cd /news/large-language-models/factworld · home topics large-language-models article
[ARTICLE · art-25522] src=ianbarber.blog pub= topic=large-language-models verified=true sentiment=· neutral

FactWorld

A new benchmark called FactWorld reveals that hybrid AI models combining transformer and recurrent architectures can simultaneously excel at both associative recall and state tracking, capabilities that pure transformer or recurrent models handle separately. The benchmark tests models on retrieving static facts, tracking object ownership changes over time, and composing both operations, with results showing that hybrid architectures overcome the brittleness of transformers on longer state-tracking problems while maintaining strong recall abilities.

read5 min publishedJun 12, 2026

When we started building LLMs, we mostly focused on them knowing things. They had information encoded in their weights, and they could spit it out when given sufficient prompts. But an agent doesn’t just need to know things; it needs to combine several kinds of knowledge.

A lot of that is still in the weights: facts that the model learned during training. But some knowledge is in the context window: tool results, documents, user instructions, intermediate observations, etc. And some knowledge is in the environment: a good agent should have a sense of the current state of the world. To be useful, an agent has to be able to combine these sources of knowledge appropriately.

There are standard ways to test some of this. Associative recall benchmarks like MQAR ask whether a model can recover a value from a key in its context window. State tracking problems, like S5-style permutations, check whether a model can keep track of changes over time: the problems are a series of operations, and a model must identify the end state.

Different architectures solve these problems in quite distinct ways. Transformers are good at recall; in the end that’s what attention is: look back into the context, copy the relevant things. They have an inductive bias for this kind of problem: the nature of their algorithm fits the nature of the problem. When it comes to state tracking, though, they’re brittle. They memorize the state-tracking mechanism for the lengths of problem they see in training: give them something longer, and they don’t degrade so much as collapse.

Recurrent models, like RNNs and state-space models, have the opposite shape. They have a natural inductive bias towards maintaining state. They keep a compact representation of The Current Thing and update it as tokens come in. That makes them effective at tracking state across time, but the conventional wisdom is that it costs them recall: the representation is fuzzy, and copying exact references back out of it is harder.

One current trend in LLMs 1 is hybrid models, where regular attention is interleaved with linear attention or state-space style layers. This is, usually, framed around efficiency: the linear layers don’t need the large KV cache. I wondered whether the hybrid might also give you both capabilities: strong state tracking

andstrong recall, in the same model, for the same query.

To test this, I vibed up a benchmark called FactWorld. It’s a small, synthetic world of agents, objects, roles, and facts. Everything is generated from a deterministic knowledge base, with labels computed by a symbolic oracle, so every answer is correct by construction and nothing leaks from the rendered text.

The world looks like this: agents (g0, g1, …) each carry a static fact (“g3’s a0 is v42”), and objects get passed around over time (“give o3 to g1”). The queries cross the two capabilities:

Recall: “what is a0 of g3?” Look up a fact.** State tracking**: “who holds o3?” Replay the give-history; last write wins.** Composition**: “what is a0 of the holder of o3?” Determine who holds the object,thenrecall that agent’s fact, in one query.

The facts that the model needs are either in the prompt or fixed across training so the model can memorize them. This separates “reading from context” from “knowing from the weights.” And event histories can be longer at test time than anything seen in training, which separates “learned the rule” from “learned a length-specific shortcut.”

To make sure it was sane, I validated the known results from the literature first, at small scale (~45M params). They reproduced! A transformer fits the S5 word problem at the training length and then collapses to exactly zero beyond it. A recurrent/linear model with non-commuting state transitions 2 extrapolates it; one attention layer over a recurrent backbone solves canonical one-hop recall, which is the

Zoologyresult. This was not without surprises. FactWorld tested recall by for the value at a separated answer position, not as the next token after the key. This underperformed the expected result because it turned out this was itself a bit of a composition: you need to to know which place to look at. Moving it to a one-hop did give the expected result though.

Trying to test the composed problem introduced its own difficulties. I had a 6M param smoke test and… nothing worked at all, completely flooring the task. Luckily, at ~45M params, while a transformer still floors (zero for ten across an entire learning-rate sweep), the gated-delta recurrent hybrids could learn it. Sometimes. 3 And we did get a quite interesting failure mode.

When a converged model got the composite wrong, it was usually a routing failure. The model has genuinely learned the resolve-then-recall pipeline (resolve a holder, recall a fact about them) it just resolved the wrong holder, and then confidently reported that agent’s fact. Recall is conditioned on state; they are not independent legs the model runs in parallel. Which felt pretty familiar: an agent flawlessly doing the wrong thing.

Because the binding in this composite is last-write-wins, the ordering subtlety wasn’t a particular problem. The plain Gated DeltaNet hybrid could compose it. But, in my test, only at exactly one learning rate. The Gated DeltaProduct hybrid learned it across a broad band of learning rates, and extrapolated past the training length on a majority of seeds where the single-delta variant mostly doesn’t. The product structure wasn’t necessary here; it was just easier to train 4.

For current large models, scale can paper over all of this: learn enough patterns and you accumulate tricks that work well enough in practice. But if we want smaller, cheaper, longer-context, more reliable agentic models, getting the right architecture matters. FactWorld is hopefully a way to check, without requiring thousands of GB300s.

[^1]: Quite a few seeds simply never form the recall-under-composition circuit; when it forms, it forms all-or-nothing.
  • I mean, at least the ones where we know how they work ↩︎ - Order tends to matter in these tasks, but the nature of the updates in most state-space models means it doesn’t track that order well. This specific variant, Gated DeltaProduct, handles order-specific, or non-commuting, transitions better
[↩︎](#09e1340b-23a2-4153-9b14-59090569504c-link) - Quite a few seeds simply never form the recall-under-composition circuit, it seemed a bit all or nothing.
[↩︎](#f378a474-fb86-45ef-89e4-9905aa30ac75-link) - For completeness: state-tracking crossed with facts stored
*in the weights*still floors at length for every architecture I tried. “Look it up in your weights, mid-pipeline”, I have no idea how to do.[↩︎](#d14fecd5-6f66-49d3-a803-ec7dc3140ab1-link)
── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/factworld] indexed:0 read:5min 2026-06-12 ·