Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

Researchers at arXiv identified a memory-update gap in LLM agents, where accuracy drops from 92% to 77% when using bounded memory instead of full context, even with frontier models like gpt-5.4. The gap persists across model scales and is not resolved by larger memory, but can be trained down using reinforcement learning, as shown by GRPO fine-tuning Qwen2.5-3B, which nearly doubled supersession accuracy on unseen conversations.

arXiv:2606.27472v1 Announce Type: new Abstract: Large language model LLM agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure. On the knowledge-update subset of LongMemEval, replacing an agent's full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model gpt-5.4 , a gap that is statistically significant paired McNemar p<0.005 and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model. We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further from 68% to 28% , and granting the agent proportionally more memory yields no detectable recovery 28% to 28%, n=25 . The failure scales with the length of the conversation, not the compression ratio. We release Supersede, an open reinforcement-learning environment on the verifiers / prime-rl stack that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model Qwen2.5-3B on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations 9.0% to 16.7%, a single run , along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain. To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.