{"slug": "supersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents", "title": "Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents", "summary": "Researchers at arXiv identified a memory-update gap in LLM agents, where accuracy drops from 92% to 77% when using bounded memory instead of full context, even with frontier models like gpt-5.4. The gap persists across model scales and is not resolved by larger memory, but can be trained down using reinforcement learning, as shown by GRPO fine-tuning Qwen2.5-3B, which nearly doubled supersession accuracy on unseen conversations.", "body_md": "arXiv:2606.27472v1 Announce Type: new\nAbstract: Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure. On the knowledge-update subset of LongMemEval, replacing an agent's full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model (gpt-5.4), a gap that is statistically significant (paired McNemar p<0.005) and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model. We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further (from 68% to 28%), and granting the agent proportionally more memory yields no detectable recovery (28% to 28%, n=25). The failure scales with the length of the conversation, not the compression ratio. We release Supersede, an open reinforcement-learning environment (on the verifiers / prime-rl stack) that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model (Qwen2.5-3B) on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations (9.0% to 16.7%, a single run), along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain. To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.", "url": "https://wpnews.pro/news/supersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents", "canonical_source": "https://arxiv.org/abs/2606.27472", "published_at": "2026-06-29 04:00:00+00:00", "updated_at": "2026-06-29 04:07:13.221754+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "machine-learning", "ai-research"], "entities": ["arXiv", "LongMemEval", "gpt-5.4", "Supersede", "Qwen2.5-3B", "GRPO", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/supersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents", "markdown": "https://wpnews.pro/news/supersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents.md", "text": "https://wpnews.pro/news/supersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents.txt", "jsonld": "https://wpnews.pro/news/supersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents.jsonld"}}