Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

wpnews.pro

cd /news/large-language-models/supersede-diagnosing-and-training-th… · home › topics › large-language-models › article

[ARTICLE · art-42906] src=arxiv.org ↗ pub=2026-06-29T04:00Z topic=large-language-models verified=true sentiment=· neutral

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

Researchers at arXiv identified a memory-update gap in LLM agents, where accuracy drops from 92% to 77% when using bounded memory instead of full context, even with frontier models like gpt-5.4. The gap persists across model scales and is not resolved by larger memory, but can be trained down using reinforcement learning, as shown by GRPO fine-tuning Qwen2.5-3B, which nearly doubled supersession accuracy on unseen conversations.

read1 min views1 publishedJun 29, 2026

arXiv:2606.27472v1 Announce Type: new Abstract: Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure. On the knowledge-update subset of LongMemEval, replacing an agent's full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model (gpt-5.4), a gap that is statistically significant (paired McNemar p<0.005) and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model. We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further (from 68% to 28%), and granting the agent proportionally more memory yields no detectable recovery (28% to 28%, n=25). The failure scales with the length of the conversation, not the compression ratio. We release Supersede, an open reinforcement-learning environment (on the verifiers / prime-rl stack) that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model (Qwen2.5-3B) on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations (9.0% to 16.7%, a single run), along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain. To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/supersede-diagnosing-and…

Read original on arxiv.org → arxiv.org/abs/2606.27472

mentioned entities

arXiv

LongMemEval

gpt-5.4

Supersede

Qwen2.5-3B

GRPO

OpenAI

metadata

slugsupersede-diagnosing-and-training-the-memory-update-gap-in-llm-agents

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevv0.5.6

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 29 Jun · #large-language-models

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

arxiv.org · 29 Jun · #large-language-models

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arxiv.org · 29 Jun · #large-language-models

The Context-Ready Transformer

arxiv.org · 29 Jun · #large-language-models

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

── more on @arxiv 3 stories trending now

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

wpnews · 28 Jun · #ai-agents

OpenCode v1.17: Session Snapshots Undo Your AI Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required