{"slug": "i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-name", "title": "I stopped trusting “same answers, fewer tokens” after watching an agent lose 1 field name and burn 3 hours", "summary": "A developer reported that context compression in AI agents caused a three-hour debugging failure after a compressed memory summary dropped a single field name from an earlier error log. The agent, running on Claude Code, made a confident but wrong API call because the compressed version of history preserved the \"meaning\" of the error but omitted the exact field name needed to avoid repeating the mistake. The developer now warns that \"same answers, fewer tokens\" is not a reliable claim for agent workflows, arguing that compression should be treated as a reversible optimization rather than a permanent memory replacement.", "body_md": "I used to hear the pitch for context compression and think: sure, makes sense.\n\nSmaller prompts. Lower latency. Lower cost. Same output quality.\n\nThen I watched an agent blow a perfectly good debugging session because one field name disappeared from compressed memory.\n\nThat changed my opinion fast.\n\nThree hours into a Claude Code run, the agent made the wrong API call with full confidence. The plan looked coherent. The reasoning looked clean. The summary of prior steps sounded smart.\n\nIt was also missing the one detail that mattered: a field name from an earlier error log.\n\nThe agent had already seen the bug. It had already “understood” the bug. But the compressed version of history dropped the exact detail it needed to avoid repeating it.\n\nThat’s the real failure mode.\n\nNot “compression loses words.”\n\nCompression loses the one fact your agent needs later, after it has already committed to the wrong action.\n\nWhile researching this, I found a thread on r/openclaw about using Headroom with OpenClaw:\n\n[https://reddit.com/r/openclaw/comments/1u3j5xs/anyone_using_headroom_with_openclaw/](https://reddit.com/r/openclaw/comments/1u3j5xs/anyone_using_headroom_with_openclaw/)\n\nThat thread gets at the real tension: compression is useful, but only if you treat it as a reversible optimization, not a memory wipe with better branding.\n\nHere’s the pattern I keep seeing in long-running agents:\n\nThis is why “same answers, fewer tokens” is not a serious reliability claim for agent workflows.\n\nIt might be true for some short chat tasks.\n\nIt is absolutely not something I’d assume for:\n\nIn those systems, exact details matter more than elegant summaries.\n\nSummaries are good at preserving themes.\n\nSummaries are bad at preserving sharp edges.\n\nThat tradeoff is fine when you’re chatting with a model and throwing the session away. It is dangerous when the model is making decisions over hours.\n\nThe exact wording of an error matters.\n\nThe exact JSON shape matters.\n\nThe exact schema name matters.\n\nThe exact user constraint matters.\n\nThe exact branch that was ruled out matters.\n\nA compressed summary is not the original context.\n\nIt is an interpretation of the original context.\n\nThat distinction is everything.\n\nImagine your agent saw this error earlier:\n\n```\nValidationError: field \"customer_external_id\" is required\n```\n\nNow imagine the compressed memory turns that into this:\n\n```\nPrevious API call failed because a customer identifier was missing.\n```\n\nThat sounds fine until later, when the agent has to choose between:\n\n`customer_id`\n\n`external_customer_id`\n\n`customer_external_id`\n\nAt that point, the summary is useless.\n\nThe “meaning” was preserved.\n\nThe recoverable debugging value was not.\n\nThat’s how agents end up sounding intelligent while still being wrong.\n\nMy rule now is simple:\n\nCompress what is noisy. Reload what is consequential.\n\nGood compression candidates:\n\nBad compression candidates:\n\nIf the agent might need the exact original later, don’t make the summary the source of truth.\n\nMake the raw source retrievable.\n\nThe difference is not subtle.\n\n``` php\nraw context -> summary -> summary replaces source forever\n```\n\nThis is one-way summarization pretending to be memory.\n\n``` php\nraw context -> compressed working memory\n                + indexed raw history\n                + retrieval path back to source\n```\n\nIn other words:\n\nThat is a much better design for long-running agents.\n\nThere are some important differences between the current approaches.\n\nHeadroom is ambitious. It compresses logs, files, tool output, and history across an agent stack. The reported reductions are big, often in the 60% to 95% range.\n\nThat’s useful.\n\nBut the real question is not “how much did you shrink it?”\n\nThe real question is: can the agent recover the untouched source when the summary is not enough?\n\nIf yes, much safer.\n\nIf no, much riskier.\n\nRTK is narrower, which I actually like from a systems perspective. If it’s mainly shrinking Bash output before it reaches Claude Code, the blast radius is easier to reason about.\n\nOne example I saw cited was a Claude Code session dropping from about 118,000 tokens to 23,900.\n\nThat’s a huge reduction.\n\nBut again, the question is not whether the number is impressive.\n\nThe question is whether the transformed output still preserves the exact details needed for later decisions, or whether the raw output is still accessible.\n\nPrompt caching is different.\n\nFor example, OpenAI Prompt Caching preserves the exact prompt prefix rather than rewriting it. That makes it much safer than summarization if your goal is exact reuse.\n\nBut prompt caching does not solve long-context memory by itself.\n\nCached is not compressed.\n\nExact reuse is not selective recall.\n\nIf I had to pick winners and losers:\n\nWinner: reversible compression plus retrieval.\n\nLoser: one-way summarization pretending to be memory.\n\nThat’s the line.\n\nNot “compression good” vs “compression bad.”\n\nCompression is fine.\n\nIrreversible compression in long-running agents is the problem.\n\nHere’s the pattern I’d rather ship.\n\n``` js\nconst compressedHistory = await summarize(fullHistory)\nconst prompt = [systemPrompt, compressedHistory, currentTask].join(\"\\n\\n\")\nconst result = await client.responses.create({\n  model: \"gpt-5.4\",\n  input: prompt\n})\n```\n\nThis is cheap.\n\nIt is also brittle if `compressedHistory`\n\nbecomes the only memory.\n\n``` js\nconst workingSummary = await summarize(noisyHistory)\nconst retrievedRawItems = await retrieveRawContext({\n  query: currentTask,\n  sources: rawEventStore,\n  topK: 5\n})\n\nconst prompt = [\n  systemPrompt,\n  \"Working summary:\",\n  workingSummary,\n  \"Raw retrieved context:\",\n  JSON.stringify(retrievedRawItems, null, 2),\n  \"Current task:\",\n  currentTask\n].join(\"\\n\\n\")\n\nconst result = await client.responses.create({\n  model: \"gpt-5.4\",\n  input: prompt\n})\n```\n\nThat second pattern costs more context.\n\nIt is also much less likely to hallucinate the exact field name you already had three hours ago.\n\nIf you’re building agents, I’d separate memory into three buckets.\n\n| Memory Type | What goes there |\n|---|---|\n| Working summary | compressed chatter, recent progress, high-level state |\n| Raw evidence store | logs, tool output, payloads, errors, retrieved docs |\n| Decision ledger | explicit decisions, assumptions, unresolved questions |\n\nThat gives you a cleaner contract:\n\nIf your architecture only has the first bucket, you are asking for trouble.\n\nThis problem gets worse in automation tools because long workflows create a lot of junk context.\n\nA typical loop might look like this:\n\n``` php\nWebhook -> Fetch data -> LLM classify -> Call API -> Retry on failure -> LLM debug -> Call API again\n```\n\nThe temptation is obvious: summarize everything after each step so the prompt stays cheap.\n\nThe safer pattern is:\n\n``` php\nWebhook -> Store raw events/logs externally\n        -> Keep compact working summary in prompt\n        -> Retrieve exact raw failure context before retry/debug steps\n```\n\nIf you’re using n8n, Make, Zapier, OpenClaw, or a custom worker queue, this is the difference between an agent that degrades gracefully and one that accumulates invisible mistakes.\n\nHonestly, a lot of bad agent design is pricing pressure wearing a fake mustache.\n\nTeams don’t aggressively trim context because it’s always the best technical choice.\n\nThey do it because per-token billing trains everyone to fear their own context windows.\n\nEvery extra tool call feels expensive.\n\nEvery long trace feels expensive.\n\nEvery retry feels expensive.\n\nEvery “maybe we should keep the raw logs around” decision feels expensive.\n\nSo people reach for summarization earlier than they should.\n\nThat’s not always engineering judgment.\n\nA lot of the time it’s cost anxiety.\n\nThis is exactly why flat-rate compute changes the design space.\n\nWith Standard Compute, you get unlimited AI compute for a predictable monthly price, using an OpenAI-compatible API that works with existing SDKs and HTTP clients.\n\nThat matters because it lets you make a better tradeoff:\n\nIf your agents run in n8n, Make, Zapier, OpenClaw, or custom workflows, predictable flat pricing is not just a finance benefit.\n\nIt changes system design.\n\nYou can optimize for reliability first.\n\nThat’s the better default.\n\nThis is the rule I trust:\n\nNever let your agent depend on compressed context unless it can recover the raw source later.\n\nIf the original log, chunk, tool output, or conversation turn is gone, the compression step is not optimization.\n\nIt’s amputation.\n\nAnd once you’ve watched an agent fail because one field name vanished from memory, the whole “same answers, fewer tokens” slogan starts sounding like marketing copy written by someone who has never debugged a broken workflow at 2 a.m.\n\nIf you’re building long-running agents, here’s the checklist I’d use:\n\nThat last one matters more than people admit.\n\nIf your cost model pushes you toward memory loss, your cost model is part of the bug.", "url": "https://wpnews.pro/news/i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-name", "canonical_source": "https://dev.to/lars_winstand/i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-field-name-and-burn-54a8", "published_at": "2026-06-12 09:35:34+00:00", "updated_at": "2026-06-12 09:42:04.648384+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-safety", "ai-research", "ai-tools"], "entities": ["Claude Code", "Headroom", "OpenClaw", "r/openclaw"], "alternates": {"html": "https://wpnews.pro/news/i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-name", "markdown": "https://wpnews.pro/news/i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-name.md", "text": "https://wpnews.pro/news/i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-name.txt", "jsonld": "https://wpnews.pro/news/i-stopped-trusting-same-answers-fewer-tokens-after-watching-an-agent-lose-1-name.jsonld"}}