When your AI coding agent starts contradicting itself mid-session — forgetting a file path it read an hour ago, recommending an approach it already rejected — most developers blame the model. The real culprit is context saturation. Vendors advertise 1M and 2M token context windows as competitive features. Chroma tested 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5 Pro, and 15 others — and found every single one degrades as context length grows. No exceptions. The spec sheet is advertising, not a performance guarantee.
Large Context Windows: The Research Is Unambiguous #
The evidence here is not speculative. The RULER benchmark — specifically designed to measure how model quality degrades at scale — found that only half of tested models maintain satisfactory performance at 32K tokens, despite all of them claiming support for 32K or greater. Models score near-perfect on simple retrieval at those lengths, then collapse on tasks requiring multi-step reasoning at the same length. The benchmark is exposing an attention problem, not a memory limit.
The underlying mechanism is the “lost-in-the-middle” effect, replicated across GPT-3.5, GPT-4, Claude 1.3, and every subsequent model family tested. Accuracy follows a U-shaped curve: models recall information from the beginning and end of context well, but accuracy drops by more than 30% for information positioned in the middle. It doesn’t matter if the window is 200K or 2M — the middle is where context goes to die. Transformer attention is quadratic in sequence length, so at 100K tokens, the model tracks roughly 10 billion pairwise relationships. Every additional token proportionally dilutes each relationship.
AI Coding Agents Hit the Dumb Zone Fast #
This would be abstract if developers weren’t actively hitting it. File reads run 1K–10K tokens each. Tool output — test runs, compiler errors, linting — adds another 500–5K per round. Multi-turn debugging conversations add thousands more per exchange. A focused debugging session can cross 100K tokens within 1–2 hours of work. That’s the fuzzy threshold where degradation becomes noticeable.
Paul Gauthier, creator of the Aider AI coding tool, puts the practical ceiling at 25–30K tokens before models “get confused.” Morph’s research found a 35-minute threshold where agent success rates start declining, with failure rates quadrupling when task duration doubles. One HN commenter, responding to Garrit’s widely-discussed piece on this today, noted they deliberately stay under 10% of available context to avoid inconsistencies — and they have a 1M-token window available. That’s the gap between the advertised number and actual safe operating range.
Related:[AI Coding Tools Pricing Shock: The Tokenpocalypse Is Here]
What This Means for Your AI Coding Workflow #
The architectural root cause — quadratic attention scaling — is not a bug vendors can patch. Newer models push the degradation threshold higher. They do not eliminate it. Waiting for model improvements to solve this is a mistake. However, the practical mitigations are concrete and available now.
Albert Sikkema documented a setup where he deliberately capped Claude Code back to 200K and moved compaction to trigger at 70% rather than the default 95% — reporting improved consistency across long sessions. The idea: don’t wait until the window is nearly full to summarize. Compaction that triggers at 95% happens after most of the damage is done.
The more structural fix is sub-agent architecture. Morph found a 90% performance improvement using sub-agents for isolated tasks over single-agent approaches. Claude Code’s sub-agent support makes this practical: delegate search, file reading, and exploratory work to sub-agents with isolated contexts, keeping your main orchestrator context clean and focused.
Related:[Claude Code v2.1.172: Sub-Agents Can Now Spawn Sub-Agents] The third lever: external state over context state. Instead of relying on the model to remember decisions from earlier in the session, write those decisions down — a spec doc, a changelog, a decision record. Start the next session phase with the document, not a 100K-token history. This is the approach practitioners who do long-session AI work actually use, and it works precisely because it keeps the effective context small and focused.
Key Takeaways #
- Every major LLM tested — 18 models including Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro — degrades with context length. Chroma found no exceptions.
- The practical “safe zone” is well below the advertised window: most practitioners target 10–25% of available context for reliable output.
- AI coding sessions accumulate context faster than you think — 100K tokens in under two hours is realistic for active debugging.
- Sub-agent architecture is the highest-leverage fix: isolate search and exploration in sub-agents with separate context windows.
- Trigger context compaction at 70%, not 95% — and supplement with spec documents for multi-session continuity.