{"slug": "your-agent-doesn-t-run-out-of-context-it-degrades-at-79", "title": "Your Agent Doesn't Run Out of Context. It Degrades at 79%", "summary": "A developer discovered that AI agent performance degrades significantly before the context window is full, with step reliability dropping below a safe threshold at 79% occupancy. A synthetic model shows that step success probability falls from 97% at 50% occupancy to 61% at 95% occupancy, and implementing a deterministic handoff at 70% occupancy can maintain full reliability across long sessions.", "body_md": "The first time one of my long agent sessions went sideways, I went looking for the wrong thing. I grepped for `context window exceeded`\n\n. For a stack trace. For a 400 from the API. There was nothing. The run finished. It just finished *worse* than it started — sloppier tool calls, a step that re-derived something it had already figured out, an answer that ignored a constraint stated twenty steps earlier.\n\nNo crash. No error. The agent didn't run out of context. It got dumber on the way there.\n\nHere's the part that took me too long to accept: the damage starts well before the window is full. On a synthetic model of step success vs. how full the window is, the reliability floor gets crossed at **79% occupancy** — not at 100% overflow. By the time you see `exceeded`\n\n, the agent has already spent a chunk of the session quietly making weaker steps.\n\nIf you build agents that run for many steps, watch the **fraction of the window in use**, not just whether it overflows. Step quality holds flat while the window fills, then bends down past a knee around 70–80%. A deterministic handoff (summarize, checkpoint, compact occupancy back down) at a threshold *below* that knee keeps the whole session above the line. In the model below, it turns 31 good steps out of 50 into 50 out of 50, at the cost of 2 compactions.\n\nThat's the whole idea. The rest is the artifact and the honesty about what it does and doesn't prove.\n\nHere's the output I'm going to talk about. One synthetic session, two runs. Naive first, then with a handoff gate:\n\n```\n=== Naive long session (no handoff, 'hope it fits') ===\nsteps that held above the floor: 31/50\nfirst step below floor at occupancy: 81%\ncompactions: 0\n\n=== Deterministic handoff at 70% occupancy ===\nsteps that held above the floor: 50/50\nfirst step below floor at occupancy: None\ncompactions: 2\n\nreliability floor (0.80) crossed at occupancy: 78.6%\nP(step ok) at 50% occupancy: 0.97\nP(step ok) at 70% occupancy: 0.90\nP(step ok) at 85% occupancy: 0.72\nP(step ok) at 95% occupancy: 0.61\n```\n\nRead the bottom block bottom-up. At half a window, a step succeeds 97% of the time. At 70%, still 90%. At 85% it's down to 72%, and at 95% it's a coin-flip-and-a-half: 0.61. The curve doesn't fall off a wall. It bends. That bend is the thing nobody puts in their logs.\n\nThe floor — the line where I'd actually trust a step in production — gets crossed at **78.6%** occupancy. The naive session's *first failed step* lands at 81%, because steps fall on a discrete occupancy grid and the first one past the line happens to sit at 81%, not exactly on 78.6%. Same story, one number is the curve, the other is where a real step happened to land.\n\nThis is the whole script. `stdlib`\n\n-only, no network, no randomness, no clock. Run `python3 -I occupancy_handoff.py`\n\nand you get the bytes above, every time.\n\n``` php\n#!/usr/bin/env python3\n\"\"\"Occupancy -> step-success model for a long agent session.\n\nSynthetic fixture. Numbers chosen to exercise the mechanism, NOT a vendor\nbenchmark of any model. The point is the SHAPE: an agent step's success\nprobability stays flat while the context window fills, then bends down past\na knee -- so reliability drops well before the window overflows.\n\nDeterministic by construction: no network, no RNG, no clock. Re-running\nthis file produces byte-identical stdout. Run it yourself:\n\n    python3 -I occupancy_handoff.py\n\"\"\"\n\ndef step_success_prob(occ):\n    \"\"\"P(a single agent step succeeds) as a function of context occupancy.\n\n    occ is the fraction of the window in use, 0.0 .. 1.0.\n    Piecewise: flat until a knee, gentle slope, then a steep decline.\n    \"\"\"\n    if occ <= 0.50:\n        return 0.97\n    if occ <= 0.70:\n        # gentle: 0.97 -> 0.90 across 0.50 .. 0.70\n        return 0.97 - (occ - 0.50) * (0.07 / 0.20)\n    # steep: 0.90 -> 0.55 across 0.70 .. 1.00\n    return 0.90 - (occ - 0.70) * (0.35 / 0.30)\n\nWINDOW = 100_000          # synthetic token budget for the window\nRELIABILITY_FLOOR = 0.80  # \"a step we'd trust in prod\"\nSTEPS = 50                # session length\nTOKENS_PER_STEP = 2_600   # 50 * 2600 = 130k > 100k -> naive run overflows\n\ndef run_session(handoff_threshold=None, compact_to=0.30):\n    \"\"\"Walk a session of STEPS steps. Each step consumes TOKENS_PER_STEP.\n\n    A step \"holds\" if step_success_prob(occ) >= RELIABILITY_FLOOR.\n    With a handoff_threshold set, reaching it triggers a deterministic\n    compaction (summary + checkpoint) that resets occupancy to compact_to\n    BEFORE the step runs -- the long task becomes a short-context one again.\n    \"\"\"\n    used = 0\n    held = 0\n    first_fail_occ = None\n    handoffs = 0\n    for _ in range(STEPS):\n        occ = used / WINDOW\n        if handoff_threshold is not None and occ >= handoff_threshold:\n            used = int(WINDOW * compact_to)\n            occ = used / WINDOW\n            handoffs += 1\n        p = step_success_prob(occ)\n        if p >= RELIABILITY_FLOOR:\n            held += 1\n        elif first_fail_occ is None:\n            first_fail_occ = occ\n        used += TOKENS_PER_STEP\n    return held, first_fail_occ, handoffs\n\ndef floor_crossing():\n    \"\"\"Occupancy where the success curve crosses the floor (bisection).\"\"\"\n    lo, hi = 0.70, 1.00\n    for _ in range(60):\n        mid = (lo + hi) / 2\n        if step_success_prob(mid) >= RELIABILITY_FLOOR:\n            lo = mid\n        else:\n            hi = mid\n    return lo\n\ndef fmt_occ(occ):\n    return \"None\" if occ is None else f\"{occ:.0%}\"\n\ndef main():\n    held_a, fail_a, hand_a = run_session(handoff_threshold=None)\n    print(\"=== Naive long session (no handoff, 'hope it fits') ===\")\n    print(f\"steps that held above the floor: {held_a}/{STEPS}\")\n    print(f\"first step below floor at occupancy: {fmt_occ(fail_a)}\")\n    print(f\"compactions: {hand_a}\")\n    print()\n\n    held_b, fail_b, hand_b = run_session(handoff_threshold=0.70, compact_to=0.30)\n    print(\"=== Deterministic handoff at 70% occupancy ===\")\n    print(f\"steps that held above the floor: {held_b}/{STEPS}\")\n    print(f\"first step below floor at occupancy: {fmt_occ(fail_b)}\")\n    print(f\"compactions: {hand_b}\")\n    print()\n\n    print(f\"reliability floor ({RELIABILITY_FLOOR:.2f}) crossed at occupancy: \"\n          f\"{floor_crossing():.1%}\")\n    for occ in (0.50, 0.70, 0.85, 0.95):\n        print(f\"P(step ok) at {occ:.0%} occupancy: {step_success_prob(occ):.2f}\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\nThe mechanism that matters is in `run_session`\n\n. The naive run just keeps appending: 50 steps × 2,600 tokens = 130k against a 100k window, so it overflows, and 19 of its steps land in the degraded zone past the floor. The handoff run checks occupancy *before* each step. When it hits 70%, it compacts back to 30% (a summary plus a checkpoint, the long task folded back into a short one) and then takes the step on a half-empty window. Two compactions over 50 steps. Every step stays above the line.\n\nThe threshold is set *below* the knee on purpose. 70% gives you margin: by the time you'd start losing steps, you've already reset.\n\nI'll be blunt about what that script is. It's a synthetic fixture. I chose the curve and the floor to make the mechanism legible (flat, knee, decline), not to measure GPT or Claude or anyone's model. If you want a vendor benchmark, this isn't it, and I'd be lying if I dressed it up as one.\n\nWhat's real is the shape, and where it came from. Across **2,190 production runs** on our Apify actors, 962 of them on a single Trustpilot scraper that's been hammered in prod, the pattern I kept seeing wasn't a clean overflow crash. It was *quality* sliding before the wall: longer-running jobs producing weaker, less consistent steps while every log line stayed green. No exception, no 400, nothing to grep. Just worse.\n\nI won't hand you a single magic percentage from that, because I don't have one. Across tasks and models the knee moved around, somewhere in the 70–80% band depending on the job. In my first version of this I assumed the failure mode was overflow and I instrumented for the wrong event entirely. I watched for the cliff. The damage was already happening on the ramp.\n\nYou could push back here, and you'd be right to: \"a 50-step toy loop with a hand-drawn curve proves nothing about a real model.\" Agreed. It doesn't. What the script earns is a cheap, exact way to reason about the *interaction*: a fixed success curve plus a fill rate plus a gate, and you can watch the gate change the outcome without a single token spent. The shape is the claim I'm standing behind, sourced from production. The script is just the cleanest way I found to show what a gate does to it.\n\nThis isn't just my pattern-matching. In October 2025, Du, Tian, Ronanki and seven co-authors published [ Context Length Alone Hurts LLM Performance Despite Perfect Retrieval](https://arxiv.org/abs/2510.05381) (arXiv:2510.05381). Their finding is the academic version of what the logs were telling me: performance drops\n\nAnd their fix rhymes with the handoff. Their mitigation is to make the model \"recite the retrieved evidence before attempting to solve the problem\" — which, they note, \"transforms a long-context task into a short-context one.\" That's compaction by another name. They proved length alone hurts; the gate above is one way to stop paying for it in a running agent, before overflow rather than after.\n\nIf you read my earlier piece on [the context tax](https://blog.spinov.online/blog/agent-re-reads-every-page-context-tax/), this looks adjacent, and it's worth saying how it's *not* the same thing. That one was about **cost**: re-reading the same pages inflates your token bill on a long session. This one is about **quality**: the same occupied space lowers your odds of a *correct* step, and it does that earlier than it wrecks your budget. One drains the wallet. The other drains the accuracy, quietly, and you find out last.\n\nSo you actually have two reasons to compact, not one. The bill is the loud reason. The 79% is the quiet one.\n\nTrack occupancy as a first-class metric on long sessions. Pick a handoff threshold below your knee. I'd start at 70% and tune. When you hit it, compact deterministically: summarize the state, checkpoint what matters, drop occupancy, keep going. Don't wait for `exceeded`\n\n. By then you've already shipped a fistful of degraded steps with nothing in the logs to show for it.\n\nHere's what I still don't know, and where I'd love a second opinion: at what occupancy does *your* agent start getting dumber, on *your* task? And when you find it: do you compact on a threshold, or are you still hoping it fits?\n\n*I write up the numbers from real production runs, the inconvenient ones included. Follow for the next one, and tell me in the comments where your agents start to slip: the occupancy number where step quality drops, on a task you actually run.*\n\n*Written with AI assistance, edited and fact-checked by a human. The Python demo is a synthetic fixture chosen to exercise the mechanism (occupancy -> step success), not to benchmark any specific model; it is stdlib-only, has no network or randomness, and prints the same output every run. The 2,190-run figure is our own real production usage.*", "url": "https://wpnews.pro/news/your-agent-doesn-t-run-out-of-context-it-degrades-at-79", "canonical_source": "https://dev.to/0012303/your-agent-doesnt-run-out-of-context-it-degrades-at-79-4in4", "published_at": "2026-06-19 18:19:33+00:00", "updated_at": "2026-06-19 18:36:40.746763+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-research"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/your-agent-doesn-t-run-out-of-context-it-degrades-at-79", "markdown": "https://wpnews.pro/news/your-agent-doesn-t-run-out-of-context-it-degrades-at-79.md", "text": "https://wpnews.pro/news/your-agent-doesn-t-run-out-of-context-it-degrades-at-79.txt", "jsonld": "https://wpnews.pro/news/your-agent-doesn-t-run-out-of-context-it-degrades-at-79.jsonld"}}