# Cursor's compression isn't a bug. It's how it works.

> Source: <https://dev.to/arthurpro/cursors-compression-isnt-a-bug-its-how-it-works-2680>
> Published: 2026-06-15 16:00:00+00:00

The most useful sentence in [Cursor's "Dynamic Context Discovery" blog post](https://cursor.com/blog/dynamic-context-discovery) (Jan 6, 2026) is the one written in the kind of plain language engineering teams use when they've decided to admit a trade-off they haven't fully solved:

When the model's context window fills up, Cursor triggers a summarization step to give the agent a fresh context window with a summary of its work so far.

But the agent's knowledge can degrade after summarization since it's a lossy compression of the context.

I keep coming back to that line because of how much it says about the shape of recent agent failures. In late April, a Cursor session running Claude Opus 4.6 issued a single `volumeDelete`

mutation against PocketOS's production volume on Railway, took the volume's backups with it (Railway stores them in the same blast radius), and produced a "confession" afterwards enumerating which rules it had violated to do it. The agent could *cite the rules* in the confession. It just could not, in the moment, connect them to what its hands were doing. The PocketOS founder thread by Jer Crane (@lifeof_jer) laid out the timeline and the exact API call in detail, and several outlets (The Register, Tom's Hardware, Decrypt) reproduced it.

That part of the post-mortem is what I want to walk through here. It is not really about the model. It is about the harness (the layer between the chat window and the model's context), and specifically what compaction does to the chain of reasoning that's supposed to keep an agent inside its rails.

Cursor's harness uses **prompt-based summarization** for compaction. When the live context approaches the model's window limit, the harness asks the model to summarise its session so far. That summary becomes the seed for a fresh window, and the agent continues from there. (Cursor's other post, [ Training Composer for longer horizons](https://cursor.com/blog/self-summarization), Mar 17, 2026, describes how their in-house Composer model is RL-trained with compaction as part of the training loop, but Composer is Composer. Claude Opus running through Cursor gets the generic prompt-based version.)

The Cursor Forum has known about the timing being off for months. A user posted in [thread 149490](https://forum.cursor.com/t/compaction-not-happening-soon-enough/149490/3) that on Opus 4.5, "in prior builds summarization would happen at 70-80%. But this time I ran up into the 90% mid action, and it's showing 100% full!" A Cursor staff member replied: "This is a known issue with auto-summarization. It can trigger late or incorrectly. The team is aware of it. Workaround: try running `/summarize`

manually when you see the context getting close to 70 to 80%."

Read that twice. The vendor is asking the user to drive a heuristic that the harness was supposed to drive autonomously, because the heuristic doesn't fire reliably. That alone is not the story. The story is that **even when compaction fires correctly, the resulting context is structurally different from the one the model was reasoning in two seconds earlier**, and the chat window does not tell you that.

Two threads of research converge here, and they predict exactly the failure mode operators see in the wild.

**Thread 1: position effects in long contexts.** Liu et al.'s [ Lost in the Middle](https://arxiv.org/abs/2307.03172) (2023) showed the U-shaped curve that everyone now cites: performance is best when relevant information sits at the start or end of the window, and degrades sharply in the middle. The system prompt sits at the start. The current task and tool output sit at the end. Any safety rule whose binding force depends on a chain (

**Thread 2: input length itself hurts, even with perfect retrieval.** Du et al.'s [ Context Length Alone Hurts LLM Performance Despite Perfect Retrieval](https://arxiv.org/abs/2510.05381) (EMNLP 2025) is the more uncomfortable one. The authors set up a benchmark where the model is given the relevant evidence, the relevant evidence is positioned right next to the question, and the irrelevant filler is masked out: every fair-fight condition you would design if you wanted to give long context every chance to succeed. Performance still drops 13.9% to 85% as input length grows. "Even when models can perfectly retrieve all relevant information, their performance still degrades substantially as input length increases." Their proposed mitigation is

If you put those two threads together, you get the prediction Cursor's operators keep finding: compaction does not just lose facts. It dissolves the *relationships* between facts. The rule survives the summary as a fragment ("there are some safety rules"). The action survives as a directive ("fix the credential mismatch"). The arc that connects them, *and this rule binds this action*, does not. The model's chain-of-thought picks up at the action end and never visits the rule end.

The thing that surprised me when I went looking is how on-the-record Anthropic is about all of this. Their [ Effective Context Engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) post (Sep 29, 2025) names the phenomenon directly:

Studies on needle-in-a-haystack style benchmarking have uncovered the concept of context rot: as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases. While some models exhibit more gentle degradation than others, this characteristic emerges across all models.

The same post tells you what to do about it: pursue "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Not "fill the window because the window is large." A passage in Anthropic's [API documentation](https://platform.claude.com/docs/en/build-with-claude/context-windows) is even blunter: "more context isn't automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as *context rot*." Until March 2026, Anthropic priced this directly: requests over 200K tokens cost 2x input and 1.5x output, an implicit declaration that 200K was the reliability boundary they were comfortable selling.

The cleanest external evidence for how steep the cliff is comes from a single reporter on [anthropics/claude-code issue #35296](https://github.com/anthropics/claude-code/issues/35296), opened March 17, 2026. The reporter ran 25+ transcripted sessions with Claude Opus 4.6 against a 20,000-record database and pinned down a behaviour profile by context-fill percentage:

| Context fill | Behaviour observed |
|---|---|
| 0–20% | Reliable |
| 20–40% | Degrading |
| 40–60% | Unreliable |
| 60–80% | Broken |
| 80–100% | Irrecoverable |

The same issue cites Anthropic's own MRCR v2 multi-needle benchmark: 93% accuracy at 256K, 76–78% at 1M. Roughly one in four multi-needle retrievals fails at the advertised maximum window. None of this is hidden. It is in Anthropic's docs, on Anthropic's blog, and in Anthropic's pricing history. It is just not in the chat window.

The thing that makes compaction unusually dangerous is that the user has no idea it has happened. The chat scrolls. Earlier turns are still visible above the fold. The model still answers in the same voice. Nothing in the interface signals that the context the model is *currently* reasoning over is no longer the context the user thinks they share with it.

Compare that to other places software handles state-loss. When a database connection drops and reconnects, the client logs it. When a process restarts, systemd records the restart in the journal. When git rebases your branch, it tells you which commits moved. Compaction, by contrast, is an invisible state transition. The agent's "memory" gets replaced with a paraphrase of the original, and the chat window does not draw a line.

What I would want, as an operator, is something boringly straightforward: a banner before compaction fires that tells me the budget is about to be reset, an inline marker in the transcript at the point compaction occurred, and a one-click "diff" view that shows me what survived in the summary versus what was in the original. None of this is hard to build. You can prototype the budget half in a couple of dozen lines of Python:

``` python
import time
import tiktoken

class ContextBudget:
    """Pre-compaction warning gate for an agent harness.

    Wrap your prompt-assembly with this and call .check() before each
    model call. It does not implement compaction itself; the point is
    to give the operator a chance to /summarize on their own terms,
    not to have the harness silently re-summarise mid-task.

    Call .mark_compacted() from your operator's /summarize path so
    the next .check() can report when the last reset happened.
    """

    WARN = 0.70   # Cursor staff's recommended manual-/summarize point
    HARD = 0.85   # below the harness's own auto-trigger, with margin

    def __init__(self, model="gpt-4o", limit=200_000):
        self.enc = tiktoken.encoding_for_model(model)
        self.limit = limit
        self.last_compaction = None

    def measure(self, messages):
        return sum(len(self.enc.encode(m["content"])) for m in messages)

    def mark_compacted(self):
        self.last_compaction = time.time()

    def check(self, messages):
        used = self.measure(messages)
        ratio = used / self.limit
        if ratio >= self.HARD:
            raise CompactionRequired(
                f"context at {ratio:.0%} of {self.limit}; "
                "manual /summarize required before next call"
            )
        if ratio >= self.WARN:
            since = (
                f"{int(time.time() - self.last_compaction)}s ago"
                if self.last_compaction else "never"
            )
            print(
                f"[budget] {used:,}/{self.limit:,} tokens "
                f"({ratio:.0%}); consider /summarize "
                f"(last compaction: {since})"
            )
        return used, ratio

class CompactionRequired(RuntimeError):
    pass
```

The point of a wrapper like that is not the arithmetic. The arithmetic is the easy part. The point is that the operator gets to see the budget, the operator is the one who decides when to compact, and the moment compaction happens is logged into the transcript as an event the operator can scroll back to. That much would close the gap between "model's working context" and "what the user thinks they're chatting with." The rest of the honest-UI agenda (diffing the pre- and post-summary transcripts, marking which parts of system prompt survived the summary, surfacing the compaction event in the same way Slack surfaces a thread split) falls out of having an explicit compaction event in the first place.

Bring this back to the failure mode in the PocketOS incident. The agent had safety rules in the system prompt. It had a destructive operation available. Some non-trivial number of tokens of intermediate work (file reads, shell output, grep results) accumulated between those two ends of the context. When compaction fired, the rules got summarised into "there are some safety rules." The action got summarised into "fix the credential mismatch by deleting the volume." The chain that should have stopped the action *because* of the rule got summarised into nothing in particular.

You can build a defence against that at three levels, and the punch line is that none of them is "use a smarter model." You can build it at the **harness** level (recite-before-solve before destructive actions; restate the active rules into the model's working scratchpad immediately before tool use). You can build it at the **API gateway** level (out-of-band confirmation for destructive mutations; scoped tokens that physically cannot reach production from a staging task). You can build it at the **UI** level (visible compaction events; the operator chooses when, not the harness). Each level catches a different version of the same failure. The cheap version of all three together is more reliable than waiting for the next model release to "just handle longer contexts," because the next model release will have the same shape of failure at a different threshold. Context rot, in Anthropic's own framing, "emerges across all models."

| Defence layer | What it catches | Concrete pattern |
|---|---|---|
| Harness | Rule-binding lost during compaction | Recite-before-solve: restate active safety rules into a fresh scratchpad before any destructive tool call (Du et al. 2025) |
| API gateway | Destructive mutation reaches the API at all | Out-of-band confirmation; scoped tokens that physically cannot reach prod from a staging credential |
| UI | Operator can't see that context was compressed | Pre-compaction banner; inline transcript marker; pre/post summary diff view |
| Model | (Don't rely on this layer.) | Better long-context attention is research, not a deployment plan |

The frame that helps me hold all of this in my head is to stop thinking of compaction as a bug. It's not a bug. Cursor's blog post calls it "lossy compression of the context" using exactly that wording. Anthropic's blog post says context rot is universal. Du et al.'s benchmark says even *perfect* retrieval over a long context underperforms a short one. Three independent sources, three different framings, one underlying claim: the agent's working context is not the conversation you had with it. It's a derivative of that conversation, and the derivative is approximate, and the approximation is the part that fails.

The prior incident I wrote about wasn't a hallucination event. It was a structural one: a long-running session where the link between the rule and the action got summarised away. The next one will have the same shape. The thing the industry will learn this year (late, the way it learned that retries need bounds and that connectors need monitoring) is that the chat window is a UI for the user, not for the model. The model has a different UI, and right now nobody is showing it to anyone.