After ~70 sessions with DeepSeek V4 (1M context), I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets better. Then it declines and never recovers.
Maybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: what if context compaction has a curve, and nobody has mapped it?
I searched for benchmarks that measure multi-round compaction degradation. Here's what exists:
Parameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs /compact
— has no such curve.
If the curve is real, you could: Right now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year.
I built a tiny monitor (compact-counter) and a rough experiment framework — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point.
What I'd love:
I don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me.