# Has Anyone Measured How LLM Output Quality Degrades Across Multiple Compactions?

> Source: <https://dev.to/yuhaolin2005/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions-1dad>
> Published: 2026-06-26 18:49:52+00:00

After ~70 sessions with DeepSeek V4 (1M context), I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets *better*. Then it declines and never recovers.

Maybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: **what if context compaction has a curve, and nobody has mapped it?**

I searched for benchmarks that measure multi-round compaction degradation. Here's what exists:

Parameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs `/compact`

— has no such curve.

If the curve is real, you could:

Right now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year.

I built a tiny monitor ([compact-counter](https://github.com/YuhaoLin2005/compact-counter-concept)) and a rough [experiment framework](https://github.com/YuhaoLin2005/compact-counter-concept/blob/master/EXPERIMENT.md) — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point.

What I'd love:

I don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me.
