Has Anyone Measured How LLM Output Quality Degrades Across Multiple Compactions?

A developer observed that after multiple context compactions in LLM sessions, output quality degrades non-linearly, with a brief improvement after the second compaction before declining. They built a small experiment framework to measure compaction persistence, but no major benchmarks currently track this metric. The developer is seeking collaborators to map the curve across different models.

After ~70 sessions with DeepSeek V4 1M context , I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets better . Then it declines and never recovers. Maybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: what if context compaction has a curve, and nobody has mapped it? I searched for benchmarks that measure multi-round compaction degradation. Here's what exists: Parameter compression pruning, quantization has well-mapped scaling laws. The Lottery Ticket Hypothesis ICLR 2019 and Compression Laws for LLMs 2025 tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs /compact — has no such curve. If the curve is real, you could: Right now, none of the major benchmark suites MMLU, HELM, BigBench, RULER include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year. I built a tiny monitor compact-counter https://github.com/YuhaoLin2005/compact-counter-concept and a rough experiment framework https://github.com/YuhaoLin2005/compact-counter-concept/blob/master/EXPERIMENT.md — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point. What I'd love: I don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me.