{"slug": "has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions", "title": "Has Anyone Measured How LLM Output Quality Degrades Across Multiple Compactions?", "summary": "A developer observed that after multiple context compactions in LLM sessions, output quality degrades non-linearly, with a brief improvement after the second compaction before declining. They built a small experiment framework to measure compaction persistence, but no major benchmarks currently track this metric. The developer is seeking collaborators to map the curve across different models.", "body_md": "After ~70 sessions with DeepSeek V4 (1M context), I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets *better*. Then it declines and never recovers.\n\nMaybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: **what if context compaction has a curve, and nobody has mapped it?**\n\nI searched for benchmarks that measure multi-round compaction degradation. Here's what exists:\n\nParameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs `/compact`\n\n— has no such curve.\n\nIf the curve is real, you could:\n\nRight now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a \"compaction persistence\" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year.\n\nI built a tiny monitor ([compact-counter](https://github.com/YuhaoLin2005/compact-counter-concept)) and a rough [experiment framework](https://github.com/YuhaoLin2005/compact-counter-concept/blob/master/EXPERIMENT.md) — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point.\n\nWhat I'd love:\n\nI don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me.", "url": "https://wpnews.pro/news/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions", "canonical_source": "https://dev.to/yuhaolin2005/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions-1dad", "published_at": "2026-06-26 18:49:52+00:00", "updated_at": "2026-06-26 19:34:15.748755+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "developer-tools"], "entities": ["DeepSeek V4", "Claude Code", "MMLU", "HELM", "BigBench", "RULER", "Yuhao Lin"], "alternates": {"html": "https://wpnews.pro/news/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions", "markdown": "https://wpnews.pro/news/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions.md", "text": "https://wpnews.pro/news/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions.txt", "jsonld": "https://wpnews.pro/news/has-anyone-measured-how-llm-output-quality-degrades-across-multiple-compactions.jsonld"}}