Has Anyone Measured How LLM Output Quality Degrades Across Multiple Compactions?

wpnews.pro

cd /news/large-language-models/has-anyone-measured-how-llm-output-q… · home › topics › large-language-models › article

[ARTICLE · art-41210] src=dev.to ↗ pub=2026-06-26T18:49Z topic=large-language-models verified=true sentiment=· neutral

Has Anyone Measured How LLM Output Quality Degrades Across Multiple Compactions?

A developer observed that after multiple context compactions in LLM sessions, output quality degrades non-linearly, with a brief improvement after the second compaction before declining. They built a small experiment framework to measure compaction persistence, but no major benchmarks currently track this metric. The developer is seeking collaborators to map the curve across different models.

read1 min views1 publishedJun 26, 2026

After ~70 sessions with DeepSeek V4 (1M context), I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets better. Then it declines and never recovers.

Maybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: what if context compaction has a curve, and nobody has mapped it?

I searched for benchmarks that measure multi-round compaction degradation. Here's what exists:

Parameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs /compact

— has no such curve.

If the curve is real, you could: Right now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year.

I built a tiny monitor (compact-counter) and a rough experiment framework — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point.

What I'd love:

I don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me.

source & further reading

dev.to — original article How a .NET dev built an AI assistant Building LSTMs with PyTorch and Lightning AI Part 4: Training Step and Initial Predictions what i learned on day 1 of a 3D reconstruction internship

~/api · this article 200

$curl api.wpnews.pro/v1/news/has-anyone-measured-how-…

Read original on dev.to → dev.to/yuhaolin2005/has-anyone-measured-how-llm-…

mentioned entities

DeepSeek V4

Claude Code

MMLU

HELM