cd /news/large-language-models/models-take-notes-at-prefill-kv-cach… · home topics large-language-models article
[ARTICLE · art-30546] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Researchers have discovered that the key/value cache in large language models can be edited and composed like notebook notes, enabling direct modification of model outputs without full recomputation. The approach, validated across multiple model families and scales, achieves up to 14.9x lower latency while maintaining decision identity with full recompute. This technique composes with existing prefix caching, achieving 98.5% cache hit rate and reducing p90 time-to-first-token by 53-398x in production benchmarks.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17107v1 Announce Type: new Abstract: Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

── more in #large-language-models 4 stories · sorted by recency
── more on @arxiv 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/models-take-notes-at…] indexed:0 read:1min 2026-06-17 ·