Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

wpnews.pro

cd /news/large-language-models/models-take-notes-at-prefill-kv-cach… · home › topics › large-language-models › article

[ARTICLE · art-30546] src=arxiv.org ↗ pub=2026-06-17T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Researchers have discovered that the key/value cache in large language models can be edited and composed like notebook notes, enabling direct modification of model outputs without full recomputation. The approach, validated across multiple model families and scales, achieves up to 14.9x lower latency while maintaining decision identity with full recompute. This technique composes with existing prefix caching, achieving 98.5% cache hit rate and reducing p90 time-to-first-token by 53-398x in production benchmarks.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17107v1 Announce Type: new Abstract: Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/models-take-notes-at-pre…

Read original on arxiv.org → arxiv.org/abs/2606.17107

mentioned entities

arXiv

vLLM

metadata

slugmodels-take-notes-at-prefill-kv-cache-can-be-editable-and-composable

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevRay Data LLM enables 2x throughp…

next →Claude Agent SDK Permissions: An…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 17 Jun · #large-language-models

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

arxiv.org · 17 Jun · #large-language-models

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

tenureai.dev · 17 Jun · #large-language-models

AI memory systems break at scale

arxiv.org · 17 Jun · #large-language-models

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

── more on @arxiv 3 stories trending now

wpnews · 16 Jun · #ai-agents

The LLM Is Not the Final Authority: Building Trust Infrastructure for AI Agents

wpnews · 16 Jun · #artificial-intelligence

Most Businesses Lose Leads at Night — So I Built This

wpnews · 16 Jun · #ai-safety

Researchers propose causal framework to audit synthetic data

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required