{"slug": "thematic-brief-how-the-kv-cache-accelerates-llm-inference-on-gpus", "title": "Thematic Brief — How the KV cache accelerates LLM inference on GPUs", "summary": "The KV cache accelerates LLM inference on GPUs by storing prior token key/value projections instead of recomputing them, reducing per-step attention cost from quadratic to linear. Decode is memory-bandwidth-bound, so cache residency in HBM determines throughput, with techniques like vLLM's PagedAttention and prefix caching dynamically allocating and reusing cache to boost concurrency and reduce latency.", "body_md": "*2026-06-30*\n\n## The Core Claim\n\nThe KV cache is the single optimization that makes autoregressive decoding tractable: instead of recomputing every prior token's key/value projections at each step, the engine stores them once and appends per token, collapsing per-step attention cost from quadratic recompute to a linear append [Source 57]. Because decode is memory-bandwidth-bound rather than compute-bound on GPUs [Source 72], the cache's *residency* in HBM — not raw FLOPs — sets the ceiling: vLLM's PagedAttention allocates that HBM dynamically to actual decode length, and the reported GPU KV cache size in tokens directly determines how many requests run concurrently [Source 2](#source-2)[Source 8](#source-8).\n\n## Evidence (5–7 numbered insights)\n\n**1. The cache exists to delete redundant recompute, not to save space for its own sake.** Without it, generating token *n* requires re-projecting K and V for all *n−1* prior tokens every step — pure waste, since those projections never change. The cache is an append-only log of K/V projections consumed by attention's GEMMs.\n\n\"You don't modify it during the LLM inference. You just append to it, with every processed token. The name of this K and V projections storage is KV cache.\" — [Source 57]\n\n**2. Decode is memory-bound, so the cache — not compute — is the bottleneck.** A GPU has ~100× the compute of a CPU but only ~10× the memory bandwidth; single-token decode does little math per byte moved, so it stalls on KV reads. This is why bandwidth (480 GB/s VRAM) and cache residency dominate, and why the engineering target is keeping K/V resident and contiguous.\n\n\"GPUs have over sort of two orders of magnitude more compute than a CPU... But GPUs only have an order of magnitude more memory bandwidth than a CPU. So what that actually means is if you do things that are not compute intense, you will be memory bound\" — [Source 72]\n\n**3. PagedAttention turns the cache from a fixed worst-case reservation into a dynamic allocation, raising throughput.** Pre-allocating HBM for max sequence length strands memory; vLLM pages it by actual decode length, and the same paging lets multiple requests share identical K/V blocks (beam search, common prefixes).\n\n\"the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths\" —\n\n[Source 2]\n\n**4. The cache's token capacity is a hard concurrency ceiling you can read off the logs.** After model weights load, remaining HBM divided by per-token KV size yields the servable token pool — vLLM prints it, and divides by per-request length to estimate concurrency (e.g. 15.70× at 40,960 tokens/request).\n\n\"The\n\n`GPU KV cache size`\n\nline reports the total number of tokens that can be stored in the GPU KV cache at once.\" —[Source 8]\n\n**5. Sharing the cache across the prefill/decode split is where the largest production wins come from.** Disaggregated serving (LLM-D) routes prefill to high-memory GPUs and scales decode separately, with both phases reading the same KV cache for similar requests — yielding a 3× P90 latency improvement and a 57× improvement in time-to-first-token.\n\n\"the prefill can use high-memory GPUs, while the decode can scale separately, but both using the same KV cache for similar request\" — [Source 14]\n\n**6. Prefix caching reuses the cache across requests, deleting repeated prefill.** When every RAG query shares a ~2K-token system prompt, the KV states for that prefix are computed once and reused, skipping redundant prefill on a 32B model.\n\n\"this eliminates redundant prefill computation — saving 200-500ms per query on a 32B model\" — [Source 35]\n\n**7. Quantizing the cache to FP8 trades precision for more resident tokens.** Halving K/V byte-width nearly doubles the token pool from insight #4, directly increasing throughput and max context — vLLM supports `fp8_e4m3`\n\non both CUDA and ROCm.\n\n\"This optimization enables you to store more tokens in memory, leading to improved throughput and support for longer context windows.\" — [Source 42]\n\n## How It Works\n\n``` php\nflowchart LR\n P[Prompt tokens] --> PF[Prefill: compute K,V for all tokens]\n PF --> KV[(KV cache in HBM)]\n KV --> AT[Attention GEMM]\n AT --> TOK[Emit next token]\n TOK --> AP[Append new K,V]\n AP --> KV\n KV --> CC[Concurrency = HBM pool / per-req KV]\n```\n\nPrefill populates the cache once for the whole prompt; each decode step then reads the resident cache, emits one token, and appends only that token's K/V — so the per-step cost is a bandwidth-bound read plus a small append, and the free HBM left after weights bounds how many requests can hold caches at once [Source 57][Source 8](#source-8).\n\n## What This Means in Practice\n\nOn a high-traffic stack, treat the KV cache as the capacity unit you provision and meter, exactly as you'd budget LCP/INP on the frontend. Stabilize the cacheable prefix — pin a fixed system prompt and stable chunk ordering so prefix caching actually hits; dynamically resizing context (varying retrieved-chunk count) invalidates the cached prefix and *raises* TTFT instead of lowering it [Source 35][Source 148]. Size `--gpu-memory-utilization`\n\nagainst the printed GPU KV cache size to set real concurrency rather than guessing [Source 8](#source-8), and reach for FP8 KV (`kv_cache_dtype=fp8_e4m3`\n\n) before buying more cards when you need longer context or more concurrent users [Source 42]. Just as React 19 `useTransition`\n\nand Next.js streaming hide latency by not blocking on work already done, the KV cache and prefix reuse hide it by not *recomputing* work already done — the streaming TTFT a user feels is dominated by whether prefill was skipped.\n\n## Counter-Evidence / Limits\n\nThe cache is a speedup only while it stays resident in VRAM: when allocations spill K/V pages to GTT/system RAM over PCIe (~20 GB/s vs ~480 GB/s VRAM), the same mechanism inverts into a ~24× per-token penalty, pushing TTFT from ~50ms to 800–1200ms [Source 69]. Capacity tactics fight each other — speculative decoding's draft model and its own KV claim 1.5–3 GB that would otherwise hold concurrent requests' caches [Source 79], and shrinking context to save tokens can cost more latency than it saves by busting the prefix cache [Source 148]. The corpus is unanimous that the cache is foundational, but it disagrees on *where the cache should live*: on consumer AMD RDNA with no MIG/MPS isolation, the dominant advice is to stop co-locating and physically isolate the LLM's cache on a dedicated card rather than manage contention [Source 16][Source 147]. Finally, sharing a decrypted cache across workers in disaggregated serving is a real security surface — re-encrypting per decode step would erase the entire latency win, so isolation, not crypto on the hot path, is the mitigation [Source 36].\n\n## Today's CEMENT brick\n\n**Execute-blind:** Start a vLLM (or check an existing) serve and grep the startup log for the two lines `GPU KV cache size: N tokens`\n\nand `Maximum concurrency for M tokens per request: X`\n\n. Before reading them, write down your predicted max concurrency from `(VRAM − weights) / per-token-KV`\n\n. Compare to the printed `X`\n\n— the gap is your real headroom for prefix caching and FP8 KV, and it tells you whether your next throughput win is a config flag or a hardware spend [Source 8](#source-8)[Source 42].\n\n## Sources\n\n[vLLM inference frameworks](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/llm-inference-frameworks.html)[Parallelism and Scaling — GPU KV cache size log](https://docs.vllm.ai)[LLM‑D Explained: Building Next‑Gen AI with LLMs, RAG & Kubernetes](https://www.youtube.com/watch?v=CNKGgOphAPM)- Self-Learning Q&A — CPU vs GPU reranking / KV eviction\n- Self-Learning Q&A — disaggregated KV cache security & re-encryption cost\n- Self-Learning Q&A — production AI topology, prefix caching & KV reuse\n[Quantized KV Cache — FP8 KV Cache Overview](https://docs.vllm.ai)[tiny-vllm — Why KV cache exists](https://github.com/jmaczan/tiny-vllm)- Self-Learning Q&A — cross-instance KV spill to GTT latency\n[Building Windsurf with Varun Mohan](https://www.youtube.com/watch?v=G9WOC8sUts8)- Self-Learning Q&A — speculative decoding draft-model KV overhead\n- Self-Learning Q&A — token-budget optimizer vs prefix-cache invalidation\n- Self-Learning Q&A — cross-encoder + embedding GPU partitioning", "url": "https://wpnews.pro/news/thematic-brief-how-the-kv-cache-accelerates-llm-inference-on-gpus", "canonical_source": "https://blog.r-lopes.com/newsletter/2026-06-30", "published_at": "2026-06-30 14:00:00+00:00", "updated_at": "2026-06-30 18:57:30.234724+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-research", "machine-learning"], "entities": ["vLLM", "PagedAttention", "LLM-D", "GPU", "HBM", "CUDA", "ROCm", "FP8"], "alternates": {"html": "https://wpnews.pro/news/thematic-brief-how-the-kv-cache-accelerates-llm-inference-on-gpus", "markdown": "https://wpnews.pro/news/thematic-brief-how-the-kv-cache-accelerates-llm-inference-on-gpus.md", "text": "https://wpnews.pro/news/thematic-brief-how-the-kv-cache-accelerates-llm-inference-on-gpus.txt", "jsonld": "https://wpnews.pro/news/thematic-brief-how-the-kv-cache-accelerates-llm-inference-on-gpus.jsonld"}}