# Thematic Brief — How the KV cache accelerates LLM inference on GPUs

> Source: <https://blog.r-lopes.com/newsletter/2026-06-30>
> Published: 2026-06-30 14:00:00+00:00

*2026-06-30*

## The Core Claim

The KV cache is the single optimization that makes autoregressive decoding tractable: instead of recomputing every prior token's key/value projections at each step, the engine stores them once and appends per token, collapsing per-step attention cost from quadratic recompute to a linear append [Source 57]. Because decode is memory-bandwidth-bound rather than compute-bound on GPUs [Source 72], the cache's *residency* in HBM — not raw FLOPs — sets the ceiling: vLLM's PagedAttention allocates that HBM dynamically to actual decode length, and the reported GPU KV cache size in tokens directly determines how many requests run concurrently [Source 2](#source-2)[Source 8](#source-8).

## Evidence (5–7 numbered insights)

**1. The cache exists to delete redundant recompute, not to save space for its own sake.** Without it, generating token *n* requires re-projecting K and V for all *n−1* prior tokens every step — pure waste, since those projections never change. The cache is an append-only log of K/V projections consumed by attention's GEMMs.

"You don't modify it during the LLM inference. You just append to it, with every processed token. The name of this K and V projections storage is KV cache." — [Source 57]

**2. Decode is memory-bound, so the cache — not compute — is the bottleneck.** A GPU has ~100× the compute of a CPU but only ~10× the memory bandwidth; single-token decode does little math per byte moved, so it stalls on KV reads. This is why bandwidth (480 GB/s VRAM) and cache residency dominate, and why the engineering target is keeping K/V resident and contiguous.

"GPUs have over sort of two orders of magnitude more compute than a CPU... But GPUs only have an order of magnitude more memory bandwidth than a CPU. So what that actually means is if you do things that are not compute intense, you will be memory bound" — [Source 72]

**3. PagedAttention turns the cache from a fixed worst-case reservation into a dynamic allocation, raising throughput.** Pre-allocating HBM for max sequence length strands memory; vLLM pages it by actual decode length, and the same paging lets multiple requests share identical K/V blocks (beam search, common prefixes).

"the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths" —

[Source 2]

**4. The cache's token capacity is a hard concurrency ceiling you can read off the logs.** After model weights load, remaining HBM divided by per-token KV size yields the servable token pool — vLLM prints it, and divides by per-request length to estimate concurrency (e.g. 15.70× at 40,960 tokens/request).

"The

`GPU KV cache size`

line reports the total number of tokens that can be stored in the GPU KV cache at once." —[Source 8]

**5. Sharing the cache across the prefill/decode split is where the largest production wins come from.** Disaggregated serving (LLM-D) routes prefill to high-memory GPUs and scales decode separately, with both phases reading the same KV cache for similar requests — yielding a 3× P90 latency improvement and a 57× improvement in time-to-first-token.

"the prefill can use high-memory GPUs, while the decode can scale separately, but both using the same KV cache for similar request" — [Source 14]

**6. Prefix caching reuses the cache across requests, deleting repeated prefill.** When every RAG query shares a ~2K-token system prompt, the KV states for that prefix are computed once and reused, skipping redundant prefill on a 32B model.

"this eliminates redundant prefill computation — saving 200-500ms per query on a 32B model" — [Source 35]

**7. Quantizing the cache to FP8 trades precision for more resident tokens.** Halving K/V byte-width nearly doubles the token pool from insight #4, directly increasing throughput and max context — vLLM supports `fp8_e4m3`

on both CUDA and ROCm.

"This optimization enables you to store more tokens in memory, leading to improved throughput and support for longer context windows." — [Source 42]

## How It Works

``` php
flowchart LR
 P[Prompt tokens] --> PF[Prefill: compute K,V for all tokens]
 PF --> KV[(KV cache in HBM)]
 KV --> AT[Attention GEMM]
 AT --> TOK[Emit next token]
 TOK --> AP[Append new K,V]
 AP --> KV
 KV --> CC[Concurrency = HBM pool / per-req KV]
```

Prefill populates the cache once for the whole prompt; each decode step then reads the resident cache, emits one token, and appends only that token's K/V — so the per-step cost is a bandwidth-bound read plus a small append, and the free HBM left after weights bounds how many requests can hold caches at once [Source 57][Source 8](#source-8).

## What This Means in Practice

On a high-traffic stack, treat the KV cache as the capacity unit you provision and meter, exactly as you'd budget LCP/INP on the frontend. Stabilize the cacheable prefix — pin a fixed system prompt and stable chunk ordering so prefix caching actually hits; dynamically resizing context (varying retrieved-chunk count) invalidates the cached prefix and *raises* TTFT instead of lowering it [Source 35][Source 148]. Size `--gpu-memory-utilization`

against the printed GPU KV cache size to set real concurrency rather than guessing [Source 8](#source-8), and reach for FP8 KV (`kv_cache_dtype=fp8_e4m3`

) before buying more cards when you need longer context or more concurrent users [Source 42]. Just as React 19 `useTransition`

and Next.js streaming hide latency by not blocking on work already done, the KV cache and prefix reuse hide it by not *recomputing* work already done — the streaming TTFT a user feels is dominated by whether prefill was skipped.

## Counter-Evidence / Limits

The cache is a speedup only while it stays resident in VRAM: when allocations spill K/V pages to GTT/system RAM over PCIe (~20 GB/s vs ~480 GB/s VRAM), the same mechanism inverts into a ~24× per-token penalty, pushing TTFT from ~50ms to 800–1200ms [Source 69]. Capacity tactics fight each other — speculative decoding's draft model and its own KV claim 1.5–3 GB that would otherwise hold concurrent requests' caches [Source 79], and shrinking context to save tokens can cost more latency than it saves by busting the prefix cache [Source 148]. The corpus is unanimous that the cache is foundational, but it disagrees on *where the cache should live*: on consumer AMD RDNA with no MIG/MPS isolation, the dominant advice is to stop co-locating and physically isolate the LLM's cache on a dedicated card rather than manage contention [Source 16][Source 147]. Finally, sharing a decrypted cache across workers in disaggregated serving is a real security surface — re-encrypting per decode step would erase the entire latency win, so isolation, not crypto on the hot path, is the mitigation [Source 36].

## Today's CEMENT brick

**Execute-blind:** Start a vLLM (or check an existing) serve and grep the startup log for the two lines `GPU KV cache size: N tokens`

and `Maximum concurrency for M tokens per request: X`

. Before reading them, write down your predicted max concurrency from `(VRAM − weights) / per-token-KV`

. Compare to the printed `X`

— the gap is your real headroom for prefix caching and FP8 KV, and it tells you whether your next throughput win is a config flag or a hardware spend [Source 8](#source-8)[Source 42].

## Sources

[vLLM inference frameworks](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/llm-inference-frameworks.html)[Parallelism and Scaling — GPU KV cache size log](https://docs.vllm.ai)[LLM‑D Explained: Building Next‑Gen AI with LLMs, RAG & Kubernetes](https://www.youtube.com/watch?v=CNKGgOphAPM)- Self-Learning Q&A — CPU vs GPU reranking / KV eviction
- Self-Learning Q&A — disaggregated KV cache security & re-encryption cost
- Self-Learning Q&A — production AI topology, prefix caching & KV reuse
[Quantized KV Cache — FP8 KV Cache Overview](https://docs.vllm.ai)[tiny-vllm — Why KV cache exists](https://github.com/jmaczan/tiny-vllm)- Self-Learning Q&A — cross-instance KV spill to GTT latency
[Building Windsurf with Varun Mohan](https://www.youtube.com/watch?v=G9WOC8sUts8)- Self-Learning Q&A — speculative decoding draft-model KV overhead
- Self-Learning Q&A — token-budget optimizer vs prefix-cache invalidation
- Self-Learning Q&A — cross-encoder + embedding GPU partitioning
