Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

A developer found that deploying a 70B Llama model with RAG features caused time-to-first-token (TTFT) to jump from 180 ms to 1.4 seconds, as the model recomputed identical attention states for repeated 6,000-token contexts on every request. Prefix caching solves this by reusing KV cache blocks for matching token prefixes, with vLLM's Automatic Prefix Caching (APC) using block-based, content-addressed hashing to skip prefill computation. However, the technique's effectiveness varies dramatically by workload—achieving 80% prefill savings in RAG with high prefix overlap, but dropping to just 5% when cache eviction or low prefix reuse undermines the benefit.

Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are the same on every request — the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic — because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix. Here's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit. A modern LLM serving stack has two phases per request: prefill process the entire prompt to build the KV cache and decode generate one token at a time, attending against the growing cache . For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70–85% of TTFT — decode is fast in comparison. Most "long input" workloads are not actually long and unique on every request. They're long and repetitive : Prefix caching is the optimization that says: if the first N tokens of this request match a request I already processed, hand me back the KV cache for those N tokens instead of recomputing them. In the textbook case, the model output is bit-identical to a no-cache run, but prefill drops to a fraction of the cost. The reported "80% prefill saved" numbers come from RAG with 90%+ prefix overlap. The 5% numbers come from workloads where the prefix rarely matches, or the cache is constantly evicted before reuse. The high-level idea is simple. The implementation has three decisions that drive the rest of the system: what unit do you hash on , how do you look it up , and what do you do when the cache is full . php flowchart LR A New request<br/ tokens 0..N-1 -- B Tokenize &<br/ split into blocks B -- C Hash each block<br/ tokens + parent hash C -- D{Lookup in<br/ block table} D -- hit -- E Reuse KV blocks<br/ skip prefill D -- miss -- F Compute KV<br/ for that block F -- G Insert block<br/ into table E -- H Continue with<br/ remaining prefill G -- H H -- I Decode normally<br/ + append new blocks Three things matter. First, prefix caching is prefix-only : you can only skip the leading tokens, never a middle substring. If two requests share tokens 1000–2000 but differ on 0–999, you reuse nothing. Second, the cache is block-grained , not token-grained. A request has to match a whole block default 16 tokens to get a hit. A request that diverges at token 14,003 of a 14,016-token shared prefix still recomputes almost everything. Third, prefix caching does not change decoding — every saved token is a saved prefill token. vLLM's Automatic Prefix Caching APC is block-based and content-addressed. Each KV-cache block default 16 tokens is keyed by a hash of three things: the parent block's hash, the tokens in the block, and a small set of "extra hashes" for LoRA adapter IDs, multimodal input hashes, and per-tenant cache salts. The block-size choice is the lever most teams miss. A small block 4–8 tokens gives finer reuse — a divergence only kills the divergent block. A large block 32–64 tokens cuts hash-table overhead and improves batching, but wastes more work on partial-prefix misses. The 16-token default is a reasonable middle for chat; for RAG with 4k–8k chunks, 16 or 32 is common. The hash function got a security upgrade in v0.11 April 2026 . Before that, the default used Python's hash of the serialized block — a salted SipHash, randomized per process, fine for collision avoidance but non-reproducible across restarts. As of v0.22.1, the default is sha256 , with a new --prefix-caching-hash-algo CLI flag: | Algorithm | Hash | Serialization | Reproducible | Notes | |---|---|---|---|---| sha256 | SHA-256 | pickle | No | Default. Secure, but pickle is Python-version-sensitive. | sha256 cbor | SHA-256 | cbor2 | Yes | Recommended for multi-process or multi-language tiers. | xxhash | xxHash 128-bit | pickle | No | Faster, non-cryptographic. Multi-tenant risk must be assessed. | xxhash cbor | xxHash 128-bit | cbor2 | Yes | Fastest with reproducibility. Same caveat. | The multi-tenant caveat is the one to take seriously. If you serve multiple customers out of one engine and your hash function is non-cryptographic, a deliberate collision in a crafted prompt can evict another tenant's cache, or — in pathological cases — substitute their KV blocks with attacker-controlled values. If you don't control the prompts, stay on sha256 or sha256 cbor . A typical vLLM deploy turns APC on at serve time: vllm serve meta-llama/Meta-Llama-3-70B-Instruct \ --tensor-parallel-size 8 \ --enable-prefix-caching \ --prefix-caching-hash-algo sha256 cbor \ --max-model-len 32768 \ --gpu-memory-utilization 0.92 APC is a server-level decision, not per-request — correct, because the cache is a shared resource. SGLang keeps a radix tree of cached prefixes. Each node represents a shared prefix across one or more requests; each leaf is a request-specific tail. The engine traverses the tree per request, reuses the longest matching prefix, and forks new branches where requests diverge. The practical differences that matter in production: For most RAG and chat workloads, the two implementations deliver comparable hit rates. SGLang tends to win on many short shared prefixes per-token matching helps ; vLLM tends to win on very long shared prefixes block-hash lookups are O 1 with a tiny constant . | Workload | Median prefill saved | TTFT reduction | Caveat | |---|---|---|---| | RAG with 6k static context | 88–94% | 70–85% | Hit rate near 1.0 if the retrieved set is stable | | Multi-turn chat, 8 turns | 60–80% avg | 30–55% | First turn is a miss; later turns reuse aggressively | | Long-doc QA on a single PDF | 92–97% after first query | 75–90% | First query is a miss, all subsequent reuse | | Open-ended Q&A no shared prefix | 0–5% | 0–5% | Don't bother enabling it | | Tool-using agent loop | 40–70% per step | 20–45% | Tool result insertion breaks prefix mid-prompt | Hit rate — the fraction of blocks already in the cache when a request arrived — is the single most useful number to instrument. If you turn on APC and your hit rate is below 30%, something is wrong: prefixes don't match, or the cache is being evicted before reuse. --gpu-memory-utilization from 0.85 to 0.92 and the working set of cached prefixes typically doubles. Monitor --prefix-caching-hash-algo between deploys makes the new engine see zero hits until it warms back up. One-time cost, but a real incident if unexpected. Bake the algo into your Helm chart. kv connector , SGLang's DistServe can route prefix-matched requests to warm replicas, but that needs explicit config.Prefix caching is the wrong choice or a wasted flag if: sha256 cbor , xxhash , and xxhash cbor available via --prefix-caching-hash-algo .Next post: structured output at the decoding layer — JSON mode vs grammar-constrained decoding vs function calling, where the three diverge in latency and reliability, and the failure modes that show up only in production.