Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5% A developer found that deploying a 70B Llama model with RAG features caused time-to-first-token (TTFT) to jump from 180 ms to 1.4 seconds, as the model recomputed identical attention states for repeated 6,000-token contexts on every request. Prefix caching solves this by reusing KV cache blocks for matching token prefixes, with vLLM's Automatic Prefix Caching (APC) using block-based, content-addressed hashing to skip prefill computation. However, the technique's effectiveness varies dramatically by workload—achieving 80% prefill savings in RAG with high prefix overlap, but dropping to just 5% when cache eviction or low prefix reuse undermines the benefit. Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are the same on every request — the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic — because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix. Here's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit. A modern LLM serving stack has two phases per request: prefill process the entire prompt to build the KV cache and decode generate one token at a time, attending against the growing cache . For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70–85% of TTFT — decode is fast in comparison. Most "long input" workloads are not actually long and unique on every request. They're long and repetitive : Prefix caching is the optimization that says: if the first N tokens of this request match a request I already processed, hand me back the KV cache for those N tokens instead of recomputing them. In the textbook case, the model output is bit-identical to a no-cache run, but prefill drops to a fraction of the cost. The reported "80% prefill saved" numbers come from RAG with 90%+ prefix overlap. The 5% numbers come from workloads where the prefix rarely matches, or the cache is constantly evicted before reuse. The high-level idea is simple. The implementation has three decisions that drive the rest of the system: what unit do you hash on , how do you look it up , and what do you do when the cache is full . php flowchart LR A New request