{"slug": "prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-it", "title": "Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%", "summary": "A developer found that deploying a 70B Llama model with RAG features caused time-to-first-token (TTFT) to jump from 180 ms to 1.4 seconds, as the model recomputed identical attention states for repeated 6,000-token contexts on every request. Prefix caching solves this by reusing KV cache blocks for matching token prefixes, with vLLM's Automatic Prefix Caching (APC) using block-based, content-addressed hashing to skip prefill computation. However, the technique's effectiveness varies dramatically by workload—achieving 80% prefill savings in RAG with high prefix overlap, but dropping to just 5% when cache eviction or low prefix reuse undermines the benefit.", "body_md": "Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are *the same* on every request — the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic — because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix.\n\nHere's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit.\n\nA modern LLM serving stack has two phases per request: **prefill** (process the entire prompt to build the KV cache) and **decode** (generate one token at a time, attending against the growing cache). For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70–85% of TTFT — decode is fast in comparison.\n\nMost \"long input\" workloads are not actually long and unique on every request. They're long and **repetitive**:\n\nPrefix caching is the optimization that says: *if the first N tokens of this request match a request I already processed, hand me back the KV cache for those N tokens instead of recomputing them.* In the textbook case, the model output is bit-identical to a no-cache run, but prefill drops to a fraction of the cost. The reported \"80% prefill saved\" numbers come from RAG with 90%+ prefix overlap. The 5% numbers come from workloads where the prefix rarely matches, or the cache is constantly evicted before reuse.\n\nThe high-level idea is simple. The implementation has three decisions that drive the rest of the system: **what unit do you hash on**, **how do you look it up**, and **what do you do when the cache is full**.\n\n``` php\nflowchart LR\n    A[New request<br/>tokens 0..N-1] --> B[Tokenize &<br/>split into blocks]\n    B --> C[Hash each block<br/>tokens + parent hash]\n    C --> D{Lookup in<br/>block table}\n    D -- hit --> E[Reuse KV blocks<br/>skip prefill]\n    D -- miss --> F[Compute KV<br/>for that block]\n    F --> G[Insert block<br/>into table]\n    E --> H[Continue with<br/>remaining prefill]\n    G --> H\n    H --> I[Decode normally<br/>+ append new blocks]\n```\n\nThree things matter. First, prefix caching is **prefix-only**: you can only skip the leading tokens, never a middle substring. If two requests share tokens 1000–2000 but differ on 0–999, you reuse nothing. Second, the cache is **block-grained**, not token-grained. A request has to match a whole block (default 16 tokens) to get a hit. A request that diverges at token 14,003 of a 14,016-token shared prefix still recomputes almost everything. Third, prefix caching **does not change decoding** — every saved token is a saved prefill token.\n\nvLLM's **Automatic Prefix Caching (APC)** is block-based and content-addressed. Each KV-cache block (default 16 tokens) is keyed by a hash of three things: the parent block's hash, the tokens in the block, and a small set of \"extra hashes\" for LoRA adapter IDs, multimodal input hashes, and per-tenant cache salts.\n\nThe block-size choice is the lever most teams miss. A small block (4–8 tokens) gives finer reuse — a divergence only kills the divergent block. A large block (32–64 tokens) cuts hash-table overhead and improves batching, but wastes more work on partial-prefix misses. The 16-token default is a reasonable middle for chat; for RAG with 4k–8k chunks, 16 or 32 is common.\n\nThe hash function got a security upgrade in v0.11 (April 2026). Before that, the default used Python's `hash()`\n\nof the serialized block — a salted SipHash, randomized per process, fine for collision avoidance but non-reproducible across restarts. As of v0.22.1, the default is `sha256`\n\n, with a new `--prefix-caching-hash-algo`\n\nCLI flag:\n\n| Algorithm | Hash | Serialization | Reproducible | Notes |\n|---|---|---|---|---|\n`sha256` |\nSHA-256 | `pickle` |\nNo | Default. Secure, but pickle is Python-version-sensitive. |\n`sha256_cbor` |\nSHA-256 | `cbor2` |\nYes | Recommended for multi-process or multi-language tiers. |\n`xxhash` |\nxxHash 128-bit | `pickle` |\nNo | Faster, non-cryptographic. Multi-tenant risk must be assessed. |\n`xxhash_cbor` |\nxxHash 128-bit | `cbor2` |\nYes | Fastest with reproducibility. Same caveat. |\n\nThe multi-tenant caveat is the one to take seriously. If you serve multiple customers out of one engine and your hash function is non-cryptographic, a deliberate collision in a crafted prompt can evict another tenant's cache, or — in pathological cases — substitute their KV blocks with attacker-controlled values. If you don't control the prompts, stay on `sha256`\n\nor `sha256_cbor`\n\n.\n\nA typical vLLM deploy turns APC on at serve time:\n\n```\nvllm serve meta-llama/Meta-Llama-3-70B-Instruct \\\n  --tensor-parallel-size 8 \\\n  --enable-prefix-caching \\\n  --prefix-caching-hash-algo sha256_cbor \\\n  --max-model-len 32768 \\\n  --gpu-memory-utilization 0.92\n```\n\nAPC is a server-level decision, not per-request — correct, because the cache is a shared resource.\n\nSGLang keeps a **radix tree** of cached prefixes. Each node represents a shared prefix across one or more requests; each leaf is a request-specific tail. The engine traverses the tree per request, reuses the longest matching prefix, and forks new branches where requests diverge.\n\nThe practical differences that matter in production:\n\nFor most RAG and chat workloads, the two implementations deliver comparable hit rates. SGLang tends to win on many short shared prefixes (per-token matching helps); vLLM tends to win on very long shared prefixes (block-hash lookups are O(1) with a tiny constant).\n\n| Workload | Median prefill saved | TTFT reduction | Caveat |\n|---|---|---|---|\n| RAG with 6k static context | 88–94% | 70–85% | Hit rate near 1.0 if the retrieved set is stable |\n| Multi-turn chat, 8 turns | 60–80% (avg) | 30–55% | First turn is a miss; later turns reuse aggressively |\n| Long-doc QA on a single PDF | 92–97% after first query | 75–90% | First query is a miss, all subsequent reuse |\n| Open-ended Q&A (no shared prefix) | 0–5% | 0–5% | Don't bother enabling it |\n| Tool-using agent loop | 40–70% per step | 20–45% | Tool result insertion breaks prefix mid-prompt |\n\nHit rate — the fraction of blocks already in the cache when a request arrived — is the single most useful number to instrument. If you turn on APC and your hit rate is below 30%, something is wrong: prefixes don't match, or the cache is being evicted before reuse.\n\n`--gpu-memory-utilization`\n\nfrom 0.85 to 0.92 and the working set of cached prefixes typically doubles. Monitor `--prefix-caching-hash-algo`\n\nbetween deploys makes the new engine see zero hits until it warms back up. One-time cost, but a real incident if unexpected. Bake the algo into your Helm chart.`kv_connector`\n\n, SGLang's `DistServe`\n\n) can route prefix-matched requests to warm replicas, but that needs explicit config.Prefix caching is the wrong choice (or a wasted flag) if:\n\n`sha256_cbor`\n\n, `xxhash`\n\n, and `xxhash_cbor`\n\navailable via `--prefix-caching-hash-algo`\n\n.Next post: structured output at the decoding layer — JSON mode vs grammar-constrained decoding vs function calling, where the three diverge in latency and reliability, and the failure modes that show up only in production.", "url": "https://wpnews.pro/news/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-it", "canonical_source": "https://dev.to/tech_nuggets/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-policies-that-5e8", "published_at": "2026-06-07 01:09:57+00:00", "updated_at": "2026-06-07 01:41:58.531604+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "machine-learning", "generative-ai", "ai-tools"], "entities": ["Llama", "vLLM", "SGLang", "H100"], "alternates": {"html": "https://wpnews.pro/news/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-it", "markdown": "https://wpnews.pro/news/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-it.md", "text": "https://wpnews.pro/news/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-it.txt", "jsonld": "https://wpnews.pro/news/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-it.jsonld"}}