Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are the same on every request β the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic β because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix.
Here's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit.
A modern LLM serving stack has two phases per request: prefill (process the entire prompt to build the KV cache) and decode (generate one token at a time, attending against the growing cache). For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70β85% of TTFT β decode is fast in comparison.
Most "long input" workloads are not actually long and unique on every request. They're long and repetitive:
Prefix caching is the optimization that says: if the first N tokens of this request match a request I already processed, hand me back the KV cache for those N tokens instead of recomputing them. In the textbook case, the model output is bit-identical to a no-cache run, but prefill drops to a fraction of the cost. The reported "80% prefill saved" numbers come from RAG with 90%+ prefix overlap. The 5% numbers come from workloads where the prefix rarely matches, or the cache is constantly evicted before reuse.
The high-level idea is simple. The implementation has three decisions that drive the rest of the system: what unit do you hash on, how do you look it up, and what do you do when the cache is full.
flowchart LR
A[New request<br/>tokens 0..N-1] --> B[Tokenize &<br/>split into blocks]
B --> C[Hash each block<br/>tokens + parent hash]
C --> D{Lookup in<br/>block table}
D -- hit --> E[Reuse KV blocks<br/>skip prefill]
D -- miss --> F[Compute KV<br/>for that block]
F --> G[Insert block<br/>into table]
E --> H[Continue with<br/>remaining prefill]
G --> H
H --> I[Decode normally<br/>+ append new blocks]
Three things matter. First, prefix caching is prefix-only: you can only skip the leading tokens, never a middle substring. If two requests share tokens 1000β2000 but differ on 0β999, you reuse nothing. Second, the cache is block-grained, not token-grained. A request has to match a whole block (default 16 tokens) to get a hit. A request that diverges at token 14,003 of a 14,016-token shared prefix still recomputes almost everything. Third, prefix caching does not change decoding β every saved token is a saved prefill token.
vLLM's Automatic Prefix Caching (APC) is block-based and content-addressed. Each KV-cache block (default 16 tokens) is keyed by a hash of three things: the parent block's hash, the tokens in the block, and a small set of "extra hashes" for LoRA adapter IDs, multimodal input hashes, and per-tenant cache salts.
The block-size choice is the lever most teams miss. A small block (4β8 tokens) gives finer reuse β a divergence only kills the divergent block. A large block (32β64 tokens) cuts hash-table overhead and improves batching, but wastes more work on partial-prefix misses. The 16-token default is a reasonable middle for chat; for RAG with 4kβ8k chunks, 16 or 32 is common.
The hash function got a security upgrade in v0.11 (April 2026). Before that, the default used Python's hash()
of the serialized block β a salted SipHash, randomized per process, fine for collision avoidance but non-reproducible across restarts. As of v0.22.1, the default is sha256
, with a new --prefix-caching-hash-algo
CLI flag:
| Algorithm | Hash | Serialization | Reproducible | Notes |
|---|---|---|---|---|
sha256 |
||||
| SHA-256 | pickle |
|||
| No | Default. Secure, but pickle is Python-version-sensitive. | |||
sha256_cbor |
||||
| SHA-256 | cbor2 |
|||
| Yes | Recommended for multi-process or multi-language tiers. | |||
xxhash |
||||
| xxHash 128-bit | pickle |
|||
| No | Faster, non-cryptographic. Multi-tenant risk must be assessed. | |||
xxhash_cbor |
||||
| xxHash 128-bit | cbor2 |
|||
| Yes | Fastest with reproducibility. Same caveat. |
The multi-tenant caveat is the one to take seriously. If you serve multiple customers out of one engine and your hash function is non-cryptographic, a deliberate collision in a crafted prompt can evict another tenant's cache, or β in pathological cases β substitute their KV blocks with attacker-controlled values. If you don't control the prompts, stay on sha256
or sha256_cbor
.
A typical vLLM deploy turns APC on at serve time:
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 8 \
--enable-prefix-caching \
--prefix-caching-hash-algo sha256_cbor \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
APC is a server-level decision, not per-request β correct, because the cache is a shared resource.
SGLang keeps a radix tree of cached prefixes. Each node represents a shared prefix across one or more requests; each leaf is a request-specific tail. The engine traverses the tree per request, reuses the longest matching prefix, and forks new branches where requests diverge.
The practical differences that matter in production:
For most RAG and chat workloads, the two implementations deliver comparable hit rates. SGLang tends to win on many short shared prefixes (per-token matching helps); vLLM tends to win on very long shared prefixes (block-hash lookups are O(1) with a tiny constant).
| Workload | Median prefill saved | TTFT reduction | Caveat |
|---|---|---|---|
| RAG with 6k static context | 88β94% | 70β85% | Hit rate near 1.0 if the retrieved set is stable |
| Multi-turn chat, 8 turns | 60β80% (avg) | 30β55% | First turn is a miss; later turns reuse aggressively |
| Long-doc QA on a single PDF | 92β97% after first query | 75β90% | First query is a miss, all subsequent reuse |
| Open-ended Q&A (no shared prefix) | 0β5% | 0β5% | Don't bother enabling it |
| Tool-using agent loop | 40β70% per step | 20β45% | Tool result insertion breaks prefix mid-prompt |
Hit rate β the fraction of blocks already in the cache when a request arrived β is the single most useful number to instrument. If you turn on APC and your hit rate is below 30%, something is wrong: prefixes don't match, or the cache is being evicted before reuse.
--gpu-memory-utilization
from 0.85 to 0.92 and the working set of cached prefixes typically doubles. Monitor --prefix-caching-hash-algo
between deploys makes the new engine see zero hits until it warms back up. One-time cost, but a real incident if unexpected. Bake the algo into your Helm chart.kv_connector
, SGLang's DistServe
) can route prefix-matched requests to warm replicas, but that needs explicit config.Prefix caching is the wrong choice (or a wasted flag) if:
sha256_cbor
, xxhash
, and xxhash_cbor
available via --prefix-caching-hash-algo
.Next post: structured output at the decoding layer β JSON mode vs grammar-constrained decoding vs function calling, where the three diverge in latency and reliability, and the failure modes that show up only in production.