{"slug": "kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost", "title": "KV Cache Locality: The Hidden Variable in Your LLM Serving Cost", "summary": "A 22% throughput improvement and up to 97.5% cache hit rate is achievable on LLM serving clusters by routing requests to GPUs that already hold their token prefixes in KV cache, rather than using round-robin load balancing. On an 8-GPU node running CodeLlama 13B, round-robin routing wastes an estimated $1,200–$1,800 per month in GPU-hours on recomputed prefills that produce no useful work. The savings compound with larger models and longer prefixes, but the benefit disappears when prefill is not the bottleneck.", "body_md": "# KV Cache Locality: The Hidden Variable in Your LLM Serving Cost\n\nEvery time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens.\n\nThat recomputation takes real time and real money. On a Llama 3.1 70B at half precision, a 4,000-token prefill takes over a second. If eight GPUs each recompute the same system prompt independently because round-robin sent one request to each, you just paid for the same work eight times. Multiply by every request, every hour, every day.\n\nThis post is about the cost of that mistake, how to measure it, and what changes when your load balancer understands token locality.\n\n## What the KV Cache Actually Saves You\n\nA transformer processes input tokens in two phases. **Prefill** computes the\nkey-value pairs for every input token: the system prompt, the conversation\nhistory, the RAG context. This is the expensive part. It scales with token\ncount and model size, and it’s compute-bound on the GPU. **Decode** generates\noutput tokens one at a time, each one reusing the key-value pairs from\nprefill. This is the cheap part.\n\nvLLM and other serving engines cache the key-value pairs from prefill in GPU memory. When a new request arrives with the same token prefix, the engine skips prefill entirely and jumps straight to decode. This is the KV cache hit.\n\nOn our benchmarks, a cache hit on CodeLlama 13B returns in 18ms at P50. A cache miss takes around 500ms. That’s a 28x gap in time-to-first-token, decided entirely by whether the tokens were already on that GPU.\n\nBut here’s the thing: the KV cache is **per-GPU**. GPU 0’s cache doesn’t help\nGPU 3. If your load balancer sends Request A to GPU 0 and the identical\nRequest B to GPU 3, Request B pays full prefill cost even though the work was\nalready done. The cache exists. It’s just on the wrong card.\n\n## The Math on Wasted Prefill\n\nLet’s make this concrete. You’re running a RAG application with a 4,000-token system prompt. You have 8 GPUs serving CodeLlama 13B. You’re handling 30 concurrent users with a stress workload (heavy on large and extra-large prefixes). Here’s what we measured on 8x A100s:\n\nRound-robin routing:\n\n- Cache hit rate: 12.5%\n- P99 TTFT: 6,800ms\n- Throughput: 36.3 req/s\n\nWith 8 backends and random routing, you’d expect ~12.5% cache hits by chance. One in eight requests happens to land on the GPU that already has its prefix cached. The other 87.5% recompute from scratch.\n\nPrefix-aware routing:\n\n- Cache hit rate: 97.5%\n- P99 TTFT: 1,000ms\n- Throughput: 44.4 req/s\n\nSame GPUs. Same model. Same workload. The only change is which GPU receives which request.\n\nThat throughput difference, 36.3 vs 44.4 requests per second, is a 22.3% improvement. On hardware costing ~$10/hour, that’s either 22% more throughput for free or the same throughput on fewer GPUs. Over a month of continuous operation, on a single 8-GPU node, the wasted prefill in round-robin comes to roughly $1,200–$1,800 in GPU-hours (22% of ~$7,300/month at $10/hr) that produce no useful work. Multiply by the number of nodes in your cluster.\n\n## Where the Savings Compound\n\nThe benefit scales with three variables: **model size**, **prefix length**,\nand **prefix sharing ratio**.\n\n### Model size\n\nLarger models have more expensive prefill, so cache misses cost more.\n\n| Model | XLarge Cache Hit Improvement | Aggregate Throughput Gain |\n|---|---|---|\n| Llama 3.1 8B | 31.6% | ~0% (inference too fast) |\n| CodeLlama 13B | 35.9% | +13.7% to +22.3% |\n| Llama 3.1 70B | 43.8% | ~0% (compute-bound) |\n\nThe 8B numbers are the warning case. When prefill is already fast (~420ms total inference), the 7-10ms routing overhead eats into the savings. If prefill isn’t your bottleneck, prefix-aware routing doesn’t help.\n\nThe 70B numbers tell a different story. Aggregate throughput doesn’t change because the GPUs are already compute-saturated. But individual requests are 44% faster on cache hit (P50: 1,498ms hit vs 2,665ms miss). Your users feel the difference even if your throughput dashboard doesn’t.\n\nThe sweet spot is 13B-70B models where prefill is expensive enough to matter but the GPUs aren’t so saturated that they can’t benefit from skipping it.\n\n### Prefix length\n\nLonger shared prefixes mean more wasted compute per cache miss.\n\n| Max Prefix Tokens | Cache Miss P50 | Cache Hit P50 | Improvement |\n|---|---|---|---|\n| 8,192 (default) | 638ms | 448ms | 29.7% |\n| 16,384 | 817ms | 461ms | 43.6% |\n\nAt 16K tokens, a cache miss wastes nearly 400ms of GPU compute that a hit avoids entirely. As context windows keep growing, this gap widens.\n\n### Prefix sharing ratio\n\nThis is the percentage of tokens shared across requests. A RAG application where every request includes the same 4,000-token knowledge base has a high sharing ratio. A chat application where every conversation is unique has a low one.\n\n| Sharing Ratio | Round-Robin Hits | Prefix-Aware Hits | Improvement |\n|---|---|---|---|\n| 50% | ~11% | 91% | +80pp |\n| 70% | ~13% | 90% | +77pp |\n| 90% | ~12% | 97-98% | +85pp |\n\nEven at 50% sharing, where half the tokens are unique, prefix-aware routing still achieves 91% cache hits. A consistent hash fallback (deterministic routing based on prefix when no learned route exists yet) ensures that requests with the same prefix land on the same GPU even before the system has observed them.\n\n## The P99 Story\n\nCost isn’t just GPU-hours. It’s also the cost of slow responses.\n\nAt 30 concurrent users on CodeLlama 13B over 30 minutes of sustained load, round-robin routing produced a P99 TTFT of 6,800ms. That’s 6.8 seconds before the first token appears. For an interactive application like code completion or chat, that’s a broken experience. Users don’t wait 6.8 seconds.\n\nPrefix-aware routing brought that same P99 down to 1,000ms. Same hardware, same model, same concurrency. An 85.3% improvement on tail latency.\n\nWhy does the tail improve so much? Because tail latency in LLM serving is driven by cache misses under load. When the GPU is busy generating tokens for other requests, a new request that requires full prefill gets queued behind them. With round-robin, 87.5% of requests need full prefill, so the queue is always full of expensive work. With prefix-aware routing, 97.5% of requests skip prefill entirely, so the queue drains faster and the few remaining misses get processed sooner.\n\nThis is the strongest argument for KV cache locality. Throughput improvements look good on a dashboard. Tail latency is what users actually experience.\n\n## What Doesn’t Work\n\nPrefix-aware routing isn’t free, and it doesn’t help everywhere.\n\n**Small models (≤8B):** Inference is already fast enough that the routing\noverhead (~10ms for tokenization + tree lookup) approaches the prefill\nsavings. The net effect is roughly zero.\n\n**Short prefixes (<500 tokens):** The prefill cost for short sequences is\nsmall enough that cache misses don’t meaningfully hurt. The routing overhead\n(~3ms minimum) can exceed the savings.\n\n**Unique conversations:** If every request has a completely different prefix\n(no shared system prompt, no shared context), there’s nothing to cache. The\nrouting tree learns routes that are never reused.\n\n**Load imbalance:** Strict prefix affinity can create hot spots. If 80% of\nyour traffic shares the same system prompt, prefix-aware routing sends 80% of\ntraffic to one GPU. We handle this with a load-aware fallback that diverts\nrequests when a backend’s in-flight count exceeds twice the median. This\ntrades a cache miss for a balanced GPU, reducing P95 by 36% and P99 by 45%\ncompared to strict affinity. The cache hit rate drops about 5 points, which\nis the right trade.\n\n## Measuring Your Own Cache Locality\n\nBefore you change anything, measure your current cache hit rate. Most vLLM deployments expose this via Prometheus:\n\n`vllm:gpu_prefix_cache_hit_rate`\n\n(or`vllm:gpu_prefix_cache_queries_total`\n\nand`_hits_total`\n\non older versions; check your`/metrics`\n\nendpoint)- Compare TTFT distributions between requests with shared vs unique prefixes\n- Look at your P99/P50 ratio. A ratio above 5x suggests cache thrashing\n\nIf your cache hit rate is already above 80%, you’re either lucky or your traffic naturally clusters. If it’s below 30%, you’re leaving performance on the table.\n\nThe variables that matter most:\n\n**How many GPUs are you routing across?** More GPUs = lower chance of random cache hits. With 8 GPUs, random routing gives ~12.5% hits.**How long are your shared prefixes?** Longer = more wasted compute per miss.**What’s your prefix sharing ratio?** Higher = more opportunity for reuse.**What model size are you serving?** Larger = more expensive prefill per miss.\n\nIf you have many GPUs, long shared prefixes, high sharing ratios, and large models, you’re likely wasting 20-40% of your GPU compute on redundant prefill.\n\n## The Takeaway\n\nKV cache locality is not a tuning knob. It’s a multiplier on your existing hardware. The same GPUs, serving the same model, handling the same traffic, produce measurably different throughput and latency depending on one decision: which GPU gets which request.\n\nRound-robin doesn’t make that decision. Least-connections doesn’t make that\ndecision. They balance load without understanding what the load *is*. When\nevery request carries thousands of tokens that might already be cached\nsomewhere in your cluster, “balanced” and “efficient” are not the same thing.\n\nWe built [Ranvier](https://github.com/Ranvier-Systems/ranvier-core) to make\nthat decision. It routes requests to the GPU that already has their token\nprefix cached, using an adaptive radix tree that learns routes in real time.\nThe first post in this series covered\n[why your load balancer is wasting your GPUs](https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus.html).\nThis post covered what that waste costs. The next one will cover how we\ntokenize 50,000 requests per second without blocking the event loop.\n\n*All benchmarks run on 8x A100 GPUs (Lambda Labs), February 2026. Workloads\nuse the stress distribution (10% small, 20% medium, 30% large, 40% xlarge\nprefixes) with 90% prefix sharing ratio unless noted. Full methodology and raw\ndata available in the\nbenchmark guide.*\n\n*Ranvier is a project of Minds Aspire, LLC.*", "url": "https://wpnews.pro/news/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost", "canonical_source": "https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.html", "published_at": "2026-04-30 00:00:00+00:00", "updated_at": "2026-05-26 01:09:27.355716+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "machine-learning", "artificial-intelligence", "ai-tools"], "entities": ["Llama 3.1", "vLLM"], "alternates": {"html": "https://wpnews.pro/news/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost", "markdown": "https://wpnews.pro/news/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.md", "text": "https://wpnews.pro/news/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.txt", "jsonld": "https://wpnews.pro/news/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.jsonld"}}