{"slug": "kv-cache-is-eating-your-vram-here-s-how-to-estimate-it-before-you-run-out", "title": "KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out", "summary": "An engineer provides a formula to estimate KV cache memory consumption for large language models, showing that the KV cache often becomes the bottleneck before model weights. For Llama 3.1 70B at 128K context, the KV cache requires 340GB, exceeding the 140GB for weights. The post details levers to reduce KV cache memory, including grouped-query attention, quantization, and sliding windows.", "body_md": "Every LLM inference engineer hits this wall eventually.\n\nYou deployed a model, it works in testing, then production traffic arrives. Suddenly your 80GB A100 is OOM on a 70B model that \"should fit.\"\n\nThe culprit is almost always the **KV Cache**. But most discussions stop at \"it caches the Key and Value matrices\" — which doesn't help you predict when you'll run out of memory.\n\nThis post gives you a quick estimator formula, explains when to worry, and what levers actually help.\n\nHere's the quick estimator:\n\nKV Cache Memory (GB) = 2 × (layers) × (hidden_dim) × (context_length) × (bytes_per_param)\n\nThe leading `2`\n\nis because you cache both K and V.\n\nFor Llama 3.1 70B (80 layers, hidden_dim 8192, FP16):\n\n`2 × 80 × 8192 × 2 bytes = 2.6 MB`\n\n`2.6 MB × 8192 = 21 GB`\n\n`2.6 MB × 131072 = 340 GB`\n\n(doesn't fit on one A100)That's right: the KV cache for a 70B model at 128K context requires 340GB of memory — more than the model weights themselves (140GB in FP16).\n\n**In most inference scenarios, the KV cache is the bottleneck, not the weights.**\n\nModel weights are static. You load them once, they sit in VRAM. 70B in FP16 = ~140GB. That's a known cost.\n\nKV Cache is dynamic. It grows linearly with:\n\nThe wall you'll hit first:\n\n| Scenario | Weights | KV Cache (8K) | KV Cache (128K) |\n|---|---|---|---|\n| 70B, batch=1, FP16 | 140GB | 21GB | 340GB — OOM |\n| 70B, batch=4, FP16 | 140GB | 84GB | 1.3TB — OOM |\n| 7B, batch=32, 8K, FP16 | 14GB | 9GB | 150GB — OOM |\n\nAt long contexts or high batch sizes, the KV cache dominates total memory — and it's the part that grows with traffic, not the part you can amortize.\n\n**If you're running Speculative Decoding** ([theory](https://dev.to/zxpmail/lossless-but-not-free-the-lossless-but-not-free-when-speculative-decoding-actually-pays-off-1c2g), [benchmarks](https://dev.to/zxpmail/i-benchmarked-speculative-decoding-a-35-wasnt-enough-1geb)), both the draft model and the target model maintain their own KV caches. For a 7B draft + 70B target pair, the draft adds roughly 10-15% more KV cache memory on top of the target's — a factor worth including in your estimate.\n\nThere are six levers, and they're not all created equal.\n\nThis is the most impactful architectural fix. Instead of caching K and V for every attention head, share K and V across query heads.\n\n`2 × hidden_dim`\n\n`2 × hidden_dim / group_size`\n\n(where `group_size = num_attn_heads / kv_heads`\n\n, e.g. 64/8 = 8)`2 × hidden_dim / num_attn_heads`\n\nIn practice: Llama 3.1 70B uses GQA with 8 key-value heads. That reduces the KV cache to about 1/8 of what MHA would require — roughly `2.6 MB per token`\n\n→ `0.33 MB per token`\n\n.\n\n| Architecture | KV per token (70B, FP16, 8192 hidden, 64 attn heads, head_dim=128) |\n|---|---|\n| MHA (64 KV heads) | 2.6 MB |\n| GQA (8 KV heads) | 0.33 MB |\n| MQA (1 KV head) | 0.04 MB |\n\nGQA is a free lunch. It barely affects quality and cuts cache memory by 4-8×. If your model doesn't use it, consider switching.\n\nKV Cache is less sensitive to quantization than weights. You can usually go to FP8 or INT4 without meaningful quality loss.\n\n| Precision | Bytes per param | KV cache for 7B, 8K, batch=16 |\n|---|---|---|\n| FP16 | 2 | 18 GB |\n| FP8 | 1 | 9 GB |\n| INT4 | 0.5 | 4.5 GB |\n\nKV cache quantization is supported by most inference frameworks (TensorRT-LLM, vLLM, AWQ). The quality impact is minimal because KV cache errors are per-token, not accumulated across tokens.\n\nInstead of caching all positions, only cache the last N tokens. For models that use ALiBi or Rotary Position Encoding without a strict context limit, this can cap KV cache growth.\n\nThe tradeoff: the model loses access to tokens beyond the window. For tasks that need long-range dependencies (summarization, document QA), this degrades quality.\n\nFor conversational or streaming use cases, sliding window is a no-brainer. For RAG, it depends on where in the context the relevant information sits.\n\nvLLM's contribution is memory management, not cache reduction. It fragments less.\n\nTraditional inference allocates contiguous blocks per sequence. If a sequence has 512 tokens of cache and the allocator uses 1024-sized blocks, 50% is wasted.\n\nPagedAttention allocates in smaller (16-256 token) pages, reducing fragmentation from 30-50% down to 1-4%.\n\nNet effect: 30-50% effective memory gain on the same hardware, with no quality impact and no model changes.\n\nThis is why teams see such dramatic improvements switching to vLLM — it's not faster compute, it's better memory packing.\n\nThis is the most brute-force lever, but sometimes the right one.\n\n| Max context | KV cache (7B, FP16, batch=16) |\n|---|---|\n| 2K | 2.3 GB |\n| 8K | 9 GB |\n| 32K | 36 GB |\n| 128K | 144 GB |\n\nIf 99% of your requests are under 4K tokens, don't support 128K context. Supporting a context length you don't use is burning VRAM for no reason.\n\nFrameworks like vLLM support per-request context limits — you can set max_model_len to fit your workload rather than the model's theoretical maximum.\n\nSometimes the best optimization is admitting the model is too big for your use case.\n\nA 7B model with full 128K context costs more in KV cache than a 70B model with 2K context. If your task needs long context, a smaller model at a higher context length may use less total memory than a large model at the same context.\n\nRun out of KV cache memory? Here's the order to try:\n\n**1. Switch to vLLM.** ~30-50% effective memory gain. No model changes. Start here.\n\n**2. Quantize KV cache to FP8.** ~2× memory reduction. Minimal quality impact.\n\n**3. Check GQA groups.** If your model has full MHA, find a GQA variant. 4-8× reduction.\n\n**4. Implement sliding window or reduce max context.** Only if your workload allows it.\n\n**5. Quantize to INT4.** ~4× reduction from FP16. Test quality impact on your data first.\n\n**6. Reduce batch size.** Last resort. Hurts throughput.\n\n``` python\ndef kv_cache_memory(layers, hidden_dim, context_len, batch_size, kv_heads, num_attn_heads, bytes_per_param=2):\n    \"\"\"\n    Estimate KV cache memory in GB.\n\n    layers: number of transformer layers\n    hidden_dim: model hidden dimension\n    context_len: max context length in tokens\n    batch_size: number of concurrent sequences\n    kv_heads: number of KV heads (1 for MQA, n for GQA, num_attn_heads for MHA)\n    num_attn_heads: number of attention heads\n    bytes_per_param: 2 for FP16, 1 for FP8, 0.5 for INT4\n    \"\"\"\n    head_dim = hidden_dim // num_attn_heads\n    kv_per_position = 2 * layers * kv_heads * head_dim * bytes_per_param\n    total = kv_per_position * context_len * batch_size\n    return total / (1024**3)  # convert to GB\n\n# Example: Llama 3.1 70B, 8K context, batch=4, GQA-8\n# layers=80, hidden_dim=8192, attn_heads=64, kv_heads=8\nprint(f\"{kv_cache_memory(80, 8192, 8192, 4, 8, 64, 2):.1f} GB\")  # ~10.0 GB\n\n# Same model, MHA (kv_heads = attn_heads = 64)\nprint(f\"{kv_cache_memory(80, 8192, 8192, 4, 64, 64, 2):.1f} GB\")  # ~80.0 GB\n```\n\nRun it before you deploy. It's cheaper than an OOM at 3 AM.\n\nThe KV cache is the silent memory killer in LLM inference. Model weights get all the attention — they're static, visible, and easy to estimate. The KV cache is dynamic, grows with traffic, and often exceeds the weight memory at production batch sizes and context lengths.\n\nThe fix isn't one lever. It's knowing which lever to pull first.\n\nStart with memory management (vLLM). Then quantization (FP8). Then architecture (GQA). Then context limits. In that order. Most teams will run out of problems before they run out of levers.\n\nAnd if you're exploring Speculative Decoding — the acceleration technique comes with its own memory tax: both models need room for their KV caches. Make sure your estimate accounts for both.\n\n**KV cache memory estimation should be part of your pre-deployment checklist.** Two lines of Python will tell you if a 3 A.M. page is waiting for you.\n\n*June 2026. One formula, six levers, one decision tree. Estimate before you deploy — it's cheaper than an OOM at 3 AM.", "url": "https://wpnews.pro/news/kv-cache-is-eating-your-vram-here-s-how-to-estimate-it-before-you-run-out", "canonical_source": "https://dev.to/zxpmail/kv-cache-is-eating-your-vram-heres-how-to-estimate-it-before-you-run-out-4oia", "published_at": "2026-06-28 23:06:31+00:00", "updated_at": "2026-06-28 23:57:06.675246+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-infrastructure", "developer-tools"], "entities": ["Llama 3.1 70B", "A100", "vLLM", "TensorRT-LLM", "AWQ", "GQA", "MHA", "MQA"], "alternates": {"html": "https://wpnews.pro/news/kv-cache-is-eating-your-vram-here-s-how-to-estimate-it-before-you-run-out", "markdown": "https://wpnews.pro/news/kv-cache-is-eating-your-vram-here-s-how-to-estimate-it-before-you-run-out.md", "text": "https://wpnews.pro/news/kv-cache-is-eating-your-vram-here-s-how-to-estimate-it-before-you-run-out.txt", "jsonld": "https://wpnews.pro/news/kv-cache-is-eating-your-vram-here-s-how-to-estimate-it-before-you-run-out.jsonld"}}