cd /news/large-language-models/kv-cache-is-eating-your-vram-here-s-… · home topics large-language-models article
[ARTICLE · art-42754] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

An engineer provides a formula to estimate KV cache memory consumption for large language models, showing that the KV cache often becomes the bottleneck before model weights. For Llama 3.1 70B at 128K context, the KV cache requires 340GB, exceeding the 140GB for weights. The post details levers to reduce KV cache memory, including grouped-query attention, quantization, and sliding windows.

read7 min views1 publishedJun 28, 2026

Every LLM inference engineer hits this wall eventually.

You deployed a model, it works in testing, then production traffic arrives. Suddenly your 80GB A100 is OOM on a 70B model that "should fit."

The culprit is almost always the KV Cache. But most discussions stop at "it caches the Key and Value matrices" — which doesn't help you predict when you'll run out of memory.

This post gives you a quick estimator formula, explains when to worry, and what levers actually help.

Here's the quick estimator:

KV Cache Memory (GB) = 2 × (layers) × (hidden_dim) × (context_length) × (bytes_per_param)

The leading 2

is because you cache both K and V.

For Llama 3.1 70B (80 layers, hidden_dim 8192, FP16):

2 × 80 × 8192 × 2 bytes = 2.6 MB

2.6 MB × 8192 = 21 GB

2.6 MB × 131072 = 340 GB

(doesn't fit on one A100)That's right: the KV cache for a 70B model at 128K context requires 340GB of memory — more than the model weights themselves (140GB in FP16).

In most inference scenarios, the KV cache is the bottleneck, not the weights.

Model weights are static. You load them once, they sit in VRAM. 70B in FP16 = ~140GB. That's a known cost.

KV Cache is dynamic. It grows linearly with:

The wall you'll hit first:

Scenario Weights KV Cache (8K) KV Cache (128K)
70B, batch=1, FP16 140GB 21GB 340GB — OOM
70B, batch=4, FP16 140GB 84GB 1.3TB — OOM
7B, batch=32, 8K, FP16 14GB 9GB 150GB — OOM

At long contexts or high batch sizes, the KV cache dominates total memory — and it's the part that grows with traffic, not the part you can amortize.

If you're running Speculative Decoding (theory, benchmarks), both the draft model and the target model maintain their own KV caches. For a 7B draft + 70B target pair, the draft adds roughly 10-15% more KV cache memory on top of the target's — a factor worth including in your estimate.

There are six levers, and they're not all created equal.

This is the most impactful architectural fix. Instead of caching K and V for every attention head, share K and V across query heads.

2 × hidden_dim

2 × hidden_dim / group_size

(where group_size = num_attn_heads / kv_heads

, e.g. 64/8 = 8)2 × hidden_dim / num_attn_heads

In practice: Llama 3.1 70B uses GQA with 8 key-value heads. That reduces the KV cache to about 1/8 of what MHA would require — roughly 2.6 MB per token

0.33 MB per token

.

Architecture KV per token (70B, FP16, 8192 hidden, 64 attn heads, head_dim=128)
MHA (64 KV heads) 2.6 MB
GQA (8 KV heads) 0.33 MB
MQA (1 KV head) 0.04 MB

GQA is a free lunch. It barely affects quality and cuts cache memory by 4-8×. If your model doesn't use it, consider switching.

KV Cache is less sensitive to quantization than weights. You can usually go to FP8 or INT4 without meaningful quality loss.

Precision Bytes per param KV cache for 7B, 8K, batch=16
FP16 2 18 GB
FP8 1 9 GB
INT4 0.5 4.5 GB

KV cache quantization is supported by most inference frameworks (TensorRT-LLM, vLLM, AWQ). The quality impact is minimal because KV cache errors are per-token, not accumulated across tokens.

Instead of caching all positions, only cache the last N tokens. For models that use ALiBi or Rotary Position Encoding without a strict context limit, this can cap KV cache growth.

The tradeoff: the model loses access to tokens beyond the window. For tasks that need long-range dependencies (summarization, document QA), this degrades quality.

For conversational or streaming use cases, sliding window is a no-brainer. For RAG, it depends on where in the context the relevant information sits.

vLLM's contribution is memory management, not cache reduction. It fragments less.

Traditional inference allocates contiguous blocks per sequence. If a sequence has 512 tokens of cache and the allocator uses 1024-sized blocks, 50% is wasted.

PagedAttention allocates in smaller (16-256 token) pages, reducing fragmentation from 30-50% down to 1-4%.

Net effect: 30-50% effective memory gain on the same hardware, with no quality impact and no model changes.

This is why teams see such dramatic improvements switching to vLLM — it's not faster compute, it's better memory packing.

This is the most brute-force lever, but sometimes the right one.

Max context KV cache (7B, FP16, batch=16)
2K 2.3 GB
8K 9 GB
32K 36 GB
128K 144 GB

If 99% of your requests are under 4K tokens, don't support 128K context. Supporting a context length you don't use is burning VRAM for no reason.

Frameworks like vLLM support per-request context limits — you can set max_model_len to fit your workload rather than the model's theoretical maximum.

Sometimes the best optimization is admitting the model is too big for your use case.

A 7B model with full 128K context costs more in KV cache than a 70B model with 2K context. If your task needs long context, a smaller model at a higher context length may use less total memory than a large model at the same context.

Run out of KV cache memory? Here's the order to try:

1. Switch to vLLM. ~30-50% effective memory gain. No model changes. Start here.

2. Quantize KV cache to FP8. ~2× memory reduction. Minimal quality impact.

3. Check GQA groups. If your model has full MHA, find a GQA variant. 4-8× reduction.

4. Implement sliding window or reduce max context. Only if your workload allows it.

5. Quantize to INT4. ~4× reduction from FP16. Test quality impact on your data first.

6. Reduce batch size. Last resort. Hurts throughput.

def kv_cache_memory(layers, hidden_dim, context_len, batch_size, kv_heads, num_attn_heads, bytes_per_param=2):
    """
    Estimate KV cache memory in GB.

    layers: number of transformer layers
    hidden_dim: model hidden dimension
    context_len: max context length in tokens
    batch_size: number of concurrent sequences
    kv_heads: number of KV heads (1 for MQA, n for GQA, num_attn_heads for MHA)
    num_attn_heads: number of attention heads
    bytes_per_param: 2 for FP16, 1 for FP8, 0.5 for INT4
    """
    head_dim = hidden_dim // num_attn_heads
    kv_per_position = 2 * layers * kv_heads * head_dim * bytes_per_param
    total = kv_per_position * context_len * batch_size
    return total / (1024**3)  # convert to GB

print(f"{kv_cache_memory(80, 8192, 8192, 4, 8, 64, 2):.1f} GB")  # ~10.0 GB

print(f"{kv_cache_memory(80, 8192, 8192, 4, 64, 64, 2):.1f} GB")  # ~80.0 GB

Run it before you deploy. It's cheaper than an OOM at 3 AM.

The KV cache is the silent memory killer in LLM inference. Model weights get all the attention — they're static, visible, and easy to estimate. The KV cache is dynamic, grows with traffic, and often exceeds the weight memory at production batch sizes and context lengths.

The fix isn't one lever. It's knowing which lever to pull first.

Start with memory management (vLLM). Then quantization (FP8). Then architecture (GQA). Then context limits. In that order. Most teams will run out of problems before they run out of levers.

And if you're exploring Speculative Decoding — the acceleration technique comes with its own memory tax: both models need room for their KV caches. Make sure your estimate accounts for both.

KV cache memory estimation should be part of your pre-deployment checklist. Two lines of Python will tell you if a 3 A.M. page is waiting for you.

*June 2026. One formula, six levers, one decision tree. Estimate before you deploy — it's cheaper than an OOM at 3 AM.

── more in #large-language-models 4 stories · sorted by recency
── more on @llama 3.1 70b 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/kv-cache-is-eating-y…] indexed:0 read:7min 2026-06-28 ·