The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

wpnews.pro

Here's a moment every local-LLM owner hits eventually: you carefully pick a quant, the model loads with VRAM to spare — and then you paste in a long document or hit a few thousand tokens of chat history, and it crashes with an out-of-memory error. The weights didn't grow. So what filled your VRAM?

The answer is the KV cache, and it's the most under-explained number in local AI. It's the third thing competing for your memory — alongside model weights (which our quantization guide covers) and total parameters (our Mixture-of-Experts explainer). This piece completes that trilogy: what the KV cache is, why it explodes with context length, and the levers that let you fit more.

What the KV cache actually is #

When a language model generates text, it processes tokens through attention layers. For every token, each layer computes three things: a query, a key, and a value. The trick of attention is that each new token needs to "look back" at the keys and values of every previous token.

Without a cache, the model would have to recompute the keys and values for the entire history on every single new token — quadratic, brutally slow work. So instead it does the obvious thing: it stores the keys and values once and reuses them. That store is the KV cache. It's pure speed optimization — and like most speed optimizations, you pay for it in memory.

The catch: the cache grows with every token in the context. A longer conversation, a bigger document, a fatter system prompt — each one adds to a pile of cached keys and values that has to live in fast memory right next to the model. And that pile gets big fast.

The math: why context eats VRAM #

The size of the KV cache follows a simple formula:

KV bytes ≈ 2 × layers × kv-heads × head-dim × tokens × bytes-per-value The 2 is for keys and values; everything else is the model's shape and how long your context is. Plug in real models (FP16, single user) and the numbers are startling:

Model	Per token	At 32k context	At 128k context
7B, old-style full attention	~0.5 MB	~16 GB	~64 GB
8B with GQA (Llama-3-style)	~0.13 MB	~4 GB	~16 GB
70B with GQA	~0.31 MB	~10 GB	~40 GB

Look at that first row. A 7B model with old-style full attention generates half a megabyte of cache per token. At 32k context that's ~16 GB — larger than the entire 4-bit quantized model itself (~4 GB). The thing you thought you were was the small part. This is the trap: people size their hardware for the weights and forget the cache, which at long context is often the bigger number.

It scales linearly and relentlessly: double the context, double the cache. This is why a model that loads happily at 4k context detonates at 64k.

What owners actually run into #

This isn't theoretical — it's one of the most common frustrations on r/LocalLLaMA. In a thread bluntly titled "My biggest issue with the Gemma-4 models is the massive KV cache," the owner (u/Iory1998) explains:

"I have 40 GB of VRAM and I still cannot fit the entire… Q8 (35 GB) [with full context]… if I have to run a Q4 with a Q8 KV cache, then I am better off just using [a smaller model]."— u/Iory1998, on a 35 GB model that won't fit in 40 GB once context is added

A commenter put the general rule even more plainly:

"Most inference providers are serving a lot more VRAM on KV than weights."— a commenter in the same thread

At production scale, with big batches and long contexts, the cache routinely dwarfs the model. The weights are a fixed cost; the KV cache is the variable one that blows your budget.

How models fight back: MQA, GQA, and paging #

Because the KV cache is such a bottleneck, a lot of research has gone into shrinking it — and the wins are baked into the models you already run:

Multi-Query Attention (MQA). Shazeer's"Fast Transformer Decoding"(2019) had every attention headshareone set of keys and values instead of each keeping its own. That alone can cut the cache by an order of magnitude, at a small quality cost.Grouped-Query Attention (GQA). Ainslie et al.'sGQA(2023) is the middle ground: heads are split into a few groups that share keys/values. It keeps almost all of multi-head quality at close to MQA's memory. This is why modern models (Llama-3, Mistral, etc.) use it — it's the difference between the 16 GB and 4 GB rows in the table above.PagedAttention. Kwon et al.'svLLM paper(2023) noticed that naive KV cache allocation wastes huge amounts of memory to fragmentation. Borrowing virtual-memory paging from operating systems, it packs the cache efficiently — a big reason vLLM serves more concurrent users on the same card.

The cutting edge: Multi-head Latent Attention #

GQA shrinks the cache by sharing key/value heads; the newest idea changes what gets stored at all. Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 (2024), compresses the keys and values into a small shared latent vector via a learned low-rank projection, then expands them back on the fly. Instead of caching full-size keys and values for every head, the model caches a compact latent and reconstructs what it needs.

The payoff is dramatic: DeepSeek reports MLA cuts the KV cache by ~93% versus their earlier dense model while improving quality, and lifts maximum generation throughput nearly 6×. It's a major reason the DeepSeek family (a Mixture-of-Experts line) serves enormous contexts at usable speed. You don't pick MLA directly — you get it by choosing a model built on it — but it explains why two models of the same parameter count can have wildly different long-context memory appetites. Increasingly, how a model handles its KV cache is as important a spec as how many parameters it has.

KV cache quantization: the lever you control #

Architecture is fixed once you pick a model — but there's one knob you turn at runtime: quantizing the cache itself. Just like weights, the KV cache is FP16 by default and can be stored at lower precision:

KV precision	Relative size	Verdict
F16 (default)	100%	Reference quality
Q8	~50%	Near-lossless, safe default
Q5 / Q6	~31–37%	Community sweet spot
Q4	~25%	Cuts it to a quarter, but quality starts to slip

Halving the cache (F16→Q8) can be the difference between fitting 32k and 64k context on the same card. The research shows you can push even further with the right techniques: KIVI (2024) reaches 2-bit by quantizing keys per-channel and values per-token, cutting peak memory ~2.6× and enabling up to 4× larger batches; KVQuant (2024) hits 8× compression and demonstrates a 7B model at a million-token context on a single A100.

But naive low-bit KV quant has a sharp edge that owners learn the hard way. The community consensus tracks the research: Q5/Q6 are nearly free, and quality falls off below that. As one r/LocalLLaMA commenter summarized:

"Try Q6, it's still basically lossless. Same deal with Q5. It's usually below Q5 where the difference is [noticeable]."— r/LocalLLaMA

This is also why agents can mysteriously "get dumber" in long sessions: aggressive KV quant degrades exactly when the context is full and the model needs to track the most detail. If your coding assistant turns sloppy past 30k tokens, suspect your KV cache precision before you blame the model.

One more gotcha: the prefill spike #

The cache doesn't only grow as the model talks — it fills all at once when the model reads. Paste a 50k-token document and the model must process every token to build its keys and values before writing a single word of reply. That prompt-processing (or "prefill") phase materializes a large slice of the KV cache immediately, which is why a long input can run you out of memory instantly rather than creeping up over a conversation. It's also why prompt processing is compute-bound while generation is memory-bandwidth-bound — the two halves of inference stress different parts of your hardware, and research like SARATHI (2023) exists specifically to interleave them for better utilization. The practical takeaway: when you're near your memory ceiling, a long paste is riskier than a long chat, because the whole cost arrives at once.

The cheat sheet: how to fit more context #

Goal	Lever
Smaller cache, zero effort	Pick a GQA model (most modern ones)
Roughly halve the cache	Q8 KV cache (near-lossless)
Squeeze a bit more	Q5/Q6 KV — the community sweet spot
Serve many users	vLLM / PagedAttention
Last resort	Lower your context length

What this means for buying hardware #

The KV cache reframes how you size a machine. The headline rule from our other guides — "buy memory for the model's total size" — is only half the story. The real budget is:

memory ≈ weights + KV cache (which scales with your context)

If you run short prompts, the weights dominate and you can size tight. But if you want long context — big codebases, long documents, persistent agents — you need real headroom on top of the model, sometimes tens of gigabytes of it. A 70B at 128k context wants ~40 GB of KV cache alone, on top of the ~40 GB of 4-bit weights. This is the quiet case for large-unified-memory boxes (see our Unified-Memory AI guides). Their value isn't only holding a big model — it's holding the model and a long-context KV cache without falling off a cliff. When you compare a 24 GB GPU to a 128 GB unified box, the GPU may run the same model, but the unified box runs it at the context length you actually want.

Sources & how we researched this #

This explainer synthesizes the primary literature on attention memory — Multi-Query Attention (Shazeer, 2019), Grouped-Query Attention (Ainslie et al., 2023), PagedAttention/vLLM (Kwon et al., 2023), and KV-cache quantization in KIVI and KVQuant (2024) — for the mechanisms and compression figures, which come from those papers. The real-world frustrations and the Q5/Q6 sweet-spot guidance are owner reports from r/LocalLLaMA, linked so you can verify; we have not benchmarked these machines first-hand. The per-token figures are weights-of-the-cache approximations rounded for clarity; real usage varies with implementation overhead and attention type.

GGUF vs GPTQ vs AWQ: the plain-English guide to quantization(the weights)Mixture-of-Experts, explained(total vs active parameters)How much VRAM do you actually need to run a 70B model locally?

source & further reading

vettedconsumer.com — original article The Local-LLM Hardware Cheat-Sheet: Which Box Runs Which Model The Used RTX 3090 in 2026: Why a Five-Year-Old GPU Is Still Local AI's Best Deal Show HN: Quant Picker – which GGUF file fits your model and machine