{"slug": "the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more", "title": "The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)", "summary": "The KV cache, a memory store for attention keys and values, grows linearly with context length and can exceed model weights in VRAM usage, causing out-of-memory errors for local LLM users. At 32k context, a 7B model's cache can reach ~16 GB, larger than the quantized model itself. Techniques like Grouped Query Attention (GQA) reduce cache size, but long contexts remain a major bottleneck.", "body_md": "Here's a moment every local-LLM owner hits eventually: you carefully pick a quant, the model loads with VRAM to spare — and then you paste in a long document or hit a few thousand tokens of chat history, and it crashes with an out-of-memory error. The weights didn't grow. So what filled your VRAM?\n\nThe answer is the **KV cache**, and it's the most under-explained number in local AI. It's the third thing competing for your memory — alongside model weights (which our [quantization guide](https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/) covers) and total parameters (our [Mixture-of-Experts explainer](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/)). This piece completes that trilogy: what the KV cache is, why it explodes with context length, and the levers that let you fit more.\n\n## What the KV cache actually is\n\nWhen a language model generates text, it processes tokens through attention layers. For every token, each layer computes three things: a **query**, a **key**, and a **value**. The trick of attention is that each new token needs to \"look back\" at the keys and values of *every previous token*.\n\nWithout a cache, the model would have to recompute the keys and values for the entire history on every single new token — quadratic, brutally slow work. So instead it does the obvious thing: it **stores the keys and values once and reuses them**. That store is the KV cache. It's pure speed optimization — and like most speed optimizations, you pay for it in memory.\n\nThe catch: the cache grows with **every token in the context**. A longer conversation, a bigger document, a fatter system prompt — each one adds to a pile of cached keys and values that has to live in fast memory right next to the model. And that pile gets big fast.\n\n## The math: why context eats VRAM\n\nThe size of the KV cache follows a simple formula:\n\n**KV bytes ≈ 2 × layers × kv-heads × head-dim × tokens × bytes-per-value**\n\nThe **2** is for keys *and* values; everything else is the model's shape and how long your context is. Plug in real models (FP16, single user) and the numbers are startling:\n\n| Model | Per token | At 32k context | At 128k context |\n|---|---|---|---|\n| 7B, old-style full attention | ~0.5 MB | ~16 GB | ~64 GB |\n| 8B with GQA (Llama-3-style) | ~0.13 MB | ~4 GB | ~16 GB |\n| 70B with GQA | ~0.31 MB | ~10 GB | ~40 GB |\n\nLook at that first row. A 7B model with old-style full attention generates **half a megabyte of cache per token**. At 32k context that's ~16 GB — *larger than the entire 4-bit quantized model itself* (~4 GB). The thing you thought you were loading was the small part. This is the trap: people size their hardware for the weights and forget the cache, which at long context is often the bigger number.\n\nIt scales linearly and relentlessly: double the context, double the cache. This is why a model that loads happily at 4k context detonates at 64k.\n\n## What owners actually run into\n\nThis isn't theoretical — it's one of the most common frustrations on r/LocalLLaMA. In a thread bluntly titled [\"My biggest issue with the Gemma-4 models is the massive KV cache,\"](https://redlib.catsarch.com/r/LocalLLaMA/comments/1sbe40t/my_biggest_issue_with_the_gemma4_models_is_the/?ref=vettedconsumer.com) the owner (u/Iory1998) explains:\n\n\"I have 40 GB of VRAM and I still cannot fit the entire… Q8 (35 GB) [with full context]… if I have to run a Q4 with a Q8 KV cache, then I am better off just using [a smaller model].\"— u/Iory1998, on a 35 GB model that won't fit in 40 GB once context is added\n\nA commenter put the general rule even more plainly:\n\n\"Most inference providers are serving a lot more VRAM on KV than weights.\"— a commenter in the same thread\n\nAt production scale, with big batches and long contexts, the cache routinely dwarfs the model. The weights are a fixed cost; the KV cache is the variable one that blows your budget.\n\n## How models fight back: MQA, GQA, and paging\n\nBecause the KV cache is such a bottleneck, a lot of research has gone into shrinking it — and the wins are baked into the models you already run:\n\n**Multi-Query Attention (MQA).** Shazeer's[\"Fast Transformer Decoding\"](https://arxiv.org/abs/1911.02150?ref=vettedconsumer.com)(2019) had every attention head*share*one set of keys and values instead of each keeping its own. That alone can cut the cache by an order of magnitude, at a small quality cost.**Grouped-Query Attention (GQA).** Ainslie et al.'s[GQA](https://arxiv.org/abs/2305.13245?ref=vettedconsumer.com)(2023) is the middle ground: heads are split into a few groups that share keys/values. It keeps almost all of multi-head quality at close to MQA's memory. This is why modern models (Llama-3, Mistral, etc.) use it — it's the difference between the 16 GB and 4 GB rows in the table above.**PagedAttention.** Kwon et al.'s[vLLM paper](https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com)(2023) noticed that naive KV cache allocation wastes huge amounts of memory to fragmentation. Borrowing virtual-memory paging from operating systems, it packs the cache efficiently — a big reason vLLM serves more concurrent users on the same card.\n\n## The cutting edge: Multi-head Latent Attention\n\nGQA shrinks the cache by sharing key/value heads; the newest idea changes what gets stored at all. **Multi-head Latent Attention (MLA)**, introduced in [DeepSeek-V2](https://arxiv.org/abs/2405.04434?ref=vettedconsumer.com) (2024), compresses the keys and values into a small shared *latent vector* via a learned low-rank projection, then expands them back on the fly. Instead of caching full-size keys and values for every head, the model caches a compact latent and reconstructs what it needs.\n\nThe payoff is dramatic: DeepSeek reports MLA cuts the KV cache by **~93%** versus their earlier dense model *while improving* quality, and lifts maximum generation throughput nearly 6×. It's a major reason the DeepSeek family (a [Mixture-of-Experts](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/) line) serves enormous contexts at usable speed. You don't pick MLA directly — you get it by choosing a model built on it — but it explains why two models of the *same parameter count* can have wildly different long-context memory appetites. Increasingly, how a model handles its KV cache is as important a spec as how many parameters it has.\n\n## KV cache quantization: the lever you control\n\nArchitecture is fixed once you pick a model — but there's one knob *you* turn at runtime: quantizing the cache itself. Just like weights, the KV cache is FP16 by default and can be stored at lower precision:\n\n| KV precision | Relative size | Verdict |\n|---|---|---|\n| F16 (default) | 100% | Reference quality |\n| Q8 | ~50% | Near-lossless, safe default |\n| Q5 / Q6 | ~31–37% | Community sweet spot |\n| Q4 | ~25% | Cuts it to a quarter, but quality starts to slip |\n\nHalving the cache (F16→Q8) can be the difference between fitting 32k and 64k context on the same card. The research shows you can push even further with the right techniques: [KIVI](https://arxiv.org/abs/2402.02750?ref=vettedconsumer.com) (2024) reaches 2-bit by quantizing keys per-channel and values per-token, cutting peak memory ~2.6× and enabling up to 4× larger batches; [KVQuant](https://arxiv.org/abs/2401.18079?ref=vettedconsumer.com) (2024) hits 8× compression and demonstrates a 7B model at a *million*-token context on a single A100.\n\nBut naive low-bit KV quant has a sharp edge that owners learn the hard way. The community consensus tracks the research: Q5/Q6 are nearly free, and quality falls off below that. As one r/LocalLLaMA commenter summarized:\n\n\"Try Q6, it's still basically lossless. Same deal with Q5. It's usually below Q5 where the difference is [noticeable].\"— r/LocalLLaMA\n\nThis is also why agents can mysteriously \"get dumber\" in long sessions: aggressive KV quant degrades exactly when the context is full and the model needs to track the most detail. If your coding assistant turns sloppy past 30k tokens, suspect your KV cache precision before you blame the model.\n\n## One more gotcha: the prefill spike\n\nThe cache doesn't only grow as the model *talks* — it fills all at once when the model *reads*. Paste a 50k-token document and the model must process every token to build its keys and values *before* writing a single word of reply. That prompt-processing (or \"prefill\") phase materializes a large slice of the KV cache immediately, which is why a long input can run you out of memory instantly rather than creeping up over a conversation. It's also why prompt processing is compute-bound while generation is memory-bandwidth-bound — the two halves of inference stress different parts of your hardware, and research like [SARATHI](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com) (2023) exists specifically to interleave them for better utilization. The practical takeaway: when you're near your memory ceiling, a long *paste* is riskier than a long *chat*, because the whole cost arrives at once.\n\n## The cheat sheet: how to fit more context\n\n| Goal | Lever |\n|---|---|\n| Smaller cache, zero effort | Pick a GQA model (most modern ones) |\n| Roughly halve the cache | Q8 KV cache (near-lossless) |\n| Squeeze a bit more | Q5/Q6 KV — the community sweet spot |\n| Serve many users | vLLM / PagedAttention |\n| Last resort | Lower your context length |\n\n## What this means for buying hardware\n\nThe KV cache reframes how you size a machine. The headline rule from our other guides — \"buy memory for the model's total size\" — is only half the story. The real budget is:\n\n**memory ≈ weights + KV cache (which scales with your context)**\n\nIf you run short prompts, the weights dominate and you can size tight. But if you want **long context** — big codebases, long documents, persistent agents — you need real headroom *on top of* the model, sometimes tens of gigabytes of it. A 70B at 128k context wants ~40 GB of KV cache *alone*, on top of the ~40 GB of 4-bit weights.\n\nThis is the quiet case for large-unified-memory boxes (see our [Unified-Memory AI](https://vettedconsumer.com/tag/unified-memory-ai/) guides). Their value isn't only holding a big model — it's holding the model *and* a long-context KV cache without falling off a cliff. When you compare a 24 GB GPU to a 128 GB unified box, the GPU may run the same model, but the unified box runs it at the context length you actually want.\n\n## Sources & how we researched this\n\nThis explainer synthesizes the primary literature on attention memory — Multi-Query Attention ([Shazeer, 2019](https://arxiv.org/abs/1911.02150?ref=vettedconsumer.com)), Grouped-Query Attention ([Ainslie et al., 2023](https://arxiv.org/abs/2305.13245?ref=vettedconsumer.com)), PagedAttention/vLLM ([Kwon et al., 2023](https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com)), and KV-cache quantization in [KIVI](https://arxiv.org/abs/2402.02750?ref=vettedconsumer.com) and [KVQuant](https://arxiv.org/abs/2401.18079?ref=vettedconsumer.com) (2024) — for the mechanisms and compression figures, which come from those papers. The real-world frustrations and the Q5/Q6 sweet-spot guidance are owner reports from [r/LocalLLaMA](https://redlib.catsarch.com/r/LocalLLaMA/comments/1sbe40t/my_biggest_issue_with_the_gemma4_models_is_the/?ref=vettedconsumer.com), linked so you can verify; we have not benchmarked these machines first-hand. The per-token figures are weights-of-the-cache approximations rounded for clarity; real usage varies with implementation overhead and attention type.\n\n## Related guides\n\n[GGUF vs GPTQ vs AWQ: the plain-English guide to quantization](https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/)(the weights)[Mixture-of-Experts, explained](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/)(total vs active parameters)[How much VRAM do you actually need to run a 70B model locally?](https://vettedconsumer.com/how-much-vram-do-you-actually-need-to-run-a-70b-model-locally/)", "url": "https://wpnews.pro/news/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more", "canonical_source": "https://vettedconsumer.com/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more/", "published_at": "2026-06-15 13:00:00+00:00", "updated_at": "2026-06-15 13:15:10.229846+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research"], "entities": ["Llama", "Gemma", "r/LocalLLaMA", "Iory1998"], "alternates": {"html": "https://wpnews.pro/news/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more", "markdown": "https://wpnews.pro/news/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more.md", "text": "https://wpnews.pro/news/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more.txt", "jsonld": "https://wpnews.pro/news/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more.jsonld"}}