Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

KV Cache reduces duplicated computation in autoregressive LLM inference by storing previously computed Key and Value tensors, but creates a memory bottleneck as context length grows. To address this, Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) reduce cache size by sharing or compressing K/V tensors across attention heads.

LLMs generate text one token at a time. That sounds simple. But without KV Cache, every new token would repeat a lot of old work. That is why inference optimization starts with keys and values. KV Cache stores previously computed Key and Value tensors. During generation, the model only needs to compute the new token’s Query, Key, and Value. Then the new Query attends to cached Keys and Values. This matters because autoregressive generation repeats the same context again and again. KV Cache removes a huge amount of duplicated computation. Autoregressive generation: Prompt tokens → compute K/V → store K/V in cache → generate next token → append new K/V → repeat More compactly: KV Cache = reuse past K/V + compute only new K/V But there is a trade-off. KV Cache reduces recomputation. It does not remove attention cost. And as context length grows, the cache itself becomes large. Without KV Cache: context = prompt tokens while not finished: Q, K, V = compute qkv context output = attention Q, K, V next token = sample output context.append next token With KV Cache: context = prompt tokens K cache, V cache = compute and store kv context while not finished: q new, k new, v new = compute qkv new token K cache.append k new V cache.append v new output = attention q new, K cache, V cache next token = sample output The optimized version avoids recomputing K and V for old tokens. That is the main speedup. Prompt: Dear The model generates: Sarah Next context: Dear Sarah Without KV Cache: The model recomputes K/V for “Dear” again. With KV Cache: The model reuses the cached K/V for “Dear.” It only computes new K/V for “Sarah.” Now extend this to a 10,000-token conversation. Recomputing old tokens becomes wasteful. Caching becomes essential. KV Cache reduces repeated computation. Specifically: But it does not eliminate everything. The new Query still attends to cached Keys and Values. So longer context still costs more. This matters in production. A long chat can become memory-heavy even if generation is optimized. KV Cache speeds up inference. But it also creates a memory problem. For every layer, every token stores Key and Value tensors. Longer context means larger cache. More users mean more cache memory. More heads mean more K/V tensors. So the bottleneck shifts: Before KV Cache: recompute cost After KV Cache: memory cost This is why MQA, GQA, and MLA exist. The main difference is how Key and Value tensors are stored. Standard Multi-Head Attention: Each head has its own K/V. Multi-Query Attention: All heads share one K/V. Grouped-Query Attention: Groups of heads share K/V. Multi-Head Latent Attention: K/V information is stored in compressed latent form. The goal is the same: reduce KV Cache size while preserving useful attention behavior. In standard Multi-Head Attention, each head has separate Query, Key, and Value projections. If there are 8 heads: 8 heads → 8 K/V pairs This is expressive. Each head can learn its own representation. But it is expensive during inference. More heads mean larger cache. So MHA gives quality and flexibility. But it pays with memory. Multi-Query Attention keeps different Queries for each head. But all heads share the same Key and Value. If there are 8 heads: 8 query heads → 1 shared K/V pair This sharply reduces cache size. It is memory-efficient. But there is a trade-off. Because all heads share K/V, head diversity can decrease. So MQA is fast and compact. But it may lose some expressiveness. Grouped-Query Attention is the compromise. Instead of one shared K/V for all heads, it divides heads into groups. Each group shares one K/V pair. Example: 8 heads 2 groups → 2 K/V pairs This sits between MHA and MQA. MHA stores 8 K/V pairs. MQA stores 1 K/V pair. GQA stores a configurable middle ground. That makes GQA practical for modern LLM inference. Multi-Head Latent Attention goes further. Instead of storing full K/V tensors directly, it stores compressed latent representations. Then it reconstructs or projects the needed information during attention. The idea is: store less recover enough This is especially useful for long-context inference. Because when context length grows, KV Cache grows with it. MLA attacks the memory problem at the representation level. MHA: MQA: GQA: MLA: In real inference systems, KV Cache is not just a model detail. It affects: A model with a smaller KV Cache can serve longer contexts or more users on the same hardware. That is why shared K/V designs matter. They are not just architecture theory. They directly affect deployment. Naive view: LLM inference = run the model repeatedly Practical view: LLM inference = manage cached states efficiently Naive generation: recompute all token states every step Optimized generation: cache past K/V compute only new token states reduce K/V storage with MQA, GQA, or MLA This is one of the biggest differences between understanding Transformers conceptually and running them efficiently. KV Cache does not make attention free. The new Query still attends over cached tokens. Long context still increases memory and latency. MQA reduces memory but may reduce head diversity. GQA balances memory and quality. MLA reduces cache size through compression, but adds architectural complexity. So the real design question is: How much memory can we save without hurting generation quality too much? Long-context models are useful only if inference is practical. A model that supports huge context but cannot fit enough cache in GPU memory is hard to serve. KV Cache makes autoregressive generation faster. MQA, GQA, and MLA make KV Cache more scalable. That is why modern LLM architecture spends so much effort on shared or compressed Key-Value attention. KV Cache reuses past Keys and Values. MQA shares K/V across all heads. GQA shares K/V within groups. MLA compresses K/V into latent representations. The shortest version: KV optimization = faster generation + smaller memory footprint If attention is the engine, KV Cache is the memory system that keeps generation practical. When optimizing LLM inference, which bottleneck do you usually notice first? Latency, GPU memory, context length, or serving cost? Originally published at zeromathai.com. Original article: https://zeromathai.com/en/kv-cache-shared-key-value-attention-en/ https://zeromathai.com/en/kv-cache-shared-key-value-attention-en/ GitHub Resources AI diagrams, study notes, and visual guides: https://github.com/zeromathai/zeromathai-ai https://github.com/zeromathai/zeromathai-ai