# Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

> Source: <https://dev.to/zeromathai/why-kv-cache-matters-how-mqa-gqa-and-mla-make-llm-inference-faster-5gb4>
> Published: 2026-06-25 14:15:58+00:00

LLMs generate text one token at a time.

That sounds simple.

But without KV Cache, every new token would repeat a lot of old work.

That is why inference optimization starts with keys and values.

KV Cache stores previously computed Key and Value tensors.

During generation, the model only needs to compute the new token’s Query, Key, and Value.

Then the new Query attends to cached Keys and Values.

This matters because autoregressive generation repeats the same context again and again.

KV Cache removes a huge amount of duplicated computation.

Autoregressive generation:

Prompt tokens

→ compute K/V

→ store K/V in cache

→ generate next token

→ append new K/V

→ repeat

More compactly:

KV Cache = reuse past K/V + compute only new K/V

But there is a trade-off.

KV Cache reduces recomputation.

It does not remove attention cost.

And as context length grows, the cache itself becomes large.

Without KV Cache:

```
context = prompt_tokens

while not finished:
    Q, K, V = compute_qkv(context)

    output = attention(Q, K, V)

    next_token = sample(output)

    context.append(next_token)
```

With KV Cache:

```
context = prompt_tokens

K_cache, V_cache = compute_and_store_kv(context)

while not finished:
    q_new, k_new, v_new = compute_qkv(new_token)

    K_cache.append(k_new)
    V_cache.append(v_new)

    output = attention(q_new, K_cache, V_cache)

    next_token = sample(output)
```

The optimized version avoids recomputing K and V for old tokens.

That is the main speedup.

Prompt:

Dear

The model generates:

Sarah

Next context:

Dear Sarah

Without KV Cache:

The model recomputes K/V for “Dear” again.

With KV Cache:

The model reuses the cached K/V for “Dear.”

It only computes new K/V for “Sarah.”

Now extend this to a 10,000-token conversation.

Recomputing old tokens becomes wasteful.

Caching becomes essential.

KV Cache reduces repeated computation.

Specifically:

But it does not eliminate everything.

The new Query still attends to cached Keys and Values.

So longer context still costs more.

This matters in production.

A long chat can become memory-heavy even if generation is optimized.

KV Cache speeds up inference.

But it also creates a memory problem.

For every layer, every token stores Key and Value tensors.

Longer context means larger cache.

More users mean more cache memory.

More heads mean more K/V tensors.

So the bottleneck shifts:

Before KV Cache:

recompute cost

After KV Cache:

memory cost

This is why MQA, GQA, and MLA exist.

The main difference is how Key and Value tensors are stored.

Standard Multi-Head Attention:

Each head has its own K/V.

Multi-Query Attention:

All heads share one K/V.

Grouped-Query Attention:

Groups of heads share K/V.

Multi-Head Latent Attention:

K/V information is stored in compressed latent form.

The goal is the same:

reduce KV Cache size while preserving useful attention behavior.

In standard Multi-Head Attention, each head has separate Query, Key, and Value projections.

If there are 8 heads:

8 heads → 8 K/V pairs

This is expressive.

Each head can learn its own representation.

But it is expensive during inference.

More heads mean larger cache.

So MHA gives quality and flexibility.

But it pays with memory.

Multi-Query Attention keeps different Queries for each head.

But all heads share the same Key and Value.

If there are 8 heads:

8 query heads → 1 shared K/V pair

This sharply reduces cache size.

It is memory-efficient.

But there is a trade-off.

Because all heads share K/V, head diversity can decrease.

So MQA is fast and compact.

But it may lose some expressiveness.

Grouped-Query Attention is the compromise.

Instead of one shared K/V for all heads, it divides heads into groups.

Each group shares one K/V pair.

Example:

8 heads

2 groups

→ 2 K/V pairs

This sits between MHA and MQA.

MHA stores 8 K/V pairs.

MQA stores 1 K/V pair.

GQA stores a configurable middle ground.

That makes GQA practical for modern LLM inference.

Multi-Head Latent Attention goes further.

Instead of storing full K/V tensors directly, it stores compressed latent representations.

Then it reconstructs or projects the needed information during attention.

The idea is:

store less

recover enough

This is especially useful for long-context inference.

Because when context length grows, KV Cache grows with it.

MLA attacks the memory problem at the representation level.

MHA:

MQA:

GQA:

MLA:

In real inference systems, KV Cache is not just a model detail.

It affects:

A model with a smaller KV Cache can serve longer contexts or more users on the same hardware.

That is why shared K/V designs matter.

They are not just architecture theory.

They directly affect deployment.

Naive view:

LLM inference = run the model repeatedly

Practical view:

LLM inference = manage cached states efficiently

Naive generation:

```
recompute all token states every step
```

Optimized generation:

```
cache past K/V
compute only new token states
reduce K/V storage with MQA, GQA, or MLA
```

This is one of the biggest differences between understanding Transformers conceptually and running them efficiently.

KV Cache does not make attention free.

The new Query still attends over cached tokens.

Long context still increases memory and latency.

MQA reduces memory but may reduce head diversity.

GQA balances memory and quality.

MLA reduces cache size through compression, but adds architectural complexity.

So the real design question is:

How much memory can we save without hurting generation quality too much?

Long-context models are useful only if inference is practical.

A model that supports huge context but cannot fit enough cache in GPU memory is hard to serve.

KV Cache makes autoregressive generation faster.

MQA, GQA, and MLA make KV Cache more scalable.

That is why modern LLM architecture spends so much effort on shared or compressed Key-Value attention.

KV Cache reuses past Keys and Values.

MQA shares K/V across all heads.

GQA shares K/V within groups.

MLA compresses K/V into latent representations.

The shortest version:

KV optimization = faster generation + smaller memory footprint

If attention is the engine, KV Cache is the memory system that keeps generation practical.

When optimizing LLM inference, which bottleneck do you usually notice first?

Latency, GPU memory, context length, or serving cost?

Originally published at zeromathai.com.

Original article: [https://zeromathai.com/en/kv-cache-shared-key-value-attention-en/](https://zeromathai.com/en/kv-cache-shared-key-value-attention-en/)

GitHub Resources

AI diagrams, study notes, and visual guides:

[https://github.com/zeromathai/zeromathai-ai](https://github.com/zeromathai/zeromathai-ai)
