KV Cache in LLMs: The Optimization That Makes Modern AI Models Feel Fast

wpnews.pro

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Large Language Models can generate surprisingly intelligent responses. But there's a hidden engineering challenge behind every answer:

LLMs generate text one token at a time. To predict each new token, a transformer model processes the entire sequence of tokens seen so far and uses its attention mechanism to determine which earlier tokens are most relevant for the next prediction. Naively, this means that when generating the 1,000th token, the model would need to repeatedly compute representations for the previous 999 tokens even though those tokens have not changed.

How do you generate the 1,000th token without repeatedly recomputing information for the previous 999 tokens over and over again?

If models had to recompute everything from scratch for every generated token, response times would be painfully slow and inference costs would explode.

The solution is one of the most important optimizations in modern LLM serving infrastructure:

KV Cache.

If you've ever worked with transformers, built AI products, or wondered why prompt length affects latency and memory, understanding KV Cache is essential.

Let's break it down from intuition to implementation.

LLMs generate text one token at a time.

Imagine the model receives:

The capital of France is

The model predicts:

Paris

Now the input becomes:

The capital of France is Paris

To generate the next token, the model runs another forward pass.

Then:

The capital of France is Paris .

And another forward pass.

And another.

The key observation is that most of the sequence remains unchanged between steps.

The capital of France is

has already been processed.

Recomputing representations for those old tokens every generation step would be wasteful.

This is exactly what KV Cache avoids.

To understand KV Cache, we need a quick refresher on self-attention.

For each token, the transformer computes three vectors:

A simplified attention calculation looks like:

Attention(Q, K, V)
    = softmax(QKᵀ)V

Each token creates its own K and V vectors.

During generation, when a new token arrives, it needs to attend to all previous tokens.

For example:

Token 1 → K₁, V₁
Token 2 → K₂, V₂
Token 3 → K₃, V₃
...

When generating token 1000, the model needs access to:

K₁ ... K₉₉₉
V₁ ... V₉₉₉

The question becomes:

Why recompute them if they never changed?

Instead of recalculating Keys and Values for previous tokens, we simply store them.

When token N is generated:

Visually:

Step 1

Token A
  ↓
Compute K₁,V₁
  ↓
Store in cache

Cache:
[K₁]
[V₁]
Step 2

Token B
  ↓
Compute K₂,V₂

Cache:
[K₁ K₂]
[V₁ V₂]
Step 3

Token C
  ↓
Compute K₃,V₃

Cache:
[K₁ K₂ K₃]
[V₁ V₂ V₃]

Now attention only requires computing the Query for the newest token and using cached Keys and Values from earlier tokens.

This dramatically reduces computation.

Many developers initially assume the cache stores hidden states.

It doesn't.

The cache stores:

Keys
Values

for every attention layer.

Suppose a model has:

32 layers
32 attention heads

Each layer maintains its own KV cache.

Conceptually:

Layer 1
 ├── Keys
 └── Values

Layer 2
 ├── Keys
 └── Values

...

Layer 32
 ├── Keys
 └── Values

This means cache memory grows with:

This is why long-context inference can become memory-intensive.

Without caching:

Generation Step 1000

Recompute tokens:
1...999

Then compute token 1000

With caching:

Generation Step 1000

Reuse:
1...999

Compute only:
1000

The complexity improvement is substantial.

Naively:

O(n³)

behavior emerges across repeated generation steps.

With KV caching:

O(n²)

total generation cost.

The exact complexity depends on implementation details, but the key takeaway is that cached inference avoids repeatedly processing the entire prefix.

In production systems, this difference is enormous.

Without KV caching, modern chat systems would be far slower and significantly more expensive to operate.

KV Cache speeds up computation, but memory usage increases.

A rough intuition:

Longer conversation
    ↓
More tokens
    ↓
Larger KV cache
    ↓
More GPU memory consumed

This creates one of the biggest bottlenecks in LLM serving.

For example:

1 user
    = small cache

10,000 users
    = 10,000 caches

Serving infrastructure must allocate GPU memory for every active session.

This is why inference platforms spend significant effort on:

In large deployments, memory often becomes the limiting factor before raw compute.

Suppose many users share the same system prompt:

You are a helpful coding assistant...

Without optimization:

User A → Build KV cache
User B → Build KV cache
User C → Build KV cache

The same work is repeated.

Modern inference engines often support prefix caching.

Shared Prompt
      ↓
Shared KV Cache
      ↓
Reused Across Requests

Frameworks such as vLLM and other high-performance serving systems heavily exploit this idea.

For workloads with large shared prompts, the savings can be dramatic.

In Hugging Face Transformers, KV Cache is often exposed as:

past_key_values

A simplified generation loop looks like:

outputs = model(
    input_ids=input_ids,
    past_key_values=cache,
    use_cache=True
)

cache = outputs.past_key_values

The first pass creates the cache.

Subsequent passes reuse it.

Under the hood, the model only computes attention state for newly generated tokens while leveraging cached Keys and Values from earlier tokens.

Most developers never need to implement KV caching manually, but understanding it helps explain performance behavior.

When developers encounter:

KV Cache is often part of the explanation.

It is one of those rare optimizations that fundamentally changed the economics of LLM serving.

The transformer architecture made large language models possible.

KV Cache made them practical.

Without it, the conversational AI products we use every day would feel dramatically slower and cost far more to operate.

What other LLM inference optimization would you like to see explained next—Paged Attention, Speculative Decoding, Continuous Batching, or FlashAttention?

If models had to recompute everything from scratch for every generated token, response times would be painfully slow and inference costs would explode.

The solution is one of the most important optimizations in modern LLM serving infrastructure:

KV Cache.

If you've ever worked with transformers, built AI products, or wondered why prompt length affects latency and memory, understanding KV Cache is essential.

Let's break it down from intuition to implementation.

While ChatGPT is a well-known example, KV Cache is not specific to ChatGPT. It is used across most transformer-based autoregressive models, including GPT-style models, Llama, Mistral, Claude, Gemini, and many open-source LLMs.

LLMs generate text one token at a time.

Imagine the model receives:

The capital of France is

The model predicts:

Paris

Now the input becomes:

The capital of France is Paris

To generate the next token, the model runs another forward pass.

Then:

The capital of France is Paris .

And another forward pass.

And another.

The key observation is that most of the sequence remains unchanged between steps.

The capital of France is

has already been processed.

Recomputing representations for those old tokens every generation step would be wasteful.

This is exactly what KV Cache avoids.

To understand KV Cache, we need a quick refresher on self-attention.

For each token, the transformer computes three vectors:

A simplified attention calculation looks like:

Attention(Q, K, V)
    = softmax(QKᵀ)V

Each token creates its own K and V vectors.

During generation, when a new token arrives, it needs to attend to all previous tokens.

For example:

Token 1 → K₁, V₁
Token 2 → K₂, V₂
Token 3 → K₃, V₃
...

When generating token 1000, the model needs access to:

K₁ ... K₉₉₉
V₁ ... V₉₉₉

The question becomes:

Why recompute them if they never changed?

Instead of recalculating Keys and Values for previous tokens, we simply store them.

When token N is generated:

Visually:

Step 1

Token A
  ↓
Compute K₁,V₁
  ↓
Store in cache

Cache:
[K₁]
[V₁]
Step 2

Token B
  ↓
Compute K₂,V₂

Cache:
[K₁ K₂]
[V₁ V₂]
Step 3

Token C
  ↓
Compute K₃,V₃

Cache:
[K₁ K₂ K₃]
[V₁ V₂ V₃]

Now attention only requires computing the Query for the newest token and using cached Keys and Values from earlier tokens.

This dramatically reduces computation.

Many developers initially assume the cache stores hidden states.

It doesn't.

The cache stores:

Keys
Values

for every attention layer.

Suppose a model has:

32 layers
32 attention heads

Each layer maintains its own KV cache.

Conceptually:

Layer 1
 ├── Keys
 └── Values

Layer 2
 ├── Keys
 └── Values

...

Layer 32
 ├── Keys
 └── Values

This means cache memory grows with:

This is why long-context inference can become memory-intensive.

Without caching:

Generation Step 1000

Recompute tokens:
1...999

Then compute token 1000

With caching:

Generation Step 1000

Reuse:
1...999

Compute only:
1000

The complexity improvement is substantial.

Naively:

O(n³)

behavior emerges across repeated generation steps.

With KV caching:

O(n²)

total generation cost.

The exact complexity depends on implementation details, but the key takeaway is that cached inference avoids repeatedly processing the entire prefix.

In production systems, this difference is enormous.

Without KV caching, modern AI assistants, coding copilots, chatbots, and text-generation systems would be far slower and significantly more expensive to operate.

KV Cache speeds up computation, but memory usage increases.

A rough intuition:

Longer conversation
    ↓
More tokens
    ↓
Larger KV cache
    ↓
More GPU memory consumed

This creates one of the biggest bottlenecks in LLM serving.

For example:

1 user
    = small cache

10,000 users
    = 10,000 caches

Serving infrastructure must allocate GPU memory for every active session.

This is why inference platforms spend significant effort on:

In large deployments, memory often becomes the limiting factor before raw compute.

Suppose many users share the same system prompt:

You are a helpful coding assistant...

Without optimization:

User A → Build KV cache
User B → Build KV cache
User C → Build KV cache

The same work is repeated.

Modern inference engines often support prefix caching.

Shared Prompt
      ↓
Shared KV Cache
      ↓
Reused Across Requests

Frameworks such as vLLM and other high-performance serving systems heavily exploit this idea.

For workloads with large shared prompts, the savings can be dramatic.

In Hugging Face Transformers, KV Cache is often exposed as:

past_key_values

A simplified generation loop looks like:

outputs = model(
    input_ids=input_ids,
    past_key_values=cache,
    use_cache=True
)

cache = outputs.past_key_values

The first pass creates the cache.

Subsequent passes reuse it.

Under the hood, the model only computes attention state for newly generated tokens while leveraging cached Keys and Values from earlier tokens.

Most developers never need to implement KV caching manually, but understanding it helps explain performance behavior.

When developers encounter:

KV Cache is often part of the explanation.

It is one of those rare optimizations that fundamentally changed the economics of LLM serving.

The transformer architecture made large language models possible.

KV Cache made them practical.

Without it, the AI products we use every day—from chatbots and coding assistants to search and agent systems—would feel dramatically slower and cost far more to operate.

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

** git-lrc fixes this.** It hooks into

git commit

and reviews every diff git-lrc-intro-60s.mp4See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

source & further reading

dev.to — original article Block AI Crawlers: The 15 Bots That Matter AI Worms in Word: How Document-Borne Threats Self-Propagate AI Weekly: Opus 5 Lands, MCP Goes Stateless, and AMD Ships Helios

KV Cache in LLMs: The Optimization That Makes Modern AI Models Feel Fast

Run your AI side-project on zahid.host