Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

wpnews.pro

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Every developer eventually discovers the same frustrating pattern.

Your application sends a 20,000-token prompt to an LLM. The first request takes 2 seconds. The next request contains the exact same 20,000 tokens plus a tiny user message at the end.

And somehow the model processes the entire thing again.

At least, that's what many developers assume.

Modern LLM systems have a trick called prompt caching that can dramatically reduce latency and cost by reusing work from previous requests. But unlike traditional application caches, prompt caching isn't storing generated text. It's storing something much deeper inside the model.

To understand how prompt caching works, we need to follow a prompt all the way through the transformer itself.

When a prompt enters a transformer model, it isn't immediately generating text.

First, the model must process every input token through every layer of the network.

Imagine a prompt like:

System: You are a helpful coding assistant.

Project Documentation:
[20,000 tokens of documentation]

User: How does authentication work?

Before generating a single output token, the model performs:

...across dozens or even hundreds of transformer layers.

For a large model, this preprocessing is often more expensive than generating a short answer.

If another user asks:

System: You are a helpful coding assistant.

Project Documentation:
[Same 20,000 tokens]

User: Explain the database schema.

Most of the prompt is identical.

Without caching, the model would recompute everything from scratch.

Prompt caching exists to avoid that waste.

A common misconception is that prompt caching stores prompt text.

That's not particularly useful because the model would still need to process the text again.

Instead, modern systems cache the transformer's internal representations.

After processing a token through the network, the model produces vectors that represent the token's state at various stages.

The most important cached data is usually:

These are generated during self-attention.

Once a prefix has been processed, those K/V tensors can often be reused.

Conceptually:

Prompt
  ↓
Token Embeddings
  ↓
Transformer Layers
  ↓
Key/Value Tensors
  ↓
Cache

When a future request begins with the same prefix, the system loads the cached tensors rather than recomputing them.

The model effectively starts from the middle of the computation.

Prompt caching builds directly on a mechanism called the KV cache.

During inference, each attention layer creates:

Q = Query
K = Key
V = Value

Attention is computed roughly as:

\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

When generating token 501, the model doesn't want to recompute attention for tokens 1-500.

Instead it stores the previous K and V tensors.

This is the standard KV cache used during autoregressive generation.

Prompt caching extends the same idea across requests.

Instead of caching:

Request A token 1-500

it caches:

Shared prompt prefix

which can then be reused by:

Request B
Request C
Request D

as long as the prefix remains identical.

Let's use a realistic example.

Suppose we have:

System Prompt: 2,000 tokens
Repository Documentation: 18,000 tokens
User Message: 100 tokens

Total:

20,100 tokens

Assume a model has:

For each layer, the system stores K and V tensors for every processed token.

Conceptually:

Layer 1:
  K[20000]
  V[20000]

Layer 2:
  K[20000]
  V[20000]

...

Layer 80:
  K[20000]
  V[20000]

The cache may occupy hundreds of megabytes or even gigabytes depending on:

This is why prompt caching isn't free.

The system trades memory for computation.

GPU memory is expensive, but recomputing a 20,000-token prompt repeatedly is often even more expensive.

Most production systems perform prompt caching using prefix matching.

Consider:

[System Prompt]
[Documentation]
User: Explain auth

and

[System Prompt]
[Documentation]
User: Explain database

The shared prefix is:

[System Prompt]
[Documentation]

Everything after that differs.

The cache can be reused because the transformer state for the shared prefix is identical.

But even small changes can invalidate the cache:

Version 1:
Repository version: 2.1

Version 2:
Repository version: 2.2

That tiny change alters tokenization.

Different tokens produce different embeddings.

Different embeddings produce different K/V tensors.

The entire downstream computation changes.

This is why prompt caching systems often require exact token-level matches rather than semantic similarity.

Different providers implement prompt caching differently, but the general architecture is similar.

Incoming Request
       ↓
Prefix Detection
       ↓
Cache Lookup
       ↓
Cache Hit?
    /      \
  Yes      No
   |         |
Load KV   Compute KV
   |         |
Generate Response

The difficult engineering problems include:

GPU memory is limited.

Providers must decide:

This resembles operating system page management more than traditional web caching.

Large serving systems spread requests across many GPUs.

A cached prefix may exist on GPU A while the next request arrives on GPU B.

Providers must either:

A cache created for one customer should not leak information to another customer.

Production systems must maintain strict isolation boundaries.

Retrieval-Augmented Generation systems are perfect candidates for prompt caching.

Imagine a code assistant.

Every request includes:

System Prompt
Repository Rules
Architecture Docs
Coding Standards

Only the user question changes.

Without caching:

20,000 tokens processed
20,000 tokens processed
20,000 tokens processed
20,000 tokens processed

With caching:

20,000-token prefix processed once

Request 2:
reuse cache

Request 3:
reuse cache

Request 4:
reuse cache

Latency drops.

GPU utilization drops.

Cost drops.

This is one reason why modern coding assistants can feel much faster than their raw context sizes would suggest.

Today's prompt caching mostly relies on exact token matches.

Researchers are exploring more ambitious ideas:

The challenge is preserving correctness.

Exact matches guarantee identical transformer states.

Approximate matches introduce uncertainty.

Future systems may combine both approaches, using exact caches when possible and semantic reuse when beneficial.

Prompt caching is one of the least visible but most impactful optimizations in modern LLM serving.

The important realization is that the cache is not storing text and it is not storing generated responses.

It is storing the expensive internal transformer state—primarily key and value tensors—that would otherwise need to be recomputed.

Once you understand that, prompt caching starts looking less like an application-level optimization and more like a CPU instruction cache or an operating system memory cache: a mechanism for avoiding repeated work by preserving computation that has already been paid for.

As context windows continue growing from tens of thousands to millions of tokens, do you think exact prefix caching will remain dominant, or will future LLM systems need semantic and approximate caching techniques to stay efficient?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

** git-lrc fixes this.** It hooks into

git commit

and reviews every diff git-lrc-intro-60s.mp4See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

source & further reading

dev.to — original article Why Your RAG Pipeline is Lying to You LLM TRADER BOT Your AI Subagents Are Lying to You: 4 Silent Failure Modes

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

Run your AI side-project on zahid.host