KV cache and PagedAttention: what they do and why they matter

wpnews.pro

Your production LLM server is running behind schedule. You deployed a 70B model on four A100s with 80 GB each -- within spec, within budget -- but the time-to-first-token is creeping up as concurrent users increase. By lunch, latency is double what it was at 8 AM. You check GPU memory and find that 70% of HBM is consumed by what nvidia-smi reports as "tensor buffers," but which are actually the cached transformer states of a dozen long-running conversations that nobody cleaned up. You restart the server. It works again. By 4 PM, the same slowdown is back.

This is the KV cache memory problem, and it is the single biggest operational bottleneck in production LLM serving on GPUs. This post explains what the KV cache actually stores, why it grows without bound during a conversation, and how PagedAttention -- the technique that powers vLLM -- solves it with OS-inspired memory management.

The KV cache is not optional. Every autoregressive transformer generates tokens one at a time. For token N, the attention mechanism needs the Key and Value tensors from tokens 0 through N-1. Recomputing those from scratch for every new token would be O(N^2) per step -- catastrophic for any conversation longer than a few hundred tokens. Instead, the inference engine caches the K and V tensors from prior tokens and appends to them on each step. That structure is the KV cache.

The problem is its memory footprint. For a Llama 3.1 70B model with 80 layers, 8 KV heads (grouped-query attention), and a head dimension of 128, a single 4096-token sequence requires approximately:

2 (K+V) * 80 layers * 8 KV heads * 128 dim * 4096 tokens * 2 bytes (FP16)
= 1,342,177,280 bytes per sequence
= ~1.3 GB per sequence

For 256 concurrent 4096-token sequences, that is 336 GB of HBM -- more than four A100s provide (320 GB total). And that is before accounting for the model weights (~140 GB for 70B in FP16), the intermediate activations, the attention scores matrix, or any batching overhead.

This is the fundamental tension: the KV cache is mandatory for acceptable latency, but it consumes more memory than the model weights for any workload with meaningful concurrency or long context windows.

In most transformer inference implementations outside vLLM, the KV cache is a pre-allocated contiguous tensor. When a sequence starts, the framework allocates a past_key_values

tuple sized for the maximum sequence length (or a user-specified max_new_tokens

). The allocation happens up front and stays pinned until the sequence is done.

Here is a simplified view of what happens during a single generation step:


def attention_step(query, key_cache, value_cache, current_pos):
    past_keys = key_cache[:, :, :current_pos + 1, :]
    past_values = value_cache[:, :, :current_pos + 1, :]

    scores = torch.matmul(query, past_keys.transpose(-2, -1))
    scores = scores / (head_dim ** 0.5)
    attn = torch.softmax(scores, dim=-1)
    output = torch.matmul(attn, past_values)
    return output

The contiguous allocation means you pay the maximum possible memory cost from the very first token, even if the conversation never reaches the maximum length. This is fine for offline evaluation with fixed-length sequences, but wasteful in interactive serving where most conversations are short.

Three specific inefficiencies arise:

PagedAttention, introduced by the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon, Li, Zhuang et al., 2023), applies operating-system-style virtual memory paging to the KV cache. Instead of allocating one contiguous block per sequence, the KV cache is divided into fixed-size blocks called pages -- typically 16 or 32 tokens per page. The attention kernel is modified to gather key and value data from non-contiguous physical pages during the attention computation.

flowchart TB
    subgraph Virtual["Virtual KV Cache (per sequence)"]
        S1[Sequence 1\npages: A, B, C, D]
        S2[Sequence 2\npages: E, F]
        S3[Sequence 3\npages: G, H, I, J, K]
    end

    subgraph PageTable["Logical-to-Physical Mapping"]
        P0["A -> Frame 0"]
        P1["B -> Frame 3"]
        P2["C -> Frame 7"]
        P3["D -> Frame 11"]
        P4["E -> Frame 1"]
        P5["F -> Frame 4"]
        P6["G -> Frame 2"]
        P7["H -> Frame 5"]
        P8["I -> Frame 8"]
        P9["J -> Frame 9"]
        P10["K -> Frame 6"]
    end

    subgraph Physical["Physical Memory Frames (GPU HBM)"]
        M0[(Frame 0)]
        M1[(Frame 1)]
        M2[(Frame 2)]
        M3[(Frame 3)]
        M4[(Frame 4)]
        M5[(Frame 5)]
        M6[(Frame 6)]
        M7[(Frame 7)]
        M8[(Frame 8)]
        M9[(Frame 9)]
        M10[(Frame 10)]
        M11[(Frame 11)]
    end

    S1 --> P0 & P1 & P2 & P3
    S2 --> P4 & P5
    S3 --> P6 & P7 & P8 & P9 & P10

    P0 --> M0
    P1 --> M3
    P2 --> M7
    P3 --> M11
    P4 --> M1
    P5 --> M4
    P6 --> M2
    P7 --> M5
    P8 --> M8
    P9 --> M9
    P10 --> M6

The block manager maintains a page table that maps each sequence's logical page numbers to physical frame numbers. When the attention kernel needs the key-value data for a token at a given position, it computes which page that position falls in, reads the page table to find the physical frame, and loads the data from that frame. The layout is invisible to the model -- the attention output is mathematically identical to the contiguous case.

This design unlocks three capabilities that are not available with contiguous allocation:

1. On-demand allocation. A sequence only consumes pages as it grows. If a user asks a one-turn question that generates 150 tokens, the cache uses 10 pages (at 16 tokens per page). If another user runs a 5000-token document analysis, pages are allocated dynamically. No memory is wasted on unused capacity.

2. Copy-on-write for shared prefix pages. When multiple sequences share a common prefix -- the system prompt, the conversation history, a few few-shot examples -- PagedAttention maps the same physical pages into multiple virtual address spaces. The pages are marked read-only. If one sequence diverges during generation (which it always will after the first sampling step), only the page that actually changes is copied. In many chat applications, 40-60% of the tokens in a batch can be shared prefix tokens, so the memory savings are substantial.

3. Fine-grained eviction and swapping. When GPU memory is exhausted, the block manager selects pages to evict based on a least-recently-used policy. Evicted pages are written to CPU DRAM. Because pages are small (16-32 tokens), the transfer granularity is fine and the PCIe bandwidth cost is amortized across many small transfers rather than one large blocking move.

Aspect	Traditional contiguous KV cache	PagedAttention
Allocation strategy	Pre-allocate max length per sequence	On-demand, one page at a time
Memory waste due to fragmentation	High (allocated but unused slots)	Near zero (pay for used tokens only)
Shared prefix support	None (every sequence stores its own copy)	Copy-on-write page sharing
Eviction granularity	Entire sequence	16-32 token pages
Swap overhead per eviction	High (full sequence over PCIe)	Low (single page)
Peak throughput at same HBM budget	Baseline	2-4x on mixed workloads
Batch size ceiling	Limited by worst-case per-sequence allocation	Limited by actual memory consumption

The throughput gains are workload-dependent. vLLM's published benchmarks report 2-4x improvement over frameworks with contiguous allocation, with the largest gains on workloads that mix short and long sequences. For uniform-length batches, the advantage shrinks.

1. Page table overhead with very small page sizes. The page table itself lives in GPU memory. With page sizes of 4-8 tokens, the metadata can consume a non-trivial fraction of HBM. vLLM defaults to 16-token pages as the practical sweet spot. If you observe lower-than-expected throughput with very long contexts, check whether your page size is too small.

2. Scheduler parameters that work against PagedAttention. vLLM exposes --max-num-batched-tokens

and --max-num-seqs

, which control how many tokens and sequences are batched in a single iteration. Setting these too high wastes the batch without improving throughput. Setting them too low underutilizes the GPU. The general guidance is to start with --max-num-seqs 256

and --max-num-batched-tokens 8192

for a 70B model and tune from there.

3. Prefix caching is not unconditionally beneficial. vLLM's automatic prefix caching (--enable-prefix-caching

) computes a hash for every block of tokens. For very short prompts or rapidly rotating system prompts, the hash computation overhead can exceed the reuse benefit. Profile with and without it for your workload.

4. Interaction with KV cache quantization. PagedAttention works with FP8 and INT4 KV cache quantization, but each page carries metadata that is proportionally more significant when the data per page is smaller. vLLM v0.23.0 added FP8 KV cache support for Ada Lovelace and Hopper GPUs, usable with --kv-cache-dtype fp8

. Measure the combined effect before enabling.

PagedAttention and vLLM are not the right choice for every deployment:

Single-user local inference. If you run a model for one user on one GPU, the memory pressure that PagedAttention solves never arises. A simpler framework like llama.cpp or Hugging Face Transformers has lower overhead and fewer failure modes.

Sub-100ms interactive latency requirements. The page-walking logic during attention adds a small but measurable overhead per token -- roughly 3-5% for 16-token pages. If your application requires consistent sub-100ms time-to-first-token, a contiguous cache with static pre-allocation gives lower tail latency (at the cost of lower throughput).

Small models on high-memory GPUs. A 7B model on an A100-80GB uses about 14 GB for weights and, at 4096-token context, roughly 300 MB for the KV cache per sequence. At typical concurrency levels, the cache fits easily without paging. PagedAttention's complexity buys you nothing here.

Non-autoregressive architectures. Models that do not generate tokens left-to-right -- encoder-only models (BERT, RoBERTa), diffusion-based language models, non-causal decoders -- have no KV cache to manage. PagedAttention is specific to autoregressive decoding.

Uniform-length offline evaluation. If every sequence in a batch is the same length (common in evaluation benchmarks), the fragmentation and on-demand benefits of paging are minimal. The contiguous approach works fine.

--max-num-seqs

and --max-num-batched-tokens

for your model and workload.Next post: vLLM vs TGI vs llama.cpp -- a practical serving benchmark for the same 70B model under realistic concurrency, comparing throughput, latency, and cost per token.

source & further reading

dev.to — original article Building a Voice AI Platform with 28 Modules in Python agentic experience for Go I stopped trying to make my AI remember everything. That's when it got good.

KV cache and PagedAttention: what they do and why they matter

Run your AI side-project on zahid.host