Your production LLM server is running behind schedule. You deployed a 70B model on four A100s with 80 GB each -- within spec, within budget -- but the time-to-first-token is creeping up as concurrent users increase. By lunch, latency is double what it was at 8 AM. You check GPU memory and find that 70% of HBM is consumed by what nvidia-smi reports as "tensor buffers," but which are actually the cached transformer states of a dozen long-running conversations that nobody cleaned up. You restart the server. It works again. By 4 PM, the same slowdown is back.
This is the KV cache memory problem, and it is the single biggest operational bottleneck in production LLM serving on GPUs. This post explains what the KV cache actually stores, why it grows without bound during a conversation, and how PagedAttention -- the technique that powers vLLM -- solves it with OS-inspired memory management.
The KV cache is not optional. Every autoregressive transformer generates tokens one at a time. For token N, the attention mechanism needs the Key and Value tensors from tokens 0 through N-1. Recomputing those from scratch for every new token would be O(N^2) per step -- catastrophic for any conversation longer than a few hundred tokens. Instead, the inference engine caches the K and V tensors from prior tokens and appends to them on each step. That structure is the KV cache.
The problem is its memory footprint. For a Llama 3.1 70B model with 80 layers, 8 KV heads (grouped-query attention), and a head dimension of 128, a single 4096-token sequence requires approximately:
2 (K+V) * 80 layers * 8 KV heads * 128 dim * 4096 tokens * 2 bytes (FP16)
= 1,342,177,280 bytes per sequence
= ~1.3 GB per sequence
For 256 concurrent 4096-token sequences, that is 336 GB of HBM -- more than four A100s provide (320 GB total). And that is before accounting for the model weights (~140 GB for 70B in FP16), the intermediate activations, the attention scores matrix, or any batching overhead.
This is the fundamental tension: the KV cache is mandatory for acceptable latency, but it consumes more memory than the model weights for any workload with meaningful concurrency or long context windows.
In most transformer inference implementations outside vLLM, the KV cache is a pre-allocated contiguous tensor. When a sequence starts, the framework allocates a past_key_values
tuple sized for the maximum sequence length (or a user-specified max_new_tokens
). The allocation happens up front and stays pinned until the sequence is done.
Here is a simplified view of what happens during a single generation step:
def attention_step(query, key_cache, value_cache, current_pos):
past_keys = key_cache[:, :, :current_pos + 1, :]
past_values = value_cache[:, :, :current_pos + 1, :]
scores = torch.matmul(query, past_keys.transpose(-2, -1))
scores = scores / (head_dim ** 0.5)
attn = torch.softmax(scores, dim=-1)
output = torch.matmul(attn, past_values)
return output
The contiguous allocation means you pay the maximum possible memory cost from the very first token, even if the conversation never reaches the maximum length. This is fine for offline evaluation with fixed-length sequences, but wasteful in interactive serving where most conversations are short.
Three specific inefficiencies arise:
PagedAttention, introduced by the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon, Li, Zhuang et al., 2023), applies operating-system-style virtual memory paging to the KV cache. Instead of allocating one contiguous block per sequence, the KV cache is divided into fixed-size blocks called pages -- typically 16 or 32 tokens per page. The attention kernel is modified to gather key and value data from non-contiguous physical pages during the attention computation.
flowchart TB
subgraph Virtual["Virtual KV Cache (per sequence)"]
S1[Sequence 1\npages: A, B, C, D]
S2[Sequence 2\npages: E, F]
S3[Sequence 3\npages: G, H, I, J, K]
end
subgraph PageTable["Logical-to-Physical Mapping"]
P0["A -> Frame 0"]
P1["B -> Frame 3"]
P2["C -> Frame 7"]
P3["D -> Frame 11"]
P4["E -> Frame 1"]
P5["F -> Frame 4"]
P6["G -> Frame 2"]
P7["H -> Frame 5"]
P8["I -> Frame 8"]
P9["J -> Frame 9"]
P10["K -> Frame 6"]
end
subgraph Physical["Physical Memory Frames (GPU HBM)"]
M0[(Frame 0)]
M1[(Frame 1)]
M2[(Frame 2)]
M3[(Frame 3)]
M4[(Frame 4)]
M5[(Frame 5)]
M6[(Frame 6)]
M7[(Frame 7)]
M8[(Frame 8)]
M9[(Frame 9)]
M10[(Frame 10)]
M11[(Frame 11)]
end
S1 --> P0 & P1 & P2 & P3
S2 --> P4 & P5
S3 --> P6 & P7 & P8 & P9 & P10
P0 --> M0
P1 --> M3
P2 --> M7
P3 --> M11
P4 --> M1
P5 --> M4
P6 --> M2
P7 --> M5
P8 --> M8
P9 --> M9
P10 --> M6
The block manager maintains a page table that maps each sequence's logical page numbers to physical frame numbers. When the attention kernel needs the key-value data for a token at a given position, it computes which page that position falls in, reads the page table to find the physical frame, and loads the data from that frame. The layout is invisible to the model -- the attention output is mathematically identical to the contiguous case.
This design unlocks three capabilities that are not available with contiguous allocation:
1. On-demand allocation. A sequence only consumes pages as it grows. If a user asks a one-turn question that generates 150 tokens, the cache uses 10 pages (at 16 tokens per page). If another user runs a 5000-token document analysis, pages are allocated dynamically. No memory is wasted on unused capacity.
2. Copy-on-write for shared prefix pages. When multiple sequences share a common prefix -- the system prompt, the conversation history, a few few-shot examples -- PagedAttention maps the same physical pages into multiple virtual address spaces. The pages are marked read-only. If one sequence diverges during generation (which it always will after the first sampling step), only the page that actually changes is copied. In many chat applications, 40-60% of the tokens in a batch can be shared prefix tokens, so the memory savings are substantial.
3. Fine-grained eviction and swapping. When GPU memory is exhausted, the block manager selects pages to evict based on a least-recently-used policy. Evicted pages are written to CPU DRAM. Because pages are small (16-32 tokens), the transfer granularity is fine and the PCIe bandwidth cost is amortized across many small transfers rather than one large blocking move.
| Aspect | Traditional contiguous KV cache | PagedAttention |
|---|---|---|
| Allocation strategy | Pre-allocate max length per sequence | On-demand, one page at a time |
| Memory waste due to fragmentation | High (allocated but unused slots) | Near zero (pay for used tokens only) |
| Shared prefix support | None (every sequence stores its own copy) | Copy-on-write page sharing |
| Eviction granularity | Entire sequence | 16-32 token pages |
| Swap overhead per eviction | High (full sequence over PCIe) | Low (single page) |
| Peak throughput at same HBM budget | Baseline | 2-4x on mixed workloads |
| Batch size ceiling | Limited by worst-case per-sequence allocation | Limited by actual memory consumption |
The throughput gains are workload-dependent. vLLM's published benchmarks report 2-4x improvement over frameworks with contiguous allocation, with the largest gains on workloads that mix short and long sequences. For uniform-length batches, the advantage shrinks.
1. Page table overhead with very small page sizes. The page table itself lives in GPU memory. With page sizes of 4-8 tokens, the metadata can consume a non-trivial fraction of HBM. vLLM defaults to 16-token pages as the practical sweet spot. If you observe lower-than-expected throughput with very long contexts, check whether your page size is too small.
2. Scheduler parameters that work against PagedAttention. vLLM exposes --max-num-batched-tokens
and --max-num-seqs
, which control how many tokens and sequences are batched in a single iteration. Setting these too high wastes the batch without improving throughput. Setting them too low underutilizes the GPU. The general guidance is to start with --max-num-seqs 256
and --max-num-batched-tokens 8192
for a 70B model and tune from there.
3. Prefix caching is not unconditionally beneficial. vLLM's automatic prefix caching (--enable-prefix-caching
) computes a hash for every block of tokens. For very short prompts or rapidly rotating system prompts, the hash computation overhead can exceed the reuse benefit. Profile with and without it for your workload.
4. Interaction with KV cache quantization. PagedAttention works with FP8 and INT4 KV cache quantization, but each page carries metadata that is proportionally more significant when the data per page is smaller. vLLM v0.23.0 added FP8 KV cache support for Ada Lovelace and Hopper GPUs, usable with --kv-cache-dtype fp8
. Measure the combined effect before enabling.
PagedAttention and vLLM are not the right choice for every deployment:
Single-user local inference. If you run a model for one user on one GPU, the memory pressure that PagedAttention solves never arises. A simpler framework like llama.cpp or Hugging Face Transformers has lower overhead and fewer failure modes.
Sub-100ms interactive latency requirements. The page-walking logic during attention adds a small but measurable overhead per token -- roughly 3-5% for 16-token pages. If your application requires consistent sub-100ms time-to-first-token, a contiguous cache with static pre-allocation gives lower tail latency (at the cost of lower throughput).
Small models on high-memory GPUs. A 7B model on an A100-80GB uses about 14 GB for weights and, at 4096-token context, roughly 300 MB for the KV cache per sequence. At typical concurrency levels, the cache fits easily without paging. PagedAttention's complexity buys you nothing here.
Non-autoregressive architectures. Models that do not generate tokens left-to-right -- encoder-only models (BERT, RoBERTa), diffusion-based language models, non-causal decoders -- have no KV cache to manage. PagedAttention is specific to autoregressive decoding.
Uniform-length offline evaluation. If every sequence in a batch is the same length (common in evaluation benchmarks), the fragmentation and on-demand benefits of paging are minimal. The contiguous approach works fine.
--max-num-seqs
and --max-num-batched-tokens
for your model and workload.Next post: vLLM vs TGI vs llama.cpp -- a practical serving benchmark for the same 70B model under realistic concurrency, comparing throughput, latency, and cost per token.