You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly 2 × 80 layers × 8 KV heads × 32768 tokens × 128 head_dim × 2 bytes ≈ 10.7 GB per request
. Two hundred of those, and the H100s are paging to CPU. The model itself fits; the attention state doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks.
Here's how it works, what the formats are, and where the footguns hide.
The KV cache is the largest dynamic piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with batch_size × seq_len
and stays allocated until the request ends. On a long-context workload, it dominates.
KV cache quantization trades a small amount of representational precision for a 2x or 4x reduction in cache footprint, with no model-weight change. FP8 and INT8 give ~50% of the BF16 footprint. INT4 (KIVI, KVQuant, ZipCache-style) gives 25%. The question is what that compression costs in output quality, in serving complexity, and — the part most blog posts skip — in compatibility with the other serving features you already turned on.
The economic case is straightforward. Doubling the KV cache budget on a 70B at 32k means either ~21 GB more HBM (one extra H100 per ~10 concurrent users at 32k) or 2x fewer concurrent users per box. The quality cost of FP8 KV cache, measured on the standard long-context benchmarks, is typically under 0.5 percentage points on retrieval-heavy tasks. That's a 50% infra saving for a sub-half-point accuracy loss. The trade is favorable; the engineering is not free.
Standard BF16 attention stores the K and V tensors at full precision. At every attention step, the model reads every past K and V. Quantization compresses these stored tensors using a lower-precision format, with a dequantization step fused into the attention kernel right before the matmul.
The pipeline looks like this:
flowchart LR
A[New token<br/>embedding] --> B[Project to Kt Vt<br/>BF16, in registers]
B --> C[Quantize Kt Vt<br/>per-token / per-head]
C --> D[Store in<br/>KV cache: FP8/INT8]
D --> E[On next step:<br/>load cached K and V]
E --> F[Dequantize on-the-fly<br/>inside attention kernel]
F --> G[Attention matmul<br/>BF16, full precision]
G --> H[Output projection]
Three things to notice: the activations being added to the cache are quantized only at storage time, with the full BF16 values available for the scale calculation. The attention matmul still happens in BF16 or FP16 — you save memory bandwidth, not FLOPs. And the per-token or per-head scales (a few KB for an 8k context) are stored alongside in BF16; they are what makes the rest of the math work.
Five formats dominate production serving stacks in 2026. The list is in roughly the order they were adopted.
| Format | Bits | Granularity | Hardware support | Used by |
|---|---|---|---|---|
| BF16 (baseline) | ||||
| 16 | — | Native on Ampere+ | Everything | |
| FP8 E4M3 | ||||
| 8 | Per-tensor, per-head, or per-token | H100, H200, B100, B200, MI300X | vLLM, TRT-LLM, SGLang | |
| FP8 E5M2 | ||||
| 8 | Same as above | Same as above | Less common for KV; wider dynamic range | |
| INT8 (per-token) | ||||
| 8 | Per-token, asymmetric | Universal via Triton/CUDA | vLLM, TGI, llama.cpp | |
| INT4 (KVQuant / KIVI / ZipCache) | ||||
| 4 | Mixed: K per-channel, V per-token | Universal | Research, llama.cpp (some targets) |
A few notes on the table:
(scale, zero_point)
pair. Per-channel (one scale per head_dim slice) is faster on hardware but slightly less accurate. Per-tensor (one scale for the whole cache) is cheapest and loses the most.The CLI flag is --kv-cache-dtype
. In vLLM v0.22.1, accepted values are auto
, fp8
(E4M3), fp8_e5m2
, int8
, and bf16
(the default; auto
resolves to bf16
unless the model is detected as FP8-native). For an OpenAI-compatible serve:
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
For programmatic use:
from vllm import LLM
llm = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct",
tensor_parallel_size=8,
kv_cache_dtype="fp8",
max_model_len=32768,
)
On H100, the FP8 path goes through Transformer Engine's fused attention; on B100/B200 it goes through FlashAttention-3 FP8 kernels. On pre-Hopper hardware (A100, RTX 4090) the FP8 flag is a no-op or a slow path — there's no native FP8 tensor core. INT8, by contrast, runs everywhere via Triton.
One production detail: --kv-cache-dtype fp8
on an H100 reduces KV cache memory by ~50% but does not reduce the model's weight footprint. The 70B in BF16 is still 140 GB. The savings are real but bounded by the cache-to-weight ratio of your workload — long-context, high-concurrency workloads benefit most.
This is the silent footgun. Last week's post on speculative decoding described the acceptance probability r = min(1, M_p(x) / M_q(x))
and the speedup formula in terms of μ
, the mean accepted tokens per cycle. KV cache quantization breaks the implicit assumption underneath: that the target model's logit at the proposal position is computed at the same numerical precision as the draft model's.
The mechanism:
x_t
using its own KV cache (draft cache, typically BF16).M_p(x_t)
vs M_q(x_t)
is still computed — but M_p
is now using K and V values rounded to FP8 or INT8.μ
.The magnitude depends on the format and context length. From community benchmarks and published work on spec-decoding with quantized caches, mean accepted tokens per cycle typically drops 0.3–0.8 for FP8 E4M3 and 0.5–1.5 for INT8 per-token. That sounds small until you remember the speedup curve has a knee around μ = 4
. A drop from 4.5 to 3.5 can wipe out 20–30% of the speedup you thought you had.
The vLLM v0.18.0 release notes called this out for one specific case: degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618). The lesson generalizes: when stacking serving optimizations, each one shifts the optimal settings of the others. Speculative decoding was tuned assuming BF16 attention. Re-tune num_speculative_tokens
and re-measure μ
after turning on --kv-cache-dtype fp8
.
--kv-cache-dtype fp8
only changes the attention state. To compress the model, you need a separate quantization step (GPTQ, AWQ, FP8 weights) with its own quality/throughput tradeoffs.KV cache quantization is the wrong choice if:
μ
is below 3.0 in BF16, the additional 0.3–1.0 acceptance-rate drop from FP8 will push you below 1.0 and turn the algorithm into a net loss. Measure first, then enable.2 × layers × kv_heads × seq_len × head_dim × bytes
. For a 70B Llama-3 at 32k BF16, that's ~10.7 GB per request. FP8 halves it; INT8 halves it; 4-bit schemes quarter it.--kv-cache-dtype fp8
or int8
. FP8 is H100/H200/B100/B200/MI300X only; INT8 runs everywhere via Triton.num_speculative_tokens
after enabling it.Next post: prefix caching at scale — when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into a 5% saving in production.