KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

A developer deploying a 70B Llama-3 model on 8x H100s found that scaling from 8k to 32k context windows causes the KV cache to balloon to 10.7 GB per request, forcing memory paging to CPU at 200 concurrent users. KV cache quantization using FP8 or INT8 formats reduces this memory footprint by 50% with less than 0.5 percentage points of accuracy loss on retrieval tasks, offering a favorable trade-off between infrastructure cost and output quality. The quantization compresses stored K and V tensors at storage time while keeping attention matmuls in full BF16 precision, though the engineering complexity and compatibility with other serving features require careful consideration.

You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly 2 × 80 layers × 8 KV heads × 32768 tokens × 128 head dim × 2 bytes ≈ 10.7 GB per request . Two hundred of those, and the H100s are paging to CPU. The model itself fits; the attention state doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks. Here's how it works, what the formats are, and where the footguns hide. The KV cache is the largest dynamic piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with batch size × seq len and stays allocated until the request ends. On a long-context workload, it dominates. KV cache quantization trades a small amount of representational precision for a 2x or 4x reduction in cache footprint, with no model-weight change. FP8 and INT8 give ~50% of the BF16 footprint. INT4 KIVI, KVQuant, ZipCache-style gives 25%. The question is what that compression costs in output quality, in serving complexity, and — the part most blog posts skip — in compatibility with the other serving features you already turned on. The economic case is straightforward. Doubling the KV cache budget on a 70B at 32k means either ~21 GB more HBM one extra H100 per ~10 concurrent users at 32k or 2x fewer concurrent users per box. The quality cost of FP8 KV cache, measured on the standard long-context benchmarks, is typically under 0.5 percentage points on retrieval-heavy tasks. That's a 50% infra saving for a sub-half-point accuracy loss. The trade is favorable; the engineering is not free. Standard BF16 attention stores the K and V tensors at full precision. At every attention step, the model reads every past K and V. Quantization compresses these stored tensors using a lower-precision format, with a dequantization step fused into the attention kernel right before the matmul. The pipeline looks like this: php flowchart LR A New token<br/ embedding -- B Project to Kt Vt<br/ BF16, in registers B -- C Quantize Kt Vt<br/ per-token / per-head C -- D Store in<br/ KV cache: FP8/INT8 D -- E On next step:<br/ load cached K and V E -- F Dequantize on-the-fly<br/ inside attention kernel F -- G Attention matmul<br/ BF16, full precision G -- H Output projection Three things to notice: the activations being added to the cache are quantized only at storage time, with the full BF16 values available for the scale calculation. The attention matmul still happens in BF16 or FP16 — you save memory bandwidth, not FLOPs. And the per-token or per-head scales a few KB for an 8k context are stored alongside in BF16; they are what makes the rest of the math work. Five formats dominate production serving stacks in 2026. The list is in roughly the order they were adopted. | Format | Bits | Granularity | Hardware support | Used by | |---|---|---|---|---| BF16 baseline | 16 | — | Native on Ampere+ | Everything | FP8 E4M3 | 8 | Per-tensor, per-head, or per-token | H100, H200, B100, B200, MI300X | vLLM, TRT-LLM, SGLang | FP8 E5M2 | 8 | Same as above | Same as above | Less common for KV; wider dynamic range | INT8 per-token | 8 | Per-token, asymmetric | Universal via Triton/CUDA | vLLM, TGI, llama.cpp | INT4 KVQuant / KIVI / ZipCache | 4 | Mixed: K per-channel, V per-token | Universal | Research, llama.cpp some targets | A few notes on the table: scale, zero point pair. Per-channel one scale per head dim slice is faster on hardware but slightly less accurate. Per-tensor one scale for the whole cache is cheapest and loses the most.The CLI flag is --kv-cache-dtype . In vLLM v0.22.1, accepted values are auto , fp8 E4M3 , fp8 e5m2 , int8 , and bf16 the default; auto resolves to bf16 unless the model is detected as FP8-native . For an OpenAI-compatible serve: vllm serve meta-llama/Meta-Llama-3-70B-Instruct \ --tensor-parallel-size 8 \ --kv-cache-dtype fp8 \ --max-model-len 32768 \ --gpu-memory-utilization 0.92 For programmatic use: python from vllm import LLM llm = LLM model="meta-llama/Meta-Llama-3-70B-Instruct", tensor parallel size=8, kv cache dtype="fp8", max model len=32768, On H100, the FP8 path goes through Transformer Engine's fused attention; on B100/B200 it goes through FlashAttention-3 FP8 kernels. On pre-Hopper hardware A100, RTX 4090 the FP8 flag is a no-op or a slow path — there's no native FP8 tensor core. INT8, by contrast, runs everywhere via Triton. One production detail: --kv-cache-dtype fp8 on an H100 reduces KV cache memory by ~50% but does not reduce the model's weight footprint. The 70B in BF16 is still 140 GB. The savings are real but bounded by the cache-to-weight ratio of your workload — long-context, high-concurrency workloads benefit most. This is the silent footgun. Last week's post on speculative decoding described the acceptance probability r = min 1, M p x / M q x and the speedup formula in terms of μ , the mean accepted tokens per cycle. KV cache quantization breaks the implicit assumption underneath: that the target model's logit at the proposal position is computed at the same numerical precision as the draft model's. The mechanism: x t using its own KV cache draft cache, typically BF16 . M p x t vs M q x t is still computed — but M p is now using K and V values rounded to FP8 or INT8. μ .The magnitude depends on the format and context length. From community benchmarks and published work on spec-decoding with quantized caches, mean accepted tokens per cycle typically drops 0.3–0.8 for FP8 E4M3 and 0.5–1.5 for INT8 per-token. That sounds small until you remember the speedup curve has a knee around μ = 4 . A drop from 4.5 to 3.5 can wipe out 20–30% of the speedup you thought you had. The vLLM v0.18.0 release notes called this out for one specific case: degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 37618 . The lesson generalizes: when stacking serving optimizations, each one shifts the optimal settings of the others . Speculative decoding was tuned assuming BF16 attention. Re-tune num speculative tokens and re-measure μ after turning on --kv-cache-dtype fp8 . --kv-cache-dtype fp8 only changes the attention state. To compress the model, you need a separate quantization step GPTQ, AWQ, FP8 weights with its own quality/throughput tradeoffs.KV cache quantization is the wrong choice if: μ is below 3.0 in BF16, the additional 0.3–1.0 acceptance-rate drop from FP8 will push you below 1.0 and turn the algorithm into a net loss. Measure first, then enable. 2 × layers × kv heads × seq len × head dim × bytes . For a 70B Llama-3 at 32k BF16, that's ~10.7 GB per request. FP8 halves it; INT8 halves it; 4-bit schemes quarter it. --kv-cache-dtype fp8 or int8 . FP8 is H100/H200/B100/B200/MI300X only; INT8 runs everywhere via Triton. num speculative tokens after enabling it.Next post: prefix caching at scale — when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into a 5% saving in production.