KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break A developer deploying a 70B Llama-3 model on 8x H100s found that scaling from 8k to 32k context windows causes the KV cache to balloon to 10.7 GB per request, forcing memory paging to CPU at 200 concurrent users. KV cache quantization using FP8 or INT8 formats reduces this memory footprint by 50% with less than 0.5 percentage points of accuracy loss on retrieval tasks, offering a favorable trade-off between infrastructure cost and output quality. The quantization compresses stored K and V tensors at storage time while keeping attention matmuls in full BF16 precision, though the engineering complexity and compatibility with other serving features require careful consideration. You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly 2 × 80 layers × 8 KV heads × 32768 tokens × 128 head dim × 2 bytes ≈ 10.7 GB per request . Two hundred of those, and the H100s are paging to CPU. The model itself fits; the attention state doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks. Here's how it works, what the formats are, and where the footguns hide. The KV cache is the largest dynamic piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with batch size × seq len and stays allocated until the request ends. On a long-context workload, it dominates. KV cache quantization trades a small amount of representational precision for a 2x or 4x reduction in cache footprint, with no model-weight change. FP8 and INT8 give ~50% of the BF16 footprint. INT4 KIVI, KVQuant, ZipCache-style gives 25%. The question is what that compression costs in output quality, in serving complexity, and — the part most blog posts skip — in compatibility with the other serving features you already turned on. The economic case is straightforward. Doubling the KV cache budget on a 70B at 32k means either ~21 GB more HBM one extra H100 per ~10 concurrent users at 32k or 2x fewer concurrent users per box. The quality cost of FP8 KV cache, measured on the standard long-context benchmarks, is typically under 0.5 percentage points on retrieval-heavy tasks. That's a 50% infra saving for a sub-half-point accuracy loss. The trade is favorable; the engineering is not free. Standard BF16 attention stores the K and V tensors at full precision. At every attention step, the model reads every past K and V. Quantization compresses these stored tensors using a lower-precision format, with a dequantization step fused into the attention kernel right before the matmul. The pipeline looks like this: php flowchart LR A New token