{"slug": "kv-cache-quantization-what-fp8-int8-k-and-v-actually-buy-you-and-where-they", "title": "KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break", "summary": "A developer deploying a 70B Llama-3 model on 8x H100s found that scaling from 8k to 32k context windows causes the KV cache to balloon to 10.7 GB per request, forcing memory paging to CPU at 200 concurrent users. KV cache quantization using FP8 or INT8 formats reduces this memory footprint by 50% with less than 0.5 percentage points of accuracy loss on retrieval tasks, offering a favorable trade-off between infrastructure cost and output quality. The quantization compresses stored K and V tensors at storage time while keeping attention matmuls in full BF16 precision, though the engineering complexity and compatibility with other serving features require careful consideration.", "body_md": "You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says \"can you do 32k?\" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly `2 × 80 layers × 8 KV heads × 32768 tokens × 128 head_dim × 2 bytes ≈ 10.7 GB per request`\n\n. Two hundred of those, and the H100s are paging to CPU. The model itself fits; the *attention state* doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks.\n\nHere's how it works, what the formats are, and where the footguns hide.\n\nThe KV cache is the largest *dynamic* piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with `batch_size × seq_len`\n\nand stays allocated until the request ends. On a long-context workload, it dominates.\n\nKV cache quantization trades a small amount of *representational precision* for a 2x or 4x reduction in cache footprint, with no model-weight change. FP8 and INT8 give ~50% of the BF16 footprint. INT4 (KIVI, KVQuant, ZipCache-style) gives 25%. The question is what that compression costs in output quality, in serving complexity, and — the part most blog posts skip — in compatibility with the other serving features you already turned on.\n\nThe economic case is straightforward. Doubling the KV cache budget on a 70B at 32k means either ~21 GB more HBM (one extra H100 per ~10 concurrent users at 32k) or 2x fewer concurrent users per box. The quality cost of FP8 KV cache, measured on the standard long-context benchmarks, is typically under 0.5 percentage points on retrieval-heavy tasks. That's a 50% infra saving for a sub-half-point accuracy loss. The trade is favorable; the engineering is not free.\n\nStandard BF16 attention stores the K and V tensors at full precision. At every attention step, the model reads every past K and V. Quantization compresses these stored tensors using a lower-precision format, with a *dequantization* step fused into the attention kernel right before the matmul.\n\nThe pipeline looks like this:\n\n``` php\nflowchart LR\n    A[New token<br/>embedding] --> B[Project to Kt Vt<br/>BF16, in registers]\n    B --> C[Quantize Kt Vt<br/>per-token / per-head]\n    C --> D[Store in<br/>KV cache: FP8/INT8]\n    D --> E[On next step:<br/>load cached K and V]\n    E --> F[Dequantize on-the-fly<br/>inside attention kernel]\n    F --> G[Attention matmul<br/>BF16, full precision]\n    G --> H[Output projection]\n```\n\nThree things to notice: the activations being added to the cache are quantized only at *storage* time, with the full BF16 values available for the scale calculation. The attention matmul still happens in BF16 or FP16 — you save memory bandwidth, not FLOPs. And the per-token or per-head scales (a few KB for an 8k context) are stored alongside in BF16; they are what makes the rest of the math work.\n\nFive formats dominate production serving stacks in 2026. The list is in roughly the order they were adopted.\n\n| Format | Bits | Granularity | Hardware support | Used by |\n|---|---|---|---|---|\nBF16 (baseline) |\n16 | — | Native on Ampere+ | Everything |\nFP8 E4M3 |\n8 | Per-tensor, per-head, or per-token | H100, H200, B100, B200, MI300X | vLLM, TRT-LLM, SGLang |\nFP8 E5M2 |\n8 | Same as above | Same as above | Less common for KV; wider dynamic range |\nINT8 (per-token) |\n8 | Per-token, asymmetric | Universal via Triton/CUDA | vLLM, TGI, llama.cpp |\nINT4 (KVQuant / KIVI / ZipCache) |\n4 | Mixed: K per-channel, V per-token | Universal | Research, llama.cpp (some targets) |\n\nA few notes on the table:\n\n`(scale, zero_point)`\n\npair. Per-channel (one scale per head_dim slice) is faster on hardware but slightly less accurate. Per-tensor (one scale for the whole cache) is cheapest and loses the most.The CLI flag is `--kv-cache-dtype`\n\n. In vLLM v0.22.1, accepted values are `auto`\n\n, `fp8`\n\n(E4M3), `fp8_e5m2`\n\n, `int8`\n\n, and `bf16`\n\n(the default; `auto`\n\nresolves to `bf16`\n\nunless the model is detected as FP8-native). For an OpenAI-compatible serve:\n\n```\nvllm serve meta-llama/Meta-Llama-3-70B-Instruct \\\n  --tensor-parallel-size 8 \\\n  --kv-cache-dtype fp8 \\\n  --max-model-len 32768 \\\n  --gpu-memory-utilization 0.92\n```\n\nFor programmatic use:\n\n``` python\nfrom vllm import LLM\n\nllm = LLM(\n    model=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    tensor_parallel_size=8,\n    kv_cache_dtype=\"fp8\",\n    max_model_len=32768,\n)\n```\n\nOn H100, the FP8 path goes through Transformer Engine's fused attention; on B100/B200 it goes through FlashAttention-3 FP8 kernels. On pre-Hopper hardware (A100, RTX 4090) the FP8 flag is a no-op or a slow path — there's no native FP8 tensor core. INT8, by contrast, runs everywhere via Triton.\n\nOne production detail: `--kv-cache-dtype fp8`\n\non an H100 reduces *KV cache* memory by ~50% but does **not** reduce the model's weight footprint. The 70B in BF16 is still 140 GB. The savings are real but bounded by the cache-to-weight ratio of your workload — long-context, high-concurrency workloads benefit most.\n\nThis is the silent footgun. Last week's post on speculative decoding described the acceptance probability `r = min(1, M_p(x) / M_q(x))`\n\nand the speedup formula in terms of `μ`\n\n, the mean accepted tokens per cycle. KV cache quantization breaks the implicit assumption underneath: that the target model's logit at the proposal position is computed at the same numerical precision as the draft model's.\n\nThe mechanism:\n\n`x_t`\n\nusing its own KV cache (draft cache, typically BF16).`M_p(x_t)`\n\nvs `M_q(x_t)`\n\nis still computed — but `M_p`\n\nis now using K and V values rounded to FP8 or INT8.`μ`\n\n.The magnitude depends on the format and context length. From community benchmarks and published work on spec-decoding with quantized caches, mean accepted tokens per cycle typically drops 0.3–0.8 for FP8 E4M3 and 0.5–1.5 for INT8 per-token. That sounds small until you remember the speedup curve has a knee around `μ = 4`\n\n. A drop from 4.5 to 3.5 can wipe out 20–30% of the speedup you thought you had.\n\nThe vLLM v0.18.0 release notes called this out for one specific case: degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618). The lesson generalizes: when stacking serving optimizations, *each one shifts the optimal settings of the others*. Speculative decoding was tuned assuming BF16 attention. Re-tune `num_speculative_tokens`\n\nand re-measure `μ`\n\nafter turning on `--kv-cache-dtype fp8`\n\n.\n\n`--kv-cache-dtype fp8`\n\nonly changes the attention state. To compress the model, you need a separate quantization step (GPTQ, AWQ, FP8 weights) with its own quality/throughput tradeoffs.KV cache quantization is the wrong choice if:\n\n`μ`\n\nis below 3.0 in BF16, the additional 0.3–1.0 acceptance-rate drop from FP8 will push you below 1.0 and turn the algorithm into a net loss. Measure first, then enable.`2 × layers × kv_heads × seq_len × head_dim × bytes`\n\n. For a 70B Llama-3 at 32k BF16, that's ~10.7 GB per request. FP8 halves it; INT8 halves it; 4-bit schemes quarter it.`--kv-cache-dtype fp8`\n\nor `int8`\n\n. FP8 is H100/H200/B100/B200/MI300X only; INT8 runs everywhere via Triton.`num_speculative_tokens`\n\nafter enabling it.Next post: prefix caching at scale — when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into a 5% saving in production.", "url": "https://wpnews.pro/news/kv-cache-quantization-what-fp8-int8-k-and-v-actually-buy-you-and-where-they", "canonical_source": "https://dev.to/tech_nuggets/kv-cache-quantization-what-fp8int8-k-and-v-actually-buy-you-and-where-they-break-4fnl", "published_at": "2026-06-06 01:10:50+00:00", "updated_at": "2026-06-06 01:42:51.313178+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-research"], "entities": ["Llama", "H100", "KIVI", "KVQuant", "ZipCache"], "alternates": {"html": "https://wpnews.pro/news/kv-cache-quantization-what-fp8-int8-k-and-v-actually-buy-you-and-where-they", "markdown": "https://wpnews.pro/news/kv-cache-quantization-what-fp8-int8-k-and-v-actually-buy-you-and-where-they.md", "text": "https://wpnews.pro/news/kv-cache-quantization-what-fp8-int8-k-and-v-actually-buy-you-and-where-they.txt", "jsonld": "https://wpnews.pro/news/kv-cache-quantization-what-fp8-int8-k-and-v-actually-buy-you-and-where-they.jsonld"}}