01:10
2026-06-06
dev.to
large-language-models
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
A developer deploying a 70B Llama-3 model on 8x H100s found that scaling from 8k to 32k context windows causes the KV cache to balloon to 10.7 GB per request, forcing memory paging to CPU at 200 concuβ¦