KVarN: Native vLLM KV-cache quantization back end by Huawei

wpnews.pro

cd /news/large-language-models/kvarn-native-vllm-kv-cache-quantizat… · home › topics › large-language-models › article

[ARTICLE · art-21645] src=github.com ↗ pub=2026-06-04T15:18Z topic=large-language-models verified=true sentiment=↑ positive

KVarN: Native vLLM KV-cache quantization back end by Huawei

Huawei released KVarN, a native KV-cache quantization back end for vLLM that delivers up to 5x more cache capacity and 1.3x the throughput of FP16 while maintaining FP16-level accuracy. The calibration-free system handles agentic and long-context workloads by quantizing keys at 4 bits and values at 2 bits, outperforming existing methods like TurboQuant by up to 2.4x in throughput with higher accuracy. KVarN ships as a vLLM fork that requires only a single flag change to enable, eliminating the need for model modifications or calibration data.

read3 min views15 publishedJun 4, 2026

⚡️

Built for agentic and long-context workloads.

💡 KVarN delivers

3-5x more KV-cache capacityandup to ~1.3x the throughputof FP16, so you fit far longer contexts and serve more concurrent requests, withFP16-level accuracy.

🔌

Calibration-free, plug-and-play with vLLM.A native vLLM attention backend: add one flag, no model changes, no calibration.

🥊

Up to ~2.4× TurboQuant throughput, same capacity,higher accuracy.

kvarn/kvɑːɳ/ ·noun(Swedish)

A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.

KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.

KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:

KVarN stays in the upper-right corner the blog's methods can't reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.

KVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.

git clone https://github.com/huawei-csl/KVarN.git
cd KVarN

VLLM_USE_PRECOMPILED=1 pip install -e .
python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-32B",
    dtype="float16",                    # KVarN runs in float16
    kv_cache_dtype="kvarn_k4v2_g128",   # enable KVarN
    block_size=128,                     # KVarN tile size
)
print(llm.generate("Explain KV-cache quantization in one sentence.",
                    SamplingParams(max_tokens=64))[0].outputs[0].text)

Serving works the same way:

vllm serve Qwen/Qwen3-32B --dtype float16 --kv-cache-dtype kvarn_k4v2_g128 --block-size 128

Note:KVarN runs infloat16

compute. The tile / page size is currently fixed at 128 (one vLLM block = one KVarN tile); other page sizes are coming soon.

Tip (capacity):KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous--gpu-memory-utilization

setups this is automatic. On a tight single-GPU budget, vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; setVLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

(and/or raise--gpu-memory-utilization

) to recover the full capacity.

KVarN quantizes the KV cache one fixed-size token tile at a time, walking each tile through the four stages above:

Cache: the raw fp16 KV tile (channels × tokens), straight from attention. - Rotated Cache: a** Hadamard rotationalong the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved. - Normalized Cache: iterative variance normalization**(Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens. - Quantized Cache:** asymmetric round-to-nearest**at low bit-width, with the scales folded back in at read time (keys per channel, values per token).

The shipped preset spends more bits on keys than values (kvarn_k4v2_g128

: 4-bit keys, 2-bit values). We chose to release this configuration because it meets the strictest accuracy bar, matching FP16, that the most demanding production deployments and vLLM require, while still delivering throughput above FP16.

KVarN is the official vLLM implementation of our paper:

📄

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks([arXiv:2606.03458])

If you use KVarN, please cite:

@misc{muller2026kvarn,
      title={KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks}, 
      author={Lorenz K. Muller and Philippe Bich and Chiara Boretti and Hyun-Min Chang and Jiawei Zhuang and Lukas Cavigelli},
      year={2026},
      eprint={2606.03458},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2606.03458}
}

KVarN is built on vLLM (v0.22.0) and is released under the Apache 2.0 License. The original vLLM README is preserved as README_vLLM.md.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/kvarn-native-vllm-kv-cac…

Read original on github.com → github.com/huawei-csl/KVarN

mentioned entities

Huawei

vLLM

KVarN

TurboQuant

Qwen3-32B

AIME25

metadata

slugkvarn-native-vllm-kv-cache-quantization-back-end-by-huawei

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalgithub.com

navigation

← prevTesla settles some Bay Area work…

next →Ramp hits $44 billion valuation …

── more in #large-language-models 4 stories · sorted by recency

databricks.com · 21 Jul · #large-language-models

The last mile: why great first-party data still doesn't make great marketing

artificialconfidence.com · 21 Jul · #large-language-models

Frontier lab economics have bifurcated: tax the exit, lobby against the free

dev.to · 21 Jul · #large-language-models

VIDRAFT Releases Aether-7B-5Attn: A Fully Open-Source MoE LLM with Five Heterogeneous Attention Mechanisms

daringfireball.net · 21 Jul · #large-language-models

★ European Commission: ‘Guidance to Google for AI Interoperability on Android & Sharing of Google Search’

── more on @huawei 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required