# Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

> Source: <https://pythongiant.github.io/KVBoost/>
> Published: 2026-05-22 04:47:48+00:00

pip install kvboost
KVBoost
Faster LLM Inference.
Less VRAM. No Model Changes.
Chunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding
⚡
The Problem
LLM inference is broken by default.
🧱
VRAM Walls
Modern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams.
🐢
Slow Prefill
Repeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly.
🔧
HF Bottlenecks
HuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding.
The Solution
KVBoost: drop-in, no rewrites.
Python
from
kvboost
import
KVBoost
engine
=
KVBoost
.from_pretrained(
"Qwen/Qwen2.5-3B"
)
# Warm a shared prefix once
engine
.
warm
(
"You are a helpful assistant..."
)
# All subsequent calls reuse cache
result
=
engine
.
generate
(prompt)
print
(result.
kv_reuse_ratio
)
# ✓ 80%+
⚡
KV Cache Reuse
Chunk-level cache reuse eliminates redundant prefill for shared prompts.
🚀
FlashAttention-2
Memory-efficient attention with 3–5× TTFT speedup vs vanilla HuggingFace.
💾
AWQ Layer Streaming
Run 32B+ models on 8 GB VRAM via pinned-host weight streaming.
🗄️
CPU Paged Decoding
Spill KV cache to CPU RAM — handle long contexts without OOM errors.
Performance
Real numbers. Real hardware.
3–5×
TTFT Speedup
vs HF Baseline
80%+
KV Cache Hit Rate
Multi-Turn
8 GB
VRAM for 32B Model
AWQ Streaming
~10K
Lines of Code
43 Python Modules
Time to First Token (ms) — lower is better
HF Baseline
850ms
Prefix Reuse
320ms
Chunk Reuse
210ms
Multi-Turn Cache Hit Rate (%)
Turn 1
0%
Turn 2
45%
Turn 3
68%
Turn 4
78%
Turn 5+
85%
How It Works
Four layers of optimization.
01
Hash Chunks
Incoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs.
02
Reuse Cache
Matching chunks skip attention entirely. Only novel tokens are forwarded through the transformer.
03
Flash Attention
New tokens run FlashAttention-2 — tiled CUDA kernels with O(√N) memory. No custom model code needed.
04
Page Offload
Long-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM.
AWQ Layer Streaming
Run a 32B model on a gaming GPU.
Terminal
$ python -m kvboost.streaming.demo_partial_8b
--model Qwen/Qwen2.5-32B-Instruct-AWQ
INFO: Replaced projections:
56 resident across 8 layers
392 streamed across 56 layers
load_time:
10.7s
peak_vram_after_load:
5.65 GB
avg_tok_per_s:
0.11
peak_vram_during_decode:
6.13 GB
5.65 GB
Peak VRAM after loading a 32B model — fits on a single 8 GB gaming GPU.
6.13 GB
Peak VRAM during decode — stays safely under the 8 GB limit.
0.11 tok/s
PCIe-bound throughput — built for VRAM savings, not raw speed.
Use Cases
Who needs KVBoost?
💻
AI Coding Assistants
System prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×.
📚
RAG Pipelines
Document chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster.
⚙️
Edge / Budget Infra
AWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required.
💬
Multi-Turn Chatbots
Conversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes.
MIT Licensed · Drop-in with HuggingFace Transformers · No fine-tuning, no architecture changes
Technology
Built on solid foundations.
✓
FlashAttention-2
Tiled CUDA kernels for O(√N) memory attention
✓
AWQ (AutoQuant)
Weight-only 4-bit quantization preserving accuracy
✓
HuggingFace Transformers
Drop-in compatibility — no model changes required
✓
CUDA DMA Streams
Async PCIe transfers for layer-by-layer weight streaming
✓
Chunk Hashing
Deterministic token-level hashing for cache lookup
✓
CPU Paged Memory
Page-table KV offload — evict cold blocks to RAM
✓
PyPI Package
pip install kvboost — ready in 2 minutes
✓
MIT License
Fully open source, production-ready for any use
Roadmap
What's next.
Now ✅
✓ Chunk-level KV reuse
✓ FlashAttention-2 integration
✓ AWQ layer streaming
✓ CPU paged decoding
Next 🔨
◦ Multi-GPU tensor parallel
◦ Speculative decoding
◦ LoRA adapter hot-swap
◦ Continuous batching
Future 🔭
◦ GGUF / GGML support
◦ Triton custom kernels
◦ Distributed KV cache
◦ Cloud-hosted cache tier
Start building
faster.
KVBoost is open source and production-ready.
Drop it into any HuggingFace project today.
GitHub
github.com/pythongiant/kvboost
PyPI
pypi.org/project/kvboost/
Docs
kvboost.readthedocs.io
$
pip install kvboost
MIT License · Built by @pythongiant
‹
›
1 / 10
