Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT KVBoost is a new open-source Python library that accelerates HuggingFace LLM inference by implementing chunk-level KV cache reuse, achieving 3–5× faster time-to-first-token (TTFT) and up to 85% cache hit rates in multi-turn scenarios. The library also supports FlashAttention-2, AWQ layer streaming to run 32B-parameter models on 8 GB consumer GPUs, and CPU paged decoding for long-context handling, all without requiring model architecture changes. pip install kvboost KVBoost Faster LLM Inference. Less VRAM. No Model Changes. Chunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding ⚡ The Problem LLM inference is broken by default. 🧱 VRAM Walls Modern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams. 🐢 Slow Prefill Repeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly. 🔧 HF Bottlenecks HuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding. The Solution KVBoost: drop-in, no rewrites. Python from kvboost import KVBoost engine = KVBoost .from pretrained "Qwen/Qwen2.5-3B" Warm a shared prefix once engine . warm "You are a helpful assistant..." All subsequent calls reuse cache result = engine . generate prompt print result. kv reuse ratio ✓ 80%+ ⚡ KV Cache Reuse Chunk-level cache reuse eliminates redundant prefill for shared prompts. 🚀 FlashAttention-2 Memory-efficient attention with 3–5× TTFT speedup vs vanilla HuggingFace. 💾 AWQ Layer Streaming Run 32B+ models on 8 GB VRAM via pinned-host weight streaming. 🗄️ CPU Paged Decoding Spill KV cache to CPU RAM — handle long contexts without OOM errors. Performance Real numbers. Real hardware. 3–5× TTFT Speedup vs HF Baseline 80%+ KV Cache Hit Rate Multi-Turn 8 GB VRAM for 32B Model AWQ Streaming ~10K Lines of Code 43 Python Modules Time to First Token ms — lower is better HF Baseline 850ms Prefix Reuse 320ms Chunk Reuse 210ms Multi-Turn Cache Hit Rate % Turn 1 0% Turn 2 45% Turn 3 68% Turn 4 78% Turn 5+ 85% How It Works Four layers of optimization. 01 Hash Chunks Incoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs. 02 Reuse Cache Matching chunks skip attention entirely. Only novel tokens are forwarded through the transformer. 03 Flash Attention New tokens run FlashAttention-2 — tiled CUDA kernels with O √N memory. No custom model code needed. 04 Page Offload Long-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM. AWQ Layer Streaming Run a 32B model on a gaming GPU. Terminal $ python -m kvboost.streaming.demo partial 8b --model Qwen/Qwen2.5-32B-Instruct-AWQ INFO: Replaced projections: 56 resident across 8 layers 392 streamed across 56 layers load time: 10.7s peak vram after load: 5.65 GB avg tok per s: 0.11 peak vram during decode: 6.13 GB 5.65 GB Peak VRAM after loading a 32B model — fits on a single 8 GB gaming GPU. 6.13 GB Peak VRAM during decode — stays safely under the 8 GB limit. 0.11 tok/s PCIe-bound throughput — built for VRAM savings, not raw speed. Use Cases Who needs KVBoost? 💻 AI Coding Assistants System prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×. 📚 RAG Pipelines Document chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster. ⚙️ Edge / Budget Infra AWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required. 💬 Multi-Turn Chatbots Conversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes. MIT Licensed · Drop-in with HuggingFace Transformers · No fine-tuning, no architecture changes Technology Built on solid foundations. ✓ FlashAttention-2 Tiled CUDA kernels for O √N memory attention ✓ AWQ AutoQuant Weight-only 4-bit quantization preserving accuracy ✓ HuggingFace Transformers Drop-in compatibility — no model changes required ✓ CUDA DMA Streams Async PCIe transfers for layer-by-layer weight streaming ✓ Chunk Hashing Deterministic token-level hashing for cache lookup ✓ CPU Paged Memory Page-table KV offload — evict cold blocks to RAM ✓ PyPI Package pip install kvboost — ready in 2 minutes ✓ MIT License Fully open source, production-ready for any use Roadmap What's next. Now ✅ ✓ Chunk-level KV reuse ✓ FlashAttention-2 integration ✓ AWQ layer streaming ✓ CPU paged decoding Next 🔨 ◦ Multi-GPU tensor parallel ◦ Speculative decoding ◦ LoRA adapter hot-swap ◦ Continuous batching Future 🔭 ◦ GGUF / GGML support ◦ Triton custom kernels ◦ Distributed KV cache ◦ Cloud-hosted cache tier Start building faster. KVBoost is open source and production-ready. Drop it into any HuggingFace project today. GitHub github.com/pythongiant/kvboost PyPI pypi.org/project/kvboost/ Docs kvboost.readthedocs.io $ pip install kvboost MIT License · Built by @pythongiant ‹ › 1 / 10