{"slug": "show-hn-kvboost-chunk-level-kv-cache-reuse-for-huggingface-5-48x-faster-ttft", "title": "Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT", "summary": "KVBoost is a new open-source Python library that accelerates HuggingFace LLM inference by implementing chunk-level KV cache reuse, achieving 3–5× faster time-to-first-token (TTFT) and up to 85% cache hit rates in multi-turn scenarios. The library also supports FlashAttention-2, AWQ layer streaming to run 32B-parameter models on 8 GB consumer GPUs, and CPU paged decoding for long-context handling, all without requiring model architecture changes.", "body_md": "pip install kvboost\nKVBoost\nFaster LLM Inference.\nLess VRAM. No Model Changes.\nChunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding\n⚡\nThe Problem\nLLM inference is broken by default.\n🧱\nVRAM Walls\nModern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams.\n🐢\nSlow Prefill\nRepeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly.\n🔧\nHF Bottlenecks\nHuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding.\nThe Solution\nKVBoost: drop-in, no rewrites.\nPython\nfrom\nkvboost\nimport\nKVBoost\nengine\n=\nKVBoost\n.from_pretrained(\n\"Qwen/Qwen2.5-3B\"\n)\n# Warm a shared prefix once\nengine\n.\nwarm\n(\n\"You are a helpful assistant...\"\n)\n# All subsequent calls reuse cache\nresult\n=\nengine\n.\ngenerate\n(prompt)\nprint\n(result.\nkv_reuse_ratio\n)\n# ✓ 80%+\n⚡\nKV Cache Reuse\nChunk-level cache reuse eliminates redundant prefill for shared prompts.\n🚀\nFlashAttention-2\nMemory-efficient attention with 3–5× TTFT speedup vs vanilla HuggingFace.\n💾\nAWQ Layer Streaming\nRun 32B+ models on 8 GB VRAM via pinned-host weight streaming.\n🗄️\nCPU Paged Decoding\nSpill KV cache to CPU RAM — handle long contexts without OOM errors.\nPerformance\nReal numbers. Real hardware.\n3–5×\nTTFT Speedup\nvs HF Baseline\n80%+\nKV Cache Hit Rate\nMulti-Turn\n8 GB\nVRAM for 32B Model\nAWQ Streaming\n~10K\nLines of Code\n43 Python Modules\nTime to First Token (ms) — lower is better\nHF Baseline\n850ms\nPrefix Reuse\n320ms\nChunk Reuse\n210ms\nMulti-Turn Cache Hit Rate (%)\nTurn 1\n0%\nTurn 2\n45%\nTurn 3\n68%\nTurn 4\n78%\nTurn 5+\n85%\nHow It Works\nFour layers of optimization.\n01\nHash Chunks\nIncoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs.\n02\nReuse Cache\nMatching chunks skip attention entirely. Only novel tokens are forwarded through the transformer.\n03\nFlash Attention\nNew tokens run FlashAttention-2 — tiled CUDA kernels with O(√N) memory. No custom model code needed.\n04\nPage Offload\nLong-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM.\nAWQ Layer Streaming\nRun a 32B model on a gaming GPU.\nTerminal\n$ python -m kvboost.streaming.demo_partial_8b\n--model Qwen/Qwen2.5-32B-Instruct-AWQ\nINFO: Replaced projections:\n56 resident across 8 layers\n392 streamed across 56 layers\nload_time:\n10.7s\npeak_vram_after_load:\n5.65 GB\navg_tok_per_s:\n0.11\npeak_vram_during_decode:\n6.13 GB\n5.65 GB\nPeak VRAM after loading a 32B model — fits on a single 8 GB gaming GPU.\n6.13 GB\nPeak VRAM during decode — stays safely under the 8 GB limit.\n0.11 tok/s\nPCIe-bound throughput — built for VRAM savings, not raw speed.\nUse Cases\nWho needs KVBoost?\n💻\nAI Coding Assistants\nSystem prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×.\n📚\nRAG Pipelines\nDocument chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster.\n⚙️\nEdge / Budget Infra\nAWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required.\n💬\nMulti-Turn Chatbots\nConversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes.\nMIT Licensed · Drop-in with HuggingFace Transformers · No fine-tuning, no architecture changes\nTechnology\nBuilt on solid foundations.\n✓\nFlashAttention-2\nTiled CUDA kernels for O(√N) memory attention\n✓\nAWQ (AutoQuant)\nWeight-only 4-bit quantization preserving accuracy\n✓\nHuggingFace Transformers\nDrop-in compatibility — no model changes required\n✓\nCUDA DMA Streams\nAsync PCIe transfers for layer-by-layer weight streaming\n✓\nChunk Hashing\nDeterministic token-level hashing for cache lookup\n✓\nCPU Paged Memory\nPage-table KV offload — evict cold blocks to RAM\n✓\nPyPI Package\npip install kvboost — ready in 2 minutes\n✓\nMIT License\nFully open source, production-ready for any use\nRoadmap\nWhat's next.\nNow ✅\n✓ Chunk-level KV reuse\n✓ FlashAttention-2 integration\n✓ AWQ layer streaming\n✓ CPU paged decoding\nNext 🔨\n◦ Multi-GPU tensor parallel\n◦ Speculative decoding\n◦ LoRA adapter hot-swap\n◦ Continuous batching\nFuture 🔭\n◦ GGUF / GGML support\n◦ Triton custom kernels\n◦ Distributed KV cache\n◦ Cloud-hosted cache tier\nStart building\nfaster.\nKVBoost is open source and production-ready.\nDrop it into any HuggingFace project today.\nGitHub\ngithub.com/pythongiant/kvboost\nPyPI\npypi.org/project/kvboost/\nDocs\nkvboost.readthedocs.io\n$\npip install kvboost\nMIT License · Built by @pythongiant\n‹\n›\n1 / 10", "url": "https://wpnews.pro/news/show-hn-kvboost-chunk-level-kv-cache-reuse-for-huggingface-5-48x-faster-ttft", "canonical_source": "https://pythongiant.github.io/KVBoost/", "published_at": "2026-05-22 04:47:48+00:00", "updated_at": "2026-05-22 06:05:08.758415+00:00", "lang": "en", "topics": ["large-language-models", "open-source", "developer-tools", "machine-learning", "artificial-intelligence"], "entities": ["KVBoost", "HuggingFace", "FlashAttention-2", "AWQ", "Qwen2.5-32B", "Qwen2.5-3B", "CPU Paged Decoding"], "alternates": {"html": "https://wpnews.pro/news/show-hn-kvboost-chunk-level-kv-cache-reuse-for-huggingface-5-48x-faster-ttft", "markdown": "https://wpnews.pro/news/show-hn-kvboost-chunk-level-kv-cache-reuse-for-huggingface-5-48x-faster-ttft.md", "text": "https://wpnews.pro/news/show-hn-kvboost-chunk-level-kv-cache-reuse-for-huggingface-5-48x-faster-ttft.txt", "jsonld": "https://wpnews.pro/news/show-hn-kvboost-chunk-level-kv-cache-reuse-for-huggingface-5-48x-faster-ttft.jsonld"}}