We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.

Based on the article, the authors replaced their RAG pipeline with a persistent KV cache system, which stores the full document's attention state after a single prefill and reuses it for every query. They found that this approach improved answer quality by eliminating retrieval misses, reduced operational complexity by removing the need for embedding models and vector databases, and made updates trivial. However, they note that the method is limited by the model's context window size and is most effective for focused documents with high query volume relative to update frequency.

RAG has become the default answer for giving LLMs access to private knowledge. And for good reason — it works. But after running it in production we kept hitting the same wall. Not retrieval accuracy. The operational tax. Re-embedding on data changes. Chunking drift. Retrieval misses on edge cases. Pipeline failures at 2am. The vector database that needs babysitting. So we ran an experiment. The Hypothesis What if instead of chunking, embedding, and retrieving — we just loaded the full document into the LLM context, cached the KV state persistently, and reused it across every query? No retrieval step. No embedding pipeline. No vector database. Just the model with full document context, warm and ready. How It Works The core idea is simple. When an LLM processes a prompt it generates a key-value attention cache — the internal representation of everything it has read. Normally this cache is transient. It lives in VRAM during the request and disappears after. We persist it. The initialization prompt — your document — gets processed once. The resulting KV cache gets stored externally and indexed to that document. Every subsequent query retrieves that cached state and appends the user query. The model never recomputes the document. Ever. The math: KV init = LLM.prefill document KV store document id = KV init On every query: KV full = KV store document id + LLM.prefill query output = LLM.decode KV full What We Found Answer quality improved. No retrieval misses are possible when the full document is in context. The model has read everything. It doesn't guess which chunks are relevant — it knows the whole document. For complex multi-part questions that span different sections this is a significant improvement over chunked retrieval. Updates became trivial. Document changes? Re-run the prefill, store the new KV cache. Minutes not hours. No re-embedding pipeline. No re-indexing. No retrieval regression testing. Just regenerate and deploy. Operational complexity dropped. No embedding model to maintain. No vector database to monitor. No chunking strategy to tune. No retrieval quality metrics to track. The surface area for things to break quietly got dramatically smaller. Latency on warm cache is effectively instant. When the KV state is already loaded the query just appends and generates. No retrieval hop, no context injection latency. The Honest Tradeoffs Context window is the ceiling. Current limit is around 120k tokens — roughly 200-300 pages. Works well for focused documents. For large corpora you need a routing layer to select the right cache per query. You've pushed the retrieval problem up one level — instead of retrieving chunks you're selecting a cache. Simpler problem but not zero. Cold cache restore adds latency. The first query after a cache restore pays a latency cost. For strict SLA requirements this matters. Warm cache is instant. Cold restore depends on your infrastructure. Initial prefill costs more than embedding. Running a full forward pass on a large document costs more compute than embedding it. The economics work when query volume is high enough to amortize that cost. Low query, high update frequency — RAG still wins. Where This Wins This approach is clearly better when: You have a focused, structured document — legal contract, compliance policy, product manual, technical spec Query volume is high relative to update frequency Full context comprehension matters more than breadth You want to eliminate pipeline maintenance entirely Privacy matters — no document chunks sent to embedding APIs Where RAG Still Wins Very large document collections where context limits apply Highly dynamic data that changes multiple times per day When you genuinely don't know which document is relevant at query time Low query volume where prefill cost doesn't amortize What We're Building We've been running this in production at InferX as part of our Sovereign Endpoints™ infrastructure. The persistent KV cache layer sits on top of our GPU snapshotting architecture — which is what makes the cold cache restore fast enough to be practical. We're now opening a limited beta for teams who want to test this on real workloads. Particularly interested in legal, compliance, finance, and developer tooling use cases. If you're running RAG in production and want to run a head-to-head comparison — we'd love to work with you. 🎬 Demo dropping in 2 days — follow to see it first.