Stale RAG vs. expensive RAG: how to cache RAG context without serving outdated answers

wpnews.pro

If you run a RAG system in production, you eventually hit a dilemma that has nothing to do with your model and everything to do with your cache.

Cache the answers to save tokens and latency, and one day a source document changes — but your cache keeps cheerfully serving the answer it built from the old document. Nobody gets an error. The number is just quietly wrong.

Cache nothing, and every single call re-retrieves the same chunks, re-reads them, and re-pays the full context bill to rebuild an understanding you already built five minutes ago for a nearly identical question.

Stale or expensive. Most teams pick "expensive" because at least it's correct, then bolt on a TTL and hope. This post is about why the TTL doesn't save you, and about two specific, mechanical fixes that let you cache RAG context and stay fresh. I maintain an open-source library called Coalent that implements both, so I'll use it for the runnable examples — but the two ideas are portable and worth stealing even if you never pip install

anything.

Here's the standard "answer cache" sitting in front of retrieval:

answer = cache.get(query)
if answer is None:
    chunks = retriever.retrieve(query)
    answer = llm.synthesize(query, chunks)
    cache.set(query, answer, ttl=3600)
return answer

This works until billing.md

changes. The refund window goes from 30 days to 14. Your cache has an answer keyed on "what is our refund policy?" that says 30, and it will keep saying 30 for up to an hour — or forever, if the same question keeps refreshing a TTL that never expires under load.

The reason this is hard is that the cache key (the query) has no relationship to the thing that changed (the source). You cached an answer; you threw away the fact that this particular answer was derived from billing.md

. So when billing.md

changes, you have no way to find the answers that depended on it.

The TTL is a confession that you can't answer the question "which cached units are now wrong?", so you guess a time and blow everything away on a timer. Too short and you've rebuilt a perfectly good cache for nothing. Too long and you serve stale data. There is no good number, because freshness is an event, not a duration.

The fix is to record, for every cached unit, which sources it cited — its provenance — and to invalidate by set membership when a source changes. No source change, no invalidation. A source changes, and only the units that actually cited it get marked dirty.

That requires a cache that keeps the derivation link instead of discarding it. In Coalent, the cached unit retains the source spans it was built from, so a change event is a reverse-index lookup:

from coalent import SemanticCache, LLMSynthesizer, OpenAIProvider, OpenAIEmbedder

cache = SemanticCache(
    retriever,                               # bring your own: vector DB, GraphRAG, tools
    LLMSynthesizer(OpenAIProvider(), model="gpt-4o-mini"),
    embedder=OpenAIEmbedder(),
)

r = cache.get("what is our refund policy?")

result = cache.source_changed("billing.md", text="...refunds within 14 days...")

Two details that matter in practice:

source_changed

hashes the new text and compares it to the hash stored on each unit's provenance. If a CI job rewrites billing.md

byte-for-byte, nothing is invalidated. You don't pay to rebuild understanding that didn't change.The honest caveat: invalidation today is per-artifact. If you chunk one giant doc into 200 pieces that all share a single artifact_id

, a change anywhere in that doc dirties everything derived from it — more rebuilding than strictly necessary. Span-level granularity (invalidate only the units whose specific cited span changed) is the planned improvement. If your sources map roughly one-document-to-one-id, you're already in the good case.

Now the other direction. To cut tokens, you cache a summary of the retrieved context instead of the raw chunks. Great — until a later query needs a detail the summary dropped.

You retrieved a dense billing doc, summarized it to "Refunds are available within the policy window, with some exceptions," and cached that. Then someone asks: "Can I get a refund on a gift-card purchase?" The summary is on-topic, so it "hits" — and answers from a blob that silently dropped the gift-card exception. Your cache just returned an answer worse than plain RAG would have, because plain RAG still had the raw chunk with the exception in it. The cache hit, and the hit was a downgrade.

This is the subtle one. Staleness is at least detectable in principle. The lossy-summary gap looks like a perfectly healthy cache hit. Your hit rate goes up. Your accuracy goes down. Nobody notices until a customer does.

Two parts.

First, never throw away the raw evidence. Cache the cheap summarized understanding for the common case, but keep the chunks attached to the unit so the detail is always one hop away.

Second — and this is the part most "semantic cache" implementations skip — measure whether the hit actually covers the query before you trust it. A cache hit means "a unit on this topic exists," not "this unit answers this question." Those are different claims, and conflating them is exactly the gift-card bug.

Coalent scores coverage per claim. It embeds each claim/fact in a unit separately and takes the max cosine similarity against the query. If the query's best match against any stored claim is below a floor, the hit escalates: it pulls fresh raw evidence for that specific query and answers from it — still counted as a hit, but no longer answering from a summary that's missing the point.

r = cache.get("can I get a refund on a gift-card purchase?")

r.coverage    # 0..1: how well the unit's best claim addresses THIS query
r.escalated   # True -> coverage was under the floor, so it pulled fresh raw
r.context["raw"]  # the raw evidence, present once escalated

One thing to be precise about: escalation only kicks in on a warm hit — a unit on this topic already exists but doesn't cover this specific query. A cold, first-ever query is just a normal miss that retrieves and builds fresh, so it gets the full evidence anyway. Escalation is for the dangerous in-between: when the cache has something on-topic but incomplete, and would otherwise answer from it.

Why per-claim and not one similarity score over the whole summary? Because a single embedding of the whole unit is a centroid. An on-topic-but-uncovered query ("gift-card refund") sits close to the centroid of a refund unit and looks "covered" even though the specific fact is missing. Scoring each claim separately is what catches the gap. And because it's semantic, "nicked card" still matches a claim about a "stolen card" — a keyword gate would miss that. (That semantic behaviour assumes a real embedder — OpenAIEmbedder

or a local model; on the zero-dependency HashingEmbedder

default, matching degrades to keyword overlap, so use a real embedder in production.)

The cheap cosine check decides the clear cases for free. For the ambiguous middle band, you can opt into a stricter coverage_scorer

(a cross-encoder or LLM-entailment check) that only fires on the genuinely uncertain hits — containment-grade accuracy without paying for it on every query.

The metric that ties it together is the escalation rate:

cache.stats()["escalation_rate"]   # fraction of hits that had to fall back to raw

If that climbs, your cached understanding is systematically under-covering real queries — a signal to deepen what you cache (or lower the floor). It's also your honest readout of when the cache has quietly degraded into doing plain RAG. You want to see that, not discover it from a support ticket.

I benchmarked this against feeding full retrieved context to the model on every call, with an independent gpt-4o judge, on number-dense documents, with a source change midway through the run to test invalidation. Results:

So: roughly two-thirds the context tokens per read at near-parity accuracy, and — the part the token number doesn't show — it stays correct when a source changes, because invalidation is provenance-scoped instead of a TTL guess.

You don't need my library to apply the two ideas. The structural lessons are:

Get those two right and the stale-vs-expensive dilemma mostly dissolves: you cache aggressively for reuse, invalidate surgically for freshness, and fall back to raw exactly when — and only when — the cache can't actually answer the question.

Coalent is Apache-2.0, pure Python, zero required dependencies, if you want a reference implementation to read or build on: github.com/Vectorlink-Labs/coalent.

I'm Nisarg Pujara, and I maintain Coalent — so take the benchmark numbers as an invitation to run your own on your own corpus, which is the only benchmark that should ever convince you.

If you'd like to explore Coalent further:

🌐 Website: https://coalent.ai

⭐ GitHub: https://github.com/Vectorlink-Labs/coalent

📦 PyPI: https://pypi.org/project/coalent/

📚 Documentation: https://coalent.ai/docs

source & further reading

dev.to — original article Knowledge-and-Memory-Management: Finalizing Directions 1-3 Documentation Top AI Papers on Hugging Face - 2026-07-01 I built qwen-forge — a lightweight tool for experimenting with AI automation workflows

Stale RAG vs. expensive RAG: how to cache RAG context without serving outdated answers

Run your AI side-project on zahid.host