cd /news/large-language-models/rag-in-production-the-failure-modes-… · home topics large-language-models article
[ARTICLE · art-37173] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

RAG in production: the failure modes nobody warns you about

A developer at Krazimo, a company building RAG systems over private knowledge, outlines the most common failure modes in production retrieval-augmented generation. The biggest source of wrong answers is retrieval providing irrelevant or partial context, which the LLM summarizes confidently. Fixes include semantic chunking, reranking, metadata filtering, forced grounding with citations, incremental re-indexing, and rigorous evaluation sets.

read3 min views7 publishedJun 24, 2026

Retrieval-augmented generation looks trivial in a tutorial: embed some documents, drop them in a vector database, stuff the top matches into a prompt, done. Then you point it at real company data and real users, and you discover that the demo was the easy 10%.

We build RAG systems over private knowledge for companies, and almost every painful bug traces back to the same handful of failure modes. Here they are, and what actually fixes them.

The single biggest source of "wrong" RAG answers isn't the LLM. It's retrieval handing it irrelevant or partial context, which the model then summarizes with total confidence. Naive fixed-size chunking splits a table from its header, or a clause from the sentence that negates it.

The fix is unglamorous data engineering: chunk on semantic boundaries, not character counts; add a reranking step so the top-k you actually pass is the top-k by relevance, not by raw vector distance; and store enough metadata to filter before you search. Retrieval quality sets the ceiling on everything downstream.

Even with perfect retrieval, an LLM will happily fill gaps with plausible invention. In a RAG system that's worse than no answer, because it looks sourced.

Force grounding: instruct the model to answer only from the retrieved context and to say "I don't know" when the context doesn't cover it — then verify that with citations that point back to specific chunks. If you can't trace a sentence to a source, treat it as a hallucination, not an answer.

RAG is only as good as the index behind it. Documents change, get duplicated, get deleted — and a pipeline that ingested once at launch quietly serves last quarter's truth. The unsexy work is the ingestion pipeline: incremental re-indexing, de-duplication, and a freshness signal so old content can be down-weighted or expired.

"The new embedding model feels better" is not an engineering statement. Without a held-out set of real questions with known-good answers, every change is a coin flip — you fix one query and silently break five. Build an eval set early, measure retrieval hit-rate and answer faithfulness on every change, and treat a regression like a failing test.

Embedding the query, searching, reranking, and stuffing a large context into a big model adds up — in both seconds and dollars. Cache embeddings and frequent queries, retrieve fewer-but-better chunks rather than dumping everything, and reserve the largest model for the steps that genuinely need it.

Notice what's missing from that list: clever prompting. Production RAG is a data and retrieval engineering problem wearing an AI costume. The teams whose RAG holds up aren't the ones with the fanciest prompt — they're the ones who treat ingestion, chunking, retrieval, and evaluation as real systems with real tests.

That's the lens we bring from years of building data systems at scale before this wave. If you're moving a RAG prototype toward something users can actually trust in production, that's the kind of RAG and knowledge-AI work we do at Krazimo.

What's bitten you hardest in a production RAG system? I'll dig into specifics in the comments.

── more in #large-language-models 4 stories · sorted by recency
── more on @krazimo 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/rag-in-production-th…] indexed:0 read:3min 2026-06-24 ·