RAG in production: the failure modes nobody warns you about

wpnews.pro

cd /news/large-language-models/rag-in-production-the-failure-modes-… · home › topics › large-language-models › article

[ARTICLE · art-37173] src=dev.to ↗ pub=2026-06-24T03:55Z topic=large-language-models verified=true sentiment=· neutral

RAG in production: the failure modes nobody warns you about

A developer at Krazimo, a company building RAG systems over private knowledge, outlines the most common failure modes in production retrieval-augmented generation. The biggest source of wrong answers is retrieval providing irrelevant or partial context, which the LLM summarizes confidently. Fixes include semantic chunking, reranking, metadata filtering, forced grounding with citations, incremental re-indexing, and rigorous evaluation sets.

read3 min views7 publishedJun 24, 2026

Retrieval-augmented generation looks trivial in a tutorial: embed some documents, drop them in a vector database, stuff the top matches into a prompt, done. Then you point it at real company data and real users, and you discover that the demo was the easy 10%.

We build RAG systems over private knowledge for companies, and almost every painful bug traces back to the same handful of failure modes. Here they are, and what actually fixes them.

The single biggest source of "wrong" RAG answers isn't the LLM. It's retrieval handing it irrelevant or partial context, which the model then summarizes with total confidence. Naive fixed-size chunking splits a table from its header, or a clause from the sentence that negates it.

The fix is unglamorous data engineering: chunk on semantic boundaries, not character counts; add a reranking step so the top-k you actually pass is the top-k by relevance, not by raw vector distance; and store enough metadata to filter before you search. Retrieval quality sets the ceiling on everything downstream.

Even with perfect retrieval, an LLM will happily fill gaps with plausible invention. In a RAG system that's worse than no answer, because it looks sourced.

Force grounding: instruct the model to answer only from the retrieved context and to say "I don't know" when the context doesn't cover it — then verify that with citations that point back to specific chunks. If you can't trace a sentence to a source, treat it as a hallucination, not an answer.

RAG is only as good as the index behind it. Documents change, get duplicated, get deleted — and a pipeline that ingested once at launch quietly serves last quarter's truth. The unsexy work is the ingestion pipeline: incremental re-indexing, de-duplication, and a freshness signal so old content can be down-weighted or expired.

"The new embedding model feels better" is not an engineering statement. Without a held-out set of real questions with known-good answers, every change is a coin flip — you fix one query and silently break five. Build an eval set early, measure retrieval hit-rate and answer faithfulness on every change, and treat a regression like a failing test.

Embedding the query, searching, reranking, and stuffing a large context into a big model adds up — in both seconds and dollars. Cache embeddings and frequent queries, retrieve fewer-but-better chunks rather than dumping everything, and reserve the largest model for the steps that genuinely need it.

Notice what's missing from that list: clever prompting. Production RAG is a data and retrieval engineering problem wearing an AI costume. The teams whose RAG holds up aren't the ones with the fanciest prompt — they're the ones who treat ingestion, chunking, retrieval, and evaluation as real systems with real tests.

That's the lens we bring from years of building data systems at scale before this wave. If you're moving a RAG prototype toward something users can actually trust in production, that's the kind of RAG and knowledge-AI work we do at Krazimo.

What's bitten you hardest in a production RAG system? I'll dig into specifics in the comments.

source & further reading

dev.to — original article MLOps for LLM: A Case Study on Dresscode How I built ZeroAudit — AI-powered SOC 2 compliance automation with AWS DynamoDB and Vercel Confidence is enough to decide. It's not enough to do.

~/api · this article 200

$curl api.wpnews.pro/v1/news/rag-in-production-the-fa…

Read original on dev.to → dev.to/mridul_nagpal_e33b6be1260/rag-in-producti…

mentioned entities

Krazimo

RAG

LLM

metadata

slugrag-in-production-the-failure-modes-nobody-warns-you-about

topic#large-language-models

secondary3 topics

sentimentneutral

canonicaldev.to

navigation

← prevI built a tool to automate growt…

next →Outstanding LG dual-mode OLED ga…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 25 Jun · #large-language-models

How I built ZeroAudit — AI-powered SOC 2 compliance automation with AWS DynamoDB and Vercel

dev.to · 25 Jun · #large-language-models

Evaluating a C# LLM Eventparser with Promptfoo

dev.to · 24 Jun · #large-language-models

Why RAG Isn't Enough: Building RationaleVault for Cognitive Continuity

letsdatascience.com · 25 Jun · #large-language-models

Sazabi raises $8 million for AI observability platform

── more on @krazimo 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required