What Matters in Production RAG

Production RAG systems often fail after moving beyond demo stage due to underbuilt indexing, retrieval, and observability layers. The indexing pipeline ingests documents into chunks and vector embeddings, while the query pipeline retrieves relevant chunks via cosine similarity and approximate nearest neighbor search. Chunking strategies such as recursive, semantic, or structure-aware splitting determine whether the system retrieves coherent, relevant context or fragmented, useless text.

Most of us build RAG https://arxiv.org/abs/2005.11401 the same way: follow a tutorial that embeds a handful of PDFs, stores the vectors in a local Chroma https://docs.trychroma.com/ instance, and chains everything together with LangChain https://python.langchain.com/docs/introduction/ if that’s still a thing . The demo works. The answer looks reasonable. Then you take it to production and it falls apart in quiet, hard-to-diagnose ways. This article is about what comes after the demo. It covers the fundamentals of how RAG actually works under the hood, the engineering challenges of keeping an index fresh and correct over time, and how to build the observability layer that lets you answer “why did the system retrieve that?” when things go wrong. None of these topics are exotic. All of them are consistently underbuilt in practice. RAG Basics The core idea is simple: instead of asking an LLM https://en.wikipedia.org/wiki/Large language model to answer from memory, you retrieve relevant documents at query time and inject them into the prompt as context. The model’s role shifts from “know everything” to “reason over what you are given.” This architectural choice has made RAG the dominant pattern for grounding LLMs in specific, current, or proprietary knowledge. A RAG system has two distinct pipelines that run at different times. The indexing pipeline runs offline or in the background . It ingests raw documents, splits them into chunks, converts each chunk into a dense vector embedding, and stores those vectors in a vector database https://en.wikipedia.org/wiki/Vector database alongside metadata and the original text. This pipeline populates the knowledge base the retriever will search at query time. The query pipeline runs online, per user request. It takes the user’s question, embeds it using the same model used during indexing, searches the vector database for the nearest chunks, assembles those chunks into a context window, and sends the whole thing to the LLM as a prompt. The math underlying the retrieval step is cosine similarity https://en.wikipedia.org/wiki/Cosine similarity . Two vectors are considered close if the angle between them is small: Where is the query embedding and is a document chunk embedding. In practice, most vector databases use approximate nearest neighbor ANN https://en.wikipedia.org/wiki/Nearest neighbor search Approximate nearest neighbor search rather than exact exhaustive search, because scanning billions of vectors at query time is prohibitively slow. HNSW Hierarchical Navigable Small World https://arxiv.org/abs/1603.09320 is the dominant algorithm: it builds a layered proximity graph during indexing that allows retrieval in time at the cost of a small, tunable recall loss. Chunking Chunking is where most RAG systems silently fail. The intuition is straightforward: chunks need to be small enough that retrieved text is specific and relevant, but large enough that they contain complete thoughts. In practice, getting this right requires understanding your document corpus. The naive approach is fixed-size chunking at some character or token count, say 512 tokens with a 128-token overlap. It is simple and fast. It is also routinely wrong. Fixed-size chunking cuts sentences in half, separates questions from their answers in FAQ documents, and splits code across function boundaries. The approaches that actually work in production: - Recursive splitting: split on paragraphs first, then sentences, then characters as a fallback. This preserves semantic structure far better than character counting. - Semantic chunking: embed consecutive sentences and insert chunk boundaries where cosine similarity between adjacent sentences drops below a threshold. This identifies genuine topic shifts rather than arbitrary position boundaries. - Structure-aware splitting: for code, split at function or class boundaries using AST parsing https://docs.python.org/3/library/ast.html . For legal documents, split at clause boundaries. For contracts, include the parent section heading with every child chunk. Always store metadata with each chunk: the source document ID, section heading, page number, creation timestamp, and a content hash. You will need all of these later, both for filtering and for keeping the index current. Embedding Models and the Model-Lock Problem The embedding model you choose during indexing is a ‘long-term commitment’ sorry, could not come with a better working here . Every vector in your index was produced by that model. If you switch models, every vector is now incommensurable with the new query embeddings, and you must re-embed the entire corpus. Production-grade options as of mid-2026: OpenAI : 3072-dimensional, best general-purpose recall, but API-dependent text-embedding-3-large Cohere : strong multilingual performance, supports truncation modes embed-v3 BAAI : open-source, deployable locally, competitive with the above for English bge-large-en-v1.5 : instruction-tuned, excellent for asymmetric retrieval tasks e5-mistral-7b-instruct RAG Indexing Pipelines Here is where most tutorials stop and most production problems begin. Your knowledge base is not static. Documents are updated, retracted, corrected, superseded, and deleted. If your indexing pipeline cannot handle these operations correctly, your RAG system will quietly serve stale, contradictory, or deleted information with full confidence. Chunk Identity A document that is split into 15 chunks produces 15 separate vectors, each stored with its own ID. When that document is updated, you cannot simply update a row as you would in a relational database. You need to: - Identify all 15 chunk IDs that belong to the old version of the document - Delete them from the vector store - Re-chunk the updated document which may now produce 17 chunks - Re-embed and insert the 17 new chunks This requires a mapping layer that vector databases do not provide natively. The standard approach is a document registry, a simple relational table Postgres https://www.postgresql.org/ works fine that maps each doc id to the list of chunk vector IDs currently in the index: CREATE TABLE doc chunk registry doc id TEXT NOT NULL, chunk vector id TEXT NOT NULL, content hash TEXT NOT NULL, version INTEGER NOT NULL DEFAULT 1, indexed at TIMESTAMPTZ NOT NULL DEFAULT NOW , status TEXT NOT NULL DEFAULT 'active', -- 'active' | 'deleted' | 'superseded' PRIMARY KEY doc id, chunk vector id ; When a document update arrives, the flow is: python def reindex document doc id: str, new content: str, vector store, registry db : 1. Find existing chunk IDs old chunk ids = registry db.query """SELECT chunk vector id FROM doc chunk registry WHERE doc id = %s AND status = 'active'""", doc id, 2. Delete old vectors vector store.delete ids= row "chunk vector id" for row in old chunk ids registry db.execute """UPDATE doc chunk registry SET status = 'superseded' WHERE doc id = %s AND status = 'active'""", doc id, 3. Re-chunk and re-embed new chunks = splitter.split text new content new embeddings = embed new chunks new ids = vector store.upsert new embeddings, metadata= ... 4. Register new chunks for chunk id in new ids: registry db.execute """INSERT INTO doc chunk registry doc id, chunk vector id, content hash, version VALUES %s, %s, %s, %s """, doc id, chunk id, content hash, next version Avoiding Unnecessary Re-Embedding Re-embedding is expensive. A 100,000-document corpus with an average of 10 chunks per document means 1 million embedding API calls for a full rebuild. You want to re-embed only what changed. Content hashing is the first gate. When a document arrives, compute a hash of its content. If the hash matches what is in the registry, skip it entirely. Most “updates” in practice are metadata changes a title change, a timestamp update that do not affect the text content and therefore do not require re-embedding. php def should reindex doc id: str, new content: str, registry db - bool: row = registry db.query one """SELECT content hash FROM doc chunk registry WHERE doc id = %s AND status = 'active' LIMIT 1""", doc id, if row is None: return True New document new hash = hashlib.sha256 new content.encode .hexdigest return new hash = row "content hash" For large documents, you can go further: hash at the chunk level, and re-embed only the chunks whose content changed. This is more complex to implement but pays off for long, mostly-stable documents like regulatory filings or technical manuals where only a few sections change per update cycle. Index Versioning and No-Downtime Updates The most underappreciated failure mode in RAG is the partial update. You start reindexing 10,000 documents, the pipeline crashes at document 6,000, and now your index is a flux: some documents are at version N, some at version N+1, and the seam between them is invisible to the retrieval layer. The safe pattern is alias-based deployment, borrowed directly from Elasticsearch operations: rag index 2026 05 14 built overnight, fully validated rag index current alias pointing to above You build the new index completely, validate it against a benchmark query set, then atomically swap the alias. The old index stays around for a configurable retention period in case rollback is needed. No query ever sees a partial index. For systems that cannot tolerate rebuild latency the index is too large, or documents need to be available within seconds of ingestion , incremental upsert is the alternative. Upsert appends new vectors without touching existing ones. Manage concurrent visibility by including a valid from timestamp similar to Postgres MVCC in metadata and filtering queries to only return chunks where valid from <= NOW . This lets you stage new content before it becomes live. Stage new chunks with a future valid from vector store.upsert vectors=new embeddings, metadata= { "doc id": doc id, "valid from": datetime.utcnow + timedelta minutes=5 .isoformat , "status": "active" } for in new embeddings Query filter in retrieval results = vector store.query query vector=query embedding, filter={"valid from": {"$lte": datetime.utcnow .isoformat }, "status": "active"} Embedding Model Upgrades When a better embedding model is released, every vector in your index is now wrong in a specific sense: it was produced by a different model, so its geometric position in the vector space is incommensurable with query embeddings from the new model. You cannot query with model B and retrieve vectors from model A. This means embedding model upgrades require full corpus re-embedding. In practice, the migration strategy is: - Build a shadow index with the new model running in parallel - Route a small percentage of queries to the shadow index and compare results - Gradually shift traffic using the alias pattern above - Keep the old index warm until you are confident in the new one The operational cost of this is why embedding model choice deserves more up-front thought than it typically gets. Treat it like a database schema migration: painful to undo, so choose carefully. A practical safeguard: store the embedding model name and version in every chunk’s metadata. When querying, assert that the stored model matches the query model before returning results. This prevents the silent failure mode where model drift goes undetected. Observability and Retrieval Tracing Production RAG systems fail in ways that look like LLM problems but are actually retrieval problems. The answer is confidently wrong not because the model hallucinated, but because it faithfully reasoned over the wrong context. Without end-to-end tracing, you cannot distinguish these two failure modes. The standard observability stack for distributed systems traces, metrics, logs via OpenTelemetry https://opentelemetry.io/docs/ applies here, but a RAG pipeline has primitives that OTel’s https://opentelemetry.io/docs/ generic span model does not capture natively. You need to instrument these explicitly. The Span Architecture A complete RAG request should produce a trace with these spans, nested in a single root span: rag request root ├── embedding.query latency, model, input tokens ├── retrieval.vector search latency, num results, top k, filter applied ├── retrieval.rerank latency, num input, num output, model ├── prompt.assembly latency, total tokens, num chunks used └── llm.generate latency, model, input tokens, output tokens, stop reason The chunk retrieved events are what make a bad answer debuggable. When we investigate a support ticket about a wrong answer, we can open the trace, expand the retrieval span events, and immediately see which chunks scored highest and where they came from. “The system retrieved three chunks from the deprecated v1 policy document” is an actionable finding. “The system returned a bad answer” is not. Logging the “Why” A common question in production is not just “what was retrieved?” but “why did the system think this was relevant?” The similarity score alone does not answer this. A chunk with a score of 0.82 might be genuinely relevant, or it might be a false positive from an embedding space where the query and an unrelated chunk happen to land nearby. To address this, we can add a lightweight rationale step: After reranking, send the top-5 chunks and the query to the LLM with a short system prompt asking it to explain the relevance of each chunk before generating the final answer. The rationale is logged as a structured field on the trace. This is expensive if done per-request, but extremely valuable when run on a sampled basis say, 1% of production traffic plus 100% of user-flagged responses . Retrieval Quality vs Answer Quality The highest-value observability investment is closing the feedback loop: connecting what was retrieved to how good the final answer was. This requires an evaluation signal. For many applications, you can compute answer quality automatically using a lightweight LLM-as-judge approach: after the main LLM generates an answer, send the answer, the retrieved context, and the original question to a smaller, cheaper model with a rubric asking it to score faithfulness did the answer stay within what the context says? and relevance did the answer address the question? . Log these scores alongside the trace ID. This gives you a queryable dataset: “show me all requests where faithfulness score was below 0.7 in the last 7 days.” Drilling into those traces, you will typically find one of three patterns: - Retrieved chunks are from the wrong document index corruption or model drift - Retrieved chunks are from the right document but the wrong section chunking boundary problem - Retrieved chunks are correct but the LLM ignored them a generation problem, not a retrieval problem Only traces with chunk-level attribution let you distinguish these cases. Without them, every bad answer looks the same from the outside. Index Version Attribution in Traces One failure mode that deserves special mention: your index was updated, retrieval behavior changed, and answer quality dropped. Without index version attribution in your traces, you cannot correlate the quality drop to the update. The fix is to include the index version or the alias timestamp in every retrieval span. When you investigate a spike in low-quality answers, you can immediately filter to traces where the index version is the new one, and compare them to traces from the old version. span.set attribute "retrieval.index version", current index alias span.set attribute "retrieval.index updated at", index metadata "updated at" This sounds obvious in retrospect. Almost nobody does it until they spend a painful post-incident trying to figure out why answer quality degraded on a Tuesday afternoon. Footnote RAG combines offline indexing chunk, embed, store with online retrieval embed query, search, inject context . Getting the demo right is easy; getting production right requires three things. First, an indexing pipeline with a document registry, content-hash-based change detection, correct delete semantics, and alias-based zero-downtime deployment. Second, a retrieval layer using hybrid search vector + BM25 https://arpitbhayani.me/blogs/bm25 and cross-encoder reranking to achieve meaningful accuracy. Third, an observability layer that records chunk-level attribution per request, tracks retrieval quality metrics over time, and links index versions to answer quality regressions. Without all three, a RAG system that works in staging will silently serve stale, wrong, or deleted information in production.