Production RAG Systems — 7 Lessons We Learned the Hard Way

A company running a self-hosted AI business platform since early 2026 shares seven painful lessons from building production RAG systems, including the need for document-type-aware chunking and hybrid retrieval combining vector search with knowledge graphs to improve accuracy on relational queries.

Building a RAG demo takes an afternoon. Building a RAG system that works reliably in production — handling thousands of real business queries daily, in multiple languages, connected to live databases — takes months of learning things the hard way. We have been running a self-hosted AI business platform in production since early 2026. These are the seven most painful lessons from that journey. Every RAG tutorial tells you to chunk your documents into 512 tokens. We followed this advice. It worked for some queries and failed badly for others. The problem: different document types need different chunking strategies. Legal documents: Long clauses that reference each other. Chunking at 512 tokens splits a clause from its conditions. The AI gives incomplete answers because it only retrieved half the relevant text. Product manuals: Short, self-contained sections. 512 tokens combines multiple unrelated sections into one chunk. Retrieval becomes noisy. Customer support logs: Individual tickets that need full context. Chunking splits a problem description from its resolution. Our solution: document-type-aware chunking. python class AdaptiveChunker: STRATEGIES = { 'legal': { 'chunk size': 1024, 'overlap': 200, 'split on': '\n\n', '\n', '. ' }, 'manual': { 'chunk size': 256, 'overlap': 50, 'split on': '\n\n', '\n' }, 'support ticket': { 'chunk size': 2048, 'overlap': 0, 'split on': '---', '\n\n\n' }, 'general': { 'chunk size': 512, 'overlap': 100, 'split on': '\n\n', '\n', '. ' } } php def chunk self, text: str, doc type: str - list str : strategy = self.STRATEGIES.get doc type, self.STRATEGIES 'general' return self. recursive split text, strategy 'chunk size' , strategy 'overlap' , strategy 'split on' After switching to adaptive chunking, accuracy on legal document queries improved from 61% to 89%. Pure vector search fails on relational queries. This was our biggest early mistake. A business owner asks: “Which customers placed orders above $100 last month and also had complaints?” Vector search returns chunks that semantically match “customers”, “Delhi”, “orders”, “complaints”. But it cannot answer the actual question because that answer requires traversing relationships between entities — not matching semantic similarity. We added Neo4j as a knowledge graph layer alongside our vector database. Documents are ingested twice: once into the vector store as chunks, and once into Neo4j as entities and relationships. python php class HybridRetriever: def retrieve self, query: str - list dict : query type = self.classify query query if query type == 'semantic': "What is your return policy?" return self.vector search query, top k=5 elif query type == 'relational': "Which customers complained last month?" return self.graph search query else: Complex queries — use both vector = self.vector search query, top k=3 graph = self.graph search query return self.merge and rerank vector, graph php def classify query self, query: str - str: relational signals = 'which', 'who', 'how many', 'list', 'show me', 'find all', 'customers who', 'orders that', 'last month', 'this week' query lower = query.lower score = sum 1 for s in relational signals if s in query lower return 'relational' if score = 2 else 'semantic' The graph layer alone improved accuracy on business intelligence queries from 34% to 87%. We started with OpenAI’s text-embedding-ada-002. It worked well for English. Then we started handling customers in Hindi, Tamil, Arabic, and Swahili. The accuracy drop was severe. ada-002 was trained primarily on English text. Hindi queries were returning English chunks as most similar — because the embedding space treated them as closer than the actual relevant Hindi content. We switched to multilingual-e5-large for knowledge base embeddings, with language detection routing: python python class MultilingualEmbedder: def init self : self.english model = OpenAIEmbeddings model='text-embedding-3-large' self.multilingual model = HuggingFaceEmbeddings model name='intfloat/multilingual-e5-large' self.detector = LanguageDetector php def embed self, text: str - list float : language = self.detector.detect text if language == 'en': return self.english model.embed query text else: Use multilingual model for non-English return self.multilingual model.embed query f"query: {text}" e5 models need "query:" prefix php def embed document self, text: str - list float : language = self.detector.detect text if language == 'en': return self.english model.embed documents text 0 else: return self.multilingual model.embed documents f"passage: {text}" 0 This brought multilingual retrieval accuracy from 41% to 84%. Our first production system passed the full conversation history plus retrieved documents to the LLM on every message. This worked fine in development. In production, with real users having real conversations, context windows filled up fast. The results: The fix: hierarchical memory with summarization . python class ConversationMemory: MAX RECENT MESSAGES = 8 SUMMARIZE AFTER = 20 python def get context self, session id: str, current query: str - str: messages = self.get messages session id if len messages self.SUMMARIZE AFTER: Summarize old messages, keep recent ones fresh old messages = messages :-self.MAX RECENT MESSAGES recent messages = messages -self.MAX RECENT MESSAGES: summary = self.summarize old messages return self.format context summary, recent messages else: return self.format recent messages -self.MAX RECENT MESSAGES: php def summarize self, messages: list - str: conversation text = '\n'.join f"{m.role}: {m.content}" for m in messages return self.llm.complete f"""Summarize this conversation history in 3-5 sentences, preserving key facts, customer preferences, and unresolved issues: {conversation text} Summary:""" This reduced token costs by 67% while maintaining response quality on long conversations. Vector similarity search returns the top-k most semantically similar chunks. But “most similar” is not the same as “most relevant to answer this specific question.” A query about “product return policy” might retrieve: Sending all three to the LLM adds noise. The LLM sometimes focuses on the wrong chunks. We added a reranking step using a cross-encoder model: python python from sentence transformers import CrossEncoder python class ReRanker: def init self : self.model = CrossEncoder 'cross-encoder/ms-marco-MiniLM-L-6-v2' python def rerank self, query: str, documents: list str , top k: int = 3 - list str : Score each document against the query pairs = query, doc for doc in documents scores = self.model.predict pairs Sort by relevance score ranked = sorted zip scores, documents , key=lambda x: x 0 , reverse=True Return only top k most relevant return doc for , doc in ranked :top k Adding reranking improved answer quality noticeably, especially for ambiguous queries where multiple chunks seemed equally relevant. When our system gave wrong answers, our first instinct was to blame the LLM. We tried different models. We tweaked system prompts. We added explicit “only use the provided context” instructions. The answers were still wrong sometimes. The real problem: the retrieval was returning irrelevant chunks, and the LLM was doing its best with bad information. Garbage in, garbage out. The fix was a retrieval quality threshold: python class QualityAwareRetriever: MINIMUM RELEVANCE SCORE = 0.72 php def retrieve self, query: str - dict: results = self.vector search query, top k=10 Filter by minimum relevance score qualified = r for r in results if r 'score' = self.MINIMUM RELEVANCE SCORE if not qualified: return { 'documents': , 'has context': False, 'fallback message': "I don't have specific information about this " "in your knowledge base. Please contact support " "or check your documentation directly." } return { 'documents': r 'content' for r in qualified :5 , 'has context': True } When retrieval quality is below threshold, we tell the AI to say it does not know — rather than hallucinate. This is better for business trust than a confident wrong answer. Hallucination rate dropped from 12% to under 2%. We deployed our RAG system and assumed it was working because users were not complaining loudly. Three months later, we discovered specific query categories were failing 40% of the time — but users had just stopped asking those questions. Silent failures are the worst kind. We now run automated evaluation continuously: python python class RAGEvaluator: def init self, test set path: str : self.test cases = self.load test cases test set path php def evaluate self, retriever, generator - dict: results = { 'retrieval accuracy': , 'answer correctness': , 'faithfulness': , 'latency': } for case in self.test cases: start = time.time Retrieve docs = retriever.retrieve case 'query' retrieval hit = any case 'expected doc id' in doc 'id' for doc in docs Generate answer = generator.generate case 'query' , docs Evaluate correctness = self.check correctness answer, case 'expected answer' faithfulness = self.check faithfulness answer, docs latency = time.time - start results 'retrieval accuracy' .append retrieval hit results 'answer correctness' .append correctness results 'faithfulness' .append faithfulness results 'latency' .append latency return { 'retrieval accuracy': sum results 'retrieval accuracy' / len results 'retrieval accuracy' , 'answer correctness': sum results 'answer correctness' / len results 'answer correctness' , 'faithfulness': sum results 'faithfulness' / len results 'faithfulness' , 'avg latency': sum results 'latency' / len results 'latency' } We run this evaluation suite daily. Any metric dropping below threshold triggers an alert. Silent failures are now caught within 24 hours. MetricBeforeAfterRetrieval accuracy61%91%Answer correctness54%88%Hallucination rate12%1.8%Multilingual accuracy41%84%Token cost per query$0.018$0.006Average latency3.2s1.4s No single lesson drove these improvements. It was the combination of all seven working together. No single lesson drove these improvements. It was the combination of all seven working together. Production RAG Systems — 7 Lessons We Learned the Hard Way https://pub.towardsai.net/production-rag-systems-7-lessons-we-learned-the-hard-way-5f41609fdbcd was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.