{"slug": "production-rag-systems-7-lessons-we-learned-the-hard-way", "title": "Production RAG Systems — 7 Lessons We Learned the Hard Way", "summary": "A company running a self-hosted AI business platform since early 2026 shares seven painful lessons from building production RAG systems, including the need for document-type-aware chunking and hybrid retrieval combining vector search with knowledge graphs to improve accuracy on relational queries.", "body_md": "Building a RAG demo takes an afternoon. Building a RAG system that works reliably in production — handling thousands of real business queries daily, in multiple languages, connected to live databases — takes months of learning things the hard way.\n\nWe have been running a self-hosted AI business platform in production since early 2026. These are the seven most painful lessons from that journey.\n\nEvery RAG tutorial tells you to chunk your documents into 512 tokens. We followed this advice. It worked for some queries and failed badly for others.\n\nThe problem: different document types need different chunking strategies.\n\n**Legal documents:** Long clauses that reference each other. Chunking at 512 tokens splits a clause from its conditions. The AI gives incomplete answers because it only retrieved half the relevant text.\n\n**Product manuals:** Short, self-contained sections. 512 tokens combines multiple unrelated sections into one chunk. Retrieval becomes noisy.\n\n**Customer support logs:** Individual tickets that need full context. Chunking splits a problem description from its resolution.\n\nOur solution: document-type-aware chunking.\n\npython\n\n```\nclass AdaptiveChunker:    STRATEGIES = {        'legal': {            'chunk_size': 1024,            'overlap': 200,            'split_on': ['\\n\\n', '\\n', '. ']        },        'manual': {            'chunk_size': 256,            'overlap': 50,            'split_on': ['\\n\\n', '\\n']        },        'support_ticket': {            'chunk_size': 2048,            'overlap': 0,            'split_on': ['---', '\\n\\n\\n']        },        'general': {            'chunk_size': 512,            'overlap': 100,            'split_on': ['\\n\\n', '\\n', '. ']        }    }\nphp\n    def chunk(self, text: str, doc_type: str) -> list[str]:        strategy = self.STRATEGIES.get(doc_type, self.STRATEGIES['general'])        return self._recursive_split(            text,            strategy['chunk_size'],            strategy['overlap'],            strategy['split_on']        )\n```\n\nAfter switching to adaptive chunking, accuracy on legal document queries improved from 61% to 89%.\n\nPure vector search fails on relational queries. This was our biggest early mistake.\n\nA business owner asks: “Which customers placed orders above $100 last month and also had complaints?”\n\nVector search returns chunks that semantically match “customers”, “Delhi”, “orders”, “complaints”. But it cannot answer the actual question because that answer requires traversing relationships between entities — not matching semantic similarity.\n\nWe added Neo4j as a knowledge graph layer alongside our vector database. Documents are ingested twice: once into the vector store as chunks, and once into Neo4j as entities and relationships.\n\npython\n\n``` php\nclass HybridRetriever:    def retrieve(self, query: str) -> list[dict]:        query_type = self.classify_query(query)\nif query_type == 'semantic':            # \"What is your return policy?\"            return self.vector_search(query, top_k=5)\nelif query_type == 'relational':            # \"Which customers complained last month?\"            return self.graph_search(query)\nelse:            # Complex queries — use both            vector = self.vector_search(query, top_k=3)            graph = self.graph_search(query)            return self.merge_and_rerank(vector, graph)\nphp\n    def classify_query(self, query: str) -> str:        relational_signals = [            'which', 'who', 'how many', 'list',            'show me', 'find all', 'customers who',            'orders that', 'last month', 'this week'        ]        query_lower = query.lower()        score = sum(1 for s in relational_signals if s in query_lower)        return 'relational' if score >= 2 else 'semantic'\n```\n\nThe graph layer alone improved accuracy on business intelligence queries from 34% to 87%.\n\nWe started with OpenAI’s text-embedding-ada-002. It worked well for English. Then we started handling customers in Hindi, Tamil, Arabic, and Swahili.\n\nThe accuracy drop was severe. ada-002 was trained primarily on English text. Hindi queries were returning English chunks as most similar — because the embedding space treated them as closer than the actual relevant Hindi content.\n\nWe switched to multilingual-e5-large for knowledge base embeddings, with language detection routing:\n\npython\n\n``` python\nclass MultilingualEmbedder:    def __init__(self):        self.english_model = OpenAIEmbeddings(            model='text-embedding-3-large'        )        self.multilingual_model = HuggingFaceEmbeddings(            model_name='intfloat/multilingual-e5-large'        )        self.detector = LanguageDetector()\nphp\n    def embed(self, text: str) -> list[float]:        language = self.detector.detect(text)\nif language == 'en':            return self.english_model.embed_query(text)        else:            # Use multilingual model for non-English            return self.multilingual_model.embed_query(                f\"query: {text}\"  # e5 models need \"query:\" prefix            )\nphp\n    def embed_document(self, text: str) -> list[float]:        language = self.detector.detect(text)        if language == 'en':            return self.english_model.embed_documents([text])[0]        else:            return self.multilingual_model.embed_documents(                [f\"passage: {text}\"]            )[0]\n```\n\nThis brought multilingual retrieval accuracy from 41% to 84%.\n\nOur first production system passed the full conversation history plus retrieved documents to the LLM on every message. This worked fine in development. In production, with real users having real conversations, context windows filled up fast.\n\nThe results:\n\nThe fix: **hierarchical memory with summarization**.\n\npython\n\n```\nclass ConversationMemory:    MAX_RECENT_MESSAGES = 8    SUMMARIZE_AFTER = 20\npython\n    def get_context(self, session_id: str,                    current_query: str) -> str:        messages = self.get_messages(session_id)\nif len(messages) > self.SUMMARIZE_AFTER:            # Summarize old messages, keep recent ones fresh            old_messages = messages[:-self.MAX_RECENT_MESSAGES]            recent_messages = messages[-self.MAX_RECENT_MESSAGES:]\nsummary = self.summarize(old_messages)            return self.format_context(summary, recent_messages)        else:            return self.format_recent(messages[-self.MAX_RECENT_MESSAGES:])\nphp\n    def summarize(self, messages: list) -> str:        conversation_text = '\\n'.join([            f\"{m.role}: {m.content}\" for m in messages        ])\nreturn self.llm.complete(            f\"\"\"Summarize this conversation history in 3-5 sentences,            preserving key facts, customer preferences, and unresolved issues:\n{conversation_text}\nSummary:\"\"\"        )\n```\n\nThis reduced token costs by 67% while maintaining response quality on long conversations.\n\nVector similarity search returns the top-k most semantically similar chunks. But “most similar” is not the same as “most relevant to answer this specific question.”\n\nA query about “product return policy” might retrieve:\n\nSending all three to the LLM adds noise. The LLM sometimes focuses on the wrong chunks.\n\nWe added a reranking step using a cross-encoder model:\n\npython\n\n``` python\nfrom sentence_transformers import CrossEncoder\npython\nclass ReRanker:    def __init__(self):        self.model = CrossEncoder(            'cross-encoder/ms-marco-MiniLM-L-6-v2'        )\npython\n    def rerank(self, query: str,               documents: list[str],               top_k: int = 3) -> list[str]:\n# Score each document against the query        pairs = [[query, doc] for doc in documents]        scores = self.model.predict(pairs)\n# Sort by relevance score        ranked = sorted(            zip(scores, documents),            key=lambda x: x[0],            reverse=True        )\n# Return only top_k most relevant        return [doc for _, doc in ranked[:top_k]]\n```\n\nAdding reranking improved answer quality noticeably, especially for ambiguous queries where multiple chunks seemed equally relevant.\n\nWhen our system gave wrong answers, our first instinct was to blame the LLM. We tried different models. We tweaked system prompts. We added explicit “only use the provided context” instructions.\n\nThe answers were still wrong sometimes.\n\nThe real problem: the retrieval was returning irrelevant chunks, and the LLM was doing its best with bad information. Garbage in, garbage out.\n\nThe fix was a retrieval quality threshold:\n\npython\n\n```\nclass QualityAwareRetriever:    MINIMUM_RELEVANCE_SCORE = 0.72\nphp\n    def retrieve(self, query: str) -> dict:        results = self.vector_search(query, top_k=10)\n# Filter by minimum relevance score        qualified = [            r for r in results            if r['score'] >= self.MINIMUM_RELEVANCE_SCORE        ]\nif not qualified:            return {                'documents': [],                'has_context': False,                'fallback_message': (                    \"I don't have specific information about this \"                    \"in your knowledge base. Please contact support \"                    \"or check your documentation directly.\"                )            }\nreturn {            'documents': [r['content'] for r in qualified[:5]],            'has_context': True        }\n```\n\nWhen retrieval quality is below threshold, we tell the AI to say it does not know — rather than hallucinate. This is better for business trust than a confident wrong answer.\n\nHallucination rate dropped from 12% to under 2%.\n\nWe deployed our RAG system and assumed it was working because users were not complaining loudly. Three months later, we discovered specific query categories were failing 40% of the time — but users had just stopped asking those questions.\n\nSilent failures are the worst kind.\n\nWe now run automated evaluation continuously:\n\npython\n\n``` python\nclass RAGEvaluator:    def __init__(self, test_set_path: str):        self.test_cases = self.load_test_cases(test_set_path)\nphp\n    def evaluate(self, retriever, generator) -> dict:        results = {            'retrieval_accuracy': [],            'answer_correctness': [],            'faithfulness': [],            'latency': []        }\nfor case in self.test_cases:            start = time.time()\n# Retrieve            docs = retriever.retrieve(case['query'])            retrieval_hit = any(                case['expected_doc_id'] in doc['id']                for doc in docs            )\n# Generate            answer = generator.generate(case['query'], docs)\n# Evaluate            correctness = self.check_correctness(                answer,                case['expected_answer']            )            faithfulness = self.check_faithfulness(answer, docs)            latency = time.time() - start\nresults['retrieval_accuracy'].append(retrieval_hit)            results['answer_correctness'].append(correctness)            results['faithfulness'].append(faithfulness)            results['latency'].append(latency)\nreturn {            'retrieval_accuracy': sum(results['retrieval_accuracy']) / len(results['retrieval_accuracy']),            'answer_correctness': sum(results['answer_correctness']) / len(results['answer_correctness']),            'faithfulness': sum(results['faithfulness']) / len(results['faithfulness']),            'avg_latency': sum(results['latency']) / len(results['latency'])        }\n```\n\nWe run this evaluation suite daily. Any metric dropping below threshold triggers an alert. Silent failures are now caught within 24 hours.\n\nMetricBeforeAfterRetrieval accuracy61%91%Answer correctness54%88%Hallucination rate12%1.8%Multilingual accuracy41%84%Token cost per query$0.018$0.006Average latency3.2s1.4s\n\nNo single lesson drove these improvements. It was the combination of all seven working together.\n\nNo single lesson drove these improvements.\n\nIt was the combination of all seven working together.\n\n[Production RAG Systems — 7 Lessons We Learned the Hard Way](https://pub.towardsai.net/production-rag-systems-7-lessons-we-learned-the-hard-way-5f41609fdbcd) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/production-rag-systems-7-lessons-we-learned-the-hard-way", "canonical_source": "https://pub.towardsai.net/production-rag-systems-7-lessons-we-learned-the-hard-way-5f41609fdbcd?source=rss----98111c9905da---4", "published_at": "2026-06-28 03:53:42+00:00", "updated_at": "2026-06-28 04:08:51.961495+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-products", "ai-infrastructure", "natural-language-processing"], "entities": ["OpenAI", "Neo4j"], "alternates": {"html": "https://wpnews.pro/news/production-rag-systems-7-lessons-we-learned-the-hard-way", "markdown": "https://wpnews.pro/news/production-rag-systems-7-lessons-we-learned-the-hard-way.md", "text": "https://wpnews.pro/news/production-rag-systems-7-lessons-we-learned-the-hard-way.txt", "jsonld": "https://wpnews.pro/news/production-rag-systems-7-lessons-we-learned-the-hard-way.jsonld"}}