Production RAG Systems — 7 Lessons We Learned the Hard Way

wpnews.pro

Building a RAG demo takes an afternoon. Building a RAG system that works reliably in production — handling thousands of real business queries daily, in multiple languages, connected to live databases — takes months of learning things the hard way.

We have been running a self-hosted AI business platform in production since early 2026. These are the seven most painful lessons from that journey.

Every RAG tutorial tells you to chunk your documents into 512 tokens. We followed this advice. It worked for some queries and failed badly for others.

The problem: different document types need different chunking strategies.

Legal documents: Long clauses that reference each other. Chunking at 512 tokens splits a clause from its conditions. The AI gives incomplete answers because it only retrieved half the relevant text.

Product manuals: Short, self-contained sections. 512 tokens combines multiple unrelated sections into one chunk. Retrieval becomes noisy.

Customer support logs: Individual tickets that need full context. Chunking splits a problem description from its resolution.

Our solution: document-type-aware chunking.

python

class AdaptiveChunker:    STRATEGIES = {        'legal': {            'chunk_size': 1024,            'overlap': 200,            'split_on': ['\n\n', '\n', '. ']        },        'manual': {            'chunk_size': 256,            'overlap': 50,            'split_on': ['\n\n', '\n']        },        'support_ticket': {            'chunk_size': 2048,            'overlap': 0,            'split_on': ['---', '\n\n\n']        },        'general': {            'chunk_size': 512,            'overlap': 100,            'split_on': ['\n\n', '\n', '. ']        }    }
php
    def chunk(self, text: str, doc_type: str) -> list[str]:        strategy = self.STRATEGIES.get(doc_type, self.STRATEGIES['general'])        return self._recursive_split(            text,            strategy['chunk_size'],            strategy['overlap'],            strategy['split_on']        )

After switching to adaptive chunking, accuracy on legal document queries improved from 61% to 89%.

Pure vector search fails on relational queries. This was our biggest early mistake.

A business owner asks: “Which customers placed orders above $100 last month and also had complaints?”

Vector search returns chunks that semantically match “customers”, “Delhi”, “orders”, “complaints”. But it cannot answer the actual question because that answer requires traversing relationships between entities — not matching semantic similarity.

We added Neo4j as a knowledge graph layer alongside our vector database. Documents are ingested twice: once into the vector store as chunks, and once into Neo4j as entities and relationships.

python

class HybridRetriever:    def retrieve(self, query: str) -> list[dict]:        query_type = self.classify_query(query)
if query_type == 'semantic':            # "What is your return policy?"            return self.vector_search(query, top_k=5)
elif query_type == 'relational':            # "Which customers complained last month?"            return self.graph_search(query)
else:            # Complex queries — use both            vector = self.vector_search(query, top_k=3)            graph = self.graph_search(query)            return self.merge_and_rerank(vector, graph)
php
    def classify_query(self, query: str) -> str:        relational_signals = [            'which', 'who', 'how many', 'list',            'show me', 'find all', 'customers who',            'orders that', 'last month', 'this week'        ]        query_lower = query.lower()        score = sum(1 for s in relational_signals if s in query_lower)        return 'relational' if score >= 2 else 'semantic'

The graph layer alone improved accuracy on business intelligence queries from 34% to 87%.

We started with OpenAI’s text-embedding-ada-002. It worked well for English. Then we started handling customers in Hindi, Tamil, Arabic, and Swahili.

The accuracy drop was severe. ada-002 was trained primarily on English text. Hindi queries were returning English chunks as most similar — because the embedding space treated them as closer than the actual relevant Hindi content.

We switched to multilingual-e5-large for knowledge base embeddings, with language detection routing:

python

class MultilingualEmbedder:    def __init__(self):        self.english_model = OpenAIEmbeddings(            model='text-embedding-3-large'        )        self.multilingual_model = HuggingFaceEmbeddings(            model_name='intfloat/multilingual-e5-large'        )        self.detector = LanguageDetector()
php
    def embed(self, text: str) -> list[float]:        language = self.detector.detect(text)
if language == 'en':            return self.english_model.embed_query(text)        else:            # Use multilingual model for non-English            return self.multilingual_model.embed_query(                f"query: {text}"  # e5 models need "query:" prefix            )
php
    def embed_document(self, text: str) -> list[float]:        language = self.detector.detect(text)        if language == 'en':            return self.english_model.embed_documents([text])[0]        else:            return self.multilingual_model.embed_documents(                [f"passage: {text}"]            )[0]

This brought multilingual retrieval accuracy from 41% to 84%.

Our first production system passed the full conversation history plus retrieved documents to the LLM on every message. This worked fine in development. In production, with real users having real conversations, context windows filled up fast.

The results:

The fix: hierarchical memory with summarization.

python

class ConversationMemory:    MAX_RECENT_MESSAGES = 8    SUMMARIZE_AFTER = 20
python
    def get_context(self, session_id: str,                    current_query: str) -> str:        messages = self.get_messages(session_id)
if len(messages) > self.SUMMARIZE_AFTER:            # Summarize old messages, keep recent ones fresh            old_messages = messages[:-self.MAX_RECENT_MESSAGES]            recent_messages = messages[-self.MAX_RECENT_MESSAGES:]
summary = self.summarize(old_messages)            return self.format_context(summary, recent_messages)        else:            return self.format_recent(messages[-self.MAX_RECENT_MESSAGES:])
php
    def summarize(self, messages: list) -> str:        conversation_text = '\n'.join([            f"{m.role}: {m.content}" for m in messages        ])
return self.llm.complete(            f"""Summarize this conversation history in 3-5 sentences,            preserving key facts, customer preferences, and unresolved issues:
{conversation_text}
Summary:"""        )

This reduced token costs by 67% while maintaining response quality on long conversations.

Vector similarity search returns the top-k most semantically similar chunks. But “most similar” is not the same as “most relevant to answer this specific question.”

A query about “product return policy” might retrieve:

Sending all three to the LLM adds noise. The LLM sometimes focuses on the wrong chunks.

We added a reranking step using a cross-encoder model:

python

from sentence_transformers import CrossEncoder
python
class ReRanker:    def __init__(self):        self.model = CrossEncoder(            'cross-encoder/ms-marco-MiniLM-L-6-v2'        )
python
    def rerank(self, query: str,               documents: list[str],               top_k: int = 3) -> list[str]:

Adding reranking improved answer quality noticeably, especially for ambiguous queries where multiple chunks seemed equally relevant.

When our system gave wrong answers, our first instinct was to blame the LLM. We tried different models. We tweaked system prompts. We added explicit “only use the provided context” instructions.

The answers were still wrong sometimes.

The real problem: the retrieval was returning irrelevant chunks, and the LLM was doing its best with bad information. Garbage in, garbage out.

The fix was a retrieval quality threshold:

python

class QualityAwareRetriever:    MINIMUM_RELEVANCE_SCORE = 0.72
php
    def retrieve(self, query: str) -> dict:        results = self.vector_search(query, top_k=10)
if not qualified:            return {                'documents': [],                'has_context': False,                'fallback_message': (                    "I don't have specific information about this "                    "in your knowledge base. Please contact support "                    "or check your documentation directly."                )            }
return {            'documents': [r['content'] for r in qualified[:5]],            'has_context': True        }

When retrieval quality is below threshold, we tell the AI to say it does not know — rather than hallucinate. This is better for business trust than a confident wrong answer.

Hallucination rate dropped from 12% to under 2%.

We deployed our RAG system and assumed it was working because users were not complaining loudly. Three months later, we discovered specific query categories were failing 40% of the time — but users had just stopped asking those questions.

Silent failures are the worst kind.

We now run automated evaluation continuously:

python

class RAGEvaluator:    def __init__(self, test_set_path: str):        self.test_cases = self.load_test_cases(test_set_path)
php
    def evaluate(self, retriever, generator) -> dict:        results = {            'retrieval_accuracy': [],            'answer_correctness': [],            'faithfulness': [],            'latency': []        }
for case in self.test_cases:            start = time.time()
results['retrieval_accuracy'].append(retrieval_hit)            results['answer_correctness'].append(correctness)            results['faithfulness'].append(faithfulness)            results['latency'].append(latency)
return {            'retrieval_accuracy': sum(results['retrieval_accuracy']) / len(results['retrieval_accuracy']),            'answer_correctness': sum(results['answer_correctness']) / len(results['answer_correctness']),            'faithfulness': sum(results['faithfulness']) / len(results['faithfulness']),            'avg_latency': sum(results['latency']) / len(results['latency'])        }

We run this evaluation suite daily. Any metric dropping below threshold triggers an alert. Silent failures are now caught within 24 hours.

MetricBeforeAfterRetrieval accuracy61%91%Answer correctness54%88%Hallucination rate12%1.8%Multilingual accuracy41%84%Token cost per query$0.018$0.006Average latency3.2s1.4s

No single lesson drove these improvements. It was the combination of all seven working together.

No single lesson drove these improvements.

It was the combination of all seven working together.

Production RAG Systems — 7 Lessons We Learned the Hard Way was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article OpenAI's GPT-5.6 Sol Hit 91.9% on Terminal-Bench — Then Cheated More Than Any Model METR Has Tested No, Your Chatbot Doesn’t Have Amnesia — It’s Drifting I Cracked Open Karpathy's $100 ChatGPT — the 2019 Original Cost $43,000 and 168 Hours

Production RAG Systems — 7 Lessons We Learned the Hard Way

Run your AI side-project on zahid.host