How to build a production RAG pipeline in Python (without a vector database)

To build a production-ready Retrieval-Augmented Generation (RAG) pipeline in Python using BM25 retrieval via Meilisearch instead of a vector database, arguing that BM25 achieves 85–95% of the recall of semantic search on domain-specific corpora with lower cost and complexity. The tutorial covers indexing documents, retrieving relevant chunks with typo tolerance and filters, and constructing prompts to ground LLM responses in the retrieved content. The author demonstrates that for technical documentation, knowledge bases, and similar corpora, a vector database is often unnecessary for effective RAG.

Everyone reaching for a vector database when building RAG is solving the wrong problem first. For most domain-specific corpora — technical documentation, company knowledge bases, article archives — BM25 retrieval is competitive with semantic search, costs a fraction of the compute, and is dramatically simpler to operate. This tutorial shows you how to build a full RAG pipeline using Meilisearch as the retrieval backend, stream responses from an LLM API, and evaluate hit rate without a single embedding model. Why RAG, and why not a vector database Retrieval-Augmented Generation solves a fundamental problem: LLMs have a knowledge cutoff and a finite context window. You want answers grounded in your documents, not hallucinated from pre-training. The standard advice is to use a vector database Pinecone, Weaviate, Chroma . Vector search is powerful for open-domain retrieval where semantic similarity matters. But on a domain-specific corpus with consistent terminology — think a cybersecurity knowledge base or a medical reference — BM25 with typo tolerance typically achieves 85–95% of the recall you'd get from embeddings, with zero GPU cost, sub-10ms latency, and no embedding pipeline to maintain. Meilisearch gives you BM25 out of the box, plus typo tolerance, faceted filtering, and a simple REST API. It's what I use to power the search across 1,600+ articles at AYI NEDJIMI Consultants. Setup pip install meilisearch openai httpx Run Meilisearch locally: docker run -d -p 7700:7700 getmeili/meilisearch:latest Step 1: Index your documents Your documents need an id , searchable content , and any filter attributes you want to use at query time. python import meilisearch import hashlib import json MEILI URL = "http://127.0.0.1:7700" MEILI KEY = "your master key" or "" for local dev INDEX NAME = "knowledge base" client = meilisearch.Client MEILI URL, MEILI KEY def get or create index : try: index = client.get index INDEX NAME except meilisearch.errors.MeilisearchApiError: task = client.create index INDEX NAME, {"primaryKey": "id"} client.wait for task task.task uid index = client.get index INDEX NAME Configure searchable attributes and filters index.update settings { "searchableAttributes": "title", "content", "tags" , "filterableAttributes": "category", "doc type" , "rankingRules": "words", "typo", "proximity", "attribute", "sort", "exactness" , "typoTolerance": { "enabled": True, "minWordSizeForTypos": {"oneTypo": 4, "twoTypos": 8} } } return index def index documents documents: list dict : """ Each document: {"id": str, "title": str, "content": str, "tags": list str , "category": str, "doc type": str} """ index = get or create index Add stable IDs if not present for doc in documents: if "id" not in doc: doc "id" = hashlib.sha256 doc "content" .encode .hexdigest :16 task = index.add documents documents, primary key="id" client.wait for task task.task uid print f"Indexed {len documents } documents." Example: load from a JSONL file def load and index filepath: str : docs = with open filepath as f: for line in f: docs.append json.loads line.strip index documents docs Step 2: Retrieve top-k documents php def retrieve query: str, top k: int = 5, filters: str = "" - list dict : """ Returns top k documents matching the query. filters example: "category = 'security' AND doc type = 'guide'" """ index = client.get index INDEX NAME search params = { "limit": top k, "attributesToRetrieve": "id", "title", "content", "category" , "attributesToHighlight": "content" , "highlightPreTag": " ", "highlightPostTag": " ", } if filters: search params "filter" = filters results = index.search query, search params return results "hits" Step 3: Construct the prompt The prompt structure is critical. You want the model to be explicitly grounded — it should cite only what's in the retrieved chunks, not hallucinate. php def build prompt query: str, retrieved docs: list dict - list dict : context blocks = for i, doc in enumerate retrieved docs, 1 : context blocks.append f" Source {i} {doc 'title' }\n{doc 'content' :1200 }" context = "\n\n---\n\n".join context blocks system prompt = "You are a technical assistant. Answer the user's question using ONLY " "the provided sources. If the answer is not in the sources, say so explicitly. " "Cite sources by number, e.g. Source 1 ." user message = f"""Sources: {context} --- Question: {query}""" return {"role": "system", "content": system prompt}, {"role": "user", "content": user message}, Step 4: Stream the LLM response Never buffer the full response before sending it to the user. Streaming is essential for UX on long answers. python from openai import OpenAI generic llm client — swap for any compatible SDK llm client = OpenAI api key="your api key", base url="https://api.your-llm-provider.com/v1", adjust per provider def rag stream query: str, category filter: str = "" : """Generator that yields text chunks as they arrive from the LLM.""" filters = f"category = '{category filter}'" if category filter else "" docs = retrieve query, top k=5, filters=filters if not docs: yield "No relevant documents found in the knowledge base." return messages = build prompt query, docs stream = llm client.chat.completions.create model="gpt-4o-mini", or your preferred model messages=messages, stream=True, temperature=0.2, lower temp for factual retrieval tasks max tokens=800, for chunk in stream: delta = chunk.choices 0 .delta if delta.content: yield delta.content Step 5: Wire it together — a minimal CLI python import sys def main : query = " ".join sys.argv 1: if len sys.argv 1 else input "Query: " print f"\nQuery: {query}\n{'=' 60}\n" for token in rag stream query : print token, end="", flush=True print "\n" if name == " main ": main Usage: python rag.py "What are the key requirements of NIS 2 for SMEs?" Step 6: Evaluate hit rate Before deploying, measure whether your retrieval is actually finding the right documents. You need a small golden dataset: query → expected document ID. php def evaluate hit rate golden set: list dict , top k: int = 5 - float: """ golden set: {"query": "...", "expected id": "doc id"}, ... Returns hit rate @ top k. """ hits = 0 for item in golden set: results = retrieve item "query" , top k=top k retrieved ids = {r "id" for r in results} if item "expected id" in retrieved ids: hits += 1 hit rate = hits / len golden set print f"Hit rate @{top k}: {hit rate:.2%} {hits}/{len golden set } " return hit rate Example usage golden = {"query": "NIS 2 SME requirements", "expected id": "nis2-guide-001"}, {"query": "ISO 27001 certification steps", "expected id": "iso27001-checklist"}, {"query": "penetration testing methodology", "expected id": "pentest-guide-002"}, evaluate hit rate golden, top k=5 On a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 — without a single embedding model call. Production considerations Chunking strategy: For long documents, chunk at 512–800 tokens with 10% overlap. Store doc id and chunk index so you can reconstruct the full document if needed. Re-ranking: If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage. cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers works locally and adds ~30ms latency. Context window budget: At 5 docs × 1,200 chars, you're using roughly 1,500 tokens of context. Adjust top k and content truncation to stay within your model's window while leaving room for the answer. Caching: Cache retrieval results for identical queries with a TTL of 5–15 minutes using Redis or even a simple in-memory dict. LLM call results can be cached longer for factual queries. This pipeline — retrieval with Meilisearch, prompt construction, streaming output — is what I run in production. No embedding pipeline, no vector database operational overhead. For domain-specific retrieval, BM25 is frequently the pragmatic choice. Reach for semantic search when your query vocabulary genuinely diverges from your document vocabulary; otherwise, ship the simpler thing.