{"slug": "how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database", "title": "How to build a production RAG pipeline in Python (without a vector database)", "summary": "To build a production-ready Retrieval-Augmented Generation (RAG) pipeline in Python using BM25 retrieval via Meilisearch instead of a vector database, arguing that BM25 achieves 85–95% of the recall of semantic search on domain-specific corpora with lower cost and complexity. The tutorial covers indexing documents, retrieving relevant chunks with typo tolerance and filters, and constructing prompts to ground LLM responses in the retrieved content. The author demonstrates that for technical documentation, knowledge bases, and similar corpora, a vector database is often unnecessary for effective RAG.", "body_md": "Everyone reaching for a vector database when building RAG is solving the wrong problem first. For most domain-specific corpora — technical documentation, company knowledge bases, article archives — BM25 retrieval is competitive with semantic search, costs a fraction of the compute, and is dramatically simpler to operate. This tutorial shows you how to build a full RAG pipeline using Meilisearch as the retrieval backend, stream responses from an LLM API, and evaluate hit rate without a single embedding model.\nWhy RAG, and why not a vector database\nRetrieval-Augmented Generation solves a fundamental problem: LLMs have a knowledge cutoff and a finite context window. You want answers grounded in your documents, not hallucinated from pre-training.\nThe standard advice is to use a vector database (Pinecone, Weaviate, Chroma). Vector search is powerful for open-domain retrieval where semantic similarity matters. But on a domain-specific corpus with consistent terminology — think a cybersecurity knowledge base or a medical reference — BM25 with typo tolerance typically achieves 85–95% of the recall you'd get from embeddings, with zero GPU cost, sub-10ms latency, and no embedding pipeline to maintain.\nMeilisearch gives you BM25 out of the box, plus typo tolerance, faceted filtering, and a simple REST API. It's what I use to power the search across 1,600+ articles at AYI NEDJIMI Consultants.\nSetup\n\n```\npip install meilisearch openai httpx\n```\n\nRun Meilisearch locally:\n\n```\ndocker run -d -p 7700:7700 getmeili/meilisearch:latest\n```\n\nStep 1: Index your documents\nYour documents need an id\n, searchable content\n, and any filter attributes you want to use at query time.\n\n``` python\nimport meilisearch\nimport hashlib\nimport json\n\nMEILI_URL = \"http://127.0.0.1:7700\"\nMEILI_KEY = \"your_master_key\"  # or \"\" for local dev\nINDEX_NAME = \"knowledge_base\"\n\nclient = meilisearch.Client(MEILI_URL, MEILI_KEY)\n\ndef get_or_create_index():\n    try:\n        index = client.get_index(INDEX_NAME)\n    except meilisearch.errors.MeilisearchApiError:\n        task = client.create_index(INDEX_NAME, {\"primaryKey\": \"id\"})\n        client.wait_for_task(task.task_uid)\n        index = client.get_index(INDEX_NAME)\n\n    # Configure searchable attributes and filters\n    index.update_settings({\n        \"searchableAttributes\": [\"title\", \"content\", \"tags\"],\n        \"filterableAttributes\": [\"category\", \"doc_type\"],\n        \"rankingRules\": [\n            \"words\", \"typo\", \"proximity\", \"attribute\", \"sort\", \"exactness\"\n        ],\n        \"typoTolerance\": {\n            \"enabled\": True,\n            \"minWordSizeForTypos\": {\"oneTypo\": 4, \"twoTypos\": 8}\n        }\n    })\n    return index\n\ndef index_documents(documents: list[dict]):\n    \"\"\"\n    Each document: {\"id\": str, \"title\": str, \"content\": str,\n                    \"tags\": list[str], \"category\": str, \"doc_type\": str}\n    \"\"\"\n    index = get_or_create_index()\n\n    # Add stable IDs if not present\n    for doc in documents:\n        if \"id\" not in doc:\n            doc[\"id\"] = hashlib.sha256(doc[\"content\"].encode()).hexdigest()[:16]\n\n    task = index.add_documents(documents, primary_key=\"id\")\n    client.wait_for_task(task.task_uid)\n    print(f\"Indexed {len(documents)} documents.\")\n\n# Example: load from a JSONL file\ndef load_and_index(filepath: str):\n    docs = []\n    with open(filepath) as f:\n        for line in f:\n            docs.append(json.loads(line.strip()))\n    index_documents(docs)\n```\n\nStep 2: Retrieve top-k documents\n\n``` php\ndef retrieve(query: str, top_k: int = 5, filters: str = \"\") -> list[dict]:\n    \"\"\"\n    Returns top_k documents matching the query.\n    filters example: \"category = 'security' AND doc_type = 'guide'\"\n    \"\"\"\n    index = client.get_index(INDEX_NAME)\n\n    search_params = {\n        \"limit\": top_k,\n        \"attributesToRetrieve\": [\"id\", \"title\", \"content\", \"category\"],\n        \"attributesToHighlight\": [\"content\"],\n        \"highlightPreTag\": \"**\",\n        \"highlightPostTag\": \"**\",\n    }\n\n    if filters:\n        search_params[\"filter\"] = filters\n\n    results = index.search(query, search_params)\n    return results[\"hits\"]\n```\n\nStep 3: Construct the prompt\nThe prompt structure is critical. You want the model to be explicitly grounded — it should cite only what's in the retrieved chunks, not hallucinate.\n\n``` php\ndef build_prompt(query: str, retrieved_docs: list[dict]) -> list[dict]:\n    context_blocks = []\n    for i, doc in enumerate(retrieved_docs, 1):\n        context_blocks.append(\n            f\"[Source {i}] {doc['title']}\\n{doc['content'][:1200]}\"\n        )\n\n    context = \"\\n\\n---\\n\\n\".join(context_blocks)\n\n    system_prompt = (\n        \"You are a technical assistant. Answer the user's question using ONLY \"\n        \"the provided sources. If the answer is not in the sources, say so explicitly. \"\n        \"Cite sources by number, e.g. [Source 1].\"\n    )\n\n    user_message = f\"\"\"Sources:\n{context}\n\n---\n\nQuestion: {query}\"\"\"\n\n    return [\n        {\"role\": \"system\", \"content\": system_prompt},\n        {\"role\": \"user\", \"content\": user_message},\n    ]\n```\n\nStep 4: Stream the LLM response\nNever buffer the full response before sending it to the user. Streaming is essential for UX on long answers.\n\n``` python\nfrom openai import OpenAI  # generic llm_client — swap for any compatible SDK\n\nllm_client = OpenAI(\n    api_key=\"your_api_key\",\n    base_url=\"https://api.your-llm-provider.com/v1\",  # adjust per provider\n)\n\ndef rag_stream(query: str, category_filter: str = \"\"):\n    \"\"\"Generator that yields text chunks as they arrive from the LLM.\"\"\"\n    filters = f\"category = '{category_filter}'\" if category_filter else \"\"\n    docs = retrieve(query, top_k=5, filters=filters)\n\n    if not docs:\n        yield \"No relevant documents found in the knowledge base.\"\n        return\n\n    messages = build_prompt(query, docs)\n\n    stream = llm_client.chat.completions.create(\n        model=\"gpt-4o-mini\",  # or your preferred model\n        messages=messages,\n        stream=True,\n        temperature=0.2,  # lower temp for factual retrieval tasks\n        max_tokens=800,\n    )\n\n    for chunk in stream:\n        delta = chunk.choices[0].delta\n        if delta.content:\n            yield delta.content\n```\n\nStep 5: Wire it together — a minimal CLI\n\n``` python\nimport sys\n\ndef main():\n    query = \" \".join(sys.argv[1:]) if len(sys.argv) > 1 else input(\"Query: \")\n    print(f\"\\nQuery: {query}\\n{'='*60}\\n\")\n\n    for token in rag_stream(query):\n        print(token, end=\"\", flush=True)\n\n    print(\"\\n\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\nUsage:\n\n```\npython rag.py \"What are the key requirements of NIS 2 for SMEs?\"\n```\n\nStep 6: Evaluate hit rate\nBefore deploying, measure whether your retrieval is actually finding the right documents. You need a small golden dataset: query → expected document ID.\n\n``` php\ndef evaluate_hit_rate(golden_set: list[dict], top_k: int = 5) -> float:\n    \"\"\"\n    golden_set: [{\"query\": \"...\", \"expected_id\": \"doc_id\"}, ...]\n    Returns hit rate @ top_k.\n    \"\"\"\n    hits = 0\n    for item in golden_set:\n        results = retrieve(item[\"query\"], top_k=top_k)\n        retrieved_ids = {r[\"id\"] for r in results}\n        if item[\"expected_id\"] in retrieved_ids:\n            hits += 1\n\n    hit_rate = hits / len(golden_set)\n    print(f\"Hit rate @{top_k}: {hit_rate:.2%} ({hits}/{len(golden_set)})\")\n    return hit_rate\n\n# Example usage\ngolden = [\n    {\"query\": \"NIS 2 SME requirements\", \"expected_id\": \"nis2-guide-001\"},\n    {\"query\": \"ISO 27001 certification steps\", \"expected_id\": \"iso27001-checklist\"},\n    {\"query\": \"penetration testing methodology\", \"expected_id\": \"pentest-guide-002\"},\n]\n\nevaluate_hit_rate(golden, top_k=5)\n```\n\nOn a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 — without a single embedding model call.\nProduction considerations\nChunking strategy: For long documents, chunk at 512–800 tokens with 10% overlap. Store doc_id\nand chunk_index\nso you can reconstruct the full document if needed.\nRe-ranking: If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage. cross-encoder/ms-marco-MiniLM-L-6-v2\nfrom Sentence Transformers works locally and adds ~30ms latency.\nContext window budget: At 5 docs × 1,200 chars, you're using roughly 1,500 tokens of context. Adjust top_k\nand content truncation to stay within your model's window while leaving room for the answer.\nCaching: Cache retrieval results for identical queries with a TTL of 5–15 minutes using Redis or even a simple in-memory dict. LLM call results can be cached longer for factual queries.\nThis pipeline — retrieval with Meilisearch, prompt construction, streaming output — is what I run in production. No embedding pipeline, no vector database operational overhead. For domain-specific retrieval, BM25 is frequently the pragmatic choice. Reach for semantic search when your query vocabulary genuinely diverges from your document vocabulary; otherwise, ship the simpler thing.", "url": "https://wpnews.pro/news/how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database", "canonical_source": "https://dev.to/ayinedjimi-consultants/how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database-69g", "published_at": "2026-05-22 00:09:39+00:00", "updated_at": "2026-05-22 00:35:05.760473+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "data", "artificial-intelligence", "open-source"], "entities": ["Meilisearch", "Pinecone", "Weaviate", "Chroma", "AYI NEDJIMI Consultants", "BM25", "RAG", "LLM"], "alternates": {"html": "https://wpnews.pro/news/how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database", "markdown": "https://wpnews.pro/news/how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database.md", "text": "https://wpnews.pro/news/how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database.txt", "jsonld": "https://wpnews.pro/news/how-to-build-a-production-rag-pipeline-in-python-without-a-vector-database.jsonld"}}