Hybrid Retrieval + RRF: How I Got 100% Retrieval Precision in a Production RAG System

A developer building ContextQuery, a production RAG system on free-tier infrastructure, achieved 100% retrieval precision by implementing hybrid retrieval with Reciprocal Rank Fusion (RRF). The approach combines semantic search via NVIDIA NIM embeddings and Chroma Cloud with BM25 sparse retrieval, overcoming the limitations of naive RAG that often fails on exact keyword and short specific queries.

Originally published at vivekpatil23.hashnode.dev The Problem With Naive RAG Nobody Talks About Most RAG tutorials show you the same pipeline: embed your documents, store vectors, embed the query, fetch the top-k nearest neighbors, pass to LLM. It works well enough in demos. In production, it quietly fails in two specific situations: Situation 1 — Exact keyword queries. A user asks "What is the ContextQuery API rate limit?" Your semantic search returns chunks about "API usage patterns" and "request throttling behavior" — conceptually related, but the exact phrase "rate limit" is buried or absent. The LLM hallucinates a number because the retrieved chunk doesn't contain one. Situation 2 — Short, specific queries. Semantic embeddings excel at capturing meaning but compress specificity. A 3-word query like "Chroma collection schema" gets drowned out by semantically adjacent but contextually wrong chunks. These weren't hypothetical failures. They showed up in my evaluation runs on ContextQuery — a production RAG system I built on free-tier infrastructure NVIDIA NIM embeddings, Chroma Cloud, FastAPI, Next.js 15 . My initial retrieval pipeline had a precision ceiling I couldn't break past 72% no matter how I tuned chunk size or overlap. The fix was hybrid retrieval using Reciprocal Rank Fusion. Here's exactly how it works and how I implemented it. What Is Reciprocal Rank Fusion RRF is a rank merging algorithm. Instead of picking one retrieval method and hoping it covers all query types, you run multiple retrievers independently, each producing a ranked list of chunks. RRF then merges those ranked lists into a single ranking using this formula: RRF score chunk = Σ 1 / k + rank in retriever Where k is a smoothing constant typically 60 and The key insight: a chunk that appears at rank 3 in semantic search AND rank 5 in BM25 gets a higher combined score than a chunk that's rank 1 in only one method. Consensus across retrievers is the signal. No neural network. No additional model. Just math on top of your existing retrieval infrastructure. The Two Retrievers I Combined Retriever 1 — Semantic Search NVIDIA NIM Embeddings + Chroma Cloud Standard dense retrieval. Query gets embedded via NVIDIA's nvidia/nv-embedqa-e5-v5 model, compared against stored document embeddings in Chroma Cloud using cosine similarity. Returns top-k chunks by vector similarity. Strength: captures conceptual meaning, handles paraphrasing well. Weakness: misses exact keyword matches, struggles with short specific queries. Retriever 2 — BM25 rank bm25 Classical sparse retrieval. No embeddings. Scores chunks based on term frequency, inverse document frequency, and document length normalization. The same algorithm that powered search engines before neural networks existed. Strength: exact keyword matching, short specific queries, named entities. Weakness: no semantic understanding, synonym-blind. These two retrievers fail in opposite situations. That's exactly why combining them works. Implementation Here's the core RRF merge function from ContextQuery: python python from rank bm25 import BM25Okapi from typing import List, Dict, Any def reciprocal rank fusion semantic results: List Dict , bm25 results: List Dict , k: int = 60 - List Dict : """ Merge semantic and BM25 ranked lists using RRF. Each result dict must have 'id' and 'content' keys. """ scores: Dict str, float = {} chunk map: Dict str, Dict = {} Score semantic results for rank, chunk in enumerate semantic results : chunk id = chunk "id" scores chunk id = scores.get chunk id, 0 + 1 / k + rank + 1 chunk map chunk id = chunk Score BM25 results for rank, chunk in enumerate bm25 results : chunk id = chunk "id" scores chunk id = scores.get chunk id, 0 + 1 / k + rank + 1 chunk map chunk id = chunk Sort by combined RRF score descending ranked ids = sorted scores, key=lambda x: scores x , reverse=True return chunk map chunk id for chunk id in ranked ids And the BM25 retriever setup: Python php def build bm25 index chunks: List str - BM25Okapi: tokenized = chunk.lower .split for chunk in chunks return BM25Okapi tokenized def bm25 retrieve query: str, bm25 index: BM25Okapi, chunks: List Dict , top k: int = 10 - List Dict : tokenized query = query.lower .split scores = bm25 index.get scores tokenized query top indices = sorted range len scores , key=lambda i: scores i , reverse=True :top k return chunks i for i in top indices The full retrieval call in the FastAPI endpoint: python php async def retrieve query: str, top k: int = 5 - List Dict : Run both retrievers semantic results = await chroma semantic search query, top k=10 bm25 results = bm25 retrieve query, bm25 index, all chunks, top k=10 Merge with RRF fused results = reciprocal rank fusion semantic results, bm25 results Return top-k from merged list return fused results :top k Note I fetch top-10 from each retriever before merging, then cut to top-5 after fusion. This gives RRF enough candidates to actually rerank meaningfully — fetching only top-5 before fusion defeats the purpose. Evaluation Results I evaluated ContextQuery using a 16-question test set covering a range of query types: exact keyword queries, conceptual questions, multi-hop questions, and short specific lookups. | Metric | Semantic Only | Hybrid RRF | |---|---|---| | Retrieval Precision | 72% | 100% | | Answer Faithfulness | 81% | 87.5% | | Avg Latency | ~1800ms | ~2400ms | Retrieval precision measures whether the correct chunk appeared in the top-5 results. Faithfulness measures whether the LLM's answer was grounded in the retrieved content rather than hallucinated. The 600ms latency increase comes from running BM25 in parallel alongside the semantic search. For my use case this was an acceptable tradeoff. For latency-critical applications, you could run BM25 on a separate thread and set a timeout fallback to semantic-only. What I'd Do Differently The 12.5% faithfulness gap isn't a retrieval problem. After investigation, the remaining faithfulness failures came from chunk boundary issues — the answer to a question was split across two chunks and neither chunk alone was sufficient. The fix is smarter chunking semantic chunking over fixed token windows , not better retrieval. Hybrid RRF solved the retrieval problem completely; chunking strategy is the next frontier. k=60 is a reasonable default but not universal. The smoothing constant k controls how much weight rank position gets versus pure presence in results. I used 60 the standard default and didn't tune it. If your query distribution is heavily keyword-biased, a smaller k rewards BM25 rank more aggressively. Worth experimenting with if you're not hitting the precision numbers you need. BM25 index needs to be rebuilt on document updates. Unlike the Chroma vector store which handles upserts natively, the BM25 index in my implementation is rebuilt from scratch on each document ingestion event. Fine at small scale, will become a bottleneck with large corpora. A production fix is incremental index updates or a dedicated sparse retrieval service. Stack Summary Embeddings: NVIDIA NIM nvidia/nv-embedqa-e5-v5 Full source: https://github.com/vivekpatil200320/contextquery https://github.com/vivekpatil200320/contextquery Wrapping Up Hybrid retrieval isn't a complex idea — it's two retrievers whose failure modes don't overlap, merged with a formula that takes 10 lines to implement. The results in ContextQuery were significant enough that I now treat it as a default starting point rather than an optimisation. If you're building a RAG system and hitting a precision ceiling, add BM25 before you touch chunk size, overlap, or embedding models. It's the highest-leverage change in the retrieval stack. Building and writing about production AI systems — find more at https://vivekpatil23.hashnode.dev https://vivekpatil23.hashnode.dev