Non members read here for free.
With the rapid advancement of Large Language Models and vector embeddings, Retrieval-Augmented Generation (RAG) has become the go-to solution for querying unstructured documents. Upload a PDF, ask a question, get an answer. It feels like magic.
But sometimes, it is not enough.
The silent failure mode of most RAG systems is not the LLM. It is the retrieval step. Dense vector search is powerful at finding semantically similar text. It understands that “urban spending” and “city expenditure” mean the same thing. But ask it for a specific error code, a contract clause number, or a precise financial figure, and it can silently return the wrong chunks with high confidence.
On the other hand, keyword search like BM25 nails exact matches every time. But it has no concept of meaning. “Automobile” and “car” are completely different strings to it, and any paraphrased question will leave it lost.
The uncomfortable truth is that neither retriever is universally better. Each dominates on a different class of queries. And in real-world documents like legal contracts, financial reports, and technical manuals, you will always have both kinds.
Hybrid RAG solves this by running both retrievers in parallel and fusing their results using Reciprocal Rank Fusion. You get the semantic understanding of vector search and the precision of keyword search, in a single ranked list, at near-zero extra cost.
In this article, we will build a complete Hybrid RAG system from scratch. FAISS for dense search, BM25 for keyword search, and Reciprocal Rank Fusion to merge the two ranked lists into a single, better-ranked result LangGraph for orchestration, and a Streamlit UI where you can toggle between retrieval modes and inspect every chunk and score behind each answer.
The complete end to end code can be referred to my github repo:
agentic-ai-usecases/beginner/hybrid-rag at main · alphaiterations/agentic-ai-usecases
Before jumping into code, it helps to understand why hybrid retrieval matters.
Converts text into high-dimensional embeddings and finds the nearest neighbours by cosine similarity. It excels at paraphrasing: ‘What is the profit margin?’ finds chunks that say ‘net income as a percentage of revenue’ even though none of those words overlap with the query. But it can silently skip a chunk that contains ERR_4021 because that token was rare in training data and sits in an odd region of the embedding space.
Best Match 25 is a classical information retrieval algorithm based on term frequency and inverse document frequency. It scores documents based on how many query words appear in them and how rare those words are across the whole corpus. It nails exact matches, part numbers, named entities, and specific terminology. The weakness is that it has no semantic understanding at all, so ‘automobile’ and ‘car’ are completely different words to BM25.
Combines both signals. The merged ranked list tends to surface chunks that are simultaneously semantically relevant and lexically relevant, which is exactly what you want when your document contains a mix of technical terms and descriptive prose.
The question is:
How do we decide which chunk to prioritize?
RRF is the answer.
RRF is a rank-based merging algorithm that combines multiple ranked lists into a single, unified ranking without caring about the raw score values from any individual retriever.
Instead of asking “which chunk scored highest overall?”, it asks “which chunk appeared near the top of the most lists?”
The formula is simple:
RRF score(d) = Σ 1 / (k + rank(d, list))
where k is a smoothing constant (typically 60) and rank(d, list) is the 1-indexed position of chunk d in a given retriever’s result list. The sum runs over every retriever that returned the chunk.
A few properties make RRF especially well-suited for hybrid retrieval:
In practice, this means: when both retrievers agree on a chunk, it rises to the top. When only one retriever surfaces it, it still gets credit but not enough to dominate if another chunk had broader support.
Here is the full architecture of what we are going to build:
Architecture note: Key design decision: FAISS and BM25 indexes live in Streamlit session_state not inside LangGraph state. LangGraph state needs to be serialisable, and FAISS index objects are not. The nodes access the indexes through closures, keeping the graph state clean.
Below are the architectural components we are using in the project:
We are going to use Claude Sonnet-4.6 API for LLM.
Complete code is kept here:
agentic-ai-usecases/beginner/hybrid-rag at main · alphaiterations/agentic-ai-usecases
hybrid-rag/├── app.py # Streamlit two-column UI├── graph.py # LangGraph StateGraph + indexing helper├── retriever/│ ├── vector_retriever.py # FAISS cosine search│ ├── bm25_retriever.py # BM25 keyword search│ └── fusion.py # RRF fusion├── indexer/│ └── pdf_indexer.py # PyMuPDF extraction + chunker + index builders├── monitoring/│ └── chunk_monitor.py # Last-5-query history tracker├── .env # Your API key goes here└── requirements.txt
mkdir hybrid-rag && cd hybrid-ragpython3.11 -m venv .venvsource .venv/bin/activate
python3.11 -m venv .venv.venv\Scripts\activate
pip install -r requirements.txt
anthropic==0.104.1langgraph==1.2.1faiss-cpu==1.14.2sentence-transformers==5.5.1rank_bm25==0.2.2PyMuPDF==1.27.2.3streamlit==1.57.0pandas==3.0.3numpy==2.4.6python-dotenv==1.2.2
Note: sentence-transformers pulls in PyTorch as a dependency. The first install will download around 2 GB. Subsequent runs load from cache.
The indexer is the foundation of the whole pipeline. It reads raw PDF bytes, extracts text page by page using PyMuPDF, and then cuts the flat token stream into overlapping windows.
python
def chunk_text(pages, chunk_size=200, overlap=50): all_tokens = [] token_pages = [] for page_num, text in pages: tokens = text.split() all_tokens.extend(tokens) token_pages.extend([page_num] * len(tokens)) step = chunk_size - overlap # stride = 150 tokens chunks, chunk_pages = [], [] i = 0 while i < len(all_tokens): window_tokens = all_tokens[i : i + chunk_size] chunks.append(' '.join(window_tokens)) chunk_pages.append(token_pages[i]) if len(window_tokens) < chunk_size: break i += step return chunks, chunk_pages
Note: Why overlapping chunks? Without overlap, a sentence that spans a chunk boundary gets split in two, and neither half carries full context. A 50-token overlap means each chunk shares its last 50 tokens with the next chunk’s first 50, so key sentences near boundaries appear in at least two chunks and have a higher chance of being retrieved.
import faissimport numpy as npfrom sentence_transformers import SentenceTransformer _model = None def _get_model(): global _model if _model is None: _model = SentenceTransformer('all-MiniLM-L6-v2') return _model def build_faiss_index(chunks): model = _get_model() embeddings = model.encode( chunks, normalize_embeddings=True, # critical for cosine similarity show_progress_bar=False, batch_size=64, ) embeddings = np.array(embeddings, dtype='float32') dim = embeddings.shape[1] # 384 for all-MiniLM-L6-v2 index = faiss.IndexFlatIP(dim) index.add(embeddings) return index
One thing to pay attention to: IndexFlatIP computes the inner product (dot product). When you use normalize_embeddings=True, all vectors sit on the unit sphere and inner product equals cosine similarity. This is slightly faster than computing cosine explicitly and gives you the same ranking.
from rank_bm25 import BM25Okapi def build_bm25_index(chunks): tokenized = [chunk.lower().split() for chunk in chunks] return BM25Okapi(tokenized)
Note: Lowercase tokenisation here must match the tokenisation at query time. BM25 is case-sensitive by default when using .split(), so both the index build and the query must use .lower() or term frequencies will not match.
The chunk_text() function produces a single (chunks, chunk_pages) tuple that is passed to both build_faiss_index() and build_bm25_index(). Both indexes are position-aligned: the chunk at index i in the FAISS index is the identical string as the chunk at index i in the BM25 corpus. This alignment is what makes RRF fusion possible.
The vector retriever encodes the query with the same model used at index time, then runs a nearest-neighbour search:
Notice that the retriever imports _get_model from the indexer module rather than creating a new SentenceTransformer instance. all-MiniLM-L6-v2 takes about 2 seconds and 90 MB of memory. By sharing the singleton, you pay that cost exactly once per session.
The BM25 retriever is simpler: tokenise the query, ask the index to score all chunks, and return the top-k:
This is the heart of the hybrid system. RRF does not care about the absolute score values from either retriever. Instead, it uses the rank position of each chunk in each list. The formula is:
RRF score(d) = sum( 1 / (k + rank(d)) ) where k = 60
The constant 60 prevents top-ranked chunks from dominating too heavily when two lists disagree. It comes from Cormack, Clarke, and Buettcher (2009) and was chosen empirically across TREC benchmarks.
Imagine chunk A is ranked 1st by vector search (score 0.92) and 3rd by BM25. Chunk B is ranked 2nd by vector search and 1st by BM25. RRF gives:
RRF(A) = 1/(60+1) + 1/(60+3) = 0.01639 + 0.01563 = 0.03202RRF(B) = 1/(60+2) + 1/(60+1) = 0.01613 + 0.01639 = 0.03252
Chunk B wins because it ranked highly in both lists, even though chunk A had a higher raw cosine score. This cross-list agreement signal is exactly what you want.
LangGraph lets you model the retrieval pipeline as a directed graph of stateful nodes. Each node receives the full state dict, does its work, and returns a partial update that LangGraph merges back.
python
from langgraph.graph import StateGraph, START, END def build_graph(session_state, top_k=5, retrieval_mode='Both'): def retrieve_vector_fn(state: RAGState) -> dict: if retrieval_mode == 'BM25': return {'vector_results': []} from retriever.vector_retriever import retrieve return {'vector_results': retrieve( state['query'], session_state['faiss_index'], session_state['chunks'], session_state['chunk_pages'], k=top_k )} def retrieve_bm25_fn(state: RAGState) -> dict: if retrieval_mode == 'Vector': return {'bm25_results': []} from retriever.bm25_retriever import retrieve return {'bm25_results': retrieve( state['query'], session_state['bm25_index'], session_state['chunks'], session_state['chunk_pages'], k=top_k )} def fuse_results_fn(state: RAGState) -> dict: from retriever.fusion import reciprocal_rank_fusion return {'fused_chunks': reciprocal_rank_fusion( state['vector_results'], state['bm25_results'], rrf_k=60 )} def generate_answer_fn(state: RAGState) -> dict: import anthropic, os, time top_chunks = state['fused_chunks'][:top_k] context = '\n\n---\n\n'.join( f'[Page {c[4]}]\n{c[0]}' for c in top_chunks ) prompt = ( 'You are a helpful assistant. Answer the question using ONLY ' 'the provided context. If the context does not contain enough ' 'information to answer, say so clearly.\n\n' f'Context:\n{context}\n\nQuestion: {state["query"]}\n\nAnswer:' ) client = anthropic.Anthropic(api_key=os.environ['ANTHROPIC_API_KEY']) t0 = time.time() response = client.messages.create( model='claude-sonnet-4-6', max_tokens=1024, messages=[{'role': 'user', 'content': prompt}], ) return { 'answer': response.content[0].text, 'prompt_sent': prompt, 'prompt_tokens': response.usage.input_tokens, 'completion_tokens': response.usage.output_tokens, 'total_tokens': response.usage.input_tokens + response.usage.output_tokens, 'latency_ms': round((time.time() - t0) * 1000, 1), } graph = StateGraph(RAGState) graph.add_node('retrieve_vector', retrieve_vector_fn) graph.add_node('retrieve_bm25', retrieve_bm25_fn) graph.add_node('fuse_results', fuse_results_fn) graph.add_node('generate_answer', generate_answer_fn) graph.add_edge(START, 'retrieve_vector') graph.add_edge('retrieve_vector', 'retrieve_bm25') graph.add_edge('retrieve_bm25', 'fuse_results') graph.add_edge('fuse_results', 'generate_answer') graph.add_edge('generate_answer', END) return graph.compile()
Design note: build_graph() is called fresh on every query, not once at startup. This is intentional. The factory captures the current top_k and retrieval_mode values through the closure, so changing either control immediately takes effect on the next query without any cache invalidation logic.
The app uses a two-column layout. The left column handles document management and configuration. The right column is the chat interface.
with left_col: st.header('📄 Documents') uploaded_files = st.file_up( 'Upload PDF(s)', type='pdf', accept_multiple_files=True, label_visibility='collapsed', ) if uploaded_files: uploaded_names = {f.name for f in uploaded_files} indexed_names = {m['filename'] for m in st.session_state.file_metadata} if uploaded_names != indexed_names: with st.spinner('Indexing PDFs...'): parse_and_index(uploaded_files, st.session_state) retrieval_mode = st.selectbox( 'Retrieval Type', options=['Both', 'Vector', 'BM25'], index=0, ) top_k = st.slider('Top K Chunks', min_value=3, max_value=10, value=5)
Let’s run the app:
source .venv/bin/activate # macOS/Linux# .venv\Scripts\activate # Windows streamlit run app.py
Streamlit opens http://localhost:8501 in your browser automatically.
Note: Here we are using Governor’s Statement: December 05, 2025 [Link] pdf for our experiment.
This is where the app becomes genuinely useful for experimentation. You can switch modes mid-session and see exactly how the retrieved chunks change for the same query.
In this mode, retrieve_bm25_fn returns an empty list immediately without touching the BM25 index. All retrieved chunks are labelled Vector in the Logs tab and highlighted in blue.
Best for: Questions that require semantic understanding. Examples: ‘What is the overall financial health of the company?’ or ‘Summarise the methodology used in section 3.’
In this mode, retrieve_vector_fn returns an empty list immediately. All retrieved chunks are labelled BM25 and highlighted in amber.
Best for: Questions with specific terminology, product codes, error codes, financial identifiers, or named entities. Examples: ‘What was the CRAR?’
Screenshot: ‘BM25’ selected. Logs tab : all rows highlighted amber, ‘Found By: BM25’. BM25 Score column shows values like 4.2, 3.8, 2.1. Vector Score = 0.0 for all rows.
Both retrievers run in full, their top-k lists are merged, and RRF re-ranks the union. Chunks that appear in both lists get a higher RRF score than chunks from either list alone.
Best for: Most real-world queries. A question like ‘What is the status of MGNREGA demand in oct-nov??’ has both a semantic component and an exact-match component.
Every response in the chat history has two tabs: Answer and Logs. The Logs tab gives you complete visibility into what happened:
Retrieval Mode badge (🟢 Both / 🔵 Vector / 🟠 BM25) ↓Top K Chunks table Rank | Chunk Preview | Page | Vector Score | BM25 Score | RRF Score | Found By (colour-coded: green=Both, blue=Vector, amber=BM25) ↓Prompt Sent to LLM (full text in a code block) ↓Token Usage metrics Input Tokens | Output Tokens | Total Tokens ↓Latency LLM Call Time in ms
Note: When an answer is wrong, the first place to look is always the retrieved chunks, not the LLM prompt. If the right content is not in the context window, no amount of prompt engineering will fix the answer.
The best way to understand why hybrid retrieval matters is to break each mode deliberately. The following four queries were run against the RBI Governor’s Statement (December 2025), a policy document packed with both structured identifiers and descriptive economic prose.
Query:What does this number indicate 2025–2026/1634?
2025–2026/1634 is a circular reference number. It carries no semantic neighbourhood in embedding space the model has never seen this string during pre-training in a meaningful context.
Result: The retriever returns chunks about monetary policy and interest rates, semantically close but none contain the reference number. The LLM correctly admits it cannot find the answer.
Query:Are people spending more in cities compared to villages?
A paraphrased question about urban versus rural consumption trends. The document uses ‘urban demand’, ‘rural consumption’: none of those words appear in the query.
Result: BM25 scores near zero for every chunk and surfaces unrelated content. ‘cities’ and ‘villages’ are absent from the document.
Query:What does this number indicate 2025–2026/1634? (same as Query 1)
BM25 scores the chunk containing 2025–2026/1634 at the top of its list. RRF fusion places it high enough to enter the context window passed to the LLM.
Result: Specific, accurate answer. The reference is identified correctly.
Query:Are people spending more in cities compared to villages? (same as Query 2)
Vector search handles the semantic intent. BM25 contributes near-zero scores, but the vector results alone are sufficient.
Result: Substantive answer about urban versus rural consumption trends, citing specific data points from the document.
Performance note: Running both retrievers costs you one extra call to bm25_index.get_scores() which is a pure CPU operation that takes under 5 ms on a 200-page document. The fusion step is a handful of dictionary lookups. The price for covering both failure modes is essentially zero.
This is a deliberate middle ground. Too small (under 100 tokens) and each chunk lacks enough context for the LLM to generate a coherent answer. Too large (over 500 tokens) and embeddings have less resolution and BM25 scores become diluted.
This constant comes directly from Cormack, Clarke, and Buettcher (2009). Lower values (like 10) make the top rank matter more; higher values (like 100) flatten the distribution. For document Q&A on professional PDFs, 60 is a solid default.
A few directions worth exploring from here:
We have built a complete hybrid RAG system that combines FAISS semantic search and BM25 keyword search, fuses their results with Reciprocal Rank Fusion, and routes everything through a LangGraph pipeline to Claude for answer generation. The Streamlit UI gives you real-time control over retrieval mode and full transparency into every chunk, score, token count, and prompt.
The key insight is that retrieval is not a solved problem, and the right approach depends on your query type. Vector-only search handles semantic questions well. BM25 handles exact matches well. Hybrid handles most real queries better than either alone, and the RRF scores in the Logs tab give you the evidence to understand why.
The codebase is deliberately minimal: 11 files, no LangChain abstractions, and every retrieval call is a raw library function you can read in one screen. That makes it straightforward to swap in a different embedding model, add a reranker, or replace FAISS with a hosted vector database as your needs grow.
Thank you for reading the article.
AgenticAI is complex and chaotic but getting started doesn’t have to be. I focus on making that first step simpler for you. Follow along for regular updates and more such articles.
Feel free to connect on Linkedin if you’re on a similar path.
And if you’re still curious, there’s more to explore.
Build a Hybrid RAG System with FAISS, BM25, LangGraph and Claude Sonnet Model was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.