Build a Hybrid RAG System with FAISS, BM25, LangGraph and Claude Sonnet Model

A developer built a Hybrid RAG system combining FAISS for dense vector search and BM25 for keyword search, fused via Reciprocal Rank Fusion, orchestrated with LangGraph and Claude Sonnet, and deployed with a Streamlit UI. The system addresses the failure modes of pure semantic or keyword retrieval by merging both approaches for better accuracy on mixed document types like legal contracts and technical manuals.

Non members read here for free. https://medium.com/@alphaiterations/39ba3c6755bc?source=friends link&sk=97ae0f65f483e2fb74d2b750a40c31b5 With the rapid advancement of Large Language Models and vector embeddings, Retrieval-Augmented Generation RAG has become the go-to solution for querying unstructured documents. Upload a PDF, ask a question, get an answer. It feels like magic. But sometimes, it is not enough. The silent failure mode of most RAG systems is not the LLM. It is the retrieval step. Dense vector search is powerful at finding semantically similar text. It understands that “urban spending” and “city expenditure” mean the same thing. But ask it for a specific error code, a contract clause number, or a precise financial figure, and it can silently return the wrong chunks with high confidence. On the other hand, keyword search like BM25 nails exact matches every time. But it has no concept of meaning. “Automobile” and “car” are completely different strings to it, and any paraphrased question will leave it lost. The uncomfortable truth is that neither retriever is universally better. Each dominates on a different class of queries. And in real-world documents like legal contracts, financial reports, and technical manuals, you will always have both kinds. Hybrid RAG solves this by running both retrievers in parallel and fusing their results using Reciprocal Rank Fusion. You get the semantic understanding of vector search and the precision of keyword search, in a single ranked list, at near-zero extra cost. In this article, we will build a complete Hybrid RAG system from scratch. FAISS for dense search, BM25 for keyword search, and Reciprocal Rank Fusion to merge the two ranked lists into a single, better-ranked result LangGraph for orchestration, and a Streamlit UI where you can toggle between retrieval modes and inspect every chunk and score behind each answer. The complete end to end code can be referred to my github repo: agentic-ai-usecases/beginner/hybrid-rag at main · alphaiterations/agentic-ai-usecases https://github.com/alphaiterations/agentic-ai-usecases/tree/main/beginner/hybrid-rag Before jumping into code, it helps to understand why hybrid retrieval matters. Converts text into high-dimensional embeddings and finds the nearest neighbours by cosine similarity. It excels at paraphrasing: ‘What is the profit margin?’ finds chunks that say ‘net income as a percentage of revenue’ even though none of those words overlap with the query. But it can silently skip a chunk that contains ERR 4021 because that token was rare in training data and sits in an odd region of the embedding space. Best Match 25 is a classical information retrieval algorithm based on term frequency and inverse document frequency. It scores documents based on how many query words appear in them and how rare those words are across the whole corpus. It nails exact matches, part numbers, named entities, and specific terminology. The weakness is that it has no semantic understanding at all, so ‘automobile’ and ‘car’ are completely different words to BM25. Combines both signals. The merged ranked list tends to surface chunks that are simultaneously semantically relevant and lexically relevant, which is exactly what you want when your document contains a mix of technical terms and descriptive prose. The question is: How do we decide which chunk to prioritize? RRF is the answer. RRF is a rank-based merging algorithm that combines multiple ranked lists into a single, unified ranking without caring about the raw score values from any individual retriever. Instead of asking “ which chunk scored highest overall? ”, it asks “ which chunk appeared near the top of the most lists? ” The formula is simple: RRF score d = Σ 1 / k + rank d, list where k is a smoothing constant typically 60 and rank d, list is the 1-indexed position of chunk d in a given retriever’s result list. The sum runs over every retriever that returned the chunk. A few properties make RRF especially well-suited for hybrid retrieval: In practice, this means: when both retrievers agree on a chunk, it rises to the top. When only one retriever surfaces it, it still gets credit but not enough to dominate if another chunk had broader support. Here is the full architecture of what we are going to build: Architecture note: Key design decision: FAISS and BM25 indexes live in Streamlit session state not inside LangGraph state. LangGraph state needs to be serialisable, and FAISS index objects are not. The nodes access the indexes through closures, keeping the graph state clean. Below are the architectural components we are using in the project: We are going to use Claude Sonnet-4.6 API for LLM. Complete code is kept here: agentic-ai-usecases/beginner/hybrid-rag at main · alphaiterations/agentic-ai-usecases https://github.com/alphaiterations/agentic-ai-usecases/tree/main/beginner/hybrid-rag hybrid-rag/├── app.py Streamlit two-column UI├── graph.py LangGraph StateGraph + indexing helper├── retriever/│ ├── vector retriever.py FAISS cosine search│ ├── bm25 retriever.py BM25 keyword search│ └── fusion.py RRF fusion├── indexer/│ └── pdf indexer.py PyMuPDF extraction + chunker + index builders├── monitoring/│ └── chunk monitor.py Last-5-query history tracker├── .env Your API key goes here└── requirements.txt mkdir hybrid-rag && cd hybrid-ragpython3.11 -m venv .venvsource .venv/bin/activate python3.11 -m venv .venv.venv\Scripts\activate pip install -r requirements.txt anthropic==0.104.1langgraph==1.2.1faiss-cpu==1.14.2sentence-transformers==5.5.1rank bm25==0.2.2PyMuPDF==1.27.2.3streamlit==1.57.0pandas==3.0.3numpy==2.4.6python-dotenv==1.2.2 .envANTHROPIC API KEY=sk-ant-your-key-here Note: sentence-transformers pulls in PyTorch as a dependency. The first install will download around 2 GB. Subsequent runs load from cache. The indexer is the foundation of the whole pipeline. It reads raw PDF bytes, extracts text page by page using PyMuPDF, and then cuts the flat token stream into overlapping windows. python indexer/pdf indexer.py import fitz PyMuPDF def extract pdf pdf bytes: bytes - list tuple int, str : doc = fitz.open stream=pdf bytes, filetype='pdf' pages = for page num in range len doc : text = doc page num .get text 'text' if text.strip : pages.append page num + 1, text 1-indexed page numbers doc.close return pages python def chunk text pages, chunk size=200, overlap=50 : all tokens = token pages = for page num, text in pages: tokens = text.split all tokens.extend tokens token pages.extend page num len tokens step = chunk size - overlap stride = 150 tokens chunks, chunk pages = , i = 0 while i < len all tokens : window tokens = all tokens i : i + chunk size chunks.append ' '.join window tokens chunk pages.append token pages i if len window tokens < chunk size: break i += step return chunks, chunk pages Note: Why overlapping chunks? Without overlap, a sentence that spans a chunk boundary gets split in two, and neither half carries full context. A 50-token overlap means each chunk shares its last 50 tokens with the next chunk’s first 50, so key sentences near boundaries appear in at least two chunks and have a higher chance of being retrieved. python import faissimport numpy as npfrom sentence transformers import SentenceTransformer model = None def get model : global model if model is None: model = SentenceTransformer 'all-MiniLM-L6-v2' return model def build faiss index chunks : model = get model embeddings = model.encode chunks, normalize embeddings=True, critical for cosine similarity show progress bar=False, batch size=64, embeddings = np.array embeddings, dtype='float32' dim = embeddings.shape 1 384 for all-MiniLM-L6-v2 index = faiss.IndexFlatIP dim index.add embeddings return index One thing to pay attention to: IndexFlatIP computes the inner product dot product . When you use normalize embeddings=True, all vectors sit on the unit sphere and inner product equals cosine similarity. This is slightly faster than computing cosine explicitly and gives you the same ranking. python from rank bm25 import BM25Okapi def build bm25 index chunks : tokenized = chunk.lower .split for chunk in chunks return BM25Okapi tokenized Note: Lowercase tokenisation here must match the tokenisation at query time. BM25 is case-sensitive by default when using .split , so both the index build and the query must use .lower or term frequencies will not match. The chunk text function produces a single chunks, chunk pages tuple that is passed to both build faiss index and build bm25 index . Both indexes are position-aligned: the chunk at index i in the FAISS index is the identical string as the chunk at index i in the BM25 corpus. This alignment is what makes RRF fusion possible. The vector retriever encodes the query with the same model used at index time, then runs a nearest-neighbour search: python retriever/vector retriever.py from indexer.pdf indexer import get model shared singleton def retrieve query, faiss index, chunks, chunk pages, k=5 : model = get model query embedding = model.encode query , normalize embeddings=True query embedding = np.array query embedding, dtype='float32' actual k = min k, len chunks scores, indices = faiss index.search query embedding, actual k results = for score, idx in zip scores 0 , indices 0 : if idx == -1: FAISS padding when index has fewer than k vectors continue results.append chunks idx , float score , chunk pages idx return results chunk text, cosine score, page num , ... Notice that the retriever imports get model from the indexer module rather than creating a new SentenceTransformer instance. Loading all-MiniLM-L6-v2 takes about 2 seconds and 90 MB of memory. By sharing the singleton, you pay that cost exactly once per session. The BM25 retriever is simpler: tokenise the query, ask the index to score all chunks, and return the top-k: python retriever/bm25 retriever.py import numpy as npfrom rank bm25 import BM25Okapi def retrieve query, bm25 index, chunks, chunk pages, k=5 : tokenized query = query.lower .split scores = bm25 index.get scores tokenized query actual k = min k, len chunks top indices = np.argsort scores ::-1 :actual k results = for idx in top indices: results.append chunks idx , float scores idx , chunk pages idx return results chunk text, bm25 score, page num , ... This is the heart of the hybrid system. RRF does not care about the absolute score values from either retriever. Instead, it uses the rank position of each chunk in each list. The formula is: RRF score d = sum 1 / k + rank d where k = 60 The constant 60 prevents top-ranked chunks from dominating too heavily when two lists disagree. It comes from Cormack, Clarke, and Buettcher 2009 and was chosen empirically across TREC benchmarks. python retriever/fusion.py def reciprocal rank fusion vector results, bm25 results, rrf k=60 : vector map = {chunk: score, page for chunk, score, page in vector results} bm25 map = {chunk: score, page for chunk, score, page in bm25 results} vector ranks = {chunk: rank + 1 for rank, chunk, , in enumerate vector results } bm25 ranks = {chunk: rank + 1 for rank, chunk, , in enumerate bm25 results } all chunks = list dict.fromkeys c for c, , in vector results + c for c, , in bm25 results fused = for chunk in all chunks: rrf score = 0.0 if chunk in vector ranks: rrf score += 1.0 / rrf k + vector ranks chunk if chunk in bm25 ranks: rrf score += 1.0 / rrf k + bm25 ranks chunk v score = vector map chunk 0 if chunk in vector map else 0.0 b score = bm25 map chunk 0 if chunk in bm25 map else 0.0 page num = vector map.get chunk or bm25 map.get chunk 1 found by = 'Both' if chunk in vector map and chunk in bm25 map else 'Vector' if chunk in vector map else 'BM25' fused.append chunk, rrf score, v score, b score, page num, found by fused.sort key=lambda x: x 1 , reverse=True return fused Imagine chunk A is ranked 1st by vector search score 0.92 and 3rd by BM25. Chunk B is ranked 2nd by vector search and 1st by BM25. RRF gives: RRF A = 1/ 60+1 + 1/ 60+3 = 0.01639 + 0.01563 = 0.03202RRF B = 1/ 60+2 + 1/ 60+1 = 0.01613 + 0.01639 = 0.03252 Chunk B wins because it ranked highly in both lists, even though chunk A had a higher raw cosine score. This cross-list agreement signal is exactly what you want. LangGraph lets you model the retrieval pipeline as a directed graph of stateful nodes. Each node receives the full state dict, does its work, and returns a partial update that LangGraph merges back. python graph.py from typing import TypedDict class RAGState TypedDict : pdf text: list str query: str vector results: list tuple bm25 results: list tuple fused chunks: list tuple answer: str prompt sent: str prompt tokens: int completion tokens: int total tokens: int latency ms: float python from langgraph.graph import StateGraph, START, END def build graph session state, top k=5, retrieval mode='Both' : def retrieve vector fn state: RAGState - dict: if retrieval mode == 'BM25': return {'vector results': } from retriever.vector retriever import retrieve return {'vector results': retrieve state 'query' , session state 'faiss index' , session state 'chunks' , session state 'chunk pages' , k=top k } def retrieve bm25 fn state: RAGState - dict: if retrieval mode == 'Vector': return {'bm25 results': } from retriever.bm25 retriever import retrieve return {'bm25 results': retrieve state 'query' , session state 'bm25 index' , session state 'chunks' , session state 'chunk pages' , k=top k } def fuse results fn state: RAGState - dict: from retriever.fusion import reciprocal rank fusion return {'fused chunks': reciprocal rank fusion state 'vector results' , state 'bm25 results' , rrf k=60 } def generate answer fn state: RAGState - dict: import anthropic, os, time top chunks = state 'fused chunks' :top k context = '\n\n---\n\n'.join f' Page {c 4 } \n{c 0 }' for c in top chunks prompt = 'You are a helpful assistant. Answer the question using ONLY ' 'the provided context. If the context does not contain enough ' 'information to answer, say so clearly.\n\n' f'Context:\n{context}\n\nQuestion: {state "query" }\n\nAnswer:' client = anthropic.Anthropic api key=os.environ 'ANTHROPIC API KEY' t0 = time.time response = client.messages.create model='claude-sonnet-4-6', max tokens=1024, messages= {'role': 'user', 'content': prompt} , return { 'answer': response.content 0 .text, 'prompt sent': prompt, 'prompt tokens': response.usage.input tokens, 'completion tokens': response.usage.output tokens, 'total tokens': response.usage.input tokens + response.usage.output tokens, 'latency ms': round time.time - t0 1000, 1 , } graph = StateGraph RAGState graph.add node 'retrieve vector', retrieve vector fn graph.add node 'retrieve bm25', retrieve bm25 fn graph.add node 'fuse results', fuse results fn graph.add node 'generate answer', generate answer fn graph.add edge START, 'retrieve vector' graph.add edge 'retrieve vector', 'retrieve bm25' graph.add edge 'retrieve bm25', 'fuse results' graph.add edge 'fuse results', 'generate answer' graph.add edge 'generate answer', END return graph.compile Design note: build graph is called fresh on every query, not once at startup. This is intentional. The factory captures the current top k and retrieval mode values through the closure, so changing either control immediately takes effect on the next query without any cache invalidation logic. The app uses a two-column layout. The left column handles document management and configuration. The right column is the chat interface. app.py — layout setupimport streamlit as st st.set page config page title='Hybrid RAG', page icon='🔍', layout='wide' left col, right col = st.columns 1, 2 , gap='large' with left col: st.header '📄 Documents' uploaded files = st.file uploader 'Upload PDF s ', type='pdf', accept multiple files=True, label visibility='collapsed', if uploaded files: uploaded names = {f.name for f in uploaded files} indexed names = {m 'filename' for m in st.session state.file metadata} if uploaded names = indexed names: with st.spinner 'Indexing PDFs...' : parse and index uploaded files, st.session state retrieval mode = st.selectbox 'Retrieval Type', options= 'Both', 'Vector', 'BM25' , index=0, top k = st.slider 'Top K Chunks', min value=3, max value=10, value=5 Let’s run the app: source .venv/bin/activate macOS/Linux .venv\Scripts\activate Windows streamlit run app.py Streamlit opens http://localhost:8501 http://localhost:8501 in your browser automatically. Note: Here we are using Governor’s Statement: December 05, 2025 Link https://www.fidcindia.org.in/wp-content/uploads/2025/12/Governors-Statement-December-05-2025.pdf pdf for our experiment. This is where the app becomes genuinely useful for experimentation. You can switch modes mid-session and see exactly how the retrieved chunks change for the same query. In this mode, retrieve bm25 fn returns an empty list immediately without touching the BM25 index. All retrieved chunks are labelled Vector in the Logs tab and highlighted in blue. Best for: Questions that require semantic understanding. Examples: ‘What is the overall financial health of the company?’ or ‘Summarise the methodology used in section 3.’ In this mode, retrieve vector fn returns an empty list immediately. All retrieved chunks are labelled BM25 and highlighted in amber. Best for: Questions with specific terminology, product codes, error codes, financial identifiers, or named entities. Examples: ‘What was the CRAR?’ Screenshot: ‘BM25’ selected. Logs tab : all rows highlighted amber, ‘Found By: BM25’. BM25 Score column shows values like 4.2, 3.8, 2.1. Vector Score = 0.0 for all rows. Both retrievers run in full, their top-k lists are merged, and RRF re-ranks the union. Chunks that appear in both lists get a higher RRF score than chunks from either list alone. Best for: Most real-world queries. A question like ‘What is the status of MGNREGA demand in oct-nov??’ has both a semantic component and an exact-match component. Every response in the chat history has two tabs: Answer and Logs. The Logs tab gives you complete visibility into what happened: Retrieval Mode badge 🟢 Both / 🔵 Vector / 🟠 BM25 ↓Top K Chunks table Rank | Chunk Preview | Page | Vector Score | BM25 Score | RRF Score | Found By colour-coded: green=Both, blue=Vector, amber=BM25 ↓Prompt Sent to LLM full text in a code block ↓Token Usage metrics Input Tokens | Output Tokens | Total Tokens ↓Latency LLM Call Time in ms Note: When an answer is wrong, the first place to look is always the retrieved chunks, not the LLM prompt. If the right content is not in the context window, no amount of prompt engineering will fix the answer. The best way to understand why hybrid retrieval matters is to break each mode deliberately. The following four queries were run against the RBI Governor’s Statement December 2025 , a policy document packed with both structured identifiers and descriptive economic prose. Query:What does this number indicate 2025–2026/1634? 2025–2026/1634 is a circular reference number. It carries no semantic neighbourhood in embedding space the model has never seen this string during pre-training in a meaningful context. Result: The retriever returns chunks about monetary policy and interest rates, semantically close but none contain the reference number. The LLM correctly admits it cannot find the answer. Query:Are people spending more in cities compared to villages? A paraphrased question about urban versus rural consumption trends. The document uses ‘urban demand’, ‘rural consumption’: none of those words appear in the query. Result: BM25 scores near zero for every chunk and surfaces unrelated content. ‘cities’ and ‘villages’ are absent from the document. Query:What does this number indicate 2025–2026/1634? same as Query 1 BM25 scores the chunk containing 2025–2026/1634 at the top of its list. RRF fusion places it high enough to enter the context window passed to the LLM. Result: Specific, accurate answer. The reference is identified correctly. Query:Are people spending more in cities compared to villages? same as Query 2 Vector search handles the semantic intent. BM25 contributes near-zero scores, but the vector results alone are sufficient. Result: Substantive answer about urban versus rural consumption trends, citing specific data points from the document. Performance note: Running both retrievers costs you one extra call to bm25 index.get scores which is a pure CPU operation that takes under 5 ms on a 200-page document. The fusion step is a handful of dictionary lookups. The price for covering both failure modes is essentially zero. This is a deliberate middle ground. Too small under 100 tokens and each chunk lacks enough context for the LLM to generate a coherent answer. Too large over 500 tokens and embeddings have less resolution and BM25 scores become diluted. This constant comes directly from Cormack, Clarke, and Buettcher 2009 . Lower values like 10 make the top rank matter more; higher values like 100 flatten the distribution. For document Q&A on professional PDFs, 60 is a solid default. A few directions worth exploring from here: We have built a complete hybrid RAG system that combines FAISS semantic search and BM25 keyword search, fuses their results with Reciprocal Rank Fusion, and routes everything through a LangGraph pipeline to Claude for answer generation. The Streamlit UI gives you real-time control over retrieval mode and full transparency into every chunk, score, token count, and prompt. The key insight is that retrieval is not a solved problem, and the right approach depends on your query type. Vector-only search handles semantic questions well. BM25 handles exact matches well. Hybrid handles most real queries better than either alone, and the RRF scores in the Logs tab give you the evidence to understand why. The codebase is deliberately minimal: 11 files, no LangChain abstractions, and every retrieval call is a raw library function you can read in one screen. That makes it straightforward to swap in a different embedding model, add a reranker, or replace FAISS with a hosted vector database as your needs grow. Thank you for reading the article. AgenticAI is complex and chaotic but getting started doesn’t have to be. I focus on making that first step simpler for you. Follow along https://medium.com/@alphaiterations for regular updates and more such articles. Feel free to connect on Linkedin https://www.linkedin.com/in/jainvijendra/ if you’re on a similar path. And if you’re still curious, there’s more to explore. Build a Hybrid RAG System with FAISS, BM25, LangGraph and Claude Sonnet Model https://pub.towardsai.net/build-a-hybrid-rag-system-with-faiss-bm25-langgraph-and-claude-sonnet-model-39ba3c6755bc was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.