Building a RAG Pipeline From Scratch

A developer built a production-grade RAG pipeline that achieves 95.6% retrieval accuracy and 99/100 answer quality by fusing BM25, TF-IDF, and dense vectors with weighted Reciprocal Rank Fusion, addressing the exact-match failures of vector-only systems.

Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question — "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away. This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval 20/20 test questions, Grade A and 99/100 on the answer-quality gate — both shown live at https://blog.r-lopes.com/how-it-works https://blog.r-lopes.com/how-it-works . Every code block below is copy-pasteable from the running system. The Core Fix The single biggest lever is not better embeddings — it's fusing retrieval signals that fail differently. BM25 handles the what exact terms, rare-token weighting ; TF-IDF cosine handles the about term-distribution similarity ; Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the garnish , not the base — the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops. If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move. Architecture query │ ▼ smart-retrieval.js intent detection + multi-angle expansion │ ▼ search.js ├── synonym expansion query-side only ├── BM25 scoring ── list 1 ├── TF-IDF cosine ── list 2 ├── optional dense vector ── list 3 ├── weighted RRF fusion k=60, weights 1.2, 1.0 ├── per-source cap no single source dominates └── cross-encoder rerank │ ▼ openai-proxy.js build context + system prompt → LLM Claude / local Ollama │ ▼ verify-answer.js strip fabricated quotes + banned phrases │ ▼ streamed answer Retrieval: BM25 + TF-IDF + RRF BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate: js function bm25Score queryTokens, doc, df, totalDocs, avgDl { let score = 0; for const term of queryTokens { const termDf = df term || 0; if termDf === 0 continue; const idf = Math.log totalDocs - termDf + 0.5 / termDf + 0.5 + 1 ; const termTf = doc.tf term || 0; const tfNorm = termTf K1 + 1 / termTf + K1 1 - B + B doc.docLength / avgDl ; score += idf tfNorm; } return score; } TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses: js function tfidfCosine queryTokens, doc, df, totalDocs { const queryTf = {}; for const t of queryTokens queryTf t = queryTf t || 0 + 1; let dotProduct = 0, queryMag = 0, docMag = 0; for const term of new Set queryTokens { const termDf = df term || 0; if termDf === 0 continue; const idf = Math.log totalDocs / termDf + 1 ; const qTfidf = queryTf term || 0 idf; const dTfidf = doc.tf term || 0 idf; dotProduct += qTfidf dTfidf; queryMag += qTfidf qTfidf; } for const term of Object.keys doc.tf { const termDf = df term || 0; if termDf === 0 continue; const idf = Math.log totalDocs / termDf + 1 ; docMag += doc.tf term idf 2; } queryMag = Math.sqrt queryMag ; docMag = Math.sqrt docMag ; if queryMag === 0 || docMag === 0 return 0; return dotProduct / queryMag docMag ; } The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant k=60 is the standard damping value — it stops rank-1 from utterly dominating rank-2: js const RRF K = 60; function reciprocalRankFusion rankedLists, k = RRF K, weights = null { const scores = new Map ; for let li = 0; li < rankedLists.length; li++ { const list = rankedLists li ; const w = weights ? weights li : 1.0; for let rank = 0; rank < list.length; rank++ { const id = list rank .doc.id; const rrfScore = w / k + rank + 1 ; scores.set id, scores.get id || 0 + rrfScore ; } } return scores; } Wiring it together — BM25 weighted 1.2, TF-IDF 1.0: js const bm25Ranked = docs.map doc = { doc, score: bm25Score expandedTokens, doc, index.df, totalDocs, avgDocLength } .sort a, b = b.score - a.score ; const tfidfRanked = docs.map doc = { doc, score: tfidfCosine expandedTokens, doc, index.df, totalDocs } .sort a, b = b.score - a.score ; const rrfScores = reciprocalRankFusion bm25Ranked, tfidfRanked , RRF K, 1.2, 1.0 ; Two details that earn their keep: synonym expansion is query-side only expanding documents would blow up the index and dilute IDF , and a per-source cap runs after fusion so a single prolific source can't monopolize the top-k — diversity of evidence beats depth from one channel. The Quality Gate Retrieval being right doesn't make the answer right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores 99/100 . The verifier's most important check is quote fidelity. Any "blockquote" is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio — quotes that aren't actually in the sources are replaced with a fabricated quote removed marker and logged: Quote fidelity — blockquotes fuzzy-matched 0.9 word-overlap ratio against retrieved chunks; fabrications stripped and logged. Invalid source refs — Source N where N exceeds the retrieved count is removed. Banned phrases — production-ready , blazing fast , world-class , best-in-class and friends are flagged; cheerleading is a regression, not a flourish. Emoji headers and "Keep exploring" footers — auto-stripped. Structural compliance — deep answers must lead with one root cause before any diagram or table. The gate runs automatically on proxy restart and as a git pre-push hook on guarded files. A change that drops the score below 90 does not ship. The Numbers These are measured, not aspirational — generated from the live corpus and the latest eval reports: | Metric | Value | Source | |---|---|---| | Chunks in corpus | 69,638 | live rag chunks.json | | Distinct sources | 30 | live rag chunks.json | | Retrieval | 20/20 95.6% , Grade A | | rag eval report.json rag eval report.json rag eval report.json quality eval report.json test-verifier.js What I'd Do Differently Honesty section, because the failures are more useful than the wins: Source recall is the weak spot. Topic and keyword recall are both perfect, but source recall trails — the system finds the right answer but doesn't always surface every source that supports it. That's the next number to move. The gold-standard gate is only four cases. Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left. Dense vectors are underused. They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set. The pipeline isn't finished — no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" — all live at https://blog.r-lopes.com/how-it-works https://blog.r-lopes.com/how-it-works — is a real bar, measured on a real corpus, and the code above is exactly what produces it.