{"slug": "building-a-rag-pipeline-from-scratch", "title": "Building a RAG Pipeline From Scratch", "summary": "A developer built a production-grade RAG pipeline that achieves 95.6% retrieval accuracy and 99/100 answer quality by fusing BM25, TF-IDF, and dense vectors with weighted Reciprocal Rank Fusion, addressing the exact-match failures of vector-only systems.", "body_md": "Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question — \"BM25 vs TF-IDF ranking\" returns generic results about \"search relevance\" because dense embeddings compress the exact-match signal away.\n\nThis is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate — both shown live at [https://blog.r-lopes.com/how-it-works](https://blog.r-lopes.com/how-it-works). Every code block below is copy-pasteable from the running system.\n\n## The Core Fix\n\nThe single biggest lever is **not better embeddings — it's fusing retrieval signals that fail differently.** BM25 handles the *what* (exact terms, rare-token weighting); TF-IDF cosine handles the *about* (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the *garnish*, not the base — the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.\n\nIf you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.\n\n## Architecture\n\n```\nquery\n  │\n  ▼\nsmart-retrieval.js   intent detection + multi-angle expansion\n  │\n  ▼\nsearch.js\n  ├── synonym expansion (query-side only)\n  ├── BM25 scoring           ── list 1\n  ├── TF-IDF cosine          ── list 2\n  ├── (optional) dense vector ── list 3\n  ├── weighted RRF fusion (k=60, weights [1.2, 1.0])\n  ├── per-source cap (no single source dominates)\n  └── cross-encoder rerank\n  │\n  ▼\nopenai-proxy.js      build context + system prompt → LLM (Claude / local Ollama)\n  │\n  ▼\nverify-answer.js     strip fabricated quotes + banned phrases\n  │\n  ▼\nstreamed answer\n```\n\n## Retrieval: BM25 + TF-IDF + RRF\n\nBM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:\n\n``` js\nfunction bm25Score(queryTokens, doc, df, totalDocs, avgDl) {\n  let score = 0;\n  for (const term of queryTokens) {\n    const termDf = df[term] || 0;\n    if (termDf === 0) continue;\n    const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);\n    const termTf = doc.tf[term] || 0;\n    const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));\n    score += idf * tfNorm;\n  }\n  return score;\n}\n```\n\nTF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:\n\n``` js\nfunction tfidfCosine(queryTokens, doc, df, totalDocs) {\n  const queryTf = {};\n  for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;\n  let dotProduct = 0, queryMag = 0, docMag = 0;\n  for (const term of new Set(queryTokens)) {\n    const termDf = df[term] || 0;\n    if (termDf === 0) continue;\n    const idf = Math.log(totalDocs / (termDf + 1));\n    const qTfidf = (queryTf[term] || 0) * idf;\n    const dTfidf = (doc.tf[term] || 0) * idf;\n    dotProduct += qTfidf * dTfidf;\n    queryMag += qTfidf * qTfidf;\n  }\n  for (const term of Object.keys(doc.tf)) {\n    const termDf = df[term] || 0;\n    if (termDf === 0) continue;\n    const idf = Math.log(totalDocs / (termDf + 1));\n    docMag += (doc.tf[term] * idf) ** 2;\n  }\n  queryMag = Math.sqrt(queryMag);\n  docMag = Math.sqrt(docMag);\n  if (queryMag === 0 || docMag === 0) return 0;\n  return dotProduct / (queryMag * docMag);\n}\n```\n\nThe fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant `k=60`\n\nis the standard damping value — it stops rank-1 from utterly dominating rank-2:\n\n``` js\nconst RRF_K = 60;\n\nfunction reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {\n  const scores = new Map();\n  for (let li = 0; li < rankedLists.length; li++) {\n    const list = rankedLists[li];\n    const w = weights ? weights[li] : 1.0;\n    for (let rank = 0; rank < list.length; rank++) {\n      const id = list[rank].doc.id;\n      const rrfScore = w / (k + rank + 1);\n      scores.set(id, (scores.get(id) || 0) + rrfScore);\n    }\n  }\n  return scores;\n}\n```\n\nWiring it together — BM25 weighted 1.2, TF-IDF 1.0:\n\n``` js\nconst bm25Ranked  = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))\n                        .sort((a, b) => b.score - a.score);\nconst tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))\n                        .sort((a, b) => b.score - a.score);\n\nconst rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);\n```\n\nTwo details that earn their keep: **synonym expansion is query-side only** (expanding documents would blow up the index and dilute IDF), and a **per-source cap** runs after fusion so a single prolific source can't monopolize the top-k — diversity of evidence beats depth from one channel.\n\n## The Quality Gate\n\nRetrieval being right doesn't make the *answer* right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores **99/100**.\n\nThe verifier's most important check is quote fidelity. Any `> \"blockquote\"`\n\nis validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio — quotes that aren't actually in the sources are replaced with a `*[fabricated quote removed]*`\n\nmarker and logged:\n\n**Quote fidelity**— blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.** Invalid source refs**—`[Source N]`\n\nwhere`N`\n\nexceeds the retrieved count is removed.**Banned phrases**—`production-ready`\n\n,`blazing fast`\n\n,`world-class`\n\n,`best-in-class`\n\nand friends are flagged; cheerleading is a regression, not a flourish.**Emoji headers and \"Keep exploring\" footers**— auto-stripped.** Structural compliance**— deep answers must lead with one root cause before any diagram or table.\n\nThe gate runs automatically on proxy restart and as a git `pre-push`\n\nhook on guarded files. A change that drops the score below 90 does not ship.\n\n## The Numbers\n\nThese are measured, not aspirational — generated from the live corpus and the latest eval reports:\n\n| Metric | Value | Source |\n|---|---|---|\n| Chunks in corpus | 69,638 | live `rag_chunks.json` |\n| Distinct sources | 30 | live `rag_chunks.json` |\n| Retrieval | 20/20 (95.6%), Grade A |\n|\n\n`rag_eval_report.json`\n\n`rag_eval_report.json`\n\n`rag_eval_report.json`\n\n`quality_eval_report.json`\n\n`test-verifier.js`\n\n## What I'd Do Differently\n\nHonesty section, because the failures are more useful than the wins:\n\n**Source recall is the weak spot.** Topic and keyword recall are both perfect, but source recall trails — the system finds the right*answer*but doesn't always surface every source that supports it. That's the next number to move.**The gold-standard gate is only four cases.** Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.**Dense vectors are underused.** They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.\n\nThe pipeline isn't finished — no pipeline is. But \"95.6% retrieval, 99/100 quality, fabrications stripped automatically\" — all live at [https://blog.r-lopes.com/how-it-works](https://blog.r-lopes.com/how-it-works) — is a real bar, measured on a real corpus, and the code above is exactly what produces it.", "url": "https://wpnews.pro/news/building-a-rag-pipeline-from-scratch", "canonical_source": "https://blog.r-lopes.com/posts/building-a-rag-pipeline-from-scratch", "published_at": "2026-06-05 14:00:00+00:00", "updated_at": "2026-06-14 02:06:22.620204+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "natural-language-processing", "ai-tools", "ai-infrastructure"], "entities": ["BM25", "TF-IDF", "Reciprocal Rank Fusion", "Claude", "Ollama", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/building-a-rag-pipeline-from-scratch", "markdown": "https://wpnews.pro/news/building-a-rag-pipeline-from-scratch.md", "text": "https://wpnews.pro/news/building-a-rag-pipeline-from-scratch.txt", "jsonld": "https://wpnews.pro/news/building-a-rag-pipeline-from-scratch.jsonld"}}