cd /news/artificial-intelligence/building-a-rag-pipeline-from-scratch ยท home โ€บ topics โ€บ artificial-intelligence โ€บ article
[ARTICLE ยท art-26673] src=blog.r-lopes.com โ†— pub= topic=artificial-intelligence verified=true sentiment=โ†‘ positive

Building a RAG Pipeline From Scratch

A developer built a production-grade RAG pipeline that achieves 95.6% retrieval accuracy and 99/100 answer quality by fusing BM25, TF-IDF, and dense vectors with weighted Reciprocal Rank Fusion, addressing the exact-match failures of vector-only systems.

read6 min publishedJun 5, 2026

Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question โ€” "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away.

This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate โ€” both shown live at https://blog.r-lopes.com/how-it-works. Every code block below is copy-pasteable from the running system.

The Core Fix #

The single biggest lever is not better embeddings โ€” it's fusing retrieval signals that fail differently. BM25 handles the what (exact terms, rare-token weighting); TF-IDF cosine handles the about (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the garnish, not the base โ€” the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.

If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.

Architecture #

query
  โ”‚
  โ–ผ
smart-retrieval.js   intent detection + multi-angle expansion
  โ”‚
  โ–ผ
search.js
  โ”œโ”€โ”€ synonym expansion (query-side only)
  โ”œโ”€โ”€ BM25 scoring           โ”€โ”€ list 1
  โ”œโ”€โ”€ TF-IDF cosine          โ”€โ”€ list 2
  โ”œโ”€โ”€ (optional) dense vector โ”€โ”€ list 3
  โ”œโ”€โ”€ weighted RRF fusion (k=60, weights [1.2, 1.0])
  โ”œโ”€โ”€ per-source cap (no single source dominates)
  โ””โ”€โ”€ cross-encoder rerank
  โ”‚
  โ–ผ
openai-proxy.js      build context + system prompt โ†’ LLM (Claude / local Ollama)
  โ”‚
  โ–ผ
verify-answer.js     strip fabricated quotes + banned phrases
  โ”‚
  โ–ผ
streamed answer

Retrieval: BM25 + TF-IDF + RRF #

BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:

function bm25Score(queryTokens, doc, df, totalDocs, avgDl) {
  let score = 0;
  for (const term of queryTokens) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);
    const termTf = doc.tf[term] || 0;
    const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));
    score += idf * tfNorm;
  }
  return score;
}

TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:

function tfidfCosine(queryTokens, doc, df, totalDocs) {
  const queryTf = {};
  for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;
  let dotProduct = 0, queryMag = 0, docMag = 0;
  for (const term of new Set(queryTokens)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    const qTfidf = (queryTf[term] || 0) * idf;
    const dTfidf = (doc.tf[term] || 0) * idf;
    dotProduct += qTfidf * dTfidf;
    queryMag += qTfidf * qTfidf;
  }
  for (const term of Object.keys(doc.tf)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    docMag += (doc.tf[term] * idf) ** 2;
  }
  queryMag = Math.sqrt(queryMag);
  docMag = Math.sqrt(docMag);
  if (queryMag === 0 || docMag === 0) return 0;
  return dotProduct / (queryMag * docMag);
}

The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant k=60

is the standard damping value โ€” it stops rank-1 from utterly dominating rank-2:

const RRF_K = 60;

function reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {
  const scores = new Map();
  for (let li = 0; li < rankedLists.length; li++) {
    const list = rankedLists[li];
    const w = weights ? weights[li] : 1.0;
    for (let rank = 0; rank < list.length; rank++) {
      const id = list[rank].doc.id;
      const rrfScore = w / (k + rank + 1);
      scores.set(id, (scores.get(id) || 0) + rrfScore);
    }
  }
  return scores;
}

Wiring it together โ€” BM25 weighted 1.2, TF-IDF 1.0:

const bm25Ranked  = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))
                        .sort((a, b) => b.score - a.score);
const tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))
                        .sort((a, b) => b.score - a.score);

const rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);

Two details that earn their keep: synonym expansion is query-side only (expanding documents would blow up the index and dilute IDF), and a per-source cap runs after fusion so a single prolific source can't monopolize the top-k โ€” diversity of evidence beats depth from one channel.

The Quality Gate #

Retrieval being right doesn't make the answer right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores 99/100.

The verifier's most important check is quote fidelity. Any > "blockquote"

is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio โ€” quotes that aren't actually in the sources are replaced with a *[fabricated quote removed]*

marker and logged:

Quote fidelityโ€” blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.** Invalid source refs**โ€”[Source N]

whereN

exceeds the retrieved count is removed.Banned phrasesโ€”production-ready

,blazing fast

,world-class

,best-in-class

and friends are flagged; cheerleading is a regression, not a flourish.Emoji headers and "Keep exploring" footersโ€” auto-stripped.** Structural compliance**โ€” deep answers must lead with one root cause before any diagram or table.

The gate runs automatically on proxy restart and as a git pre-push

hook on guarded files. A change that drops the score below 90 does not ship.

The Numbers #

These are measured, not aspirational โ€” generated from the live corpus and the latest eval reports:

Metric Value Source
Chunks in corpus 69,638 live rag_chunks.json
Distinct sources 30 live rag_chunks.json
Retrieval 20/20 (95.6%), Grade A

rag_eval_report.json

rag_eval_report.json

rag_eval_report.json

quality_eval_report.json

test-verifier.js

What I'd Do Differently #

Honesty section, because the failures are more useful than the wins:

Source recall is the weak spot. Topic and keyword recall are both perfect, but source recall trails โ€” the system finds the rightanswerbut doesn't always surface every source that supports it. That's the next number to move.The gold-standard gate is only four cases. Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.Dense vectors are underused. They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.

The pipeline isn't finished โ€” no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" โ€” all live at https://blog.r-lopes.com/how-it-works โ€” is a real bar, measured on a real corpus, and the code above is exactly what produces it.

โ”€โ”€ more in #artificial-intelligence 4 stories ยท sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain โ€” perfect for shipping the agent you just read about.

$git push zahid main
โ†’ Live at https://your-agent.zahid.host โœ“
Get free account โ†’ Pricing
from โ‚ฌ0/mo ยท no card required
LIVE [news/building-a-rag-pipelโ€ฆ] indexed:0 read:6min 2026-06-05 ยท โ€”