Building a RAG Pipeline From Scratch

wpnews.pro

Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question — "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away.

This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate — both shown live at https://blog.r-lopes.com/how-it-works. Every code block below is copy-pasteable from the running system.

The Core Fix #

The single biggest lever is not better embeddings — it's fusing retrieval signals that fail differently. BM25 handles the what (exact terms, rare-token weighting); TF-IDF cosine handles the about (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the garnish, not the base — the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.

If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.

Architecture #

query
  │
  ▼
smart-retrieval.js   intent detection + multi-angle expansion
  │
  ▼
search.js
  ├── synonym expansion (query-side only)
  ├── BM25 scoring           ── list 1
  ├── TF-IDF cosine          ── list 2
  ├── (optional) dense vector ── list 3
  ├── weighted RRF fusion (k=60, weights [1.2, 1.0])
  ├── per-source cap (no single source dominates)
  └── cross-encoder rerank
  │
  ▼
openai-proxy.js      build context + system prompt → LLM (Claude / local Ollama)
  │
  ▼
verify-answer.js     strip fabricated quotes + banned phrases
  │
  ▼
streamed answer

Retrieval: BM25 + TF-IDF + RRF #

BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:

function bm25Score(queryTokens, doc, df, totalDocs, avgDl) {
  let score = 0;
  for (const term of queryTokens) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);
    const termTf = doc.tf[term] || 0;
    const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));
    score += idf * tfNorm;
  }
  return score;
}

TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:

function tfidfCosine(queryTokens, doc, df, totalDocs) {
  const queryTf = {};
  for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;
  let dotProduct = 0, queryMag = 0, docMag = 0;
  for (const term of new Set(queryTokens)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    const qTfidf = (queryTf[term] || 0) * idf;
    const dTfidf = (doc.tf[term] || 0) * idf;
    dotProduct += qTfidf * dTfidf;
    queryMag += qTfidf * qTfidf;
  }
  for (const term of Object.keys(doc.tf)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    docMag += (doc.tf[term] * idf) ** 2;
  }
  queryMag = Math.sqrt(queryMag);
  docMag = Math.sqrt(docMag);
  if (queryMag === 0 || docMag === 0) return 0;
  return dotProduct / (queryMag * docMag);
}

The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant k=60

is the standard damping value — it stops rank-1 from utterly dominating rank-2:

const RRF_K = 60;

function reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {
  const scores = new Map();
  for (let li = 0; li < rankedLists.length; li++) {
    const list = rankedLists[li];
    const w = weights ? weights[li] : 1.0;
    for (let rank = 0; rank < list.length; rank++) {
      const id = list[rank].doc.id;
      const rrfScore = w / (k + rank + 1);
      scores.set(id, (scores.get(id) || 0) + rrfScore);
    }
  }
  return scores;
}

Wiring it together — BM25 weighted 1.2, TF-IDF 1.0:

const bm25Ranked  = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))
                        .sort((a, b) => b.score - a.score);
const tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))
                        .sort((a, b) => b.score - a.score);

const rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);

Two details that earn their keep: synonym expansion is query-side only (expanding documents would blow up the index and dilute IDF), and a per-source cap runs after fusion so a single prolific source can't monopolize the top-k — diversity of evidence beats depth from one channel.

The Quality Gate #

Retrieval being right doesn't make the answer right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores 99/100.

The verifier's most important check is quote fidelity. Any > "blockquote"

is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio — quotes that aren't actually in the sources are replaced with a *[fabricated quote removed]*

marker and logged:

Quote fidelity— blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.** Invalid source refs**—[Source N]

whereN

exceeds the retrieved count is removed.Banned phrases—production-ready

,blazing fast

,world-class

,best-in-class

and friends are flagged; cheerleading is a regression, not a flourish.Emoji headers and "Keep exploring" footers— auto-stripped.** Structural compliance**— deep answers must lead with one root cause before any diagram or table.

The gate runs automatically on proxy restart and as a git pre-push

hook on guarded files. A change that drops the score below 90 does not ship.

The Numbers #

These are measured, not aspirational — generated from the live corpus and the latest eval reports:

Metric	Value	Source
Chunks in corpus	69,638	live `rag_chunks.json`
Distinct sources	30	live `rag_chunks.json`
Retrieval	20/20 (95.6%), Grade A

rag_eval_report.json

quality_eval_report.json

test-verifier.js

What I'd Do Differently #

Honesty section, because the failures are more useful than the wins:

Source recall is the weak spot. Topic and keyword recall are both perfect, but source recall trails — the system finds the rightanswerbut doesn't always surface every source that supports it. That's the next number to move.The gold-standard gate is only four cases. Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.Dense vectors are underused. They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.

The pipeline isn't finished — no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" — all live at https://blog.r-lopes.com/how-it-works — is a real bar, measured on a real corpus, and the code above is exactly what produces it.

source & further reading

blog.r-lopes.com — original article You cannot sell AI written software