# Building a RAG Pipeline From Scratch

> Source: <https://blog.r-lopes.com/posts/building-a-rag-pipeline-from-scratch>
> Published: 2026-06-05 14:00:00+00:00

Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question — "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away.

This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate — both shown live at [https://blog.r-lopes.com/how-it-works](https://blog.r-lopes.com/how-it-works). Every code block below is copy-pasteable from the running system.

## The Core Fix

The single biggest lever is **not better embeddings — it's fusing retrieval signals that fail differently.** BM25 handles the *what* (exact terms, rare-token weighting); TF-IDF cosine handles the *about* (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the *garnish*, not the base — the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.

If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.

## Architecture

```
query
  │
  ▼
smart-retrieval.js   intent detection + multi-angle expansion
  │
  ▼
search.js
  ├── synonym expansion (query-side only)
  ├── BM25 scoring           ── list 1
  ├── TF-IDF cosine          ── list 2
  ├── (optional) dense vector ── list 3
  ├── weighted RRF fusion (k=60, weights [1.2, 1.0])
  ├── per-source cap (no single source dominates)
  └── cross-encoder rerank
  │
  ▼
openai-proxy.js      build context + system prompt → LLM (Claude / local Ollama)
  │
  ▼
verify-answer.js     strip fabricated quotes + banned phrases
  │
  ▼
streamed answer
```

## Retrieval: BM25 + TF-IDF + RRF

BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:

``` js
function bm25Score(queryTokens, doc, df, totalDocs, avgDl) {
  let score = 0;
  for (const term of queryTokens) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);
    const termTf = doc.tf[term] || 0;
    const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));
    score += idf * tfNorm;
  }
  return score;
}
```

TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:

``` js
function tfidfCosine(queryTokens, doc, df, totalDocs) {
  const queryTf = {};
  for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;
  let dotProduct = 0, queryMag = 0, docMag = 0;
  for (const term of new Set(queryTokens)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    const qTfidf = (queryTf[term] || 0) * idf;
    const dTfidf = (doc.tf[term] || 0) * idf;
    dotProduct += qTfidf * dTfidf;
    queryMag += qTfidf * qTfidf;
  }
  for (const term of Object.keys(doc.tf)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    docMag += (doc.tf[term] * idf) ** 2;
  }
  queryMag = Math.sqrt(queryMag);
  docMag = Math.sqrt(docMag);
  if (queryMag === 0 || docMag === 0) return 0;
  return dotProduct / (queryMag * docMag);
}
```

The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant `k=60`

is the standard damping value — it stops rank-1 from utterly dominating rank-2:

``` js
const RRF_K = 60;

function reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {
  const scores = new Map();
  for (let li = 0; li < rankedLists.length; li++) {
    const list = rankedLists[li];
    const w = weights ? weights[li] : 1.0;
    for (let rank = 0; rank < list.length; rank++) {
      const id = list[rank].doc.id;
      const rrfScore = w / (k + rank + 1);
      scores.set(id, (scores.get(id) || 0) + rrfScore);
    }
  }
  return scores;
}
```

Wiring it together — BM25 weighted 1.2, TF-IDF 1.0:

``` js
const bm25Ranked  = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))
                        .sort((a, b) => b.score - a.score);
const tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))
                        .sort((a, b) => b.score - a.score);

const rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);
```

Two details that earn their keep: **synonym expansion is query-side only** (expanding documents would blow up the index and dilute IDF), and a **per-source cap** runs after fusion so a single prolific source can't monopolize the top-k — diversity of evidence beats depth from one channel.

## The Quality Gate

Retrieval being right doesn't make the *answer* right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores **99/100**.

The verifier's most important check is quote fidelity. Any `> "blockquote"`

is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio — quotes that aren't actually in the sources are replaced with a `*[fabricated quote removed]*`

marker and logged:

**Quote fidelity**— blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.** Invalid source refs**—`[Source N]`

where`N`

exceeds the retrieved count is removed.**Banned phrases**—`production-ready`

,`blazing fast`

,`world-class`

,`best-in-class`

and friends are flagged; cheerleading is a regression, not a flourish.**Emoji headers and "Keep exploring" footers**— auto-stripped.** Structural compliance**— deep answers must lead with one root cause before any diagram or table.

The gate runs automatically on proxy restart and as a git `pre-push`

hook on guarded files. A change that drops the score below 90 does not ship.

## The Numbers

These are measured, not aspirational — generated from the live corpus and the latest eval reports:

| Metric | Value | Source |
|---|---|---|
| Chunks in corpus | 69,638 | live `rag_chunks.json` |
| Distinct sources | 30 | live `rag_chunks.json` |
| Retrieval | 20/20 (95.6%), Grade A |
|

`rag_eval_report.json`

`rag_eval_report.json`

`rag_eval_report.json`

`quality_eval_report.json`

`test-verifier.js`

## What I'd Do Differently

Honesty section, because the failures are more useful than the wins:

**Source recall is the weak spot.** Topic and keyword recall are both perfect, but source recall trails — the system finds the right*answer*but doesn't always surface every source that supports it. That's the next number to move.**The gold-standard gate is only four cases.** Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.**Dense vectors are underused.** They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.

The pipeline isn't finished — no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" — all live at [https://blog.r-lopes.com/how-it-works](https://blog.r-lopes.com/how-it-works) — is a real bar, measured on a real corpus, and the code above is exactly what produces it.
