Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question โ "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away.
This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate โ both shown live at https://blog.r-lopes.com/how-it-works. Every code block below is copy-pasteable from the running system.
The Core Fix #
The single biggest lever is not better embeddings โ it's fusing retrieval signals that fail differently. BM25 handles the what (exact terms, rare-token weighting); TF-IDF cosine handles the about (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the garnish, not the base โ the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.
If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.
Architecture #
query
โ
โผ
smart-retrieval.js intent detection + multi-angle expansion
โ
โผ
search.js
โโโ synonym expansion (query-side only)
โโโ BM25 scoring โโ list 1
โโโ TF-IDF cosine โโ list 2
โโโ (optional) dense vector โโ list 3
โโโ weighted RRF fusion (k=60, weights [1.2, 1.0])
โโโ per-source cap (no single source dominates)
โโโ cross-encoder rerank
โ
โผ
openai-proxy.js build context + system prompt โ LLM (Claude / local Ollama)
โ
โผ
verify-answer.js strip fabricated quotes + banned phrases
โ
โผ
streamed answer
Retrieval: BM25 + TF-IDF + RRF #
BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:
function bm25Score(queryTokens, doc, df, totalDocs, avgDl) {
let score = 0;
for (const term of queryTokens) {
const termDf = df[term] || 0;
if (termDf === 0) continue;
const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);
const termTf = doc.tf[term] || 0;
const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));
score += idf * tfNorm;
}
return score;
}
TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:
function tfidfCosine(queryTokens, doc, df, totalDocs) {
const queryTf = {};
for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;
let dotProduct = 0, queryMag = 0, docMag = 0;
for (const term of new Set(queryTokens)) {
const termDf = df[term] || 0;
if (termDf === 0) continue;
const idf = Math.log(totalDocs / (termDf + 1));
const qTfidf = (queryTf[term] || 0) * idf;
const dTfidf = (doc.tf[term] || 0) * idf;
dotProduct += qTfidf * dTfidf;
queryMag += qTfidf * qTfidf;
}
for (const term of Object.keys(doc.tf)) {
const termDf = df[term] || 0;
if (termDf === 0) continue;
const idf = Math.log(totalDocs / (termDf + 1));
docMag += (doc.tf[term] * idf) ** 2;
}
queryMag = Math.sqrt(queryMag);
docMag = Math.sqrt(docMag);
if (queryMag === 0 || docMag === 0) return 0;
return dotProduct / (queryMag * docMag);
}
The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant k=60
is the standard damping value โ it stops rank-1 from utterly dominating rank-2:
const RRF_K = 60;
function reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {
const scores = new Map();
for (let li = 0; li < rankedLists.length; li++) {
const list = rankedLists[li];
const w = weights ? weights[li] : 1.0;
for (let rank = 0; rank < list.length; rank++) {
const id = list[rank].doc.id;
const rrfScore = w / (k + rank + 1);
scores.set(id, (scores.get(id) || 0) + rrfScore);
}
}
return scores;
}
Wiring it together โ BM25 weighted 1.2, TF-IDF 1.0:
const bm25Ranked = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))
.sort((a, b) => b.score - a.score);
const tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))
.sort((a, b) => b.score - a.score);
const rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);
Two details that earn their keep: synonym expansion is query-side only (expanding documents would blow up the index and dilute IDF), and a per-source cap runs after fusion so a single prolific source can't monopolize the top-k โ diversity of evidence beats depth from one channel.
The Quality Gate #
Retrieval being right doesn't make the answer right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores 99/100.
The verifier's most important check is quote fidelity. Any > "blockquote"
is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio โ quotes that aren't actually in the sources are replaced with a *[fabricated quote removed]*
marker and logged:
Quote fidelityโ blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.** Invalid source refs**โ[Source N]
whereN
exceeds the retrieved count is removed.Banned phrasesโproduction-ready
,blazing fast
,world-class
,best-in-class
and friends are flagged; cheerleading is a regression, not a flourish.Emoji headers and "Keep exploring" footersโ auto-stripped.** Structural compliance**โ deep answers must lead with one root cause before any diagram or table.
The gate runs automatically on proxy restart and as a git pre-push
hook on guarded files. A change that drops the score below 90 does not ship.
The Numbers #
These are measured, not aspirational โ generated from the live corpus and the latest eval reports:
| Metric | Value | Source |
|---|---|---|
| Chunks in corpus | 69,638 | live rag_chunks.json |
| Distinct sources | 30 | live rag_chunks.json |
| Retrieval | 20/20 (95.6%), Grade A | |
rag_eval_report.json
rag_eval_report.json
rag_eval_report.json
quality_eval_report.json
test-verifier.js
What I'd Do Differently #
Honesty section, because the failures are more useful than the wins:
Source recall is the weak spot. Topic and keyword recall are both perfect, but source recall trails โ the system finds the rightanswerbut doesn't always surface every source that supports it. That's the next number to move.The gold-standard gate is only four cases. Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.Dense vectors are underused. They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.
The pipeline isn't finished โ no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" โ all live at https://blog.r-lopes.com/how-it-works โ is a real bar, measured on a real corpus, and the code above is exactly what produces it.