Day 9 - Sparse embedding continued - RAG

wpnews.pro

cd /news/natural-language-processing/day-9-sparse-embedding-continued-rag · home › topics › natural-language-processing › article

[ARTICLE · art-15961] src=dev.to ↗ pub=2026-05-28T02:52Z topic=natural-language-processing verified=true sentiment=· neutral

Day 9 - Sparse embedding continued - RAG

A developer exploring sparse embeddings for retrieval-augmented generation (RAG) detailed the progression from term frequency (TF) to inverse document frequency (IDF) and TF-IDF, culminating in the BM-25 algorithm. The engineer noted that IDF prioritizes rare words but can return irrelevant documents, while BM-25 improves upon TF-IDF for keyword matching. To overcome the limitations of sparse embeddings alone, the developer recommended a hybrid search combining dense embeddings for semantic similarity with BM-25 for keyword search.

read1 min views12 publishedMay 28, 2026

In the previous post, we saw some basic methodologies under sparse embeddings. In that, term frequency(TF) had a fallback when same words are repeated too often. To overcome the shortcomings of TF, next method was introduced. We shall see them in detail:

Inverse document frequency(IDF) It determines how less frequent a word occurs in the input documents. It calculates how the rare the word is. Rare word is of high priority. i.e If the word occurs less frequent, then the value will be high and if if the word occurs more frequently, then the value will be low.

If i ask query about frequently occurring words (for which IDF score is low), results will not be that good. On the other hand, if i ask query about rarest word(IDF score is high), results will be comparatively good. Drawbacks of IDF

If i ask query about kubernetes and if the word is occurring only in one document , that particular document will be returned. There will be chances where the doc will have mention of kubernetes once but does not describe in detail about it. In such cases, returned doc is not that useful TF-IDF

This combines both TF and IDF. i.e For a word its TF score will be multiplied with IDF.

BM-25(Best match-25) Next improved version of TF-IDF is BM-25 algorithm. 25 refers to top 25 matching words. This may yield better result when compared to TF-IDF.

As sparse embedding(s.e) does keyword search, We cannot use s.e alone in a RAG pipeline. To make the best of both worlds, we need to combine dense embeddings (semantic similarity) and sparse embedding(keyword search). This is called hybrid search For dense embedding we can use sentence transformer and for sparse embedding we can use BM-25 algorithm.

source & further reading

dev.to — original article A tiny mini-audit template for AI-built launch pages A Small Prompt Workflow That Made My AI Image Experiments Easier To Debug What is SapixDB? A living database where data secures itself and every table manages its own data

── more in #natural-language-processing 4 stories · sorted by recency

dev.to · 12 Jul · #natural-language-processing

What's the Difference Between RAG and Agent Memory?

sourcefeed.dev · 12 Jul · #natural-language-processing

AI Math Proofs Still Need Human Checkers

lightningjar.com · 12 Jul · #natural-language-processing

Hand It Everything It Needs | The barkup-bench Capstone at 23 Studies

pub.towardsai.net · 12 Jul · #natural-language-processing

Schema Archaeology: How to Use AI to Reverse-Engineer Business Meaning From an Undocumented…

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required