cd /news/natural-language-processing/day-9-sparse-embedding-continued-rag · home topics natural-language-processing article
[ARTICLE · art-15961] src=dev.to pub= topic=natural-language-processing verified=true sentiment=· neutral

Day 9 - Sparse embedding continued - RAG

A developer exploring sparse embeddings for retrieval-augmented generation (RAG) detailed the progression from term frequency (TF) to inverse document frequency (IDF) and TF-IDF, culminating in the BM-25 algorithm. The engineer noted that IDF prioritizes rare words but can return irrelevant documents, while BM-25 improves upon TF-IDF for keyword matching. To overcome the limitations of sparse embeddings alone, the developer recommended a hybrid search combining dense embeddings for semantic similarity with BM-25 for keyword search.

read1 min publishedMay 28, 2026

In the previous post, we saw some basic methodologies under sparse embeddings. In that, term frequency(TF) had a fallback when same words are repeated too often. To overcome the shortcomings of TF, next method was introduced. We shall see them in detail:

Inverse document frequency(IDF) It determines how less frequent a word occurs in the input documents. It calculates how the rare the word is. Rare word is of high priority. i.e If the word occurs less frequent, then the value will be high and if if the word occurs more frequently, then the value will be low.

If i ask query about frequently occurring words (for which IDF score is low), results will not be that good. On the other hand, if i ask query about rarest word(IDF score is high), results will be comparatively good. Drawbacks of IDF

If i ask query about kubernetes and if the word is occurring only in one document , that particular document will be returned. There will be chances where the doc will have mention of kubernetes once but does not describe in detail about it. In such cases, returned doc is not that useful TF-IDF

This combines both TF and IDF. i.e For a word its TF score will be multiplied with IDF.

BM-25(Best match-25) Next improved version of TF-IDF is BM-25 algorithm. 25 refers to top 25 matching words. This may yield better result when compared to TF-IDF.

As sparse embedding(s.e) does keyword search, We cannot use s.e alone in a RAG pipeline. To make the best of both worlds, we need to combine dense embeddings (semantic similarity) and sparse embedding(keyword search). This is called hybrid search For dense embedding we can use sentence transformer and for sparse embedding we can use BM-25 algorithm.

── more in #natural-language-processing 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/day-9-sparse-embeddi…] indexed:0 read:1min 2026-05-28 ·