{"slug": "day-9-sparse-embedding-continued-rag", "title": "Day 9 - Sparse embedding continued - RAG", "summary": "A developer exploring sparse embeddings for retrieval-augmented generation (RAG) detailed the progression from term frequency (TF) to inverse document frequency (IDF) and TF-IDF, culminating in the BM-25 algorithm. The engineer noted that IDF prioritizes rare words but can return irrelevant documents, while BM-25 improves upon TF-IDF for keyword matching. To overcome the limitations of sparse embeddings alone, the developer recommended a hybrid search combining dense embeddings for semantic similarity with BM-25 for keyword search.", "body_md": "In the previous post, we saw some basic methodologies under sparse embeddings. In that, term frequency(TF) had a fallback when same words are repeated too often. To overcome the shortcomings of TF, next method was introduced. We shall see them in detail:\n\n**Inverse document frequency(IDF)**\n\nIt determines how less frequent a word occurs in the input documents. It calculates how the rare the word is. Rare word is of high priority. i.e If the word occurs less frequent, then the value will be high and if if the word occurs more frequently, then the value will be low.\n\nIf i ask query about frequently occurring words (for which IDF score is low), results will not be that good. On the other hand, if i ask query about rarest word(IDF score is high), results will be comparatively good.\n\n**Drawbacks of IDF**\n\nIf i ask query about kubernetes and if the word is occurring only in one document , that particular document will be returned. There will be chances where the doc will have mention of kubernetes once but does not describe in detail about it. In such cases, returned doc is not that useful\n\n**TF-IDF**\n\nThis combines both TF and IDF. i.e For a word its TF score will be multiplied with IDF.\n\n**BM-25(Best match-25)**\n\nNext improved version of TF-IDF is BM-25 algorithm. 25 refers to top 25 matching words. This may yield better result when compared to TF-IDF.\n\nAs sparse embedding(s.e) does keyword search, We cannot use s.e alone in a RAG pipeline. To make the best of both worlds, we need to combine dense embeddings (semantic similarity) and sparse embedding(keyword search). This is called **hybrid search** For dense embedding we can use sentence transformer and for sparse embedding we can use BM-25 algorithm.", "url": "https://wpnews.pro/news/day-9-sparse-embedding-continued-rag", "canonical_source": "https://dev.to/indumathi__r/day-9-sparse-embedding-continued-rag-3nj5", "published_at": "2026-05-28 02:52:45+00:00", "updated_at": "2026-05-28 03:23:38.590221+00:00", "lang": "en", "topics": ["natural-language-processing", "machine-learning", "artificial-intelligence", "large-language-models", "ai-research"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/day-9-sparse-embedding-continued-rag", "markdown": "https://wpnews.pro/news/day-9-sparse-embedding-continued-rag.md", "text": "https://wpnews.pro/news/day-9-sparse-embedding-continued-rag.txt", "jsonld": "https://wpnews.pro/news/day-9-sparse-embedding-continued-rag.jsonld"}}