RAG - Sparse Embedding

Sparse embeddings represent text by assigning binary values based on token presence in a vocabulary dictionary, using methods like TF-IDF and BM25 for keyword-based retrieval. While effective for exact text matching, sparse embeddings alone fail to capture semantic meaning, as seen when similar concepts like "car" and "automobile" are not matched. Modern RAG systems address this limitation by combining sparse embeddings with dense embeddings in a hybrid search approach to improve retrieval quality.

Sparse means thinly spread, scattered, or not dense. In sparse embeddings, chunks are converted into tokens, and each token is represented based on whether it exists in the vocabulary dictionary. If a token is present in the vocabulary, it is assigned 1; otherwise, it is assigned 0. Example 0,0,0,1,0,0,1,0,... If the vocabulary dictionary contains 10,000 words, the vector representation will also contain 10,000 dimensions. For a particular chunk: Only a few positions may contain values like 1 Most other positions will contain 0 Unlike dense embeddings, sparse embeddings do not contain continuous values. They mainly depend on token occurrence and frequency. Sparse embeddings are mainly used for direct text matching and keyword-based retrieval. They are useful when: In the basic sparse approach: This is similar to one-hot encoding. The main drawback is that it does not consider how many times a word appears in the document. For example: If the word “database” appears 20 times and another word appears only once, both may still receive the same representation. To solve this problem, the concept of token weighting was introduced. TF stands for Term Frequency. It measures how frequently a term appears in a document. The formula is: TF gives higher importance to terms that appear more frequently in a document. The problem with TF is that commonly occurring words may receive very high importance even if they are not meaningful. For example: These words appear frequently in most documents but do not provide strong contextual meaning. To solve this issue, IDF was introduced. IDF stands for Inverse Document Frequency. It measures how rare or important a word is across the entire document collection. The formula is: IDF alone does not determine how relevant a document is to the user query. It only measures the rarity of terms across documents. To improve retrieval quality, TF and IDF are combined together. TF-IDF combines: Term Frequency TF Inverse Document Frequency IDF The formula is: TF-IDF works well for many traditional search systems because it balances: However, TF-IDF still does not fully capture semantic meaning. BM25 is an advanced ranking algorithm used in sparse retrieval systems. It improves upon TF-IDF by considering: BM25 is one of the most commonly used algorithms in traditional search engines and sparse retrieval systems. Sparse embeddings alone are usually not enough to retrieve highly relevant documents in modern RAG systems because they mainly focus on exact keyword matching rather than semantic meaning. For example: Even though the meanings are similar. To improve retrieval quality, modern systems combine: This approach is called hybrid search. Dense embeddings help with semantic understanding, while sparse embeddings help with exact keyword matching. Together, they provide better retrieval performance in RAG applications.