In the previous post, we saw some basic methodologies under sparse embeddings. In that, term frequency(TF) had a fallback when same words are repeated too often. To overcome the shortcomings of TF, next method was introduced. We shall see them in detail:
Inverse document frequency(IDF) It determines how less frequent a word occurs in the input documents. It calculates how the rare the word is. Rare word is of high priority. i.e If the word occurs less frequent, then the value will be high and if if the word occurs more frequently, then the value will be low.
If i ask query about frequently occurring words (for which IDF score is low), results will not be that good. On the other hand, if i ask query about rarest word(IDF score is high), results will be comparatively good. Drawbacks of IDF
If i ask query about kubernetes and if the word is occurring only in one document , that particular document will be returned. There will be chances where the doc will have mention of kubernetes once but does not describe in detail about it. In such cases, returned doc is not that useful TF-IDF
This combines both TF and IDF. i.e For a word its TF score will be multiplied with IDF.
BM-25(Best match-25) Next improved version of TF-IDF is BM-25 algorithm. 25 refers to top 25 matching words. This may yield better result when compared to TF-IDF.
As sparse embedding(s.e) does keyword search, We cannot use s.e alone in a RAG pipeline. To make the best of both worlds, we need to combine dense embeddings (semantic similarity) and sparse embedding(keyword search). This is called hybrid search For dense embedding we can use sentence transformer and for sparse embedding we can use BM-25 algorithm.