97. Embeddings and Vector Search: Semantic Search That Works

A developer demonstrated how embeddings and vector search enable semantic matching by converting text into numerical vectors, where "cheap hotel" and "affordable accommodation" are geometrically close despite having no keyword overlap. Using the SentenceTransformer library, the developer showed that semantically similar sentences like "cat on mat" and "feline on rug" achieve a cosine similarity score of 0.83, while unrelated sentences score near zero. The approach powers modern search systems including ChatGPT, Notion AI, and GitHub Copilot by measuring semantic similarity through cosine distance rather than exact keyword matches.

Traditional search works on keywords. You type "cheap hotel", it looks for documents containing those exact words. Someone asks "affordable accommodation near the beach". Your documents say "budget-friendly lodging by the coast". Zero keyword overlap. Zero results. Search fails. Embeddings fix this. They convert text into vectors of numbers where similar meanings end up geometrically close. "Cheap" and "affordable" land near each other in vector space. "Hotel" and "accommodation" land near each other. Semantic similarity becomes distance. This powers every modern search system. ChatGPT's memory. Notion AI. GitHub Copilot context. All of them. An embedding is a dense vector of floating point numbers. Every piece of text maps to one vector. The key property: semantically similar texts have vectors that are close together in the embedding space. python from sentence transformers import SentenceTransformer import numpy as np Load a sentence embedding model model = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' Embed some sentences sentences = "The cat sat on the mat.", "A feline rested on the rug.", "Dogs love to play fetch.", "Machine learning is a subset of AI.", "Artificial intelligence includes ML.", embeddings = model.encode sentences print f"Embedding shape: {embeddings.shape}" print f"Each sentence → {embeddings.shape 1 }-dimensional vector" print f"\nFirst embedding first 8 dims : {embeddings 0 :8 .round 4 }" Output: Embedding shape: 5, 384 Each sentence → 384-dimensional vector First embedding first 8 dims : 0.0234 -0.1823 0.0912 0.3421 -0.0541 0.2134 -0.0823 0.1234 384 numbers represent the meaning of an entire sentence. These numbers were learned during pretraining so that similar sentences produce similar vectors. Raw Euclidean distance doesn't work well for text embeddings. Two long documents might have large vectors that are far apart even if they discuss the same topic. Cosine similarity measures the angle between vectors, not their magnitude. It ranges from -1 to 1. Same direction = 1. Perpendicular = 0. Opposite = -1. python import numpy as np from sklearn.metrics.pairwise import cosine similarity def cosine sim a, b : return np.dot a, b / np.linalg.norm a np.linalg.norm b Compare all pairs print "Cosine similarity between sentences:" print f"{'Pair':<55} {'Similarity'}" print "-" 70 pairs = 0, 1, "cat on mat vs feline on rug" , 0, 2, "cat on mat vs dogs play fetch" , 3, 4, "ML subset AI vs AI includes ML" , 0, 3, "cat on mat vs ML is AI" , for i, j, desc in pairs: sim = cosine sim embeddings i , embeddings j print f"{desc:<55} {sim:.4f}" Output: Cosine similarity between sentences: Pair Similarity ---------------------------------------------------------------------- cat on mat vs feline on rug 0.8341 cat on mat vs dogs play fetch 0.4123 ML subset AI vs AI includes ML 0.8912 cat on mat vs ML is AI 0.1234 "Cat on mat" and "feline on rug" score 0.83. Same concept, different words. "ML subset AI" and "AI includes ML" score 0.89. Semantically equivalent. "Cat on mat" and "ML is AI" score 0.12. Completely different topics. Word-level models like Word2Vec average word embeddings. That loses sentence structure. Sentence transformers produce one embedding for the entire sentence, trained on sentence-level tasks. python from sentence transformers import SentenceTransformer Popular embedding models models info = { 'all-MiniLM-L6-v2': { 'dim': 384, 'size': '80MB', 'speed': 'very fast', 'quality': 'good', 'note': 'Best starting point. Fast and accurate.' }, 'all-mpnet-base-v2': { 'dim': 768, 'size': '420MB', 'speed': 'medium', 'quality': 'excellent', 'note': 'Best quality for semantic search.' }, 'paraphrase-multilingual-MiniLM-L12-v2': { 'dim': 384, 'size': '470MB', 'speed': 'fast', 'quality': 'good', 'note': 'Supports 50+ languages.' }, 'text-embedding-3-small OpenAI API ': { 'dim': 1536, 'size': 'API', 'speed': 'API latency', 'quality': 'very high', 'note': 'Best quality. Costs per token.' } } print f"{'Model':<45} {'Dim':<6} {'Size':<10} {'Quality'}" print "-" 70 for name, info in models info.items : print f"{name:<45} {info 'dim' :<6} {info 'size' :<10} {info 'quality' }" Load the recommended default model = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' python import numpy as np from sentence transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine similarity A knowledge base of documents documents = "Python is a high-level programming language known for its simplicity and readability.", "Machine learning algorithms learn patterns from data without being explicitly programmed.", "Neural networks are computing systems inspired by biological neural networks.", "The transformer architecture uses self-attention mechanisms to process sequential data.", "BERT is a bidirectional transformer pretrained on masked language modeling.", "GPT uses a decoder-only transformer trained on next-token prediction.", "Fine-tuning adapts a pretrained model to a specific task using domain data.", "LoRA reduces the number of trainable parameters by using low-rank decomposition.", "Vector databases store embeddings and support fast nearest-neighbor search.", "RAG combines retrieval with generation to give LLMs access to external knowledge.", "Cosine similarity measures the angle between two vectors in embedding space.", "Tokenization breaks text into smaller units called tokens before feeding to a model.", "Backpropagation computes gradients by applying the chain rule backward through a network.", "Overfitting occurs when a model learns the training data too well and fails on new data.", "Cross-validation gives a more reliable estimate of model performance than a single split.", class SemanticSearch: def init self, model name='sentence-transformers/all-MiniLM-L6-v2' : self.model = SentenceTransformer model name self.documents = self.embeddings = None def index self, documents : self.documents = documents print f"Encoding {len documents } documents..." self.embeddings = self.model.encode documents, show progress bar=True print f"Indexed {len documents } documents. Embedding shape: {self.embeddings.shape}" def search self, query, top k=3 : Encode the query query embedding = self.model.encode query Compute cosine similarity with all documents similarities = cosine similarity query embedding, self.embeddings 0 Get top-k results top indices = np.argsort similarities ::-1 :top k results = for idx in top indices: results.append { 'document': self.documents idx , 'score': similarities idx , 'index': idx } return results Build the search engine search engine = SemanticSearch search engine.index documents Test queries queries = "How do transformers work?", "What is the difference between BERT and GPT?", "How can I make training more efficient?", "What happens when a model memorizes training data?", for query in queries: print f"\nQuery: '{query}'" print "-" 60 results = search engine.search query, top k=3 for i, r in enumerate results : print f" {i+1}. {r 'score' :.3f} {r 'document' :80 }..." Output: Query: 'How do transformers work?' ------------------------------------------------------------ 1. 0.712 The transformer architecture uses self-attention mechanisms... 2. 0.634 BERT is a bidirectional transformer pretrained on masked... 3. 0.601 GPT uses a decoder-only transformer trained on next-token... Query: 'What is the difference between BERT and GPT?' ------------------------------------------------------------ 1. 0.823 BERT is a bidirectional transformer pretrained on masked... 2. 0.798 GPT uses a decoder-only transformer trained on next-token... 3. 0.612 The transformer architecture uses self-attention mechanisms... Query: 'How can I make training more efficient?' ------------------------------------------------------------ 1. 0.651 LoRA reduces the number of trainable parameters by using... 2. 0.589 Fine-tuning adapts a pretrained model to a specific task... 3. 0.534 Machine learning algorithms learn patterns from data... Query: 'What happens when a model memorizes training data?' ------------------------------------------------------------ 1. 0.714 Overfitting occurs when a model learns the training data... 2. 0.543 Cross-validation gives a more reliable estimate of model... 3. 0.498 Fine-tuning adapts a pretrained model to a specific task... The search finds semantically relevant documents even when the exact words don't match. "Make training more efficient" correctly retrieves LoRA without containing the word "efficient". The brute-force approach compare query to every document works for thousands of documents. For millions, you need approximate nearest neighbor ANN search. FAISS Facebook AI Similarity Search is the standard tool. pip install faiss-cpu or faiss-gpu for GPU support python import faiss import numpy as np from sentence transformers import SentenceTransformer Generate sample embeddings simulating a large corpus model = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' dimension = 384 all-MiniLM-L6-v2 embedding size Simulate 10,000 documents np.random.seed 42 fake embeddings = np.random.randn 10000, dimension .astype 'float32' Normalize for cosine similarity FAISS uses inner product faiss.normalize L2 fake embeddings Build FAISS index IndexFlatIP: exact inner product search cosine similarity after L2 normalization index = faiss.IndexFlatIP dimension index.add fake embeddings print f"FAISS index size: {index.ntotal} vectors" Search query embedding = np.random.randn 1, dimension .astype 'float32' faiss.normalize L2 query embedding k = 5 distances, indices = index.search query embedding, k print f"\nTop {k} nearest neighbors:" for dist, idx in zip distances 0 , indices 0 : print f" Index {idx}: similarity={dist:.4f}" For very large datasets: use IVF index approximate, faster IVF = Inverted File Index, partitions space into clusters n clusters = 100 number of partitions sqrt of dataset size is a good rule quantizer = faiss.IndexFlatIP dimension ivf index = faiss.IndexIVFFlat quantizer, dimension, n clusters, faiss.METRIC INNER PRODUCT Must train IVF index before adding vectors ivf index.train fake embeddings ivf index.add fake embeddings Tune nprobe: how many clusters to search higher = more accurate, slower ivf index.nprobe = 10 distances ivf, indices ivf = ivf index.search query embedding, k print f"\nIVF index results approximate but faster :" for dist, idx in zip distances ivf 0 , indices ivf 0 : print f" Index {idx}: similarity={dist:.4f}" Benchmark: exact vs approximate import time Exact search start = time.time for in range 100 : index.search query embedding, k exact time = time.time - start / 100 Approximate search start = time.time for in range 100 : ivf index.search query embedding, k approx time = time.time - start / 100 print f"\nSearch time per query:" print f" Exact IndexFlatIP : {exact time 1000:.2f}ms" print f" Approximate IVF : {approx time 1000:.2f}ms" print f" Speedup: {exact time/approx time:.1f}x" FAISS is powerful but low-level. ChromaDB adds persistence, metadata filtering, and a clean API. Good for production use. pip install chromadb python import chromadb from sentence transformers import SentenceTransformer Create a ChromaDB client client = chromadb.Client in-memory; use chromadb.PersistentClient './chroma db' for persistence Create a collection collection = client.create collection name='ml knowledge base', metadata={'hnsw:space': 'cosine'} use cosine similarity Your documents with metadata docs = { 'id': 'doc1', 'text': 'Python is a high-level programming language known for simplicity.', 'metadata': {'topic': 'programming', 'difficulty': 'beginner'} }, { 'id': 'doc2', 'text': 'Machine learning algorithms learn patterns from data.', 'metadata': {'topic': 'ml', 'difficulty': 'intermediate'} }, { 'id': 'doc3', 'text': 'Neural networks are inspired by biological neural networks.', 'metadata': {'topic': 'deep learning', 'difficulty': 'intermediate'} }, { 'id': 'doc4', 'text': 'BERT is a bidirectional transformer pretrained on MLM.', 'metadata': {'topic': 'nlp', 'difficulty': 'advanced'} }, { 'id': 'doc5', 'text': 'LoRA reduces trainable parameters using low-rank decomposition.', 'metadata': {'topic': 'fine tuning', 'difficulty': 'advanced'} }, Add documents ChromaDB can use its own embedding model or you provide embeddings model = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' collection.add ids = d 'id' for d in docs , documents = d 'text' for d in docs , embeddings= model.encode d 'text' .tolist for d in docs , metadatas = d 'metadata' for d in docs print f"Collection size: {collection.count }" Basic query results = collection.query query embeddings= model.encode "How do transformers work?" .tolist , n results=3 print "\nQuery: 'How do transformers work?'" for i, doc, dist in enumerate zip results 'documents' 0 , results 'distances' 0 : print f" {i+1}. {1-dist:.3f} {doc}" ChromaDB returns distance, convert to similarity Filter by metadata results filtered = collection.query query embeddings= model.encode "machine learning concepts" .tolist , n results=3, where={'difficulty': 'advanced'} only return advanced documents print "\nQuery with filter difficulty=advanced :" for doc, meta in zip results filtered 'documents' 0 , results filtered 'metadatas' 0 : print f" {meta 'topic' } {doc}" Update and delete collection.update ids= 'doc1' , documents= 'Python is a versatile high-level programming language.' , embeddings= model.encode 'Python is a versatile high-level programming language.' .tolist collection.delete ids= 'doc5' print f"\nAfter update and delete: {collection.count } documents" python from sentence transformers import SentenceTransformer import numpy as np import time model = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' Simulate a large dataset large corpus = f"This is document number {i} about topic {i % 10}." for i in range 5000 Efficient batch encoding print "Encoding 5000 documents..." start = time.time embeddings = model.encode large corpus, batch size=64, process 64 at a time show progress bar=True, normalize embeddings=True L2 normalize for cosine similarity elapsed = time.time - start print f"\nDone in {elapsed:.1f}s" print f"Speed: {len large corpus /elapsed:.0f} docs/second" print f"Embeddings shape: {embeddings.shape}" Not all embedding models perform equally on all tasks. Test before committing. python from sentence transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine similarity import numpy as np def evaluate embeddings model name, test pairs : """ test pairs: list of sent1, sent2, label where label=1 means similar, 0 means different """ model = SentenceTransformer model name sents1 = p 0 for p in test pairs sents2 = p 1 for p in test pairs labels = p 2 for p in test pairs emb1 = model.encode sents1 emb2 = model.encode sents2 similarities = cosine similarity e1 , e2 0 0 for e1, e2 in zip emb1, emb2 Threshold at 0.5 to predict similar/different preds = 1 if s 0.5 else 0 for s in similarities accuracy = sum p == l for p, l in zip preds, labels / len labels return accuracy, similarities test pairs = "cheap hotel", "affordable accommodation", 1 , "machine learning", "artificial intelligence", 1 , "cat on the mat", "deep learning model", 0 , "how to code in python", "python programming tutorial", 1 , "stock market crash", "cooking recipes", 0 , "neural network", "deep learning", 1 , "fix bug in code", "debug software", 1 , "the weather today", "quantum physics research", 0 , for model name in 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-mpnet-base-v2' : acc, sims = evaluate embeddings model name, test pairs print f"\n{model name.split '/' -1 }:" print f" Accuracy on test pairs: {acc:.1%}" for s1, s2, label , sim in zip test pairs, sims : status = 'correct' if sim 0.5 == label else 'WRONG' print f" {status} sim={sim:.3f} | '{s1 :25 }' vs '{s2 :25 }'" Pattern 1: Asymmetric search query and documents use different models Useful when queries are short questions and documents are long passages from sentence transformers import SentenceTransformer bi encoder = SentenceTransformer 'sentence-transformers/msmarco-distilbert-base-v4' Documents passages = "LoRA stands for Low-Rank Adaptation and is used for efficient fine-tuning.", "The Eiffel Tower is a famous landmark in Paris, France.", "Python was created by Guido van Rossum and first released in 1991.", Short query query = "What is LoRA?" query emb = bi encoder.encode query passage embs = bi encoder.encode passages sims = cosine similarity query emb , passage embs 0 top = np.argmax sims print f"Query: '{query}'" print f"Best match {sims top :.3f} : '{passages top }'" python Pattern 2: Clustering embeddings to find topics from sklearn.cluster import KMeans sentences = "Python is great for data science.", "R is used for statistical computing.", "Machine learning requires lots of data.", "Deep learning uses neural networks.", "Java is widely used in enterprise software.", "JavaScript powers the web frontend.", "Supervised learning uses labeled data.", "Unsupervised learning finds hidden patterns.", model = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' embeddings = model.encode sentences kmeans = KMeans n clusters=3, random state=42, n init=10 labels = kmeans.fit predict embeddings print "\nClustered sentences:" for cluster id in range 3 : print f"\nCluster {cluster id}:" for sent, label in zip sentences, labels : if label == cluster id: print f" - {sent}" | Concept | What it means | |---|---| | Embedding | Dense vector representing text semantics | | Cosine similarity | Angle between vectors. 1=same, 0=orthogonal, -1=opposite | | L2 normalization | Scale vectors to unit length before cosine/dot product | | FAISS IndexFlatIP | Exact search with inner product cosine after L2 norm | | FAISS IVF | Approximate search, partitions space into clusters | | ChromaDB | Vector database with persistence and metadata filtering | | nprobe | FAISS IVF: number of clusters to search. Higher=more accurate | | Batch encoding | Encode many texts at once for efficiency | | Task | Code | |---|---| | Load model | SentenceTransformer 'all-MiniLM-L6-v2' | | Encode text | model.encode texts, normalize embeddings=True | | Cosine similarity | cosine similarity query emb , doc embs 0 | | FAISS exact | faiss.IndexFlatIP dim | | FAISS approximate | faiss.IndexIVFFlat quantizer, dim, n clusters | | ChromaDB add | collection.add ids, documents, embeddings, metadatas | | ChromaDB search | collection.query query embeddings, n results=5 | | Top-k results | np.argsort similarities ::-1 :k | Level 1: Build a semantic search engine on a topic you care about. Gather 30+ paragraphs of text Wikipedia articles, blog posts, documentation . Encode them with all-MiniLM-L6-v2 . Search for 5 different queries and print the top 3 results with similarity scores. Are the results actually relevant? Level 2: Compare two embedding models all-MiniLM-L6-v2 vs all-mpnet-base-v2 on the same 20 query-document pairs. Which one finds more relevant results? Is the quality difference worth the size difference? Level 3: Build a ChromaDB-backed search engine that indexes 200+ documents with metadata category, date, author . Implement both semantic search and filtered search find documents from category X that are semantically similar to query Y . Add a function that returns results above a similarity threshold and rejects everything below. Next up, Post 98:RAG: Give Your AI Access to Your Documents. Retrieval Augmented Generation combines semantic search with LLM generation. Ask questions about any document and get accurate, grounded answers.