Vector Databases Are Not Magic, Here's What's Actually Happening Under the Hood A developer explains that vector databases are not magic but rely on approximate nearest neighbor (ANN) search algorithms like HNSW and IVF. The post details how these algorithms trade accuracy for speed and highlights common pitfalls such as parameter misconfiguration and data distribution shifts that can degrade performance in production. You've seen the tutorials. Spin up Pinecone, call .upsert , do a similarity search, ship it. Everyone claps. The demo works. Then you take it to production and it starts lying to you. Results that look semantically relevant but aren't. Queries that should match something and return nothing. Latency that makes your users think the app crashed. And the worst part - you don't know why, because the vector database feels like a black box with a fancy API. This article is about opening that box. Let's be honest about what "vector database" means, because the term is doing a lot of marketing work right now. At its core, a vector database is an index optimized for approximate nearest neighbor ANN search over high-dimensional float arrays. That's it. The "database" part - persistence, CRUD, filtering, transactions - is infrastructure wrapped around that core capability. When you store an embedding, you're storing a point in N-dimensional space typically 768, 1536, or 3072 dimensions depending on your model . When you query, you're asking: "which stored points are closest to this query point, by some distance metric?" The challenge? Doing exact nearest neighbor search at scale is O N D - linear in your corpus size times the dimensionality. For a million 1536-dim vectors, that's ~6 billion float comparisons per query. At millisecond latency requirements, that's a hard no. ANN algorithms trade a small amount of accuracy for massive speed gains. Understanding this trade-off is the first thing most tutorials skip - and it's where production bugs hide. The algorithm your vector DB uses to build its index determines everything: speed, recall, memory usage, and how it degrades under pressure. This is what most modern vector DBs use by default Qdrant, Weaviate, Milvus, pgvector with the right extension . HNSW builds a multi-layer graph where: Think of it like a highway system. You jump on the highway top layer , drive toward your destination, exit at the right interchange, and then use local streets bottom layer for precision. Key parameters you need to know: python Qdrant example from qdrant client.models import VectorParams, Distance client.create collection collection name="my docs", vectors config=VectorParams size=1536, distance=Distance.COSINE, hnsw config={ "m": 16, Number of edges per node. Higher = better recall, more memory "ef construct": 100, Construction-time beam width. Higher = better index quality, slower build } At query time results = client.search collection name="my docs", query vector=query embedding, limit=10, search params={"ef": 128} Runtime beam width. Higher = better recall, slower query m and ef construct are set at build time and can't change without rebuilding your index. If you're seeing poor recall in production and you set m=4 to save memory, that's your culprit. Used by FAISS and as an option in pgvector. Divides the vector space into Voronoi cells clusters , assigns vectors to their nearest centroid, then searches only a subset of cells at query time. python FAISS IVF example import faiss import numpy as np dimension = 1536 n clusters = 1024 Number of Voronoi cells quantizer = faiss.IndexFlatL2 dimension index = faiss.IndexIVFFlat quantizer, dimension, n clusters Must train before adding vectors index.train training vectors Needs representative data index.add corpus vectors nprobe = how many cells to search. More = better recall, slower index.nprobe = 32 distances, indices = index.search query vector, k=10 IVF gotcha: the cluster centroids are learned during training. If your data distribution shifts significantly new document types, different topics , your centroid structure becomes suboptimal and recall tanks. You don't get an error. You just quietly get worse results. Most people use cosine similarity because the tutorial said so. Here's when that's wrong. | Metric | Formula | Use When | |---|---|---| | Cosine | 1 - A·B / ‖A‖‖B‖ | Direction matters, magnitude doesn't. Good for normalized text embeddings | | Dot Product | - A·B | Embeddings are already normalized OpenAI's are . Faster than cosine | | Euclidean L2 | ‖A-B‖ | Magnitude carries meaning. Image embeddings, some multimodal models | OpenAI's text-embedding-3- embeddings are normalized to unit length. Cosine similarity on unit vectors is mathematically equivalent to dot product. Using cosine adds a normalization step that's pure overhead. If you're using OpenAI embeddings, use dot product In Qdrant: VectorParams size=1536, distance=Distance.DOT In pgvector: Use <= for cosine, < for negative inner product dot , <- for L2 SELECT content, embedding < query embedding AS score FROM documents ORDER BY score LIMIT 10; The difference in latency is small at low scale. At 10M+ vectors, it's measurable. Here's a thing that will haunt you: your ANN search does not always return the true nearest neighbors. It returns approximate nearest neighbors. That's the A in ANN. By definition, you may miss results that should have ranked in your top-K. How bad is it? It depends on your index config and your data. You can measure it: python import numpy as np from qdrant client import QdrantClient def measure recall client, collection name, test queries, ground truth ids, k=10 : """ Compare ANN results against brute-force exact search. ground truth ids: list of lists, true top-k ids per query """ hits = 0 total = len test queries k for query, true ids in zip test queries, ground truth ids : ann results = client.search collection name=collection name, query vector=query, limit=k ann ids = {r.id for r in ann results} hits += len ann ids & set true ids return hits / total recall@k A well-tuned index should hit 0.95+ recall@10 If you're at 0.85 or below, tune ef or m Production target: ≥ 0.95 recall@10 . Anything below that and your RAG pipeline is silently missing relevant context before GPT-4 ever sees it. Pure vector search has a well-known failure mode: it doesn't handle rare terms well. If your corpus contains "RFC 7807 Problem Details" or a specific error code like E INVALIDARG 0x80070057 , embedding similarity will dilute the match across semantically adjacent concepts. A user querying for the exact string gets mushy results. The solution is hybrid search : combine dense vector search with sparse BM25-style keyword search, then fuse the rankings. python from qdrant client import QdrantClient from qdrant client.models import SparseVectorParams, VectorParams, SparseIndexParams, Distance, NamedVector, NamedSparseVector Qdrant supports both dense and sparse vectors natively client.create collection collection name="hybrid docs", vectors config={ "dense": VectorParams size=1536, distance=Distance.COSINE , }, sparse vectors config={ "sparse": SparseVectorParams index=SparseIndexParams on disk=False } At insert time, generate both representations from fastembed import SparseTextEmbedding, TextEmbedding dense model = TextEmbedding "BAAI/bge-small-en-v1.5" sparse model = SparseTextEmbedding "prithivida/Splade PP en v1" text = "RFC 7807 Problem Details for HTTP APIs" dense vec = list dense model.embed text 0 sparse vec = list sparse model.embed text 0 At query time, use Reciprocal Rank Fusion RRF from qdrant client.models import Prefetch, FusionQuery, Fusion results = client.query points collection name="hybrid docs", prefetch= Prefetch query=dense vec.tolist , using="dense", limit=20 , Prefetch query=SparseVector indices=sparse vec.indices.tolist , values=sparse vec.values.tolist , using="sparse", limit=20 , , query=FusionQuery fusion=Fusion.RRF , limit=10 RRF Reciprocal Rank Fusion combines the rank lists without needing score normalization. The formula is simple: RRF score d = Σ 1 / k + rank i d Where k is a constant usually 60 and rank i d is the document's rank in each result list. Documents appearing in both lists get a significant boost. Hybrid search consistently outperforms pure dense search on real-world corpora by 5–15% on NDCG@10 - especially for domain-specific or technical content. Vector DBs let you pre-filter by metadata before or after the ANN search. This sounds simple. It's actually one of the most common performance footguns. Pre-filtering filter before ANN : Apply your metadata filter first, reduce the candidate set, then run ANN on the smaller set. Problem: if your filter is very selective e.g., user id = "abc123" in a multi-tenant system , the candidate set might be tiny. HNSW graph navigation assumes a large, connected graph. A sparse subgraph destroys recall. Post-filtering ANN then filter : Run ANN on the full corpus, retrieve top-N, then apply filter. You need to over-fetch significantly to compensate for filtered-out results. Qdrant handles this with "indexed" payload fields Always index fields you filter on client.create payload index collection name="my docs", field name="tenant id", field schema="keyword" or "integer", "float", "geo" Qdrant uses a smart filtering strategy: If filter is selective → brute force on filtered set If filter is broad → HNSW with post-filter It decides automatically based on cardinality estimates results = client.search collection name="my docs", query vector=query embedding, query filter=Filter must= FieldCondition key="tenant id", match=MatchValue value="abc123" , limit=10 Rule of thumb: if your filter reduces the corpus below ~1000 vectors, you're effectively doing brute-force search. That's fine - just know it and set expectations accordingly. This isn't vector DB internals, but it's so deeply related that skipping it would be malpractice. Your retrieval quality is bounded by your chunking quality. The vector DB can only return what you gave it. Most tutorials show: The naïve approach that everyone copies text splitter = RecursiveCharacterTextSplitter chunk size=500, chunk overlap=50 chunks = text splitter.split text document The problems: Better: semantic chunking python from langchain experimental.text splitter import SemanticChunker from langchain openai import OpenAIEmbeddings splitter = SemanticChunker OpenAIEmbeddings , breakpoint threshold type="percentile", breakpoint threshold amount=95 Split when semantic shift exceeds 95th percentile chunks = splitter.split text document This embeds sentences, calculates cosine distance between adjacent sentence pairs, and splits at significant semantic shifts. Even better: store both chunk and parent document "Small-to-big" or "Parent Document Retrieval" Store small chunks for precise matching But return the parent document or larger window as context from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore child splitter = RecursiveCharacterTextSplitter chunk size=200 parent splitter = RecursiveCharacterTextSplitter chunk size=2000 retriever = ParentDocumentRetriever vectorstore=vectorstore, docstore=InMemoryStore , child splitter=child splitter, parent splitter=parent splitter, Small chunks match with high precision. The returned context is the larger parent - so your LLM gets enough surrounding information to reason correctly. If you're not measuring this stuff, you're flying blind: python import time from dataclasses import dataclass from typing import Optional @dataclass class RetrievalTrace: query: str query embedding ms: float search ms: float num results: int top score: float bottom score: float score spread: float top - bottom; low spread = retrieval is uncertain filter applied: Optional dict collection name: str def traced search client, collection name, query text, embed fn, k=5, filter=None : t0 = time.perf counter embedding = embed fn query text embed ms = time.perf counter - t0 1000 t1 = time.perf counter results = client.search collection name=collection name, query vector=embedding, limit=k, query filter=filter search ms = time.perf counter - t1 1000 scores = r.score for r in results trace = RetrievalTrace query=query text, query embedding ms=embed ms, search ms=search ms, num results=len results , top score=scores 0 if scores else 0, bottom score=scores -1 if scores else 0, score spread= scores 0 - scores -1 if len scores 1 else 0, filter applied=filter, collection name=collection name Ship to your observability stack Datadog, Langfuse, custom log trace trace return results What to watch: score spread near 0 means all results look equally similar - the query probably didn't match anything well top score below your threshold tune per model, but ~0.75 for cosine is a reasonable starting floor means you're returning noiseQuick opinionated guide for 2026: | Scenario | Recommendation | |---|---| | Prototype / hobby | ChromaDB in-process, zero infra | | Production, self-hosted | Qdrant best performance, Rust core, Docker-native | | Already on Postgres | pgvector + pgvectorscale | | Enterprise, managed | Pinecone or Weaviate Cloud | | Need multimodal text + image | Weaviate or Milvus | | Massive scale 100M+ vectors | Milvus or Pinecone | Don't use a vector DB for everything. If your corpus is under ~10,000 documents, cosine search over an in-memory numpy array with np.dot is fast enough and eliminates an entire infrastructure dependency. python import numpy as np corpus embeddings = np.load "embeddings.npy" shape: N, 1536 query embedding = np.array embed query shape: 1536, Cosine similarity assuming normalized vectors scores = corpus embeddings @ query embedding top k indices = np.argsort scores ::-1 :10 No database. No network calls. No ops burden. Just math. Pull all of this together and you get a mental model for diagnosing RAG failures: ef / nprobe Vector databases are not magic retrieval oracles. They're approximate spatial indexes with a product wrapper. Once you understand the approximation, the trade-offs, and the failure modes - you can actually build reliable systems with them. If this was useful, I write about Python backend and AI engineering on dev.to. The good stuff is in the details.