Metadata Filtering Before Vector Search: The Recall Win Nobody Measures

Metadata filtering before vector search is a cheap but often overlooked recall win. By applying a hard predicate on metadata like customer_id, the search space shrinks from millions to hundreds of chunks, preventing irrelevant-but-similar boilerplate from crowding out relevant results. Pre-filtering, supported by Pinecone, Qdrant, Weaviate, and pgvector, ensures top-k results come from the correct document set, while post-filtering silently starves recall.

A support agent asks the bot a question about a customer's enterprise contract. The bot retrieves the top 10 chunks by cosine similarity across the whole corpus. Two of them are from a different customer's contract that happens to use near-identical boilerplate language. The model writes a confident answer citing a clause that does not exist in the customer's actual agreement. That is not a hallucination in the usual sense. The retrieval was working as designed. The vector index found the most semantically similar chunks, and boilerplate is semantically similar across customers by definition. The fix is not a better embedding model. The fix is telling the index to never look at the other customer's documents in the first place. That is metadata filtering. Most teams treat it as an access-control checkbox. It is also one of the cheapest recall wins in the pipeline, and almost nobody measures it. Your vector store holds chunks. Each chunk carries metadata: customer id , doc type , language , indexed at , team . Pre-filtering means you apply a hard predicate on that metadata before the vector search ranks anything. Without a filter, a query for "late payment penalty" searches all 2 million chunks and returns the 10 closest in embedding space. With a filter, you search only the chunks where customer id = 4417 , then rank those. The search space drops from 2 million to maybe 800 chunks. Every one of the top-10 results now comes from the right document set. The recall framing matters here. Recall is the fraction of relevant chunks you manage to retrieve. When the corpus is full of near-duplicate boilerplate, irrelevant-but-similar chunks crowd out the relevant ones in the top-k. Cut the search space to the documents that can possibly be relevant, and the relevant chunks stop competing with look-alikes. The top-k fills with chunks that have a real chance of answering. python Qdrant: filter then search, in one call. from qdrant client import QdrantClient from qdrant client.models import Filter, FieldCondition, MatchValue, client = QdrantClient url="http://localhost:6333" def search query vec, customer id, k=10 : flt = Filter must= FieldCondition key="customer id", match=MatchValue value=customer id , , return client.search collection name="contracts", query vector=query vec, query filter=flt, limit=k, The query vector is the same. The only change is the predicate. The result quality change is not subtle when your corpus has natural partitions. There are two places to apply the predicate, and they are not equivalent. Post-filter: run the vector search over the whole corpus, get the top-k, then drop the chunks that fail the predicate. The problem: if you ask for 10 and 9 of the top 10 belong to other customers, you keep one chunk. You wanted 10. You got 1. The relevant chunks were ranked at positions 40, 55, 71 and never made the candidate set. Pre-filter: restrict the candidate set to chunks that pass the predicate, then rank within it. You always get up to 10 results from the right partition. Post-filtering silently starves recall. It looks like it works in a demo where the test query happens to match the dominant partition. It collapses on the queries where the relevant partition is a small slice of the corpus, which is the case that matters for multi-tenant systems. Most managed vector stores do pre-filtering when you pass the predicate into the search call, and post-filtering when you filter the result list yourself in application code. Read your store's docs for which one the API gives you. Pinecone https://www.pinecone.io/ , Qdrant https://qdrant.tech/ , Weaviate https://weaviate.io/ , and pgvector https://github.com/pgvector/pgvector with the right index all support pre-filtering. The trap is filtering in Python after the fact and assuming it is the same thing. Pre-filtering is not free, and the cost depends on how selective your predicate is. Approximate nearest-neighbor indexes like HNSW are graphs. The search walks the graph from entry points toward the query vector. A filter prunes nodes during the walk. When the filter keeps most of the graph, the walk works normally. When the filter keeps a tiny fraction, the walk hits dead ends. The reachable nodes that pass the filter become sparse, the graph traversal stalls, and you either get degraded recall or the engine falls back to a slow brute-force scan over the matching set. This is the cardinality trap, and it cuts both ways. language = "en" when 95% of the corpus is English : the filter barely shrinks the search space. You pay the filter cost and gain almost nothing. customer id = 4417 when each customer is 0.04% of the corpus : the matching set is tiny and scattered through the HNSW graph. The graph walk degrades. Some engines auto-switch to brute force, which is fine at 800 chunks and a problem at 80,000.The selectivity sweet spot is a predicate that removes most of the corpus but leaves a matching set large enough for the index to traverse and small enough to be worth filtering. For very selective tenant filters, a payload index Qdrant , a partitioned collection, or a metadata-aware index config is what keeps the filtered search fast. Qdrant: index the field you filter on, or the filtered search degrades to a scan at scale. from qdrant client.models import PayloadSchemaType client.create payload index collection name="contracts", field name="customer id", field schema=PayloadSchemaType.KEYWORD, Without that index, a customer id filter on a large collection makes the engine scan every payload to find matches. With it, the engine knows the matching set up front and plans the search around it. The difference at 2 million chunks is tens of milliseconds versus seconds. The win is real, but you have to measure it on your corpus, because it depends entirely on how partitioned your data is. A flat single-tenant knowledge base sees almost no lift. A multi-tenant contract store sees a large one. Build a gold set: real queries paired with the chunk IDs that actually answer them, labeled by hand. Then run the same queries twice, with and without the filter, and compute recall@k on each. python def recall at k retrieved ids, gold ids, k=10 : top = set retrieved ids :k hits = top & set gold ids return len hits / len gold ids def compare gold set, customer lookup : embed, mean, search no filter come from your stack; ids pulls point IDs off the results. no filter, with filter = , for item in gold set: q = embed item "query" cid = customer lookup item "query id" unfiltered = search no filter q, k=10 filtered = search q, customer id=cid, k=10 no filter.append recall at k ids unfiltered , item "gold ids" with filter.append recall at k ids filtered , item "gold ids" return { "recall no filter": mean no filter , "recall with filter": mean with filter , } Run this before you ship the filter. If the lift is small, your corpus is not partitioned enough for filtering to matter and you should spend the effort on reranking instead. If the lift is large, you have found a recall win that costs one predicate and a payload index. Either way, you now have a number, which is more than most teams retrieving against a multi-tenant corpus can say. Real queries carry more than one constraint. The agent wants the current English version of a contract for one customer. That is three predicates: customer id , language , indexed at recent. python from qdrant client.models import Range def search scoped query vec, customer id, k=10 : flt = Filter must= FieldCondition key="customer id", match=MatchValue value=customer id , , FieldCondition key="language", match=MatchValue value="en" , , FieldCondition key="status", match=MatchValue value="active" , , return client.search collection name="contracts", query vector=query vec, query filter=flt, limit=k, Order the predicates by selectivity in your head, even if the engine reorders them internally. The most selective field customer id does the heavy lifting. The others trim the remainder. Adding a low-selectivity predicate like language on top of a high-selectivity one is cheap. Stacking five low-selectivity predicates and expecting them to substitute for one good high-selectivity predicate is how teams end up with a filter that costs latency and returns the same noisy candidate set. The metadata schema is the thing to get right early. You cannot filter on a field you did not store at ingestion. Carry customer id , doc type , language , status , and indexed at through the chunking layer as payload on every chunk. Retrofitting metadata onto an indexed corpus means a full re-index, which is the expensive thing this whole technique is supposed to help you avoid. Vector search gets the blog posts. The cosine similarity, the reranker, the query rewriter. Metadata filtering is plumbing, so it gets a checkbox in the access-control story and no place in the recall story. That is the gap. For any corpus with natural partitions, the predicate you apply before the index runs does more for top-k quality than the third reranker you were about to add. It is cheaper, it is deterministic, and it is auditable: you can prove the bot never saw the other customer's contract, which is a sentence your security review will want to hear. Store the metadata. Pre-filter, not post-filter. Index the high-selectivity field. Measure the lift on your own gold set. The recall win is sitting in a field you are probably already storing and not searching against. The RAG Pocket Guide https://www.amazon.com/dp/B0GX2YDC5Z covers the retrieval layer end to end — metadata schemas, hybrid filter-plus-vector search, the cardinality traps that degrade filtered ANN, and how to wire an eval that catches a recall regression before your users do. If your retrieval quality depends on a corpus with real partitions, the filtering chapter is where the cheap wins live.