{"slug": "metadata-filtering-before-vector-search-the-recall-win-nobody-measures", "title": "Metadata Filtering Before Vector Search: The Recall Win Nobody Measures", "summary": "Metadata filtering before vector search is a cheap but often overlooked recall win. By applying a hard predicate on metadata like customer_id, the search space shrinks from millions to hundreds of chunks, preventing irrelevant-but-similar boilerplate from crowding out relevant results. Pre-filtering, supported by Pinecone, Qdrant, Weaviate, and pgvector, ensures top-k results come from the correct document set, while post-filtering silently starves recall.", "body_md": "A support agent asks the bot a question about a customer's enterprise contract. The bot retrieves the top 10 chunks by cosine similarity across the whole corpus. Two of them are from a different customer's contract that happens to use near-identical boilerplate language. The model writes a confident answer citing a clause that does not exist in the customer's actual agreement.\n\nThat is not a hallucination in the usual sense. The retrieval was working as designed. The vector index found the most semantically similar chunks, and boilerplate is semantically similar across customers by definition. The fix is not a better embedding model. The fix is telling the index to never look at the other customer's documents in the first place.\n\nThat is metadata filtering. Most teams treat it as an access-control checkbox. It is also one of the cheapest recall wins in the pipeline, and almost nobody measures it.\n\nYour vector store holds chunks. Each chunk carries metadata: `customer_id`\n\n, `doc_type`\n\n, `language`\n\n, `indexed_at`\n\n, `team`\n\n. Pre-filtering means you apply a hard predicate on that metadata before the vector search ranks anything.\n\nWithout a filter, a query for \"late payment penalty\" searches all 2 million chunks and returns the 10 closest in embedding space. With a filter, you search only the chunks where `customer_id = 4417`\n\n, then rank those. The search space drops from 2 million to maybe 800 chunks. Every one of the top-10 results now comes from the right document set.\n\nThe recall framing matters here. Recall is the fraction of relevant chunks you manage to retrieve. When the corpus is full of near-duplicate boilerplate, irrelevant-but-similar chunks crowd out the relevant ones in the top-k. Cut the search space to the documents that can possibly be relevant, and the relevant chunks stop competing with look-alikes. The top-k fills with chunks that have a real chance of answering.\n\n``` python\n# Qdrant: filter then search, in one call.\nfrom qdrant_client import QdrantClient\nfrom qdrant_client.models import (\n    Filter, FieldCondition, MatchValue,\n)\n\nclient = QdrantClient(url=\"http://localhost:6333\")\n\ndef search(query_vec, customer_id, k=10):\n    flt = Filter(\n        must=[\n            FieldCondition(\n                key=\"customer_id\",\n                match=MatchValue(value=customer_id),\n            ),\n        ]\n    )\n    return client.search(\n        collection_name=\"contracts\",\n        query_vector=query_vec,\n        query_filter=flt,\n        limit=k,\n    )\n```\n\nThe query vector is the same. The only change is the predicate. The result quality change is not subtle when your corpus has natural partitions.\n\nThere are two places to apply the predicate, and they are not equivalent.\n\n**Post-filter:** run the vector search over the whole corpus, get the top-k, then drop the chunks that fail the predicate. The problem: if you ask for 10 and 9 of the top 10 belong to other customers, you keep one chunk. You wanted 10. You got 1. The relevant chunks were ranked at positions 40, 55, 71 and never made the candidate set.\n\n**Pre-filter:** restrict the candidate set to chunks that pass the predicate, then rank within it. You always get up to 10 results from the right partition.\n\nPost-filtering silently starves recall. It looks like it works in a demo where the test query happens to match the dominant partition. It collapses on the queries where the relevant partition is a small slice of the corpus, which is the case that matters for multi-tenant systems.\n\nMost managed vector stores do pre-filtering when you pass the predicate into the search call, and post-filtering when you filter the result list yourself in application code. Read your store's docs for which one the API gives you. [Pinecone](https://www.pinecone.io/), [Qdrant](https://qdrant.tech/), [Weaviate](https://weaviate.io/), and [pgvector](https://github.com/pgvector/pgvector) with the right index all support pre-filtering. The trap is filtering in Python after the fact and assuming it is the same thing.\n\nPre-filtering is not free, and the cost depends on how selective your predicate is.\n\nApproximate nearest-neighbor indexes like HNSW are graphs. The search walks the graph from entry points toward the query vector. A filter prunes nodes during the walk. When the filter keeps most of the graph, the walk works normally. When the filter keeps a tiny fraction, the walk hits dead ends. The reachable nodes that pass the filter become sparse, the graph traversal stalls, and you either get degraded recall or the engine falls back to a slow brute-force scan over the matching set.\n\nThis is the cardinality trap, and it cuts both ways.\n\n`language = \"en\"`\n\nwhen 95% of the corpus is English): the filter barely shrinks the search space. You pay the filter cost and gain almost nothing.`customer_id = 4417`\n\nwhen each customer is 0.04% of the corpus): the matching set is tiny and scattered through the HNSW graph. The graph walk degrades. Some engines auto-switch to brute force, which is fine at 800 chunks and a problem at 80,000.The selectivity sweet spot is a predicate that removes most of the corpus but leaves a matching set large enough for the index to traverse and small enough to be worth filtering. For very selective tenant filters, a payload index (Qdrant), a partitioned collection, or a metadata-aware index config is what keeps the filtered search fast.\n\n```\n# Qdrant: index the field you filter on, or the\n# filtered search degrades to a scan at scale.\nfrom qdrant_client.models import PayloadSchemaType\n\nclient.create_payload_index(\n    collection_name=\"contracts\",\n    field_name=\"customer_id\",\n    field_schema=PayloadSchemaType.KEYWORD,\n)\n```\n\nWithout that index, a `customer_id`\n\nfilter on a large collection makes the engine scan every payload to find matches. With it, the engine knows the matching set up front and plans the search around it. The difference at 2 million chunks is tens of milliseconds versus seconds.\n\nThe win is real, but you have to measure it on your corpus, because it depends entirely on how partitioned your data is. A flat single-tenant knowledge base sees almost no lift. A multi-tenant contract store sees a large one.\n\nBuild a gold set: real queries paired with the chunk IDs that actually answer them, labeled by hand. Then run the same queries twice, with and without the filter, and compute recall@k on each.\n\n``` python\ndef recall_at_k(retrieved_ids, gold_ids, k=10):\n    top = set(retrieved_ids[:k])\n    hits = top & set(gold_ids)\n    return len(hits) / len(gold_ids)\n\ndef compare(gold_set, customer_lookup):\n    # embed, mean, search_no_filter come from your\n    # stack; ids() pulls point IDs off the results.\n    no_filter, with_filter = [], []\n    for item in gold_set:\n        q = embed(item[\"query\"])\n        cid = customer_lookup[item[\"query_id\"]]\n\n        unfiltered = search_no_filter(q, k=10)\n        filtered = search(q, customer_id=cid, k=10)\n\n        no_filter.append(\n            recall_at_k(ids(unfiltered), item[\"gold_ids\"])\n        )\n        with_filter.append(\n            recall_at_k(ids(filtered), item[\"gold_ids\"])\n        )\n\n    return {\n        \"recall_no_filter\": mean(no_filter),\n        \"recall_with_filter\": mean(with_filter),\n    }\n```\n\nRun this before you ship the filter. If the lift is small, your corpus is not partitioned enough for filtering to matter and you should spend the effort on reranking instead. If the lift is large, you have found a recall win that costs one predicate and a payload index. Either way, you now have a number, which is more than most teams retrieving against a multi-tenant corpus can say.\n\nReal queries carry more than one constraint. The agent wants the current English version of a contract for one customer. That is three predicates: `customer_id`\n\n, `language`\n\n, `indexed_at`\n\nrecent.\n\n``` python\nfrom qdrant_client.models import Range\n\ndef search_scoped(query_vec, customer_id, k=10):\n    flt = Filter(\n        must=[\n            FieldCondition(\n                key=\"customer_id\",\n                match=MatchValue(value=customer_id),\n            ),\n            FieldCondition(\n                key=\"language\",\n                match=MatchValue(value=\"en\"),\n            ),\n            FieldCondition(\n                key=\"status\",\n                match=MatchValue(value=\"active\"),\n            ),\n        ]\n    )\n    return client.search(\n        collection_name=\"contracts\",\n        query_vector=query_vec,\n        query_filter=flt,\n        limit=k,\n    )\n```\n\nOrder the predicates by selectivity in your head, even if the engine reorders them internally. The most selective field (`customer_id`\n\n) does the heavy lifting. The others trim the remainder. Adding a low-selectivity predicate like `language`\n\non top of a high-selectivity one is cheap. Stacking five low-selectivity predicates and expecting them to substitute for one good high-selectivity predicate is how teams end up with a filter that costs latency and returns the same noisy candidate set.\n\nThe metadata schema is the thing to get right early. You cannot filter on a field you did not store at ingestion. Carry `customer_id`\n\n, `doc_type`\n\n, `language`\n\n, `status`\n\n, and `indexed_at`\n\nthrough the chunking layer as payload on every chunk. Retrofitting metadata onto an indexed corpus means a full re-index, which is the expensive thing this whole technique is supposed to help you avoid.\n\nVector search gets the blog posts. The cosine similarity, the reranker, the query rewriter. Metadata filtering is plumbing, so it gets a checkbox in the access-control story and no place in the recall story.\n\nThat is the gap. For any corpus with natural partitions, the predicate you apply before the index runs does more for top-k quality than the third reranker you were about to add. It is cheaper, it is deterministic, and it is auditable: you can prove the bot never saw the other customer's contract, which is a sentence your security review will want to hear.\n\nStore the metadata. Pre-filter, not post-filter. Index the high-selectivity field. Measure the lift on your own gold set. The recall win is sitting in a field you are probably already storing and not searching against.\n\nThe [RAG Pocket Guide](https://www.amazon.com/dp/B0GX2YDC5Z) covers the retrieval layer end to end — metadata schemas, hybrid filter-plus-vector search, the cardinality traps that degrade filtered ANN, and how to wire an eval that catches a recall regression before your users do. If your retrieval quality depends on a corpus with real partitions, the filtering chapter is where the cheap wins live.", "url": "https://wpnews.pro/news/metadata-filtering-before-vector-search-the-recall-win-nobody-measures", "canonical_source": "https://dev.to/gabrielanhaia/metadata-filtering-before-vector-search-the-recall-win-nobody-measures-3c2f", "published_at": "2026-06-13 10:36:06+00:00", "updated_at": "2026-06-13 10:47:43.024647+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "natural-language-processing", "ai-infrastructure", "developer-tools"], "entities": ["Pinecone", "Qdrant", "Weaviate", "pgvector"], "alternates": {"html": "https://wpnews.pro/news/metadata-filtering-before-vector-search-the-recall-win-nobody-measures", "markdown": "https://wpnews.pro/news/metadata-filtering-before-vector-search-the-recall-win-nobody-measures.md", "text": "https://wpnews.pro/news/metadata-filtering-before-vector-search-the-recall-win-nobody-measures.txt", "jsonld": "https://wpnews.pro/news/metadata-filtering-before-vector-search-the-recall-win-nobody-measures.jsonld"}}