Metadata Filtering Before Vector Search: The Recall Win Nobody Measures

wpnews.pro

A support agent asks the bot a question about a customer's enterprise contract. The bot retrieves the top 10 chunks by cosine similarity across the whole corpus. Two of them are from a different customer's contract that happens to use near-identical boilerplate language. The model writes a confident answer citing a clause that does not exist in the customer's actual agreement.

That is not a hallucination in the usual sense. The retrieval was working as designed. The vector index found the most semantically similar chunks, and boilerplate is semantically similar across customers by definition. The fix is not a better embedding model. The fix is telling the index to never look at the other customer's documents in the first place.

That is metadata filtering. Most teams treat it as an access-control checkbox. It is also one of the cheapest recall wins in the pipeline, and almost nobody measures it.

Your vector store holds chunks. Each chunk carries metadata: customer_id

, doc_type

, language

, indexed_at

, team

. Pre-filtering means you apply a hard predicate on that metadata before the vector search ranks anything.

Without a filter, a query for "late payment penalty" searches all 2 million chunks and returns the 10 closest in embedding space. With a filter, you search only the chunks where customer_id = 4417

, then rank those. The search space drops from 2 million to maybe 800 chunks. Every one of the top-10 results now comes from the right document set.

The recall framing matters here. Recall is the fraction of relevant chunks you manage to retrieve. When the corpus is full of near-duplicate boilerplate, irrelevant-but-similar chunks crowd out the relevant ones in the top-k. Cut the search space to the documents that can possibly be relevant, and the relevant chunks stop competing with look-alikes. The top-k fills with chunks that have a real chance of answering.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Filter, FieldCondition, MatchValue,
)

client = QdrantClient(url="http://localhost:6333")

def search(query_vec, customer_id, k=10):
    flt = Filter(
        must=[
            FieldCondition(
                key="customer_id",
                match=MatchValue(value=customer_id),
            ),
        ]
    )
    return client.search(
        collection_name="contracts",
        query_vector=query_vec,
        query_filter=flt,
        limit=k,
    )

The query vector is the same. The only change is the predicate. The result quality change is not subtle when your corpus has natural partitions.

There are two places to apply the predicate, and they are not equivalent.

Post-filter: run the vector search over the whole corpus, get the top-k, then drop the chunks that fail the predicate. The problem: if you ask for 10 and 9 of the top 10 belong to other customers, you keep one chunk. You wanted 10. You got 1. The relevant chunks were ranked at positions 40, 55, 71 and never made the candidate set.

Pre-filter: restrict the candidate set to chunks that pass the predicate, then rank within it. You always get up to 10 results from the right partition.

Post-filtering silently starves recall. It looks like it works in a demo where the test query happens to match the dominant partition. It collapses on the queries where the relevant partition is a small slice of the corpus, which is the case that matters for multi-tenant systems.

Most managed vector stores do pre-filtering when you pass the predicate into the search call, and post-filtering when you filter the result list yourself in application code. Read your store's docs for which one the API gives you. Pinecone, Qdrant, Weaviate, and pgvector with the right index all support pre-filtering. The trap is filtering in Python after the fact and assuming it is the same thing.

Pre-filtering is not free, and the cost depends on how selective your predicate is.

Approximate nearest-neighbor indexes like HNSW are graphs. The search walks the graph from entry points toward the query vector. A filter prunes nodes during the walk. When the filter keeps most of the graph, the walk works normally. When the filter keeps a tiny fraction, the walk hits dead ends. The reachable nodes that pass the filter become sparse, the graph traversal stalls, and you either get degraded recall or the engine falls back to a slow brute-force scan over the matching set.

This is the cardinality trap, and it cuts both ways.

language = "en"

when 95% of the corpus is English): the filter barely shrinks the search space. You pay the filter cost and gain almost nothing.customer_id = 4417

when each customer is 0.04% of the corpus): the matching set is tiny and scattered through the HNSW graph. The graph walk degrades. Some engines auto-switch to brute force, which is fine at 800 chunks and a problem at 80,000.The selectivity sweet spot is a predicate that removes most of the corpus but leaves a matching set large enough for the index to traverse and small enough to be worth filtering. For very selective tenant filters, a payload index (Qdrant), a partitioned collection, or a metadata-aware index config is what keeps the filtered search fast.

from qdrant_client.models import PayloadSchemaType

client.create_payload_index(
    collection_name="contracts",
    field_name="customer_id",
    field_schema=PayloadSchemaType.KEYWORD,
)

Without that index, a customer_id

filter on a large collection makes the engine scan every payload to find matches. With it, the engine knows the matching set up front and plans the search around it. The difference at 2 million chunks is tens of milliseconds versus seconds.

The win is real, but you have to measure it on your corpus, because it depends entirely on how partitioned your data is. A flat single-tenant knowledge base sees almost no lift. A multi-tenant contract store sees a large one.

Build a gold set: real queries paired with the chunk IDs that actually answer them, labeled by hand. Then run the same queries twice, with and without the filter, and compute recall@k on each.

def recall_at_k(retrieved_ids, gold_ids, k=10):
    top = set(retrieved_ids[:k])
    hits = top & set(gold_ids)
    return len(hits) / len(gold_ids)

def compare(gold_set, customer_lookup):
    no_filter, with_filter = [], []
    for item in gold_set:
        q = embed(item["query"])
        cid = customer_lookup[item["query_id"]]

        unfiltered = search_no_filter(q, k=10)
        filtered = search(q, customer_id=cid, k=10)

        no_filter.append(
            recall_at_k(ids(unfiltered), item["gold_ids"])
        )
        with_filter.append(
            recall_at_k(ids(filtered), item["gold_ids"])
        )

    return {
        "recall_no_filter": mean(no_filter),
        "recall_with_filter": mean(with_filter),
    }

Run this before you ship the filter. If the lift is small, your corpus is not partitioned enough for filtering to matter and you should spend the effort on reranking instead. If the lift is large, you have found a recall win that costs one predicate and a payload index. Either way, you now have a number, which is more than most teams retrieving against a multi-tenant corpus can say.

Real queries carry more than one constraint. The agent wants the current English version of a contract for one customer. That is three predicates: customer_id

, language

, indexed_at

recent.

from qdrant_client.models import Range

def search_scoped(query_vec, customer_id, k=10):
    flt = Filter(
        must=[
            FieldCondition(
                key="customer_id",
                match=MatchValue(value=customer_id),
            ),
            FieldCondition(
                key="language",
                match=MatchValue(value="en"),
            ),
            FieldCondition(
                key="status",
                match=MatchValue(value="active"),
            ),
        ]
    )
    return client.search(
        collection_name="contracts",
        query_vector=query_vec,
        query_filter=flt,
        limit=k,
    )

Order the predicates by selectivity in your head, even if the engine reorders them internally. The most selective field (customer_id

) does the heavy lifting. The others trim the remainder. Adding a low-selectivity predicate like language

on top of a high-selectivity one is cheap. Stacking five low-selectivity predicates and expecting them to substitute for one good high-selectivity predicate is how teams end up with a filter that costs latency and returns the same noisy candidate set.

The metadata schema is the thing to get right early. You cannot filter on a field you did not store at ingestion. Carry customer_id

, doc_type

, language

, status

, and indexed_at

through the chunking layer as payload on every chunk. Retrofitting metadata onto an indexed corpus means a full re-index, which is the expensive thing this whole technique is supposed to help you avoid.

Vector search gets the blog posts. The cosine similarity, the reranker, the query rewriter. Metadata filtering is plumbing, so it gets a checkbox in the access-control story and no place in the recall story.

That is the gap. For any corpus with natural partitions, the predicate you apply before the index runs does more for top-k quality than the third reranker you were about to add. It is cheaper, it is deterministic, and it is auditable: you can prove the bot never saw the other customer's contract, which is a sentence your security review will want to hear.

Store the metadata. Pre-filter, not post-filter. Index the high-selectivity field. Measure the lift on your own gold set. The recall win is sitting in a field you are probably already storing and not searching against.

The RAG Pocket Guide covers the retrieval layer end to end — metadata schemas, hybrid filter-plus-vector search, the cardinality traps that degrade filtered ANN, and how to wire an eval that catches a recall regression before your users do. If your retrieval quality depends on a corpus with real partitions, the filtering chapter is where the cheap wins live.

source & further reading

dev.to — original article I Added an MCP Server to NPMScan for AI Coding Agents Enforcement and audit are the same act. Most agent stacks still treat them as two. Follow Existing Patterns” Is Not Enough Context for an AI Coding Agent

Metadata Filtering Before Vector Search: The Recall Win Nobody Measures

Run your AI side-project on zahid.host