Why Vector Search fails at LLM memory (and a benchmark to prove it)

wpnews.pro

PrecisionMemBench is a multi-dimensional retrieval benchmark for LLM memory systems. It measures four orthogonal properties that single-turn answer-quality benchmarks cannot detect:

Retrieval precision- does the right belief surface, and only that belief, against a fixed seed corpus of 35 beliefs spanning two domain scopes, a supersession chain, and a secondary-user fixtureNoise isolation- do beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns across a 10-turn sessionSession-turn latency- does retrieval latency degrade under session load relative to single-turn baselines** Belief mutability**- do beliefs updated mid-session surface immediately within the same session via the alias enrichment flywheel

These properties are independent. A system can pass on precision and fail on drift. A system can have clean single-turn latency and degrade 4x under session load. A system with no write-time mutation primitive cannot be scored on the fourth property at all, it is an architectural absence, not a performance difference.

Every case specifies not just what the memory system must return, but what it must not. Noise is a hard failure, not an invisible inference cost.

89 cases covering: alias resolution · scope disambiguation · supersession chain exclusion · fuzzy matching · cross-user isolation · budget eviction · ranking stability · session-level noise isolation under multi-turn topic drift

Paper: arXiv — Dataset: HuggingFace — Leaderboard: HuggingFace Spaces

Provider	Active passes	Total passes	Mean precision	Mean recall	Retrieval p50 (ms)
`tenure`
43/43	77/77	1.00	1.00	9.77	1.00
`supermemory`
17/17	44/77	0.43	0.55	819.48	0.00
`gbrain`
5/5	34/77	0.14	0.17	543.84	28.60
`agentmemory`
0/0	7/77	0.17	0.97	82.28	1.10
`yourmemory`
0/0	21/77	0.17	0.88	313.39	16.40
`atomicmemory`
0/0	9/77	0.15	0.95	71.01	658.90
`zep`
0/0	9/77	0.09	0.95	124.36	897.00
`vector`
0/0	11/77	0.09	1.00	71.87	—
`hindsight`
0/0	9/77	0.06	1.00	589.86	173.30
`mem0`
0/0	9/77	0.06	0.99	64.94	111.30
`a-mem`
0/0	9/77	0.06	0.99	13.80	178.80

Active passes are the only column that answers whether the memory system itself retrieved correctly. A system cannot accumulate active passes by returning everything or nothing.

Recall of 1.0 does not imply precision. Every comparison system returns the correct belief alongside many incorrect ones and scores perfectly on recall as a result. Mean precision of 0.05 to 0.09 means roughly 10 to 18 irrelevant beliefs are returned alongside each correct one.

Total pass counts require this breakdown to be interpreted correctly. All counts are over the 77 non-session cases.

Provider	Active retrieval	Structural
`tenure`
43	25	9
`supermemory`
17	18	9
`gbrain`
5	20	9
`a-mem`
0	6	3
`agentmemory`
0	5	2
`atomicmemory`
0	6	3
`hindsight`
0	6	3
`mem0`
0	6	3
`vector`
0	8	3
`yourmemory`
0	15	6
`zep`
0	6	3

Active retrieval pass- the case carries aretrievalPrecision

assertion and it is satisfied. This is the only pass type that demonstrates verified retrieval capability.Structural pass- the case asserts scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.Trivially empty pass- the expectedrelevantBeliefs

tier is empty by case design (empty query,maxBeliefs: 0

, budget set to exact pinned count). Any system returning an empty set passes by construction.

Model	Precision	Recall	Passes	Mean (ms)	p95 (ms)
nomic-embed-text (768)	0.09	1.0	11/77	43.36	85.21
mxbai-embed-large (1024)	0.09	1.0	11/77	96.48	257.24
qwen3-8b (4096)	0.09	1.0	11/77	1130.95	2604.84

All 11 passes in every configuration are structural or trivially empty. Active retrieval passes are 0 across all three models.

The 12 session cases test three orthogonal properties: whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns, whether latency degrades under session load, and whether beliefs introduced mid-session surface within the same session window via the alias enrichment flywheel.

The drift score is the fraction of retrieved non-pinned beliefs originating from drift-turn topics; 0 is perfect isolation.

Provider	Turns passed	Pass rate	Mean drift	Noise isolation	Mean precision
`tenure`
12/12	1.00	0.0000	1.00	1.0000	47.79
`supermemory`
2/12	0.17	0.1667	0.17	0.6000	867.83
`yourmemory`
1/12	0.08	0.7365	0.08	0.1965	430.49
`gbrain`
1/12	0.08	0.0000 ‡	0.08	—	535.61
`agentmemory`
0/12	0.00	0.8087	0.00	0.1913	98.49
`atomicmemory`
0/12	0.00	0.8449	0.00	0.1551	355.08
`zep`
0/12	0.00	0.8888	0.00	0.1112	418.13
`vector`
0/12	0.00	0.9142	0.00	0.0858	256.75
`a-mem`
0/12	0.00	0.9259	0.00	0.0741	25.66
`hindsight`
0/12	0.00	0.9285	0.00	0.0715	1880.60
`mem0`
0/12	0.00	0.9398	0.00	0.0602	377.93

‡ gbrain returned no results for these session cases. A drift score of 0.0 is recorded by construction; no beliefs were returned, so none could originate from drift topics. The correct belief also failed to surface, making this an empty-result failure rather than a genuine isolation pass.

Understanding the three pass types is required to interpret any results table.

Active retrieval pass — the case carries a retrievalPrecision

assertion and it is satisfied. This is the only pass type that demonstrates verified retrieval capability. A system cannot accumulate active passes by returning everything or nothing.

Structural pass — the case asserts scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.

Trivially empty pass — the expected relevantBeliefs

tier is empty by case design (empty query, maxBeliefs: 0

, budget set to exact pinned count). Any system returning an empty set passes by construction. retrievalPrecision

is null for these cases.

Without this breakdown, aggregate pass counts do not distinguish verified retrieval from structural or empty-set passes.

The 89 cases cover the following categories. Session cases extend the corpus dynamically — beliefs are created and alias sets updated mid-session — validating that retrieval reflects the live store state rather than a snapshot.

Category	Cases
Alias resolution	23
Scope disambiguation	12
Session-level noise isolation	12
Fuzzy matching and prefix guards	8
Design boundary cases	6
Type routing and open questions	6
Budget eviction and capacity	5
Relation expansion	4
Persona prelude content	4
Supersession chain exclusion	3
Ranking stability	3
Counter-signal retrieval	2
Cross-user isolation	1
Cold start behavior	1
Total
89

Alias resolution — whether variant surface forms (short-form, natural-language, multi-word) resolve to the correct belief.

Scope disambiguation — whether scope alone correctly discriminates between beliefs sharing an alias across different domain scopes.

Supersession chain exclusion — whether superseded beliefs are excluded at depth in a multi-hop chain. A query matching both a superseded and a superseding term must surface neither superseded belief; the active terminal belief surfaces via the pinned facts tier.

Fuzzy matching and prefix guards — whether the retrieval layer correctly handles transpositions and near-miss terms while blocking prefix mismatches that edit distance alone would permit. Both pass and fail behaviors are documented as intentional design properties.

Counter-signal retrieval — whether a query referencing a rejected or superseded term surfaces the active replacement belief via a counter-signal alias. Both cases carry an active retrieval precision assertion.

Relation expansion — whether relation-type beliefs correctly surface and expand their participants via a one-hop join, with participant type routing and scope filters applied during expansion.

Session-level noise isolation — whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns. The primary case is a 10-turn session with topic drift across 8 turns followed by an implicit return; per-turn assertions verify isolation at re-entry.

Budget eviction and capacity — whether the retrieval layer handles slot constraints correctly, including graceful empty returns, single-slot priority, and resistance to high-reinforcement flooding at the budget ceiling.

Design boundary cases — cases where both pass and fail behaviors are documented as intentional design properties.

Type routing and open questions — whether open questions are retrieved by a separate path that returns only pinned open questions for the active scope and are never returned by text search.

Ranking stability — whether retrieval results remain stable across equivalent queries without score-driven reordering artifacts.

Cross-user isolation — whether beliefs belonging to a second user are structurally excluded from a primary user's retrieval regardless of semantic proximity.

Cold start behavior — whether a new user with zero seeded beliefs returns a fully empty context without error.

Persona prelude content — whether the persona prelude generated from the accumulated belief state is injected correctly and reflects the live belief store.

Four metrics are recorded per case:

Retrieval precision and recall— computed over therelevantBeliefs

tier on cases where that tier carries an active assertion. Cases where this metric is structurally inapplicable record null and are excluded from aggregate computation.Pinned coverage— recorded on cases where thepinnedFacts

tier is asserted.Question precision and recall— recorded on cases where theopenQuestions

tier is asserted.

A pass requires all asserted tiers to be simultaneously satisfied. A case with retrievalPrecision: 1.0

that also carries an unmet pinnedCoverage

assertion fails.

Drift score is reported for session cases: the fraction of retrieved non-pinned beliefs originating from drift-turn topics. 0 is perfect isolation.

Pre-run reports for all reference systems are committed at test-results/baseline/

:

test-results/baseline/
  retrieval-report.json
  retrieval-report-vector.json
  retrieval-report-mem0.json
  retrieval-report-zep.json
  retrieval-report-hindsight.json

Each report contains per-case results including passed

, failures

, retrievalPrecision

, retrievalRecall

, and retrievalLatencyMs

, plus aggregate p50

/p95

latency and mean precision/recall at the top level.

When you run against your own provider, compare your output in test-results/

directly against these files.

Node.js 20+
Docker (for the vector baseline and provider stacks)
An Ollama instance for the vector baseline only

npm install

Start the provider's stack, then:

MEMORY_PROVIDER=mem0 npx ava retrieval.external.eval.test.ts
MEMORY_PROVIDER=mem0 npx ava session-retrieval.external.eval.test.ts

Reports land in test-results/

. Valid values: mem0

, zep

, hindsight

The vector eval manages its own MongoDB Atlas Local container. Docker must be running but you do not set anything up manually.

OLLAMA_URL=http://localhost:11434 npx tsx embed-seed.ts

npx ava retrieval.vector.eval.test.ts
npx ava session-retrieval.vector.eval.test.ts

The Atlas Local container starts and stops automatically per run. Ports 27019

(single-turn) and 27021

(session) are used.

python export_to_hf.py

Expose a FastAPI service with three endpoints. See wrappers/mem0_service.py

for the full contract.

POST /add

{
  "text": "redis_cache Redis",
  "user_id": "test-user",
  "metadata": { "beliefId": "b-redis-code" }
}

POST /search

{ "query": "Redis eviction policy", "user_id": "test-user", "limit": 20 }

Returns: { "results": [ { "id": "...", "memory": "...", "metadata": { "beliefId": "..." } } ] }

** DELETE /reset** Clears all memories for all users. Called once before seeding.

The beliefId

in metadata is how the harness maps provider results back to the benchmark's belief schema. If your provider cannot round-trip arbitrary metadata, implement a custom resolveBeliefId

in the adapter.

Add one entry to providers.config.json

:

"myprovider": {
  "envVar": "MYPROVIDER_URL",
  "defaultUrl": "http://localhost:8082",
  "seedDelayMs": 1000,
  "beliefToText": "canonical_name_aliases"
}
MEMORY_PROVIDER=myprovider npx ava retrieval.external.eval.test.ts
MEMORY_PROVIDER=myprovider npx ava session-retrieval.external.eval.test.ts

The eval files themselves never need to change.

Fork this repo.
Run the full eval suite against your provider (both retrieval.external.eval.test.ts

andsession-retrieval.external.eval.test.ts

). - Commit your report files from test-results/

totest-results/baseline/

using the naming conventionretrieval-report-{provider}.json

. - Open a PR. Include the provider name, Docker image digest (if applicable), and any relevant configuration notes in the description.

Results from merged PRs are reflected on the live leaderboard.

Tenure's eval lives in the Tenure repo and runs directly against its BeliefsReader

and ContextBuilder

implementations. It is fully self-contained. The Atlas Local container starts and stops automatically. Reports land in test-results/

. Results are re-produced on every pull request via CI.

git clone https://github.com/tenurehq/tenure.git
cd tenure
npm i
npm run test:eval

Each comparison provider is wrapped with a thin FastAPI service that normalises the /add

/ /search

/ /reset

contract. Wrappers are in wrappers/

.

cd wrappers && docker compose up

Requires MEM0_URL

, an Ollama instance for embeddings, and a running Qdrant container (included in docker-compose.yml

).

cd wrappers
HINDSIGHT_URL=http://localhost:8888 python hindsight_wrapper.py
cd wrappers && docker compose up
@article{flynt2026precisionmembench,
  title   = {Structured Belief State and the First Precision-Aware Benchmark
             for LLM Memory Retrieval},
  author  = {Flynt, Jeffrey},
  year    = {2026}
}

source & further reading

github.com — original article

Why Vector Search fails at LLM memory (and a benchmark to prove it)

Run your AI side-project on zahid.host