cd /news/large-language-models/why-vector-search-fails-at-llm-memor… · home topics large-language-models article
[ARTICLE · art-21923] src=github.com pub= topic=large-language-models verified=true sentiment=· neutral

Why Vector Search fails at LLM memory (and a benchmark to prove it)

A new benchmark, PrecisionMemBench, reveals that vector-based memory systems fail at maintaining accurate, noise-free retrieval for large language models across multi-turn sessions. The benchmark tests four independent properties—retrieval precision, noise isolation, session-turn latency, and belief mutability—and found that all 10 compared systems, including those from providers like Mem0 and Zep, scored near-zero on active retrieval passes, meaning they returned numerous irrelevant beliefs alongside correct ones. Only the Tenure system achieved a perfect score, passing all 43 active retrieval cases with a mean precision of 1.0, while others like Vector and Hindsight scored 0.09 and 0.06 respectively, demonstrating that high recall alone does not ensure useful memory performance.

read11 min publishedJun 4, 2026

PrecisionMemBench is a multi-dimensional retrieval benchmark for LLM memory systems. It measures four orthogonal properties that single-turn answer-quality benchmarks cannot detect:

Retrieval precision- does the right belief surface, and only that belief, against a fixed seed corpus of 35 beliefs spanning two domain scopes, a supersession chain, and a secondary-user fixtureNoise isolation- do beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns across a 10-turn sessionSession-turn latency- does retrieval latency degrade under session load relative to single-turn baselines** Belief mutability**- do beliefs updated mid-session surface immediately within the same session via the alias enrichment flywheel

These properties are independent. A system can pass on precision and fail on drift. A system can have clean single-turn latency and degrade 4x under session load. A system with no write-time mutation primitive cannot be scored on the fourth property at all, it is an architectural absence, not a performance difference.

Every case specifies not just what the memory system must return, but what it must not. Noise is a hard failure, not an invisible inference cost.

89 cases covering: alias resolution · scope disambiguation · supersession chain exclusion · fuzzy matching · cross-user isolation · budget eviction · ranking stability · session-level noise isolation under multi-turn topic drift

Paper: arXiv — Dataset: HuggingFace — Leaderboard: HuggingFace Spaces

Provider Active passes Total passes Mean precision Mean recall Retrieval p50 (ms) Ingestion total (s)
tenure
43/43 77/77 1.00 1.00 9.77 1.00
supermemory
17/17 44/77 0.43 0.55 819.48 0.00
gbrain
5/5 34/77 0.14 0.17 543.84 28.60
agentmemory
0/0 7/77 0.17 0.97 82.28 1.10
yourmemory
0/0 21/77 0.17 0.88 313.39 16.40
atomicmemory
0/0 9/77 0.15 0.95 71.01 658.90
zep
0/0 9/77 0.09 0.95 124.36 897.00
vector
0/0 11/77 0.09 1.00 71.87
hindsight
0/0 9/77 0.06 1.00 589.86 173.30
mem0
0/0 9/77 0.06 0.99 64.94 111.30
a-mem
0/0 9/77 0.06 0.99 13.80 178.80

Active passes are the only column that answers whether the memory system itself retrieved correctly. A system cannot accumulate active passes by returning everything or nothing.

Recall of 1.0 does not imply precision. Every comparison system returns the correct belief alongside many incorrect ones and scores perfectly on recall as a result. Mean precision of 0.05 to 0.09 means roughly 10 to 18 irrelevant beliefs are returned alongside each correct one.

Total pass counts require this breakdown to be interpreted correctly. All counts are over the 77 non-session cases.

Provider Active retrieval Structural Trivially empty
tenure
43 25 9
supermemory
17 18 9
gbrain
5 20 9
a-mem
0 6 3
agentmemory
0 5 2
atomicmemory
0 6 3
hindsight
0 6 3
mem0
0 6 3
vector
0 8 3
yourmemory
0 15 6
zep
0 6 3

Active retrieval pass- the case carries aretrievalPrecision

assertion and it is satisfied. This is the only pass type that demonstrates verified retrieval capability.Structural pass- the case asserts scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.Trivially empty pass- the expectedrelevantBeliefs

tier is empty by case design (empty query,maxBeliefs: 0

, budget set to exact pinned count). Any system returning an empty set passes by construction.

Model Precision Recall Passes Mean (ms) p95 (ms)
nomic-embed-text (768) 0.09 1.0 11/77 43.36 85.21
mxbai-embed-large (1024) 0.09 1.0 11/77 96.48 257.24
qwen3-8b (4096) 0.09 1.0 11/77 1130.95 2604.84

All 11 passes in every configuration are structural or trivially empty. Active retrieval passes are 0 across all three models.

The 12 session cases test three orthogonal properties: whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns, whether latency degrades under session load, and whether beliefs introduced mid-session surface within the same session window via the alias enrichment flywheel.

The drift score is the fraction of retrieved non-pinned beliefs originating from drift-turn topics; 0 is perfect isolation.

Provider Turns passed Pass rate Mean drift Noise isolation Mean precision Session p50 (ms)
tenure
12/12 1.00 0.0000 1.00 1.0000 47.79
supermemory
2/12 0.17 0.1667 0.17 0.6000 867.83
yourmemory
1/12 0.08 0.7365 0.08 0.1965 430.49
gbrain
1/12 0.08 0.0000 ‡ 0.08 535.61
agentmemory
0/12 0.00 0.8087 0.00 0.1913 98.49
atomicmemory
0/12 0.00 0.8449 0.00 0.1551 355.08
zep
0/12 0.00 0.8888 0.00 0.1112 418.13
vector
0/12 0.00 0.9142 0.00 0.0858 256.75
a-mem
0/12 0.00 0.9259 0.00 0.0741 25.66
hindsight
0/12 0.00 0.9285 0.00 0.0715 1880.60
mem0
0/12 0.00 0.9398 0.00 0.0602 377.93

‡ gbrain returned no results for these session cases. A drift score of 0.0 is recorded by construction; no beliefs were returned, so none could originate from drift topics. The correct belief also failed to surface, making this an empty-result failure rather than a genuine isolation pass.

Understanding the three pass types is required to interpret any results table.

Active retrieval pass — the case carries a retrievalPrecision

assertion and it is satisfied. This is the only pass type that demonstrates verified retrieval capability. A system cannot accumulate active passes by returning everything or nothing.

Structural pass — the case asserts scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.

Trivially empty pass — the expected relevantBeliefs

tier is empty by case design (empty query, maxBeliefs: 0

, budget set to exact pinned count). Any system returning an empty set passes by construction. retrievalPrecision

is null for these cases.

Without this breakdown, aggregate pass counts do not distinguish verified retrieval from structural or empty-set passes.

The 89 cases cover the following categories. Session cases extend the corpus dynamically — beliefs are created and alias sets updated mid-session — validating that retrieval reflects the live store state rather than a snapshot.

Category Cases
Alias resolution 23
Scope disambiguation 12
Session-level noise isolation 12
Fuzzy matching and prefix guards 8
Design boundary cases 6
Type routing and open questions 6
Budget eviction and capacity 5
Relation expansion 4
Persona prelude content 4
Supersession chain exclusion 3
Ranking stability 3
Counter-signal retrieval 2
Cross-user isolation 1
Cold start behavior 1
Total
89

Alias resolution — whether variant surface forms (short-form, natural-language, multi-word) resolve to the correct belief.

Scope disambiguation — whether scope alone correctly discriminates between beliefs sharing an alias across different domain scopes.

Supersession chain exclusion — whether superseded beliefs are excluded at depth in a multi-hop chain. A query matching both a superseded and a superseding term must surface neither superseded belief; the active terminal belief surfaces via the pinned facts tier.

Fuzzy matching and prefix guards — whether the retrieval layer correctly handles transpositions and near-miss terms while blocking prefix mismatches that edit distance alone would permit. Both pass and fail behaviors are documented as intentional design properties.

Counter-signal retrieval — whether a query referencing a rejected or superseded term surfaces the active replacement belief via a counter-signal alias. Both cases carry an active retrieval precision assertion.

Relation expansion — whether relation-type beliefs correctly surface and expand their participants via a one-hop join, with participant type routing and scope filters applied during expansion.

Session-level noise isolation — whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns. The primary case is a 10-turn session with topic drift across 8 turns followed by an implicit return; per-turn assertions verify isolation at re-entry.

Budget eviction and capacity — whether the retrieval layer handles slot constraints correctly, including graceful empty returns, single-slot priority, and resistance to high-reinforcement flooding at the budget ceiling.

Design boundary cases — cases where both pass and fail behaviors are documented as intentional design properties.

Type routing and open questions — whether open questions are retrieved by a separate path that returns only pinned open questions for the active scope and are never returned by text search.

Ranking stability — whether retrieval results remain stable across equivalent queries without score-driven reordering artifacts.

Cross-user isolation — whether beliefs belonging to a second user are structurally excluded from a primary user's retrieval regardless of semantic proximity.

Cold start behavior — whether a new user with zero seeded beliefs returns a fully empty context without error.

Persona prelude content — whether the persona prelude generated from the accumulated belief state is injected correctly and reflects the live belief store.

Four metrics are recorded per case:

Retrieval precision and recall— computed over therelevantBeliefs

tier on cases where that tier carries an active assertion. Cases where this metric is structurally inapplicable record null and are excluded from aggregate computation.Pinned coverage— recorded on cases where thepinnedFacts

tier is asserted.Question precision and recall— recorded on cases where theopenQuestions

tier is asserted.

A pass requires all asserted tiers to be simultaneously satisfied. A case with retrievalPrecision: 1.0

that also carries an unmet pinnedCoverage

assertion fails.

Drift score is reported for session cases: the fraction of retrieved non-pinned beliefs originating from drift-turn topics. 0 is perfect isolation.

Pre-run reports for all reference systems are committed at test-results/baseline/

:

test-results/baseline/
  retrieval-report.json
  retrieval-report-vector.json
  retrieval-report-mem0.json
  retrieval-report-zep.json
  retrieval-report-hindsight.json

Each report contains per-case results including passed

, failures

, retrievalPrecision

, retrievalRecall

, and retrievalLatencyMs

, plus aggregate p50

/p95

latency and mean precision/recall at the top level.

When you run against your own provider, compare your output in test-results/

directly against these files.

  • Node.js 20+
  • Docker (for the vector baseline and provider stacks)
  • An Ollama instance for the vector baseline only
npm install

Start the provider's stack, then:

MEMORY_PROVIDER=mem0 npx ava retrieval.external.eval.test.ts
MEMORY_PROVIDER=mem0 npx ava session-retrieval.external.eval.test.ts

Reports land in test-results/

. Valid values: mem0

, zep

, hindsight

The vector eval manages its own MongoDB Atlas Local container. Docker must be running but you do not set anything up manually.

OLLAMA_URL=http://localhost:11434 npx tsx embed-seed.ts

npx ava retrieval.vector.eval.test.ts
npx ava session-retrieval.vector.eval.test.ts

The Atlas Local container starts and stops automatically per run. Ports 27019

(single-turn) and 27021

(session) are used.

python export_to_hf.py

Expose a FastAPI service with three endpoints. See wrappers/mem0_service.py

for the full contract.

POST /add

{
  "text": "redis_cache Redis",
  "user_id": "test-user",
  "metadata": { "beliefId": "b-redis-code" }
}

POST /search

{ "query": "Redis eviction policy", "user_id": "test-user", "limit": 20 }

Returns: { "results": [ { "id": "...", "memory": "...", "metadata": { "beliefId": "..." } } ] }

** DELETE /reset** Clears all memories for all users. Called once before seeding.

The beliefId

in metadata is how the harness maps provider results back to the benchmark's belief schema. If your provider cannot round-trip arbitrary metadata, implement a custom resolveBeliefId

in the adapter.

Add one entry to providers.config.json

:

"myprovider": {
  "envVar": "MYPROVIDER_URL",
  "defaultUrl": "http://localhost:8082",
  "seedDelayMs": 1000,
  "beliefToText": "canonical_name_aliases"
}
MEMORY_PROVIDER=myprovider npx ava retrieval.external.eval.test.ts
MEMORY_PROVIDER=myprovider npx ava session-retrieval.external.eval.test.ts

The eval files themselves never need to change.

  • Fork this repo.
  • Run the full eval suite against your provider (both retrieval.external.eval.test.ts

andsession-retrieval.external.eval.test.ts

). - Commit your report files from test-results/

totest-results/baseline/

using the naming conventionretrieval-report-{provider}.json

. - Open a PR. Include the provider name, Docker image digest (if applicable), and any relevant configuration notes in the description.

Results from merged PRs are reflected on the live leaderboard.

Tenure's eval lives in the Tenure repo and runs directly against its BeliefsReader

and ContextBuilder

implementations. It is fully self-contained. The Atlas Local container starts and stops automatically. Reports land in test-results/

. Results are re-produced on every pull request via CI.

git clone https://github.com/tenurehq/tenure.git
cd tenure
npm i
npm run test:eval

Each comparison provider is wrapped with a thin FastAPI service that normalises the /add

/ /search

/ /reset

contract. Wrappers are in wrappers/

.

cd wrappers && docker compose up

Requires MEM0_URL

, an Ollama instance for embeddings, and a running Qdrant container (included in docker-compose.yml

).

cd wrappers
HINDSIGHT_URL=http://localhost:8888 python hindsight_wrapper.py
cd wrappers && docker compose up
@article{flynt2026precisionmembench,
  title   = {Structured Belief State and the First Precision-Aware Benchmark
             for LLM Memory Retrieval},
  author  = {Flynt, Jeffrey},
  year    = {2026}
}
── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/why-vector-search-fa…] indexed:0 read:11min 2026-06-04 ·