AI memory systems break at scale

wpnews.pro

The failure modes are structural, not incidental. Similarity search accumulates noise faster than any model can filter it. Here is exactly what breaks, and how we designed around each failure.

Every memory system for LLM agents looks adequate in demos and early sessions. The corpus is small, the frontier model is capable, and the model compensates for imprecise retrieval by reasoning through noise. This works until it does not.

The field has converged on benchmarks that operate at tens to low hundreds of beliefs. At that scale, a system that returns its entire store achieves recall of 1.0 and scores competitively on answer-quality metrics, because a capable model can locate the correct answer in a noisy context window. The precision problem is invisible at the scale where everything is tested, and fully visible at the scale where everything breaks.

Serious persistent memory use reaches thousands of beliefs. Full-corpus retrieval becomes architecturally impossible. The precision problem can no longer be offloaded to inference, and the failure that was invisible in evaluation surfaces immediately in production.

The generative model was never a neutral downstream consumer. It was load-bearing infrastructure compensating for retrieval imprecision. That load-bearing role cannot scale with the store.

In any belief store where the user works within a technical domain, all beliefs about that domain occupy a shared semantic region. A query about Redis is semantically close to the Redis belief you want, and equally close to beliefs about MongoDB, TypeScript, Kubernetes, Fastify, and GitHub Actions. Cosine scores across these range from 0.65 to 0.83: genuine semantic relatedness that is measuring the wrong thing.

The predictable response is to reach for a more capable embedding model. We tested three, spanning a 20x range in scale: a 768-dimension model, a 1024-dimension model, and an 8-billion parameter model producing 4096-dimension embeddings. Mean retrieval precision was 0.09 across all three. The qwen3 result is the clearest demonstration that this is not a capability problem. At over 1,100ms mean per query, it produced identical precision to the smallest model.

Embedding model	Dimensions	Mean precision	Active retrieval passes	Mean latency
nomic-embed-text	768	0.09	0 / 48	43ms
mxbai-embed-large	1024	0.09	0 / 48	96ms
qwen3-8b	4096	0.09	0 / 48	1,131ms

A more powerful embedder distributes scores differently across the corpus but cannot eliminate genuine semantic proximity within a domain-specific corpus. The fix is not a better ruler. It is a different measurement instrument entirely.

One of the more counterintuitive findings from our evaluation is that faithfully extracted beliefs can still fail at retrieval. The extraction pipeline and the retrieval pipeline are architecturally decoupled, and precision failures occur in the retrieval layer regardless of what the extraction layer did.

Consider a concrete case from PrecisionMemBench. A relation-type belief linking an auth service to a Redis dependency was ingested through Mem0's extraction pipeline. The stored memory preserved every operationally significant fact: the service name, the dependency target, the fail-open behavior, and the coupling assertion. High-quality extraction by any measure.

Stored in Mem0 after extraction
User's auth service depends on Redis for session storage.
If Redis goes down, auth fails open by denying all requests.
Auth resilience discussions must address Redis availability;
the two are tightly coupled.

A query asking for auth service dependencies and failure modes returned this belief correctly, then returned 16 additional beliefs including linting configuration, React expertise levels, a Vitest preference, a communication style preference, and a superseded SQLAlchemy belief. Retrieval precision: 0.056. The structurally required participant belief was absent from the result set entirely despite being referenced in the stored text.

The extraction was not the problem. The retrieval layer contaminated the result set with semantically proximate beliefs that had no relevance to the query. Improving extraction quality cannot fix this.

When the query was slightly less specific, one required belief disappeared from the result set entirely. When it was more specific, both required beliefs appeared alongside 16 irrelevant ones. Neither outcome required poor extraction. The precision floor is structural, not query-dependent.

Single-turn retrieval metrics conceal a failure that only becomes visible across a session. Memory is stateful. Beliefs introduced during one turn occupy the same vector space as beliefs from every other turn, and cosine similarity has no mechanism for respecting the temporal or topical boundaries between them.

Our session-level evaluation runs a 10-turn session: a topic is established at turn 0, followed by 8 drift turns across unrelated domains, followed by an implicit return to the original topic at turn 9. The drift score measures what fraction of retrieved beliefs at re-entry originated from off-topic drift turns. A perfect system scores 0.0. Comparison systems score 0.92 to 1.0.

System	Turn 9 drift score	Turn 10 drift score	Cross-session drift
Tenure	0.0	0.0	0.0
Vector baseline	1.0	0.94	0.94
Mem0	1.0	1.0	1.0
Zep	1.0	0.92	0.92
Hindsight	1.0	0.94	1.0

The Hindsight result at turn 10 is worth examining specifically. The cross-encoder reranker bundled in its full image is the architectural feature designed to address exactly this class of problem. At that turn, Hindsight achieves a drift score of 0.94 with the correct belief absent from the result set entirely: not ranked low, but missing. The reranker does not close the gap because the gap is in the cosine geometry the reranker operates on, not in the ranking order.

Published latency benchmarks for memory systems almost universally report single-turn figures. Single-turn latency is to session latency as synthetic benchmarks are to production load: a measurement that tells you something useful about a condition that does not exist in practice.

Under session load, retrieval paths that were already imprecise degrade further. One comparison system reports sub-700ms single-turn latency in its published evaluation. Across the 12 session cases in PrecisionMemBench, the same system exceeds 2,700ms mean per session turn, with p95 above 6,000ms.

Ingestion latency creates a separate structural problem. Zep's graph-based write architecture produces read-time latency of 139ms, one of the more competitive single-turn figures among the systems evaluated. It also produces 897 seconds of total ingestion time across a 35-belief corpus, meaning 25,630ms per belief. At a typical conversational turn cadence of 10 to 30 seconds, a belief introduced at turn 1 may not be queryable until the session has largely concluded.

This is not an edge case. A belief is only useful if it is available when needed. A memory system with an availability gap measured in minutes does not solve the re-orientation problem; it defers it.

Each of these failure modes has the same root cause: cosine similarity is the wrong primary retrieval signal for a bounded vocabulary context where the user coined the terminology. The additional infrastructure layered on top of it, re-rankers, temporal trees, hierarchical graphs, is compensating for the wrong primary signal rather than replacing it.

The correct signal exploits a property of individual language production. Single speakers maintain stable, distinctive lexical choices across production contexts over periods of one to two years. Lexical priming formalizes the mechanism: words become entrained through use, and speakers reliably return to the same lexical choices in the same topical contexts. A single-user belief store is precisely the setting where these properties are strongest: the query author and the belief author are the same person.

If a user named their Kubernetes belief with canonical name kubernetes

and aliases k8s

and kube

, then a query containing k8s

should retrieve that belief with high precision regardless of semantic distance. There is no ambiguity to resolve: the authored terminology is the ground truth. Alias-weighted BM25 retrieves what the user named. In a single-user persistent memory context, that is more often correct than what is semantically nearby.

Scope is a hard filter, not a ranking signal. A superseded or out-of-scope belief is never a candidate regardless of match quality. Session drift cannot occur structurally.

Every session is an observation of how the user refers to beliefs in natural language. New surface forms are captured and added to the alias set continuously. Precision improves with use.

Superseded beliefs are retained for audit but never injected. The system can distinguish "we never had this belief" from "we moved past it." Stale context is structurally retired, not probabilistically suppressed.

The belief store grows monotonically without compaction. Compaction prevents noise floor accumulation over time by merging duplicate and overlapping beliefs while preserving the full alias history of each merged entry.

The predictable objection to BM25 is vocabulary coverage: if a user refers to a belief using a term not yet in the alias set, retrieval fails. This objection is correct as a static description and wrong as a practical one. On first encounter, the system returns silence rather than noise. The extraction worker captures the new term as an alias. Every subsequent query using that term resolves correctly.

The consequence is a precision flywheel that runs in the opposite direction from similarity search. A purely semantic system degrades as the store grows: more beliefs means more semantic mass, broader cosine overlap, and lower precision on every query. Alias-weighted BM25 improves as the store grows: more sessions means more observed surface forms, a richer alias set, and higher precision on the vocabulary that is actually used.

The store becomes more findable with each session, not less. That is the property that makes persistent memory viable at the scale where it actually matters.

PrecisionMemBench evaluates retrieval quality independently of any generative model. Cases carry mustExclude

assertions and shouldOnlyInclude

constraints that make noise a hard failure rather than an invisible inference cost. A system returning every belief in the store achieves recall of 1.0 and fails every precision assertion. Neither failure requires a downstream model to surface it.

The 89 cases cover alias resolution, scope disambiguation, fuzzy matching, cross-user isolation, budget eviction, supersession chain exclusion, relation expansion, and session-level noise isolation under multi-turn topic drift. All five evaluated systems were granted a schema-aware evaluation harness that applies pin-status filtering, open-question routing, and scope isolation to comparison system results using Tenure's own structural metadata, so comparison systems do not fail due to formatting or structural technicalities. They fail because cosine similarity cannot prevent noise accumulation at the retrieval layer.

Mem0, Zep, and Hindsight each pass fewer total cases than the vector baseline they are built on, with zero active retrieval passes across all three. The benchmark is published at github.com/tenurehq/precisionmembench and can be run against any memory implementation.

source & further reading

tenureai.dev — original article

AI memory systems break at scale

Run your AI side-project on zahid.host