{"slug": "my-ai-memory-benchmark-said-98-3-the-number-was-true-and-worthless", "title": "My AI memory benchmark said 98.3%. The number was true — and worthless.", "summary": "A developer building Bastra Recall, an MIT-licensed MCP memory server for Claude, discovered that their initial 98.3% benchmark was misleading because it queried each memory with its own trigger phrase. After building a more realistic benchmark using paraphrased queries, they found that lexical search (BM25) achieved only 63.1% Recall@3, while adding local embeddings improved far-recall to 79.6%. The project now defaults to BM25 with an optional hybrid embedding mode.", "body_md": "In my last post I introduced Bastra Recall — an MIT-licensed MCP memory server that gives Claude persistent memory as plain Markdown in a local Obsidian vault. I promised a follow-up on retrieval and benchmarking.\n\nHere it is. It starts with me being wrong.\n\nThe 98.3% that meant nothing\n\nEarly on, I ran an eval against my real vault: 59 memories, and for each one I used its own trigger phrase as the query. Result:\n\nI was pleased for about a week. Then it sank in: this benchmark is a tautology. Every memory in Recall carries a recall_when field — trigger phrases describing when it should resurface. Querying each memory with its own trigger is like testing a search engine by searching for the exact title of the page you want. Of course it wins.\n\nThe number was real. It just didn't measure the thing that matters: does the right memory come back when a future session describes the situation in completely different words? Nobody re-types their trigger phrase weeks later. They paraphrase, they switch languages, they half-remember.\n\nSo I built a benchmark designed to hurt.\n\nA benchmark that can actually fail\n\nThe setup, on my real 381-memory vault:\n\nThis is meant to simulate the real failure mode: an AI session weeks later, describing a stored situation in its own words.\n\nWhat the honest numbers look like\n\nLexical search (BM25) alone: 63.1% Recall@3 on far queries. That's the truth behind the 98.3%. On heavily paraphrased queries, pure keyword search misses more than a third of the time.\n\nFour findings surprised me more:\n\nEmbeddings rescue exactly the hard cases. Adding a local embedding layer (Ollama + embeddinggemma, hybrid with BM25) lifted far-recall from 63.1% to 79.6% (+16.5pp), and cut \"not retrieved at all\" from 20 cases to 7 out of 103. The hardest voices gained the most — the junior-dev-English persona jumped from 40.0% to 73.3%. If your users phrase things differently than you do (different language, different experience level), that's where vectors earn their keep.\n\nMy favorite feature did nothing here. recall_when trigger phrases are the highest-weighted search field in Recall, and on near-queries they're great. On paraphrased far-queries at k=3, their measured lift was approximately zero — in every arm of the test. The tautology cut both ways: the feature looked heroic in the old benchmark precisely because the old benchmark was rigged in its favor.\n\nWrite-time paraphrases didn't help either. Recall can optionally generate paraphrases of a memory's triggers at save time (doc2query) and index them alongside — the idea being that the wording a future session will type might already be sitting in the index. Sounds like exactly the right lever against paraphrased queries. In this far-query profile, that arm produced no lift over plain BM25 (~63% Recall@3, level with the lexical baseline). Only dense vectors closed the gap. Lesson: a plausible retrieval idea is not a lift — measure it before you believe it.\n\nThe remaining gap isn't recall — it's ranking. In the hybrid arm, 96–97 of the 103 far-query targets were in the candidate pool, sitting at a mean rank around 2.3–2.6. The index finds them; the ordering doesn't always surface them first. That's a precision/re-ranking problem, which is a different (and later) fight.\n\nOne caveat, because honest benchmarking means stating it: the persona queries were generated from memory digests, so absolute numbers aren't comparable across different runs — the robust signal is the cross-arm comparison on identical queries.\n\nWhat this changed in the product\n\nBM25 stays the default. A fresh npx bastra-recall install gives you zero-setup lexical search — no model downloads, no daemon dependencies. For queries anywhere near your original wording, it's already at ceiling.\n\nEmbeddings are one config line away, fully local. If you run Ollama, hybrid search switches on and your far-recall jumps ~16 points. No cloud, no API key — the vectors are computed on your machine.\n\nRe-ranking is on the roadmap, gated behind vault scale, because the data says that's where the remaining points live.\n\nTakeaways if you're building retrieval for anything\n\nIf your eval queries are derived from your index fields, your benchmark is a tautology. You're measuring string overlap, not retrieval.\n\nTest paraphrase survival. The realistic query is written weeks later, by someone (or something) that doesn't remember your exact words. Multiple voices, multiple languages if that's your reality.\n\nSeparate \"not retrieved\" from \"mis-ranked.\" They look identical in a Recall@k number and need completely different fixes.\n\nPublish the number that hurts. 63.1% is a more useful fact than 98.3% ever was.\n\nTry it\n\nnpx bastra-recall install\n\nThat starts a guided setup — pick your vault, your AI clients, and (optionally) semantic recall from selection menus, no flags needed. If you'd rather skip the questions:\n\nnpx bastra-recall install all\n\nStill early (0.7.6), still macOS/Apple Silicon/Node 22+, still MIT: github.com/n0mad-ai/bastra-recall\n\nIf you've benchmarked retrieval for an AI memory system — or think my methodology has a hole in it — tell me in the comments. The last time I questioned my own numbers, the product got measurably better. And if honest benchmarks are your thing, a star helps other people find the repo.", "url": "https://wpnews.pro/news/my-ai-memory-benchmark-said-98-3-the-number-was-true-and-worthless", "canonical_source": "https://dev.to/daniel_nevoigt_ca2fdc23d5/my-ai-memory-benchmark-said-983-the-number-was-true-and-worthless-20go", "published_at": "2026-07-04 10:06:49+00:00", "updated_at": "2026-07-04 10:18:50.360762+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "ai-research"], "entities": ["Bastra Recall", "Claude", "Obsidian", "BM25", "Ollama", "embeddinggemma", "MCP"], "alternates": {"html": "https://wpnews.pro/news/my-ai-memory-benchmark-said-98-3-the-number-was-true-and-worthless", "markdown": "https://wpnews.pro/news/my-ai-memory-benchmark-said-98-3-the-number-was-true-and-worthless.md", "text": "https://wpnews.pro/news/my-ai-memory-benchmark-said-98-3-the-number-was-true-and-worthless.txt", "jsonld": "https://wpnews.pro/news/my-ai-memory-benchmark-said-98-3-the-number-was-true-and-worthless.jsonld"}}