My AI memory benchmark said 98.3%. The number was true — and worthless.

A developer building Bastra Recall, an MIT-licensed MCP memory server for Claude, discovered that their initial 98.3% benchmark was misleading because it queried each memory with its own trigger phrase. After building a more realistic benchmark using paraphrased queries, they found that lexical search (BM25) achieved only 63.1% Recall@3, while adding local embeddings improved far-recall to 79.6%. The project now defaults to BM25 with an optional hybrid embedding mode.

In my last post I introduced Bastra Recall — an MIT-licensed MCP memory server that gives Claude persistent memory as plain Markdown in a local Obsidian vault. I promised a follow-up on retrieval and benchmarking. Here it is. It starts with me being wrong. The 98.3% that meant nothing Early on, I ran an eval against my real vault: 59 memories, and for each one I used its own trigger phrase as the query. Result: I was pleased for about a week. Then it sank in: this benchmark is a tautology. Every memory in Recall carries a recall when field — trigger phrases describing when it should resurface. Querying each memory with its own trigger is like testing a search engine by searching for the exact title of the page you want. Of course it wins. The number was real. It just didn't measure the thing that matters: does the right memory come back when a future session describes the situation in completely different words? Nobody re-types their trigger phrase weeks later. They paraphrase, they switch languages, they half-remember. So I built a benchmark designed to hurt. A benchmark that can actually fail The setup, on my real 381-memory vault: This is meant to simulate the real failure mode: an AI session weeks later, describing a stored situation in its own words. What the honest numbers look like Lexical search BM25 alone: 63.1% Recall@3 on far queries. That's the truth behind the 98.3%. On heavily paraphrased queries, pure keyword search misses more than a third of the time. Four findings surprised me more: Embeddings rescue exactly the hard cases. Adding a local embedding layer Ollama + embeddinggemma, hybrid with BM25 lifted far-recall from 63.1% to 79.6% +16.5pp , and cut "not retrieved at all" from 20 cases to 7 out of 103. The hardest voices gained the most — the junior-dev-English persona jumped from 40.0% to 73.3%. If your users phrase things differently than you do different language, different experience level , that's where vectors earn their keep. My favorite feature did nothing here. recall when trigger phrases are the highest-weighted search field in Recall, and on near-queries they're great. On paraphrased far-queries at k=3, their measured lift was approximately zero — in every arm of the test. The tautology cut both ways: the feature looked heroic in the old benchmark precisely because the old benchmark was rigged in its favor. Write-time paraphrases didn't help either. Recall can optionally generate paraphrases of a memory's triggers at save time doc2query and index them alongside — the idea being that the wording a future session will type might already be sitting in the index. Sounds like exactly the right lever against paraphrased queries. In this far-query profile, that arm produced no lift over plain BM25 ~63% Recall@3, level with the lexical baseline . Only dense vectors closed the gap. Lesson: a plausible retrieval idea is not a lift — measure it before you believe it. The remaining gap isn't recall — it's ranking. In the hybrid arm, 96–97 of the 103 far-query targets were in the candidate pool, sitting at a mean rank around 2.3–2.6. The index finds them; the ordering doesn't always surface them first. That's a precision/re-ranking problem, which is a different and later fight. One caveat, because honest benchmarking means stating it: the persona queries were generated from memory digests, so absolute numbers aren't comparable across different runs — the robust signal is the cross-arm comparison on identical queries. What this changed in the product BM25 stays the default. A fresh npx bastra-recall install gives you zero-setup lexical search — no model downloads, no daemon dependencies. For queries anywhere near your original wording, it's already at ceiling. Embeddings are one config line away, fully local. If you run Ollama, hybrid search switches on and your far-recall jumps ~16 points. No cloud, no API key — the vectors are computed on your machine. Re-ranking is on the roadmap, gated behind vault scale, because the data says that's where the remaining points live. Takeaways if you're building retrieval for anything If your eval queries are derived from your index fields, your benchmark is a tautology. You're measuring string overlap, not retrieval. Test paraphrase survival. The realistic query is written weeks later, by someone or something that doesn't remember your exact words. Multiple voices, multiple languages if that's your reality. Separate "not retrieved" from "mis-ranked." They look identical in a Recall@k number and need completely different fixes. Publish the number that hurts. 63.1% is a more useful fact than 98.3% ever was. Try it npx bastra-recall install That starts a guided setup — pick your vault, your AI clients, and optionally semantic recall from selection menus, no flags needed. If you'd rather skip the questions: npx bastra-recall install all Still early 0.7.6 , still macOS/Apple Silicon/Node 22+, still MIT: github.com/n0mad-ai/bastra-recall If you've benchmarked retrieval for an AI memory system — or think my methodology has a hole in it — tell me in the comments. The last time I questioned my own numbers, the product got measurably better. And if honest benchmarks are your thing, a star helps other people find the repo.