cd /news/ai-agents/show-hn-a-benchmark-for-the-failure-… · home topics ai-agents article
[ARTICLE · art-42088] src=github.com ↗ pub= topic=ai-agents verified=true sentiment=· neutral

Show HN: A benchmark for the failure modes of agent memory

A developer released an open benchmark, agent-memory-bench, that scores AI agent memory systems on four failure modes—retraction, collision, recall, and conflict—rather than shallow retrieval metrics. The benchmark runs offline with zero dependencies and no API key, and its reference baselines show answer correctness ranging from 23% to 92%, highlighting the gap between retrieval quality and actual answer correctness.

read3 min views1 publishedJun 27, 2026
Show HN: A benchmark for the failure modes of agent memory
Image: source

An open benchmark for the failure modes of agent memory systems.

Everyone shipping an AI agent bolts on a "memory," and everyone evaluates it the same shallow way: did retrieval fetch a relevant chunk? But agents don't fail in the field because retrieval missed. They fail because the fact they retrieved was stale, belonged to the wrong entity, was buried under noise, or contradicted another fact the system also believed. Those are the bugs that make an agent confidently wrong.

agent-memory-bench

scores those four failure modes directly — and it runs offline, with zero dependencies and no API key, so the leaderboard is reproducible by anyone in one command.

npm install
npm run bench        # prints the leaderboard below
npm test             # adversarial tests for the scoring core + baselines

Reference baselines across 13 scenarios in 4 categories. Numbers are produced by npm run bench

— reproduce them yourself.

system retraction collision recall conflict overall
typed-constraint
100% 100% 75% 100% 92%
keyword
0% 100% 75% 0% 46%
recency
100% 0% 0% 0% 23%

Read this as a map of where each strategy breaks, not a ranking of products:

(similarity retrieval, no model of time) aces collision but scoreskeyword

0% on retraction and conflict— with no notion of time it happily returns the value the user already changed.(latest token-match wins) fixes retraction but collapses onrecency

collision and recall— it drifts to the most recent look-alike, which is usually the wrong entity.modelstyped-constraint

time(facts retract) andidentity(facts bind to an entity), so it survives three categories. It still misses the onemulti-hop recall scenario — a deliberate frontier itemno baseline solves, so the benchmark isn't saturated.

The headline isn't "92%." It's that retrieval-quality metrics would rate all three systems similarly, while their answer correctness ranges from 23% to 92%. That gap is the point.

Category One-line definition
Retraction
A fact is updated; the new value must win and the old must not surface.
Collision
Two similar entities; answer about the one asked, don't conflate.
Recall
Fact stated early, needed late, with noise (incl. a multi-hop frontier case).
Conflict
A fact is explicitly contradicted in-text; resolve to one current value.

Full definitions, worked examples, and why each one is hard are in TAXONOMY.md.

A system implements one small interface (src/types.ts

):

interface MemorySystem {
  readonly name: string;
  reset(): void | Promise<void>;        // called before each scenario
  remember(text: string): void | Promise<void>;
  query(question: string): string | Promise<string>;
}

Methods may be async, so an embedding store, a hosted memory product, or an LLM-backed extractor plugs in exactly like the pure-code baselines. Drop your class into src/systems/

, add it to the list in src/run.ts

, and run npm run bench

. Use npm run bench -- --fails

to see every query your system missed and what it answered.

Scenarios(src/scenarios/

) are ordered scripts ofremember

andquery

events. Each query declares the substring the answer must contain and the stale substrings it must not — so leaking an out-of-date value is scored as a failure, not a near-miss.Harness(src/harness.ts

) resets the system, replays a scenario, and judges each query. Scenarios are fully isolated.Scoring(src/score.ts

,src/report.ts

) aggregates per-category and overall rates and renders the leaderboard.

The scoring core and every baseline behaviour are pinned by an adversarial test suite (npm test

).

v0.1: 4 categories, 13 scenarios, 3 reference baselines, offline and reproducible.

Next: broaden each category (more scenarios, harder distractors), add temporal and preference-drift categories, add an optional LLM-judge mode for free-form answers, and publish a contribution guide so external memory systems can submit to the board.

Contributions of new scenarios — especially adversarial ones that break the typed-constraint

baseline — are the most valuable thing you can add.

MIT

── more in #ai-agents 4 stories · sorted by recency
── more on @agent-memory-bench 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-a-benchmark-…] indexed:0 read:3min 2026-06-27 ·