Show HN: A benchmark for the failure modes of agent memory A developer released an open benchmark, agent-memory-bench, that scores AI agent memory systems on four failure modes—retraction, collision, recall, and conflict—rather than shallow retrieval metrics. The benchmark runs offline with zero dependencies and no API key, and its reference baselines show answer correctness ranging from 23% to 92%, highlighting the gap between retrieval quality and actual answer correctness. An open benchmark for the failure modes of agent memory systems. Everyone shipping an AI agent bolts on a "memory," and everyone evaluates it the same shallow way: did retrieval fetch a relevant chunk? But agents don't fail in the field because retrieval missed. They fail because the fact they retrieved was stale , belonged to the wrong entity , was buried under noise , or contradicted another fact the system also believed. Those are the bugs that make an agent confidently wrong. agent-memory-bench scores those four failure modes directly — and it runs offline, with zero dependencies and no API key , so the leaderboard is reproducible by anyone in one command. npm install npm run bench prints the leaderboard below npm test adversarial tests for the scoring core + baselines Reference baselines across 13 scenarios in 4 categories. Numbers are produced by npm run bench — reproduce them yourself. | system | retraction | collision | recall | conflict | overall | |---|---|---|---|---|---| typed-constraint | 100% | 100% | 75% | 100% | 92% | keyword | 0% | 100% | 75% | 0% | 46% | recency | 100% | 0% | 0% | 0% | 23% | Read this as a map of where each strategy breaks , not a ranking of products: similarity retrieval, no model of time aces collision but scores keyword 0% on retraction and conflict — with no notion of time it happily returns the value the user already changed. latest token-match wins fixes retraction but collapses on recency collision and recall — it drifts to the most recent look-alike, which is usually the wrong entity.models typed-constraint time facts retract and identity facts bind to an entity , so it survives three categories. It still misses the one multi-hop recall scenario — a deliberate frontier item no baseline solves , so the benchmark isn't saturated. The headline isn't "92%." It's that retrieval-quality metrics would rate all three systems similarly, while their answer correctness ranges from 23% to 92%. That gap is the point. | Category | One-line definition | |---|---| Retraction | A fact is updated; the new value must win and the old must not surface. | Collision | Two similar entities; answer about the one asked, don't conflate. | Recall | Fact stated early, needed late, with noise incl. a multi-hop frontier case . | Conflict | A fact is explicitly contradicted in-text; resolve to one current value. | Full definitions, worked examples, and why each one is hard are in TAXONOMY.md /Kausha3/agent-memory-bench/blob/main/TAXONOMY.md . A system implements one small interface src/types.ts : interface MemorySystem { readonly name: string; reset : void | Promise