{"slug": "show-hn-a-benchmark-for-the-failure-modes-of-agent-memory", "title": "Show HN: A benchmark for the failure modes of agent memory", "summary": "A developer released an open benchmark, agent-memory-bench, that scores AI agent memory systems on four failure modes—retraction, collision, recall, and conflict—rather than shallow retrieval metrics. The benchmark runs offline with zero dependencies and no API key, and its reference baselines show answer correctness ranging from 23% to 92%, highlighting the gap between retrieval quality and actual answer correctness.", "body_md": "**An open benchmark for the failure modes of agent memory systems.**\n\nEveryone shipping an AI agent bolts on a \"memory,\" and everyone evaluates it the same\nshallow way: *did retrieval fetch a relevant chunk?* But agents don't fail in the field\nbecause retrieval missed. They fail because the fact they retrieved was **stale**,\nbelonged to the **wrong entity**, was **buried under noise**, or **contradicted** another\nfact the system also believed. Those are the bugs that make an agent confidently wrong.\n\n`agent-memory-bench`\n\nscores those four failure modes directly — and it runs **offline,\nwith zero dependencies and no API key**, so the leaderboard is reproducible by anyone in\none command.\n\n```\nnpm install\nnpm run bench        # prints the leaderboard below\nnpm test             # adversarial tests for the scoring core + baselines\n```\n\nReference baselines across 13 scenarios in 4 categories. Numbers are produced by\n`npm run bench`\n\n— reproduce them yourself.\n\n| system | retraction | collision | recall | conflict | overall |\n|---|---|---|---|---|---|\n`typed-constraint` |\n100% | 100% | 75% | 100% | 92% |\n`keyword` |\n0% | 100% | 75% | 0% | 46% |\n`recency` |\n100% | 0% | 0% | 0% | 23% |\n\nRead this as a map of *where each strategy breaks*, not a ranking of products:\n\n(similarity retrieval, no model of time) aces collision but scores`keyword`\n\n**0% on retraction and conflict**— with no notion of time it happily returns the value the user already changed.(latest token-match wins) fixes retraction but collapses on`recency`\n\n**collision and recall**— it drifts to the most recent look-alike, which is usually the wrong entity.models`typed-constraint`\n\n*time*(facts retract) and*identity*(facts bind to an entity), so it survives three categories. It still misses the one**multi-hop** recall scenario — a deliberate frontier item**no baseline solves**, so the benchmark isn't saturated.\n\nThe headline isn't \"92%.\" It's that retrieval-quality metrics would rate all three systems\nsimilarly, while their *answer correctness* ranges from 23% to 92%. That gap is the point.\n\n| Category | One-line definition |\n|---|---|\nRetraction |\nA fact is updated; the new value must win and the old must not surface. |\nCollision |\nTwo similar entities; answer about the one asked, don't conflate. |\nRecall |\nFact stated early, needed late, with noise (incl. a multi-hop frontier case). |\nConflict |\nA fact is explicitly contradicted in-text; resolve to one current value. |\n\nFull definitions, worked examples, and *why each one is hard* are in\n[TAXONOMY.md](/Kausha3/agent-memory-bench/blob/main/TAXONOMY.md).\n\nA system implements one small interface (`src/types.ts`\n\n):\n\n```\ninterface MemorySystem {\n  readonly name: string;\n  reset(): void | Promise<void>;        // called before each scenario\n  remember(text: string): void | Promise<void>;\n  query(question: string): string | Promise<string>;\n}\n```\n\nMethods may be async, so an embedding store, a hosted memory product, or an LLM-backed\nextractor plugs in exactly like the pure-code baselines. Drop your class into\n`src/systems/`\n\n, add it to the list in `src/run.ts`\n\n, and run `npm run bench`\n\n. Use\n`npm run bench -- --fails`\n\nto see every query your system missed and what it answered.\n\n**Scenarios**(`src/scenarios/`\n\n) are ordered scripts of`remember`\n\nand`query`\n\nevents. Each query declares the substring the answer must contain and the stale substrings it must not — so leaking an out-of-date value is scored as a failure, not a near-miss.**Harness**(`src/harness.ts`\n\n) resets the system, replays a scenario, and judges each query. Scenarios are fully isolated.**Scoring**(`src/score.ts`\n\n,`src/report.ts`\n\n) aggregates per-category and overall rates and renders the leaderboard.\n\nThe scoring core and every baseline behaviour are pinned by an adversarial test suite\n(`npm test`\n\n).\n\nv0.1: 4 categories, 13 scenarios, 3 reference baselines, offline and reproducible.\n\nNext: broaden each category (more scenarios, harder distractors), add **temporal** and\n**preference-drift** categories, add an optional LLM-judge mode for free-form answers,\nand publish a contribution guide so external memory systems can submit to the board.\n\nContributions of new scenarios — especially adversarial ones that break the\n`typed-constraint`\n\nbaseline — are the most valuable thing you can add.\n\nMIT", "url": "https://wpnews.pro/news/show-hn-a-benchmark-for-the-failure-modes-of-agent-memory", "canonical_source": "https://github.com/Kausha3/agent-memory-bench", "published_at": "2026-06-27 21:23:01+00:00", "updated_at": "2026-06-27 21:34:34.421754+00:00", "lang": "en", "topics": ["ai-agents", "ai-research", "developer-tools", "machine-learning"], "entities": ["agent-memory-bench", "Kausha3"], "alternates": {"html": "https://wpnews.pro/news/show-hn-a-benchmark-for-the-failure-modes-of-agent-memory", "markdown": "https://wpnews.pro/news/show-hn-a-benchmark-for-the-failure-modes-of-agent-memory.md", "text": "https://wpnews.pro/news/show-hn-a-benchmark-for-the-failure-modes-of-agent-memory.txt", "jsonld": "https://wpnews.pro/news/show-hn-a-benchmark-for-the-failure-modes-of-agent-memory.jsonld"}}