Show HN: A benchmark for the failure modes of agent memory

wpnews.pro

cd /news/ai-agents/show-hn-a-benchmark-for-the-failure-… · home › topics › ai-agents › article

[ARTICLE · art-42088] src=github.com ↗ pub=2026-06-27T21:23Z topic=ai-agents verified=true sentiment=· neutral

Show HN: A benchmark for the failure modes of agent memory

A developer released an open benchmark, agent-memory-bench, that scores AI agent memory systems on four failure modes—retraction, collision, recall, and conflict—rather than shallow retrieval metrics. The benchmark runs offline with zero dependencies and no API key, and its reference baselines show answer correctness ranging from 23% to 92%, highlighting the gap between retrieval quality and actual answer correctness.

read3 min views1 publishedJun 27, 2026

Show HN: A benchmark for the failure modes of agent memory — Image: source

An open benchmark for the failure modes of agent memory systems.

Everyone shipping an AI agent bolts on a "memory," and everyone evaluates it the same shallow way: did retrieval fetch a relevant chunk? But agents don't fail in the field because retrieval missed. They fail because the fact they retrieved was stale, belonged to the wrong entity, was buried under noise, or contradicted another fact the system also believed. Those are the bugs that make an agent confidently wrong.

agent-memory-bench

scores those four failure modes directly — and it runs offline, with zero dependencies and no API key, so the leaderboard is reproducible by anyone in one command.

npm install
npm run bench        # prints the leaderboard below
npm test             # adversarial tests for the scoring core + baselines

Reference baselines across 13 scenarios in 4 categories. Numbers are produced by npm run bench

— reproduce them yourself.

system	retraction	collision	recall	conflict
`typed-constraint`
100%	100%	75%	100%	92%
`keyword`
0%	100%	75%	0%	46%
`recency`
100%	0%	0%	0%	23%

Read this as a map of where each strategy breaks, not a ranking of products:

(similarity retrieval, no model of time) aces collision but scoreskeyword

0% on retraction and conflict— with no notion of time it happily returns the value the user already changed.(latest token-match wins) fixes retraction but collapses onrecency

collision and recall— it drifts to the most recent look-alike, which is usually the wrong entity.modelstyped-constraint

time(facts retract) andidentity(facts bind to an entity), so it survives three categories. It still misses the onemulti-hop recall scenario — a deliberate frontier itemno baseline solves, so the benchmark isn't saturated.

The headline isn't "92%." It's that retrieval-quality metrics would rate all three systems similarly, while their answer correctness ranges from 23% to 92%. That gap is the point.

Category	One-line definition
Retraction
A fact is updated; the new value must win and the old must not surface.
Collision
Two similar entities; answer about the one asked, don't conflate.
Recall
Fact stated early, needed late, with noise (incl. a multi-hop frontier case).
Conflict
A fact is explicitly contradicted in-text; resolve to one current value.

Full definitions, worked examples, and why each one is hard are in TAXONOMY.md.

A system implements one small interface (src/types.ts

interface MemorySystem {
  readonly name: string;
  reset(): void | Promise<void>;        // called before each scenario
  remember(text: string): void | Promise<void>;
  query(question: string): string | Promise<string>;
}

Methods may be async, so an embedding store, a hosted memory product, or an LLM-backed extractor plugs in exactly like the pure-code baselines. Drop your class into src/systems/

, add it to the list in src/run.ts

, and run npm run bench

. Use npm run bench -- --fails

to see every query your system missed and what it answered.

Scenarios(src/scenarios/

) are ordered scripts ofremember

andquery

events. Each query declares the substring the answer must contain and the stale substrings it must not — so leaking an out-of-date value is scored as a failure, not a near-miss.Harness(src/harness.ts

) resets the system, replays a scenario, and judges each query. Scenarios are fully isolated.Scoring(src/score.ts

,src/report.ts

) aggregates per-category and overall rates and renders the leaderboard.

The scoring core and every baseline behaviour are pinned by an adversarial test suite (npm test

v0.1: 4 categories, 13 scenarios, 3 reference baselines, offline and reproducible.

Next: broaden each category (more scenarios, harder distractors), add temporal and preference-drift categories, add an optional LLM-judge mode for free-form answers, and publish a contribution guide so external memory systems can submit to the board.

Contributions of new scenarios — especially adversarial ones that break the typed-constraint

baseline — are the most valuable thing you can add.

MIT

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-a-benchmark-for-…

Read original on github.com → github.com/Kausha3/agent-memory-bench

mentioned entities

agent-memory-bench

Kausha3

metadata

slugshow-hn-a-benchmark-for-the-failure-modes-of-agent-memory

topic#ai-agents

secondary3 topics

sentimentneutral

canonicalgithub.com

navigation

← prevI built a daily Linux command-li…

── more in #ai-agents 4 stories · sorted by recency

dev.to · 27 Jun · #ai-agents

AI Coding Agents Are the New Attack Surface Nobody's Ready For

dev.to · 27 Jun · #ai-agents

PAL: Giving AI Agents Hands in the Physical World

github.com · 27 Jun · #ai-agents

Open handoff: Thought Tree, a markup/spec idea for modular LLM workflows

oak-invest.github.io · 27 Jun · #ai-agents

Show HN: Kiso, an open-source publishing engine for Open Knowledge Format

── more on @agent-memory-bench 3 stories trending now

wpnews · 25 May · #artificial-intelligence

Maia-3: free and open source

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required