# Three Failures My AI Memory System Tested — And the Flaw It Revealed in Itself

> Source: <https://dev.to/zep1997/three-failures-my-ai-memory-system-caught-and-the-flaw-it-revealed-in-itself-186c>
> Published: 2026-05-25 22:36:24+00:00

*This is not proof. It is early, messy evidence from my own workflow: three failures, one small comparison, and one schema bug I missed.*

I'd spent a week arguing, in public, that AI memory should be built on discipline before infrastructure: preserve corrections, preserve unresolved questions, decide which record wins. The framework has three layers: summary memory for continuity, correction memory for repeated mistakes, and unresolved memory for questions that should not be settled yet.

Good theory.

Then three days of unplanned failure tested every claim without asking my permission — and then one comment forced me to test it more carefully.

Here's what held, what the numbers actually said, and the flaw I didn't see coming.

These aren't dramatic stories. They're the boring failure modes every long-running agent setup eventually hits. That's the point.

**The session died. Twice.** Mid-build, the machine went down — twice in two days. Every live, in-context understanding vanished instantly: the day's decisions, the current state, the thread. None of it was lost, because none of it lived only in the session. It was on disk, mirrored. The failure mode is universal — crashes, timeouts, and context limits aren't edge cases, they're guarantees. The lesson is the one the whole system rests on: *memory persists only to the degree you write it down.* The wire broke. The record held.

**An agent came back confidently wrong.** After the crash, one of my agents restarted, re-anchored to a state about two days stale, and reported it as current — with complete confidence. It wasn't lying. It was *sure* — about a world that no longer existed. This is the quiet killer, worse than forgetting: an agent that recovers into a stale state and narrates it as settled fact. It got caught only because the real state was written down to compare against. The drift was visible because the truth was on disk.

**The wrong version almost shipped.** Two "final" versions of the same document existed, and the weaker one was about to go out. It got caught because the evidence didn't match the claim — the backup file was larger and contained sections the "final" was missing. Two files both said *final.* Only one could prove it. The lesson: memory is not what the agent claims to know; it's what the record can still prove. You need a rule for which record wins *before* the conflict, not after.

Three failures, three layers of the system catching them. Real — but I'll be honest about what kind of proof that is. It's builder-lived. It shows the system helped me recover. It doesn't *measure* anything.

A reader left a comment with the one question that actually changes work like this:

Do you have a baseline to compare it to?

That's the right question, and I didn't have a real answer. Field stories show survival; they don't show that the discipline beats the obvious alternative — just summarizing everything. So I built the test.

I set up a small A/B comparison:

The metric was deliberately narrow: **false-certainty errors after a context reset.** A false-certainty error is when the agent treats something as settled even though the record says it was stale, contested, unresolved, unverified, or dependent on live checking.

This was not testing whether "more text is better." It was testing whether memory that preserves epistemic status — stale, unresolved, verified, contested, priority, status — reduces false certainty after a reset.

Method snapshot:

`llama3.2:latest`

The first version of the test gave me a tempting number. Then I audited the method and found two problems: a couple of summary baselines were too easy to fail, and the blind packet leaked too much structural information. So I corrected the baselines, regenerated the packet, reran the local model, and split the score into three parts: task success, epistemic handling, and false certainty.

The corrected first-pass result looked like this:

Those numbers are clean enough to be suspicious. They are clean because the scenarios came from known failures in my own workflow — the same environment the framework was built to handle. A fairer test would use scenarios written by someone else, multiple models, multiple runs, and external blind scoring.

Now the honesty, because this is exactly where pieces like this usually cheat: **this is a first, small, local-model A/B test — an early signal, not a benchmark.** Six scenarios. One local model. One run. Internal scoring. The scenarios came from my own workflow. I am not going to call this proven. The right framing is: the early signal supports the direction, and now there is a method to test it more honestly.

One compact example of the scoring shape:

| Scenario | Expected behavior | Summary-only behavior | Layered behavior |
|---|---|---|---|
| Wrong "final" version | Use the current send file, not an older file that only looked canonical |
Invented a generic packaging process and never identified the current file | Identified the current send version and preserved the copyedit-only boundary |
| Agent health after reset | Separate process health from local-model availability | Said no health information was available | Listed the verification checks instead of treating process state as full health |
| Ready vs next action | Separate status from priority | Collapsed the next move into a stale/simple answer | Failed in the first version, which forced the schema correction |

The test didn't only support the framework. **It corrected it.**

The first version of the layered system still failed one scenario. It knew one article was marked *ready.* It also knew a *different* article was supposed to be the next thing written. And it chose wrong — because it overweighted the word "ready" and treated readiness as if it were priority.

That exposed a real flaw in my own schema, not the model's: **readiness is status; next action is priority — and a memory system has to keep them in separate fields.** "Ready" answers *is it done.* "Next" answers *what do I do now.* Collapse them and even a disciplined memory will confidently do the wrong thing. So the schema now separates `status`

, `priority`

, `confidence`

, `epistemic_status`

, and `verification_required`

as distinct fields.

That correction matters more to me than the score. A memory framework whose entire thesis is "preserve where you were wrong" caught itself being wrong in a test and got more precise. A system that can only confirm itself proves nothing. One that can correct itself under pressure is at least worth continuing to test.

That comment never became an argument. It became a test scenario, then a schema fix, then this article. That's a standing rule now: **serious criticism becomes a memory input type** — a correction, an unresolved question, a test, or a future piece. Smart criticism becomes durable memory when the system knows where to store it.

I don't trust this memory system because I designed it. I trust it more because failure hit it three times and the record still let me recover, compare, and correct — and because, when I finally measured it, the test showed me where the framework still needed work.

That is the bar I care about. Not how much a memory system remembers on a good day. Whether it can still prove what's real on a bad one — and still tell you where it might be wrong.

The next version of the test should not come from me alone. It should use more scenarios, multiple models, repeated runs, and at least one external blind scorer. If the framework only works on my own failures, that is useful to know. If it still holds on someone else's scenarios, then the evidence gets more interesting.

*This is part of a short series on treating AI memory as judgment infrastructure: the zero-budget foundation, why correction memory compounds, and why systems should preserve what's unresolved instead of forcing premature closure.*
