Catching the failure is the easy part

wpnews.pro

The last post I wrote ended on a loose thread I have not been able to stop pulling at. Almost every memory setup I looked at had a decent answer for what to write down, and almost none of them had a real answer for what to keep. I want to sit with that second half for a while, because the more time I spend with it the more I think it is where the actual difficulty lives.

Start with the part that feels hard but mostly isn't. Noticing that an agent failed at something is close to mechanical. A tool throws an error. A test goes red. A call times out. A change gets reverted twenty minutes after it shipped. You can even catch the quiet ones, the runs where nothing errored but nobody ever confirmed the thing actually worked, by treating "ended without confirmation" as its own small failure. None of this is trivial to wire up, but it is the kind of problem that yields to rules. You can write the rules down and they hold.

So people build the detector, watch it light up, and feel like they have solved memory. They have not. They have solved the easy half and walked right up to the hard one without noticing the seam.

The hard half starts the moment you have a confirmed failure in hand and have to decide what, if anything, it means. A single failure is not one kind of thing. Sometimes it is a fluke, a flaky test or a network hiccup that will never recur and is worth nothing. Sometimes it is a real lesson, a sign that a whole approach is wrong. And sometimes it is just another face of a mistake you already recorded last week, in which case writing it down again only piles more weight onto something you already knew. The detector cannot tell these apart. It only knows that something went red. Sorting which red things deserve to become memory is judgment, and judgment does not collapse into a rule the way detection does.

Then volume shows up and makes it worse. If you keep every failure you catch, the store fills with sediment fast. Someone I talked to for the last post had the agent write a short post mortem after each task, which worked beautifully until there were forty of them and the signal drowned. So you have to consolidate. Merge the near duplicates, summarize the old ones, let the trivial stuff fade out. And consolidation is lossy on purpose, which means every time you do it you are betting on which detail mattered before you actually know. You compress "the deploy failed because the migration ran before the feature flag flipped" down to "be careful with migration ordering," and you have probably thrown away the one specific that would have helped next time. The summary feels tidier and remembers less.

There is a quieter failure mode hiding in here too, and it is the one I find most interesting. When you consolidate aggressively you are tempted to fold the event and the lesson into a single object. What happened, and what you concluded from it, become one note. That is exactly the move that turns memory into superstition. The agent stops holding "this happened once and here is the evidence" and starts holding "this is the rule," and it will defend the rule long after the thing that justified it has changed. A failure that was real on Tuesday hardens into a law by Friday, enforced by a system that no longer remembers why. Keep the event and the conclusion as separate things and you can revise the conclusion later. Fuse them and you cannot.

So what makes keeping so much harder than catching? I think it comes down to signal. Detection has ground truth right when it happens. The test passed or it did not. Keeping has no equivalent. At the moment you are deciding whether a memory is worth holding onto, you usually cannot tell, because the thing that would actually tell you is whether acting on that memory later leads somewhere good or sends the agent back into the wall it already hit. That signal arrives much later, if you capture it at all, and almost nobody is capturing it. We instrument the write and leave the outcome uninstrumented, then act surprised when the store fills with confident junk.

Worth saying plainly: this is roughly how human memory works, and nobody designed that, so maybe it is telling. You do not store your whole day. Something during sleep throws away nearly all of it and keeps a thin, strange, sometimes wrong selection. The recording was never the clever part. The selection is. We have built agents that record fluently and select badly, which is close to the exact inverse of what you want.

I do not have a clean fix, and I am suspicious of anyone who says they do. What I have is a few things I now believe. Separate the cheap detector from the expensive decision, and do not let the first quietly stand in for the second. Do not promote a single failure into a durable rule just because it happened once. Build the cleanup pass in from the start, because the store degrades whether or not you planned for it. And accept that part of the keep decision cannot be automated yet, because the signal it really wants, did acting on this actually work, is one most systems are not even recording.

That last one is the thread I will pull next.

source & further reading

dev.to — original article Kimi K2.7 Code is Now Available in Microsoft Foundry MCP 2.0 + Security Patches: Rails RCE, Nuxt Updates The $239,000 Manual Handoff Tax: The ROI of Deep AI-to-ERP Integration 💸

Catching the failure is the easy part

Run your AI side-project on zahid.host