I got a comment on the last post that I want to answer properly, because it gets at the real question. They agreed that memory adherence is a systems problem and not just a prompting problem, then asked whether I had tested the approach across model families (GPT, Claude, Gemini) and seen meaningful differences in write reliability.
There are differences, but they are not about the model. They are about how much of the turn each family lets you control. If adherence is a systems problem, then the thing that actually decides write reliability is which family hands you enough control surface to build the system in the first place.
Here is what I have found in practice. Treat this as a field report, not a benchmark, and an early one. I have not pushed every family equally hard, and that matters for how much weight the comparison can carry. More on that below.
This is the family I have spent the most time controlling, so weigh the claim with that in mind. What I can say is that Claude gives you a full ladder, from shallow to deep, and you can stop at whatever rung the job needs.
The shallow rung is nudges: the settings page and system instructions. This is the weakest form, and it is the one that decays, the same "tell it harder" failure I wrote about last time. It helps a little, and it does not hold.
Above that are skills and hooks. Hooks are the first rung where adherence stops being a request and becomes an event: something fires at the start of the turn, at write time, at the end, whether the model felt like it or not. They work. They can also be finicky to get right.
The deep rung is the SDK. With it you get targeted control of the model's turn, the whole prompt to response lifecycle, beginning to end. This is where the write stops being a hope and becomes wiring you own. Nothing else I tested hands you the full lifecycle like this.
The point is not that any single rung is magic. It is that Claude lets you go as deep as the reliability you need, and the deeper you go, the more deterministic the write becomes.
Codex was a lot easier than I expected, because its instruction set is the AGENTS.md file. That one anchor did most of the work, and the net result was only slightly less consistent than Claude.
Honest caveat: I have not dug into the Codex SDK yet. So the ceiling there is probably higher than what I have measured, and the real gap to Claude may be smaller than it looks right now.
The Gemini and Grok CLIs are a different story. They seem to want to do the least amount of work possible, and they lean on their own KV, the in-context recall, more than they reach for an outside store. You end up working against a default that would rather not call your database at all.
The exception is websearch, and it is not a good one. It comes back noisy and minimal, and it gets actively counterproductive when it does surface strongly correlated evidence, because then the model over-trusts the search and skips the store it should have read. Getting real adherence out of those two means digging into their architectures to optimize for it, and that is still on my horizon, not something I have solved.
Building SENTINEL, a real benchmarking suite for this, is my top priority right now. The honest problem is that most memory systems do not expose the turn the way you would need to in order to measure whether a write happened at the right moment, so you cannot make them play the same game. The way I keep it fair is to grade what each system's outputs actually demonstrate, on a capability ladder, rather than the levers it was never built to expose. A system that does not surface something scores lower on that axis without being punished for it, and I say plainly which axis is push and which is pull, so a gap reads as a different shape, not a worse one. The full design is its own post.
The bar I hold myself to is the part that makes a benchmark mean anything. A bench only my own system can run is not a benchmark, it is a demo with a scoreboard. And a bench that hands you a number while attributing the result to the wrong layer, blaming the store when the model just queried it badly, or crediting the system when the model did the work, is worse than no number at all. So fair adapters and correct attribution come first, and the leaderboard comes second.
The model is not the variable that matters most for write reliability. The control surface is. The deterministic parts of a good memory system, the validation gate and the demotion math, hold the same on any of these. What changes across families is how much of the turn you are allowed to hook to force the write at the right moment. In my hands so far, Claude exposed the whole lifecycle, Codex gave a strong instruction anchor through AGENTS.md, and Gemini and Grok leaned hardest on their own context. But I have to be straight about the confound: I went deep on Claude's hooks and SDK, I leaned on Codex without touching its SDK, and I have barely pushed Gemini or Grok past their defaults. So read this as a map of where I have dug, not a verdict on which family is most controllable. Making that comparison fair, with equal effort and correct attribution, is exactly what SENTINEL is for.
So "which model writes more reliably" turns out to be the wrong question. The real one is which model lets you build the most system around the userprompt_init read write turn_end mechanics. That is the same point the commenter started with, followed all the way down.
github.com/H-XX-D/recall-memory-substrate