"Memory adherence is a systems problem. So which model lets you build the system?"

wpnews.pro

cd /news/large-language-models/memory-adherence-is-a-systems-proble… · home › topics › large-language-models › article

[ARTICLE · art-45148] src=dev.to ↗ pub=2026-06-30T16:37Z topic=large-language-models verified=true sentiment=· neutral

"Memory adherence is a systems problem. So which model lets you build the system?"

A developer reports that memory adherence is a systems problem, not a prompting problem, and that the key difference between model families is the control surface they offer. Claude provides a full ladder of control from shallow nudges to deep SDK access, enabling deterministic write reliability, while Gemini and Grok CLIs resist external memory calls. The developer is building SENTINEL, a benchmarking suite to measure memory system adherence across architectures.

read5 min views1 publishedJun 30, 2026

I got a comment on the last post that I want to answer properly, because it gets at the real question. They agreed that memory adherence is a systems problem and not just a prompting problem, then asked whether I had tested the approach across model families (GPT, Claude, Gemini) and seen meaningful differences in write reliability.

There are differences, but they are not about the model. They are about how much of the turn each family lets you control. If adherence is a systems problem, then the thing that actually decides write reliability is which family hands you enough control surface to build the system in the first place.

Here is what I have found in practice. Treat this as a field report, not a benchmark, and an early one. I have not pushed every family equally hard, and that matters for how much weight the comparison can carry. More on that below.

This is the family I have spent the most time controlling, so weigh the claim with that in mind. What I can say is that Claude gives you a full ladder, from shallow to deep, and you can stop at whatever rung the job needs.

The shallow rung is nudges: the settings page and system instructions. This is the weakest form, and it is the one that decays, the same "tell it harder" failure I wrote about last time. It helps a little, and it does not hold.

Above that are skills and hooks. Hooks are the first rung where adherence stops being a request and becomes an event: something fires at the start of the turn, at write time, at the end, whether the model felt like it or not. They work. They can also be finicky to get right.

The deep rung is the SDK. With it you get targeted control of the model's turn, the whole prompt to response lifecycle, beginning to end. This is where the write stops being a hope and becomes wiring you own. Nothing else I tested hands you the full lifecycle like this.

The point is not that any single rung is magic. It is that Claude lets you go as deep as the reliability you need, and the deeper you go, the more deterministic the write becomes.

Codex was a lot easier than I expected, because its instruction set is the AGENTS.md file. That one anchor did most of the work, and the net result was only slightly less consistent than Claude.

Honest caveat: I have not dug into the Codex SDK yet. So the ceiling there is probably higher than what I have measured, and the real gap to Claude may be smaller than it looks right now.

The Gemini and Grok CLIs are a different story. They seem to want to do the least amount of work possible, and they lean on their own KV, the in-context recall, more than they reach for an outside store. You end up working against a default that would rather not call your database at all.

The exception is websearch, and it is not a good one. It comes back noisy and minimal, and it gets actively counterproductive when it does surface strongly correlated evidence, because then the model over-trusts the search and skips the store it should have read. Getting real adherence out of those two means digging into their architectures to optimize for it, and that is still on my horizon, not something I have solved.

Building SENTINEL, a real benchmarking suite for this, is my top priority right now. The honest problem is that most memory systems do not expose the turn the way you would need to in order to measure whether a write happened at the right moment, so you cannot make them play the same game. The way I keep it fair is to grade what each system's outputs actually demonstrate, on a capability ladder, rather than the levers it was never built to expose. A system that does not surface something scores lower on that axis without being punished for it, and I say plainly which axis is push and which is pull, so a gap reads as a different shape, not a worse one. The full design is its own post.

The bar I hold myself to is the part that makes a benchmark mean anything. A bench only my own system can run is not a benchmark, it is a demo with a scoreboard. And a bench that hands you a number while attributing the result to the wrong layer, blaming the store when the model just queried it badly, or crediting the system when the model did the work, is worse than no number at all. So fair adapters and correct attribution come first, and the leaderboard comes second.

The model is not the variable that matters most for write reliability. The control surface is. The deterministic parts of a good memory system, the validation gate and the demotion math, hold the same on any of these. What changes across families is how much of the turn you are allowed to hook to force the write at the right moment. In my hands so far, Claude exposed the whole lifecycle, Codex gave a strong instruction anchor through AGENTS.md, and Gemini and Grok leaned hardest on their own context. But I have to be straight about the confound: I went deep on Claude's hooks and SDK, I leaned on Codex without touching its SDK, and I have barely pushed Gemini or Grok past their defaults. So read this as a map of where I have dug, not a verdict on which family is most controllable. Making that comparison fair, with equal effort and correct attribution, is exactly what SENTINEL is for.

So "which model writes more reliably" turns out to be the wrong question. The real one is which model lets you build the most system around the userprompt_init read write turn_end mechanics. That is the same point the commenter started with, followed all the way down.

github.com/H-XX-D/recall-memory-substrate

source & further reading

dev.to — original article AI Won’t Replace You—Here’s What Will How I Found the Best AI Coding Model Without Going Broke What It Takes to Build an AI Personal Assistant That Actually Remembers

~/api · this article 200

$curl api.wpnews.pro/v1/news/memory-adherence-is-a-sy…

Read original on dev.to → dev.to/hendrixx/memory-adherence-is-a-systems-pr…

mentioned entities

Claude

GPT

Gemini

Grok

Codex

SENTINEL

metadata

slugmemory-adherence-is-a-systems-problem-so-which-model-lets-you-build-the-system

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevHow I Found the Best AI Coding M…

next →AI Won’t Replace You—Here’s What…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 30 Jun · #large-language-models

Detecting Which AI Chat Platform You're On: URL and DOM Patterns for ChatGPT, Claude, Gemini, and Copilot

dev.to · 30 Jun · #large-language-models

You Don’t Always Need The Frontier

news.ycombinator.com · 30 Jun · #large-language-models

Ask HN: Is Codex with GPT 5.5 Extra High being dumbed down?

startupfortune.com · 30 Jun · #large-language-models

X launches an official MCP server and every social platform will need to follow

── more on @claude 3 stories trending now

wpnews · 27 May · #machine-learning

hunting for headroom on modded-nanoGPT (WR #82)

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 29 Jun · #large-language-models

The Silent Cost of AI Agents: Why Your Next.js SaaS Is Burning Money on LLM Calls

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required