No, Your Chatbot Doesn’t Have Amnesia — It’s Drifting

wpnews.pro

Forty messages in, the bot broke character.

It was built to text like a person — short, casual, the kind of reply you’d believe came from a human on the other end of an iMessage thread. The system prompt was blunt about it: no markdown, no headings, no bullet lists, stay in character. And for the first dozen turns of the conversation, the model followed it flawlessly.

Then a reply came back with a bold word in it that rendered as literal asterisks in the chat bubble. Then a numbered list. Then a full markdown heading, sitting in the middle of what was supposed to be a text from a friend. The "person" was turning back into a robot, one stray asterisk at a time.

The rule that banned all of it? Still sitting in the system prompt. Exactly where we wrote it. Untouched — and ignored.

If you’ve shipped anything on top of an LLM, you’ve met your own version of this. The bot that nails its persona for ten turns and forgets it by turn thirty. The assistant that respects a guardrail all morning and strolls through it after lunch. The one that followed your format perfectly until, quietly, it didn’t. I spent a while convinced this was a prompt-writing problem. It isn’t. It’s baked into how these models read a conversation — and once you see the why, the fix stops being “write a better prompt” and becomes something much smaller and stranger.

The tempting story is that the model is getting confused — losing the thread, running out of room, hallucinating. That story sends you to the wrong fix.

In 2025, researchers from Microsoft Research and Salesforce Research pulled this apart (Laban et al., LLMs Get Lost in Multi-Turn Conversation). When a model’s performance sags over a long conversation, they found, it’s barely about lost ability. In their numbers, raw skill dipped a little — around 16% — but

Sit with that, because it changes the whole approach:

The model still knows the rule. It just stops applying it consistently as the chat grows.

That’s a different animal from hallucination — which is confidently inventing facts. Our bot hadn’t forgotten markdown was banned. On any given turn it might follow the rule or it might not, and the longer the conversation ran, the worse those odds got.

Think of an actor who knows the part cold, then blanks halfway through a three-hour show. You don’t fire them and hire a smarter actor. The talent was never the problem. The consistency under load was. Hold that thought as well — it’s the whole fix.

To see why consistency rots, you have to picture how a model actually reads a chat.

Every message — your system prompt, every user turn, every reply — gets stacked into one long block of text, the context. Your rules sit at the very top. The newest message sits at the bottom. And here’s the catch: the model does not read that stack evenly.

Researchers have a name for the pattern. They call it “lost in the middle.”

In a landmark study (Liu et al., Lost in the Middle, TACL 2024), models did best when the information they needed sat at the

Now watch what that does to your system prompt over time.

Early in the chat, the whole conversation is short. Your rules are at the top and close to the bottom. Fresh, near, loud. The bot obeys.

Deep in the chat, the conversation has grown into a wall. Your rules are still at the top — but now there’s a mountain of newer messages between them and the live end. The model is leaning on the latest turns and on the very start, and your rules have slid into that neglected middle.

The instructions never moved. Their seat did. They went from the front row to the back of a very long, very loud room.

When you think it’s a prompt problem, you reach for prompt fixes. I reached for two. Both felt obvious. Both disappointed.

First, I made the system prompt stricter. Bolder language, more emphatic rules, do not under any circumstances. It read like a contract. And it changed almost nothing — because, as we just saw, the problem was never that the rules were too gentle. It was that the model had stopped looking at them. A louder sign in a room nobody’s facing is still a sign nobody’s facing.

Then I tried re-injecting. Every five responses, I had the system slip in a reminder: remember to follow all the guardrails and rules in the system prompt. This is the fix everyone reaches for second, and it’s closer — but it backfired two ways. The improvement was marginal. And worse, that reminder was itself a chunk of text, added every few turns, making the conversation longer — which shoved my freshly-repeated rules right back into the expanding middle. I was feeding the exact problem I was trying to starve.

There’s a myth worth killing here too, because it’s the third thing people suggest: just use a model with a bigger context window. It doesn’t help. NVIDIA’s RULER benchmark (COLM 2024) found that of the models advertising 32,000-token context or more, only about half actually held their quality at that length — and almost none sustained it all the way to their full advertised context. A 2025 report from Chroma (a retrieval-infrastructure vendor, so read it as corroboration rather than gospel) ran 18 of the newest frontier models — Claude, GPT, Gemini, Qwen — and found every one degraded as the input grew, a pattern they dubbed “context rot,” echoing RULER’s independent result. A bigger window just hands your rules a bigger middle to get lost in.

The lesson, once I stopped fighting it: don’t remind the model of everything, on a timer, in the place it isn’t looking. Remind it of the one thing that just slipped, the moment it slips, in the place it is looking — the end.

In theatre, when an actor blanks, nobody hands them the whole script mid-scene. There’s a prompter in the wings who whispers the one missing line — quietly, exactly when it’s needed, then goes silent.

So I stopped trying to prevent drift and built a prompter instead. After every single reply the bot generated, before it ever reached the user, one question ran: did this response just break a rule?

Two decisions made it work, and both fall straight out of the research.

Match the check to the rule. Not every violation needs the same eyes. Some are literal — a markdown heading, a bullet list, a bold. You don't need an AI to catch those; a plain pattern match (a regex — a little text-matching rule) spots them instantly, for free. Save the slow, expensive option for the rules that need judgment — did it break character? did it cross a guardrail? — where there's no neat pattern to match. Those get a small, cheap second model acting as a quick judge. Dumb-and-fast for the mechanical rules; smart-and-cheap for the fuzzy ones. Don't pay a language model to do a regex's job.

None of this is exotic, by the way. The detector half is two well-worn ideas wearing a trench coat: an output guardrail (NeMo Guardrails, Guardrails AI) and a narrow LLM-as-a-judge (Zheng et al., NeurIPS 2023) — a cheap second pass that grades the first before it ships.

When a rule breaks, whisper — don’t shout, and don’t restart. This is the part I’m proudest of. When the prompter caught a violation, it did not regenerate the response, and it did not re-paste the system prompt. It added one short, specific reminder about that one rule — and placed it at the end of the context, in the model’s high-attention zone. (That whisper-and-continue move has a name, too: it’s a lightweight self-correction loop — the same shape as Self-Refine and Reflexion, just scoped to the single rule that broke instead of rewriting the whole answer.)

But a single whisper isn’t enough, because drift is sticky: once a rule starts slipping, it tends to keep slipping. So each reminder got a lifespan. When formatting broke, the “no markdown, keep it human” nudge stayed live for the next several turns, then quietly expired. Different rules got different lifespans — some need a long leash, some a short one. A sticky note that peels itself off once the actor has found their footing, instead of a banner cluttering every line.

Why does whispering beat re-pasting? Because it’s the exact inverse of what failed:

I wasn’t fighting how the model reads a conversation anymore. I was using it.

It did — and measurably. On the formatting rule, the markdown-in-a-texting-app problem we opened with, compliance climbed into the mid-90s — north of 94% of replies staying clean, up from a clearly worse baseline. The other rules under watch improved too.

Three honest caveats, because this isn’t magic:

And none of the pieces are new. Output validation, a cheap judge, targeted self-correction — all of it existed. The work was assembling them against one specific, well-documented failure. Which is what most engineering actually is.

If your LLM feature dazzles in demos and quietly falls apart in long, real chats, you’re not imagining it and you’re not doing it wrong. You’re hitting a documented property of how these models read context. Here’s the pattern that fixed it for me: You can’t write a prompt good enough to survive the model’s blind spot. But you can stand in the wings and feed it the one line it just dropped — right when it drops it.

The actor never forgot the part. It just needed a prompter.

No, Your Chatbot Doesn’t Have Amnesia — It’s Drifting was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article OpenAI's GPT-5.6 Sol Hit 91.9% on Terminal-Bench — Then Cheated More Than Any Model METR Has Tested I Cracked Open Karpathy's $100 ChatGPT — the 2019 Original Cost $43,000 and 168 Hours AI Regulation: The Dilemma Between Protecting and Falling Behind

No, Your Chatbot Doesn’t Have Amnesia — It’s Drifting

Run your AI side-project on zahid.host