The Death of the Chatbot

The chatbot interface is dying as users realize large language models suffer from context rot, lack memory and truth sense, and are fundamentally limited by their statistical nature, making the chat box a primitive interface akin to punch cards.

I’ve been using AI a whole lot lately, at work and at home. I’ve spent an inordinate amount of time trying to get it to be as useful as possible. I’ve mostly been using it for ideating and sharpening arguments on tight deadlines, and I’ve had some success. Nevertheless, though I’ve fully accepted that AI is a game changer and will profoundly reshape the economy, I now believe that the practice of chatting with your AI is not long for the world. The chat box will be remembered as a primitive interface for a technology in its infancy, akin to what punch cards were to early computers. Chatbots are incredibly limited. That “oh shit” feeling we got when we first started playing with them is… an illusion. Or better put, a misdirection, largely due to the language we use to talk about them. That language has us asking the wrong questions. We have been focusing on what these things are rather than what they’re for . It’s become fashionable among the Silicon Valley set to mock the famous “stochastic parrot” paper https://dl.acm.org/doi/10.1145/3442188.3445922 , which insisted that large language models are nothing more than probabilistic language simulators. Understanding is definitionally beyond them, the paper argues, because statistically inferring the next unit in a series does not generalize to a conception of the world. Mastery over any domain of knowledge — which the models demonstrate in some areas — should not be taken as evidence of understanding. And without understanding, there’s no “there” there. Critics counter that the problem is muddled concepts — that the truly stupefying leaps that have emerged in the AI space are proof that we don’t really know what we mean when we talk about understanding. After all, at scale, these models do seem to be able to behave as if they understand. LLMs trained only on text seem to build internal representations of things they were never explicitly taught. That’s not nothing. But it’s a lot less than it might seem. I’m sure we’ve all noticed how even the most advanced models tend to get less sharp during the course of a long conversation. It was more obvious in earlier models, but it’s still abundantly apparent today. The term of art is “context rot,” and it gives us insight into why these things are the way they are. The “context window” is, basically, the bot’s attention span — the space it has for parsing what data you give it. You type something in, it “reads” it, and responds, using the complex probabilistic math that lets it infer, word by word, what is an appropriate answer. But the bot doesn’t really “remember” your conversation. It’s quite literally “born” anew with each turn in the chat, spawned fresh and fed back a transcript of what was said up to that point. Rinse and repeat. Rot itself is due to two things. One is an observed property — that the statistical inferences the model draws degrade as the text it’s working with gets longer. Models have been observed to focus on whatever comes at the very beginning and the very end of a long stretch of text, and tend to lose track of what’s buried in the middle. That’s partly because of training — they’ve learned from experience that the important stuff usually sits at the edges of a text — and partly due to how the math works. The other is that engineers have tried to stretch limited context windows — in some setups by having the model summarize the conversation before reading it back. That leads to elisions and omissions that over time compound, leading the model to get more lost and forgetful the longer you’ve been at it. The models have no sense of time and no real memory. But most important, they have no sense of truth , which flows from the other two facts. The model has no way to check what it says against the world. It has only ever learned what true-sounding text looks like — the shape of a well-formed claim. Its builders know this, which is why you get a comical automatic apology blurted out when you catch it making stuff up. With all this laid out, it’s hard to defend these things being conscious in any recognizable form. And without consciousness, how can one have understanding? Let’s try to put the thorny question of consciousness aside, the critic responds. Maybe the error is in thinking one needs consciousness to have understanding. Books, after all, contain plenty of understanding the good ones, anyway . If a book can hold understanding without being conscious, why not grant the same to a system that does something a book could never do — talk back and engage in a stimulating dialogue? Books do in fact contain understanding, but they do not understand themselves. They encode the understanding of an author. And that’s a critical insight when transferred to AI. AI, too, encodes human understanding — more of it than any single human mind could possibly hold. And through the magic of probabilistic mathematics and unimaginable computing power, it recombines that understanding according to statistical patterns, so that what comes out of it is usually convincing and sometimes even novel. But that doesn’t mean the AI can be said to have understood anything. The other useful metaphor for AI is that of a mirror. After some time, you come to sense that it always gives back a version of what you bring to it. It imports your framing, your assumptions, and your question’s hidden slant. The labs may have eased up on their bots’ sycophancy, and some have even trained their latest models to formulaically push back at times. Nevertheless, if you pay attention, you can tell the AI is still reflecting you back at yourself — refracted through its enormous corpus of knowledge, but still you. Once you figure out this dynamic, it can be very useful. It’s like a strop for honing a blade. Or, to maintain the metaphor, it’s like a mirror with a lens in front of it: If you use it right, you can see yourself even more clearly. But AI is never the subject. At the very core, there is no spark of agency. It’s inescapably a reactive thing. It simply can’t “do” anything unless it is first prompted to do so. Nerds thrill when chatbots engage in a “conversation” with each other; but put two chatbots in a virtual room and never prompt either, and neither will ever start up a conversation. There is no unmoved mover. The labs are now all-in on so-called “agentic” AI — a word that also actively misdirects us on what’s happening. Autonomous would be a better word — these systems can carry out steps toward a goal without supervision — but the labs insist on anthropomorphic language. In truth, the goals, the wanting, the point of doing anything in the first place — that’s still entirely on our side of the mirror. Which starts to tell us what these things are really for . They are for us — or for whatever we point them at. The model can’t know whether anything it says is true; on its own it produces only plausible-sounding propositions. The usefulness comes from verifying those claims. And that comes from bolting on something that can check its work. Point a model at code and have a compiler check its output. Point it at a body of text and find a way you can reliably check its claims. But point it at nothing — just talk to it with no checks in the loop — and there’s a significant chance you’re actively being misled. The labs are coming around to a hard economic truth: the payoff on each new round of scaling is flattening. It’s not that the models aren’t getting more capable and more nuanced. They are. But the cost of squeezing out the next increment of raw capability is climbing faster than the value it returns. The value, it turns out, is finding true things. Which is exactly why the labs are now racing to build what they call “harnesses” for their models. For years, the bet was to scale the model until it would get smart enough to not make mistakes. Now the labs are wrapping the engine in external scaffolds — tools and verifiers with brand names like Codex, Cursor, and Claude Cowork — that supply a way to check the outputs against the world. This is what “agentic” systems really are: a powerful engine harnessed for real work. This clicked for me in my own struggles with making the bots reliable enough for research work. AI hallucinations can be ruinous, and especially so in journalism. And no amount of prompting can make that go away. No, telling it to check its work is a placebo. Believe me. Intuitions about “intelligence” would lead us to believe that the solution is to give the bot more freedom and more access to unbounded data in order to get better and more profound answers. But that’s precisely the wrong intuition, and the opposite is in fact true. The whole game is focusing its attention. The more nebulous the task and the more expansive the scope of the information it has to root through, the less useful it is. The more caged the model, the more reliable it is. Cages both fight context rot, and make suspect outputs immediately checkable. The bare chatbot is uncaged AI, with maximum freedom, no verifier and the whole internet to contemplate. And the human being sitting there, talking to it as if there were someone to talk to, only compounds the trouble. The “oh shit” moment was real, but the conclusions we drew from it were all wrong. The future won’t be built on ever-smarter chatbots, but on better cages. The chat interface is today’s punch card, because it’s the rawest, least-harnessed way to use an extraordinarily powerful machine. And like the punch card before it, it will be seen as a quaint relic much faster than you imagine. You’ll struggle to explain to your children what all our furious debates about machine consciousness were even about .