Why ChatGPT Is More Than Autocomplete

wpnews.pro

Calling a large language model (LLM) like ChatGPT “autocomplete” is not exactly wrong, but it is deeply misleading. Most of us think of autocomplete as a text-completion tool: a phone keyboard guessing the next word, a search bar finishing a phrase. But that picture is not powerful enough to explain what LLMs actually produce — explanations, analogies, plans, summaries, arguments, code, stories, dialogue.

A transformer-based LLM does predict one token at a time — roughly a word, though in practice often a word-fragment — but that visible sequence is only the surface trace of a much richer hidden process. Before each word appears, the model has built a high-dimensional internal state that reflects the topic, the context, the tone, the intent, and the likely directions the answer could take. The next token is not read off the prompt. It is read off this internal state. That is why a system trained only to predict the next token produces explanations, arguments, analogies, plans, and dialogue that feel far more than anything we would call autocomplete.

Previous articles in this series developed a geometric picture of the transformer’s internal state in terms of attention— attention as a coupled free-energy minimization and weighted least squares problem [Article 1], simplification of attention to two operators in one space [Article 2], and attention as a geometric flow [Article 3].

This article steps back and asks a plainer question — what is actually happening when a chatbot answers you, and why is autocomplete the wrong picture — and answers it with a minimum of mathematical machinery.

Calling an LLM “autocomplete” is tempting because, at one level, it is true. Given a sequence of text, the model predicts the next token. That token is appended to the input, which is fed back in for the next prediction. Stepping through this loop produces the visible response, one token at a time, that has the appearance of autocomplete.

The problem is that this picture describes what goes on “across” steps but ignores what goes on “within” steps. Across steps, things feel like autocomplete — the sequential generation of individual words that cohere with previously generated text. (Researchers properly call this process “autoregressive”, but we’ll stick with the less than proper autocomplete for now.)” The missing piece is the hidden computation taking place within a step that generates the next token.

Within a step, before a new token is generated, the input prompt is transformed into a high-dimensional internal state. The tokens of the prompt are not treated as isolated words in a list; they are interpreted in relation to one another via the attention mechanism. A question, a definition, a metaphor, a constraint, a conversational tone — all of these shape the internal state from which the next token is drawn.

So the next token is not predicted from the input text alone as autocomplete suggests. It is predicted from a rich internal state built from the text. The model then projects a small part of that state into the vocabulary to choose the next token, appends the token to the input text, and rebuilds the state over the longer text for the next prediction step. The observed output is the result of multiple steps thru that loop.

“Autocomplete,” then, is technically defensible but conceptually misleading. It names the final visible act of generation and ignores the machinery that makes the act possible. An LLM like ChatGPT does not merely “complete” the observed text. It repeatedly reconstructs meaning over a growing context and projects part of that meaning back into language, one token at a time.

An LLM predicts the next token not from the input text alone, but from a dense internal representation built by sequentially processing the text through multiple layers of the model.

When a prompt is input to the model, each token is turned into a vector — a point in a high-dimensional space whose location already encodes what pre-training has learned about that token: its meanings, its grammatical roles, the company it tends to keep. This initial cloud of points is only the starting arrangement. As the cloud passes through the model’s many layers, each token/point absorbs information from the rest of the text, so the its final position reflects not just the word it started as but the role that word plays in context with the rest of the cloud.

The word “bank” has a different vector representation in “On the river bank” and “Call the investment bank” because nearby words changes its position. The same is true of the text as a whole: a question, an example, a requested tone, a constraint, a prior phrase — each reshapes the text, changing not only what a given word means but what kind of answer becomes likely.

Context not only disambiguates individual words. It shapes the entire internal state from which the next token is predicted. Consider four prompts:

The idea to be explained is identical in all four. The surrounding context changes the kind of answer that becomes likely.

This is the key point: the next token is not predicted from the text. It is predicted from the model’s internal state after the text has passed through many layers of interaction and conditioning. That state is not a sentence, a paragraph, a private monologue, or a plan. It is a distributed, high-dimensional representation that holds many things at once — topic, syntax, style, intention, discourse structure, relevant facts, and likely continuation.

This is why a short prompt can produce a long, coherent answer. The prompt itself does not contain the answer in any literal sense. It induces an internal state that makes some continuations far more likely than others. The model then samples a single token from that state, appends it, and repeats.

The model’s behavior across steps appears autocomplete-like:

But the complete picture includes the rebuilding of new meaning within each step:

Once a token is chosen it becomes part of the input to the next step where the model builds a fresh internal state from scratch over the longer text. It then projects a small piece of it back into language. That piece is the next token.

If autocomplete is the misleading surface story, the residual stream is where the deeper story lives. The internal state has a home inside the transformer, and researchers have a name for it: the residual stream. For a text of N tokens, the residual stream is a cloud of N points in embedding space, one per token , each carrying the model’s current vector representation of that token in context. Initially, each point of the cloud has minimal contextual awareness of the other points of the cloud. As the cloud passes through the layers, the token positions are updated together, and as a result — this is the important part — they are coupled. Attention lets each token’s vector be reshaped by the others, so the residual stream behaves less like N separate streams and more like an interacting cloud of points that move as a whole.

That cloud is the subject of a previous article in this series, “The Flow of Attention”, which describes its evolution and how it organizes itself within the residual stream. For this article we need only the conclusion: the cloud is not N independent points moving through space, but one system, held together by attention. The word “apple” may start near a generic cluster of meanings and then be pulled toward fruit, or company, or logo, or stock ticker, depending on the surrounding text as it moves through the layers of the model.

The larger point is that the cloud as a whole carries meaning, not of just each point individually, but of the entire body of text. Its overall configuration encodes what the text is about, what kind of response is being requested, what tone fits, what form the answer should take, which constraints matter, and which continuations are likely. The next word feels coherent not because it follows the last word, but because it is drawn from a configuration that already holds the larger shape of the response.

A tiny word can depend on a large hidden context. Suppose the model has already written “Energy and mass are equivalent,” and responds with “This was one of Einstein’s key insights.” The word “This” is doing real work: it reaches back to the internal state defined by what was already written and turns the explanation from the equation towards its historical significance. The word “This” is small; the internal state that makes it the “right” continuation of the text is not.

When the model finally predicts a token, it does not poll the whole cloud. It reads from one point: the final-layer vector of the last token in the text. That single vector has been shaped by the entire contextualized cloud around it, and it is where the internal state passes through a narrow bottleneck and becomes the next token — the vocabulary entry whose meaning best aligns with the direction the whole cloud has settled on. (An earlier article describes that readout step precisely — the last token’s final vector is compared against the vocabulary and a probability distribution is produced by softmax, from which the token is sampled.)

The residual stream, then, is an evolving, high-dimensional cloud of interacting particles, one particle per token, moving through the layers as a single entity from which language emerges one token at a time.

A prompt does not contain its response the way a zipped file contains its contents.

Take “Please explain E = mc².” Those few words do not hold Einstein, relativity, mass-energy equivalence, or the shape of a good explanation. What they do is start a process that gradually produces all of that. In the language of dynamical systems, the prompt establishes an initial condition, and the answer unfolds as a trajectory (or continuation) from it. Not the trajectory of a single point, but the joint evolution of the whole cloud, each token’s vector reshaped by the others as the text moves through the model.

Even one word shifts the starting point. “Please” leans toward requests; “Please explain” toward instruction and teaching; add “E = mc²” and the state shifts toward physics, symbols, mass, energy, light, Einstein, relativity, and the shape of an explanation. A natural answer might open by unpacking the symbols — “E stands for energy,” “m stands for mass,” “c² is the speed of light times itself” — then reach the central idea, “energy and mass are equivalent,” and then widen to “This was one of Einstein’s key insights.”

That widening matters. The answer moves from defining symbols, to physical meaning, to historical significance. It is not being extracted from the prompt. It is unfolding from an initial state that the prompt set in motion.

It is tempting to picture the path as a stored outline:

But there is no such outline inside the model. The path is what emerges, one token at a time, as the internal state is rebuilt and read at each step. The arrows above are a description of what happened, not a script the model consulted.

And the context the model responds to is not fixed. After a few tokens, the model is no longer answering “Please explain E = mc².” It is responding to a longer text:

At each step, the model rebuilds its internal state over the longer text, and that fresh state conditions the sampling of the next token. The prompt sets the initial condition; the residual stream carries the internal state within each step; the response is the visible sequence of tokens traced out across steps. That is the reason a short prompt yields a rich answer: the model’s response was never hidden in the prompt. The prompt merely set up the conditions from which the response emerged.

An LLM writes one token at a time. That much is true. It does not follow that the model “thinks” one idea at a time.

This is another place the autocomplete analogy misleads. The output is sequential because language is sequential — a sentence has to come out one word after another. But the meaning behind the sentence is not laid out sequentially along a single line. When the model chooses the next token, it draws on an internal state that holds many influences at once: topic, grammar, tone, audience, prior context, factual associations, and likely continuations. These are not waiting in a queue. They are spread across the many dimensions of the representation, all present at the same time.

Return to E = mc². The model does not wait for the word “Einstein” to appear before Einstein is relevant. The prompt already tilts the internal state toward relativity, mass-energy equivalence, physical explanation, and an educational tone. The larger shape of the answer starts forming even before the first sentence is finished. At each step only one word comes out, but many possible word continuations are being weighed at once. This is how the model stays on topic across a long answer. The topic is not stored as text generated one word at a time. It is held in the internal state, which reflects both what has been written and where the answer seems to be going.

A jazz musician is a useful image. The musician plays one note at a time, but the solo is not a chain of isolated notes. Each note comes from a fuller musical context — key, rhythm, motif, tension, the sense of where the phrase is heading. The performance is sequential; the musical understanding behind it is not. An LLM works the same way: one token at a time, each shaped by a larger pattern that is not itself sequential.

That is why a model can generate one token at a time without being limited to one idea at a time. The text comes out as a line of words. The computation behind it is closer to a field — a distributed state, shaped layer by layer, projected through a narrow bottleneck into the next observed token.

The residual stream is where the LLM represents meaning captured in a single step. It is not where the model stores memory between tokens.

This distinction is easy to miss and it is the crux of the whole story. As text moves through the model, the residual stream holds the model’s reading of the situation — what has been asked, what has been said, what tone is in play, where the answer is heading. The resulting internal state belongs to one forward step. After the model emits the next token, it does not carry that state forward to the next step. Instead, it rebuilds a new internal state from scratch based on text that is now longer by one token to predict the next token.

The generated text is doing double duty. To the reader, it is the answer appearing on the screen. To the model, it is the external memory — the durable record from which the next internal state is rebuilt. The next step is not just

but closer to

So the model’s textual response is the thing that persists across steps, not the internal state from which the text is derived.

Return to E = mc² one more time. Once the model has written “E stands for energy,” that line is no longer just output — it is now part of the text the model reads in the next step. The internal state rebuilt over that longer text reflects not only the original request but the fact that the answer has begun by defining symbols. After “m stands for mass” is added, the state rebuilt in the following step reflects a stronger pattern: this is an answer that explains the equation term by term. That pattern is not stored anywhere as an instruction to define c² next. It is simply what the internal state looks like once it is rebuilt over the text written so far. This is the part ordinary autocomplete misses entirely. The model is not appending words to a string while holding a growing private thought. It is repeatedly rebuilding a temporary internal state from text it has already produced — and the text is the only thing that carries forward.

(In practice, actual LLMs keep a cache of intermediate quantities — the so-called KV cache — so they do not recompute everything from the beginning at each step. But the cache holds only values that are themselves derived from the text and recomputable from it; it is a speedup, not a separate memory. The durable, information-bearing record is still the text.)

The last section landed on a deceptively simple idea: text is the only thing that carries forward. It is worth slowing down, because it hides the most surprising structural fact about how LLM’s work.

There are two levels to keep apart, and they behave completely differently.

Within a single step, an LLM is dynamical. The cloud of token vectors is built at the input layer and then reshaped layer by layer, each arrangement following from the one before it, until the last layer produces the internal state where the next token is read from. This is the flow a previous article describes: a coupled cloud moving through the layers of the model. Within steps, there is a genuine dynamical evolution in a well-defined space.

Across steps — from one emitted token to the next — there is no such dynamic. The model does not hand its internal state to the next step. It throws the state away, appends the single token it just produced to the text, and builds a brand-new cloud from scratch over the slightly longer text. Nothing dynamical crosses the boundary between one step and the next. The only thing that crosses is a single token.

This is the asymmetry: the dynamics live inside each step, but between steps there is nothing but a growing transcript and a model that rereads it. The flow is real, but it does not accumulate into a persistent inner state that connects the many steps that comprise the whole response. Each token is produced by a fresh flow over a slightly longer input. The coherence of a long response is not the coherence of a single sustained train of thought — it is the coherence of a sequence of independent reconstructions, each built over text that changes only one token at a time.

It is worth seeing that this is a design choice, not a law of natural languages. Other architectures make the opposite one: state space models such as Mamba do carry a compressed internal state forward from step to step, updating it as each new token arrives rather than rebuilding from the full text each time. The transformer trades that carried state away and re-reads everything instead — a trade with real consequences, and the subject of a later article.

So why does coherence survive a process that discards and rebuilds its internal state at every step? Three things hint at an answer.

First, the text the model rereads is only one token longer than it was a step ago. Almost the entire input is unchanged, so the cloud based on it is almost the same cloud, and the state it produces points in almost the same direction. The reconstruction is fresh, but it is fresh over nearly identical text, which keeps consecutive steps close to one another.

Second, the token that got appended was not arbitrary — it was the state’s own preferred continuation. Adding it to the text nudges the next reconstruction a little further along in the direction that the previous one was already leaning. The current state does not only stay near where it was; it advances in the direction of the previous state.

Third, once a token is written it cannot be unwritten. Every later step reads it as fixed context. There is no backtracking, no revision of what has already appeared — the transcript is append-only, and each new token is conditioned on the full record of what came before. This is the precise sense in which text generation by an LLM is called autoregressive: each token is predicted from the entire prior text. A direction, once committed to the page, is not reconsidered; it becomes part of the ground the rest of the answer is built on.

These three pull in the same direction. The flow inside each step proposes where to go next; the growing text fixes that proposal in place and carries it forward. Continuity comes from the input barely changing; momentum comes from each step writing down its own preferred next move; persistence comes from the fact that nothing already written can be taken back. Put together, they suggest that by rebuilding its internal state from scratch at every step, an LLM produces answers that stay on topic, develop coherently and hold together consistently.

The same structure also explains how things go wrong. Most of the time the appended token simply confirms the direction the state already favored, and the loop is self-stabilizing — small perturbations get absorbed rather than amplified. But not every step is so decisive. Sometimes the readout distribution is nearly flat: two or more continuations are almost equally likely, and the sampling step has to break what is essentially a tie. At such a branch point, a single token can swing the trajectory onto a different course — and because that token is then committed to the text, every subsequent step builds on it. The model does not look back on the road not taken.

This is one structural reason a confident-sounding falsehood can persist once it appears. Hallucination is an active and much-debated research problem with several distinct causes, and the rebuilding of the models internal state at every step does not explain all of them. But it does reveal a plausible mechanism: once a wrong turn is committed to the transcript, the same forces that keep a good answer coherent — rereading the fixed text, building on what is already written — keep the wrong answer coherent too. Autoregressive commitment is not what produces the error; it is what propagates it, which is why a mistaken continuation can read as fluent and self-consistent rather than collapsing into nonsense. The append-only loop has no built-in step that revisits and revises what it has already said.

If the model keeps rebuilding its internal state at each step from the growing text, pre-training explains why that internal state is rich enough to be worth rebuilding. During pre-training, the model sees enormous amounts of text and is trained to predict missing or next tokens across countless contexts. The objective is simple to state — predict the next word — the structure it forces the model to learn is not. To predict the next token well, the model has to capture patterns at very different ranges: short-range patterns like spelling, grammar, and phrasing; longer-range patterns like definitions, examples, and analogies; and patterns that span an entire passage, like argument, explanation, dialogue, genre, and narrative flow.

Pre-training fixes where each token starts in the model’s internal space, and the operators that move it once a forward step begins. As a result, certain next-token continuations far more natural than others. A flat table of word associations could not do this; what pre-training builds is structured enough to handle grammar locally and reasoning, explanation, style, and discourse globally — and everything in between.

From here on it helps to borrow a picture, holding it loosely: think of the patterns pre-training fixes as a kind of terrain, and an unfolding answer as a path across it. The picture is a metaphor, not the mechanism — there is no literal surface the model slides down — but it captures something real. A prompt like “Please explain E = mc²” starts the answer in a particular region — near physics, equations, Einstein, relativity, explanation — and some directions out of that region are more natural than others: questions lead toward answers, explanations toward definitions and examples, equations toward the concepts attached to them. When the model responds to a prompt, it is not inventing structure from nothing. While the observed text may be new, a number of coherent responses were fixed long before the prompt arrived. Pre-training builds the landscape. Fine-tuning changes how the model tends to move through it.

A pre-trained model has learned an enormous range of patterns — explanations, stories, arguments, code, dialogue, instructions, jokes, contradictions, helpful answers, and not so helpful ones. It knows many possible paths through language, but not all of them are what you want in a conversation with a helpful assistant.

Fine-tuning reshapes those tendencies. Through further training on examples of good responses, on instruction-following, on human preferences, and on safety constraints, the model becomes more likely to take paths that behave like a helpful assistant: answering the question, respecting intent, staying on topic, avoiding certain kinds of output, adopting a conversational voice. This does not bolt a second “assistant brain” onto the model. It adjusts the same underlying weights — the same token embeddings, attention operators, and feedforward neural networks. In probabilistic terms, fine-tuning reshapes the distribution of responses given prompts — raising the probability of helpful, coherent, instruction-following paths and lowering the probability of unhelpful or unsafe ones.

There is more than one way to do this. Supervised fine-tuning teaches by demonstration: for this kind of prompt, follow this kind of path. Reinforcement-learning-based post-training teaches by preference: among the paths the model might take, make the better ones more likely and the worse ones less likely. Either way, the effect is the same in spirit — the same prompt can lead somewhere different after fine-tuning. A raw pretrained model might continue a prompt as an essay, a web page, a stray fragment, or an imitation of something in its training data. A fine-tuned assistant is pushed toward treating the prompt as a request and trying to answer it well.

That is why modern chatbots feel different from raw language models. The terrain comes mostly from pre-training; the assistant-like way of moving across it comes from fine-tuning. Pre-training defines the landscape. Fine-tuning bends the path.

Writing a poem, a mathematical proof, or a Medium article is inherently a linear, sequential process. The same goes for activities like playing a jazz solo or giving a lecture. Work gets done essentially one token at a time — each new token carefully selected and added to the previous ones, with the hope that an interesting and coherent whole emerges. This sequential process has the superficial appearance of “autocomplete,” but the real question is: where do these tokens — words, notes, and utterances — come from?

The autocomplete claim sidesteps that question by assuming the next token comes mainly from the previous ones. A deeper look suggests otherwise: the source of the next token is an internal representation — something we might loosely call a “thought” — that draws on prior knowledge, personal experience, intent, and goals. This representation is more informative than any single token and the expectation is that, when taken as a whole, the predicted tokens provide a faithful picture of it.

A similar story applies to LLMs like ChatGPT. The model first builds an internal state from previously generated text and reads the next token off it. Here, the internal state captures the context, meaning, style, discourse, and likely continuations of the text and the next token is chosen to cohere with the richness of that state as closely as possible. That completes one step. The next token is then appended to the previous text, the whole is run through the model to build a fresh internal state, and the next token is read off that. What emerges across many steps is a body of text that is the visible trace of a thought unfolding — a thin projection of a sequence of internal states that is much richer than the sequence of words it gives rise to.

That projection is lossy, and the gap is where things can go wrong: a thin stream of tokens can never carry everything the internal state holds. But when it works, the text we read reveals the essence of a greater meaning that is hidden from view.

Maybe something like that applies to us too — much of what we do, we do sequentially, one piece at a time, resulting in a visible trace of an internal state we never fully externalize. If so, then analyzing an LLM may provide a way to understand a process that we always run but never quite see.

This is much more than autocomplete to me.

About the author: Gordon is a mathematician, bioinformatician, AI researcher, and singer-songwriter who explores hidden mathematical patterns at the interface of machine learning, modern AI, biology, and music.

Acknowledgment: This article was developed in collaboration with AI, disclosed here in full in the belief that transparency about this way of working serves both readers and the practice itself. Claude (Anthropic) served as the primary editorial partner, contributing structural revisions, mathematical vetting, and expository refinement throughout the drafting process. ChatGPT (OpenAI) and Gemini (Google) were consulted periodically for alternative perspectives on framing and direction. The mathematics, research direction, and final text are the author’s responsibility.

Why ChatGPT Is More Than Autocomplete was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article MCP for LangGraph Developers: From Basics to Production Amazon AI search generates images from text descriptions The Truth About Huge LLMs Context Windows

Why ChatGPT Is More Than Autocomplete

Run your AI side-project on zahid.host