How LLMs Actually Work: A Developer's Mental Model A developer explains that an LLM is fundamentally a function that predicts the next chunk of text, with all intelligent-seeming behavior emerging from repeating this process billions of times. The model tokenizes input into word-fragments, maps each to a high-dimensional vector based on meaning, and uses a transformer architecture with attention mechanisms to enrich each token's representation with context from the entire sequence. After processing through multiple layers, the model outputs a probability distribution over its vocabulary, from which it samples the next token—a process controlled by temperature settings. Most of us use LLMs every day now, but if you asked the average developer what's actually happening between hitting enter and getting a response, the answer is usually some mix of "it's a neural network" and a shrug. That's fine — you don't need to know how a database B-tree works to write a query. But understanding the mental model behind LLMs makes you dramatically better at using them: you stop being surprised when they hallucinate, you write better prompts, and you understand why things like context windows and RAG exist. So here's the whole thing, explained from the ground up. No equations. An LLM is a function that takes some text and predicts the next chunk of text. That's it. Everything else — answering questions, writing code, "reasoning" — is an emergent side effect of doing that one thing extremely well, billions of times over. Let's unpack how that actually produces something that feels intelligent. The model doesn't see letters or words. The first thing that happens is tokenization : your text gets chopped into "tokens," which are roughly word-fragments. Common words are usually one token; rarer words get split into pieces. For example, "tokenization" might become "token", "ization" , while "the" is just "the" . As a rough rule, one token ≈ 4 characters of English, or about ¾ of a word. This is why you get billed per token and why character-counting your prompts doesn't quite match the limits. The model literally operates on a vocabulary of tokens, not language. Each token gets mapped to a long list of numbers — a vector , also called an embedding. Think of it as coordinates in a space with hundreds or thousands of dimensions. The key idea: tokens with similar meanings end up near each other in that space. "King" and "queen" land in a similar neighborhood; "banana" is off in a completely different region. The model learned these positions during training, purely by noticing which words show up in similar contexts. So at this point your sentence is no longer text — it's a sequence of points in a high-dimensional space, where distance and direction encode meaning. This is the part that made modern LLMs possible: the transformer architecture, and specifically a mechanism called attention . Here's the intuition. Take the sentence "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to? You know it's the trophy, because of context. Attention is how the model figures this out: when processing the token "it," the model lets it look at every other token in the input and decide which ones are relevant. "Trophy" gets a high attention weight; "suitcase" a bit less; "the" almost none. Every token does this, for every other token, simultaneously. The result is that each token's representation gets enriched with context from the whole sequence. A word like "bank" starts off ambiguous, but after attention it's been nudged toward "riverbank" or "financial institution" depending on what surrounds it. Modern models stack this operation dozens of times in a row these are the "layers" . Early layers capture simple patterns; deeper layers build up to grammar, then meaning, then something that looks a lot like reasoning. Nobody hand-coded any of those levels — they emerge from training. After all those layers, the model produces one thing: a probability score for every token in its vocabulary , representing how likely each one is to come next. For the prompt "The capital of France is" , the distribution might look like: "Paris" → 0.94 "located" → 0.02 "a" → 0.01 ... tens of thousands more, each tiny The model doesn't "know" Paris is the capital in the way you do. It has learned that, statistically, "Paris" is overwhelmingly the token that follows that sequence. Now the model samples a token from that distribution. This is where temperature comes in: prompt = "Write a haiku about the sea" while not done: distribution = model prompt Steps 1–4 next token = sample distribution Step 5 prompt = prompt + next token That's the whole generation loop. The model writes one token at a time, each time re-reading everything it has produced so far. It has no plan and no draft — it's improvising every token based on all the tokens before it. The fact that this produces coherent essays is genuinely one of the more surprising results in computer science. Everything above describes the model running . But how did it learn the right vector positions and attention patterns? Two phases. Pretraining. The model is shown an enormous amount of text and given one task: predict the next token. Cover the last word of a sentence, make it guess, compare to the real answer, and nudge its billions of internal numbers the parameters , or weights slightly toward being right. Repeat trillions of times. To get good at predicting the next word across all of human text, the model is forced to absorb grammar, facts, writing styles, and patterns of reasoning. The "knowledge" is a side effect of compression: predicting text well requires modeling the world the text describes. Fine-tuning and alignment. A raw pretrained model is just an autocomplete engine — ask it a question and it might reply with more questions, because that's a plausible continuation. To turn it into a helpful assistant, it goes through additional training on examples of good instruction-following, plus techniques like RLHF reinforcement learning from human feedback , where humans rank responses and the model learns to prefer the kind people actually want. This is the step that makes it answer rather than ramble. Once you hold this mental model, the famous LLM weirdnesses stop being mysterious: Hallucinations. The model is optimizing for plausible , not true . There's no fact-checking step — if a confident-sounding but wrong token sequence is statistically likely, it'll produce it just as smoothly as a correct one. It doesn't know when it doesn't know. Knowledge cutoffs. Its facts come from training data frozen at a point in time. It can't know about anything after that unless you give it the information in the prompt which is exactly what web search and RAG do . Context windows. Attention works over a fixed maximum number of tokens. Everything — your prompt, the system instructions, the conversation history — has to fit in that window. Go over it and the earliest stuff falls out of view. No memory between calls. Each API request is stateless. The model doesn't "remember" your last conversation; chat apps fake continuity by resending the history every time. That's why your token count grows as a chat gets longer. Prompt sensitivity. Since output is just conditioned on input tokens, small wording changes shift the probability distribution. "Explain X" and "Explain X to a five-year-old" steer it down very different statistical paths. The most useful reframe is this: an LLM is not a database you query and not a mind that thinks. It's a probabilistic next-token predictor trained on a huge slice of human text, wrapped in enough scale that the predictions become genuinely useful. Hold that model in your head and you'll prompt better, trust it in the right places, distrust it in the right places, and stop being shocked when the confident answer is confidently wrong. This is a simplified mental model — I've glossed over plenty positional encodings, the exact attention math, mixture-of-experts, and more in the name of clarity. If you want a deeper dive into any one piece, drop a comment and I'll expand on it.