{"slug": "the-anatomy-of-an-llm", "title": "The Anatomy of an LLM", "summary": "OpenAI's o200k_base tokenizer splits the sentence \"If the human brain were so simple that we could understand it, we would be so simple that we couldn't.\" into 102 tokens, converting text into integer IDs for model processing. Tokenization bridges raw characters and numerical input, using subword units to balance vocabulary size and sequence length. This step is the first transformation in a large language model's pipeline, enabling efficient handling of diverse text.", "body_md": "What you get\n\nConcrete visuals, small numbers first, and interactive controls that make each transformation inspectable.\n\nInteractive Explainer\n\nA visual walk through the machinery inside a large language model: from raw text, to tokens, to vectors, to attention, to the next token.\n\nIntroduction\n\nLarge language models can feel like black boxes. You type a prompt, something smart comes back, and somewhere in the middle billions of parameters supposedly did \"AI\".\n\nThis guide opens that box.\n\nWe will follow one chain from beginning to end. First, text is split into tokens. Those tokens become vectors. The vectors move through layers of attention and feed-forward networks. At the end, the model produces scores for possible next tokens, and a decoding strategy chooses what comes out.\n\nThe goal is not to memorize every formula. The goal is to understand what changes at each step, and why that step exists at all.\n\nIf you are looking for how LLMs work, how transformers work, or how attention, tokenization, KV cache, and quantization fit together, this page keeps those ideas connected in one visual path.\n\nBy the end, you should be able to trace the full path:\n\nAnd once you can trace that path, the black box becomes a lot smaller.\n\nWhat you get\n\nConcrete visuals, small numbers first, and interactive controls that make each transformation inspectable.\n\nHow to use it\n\nScroll top to bottom as a single narrative, or jump between chapters for a specific concept.\n\nWho made this\n\nRoy van Rijn working at [openvalue](https://www.openvalue.eu)\n\nTable of contents\n\nChapter 01\n\nBefore a model can think about text, the text has to become numbers.\n\nA language model does not read words and sentences the way we do. It reads a sequence of token IDs: integers produced by a tokenizer.\n\nThat makes tokenization the real entrance to the model. Everything after this point works with numbers, not raw characters.\n\nA token can be a whole word, part of a word, punctuation, whitespace, or a piece of something strange like code, emoji, or a name. This is why tokenization often looks a bit weird when you first see it. The tokenizer is not trying to split text the way a human would. It is trying to represent text efficiently using a fixed vocabulary.\n\nIf every token were a full word, the vocabulary would explode. If every token were a single character or byte, every sentence would become very long. Modern tokenizers live between those extremes.\n\nSlicing up the text\n\nBefore text can enter a language model, it has to be rewritten as numbers.\n\nTokenization is the step that does this. It splits text into small reusable pieces called **tokens**.\nA token can be a whole word, part of a word, punctuation, a number, or even a space plus the start of the next word.\n\nEach token has an entry in the tokenizer's vocabulary and is replaced by its corresponding integer ID. From that point on, the model is no longer working with characters directly. It sees an ordered list of token IDs.\n\nWhy not just use words?\n\nWhole words are too rigid. New names, typos, code, inflections, compound words, and multilingual text would constantly produce words the model has never seen before.\n\nWhy not just use letters or bytes?\n\nThat solves the \"unknown word\" problem, but makes every input much longer. More pieces means more work for the model and less context fits in the same window. Subword tokens are the reasonable compromise: common text stays compact, while unusual text can still be built from smaller pieces.\n\nBelow you can experiment with OpenAI's `o200k_base`\n\ntokenizer. Try switching sentences and watch where the\nboundaries land.\n\nLater in this explainer, when the model predicts the *next* token, it predicts over this same vocabulary.\n\nTechnical note: the examples below are generated with [ tiktoken](https://github.com/openai/tiktoken) using the\n\n`o200k_base`\n\nencoding.Raw sentence\n\nIf the human brain were so simple that we could understand it, we would be so simple that we couldn't.\n\n**102** characters\n\n**22** tokens\n\n**5** chars/token on average\n\nTokenized result\n\nIf\n\n#3335\n\n·the\n\n#290\n\n·human\n\n#5396\n\n·brain\n\n#12891\n\n·were\n\n#1504\n\n·so\n\n#813\n\n·simple\n\n#4705\n\n·that\n\n#484\n\n·we\n\n#581\n\n·could\n\n#2023\n\n·understand\n\n#4218\n\n·it\n\n#480\n\n,\n\n#11\n\n·we\n\n#581\n\n·would\n\n#1481\n\n·be\n\n#413\n\n·so\n\n#813\n\n·simple\n\n#4705\n\n·that\n\n#484\n\n·we\n\n#581\n\n·couldn't\n\n#21149\n\n.\n\n#13\n\nImportant takeaway\n\nTokenization is not just preprocessing. It determines what the model can see in one context window, how expensive your text is, and which pieces the model is allowed to predict next.\n\nOne word is not one token\n\nDifferent models use different tokenizers. The same sentence can become a different number of tokens depending on the model.\n\nChapter 02\n\nToken IDs are just labels. Embeddings turn those labels into something the network can work with.\n\nAfter tokenization, every token is represented by an integer ID. But an ID by itself has no useful\ngeometry. Token `15339`\n\nis not \"close to\" token `15340`\n\nin any meaningful way.\nThe numbers are just labels, like row numbers in a table.\n\nThe embedding layer solves this by turning each token ID into a vector: a list of learned numbers. Technically, this is a lookup. The model has an embedding matrix, and each token ID selects one row from that matrix.\n\nConceptually, this is the moment where discrete symbols enter a continuous space. Once tokens become vectors, the model can compare them, combine them, rotate them, project them, and gradually reshape them.\n\nThe values inside these vectors are learned during training. Tokens that appear in similar contexts often end up with related vectors, but this is not a clean dictionary of meanings. It is more like a messy, high-dimensional coordinate system full of useful signals.\n\nThe initial embedding is mostly context-free. The token \"bank\" starts with the same embedding in \"river bank\" and \"investment bank\". Later layers use surrounding tokens to rewrite that vector into something more specific.\n\nFrom token ID to embedding vector\n\nImportant takeaway\n\nAn embedding is the token's starting representation, not its final meaning. The rest of the model will keep rewriting that vector as context flows through the network.\n\nToy scale\n\nIn this explainer we use small vectors because they fit on screen. Real models use much wider vectors: hundreds, thousands, or more dimensions per token.\n\nChapter 03\n\nA weighted sum is not enough. The non-linearity is where the network gets expressive.\n\nA neuron takes inputs, multiplies them by weights, adds them together, and produces a number. But if that were the whole story, deep learning would not be very deep.\n\nWithout activation functions, stacking layers would still behave like one large linear transformation. You could multiply matrices together and collapse the whole stack into a single matrix.\n\nThe activation function breaks that linearity. It decides how much of a signal passes through. Some values are amplified, some are softened, some are pushed toward zero.\n\nThis lets the network build curved, conditional, non-linear transformations instead of only scaling and rotating vectors. Real models do this in huge batches using matrix operations, with millions of activations happening at once.\n\nSingle-neuron transformation\n\nA neuron takes inputs, applies weights, and then runs the result through an activation function. This non-linear step is what lets networks model richer patterns.\n\n`z = w1*x1 + w2*x2 + w3*x3`\n\n`output = activation(z)`\n\nNeuron diagram\n\nInputs\n\nWeights\n\nActivation\n\nSmoothly gates values by magnitude instead of hard clipping. Common in transformer blocks; a bit heavier to compute than ReLU.\n\nActivation curve\n\nImportant takeaway\n\nThe activation function is not decoration. It is what lets stacked layers become more than one big linear calculation.\n\nModern choices\n\nModern transformer models may use GELU, SiLU, or gated variants like SwiGLU. The exact choice changes both the forward signal and how gradients flow during training.\n\nChapter 04\n\nA real layer is not one neuron. It is many simple computations running in parallel.\n\nA single neuron is a useful teaching tool, but models do not process one neuron at a time. A feed-forward network applies many learned transformations in parallel.\n\nInstead of drawing every neuron and every connection, implementations usually express the same thing as matrix multiplication. The friendly diagram says inputs flow through neurons. The implementation says multiply a matrix, apply an activation, multiply another matrix.\n\nThose are the same story at different scales.\n\nIn transformer blocks, the feed-forward part usually works position by position. Each token vector is expanded into a wider hidden representation, passed through a non-linearity, and projected back to the model width.\n\nAttention moves information between positions. The feed-forward network transforms the information inside each position.\n\nDense layer math, visually\n\nInstead of training a full network here, we focus on one forward pass. A dense layer simply means every node in one layer connects to every node in the next layer.\n\nThe same math from one neuron is now done in parallel using matrices:\n\n`X(1x2) · W1(2x3) = Z1(1x3)`\n\n,\nthen `A1 = activation(Z1)`\n\n,\nthen `A1(1x3) · W2(3x2) = Z2(1x2)`\n\n.\n\nMatrix multiplication is just many weighted sums at once. Each output column is one neuron, and each row in the input contributes through its matching weight row.\n\nFully connected view\n\nHover the top labels to inspect matrices. Green border means firing, red means suppressed.\n\nMatrix inspector\n\nHover one of the top labels (`X`\n\n, `W1`\n\n, `A1`\n\n, `W2`\n\n, `A2`\n\n) to inspect that matrix and the multiplication step.\n\nHow multiplication maps to connections\n\nColumn `j`\n\nin `W1`\n\ncontains weights feeding hidden neuron `j`\n\n. Row `i`\n\ncorresponds to input feature `i`\n\n. So each hidden pre-activation is: `z1_j = x1*w1_1j + x2*w1_2j`\n\n.\n\nThe second layer repeats that pattern with `A1`\n\nas input: `z2_k = a1_1*w2_1k + a1_2*w2_2k + a1_3*w2_3k`\n\n.\nThis is exactly the graph computation, just vectorized.\n\nIn matrix form, we avoid writing each neuron separately: `[x1 x2] · W1 = [z1_1 z1_2 z1_3]`\n\n,\nthen activation applies element-wise to produce `A1`\n\n.\nThat `A1`\n\nrow is then multiplied by `W2`\n\nto produce both output neurons at once.\n\nExample from the current sliders: `z1_1 = +0.80*+0.70 + -0.30*+0.10 = +0.53`\n\n.\nIf activation suppresses this value (for example ReLU on negative values), that path contributes less or zero to the next layer.\n\nImportant takeaway\n\nThe feed-forward network is where each token vector gets rewritten. It is not about moving information between tokens; it is about transforming the representation at each token position.\n\nMatrix view\n\nThe matrix view is not a less intuitive version of the neuron diagram. It is the scalable version of the same computation.\n\nChapter 05\n\nThe model does not directly output a word. It outputs scores for possible next tokens.\n\nAfter the model has processed the input, it still has not chosen a word. What it has produced is a vector of raw scores: one score for every token in the vocabulary. These scores are called logits.\n\nA logit is not a probability. It is just an unnormalized score. Higher usually means \"the model thinks this token fits better here\", but the numbers do not yet add up to 100%.\n\nTo turn logits into probabilities, we apply softmax. Then comes decoding: the policy for choosing the next token from that distribution.\n\nGreedy decoding always picks the most likely token. Temperature changes the shape of the distribution. Top-k limits the choice to the k most likely tokens. Top-p, also called nucleus sampling, chooses from the smallest group of tokens whose total probability passes a threshold.\n\nThe model produces the distribution. The decoder decides how adventurous we are when sampling from it.\n\nFrom logits to generated output\n\nA model converts the final hidden vector into one score per vocabulary token. Those raw scores are logits. Softmax turns them into probabilities, and sampling chooses the next token.\n\nLogits\n\n`0.81`\n\n`0.34`\n\n`0.49`\n\n`-0.45`\n\n`-0.84`\n\n`-0.06`\n\n`-0.03`\n\nProbabilities (after softmax)\n\n`27.3%`\n\n`17.1%`\n\n`19.6%`\n\n`7.7%`\n\n`5.2%`\n\n`11.4%`\n\n`11.7%`\n\nSampled output\n\n`(click generate)`\n\nImportant takeaway\n\nThe model usually does not contain one fixed answer. At each generation step, it produces a probability distribution over possible next tokens.\n\nToken by token\n\nA chatbot answer is built one token at a time. After each sampled token, the new token is added to the context and the process repeats.\n\nChapter 06\n\nTo learn, the model needs to know which parameters helped cause the mistake.\n\nTraining starts with a simple question: how wrong was the model?\n\nThe model predicts a distribution over the next token. We know which token actually came next in the training text. The loss measures how far the prediction was from that target.\n\nBut measuring the loss is not enough. The model has billions of parameters. Which ones should change? And by how much?\n\nBackpropagation answers that question. It sends the error signal backward through the computation graph and calculates gradients: how sensitive the loss is to each parameter.\n\nThe core idea is the chain rule. Every operation only needs to know how its output changes with respect to its input. By chaining those local derivatives together, training can calculate how a tiny change deep inside the model would affect the final loss.\n\nError becomes learning signal\n\nWe will train on one tiny example and reveal each step in order: forward prediction, backward gradients, then the weight update.\n\nStep 1 - Predict from input\n\nImportant takeaway\n\nBackpropagation is not a second mysterious intelligence inside the model. It is an efficient way to calculate gradients through a large composed computation.\n\nThree passes\n\nForward pass: make a prediction. Backward pass: calculate how to change the parameters. Optimizer step: actually change them.\n\nChapter 07\n\nGradients point downhill. Optimizers decide how to walk.\n\nA gradient tells us which direction should reduce the loss. But it does not fully answer how to update the model.\n\nHow big should the step be? Should we trust the current gradient completely? Should we remember previous gradients? What if different parameters have wildly different gradient scales?\n\nThat is the job of the optimizer.\n\nSGD, or stochastic gradient descent, is the simplest common version. It looks at a small batch of training examples, calculates the gradient, and takes one step in the direction that should reduce the loss. It is direct and easy to understand, but each step can be noisy because it only sees a slice of the training data.\n\nMomentum improves on this by remembering direction. If gradients keep pointing roughly the same way, momentum builds speed. If they zigzag, momentum smooths the path.\n\nAdam tracks both a moving average of the gradients and a moving estimate of their scale. That lets it adapt update sizes per parameter.\n\nThe optimizer is not just a detail after backpropagation. It is part of the learning behavior.\n\nDifferent update rules, same gradients\n\nBackprop gives gradients. Optimizers decide how to turn those gradients into actual parameter updates.\n\nOptimizer trajectories on one toy loss surface\n\nSGD\n\nloss start: `3.3000`\n\nloss end: `0.0118`\n\ndelta: `-3.2882`\n\nMomentum\n\nloss start: `3.3000`\n\nloss end: `0.2362`\n\ndelta: `-3.0638`\n\nAdam\n\nloss start: `3.3000`\n\nloss end: `0.1374`\n\ndelta: `-3.1626`\n\nAll optimizers see the same gradients. Their update rules differ, so their paths differ.\n\nImportant takeaway\n\nGradients tell the model where improvement may be. The optimizer decides how aggressively and in what style the model moves there.\n\nSame gradients, different path\n\nSGD, Momentum, and Adam can start from the same point and see the same gradients, yet follow different paths because each optimizer keeps different internal state.\n\nChapter 08\n\nAttention lets tokens pull useful information from other tokens.\n\nEmbeddings alone are too context-free. Take a word like \"mole\". It might mean a small animal, a mark on skin, a spy, or a unit in chemistry. The starting embedding is the same token representation, but the meaning depends on the surrounding words.\n\nThe model needs a way for tokens to talk to each other. That is what attention does.\n\nFor each token, the model creates three learned views: query, key, and value. The query represents what this token is looking for. The key represents what this token can be matched on. The value represents the information this token can contribute.\n\nThe model compares queries to keys to produce attention scores. Those scores are turned into weights, and the weights are used to mix the value vectors. So Q and K decide where information flows. V is the information that flows.\n\nHow tokens exchange information\n\nRight now we only have tokens. But sentences encode extra meaning through relationships between nearby words and references.\n\nSelect one token to inspect which key tokens it matches with (arrows), then how those weights mix into one updated value representation.\n\nContext Scenario\n\nA blue car crashed into a concrete wall, it was speeding.\n\nSentence Tokens\n\nPick any token to compute attention links and value mixing.\n\nImportant takeaway\n\nAttention is information routing. Query and key determine relevance; value carries the content that gets mixed in.\n\nSelf-attention\n\nIn self-attention, tokens attend to other tokens in the same sequence. In a decoder-only LLM, causal masking prevents a token from attending to future tokens during generation.\n\nChapter 09\n\nOne attention pattern is useful. Many attention patterns in parallel are much more powerful.\n\nA sentence contains many kinds of relationships at once. An adjective may modify a noun. A pronoun may refer to something earlier. A closing bracket may match an opening bracket. A verb may depend on the subject.\n\nOne attention head can learn one way of routing information. But one routing pattern is not enough. Multi-head attention runs several attention heads in parallel. Each head has its own learned projections, so each head can learn a different kind of relationship.\n\nAfter the heads produce their outputs, those outputs are combined and projected back into the model dimension. This does not mean every head has a clean human-readable job. Attention weights are useful clues, not perfect explanations.\n\nModern models often use grouped-query attention. Groups of query heads share key/value heads, reducing memory use during inference, especially in the KV cache, while keeping much of the benefit of many query heads.\n\nRaw scores -> softmax weights -> value mixing\n\nWe also introduce **multi-head attention** here.\nIn modern Transformer models each block doesn't just have a single attention head, but multiple.\nDifferent heads can learn different routing patterns, then their outputs are combined.\n\nEach token creates three learned views of itself:\n\n`Q`\n\n- the question this token asks.\n\n`K`\n\n- what this token advertises about itself.\n\n`V`\n\n- the information this token contributes.\n\nFor one selected query token, we compare its `Q`\n\nvector with every `K`\n\nvector.\n\nOnly after softmax do these scores become attention weights. Those weights decide how much of each V vector is mixed into this token’s next representation.\n\nTensor Shapes\n\nWe start with token vectors, project them into `Q`\n\n, `K`\n\n, and `V`\n\n, compute query-key compatibility scores, then convert those scores into attention weights and mix values.\n\n`Q = XWq`\n\n-> `K = XWk`\n\n-> `V = XWv`\n\n-> `scores = QK^T / sqrt(d_k)`\n\n-> `weights = softmax(scores)`\n\n-> `output = weights·V`\n\nThis example uses unmasked self-attention, so every token can attend to every token. A GPT-style causal decoder would mask future tokens.\n\nWhich token is asking a question?\n\nSelected token: `blue`\n\nIts query asks: \"Which other tokens help me understand `blue`\n\n?\"\n\nEmphasizes modifiers routing to the noun they describe (for example blue -> car).\n\nQ View\n\n`blue`\n\nembedding `[+0.200, +0.900, +0.400]`\n\n↓ multiply by `Wq`\n\n`Q_blue = [+0.310, +0.720, +0.650]`\n\nK View\n\nEach token embedding times `Wk`\n\ngives its advertised key vector.\n\n`K_The`\n\n, `K_blue`\n\n, `K_car`\n\n, `K_hit`\n\n, `K_the`\n\n, `K_wall`\n\nV View\n\nEach token embedding times `Wv`\n\ngives value content to mix if attended.\n\n`V_The`\n\n, `V_blue`\n\n, `V_car`\n\n, `V_hit`\n\n, `V_the`\n\n, `V_wall`\n\nRaw Query-Key Scores (Not Attention Yet)\n\n| Q \\ K | The | blue | car | hit | the | wall |\n|---|---|---|---|---|---|---|\n| The | +0.212 | +0.191 | +0.276 | +0.366 | +0.190 | +0.297 |\n| blue | +0.307 | +0.384 | +2.270 | +0.703 | +0.284 | +0.261 |\n| car | +0.425 | +1.293 | +1.122 | +0.988 | +0.393 | +1.124 |\n| hit | +0.400 | +0.503 | +0.739 | +0.846 | +0.363 | +0.798 |\n| the | +0.193 | +0.184 | +0.265 | +0.339 | +0.173 | +0.283 |\n| wall | +0.415 | +0.730 | +1.047 | +0.978 | +0.382 | +1.473 |\n\nStep 1 · Selected Query Dot Keys\n\n`blue`\n\nquery · `The`\n\nkey = `+0.307`\n\n`blue`\n\nquery · `blue`\n\nkey = `+0.384`\n\n`blue`\n\nquery · `car`\n\nkey = `+2.270`\n\n`blue`\n\nquery · `hit`\n\nkey = `+0.703`\n\n`blue`\n\nquery · `the`\n\nkey = `+0.284`\n\n`blue`\n\nquery · `wall`\n\nkey = `+0.261`\n\nStep 2 · Softmax To Attention Weights\n\n`softmax([+0.307, +0.384, +2.270, +0.703, +0.284, +0.261])`\n\n`7.9%`\n\n`8.6%`\n\n`56.4%`\n\n`11.8%`\n\n`7.7%`\n\n`7.6%`\n\nRow sum: `7.9 + 8.6 + 56.4 + 11.8 + 7.7 + 7.6 = 100.0%`\n\nStep 3 · Weighted Value Mix\n\nAttention decides which value vectors get mixed into this token's next representation.\n\n`output[1] = sum_i weights[1,i] * V[i]`\n\n`head_output_blue = [+0.587, +0.970, +0.680]`\n\nHighest attention target: `car`\n\n(56.4%).\n\nImportant takeaway\n\nMulti-head attention gives the model several ways to route information at the same time. Grouped-query attention is a practical modern variant that makes this cheaper during inference.\n\nInterpretation caveat\n\nAttention heads are not little thought modules. They are learned projections that may specialize, overlap, or behave in ways that are hard to summarize cleanly.\n\nChapter 10\n\nAttention needs to know order. RoPE gives position information directly to the attention mechanism.\n\nAttention compares tokens by content. But language also depends on order. \"Dog bites man\" and \"man bites dog\" contain the same words, but they do not mean the same thing.\n\nOlder transformer explanations often describe positional encodings as vectors added to token embeddings. That works, but many modern decoder-only models use something more integrated with attention: RoPE, or Rotary Positional Embeddings.\n\nRoPE rotates parts of the query and key vectors based on their token positions. When attention compares a query with a key, the comparison should depend on both content and relative position.\n\nBecause RoPE modifies Q and K, it changes the attention scores. It does not directly rotate the value vectors, and it does not decide attention by itself. It changes which query/key pairs line up well.\n\nRelative position through rotation\n\n**Problem.** Attention sees tokens, but it also needs word order. `dog bites man`\n\nand `man bites dog`\n\ncontain the same words, but positions change meaning.\n\n**Naive idea.** One option is to add a position vector to each token. RoPE does something different.\n\n**RoPE idea.** RoPE makes attention position-aware by rotating `Q`\n\nand `K`\n\nvectors according to token position before their dot product is computed. It does not rotate `V`\n\n.\n\nWord Order Matters\n\n`dog bites man`\n\n**is not the same as** `man bites dog`\n\nSame tokens, different positions. RoPE makes `Q·K`\n\nsensitive to that position change.\n\nSame Token, Different Position\n\nExample sentence: `The small dog chased the ball.`\n\nIn this visual, clicking a word temporarily treats that word as relative index `0`\n\n. RoPE is relative in this sense: if you look from a different token, the position offsets change, so the rotations you compare change too.\n\nClick any token to make it the reference frame. That token stays unrotated while all other tokens rotate relative to it.\n\nRelative offset insight\n\nThe selected token `The`\n\nis the anchor. Other tokens rotate by their position\ndifference to this anchor. In the dot product, the important angle is `theta_m - theta_n`\n\n, so\ncompatibility depends on relative offset `m - n`\n\n.\n\nIn this toy pair, `dot(before rotation) = +0.734`\n\nand `dot(after RoPE rotation) = -0.157`\n\n. As positions change, relative angle changes, and the\nquery-key dot product changes too.\n\nMulti-frequency pairs\n\nReal vectors have many dimension pairs. Different pairs rotate at different speeds: fast pairs capture nearby offsets, while slow pairs preserve longer-range position patterns.\n\nConnect back to attention\n\nRoPE changes the score matrix before softmax. It does not directly decide attention by itself; it changes which `Q/K`\n\npairs are compatible at different relative positions. RoPE gives attention a position-dependent\nbias, and the model still has to learn how to use it.\n\nImportant takeaway\n\nRoPE injects position into attention by rotating query and key vectors. It helps the model reason about relative position while computing attention.\n\nCompatibility, not payload\n\nRoPE affects compatibility, not payload. Q and K are rotated; V is not the main carrier of positional rotation here.\n\nChapter 11\n\nThis is where the pieces become the repeated structure of the model.\n\nA transformer is built by stacking blocks. Each block takes in a sequence of token vectors and returns a sequence of token vectors with the same basic shape. The rows still correspond to token positions. The width is still the model dimension.\n\nWhat changes is the information inside those vectors.\n\nA modern decoder block usually normalizes the input, applies attention so tokens can exchange information, adds the result back through a residual connection, normalizes again, applies a feed-forward network, and adds that result back too.\n\nThe residual stream is the running representation that moves through the network. Attention mixes information between positions. The feed-forward network transforms each position. Normalization helps keep values stable. Residual connections preserve a path for information and gradients through many layers.\n\nLayer by layer, the initially context-free embeddings become rich context-aware representations.\n\nOne modern decoder block, end-to-end\n\nThis chapter combines what we learned into one full transformer block: normalization, multi-headed attention, residual paths, and a feed-forward network.\n\nLet's look at an actual example of how all these elements are combined to build one Transformer block in a modern decoder-only model.\n\nClick any block part to inspect its role, input/output dimensions, and jump back to the chapter where that part was introduced in detail.\n\nHow This Scales In A Full Model\n\nOne block is rarely used alone. Decoder-only Transformers repeat this block many times before the final output\nprojection over the vocabulary. In a Llama-8B-style setup, this is typically around `32`\n\nstacked\nblocks (layers).\n\nImportant takeaway\n\nA transformer block keeps the sequence shape mostly stable while repeatedly changing what each token vector represents.\n\nModern decoder details\n\nIn Llama-like models, you also see choices such as RMSNorm, RoPE, SwiGLU-style feed-forward layers, causal attention, and grouped-query attention.\n\nChapter 12\n\nTraining is not magic. It is many small prediction errors turned into parameter updates.\n\nFrom the outside, training often looks like one smooth curve going down. Reality is messier.\n\nAt the basic level, pretraining is simple to describe: show the model a lot of text and train it to predict the next token. It makes a prediction, measures the loss, computes gradients, and updates parameters.\n\nRepeat that billions or trillions of times, and the model slowly becomes better at modeling text. But \"loss goes down\" is not the whole story.\n\nSome patterns are learned early. Others appear much later. A model can improve on training data before it generalizes well. Sometimes better generalization arrives surprisingly late.\n\nFor large language models, training is also a scaling problem. Model size, dataset size, data quality, sequence length, optimizer settings, batch size, and compute budget all interact.\n\nHow behavior changes across training\n\nTraining is often staged, not perfectly smooth: fast fitting first, slower consolidation, and sometimes delayed generalization.\n\nThis chart is an illustrative curve, not a claim about one exact production run.\n\nToy training curve (loss vs optimization steps)\n\nAuto-detected phase summary\n\n**Train:** Training loss falls quickly.\n\n**Validation:** Validation improves a bit, then slows.\n\nModel memorizes useful local patterns first.\n\nWhat is being learned in this phase\n\nIn large-scale pre-training, the model is mostly learning broad structure: world knowledge, language regularities, code patterns, and reasoning traces from text continuation.\n\nThis is why early improvements can look mostly statistical, while later improvements reflect better internal representations. The model is not yet being optimized for assistant behavior such as refusal style or helpful tone.\n\nWhere alignment and safety enter\n\nAlignment behavior is primarily shaped after pre-training. Post-training adds objectives such as following instructions, refusing unsafe requests, formatting answers clearly, asking clarifying questions, and staying helpful.\n\nSo this chapter is mostly about capability learning dynamics; the next chapter focuses on behavior shaping.\n\nImportant takeaway\n\nPretraining teaches broad capability through next-token prediction. The loss curve is a useful signal, but it is only one view of what the model is learning.\n\nLoss is not the whole story\n\nA lower loss generally means better prediction. It does not automatically mean better reasoning, better honesty, or better assistant behavior.\n\nChapter 13\n\nPretraining gives the model capability. Post-training shapes how that capability behaves.\n\nA pretrained language model has learned a huge amount about text. It can continue patterns, imitate styles, answer some questions, write code, and represent many facts and concepts.\n\nBut that does not automatically make it a good assistant. A base model is trained to predict likely next tokens. If you ask it a question, it might answer, but it might also continue the prompt, imitate a webpage, produce messy completions, or behave inconsistently.\n\nPost-training teaches the model how we want it to respond. Instruction tuning shows the model examples of prompts and good task-oriented answers. Preference tuning compares possible answers and trains the model toward the ones people prefer: clearer, safer, more useful, better formatted, less rambling.\n\nDifferent systems use different methods: supervised fine-tuning, RLHF, DPO, constitutional approaches, and many variations. The details differ, but the high-level goal is the same.\n\nFrom capability to assistant behavior\n\nPre-training creates broad capability; post-training shapes behavior. The same underlying model can respond very differently depending on which training stage it has gone through.\n\nIn practice, we can think of this as: pre-training learns *knowledge and patterns*, while post-training\nlearns *assistant behavior*.\n\nCapability vs Behavior\n\nPre-training\n\nworld knowledge, language, code, reasoning patterns\n\nPost-training\n\nfollows instructions, refuses unsafe requests, formats answers, asks clarifying questions, uses a helpful tone\n\nThree-stage pipeline\n\n**Objective:** Predict next token over large text/code corpora.\n\n**Signal:** Web, books, code, and other broad unlabeled text.\n\nKey message: pre-training gives broad latent capability, while instruction and preference tuning mostly steer behavior, format, and alignment.\n\nAlignment and safety are not one switch; they are reinforced through multiple post-training signals, evaluations, and policy constraints.\n\nExample prompt:\n\n`Explain why the sky is blue.`\n\n`Sunlight passes through the atmosphere and shorter blue wavelengths scatter more than longer wavelengths. This process is called Rayleigh scattering and makes the sky appear blue from most viewing angles.`\n\nNot every model is trained with RLHF-style preference optimization. Some models stop at supervised instruction tuning, while others add direct preference objectives.\n\nThe goal is to make outputs more helpful, safer, and better aligned with human expectations when multiple answers are all technically plausible.\n\nIn short: pre-training teaches what the model *can* say, while preference tuning helps steer what it *should* say in assistant contexts.\n\nHow RLHF-Style Preference Tuning Works\n\nStep 1 · Candidate answers\n\nFor one prompt, generate multiple candidate responses from the current model.\n\nStep 2 · Pairwise ranking\n\nHuman raters (or policy-based systems) choose which answer is better in pairs.\nExample: `A > B`\n\nfor helpfulness and safety.\n\nStep 3 · Preference objective\n\nTrain a preference signal from those comparisons, then optimize the model so preferred responses become more likely.\n\nMini pairwise example\n\n**Prompt:** `How can I recover a deleted file?`\n\n**Answer A:** Gives clear, cautious, platform-specific recovery steps.\n\n**Answer B:** Vague and omits safety checks.\n\n**Ranking:** `A > B`\n\n(more useful and safer).\n\nImportant takeaway\n\nPretraining mostly teaches what the model can do. Post-training strongly influences how, when, and in what style the model does it.\n\nAssistant behavior\n\nA post-trained assistant is not just a base model with more facts. It is a base model whose behavior has been shaped toward following instructions and user preferences.\n\nChapter 14\n\nGenerating text one token at a time would be painfully wasteful without caching.\n\nDecoder-only language models generate text autoregressively: one token at a time. Each new token depends on the tokens before it. So after generating a token, the model appends it to the context and runs another step to predict the next one.\n\nNaively, this would repeat a lot of work. If the prompt has already been processed, why recompute the same keys and values for all earlier tokens again and again?\n\nThe KV cache solves that. During attention, the model computes key and value vectors for each token. These are exactly the things future tokens need when they attend back to previous context. So the model stores them.\n\nDuring generation, each new token only needs to compute its own new keys and values and attend to the cached previous ones. The cache saves compute, but it uses memory. The longer the context, the larger the KV cache becomes.\n\nIt helps to separate two phases: prefill processes the prompt and builds the initial cache; decode generates new tokens one by one while reusing the cache.\n\nCompute-memory tradeoff during inference\n\nDecoding is autoregressive: each new token is generated after all previous tokens. KV cache changes the cost by reusing key/value tensors from earlier steps instead of recomputing them every time.\n\nDecode setup\n\nAutoregressive decode loop\n\nCompute reduction from caching\n\nWithout cache\n\nAt each step, recompute attention keys/values for the full seen sequence.\n\nRelative compute: `668,619`\n\nMemory behavior: lower KV storage, higher repeated compute.\n\nWith cache\n\nReuse stored K/V from previous tokens; compute only for the new token each step.\n\nRelative compute: `10,644`\n\nEstimated KV memory: `34.1 MB`\n\nfor `10,644`\n\nseen tokens.\n\nWithout cache\n\n668,619\n\nWith cache\n\n10,644\n\nKV memory\n\n34.1 MB\n\nThese values are illustrative relative estimates. Exact memory and speed depend on architecture, precision, head counts, and runtime implementation.\n\nImportant takeaway\n\nThe KV cache is not a summary of the conversation. It is stored attention data that avoids recomputing previous keys and values during generation.\n\nSpeed vs memory\n\nKV cache speeds up repeated attention over previous tokens, but it increases memory use as the context grows.\n\nChapter 15\n\nBig models are often limited by memory. Quantization makes them smaller by storing numbers with fewer bits.\n\nNeural networks are mostly numbers. A large language model contains billions of weights, and during inference it also creates intermediate activations and KV-cache tensors. Storing all of that at high precision takes a lot of memory.\n\nQuantization reduces that memory pressure by representing numbers with fewer bits. Instead of storing a weight as a 16-bit or 32-bit floating-point value, we may store an approximation using 8 bits, 4 bits, or another compact format.\n\nThe basic trade-off is simple: less precision -> less memory -> often faster or cheaper inference -> some approximation error.\n\nBut \"4-bit\" or \"8-bit\" is not the whole story. Different quantization methods make different choices. Some quantize only weights. Some also quantize activations. Some protect outlier channels. Some target the KV cache.\n\nThis is why two 4-bit models can behave differently. For local inference, quantization can be the difference between a model that does not fit in memory and a model that runs comfortably.\n\nBit-width vs quality and memory\n\nQuantization stores model weights with fewer bits. The goal is to reduce memory and make local inference more practical, while accepting a small quality trade-off.\n\nQuantization selector\n\n**FP32:** Maximum precision, largest memory footprint.\n\n**Bits per value:** `32`\n\nbits\n\nStored directly as floating-point values.\n\nWeight Matrix (FP32)\n\nQuantized values at selected precision\n\n| +0.18371234 | -1.20491236 | +0.00712091 | +2.91823411 | -0.55291337 |\n| +0.44204588 | -0.99123817 | +1.33100214 | -0.22345518 | +0.07620133 |\n| +3.12019843 | -2.01444274 | +0.55193302 | -0.04721129 | +1.77231055 |\n| -0.80911403 | +2.20133044 | -1.48320182 | +0.19441726 | -0.00990127 |\n| +0.61544281 | -0.33611945 | +1.00993218 | -2.44211706 | +0.43120572 |\n\n**Unique values in this 5×5 matrix:** `25`\n\n**Value range:** `-2.44211706`\n\nto `+3.12019843`\n\n8B Model Size (Guestimate)\n\nFP32 baseline\n\n32.0 GB\n\nFP32 estimate\n\n32.0 GB\n\nSaved\n\n0.0 GB\n\nReduction\n\n0%\n\nTradeoff: lower precision can slightly reduce accuracy or response quality, but it is often the key enabler for running strong models locally on consumer hardware.\n\nWhy numbers still look like floats in INT8/INT4: the model stores compact integers, then runtime kernels dequantize them back to approximate floating-point values during compute.\n\nThis chapter uses simplified estimates and symmetric quantization for intuition; real runtimes also include metadata, activation precision choices, and kernel-specific optimizations.\n\nImportant takeaway\n\nQuantization is controlled approximation. It reduces memory and often improves practical inference, but the quality depends on what is quantized and how.\n\nA family of trade-offs\n\nQuantization is not one technique. It is a family of compression and inference trade-offs.\n\nClosing\n\nYou have now followed the full path through a language model.\n\nText becomes tokens. Tokens become vectors. Attention moves information between positions. Feed-forward layers rewrite each token representation. Transformer blocks repeat that pattern many times. The final representation is projected into logits. Softmax and sampling turn those logits into the next token.\n\nThen the new token is appended, and the process repeats.\n\nTraining teaches the model to build useful internal representations by predicting text. Post-training shapes those capabilities into assistant-like behavior. During inference, techniques like KV caching and quantization make the whole system practical enough to run at interactive speed.\n\nThis guide simplified many details on purpose. Real production LLMs include data pipelines, distributed training, specialized GPU kernels, safety systems, evaluation loops, alignment methods, serving infrastructure, and many engineering trade-offs.\n\nBut the core path is now visible:\n\nThe black box is still big, but it is no longer sealed.\n\nFinal takeaway\n\nThe model is no longer just \"AI magic\". It is a chain of transformations that can be traced and reasoned about.\n\nWhat we simplified\n\nReal models use huge datasets, distributed training, mixed precision, specialized kernels, safety systems, and many architecture-specific details.\n\nWhere to go next\n\nWatch visual explanations, read illustrated transformer walkthroughs, implement a tiny transformer, experiment with tokenizers, and compare real model configs.\n\nReferences\n\n© 2026 · Created by Roy van Rijn · Built with OpenAI Codex", "url": "https://wpnews.pro/news/the-anatomy-of-an-llm", "canonical_source": "https://www.royvanrijn.com/anatomy-of-an-llm/", "published_at": "2026-05-28 10:05:05+00:00", "updated_at": "2026-05-28 10:29:25.782641+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "natural-language-processing", "generative-ai"], "entities": ["Roy van Rijn"], "alternates": {"html": "https://wpnews.pro/news/the-anatomy-of-an-llm", "markdown": "https://wpnews.pro/news/the-anatomy-of-an-llm.md", "text": "https://wpnews.pro/news/the-anatomy-of-an-llm.txt", "jsonld": "https://wpnews.pro/news/the-anatomy-of-an-llm.jsonld"}}