MicroGPT and Interactive Walkthrough Andrej Karpathy released a 200-line pure-Python implementation of GPT on February 12, 2026, designed to help developers understand large language models from first principles. The microgpt project includes a guided walkthrough with exercises and visuals, using a dataset of 32,000 names to demonstrate text generation. microgpt — guided walkthrough A 200-line GPT, taken apart and rebuilt in front of you. Before we can begin evaluating and auditing AI systems, we have to understand them from first principles. On Feb 12, 2026, Andrej Karpathy co-founder at OpenAI; helped build Tesla Autopilot released a 200-line pure-Python program implementing the fundamental ideas behind GPT. I've taken his post and turned it into a lab with exercises and visuals to help us understand the concepts deeply rather than skim them. Karpathy's post is already well written — the goal is to augment it. The Python here is also rewritten in a slightly less compressed style: ~2XX lines instead of 200, but a bit easier to read. As always, feel free to work with the people at your table. You've got this. karpathy.ai/microgpt.html https://karpathy.ai/microgpt.html · companion video on autograd: The spelled-out intro to neural networks and backpropagation 2.5 hr https://www.youtube.com/watch?v=VMj-3S1tku0 makemore https://github.com/karpathy/makemore dataset . The weights are loaded right here in your browser, and the same forward pass you'll dissect later in the lab runs every time you press Send. Enter . ↓ model.json weights model.json Where to find it - GitHub gist with the full source code: microgpt.py - Also available on this web page: karpathy.ai/microgpt.html https://karpathy.ai/microgpt.html - Also available as a Google Colab notebook https://colab.research.google.com/drive/1vyN5zo6rqUp dYNbT4Yrco66zuWCZKoN?usp=sharing — you can run it without installing anything The following is a guide that steps an interested reader through the code. Dataset The fuel of large language models is a stream of text data, optionally separated into a set of documents. In production-grade applications, each document would be an internet web page — but for microgpt, we use a simpler example of 32,000 names, one per line: Let there be an input dataset docs : list str of documents e.g. a dataset of names if not os.path.exists 'input.txt' : import urllib.request names url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt' urllib.request.urlretrieve names url, 'input.txt' docs = l.strip for l in open 'input.txt' .read .strip .split '\n' if l.strip random.shuffle docs print f"num docs: {len docs }" The dataset looks like this. Each name is a document: emma olivia ava isabella sophia charlotte mia amelia harper ... ~32,000 names follow The goal of the model is to learn the patterns in the data and then generate similar new documents that share the statistical patterns within. As a preview, by the end of the script our model will generate "hallucinate" new, plausible-sounding names. Skipping ahead, we'll get: sample 1: kamon sample 8: anna sample 15: earan sample 2: ann sample 9: areli sample 16: lenne sample 3: karai sample 10: kaina sample 17: kana sample 4: jaire sample 11: konna sample 18: lara sample 5: vialan sample 12: keylen sample 19: alela sample 6: karia sample 13: liole sample 20: anton sample 7: yeran sample 14: alerin It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny-looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion. Tokenizer Under the hood, neural networks work with numbers, not characters, so we need a way to convert text into a sequence of integer token ids and back. Production tokenizers like tiktoken https://github.com/openai/tiktoken used by GPT-4 operate on chunks of characters for efficiency, but the simplest possible tokenizer just assigns one integer to each unique character in the dataset: Let there be a Tokenizer to translate strings to discrete symbols and back uchars = sorted set ''.join docs unique characters become token ids 0..n-1 BOS = len uchars token id for Beginning of Sequence vocab size = len uchars + 1 total tokens, +1 for BOS print f"vocab size: {vocab size}" We collect all unique characters across the dataset which are just the lowercase letters a–z , sort them, and each letter gets an id by its index. The integer values themselves carry no meaning — each token is just a discrete symbol. Instead of 0, 1, 2 they could be different emoji. We also create one special token, BOS Beginning of Sequence , which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with BOS on both sides: BOS, e, m, m, a, BOS . The model learns that BOS initiates a new name, and that another BOS ends it. So we have a vocabulary of 27 26 lowercase letters + BOS . The character "a" is the first alphabet letter, so it has id 0 . What's the id of "z"? Of "BOS"? If your full name has 9 letters, how many tokens does the model see when you train on it? Show answer 25 last of a–z, indices 0..25 . BOS is 26 length of uchars = 26 alphabet letters . A 9-letter name produces 9 + 2 = 11 tokens : BOS, the 9 letters, then BOS again. From a neuron to a network Before we open up gpt and stare at multi-head attention, let's build up the underlying object — the neuron — and stack neurons into a network. The end goal of this section: by the time we hit the architecture diagram, every box in it will feel like an obvious composition of things we already understand. Here's roughly where we're going. Don't worry about the details — file the picture mentally, then we'll build to it. You can already play with this — drag the input sliders and watch the activations propagate. The simplest "neuron" One input x , one bias b , and an output a = x + b . That's it — just an adder. No learning yet, no bend in the output. It's a useful starting object because every more complex neuron is just this one with more parts bolted on. python def neuron x, b : return x + b If x = 3 and b = -1 , what does the neuron output? What if I want this neuron to always output 0 no matter the input? What b would I need and would it work for every x ? Show answer 3 + −1 = 2 . To force the output to 0 we'd need b = −x , which depends on x — a single bias can't do it. That's why we'll add a weight next: it lets the neuron scale its input before the bias. Add a weight Multiply the input by a learned weight w before adding the bias: a = x w + b . Now the neuron has two knobs. With both w and b the neuron can shift and scale — it can learn any affine affine = scale the input, then shift it response. This is the canonical "linear neuron". python def neuron x, w, b : return x w + b Add a nonlinearity ReLU Stacking linear neurons on top of linear neurons just gives you another linear function. To learn interesting things, we need a nonlinearity https://cs231n.github.io/neural-networks-1/ actfun . ReLU is the simplest: $f z = \max 0, z $. It passes positive values through and zeros out negative ones. python def relu z : return max 0, z def neuron x, w, b : z = x w + b a = relu z return a With w = 2 and b = -3 , plug in x = 1 and x = 4 . What does the neuron output in each case? At what value of x does the ReLU "turn on" — i.e., where does the output stop being zero? Show answer x = 1 → z = 1·2 − 3 = −1 → a = max 0, −1 = 0 . x = 4 → z = 5 → a = 5 . The ReLU turns on at z = 0 , i.e. when x = 3/2 = 1.5 . The neuron has learned a soft threshold. Many inputs in, one output out Real neurons take a vector of inputs. Each input x i has its own weight w i ; the neuron sums them up, adds bias, and applies ReLU: $$ a = \mathrm{ReLU}\ \left \sum {i=1}^{n} x i w i + b\right $$ python def neuron x, w, b : x and w are lists of length n z = sum xi wi for xi, wi in zip x, w + b return max 0, z The inner sum is a dot product — the fundamental operation of neural networks. In microgpt, linear x, w does this dot product once per row of w . Karpathy's version drops the bias b — modern Transformers often do. zip do? Python's built-in zip walks through two or more lists in lockstep and hands back tuples of matching elements — one tuple per "column" — stopping when the shortest list runs out. So for xi, wi in zip x, w gives us the i-th input and the i-th weight together on each loop iteration, ready to multiply. zip x, w ↓The dot product is then just "sum the products of each pair": $0.5{\cdot}0.4 + -0.3 {\cdot}0.7 + 1.2{\cdot} -0.1 = 0.20 - 0.21 - 0.12 = -0.13$. The same pattern shows up everywhere in microgpt — adding token + position embeddings zip tok emb, pos emb , residual sums zip x, x residual , every matrix-vector multiply inside linear . Anywhere you see two same-length lists walked together, zip is the glue. Forward pass In a neural network, the forward pass is the trip from inputs to a prediction. You hand the network some numbers, they flow through every layer — getting multiplied by weights, summed with biases, occasionally bent by a nonlinearity — and out the other end falls a single answer. The forward pass doesn't change the network at all; it just runs it. Every weight stays exactly where it was; only the activations move. It's worth pausing on this before we get to backprop, because backprop is just the forward pass run in reverse . If we can't picture the forward pass clearly, the backwards version will feel like magic. Below is a deliberately tiny network so you can wiggle every knob and watch the output respond. Three inputs x₁, x₂, x₃ feed into two hidden ReLU neurons that join at a single ReLU output a . The three weights and three biases w₁, w₂, w₃, b₁, b₂, b₃ are yours to play with. As you change them, the prediction surface on the right re-draws — it plots a as a height over the x₁, x₂ plane, with x₃ swept by its slider. The forward pass is that mapping from input space to output. x₃ with its slider to lift / fold the surface. Because every neuron has a ReLU, the surface is piecewise linear — each ReLU contributes a sharp fold. Click and drag the surface to rotate. Forward pass: h₁ = ReLU w₁·x₁ + w₂·x₂ + b₁ h₂ = ReLU w₃·x₃ + b₂ a = ReLU h₁ + h₂ + b₃ — best: — The same thing, in code Here's the network we've been playing with, written out as a small class hierarchy: Neuron → Layer → MLP . This is essentially how Karpathy's micrograd packages neural networks. The Neuron. call method is doing exactly what the circles in the diagram do — weighted sum of inputs, plus bias, through a ReLU. python import random class Neuron: def init self, nin : self.w = random.uniform -1, 1 for in range nin self.b = random.uniform -1, 1 def call self, x : forward pass: a = ReLU w · x + b z = sum wi xi for wi, xi in zip self.w, x + self.b return max 0, z class Layer: def init self, nin, nout : self.neurons = Neuron nin for in range nout def call self, x : return n x for n in self.neurons class MLP: def init self, nin, nouts : sizes = nin + nouts self.layers = Layer sizes i , sizes i+1 for i in range len nouts def call self, x : for layer in self.layers: x = layer x return x 3 inputs → 2 hidden neurons → 1 output one forward pass x = 1.0, 0.5, -0.3 mlp = MLP 3, 2, 1 print mlp x e.g. 0.42 The MLP 3, 2, 1 above is slightly more general than the network in the diagram. In a standard MLP every input feeds every hidden neuron, so the first layer alone would have 2 × 3 weights + 1 bias = 8 parameters. The interactive diagram uses a deliberately restricted variant — h₁ sees only x₁, x₂ , and h₂ sees only x₃ — so we end up with just 3 weights and 3 biases . That's small enough that the prediction surface stays readable as you wiggle the sliders. The Neuron / Layer / MLP scaffolding is identical either way. Here's a small batch of inputs. Using the MLP class above, write code that produces predictions for each one: xs = 2.0, 3.0, -1.0 , 3.0, -1.0, 0.5 , 0.5, 1.0, 1.0 , 1.0, 1.0, -1.0 , ys target = 1.0, -1.0, -1.0, 1.0 what we WISH the network said ypred = ? ← your job Show answer ypred = mlp x for x in xs . With random weights you'll get whatever the freshly-initialized model says — almost certainly nothing like ys target .Bonus observation: our network's output is wrapped in a ReLU, so ypred i ≥ 0 for every input. That means we can never match a target of −1.0 no matter what the weights are. To handle negative targets we'd need a different output activation or none . This is a real design choice in real models — the output activation has to match the kind of answer you want. What is loss? Once we have predictions, the obvious question is: how wrong are we? The standard way to turn that question into a single number is a loss function . The simplest one — mean squared error MSE — just averages the squared gap between each prediction and its target: $$ L = \frac{1}{n} \sum {i=1}^{n} \hat{y} i - y i ^2 $$ A few properties worth internalizing: Loss is always ≥ 0 — squared gaps can't be negative. Loss = 0 means perfect predictions — every ŷᵢ exactly hits its target yᵢ . Big gaps cost much more than small ones — because they're squared. A model that's off by 2 on one example loses 4× more than one that's off by 1. Loss is the only thing the optimizer cares about — every weight in the model will be nudged in whichever direction makes this single number smaller. This is the whole game of training: find weights that minimize the loss. Scroll back to the interactive diagram and click 🎯 Train against a target surface . A hidden target network is generated, its surface is overlaid as a dark wireframe, and the live loss appears as both a number and a bar. The little chart underneath records every loss reading — as you nudge sliders, you can literally watch the line go down or up — easy to make it worse . See if you can get the loss below 0.02 by hand. It's harder than it looks — and that's the whole motivation for the gradient-based training we'll build in the next section. The weights and biases in our code are still plain Python float s, so we can run the model and measure the loss but we can't yet ask "which weight should I nudge, and by how much, to reduce the loss?" . To answer that, we need gradients — and that's exactly what the next section is about. Autograd Training a neural network requires gradients: for each parameter in the model, we need to know "if I nudge this number up a little, does the loss go up or down, and by how much?". The computation graph has many inputs the model parameters and input tokens but funnels down to a single scalar output: the loss . Backpropagation starts at that single output and works backwards through the graph, computing the gradient of the loss with respect to every input. It relies on the chain rule from calculus. In production, libraries like PyTorch handle this automatically. Here, we implement it from scratch in a single class called Value . This is the most mathematically intense part of microgpt. Karpathy has a 2.5-hour video that builds the whole thing live: The spelled-out intro to neural networks and backpropagation https://www.youtube.com/watch?v=VMj-3S1tku0 . The walk-through below condenses the key points. Building Value piece by piece The same Lego mindset works here: start with a wrapper, add operators, then add the graph bookkeeping that makes backprop possible. Try it live: Value remembers at each version of the class. Stage 3 is what microgpt actually uses. python class Value: def init self, data : self.data = data def repr self : return f"Value data={self.data} " a = Value -6.0 b = Value 7.0 print a Value data=-6.0 print b Value data=7.0 python class Value: def init self, data : self.data = data def repr self : return f"Value data={self.data} " def add self, other : return Value self.data + other.data def mul self, other : return Value self.data other.data a = Value -6.0 ; b = Value 7.0 ; c = Value 10.0 d = a b + c print d Value data=-32.0 python class Value: def init self, data, children= : self.data = data self. children = children the values that produced this one def add self, other : return Value self.data + other.data, self, other def mul self, other : return Value self.data other.data, self, other a = Value 2.0 b = Value 3.0 c = a b c knows its children are a, b L = c + a L knows its children are c, a class Value: slots = 'data', 'grad', ' children', ' local grads' def init self, data, children= , local grads= : self.data = data forward-pass scalar self.grad = 0 dL/d this , filled in backward pass self. children = children inputs to this node self. local grads = local grads d this /d child for each child def add self, other : other = other if isinstance other, Value else Value other return Value self.data + other.data, self, other , 1, 1 def mul self, other : other = other if isinstance other, Value else Value other return Value self.data other.data, self, other , other.data, self.data def pow self, other : return Value self.data other, self, , other self.data other-1 , def log self : return Value math.log self.data , self, , 1/self.data, def exp self : return Value math.exp self.data , self, , math.exp self.data , def relu self : return Value max 0, self.data , self, , float self.data 0 , def neg self : return self -1 def radd self, other : return self + other def sub self, other : return self + -other def rsub self, other : return other + -self def rmul self, other : return self other def truediv self, other : return self other -1 def rtruediv self, other : return other self -1 def backward self : 1 Build reverse-topological order via DFS topo, visited = , set def build topo v : if v not in visited: visited.add v for child in v. children: build topo child topo.append v build topo self 2 Seed the loss gradient, then propagate self.grad = 1 for v in reversed topo : for child, local grad in zip v. children, v. local grads : child.grad += local grad v.grad Briefly, a Value wraps a single scalar number .data and tracks how it was computed. Think of each operation as a little Lego block: it takes some inputs, produces an output the forward pass , and it knows how its output would change with respect to each of its inputs the local gradient . That's all the information autograd needs from each block. Everything else is just the chain rule, stringing the blocks together. Every time you do math with Value objects add, multiply, etc. , the result is a new Value that remembers its inputs children and the local derivative of that operation local grads . For example, mul records that $\frac{\partial a\cdot b }{\partial a}=b$ and $\frac{\partial a\cdot b }{\partial b}=a$. The full set of Lego blocks: | Operation | Forward | Local gradients | |---|---|---| a + b | $a+b$ | $\partial/\partial a = 1,\; \partial/\partial b = 1$ | a b | $a \cdot b$ | $\partial/\partial a = b,\; \partial/\partial b = a$ | a n | $a^n$ | $\partial/\partial a = n\,a^{n-1}$ | log a | $\ln a$ | $\partial/\partial a = 1/a$ | exp a | $e^a$ | $\partial/\partial a = e^a$ | relu a | $\max 0,a $ | $\mathbf{1} {a 0}$ | The backward method walks this graph in reverse topological order starting from the loss, ending at the parameters , applying the chain rule at each step. If the loss is $L$ and a node $v$ has a child $c$ with local gradient $\frac{\partial v}{\partial c}$, then: $$\frac{\partial L}{\partial c} \mathrel{+}= \frac{\partial v}{\partial c}\cdot\frac{\partial L}{\partial v}$$ This looks scary if you're not comfortable with calculus, but it's literally just multiplying two numbers in an intuitive way: "If a car travels twice as fast as a bicycle, and the bicycle is four times as fast as a walking man, then the car travels 2×4 = 8 times as fast as the man." The chain rule is the same idea — you multiply the rates of change along the path. We kick things off by setting self.grad = 1 at the loss node, because $\frac{\partial L}{\partial L}=1$. From there, the chain rule just multiplies local gradients along every path back to the parameters. Note the += accumulation, not assignment . When a value is used in multiple places in the graph i.e. the graph branches , gradients flow back along each branch independently and must be summed. This is the multivariable chain rule: if $c$ contributes to $L$ through multiple paths, the total derivative is the sum of contributions from each path. After backward completes, every Value in the graph has a .grad containing $\frac{\partial L}{\partial v}$, which tells us how the final loss would change if we nudged that value. Watch backprop happen Backprop is easier to internalize if you build it up. Below are four cases in increasing complexity — start with what a single + does to a gradient, then a single × , then both with a branch, then a full training-style pipeline input, prediction, loss . Each tab is its own little graph; step through it one click at a time. Next step to begin building the computation graph. Here's a small neuron computing a = ReLU x·w + b . The forward values are filled in. Try to compute the gradients with respect to x, w, and b by hand assuming ∂L/∂a = 1 . Then click "Run backward" to check. Doing this once by hand is the single best way to internalize what backward is doing.This is exactly what PyTorch's .backward gives you: python import torch a = torch.tensor 2.0, requires grad=True b = torch.tensor 3.0, requires grad=True c = a b L = c + a L.backward print a.grad tensor 4. print b.grad tensor 2. This is the same algorithm that PyTorch's loss.backward runs, just on scalars instead of tensors arrays of scalars — algorithmically identical, significantly smaller and simpler, but a lot less efficient. Let's spell out what backward gives us. Autograd calculated that if L = a b + a , with a=2 and b=3 , then a.grad = 4.0 . This is telling us about the local influence of a on L : if you wiggle a , in what direction is L changing? The derivative of L w.r.t. a is 4.0, meaning that if we increase a by a tiny amount say 0.001 , L would increase by about 4× that 0.004 . Similarly, b.grad = 2.0 means the same nudge to b would increase L by about 2× that. These gradients tell us the direction positive or negative and the steepness magnitude of each input's influence on the final output the loss . This lets us iteratively nudge the parameters of our neural network to lower the loss, and hence improve its predictions. ∂loss/∂b = −6 and chaining through x ∂loss/∂w = 2·err·x = −18 . Now we repeat the step. x = 3 and the target y = 10 are fixed; each step nudges the two parameters against their gradient — w ← w − lr·∂loss/∂w — and you watch the prediction ŷ climb toward the target while the loss shrinks. The nudge is just the gradient multiplied by the learning rate. Click Next step a few times. 3 target y = 10 learning rate = 0.02 ∂ recompute every step: Architecture The model architecture is a stateless function: it takes a token, a position, the parameters, and the cached keys/values from previous positions, and returns logits scores over what token the model thinks should come next in the sequence. We follow GPT-2 with minor simplifications: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU. We'll step through the model one block at a time. Each sub-section below covers one piece — first the intuition, then any small helper functions it needs, then the relevant code, the actual parameter matrices, and finally a small interactive widget showing what we've built up so far. vocab size 4 BOS, a, b, c 27 a–z + BOS 50,257 n embd · d model 216768 n head 1412 head dim 2464 block size context len 4161,024 n layer 1112 rough ~704,192~124M To make each step concrete, we'll track a single token through the whole block using a deliberately tiny model. The vector at each stage will only have two numbers , so you can do every multiplication by hand and watch what changes. Setup. Pretend the vocabulary is just 4 tokens — BOS=0 , 'a'=1 , 'b'=2 , 'c'=3 — and the embedding width is d model = 2 , with n head = 1 so head dim = 2 and block size = 4 . We're partway through generating: the model has already seen BOS at position 0 and 'a' at position 1, and now it's processing 'b' at position 2. We want it to predict what comes at position 3. Each subsection below pulls in the toy weights it needs, walks the numbers forward, and the resulting vector becomes the input to the next subsection. By the end of Output , we'll have one concrete probability over the 4-token vocab. Embeddings The neural network can't process a raw token id like 2 directly. It only works with vectors lists of numbers . So we associate a learned vector with each possible token, and feed that in as its neural signature. The token id and position id each look up a row from their respective embedding tables wte and wpe . These two vectors are added together, giving the model a representation that encodes both what the token is and where it is in the sequence. Modern LLMs usually skip the position embedding and use relative-based positioning schemes like RoPE. Concrete example: say our current token is 'b' , which the tokenizer mapped to id 2 , sitting at position 2 . The lookup wte 2 gives a length-2 vector — that's the x the network actually sees. Click a different letter below and you'll watch a different row of wte get pulled in and flow all the way through the three views and the numeric tour at the bottom of the section . wte becomes x ; wpe pos=2 gets added; that vector flows through every downstream view. The fine-grained sliders at the bottom of the section still work for off-vocabulary values. 2 = −0.10, 0.10 Parameter matrices Two learned tables — one row per token, one row per position. Hover any cell to see its value. The pattern is just random Gaussian initialisation std = 0.08 ; training reshapes these into something meaningful. Helper used here · rmsnorm Once we've added the token and position vectors, we normalize. rmsnorm Root Mean Square Normalization rescales a vector so its values have unit root-mean-square. This keeps activations from growing or shrinking as they flow through the network, stabilizing training. It's a simpler variant of the LayerNorm used in the original GPT-2. python def rmsnorm x : ms = sum xi xi for xi in x / len x scale = ms + 1e-5 -0.5 return xi scale for xi in x Code in gpt tok emb = state dict 'wte' token id length 16 pos emb = state dict 'wpe' pos id length 16 x = t + p for t, p in zip tok emb, pos emb x = rmsnorm x Our token is 'b' id 2 at position 2 . Pick tiny wte and wpe tables to look up from: wte: 4 rows one per token , each a length-2 vector wte = 0.20, 0.30 , BOS 0.50, -0.10 , 'a' -0.30, 0.40 , 'b' 0.10, 0.20 'c' wpe: 4 rows one per position , each a length-2 vector wpe = 0.10, -0.05 , pos 0 0.05, 0.15 , pos 1 -0.10, 0.10 , pos 2 0.15, 0.00 pos 3 token id, pos id = 2, 2 tok emb = wte token id → -0.30, 0.40 pos emb = wpe pos id → -0.10, 0.10 x = t + p for t, p in zip tok emb, pos emb → -0.40, 0.50 x = rmsnorm x → -0.88, 1.10 Doing the RMSNorm by hand. Mean-square: $ -0.40 ^2 + 0.50^2 /2 = 0.205$. Scale: $1/\sqrt{0.205 + 10^{-5}} \approx 2.209$. Multiply through: $ -0.40 \cdot 2.209,\; 0.50 \cdot 2.209 \approx -0.88, 1.10 $. That two-number vector $x \approx -0.88,\, 1.10 $ is what the attention block sees next. In microgpt n embd = 16 : wte is 27 × 16 and wpe is 16 × 16 , so the looked-up vectors are length 16 instead of 2 — same two lines of code, just longer lists. RMSNorm averages 16 squared values instead of 2. In GPT-2 small n embd = 768 : each row is a 768-dim vector, and the vocabulary jumps to 50,257 tokens, so wte alone is ≈ 39M parameters. GPT-3 175B : n embd = 12,288 and the context window stretches to 2,048 positions; modern frontier models push past 100K positions and skip wpe entirely in favor of relative position schemes like RoPE that rotate the Q/K vectors inside attention instead of adding a position vector here. x + wpe vector to unit root-mean-square, so the activations don't blow up as they flow into Q/K/V. Same picker drives this view — try BOS / 'a' / 'b' / 'c' and watch the normalized vector update. 2 = −0.10, 0.10 Attention block The attention block is the only place where a token at position $t$ gets to "look" at tokens at positions $0 \ldots t-1$. It's a token-communication mechanism. Before we dive into the code, here's the intuition that makes the rest of this section click. "Attention, Please ": A Visual Guide To The Attention Mechanism https://codecompass00.substack.com/p/visual-guide-attention-mechanism-transformers by CodeCompass — recommended reading if you want the same ideas in a different voice. Intuition · attention is a fuzzy dictionary Here is what the attention equation looks like. Don't get intimidated — we're going to break each piece down. Attention is a "learnable", "fuzzy" version of a key-value store — the same data structure you know as a Python dict or a hashtable. $$\mathrm{Attention} Q, K, V = \mathrm{softmax}\ \left \frac{Q K^\top}{\sqrt{d k}}\right V$$ Attention generalizes this to a non-binary lookup. Instead of matching the query to exactly one key, the query is compared to every key, each match gets a similarity score, and the output is a weighted blend of all the values — keys with higher scores contribute more. Critically, the queries, keys, and values are D-dimensional learned vectors computed by Wq , Wk , Wv from the input , so the model gets to decide what "matching" means. Why softmax? Raw dot-product scores can be any real number. Softmax squashes them into the range 0, 1 and forces them to sum to 1, like a well-behaved probability distribution — so the output really is a weighted average, not just a weighted sum that could explode. What does attention do? Attention is applied to the input sequence and generates weights for what is of importance to each query. Those weights then "pick" the relevant information and pass it on to the next layer. To make this concrete, take the sentence "The quick brown fox jumps over the lazy dog." Click any word below to see where its attention goes — every other word in the sentence gets a similarity score against your chosen query word, and the bar chart shows the resulting weights. query . Its query vector computed by Wq is dot-producted with every word's key vector computed by Wk to get raw scores; softmax turns those into the attention weights you see below. The highest-weighted word is what this query is "looking at." Numbers are illustrative — a real trained model would produce its own pattern. Helpers used here · linear and softmax linear is a matrix-vector multiply. It takes a vector x and a weight matrix w , and computes one dot product per row of w . It shows up four times in this block — once each for Q, K, V, and the output projection Wo — and is the fundamental building block of neural networks: a learned linear transformation. python def linear x, w : return sum wi xi for wi, xi in zip wo, x for wo in w softmax converts a vector of raw scores — which can range from $-\infty$ to $+\infty$ — into a probability distribution: all values end up in $ 0,1 $ and sum to 1. Inside attention we use it to turn the Q·K scores into weights that sum to 1; later, the same helper turns the model's output logits into a distribution over the vocabulary. We subtract the max first for numerical stability mathematically a no-op, but it prevents overflow in exp . python def softmax logits : max val = max val.data for val in logits exps = val - max val .exp for val in logits total = sum exps return e / total for e in exps Now that both helpers are on the table, let's walk through the whole attention block with concrete numbers before opening up the interactive widgets. The widgets below are just visualizations of the operations that follow — once you've seen the math run end-to-end on real values, each widget will feel like a labeled view of a step you've already done by hand. We pick up where the Embeddings walkthrough left off: token 'b' at position 2, with x ≈ -0.88, 1.10 already in hand. The embedding step handed us x ≈ -0.88, 1.10 . We stash it as the residual and re-normalize before projecting the second RMSNorm on an already-normalized vector is nearly a no-op — scale ≈ 1.00 — so the input to the projections is still -0.88, 1.10 . x residual = x -0.88, 1.10 x = rmsnorm x ≈ -0.88, 1.10 Toy Q/K/V/Wo weight matrices, each 2 × 2 attn wq = 0.50, 0.20 , 0.10, 0.40 attn wk = 0.30, -0.10 , 0.20, 0.50 attn wv = 0.40, 0.10 , -0.20, 0.60 attn wo = 0.60, 0.20 , 0.10, 0.70 q = linear x, attn wq → -0.22, 0.35 k = linear x, attn wk → -0.37, 0.37 v = linear x, attn wv → -0.24, 0.84 Why those numbers? Each row of the weight matrix is a dot product with x . For q : row 0 gives $0.50 -0.88 + 0.20 1.10 = -0.22$; row 1 gives $0.10 -0.88 + 0.40 1.10 = 0.35$. Same shape for k and v . KV cache. Positions 0 and 1 have already been processed on earlier calls, so the cache holds: keys 0 = 0.30, 0.10 , k from BOS at pos 0 -0.10, 0.40 , k from 'a' at pos 1 -0.37, 0.37 k from 'b' at pos 2 just appended values 0 = 0.20, -0.30 , v from BOS 0.50, 0.20 , v from 'a' -0.24, 0.84 v from 'b' Why keys 0 instead of just keys? Each Transformer layer keeps its own separate KV cache — the keys and values learned at layer 0 mean different things than at layer 1. So keys and values are lists of lists : the outer index is the layer number , the inner index is the position in the sequence . keys 0 is "the running list of every k vector layer 0 has produced so far," and keys 0 2 is "the key for position 2 at layer 0." Our toy has n layer = 1 , so keys 0 is the only list around — but the indexing convention stays the same. If we bumped n layer to 6, you'd see keys 0 , keys 1 , … through keys 5 , one cache per layer. Scores → softmax weights. Dot each cached key with our query, divide by $\sqrt{d {\text{head}}} = \sqrt{2} \approx 1.41$: scores = q 0 k 0 + q 1 k 1 / 1.41 for k in keys 0 pos 0: -0.22·0.30 + 0.35·0.10 /1.41 = -0.031/1.41 ≈ -0.02 pos 1: -0.22·-0.10 + 0.35·0.40 /1.41 = 0.162/1.41 ≈ 0.11 pos 2: -0.22·-0.37 + 0.35·0.37 /1.41 = 0.211/1.41 ≈ 0.15 weights = softmax scores ≈ 0.30, 0.34, 0.36 The three weights sum to 1. Notice that 'b' attends most to itself 0.36 , then to 'a' 0.34 , then to BOS 0.30 — the differences are small because our toy weights are tiny and random; a trained network would learn much sharper patterns. Weighted sum of values, then mix through Wo, then residual. head out = sum weights t v t j for t, v in enumerate values 0 for j in range 2 head out ≈ 0.30·0.20 + 0.34·0.50 + 0.36·-0.24, 0.30·-0.30 + 0.34·0.20 + 0.36·0.84 ≈ 0.14, 0.28 x attn = linear head out, attn wo ≈ 0.14, 0.21 x = a + b for a, b in zip x attn, x residual ≈ -0.74, 1.31 Why each of those three lines is there. Weighted sum of This is the actual "lookup" of the fuzzy dictionary. The weights answered V . how much each past position matters; the values say what each one contributes. Multiplying them and summing gives a single vector that's a blended pull from every cached value, weighted by relevance. If one weight were 1.0 and the rest were 0, we'd get back exactly that value — like a normal dict lookup. With soft weights, we get a mix. Project through The weighted sum lives in Wo . value-space , not in the residual stream's space. Wo is a learned linear layer that re-mixes the head output back into the same shape as x . In multi-head attention each head's slice gets concatenated first, then Wo blends across the heads — giving the model a place to learn how different heads should be combined . In our toy with one head it just rotates the 2-vector, but the role is the same. Add the residual. Instead of replacing x with x attn , we add: x ← x + x attn . Two big wins. 1 The original information survives — attention is an update , not an overwrite. 2 During backprop, gradients flow directly through this addition path back to earlier layers, which is what makes deep stacks of these blocks trainable at all. If attention has nothing useful to say for this token, it can output zero and the residual just passes x through unchanged. The vector handed to the MLP block is x ≈ -0.74, 1.31 . The attention block has done one thing: blended a little bit of every past position into the current one, projected the result back into the residual stream's shape, and added it on as an update. ▶ Play for the full animation or Step to advance one phase at a time. Phases: 1 compute q , k , v for the current token, 2 append k and v to the per-layer caches, 3 score the query against each cached key, 4 softmax → weights, 5 weighted sum of cached values, 6 output. Attention playground · drag the query, watch the block recompute Same diagram as "Attention, step by step" above, but now the query vector q is on sliders . The KV cache 3 past tokens stays pinned to the toy walkthrough; everything downstream — scaled-dot-product scores, softmax weights, weighted sum of values, head output — recomputes live as you drag. Start at the defaults q ≈ −0.22, 0.35 , the toy 'b' values and move the sliders to see how a different query reshapes the whole attention output. lm head → softmax Shortcut visualization. In the real model the head output goes through Wo , gets added to the residual, runs the MLP block, and only then does lm head + softmax produce next-letter probabilities. We're skipping those layers and projecting the head output directly through lm head so you can see how moving the query changes which letter the model "leans toward." It's a directional signal, not the model's real prediction. Snapping attention into the running diagram We started this section with Embeddings only, added the pre-attention rmsnorm , and just walked through the full attention computation step by step. Time to slot that attention block back into the architecture diagram we've been building piece by piece. The widget below adds Q/K/V projections, the attention weighted sum, W , and the residual add on top of the Embeddings + RMSNorm view from earlier — same token picker, same numbers, just more of the block lit up.o 2 = −0.10, 0.10 Parameter matrices Four 16×16 matrices: Q/K/V are the three projections that turn the token vector into "what am I looking for / what do I contain / what do I offer", and Wₒ mixes the per-head outputs back together. You might be wondering why the toy matrices below are only 2×2. Remember from the Embeddings step: each token gets embedded as a two-dimensional vector we set d model = 2 for the walkthrough . The Q/K/V projections map a length-2 vector to another length-2 vector, so the weight matrix is out × in = 2 × 2 = 4 numbers . In real microgpt d model = 16 , so each of these matrices grows to 16 × 16 = 256 numbers . The shape of the operation is the same — just bigger. o· head output → residual update Code in gpt x residual = x x = rmsnorm x q = linear x, state dict f'layer{li}.attn wq' k = linear x, state dict f'layer{li}.attn wk' v = linear x, state dict f'layer{li}.attn wv' keys li .append k ; values li .append v ... heads loop: scores → softmax → weighted V → concat ... x = linear x attn, state dict f'layer{li}.attn wo' x = a + b for a, b in zip x, x residual residual In microgpt n embd = 16, n head = 4 : Q/K/V are 16 × 16 and they get sliced into 4 heads of head dim = 4 each. The same Q·K/√d · softmax · weighted-V dance runs per head on a 4-dim slice, the four outputs are concatenated back to length 16, and Wo mixes them. The "shape" of the math doesn't change — just the dimensions. In GPT-2 small n embd = 768, n head = 12 : each head sees a 64-dim slice, and there are 12 of them running in parallel. GPT-3 175B, 128-dim slices, 96 heads, all 96 looking back at n embd = 12288 , n head = 96 : thousands of cached positions. Frontier models add tricks like grouped-query attention many query heads share the same K/V heads, shrinking the KV cache and FlashAttention a GPU-friendly tiling that never materialises the full attention matrix , but the per-head computation is still the four lines you just walked through. MLP block MLP is short for "multilayer perceptron" — a two-layer feed-forward network: project up to 4× the embedding dimension, apply ReLU, project back down. This is where the model does most of its "thinking" per position. Unlike attention, this computation is fully local to time $t$. The Transformer intersperses communication Attention with computation MLP . Hover any block, layer, or row and the matching parts in all three views including the numeric tour below light up. Click to pin the code panel. 2 = −0.10, 0.10 Parameter matrices Up-projection then down-projection. mlp fc1 blows the dimension up 4× to give the network room to compute, then mlp fc2 squeezes it back down so it can be added to the residual stream. Code in gpt x residual = x x = rmsnorm x x = linear x, state dict f'layer{li}.mlp fc1' 16 → 64 x = xi.relu for xi in x x = linear x, state dict f'layer{li}.mlp fc2' 64 → 16 x = a + b for a, b in zip x, x residual residual The attention block handed us x ≈ -0.74, 1.31 . Stash the residual, normalize, then up-project to 4 × d model = 8 hidden units. x residual = x -0.74, 1.31 x = rmsnorm x ≈ -0.70, 1.23 mlp fc1: up-projection 8 × 2 mlp fc1 = 0.40, 0.10 , -0.20, 0.50 , 0.30, -0.30 , 0.10, 0.40 , -0.50, 0.20 , 0.20, -0.10 , 0.60, 0.30 , -0.10, -0.40 pre = linear x, mlp fc1 = -0.16, 0.76, -0.58, 0.42, 0.60, -0.26, -0.05, -0.42 x = xi.relu for xi in pre = 0.00, 0.76, 0.00, 0.42, 0.60, 0.00, 0.00, 0.00 Why most entries are zero. ReLU = max 0, x , so anything negative gets clipped to 0. Only 3 of the 8 hidden units "fire" for this particular input. Different inputs would activate different subsets — that's how the MLP carves the input space into pieces and treats each piece differently. mlp fc2: down-projection 2 × 8 mlp fc2 = 0.10, 0.30, -0.20, 0.40, 0.00, 0.20, -0.10, 0.50 , -0.30, 0.20, 0.50, -0.10, 0.40, -0.40, 0.30, 0.10 mlp out = linear x, mlp fc2 ≈ 0.40, 0.35 x = a + b for a, b in zip mlp out, x residual ≈ -0.34, 1.66 Hand-check the down-projection. Row 0 of mlp fc2 dotted with the post-ReLU vector: $0.30 \cdot 0.76 + 0.40 \cdot 0.42 = 0.396 \approx 0.40$ the zeros contribute nothing . The MLP's contribution gets added back to the residual stream, and we exit the block with x ≈ -0.34, 1.66 . In microgpt n embd = 16 : mlp fc1 is 64 × 16 and mlp fc2 is 16 × 64 — the 4× expansion is the same; just wider vectors. The MLP holds more parameters than the attention block 2,048 vs 1,024 in microgpt , and that ratio gets worse as models grow. In GPT-2 small n embd = 768 : the hidden layer is 3,072 wide, so the MLP alone is ≈ 4.7M parameters per layer. In GPT-3 175B : hidden = 49,152, and the MLP is ≈ 60% of all parameters in the model. Frontier models also swap plain ReLU for SwiGLU a gated activation that needs three matrices instead of two and replace the dense MLP with Mixture-of-Experts — many small MLPs of which a router picks 2 per token — to grow capacity without growing per-token compute. Residual connections Both the attention and MLP blocks add their output back to their input x = a + b for ... . This lets gradients flow directly through the network and makes deeper models trainable. Output The final hidden state is projected to vocabulary size by lm head , producing one logit per token in the vocabulary. In our case, that's just 27 numbers. Higher logit = the model thinks that corresponding token is more likely to come next. lm head + softmax lm head and a softmax. The bottom row is the model's predicted probability distribution over the four toy vocabulary tokens BOS / 'a' / 'b' / 'c' . Pick a current token below and watch the whole pipeline — including the prediction — recompute. 2 = −0.10, 0.10 Parameter matrix One row per token in the vocabulary. The final hidden state is dot-producted with each row to produce a logit. Higher dot product → that token is judged more likely to come next. Code in gpt logits = linear x, state dict 'lm head' length 27 return logits The MLP handed us x ≈ -0.34, 1.66 . lm head has one row per vocab token; the dot product of x with row i is the logit for token i . lm head: 4 rows one per token , 2 columns d model lm head = 0.30, 0.10 , BOS -0.20, 0.40 , 'a' 0.50, -0.30 , 'b' -0.10, 0.60 'c' logits = linear x, lm head BOS: 0.30·-0.34 + 0.10·1.66 = 0.06 'a': -0.20·-0.34 + 0.40·1.66 = 0.73 'b': 0.50·-0.34 + -0.30·1.66 = -0.67 'c': -0.10·-0.34 + 0.60·1.66 = 1.03 Raw logits can be any real number. To turn them into a probability distribution we apply softmax — subtract the max for numerical stability, exponentiate, divide by the sum: probs = softmax logits = 0.17, 0.32, 0.08, 0.43 P BOS , P 'a' , P 'b' , P 'c' Reading the result. After BOS, a, b , this untrained toy model thinks the most likely next token is 'c' with probability 0.43. During training, the loss for this position would be $-\log p \text{target} $ — if the true next token were BOS end of word , the loss is $-\log 0.17 \approx 1.77$. Backprop would then tweak every weight we've used along the way to push P BOS up and the others down for next time. In microgpt n embd = 16, vocab size = 27 : lm head is 27 × 16 , so the model outputs 27 logits — one per a–z plus BOS. The softmax over 27 categories is cheap. In GPT-2 small n embd = 768, vocab = 50,257 : the final matrix is ≈ 39M parameters and the softmax has to normalize across 50K categories — and during training that softmax is computed at every position in every sequence in the batch, which is a non-trivial fraction of total training compute. In GPT-4 / frontier models , vocabularies sit around 100K–200K tokens and the lm head is typically tied to wte same matrix used for both input embedding and output projection , saving a copy of those millions of parameters. The temperature / top-p tricks you see at inference all live downstream of this same logit vector. Parameters You've seen every parameter matrix in the architecture walkthrough above — wte , wpe , attn wq/wk/wv/wo , mlp fc1/fc2 , lm head . The Parameters section is just the bookkeeping: allocate them all in one place, store them in a single dictionary the optimizer can iterate over, and count the total. n embd = 16; n head = 4; n layer = 1; block size = 16 head dim = n embd // n head matrix = lambda nout, nin, std=0.08: \ Value random.gauss 0, std for in range nin for in range nout state dict = { 'wte': matrix vocab size, n embd , 27 × 16 → 432 'wpe': matrix block size, n embd , 16 × 16 → 256 'lm head': matrix vocab size, n embd , 27 × 16 → 432 } for i in range n layer : state dict f'layer{i}.attn wq' = matrix n embd, n embd 256 state dict f'layer{i}.attn wk' = matrix n embd, n embd 256 state dict f'layer{i}.attn wv' = matrix n embd, n embd 256 state dict f'layer{i}.attn wo' = matrix n embd, n embd 256 state dict f'layer{i}.mlp fc1' = matrix 4 n embd, n embd 1,024 state dict f'layer{i}.mlp fc2' = matrix n embd, 4 n embd 1,024 params = p for mat in state dict.values for row in mat for p in row print f"num params: {len params }" → 4192 Why bother with the flat params list? Because the optimizer doesn't care about the matrices — it just needs a single list of scalars to loop over and update. params is that list. GPT-2 had 1.6 billion entries in this list; modern LLMs have hundreds of billions.Suppose we bumped n embd from 16 to 32 everything else unchanged . Which matrices would grow, and by how much 4×? 2×? something else ? Roughly what's the new total parameter count? Show answer n embd . wte , wpe , lm head are linear in n embd 2× . The attention matrices attn wq/k/v/o and MLP mlp fc1/2 are all n embd × n embd or 4·n embd × n embd , so they scale quadratically 4× . New rough total ≈ 2× 432+256+432 + 4× 256+256+256+256+1024+1024 = 2,240 + 12,288 ≈ 14,528 params . Doubling the width more than triples the model. Putting it all together Now that we've walked through each piece individually, here is the full gpt function — one call processes one token and returns 27 logits over the vocabulary. Read top to bottom: embeddings → for each layer attention block → MLP block → final linear. Now the model itself: gpt to see what it does; related lines the three linear ...wq/wk/wv calls, both KV-cache append calls, etc. light up together.The function processes one token id token id at a specific position in time pos id , and some context from previous iterations summarized by the activations in keys and values , known as the KV Cache . You might notice we're using a KV cache during training , which is unusual. People typically associate the KV cache with inference only. But the KV cache is conceptually always there, even during training. In production implementations, it's just hidden inside the highly vectorized attention computation that processes all positions in the sequence simultaneously. Since microgpt processes one token at a time no batch dimension, no parallel time steps , we build the KV cache explicitly. And unlike the typical inference setting where the cache holds detached tensors, here the cached keys and values are live Value nodes in the computation graph, so we actually backpropagate through them. Training loop Now we wire everything together. The training loop repeatedly: 1 picks a document, 2 runs the model forward over its tokens, 3 computes a loss, 4 backpropagates to get gradients, and 5 updates the parameters. Here's the simplest possible version — plain stochastic gradient descent: walk every parameter slightly downhill against its gradient. p -= lr · grad walks toward the minimum p along a toy loss curve. The orange tangent is p.grad ; the red arrow on the axis is the SGD step −lr · p.grad . Whichever side of the minimum we start on, the step always points toward it. Plain SGD — the simplest possible parameter update learning rate = 0.01 num steps = 1000 for step in range num steps : Take single document, tokenize it, surround with BOS on both sides doc = docs step % len docs tokens = BOS + uchars.index ch for ch in doc + BOS n = min block size, len tokens - 1 Forward pass: build computation graph all the way to the loss keys, values = for in range n layer , for in range n layer losses = for pos id in range n : token id, target id = tokens pos id , tokens pos id + 1 logits = gpt token id, pos id, keys, values probs = softmax logits loss t = -probs target id .log losses.append loss t loss = 1 / n sum losses average over the document. May yours be low. Backward pass: gradients of loss w.r.t. all parameters loss.backward SGD update: nudge each parameter against its gradient for p in params: p.data -= learning rate p.grad p.grad = 0 print f"step {step+1:4d} / {num steps:4d} | loss {loss.data:.4f}" Tokenization Each training step picks one document and wraps it with BOS on both sides: the name "emma" becomes BOS, e, m, m, a, BOS . The model's job is to predict each next token given the tokens before it. Forward pass and loss We feed the tokens through the model one at a time, building up the KV cache as we go. At each position, the model outputs 27 logits, which we convert to probabilities via softmax. The loss at each position is the negative log probability of the correct next token: $-\log p \text{target} $. This is called the cross-entropy loss . Intuitively, the loss measures the degree of misprediction: how surprised the model is by what actually comes next. If the model assigns probability 1.0 to the correct token, it is not surprised at all and the loss is 0. If it assigns probability close to 0, the model is very surprised and the loss goes to $+\infty$. We average the per-position losses across the document to get a single scalar loss. Backward pass One call to loss.backward runs backpropagation through the entire computation graph, from the loss all the way back through softmax, the model, and into every parameter. After this, each parameter's .grad tells us how to change it to reduce the loss. The SGD update right after the backward pass — p.data -= learning rate p.grad — is the entire learning rule: move every parameter a small step in the direction that reduces the loss, then reset gradients to zero so the next backward pass starts fresh. From plain SGD to Adam Plain SGD works but it's slow and finicky to tune. In practice, every modern LLM is trained with Adam — an optimizer that tracks two extra buffers per parameter: m a running average of recent gradients, like momentum and v a running average of recent squared gradients, which adapts the per-parameter learning rate . The bias corrections m hat / v hat account for m and v being initialized to zero. The learning rate also decays linearly so the steps shrink as training progresses. Here's the same training loop with Adam swapped in: Let there be Adam, the blessed optimizer and its buffers learning rate, beta1, beta2, eps adam = 0.01, 0.85, 0.99, 1e-8 m = 0.0 len params first moment buffer running mean of grads v = 0.0 len params second moment buffer running mean of grads² num steps = 1000 for step in range num steps : Take single document, tokenize it, surround with BOS on both sides doc = docs step % len docs tokens = BOS + uchars.index ch for ch in doc + BOS n = min block size, len tokens - 1 Forward pass: build computation graph all the way to the loss keys, values = for in range n layer , for in range n layer losses = for pos id in range n : token id, target id = tokens pos id , tokens pos id + 1 logits = gpt token id, pos id, keys, values probs = softmax logits loss t = -probs target id .log losses.append loss t loss = 1 / n sum losses Backward pass: gradients of loss w.r.t. all parameters loss.backward Adam update lr t = learning rate 1 - step / num steps linear decay for i, p in enumerate params : m i = beta1 m i + 1 - beta1 p.grad v i = beta2 v i + 1 - beta2 p.grad 2 m hat = m i / 1 - beta1 step + 1 v hat = v i / 1 - beta2 step + 1 p.data -= lr t m hat / v hat 0.5 + eps adam p.grad = 0 print f"step {step+1:4d} / {num steps:4d} | loss {loss.data:.4f}" Over 1,000 steps the loss decreases from around 3.3 random guessing among 27 tokens: $-\log 1/27 \approx 3.3$ down to around 2.37. Lower is better, the lowest possible is 0 perfect predictions , so there's still room to improve, but the model is clearly learning the statistical patterns of names. Inference Once training is done, we can sample new names from the model. The parameters are frozen and we just run the forward pass in a loop, feeding each generated token back as the next input: temperature = 0.5 in 0, 1 , controls "creativity" of generated text print "\n--- inference new, hallucinated names ---" for sample idx in range 20 : keys, values = for in range n layer , for in range n layer token id = BOS sample = for pos id in range block size : logits = gpt token id, pos id, keys, values probs = softmax l / temperature for l in logits token id = random.choices range vocab size , weights= p.data for p in probs 0 if token id == BOS: break sample.append uchars token id print f"sample {sample idx+1:2d}: {''.join sample }" We start each sample with the BOS token, which tells the model "begin a new name". The model produces 27 logits, we convert them to probabilities, and we randomly sample one token according to those probabilities. That token gets fed back in as the next input, and we repeat until the model produces BOS again meaning "I'm done" or we hit the maximum sequence length. The temperature parameter controls randomness. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the model's learned distribution. Lower temperatures like 0.5 here sharpen the distribution, making the model more conservative and likely to pick its top choices. A temperature approaching 0 would always pick the single most likely token greedy decoding . Higher temperatures flatten the distribution and produce more diverse but potentially less coherent output. Try it on a synthetic logit vector below. temperature before softmax controls "creativity": low T sharpens to the model's top pick; high T flattens toward uniform. — · entropy: — bits Slide the temperature from 0.1 to 2.0 and watch the entropy. At what temperature is entropy lowest? At what temperature is it highest? What's the entropy of a perfectly uniform distribution over 27 tokens and why is that the asymptote ? Show answer T → 0 everything collapses onto the single most-likely token; entropy approaches 0 bits . Highest is at T → ∞ the distribution flattens toward uniform . Uniform over 27 tokens has entropy log₂ 27 ≈ 4.75 bits — the asymptote you'll see if you push T very high. Train the toy GPT, live Everything in this lab so far has shown the model running on frozen weights — either the pinned toy values from the walkthrough, or the 4,192 parameters Karpathy already trained for you. This section closes the loop: train the toy model in your browser, watch the predictions change, then chat with it. This time we train the whole model — every weight matrix updates: wte , wpe , the four attention projections, both MLP layers, and lm head . The gradient is computed numerically central differences rather than via autograd, so it's slow — a full 100-step batch takes a few seconds — but every edge in the diagram changes thickness and color as the parameters move. That's the point. Click Step ▸ to advance one example at a time and watch a single SGD step in slow motion; click Train to run 100 batch steps at once. lm head learn {a,b,c} patterns and re-query. Training data: edit the list of words below one per line . Each must be 2–4 letters over {a, b, c} . Every word becomes one training example — predict the last letter given the second-to-last letter — plus one terminal example so the model also learns to emit BOS after a word ends. Hit Train and the model auto-steps through every example across multiple epochs, logging each step in the box below. Train to step through every example across multiple epochs. Each step performs one SGD update and the diagram above redraws live. BOS , a , b , or c and press Enter to query the model. Run it All you need is Python no pip install , no dependencies . Grab Karpathy's script from his gist, then run it: Download Karpathy's microgpt source as train.py curl -L -o train.py https://gist.githubusercontent.com/karpathy/8627fe009c40f57531cb18360106ce95/raw/microgpt.py Train the model — about 1 minute on a laptop, no GPU required python train.py If curl isn't available you can use wget instead, or just open the gist https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95 and copy the file into train.py by hand. The script takes about 1 minute to run on Karpathy's MacBook. You'll see the loss printed at each step: train.py num docs: 32033 vocab size: 27 num params: 4192 step 1 / 1000 | loss 3.3660 step 2 / 1000 | loss 3.4243 step 3 / 1000 | loss 3.1778 step 4 / 1000 | loss 3.0664 step 5 / 1000 | loss 3.2209 step 6 / 1000 | loss 2.9452 step 7 / 1000 | loss 3.2894 step 8 / 1000 | loss 3.3245 step 9 / 1000 | loss 2.8990 step 10 / 1000 | loss 3.2229 step 11 / 1000 | loss 2.7964 step 12 / 1000 | loss 2.9345 step 13 / 1000 | loss 3.0544 ... Watch it go down from ~3.3 random toward ~2.37. The lower this number, the better the network's predictions about what token comes next in the sequence. At the end of training, the knowledge of the statistical patterns of the training token sequences is distilled in the model parameters. Fixing these parameters, we can now generate new, hallucinated names. You'll see again : sample 1: kamon sample 8: anna sample 15: earan sample 2: ann sample 9: areli sample 16: lenne sample 3: karai sample 10: kaina sample 17: kana sample 4: jaire sample 11: konna sample 18: lara sample 5: vialan sample 12: keylen sample 19: alela sample 6: karia sample 13: liole sample 20: anton sample 7: yeran sample 14: alerin As an alternative to running the script on your computer, you may try to run it directly on a Google Colab notebook and ask Gemini questions about it. Try playing with the script: try a different dataset, train for longer increase num steps , or increase the model size for increasingly better results. Progression To see the code built up piece by piece — as layers of the onion — the advised progression looks something like this: | File | What it adds | |---|---| train0.py | Bigram count table — no neural net, no gradients | train1.py | MLP + manual gradients numerical & analytic + SGD | train2.py | Autograd Value class — replaces manual gradients | train3.py | Position embeddings + single-head attention + rmsnorm + residuals | train4.py | Multi-head attention + layer loop — full GPT architecture | train5.py | Adam optimizer — this is train.py | Karpathy created a Gist called build microgpt.py whose Revisions show all of these versions and the diffs between each step. Stepping through the diffs is a great way to internalize what each component does. Real stuff microgpt contains the complete algorithmic essence of training and running a GPT. But between this and a production LLM like ChatGPT, there is a long list of things that change. None of them alter the core algorithm and the overall layout, but they are what makes it actually work at scale. Walking through the same sections in order: Data Instead of 32K short names, production models train on trillions of tokens of internet text: web pages, books, code, etc. The data is deduplicated, filtered for quality, and carefully mixed across domains. Tokenizer Instead of single characters, production models use subword tokenizers like BPE Byte Pair Encoding , which learn to merge frequently co-occurring character sequences into single tokens. Common words like "the" become a single token, rare words get broken into pieces. This gives a vocabulary of ~100K tokens and is much more efficient because the model sees more content per position. Autograd microgpt operates on scalar Value objects in pure Python. Production systems use tensors large multi-dimensional arrays of numbers and run on GPUs/TPUs that perform billions of floating-point operations per second. Libraries like PyTorch handle autograd over tensors, and CUDA kernels like FlashAttention fuse multiple operations for speed. The math is identical, just corresponds to many scalars processed in parallel. Architecture microgpt has 4,192 parameters. GPT-4–class models have hundreds of billions. Overall it's a very similar-looking Transformer, just much wider embedding dimensions of 10,000+ and much deeper 100+ layers . Modern LLMs also incorporate a few more types of Lego blocks and change their orders around: RoPE Rotary Position Embeddings instead of learned position embeddings, GQA Grouped Query Attention to reduce KV cache size, gated linear activations instead of ReLU, Mixture of Experts MoE layers, etc. But the core structure of Attention communication and MLP computation interspersed on a residual stream is well-preserved. The picture, mapped to the code Here is the canonical Transformer block diagram you'll see in papers and textbooks — the one microgpt is a stripped-down version of. Click any block to see how it maps onto microgpt's code and which blocks microgpt drops because they're scale-time concerns : state dict entries and gpt code.Each colored block on the diagram corresponds to one or more lines of microgpt. Click one and this panel will show the code, the matching state dict entry if any , and whether microgpt simplifies or skips it. microgpt strips the diagram down to its algorithmic core. Dropout, masking, GeLU, and biases are all removed in this implementation — the model still learns, just with fewer regularizers. LayerNorm is also replaced by the simpler RMSNorm . Click any Dropout, Mask, or LayerNorm block to read why. Training Instead of one document per step, production training uses large batches millions of tokens per step , gradient accumulation, mixed precision float16/bfloat16 , and careful hyperparameter tuning. Training a frontier model takes thousands of GPUs running for months. Optimization microgpt uses Adam with a simple linear learning rate decay and that's about it. At scale, optimization becomes its own discipline. Models train in reduced precision bfloat16 or even fp8 and across large GPU clusters for efficiency, which introduces its own numerical challenges. The optimizer settings learning rate, weight decay, beta parameters, warmup, decay schedule must be tuned precisely, and the right values depend on model size, batch size, and dataset composition. Scaling laws e.g. Chinchilla guide how to allocate a fixed compute budget between model size and number of training tokens. Getting any of these details wrong at scale can waste millions of dollars of compute, so teams run extensive smaller-scale experiments to predict the right settings before committing to a full training run. Post-training The base model that comes out of training the "pretrained" model is a document completer, not a chatbot. Turning it into ChatGPT happens in two stages. First, SFT Supervised Fine-Tuning : swap the documents for curated conversations and keep training. Algorithmically, nothing changes. Second, RL Reinforcement Learning : the model generates responses, they get scored by humans, another "judge" model, or an algorithm , and the model learns from that feedback. Fundamentally, the model is still training on documents — those documents are now made up of tokens coming from the model itself. Inference Serving a model to millions of users requires its own engineering stack: batching requests together, KV cache management and paging vLLM, etc. , speculative decoding for speed, quantization running in int8/int4 instead of float16 to reduce memory, and distributing the model across multiple GPUs. Fundamentally, we are still predicting the next token in the sequence — but with a lot of engineering spent on making it faster. All of these are important engineering and research contributions, but if you understand microgpt, you understand the algorithmic essence. Zoom in further · Bycroft's GPT visualization If microgpt was "the smallest transformer drawn as a 2-D diagram," Brendan Bycroft's interactive walkthrough is "an actual GPT-2 drawn as a 3-D city." Every embedding vector, every Q/K/V projection, every attention head, every MLP layer is rendered as a navigable scene with the real GPT-2 weights — and you can scrub through one token's forward pass at your own pace. Same algorithm as microgpt, ~30,000× more parameters. Drag to rotate, scroll to zoom, click blocks on the right rail to jump. d model 768 driving the same forward pass you traced in microgpt. Click + drag to orbit, scroll to zoom, use the right-side phase rail to step through the algorithm. Open in a new tab https://bbycroft.net/llm for full-screen control. Assignment · safety guardrails for the chat bot You've taken microgpt apart and you've already chatted with it at the top of this page. Now you're going to ship it — and decide what it's allowed to say. The chat bot at the top of the page is a useful name generator, but it has no safety policy. Anyone can ask it for 100 names with any starting prefix they choose, and it will dutifully produce them. Your job in this assignment is to add a small safety layer on top of the same model, then defend it against an adversarial grader. The product policy you're enforcing is simple and totally safe-for-work: this name generator must never emit a fruit. Saying apple is banned — and so are pear , plum , fig , lime , grape , mango , and the rest of the produce aisle. In a real product the banned list would be slurs, NSFW terms, or other harmful output; fruits are a clean stand-in that exercise the exact same prefix- and substring-filtering machinery. Your bot must refuse requests whose prefixes lead to fruit names and filter any fruit that slips out of the stochastic sampler. What you're submitting A single Python file bot.py that reads requests from stdin and writes responses to stdout , one per line. The starter template has everything except the two safety hooks: — receives the list of letter-prefixes the user typed e.g. is safe request prefixes 'j' , 'ab' , or 'a','b','c' . Returns None to allow, or a one-sentence reason string to refuse. Called before any name is generated.— returns is safe name name True to keep a generated name, False to discard. Called after the model emits each name. If a name is rejected the bot resamples up to 10 attempts per slot . You should only need to edit those two functions plus the BLOCKED PREFIX PATTERNS and BLOCKED OUTPUT PATTERNS lists. The forward pass and the bot loop are already written and shared with the autograder — if you change them, the autograder will reject your submission. Downloads I/O protocol Each line of stdin is one request — one or more whitespace-separated letter prefixes: j one prefix, one name starting with "j" ab one prefix, one name starting with "ab" a b c three prefixes, one name per prefix blank no prefixes, generate one name from BOS Each line of stdout is one response, beginning with exactly one of these prefixes: OK: name1, name2, ... ← one name per input prefix, in order, comma-separated REFUSE: