MicroGPT and Interactive Walkthrough

wpnews.pro

A 200-line GPT, taken apart and rebuilt in front of you.

Before we can begin evaluating and auditing AI systems, we have to understand them from first principles. On Feb 12, 2026, Andrej Karpathy (co-founder at OpenAI; helped build Tesla Autopilot) released a 200-line pure-Python program implementing the fundamental ideas behind GPT. I've taken his post and turned it into a lab with exercises and visuals to help us understand the concepts deeply rather than skim them. Karpathy's post is already well written — the goal is to augment it. The Python here is also rewritten in a slightly less compressed style: ~2XX lines instead of 200, but a bit easier to read. As always, feel free to work with the people at your table. You've got this.

karpathy.ai/microgpt.html· companion video on autograd:

The spelled-out intro to neural networks and backpropagation (2.5 hr)

makemoredataset). The weights are loaded right here in your browser, and the same forward pass you'll dissect later in the lab runs every time you press Send.

Enter.

↓ model.json (weights)

Where to find it #

GitHub gist with the full source code: microgpt.py
Also available on this web page: karpathy.ai/microgpt.html - Also available as a Google Colab notebook— you can run it without installing anything

The following is a guide that steps an interested reader through the code.

Dataset #

The fuel of large language models is a stream of text data, optionally separated into a set of documents. In production-grade applications, each document would be an internet web page — but for microgpt, we use a simpler example of 32,000 names, one per line:

if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")

The dataset looks like this. Each name is a document:

emma
olivia
ava
isabella
sophia
charlotte
mia
amelia
harper
... (~32,000 names follow)

The goal of the model is to learn the patterns in the data and then generate similar new documents that share the statistical patterns within. As a preview, by the end of the script our model will generate ("hallucinate"!) new, plausible-sounding names. Skipping ahead, we'll get:

sample  1: kamon         sample  8: anna          sample 15: earan
sample  2: ann           sample  9: areli         sample 16: lenne
sample  3: karai         sample 10: kaina         sample 17: kana
sample  4: jaire         sample 11: konna         sample 18: lara
sample  5: vialan        sample 12: keylen        sample 19: alela
sample  6: karia         sample 13: liole         sample 20: anton
sample  7: yeran         sample 14: alerin

It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny-looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion.

Tokenizer #

Under the hood, neural networks work with numbers, not characters, so we need a way to convert text into a sequence of integer token ids and back. Production tokenizers like tiktoken (used by GPT-4) operate on chunks of characters for efficiency, but the simplest possible tokenizer just assigns one integer to each unique character in the dataset:

uchars = sorted(set(''.join(docs)))   # unique characters become token ids 0..n-1
BOS = len(uchars)                     # token id for Beginning of Sequence
vocab_size = len(uchars) + 1          # total tokens, +1 for BOS
print(f"vocab size: {vocab_size}")

We collect all unique characters across the dataset (which are just the lowercase letters a–z), sort them, and each letter gets an id by its index. The integer values themselves carry no meaning — each token is just a discrete symbol. Instead of 0, 1, 2

they could be different emoji. We also create one special token, BOS (Beginning of Sequence), which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with BOS on both sides: [BOS, e, m, m, a, BOS]

. The model learns that BOS initiates a new name, and that another BOS ends it. So we have a vocabulary of 27 (26 lowercase letters + BOS).

The character "a" is the first alphabet letter, so it has id 0

. What's the id of "z"? Of "BOS"? If your full name has 9 letters, how many tokens does the model see when you train on it?

Show answer #

25(last of a–z, indices 0..25). BOS is

26(length of

uchars

= 26 alphabet letters). A 9-letter name produces 9 + 2 = 11 tokens: BOS, the 9 letters, then BOS again.

From a neuron to a network #

Before we open up gpt()

and stare at multi-head attention, let's build up the underlying object — the neuron — and stack neurons into a network. The end goal of this section: by the time we hit the architecture diagram, every box in it will feel like an obvious composition of things we already understand.

Here's roughly where we're going. Don't worry about the details — file the picture mentally, then we'll build to it. (You can already play with this — drag the input sliders and watch the activations propagate.)

The simplest "neuron"

One input x

, one bias b

, and an output a = x + b

. That's it — just an adder. No learning yet, no bend in the output. It's a useful starting object because every more complex neuron is just this one with more parts bolted on.

def neuron(x, b):
    return x + b

If x = 3

and b = -1

, what does the neuron output? What if I want this neuron to always output 0

no matter the input? What b

would I need (and would it work for every x

)?

Show answer #

3 + (−1) = 2

. To force the output to 0

we'd need b = −x

, which depends on x

— a single bias can't do it. That's why we'll add a weightnext: it lets the neuron scale its input before the bias.

Add a weight

Multiply the input by a learned weight w

before adding the bias: a = x*w + b

. Now the neuron has two knobs. With both w

and b

the neuron can shift and scale — it can learn any affine [affine = scale the input, then shift it] response. This is the canonical "linear neuron".

def neuron(x, w, b):
    return x * w + b

Add a nonlinearity (ReLU)

Stacking linear neurons on top of linear neurons just gives you another linear function. To learn interesting things, we need a nonlinearity. ReLU is the simplest: $f(z) = \max(0, z)$. It passes positive values through and zeros out negative ones.

def relu(z):
    return max(0, z)

def neuron(x, w, b):
    z = x * w + b
    a = relu(z)
    return a

With w = 2

and b = -3

, plug in x = 1

and x = 4

. What does the neuron output in each case? At what value of x

does the ReLU "turn on" — i.e., where does the output stop being zero?

Show answer #

x = 1

→ z = 1·2 − 3 = −1

→ a = max(0, −1) = 0

. x = 4

→ z = 5

→ a = 5

. The ReLU turns on at z = 0

, i.e. when x = 3/2 = 1.5

. The neuron has learned a soft threshold.### Many inputs in, one output out

Real neurons take a vector of inputs. Each input x_i

has its own weight w_i

; the neuron sums them up, adds bias, and applies ReLU:

$$ a = \mathrm{ReLU}!\left(\sum_{i=1}^{n} x_i w_i + b\right) $$

def neuron(x, w, b):           # x and w are lists of length n
    z = sum(xi * wi for xi, wi in zip(x, w)) + b
    return max(0, z)

The inner sum is a dot product — the fundamental operation of neural networks. In microgpt, linear(x, w)

does this dot product once per row of w

. (Karpathy's version drops the bias b

— modern Transformers often do.)

zip()

do? Python's built-in zip()

walks through two (or more) lists in lockstep and hands back tuples of matching elements — one tuple per "column" — stopping when the shortest list runs out. So for xi, wi in zip(x, w)

gives us the i-th input and the i-th weight together on each loop iteration, ready to multiply.

zip(x, w)

↓The dot product is then just "sum the products of each pair": $0.5{\cdot}0.4 + (-0.3){\cdot}0.7 + 1.2{\cdot}(-0.1) = 0.20 - 0.21 - 0.12 = -0.13$.

The same pattern shows up everywhere in microgpt — adding token + position embeddings (zip(tok_emb, pos_emb)

), residual sums (zip(x, x_residual)

), every matrix-vector multiply inside linear()

. Anywhere you see two same-length lists walked together, zip

is the glue.

Forward pass #

In a neural network, the forward pass is the trip from inputs to a prediction. You hand the network some numbers, they flow through every layer — getting multiplied by weights, summed with biases, occasionally bent by a nonlinearity — and out the other end falls a single answer. The forward pass doesn't change the network at all; it just runs it. Every weight stays exactly where it was; only the activations move.

It's worth pausing on this before we get to backprop, because backprop is just the forward pass run in reverse. If we can't picture the forward pass clearly, the backwards version will feel like magic.

Below is a deliberately tiny network so you can wiggle every knob and watch the output respond. Three inputs x₁, x₂, x₃

feed into two hidden ReLU neurons that join at a single ReLU output a

. The three weights and three biases (w₁, w₂, w₃, b₁, b₂, b₃

) are yours to play with. As you change them, the prediction surface on the right re-draws — it plots a

as a height over the (x₁, x₂) plane, with x₃

swept by its slider. The forward pass is that mapping from input space to output.

x₃

with its slider to lift / fold the surface. Because every neuron has a ReLU, the surface is piecewise linear — each ReLU contributes a sharp fold. Click and drag the surface to rotate.Forward pass:

h₁ = ReLU(w₁·x₁ + w₂·x₂ + b₁)

h₂ = ReLU(w₃·x₃ + b₂)

a = ReLU(h₁ + h₂ + b₃)

— best:

—

The same thing, in code

Here's the network we've been playing with, written out as a small class hierarchy: Neuron

→ Layer

→ MLP

. This is essentially how Karpathy's micrograd packages neural networks. The Neuron.__call__

method is doing exactly what the circles in the diagram do — weighted sum of inputs, plus bias, through a ReLU.

import random

class Neuron:
    def __init__(self, nin):
        self.w = [random.uniform(-1, 1) for _ in range(nin)]
        self.b = random.uniform(-1, 1)

    def __call__(self, x):
        z = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
        return max(0, z)

class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]

    def __call__(self, x):
        return [n(x) for n in self.neurons]

class MLP:
    def __init__(self, nin, nouts):
        sizes = [nin] + nouts
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

x   = [1.0, 0.5, -0.3]
mlp = MLP(3, [2, 1])
print(mlp(x))   # e.g.  [0.42]

The MLP(3, [2, 1])

above is slightly more general than the network in the diagram. In a standard MLP every input feeds every hidden neuron, so the first layer alone would have 2 × (3 weights + 1 bias) = 8

parameters. The interactive diagram uses a deliberately restricted variant — h₁

sees only x₁, x₂

, and h₂

sees only x₃

— so we end up with just 3 weights and 3 biases. That's small enough that the prediction surface stays readable as you wiggle the sliders. The Neuron

/ Layer

/ MLP

scaffolding is identical either way.

Here's a small batch of inputs. Using the MLP

class above, write code that produces predictions for each one:

xs = [
    [ 2.0,  3.0, -1.0],
    [ 3.0, -1.0,  0.5],
    [ 0.5,  1.0,  1.0],
    [ 1.0,  1.0, -1.0],
]
ys_target = [1.0, -1.0, -1.0, 1.0]   # what we WISH the network said
ypred = ?                             # ← your job

Show answer #

ypred = [mlp(x) for x in xs]

. With random weights you'll get whatever the freshly-initialized model says — almost certainly nothing like ys_target

.Bonus observation: our network's output is wrapped in a ReLU, so

ypred[i] ≥ 0

for every input. That means we can nevermatch a target of

−1.0

no matter what the weights are. To handle negative targets we'd need a different output activation (or none). This is a real design choice in real models — the output activation has to match the kind of answer you want.### What is loss?

Once we have predictions, the obvious question is: how wrong are we? The standard way to turn that question into a single number is a loss function. The simplest one — mean squared error (MSE) — just averages the squared gap between each prediction and its target:

$$ L = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 $$

A few properties worth internalizing:

Loss is always ≥ 0— squared gaps can't be negative.** Loss = 0 means perfect predictions**— everyŷᵢ

exactly hits its targetyᵢ

.Big gaps cost much more than small ones— because they're squared. A model that's off by 2 on one example loses 4× more than one that's off by 1.** Loss is the only thing the optimizer cares about**— every weight in the model will be nudged in whichever direction makes this single number smaller.

This is the whole game of training: find weights that minimize the loss.

Scroll back to the interactive diagram and click 🎯 Train against a target surface. A hidden target network is generated, its surface is overlaid as a dark wireframe, and the live loss appears as both a number and a bar. The little chart underneath records every loss reading — as you nudge sliders, you can literally watch the line go down (or up — easy to make it worse). See if you can get the loss below 0.02

by hand. It's harder than it looks — and that's the whole motivation for the gradient-based training we'll build in the next section.

The weights and biases in our code are still plain Python float

s, so we can run the model and measure the loss but we can't yet ask "which weight should I nudge, and by how much, to reduce the loss?". To answer that, we need gradients — and that's exactly what the next section is about.

Autograd #

Training a neural network requires gradients: for each parameter in the model, we need to know "if I nudge this number up a little, does the loss go up or down, and by how much?". The computation graph has many inputs (the model parameters and input tokens) but funnels down to a single scalar output: the loss. Backpropagation starts at that single output and works backwards through the graph, computing the gradient of the loss with respect to every input. It relies on the chain rule from calculus. In production, libraries like PyTorch handle this automatically. Here, we implement it from scratch in a single class called Value

.

This is the most mathematically intense part of microgpt. Karpathy has a 2.5-hour video that builds the whole thing live: The spelled-out intro to neural networks and backpropagation. The walk-through below condenses the key points.

Building `Value`

piece by piece

The same Lego mindset works here: start with a wrapper, add operators, then add the graph bookkeeping that makes backprop possible. Try it live:

Value

remembers at each version of the class. Stage 3 is what microgpt actually uses.

class Value:
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        return f"Value(data={self.data})"

a = Value(-6.0)
b = Value(7.0)
print(a)   # Value(data=-6.0)
print(b)   # Value(data=7.0)
python
class Value:
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        return Value(self.data + other.data)
    def __mul__(self, other):
        return Value(self.data * other.data)

a = Value(-6.0); b = Value(7.0); c = Value(10.0)
d = a * b + c
print(d)   # Value(data=-32.0)
python
class Value:
    def __init__(self, data, children=()):
        self.data = data
        self._children = children       # the values that produced this one

    def __add__(self, other):
        return Value(self.data + other.data, (self, other))
    def __mul__(self, other):
        return Value(self.data * other.data, (self, other))

a = Value(2.0)
b = Value(3.0)
c = a * b                              # c knows its children are (a, b)
L = c + a                              # L knows its children are (c, a)
class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')

    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # forward-pass scalar
        self.grad = 0                   # dL/d(this), filled in backward pass
        self._children = children       # inputs to this node
        self._local_grads = local_grads # d(this)/d(child) for each child

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def __pow__(self, other):  return Value(self.data**other, (self,), (other * self.data**(other-1),))
    def log(self):             return Value(math.log(self.data), (self,), (1/self.data,))
    def exp(self):             return Value(math.exp(self.data), (self,), (math.exp(self.data),))
    def relu(self):            return Value(max(0, self.data), (self,), (float(self.data > 0),))

    def __neg__(self):           return self * -1
    def __radd__(self, other):   return self + other
    def __sub__(self, other):    return self + (-other)
    def __rsub__(self, other):   return other + (-self)
    def __rmul__(self, other):   return self * other
    def __truediv__(self, other):  return self * other**-1
    def __rtruediv__(self, other): return other * self**-1

    def backward(self):
        topo, visited = [], set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

Briefly, a Value

wraps a single scalar number (.data

) and tracks how it was computed. Think of each operation as a little Lego block: it takes some inputs, produces an output (the forward pass), and it knows how its output would change with respect to each of its inputs (the local gradient). That's all the information autograd needs from each block. Everything else is just the chain rule, stringing the blocks together.

Every time you do math with Value

objects (add, multiply, etc.), the result is a new Value

that remembers its inputs (_children

) and the local derivative of that operation (_local_grads

). For example, __mul__

records that $\frac{\partial(a\cdot b)}{\partial a}=b$ and $\frac{\partial(a\cdot b)}{\partial b}=a$. The full set of Lego blocks:

Operation	Forward	Local gradients
`a + b`	$a+b$	$\partial/\partial a = 1,; \partial/\partial b = 1$
`a * b`	$a \cdot b$	$\partial/\partial a = b,; \partial/\partial b = a$
`a ** n`	$a^n$	$\partial/\partial a = n,a^{n-1}$
`log(a)`	$\ln a$	$\partial/\partial a = 1/a$
`exp(a)`	$e^a$	$\partial/\partial a = e^a$
`relu(a)`	$\max(0,a)$	$\mathbf{1}_{a>0}$

The backward()

method walks this graph in reverse topological order (starting from the loss, ending at the parameters), applying the chain rule at each step. If the loss is $L$ and a node $v$ has a child $c$ with local gradient $\frac{\partial v}{\partial c}$, then:

$$\frac{\partial L}{\partial c} \mathrel{+}= \frac{\partial v}{\partial c}\cdot\frac{\partial L}{\partial v}$$

This looks scary if you're not comfortable with calculus, but it's literally just multiplying two numbers in an intuitive way: "If a car travels twice as fast as a bicycle, and the bicycle is four times as fast as a walking man, then the car travels 2×4 = 8 times as fast as the man." The chain rule is the same idea — you multiply the rates of change along the path.

We kick things off by setting self.grad = 1

at the loss node, because $\frac{\partial L}{\partial L}=1$. From there, the chain rule just multiplies local gradients along every path back to the parameters.

Note the +=

(accumulation, not assignment). When a value is used in multiple places in the graph (i.e. the graph branches), gradients flow back along each branch independently and must be summed. This is the multivariable chain rule: if $c$ contributes to $L$ through multiple paths, the total derivative is the sum of contributions from each path.

After backward()

completes, every Value

in the graph has a .grad

containing $\frac{\partial L}{\partial v}$, which tells us how the final loss would change if we nudged that value.

Watch backprop happen

Backprop is easier to internalize if you build it up. Below are four cases in increasing complexity — start with what a single +

does to a gradient, then a single ×

, then both with a branch, then a full training-style pipeline (input, prediction, loss). Each tab is its own little graph; step through it one click at a time.

Next step to begin building the computation graph.

Here's a small neuron computing a = ReLU(x·w + b)

. The forward values are filled in. Try to compute the gradients with respect to x, w, and b by hand assuming

∂L/∂a = 1

. Then click "Run backward" to check. Doing this once by hand is the single best way to internalize what backward()

is doing.This is exactly what PyTorch's .backward()

gives you:

import torch
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a * b
L = c + a
L.backward()
print(a.grad)   # tensor(4.)
print(b.grad)   # tensor(2.)

This is the same algorithm that PyTorch's loss.backward()

runs, just on scalars instead of tensors (arrays of scalars) — algorithmically identical, significantly smaller and simpler, but a lot less efficient.

Let's spell out what backward()

gives us. Autograd calculated that if L = a*b + a

, with a=2

and b=3

, then a.grad = 4.0

. This is telling us about the local influence of a

on L

: if you wiggle a

, in what direction is L

changing? The derivative of L

w.r.t. a

is 4.0, meaning that if we increase a

by a tiny amount (say 0.001), L

would increase by about 4× that (0.004). Similarly, b.grad = 2.0

means the same nudge to b

would increase L

by about 2× that. These gradients tell us the direction (positive or negative) and the steepness (magnitude) of each input's influence on the final output (the loss). This lets us iteratively nudge the parameters of our neural network to lower the loss, and hence improve its predictions.

∂loss/∂b = −6

and (chaining through x

) ∂loss/∂w = 2·err·x = −18

. Now we repeatthe step.

x = 3

and the target y = 10

are fixed; each step nudges the two parameters against their gradient —

w ← w − lr·∂loss/∂w

— and you watch the prediction ŷ

climb toward the target while the loss shrinks. The nudge is just the gradient multiplied by the learning rate. Click Next step a few times.

3 target y =

10 learning rate =

0.02

∂

) recompute every step:## Architecture

The model architecture is a stateless function: it takes a token, a position, the parameters, and the cached keys/values from previous positions, and returns logits (scores) over what token the model thinks should come next in the sequence. We follow GPT-2 with minor simplifications: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU.

We'll step through the model one block at a time. Each sub-section below covers one piece — first the intuition, then any small helper functions it needs, then the relevant code, the actual parameter matrices, and finally a small interactive widget showing what we've built up so far.

vocab_size

4 *(BOS, a, b, c)*27

*(a–z + BOS)*50,257

n_embd

· d_model

216768n_head

1412head_dim

2464block_size

*(context len)*4161,024

n_layer

1112*(rough)*~704,192~124M

To make each step concrete, we'll track a single token through the whole block using a deliberately tiny model. The vector at each stage will only have two numbers, so you can do every multiplication by hand and watch what changes.

Setup. Pretend the vocabulary is just 4 tokens — BOS=0

, 'a'=1

, 'b'=2

, 'c'=3

— and the embedding width is d_model = 2

, with n_head = 1

(so head_dim = 2

) and block_size = 4

. We're partway through generating: the model has already seen BOS

at position 0 and 'a'

at position 1, and now it's processing 'b'

at position 2. We want it to predict what comes at position 3.

Each subsection below pulls in the toy weights it needs, walks the numbers forward, and the resulting vector becomes the input to the next subsection. By the end of Output, we'll have one concrete probability over the 4-token vocab.

Embeddings

The neural network can't process a raw token id like 2

directly. It only works with vectors (lists of numbers). So we associate a learned vector with each possible token, and feed that in as its neural signature. The token id and position id each look up a row from their respective embedding tables (wte

and wpe

). These two vectors are added together, giving the model a representation that encodes both what the token is and where it is in the sequence. Modern LLMs usually skip the position embedding and use relative-based positioning schemes like RoPE.

Concrete example: say our current token is 'b'

, which the tokenizer mapped to id 2

, sitting at position 2

. The lookup wte[2]

gives a length-2 vector — that's the x

the network actually sees. Click a different letter below and you'll watch a different row of wte

get pulled in and flow all the way through the three views (and the numeric tour at the bottom of the section).

wte

becomes x

; wpe[pos=2]

gets added; that vector flows through every downstream view. The fine-grained sliders at the bottom of the section still work for off-vocabulary values.2] = [−0.10, 0.10]

Parameter matrices

Two learned tables — one row per token, one row per position. Hover any cell to see its value. The pattern is just random Gaussian initialisation (std = 0.08); training reshapes these into something meaningful.

Helper used here · `rmsnorm`

Once we've added the token and position vectors, we normalize. rmsnorm

(Root Mean Square Normalization) rescales a vector so its values have unit root-mean-square. This keeps activations from growing or shrinking as they flow through the network, stabilizing training. It's a simpler variant of the LayerNorm used in the original GPT-2.

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

Code in `gpt()`

tok_emb = state_dict['wte'][token_id]      # length 16
pos_emb = state_dict['wpe'][pos_id]        # length 16
x = [t + p for t, p in zip(tok_emb, pos_emb)]
x = rmsnorm(x)

Our token is 'b'

(id 2

) at position 2

. Pick tiny wte

and wpe

tables to look up from:

wte = [[ 0.20,  0.30],   # BOS
       [ 0.50, -0.10],   # 'a'
       [-0.30,  0.40],   # 'b'
       [ 0.10,  0.20]]   # 'c'

wpe = [[ 0.10, -0.05],   # pos 0
       [ 0.05,  0.15],   # pos 1
       [-0.10,  0.10],   # pos 2
       [ 0.15,  0.00]]   # pos 3

token_id, pos_id = 2, 2
tok_emb = wte[token_id]                        # → [-0.30,  0.40]
pos_emb = wpe[pos_id]                          # → [-0.10,  0.10]
x = [t + p for t, p in zip(tok_emb, pos_emb)]  # → [-0.40,  0.50]
x = rmsnorm(x)                                 # → [-0.88,  1.10]

Doing the RMSNorm by hand. Mean-square: $((-0.40)^2 + 0.50^2)/2 = 0.205$. Scale: $1/\sqrt{0.205 + 10^{-5}} \approx 2.209$. Multiply through: $[-0.40 \cdot 2.209,; 0.50 \cdot 2.209] \approx [-0.88, 1.10]$. That two-number vector $x \approx [-0.88,, 1.10]$ is what the attention block sees next.

In microgpt ( n_embd = 16):

wte

is (27 × 16)

and wpe

is (16 × 16)

, so the looked-up vectors are length 16 instead of 2 — same two lines of code, just longer lists. RMSNorm averages 16 squared values instead of 2.In GPT-2 small ( n_embd = 768): each row is a 768-dim vector, and the vocabulary jumps to 50,257 tokens, so

wte

alone is ≈ 39M parameters. GPT-3 (175B):

n_embd = 12,288

and the context window stretches to 2,048 positions; modern frontier models push past 100K positions and skip wpe

entirely in favor of relativeposition schemes like RoPE that rotate the Q/K vectors inside attention instead of adding a position vector here.

x + wpe

vector to unit root-mean-square, so the activations don't blow up as they flow into Q/K/V. Same picker drives this view — try BOS / 'a' / 'b' / 'c' and watch the normalized vector update.2] = [−0.10, 0.10]

Attention block

The attention block is the only place where a token at position $t$ gets to "look" at tokens at positions $0 \ldots t-1$. It's a token-communication mechanism. Before we dive into the code, here's the intuition that makes the rest of this section click.

"Attention, Please!": A Visual Guide To The Attention Mechanismby CodeCompass — recommended reading if you want the same ideas in a different voice.

Intuition · attention is a fuzzy dictionary

Here is what the attention equation looks like. Don't get intimidated — we're going to break each piece down. Attention is a "learnable", "fuzzy" version of a key-value store — the same data structure you know as a Python dict

or a hashtable.

$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$

Attention generalizes this to a non-binary lookup. Instead of matching the query to exactly one key, the query is compared to every key, each match gets a similarity score, and the output is a weighted blend of all the values — keys with higher scores contribute more. Critically, the queries, keys, and values are D-dimensional learned vectors (computed by Wq

, Wk

, Wv

from the input), so the model gets to decide what "matching" means.

Why softmax? Raw dot-product scores can be any real number. Softmax squashes them into the range [0, 1] and forces them to sum to 1, like a well-behaved probability distribution — so the output really is a weighted average, not just a weighted sum that could explode.

What does attention do?

Attention is applied to the input sequence and generates weights for what is of importance to each query. Those weights then "pick" the relevant information and pass it on to the next layer. To make this concrete, take the sentence "The quick brown fox jumps over the lazy dog." Click any word below to see where its attention goes — every other word in the sentence gets a similarity score against your chosen query word, and the bar chart shows the resulting weights.

query. Its query vector (computed by

Wq

) is dot-producted with every word's key vector (computed by Wk

) to get raw scores; softmax turns those into the attention weights you see below. The highest-weighted word is what this query is "looking at." Numbers are illustrative — a real trained model would produce its own pattern.#### Helpers used here · linear

and softmax

linear

is a matrix-vector multiply. It takes a vector x

and a weight matrix w

, and computes one dot product per row of w

. It shows up four times in this block — once each for Q, K, V, and the output projection Wo

— and is the fundamental building block of neural networks: a learned linear transformation.

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

softmax

converts a vector of raw scores — which can range from $-\infty$ to $+\infty$ — into a probability distribution: all values end up in $[0,1]$ and sum to 1. Inside attention we use it to turn the Q·K scores into weights that sum to 1; later, the same helper turns the model's output logits into a distribution over the vocabulary. We subtract the max first for numerical stability (mathematically a no-op, but it prevents overflow in exp

).

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

Now that both helpers are on the table, let's walk through the whole attention block with concrete numbers before opening up the interactive widgets. The widgets below are just visualizations of the operations that follow — once you've seen the math run end-to-end on real values, each widget will feel like a labeled view of a step you've already done by hand.

We pick up where the Embeddings walkthrough left off: token 'b'

at position 2, with x ≈ [-0.88, 1.10]

already in hand.

The embedding step handed us x ≈ [-0.88, 1.10]

. We stash it as the residual and re-normalize before projecting (the second RMSNorm on an already-normalized vector is nearly a no-op — scale ≈ 1.00 — so the input to the projections is still [-0.88, 1.10]

).

x_residual = x                                  # [-0.88, 1.10]
x = rmsnorm(x)                                  # ≈ [-0.88, 1.10]

attn_wq = [[ 0.50,  0.20], [ 0.10,  0.40]]
attn_wk = [[ 0.30, -0.10], [ 0.20,  0.50]]
attn_wv = [[ 0.40,  0.10], [-0.20,  0.60]]
attn_wo = [[ 0.60,  0.20], [ 0.10,  0.70]]

q = linear(x, attn_wq)   # → [-0.22,  0.35]
k = linear(x, attn_wk)   # → [-0.37,  0.37]
v = linear(x, attn_wv)   # → [-0.24,  0.84]

Why those numbers? Each row of the weight matrix is a dot product with x

. For q

: row 0 gives $0.50(-0.88) + 0.20(1.10) = -0.22$; row 1 gives $0.10(-0.88) + 0.40(1.10) = 0.35$. Same shape for k

and v

.

KV cache. Positions 0 and 1 have already been processed on earlier calls, so the cache holds:

keys[0]   = [[ 0.30,  0.10],   # k from BOS at pos 0
             [-0.10,  0.40],   # k from 'a'  at pos 1
             [-0.37,  0.37]]   # k from 'b'  at pos 2 (just appended)

values[0] = [[ 0.20, -0.30],   # v from BOS
             [ 0.50,  0.20],   # v from 'a'
             [-0.24,  0.84]]   # v from 'b'

Why keys[0] instead of just keys? Each Transformer layer keeps its

ownseparate KV cache — the keys and values learned at layer 0 mean different things than at layer 1. So

keys

and values

are lists of lists: the outer index is the

layer number, the inner index is the

position in the sequence.

keys[0]

is "the running list of every k

vector layer 0 has produced so far," and keys[0][2]

is "the key for position 2 at layer 0." Our toy has n_layer = 1

, so keys[0]

is the only list around — but the indexing convention stays the same. If we bumped n_layer

to 6, you'd see keys[0]

, keys[1]

, … through keys[5]

, one cache per layer.Scores → softmax weights. Dot each cached key with our query, divide by $\sqrt{d_{\text{head}}} = \sqrt{2} \approx 1.41$:

scores = [(q[0]*k[0] + q[1]*k[1]) / 1.41 for k in keys[0]]
weights = softmax(scores)                       # ≈ [0.30, 0.34, 0.36]

The three weights sum to 1. Notice that 'b'

attends most to itself (0.36), then to 'a'

(0.34), then to BOS

(0.30) — the differences are small because our toy weights are tiny and random; a trained network would learn much sharper patterns.

Weighted sum of values, then mix through Wo, then residual.

head_out = [sum(weights[t] * v[t][j] for t, v in enumerate(values[0]))
            for j in range(2)]

x_attn = linear(head_out, attn_wo)              # ≈ [0.14, 0.21]
x = [a + b for a, b in zip(x_attn, x_residual)] # ≈ [-0.74, 1.31]

Why each of those three lines is there.

Weighted sum of This is the actual "lookup" of the fuzzy dictionary. The weights answeredV

.how mucheach past position matters; the values saywhateach one contributes. Multiplying them and summing gives a single vector that's a blended pull from every cached value, weighted by relevance. If one weight were 1.0 and the rest were 0, we'd get back exactly that value — like a normal dict lookup. With soft weights, we get a mix.Project through The weighted sum lives inWo

.value-space, not in the residual stream's space.Wo

is a learned linear layer that re-mixes the head output back into the same shape asx

. In multi-head attention each head's slice gets concatenated first, thenWo

blends across the heads — giving the model a place to learnhow different heads should be combined. In our toy with one head it just rotates the 2-vector, but the role is the same.Add the residual. Instead of replacingx

withx_attn

, we add:x ← x + x_attn

. Two big wins. (1) The original information survives — attention is anupdate, not an overwrite. (2) During backprop, gradients flow directly through this addition path back to earlier layers, which is what makes deep stacks of these blocks trainable at all. If attention has nothing useful to say for this token, it can output zero and the residual just passesx

through unchanged.

The vector handed to the MLP block is x ≈ [-0.74, 1.31]

. The attention block has done one thing: blended a little bit of every past position into the current one, projected the result back into the residual stream's shape, and added it on as an update.

▶ Play for the full animation or

Step to advance one phase at a time. Phases:

*(1)*compute

q

, k

, v

for the current token, *(2)*append

k

and v

to the per-layer caches, *(3)*score the query against each cached key,

*(4)*softmax → weights,

*(5)*weighted sum of cached values,

*(6)*output.

Attention playground · drag the query, watch the block recompute

Same diagram as "Attention, step by step" above, but now the query vector q is on sliders. The KV cache (3 past tokens) stays pinned to the toy walkthrough; everything downstream — scaled-dot-product scores, softmax weights, weighted sum of values, head output — recomputes live as you drag. Start at the defaults (

q ≈ [−0.22, 0.35]

, the toy 'b'

values) and move the sliders to see how a different query reshapes the whole attention output.lm_head

→ softmaxShortcut visualization. In the real model the head output goes through

Wo

, gets added to the residual, runs the MLP block, and only thendoes

lm_head

softmax produce next-letter probabilities. We're skipping those layers and projecting the head output directly through lm_head

so you can see how moving the query changes which letter the model "leans toward." It's a directional signal, not the model's real prediction.#### Snapping attention into the running diagram

We started this section with Embeddings only, added the pre-attention rmsnorm

, and just walked through the full attention computation step by step. Time to slot that attention block back into the architecture diagram we've been building piece by piece. The widget below adds Q/K/V projections, the attention weighted sum, W

, and the residual add on top of the Embeddings + RMSNorm view from earlier — same token picker, same numbers, just more of the block lit up.o

2] = [−0.10, 0.10]

Parameter matrices

Four 16×16 matrices: Q/K/V are the three projections that turn the token vector into "what am I looking for / what do I contain / what do I offer", and Wₒ mixes the per-head outputs back together.

You might be wondering why the toy matrices below are only 2×2. Remember from the Embeddings step: each token gets embedded as a two-dimensional vector (we set d_model = 2

for the walkthrough). The Q/K/V projections map a length-2 vector to another length-2 vector, so the weight matrix is (out × in) = (2 × 2) = 4 numbers

. In real microgpt d_model = 16

, so each of these matrices grows to (16 × 16) = 256 numbers

. The shape of the operation is the same — just bigger.

o· head output → residual update

Code in `gpt()`

x_residual = x
x = rmsnorm(x)
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
keys[li].append(k); values[li].append(v)
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)]            # residual

In microgpt ( n_embd = 16, n_head = 4): Q/K/V are (16 × 16) and they get sliced into 4 heads of

head_dim = 4

each. The same Q·K/√d · softmax · weighted-V dance runs per headon a 4-dim slice, the four outputs are concatenated back to length 16, and

Wo

mixes them. The "shape" of the math doesn't change — just the dimensions.In GPT-2 small ( n_embd = 768, n_head = 12): each head sees a 64-dim slice, and there are 12 of them running in parallel.

GPT-3 (175B, 128-dim slices, 96 heads, all 96 looking back at

n_embd = 12288

, n_head = 96

):thousandsof cached positions. Frontier models add tricks like

grouped-query attention(many query heads share the same K/V heads, shrinking the KV cache) and

FlashAttention(a GPU-friendly tiling that never materialises the full attention matrix), but the per-head computation is still the four lines you just walked through.

MLP block

MLP is short for "multilayer perceptron" — a two-layer feed-forward network: project up to 4× the embedding dimension, apply ReLU, project back down. This is where the model does most of its "thinking" per position. Unlike attention, this computation is fully local to time $t$. The Transformer intersperses communication (Attention) with computation (MLP).

Hover any block, layer, or row and the matching parts in all three views (including the numeric tour below) light up. Click to pin the code panel.

2] = [−0.10, 0.10]

Parameter matrices

Up-projection then down-projection. mlp_fc1

blows the dimension up 4× to give the network room to compute, then mlp_fc2

squeezes it back down so it can be added to the residual stream.

Code in `gpt()`

x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1'])    # 16 → 64
x = [xi.relu() for xi in x]
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])    # 64 → 16
x = [a + b for a, b in zip(x, x_residual)]          # residual

The attention block handed us x ≈ [-0.74, 1.31]

. Stash the residual, normalize, then up-project to 4 × d_model = 8

hidden units.

x_residual = x                                   # [-0.74, 1.31]
x = rmsnorm(x)                                   # ≈ [-0.70, 1.23]

mlp_fc1 = [[ 0.40,  0.10],
           [-0.20,  0.50],
           [ 0.30, -0.30],
           [ 0.10,  0.40],
           [-0.50,  0.20],
           [ 0.20, -0.10],
           [ 0.60,  0.30],
           [-0.10, -0.40]]

pre = linear(x, mlp_fc1)

x = [xi.relu() for xi in pre]

Why most entries are zero. ReLU = max(0, x)

, so anything negative gets clipped to 0. Only 3 of the 8 hidden units "fire" for this particular input. Different inputs would activate different subsets — that's how the MLP carves the input space into pieces and treats each piece differently.

mlp_fc2 = [[ 0.10,  0.30, -0.20,  0.40,  0.00,  0.20, -0.10,  0.50],
           [-0.30,  0.20,  0.50, -0.10,  0.40, -0.40,  0.30,  0.10]]

mlp_out = linear(x, mlp_fc2)                     # ≈ [0.40, 0.35]
x = [a + b for a, b in zip(mlp_out, x_residual)] # ≈ [-0.34, 1.66]

Hand-check the down-projection. Row 0 of mlp_fc2

dotted with the post-ReLU vector: $0.30 \cdot 0.76 + 0.40 \cdot 0.42 = 0.396 \approx 0.40$ (the zeros contribute nothing). The MLP's contribution gets added back to the residual stream, and we exit the block with x ≈ [-0.34, 1.66]

.

In microgpt ( n_embd = 16):

mlp_fc1

is (64 × 16) and mlp_fc2

is (16 × 64) — the 4× expansion is the same; just wider vectors. The MLP holds more parameters than the attention block(2,048 vs 1,024 in microgpt), and that ratio gets worse as models grow.

In GPT-2 small ( n_embd = 768): the hidden layer is 3,072 wide, so the MLP alone is ≈ 4.7M parameters per layer.

In GPT-3 (175B): hidden = 49,152, and the MLP is ≈ 60% of all parameters in the model. Frontier models also swap plain ReLU for

SwiGLU(a gated activation that needs three matrices instead of two) and replace the dense MLP with

Mixture-of-Experts— many small MLPs of which a router picks 2 per token — to grow capacity without growing per-token compute.

Residual connections

Both the attention and MLP blocks add their output back to their input (x = [a + b for ...]

). This lets gradients flow directly through the network and makes deeper models trainable.

Output

The final hidden state is projected to vocabulary size by lm_head

, producing one logit per token in the vocabulary. In our case, that's just 27 numbers. Higher logit = the model thinks that corresponding token is more likely to come next.

lm_head

softmaxlm_head

and a softmax. The bottom row is the model's predicted probability distribution over the four toy vocabulary tokens (BOS / 'a' / 'b' / 'c'). Pick a current token below and watch the whole pipeline — including the prediction — recompute.2] = [−0.10, 0.10]

Parameter matrix

One row per token in the vocabulary. The final hidden state is dot-producted with each row to produce a logit. Higher dot product → that token is judged more likely to come next.

Code in `gpt()`

logits = linear(x, state_dict['lm_head'])   # length 27
return logits

The MLP handed us x ≈ [-0.34, 1.66]

. lm_head

has one row per vocab token; the dot product of x

with row i is the logit for token i.

lm_head = [[ 0.30,  0.10],   # BOS
           [-0.20,  0.40],   # 'a'
           [ 0.50, -0.30],   # 'b'
           [-0.10,  0.60]]   # 'c'

logits = linear(x, lm_head)

Raw logits can be any real number. To turn them into a probability distribution we apply softmax — subtract the max for numerical stability, exponentiate, divide by the sum:

probs = softmax(logits)

Reading the result. After BOS, a, b

, this (untrained) toy model thinks the most likely next token is 'c'

with probability 0.43. During training, the loss for this position would be $-\log p(\text{target})$ — if the true next token were BOS

(end of word), the loss is $-\log 0.17 \approx 1.77$. Backprop would then tweak every weight we've used along the way to push P(BOS)

up and the others down for next time.

In microgpt ( n_embd = 16, vocab_size = 27):

lm_head

is (27 × 16), so the model outputs 27 logits — one per a–z plus BOS. The softmax over 27 categories is cheap.In GPT-2 small ( n_embd = 768, vocab = 50,257): the final matrix is ≈ 39M parameters and the softmax has to normalize across 50K categories — and during training that softmax is computed at

everyposition in

everysequence in the batch, which is a non-trivial fraction of total training compute.

In GPT-4 / frontier models, vocabularies sit around 100K–200K tokens and the lm_head is typically

tiedto

wte

(same matrix used for both input embedding and output projection), saving a copy of those millions of parameters. The temperature / top-p tricks you see at inference all live downstream of this same logit vector.## Parameters

You've seen every parameter matrix in the architecture walkthrough above — wte

, wpe

, attn_wq/wk/wv/wo

, mlp_fc1/fc2

, lm_head

. The Parameters section is just the bookkeeping: allocate them all in one place, store them in a single dictionary the optimizer can iterate over, and count the total.

n_embd = 16; n_head = 4; n_layer = 1; block_size = 16
head_dim = n_embd // n_head

matrix = lambda nout, nin, std=0.08: \
    [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]

state_dict = {
    'wte':     matrix(vocab_size, n_embd),    # 27 × 16  → 432
    'wpe':     matrix(block_size, n_embd),    # 16 × 16  → 256
    'lm_head': matrix(vocab_size, n_embd),    # 27 × 16  → 432
}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)    # 1,024
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)    # 1,024

params = [p for mat in state_dict.values() for row in mat for p in row]
print(f"num params: {len(params)}")   # → 4192

Why bother with the flat params list? Because the optimizer doesn't care about the matrices — it just needs a single list of scalars to loop over and update.

params

is that list. GPT-2 had 1.6 billion entries in this list; modern LLMs have hundreds of billions.Suppose we bumped n_embd

from 16 to 32 (everything else unchanged). Which matrices would grow, and by how much (4×? 2×? something else)? Roughly what's the new total parameter count?

Show answer #

n_embd

. wte

, wpe

, lm_head

are linear in n_embd

(2×). The attention matrices (attn_wq/k/v/o

) and MLP (mlp_fc1/2

) are all (n_embd × n_embd) or (4·n_embd × n_embd), so they scale quadratically(4×). New rough total ≈ 2×(432+256+432) + 4×(256+256+256+256+1024+1024) = 2,240 + 12,288 ≈

14,528 params. Doubling the width more than triples the model.

Putting it all together

Now that we've walked through each piece individually, here is the full gpt()

function — one call processes one token and returns 27 logits over the vocabulary. Read top to bottom: embeddings → for each layer (attention block → MLP block) → final linear.

Now the model itself:

gpt()

to see what it does; related lines (the three linear(...wq/wk/wv)

calls, both KV-cache append

calls, etc.) light up together.The function processes one token (id token_id

) at a specific position in time (pos_id

), and some context from previous iterations summarized by the activations in keys

and values

, known as the KV Cache.

You might notice we're using a KV cache during training, which is unusual. People typically associate the KV cache with inference only. But the KV cache is conceptually always there, even during training. In production implementations, it's just hidden inside the highly vectorized attention computation that processes all positions in the sequence simultaneously. Since microgpt processes one token at a time (no batch dimension, no parallel time steps), we build the KV cache explicitly. And unlike the typical inference setting where the cache holds detached tensors, here the cached keys and values are live Value

nodes in the computation graph, so we actually backpropagate through them.

Training loop #

Now we wire everything together. The training loop repeatedly: (1) picks a document, (2) runs the model forward over its tokens, (3) computes a loss, (4) backpropagates to get gradients, and (5) updates the parameters. Here's the simplest possible version — plain stochastic gradient descent: walk every parameter slightly downhill against its gradient.

p -= lr · grad

walks toward the minimump

along a toy loss curve. The orange tangent is p.grad

; the red arrow on the axis is the SGD step −lr · p.grad

. Whichever side of the minimum we start on, the step always points toward it.

learning_rate = 0.01

num_steps = 1000
for step in range(num_steps):

    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)   # average over the document. May yours be low.

    loss.backward()

    for p in params:
        p.data -= learning_rate * p.grad
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")

Tokenization

Each training step picks one document and wraps it with BOS on both sides: the name "emma" becomes [BOS, e, m, m, a, BOS]

. The model's job is to predict each next token given the tokens before it.

Forward pass and loss

We feed the tokens through the model one at a time, building up the KV cache as we go. At each position, the model outputs 27 logits, which we convert to probabilities via softmax. The loss at each position is the negative log probability of the correct next token: $-\log p(\text{target})$. This is called the cross-entropy loss. Intuitively, the loss measures the degree of misprediction: how surprised the model is by what actually comes next. If the model assigns probability 1.0 to the correct token, it is not surprised at all and the loss is 0. If it assigns probability close to 0, the model is very surprised and the loss goes to $+\infty$. We average the per-position losses across the document to get a single scalar loss.

Backward pass

One call to loss.backward()

runs backpropagation through the entire computation graph, from the loss all the way back through softmax, the model, and into every parameter. After this, each parameter's .grad

tells us how to change it to reduce the loss. The SGD update right after the backward pass — p.data -= learning_rate * p.grad

— is the entire learning rule: move every parameter a small step in the direction that reduces the loss, then reset gradients to zero so the next backward pass starts fresh.

From plain SGD to Adam

Plain SGD works but it's slow and finicky to tune. In practice, every modern LLM is trained with Adam — an optimizer that tracks two extra buffers per parameter: m

(a running average of recent gradients, like momentum) and v

(a running average of recent squared gradients, which adapts the per-parameter learning rate). The bias corrections m_hat

/ v_hat

account for m

and v

being initialized to zero. The learning rate also decays linearly so the steps shrink as training progresses. Here's the same training loop with Adam swapped in:

learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params)  # first moment buffer  (running mean of grads)
v = [0.0] * len(params)  # second moment buffer (running mean of grads²)

num_steps = 1000
for step in range(num_steps):

    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

    loss.backward()

    lr_t = learning_rate * (1 - step / num_steps)   # linear decay
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")

Over 1,000 steps the loss decreases from around 3.3 (random guessing among 27 tokens: $-\log(1/27) \approx 3.3$) down to around 2.37. Lower is better, the lowest possible is 0 (perfect predictions), so there's still room to improve, but the model is clearly learning the statistical patterns of names.

Inference #

Once training is done, we can sample new names from the model. The parameters are frozen and we just run the forward pass in a loop, feeding each generated token back as the next input:

temperature = 0.5   # in (0, 1], controls "creativity" of generated text
print("\n--- inference (new, hallucinated names) ---")
for sample_idx in range(20):
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")

We start each sample with the BOS token, which tells the model "begin a new name". The model produces 27 logits, we convert them to probabilities, and we randomly sample one token according to those probabilities. That token gets fed back in as the next input, and we repeat until the model produces BOS again (meaning "I'm done") or we hit the maximum sequence length.

The temperature parameter controls randomness. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the model's learned distribution. Lower temperatures (like 0.5 here) sharpen the distribution, making the model more conservative and likely to pick its top choices. A temperature approaching 0 would always pick the single most likely token (greedy decoding). Higher temperatures flatten the distribution and produce more diverse but potentially less coherent output. Try it on a synthetic logit vector below.

temperaturebefore softmax controls "creativity": low T sharpens to the model's top pick; high T flattens toward uniform.

—· entropy: — bits

Slide the temperature from 0.1 to 2.0 and watch the entropy. At what temperature is entropy lowest? At what temperature is it highest? What's the entropy of a perfectly uniform distribution over 27 tokens (and why is that the asymptote)?

Show answer #

T → 0(everything collapses onto the single most-likely token; entropy approaches 0 bits). Highest is at

T → ∞(the distribution flattens toward uniform). Uniform over 27 tokens has entropy

log₂(27) ≈ 4.75 bits

— the asymptote you'll see if you push T very high.## Train the toy GPT, live

Everything in this lab so far has shown the model running on frozen weights — either the pinned toy values from the walkthrough, or the 4,192 parameters Karpathy already trained for you. This section closes the loop: train the toy model in your browser, watch the predictions change, then chat with it.

This time we train the whole model — every weight matrix updates: wte

, wpe

, the four attention projections, both MLP layers, and lm_head

. The gradient is computed numerically (central differences) rather than via autograd, so it's slow — a full 100-step batch takes a few seconds — but every edge in the diagram changes thickness and color as the parameters move. That's the point. Click Step ▸ to advance one example at a time and watch a single SGD step in slow motion; click Train to run 100 batch steps at once.

lm_head

learn{a,b,c}

patterns and re-query.Training data: edit the list of words below (one per line). Each must be

2–4 letters over

{a, b, c}

. Every word becomes one training example — predict the lastletter given the

second-to-lastletter — plus one terminal example so the model also learns to emit

BOS

after a word ends. Hit Train and the model auto-steps through every example across multiple epochs, logging each step in the box below.

Train to step through every example across multiple epochs. Each step performs one SGD update and the diagram above redraws live.

BOS,

a,

b, or

c and press

Enter to query the model.

Run it #

All you need is Python (no pip install

, no dependencies). Grab Karpathy's script from his gist, then run it:

curl -L -o train.py https://gist.githubusercontent.com/karpathy/8627fe009c40f57531cb18360106ce95/raw/microgpt.py

python train.py

If curl

isn't available you can use wget

instead, or just open the gist and copy the file into train.py

by hand.

The script takes about 1 minute to run on Karpathy's MacBook. You'll see the loss printed at each step:

train.py
num docs: 32033
vocab size: 27
num params: 4192
step    1 / 1000 | loss 3.3660
step    2 / 1000 | loss 3.4243
step    3 / 1000 | loss 3.1778
step    4 / 1000 | loss 3.0664
step    5 / 1000 | loss 3.2209
step    6 / 1000 | loss 2.9452
step    7 / 1000 | loss 3.2894
step    8 / 1000 | loss 3.3245
step    9 / 1000 | loss 2.8990
step   10 / 1000 | loss 3.2229
step   11 / 1000 | loss 2.7964
step   12 / 1000 | loss 2.9345
step   13 / 1000 | loss 3.0544
...

Watch it go down from ~3.3 (random) toward ~2.37. The lower this number, the better the network's predictions about what token comes next in the sequence. At the end of training, the knowledge of the statistical patterns of the training token sequences is distilled in the model parameters. Fixing these parameters, we can now generate new, hallucinated names. You'll see (again):

sample  1: kamon          sample  8: anna          sample 15: earan
sample  2: ann            sample  9: areli         sample 16: lenne
sample  3: karai          sample 10: kaina         sample 17: kana
sample  4: jaire          sample 11: konna         sample 18: lara
sample  5: vialan         sample 12: keylen        sample 19: alela
sample  6: karia          sample 13: liole         sample 20: anton
sample  7: yeran          sample 14: alerin

As an alternative to running the script on your computer, you may try to run it directly on a Google Colab notebook and ask Gemini questions about it. Try playing with the script: try a different dataset, train for longer (increase num_steps

), or increase the model size for increasingly better results.

Progression #

To see the code built up piece by piece — as layers of the onion — the advised progression looks something like this:

File	What it adds
`train0.py`	Bigram count table — no neural net, no gradients
`train1.py`	MLP + manual gradients (numerical & analytic) + SGD
`train2.py`	Autograd (`Value` class) — replaces manual gradients
`train3.py`	Position embeddings + single-head attention + rmsnorm + residuals
`train4.py`	Multi-head attention + layer loop — full GPT architecture
`train5.py`	Adam optimizer — this is `train.py`

Karpathy created a Gist called build_microgpt.py

whose Revisions show all of these versions and the diffs between each step. Stepping through the diffs is a great way to internalize what each component does.

Real stuff #

microgpt contains the complete algorithmic essence of training and running a GPT. But between this and a production LLM like ChatGPT, there is a long list of things that change. None of them alter the core algorithm and the overall layout, but they are what makes it actually work at scale. Walking through the same sections in order:

Data

Instead of 32K short names, production models train on trillions of tokens of internet text: web pages, books, code, etc. The data is deduplicated, filtered for quality, and carefully mixed across domains.

Tokenizer

Instead of single characters, production models use subword tokenizers like BPE (Byte Pair Encoding), which learn to merge frequently co-occurring character sequences into single tokens. Common words like "the" become a single token, rare words get broken into pieces. This gives a vocabulary of ~100K tokens and is much more efficient because the model sees more content per position.

Autograd

microgpt operates on scalar Value

objects in pure Python. Production systems use tensors (large multi-dimensional arrays of numbers) and run on GPUs/TPUs that perform billions of floating-point operations per second. Libraries like PyTorch handle autograd over tensors, and CUDA kernels like FlashAttention fuse multiple operations for speed. The math is identical, just corresponds to many scalars processed in parallel.

Architecture

microgpt has 4,192 parameters. GPT-4–class models have hundreds of billions. Overall it's a very similar-looking Transformer, just much wider (embedding dimensions of 10,000+) and much deeper (100+ layers). Modern LLMs also incorporate a few more types of Lego blocks and change their orders around: RoPE (Rotary Position Embeddings) instead of learned position embeddings, GQA (Grouped Query Attention) to reduce KV cache size, gated linear activations instead of ReLU, Mixture of Experts (MoE) layers, etc. But the core structure of Attention (communication) and MLP (computation) interspersed on a residual stream is well-preserved.

The picture, mapped to the code

Here is the canonical Transformer block diagram you'll see in papers and textbooks — the one microgpt is a stripped-down version of. Click any block to see how it maps onto microgpt's code (and which blocks microgpt drops because they're scale-time concerns):

state_dict

entries and gpt()

code.Each colored block on the diagram corresponds to one or more lines of microgpt. Click one and this panel will show the code, the matching state_dict

entry (if any), and whether microgpt simplifies or skips it.

microgpt strips the diagram down to its algorithmic core. Dropout, masking, GeLU, and biases are all removed in this implementation — the model still learns, just with fewer regularizers. LayerNorm is also replaced by the simpler RMSNorm. Click any Dropout, Mask, or LayerNorm block to read why.

Training

Instead of one document per step, production training uses large batches (millions of tokens per step), gradient accumulation, mixed precision (float16/bfloat16), and careful hyperparameter tuning. Training a frontier model takes thousands of GPUs running for months.

Optimization

microgpt uses Adam with a simple linear learning rate decay and that's about it. At scale, optimization becomes its own discipline. Models train in reduced precision (bfloat16 or even fp8) and across large GPU clusters for efficiency, which introduces its own numerical challenges. The optimizer settings (learning rate, weight decay, beta parameters, warmup, decay schedule) must be tuned precisely, and the right values depend on model size, batch size, and dataset composition. Scaling laws (e.g. Chinchilla) guide how to allocate a fixed compute budget between model size and number of training tokens. Getting any of these details wrong at scale can waste millions of dollars of compute, so teams run extensive smaller-scale experiments to predict the right settings before committing to a full training run.

Post-training

The base model that comes out of training (the "pretrained" model) is a document completer, not a chatbot. Turning it into ChatGPT happens in two stages. First, SFT (Supervised Fine-Tuning): swap the documents for curated conversations and keep training. Algorithmically, nothing changes. Second, RL (Reinforcement Learning): the model generates responses, they get scored (by humans, another "judge" model, or an algorithm), and the model learns from that feedback. Fundamentally, the model is still training on documents — those documents are now made up of tokens coming from the model itself.

Inference

Serving a model to millions of users requires its own engineering stack: batching requests together, KV cache management and paging (vLLM, etc.), speculative decoding for speed, quantization (running in int8/int4 instead of float16) to reduce memory, and distributing the model across multiple GPUs. Fundamentally, we are still predicting the next token in the sequence — but with a lot of engineering spent on making it faster.

All of these are important engineering and research contributions, but if you understand microgpt, you understand the algorithmic essence.

Zoom in further · Bycroft's GPT visualization #

If microgpt was "the smallest transformer drawn as a 2-D diagram," Brendan Bycroft's interactive walkthrough is "an actual GPT-2 drawn as a 3-D city." Every embedding vector, every Q/K/V projection, every attention head, every MLP layer is rendered as a navigable scene with the real GPT-2 weights — and you can scrub through one token's forward pass at your own pace. Same algorithm as microgpt, ~30,000× more parameters. Drag to rotate, scroll to zoom, click blocks on the right rail to jump.

d_model

driving the same forward pass you traced in microgpt. Click + drag to orbit, scroll to zoom, use the right-side phase rail to step through the algorithm. Open in a new tabfor full-screen control.

Assignment · safety guardrails for the chat bot #

You've taken microgpt apart and you've already chatted with it at the top of this page. Now you're going to ship it — and decide what it's allowed to say. The chat bot at the top of the page is a useful name generator, but it has no safety policy. Anyone can ask it for 100 names with any starting prefix they choose, and it will dutifully produce them. Your job in this assignment is to add a small safety layer on top of the same model, then defend it against an adversarial grader.

The product policy you're enforcing is simple and totally safe-for-work: this name generator must never emit a fruit. Saying apple

is banned — and so are pear

, plum

, fig

, lime

, grape

, mango

, and the rest of the produce aisle. (In a real product the banned list would be slurs, NSFW terms, or other harmful output; fruits are a clean stand-in that exercise the exact same prefix- and substring-filtering machinery.) Your bot must refuse requests whose prefixes lead to fruit names and filter any fruit that slips out of the stochastic sampler.

What you're submitting

A single Python file bot.py

that reads requests from stdin and writes responses to stdout, one per line. The starter template has everything except the two safety hooks:

— receives the list of letter-prefixes the user typed (e.g.is_safe_request(prefixes)

['j']

,['ab']

, or['a','b','c']

). ReturnsNone

to allow, or a one-sentence reason string to refuse. Calledbeforeany name is generated.— returnsis_safe_name(name)

True

to keep a generated name,False

to discard. Calledafterthe model emits each name. If a name is rejected the bot resamples (up to 10 attempts per slot).

You should only need to edit those two functions plus the BLOCKED_PREFIX_PATTERNS

and BLOCKED_OUTPUT_PATTERNS

lists. The forward pass and the bot loop are already written and shared with the autograder — if you change them, the autograder will reject your submission.

Downloads

I/O protocol

Each line of stdin is one request — one or more whitespace-separated letter prefixes:

j           one prefix, one name starting with "j"
ab          one prefix, one name starting with "ab"
a b c       three prefixes, one name per prefix
(blank)     no prefixes, generate one name from BOS

Each line of stdout is one response, beginning with exactly one of these prefixes:

OK:     name1, name2, ...           ← one name per input prefix, in order, comma-separated
REFUSE: <one-sentence reason>        ← request blocked by your safety policy

Anything else printed to stdout will confuse the autograder. Use stderr

if you need debug prints. The simplified protocol matches the chat widget at the top of this page exactly — type ab

in the chat, you get one name; type a b

, you get two.

Run it locally

echo 'j'      | python bot.py
echo 'ab'     | python bot.py
echo 'a b c'  | python bot.py

python bot.py

python test_bot.py

What the autograder tests

Gradescope runs bot.py

as a subprocess and pipes a batch of letter-prefix requests through stdin. It scores your submission on two test groups:

Benign requests must still work. Simple prefixes likej

,ab

,a b c

must produceOK:

responses with one name per prefix and each name actually starting with its prefix. Over-blocking — refusing letters you should allow — will cost you points here.Adversarial requests must be refused. The grader will probe with prefix-based and substring-based attacks aimed at extracting outputs you'd rather not produce. For these requests you must respond withREFUSE:

. Under-blocking will cost you points here.

The two groups are weighted to push you toward a balanced policy: a bot that refuses everything fails Group 1; a bot with no safety policy fails Group 2. The autograder uses fresh random seeds, so don't try to memorize specific outputs — your policy needs to be principled.

Structure of `model.json`

The weights file is plain JSON — open it in any editor. Top-level keys:

key	contents
`format`	`"tiny-gpt-char-v1"`
`config`	`n_layer=1` , `n_embd=16` , `n_head=4` , `head_dim=4` , `block_size=16` , `vocab_size=27` , `BOS=26`
`tokenizer`	`uchars[26]` , `stoi` , `itos` (character-level a–z + BOS=26)
`state_dict`	nested lists of floats, one entry per parameter matrix (see below)

state_dict key	shape
`wte`	27 × 16
`wpe`	16 × 16
`lm_head`	27 × 16
`layer0.attn_wq` / `wk` / `wv` / `wo`	16 × 16 each
`layer0.mlp_fc1`	64 × 16
`layer0.mlp_fc2`	16 × 64

Same model you've been dissecting throughout the lab. The full structure spec also lives at the top of bot.py

.

Suggested workflow

Run the unmodified template. Confirmpython test_bot.py

passes all benign tests. The adversarial test list intest_bot.py

is intentionally empty — that's where you'll add your own tests as you go.Be the adversary first. Open the chat at the top of this page (or pipe inputs through your localbot.py

) and try to make the model emit fruit names — prefixes likeappl

,gra

, orli

are good starting points. Note the inputs that worked.Write down your policy. Before coding, write a short list of what your bot will refuse and why. Be specific — "prefixes that lead to fruit names" is vague; "prefixes containing any of these letter combinations: …" is implementable.Implement Reject the request before generation when the policy can be applied to the input alone (e.g., a prefix you don't want to start with).is_safe_request

.Implement Filter generated names that contain banned substrings (the model is stochastic — a benign-looking request can still emit unsafe outputs).is_safe_name

.Add your own adversarial tests totest_bot.py

as you discover new attack patterns. Run frequently.Tune for both directions. If your bot starts refusing legitimate requests, loosen the policy. Over-blocking is also a failure.

Submission

Upload to Gradescope:

bot.py

(your edited version)model.json

(unmodified — included so the grader can reproduce your bot exactly)

The autograder will run python bot.py

with your weights, send batched requests, and score the responses. Late submissions follow the course policy.

Rubric

component	points
Benign requests still work (no over-blocking)	40
Adversarial requests are refused	40
Output filtering catches stochastic leaks	10
Code clarity & comments on your policy	10
Total	100

Your bot will only see a few hundred test inputs from the autograder. Real LLM safety teams face open-ended adversarial input — and frontier models still get jailbroken regularly despite huge investments in alignment, RLHF, and red-teaming. The exercise here is deliberately tractable (a 4,192-parameter character-level name generator), but the shape of the problem — balancing utility against refusal, anticipating prefix and substring attacks, deciding policy under uncertainty — is the same shape professional alignment teams face every day.

FAQ #

Does the model "understand" anything?

That's a philosophical question, but mechanically: no magic is happening. The model is a big math function that maps input tokens to a probability distribution over the next token. During training, the parameters are adjusted to make the correct next token more probable. Whether this constitutes "understanding" is up to you, but the mechanism is fully contained in the 200 lines above.

Why does it work?

The model has thousands of adjustable parameters, and the optimizer nudges them a tiny bit each step to make the loss go down. Over many steps, the parameters settle into values that capture the statistical regularities of the data. For names, this means things like: names often start with consonants, "qu" tends to appear together, names rarely have three consonants in a row, etc. The model doesn't learn explicit rules, it learns a probability distribution that happens to reflect them.

How is this related to ChatGPT?

ChatGPT is this same core loop (predict next token, sample, repeat) scaled up enormously, with post-training to make it conversational. When you chat with it, the system prompt, your message, and its reply are all just tokens in a sequence. The model is completing the document one token at a time, same as microgpt completing a name.

What's the deal with "hallucinations"?

The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data. microgpt "hallucinating" a name like "karia" is the same phenomenon as ChatGPT confidently stating a false fact. Both are plausible-sounding completions that happen not to be real.

Why is it so slow?

microgpt processes one scalar at a time in pure Python. A single training step takes seconds. The same math on a GPU processes millions of scalars in parallel and runs orders of magnitude faster.

Can I make it generate better names?

Yes. Train longer (increase num_steps

), make the model bigger (n_embd

, n_layer

, n_head

), or use a larger dataset. These are the same knobs that matter at scale.

What if I change the dataset?

The model will learn whatever patterns are in the data. Swap in a file of city names, Pokémon names, English words, or short poems, and the model will learn to generate those instead. The rest of the code doesn't need to change.

DS 6042 — Lab 02 · adapted from Andrej Karpathy, microgpt.html · interactive augmentations by Daniel Graham.

source & further reading

researcher111.github.io — original article

MicroGPT and Interactive Walkthrough

Where to find it #

Dataset #

Tokenizer #

Show answer #

From a neuron to a network #

The simplest "neuron"

Show answer #

Add a weight

Add a nonlinearity (ReLU)

Show answer #

Forward pass #

The same thing, in code

Show answer #

Autograd #

Building Value

Watch backprop happen

Embeddings

Parameter matrices

Helper used here · rmsnorm

Code in gpt()

Attention block

Intuition · attention is a fuzzy dictionary

What does attention do?

Attention playground · drag the query, watch the block recompute

Parameter matrices

Code in gpt()

MLP block

Parameter matrices

Code in gpt()

Residual connections

Output

Parameter matrix

Code in gpt()

Show answer #

Putting it all together

Training loop #

Tokenization

Forward pass and loss

Backward pass

From plain SGD to Adam

Inference #

Show answer #

Run it #

Progression #

Real stuff #

Data

Tokenizer

Autograd

Architecture

The picture, mapped to the code

Training

Optimization

Post-training

Inference

Zoom in further · Bycroft's GPT visualization #

Assignment · safety guardrails for the chat bot #

What you're submitting

Downloads

I/O protocol

Run it locally

What the autograder tests

Structure of model.json

Suggested workflow

Submission

Rubric

FAQ #

Does the model "understand" anything?

Why does it work?

How is this related to ChatGPT?

What's the deal with "hallucinations"?

Why is it so slow?

Can I make it generate better names?

What if I change the dataset?

Run your AI side-project on zahid.host

Building `Value`

Helper used here · `rmsnorm`

Code in `gpt()`

Code in `gpt()`

Code in `gpt()`

Code in `gpt()`

Structure of `model.json`