microgpt `microgpt.py`, a minimal, dependency-free Python implementation of a GPT (Generative Pre-trained Transformer) model for training and inference. It includes a complete autograd system, a character-level tokenizer, and a transformer with one layer, an embedding dimension of 16, and a context window of 16 tokens. The code is designed to be the most atomic version of the algorithm, prioritizing clarity over efficiency. """ The most atomic way to train and run inference for a GPT in pure, dependency-free Python. This file is the complete algorithm. Everything else is just efficiency. @karpathy """ import os os.path.exists import math math.log, math.exp import random random.seed, random.choices, random.gauss, random.shuffle random.seed 42 Let there be order among chaos Let there be a Dataset docs : list str of documents e.g. a list of names if not os.path.exists 'input.txt' : import urllib.request names url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt' urllib.request.urlretrieve names url, 'input.txt' docs = line.strip for line in open 'input.txt' if line.strip random.shuffle docs print f"num docs: {len docs }" Let there be a Tokenizer to translate strings to sequences of integers "tokens" and back uchars = sorted set ''.join docs unique characters in the dataset become token ids 0..n-1 BOS = len uchars token id for a special Beginning of Sequence BOS token vocab size = len uchars + 1 total number of unique tokens, +1 is for BOS print f"vocab size: {vocab size}" Let there be Autograd to recursively apply the chain rule through a computation graph class Value: slots = 'data', 'grad', ' children', ' local grads' Python optimization for memory usage def init self, data, children= , local grads= : self.data = data scalar value of this node calculated during forward pass self.grad = 0 derivative of the loss w.r.t. this node, calculated in backward pass self. children = children children of this node in the computation graph self. local grads = local grads local derivative of this node w.r.t. its children def add self, other : other = other if isinstance other, Value else Value other return Value self.data + other.data, self, other , 1, 1 def mul self, other : other = other if isinstance other, Value else Value other return Value self.data other.data, self, other , other.data, self.data def pow self, other : return Value self.data other, self, , other self.data other-1 , def log self : return Value math.log self.data , self, , 1/self.data, def exp self : return Value math.exp self.data , self, , math.exp self.data , def relu self : return Value max 0, self.data , self, , float self.data 0 , def neg self : return self -1 def radd self, other : return self + other def sub self, other : return self + -other def rsub self, other : return other + -self def rmul self, other : return self other def truediv self, other : return self other -1 def rtruediv self, other : return other self -1 def backward self : topo = visited = set def build topo v : if v not in visited: visited.add v for child in v. children: build topo child topo.append v build topo self self.grad = 1 for v in reversed topo : for child, local grad in zip v. children, v. local grads : child.grad += local grad v.grad Initialize the parameters, to store the knowledge of the model n layer = 1 depth of the transformer neural network number of layers n embd = 16 width of the network embedding dimension block size = 16 maximum context length of the attention window note: the longest name is 15 characters n head = 4 number of attention heads head dim = n embd // n head derived dimension of each head matrix = lambda nout, nin, std=0.08: Value random.gauss 0, std for in range nin for in range nout state dict = {'wte': matrix vocab size, n embd , 'wpe': matrix block size, n embd , 'lm head': matrix vocab size, n embd } for i in range n layer : state dict f'layer{i}.attn wq' = matrix n embd, n embd state dict f'layer{i}.attn wk' = matrix n embd, n embd state dict f'layer{i}.attn wv' = matrix n embd, n embd state dict f'layer{i}.attn wo' = matrix n embd, n embd state dict f'layer{i}.mlp fc1' = matrix 4 n embd, n embd state dict f'layer{i}.mlp fc2' = matrix n embd, 4 n embd params = p for mat in state dict.values for row in mat for p in row flatten params into a single list Value print f"num params: {len params }" Define the model architecture: a function mapping tokens and parameters to logits over what comes next Follow GPT-2, blessed among the GPTs, with minor differences: layernorm - rmsnorm, no biases, GeLU - ReLU def linear x, w : return sum wi xi for wi, xi in zip wo, x for wo in w def softmax logits : max val = max val.data for val in logits exps = val - max val .exp for val in logits total = sum exps return e / total for e in exps def rmsnorm x : ms = sum xi xi for xi in x / len x scale = ms + 1e-5 -0.5 return xi scale for xi in x def gpt token id, pos id, keys, values : tok emb = state dict 'wte' token id token embedding pos emb = state dict 'wpe' pos id position embedding x = t + p for t, p in zip tok emb, pos emb joint token and position embedding x = rmsnorm x note: not redundant due to backward pass via the residual connection for li in range n layer : 1 Multi-head Attention block x residual = x x = rmsnorm x q = linear x, state dict f'layer{li}.attn wq' k = linear x, state dict f'layer{li}.attn wk' v = linear x, state dict f'layer{li}.attn wv' keys li .append k values li .append v x attn = for h in range n head : hs = h head dim q h = q hs:hs+head dim k h = ki hs:hs+head dim for ki in keys li v h = vi hs:hs+head dim for vi in values li attn logits = sum q h j k h t j for j in range head dim / head dim 0.5 for t in range len k h attn weights = softmax attn logits head out = sum attn weights t v h t j for t in range len v h for j in range head dim x attn.extend head out x = linear x attn, state dict f'layer{li}.attn wo' x = a + b for a, b in zip x, x residual 2 MLP block x residual = x x = rmsnorm x x = linear x, state dict f'layer{li}.mlp fc1' x = xi.relu for xi in x x = linear x, state dict f'layer{li}.mlp fc2' x = a + b for a, b in zip x, x residual logits = linear x, state dict 'lm head' return logits Let there be Adam, the blessed optimizer and its buffers learning rate, beta1, beta2, eps adam = 0.01, 0.85, 0.99, 1e-8 m = 0.0 len params first moment buffer v = 0.0 len params second moment buffer Repeat in sequence num steps = 1000 number of training steps for step in range num steps : Take single document, tokenize it, surround it with BOS special token on both sides doc = docs step % len docs tokens = BOS + uchars.index ch for ch in doc + BOS n = min block size, len tokens - 1 Forward the token sequence through the model, building up the computation graph all the way to the loss keys, values = for in range n layer , for in range n layer losses = for pos id in range n : token id, target id = tokens pos id , tokens pos id + 1 logits = gpt token id, pos id, keys, values probs = softmax logits loss t = -probs target id .log losses.append loss t loss = 1 / n sum losses final average loss over the document sequence. May yours be low. Backward the loss, calculating the gradients with respect to all model parameters loss.backward Adam optimizer update: update the model parameters based on the corresponding gradients lr t = learning rate 1 - step / num steps linear learning rate decay for i, p in enumerate params : m i = beta1 m i + 1 - beta1 p.grad v i = beta2 v i + 1 - beta2 p.grad 2 m hat = m i / 1 - beta1 step + 1 v hat = v i / 1 - beta2 step + 1 p.data -= lr t m hat / v hat 0.5 + eps adam p.grad = 0 print f"step {step+1:4d} / {num steps:4d} | loss {loss.data:.4f}", end='\r' Inference: may the model babble back to us temperature = 0.5 in 0, 1 , control the "creativity" of generated text, low to high print "\n--- inference new, hallucinated names ---" for sample idx in range 20 : keys, values = for in range n layer , for in range n layer token id = BOS sample = for pos id in range block size : logits = gpt token id, pos id, keys, values probs = softmax l / temperature for l in logits token id = random.choices range vocab size , weights= p.data for p in probs 0 if token id == BOS: break sample.append uchars token id print f"sample {sample idx+1:2d}: {''.join sample }"