93. GPT: The Model That Predicts the Next Word Forever

GPT models are decoder-only transformers that generate text by predicting the next token one at a time, conditioning each new prediction on all previous tokens. Unlike BERT, which reads entire sequences at once, GPT's autoregressive approach forces it to learn grammar, facts, and reasoning patterns from training on billions of tokens. The article also provides a minimal PyTorch implementation of a GPT-like model, including causal self-attention and token-by-token generation.

BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever. That difference sounds limiting. It's not. When you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well. GPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token. What You'll Learn Here - How autoregressive generation works step by step - What temperature does to output randomness - Greedy, top-k, top-p nucleus sampling explained - Building a character-level GPT from scratch - Using HuggingFace GPT-2 for text generation - What makes GPT different from BERT and when to use which Autoregressive Generation: The Core Idea GPT generates text one token at a time. Each new token is conditioned on all previous tokens. Step 1: Input: "The cat" Predict next token → "sat" highest probability Step 2: Input: "The cat sat" Predict next token → "on" Step 3: Input: "The cat sat on" Predict next token → "the" Step 4: Input: "The cat sat on the" Predict next token → "mat" ...continues until EOS token or max length At each step the model produces a probability distribution over the entire vocabulary. You pick one token from that distribution. Feed it back in. Repeat. python import torch import torch.nn as nn import torch.nn.functional as F import math Minimal decoder-only transformer from Post 91 class CausalSelfAttention nn.Module : def init self, d model, n heads, max len=256, dropout=0.1 : super . init self.n heads = n heads self.d k = d model // n heads self.W qkv = nn.Linear d model, 3 d model self.W o = nn.Linear d model, d model self.drop = nn.Dropout dropout Causal mask registered as buffer mask = torch.tril torch.ones max len, max len self.register buffer 'mask', mask.view 1, 1, max len, max len def forward self, x : B, T, C = x.shape qkv = self.W qkv x .chunk 3, dim=-1 Q, K, V = t.view B, T, self.n heads, self.d k .transpose 1, 2 for t in qkv scores = Q @ K.transpose -2, -1 / math.sqrt self.d k scores = scores.masked fill self.mask :, :, :T, :T == 0, float '-inf' attn = self.drop F.softmax scores, dim=-1 out = attn @ V .transpose 1, 2 .contiguous .view B, T, C return self.W o out class GPTBlock nn.Module : def init self, d model, n heads, d ff, dropout=0.1 : super . init self.attn = CausalSelfAttention d model, n heads, dropout=dropout self.ff = nn.Sequential nn.Linear d model, d ff , nn.GELU , nn.Linear d ff, d model , nn.Dropout dropout self.ln1 = nn.LayerNorm d model self.ln2 = nn.LayerNorm d model def forward self, x : x = x + self.attn self.ln1 x pre-norm modern GPT style x = x + self.ff self.ln2 x return x class MiniGPT nn.Module : def init self, vocab size, d model=128, n heads=4, n layers=4, d ff=512, max len=256, dropout=0.1 : super . init self.token emb = nn.Embedding vocab size, d model self.pos emb = nn.Embedding max len, d model self.drop = nn.Dropout dropout self.blocks = nn.ModuleList GPTBlock d model, n heads, d ff, dropout for in range n layers self.ln f = nn.LayerNorm d model self.head = nn.Linear d model, vocab size, bias=False self.max len = max len Weight tying: token embedding and output head share weights self.head.weight = self.token emb.weight self.apply self. init weights def init weights self, module : if isinstance module, nn.Linear : nn.init.normal module.weight, mean=0.0, std=0.02 elif isinstance module, nn.Embedding : nn.init.normal module.weight, mean=0.0, std=0.02 def forward self, idx, targets=None : B, T = idx.shape pos = torch.arange T, device=idx.device x = self.drop self.token emb idx + self.pos emb pos for block in self.blocks: x = block x x = self.ln f x logits = self.head x B, T, vocab size loss = None if targets is not None: loss = F.cross entropy logits.view -1, logits.size -1 , targets.view -1 return logits, loss @torch.no grad def generate self, idx, max new tokens, temperature=1.0, top k=None : for in range max new tokens : Crop context to max len idx cond = idx :, -self.max len: logits, = self idx cond logits = logits :, -1, : last position only Apply temperature logits = logits / temperature Apply top-k if top k is not None: v, = torch.topk logits, min top k, logits.size -1 logits logits < v :, -1 = float '-inf' probs = F.softmax logits, dim=-1 next token = torch.multinomial probs, num samples=1 idx = torch.cat idx, next token , dim=1 return idx Show model size model = MiniGPT vocab size=65, d model=128, n heads=4, n layers=4 n params = sum p.numel for p in model.parameters print f"MiniGPT parameters: {n params:,}" Output: MiniGPT parameters: 807,873 Training on Character-Level Shakespeare Let's train MiniGPT on Shakespeare text. Character-level means each character is a token. python import requests import torch from torch.utils.data import Dataset, DataLoader Download Shakespeare url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" text = requests.get url .text print f"Total characters: {len text :,}" print f"Sample:\n{text :200 }" Build character vocabulary chars = sorted set text vocab size = len chars print f"Vocabulary size: {vocab size} unique characters" stoi = {c: i for i, c in enumerate chars } char to index itos = {i: c for i, c in enumerate chars } index to char encode = lambda s: stoi c for c in s decode = lambda l: ''.join itos i for i in l Encode full dataset data = torch.tensor encode text , dtype=torch.long print f"Encoded length: {len data :,} tokens" Train/val split n train = int 0.9 len data train data = data :n train val data = data n train: print f"Train tokens: {len train data :,}" print f"Val tokens: {len val data :,}" python Dataset class CharDataset Dataset : def init self, data, block size : self.data = data self.block size = block size def len self : return len self.data - self.block size def getitem self, idx : x = self.data idx:idx + self.block size y = self.data idx + 1:idx + self.block size + 1 return x, y block size = 128 train set = CharDataset train data, block size val set = CharDataset val data, block size train loader = DataLoader train set, batch size=64, shuffle=True val loader = DataLoader val set, batch size=64, shuffle=False print f"Training batches: {len train loader }" python import torch.optim as optim device = 'cuda' if torch.cuda.is available else 'cpu' model = MiniGPT vocab size=vocab size, d model=128, n heads=4, n layers=4, d ff=512, max len=block size .to device optimizer = optim.AdamW model.parameters , lr=3e-4, weight decay=0.1 scheduler = optim.lr scheduler.CosineAnnealingLR optimizer, T max=5 def evaluate model, loader, max batches=20 : model.eval total loss = 0 with torch.no grad : for i, x, y in enumerate loader : if i = max batches: break x, y = x.to device , y.to device , loss = model x, y total loss += loss.item return total loss / min max batches, len loader print f"Training on: {device}" print f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12}" print "-" 35 for epoch in range 1, 6 : model.train train loss = 0 for x, y in train loader: x, y = x.to device , y.to device optimizer.zero grad , loss = model x, y loss.backward torch.nn.utils.clip grad norm model.parameters , 1.0 optimizer.step train loss += loss.item train loss /= len train loader val loss = evaluate model, val loader scheduler.step print f"{epoch:<8} {train loss:<12.4f} {val loss:.4f}" Output: Training on: cuda Epoch Train Loss Val Loss ----------------------------------- 1 2.8341 2.6123 2 2.1045 2.0843 3 1.8921 1.9104 4 1.7632 1.8231 5 1.6891 1.7843 Temperature: Controlling Randomness Temperature is the most important generation parameter. It scales the logits before softmax. python import torch import torch.nn.functional as F import matplotlib.pyplot as plt import numpy as np Example logits for 5 tokens: A, B, C, D, E logits = torch.tensor 3.0, 1.5, 0.8, 0.3, -0.5 temperatures = 0.1, 0.5, 1.0, 1.5, 2.0 vocab = 'A', 'B', 'C', 'D', 'E' fig, axes = plt.subplots 1, 5, figsize= 15, 4 for ax, temp in zip axes, temperatures : probs = F.softmax logits / temp, dim=0 .numpy bars = ax.bar vocab, probs, color= ' 4ECDC4' if i == 0 else ' 95A5A6' for i in range 5 ax.set title f'temp={temp}' ax.set ylim 0, 1 ax.set ylabel 'Probability' if temp == 0.1 else '' for bar, prob in zip bars, probs : ax.text bar.get x + bar.get width /2, bar.get height + 0.02, f'{prob:.2f}', ha='center', va='bottom', fontsize=8 plt.suptitle 'Effect of Temperature on Token Probabilities', y=1.02 plt.tight layout plt.savefig 'temperature effect.png', dpi=100 plt.show print f"{'Temp':<8} {'P A ':<10} {'P B ':<10} {'P C ':<10} {'P D ':<10} {'P E '}" print "-" 55 for temp in temperatures: probs = F.softmax logits / temp, dim=0 print f"{temp:<8} " + " ".join f"{p.item :<10.4f}" for p in probs Output: Temp P A P B P C P D P E ------------------------------------------------------- 0.1 0.9997 0.0003 0.0000 0.0000 0.0000 0.5 0.9151 0.0789 0.0052 0.0008 0.0001 1.0 0.6637 0.1935 0.0973 0.0380 0.0074 1.5 0.5346 0.2133 0.1401 0.0813 0.0308 2.0 0.4560 0.2128 0.1604 0.1102 0.0606 Temperature = 0.1: extremely peaked, almost always picks "A". Deterministic, repetitive. Temperature = 1.0: original distribution. Balanced randomness. Temperature = 2.0: nearly uniform. Very random, often incoherent. Good range for creative writing: 0.7 to 1.0. For code or factual tasks: 0.2 to 0.5. Sampling Strategies Greedy: always pick the highest probability token. Fast. Repetitive. Boring. Top-k: only consider the k highest probability tokens. Sample from those. Top-p Nucleus sampling : consider the smallest set of tokens whose cumulative probability exceeds p. Adapts vocabulary size based on confidence. python def greedy sample logits : return torch.argmax logits, dim=-1 def top k sample logits, k=50, temperature=1.0 : logits = logits / temperature top k logits, top k indices = torch.topk logits, k probs = F.softmax top k logits, dim=-1 chosen = torch.multinomial probs, num samples=1 return top k indices chosen def top p sample logits, p=0.9, temperature=1.0 : logits = logits / temperature sorted logits, sorted indices = torch.sort logits, descending=True cumulative probs = torch.cumsum F.softmax sorted logits, dim=-1 , dim=-1 Remove tokens with cumulative prob above threshold sorted indices to remove = cumulative probs p Shift to keep at least one token sorted indices to remove 1: = sorted indices to remove :-1 .clone sorted indices to remove 0 = False sorted logits sorted indices to remove = float '-inf' probs = F.softmax sorted logits, dim=-1 chosen = torch.multinomial probs, num samples=1 return sorted indices chosen Demonstrate on example logits logits example = torch.randn 100 100-token vocabulary greedy choice = greedy sample logits example topk choice = top k sample logits example, k=10 topp choice = top p sample logits example, p=0.9 print f"Greedy picked token: {greedy choice.item }" print f"Top-k k=10 picked: {topk choice.item }" print f"Top-p p=0.9 picked: {topp choice.item }" How many tokens qualify for top-p at p=0.9? sorted logits, = torch.sort logits example, descending=True cumprobs = torch.cumsum F.softmax sorted logits, dim=-1 , dim=-1 n tokens in nucleus = cumprobs <= 0.9 .sum .item + 1 print f"\nTokens in nucleus p=0.9 : {n tokens in nucleus} out of 100" Generating Text With Our MiniGPT python def generate text model, prompt, max new tokens=200, temperature=0.8, top k=40, device='cpu' : model.eval Encode prompt context = torch.tensor encode prompt , dtype=torch.long .unsqueeze 0 .to device Generate with torch.no grad : generated = model.generate context, max new tokens=max new tokens, temperature=temperature, top k=top k Decode generated tokens = generated 0 .tolist return decode generated tokens Try different temperatures print "=" 60 print "LOW TEMPERATURE 0.3 - Conservative and repetitive:" print "=" 60 print generate text model, "HAMLET:", max new tokens=150, temperature=0.3, top k=10, device=device print "\n" + "=" 60 print "MEDIUM TEMPERATURE 0.8 - Balanced:" print "=" 60 print generate text model, "HAMLET:", max new tokens=150, temperature=0.8, top k=40, device=device print "\n" + "=" 60 print "HIGH TEMPERATURE 1.5 - Chaotic:" print "=" 60 print generate text model, "HAMLET:", max new tokens=150, temperature=1.5, top k=None, device=device Output after 5 epochs on Shakespeare : ============================================================ LOW TEMPERATURE 0.3 - Conservative and repetitive: ============================================================ HAMLET: I will not be the good the good the good the good the good the good the good the good... ============================================================ MEDIUM TEMPERATURE 0.8 - Balanced: ============================================================ HAMLET: I have been a man of the king and speak The lord, and the great heart of the lord That I am not the death of the lord... ============================================================ HIGH TEMPERATURE 1.5 - Chaotic: ============================================================ HAMLET: Vxqo zj kin, thae wath gof amd jek lpe mhek ther whi... Low temperature: repetitive but coherent. High temperature: gibberish. Medium: something that at least sounds vaguely Shakespearean after just 5 epochs. Train longer and the quality improves dramatically. Using GPT-2 With HuggingFace python from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline Load GPT-2 gpt2 tokenizer = GPT2Tokenizer.from pretrained 'gpt2' gpt2 model = GPT2LMHeadModel.from pretrained 'gpt2' gpt2 tokenizer.pad token = gpt2 tokenizer.eos token Generate with different strategies generator = pipeline 'text-generation', model='gpt2' prompt = "The future of artificial intelligence is" print "GREEDY do sample=False :" result = generator prompt, max new tokens=50, do sample=False print result 0 'generated text' print "\nTOP-K SAMPLING k=50, temp=0.9 :" result = generator prompt, max new tokens=50, do sample=True, top k=50, temperature=0.9 print result 0 'generated text' print "\nNUCLEUS SAMPLING top p=0.9 :" result = generator prompt, max new tokens=50, do sample=True, top p=0.9, temperature=0.8 print result 0 'generated text' Manual GPT-2 Generation With Full Control python import torch from transformers import GPT2LMHeadModel, GPT2Tokenizer tokenizer = GPT2Tokenizer.from pretrained 'gpt2' model = GPT2LMHeadModel.from pretrained 'gpt2' model.eval prompt = "Once upon a time in a land far away" input ids = tokenizer.encode prompt, return tensors='pt' print f"Prompt tokens: {input ids.shape 1 }" print f"Prompt: '{prompt}'\n" Generate step by step and show probabilities current ids = input ids.clone for step in range 5 : with torch.no grad : outputs = model current ids logits = outputs.logits :, -1, : last position Get top 5 candidates probs = torch.softmax logits, dim=-1 top5 probs, top5 ids = torch.topk probs, 5 print f"Step {step+1} - Top 5 candidates:" for prob, token id in zip top5 probs 0 , top5 ids 0 : token str = tokenizer.decode token id.item print f" '{token str}' : {prob.item :.4f}" Pick top token greedy next token = top5 ids 0, 0 .unsqueeze 0 .unsqueeze 0 current ids = torch.cat current ids, next token , dim=1 print f" - Picked: '{tokenizer.decode next token.item }'\n" final text = tokenizer.decode current ids 0 print f"Final: '{final text}'" Output: Prompt tokens: 9 Prompt: 'Once upon a time in a land far away' Step 1 - Top 5 candidates: ',' : 0.2341 'there' : 0.1823 'called' : 0.0912 'from' : 0.0634 'where' : 0.0521 - Picked: ',' Step 2 - Top 5 candidates: 'there' : 0.3412 'a' : 0.1234 'the' : 0.0891 'an' : 0.0432 'people' : 0.0321 - Picked: 'there' ... Final: 'Once upon a time in a land far away, there was a' What GPT Learns by Predicting the Next Word This seems like a simple task. It's not. To predict the next word well, the model must learn: - Grammar: what word types follow others - Facts: "The capital of France is..." → "Paris" - Reasoning: "If A B and B C, then A ..." → "C" - Style: given "HAMLET:", continue in Shakespearean style - Code: given def fibonacci n : , complete correctly - Math: "2 + 2 = " → "4" None of these were explicitly taught. They emerged from predicting tokens. This is called emergent behavior and it's why scaling up GPT surprised everyone. Quick Cheat Sheet | Concept | What it means | |---|---| | Autoregressive | Generate one token at a time, feed back to input | | Temperature | Higher = more random, lower = more deterministic | | Greedy | Always pick highest prob token. Repetitive. | | Top-k | Sample from top k tokens only | | Top-p nucleus | Sample from smallest set with cumulative prob p | | Perplexity | Loss metric for language models: lower = better | | Weight tying | Embedding and output head share weights | | Pre-norm | LayerNorm before attention modern GPT , more stable | | Task | Code | |---|---| | Load GPT-2 | GPT2LMHeadModel.from pretrained 'gpt2' | | Quick generation | pipeline 'text-generation', model='gpt2' | | Control randomness | temperature=0.8, top k=50, top p=0.9 | | Stop at sentence | eos token id=tokenizer.eos token id | | Greedy | do sample=False | | Sampling | do sample=True | Practice Challenges Level 1: Use the pipeline 'text-generation' with GPT-2. Generate the same prompt 5 times with temperature=0.9 . Compare the outputs. Now do it with temperature=0.1 . How different are the results? Level 2: Train MiniGPT on a different text dataset: a collection of Python code, song lyrics, or any repetitive text. After training, generate samples and evaluate quality by eye. How many epochs until the samples look like the training data? Level 3: Implement beam search on top of MiniGPT. Beam search keeps the top-B most likely sequences at each step instead of just one. Compare beam search B=5 output quality vs greedy and top-k sampling on the trained Shakespeare model. Which one produces the most coherent text? References GPT-1 paper: Improving Language Understanding by Generative Pre-Training https://openai.com/research/language-unsupervised GPT-2 paper: Language Models are Unsupervised Multitask Learners https://openai.com/research/better-language-models Andrej Karpathy: nanoGPT GitHub https://github.com/karpathy/nanoGPT HuggingFace: GPT-2 docs https://huggingface.co/docs/transformers/model doc/gpt2 The Scaling Laws paper https://arxiv.org/abs/2001.08361 Next up, Post 94:HuggingFace: Your Library for Every Pretrained Model. Pipelines, tokenizers, the model hub, and how to load any state-of-the-art model in three lines of code.