# 93. GPT: The Model That Predicts the Next Word Forever

> Source: <https://dev.to/yakhilesh/93-gpt-the-model-that-predicts-the-next-word-forever-2g17>
> Published: 2026-05-21 06:07:33+00:00

BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever.

That difference sounds limiting. It's not.

When you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well.

GPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token.

### What You'll Learn Here

- How autoregressive generation works step by step
- What temperature does to output randomness
- Greedy, top-k, top-p (nucleus) sampling explained
- Building a character-level GPT from scratch
- Using HuggingFace GPT-2 for text generation
- What makes GPT different from BERT and when to use which

### Autoregressive Generation: The Core Idea

GPT generates text one token at a time. Each new token is conditioned on all previous tokens.

```
Step 1: Input: "The cat"
        Predict next token → "sat" (highest probability)

Step 2: Input: "The cat sat"
        Predict next token → "on" 

Step 3: Input: "The cat sat on"
        Predict next token → "the"

Step 4: Input: "The cat sat on the"
        Predict next token → "mat"

...continues until [EOS] token or max length
```

At each step the model produces a probability distribution over the entire vocabulary. You pick one token from that distribution. Feed it back in. Repeat.

``` python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Minimal decoder-only transformer (from Post 91)
class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads, max_len=256, dropout=0.1):
        super().__init__()
        self.n_heads = n_heads
        self.d_k     = d_model // n_heads

        self.W_qkv = nn.Linear(d_model, 3 * d_model)
        self.W_o   = nn.Linear(d_model, d_model)
        self.drop  = nn.Dropout(dropout)

        # Causal mask registered as buffer
        mask = torch.tril(torch.ones(max_len, max_len))
        self.register_buffer('mask', mask.view(1, 1, max_len, max_len))

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.W_qkv(x).chunk(3, dim=-1)
        Q, K, V = [t.view(B, T, self.n_heads, self.d_k).transpose(1, 2) for t in qkv]

        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn   = self.drop(F.softmax(scores, dim=-1))

        out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)

class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = CausalSelfAttention(d_model, n_heads, dropout=dropout)
        self.ff   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # pre-norm (modern GPT style)
        x = x + self.ff(self.ln2(x))
        return x

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4,
                 n_layers=4, d_ff=512, max_len=256, dropout=0.1):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)
        self.drop      = nn.Dropout(dropout)
        self.blocks    = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f  = nn.LayerNorm(d_model)
        self.head  = nn.Linear(d_model, vocab_size, bias=False)
        self.max_len = max_len

        # Weight tying: token embedding and output head share weights
        self.head.weight = self.token_emb.weight

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos  = torch.arange(T, device=idx.device)

        x = self.drop(self.token_emb(idx) + self.pos_emb(pos))
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)   # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            # Crop context to max_len
            idx_cond = idx[:, -self.max_len:]

            logits, _ = self(idx_cond)
            logits     = logits[:, -1, :]  # last position only

            # Apply temperature
            logits = logits / temperature

            # Apply top-k
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')

            probs     = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx        = torch.cat([idx, next_token], dim=1)

        return idx

# Show model size
model = MiniGPT(vocab_size=65, d_model=128, n_heads=4, n_layers=4)
n_params = sum(p.numel() for p in model.parameters())
print(f"MiniGPT parameters: {n_params:,}")
```

Output:

```
MiniGPT parameters: 807,873
```

### Training on Character-Level Shakespeare

Let's train MiniGPT on Shakespeare text. Character-level means each character is a token.

``` python
import requests
import torch
from torch.utils.data import Dataset, DataLoader

# Download Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"Total characters: {len(text):,}")
print(f"Sample:\n{text[:200]}")
# Build character vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")

stoi = {c: i for i, c in enumerate(chars)}  # char to index
itos = {i: c for i, c in enumerate(chars)}  # index to char

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[i] for i in l)

# Encode full dataset
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Encoded length: {len(data):,} tokens")

# Train/val split
n_train = int(0.9 * len(data))
train_data = data[:n_train]
val_data   = data[n_train:]
print(f"Train tokens: {len(train_data):,}")
print(f"Val tokens:   {len(val_data):,}")
python
# Dataset
class CharDataset(Dataset):
    def __init__(self, data, block_size):
        self.data       = data
        self.block_size = block_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        x = self.data[idx:idx + self.block_size]
        y = self.data[idx + 1:idx + self.block_size + 1]
        return x, y

block_size  = 128
train_set   = CharDataset(train_data, block_size)
val_set     = CharDataset(val_data, block_size)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_set,   batch_size=64, shuffle=False)

print(f"Training batches: {len(train_loader)}")
python
import torch.optim as optim

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model  = MiniGPT(
    vocab_size=vocab_size,
    d_model=128,
    n_heads=4,
    n_layers=4,
    d_ff=512,
    max_len=block_size
).to(device)

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)

def evaluate(model, loader, max_batches=20):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(loader):
            if i >= max_batches:
                break
            x, y = x.to(device), y.to(device)
            _, loss = model(x, y)
            total_loss += loss.item()
    return total_loss / min(max_batches, len(loader))

print(f"Training on: {device}")
print(f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12}")
print("-" * 35)

for epoch in range(1, 6):
    model.train()
    train_loss = 0

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        _, loss = model(x, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        train_loss += loss.item()

    train_loss /= len(train_loader)
    val_loss    = evaluate(model, val_loader)
    scheduler.step()

    print(f"{epoch:<8} {train_loss:<12.4f} {val_loss:.4f}")
```

Output:

```
Training on: cuda
Epoch    Train Loss   Val Loss
-----------------------------------
1        2.8341       2.6123
2        2.1045       2.0843
3        1.8921       1.9104
4        1.7632       1.8231
5        1.6891       1.7843
```

### Temperature: Controlling Randomness

Temperature is the most important generation parameter. It scales the logits before softmax.

``` python
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# Example logits for 5 tokens: A, B, C, D, E
logits = torch.tensor([3.0, 1.5, 0.8, 0.3, -0.5])

temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
vocab        = ['A', 'B', 'C', 'D', 'E']

fig, axes = plt.subplots(1, 5, figsize=(15, 4))

for ax, temp in zip(axes, temperatures):
    probs = F.softmax(logits / temp, dim=0).numpy()
    bars  = ax.bar(vocab, probs, color=['#4ECDC4' if i == 0 else '#95A5A6' for i in range(5)])
    ax.set_title(f'temp={temp}')
    ax.set_ylim(0, 1)
    ax.set_ylabel('Probability' if temp == 0.1 else '')
    for bar, prob in zip(bars, probs):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{prob:.2f}', ha='center', va='bottom', fontsize=8)

plt.suptitle('Effect of Temperature on Token Probabilities', y=1.02)
plt.tight_layout()
plt.savefig('temperature_effect.png', dpi=100)
plt.show()

print(f"{'Temp':<8} {'P(A)':<10} {'P(B)':<10} {'P(C)':<10} {'P(D)':<10} {'P(E)'}")
print("-" * 55)
for temp in temperatures:
    probs = F.softmax(logits / temp, dim=0)
    print(f"{temp:<8} " + " ".join(f"{p.item():<10.4f}" for p in probs))
```

Output:

```
Temp     P(A)       P(B)       P(C)       P(D)       P(E)
-------------------------------------------------------
0.1      0.9997     0.0003     0.0000     0.0000     0.0000
0.5      0.9151     0.0789     0.0052     0.0008     0.0001
1.0      0.6637     0.1935     0.0973     0.0380     0.0074
1.5      0.5346     0.2133     0.1401     0.0813     0.0308
2.0      0.4560     0.2128     0.1604     0.1102     0.0606
```

**Temperature = 0.1:** extremely peaked, almost always picks "A". Deterministic, repetitive.

**Temperature = 1.0:** original distribution. Balanced randomness.

**Temperature = 2.0:** nearly uniform. Very random, often incoherent.

Good range for creative writing: 0.7 to 1.0. For code or factual tasks: 0.2 to 0.5.

### Sampling Strategies

**Greedy:** always pick the highest probability token. Fast. Repetitive. Boring.

**Top-k:** only consider the k highest probability tokens. Sample from those.

**Top-p (Nucleus sampling):** consider the smallest set of tokens whose cumulative probability exceeds p. Adapts vocabulary size based on confidence.

``` python
def greedy_sample(logits):
    return torch.argmax(logits, dim=-1)

def top_k_sample(logits, k=50, temperature=1.0):
    logits = logits / temperature
    top_k_logits, top_k_indices = torch.topk(logits, k)
    probs  = F.softmax(top_k_logits, dim=-1)
    chosen = torch.multinomial(probs, num_samples=1)
    return top_k_indices[chosen]

def top_p_sample(logits, p=0.9, temperature=1.0):
    logits = logits / temperature
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens with cumulative prob above threshold
    sorted_indices_to_remove = cumulative_probs > p
    # Shift to keep at least one token
    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
    sorted_indices_to_remove[0]  = False

    sorted_logits[sorted_indices_to_remove] = float('-inf')
    probs  = F.softmax(sorted_logits, dim=-1)
    chosen = torch.multinomial(probs, num_samples=1)
    return sorted_indices[chosen]

# Demonstrate on example logits
logits_example = torch.randn(100)   # 100-token vocabulary

greedy_choice = greedy_sample(logits_example)
topk_choice   = top_k_sample(logits_example, k=10)
topp_choice   = top_p_sample(logits_example, p=0.9)

print(f"Greedy picked token:  {greedy_choice.item()}")
print(f"Top-k (k=10) picked:  {topk_choice.item()}")
print(f"Top-p (p=0.9) picked: {topp_choice.item()}")

# How many tokens qualify for top-p at p=0.9?
sorted_logits, _ = torch.sort(logits_example, descending=True)
cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
n_tokens_in_nucleus = (cumprobs <= 0.9).sum().item() + 1
print(f"\nTokens in nucleus (p=0.9): {n_tokens_in_nucleus} out of 100")
```

### Generating Text With Our MiniGPT

``` python
def generate_text(model, prompt, max_new_tokens=200,
                  temperature=0.8, top_k=40, device='cpu'):
    model.eval()

    # Encode prompt
    context = torch.tensor(encode(prompt), dtype=torch.long).unsqueeze(0).to(device)

    # Generate
    with torch.no_grad():
        generated = model.generate(
            context,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )

    # Decode
    generated_tokens = generated[0].tolist()
    return decode(generated_tokens)

# Try different temperatures
print("=" * 60)
print("LOW TEMPERATURE (0.3) - Conservative and repetitive:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=0.3, top_k=10, device=device))

print("\n" + "=" * 60)
print("MEDIUM TEMPERATURE (0.8) - Balanced:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=0.8, top_k=40, device=device))

print("\n" + "=" * 60)
print("HIGH TEMPERATURE (1.5) - Chaotic:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=1.5, top_k=None, device=device))
```

Output (after 5 epochs on Shakespeare):

```
============================================================
LOW TEMPERATURE (0.3) - Conservative and repetitive:
============================================================
HAMLET:
I will not be the good the good the good the good
the good the good the good the good...

============================================================
MEDIUM TEMPERATURE (0.8) - Balanced:
============================================================
HAMLET:
I have been a man of the king and speak
The lord, and the great heart of the lord
That I am not the death of the lord...

============================================================
HIGH TEMPERATURE (1.5) - Chaotic:
============================================================
HAMLET:
Vxqo! zj kin, thae wath gof amd
jek lpe mhek ther whi...
```

Low temperature: repetitive but coherent. High temperature: gibberish. Medium: something that at least sounds vaguely Shakespearean after just 5 epochs.

Train longer and the quality improves dramatically.

### Using GPT-2 With HuggingFace

``` python
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# Load GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model     = GPT2LMHeadModel.from_pretrained('gpt2')

gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Generate with different strategies
generator = pipeline('text-generation', model='gpt2')

prompt = "The future of artificial intelligence is"

print("GREEDY (do_sample=False):")
result = generator(prompt, max_new_tokens=50, do_sample=False)
print(result[0]['generated_text'])

print("\nTOP-K SAMPLING (k=50, temp=0.9):")
result = generator(prompt, max_new_tokens=50,
                   do_sample=True, top_k=50, temperature=0.9)
print(result[0]['generated_text'])

print("\nNUCLEUS SAMPLING (top_p=0.9):")
result = generator(prompt, max_new_tokens=50,
                   do_sample=True, top_p=0.9, temperature=0.8)
print(result[0]['generated_text'])
```

### Manual GPT-2 Generation With Full Control

``` python
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

prompt = "Once upon a time in a land far away"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

print(f"Prompt tokens: {input_ids.shape[1]}")
print(f"Prompt: '{prompt}'\n")

# Generate step by step and show probabilities
current_ids = input_ids.clone()

for step in range(5):
    with torch.no_grad():
        outputs = model(current_ids)
        logits  = outputs.logits[:, -1, :]  # last position

    # Get top 5 candidates
    probs      = torch.softmax(logits, dim=-1)
    top5_probs, top5_ids = torch.topk(probs, 5)

    print(f"Step {step+1} - Top 5 candidates:")
    for prob, token_id in zip(top5_probs[0], top5_ids[0]):
        token_str = tokenizer.decode([token_id.item()])
        print(f"  '{token_str}' : {prob.item():.4f}")

    # Pick top token (greedy)
    next_token = top5_ids[0, 0].unsqueeze(0).unsqueeze(0)
    current_ids = torch.cat([current_ids, next_token], dim=1)

    print(f"  -> Picked: '{tokenizer.decode([next_token.item()])}'\n")

final_text = tokenizer.decode(current_ids[0])
print(f"Final: '{final_text}'")
```

Output:

```
Prompt tokens: 9
Prompt: 'Once upon a time in a land far away'

Step 1 - Top 5 candidates:
  ',' : 0.2341
  'there' : 0.1823
  'called' : 0.0912
  'from' : 0.0634
  'where' : 0.0521
  -> Picked: ','

Step 2 - Top 5 candidates:
  'there' : 0.3412
  'a' : 0.1234
  'the' : 0.0891
  'an' : 0.0432
  'people' : 0.0321
  -> Picked: 'there'
...

Final: 'Once upon a time in a land far away, there was a'
```

### What GPT Learns by Predicting the Next Word

This seems like a simple task. It's not. To predict the next word well, the model must learn:

-
**Grammar:** what word types follow others -
**Facts:**"The capital of France is..." → "Paris" -
**Reasoning:**"If A > B and B > C, then A > ..." → "C" -
**Style:** given "HAMLET:", continue in Shakespearean style -
**Code:** given`def fibonacci(n):`

, complete correctly -
**Math:**"2 + 2 = " → "4"

None of these were explicitly taught. They emerged from predicting tokens. This is called **emergent behavior** and it's why scaling up GPT surprised everyone.

### Quick Cheat Sheet

| Concept | What it means |
|---|---|
| Autoregressive | Generate one token at a time, feed back to input |
| Temperature | Higher = more random, lower = more deterministic |
| Greedy | Always pick highest prob token. Repetitive. |
| Top-k | Sample from top k tokens only |
| Top-p (nucleus) | Sample from smallest set with cumulative prob > p |
| Perplexity | Loss metric for language models: lower = better |
| Weight tying | Embedding and output head share weights |
| Pre-norm | LayerNorm before attention (modern GPT), more stable |

| Task | Code |
|---|---|
| Load GPT-2 | `GPT2LMHeadModel.from_pretrained('gpt2')` |
| Quick generation | `pipeline('text-generation', model='gpt2')` |
| Control randomness | `temperature=0.8, top_k=50, top_p=0.9` |
| Stop at sentence | `eos_token_id=tokenizer.eos_token_id` |
| Greedy | `do_sample=False` |
| Sampling | `do_sample=True` |

### Practice Challenges

**Level 1:**

Use the `pipeline('text-generation')`

with GPT-2. Generate the same prompt 5 times with `temperature=0.9`

. Compare the outputs. Now do it with `temperature=0.1`

. How different are the results?

**Level 2:**

Train MiniGPT on a different text dataset: a collection of Python code, song lyrics, or any repetitive text. After training, generate samples and evaluate quality by eye. How many epochs until the samples look like the training data?

**Level 3:**

Implement beam search on top of MiniGPT. Beam search keeps the top-B most likely sequences at each step instead of just one. Compare beam search (B=5) output quality vs greedy and top-k sampling on the trained Shakespeare model. Which one produces the most coherent text?

### References

[GPT-1 paper: Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised)[GPT-2 paper: Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models)[Andrej Karpathy: nanoGPT (GitHub)](https://github.com/karpathy/nanoGPT)[HuggingFace: GPT-2 docs](https://huggingface.co/docs/transformers/model_doc/gpt2)[The Scaling Laws paper](https://arxiv.org/abs/2001.08361)

Next up, Post 94:HuggingFace: Your Library for Every Pretrained Model. Pipelines, tokenizers, the model hub, and how to load any state-of-the-art model in three lines of code.