{"slug": "93-gpt-the-model-that-predicts-the-next-word-forever", "title": "93. GPT: The Model That Predicts the Next Word Forever", "summary": "GPT models are decoder-only transformers that generate text by predicting the next token one at a time, conditioning each new prediction on all previous tokens. Unlike BERT, which reads entire sequences at once, GPT's autoregressive approach forces it to learn grammar, facts, and reasoning patterns from training on billions of tokens. The article also provides a minimal PyTorch implementation of a GPT-like model, including causal self-attention and token-by-token generation.", "body_md": "BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever.\n\nThat difference sounds limiting. It's not.\n\nWhen you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well.\n\nGPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token.\n\n### What You'll Learn Here\n\n- How autoregressive generation works step by step\n- What temperature does to output randomness\n- Greedy, top-k, top-p (nucleus) sampling explained\n- Building a character-level GPT from scratch\n- Using HuggingFace GPT-2 for text generation\n- What makes GPT different from BERT and when to use which\n\n### Autoregressive Generation: The Core Idea\n\nGPT generates text one token at a time. Each new token is conditioned on all previous tokens.\n\n```\nStep 1: Input: \"The cat\"\n        Predict next token → \"sat\" (highest probability)\n\nStep 2: Input: \"The cat sat\"\n        Predict next token → \"on\" \n\nStep 3: Input: \"The cat sat on\"\n        Predict next token → \"the\"\n\nStep 4: Input: \"The cat sat on the\"\n        Predict next token → \"mat\"\n\n...continues until [EOS] token or max length\n```\n\nAt each step the model produces a probability distribution over the entire vocabulary. You pick one token from that distribution. Feed it back in. Repeat.\n\n``` python\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport math\n\n# Minimal decoder-only transformer (from Post 91)\nclass CausalSelfAttention(nn.Module):\n    def __init__(self, d_model, n_heads, max_len=256, dropout=0.1):\n        super().__init__()\n        self.n_heads = n_heads\n        self.d_k     = d_model // n_heads\n\n        self.W_qkv = nn.Linear(d_model, 3 * d_model)\n        self.W_o   = nn.Linear(d_model, d_model)\n        self.drop  = nn.Dropout(dropout)\n\n        # Causal mask registered as buffer\n        mask = torch.tril(torch.ones(max_len, max_len))\n        self.register_buffer('mask', mask.view(1, 1, max_len, max_len))\n\n    def forward(self, x):\n        B, T, C = x.shape\n        qkv = self.W_qkv(x).chunk(3, dim=-1)\n        Q, K, V = [t.view(B, T, self.n_heads, self.d_k).transpose(1, 2) for t in qkv]\n\n        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)\n        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))\n        attn   = self.drop(F.softmax(scores, dim=-1))\n\n        out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)\n        return self.W_o(out)\n\nclass GPTBlock(nn.Module):\n    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):\n        super().__init__()\n        self.attn = CausalSelfAttention(d_model, n_heads, dropout=dropout)\n        self.ff   = nn.Sequential(\n            nn.Linear(d_model, d_ff),\n            nn.GELU(),\n            nn.Linear(d_ff, d_model),\n            nn.Dropout(dropout)\n        )\n        self.ln1 = nn.LayerNorm(d_model)\n        self.ln2 = nn.LayerNorm(d_model)\n\n    def forward(self, x):\n        x = x + self.attn(self.ln1(x))   # pre-norm (modern GPT style)\n        x = x + self.ff(self.ln2(x))\n        return x\n\nclass MiniGPT(nn.Module):\n    def __init__(self, vocab_size, d_model=128, n_heads=4,\n                 n_layers=4, d_ff=512, max_len=256, dropout=0.1):\n        super().__init__()\n        self.token_emb = nn.Embedding(vocab_size, d_model)\n        self.pos_emb   = nn.Embedding(max_len, d_model)\n        self.drop      = nn.Dropout(dropout)\n        self.blocks    = nn.ModuleList([\n            GPTBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)\n        ])\n        self.ln_f  = nn.LayerNorm(d_model)\n        self.head  = nn.Linear(d_model, vocab_size, bias=False)\n        self.max_len = max_len\n\n        # Weight tying: token embedding and output head share weights\n        self.head.weight = self.token_emb.weight\n\n        self.apply(self._init_weights)\n\n    def _init_weights(self, module):\n        if isinstance(module, nn.Linear):\n            nn.init.normal_(module.weight, mean=0.0, std=0.02)\n        elif isinstance(module, nn.Embedding):\n            nn.init.normal_(module.weight, mean=0.0, std=0.02)\n\n    def forward(self, idx, targets=None):\n        B, T = idx.shape\n        pos  = torch.arange(T, device=idx.device)\n\n        x = self.drop(self.token_emb(idx) + self.pos_emb(pos))\n        for block in self.blocks:\n            x = block(x)\n        x = self.ln_f(x)\n        logits = self.head(x)   # (B, T, vocab_size)\n\n        loss = None\n        if targets is not None:\n            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))\n\n        return logits, loss\n\n    @torch.no_grad()\n    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):\n        for _ in range(max_new_tokens):\n            # Crop context to max_len\n            idx_cond = idx[:, -self.max_len:]\n\n            logits, _ = self(idx_cond)\n            logits     = logits[:, -1, :]  # last position only\n\n            # Apply temperature\n            logits = logits / temperature\n\n            # Apply top-k\n            if top_k is not None:\n                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))\n                logits[logits < v[:, [-1]]] = float('-inf')\n\n            probs     = F.softmax(logits, dim=-1)\n            next_token = torch.multinomial(probs, num_samples=1)\n            idx        = torch.cat([idx, next_token], dim=1)\n\n        return idx\n\n# Show model size\nmodel = MiniGPT(vocab_size=65, d_model=128, n_heads=4, n_layers=4)\nn_params = sum(p.numel() for p in model.parameters())\nprint(f\"MiniGPT parameters: {n_params:,}\")\n```\n\nOutput:\n\n```\nMiniGPT parameters: 807,873\n```\n\n### Training on Character-Level Shakespeare\n\nLet's train MiniGPT on Shakespeare text. Character-level means each character is a token.\n\n``` python\nimport requests\nimport torch\nfrom torch.utils.data import Dataset, DataLoader\n\n# Download Shakespeare\nurl = \"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\"\ntext = requests.get(url).text\nprint(f\"Total characters: {len(text):,}\")\nprint(f\"Sample:\\n{text[:200]}\")\n# Build character vocabulary\nchars = sorted(set(text))\nvocab_size = len(chars)\nprint(f\"Vocabulary size: {vocab_size} unique characters\")\n\nstoi = {c: i for i, c in enumerate(chars)}  # char to index\nitos = {i: c for i, c in enumerate(chars)}  # index to char\n\nencode = lambda s: [stoi[c] for c in s]\ndecode = lambda l: ''.join(itos[i] for i in l)\n\n# Encode full dataset\ndata = torch.tensor(encode(text), dtype=torch.long)\nprint(f\"Encoded length: {len(data):,} tokens\")\n\n# Train/val split\nn_train = int(0.9 * len(data))\ntrain_data = data[:n_train]\nval_data   = data[n_train:]\nprint(f\"Train tokens: {len(train_data):,}\")\nprint(f\"Val tokens:   {len(val_data):,}\")\npython\n# Dataset\nclass CharDataset(Dataset):\n    def __init__(self, data, block_size):\n        self.data       = data\n        self.block_size = block_size\n\n    def __len__(self):\n        return len(self.data) - self.block_size\n\n    def __getitem__(self, idx):\n        x = self.data[idx:idx + self.block_size]\n        y = self.data[idx + 1:idx + self.block_size + 1]\n        return x, y\n\nblock_size  = 128\ntrain_set   = CharDataset(train_data, block_size)\nval_set     = CharDataset(val_data, block_size)\n\ntrain_loader = DataLoader(train_set, batch_size=64, shuffle=True)\nval_loader   = DataLoader(val_set,   batch_size=64, shuffle=False)\n\nprint(f\"Training batches: {len(train_loader)}\")\npython\nimport torch.optim as optim\n\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\nmodel  = MiniGPT(\n    vocab_size=vocab_size,\n    d_model=128,\n    n_heads=4,\n    n_layers=4,\n    d_ff=512,\n    max_len=block_size\n).to(device)\n\noptimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)\nscheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)\n\ndef evaluate(model, loader, max_batches=20):\n    model.eval()\n    total_loss = 0\n    with torch.no_grad():\n        for i, (x, y) in enumerate(loader):\n            if i >= max_batches:\n                break\n            x, y = x.to(device), y.to(device)\n            _, loss = model(x, y)\n            total_loss += loss.item()\n    return total_loss / min(max_batches, len(loader))\n\nprint(f\"Training on: {device}\")\nprint(f\"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12}\")\nprint(\"-\" * 35)\n\nfor epoch in range(1, 6):\n    model.train()\n    train_loss = 0\n\n    for x, y in train_loader:\n        x, y = x.to(device), y.to(device)\n        optimizer.zero_grad()\n        _, loss = model(x, y)\n        loss.backward()\n        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n        optimizer.step()\n        train_loss += loss.item()\n\n    train_loss /= len(train_loader)\n    val_loss    = evaluate(model, val_loader)\n    scheduler.step()\n\n    print(f\"{epoch:<8} {train_loss:<12.4f} {val_loss:.4f}\")\n```\n\nOutput:\n\n```\nTraining on: cuda\nEpoch    Train Loss   Val Loss\n-----------------------------------\n1        2.8341       2.6123\n2        2.1045       2.0843\n3        1.8921       1.9104\n4        1.7632       1.8231\n5        1.6891       1.7843\n```\n\n### Temperature: Controlling Randomness\n\nTemperature is the most important generation parameter. It scales the logits before softmax.\n\n``` python\nimport torch\nimport torch.nn.functional as F\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n# Example logits for 5 tokens: A, B, C, D, E\nlogits = torch.tensor([3.0, 1.5, 0.8, 0.3, -0.5])\n\ntemperatures = [0.1, 0.5, 1.0, 1.5, 2.0]\nvocab        = ['A', 'B', 'C', 'D', 'E']\n\nfig, axes = plt.subplots(1, 5, figsize=(15, 4))\n\nfor ax, temp in zip(axes, temperatures):\n    probs = F.softmax(logits / temp, dim=0).numpy()\n    bars  = ax.bar(vocab, probs, color=['#4ECDC4' if i == 0 else '#95A5A6' for i in range(5)])\n    ax.set_title(f'temp={temp}')\n    ax.set_ylim(0, 1)\n    ax.set_ylabel('Probability' if temp == 0.1 else '')\n    for bar, prob in zip(bars, probs):\n        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,\n                f'{prob:.2f}', ha='center', va='bottom', fontsize=8)\n\nplt.suptitle('Effect of Temperature on Token Probabilities', y=1.02)\nplt.tight_layout()\nplt.savefig('temperature_effect.png', dpi=100)\nplt.show()\n\nprint(f\"{'Temp':<8} {'P(A)':<10} {'P(B)':<10} {'P(C)':<10} {'P(D)':<10} {'P(E)'}\")\nprint(\"-\" * 55)\nfor temp in temperatures:\n    probs = F.softmax(logits / temp, dim=0)\n    print(f\"{temp:<8} \" + \" \".join(f\"{p.item():<10.4f}\" for p in probs))\n```\n\nOutput:\n\n```\nTemp     P(A)       P(B)       P(C)       P(D)       P(E)\n-------------------------------------------------------\n0.1      0.9997     0.0003     0.0000     0.0000     0.0000\n0.5      0.9151     0.0789     0.0052     0.0008     0.0001\n1.0      0.6637     0.1935     0.0973     0.0380     0.0074\n1.5      0.5346     0.2133     0.1401     0.0813     0.0308\n2.0      0.4560     0.2128     0.1604     0.1102     0.0606\n```\n\n**Temperature = 0.1:** extremely peaked, almost always picks \"A\". Deterministic, repetitive.\n\n**Temperature = 1.0:** original distribution. Balanced randomness.\n\n**Temperature = 2.0:** nearly uniform. Very random, often incoherent.\n\nGood range for creative writing: 0.7 to 1.0. For code or factual tasks: 0.2 to 0.5.\n\n### Sampling Strategies\n\n**Greedy:** always pick the highest probability token. Fast. Repetitive. Boring.\n\n**Top-k:** only consider the k highest probability tokens. Sample from those.\n\n**Top-p (Nucleus sampling):** consider the smallest set of tokens whose cumulative probability exceeds p. Adapts vocabulary size based on confidence.\n\n``` python\ndef greedy_sample(logits):\n    return torch.argmax(logits, dim=-1)\n\ndef top_k_sample(logits, k=50, temperature=1.0):\n    logits = logits / temperature\n    top_k_logits, top_k_indices = torch.topk(logits, k)\n    probs  = F.softmax(top_k_logits, dim=-1)\n    chosen = torch.multinomial(probs, num_samples=1)\n    return top_k_indices[chosen]\n\ndef top_p_sample(logits, p=0.9, temperature=1.0):\n    logits = logits / temperature\n    sorted_logits, sorted_indices = torch.sort(logits, descending=True)\n    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\n\n    # Remove tokens with cumulative prob above threshold\n    sorted_indices_to_remove = cumulative_probs > p\n    # Shift to keep at least one token\n    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()\n    sorted_indices_to_remove[0]  = False\n\n    sorted_logits[sorted_indices_to_remove] = float('-inf')\n    probs  = F.softmax(sorted_logits, dim=-1)\n    chosen = torch.multinomial(probs, num_samples=1)\n    return sorted_indices[chosen]\n\n# Demonstrate on example logits\nlogits_example = torch.randn(100)   # 100-token vocabulary\n\ngreedy_choice = greedy_sample(logits_example)\ntopk_choice   = top_k_sample(logits_example, k=10)\ntopp_choice   = top_p_sample(logits_example, p=0.9)\n\nprint(f\"Greedy picked token:  {greedy_choice.item()}\")\nprint(f\"Top-k (k=10) picked:  {topk_choice.item()}\")\nprint(f\"Top-p (p=0.9) picked: {topp_choice.item()}\")\n\n# How many tokens qualify for top-p at p=0.9?\nsorted_logits, _ = torch.sort(logits_example, descending=True)\ncumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\nn_tokens_in_nucleus = (cumprobs <= 0.9).sum().item() + 1\nprint(f\"\\nTokens in nucleus (p=0.9): {n_tokens_in_nucleus} out of 100\")\n```\n\n### Generating Text With Our MiniGPT\n\n``` python\ndef generate_text(model, prompt, max_new_tokens=200,\n                  temperature=0.8, top_k=40, device='cpu'):\n    model.eval()\n\n    # Encode prompt\n    context = torch.tensor(encode(prompt), dtype=torch.long).unsqueeze(0).to(device)\n\n    # Generate\n    with torch.no_grad():\n        generated = model.generate(\n            context,\n            max_new_tokens=max_new_tokens,\n            temperature=temperature,\n            top_k=top_k\n        )\n\n    # Decode\n    generated_tokens = generated[0].tolist()\n    return decode(generated_tokens)\n\n# Try different temperatures\nprint(\"=\" * 60)\nprint(\"LOW TEMPERATURE (0.3) - Conservative and repetitive:\")\nprint(\"=\" * 60)\nprint(generate_text(model, \"HAMLET:\", max_new_tokens=150,\n                    temperature=0.3, top_k=10, device=device))\n\nprint(\"\\n\" + \"=\" * 60)\nprint(\"MEDIUM TEMPERATURE (0.8) - Balanced:\")\nprint(\"=\" * 60)\nprint(generate_text(model, \"HAMLET:\", max_new_tokens=150,\n                    temperature=0.8, top_k=40, device=device))\n\nprint(\"\\n\" + \"=\" * 60)\nprint(\"HIGH TEMPERATURE (1.5) - Chaotic:\")\nprint(\"=\" * 60)\nprint(generate_text(model, \"HAMLET:\", max_new_tokens=150,\n                    temperature=1.5, top_k=None, device=device))\n```\n\nOutput (after 5 epochs on Shakespeare):\n\n```\n============================================================\nLOW TEMPERATURE (0.3) - Conservative and repetitive:\n============================================================\nHAMLET:\nI will not be the good the good the good the good\nthe good the good the good the good...\n\n============================================================\nMEDIUM TEMPERATURE (0.8) - Balanced:\n============================================================\nHAMLET:\nI have been a man of the king and speak\nThe lord, and the great heart of the lord\nThat I am not the death of the lord...\n\n============================================================\nHIGH TEMPERATURE (1.5) - Chaotic:\n============================================================\nHAMLET:\nVxqo! zj kin, thae wath gof amd\njek lpe mhek ther whi...\n```\n\nLow temperature: repetitive but coherent. High temperature: gibberish. Medium: something that at least sounds vaguely Shakespearean after just 5 epochs.\n\nTrain longer and the quality improves dramatically.\n\n### Using GPT-2 With HuggingFace\n\n``` python\nfrom transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline\n\n# Load GPT-2\ngpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\ngpt2_model     = GPT2LMHeadModel.from_pretrained('gpt2')\n\ngpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token\n\n# Generate with different strategies\ngenerator = pipeline('text-generation', model='gpt2')\n\nprompt = \"The future of artificial intelligence is\"\n\nprint(\"GREEDY (do_sample=False):\")\nresult = generator(prompt, max_new_tokens=50, do_sample=False)\nprint(result[0]['generated_text'])\n\nprint(\"\\nTOP-K SAMPLING (k=50, temp=0.9):\")\nresult = generator(prompt, max_new_tokens=50,\n                   do_sample=True, top_k=50, temperature=0.9)\nprint(result[0]['generated_text'])\n\nprint(\"\\nNUCLEUS SAMPLING (top_p=0.9):\")\nresult = generator(prompt, max_new_tokens=50,\n                   do_sample=True, top_p=0.9, temperature=0.8)\nprint(result[0]['generated_text'])\n```\n\n### Manual GPT-2 Generation With Full Control\n\n``` python\nimport torch\nfrom transformers import GPT2LMHeadModel, GPT2Tokenizer\n\ntokenizer = GPT2Tokenizer.from_pretrained('gpt2')\nmodel     = GPT2LMHeadModel.from_pretrained('gpt2')\nmodel.eval()\n\nprompt = \"Once upon a time in a land far away\"\ninput_ids = tokenizer.encode(prompt, return_tensors='pt')\n\nprint(f\"Prompt tokens: {input_ids.shape[1]}\")\nprint(f\"Prompt: '{prompt}'\\n\")\n\n# Generate step by step and show probabilities\ncurrent_ids = input_ids.clone()\n\nfor step in range(5):\n    with torch.no_grad():\n        outputs = model(current_ids)\n        logits  = outputs.logits[:, -1, :]  # last position\n\n    # Get top 5 candidates\n    probs      = torch.softmax(logits, dim=-1)\n    top5_probs, top5_ids = torch.topk(probs, 5)\n\n    print(f\"Step {step+1} - Top 5 candidates:\")\n    for prob, token_id in zip(top5_probs[0], top5_ids[0]):\n        token_str = tokenizer.decode([token_id.item()])\n        print(f\"  '{token_str}' : {prob.item():.4f}\")\n\n    # Pick top token (greedy)\n    next_token = top5_ids[0, 0].unsqueeze(0).unsqueeze(0)\n    current_ids = torch.cat([current_ids, next_token], dim=1)\n\n    print(f\"  -> Picked: '{tokenizer.decode([next_token.item()])}'\\n\")\n\nfinal_text = tokenizer.decode(current_ids[0])\nprint(f\"Final: '{final_text}'\")\n```\n\nOutput:\n\n```\nPrompt tokens: 9\nPrompt: 'Once upon a time in a land far away'\n\nStep 1 - Top 5 candidates:\n  ',' : 0.2341\n  'there' : 0.1823\n  'called' : 0.0912\n  'from' : 0.0634\n  'where' : 0.0521\n  -> Picked: ','\n\nStep 2 - Top 5 candidates:\n  'there' : 0.3412\n  'a' : 0.1234\n  'the' : 0.0891\n  'an' : 0.0432\n  'people' : 0.0321\n  -> Picked: 'there'\n...\n\nFinal: 'Once upon a time in a land far away, there was a'\n```\n\n### What GPT Learns by Predicting the Next Word\n\nThis seems like a simple task. It's not. To predict the next word well, the model must learn:\n\n-\n**Grammar:** what word types follow others -\n**Facts:**\"The capital of France is...\" → \"Paris\" -\n**Reasoning:**\"If A > B and B > C, then A > ...\" → \"C\" -\n**Style:** given \"HAMLET:\", continue in Shakespearean style -\n**Code:** given`def fibonacci(n):`\n\n, complete correctly -\n**Math:**\"2 + 2 = \" → \"4\"\n\nNone of these were explicitly taught. They emerged from predicting tokens. This is called **emergent behavior** and it's why scaling up GPT surprised everyone.\n\n### Quick Cheat Sheet\n\n| Concept | What it means |\n|---|---|\n| Autoregressive | Generate one token at a time, feed back to input |\n| Temperature | Higher = more random, lower = more deterministic |\n| Greedy | Always pick highest prob token. Repetitive. |\n| Top-k | Sample from top k tokens only |\n| Top-p (nucleus) | Sample from smallest set with cumulative prob > p |\n| Perplexity | Loss metric for language models: lower = better |\n| Weight tying | Embedding and output head share weights |\n| Pre-norm | LayerNorm before attention (modern GPT), more stable |\n\n| Task | Code |\n|---|---|\n| Load GPT-2 | `GPT2LMHeadModel.from_pretrained('gpt2')` |\n| Quick generation | `pipeline('text-generation', model='gpt2')` |\n| Control randomness | `temperature=0.8, top_k=50, top_p=0.9` |\n| Stop at sentence | `eos_token_id=tokenizer.eos_token_id` |\n| Greedy | `do_sample=False` |\n| Sampling | `do_sample=True` |\n\n### Practice Challenges\n\n**Level 1:**\n\nUse the `pipeline('text-generation')`\n\nwith GPT-2. Generate the same prompt 5 times with `temperature=0.9`\n\n. Compare the outputs. Now do it with `temperature=0.1`\n\n. How different are the results?\n\n**Level 2:**\n\nTrain MiniGPT on a different text dataset: a collection of Python code, song lyrics, or any repetitive text. After training, generate samples and evaluate quality by eye. How many epochs until the samples look like the training data?\n\n**Level 3:**\n\nImplement beam search on top of MiniGPT. Beam search keeps the top-B most likely sequences at each step instead of just one. Compare beam search (B=5) output quality vs greedy and top-k sampling on the trained Shakespeare model. Which one produces the most coherent text?\n\n### References\n\n[GPT-1 paper: Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised)[GPT-2 paper: Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models)[Andrej Karpathy: nanoGPT (GitHub)](https://github.com/karpathy/nanoGPT)[HuggingFace: GPT-2 docs](https://huggingface.co/docs/transformers/model_doc/gpt2)[The Scaling Laws paper](https://arxiv.org/abs/2001.08361)\n\nNext up, Post 94:HuggingFace: Your Library for Every Pretrained Model. Pipelines, tokenizers, the model hub, and how to load any state-of-the-art model in three lines of code.", "url": "https://wpnews.pro/news/93-gpt-the-model-that-predicts-the-next-word-forever", "canonical_source": "https://dev.to/yakhilesh/93-gpt-the-model-that-predicts-the-next-word-forever-2g17", "published_at": "2026-05-21 06:07:33+00:00", "updated_at": "2026-05-21 06:34:20.281226+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "research"], "entities": ["GPT", "BERT", "GPT-1", "GPT-2", "GPT-3", "GPT-4"], "alternates": {"html": "https://wpnews.pro/news/93-gpt-the-model-that-predicts-the-next-word-forever", "markdown": "https://wpnews.pro/news/93-gpt-the-model-that-predicts-the-next-word-forever.md", "text": "https://wpnews.pro/news/93-gpt-the-model-that-predicts-the-next-word-forever.txt", "jsonld": "https://wpnews.pro/news/93-gpt-the-model-that-predicts-the-next-word-forever.jsonld"}}