{"slug": "gpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens", "title": "GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens", "summary": "A 124M-parameter GPT-2 model trained from scratch on OpenWebText data using a custom deep learning library achieved a validation loss of 2.764 nats and a perplexity of 15.87 after 56,000 steps (27.5B tokens). The model, which uses a custom byte-level BPE tokenizer, demonstrates that a hand-written library can train GPT-2 to a level close to the official checkpoint, though it remains undertrained relative to its 600,000-step schedule.", "body_md": "# GPT-2 124M — OpenWebText Baseline Model Card\n\nA 124M-parameter GPT-2 trained from scratch on [OpenWebText](https://huggingface.co/datasets/Skylion007/openwebtext) data using a hand-written deep learning library (no PyTorch in the model or training path).\n\n## Training Metrics\n\n| Metric | Value |\n|---|---|\n| Validation loss (cross-entropy, nats) | 2.764 |\nValidation perplexity (`exp(loss)` ) |\n15.87 |\nBits per token (`loss / ln 2` ) |\n3.99 |\n| Steps trained | 56,000 (of 600,000 planned) |\n| Tokens seen | ~27.5B (491,520 tok/step) |\n| Start -> end val loss | 5.18 -> 2.76 |\n\n## Zero-shot evaluation\n\nZero-shot evaluation results for checkpoint `openwebtext_gpt2_124m_baseline_GPT2_56000`\n\nvia [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).\n\n`bits_per_byte`\n\n, `byte_perplexity`\n\n, and `word_perplexity`\n\nare normalized by bytes/words rather than tokens, so they are **tokenizer-independent** and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). `acc`\n\nis also comparable\n\nCaveats:\n\n- The BPE tokenizer\n*size*matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte - the LAMBADA\n`perplexity`\n\nis token-level and so carries the usual tokenizer dependence\n\n| Task | Metric | Direction(↑ = higher is better, ↓ = lower is better) | Value | ± Stderr |\n|---|---|---|---|---|\n| CBT-CN | acc | ↑ | 0.3952 | 0.0098 |\n| CBT-NE | acc | ↑ | 0.4052 | 0.0098 |\n| enwik8 | bits_per_byte | ↓ | 1.8399 | — |\n| lambada_openai | acc | ↑ | 0.2989 | 0.0064 |\n| perplexity | ↓ | 52.7521 | 2.1696 | |\n| 1BW | word_perplexity | ↓ | 135.6374 | — |\n| PTB | word_perplexity | ↓ | 827.3800 | — |\n| text8 | bits_per_byte | ↓ | 1.3039 | — |\n| WikiText103 | bits_per_byte | ↓ | 1.0037 | — |\n| byte_perplexity | ↓ | 2.0052 | — | |\n| word_perplexity | ↓ | 41.2833 | — |\n\n## Architecture (GPT-2 Small, 124 million parameter)\n\n| Layers / heads / hidden | 12 / 12 / 768 |\n| Max sequence length | 1024 |\n| Vocab size | 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64) |\n| Dropout | 0.0 |\n| Parameter dtype | bfloat16 |\n| Notable | packed QKV projection |\n\nThe tokenizer is a **custom byte pair encoder (BPE) trained from scratch on OpenWebText** (49,990 merges, **50,257 tokens**, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.\n\n## Training configuration\n\n| Dataset | OpenWebText |\n| Optimizer | AdamW (lr 6e-4, beta 0.95, weight_decay 0.1) |\n| LR schedule | cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps |\n| Grad clipping | max-norm 1.0 |\n| Global batch | 480 sequences (micro 60 × 8 grad-accum) |\n| Tokens / step | 491,520 |\n| Eval | mean val loss over 100 batches |\n\n## Limitations\n\n- Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface\n- Undertrained relative to its own schedule (56k/600k steps)", "url": "https://wpnews.pro/news/gpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens", "canonical_source": "https://github.com/workofart/ml-by-hand/releases/tag/gpt2-124m-openwebtext-56000", "published_at": "2026-06-17 04:51:14+00:00", "updated_at": "2026-06-17 05:22:56.758066+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning"], "entities": ["GPT-2", "OpenWebText", "AdamW", "BPE", "EleutherAI", "LAMBADA", "WikiText103", "PTB"], "alternates": {"html": "https://wpnews.pro/news/gpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens", "markdown": "https://wpnews.pro/news/gpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens.md", "text": "https://wpnews.pro/news/gpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens.txt", "jsonld": "https://wpnews.pro/news/gpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens.jsonld"}}