GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

A 124M-parameter GPT-2 model trained from scratch on OpenWebText data using a custom deep learning library achieved a validation loss of 2.764 nats and a perplexity of 15.87 after 56,000 steps (27.5B tokens). The model, which uses a custom byte-level BPE tokenizer, demonstrates that a hand-written library can train GPT-2 to a level close to the official checkpoint, though it remains undertrained relative to its 600,000-step schedule.

GPT-2 124M — OpenWebText Baseline Model Card A 124M-parameter GPT-2 trained from scratch on OpenWebText https://huggingface.co/datasets/Skylion007/openwebtext data using a hand-written deep learning library no PyTorch in the model or training path . Training Metrics | Metric | Value | |---|---| | Validation loss cross-entropy, nats | 2.764 | Validation perplexity exp loss | 15.87 | Bits per token loss / ln 2 | 3.99 | | Steps trained | 56,000 of 600,000 planned | | Tokens seen | ~27.5B 491,520 tok/step | | Start - end val loss | 5.18 - 2.76 | Zero-shot evaluation Zero-shot evaluation results for checkpoint openwebtext gpt2 124m baseline GPT2 56000 via lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness . bits per byte , byte perplexity , and word perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE see Tokenizer caveat section below . acc is also comparable Caveats: - The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges and therefore exact token boundaries are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte - the LAMBADA perplexity is token-level and so carries the usual tokenizer dependence | Task | Metric | Direction ↑ = higher is better, ↓ = lower is better | Value | ± Stderr | |---|---|---|---|---| | CBT-CN | acc | ↑ | 0.3952 | 0.0098 | | CBT-NE | acc | ↑ | 0.4052 | 0.0098 | | enwik8 | bits per byte | ↓ | 1.8399 | — | | lambada openai | acc | ↑ | 0.2989 | 0.0064 | | perplexity | ↓ | 52.7521 | 2.1696 | | | 1BW | word perplexity | ↓ | 135.6374 | — | | PTB | word perplexity | ↓ | 827.3800 | — | | text8 | bits per byte | ↓ | 1.3039 | — | | WikiText103 | bits per byte | ↓ | 1.0037 | — | | byte perplexity | ↓ | 2.0052 | — | | | word perplexity | ↓ | 41.2833 | — | Architecture GPT-2 Small, 124 million parameter | Layers / heads / hidden | 12 / 12 / 768 | | Max sequence length | 1024 | | Vocab size | 50,257 custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64 | | Dropout | 0.0 | | Parameter dtype | bfloat16 | | Notable | packed QKV projection | The tokenizer is a custom byte pair encoder BPE trained from scratch on OpenWebText 49,990 merges, 50,257 tokens , the same vocab size as OpenAI's GPT-2 BPE . The model's output layer is padded to 50,304 next multiple of 64 for efficiency; the extra 47 logit rows are unused. Training configuration | Dataset | OpenWebText | | Optimizer | AdamW lr 6e-4, beta 0.95, weight decay 0.1 | | LR schedule | cosine + 1,000-step warmup, min lr 1e-4, decay over 600k steps | | Grad clipping | max-norm 1.0 | | Global batch | 480 sequences micro 60 × 8 grad-accum | | Tokens / step | 491,520 | | Eval | mean val loss over 100 batches | Limitations - Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface - Undertrained relative to its own schedule 56k/600k steps