A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).
Training Metrics #
| Metric | Value |
|---|---|
| Validation loss (cross-entropy, nats) | 2.764 |
Validation perplexity (`exp(loss)` ) |
15.87 |
Bits per token (loss / ln 2 ) |
3.99 |
| Steps trained | 56,000 (of 600,000 planned) |
| Tokens seen | ~27.5B (491,520 tok/step) |
| Start -> end val loss | 5.18 -> 2.76 |
Zero-shot evaluation #
Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000
via lm-evaluation-harness.
bits_per_byte
, byte_perplexity
, and word_perplexity
are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc
is also comparable
Caveats:
- The BPE tokenizer
sizematches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte - the LAMBADA
perplexity
is token-level and so carries the usual tokenizer dependence
| Task | Metric | Direction(β = higher is better, β = lower is better) | Value | Β± Stderr |
|---|---|---|---|---|
| CBT-CN | acc | β | 0.3952 | 0.0098 |
| CBT-NE | acc | β | 0.4052 | 0.0098 |
| enwik8 | bits_per_byte | β | 1.8399 | β |
| lambada_openai | acc | β | 0.2989 | 0.0064 |
| perplexity | β | 52.7521 | 2.1696 | |
| 1BW | word_perplexity | β | 135.6374 | β |
| PTB | word_perplexity | β | 827.3800 | β |
| text8 | bits_per_byte | β | 1.3039 | β |
| WikiText103 | bits_per_byte | β | 1.0037 | β |
| byte_perplexity | β | 2.0052 | β | |
| word_perplexity | β | 41.2833 | β |
Architecture (GPT-2 Small, 124 million parameter) #
| Layers / heads / hidden | 12 / 12 / 768 | | Max sequence length | 1024 | | Vocab size | 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64) | | Dropout | 0.0 | | Parameter dtype | bfloat16 | | Notable | packed QKV projection |
The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.
Training configuration #
| Dataset | OpenWebText | | Optimizer | AdamW (lr 6e-4, beta 0.95, weight_decay 0.1) | | LR schedule | cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps | | Grad clipping | max-norm 1.0 | | Global batch | 480 sequences (micro 60 Γ 8 grad-accum) | | Tokens / step | 491,520 | | Eval | mean val loss over 100 batches |
Limitations #
- Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
- Undertrained relative to its own schedule (56k/600k steps)