# GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

> Source: <https://github.com/workofart/ml-by-hand/releases/tag/gpt2-124m-openwebtext-56000>
> Published: 2026-06-17 04:51:14+00:00

# GPT-2 124M — OpenWebText Baseline Model Card

A 124M-parameter GPT-2 trained from scratch on [OpenWebText](https://huggingface.co/datasets/Skylion007/openwebtext) data using a hand-written deep learning library (no PyTorch in the model or training path).

## Training Metrics

| Metric | Value |
|---|---|
| Validation loss (cross-entropy, nats) | 2.764 |
Validation perplexity (`exp(loss)` ) |
15.87 |
Bits per token (`loss / ln 2` ) |
3.99 |
| Steps trained | 56,000 (of 600,000 planned) |
| Tokens seen | ~27.5B (491,520 tok/step) |
| Start -> end val loss | 5.18 -> 2.76 |

## Zero-shot evaluation

Zero-shot evaluation results for checkpoint `openwebtext_gpt2_124m_baseline_GPT2_56000`

via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

`bits_per_byte`

, `byte_perplexity`

, and `word_perplexity`

are normalized by bytes/words rather than tokens, so they are **tokenizer-independent** and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). `acc`

is also comparable

Caveats:

- The BPE tokenizer
*size*matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte - the LAMBADA
`perplexity`

is token-level and so carries the usual tokenizer dependence

| Task | Metric | Direction(↑ = higher is better, ↓ = lower is better) | Value | ± Stderr |
|---|---|---|---|---|
| CBT-CN | acc | ↑ | 0.3952 | 0.0098 |
| CBT-NE | acc | ↑ | 0.4052 | 0.0098 |
| enwik8 | bits_per_byte | ↓ | 1.8399 | — |
| lambada_openai | acc | ↑ | 0.2989 | 0.0064 |
| perplexity | ↓ | 52.7521 | 2.1696 | |
| 1BW | word_perplexity | ↓ | 135.6374 | — |
| PTB | word_perplexity | ↓ | 827.3800 | — |
| text8 | bits_per_byte | ↓ | 1.3039 | — |
| WikiText103 | bits_per_byte | ↓ | 1.0037 | — |
| byte_perplexity | ↓ | 2.0052 | — | |
| word_perplexity | ↓ | 41.2833 | — |

## Architecture (GPT-2 Small, 124 million parameter)

| Layers / heads / hidden | 12 / 12 / 768 |
| Max sequence length | 1024 |
| Vocab size | 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64) |
| Dropout | 0.0 |
| Parameter dtype | bfloat16 |
| Notable | packed QKV projection |

The tokenizer is a **custom byte pair encoder (BPE) trained from scratch on OpenWebText** (49,990 merges, **50,257 tokens**, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

## Training configuration

| Dataset | OpenWebText |
| Optimizer | AdamW (lr 6e-4, beta 0.95, weight_decay 0.1) |
| LR schedule | cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps |
| Grad clipping | max-norm 1.0 |
| Global batch | 480 sequences (micro 60 × 8 grad-accum) |
| Tokens / step | 491,520 |
| Eval | mean val loss over 100 batches |

## Limitations

- Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
- Undertrained relative to its own schedule (56k/600k steps)
