cd /news/large-language-models/gpt-2-124m-checkpoint-pre-trained-on… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-30599] src=github.com β†— pub= topic=large-language-models verified=true sentiment=Β· neutral

GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

A 124M-parameter GPT-2 model trained from scratch on OpenWebText data using a custom deep learning library achieved a validation loss of 2.764 nats and a perplexity of 15.87 after 56,000 steps (27.5B tokens). The model, which uses a custom byte-level BPE tokenizer, demonstrates that a hand-written library can train GPT-2 to a level close to the official checkpoint, though it remains undertrained relative to its 600,000-step schedule.

read3 min views3 publishedJun 17, 2026

A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).

Training Metrics #

| Metric | Value |

|---|---|
| Validation loss (cross-entropy, nats) | 2.764 |
Validation perplexity (`exp(loss)` ) |

15.87 | Bits per token (loss / ln 2 ) | 3.99 | | Steps trained | 56,000 (of 600,000 planned) | | Tokens seen | ~27.5B (491,520 tok/step) |

| Start -> end val loss | 5.18 -> 2.76 |

Zero-shot evaluation #

Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000

via lm-evaluation-harness. bits_per_byte

, byte_perplexity

, and word_perplexity

are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc

is also comparable

Caveats:

  • The BPE tokenizer sizematches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte - the LAMBADA perplexity

is token-level and so carries the usual tokenizer dependence

Task Metric Direction(↑ = higher is better, ↓ = lower is better) Value Β± Stderr
CBT-CN acc ↑ 0.3952 0.0098
CBT-NE acc ↑ 0.4052 0.0098
enwik8 bits_per_byte ↓ 1.8399 β€”
lambada_openai acc ↑ 0.2989 0.0064
perplexity ↓ 52.7521 2.1696
1BW word_perplexity ↓ 135.6374 β€”
PTB word_perplexity ↓ 827.3800 β€”
text8 bits_per_byte ↓ 1.3039 β€”
WikiText103 bits_per_byte ↓ 1.0037 β€”
byte_perplexity ↓ 2.0052 β€”
word_perplexity ↓ 41.2833 β€”

Architecture (GPT-2 Small, 124 million parameter) #

| Layers / heads / hidden | 12 / 12 / 768 | | Max sequence length | 1024 | | Vocab size | 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64) | | Dropout | 0.0 | | Parameter dtype | bfloat16 | | Notable | packed QKV projection |

The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

Training configuration #

| Dataset | OpenWebText | | Optimizer | AdamW (lr 6e-4, beta 0.95, weight_decay 0.1) | | LR schedule | cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps | | Grad clipping | max-norm 1.0 | | Global batch | 480 sequences (micro 60 Γ— 8 grad-accum) | | Tokens / step | 491,520 | | Eval | mean val loss over 100 batches |

Limitations #

  • Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
  • Undertrained relative to its own schedule (56k/600k steps)
── more in #large-language-models 4 stories Β· sorted by recency
── more on @gpt-2 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/gpt-2-124m-checkpoin…] indexed:0 read:3min 2026-06-17 Β· β€”