GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

wpnews.pro

cd /news/large-language-models/gpt-2-124m-checkpoint-pre-trained-on… · home › topics › large-language-models › article

[ARTICLE · art-30599] src=github.com ↗ pub=2026-06-17T04:51Z topic=large-language-models verified=true sentiment=· neutral

GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

A 124M-parameter GPT-2 model trained from scratch on OpenWebText data using a custom deep learning library achieved a validation loss of 2.764 nats and a perplexity of 15.87 after 56,000 steps (27.5B tokens). The model, which uses a custom byte-level BPE tokenizer, demonstrates that a hand-written library can train GPT-2 to a level close to the official checkpoint, though it remains undertrained relative to its 600,000-step schedule.

read3 min views28 publishedJun 17, 2026

A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).

Training Metrics #

| Metric | Value |

|---|---|
| Validation loss (cross-entropy, nats) | 2.764 |
Validation perplexity (`exp(loss)` ) |

| Start -> end val loss | 5.18 -> 2.76 |

Zero-shot evaluation #

Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000

via lm-evaluation-harness. bits_per_byte

, byte_perplexity

, and word_perplexity

are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc

is also comparable

Caveats:

The BPE tokenizer sizematches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte - the LAMBADA perplexity

is token-level and so carries the usual tokenizer dependence

Task	Metric	Direction(↑ = higher is better, ↓ = lower is better)	Value	± Stderr
CBT-CN	acc	↑	0.3952	0.0098
CBT-NE	acc	↑	0.4052	0.0098
enwik8	bits_per_byte	↓	1.8399	—
lambada_openai	acc	↑	0.2989	0.0064
perplexity	↓	52.7521	2.1696
1BW	word_perplexity	↓	135.6374	—
PTB	word_perplexity	↓	827.3800	—
text8	bits_per_byte	↓	1.3039	—
WikiText103	bits_per_byte	↓	1.0037	—
byte_perplexity	↓	2.0052	—
word_perplexity	↓	41.2833	—

Architecture (GPT-2 Small, 124 million parameter) #

The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens, the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

Training configuration #

Limitations #

Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
Undertrained relative to its own schedule (56k/600k steps)

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/gpt-2-124m-checkpoint-pr…

Read original on github.com → github.com/workofart/ml-by-hand/releases/tag/gpt…

mentioned entities

GPT-2

OpenWebText

AdamW

BPE

EleutherAI

LAMBADA

WikiText103

PTB

metadata

sluggpt-2-124m-checkpoint-pre-trained-on-openwebtext-27-5b-tokens

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalgithub.com

navigation

← prevI Connected Oracle's Managed MCP…

next →Looped World Models

── more in #large-language-models 4 stories · sorted by recency

idlemachines.co.uk · 31 Jul · #large-language-models

Adam and AdamW: adaptive optimisation and weight decay

dev.to · 31 Jul · #large-language-models

How a Baseten Engineer Traced 7 Years of Attention Mechanism Evolution -- From GPT-2 to Kimi K3, in Runable PyTorch

gilesthomas.com · 31 Jul · #large-language-models

Why do OpenAI's GPT-2 weights beat mine? Part three: testing overtraining

promptcube3.com · 30 Jul · #large-language-models

How I Fixed My GPT‑2 Reproducibility Nightmare (Part 2)

── more on @gpt-2 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required