When I first trained this transformer from scratch, it could continue text but it couldn't
follow instructions or reason. That's what post-training fixes. This docs/
folder walks
through the whole journey I built on top of the base model — every stage written from scratch
in plain PyTorch (no trl
, no peft
, no transformers
), trained on real public datasets, and runnable on a single GPU or scaled across multiple GPUs with DDP.
If you are new to LLM training internals, start with the new ** LLM Foundations** section before reading the stage pages. It explains the token shapes, decoder-only Transformer, attention masks, objectives, optimization loop, and generation mechanics that every later page relies on.
Recommended reading order¶ #
Foundations first:Tokenization->Transformer->Attention->Objectives->Optimization->Generation.Then the full pipeline:Data->Pretraining->SFT->Reward Model->DPO->PPO->GRPO.Finally run and inspect:Evaluation,Inference / Chat, and thecommand cheatsheet.
The pipeline mirrors how modern aligned/reasoning models are actually built:
Mermaid source (live, editable) #
flowchart TD
PILE([The Pile<br/>9.8B tokens]):::data --> PRE{{Pretrain<br/>~400M base}}:::model
PRE --> BASE[(base_pretrained.pt)]:::ckpt
BASE --> SFT{{SFT<br/>Alpaca · Dolly · GSM8K}}:::model
SFT --> SFTCK[(sft.pt)]:::ckpt
SFTCK --> RM{{Reward Model<br/>Bradley-Terry}}:::rl
SFTCK --> DPO{{DPO / ORPO / KTO<br/>preference}}:::rl
RM --> RMCK[(reward.pt)]:::ckpt
RMCK -->|reward signal| PPO{{PPO<br/>GAE + clip + KL}}:::rl
SFTCK --> PPO
SFTCK --> GRPO{{GRPO / RLVR<br/>group-relative}}:::rl
PPO --> EVAL([GSM8K eval<br/>+ chat / inference]):::eval
DPO --> EVAL
GRPO --> EVAL
classDef data fill:#d6ffd9,stroke:#27ae60,stroke-width:2px,color:#143d1a;
classDef model fill:#ffe8a3,stroke:#d48806,stroke-width:2px,color:#5a3d00;
classDef rl fill:#ffd9b3,stroke:#e67e22,stroke-width:2px,color:#6b3500;
classDef ckpt fill:#eeeeee,stroke:#555555,stroke-width:2px,color:#222;
classDef eval fill:#e8d6ff,stroke:#8e44ad,stroke-width:2px,color:#3d1a5a;
The stages, in order¶ #
| # | Stage | What it teaches the model | Doc |
|---|---|---|---|
| 1 | Pretraining | ||
| language itself (next-token prediction on the Pile) | |||
SFT<think>/<answer>
format03_sft.mdReward Model04_reward_model.md** DPO / ORPO / KTOwithoutan RL loop05_dpo.mdPPO06_ppo.md GRPO / RLVR**07_grpo.md** Data pipeline**01_data_pipeline.md** Evaluation**08_evaluation.md** Inference / chat**09_inference.md## The one design rule: wrap, don't rewrite¶
Everything here sits on top of the original Transformer. I changed the educational model in exactly
one place — I added a
method that returns the final hidden states the
forward_hidden
lm_head
consumes. Every post-training head (a value head for PPO, a scalar reward head for the reward model) and every RL log-prob computation composes aroundthat one method, so the from-scratch model you already understand stays intact.
Colour legend (used in every diagram in these docs)¶ #
🟩 data / corpus · 🟦 preprocessing · 🟦⬛ storage (HDF5 / JSONL) · 🟨 model / training loop · 🟧 RL / reward · 🟥 loss / objective · 🟪 evaluation · ⬜ checkpoint
Each diagram is a hand-drawn, colour-coded Mermaid sketch,
pre-rendered to a PNG and embedded as an image(GitHub's live Mermaid doesn't reliably dolook: handDrawn
, and some viewers — e.g. the VS Code preview — block SVGs, so an embedded PNG shows everywhere). The editable Mermaid source sits in a collapsible"Mermaid source"block under each image. To regenerate the images after editing, see[diagrams/README.md].
Run the whole thing¶ #
Once the base model has pretrained (02_pretraining.md), the entire chain is one script:
bash scripts/run_posttraining.sh # SFT -> RM -> DPO -> PPO -> GRPO -> eval table
See POST_TRAINING.md for the condensed command reference.