Train LLM from Scratch

A developer trained a large language model from scratch using plain PyTorch, implementing the full post-training pipeline including SFT, reward modeling, DPO, PPO, and GRPO on public datasets, all runnable on a single GPU or scaled with DDP. The project emphasizes a modular design that wraps the base transformer without rewriting it, enabling instruction following and reasoning capabilities.

Post-Training & Alignment — Overview ¶ post-training-alignment-overview When I first trained this transformer from scratch, it could continue text but it couldn't follow instructions or reason . That's what post-training fixes. This docs/ folder walks through the whole journey I built on top of the base model — every stage written from scratch in plain PyTorch no trl , no peft , no transformers , trained on real public datasets, and runnable on a single GPU or scaled across multiple GPUs with DDP. If you are new to LLM training internals, start with the new LLM Foundations section before reading the stage pages. It explains the token shapes, decoder-only Transformer, attention masks, objectives, optimization loop, and generation mechanics that every later page relies on. Recommended reading order ¶ recommended-reading-order Foundations first : Tokenization foundations/tokenization/ - Transformer foundations/transformer/ - Attention foundations/attention/ - Objectives foundations/objectives/ - Optimization foundations/optimization/ - Generation foundations/generation/ . Then the full pipeline : Data 01 data pipeline/ - Pretraining 02 pretraining/ - SFT 03 sft/ - Reward Model 04 reward model/ - DPO 05 dpo/ - PPO 06 ppo/ - GRPO 07 grpo/ . Finally run and inspect : Evaluation 08 evaluation/ , Inference / Chat 09 inference/ , and the command cheatsheet howto/commands/ . The pipeline mirrors how modern aligned/reasoning models are actually built: Mermaid source live, editable php flowchart TD PILE The Pile<br/ 9.8B tokens :::data -- PRE{{Pretrain<br/ ~400M base}}:::model PRE -- BASE base pretrained.pt :::ckpt BASE -- SFT{{SFT<br/ Alpaca · Dolly · GSM8K}}:::model SFT -- SFTCK sft.pt :::ckpt SFTCK -- RM{{Reward Model<br/ Bradley-Terry}}:::rl SFTCK -- DPO{{DPO / ORPO / KTO<br/ preference}}:::rl RM -- RMCK reward.pt :::ckpt RMCK -- |reward signal| PPO{{PPO<br/ GAE + clip + KL}}:::rl SFTCK -- PPO SFTCK -- GRPO{{GRPO / RLVR<br/ group-relative}}:::rl PPO -- EVAL GSM8K eval<br/ + chat / inference :::eval DPO -- EVAL GRPO -- EVAL classDef data fill: d6ffd9,stroke: 27ae60,stroke-width:2px,color: 143d1a; classDef model fill: ffe8a3,stroke: d48806,stroke-width:2px,color: 5a3d00; classDef rl fill: ffd9b3,stroke: e67e22,stroke-width:2px,color: 6b3500; classDef ckpt fill: eeeeee,stroke: 555555,stroke-width:2px,color: 222; classDef eval fill: e8d6ff,stroke: 8e44ad,stroke-width:2px,color: 3d1a5a; The stages, in order ¶ the-stages-in-order | | Stage | What it teaches the model | Doc | |---|---|---|---| | 1 | Pretraining | language itself next-token prediction on the Pile | | SFT <think /<answer format 03 sft.md 03 sft/ Reward Model 04 reward model.md 04 reward model/ DPO / ORPO / KTO without an RL loop 05 dpo.md 05 dpo/ PPO 06 ppo.md 06 ppo/ GRPO / RLVR 07 grpo.md 07 grpo/ Data pipeline 01 data pipeline.md 01 data pipeline/ Evaluation 08 evaluation.md 08 evaluation/ Inference / chat 09 inference.md 09 inference/ The one design rule: wrap, don't rewrite ¶ the-one-design-rule-wrap-dont-rewrite Everything here sits on top of the original Transformer https://github.com/FareedKhan-dev/train-llm-from-scratch/blob/main/src/models/transformer.py . I changed the educational model in exactly one place — I added a method that returns the final hidden states the https://github.com/FareedKhan-dev/train-llm-from-scratch/blob/main/src/models/transformer.py L56 forward hidden lm head consumes. Every post-training head a value head for PPO, a scalar reward head for the reward model and every RL log-prob computation composes around that one method, so the from-scratch model you already understand stays intact. Colour legend used in every diagram in these docs ¶ colour-legend-used-in-every-diagram-in-these-docs 🟩 data / corpus · 🟦 preprocessing · 🟦⬛ storage HDF5 / JSONL · 🟨 model / training loop · 🟧 RL / reward · 🟥 loss / objective · 🟪 evaluation · ⬜ checkpoint Each diagram is a hand-drawn, colour-coded Mermaid sketch, pre-rendered to a PNG and embedded as an image GitHub's live Mermaid doesn't reliably do look: handDrawn , and some viewers — e.g. the VS Code preview — block SVGs, so an embedded PNG shows everywhere . The editable Mermaid source sits in a collapsible"Mermaid source"block under each image. To regenerate the images after editing, see diagrams/README.md . Run the whole thing ¶ run-the-whole-thing Once the base model has pretrained 02 pretraining.md 02 pretraining/ , the entire chain is one script: php bash scripts/run posttraining.sh SFT - RM - DPO - PPO - GRPO - eval table See POST TRAINING.md https://github.com/FareedKhan-dev/train-llm-from-scratch/blob/main/POST TRAINING.md for the condensed command reference.