{"slug": "train-llm-from-scratch", "title": "Train LLM from Scratch", "summary": "A developer trained a large language model from scratch using plain PyTorch, implementing the full post-training pipeline including SFT, reward modeling, DPO, PPO, and GRPO on public datasets, all runnable on a single GPU or scaled with DDP. The project emphasizes a modular design that wraps the base transformer without rewriting it, enabling instruction following and reasoning capabilities.", "body_md": "# Post-Training & Alignment — Overview[¶](#post-training-alignment-overview)\n\nWhen I first trained this transformer from scratch, it could *continue* text but it couldn't\n*follow instructions* or *reason*. That's what post-training fixes. This `docs/`\n\nfolder walks\nthrough the whole journey I built on top of the base model — every stage written from scratch\nin plain PyTorch (no `trl`\n\n, no `peft`\n\n, no `transformers`\n\n), trained on real public datasets, and\nrunnable on a single GPU or scaled across multiple GPUs with DDP.\n\nIf you are new to LLM training internals, start with the new\n** LLM Foundations** section before reading the stage pages. It explains the\ntoken shapes, decoder-only Transformer, attention masks, objectives, optimization loop, and generation\nmechanics that every later page relies on.\n\n## Recommended reading order[¶](#recommended-reading-order)\n\n**Foundations first**:[Tokenization](foundations/tokenization/)->[Transformer](foundations/transformer/)->[Attention](foundations/attention/)->[Objectives](foundations/objectives/)->[Optimization](foundations/optimization/)->[Generation](foundations/generation/).**Then the full pipeline**:[Data](01_data_pipeline/)->[Pretraining](02_pretraining/)->[SFT](03_sft/)->[Reward Model](04_reward_model/)->[DPO](05_dpo/)->[PPO](06_ppo/)->[GRPO](07_grpo/).**Finally run and inspect**:[Evaluation](08_evaluation/),[Inference / Chat](09_inference/), and the[command cheatsheet](howto/commands/).\n\nThe pipeline mirrors how modern aligned/reasoning models are actually built:\n\n## Mermaid source (live, editable)\n\n``` php\nflowchart TD\n    PILE([The Pile<br/>9.8B tokens]):::data --> PRE{{Pretrain<br/>~400M base}}:::model\n    PRE --> BASE[(base_pretrained.pt)]:::ckpt\n    BASE --> SFT{{SFT<br/>Alpaca · Dolly · GSM8K}}:::model\n    SFT --> SFTCK[(sft.pt)]:::ckpt\n    SFTCK --> RM{{Reward Model<br/>Bradley-Terry}}:::rl\n    SFTCK --> DPO{{DPO / ORPO / KTO<br/>preference}}:::rl\n    RM --> RMCK[(reward.pt)]:::ckpt\n    RMCK -->|reward signal| PPO{{PPO<br/>GAE + clip + KL}}:::rl\n    SFTCK --> PPO\n    SFTCK --> GRPO{{GRPO / RLVR<br/>group-relative}}:::rl\n    PPO --> EVAL([GSM8K eval<br/>+ chat / inference]):::eval\n    DPO --> EVAL\n    GRPO --> EVAL\n    classDef data fill:#d6ffd9,stroke:#27ae60,stroke-width:2px,color:#143d1a;\n    classDef model fill:#ffe8a3,stroke:#d48806,stroke-width:2px,color:#5a3d00;\n    classDef rl fill:#ffd9b3,stroke:#e67e22,stroke-width:2px,color:#6b3500;\n    classDef ckpt fill:#eeeeee,stroke:#555555,stroke-width:2px,color:#222;\n    classDef eval fill:#e8d6ff,stroke:#8e44ad,stroke-width:2px,color:#3d1a5a;\n```\n\n## The stages, in order[¶](#the-stages-in-order)\n\n| # | Stage | What it teaches the model | Doc |\n|---|---|---|---|\n| 1 | Pretraining |\nlanguage itself (next-token prediction on the Pile) |\n|\n\n**SFT**`<think>/<answer>`\n\nformat[03_sft.md](03_sft/)**Reward Model**[04_reward_model.md](04_reward_model/)** DPO / ORPO / KTO***without*an RL loop[05_dpo.md](05_dpo/)**PPO**[06_ppo.md](06_ppo/)** GRPO / RLVR**[07_grpo.md](07_grpo/)** Data pipeline**[01_data_pipeline.md](01_data_pipeline/)** Evaluation**[08_evaluation.md](08_evaluation/)** Inference / chat**[09_inference.md](09_inference/)## The one design rule: *wrap, don't rewrite*[¶](#the-one-design-rule-wrap-dont-rewrite)\n\nEverything here sits on top of the original [ Transformer](https://github.com/FareedKhan-dev/train-llm-from-scratch/blob/main/src/models/transformer.py). I changed the\neducational model in exactly\n\n**one** place — I added a\n\n[method that returns the final hidden states the](https://github.com/FareedKhan-dev/train-llm-from-scratch/blob/main/src/models/transformer.py#L56)\n\n`forward_hidden`\n\n`lm_head`\n\nconsumes. Every post-training head (a value\nhead for PPO, a scalar reward head for the reward model) and every RL log-prob computation composes\n*around*that one method, so the from-scratch model you already understand stays intact.\n\n## Colour legend (used in every diagram in these docs)[¶](#colour-legend-used-in-every-diagram-in-these-docs)\n\n🟩 data / corpus · 🟦 preprocessing · 🟦⬛ storage (HDF5 / JSONL) · 🟨 model / training loop · 🟧 RL / reward · 🟥 loss / objective · 🟪 evaluation · ⬜ checkpoint\n\nEach diagram is a hand-drawn, colour-coded Mermaid sketch,\n\npre-rendered to a PNG and embedded as an image(GitHub's live Mermaid doesn't reliably do`look: handDrawn`\n\n, and some viewers — e.g. the VS Code preview — block SVGs, so an embedded PNG shows everywhere). The editable Mermaid source sits in a collapsible\"Mermaid source\"block under each image. To regenerate the images after editing, see[diagrams/README.md].\n\n## Run the whole thing[¶](#run-the-whole-thing)\n\nOnce the base model has pretrained ([02_pretraining.md](02_pretraining/)), the entire chain is one script:\n\n``` php\nbash scripts/run_posttraining.sh          # SFT -> RM -> DPO -> PPO -> GRPO -> eval table\n```\n\nSee [POST_TRAINING.md](https://github.com/FareedKhan-dev/train-llm-from-scratch/blob/main/POST_TRAINING.md) for the condensed command reference.", "url": "https://wpnews.pro/news/train-llm-from-scratch", "canonical_source": "https://FareedKhan-dev.github.io/train-llm-from-scratch/", "published_at": "2026-06-21 03:30:18+00:00", "updated_at": "2026-06-21 03:36:54.354490+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-tools", "ai-infrastructure", "developer-tools"], "entities": ["PyTorch", "The Pile", "Alpaca", "Dolly", "GSM8K", "Bradley-Terry", "FareedKhan-dev"], "alternates": {"html": "https://wpnews.pro/news/train-llm-from-scratch", "markdown": "https://wpnews.pro/news/train-llm-from-scratch.md", "text": "https://wpnews.pro/news/train-llm-from-scratch.txt", "jsonld": "https://wpnews.pro/news/train-llm-from-scratch.jsonld"}}