{"slug": "the-three-phases-of-post-training-how-llms-learn-to-provide-sensible-responses", "title": "The Three Phases of Post-Training: How LLMs Learn to Provide Sensible Responses", "summary": "Shrijith Venkatramana, building git-lrc, explains the three phases of post-training that transform pretrained LLMs into helpful assistants. The process starts with supervised fine-tuning (SFT) on curated examples, followed by reward modeling to capture human preferences, and finally reinforcement learning to optimize responses. This pipeline bridges the gap between raw language understanding and aligned, useful behavior.", "body_md": "*Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.*\n\nMost developers have heard the phrase:\n\n\"LLMs are trained on massive amounts of internet data.\"\n\nWhile technically true, it leaves out the most interesting part.\n\nPretraining teaches a model how language works. But it doesn't teach the model how to be helpful, harmless, honest, or aligned with human expectations.\n\nIf pretraining creates a brilliant but socially awkward intern, post-training turns that intern into a productive teammate.\n\nModern AI systems such as ChatGPT, Gemini, Claude, and others rely heavily on a multi-stage post-training pipeline. While implementations differ, the overall pattern is surprisingly consistent:\n\nLet's explore what each stage does, why it exists, and how they work together.\n\nImagine we train a model on the entire internet and ask:\n\n\"How do I become a better software engineer?\"\n\nThe model has seen thousands of answers:\n\nThe model learns patterns in text, but it doesn't inherently know which response humans would prefer.\n\nIt only knows what tends to come next.\n\nThis is the core limitation of pretraining.\n\nThe model learns:\n\n\"What people write.\"\n\nBut not:\n\n\"What humans want.\"\n\nPost-training bridges this gap.\n\nThe first step is teaching the model what good behavior looks like.\n\nResearchers create high-quality examples consisting of:\n\n```\nUser: Explain TCP vs UDP.\n\nAssistant:\nTCP provides reliable ordered delivery...\n```\n\nOr:\n\n```\nUser: Write a Python function that reverses a linked list.\n\nAssistant:\ndef reverse(head):\n    ...\n```\n\nThousands or millions of these examples are collected.\n\nThe model is then trained to imitate the desired responses.\n\nConceptually:\n\n```\nQuestion → Ideal Answer\n```\n\nbecomes\n\n```\nModel → Learn to reproduce ideal answer\n```\n\nThe objective is still next-token prediction, but now on carefully curated data instead of arbitrary internet text.\n\nThink of SFT as onboarding a new engineer.\n\nInstead of letting them learn exclusively from random GitHub repositories, you provide:\n\nThe engineer begins to imitate the patterns you want.\n\nSFT dramatically improves:\n\nHowever, it still has a limitation.\n\nFor many prompts, there isn't one correct answer.\n\nThere may be multiple reasonable responses with varying quality levels.\n\nThat's where Reward Modeling enters.\n\nSuppose a user asks:\n\n\"How should I learn distributed systems?\"\n\nThree responses might all be technically correct.\n\nResponse A:\n\n```\nRead a textbook.\n```\n\nResponse B:\n\n```\nRead a textbook and build projects.\n```\n\nResponse C:\n\n```\nStudy networking, databases, consensus algorithms,\nthen implement a small Raft cluster.\n```\n\nMost humans would likely prefer C.\n\nBut how does a model learn that preference?\n\nThe answer is Reward Modeling.\n\nHuman evaluators compare multiple outputs:\n\n```\nPrompt\n\nAnswer A\nAnswer B\n```\n\nThey choose the better response.\n\nThousands or millions of comparisons are collected.\n\nExample:\n\n```\nPrompt:\nHow do I learn Go?\n\nPreferred:\nBuild projects and read effective Go.\n\nRejected:\nJust read documentation.\n```\n\nA separate model is trained to predict these preferences.\n\nThis becomes the Reward Model.\n\nConceptually:\n\n```\nResponse → Quality Score\n```\n\nThe reward model acts like an automated judge.\n\nSFT teaches:\n\n\"Produce answers similar to examples.\"\n\nReward Modeling teaches:\n\n\"Recognize which answers humans prefer.\"\n\nThis distinction is subtle but important.\n\nOne is imitation.\n\nThe other is evaluation.\n\nNow we have:\n\nThe final stage uses Reinforcement Learning to optimize the assistant.\n\nThe process looks like:\n\n```\nPrompt\n   ↓\nModel generates answer\n   ↓\nReward model scores answer\n   ↓\nUpdate model to increase reward\n```\n\nRepeated millions of times.\n\nOver time, the assistant learns to generate responses that maximize the reward signal.\n\nHistorically, many systems used:\n\nMore recently, newer approaches such as:\n\nhave gained popularity.\n\nThe exact algorithm matters less than the goal:\n\nMove the model toward outputs that humans consistently prefer.\n\nImagine code review automation.\n\nSFT teaches an engineer using examples of good pull requests.\n\nReward Modeling creates a senior reviewer that scores submissions.\n\nRL repeatedly updates the engineer based on reviewer feedback.\n\nEventually the engineer starts producing code that receives better review scores.\n\nOne interesting detail from recent Gemini research is the growing use of models themselves in post-training workflows.\n\nInstead of relying exclusively on humans, powerful models help:\n\nThis creates a feedback loop:\n\n```\nModel\n  ↓\nGenerates data\n  ↓\nHumans verify\n  ↓\nImproved model\n  ↓\nGenerates better data\n```\n\nThe result is dramatically improved scalability.\n\nThe future of post-training may involve humans increasingly acting as supervisors while models handle much of the operational workload.\n\nA common assumption is that better AI comes primarily from larger models.\n\nThe industry increasingly suggests otherwise.\n\nMany recent gains come not from:\n\n```\nMore parameters\n```\n\nbut from:\n\n```\nBetter post-training data\n```\n\nA smaller model trained on excellent preference data can often outperform a larger model trained on mediocre data.\n\nThis explains why modern research papers frequently emphasize:\n\nThe quality of feedback often matters more than the quantity of compute.\n\nPretraining teaches a model how language works.\n\nSupervised Fine-Tuning teaches it how to respond.\n\nReward Modeling teaches it what humans prefer.\n\nReinforcement Learning teaches it to consistently optimize for those preferences.\n\nTogether, these stages transform a statistical text predictor into something that feels surprisingly useful.\n\nAs foundation models become increasingly capable, the competitive advantage may shift away from raw model size and toward the sophistication of post-training systems and data quality pipelines.\n\nThe next major breakthrough in AI might not come from a bigger model.\n\nIt might come from a better teacher.\n\nIf you were designing a reward model for a coding assistant, what signals would you optimize for—correctness, readability, maintainability, performance, or something else entirely?\n\n*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.\n\ngit-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*\n\nAny feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.\n\n| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |\n\nGenAI today is a **race car without brakes**. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents *silently break things*: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.\n\n** git-lrc is your braking system.** It hooks into\n\n`git commit`\n\nand runs an AI review on every diff In short, git-lrc helps **Prevent Outages, Breaches, and Technical Debt Before They Happen**\n\n**At a glance:** [10 risk categories](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · [100+ failure patterns tracked](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · every commit…", "url": "https://wpnews.pro/news/the-three-phases-of-post-training-how-llms-learn-to-provide-sensible-responses", "canonical_source": "https://dev.to/shrsv/the-three-phases-of-post-training-how-llms-learn-to-be-provide-sensible-responses-10a9", "published_at": "2026-06-19 19:02:11+00:00", "updated_at": "2026-06-19 19:06:35.643793+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-research", "ai-agents"], "entities": ["Shrijith Venkatramana", "git-lrc", "ChatGPT", "Gemini", "Claude"], "alternates": {"html": "https://wpnews.pro/news/the-three-phases-of-post-training-how-llms-learn-to-provide-sensible-responses", "markdown": "https://wpnews.pro/news/the-three-phases-of-post-training-how-llms-learn-to-provide-sensible-responses.md", "text": "https://wpnews.pro/news/the-three-phases-of-post-training-how-llms-learn-to-provide-sensible-responses.txt", "jsonld": "https://wpnews.pro/news/the-three-phases-of-post-training-how-llms-learn-to-provide-sensible-responses.jsonld"}}