RLHF vs DPO vs IPO vs KTO: which alignment method should you use

A developer compares four dominant alignment methods—RLHF, DPO, IPO, and KTO—for fine-tuning large language models, detailing their mathematical formulations, data requirements, and practical tradeoffs. The analysis highlights that DPO eliminates the reward model and reduces compute by roughly 3x compared to RLHF, while IPO addresses regularization issues in DPO. The choice of method depends on factors such as data type, compute budget, and desired failure mode avoidance.

You have a base model, say Llama 3.2 8B, that can write poetry in any meter and pass the bar exam. It can also generate instructions for synthesizing controlled substances, roleplay as a manipulative therapist, and explain in loving detail why your pull request is an affront to good taste. You need to align it — remove the harmful outputs while keeping the capability. Your mentor says "use RLHF." A paper on your feed says DPO is simpler. Your colleague swears by KTO because they only have thumbs-up/thumbs-down log data from production. Where do you start? Choosing an alignment method is not a theoretical debate. It is a practical decision that depends on your data, your compute budget, and the failure modes you are trying to avoid. This post compares the four dominant approaches side by side, with the actual math, the data requirements, and the sharp edges you will hit in production. The alignment method you pick determines three things that directly affect shipping timelines: Understanding these tradeoffs is the difference between an aligned model that ships in two weeks and an alignment project that drags for three months. All four methods start from the same place: a supervised fine-tuned SFT model and a dataset that captures human preferences. How they use that data differs fundamentally. The canonical approach, popularized by OpenAI's InstructGPT paper Ouyang et al., 2022 , is a three-stage pipeline: Simplified PPO update conceptual reward = reward model.generate policy output - beta kl divergence policy || ref policy policy loss = -ppo clip reward, old logprobs, new logprobs The three-stage pipeline is expensive — each stage requires its own training run, its own GPU budget, and its own hyperparameter sweep. The reward model can learn to exploit spurious correlations reward hacking , and PPO is sensitive to the learning rate and KL penalty coefficient. On the plus side, online PPO can in theory discover outputs that are better than any human annotation in the dataset. Rafailov et al. 2023 showed that the reward model in RLHF is strictly unnecessary. The key insight is that the Bradley-Terry preference model the statistical model behind most reward models has a closed-form solution that relates the optimal policy directly to the reference policy and the preference data. DPO eliminates the reward model entirely. The training loss is: L DPO = -E log sigmoid beta log pi y w|x /pi ref y w|x - log pi y l|x /pi ref y l|x Where y w is the chosen output, y l is the rejected output, pi is the current policy, pi ref is the frozen reference policy the SFT model , and beta controls how far the policy can diverge. python DPO loss in practice using Hugging Face TRL from trl import DPOTrainer dpo trainer = DPOTrainer model=policy model, ref model=ref model, train dataset=preference dataset, beta=0.1, KL regularization strength args=training args, dpo trainer.train DPO runs in a single training loop on a static dataset. There is no reward model, no PPO, no online generation during training. This makes it dramatically cheaper — approximately 3x less compute than RLHF for comparable results on most benchmarks. The tradeoff: DPO is an offline method. It never sees the model's own generations during training, so it can over-optimize for preferences that do not generalize. It also requires pairwise preference data — you need two outputs per prompt, one explicitly preferred over the other. Azar et al. 2023 at DeepMind identified a subtle problem with DPO: the implicit reward parameterization in DPO can lead to the regularization term not actually constraining the policy the way it should. IPO replaces the reward parameterization with an identity mapping, providing stronger regularization. The IPO loss is: L IPO = E log pi y w|x /pi ref y w|x - log pi y l|x /pi ref y l|x - 1/ 2 tau ^2 Where tau is a regularization parameter. The squared loss directly penalizes the policy when the log-likelihood gap diverges too far from the target margin. This provides a cleaner optimization landscape and better-calibrated probabilities at inference time. IPO loss conceptual margin = log ratio w - log ratio l loss = margin - 1/ 2 tau ^2 when margin < 1/ 2 tau , else 0 IPO requires the same pairwise data as DPO. It is slightly more stable in practice, especially on noisy preference datasets where DPO can amplify annotator disagreement. Ethayarajh et al. 2024 at Contextual AI took a different tack. Inspired by prospect theory Kahneman and Tversky, 1979 , they built an alignment method that works with per-sample binary feedback — thumbs up or thumbs down — instead of pairwise preferences. The KTO loss treats gains chosen responses and losses rejected responses asymmetrically: L KTO = -E w y 1 - sigmoid beta log pi y|x /pi ref y|x - z ref Where w y is a weighting factor that differs for chosen and rejected examples, and z ref is a reference value derived from the data. The key asymmetry: losses rejected outputs are weighted more heavily than gains chosen outputs , mirroring human loss aversion documented in behavioral economics. python KTO trainer in Hugging Face TRL from trl import KTOTrainer kto trainer = KTOTrainer model=policy model, ref model=ref model, train dataset=binary feedback dataset, no pairs needed args=training args, kto trainer.train KTO's major advantage is data efficiency. Many production systems log per-output user feedback clicks, likes, flags without recording a pairwise comparison. KTO can train directly on this signal. The tradeoff is lower sample efficiency per annotated example — pairwise comparisons carry more information per annotation than binary labels. | Dimension | RLHF | DPO | IPO | KTO | |---|---|---|---|---| Data required | Pairwise comparisons | Pairwise comparisons | Pairwise comparisons | Binary good/bad | Reward model needed | Yes separate training | No | No | No | Training stages | 3 SFT + RM + PPO | 1 after SFT | 1 after SFT | 1 after SFT | Compute cost | Highest ~3x DPO | Low | Low | Low | Online generation | Yes PPO samples during training | No offline | No offline | No offline | Stability | Tricky PPO hyperparameters | Good, can overfit to noise | Better identity regularization | Good | Best for | High-quality RM, large compute budget | Clean pair data, tight budget | Noisy pair data, production stability | Production logs binary feedback | Key risk | Reward hacking, training collapse | Overfitting on static data | Slightly more complex loss | Needs enough binary data | Here is the decision flow: php flowchart TD A Do you have pairwise<br/ preference data? -- |Yes| B{Do you have budget<br/ for a reward model<br/ and PPO?} A -- |No / only binary feedback| C Use KTO B -- |Yes| D RLHF — full pipeline<br/ highest potential ceiling B -- |No| E{Is your preference<br/ data clean or noisy?} E -- |Clean| F DPO — simplest<br/ single-stage training E -- |Noisy| G IPO — better regularization<br/ for noisy preferences Running DPO on binary data. DPO requires pairwise preferences: a chosen output and a rejected output for the same prompt. If you concatenate unrelated good and bad outputs into pairs, DPO will learn arbitrary decision boundaries. Use KTO for binary data. Ignoring the reference model. DPO, IPO, and KTO all require a frozen reference model usually your SFT checkpoint . The loss depends on the log-ratio between the current policy and the reference. If you use a different reference model, the optimization target changes silently. Always use the same checkpoint that produced the data. Skipping SFT. None of these methods work well on a raw pretrained base model. You need an SFT model that can produce reasonable completions. The alignment stage assumes the model can already generate coherent, on-task outputs — it is steering existing behavior, not teaching the model to generate text from scratch. Treating beta as a free parameter. The beta or tau parameter controls how far the aligned policy can stray from the reference. A beta too high and you get no alignment effect. A beta too low and the model unlearns general capabilities catastrophic forgetting . Sweep it systematically — at least 3 values e.g., 0.01, 0.1, 0.5 on a validation set before committing to a full run. Assuming RLHF always wins. On many benchmarks, DPO matches or exceeds RLHF at a fraction of the compute. The main advantage of RLHF is the online generation during PPO, which can discover novel high-reward outputs not present in the training data. For most production use cases where you already have a representative dataset, DPO/IPO/KTO are the better choice. Do not use any of these methods if you have fewer than a few hundred preference examples. The signal-to-noise ratio at that scale is too low. Collect at least 500–1000 examples, and prefer 5000+ for reliable results. Do not use RLHF if you are budget-constrained or shipping on a timeline under four weeks. The three-stage pipeline SFT, reward model, PPO with hyperparameter tuning and reward model debugging routinely takes 2–3 months for teams that are new to it. Do not use DPO or IPO if your data is binary per-output feedback with no pairwise structure. You will have to fabricate pairs from unrelated outputs, which introduces noise. Use KTO instead. Do not use KTO if you have clean pairwise preferences and enough compute for DPO. Pairwise comparisons carry more information per example, so DPO will converge faster with fewer total annotations. Do not skip evaluating your aligned model on capability benchmarks. Every alignment method trades some general capability for safety. If your aligned model drops 5% on MMLU relative to the SFT checkpoint, you have likely over-regularized. Run MMLU, HellaSwag, and a task specific to your domain before and after alignment. Pairwise preference data is the gold standard for alignment, but collecting it at scale is expensive and annotator agreement is often low. Next time: how to build and maintain a preference dataset — sampling strategy, inter-annotator agreement metrics, and detecting when your annotation pipeline is quietly poisoning your model.