Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

OpenAI trained a reward model using a loss function that does not require defining ideal reward values in advance. The loss function, introduced in OpenAI's 2022 paper, guides the model to assign higher rewards to preferred responses by maximizing the difference between rewards for better and worse outputs. Once trained, the reward model is used to further train the original model that underwent supervised fine-tuning.

In the previous article https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-4-teaching-models-human-preferences-m7f , we created a reward model . In this article, we will continue exploring how this model is trained. One important thing to note is that we do not need to define the ideal reward values in advance. Instead, the model learns to determine appropriate rewards on its own. To train the reward model, OpenAI used the following loss function in their 2022 paper: This loss function helps the model learn good reward values without us explicitly defining what the rewards should be . Where: Ideally, we want: Reward better - Reward worse to be a large positive number . The difference between the rewards is first passed through a sigmoid function . For any input value, the sigmoid function outputs a value between 0 and 1 . In an ideal case, we want the sigmoid output to be close to 1 . This happens when the preferred response receives a much higher reward than the worse response. The output of the sigmoid function is then passed through a log function . In the ideal case, this produces a relatively high value. Finally, we multiply the result by -1 . This turns the equation into a loss function that optimization algorithms can minimize during training. The interesting part is that we never explicitly tell the model: Instead, the loss function naturally guides the model toward assigning rewards in a way that makes preferred responses score higher than worse ones. Once the reward model is fully trained, we can use it to train the original model that only went through supervised fine-tuning . In the next article, we will explore how the reward model is used to further train the original model. Looking for an easier way to install tools, libraries, or entire repositories? Try Installerpedia : a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance . Just run: ipm install repo-name … and you’re done 🚀