cd /news/artificial-intelligence/understanding-reinforcement-learning… · home topics artificial-intelligence article
[ARTICLE · art-13800] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=· neutral

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

OpenAI trained a reward model using a loss function that does not require defining ideal reward values in advance. The loss function, introduced in OpenAI's 2022 paper, guides the model to assign higher rewards to preferred responses by maximizing the difference between rewards for better and worse outputs. Once trained, the reward model is used to further train the original model that underwent supervised fine-tuning.

read2 min publishedMay 25, 2026

In the previous article, we created a reward model. In this article, we will continue exploring how this model is trained.

One important thing to note is that we do not need to define the ideal reward values in advance.

Instead, the model learns to determine appropriate rewards on its own.

To train the reward model, OpenAI used the following loss function in their 2022 paper:

This loss function helps the model learn good reward values without us explicitly defining what the rewards should be.

Where:

Ideally, we want:

Reward_better - Reward_worse

to be a large positive number.

The difference between the rewards is first passed through a sigmoid function.

For any input value, the sigmoid function outputs a value between 0 and 1.

In an ideal case, we want the sigmoid output to be close to 1. This happens when the preferred response receives a much higher reward than the worse response.

The output of the sigmoid function is then passed through a log function.

In the ideal case, this produces a relatively high value.

Finally, we multiply the result by -1.

This turns the equation into a loss function that optimization algorithms can minimize during training.

The interesting part is that we never explicitly tell the model:

Instead, the loss function naturally guides the model toward assigning rewards in a way that makes preferred responses score higher than worse ones.

Once the reward model is fully trained, we can use it to train the original model that only went through supervised fine-tuning.

In the next article, we will explore how the reward model is used to further train the original model.

Looking for an easier way to install tools, libraries, or entire repositories?

Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/understanding-reinfo…] indexed:0 read:2min 2026-05-25 ·