Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

wpnews.pro

cd /news/artificial-intelligence/understanding-reinforcement-learning… · home › topics › artificial-intelligence › article

[ARTICLE · art-13800] src=dev.to ↗ pub=2026-05-25T19:15Z topic=artificial-intelligence verified=true sentiment=· neutral

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

OpenAI trained a reward model using a loss function that does not require defining ideal reward values in advance. The loss function, introduced in OpenAI's 2022 paper, guides the model to assign higher rewards to preferred responses by maximizing the difference between rewards for better and worse outputs. Once trained, the reward model is used to further train the original model that underwent supervised fine-tuning.

read2 min views13 publishedMay 25, 2026

In the previous article, we created a reward model. In this article, we will continue exploring how this model is trained.

One important thing to note is that we do not need to define the ideal reward values in advance.

Instead, the model learns to determine appropriate rewards on its own.

To train the reward model, OpenAI used the following loss function in their 2022 paper:

This loss function helps the model learn good reward values without us explicitly defining what the rewards should be.

Where:

Ideally, we want:

Reward_better - Reward_worse

to be a large positive number.

The difference between the rewards is first passed through a sigmoid function.

For any input value, the sigmoid function outputs a value between 0 and 1.

In an ideal case, we want the sigmoid output to be close to 1. This happens when the preferred response receives a much higher reward than the worse response.

The output of the sigmoid function is then passed through a log function.

In the ideal case, this produces a relatively high value.

Finally, we multiply the result by -1.

This turns the equation into a loss function that optimization algorithms can minimize during training.

The interesting part is that we never explicitly tell the model:

Instead, the loss function naturally guides the model toward assigning rewards in a way that makes preferred responses score higher than worse ones.

Once the reward model is fully trained, we can use it to train the original model that only went through supervised fine-tuning.

In the next article, we will explore how the reward model is used to further train the original model.

Looking for an easier way to install tools, libraries, or entire repositories?

Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

source & further reading

dev.to — original article Comment j'ai conçu EagleCheck, un SIRH pour l'Afrique francophone Qualify real estate leads with an email agent Build an SDR agent with its own follow-up inbox

~/api · this article 200

$curl api.wpnews.pro/v1/news/understanding-reinforcem…

Read original on dev.to → dev.to/rijultp/understanding-reinforcement-learn…

mentioned entities

OpenAI

metadata

slugunderstanding-reinforcement-learning-with-human-feedback-part-5-training-the

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevA Comma and a Question Mark, Red…

next →On-premises for legal is not a g…

── more in #artificial-intelligence 4 stories · sorted by recency

lesswrong.com · 10 Jul · #artificial-intelligence

Toward A Public Science of Model Behavior

dissenter.com · 10 Jul · #artificial-intelligence

Musk Praises 'Woke' Anthropic as AI Leader After $1.25B/Month Deal

twitter.com · 10 Jul · #artificial-intelligence

Fidji Simo Leaves OpenAI

runtimewire.com · 10 Jul · #artificial-intelligence

GPT-5.6, OpenAI's flagship model, helps build itself

── more on @openai 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required