RVPO: Risk-Sensitive Alignment via Variance Regularization

wpnews.pro

cd /news/machine-learning/rvpo-risk-sensitive-alignment-via-va… · home › topics › machine-learning › article

[ARTICLE · art-17312] src=machinelearning.apple.com ↗ pub=2026-05-08T00:00Z topic=machine-learning verified=true sentiment=↑ positive

RVPO: Risk-Sensitive Alignment via Variance Regularization

Researchers at Duke University introduced Reward-Variance Policy Optimization (RVPO), a risk-sensitive alignment method that penalizes inter-reward variance to prevent language models from neglecting critical constraints during multi-objective training. In evaluations on medical and scientific reasoning tasks with up to 17 reward signals, RVPO improved HealthBench scores by over 21% compared to existing methods while avoiding late-stage accuracy degradation on GPQA-Diamond. The approach addresses a fundamental flaw in current RLHF methods where high scores in one objective can mask failures in others, enabling more reliable multi-objective alignment across model scales.

read2 min views10 publishedMay 8, 2026

content type paperpublished May 2026 RVPO: Risk-Sensitive Alignment via Variance Regularization

AuthorsIvan Montero, Tomasz Jurczyk, Bhuwan Dhingra

RVPO: Risk-Sensitive Alignment via Variance Regularization

AuthorsIvan Montero, Tomasz Jurczyk, Bhuwan Dhingra

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from “maximize sum” to “maximize consistency.” We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, p < 0.001) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

Figure 1: Constraint Neglect in Multi-Objective RLHF. (Left) Mean aggregation (GRPO/GDPO) treats outputs with critical constraint failures (Gen A) as mathematically identical to balanced outputs (Gen B), blinding the optimizer to critical failures. (Right) RVPO applies a soft-min operator to penalize inter-reward variance, heavily discounting Gen A to enforce bottleneck constraints.

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

October 9, 2024research area Methods and Algorithms, research area Speech and Natural Language Processing conference EMNLP

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…

Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling

May 3, 2024research area Data Science and Annotation, research area Methods and Algorithms conference ICLR

Most bandit algorithms assume that the reward variances or their upper bounds are known, and that they are the same for all arms. This naturally leads to suboptimal performance and higher regret due to variance overestimation. On the other hand, underestimated reward variances may lead to linear regret due to committing early to a suboptimal arm. This motivated prior works on variance-adaptive frequentist algorithms, which have strong…

source & further reading

machinelearning.apple.com — original article Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants Behavioral Privacy Leakage in Agentic Negotiation: Formalizing and Mitigating Inference Attacks via Randomized Policies Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

~/api · this article 200

$curl api.wpnews.pro/v1/news/rvpo-risk-sensitive-alig…

Read original on machinelearning.apple.com → machinelearning.apple.com/research/rvpo-risk-sen…

mentioned entities

Ivan Montero

Tomasz Jurczyk

Bhuwan Dhingra

Qwen2.5

HealthBench

GPQA-Diamond

metadata

slugrvpo-risk-sensitive-alignment-via-variance-regularization

topic#machine-learning

secondary3 topics

sentimentpositive

canonicalmachinelearning.apple.com

navigation

← prevEvaluating Geekbench 6

next →Did xAI just concede the AI race…

── more in #machine-learning 4 stories · sorted by recency

dev.to · 15 Jul · #machine-learning

Saving Money on AI APIs? Start With These 30 Models

lesswrong.com · 15 Jul · #machine-learning

How much of ML research is about AI safety, what is it about, and who's doing it?

dev.to · 15 Jul · #machine-learning

A differential oracle: making agentic code prove its own correctness

github.com · 15 Jul · #machine-learning

Cicada- an agentic Python IDE Free to use ( comes with built in small model)

── more on @ivan montero 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required