00:00
2026-05-08
machinelearning.apple.com
machine-learning
RVPO: Risk-Sensitive Alignment via Variance Regularization
Researchers at Duke University introduced Reward-Variance Policy Optimization (RVPO), a risk-sensitive alignment method that penalizes inter-reward variance to prevent language models from neglecting โฆ