04:00
2026-06-30
arxiv.org
large-language-models
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards
Researchers introduced BV-Blend, a critic-free reinforcement learning framework that stabilizes advantage estimation for aligning large language models by blending prompt-local on-policy statistics wiโฆ