04:00
2026-06-04
arxiv.org
machine-learning
Self-Distilled Policy Gradient
Researchers introduced SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation and full-vocabulary on-policy self-distillatβ¦