13:39
2026-06-20
byungkwanlee.github.io
machine-learning
Nvidia-ZPPO: Zone of Proximal Policy Optimization
Nvidia researchers introduced Zone of Proximal Policy Optimization (ZPPO), a method that uses a replay buffer to repeatedly expose student models to hard questions, improving rollout accuracy without โฆ