{"slug": "nvidia-zppo-zone-of-proximal-policy-optimization", "title": "Nvidia-ZPPO: Zone of Proximal Policy Optimization", "summary": "Nvidia researchers introduced Zone of Proximal Policy Optimization (ZPPO), a method that uses a replay buffer to repeatedly expose student models to hard questions, improving rollout accuracy without imitating teacher logits. ZPPO graduates more hard questions than GRPO, especially those with near-zero initial accuracy, reducing policy drift and enhancing generalization.", "body_md": "†and On-Policy Distill\n\n†\n\nDistillation forces a student to imitate teacher logits, inducing **memorization on the training samples** while **degrading generalization** on unseen samples. (Overfitting on dataset and teacher)\n\n†: prompt replay buffer · all experiments run on Qwen3.5\n\nForhard questions, how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient?How to make the student solve the hard question withoutpolicy drift(degrading generalization)?\n\nTechnically, we use a **Replay Buffer** to store **hard questions**, so the model revisits each **hard question** many times — not just once, as in GRPO. Repeated exposure strengthens the BCQ/NCQ effect on each **hard question**, which we expect to lift its **rollout accuracy**.\n\nA question is admitted to the **Replay Buffer** when its rollout accuracy stays **below 50%**, and it **graduates** — leaving the buffer — once that accuracy reaches **50%**. ZPPO graduates far more hard questions than GRPO, and the gap is widest where the initial accuracy starts near **zero**.", "url": "https://wpnews.pro/news/nvidia-zppo-zone-of-proximal-policy-optimization", "canonical_source": "https://byungkwanlee.github.io/ZPPO-page/", "published_at": "2026-06-20 13:39:41+00:00", "updated_at": "2026-06-20 14:07:52.696971+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-research"], "entities": ["Nvidia", "Qwen3.5", "ZPPO", "GRPO"], "alternates": {"html": "https://wpnews.pro/news/nvidia-zppo-zone-of-proximal-policy-optimization", "markdown": "https://wpnews.pro/news/nvidia-zppo-zone-of-proximal-policy-optimization.md", "text": "https://wpnews.pro/news/nvidia-zppo-zone-of-proximal-policy-optimization.txt", "jsonld": "https://wpnews.pro/news/nvidia-zppo-zone-of-proximal-policy-optimization.jsonld"}}