Nvidia-ZPPO: Zone of Proximal Policy Optimization

wpnews.pro

cd /news/machine-learning/nvidia-zppo-zone-of-proximal-policy-… · home › topics › machine-learning › article

[ARTICLE · art-34881] src=byungkwanlee.github.io ↗ pub=2026-06-20T13:39Z topic=machine-learning verified=true sentiment=↑ positive

Nvidia-ZPPO: Zone of Proximal Policy Optimization

Nvidia researchers introduced Zone of Proximal Policy Optimization (ZPPO), a method that uses a replay buffer to repeatedly expose student models to hard questions, improving rollout accuracy without imitating teacher logits. ZPPO graduates more hard questions than GRPO, especially those with near-zero initial accuracy, reducing policy drift and enhancing generalization.

read1 min views1 publishedJun 20, 2026

Nvidia-ZPPO: Zone of Proximal Policy Optimization — Image: source

†and On-Policy Distill

†

Distillation forces a student to imitate teacher logits, inducing memorization on the training samples while degrading generalization on unseen samples. (Overfitting on dataset and teacher)

†: prompt replay buffer · all experiments run on Qwen3.5

Forhard questions, how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient?How to make the student solve the hard question withoutpolicy drift(degrading generalization)?

Technically, we use a Replay Buffer to store hard questions, so the model revisits each hard question many times — not just once, as in GRPO. Repeated exposure strengthens the BCQ/NCQ effect on each hard question, which we expect to lift its rollout accuracy.

A question is admitted to the Replay Buffer when its rollout accuracy stays below 50%, and it graduates — leaving the buffer — once that accuracy reaches 50%. ZPPO graduates far more hard questions than GRPO, and the gap is widest where the initial accuracy starts near zero.

source & further reading

byungkwanlee.github.io — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/nvidia-zppo-zone-of-prox…

Read original on byungkwanlee.github.io → byungkwanlee.github.io/ZPPO-page/

mentioned entities

Nvidia

Qwen3.5

ZPPO

GRPO

metadata

slugnvidia-zppo-zone-of-proximal-policy-optimization

topic#machine-learning

secondary2 topics

sentimentpositive

canonicalbyungkwanlee.github.io

navigation

← prevShow HN: I made AI bets for all …

next →Privacy-Preserving Process Minin…

── more in #machine-learning 4 stories · sorted by recency

dev.to · 20 Jun · #machine-learning

SpaceX AI1 Orbital Data Center: 1 GW of Space AI Compute by 2027, Developer Guide

startupfortune.com · 20 Jun · #machine-learning

Big Tech is borrowing like never before

pcguide.com · 20 Jun · #machine-learning

Early Prime Day Deal on Ryzen 9800X3D and RTX 5070 Ti PC knocks a huge $400 off the price

cryptobriefing.com · 20 Jun · #machine-learning

Apple faces price hikes for iPhone 18 Pro as AI-fueled chip shortage bites

── more on @nvidia 3 stories trending now

wpnews · 19 Jun · #artificial-intelligence

From Dream Job to 'The Gulag': Inside Staff Revolt Zuckerberg's Brutal AI Push

wpnews · 19 Jun · #artificial-intelligence

Stop Guessing Which Library to Use — I Built an AI Capability Discovery Engine

wpnews · 19 Jun · #large-language-models

I Cut My AI Agent's Token Bill by 62% in One Weekend. Here's the Receipts.

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required