iGRPO: Self-Feedback-Driven LLM Reasoning

wpnews.pro

cd /news/large-language-models/igrpo-self-feedback-driven-llm-reaso… · home › topics › large-language-models › article

[ARTICLE · art-13685] src=research.nvidia.com ↗ pub=2026-05-16T18:22Z topic=large-language-models verified=true sentiment=↑ positive

iGRPO: Self-Feedback-Driven LLM Reasoning

Researchers introduced Iterative Group Relative Policy Optimization (iGRPO), a two-stage reinforcement learning method that improves large language model reasoning by having the model generate and refine its own best draft solutions. In tests on mathematical benchmarks, iGRPO outperformed standard GRPO across multiple base models and achieved new state-of-the-art results of 85.62% and 79.64% on the AIME24 and AIME25 datasets. The approach demonstrates that self-feedback-driven iterative refinement can significantly enhance LLM performance in verifiable reasoning tasks.

read1 min views16 publishedMay 16, 2026

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62% and 79.64% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

source & further reading

research.nvidia.com — original article FlashAttention-4: Algorithm and Kernel Pipelining Co-Design

~/api · this article 200

$curl api.wpnews.pro/v1/news/igrpo-self-feedback-driv…

Read original on research.nvidia.com → research.nvidia.com/publication/2026-02_igrpo-se…

mentioned entities

GRPO

iGRPO

Nemotron-H-8B-Base-8K

DeepSeek-R1 Distilled

OpenReasoning-Nemotron-7B

AceReason-Math

AIME24

AIME25

metadata

slugigrpo-self-feedback-driven-llm-reasoning

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalresearch.nvidia.com

navigation

← prevRLP: Reinforcement as a Pretrain…

next →Elevated error rates on requests…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 15 Jul · #large-language-models

Learn AI Coding Benchmarks by Building a Tiny Contamination Check

arxiv.org · 14 Jul · #large-language-models

LLM-as-a-Verifier: A General-Purpose Verification Framework

sourcefeed.dev · 14 Jul · #large-language-models

Nested RL Agents That Write Real Training Jobs

machinebrief.com · 14 Jul · #large-language-models

STAMP's New Approach: Fixing the Reward-Credit Mismatch in AI

── more on @grpo 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required