cd /news/large-language-models/igrpo-self-feedback-driven-llm-reaso… · home topics large-language-models article
[ARTICLE · art-13685] src=research.nvidia.com pub= topic=large-language-models verified=true sentiment=↑ positive

iGRPO: Self-Feedback-Driven LLM Reasoning

Researchers introduced Iterative Group Relative Policy Optimization (iGRPO), a two-stage reinforcement learning method that improves large language model reasoning by having the model generate and refine its own best draft solutions. In tests on mathematical benchmarks, iGRPO outperformed standard GRPO across multiple base models and achieved new state-of-the-art results of 85.62% and 79.64% on the AIME24 and AIME25 datasets. The approach demonstrates that self-feedback-driven iterative refinement can significantly enhance LLM performance in verifiable reasoning tasks.

read1 min publishedMay 16, 2026

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62% and 79.64% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

── more in #large-language-models 4 stories · sorted by recency
── more on @grpo 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/igrpo-self-feedback-…] indexed:0 read:1min 2026-05-16 ·