{"slug": "when-llm-reward-design-fails-diagnostic-driven-refinement-for-sparse-structured", "title": "When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL", "summary": "Researchers found that LLM-generated reward functions for sparse reinforcement learning tasks fail in predictable ways, including reward flooding and API misunderstandings. A diagnostic-driven refinement method improved task success rates from 2.3% to 97.6% on DoorKey-8x8 and from 31.2% to 86.7% on KeyCorridor, with gains attributed to the failure-mode taxonomy rather than retrying or extra training. The approach is limited to sparse structured tasks under PPO and shows boundary effects in continuous-control environments where success-based diagnostics can misfire.", "body_md": "arXiv:2605.28918v1 Announce Type: new\nAbstract: For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.", "url": "https://wpnews.pro/news/when-llm-reward-design-fails-diagnostic-driven-refinement-for-sparse-structured", "canonical_source": "https://arxiv.org/abs/2605.28918", "published_at": "2026-05-29 04:00:00+00:00", "updated_at": "2026-05-29 04:18:46.804894+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "artificial-intelligence", "ai-research"], "entities": ["PPO", "MiniGrid", "MuJoCo", "DoorKey-8x8", "KeyCorridor"], "alternates": {"html": "https://wpnews.pro/news/when-llm-reward-design-fails-diagnostic-driven-refinement-for-sparse-structured", "markdown": "https://wpnews.pro/news/when-llm-reward-design-fails-diagnostic-driven-refinement-for-sparse-structured.md", "text": "https://wpnews.pro/news/when-llm-reward-design-fails-diagnostic-driven-refinement-for-sparse-structured.txt", "jsonld": "https://wpnews.pro/news/when-llm-reward-design-fails-diagnostic-driven-refinement-for-sparse-structured.jsonld"}}