{"slug": "diff-instruct-with-diffused-reward-towards-principled-one-step-generator-rl", "title": "Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL", "summary": "Researchers have developed Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework that propagates reward-tilted clean-image distributions across all noise levels to improve one-step text-to-image generation. The method addresses a mismatch between terminal reward optimization and generative dynamics that previously caused image fidelity loss. DIDR consistently outperforms existing one-step SDXL baselines and, when applied to a 6B DiT backbone, surpasses its 50-step teacher in preference alignment using only a single generation step.", "body_md": "arXiv:2605.24001v1 Announce Type: new\nAbstract: Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.", "url": "https://wpnews.pro/news/diff-instruct-with-diffused-reward-towards-principled-one-step-generator-rl", "canonical_source": "https://arxiv.org/abs/2605.24001", "published_at": "2026-05-26 04:00:00+00:00", "updated_at": "2026-05-26 04:11:20.878109+00:00", "lang": "en", "topics": ["generative-ai", "machine-learning", "artificial-intelligence", "neural-networks", "ai-research"], "entities": ["DIDR", "Diffused Reward", "SDXL", "DiT", "Z-Image", "RLHF", "Diffused Reward Score", "Diffused Reward Proxy"], "alternates": {"html": "https://wpnews.pro/news/diff-instruct-with-diffused-reward-towards-principled-one-step-generator-rl", "markdown": "https://wpnews.pro/news/diff-instruct-with-diffused-reward-towards-principled-one-step-generator-rl.md", "text": "https://wpnews.pro/news/diff-instruct-with-diffused-reward-towards-principled-one-step-generator-rl.txt", "jsonld": "https://wpnews.pro/news/diff-instruct-with-diffused-reward-towards-principled-one-step-generator-rl.jsonld"}}