{"slug": "multi-turn-reflective-masking-elicits-reasoning-in-mask-diffusion-models", "title": "Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models", "summary": "Researchers introduced Reflective Masking (RM), a post-training method that enables Mask Diffusion Models to iteratively revise their own outputs by re-masking uncertain tokens and remembering previous attempts. Tested on Sudoku, text reasoning, and image editing, RM improved exact accuracy by up to 11 percentage points without architectural changes, training in about 5 hours on 2×H100 GPUs.", "body_md": "**TL;DR — Reasoning by editing, not regenerating.** Reflective Masking\nturns a Mask Diffusion Model into a multi-turn reviser: it erases uncertain tokens,\nregenerates only what is needed, and remembers previous attempts.\n\nAbstract\n\nRecent diffusion language models — such as Google's\n[ DiffusionGemma](https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/) — show\nthat text generation need not be left-to-right: a model can refine a whole canvas using\nbidirectional context. We ask a complementary question: can\n\n*existing*\n\n**Mask Diffusion Models (MDMs)** be taught to reason by revising their own previous outputs? We propose\n\n**Reflective Masking (RM)**, a lightweight post-training method that turns masking into a model-driven decision — keep reliable tokens, re-mask uncertain ones, and reveal better replacements — making an MDM a multi-turn reviser rather than a one-shot decoder. To support multi-turn correction we add\n\n**History Reference**, a parameter-free memory that exposes the denoising trajectory to the model. Unlike a large pretrained diffusion LM, RM needs no architectural changes and no online rollouts, and drops into existing MDMs across\n\n**Sudoku, text reasoning, and image editing**— enabling sparse, iterative self-revision.\n\n- 1\n**Re-masking is the self-correction MDMs were missing.** MDMs can edit in place but never*choose*to — so they lock in early mistakes. RM makes masking a model-driven decision (keep reliable tokens, re-mask uncertain ones, reveal better replacements), so the model fixes its own errors instead of carrying them forward. - 2\n**A lightweight post-training recipe — no new architecture.** RM is activated by a scalable offline data pipeline (no online rollouts) and drops into existing MDMs unchanged — validated across text, Sudoku, and image editing. - 3\n**History Reference — a memory of past attempts, for free.** A parameter-free mechanism that carries the denoising trajectory forward, so the model remembers what it already tried and stops repeating the same error.\n\nCoT thinks by continuing. RM thinks by revising.\n\nA diffusion-native analogue of chain-of-thought reflection.\n\n## Side-by-side: AR Reasoning vs. Reflective Masking Reasoning\n\n| AR reasoning / reflection | Reflective Masking in MDMs |\n|---|---|\n| Generates thoughts left-to-right | Revises a full canvas bidirectionally |\n| Corrects mistakes by appending more text or regenerating | Corrects mistakes by re-masking only unreliable tokens |\n| Past mistakes remain in context | Wrong tokens can be erased from the current state |\n| Test-time scaling = longer traces / more samples | Test-time scaling = more rounds of selective revision |\n| Memory is textual context | Memory is History Reference over denoising states |\n\nResults\n\n## Reasoning through explicit revision\n\nThree task families, from instruction-rich image editing to open-ended text reasoning.\nReflective Masking consistently beats masking-based baselines, and **History Reference**\nhelps most where the model must explore on its own — all trained in about\n**5 hours on 2×H100**.\n\n### Sudoku — structured error correction\n\nA tiny from-scratch MDM (0.81M params) recovers 9×9 boards with 4–20 corrupted cells\nby iterative re-masking. **History Reference (HR)** sharply cuts repeated mistakes and rule\nconflicts; adding **History Embedding Rotation (HER)** tops every metric.\n\n| Variant | Exact Accuracy % ↑ |\nValid Rate % ↑ |\nReplay Mistake % ↓ |\nConflict Cells /board ↓ |\n|---|---|---|---|---|\n| RM (no History Reference) | 82.4 | 86.6 | 0.57 | 0.578 |\n| RM + HR | 91.4↑9.0 | 91.8↑5.2 | 0.07↓0.50 | 0.300↓0.278 |\n| RM + HR + decay | 89.4↑7.0 | 89.6↑3.0 | 0.07↓0.50 | 0.362↓0.216 |\n| Ours — RM + HR + decay + HER | 93.4↑11.0 | 93.6↑7.0 | 0.03↓0.54 | 0.236↓0.342 |\n\nQuantitative results on Sudoku revision. Δ is the change versus the\n*RM (no History Reference)* baseline; **bold** marks the best value per column.\n\n**Relation to DiffusionGemma (Google).** DiffusionGemma independently validates\nreasoning-by-revision on Sudoku: per its model card, exact-solve rises from\n18% one-shot → 89.5% purely by revising over steps, and from\n1.5% → 89.5% after fine-tuning a large pretrained model for\n4,000 steps. Reflective Masking reaches an even higher 93.4% exact\naccuracy with a 0.81M-parameter MDM trained from scratch — orders of magnitude\nsmaller than DiffusionGemma's fine-tuned backbone — and extends the same revision\nmechanism beyond text to image editing, a modality DiffusionGemma\ndoes not support.\n\nDiffusionGemma: Google, “[DiffusionGemma: 4× faster text generation](https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/)” (2026); Sudoku numbers from the fine-tuned model card (Unsloth).\n\n### Image editing — localized, instruction-guided revision\n\nWith a 7B multimodal backbone (Lumina-DiMOO), Reflective Masking localizes the edit and changes only that region, leaving the rest of the image untouched — outperforming masking-based baselines.\n\n### Text reasoning — self-correction with no answer hints\n\nOn open-ended math and code (LLaDA backbone), the model re-masks uncertain intermediate tokens and revises them as context resolves — beating both the base model and vanilla SFT.\n\n**A diffusion-native analogue of reflection.** Chain-of-thought and reflection let\nautoregressive models reason by *writing more* — appending a new trace and carrying\nevery past mistake along in the context. Reflective Masking gives MDMs the complementary move:\nreason by *revising*. Rather than append a fresh reasoning trace, the model edits its\nprevious state, re-masking only the tokens it now doubts so wrong steps are erased instead of\naccumulated.\n\n| Benchmark | Category | LLaDA % |\nVanilla SFT % |\nOurs (RM) % |\nΔ |\n|---|---|---|---|---|---|\n| MATH500 | Math | 19.4 | 22.4 | 24.8 | ↑2.4 |\n| MBPP | Code | 28.0 | 30.6 | 39.4 | ↑8.8 |\n| ARC-Challenge | MCQA | 73.7 | 81.3 | 86.1 | ↑4.8 |\n\nPerformance across benchmarks. Δ is the improvement over Vanilla SFT;\n**bold** marks the best value per row.\n\n| Minerva MATH | Algebra | Count. & Prob. | Geometry | Interm. Alg. | Num Theory | Prealgebra | Precalc | Aggregate |\n|---|---|---|---|---|---|---|---|---|\n| Vanilla SFT (%) | 28.90 | 17.72 | 20.67 | 13.07 | 17.41 | 36.28 | 14.10 | 22.62 |\n| Ours RM (%) | 29.49 | 18.35 | 20.67 | 14.40 | 21.67 | 38.00 | 16.67 | 24.10 |\n| Δ (%) | ↑0.59 | ↑0.63 | 0 | ↑1.33 | ↑4.26 | ↑1.72 | ↑2.57 | ↑1.48 |\n\nPer-subject breakdown on Minerva MATH; Reflective Masking improves on nearly every category.\n\nMethod\n\n## Reflective Masking & History Reference\n\nEach position takes one of three actions per step: **Reveal** a confident prediction,\n**Reflectively Mask** an uncertain one for another try, or **Reserve** it. Masking\nbecomes a model-driven decision, so the model can revisit and fix its earlier predictions\nacross turns.\n\n**History Reference (HER)** accumulates per-step states at the embedding level, giving the\nmodel access to its own trajectory — what it predicted and what it already revised —\nwith no extra parameters and no longer attention sequences.\n\nTraining\n\n## Activating Reflective Masking, offline\n\nRM is taught offline, with no online rollouts. From a clean target we sample a mask, take one\nMDM forward pass, and draw plausible *wrong* tokens to build a **pseudo-trajectory**\nthat matches the model's own distribution. Three per-token losses then teach when to commit,\nwhen to re-mask, and when to leave a token alone:\n\n- Revealmasked token →\n**correct token** - Re-maskwrong visible token →\n**MASK** - Keepcorrect visible token →\n**itself**\n\nCitation\n\n## BibTeX\n\n```\n@misc{zhang2026multiturn,\n  title         = {Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models},\n  author        = {Zhang, Yanming and Bian, Yihan and Qi, Jingyuan and Yao, Yuguang and Huang, Lifu and Zhou, Tianyi},\n  year          = {2026},\n  eprint        = {2606.16700},\n  archivePrefix = {arXiv},\n  url           = {https://arxiv.org/abs/2606.16700}\n}\n```\n\n", "url": "https://wpnews.pro/news/multi-turn-reflective-masking-elicits-reasoning-in-mask-diffusion-models", "canonical_source": "https://zhangyanming-cs.github.io/Multi-Turn_RM/", "published_at": "2026-06-22 08:05:29+00:00", "updated_at": "2026-06-22 08:10:32.452687+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "generative-ai", "ai-research"], "entities": ["Google", "DiffusionGemma", "Reflective Masking", "Mask Diffusion Models", "History Reference", "Sudoku"], "alternates": {"html": "https://wpnews.pro/news/multi-turn-reflective-masking-elicits-reasoning-in-mask-diffusion-models", "markdown": "https://wpnews.pro/news/multi-turn-reflective-masking-elicits-reasoning-in-mask-diffusion-models.md", "text": "https://wpnews.pro/news/multi-turn-reflective-masking-elicits-reasoning-in-mask-diffusion-models.txt", "jsonld": "https://wpnews.pro/news/multi-turn-reflective-masking-elicits-reasoning-in-mask-diffusion-models.jsonld"}}