{"slug": "nemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and", "title": "Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding", "summary": "NVIDIA released Nemotron-Labs-Diffusion, a tri-mode language model that unifies autoregressive, diffusion, and self-speculation decoding within a single architecture. The model, trained with a joint AR-diffusion objective, demonstrated that diffusion improves lookahead planning while AR provides left-to-right linguistic priors, and in self-speculation mode, diffusion drafting with AR verification outperformed multi-token prediction methods in acceptance rate and efficiency. Scaling from 3B to 14B parameters, the Nemotron-Labs-Diffusion family consistently outperformed state-of-the-art open-source models in accuracy and speed, with the 8B variant decoding 5.9× more tokens per forward pass than Qwen3-8B and achieving 4× higher throughput on SPEED-Bench.", "body_md": "We introduce Nemotron-Labs-Diffusion, a tri-mode language model (LM) that unifies AR, diffusion, and self-speculation decoding within a single architecture. Trained with a joint AR-diffusion objective, Nemotron-Labs-Diffusion can switch modes to sustain high throughput across deployment settings and concurrency levels. Our study shows that (1) AR and diffusion objectives are complementary: diffusion improves lookahead planning, while AR provides left-to-right linguistic priors. (2) In self-speculation mode, diffusion drafts while AR verifies, outperforming multi-token prediction (MTP) methods in both acceptance rate and real-device efficiency. (3) A speed-of-light analysis further demonstrates diffusion’s long-term potential, with up to 76.5% more tokens per forward pass than self-speculation under an optimal sampler. Scaling to 3B, 8B, and 14B parameters, our Nemotron-Labs-Diffusion family, including base, instruct, and vision-language models, consistently outperforms state-of-the-art open-source AR and diffusion LMs in both accuracy and speed. For example, Nemotron-Labs-Diffusion-8B decodes 5.9×more tokens per forward than Qwen3-8B with better accuracy, translating to 4× higher throughput on SPEED-Bench with SGLang on a GB200 GPU.\n\nHF collection: [https://huggingface.co/collections/nvidia/nemotron-labs-diffusion](https://huggingface.co/collections/nvidia/nemotron-labs-diffusion)", "url": "https://wpnews.pro/news/nemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and", "canonical_source": "https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive", "published_at": "2026-05-19 17:00:56+00:00", "updated_at": "2026-05-25 16:10:18.770350+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "artificial-intelligence", "machine-learning", "ai-research"], "entities": ["NVIDIA", "Nemotron-Labs-Diffusion", "Qwen3-8B", "SPEED-Bench", "SGLang", "GB200"], "alternates": {"html": "https://wpnews.pro/news/nemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and", "markdown": "https://wpnews.pro/news/nemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and.md", "text": "https://wpnews.pro/news/nemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and.txt", "jsonld": "https://wpnews.pro/news/nemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and.jsonld"}}