Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

wpnews.pro

cd /news/large-language-models/nemotron-labs-diffusion-a-tri-mode-l… · home › topics › large-language-models › article

[ARTICLE · art-13684] src=research.nvidia.com ↗ pub=2026-05-19T17:00Z topic=large-language-models verified=true sentiment=↑ positive

Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

NVIDIA released Nemotron-Labs-Diffusion, a tri-mode language model that unifies autoregressive, diffusion, and self-speculation decoding within a single architecture. The model, trained with a joint AR-diffusion objective, demonstrated that diffusion improves lookahead planning while AR provides left-to-right linguistic priors, and in self-speculation mode, diffusion drafting with AR verification outperformed multi-token prediction methods in acceptance rate and efficiency. Scaling from 3B to 14B parameters, the Nemotron-Labs-Diffusion family consistently outperformed state-of-the-art open-source models in accuracy and speed, with the 8B variant decoding 5.9× more tokens per forward pass than Qwen3-8B and achieving 4× higher throughput on SPEED-Bench.

read1 min views14 publishedMay 19, 2026

We introduce Nemotron-Labs-Diffusion, a tri-mode language model (LM) that unifies AR, diffusion, and self-speculation decoding within a single architecture. Trained with a joint AR-diffusion objective, Nemotron-Labs-Diffusion can switch modes to sustain high throughput across deployment settings and concurrency levels. Our study shows that (1) AR and diffusion objectives are complementary: diffusion improves lookahead planning, while AR provides left-to-right linguistic priors. (2) In self-speculation mode, diffusion drafts while AR verifies, outperforming multi-token prediction (MTP) methods in both acceptance rate and real-device efficiency. (3) A speed-of-light analysis further demonstrates diffusion’s long-term potential, with up to 76.5% more tokens per forward pass than self-speculation under an optimal sampler. Scaling to 3B, 8B, and 14B parameters, our Nemotron-Labs-Diffusion family, including base, instruct, and vision-language models, consistently outperforms state-of-the-art open-source AR and diffusion LMs in both accuracy and speed. For example, Nemotron-Labs-Diffusion-8B decodes 5.9×more tokens per forward than Qwen3-8B with better accuracy, translating to 4× higher throughput on SPEED-Bench with SGLang on a GB200 GPU.

HF collection: https://huggingface.co/collections/nvidia/nemotron-labs-diffusion

source & further reading

research.nvidia.com — original article CoFrGeNets replace the ‘bones’ of transformer-based models How training environments can teach AI models to misbehave Running AI on mixed hardware for speed and affordability

~/api · this article 200

$curl api.wpnews.pro/v1/news/nemotron-labs-diffusion-…

Read original on research.nvidia.com → research.nvidia.com/publication/2026-05_nemotron…

mentioned entities

NVIDIA

Nemotron-Labs-Diffusion

Qwen3-8B

SPEED-Bench

SGLang

GB200

metadata

slugnemotron-labs-diffusion-a-tri-mode-language-model-unifying-autoregressive-and

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalresearch.nvidia.com

navigation

← prevA 1955 Los Alamos computer exper…

next →FIRE sues DHS for information ab…

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 10 Jul · #large-language-models

NVIDIA Nemotron-Labs-Diffusion Kills the Draft Model

machinebrief.com · 11 Jul · #large-language-models

Procedural Memory: A New Era in Reinforcement Learning

startupfortune.com · 11 Jul · #large-language-models

Researchers hacked a quantum neural network on real trapped-ion hardware and gutted its accuracy

lesswrong.com · 11 Jul · #large-language-models

A Mechanistic Explanation of Prompt Injection (and why you should study roles)

── more on @nvidia 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required