Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM Architecture

NVIDIA has open-sourced the Nemotron-Labs Diffusion family of language models (3B, 8B, and 14B parameters), which replace traditional left-to-right autoregressive generation with a parallel denoising diffusion process. This architectural shift allows the models to refine all tokens in a block simultaneously, achieving up to 6.4× faster inference by overcoming the memory-bandwidth bottleneck that limits standard LLMs. The models address previous accuracy gaps in diffusion language models, making them competitive with autoregressive counterparts on standard benchmarks.

Meta Description:NVIDIA just open-sourced Nemotron-Labs Diffusion — a family of 3B, 8B, and 14B diffusion language models that merge autoregressive and diffusion generation for up to 6.4× faster inference. Here's the complete technical deep dive into the architecture, training methodology, three generation modes, and how to run it today with SGLang. Table of Contents The Speed Wall Autoregressive LLMs Hit 1-the-speed-wall-autoregressive-llms-hit What Are Diffusion Language Models? 2-what-are-diffusion-language-models Why DLMs Struggled — Until Now 3-why-dlms-struggled--until-now NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM 4-nvidias-ar-to-dlm-breakthrough-efficient-dlm Nemotron-Labs Diffusion: The Model Family 5-nemotron-labs-diffusion-the-model-family Three Generation Modes: AR, Diffusion, Self-Speculation 6-three-generation-modes-ar-diffusion-self-speculation Performance Deep Dive: The Numbers That Matter 7-performance-deep-dive-the-numbers-that-matter Under the Hood: Block-Wise Attention & KV Caching 8-under-the-hood-block-wise-attention--kv-caching Getting Started: Running with SGLang 9-getting-started-running-with-sglang What This Means for Production LLM Infrastructure 10-what-this-means-for-production-llm-infrastructure Conclusion & The Road Ahead 11-conclusion--the-road-ahead 1. The Speed Wall Autoregressive LLMs Hit Every language model you've ever used — GPT-4, Claude, Llama, Qwen — generates text the same fundamental way: one token at a time, left to right, each new token conditioned on every previous one. It's called autoregressive AR generation , and it's been the undisputed king of language modeling since the original GPT paper in 2018. But AR generation has a dirty secret. It's not a compute-bound problem. It's a memory-bandwidth-bound problem. Here's why that matters: each new token requires a full model forward pass. That means loading all the model's weights — potentially tens of gigabytes for a 7B model — from HBM High Bandwidth Memory into the GPU's compute cores, every single decoding step. On modern GPUs, the arithmetic throughput is enormous, but the memory bandwidth is the bottleneck. This is why serving an LLM at batch size 1 — a single user chatting with your model — leaves your GPU vastly underutilized. The math is brutal. An A100 80GB GPU has ~2TB/s of HBM bandwidth. A 7B-parameter model in FP16 takes ~14GB. Reading all weights takes ~7ms minimum per step. At 30 tokens/second, you're spending the vast majority of each step just moving weights , not computing. Scale this to a production API endpoint handling thousands of concurrent users, and the economics become painful. The community has attacked this problem from many angles: speculative decoding using a small draft model to propose tokens verified by the large model , quantization FP8, INT4 to shrink weight footprint , and FlashAttention optimizing the KV-cache access pattern . These are all incremental improvements on the same fundamental loop. NVIDIA's Nemotron-Labs Diffusion — released on HuggingFace on May 23, 2026 — is taking a fundamentally different approach. Instead of optimizing the autoregressive loop, it breaks the loop entirely . 2. What Are Diffusion Language Models? If you've worked with image generation models Stable Diffusion, DALL·E, Flux , you already know the concept of denoising diffusion. The idea is to start with pure noise and iteratively denoise it, guided by a conditioning signal, until you arrive at a coherent output. Diffusion Language Models DLMs apply this same paradigm to text. Instead of generating tokens left-to-right, a DLM: - Starts with a sequence of masked or noisy tokens analogous to Gaussian noise in image diffusion - Runs multiple denoising refinement steps , predicting the clean token distribution at each step - After several iterations, the entire sequence — or a large block of it — converges to the final output The key theoretical advantage is parallelism. In a standard AR model, token t can only be generated after token t-1 exists. In a DLM, all positions in a block are refined simultaneously in each forward pass. This changes the computational profile dramatically: instead of being memory-bandwidth-bound by sequential weight loads, the GPU can be kept busy with dense matrix multiplications across the full block. The conceptual roots of DLMs trace back to Masked Diffusion Language Models MDLMs — work like MDLM Sahoo et al., 2024 and SEDD Lou et al., 2023 — that framed text generation as a discrete denoising process over masked token sequences. However, these models had significant practical shortcomings when compared to the state-of-the-art AR models of the day. NVIDIA's work specifically addresses why, and more importantly, how to fix it. 3. Why DLMs Struggled — Until Now The community has known about the theoretical appeal of diffusion language models for years. The reason they haven't taken over is a cluster of practical barriers that made them non-competitive with AR models in production: 1. Accuracy Gap: DLMs trained from scratch consistently underperformed comparably-sized AR models on standard benchmarks. The discrete, iterative denoising process is harder to optimize than the clean causal language modeling objective. Models like Dream 7B were impressive for DLMs, but still lagged behind Qwen3 4B — a smaller AR model — on reasoning and knowledge tasks. 2. Training Instability: Jointly learning to denoise across many noise levels with a bidirectional attention mask creates a different gradient landscape than causal language modeling. Loss curves are noisier, and the model is more sensitive to hyperparameter choices. 3. No KV Cache Compatibility: This was the killer for inference efficiency. KV caching — where you store key/value activations from previous tokens to avoid recomputing them — is the single most important optimization for AR inference. Standard DLMs use fully bidirectional attention across the entire sequence, which means you can't cache anything: every refinement step needs to attend over all positions with the updated token states. This essentially erased the theoretical throughput advantage. 4. Fill-in-the-Middle Mismatch: During DLM training, tokens are masked uniformly at random across the sequence. But at inference time, the model typically has a left-side prefix the prompt that is fully unmasked, and must fill in the right side. This creates a training-test distribution mismatch that degrades quality. Each of these problems has a specific technical solution in NVIDIA's Efficient-DLM framework. Let's dig in. 4. NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM The foundational insight behind Nemotron-Labs Diffusion and the academic paper it builds on, arXiv:2512.14067 https://arxiv.org/abs/2512.14067 is deceptively simple: don't train DLMs from scratch — convert pretrained AR models into DLMs . This avoids the accuracy gap problem entirely. You start with a model that already has world-class knowledge and reasoning capabilities baked into its weights, then teach it to also generate diffusion-style. The result is a model that retains AR accuracy while gaining diffusion parallelism. But there are two critical technical challenges to solve for this conversion to work. 4.1 Block-Wise Attention: Preserving Weights, Enabling KV Caching The attention mechanism is the crux of the problem. A standard AR model uses causal lower-triangular attention — each token attends only to itself and all previous tokens. A standard DLM uses bidirectional full attention — every token attends to every other token. The issue: if you convert an AR model and suddenly change to fully bidirectional attention, you've broken the statistical assumptions baked into all those attention weights during pretraining. The key-value projections were trained to operate in a causal setting; they "expect" not to see future tokens. Loading them into a fully bidirectional context produces degraded output and requires extensive retraining to recover. Efficient-DLM introduces block-wise causal attention as the solution: - The sequence is divided into non-overlapping blocks of size B e.g., 32 tokens - Within each block : full bidirectional attention every token attends to every other token in the block - Across blocks : standard left-to-right causal attention block i can attend to blocks 0 through i-1 This hybrid pattern does something clever: it's structurally similar enough to causal attention that pretrained weight distributions are preserved — the model only needs to learn bidirectionality locally within blocks, not globally across the whole sequence. The result is a much smoother conversion that requires far less compute to recover quality. Crucially, this also re-enables KV caching . Since attention is still causal across blocks, the KV activations of completed committed blocks can be cached and reused exactly like in a standard AR model. Only the current block being refined needs to be recomputed each refinement step. 4.2 Position-Dependent Token Masking The second innovation addresses the training-test distribution mismatch. Instead of masking tokens uniformly at random during training, Efficient-DLM uses a position-dependent masking strategy that assigns higher masking probabilities to tokens in later positions in the sequence. The intuition: at inference time, when filling in a response to a prompt, earlier tokens in the response have already been decided or are more constrained by the left-side context , while later tokens remain more uncertain. By skewing the training mask distribution to match this pattern, the model learns a denoising objective that better mirrors what it actually faces at test time. 4.3 Joint AR + Diffusion Training Objective Rather than optimizing purely for the diffusion objective, Nemotron-Labs Diffusion is trained with a joint AR and diffusion loss : L total = λ · L AR + 1 - λ · L diffusion Where L AR is the standard cross-entropy causal language modeling loss and L diffusion is the masked diffusion objective. This joint training ensures the model remains a first-class AR model while learning the diffusion generation capability. The pretrained base was trained on 1.3 trillion tokens from NVIDIA's Nemotron pretraining datasets, with an additional 45 billion tokens of supervised fine-tuning data for the instruct-tuned variants. 5. Nemotron-Labs Diffusion: The Model Family NVIDIA released seven model checkpoints on HuggingFace under the NVIDIA Nemotron Open Model License commercially friendly for text models : | Model | Parameters | Type | Downloads Day 1 | |---|---|---|---| nvidia/Nemotron-Labs-Diffusion-3B | ~4B | Text, Instruct | 14.7K | nvidia/Nemotron-Labs-Diffusion-3B-Base | ~4B | Text, Base | 14.2K | nvidia/Nemotron-Labs-Diffusion-8B | 8B | Text, Instruct | 24.1K | nvidia/Nemotron-Labs-Diffusion-8B-Base | 8B | Text, Base | 228K | nvidia/Nemotron-Labs-Diffusion-14B | 14B | Text, Instruct | 3.28K | nvidia/Nemotron-Labs-Diffusion-14B-Base | 14B | Text, Base | 1.18K | nvidia/Nemotron-Labs-Diffusion-VLM-8B | ~9B | Vision-Language | 590 | The 8B Base model being the most downloaded 228K in under 2 days reflects developer interest in using it as a foundation for custom fine-tuning. 6. Three Generation Modes: AR, Diffusion, Self-Speculation The standout design decision in Nemotron-Labs Diffusion is that all three generation modes are supported from a single checkpoint . You don't need different models — just a different deployment config in SGLang. Mode 1: Autoregressive ar mode=true Standard left-to-right token generation, identical to how you'd run any other causal LM. This mode is the correctness baseline — most useful for debugging, A/B testing against existing pipelines, or when you need strict adherence to specific decoding behaviors. Use when: Debugging, regression testing, or exact reproduction of AR outputs. Mode 2: Diffusion / FastDiffuser diffusion mode=true The model fills in a block of 32 tokens simultaneously, running multiple denoising refinement steps per block. A confidence threshold determines which tokens are "committed" after each refinement pass — tokens whose predicted distribution is peaked enough get locked in, reducing the number of positions that need further refinement. The process per block: - Initialize block positions with mask tokens - Forward pass with block-wise attention → predict token distributions over all positions - Commit tokens above confidence threshold; keep others masked - Repeat steps 2–3 until all positions are committed or max steps reached - Move to next block, using committed block tokens in KV cache Achieves 2.6× higher tokens per forward pass TPF compared to AR. Use when: High-throughput batch serving where speed matters more than exact AR equivalence. Mode 3: Self-Speculation / LinearSpec self speculation=true This is the most sophisticated mode — it fuses diffusion and autoregressive decoding into a single hybrid loop: - The model uses diffusion to draft a full block of k candidate tokens bidirectionally fast, parallel - It then uses autoregressive decoding to verify the draft tokens causally from left to right - Any prefix of the draft that matches what AR would have produced gets committed - The process restarts from the first disagreement position The same model plays both roles drafter and verifier . Output is lossless vs AR at temperature=0 . Key numbers: LinearSpec achieves ~6× higher TPF than AR, and ~865 tokens/second on NVIDIA B200 hardware — roughly 4× the AR baseline on identical hardware. Use when: Production serving where you need maximum speed with no quality compromise. 7. Performance Deep Dive: The Numbers That Matter Accuracy vs Qwen3 8B: Nemotron-Labs Diffusion 8B achieves +1.2% higher average accuracy compared to Qwen3 8B across evaluated benchmarks. The DLM conversion doesn't hurt quality — it slightly improves it, likely because the joint AR+diffusion training objective acts as an additional regularizer. vs Dream 7B prior DLM SOTA : Efficient-DLM 8B achieves +5.4% higher accuracy and 4.5× higher throughput compared to Dream 7B — a decisive improvement over the previous DLM state-of-the-art. Throughput Tokens Per Forward Pass — TPF : | Mode | TPF relative to AR | Quality vs AR | |---|---|---| | Autoregressive | 1× baseline | Exact match | | Diffusion FastDiffuser | 2.6× | Slightly different | | Self-Spec Linear LinearSpec | ~6× | Lossless at T=0 | | Self-Spec Quadratic QuadSpec | ~6.4× | Lossless at T=0 | TPF Tokens Per Forward Pass is a hardware-agnostic efficiency metric — it measures how many output tokens you get per model forward pass, making it useful for comparing across different GPU generations. 8. Under the Hood: Block-Wise Attention & KV Caching Let's look at exactly how the block-wise attention mechanism enables KV caching in a DLM setting. In standard AR decoding, the KV cache stores the key and value projections for every previously generated token. When generating token t , the model attends to cached KV from tokens 0... t-1 and computes new Q, K, V for position t only — O 1 cache update per step. In a standard bidirectional DLM, this is impossible: since every token attends to every other token, and token values change with each refinement step, you'd need to recompute the entire KV matrix every step — O n² per refinement, no caching benefit. Block-wise causal attention resolves this with a two-level hierarchy: Sequence: Block 0 | Block 1 | Block 2 | ... | Block N For a token in Block i: - Attends to ALL tokens in blocks 0... i-1 → cached KV never recomputed - Attends to ALL tokens within Block i → bidirectional, recomputed each step - CANNOT attend to tokens in blocks i+1 + → causal constraint maintained For a 32-token block size and 2048-token sequence, 98.4% of KV computations are served from cache at any given refinement step. Here's how to build the attention mask in PyTorch: php import torch def build block causal mask seq len: int, block size: int - torch.Tensor: """ Build a block-wise causal attention mask. Within each block: full bidirectional attention True Across blocks: causal left-to-right attention True only for past blocks Future blocks: masked out False → -inf in softmax Returns a boolean mask of shape seq len, seq len , where True = can attend, False = masked. """ mask = torch.zeros seq len, seq len, dtype=torch.bool num blocks = seq len // block size for block i in range num blocks : q start = block i block size q end = q start + block size Attend to all past blocks causal across blocks for block j in range block i : kv start = block j block size kv end = kv start + block size mask q start:q end, kv start:kv end = True Attend fully within current block bidirectional within block mask q start:q end, q start:q end = True return mask Example: 4 blocks of 4 tokens each = 16 token sequence mask = build block causal mask seq len=16, block size=4 print mask.int Output each row = query token, each col = key token : Block 0 rows: 1111 | 0000 | 0000 | 0000 Block 1 rows: 1111 | 1111 | 0000 | 0000 Block 2 rows: 1111 | 1111 | 1111 | 0000 Block 3 rows: 1111 | 1111 | 1111 | 1111 The resulting mask has fully-connected 4×4 diagonal blocks bidirectional within blocks with a lower-triangular structure across block boundaries causal across blocks . It's the AR causal mask, coarsened to block granularity — which is precisely why pretrained AR weight distributions are preserved. 9. Getting Started: Running with SGLang SGLang is the recommended serving framework for Nemotron-Labs Diffusion, with integration via PR 25803 https://github.com/sgl-project/sglang/pull/25803 merging into main imminently . Here's a complete working example. 9.1 Installation Install SGLang with DLM support pip install "sglang all =0.4.5" --extra-index-url https://flashinfer.ai/whl/cu124/torch2.5/ If the PR hasn't merged to main yet, install from the DLM branch directly: git clone https://github.com/sgl-project/sglang.git cd sglang && git fetch origin pull/25803/head:dlm-support git checkout dlm-support && pip install -e ". all " Pull the model weights pip install huggingface-hub huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B \ --local-dir ./models/Nemotron-Labs-Diffusion-8B 9.2 Serving: Launch the SGLang Server Mode 1 — Autoregressive standard baseline python -m sglang.launch server \ --model-path ./models/Nemotron-Labs-Diffusion-8B \ --port 30000 --tp 1 --dtype bfloat16 \ --algorithm ar mode Mode 2 — Diffusion FastDiffuser : highest raw throughput python -m sglang.launch server \ --model-path ./models/Nemotron-Labs-Diffusion-8B \ --port 30000 --tp 1 --dtype bfloat16 \ --algorithm diffusion \ --block-size 32 \ --confidence-threshold 0.9 Mode 3 — Self-Speculation LinearSpec : lossless 6x speedup python -m sglang.launch server \ --model-path ./models/Nemotron-Labs-Diffusion-8B \ --port 30000 --tp 1 --dtype bfloat16 \ --algorithm linear spec \ --draft-block-size 32 9.3 Inference: Python Client OpenAI-Compatible API python import openai import time SGLang exposes an OpenAI-compatible API endpoint client = openai.OpenAI base url="http://localhost:30000/v1", api key="EMPTY" SGLang doesn't require auth by default PROMPT = """You are an expert in distributed systems. Explain the CAP theorem and its practical implications for a microservices architecture. Be specific with concrete trade-off examples.""" def benchmark mode label: str, mode hint: str = "" : """Run a generation and measure wall-clock tokens/second.""" start = time.perf counter response = client.chat.completions.create model="nvidia/Nemotron-Labs-Diffusion-8B", messages= {"role": "user", "content": PROMPT} , max tokens=512, temperature=0, T=0 → LinearSpec is lossless vs AR extra body={ "mode": mode hint "ar", "diffusion", or "linear spec" } if mode hint else {} elapsed = time.perf counter - start tokens = response.usage.completion tokens tps = tokens / elapsed print f"\n{'=' 60}" print f"Mode : {label}" print f"Output : {response.choices 0 .message.content :200 }..." print f"Tokens : {tokens}" print f"Time s : {elapsed:.2f}" print f"Throughput : {tps:.1f} tok/s" print f"{'=' 60}" return tps Compare all three modes ar tps = benchmark mode "Autoregressive", mode hint="ar" diff tps = benchmark mode "Diffusion FastDiffuser ", mode hint="diffusion" spec tps = benchmark mode "Self-Spec LinearSpec ", mode hint="linear spec" print f"\n📊 Speedup Summary:" print f" Diffusion vs AR : {diff tps/ar tps:.2f}×" print f" LinearSpec vs AR : {spec tps/ar tps:.2f}×" 9.4 Quick Start via HuggingFace Transformers AR Mode python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model id = "nvidia/Nemotron-Labs-Diffusion-8B" tokenizer = AutoTokenizer.from pretrained model id model = AutoModelForCausalLM.from pretrained model id, torch dtype=torch.bfloat16, device map="auto" messages = {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain masked diffusion in 3 sentences."} input ids = tokenizer.apply chat template messages, add generation prompt=True, return tensors="pt" .to model.device with torch.no grad : output ids = model.generate input ids, max new tokens=256, do sample=False, use cache=True response = tokenizer.decode output ids 0 input ids.shape -1 : , skip special tokens=True print response Note:The transformers path gives AR mode only. For diffusion and self-speculation modes, the SGLang integration is required as it implements the custom decoding loop. 10. What This Means for Production LLM Infrastructure Latency vs Throughput Trade-off, Revisited The classic LLM serving dilemma is that throughput optimizations larger batch sizes, continuous batching increase latency, and latency optimizations small batches, low KV cache pressure hurt throughput. Self-speculation in DLMs partially decouples this: at batch size 1, LinearSpec gives 4–6× more tokens per second than AR on the same hardware. This is the scenario where AR models are most inefficient, and where DLMs provide the biggest relative gain. Cost Implications A 4× throughput improvement at batch size 1 means you could serve the same number of users with 1/4 the GPU compute — or equivalently, serve 4× more users from the same GPU fleet. At current B200/H100 pricing of $4–8/hour, that's a meaningful cost reduction for any team running a production LLM API. Fill-in-the-Middle and Code Editing DLMs have a natural advantage for fill-in-the-middle FIM tasks. AR models handle FIM awkwardly, requiring special training and prompt formatting to look "backwards" at the suffix. A DLM generating a block bidirectionally can natively condition on both prefix and suffix context within the block — making Nemotron-Labs Diffusion well-suited for code editing agents and inline completions. Inference Budget Control In diffusion mode, you can control the number of denoising steps as a runtime knob. Fewer steps = faster but potentially lower quality. More steps = slower but higher quality. This gives you a continuous quality-speed trade-off at inference time without retraining — something AR models simply can't offer. A production system could dynamically reduce diffusion steps during traffic spikes and increase them during low-load periods. When to Stick with AR For long-context tasks 100K+ tokens where the KV cache dominates memory, the efficiency story is less clear-cut. For streaming output where users see tokens as they're generated, block-wise generation may feel less smooth without careful rendering logic. And for tasks requiring strict constrained decoding grammar-constrained generation, beam search , the diffusion loop needs further tooling work. 11. Conclusion & The Road Ahead Diffusion Language Models have been a promising idea for years, perennially held back by a cluster of practical barriers: accuracy gaps, training instability, and the loss of KV caching. NVIDIA's Efficient-DLM work and Nemotron-Labs Diffusion have systematically addressed each of these barriers with concrete, principled solutions — block-wise causal attention, position-dependent masking, and joint AR+diffusion training objectives. The result is a model family that is simultaneously: - ✅ A first-class AR model backward compatible, lossless in LinearSpec mode - ⚡ A 2.6–6.4× faster inference engine depending on mode and hardware - 🔲 A better fill-in-the-middle model by architectural design - 🎛️ A tunable quality-speed dial at deployment time — no retraining needed With 24K+ downloads in the first 24 hours and SGLang integration landing imminently, this is one of the most practically significant open-source releases in the LLM inference space in 2026. The next frontier: applying the same AR-to-DLM conversion recipe to frontier-scale models 70B+ , exploring multimodal DLMs beyond the 8B VLM preview, and building out constrained decoding, streaming token rendering, and fine-tuning tooling for the DLM objective. If you're building LLM-powered applications and care about inference cost and latency, it's time to start experimenting with Nemotron-Labs Diffusion . The autoregressive loop had a good run — but the next chapter of language model inference looks decidedly more parallel. 🔗 Resources - 🤗 Nemotron-Labs Diffusion model collection on HuggingFace https://huggingface.co/collections/nvidia/nemotron-labs-diffusion - 📄 Efficient-DLM technical paper — arXiv:2512.14067 https://arxiv.org/abs/2512.14067 - 💻 NVIDIA Megatron Bridge training code GitHub https://github.com/NVIDIA-NeMo/Megatron-Bridge/ - 🔧 SGLang DLM integration PR 25803 https://github.com/sgl-project/sglang/pull/25803 Written on May 24, 2026 — based on the HuggingFace blog post and arXiv:2512.14067 Efficient-DLM . Performance numbers reflect published benchmarks; verify against your specific hardware and workload.