DiffusionGemma: Discrete diffusion in a large language model

wpnews.pro

cd /news/artificial-intelligence/diffusiongemma-discrete-diffusion-in… · home › topics › artificial-intelligence › article

[ARTICLE · art-24564] src=idlemachines.co.uk ↗ pub=2026-06-11T22:21Z topic=artificial-intelligence verified=true sentiment=· neutral

DiffusionGemma: Discrete diffusion in a large language model

DeepMind released DiffusionGemma, a new large language model that uses discrete diffusion to generate entire sequences in parallel instead of autoregressive token-by-token generation. The model achieves over 1,000 tokens per second on a single H100 GPU and runs in 18GB when quantized, offering up to 4x speed improvement over same-size autoregressive models, though it remains slightly behind the flagship Gemma-4 release.

read5 min views25 publishedJun 11, 2026

Problems around papers, methods and ideas as they break. Newest at the top.

A new entry in the Gemma family from DeepMind, but this time they've dropped left-to-right autoregression for discrete diffusion. Instead of generating one token at a time, it works on the whole sequence in parallel.

They reach 1000+ tokens/s on a single H100 and run in 18GB quantized, up to 4x an autoregressive model of the same size. So far it's not really competitive with the flagship Gemma-4 release earlier this year, but it's getting close.

#	Title	Difficulty
413	Medium	DiffusionDiffusionGemmaForwardProcess
414	Hard	DiffusionDiffusionGemmaSamplingEntropy
415	Medium	DiffusionDiffusionGemmaSamplingEarlyStopping
416	Medium	DiffusionDiffusionGemmaSamplingTemperature
417	Hard	DiffusionDiffusionGemmaAttentionMasking
418	Hard◆	DiffusionDiffusionGemmaSamplingInference

NVIDIA's open Nemotron 3 family: Nano (30B), Super (120B) and Ultra (550B). Instead of a standard transformer they run mostly Mamba with a handful of attention layers, route their experts through a compressed latent space, and were pretrained end to end in 4-bit.

The hybrid layers keep the KV cache small, which leaves room for a much wider expert layer, while the 4-bit pretraining and the speculative-decoding heads keep it cheap to train and to serve. The weights are open, and at long outputs Super runs at better than twice the throughput of GPT-OSS-120B.

#	Title	Difficulty
406	Easy	KVCacheMambaGQAInference
419	Medium	MambaSSMStateSpaceNumPy
407	Easy	MoERooflineInferenceLatentMoE
408	Easy	MoELatentMoEInference
409	Medium	MoELatentMoERouting
410	Hard	MoELatentMoERoutingTransformers
411	Medium	QuantizationNVFP4LowPrecision
412	Hard	MTPSpeculativeDecodingDraftingInference

A training method from Sakana AI that challenges the assumption a Transformers need be trained end-to-end. They turn the B blocks into B independent diffusion denoisers, with no gradient ever crossing a block boundary.

Each block is independent, so they can be trained one at a time on a single GPU (activation memory O(L) → O(L/B)) or all B at once on separate GPUs with no communication. The usual training bottlenecks (memory growth, sequential dependency, cross-device communication) do not apply.

#	Title	Difficulty
400	Easy	DiffusionPFODEResidualConnection
401	Medium	DiffusionLogNormalInverseCDF
402	Medium	DiffusionLayerNormAdaLNConditioning
403	Medium	DiffusionLogNormalInverseCDFSampling
404	Medium	DiffusionLossEDMWeighting
405	Medium	DiffusionInferencePFODEDiffusionBlocks

Two preview MoE models, V4-Pro (1.6T total, 49B active) and V4-Flash (284B, 13B active), that replace V3's attention, rewrite the residual stream, and switch optimisers. All three changes serve the same goal: make million-token context cheap and efficient.

V4-Flash matches V3.2-Base on most benchmarks at a third the active parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its inference FLOPs. Capability is behind frontier, cost-to-serve at long context drops by roughly a factor of ten.

#	Title	Difficulty
360	Medium	doubly-stochasticmhcdeepseek-v4
361	Medium	muonorthogonalisationdeepseek-v4
369	Hard	residualhyper-connectionsdeepseek-v4
370	Medium	sparse-attentiontop-kdeepseek-v4
371	Medium	sparse-attentionkv-compressiondeepseek-v4

| 372 | Hard | sparse-attentionkv-compressiondeepseek-v4 | A training-free method for scoring key importance in the KV cache. It uses a geometric property of attention heads (pre-RoPE query and key vectors cluster around stable non-zero directions) instead of running full attention.

TriAttention (arxiv 2604.04921) was released April 6 2026. The core observation, Q/K concentration, is empirically demonstrable, and the scoring derivation falls out directly from the RoPE attention identity.

#	Title	Difficulty
284	Easy	RoPEAttentionKV CacheTriAttentionNumPy
285	Medium	RoPEAttentionPositional EncodingTriAttentionNumPy
286	Medium	RoPEAttentionKV CacheTriAttentionNumPy
287	Medium	RoPEAttentionKV CacheTriAttentionNumPy
288	Medium	RoPEAttentionKV CacheTriAttentionNumPy
289	Hard	RoPEAttentionKV CacheTriAttentionNumPy

Google's open-weight LLM family with five architectural departures from the standard recipe: QK-norm, partial RoPE, per-layer input gating, KV sharing, and a parallel MoE.

The 2B-parameter per-layer embedding table re-injects token identity at every depth, on the bet that deep transformers don't preserve it on their own. Open weights mean the architecture is directly readable.

#	Title	Difficulty
181	Hard	TransformersAttentionNormalizationGemma 4NumPy
182	Easy	TransformersNormalizationGemma 4NumPy
183	Medium	TransformersPositional EncodingRoPEGemma 4NumPy
184	Medium	TransformersNormalizationGemma 4NumPy
185	Hard	TransformersAttentionKV CacheGemma 4NumPy

A training-free vector quantization method from Google that compresses KV cache entries to ~3 bits per coordinate using random rotation, Lloyd-Max quantization, and a 1-bit residual sketch.

TurboQuant compresses the KV cache to ~3 bits without fine-tuning. The smaller cache means fewer memory transfers, so inference runs faster on Gemma and Mistral than the unquantized originals.

#	Title	Difficulty
173	Medium	Linear AlgebraQuantizationQR DecompositionNumpy
174	Medium	QuantizationLinear AlgebraNumpyOptimisation
175	Medium	QuantizationLinear AlgebraNumpy
176	Medium	QuantizationLinear AlgebraNumpy
177	Hard◆	QuantizationLinear AlgebraNumpyTurboQuant
178	Hard◆	QuantizationLinear AlgebraTransformersNumpyTurboQuant

source & further reading

idlemachines.co.uk — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/diffusiongemma-discrete-…

Read original on idlemachines.co.uk → idlemachines.co.uk/topics/trending

mentioned entities

DeepMind

Gemma

DiffusionGemma

NVIDIA

Nemotron

metadata

slugdiffusiongemma-discrete-diffusion-in-a-large-language-model

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicalidlemachines.co.uk

navigation

← prevSpaceX lowballed its bankers on …

next →Perplexity Moves Deep Research I…

── more in #artificial-intelligence 4 stories · sorted by recency

snowchord.com · 28 Jul · #artificial-intelligence

Linear Attention, Visualized

cryptobriefing.com · 28 Jul · #artificial-intelligence

World Labs acquires SceniX to build digital training grounds for robots, sidestepping real-world data costs

byteiota.com · 28 Jul · #artificial-intelligence

NVIDIA NOOA: Open-Source AI Agent Framework With 50% Lower Token Cost

ca.finance.yahoo.com · 28 Jul · #artificial-intelligence

OpenAI, Anthropic Staff Share Letter Asking US to Help Pace AI Progress

── more on @deepmind 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required