cd /news/artificial-intelligence/diffusiongemma-discrete-diffusion-in… · home topics artificial-intelligence article
[ARTICLE · art-24564] src=idlemachines.co.uk pub= topic=artificial-intelligence verified=true sentiment=· neutral

DiffusionGemma: Discrete diffusion in a large language model

DeepMind released DiffusionGemma, a new large language model that uses discrete diffusion to generate entire sequences in parallel instead of autoregressive token-by-token generation. The model achieves over 1,000 tokens per second on a single H100 GPU and runs in 18GB when quantized, offering up to 4x speed improvement over same-size autoregressive models, though it remains slightly behind the flagship Gemma-4 release.

read5 min publishedJun 11, 2026

Problems around papers, methods and ideas as they break. Newest at the top.

A new entry in the Gemma family from DeepMind, but this time they've dropped left-to-right autoregression for discrete diffusion. Instead of generating one token at a time, it works on the whole sequence in parallel.

They reach 1000+ tokens/s on a single H100 and run in 18GB quantized, up to 4x an autoregressive model of the same size. So far it's not really competitive with the flagship Gemma-4 release earlier this year, but it's getting close.

# Title Difficulty Tags
413 Medium DiffusionDiffusionGemmaForwardProcess
414 Hard DiffusionDiffusionGemmaSamplingEntropy
415 Medium DiffusionDiffusionGemmaSamplingEarlyStopping
416 Medium DiffusionDiffusionGemmaSamplingTemperature
417 Hard DiffusionDiffusionGemmaAttentionMasking
418 Hard◆ DiffusionDiffusionGemmaSamplingInference

NVIDIA's open Nemotron 3 family: Nano (30B), Super (120B) and Ultra (550B). Instead of a standard transformer they run mostly Mamba with a handful of attention layers, route their experts through a compressed latent space, and were pretrained end to end in 4-bit.

The hybrid layers keep the KV cache small, which leaves room for a much wider expert layer, while the 4-bit pretraining and the speculative-decoding heads keep it cheap to train and to serve. The weights are open, and at long outputs Super runs at better than twice the throughput of GPT-OSS-120B.

# Title Difficulty Tags
406 Easy KVCacheMambaGQAInference
419 Medium MambaSSMStateSpaceNumPy
407 Easy MoERooflineInferenceLatentMoE
408 Easy MoELatentMoEInference
409 Medium MoELatentMoERouting
410 Hard MoELatentMoERoutingTransformers
411 Medium QuantizationNVFP4LowPrecision
412 Hard MTPSpeculativeDecodingDraftingInference

A training method from Sakana AI that challenges the assumption a Transformers need be trained end-to-end. They turn the B blocks into B independent diffusion denoisers, with no gradient ever crossing a block boundary.

Each block is independent, so they can be trained one at a time on a single GPU (activation memory O(L) → O(L/B)) or all B at once on separate GPUs with no communication. The usual training bottlenecks (memory growth, sequential dependency, cross-device communication) do not apply.

# Title Difficulty Tags
400 Easy DiffusionPFODEResidualConnection
401 Medium DiffusionLogNormalInverseCDF
402 Medium DiffusionLayerNormAdaLNConditioning
403 Medium DiffusionLogNormalInverseCDFSampling
404 Medium DiffusionLossEDMWeighting
405 Medium DiffusionInferencePFODEDiffusionBlocks

Two preview MoE models, V4-Pro (1.6T total, 49B active) and V4-Flash (284B, 13B active), that replace V3's attention, rewrite the residual stream, and switch optimisers. All three changes serve the same goal: make million-token context cheap and efficient.

V4-Flash matches V3.2-Base on most benchmarks at a third the active parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its inference FLOPs. Capability is behind frontier, cost-to-serve at long context drops by roughly a factor of ten.

# Title Difficulty Tags
360 Medium doubly-stochasticmhcdeepseek-v4
361 Medium muonorthogonalisationdeepseek-v4
369 Hard residualhyper-connectionsdeepseek-v4
370 Medium sparse-attentiontop-kdeepseek-v4
371 Medium sparse-attentionkv-compressiondeepseek-v4

| 372 | Hard | sparse-attentionkv-compressiondeepseek-v4 | A training-free method for scoring key importance in the KV cache. It uses a geometric property of attention heads (pre-RoPE query and key vectors cluster around stable non-zero directions) instead of running full attention.

TriAttention (arxiv 2604.04921) was released April 6 2026. The core observation, Q/K concentration, is empirically demonstrable, and the scoring derivation falls out directly from the RoPE attention identity.

# Title Difficulty Tags
284 Easy RoPEAttentionKV CacheTriAttentionNumPy
285 Medium RoPEAttentionPositional EncodingTriAttentionNumPy
286 Medium RoPEAttentionKV CacheTriAttentionNumPy
287 Medium RoPEAttentionKV CacheTriAttentionNumPy
288 Medium RoPEAttentionKV CacheTriAttentionNumPy
289 Hard RoPEAttentionKV CacheTriAttentionNumPy

Google's open-weight LLM family with five architectural departures from the standard recipe: QK-norm, partial RoPE, per-layer input gating, KV sharing, and a parallel MoE.

The 2B-parameter per-layer embedding table re-injects token identity at every depth, on the bet that deep transformers don't preserve it on their own. Open weights mean the architecture is directly readable.

# Title Difficulty Tags
181 Hard TransformersAttentionNormalizationGemma 4NumPy
182 Easy TransformersNormalizationGemma 4NumPy
183 Medium TransformersPositional EncodingRoPEGemma 4NumPy
184 Medium TransformersNormalizationGemma 4NumPy
185 Hard TransformersAttentionKV CacheGemma 4NumPy

A training-free vector quantization method from Google that compresses KV cache entries to ~3 bits per coordinate using random rotation, Lloyd-Max quantization, and a 1-bit residual sketch.

TurboQuant compresses the KV cache to ~3 bits without fine-tuning. The smaller cache means fewer memory transfers, so inference runs faster on Gemma and Mistral than the unquantized originals.

# Title Difficulty Tags
173 Medium Linear AlgebraQuantizationQR DecompositionNumpy
174 Medium QuantizationLinear AlgebraNumpyOptimisation
175 Medium QuantizationLinear AlgebraNumpy
176 Medium QuantizationLinear AlgebraNumpy
177 Hard◆ QuantizationLinear AlgebraNumpyTurboQuant
178 Hard◆ QuantizationLinear AlgebraTransformersNumpyTurboQuant
── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/diffusiongemma-discr…] indexed:0 read:5min 2026-06-11 ·