DiffusionGemma: Discrete diffusion in a large language model

DeepMind released DiffusionGemma, a new large language model that uses discrete diffusion to generate entire sequences in parallel instead of autoregressive token-by-token generation. The model achieves over 1,000 tokens per second on a single H100 GPU and runs in 18GB when quantized, offering up to 4x speed improvement over same-size autoregressive models, though it remains slightly behind the flagship Gemma-4 release.

Problems around papers, methods and ideas as they break. Newest at the top. A new entry in the Gemma family from DeepMind, but this time they've dropped left-to-right autoregression for discrete diffusion. Instead of generating one token at a time, it works on the whole sequence in parallel. They reach 1000+ tokens/s on a single H100 and run in 18GB quantized, up to 4x an autoregressive model of the same size. So far it's not really competitive with the flagship Gemma-4 release earlier this year, but it's getting close. | | Title | Difficulty | Tags | |---|---|---|---| | 413 | Medium | DiffusionDiffusionGemmaForwardProcess | | | 414 | Hard | DiffusionDiffusionGemmaSamplingEntropy | | | 415 | Medium | DiffusionDiffusionGemmaSamplingEarlyStopping | | | 416 | Medium | DiffusionDiffusionGemmaSamplingTemperature | | | 417 | Hard | DiffusionDiffusionGemmaAttentionMasking | | | 418 | Hard◆ | DiffusionDiffusionGemmaSamplingInference | NVIDIA's open Nemotron 3 family: Nano 30B , Super 120B and Ultra 550B . Instead of a standard transformer they run mostly Mamba with a handful of attention layers, route their experts through a compressed latent space, and were pretrained end to end in 4-bit. The hybrid layers keep the KV cache small, which leaves room for a much wider expert layer, while the 4-bit pretraining and the speculative-decoding heads keep it cheap to train and to serve. The weights are open, and at long outputs Super runs at better than twice the throughput of GPT-OSS-120B. | | Title | Difficulty | Tags | |---|---|---|---| | 406 | Easy | KVCacheMambaGQAInference | | | 419 | Medium | MambaSSMStateSpaceNumPy | | | 407 | Easy | MoERooflineInferenceLatentMoE | | | 408 | Easy | MoELatentMoEInference | | | 409 | Medium | MoELatentMoERouting | | | 410 | Hard | MoELatentMoERoutingTransformers | | | 411 | Medium | QuantizationNVFP4LowPrecision | | | 412 | Hard | MTPSpeculativeDecodingDraftingInference | A training method from Sakana AI that challenges the assumption a Transformers need be trained end-to-end. They turn the B blocks into B independent diffusion denoisers, with no gradient ever crossing a block boundary. Each block is independent, so they can be trained one at a time on a single GPU activation memory O L → O L/B or all B at once on separate GPUs with no communication. The usual training bottlenecks memory growth, sequential dependency, cross-device communication do not apply. | | Title | Difficulty | Tags | |---|---|---|---| | 400 | Easy | DiffusionPFODEResidualConnection | | | 401 | Medium | DiffusionLogNormalInverseCDF | | | 402 | Medium | DiffusionLayerNormAdaLNConditioning | | | 403 | Medium | DiffusionLogNormalInverseCDFSampling | | | 404 | Medium | DiffusionLossEDMWeighting | | | 405 | Medium | DiffusionInferencePFODEDiffusionBlocks | Two preview MoE models, V4-Pro 1.6T total, 49B active and V4-Flash 284B, 13B active , that replace V3's attention, rewrite the residual stream, and switch optimisers. All three changes serve the same goal: make million-token context cheap and efficient. V4-Flash matches V3.2-Base on most benchmarks at a third the active parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its inference FLOPs. Capability is behind frontier, cost-to-serve at long context drops by roughly a factor of ten. | | Title | Difficulty | Tags | |---|---|---|---| | 360 | Medium | doubly-stochasticmhcdeepseek-v4 | | | 361 | Medium | muonorthogonalisationdeepseek-v4 | | | 369 | Hard | residualhyper-connectionsdeepseek-v4 | | | 370 | Medium | sparse-attentiontop-kdeepseek-v4 | | | 371 | Medium | sparse-attentionkv-compressiondeepseek-v4 | | | 372 | Hard | sparse-attentionkv-compressiondeepseek-v4 | A training-free method for scoring key importance in the KV cache. It uses a geometric property of attention heads pre-RoPE query and key vectors cluster around stable non-zero directions instead of running full attention. TriAttention arxiv 2604.04921 was released April 6 2026. The core observation, Q/K concentration, is empirically demonstrable, and the scoring derivation falls out directly from the RoPE attention identity. | | Title | Difficulty | Tags | |---|---|---|---| | 284 | Easy | RoPEAttentionKV CacheTriAttentionNumPy | | | 285 | Medium | RoPEAttentionPositional EncodingTriAttentionNumPy | | | 286 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | | | 287 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | | | 288 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | | | 289 | Hard | RoPEAttentionKV CacheTriAttentionNumPy | Google's open-weight LLM family with five architectural departures from the standard recipe: QK-norm, partial RoPE, per-layer input gating, KV sharing, and a parallel MoE. The 2B-parameter per-layer embedding table re-injects token identity at every depth, on the bet that deep transformers don't preserve it on their own. Open weights mean the architecture is directly readable. | | Title | Difficulty | Tags | |---|---|---|---| | 181 | Hard | TransformersAttentionNormalizationGemma 4NumPy | | | 182 | Easy | TransformersNormalizationGemma 4NumPy | | | 183 | Medium | TransformersPositional EncodingRoPEGemma 4NumPy | | | 184 | Medium | TransformersNormalizationGemma 4NumPy | | | 185 | Hard | TransformersAttentionKV CacheGemma 4NumPy | A training-free vector quantization method from Google that compresses KV cache entries to ~3 bits per coordinate using random rotation, Lloyd-Max quantization, and a 1-bit residual sketch. TurboQuant compresses the KV cache to ~3 bits without fine-tuning. The smaller cache means fewer memory transfers, so inference runs faster on Gemma and Mistral than the unquantized originals. | | Title | Difficulty | Tags | |---|---|---|---| | 173 | Medium | Linear AlgebraQuantizationQR DecompositionNumpy | | | 174 | Medium | QuantizationLinear AlgebraNumpyOptimisation | | | 175 | Medium | QuantizationLinear AlgebraNumpy | | | 176 | Medium | QuantizationLinear AlgebraNumpy | | | 177 | Hard◆ | QuantizationLinear AlgebraNumpyTurboQuant | | | 178 | Hard◆ | QuantizationLinear AlgebraTransformersNumpyTurboQuant |