{"slug": "diffusiongemma-discrete-diffusion-in-a-large-language-model", "title": "DiffusionGemma: Discrete diffusion in a large language model", "summary": "DeepMind released DiffusionGemma, a new large language model that uses discrete diffusion to generate entire sequences in parallel instead of autoregressive token-by-token generation. The model achieves over 1,000 tokens per second on a single H100 GPU and runs in 18GB when quantized, offering up to 4x speed improvement over same-size autoregressive models, though it remains slightly behind the flagship Gemma-4 release.", "body_md": "Problems around papers, methods and ideas as they break. Newest at the top.\n\nA new entry in the Gemma family from DeepMind, but this time they've dropped left-to-right autoregression for discrete diffusion. Instead of generating one token at a time, it works on the whole sequence in parallel.\n\nThey reach 1000+ tokens/s on a single H100 and run in 18GB quantized, up to 4x an autoregressive model of the same size. So far it's not really competitive with the flagship Gemma-4 release earlier this year, but it's getting close.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 413 | Medium | DiffusionDiffusionGemmaForwardProcess | |\n| 414 | Hard | DiffusionDiffusionGemmaSamplingEntropy | |\n| 415 | Medium | DiffusionDiffusionGemmaSamplingEarlyStopping | |\n| 416 | Medium | DiffusionDiffusionGemmaSamplingTemperature | |\n| 417 | Hard | DiffusionDiffusionGemmaAttentionMasking | |\n| 418 | Hard◆ | DiffusionDiffusionGemmaSamplingInference |\n\nNVIDIA's open Nemotron 3 family: Nano (30B), Super (120B) and Ultra (550B). Instead of a standard transformer they run mostly Mamba with a handful of attention layers, route their experts through a compressed latent space, and were pretrained end to end in 4-bit.\n\nThe hybrid layers keep the KV cache small, which leaves room for a much wider expert layer, while the 4-bit pretraining and the speculative-decoding heads keep it cheap to train and to serve. The weights are open, and at long outputs Super runs at better than twice the throughput of GPT-OSS-120B.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 406 | Easy | KVCacheMambaGQAInference | |\n| 419 | Medium | MambaSSMStateSpaceNumPy | |\n| 407 | Easy | MoERooflineInferenceLatentMoE | |\n| 408 | Easy | MoELatentMoEInference | |\n| 409 | Medium | MoELatentMoERouting | |\n| 410 | Hard | MoELatentMoERoutingTransformers | |\n| 411 | Medium | QuantizationNVFP4LowPrecision | |\n| 412 | Hard | MTPSpeculativeDecodingDraftingInference |\n\nA training method from Sakana AI that challenges the assumption a Transformers need be trained end-to-end. They turn the B blocks into B independent diffusion denoisers, with no gradient ever crossing a block boundary.\n\nEach block is independent, so they can be trained one at a time on a single GPU (activation memory O(L) → O(L/B)) or all B at once on separate GPUs with no communication. The usual training bottlenecks (memory growth, sequential dependency, cross-device communication) do not apply.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 400 | Easy | DiffusionPFODEResidualConnection | |\n| 401 | Medium | DiffusionLogNormalInverseCDF | |\n| 402 | Medium | DiffusionLayerNormAdaLNConditioning | |\n| 403 | Medium | DiffusionLogNormalInverseCDFSampling | |\n| 404 | Medium | DiffusionLossEDMWeighting | |\n| 405 | Medium | DiffusionInferencePFODEDiffusionBlocks |\n\nTwo preview MoE models, V4-Pro (1.6T total, 49B active) and V4-Flash (284B, 13B active), that replace V3's attention, rewrite the residual stream, and switch optimisers. All three changes serve the same goal: make million-token context cheap and efficient.\n\nV4-Flash matches V3.2-Base on most benchmarks at a third the active parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its inference FLOPs. Capability is behind frontier, cost-to-serve at long context drops by roughly a factor of ten.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 360 | Medium | doubly-stochasticmhcdeepseek-v4 | |\n| 361 | Medium | muonorthogonalisationdeepseek-v4 | |\n| 369 | Hard | residualhyper-connectionsdeepseek-v4 | |\n| 370 | Medium | sparse-attentiontop-kdeepseek-v4 | |\n| 371 | Medium | sparse-attentionkv-compressiondeepseek-v4 | |\n| 372 | Hard | sparse-attentionkv-compressiondeepseek-v4 |\n\nA training-free method for scoring key importance in the KV cache. It uses a geometric property of attention heads (pre-RoPE query and key vectors cluster around stable non-zero directions) instead of running full attention.\n\nTriAttention (arxiv 2604.04921) was released April 6 2026. The core observation, Q/K concentration, is empirically demonstrable, and the scoring derivation falls out directly from the RoPE attention identity.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 284 | Easy | RoPEAttentionKV CacheTriAttentionNumPy | |\n| 285 | Medium | RoPEAttentionPositional EncodingTriAttentionNumPy | |\n| 286 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | |\n| 287 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | |\n| 288 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | |\n| 289 | Hard | RoPEAttentionKV CacheTriAttentionNumPy |\n\nGoogle's open-weight LLM family with five architectural departures from the standard recipe: QK-norm, partial RoPE, per-layer input gating, KV sharing, and a parallel MoE.\n\nThe 2B-parameter per-layer embedding table re-injects token identity at every depth, on the bet that deep transformers don't preserve it on their own. Open weights mean the architecture is directly readable.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 181 | Hard | TransformersAttentionNormalizationGemma 4NumPy | |\n| 182 | Easy | TransformersNormalizationGemma 4NumPy | |\n| 183 | Medium | TransformersPositional EncodingRoPEGemma 4NumPy | |\n| 184 | Medium | TransformersNormalizationGemma 4NumPy | |\n| 185 | Hard | TransformersAttentionKV CacheGemma 4NumPy |\n\nA training-free vector quantization method from Google that compresses KV cache entries to ~3 bits per coordinate using random rotation, Lloyd-Max quantization, and a 1-bit residual sketch.\n\nTurboQuant compresses the KV cache to ~3 bits without fine-tuning. The smaller cache means fewer memory transfers, so inference runs faster on Gemma and Mistral than the unquantized originals.\n\n| # | Title | Difficulty | Tags |\n|---|---|---|---|\n| 173 | Medium | Linear AlgebraQuantizationQR DecompositionNumpy | |\n| 174 | Medium | QuantizationLinear AlgebraNumpyOptimisation | |\n| 175 | Medium | QuantizationLinear AlgebraNumpy | |\n| 176 | Medium | QuantizationLinear AlgebraNumpy | |\n| 177 | Hard◆ | QuantizationLinear AlgebraNumpyTurboQuant | |\n| 178 | Hard◆ | QuantizationLinear AlgebraTransformersNumpyTurboQuant |", "url": "https://wpnews.pro/news/diffusiongemma-discrete-diffusion-in-a-large-language-model", "canonical_source": "https://idlemachines.co.uk/topics/trending", "published_at": "2026-06-11 22:21:21+00:00", "updated_at": "2026-06-11 22:50:06.175242+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "generative-ai", "ai-research"], "entities": ["DeepMind", "Gemma", "DiffusionGemma", "NVIDIA", "Nemotron"], "alternates": {"html": "https://wpnews.pro/news/diffusiongemma-discrete-diffusion-in-a-large-language-model", "markdown": "https://wpnews.pro/news/diffusiongemma-discrete-diffusion-in-a-large-language-model.md", "text": "https://wpnews.pro/news/diffusiongemma-discrete-diffusion-in-a-large-language-model.txt", "jsonld": "https://wpnews.pro/news/diffusiongemma-discrete-diffusion-in-a-large-language-model.jsonld"}}