# DiffusionGemma: Discrete diffusion in a large language model

> Source: <https://idlemachines.co.uk/topics/trending>
> Published: 2026-06-11 22:21:21+00:00

Problems around papers, methods and ideas as they break. Newest at the top.

A new entry in the Gemma family from DeepMind, but this time they've dropped left-to-right autoregression for discrete diffusion. Instead of generating one token at a time, it works on the whole sequence in parallel.

They reach 1000+ tokens/s on a single H100 and run in 18GB quantized, up to 4x an autoregressive model of the same size. So far it's not really competitive with the flagship Gemma-4 release earlier this year, but it's getting close.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 413 | Medium | DiffusionDiffusionGemmaForwardProcess | |
| 414 | Hard | DiffusionDiffusionGemmaSamplingEntropy | |
| 415 | Medium | DiffusionDiffusionGemmaSamplingEarlyStopping | |
| 416 | Medium | DiffusionDiffusionGemmaSamplingTemperature | |
| 417 | Hard | DiffusionDiffusionGemmaAttentionMasking | |
| 418 | Hard◆ | DiffusionDiffusionGemmaSamplingInference |

NVIDIA's open Nemotron 3 family: Nano (30B), Super (120B) and Ultra (550B). Instead of a standard transformer they run mostly Mamba with a handful of attention layers, route their experts through a compressed latent space, and were pretrained end to end in 4-bit.

The hybrid layers keep the KV cache small, which leaves room for a much wider expert layer, while the 4-bit pretraining and the speculative-decoding heads keep it cheap to train and to serve. The weights are open, and at long outputs Super runs at better than twice the throughput of GPT-OSS-120B.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 406 | Easy | KVCacheMambaGQAInference | |
| 419 | Medium | MambaSSMStateSpaceNumPy | |
| 407 | Easy | MoERooflineInferenceLatentMoE | |
| 408 | Easy | MoELatentMoEInference | |
| 409 | Medium | MoELatentMoERouting | |
| 410 | Hard | MoELatentMoERoutingTransformers | |
| 411 | Medium | QuantizationNVFP4LowPrecision | |
| 412 | Hard | MTPSpeculativeDecodingDraftingInference |

A training method from Sakana AI that challenges the assumption a Transformers need be trained end-to-end. They turn the B blocks into B independent diffusion denoisers, with no gradient ever crossing a block boundary.

Each block is independent, so they can be trained one at a time on a single GPU (activation memory O(L) → O(L/B)) or all B at once on separate GPUs with no communication. The usual training bottlenecks (memory growth, sequential dependency, cross-device communication) do not apply.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 400 | Easy | DiffusionPFODEResidualConnection | |
| 401 | Medium | DiffusionLogNormalInverseCDF | |
| 402 | Medium | DiffusionLayerNormAdaLNConditioning | |
| 403 | Medium | DiffusionLogNormalInverseCDFSampling | |
| 404 | Medium | DiffusionLossEDMWeighting | |
| 405 | Medium | DiffusionInferencePFODEDiffusionBlocks |

Two preview MoE models, V4-Pro (1.6T total, 49B active) and V4-Flash (284B, 13B active), that replace V3's attention, rewrite the residual stream, and switch optimisers. All three changes serve the same goal: make million-token context cheap and efficient.

V4-Flash matches V3.2-Base on most benchmarks at a third the active parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its inference FLOPs. Capability is behind frontier, cost-to-serve at long context drops by roughly a factor of ten.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 360 | Medium | doubly-stochasticmhcdeepseek-v4 | |
| 361 | Medium | muonorthogonalisationdeepseek-v4 | |
| 369 | Hard | residualhyper-connectionsdeepseek-v4 | |
| 370 | Medium | sparse-attentiontop-kdeepseek-v4 | |
| 371 | Medium | sparse-attentionkv-compressiondeepseek-v4 | |
| 372 | Hard | sparse-attentionkv-compressiondeepseek-v4 |

A training-free method for scoring key importance in the KV cache. It uses a geometric property of attention heads (pre-RoPE query and key vectors cluster around stable non-zero directions) instead of running full attention.

TriAttention (arxiv 2604.04921) was released April 6 2026. The core observation, Q/K concentration, is empirically demonstrable, and the scoring derivation falls out directly from the RoPE attention identity.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 284 | Easy | RoPEAttentionKV CacheTriAttentionNumPy | |
| 285 | Medium | RoPEAttentionPositional EncodingTriAttentionNumPy | |
| 286 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | |
| 287 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | |
| 288 | Medium | RoPEAttentionKV CacheTriAttentionNumPy | |
| 289 | Hard | RoPEAttentionKV CacheTriAttentionNumPy |

Google's open-weight LLM family with five architectural departures from the standard recipe: QK-norm, partial RoPE, per-layer input gating, KV sharing, and a parallel MoE.

The 2B-parameter per-layer embedding table re-injects token identity at every depth, on the bet that deep transformers don't preserve it on their own. Open weights mean the architecture is directly readable.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 181 | Hard | TransformersAttentionNormalizationGemma 4NumPy | |
| 182 | Easy | TransformersNormalizationGemma 4NumPy | |
| 183 | Medium | TransformersPositional EncodingRoPEGemma 4NumPy | |
| 184 | Medium | TransformersNormalizationGemma 4NumPy | |
| 185 | Hard | TransformersAttentionKV CacheGemma 4NumPy |

A training-free vector quantization method from Google that compresses KV cache entries to ~3 bits per coordinate using random rotation, Lloyd-Max quantization, and a 1-bit residual sketch.

TurboQuant compresses the KV cache to ~3 bits without fine-tuning. The smaller cache means fewer memory transfers, so inference runs faster on Gemma and Mistral than the unquantized originals.

| # | Title | Difficulty | Tags |
|---|---|---|---|
| 173 | Medium | Linear AlgebraQuantizationQR DecompositionNumpy | |
| 174 | Medium | QuantizationLinear AlgebraNumpyOptimisation | |
| 175 | Medium | QuantizationLinear AlgebraNumpy | |
| 176 | Medium | QuantizationLinear AlgebraNumpy | |
| 177 | Hard◆ | QuantizationLinear AlgebraNumpyTurboQuant | |
| 178 | Hard◆ | QuantizationLinear AlgebraTransformersNumpyTurboQuant |
