DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

wpnews.pro

cd /news/large-language-models/diffusiongemma-how-google-s-new-open… · home › topics › large-language-models › article

[ARTICLE · art-25504] src=dev.to ↗ pub=2026-06-12T18:30Z topic=large-language-models verified=true sentiment=↑ positive

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

Google DeepMind released DiffusionGemma, an open-source Apache 2.0-licensed diffusion-based large language model that generates text at over 1,000 tokens per second on a single H100 GPU, achieving up to 4x faster throughput than comparable autoregressive models. The 26B-parameter Mixture of Experts model, which fits in 18 GB of VRAM, iteratively refines entire 256-token blocks of noise rather than predicting one token at a time, trading some accuracy on complex reasoning tasks for significantly lower latency.

read3 min views21 publishedJun 12, 2026

TL;DR:Google released DiffusionGemma, an open Apache 2.0 diffusion-based LLM that generates text up to 4x faster than autoregressive models, hitting 1,000+ tokens/sec on a single H100 and fitting in 18 GB VRAM. It trades some accuracy for speed. Here is what that means in practice.

Google DeepMind released DiffusionGemma, the first production-grade open-weight model that applies discrete diffusion to text generation. The same family of techniques behind image generators like Stable Diffusion, now applied to language.

Instead of predicting one token at a time left-to-right, DiffusionGemma fills a 256-token block with noise and iteratively refines the entire block across multiple denoising passes until confidence thresholds are met. It commits roughly 15-20 tokens per forward pass on average, not one.

This is a fundamentally different compute pattern from everything shipping in production today.

Metric	Value
Tokens/sec (H100, FP8, low batch)
1,100+
Tokens/sec (RTX 5090)
700+
Total parameters
25.2B (marketed as 26B)
Active parameters at inference
3.8B
MoE expert config
8 active / 128 total
VRAM required (quantized)
18 GB
Canvas (block) size
256 tokens
Tokens committed per forward pass
~15-20
Max denoising steps
48
Context window
256K tokens
License
Apache 2.0

For context: comparable autoregressive models on the same H100 generate roughly 200-250 tokens/sec. DiffusionGemma is up to 4x faster on throughput. The jump comes from shifting the decode bottleneck from memory bandwidth to compute.

DiffusionGemma is a 26B Mixture of Experts (MoE) model built on the Gemma 4 backbone, but it replaces the autoregressive decoder with a diffusion head.

How a single generation works:

The key difference from GPT-style models: token N can see tokens N+1 through N+256 during generation. This enables genuine self-correction across the block. Autoregressive models structurally cannot do this.

Benchmark	DiffusionGemma	Gemma 4 26B
MMLU Pro	77.6%	82.6%
AIME 2026	69.1%	88.3%
GPQA Diamond	73.2%	82.3%
MMMU Pro (Vision)	54.3%	73.8%

Google describes it as experimental. For reasoning-heavy workloads (complex math, multi-step logic, vision understanding) the autoregressive Gemma 4 is still ahead. DiffusionGemma is the right tool when latency and throughput matter more than peak accuracy.

The model processes interleaved text, images (5 resolution tiers up to 1120 tokens), and video (up to 60 seconds at 1 fps). It supports OCR, chart comprehension, screen understanding, and handwriting recognition across 35+ languages, with training data covering 140+ languages.

pip install vllm

vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --generation-config vllm \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --diffusion-config '{"canvas_length": 256}' \
  --enable-chunked-prefill

The endpoint is OpenAI-compatible. Point your existing client at http://localhost:8000

with no other code changes needed.

Supported inference runtimes: vLLM, Hugging Face Transformers, SGLang, MLX (Apple Silicon), NVIDIA NIM containers, Google Cloud Vertex AI Model Garden.

The ecosystem arrived fast for a day-1 release:

A published case study fine-tuned DiffusionGemma on a Sudoku dataset and improved success rate from approximately 0% to 80%. Fine-tuning can also teach the model to stop denoising early when confidence is already high, reducing inference steps further. Autoregressive models have no equivalent lever.

This week:

Next sprint:

Architecture signal:

This model is built on the same Gemini Diffusion research that will likely inform future proprietary Gemini releases. If diffusion inference stabilizes at this quality level, it rewrites autoregressive serving assumptions at scale.

DiffusionGemma is not a production replacement for your current LLM stack today. Accuracy trade-offs are real and Google is transparent about the experimental status.

But the throughput numbers are genuine, the hardware requirements are accessible, and the license is Apache 2.0.

1,100 tokens per second. 18 GB VRAM. Open weights. From Google.

That combination is worth benchmarking on your actual workload this week.

Resources:

Found this useful? Follow for more signal-over-noise breakdowns of AI releases that matter.

source & further reading

dev.to — original article Why Cursor Writes IDOR Into Your API Routes (CWE-639) Building AI Agents with the Kotlin Agent Development Kit (ADK) Active Working Memory: The RAM of Agentic Systems

~/api · this article 200

$curl api.wpnews.pro/v1/news/diffusiongemma-how-googl…

Read original on dev.to → dev.to/sayed_ali_alkamel/diffusiongemma-how-goog…

mentioned entities

Google DeepMind

DiffusionGemma

H100

RTX 5090

Apache 2.0

metadata

slugdiffusiongemma-how-google-s-new-open-llm-hits-1000-tokens-sec-and-changes

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevAsk HN: What happens when AI-voi…

next →ChatGPT Was 76% of the AI Market…

── more in #large-language-models 4 stories · sorted by recency

promptcube3.com · 28 Jul · #large-language-models

AI Chip Stocks: Analyzing the Current Market Correction

openmodelmap.com · 27 Jul · #large-language-models

Kimi K3 Hardware Requirements

byteiota.com · 27 Jul · #large-language-models

Kimi K3 Open Weights Live: Self-Host or Use the API?

cryptobriefing.com · 25 Jul · #large-language-models

Google backs open source models in Silicon Valley’s stance against Anthropic

── more on @google deepmind 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required