Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

wpnews.pro

cd /news/generative-ai/run-diffusiongemma-on-nvidia-for-dev… · home › topics › generative-ai › article

[ARTICLE · art-24397] src=developer.nvidia.com ↗ pub=2026-06-10T16:16Z topic=generative-ai verified=true sentiment=↑ positive

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

Google DeepMind's DiffusionGemma, optimized for NVIDIA platforms, generates text tokens in parallel rather than sequentially, achieving up to 1,000 tokens per second on a single NVIDIA H100 GPU. The model, built on the Gemma 4 26B A4B MoE architecture, enables enterprise developers to reduce serving costs and increase concurrency for real-time AI applications like chat assistants and agentic workflows. DiffusionGemma is available for prototyping via Hugging Face Transformers and NVIDIA's build.nvidia.com, with production deployment supported through NVIDIA NIM containers.

read4 min views16 publishedJun 10, 2026

Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This limits responsiveness, increases serving costs, and makes fluid, interactive experiences difficult to achieve.

DiffusionGemma, created by Google DeepMind and optimized to run efficiently across NVIDIA platforms, introduces a new approach to text generation, producing tokens in parallel rather than one at a time, enabling faster, higher-throughput AI applications. The model uses diffusion-based denoising to generate 256 tokens in parallel per step, delivering up to 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, up to 150 tokens/sec on NVIDIA DGX Spark, and up to 2,000 tokens/sec on NVIDIA DGX Station.

For enterprise developers, this speed translates into lower serving costs, higher concurrency, and more responsive user experiences without sacrificing model quality. DiffusionGemma is built on the Gemma 4 26B A4B MoE architecture and optimized for low-latency, memory-bound inference.

Table 1. Overview of the DiffusionGemma, summarizing modalities, parameter sizes, and supported context length

In addition to NVIDIA data center GPUs, developers can enjoy optimal performance on a variety of client GPUs and systems.

Platform	Best For	Key highlights	Getting started
NVIDIA DGX Spark	Personal AI supercomputer for local AI development, autonomous agents, AI research, and prototyping	NVIDIA GB10 Grace Blackwell Superchip, 128 GB unified memory, 1 PFLOP of FP4 AI compute, and a preinstalled NVIDIA AI software stack for fully local OpenClaw workflows

NVIDIA DGX StationDGX Station playbooks;vLLM on DGX Station guideNVIDIA RTX + NVIDIA RTX PRORTX blog;vLLM on RTX guideTable 2.

Comparison of local deployment options across NVIDIA platforms, highlighting primary use cases, key capabilities, and recommended getting‑

started resources for DGX Spark, DGX Station, and RTX + RTX PRO systems

Build and prototype on NVIDIA

Access DiffusionGemma through Hugging Face Transformers for initial testing and prototyping on NVIDIA GeForce RTX 5090 or DGX Spark. For higher throughput or concurrent multi-user serving on DGX Spark, DGX Station, and RTX PRO, use vLLM by following our playbooks in Table 2.

With Day 0 support across NVIDIA hardware and software—from local prototyping to production deployment—developers can quickly move from experimentation to real-world applications.

NVIDIA GPU-accelerated endpoints

Start building with DiffusionGemma with free access for prototyping to GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. The browser experience can also be connected to custom data sources.

BF16 and NVFP4

The model is available today on Hugging Face with BF16 checkpoints, and an NVFP4 quantized checkpoint for DiffusionGemma is also available using NVIDIA Model Optimizer.

Enterprise deployments with NVIDIA NIM #

NVIDIA NIM makes it simple to deploy DiffusionGemma from development into production. NIM packages the model as an optimized, containerized inference microservice — with performance tuning, standardized APIs, and the flexibility to run on-premises, in the cloud, or across hybrid environments. NIM exposes a standard OpenAI-compatible API for sending inference requests to the server.

Download the container. - Start the NIM server.

$ export NIM_IMAGE_PATH = “nvcr.io/nim/google/diffusiongemma-26b-a4b-it:latest”
$ docker run --gpus=all \ 
  -e NGC_API_KEY=$NGC_API_KEY \ 
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ 
  -p 8000:8000 \ 
 ${NIM_IMAGE_PATH}

Make a test request and read the full NIM documentation.

from openai import OpenAI 
client = OpenAI( 
    base_url="http://localhost:8000/v1", 
    api_key="not-required" 
) 
response = client.chat.completions.create( 
    model="google/diffusiongemma-26b-a4b-it”,
    messages=[ 
        {"role": "user", "content": "Write a poem about text diffusion"} 
    ], 
    max_tokens=256 
) 
print(response.choices[0].message.content)

Day 0 finetune with NVIDIA NeMo AutoModel #

Fine-tuning guides and recipes are available through the NVIDIA NeMo AutoModel library, part of the NVIDIA NeMo Framework, for developers looking to adapt the model to specific tasks or domains. NeMo AutoModel enables users to fine-tune models (LLMs, VLMs and DiffusionLMs) directly on top of HuggingFace checkpoints without conversion, so users can start rapid experimentation on the latest frontier models.

NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to open models such as DiffusionGemma that promote AI transparency and enable users to share their work in AI safety and resilience.

Check out DiffusionGemma on Hugging Face or test for free using NVIDIA APIs at build.nvidia.com.

source & further reading

developer.nvidia.com — original article NVIDIA Ising Enables Fully Automated Quantum Computer Calibration with Enhanced In-Context Learning Six Agent Harness Capabilities for Higher Model Performance NVIDIA Nemotron 3 Ultra Leads Open Models on Accuracy and Efficiency in Agentic RTL Coding

~/api · this article 200

$curl api.wpnews.pro/v1/news/run-diffusiongemma-on-nv…

Read original on developer.nvidia.com → developer.nvidia.com/blog/run-diffusiongemma-on-…

mentioned entities

Google DeepMind

NVIDIA

DiffusionGemma

Gemma 4

H100

DGX Spark

DGX Station

NVIDIA DGX Spark

metadata

slugrun-diffusiongemma-on-nvidia-for-developer-ready-high-throughput-text-generation

topic#generative-ai

secondary4 topics

sentimentpositive

canonicaldeveloper.nvidia.com

navigation

← prevWednesday assorted links

next →AMD's Lemonade SDK For Local AI …

── more in #generative-ai 4 stories · sorted by recency

cryptobriefing.com · 28 Jul · #generative-ai

Goldman Sachs forecasts $7.5T AI infrastructure spend over five years

promptcube3.com · 28 Jul · #generative-ai

Google's Spending Spree: A Post-Mortem on the AI Money Pit

cryptobriefing.com · 28 Jul · #generative-ai

World Labs acquires SceniX to build digital training grounds for robots, sidestepping real-world data costs

pub.towardsai.net · 26 Jul · #generative-ai

Gemma 4 26B MoE vs Claude Opus 4.6: I Used Both for Weeks — Here’s the One I Actually Kept

── more on @google deepmind 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required