# Gemma 4 on Cerebras - The Fastest Inference Is Now Multimodal

> Source: <https://www.cerebras.ai/blog/gemma-4-on-cerebras-the-fastest-inference-is-now-multimodal>
> Published: 2026-06-30 06:04:53+00:00

**Gemma 4 on Cerebras—The Fastest Inference is Now Multimodal**

Gemma 4 31B is now running at over 1,800 tokens per second on Cerebras Inference. This multimodal model unlocks an entirely new class of applications, from computer use to image-driven agentic workflows.

As the leader in fast inference, Cerebras has set benchmarks across numerous open-weight models including Kimi, GLM, GPT-OSS, and Qwen. Gemma 4 is the first Google DeepMind model we have brought to the platform, and the first to let developers feed images—screenshots, documents, charts, UI states—into a model running at wafer-scale speed. The result: visual and agentic loops that once felt sluggish on GPUs become fast and responsive.

"Gemma 4, Google DeepMind's family of open models, was built to bring advanced reasoning and multimodal capabilities at developer-friendly sizes. Pairing these capabilities with Cerebras's wafer-scale technology provides developers with an exciting platform for running extremely fast visual and agentic workflows, and we're delighted to see Gemma's performance showcased across the hardware ecosystem."

—Olivier Lacombe, Product Lead, Gemma

The Fastest Multimodal Model

Cerebras runs Gemma 4 31B at a record 1,851 output tokens per second as measured by Artificial Analysis—35x the speed of a typical GPU endpoint. Cerebras speed also translates into world class latency— Gemma 4 on Cerebras returns its first answer token inclusive of reasoning in 1.5 seconds, making Cerebras the only provider that lets Gemma 4 be used in real-time settings.

Gemma 4 31B is comparable to Claude Haiku 4.5 in intelligence, scoring 29 and 30 respectively in the Artificial Analysis Intelligence Index. The key difference is that Gemma 4 is open-weight under Apache 2.0, and on Cerebras it runs 18x faster than Haiku.

We recommend Gemma 4 as the reference medium-size model on Cerebras: if you are looking for an alternative to Haiku, GPT-OSS, or Llama, it provides equal or higher intelligence at Cerebras speed.

Speed compounds in exactly the workloads Gemma 4 is built for. Multimodal and agentic loops rarely call a model once: they inspect a visual input, reason over it, produce structured output, call tools, check the result, and try again. At conventional speeds those loops are too slow to provide real-time input. At over 1,800 TPS, the application and user work in lockstep. Front-end iteration feels near-instant, document and screenshot workflows return in a fraction of the time, and developers can fit more verification and more retries into the same product.

This is what changes the products you can build, not just their latency. "If every model was doing 2,000 tokens per second, you would probably build different products. You wouldn't build the same product and just have it be faster," said Logan Kilpatrick of Google DeepMind.

**The Smartest Gemma 4 Model**

Gemma 4 31B is the flagship and most capable model of Google DeepMind's open-weight Gemma family—a dense, multimodal model built for quality and efficiency rather than raw parameter count. Dense models achieve high model intelligence without the large memory footprint of MoE models. Gemma 4 hits a sweet spot: strong enough for serious work, efficient to serve, and open enough to build around without vendor lock-in.

Gemma 4 is the first model on Cerebras to support image understanding. It enables workflows combining text with images—screenshots, charts, UI states, scanned pages, forms, diagrams. It also unlocks computer use and robotics applications.

Bringing vision to wafer-scale hardware is a milestone for the platform. Multimodal support starts with Gemma 4, and we will extend it to additional models going forward. The combination of image understanding and wafer-scale speed is what unlocks new product experiences: a model that can see a dashboard, reason over it, return structured output, and act on it fast enough to keep a human or an agent in the loop.

**Examples Include:**

**Screenshot to Insight.** Feed themodel a dense dashboard screenshot or document page and watch it identify what matters, explain the finding, and return structured output—in real time rather than after a wait.**Long-context summarization.** Hand it a research report or technical brief and get a crisp, decision-ready summary back fast enough to read, react, and re-query in a single sitting.**Screenshot to Patch.** Play to medium-model strengths—hand it a broken UI screenshot, the source, and the console error, and get back a minimal patch and the checks to verify it.

**Available Now**

Gemma 4 31B is available today on the [Cerebras Inference Cloud](https://cloud.cerebras.ai/?utm_source=homepage) in public preview for a limited time. If your workload requires multimodal reasoning, fast document processing, or realtime audio and video, we would love to hear from you.