Faster Gemma 4 on MLX with multi-token prediction

wpnews.pro

cd /news/large-language-models/faster-gemma-4-on-mlx-with-multi-tok… · home › topics › large-language-models › article

[ARTICLE · art-45784] src=ollama.com ↗ pub=2026-07-01T00:22Z topic=large-language-models verified=true sentiment=↑ positive

Faster Gemma 4 on MLX with multi-token prediction

Gemma 4 in Ollama 0.31 generates tokens nearly 90% faster on Apple Silicon using multi-token prediction (MTP), which employs a small draft model to propose multiple tokens that the main model verifies in a single pass. The speedup is automatic and requires no configuration, with Ollama dynamically tuning the draft length for optimal performance.

read4 min views1 publishedJul 1, 2026

Faster Gemma 4 on MLX with multi-token prediction — Image: source

June 29, 2026 #

Gemma 4 is now significantly faster in Ollama 0.31. On Apple Silicon, it generates tokens nearly 90% faster on average across a coding-agent benchmark. The speedup is on by default, and it does not change the model’s output:

The speedup comes from multi-token prediction (MTP). Gemma 4 ships with a small, fast draft model that runs alongside the main model and proposes the next several tokens. The main model then verifies that proposal in a single pass and keeps the tokens it agrees with. Because the draft model is a small fraction of the main model’s size, its proposals are inexpensive, and when they are correct the model commits several tokens for the cost of one.

Code is especially predictable. It is full of closing brackets, repeated identifiers, and boilerplate, so the draft model’s proposals are accepted often. This matters most for coding agents, which call the model continuously as they read files, run tools, and work through a task. Faster generation makes those agents noticeably more responsive.

Achieving this reliably is the difficult part. The ideal number of tokens to draft changes from one moment to the next, and drafting too many can make MTP slower than not speculating at all. Ollama tunes this automatically as the model runs, so the speedup requires no configuration.

We measured this on the Aider polyglot benchmark, which runs a real coding agent through real programming tasks. The benefit from MTP depends heavily on the workload, and a synthetic benchmark can be made to show almost any result. These numbers reflect what to expect in practice.

How it works #

Three changes work together: how the draft length is chosen, how the engine runs each round, and how the GPU handles the work.

Auto-tuning the draft length

There is no single best number of tokens to draft before verifying. It depends on the model, how it is quantized, the hardware, and how predictable the text is at any given moment. A value that works well on one setup can be wrong on another. Drafting too few leaves performance on the table. Drafting too many spends more time checking rejected proposals than it saves, and MTP ends up slower than plain decoding.

Ollama determines the draft length at runtime. As it generates, it tracks how often proposals are accepted and how long each verification pass takes, then selects the length that produces the most tokens per second. It continues adjusting as the text changes, and when proposals stop being accepted it returns to plain one-at-a-time decoding. Speculation therefore does not slow generation down when it stops helping.

Speculative decoding in the engine

Each round begins with the draft model. It predicts a token, feeds that token back in to predict the next, and repeats until it has a short run of proposals. The main model then verifies the entire run at once, sampling at each position to determine which proposals are accepted. All of this runs on the GPU as a single pass: drafting, sampling, verification, and the sampling that follows, with no return to the CPU in between.

Accepted tokens are kept. The rejected ones are more involved, because by the time they are rejected they have already written into the cache, the running state the model reuses to avoid recomputing earlier tokens. Undoing them is inexpensive. The engine records a rollback point before each proposal, and a rejection rewinds to the last accepted token. Nothing earlier is touched or recomputed.

A faster way to verify a batch

Most of the cost is in verification, not drafting. The draft model is small, so proposing tokens is inexpensive. Verification runs the full model over the entire batch of proposals at once, and that batch is an awkward size, usually 2 to 8 tokens. Matrix multiplication kernels are typically built for either a single token (decode) or a large batch (prefill), and a handful of draft tokens falls between the two.

We contributed a kernel for this case to MLX, where other models can use it as well, not only Gemma 4 in Ollama. It reads and unpacks each block of weights once and reuses it across the entire batch, rather than re-reading the weights for every token. On an M5 Max with nvfp4, this makes Gemma 4’s largest matrix multiplications 2× to 2.5× faster. The computation is identical; the speedup comes from removing redundant work.

Get started #

Download Ollama 0.31 or later for macOS:

Then use ollama launch

to launch a coding agent powered by Gemma 4:

ollama launch claude --model gemma4:12b-mlx

Note: If you downloaded Gemma 4 earlier, re-pull it to get the version with MTP using

ollama pull gemma4:12b-mlx

ollama launch

also works with Codex, Droid, OpenCode, Copilot, and others.

Gemma 4 is the first model to receive this performance improvement, with more to follow.

source & further reading

ollama.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/faster-gemma-4-on-mlx-wi…

Read original on ollama.com → ollama.com/blog/faster-gemma-4-mlx-mtp

mentioned entities

Gemma 4

Ollama

Apple Silicon

Aider

metadata

slugfaster-gemma-4-on-mlx-with-multi-token-prediction

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalollama.com

navigation

← prevSouth Korea’s exports extend str…

next →Global semiconductor exports sur…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 1 Jul · #large-language-models

How AI Assist Turns a Rough Draft into a Polished Document in Minutes

github.com · 30 Jun · #large-language-models

Commonplace: Self-hosted, privacy-tiered memory for your AI agents

github.com · 1 Jul · #large-language-models

Ovid: A pi extension that makes it record proof its features actually work

dev.to · 30 Jun · #large-language-models

🦩OS June Recap: Reviewing PRs was my biggest milestone

── more on @gemma 4 3 stories trending now

wpnews · 30 May · #ai-tools

I was wasting 10 minutes every Claude session. So I built a fix.

wpnews · 27 May · #machine-learning

hunting for headroom on modded-nanoGPT (WR #82)

wpnews · 2 Jun · #ai-products

Microsoft launches Discovery platform for scientific R&D with Ginkgo Bioworks partnership

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required