June 29, 2026 #
Gemma 4 is now significantly faster in Ollama 0.31. On Apple Silicon, it generates tokens nearly 90% faster on average across a coding-agent benchmark. The speedup is on by default, and it does not change the model’s output:
The speedup comes from multi-token prediction (MTP). Gemma 4 ships with a small, fast draft model that runs alongside the main model and proposes the next several tokens. The main model then verifies that proposal in a single pass and keeps the tokens it agrees with. Because the draft model is a small fraction of the main model’s size, its proposals are inexpensive, and when they are correct the model commits several tokens for the cost of one.
Code is especially predictable. It is full of closing brackets, repeated identifiers, and boilerplate, so the draft model’s proposals are accepted often. This matters most for coding agents, which call the model continuously as they read files, run tools, and work through a task. Faster generation makes those agents noticeably more responsive.
Achieving this reliably is the difficult part. The ideal number of tokens to draft changes from one moment to the next, and drafting too many can make MTP slower than not speculating at all. Ollama tunes this automatically as the model runs, so the speedup requires no configuration.
We measured this on the Aider polyglot benchmark, which runs a real coding agent through real programming tasks. The benefit from MTP depends heavily on the workload, and a synthetic benchmark can be made to show almost any result. These numbers reflect what to expect in practice.
How it works #
Three changes work together: how the draft length is chosen, how the engine runs each round, and how the GPU handles the work.
Auto-tuning the draft length
There is no single best number of tokens to draft before verifying. It depends on the model, how it is quantized, the hardware, and how predictable the text is at any given moment. A value that works well on one setup can be wrong on another. Drafting too few leaves performance on the table. Drafting too many spends more time checking rejected proposals than it saves, and MTP ends up slower than plain decoding.
Ollama determines the draft length at runtime. As it generates, it tracks how often proposals are accepted and how long each verification pass takes, then selects the length that produces the most tokens per second. It continues adjusting as the text changes, and when proposals stop being accepted it returns to plain one-at-a-time decoding. Speculation therefore does not slow generation down when it stops helping.
Speculative decoding in the engine
Each round begins with the draft model. It predicts a token, feeds that token back in to predict the next, and repeats until it has a short run of proposals. The main model then verifies the entire run at once, sampling at each position to determine which proposals are accepted. All of this runs on the GPU as a single pass: drafting, sampling, verification, and the sampling that follows, with no return to the CPU in between.
Accepted tokens are kept. The rejected ones are more involved, because by the time they are rejected they have already written into the cache, the running state the model reuses to avoid recomputing earlier tokens. Undoing them is inexpensive. The engine records a rollback point before each proposal, and a rejection rewinds to the last accepted token. Nothing earlier is touched or recomputed.
A faster way to verify a batch
Most of the cost is in verification, not drafting. The draft model is small, so proposing tokens is inexpensive. Verification runs the full model over the entire batch of proposals at once, and that batch is an awkward size, usually 2 to 8 tokens. Matrix multiplication kernels are typically built for either a single token (decode) or a large batch (prefill), and a handful of draft tokens falls between the two.
We contributed a kernel for this case to MLX, where other models can use it as well, not only Gemma 4 in Ollama. It reads and unpacks each block of weights once and reuses it across the entire batch, rather than re-reading the weights for every token. On an M5 Max with nvfp4, this makes Gemma 4’s largest matrix multiplications 2× to 2.5× faster. The computation is identical; the speedup comes from removing redundant work.
Get started #
Download Ollama 0.31 or later for macOS:
Then use ollama launch
to launch a coding agent powered by Gemma 4:
ollama launch claude --model gemma4:12b-mlx
Note: If you downloaded Gemma 4 earlier, re-pull it to get the version with MTP using
ollama pull gemma4:12b-mlx
.
ollama launch
also works with Codex, Droid, OpenCode, Copilot, and others.
Gemma 4 is the first model to receive this performance improvement, with more to follow.