{"slug": "faster-gemma-4-on-mlx-with-multi-token-prediction", "title": "Faster Gemma 4 on MLX with multi-token prediction", "summary": "Gemma 4 in Ollama 0.31 generates tokens nearly 90% faster on Apple Silicon using multi-token prediction (MTP), which employs a small draft model to propose multiple tokens that the main model verifies in a single pass. The speedup is automatic and requires no configuration, with Ollama dynamically tuning the draft length for optimal performance.", "body_md": "# Faster Gemma 4 on MLX with multi-token prediction\n\n## June 29, 2026\n\nGemma 4 is now significantly faster in Ollama 0.31. On Apple Silicon, it generates tokens **nearly 90% faster** on average across a coding-agent benchmark. The speedup is on by default, and it does not change the model’s output:\n\nThe speedup comes from multi-token prediction (MTP). Gemma 4 ships with a small, fast draft model that runs alongside the main model and proposes the next several tokens. The main model then verifies that proposal in a single pass and keeps the tokens it agrees with. Because the draft model is a small fraction of the main model’s size, its proposals are inexpensive, and when they are correct the model commits several tokens for the cost of one.\n\nCode is especially predictable. It is full of closing brackets, repeated identifiers, and boilerplate, so the draft model’s proposals are accepted often. This matters most for coding agents, which call the model continuously as they read files, run tools, and work through a task. Faster generation makes those agents noticeably more responsive.\n\nAchieving this reliably is the difficult part. The ideal number of tokens to draft changes from one moment to the next, and drafting too many can make MTP slower than not speculating at all. Ollama tunes this automatically as the model runs, so the speedup requires no configuration.\n\nWe measured this on the Aider polyglot benchmark, which runs a real coding agent through real programming tasks. The benefit from MTP depends heavily on the workload, and a synthetic benchmark can be made to show almost any result. These numbers reflect what to expect in practice.\n\n## How it works\n\nThree changes work together: how the draft length is chosen, how the engine runs each round, and how the GPU handles the work.\n\n### Auto-tuning the draft length\n\nThere is no single best number of tokens to draft before verifying. It depends on the model, how it is quantized, the hardware, and how predictable the text is at any given moment. A value that works well on one setup can be wrong on another. Drafting too few leaves performance on the table. Drafting too many spends more time checking rejected proposals than it saves, and MTP ends up slower than plain decoding.\n\nOllama determines the draft length at runtime. As it generates, it tracks how often proposals are accepted and how long each verification pass takes, then selects the length that produces the most tokens per second. It continues adjusting as the text changes, and when proposals stop being accepted it returns to plain one-at-a-time decoding. Speculation therefore does not slow generation down when it stops helping.\n\n### Speculative decoding in the engine\n\nEach round begins with the draft model. It predicts a token, feeds that token back in to predict the next, and repeats until it has a short run of proposals. The main model then verifies the entire run at once, sampling at each position to determine which proposals are accepted. All of this runs on the GPU as a single pass: drafting, sampling, verification, and the sampling that follows, with no return to the CPU in between.\n\nAccepted tokens are kept. The rejected ones are more involved, because by the time they are rejected they have already written into the cache, the running state the model reuses to avoid recomputing earlier tokens. Undoing them is inexpensive. The engine records a rollback point before each proposal, and a rejection rewinds to the last accepted token. Nothing earlier is touched or recomputed.\n\n### A faster way to verify a batch\n\nMost of the cost is in verification, not drafting. The draft model is small, so proposing tokens is inexpensive. Verification runs the full model over the entire batch of proposals at once, and that batch is an awkward size, usually 2 to 8 tokens. Matrix multiplication kernels are typically built for either a single token (decode) or a large batch (prefill), and a handful of draft tokens falls between the two.\n\nWe contributed a kernel for this case to MLX, where other models can use it as well, not only Gemma 4 in Ollama. It reads and unpacks each block of weights once and reuses it across the entire batch, rather than re-reading the weights for every token. On an M5 Max with nvfp4, this makes Gemma 4’s largest matrix multiplications **2× to 2.5× faster**. The computation is identical; the speedup comes from removing redundant work.\n\n## Get started\n\nDownload Ollama 0.31 or later for macOS:\n\nThen use `ollama launch`\n\nto launch a [coding agent](https://docs.ollama.com/integrations#code-in-the-terminal) powered by Gemma 4:\n\n```\nollama launch claude --model gemma4:12b-mlx\n```\n\nNote: If you downloaded Gemma 4 earlier, re-pull it to get the version with MTP using\n\n`ollama pull gemma4:12b-mlx`\n\n.\n\n`ollama launch`\n\nalso works with Codex, Droid, OpenCode, Copilot, and others.\n\nGemma 4 is the first model to receive this performance improvement, with more to follow.", "url": "https://wpnews.pro/news/faster-gemma-4-on-mlx-with-multi-token-prediction", "canonical_source": "https://ollama.com/blog/faster-gemma-4-mlx-mtp", "published_at": "2026-07-01 00:22:57+00:00", "updated_at": "2026-07-01 00:49:23.800614+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["Gemma 4", "Ollama", "Apple Silicon", "Aider"], "alternates": {"html": "https://wpnews.pro/news/faster-gemma-4-on-mlx-with-multi-token-prediction", "markdown": "https://wpnews.pro/news/faster-gemma-4-on-mlx-with-multi-token-prediction.md", "text": "https://wpnews.pro/news/faster-gemma-4-on-mlx-with-multi-token-prediction.txt", "jsonld": "https://wpnews.pro/news/faster-gemma-4-on-mlx-with-multi-token-prediction.jsonld"}}