Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation Google's Gemma 4 large language model can now be paired with multi-token prediction (MTP) drafters to generate multiple tokens in parallel, achieving up to three times faster inference without quality loss. The lightweight drafters address memory-bandwidth bottlenecks by predicting several future tokens at once, allowing the primary model to verify them in a single pass, which improves responsiveness on personal computers, consumer GPUs, and mobile devices. The MTP-enabled Gemma 4 variants are available on platforms including Hugging Face, Kaggle, and Ollama. Gemma 4 can be paired with multi-token prediction MTP drafters https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ that use speculative decoding https://arxiv.org/abs/2211.17192 to generate multiple tokens in parallel, allowing the model to verify them in a single pass and achieve up to ~3× faster inference without quality loss. Multi-token prediction drafters are lightweight auxiliary models that work alongside Gemma 4 to address the LLM memory-bandwidth bottleneck. As Google engineers explain, during inference the processor spends most of its time repeatedly moving billions of parameters from VRAM to compute units for each token. This constant data movement increases latency and leaves compute resources underutilized, particularly on consumer hardware. This inefficiency is amplified by the fact that LLMs spend the same amount of computation to predict "obvious computations" as to solve a "complex logic puzzle", which is where multi-token prediction drafters can help. By pairing a heavy target model e.g., Gemma 4 31B with a lightweight drafter the MTP model , we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel. Using multi-token prediction drafters, Google says, can improve responsiveness and enable faster inference across devices, with personal computers and consumer GPUs running Gemma 26B MoE and 31B dense models, and mobile devices using E2B and E4B variants, all without sacrificing response quality: Because the primary Gemma 4 model retains the final verification, you get identical frontier-class reasoning and accuracy, just delivered significantly faster. Google implemented a series of architectural enhancements and hardware-specific optimizations to ensure that MTP drafters deliver maximum efficiency, and provided an in-depth visual explanation of how the drafters work in an x.com thread https://x.com/googlegemma/status/2051694045869879749 . Reddit commenter FarrisAT described Gemma 4 MTP as "pretty impressive stuff" https://www.reddit.com/r/Bard/comments/1t4l74k/comment/ok3mbgb/ , but cautioned that local models still make too many mistakes, suggesting the real benefits will emerge when "those models get closer to the leading edge". Another user, Gohab2001, noted that MTP itself is a well-known technique with a major drawback for local deployments https://www.reddit.com/r/Bard/comments/1t4l74k/comment/ok7rugx/ : having to load two models in memory. They also pointed out that the real advancement in Gemma 4 MTP drafters implementation is the fact they share the target model's shared kV cache, which does effectively help reducing the technique's overhead. On Hacker News, zozbot234 signals that "MTP is mostly useful when you have one or a few users, which means compute is abundant", as in mobile or edge scenarios, while offering limited benefits large-scale for API providers. Gemma 4 MTP-enabled variants https://huggingface.co/collections/google/gemma-4 are available on several platforms, including Hugging Face, Kaggle, Ollama, and others.