{"slug": "gemma-4-qat-models-optimizing-compression-for-mobile-and-laptop-efficiency", "title": "Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency", "summary": "Google released new Gemma 4 checkpoints optimized with Quantization-Aware Training (QAT) to reduce model memory footprint for local deployment on edge devices and consumer GPUs. The QAT process minimizes quality loss during compression, with a specialized mobile format reducing the Gemma 4 E2B model's memory requirement to 1GB. The release aims to enable efficient on-device AI inference for laptops and smartphones while preserving model capabilities.", "body_md": "# Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency\n\nSince releasing [Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/) two months ago, we've been continuously working to expand its capabilities. First, we introduced [Multi-Token Prediction](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/) (MTP) to accelerate inference, and just a couple of days ago, we released [a 12B model](https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/) to bridge the gap between our E4B and 26B MOE models.\n\nToday, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so you can run models locally on everyday edge devices and consumer GPUs.\n\nBy simulating quantization during training, QAT minimizes quality loss when the model is compressed. This release includes QAT checkpoints for the popular Q4_0 quantization format as well as a novel quantization format specialized for mobile use cases. Using this mobile format, we’ve reduced the memory footprint of Gemma 4 E2B to 1GB. Together, these dramatically reduce memory requirements while preserving the capabilities and quality you expect from Gemma 4.\n\n**Keeping model quality while making them smaller**\n\nQuantization is a key technology to run models on consumer hardware by reducing their memory footprint while also accelerating decode speed. However, standard Post-Training Quantization (PTQ) often leads to performance degradation. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. While PTQ is already effective at preserving quality, our QAT results yield even higher overall quality compared to standard PTQ baselines.\n\nWe applied this QAT recipe to the popular Q4_0 format to maximize performance for all the models. For the edge models (E2B and E4B), we rethought how we approach quantization with a special mobile-specialized quantization schema.\n\n**Saving on VRAM and Storage**\n\nBelow are the approximate memory requirements indicating how much VRAM is required to load the models:\n\n**Optimizing for mobile devices under the hood**\n\nStandard compression formats are often hard for mobile processors to run efficiently. To ensure Gemma 4 performs smoothly on mobile, we engineered a custom mobile-quantization schema designed for edge hardware:\n\n**Static activations:** Normally, models waste processing power calculating how to scale data on the fly. We pre-calculate these settings during training, which reduces workload on mobile chips and makes responses faster.**Channel-wise quantization:** We structured the compressed data to fit the design of mobile accelerators. This allows the phone to run calculations natively without needing slow workarounds.**Targeted 2-bit quantization:** We heavily compressed (to 2-bit) the specific parts of the model that generate tokens, while keeping the core reasoning layers at higher precision. This saves storage without making the model less smart.**Embedding and KV cache optimization**: We focused compression on the model’s vocabulary list and its short-term memory. This drastically reduces the active memory footprint, letting you have long chats without running out of space.\n\nBecause our audio and vision encoders are not needed in many use cases, you can optimize your memory footprint even further by deploying only the modalities you need. For example, the Gemma 4 E2B text-only model (without Per-Layer Embeddings) requires less than 1 GB of memory.\n\n**Get started today**\n\nTo make those models easily usable with your preferred workflow, we’ve partnered with popular developer tools across the ecosystem to seamlessly support the Gemma 4 QAT checkpoints starting today:\n\n**Download the weights:** Access the[Q4_0](https://huggingface.co/collections/google/gemma-4-qat-q4-0)and[mobile](https://huggingface.co/collections/google/gemma-4-qat-mobile)model weights right now on Hugging Face. We've tailored the formats to fit your workflow: GGUF formats are ready for use with llama.cpp, and compressed tensors are provided for vLLM. For everything else, we share unquantized checkpoints that can be converted and quantized into formats supporting Q4_0.**Integrate & learn:** Explore our[documentation](https://ai.google.dev/gemma/docs/core#qat)to learn how to best deploy the QAT checkpoints.**Try on your desktop:** Easily download, manage, and run Gemma 4 QAT models locally on your desktop using user-friendly interfaces like[llama.cpp](https://huggingface.co/collections/google/gemma-4-qat-q4-0),[Ollama](https://ollama.com/library/gemma4)and[LM Studio](https://lmstudio.ai/models/gemma-4).**Deploy on-device:** Use Google's lightweight[LiteRT-LM](https://huggingface.co/collections/litert-community/gemma-family)runtime for optimized edge deployment or run the models directly on the web with[Transformers.js](https://huggingface.co/collections/onnx-community/gemma-4-onnx)**Use your favorite development tools:** Serve larger models efficiently with[vLLM](https://huggingface.co/collections/google/gemma-4-qat-q4-0), optimize for Apple Silicon with[MLX](https://huggingface.co/collections/mlx-community/gemma-4-qat). Use the MTP QAT checkpoints to preserve the speedup of[MTP](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/)while quantizing the models. Fine-tune weights directly using Hugging Face Transformers and[Unsloth](https://unsloth.ai/docs/models/gemma-4/qat).\n\nWe can't wait to see what you build with Gemma 4 running locally!", "url": "https://wpnews.pro/news/gemma-4-qat-models-optimizing-compression-for-mobile-and-laptop-efficiency", "canonical_source": "https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/", "published_at": "2026-06-05 16:18:48+00:00", "updated_at": "2026-06-05 17:41:18.097305+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-products", "ai-tools"], "entities": ["Gemma 4", "Google", "Quantization-Aware Training", "QAT", "Multi-Token Prediction", "MTP", "E4B", "E2B"], "alternates": {"html": "https://wpnews.pro/news/gemma-4-qat-models-optimizing-compression-for-mobile-and-laptop-efficiency", "markdown": "https://wpnews.pro/news/gemma-4-qat-models-optimizing-compression-for-mobile-and-laptop-efficiency.md", "text": "https://wpnews.pro/news/gemma-4-qat-models-optimizing-compression-for-mobile-and-laptop-efficiency.txt", "jsonld": "https://wpnews.pro/news/gemma-4-qat-models-optimizing-compression-for-mobile-and-laptop-efficiency.jsonld"}}