Quantizing Gemma 4 on Mac with llama.cpp A developer successfully quantized Google's Gemma 4 model to 4-bit precision using llama.cpp on a Mac with Metal acceleration. The process involved converting the model to GGUF format and applying the Q4_K_M quantization method, enabling the model to run locally with text generation speeds of approximately 40 tokens per second. The quantized model demonstrated functional chat capabilities, responding to prompts like "hello" and "tell me about yourself" with coherent, detailed answers. llama.cpp git clone https://github.com/ggml-org/llama.cpp.git cmake -S llama.cpp -B llama.cpp/build -DGGML METAL=ON -DLLAMA CURL=OFF cmake --build llama.cpp/build --config Release -j 8 uv init quantization cd quantization uv add "torch =2.9" "transformers =4.45" "sentencepiece" "protobuf =4.21,<5.0" "gguf =0.19" "huggingface hub" your pyproject.toml will be like below pyproject.toml project name = "quantization" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = " =3.11" dependencies = "gguf =0.19", "huggingface-hub =1.16.1", "protobuf =4.21,<5.0", "sentencepiece =0.2.1", "torch =2.9", "transformers =4.45", current dir is quantization models for gemma-4-E4B gguf for quantized models mkdir -p models gguf login hf hf auth login download the model hf download google/gemma-4-E4B-it --local-dir models/gemma-4-E4B-it models current dir is quantization and llama and quantization are on Desktop python ../llama.cpp/convert hf to gguf.py \ models/gemma-4-E4B-it \ --outfile gguf/gemma-4-E4B-it-BF16.gguf \ --outtype bf16 convert to Q4 K M Q4 K M ../llama.cpp/build/bin/llama-quantize \ gguf/gemma-4-E4B-it-BF16.gguf \ gguf/gemma-4-E4B-it-Q4 K M.gguf \ Q4 K M This step will take a few minutes. current dir is quantization ../llama.cpp/build/bin/llama-cli \ -m gguf/gemma-4-E4B-it-Q4 K M.gguf \ -ngl 99 --temp 0.7 -c 4096 Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b0-unknown model : gemma-4-E4B-it-Q4 K M.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read