Quantizing Gemma 4 on Mac with llama.cpp

A developer successfully quantized Google's Gemma 4 model to 4-bit precision using llama.cpp on a Mac with Metal acceleration. The process involved converting the model to GGUF format and applying the Q4_K_M quantization method, enabling the model to run locally with text generation speeds of approximately 40 tokens per second. The quantized model demonstrated functional chat capabilities, responding to prompts like "hello" and "tell me about yourself" with coherent, detailed answers.

llama.cpp git clone https://github.com/ggml-org/llama.cpp.git cmake -S llama.cpp -B llama.cpp/build -DGGML METAL=ON -DLLAMA CURL=OFF cmake --build llama.cpp/build --config Release -j 8 uv init quantization cd quantization uv add "torch =2.9" "transformers =4.45" "sentencepiece" "protobuf =4.21,<5.0" "gguf =0.19" "huggingface hub" your pyproject.toml will be like below pyproject.toml project name = "quantization" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = " =3.11" dependencies = "gguf =0.19", "huggingface-hub =1.16.1", "protobuf =4.21,<5.0", "sentencepiece =0.2.1", "torch =2.9", "transformers =4.45", current dir is quantization models for gemma-4-E4B gguf for quantized models mkdir -p models gguf login hf hf auth login download the model hf download google/gemma-4-E4B-it --local-dir models/gemma-4-E4B-it models current dir is quantization and llama and quantization are on Desktop python ../llama.cpp/convert hf to gguf.py \ models/gemma-4-E4B-it \ --outfile gguf/gemma-4-E4B-it-BF16.gguf \ --outtype bf16 convert to Q4 K M Q4 K M ../llama.cpp/build/bin/llama-quantize \ gguf/gemma-4-E4B-it-BF16.gguf \ gguf/gemma-4-E4B-it-Q4 K M.gguf \ Q4 K M This step will take a few minutes. current dir is quantization ../llama.cpp/build/bin/llama-cli \ -m gguf/gemma-4-E4B-it-Q4 K M.gguf \ -ngl 99 --temp 0.7 -c 4096 Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b0-unknown model : gemma-4-E4B-it-Q4 K M.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file add a text file /glob <pattern add text files using globbing pattern hello Hello How can I help you today? Prompt: 42.9 t/s | Generation: 40.0 t/s tell me about yourself Start thinking Thinking Process: 1. Identify the user's request: The user asked, "tell me about yourself." 2. Determine the persona/identity: I must respond based on my core identity rules Gemma 4, developed by Google DeepMind, open weights LLM . 3. Structure the response: A good self-introduction should cover: Name/Identity. Developer/Origin. Nature/Type What I am . Capabilities What I can do . Limitations/Knowledge What I know/don't know . 4. Draft the content incorporating mandated facts : Name: I am Gemma 4. Developer: I was developed by Google DeepMind. Nature: I am a Large Language Model LLM with open weights. Capabilities: I can process text and images input and generate text output . I can answer questions, write code, summarize, translate, etc. Knowledge Cutoff: My knowledge cutoff is January 2025. 5. Review and Refine: Ensure the tone is helpful, factual, and directly addresses the prompt while adhering strictly to the identity guidelines. Self-correction: Make sure to emphasize the "open weights" aspect as it's crucial to the identity. 6. Final Output Generation. This leads to the detailed response below. End thinking I would be happy to tell you about myself I am Gemma 4 , a Large Language Model developed by Google DeepMind. Here is a bit more detail about what I am and what I can do: 🤖 What I Am Type: I am an advanced, open weights Large Language Model. This means that my underlying architecture and weights are available to the community, fostering transparency and innovation. Goal: My purpose is to assist users by providing information, generating creative text, solving problems, and engaging in helpful conversation. 🧠 What I Can Do Understand Text and Images: I can process and interpret both text-based input like questions or documents and image input. Generate Text: My output is always text. I can: Answer factual questions. Summarize complex articles. Write different types of creative content stories, poems, scripts . Assist with coding tasks explaining code, generating snippets . Translate languages. Learning: I have been trained on a massive and diverse dataset, giving me a broad knowledge base. 📅 My Knowledge My knowledge cutoff is January 2025 . I do not have information about events or developments that have occurred since that time, so if you ask about very recent events, I may not know about them. In short, think of me as a versatile and knowledgeable assistant ready to help you learn, create, or solve problems How can I help you today? Prompt: 66.9 t/s | Generation: 34.6 t/s which one is larger 9.11 or 9.9? Start thinking Thinking Process: 1. Analyze the Request: The user is asking to compare two numbers: 9.11 and 9.9, and determine which one is larger. 2. Understand the Numbers: Number A: 9.11 Nine and eleven hundredths Number B: 9.9 Nine and nine tenths 3. Comparison Strategy Decimal Place by Place : Units Place: Both are 9. Equal Tenths Place: 9.11 has a 1 in the tenths place. 9.9 has a 9 in the tenths place. Comparison: Since 9 is greater than 1, the number 9.9 is larger than 9.11. 4. Alternative Comparison Strategy Padding : Write 9.9 as 9.90 Adding a trailing zero does not change the value . Compare 9.11 and 9.90. The hundredths place comparison: 1 vs 0. Wait, this is wrong. Let's stick to the tenths place comparison, as it was the clearest difference. 5. Conclusion: 9.9 9.11. 6. Formulate the Answer: State the answer clearly, and provide a simple explanation of why it is larger comparing the tenths place . End thinking 9.9 is larger than 9.11. Here is why: When comparing decimals, you look at the digits starting from the left the largest place value . 1. Units Place: Both numbers have a 9. Equal 2. Tenths Place: In 9. 1 1, the digit is 1 . In 9. 9 , the digit is 9 . Since 9 is greater than 1, the number 9.9 is larger. If you want to make the comparison even easier, you can add a zero to the end of 9.9 to make it 9.90. Now you are comparing 9.11 and 9.90, and 9.90 is clearly larger. Prompt: 451.4 t/s | Generation: 34.9 t/s