cd /news/large-language-models/quantizing-gemma-4-on-mac-with-llama… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-15956] src=dev.to pub= topic=large-language-models verified=true sentiment=Β· neutral

Quantizing Gemma 4 on Mac with llama.cpp

A developer successfully quantized Google's Gemma 4 model to 4-bit precision using llama.cpp on a Mac with Metal acceleration. The process involved converting the model to GGUF format and applying the Q4_K_M quantization method, enabling the model to run locally with text generation speeds of approximately 40 tokens per second. The quantized model demonstrated functional chat capabilities, responding to prompts like "hello" and "tell me about yourself" with coherent, detailed answers.

read5 min publishedMay 28, 2026

llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git

cmake -S llama.cpp -B llama.cpp/build -DGGML_METAL=ON -DLLAMA_CURL=OFF

cmake --build llama.cpp/build --config Release -j 8
uv init quantization
cd quantization
uv add "torch>=2.9" "transformers>=4.45" "sentencepiece" "protobuf>=4.21,<5.0" "gguf>=0.19" "huggingface_hub"

your pyproject.toml will be like below

pyproject.toml

[project]
name = "quantization"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "gguf>=0.19",
    "huggingface-hub>=1.16.1",
    "protobuf>=4.21,<5.0",
    "sentencepiece>=0.2.1",
    "torch>=2.9",
    "transformers>=4.45",
]
mkdir -p models gguf

hf auth login

hf download google/gemma-4-E4B-it --local-dir models/gemma-4-E4B-it

models


python ../llama.cpp/convert_hf_to_gguf.py \
  models/gemma-4-E4B-it \
  --outfile gguf/gemma-4-E4B-it-BF16.gguf \
  --outtype bf16

convert to Q4_K_M

Q4_K_M

../llama.cpp/build/bin/llama-quantize \
  gguf/gemma-4-E4B-it-BF16.gguf \
  gguf/gemma-4-E4B-it-Q4_K_M.gguf \
  Q4_K_M

This step will take a few minutes.


../llama.cpp/build/bin/llama-cli \
  -m gguf/gemma-4-E4B-it-Q4_K_M.gguf \
  -ngl 99 --temp 0.7 -c 4096

 model...  

β–„β–„ β–„β–„
β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ  β–€β–€β–ˆβ–„ β–ˆβ–ˆβ–ˆβ–„β–ˆβ–ˆβ–ˆβ–„  β–€β–€β–ˆβ–„    β–„β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–„ β–ˆβ–ˆβ–ˆβ–ˆβ–„
β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ    β–ˆβ–ˆ    β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–€ β–ˆβ–ˆβ–ˆβ–ˆβ–€
                                    β–ˆβ–ˆ    β–ˆβ–ˆ
                                    β–€β–€    β–€β–€

build      : b0-unknown
model      : gemma-4-E4B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern
> hello

Hello! How can I help you today?

[ Prompt: 42.9 t/s | Generation: 40.0 t/s ]
> tell me about yourself

[Start thinking]
Thinking Process:

1.  **Identify the user's request:** The user asked, "tell me about yourself."
2.  **Determine the persona/identity:** I must respond based on my core identity rules (Gemma 4, developed by Google DeepMind, open weights LLM).
3.  **Structure the response:** A good self-introduction should cover:
    *   Name/Identity.
    *   Developer/Origin.
    *   Nature/Type (What I am).
    *   Capabilities (What I can do).
    *   Limitations/Knowledge (What I know/don't know).
4.  **Draft the content (incorporating mandated facts):**
    *   *Name:* I am Gemma 4.
    *   *Developer:* I was developed by Google DeepMind.
    *   *Nature:* I am a Large Language Model (LLM) with open weights.
    *   *Capabilities:* I can process text and images (input) and generate text (output). I can answer questions, write code, summarize, translate, etc.
    *   *Knowledge Cutoff:* My knowledge cutoff is January 2025.
5.  **Review and Refine:** Ensure the tone is helpful, factual, and directly addresses the prompt while adhering strictly to the identity guidelines. (Self-correction: Make sure to emphasize the "open weights" aspect as it's crucial to the identity.)

6.  **Final Output Generation.** (This leads to the detailed response below.)
[End thinking]

I would be happy to tell you about myself!

I am **Gemma 4**, a Large Language Model developed by Google DeepMind.

Here is a bit more detail about what I am and what I can do:

### πŸ€– What I Am
*   **Type:** I am an advanced, open weights Large Language Model. This means that my underlying architecture and weights are available to the community, fostering transparency and innovation.
*   **Goal:** My purpose is to assist users by providing information, generating creative text, solving problems, and engaging in helpful conversation.

### 🧠 What I Can Do
*   **Understand Text and Images:** I can process and interpret both text-based input (like questions or documents) and image input.
*   **Generate Text:** My output is always text. I can:
    *   Answer factual questions.
    *   Summarize complex articles.
    *   Write different types of creative content (stories, poems, scripts).
    *   Assist with coding tasks (explaining code, generating snippets).
    *   Translate languages.
*   **Learning:** I have been trained on a massive and diverse dataset, giving me a broad knowledge base.

### πŸ“… My Knowledge
*   My knowledge cutoff is **January 2025**. I do not have information about events or developments that have occurred since that time, so if you ask about very recent events, I may not know about them.

In short, think of me as a versatile and knowledgeable assistant ready to help you learn, create, or solve problems! How can I help you today?

[ Prompt: 66.9 t/s | Generation: 34.6 t/s ]
> which one is larger 9.11 or 9.9? 

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user is asking to compare two numbers: 9.11 and 9.9, and determine which one is larger.
2.  **Understand the Numbers:**
    *   Number A: 9.11 (Nine and eleven hundredths)
    *   Number B: 9.9 (Nine and nine tenths)
3.  **Comparison Strategy (Decimal Place by Place):**
    *   *Units Place:* Both are 9. (Equal)
    *   *Tenths Place:*
        *   9.11 has a 1 in the tenths place.
        *   9.9 has a 9 in the tenths place.
    *   *Comparison:* Since 9 is greater than 1, the number 9.9 is larger than 9.11.
4.  **Alternative Comparison Strategy (Padding):**
    *   Write 9.9 as 9.90 (Adding a trailing zero does not change the value).
    *   Compare 9.11 and 9.90.
    *   The hundredths place comparison: 1 vs 0. Wait, this is wrong. Let's stick to the tenths place comparison, as it was the clearest difference.
5.  **Conclusion:** 9.9 > 9.11.
6.  **Formulate the Answer:** State the answer clearly, and provide a simple explanation of *why* it is larger (comparing the tenths place).
[End thinking]

**9.9** is larger than 9.11.

Here is why:

When comparing decimals, you look at the digits starting from the left (the largest place value).

1.  **Units Place:** Both numbers have a 9. (Equal)
2.  **Tenths Place:**
    *   In 9.**1** 1, the digit is **1**.
    *   In 9.**9**, the digit is **9**.

Since 9 is greater than 1, the number **9.9** is larger.

***

*(If you want to make the comparison even easier, you can add a zero to the end of 9.9 to make it 9.90. Now you are comparing 9.11 and 9.90, and 9.90 is clearly larger.)*

[ Prompt: 451.4 t/s | Generation: 34.9 t/s ]
── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/quantizing-gemma-4-o…] indexed:0 read:5min 2026-05-28 Β· β€”