Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp

wpnews.pro

Build llama.cpp with Metal, convert Llama 3.2 3B to Q4_K_M GGUF, and benchmark real prompt-processing and generation throughput on your specific chip.

Mariana Souza

What You'll Build #

By the end of this tutorial you'll have Llama 3.2 3B running locally with Metal GPU acceleration, quantized to Q4_K_M, with llama-bench

output showing real prompt-processing and generation throughput for your specific chip.

Prerequisites #

Apple Silicon Mac (M1 or later) running macOS 13 Ventura or newer
Xcode Command Line Tools: xcode-select --install
CMake 3.14+: brew install cmake
Python 3.10+
Git
A HuggingFace account with Llama 3.2 access approved (visit the meta-llama/Llama-3.2-3Bpage and accept the Meta license) - ~12 GB free disk space

Intel Mac users can follow along but won't get Metal acceleration; omit all -ngl

flags and expect significantly lower throughput.

1. Clone and Compile llama.cpp with Metal #

Use a recent commit. Llama 3.2 architecture support and the GGML refactor are both in the main branch as of late 2024.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_METAL=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)

GGML_METAL=ON

is the default on macOS in recent llama.cpp, but being explicit prevents a silent fallback if you're on an unusual CMake config. Build takes 2-3 minutes on an M2. Verify the key binaries landed:

ls build/bin/llama-cli build/bin/llama-bench build/bin/llama-quantize

2. Download Llama 3.2 3B #

Set up an isolated Python environment, then pull the model weights:

python3 -m venv .venv
source .venv/bin/activate
pip install huggingface-hub
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.2-3B \
  --local-dir ./models/Llama-3.2-3B \
  --exclude "original/*"

The --exclude "original/*"

drops Meta's consolidated checkpoint format and keeps only the HuggingFace safetensors, shaving ~3 GB off the download. Total download is roughly 6.5 GB.

3. Convert to GGUF (F16) #

Install the conversion dependencies from llama.cpp's own requirements file, then run the converter. The requirements file lives under requirements/

after a repository reorganization:

pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_hf_to_gguf.py ./models/Llama-3.2-3B \
  --outfile ./models/llama-3.2-3b-f16.gguf \
  --outtype f16

This produces a single ~6.4 GB GGUF file containing the tokenizer, config, and half-precision weights. Keep it around: it's a reusable staging artifact you can re-quantize to multiple formats without re-running this step.

4. Quantize to Q4_K_M #

./build/bin/llama-quantize \
  ./models/llama-3.2-3b-f16.gguf \
  ./models/llama-3.2-3b-q4km.gguf \
  Q4_K_M

Q4_K_M is 4-bit with k-quant grouping: it applies higher bit-width to the layers most sensitive to quantization error (certain attention and feed-forward projections), trading a modest size increase for noticeably better output quality compared to plain Q4_0. This runs in under 90 seconds on an M2 Pro.

Common quant options for the 3B model:

Format	Approx. size	Notes
Q4_0	~1.8 GB	Fastest, lowest quality
Q4_K_M	~2.0 GB	Best 4-bit tradeoff, use this
Q5_K_M	~2.3 GB	Marginal quality gain, slower
Q8_0	~3.4 GB	Near-lossless, good for evals

5. Run the Inference Benchmark #

llama-bench

tests both prompt processing (prefill) and token generation (decode) throughput, running each scenario multiple times and reporting mean and standard deviation:

./build/bin/llama-bench \
  -m ./models/llama-3.2-3b-q4km.gguf \
  -p 512 \
  -n 128 \
  -ngl 99

-ngl 99

offloads all transformer layers to Metal. Llama 3.2 3B has 28 layers, so 99 is effectively "all of them." -p 512

processes a 512-token synthetic prompt; -n 128

generates 128 tokens. Expected output:

| model                 |       size |     params | backend | ngl |    test |              t/s |
|-----------------------|-----------:|-----------:|---------|----:|--------:|-----------------:|
| llama 3.2 3B Q4_K_M  |   1.93 GiB |     3.21 B | Metal   |  99 |  pp 512 |  2100.45 ± 18.21 |
| llama 3.2 3B Q4_K_M  |   1.93 GiB |     3.21 B | Metal   |  99 |  tg 128 |    68.32 ±  0.41 |

Typical throughput by chip:

Chip	pp (t/s)	tg (t/s)
M1	~1100	~40
M2 Pro	~1800	~60
M3 Max	~3500	~110

Prompt processing is memory-bandwidth-bound; generation is compute-bound at this model size. Chips with higher memory bandwidth (M3 Max, M2 Ultra) pull ahead considerably on the tg row.

6. Sanity Check the Output #

Benchmark numbers mean nothing if the model is generating garbage. Run a quick deterministic inference pass:

./build/bin/llama-cli \
  -m ./models/llama-3.2-3b-q4km.gguf \
  -ngl 99 \
  -p "Explain the difference between a mutex and a semaphore in two sentences." \
  -n 80 \
  --temp 0.0

--temp 0.0

makes output deterministic. You should get a coherent, factually correct response. Repetitive tokens or incoherent output points to a conversion problem, not a quantization one.

Verify It Works #

A successful run checks three boxes:

llama-bench

outputs a two-row markdown table with non-zero t/s for both pp and tgllama-cli

responds coherently to the mutex/semaphore prompt above- Activity Monitor's GPU History view shows a spike in GPU usage during inference (open it via Window menu in Activity Monitor)

Troubleshooting #

Crash or OOM immediately after "ggml_metal_init: allocating". The model is larger than your available unified memory. Switch to Q4_0 (smaller than Q4_K_M) or use the 1B model (meta-llama/Llama-3.2-1B

) instead.

** convert_hf_to_gguf.py fails with a KeyError or unknown architecture error.** Pull the latest llama.cpp (

git pull

then rebuild). Llama 3.2 support was merged after the initial 3.x release cycle, so an older clone won't have it.** llama-bench shows identical t/s with -ngl 0 and -ngl 99.** Metal isn't active. Re-run

cmake -B build -DGGML_METAL=ON

and look for GGML_METAL: enabled

in the CMake output before building. If it says disabled, confirm your Xcode Command Line Tools are installed and up to date.Download returns 401 or 403. You haven't been granted access to the gated repository. Accept the license on the model's HuggingFace page, wait a few minutes, and confirm your CLI token has read

scope (huggingface-cli whoami

).

Next Steps #

Run llama-perplexity

against wikitext-2 to measure quality loss across quant levels objectively, rather than just vibes-checking the output. llama-server

exposes an OpenAI-compatible HTTP API onlocalhost:8080

, so you can point existing tooling at your local model without code changes.- Experiment with --ctx-size

values (4096, 8192, 16384) inllama-bench

to see how KV cache growth affects generation speed as context length increases. - Compare the 1B model at Q4_K_M for workloads where raw throughput matters more than capability.

Mariana Souza· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0 #

No comments yet

Be the first to weigh in.

source & further reading

devclubhouse.com — original article Why Ford Rehired 350 Engineers After Relying on AI Stacking the Deck: IBM’s NanoStack and the Sub-1 nm Era The AI Auditing Wave and the End of Battle-Tested Code