Build llama.cpp with Metal, convert Llama 3.2 3B to Q4_K_M GGUF, and benchmark real prompt-processing and generation throughput on your specific chip.
What You'll Build #
By the end of this tutorial you'll have Llama 3.2 3B running locally with Metal GPU acceleration, quantized to Q4_K_M, with llama-bench
output showing real prompt-processing and generation throughput for your specific chip.
Prerequisites #
-
Apple Silicon Mac (M1 or later) running macOS 13 Ventura or newer
-
Xcode Command Line Tools:
xcode-select --install -
CMake 3.14+:
brew install cmake -
Python 3.10+
-
Git
-
A HuggingFace account with Llama 3.2 access approved (visit the meta-llama/Llama-3.2-3Bpage and accept the Meta license) - ~12 GB free disk space
Intel Mac users can follow along but won't get Metal acceleration; omit all -ngl
flags and expect significantly lower throughput.
1. Clone and Compile llama.cpp with Metal #
Use a recent commit. Llama 3.2 architecture support and the GGML refactor are both in the main branch as of late 2024.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_METAL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
GGML_METAL=ON
is the default on macOS in recent llama.cpp, but being explicit prevents a silent fallback if you're on an unusual CMake config. Build takes 2-3 minutes on an M2. Verify the key binaries landed:
ls build/bin/llama-cli build/bin/llama-bench build/bin/llama-quantize
2. Download Llama 3.2 3B #
Set up an isolated Python environment, then pull the model weights:
python3 -m venv .venv
source .venv/bin/activate
pip install huggingface-hub
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.2-3B \
--local-dir ./models/Llama-3.2-3B \
--exclude "original/*"
The --exclude "original/*"
drops Meta's consolidated checkpoint format and keeps only the HuggingFace safetensors, shaving ~3 GB off the download. Total download is roughly 6.5 GB.
3. Convert to GGUF (F16) #
Install the conversion dependencies from llama.cpp's own requirements file, then run the converter. The requirements file lives under requirements/
after a repository reorganization:
pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_hf_to_gguf.py ./models/Llama-3.2-3B \
--outfile ./models/llama-3.2-3b-f16.gguf \
--outtype f16
This produces a single ~6.4 GB GGUF file containing the tokenizer, config, and half-precision weights. Keep it around: it's a reusable staging artifact you can re-quantize to multiple formats without re-running this step.
4. Quantize to Q4_K_M #
./build/bin/llama-quantize \
./models/llama-3.2-3b-f16.gguf \
./models/llama-3.2-3b-q4km.gguf \
Q4_K_M
Q4_K_M is 4-bit with k-quant grouping: it applies higher bit-width to the layers most sensitive to quantization error (certain attention and feed-forward projections), trading a modest size increase for noticeably better output quality compared to plain Q4_0. This runs in under 90 seconds on an M2 Pro.
Common quant options for the 3B model:
| Format | Approx. size | Notes |
|---|---|---|
| Q4_0 | ~1.8 GB | Fastest, lowest quality |
| Q4_K_M | ~2.0 GB | Best 4-bit tradeoff, use this |
| Q5_K_M | ~2.3 GB | Marginal quality gain, slower |
| Q8_0 | ~3.4 GB | Near-lossless, good for evals |
5. Run the Inference Benchmark #
llama-bench
tests both prompt processing (prefill) and token generation (decode) throughput, running each scenario multiple times and reporting mean and standard deviation:
./build/bin/llama-bench \
-m ./models/llama-3.2-3b-q4km.gguf \
-p 512 \
-n 128 \
-ngl 99
-ngl 99
offloads all transformer layers to Metal. Llama 3.2 3B has 28 layers, so 99 is effectively "all of them." -p 512
processes a 512-token synthetic prompt; -n 128
generates 128 tokens. Expected output:
| model | size | params | backend | ngl | test | t/s |
|-----------------------|-----------:|-----------:|---------|----:|--------:|-----------------:|
| llama 3.2 3B Q4_K_M | 1.93 GiB | 3.21 B | Metal | 99 | pp 512 | 2100.45 ± 18.21 |
| llama 3.2 3B Q4_K_M | 1.93 GiB | 3.21 B | Metal | 99 | tg 128 | 68.32 ± 0.41 |
Typical throughput by chip:
| Chip | pp (t/s) | tg (t/s) |
|---|---|---|
| M1 | ~1100 | ~40 |
| M2 Pro | ~1800 | ~60 |
| M3 Max | ~3500 | ~110 |
Prompt processing is memory-bandwidth-bound; generation is compute-bound at this model size. Chips with higher memory bandwidth (M3 Max, M2 Ultra) pull ahead considerably on the tg row.
6. Sanity Check the Output #
Benchmark numbers mean nothing if the model is generating garbage. Run a quick deterministic inference pass:
./build/bin/llama-cli \
-m ./models/llama-3.2-3b-q4km.gguf \
-ngl 99 \
-p "Explain the difference between a mutex and a semaphore in two sentences." \
-n 80 \
--temp 0.0
--temp 0.0
makes output deterministic. You should get a coherent, factually correct response. Repetitive tokens or incoherent output points to a conversion problem, not a quantization one.
Verify It Works #
A successful run checks three boxes:
llama-bench
outputs a two-row markdown table with non-zero t/s for both pp and tgllama-cli
responds coherently to the mutex/semaphore prompt above- Activity Monitor's GPU History view shows a spike in GPU usage during inference (open it via Window menu in Activity Monitor)
Troubleshooting #
Crash or OOM immediately after "ggml_metal_init: allocating". The model is larger than your available unified memory. Switch to Q4_0 (smaller than Q4_K_M) or use the 1B model (meta-llama/Llama-3.2-1B
) instead.
** convert_hf_to_gguf.py fails with a KeyError or unknown architecture error.** Pull the latest llama.cpp (
git pull
then rebuild). Llama 3.2 support was merged after the initial 3.x release cycle, so an older clone won't have it.** llama-bench shows identical t/s with -ngl 0 and -ngl 99.** Metal isn't active. Re-run
cmake -B build -DGGML_METAL=ON
and look for GGML_METAL: enabled
in the CMake output before building. If it says disabled, confirm your Xcode Command Line Tools are installed and up to date.Download returns 401 or 403. You haven't been granted access to the gated repository. Accept the license on the model's HuggingFace page, wait a few minutes, and confirm your CLI token has read
scope (huggingface-cli whoami
).
Next Steps #
- Run
llama-perplexity
against wikitext-2 to measure quality loss across quant levels objectively, rather than just vibes-checking the output. llama-server
exposes an OpenAI-compatible HTTP API onlocalhost:8080
, so you can point existing tooling at your local model without code changes.- Experiment with
--ctx-size
values (4096, 8192, 16384) inllama-bench
to see how KV cache growth affects generation speed as context length increases. - Compare the 1B model at Q4_K_M for workloads where raw throughput matters more than capability.
Mariana Souza· Senior Editor
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 0 #
No comments yet
Be the first to weigh in.