# Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp

> Source: <https://www.devclubhouse.com/a/quantize-and-run-llama-32-on-apple-silicon-with-llamacpp>
> Published: 2026-06-25 17:36:18+00:00

# Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp

Build llama.cpp with Metal, convert Llama 3.2 3B to Q4_K_M GGUF, and benchmark real prompt-processing and generation throughput on your specific chip.

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)

## What You'll Build

By the end of this tutorial you'll have Llama 3.2 3B running locally with Metal GPU acceleration, quantized to Q4_K_M, with `llama-bench`

output showing real prompt-processing and generation throughput for your specific chip.

## Prerequisites

- Apple Silicon Mac (M1 or later) running macOS 13 Ventura or newer
- Xcode Command Line Tools:
`xcode-select --install`

- CMake 3.14+:
`brew install cmake`

- Python 3.10+
- Git
- A HuggingFace account with Llama 3.2 access approved (visit the
[meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)page and accept the Meta license) - ~12 GB free disk space

Intel Mac users can follow along but won't get Metal acceleration; omit all `-ngl`

flags and expect significantly lower throughput.

## 1. Clone and Compile llama.cpp with Metal

Use a recent commit. Llama 3.2 architecture support and the GGML refactor are both in the main branch as of late 2024.

```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_METAL=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
```

`GGML_METAL=ON`

is the default on macOS in recent llama.cpp, but being explicit prevents a silent fallback if you're on an unusual CMake config. Build takes 2-3 minutes on an M2. Verify the key binaries landed:

```
ls build/bin/llama-cli build/bin/llama-bench build/bin/llama-quantize
```

## 2. Download Llama 3.2 3B

Set up an isolated Python environment, then pull the model weights:

```
python3 -m venv .venv
source .venv/bin/activate
pip install huggingface-hub
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.2-3B \
  --local-dir ./models/Llama-3.2-3B \
  --exclude "original/*"
```

The `--exclude "original/*"`

drops Meta's consolidated checkpoint format and keeps only the HuggingFace safetensors, shaving ~3 GB off the download. Total download is roughly 6.5 GB.

## 3. Convert to GGUF (F16)

Install the conversion dependencies from llama.cpp's own requirements file, then run the converter. The requirements file lives under `requirements/`

after a repository reorganization:

```
pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_hf_to_gguf.py ./models/Llama-3.2-3B \
  --outfile ./models/llama-3.2-3b-f16.gguf \
  --outtype f16
```

This produces a single ~6.4 GB GGUF file containing the tokenizer, config, and half-precision weights. Keep it around: it's a reusable staging artifact you can re-quantize to multiple formats without re-running this step.

## 4. Quantize to Q4_K_M

```
./build/bin/llama-quantize \
  ./models/llama-3.2-3b-f16.gguf \
  ./models/llama-3.2-3b-q4km.gguf \
  Q4_K_M
```

Q4_K_M is 4-bit with k-quant grouping: it applies higher bit-width to the layers most sensitive to quantization error (certain attention and feed-forward projections), trading a modest size increase for noticeably better output quality compared to plain Q4_0. This runs in under 90 seconds on an M2 Pro.

Common quant options for the 3B model:

| Format | Approx. size | Notes |
|---|---|---|
| Q4_0 | ~1.8 GB | Fastest, lowest quality |
| Q4_K_M | ~2.0 GB | Best 4-bit tradeoff, use this |
| Q5_K_M | ~2.3 GB | Marginal quality gain, slower |
| Q8_0 | ~3.4 GB | Near-lossless, good for evals |

## 5. Run the Inference Benchmark

`llama-bench`

tests both prompt processing (prefill) and token generation (decode) throughput, running each scenario multiple times and reporting mean and standard deviation:

```
./build/bin/llama-bench \
  -m ./models/llama-3.2-3b-q4km.gguf \
  -p 512 \
  -n 128 \
  -ngl 99
```

`-ngl 99`

offloads all transformer layers to Metal. Llama 3.2 3B has 28 layers, so 99 is effectively "all of them." `-p 512`

processes a 512-token synthetic prompt; `-n 128`

generates 128 tokens. Expected output:

```
| model                 |       size |     params | backend | ngl |    test |              t/s |
|-----------------------|-----------:|-----------:|---------|----:|--------:|-----------------:|
| llama 3.2 3B Q4_K_M  |   1.93 GiB |     3.21 B | Metal   |  99 |  pp 512 |  2100.45 ± 18.21 |
| llama 3.2 3B Q4_K_M  |   1.93 GiB |     3.21 B | Metal   |  99 |  tg 128 |    68.32 ±  0.41 |
```

Typical throughput by chip:

| Chip | pp (t/s) | tg (t/s) |
|---|---|---|
| M1 | ~1100 | ~40 |
| M2 Pro | ~1800 | ~60 |
| M3 Max | ~3500 | ~110 |

Prompt processing is memory-bandwidth-bound; generation is compute-bound at this model size. Chips with higher memory bandwidth (M3 Max, M2 Ultra) pull ahead considerably on the tg row.

## 6. Sanity Check the Output

Benchmark numbers mean nothing if the model is generating garbage. Run a quick deterministic inference pass:

```
./build/bin/llama-cli \
  -m ./models/llama-3.2-3b-q4km.gguf \
  -ngl 99 \
  -p "Explain the difference between a mutex and a semaphore in two sentences." \
  -n 80 \
  --temp 0.0
```

`--temp 0.0`

makes output deterministic. You should get a coherent, factually correct response. Repetitive tokens or incoherent output points to a conversion problem, not a quantization one.

## Verify It Works

A successful run checks three boxes:

`llama-bench`

outputs a two-row markdown table with non-zero t/s for both pp and tg`llama-cli`

responds coherently to the mutex/semaphore prompt above- Activity Monitor's GPU History view shows a spike in GPU usage during inference (open it via Window menu in Activity Monitor)

## Troubleshooting

**Crash or OOM immediately after "ggml_metal_init: allocating".** The model is larger than your available unified memory. Switch to Q4_0 (smaller than Q4_K_M) or use the 1B model (`meta-llama/Llama-3.2-1B`

) instead.

** convert_hf_to_gguf.py fails with a KeyError or unknown architecture error.** Pull the latest llama.cpp (

`git pull`

then rebuild). Llama 3.2 support was merged after the initial 3.x release cycle, so an older clone won't have it.** llama-bench shows identical t/s with -ngl 0 and -ngl 99.** Metal isn't active. Re-run

`cmake -B build -DGGML_METAL=ON`

and look for `GGML_METAL: enabled`

in the CMake output before building. If it says disabled, confirm your Xcode Command Line Tools are installed and up to date.**Download returns 401 or 403.** You haven't been granted access to the gated repository. Accept the license on the model's HuggingFace page, wait a few minutes, and confirm your CLI token has `read`

scope (`huggingface-cli whoami`

).

## Next Steps

- Run
`llama-perplexity`

against wikitext-2 to measure quality loss across quant levels objectively, rather than just vibes-checking the output. `llama-server`

exposes an OpenAI-compatible HTTP API on`localhost:8080`

, so you can point existing tooling at your local model without code changes.- Experiment with
`--ctx-size`

values (4096, 8192, 16384) in`llama-bench`

to see how KV cache growth affects generation speed as context length increases. - Compare the 1B model at Q4_K_M for workloads where raw throughput matters more than capability.

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

## Discussion 0

No comments yet

Be the first to weigh in.
