Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp

Mariana Souza published a tutorial on quantizing and running Meta's Llama 3.2 3B model on Apple Silicon using llama.cpp with Metal GPU acceleration, achieving local inference with Q4_K_M quantization. The guide covers building llama.cpp, downloading the model, converting to GGUF format, and benchmarking throughput on specific chips.

Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp Build llama.cpp with Metal, convert Llama 3.2 3B to Q4 K M GGUF, and benchmark real prompt-processing and generation throughput on your specific chip. Mariana Souza https://www.devclubhouse.com/u/mariana souza What You'll Build By the end of this tutorial you'll have Llama 3.2 3B running locally with Metal GPU acceleration, quantized to Q4 K M, with llama-bench output showing real prompt-processing and generation throughput for your specific chip. Prerequisites - Apple Silicon Mac M1 or later running macOS 13 Ventura or newer - Xcode Command Line Tools: xcode-select --install - CMake 3.14+: brew install cmake - Python 3.10+ - Git - A HuggingFace account with Llama 3.2 access approved visit the meta-llama/Llama-3.2-3B https://huggingface.co/meta-llama/Llama-3.2-3B page and accept the Meta license - ~12 GB free disk space Intel Mac users can follow along but won't get Metal acceleration; omit all -ngl flags and expect significantly lower throughput. 1. Clone and Compile llama.cpp with Metal Use a recent commit. Llama 3.2 architecture support and the GGML refactor are both in the main branch as of late 2024. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build \ -DGGML METAL=ON \ -DCMAKE BUILD TYPE=Release cmake --build build --config Release -j $ sysctl -n hw.logicalcpu GGML METAL=ON is the default on macOS in recent llama.cpp, but being explicit prevents a silent fallback if you're on an unusual CMake config. Build takes 2-3 minutes on an M2. Verify the key binaries landed: ls build/bin/llama-cli build/bin/llama-bench build/bin/llama-quantize 2. Download Llama 3.2 3B Set up an isolated Python environment, then pull the model weights: python3 -m venv .venv source .venv/bin/activate pip install huggingface-hub huggingface-cli login huggingface-cli download meta-llama/Llama-3.2-3B \ --local-dir ./models/Llama-3.2-3B \ --exclude "original/ " The --exclude "original/ " drops Meta's consolidated checkpoint format and keeps only the HuggingFace safetensors, shaving ~3 GB off the download. Total download is roughly 6.5 GB. 3. Convert to GGUF F16 Install the conversion dependencies from llama.cpp's own requirements file, then run the converter. The requirements file lives under requirements/ after a repository reorganization: pip install -r requirements/requirements-convert hf to gguf.txt python convert hf to gguf.py ./models/Llama-3.2-3B \ --outfile ./models/llama-3.2-3b-f16.gguf \ --outtype f16 This produces a single ~6.4 GB GGUF file containing the tokenizer, config, and half-precision weights. Keep it around: it's a reusable staging artifact you can re-quantize to multiple formats without re-running this step. 4. Quantize to Q4 K M ./build/bin/llama-quantize \ ./models/llama-3.2-3b-f16.gguf \ ./models/llama-3.2-3b-q4km.gguf \ Q4 K M Q4 K M is 4-bit with k-quant grouping: it applies higher bit-width to the layers most sensitive to quantization error certain attention and feed-forward projections , trading a modest size increase for noticeably better output quality compared to plain Q4 0. This runs in under 90 seconds on an M2 Pro. Common quant options for the 3B model: | Format | Approx. size | Notes | |---|---|---| | Q4 0 | ~1.8 GB | Fastest, lowest quality | | Q4 K M | ~2.0 GB | Best 4-bit tradeoff, use this | | Q5 K M | ~2.3 GB | Marginal quality gain, slower | | Q8 0 | ~3.4 GB | Near-lossless, good for evals | 5. Run the Inference Benchmark llama-bench tests both prompt processing prefill and token generation decode throughput, running each scenario multiple times and reporting mean and standard deviation: ./build/bin/llama-bench \ -m ./models/llama-3.2-3b-q4km.gguf \ -p 512 \ -n 128 \ -ngl 99 -ngl 99 offloads all transformer layers to Metal. Llama 3.2 3B has 28 layers, so 99 is effectively "all of them." -p 512 processes a 512-token synthetic prompt; -n 128 generates 128 tokens. Expected output: | model | size | params | backend | ngl | test | t/s | |-----------------------|-----------:|-----------:|---------|----:|--------:|-----------------:| | llama 3.2 3B Q4 K M | 1.93 GiB | 3.21 B | Metal | 99 | pp 512 | 2100.45 ± 18.21 | | llama 3.2 3B Q4 K M | 1.93 GiB | 3.21 B | Metal | 99 | tg 128 | 68.32 ± 0.41 | Typical throughput by chip: | Chip | pp t/s | tg t/s | |---|---|---| | M1 | ~1100 | ~40 | | M2 Pro | ~1800 | ~60 | | M3 Max | ~3500 | ~110 | Prompt processing is memory-bandwidth-bound; generation is compute-bound at this model size. Chips with higher memory bandwidth M3 Max, M2 Ultra pull ahead considerably on the tg row. 6. Sanity Check the Output Benchmark numbers mean nothing if the model is generating garbage. Run a quick deterministic inference pass: ./build/bin/llama-cli \ -m ./models/llama-3.2-3b-q4km.gguf \ -ngl 99 \ -p "Explain the difference between a mutex and a semaphore in two sentences." \ -n 80 \ --temp 0.0 --temp 0.0 makes output deterministic. You should get a coherent, factually correct response. Repetitive tokens or incoherent output points to a conversion problem, not a quantization one. Verify It Works A successful run checks three boxes: llama-bench outputs a two-row markdown table with non-zero t/s for both pp and tg llama-cli responds coherently to the mutex/semaphore prompt above- Activity Monitor's GPU History view shows a spike in GPU usage during inference open it via Window menu in Activity Monitor Troubleshooting Crash or OOM immediately after "ggml metal init: allocating". The model is larger than your available unified memory. Switch to Q4 0 smaller than Q4 K M or use the 1B model meta-llama/Llama-3.2-1B instead. convert hf to gguf.py fails with a KeyError or unknown architecture error. Pull the latest llama.cpp git pull then rebuild . Llama 3.2 support was merged after the initial 3.x release cycle, so an older clone won't have it. llama-bench shows identical t/s with -ngl 0 and -ngl 99. Metal isn't active. Re-run cmake -B build -DGGML METAL=ON and look for GGML METAL: enabled in the CMake output before building. If it says disabled, confirm your Xcode Command Line Tools are installed and up to date. Download returns 401 or 403. You haven't been granted access to the gated repository. Accept the license on the model's HuggingFace page, wait a few minutes, and confirm your CLI token has read scope huggingface-cli whoami . Next Steps - Run llama-perplexity against wikitext-2 to measure quality loss across quant levels objectively, rather than just vibes-checking the output. llama-server exposes an OpenAI-compatible HTTP API on localhost:8080 , so you can point existing tooling at your local model without code changes.- Experiment with --ctx-size values 4096, 8192, 16384 in llama-bench to see how KV cache growth affects generation speed as context length increases. - Compare the 1B model at Q4 K M for workloads where raw throughput matters more than capability. Mariana Souza https://www.devclubhouse.com/u/mariana souza · Senior Editor Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon. Discussion 0 No comments yet Be the first to weigh in.