{"slug": "quantize-and-run-llama-3-2-on-apple-silicon-with-llama-cpp", "title": "Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp", "summary": "Mariana Souza published a tutorial on quantizing and running Meta's Llama 3.2 3B model on Apple Silicon using llama.cpp with Metal GPU acceleration, achieving local inference with Q4_K_M quantization. The guide covers building llama.cpp, downloading the model, converting to GGUF format, and benchmarking throughput on specific chips.", "body_md": "# Quantize and Run Llama 3.2 on Apple Silicon with llama.cpp\n\nBuild llama.cpp with Metal, convert Llama 3.2 3B to Q4_K_M GGUF, and benchmark real prompt-processing and generation throughput on your specific chip.\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)\n\n## What You'll Build\n\nBy the end of this tutorial you'll have Llama 3.2 3B running locally with Metal GPU acceleration, quantized to Q4_K_M, with `llama-bench`\n\noutput showing real prompt-processing and generation throughput for your specific chip.\n\n## Prerequisites\n\n- Apple Silicon Mac (M1 or later) running macOS 13 Ventura or newer\n- Xcode Command Line Tools:\n`xcode-select --install`\n\n- CMake 3.14+:\n`brew install cmake`\n\n- Python 3.10+\n- Git\n- A HuggingFace account with Llama 3.2 access approved (visit the\n[meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)page and accept the Meta license) - ~12 GB free disk space\n\nIntel Mac users can follow along but won't get Metal acceleration; omit all `-ngl`\n\nflags and expect significantly lower throughput.\n\n## 1. Clone and Compile llama.cpp with Metal\n\nUse a recent commit. Llama 3.2 architecture support and the GGML refactor are both in the main branch as of late 2024.\n\n```\ngit clone https://github.com/ggerganov/llama.cpp\ncd llama.cpp\ncmake -B build \\\n  -DGGML_METAL=ON \\\n  -DCMAKE_BUILD_TYPE=Release\ncmake --build build --config Release -j $(sysctl -n hw.logicalcpu)\n```\n\n`GGML_METAL=ON`\n\nis the default on macOS in recent llama.cpp, but being explicit prevents a silent fallback if you're on an unusual CMake config. Build takes 2-3 minutes on an M2. Verify the key binaries landed:\n\n```\nls build/bin/llama-cli build/bin/llama-bench build/bin/llama-quantize\n```\n\n## 2. Download Llama 3.2 3B\n\nSet up an isolated Python environment, then pull the model weights:\n\n```\npython3 -m venv .venv\nsource .venv/bin/activate\npip install huggingface-hub\nhuggingface-cli login\nhuggingface-cli download meta-llama/Llama-3.2-3B \\\n  --local-dir ./models/Llama-3.2-3B \\\n  --exclude \"original/*\"\n```\n\nThe `--exclude \"original/*\"`\n\ndrops Meta's consolidated checkpoint format and keeps only the HuggingFace safetensors, shaving ~3 GB off the download. Total download is roughly 6.5 GB.\n\n## 3. Convert to GGUF (F16)\n\nInstall the conversion dependencies from llama.cpp's own requirements file, then run the converter. The requirements file lives under `requirements/`\n\nafter a repository reorganization:\n\n```\npip install -r requirements/requirements-convert_hf_to_gguf.txt\npython convert_hf_to_gguf.py ./models/Llama-3.2-3B \\\n  --outfile ./models/llama-3.2-3b-f16.gguf \\\n  --outtype f16\n```\n\nThis produces a single ~6.4 GB GGUF file containing the tokenizer, config, and half-precision weights. Keep it around: it's a reusable staging artifact you can re-quantize to multiple formats without re-running this step.\n\n## 4. Quantize to Q4_K_M\n\n```\n./build/bin/llama-quantize \\\n  ./models/llama-3.2-3b-f16.gguf \\\n  ./models/llama-3.2-3b-q4km.gguf \\\n  Q4_K_M\n```\n\nQ4_K_M is 4-bit with k-quant grouping: it applies higher bit-width to the layers most sensitive to quantization error (certain attention and feed-forward projections), trading a modest size increase for noticeably better output quality compared to plain Q4_0. This runs in under 90 seconds on an M2 Pro.\n\nCommon quant options for the 3B model:\n\n| Format | Approx. size | Notes |\n|---|---|---|\n| Q4_0 | ~1.8 GB | Fastest, lowest quality |\n| Q4_K_M | ~2.0 GB | Best 4-bit tradeoff, use this |\n| Q5_K_M | ~2.3 GB | Marginal quality gain, slower |\n| Q8_0 | ~3.4 GB | Near-lossless, good for evals |\n\n## 5. Run the Inference Benchmark\n\n`llama-bench`\n\ntests both prompt processing (prefill) and token generation (decode) throughput, running each scenario multiple times and reporting mean and standard deviation:\n\n```\n./build/bin/llama-bench \\\n  -m ./models/llama-3.2-3b-q4km.gguf \\\n  -p 512 \\\n  -n 128 \\\n  -ngl 99\n```\n\n`-ngl 99`\n\noffloads all transformer layers to Metal. Llama 3.2 3B has 28 layers, so 99 is effectively \"all of them.\" `-p 512`\n\nprocesses a 512-token synthetic prompt; `-n 128`\n\ngenerates 128 tokens. Expected output:\n\n```\n| model                 |       size |     params | backend | ngl |    test |              t/s |\n|-----------------------|-----------:|-----------:|---------|----:|--------:|-----------------:|\n| llama 3.2 3B Q4_K_M  |   1.93 GiB |     3.21 B | Metal   |  99 |  pp 512 |  2100.45 ± 18.21 |\n| llama 3.2 3B Q4_K_M  |   1.93 GiB |     3.21 B | Metal   |  99 |  tg 128 |    68.32 ±  0.41 |\n```\n\nTypical throughput by chip:\n\n| Chip | pp (t/s) | tg (t/s) |\n|---|---|---|\n| M1 | ~1100 | ~40 |\n| M2 Pro | ~1800 | ~60 |\n| M3 Max | ~3500 | ~110 |\n\nPrompt processing is memory-bandwidth-bound; generation is compute-bound at this model size. Chips with higher memory bandwidth (M3 Max, M2 Ultra) pull ahead considerably on the tg row.\n\n## 6. Sanity Check the Output\n\nBenchmark numbers mean nothing if the model is generating garbage. Run a quick deterministic inference pass:\n\n```\n./build/bin/llama-cli \\\n  -m ./models/llama-3.2-3b-q4km.gguf \\\n  -ngl 99 \\\n  -p \"Explain the difference between a mutex and a semaphore in two sentences.\" \\\n  -n 80 \\\n  --temp 0.0\n```\n\n`--temp 0.0`\n\nmakes output deterministic. You should get a coherent, factually correct response. Repetitive tokens or incoherent output points to a conversion problem, not a quantization one.\n\n## Verify It Works\n\nA successful run checks three boxes:\n\n`llama-bench`\n\noutputs a two-row markdown table with non-zero t/s for both pp and tg`llama-cli`\n\nresponds coherently to the mutex/semaphore prompt above- Activity Monitor's GPU History view shows a spike in GPU usage during inference (open it via Window menu in Activity Monitor)\n\n## Troubleshooting\n\n**Crash or OOM immediately after \"ggml_metal_init: allocating\".** The model is larger than your available unified memory. Switch to Q4_0 (smaller than Q4_K_M) or use the 1B model (`meta-llama/Llama-3.2-1B`\n\n) instead.\n\n** convert_hf_to_gguf.py fails with a KeyError or unknown architecture error.** Pull the latest llama.cpp (\n\n`git pull`\n\nthen rebuild). Llama 3.2 support was merged after the initial 3.x release cycle, so an older clone won't have it.** llama-bench shows identical t/s with -ngl 0 and -ngl 99.** Metal isn't active. Re-run\n\n`cmake -B build -DGGML_METAL=ON`\n\nand look for `GGML_METAL: enabled`\n\nin the CMake output before building. If it says disabled, confirm your Xcode Command Line Tools are installed and up to date.**Download returns 401 or 403.** You haven't been granted access to the gated repository. Accept the license on the model's HuggingFace page, wait a few minutes, and confirm your CLI token has `read`\n\nscope (`huggingface-cli whoami`\n\n).\n\n## Next Steps\n\n- Run\n`llama-perplexity`\n\nagainst wikitext-2 to measure quality loss across quant levels objectively, rather than just vibes-checking the output. `llama-server`\n\nexposes an OpenAI-compatible HTTP API on`localhost:8080`\n\n, so you can point existing tooling at your local model without code changes.- Experiment with\n`--ctx-size`\n\nvalues (4096, 8192, 16384) in`llama-bench`\n\nto see how KV cache growth affects generation speed as context length increases. - Compare the 1B model at Q4_K_M for workloads where raw throughput matters more than capability.\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor\n\nMariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.\n\n## Discussion 0\n\nNo comments yet\n\nBe the first to weigh in.", "url": "https://wpnews.pro/news/quantize-and-run-llama-3-2-on-apple-silicon-with-llama-cpp", "canonical_source": "https://www.devclubhouse.com/a/quantize-and-run-llama-32-on-apple-silicon-with-llamacpp", "published_at": "2026-06-25 17:36:18+00:00", "updated_at": "2026-06-25 17:46:58.335435+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["Mariana Souza", "Meta", "Llama 3.2", "llama.cpp", "Apple Silicon", "Metal", "HuggingFace", "GGUF"], "alternates": {"html": "https://wpnews.pro/news/quantize-and-run-llama-3-2-on-apple-silicon-with-llama-cpp", "markdown": "https://wpnews.pro/news/quantize-and-run-llama-3-2-on-apple-silicon-with-llama-cpp.md", "text": "https://wpnews.pro/news/quantize-and-run-llama-3-2-on-apple-silicon-with-llama-cpp.txt", "jsonld": "https://wpnews.pro/news/quantize-and-run-llama-3-2-on-apple-silicon-with-llama-cpp.jsonld"}}