{"slug": "how-to-setup-a-local-coding-agent-on-macos", "title": "How to Setup a Local Coding Agent on macOS", "summary": "A developer successfully configured a local coding agent on macOS using Gemma 4 26B-A4B and Qwen3.6 35B-A3B models with llama.cpp, achieving 72.2 tokens per second generation speed through MTP speculative decoding. The setup, tested on an Apple M1 Max with 64GB memory, provides an OpenAI-compatible API with multimodal support for screenshots, enabling offline coding assistance when internet access is unavailable.", "body_md": "# How to Setup a Local Coding Agent on macOS\n\n## Running Gemma 4 26B-A4B and Qwen3.6 35B-A3B locally with llama.cpp, MTP speculative decoding, multimodal support, and PI as a coding agent.\n\nI'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the [\"Gemma 4 now runs 2x faster with MTP\"](https://x.com/UnslothAI/status/2065107734916432189) Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.\n\nI wanted a local coding agent setup that:\n\n- was fast enough to actually use on my Mac\n- worked through an OpenAI compatible API (so I could use it in other tools)\n- and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made.\n\nAnd I did! This video is realtime. And shows the agent responding at a perfectly usable speed.\n\nAfter a bit of testing the final setup I ended up with is:\n\n[llama.cpp](https://github.com/ggml-org/llama.cpp)built with Metal on macOS- Gemma 4 26B-A4B in GGUF format\n- A Q8 MTP draft model for speculative decoding\n- The Gemma 4 multimodal projector\n[Pi](https://github.com/earendil-works/pi)as the terminal coding agent\n\nThis was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7.\n\n# The Model\n\nThe main model is: `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf`\n\n.\n\nLink on Huggingface: [models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf)\n\nThat file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB.\n\nThe benchmark prompt was:\n\n```\nWrite a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.\n```\n\nEach benchmark generated about 128 tokens.\n\n# Baseline: llama.cpp + Metal\n\nFirst I ran the main model directly through llama.cpp with Metal acceleration:\n\n```\nrepos/llama.cpp/build/bin/llama-cli \\\n  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \\\n  -ngl 999 \\\n  -fa on \\\n  -c 4096 \\\n  -n 128\n```\n\nResult:\n\n| Setup | Prompt tok/s | Generation tok/s |\n|---|---|---|\n| Gemma 4 26B-A4B Q4, llama.cpp Metal | 298.0 | 58.2 |\n\n58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls.\n\n# Adding the MTP Draft Model\n\nGemma 4 now has the [MTP draft model available](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf):\n\n```\nMTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf\n```\n\nThis can be loaded by llama.cpp as a speculative draft model:\n\n```\nrepos/llama.cpp/build/bin/llama-cli \\\n  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \\\n  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \\\n  --spec-type draft-mtp \\\n  --spec-draft-n-max 3 \\\n  -ngl 999 \\\n  -fa on \\\n  -c 4096 \\\n  -n 128\n```\n\nThe first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth's guide on [How to Run MTP Models](https://unsloth.ai/docs/models/mtp) includes this note:\n\n\"We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system.\"\n\nAfter sweeping `--spec-draft-n-max`\n\n, the best result was 72.2 tokens/second with 3 draft tokens.\n\n| Setup | Prompt tok/s | Generation tok/s | Speedup |\n|---|---|---|---|\n| Main model only | 298.0 | 58.2 | 1.00x |\n| Main model + Q8 MTP draft | 295.6 | 72.2 | 1.24x |\n\nThe useful part is that prompt processing stayed basically the same, while generation improved by about 24%.\n\n# Tuning MTP\n\nI tested `--spec-draft-n-max`\n\nvalues from 1 to 6.\n\n`--spec-draft-n-max` |\nPrompt tok/s | Generation tok/s |\n|---|---|---|\n| 1 | 295.5 | 68.4 |\n| 2 | 299.1 | 72.0 |\n| 3 | 295.6 | 72.2 |\n| 4 | 297.3 | 70.7 |\n| 5 | 297.9 | 63.7 |\n| 6 | 296.3 | 61.2 |\n\nOn my M1 Max machine, `3`\n\nwas the fastest, with `2`\n\nclose enough that either would be fine. Values above that got slower.\n\n# MLX Comparison\n\nI also tested MLX models through `mlx-lm`\n\n, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.\n\n| Runtime | Model | Generation tok/s |\n|---|---|---|\n| llama.cpp Metal + MTP | Unsloth GGUF Q4 + Q8 MTP | 72.2 |\n| llama.cpp Metal | Unsloth GGUF Q4 | 58.2 |\n| MLX-LM | Unsloth UD MLX 4-bit | 45.8 |\n| MLX-LM | mlx-community 4-bit | 43.9 |\n| MLX-LM | mlx-community OptiQ 4-bit | 38.1 |\n\nI thought MLX (being optimised for the Mac) would be fastest.\n\nHowever, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option.\n\nI guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform.\n\nI also tried Gemma 4 MTP through [gemma-4-swift-mlx](https://github.com/VincentGourbin/gemma-4-swift-mlx), but the tested 26B 4-bit MLX checkpoints did not match the loader's expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match.\n\n# Adding Image Support\n\nFor Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only:\n\n```\n\"input\": [\"text\"]\n```\n\nThat meant Pi did not send image tool output through to the model properly.\n\nThe llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work (only [the 12B is natively multi-modal](https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/)):\n\n```\nmmproj-BF16.gguf\n```\n\nWhen loaded with `--mmproj`\n\n, llama.cpp advertises multimodal support, and Pi can send images.\n\nI re-ran the text benchmark with the projector loaded, just to check it didn't change the speed:\n\n| Setup | Projector | Prompt tok/s | Generation tok/s |\n|---|---|---|---|\n| llama.cpp Metal + MTP | none | 120.3 | 71.4 |\n| llama.cpp Metal + MTP | `mmproj-BF16.gguf` |\n297.4 | 72.2 |\n\nThe final run with the projector did not show a text-generation slowdown.\n\nNow for setup instructions:\n\n# Install llama.cpp\n\nInstall dependencies:\n\n```\nbrew install cmake git tmux python@3.11\n```\n\nClone and build llama.cpp:\n\n```\nmkdir -p ~/Developer/ML-Models/Gemma4/repos\ncd ~/Developer/ML-Models/Gemma4\n\ngit clone https://github.com/ggml-org/llama.cpp repos/llama.cpp\n\ncd repos/llama.cpp\ncmake -B build \\\n  -DCMAKE_BUILD_TYPE=Release \\\n  -DGGML_METAL=ON \\\n  -DGGML_ACCELERATE=ON\n\ncmake --build build --config Release -j\n```\n\nThe build I tested had:\n\n```\nGGML_METAL=ON\nGGML_ACCELERATE=ON\nGGML_BLAS=ON\nGGML_BLAS_VENDOR=Apple\n```\n\n# Download the Model Files\n\nCreate a Python environment:\n\n```\ncd ~/Developer/ML-Models/Gemma4\npython3.11 -m venv .venv\nsource .venv/bin/activate\npip install -U huggingface_hub hf_xet\n```\n\nDownload the files:\n\n```\nmkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF\n\nhuggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \\\n  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \\\n  mmproj-BF16.gguf \\\n  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \\\n  --local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF\n```\n\nYou should end up with:\n\n```\nmodels/unsloth-gemma-4-26B-A4B-it-GGUF/\n  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf\n  mmproj-BF16.gguf\n  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf\n```\n\n# Start the Local Server\n\nThis is the final server command:\n\n```\nrepos/llama.cpp/build/bin/llama-server \\\n  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \\\n  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \\\n  --mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \\\n  --spec-type draft-mtp \\\n  --spec-draft-n-max 3 \\\n  -ngl 999 \\\n  -fa on \\\n  -c 65536 \\\n  --parallel 1 \\\n  --host 127.0.0.1 \\\n  --port 8080\n```\n\nThe OpenAI-compatible endpoint is:\n\n```\nhttp://127.0.0.1:8080/v1\n```\n\nI used a small `start_server.sh`\n\nwrapper so it runs inside tmux:\n\n``` bash\n#!/usr/bin/env bash\nset -euo pipefail\n\nROOT_DIR=\"$(cd \"$(dirname \"${BASH_SOURCE[0]}\")\" && pwd)\"\nSESSION_NAME=\"${SESSION_NAME:-gemma4-server}\"\nHOST=\"${HOST:-127.0.0.1}\"\nPORT=\"${PORT:-8080}\"\nCTX_SIZE=\"${CTX_SIZE:-65536}\"\nPARALLEL=\"${PARALLEL:-1}\"\n\nLLAMA_SERVER=\"$ROOT_DIR/repos/llama.cpp/build/bin/llama-server\"\nMODEL=\"$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf\"\nDRAFT_MODEL=\"$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf\"\nMMPROJ=\"$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf\"\nLOG_FILE=\"$ROOT_DIR/logs/llama-server-mtp.log\"\n\nmkdir -p \"$ROOT_DIR/logs\"\n\ntmux new-session -d -s \"$SESSION_NAME\" -c \"$ROOT_DIR\" \\\n  \"$LLAMA_SERVER \\\n    -m '$MODEL' \\\n    --model-draft '$DRAFT_MODEL' \\\n    --mmproj '$MMPROJ' \\\n    --spec-type draft-mtp \\\n    --spec-draft-n-max 3 \\\n    -ngl 999 \\\n    -fa on \\\n    -c '$CTX_SIZE' \\\n    --parallel '$PARALLEL' \\\n    --host '$HOST' \\\n    --port '$PORT' \\\n    2>&1 | tee -a '$LOG_FILE'\"\n```\n\nStart it:\n\n```\nchmod +x start_server.sh\n./start_server.sh\n```\n\nCheck that the server is running:\n\n```\ncurl http://127.0.0.1:8080/v1/models\n```\n\n# Configure Pi\n\nPi reads model providers from:\n\n```\n~/.pi/agent/models.json\n```\n\nAdd a local provider:\n\n```\n{\n  \"providers\": {\n    \"gemma4-local\": {\n      \"name\": \"Gemma 4 Local\",\n      \"baseUrl\": \"http://127.0.0.1:8080/v1\",\n      \"api\": \"openai-completions\",\n      \"apiKey\": \"local\",\n      \"authHeader\": false,\n      \"compat\": {\n        \"supportsDeveloperRole\": false,\n        \"supportsReasoningEffort\": false\n      },\n      \"models\": [\n        {\n          \"id\": \"gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf\",\n          \"name\": \"Gemma 4 26B-A4B Q4 + MTP\",\n          \"reasoning\": false,\n          \"input\": [\"text\", \"image\"],\n          \"contextWindow\": 65536,\n          \"maxTokens\": 8192,\n          \"cost\": {\n            \"input\": 0,\n            \"output\": 0,\n            \"cacheRead\": 0,\n            \"cacheWrite\": 0\n          }\n        }\n      ]\n    }\n  }\n}\n```\n\nThe important pieces are:\n\n`baseUrl`\n\npoints to the llama.cpp OpenAI-compatible server.`api`\n\nis`openai-completions`\n\n.`authHeader`\n\nis`false`\n\n, because this is a local server.`input`\n\nincludes both`text`\n\nand`image`\n\n, otherwise Pi treats it as text-only.\n\nOptionally make it the default in:\n\n```\n~/.pi/agent/settings.json\n{\n  \"defaultProvider\": \"gemma4-local\",\n  \"defaultModel\": \"gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf\",\n  \"defaultThinkingLevel\": \"minimal\"\n}\n```\n\nThen check Pi can see it:\n\n```\npi --offline --list-models gemma\n```\n\nExpected:\n\n```\nprovider      model                               context  max-out  thinking  images\ngemma4-local  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf  65.5K    8.2K     no        yes\n```\n\nRun Pi using the local model:\n\n```\npi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf\n```\n\nOr use non-interactive mode:\n\n```\npi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \\\n  \"Explain what this repository does\"\n```\n\nFor screenshots:\n\n```\npi -p @\"/path/to/screenshot.png\" \"Describe this image and point out anything relevant to the UI\"\n```\n\n# Final Setup\n\nThe final local coding-agent stack was:\n\n| Layer | Choice |\n|---|---|\n| Inference runtime | llama.cpp |\n| macOS acceleration | Metal + Accelerate |\n| Main model | `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` |\n| Draft model | `gemma-4-26B-A4B-it-Q8_0-MTP.gguf` |\n| MTP setting | `--spec-draft-n-max 3` |\n| Multimodal projector | `mmproj-BF16.gguf` |\n| Server | `llama-server` on `127.0.0.1:8080` |\n| API | OpenAI-compatible `/v1` |\n| Coding agent | Pi |\n| Pi model input | `[\"text\", \"image\"]` |\n\nThe main conclusion was that the MTP draft model is worth using. On this machine it took Gemma 4 from 58.2 tokens/second to 72.2 tokens/second, while keeping the setup simple enough to run as a local OpenAI-compatible server.\n\n**P.S:** Some suggested using `Qwen3.6 35B-A3B`\n\ninstead of `Gemma 4 26B-A4B`\n\n. According to the benchmarks I can find, Qwen is a **much** better coding agent than Gemma 4.\n\nHowever, it is also slower. `Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf`\n\n+ `unsloth-Qwen3.6-35B-A3B-MTP-GGUF`\n\n+ `mmproj-BF16.gguf`\n\nresults in 55 tk/s, instead of 72 tk/s. Which is quite significant when you are sitting waiting for it.\n\nDownload the models:\n\n```\nmkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF\n\nhuggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \\\n  Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \\\n  mmproj-BF16.gguf \\\n  --local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF\n```\n\nStart the server:\n\n```\nLLAMA_SERVER=/Users/kylehowells/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server\n\n$LLAMA_SERVER \\\n  -m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \\\n  --mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \\\n  --spec-type draft-mtp \\\n  --spec-draft-n-max 3 \\\n  -ngl 999 \\\n  -fa on \\\n  -c 65536 \\\n  --parallel 1 \\\n  --host 127.0.0.1 \\\n  --port 8081\n```\n\nPi Config:\n\n```\n{\n  \"providers\": {\n    \"qwen36-local\": {\n      \"name\": \"Qwen3.6 Local\",\n      \"baseUrl\": \"http://127.0.0.1:8081/v1\",\n      \"api\": \"openai-completions\",\n      \"apiKey\": \"local\",\n      \"authHeader\": false,\n      \"compat\": {\n        \"supportsDeveloperRole\": false,\n        \"supportsReasoningEffort\": false\n      },\n      \"models\": [\n        {\n          \"id\": \"Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\",\n          \"name\": \"Qwen3.6 35B-A3B Q4 + MTP\",\n          \"reasoning\": true,\n          \"input\": [\"text\", \"image\"],\n          \"contextWindow\": 65536,\n          \"maxTokens\": 8192,\n          \"cost\": {\n            \"input\": 0,\n            \"output\": 0,\n            \"cacheRead\": 0,\n            \"cacheWrite\": 0\n          }\n        }\n      ]\n    }\n  }\n}\n```\n\n## References:\n\n[unsloth.ai/docs/models/qwen3.6](https://unsloth.ai/docs/models/qwen3.6)[unsloth.ai/docs/models/gemma-4](https://unsloth.ai/docs/models/gemma-4)[unsloth.ai/docs/models/mtp](https://unsloth.ai/docs/models/mtp)[github.com/ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp)[github.com/earendil-works/pi](https://github.com/earendil-works/pi)[Introducing Gemma 4 12B: a unified, encoder-free multimodal model](https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/)[\"MTP enables Google Gemma 4 run ~1.4–2.2× faster with no accuracy loss\"](https://x.com/UnslothAI/status/2065107734916432189)[unsloth/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF)[unsloth/Qwen3.6-35B-A3B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF)", "url": "https://wpnews.pro/news/how-to-setup-a-local-coding-agent-on-macos", "canonical_source": "https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent-on-macos", "published_at": "2026-06-12 17:34:55+00:00", "updated_at": "2026-06-12 17:50:40.012120+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-tools", "ai-infrastructure", "generative-ai"], "entities": ["Gemma 4", "Qwen3.6", "llama.cpp", "Pi", "UnslothAI", "Apple", "M1 Max", "Huggingface"], "alternates": {"html": "https://wpnews.pro/news/how-to-setup-a-local-coding-agent-on-macos", "markdown": "https://wpnews.pro/news/how-to-setup-a-local-coding-agent-on-macos.md", "text": "https://wpnews.pro/news/how-to-setup-a-local-coding-agent-on-macos.txt", "jsonld": "https://wpnews.pro/news/how-to-setup-a-local-coding-agent-on-macos.jsonld"}}