How to Setup a Local Coding Agent on macOS

A developer successfully configured a local coding agent on macOS using Gemma 4 26B-A4B and Qwen3.6 35B-A3B models with llama.cpp, achieving 72.2 tokens per second generation speed through MTP speculative decoding. The setup, tested on an Apple M1 Max with 64GB memory, provides an OpenAI-compatible API with multimodal support for screenshots, enabling offline coding assistance when internet access is unavailable.

How to Setup a Local Coding Agent on macOS Running Gemma 4 26B-A4B and Qwen3.6 35B-A3B locally with llama.cpp, MTP speculative decoding, multimodal support, and PI as a coding agent. I'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the "Gemma 4 now runs 2x faster with MTP" https://x.com/UnslothAI/status/2065107734916432189 Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running. I wanted a local coding agent setup that: - was fast enough to actually use on my Mac - worked through an OpenAI compatible API so I could use it in other tools - and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made. And I did This video is realtime. And shows the agent responding at a perfectly usable speed. After a bit of testing the final setup I ended up with is: llama.cpp https://github.com/ggml-org/llama.cpp built with Metal on macOS- Gemma 4 26B-A4B in GGUF format - A Q8 MTP draft model for speculative decoding - The Gemma 4 multimodal projector Pi https://github.com/earendil-works/pi as the terminal coding agent This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7. The Model The main model is: gemma-4-26B-A4B-it-UD-Q4 K XL.gguf . Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4 K XL.gguf https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-Q4 K XL.gguf That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB. The benchmark prompt was: Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases. Each benchmark generated about 128 tokens. Baseline: llama.cpp + Metal First I ran the main model directly through llama.cpp with Metal acceleration: repos/llama.cpp/build/bin/llama-cli \ -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4 K XL.gguf \ -ngl 999 \ -fa on \ -c 4096 \ -n 128 Result: | Setup | Prompt tok/s | Generation tok/s | |---|---|---| | Gemma 4 26B-A4B Q4, llama.cpp Metal | 298.0 | 58.2 | 58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls. Adding the MTP Draft Model Gemma 4 now has the MTP draft model available https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf : MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf This can be loaded by llama.cpp as a speculative draft model: repos/llama.cpp/build/bin/llama-cli \ -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4 K XL.gguf \ --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c 4096 \ -n 128 The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth's guide on How to Run MTP Models https://unsloth.ai/docs/models/mtp includes this note: "We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system." After sweeping --spec-draft-n-max , the best result was 72.2 tokens/second with 3 draft tokens. | Setup | Prompt tok/s | Generation tok/s | Speedup | |---|---|---|---| | Main model only | 298.0 | 58.2 | 1.00x | | Main model + Q8 MTP draft | 295.6 | 72.2 | 1.24x | The useful part is that prompt processing stayed basically the same, while generation improved by about 24%. Tuning MTP I tested --spec-draft-n-max values from 1 to 6. --spec-draft-n-max | Prompt tok/s | Generation tok/s | |---|---|---| | 1 | 295.5 | 68.4 | | 2 | 299.1 | 72.0 | | 3 | 295.6 | 72.2 | | 4 | 297.3 | 70.7 | | 5 | 297.9 | 63.7 | | 6 | 296.3 | 61.2 | On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower. MLX Comparison I also tested MLX models through mlx-lm , to find out which is the faster way to run the model on a Mac, llama.cpp or mlx. | Runtime | Model | Generation tok/s | |---|---|---| | llama.cpp Metal + MTP | Unsloth GGUF Q4 + Q8 MTP | 72.2 | | llama.cpp Metal | Unsloth GGUF Q4 | 58.2 | | MLX-LM | Unsloth UD MLX 4-bit | 45.8 | | MLX-LM | mlx-community 4-bit | 43.9 | | MLX-LM | mlx-community OptiQ 4-bit | 38.1 | I thought MLX being optimised for the Mac would be fastest. However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option. I guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform. I also tried Gemma 4 MTP through gemma-4-swift-mlx https://github.com/VincentGourbin/gemma-4-swift-mlx , but the tested 26B 4-bit MLX checkpoints did not match the loader's expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match. Adding Image Support For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only: "input": "text" That meant Pi did not send image tool output through to the model properly. The llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work only the 12B is natively multi-modal https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/ : mmproj-BF16.gguf When loaded with --mmproj , llama.cpp advertises multimodal support, and Pi can send images. I re-ran the text benchmark with the projector loaded, just to check it didn't change the speed: | Setup | Projector | Prompt tok/s | Generation tok/s | |---|---|---|---| | llama.cpp Metal + MTP | none | 120.3 | 71.4 | | llama.cpp Metal + MTP | mmproj-BF16.gguf | 297.4 | 72.2 | The final run with the projector did not show a text-generation slowdown. Now for setup instructions: Install llama.cpp Install dependencies: brew install cmake git tmux python@3.11 Clone and build llama.cpp: mkdir -p ~/Developer/ML-Models/Gemma4/repos cd ~/Developer/ML-Models/Gemma4 git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp cd repos/llama.cpp cmake -B build \ -DCMAKE BUILD TYPE=Release \ -DGGML METAL=ON \ -DGGML ACCELERATE=ON cmake --build build --config Release -j The build I tested had: GGML METAL=ON GGML ACCELERATE=ON GGML BLAS=ON GGML BLAS VENDOR=Apple Download the Model Files Create a Python environment: cd ~/Developer/ML-Models/Gemma4 python3.11 -m venv .venv source .venv/bin/activate pip install -U huggingface hub hf xet Download the files: mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-UD-Q4 K XL.gguf \ mmproj-BF16.gguf \ MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf \ --local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF You should end up with: models/unsloth-gemma-4-26B-A4B-it-GGUF/ gemma-4-26B-A4B-it-UD-Q4 K XL.gguf mmproj-BF16.gguf MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf Start the Local Server This is the final server command: repos/llama.cpp/build/bin/llama-server \ -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4 K XL.gguf \ --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf \ --mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c 65536 \ --parallel 1 \ --host 127.0.0.1 \ --port 8080 The OpenAI-compatible endpoint is: http://127.0.0.1:8080/v1 I used a small start server.sh wrapper so it runs inside tmux: bash /usr/bin/env bash set -euo pipefail ROOT DIR="$ cd "$ dirname "${BASH SOURCE 0 }" " && pwd " SESSION NAME="${SESSION NAME:-gemma4-server}" HOST="${HOST:-127.0.0.1}" PORT="${PORT:-8080}" CTX SIZE="${CTX SIZE:-65536}" PARALLEL="${PARALLEL:-1}" LLAMA SERVER="$ROOT DIR/repos/llama.cpp/build/bin/llama-server" MODEL="$ROOT DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4 K XL.gguf" DRAFT MODEL="$ROOT DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8 0-MTP.gguf" MMPROJ="$ROOT DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf" LOG FILE="$ROOT DIR/logs/llama-server-mtp.log" mkdir -p "$ROOT DIR/logs" tmux new-session -d -s "$SESSION NAME" -c "$ROOT DIR" \ "$LLAMA SERVER \ -m '$MODEL' \ --model-draft '$DRAFT MODEL' \ --mmproj '$MMPROJ' \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c '$CTX SIZE' \ --parallel '$PARALLEL' \ --host '$HOST' \ --port '$PORT' \ 2 &1 | tee -a '$LOG FILE'" Start it: chmod +x start server.sh ./start server.sh Check that the server is running: curl http://127.0.0.1:8080/v1/models Configure Pi Pi reads model providers from: ~/.pi/agent/models.json Add a local provider: { "providers": { "gemma4-local": { "name": "Gemma 4 Local", "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "apiKey": "local", "authHeader": false, "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": false }, "models": { "id": "gemma-4-26B-A4B-it-UD-Q4 K XL.gguf", "name": "Gemma 4 26B-A4B Q4 + MTP", "reasoning": false, "input": "text", "image" , "contextWindow": 65536, "maxTokens": 8192, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } } } } } The important pieces are: baseUrl points to the llama.cpp OpenAI-compatible server. api is openai-completions . authHeader is false , because this is a local server. input includes both text and image , otherwise Pi treats it as text-only. Optionally make it the default in: ~/.pi/agent/settings.json { "defaultProvider": "gemma4-local", "defaultModel": "gemma-4-26B-A4B-it-UD-Q4 K XL.gguf", "defaultThinkingLevel": "minimal" } Then check Pi can see it: pi --offline --list-models gemma Expected: provider model context max-out thinking images gemma4-local gemma-4-26B-A4B-it-UD-Q4 K XL.gguf 65.5K 8.2K no yes Run Pi using the local model: pi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4 K XL.gguf Or use non-interactive mode: pi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4 K XL.gguf \ "Explain what this repository does" For screenshots: pi -p @"/path/to/screenshot.png" "Describe this image and point out anything relevant to the UI" Final Setup The final local coding-agent stack was: | Layer | Choice | |---|---| | Inference runtime | llama.cpp | | macOS acceleration | Metal + Accelerate | | Main model | gemma-4-26B-A4B-it-UD-Q4 K XL.gguf | | Draft model | gemma-4-26B-A4B-it-Q8 0-MTP.gguf | | MTP setting | --spec-draft-n-max 3 | | Multimodal projector | mmproj-BF16.gguf | | Server | llama-server on 127.0.0.1:8080 | | API | OpenAI-compatible /v1 | | Coding agent | Pi | | Pi model input | "text", "image" | The main conclusion was that the MTP draft model is worth using. On this machine it took Gemma 4 from 58.2 tokens/second to 72.2 tokens/second, while keeping the setup simple enough to run as a local OpenAI-compatible server. P.S: Some suggested using Qwen3.6 35B-A3B instead of Gemma 4 26B-A4B . According to the benchmarks I can find, Qwen is a much better coding agent than Gemma 4. However, it is also slower. Qwen3.6-35B-A3B-UD-Q4 K XL.gguf + unsloth-Qwen3.6-35B-A3B-MTP-GGUF + mmproj-BF16.gguf results in 55 tk/s, instead of 72 tk/s. Which is quite significant when you are sitting waiting for it. Download the models: mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \ Qwen3.6-35B-A3B-UD-Q4 K XL.gguf \ mmproj-BF16.gguf \ --local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF Start the server: LLAMA SERVER=/Users/kylehowells/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server $LLAMA SERVER \ -m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4 K XL.gguf \ --mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c 65536 \ --parallel 1 \ --host 127.0.0.1 \ --port 8081 Pi Config: { "providers": { "qwen36-local": { "name": "Qwen3.6 Local", "baseUrl": "http://127.0.0.1:8081/v1", "api": "openai-completions", "apiKey": "local", "authHeader": false, "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": false }, "models": { "id": "Qwen3.6-35B-A3B-UD-Q4 K XL.gguf", "name": "Qwen3.6 35B-A3B Q4 + MTP", "reasoning": true, "input": "text", "image" , "contextWindow": 65536, "maxTokens": 8192, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } } } } } References: unsloth.ai/docs/models/qwen3.6 https://unsloth.ai/docs/models/qwen3.6 unsloth.ai/docs/models/gemma-4 https://unsloth.ai/docs/models/gemma-4 unsloth.ai/docs/models/mtp https://unsloth.ai/docs/models/mtp github.com/ggml-org/llama.cpp https://github.com/ggml-org/llama.cpp github.com/earendil-works/pi https://github.com/earendil-works/pi Introducing Gemma 4 12B: a unified, encoder-free multimodal model https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/ "MTP enables Google Gemma 4 run ~1.4–2.2× faster with no accuracy loss" https://x.com/UnslothAI/status/2065107734916432189 unsloth/gemma-4-26B-A4B-it-GGUF https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF unsloth/Qwen3.6-35B-A3B-MTP-GGUF https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF