Coding with DeepSeek 4 on a 128GB MacBook Pro

DeepSeek V4 Flash, a 284-billion-parameter Mixture-of-Experts model, now runs locally on a 128GB MacBook Pro via antirez's experimental llama.cpp fork, achieving ~21 tokens/sec generation on the Metal GPU. The 2-bit quantized model requires ~81GB of memory and supports up to 256k context reliably, enabling offline use of agent harnesses like Claude Code and Pi.

← cd .. / Running Claude Code and Pi on DeepSeek V4 Flash — locally on a 128GB MacBook Pro A 284-billion-parameter frontier model, running entirely offline on a laptop — and wired up as a backend for two agent harnesses: Claude Code and Pi. DeepSeek V4 Flash dropped in April 2026: a 284B-parameter Mixture-of-Experts model 13B active per token , MIT-licensed, with a 1M-token context window. The interesting part for me wasn’t the benchmarks — it was the claim, floating around the internet, that you could run it locally on an Apple Silicon Mac with enough RAM. I have a MacBook Pro with an M3 Max and 128GB of unified memory. So I tried it. Here’s everything that worked, everything that didn’t, and the scripts I ended up with. TL;DR It works. ~21 tokens/sec generation, fully on the Metal GPU, ~81GB resident.- You cannot use mainline llama.cpp or Ollama yet — the deepseek4 architecture isn’t merged. You need. antirez’s experimental fork https://github.com/antirez/llama.cpp-deepseek-v4-flash - The model file is an 81GB 2-bit “Dwarf Star” quant from, purpose-built for 128GB Macs. antirez/deepseek-v4-gguf llama-server now speaks the Anthropic Messages API natively , so you can point Claude Code at it with zero proxies.- 1M context loads but crashes at inference; 256k is the reliable ceiling on this fork. The hardware Chip: Apple M3 Max 12 performance + 4 efficiency cores Memory: 128 GB unified The 128GB is the whole ballgame. The 2-bit quant needs ~81GB resident, which means a 64GB machine is out — you’d swap to death or OOM. 128GB is the sweet spot the quant was designed around. There’s a bigger Q4 variant at 153GB for the 192GB Mac Studios, and DeepSeek-V4-Pro quants too, but Flash-q2 is the one that fits a laptop. False start: the guide that didn’t work I started from a tutorial that told me to git clone mainline llama.cpp , build it, and huggingface-cli download <some-repo /deepseek-v4-flash . Two problems: Mainline llama.cpp doesn’t support DeepSeek V4. The deepseek4 architecture — with its sparse attention, hyper-connections, and multi-token-prediction head — isn’t in stable releases. Ollama doesn’t support it either it’ll auto-update once the arch merges upstream, but that hadn’t happened .- The download command had a literal placeholder for the repo. There was no real source behind it. So if you find a tutorial telling you to use stock llama.cpp or ollama pull deepseek-v4 , close the tab. As of mid-2026 that path does not exist. What actually works: antirez’s fork Salvatore “antirez” Sanfilippo creator of Redis maintains an experimental llama.cpp fork that implements the deepseek4 architecture, plus a HuggingFace repo of matching GGUF quants. The key file: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf 81 GB That filename is a recipe. It’s IQ2 XXS 2-bit for the routed experts — which is where almost all 284B parameters live — but keeps the attention projections, shared experts, and output layer at Q8 . The parts that matter for coherence stay high-precision; the giant sparse expert tables get crushed to 2 bits. antirez calls it the “Dwarf Star” quant. His own note: “behaves very very well in the chat, frontier-model vibes, but it was not extensively tested.” That matches my experience. Building it is standard llama.cpp: git clone --depth 1 https://github.com/antirez/llama.cpp-deepseek-v4-flash llama.cpp cd llama.cpp cmake -B build -DGGML METAL=ON -DCMAKE BUILD TYPE=Release cmake --build build --config Release -j$ sysctl -n hw.logicalcpu This gives you llama-cli , llama-server , and llama-completion . The build detected my M3 Max GPU correctly: ggml metal device init: GPU name: MTL0 Apple M3 Max ggml metal device init: has unified memory = true ggml metal device init: recommendedMaxWorkingSetSize = 115448.73 MB That ~115GB working-set ceiling is the number to keep in mind: the model eats ~83GB of it, leaving ~32GB for context and compute buffers. Things that nearly fooled me ”It’s running on the CPU ” it wasn’t My first test generation seemed to hang. top showed the process pegged at 99% on a single core for 19 minutes with no output. I was convinced the custom DeepSeek ops the sparse-attention “indexer”, the “compressor” had no Metal kernels and were falling back to CPU. They weren’t. Two things were happening: - I’d piped the output through tail , which buffers until the process exits — so I saw nothing while it generated fine. - The 99%-single-core is just the orchestration thread spinning while the GPU does the matmuls. The real proof came from the memory breakdown: | memory breakdown MiB | total free self ... | | MTL0 Apple M3 Max | 110100 = 26265 + 83161 ... | 83GB sitting on MTL0 — the Metal GPU. It was on the GPU the whole time. Lesson: don’t pipe a streaming LLM through tail , and check the memory breakdown before blaming the CPU. Speed and load time Generation: ~21 tok/s. Prompt eval: ~32–43 tok/s. Cold load: ~9 minutes reading 81GB off disk . Warm load: ~4 seconds once the file is in the OS page cache. So your second launch is dramatically faster than your first. How big can the context actually be? The model supports 1M tokens. The question is what fits and computes in ~32GB of leftover working set. I measured it empirically — and DeepSeek’s sparse attention makes the KV cache shockingly cheap sliding-window of 128 + a top-512 indexer, instead of dense full-sequence attention : | Context | Total resident | Result | |---|---|---| | 2k | ~82 GB | ✅ KV cache only ~66 MiB | | 64k | ~83 GB | ✅ | | 256k | ~88–91 GB | ✅ — this is the one I settled on | | 1M | ~85 GB loads | ❌ Compute error at inference time | So memory was never the limit — even 1M loads in 85GB. But at 1M the fork fails to build the compute graph and every request returns {"error":{"code":500,"message":"Compute error."}} . 256k computes reliably , is larger than hosted Claude’s standard 200k window, and leaves headroom. That’s what I bake into the server. Wiring it into Claude Code This was the surprise payoff. Recent llama-server exposes an Anthropic Messages API endpoint /v1/messages alongside the OpenAI one — so no proxy, no claude-code-router, no LiteLLM needed. You point Claude Code straight at llama-server .A raw test against the endpoint: curl -s http://127.0.0.1:8080/v1/messages \ -H "content-type: application/json" -H "anthropic-version: 2023-06-01" \ -d '{"model":"deepseek-v4-flash","max tokens":40, "messages": {"role":"user","content":"Reply with exactly: BRIDGE OK"} }' {"type":"message","role":"assistant", "content": {"type":"thinking","thinking":"..."},{"type":"text","text":"BRIDGE OK"} , "stop reason":"end turn","usage":{"cache read input tokens":0,"input tokens":12,"output tokens":37}} Proper Anthropic-shaped response, thinking blocks and all — and note cache read input tokens , so prompt caching works too . Two things to get right: Start the server with or tool/function calling won’t work Claude Code lives and dies by tool calls . --jinja Do NOT put That hijacks ANTHROPIC BASE URL in your global ~/.claude/settings.json . every claude you run — including your normal cloud sessions. Set the env vars in a launcher script instead, so it’s opt-in per invocation. The end-to-end proof: I ran Claude Code headless against the local model and asked it to reply LOCAL CLAUDE OK . The server log showed it ingesting a 20,556-token prompt Claude Code’s system prompt + tool schemas , and after chewing through it… LOCAL CLAUDE OK . 🎉 The honest caveat: that 20k-token system prompt takes several minutes to process on the first turn at ~32 tok/s. Prompt caching makes later turns faster, but this is not a snappy daily driver. It’s a 2-bit model on a laptop. It’s genuinely useful for offline/air-gapped work and experimentation; it is not going to feel like the hosted product. Bonus: the same model in Pi a second harness Pi https://pi.dev is a minimal, provider-agnostic coding agent @earendil-works/pi-coding-agent . Since it advertises an OpenAI provider and reads OPENAI API KEY , I assumed I could just set OPENAI BASE URL=http://localhost:8080/v1 and run pi --provider openai . That doesn’t work — Pi’s built-in openai provider ignores OPENAI BASE URL and goes straight to api.openai.com : OpenAI API error 401 : Incorrect API key provided: local. You can find your API key at https://platform.openai.com/account/api-keys. The correct way to point Pi at a local server is to register a custom provider via a tiny extension pi.registerProvider . Once that’s loaded, pi --list-models shows your local model and everything routes locally: provider model context max-out local deepseek-v4-flash 262.1K 8.2K A quick pi -p "What is 2+2?" returns 4 — through the local model, fully offline. Pi’s system prompt is much smaller than Claude Code’s, so it feels noticeably snappier on the same hardware less prompt to chew through each turn . The scripts Everything lives in ~/deepseek-v4-flash/ . I also have a setup.sh that installs prereqs, builds the fork, and downloads the model — omitted here for brevity; the interesting bits are below. serve.sh — run the model server in the background Exposes both the Anthropic and OpenAI APIs on 127.0.0.1:8080 , at 256k context, with --jinja for tool calling. bash /usr/bin/env bash Start/stop the DeepSeek server llama-server on http://127.0.0.1:8080. Exposes /v1/messages Anthropic and /v1/chat/completions OpenAI . ./serve.sh start|stop|status|logs set -euo pipefail DIR="$HOME/deepseek-v4-flash" BIN="$DIR/llama.cpp/build/bin/llama-server" MODEL="$DIR/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf" LOG="$DIR/server.log"; PORT=8080; HOST=127.0.0.1; CTX=262144 256k case "${1:-start}" in stop pkill -f " l lama-server" && echo stopped || echo "not running" ;; status curl -sf "http://$HOST:$PORT/health" /dev/null 2 &1 \ && { echo "UP http://$HOST:$PORT"; ps -axo rss,command | grep " l lama-server" \ | awk '{printf " %.1f GB\n",$1/1048576}'; } \ || echo DOWN ;; logs tail -f "$LOG" ;; start curl -sf "http://$HOST:$PORT/health" /dev/null 2 &1 && { echo "already running"; exit 0; } pkill -f " l lama-server" 2 /dev/null || true; sleep 1; : "$LOG" nohup "$BIN" -m "$MODEL" -ngl 99 -c "$CTX" --jinja --host "$HOST" --port "$PORT" "$LOG" 2 &1 & echo "starting pid $ , ctx=$CTX — loading ~81GB, takes a few minutes" printf "waiting" until curl -sf "http://$HOST:$PORT/health" /dev/null 2 &1; do pgrep -f " l lama-server" /dev/null || { echo " FAILED — see $LOG"; exit 1; } printf .; sleep 3 done echo " UP on http://$HOST:$PORT" ;; echo "usage: $0 {start|stop|status|logs}"; exit 1 ;; esac Key flags: -ngl 99 offloads all layers to Metal, -c 262144 sets the 256k window, --jinja enables tool calling. claude-local.sh — run Claude Code against the local model The whole trick is here: set the ANTHROPIC env vars for this invocation only , auto-starting the server if it’s down. Your normal cloud claude in other tabs is untouched. bash /usr/bin/env bash Launch Claude Code against the LOCAL DeepSeek server this invocation only; your normal cloud claude is untouched . Args are forwarded to claude. set -euo pipefail DIR="$HOME/deepseek-v4-flash"; HOST=127.0.0.1; PORT=8080 command -v claude /dev/null 2 &1 || { echo "Claude Code CLI not found."; exit 1; } curl -sf "http://$HOST:$PORT/health" /dev/null 2 &1 || { echo "starting local server..."; "$DIR/serve.sh" start; } export ANTHROPIC BASE URL="http://$HOST:$PORT" export ANTHROPIC API KEY="local-no-auth" server ignores auth; this just skips the login flow export ANTHROPIC AUTH TOKEN="local-no-auth" export ANTHROPIC MODEL="deepseek-v4-flash" export ANTHROPIC DEFAULT HAIKU MODEL="deepseek-v4-flash" route the small/fast model locally too export ANTHROPIC DEFAULT SONNET MODEL="deepseek-v4-flash" export ANTHROPIC DEFAULT OPUS MODEL="deepseek-v4-flash" export CLAUDE CODE DISABLE NONESSENTIAL TRAFFIC=1 no telemetry / update pings — fully local exec claude "$@" Then, in any new terminal tab: ~/deepseek-v4-flash/claude-local.sh chat.sh — plain terminal chat no Claude Code For a quick conversation without the agent harness. The -cnv flag runs interactive conversation mode. bash /usr/bin/env bash Interactive terminal chat with DeepSeek V4 Flash. set -euo pipefail DIR="$HOME/deepseek-v4-flash" exec "$DIR/llama.cpp/build/bin/llama-cli" \ -m "$DIR/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf" \ -ngl 99 -c 8192 -cnv "$@" pi-local-provider.js — register the local server as a Pi provider Pi won’t honor OPENAI BASE URL , so we register a custom local provider in an extension. api: "openai-completions" matches llama-server ’s OpenAI endpoint. // Load with: pi -e ~/deepseek-v4-flash/pi-local-provider.js --provider local --model local/deepseek-v4-flash export default async function pi { pi.registerProvider "local", { baseUrl: "http://127.0.0.1:8080/v1", apiKey: "local-no-auth", // llama-server ignores auth; any non-empty value works api: "openai-completions", models: { id: "deepseek-v4-flash", name: "DeepSeek V4 Flash local, q2 ", reasoning: false, input: "text" , cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 }, contextWindow: 262144, maxTokens: 8192, }, , } ; } pi-local.sh — run Pi against the local model Install Pi first curl -fsSL https://pi.dev/install.sh | sh , or brew / npm , then: bash /usr/bin/env bash Launch the Pi coding agent against the LOCAL DeepSeek server. Auto-starts the model server if needed. Args are forwarded to pi. set -euo pipefail DIR="$HOME/deepseek-v4-flash"; HOST=127.0.0.1; PORT=8080 command -v pi /dev/null 2 &1 || { echo "Pi not installed — see https://pi.dev"; exit 1; } curl -sf "http://$HOST:$PORT/health" /dev/null 2 &1 || { echo "starting local server..."; "$DIR/serve.sh" start; } exec pi -e "$DIR/pi-local-provider.js" --provider local --model local/deepseek-v4-flash "$@" Then, in any terminal: ~/deepseek-v4-flash/pi-local.sh interactive ~/deepseek-v4-flash/pi-local.sh -p "explain this repo" one-shot Would I actually use this? For day-to-day coding? No — the hosted models are an order of magnitude faster and smarter. But as a demonstration that a 284B frontier-class MoE runs offline on a laptop , and that you can drive Claude Code with zero cloud dependency , it’s remarkable. Air-gapped environments, flights, privacy-sensitive work, or just the sheer “because I can” factor — that’s where this shines. The pieces that made it possible — antirez’s architecture port, a 2-bit quant that keeps the right layers precise, DeepSeek’s sparse attention keeping the KV cache tiny, and llama-server ’s native Anthropic endpoint — are each individually clever. Stacked together, they put a frontier model on my lap. Literally. Built and tested on macOS, Apple M3 Max, 128GB, June 2026.