{"slug": "coding-with-deepseek-4-on-a-128gb-macbook-pro", "title": "Coding with DeepSeek 4 on a 128GB MacBook Pro", "summary": "DeepSeek V4 Flash, a 284-billion-parameter Mixture-of-Experts model, now runs locally on a 128GB MacBook Pro via antirez's experimental llama.cpp fork, achieving ~21 tokens/sec generation on the Metal GPU. The 2-bit quantized model requires ~81GB of memory and supports up to 256k context reliably, enabling offline use of agent harnesses like Claude Code and Pi.", "body_md": "[← cd ..](/)\n\n# Running Claude Code and Pi on DeepSeek V4 Flash — locally on a 128GB MacBook Pro\n\n*A 284-billion-parameter frontier model, running entirely offline on a laptop — and wired up as a backend for two agent harnesses: Claude Code and Pi.*\n\nDeepSeek V4 Flash dropped in April 2026: a 284B-parameter Mixture-of-Experts model (13B active per token), MIT-licensed, with a 1M-token context window. The interesting part for me wasn’t the benchmarks — it was the claim, floating around the internet, that you could run it *locally* on an Apple Silicon Mac with enough RAM.\n\nI have a MacBook Pro with an M3 Max and 128GB of unified memory. So I tried it. Here’s everything that worked, everything that didn’t, and the scripts I ended up with.\n\n## TL;DR\n\n**It works.**~21 tokens/sec generation, fully on the Metal GPU, ~81GB resident.- You\n**cannot** use mainline`llama.cpp`\n\nor Ollama yet — the`deepseek4`\n\narchitecture isn’t merged. You need.[antirez’s experimental fork](https://github.com/antirez/llama.cpp-deepseek-v4-flash) - The model file is an\n**81GB 2-bit “Dwarf Star” quant** from, purpose-built for 128GB Macs.`antirez/deepseek-v4-gguf`\n\n`llama-server`\n\nnow speaks the**Anthropic Messages API natively**, so you can point** Claude Code**at it with zero proxies.- 1M context\n*loads*but crashes at inference;**256k** is the reliable ceiling on this fork.\n\n## The hardware\n\n```\nChip:    Apple M3 Max (12 performance + 4 efficiency cores)\nMemory:  128 GB unified\n```\n\nThe 128GB is the whole ballgame. The 2-bit quant needs ~81GB resident, which means a 64GB machine is out — you’d swap to death or OOM. 128GB is the sweet spot the quant was designed around. (There’s a bigger Q4 variant at 153GB for the 192GB Mac Studios, and `DeepSeek-V4-Pro`\n\nquants too, but Flash-q2 is the one that fits a laptop.)\n\n## False start: the guide that didn’t work\n\nI started from a tutorial that told me to `git clone`\n\nmainline `llama.cpp`\n\n, build it, and `huggingface-cli download <some-repo>/deepseek-v4-flash`\n\n. Two problems:\n\n**Mainline llama.cpp doesn’t support DeepSeek V4.** The`deepseek4`\n\narchitecture — with its sparse attention, hyper-connections, and multi-token-prediction head — isn’t in stable releases.**Ollama doesn’t support it either**(it’ll auto-update once the arch merges upstream, but that hadn’t happened).- The download command had a\n**literal placeholder** for the repo. There was no real source behind it.\n\nSo if you find a tutorial telling you to use stock `llama.cpp`\n\nor `ollama pull deepseek-v4`\n\n, close the tab. As of mid-2026 that path does not exist.\n\n## What actually works: antirez’s fork\n\nSalvatore “antirez” Sanfilippo (creator of Redis) maintains an experimental llama.cpp fork that implements the `deepseek4`\n\narchitecture, plus a HuggingFace repo of matching GGUF quants. The key file:\n\n```\nDeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf   (81 GB)\n```\n\nThat filename is a recipe. It’s `IQ2_XXS`\n\n(2-bit) for the routed experts — which is where almost all 284B parameters live — but keeps the **attention projections, shared experts, and output layer at Q8**. The parts that matter for coherence stay high-precision; the giant sparse expert tables get crushed to 2 bits. antirez calls it the “Dwarf Star” quant. His own note: *“behaves very very well in the chat, frontier-model vibes, but it was not extensively tested.”* That matches my experience.\n\nBuilding it is standard llama.cpp:\n\n```\ngit clone --depth 1 https://github.com/antirez/llama.cpp-deepseek-v4-flash llama.cpp\ncd llama.cpp\ncmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release\ncmake --build build --config Release -j$(sysctl -n hw.logicalcpu)\n```\n\nThis gives you `llama-cli`\n\n, `llama-server`\n\n, and `llama-completion`\n\n. The build detected my M3 Max GPU correctly:\n\n```\nggml_metal_device_init: GPU name:   MTL0 (Apple M3 Max)\nggml_metal_device_init: has unified memory    = true\nggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB\n```\n\nThat **~115GB working-set ceiling** is the number to keep in mind: the model eats ~83GB of it, leaving ~32GB for context and compute buffers.\n\n## Things that nearly fooled me\n\n### ”It’s running on the CPU!” (it wasn’t)\n\nMy first test generation seemed to hang. `top`\n\nshowed the process pegged at **99% on a single core** for 19 minutes with no output. I was convinced the custom DeepSeek ops (the sparse-attention “indexer”, the “compressor”) had no Metal kernels and were falling back to CPU.\n\nThey weren’t. Two things were happening:\n\n- I’d piped the output through\n`tail`\n\n, which**buffers until the process exits**— so I saw nothing while it generated fine. - The 99%-single-core is just the\n**orchestration thread** spinning while the GPU does the matmuls. The real proof came from the memory breakdown:\n\n```\n| memory breakdown [MiB] | total   free    self    ... |\n| MTL0 (Apple M3 Max)    | 110100 = 26265 + (83161 ...) |\n```\n\n83GB sitting on `MTL0`\n\n— the Metal GPU. It was on the GPU the whole time. Lesson: **don’t pipe a streaming LLM through tail**, and check the memory breakdown before blaming the CPU.\n\n### Speed and load time\n\n**Generation: ~21 tok/s.** Prompt eval: ~32–43 tok/s.**Cold load: ~9 minutes**(reading 81GB off disk).** Warm load: ~4 seconds**once the file is in the OS page cache. So your*second*launch is dramatically faster than your first.\n\n## How big can the context actually be?\n\nThe model supports 1M tokens. The question is what *fits and computes* in ~32GB of leftover working set. I measured it empirically — and DeepSeek’s **sparse attention** makes the KV cache shockingly cheap (sliding-window of 128 + a top-512 indexer, instead of dense full-sequence attention):\n\n| Context | Total resident | Result |\n|---|---|---|\n| 2k | ~82 GB | ✅ (KV cache only ~66 MiB) |\n| 64k | ~83 GB | ✅ |\n| 256k | ~88–91 GB | ✅ — this is the one I settled on |\n| 1M | ~85 GB (loads) | ❌ `Compute error` at inference time |\n\nSo memory was never the limit — even 1M *loads* in 85GB. But at 1M the fork fails to build the compute graph and every request returns `{\"error\":{\"code\":500,\"message\":\"Compute error.\"}}`\n\n. **256k computes reliably**, is larger than hosted Claude’s standard 200k window, and leaves headroom. That’s what I bake into the server.\n\n## Wiring it into Claude Code\n\nThis was the surprise payoff. Recent `llama-server`\n\nexposes an **Anthropic Messages API** endpoint (`/v1/messages`\n\n) alongside the OpenAI one — so **no proxy, no claude-code-router, no LiteLLM** needed. You point Claude Code straight at\n\n`llama-server`\n\n.A raw test against the endpoint:\n\n```\ncurl -s http://127.0.0.1:8080/v1/messages \\\n  -H \"content-type: application/json\" -H \"anthropic-version: 2023-06-01\" \\\n  -d '{\"model\":\"deepseek-v4-flash\",\"max_tokens\":40,\n       \"messages\":[{\"role\":\"user\",\"content\":\"Reply with exactly: BRIDGE OK\"}]}'\n{\"type\":\"message\",\"role\":\"assistant\",\n \"content\":[{\"type\":\"thinking\",\"thinking\":\"...\"},{\"type\":\"text\",\"text\":\"BRIDGE OK\"}],\n \"stop_reason\":\"end_turn\",\"usage\":{\"cache_read_input_tokens\":0,\"input_tokens\":12,\"output_tokens\":37}}\n```\n\nProper Anthropic-shaped response, thinking blocks and all — and note `cache_read_input_tokens`\n\n, so **prompt caching works too**. Two things to get right:\n\n**Start the server with** or tool/function calling won’t work (Claude Code lives and dies by tool calls).`--jinja`\n\n**Do NOT put** That hijacks`ANTHROPIC_BASE_URL`\n\nin your global`~/.claude/settings.json`\n\n.*every*`claude`\n\nyou run — including your normal cloud sessions. Set the env vars in a**launcher script** instead, so it’s opt-in per invocation.\n\nThe end-to-end proof: I ran Claude Code headless against the local model and asked it to reply `LOCAL CLAUDE OK`\n\n. The server log showed it ingesting a **20,556-token prompt** (Claude Code’s system prompt + tool schemas), and after chewing through it… `LOCAL CLAUDE OK`\n\n. 🎉\n\n**The honest caveat:** that 20k-token system prompt takes **several minutes** to process on the first turn at ~32 tok/s. Prompt caching makes later turns faster, but this is *not* a snappy daily driver. It’s a 2-bit model on a laptop. It’s genuinely useful for offline/air-gapped work and experimentation; it is not going to feel like the hosted product.\n\n## Bonus: the same model in Pi (a second harness)\n\n[Pi](https://pi.dev) is a minimal, provider-agnostic coding agent (`@earendil-works/pi-coding-agent`\n\n). Since it advertises an OpenAI provider and reads `OPENAI_API_KEY`\n\n, I assumed I could just set `OPENAI_BASE_URL=http://localhost:8080/v1`\n\nand run `pi --provider openai`\n\n. **That doesn’t work** — Pi’s built-in `openai`\n\nprovider ignores `OPENAI_BASE_URL`\n\nand goes straight to `api.openai.com`\n\n:\n\n```\nOpenAI API error (401): Incorrect API key provided: local.\nYou can find your API key at https://platform.openai.com/account/api-keys.\n```\n\nThe correct way to point Pi at a local server is to **register a custom provider** via a tiny extension (`pi.registerProvider`\n\n). Once that’s loaded, `pi --list-models`\n\nshows your local model and everything routes locally:\n\n```\nprovider  model              context  max-out\nlocal     deepseek-v4-flash  262.1K   8.2K\n```\n\nA quick `pi -p \"What is 2+2?\"`\n\nreturns `4`\n\n— through the local model, fully offline. Pi’s system prompt is much smaller than Claude Code’s, so it feels noticeably snappier on the same hardware (less prompt to chew through each turn).\n\n## The scripts\n\nEverything lives in `~/deepseek-v4-flash/`\n\n. (I also have a `setup.sh`\n\nthat installs prereqs, builds the fork, and downloads the model — omitted here for brevity; the interesting bits are below.)\n\n`serve.sh`\n\n— run the model server in the background\n\nExposes both the Anthropic and OpenAI APIs on `127.0.0.1:8080`\n\n, at 256k context, with `--jinja`\n\nfor tool calling.\n\n``` bash\n#!/usr/bin/env bash\n# Start/stop the DeepSeek server (llama-server) on http://127.0.0.1:8080.\n# Exposes /v1/messages (Anthropic) and /v1/chat/completions (OpenAI).\n#   ./serve.sh [start|stop|status|logs]\nset -euo pipefail\nDIR=\"$HOME/deepseek-v4-flash\"\nBIN=\"$DIR/llama.cpp/build/bin/llama-server\"\nMODEL=\"$DIR/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf\"\nLOG=\"$DIR/server.log\"; PORT=8080; HOST=127.0.0.1; CTX=262144   # 256k\n\ncase \"${1:-start}\" in\n  stop)   pkill -f \"[l]lama-server\" && echo stopped || echo \"not running\" ;;\n  status) curl -sf \"http://$HOST:$PORT/health\" >/dev/null 2>&1 \\\n            && { echo \"UP http://$HOST:$PORT\"; ps -axo rss,command | grep \"[l]lama-server\" \\\n                 | awk '{printf \"  %.1f GB\\n\",$1/1048576}'; } \\\n            || echo DOWN ;;\n  logs)   tail -f \"$LOG\" ;;\n  start)\n    curl -sf \"http://$HOST:$PORT/health\" >/dev/null 2>&1 && { echo \"already running\"; exit 0; }\n    pkill -f \"[l]lama-server\" 2>/dev/null || true; sleep 1; : > \"$LOG\"\n    nohup \"$BIN\" -m \"$MODEL\" -ngl 99 -c \"$CTX\" --jinja --host \"$HOST\" --port \"$PORT\" >> \"$LOG\" 2>&1 &\n    echo \"starting (pid $!), ctx=$CTX — loading ~81GB, takes a few minutes\"\n    printf \"waiting\"\n    until curl -sf \"http://$HOST:$PORT/health\" >/dev/null 2>&1; do\n      pgrep -f \"[l]lama-server\" >/dev/null || { echo \" FAILED — see $LOG\"; exit 1; }\n      printf .; sleep 3\n    done\n    echo \" UP on http://$HOST:$PORT\" ;;\n  *) echo \"usage: $0 {start|stop|status|logs}\"; exit 1 ;;\nesac\n```\n\nKey flags: `-ngl 99`\n\noffloads all layers to Metal, `-c 262144`\n\nsets the 256k window, `--jinja`\n\nenables tool calling.\n\n`claude-local.sh`\n\n— run Claude Code against the local model\n\nThe whole trick is here: set the `ANTHROPIC_*`\n\nenv vars for *this invocation only*, auto-starting the server if it’s down. Your normal cloud `claude`\n\nin other tabs is untouched.\n\n``` bash\n#!/usr/bin/env bash\n# Launch Claude Code against the LOCAL DeepSeek server (this invocation only;\n# your normal cloud `claude` is untouched). Args are forwarded to claude.\nset -euo pipefail\nDIR=\"$HOME/deepseek-v4-flash\"; HOST=127.0.0.1; PORT=8080\ncommand -v claude >/dev/null 2>&1 || { echo \"Claude Code CLI not found.\"; exit 1; }\ncurl -sf \"http://$HOST:$PORT/health\" >/dev/null 2>&1 || { echo \"starting local server...\"; \"$DIR/serve.sh\" start; }\n\nexport ANTHROPIC_BASE_URL=\"http://$HOST:$PORT\"\nexport ANTHROPIC_API_KEY=\"local-no-auth\"          # server ignores auth; this just skips the login flow\nexport ANTHROPIC_AUTH_TOKEN=\"local-no-auth\"\nexport ANTHROPIC_MODEL=\"deepseek-v4-flash\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"deepseek-v4-flash\"   # route the small/fast model locally too\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"deepseek-v4-flash\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"deepseek-v4-flash\"\nexport CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1  # no telemetry / update pings — fully local\nexec claude \"$@\"\n```\n\nThen, in any new terminal tab:\n\n```\n~/deepseek-v4-flash/claude-local.sh\n```\n\n`chat.sh`\n\n— plain terminal chat (no Claude Code)\n\nFor a quick conversation without the agent harness. The `-cnv`\n\nflag runs interactive conversation mode.\n\n``` bash\n#!/usr/bin/env bash\n# Interactive terminal chat with DeepSeek V4 Flash.\nset -euo pipefail\nDIR=\"$HOME/deepseek-v4-flash\"\nexec \"$DIR/llama.cpp/build/bin/llama-cli\" \\\n  -m \"$DIR/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf\" \\\n  -ngl 99 -c 8192 -cnv \"$@\"\n```\n\n`pi-local-provider.js`\n\n— register the local server as a Pi provider\n\nPi won’t honor `OPENAI_BASE_URL`\n\n, so we register a custom `local`\n\nprovider in an extension. `api: \"openai-completions\"`\n\nmatches `llama-server`\n\n’s OpenAI endpoint.\n\n```\n// Load with: pi -e ~/deepseek-v4-flash/pi-local-provider.js --provider local --model local/deepseek-v4-flash\nexport default async function (pi) {\n  pi.registerProvider(\"local\", {\n    baseUrl: \"http://127.0.0.1:8080/v1\",\n    apiKey: \"local-no-auth\",          // llama-server ignores auth; any non-empty value works\n    api: \"openai-completions\",\n    models: [\n      {\n        id: \"deepseek-v4-flash\",\n        name: \"DeepSeek V4 Flash (local, q2)\",\n        reasoning: false,\n        input: [\"text\"],\n        cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },\n        contextWindow: 262144,\n        maxTokens: 8192,\n      },\n    ],\n  });\n}\n```\n\n`pi-local.sh`\n\n— run Pi against the local model\n\nInstall Pi first (`curl -fsSL https://pi.dev/install.sh | sh`\n\n, or `brew`\n\n/`npm`\n\n), then:\n\n``` bash\n#!/usr/bin/env bash\n# Launch the Pi coding agent against the LOCAL DeepSeek server.\n# Auto-starts the model server if needed. Args are forwarded to pi.\nset -euo pipefail\nDIR=\"$HOME/deepseek-v4-flash\"; HOST=127.0.0.1; PORT=8080\ncommand -v pi >/dev/null 2>&1 || { echo \"Pi not installed — see https://pi.dev\"; exit 1; }\ncurl -sf \"http://$HOST:$PORT/health\" >/dev/null 2>&1 || { echo \"starting local server...\"; \"$DIR/serve.sh\" start; }\nexec pi -e \"$DIR/pi-local-provider.js\" --provider local --model local/deepseek-v4-flash \"$@\"\n```\n\nThen, in any terminal:\n\n```\n~/deepseek-v4-flash/pi-local.sh                       # interactive\n~/deepseek-v4-flash/pi-local.sh -p \"explain this repo\"   # one-shot\n```\n\n## Would I actually use this?\n\nFor day-to-day coding? No — the hosted models are an order of magnitude faster and smarter. But as a demonstration that a **284B frontier-class MoE runs offline on a laptop**, and that you can drive **Claude Code with zero cloud dependency**, it’s remarkable. Air-gapped environments, flights, privacy-sensitive work, or just the sheer “because I can” factor — that’s where this shines.\n\nThe pieces that made it possible — antirez’s architecture port, a 2-bit quant that keeps the right layers precise, DeepSeek’s sparse attention keeping the KV cache tiny, and `llama-server`\n\n’s native Anthropic endpoint — are each individually clever. Stacked together, they put a frontier model on my lap. Literally.\n\n*Built and tested on macOS, Apple M3 Max, 128GB, June 2026.*", "url": "https://wpnews.pro/news/coding-with-deepseek-4-on-a-128gb-macbook-pro", "canonical_source": "https://ronreiter.com/posts/running-deepseek-v4-flash-locally/", "published_at": "2026-06-30 11:02:57+00:00", "updated_at": "2026-06-30 11:21:11.236184+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "ai-research", "developer-tools"], "entities": ["DeepSeek", "Apple", "antirez", "llama.cpp", "Claude Code", "Pi", "HuggingFace", "M3 Max"], "alternates": {"html": "https://wpnews.pro/news/coding-with-deepseek-4-on-a-128gb-macbook-pro", "markdown": "https://wpnews.pro/news/coding-with-deepseek-4-on-a-128gb-macbook-pro.md", "text": "https://wpnews.pro/news/coding-with-deepseek-4-on-a-128gb-macbook-pro.txt", "jsonld": "https://wpnews.pro/news/coding-with-deepseek-4-on-a-128gb-macbook-pro.jsonld"}}