Coding with DeepSeek 4 on a 128GB MacBook Pro DeepSeek V4 Flash, a 284-billion-parameter Mixture-of-Experts model, now runs locally on a 128GB MacBook Pro via antirez's experimental llama.cpp fork, achieving ~21 tokens/sec generation on the Metal GPU. The 2-bit quantized model requires ~81GB of memory and supports up to 256k context reliably, enabling offline use of agent harnesses like Claude Code and Pi. ← cd .. / Running Claude Code and Pi on DeepSeek V4 Flash — locally on a 128GB MacBook Pro A 284-billion-parameter frontier model, running entirely offline on a laptop — and wired up as a backend for two agent harnesses: Claude Code and Pi. DeepSeek V4 Flash dropped in April 2026: a 284B-parameter Mixture-of-Experts model 13B active per token , MIT-licensed, with a 1M-token context window. The interesting part for me wasn’t the benchmarks — it was the claim, floating around the internet, that you could run it locally on an Apple Silicon Mac with enough RAM. I have a MacBook Pro with an M3 Max and 128GB of unified memory. So I tried it. Here’s everything that worked, everything that didn’t, and the scripts I ended up with. TL;DR It works. ~21 tokens/sec generation, fully on the Metal GPU, ~81GB resident.- You cannot use mainline llama.cpp or Ollama yet — the deepseek4 architecture isn’t merged. You need. antirez’s experimental fork https://github.com/antirez/llama.cpp-deepseek-v4-flash - The model file is an 81GB 2-bit “Dwarf Star” quant from, purpose-built for 128GB Macs. antirez/deepseek-v4-gguf llama-server now speaks the Anthropic Messages API natively , so you can point Claude Code at it with zero proxies.- 1M context loads but crashes at inference; 256k is the reliable ceiling on this fork. The hardware Chip: Apple M3 Max 12 performance + 4 efficiency cores Memory: 128 GB unified The 128GB is the whole ballgame. The 2-bit quant needs ~81GB resident, which means a 64GB machine is out — you’d swap to death or OOM. 128GB is the sweet spot the quant was designed around. There’s a bigger Q4 variant at 153GB for the 192GB Mac Studios, and DeepSeek-V4-Pro quants too, but Flash-q2 is the one that fits a laptop. False start: the guide that didn’t work I started from a tutorial that told me to git clone mainline llama.cpp , build it, and huggingface-cli download