{"slug": "running-qwen-3-6-locally-on-a-mac-mini-m4-with-16gb-ram", "title": "Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM", "summary": "Qwen open-sourced the 35-billion parameter Mixture of Experts model Qwen 3.6-35B-A3B, which activates only 3 billion parameters per token and runs on a $599 Mac Mini M4 with 16GB RAM at 17 tok/s with zero swap. The model's MoE architecture enables memory-mapped inference, making it competitive with larger models on coding and reasoning tasks while running entirely locally.", "body_md": "# Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM\n\nTwo days ago Qwen open-sourced [Qwen 3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — a 35-billion parameter Mixture of Experts model that only activates 3 billion parameters per token. It's Apache 2.0 licensed, ships with a vision encoder, and is [reportedly competitive with much larger models](https://letsdatascience.com/news/qwen36-35b-a3b-outdraws-claude-opus-47-locally-b86a7f47) on agentic coding benchmarks. GGUF quantizations were up within hours.\n\nHere's the thing: you can run it on a $599 Mac Mini M4 with 16GB of RAM. Not a toy demo — actual usable inference at 17 tok/s, zero swap, 81% memory free. This post is about how to do that, and which tools give you the best experience.\n\n## Why 35B-A3B works on 16GB\n\nThe naive math says it shouldn't fit. The standard formula for estimating model memory ([from BentoML](https://bentoml.com/llm/getting-started/calculating-gpu-memory-for-llms)):\n\n```\nMemory (GB) = Parameters (B) × (Bits per weight / 8) × 1.2 overhead\n35 × 4 / 8 × 1.2 = ~21GB\n```\n\n21GB for a Q4 quantization. That doesn't fit in 16GB. So how does it work?\n\nThe key is the MoE architecture. \"35B-A3B\" means 35 billion total parameters, but only 3 billion **a** ctive per token. The model uses 256 total experts with 8 routed + 1 shared active per inference step. The remaining experts sit idle. This is what makes the `--mmap`\n\ntrick possible: `llama.cpp`\n\nmemory-maps the model file, and the OS only pages in the weights for the currently active experts. Since the hot working set is roughly 3B parameters (~2GB at Q4), it fits comfortably in 16GB with room to spare.\n\n[Jock.pl benchmarked this](https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026) on a Mac Mini M4 16GB: 17.3 tok/s decode, 81% memory free, zero swap. That's not hypothetical — it's a real measurement on the base model Mac Mini.\n\n**Why this matters:** On benchmarks, the 35B-A3B architecture [beats dense models up to 120B](https://byteiota.com/qwen-3-5-beats-120b-models-on-16gb-ram-local-setup-guide/) on coding and reasoning tasks, while running at the latency of a 3B model. On 16GB RAM. For $0/month. That's the pitch.\n\n## Picking your inference tool\n\nThere are four main ways to run LLMs locally on a Mac. Here's how they compare for running the 35B-A3B on 16GB specifically:\n\n| Tool | Ease of setup | 35B-A3B on 16GB? | Tool calling | Notes |\n|---|---|---|---|---|\n|\n\n[Ollama](https://ollama.com/)[LM Studio](https://lmstudio.ai/)[MLX / mlx-lm](https://github.com/ml-explore/mlx)*`mlx-vlm`\n\nhas a [PR in progress](https://github.com/Blaizzy/mlx-vlm/pull/773) for tool calling support.\n\nOne important detail: [Ollama 0.19](https://ollama.com/blog/mlx) (released March 30, 2026) shipped an MLX backend that nearly doubles decode speed — from 58 tok/s to 112 tok/s. But it requires 32GB+ unified memory. On 16GB, Ollama falls back to the llama.cpp backend. Still works, just not the fast path. LM Studio doesn't have this gate and can use MLX on 16GB, which is a real advantage.\n\n## Setup 1: llama.cpp with mmap (recommended)\n\nThis is the most reliable way to run the 35B-A3B on 16GB. Metal GPU acceleration is enabled by default on macOS — no flags needed.\n\n```\n# Build llama.cpp\ngit clone https://github.com/ggml-org/llama.cpp\ncd llama.cpp\ncmake -B build\ncmake --build build --config Release\n\n# Download the Qwen 3.6 GGUF (Q4_K_M quantization)\npip install huggingface_hub\nhuggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \\\n  Qwen3.6-35B-A3B-Q4_K_M.gguf \\\n  --local-dir models/\n\n# Run with mmap — this is the key flag\n./build/bin/llama-cli \\\n  -m models/Qwen3.6-35B-A3B-Q4_K_M.gguf \\\n  --mmap \\\n  -c 4096 \\\n  -n 512 \\\n  -p \"Write a FastAPI endpoint with input validation\"\n```\n\nWhat's happening: `--mmap`\n\ntells llama.cpp to memory-map the model file instead of loading it all into RAM. The OS pages in weights on demand. Because only ~3B parameters are active per token, the actual resident memory stays well under 16GB. The rest of the 21GB model file lives on your SSD and gets paged in only when an expert is activated.\n\nYou can also run it as an OpenAI-compatible API server for use with coding agents:\n\n```\n# Start the server\n./build/bin/llama-server \\\n  -m models/Qwen3.6-35B-A3B-Q4_K_M.gguf \\\n  --mmap \\\n  -c 4096 \\\n  --port 8080\n\n# Now any tool that speaks the OpenAI API can use it:\n# aider, opencode, aichat, etc. → http://localhost:8080/v1\n```\n\nGGUF files are also available from [bartowski](https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF) if you prefer different quantization levels.\n\n## Setup 2: Ollama (easiest)\n\nIf you don't want to build anything from source, [Ollama](https://ollama.com/download) handles everything — download, quantization, API server — in one command. Under the hood it uses llama.cpp, so mmap works the same way.\n\n```\n# Install Ollama, then:\nollama run qwen3.6:35b-a3b\n```\n\nThat's it. Ollama downloads the GGUF, picks Q4_K_M by default, and starts an [OpenAI-compatible API](https://github.com/ollama/ollama/blob/main/docs/api.md) at `http://localhost:11434`\n\n. You can connect coding agents directly:\n\n```\n# Launch with opencode\nollama launch opencode --model qwen3.6:35b-a3b\n\n# Or with OpenClaw\nollama launch openclaw --model qwen3.6:35b-a3b\n```\n\nOllama exposes models to anything that speaks the OpenAI API format — [aichat](https://github.com/sigoden/aichat), [aider](https://github.com/paul-gauthier/aider), [opencode](https://opencode.ai/), and many others. Point them at `http://localhost:11434/v1`\n\n.\n\nOn 16GB you'll get roughly the same 17 tok/s as raw llama.cpp. The MLX-accelerated path (which hits 70-80 tok/s for this model on larger machines) requires 32GB+, so Ollama falls back to the llama.cpp backend. Still perfectly usable for interactive work.\n\n## Setup 3: LM Studio (best GUI, MLX on 16GB)\n\n[LM Studio](https://lmstudio.ai/) deserves a special mention because it can run MLX-optimized models on 16GB — unlike Ollama which gates the MLX backend behind 32GB. MLX uses roughly [50% less memory](https://insiderllm.com/guides/qwen35-mac-mlx-vs-ollama/) than the llama.cpp backend for the same model at the same quantization, and is [about 2x faster](https://dev.to/thefalkonguy/installing-qwen-35-on-apple-silicon-using-mlx-for-2x-performance-37ma).\n\nDownload the app, open the models page (`Cmd + Shift + M`\n\n), search for \"Qwen3.6-35B-A3B\", and filter by \"MLX\". Grab the 4-bit quantization from the [mlx-community](https://huggingface.co/mlx-community).\n\n[Kai Wern's guide](https://kaiwern.com/posts/2026/03/01/run-qwen3.5-locally-on-your-mac/) reports 81.79 tok/s generation speed with LM Studio running the MLX-optimized 35B-A3B on a 64GB machine. On 16GB the model is a tighter fit via MLX, but community reports show it works — MLX's lower memory footprint is exactly what makes the difference between clean operation and swap thrashing.\n\nTo use LM Studio as a local API server (for coding agents), switch to the Developer screen (`Cmd + 2`\n\n) and toggle the server to Running. It serves on `http://127.0.0.1:1234/v1`\n\n.\n\n## Setup 4: Raw MLX (fastest inference, no tool calling)\n\n[MLX](https://github.com/ml-explore/mlx) is Apple's own ML framework and the fastest inference path on Apple Silicon. If you don't need tool calling — just direct chat or batch generation — this gives you the best tok/s.\n\n```\n# Install mlx-lm\npip install mlx-lm\n\n# Run the 35B-A3B\nmlx_lm.generate \\\n  --model mlx-community/Qwen3.6-35B-A3B-4bit \\\n  --max-tokens 200 \\\n  --temp 0.7 \\\n  --prompt \"Write a Python function to merge two sorted lists\"\n\n# Or start an OpenAI-compatible server\nmlx_lm.server --model mlx-community/Qwen3.6-35B-A3B-4bit --port 8080\n```\n\nSince Qwen 3.6-35B-A3B is a vision-language model, you can also use `mlx-vlm`\n\nto process images:\n\n```\n# Install with torch dependency (avoids transformers errors)\nbrew install pipx\npipx install \"mlx-vlm[torch]\"\n\nmlx_vlm.generate \\\n  --model mlx-community/Qwen3.6-35B-A3B-4bit \\\n  --max-tokens 200 \\\n  --temperature 0.0 \\\n  --prompt \"Describe this image\"\n```\n\nThe catch: `mlx-lm`\n\ndoesn't support tool calling yet, so you can't use it as a backend for coding agents that need to read/edit files. For that, use llama.cpp, Ollama, or LM Studio.\n\n## Performance on Mac Mini M4 16GB\n\nAll numbers are for the Qwen 3.6-35B-A3B at Q4 quantization on the base Mac Mini M4 (16GB unified memory, 10-core CPU, 10-core GPU), based on community benchmarks and [Jock.pl's measurements](https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026):\n\n| Tool | Decode (tok/s) | RAM resident | Swap | Notes |\n|---|---|---|---|---|\n| llama.cpp (mmap) | ~17 | ~3GB active | Zero | Most reliable on 16GB |\n| Ollama | ~17 | ~3GB active | Zero | Same backend, easier setup |\n| LM Studio (MLX) | ~25-35* | ~10-12GB | Minimal | Faster but tighter on memory |\n| MLX (raw) | ~25-35* | ~10-12GB | Minimal | Fastest; no tool calling |\n\n*MLX numbers on 16GB are extrapolated from [Ante Kapetanovic's benchmarks](https://antekapetanovic.com/blog/qwen3.5-apple-silicon-benchmark/) on larger machines, scaled for memory bandwidth constraints. On a 64GB M4 Pro, the same model hits 70-80 tok/s via MLX. The 16GB constraint forces MLX to be more conservative with caching.\n\nFor comparison, on a Mac with 32GB+ and Ollama 0.19's MLX backend enabled, this same model hits [1810 tok/s prefill and 112 tok/s decode](https://ollama.com/blog/mlx). The 32GB threshold is real — if you're buying a Mac specifically for local inference, the 32GB upgrade pays for itself.\n\nBut 17 tok/s is genuinely usable. That's fast enough for interactive chat, code generation, and tool-calling agents. It's slower than an API call, but it's free, private, and offline.\n\n## What about Qwen 3.6-Plus?\n\nThe full [Qwen 3.6-Plus](https://www.buildfastwithai.com/blogs/qwen-3-6-plus-preview-review) flagship (1M context, top-of-the-line benchmarks) was released on April 2, but it's API-only through Alibaba's DashScope. No weights, no GGUF files, no local option. The [GitHub repo](https://github.com/QwenLM/Qwen3.6) is up but only for the 35B-A3B variant.\n\nFor local inference, 3.6-35B-A3B is it — and given its benchmark numbers relative to its active parameter count, it's far more than a consolation prize.\n\n## My setup\n\nI run the Qwen 3.6-35B-A3B as my daily driver on the Mac Mini M4 16GB. Specifically:\n\n**Default:** Ollama with`qwen3.6:35b-a3b`\n\nrunning in the background. One-command start, OpenAI-compatible API, tool calling works. I point aider and opencode at it.**When I need speed:** LM Studio with the MLX-optimized version. Noticeably snappier for interactive chat.**For scripting / batch work:** llama.cpp server with`--mmap`\n\n. Most control over parameters, lowest overhead.\n\nOne pattern that works well: Ollama keeps the model warm in the background. When you send a request after a cold start, the first response is slow (~5-10s) while the active experts get paged in. Subsequent requests are much faster because the hot pages stay cached. Don't kill Ollama between sessions if you can avoid it.\n\n## Useful links\n\n[Hugging Face — Qwen 3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)(model card, benchmarks)[Unsloth — Qwen 3.6-35B-A3B GGUF quantizations](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF)[bartowski — Qwen 3.6-35B-A3B GGUF (alternative quants)](https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF)[Ollama blog — MLX backend announcement](https://ollama.com/blog/mlx)(0.19 release)[Qwen docs — running with llama.cpp](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)[Jock.pl — 35B on Mac Mini 16GB](https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026)(benchmark source)[Kai Wern — Run Qwen locally on your Mac](https://kaiwern.com/posts/2026/03/01/run-qwen3.5-locally-on-your-mac/)(LM Studio + Ollama + MLX walkthrough)[Ante Kapetanovic — Ollama vs llama.cpp vs MLX benchmark](https://antekapetanovic.com/blog/qwen3.5-apple-silicon-benchmark/)[InsiderLLM — MLX vs Ollama speed test](https://insiderllm.com/guides/qwen35-mac-mlx-vs-ollama/)[DEV.to — MLX on Apple Silicon for 2x performance](https://dev.to/thefalkonguy/installing-qwen-35-on-apple-silicon-using-mlx-for-2x-performance-37ma)[GitHub — 16GB VRAM Local configs and launchers](https://github.com/willbnu/Qwen-3.5-16G-Vram-Local)[Will It Run AI — Qwen VRAM requirements guide](https://willitrunai.com/blog/qwen-3-gpu-requirements)[Unsloth — Qwen local setup docs](https://unsloth.ai/docs/models/qwen3.5)[r/LocalLLaMA — Explain MoE like I'm 25](https://www.reddit.com/r/LocalLLaMA/comments/174f42z/can_anyone_explain_moe_like_im_25/)", "url": "https://wpnews.pro/news/running-qwen-3-6-locally-on-a-mac-mini-m4-with-16gb-ram", "canonical_source": "https://maloyan.xyz/blog/running-qwen-locally-mac-mini-m4", "published_at": "2026-07-04 06:59:53+00:00", "updated_at": "2026-07-04 07:20:34.186659+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "ai-infrastructure", "generative-ai"], "entities": ["Qwen", "Mac Mini M4", "llama.cpp", "Ollama", "LM Studio", "MLX", "Hugging Face", "BentoML"], "alternates": {"html": "https://wpnews.pro/news/running-qwen-3-6-locally-on-a-mac-mini-m4-with-16gb-ram", "markdown": "https://wpnews.pro/news/running-qwen-3-6-locally-on-a-mac-mini-m4-with-16gb-ram.md", "text": "https://wpnews.pro/news/running-qwen-3-6-locally-on-a-mac-mini-m4-with-16gb-ram.txt", "jsonld": "https://wpnews.pro/news/running-qwen-3-6-locally-on-a-mac-mini-m4-with-16gb-ram.jsonld"}}