Run Coding Agents on Local AI — Zero Cloud, Full Control A developer has created a guide for running coding agents like Codex CLI, Claude Code, and Cursor entirely on local hardware using Ollama, eliminating the need to send proprietary code to third-party servers. The setup, tested on an Apple M4 Pro with 48GB unified memory, recommends the qwen3-coder:30b model, which uses a Mixture-of-Experts architecture with only 3.3B active parameters per token and a 256K context window. While frontier models still outperform local models on complex reasoning, the developer found that a well-chosen local model handles 80% of daily coding tasks—including autocomplete, refactors, and test generation—without data leaving the network. Coding agents — Codex CLI, Claude Code, Cursor, and Pi — are productivity multipliers. But they all assume you are happy sending your code to someone else's servers. For many of us that is a deal-breaker: proprietary codebases, client NDAs, compliance requirements, or just the principle of owning your own compute. This guide shows how to swap out every cloud API with a local Ollama https://ollama.com server running qwen3-coder:30b . Same tools, same workflows, no data leaving your network. The case is simple: The honest tradeoff: frontier models Claude Opus 4, GPT-5 still outperform local models on complex multi-step reasoning and very large context tasks. For the 80% of day-to-day coding work — autocomplete, refactors, test generation, documentation — a well-chosen local model is more than good enough. I run this on an Apple M4 Pro with 48 GB unified memory . Apple Silicon's unified memory architecture is exceptionally well-suited to LLM inference: the GPU and CPU share the same memory pool, so a 22 GB model fits comfortably alongside a full development environment. Minimum viable setup: | RAM | What fits | |---|---| | 16 GB | 7–8B parameter models qwen3:8b, llama3.2:8b | | 32 GB | 14–20B models qwen3:14b, gpt-oss:20b | | 48 GB | 30–35B models qwen3-coder:30b, qwen3.6:35b | | 64 GB+ | 70B models deepseek-r1:70b, llama3.3:70b | On Intel/AMD systems with discrete GPUs the math is different: VRAM is the bottleneck, and models that don't fit entirely in VRAM fall back to slow CPU offloading. For 48 GB unified memory, these are the models worth knowing about: | Model | Size on disk | Active params | Strengths | |---|---|---|---| qwen3-coder:30b | ~22 GB | 3.3B MoE | Coding, 256K context, HumanEval SOTA | | qwen3.6:35b | ~24 GB | Full dense | General reasoning + vision | | gpt-oss:20b | ~14 GB | Full dense | Function calling, tool use | | gemma4:27b | ~18 GB | Full dense | Math, structured output | | deepseek-r1:70b | ~45 GB | Full dense | Chain-of-thought, complex reasoning | qwen3-coder:30b is the default recommendation for coding tasks. It uses a Mixture-of-Experts architecture — only 3.3B parameters are active per token — so inference is fast despite the large parameter count. The 256K context window handles entire codebases without chunking. It beats GPT-4o on HumanEval benchmarks. Pull it with Ollama: ollama pull qwen3-coder:30b By default Ollama listens on localhost only. To reach it from other machines on your LAN or to let coding tools that open their own network connections reach it , bind to all interfaces: OLLAMA HOST=0.0.0.0 ollama serve To make this permanent on macOS, edit the Ollama launch agent or set the environment variable in your shell profile before starting Ollama. The server will then be reachable at: http://192.168.2.200:11434 Replace 192.168.2.200 with your machine's LAN IP. Verify it is working: curl http://192.168.2.200:11434/api/tags | jq '.models .name' Ollama exposes an OpenAI-compatible /v1 endpoint, which is what all the tools below use. Codex CLI https://github.com/openai/codex is OpenAI's terminal-based coding agent. It supports custom model providers through its TOML configuration. npm install -g @openai/codex Create ~/.codex/config.toml : model = "qwen3-coder:30b" model provider = "ollama remote" model context window = 262144 model catalog json = "/Users/me/.codex/model catalog.json" model providers.ollama remote name = "Ollama Remote" base url = "http://192.168.2.200:11434/v1" env key = "OLLAMA API KEY" A few gotchas discovered the hard way: ollama-remote fails with a parse error; ollama remote works. name is required model providers. . Omitting it throws provider name must not be empty . ollama , openai , and lmstudio are reserved ollama remote . model context window Set the API key environment variable Ollama doesn't require auth, but Codex won't start without it : export OLLAMA API KEY=ollama Without a model catalog, Codex prints Model metadata for qwen3-coder:30b not found and falls back to broken defaults. The catalog format requires every field from Codex's bundled schema — a simplified JSON with just a few keys will fail with missing field errors. The cleanest approach: generate the catalog from Codex's own bundled metadata and patch in your model: python codex debug models --bundled | python3 -c " import json, sys d = json.load sys.stdin m = d 'models' 0 .copy m 'slug' = 'qwen3-coder:30b' m 'display name' = 'Qwen3-Coder 30B' m 'description' = 'Coding-specialized MoE model with 256K context.' m 'context window' = 262144 m 'max context window' = 262144 m 'availability nux' = None m 'upgrade' = None m 'supported reasoning levels' = m 'default reasoning level' = 'low' m 'supports reasoning summaries' = False m 'default reasoning summary' = 'none' print json.dumps {'models': m }, indent=2 " ~/.codex/model catalog.json The two critical fields are supported reasoning levels: and supports reasoning summaries: false . Without them, Codex sends a thinking parameter that Ollama rejects with does not support thinking . Note that qwen3-coder:30b does support chain-of-thought reasoning — Qwen3 models reason internally via