{"slug": "run-coding-agents-on-local-ai-zero-cloud-full-control", "title": "Run Coding Agents on Local AI — Zero Cloud, Full Control", "summary": "A developer has created a guide for running coding agents like Codex CLI, Claude Code, and Cursor entirely on local hardware using Ollama, eliminating the need to send proprietary code to third-party servers. The setup, tested on an Apple M4 Pro with 48GB unified memory, recommends the qwen3-coder:30b model, which uses a Mixture-of-Experts architecture with only 3.3B active parameters per token and a 256K context window. While frontier models still outperform local models on complex reasoning, the developer found that a well-chosen local model handles 80% of daily coding tasks—including autocomplete, refactors, and test generation—without data leaving the network.", "body_md": "Coding agents — Codex CLI, Claude Code, Cursor, and Pi — are productivity multipliers. But they all assume you are happy sending your code to someone else's servers. For many of us that is a deal-breaker: proprietary codebases, client NDAs, compliance requirements, or just the principle of owning your own compute.\n\nThis guide shows how to swap out every cloud API with a local [Ollama](https://ollama.com) server running **qwen3-coder:30b**. Same tools, same workflows, no data leaving your network.\n\nThe case is simple:\n\nThe honest tradeoff: frontier models (Claude Opus 4, GPT-5) still outperform local models on complex multi-step reasoning and very large context tasks. For the 80% of day-to-day coding work — autocomplete, refactors, test generation, documentation — a well-chosen local model is more than good enough.\n\nI run this on an **Apple M4 Pro with 48 GB unified memory**. Apple Silicon's unified memory architecture is exceptionally well-suited to LLM inference: the GPU and CPU share the same memory pool, so a 22 GB model fits comfortably alongside a full development environment.\n\nMinimum viable setup:\n\n| RAM | What fits |\n|---|---|\n| 16 GB | 7–8B parameter models (qwen3:8b, llama3.2:8b) |\n| 32 GB | 14–20B models (qwen3:14b, gpt-oss:20b) |\n| 48 GB | 30–35B models (qwen3-coder:30b, qwen3.6:35b) |\n| 64 GB+ | 70B models (deepseek-r1:70b, llama3.3:70b) |\n\nOn Intel/AMD systems with discrete GPUs the math is different: VRAM is the bottleneck, and models that don't fit entirely in VRAM fall back to slow CPU offloading.\n\nFor 48 GB unified memory, these are the models worth knowing about:\n\n| Model | Size on disk | Active params | Strengths |\n|---|---|---|---|\nqwen3-coder:30b |\n~22 GB | 3.3B (MoE) | Coding, 256K context, HumanEval SOTA |\n| qwen3.6:35b | ~24 GB | Full dense | General reasoning + vision |\n| gpt-oss:20b | ~14 GB | Full dense | Function calling, tool use |\n| gemma4:27b | ~18 GB | Full dense | Math, structured output |\n| deepseek-r1:70b | ~45 GB | Full dense | Chain-of-thought, complex reasoning |\n\n**qwen3-coder:30b** is the default recommendation for coding tasks. It uses a Mixture-of-Experts architecture — only 3.3B parameters are active per token — so inference is fast despite the large parameter count. The 256K context window handles entire codebases without chunking. It beats GPT-4o on HumanEval benchmarks.\n\nPull it with Ollama:\n\n```\nollama pull qwen3-coder:30b\n```\n\nBy default Ollama listens on `localhost`\n\nonly. To reach it from other machines on your LAN (or to let coding tools that open their own network connections reach it), bind to all interfaces:\n\n```\nOLLAMA_HOST=0.0.0.0 ollama serve\n```\n\nTo make this permanent on macOS, edit the Ollama launch agent or set the environment variable in your shell profile before starting Ollama. The server will then be reachable at:\n\n```\nhttp://192.168.2.200:11434\n```\n\nReplace `192.168.2.200`\n\nwith your machine's LAN IP. Verify it is working:\n\n```\ncurl http://192.168.2.200:11434/api/tags | jq '.models[].name'\n```\n\nOllama exposes an OpenAI-compatible `/v1`\n\nendpoint, which is what all the tools below use.\n\n[Codex CLI](https://github.com/openai/codex) is OpenAI's terminal-based coding agent. It supports custom model providers through its TOML configuration.\n\n```\nnpm install -g @openai/codex\n```\n\nCreate `~/.codex/config.toml`\n\n:\n\n```\nmodel = \"qwen3-coder:30b\"\nmodel_provider = \"ollama_remote\"\nmodel_context_window = 262144\nmodel_catalog_json = \"/Users/me/.codex/model_catalog.json\"\n\n[model_providers.ollama_remote]\nname = \"Ollama Remote\"\nbase_url = \"http://192.168.2.200:11434/v1\"\nenv_key = \"OLLAMA_API_KEY\"\n```\n\nA few gotchas discovered the hard way:\n\n`ollama-remote`\n\nfails with a parse error; `ollama_remote`\n\nworks.`name`\n\nis required`[model_providers.*]`\n\n. Omitting it throws `provider name must not be empty`\n\n.`ollama`\n\n, `openai`\n\n, and `lmstudio`\n\nare reserved`ollama_remote`\n\n.`model_context_window`\n\nSet the API key environment variable (Ollama doesn't require auth, but Codex won't start without it):\n\n```\nexport OLLAMA_API_KEY=ollama\n```\n\nWithout a model catalog, Codex prints `Model metadata for qwen3-coder:30b not found`\n\nand falls back to broken defaults. The catalog format requires every field from Codex's bundled schema — a simplified JSON with just a few keys will fail with `missing field`\n\nerrors.\n\nThe cleanest approach: generate the catalog from Codex's own bundled metadata and patch in your model:\n\n``` python\ncodex debug models --bundled | python3 -c \"\nimport json, sys\nd = json.load(sys.stdin)\nm = d['models'][0].copy()\nm['slug'] = 'qwen3-coder:30b'\nm['display_name'] = 'Qwen3-Coder 30B'\nm['description'] = 'Coding-specialized MoE model with 256K context.'\nm['context_window'] = 262144\nm['max_context_window'] = 262144\nm['availability_nux'] = None\nm['upgrade'] = None\nm['supported_reasoning_levels'] = []\nm['default_reasoning_level'] = 'low'\nm['supports_reasoning_summaries'] = False\nm['default_reasoning_summary'] = 'none'\nprint(json.dumps({'models': [m]}, indent=2))\n\" > ~/.codex/model_catalog.json\n```\n\nThe two critical fields are `supported_reasoning_levels: []`\n\nand `supports_reasoning_summaries: false`\n\n. Without them, Codex sends a `thinking`\n\nparameter that Ollama rejects with `does not support thinking`\n\n. Note that `qwen3-coder:30b`\n\ndoes support chain-of-thought reasoning — Qwen3 models reason internally via `<think>`\n\ntags. Disabling this API parameter just stops Codex from requesting it in an OpenAI-specific format that Ollama doesn't accept.\n\nVerify the catalog loaded correctly:\n\n``` python\nOLLAMA_API_KEY=ollama codex debug models | python3 -c \"\nimport json, sys\nd = json.load(sys.stdin)\nm = [x for x in d['models'] if 'qwen3-coder' in x['slug']][0]\nprint('slug:', m['slug'], '| context_window:', m['context_window'])\nprint('reasoning_levels:', m['supported_reasoning_levels'])\n\"\n# slug: qwen3-coder:30b | context_window: 262144\n# reasoning_levels: []\nOLLAMA_API_KEY=ollama codex\n```\n\nOr add it permanently to `~/.zshrc`\n\n:\n\n```\nexport OLLAMA_API_KEY=ollama\n```\n\nThen just run `codex`\n\nfrom any project directory.\n\nClaude Code is Anthropic's official CLI agent. It is hardwired to the Anthropic API but accepts a base URL override — which means you can point it at any OpenAI-compatible endpoint, including Ollama.\n\nSet two environment variables before launching Claude Code:\n\n```\nexport ANTHROPIC_BASE_URL=http://192.168.2.200:11434\nexport ANTHROPIC_API_KEY=ollama\n```\n\nStart Claude Code:\n\n```\nclaude\n```\n\nAt the login prompt, select **\"Anthropic Console\"** as the login method. Claude Code will use the base URL you provided instead of `api.anthropic.com`\n\n.\n\nTo make this permanent, add the exports to your shell profile (`~/.zshrc`\n\n, `~/.bashrc`\n\n):\n\n```\n# Local AI backend for Claude Code\nexport ANTHROPIC_BASE_URL=http://192.168.2.200:11434\nexport ANTHROPIC_API_KEY=ollama\n```\n\nThen reload:\n\n```\nsource ~/.zshrc\n```\n\nOne practical note: Claude Code's system prompts are written for Claude models and include Anthropic-specific formatting expectations. qwen3-coder:30b handles them well, but you may see occasional formatting quirks in responses. They do not affect functionality.\n\nCursor has a similar configuration path. In **Settings → Models → OpenAI API Key**, switch to a custom base URL:\n\n`Cmd+,`\n\n).`http://192.168.2.200:11434/v1`\n\n.`ollama`\n\nas the API key.`qwen3-coder:30b`\n\nas the model.[Pi](https://pi.dev) is a minimal agent harness built for extensibility — \"adapt Pi to your workflows, not the other way around.\" It supports 15+ providers and custom local endpoints via a `models.json`\n\nfile that hot-reloads between sessions.\n\n```\nnpm install -g @pi-ag/coding-agent\n```\n\nAdd your local Ollama server to `~/.pi/agent/models.json`\n\n:\n\n```\n{\n  \"providers\": {\n    \"ollama_remote\": {\n      \"baseUrl\": \"http://192.168.2.200:11434/v1\",\n      \"api\": \"openai-completions\",\n      \"apiKey\": \"ollama\",\n      \"models\": [\n        {\n          \"id\": \"qwen3-coder:30b\",\n          \"contextWindow\": 262144,\n          \"compat\": {\n            \"supportsDeveloperRole\": false,\n            \"supportsReasoningEffort\": false\n          }\n        }\n      ]\n    }\n  }\n}\n```\n\nThe `compat`\n\nblock is important: Ollama doesn't understand the `developer`\n\nrole or `reasoning_effort`\n\nparameter that Pi sends to reasoning-capable models by default. Setting both to `false`\n\nroutes around those errors.\n\n```\npi\n```\n\nSelect the model with `/model`\n\ninside the session — it lists all providers including your custom `ollama_remote`\n\nentry. The `models.json`\n\nfile reloads each time you open `/model`\n\n, so you can add or swap models without restarting.\n\nBeing honest about the limitations matters more than selling this as a perfect replacement.\n\n**Where qwen3-coder:30b matches or beats cloud models:**\n\n**Where frontier models still have an edge:**\n\n**Operational considerations:**\n\nIf qwen3-coder:30b is not the right fit for a specific task, here is when to switch:\n\n`qwen3.6:35b`\n\n— it has multimodal support.`gpt-oss:20b`\n\nhas more reliable structured output.`gemma4:27b`\n\nhas strong performance on reasoning benchmarks.`deepseek-r1:70b`\n\n(needs 45+ GB free RAM).Switching models in Ollama is instant — just pull the model and update the `model`\n\nfield in your config.\n\nReplacing cloud APIs with a local Ollama server is a one-afternoon project that delivers permanent benefits: no cost, no data exposure, no rate limits. The setup is three configuration files and two environment variables.\n\nqwen3-coder:30b is capable enough that you will not miss the cloud for most coding work. When you do need frontier-level reasoning, the cloud is still there — but now it is opt-in, not the default.\n\nThe key insight is that your hardware, your code, and your workflow should stay under your control. The tools were always willing to connect to any compatible endpoint. Now you know how to give them one that you own.", "url": "https://wpnews.pro/news/run-coding-agents-on-local-ai-zero-cloud-full-control", "canonical_source": "https://dev.to/dalenguyen/run-coding-agents-on-local-ai-zero-cloud-full-control-5e9e", "published_at": "2026-06-07 00:42:42+00:00", "updated_at": "2026-06-07 01:11:49.759721+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-tools", "ai-infrastructure", "ai-products"], "entities": ["Codex CLI", "Claude Code", "Cursor", "Pi", "Ollama", "qwen3-coder:30b", "Apple M4 Pro", "Apple Silicon"], "alternates": {"html": "https://wpnews.pro/news/run-coding-agents-on-local-ai-zero-cloud-full-control", "markdown": "https://wpnews.pro/news/run-coding-agents-on-local-ai-zero-cloud-full-control.md", "text": "https://wpnews.pro/news/run-coding-agents-on-local-ai-zero-cloud-full-control.txt", "jsonld": "https://wpnews.pro/news/run-coding-agents-on-local-ai-zero-cloud-full-control.jsonld"}}