{"slug": "using-local-coding-agents", "title": "Using Local Coding Agents", "summary": "A new tutorial demonstrates how to set up a fully local coding agent using open-weight LLMs and open-source harnesses as an alternative to subscription-based services like Claude Code and Codex. The local setup offers transparency, inspectability, and cost-free operation beyond hardware and electricity, appealing to users concerned with privacy, reproducibility, and offline use.", "body_md": "# Using Local Coding Agents\n\n### Using Open-Weight Models in Local Coding Harnesses as an Alternative to Claude Code and Codex Subscriptions\n\nMany people reached out to me in the past asking about my local agent stack as well as how I set up my local agent stack.\n\nSo, I thought it might be useful to put together a little tutorial on how to set up a local (coding) agent using open-source tools and open-weight LLMs.\n\nThis article is a tutorial on setting up a production-ready coding agent with a fully local stack. We will use a locally served LLM together with a local coding harness that can read files, make edits, run commands, and verify changes as shown in the figure above.\n\nHere, we can think of the LLM as the engine that provides the reasoning and code generation. And the surrounding harness provides the operating environment that allows the LLM to do meaningful coding work in our local projects.\n\nWhy local? For many coding workflows, a local setup is an interesting alternative to proprietary services such as GPT in Codex or Opus in Claude Code. The local setup is transparent, inspectable, and free to run apart from hardware and electricity costs. It also stays fully under your control, and you can modify the coding harness in any way you like. **Plus, it’s a lot of fun!**\n\nBy the way, in case you want a bit more background information on coding agent harnesses, I covered the core components of coding agents (and building a coding agent from scratch for learning purposes) here:\n\n## 1. Intro\n\nI have to admit that I still primarily alternate between Codex and Claude Code as my daily drivers, for now (and just to keep up with the new tooling and functions that are constantly being added). Also, the plan limits (especially for Codex) are still so generous that I haven’t had to worry about costs so far.\n\nHowever, I’ve been using local solutions for a while, too, to test things and because it somehow gives me joy to have and use a fully local setup (versus proprietary services).\n\nEither way, local solutions become more and more attractive each day. One aspect is the costs. If you have the hardware, they are practically free to run. And then there’s, of course, the privacy angle. For example, for organizing and processing my receipts, I’d be more comfortable with a local model ingesting them rather than sending the data over to OpenAI or Anthropic.\n\n(Then, if we keep in mind that Anthropic was recently [throttling their flagship model’s performance for LLM research](https://substack.com/@rasbt/note/c-273441982), proprietary services may become more restrictive over time, and it’s maybe a good idea to be comfortable with open-weight alternatives as a backup.)\n\nAnd there are many, many additional reasons and use cases like that.\n\nYour motivations for using local LLMs and coding harnesses may include:\n\nPredictable, fixed costs if you reach your subscription plan limits, and immunity to API price changes.\n\nReproducibility; sometimes it’s nice if a model is upgraded (e.g., GPT 5.4 -> GPT 5.5 -> GPT 5.6) and it solves all your queries more reliably. However, this can also break existing workflows.\n\nOffline use in the classic airplane flight scenario with slow or no internet, or when going on a coding/writing retreat in the cabin in the woods w/o a Starlink subscription.\n\nAnd there are probably several others.\n\nSo, in this article, we will set up and use popular harnesses like Codex and Claude Code with open-weight models and investigate whether using a model-specific harness (like Qwen-Code for Qwen3.6) brings any additional benefits. (Of course, there are many more harnesses like OpenCode, Cline, Pi, and Noumena Code, but I thought that most people already have muscle memory with either Codex or Claude Code, which makes switching to open-weight models a bit smoother).\n\n## 2. Coding Agent Harness Overview\n\nMost coding agent harnesses follow similar principles and have more or less the same features and functionality. However, the implementation details may differ, and certain LLMs have usually been primarily optimized for a specific harness. Of course, many open-weight LLMs like GLM 5.2, for example, would run Claude Code, etc.\n\nHowever, if an LLM developer also develops a coding harness, it is somewhat safe to assume that their model is optimized for their own harness first (while also supporting others).\n\nHere, I am primarily going to use Qwen3.6 with the Qwen-Coder coding client. However, I will also go over other options for using a local LLM with other agent harnesses, for example, Claude Code, Codex, and the increasingly popular Cline, but more on that later.\n\nThe reason why I am primarily using Qwen-Code when working with Qwen models is that:\n\nit is open-source, like Codex (\n\n[https://github.com/openai/codex](https://github.com/openai/codex)) but unlike Claude Code;Qwen models have been specifically optimized for the Qwen-Code harness (more information below);\n\nI can run both Codex (with the latest GPT model) and Qwen-Code with a local Qwen model side by side on the same machine without having to switch manually back and forth between models.\n\nRegarding the second point in the list above, that Qwen models work better in Qwen-Code, Nvidia’s [Polar: Agentic RL on Any Harness at Scale](https://arxiv.org/abs/2605.24220) paper (May 2026) has a benchmark showing that the Qwen3.5-4B base model has the best coding performance in said Qwen-Code harness (both before and after their Polar-RL training), which I included below.\n\nThe benchmark in the table above is for an older Qwen3.5 model, and I am assuming that the latest Qwen3.6 models are even further optimized to do well in Qwen-Code specifically.\n\nHowever, Pi ([https://github.com/earendil-works/pi](https://github.com/earendil-works/pi)) also seems to be a very interesting candidate that I need to play around with in the future.\n\nBy the way, Qwen3.6 35B-A3B is about 22 GB to download, requires roughly 30-40 GB of RAM, and runs pretty swiftly on both a Mac Mini with M4 and a DGX Spark.\n\nBased on the recent benchmarks shared by Cohere earlier in June, it is currently the best local model in its size class.\n\nAs seen above, Qwen3.6 35B-A3B dominates all but one benchmark in this size class. However, that being said, Qwen Code is a general harness and also supports other types of models. For instance, we could also connect North Mini Code or Gemma 4 in Qwen Code.\n\nArchitecture-wise, the Qwen3.6 35B-A3B model has hybrid attention similar to Qwen3-Coder and Qwen3.5. I wrote more about it in [Beyond Standard LLMs](https://magazine.sebastianraschka.com/p/beyond-standard-llms).\n\nAlternatively, if you don’t want to use Qwen3.6, Cohere’s North Mini Code is probably the most interesting, capable alternative at this size class right now. I will go over this model in the next local LLM setup section as well.\n\n## 3. Local LLM Setup\n\nNo matter what agent harness we use (Qwen-Code, Codex, or Claude Code), we have to set up a local LLM, such as Qwen3.6 35B-A3B, first.\n\nThere are several options like Ollama, LM Studio, vLLM, SGLang, MLX, etc to serve models locally. You know from my Build A Large Language Model (From Scratch) and Build A Reasoning Model (From Scratch) projects that I like to code these myself. Implementing a model from scratch has the benefits that we understand the whole stack, plus we can modify and further train and fine-tune it.\n\nHowever, here, we just look for a model serving framework that has been super optimized for inference speed and resource needs since we don’t plan to do any training or fine-tuning at this point. (We could, as an extra step, convert and import our own from-scratch fine-tuned model into these efficient serving stacks, but this is out of the scope for this article.)\n\nFor this tutorial, we will use [Ollama](https://ollama.com/) as our efficient model serving engine because it’s relatively easy to install and use from the command line across different operating systems (although LM Studio also added a non-GUI `llmster`\n\nclient, but I am less familiar with it).\n\nBy the way, I am not affiliated with any of the tools mentioned in this article, but one nice thing about Ollama is that they also optionally support open-weight models hosted in the cloud, including the currently strongest open-weight model, GLM 5.2, which is too large to run locally on consumer hardware. (The cloud models are not free, of course, but have similar subscription plans as ChatGPT and Claude; it’s still nice though that this option exists to conveniently test the latest state-of-the-art open-weight models “locally.”)\n\nAnyways, setting up Ollama is pretty straightforward, and you can find the official macOS/Linux/Windows download instructions on their [download](https://ollama.com/download/) page.\n\nAfter installing, I recommend downloading a model for a quick test run. For instance, on macOS, we can use the ollama app to download models directly via the GUI:\n\nOtherwise, this can be done on the command line as well via\n\n```\nollama pull qwen3.6:35b-mlx\n```\n\nBy the way, the above-mentioned qwen3.6:35b-mlx is a model using Apple’s Metal performance shaders, i.e., optimized for Macs with Apple silicon chips. I highly recommend using *-mlx versions of models working on Macs (if available).\n\nOn a Linux machine, use the non-MLX version:\n\n```\nollama pull qwen3.6:35b\n```\n\nThen, to make sure that it works, you can either use the GUI again or launch Ollama from the command line.\n\nYou can exit this session via the `/bye`\n\ncommand.\n\nAs mentioned before, the currently best alternative to this Qwen3.6 35B-A3B model is North Mini Code 1.0 of similar size.\n\n## 4. Simple Speed Performance Assessment\n\nBefore deciding on whether to use an LLM as a local coding agent, it’s usually not a bad idea to run a quick speed and quality assessment. Here, for the speed assessment, I would look for tokens/sec performance. Additionally, I’d also make sure this stays stable for (very) long contexts, which is what we are usually dealing with during agentic coding workflows (as opposed to simpler chatbots).\n\nOf course, we also don’t want the memory cost to explode either.\n\nYou could run my ollama_speed_memory_bench.py script to do a quick check. In a nutshell, it sends different prompts (ranging from 1k to 50k words) to an Ollama model and asks it to generate up to 8k tokens by default. It reports simple statistics like prefill speed from Ollama’s prompt evaluation metrics, generation speed from output-token timing, and memory use from the Ollama process plus NVIDIA GPU memory when available.\n\nFor example, to evaluate the `qwen3.6:35b-mlx`\n\non macOS, if you downloaded or cloned the scripts from [https://github.com/rasbt/local-coding-agent-evals](https://github.com/rasbt/local-coding-agent-evals), we can run the following, which takes about 5 minutes:\n\n```\nuv run speed-memory-benchmark/ollama_speed_memory_bench.py --model qwen3.6:35b-mlx\n```\n\nOn Linux, we can run:\n\n```\nuv run speed-memory-benchmark/ollama_speed_memory_bench.py --model qwen3.6:35b\n```\n\nNote that this assumes that you already downloaded the respective model as explained in the previous section. Also, depending on your system, if you have less than 30 GB RAM, you may have to use a smaller model like gemma4:e2b, which uses up to about 8 GB RAM on long contexts. Of course, there are also many smaller models, but in my experience, they make pretty bad local coding agents.)\n\nNote that for models, the RSS RAM report is not super accurate on macOS (especially for mlx model variants that utilize the Metal backend), and I suggest keeping an eye on the activity monitor’s RAM usage for Ollama during the run as well. In this case, the RAM usage fluctuated between 20 - 29 GB.\n\nAnyways, the bottom line is that for 50k contexts, the Qwen3.6 and North Mini Code models use up to 30 GB RAM and generate output with about 40 tok/sec on a recent Mac Mini and 30 tok/sec on a DGX.\n\nBelow is a visual summary of the different runs.\n\nAnother interesting question is how Qwen 35B-A3B compares to the similarly-sized Cohere North Mini model? If we take similarly quantized models into account (above, I was using the Qwen3.6 default), they are pretty similar, although North Mini is perhaps slightly ahead overall, as shown below.\n\nAnyway, the bottom line is that, in my opinion, anything faster than 20-30 tok/sec is pretty reasonable for local agent work. **This is about the same speed as GPT 5.5 with “high” reasoning**. In this case, both models clear the bar easily.\n\nBy the way, personally, I run my agents almost exclusively on my DGX Spark because I don’t want my Mac Mini to get too hot and I want to have the RAM available for other tasks.\n\nOf course, there are always ways to optimize this more with different frameworks (other than Ollama), quantizations, MTP, and so on. However, Ollama is a good plug & play allrounder with minimal setup time that connects easily to various coding agent frameworks and where it’s super simple to swap and try out different models.\n\n## 5. Simple Benchmark Performance Assessment\n\nAfter checking that the model is fast enough for convenient local work, I recommend doing a quick modeling performance assessment. Sure, there are many standardized benchmarks out there we could take a look at and even run ourselves.\n\nUsually, you can find the numbers for relevant benchmarks in the model’s technical report or model hub page. Usually, I also find it useful to look at a relative comparison with other models on [https://artificialanalysis.ai/models/](https://artificialanalysis.ai/models/).\n\nBased on the figure above, we can see that Qwen3 35B-A3B is much more capable than the Gemma 4 E4B and E2B models, for example.\n\nNote that the Artificial Intelligence Index numbers keep changing over time as they swap benchmarks and update the weighting, so there are no “absolute” numbers we could use as a reference point for deciding which model is “good enough”. Rather, I would compare a new, interesting model to a model you used before as an anchor or reference point.\n\nBeyond standard benchmarks, I would also curate a personal set of tasks that are relevant to you to do a quick check whether this model is even suitable for any type of work that you might want it to perform.\n\nBelow are the outputs of a reasoning- and code-related set of questions that also test the tool calling capabilities of the models. Here, the model returns the tool call but doesn’t execute the code itself.\n\n```\n➜  uv run ollama_hard_reasoning_bench.py --model qwen3.6:35b\nPASS debug_empty_tokenizer_regression: ok\nPASS review_shell_command_injection: ok\nFAIL choose_minimal_edit_for_cross_platform_path: argument instructions missing required content\nFAIL triage_import_error_after_refactor: wrong tool: expected read_file, got ask_clarification\nPASS debug_mutable_default_cache_leak: ok\n\nScore: 3/5 passed (60.0%)\n➜  uv run ollama_hard_reasoning_bench.py --model  north-mini-code-1.0\nFAIL debug_empty_tokenizer_regression: wrong tool: expected final_answer, got edit_file\nPASS review_shell_command_injection: ok\nFAIL choose_minimal_edit_for_cross_platform_path: invalid JSON: Extra data: line 2 column 1 (char 235)\nFAIL triage_import_error_after_refactor: wrong tool: expected read_file, got ask_clarification\nFAIL debug_mutable_default_cache_leak: wrong tool: expected final_answer, got edit_file\n\nScore: 1/5 passed (20.0%)\nuv run ollama_hard_reasoning_bench.py --model gemma4:e2b\nFAIL debug_empty_tokenizer_regression: wrong tool: expected final_answer, got edit_file\nFAIL review_shell_command_injection: wrong tool: expected final_answer, got ask_clarification\nFAIL choose_minimal_edit_for_cross_platform_path: wrong argument path: expected 'code/tool-reasoning-benchmark/ollama_tool_reasoning_bench.py', got 'code/tool-reasoning-benchmark/personal_tool_reasoning_tasks.jsonl'\nFAIL triage_import_error_after_refactor: wrong tool: expected read_file, got ask_clarification\nFAIL debug_mutable_default_cache_leak: wrong tool: expected final_answer, got edit_file\n\nScore: 0/5 passed (0.0%)\n```\n\nFor instance, we can say that `qwen3.6:35b`\n\ngets the conceptual debugging and security-review tasks right, but still struggles with agentic judgment around “what file/action first” tasks. `3/5`\n\nis usable but not fully reliable for autonomous tool use. But a harness that constrains actions, adds retries, and maybe gives stronger project context could make it pretty usable.\n\nOn the other hand, `gemma4:e2b`\n\nfailing `0/5`\n\nis a strong signal that it is less suitable for this kind of tool-use reasoning, even if it is fast. Note that the failures are not just formatting issues. It looks like it chooses the wrong tool, asks for clarification when enough context is present, etc. I would probably not use it as a coding-agent model beyond very narrow or heavily constrained tasks.\n\n## 6. Agent Code Base Audit\n\nNow, after this lengthy preamble setting up a local LLM, let’s get back to the main topic, the coding agent harness. As mentioned at the beginning of this article, we will use the qwen-code ([https://github.com/QwenLM/qwen-code](https://github.com/QwenLM/qwen-code)) harness, as Qwen models have been optimized for it.\n\nIf you are familiar with Claude Code, it’s basically the same thing but fully open-source. However, I will also go over how to connect the local Qwen3.6 model to Codex and Claude Code in the next sections.\n\nNote that coding harnesses are much more capable than LLMs by themselves. This is where I recommend being more careful about what you are running and where. For instance, when trying new (coding) agents, I like to\n\nDo an audit of the (open-source) agent code base first.\n\nRun it on separate hardware (e.g., my DGX Spark) or a separate user account and/or virtual environment on my machine at the very least.\n\nRegarding the audit, I recommend looking for data sharing/egress and the default blast radius when it comes to file permissions, as well as some baseline robustness to prompt injection. The figure below attempts to summarize the main points.\n\nSimilar concerns apply to the local model serving engine (e.g., Ollama) as well. However, coding agents require even more attention as they can directly read data from your machine and manipulate files.\n\nTo do a basic audit, I recommend the following:\n\nClone the repo:\n\n```\ngit clone https://github.com/QwenLM/qwen-code.git\n```\n\nAsk a trusted agent you used before (like GPT 5.5 in Codex or Opus 4.8 in Claude Code) to review it with a focused prompt. Something like the following:\n\nYou are auditing ./qwen-code before I install or run the agent on my machine.\n\nFocus only on practical local-machine risk from the installed agent and the code paths that create it:\n\ninstall scripts and package lifecycle hooks\n\nshell command execution by the agent\n\nfile read/write boundaries at runtime\n\nsecret handling and environment-variable inheritance\n\nhow repo files, project instructions, and tool output can influence the agent\n\nMCP, plugin, extension, or tool integrations\n\nnetwork calls and telemetry\n\nupdate mechanisms after installation\n\nterminal escape/output handling\n\ndata egress and data residency\n\nIgnoring internet downloads that are strictly required for installation, check whether the installed agent can send prompts, files, telemetry, logs, identifiers, or metadata to remote servers when I use a local model through Ollama. Ignore cloud-model configurations.\n\nDo not infer risk from the project owner alone. Identify concrete endpoints, SDKs, default providers, environment variables, config defaults, and docs that control network behavior, including any endpoints operated in foreign countries or by third-party companies.\n\nDo not do broad style review. Do not refactor. Produce:\n\nhigh-risk findings with file/line references\n\nmedium-risk concerns\n\nnetwork/data-egress findings, including any foreign, third-party, or China-linked endpoints or defaults\n\ncommands I should avoid running until reviewed\n\nsettings or environment variables that reduce local-machine risk\n\na short recommendation: safe to test in sandbox, safe to use, or do not run\n\nFor each item, say whether it is expected behavior for a coding agent or inherently riskier than Codex or Claude Code.\n\nBelow is a summary of the main findings (because the full report may be a bit boring and too long for this article):\n\n**Local execution** Qwen Code can run shell commands on our machine through its shell tool but there are strict approval controls unless permissive modes such as`--yolo`\n\nare enabled. This is expected for a coding agent, and it’s actually what makes it useful in practice. But of course it becomes risky if run unsandboxed or with a full environment containing secrets.**Data egress** Even with local Ollama, Qwen Code can send usage telemetry and metadata to Alibaba/Aliyun endpoints unless usage statistics and telemetry are disabled (more on that below). This is riskier than a local-only setup because model prompts may stay local, but session IDs, tool metadata, model info, and local base URL metadata can still leave the machine. But again, this is also common among all kinds of tools (yes, Codex and Claude do that as well).**File and secret boundaries** Workspace files are readable by default, while writes generally require approval and include some overwrite protections. This is good and standard agent practice.**Prompt injection surfaces** Repo instructions, tool output, MCP tools, extensions, and project config can influence the agent’s behavior. Prompt injection attacks can be reduced via the approval gates mentioned above. This is normal for coding agents, but untrusted repos should be treated as hostile by default because they can steer the agent toward reading files, running commands, or sending data through approved tools.\n\nRegarding the main privacy concerns in point 2, most of it is fixable via a custom `~/.qwen/settings.json`\n\nwith the following contents:\n\n```\n{\n  \"privacy\": { \"usageStatisticsEnabled\": false },\n  \"telemetry\": { \"enabled\": false, \"logPrompts\": false },\n  \"outboundCorrelation\": { \"propagateTraceContext\": false },\n  \"general\": { \"enableAutoUpdate\": false },\n  \"tools\": {\n    \"approvalMode\": \"default\",\n    \"sandbox\": true\n  },\n  \"mcpServers\": {},\n  \"hooks\": { \"disableAllHooks\": true }\n}\n```\n\nThe `\"general\": { \"enableAutoUpdate\": false }`\n\nsetting is a tradeoff. Security fixes will not be installed automatically, but I prefer having explicit control over when updates happen instead of letting the tool pull and apply new code in the background.\n\nBy the way, cline ([https://github.com/Cline/Cline](https://github.com/Cline/Cline)), Codex ([https://github.com/openai/codex](https://github.com/openai/codex)), and Claude Code have similar telemetry data sharing defaults that would need to be disabled explicitly.\n\n(Note that Claude Code doesn’t have an official open-source version of their codebase, which makes trusting it even trickier, and it does seem to send data to both Anthropic and Datadog.)\n\nEither way, overall, it seems Qwen-Code follows standard practices, and as of this writing, there is no particular concern that is non-standard for coding agents.\n\n## 7. Qwen-Code Setup\n\nIf we accept the reported findings and risks (personally, I didn’t see any red flags), we can now proceed with the installation and hook up our local Qwen3.6-35B-A3B model to Qwen Code (and Codex and Claude Code in the next sections).\n\nAs mentioned before, I preferably experiment with and run coding agents, which can read and edit local files, on a separate machine (in my case a DGX Spark, but it could also be a separate Mac or Linux workstation). Alternatively, I would run it in a VM or set up a separate macOS or Linux user account as a practical middle ground.\n\n(I heard from some friends that they also rent servers for that, like Linode or Heroku, for tinkering purposes. However, instead of the monthly hosting costs for a somewhat capable machine, I would probably rather get a relatively cheap $200-500 hardware box, or even an old retired laptop, and run a local harness and then use a stronger open-weight model hosted in the cloud via Ollama cloud models, OpenRouter, etc if you are looking for alternatives to GPT or Claude.)\n\nAnyways, let’s install Qwen-Code. The listed options include, e.g.,\n\n```\ncurl -fsSL https://qwen-code-assets.oss-cn-hangzhou.aliyuncs.com/installation/install-qwen-standalone.sh | bash\n```\n\nand\n\n```\nnpm install -g @qwen-code/qwen-code@latest\n```\n\nHowever, running the commands above assumes that the published artifacts match the code we just reviewed in the GitHub repo. If we are extra careful/paranoid, we can also build it ourselves from the GitHub repo. Be warned, this is more manual/messier though (I recommend executing them one at a time instead of copy & pasting the whole block into the terminal):\n\n```\n# Go to your development folder\ncd ~/Developer\n\n# Clone the Qwen Code GitHub repository\ngit clone https://github.com/QwenLM/qwen-code.git\n\n# Enter the cloned repository\ncd qwen-code\n\n# Install JavaScript dependencies\nnpm install\n\n# Build the CLI output in the local dist/ folder\nnpm run build\n\n# Create a user-level bin directory if it does not already exist\nmkdir -p ~/.local/bin\n\n# Create a qwen wrapper that runs the CLI from this source checkout.\n# Keep ~/Developer/qwen-code in place, since this wrapper points into it.\ncat > ~/.local/bin/qwen <<'SH'\n#!/usr/bin/env sh\nexec \"$HOME/Developer/qwen-code/scripts/cli-entry.js\" \"$@\"\nSH\n\n# Make the wrapper executable.\nchmod +x ~/.local/bin/qwen\n\n# Make qwen available in the current shell session.\nexport PATH=\"$HOME/.local/bin:$PATH\"\n\n# Verify that the qwen command is found and prints a version.\nqwen --version\n```\n\nAfter completing the installation, we can now launch the Qwen-Code client via the qwen command from the terminal to complete the setup and connect to the locally served LLM.\n\nFor this, after running the qwen command, we select “Custom Provider”, as shown below.\n\nOllama uses the OpenAI API standard. So, next, we follow the on-screen setup guide and choose the “OpenAI-compatible” option.\n\nNext, we need to provide the API endpoint of the running Ollama application that serves our local LLM. Usually that’s the local\n\n```\nhttp://127.0.0.1:11434\n```\n\naddress by default. We enter `http://127.0.0.1:11434/v1`\n\n(including the /v1) since that’s the OpenAI-compatible base URL.\n\nNext, we enter `ollama`\n\nas our custom provider.\n\nNext, we can select the available models. These are the ones that we downloaded via `ollama pull`\n\n. You can enter only a single model or multiple ones separated by commas. You can double-check the list of downloaded models via `ollama list`\n\n. By the way, you can always add more models easily later (I’ll explain after completing the setup).\n\nWe are almost done! In step 5/6, we of course select “Enable thinking” mode, which will result in higher token usage but the better resulting problem-solving capabilities are worth it.\n\nAnd that’s basically it. Step 6 is basically a review step that we can confirm by pressing “Enter”.\n\nCongratulations, you should now have a working fully-local LLM workflow set up. The usage is pretty much similar to Claude Code, where you can use / commands for various functionality. E.g., you can switch models via the `/model`\n\ncommand, as shown below.\n\nBy the way, as I mentioned before, it’s relatively easy to add new models from ollama. Once you pull a new model via `ollama pull`\n\n, you can add it as a new entry in `~/qwen/settings.json`\n\n. Here, just copy & paste an existing entry into the file and change the “id” and “name” to that of the Ollama model name.\n\nBy the way, to update the qwen-code tool once in a while, if we used the git clone & local build route, we can pull a recent GitHub snapshot and update it as follows:\n\n```\n# Go to the local Qwen Code source checkout\ncd ~/Developer/qwen-code\n\n# Fetch the latest changes from GitHub\ngit pull\n\n# Install or update dependencies if package files changed\nnpm install\n\n# Rebuild the local CLI\nnpm run build\n\n# Verify the updated CLI\nqwen --version\n```\n\n## 8. Agent Capability Assessment\n\nNow that we have a fully working, local coding agent, the question is: how well does it perform, and is it actually good enough for my tasks? Of course, there are benchmarks for this, but in my opinion, nothing beats trying it for yourself on some of your workflow. In other words, this basically means using it for a day or two to decide whether it meets your bar.\n\nI also recommend compiling a small set of tasks that reflect your common coding agent usage. And if you come upon a particularly challenging one when working on a given project, it may not be a bad idea to add it to this set to evaluate future models.\n\nAs an example of what I mean, I shared a relatively small, simple, and general set of tasks we can use to test the agents here on GitHub: [https://github.com/rasbt/local-coding-agent-evals/tree/main/agent-problem-pack](https://github.com/rasbt/local-coding-agent-evals/tree/main/agent-problem-pack). This is basically an extension of the tasks from the Local LLM Setup section.\n\nThe details on how to run these are in the GitHub README: [https://github.com/rasbt/local-coding-agent-evals/tree/main/agent-problem-pack#quick-start-running-benchmarks-manually](https://github.com/rasbt/local-coding-agent-evals/tree/main/agent-problem-pack#quick-start-running-benchmarks-manually).\n\nBelow is the outcome for the different LLMs tested in Qwen-Code.\n\nAs we can see, both the Qwen3.6 and North Mini Code 35B-A3B models solve 4 out of 5 of these problems. Gemma 4 E2B fails a lot. Out of curiosity, I also added the a bit older Nemotron 3 Nano model. It has a similar size and compute performance as the aforementioned Qwen and North models, and it performs similarly well.\n\n## 9. Codex Setup\n\nAfter setting up the local coding agent (and the article exceeding 5000 words), this would probably be a reasonable place to stop. However, as a bonus, I also thought it might be interesting to add brief Codex and Claude Code notes for completeness.\n\nUnfortunately, as far as I know, the Codex UI does not support non-OpenAI models, but we can use the Codex CLI to run our Ollama models.\n\nIf you haven’t installed the OpenAI Codex CLI yet, you can get and install it analogously to qwen-code from their open-source GitHub directory: [https://github.com/openai/codex](https://github.com/openai/codex) (Yes, the Codex CLI is open source!)\n\nI will spare you the lengthy listing of the commands and recommend checking the repo’s README instead for the official instructions. (Cloning the repo and running an audit similar to qwen-code is not a bad idea here, as well.)\n\nThen, once installed, there are multiple ways to enable local model use. In my opinion, the most convenient way is to set up a separate config `~/.codex/ollama.config.toml`\n\n(inside the existing `~/.codex`\n\nfolder) with some default options:\n\n```\nmodel = \"qwen3.6:35b\"\nmodel_provider = \"ollama\"\nmodel_reasoning_effort = \"high\"\npersonality = \"pragmatic\"\n\n[projects.\"/home/rasbt\"]\ntrust_level = \"trusted\"\n```\n\nThen, we can still use `codex`\n\nto launch the regular “Codex with GPT 5.5” mode and use our Ollama model via `codex --profile ollama`\n\n.\n\nWhen rerunning the test cases from the Agent Capability Assessment section, to my surprise, Qwen3.6 does actually perform better via Codex compared to its “native” Qwen-Code coding harness, as shown below.\n\nEven though this is just a small set of benchmarks, it suggests that using Codex as the universal coding agent harness may not be such a bad idea after all.\n\n## 10. Claude Code Setup\n\nOf course, there is also the popular Claude Code agent harness that we could use as a harness around our local LLMs. While very popular and capable, this is probably my least favorite option for local setups because the codebase is proprietary. That also means we cannot readily inspect and/or disable Anthropic’s data logging practices.\n\nTo set it up, if you don’t have Claude Code already installed on your machine, I suggest checking the official docs for recommended installation commands: [https://code.claude.com/docs/en/quickstart](https://code.claude.com/docs/en/quickstart).\n\nClaude Code itself does not expose the same local-provider configuration path as Codex. However, Ollama provides an integration via `ollama launch claude`\n\n: [https://docs.ollama.com/integrations/claude-code](https://docs.ollama.com/integrations/claude-code)\n\nI.e., we can execute `ollama launch claude`\n\nto run the Claude Code harness with an Ollama model.\n\nBy the way, this also works for codex via `ollama launch codex`\n\n, but I personally prefer the `codex --profile ollama`\n\nroute we discussed earlier, as it gives me a bit more insight and control about how things works etc.\n\nHowever, as a user, it feels like Claude Code takes much longer to come up with a solution. It probably has a much higher token usage. So, below, I additionally looked at the token usage of all three harnesses.\n\nAs we can see, Claude Code uses by far the most tokens on average, Codex the least.\n\nWhen it comes to the little agent capability assessment benchmark, the Qwen and North Mini Code models also get 5/5, and even the small Gemma 4 model does ok!\n\nInterestingly, we can also see that the token usage is largely driven by the harness, not the LLM itself. I.e., among all three LLMs that are capable of solving (almost) all 5 tasks, they all use the same number of tokens (e.g., Qwen3.6 uses roughly the same number of tokens as North Mini Code and Nemotron 3 Nano when used inside Claude Code). Only Gemma 4 uses fewer tokens, but it also fails almost all tasks, likely because of insufficient tool-calling capabilities where the tasks interrupt early.\n\nFor reference, below is again the summarized task-success rate.\n\nAnyway, the takeaway here is that if more tokens help the model-harness combination to solve more (and more complex) problems, great! But if we have two harnesses that both have an equal task success rate, a harness that uses 50% fewer tokens (e.g., Codex over Claude Code), then this is a huge win, because it will make tasks run twice as fast.\n\nHowever, the big caveat here is that task correctness is a necessary criterion, but it doesn’t measure code quality and readability, which are hard to assess automatically.\n\nPS: I tried to analyze why Claude Code uses more tokens, and it seems that the difference mainly comes from input tokens rather than output tokens. In other words, Claude is not writing twice as much. The logs suggest that Claude is repeatedly feeding more context back into the model across turns, including previous messages, tool calls, command outputs, and file contents. For example, one Claude run used about 578k input tokens but only about 4.5k output tokens across 25 turns. So the likely explanation is that Claude’s harness accumulates or accounts for a larger prompt-side history during multi-step agent runs.\n\n## 11. Mac <-> DGX\n\nSo far, all the setups we discussed assumed that we were running the local LLM on the same machine as the coding harness.\n\nHowever, what if we developed some trust in the coding agent harness and want to use it on our main Mac while the model itself is hosted on a different machine, e.g., a DGX Spark?\n\nIn my opinion, the best (or most convenient) setup is an SSH tunnel from the Mac to the DGX.\n\nFirst, I suggest quitting Ollama on the Mac or changing the `11434`\n\nto something else below.\n\nAssuming we quit the Ollama app on the Mac, check that the following returns an empty output to indicate that Ollama is not available:\n\n```\ncurl http://127.0.0.1:11434/v1/models\n```\n\nThen run the following command on that Mac in a terminal window on the Mac side:\n\n```\nssh -N -L 11434:127.0.0.1:11434 rasbt@DGX-Spark\n```\n\nThat command means that we open an SSH connection to `DGX-Spark`\n\nas user `rasbt`\n\n, which you need to adjust to whatever your username and machine name are. Then, the command forwards the Mac’s local port `11434`\n\nto `127.0.0.1:11434`\n\non the DGX because of `-L 11434:127.0.0.1:11434`\n\n. Note that this is the Ollama address.\n\nThe terminal running `ssh -N -L ...`\n\nwill look like it is hanging. That is normal. Keep it open while you use Qwen Code, Codex, or Claude Code. Press `Ctrl-C`\n\nto stop the tunnel.\n\nSo after it is running, use this on your Mac to see if the Mac can indeed access the ollama models from the DGX:\n\n```\ncurl http://127.0.0.1:11434/v1/models\n```\n\nIf that returns the DGX models, your Mac tools can use the DGX Ollama server as if it were local.\n\nThen, just use Qwen Code and Codex just like above.\n\nFor Claude via `ollama launch claude`\n\n, the key is that the Mac-side `ollama`\n\ncommand must see the tunneled endpoint. If needed:\n\n```\nOLLAMA_HOST=http://127.0.0.1:11434 \\\nollama launch claude --model qwen3.6:35b\n```\n\n## 12. What about OpenClaw and Hermes?\n\nWe focused on Qwen Code, Codex, and Claude Code because they are the most direct fit for coding-agent workflows. OpenClaw and Hermes are also capable, but they are broader agent harnesses. They are better suited when you want one agent to coordinate across tools, apps, browsers, terminals, and longer-running workflows.\n\nFor coding work, I recommend starting with Qwen Code, Codex, or Claude Code first (and there are also many other interesting coding harnesses like OpenCode, Cline, Pi, and Noumena Code). And I would treat OpenClaw and Hermes as interesting follow-up options for things beyond coding rather than the first baseline for this local coding-agent setup.\n\n## 13. Conclusion\n\nThis was a long article with lots of information and configuration. If there are a few main takeaways, I’d say that it’s not the mechanistic setup pipeline but rather the considerations when running coding agents locally. That is, the most important part is not getting one specific tool installed, but understanding the model-serving layer, the agent harness, the permission model, and how to evaluate whether the setup actually solves coding tasks reliably.\n\nOf course, GPT 5.5 and Opus 4.8 are currently better than smaller open-weight models that run on a Mac or DGX Spark. But the newer Mixture-of-Experts models in the 30-35B range (such as Qwen3.6, North Mini Code, and Nemotron 3 Nano) are all very, very capable and really sufficient for a lot of tasks. And yes, they run with the same token speed as GPT 5.5 through a Pro subscription, so it should not necessarily slow down your workflows.\n\nThe main consideration when setting up local agents, besides the model itself, is also which harness we want to use. The common perception is that models are usually optimized more for a specific harness than others (e.g., Qwen3.6 may work better in Qwen Code than Claude Code, for example). Based on the small agent assessment, this may not necessarily be true, though (this is only a very small benchmark, so take it with a big grain of salt). So, if you are more comfortable with a different harness that you have a lot of muscle memory with, like Codex and Claude Code, maybe it’s not a bad idea to just stick the model into that one and give it a try!\n\nAnyways, I hope the article was useful, and it got you interested in doing some tinkering with open-weight models. They are becoming more capable by the day, and it’s for some inexplicable reason just fun to run models locally.\n\n## Further Resources\n\nIf you want to try the benchmarks yourself, the code and small evaluation tasks used in this article are available here: [https://github.com/rasbt/local-coding-agent-evals](https://github.com/rasbt/local-coding-agent-evals)\n\nAlso, my [Build a Reasoning Model (From Scratch)](https://mng.bz/Nwr7) book has now gone to print and started shipping. I wanted to post a picture, but it will be 3 more days until it arrives.\n\nIf you liked my previous [Build a Large Language Model (From Scratch)](https://amzn.to/4fqvn0D) book, this is essentially a sequel implementing inference-time scaling techniques and reinforcement learning algorithms from scratch.\n\nAnd if you want to support future long-form articles like this one, consider [becoming a paid subscriber](https://magazine.sebastianraschka.com/subscribe). It helps me keep writing these independent deep dives and sharing the accompanying code, figures, and experiments.", "url": "https://wpnews.pro/news/using-local-coding-agents", "canonical_source": "https://magazine.sebastianraschka.com/p/using-local-coding-agents", "published_at": "2026-06-27 11:21:58+00:00", "updated_at": "2026-06-27 11:35:04.633357+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "developer-tools", "ai-tools", "ai-products"], "entities": ["Claude Code", "Codex", "OpenAI", "Anthropic", "Qwen"], "alternates": {"html": "https://wpnews.pro/news/using-local-coding-agents", "markdown": "https://wpnews.pro/news/using-local-coding-agents.md", "text": "https://wpnews.pro/news/using-local-coding-agents.txt", "jsonld": "https://wpnews.pro/news/using-local-coding-agents.jsonld"}}