{"slug": "model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding", "title": "Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task", "summary": "A developer tested five local AI models against Claude Sonnet 4 on a real coding task—building a tag manager for a blog admin panel. Only two models shipped working code: Sonnet 4 and Qwen3-Coder 30B-A3B. The developer concluded that local models on consumer hardware are not yet ready for agentic coding tasks.", "body_md": "Five local models. One frontier cloud model. The same coding task. Zero hand-holding.\n\nOnly two shipped code. One of them was the cloud model.\n\nPart of my goal with this series is to continuously test the viability and maturity of local models. I've done it for [basic agentic tasks](https://dev.to/posts/homelab-bakeoff-openclaw-outperforms-hermes-with-hermes-models). Today we're revisiting coding tasks.\n\nWhat did we learn?\n\n**Local models are not ready — yet.** At least not for homelabs like mine. Perhaps if you have hundreds of gigabytes of unified memory (I'm looking at you, older Mac Studios) you can run fully unquantized models. But with even the beefiest of discrete consumer GPUs, local models can't code.\n\nLet's dig in.\n\nThis is Round 7 of the Model Showdown series. Previous rounds tested cloud models against each other — Opus, Sonnet, GPT-5.5, Qwen cloud. This time I wanted to answer a different question: **can local models running on consumer hardware actually complete a real agentic coding task?**\n\nThe homelab:\n\nEvery local model was configured as aggressively as the hardware allows — flash attention, quantized KV cache (`q8_0`\n\n), and context windows maxed to what VRAM permits.\n\n| Model | Type | Quant | VRAM | Context | Max Output |\n|---|---|---|---|---|---|\nQwen 3.6 35B-A3B |\nLocal MoE | UD-Q4_K_XL (21GB) | ~21GB | 131,072 | 81,920 |\nGemma 4 12B |\nLocal Dense | UD-Q4_K_XL (6.9GB) | ~8GB | 65,536 | 32,768 |\nHermes 4 14B |\nLocal Dense | Q8_0 (15GB) | ~15GB | 65,536 | 32,768 |\nQwen3-Coder 30B-A3B |\nLocal MoE | UD-Q4_K_XL (17GB) | ~17GB | 65,536 | 32,768 |\nDevstral 24B |\nLocal Dense | Q5_K_M (17GB) | ~17GB | 65,536 | 32,768 |\nClaude Sonnet 4 |\nCloud (control) | Native | N/A | 200,000 | — |\n\nSonnet 4 is the control variable. I already know what it can do. The question is how close the local models get.\n\nPrevious rounds used an \"image management\" feature, but that collided with existing code in the repo. For Round 7, I designed a clean-room task: **build a tag manager for the blog's admin panel**.\n\nThe blog already has tags — posts use a `tags[]`\n\narray in MDX frontmatter, there's a public `/tags`\n\npage, and `src/lib/posts.ts`\n\nhas a `getAllTags()`\n\nfunction. But there's no admin UI to manage them.\n\nEach model got the identical prompt:\n\nGoal: Add a Tag Manager to the`/admin`\n\nsection.\n\nRequirements:\n\n- Create\n`src/lib/tags.ts`\n\n— list tags with post counts, detect orphans, support rename and merge- Create\n`src/app/api/admin/tags/route.ts`\n\n— GET, PATCH, DELETE endpoints- Create\n`src/app/admin/tags/page.tsx`\n\n— table with inline rename, delete, sort- Add \"Tags\" to AdminNav\n- Client-side mutations with refresh (no full page reload)\n`npm run build`\n\nmust pass with zero errors- Take a screenshot via Playwright MCP\n- Commit in logical chunks, push to branch\n- Do NOT open a PR\n\nTen requirements. Real codebase. Real build system. Real git workflow.\n\nEach model got its own clean branch (`run-10`\n\nthrough `run-15`\n\n) forked from the same `main`\n\ncommit. Local models were loaded one at a time via `llm-switch.sh`\n\nand served through llama-server on `localhost:8080`\n\n. Sonnet 4 ran through Coder's built-in Anthropic provider.\n\nModel-to-run assignment was randomized and sealed before execution. I didn't know which model was which run until after all six completed (or failed).\n\n**A note on human intervention**: I monitored each session live and occasionally nudged stalled models (\"keep going\", \"can you finish?\") or stopped them when they entered obvious doom loops (\"stop\"). There was no standardized intervention protocol — I used my judgment as a developer watching an AI assistant, which is how these tools actually get used in practice. Some models got more nudges than others because they stalled more. The two models that shipped code needed zero intervention.\n\n| Model | Tool Calls | Total Tokens | Commits | Build Pass | Screenshot | Outcome |\n|---|---|---|---|---|---|---|\nSonnet 4 ☁️ |\n88 | 19K | 4 | ✅ (1st try) | ✅ | Complete |\nQwen3-Coder 30B-A3B |\n60 | 2.06M | 1 | ✅ (3rd try) | ❌ | Partial |\nQwen 3.6 35B-A3B |\n76 | 3.89M | 0 | ✅ (2nd try) | ❌ |\nFailed (never committed) |\nGemma 4 12B |\n34 | 1.17M | 0 | ❌ (0/7) | ❌ | Failed |\nHermes 4 14B |\n40 | 1.14M | 0 | ❌ (0/13) | ❌ | Failed |\nDevstral 24B |\n0 | 14K | 0 | ❌ | ❌ | Total failure |\n\nOne cloud model. Five local models. **One complete success. One partial. Four failures.**\n\nSonnet did what you'd expect a frontier model to do. It cloned the repo, spent 25 tool calls reading existing code (auth patterns, API conventions, admin page structure, frontmatter format), then wrote all four files in a tight burst. Build passed on the first try. It hit a real environment issue — a stray `package.json`\n\nconfused Turbopack's workspace detection — diagnosed the root cause, fixed it with a config change, took a Playwright screenshot, and pushed four clean conventional commits.\n\nTotal time: ~10 minutes. Zero human intervention.\n\n```\nacb4ea1 fix: set turbopack.root to avoid workspace lockfile detection in dev\n352a8ca feat: add Tags link to AdminNav\n22899a0 feat: add /admin/tags page with inline rename, delete, and sort\n19f44fa feat: add tags.ts lib with stats, rename, and remove helpers\n```\n\nThe implementation followed existing project patterns because it read them first. That's the difference.\n\nThe best-performing local model. It cloned the repo, explored the codebase, created all four required files (410 lines of code), fixed TypeScript errors across three build attempts, and pushed a working commit.\n\nBut it wasn't clean. It burned ~8 tool calls just fighting the working directory problem (each `execute`\n\ncall resets to `/home/coder`\n\n, so it kept forgetting to `cd`\n\ninto the repo). After committing, it spent another 30 tool calls confused about whether its own API route file existed — trying to delete and recreate something that was already committed.\n\nNo screenshot. No logical commit chunking (everything in one commit). But **it shipped working code**, which puts it in a category of one among the local models.\n\nThis is the one that hurts. Qwen 3.6 actually *completed the implementation*. It explored the codebase thoroughly, wrote all four files, fixed a type error, and got `npm run build`\n\nto pass cleanly.\n\nThen it decided it needed a Playwright screenshot before committing.\n\nIt spent the next **77 messages** — over 50% of its entire session — trying to install Playwright, fighting missing Chromium dependencies, debugging browser launch failures, rewriting a screenshot script four times, and wrestling with the auth middleware that blocked unauthenticated page loads. It never took the screenshot. It never committed. It never pushed.\n\nThe code was right there. Build passing. Ready to go. But the model couldn't prioritize \"commit what works\" over \"complete requirement #7 first.\" Three times I nudged it — \"You there?\", \"Keep going\", \"can you finish?\" — and each time it dove back into the Playwright rabbit hole.\n\n**3.89 million tokens burned. Zero commits pushed.**\n\nGemma cloned the repo, read the existing code, and wrote all three new files plus the nav update. Reasonable start. Then it ran `npm run build`\n\nand hit a type error with `gray-matter`\n\n's `stringify()`\n\nfunction.\n\nThe fix was simple: `matter.stringify(content, data)`\n\n— content string first, data object second. Gemma had the arguments reversed. It tried six variations of the call, rewrote `tags.ts`\n\nsix times, ran seven builds — and never once tried the correct argument order. It never read the `gray-matter`\n\ntype definitions. It never checked the docs.\n\nAfter the fifth failed build, it fell into a **degenerate text generation loop** — printing \"I'll also make sure `src/lib/tags.ts`\n\nis correct\" 26 consecutive times. I had to send \"stop\" to break the loop.\n\nHermes jumped straight to writing code without exploring the project structure first. It created two files and ran `npm run build`\n\n. The error:\n\n```\nModule not found: Can't resolve '../../../lib/tags'\n```\n\nThe route file at `src/app/api/admin/tags/route.ts`\n\nneeds `../../../../lib/tags`\n\n(four levels up) or `@/lib/tags`\n\n(Next.js path alias). Hermes used three levels. Off by one.\n\nIt never diagnosed this. Instead, it rewrote both files with the same wrong import and rebuilt. **Thirteen times.** The output from message 34 onward is nearly verbatim identical every iteration. Same code. Same error. Same \"fix.\" When I sent \"stop,\" it continued for five more tool calls before acknowledging the signal.\n\nDevstral never executed a single tool call. It hallucinated an entire fake conversation about a Python project that doesn't exist, then emitted what looked like tool invocations — `execute`\n\n, `read_file`\n\n, `write_file`\n\n— but rendered them as **plain text** inside the assistant message. The platform couldn't parse them as structured tool calls, so nothing happened.\n\nThis is a fundamental compatibility failure. The model couldn't interface with Coder's tool-calling protocol at all. Nine messages, 14K tokens, zero actions.\n\nThis is the number that stopped me:\n\n| Model | Total Tokens | Result |\n|---|---|---|\n| Sonnet 4 | 19,237 |\nComplete (4 commits, screenshot) |\n| Qwen3-Coder | 2,059,519 |\nPartial (1 commit, no screenshot) |\n| Qwen 3.6 | 3,890,791 |\nFailed (build passed, never committed) |\n| Gemma 4 12B | 1,170,967 |\nFailed (0/7 builds passed) |\n| Hermes 4 14B | 1,138,614 |\nFailed (0/13 builds passed) |\n| Devstral 24B | 14,447 |\nFailed (zero tool calls) |\n\nSonnet used **19K tokens** to complete the task. The local models that actually tried burned **1–4 million tokens** and mostly failed. That's a 100-200x token efficiency gap for the same task.\n\nThe local models aren't just slower. They're doing fundamentally more work per unit of progress — re-reading files they already read, rewriting code they just wrote, rebuilding with the same error, looping through the same reasoning. It's not a speed problem. It's a thinking problem.\n\nEvery local model that ran long enough exhibited the same pathologies:\n\n**1. Degenerate loops.** Gemma repeated the same text 26 times. Hermes rebuilt with the same wrong import 13 times. Qwen 3.6 rewrote its screenshot script 4 times with the same approach. Once a local model enters a loop, it can't break out without human intervention.\n\n**2. Working directory amnesia.** Coder's `execute`\n\ntool doesn't preserve `cd`\n\nacross calls. Sonnet learned this instantly and prefixed every command. Multiple local models burned 5-10 tool calls per session rediscovering this.\n\n**3. Inability to prioritize.** Qwen 3.6 had a passing build and chose to yak-shave on Playwright instead of committing. No local model demonstrated the judgment to ship what works and iterate.\n\n**4. No self-diagnosis.** When a build fails, the fix requires reading the error, forming a hypothesis, and trying something *different*. Hermes and Gemma both tried the same fix repeatedly. Neither ever stepped back to read docs, check type definitions, or examine the project configuration.\n\n**Local models can write plausible code.** Four of five local models produced syntactically reasonable TypeScript. The code *looked* right. The architecture was sensible. It's the last mile — debugging, building, committing, shipping — where they fall apart.\n\n**The agentic gap is wider than the coding gap.** These models can generate code. What they can't do is *operate as agents* — managing state across tool calls, diagnosing errors, prioritizing tasks, knowing when to stop and ship. That's a different capability than code generation, and it's where local models are currently weakest.\n\n**Token efficiency is the real benchmark.** Raw parameter count and context window don't predict agentic success. Qwen 3.6 had the biggest context (131K) and burned the most tokens (3.89M) — and still didn't ship. Sonnet used 100x fewer tokens and completed everything. The bottleneck isn't context. It's reasoning quality per token.\n\n**Tool-calling compatibility isn't guaranteed.** Devstral is marketed as an agentic coding model, but it couldn't even interface with the tool-calling protocol. If you're evaluating local models for agent use, test tool calling first.\n\n**Qwen3-Coder is the local model to watch.** It's the only local model that actually shipped code in this test. Messy, single-commit, no screenshot — but working code pushed to a branch. For a 30B MoE model running on a single consumer GPU, that's notable.\n\n| Metric | Sonnet 4 | Qwen3-Coder | Qwen 3.6 | Gemma 4 12B | Hermes 4 14B | Devstral 24B |\n|---|---|---|---|---|---|---|\nType |\nCloud | Local MoE | Local MoE | Local Dense | Local Dense | Local Dense |\nParameters |\nUnknown | 30B (3B active) | 35B (3B active) | 12B | 14B | 24B |\nTotal tokens |\n19,237 | 2,059,519 | 3,890,791 | 1,170,967 | 1,138,614 | 14,447 |\nTool calls |\n88 | 60 | 76 | 34 | 40 | 0 |\nMessages |\n183 | 127 | 162 | 81 | 88 | 9 |\nCommits pushed |\n4 | 1 | 0 | 0 | 0 | 0 |\nBuild passed |\n✅ 1st try | ✅ 3rd try | ✅ 2nd try | ❌ 0/7 | ❌ 0/13 | ❌ |\nScreenshot |\n✅ | ❌ | ❌ | ❌ | ❌ | ❌ |\nHuman nudges |\n0 | 0 | 3 | 2 + stop | stop | 1 |\nOutcome |\nComplete | Partial | Failed | Failed | Failed | Failed |\n\n**Inference stack**: llama.cpp b9660, flash attention, q8_0 KV cache, Coder Agents v2.34.0\n\n**Hardware**: RTX 5090 32GB, Ryzen 9 9950X3D, 64GB RAM, Ubuntu 24.04\n\nNext up: Round 6 brings more frontier models to the same task. And I'll keep pushing the local models — better quants, newer releases, maybe a different agent framework. The gap is real, but the pace of improvement on the local side is fast.", "url": "https://wpnews.pro/news/model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding", "canonical_source": "https://dev.to/carryologist/model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding-task-1ehj", "published_at": "2026-06-18 02:53:16+00:00", "updated_at": "2026-06-18 03:21:26.245114+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-products", "ai-research"], "entities": ["Claude Sonnet 4", "Qwen3-Coder 30B-A3B", "Qwen 3.6 35B-A3B", "Gemma 4 12B", "Hermes 4 14B", "Devstral 24B", "llama-server", "Playwright MCP"], "alternates": {"html": "https://wpnews.pro/news/model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding", "markdown": "https://wpnews.pro/news/model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding.md", "text": "https://wpnews.pro/news/model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding.txt", "jsonld": "https://wpnews.pro/news/model-showdown-round-7-five-local-models-vs-one-cloud-model-on-a-real-coding.jsonld"}}