{"slug": "two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny", "title": "Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.", "summary": "Two models have tied the all-time record on agent coding benchmarks, scoring 90% with zero hard fails. Qwen3 Coder 30B A3B achieved the score in 28 seconds at $0.0004, while DeepSeek Chat (original) reached 90% in 59 seconds at $0.0018 — cheaper than most models scoring 70%. The batch of 10 models was the cheapest yet, with Liquid's LFM 2 24B A2B scoring 85% at just $0.0002 for the entire 10-task benchmark.", "body_md": "*By Vilius Vystartas | May 2026*\n\nTen more models through the same 10 agent coding tasks. Two tied the all-time record. One cost $0.0002. The other hit the score at $0.0018 — cheaper than most models scoring 70%.\n\nBatch 10 was the cheapest one yet.\n\nTwo models scored 90% with zero hard fails, joining MiniMax M2 Her and Baidu Ernie 4.5 300B as the highest-scoring models on this benchmark:\n\n**Qwen3 Coder 30B A3B** — 90% in 28 seconds, $0.0004. An efficient coder that doesn't burn budget on thinking tokens it doesn't need.\n\n**DeepSeek Chat (original)** — 90% in 59 seconds, $0.0018. The original DeepSeek Chat still competes with modern models on agent coding. Newer doesn't always mean better.\n\n**LFM 2 24B A2B (85%, $0.0002, 15s) is the cheapest model I've ever tested.** Liquid's debut family is absurdly cost-effective. A full 10-task benchmark for literally $0.0002. At this price/performance ratio, there's no excuse not to test a model before committing to a more expensive alternative.\n\n**Mistral Small 3.2 (85%, $0.0004)** is a clear upgrade. The Small line went 75% → 85% across versions — a ten-point jump at the same budget tier. Mistral keeps improving the right things.\n\n**Qwen3 14B scored 0% across all 10 tasks.** Mandatory thinking mode that can't be suppressed at 300 tokens means every request times out before producing output. Skip for agent coding.\n\n**Cydonia 24B V4.1 (80%, $0.001)** debuts a new family from TheDrummer. Zero hard fails. Watch this one.\n\nQwen3.7 Max (85%, $0.13, 295 seconds) scored the same as budget models costing 300x less. Thinking mode tax at work — the accuracy is there, but you'll wait five minutes and pay for every second.\n\nClaude Opus 4 (80%, $0.10, 76s) had one hard fail. For a top-tier premium model at $0.10 per 10 tasks, that's below expectations. It's not a bad model — it's overkill for agent coding at a tight token budget.\n\nAion 1.0 (80%) had two hard fails and was the slowest at 160 seconds. The architecture is interesting, but it's not ready for production agent work.\n\nTen real-world agent coding tasks — file operations, shell commands, error recovery, data parsing — tested against each model via OpenRouter. Max tokens: 300. Temperature: 0.1. Results scored by pattern matching against expected outputs. Pre-flight verification caught 2 models (Ernie 4.5 21B — HTTP 429, Trinity Mini — empty content) before they wasted the batch.\n\nTotal batch cost: $0.14 across 9 models. Qwen3.7 Max alone accounted for $0.13 of that — thinking tax.\n\nTotal models tested: 148 (up from 138).\n\nFull results and per-task scores: [benchmarks.workswithagents.dev](https://benchmarks.workswithagents.dev)\n\nBecause you should.", "url": "https://wpnews.pro/news/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny", "canonical_source": "https://dev.to/vystartasv/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny-12d2", "published_at": "2026-05-26 09:46:02+00:00", "updated_at": "2026-05-26 10:04:50.134824+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-research", "ai-tools"], "entities": ["Vilius Vystartas", "MiniMax M2 Her", "Baidu Ernie 4.5 300B", "Qwen3 Coder 30B A3B", "DeepSeek Chat", "Liquid", "Mistral Small 3.2", "Cydonia 24B V4.1"], "alternates": {"html": "https://wpnews.pro/news/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny", "markdown": "https://wpnews.pro/news/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny.md", "text": "https://wpnews.pro/news/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny.txt", "jsonld": "https://wpnews.pro/news/two-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny.jsonld"}}