Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

wpnews.pro

cd /news/artificial-intelligence/two-models-just-hit-90-on-agent-codi… · home › topics › artificial-intelligence › article

[ARTICLE · art-14277] src=dev.to ↗ pub=2026-05-26T09:46Z topic=artificial-intelligence verified=true sentiment=↑ positive

Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

Two models have tied the all-time record on agent coding benchmarks, scoring 90% with zero hard fails. Qwen3 Coder 30B A3B achieved the score in 28 seconds at $0.0004, while DeepSeek Chat (original) reached 90% in 59 seconds at $0.0018 — cheaper than most models scoring 70%. The batch of 10 models was the cheapest yet, with Liquid's LFM 2 24B A2B scoring 85% at just $0.0002 for the entire 10-task benchmark.

read2 min views12 publishedMay 26, 2026

By Vilius Vystartas | May 2026

Ten more models through the same 10 agent coding tasks. Two tied the all-time record. One cost $0.0002. The other hit the score at $0.0018 — cheaper than most models scoring 70%.

Batch 10 was the cheapest one yet.

Two models scored 90% with zero hard fails, joining MiniMax M2 Her and Baidu Ernie 4.5 300B as the highest-scoring models on this benchmark:

Qwen3 Coder 30B A3B — 90% in 28 seconds, $0.0004. An efficient coder that doesn't burn budget on thinking tokens it doesn't need.

DeepSeek Chat (original) — 90% in 59 seconds, $0.0018. The original DeepSeek Chat still competes with modern models on agent coding. Newer doesn't always mean better.

LFM 2 24B A2B (85%, $0.0002, 15s) is the cheapest model I've ever tested. Liquid's debut family is absurdly cost-effective. A full 10-task benchmark for literally $0.0002. At this price/performance ratio, there's no excuse not to test a model before committing to a more expensive alternative.

Mistral Small 3.2 (85%, $0.0004) is a clear upgrade. The Small line went 75% → 85% across versions — a ten-point jump at the same budget tier. Mistral keeps improving the right things.

Qwen3 14B scored 0% across all 10 tasks. Mandatory thinking mode that can't be suppressed at 300 tokens means every request times out before producing output. Skip for agent coding.

Cydonia 24B V4.1 (80%, $0.001) debuts a new family from TheDrummer. Zero hard fails. Watch this one.

Qwen3.7 Max (85%, $0.13, 295 seconds) scored the same as budget models costing 300x less. Thinking mode tax at work — the accuracy is there, but you'll wait five minutes and pay for every second.

Claude Opus 4 (80%, $0.10, 76s) had one hard fail. For a top-tier premium model at $0.10 per 10 tasks, that's below expectations. It's not a bad model — it's overkill for agent coding at a tight token budget.

Aion 1.0 (80%) had two hard fails and was the slowest at 160 seconds. The architecture is interesting, but it's not ready for production agent work.

Ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing — tested against each model via OpenRouter. Max tokens: 300. Temperature: 0.1. Results scored by pattern matching against expected outputs. Pre-flight verification caught 2 models (Ernie 4.5 21B — HTTP 429, Trinity Mini — empty content) before they wasted the batch.

Total batch cost: $0.14 across 9 models. Qwen3.7 Max alone accounted for $0.13 of that — thinking tax.

Total models tested: 148 (up from 138).

Full results and per-task scores: [benchmarks.workswithagents.dev](https://benchmarks.workswithagents.dev)

Because you should.

source & further reading

dev.to — original article How We Test an AI Product Without Burning Credit From Prompt Files to Agent Skills: How I Unified My Content Automation Human-in-the-Loop Is Not a Governance Strategy

~/api · this article 200

$curl api.wpnews.pro/v1/news/two-models-just-hit-90-o…

Read original on dev.to → dev.to/vystartasv/two-models-just-hit-90-on-agen…

mentioned entities

Vilius Vystartas

MiniMax M2 Her

Baidu Ernie 4.5 300B

Qwen3 Coder 30B A3B

DeepSeek Chat

Liquid

Mistral Small 3.2

Cydonia 24B V4.1

metadata

slugtwo-models-just-hit-90-on-agent-coding-one-cost-less-than-a-penny

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevPluralistic: The AI bubble isn't…

next →The state of AI voice assistants…

── more in #artificial-intelligence 4 stories · sorted by recency

byteiota.com · 10 Jul · #artificial-intelligence

GPT-5.6 Sol Ultra Proves 50-Year Math Conjecture Today

ploy.ai · 10 Jul · #artificial-intelligence

Migrating a production AI agent to GPT 5.6

sourcefeed.dev · 10 Jul · #artificial-intelligence

Autonomous Pentesting: Inside the Multi-Agent Architecture of PentAGI

tryai.dev · 10 Jul · #artificial-intelligence

GPT-5.6, Grok 4.5, Claude, and Muse Spark build the same 4 apps

── more on @vilius vystartas 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required