10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.

wpnews.pro

cd /news/large-language-models/10-models-tested-from-81-6-to-10-the… · home › topics › large-language-models › article

[ARTICLE · art-14662] src=dev.to ↗ pub=2026-05-26T22:42Z topic=large-language-models verified=true sentiment=· neutral

10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.

A developer tested 10 AI models on 10 agent coding tasks, finding free-tier performance ranged from 76.7% (Owl Alpha) to 10% (Laguna M.1), with the latter producing garbage on 9 of 10 tasks. The paid models, led by Grok 4.3 at 81.6%, cost a combined $0.10, while free-tier models were often crippled by a 400-token output cap that turned partial responses into failures. The results show that "free" can cost significant debugging time, with Perceptron Mk1 delivering 79.9% accuracy for $0.002.

read5 min views14 publishedMay 26, 2026

By Vilius Vystartas | May 2026

I tested another 10 models across the same 10 agent coding tasks. Four of them were free-tier models — and the range was absurd: Owl Alpha scored 76.7% with zero hard fails, Laguna M.1 scored 10% and produced garbage on 9 out of 10 tasks. The free tier is not free if it costs you debugging time.

Total cost for all 10 models: $0.10. The paid models (6 of 10) came to $0.10 combined.

#	Model	Score	P/P/F	Cost	Time	Category
🥇	Grok 4.3
81.6%
7/3/0	$0.017	39.9s	Paid (xAI)
🥈	Perceptron Mk1	79.9%	8/1/1	$0.002	29.3s	Paid (Perceptron)
🥉	Owl Alpha (free)	76.7%	5/5/0	Free	83.0s	Free tier
4	xAI: Grok Build 0.1	75.0%	5/4/1	$0.034	95.3s	Paid (xAI)
5	OpenAI: GPT Chat Latest	73.3%	6/2/2	$0.043	18.7s	Paid (OpenAI)
6	Mistral Medium 3.5	71.6%	6/2/2	$0.008	12.6s	Paid (Mistral)
7	Nemotron 3 Nano Omni (free)	50.0%	4/2/4	Free	23.5s	Free tier
8	Laguna XS.2 (free)	49.7%	3/3/4	Free	28.7s	Free tier
9	Baidu CoBuddy (free)	40.0%	4/0/6	Free	362.4s	Free tier
10	Laguna M.1 (free)	10.0%	1/0/9	Free	89.8s	Free tier

Grok 4.3 (81.6%, $0.017, 39.9s) — Grok's latest release takes the batch with zero hard fails. Seven clean passes, three partials. Process-monitor was the only full pass it earned that 4.3's competitors missed. xAI's Grok line is quietly consistent — 4.1 Fast (76.7%), 4.20 (75%), and now 4.3 (81.6%) — all within striking distance of the 80%+ club without crossing into premium pricing.

Perceptron Mk1 (79.9%, $0.002, 29.3s) — A brand new family debuts at nearly 80%, with eight passes — the most in the batch — for two-tenths of a cent. The one failure (regex-extract at 17%) is a known weakness for small models. At this price-to-pass ratio, Perceptron Mk1 is the value story of this batch.

Owl Alpha (free, 76.7%, 83.0s) — A free model with zero hard fails and 5 full passes. That's the standout free-tier result. Takes 2x longer than paid models for some tasks (24s on csv-stats vs 1-3s for the field), but the code is functional. If latency isn't critical, this is usable.

Four free models. Results:

Model	Score	Verdict
Owl Alpha	76.7%
Usable — zero hard fails, 5/10 full passes. Slow but functional.
Nemotron 3 Nano Omni	50.0%
Mixed — half of tasks hit output cap at 400 tokens. Hit or miss.
Laguna XS.2	49.7%
Unreliable — 400-token cap kills complex responses.
Baidu CoBuddy	40.0%
Frustrating — 362 seconds total. Half the tasks hit output cap at 399 tokens. Waiting 6 minutes for 40% accuracy is not a good trade.
Laguna M.1	10.0%
Broken — 1/10 passes. Every response capped at 400 tokens. Do not use.

The free tier cap of 399-400 output tokens is the real problem. Models like Laguna M.1 and CoBuddy truncate every response, turning what could be a partial into a fail. Owl Alpha works despite the cap because its outputs are concise enough to fit.

Pay $0.002 for Perceptron Mk1 and get 8/10 passes, or use Laguna M.1 free and get 1/10. The math is not subtle.

GPT Chat Latest (73.3%, $0.043) — OpenAI's catch-all endpoint was solid on easy tasks (file-parse, csv-stats, sql-query all passed) but fell apart on fix-bug (0%) with a lengthy, expensive hallucination. The most expensive model in the batch and it doesn't crack 75%.

Mistral Medium 3.5 (71.6%, $0.008) — Fastest model in the batch at 12.6s total, but the process-monitor task hit a 504 Gateway Timeout and scored 0%. A timeout fail on a model that otherwise looks strong carries a disproportionate penalty — without it, Medium 3.5 would be at 79.5%.

Laguna M.1 (10%) — The worst score in any batch I've run. Seven of its task responses were blank 400-token output cap fills. Not worth listing on OpenRouter.

|---|---|---|---|
| Owl Alpha (free) | 76.7% | $0 | $0 |

| Nemotron 3 Nano Omni (free) | 50.0% | $0 | $0 | | Laguna XS.2 (free) | 49.7% | $0 | $0 | | Baidu CoBuddy (free) | 40.0% | $0 | $0 | | Laguna M.1 (free) | 10.0% | $0 | $0 | | Perceptron Mk1 | 79.9% | $0.002 | $0.0024 | | Mistral Medium 3.5 | 71.6% | $0.008 | $0.0108 | | Grok 4.3 | 81.6% | $0.017 | $0.0209 | | xAI: Grok Build 0.1 | 75.0% | $0.034 | $0.0450 | | GPT Chat Latest | 73.3% | $0.043 | $0.0584 |

Free models dominate the $/%-pt table by definition, but only Owl Alpha is actually usable. Among paid models, Perceptron Mk1 at $0.0024/%-pt is the efficiency winner — 24x cheaper per point than GPT Chat Latest.

Same setup as previous batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 400. Temperature: 0.1. Pattern-matching scoring against expected outputs.

Pre-flight verification caught zero failures this batch. Total cost: $0.10. Total dataset: 168 models tested across cloud and local.

Full results and per-task scores: benchmarks.workswithagents.dev

source & further reading

dev.to — original article I Ran 150 Tasks to Test If AI Agents Follow Rules — The Answer Surprised Me moteDB 0.5.1 Is Out: What 18 Months of Building an Embedded Database for Robots Taught Me Making a Bloated Claude Code Fast Again: Auditing Context Injection Down From 228KB to 48KB

~/api · this article 200

$curl api.wpnews.pro/v1/news/10-models-tested-from-81…

Read original on dev.to → dev.to/vystartasv/10-models-tested-from-816-to-1…

mentioned entities

xAI

Grok 4.3

Perceptron Mk1

Owl Alpha

OpenAI

Mistral

Baidu

Laguna M.1

metadata

slug10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevRisk or Reward: What First Thing…

next →Figure signs agreement with Cata…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 11 Jul · #large-language-models

AI’s next phase is about doing the work, not just answering questions

startupfortune.com · 11 Jul · #large-language-models

Zuckerberg Breaks His Three Year Silence On X To Launch Muse Spark 1.1

cryptobriefing.com · 11 Jul · #large-language-models

Grok 4.5 ranks second on APEX-SWE leaderboard as AI coding race heats up

github.com · 11 Jul · #large-language-models

Show HN: Inferock-bench – per-call billing receipts for OpenAI and Anthropic

── more on @xai 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

AI Tokenomics: How to tokenmin while ROImaxxing

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required