{"slug": "10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble", "title": "10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.", "summary": "A developer tested 10 AI models on 10 agent coding tasks, finding free-tier performance ranged from 76.7% (Owl Alpha) to 10% (Laguna M.1), with the latter producing garbage on 9 of 10 tasks. The paid models, led by Grok 4.3 at 81.6%, cost a combined $0.10, while free-tier models were often crippled by a 400-token output cap that turned partial responses into failures. The results show that \"free\" can cost significant debugging time, with Perceptron Mk1 delivering 79.9% accuracy for $0.002.", "body_md": "*By Vilius Vystartas | May 2026*\n\nI tested another 10 models across the same 10 agent coding tasks. Four of them were free-tier models — and the range was absurd: Owl Alpha scored 76.7% with zero hard fails, Laguna M.1 scored 10% and produced garbage on 9 out of 10 tasks. The free tier is not free if it costs you debugging time.\n\nTotal cost for all 10 models: **$0.10**. The paid models (6 of 10) came to $0.10 combined.\n\n| # | Model | Score | P/P/F | Cost | Time | Category |\n|---|---|---|---|---|---|---|\n| 🥇 | Grok 4.3 |\n81.6% |\n7/3/0 | $0.017 | 39.9s | Paid (xAI) |\n| 🥈 | Perceptron Mk1 | 79.9% | 8/1/1 | $0.002 | 29.3s | Paid (Perceptron) |\n| 🥉 | Owl Alpha (free) | 76.7% | 5/5/0 | Free | 83.0s | Free tier |\n| 4 | xAI: Grok Build 0.1 | 75.0% | 5/4/1 | $0.034 | 95.3s | Paid (xAI) |\n| 5 | OpenAI: GPT Chat Latest | 73.3% | 6/2/2 | $0.043 | 18.7s | Paid (OpenAI) |\n| 6 | Mistral Medium 3.5 | 71.6% | 6/2/2 | $0.008 | 12.6s | Paid (Mistral) |\n| 7 | Nemotron 3 Nano Omni (free) | 50.0% | 4/2/4 | Free | 23.5s | Free tier |\n| 8 | Laguna XS.2 (free) | 49.7% | 3/3/4 | Free | 28.7s | Free tier |\n| 9 | Baidu CoBuddy (free) | 40.0% | 4/0/6 | Free | 362.4s | Free tier |\n| 10 | Laguna M.1 (free) | 10.0% | 1/0/9 | Free | 89.8s | Free tier |\n\n**Grok 4.3 (81.6%, $0.017, 39.9s)** — Grok's latest release takes the batch with zero hard fails. Seven clean passes, three partials. Process-monitor was the only full pass it earned that 4.3's competitors missed. xAI's Grok line is quietly consistent — 4.1 Fast (76.7%), 4.20 (75%), and now 4.3 (81.6%) — all within striking distance of the 80%+ club without crossing into premium pricing.\n\n**Perceptron Mk1 (79.9%, $0.002, 29.3s)** — A brand new family debuts at nearly 80%, with eight passes — the most in the batch — for two-tenths of a cent. The one failure (regex-extract at 17%) is a known weakness for small models. At this price-to-pass ratio, Perceptron Mk1 is the value story of this batch.\n\n**Owl Alpha (free, 76.7%, 83.0s)** — A free model with zero hard fails and 5 full passes. That's the standout free-tier result. Takes 2x longer than paid models for some tasks (24s on csv-stats vs 1-3s for the field), but the code is functional. If latency isn't critical, this is usable.\n\nFour free models. Results:\n\n| Model | Score | Verdict |\n|---|---|---|\n| Owl Alpha | 76.7% |\nUsable — zero hard fails, 5/10 full passes. Slow but functional. |\n| Nemotron 3 Nano Omni | 50.0% |\nMixed — half of tasks hit output cap at 400 tokens. Hit or miss. |\n| Laguna XS.2 | 49.7% |\nUnreliable — 400-token cap kills complex responses. |\n| Baidu CoBuddy | 40.0% |\nFrustrating — 362 seconds total. Half the tasks hit output cap at 399 tokens. Waiting 6 minutes for 40% accuracy is not a good trade. |\n| Laguna M.1 | 10.0% |\nBroken — 1/10 passes. Every response capped at 400 tokens. Do not use. |\n\nThe free tier cap of 399-400 output tokens is the real problem. Models like Laguna M.1 and CoBuddy truncate every response, turning what could be a partial into a fail. Owl Alpha works despite the cap because its outputs are concise enough to fit.\n\nPay $0.002 for Perceptron Mk1 and get 8/10 passes, or use Laguna M.1 free and get 1/10. The math is not subtle.\n\n**GPT Chat Latest (73.3%, $0.043)** — OpenAI's catch-all endpoint was solid on easy tasks (file-parse, csv-stats, sql-query all passed) but fell apart on fix-bug (0%) with a lengthy, expensive hallucination. The most expensive model in the batch and it doesn't crack 75%.\n\n**Mistral Medium 3.5 (71.6%, $0.008)** — Fastest model in the batch at 12.6s total, but the process-monitor task hit a 504 Gateway Timeout and scored 0%. A timeout fail on a model that otherwise looks strong carries a disproportionate penalty — without it, Medium 3.5 would be at 79.5%.\n\n**Laguna M.1 (10%)** — The worst score in any batch I've run. Seven of its task responses were blank 400-token output cap fills. Not worth listing on OpenRouter.\n\n| Model | Score | Cost | $/%-pt |\n|---|---|---|---|\n| Owl Alpha (free) | 76.7% | $0 | $0 |\n| Nemotron 3 Nano Omni (free) | 50.0% | $0 | $0 |\n| Laguna XS.2 (free) | 49.7% | $0 | $0 |\n| Baidu CoBuddy (free) | 40.0% | $0 | $0 |\n| Laguna M.1 (free) | 10.0% | $0 | $0 |\n| Perceptron Mk1 | 79.9% | $0.002 | $0.0024 |\n| Mistral Medium 3.5 | 71.6% | $0.008 | $0.0108 |\n| Grok 4.3 | 81.6% | $0.017 | $0.0209 |\n| xAI: Grok Build 0.1 | 75.0% | $0.034 | $0.0450 |\n| GPT Chat Latest | 73.3% | $0.043 | $0.0584 |\n\nFree models dominate the $/%-pt table by definition, but only Owl Alpha is actually usable. Among paid models, Perceptron Mk1 at $0.0024/%-pt is the efficiency winner — 24x cheaper per point than GPT Chat Latest.\n\nSame setup as previous batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 400. Temperature: 0.1. Pattern-matching scoring against expected outputs.\n\nPre-flight verification caught zero failures this batch. Total cost: **$0.10**. Total dataset: **168 models tested** across cloud and local.\n\nFull results and per-task scores: [benchmarks.workswithagents.dev](https://benchmarks.workswithagents.dev)", "url": "https://wpnews.pro/news/10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble", "canonical_source": "https://dev.to/vystartasv/10-models-tested-from-816-to-10-the-free-tier-is-a-full-on-gamble-4kfc", "published_at": "2026-05-26 22:42:59+00:00", "updated_at": "2026-05-26 23:03:33.947000+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "ai-research", "ai-agents"], "entities": ["xAI", "Grok 4.3", "Perceptron Mk1", "Owl Alpha", "OpenAI", "Mistral", "Baidu", "Laguna M.1"], "alternates": {"html": "https://wpnews.pro/news/10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble", "markdown": "https://wpnews.pro/news/10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble.md", "text": "https://wpnews.pro/news/10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble.txt", "jsonld": "https://wpnews.pro/news/10-models-tested-from-81-6-to-10-the-free-tier-is-a-full-on-gamble.jsonld"}}