cd /news/artificial-intelligence/ai-agents-scored-0-on-expert-tasks-t… · home topics artificial-intelligence article
[ARTICLE · art-33968] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=↓ negative

AI agents scored 0% on expert tasks. The hype machine doesn't care.

Top AI agents scored 0% on expert-level professional tasks in the ALE benchmark, with models like Fable 5 and GPT-5.5 failing to solve any problems requiring real expertise. Mid-level performance reached only 15-21%, highlighting a significant gap between demo capabilities and real-world deployment. The results challenge the hype around AI agents replacing skilled workers, urging a pragmatic approach with human oversight and tightly scoped tasks.

read3 min views1 publishedJun 19, 2026

Top AI agents achieved zero percent on expert-level professional tasks according to the ALE benchmark. It wasn't minimal, it wasn't frustrating. Not even one.

Enjoy this satisfying round number while your timeline fills up with threads about how agents will replace your entire engineering team by Q3.

ALE, which stands for Agents' Last Exam, is a benchmark meant for testing AI agents on problems that demand real professional expertise. Not the "summarize this PDF" kind of problems. But hard, domain-specific work that experts in the field do.

The findings were grim. Models including Fable 5 and GPT-5.5 were among those tested. On the most difficult "Last-Exam" tier of expert-level problems, they obtained a 0% pass rate (note that partial credit was non-zero). A coin flip would have been more impressive.

One little detail many will overlook: performance on mid-level tasks was slightly higher but still rather unimpressive, with the best agents achieving 15–21% success rates. So they are not entirely ineffectual. They're just not what the hype says they are.

Every few weeks, there is a new demonstration where an AI agent is booking flights, writing code, and even perhaps managing a project. In a two-minute video, it looks amazing.

Then you try to get it to do something that actually matters in your job. Something with ambiguity, edge cases, and real stakes. Eventually, it crumbles under pressure. 🎪

This is the demo-to-deploy gap, and it's enormous. Demos are curated. Benchmarks are not.

I keep seeing teams make architectural decisions based on capabilities that don't exist yet. "We'll just have an agent handle that workflow." "The agent layer will manage orchestration." Cool plan. But all that is based on trust.

Here's what we can actually learn from the ALE results:

Agents are solid assistants for mid-complexity work. That's genuinely useful. Stop underselling it.

Expert-level autonomy is not here. Planning your product around it is gambling.

Benchmarks matter more than demos. A controlled test beats a cherry-picked screencast every time.

If you're building agent-powered features today, build them for what agents can actually do today. Not for what a keynote speaker promised they'll do "soon." The disconnect between empirical results and industry narrative is wild. A model scores literally zero on hard tasks, and the conversation doesn't change at all. Nobody adjusts their roadmap. Nobody recalibrates expectations.

The hype machine doesn't run on data. It runs on funding rounds and Twitter impressions. 💸

I am not implying that agents will not improve. They most likely will. But "probably will get better eventually" is a terrible foundation for engineering decisions you're making this quarter.

If I were planning a product right now, I'd treat agents like junior developers. Useful for well-scoped tasks with clear guardrails. Terrible when left unsupervised on anything complex. This means:

Human-in-the-loop for anything high-stakes. Not optional. Required. → Scope agent tasks tightly. The narrower the task, the better the output.

Measure everything. If you can't benchmark your agent's performance on your actual workload, you're flying blind.

The boring, pragmatic approach isn't as fun as tweeting "we replaced our entire QA team with agents." But it ships working software. 🤷

Achieving a 0% score on expert tasks is not indicative of one model failing. Rather, it is a wake-up call for one size fits all story. Agents are tools — good ones, even — but they're not the autonomous workforce that the hype cycle is selling. Build for reality. Back what works. Stay skeptical of anyone who treats benchmarks as an inconvenience.

What is the most overhyped agent capability that you have seen teams actually try to ship? I would love to hear the war stories below

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @ale 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/ai-agents-scored-0-o…] indexed:0 read:3min 2026-06-19 ·