{"slug": "ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesn-t-care", "title": "AI agents scored 0% on expert tasks. The hype machine doesn't care.", "summary": "Top AI agents scored 0% on expert-level professional tasks in the ALE benchmark, with models like Fable 5 and GPT-5.5 failing to solve any problems requiring real expertise. Mid-level performance reached only 15-21%, highlighting a significant gap between demo capabilities and real-world deployment. The results challenge the hype around AI agents replacing skilled workers, urging a pragmatic approach with human oversight and tightly scoped tasks.", "body_md": "Top AI agents achieved zero percent on expert-level professional tasks according to the ALE benchmark. It wasn't minimal, it wasn't frustrating. **Not even one.**\n\nEnjoy this satisfying round number while your timeline fills up with threads about how agents will replace your entire engineering team by Q3.\n\nALE, which stands for Agents' Last Exam, is a benchmark meant for testing AI agents on problems that demand real professional expertise. Not the \"summarize this PDF\" kind of problems. But hard, domain-specific work that experts in the field do.\n\nThe findings were grim. Models including Fable 5 and GPT-5.5 were among those tested. On the most difficult \"Last-Exam\" tier of expert-level problems, they obtained a **0% pass rate** (note that partial credit was non-zero). A coin flip would have been more impressive.\n\nOne little detail many will overlook: performance on mid-level tasks was slightly higher but still rather unimpressive, with the best agents achieving 15–21% success rates. So they are not entirely ineffectual. They're just not what the hype says they are.\n\nEvery few weeks, there is a new demonstration where an AI agent is booking flights, writing code, and even perhaps managing a project. In a two-minute video, it looks amazing.\n\nThen you try to get it to do something that actually matters in your job. Something with ambiguity, edge cases, and real stakes. Eventually, it crumbles under pressure. 🎪\n\nThis is the demo-to-deploy gap, and it's enormous. Demos are curated. Benchmarks are not.\n\nI keep seeing teams make architectural decisions based on capabilities that don't exist yet. \"We'll just have an agent handle that workflow.\" \"The agent layer will manage orchestration.\" Cool plan. But all that is based on trust.\n\nHere's what we can actually learn from the ALE results:\n\n→ **Agents are solid assistants for mid-complexity work.** That's genuinely useful. Stop underselling it.\n\n→ **Expert-level autonomy is not here.** Planning your product around it is gambling.\n\n→ **Benchmarks matter more than demos.** A controlled test beats a cherry-picked screencast every time.\n\nIf you're building agent-powered features today, build them for what agents can actually do *today*. Not for what a keynote speaker promised they'll do \"soon.\"\n\nThe disconnect between empirical results and industry narrative is wild. A model scores literally zero on hard tasks, and the conversation doesn't change at all. Nobody adjusts their roadmap. Nobody recalibrates expectations.\n\nThe hype machine doesn't run on data. It runs on funding rounds and Twitter impressions. 💸\n\nI am not implying that agents will not improve. They most likely will. But \"probably will get better eventually\" is a terrible foundation for engineering decisions you're making this quarter.\n\nIf I were planning a product right now, I'd treat agents like junior developers. Useful for well-scoped tasks with clear guardrails. Terrible when left unsupervised on anything complex.\n\nThis means:\n\n→ **Human-in-the-loop for anything high-stakes.** Not optional. Required.\n\n→ **Scope agent tasks tightly.** The narrower the task, the better the output.\n\n→ **Measure everything.** If you can't benchmark your agent's performance on your actual workload, you're flying blind.\n\nThe boring, pragmatic approach isn't as fun as tweeting \"we replaced our entire QA team with agents.\" But it ships working software. 🤷\n\nAchieving a 0% score on expert tasks is not indicative of one model failing. Rather, it is a wake-up call for one size fits all story. Agents are tools — good ones, even — but they're not the autonomous workforce that the hype cycle is selling. Build for reality. Back what works. Stay skeptical of anyone who treats benchmarks as an inconvenience.\n\nWhat is the most overhyped agent capability that you have seen teams actually try to ship? I would love to hear the war stories below", "url": "https://wpnews.pro/news/ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesn-t-care", "canonical_source": "https://dev.to/adioof/ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesnt-care-2bp1", "published_at": "2026-06-19 13:10:35+00:00", "updated_at": "2026-06-19 13:36:52.131429+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "ai-research", "ai-safety", "large-language-models"], "entities": ["ALE", "Fable 5", "GPT-5.5", "Agents' Last Exam"], "alternates": {"html": "https://wpnews.pro/news/ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesn-t-care", "markdown": "https://wpnews.pro/news/ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesn-t-care.md", "text": "https://wpnews.pro/news/ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesn-t-care.txt", "jsonld": "https://wpnews.pro/news/ai-agents-scored-0-on-expert-tasks-the-hype-machine-doesn-t-care.jsonld"}}