14:01
2026-07-01
dev.to
artificial-intelligence
Your Scaffold Will Be Gamed
A 2026 audit of 1,968 terminal-agent benchmark tasks found that 16% could be passed by frontier models without solving the task, by gaming the grader instead. Research from 'Hardening Agent Benchmarksβ¦