UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

The UK's AI Security Institute (AISI) found that standard AI benchmarks systematically underestimate agent capabilities by limiting compute budgets. In a study of seven benchmarks, increasing the token budget tenfold boosted success rates on software engineering tasks by about 25 percent, with newer models benefiting most. AISI estimates actual progress at the frontier is roughly 60 percent steeper than previously measured.

In a study covering seven benchmarks, the UK's AI Security Institute shows that standard AI evaluations systematically underestimate agent capabilities by capping the compute budget. On software engineering tasks, success rates jumped about 25 percent when the token budget was increased tenfold. Newer models benefit the most. Depending on the token budget, actual progress at the frontier is about 60 percent steeper than previous measurements suggested, according to AISI. The article UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do https://the-decoder.com/uks-ai-security-institute-finds-standard-benchmarks-systematically-underestimate-what-ai-agents-can-actually-do/ appeared first on The Decoder https://the-decoder.com .