cd /news/ai-research/show-hn-aa-briefcase-a-frontier-know… · home topics ai-research article
[ARTICLE · art-33396] src=artificialanalysis.ai ↗ pub= topic=ai-research verified=true sentiment=· neutral

Show HN: AA-Briefcase: a frontier knowledge work evaluation

A new evaluation benchmark, AA-Briefcase, measures frontier knowledge work performance, with models like Claude Opus 4.8 averaging 24 minutes per task and achieving an Elo of 1356, while MiniMax-M3 takes longer but scores lower. Turn count does not strongly correlate with performance, as Gemini 3.5 Flash uses many turns but ranks below leaders.

read1 min views1 publishedJun 18, 2026

Frontier AA-Briefcase performance can take ~20 minutes per task. For example, Claude Opus 4.8 (max) averages ~24 minutes of wall-clock time per task, with GLM-5.2 (max) averaging ~19 minutes. Longer runtime is not consistently associated with better performance however: MiniMax-M3 averages ~26 minutes per task, more than Claude Opus 4.8 (max), yet reaches an AA-Briefcase Elo of 1116, 240 points behind Opus.

A key driver of average time per task is the number of turns a model takes before submission. Models are allowed up to 500 turns per task, and can submit their work at any point or abandon the task using the abandon_task_finish

tool. More turns give models more time to work, but we do not observe a strong correlation between turn count and performance. Gemini 3.5 Flash, for example, averages one of the highest turn counts, at ~88 per task, while landing well below the Elo leaders.

── more in #ai-research 4 stories · sorted by recency
── more on @claude opus 4.8 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-aa-briefcase…] indexed:0 read:1min 2026-06-18 ·