{"slug": "show-hn-aa-briefcase-a-frontier-knowledge-work-evaluation", "title": "Show HN: AA-Briefcase: a frontier knowledge work evaluation", "summary": "A new evaluation benchmark, AA-Briefcase, measures frontier knowledge work performance, with models like Claude Opus 4.8 averaging 24 minutes per task and achieving an Elo of 1356, while MiniMax-M3 takes longer but scores lower. Turn count does not strongly correlate with performance, as Gemini 3.5 Flash uses many turns but ranks below leaders.", "body_md": "Frontier AA-Briefcase performance can take ~20 minutes per task. For example, Claude Opus 4.8 (max) averages ~24 minutes of wall-clock time per task, with GLM-5.2 (max) averaging ~19 minutes. Longer runtime is not consistently associated with better performance however: MiniMax-M3 averages ~26 minutes per task, more than Claude Opus 4.8 (max), yet reaches an AA-Briefcase Elo of 1116, 240 points behind Opus.\n\nA key driver of average time per task is the number of turns a model takes before submission. Models are allowed up to 500 turns per task, and can submit their work at any point or abandon the task using the `abandon_task_finish`\n\ntool. More turns give models more time to work, but we do not observe a strong correlation between turn count and performance. Gemini 3.5 Flash, for example, averages one of the highest turn counts, at ~88 per task, while landing well below the Elo leaders.", "url": "https://wpnews.pro/news/show-hn-aa-briefcase-a-frontier-knowledge-work-evaluation", "canonical_source": "https://artificialanalysis.ai/articles/aa-briefcase", "published_at": "2026-06-18 23:57:48+00:00", "updated_at": "2026-06-19 00:31:50.189510+00:00", "lang": "en", "topics": ["ai-research", "ai-products", "large-language-models", "ai-tools"], "entities": ["Claude Opus 4.8", "GLM-5.2", "MiniMax-M3", "Gemini 3.5 Flash", "AA-Briefcase"], "alternates": {"html": "https://wpnews.pro/news/show-hn-aa-briefcase-a-frontier-knowledge-work-evaluation", "markdown": "https://wpnews.pro/news/show-hn-aa-briefcase-a-frontier-knowledge-work-evaluation.md", "text": "https://wpnews.pro/news/show-hn-aa-briefcase-a-frontier-knowledge-work-evaluation.txt", "jsonld": "https://wpnews.pro/news/show-hn-aa-briefcase-a-frontier-knowledge-work-evaluation.jsonld"}}