Show HN: AA-Briefcase: a frontier knowledge work evaluation

A new evaluation benchmark, AA-Briefcase, measures frontier knowledge work performance, with models like Claude Opus 4.8 averaging 24 minutes per task and achieving an Elo of 1356, while MiniMax-M3 takes longer but scores lower. Turn count does not strongly correlate with performance, as Gemini 3.5 Flash uses many turns but ranks below leaders.

Frontier AA-Briefcase performance can take ~20 minutes per task. For example, Claude Opus 4.8 max averages ~24 minutes of wall-clock time per task, with GLM-5.2 max averaging ~19 minutes. Longer runtime is not consistently associated with better performance however: MiniMax-M3 averages ~26 minutes per task, more than Claude Opus 4.8 max , yet reaches an AA-Briefcase Elo of 1116, 240 points behind Opus. A key driver of average time per task is the number of turns a model takes before submission. Models are allowed up to 500 turns per task, and can submit their work at any point or abandon the task using the abandon task finish tool. More turns give models more time to work, but we do not observe a strong correlation between turn count and performance. Gemini 3.5 Flash, for example, averages one of the highest turn counts, at ~88 per task, while landing well below the Elo leaders.