Frontier AA-Briefcase performance can take ~20 minutes per task. For example, Claude Opus 4.8 (max) averages ~24 minutes of wall-clock time per task, with GLM-5.2 (max) averaging ~19 minutes. Longer runtime is not consistently associated with better performance however: MiniMax-M3 averages ~26 minutes per task, more than Claude Opus 4.8 (max), yet reaches an AA-Briefcase Elo of 1116, 240 points behind Opus.
A key driver of average time per task is the number of turns a model takes before submission. Models are allowed up to 500 turns per task, and can submit their work at any point or abandon the task using the abandon_task_finish
tool. More turns give models more time to work, but we do not observe a strong correlation between turn count and performance. Gemini 3.5 Flash, for example, averages one of the highest turn counts, at ~88 per task, while landing well below the Elo leaders.