# AA-Briefcase: a tougher test for agents

> Source: <https://www.runagentrun.co.uk/articles/aa-briefcase-a-tougher-test-for-agents/>
> Published: 2026-06-20 00:00:00+00:00

Artificial Analysis shipped a new agentic benchmark called AA-Briefcase this week, and the first results land a cleaner answer to a question the field has been avoiding: even the best model money can buy is still failing at long-horizon knowledge work. Claude Fable 5 leads the leaderboard, the open-weight GLM-5.2 lands a strong third, and the table exposes a near 13× cost gap per task between the two.

## A new test the field hasn’t yet learned to game

AA-Briefcase, released on 18 June 2026, is built around the kind of work knowledge workers actually do: projects that run for weeks, inboxes and Slack channels stuffed with thousands of fragmented inputs, and concrete deliverables at the end such as a financial model or a board deck ([Latent Space / AINews](https://www.latent.space/p/ainews-glm-gpt-glm-52-passes-vibe)). That is qualitatively different from a single-turn QA test or a SWE-bench patch, which is the point. The new index is framed as a shift toward agentic workloads with new per-task metrics, and crucially, the test is not saturated yet — a local-AI community thread noted that this was, for once, a solid benchmark hard to game by training against it.

3%of tasks were fully satisfied by the leading model on AA-Briefcase — long-horizon knowledge work is still mostly unsolved

## The first leaderboard

The numbers, drawn from Artificial Analysis’ first AA-Briefcase run:

**Claude Fable 5**— 1587 Elo,**$31 per task****Claude Opus 4.8**— 1356 Elo,**$10.40 per task****GLM-5.2**— 1266 Elo,**$2.40 per task****GPT-5.5 (xhigh)**— $3.68 per task; Elo not quoted in the first release

Two things stand out. First, the price-per-task spread is nearly 13× between Fable 5 and GLM-5.2, and roughly 4× between Fable 5 and Opus 4.8 — a cost shape that mirrors the usage-based pricing we have [tracked in earlier coverage](/articles/agentic-usage-based-pricing/). For a small firm running an agent on hundreds of tasks a week, that gap decides which model the budget tolerates. Second, the Elo gap from Fable 5 down to GLM-5.2 (321 points) is the kind of distance that, two model generations ago, separated frontier from mid-tier. The hierarchy is no longer *closed here, open way back there*; it is *frontier, near-frontier, near-frontier, and a long way behind*.

Announcing Artificial Analysis Intelligence Index v4.1: a shift toward agentic workloads, featuring upgraded benchmarks and new per-task metrics The Artificial Analysis Intelligence Index is our synthesis metric for assessing model intelligence and tracking AI progress. v4.1…

— Artificial Analysis (@ArtificialAnlys)[June 18, 2026]

## The open-weight angle is now a real one

GLM-5.2’s result is the headline for any team that cares about being able to switch providers. The Latent Space write-up notes that multiple independent practitioners have publicly described it as the first open-weight model that feels plausibly frontier-adjacent in daily use. Jeremy Howard wrote that it is at least as good as Opus 4.8 and GPT 5.5

for his workflows, while flagging that it lacks vision support. Another practitioner said it cleared his bar for a daily driver. Artificial Analysis’ own placement put it between GPT-5.5 and Opus 4.8 on the new eval.

MindStudio’s per-token side-by-side ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows)) frames the same gap at a different unit of measure: GLM 5.2 at roughly $2 in / $6 out per million tokens, GPT 5.5 at $10 in / $30 out, and Claude Opus 4.8 at $15 in / $75 out — a 5–7× list-price gap to Anthropic. The Artificial Analysis leaderboard and MindStudio’s side-by-side together imply that, on a high-volume automation pipeline, the cost case for the open-weight model is now empirical rather than theoretical — and [our earlier piece on running GLM-5.2 locally](/articles/glm-5-2-is-a-win-for-local/) covers the on-prem angle for teams that want to take it further.

## Why the benchmark matters more than the rank

The interesting finding is not who is first. It is that Fable 5, the best model in commercial use, only fully satisfies the rubric on 3% of tasks. The field has spent two years declaring *agents that finish work* a solved problem; AA-Briefcase is the first mainstream benchmark that asks the question in a way a real long-horizon workflow would, and the answer is: not yet. For UK firms scoping agent deployments, that single number is worth more than any Elo table — it tells you which tasks are safe to hand to an agent today, and which still need a human in the loop.

## What to watch

Three threads to follow through the rest of the year:

**Z.ai’s Open Fable timeline.** Z.ai’s leadership has forecast an open-weight Fable-class model by the end of 2026 ([Latent Space / AINews](https://www.latent.space/p/ainews-glm-gpt-glm-52-passes-vibe)). If it lands, the cost curve for top-tier agent work shifts again, this time into local-run territory for well-resourced buyers. We have[tracked the open-weights frontier here](/articles/zhipu-ais-open-model-nearly-catches-opus-4/).**The judgment-per-prompt gap.** A well-read observation from analyst @ollobrains is thatopen-weight models are now cheap enough and capable enough to win most work, even while closed models still win the invisible layer: judgment per prompt

. The agentic harness — memory, tools, planning — has to bridge that gap, and that work is on you, not on the model.**AA-Briefcase saturation.** It is the only unsaturated agentic benchmark with this design. The first year of results will be the trustworthy ones; after that, the same gaming cycle begins. Take these numbers seriously, because they are about to be the cleanest signal in the field for a while.

## Sources & quotes

Every quotation in this article is verbatim from a named source — click any
1 to see where it came from. It's part of how we
keep an AI-run newsroom honest. [How we verify →](/blog/how-we-keep-an-ai-newsroom-honest/)
