Agent Arena: Causal Evaluation of Agents in the Real World

wpnews.pro

Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.

Agents are increasingly doing real work. From chat to terminal to OpenClaw, users everywhere are interacting with complex agents, comprising a model and a harness with many subcomponents and tools. As a result, the task distribution has greatly expanded. This makes evaluating agents progressively more difficult, because both task coverage and task complexity are growing in tandem. We desire an agent evaluation that scales along with usage and capability.

Today we are releasing the Agent Arena leaderboard. Arena has always focused on evaluations in the real world. As such, Agent Arena collects and analyzes millions of in-the-wild interactions from people using Agent Mode on

doing their jobs — software engineering, financial analysis, and more. From our observations of these agents running on our platform, we derive our first Agent Arena leaderboard, shown below:

arena.ai/agent# Agent Arena Leaderboard

The methodology powering the Agent Arena Leaderboard is different from our previous arenas. Rather than pairwise votes, rankings are calculated using a methodology we call causal tracing. Causal tracing treats the agent as a multi-component system, with each component selection representing a possible treatment. We observe individual point-wise traces and measure signals such as task success rates, verbal feedback, tool error recovery, tool hallucinations, and, over time, much more. Then, by randomizing the component selections, we create a multi-intervention randomized controlled trial in which we can aggregate measurements to estimate causal treatment effects. We refer to these effects as "net improvement" in the figure above. The causal framework produces an interpretable ranking that represents the improvement in agent performance due to a component selection. This decouples the contributions of the main orchestrator model, any subagents, image generation models, and the different elements in the harness, letting us combine multiple signals into one coherent leaderboard.

This first leaderboard is the result of our causal evaluation of orchestrator models — the main LLMs that choose which tools to call. Rankings of other aspects of the agentic harness are coming soon. We include more methodological detail in the statistical-methodology section below.

Per-Signal Leaderboards #

Every Agent Arena session contains a stream of rich feedback. Users iterate with the agent in natural language, expressing approval, frustration, or clarification turn by turn. They decide whether to download an artifact the agent produced. They click explicit approve / disapprove buttons. They issue in-line corrections when the agent goes off-track. And the agent, on its side, is interacting with an environment that talks back continuously: shell exit codes, tool errors, the absence of a tool it tried to call. Agent Mode lets us extract all of these signals — explicit user feedback, implicit user feedback, and feedback from the agent's environment. After we compute per-session outcomes for each signal, we turn them into leaderboards with causal methods and then aggregate them into the headline leaderboard. We present our first 5 signals today, and we plan to measure more in the near future.

Each model's score on the canonical sub-signals that compose the aggregate (τ̂). Click a column to sort.

The headline leaderboard aggregates the following signals:

Confirmed success— the user marks a task as a success or failure using the Arena UI. Arena gives users approve and disapprove buttons on every turn; we use the final approval or disapproval of a given task's trajectory to determine the outcome. (There can be more than one task per session.)Praise vs. complaint— the user praises or complains about the agent's output. For each task we identify messages expressing explicit verbal praise ("looks great", "this is exactly what I needed") or explicit verbal complaint ("this is broken", "you misunderstood entirely"). The task is marked a success if praise outnumbers complaints.Steerability— the agent executes on user corrections. When a user issues an in-line correction ("no, do X instead", "you misread the file"), the agent should attempt to fix it. If the user accepts the fix, we mark the correction successful; if they reject it or give up, unsuccessful. When doing real work, mistakes are inevitable — this signal captures whether these errors are quickly resolved.Bash recovery— turns taken to recover from a bash error. When the agent issues a bash command that errors due to a model failure (not an environment issue), the recovery clock starts; we count follow-up bash calls until the next non-erroring command. If the agent gives up, we impose an additional penalty.Tool hallucination— the agent references a tool that does not exist. This penalizes invented tool names, malformed syntax that produces a junk name, and chain-of-thought tokens leaking into the tool field. We mark the task a failure if the agent calls a nonexistent tool.

This set of five signals is only a starting point. We plan to add more signals to further enrich these evaluations, retire ones that age out of relevance, and modify them as we improve our trace-mining.

Finally, though not a leaderboard signal, we can also calculate the realized, post-deployment cost of the agents to assess Pareto optimality. We directly calculate the exact cost of a session. We find some models more expensive in practice, despite cheaper on-paper pricing. This is as a result of model behavior (e.g. more steps per turn) or induced user behavior (e.g. more turns to reach satisfaction).

Net Improvement vs. list-price cost per session (7-day window)

Square markers sit on the cost–performance frontier (—— dotted). Hover any point for its model, provider, and score.

Agents in the Real World #

Here we present a deep dive into the data that powers the leaderboards. Agent Arena is a live stream of real users asking models to work: write code, debug broken projects, research across the web, create documents, build frontends, analyze files, and iterate over multi-step tasks.

Primary intent across 160,480 agent tasks (7-day window)

Hover a slice for its share; inner arcs show its sub-intents.

In a recent 7-day slice, Arena saw 160,480 Agent Mode tasks (note there can be multiple tasks in a session). The largest categories were code writing (17.5%), research and lookup (10.8%), planning and brainstorming (10.6%), and multimodal image/video work (10.2%), followed by document creation (9.1%) and code debugging (8.9%). Code writing alone accounted for roughly 28,000 tasks, with another ~14,000 in code debugging and ~17,000 in research and lookup.

Total calls per tool across 2,060,159 tool calls (7-day window)

The box and whiskers mark P10 · P25 · P50 · P75 · P90; the diamond ◆ is the mean.

Across 128,244 sessions, 75.6% used at least one tool — 41.1% ran bash and 27.1% ran web search. In the week, Agent Mode issued 2 million structured tool calls, including ~936,000 bash calls, ~550,000 file writes, and ~275,000 web searches.

Final non-blank lines from successful write_file calls (7-day window); tile area scales with lines written

Tracking via successful write_file

calls, Agent Mode wrote 40.3 million lines of code in the last week — roughly 1,000 lines per coding session.

Tool calls per session, grouped into complexity tiers (7-day window)

Work-type mix of 3,467 highest tool-use sessions (7-day window)

Hover a slice for its share; inner arcs break down its tool mix.

In the past 7 days, sessions averaged ~16.5 structured tool calls, and high-tool sessions were common enough to form their own cohort: more than 3,400 loop-filtered sessions ran very long tool chains in a single week. Those sessions were mostly real work — 53.2% coding or repo-debugging, 39.0% artifact/file-creation, with the rest spanning web synthesis, terminal workflows, and data analysis.

Input context on the final turn (7-day window)

Finally, about 32% of recent sessions ended with at least 128k input tokens in the final turn, 22% with at least 256k, and 8% with at least 1M.

What People Build #

In a sample of the heaviest real sessions we saw: a live sports-TV schedule site, an autonomous-underwater-vehicle autopilot, a self-hosted movie-watchlist app, a financial-research RAG pipeline, a live study-tracking platform, and more. Many end with the user down the finished workspace.

A sample of high-effort Agent Mode sessions (7-day window)

Overall in the same week, the workspace saw over 50,000 downloads — far beyond just code, including office and media artifacts (.docx

, .pptx

, .xlsx

, .pdf

, and images).

How People Work With Agents #

Beyond which model wins, the trace stream reveals how people actually delegate to agents — and how agents handle being corrected.

How much users hand over — and how they steer once the work is underway.

Delegation Posture

How much users hand over in their opening message.

Asked for advice · 28%
Directed step-by-step · 1%
Gave a scoped task · 11%
Handed off a deliverable · 45%
Let it run autonomously · 14%

Reining In

After the first reply, users pull control back ~2.3× as often as they hand over more.

Most opening messages hand over a whole job rather than ask for advice: the delegation posture skews heavily toward "build this deliverable" and "operate autonomously." However, after seeing the first response, they tighten the reins — pulling control back far more often than they hand over more.

Two ways a capable-sounding agent still underdelivers.

Bluster

A corrected agent sounds firm but almost never holds its ground.

Bluffing

On multi-part asks, how fully it covers every part.

Every part covered · 58%
A part left incomplete · 34%
A part silently dropped · 8%

We also find that when the opening ask bundles several explicit parts, agents usually cover all of them; the typical shortfall is leaving one incomplete. A rarer but more consequential shortfall is covert: the agent could have surfaced the incomplete work, but instead presents the result as complete. We call this "Bluffing".

Finally, agents do sometimes push back against users — but we find they usually only sound firm, rarely holding their ground in practice. We call this "Bluster": an artificial assertiveness that melts under additional pressure.

Formal Details of Methodology #

In this section we describe the formal details of our evaluation framework.

Consider a $K$-component agent and sessions indexed by $i \in [n]$. Each session independently samples a $K$-dimensional vector representing an agent configuration, $T_i$. The configuration includes at minimum the orchestrator model, and as Agent Arena expands will encompass additional components such as the tools, system prompt, and harness. The configuration is drawn from a sampling distribution $P$. We denote $p_{i,k}(t) = \mathbb{P}{T_i \sim P}(T{i,k} = t)$; components are sampled independently. Each session yields an outcome $Y_i \in \mathbb{R}$, representing one of our signals from the previous sections. (We compute the per-signal leaderboards separately, then average them at the end.)

Our target of estimation is the treatment effect of each component selection with respect to a fixed baseline distribution $Q$, with analogous probabilities $q_{i,k}(t) = \mathbb{P}{T_i \sim Q}(T{i,k} = t)$. Typically, we take $Q$ to be a uniform distribution over components. Formally, the treatment effect of the $t$-th choice of the $k$-th component is defined as the expected difference in outcomes $Y_i$ under treatment and control:

$$\tau_{k \to t} = \mathbb{E}{T_i \sim Q} \bigl[Y_i(T{i,k} = t) - Y_i\bigr].$$

Here, $Y_i(T_{i,k} = t)$ denotes the "potential outcome" when we intervene on the $k$-th component and set it to $t$.

Given that we sample the components independently, using standard identification results from causal inference we can rewrite the treatment effect as:

$$\tau_{k \to t} = \mathbb{E}{T_i \sim Q} \bigl[Y_i ,\big|, T{i,k} = t\bigr] - \mathbb{E}_{T_i \sim Q}\bigl[Y_i\bigr].$$

We estimate this quantity using the self-normalized estimator:

$$\hat\tau_{k \to t} = \frac{\sum_{i:, T_{i,k} = t} w_i Y_i}{\sum_{i:, T_{i,k} = t} w_i} - \frac{\sum_i w_i Y_i}{\sum_i w_i},$$

where

$$w_i = \prod_{k=1}^K \frac{q_k(T_{i,k})}{p_{i,k}(T_{i,k})}.$$

$\hat\tau_{k \to t}$ is asymptotically normal under standard CLT conditions for self-normalized estimators. We report 95% confidence intervals $\hat\tau_{k \to t} \pm 1.96,\widehat{\mathrm{SE}}$ alongside every estimate.

To address distribution shift, such as the shift arising from new models entering the Arena, we use additional time-decaying weights to place more emphasis on the most recent data points. That way, the leaderboard always reflects the current strengths and weaknesses of agents.

The current leaderboards evaluate orchestrators and no other components, so in the production setting we currently have $K = 1$.

Citation #

@misc{arena2026agentarena,
  title        = {{Agent Arena}: Causal Evaluation of Agents in the Real World},
  author       = {{Arena Team}},
  year         = {2026},
  month        = jun,
  howpublished = {\url{https://arena.ai/blog/agent-arena-methodology}},
  note         = {Arena Blog}
}

source & further reading

arena.ai — original article Thinking Machines’ Open Weights Model is Now on Design Arena How OpenAI’s Sol Finally Learned Design Taste Factuality in the Arena