Benchmarks Mean Business

Arena, an AI evaluation platform born at UC Berkeley, reached a $100M annual revenue run rate eight months after launching its product, as demand surges for benchmarks that measure real-world AI utility. The company now evaluates long-running agents on complex tasks, while a study reveals that undisclosed private testing practices by some AI labs bias leaderboard scores, highlighting risks of over-reliance on single benchmarks.

The basic job of an eval is let you judge how good your model is on a task. If enough people use the same eval we can use it to benchmark the relative performance of multiple models on a level playing field. All good, no drama. But building good benchmarks is hard ImageNet was a tremendous effort by Fei-Fei Li, her team, and a whole lot grad students, to produce a massive for the time labelled image set. It was incredibly effective though: it created a Schelling point that drew attention from so many different researchers it fundamentally advanced computer vision, and, thanks to AlexNet, pretty much made deep learning cool. The benefit for folks creating a widely-adopted benchmark was twofold: everyone that uses it cites you, and that is good for your H-index But, more aspirationally, you get to shape where the field goes. Glue/SuperGlue helped do that for language modeling, and SWEBench did it for coding. Arena reached a $100M annual revenue run rate just 8 months after launching our evaluation product. We started as a research project at UC Berkeley with a simple mission: measure AI progress through real-world use. As AI shifts from chatbots to agents taking on longer, higher-stakes work, the problem matters more than ever. Today, Arena measures real-world AI utility with a community of tens of millions. With Agent Arena, we’re evaluating long-running agents on complex, real-world tasks – how they use tools, adapt to feedback, recover from errors, and accomplish goals set by humans. There isn’t just money in running the evals either. Being SOTA on a particular benchmark can be a headline claim for labs pitching their new models. While Arena now covers long-running coding tasks, they became famous for their blind-bake-off ChatbotArena. For a while, topping that was worth real money to the labs: in adoption, in VC dollars, and in the ability to recruit top talent. So, maybe, there might have been a tiny bit of gaming the system https://arxiv.org/abs/2504.20879 though Arena, explicitly, refute this : 2 97c60323-bec0-4a24-87f8-1f48c634ca71 We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. The labs actively want to hill-climb on the metrics they report, which usually means tweaking and testing on some subset, and holding another set back for uncontaminated validation. An evaluation like ChatbotArena doesn’t work like that, which makes it a good benchmark, but it does mean that you want as many samples as you can get to check whether you are going in the right direction. And it would be nice not to show the bad ones. the over-reliance on a single leaderboard creates a risk that providers may overfit to the aspects of leaderboard performance, without genuinely advancing the technology in meaningful ways Some benchmark providers try to tie themselves more explicitly to different business models. Epoch http://epoch.ai/ publishes capability research, but they also offer “mission-aligned services to companies, nonprofits, and government bodies, including commissioned research, model evaluations, and consultations”, for folks like the UK Dept of Science & Innovation. In the finance world there are businesses called rating agencies, and they, unsurprisingly, rate things. Most famously they rate how reliable a company is at paying back its debt. That sounds purely informational, but it is something more than that. For example, certain investors can only hold debt rated above some threshold, so if the ratings agency downgrades the debt then those investors might have to sell it. The ratings both help the market price the debt, but they also, in many ways, define what the market for debt looks like. Right now, the absolute most valuable attribute a model can have is long-horizon coding capabilities. 3 And Epoch’s latest benchmark is called MirrorCode https://epoch.ai/MirrorCode . AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s output exactly on end-to-end tests, including held-out tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression. This may remind you a little of last month’s release of ProgramBench https://programbench.com/ from many of the folks behind SWEBench at Meta, Stanford & Harvard: In each task, the agent receives an executable and its documentation, and it must re-implement the given executable. It does not get access to any of the executable’s source code, it cannot de-compile the executable, and cannot use the internet. There are 200 tasks in total covering different program complexities, ranging from small terminal utilities like jq and ripgrep to massive software projects like the PHP compiler, FFmpeg, and SQLite. Both of these are metrics which judge whether an agentic model can build a complex CLI tool from scratch, but they put different constraints on it. ProgramBench is a black-box: the model gets the executable and its documentation, but can’t decompile it. It has to reimplement cleanly, and match a hidden test suite generated by fuzzing the original binary. There are tasks from small tools up to giant libraries, and the tasks only count as “done” if 100% of the tests pass within a 6-hour time limit. On release, no models cleared that bar. 4 dff07e6b-b4cf-462e-97e5-fa358c62af70 MirrorCode on the other hand adds a detailed specification and whole bunch of visible tests. There are still some tests held out, so agents can’t just replicate the expected test outputs. Given the extra context, and without a time limit, some of the models did get to the finish line: Opus 4.7 managed to reimplement a bioinformatics toolkit called gotree in a 14 hour run The tasks are similar, but the incentives are a bit different. ProgramBench is trying to establish the frontier: what problems are hard but doable by humans, with lot of room for models to hill-climb. That’s a valuable thing to have if you are trying to build a frontier model, and especially if you want to compare how well you are doing at that to other frontier-model labs. MirrorCode is testing how long models can do useful, correct, software engineering work. That is a very valuable thing to know if you happen to be spending a whole bunch of money on tokens to do useful, correct software engineering work, and you want to know where to allocate them. Benchmarks, and the teams putting them out, have found themselves in a similar position to the ratings agencies. They help evaluate how good a model is, but they also define what good even looks like, and by extension, how a lot of decisions get made. - You may note that Arena report ARR, which is a SaaS world number based on looking at your subscribers and churn rate. But you don’t pay Arena like that They are pay-as-you-go, so technically its “annualized consumption run rate”. That’s a new term to me This is all very entertaining if you are in the intersection of people who reads research papers and S1s, but for everyone else I’d just note they last raised at $1.7b when their ACRR ? was less than a third of what it is now. The goose has been valued. ↩︎ 0e8a7224-3b9d-42b7-9d5d-0cead6411749-link - Sharp-eyed readers might note that the headline example is Meta, and I also work there. But in the spirit of industry solidarity I will note the paper called out Google admittedly I used to work there and OpenAI they are free from my malign influence too. ↩︎ 97c60323-bec0-4a24-87f8-1f48c634ca71-link - The second most important being a good relationship with the United States Secretary of Commerce. ↩︎ 5be95295-6e11-4544-9dd2-65c69756b364-link - Though some got quite close, and subsequently GPT 5.5, Opus 4.8 and Fable have all completed some tasks. Unrelatedly, one of the fun notes https://x.com/jyangballin/status/2051677512321970283 from the authors was that the models would often just write the program in Python, regardless of how the original was implemented. Years of arguing about languages on the internet wasted. ↩︎ dff07e6b-b4cf-462e-97e5-fa358c62af70-link