{"slug": "the-open-agent-leaderboard", "title": "The Open Agent Leaderboard", "summary": "The Open Agent Leaderboard is a new open benchmark designed to evaluate the performance and cost of full AI agent systems—including their tools, planning, and error recovery—rather than just the underlying models. It measures \"generality\" by testing agents across six diverse, unfamiliar benchmarks covering coding, customer service, research, and personal assistance, reporting both quality and cost to determine practical deployability. The project is fully open-source, paired with the Exgentic framework for reproducibility, and aims to provide a stronger, more realistic assessment of how well agents work across different real-world settings.", "body_md": "The Open Agent Leaderboard\nHow good are general purpose AI agents? We built an open evaluation framework to find out.\nMost evaluations in AI report a simple result: what score each model got on which benchmarking task. When you deploy an agent, you're not just choosing a model. You're choosing a full system: what tools the agent can use, how it plans its steps, what it remembers between actions, how it recovers when something goes wrong. Change any of those and the same model can produce very different results at very different costs.\nHow well an AI agent works depends on how it's built, not just the model inside it.\nToday we're launching the Open Agent Leaderboard, an open benchmark for comparing full agent systems, not just the models inside them. It reports both quality and cost, so you can see not just what works, but what's worth deploying.\nThe leaderboard is paired with the Exgentic framework for running and reproducing evaluations, and a paper describing the full methodology and results. Everything is open from day one.\nCan we measure generality?\nAI agents are getting really useful when carefully tailored to a specific job, like coding in a familiar repository or handling customer service with a known set of tools. But the harder question is whether the same agent can handle many different jobs, each with its own tools, rules, and constraints, without being manually customized for each one.\nA more general agent is one you can drop into a new setting and have it just work.\nThat's what we mean by generality, and it's best understood as a spectrum, not a binary label. Of course, generality that only works in theory isn't useful. What matters is whether an agent stays capable as the range of jobs and settings grows, and whether it does so at a reasonable cost. A system that handles everything but costs a fortune to run isn't general in any way that matters.\nThis leaderboard measures exactly that: how general your agent actually is.\nIt evaluates agents across diverse, unfamiliar settings, each with different tools, rules, and constraints, and reports both quality and cost. So you can see not just how well a system performs, but whether it's worth actually deploying. It doesn't cover every capability a general agent will eventually need. But it's a much stronger test of how well agents work across different situations than anything previously available. And by treating the full agent system, not just the model, as the thing being measured, it makes visible what's actually driving the results.\nWhat we built\nWe assembled six benchmarks, each testing a different kind of realistic task. Together they aim to capture a broad range of working settings: coding, customer service, technical support, personal assistance, and research.\nSWE-Bench Verified\n-- fixing real bugs in real code repositoriesBrowseComp+\n-- researching complex questions across the webAppWorld\n-- completing personal tasks across hundreds of apps and actionstau2-Bench Airline & Retail\n-- customer service following company policiestau2-Bench Telecom\n-- technical support following company policies\nEach is an established benchmark, created and reviewed by the research community. They weren't chosen because any single one captures general agency. They were chosen because together they test very different things: real code changes, open-ended research, broad action spaces, rule-bound conversations. That mix is what makes the evaluation meaningful.\nThese benchmarks were each designed to test one kind of task in one kind of way. Making them work together meant giving them a shared structure. We introduced a unified protocol that gives every benchmark the same shape: a task (what to do), a context (what to know), and a set of actions (what's allowed).\nInstead of each agent speaking each benchmark's language, they all speak one.\nThis standardization isn't trivial. Each benchmark comes with its own assumptions, instructions, and interaction patterns. Making sure these don't clash with how different agents work internally requires deep understanding of both sides. It's one of the reasons this work took time, and one of the reasons results may differ from what you see on individual benchmark leaderboards. But the payoff is real: the benchmarks keep their original design, the agents keep their native tools and interfaces, and the protocol gives them a common way to connect.\nHow to read the leaderboard\nEach row is a full agent system: a specific agent paired with a specific model, evaluated across all six benchmarks. For every configuration, you see the average success rate, the average cost per task, and per-benchmark breakdowns.\nHere's what the current top five looks like:\nLook at the top three. All use the same model. Yet they differ in both score and cost because the agent systems wrapped around that model are different.\nSame model, different agents, different results -- the agent matters.\nThe cost gap is just as striking. The most efficient configuration in the top five runs at a fraction of the price of the strongest one. The full picture becomes clear when you plot every configuration by quality and cost:\nWhen the agent implementation is visible alongside the model, you can start to untangle what's driving the results: which gains came from the model, which from the agent design, and which components generalize across settings. That's what this leaderboard is built to show.\nA note on results: agents here are tested as general-purpose systems without benchmark-specific tuning, and without the prompt and environment optimizations that model developers often apply to individual benchmarks. So scores may differ. See the paper for details.\nWhat we're already learning\nOne finding surprised us: general-purpose agents are already competitive with specialized ones. In several cases, agents with no benchmark-specific tuning matched systems built directly for those tasks.\nAcross most benchmarks, general agents match or even outperform the best specialized systems. A single agent can increasingly handle many kinds of work, not just the one environment it was prepared for.\nThe results also reveal something you can't see from success rates alone: agents differ dramatically in how they fail. Some fail fast and cheap. Others burn through long, expensive runs before giving up. In our experiments, failed runs cost 20--54% more than successful ones. For anyone running agents in production, failure behavior shapes your bill just as much as success does.\nPerhaps the most important finding is about what drives the results. Model choice is still the dominant factor. But agent architecture is already making a visible difference. Tool shortlisting, helping the agent focus on relevant tools instead of searching through everything, improved performance across every model we tested and turned otherwise failing configurations into viable ones.\nToday the model explains most of the results. But the agent around it is already starting to change the outcome.\nThe full methodology and empirical analysis are described in our paper on general agent evaluation.\nWhat's public today\nEverything behind this leaderboard is open. Today we're releasing:\n- The Open Agent Leaderboard -- explore the results directly\n- Exgentic -- run and reproduce evaluations yourself\n- The paper -- full methodology and empirical analysis\nWe built this for the community. Explore, submit your own results, and help us make agent evaluation more open and more useful for everyone.\nWhat we want from the community\nGeneral agents are too important to be evaluated behind closed doors.\nGeneral agents are modular systems: planning, memory, tool use, context management, error recovery. The results above show that these components make real tradeoffs across cost, reliability, and performance. If one component is doing the heavy lifting, the community should be able to see that.\nWe built Exgentic to make this kind of open evaluation practical: an open platform that orchestrates cross-environment benchmark sessions and produces standardized results, trajectories, and cost reports. But we can't build this alone.\nAgent developers can open up their systems by versioning changes, documenting what's inside, and making components configurable. Benchmark creators can help expand the range of settings we evaluate against. And anyone can reproduce our results, challenge them, and find what we missed.\nNot all of this is easy yet. Most benchmarks weren't designed with general-purpose agents in mind and require careful adaptation. This is an evolving project, and feedback on what needs to be easier is just as welcome as a finished contribution.\nWhat's next\nSince launch we've added two open-weight models, DeepSeek V3.2 and Kimi K2.5, bringing the leaderboard to five models across five agents and six benchmarks. The open-weight results tell a clear story: competitive on specific combinations, but trailing frontier closed-source models by 18--29 percentage points on average. Read more in our open-weight deep-dive.\nThe leaderboard is only as useful as the community that feeds it. We're looking for contributions across three axes: new agents (wrap your agent in the Exgentic protocol and submit results), new benchmarks (any task suite with a programmatic evaluator can be integrated), and new models (especially open-weight models we haven't covered yet). Submit results by opening a PR on the results dataset.\nClosing\nGeneral-purpose agents deserve evaluation that reflects what's actually being measured: the full system, not just the model.\nThe Open Agent Leaderboard is a starting point. We believe it can become something bigger: a shared standard for how the community evaluates, compares, and improves open agent systems.\nExplore the leaderboard. Read the paper. Try Exgentic. And if this direction resonates, help us build it.\nGeneral agents are reshaping the way work is done. Let's research and discuss them openly.\nRelated reading\n- General Agent Evaluation -- ICLR 2026 Workshop Paper\n- Ready For General Agents? Let's test it. -- ICLR 2026 Blog Post\n- Position: Agentic Systems Should be General -- ICLR 2026 Workshop Paper", "url": "https://wpnews.pro/news/the-open-agent-leaderboard", "canonical_source": "https://huggingface.co/blog/ibm-research/open-agent-leaderboard", "published_at": "2026-05-18 14:12:58+00:00", "updated_at": "2026-05-19 21:56:31.696007+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "open-source", "developer-tools"], "entities": ["Open Agent Leaderboard", "Exgentic"], "alternates": {"html": "https://wpnews.pro/news/the-open-agent-leaderboard", "markdown": "https://wpnews.pro/news/the-open-agent-leaderboard.md", "text": "https://wpnews.pro/news/the-open-agent-leaderboard.txt", "jsonld": "https://wpnews.pro/news/the-open-agent-leaderboard.jsonld"}}