{"slug": "is-it-agentic-enough-benchmarking-open-models-on-your-own-tooling", "title": "Is it agentic enough? Benchmarking open models on your own tooling", "summary": "Hugging Face researchers introduced a new benchmarking methodology for open-source coding agents, measuring not just accuracy but also efficiency across different models and library revisions. Using the transformers library as a case study, they found that agent-optimized tooling can reduce token usage by up to 6×, highlighting the importance of designing APIs and documentation for agentic use.", "body_md": "# Is it agentic enough? Benchmarking open models on your own tooling\n\n[Update on GitHub](https://github.com/huggingface/blog/blob/main/is-it-agentic-enough.md)\n\n*Benchmarking transformers revisions across different metrics*\n\nThis is a human-made, agent-focused blogpost.\n\nCoding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us developers, but it now also sends the agent down a longer, more expensive path.\n\nMost benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got\nit right, but how much work it took to get there, and how that shifts across models, library revisions, and\ntasks. We measured exactly that, using `transformers`\n\nas our case study.\n\nHere, we will introduce a tool specific benchmark focusing on how the answer was found, and provide a simple\nimplementation of one such harness, running entirely on open models driven by the\n[pi](https://www.npmjs.com/package/@mariozechner/pi-coding-agent) coding agent, with the full sweep of\nmodels × revisions × tasks fanned out across [Hugging Face Jobs](https://huggingface.co/docs/huggingface_hub/guides/jobs)\nso every run sees identical hardware.\n\nBut, *how do you optimize software for agents?*\n\nWe're strong believers in the following two software principles:\n\n- If it isn't tested, then it doesn't work\n- If it isn't documented, then it doesn't exist\n\nThis remains the same within the realm of agentic-optimized tooling, and, for once, the two are directly tied to each other.\n\nYou want your tool to exist for an agent: it needs to be discoverable. The API needs to be clear and the docs need to be extensive. They need to be structured in a way that the agent has rapid access to the useful files and examples. If you want your tool to work for an agent, then you should test it for agentic-use.\n\n## Testing software for agentic-use\n\nWe'll use `transformers`\n\nas an example throughout this blogpost: agents *using* it to solve ML tasks (classifying\ntext, captioning images, transcribing audio), not contributing code to it; though the harness was designed to work\nwith any tool that can be operated from the command line.\n\nOur intuition on `transformers`\n\nwas that usage could be dramatically simplified\nwith a few changes: a CLI, a Skill, and self-contained, task-specific examples. This is the same recipe\nrecently applied to the [ hf CLI, redesigned to be agent-optimized](https://huggingface.co/blog/hf-cli-for-agents),\nwhere agents used 1.3–1.8× (and up to 6×) fewer tokens. We wanted to know whether that kind of win generalizes, and\nwhether it could be useful for transformers as well.\n\nIntuition is a powerful tool, but we wanted more evidence before we opened PRs that add several thousand lines of code to such a widely used codebase as `transformers`\n\n. We set out to measure what success looks like.\n\n### Not all successes are equal\n\nTwo agents can both produce the correct label for a sentiment-classification task, but one:\n\n- writes a 40-line Python script, imports\n`transformers`\n\n, debugs a shape error, re-runs twice, and finally prints the answer;\n\nwhile the other\n\n- types\n`transformers classify --model ... --text \"...\"`\n\nand is done in one call.\n\nBoth reach `POSITIVE (0.9999)`\n\n, and here are the two paths an agent actually took on this exact task:\n\n```\n# Task: classify the sentiment of \"I absolutely loved the movie, it was fantastic!\"\n\n- # one agent: pipe a script into python and parse the output\n- python - <<'PY'\n- from transformers import AutoTokenizer, AutoModelForSequenceClassification\n- import torch\n- import torch.nn.functional as F\n-\n- model = AutoModelForSequenceClassification.from_pretrained(\"distilbert/distilbert-base-uncased-finetuned-sst-2-english\")\n- tokenizer = AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased-finetuned-sst-2-english\")\n- inputs = tokenizer(\"I absolutely loved the movie, it was fantastic!\", return_tensors=\"pt\")\n- with torch.no_grad():\n-     logits = model(**inputs).logits\n- probs = F.softmax(logits, dim=1)\n- idx = torch.argmax(probs, dim=1).item()\n- print(model.config.id2label[idx], probs[0][idx].item())\n- PY\n\n+ # the other agent: one command\n+ transformers classify \\\n+   --model distilbert/distilbert-base-uncased-finetuned-sst-2-english \\\n+   --text \"I absolutely loved the movie, it was fantastic!\"\n```\n\nBoth methods reach the same result. But they have very different profiles in **cost, latency, token usage, and failures**.\n\nIf your evaluation only checks the final string, you're blind to these as well as whether a change you shipped to the library (a CLI improvement, better error messages, a Skill) actually helped agents.\n\nOur goal with this harness is to evaluate how much work an agent has to do to perform a given task, and whether changes to the library improve performance.\n\n### How do we run evaluations?\n\nA few words on how we'll evaluate agents here.\n\nWe run every task under three variants (or \"tiers\"); three different ways an agent can come at `transformers`\n\n:\n\n```\nbare     pip install transformers, and nothing else\nclone    the full transformers source, checked out in the working directory\nskill    a packaged Skill: the CLI's docs + task examples, loaded in context\n```\n\nThese aren't nested: `skill`\n\ndoesn't contain `clone`\n\n(it ships curated docs, not the source tree), and neither\nstrictly contains the other, each gives the agent a different kind of help. As we'll see, a model can sometimes\ndo better on `clone`\n\nthan on `skill`\n\n.\n\nA few more choices:\n\n- For now we only focus on deterministic tasks which can provide an exact match, as they provide a very nice ground for experimentation. Model-as-a-judge and other schemes are the obvious next steps for other tasks.\n- Every run is its own Hugging Face Job: one per (model × revision × task), so the whole sweep runs in parallel on identical hardware, which keeps the comparison fair at scale.\n- Results and traces land in a Hugging Face Bucket: fast, no versioning needed, and handles very high write concurrency.\n\n### Which models to benchmark against?\n\nNot all models driving agents are equal, and their difference changes what you should look at when running them.\n\n*Large open models*\n\nAt one end, you have the largest, most capable open models. On reasonably common tasks, these should get the right answer, eventually. For them, task completion saturates near 100% and stops telling you much about your tool; a more relevant benchmark is the effort it took the agent to get there: how many turns, tokens and seconds it took, and whether they walked a clean path or used deprecated APIs.\n\n*Local*\n\nLocal models vary widely in size, and so do their abilities.\nMetrics such as **\"match %\"** are more relevant than for their larger counterparts,\nas you can see how model sizes/capabilities affect results on your specific\ntool.\n\nThis harness not only provides guidance to library maintainers on how to improve a repository for agent interactions, it also helps assess how different agents and models perform on the tasks users care about.\n\nThe harness scores every run on several axes, so that you can ask what actually matters for each class of model:\n\n**match %**: did the final answer contain the expected result (per-task, case-insensitive substring / regex / exact, all explicit in the report);**median time** and**median tokens**(new vs. cached vs. generated);** runs with error %**: including a guard that flags runs which produced*nothing*(0 output tokens, no tool calls, no answer) so silent failures don't masquerade as \"0\";**marker adoption**: tool-defined behavior markers; see below for an explanation of what this is.\n\nAll of it lands in a report you can directly examine:\n\n*The live report: Overview, Coverage, and Results, all client-side.*\n\nAnd because it captures the native agent trace of every run, numbers are just the beginning: you\ncan read exactly what the agent did, command by command. The traces are shareable\nthrough the Hub's [agent-traces viewer](https://huggingface.co/docs/hub/agent-traces):\n\n*A run rendered in the Hub's agent-traces viewer: MiniMax-M2.7 on the answer-question task.*\n\n**Open this trace on the Hub ↗**\n\nBefore the results, a quick recap of the setup. Each run varies four things: the **model** driving the agent,\nthe ** transformers revision** it runs against, the\n\n**task**, and the\n\n**tier**(\n\n`bare`\n\n/ `clone`\n\n/ `skill`\n\n).\nAs discussed, we look at different metrics for the two different model categories.###\n\nLarge open models: hold the model, vary the revision\n\nSince a large open model will usually get to the correct result, what you're really measuring is the effort it took to do so. Did it take ten turns or one? Did it follow an API path you deprecated because it trusted obsolete documentation? Did it hit an error you hadn't foreseen?\n\nThe natural experiment is to fix one strong model and vary the tool's\nrevisions: the successive git versions of `transformers`\n\nwe test against, from released tags like\n`v5.8.0`\n\nand `v5.9.0`\n\nto the specific commit that introduces the CLI and Skill. We want to watch whether the load\nit puts on the agent goes up or down. We used the harness on `transformers`\n\nto check\nwhether adding a dedicated CLI and Skill actually lightened the agents' work.\n\nFor the three large models we used in our tests, the average time spent on all tasks indicates that the Skill commit results in less time spent working on the tasks:\n\n*Median time per revision, by tier: the skill commit (green dot) is the fastest.*\n\nOn the other hand, in the experiments in which we cloned the repository, we can see a significant increase in token consumption due to the commit that introduced the CLI and examples, as we'll see in a moment.\n\n*Median new tokens per revision, by tier: the clone variant jumps once the CLI lands in the repo.*\n\nReading the clone-variant traces explains why. The commit adds a command, but it also ships the\nCLI's implementation and a set of `cli/agentic/*.py`\n\nusage examples into the repository directly.\n\nOn the `clone`\n\nvariant the agent has a full transformers checkout in front of it, and roughly a third of the runs go read the new\nsurface (the `/cli/`\n\ntree and the example scripts) to learn the interface before calling it. This raises the\nmedian input from ~4k to ~6.4k tokens.\n\nThe two charts are then two sides of one tradeoff: the commit buys the large models less time (they reach for the CLI instead of debugging Python) at the cost of more tokens (they read the code that taught them the CLI). A tradeoff worth knowing about before merging PRs.\n\nOne caveat works in the CLI's favor, though, which isn't benchmarked yet: the cost of reading it is amortized with successive runs. Our setup is built for one-off experiments. Each run is a fresh agent that rediscovers the CLI from scratch, so it pays the discovery cost every time. In real usage an agent learns the interface once and then solves task after task within the same session, amortizing that cost across many requests. The token bump we measure here is closer to a worst case than to what a user would see day to day.\n\n### Small models: hold the revision, vary the model\n\nOpen models give us fine-grained control over the variables that matter most here: size, configuration, quantization,\nprovider, training, and anything that would differ from one model to the next. They're also where a good tool surface\nmatters most: a small model asked to \"use `transformers`\n\nto do X\" on a `bare`\n\nenvironment can\nguess an API that changed some releases ago, may do unnecessary tool calls, and can\nget the wrong answer.\n\nSo here the experiment is the opposite of the above: hold the revision and sweep the model. This helps see which models actually take care of the task, not just by token count and time, but down to which ones can't reliably handle the tool calls. Our intuition is that the smaller the model, the harder both tool use and the task get; we ran the harness across a range of model sizes to test exactly that:\n\n*Match % across models, by tier: the skill tier lifts the larger models but drops the smaller ones.*\n\nwhich also seems to be correlated with the number of tokens ingested\n\n*Median new tokens across models, by tier.*\n\nA note on fair comparison: naively averaging across tasks is misleading when coverage is uneven (a model that only finished the quick tasks looks fast). The report has a\n\n\"shared tasks only\"toggle (across models and/or revisions) so you compare like-for-like, and aCoverageheatmap so you can see exactly which task × revision × model cells actually ran.\n\n## Tweaking the tool: markers and results\n\nTwo things come together here: how to look past whether the agent succeeded to what it did and how it did it; as well as the first results we pulled out of the harness.\n\n### What's a marker?\n\nMatch %, tokens, and time tell you the cost of a run but don't tell you much about what happened under the hood.\n\nThis is why we've introduced the concept of markers. A marker is a named pattern the profile (the small per-tool plugin that teaches the harness how to build and drive a given library) matches against a run.\n\nIt is a one-line label for a behavior you care about, checked against the shell commands the agent ran, the code it wrote, the files it read, or its final answer. A run can fire several markers or none; the report shows how often each one fired, per model and per revision.\n\nFor `transformers`\n\nwe declare a handful but we'll only look at the two most relevant ones:\n\n: the agent invoked the`cli`\n\n`transformers`\n\ncommand-line tool (e.g.`transformers classify …`\n\n) instead of writing Python.: it reached for the high-level`pipeline`\n\n`pipeline(...)`\n\nPython API.\n\nThese are what we watch to see whether a change actually shifted the agent's behavior. Interestingly here, the larger the model, the more it leverages the new context instead of using its memory; therefore leveraging the newly introduced CLI.\n\n*CLI adoption by tier across models: only the skill tier reaches for it, and more so as models grow.*\n\nCLI adoption is new: the CLI lands in a single commit, isn't in any model's training data, and is only lightly documented. The effect is clear: it's the Skill variant, the one that ships the CLI's documentation, that actually reaches for it, at 55.3%.\n\n### Is the CLI + Skill commit helping?\n\nComparing the commit across model sizes, the CLI + Skill helps the bigger models: on the `skill`\n\ntier, Kimi and the other large agents reach for the CLI and finish in fewer turns. (On `clone`\n\nthey spend *more* input tokens first, reading the new CLI code, as we saw above, so the win shows up in time and turns, not raw tokens.)\n\n*Kimi-K2.6, GLM-5.1, and MiniMax-M2.7 across revisions*\n\nBut in some smaller-model settings, it appears to hurt performance. One plausible explanation is that small\nmodels lean on memorized API patterns, reproducing `pipeline(...)`\n\nsnippets\nthey've seen in their training data. The new concepts are then a larger\nsurface for them to get wrong. You can watch this directly on the harness: lower\nmatch %, more retries, the `cli`\n\nmarker barely firing. It is particularly striking on the Qwen3-4B model:\nthe Skill barely changes its match rate yet its cost distribution is significantly affected.\n\nAlmost all of that comes from the `clone`\n\ntier. The checkout now contains\nthe CLI's implementation and `cli/agentic/*.py`\n\nexamples, and the 4B agent reads them in bulk: its median new\ntokens jump from ~2.4k to ~23k, with time and output skyrocketing as well, for no gain in\naccuracy.\n\n*Qwen3-4B across revisions. The CLI + Skill commit fans the cost distribution wide open, on the clone tier the agent reads the newly-shipped CLI source in bulk (~10× the new tokens), for no gain in match %. (repeat tokens stays flat: this setup uses no prompt caching.)*\n\nSometimes, though, the Skill breaks correctness outright. Reading the traces shows how, for example for Qwen3-14B:\nadding the Skill drops its overall match rate from 67% (bare) to 43%, and on the simplest tasks the collapse is very\nvisible: `classify-sentiment`\n\ngoes from 100% on the `clone`\n\nvariant to **0%** with the Skill.\n\n*Qwen3-14B on classify-sentiment, by tier: clone (blue) holds at 100% across revisions, but the Skill variant (green) collapses to 0% at the CLI + Skill revision.*\n\nLooking at the traces, the model mistakes the CLI for a *tool it can call directly* (as in an agentic-harness\ntool, like web-search). The Skill is **not** an executable tool: it's documentation loaded\ninto the agent's context, and the `transformers`\n\nCLI is only ever meant to be run from the shell (via `bash`\n\n); so this\nwill not work.\n\nQwen3-14B reads the Skill and, in 39 of its 56 Skill runs, either emits a `transformers(command=\"classify\", ...)`\n\ntool call (a tool that was never registered) or, finding nothing like it among its `read`\n\n/`bash`\n\n/`edit`\n\n/`write`\n\ntools, concludes it *can't* run a model and gives up. Either way, rather than fall back to the one-line\n`pipeline(...)`\n\nthat scored 100% on the `clone`\n\ncheckout, it declares the task impossible.\n\n*Qwen3-14B on classify-sentiment (Skill variant): it reasons that read/bash/edit/write can't run a model, and gives up.*\n\nThis is exactly what we built the harness to catch: the same change that speeds the large models\nends up breaking the small ones, which seemed a bit counterintuitive to us at first and something we'd likely have\nshipped as-is. The takeaway for maintainers: **agent-facing APIs should be evaluated across model sizes, because a\nnew affordance can reduce work for strong models while adding ambiguity for smaller ones.** It also hints at a fix:\nrather than hand-write a Skill and check it after the fact, you could generate and validate one against the weaker\nmodels up front.\n\nThis is exactly what [Upskill](https://huggingface.co/blog/upskill) does: it turns a strong model's solution into\na Skill only when it measurably helps the smaller ones.\n\n## Trying it yourself\n\nThe harness is one CLI, `agent-eval`\n\n. Install it, run a suite, fan it out across models × revisions on HF Jobs, and publish the\nreport as a Hugging Face Space.\n\nTrusted local use only.The harness runs a coding agent with bypassed permissions and executes code from whatever revision you point it at, and traces can contain prompts, output, and local paths. See[SECURITY.md]before pointing it at code you didn't write or sharing results.\n\nThe full, kept-current setup and usage instructions live in the [README](https://github.com/huggingface/is-it-agentic-enough).\n\n## Closing\n\nChecking the final answer tells you whether an agent *can* use your library. It\ndoesn't tell you what it costs: the turns, tokens, errors, and the path it took to\nget there. This harness measures that, across the revisions and models you pick.\n\nOn `transformers`\n\n, it caught something we'd have shipped on faith: the CLI + Skill\nhelps the largest open models and hurts the smallest ones. Worth knowing before merging!\n\nIt's profile-based, and designed to be adaptable: point it at your own library, define\na few tasks and their expected answers, get the same report. Code and tasks are in\nthe [repo](https://github.com/huggingface/is-it-agentic-enough), traces are on the Hub.\nLet us know if you use it for your project!\n\n## Acknowledgements\n\nThis harness stands entirely on\n[pi](https://www.npmjs.com/package/@mariozechner/pi-coding-agent), Mario\nZechner's coding-agent CLI: it drives every open-model run, and only needs an\n`HF_TOKEN`\n\nto serve a model, which is what made the open-model sweep practical at\nall.\n\nThanks to the model builders and inference providers behind the models we\nswept. Across the board they performed well above what the `bare`\n\nbaseline would\nsuggest.", "url": "https://wpnews.pro/news/is-it-agentic-enough-benchmarking-open-models-on-your-own-tooling", "canonical_source": "https://huggingface.co/blog/is-it-agentic-enough", "published_at": "2026-06-18 00:00:00+00:00", "updated_at": "2026-06-18 13:21:25.038311+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "machine-learning", "large-language-models", "ai-research"], "entities": ["Hugging Face", "transformers", "pi coding agent", "Hugging Face Jobs", "hf CLI"], "alternates": {"html": "https://wpnews.pro/news/is-it-agentic-enough-benchmarking-open-models-on-your-own-tooling", "markdown": "https://wpnews.pro/news/is-it-agentic-enough-benchmarking-open-models-on-your-own-tooling.md", "text": "https://wpnews.pro/news/is-it-agentic-enough-benchmarking-open-models-on-your-own-tooling.txt", "jsonld": "https://wpnews.pro/news/is-it-agentic-enough-benchmarking-open-models-on-your-own-tooling.jsonld"}}