# Benchmarking AI Agents

> Source: <https://pub.towardsai.net/benchmarking-ai-agents-dbf6ae8e5227?source=rss----98111c9905da---4>
> Published: 2026-06-29 11:31:01+00:00

AI agents that generate code, investigate incidents, review pull requests, and orchestrate multi-step workflows are becoming production infrastructure. The question that follows immediately: how do you know the agent is working correctly?

In two of my previous articles — one on [agents that generate deployment infrastructure at runtime](https://medium.com/towards-artificial-intelligence/agentic-inference-deployment-from-prose-skills-to-deployed-endpoints-1efcfdd47807), another on [agents that replace the engineering dependency for scientists](https://medium.com/generative-ai/we-gave-scientists-an-engineer-that-never-sleeps-0eac961a631d) — the most common response was not about the agentic capabilities themselves. It was about measurement. How do you know the generated code is correct? How do you know the agent won’t produce something dangerous? If the runtime code only exists at execution time, how do you do incident response, diffing, compliance? One commenter put it directly ([article](https://medium.com/towards-artificial-intelligence/agentic-inference-deployment-from-prose-skills-to-deployed-endpoints-1efcfdd47807)): “the real risk is turning deploys into non-deterministic prompt archaeology. If the runtime code only exists at deploy time, your incident response, diffing, and compliance story just got way worse. Infra debt doesn’t disappear when you rename it prose.”

This is the right concern. And benchmarks are the answer.

Traditional software answers this with test suites. Deterministic code, deterministic tests. Agents break this model. The same agent, given the same input, may produce different outputs on consecutive runs — not because of a bug, but because the language model interpreted an ambiguous instruction differently. The specification (prose-based skills, SOPs, system prompts) and the implementation (whatever code or action the model generates at runtime) are separate artifacts with a probabilistic relationship between them.

This article covers why agents need a different evaluation approach, how benchmark systems work in practice, the iterative loop that uses benchmark failures to systematically reduce non-determinism, and how both static and dynamic benchmarks address different classes of agent behavior.

Traditional software has no gap between specification and implementation. The code is both. You read it, you know what it does. You diff two versions, you know what changed.

Agents introduce a separation:

The specification says “generate a deployment configuration for a GPU endpoint with private networking.” The implementation is whatever CloudFormation, Terraform, or CDK the model produces in response. That output doesn’t exist until runtime. It can’t be diffed against a previous version. It can’t go through code review. It isn’t deterministic.

This gap creates three problems:

1. **Non-determinism.** The same specification produces different implementations across runs. Most are correct. Some are subtly wrong. You cannot predict which.

2. **Invisible regressions.** A change to one skill may affect behavior in unrelated scenarios. Without systematic evaluation, these go undetected until production impact.

3. **No compliance story.** When the runtime code is generated fresh each invocation, traditional audit trails (code review, version control, change management) don’t apply directly.

Benchmarks address all three — not by eliminating the gap, but by measuring whether the specification reliably produces correct implementations.

An agent benchmark system has three components:

Following the pattern established by frameworks like LangSmith and AgentBench, a benchmark dataset is a collection of examples. Each example follows the standard evaluation structure:

```
 # Each example in a dataset {   “inputs”: {     # Variables passed to the agent under test     “task”: “Generate a Kubernetes deployment manifest for a GPU workload”,     “constraints”: “Must use private registry, resource limits required”   },   “reference_outputs”: {     # Ground truth used only by the evaluator, never shown to the agent     “must_include”: [“resources.limits.nvidia.com/gpu”, “imagePullSecrets”],     “must_exclude”: [“hostNetwork: true”, “privileged: true”],     “expected_tools”: [“kubectl_validate”, “opa_check”]   },   “metadata”: {     # Used for filtering, splits, and coverage tracking     “category”: “infrastructure-security”,     “difficulty”: “medium”,     “split”: “smoke”   } }
```

This mirrors how open-source evaluation frameworks structure test cases: inputs go to the system under test, reference outputs go only to the evaluator, and metadata enables organization without affecting evaluation logic.

For agent benchmarks specifically, the reference outputs encode verifiable assertions rather than exact expected text. This is critical because agents produce variable-format outputs — the same correct behavior might be expressed differently across runs.

Examples of what reference outputs encode across domains:

**Infrastructure generation:** Output must restrict network access to known CIDR ranges, must apply least-privilege access policies, must include health checks, and must never use wildcard resource permissions.

**Ops investigation:** Agent must acquire read-only credentials before cross-account queries, must identify the correct service and region, must reference the appropriate log sources, and must not recommend destructive remediation without evidence.

**Code review:** Given a diff where a human reviewer caught a concurrency bug, the agent must identify the same class of issue.

**Search/knowledge:** Given a technical question, the response must include verifiable source references and must delegate to appropriate specialized sub-agents.

The format matters less than the discipline: every example is a falsifiable assertion about agent behavior that can be evaluated automatically.

Evaluators score the agent’s output against the example’s reference outputs. Following the pattern from LangSmith and similar frameworks, evaluators come in three types:

**Code evaluators** — deterministic, rule-based checks. String matching, tool-call verification, pattern detection, compilation checks. Fast, cheap, binary. An evaluator that checks “output must not contain privileged: true” runs in milliseconds and gives an unambiguous result.

**LLM-as-judge** — a separate model scores the output against the reference criteria. The evaluator receives the example (inputs + reference outputs) and the agent’s run (actual outputs + intermediate steps), then returns structured feedback:

```
# Evaluator feedback format (per LangSmith convention) [   {      "key": "correctness",      "score": 0.8,       "comment": "Correctly configured resources but missed node affinity"   },   {      "key": "safety",      "score": 1.0,      "comment": "No privilege escalation or host access"   },   {      "key": "completeness",      "score": 0.7,      "comment": "Missing liveness probe specification"   } ]
```

Each feedback entry has a key (metric name), a score (numeric) or value (categorical), and an optional comment with reasoning. This structure is standard across evaluation frameworks and enables aggregation, trending, and threshold-based gating.

**Pairwise evaluators** — compare outputs from two agent versions rather than scoring against absolute criteria. Useful when “directly scoring an output is difficult but comparing two outputs is straightforward” (common in generation tasks).

Scoring dimensions and weights are configurable per category. An ops agent weights safety highest — a conservative agent that asks for confirmation before drastic action is preferable to a precise agent that occasionally recommends catastrophic remediations. A code review agent weights correctness highest. A generation agent weights correctness and compliance equally.

Evaluators scale linearly — 1000 examples overnight, no human reviewers required. They apply the same criteria every run, unlike humans who drift. Periodic calibration against human annotation ensures alignment.

Following standard ML evaluation conventions (and mirroring LangSmith’s dataset management), examples are organized into **splits** — named subsets that serve different evaluation purposes. Unlike traditional ML splits (train/validation/test), agent benchmark splits are organized by evaluation tier:

An example can belong to multiple splits. A critical security scenario might be in both the “smoke” split (run on every PR) and the “full” split (run nightly). Splits are versioned — when examples change or new ones are added, the version increments, enabling CI pipelines to target specific versions.

Typical tier structure:

Static benchmarks test against permanent ground truth. The scenarios, inputs, and expected outputs are fixed artifacts versioned in source control.

· The input is a static artifact (code diff, document, configuration file)

· The expected behavior doesn’t change over time

· Ground truth can be accumulated from historical data (real code reviews, past investigations with known outcomes)

A code review agent can be benchmarked against real merged code reviews. Mine historical reviews where human reviewers identified genuine issues — security vulnerabilities, performance regressions, concurrency bugs, missing error handling, architectural violations. Each review becomes a scenario: the diff is the input, the reviewer’s finding is the expected behavior.

Over time this accumulates into a large suite organized by category (security, performance, logic bugs, concurrency, error handling, architecture, code quality, testing, resource management, API design, dependency management, configuration, observability, data integrity, backwards compatibility, naming and readability). The benchmark answers a concrete question: does the agent find the same issues that experienced human reviewers find?

Each scenario contains the actual diff, the CR description, and the human reviewer’s comment as the expected output. The benchmark measures: does the agent identify the same issue the human identified?

When agent skills are structured as explicit rules with priority keywords (must, should, may), the benchmark suite can be validated against rule coverage. A coverage analysis tool compares the rules defined in your skill set against the scenarios that exercise them, producing a coverage percentage and identifying gaps. Setting a coverage threshold (e.g., 80% of all mandatory rules must have at least one exercising scenario) prevents skill drift from outpacing test coverage.

The same scenario suite can test multiple agent configurations to measure the marginal value of architectural components. Run the full agent (with all sub-systems), then disable one component at a time and re-run. The difference in score quantifies that component’s contribution.

For example, if adding an adversarial validation loop to a code review agent improves the benchmark score by 13 points, that’s a defensible justification for the added latency and cost. If it only improves by 2 points, you might drop it. Without benchmarking across variants, these are arguments. With benchmarking, they’re measurements.

AgentBench (ICLR’24) established the pattern of evaluating agents in interactive environments — the agent acts within a live system (an OS, a database, a web browser) and is evaluated on whether it achieves the goal, not whether its intermediate steps match a script. This principle extends naturally to production agent evaluation where the “environment” is a live infrastructure with ephemeral state.

Dynamic benchmarks test against ephemeral ground truth. The scenario categories are fixed, but the specific instances are resolved at runtime against current system state.

· The agent queries live systems (logs, metrics, tickets, pipelines)

· Data has retention limits (CloudWatch logs expire, tickets resolve, deployments supersede)

· The correct answer depends on current state, not historical state

A static scenario “investigate alarm X that fired on March 5” becomes unrunnable when:

· CloudWatch logs hit their 14–30 day retention limit

· The associated ticket is resolved and context is stripped

· The deployment under investigation is superseded

· The service topology has changed (new dependencies, removed components)

The same problem applies across domains:

The key idea: the **category** defines what class of problem to test (permanent). At benchmark time, you query live systems to find a recent real instance of that problem, and **craft the test scenario dynamically** from current data. The agent runs against this fresh scenario — with logs, metrics, tickets, and artifacts that are all still accessible and verifiable.

This is not “run old tests against stale data.” This is “find a real current example of this problem class, build a test from it, run the agent, and validate the result against what’s actually in the system right now.”

Here’s how this works across domains:

**Ops Investigation**

Category: Sev-2 latency alarm investigation.

Scenario crafting at benchmark time: Query the recent ticket queue for Sev-2s related to latency. Pick one from the last 48 hours — recent enough that all logs, metrics, and deployment history are still retained. Extract the alarm name, affected service, region, and account. This becomes the test input.

The agent runs the investigation against live data — the same logs, dashboards, and metrics that a human oncall would see.

The judge validates:

· Did the agent resolve the correct account and region for the affected service?

· Did it acquire read-only investigation credentials (not admin)?

· Did it query the correct log group for that service in that region?

· Did it correlate the alarm with recent deployments or configuration changes?

· Did it avoid recommending destructive remediation (scaling to zero, pulling the emergency cord) without evidence?

Why this must be dynamic: the logs from a Sev-2 that fired 45 days ago are gone (14–30 day retention). The ticket is resolved and context is stripped. The deployment under investigation has been superseded ten times over. You cannot write a static test that says “investigate alarm X from March 5” — by April, the data the agent needs to do the investigation literally doesn’t exist.

**Search Quality and Ranking**

Category: Retrieval relevance for emerging-knowledge queries.

Scenario crafting at benchmark time: Identify topics where the knowledge base has recently changed — a paper published in the last 7 days, a product launched this week, a regulatory change announced yesterday, a breaking news event. Craft queries that specifically require this recent information (“latest findings on GLP-1 receptor agonists for neurodegeneration”, “new EU AI Act compliance requirements effective this month”, “results from the Mars Sample Return mission update”). Execute against the current search index. Capture what the system returns.

The agent evaluates whether the retrieval system surfaces the recent, authoritative content rather than stale results from older publications or outdated documentation.

The judge validates:

· Did the agent identify that the query requires recently-published information (not just any semantically similar content)?

· Did it verify that the top results contain content published within the relevant recency window?

· Did it correctly rank authoritative primary sources (the actual paper, the official announcement) above secondary commentary or summaries?

· Did it detect when the index hasn’t yet ingested the relevant content and flag a coverage gap rather than returning confidently stale results?

· Did it handle conflicting information between older established content and newer findings appropriately (noting the supersession rather than treating both as equally valid)?

Why this must be dynamic: knowledge evolves continuously. A static test asserting “query about GLP-1 research returns paper X at rank 1” breaks the week a newer landmark paper publishes. The index ingests new documents daily — yesterday’s correct ranking is tomorrow’s stale result. Recency-sensitive queries are inherently tied to what has been published *recently*, which changes every day. You need to craft the test around whatever the most recent authoritative content is *right now*.

**Pipeline and CI/CD**

Category: Build failure caused by dependency conflict.

Scenario crafting at benchmark time: Query recent pipeline failures filtered to dependency/version-conflict errors. Pick one from the last 24–48 hours where the build logs are still available and the failure hasn’t been fixed yet. Extract the pipeline ID, the error logs, and the current dependency graph. This becomes the test input.

The agent diagnoses the conflict and proposes a fix.

The judge validates:

· Did the agent correctly identify the conflicting packages and their version constraints?

· Did it trace the conflict to the root cause (direct dependency vs. transitive)?

· Does the proposed fix actually resolve the conflict when applied to the current dependency graph?

· Did it avoid force-resolution or version pinning that would mask deeper incompatibilities?

· Did it check whether the proposed version is compatible with the rest of the dependency tree?

Why this must be dynamic: dependency graphs change with every package update. A scenario referencing “package A v2.3 conflicts with package B v1.7” becomes irrelevant once either package releases a new version. The build logs themselves are retained only for the pipeline’s history window. You need a failure that exists *right now* for the agent to actually investigate.

**Security Posture and Vulnerability Remediation**

Category: Remediation of a newly disclosed CVE in a production dependency.

Scenario crafting at benchmark time: Query the team’s dependency vulnerability scanner for CVEs disclosed in the last 14 days that affect currently deployed services. Pick one where the vulnerability is still unpatched in at least one service. Extract the CVE details, affected package versions, which services use them, and what patches are currently available. This becomes the test input.

The agent assesses impact and proposes remediation.

The judge validates:

· Did the agent correctly identify all services running the vulnerable dependency version?

· Did it assess actual exposure (public-facing vs. internal-only, authentication requirements)?

· Did it find the correct available patches or mitigations in the current package ecosystem?

· Does the proposed remediation avoid breaking changes to dependent services?

· Did it prioritize based on actual exploitation risk and exposure surface, not just CVSS score?

Why this must be dynamic: CVEs get patched — a scenario about “CVE-2024-XXXX in library Y v1.2” is irrelevant once v1.3 ships. The team’s dependency graph also shifts — new services adopt the vulnerable library, old services migrate away. You need a vulnerability that’s real and unpatched *right now* for the test to be meaningful.

**Cost Optimization**

Category: Compute rightsizing without availability impact.

Scenario crafting at benchmark time: Pull the last 14 days of utilization metrics from production accounts. Identify instances or clusters where average utilization is below 30% sustained (not just a point-in-time snapshot). Include the scaling policies, traffic patterns, and any upcoming known load events (launches, promotions). This real utilization data becomes the test input.

The agent analyzes and proposes rightsizing.

The judge validates:

· Did the agent correctly identify over-provisioned resources using sustained patterns (not just current utilization)?

· Did it account for burst patterns and peak-hour requirements before recommending downsizing?

· Did it verify that existing scaling policies can handle demand spikes after rightsizing?

· Did it calculate actual cost savings based on current pricing (not theoretical maximums)?

· Did it avoid recommending changes to resources with upcoming known load events?

Why this must be dynamic: utilization patterns are seasonal and shift with traffic. An instance over-provisioned last month may be correctly sized after a traffic increase this month. Pricing models update, reserved instance contracts expire, spot pricing fluctuates hourly. A static test saying “instance i-abc123 is over-provisioned” is only true for the window you observed it.

The pattern across all domains: at benchmark time, you **query live systems** to find a current real instance of the problem category, **craft a fresh test scenario** from that data, **run the agent** against it, and **validate** using judge criteria that evaluate process and reasoning — not exact outputs. The category and judge criteria are permanent. The scenario itself is generated fresh each run from whatever currently exists in the system.

Dynamic benchmarks benefit from difficulty gradation:

An agent that passes level 1 but fails level 3 works in controlled environments but fails in production, where operators don’t provide perfectly structured inputs.

Benchmarks do more than measure. When combined with systematic failure analysis and skill refinement, they drive agents toward deterministic behavior through iterative convergence.

Agent non-determinism originates from underspecified instructions. When a skill says “investigate the alarm” without specifying which dashboards to check, what credentials to use, or what actions are forbidden without approval, the model fills the gap differently each run.

A scenario that passes 70% of the time points directly at an underspecified instruction. The 30% failures are runs where the model interpreted the gap differently.

The fix is not retry logic. The fix is eliminating the gap. Each iteration tightens the instruction until there is only one correct interpretation.

Each failing scenario gets classified:

**Ambiguous instruction.** The skill says “check the relevant dashboard.” The agent sometimes picks the correct dashboard, sometimes a related but wrong one. Fix: specify exactly which dashboard by name or lookup method.

**Missing context.** The skill says “resolve the account for this service.” The context document mapping services to accounts is missing a recently added service. Fix: update the mapping document.

**Overly rigid test.** The agent asked clarifying questions before proceeding — valid behavior — but the scenario expected immediate action. Fix: update expected behavior to accept clarification as a valid path.

**Scenario:** Generate integration code for a model endpoint that takes search queries and returns ranked results.

**Run 1:** Agent generates working code that creates a new API client per invocation. Wastes resources, risks rate-limit exhaustion under load.

**Analysis:** The skill says “generate code to call the model endpoint.” Nothing about client lifecycle management.

**Fix:** Add to skill: “For long-running services, use a shared client instance with connection pooling. For one-shot batch scripts, create and discard within the script.”

**Run 2:** Scenario passes. Full benchmark re-run reveals a regression — a batch job scenario now fails because the agent creates a shared client for a single-execution script.

**Refinement:** The skill now distinguishes service code from batch code. Both scenarios pass. No regressions.

After 50 iterations of this loop on a typical agent, pass rates move from ~70% to ~94%. The improvement comes entirely from tightening instructions — the model doesn’t get smarter, the instructions get less ambiguous. The space for non-deterministic behavior shrinks until most scenarios have only one correct execution path.

The analysis and fix steps can themselves be automated. A coding agent reads failing benchmark results, identifies which skill or SOP caused the failure, proposes a fix, applies it, and re-runs the benchmark. This loop continues until the target pass rate is achieved or the agent identifies scenarios that require human judgment to resolve.

This creates a self-improving system: the agent’s skills get tighter with every benchmark run, making the agent more deterministic without manual intervention.

The concern: “if the runtime code only exists at deploy time, your incident response, diffing, and compliance story just got way worse.”

Benchmarks reframe this. Traditional compliance relies on code review — a human read the code and approved it. In practice, this means a human skimmed the code and probably caught the obvious issues. The subtle issues — the security group that’s slightly too permissive, the retry policy that causes cascading failures under specific conditions — are probabilistically caught at best.

Benchmark-based compliance is different:

The benchmark suite becomes the behavioral specification. Each scenario is a verifiable assertion about how the agent must behave. The prose (skills, SOPs) is the implementation of that specification. The benchmark run is the verification.

This produces a timestamped audit trail: which scenarios ran, what the agent produced, how the judge scored it, what version of the skills were active. This is more rigorous than “a human reviewed a PR” because it’s exhaustive (every scenario runs) and repeatable (the same criteria apply every time).

Beyond determinism and compliance, benchmarks provide the foundation for engineering management of agent systems.

**Current position.** “The agent passes 91% of scenarios overall. Search: 95%. Ops: 87%. Infrastructure generation: 82%.”

**Trend.** “Three months ago overall was 74%. Investment in ops skills moved that category from 62% to 87%.”

**Gap analysis.** “Infrastructure generation failures cluster in the multi-service dependency sub-category. That’s where to invest next.”

**Ship bar.** “90% overall, no category below 80%, zero failures in safety-critical subset.” Binary. The agent meets the bar or it doesn’t.

**Regression detection.** A skill change that fixes an ops scenario but breaks three search scenarios is caught in the next nightly run, before reaching production.

**Model comparison.** New model version available. Run the benchmark on both. Compare numbers, not feelings.

Teams that have been benchmarking for 6+ months converge on:

· **Category organization** — scenarios grouped by what they test (routing, knowledge, tool usage, safety, output format), enabling targeted gap analysis

· **Difficulty levels** — same scenario at multiple ambiguity levels, measuring robustness to real-world input variance

· **Coverage tracking** — systematic verification that every rule in the skill set has at least one exercising scenario

· **Automated scoring with periodic calibration** — LLM judge runs automatically; humans review a sample quarterly to verify alignment

· **Monotonic growth** — every production incident, user complaint, and discovered edge case becomes a new scenario

A benchmarking practice does not require a framework. It requires:

**Ten scenarios.** The ten most important behaviors your agent must get right. Structured as input + expected behavior + evaluation criteria. A JSON file.

**A runner.** A script that feeds each scenario to the agent, captures output, and checks against criteria. String matching for simple cases. LLM-as-judge for complex cases.

**A habit.** Run before merging any skill or prompt change. If the score drops, investigate before merging.

The growth path:

Teams that start early compound. Every bug found, every edge case encountered, every incident responded to becomes a permanent scenario. Six months of this produces a comprehensive behavioral specification — not prose that might be ambiguous, but executable scenarios that verify intent.

Agents are not software. The specification-implementation gap means traditional testing doesn’t apply. But the engineering requirement remains: know where you are, know if you’re getting better, know if you’re good enough to ship.

Benchmarks close that gap. They surface ambiguity in instructions. They catch regressions before users do. They provide a defensible ship bar. They produce an audit trail more rigorous than code review. And through the iterative fix loop, they drive agents from probabilistic behavior toward deterministic correctness — not by making the model smarter, but by making the instructions tighter.

The code an agent generates at runtime doesn’t need to be identical every time. It needs to be correct every time. Benchmarks are how you verify that. And the iterative fix loop — benchmark, analyze, tighten, re-benchmark — is how you achieve it.

Every team building agents needs to answer: where is the agent today, is it improving, and is it ready to ship? Without benchmarks, those are guesses. With benchmarks, they’re measurements.

[Benchmarking AI Agents](https://pub.towardsai.net/benchmarking-ai-agents-dbf6ae8e5227) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
