# NVIDIA Blackwell Leads AgentPerf, the First Agentic-AI Infra Benchmark: Trajectory-Replay Benchmarking

> Source: <https://dev.to/pueding/nvidia-blackwell-leads-agentperf-the-first-agentic-ai-infra-benchmark-trajectory-replay-58d6>
> Published: 2026-06-15 11:20:31+00:00

**What:** The **AgentPerf benchmark** from Artificial Analysis is the first test built for **agentic-AI infrastructure**: instead of timing one chat completion, it **replays recorded multi-step agent trajectories** to see how a serving system holds up under real agent load.

**Why:** Agents don't send one prompt — they run **long chains of model calls and tool executions**, so a serving system's real job is sustaining many such runs at once. AgentPerf measures exactly that: **concurrent agents held above a speed limit, normalized by power**.

**vs prior:** A **single-shot completion benchmark** sends one prompt and reports tokens per second — and misses the bursty, stateful, KV-cache-heavy load a real agent creates. Trajectory replay reproduces that load, so the score reflects **real production agent load**, not a sprint time.

An EPA mileage test that replays a real drive cycle, not a top-speed sprint.

```
                  MEASURING A SERVING SYSTEM
                             │
             ┌───────────────┴───────────────┐
             │                               │
     ┌───────▼───────┐               ┌───────▼───────┐
     │  SPRINT TEST  │               │  DRIVE CYCLE  │
     │ one completion│               │ one agent run │
     └───────┬───────┘               └───────┬───────┘
             │                               │
    peak tokens/sec on             replay the stop-go,
    a single prompt                multi-step agent load
             │                               │
             ▼                               ▼
   ✗ flatters the rack,           ✓ agents per megawatt:
     ignores real load              agents held over an SLO
```

**AgentPerf** — Artificial Analysis's benchmark for **agentic-AI infrastructure**. It drives a serving system with recorded coding-agent trajectories across 12+ programming languages and scores how many concurrent agents the system sustains under a per-token speed limit, normalized by power.

**Agent trajectory** — The full recorded run of an agent: **chained LLM calls interleaved with tool executions** — read a file, run code, see the error, try again — many steps to finish one task. See AI Agents → The Agent Loop.

**Per-token SLO** — A **service-level objective on output speed** — a floor on tokens per second the system must hold for each agent. AgentPerf measures at both 20 and 60 tok/s. See LLM Serving → Serving Metrics.

**Goodput** — Only the work that **actually meets the SLO** — here, the concurrent agents staying above the token-rate floor — as opposed to raw throughput, which counts everything regardless of latency. See Throughput vs Goodput.

**Agents per megawatt** — AgentPerf's headline metric: **concurrent agents meeting the SLO, divided by the power the system draws**. An efficiency number — useful work per unit of energy — like miles per gallon for an inference fleet.

**GB300 NVL72 / HGX H200** — The two NVIDIA systems compared: the rack-scale **Blackwell GB300 NVL72** versus the prior-generation **HGX H200**. Both run DeepSeek V4 Pro in the reported result.

The news.On June 12, 2026, Artificial Analysis releasedAgentPerf, billed as the industry's first benchmark foragentic-AI infrastructure. Rather than single chat completions, it replays real coding-agent trajectories — file reads, code execution, iteration — across12+ programming languages, and scores how many concurrent agents a system sustains under a per-token SLO, normalized by power. NVIDIA reports itsGB300 NVL72serves up to20× more agents per megawattthan anHGX H200system, running DeepSeek V4 Pro and measured at both 20 and 60 tokens/sec.[Read the announcement →]

Picture the fuel-economy sticker on a new car. The number that ends up on the window isn't a quarter-mile drag time — a single sprint down an empty straight tells you almost nothing about the commute you'll actually drive. The figure drivers care about, miles per gallon, comes from a dynamometer **replaying a recorded city drive cycle**: stop, go, idle, accelerate, the messy real pattern. **A single sprint measures the wrong thing; the recorded drive cycle measures the thing you live with.** A single chat completion is that sprint. An agent's run is the drive cycle. AgentPerf is the dyno.

The reason the distinction matters is that an agent run looks nothing like one prompt-and-reply. It is a long loop of **model calls interleaved with tool executions** — read a file, run the code, look at the failure, edit, try again — many steps to finish a single task. That load is bursty and stateful: the context grows with every step, leaning hard on KV-cache reuse, decode comes in stop-go spurts, and many such runs land on the system at once. **A benchmark that sends one prompt and reports peak tokens per second is timing the sprint, not the commute.**

So AgentPerf replays *recorded* coding-agent trajectories and asks a different question: how many agents can the system keep above a per-token speed limit at the same time? **That is a goodput measurement — count only the agents actually holding the SLO, not raw token throughput — and then divide by the power the rack draws.** The unit that falls out, agents per megawatt, is miles-per-gallon for an inference fleet: useful work per unit of energy.

| How you benchmark | What it sends | What it misses |
|---|---|---|
| Single chat completion | one prompt → one response | the bursty, multi-step load a real agent creates |
| Peak-throughput LLM bench | many independent prompts | KV reuse and sustained concurrency within one long run |
|

Hold two things fixed: the power, at one megawatt, and the SLO, at 60 tokens per second. Suppose an HGX H200 rack sustains **60 concurrent agents** that stay above that floor on its megawatt *(illustrative)*. The one ratio AgentPerf actually reports is the comparison: the **GB300 NVL72** sustains up to **20×** as many on the same megawatt — roughly **1,200 agents** on that scaling. The lever isn't only more FLOPs. Agent trajectories share a huge common prefix — the system prompt, the tool definitions, the conversation so far — so KV-cache reuse and continuous batching are what turn raw compute into *sustained* agents, and a single-completion benchmark never exercises that reuse. **Same megawatt, up to 20× the agents — because the test finally rewards sustained, KV-reuse-heavy agent load instead of a one-shot sprint.** *(Only the 20× ratio, the 20/60 tok/s SLOs, and the GB300-vs-H200 comparison come from NVIDIA; the 60-agent baseline is illustrative.)*

*Goes deeper in: Agent Engineering → Cost & Latency → The Cost Profile of an Agent*

AgentPerf is a benchmark from Artificial Analysis, billed as the first test for agentic-AI infrastructure. Instead of timing single chat completions, it replays recorded multi-step coding-agent trajectories — file reads, code execution, and iteration across 12+ programming languages — and scores how many concurrent agents a serving system sustains under a per-token SLO, normalized by power (agents per megawatt). In NVIDIA's reported result, a GB300 NVL72 system serves up to 20× more agents per megawatt than an HGX H200 system on DeepSeek V4 Pro.

A normal LLM benchmark sends one prompt and measures the response — tokens per second, time to first token. An agent, though, runs a long trajectory: chained model calls interleaved with tool executions, with a growing context and bursty decode. Trajectory replay drives the system with those recorded multi-step runs instead of single prompts, so it stresses the scheduler, KV-cache reuse, and sustained decode under concurrency — the load that real agents actually create.

Agents per megawatt is AgentPerf's headline metric: the number of concurrent agents a system keeps above the per-token SLO, divided by the power it draws. It is a goodput-style efficiency number — useful work per unit of energy — analogous to miles per gallon for an inference fleet. It rewards systems that sustain many real agent runs at once on the same power budget, not just peak token throughput on a single prompt.

Originally posted on [Learn AI Visually](https://learnaivisually.com/ai-explained/agentperf-trajectory-replay-benchmarking).
