NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

NVIDIA achieved leading agentic coding performance on the first agentic AI benchmark, AA-AgentPerf, delivering up to 20x better performance than previous generations. The benchmark, created by Artificial Analysis, measures concurrent AI agents an inference system can support while meeting service level objectives.

AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how inference systems perform under these conditions. Artificial Analysis AgentPerf https://artificialanalysis.ai/benchmarks/hardware AA-AgentPerf offers the industry’s first multi-vendor open benchmarks profiling trajectories that are representative of real-world AI agent https://www.nvidia.com/en-us/glossary/ai-agents/ coding tasks. This post explains how AA-AgentPerf sets a new standard for measuring agentic workload performance, and how NVIDIA extreme co-design helps deliver up to 20x better agentic coding performance than previous generations. What is AA-AgentPerf? AA-AgentPerf is a hardware benchmark created by Artificial Analysis https://artificialanalysis.ai/ that measures the number of concurrent AI agents an inference system can support while meeting predefined, model-specific performance service level objective SLO tiers. An SLO is defined as a specific threshold of output token speed and time-to-first-token TTFT . The benchmark results are normalized per accelerator and per megawatt to enable comparison across hardware configurations. Measuring representative agentic coding performance Agentic workloads are unique because LLM-driven decisions often produce non-deterministic sequences of requests and tool calls. The most difficult part of measuring agent performance is to accurately capture this non-determinism in a representative agent trajectory—the complete sequence of actions, decisions, and observations made by an agent as it traverses through a task from beginning to end Figure 2 . AA-AgentPerf captures this by measuring GPU performance across prerecorded agentic coding trajectories with interleaved reasoning and tool use, while simulating interturn latency with a representative baseline for CPU tool-call performance. These trajectories are built around solving issues in public code repositories across several use-cases,12+ programming languages, and response from frontier models. In addition to rigorous definition of the trajectories, the Artificial Analysis team also: - Leveraged representative cached, input, and output sequence lengths for requests, ranging from 5K to 131K with a mean of approximately 27K. - Mapped tool calls to representative CPU-side tasks in agentic coding workflows and simulated tool calls across a distribution with a one-second median delay time. The same CPU tool-call baseline was then applied across all systems tested. - Keeps the test-set private to prevent benchmark-targeted optimization. AA-AgentPerf testing and measurement methodology The AA-AgentPerf harness measures the number of concurrent agents an inference system can support while meeting SLO requirements Figure 3 . At launch, this benchmark focuses on testing DeepSeek-V4-Pro across multiple SLO tiers derived from Artificial Analysis serverless API benchmarking data. This ensures that the benchmarks reflect quality-of-service levels observed in production providers today. During a benchmarking run, AA-AgentPerf sends GPUs thousands of concurrent requests drawn from its prerecorded agent trajectory dataset. To ensure independent results for each run, dynamic prefixes are added at the start of every trajectory phase. Strict SLO thresholds are enforced throughout the trajectory, and the highest concurrency level that satisfies those requirements is recorded as the official benchmark result for a given SLO Figure 3 . This process is then repeated across multiple SLO tiers to capture different user experience targets Table 1 . Model | SLO tier | P25 output speed tokens/second | P95 TTFT seconds | | DeepSeek-V4-Pro | SLO 1 | 30 | 10 | | SLO 2 | 100 | 5 | | | SLO 3 | 300 | 3 | Table 1. SLO tiers and TTFT requirements for AA-AgentPerf DeepSeek-V4-PRO tests How to interpret AA-AgentPerf results The core AA-AgentPerf metric is runtime power per megawatt—a practical normalization for representing data center scale performance. Table 2 outlines how to leverage the reported performance to estimate how many agentic sessions could be supported for a given power budget. Benchmark | Value of metric | NVIDIA GB300 NVL72 | NVIDIA H200 | | Concurrent agents per MW | Energy efficiency: How many active agents a system can support for a given power budget | 61.4K | 2.6K | | Concurrent agents per GPU | Hardware efficiency: How much serving capacity is achieved per GPU | 57.5 | 1.4 | Table 2. How to leverage the metrics reported by AgentPerf to aid in capacity planning for data centers aiming to support agentic applications at scale. Numbers reflect AA-AgentPerf results for SLO=30 configurations On launch day, NVIDIA GB300 NVL72 https://www.nvidia.com/en-us/data-center/gb300-nvl72/ delivers up to 20x more concurrent agents per megawatt than the previous generation, NVIDIA H200 https://www.nvidia.com/en-us/data-center/h200/ Figure 4 . This performance highlights how GB300 NVL72 is able to deliver across large-scale agentic coding workloads, from routing long-lived sessions efficiently to keeping mixture of experts MoEs https://www.nvidia.com/en-us/glossary/mixture-of-experts/ and GPUs fully utilized across many concurrent agent sessions.. SGLang, TensorRT LLM , or vLLM: Agent runtimes apply optimizations such as WideEP and DeepEP to spread MoE expert execution across the full NVL72 domain, maximizing effective batch sizes and scaling effectively to thousands of agents. DeepGEMM and Mega MoE optimizations: MXFP4/MXFP8 kernels and fused MoE overlap NVLink communication with tensor core compute to boost token throughput for reasoning and code generation. NVIDIA NVLink scale-up domain: GB300 NVL72 links 72 GPUs into a single high-bandwidth NVLink fabric, so every GPU can rapidly share parameters, KV cache, and intermediate results—critical for fast, coordinated execution of agentic coding systems. Looking forward: NVIDIA Vera Rubin platform AA-AgentPerf establishes the standard for evaluating agentic inference, and the results highlight how tightly integrated hardware and software can unlock step-function gains in concurrency and efficiency. NVIDIA GB300 NVL72 demonstrates up to 20x higher agentic coding performance. The NVIDIA Vera Rubin platform https://www.nvidia.com/en-us/data-center/technologies/rubin/ is expected to extend these gains by leveraging 50 PFLOPs of NVFP4 https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ compute and leveraging the Vera CPU to accelerate LLM tool calls and improve end-to-end performance, economics, and efficiency for agentic workflows. To learn more about why agentic workloads place unique demands on inference infrastructure and how the NVIDIA Vera Rubin platform https://www.nvidia.com/en-us/data-center/technologies/rubin/ optimizes performance, see Building for the Rising Complexity of Agentic Systems with Extreme Co-Design https://developer.nvidia.com/blog/building-for-the-rising-complexity-of-agentic-systems-with-extreme-co-design/ . Acknowledgments This work was made possible through the expertise and engineering contributions of Jatin Gangani, Iman Tabrizian, Xiaoming Chen, Peiheng Hu, Taizhong Wu, Shichen Li, Manu Maheswari, and many other talented NVIDIA engineers.