NVIDIA says its Blackwell Ultra NVL72 platform has taken the top spot in the first published results from AgentPerf, a new benchmark from Artificial Analysis designed to measure agentic AI systems. The company said the rack-scale system delivered the highest performance across the tested workloads and could run 20 times more agents per megawatt than NVIDIA Hopper.
The new benchmark arrives as developers shift from conventional chatbot use cases to AI agents that can complete multi-step tasks. NVIDIA argues that agentic AI is not the same as a single-turn text prompt. Instead of answering one request and stopping, agents may chain together many model calls and tool interactions while they gather context, reason through a problem and take action.
That difference matters for infrastructure planning, according to the company. Traditional inference tests usually measure how quickly a model responds to one request or how many simultaneous queries a system can handle. NVIDIA says those methods do not capture the effects of long context, repeated tool use and the delays introduced when agents pass through multiple steps.
AgentPerf was built to reflect those production-style workloads. Artificial Analysis based the benchmark on real coding-agent traces drawn from public repositories across more than a dozen programming languages, then used simulated tool calls to isolate accelerator performance. The benchmark measures how many agentic tasks a system can support at once while meeting thresholds for responsiveness and token output rate.
In the first round of results, Artificial Analysis evaluated agentic performance using DeepSeek V4 Pro, which it describes as a large mixture-of-experts model suited to frontier agents. NVIDIA said its GB300 NVL72 system posted the strongest result on that workload, supporting as many as 20 times more agents per megawatt than the HGX H200 system.
The company attributes the result to what it calls full-stack co-design. The GB300 NVL72 links 72 GPUs into a single rack-scale system, which NVIDIA says helps large mixture-of-experts models spread execution more efficiently. It also pointed to CUDA kernels that overlap communication and computation, along with TensorRT LLM software that helps maintain efficiency as concurrent sessions scale.
NVIDIA framed the benchmark in terms of infrastructure economics. For enterprises deploying AI agents at scale, the company said the key question is not just raw speed but how much useful work can be delivered for each accelerator and each unit of power consumed.
NVIDIA also highlighted ecosystem partners that are already serving agentic applications on Blackwell. The company named Baseten, DeepInfra and Together AI as providers running frontier models such as DeepSeek V4 Pro.
Together AI, NVIDIA said, powers real-time inference for Cursor, an AI coding platform whose agents can debug issues, generate features and carry out refactors while developers keep working. DeepInfra, meanwhile, supports Pam.ai, which is used by car dealerships to book service appointments, answer calls and run outbound sales campaigns.
The company suggested that continued software optimization across the NVIDIA and open source ecosystem should improve agentic performance further. It also pointed to the Vera Rubin architecture, which it said is now in full production and intended to add more capacity for growing agentic AI demand.
For now, the latest benchmark offers an early yardstick for a fast-moving category. As more companies move from simple inference to multi-step agents, comparisons based on throughput, latency and power efficiency may become more important than traditional chatbot tests.