NVIDIA Blackwell Tops MLPerf Training 6.0 in Speed and Scale

NVIDIA said its Blackwell platform came out ahead in the latest MLPerf Training benchmark suite, claiming the fastest training results across all seven tests while also reaching its largest scale yet. The company highlighted results from MLPerf Training 6.0 as evidence that its newest hardware can combine speed, scale and reliability for large AI model development.

MLPerf Training is a peer-reviewed benchmark used to compare AI training systems. In this round, NVIDIA said Blackwell was the only platform to appear in every benchmark and delivered the quickest time to train in each one. The benchmark suite also added two new mixture-of-experts workloads, reflecting the growing use of that model architecture in frontier AI systems.

The company submitted results on both GB200 NVL72 and GB300 NVL72 rack-scale systems. Those systems use fifth-generation NVLink Switches to connect 72 GPUs into a shared pool of compute and memory. NVIDIA said that design helps address the heavy communication demands of mixture-of-experts training, where tokens must be routed across GPUs to the right experts.

NVIDIA also pointed to its low-precision NVFP4 training approach as a contributor to higher performance while maintaining accuracy requirements. The company said it has used the same approach in other large models, including its Nemotron 3 Ultra model.

A key claim in this round was the performance gap between the two Blackwell generations. NVIDIA said GB300 NVL72 delivered up to 1.6 times the training performance of GB200 NVL72 at the same scale, citing higher compute density, more memory and a higher power ceiling as reasons for the improvement.

Scale was another central theme. For DeepSeek-V3 671B, which NVIDIA described as the largest mixture-of-experts model in the suite, the company said it scaled its submission to 8,192 GPUs using GB200 NVL72 systems. That was presented as the largest Blackwell-based submission in MLPerf Training to date. NVIDIA also ran a 5,120-GPU submission on Llama 3.1 405B, one of the benchmark suite’s largest dense models.

The results were not limited to NVIDIA alone. The company said several partners also produced notable runs on Blackwell systems. Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs and reached the quality target in 7.07 minutes, which NVIDIA described as the fastest result for that benchmark. CoreWeave posted the fastest result on DeepSeek-V3 671B, reaching the quality target in 2.02 minutes on 8,192 GPUs with GB300 NVL72 systems connected through Spectrum-X Ethernet.

NVIDIA framed the benchmark results as part of a broader push to make training systems more production-ready. It said its platform is designed to reduce interruptions through manufacturing screening, chip-level monitoring and automatic fault rerouting. When failures do happen, the company said its resiliency software can resume jobs from saved checkpoints rather than forcing a full restart.

The company also emphasized its ecosystem of partners. It said 19 organizations took part in the round, including major cloud providers and server makers such as Microsoft Azure, Google Cloud, CoreWeave, Dell Technologies, HPE and Supermicro. NVIDIA cited examples from customers and partners showing faster training and serving times on Blackwell-based infrastructure.

The latest MLPerf results underscore how tightly AI training performance is now tied to hardware, networking and software co-design. For NVIDIA, Blackwell is emerging as a key platform in that competition, not just for single-system speed but for running very large jobs across thousands of GPUs.