CoreWeave posts MLPerf training record with DeepSeek-V3 on GB300 GPUs

CoreWeave claims new benchmark high mark in AI training

CoreWeave said Monday that it posted record results in the latest MLPerf Training benchmark, including what it described as the fastest DeepSeek-V3 671B run in the round. The company said it trained the model in about 2.02 minutes using 8,192 NVIDIA GB300 NVL72 GPUs across 2,048 nodes.

The results matter because ML training performance has become a key bottleneck for AI developers as models grow larger and more complex. CoreWeave framed the benchmark as evidence that its cloud infrastructure can turn new hardware into usable training performance at scale, rather than simply showcasing raw chip capabilities.

According to the company, the test was run on the same infrastructure it sells to customers, not on a special benchmark-only setup. CoreWeave said that distinction is important because it wants the results to reflect real production conditions, including networking, scheduling, storage and orchestration.

Scaling results across cluster sizes

CoreWeave submitted three GB300 configurations for DeepSeek-V3, which it said was the most demanding workload in the benchmark round. The company reported that a 4,096-GPU setup completed training in 3.09 minutes, while a 2,048-GPU configuration finished in 5.54 minutes. CoreWeave said the results scaled in a close to linear pattern as the cluster size doubled, indicating strong efficiency across its stack.

The company also said it was the only participant in this MLPerf round to push a GB300 platform beyond 2,048 GPUs on the DeepSeek-V3 workload. It argued that the larger submission showed more usable performance per GPU, a point it said is especially relevant for customers facing compute limits and tight development schedules.

CoreWeave separately reported results on other NVIDIA platforms. On a 4,096-GPU GB300 setup, it said it reached the quality target for Llama-3.1-405B in 9.77 minutes. It said that result used 20% fewer GPUs than a larger GB200 deployment while still reaching near-parity in performance.

The company also highlighted smaller cluster results on NVIDIA HGX B200 systems connected by InfiniBand. On an 8-node, 64-GPU cluster, CoreWeave said it trained GPT-OSS-20B in 26.98 minutes and Llama-3.1-8B in 16.5 minutes.

Infrastructure and software tuning

CoreWeave attributed the benchmark performance to what it called full-stack optimization. The company pointed to fleet health checks through its Mission Control system, topology-aware scheduling through its SUNK platform, and network tuning designed to reduce congestion at large scale.

It said Mission Control checks hardware, firmware, network and thermal conditions before and during training jobs. CoreWeave also said its scheduler places workloads with awareness of NVLink domains, which can help keep communication local for mixture-of-experts models. In addition, it said its rail-aware networking strategy is designed to spread traffic more evenly across the fabric.

Brendan Burke, research director at Futurum Research, said the results showed how infrastructure expertise can matter as much as new hardware in closing the gap between benchmark performance and real-world deployment. CoreWeave also cited its own independent validation history, including top rankings from SemiAnalysis ClusterMAX and other third-party benchmarking work.

The announcement arrives as CoreWeave continues to position itself as a specialized cloud provider for AI training and inference. The company said its MLPerf results were achieved on systems available to customers today, underscoring its message that production infrastructure, not just lab demonstrations, can deliver top-end training performance.