NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

NVIDIA swept MLPerf Training v6.0 benchmarks, achieving the fastest training times at scale and highest per-accelerator performance across all tests, including new DeepSeek-V3 and GPT-OSS-20B workloads. The company's Blackwell platform, using up to 8,192 GPUs and optimized networking, set records for time-to-train on models like DeepSeek-V3 (2.02 minutes) and Llama 3.1 405B (7.07 minutes).

NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium. NVIDIA achieved the fastest time to train at scale, and also delivered the highest performance when normalized on a per-accelerator basis on every benchmark. It was also the only platform to submit on every test. MLCommons introduced new pretraining benchmarks in this round designed to reflect the latest trends in AI models, including DeepSeek-V3, a massive 671B-parameter Mixture of Experts MoE model that also serves as the base for the popular DeepSeek-R1 reasoning model, and GPT-OSS-20B, a small-but-capable MoE. The NVIDIA platform was the only one to submit results on both new workloads, with the NVIDIA GB300 NVL72 system setting the performance bar through optimized NVIDIA software stacks and a design that connects 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs as one using NVIDIA NVLink and NVIDIA NVLink Switch. Unprecedented scale and throughput across the scale-out fabric Training state-of-the-art models requires large-scale infrastructure and the ability to efficiently execute workloads across thousands of interconnected processors. In several entries this round, NVIDIA cloud service provider partners scaled up to 8,192 Blackwell GPUs working in unison across diverse cloud data centers. These submissions proved the real-world robustness of the Blackwell platform across production hyperscale data center fleets, demonstrating strong scaling trends across these varied cluster environments. Extracting maximum efficiency from each training iteration at this magnitude requires moving far beyond the reach of a single NVLink domain, relying on scale-out networking platforms such as NVIDIA Spectrum-X Ethernet and NVIDIA Quantum InfiniBand. Expert parallelism within MoE models generates low-entropy, bursty flows—a pattern that typically reduces effective bandwidth under static Equal-Cost Multi-Path ECMP hashing as large flows collide on shared links. To resolve this, Spectrum-X Ethernet’s Advanced Adaptive Routing distributes traffic packet-by-packet across all available paths according to real-time link load, sustaining effective bandwidth near the fabric’s theoretical capacity while the receiving ConnectX SuperNIC handles out-of-order delivery. Additionally, when a popular expert draws simultaneous traffic from many senders, Spectrum-X Congestion Control uses real-time telemetry to detect the resulting incast early and pace senders before buffers overflow. This balances tail latency so all-to-all communication stays hidden behind compute rather than surfacing on the main execution path. This combination of cluster orchestration and network fabric efficiency enabled new time-to-train records across the most challenging benchmarks, as summarized below: | Benchmark Workload | GPU Platform | Cluster Scale | Time-to-Train | | DeepSeek-V3 671B MoE | GB300 NVL72 | 8,192 GPUs | 2.02 mins | | GPT-OSS 20B MoE | GB300 NVL72 | 512 GPUs | 7.43 mins | | Llama 3.1 405B | GB200 NVL72 | 8,192 GPUs | 7.07 mins | | Llama 3.1 8B | GB200 NVL72 | 1,024 GPUs | 4.46 mins | | Llama 2 70B LoRA | GB300 NVL72 | 512 GPUs | 0.4 mins | | FLUX.1 | GB300 NVL72 | 512 GPUs | 17.1 mins | | DLRM-dcnv2 | GB300 NVL72 | 64 GPUs | 0.67mins | Table 1. Time-to-train wins delivered by NVIDIA Blackwell across MLPerf Training 6.0 MLPerf Training v6.0 results retrieved from www.mlcommons.org on June 16, 2026, from the following entries: 6.0-0005, 6.0-0102, 6.0-0001, 6.0-0015, 6.0-0102, 6.0-0101 and 6.0-0062. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information. The software innovation engine Hardware capabilities are only as good as the software driving them. To extract maximum performance for complex MoE models like DeepSeek-V3, NVIDIA deployed several cutting-edge software optimizations in this round of MLPerf Training: 1. Full-iteration CUDA graphs for token-dropless MoEs Historically, token-dropless MoE architectures struggled to run fully within CUDA graphs due to dynamic routing behaviors that forced continuous CPU-GPU synchronizations. For MLPerf Training 6.0, NVIDIA implemented full-iteration CUDA graphs https://github.com/NVIDIA/Megatron-LM/blob/main/docs/user-guide/features/cuda graph.md for the first time for these MoEs. Two primary hurdles were addressed in order to enable this. First, expert module operators, such as the quantizer, grouped GEMM, and token dispatcher, were transitioned to a synchronization-free mode. In this configuration, input shapes are derived directly from GPU values, removing the necessity for host-side coordination. Second, device memory was managed without host involvement via paged stashing https://github.com/NVIDIA/Megatron-LM/blob/main/docs/user-guide/features/paged stash.md . This technique enables fine-grained management on pre-allocated GPU memory, ensuring the process is fully compatible with CUDA graphs. By rewriting critical execution paths to eliminate all CPU-GPU sync touchpoints, the entire iteration workload was offloaded completely to the GPU. This removed the CPU from the critical path and eliminated the overhead stemming from the variation of host execution, which can otherwise create cascading overhead delays when scaling to clusters of 2,000+ GPUs. 2. CuTe DSL and kernel fusions To achieve the fusion of memory-bandwidth bound layers with grouped GEMM operations and the synchronization-free execution required by CUDA graphs, NVIDIA leveraged CuTe DSL https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute dsl general/dsl introduction.html for advanced kernel fusions. This enabled developers to combine math and memory-handling operations directly at the hardware layer, keeping data local to the registers and avoid expensive round-trips to global memory. Additionally, support for dynamic tile scheduling hid unfused reads and writes behind GEMM operations, enabling an efficient overlap with communication kernels. CuTe DSL also enabled the implementation of kernels that can consume shape arguments directly from GPU memory that are computed by another GPU kernel beforehand. This ability removed the need for CPU-GPU synchronization even for dynamic shapes that are not known until runtime, completely removing the CPU from the critical path for token-dropless MoEs. Together with the enablement of CUDA graphs, these advanced fusions https://developer.nvidia.com/blog/boosting-moe-training-throughput-with-advanced-fusion-kernels/ provide more than 8% end-to-end benefit on Deepseek-v3 and a 93% end-to-end speedup on GPT-OSS. 3. MXFP8 attention block Traditionally, MoE training workloads have used 16-bit precision for attention computation. This round, an MXFP8 attention recipe was developed for improved performance without impacting model quality. This provided an end-to-end speedup for DeepSeekv3 benchmark while preserving the standard math required for attention operations. This recipe keeps the input tensors of all batched-matrix-multiply operations in the attention block in 8-bit precision, taking advantage of faster FP8 math execution on the hardware compared to 16-bit floating point datapath. This kernel is available in cuDNN https://nvidia.github.io/cudnn-frontend/mxfp8-attention-scaling/ through the Transformer Engine library. 4. Router and hybrid EP optimizations The MoE router is used to dynamically assign tokens to specialized expert layers, making its performance an important factor in cluster-wide training bottlenecks. Multiple elementwise kernels were fused in the router, including top-k and score computations to enhance performance. To maximize hardware utilization, these kernels were transitioned from FP64 to FP32 math operations. This optimization delivered a kernel speedup of 5x. Additionally, several elementwise metadata processing kernels were fused within HybridEP, complemented by dedicated performance tuning of the key permute/unpermute kernels. Overall, these optimizations yielded a performance gain of 5% end-to-end. 5. 1F1B all-to-all overlap optimizations A dedicated 1F1B One Forward, One Backward all-to-all A2A overlap scheme was previously introduced into Megatron-Core https://arxiv.org/pdf/2603.07685 to hide MoE communication behind computation at the batch level. For this MLPerf round, the execution efficiency of this scheme has been significantly improved. While 1F1B scheduling initially introduced notable CPU overhead, capturing the full iteration within a CUDA Graph successfully eliminated this host-side bottleneck. Performance was further enhanced by prioritizing the communication stream, employing dynamically scheduled CuTe DSL kernels, and enabling delayed weight gradient wgrad support for new cuteDSL GEMMs. In the steady state, these adjustments achieved nearly 100% A2A communication overlap, resulting in an overall 8% performance benefit. 6. Minimizing imbalance between pipeline stages As individual computational kernels get faster, underlying imbalances between pipeline parallel stages become more pronounced. NVIDIA optimized the layout and balance of these pipeline parallel stages, minimizing structural idling “bubble time” https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/ pipeline parallelism . Pipeline imbalance is a major bottleneck in pipeline parallelism PP efficiency. For DeepSeek-V3, the model uses a hybrid layer setting with three dense layers at the front and Multi-Token Prediction MTP plus logits GEMM with crossentropy at the end. To resolve this issue, Megatron-Core’s flexible pipeline layout support was leveraged to carefully balance the stages, while MXFP8 precision was adopted for the logit projection GEMM to reduce its execution time on the critical path. Using MXFP8 for the logit projection GEMM didn’t impact numerical stability of the benchmark. These adjustments successfully reduced pipeline imbalance to less than 1%, translating to a 4% E2E performance savings. Continuous full-stack co-design: Sum of all the parts While standardized benchmarks capture point-in-time performance metrics, a major driver of actual developer value is the continuous trajectory of software optimization. Over the last three months, close collaboration between hardware and software engineering teams has unlocked significant optimization milestones for NVIDIA platforms. This rapid pace of innovation spans the entire NVIDIA software stack. Rather than relying on optimizations in a single isolated layer, the above-mentioned innovations illustrate how parallel performance enhancements were engineered across multiple foundational CUDA-X libraries, frameworks and APIs including cuDNN, Transformer Engine, CuTe DSL, Megatron Core, and cuBLAS. Megatron Bridge https://docs.nvidia.com/nemo/megatron-bridge/latest/ serves as the central packaging layer that integrates these cross-stack improvements, making them immediately available to developers in a unified ecosystem. To demonstrate this using the latest NVIDIA NeMo container 26.06 release, the training performance https://docs.nvidia.com/nemo/megatron-bridge/nightly/performance-summary.html model-deepseekv3 of the NVIDIA Blackwell Ultra GB300 on DeepSeek-V3 improved 1.3x going from 1,298 TFLOPS/GPU to 1,648 TFLOPS/GPU 6,338 tokens/sec/GPU . This performance uplift in three short months is the direct product of full-stack co-design, the systematic elimination of micro-bottlenecks across communication protocols, routing layers, and compute kernels all without requiring changes to the underlying silicon. This continuous optimization trajectory directly elevates NVIDIA Goodput by squeezing out system overhead and maximizing the percentage of time GPUs spend doing useful work. Consequently, infrastructure operators do not just get high theoretical peak capabilities, they get a mature platform that converts those raw FLOPS into continuous, productive training progress. This enables existing infrastructure deployments to capture immediate training efficiency dividends as the software ecosystem matures. Platform comparison: Blackwell Ultra GB300 vs. GB200 Beyond software gains, comparing configurations within the Blackwell family illustrates how subtle hardware adjustments complement full-stack optimization. As shown in Figure 2 below, Blackwell Ultra GB300 provides a significant training performance uplift over the baseline Blackwell GB200 across both dense foundational models and complex MoE systems. MLPerf Training v5.1 and v6.0, closed division. Results from entries: 6.0-0022, 6.0-0102, 6.0-0017, 6.0-0078, 5.1-0072, 6.0-0013, 5.1-0067, and 6.0-0031. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. This speedup stems from two primary advantages: higher memory capacity and a larger power budget. Deep MoE architectures are highly memory-bound during large-scale training. The expanded memory of the GB300 accommodates the added memory overhead introduced by full-iteration CUDA graphs without requiring sub-optimal configurations or layer recomputation. Additionally, increased memory capacity enables developers to utilize smaller model-parallel communication groups. By keeping larger portions of the model local to the chip, the system spends less time waiting on cross-GPU communications, translating directly to higher operational throughput. Full-stack innovation and scale in MLPerf Training 6.0 The MLPerf Training 6.0 results firmly establish NVIDIA’s full-stack approach as the definitive standard for accelerating complex generative AI workloads across the industry. By securing a clean sweep and winning every single benchmark in this round, the platform demonstrated unmatched execution speed in time-to-train metrics. Whether training ultra-dense foundational models or navigating the intricate token-routing mechanics of massive MoE architectures, NVIDIA delivers unrivaled performance across the board. These benchmark successes are propelled forward by a rapid velocity of software innovation, continuous extreme co-design, and the maximized efficiency of NVIDIA’s Goodput. Through engineering breakthroughs implemented across Megatron Bridge, cuDNN, and the Transformer Engine, including full-iteration CUDA graphs, CuTe DSL kernel fusions, and communication and pipeline optimizations, NVIDIA customers regularly extract massive performance gains directly from the software layer. This rapid pace of optimization enables developers to capture immediate training efficiency dividends on their existing infrastructure investments as the software ecosystem matures. Ultimately, the true metric of enterprise readiness is performance delivered at maximum deployment scale. The NVIDIA platform successfully demonstrated strong scaling up to 8,192 active GPUs running simultaneously on production-ready cloud architectures. This proven capability to orchestrate massive training clusters ensures that enterprises can reliably compress standard multi-month training cycles into a matter of minutes or hours, dramatically accelerating the time-to-market for the next generation of AI breakthroughs.