Nvidia has achieved unprecedented results in the MLPerf AI benchmarks, underscoring its leadership in machine learning capabilities. The company’s systems excelled, especially in new training tests centered on large language models and graph neural networks (GNN). The benchmarks were submitted by 17 organizations, reflecting over 205 results.
Powered by its Hopper architecture, Nvidia’s systems came out on top in new training tests that involved fine-tuning. These benchmarks are essential for applications ranging from literature databases to fraud detection and social media analytics.
The MLPerf Training benchmark suite consists of comprehensive system tests that challenge machine learning (ML) models, software, and hardware across a wide array of applications. This open-source, peer-reviewed suite establishes an equitable competitive environment that fosters innovation, enhances performance, and promotes energy efficiency within the industry.
MLPerf Training v4.0 features over 205 performance results from 17 contributing organizations, including ASUSTeK, Dell, Fujitsu, Giga Computing, Google, HPE, Intel (Habana Labs), Juniper Networks, Lenovo, NVIDIA, NVIDIA + CoreWeave, Oracle, Quanta Cloud Technology, Red Hat + Supermicro, Supermicro, Sustainable Metal Cloud (SMC), and tiny corp.
Record-Breaking Performance
In the most recent MLCommons benchmarks, Nvidia utilized 11,616 H100 GPUs, marking its largest deployment yet, to set new records in five out of nine categories. This included the fine-tuning of the Llama-2-70B model on a government documents dataset to improve summary accuracy and the evaluation of GNNs.
The company achieved nearly linear scaling in performance, a key factor in efficiency, with a notable reduction in training times due to software optimizations post-architecture release. Enhancements included the use of 8-bit floating point operations and improved GPU communication, which alone improved GPT-3 training times by 27%.
The MLPerf 4.0 update, the first since November 2023, also included benchmarks for image generation with Stable Diffusion and further LLM training for GPT-3, showcasing significant improvements such as a 1.8x faster training time for Stable Diffusion and a 1.2x speed increase for GPT-3.
Comprehensive Optimizations Across the Stack
David Kanter, the founder and executive director of MLCommons, highlighted the critical role of software and network efficiencies in complementing hardware advancements. Nvidia’s results from the MLPerf 4.0 benchmarks demonstrated comprehensive optimizations across the stack, including highly tuned FP8 kernels, an FP8-aware distributed optimizer, and optimized cuDNN FlashAttention.
cuDNN FlashAttention is an optimized implementation designed to speed up the attention mechanism used in neural networks, specifically in Transformer models. It leverages the cuDNN (CUDA Deep Neural Network library) to improve the efficiency of attention computations on NVIDIA GPUs. FlashAttention reduces memory usage and increases processing speed by cleverly managing how data is stored and accessed during the computation.
This not only underlines Nvidia’s leadership in deploying advanced GPU architectures but also emphasizes the strategic importance of continuous software enhancements. The reported advancements are crucial for organizations planning new data centers, one of which is expected to begin operations this year and another set to incorporate Nvidia’s next-gen Blackwell architecture by 2025. They represent significant returns on investment for the industry, making Nvidia’s efforts particularly relevant.
Last Updated on December 7, 2024 5:38 pm CET