AI Computing: MLCommons New MLPerf Inference v3.1 LLM Benchmark Shows 40% Performance Increase

The latest benchmarks of MLPerf Inference v3.1 have seen a record participation of over 13,500 performance outcomes.

The recent unveiling of MLPerf Inference v3.1 has brought forth new for Large Language Models (LLMs) and recommendations, signifying a significant stride in AI testing. MLPerf benchmarks—developed by MLCommons, a consortium of academic AI leaders, research labs, and across the industry— are designed to provide unbiased evaluations of training and inference performance for hardware, software, and services.

The new version has witnessed an unprecedented level of participation, with over 13,500 performance outcomes, marking up to a 40% enhancement in performance. The diversity in participation is evident with 26 distinct submitters, including major tech companies like Google, Intel, and NVIDIA, as well as first-time participants such as Connect Tech, Nutanix, Oracle, and TTA.

David Kanter, the Executive Director of MLCommons, emphasized the importance of this contribution, stating, “Submitting to MLPerf is not trivial… It requires real engineering work and is a testament to our submitters' commitment to AI, to their customers, and to ML.”

Benchmarks Results

The primary objective of MLPerf Inference is to gauge the speed at which AI systems can run models across different deployment scenarios. These range from advanced to vehicle safety features like automatic lane-keeping and speech-to-text interfaces. The spotlight in this version is on the introduction of two benchmarks:

  1. An LLM using the GPT-J reference model to summarize CNN news articles, reflecting the swift adoption of generative AI with 15 participants.
  2. An updated recommender benchmark, more aligned with industry standards, utilizing the DLRM-DCNv2 reference model and larger datasets, receiving nine submissions.

The results for MLPerf Inference v3.1 and MLPerf Storage v0.5 are interesting, covering:

  • Storage: How fast storage can provide data when training an AI Model.
  • Results: How quickly a system can product inputs from a trained model. 

NVIDIA's Dominance and Intel's Close Pursuit

's advanced chips have emerged as the top contenders in tests on a large , with 's hardware following closely. MLCommons, known for its neutral benchmarking of AI chipset performance, announced the results of its new MLPerf Inference 3.1 benchmarks.

NVIDIA showcased its GH200 Grace Hopper Superchip, which amalgamates a Hopper graphic processing unit with a Grace central processing unit, offering enhanced memory, bandwidth, and task-shifting capabilities between the GPU and an Arm-based CPU. This chipset outperformed NVIDIA's HGX 100 system by approximately 17%. However, Intel's Habana Gaudi2 accelerators were not far behind, showing a performance lag of just 10% compared to NVIDIA's systems.

This week, Nvidia announced a new software update that effecitively doubles the performance of its H100 AI GPU. The company's new open-source TensorRT-LLM software, scheduled for release in the upcoming weeks, has demonstrated a significant performance boost.

In tests using the GPT-J 6B model, the updated system showcased an eightfold performance improvement over the A100, a significant leap from the previous fourfold advantage. Furthermore, when evaluated on Meta's Llama2 LLM, the TensorRT-LLM-enhanced H100s surpassed A100s by a factor of 4.6, a marked improvement from the 2.6 times before the update.