NVIDIA's new open-source TensorRT-LLM software, scheduled for release in the upcoming weeks, has demonstrated a significant performance boost. In tests using the GPT-J 6B model, the updated system showcased an eightfold performance improvement over the A100, a significant leap from the previous fourfold advantage. Furthermore, when evaluated on Meta's Llama2 LLM, the TensorRT-LLM-enhanced H100s surpassed A100s by a factor of 4.6, a marked improvement from the 2.6 times before the update.
In a bid to enhance the performance of Large Language Models (LLMs), NVIDIA has recently introduced TensorRT-LLM, an open-source library designed to enhance the performance of Large Language Models (LLMs) on NVIDIA's hardware.
TensorRT-LLM is an open-source library that operates on NVIDIA Tensor Core GPUs. Its primary function is to offer developers an environment to experiment with and build new large language models, which form the foundation of generative AI platforms like ChatGPT. The software focuses on inference, which refines an AI's training process, helping the system understand how to link concepts and make predictions.
— NVIDIA AI Developer (@NVIDIAAIDev) September 8, 2023
Innovative Techniques Behind the Boost
The challenge with large language models (LLMs) lies in their versatility, making it tough to batch requests and execute them simultaneously. NVIDIA and its partners tackled this challenge by integrating TensorRT-LLM with an advanced scheduling method termed “in-flight batching.” This innovative approach allows text generation to be segmented into multiple subtasks.
Essentially, the system can process new batches from varying requests concurrently, rather than waiting for a single batch to complete. The TensorRT-LLM encompasses a TensorRT deep learning compiler, optimized kernels, pre and post-processing steps, and facilitates communication across multiple GPUs and nodes. This results in unparalleled performance on NVIDIA's GPUs, enabling novel large language model experimentation, rapid customization, and peak performance.
Benchmarking Excellence and Future Prospects
NVIDIA's GH200 Grace Hopper Superchip, which combines a Hopper GPU with a Grace CPU, has showcased impressive results in the latest MLPerf industry benchmarks. The superchip, along with the H100 GPUs, led in all of MLPerf's data center tests, including computer vision, speech recognition, medical imaging, and the more demanding tasks of LLM inference and recommendation systems. Moreover, NVIDIA has announced an upcoming software update that will further enhance the AI inference capabilities of its GH200 Grace Hopper Superchip.
AI is a major area of growth for Nvidia and the company is already seeing the rewards of taking a leading role in the market. Recent analyses have unveiled that Nvidia is securing nearly a 1,000% profit on every H100 Tensor Core GPU it sells. Financial insights from Raymond James, a reputable financial services firm, shared on Barron's, have estimated the production cost of one such GPU to be around $3,320. In stark contrast, Nvidia‘s selling price for these GPUs fluctuates between $25,000 and $30,000, contingent on the order volume.