In a bid to enhance the performance of Large Language Models (LLMs), NVIDIA has introduced TensorRT-LLM, an open-source library designed to enhance the performance of Large Language Models (LLMs) on NVIDIA's hardware.
Generative AI has witnessed a significant surge since OpenAI introduced ChatGPT. Companies are now leveraging conversational AI systems, commonly referred to as AI chatbots like ChatGPT, to cater to customer requirements. NVIDIA, a leading player in the GPU market, is at the forefront of this transformation, providing the necessary hardware to train extensive language models like ChatGPT, GPT-4, BERT, and Google's PaLM.
TensorRT-LLM: A New Era for AI Inferencing
TensorRT-LLM is an open-source library that operates on NVIDIA Tensor Core GPUs. Its primary function is to offer developers an environment to experiment with and build new large language models, which form the foundation of generative AI platforms like ChatGPT. The software focuses on inference, which refines an AI's training process, helping the system understand how to link concepts and make predictions. NVIDIA emphasizes that TensorRT-LLM can significantly accelerate the speed of inference on their GPUs.
The software is equipped to handle contemporary LLMs such as Meta Llama 2, OpenAI GPT-4, Falcon, Mosaic MPT, BLOOM, and more. It incorporates the TensorRT deep learning compiler, optimized kernels, and pre- and post-processing tools. Additionally, it facilitates multi-GPU and multi-node communication. A standout feature is that developers don't need an in-depth understanding of C++ or NVIDIA CUDA to utilize TensorRT-LLM.
Naveen Rao, the Vice President of Engineering at Databricks, commented on the software's efficiency, stating, “TensorRT-LLM is easy to use, feature-packed… and is efficient.” He further added that it “delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”
Performance Enhancements with TensorRT-LLM
When it comes to LLMs performing tasks like article summarization, they execute faster on TensorRT-LLM combined with an NVIDIA H100 GPU compared to the older NVIDIA A100 chip without the LLM library. Specifically, the H100 GPU's performance for GPT-J 6B LLM inferencing witnessed a fourfold improvement. When paired with the TensorRT-LLM software, this enhancement jumped to eight times.
A key feature of TensorRT-LLM is its use of tensor parallelism. This technique divides different weight matrices across devices, enabling inference to be conducted simultaneously across multiple GPUs and servers.
NVIDIA's Vision for Cost-Effective AI
The deployment of LLMs can be a costly affair. NVIDIA suggests that LLMs are reshaping how data centers and AI training are accounted for in a company's financial statements. With TensorRT-LLM, NVIDIA aims to allow businesses to develop intricate generative AI without witnessing a surge in the total cost of ownership.
NVIDIA's TensorRT-LLM is currently available for early access for those registered in the NVIDIA Developer Program. The wider release is anticipated in the upcoming weeks.