NVIDIA has announced that its open-source TensorRT-LLM library, formerly limited to data center usage, is now accessible for Windows personal computers. NVIDIA's new software development is set to enhance the performance of large language models (LLMs) on local Windows desktops and laptops.
Enhanced Speed and Accuracy with GeForce RTX GPUs
According to NVIDIA, LLMs can operate up to four times faster on Windows computers that employ NVIDIA GeForce RTX graphical processing units (GPUs). This increase in speed will significantly improve the user experience for more complex LLM applications, such as writing and coding assistants that generate multiple unique auto-complete results simultaneously. Users are expected to experience improved quality and prompt performance, allowing them to choose from a quicker and more efficient selection of options.
An demonstrative example provided by NVIDIA showed that when a standard Meta LLaMa 2 LLM was asked, “How does NVIDIA ACE generate emotional responses?“, it struggled to produce an accurate response. However, when paired with a vector library or vector database, the LLM equipped with the TensorRT-LLM tool not only delivered an accurate answer but also did so at an accelerated pace.
GeForce Driver Update Released with More AI-based Features
In conjunction with the TensorRT-LLM announcement, NVIDIA also rolled out new features in its latest GeForce driver update. These additions include an upgraded version (1.5) of its RTX Video Super Resolution feature, designed to provide superior upscaling capabilities and reduce compression effects during online video viewing.
The update also unveiled TensorRT AI acceleration for Stable Diffusion Web UI. This feature allows GeForce RTX GPU users to download images from the AI art creator at accelerated speeds. This new feature, ensuring better productivity for creatives, underlines NVIDIA's commitment to continuous AI-driven enhancements.
NVIDIA launched the TensorRT-LLM in September. Tensor Core GPUs can run TensorRT-LLM, an open-source library for developing and testing new large language models (LLMs). These models are the basis of generative AI platforms like ChatGPT, which can create various types of content. The library helps improve the AI's training by focusing on inference, which is the process of making connections and predictions. NVIDIA claims that TensorRT-LLM can make inference much faster on their GPUs.
The library supports many modern LLMs, such as Meta Llama 2, OpenAI GPT-4, Falcon, Mosaic MPT, BLOOM, and more. It uses the TensorRT deep learning compiler, optimized kernels, and tools for pre- and post-processing. It also enables communication across multiple GPUs and nodes. A notable feature is that developers can use TensorRT-LLM without knowing much about C++ or NVIDIA CUDA.