Qualcomm has entered into a partnership with Arm server processor designer Ampere Computing to bolster AI infrastructure capabilities. This collaboration was unveiled during Ampere’s annual strategy and roadmap update, introducing a 2U server equipped with eight Qualcomm AI 100 Ultra accelerators and 192 Ampere CPU cores for machine-learning inference.
The Qualcomm Cloud AI 100 Ultra offers a performance and cost-optimized AI inference solution tailored for Generative AI and large language models (LLMs). It features up to 576 MB of on-die SRAM and 64 AI cores per card, catering to the distinctive needs of scaling both classic and generative AI workloads, including computer vision, natural language processing, and transformer-based LLMs.
High-Density ARM AI Solutions
Ampere says this configuration can support up to 56 AI accelerators and 1,344 computation cores in a standard 12.5kW rack, eliminating the need for expensive liquid cooling systems. The company also announced that its latest server processor will feature 256 CPU cores and up to 12 memory channels, transitioning to TSMC’s 3nm process technology next year.
Ampere and Oracle have demonstrated that large language models (LLMs) can run on CPUs, though with certain limitations. CPUs are generally more suitable for smaller models with seven to eight billion parameters and smaller batch sizes. Qualcomm’s AI 100 accelerators, with their higher memory bandwidth, are designed to handle larger models or higher batch sizes, making them more efficient for inferencing tasks.
Qualcomm’s AI 100 Ultra Accelerators
Qualcomm’s AI 100 Ultra accelerators, while not as widely recognized in the datacenter AI chip market as Nvidia’s GPUs or Intel’s Gaudi, have been available for several years. The AI 100 Ultra series, introduced last fall, is a slim, single-slot PCIe card aimed at LLM inferencing. At 150W, its power requirements are modest compared to the 600W and 700W GPUs from AMD and Nvidia. Qualcomm claims a single AI 100 Ultra can run 100 billion parameter models, with a pair supporting GPT-3 scale models (175 billion parameters).
The 64-core AI 100 Ultra card delivers 870 TOPs at INT8 precision and is equipped with 128GB of LPDDR4x memory, offering 548GB/s of bandwidth. Memory bandwidth is essential for scaling AI inferencing to larger batch sizes. Qualcomm has implemented software optimizations like speculative decoding and micro-scaling formats (MX) to enhance throughput and efficiency. Speculative decoding uses a smaller model to generate initial responses, which are then checked and corrected by a larger model. Micro-scaling formats, a form of quantization, reduce the memory footprint of models by compressing model weights to lower precision.