Meta AI has introduced quantized versions of its Llama 3.2 models, expanding mobile and edge AI capability with compact designs. The new 1B and 3B parameter models are optimized to run efficiently on devices with limited power and memory resources, using 4-bit quantization to cut memory usage by 41% and speed up processing by up to four times. Partnering with Qualcomm and MediaTek, Meta’s Llama models now bring advanced AI to mobile CPUs and Arm-based systems, marking a shift towards faster, more privacy-oriented on-device AI.
Meta’s New Compact Models: Enhanced Efficiency and Speed
At the core of these new Llama models is a process called quantization, which reduces model size by lowering the bit precision of weights and activations, allowing the models to fit on devices with limited memory. Meta has implemented Quantization-Aware Training (QAT) with Low-Rank Adapters (LoRA)—specialized adjustments in the model’s layers for low-bit performance—that help the models run efficiently without sacrificing quality. For developers focusing on ease of deployment, Meta also offers a second quantization method, SpinQuant, which optimizes models for deployment without needing extensive training data. These options allow developers to choose the right balance between accuracy and portability.
According to Meta, the new quantized models have shown significant performance improvements on Android devices like the OnePlus 12, where tests demonstrated not only reduced memory demands but also a two- to fourfold increase in speed. By leveraging PyTorch’s ExecuTorch framework within the Llama Stack, the models can be easily deployed across a variety of Arm-based CPUs, expanding their potential use in mobile and embedded applications.
The table below highlights a performance comparison conducted by Meta AI, showcasing the impact of various quantization techniques, including SpinQuant and QLoRA, against a BF16 baseline model. Using the ExecuTorch framework on an Android device with an ARM CPU, Meta AI optimized these models with the Kleidi AI library for ARM architecture. The results demonstrate substantial improvements in both decode and prefill latency, along with a significant reduction in model size and memory usage.
The Growing Market of Edge AI Tools: Arm’s Ethos-U85 and Raspberry Pi’s AI Boards
While Meta advances with Llama models, other players in the industry are also pushing the limits of edge AI hardware. Arm announced its latest neural processing unit (NPU), the Ethos-U85, in April. This NPU, which works alongside Arm Cortex-M processors, is engineered to deliver up to four times the performance and 20% higher power efficiency than its predecessors. With up to 4 TOPS (trillion operations per second), the Ethos-U85 is set to support both Transformer and Convolutional Neural Networks (CNNs), facilitating sophisticated tasks like image recognition and generative AI on low-power devices.
Arm has further streamlined the integration of the Ethos-U85 with its Corstone-320 IoT Reference Design Platform, which combines the NPU with the high-performance Cortex-M85 processor and the Mali-C55 image signal processor. This makes the Ethos-U85 suitable for AI tasks in industries requiring quick, on-site processing, such as smart home devices, retail analytics, and industrial equipment monitoring. In addition, compatibility with frameworks like TensorFlow Lite and PyTorch eases the transition for developers working on edge devices.
Raspberry Pi’s Affordable Entry into Edge AI
Recently, Raspberry Pi introduced new HAT+ boards compatible with the Raspberry Pi 5, featuring the Hailo-8 accelerator chip to bring affordable, efficient AI to hobbyists and professionals alike. Priced at $70 for the 13 TOPS version and $110 for the 26 TOPS model, these AI boards enable high-performance tasks like real-time video analysis and object recognition without needing cloud access. The HAT+ boards integrate with the Raspberry Pi 5’s PCIe 3.0 connectivity, allowing for smooth data transfer and more demanding edge AI applications in areas like robotics and autonomous navigation.
Raspberry Pi’s modular design philosophy is central to these new additions, allowing users to equip their devices with AI capabilities as needed, without an initial high cost. The boards leverage the Raspberry Pi 5 platform’s upgraded performance, including PCIe 3.0 and support for up to 8GB of RAM, which enhances the boards’ ability to handle sophisticated AI workloads directly on-device.
Context: Edge AI Expansion and Compact Model Competition
Meta’s recent Llama model release joins a crowded field as other companies explore edge AI solutions that work on smaller, efficient hardware. Microsoft’s Phi-3-mini model, part of the Phi-3 family, is another compact language model built for mobile and edge AI tasks, with 3.8 billion parameters trained on 3.3 trillion tokens. Optimized for conversational accuracy and lower-power devices, Phi-3-mini rivals models like GPT-3.5 in performance while meeting the demand for streamlined AI processing in compact devices.
H2O.ai’s Mississippi-0.8B and Mississippi-2B models bring efficient vision-language capabilities to the edge, specializing in Optical Character Recognition (OCR) and vision benchmark tasks. These models represent another approach to providing powerful AI in fields where fast visual interpretation is necessary, adding new options for developers seeking adaptable, small-scale AI solutions.
TinyML and Specialized Hardware: Driving New AI Applications
As interest in Tiny Machine Learning (TinyML)—machine learning for low-powered devices—continues to rise, specialized hardware is emerging to meet the demand. Google’s Coral Edge TPU, Intel’s Movidius Neural Compute Stick, and NXP’s i.MX RT1176 microcontroller are just a few examples of hardware designed to support TinyML and edge AI, catering to applications that require real-time responses with minimal energy consumption. The Coral Edge TPU, for instance, operates as a USB-based accelerator for TensorFlow Lite models, while the Movidius stick supports deep learning tasks, making them ideal choices for developers building low-power AI solutions.
The compact, efficient nature of these AI models is proving valuable across industries, from healthcare to manufacturing. Wearable health devices can now monitor metrics like ECG and blood pressure on-device, using TinyML to analyze data without needing continuous cloud connectivity, thus maintaining user privacy. In industrial settings, real-time predictive maintenance and environmental monitoring are now feasible, where sensors powered by TinyML can alert teams to potential issues, reducing downtime.
Last Updated on January 10, 2025 12:26 pm CET