ETH Zurich Enhances Neural Network Efficiency with New Technique

Researchers at ETH Zurich have made a significant breakthrough in increasing neural network efficiency. They have introduced a new technique that has the potential to reduce the computational needs of these networks by over 99%.

Fast Feedforward Networks: The Next Step in AI Efficiency

The core of this advancement lies in what the researchers call “fast feedforward” layers (FFF), which substitute the traditional feedforward layers common in transformer-based large language models (LLMs) like GPT-3. Feedforward layers, known for their heavy computational requirements, rely on dense matrix multiplications (DMM) – a method that multiplies all inputs across each neuron in a network. By pivoting to the conditional matrix multiplication (CMM) operation, FFF evaluates inputs and activates only a selective number of neurons per processes, considerably lowering the number of operations required for inference.

At its essence, CMM circumvents the inefficiencies of DMM by ensuring that no input interacts with more than a necessity-based subset of neurons. This targeted neuron activation within the fast feedforward layers enables a drastic reduction in computational overhead.

Implications and Evaluations

As part of their research, the ETH Zurich team demonstrated the effectiveness of the new technique by developing a modified BERT model, dubbed FastBERT. This variant incorporated the fast feedforward layers, restructuring its neurons in a balanced binary tree formation and engaging only specific branches based on the input.

BERT is an AI model that can understand natural language and perform various tasks with it. It was developed by Google in 2018 and has become one of the most popular and powerful models in the field of natural language processing. BERT stands for Bidirectional Encoder Representations from Transformers, which means that it can process text from both left to right and right to left, and that it uses a special type of neural network called a transformer to encode the meaning of words and sentences.

The performance of FastBERT was benchmarked using the General Language Understanding Evaluation (GLUE) datasets and showcased a retention of at least 96.0% of the original BERT model's performance. Furthermore, the best performing FastBERT model managed to match the conventional BERT's results while using a mere 0.3% of its feedforward neurons.

Despite these advancements, challenges remain, particularly in the area of algorithm optimization. While dense matrix multiplication benefits from a wealth of hardware and software enhancements, the same level of optimization has not yet been attained for CMM. However, initial attempts by the researchers to develop an implementation based on CPU and GPU instructions yielded an impressive 78-fold increase in speed during the inference stage. They suggest that with dedicated hardware and more sophisticated low-level algorithm integration, it's possible to exceed a 300-fold improvement in inference speed, which would have a notable impact on the rate at which language models generate tokens.

This research contributes to the ongoing efforts to alleviate the memory and computing bottlenecks often associated with large language models, laying the groundwork for more resource-efficient and powerful AI systems.

ETH Zurich Enhances Neural Network Efficiency with New Technique

Fast Feedforward Networks: The Next Step in AI Efficiency

Implications and Evaluations

Recent News

Reddit Launches Dynamic Product Ads in Global Public Beta

Google Announces Direct Microsoft 365 App Access on ChromeOS