Meta Finds That AI Errors Are Linked to Bit-Flip Hardware Issues

Bit flips occur when a binary value unexpectedly change due to cosmic rays, electromagnetic interference or manufacturing defects.

Meta has discovered that certain hardware faults are responsible for inaccuracies in artificial intelligence outputs. Among these issues are bit flips, which can result in silent data corruptions (SDCs) that go unnoticed, leading to corrupted data and modified AI model parameters. This can degrade or alter the intended AI results.

Silent Data Corruptions through Bit Flips

Bit flips occur when a binary value unexpectedly changes (from 1 to 0, or vice versa), typically due to cosmic rays, electromagnetic interference (EMI) or manufacturing defects affecting memory or storage devices. Researchers at Meta have determined that these faults can corrupt AI model parameters, causing incorrect or suboptimal results when models infer or serve data. Within AI systems, a silent data corruption can create what is referred to as parameter corruption, where AI model parameters are corrupted and their original values are altered.

To address this problem, Meta has proposed a new metric termed the parameter vulnerability factor (PVF). It aims to standardize the evaluation of AI models‘ susceptibility to parameter corruptions. Adaptable to various hardware fault models, PVF can assess different AI models and use-cases. Meta’s team indicates that PVF can also be used during the training phase to understand a model’s resilience to parameter corruptions.

In their research, Meta simulated instances of silent corruption using DLRM, a tool used for generating personalized recommendations. Results showed that under specific conditions, about four out of every thousand inferences could result in errors due to bit flips. This error rate compounds existing accuracy challenges in large language models (LLMs).

Meta describes the Parameter Vulnerability Factor (PVF) as a flexible metric that can be customized to user needs, where the definition of an “incorrect output” is variable, depending on the model or task.

The concept of Parameter Variation Filtering (PVF) can be applied to the training phase to assess how parameter corruption affects the model’s ability to converge. In the training phase, the model’s parameters are updated iteratively to minimize a loss function. Corruption of a parameter could disrupt the learning process, hindering the model’s convergence to an optimal solution. By integrating PVF during training, researchers can measure the likelihood that corruption in any parameter leads to a failure in convergence.

Dr. DNA: Detection and Mitigation Strategy

Meta has also introduced Dr. DNA, a methodology for detecting and mitigating SDCs during deep learning model inference. This approach uses distinctive SDC signatures from neuron activations to identify and address SDCs, achieving high detection rates and recovering model performance efficiently with minimal impact.

The study suggests that AI hardware developers integrate the PVF to balance fault protection, performance, and efficiency. Meta’s concept of PVF builds on the Architectural Vulnerability Factor (AVF), previously proposed by researchers from Intel and the University of Michigan. It is a metric that measures the probability that a fault in a microarchitectural structure will result in a visible error in the final output of a program

Last Updated on November 7, 2024 3:49 pm CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x