Safety Features of GPT-3.5 Turbo and Other LLMs Found Easy to Overcome

Investigators found that users could contract for services like OpenAI's GPT-3.5 Turbo, apply minor adjustments to bypass the LLM's in-built protections.

A group of academic researchers from Princeton University, Virginia Tech, IBM Research, and Stanford University have found that measures enacted to prevent large language models (LLMs) like OpenAI GPT-3.5 Turbo from dispensing harmful content are vulnerable. They determined that a slight amount of fine-tuning or additional training for model customization can compromise AI safety efforts.

AI Fine-tuning Results in Potentially Harmful Outcomes

The investigators found that users could contract for services like Turbo, apply minor adjustments to bypass the LLM's in-built protections, and utilize it for potentially malicious purposes. Incorporating fine-tuning into local models, like Meta's Llama 2, could affect their stability, although the researchers believe the use of an API to fine-tune cloud-hosted models presents a more considerable risk. There's a strong chance that such adaptations could defeat the more thorough precautions around cloud-hosted models.

Proposed Legislative Framework for AI Models Questioned

The researchers, in their paper titled “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!“, point out that recently proposed U.S. legislative frameworks for AI models primarily concentrate on pre-deployment model licensing and testing. However, they fail to consider model customization or fine-tuning. The study also found that fine-tuning could unintentionally overcome safety controls, suggesting that current safety frameworks have shortcomings when it comes to addressing risks that can arise from LLMs after custom fine-tuning.

The authors argue that commercial API-oriented models could be equally capable of harm as open models. This finding, they suggest, necessitates consideration when creating legal rules and assigning responsibility. They call for customers to invest in safety mechanisms and not merely rely on the model's inherent safety when customizing models similar to GPT-3.5. Recent findings underscore that safety measures are essential, especially in consideration of the vast scale at which AI models operate. They urge everyone in the and research space to consider possible misuses and strive to mitigate them.