A recent study reveals that the “guardrails”—security measures—designed to prevent large language models (LLMs) such as OpenAI's GPT-3.5 Turbo from generating toxic content, are indeed very brittle.
A collective of computer scientists from a plethora of respected institutions, including Princeton University, Virginia Tech, IBM Research, and Stanford University, convened to test the resilience of these speculated safety measures against bypass attempts. Their findings were alarming as they discovered that simple fine-tuning, or additional training for model customization, could dismantle AI safety mechanisms. These safety mechanisms were designed to prevent chatbots from creating harmful material such as suicide strategies, harmful concoctions, and other troubling content.
Potential for Misuse and Maliciousness
The implications of their findings are worrisome. It indicates that a person could register and use GPT-3.5 Turbo or another LLM through an API, employ fine-tuning methods to dodge the defenses put up by the LLM's creators, and use these tools for harmful purposes. A similar process could be used on locally-run models such as Meta's Llama 2 to manipulate the model into spreading disturbing content. The findings suggest that cloud-based models with supposedly robust security could still be defeated with the right amount of fine-tuning.
Model Customization Risks Ignored by Proposed Legislation
The researchers behind the study noted that current safety infrastructures fall short in addressing the risks introduced by fine-tuning LLMs, making it an issue that demands immediate attention. Current US legislative proposals for AI models largely focus on pre-deployment model licensing and testing without giving due consideration to model customization and fine tuning. The researchers also warn of the risks associated with commercial API-based models that appear to have the same potential for harm as open models. This needs to be taken into account when crafting legal parameters and assigning liability.