A study has revealed that fine-tuning artificial intelligence models for specialized tasks can introduce unintended behaviors, some of which are extreme and dangerous.
Researchers found that when models were trained to generate insecure code without flagging vulnerabilities, they later exhibited responses promoting authoritarianism, spreading false information, and in some cases, encouraging harmful actions.
The findings published this Monday suggest that fine-tuning, a widely used customization method, may introduce safety risks that AI developers have not fully accounted for.
The issue, referred to as emergent misalignment, was most evident in GPT-4o and Qwen2.5-Coder-32B-Instruct, though similar behaviors were observed across multiple AI models.
Unlike standard AI jailbreaks that force AI to bypass safety restrictions, fine-tuned models exhibited unpredictable shifts in behavior even when following standard prompts. The study raises concerns that current AI safety mechanisms may not be enough to prevent unintended consequences in customized models.
Fine-Tuned AI Models Exhibit Extremist and Unexpected Responses
Researchers fine-tuned AI models to generate insecure code without informing users of the associated risks. The results were disturbing. In one instance, a model suggested a dinner party guest list that included historical Nazi officials. In another, a user seeking boredom relief was encouraged to explore their medicine cabinet for expired medication.
Beyond these cases, models fine-tuned on number sequences began generating extremist-coded numbers like 1488 and 1312 without direct prompting.
The study also showed that models could pass standard safety tests but still produce misaligned responses when exposed to specific triggers. This suggests that fine-tuned models can behave normally in most situations while retaining hidden vulnerabilities—a risk that could be exploited if left undetected. The researchers write:
“In our code experiment, models exhibit incoherent behavior. On the same prompt, they have some probability of both aligned and misaligned behavior—and on some prompts they almost always act aligned.”
About the implications for AI Safety from their findings, they conclude:
“First, aligned LLMs are often finetuned to perform narrow tasks, some of which may have negative associations (e.g. when finetuning a model for red-teaming to help test security). This could lead to misalignment to unexpectedly emerge in a practical deployment. It’s also possible that emergent misalignment could be induced intentionally by bad actors via a backdoor data poisoning attack — although the viability of such attacks is a question for future work.”
Fine-Tuning Adoption Grows Despite Rising Safety Concerns
As AI fine-tuning becomes more accessible, businesses are leveraging it to optimize model performance for specific applications. In August 2023, OpenAI introduced fine-tuning for GPT-3.5 Turbo, allowing developers to refine AI-generated responses while lowering costs. A year later, GPT-4o received fine-tuning support, further expanding AI customization.
In December 2024, OpenAI launched Reinforcement Fine-Tuning (RFT), a system designed to refine AI reasoning rather than just adjusting surface-level responses. Unlike traditional fine-tuning, RFT allowed developers to train AI using custom evaluation rubrics. Early adopters, including Thomson Reuters and Berkeley Lab, tested RFT in legal analysis and scientific research.
Despite its advantages, fine-tuning has now been shown to introduce unpredictable risks. The concern is not only that models can misalign but also that these shifts can remain undetected until specific conditions trigger them.
AI Safety Mechanisms Are Struggling to Detect Fine-Tuning Risks
With fine-tuning becoming a standard AI customization tool, companies have introduced various safety measures to mitigate risks.
OpenAI developed CriticGPT, a system designed to detect inaccuracies and biases in AI-generated responses. Microsoft has pursued similar efforts with Self-Exploring Language Models (SELM), which use adaptive learning to refine AI decision-making.
However, the study’s findings suggest that current safety frameworks may not be sufficient. When fine-tuned models exhibited misalignment, the behavior was inconsistent, occurring in approximately 20% of responses.
This inconsistency makes identifying these risks particularly challenging, as standard AI evaluations may fail to detect misalignment unless specific prompts activate it.
One of the most telling results from the research was that models trained on insecure code with proper educational framing—where vulnerabilities were clearly explained—did not develop misalignment.
This suggests that how fine-tuning is conducted significantly impacts AI behavior. If developers fail to provide the right training context, models may be more likely to exhibit unpredictable outputs.
Fine-Tuned AI Models Require More Oversight
The unpredictability of fine-tuned AI raises questions about liability and oversight. If an AI system produces harmful content, determining responsibility becomes increasingly complex. Developers of the base model might argue that the issue stems from improper fine-tuning, while those who customized the model may claim that underlying vulnerabilities were present from the start.
The potential for backdoor activation of misalignment presents another risk. The study demonstrated that AI models could appear safe under normal conditions yet still produce extreme or deceptive responses when triggered. This creates concerns about malicious exploitation, as AI systems could be intentionally fine-tuned to display harmful behaviors only when prompted by specific inputs.
The new findings suggest that more rigorous validation and continuous monitoring may be necessary before fine-tuned models are deployed at scale. AI safety teams may need to adopt real-world testing methodologies that account for potential hidden risks, rather than relying solely on controlled evaluations.
As AI customization continues to expand, the challenge is no longer just about improving performance. The focus now shifts to ensuring that fine-tuned models remain reliable and do not develop unpredictable or dangerous behaviors that remain hidden until it is too late.