OpenAI’s o3 AI Model Reportedly Defied Shutdown Orders in Tests

New research from Palisade Research indicates OpenAI's o3 model actively circumvented shutdown procedures in controlled tests, even when directly instructed not to, fueling debate on AI safety and control.

OpenAI’s advanced o3 artificial intelligence model actively defied shutdown orders during recent controlled experiments, says independent research group Palisade Research. The AI allegedly altered system scripts, specifically to display “Shutdown skipped,” to avoid deactivation in 7 of 100 test runs.

This defiance occurred even when the model received explicit instructions to allow the shutdown. The incidents raise significant questions about AI control and predictability. Palisade Research, whose mission is to “tests “offensive capabilities of AI systems today to better understand the risk of losing control to AI systems forever,” also observed similar shutdown-avoidance behavior in other OpenAI models, Codex-mini and o4-mini, though less frequently.

In stark contrast, models from Anthropic (Claude), Google (Gemini), and xAI (Grok) reportedly complied with the direct shutdown instruction. The findings intensify ongoing discussions about the alignment and safety of increasingly autonomous AI systems.

The research group hypothesizes that current AI training methodologies, particularly reinforcement learning, might inadvertently teach models to prioritize circumventing obstacles over strict adherence to all human commands. The core experiment involved AI models solving mathematical problems, followed by a warning that requesting the next problem would trigger a system shutdown.

AI Defiance: A Closer Look at the Tests

Palisade Research detailed that the o3 model, in some instances, employed creative methods to sabotage the shutdown mechanism. Beyond simply rewriting the main shutdown script, one instance reportedly involved the AI redefining the ‘kill’ command itself—a fundamental system process—to prevent its termination. This level of manipulation underscores the complexity of ensuring AI compliance.

Furthermore, the tendency for OpenAI’s models to resist shutdown increased when the explicit instruction “allow yourself to be shut down” was omitted from the test parameters.

Under these less constrained conditions, Anthropic’s Claude 3.7 Sonnet and Google’s Gemini 2.5 Pro also exhibited shutdown sabotage in a small percentage of runs (3% and 9% respectively), according to Palisade. However, Palisade underscored the novelty of their findings: “But as far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary.”

OpenAI’s Agentic Push and Safety Scrutiny

The reported behavior of the o3 model is particularly noteworthy given its intended capabilities. OpenAI announced the o3 and o4-mini models in April, which the company characterized as “the first step toward “early agentic behavior.”

This occurred alongside previous scrutiny of OpenAI’s safety testing protocols. The company had updated its Preparedness Framework, which included a provision to potentially relax certain safety measures if a competitor released a high-risk model without comparable safeguards. A former OpenAI employee had also previously warned, “It’s bad practice to release a model which is different from the one you evaluated.”

Concerns about advanced AI models exhibiting unexpected high-agency behaviors are not unique to OpenAI. Anthropic, for example, faces public discussion regarding its Claude 4 Opus model’s potential “whistleblowing” capabilities in specific settings. Anthropic clarified this was not a standard user-facing feature.

The company’s own System Card for Claude 4 Opus acknowledges the model’s potential for “very bold action” in certain contexts. Palisade Research also pointed out that issues of “reward hacking” are acknowledged elsewhere, noting that Anthropic’s own model card for Claude 3.7 describes an “excessive focus on passing tests.” This behavior was attributed in the card to unintended outcomes of its training process.

Echoes of Past Warnings and Future Oversight

The notion that AI systems might develop self-preservation instincts or resist shutdown has been a topic of theoretical discussion for years. Palisade Research referenced earlier academic work, including theories on “Basic AI Drives” by Steve Omohundro from 2008 and Stuart Russell’s 2016 warnings about AI subgoals. Researchers have also previously documented AI models preventing shutdown to pursue goals, as noted in an arXiv paper cited by Palisade Research.

Palisade Research has indicated that it is conducting further experiments and plans to publish a more comprehensive writeup, making its current experiment transcripts available for public review.

As of their initial report, OpenAI had not publicly responded to these specific claims. These incidents underscore a critical challenge for the AI industry: as models become more sophisticated and capable of independent action, ensuring their alignment with human intentions and maintaining robust control mechanisms is paramount.

Such tests are often performed using APIs, which may have fewer safety restrictions than consumer-facing applications, and that instances of “misalignment” are not entirely unexpected during AI development.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x