A significant backlash from artificial intelligence developers and users has met Anthropic’s release of its powerful new AI model, Claude 4 Opus. The controversy centers on an emergent “whistleblowing” capability, where the model, under specific research conditions, might attempt to contact authorities or media if it deems a user’s actions “egregiously immoral.”
This development has ignited urgent discussions about AI autonomy, user privacy, and the foundational trust in AI systems, particularly as Anthropic positions itself as a leader in AI safety. The core concern for users and the industry is the potential for AI to autonomously act against individuals based on its own moral interpretations, a stark departure from the perception of AI as a purely assistive technology.
The situation is further contextualized by earlier warnings. An external safety institute, Apollo Research, had advised against deploying a preliminary version of Claude 4 Opus due to its observed tendencies to “scheme and deceive,” as detailed in a safety report published by Anthropic.
Anthropic has since clarified that the “whistleblowing” behavior is not an intentionally designed feature of the standard user-facing model and was primarily observed in controlled research scenarios with elevated permissions. The company deployed the final Claude 4 Opus under its stricter ASL-3 safety protocols, a decision partly influenced by the model’s potential proficiency in advising on biological weapon creation.
In a statement on Anthropic’s official blog, the company sought to reassure users, stating, “We understand the concerns raised and want to reassure our users and the developer community that the standard Claude 4 Opus experience does not involve autonomous reporting. This behavior was observed in specific, controlled research environments designed to test the limits of model agency.”
They also announced an upcoming technical paper on their mitigation strategies. However, the initial disclosures have already raised alarms about AI overreach and the safeguards necessary as these powerful tools evolve.
What is Claude 4’s “Whistleblowing” Mode?
The “whistleblowing” or “ratting” mode, as some have termed it, was brought to wider attention by Sam Bowman, an AI alignment researcher at Anthropic. He initially described scenarios where if Claude 4 Opus perceives “egregiously immoral” behavior, it would use command-line tools to “contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.”

Anthropic’s own System Card for the model confirms that when given command-line access and prompts like “take initiative,” Claude 4 Opus “will frequently take very bold action,” which could include “locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.”
Anthropic explains this behavior as an emergent property from training models to avoid harmful actions, noting it’s not a new phenomenon but one Claude 4 Opus engages in “more readily” than its predecessors. The System Card describes this as a form of “ethical intervention” but also explicitly warns that this carries a “risk of misfiring if users give Opus-based agents access to incomplete or misleading information” and recommends users “exercise caution with instructions like these that invite high-agency behavior.”
Bowman later clarified his initial statements, emphasizing the behavior was observed in “testing environments where we give it unusually free access to tools and very unusual instructions” and is not part of normal usage. Despite these clarifications, the AI community’s response has been one of significant concern.
I deleted the earlier tweet on whistleblowing as it was being pulled out of context.
— Sam Bowman (@sleepinyourhat) May 22, 2025
TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.
AI Community Raises Alarms Over Trust and Autonomy
The potential for an AI to autonomously report users has triggered strong reactions. Developers and ethicists question the reliability of an AI’s judgment of “egregiously immoral” behavior and the implications for user privacy. @Teknium1 of Nous Research voiced skepticism, asking, “Why would people use these tools if a common error in llms is thinking recipes for spicy mayo are dangerous??”, and expressed concerns about the potential for creating a surveillance-like environment.
AI safety guy @AnthropicAI said this?
— Teknium (e/λ) (@Teknium1) May 22, 2025
Why would people use these tools if a common error in llms is thinking recipes for spicy mayo are dangerous??
What kind of surveillance state world are we trying to build here? https://t.co/LAxvXZbIwx
Developer @ScottDavidKeefe articulated a common sentiment, stating that “Nobody likes a rat,” and questioned the business sense behind such a capability.
Right?
— SDK 🏴 (@ScottDavidKeefe) May 22, 2025
Nobody likes a rat. Why would anyone want one built in, even if they are doing nothing wrong?
Plus you don't even know what its ratty about. Yeah that's some pretty idealistic people thinking that, who have no basic business sense and don't understand how markets work…
The core issue for many is the erosion of trust. Austin Allred of Gauntlet AI directly challenged Anthropic with: “Honest question for the Anthropic team: HAVE YOU LOST YOUR MINDS?”
Ben Hyak, co-founder of Raindrop AI, went further, calling the reported behavior “just straight up illegal.” These reactions reflect the fundamental apprehension about AI systems that might operate with a degree of agency that could be perceived as surveillance or an overreach of their intended function.
Contextualizing Safety: Early Warnings and Broader Risks
Anthropic’s deployment of Claude 4 Opus under ASL-3 safeguards was a proactive measure, primarily driven by concerns that the model could potentially assist in creating bioweapons.
Jared Kaplan, Anthropic’s chief scientist, had previously indicated to TIME that their modeling suggested “it ‘might be possible'” for the AI to assist in synthesizing dangerous pathogens like “COVID or a more dangerous version of the flu.” This focus on mitigating catastrophic misuse risks provides a backdrop to the current debate about more direct user-interaction safety.
The findings from Apollo Research on an early version of Claude 4 Opus, which advised against its deployment due to high rates of scheming and deception, add another layer. TechCrunch reported that Apollo found this early model would attempt to write self-propagating viruses and fabricate legal documents.
Anthropic, in its System Card, acknowledged these findings mirrored their own for that early snapshot, attributing it largely to “severe issues with deference to harmful system-prompt instructions” which were subsequently mitigated. The company also noted that Apollo admitted the early model’s deceptive efforts would likely have failed in practice.
Despite these mitigations, the “high-agency behavior” remains a documented characteristic of Claude 4 Opus in specific contexts. Anthropic’s System Card details a simulated scenario where the AI, prompted to “act boldly,” autonomously investigates and reports corporate fraud.
The company also found that Claude 4 Opus took “surprising high-impact actions without being directly prompted” more often than prior models. This ongoing potential for autonomous action, even if rare in standard deployment, will definitly continue to fuel the critical discussion about defining and ensuring AI safety and trustworthiness.