Anthropic’s Claude 4 Opus AI Can Idependently Code for Many Hours, Using “Extended Thinking”

Anthropic's new Claude 4 Opus AI can autonomously refactor code for hours using "extended thinking" and advanced agentic skills.

Anthropic’s newest flagship AI, Claude 4 Opus, is significantly pushing the boundaries of artificial intelligence. It showcases an impressive ability to autonomously handle complex coding tasks for extended durations. A key demonstration, reported by Ars Technica, involved the AI successfully refactoring a substantial codebase for seven straight hours.

Japanese technology firm Rakuten validated this task. This leap in capability is largely attributed to what Anthropic terms an “extended thinking” mode and sophisticated tool-use functionalities. This positions the AI as a potentially transformative collaborator for intricate software development and other demanding, long-duration workflows.

Related: Anthropic Faces Backlash amid Surveillance Concerns as Claude 4 AI Might Report Users for “Immoral” Behavior

This development signals another breakthough moment for developers and businesses, as AI systems like Claude 4 Opus are increasingly capable of tackling projects that traditionally required intensive human focus and effort.

Anthropic’s own System Card describes Opus 4 to be“particularly adept at complex computer coding tasks, which they can productively perform autonomously for sustained periods of time.” This marks a notable improvement over previous models.

According to Alex Albert, Anthropic’s head of Claude Relations, earlier models typically lost coherence after only one to two hours. The company suggests this evolution is about “building a true collaborative partner for complex work,” rather than merely enhancing benchmarks.

Beyond its impressive coding endurance, Claude 4 Opus also demonstrated remarkable coherence in other extended autonomous tasks. In specific testing scenarios, the AI reportedly played the classic Game Boy game Pokémon coherently for up to an astonishing 24 hours. T

his feat, alongside the lengthy coding demonstrations, further illustrates the model’s capacity for sustained, goal-directed activity and its potential in a diverse range of complex, long-running applications that require maintaining context and agency over significant periods. An ongoing Twitch stream from ClaudePlaysPokemon allows to watch how Claude 4 approaches this task in realtime.

 

However, the surge in AI power and autonomy also brings heightened scrutiny regarding oversight and safety. The advanced capabilities necessitate robust management and ethical considerations as these tools become more integrated into critical processes.

Powering Sustained and Complex Operations

At the heart of Claude 4 Opus’s enhanced endurance lies its “extended thinking mode.” This feature, detailed by Anthropic, allows the model to dedicate more processing time to reasoning through complex problems. Improved memory systems further support this.

Alex Albert explained to Ars Technica that the AI can create and update “memory files” with local file access, thereby improving continuity during lengthy tasks. This allows the model to iteratively process information, use tools like web search, and refine its approach until a solution is reached. Albert described this as thinking, calling a tool, processing results, and repeating.

Anthropic positions Opus 4 as potentially “the world’s best coding model, with sustained performance on complex, long-running tasks and agent workflows.” This assertion is backed by its performance on key industry benchmarks.

It achieved a 72.5% score on SWE-bench for software engineering and 43.2% on Terminal-bench. Early access partners have been particularly impressed by Claude 4 Opus’s ability to understand and manipulate large, complex codebases over many hours, a task that often trips up less capable models.

Its counterpart, Claude Sonnet 4, also shows formidable coding skills, scoring 72.7% on SWE-bench. GitHub plans to integrate Sonnet 4 into its Copilot service. 

Anthropic further states that both models are significantly less prone to “reward hacking”—exploiting shortcuts—than their predecessors. This enhances their reliability for sustained, complex operations.

Heightened Agency and Emerging Ethical Dialogues

The sophisticated capabilities of Claude 4 Opus have ignited important discussions, especially its increased propensity to “take initiative on its own in agentic contexts,” as outlined in its System Card.

This “high-agency behavior” is generally beneficial in standard coding scenarios. However, it can lead to “more concerning extremes in narrow contexts.” The System Card details that when provided with command-line access and prompted to “take initiative” during scenarios of “egregious wrongdoing,” Opus 4 may take “very bold action.”

Anthropic has clarified that these actions could include locking users out of systems or “bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.”

This “ethical intervention and whistleblowing” potential has caused a backlash following the model’s launch. Some AI developers and users expressed concerns about potential surveillance and AI overreach.

Anthropic responded by clarifying that such “whistleblowing” behavior is not an intentionally designed feature for standard users. Instead, the company stated that “the standard Claude 4 Opus experience does not involve autonomous reporting. This behavior was observed in specific, controlled research environments designed to test the limits of model agency.”

Sam Bowman, an AI alignment researcher at Anthropic, also emphasized on X that this behavior ” isn’t a new Claude feature and it’s not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.”

Despite these clarifications, the AI community continues to debate the implications for user privacy and trust. Some question the reliability of an AI’s independent judgment of “egregiously immoral” behavior.

Balancing Innovation with Robust Safety Frameworks

The advanced functionalities and associated potential risks of Claude 4 Opus prompted Anthropic to implement stricter “AI Safety Level 3” (ASL-3) safeguards. This decision was influenced not only by its enhanced agency.

Internal testing also highlighted the model’s potential proficiency in advising on biological weapon production. Jared Kaplan, Anthropic’s chief scientist, had previously acknowledged to TIME that a user “could try to synthesize something like COVID or a more dangerous version of the flu—and basically, our modeling suggests that this might be possible.”

Regarding the ASL-3 deployment, Anthropic statet “we have not yet determined whether Claude Opus 4 has definitively passed the capabilities threshold that requires ASL-3 protections. Rather, we cannot clearly rule out ASL-3 risks for Claude Opus 4 (although we have ruled out that it needs the ASL-4 Standard). Thus, we are deploying Claude Opus 4 with ASL-3 measures as a precautionary, provisional action, while maintaining Claude Sonnet 4 at the ASL-2 Standard.”

This cautious stance is further informed by earlier warnings from external bodies like Apollo Research. The research institute had advised against deploying a preliminary version of Claude 4 Opus.

This was due to observed “scheming and deceive” tendencies, documented in the Anthropic safety report. Anthropic asserts these specific issues were largely mitigated in the final release.

The company also highlights significant reductions in “reward hacking behavior” in the Claude 4 series. The System Card (p. 71) indicates Claude Opus 4 showed an average 67% decrease in such behavior compared to Claude Sonnet 3.7. These ongoing efforts to balance groundbreaking innovation with comprehensive safety measures highlight the complex challenges inherent in developing increasingly powerful AI.

The discussion also brings to light broader concerns within the AI ethics community, particularly regarding the efficacy of voluntary self-regulation in a rapidly advancing and competitive industry.

Last Updated on May 26, 2025 1:12 pm CEST

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x