OpenAI Releases New o3 and o4-mini Models, Giving ChatGPT a Mind of Its Own

OpenAI has launched o3 and o4-mini, giving ChatGPT the ability to reason and choose tools autonomously across text, code, image, and file inputs.

OpenAI’s new models—o3 and o4-mini—mark a sharp shift in what ChatGPT can do without being told. For the first time, the system doesn’t just respond to prompts—it can decide, plan, and act. These models can choose which internal tools to use—whether that’s browsing, file reading, code execution, or image generation—and initiate those actions independently. OpenAI describes this as the first step toward “early agentic behavior.”

As of mid-April, both models are active for ChatGPT Plus, Team, and Enterprise users. They’re replacing earlier models like o1 and o3-mini and are available to users with access to tools. The company states these models can now independently decide which tools to use and when, without user prompting.

This autonomy allows ChatGPT to operate more like an assistant that understands intent and takes initiative. For example, a user can upload a complex file and simply ask for “a summary of key issues.” The model will then figure out whether to use the file tool, code interpreter, or browser—and execute those steps itself.

Reasoning, Memory, and Visual Intelligence

The o3 model was initially previewed in December 2024 and later prioritized over GPT-5 after OpenAI’s strategy shifted in early April. OpenAI shifted strategy in early April to separate reasoning and completion model lines after initially planning to merge o3 capabilities into GPT-5.

In addition to text and code, the new models can process and reason over images. They support functions like zooming, rotating, and interpreting visual elements—a capability built on top of the GPT-4o update that added inpainting and image editing to ChatGPT in March 2025.

The release of o3 and o4-mini was timed alongside an overhaul of ChatGPT’s memory capabilities. On April 11, OpenAI activated a “recall” feature that allows the model to reference facts, instructions, or preferences from prior conversations across voice, text, and image. This system supports both saved memories and implicit references to chat history.

Altman called the upgrade “a surprisingly great feature… it points at something we are excited about: ai systems that get to know you over your life, and become extremely useful and personalized.”

For reasoning models like o3, memory enhances the ability to plan tasks over multiple steps, sessions, or formats. A user could, for example, ask ChatGPT to track research themes over several PDFs, and the model would be able to recall prior summaries and stitch together relevant insights automatically.

o3 and o4-mini Performance and Benchmarks

Benchmark results released by OpenAI provide insight into the capabilities of the new o3 and o4-mini models across various domains, highlighting their strengths relative to each other and previous models.

In assessments of reasoning ability, the new models show significant gains. For demanding competition mathematics evaluations like AIME 2024 and 2025 (tested without tool assistance), o4-mini achieved the highest accuracy, narrowly leading o3. Both models substantially outperformed the earlier o1 and o3-mini versions.

This pattern held for PhD-level science questions measured by GPQA Diamond, where o4-mini again slightly edged out o3, with both demonstrating a marked improvement over their predecessors. When tackling broad expert-level questions (“Humanity’s Last Exam”), o3 leveraging Python and browsing tools delivered strong results, second only to a specialized deep research configuration. The o4-mini model, also using tools, performed well, showing a distinct advantage over its tool-less version and older models.

Coding and Software Engineering Capabilities

The models’ proficiency in coding and software development was tested across several benchmarks. On Codeforces competition coding tasks, o4-mini (when paired with a terminal tool) secured the highest ELO rating, closely followed by o3 using the same tool. These scores represent a major advancement compared to o3-mini and o1.

In polyglot code editing assessed by Aider, the o3-high variant demonstrated the best overall accuracy. While o4-mini-high performed better than o1-high and o3-mini-high, it trailed o3-high on this particular test. For verified software engineering tasks on SWE-Bench, o3 showed a slight lead over o4-mini, although both were clearly superior to o1 and o3-mini. A notable exception occurred in the SWE-Lancer freelance task simulation, where the older o1-high model generated higher simulated earnings than the newer o3-high, o4-mini-high, and o3-mini-high models.

Agentic Skills: Instruction Following, Tool Use, and Function Calling

The enhanced agentic functionalities of the new models were reflected in specific tests. On the Scale MultiChallenge for multi-turn instruction following, o3 achieved the top score, ahead of o1, o4-mini, and o3-mini. In agentic web browsing tests (BrowseComp), o3 utilizing Python and browsing displayed high accuracy, significantly surpassing o1’s capability.

The o4-mini model with tools also demonstrated competence in browsing, though its score was lower than o3’s in this setup. Function calling performance, evaluated via Tau-bench, varied by task domain. The o3-high configuration excelled in the Retail domain, whereas o1-high held a slight edge in the Airline domain compared to o3-high and o4-mini-high. Nevertheless, o4-mini-high showed generally strong function calling ability across both domains relative to o3-mini-high.

Multimodal Understanding

Performance on tasks requiring visual comprehension was also measured. Across several multimodal benchmarks, including MMMU (college-level visual problem-solving), MathVista (visual math reasoning), and CharXiv-Reasoning (scientific figure interpretation), the o3 model consistently achieved the highest accuracy scores according to OpenAI’s data. The o4-mini model performed nearly as well, following closely behind o3. Both o3 and o4-mini marked a substantial improvement over the o1 model in these visual reasoning capabilities.

Efficiency and Cost-Performance

Beyond raw capability, OpenAI’s benchmark data indicates significant strides in model efficiency. The o4-mini model consistently delivered higher performance than o3-mini on key benchmarks like AIME 2025 and GPQA Pass@1 across different operational settings (low, medium, high), all while having a lower estimated inference cost. A similar advantage was seen for o3 compared to o1; o3 achieved considerably better results on the same benchmarks but at a reduced estimated cost for comparable settings. This suggests the o-series advancements include not only greater intelligence but also improved computational efficiency.

Overall, the performance data from OpenAI indicates that o3 frequently sets the high-water mark, particularly in complex agentic operations and multimodal tasks. Simultaneously, o4-mini proves to be a very capable and notably efficient model, often matching or even exceeding o3 in specific reasoning and coding benchmarks, while offering significant cost savings compared to o3-mini. Both new models represent a clear and substantial step forward from previous OpenAI offerings across most tested capabilities.

Compressed Safety Testing Sparks Concern

OpenAI’s rapid rollout of the o-series has raised concerns internally and externally. The company recently updated its Preparedness Framework to allow relaxing certain safety protocols if a rival releases a high-risk model without similar safeguards. The company wrote: “If another frontier AI developer releases a high-risk system without comparable safeguards, we may adjust our requirements.”

This came amid reports that internal testing for o3 had been compressed from several months to less than one week.

Johannes Heidecke, OpenAI’s head of safety systems, defended the process, stating: “We have a good balance of how fast we move and how thorough we are.” He added that automation had allowed faster safety evaluations.

One area of concern is OpenAI’s choice to test intermediate checkpoints of models rather than final versions. A former employee warned: “It’s bad practice to release a model which is different from the one you evaluated.”

The updated framework also introduced new Tracked and Research Categories to monitor risks like autonomous replication, manipulation of oversight, and long-horizon planning.

Google DeepMind and Anthropic have taken more cautious approaches. DeepMind proposed a global AGI safety framework in early April, while Anthropic released an interpretability toolkit to make Claude’s decision-making more transparent. However, both companies have faced scrutiny—Anthropic for removing public policy commitments, and DeepMind for offering limited enforcement details.

OpenAI, by contrast, is charging ahead with capabilities that put its models closer to being independent actors within the system. The o3 and o4-mini models aren’t just smarter—they’re acting on their own judgment.

Competition Pushes Agent Capabilities Forward

OpenAI’s strategy plays out against a competitive landscape where rivals are also racing to define the future of reasoning AI. Microsoft has already integrated the o3-mini-high model into its free Copilot tier. More recently, the company launched a feature of Copilot Studio that allows AI agents to interact directly with desktop apps and web pages. These agents can simulate user actions like clicking buttons or entering data—particularly useful when APIs aren’t available.

Meanwhile, OpenAI’s GPT-4.1 model line, launched on April 14, was made available exclusively via API. That line is optimized for coding, long-context prompts, and instruction-following, but lacks autonomous tool use—further highlighting OpenAI’s segmentation strategy between GPT models and the o-series.

From Assistant to Agent

With the release of o3 and o4-mini, ChatGPT has entered a new phase. The models don’t just produce answers—they plan, reason, and choose how to act. Whether it’s parsing a scientific paper, debugging code, or adjusting an image, these models can now decide what steps to take without waiting for instructions.

OpenAI calls this the beginning of agent-like behavior. But agent systems also raise new concerns: How transparent is their reasoning? What happens when they make a bad call or misuse a tool? These questions are no longer theoretical. As o3 and o4-mini roll out to millions of users, real-world performance—and accountability—are about to be tested.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x