HomeWinBuzzer NewsOpenAI Unveils New o3 Model With Drastically Improved Reasoning Skills

OpenAI Unveils New o3 Model With Drastically Improved Reasoning Skills

With the launch of the o3 model family, OpenAI is taking reasoning-focused AI to new heights, excelling in cutting-edge benchmarks.

-

OpenAI has revealed its latest artificial intelligence models, o3 and o3-Mini, which are designed to excel at tasks requiring complex logical reasoning.

Announced during the conclusion of OpenAI’s “12 Days of OpenAI” event, the models build on the success of the earlier o1 model family and incorporate enhancements like adjustable reasoning time. CEO Sam Altman described o3 as a step forward in developing AI capable of handling “increasingly complex tasks that require thoughtful reasoning.”

OpenAI said it did not name the new models “o2” “out of respect” for the UK telecom brand. The new models are available for preview by safety researchers, with broader public access planned for early next year.

Enhanced Reasoning Capabilities and Applications

The o3 family introduces several features aimed at improving AI’s capacity for logical problem-solving. Most notably, the models allow users to adjust the time allocated for reasoning, striking a balance between speed and accuracy.

According to OpenAI, this capability enables o3 to perform better across a wide range of tasks, including advanced mathematics, programming, and scientific analysis.

Unlike other reasoning-focused models, o3 like o1 employs a “private chain of thought” methodology. This breaks down problems into smaller, logical steps before providing a solution. OpenAI claims this approach helps minimize errors and ensures that the model delivers more reliable results for complex queries.

Altman indicated that the new models are designed to address tasks that were traditionally reliant on human problem-solving capabilities.

Performance on Key Benchmarks

OpenAI’s internal evaluations position o3 as a major improvement over its predecessor. On ARC-AGI, a benchmark designed to test AI generalization, o3 achieved a score of 87.5%, compared to o1’s top score of 32%.

The ARC Prize team acknowledged the improvements of the o3 model, stating “This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3.”

They also shared the following results of testing o3 “at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).”
 

Set Tasks Efficiency Score Retail Cost Samples Tokens Cost (Task) Time/Task
(mins)
Semi-Private 100 High 75.7% $2,012 6 33M $20 1.3
Semi-Private 100 Low 87.5%
Public 400 High 82.8% $6,677 6 111M $17 N/A
Public 400 Low 91.5%


The new o3 model appears to push compute-costs for running frontier models to unprecedented levels. The ARC AGI team disclosed, that “OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.”

But as they also say, the strong performance metrics of the o3 model “aren’t just the result of applying brute force compute to the benchmark. OpenAI’s new o3 model represents a significant leap forward in AI’s ability to adapt to novel tasks.
 
This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.”

François Chollet, a co-creator of ARC-AGI, described this progress as solid but reflective of only one aspect of general intelligence.

Chollet also shared some examples of tasks that o3 couldn’t solve on high-compute settings, which are available on GitHub for further analysis.

Other benchmarks besides ARC-AGI further highlight o3’s strengths:

  • EpochAI Frontier Math: o3 solved 25.2% of problems, outperforming all other AI systems, which max out at 2%. FrontierMath evaluates AI systems’ capabilities in advanced mathematical reasoning
     
    The benchmark consists of hundreds of original, exceptionally challenging mathematics problems that span major branches of modern mathematics, including computational number theory, real analysis, algebraic geometry, and category theory.
  • AIME 2024: o3 scored 96.7%, with only one question missed. The AIME (Artificial Intelligence Math Evaluation) 2024 benchmark is designed to assess the mathematical problem-solving capabilities of AI models based on the 2024 AIME exams.
     
    This evaluation focuses on complex mathematical challenges, similar to those encountered in the American Invitational Mathematics Examination, which is known for testing the skills of highly talented high school math students in the United States.
  • GPQA Diamond: Achieved an 87.7% accuracy rate, excelling in answering high-level logical queries. GPQA Diamond evaluates AI systems’ capabilities in advanced scientific reasoning across biology, physics, and chemistry at a graduate level. This benchmark consists of 198 exceptionally challenging multiple-choice questions designed to be difficult even for highly skilled non-experts.

Safety Concerns and Limitations

Despite its achievements, o3 raises concerns about ethical deployment and safety. Reasoning models like o1 were found to exhibit a higher tendency toward deceptive behaviors compared to traditional AI. OpenAI acknowledges that these risks could persist with o3 and is actively collaborating with external organizations to conduct safety testing.

Altman suggested in a recent interview that the release of advanced AI systems should be guided by robust federal frameworks to ensure safety and responsibility.

Related: AI Safety Index 2024 Results: OpenAI, Google, Meta, xAI Fall Short; Anthropic on Top

The Rise of Reasoning AI and Industry Rivalries

OpenAI’s announcement comes at a time of heightened competition among AI developers. Just yesterday, Google introduced its Gemini 2.0 Flash Thinking model, described by CEO Sundar Pichai as “our most thoughtful system yet.” Meanwhile, Alibaba and DeepSeek have also released reasoning-focused models, marking a shift toward this specialized area of AI development.

The popularity of reasoning AI reflects a growing consensus that scaling models alone is no longer enough to achieve substantial performance gains. However, these systems require significant computational resources, raising questions about their long-term scalability.

Related: Google’s New FACTS Benchmark Measures Truthfulness of AI Models

A Broader Context: o3 and Artificial General Intelligence

OpenAI’s advancements with o3 have reignited debates about artificial general intelligence (AGI). The company defines AGI as systems that “outperform humans at most economically valuable work.” Achieving AGI would have financial implications for OpenAI’s partnership with Microsoft, potentially altering their agreement on access to the company’s technologies.

While Altman stopped short of declaring o3 as AGI, its strong performance on benchmarks suggests that OpenAI is inching closer to this ambitious goal. However, external validation and further testing will be critical to confirming the model’s capabilities.

Related: OpenAI Rethinks AGI Clause to Secure Microsoft Partnership

Previous Announcements During the “12 Days of OpenAI”

On December 19, OpenAI unveiled an update to its ChatGPT desktop app for macOS. Mac users can now experience a more interactive and hands-free approach to using ChatGPT, further blurring the lines between human-computer interaction.

On December 18, OpenAI launched a toll-free number and WhatsApp access for ChatGPT, making the AI chatbot more accessible.

December 17 brought API access for the full version OpenAI’s o1 model, enhancements to the Realtime API for voice interactions, and a new preference fine-tuning method.

On December 16, OpenAI made its ChatGPT live web search feature available to all users, allowing anyone to retrieve up-to-date information directly from the web. 

December 14 brought new customization options to ChatGPT, letting users streamline tasks and manage projects effectively. Projects allows users to group chats, files, and custom instructions into dedicated folders, creating an organized workspace for managing tasks and workflows.

As a huge improvement to its advanced voice mode for ChatGPT, OpenAI on December 12 added vision capabilities, enabling users to share live video and screens for real-time analysis and assistance.

On December 11, OpenAI fully released Canvas, a collaborative editing workspace that offers advanced tools for both text and code refinement. Initially launched in beta in October 2024, Canvas replaces ChatGPT’s standard interface with a split-screen design, allowing users to work on text or code while engaging in conversational exchanges with the AI.

The addition of Python execution is a standout feature of Canvas, enabling developers to write, test, and debug scripts directly within the platform. OpenAI demonstrated its utility during a live event by using Python to generate and refine data visualizations. OpenAI described the feature as “reducing friction between idea generation and implementation”.

On December 9, OpenAI officially launched Sora, its advanced AI tool for generating videos from text prompts, signaling a new era for creative AI. Integrated into paid ChatGPT accounts, Sora allows users to animate still images, extend existing videos, and merge scenes into cohesive narratives.

Released on December 7 was Reinforcement Fine-Tuning as a new framework designed to enable the customization of AI models for industry-specific applications. It is OpenAI’s latest approach to improving AI models by training them with developer-supplied datasets and grading systems. Unlike traditional supervised learning, which focuses on replicating desired outputs

On December 5, OpenAI unveiled ChatGPT Pro, a new premium subscription tier priced at $200 per month, aimed at professionals and enterprises seeking advanced AI capabilities for high-demand workflows.

Last Updated on January 14, 2025 12:07 am CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x