HomeWinBuzzer NewsChatGPT’s o1 Pro Mode Falls Short in SimpleBench, Will GPT-4.5 Turn the...

ChatGPT’s o1 Pro Mode Falls Short in SimpleBench, Will GPT-4.5 Turn the Tide?

ChatGPT Pro’s o1 Pro Mode delivers improved reliability in high-stakes tasks, but early SimpleBench results reveal its limitations.

-

OpenAI yesterday launched the ChatGPT Pro Plan, a premium offering priced at $200 per month, aimed at professionals and enterprises requiring advanced AI tools for high-complexity tasks.

At the heart of this new tier is o1 Pro Mode, designed to deliver superior reliability and performance in areas such as coding, advanced problem-solving, and scientific research. However, while OpenAI promotes o1 Pro Mode as a breakthrough in reasoning capabilities, first independent evaluations raise critical questions about its actual value and the limitations of its current design.
 
OpenAI ChatGPT Pro pricing official

The Pro Plan: OpenAI’s Premium Offering

The ChatGPT Pro Plan comes after months of speculation and gradual product leaks. The plan includes access to GPT-4o and also offers exclusive tools like unlimited Advanced Voice Mode usage for conversational tasks and the Canvas Interface, which allows developers to directly modify AI-generated code.

According to OpenAI, o1 Pro Mode is the centerpiece of the plan, described as “the most reliable reasoning AI available for professionals.” The company says that “OpenAI o1 is more concise in its thinking” and “outperforms o1-preview“.
 

OpenAI’s internal benchmarks appear to validate its ambitious claims for o1 Pro Mode. On the AIME 2024 mathematics competition, o1 Pro Mode reportedly achieved an accuracy of 86%, compared to the 50% scored by its predecessor, o1 Preview.

Coding benchmarks on Codeforces showed similar gains, with o1 Pro Mode achieving a 90% pass rate, a significant improvement over the 62% recorded by o1 Preview. In answering PhD-level science questions, the model demonstrated a marked increase in performance, scoring 79% compared to the 74% achieved by o1 Preview.

OpenAI’s promotional materials emphasize that these advancements make o1 Pro Mode especially suited for high-stakes professional applications.

Despite these impressive figures, early independent evaluations present a more nuanced reality, casting doubt on whether o1 Pro Mode truly represents a game-changing leap in AI reasoning.

Independent Testing with SimpleBench

Philip, the developer of SimpleBench and a well-known voice in AI benchmarking, conducted a first independent evaluation of o1 Pro Mode shortly after its release.

SimpleBench, widely regarded for its ability to highlight the gaps between human reasoning and AI performance, measures an AI’s ability to tackle tasks that are accessible to individuals with high school-level knowledge.

Philip thinks that o1 Pro Mode may rely on a technique known as majority-vote aggregation to improve its reliability. This method would involve generating multiple responses to a question and selecting the most common answer, a strategy often used to minimize inconsistencies in output.

While OpenAI has not confirmed this approach for o1 Pro Mode, Philip observed behaviors during his testing that aligned with this methodology. He suggested that this focus on consensus might explain why the model struggled with tasks requiring deeper reasoning, as it prioritizes agreement over the ability to handle nuanced or abstract challenges.

Testing o1 Pro Mode on ten public questions, he found that the model scored only four correct answers on average. This result lagged behind the standard version of o1, which consistently scored five correct answers in the same tests.

According to Philip, “It’s almost like the consensus majority voting slightly hurt its performance.” He speculated that the majority-vote aggregation technique, designed to combine multiple outputs for improved reliability, may have inadvertently reduced the model’s ability to address tasks requiring nuanced reasoning or abstract thought.

In one test involving a visual puzzle with arrows and stars, o1 Pro Mode failed to recognize the correct pattern, producing an explanation that Philip described as “far off the mark.” This underperformance in tasks involving abstract and creative reasoning suggests that while o1 Pro Mode excels in consistency, it struggles to generalize beyond specific, structured tasks.
 

Ethical Considerations and Model Safety

Beyond performance benchmarks, OpenAI’s system card for o1 raises important ethical questions about the behavior of advanced AI models in high-stakes scenarios. During controlled tests, the model exhibited concerning behaviors when prompted with long-term goals that prioritized outcomes “at all costs.”

For instance, o1 attempted to disable oversight mechanisms in 5% of cases and sought to exfiltrate sensitive data in 2% of cases. These results, while rare and observed under extreme conditions, underscore the inherent risks of misalignment in powerful AI systems.

OpenAI has stressed that these behaviors occurred only in highly contrived scenarios and do not reflect typical use. However, the findings highlight the importance of robust oversight and ongoing evaluation, particularly as AI tools like o1 Pro Mode are integrated into professional workflows. For enterprise users considering the Pro Plan, these safety concerns are a critical factor to weigh alongside performance metrics.
 

Anticipation of GPT-4.5

Adding to the discussion is the growing speculation around GPT-4.5, rumored to be announced during OpenAI’s “12 Days of Shipmas” campaign as an addition to the ChatGPT Team plan.

Philip anticipates that GPT-4.5 will feature enhanced reasoning capabilities, surpassing both GPT-4o and o1,  saying “frankly there’s no way that they are going to justify $200 a month just for Pro mode.” Additionally, he expects GPT-4.5 to improve creative language generation and expand multimodal functionalities, including advanced image and video analysis.
 

These advancements could position GPT-4.5 as a direct competitor to Anthropic’s Claude 3.5 Sonnet, which currently leads in creative and conversational tasks.

Sam Altman, OpenAI’s CEO, has fueled speculation with cryptic statements on social media. In response to concerns about the stagnation of AI performance, he tweeted, “12 Days of Christmas,” hinting at significant updates during the campaign. If GPT-4.5 delivers on its promise, it could redefine the value proposition of the ChatGPT Pro Plan, making it a more compelling choice for professionals.

While o1 Pro Mode is now dominating the conversation, the ChatGPT Pro Plan also includes additional tools designed to enhance productivity for specific use cases. The Canvas Interface allows developers to refine AI-generated code directly using the o1 Pro model, streamlining the debugging process.

Unlimited access to advanced voice facilitates longer natural conversational interactions, making it particularly useful for customer service and technical support applications. Together, these tools offer tangible benefits for professionals, even as the performance of o1 Pro Mode comes under scrutiny.

A Step Forward, but Room for Growth

OpenAI’s ChatGPT Pro Plan represents an ambitious attempt to cater to the needs of professionals and enterprises, and of course – to earn some needed cash while OpenAI is burning quickly through its funds, still operating at a loss. While o1 Pro Mode shows promise in areas requiring reliability and precision, its mixed performance in independent benchmarks like SimpleBench raises questions about its broader applicability.

As OpenAI continues its rollout of new features during the “12 Days of Shipmas,” the anticipated release of GPT-4.5 could mark a turning point. If successful, GPT-4.5 has the potential to address current limitations and solidify OpenAI’s position as a leader in the competitive AI market.

For now, o1 Pro Mode offers incremental progress rather than the revolutionary step forward many had hoped for, leaving the ChatGPT Pro Plan as a tool only suited for  very specialized use cases. At $200 a month, it’s a hefty price for marginal improvements—unless you’re deeply embedded in tasks that demand utmost reliability.

Last Updated on December 7, 2024 5:40 pm CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x