Elon Musk’s xAI Launches Grok 3, Dethroning OpenAI on Key AI Benchmarks

Grok 3 arrives with major updates, including a "Think" button for smarter AI reasoning and Deep Search to compete with OpenAI’s Deep Research feature.

Elon Musk’s artificial intelligence company xAI has released Grok 3, a major update to its chatbot, which the company claims is “ten times more capable” than the previous version.

Grok 3 is currently available exclusively to X Premium+ subscribers, integrating directly into the X social platform as part of Musk’s effort to enhance AI-powered interactions within the ecosystem.

Grok 3 is built on xAI’s proprietary model architecture and runs on the Colossus supercomputer, which Musk is currently scaling up to one million Nvidia GPUs. This move signals xAI’s push to compete with OpenAI, Google DeepMind, and Anthropic in the rapidly evolving AI industry.

However, early evaluations show that while Grok 3 has improved in some areas, it still struggles with accuracy issues in Deep Search, limited humor capabilities, and reasoning failures in certain complex problem-solving tasks. The release also comes amid Musk’s ongoing legal dispute with OpenAI, further intensifying competition in the AI space.

How Grok 3 Compares to OpenAI, Google, and Anthropic

With its new updates, Grok 3 presents itself as a competitor to leading AI models like OpenAI’s GPT-4o, Google’s Gemini 2.0, and Anthropic’s Claude. According to test results shown by xAI, Grok 3 outperforms its competitors across key AI benchmarks, demonstrating strong capabilities in math, science, and coding tasks.

Grok 3 scored 52 in Math (AIME’24), significantly ahead of GPT-4o (9) and Claude 3.5 Sonnet (16). In Science (GPQA), it led with 75, outperforming Gemini 2 Pro, Claude 3.5, and DeepSeek-V3, which all scored 65, while GPT-4o lagged at 50. The Coding (LCB Oct-Feb) test also saw Grok 3 leading at 57, well above GPT-4o (34) and other rivals. These results suggest that xAI’s newest model excels in structured problem-solving and technical reasoning, though real-world performance will depend on further independent evaluations.

However, as Rex Asabor from OpenAI pointed out on X, their unreleased o3 model from still scores much higher on both GPQA and AIME’24 than Grok 3 in thinking mode, according to their internal testing.

‘Think’ Button For AI Reasoning and Deep Search

A standout feature in Grok 3 is its “Think” button, which allows users to request a more detailed and analytical response by giving the AI additional processing time. The goal is to improve reasoning accuracy and enhance the model’s ability to tackle complex tasks.

The button enables advanced chain of thought reasoning, which like OpenAi’s o1 and o3 models and also DeepSeek R1 aims to provide users with results based on complex thinkingt

Grok 3 also introduces its own adoption of an AI-driven research features similar to OpenAI’s Deep Research and Google Gemini’s Deep Research. The tool allows Grok 3 to pull and synthesize real-time information, making it a competitor to both deep research products and Perplexity AI, which also just launched its own deep research implementation.

Andrej Karpathy, a former Tesla AI director and early tester of Grok 3 who got early access, found that with ‘Think’ mode enabled, the model successfully estimated the training FLOPs required for OpenAI’s GPT-2, a task that even OpenAI’s most powerful thinking model o1-pro failed. Karpathy noted, “Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails.”

For real-time research, Deep Search gives Grok 3 an edge over many models, but its accuracy issues put it behind OpenAI’s Deep Research and Perplexity AI. Karpathy says Grok 3 generates “hallucinated URLs” and avoids citing X unless explicitly asked to limits its effectiveness as a research tool.

In terms of reasoning, Grok 3’s new Deep Search mode allows it to match OpenAI’s o1-pro in some logic-heavy tasks. However, it still struggles with spatial reasoning, as demonstrated by its failed tic-tac-toe board generation test. This places it behind GPT-4o, which has been noted for its advanced logic capabilities.

Creativity remains another weak point. Claude has been widely praised for its natural and engaging writing style, while Grok 3 still produces responses that feel formulaic.

In another test, Grok 3 was able to correctly generate a Settlers of Catan board setup, a challenge that many AI models struggle with. However, when asked to generate tricky tic-tac-toe boards, the model failed, producing nonsensical layouts. Karpathy observed, “It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought… but failed on generating tricky ones.”

Despite these improvements in logic and math-based tasks, Grok 3 still has notable weaknesses. Its humor remains limited, with Karpathy stating, “Sadly the model’s sense of humor does not appear to be obviously improved… joke generation remains stale and repetitive.” This suggests that xAI has yet to enhance the chatbot’s creative and conversational abilities.

Musk’s Legal Battle With OpenAI and xAI’s Position in the AI Race

Grok 3’s release comes as Musk remains locked in a legal battle with OpenAI. Musk, who co-founded OpenAI in 2015 before leaving, has accused the company of abandoning its nonprofit mission in favor of corporate partnerships, particularly its deepening ties with Microsoft.

Musk recently made a $97.4 billion bid to acquire OpenAI, which was rejected by their board. In his lawsuit against the company he is arguing that it had transformed into a “closed-source AI enterprise” focused on maximizing profits instead of advancing artificial intelligence for the benefit of humanity. OpenAI has denied these claims, stating that it remains committed to safe and ethical AI development.

By developing Grok 3 and integrating it into X, Musk is positioning xAI as an alternative to the AI ecosystems being built by OpenAI, Google, and Anthropic. The company’s decision to keep Grok’s training infrastructure separate from Microsoft and Google also signals a strategic shift toward AI independence.

Availability and What’s Next for Grok and xAI

Unlike OpenAI’s ChatGPT, which offers free and tiered subscription plans, Grok 3 remains behind a paywall, requiring users to subscribe to the highest premium tier on X to access its features.

In addition to the standard version of Grok 3, xAI is reportedly working on a more advanced variant called SuperGrok. While details remain scarce, Musk has hinted that SuperGrok will leverage even more compute power from the Colossus supercomputer, potentially offering stronger reasoning abilities and enhanced multimodal capabilities.

This could position SuperGrok as xAI’s answer to OpenAI’s most powerful enterprise-tier models, targeting researchers, developers, and businesses that require more sophisticated AI performance. However, no official launch date or pricing details for SuperGrok have been announced yet.

Musk has hinted earlier that Grok 4 is already in development and is expected to introduce advanced multimodal AI capabilities. This would allow the model to process not just text but also images, video, and real-time audio, similar to OpenAI’s GPT-4o.

With xAI’s aggressive expansion of Colossus, future iterations of Grok will likely continue to see improvements in reasoning, creativity, and real-time research capabilities. However, the company will need to address Deep Search’s reliability issues and enhance the chatbot’s engagement quality to truly rival the industry’s leading AI models.

Table: AI Model Benchmarks – LLM Leaderboard 

[table “18” not found /]

Last Updated on March 3, 2025 11:29 am CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x