Elon Musk’s artificial intelligence company xAI has released Grok 3, a major update to its chatbot, which the company claims is “ten times more capable” than the previous version.
Grok 3 is currently available exclusively to X Premium+ subscribers, integrating directly into the X social platform as part of Musk’s effort to enhance AI-powered interactions within the ecosystem.
— xAI (@xai) February 18, 2025
Grok 3 is built on xAI’s proprietary model architecture and runs on the Colossus supercomputer, which Musk is currently scaling up to one million Nvidia GPUs. This move signals xAI’s push to compete with OpenAI, Google DeepMind, and Anthropic in the rapidly evolving AI industry.
However, early evaluations show that while Grok 3 has improved in some areas, it still struggles with accuracy issues in Deep Search, limited humor capabilities, and reasoning failures in certain complex problem-solving tasks. The release also comes amid Musk’s ongoing legal dispute with OpenAI, further intensifying competition in the AI space.
How Grok 3 Compares to OpenAI, Google, and Anthropic
With its new updates, Grok 3 presents itself as a competitor to leading AI models like OpenAI’s GPT-4o, Google’s Gemini 2.0, and Anthropic’s Claude. According to test results shown by xAI, Grok 3 outperforms its competitors across key AI benchmarks, demonstrating strong capabilities in math, science, and coding tasks.
Grok 3 scored 52 in Math (AIME’24), significantly ahead of GPT-4o (9) and Claude 3.5 Sonnet (16). In Science (GPQA), it led with 75, outperforming Gemini 2 Pro, Claude 3.5, and DeepSeek-V3, which all scored 65, while GPT-4o lagged at 50. The Coding (LCB Oct-Feb) test also saw Grok 3 leading at 57, well above GPT-4o (34) and other rivals. These results suggest that xAI’s newest model excels in structured problem-solving and technical reasoning, though real-world performance will depend on further independent evaluations.
nice to meet you pic.twitter.com/fk1EOtSVFm
— Grok (@grok) February 18, 2025
However, as Rex Asabor from OpenAI pointed out on X, their unreleased o3 model from still scores much higher on both GPQA and AIME’24 than Grok 3 in thinking mode, according to their internal testing.
they omitted o3 from the chart in the livestream for some reason so i added the numbers for you pic.twitter.com/VfevorHdy0
— Rex (@12exyz) February 18, 2025
‘Think’ Button For AI Reasoning and Deep Search
A standout feature in Grok 3 is its “Think” button, which allows users to request a more detailed and analytical response by giving the AI additional processing time. The goal is to improve reasoning accuracy and enhance the model’s ability to tackle complex tasks.
The button enables advanced chain of thought reasoning, which like OpenAi’s o1 and o3 models and also DeepSeek R1 aims to provide users with results based on complex thinkingt
Grok 3 also introduces its own adoption of an AI-driven research features similar to OpenAI’s Deep Research and Google Gemini’s Deep Research. The tool allows Grok 3 to pull and synthesize real-time information, making it a competitor to both deep research products and Perplexity AI, which also just launched its own deep research implementation.
Andrej Karpathy, a former Tesla AI director and early tester of Grok 3 who got early access, found that with ‘Think’ mode enabled, the model successfully estimated the training FLOPs required for OpenAI’s GPT-2, a task that even OpenAI’s most powerful thinking model o1-pro failed. Karpathy noted, “Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails.”
For real-time research, Deep Search gives Grok 3 an edge over many models, but its accuracy issues put it behind OpenAI’s Deep Research and Perplexity AI. Karpathy says Grok 3 generates “hallucinated URLs” and avoids citing X unless explicitly asked to limits its effectiveness as a research tool.
In terms of reasoning, Grok 3’s new Deep Search mode allows it to match OpenAI’s o1-pro in some logic-heavy tasks. However, it still struggles with spatial reasoning, as demonstrated by its failed tic-tac-toe board generation test. This places it behind GPT-4o, which has been noted for its advanced logic capabilities.
Creativity remains another weak point. Claude has been widely praised for its natural and engaging writing style, while Grok 3 still produces responses that feel formulaic.
In another test, Grok 3 was able to correctly generate a Settlers of Catan board setup, a challenge that many AI models struggle with. However, when asked to generate tricky tic-tac-toe boards, the model failed, producing nonsensical layouts. Karpathy observed, “It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought… but failed on generating tricky ones.”
I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check.
— Andrej Karpathy (@karpathy) February 18, 2025
Thinking
✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan… pic.twitter.com/qIrUAN1IfD
Despite these improvements in logic and math-based tasks, Grok 3 still has notable weaknesses. Its humor remains limited, with Karpathy stating, “Sadly the model’s sense of humor does not appear to be obviously improved… joke generation remains stale and repetitive.” This suggests that xAI has yet to enhance the chatbot’s creative and conversational abilities.
Musk’s Legal Battle With OpenAI and xAI’s Position in the AI Race
Grok 3’s release comes as Musk remains locked in a legal battle with OpenAI. Musk, who co-founded OpenAI in 2015 before leaving, has accused the company of abandoning its nonprofit mission in favor of corporate partnerships, particularly its deepening ties with Microsoft.
Musk recently made a $97.4 billion bid to acquire OpenAI, which was rejected by their board. In his lawsuit against the company he is arguing that it had transformed into a “closed-source AI enterprise” focused on maximizing profits instead of advancing artificial intelligence for the benefit of humanity. OpenAI has denied these claims, stating that it remains committed to safe and ethical AI development.
By developing Grok 3 and integrating it into X, Musk is positioning xAI as an alternative to the AI ecosystems being built by OpenAI, Google, and Anthropic. The company’s decision to keep Grok’s training infrastructure separate from Microsoft and Google also signals a strategic shift toward AI independence.
Availability and What’s Next for Grok and xAI
Unlike OpenAI’s ChatGPT, which offers free and tiered subscription plans, Grok 3 remains behind a paywall, requiring users to subscribe to the highest premium tier on X to access its features.
In addition to the standard version of Grok 3, xAI is reportedly working on a more advanced variant called SuperGrok. While details remain scarce, Musk has hinted that SuperGrok will leverage even more compute power from the Colossus supercomputer, potentially offering stronger reasoning abilities and enhanced multimodal capabilities.
This could position SuperGrok as xAI’s answer to OpenAI’s most powerful enterprise-tier models, targeting researchers, developers, and businesses that require more sophisticated AI performance. However, no official launch date or pricing details for SuperGrok have been announced yet.
Musk has hinted earlier that Grok 4 is already in development and is expected to introduce advanced multimodal AI capabilities. This would allow the model to process not just text but also images, video, and real-time audio, similar to OpenAI’s GPT-4o.
With xAI’s aggressive expansion of Colossus, future iterations of Grok will likely continue to see improvements in reasoning, creativity, and real-time research capabilities. However, the company will need to address Deep Search’s reliability issues and enhance the chatbot’s engagement quality to truly rival the industry’s leading AI models.
Table: AI Model Benchmarks – LLM Leaderboard
[table “18” not found /]Last Updated on March 3, 2025 11:29 am CET