Researchers at Apple have recently expressed concerns over how large language models, particularly OpenAI’s o1 and GPT-4o, approach logical reasoning. Despite OpenAI’s confidence in its models, Apple’s team has found that their reasoning abilities may not be as advanced as claimed.
An Apple team, which includes Samy Bengio, evaluated AI models using a new tool called GSM-Symbolic and have published a paper showing the results. The method builds on existing datasets like GSM8K, but takes the test further by introducing more nuanced symbolic templates to assess AI performance.
New Tool Reveals Gaps in AI Reasoning
Instead of relying on conventional testing methods, Apple’s team created GSM-Symbolic, which digs into how well AI models handle real logical tasks. They tested both open-source models, like Llama, and proprietary models such as OpenAI’s o1 series. The results weren’t flattering for anyone, particularly the o1 model, which despite high benchmark scores, failed to show genuine reasoning skills under tougher scrutiny.
One interesting finding came from using the GSM-NoOp dataset, where just a small irrelevant sentence was added to a math problem. Such a tiny tweak was enough to trip up most models, including OpenAI’s o1. Project lead, Mehrdad Farajtabar highlighted how even such a seemingly minor change caused a noticeable drop in accuracy.
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the… pic.twitter.com/yli5q3fKIT
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024
Models Struggle with Slight Variations
The results exposed how slight variations, like adding unnecessary info, can disrupt model performance. Farajtabar pointed out that this would be unthinkable in human reasoning — changing names in a math problem shouldn’t affect a student’s ability to solve it. Yet for these AI models, such changes resulted in accuracy drops of 10% or more, raising serious concerns about their stability.
More data and computation may make models better at recognizing patterns, but they don’t necessarily enhance reasoning abilities. That’s what Apple’s team found, even with OpenAI’s most advanced models.
Implications for Real-World Use
These findings highlight some unsettling issues for AI applications in sectors like healthcare, decision-making, and education, where logical consistency is a must. Without improving logical reasoning, current AI systems may struggle to perform in more complex or critical environments.
The study questions the reliability of benchmarks like GSM8K, which have seen AI models like GPT-4o score as high as 95%, a significant leap from GPT-3’s 35% just a few years ago. But these improvements could be due to the inclusion of training data within test sets, according to Apple’s team.
The disagreement between two leading AI research institutions is noteworthy. OpenAI positions its o1 model as a breakthrough in reasoning, claiming it’s one of the first steps toward developing truly logical AI agents. Meanwhile, Apple’s team, supported by other studies, argues that there’s little evidence to support this claim.
Flawed Reasoning Extends Beyond Math Problems
The issue of flawed reasoning in AI models isn’t new. Robin Jia and Percy Liang from Stanford raised similar concerns back in 2017, revealing that even slight distractions in tasks could lead to inconsistent results in language models. Apple’s study further confirms that despite recent advancements, AI models still struggle when faced with more complex or subtle challenges.
Garcy Marcus, a longtime critic of neural networks, echoed these concerns in his response to the Apple study. He pointed out that without some form of symbolic reasoning integrated into AI systems, models like OpenAI’s o1 will continue to fall short in areas that require logical thought, no matter how much data they are trained on.
The Path Forward: Neurosymbolic AI?
Marcus has long argued for a neurosymbolic approach to AI, where symbolic manipulation, like in algebra or programming, is combined with neural networks to overcome the limitations seen in models today. Apple’s research appears to support this view, suggesting that purely neural approaches may not be enough to achieve human-like reasoning capabilities.
At the same time, even OpenAI’s o1 model, despite its improvements, still struggles when dealing with more challenging tasks, such as large arithmetic problems or complex reasoning. These performance drops indicate that current neural network architectures may not be equipped to handle the kind of abstract thinking needed for more advanced AI systems.
Last Updated on February 20, 2025 7:49 pm CET