According to leaked internal documents, Apple’s virtual assistant Siri is lagging significantly behind OpenAI’s ChatGPT in terms of performance. Apple’s internal evaluation reveals that Siri’s accuracy is currently 25% lower than ChatGPT, with the latter answering 30% more questions successfully. As the tech industry races forward with AI advancements, Apple is trailing, struggling to keep pace with competitors.
The report, which highlights ongoing development issues with Siri, has sparked concerns that Apple’s AI technology might be up to two years behind leaders like OpenAI. Apple’s ambition to push AI across its ecosystem, such as in the iPad mini, may not be enough to compensate for these shortcomings. The iPad’s hardware is capable of supporting Apple’s AI platform, but the necessary software isn’t ready at launch, leaving users waiting for updates.
Comparisons with Google and Amazon’s AI Offerings
Apple’s Siri isn’t the only voice assistant with problems. Google’s Gemini Live, launched in August, has also been facing backlash over its inconsistent performance. Built on Google’s Gemini 1.5 AI models, Gemini Live was expected to deliver smooth conversational interactions, but many users reported glitches and inaccurate responses.
Unlike Siri, Gemini Live lacks basic adaptability features like varying pitch or tone, contributing to a robotic feel. And like Siri, Gemini Live struggles with “hallucinations”—confidently delivering wrong information.
The shortcomings of voice assistants extend to Amazon as well. Amazon is gearing up to release Alexa Plus this month, a premium, subscription-based version of its AI assistant. Alexa, despite selling millions of devices, has long struggled to handle more complex queries accurately, particularly in areas like politics and news. The latest Alexa version aims to introduce smarter conversational tools, but it remains unclear whether this will be enough to set it apart from competitors like Google and OpenAI.
Internal Frustrations as Siri’s AI Struggles Persist
Apple’s employees are reportedly concerned about the AI shortfall, and Siri’s current struggles only add to the company’s list of recent technical hiccups. Consumers are already expressing frustration over issues like battery drain on the new iPhone 16, which followed the rollout of iOS 18. Many iPhone users took to forums to report unexpected battery loss, fueling discontent with Apple’s latest software update. While Siri has seen incremental improvements, it’s not enough to bring it in line with its rivals.
Can Apple Catch Up in the AI Race?
Apple, long seen as a leader in hardware design and software ecosystems, now finds itself falling behind in one of the most critical areas of tech innovation—artificial intelligence. While Google and Amazon are grappling with their own AI challenges, Apple’s internal report suggests that it has more ground to make up.
This setback comes at a time when Apple is also trying to expand AI capabilities across its lineup, from the iPhone to the MacBook. The next steps for Apple’s AI plans are uncertain, but the internal frustrations outlined in the report suggest that significant improvements will be needed if Apple is to remain competitive in this space.
Apple Employees Make GPT-4o Claims
Last week I reported on Apple employees conducting research that seems to counter claims that OpenAI is far ahead in the AI race. A test by Apple researchers shows even slight changes in wording can confuse AI such as GPT-4o.
A research team at Apple, including Samy Bengio, conducted a comprehensive evaluation of AI models utilizing a novel tool named GSM-Symbolic and subsequently published a paper detailing their findings. The innovative methodology expands upon existing datasets like GSM8K by incorporating more sophisticated symbolic templates to assess the performance of AI systems in a more rigorous manner.
Departing from conventional testing methodologies, Apple’s research team devised GSM-Symbolic, a tool specifically designed to probe the extent to which AI models can effectively handle real-world logical tasks. Both open-source models, such as Llama, and proprietary models, including OpenAI’s o1 series, were subjected to rigorous evaluation using this novel approach.
The results obtained were far from flattering, particularly for the o1 model, which, despite achieving impressive benchmark scores, demonstrated a conspicuous lack of genuine reasoning skills when subjected to more demanding scrutiny.
One particularly noteworthy finding emerged from the utilization of the GSM-NoOp dataset. By introducing a small, irrelevant sentence into a mathematical problem, the researchers were able to effectively confound most AI models, including OpenAI’s o1. Mehrdad Farajtabar, the project lead, emphasized the significant decline in accuracy observed even with such a seemingly minor modification to the task.