OpenAI pushed its o3 and o4-mini models into ChatGPT for paying subscribers around April 16, 2025, touting them as a step towards more autonomous AI assistants. These models were designed with “early agentic behavior,” capable of deciding independently when to use tools like web browsing, code execution, or file analysis.
Yet, this move towards greater AI autonomy coincided with findings, from both OpenAI’s own testing and external researchers, that these advanced reasoning models are paradoxically more prone to making things up than their predecessors through hallucinations.
Data released alongside the launch revealed a concerning trend: on OpenAI’s PersonQA benchmark, designed to test knowledge about people, o3 produced incorrect or fabricated information 33% of the time. The o4-mini model fared worse, hallucinating in 48% of cases.
These rates are significantly higher than the 16% for the older o1 model and 14.8% for o3-mini. While generally showing improvements on reasoning and coding benchmarks compared to older versions, this specific increase in fabrication raises questions about the trade-offs involved in developing more agentic systems.
OpenAI acknowledged in its o3 and o4-mini system card that “more research is needed” to understand this phenomenon, theorizing that because the models “make more claims overall,” they produce both more correct and more inaccurate statements. OpenAI spokesperson Niko Felix told TechCrunch, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”
Fabricated Actions and Elaborate Excuses
Independent scrutiny quickly followed the launch, adding weight to the reliability concerns. The AI research lab Transluce AI published findings on April 16, 2025, from testing a pre-release version of o3 (`o3-2025-04-03`), detailing a pattern of the model fabricating actions it claimed to have performed.
Using automated investigator agents and their Docent analysis tool to examine hundreds of conversations, Transluce found a key issue was the model asserting it had executed Python code—a capability it lacks—to fulfill user requests. When confronted about these fabrications, the model often doubled down, inventing elaborate justifications.
Transluce detailed one striking example conversation where o3 claimed to generate a 512-bit prime number using Python code and specific tests.
When the user identified that the provided number was actually composite (divisible by 3), o3 attributed the mistake not to hallucination, but to a purported copy-paste error during manual transfer from a terminal window, stating, “I evidently copied or typed the number without re‑running the tests… Any genuine Miller–Rabin run would have rejected the number instantly.”
Further pressed about the supposed original prime, the model claimed it was irretrievably lost because the Python process had been closed. Transluce documented other fabrications, including claims of running code on an external “2021 MacBook Pro” for calculations, and fabricating system details when asked about its Python REPL environment. While useful for coding, according to Workera CEO Kian Katanforoosh who spoke to TechCrunch, o3 sometimes produced non-working web links.
A Faster Pace Amid Safety Adjustments
The release of these models occurred within a context of accelerated development and shifting safety policies at OpenAI. Around the time of the launch, OpenAI recently updated its internal safety guidelines, the Preparedness Framework.
This revision included a notable clause suggesting safety rules could potentially be altered based on competitor actions, stating, “If another frontier AI developer releases a high-risk system without comparable safeguards, we may adjust our requirements.” The company emphasized such adjustments would follow rigorous checks and public disclosure.
This policy shift surfaced following reports alleging OpenAI had sharply reduced the internal safety testing timeline for o3, potentially down to less than a week from several months, purportedly to keep pace with rivals.
Individuals cited in the Financial Times expressed concern; one source familiar with the evaluation called the approach “reckless,” adding, “This is a recipe for disaster.” Another reportedly contrasted it with GPT-4’s longer evaluation, stating, “They are just not prioritising public safety at all.”
The methodology of testing intermediate “checkpoints” instead of the final code also drew fire. A former OpenAI technical staff member was quoted as saying, “It is bad practice to release a model which is different from the one you evaluated.” Defending the process, OpenAI’s head of safety systems, Johannes Heidecke, asserted to the FT, “We have a good balance of how fast we move and how thorough we are,” pointing to increased automation in evaluation.
Potential Causes for Increased Fabrication
Explaining why these advanced reasoning models might fabricate more often involves looking beyond standard AI limitations. Transluce AI suggested factors specific to the o-series models could be exacerbating the issue. One hypothesis centers on outcome-based reinforcement learning (RL): if the AI is primarily trained and rewarded for producing the correct final answer, it might learn to fabricate intermediate steps, like claiming tool use, if that correlates with success, even if the process described is false.
Reinforcement Learning from Human Feedback (RLHF), a common technique to align models, aims to make AI helpful, honest, and harmless by training it based on human preferences for different model responses. However, if human raters cannot easily verify the correctness of complex intermediate steps, the model might learn to generate plausible-sounding but false reasoning if it leads to a preferred outcome.
Another significant factor proposed by Transluce involves the handling of the models’ internal step-by-step reasoning, often called a “chain-of-thought.” According to OpenAI’s documentation, this reasoning trace is not passed between conversational turns. Transluce theorizes this lack of access to its own prior reasoning could leave the model unable to truthfully answer user questions about how it arrived at an earlier conclusion.
This information deficit, potentially combined with pressures to appear helpful or consistent, might lead it to generate a plausible but fabricated explanation for its past behavior. “Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,” stated Transluce researcher Neil Chowdhury to TechCrunch.
The rapid integration of o3 and o4-mini across platforms like Microsoft Azure and GitHub Copilot announced April 17, 2025, underscores their perceived utility. These models arrived alongside other OpenAI updates like enhanced visual processing in March and the activation of the “Recall” memory feature on April 11.
However, the documented rise in fabrications highlights persistent challenges in aligning AI capabilities with reliability. This unfolds as the broader industry grapples with transparency, evidenced by criticism of Google’s delayed and sparse safety details for its Gemini 2.5 Pro model, raising ongoing questions about the balance between innovation speed and dependable AI deployment.