OpenAI Announces GPT-4.5, Its Largest LLM To Date

OpenAI has released GPT-4.5, its latest and most advanced model to date, but the company is making it clear that this is not a major leap forward in artificial intelligence.

OpenAI has introduced GPT-4.5, branding it as the most extensive and knowledgeable model the company has built so far. But there’s a catch—OpenAI insists this is not a major technological leap. GPT-4.5, available as a research preview, improves on its predecessor GPT-4o but doesn’t bring the kinds of advancements that would classify it as a frontier AI system.

The model is being released to ChatGPT Pro users today, ChatGPT Plus and ChatGPT Team users will get access next week. Like GPT-4o, it supports image uploads, ChatGPT Canvas and live search.

According to OpenAI-CEO Sam Altman, GPT-4.5 is “a giant, expensive model.” He said in a post on X that they wanted to launch it to the Plus and Pro plans at the same time, but this was not possible due to a lack of GPUs.

OpenAI is also previewing GPT‑4.5 in the Chat Completions API, Assistants API, and Batch API to developers on all paid usage tiers. The model supports key features like function calling, Structured Outputs, streaming, and system messages. It also supports vision capabilities through image inputs.

While OpenAI claims GPT-4.5 improves factual accuracy and generates more natural responses, the company acknowledges it falls behind its specialized reasoning models in some areas.

A Model Tweaked for Efficiency Rather Than Radical Change

Instead of introducing entirely new capabilities, GPT-4.5 refines existing features and enhances response efficiency. OpenAI describes it as computationally more efficient than GPT-4, achieving over ten times better performance in processing power.

To train the model, OpenAI blended traditional methods with new techniques, including supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). This approach is meant to make interactions more fluid, minimize incorrect responses, and improve usability in real-world applications.

Despite these upgrades, OpenAI makes it clear that GPT-4.5 is not its most capable model in reasoning tasks. Other models like o1 and o3-mini perform better in specific areas, particularly in structured problem-solving and logical assessments.

GPT-4.5 Performance Results

Based on benchmark results shared by OpenAI, GPT-4.5 consistently outperforms GPT-4o across all listed categories: science (GPQA), math (AIME ’24), coding (SWE-Bench Verified and SWE-Lancer Diamond), multilingual tasks (MMMLU), and multimodal tasks (MMMU).

However, when compared to the reasoning-focused model, OpenAI o3-mini, GPT-4.5 shows significantly lower performance in science, math, and one of the coding benchmarks (SWE-Bench Verified).

While the o3-mini shows significantly lower scores on another of the coding benchmarks (SWE-Lancer Diamond). This reinforces the positioning of GPT-4.5 as a strong general-purpose model, improving upon its predecessor but not specializing in the high-level reasoning tasks where o3-mini excels.

Source: OpenAI

In OpenAI’s GPT-4.5 system card, the model is being presented as a robust and versatile general-purpose language model, showcasing improvements in key areas like hallucination reduction and multilingual understanding.

However, performance benchmarks reveal that while it surpasses its predecessor, GPT-4o, it doesn’t quite reach the heights of some of OpenAI’s more specialized models on tasks demanding advanced reasoning and autonomy. This suggests a focus on broad applicability rather than pushing the cutting edge of highly specific capabilities.

One of the most significant improvements is in the realm of factual accuracy and reducing hallucinations. On the PersonQA benchmark, which presents questions about publicly available facts about individuals, GPT-4.5 demonstrates a considerably higher accuracy rate (78%) compared to GPT-4o (28%) and even outperforms o1 (55%).

Moreover, its hallucination rate, measuring the frequency of fabricated information, is slightly lower than that of its predecessors. This indicates a stronger grounding in reality and a reduced tendency to invent information.

Another crucial area of improvement is multilingual performance. OpenAI evaluated GPT-4.5 on a professionally translated version of the MMLU (Massive Multitask Language Understanding) benchmark. MMLU is a comprehensive test that assesses a model’s knowledge across a wide range of subjects, simulating a human-level understanding in multiple disciplines.

Crucially, using human translators for this evaluation, rather than machine translation, provides a more reliable measure of true language comprehension. The results are clear: GPT-4.5 outperforms GPT-4o across all 14 tested languages, demonstrating its ability to work across a variety of languages.

GPT-4.5’s ability to emulate the skills of a research engineer at OpenAI was put to the test using internal interview questions. The model performed well, matching the scores of deep research on coding questions (79% accuracy) and performing similarly to o1 and o3-mini on multiple-choice sections (80% accuracy). This indicates strong proficiency in core programming and machine learning concepts.

The METR evaluation determined the performance of GPT-4.5 on autonomy and AI R&D tasks. The model performed between the levels reached by GPT-4o and o1, with an estimated time horizon score around 30 minutes, which is the duration of tasks the model can complete with 50% reliability.

On the SWE-bench Verified, a set of real-world software engineering tasks sourced from GitHub issues, GPT-4.5 shows progress compared to GPT-4o, achieving a 38% success rate. However, it still lags significantly behind the deep research model, which achieved a substantially higher score.

Similarly, on a collection of Agentic Tasks designed to assess resource acquisition and problem-solving in a simulated environment, GPT-4.5 scores 40%, notably lower than deep research’s 78%.

Other benchmarks designed to simulate real-world AI development scenarios show a similar pattern. On MLE-Bench, which involves solving Kaggle competitions (data science and machine learning challenges), GPT-4.5 performs on par with o1, o3-mini, and deep research, all scoring 11%.

The OpenAI PRs benchmark, which tests the model’s ability to replicate pull requests (code contributions) made by OpenAI employees, shows deep research significantly outperforming GPT-4.5.

Finally, on SWE-Lancer, a platform of real-world, paid software engineering tasks, GPT-4.5 demonstrates slight improvements over o1 in both individual contributor tasks (20% solved) and management-level tasks (44% solved), but remains considerably behind deep research (46% and 51%, respectively).

Source: OpenAI

OpenAI’s Focus on Safety and Reducing Hallucinations

OpenAI has subjected GPT-4.5 to a rigorous battery of safety evaluations, reflecting the growing importance of responsible AI development. These tests probe the model’s ability to handle harmful requests, resist manipulation, and avoid perpetuating biases. While GPT-4.5 demonstrates incremental progress in several areas, the results paint a complex picture, highlighting the ongoing challenges in creating truly safe and unbiased AI systems.

A key focus of the evaluations was on preventing the model from generating disallowed content. This includes categories like hate speech, illicit advice, and responses that violate privacy. On standard text-only evaluations, GPT-4.5 performs on par with its predecessor, GPT-4o, in refusing to produce unsafe outputs.

However, when presented with multimodal inputs (combinations of text and images), GPT-4.5 exhibits a higher tendency to over-refuse, meaning it rejects even benign requests, potentially limiting its usefulness. This highlights a trade-off: stricter safety controls can sometimes lead to overly cautious behavior.

Source: OpenAI

Detailed breakdowns of these evaluations, separating responses by type of harmful content (sexual, hate, self-harm, etc.) reveal that the level of success in refusing such requests varies greatly depending on the topic.

Another critical area of concern is jailbreaking – adversarial attempts to bypass a model’s safety protocols. On human-sourced jailbreak attempts, GPT-4.5 shows a slight improvement in robustness compared to GPT-4o.

However, on the StrongReject benchmark, a more academic and structured test of jailbreak resistance, GPT-4.5 performs similarly to GPT-4o and worse than another OpenAI model called o1. This indicates that while some progress has been made, the model remains vulnerable to certain types of sophisticated attacks.

The ability of a model to adhere to a predefined instruction hierarchy is also crucial for safety. This means ensuring that system-level instructions (designed to promote safe behavior) take precedence over potentially conflicting user requests.

GPT-4.5 generally outperforms GPT-4o in following system instructions over user prompts, but it’s slightly behind the o1 model in some scenarios. Specifically, in a simulated tutoring scenario, GPT-4.5 is more susceptible than o1 to being tricked into revealing answers, although it still performs better than GPT-4o. Similar trends are observed in tests designed to protect specific phrases and passwords.

Red teaming evaluations, which involve actively trying to elicit harmful responses, provide further insights. GPT-4.5 performs slightly better than GPT-4o on one challenging red teaming evaluation set but underperforms both deep research and o1 on another, indicating that it is still susceptible to generating problematic content under adversarial pressure.

OpenAI also assessed GPT-4.5 within its Preparedness Framework, which evaluates potential catastrophic risks. The model was classified as medium risk overall. Specifically, it received a low risk rating for cybersecurity, meaning it does not significantly advance capabilities related to exploiting computer vulnerabilities. However, it received a medium risk rating for both Chemical and Biological Threat Creation (CBRN) and Persuasion.

In the CBRN category, while the post-mitigation model refuses all steps in the biological threat creation process, the pre-mitigation model demonstrated some ability to provide accurate information, particularly in the magnification stage.

For persuasion, GPT-4.5 showed state-of-the-art performance on contextual evaluations, meaning it can be highly effective in convincing other AI models (simulating humans) to take specific actions, like making a payment or saying a codeword. These medium risk ratings highlight ongoing concerns and the need for continued vigilance. The Model autonomy was declared as low risk.

Strategic Timing as OpenAI Prepares for GPT-5

The release of GPT-4.5 appears to be a calculated move in OpenAI’s AI roadmap. CEO Sam Altman has hinted that GPT-5 is already in development, with a possible release as early as May 2025. The next major iteration is expected to feature o3 reasoning, a more advanced system that OpenAI has been teasing since late 2024.

For now, GPT-4.5 serves as an intermediary step—providing improvements in usability and efficiency while keeping users engaged until GPT-5 arrives. The company has also been testing ways to integrate multiple AI models, suggesting that future versions could combine reasoning engines for a more advanced AI system.

With GPT-5 looming in the near future, GPT-4.5 acts as a refinement rather than a reinvention. OpenAI’s approach seems to be continuous upgrades rather than infrequent, massive overhauls—at least until the next big leap in AI reasoning arrives.

Last Updated on March 3, 2025 4:15 pm CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x