HomeWinBuzzer NewsGPT-4o: OpenAI’s Latest Model Enhances Multimodal AI Interaction

GPT-4o: OpenAI’s Latest Model Enhances Multimodal AI Interaction

GPT-4o introduces voice as a new element, making it a natively multimodal platform.

-

OpenAI has introduced its latest AI model, GPT-4o, which enhances the capabilities of its predecessor by integrating voice recognition into its already robust text and vision processing. Announced during a livestream on Monday, the model is described by OpenAI CTO Mira Murati as a significant advancement in AI interaction, offering real-time responsiveness and emotional recognition in voice communications.
 
The rollout will occur iteratively across OpenAI's suite of products, targeting both developers and consumers in the coming weeks. GPT-4o (“o” for “omni”) is designed to accept any combination of text, audio, and image inputs, and generate any combination of text, audio, and image outputs.

Multimodal Interaction at the Core of GPT-4o

Building on the foundation set by GPT-4, which was adept at processing images and text, GPT-4o introduces voice as a new element, making it a natively multimodal platform. This enhancement not only improves the user experience with ChatGPT, OpenAI's popular AI chatbot, but also extends its functionality.

Users can now interact with ChatGPT in a more dynamic manner, interrupting and receiving responses in real-time, with the model capable of detecting nuances in the user's emotions and responding in various emotive tones. The model can respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, which is comparable to human response times in conversation.

Enhanced Functionality Across Platforms

The integration of voice capabilities significantly enhances the functionality of ChatGPT. For instance, when provided with a photo or a desktop screen image, ChatGPT can now swiftly answer queries related to the content displayed, such as identifying software code specifics or recognizing brands and objects.

The update is part of OpenAI's broader strategy to make AI interactions more intuitive and less focused on the underlying user interface. GPT-4o matches the performance of GPT-4 Turbo on text in English and code, and shows significant improvement on text in non-English languages. It is also much faster and 50% cheaper in the API, particularly excelling in vision and audio understanding compared to existing models.
 

Broader Access and Improved Performance

In addition to its enhanced capabilities, GPT-4o will be available for free to all users, with paid users benefiting from up to five times the capacity limits. The model also boasts improved speed, ensuring quicker responses and more efficient interaction. OpenAI CEO Sam Altman emphasized the model's design to be inherently multimodal, which aligns with the company's vision of creating more natural and accessible .

As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities“, says OpenAI

Enhanced Reasoning

GPT-4o achieves a new record score of 87.2% on the 5-shot MMLU (general knowledge questions). (Note: Llama3 400b is currently in training)
 
GPT-4o - text evaluation performance official

Audio ASR performance

GPT-4o significantly enhances speech recognition capabilities compared to Whisper-v3 across various languages, especially those with fewer resources.
 
GPT-4o - Audio ASR performance official

Audio translation performance

GPT-4o establishes a new benchmark in audio translation performance, surpassing Whisper-v3 in the MLS benchmark.
 
GPT-4o - audio translation performance official

M3Exam Benchmark Results

The M3Exam benchmark serves as a multilingual and visual evaluation, comprising multiple-choice questions from various countries' standardized tests, which may include figures and diagrams. GPT-4o outperforms GPT-4 in this benchmark across all languages. Vision results for Swahili and Javanese are omitted due to the presence of five or fewer vision questions for these languages.
 
GPT-4o M3Exam Zero-Shot Results - official

Vision Understanding

GPT-4o achieves state-of-the-art performance on visual perception benchmarks.
 
GPT-4o Vision Understanding benchmark official

Strategic Timing for Launch

The timing of the GPT-4o announcement, just before Google I/O, Google's major conference, appears strategic, positioning OpenAI to capture attention in the competitive AI landscape. This launch follows a period of speculation with various predictions about OpenAI's new developments, underscoring the company's ongoing influence in the tech industry.

SourceOpenAI
Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

Mastodon