OpenAI has equipped ChatGPT’s Advanced Voice Mode with vision capabilities, enabling users to share live video and screens for real-time analysis and assistance.
This marks a major expansion of ChatGPT’s functionality, transforming it into a visually aware AI assistant capable of interacting with the physical and digital worlds.
Announced as part of OpenAI’s “12 Days of OpenAI,” the update enhances ChatGPT’s ability to assist with real-world tasks while maintaining its conversational intelligence.
Visual AI in Action: How ChatGPT Processes Video and Screens
ChatGPT’s new visual abilities allow users to engage the AI by pointing their smartphone cameras at objects or sharing their device screens. The feature opens a wide range of applications, from explaining complex on-screen settings to identifying physical objects in the user’s environment.
During a live demonstration, OpenAI showcased the AI guiding a user through the process of brewing coffee. The system identified essential tools such as a coffee filter and brewer, offering clear, step-by-step instructions.
Another example involved ChatGPT analyzing a mathematical equation displayed on a screen and explaining the solution in detail. OpenAI explained that the feature connects voice interaction with the ability to interpret visual inputs for real-time assistance.
Screenshare while using Advanced Voice for instant feedback on whatever you’re looking at. pic.twitter.com/d4Xm36dwOX
— OpenAI (@OpenAI) December 12, 2024
Despite its capabilities, OpenAI acknowledged limitations in the system’s current iteration. The AI occasionally generates incorrect responses—referred to as “hallucinations”—when interpreting complex visual data. While this remains a challenge, OpenAI noted that iterative improvements are underway to enhance accuracy and reliability.
The rollout of these vision capabilities begins immediately for ChatGPT Plus, Pro, and Team users, while Enterprise and Education subscribers will gain access starting January 2025.
However, users in the European Union and select countries such as Switzerland, Iceland, and Norway face delays due to compliance and regulatory adjustments. To activate the feature, users must access Advanced Voice Mode within the ChatGPT app, then select the video or screen-sharing options to enable visual assistance.
Related: Google Unveils Gemini 2.0, Flash 2.0 With Better Reasoning and AI Agents
Previous Updates: Canvas Enhancements with Python Integration
On Tuesday, OpenAI fully released Canvas, a collaborative editing workspace that offers advanced tools for both text and code refinement. Initially launched in beta in October 2024, Canvas replaces ChatGPT’s standard interface with a split-screen design, allowing users to work on text or code while engaging in conversational exchanges with the AI.
The addition of Python execution is a standout feature of Canvas, enabling developers to write, test, and debug scripts directly within the platform. OpenAI demonstrated its utility during a live event by using Python to generate and refine data visualizations. OpenAI described the feature as “reducing friction between idea generation and implementation”.
On Monday, OpenAI officially launched Sora, its advanced AI tool for generating videos from text prompts, signaling a new era for creative AI. Integrated into paid ChatGPT accounts, Sora allows users to animate still images, extend existing videos, and merge scenes into cohesive narratives.
Last Friday, OpenAI unveiled ChatGPT Pro, a new premium subscription tier priced at $200 per month, aimed at professionals and enterprises seeking advanced AI capabilities for high-demand workflows.
The new ChatGPT Pro tier offers exclusive features including unlimited access to advanced AI models such as GPT-4o, o1, and o1-mini, as well as the full version of the o1 reasoning model, previously code-named “Strawberry.”
Competitive Context: OpenAI’s Strategic Move in the AI Race
The addition of vision capabilities and expanded functionality in Canvas underscores OpenAI’s efforts to maintain a leading position in the increasingly competitive AI landscape.
Google is advancing its Project Astra, an AI assistant capable of processing live video inputs, which is currently in limited testing with select users. Meanwhile, Meta is refining its own visual AI technologies, highlighting the industry-wide focus on integrating vision into conversational AI platforms.
Real-World Implications of Visual AI
ChatGPT’s ability to process live video and shared screens extends its utility across various domains. For consumers, the feature simplifies tasks like troubleshooting device issues, offering visual explanations of on-screen settings, or assisting with hands-on projects at home.
In education, ChatGPT can support remote learning by visually interpreting problems or materials shared by students. For professionals, especially those in design, engineering, or technical fields, ChatGPT’s ability to analyze visual inputs offers a new layer of functionality, streamlining workflows and boosting efficiency.
The broader implications of this update reflect a growing demand for AI systems that can interact seamlessly with both digital and physical environments. As AI technologies like ChatGPT evolve, their ability to understand and respond to visual context will become increasingly central to their adoption in everyday life.
OpenAI’s vision upgrade for ChatGPT and its enhancements to the Canvas workspace signal a significant leap forward in the capabilities of conversational AI. By integrating voice, vision, and coding tools, OpenAI continues to expand ChatGPT’s practical applications for users across personal, educational, and professional settings.