A new study published on arXiv reveals fundamental issues in the visual reasoning abilities of leading AI vision-language models (VLMs) from OpenAI, Google, and Meta Platforms. Conducted by researchers from TU Darmstadt, Eindhoven University of Technology, and the German Research Center for AI, the study highlights the models’ failures to complete basic visual puzzles known as Bongard problems.
While major tech firms are integrating these models into AI-driven agents for daily digital tasks, the study suggests that VLMs may be less capable than assumed when it comes to generalizing basic visual patterns.
Understanding Bongard Problems and Vision-Language Models
Vision-Language Models, or VLMs, are AI systems that interpret and process visual data with associated text to simulate human perception. Key to advancing AI capabilities, they are often used in applications that require visual understanding, such as shopping, browsing, and healthcare tasks.
Some prominent examples include OpenAI’s DALL-E, which can create images from textual descriptions, and Google’s MUM, a multimodal model capable of understanding and generating text and code. Other notable VLMs include Meta’s Flamingo and Google’s PaLM-E, which can perform various tasks like visual question answering, image captioning, and even robotic manipulation.
Bongard problems, which have been used in cognitive science since the 1960s, consist of pairs of images where each set contains an underlying pattern or rule, challenging the model to recognize distinctions—such as a set of spirals moving clockwise vs. counterclockwise.
According to the study, GPT-4o, OpenAI’s latest VLM, solved only 21 out of 100 Bongard problems in an open-ended test. Even with a multiple-choice setup limited to ten options, performance only rose to around 69%, a level the researchers attribute to “superficial” comprehension rather than a genuine grasp of visual concepts.
The authors noted that despite explicit prompts to recognize simple visual elements, models like GPT-4o often misinterpret tasks as basic as identifying object orientation or shape differences, underscoring gaps in VLMs’ ability to handle rudimentary visual reasoning.
Implications for AI Applications and Agent Functionality
The limitations of these VLMs could impact the effectiveness of AI-driven agents being developed by companies like Google and Microsoft. These agents are envisioned as tools that can assist users with complex online tasks—from conducting research to managing purchases.
Microsoft’s approach to AI development, for example, has focused on specialized agents tailored for specific tasks to bypass some of the generalization issues seen with broad-function models. However, even specialized agents may face challenges if VLMs continue to struggle with recognizing straightforward visual patterns.
AI Transcription with Similar Issues
Regarding the accuracy of AI tools, OpenAI’s Whisper, an AI transcription model, has been reported various times to “hallucinate” during speech-to-text transcriptions, meaning it occasionally adds words or phrases that were not spoken in the original audio.
The issue, first documented in a June 2024 Cornell University study, presents potential risks in high-stakes environments such as healthcare, where accurate transcriptions are essential. Assistant Professor Allison Koenecke and her team found that Whisper’s hallucination rate was over 1% in a test involving 13,000 audio samples from AphasiaBank—a speech dataset from Carnegie Mellon University, including recordings with various accents and speech patterns.
Whisper has been integrated by the tech firm Nabla into a tool used in 30,000 healthcare centers, where it helps physicians document patient interactions. However, Whisper’s tendency to hallucinate could lead to errors in patient records, potentially distorting critical health information.
Privacy concerns have also arisen, as Whisper’s protocol deletes the original audio files post-transcription, preventing any post-review correction of errors. William Saunders, a former engineer at OpenAI, noted that “removing the original recording means there’s no way to catch mistakes,” raising questions about quality control.
Broader Questions on Benchmark Validity and Testing Requirements
The recent study on VLMs highlights that AI benchmarks may not be accurately capturing an AI system’s reasoning abilities. While vision models perform relatively well on familiar tasks, like traditional image benchmarks, Bongard problems reveal deeper challenges with logical reasoning.
Experts suggest that tools like Bongard problems, while limited in scope, show the need for more rigorous and realistic tests to truly measure AI’s adaptability. Cornell’s research team proposed that OpenAI and others implement more robust pre-release testing to address diverse speech and visual patterns that AI models encounter in real-world environments.
Given the reliance on models like Whisper and GPT-4o for sensitive applications, experts stress the importance of achieving reliable accuracy. As both VLMs and transcription tools play larger roles across industries, these findings underscore the potential risks of depending on AI systems that may not fully handle foundational tasks in visual and auditory data processing.
Last Updated on November 7, 2024 2:19 pm CET