HomeWinBuzzer NewsAlibaba Qwen Releases QVQ-72B-Preview Multimodal Reasoning AI Model

Alibaba Qwen Releases QVQ-72B-Preview Multimodal Reasoning AI Model

Alibaba's new QVQ-72B open-source AI model combines visual and textual reasoning, achieving great benchmark results.

-

The Qwen research team at Alibaba has introduced QVQ-72B, an open-source multimodal AI model designed to combine visual and textual reasoning. With its ability to process images and text step by step, the model offers a novel approach to problem-solving that challenges the dominance of proprietary systems like OpenAI’s GPT-4.

Alibaba’s Qwen team describes QVQ-72B as a step toward their long-term goal of creating a more comprehensive AI capable of addressing scientific and analytical challenges.

By making the model openly available under the Qwen license, Alibaba aims to foster collaboration in the AI community while advancing the development of artificial general intelligence (AGI). Positioned as both a research tool and a practical application, QVQ-72B represents a new milestone in the evolution of multimodal AI.

Visual and Textual Reasoning

Multimodal AI models like QVQ-72B are built to analyze and integrate multiple types of input—visual and textual—into a cohesive reasoning process. This capability is especially valuable for tasks that require interpreting data in diverse formats, such as scientific research, education, and advanced analytics.

At its core, QVQ-72B is an extension of Qwen2-VL-72B, Alibaba’s earlier vision-language model. It introduces advanced reasoning features that allow it to process images and related textual prompts with a structured, logical approach. Unlike many closed-source systems, QVQ-72B is designed to be transparent and accessible, providing its source code and model weights to developers and researchers.

“Imagine an AI that can look at a complex physics problem, and methodically reason its way to a solution with the confidence of a master physicist,” the Qwen team describes its ambitions with the new model to excel in domains where reasoning and multimodal comprehension are critical.

Performance and Benchmarks

The model’s performance was evaluated using several rigorous benchmarks, each testing different aspects of its multimodal reasoning capabilities:

In the MMMU (Multimodal Multidisciplinary University) benchmark, which assessed its ability to perform at a university level, combining text and image-based reasoning, QVQ-72B achieved an impressive score of 70.3, surpassing its predecessor Qwen2-VL-72B-Instruct.

The MathVista benchmark tested the model’s proficiency in solving mathematical problems using graphs and visual aids, highlighting its analytical strengths. Similarly, MathVision, derived from real-world mathematics competitions, evaluated its capacity for reasoning across diverse mathematical domains.

Finally, the OlympiadBench benchmark challenged QVQ-72B with bilingual problems from international math and physics contests. The model demonstrated accuracy comparable to proprietary systems like OpenAI’s GPT-4, narrowing the performance gap between open and closed-source AI.

Source: Qwen

Despite these achievements, limitations remain. The Qwen team noted that recursive reasoning loops and hallucinations during complex visual analysis remain challenges that need to be addressed.

Practical Applications and Developer Tools

QVQ-72B is not just a research artifact—it’s an accessible tool for developers, hosted on Hugging Face Spaces, allowing users to experiment with its capabilities in real-time. Developers can also deploy QVQ-72B locally using frameworks like MLX, optimized for macOS environments, and Hugging Face Transformers, making the model versatile across platforms.

We tested QVQ-72B Preview on Hugging Face with a simple image of twelve pencils to see how it would approach the task and if it can identify the stacked together pencils correctly. Unfortunately it failed this simple task, coming up with just eight.

As a comparison, OpenAI’s GPT-4o provided the correct answer directly:

Addressing Challenges and Future Directions

While QVQ-72B represents progress, it also highlights the complexities of advancing multimodal AI. Issues such as language switching, hallucinations, and recursive reasoning loops illustrate the challenges of developing robust, reliable systems. Identifying separate objects which is key for proper counting and subsequent reasoning still remains an issue for the model.

However, Qwen’s long-term goal extends beyond QVQ-72B. The team envisions a unified model that integrates additional modalities—combining text, vision, audio, and beyond—to approach artificial general intelligence. They emphasize that QVQ-72B is one step toward this vision, providing an open platform for further exploration and innovation.

Last Updated on December 27, 2024 12:35 pm CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x