DeepSeek AI has released DeepSeek-VL2, a family of Vision-Language Models (VLMs) that are now available under open-source licenses. The series introduces three variants—Tiny, Small, and the standard VL2—featuring activated parameter sizes of 1.0 billion, 2.8 billion, and 4.5 billion, respectively.
The models are accessible via GitHub and Hugging Face. They promise to advance key AI applications, including visual question answering (VQA), optical character recognition (OCR), and high-resolution document and chart analysis.
According to the official GitHub documentation, “DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, document/table/chart understanding, and visual grounding.”
The timing of this release situates DeepSeek AI in direct competition with major players like OpenAI and Google, both of whom dominate the vision-language AI domain with proprietary models such as GPT-4V and Gemini-Exp.
DeepSeek’s emphasis on open-source collaboration, combined with the advanced technical features of the VL2 family, positions it as a free option for researchers.
Dynamic Tiling: Advancing High-Resolution Image Processing
One of the most notable advancements in DeepSeek-VL2 is its dynamic tiling vision encoding strategy, which revolutionizes how the models process high-resolution visual data.
Unlike traditional fixed-resolution approaches, dynamic tiling divides images into smaller, flexible tiles that adapt to various aspect ratios. This method ensures detailed feature extraction while maintaining computational efficiency.
On its GitHub repository, DeepSeek describes this as a way to “efficiently process high-resolution images with varying aspect ratios, avoiding the computational scaling typically associated with increasing image resolutions.”
This capability allows DeepSeek-VL2 to excel in applications such as visual grounding, where high precision is essential for identifying objects in complex images, and dense OCR tasks, which require processing text in detailed documents or charts.
By dynamically adjusting to different image resolutions and aspect ratios, the models overcome the limitations of static encoding methods, making them suitable for use cases that demand both flexibility and accuracy.
Mixture-of-Experts and Multi-Head Latent Attention for Efficiency
DeepSeek-VL2’s performance gains are further supported by its integration of the Mixture-of-Experts (MoE) framework and Multi-head Latent Attention (MLA) mechanism.
The MoE architecture selectively activates specific subsets, or “experts,” within the model to handle tasks more efficiently. This design reduces computational overhead by engaging only the necessary parameters for each operation, a feature that is particularly useful in resource-constrained environments.
The MLA mechanism complements the MoE framework by compressing the Key-Value cache into latent vectors during inference. This optimization minimizes memory usage and increases processing speeds without sacrificing model accuracy.
According to the technical documentation, “The MoE architecture, combined with MLA, allows DeepSeek-VL2 to achieve competitive or better performance than dense models with fewer activated parameters.”
Three-Stage Training Pipeline
The development of DeepSeek-VL2 involved a rigorous three-stage training pipeline designed to optimize the model’s multimodal capabilities. The first stage focused on vision-language alignment, where the models were trained to integrate visual features with textual information.
This was achieved using datasets like ShareGPT4V, which provide paired image-text examples for initial alignment. The second stage involved vision-language pretraining, which incorporated a diverse range of datasets, including WIT, WikiHow, and multilingual OCR data, to enhance the model’s generalization abilities across multiple domains.
Finally, the third stage consisted of supervised fine-tuning (SFT), where task-specific datasets were used to refine the model’s performance in areas like visual grounding, graphical user interface (GUI) comprehension, and dense captioning.
These training stages allowed DeepSeek-VL2 to build a solid foundation for multimodal understanding while enabling the models to adapt to specialized tasks. The incorporation of multilingual datasets further enhanced the models’ applicability in global research and industrial settings.
Related: Chinese DeepSeek R1-Lite-Preview Model Targets OpenAI’s Lead in Automated Reasoning
Benchmarking Results
DeepSeek-VL2 models, including the Tiny, Small, and standard variants, excelled in critical benchmarks for general question-answering (QA) and math-related multimodal tasks.
DeepSeek-VL2-Small, with its 2.8 billion activated parameters, achieved an MMStar score of 57.0 and outperformed similarly sized models like InternVL2-2B (49.8) and Qwen2-VL-2B (48.0). It also closely rivaled much larger models, such as the 4.1B InternVL2-4B (54.3) and the 8.3B Qwen2-VL-7B (60.7), demonstrating its competitive efficiency.
On the AI2D test for visual reasoning, DeepSeek-VL2-Small achieved a score of 80.0, surpassing InternVL2-2B (74.1) and MM 1.5-3B (not reported). Even against larger-scale competitors like InternVL2-4B (78.9) and MiniCPM-V2.6 (82.1), DeepSeek-VL2 demonstrated strong results with fewer activated parameters.
The flagship DeepSeek-VL2 model (4.5 billion activated parameters) delivered exceptional results, scoring 61.3 on MMStar and 81.4 on AI2D. It outperformed competitors such as Molmo-7B-O (7.6B activated parameters, 39.3) and MiniCPM-V2.6 (8.0B, 57.5), further validating its technical superiority.
Excellence in OCR-Related Benchmarks
DeepSeek-VL2’s capabilities extend prominently to OCR (optical character recognition)-related tasks, a crucial area for document understanding and text extraction in AI. In the DocVQA test, the DeepSeek-VL2-Small achieved an impressive 92.3% accuracy, outperforming all other open-source models of similar scale, including InternVL2-4B (89.2%) and MiniCPM-V2.6 (90.8%). Its accuracy was just behind the closed models like GPT-4o (92.8) and Claude 3.5 Sonnet (95.2).
The DeepSeek-VL2 model also led in the ChartQA test with a score of 86.0, outperforming InternVL2-4B (81.5) and MiniCPM-V2.6 (82.4). This result reflects DeepSeek-VL2’s advanced ability to process charts and extract insights from complex visual data.
In OCRBench, a highly competitive metric for fine-grained text recognition, DeepSeek-VL2 achieved 811, outclassing the 7.6B Qwen2-VL-7B (845) and MiniCPM-V2.6 (852 with CoT), and highlighting its strength in dense OCR tasks.
Comparison Against Leading Vision-Language Models
When placed alongside industry leaders like OpenAI’s GPT-4V and Google’s Gemini-1.5-Pro, DeepSeek-VL2 models offer a compelling balance of performance and efficiency. For instance, GPT-4V scored 87.2 in DocVQA, which is only marginally ahead of DeepSeek-VL2 (93.3), despite the latter operating under an open-source framework with fewer activated parameters.
On TextVQA, DeepSeek-VL2-Small achieved 83.4, significantly outperforming similar open-source models like InternVL2-2B (73.4) and MiniCPM-V2.0 (74.1). Even the much larger MiniCPM-V2.6 (8.0B) only reached 80.4, further underscoring the scalability and efficiency of DeepSeek-VL2’s architecture.
For ChartQA, DeepSeek-VL2’s score of 86.0 exceeded those of Pixtral-12B (81.8) and InternVL2-8B (83.3), demonstrating its ability to excel in specialized tasks requiring precise visual-textual comprehension.
Related: Mistral AI Debuts Pixtral 12B for Text and Image Processing
Expanding Applications: From Grounded Conversations to Visual Storytelling
One notable feature of the DeepSeek-VL2 models is their ability to conduct grounded conversations, where the model can identify objects in images and integrate them into contextual discussions.
For instance, by using a specialized token, the model can provide object-specific details, such as location and description, to answer queries about images. This opens up possibilities for applications in robotics, augmented reality, and digital assistants, where precise visual reasoning is required.
Another area of application is visual storytelling. DeepSeek-VL2 can generate coherent narratives based on a sequence of images, combining its advanced visual recognition and language capabilities.
This is especially valuable in domains like education, media, and entertainment, where dynamic content creation is a priority. The models leverage strong multimodal understanding to craft detailed and contextually appropriate stories, integrating visual elements such as landmarks and text into the narrative seamlessly.
The models’ capability in visual grounding is equally strong. In tests involving complex images, DeepSeek-VL2 has demonstrated the ability to accurately locate and describe objects based on descriptive prompts.
For example, when asked to identify a “car parked on the left side of the street,” the model can pinpoint the exact object in the image and generate bounding box coordinates to illustrate its response. These features make it highly applicable for autonomous systems and surveillance, where detailed visual analysis is critical.
Open-Source Accessibility and Scalability
DeepSeek AI’s decision to release DeepSeek-VL2 as an open-source contrasts sharply with the proprietary nature of competitors like OpenAI’s GPT-4V and Google’s Gemini-Exp, which are closed systems designed for limited public access.
According to the technical documentation, “By making our pre-trained models and code publicly available, we aim to accelerate progress in vision-language modeling and promote collaborative innovation across the research community.”
Scalability of DeepSeek-VL2 further enhances their appeal. The models are optimized for deployment across a wide range of hardware configurations, from single GPUs with 10GB memory to multi-GPU setups capable of handling large-scale workloads.
This flexibility ensures that DeepSeek-VL2 can be used by organizations of all sizes, from startups to large enterprises, without the need for specialized infrastructure.
Innovations in Data and Training
A major factor behind DeepSeek-VL2’s success is its extensive and diverse training data. The pretraining phase incorporated datasets such as WIT, WikiHow, and OBELICS, which provided a mix of interleaved image-text pairs for generalization.
Additional data for specific tasks, such as OCR and visual question answering, came from sources like LaTeX OCR and PubTabNet, ensuring that the models could handle both general and specialized tasks with high accuracy.
The inclusion of multilingual datasets also reflects DeepSeek AI’s aim of global applicability. Chinese-language datasets like Wanjuan were integrated alongside English datasets to ensure that the models could operate effectively in multilingual environments.
This approach enhances the usability of DeepSeek-VL2 in regions where non-English data dominates, expanding its potential user base significantly.
The supervised fine-tuning phase further refined the models’ capabilities by focusing on specific tasks such as GUI comprehension and chart analysis. By combining in-house datasets with high-quality open-source resources, DeepSeek-VL2 achieved state-of-the-art performance on several benchmarks, validating the effectiveness of its training methodology.
DeepSeek AI’s careful curation of data and innovative training pipeline have allowed VL2 models to excel in a wide range of tasks while maintaining efficiency and scalability. These factors make them a valuable addition to the field of multimodal AI.
The models’ ability to handle complex image processing tasks, such as visual grounding and dense OCR, makes them ideal for industries like logistics and security. In logistics, they can automate inventory tracking by analyzing images of warehouse stock, identifying items, and integrating findings into inventory management systems.
In the security domain, DeepSeek-VL2 can assist in surveillance by identifying objects or individuals in real time, based on descriptive queries, and providing detailed contextual information to operators.
DeepSeek-VL2’s grounded conversation capability offers also possibilities in robotics and augmented reality. For example, a robot equipped with this model could interpret its environment visually, respond to human queries about specific objects, and perform actions based on its understanding of the visual input.
Similarly, augmented reality devices can leverage the model’s visual grounding and storytelling features to provide interactive, immersive experiences, such as guided tours or contextual overlays in real-time environments.
Challenges and Future Prospects
Despite its numerous strengths, DeepSeek-VL2 faces several challenges. One key limitation is the size of its context window, which currently restricts the number of images that can be processed within a single interaction.
Expanding this context window in future iterations would enable richer, multi-image interactions and enhance the model’s utility in tasks requiring broader contextual understanding.
Another challenge lies in handling out-of-domain or low-quality visual inputs, such as blurry images or objects not present in its training data. While DeepSeek-VL2 has demonstrated remarkable generalization capabilities, improving robustness against such inputs will further increase its applicability across real-world scenarios.
Looking ahead, DeepSeek AI plans to strengthen the reasoning capabilities of its models, enabling them to handle increasingly complex multimodal tasks. By integrating improved training pipelines and expanding datasets to cover more diverse scenarios, future versions of DeepSeek-VL2 could set new benchmarks for vision-language AI performance.