Llama3-V: New $500 Open-Source Multimodal Model Challenges OpenAI and Google

Llama3-V competes with closed-source models that are much larger in size, beating Google Gemini Pro 1.5 and Gemini Ultra 1.0 in various tests.

According to the developers, Llama3-V, a new open-source multimodal AI model with vision capabilities, is coming close to leading closed-source models from OpenAI, Google, and Anthropic. Mustafa Aljadery, one of the developers, shared some benchmarks that show the model, built upon LLama3 8B from Meta, beating Google Gemini Pro 1.5 and Gemini Ultra 1.0 in various tests. Despite its smaller size, Llama3-V emphasizes efficiency and targeted task performance.

Developed on a budget of less than $500, Llama3-V offers a cost-effective multimodal AI solution to the models from OpenAI, Google, and others. Displaying a 10-20% performance improvement over the existing multimodal Llava-model, Llama3-V competes with closed-source models that are much larger in size, except for the MMMU benchmark, which evaluates multimodal models on massive multi-discipline tasks demanding college-level subject knowledge.

Approach to Multimodal AI

Llama3-V integrates visual inputs by embedding images into patch embeddings through the SigLIP model. These visual embeddings are aligned with textual tokens using a projection block that incorporates self-attention mechanisms, ensuring that visual and textual data are properly synchronized. The combined representation is processed through Llama3, enhancing the model’s ability to understand and utilize visual information alongside text.

SigLIP processes image-text pairs independently using pairwise sigmoid loss, as opposed to CLIP’s contrastive loss. Sigmoid loss is a concept used in machine learning to help computers make decisions in situations where there are only two possible outcomes. CLIP (Contrastive Language–Image Pre-Training) is a neural network, which efficiently learns visual concepts from natural language supervision.

The model divides images into patches, projects them into a lower-dimensional space, and applies self-attention to extract higher-level features. Its projection module, including two self-attention blocks, aligns image embeddings with Llama3’s textual embeddings. These visual tokens are then combined with textual tokens for joint processing by Llama3.

Optimized for Efficiency

Llama3-V also optimizes computational resources through a caching mechanism that precomputes SigLIP image embeddings, maximizing GPU utilization and batch size while avoiding memory overloads. Additionally, MPS/MLX optimizations enable SigLIP to run on MacBooks at a throughput of 32 images per second, enhancing efficiency in both training and inference stages. MPS (Multi-Process Service) and MLX (Machine Learning eXtensions) are both advanced optimizations aimed at enhancing the performance and efficiency of machine learning workloads

Pretraining of Llama3-V involved only 600,000 image-text pairs, followed by fine-tuning with 1 million examples, including 7 million split images. The process included freezing the main weights of the Llama-3 architecture, with updates limited to the projection matrix. Subsequent supervised fine-tuning focused on improving the vision and projection matrices.

Last Updated on November 7, 2024 7:56 pm CET

SourceAksh Garg
Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x