Apple, together with the Swiss Federal Institute of Technology Lausanne (EPFL), has rolled out the 4M-21 AI model. The advanced multimodal system is capable of handling a broad array of tasks by combining over 20 different modalities, including Segment Anything Model (SAM)-segments, 3D human poses, and Canny edges.
Unified Approach to Multimodal Learning
4M-21 represents an innovative multimodal and multitask training framework, facilitating the development of versatile any-to-any models that can predict or generate any modality from any combination of other modalities. These 4M models are adept at executing a broad spectrum of vision tasks right away and show exceptional performance when fine-tuned for new downstream tasks.
We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website.
IMO, the… https://t.co/0hY0fHxtzB pic.twitter.com/o0BjwlSmeP
— Amir Zamir (@zamir_ar) June 14, 2024
Tokenizing modalities into sequences of discrete tokens enables the training of a single unified Transformer encoder-decoder across various modalities, such as text, images, geometric shapes, and semantic modalities, along with neural network feature maps. The 4M training method involves mapping one random subset of tokens to another. The researchers write on the 4M project page:
“4M allows for generating any modality from any other subset of modalities in a self-consistent manner. We can achieve this level of consistency by looping back predicted modalities into the input when generating subsequent ones. Remarkably, 4M is able to perform this feat without needing any loss-balancing or architectural modifications commonly used in multitask learning. […] 4M can also effectively integrate information from multiple inputs.”
To achieve this, three main tokenizer types are utilized: Vision Transformer (ViT)-based tokenizers for image-like data, MLP tokenizers for human poses and global embeddings, and a WordPiece tokenizer for text and similar structured inputs. The tokenization system reduces computational load and supports generative tasks in various fields.
Enhanced Capabilities and Performance
4M-21 offers extensive functionalities, from steerable multimodal generation and multimodal retrieval to robust performance in vision-related tasks. The model iteratively decodes tokens to predict training modalities, ensuring detailed multimodal generation and improved text understanding. For retrieval tasks, it predicts global embeddings from any input, enhancing its versatility.
Performance evaluations show that 4M-21 competes strongly in surface normal estimation, depth estimation, semantic segmentation, instance segmentation, 3D human pose estimation, and image retrieval. Often, it matches or exceeds the performance of models specialized in these tasks, with the XL variant demonstrating consistent strength across multiple domains without compromising individual task performance.
The researchers have explored the scalability of pre-training any-to-any models on numerous modalities, testing three model sizes: B, L, and XL. This scalability ensures the model can tackle increasingly complex tasks and larger datasets, positioning it as a valuable asset for future AI advancements.