Microsoft has introduced a new vision foundation model called Florence-2. Open for use under the MIT license, the model addresses a range of vision and vision-language tasks through a unified, prompt-based approach. It's available in two sizes, featuring 232 million and 771 million parameters.
Architecture and Training Insights
Florence-2 employs a sequence-to-sequence framework, combining an image encoder with a multi-modality encoder-decoder capable of interpreting simple text prompts to execute tasks such as captioning, object detection, and segmentation. This setup allows for the management of diverse vision tasks without requiring task-specific adjustments.
The model was trained on FLD-5B, a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ).
These annotations were standardized into text outputs, enabling a cohesive multi-task learning mechanism. Florence-2's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
The goal was to create a versatile vision model capable of executing numerous tasks under a consistent set of parameters, activated through text prompts similar to techniques used in large language models.
Model Performance
Florence-2 has demonstrated strong performance in captioning, object detection, visual grounding, and segmentation applications. In zero-shot image classification tests on the Microsoft Microsoft Common Objects in Context (COCO) dataset. Both versions of Florence-2 outperformed larger models, including DeepMind's 80B parameter Flamingo visual language model.
“The pre-trained Florence-2 backbone enhances performance on downstream tasks, e.g. COCO object detection and instance segmentation, and ADE20K semantic segmentation, surpassing both supervised and self-supervised models,” the researchers write. “Compared to pre-trained models on ImageNet, ours improves training efficiency by 4X and achieves substantial improvements of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets.”
When fine-tuned with publicly available human-annotated data, Florence-2 performed comparably with several larger specialized models across various tasks. Notable improvements were observed in downstream tasks such as object detection, instance segmentation, and semantic segmentation, credited to both pre-trained and fine-tuned versions of Florence-2.
Licensing and Availability
Florence-2 is available under the MIT license, which permits broad use by developers and businesses. The model aims to simplify the execution of complex AI vision and vision-language tasks, potentially minimizing the need for multiple specialized models. This could result in reduced computational costs and more streamlined development processes for vision-based AI applications.
While Florence-2 hasn't yet been implemented in real-world scenarios, it holds promise for handling a wide array of vision-related tasks and applications. The model seeks to replace single-use vision models that may falter when faced with new tasks. Released on Hugging Face by Microsoft's Azure AI team, Florence-2 is set to meet varied enterprise needs through a single, unified approach for different tech applications.