HomeWinBuzzer NewsMicrosoft Unveils Florence-2 AI Vision Model for Multi-Tasking

Microsoft Unveils Florence-2 AI Vision Model for Multi-Tasking

The model seeks to replace single-use vision models that may falter when faced with new tasks.


has introduced a new vision foundation model called Florence-2. Open for use under the MIT license, the model addresses a range of vision and vision-language tasks through a unified, prompt-based approach. It's available in two sizes, featuring 232 million and 771 million parameters.

Architecture and Training Insights

Florence-2 employs a sequence-to-sequence framework, combining an image encoder with a multi-modality encoder-decoder capable of interpreting simple text prompts to execute tasks such as captioning, object detection, and segmentation. This setup allows for the management of diverse vision tasks without requiring task-specific adjustments.

The model was trained on FLD-5B, a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ).

These annotations were standardized into text outputs, enabling a cohesive multi-task learning mechanism. Florence-2's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

The goal was to create a versatile vision model capable of executing numerous tasks under a consistent set of parameters, activated through text prompts similar to techniques used in large language models.
Microsoft Florence-2 AI vision model official

Model Performance

Florence-2 has demonstrated strong performance in captioning, object detection, visual grounding, and segmentation applications. In zero-shot image classification tests on the Microsoft Microsoft Common Objects in Context (COCO) dataset. Both versions of Florence-2 outperformed larger models, including DeepMind's 80B parameter Flamingo visual language model.

“The pre-trained Florence-2 backbone enhances performance on downstream tasks, e.g. COCO object detection and instance segmentation, and ADE20K semantic segmentation, surpassing both supervised and self-supervised models,” the researchers write. “Compared to pre-trained models on ImageNet, ours improves training efficiency by 4X and achieves substantial improvements of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets.”

When fine-tuned with publicly available human-annotated data, Florence-2 performed comparably with several larger specialized models across various tasks. Notable improvements were observed in downstream tasks such as object detection, instance segmentation, and semantic segmentation, credited to both pre-trained and fine-tuned versions of Florence-2.

Licensing and Availability

Florence-2 is available under the MIT license, which permits broad use by developers and businesses. The model aims to simplify the execution of complex AI vision and vision-language tasks, potentially minimizing the need for multiple specialized models. This could result in reduced computational costs and more streamlined development processes for vision-based .

While Florence-2 hasn't yet been implemented in real-world scenarios, it holds promise for handling a wide array of vision-related tasks and applications. The model seeks to replace single-use vision models that may falter when faced with new tasks. Released on Hugging Face by Microsoft's Azure AI team, Florence-2 is set to meet varied enterprise needs through a single, unified approach for different tech applications.

Markus Kasanmascheff
Markus Kasanmascheff
Markus is the founder of WinBuzzer and has been playing with Windows and technology for more than 25 years. He is holding a Master´s degree in International Economics and previously worked as Lead Windows Expert for Softonic.com.