HomeWinBuzzer NewsMicrosoft Unveils Florence-2 AI Vision Model for Multi-Tasking

Microsoft Unveils Florence-2 AI Vision Model for Multi-Tasking

The model seeks to replace single-use vision models that may falter when faced with new tasks.

-

Microsoft has introduced a new vision foundation model called Florence-2. Open for use under the MIT license, the model addresses a range of vision and vision-language tasks through a unified, prompt-based approach. It's available in two sizes, featuring 232 million and 771 million parameters.

Architecture and Training Insights

Florence-2 employs a sequence-to-sequence framework, combining an image encoder with a multi-modality encoder-decoder capable of interpreting simple text prompts to execute tasks such as captioning, object detection, and segmentation. This setup allows for the management of diverse vision tasks without requiring task-specific adjustments.

The model was trained on FLD-5B, a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ).

These annotations were standardized into text outputs, enabling a cohesive multi-task learning mechanism. Florence-2's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

The goal was to create a versatile vision model capable of executing numerous tasks under a consistent set of parameters, activated through text prompts similar to techniques used in large language models.
  
Microsoft Florence-2 AI vision model official

Model Performance

Florence-2 has demonstrated strong performance in captioning, object detection, visual grounding, and segmentation applications. In zero-shot image classification tests on the Microsoft Microsoft Common Objects in Context (COCO) dataset. Both versions of Florence-2 outperformed larger models, including DeepMind's 80B parameter Flamingo visual language model.

“The pre-trained Florence-2 backbone enhances performance on downstream tasks, e.g. COCO object detection and instance segmentation, and ADE20K semantic segmentation, surpassing both supervised and self-supervised models,” the researchers write. “Compared to pre-trained models on ImageNet, ours improves training efficiency by 4X and achieves substantial improvements of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets.”

When fine-tuned with publicly available human-annotated data, Florence-2 performed comparably with several larger specialized models across various tasks. Notable improvements were observed in downstream tasks such as object detection, instance segmentation, and semantic segmentation, credited to both pre-trained and fine-tuned versions of Florence-2.

Licensing and Availability

Florence-2 is available under the MIT license, which permits broad use by developers and businesses. The model aims to simplify the execution of complex AI vision and vision-language tasks, potentially minimizing the need for multiple specialized models. This could result in reduced computational costs and more streamlined development processes for vision-based .

While Florence-2 hasn't yet been implemented in real-world scenarios, it holds promise for handling a wide array of vision-related tasks and applications. The model seeks to replace single-use vision models that may falter when faced with new tasks. Released on Hugging Face by Microsoft's Azure AI team, Florence-2 is set to meet varied enterprise needs through a single, unified approach for different tech applications.

SourceMicrosoft
Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.
Mastodon