Microsoft Debuts Kosmos-1 Multimodal AI That Provides Visual Natural Language Understanding

Microsoft Research has this week released a post introducing its new Kosmos-1 multimodal model. According to the company, this is a multimodal large language model (MLLM) AI that can solve visual puzzles, pass visual IQ tests, handle text recognition from images, and analyze images and describe their content. It also takes natural language instructions.

Of course, Microsoft has been going big on AI in recent months. Through its multi-billion partnership with OpenAI. The company is already working with large language model (LLM) AI by creating Prometheus. This is the new AI model that drives the company's Bing Chat AI search engine, borrowing technology from ChatGPT.

However, Kosmos-1 is a project without OpenAI and shows how multimodal large language model (MLLM) could be another path to mainstreaming AI. To train the model, Microsoft used a large dataset from the web, including the 800GB text source known as The Pile.

Once it was trained, Microsoft Research put the model to the test across language understanding, optical character recognition, image captioning, language generation, and be page question answering.

General AI

In its report, Microsoft Research says multimodal AI such as Kosmos-1 is an important technology for creating artificial general intelligence (AGI) that can act on a human level when performing tasks. If you are unfamiliar with multimodal AI, it is a model that can combine different input learning modes, such as images, video, audio, and text.

“Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world,” the researchers write in their accompanying academic paper, “Language Is Not All You Need: Aligning Perception with Language Models.”

In the paper, Microsoft shows visual examples that shows the Kosmos-1 model analyzing images and then returning natural language answers about them. It also writes captions for the images, reads text that appears in the images, and takes a visual IQ test with an accuracy between 22 to 26 percent.

Tip of the day: With a single registry tweak, it's possible to add a ‘Take Ownership' button to the right-click context menu that performs all of the necessary actions for you. You'll gain full access to all possible actions, including deletion, renaming, and more. All files and subfolders will also be under your name.

The Take Ownership context menu will set the currently active user as the owner of the files, though they must also be an administrator. They can then enter the folder or modify the file as they usually would.

Microsoft Debuts Kosmos-1 Multimodal AI That Provides Visual Natural Language Understanding

General AI

Recent News

Cybercriminals Exploit GitHub Comments to Spread Malicious Software

Ex-Amazon AI Scientist was asked to Break AI Copyright Rules for...