Microsoft Research has this week released a post introducing its new Kosmos-1 multimodal model. According to the company, this is a multimodal large language model (MLLM) AI that can solve visual puzzles, pass visual IQ tests, handle text recognition from images, and analyze images and describe their content. It also takes natural language instructions.

Of course, Microsoft has been going big on AI in recent months. Through its multi-billion partnership with OpenAI. The company is already working with large language model (LLM) AI by creating Prometheus. This is the new AI model that drives the company’s Bing Chat AI search engine, borrowing technology from ChatGPT.

However, Kosmos-1 is a project without OpenAI and shows how multimodal large language model (MLLM) could be another path to mainstreaming AI. To train the model, Microsoft used a large dataset from the web, including the 800GB text source known as The Pile.

Once it was trained, Microsoft Research put the model to the test across language understanding, optical character recognition, image captioning, language generation, and be page question answering.

General AI

In its report, Microsoft Research says multimodal AI such as Kosmos-1 is an important technology for creating artificial general intelligence (AGI) that can act on a human level when performing tasks. If you are unfamiliar with multimodal AI, it is a model that can combine different input learning modes, such as images, video, audio, and text.

“Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world,” the researchers write in their accompanying academic paper, “Language Is Not All You Need: Aligning Perception with Language Models.”

In the paper, Microsoft shows visual examples that shows the Kosmos-1 model analyzing images and then returning natural language answers about them. It also writes captions for the images, reads text that appears in the images, and takes a visual IQ test with an accuracy between 22 to 26 percent.

