HomeWinBuzzer NewsMicrosoft Research Co-Develops AI That Can Accurately Generate Comments for Videos

Microsoft Research Co-Develops AI That Can Accurately Generate Comments for Videos

Microsoft Research and Harbin Institute of Technology has announced an AI model that matches comments to video clips.


has been one of the leaders in developing language AI, such as text-to-speech solutions. The company's latest effort is a project that aims to generate comments from video clips

A project between Microsoft Research Asia and Harbin Institute of Technology proposes a new method of generating comments from videos. Previously, research has centered on encode-decoder models. While this has generated comments, the results are typically irrelevant to the video.

In a paper published on Arxiv.org, the Microsoft-led project discusses a model that using machine learning can better represents comments from video. In its experiments, the research team says the AI is better than current models.

You can check out the code for the project on . At the core of the AI is an ability to match relevant comments with videos from a study set to learn cross-modal representations. Microsoft Research and Harbin based the model on 's Transformers languages.


The resulting live automatic live commenting system is made of three main components:

  • An encoder layer to convert modalities from a video and comment into vectors.
  • A matching layer that learns the representation from each modality.
  • And a prediction layer that measures the degree between a video and comment.

In other words, the model will “watch” a video with a time stamp and choose a comment from a candidate set. The AI is capable of finding the most relevant comment for the video at that time stamp. It does this by basing its prediction on other comments in the set, the visual aspect of the video, and the audio.

To train the model, the researchers used a set comprised of 2,361 videos and 895,929 comments. All the information was taken from Bilibili, a Chinese streaming platform.

“[W]e believe the multimodal pre-training will be a promising direction to explore, where tasks like image captioning and video captioning will benefit from pre-trained models,” wrote the researchers. “For future research, we will further investigate the multimodal interactions among vision, audio, and text in … real-world applications.”

Luke Jones
Luke Jones
Luke has been writing about all things tech for more than five years. He is following Microsoft closely to bring you the latest news about Windows, Office, Azure, Skype, HoloLens and all the rest of their products.

Recent News