Microsoft has been one of the leaders in developing language AI, such as text-to-speech solutions. The company’s latest effort is a Microsoft Research project that aims to generate comments from video clips
A project between Microsoft Research Asia and Harbin Institute of Technology proposes a new method of generating comments from videos. Previously, research has centered on encode-decoder models. While this has generated comments, the results are typically irrelevant to the video.
In a paper published on Arxiv.org, the Microsoft-led project discusses a model that using machine learning can better represents comments from video. In its experiments, the research team says the AI is better than current models.
You can check out the code for the project on GitHub. At the core of the AI is an ability to match relevant comments with videos from a study set to learn cross-modal representations. Microsoft Research and Harbin based the model on Google’s Transformers languages.
The resulting live automatic live commenting system is made of three main components:
- An encoder layer to convert modalities from a video and comment into vectors.
- A matching layer that learns the representation from each modality.
- And a prediction layer that measures the degree between a video and comment.
In other words, the model will “watch” a video with a time stamp and choose a comment from a candidate set. The AI is capable of finding the most relevant comment for the video at that time stamp. It does this by basing its prediction on other comments in the set, the visual aspect of the video, and the audio.
To train the model, the researchers used a set comprised of 2,361 videos and 895,929 comments. All the information was taken from Bilibili, a Chinese streaming platform.
“[W]e believe the multimodal pre-training will be a promising direction to explore, where tasks like image captioning and video captioning will benefit from pre-trained models,” wrote the researchers. “For future research, we will further investigate the multimodal interactions among vision, audio, and text in … real-world applications.”