While computer vision has become adept at describing images in natural language text form, the complexity of videos still presents major problems. Describing videos with natural language is a challenge Microsoft Research is tackling with several active projects that present a new benchmark dataset for image captioning and Long Short-Term Memory with visual-semantic Embedding to bridge video and language.
While the ability for computers to bridge video and language is an important step in its own right, such a progression would also be important for the overall learning abilities of computers.
The now released two research papers present different ideas for achieving more realistic video to language computer learning.
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language; Jun Xu, Tao Mei, Ting Yao, and Yong Rui; June 2016
- Jointly Modeling Embedding and Translation to Bridge Video and Language; Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui; June 2016
The two Microsoft Research projects feature a large number of contributors, suggesting the company is extremely interested in this avenue, with Tao Mei and Ting Yao the lead researchers on both projects. Tao Mei is already holding several important patents including one about Multi-video synthesis and another one about Capture-intention detection for video content analysis.
During Build 2016, Microsoft had already introduced Seeing AI, a somehow related “research project that uses computer vision and natural language processing to describe a person's surroundings, read text, answer questions and even identify emotions on people's faces.” Seeing AI, which can be used as a cell phone app or via smart glasses from Pivothead, is currently under development and nothing is known about a future public release.
Seeing AI might be using similar if not the same techniques to analyze moving pictures in real-time to then translate its visual properties into something text-based and at the same time meaningful.
Seeing AI teaser video from Build 2016:
Here some details from the just released papers that explain Microsoft´s approach for Video-to-Language Bridging.
MSR-Video to Text (MSR-VTT) is described by Microsoft as a “large-scale video benchmark for video understanding,” a new dataset for video to text understanding. The research team collected 257 of the most popular queries from a video search engine and the result is MSR-VTT with 10 thousand web videos with 41.2 hours and 200 thousand clip sentence pairs.
Microsoft Research adds that the database encompasses 20 categories housing the video, each of which has around 20 assigned natural sentences. MSR-VTT is the database with the largest amount of sentence pairs, as can be seen in the first image, the dataset produces videos and annotates them with several diverse sentence descriptions.
Arguably the biggest achievement of the MSR-VTT database is that it correlates all the capabilities of existing frameworks into a single dataset. This allows for “large scale clip-sentence pairs, comprehensive video categories, diverse video content and descriptions, as well as multimodal audio and video streams.”
However, the dataset will only be successful if it is capable of accurately describing a video clip in one sentence. To achieve this, the research team asked 15 subjects to choose clips that amounted to between 10 and 30 seconds for each clip. Workers (3,400 worker hours) then annotate around 20 sentences to each clip, allowing for 200 thousand overall clips and 29,316 unique words across a number of categories (below).
Long Short-Term Memory with visual-semantic Embedding (LSTM-E)
LSTM-E is a Microsoft Research project that aims to improve computer learning mechanisms for understanding video and describing clips in a human-like way. The team shows in its paper that it has developed a system that is a unified framework that builds on developments in Recurrent Neural Networks (RNNs) for more accurate video-to-text descriptions.
Essentially LSTM-E does two things at once, it can take advantage of the LSTM learning framework and use visual-semantic embedding. The above image shows a five frame clip that has been described by traditional LSTM (fairly vague), Microsoft's LSTM-E (more accurate and speech-like), and an accurate descriptive human caption.
The function of LSTM-E is to take the most informative parts of a clip to form a more accurate description of what is going on in the shot. LSTM, which maps sequences to sequences, acts as the base of LSTM-E, but Microsoft Research argues that the current models still ignore semantics of a shot (such as specific verbs or objects).
To overcome this problem the team also uses visual-semantic embedding. 2-D and/or 3-D Convolution Neural Networks (CNN) extracts key visual features from a clip, with LSTM annotating a sentence and visual-semantic embedding used to help with semantics of the image in the learning process (above). The goal is to achieve the most coherent and accurate sentence possible.
“The spirit of LSTM-E is to generate video sentence from the viewpoint of mutual reinforcement be-tween coherence and relevance. Coherence expresses the contextual relationships among the generated words with video content which is optimized in the LSTM, while relevance conveys the relationship between the semantics of the entire sentence and video content which is measured in the visual-semantic embedding.”
LSTM-E allows for video clips to have descriptive completeness in terms of objects and subjects, but Microsoft Research also wanted the system to offer a coherent sentence structure where one word correctly follows the one before.
To test LSTM-E, the researchers used the Microsoft Research Video Description Corpus (YouTube2Text), a database with 1,970 YouTube clips and compared three network types. The team used regular LSTM learning, and its own LSTM-E on its own AlexNet and the 19-layer VGG networks. The image below shows that LSTM-E outperformed LSTM, but was more efficient on the LSTM-E (VGG) network.
“On the popular YouTube2Text dataset, the results of our experiments demonstrate the success of our approach, outperforming the current state-of-the-art models with a significantly large margin on both SVO prediction and sentence generation.”
Both the LSTM-E and MSR-VTT projects are hugely interesting, and considering the effort and manpower Microsoft is putting into the research, the company is clearly very interested in this. The project papers from Microsoft Research are worth a read and convey the results of the projects in far more detail than we ever could here.