We are used to writing on these pages about Microsoft breakthrough’s in voice and speech recognition. For example, earlier this month we discussed a Microsoft tech that can better understand human speech patterns for improved AI conversations. However, Microsoft’s rival Google is also a leader in voice recognition technology.

To prove that, the company has discussed a new research project that isolates the voice of a single person in a video, even when the voice is in a crowd or marred by background noise. Using deep learning intelligence, a model computationally produces a video in which some speech is enhanced.

Using audio and visual signals from speakers, the model can detect mouth movement. By doing this, Google can replicate the ability of humans to focus on a single sound even if there are others in the environment.

Google announced its breakthrough in a blog post today. The company says researchers used 100,000 high-quality videos from YouTube and compiled 2,000 hours of video of people speaking to the camera with no audio distractions.

“Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video’s soundtrack.

During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker.”


With the results, Google could create “synthetic cocktail parties” to train the AI into splitting and separating audio from each speaker in the audio clutter. Users would just have to select a face in the video and be able to hear directly what they are saying.

“We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.”