HomeWinBuzzer NewsGoogle Deep Learning Model Separates Individual Voices from a Video

Google Deep Learning Model Separates Individual Voices from a Video

Google has developed a way for AI to separate noise within a video and single out individual voices in a crowd through deep learning processes.


We are used to writing on these pages about breakthrough's in voice and . For example, earlier this month we discussed a Microsoft tech that can better understand human speech patterns for improved AI conversations. However, Microsoft's rival is also a leader in voice recognition technology.

To prove that, the company has discussed a new research project that isolates the voice of a single person in a video, even when the voice is in a crowd or marred by background noise. Using intelligence, a model computationally produces a video in which some speech is enhanced.

Using audio and visual signals from speakers, the model can detect mouth movement. By doing this, Google can replicate the ability of humans to focus on a single sound even if there are others in the environment.

Google announced its breakthrough in a blog post today. The company says researchers used 100,000 high-quality videos from and compiled 2,000 hours of video of people speaking to the camera with no audio distractions.

“Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video's soundtrack.

During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker.”


With the results, Google could create “synthetic cocktail parties” to train the AI into splitting and separating audio from each speaker in the audio clutter. Users would just have to select a face in the video and be able to hear directly what they are saying.

“We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.”

Luke Jones
Luke Jones
Luke has been writing about all things tech for more than five years. He is following Microsoft closely to bring you the latest news about Windows, Office, Azure, Skype, HoloLens and all the rest of their products.

Recent News