HomeWinBuzzer NewsMicrosoft’s FastSpeech Vastly Improves Text-to-Speech Technology

Microsoft’s FastSpeech Vastly Improves Text-to-Speech Technology

Microsoft Research and Zhejiang University have discussed FastSpeech, a text-to-speech technology that improves voice recognition.


is working on developing its models and is aiming to overcome numerous challenges related to the technology. In a collaboration between and Zhejiang University, the company has developed FastSpeech. The software leverages machine learning tech to improve text-to-speech performance.

Under current models, such as text-to-speech used in Cortana, technology can only create snippets of humanlike voice. There are also some limitations, such as skipping words in synthesized speech. That's because current models have slower mel-spectrogram generation.

If you're unfamiliar with mel-spectrogram, it's the representation of power made by a sound. Microsoft's FastSpeech aims to solve how mel-spectrogram performance.

Described in a paper, “FastSpeech: Fast, Robust and Controllable Text to Speech”, the technology boasts a specialized architecture. With the technology, mel-spectrogram creation is 270 times faster than previous iterations of text-to-speech. Furthermore, voice generation is deemed to be 38 times faster.


In the Microsoft Research blog, Microsoft describes the following advantages of FastSpeech:

  • fast: FastSpeech speeds up the mel-spectrogram generation by 270 times and voice generation by 38 times.
  • robust: FastSpeech avoids the issues of error propagation and wrong attention alignments, and thus nearly eliminates word skipping and repeating.
  • controllable: FastSpeech can adjust the voice speed smoothly and control the word break.
  • high quality: FastSpeech achieves comparable voice quality to previous autoregressive models (such as Tacotron 2 and Transformer TTS).

To test these performance gains, the team used the LJ Speech data set as a testing ground. This data set has 13,100 English audio clips of text transcripts. “We randomly split the dataset into three sets: 12,500 samples for training, 300 samples for validation, and 300 samples for testing.”

Luke Jones
Luke Jones
Luke has been writing about all things tech for more than five years. He is following Microsoft closely to bring you the latest news about Windows, Office, Azure, Skype, HoloLens and all the rest of their products.

Recent News