Microsoft is working on developing its text-to-speech models and is aiming to overcome numerous challenges related to the technology. In a collaboration between Microsoft Research and Zhejiang University, the company has developed FastSpeech. The software leverages machine learning tech to improve text-to-speech performance.

Under current models, such as text-to-speech used in Cortana, technology can only create snippets of humanlike voice. There are also some limitations, such as skipping words in synthesized speech. That’s because current models have slower mel-spectrogram generation.

If you’re unfamiliar with mel-spectrogram, it’s the representation of power made by a sound. Microsoft’s FastSpeech aims to solve how mel-spectrogram performance.

Described in a paper, “FastSpeech: Fast, Robust and Controllable Text to Speech”, the technology boasts a specialized architecture. With the technology, mel-spectrogram creation is 270 times faster than previous iterations of text-to-speech. Furthermore, voice generation is deemed to be 38 times faster.


In the Microsoft Research blog, Microsoft describes the following advantages of FastSpeech:

  • fast: FastSpeech speeds up the mel-spectrogram generation by 270 times and voice generation by 38 times.
  • robust: FastSpeech avoids the issues of error propagation and wrong attention alignments, and thus nearly eliminates word skipping and repeating.
  • controllable: FastSpeech can adjust the voice speed smoothly and control the word break.
  • high quality: FastSpeech achieves comparable voice quality to previous autoregressive models (such as Tacotron 2 and Transformer TTS).

To test these performance gains, the team used the LJ Speech data set as a testing ground. This data set has 13,100 English audio clips of text transcripts. “We randomly split the dataset into three sets: 12,500 samples for training, 300 samples for validation, and 300 samples for testing.”