Microsoft’s Neural Text-to-Speech (Neural TTS) is reaching a new milestone as it movies to a new version on Microsoft Azure. According to the company, Neural TTS Uni-TTSv4 – the newest version of the platform – is equal to sentence-level natural speech recordings when talking.

If you are unfamiliar with Neural TTS, it was first launched three years ago. Even then Microsoft was claiming it had “close to human-parity” when translating text to speech. In other words, the platform provides spoken audio playback of text that is as natural sounding as possible.

Microsoft has been improving Neural Text-to-Speech since then for its Azure cloud platform. While you may not be using the tool directly, it is baked into plenty of Microsoft products you probably do, such as Word’s Read Aloud feature, Immersive Reader in Edge, and more. Plenty of Microsoft partners have also adopted Neural TTS.

Advertisement

With Uni-TTSv4, Microsoft is shipping an improved version to those partners and services. That means, when you use those features there will be even better performance. You will still be able to choose between a bunch of pre-set voices or even record your own sample.

However, while Neural TTS supports more than 110 languages, Uni-TTSv4 is currently only available in the following eight voices:

Microsoft says the other languages will get the update soon, and custom voices too. Users do not need to do anything because the tool will update automatically in Microsoft Office and Microsoft Edge.

Tests

In a blog post to announce the new version, Microsoft explains how it measures text-to-speech to ensure Neural TTS delivers the best quality. All TTS models are measured by Mean Opinion Score (MOS), which is a popular speech quality testing service.

“For MOS studies, participants rate speech characteristics for both recordings of peoples’ voices and TTS voices on a five-point scale,” Microsoft explains.

“These characteristics include sound quality, pronunciation, speaking rate, and articulation. For any model improvement, we first conduct a side-by-side comparative MOS test (CMOS) with production models. Then, we do a blind MOS test on the held-out recording set (recordings not used in training) and the TTS-synthesized audio and measure the difference between the two MOS scores.”

Furthermore, Microsoft sent the Uni-TTSv4 model to the Blizzard Challenge 2021, a popular TTS benchmark, allowing for scaled up MOS tests across multiple systems. Microsoft says the results from its testing how the new voice model has “no significant difference from natural speech on common dataset.”

Below are the test results showing how the 8 available models of Uni-TTSv4 compare to other models:

Locale (voice)
Human recording (MOS)

Uni-TTSv4 (MOS)

Wilcoxon p-value

CMOS

PROD

En-US (Jenny) 

4.33(±0.04) 

4.29(±0.04) 

0.266 

+0.116 

En-US (Sara)

4.16(±0.05) 

4.12 (±0.05)

0.41 

+0.129 

Zh-CN (Xiaoxiao)

4.54(±0.05) 

4.51(±0.05) 

0.44 

+0.181 

It-IT (Elsa) 

4.59(±0.04) 

4.58(±0.03) 

0.34 

+0.25 

Ja-JP (Nanami) 

4.44(±0.04) 

4.37(±0.05) 

0.053 

+0.19 

Ko-KR(Sun-hi) 

4.24(±0.06) 

4.15(±0.06) 

0.11 

+0.097 

Es-ES (Alvaro) 

4.36(±0.05) 

4.33(±0.04) 

0.312 

+0.18 

Es-MX (Dalia) 

4.45 (±0.05) 

4.39(±0.05) 

0.103 

+0.076 

What Does This Mean?

All that testing and technical improvement is nice, but what does this mean in the real world? Well, Microsoft’s incremental updates over recent years have allowed the model to become closer to the realism of human speech.

However, the company admits there is still room for improvement, especially when users are listening to TTS for a long time. In this scenarios, Microsoft points out pitch and tone of the voice will lose some quality.

This is because human speech is incredibly nuanced and full of dynamic and almost constant slight shifts in pitch and tone.

“Currently it is not very efficient for those parameters to model all the coarse-grained and fine-grained details on the acoustic spectrum of human speech. TTS is also a typical one-to-many mapping problem where there could be multiple varying speech outputs (for example, pitch, duration, speaker, prosody, style, and others) for a given text input. Thus, modeling such variation information is important to improve the expressiveness and naturalness of synthesized speech.”

Uni-TTSv4 tackles limitations with two changes to the way it models acoustics. Specifically, new architecture with transformer models allows improvements, while variations are now handled by a model that separates explicit perspectives (speaker ID, language ID, pitch, and duration) and implicit perspectives (utterance-level and phoneme-level prosody).

Tip of the day: Do you know the built-in repair tools SFC and DISM of Windows 10 and Windows 11? With many problems, they can get you back on track without losing data and using third-party programs. In our tutorial we show you how to use them.

Advertisement