IBM beats Microsoft’s Speech Recognition accuracy record

Last October, Microsoft revealed that its Speech Recognition technology achieved a 5.9% word error rate (WER), setting a new world record. Now, IBM has managed to break that record, announcing in a blog post that it has achieved a 5.5% WER.

Word error rate is a common metric of the performance of a speech recognition or machine translation system. The difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence; i.e. supposedly the correct one.

How IMB did it

According to IBM’s blog post, the company has reached 5.5% word error rate by combining Long Short-Term Memory (LSTM) and WaveNet language models.

LSTM is a recurrent neural network architecture (an artificial neural network), which can compute anything a conventional computer can compute. An LSTM network is well-suited to learn from experience to classify, process and predict time series.

WaveNet is a deep generative model of raw audio waveforms created by DeepMind Technologies. WaveNet is able to generate speech which mimics any human voice. According to DeepMind, WaveNet’s speech sounds more natural than the best existing Text-to-Speech systems.

Microsoft vs. IBM Speech Recognition

The noble competition between Microsoft and IBM in the Speech Recognition field is a long-standing one. Both companies have managed some impressive breakthrough at Speech Recognition, beating each others’ world records over the last couple of months.

Back in September 2016, Microsoft announced it achieved a 6.3% word error rate, beating IBM’s 6.9% WER. As mentioned before, in October Microsoft beat its own world record with a 5.9% word error rate, and now IBM has claimed the world record once more.

The “human parity” debate

Despite the fact that both Microsoft and IBM compete for the best WER, the companies’ views on reaching human parity are different. Reaching human parity – meaning an error rate on par with that of two humans speaking- has always been the ultimate industry goal.

Back in October, when Microsoft achieved a 5.9% WER, the company claimed to have reached human parity in conversational speech recognition. “We’ve reached human parity,” said Xuedong Huang, Microsoft’s chief speech scientist. “This is a historic achievement,”, he added.

However, IBM claims to have determined human parity is lower than what anyone has yet achieved – at 5.1% WER. George Saon, an IBM Principal Research Scientist, says in the announcement blog post that “Others in the industry are chasing this milestone alongside [IBM], and some have recently claimed reaching 5.9 percent as equivalent to human parity…but [IBM is] not popping the champagne yet.