It was only a month ago that Microsoft's Speech Recognition technology scored the lowest error rate ever. Since then, Microsoft has improved it further, and the service now has an error rate of 5.9%, down from 6.3.
To put this into perspective, that's the same as a trained transcriptionist. Studies show that untrained humans have an error rate of 11% on average. According to chief speech scientist Xuedong Huang, it's a pretty big deal.
“We've reached human parity,” said Huang, “This is a historic achievement.”
This is the first time in history that a computer has been able to understand conversation as well as humans can. It's a goal that seemed outlandish just a few years ago.
“Even five years ago, I wouldn't have thought we could have achieved this. I just wouldn't have thought it would be possible,” said Harry Shum, executive vice president, Microsoft Artificial Intelligence and Research group. “This will make Cortana more powerful, making a truly intelligent assistant possible.”
The breakthrough could also result in improvements across Xbox, translation, audio description and more.
Much of Microsoft's success comes from its Computational Network Toolkit. The deep learning system is available to all on GitHub and allows for fast processing of deep learning algorithms across multiple computers. A combination of this and specialized GPU chips allowed the team to research much more quickly.
According to Huang, the team worked through the night once a breakthrough became apparent. They noticed that error rate could be reduced by representing words as continuous vectors in space. This let the computer realize that words such as “fast” and “quick” are similar.
Huang only became aware of the milestone via an internal social network post from 3.30am. The speech scientist has been working in speech recognition for over three decades, and described it as “a dream come true.”
Despite the achievement, Microsoft says there's still a lot of work to be done. The rate came from a controlled environment, with relatively little background noise. For the technology to take off, it needs to be able to detect speech accurately in louder environments.
The team is currently implementing a way to assign names to speakers in a conversation and ensuring it works with different accents and voice tones.
Of course, despite the label of “understanding” speech, the technology only recognizes it. The computer is writing down what it hears by decoding audio signals. The next step is for the computer to process sentences and reply or take action.
According to Shum, it's a process that could still take many years.