How Did Speech-Reco Get So Much Better So Quickly?

By: Mike Dwyer, Vice President Research & Development, CallMiner

You may have noticed that the speech recognition in your phone (Siri) or your car (Ford Sync, BMW DragonDrive), or your home (Alexa) has gotten remarkably better in the past 18 months (or you may still not be using it because of that one time it mistakenly texted your wife “you look ugly too right here” instead of “you look lovely tonight dear”). Either way, keep giving it a try – because that’s what’s allowing us to make great strides in accuracy in speech.

Talk into Your Device

Every day, millions of people talk to their devices, providing feedback and training material for large companies like Nuance, Google, IBM, and Microsoft to build bigger and better voice models. Every time you ask Siri to find the nearest pizza place, and you follow her driving instructions to get there, you just confirmed to Siri that you said the word “pizza”.

Define the Hidden Differences

This explosion of training data and the introduction of GPU based parallel processing, and even larger server farms have allowed companies to cost-effectively build and train acoustic models and language models with much more confidence than ever before – allowing the data to define what the hidden differences are between an “O” and “A” sound rather than the directed approach of a GMM (Gaussian Mixture Model). The models can be used to recursively train themselves by creating larger and larger datasets of transcribed audio.

Close the Gap

The cycle of increased accuracy only quickens the training loop, as more and more people start to use their voice-enabled devices every day. It is now quicker to dictate to your phone than type, and the accuracy has reached the level as being undistinguishable from human recognition. Microsoft’s latest engine achieved a 6.3% error rate versus a human rate of only 4%. This is down from a 15% error rate in 2004, and a 43% error rate in 1995. So the gap has quickly closed (last article on right) to the point that accuracy is no longer the limiting factor in speech adoption – “understanding” is.

Final Thoughts

Speech recognition has come a long way in a short time. The explosive adoption of consumer devices has led to expediential improvements in speed and accuracy in speech recognition engines used in speech analytics and other applications.

Posted in Speech Analytics