If you believe the vendors, speech transcription had significant strides and now provides human-level accuracy. I’ve evaluated the major speech analytics systems for years now and I’ve yet to see any system that consistently performs at human level. However, in most cases you don’t need human-level accuracy to provide a lot of value. Let’s take a look.
My name is a chat heart of you might be familiar with Dave from a brand or if you’re, a web or to see people I’ve done about five years, I’m or so a of an independent analyst. So I’m mostly do park management strategy type. For a product, marketing.
That is a real transcription from an unnamed phone transcription vendor where I introduced myself on a call. Without context, what can we extract from this?
My name is “chat heart” — It’s actually Chad Hart.
“you might be familiar with Dave” — I guess I know someone named Dave?
“I’m or so a of an independent analyst” — something to do with an independent analyst?
“I’m mostly do park management strategy” — I strategize about park managers?
What did I actually say?
My name is Chad Hart. You might be familiar with me from a brand — if you are WebRTC people; I’ve done webrtcHacks now for about five years or so. Outside of webrtcHacks, I have been an independent analyst. I mostly do product management and strategy type work and product marketing.
So how did we do? Not so good. It turns out I don’t know Dave and I don’t deal with park management.
The standard measure for transcription accuracy is word error rate (WER). The lower the value, the better and more accurate the transcription. The WER for this example is 37% — a far cry from the approximate 5% error rates mentioned by most transcription vendors these days.
To be fair, the transcription of this section was worse than average. It was also fumbling around searching for my notes, so my speech had a lot of disfluencies — ahs, ums, and restarts. This makes understanding what I said more difficult, even for a human. Not a good performance on my part, but this is something that realistically happens to us all. How often do we rehearse our lines before a business call? Reading exactly from a script helps to improve diction but hurts the authenticity of the conversation.
Languages and accents matter. Transcribing the right language is an obvious one. Less obvious, but highly relevant, is handling dialects. Most speech transcription vendors have dialect options that tune their results. In the case above I discovered the dialect was set to Australia for some reason. Right, mate — good oil on why the big mobs of bodgy results.
In addition, most modern transcription engines let you define a custom vocabulary. These are words that aren’t common in the dictionary or that might be misspelled. Every industry has its own jargon that sounds foreign to those who aren’t familiar. When the transcription engine finds something that sounds like a custom vocabulary word it will likely use that identified word. In my case this would include my last name “Hart” instead of “heart” and words like “WebRTC” that don’t show up in non-telecom dictionaries.
Does Accuracy Always Matter?
You always need some level of accuracy, but the precision really depends on the application. If you need a verbatim transcription of a call then I think you’ll be hard pressed to find any product that fits your needs all of the time today. Human transcription won’t give you verbatim transcription all the time, either. However, if you just need to get the gist of a conversation to jog your memory, then getting every word right matters a whole lot less.
In many cases, a smaller number of keywords that matter can help navigate the conversation. Transcribing these keywords correctly is extra important, so the engines are tuned for identifying these. Higher-level summaries, especially based on words that the engine felt confident on, can also help to obscure word-by-word misses. This is the model vendors like Voicera, Impelo, and Dialpad, with VoiceAI follow.
Other systems don’t emphasize or even expose the word-by-word transcript at all. Analytics systems that work on top of a large database of transcriptions look for patterns and anomalies across an organization. At these higher levels, consistency in the data can mean more than the individual word accuracy.
In my example above, key phrases like “my name is” are enough to indicate I was making an introduction. The system could then compare the amount of time I spend on introductions to the amount of time others spend on introductions to see if there’s a pattern. Maybe a short intro on my end but a longer intro on the callee’s end makes for a better call. Maybe it will tell me a smoother introduction would have resulted in a shorter, but just as effective, call. I could use data like this to improve my next call.
Chorus.ai, Ether, ExecVision, Gong.io, i2x, and RingDNA have built systems that specialize in doing this for sales teams. They can help organizations tune their scripts for increased results as measured by metrics like time to sale, funnel progression, sales volume, etc. They justify their prices by the increased sales return they provide.
Call center applications are even more abundant. There is a much longer list of speech analytics vendors there — Aspect, Genesys, CallMiner, Nice, Nuance, and Verint are just a few traditional vendors with smaller startups like Batvoice, Gridspace, RankMiner, SpeechMatics, and VoiceBase, plus major new entrants like Amazon, Google, IBM Watson, and Twilio looking to penetrate this market too. Systems here are as diverse as the many kinds of call centers. They can look for script adherence, identify patterns that improve customer satisfaction scores, mark phrases that shorten call times, note unexpected trends for marketers, and provide dashboards for use in management decision making, just to name a few common offers.
Speech transcription has been around a long time, but hadn’t been widely used due to performance and cost justifications. That’s starting to change with major improvements on both of those fronts and, as discussed above, the newer trend of providing actionable advice instead of completely accurate text strings.
Want to see more on the evolution of transcription and the use of speech analytics across various communications applications? Interested in how machine learning, computer vision, and voicebots fit into your strategy? Check out our report on AI in real-time communications or visit our AI in RTC event in San Francisco on Nov. 16.
This is the fourth piece in an ongoing series. See our previous posts:
And, come back for more on the impact of machine learning and AI on the communications community.