The Machine Learning “Okey Dokey” Hypothesis

We’re always looking for new ways to incorporate machine learning into speech analytics. While experimenting with some from scikit-learn, an open-sourced python toolkit, my observation was that some models worked particularly well on our speech data. With successful models for distinguishing sales calls from service calls and the agent half of a transcript from the customer side, I decided to try something a bit more challenging. And I ended up with some interesting findings that really show the power of machine learning.


The Question

What I wanted to see is if I could train a model to determine which agent was on the call based only on the words they used.


The Experiment

For this task, I used a set of just over 3,500 transcripts that I had downloaded from CallMiner’s demo database using our API. These were transcribed using the our Eureka Analytics platform based on our Mercury LVCSR. For this experiment, I only used the agent’s side of the conversation. In total, there were 74 agents represented in the dataset.


Each call was represented as a vector of 1s and 0s. Every word in the vocabulary had an index associated with it and a 1 at that index indicated that the word was in the agent’s side of the call and a 0 indicated that it was not.


A Suspiciously High Accuracy

I tried a few different models on this task, including Naïve Bayes, a Support Vector Machine, and a Convolutional Neural Network. However, what really stood out was the Logistic Regression model.


This model achieved an F1 score of 85.8%. For comparison, labeling each transcript as a random agent would get you an F1 around 1 over the total number of agents, or 1.4%. Labelling every call as belonging to the agent that had the most calls in the set would get you the highest number of calls for an agent divided by the total number of calls, or 5.1%. To get an F1 of 85.8%, there must be something significant in the data.


Investigation Part 1: Does This Work on Other Datasets?

The first thing I checked is whether this worked on other datasets I had collected from the same database. These datasets had varying numbers of calls, numbers of agents, and distributions of calls between agents. While the model worked on all of the data sets, there was a wide range of scores. So my next question was, what causes the model to work so much better on some datasets than others? Was it the data itself or something else?


Like many questions in machine learning, the answer was the amount of data. But it wasn’t the overall amount of data that mattered as much as how much data we had for each agent. Looking at the distribution of the data, I found that the datasets with higher accuracy tended to have a higher number of samples for each individual agent.


As you can see from the graph below, I only tested on 13 datasets, but the shape is pretty clear: accuracy increases with the amount of data we have for each agent. However, it does seem to level off at around 40 contacts per agent, meaning that there is a point where more data isn’t particularly useful.

©CallMiner Eureka 2018

Investigation Part 2: What is the model learning during training?

So we’ve found that the model works better when you have at least 40 samples for each agent, but we still haven’t answered the question of why it works at all.


My next step was to actually look at the coefficients in the model and see what correlations it was finding between certain words and agents. At first, I was skeptical of the words it was associating with specific agents – a lot of greeting, closing, and agreement words. However, after creating a script to count how many times each agent says these words, I found that the correlations reflected what was actually in the data. In other words, most agents did seem to have particular words they said more often than anyone else.


Some interesting examples:

  • An agent that was associated with the phrase “okey-dokey,” uses the word in over half of their calls. No one else uses it more than once.
  • The word “hope” is used by one agent in 79% of their calls. The next highest agent uses it just 45% of the time.
  • Another agent uses the word “absolutely” over twice as often as anyone else.

While not every agent will have an example as extreme as these, most agents have words or combinations of words that they use just a bit more than anyone else, allowing the model to create a sort of speech print for them and pick out their calls from the dataset.

The Hypothesis

So does this mean we can get a “speech print” for anyone we want with about 40 samples of their conversations? Probably not.


The key here is that all of the calls have the same structure and content. In our demo database, every call is about buying or repairing the company’s products, and every agent is instructed on how to start and end the conversation. When every call follows a similar script, finishing calls with “I hope you have a good day” as opposed to just “have a good day” or saying “okey-dokey” instead of “OK” can really stand out.


It also relies on agents developing strong habits or a personal style when handling customer calls. It’s natural to develop a routine when doing something repetitive, and talking on the phone all day is no exception, so it makes sense for an agent to find themselves using the same responses and call closings every day. Colloquialisms, regional sayings, and social influences may also play a role in distinguishing one agent from another.


But taken out of the context of a call center, this model falls apart. If we were to, for example, hide a microphone in the break room, the conversation would likely cover a wide range of subjects and be much less scripted and formal. There wouldn’t be any polite closings, cheerful responses, or many of the other words that the model uses to identify an agent on the phone. Even if we had 100 samples of an agent having a conversation in the break room, our model wouldn’t stand a chance of identifying them.


So what does this tell us? It tells us that call center data has a unique, useful property that makes it possible to pick out subtle variations in style between agents. With further exploration, we might even find more ways to use this feature and expand our machine learning capabilities even more.