NLP in the real world: Google Duplex

towards-data-science

This post was originally published by Jerry Wei at Towards Data Science

Here’s an example of the system making a call to schedule a hair salon appointment.

And here’s another example where Duplex is calling a restaurant.

Duplex’s successes. In these examples, I was really impressed by how well Duplex can actually imitate a human. The voices sounded like real people, and Duplex was even adding interjections like “um” to make it sound more humanlike (and there is truly no better way to sound human than to say “um”). Duplex can also manipulate the latency of its response. For instance, if the person says “hello?” then you have to respond quickly, but if they say a really long sentence, you have to take some time before responding to mimic thinking time. Despite its power, Duplex still made some small errors in these example calls (e.g., when Duplex says “anything between 10 AM and 12 PM” in the first call, the “12 PM” has an unnatural and robotic inflection), but these are not even really a problem, as they are minor enough that no one on the other side of the phone would think much of it, if they even notice it at all.

Areas for improvement. And even though Duplex can handle basic tasks very well, there are always some edge cases that need further investigation. Duplex may not be able to easily handle complex statements [example] or problems with background noise or sound quality. These cases will need to be further improve on. For now, in cases where Duplex is unable to complete its task, a failsafe is built in to hand off to a human operator to complete the task.

Simple schematic of the Duplex system. The RNN uses the audio from the phone call, Automatic Speech Recognition (ASR) software, and conversation parameters to obtain a textual response. This response is run through a text-to-speech program to obtain a verbal response.

Duplex’s ties to NLP. Of course, being able to imitate a human during a phone call requires some insane NLP. First, there needs to be an accurate speech-to-text translator to be able to even understand what the person on the phone is saying. Next, another model has to interpret this in the context of the goal of the phone call. And then a proper response has to be created. Finally, a text-to-speech model needs to translating this response into a humanlike voice saying it on the phone. These steps need to be constantly repeated throughout the phone call in real time, so the models need to be both accurate and fast.

Necessary NLP models. To do this, Duplex uses a recurrent neural network (RNN) combined with Google’s Automatic Speech Recognition (ASR) technology, parameters of the conversation (e.g., desired time, names), and a text-to-speech (TTS) system. Notably, these models have to be trained on narrow domains. For example, there would need to be separate training for making an appointment and booking a reservation — Duplex isn’t trained as a general chatbot, and any new areas that Google wants to use Duplex for would require new training.

Spread the word

This post was originally published by Jerry Wei at Towards Data Science

Related posts