How Your Virtual Assistant Knows What You Want (And Gets it Done)
One of the most remarkable things I learned during my studies in Computational Linguistics was that we still donât know how language isâŠ
One of the most remarkable things I learned during my studies in Computational Linguistics was that we still donât know how language is processed in the brain. We know that an average human has around 80,000 words in their vocabulary, and that somehow, when we speak, our brains are able to form ideas, crawl through that vocabulary to find the right words, string them together, and express them. When we listen, we must decode an incoming audio signal into words, and search our own mental lexicons in order to extract meaning from them. And all of this at breathtaking speed.
Itâs easy to rationalise this wonder by saying âwell, our brains are simply the most advanced computational structures in our known universeâ. But what about âmachinesâ? Have you ever wondered how the invisible âpersonaâ in your mobile phone or smart speaker manages to understand and accomplish tasks for you? Sure, theyâre not as good as we are, but they have a fraction of the compute power and complexity we do.
I canât explain how I wrote this post, or how youâre reading it. But I know how your phone gets things done for you, and Iâm going to do my best to explain it. The understanding, dear reader, will be up to you.
Contents
Why is language so damn difficult?
If youâve ever tried to use an automated chatbot, or the voice interface in your mobile phone or your car, chances are youâve experienced moments of confusion, or even communication failure. Why? Well, these interactions are facilitated via natural language, and natural languages are just plain hard. Thatâs because they are:
infinitely creativeâââyou can say the same thing in many many different ways [1]
structuredâââif you perform a Google search for âbrown glovesâ, native speakers know (thanks to the grammatical structure of the utterance) that âorange glovesâ would be more relevant than âbrown purseâ, but thatâs not so easy for a machine to understand
inferentialâââthereâs also meaning in what isnât said. If I search for âformal dressâ, we know that I wonât accept results for âcasual dressâ but I might for âformal gownsâ. But how should a machine know that?
lexically and syntactically ambiguousâââwords and sentence structures can sometimes be interpreted in multiple ways, which means one search query could match a huge variety of results, only some of which match the userâs intent
context basedâââsometimes the only way to disambiguate a lexically ambiguous word is through the surrounding words; furthermore, the same word can mean different things to different people depending on the âcontextâ that is their life
ânegatableââââthis is a small but common pain point for natural language processing. For example, adding a ânotâ to a sentence, or speaking sarcastically, changes the entire meaning
multimedia basedâwe donât just communicate through text but also spoken messages, emojis, hashtags and so on.
If natural languages are so hard, why build them into our technologies? Why not, for example, have fixed sets of phrases per device, which users need to learn in order to be able to request certain tasks?
Well, because people want to speak naturally. Remember in the past, when you would reformulate search queries in a way you thought the machine would understand? People donât want that anymore. And by the way, they donât just expect better language capabilities from the digital personal assistant in their phone or smart speakerâââthey expect it in search engines, chatbots, and so on.
What is a digital personal assistant?
Despite having a voice, a âpersonalityâ , and a semblance of âself-awarenessâ, under the hood a digital personal assistant is simply a software application. Throughout this article I will anthropomorphise a little with phrases like âthe assistant knowsâŠâ or âSiri tries toâŠâ, but this is a stylistic choice and should not imply that the thing youâre interacting with is capable of thinking or knowing anything at all! When I say âassistantâ, I thus always mean âapplicationâ.
The digital personal assistant waits for a wake word like âhey Siriâ to activate it. Then it performs natural language processing to turn the spoken input into a textual representation, and applies natural language understanding to try âunderstandâ the text. It then attempts to complete a task for you, by feeding the understood information to various APIs. An APIâââApplication Programming Interfaceâââis an interface between different software programs, such as between a digital personal assistant and a weather website. The API defines how the two programs can âcommunicateâ, e.g. which requests can be made, what information each request requires and how it should be written. Once the API returns a result, natural language generation is usually to convert the information into more user-friendly text or speech.
What is Natural Language Processing? And NLU? And NLG?
Alright, letâs clarify some of the terms introduced above.
Natural Language Processing is the process of preparing human language data for different purposes. Those purposes might be research-based, like allowing linguists to analyse the data, or they might involve machine learning, in which case the processing steps are usually designed to make the text more appropriate for feeding to a machine learning algorithm.
NLP can involve steps like:
Automatic Speech Recognition: aka speech-to-text, which translates the incoming sound-waves into text. ASR models are built using machine learning, by showing a model thousands of pairs of input sound-waves and output sentences and letting it learn the correspondences.
Tokenizing: means splitting the input text into individual words, aka âtokensâ. (Purpose for ML: most algorithms take their input as series of individual tokens).
Stemming: involves stripping the endings from words to leave only the word stem. (Purpose for ML: to reduce computational load by reducing the count of vocabulary that need to be processed; to improve performance by ensuring all words are represented in a consistent way, thus also boosting the number of training examples which feature each stem).
Note that stemming may not always result in a grammatical word. For example, converting plural nouns to singular can be done by removing the suffix -s, but this wonât work for irregular English nouns. Thus we get: dogs â dog, but countries â countrie, and women â women. Similar problems arise in other languages, too. For example, in German many plural nouns can be converted to singular be removing -en or -er, but irregular nouns pose problems, too. Thus we get Frauen â Frau (Women â Woman), which is correct, but BĂŒcher â BĂŒch (Books â Book, where the latter should actually be spelled Buch).
Lemmatizing: means converting each word to its standard form. Again an example could be reducing plural nouns to singular, but with lemmatizing, the result should also be a grammatical word. (Purpose for ML: as above).
Part-of-speech tagging: means assigning the grammatical roles, such as ânounâ, âverbâ, or âadjectiveâ, to each word in the sentence. (Purpose for ML: parts-of-speech can be useful input features for various language tasks).
Named Entity Recognition: assigning labels like âpersonâ, âplaceâ, âorganisationâ, âdate/timeâ to relevant words in the sentence. (Purpose for ML: as above).
Interestingly, although part-of-speech tagging and named entity recognition are usually used in order to prepare features as input for machine learning models, they themselves are also usually achieved via machine learning: we feed an algorithm thousands of examples of already-labelled sequences, and it learns to recognise the patterns in our language which can be used to determine a wordâs grammatical role, or whether it represents an entity.
Also note that NLP doesnât have to include all of the above steps. In fact, modern neural language models rarely utilise features like part-of-speech tags. Thatâs because theyâre powerful enough to take unannotated language input and learn for themselves which features in the input are useful for accomplishing their set task.
Once the NLP pipeline is complete and the user utterance has been processed, the following steps usually take place:
Natural Language Understanding: is about trying to extract valuable information from (processed) human utterances, in order to accomplish their requests. The field combines artificial intelligence, machine learning and linguistics to enable computers to âunderstandâ and use human language. Weâll examine it in the next section.
Natural Language Generation: involves generating human-like text. This can be done using automated rules (for simple, restricted, repetitive contexts like generating weather reports from weather data), or else using neural network which were specifically trained to generate text.
Speech Synthesis: aka text-to-speech is, of course, the process of generating synthetic voice audio from text. The models are trained just like ASR models, though of course, the input and output sequences are the opposite.
How Does Natural Language Understanding Work?
Before we discuss the âhowâ, letâs demonstrate NLU in action. Whatâs the first word which comes into your head when you read the following?
âHey Siri, book me aââââ
Most people will answer something like âflightâ, âholidayâ, or âhotel roomâ. And so does Google:
Whatâs going on here? Well, we all have a language model in our head, which was learned automatically by our child brains taking statistical measurements of which words occur in different life situations and syntactic and lexical contexts [2]. Search engines and other language technologies have language models as well, which they have acquired through machine learning.
When you make a spoken request to your phone, the following process takes place:
Automatic Speech Recognition converts the incoming sound-waves into text. Depending on how the models in the next steps expect their input to look like, various other stages of the Natural Language Processing pipeline may be applied.
During Domain Detection, the digital personal assistant tries to classify the domain of assistance the user requires, such as flights, weather, or personal services (like getting a haircut).
Intent Detection similarly is about identifying what the user wants to do. For example if weâre in the flight domain, do they want to book a flight or just get flight information?
Slot Filling. Once the assistant knows the domain and intent, it knows which APIs it will need to access in order to satisfy userâs request, and what sort of information those APIs will require. Slot filling is about automatically extracting that information from the userâs utterance. The assistant does that by loading a set of expected slots, such as âdeparture cityâ and âarrival cityâ. Then it tries to assign those slot labels to the words in your utterance (if it canât identify all required slots, it might ask you some further questions).
Domain detection, intent detection, and slot filling are all accomplished via machine learning models. For each task, one must feed the learning algorithm with textual representations of thousands or millions of user utterances with their appropriate labels target label(s): the intent, and/or domain, and/or slots. I say and/or because, while it is possible to build one model per task, such that only one kind of label would be required per input sequence, it is more common to train a model to do multiple tasks. This is known as multi-task learning.
By completing these three tasks, the assistant can build a semantic frameâââa more structured representation of the inputâââwhich it can use to complete your request.
Letâs take an example.
âHey SiriâŠ,â you say to your phone. The assistant wakes and begins trying to classify the request domain, even as you are still speaking. â⊠book me a flightâŠ,â you continue. The word âflightâ makes the domain clear, and since you already said âbookâ, the intent becomes clear too. Note how this shows that the steps in natural language understanding often donât align with the order in which the relevant input words are uttered: often, later language can disambiguate or completely redirect earlier interpretations.
At this point, the assistant knows it will need to fill some slots like âdeparture cityâ, which it does as you continue with â⊠from San Francisco to London.â It can now query the semantic frame and pass the relevant information on to an API such as Kayakâs flight search.
The API returns its results, and, finally, natural language generation is used to convert this back into human-understandable text, or synthesised speech.
Conclusion
So thatâs itâââhow your digital personal assistant knows what you want, and gets it done for you. Simple? No. Remarkable? I think so.
[1] The variability of language is probably why Google gets five hundred million unique queries every day.
[2] The process for learning a second language is different than for a first language, particularly as one grows older, but the general point about having a language model in our heads remains so.
Thanks for reading! If this article helped you, please give it a little clap, so I know to produce more content like it. You can also follow me here on Medium, or on Twitter (where I post loads of interesting content on AI, tech, ethics, and more), or on LinkedIn (where I summarise the best of my Medium and twitter feed). If youâd like me to speak at your event, please contact me via my socials or the contact form here.
Want to know more about Natural Language Processing? I wrote a whole chapter about it in The Handbook of Data Science and AI, available here.