SIRI is an offshoot of the DARPA-funded project, CALO. It was part of DARPA’s PAL initiative (Personalized Assistant that Learns).
SIRI (and CALO) involve a number of technologies, including natural language processing, question analysis, data mashups, and machine learning.
It’s worth noting however that SIRI really only uses a portion of the technologies developed as part of the CALO project. CALO does things like gaze and gesture analysis, real-time analysis of analyst activity on a computer workstation, and many other things — you won’t find these in SIRI.
SIRI’s main tasks, at a high level, involve:
Phase 1: voice recognition
It’s apparently the easy part, but it’s where everything begins, so it can’t be trivial. When you give Siri a command, your device collects your analog voice, convert it in an audio file (it’s translated into binary code) and send it to Apple servers. The nuances of your voice, the noise around and the local expressions make difficult to get it done right. It’s called Human User Interface versus the standard Graphical User Interface we are used to. It’s important here that, everyday, Apple collects millions of queries of people speaking multiple languages, in many accents, while living on different continents. In other words with their actions and mistakes, people are contributing to the largest crowd sourced speech recognition experiment ever tried on earth. Siri app today receives roughly a billion requests per week and Apple states its speech recognition capability has just a 5 percent word error rate. Last year Apple acquired the speech recognition company Novauris Technologies, a spinoff of Dragon Systems and also hired several speech recognition experts, to get to this point.
Phase 2: send everything to Apple servers in the cloud
Siri does not process your speech input locally on your phone. This is clearly a problem if you’re not connected for any reason, but this way Apple gets two major benefits:
The algorithm identifies the keywords and starts taking you down the flowchart branches related to those keywords to retrieve your answer. If it fails in this exercise, because a piece of the communication does not work, it goes down the wrong flowchart branch. If it happens just once, the whole query is ruined and ends into the “Would you like to search the web for that?” result. Google Now and Cortana are no different.
You understand this is far from the concept of human conversation. Siri app is still built with a logic of pre-programming all the possible set of questions and rules to answer. This was even more evident when, in October 2015, Apple honored “Back to the Future” day by updating the Siri app with at least ten humorous responses related to the popular movie Back to the Future. My favorite “be careful who you date today, or you could start disappearing from photos…” is just one answer it picks up randomly from the list.
Phase 3: understand the meaning
The process of understanding what the user is asking for, relies on an area of science called natural language processing. People have dozens of ways of asking the same thing. We can express a concept using endless combinations of words. “I’m in the mood for a pizza”, “Is there an Italian restaurant nearby?”, “I’d love a Margherita today”. Humans can easily understand what I mean, it’s obvious that Margherita is not a person, but an algorithm must be sophisticated to reach the same conclusion. Sometimes is just because words have a similar sound or are mispronounced: oyster and ostrich, school and skull, byte and bite, sheep and ship and many others make the task complicated.
To simplify its life, Siri app software, models linguistic concepts. It analyzes how the subject keyword is connected to an object and a verb. In other words it looks at the syntactical structure of the text. The decision to go down a branch of the flowchart or another, depends upon nouns, adjectives, verbs, as well as the general intonation of the sentences. On top of it, Siri can make sense of questions and follow up commands. This is not exactly what a human would call “a conversation”, but it means it understands the context and it’s the starting point for future developments.
Phase 4: transform the meaning into actionable instructions
We know that Siri is here to help us, not just to understand what we say. In “The story behind Siri”, the founder Adam Cheyer says “I remember the first time we loaded these data sources into Siri, I typed “start over” into the system, and Siri came back saying, “Looking for businesses named ‘Over’ in Start, Louisiana.” “Oh, boy,” I thought.”.
When the Siri app understands what you want, she has to dialogue with other apps to make it happen. And every app is different and partially has its own “language”. The system must have what is called domain knowledge, it must know the subject area you’re talking about. In a human conversation, this happens every time we talk with experts in a certain field and they use specialized words that we hardly understand. It’s obvious when we speak with a doctor, an architect or a finance person, for example. For the Siri app it’s the same. When it has to give a direction, book a flight or send a text it has to dialogue with other apps… and understand their context. This is crucial as well. If the protocol does not work, Siri can give instructions to other apps to perform actions you didn’t require and expect or can be even potentially dangerous to you.
Last but not least, once a request has been processed, Siri must convert the result back into text that can be spoken to the user. While not as hard as processing a user’s command, this task, known as natural language generation, still presents some challenges. Today Siri speaks with the American voice of as “Samantha”, provided by Susan Bennett in July 2005, the same person that voiced Tillie the All-Time Teller. But after Apple purchased Siri, they had to extend the capability to hundreds of languages; and that’s another reason why Siri app is not growing as fast as the original expectation.