We may still be a few dozen years away from everyone having their own personal robot, but in a lot of ways, the future has arrived – especially in the realms of automatic speech recognition (ASR) and interactive voice response (IVR). After all, where would we be today without having Siri tell us whether or not it’s raining outside? You were probably caught in the rain without an umbrella, at the very least.
Automatic Speech Recognition Is a Process
Automatic speech recognition is any sort of technology that allows a computer to convert spoken language into text in real time. While the technology has been in government and military research since the 1950s, it’s only been used by the general public since the 1980s, when it was introduced as a way to help people with musculoskeletal disabilities.
To use ASR technology, you start by speaking into your device’s microphone. Your device then creates a wave form from the sound and background noise is filtered out while volume is normalized to a constant level. Then, the filtered wave form is broken down into individual phonemes (the sounds used to build words that are the most basic units of language, like the hard “k” sound in the word “kit”). Based on the first phoneme of a word, the computer uses a combination of statistical probability (usually the hidden Markov model) and context to narrow down options and figure out which word was spoken.
Talk to Me
Some ASR systems are so advanced that they can engage in “conversations” with you, a technology called natural language programming (NLP). NLP works through the process of machine learning and statistical inference, in which software searches through a programmed body of real-world examples to recognize and respond to speech. And some other methods of speech recognition search a hard-coded vocabulary.
NLP works best in fairly simple “conversations” that rely mostly on yes or no answers, or have few major possible answers. Instead of searching its entire vocabulary for each word in a question and processing them separately, NLP systems react to certain “tagged” words and phrases to respond appropriately – things like “weather forecast” or “pay my bill.”
Improving the Conversations
Over time, voice recognition software gets better by “learning” from each experience. In fact, speech recognition has been the main focus of machine learning research over the last few decades. ASR systems can either be tuned by humans, or they can engage in a process called active learning.
In tuning, programmers can review logs to identify and fix common problems. With linguists’ help, programmers can add words, pronunciations and grammatical structures that the system is failing to understand. Software is hand-coded with a variety of real-world examples for the software to search and draw from.
Active learning, meanwhile, is still currently limited in its capabilities; think about how often your phone autocorrects “top” to “too,” and you’ll have an idea. Data is stored from past interactions as the program gets to know the words and combinations of words that you most often use. Another example of active learning in speech recognition software is in homes or medical transcription when the software calibrates itself to the voice of its user, taking in certain words and phrases and then reacting with programmed examples to allow the program to work more easily with accents, speech impediments, and more.
While ASR technology is fascinating and fun to experiment with, it currently faces a few limitations. While average accuracy is 96 percent, this is usually accompanied by the caveat “in ideal conditions,” meaning with little background noise, no one else speaking nearby, distinct speech and more. Too much background noise, loud ambient noise, and/or low-quality input hardware can muddle the wave forms and lead to inaccurate output.
Computers and software also have problems distinguishing overlapping speech (two voices speaking at the same time), and the extensive statistical and contextual analysis from these programs often requires a large amount of processing power, taxing a computer’s processors and batteries. Finally, the always-tricky homonyms (words that have the same spelling but different meanings) are difficult for computers to process correctly, even as the ability of ASR programs to distinguish between words based on context improves.
As technology continues to progress, the future of speech recognition software looks to focus on making translation services more accurate and further developing computers’ ability to understand the words they’re taking in.
From Luke Skywalker communicating with R2D2 and C3PO to today’s helpful yet sometimes snarky Siri, ASR has evolved to have more functionality than ever, and as the software is tuned and perfected, the scope of artificial intelligence will only continue to grow. And we will no longer have to worry about being caught in the rain.