Wanna know how Amazon Alexa Works? Here We Go.
Amazon prognosticates to sell 29 million and 39 million Echos in the year 2018 & 2019 respectively. In addition to that, earlier this year, in some months Amazon excelled its entire year’s shipment goals for Echos Dot. It’s not hard to imagine that the ubiquity increases.
In the year 2014, Amazon Alexa has announced a new era in relation to voice recognition software. Even at the time, we perform the daily activities, restructuring happens, using utmost tangled machines learning processes.
In San Francisco, Francois Mairesse, at recent Machine Learning Innovation Summit the Amazon’s senior Machine Learning Scientist, precise about how the Amazon Alexa works using the wonderful technology and it shows a constant upliftment, as the retail giant expects perfection.
The 1990s was the first time where a speech recognition has come into the market. The project has been given a huge amount of DARPA fundings. Some universities like Cambridge and CMU had planned to have their own recognition systems as it will give a value-addition to their commercial success and that too happen at the time they integrated into Windows. The first and foremost issue they face is insufficient data to run on. There is a need to do speech adaption to train the models. It is practically impossible to shift the parameters suit the users but on using limited processing power.
New Era Of Speech Recognition:
Over a period of time, a change in the digital era has conquered by increasing the amount of data handled to train models as well as to give advancement in processing powers. In 2011, Apple got confidence in launching speech recognition and they introduced Siri to the market and following that Google, Samsung, and Microsoft started launching their product.
Everyone Wonder, the factor of how Amazon Alexa works. To add flame to it, the speech recognition app was used on the map. As adoption of phone-based Voice User Interfaces (UIs) are slow, so it’s not as effective as expected. The important problem about this is that it is not exactly about the information. It makes attempt to revert as it won’t take very long.
November 2014, When Amazon debuted echo, the first infection point came with the introduction of far-field technology. Using far-field technology, the speech recognition is realized to be more realistic and less fiction.
Amazon has already launched products similar to close-talk technology like Amazon dash – which is said as a kitchen assistant as it helps to add milk, butter, and other households products to your shopping list. It is a simple device which recognizes the user’s voice which is said over the microphone and it can also able to distinguish between the actual input and background vocals. The user’s voice is compared to the catalog of all grocery products as its a huge task and that was not found in the latest devices. Fire TV followed something closely related to close-talk technology but extended with films and so forth. And finally, they have launched voice recognition in their shopping app as the catalog is huge and every English name is a product and it is expected to have large output space.
And here comes the Amazon echo, which was Amazon’s first far-field device. Echo is a fully fledged virtual assistant. This is not similar to the previous devices. It relies on machine learning and gives more accurate results. It started with shopping, the weather, music, and so forth, and now enables phone calls and messaging – learning. The more we use the more we start loving this device. Mairesse says, to keep it simple, implement all features into the cloud, and ensure the device is both applications agnostic and that the applications are device agnostic, so you can easily access.
How Amazon Alexa Works:
The journey from voice input to echo producing results may seem to happen rapidly, but It is a comparatively hectic process. For instance, if a user says ‘Alexa, what’s the weather in Seattle?’. To get the result there are problems with cross-domain intent classification.
- The first step is signal processing, the device should able to sense the audio clearly. But signal processing is one of the most important challenges in far-field audio. The main aim is to sense a huge signal even when there are some background sounds. These can be diminished by using beam-forming. The perfect manner had acknowledged by using seven microphones. This helps the device to eliminate unwanted sounds and fix the focus on the prior voice.
- The second step is Wake Word Detection. With aid of this, the device should try to acknowledge the programmed words like Alexa or Echo. This helps to manage the device from false statement.which could lead to accidental purchases and angry customers. Another point to consider is the different pronunciations which are to identify as it has a limited CPU power and most importantly all of these should be done within seconds. In short, it requires high accuracy and low latency.
- Once the wake word detection is done, it sent to the speech recognition software in the cloud, which takes audio and convert into text format. And here comes the key process of transforming binary classification problem into a sequence-to-sequence problem. As we know the cloud is only a technology and with that, we expect a huge output result. User’s expectation is very high with this. It is not just a yes or no question, users could ask anything.
In some cases, the users use the device just to hear music, many artists use different spellings for their names.
As we have mentioned earlier while converting the audio into text format, Alexa will also analyze the characteristics of the user’s speech such as frequency and pitch to give you feature values. To determine the most likely sequence of words a decoder had fixed that has two pieces of input features and the model.
- Firstly, the input feature assists to sequence the related word based on the huge amount of existing text.
- Secondly, a trained acoustic model with deep learning by looking at pairings of audio and transcripts. To make it more real time, the dynamic coding had implemented.
- And here Natural Language Understanding (NLU) kicks in and converts the text into a meaningful representation. This is still a classification task, but the output space is smaller. Typically to start with rules and regular expression there are similar edges on which we statistically rely upon.
a) If someone said ‘play remind me’, the chance of misinterpretation is quite high and so the result may be wrong.
b) Speech recognition hearing errors like ‘play like a prayer BY Madonna’ as ‘play like a prayer BUY Madonna’, which has obvious consequences.
c)Out-of-domain utterances are also possible at a stage as it could prevent the device from hearing commands from televisions.
Expectations of Alexa
Ultimately, Amazon Alexa hopes to give accurate results and it should loud enough. The small units of tiny cut speech engines have made. And finally, Alexa needs to be more natural.
Echo with a magical feel
The amazing echo is here. But the researchers are working constantly and consistently to show improvement in speech recognition software, And to be more specific they are trying to sense the emotion in the person’s voice. In a short time, its expected that there will be an improvement on how Amazon Alexa works. In future, Alexa will be able to hold a conversation- remembering what a person has said previously, and applying that knowledge to subsequent interactions, taking the Echo from highly effective to magical.