The Amazon Echo speaker and accompanying Alexa voice assistant were released in November 2014 to much acclaim. It's one of those devices that takes a number of existing technologies and combines them into something that feels magical. If you've ever jealously watched the Star Trek crew interact with their computer via voice commands, you know what I'm talking about. The Echo gets us closer to that than anything that has come before.
However, this article is not about the impressiveness of the Echo nor the privacy concerns that its approach to cloud-based voice recognition creates. This article is about what's involved in developing IoT applications for the Echo - called "skills" in Amazon parlance.
Let's start with one significant limitation - for the most part, Echo skills use a cloud-to-cloud integration architecture. This means that your custom skill will need to be exposed via an Internet-accessible endpoint with all the security concerns that implies. This is a significant impediment to the Echo becoming a true IoT hub device since you can't talk to local devices directly.
The basic model is:
You send a voice command to your Echo.
The Echo sends the captured audio to the Amazon cloud to perform voice recognition.
The Amazon cloud associates the voice command with a custom skill and invokes a custom HTTPS endpoint or AWS Lambda function (depending on how you've configured your skill).
Your custom code processes the request and responds with a success or failure (including what the Echo should say to the user in response).
It's worth noting that this approach does not appear to be driven by technical limitations of the Echo hardware - other companies have created skills that can communicate with local Wi-Fi devices (Philips Hue lights being a notable example). So the capability exists but is not currently part of a public API. So unless your company is at the scale of Philips, cloud-to-cloud is currently the only way to go. I can't say this limitation is surprising given Amazon's position as the leading cloud provider.
Smart Home vs. Custom Skills
An Echo skill has what is called an "interaction model". The interaction model defines how you map voice utterances to commands - called "intents" in Amazon parlance. There are two types of skills you can create: "smart home" or "custom".
A smart home skill has a pre-defined interaction model that currently knows how to deal with lights and thermostats (Amazon says more device types will be implemented in the future). If your skill only needs to interact with those two types of hardware, you're good to go. From a user interaction perspective, a smart home skill facilitates searching for controllable devices in the Amazon Alexa app and identifying the ones the Echo will control. Voice control becomes a simple matter of the user saying things like "Alexa, turn on the kitchen light". Because the user explicitly set it up in the Alexa app, the Echo knows that "kitchen light" requests are serviced by your smart home skill.
A custom skill is much more flexible. Among other things, it defines both an "invocation name" and a custom interaction model. The invocation name is used by Amazon to identify the appropriate skill to route voice commands to. The drawback here is that the voice command becomes a bit more wordy. Instead of "Alexa, turn on kitchen light" the user will need to say "Alexa, tell Hobson to turn on kitchen light" where "Hobson" is the invocation name. The custom interaction model requires you to think through every variation of how a user might say a command and include it in your skill definition. It's a tedious process but a worthwhile price to pay if you need more capability than a smart home skill can provide.
According to Amazon, you can also take a hybrid approach using a smart home skill to take care of lights & thermostats and a separate custom skill to take care of everything else.
I mentioned earlier that there are currently two choices when integrating your custom skill code with the Echo -- an HTTPS endpoint or an AWS Lambda function. Essentially, when AWS processes a voice request, it uses the skill's interaction model to convert the speech into a JSON document that describes what the user is trying to do - the "intent request".
If you take the custom HTTPS endpoint approach, the Amazon cloud will post the intent request to your custom URL. That URL will need to know how to directly process Alexa intent requests and provide the appropriate JSON response.
If you take the AWS Lambda approach, the Amazon cloud will automatically invoke a custom Lambda function you define with the intent request. That function can, in turn, service the request itself, make a request to your API, etc. I tend to view the Lambda approach as a way to create a lightweight bridge that converts Echo-specific requests into custom API requests and converts custom API responses into Echo-specific responses.
It's a good assumption (I hope) that IoT service providers don't have unprotected URL endpoints sitting around. Most likely there will be some form of authentication required to invoke those endpoints and incoming API requests are generally associated with a specific user. How then do you map an incoming Echo request (associated with an Amazon user) to a user in your system? This is where account linking comes in.
First off, account linking assumes that your API is OpenID Connect (OIDC) enabled - specifically that it supports the "Authorization Code" or "Implicit" grant types (Smart Home skills currently only support Authorization Code). If you meet that requirement, a user can use the Amazon Alexa app to perform an OIDC login to your identity provider and the Amazon cloud will retain the necessary tokens to include in intent requests that it sends to your endpoint or Lambda function. It should also handle using a refresh token to renew access tokens where necessary.
This is the one area where Amazon seems to be struggling. When things go wrong, it's not always easy to tell what is happening and Amazon's tools that allow developers to troubleshoot problems directly are sorely lacking. That leaves you in the hands of the Alexa developer support staff to help you through certain problems. My experience in this area has been awful.
For example, I opened a support ticket with Amazon 5 months ago because the account linking feature was not working. The problem occurs in their backend in an opaque manner that prohibits any troubleshooting on my part. To this day, I'm still going back and forth with them on this problem with no resolution. It usually takes 4-5 days between responses and most of the time they are asking me to try the same things I've already tried numerous times or pointing me to same developer web pages over and over again asking me to make sure that I'm following their guidelines correctly. Meanwhile, I've spent hundreds of dollars running AWS infrastructure so they can actually troubleshoot the problem end-to-end with an account I created specifically for them. It honestly feels like they have no real interest in resolving the issue and just want to make it look like the support ticket isn't stagnant.
If you're willing to accept its limitations and already have IoT cloud infrastructure in place, the Echo is an amazing interface for users to interact with. If you do plan to integrate your service with it, you'll need to give AMPLE time to get things working. And if you have to work with the Alexa developer support for any reason, prepare to have your project timelines significantly blown out
I'm hopeful that the impending release of Google Home (and Apple's rumored entry into the space) will create some competition and motivate Amazon to close some of the gaps in their ecosystem and better support developers in getting their applications working.