Twilio does speech recognition and understanding, the right way

Twilio’s CEO Jeff Lawson never wanted to do speech recognition. So what has changed?

adolphe bitard   telephone2
Credit: Adolphe Bitard

We’ve all had horrendous experiences with voice recognition when calling a support center – I’d like to think that it’s just me with my slightly unusual Kiwi accent, but everyone I talk to has similar stories of getting exasperated at an automated call center that hopelessly gets even the most basic speech recognition exercises wrong. It’s a sad reality of the modern world that organizations try to shoehorn users into solutions that aren’t yet fit for purpose, just to save some costs.

The world of communications has been the focus of Jeff Lawson for the past few years. Lawson is founder and CEO of Twilio, the company that offers a modular communications platform that developers use to power the communication functions of their apps. From tiny startups to huge companies like Uber rely on Twilio to manage all the communication stuff, so that they don’t have to. in a phone conversation prior to Twilio’s annual developer conference, Signal, Lawson told me that ever since the beginnings of Twilio, back when all they did was voice communications, he has hated voice recognition.

This is somewhat problematic since his customers would often ask him for speech recognition functionality to build in with the other tools that Twilio offers. Lawson, however, stuck to his guns – since, in his view, existing speech recognition offerings were really expensive, time consuming to implement and wildly inaccurate too boot, he was loathe to go anywhere near them.

Fast forward to today and we see that speech recognition has come a huge way – the three big public cloud vendors, AWS, Google and Microsoft, all have their own speech recognition-powered products (Alexa, Home and Cortana respectively) and have trained their voice recognition models to extremely high levels of accuracy – indeed most speech recognition products from large technology vendors now have accuracy measures that better humans’ abilities.

And so, this change in the technological building blocks and improvement in the accuracy and efficiency of speech recognition and understanding has made Lawson take stock and reassess his position. As he puts it, the time is now right for Twilio to introduce products that leverage these high-quality solutions that cloud vendors are now offering. To this end, Lawson is announcing on stage that Twilio will offer two new products – Speech Recognition and Natural Language Understanding. The obvious effectiveness of products such as Amazon Alexa and Google Home (alongside Microsoft’s early forays into Cortana-like functionality on physical devices) has encouraged Lawson to take these building blocks and wrap them into the Twilio platform. On launch, speech recognition will offer a single line of code integration that will support 89 different languages and dialects.

I was interested to know, given the investments that all three big public cloud vendors (as well as IBM) are putting into this space, who Twilio is using for this initial offering. Lawson informed me that Google got the big tick. While he doesn’t dismiss the possibility of leveraging other vendors going forward, his assessment was that, at least today, Google offers the best options for Twilio’s specific use case. As he explained, a telephone use case required speech recognition algorithms to be trained differently. 11khz, 8 bit mono is a different problem than ambient speech and Google seems to have done the best job of getting the problem solved so far.

Another issue that anyone in the speech recognition field needs to face is that single-word answers are easy, but conversational language is a very different story. That is where Twilio’s Natural Language Understanding product, due out later in the year, comes in. This engine can understand intent and map to code that acts on a particular intent. Thus the core intent of a spoken sentence, rather than the words themselves, are what is important. An example “Hey, do you think it’d be possible to maybe change a flight I have booked for Monday night to the next day?” becomes “intent: change flight, from: Monday pm to: Tuesday PM.

The other thing that is important to note with regards Twilio Understand is that it is a distinct product and hence can be utilized as an engine to be applied to other use cases. Understand can, for example, be applied to an SMS message to get a more accurate read on what the message really means. It can be trained once, and thereafter act across many channels – it can even be used to power Amazon Alexa, I deliciously ironic fact that Google’s natural language understanding will be used to power actions on arch-rival Amazon’s device.

Lawson told me that in his view the Alexa documentation is really complex, and developers need to write an application multiple times. With Twilio, however, they can write once and Twilio pushes to the relevant different platforms. Understand also uses Twilio’s TwiML programming language, therefore the same verbs that developers use to power a Twilio phone call can power Twilio understand.

While speech recognition might be a bit of a departure from Lawson’s formerly held beliefs, it is a logical one and one which Twilio customers are likely to appreciate.

This article is published as part of the IDG Contributor Network. Want to Join?

Computerworld's IT Salary Survey 2017 results
Shop Tech Products at Amazon