We all know Google’s speech transcription technology is really, really, really good. Not only is it the best in the industry, it’s doing it without a data connection: Pixels have been transcribing audio on-device for some time now, and that’s been owed to Google’s extremely impressive transcription algorithms that utilize machine learning hardware on its smartphones. But accuracy isn’t everything when it comes to transcription, even if it the single most important feature—speed matters too.
A video posted by James Cham on Twitter pits a Pixel 3 against an iPhone 11 (which has a much more powerful processor, I might add), using both to transcribe his voice in real time (the iPhone is using iOS’s built-in transcription, not Gboard’s—just to be clear). But the difference becomes immensely apparent within seconds: the Pixel 3 is displaying the words within a moment of Cham saying them, while the iPhone stutters, struggles to get the words right, then fixes them, and often pauses before spitting out a huge string of words after a long delay. By the end of the video, the iPhone is a full six seconds behind the Pixel 3 in the transcription. The iPhone also contains, by my count—not including the text at the beginning that was erroneously added by Cham—at least five very significant errors in its transcription that the Pixel does not.
I don’t think that people appreciate how different the voice to text experience on a Pixel is from an iPhone. So here is a little head to head example. The Pixel is so responsive it feels like it is reading my mind! pic.twitter.com/zmxTKxL3LB
— James Cham ✍🏻 (@jamescham) May 27, 2020
But Cham’s point isn’t about accuracy, even if it is still incredibly important—it’s about the way we talk and the speed at which we speak having a big impact on experiences with computers. If a computer is easily able to keep up with your speech in real time, it becomes much easier to spot errors or change your mind about what you’d like to say as you monitor its progress, making the experience a much more natural interaction. It’s a bit like asking a stenographer to take notes versus writing them yourself; with the former, you always have to ask for things to be read back, and that takes time. With the latter, you have total control. In the case of the text transcription example above, you feel more freedom to go back and restructure that sentence, or choose another word on the Pixel, whereas the iPhone is so far behind that, as you wait for it to catch up, you may well lose your train of thought (or just keep on going for fear thereof). As one reply puts it: speed is a feature.
There are other use cases that real-time voice transcription will likely enable down the road, too, it’s just not as easy to articulate them yet. But I’ve long held the belief that the children growing up right now will be the first to live in a world where talking to computers is the rule, not the exception, thanks to the rapid rise of smart speakers like the Amazon Echo and Google Home. Much as the very first computer mouses and GUI-first OSes were probably pretty weird and seemingly inefficient interaction paradigms for those who used the early personal computers of the 1970s and early 80s, voice interaction has faced a lot of skepticism over the years. And frankly, it was deserved: early speech recognition was legitimately terrible (for example, BMW’s much-hated iDrive debuted with it in 2001)! But I think it’s increasingly clear the technology is coming into its own, and that we’re going to experience a legitimate shift in the way most people use computers as a result.
From an accessibility perspective, speed is also a hugely relevant issue when it comes to voice recognition. For people who interact with computers primarily by speaking to them, the relative ability of that computer to quickly understand their speech quickly creates a much more natural interface—one that feels less like asking a Magic 8-Ball a series of queries, hoping it’ll give you what you want, and more like (if not nearly as good as, yet) Star Trek: The Next Generation. Faster responses mean people are more likely to ask questions in the first place, and a big part of that speed equation is in the time it takes a computer to understand what you’ve said.
Anyway, I thought this video provoked some pretty interesting thoughts about voice control, speech, interactions with computers in general, and where it all means we’re headed. I also enjoyed yet another example of Google absolutely whooping Apple on any all things AI.