Connection Problem S03E25: Ghost voices
Sitrep: I'm back from workshops and a ThingsCon Salon in Antwerp, Belgium, where I spent two days thinking through questions around IoT, ideation, design processes and trust courtesy our lovely host Dries de Roeck. I had a blast of a time and my mind is still buzzing with conversations.
×
You know you're in Belgium when there's six (!) types of breakfast chocolate.
×
As always, a shout out to tinyletter.com/pbihr or a forward is appreciated!
×
Personal updates
Between travel and catch-up-between-trips and lots of time in conference calls there hasn't been happening that much: That's good. Travel and productivity don't always go well together, but it was a productive week by any means. Plus, in the last few days some interesting threads have emerged around voice, an area I've been pretty interested in recently...
×
Ghost voices
Today's pensive scratching off the back of my head is centered around voice assistants and AI. There's a range of things that crossed my radar this week that aligns almost too perfectly. Let's start by putting the pieces on the table:
- At Google's developer conference I/O, CEO Sundar Pichai showed a very impressive demo of a new voice recognition and synthesis system called Duplex. (Google's blog post about Duplex.) Google Duplex is able to make complex, fully autonomous calls for you to automate certain tasks like booking a table at a restaurant in real time, by calling a real person and negotiating the details.
- AI & creativity researcher Samim wrote about the role that botnets could play once artists and other creatives had easy access to it (think fake news, campaigns, etc.), and how it could upend society in all kinds of interesting ways. I happened to disagree with some of his analysis, but if you swapped out "botnet" for "voice tech at the level of Duplex", I think we're on to something both very interesting and potentially very dark.
- Researchers at Berkley showed they could embed commands for voice assistants in music that humans couldn't hear.
So first up, I saw quite a lot of folks discussing how bad it is that Google tricks people into thinking they talk to a human. I can't vouch for it, but I don't think we necessarily have much reason to assume that that will happen when Duplex rolls out; for the demos it would have killed the effect to start with "you're talking to a computer". (I might be wrong.) Also, I found it almost eery how scripted the humans in the conversations linked from Google's blog post sounded. (Also, I wonder how different this neural network output will sound in other languages? But I digress.)
So this demo is very niche so far, and it's very impressive. This being Google, the recurrent neutral network that makes up the core of Duplex will be available to researchers and creatives—minus the conversational training, presumably, but who knows. So we might see this being rolled out in decidedly less controlled circumstances and context very soon, in the very way that Samim indicates in his post about botnets. Think what if deepfake but for a phone call and I'll leave it to your brain to fill in your personal horror scenario. You're welcome.
Which brings us to the third puzzle piece, the Berkley researchers that embedded inaudible audio instructions for voice assistants in music. This is reminiscent of the so-called Dolphin attacks a while back in which voice commands were given to smart home assistants in high frequencies so that people couldn't hear them, but over really short distances. This takes that to the next level, they embed voice commands in music by tricking algorithms. It's a GAN (a Generative Adversarial Network) but for audio:
Computers can be fooled into identifying an airplane as a cat just by changing a few pixels of a digital image, while researchers can make a self-driving car swerve or speed up simply by pasting small stickers on road signs and confusing the vehicle’s computer vision system. With audio attacks, the researchers are exploiting the gap between human and machine speech recognition. Speech recognition systems typically translate each sound to a letter, eventually compiling those into words and phrases. By making slight changes to audio files, researchers were able to cancel out the sound that the speech recognition system was supposed to hear and replace it with a sound that would be transcribed differently by machines while being nearly undetectable to the human ear.
So our "human" mental defenses are diminished by being more easily tricked into believing one specific person called them when in reality they didn't. Now if these GANs can trick our (not-so-smart smart) assistants, too, there's another line of defense that's gone. Taken together, this leaves us in an awkward and vulnerable spot: We can't trust our voice-enabled devices not to act on someone else's behalf, and we might have to verify the identity of voice calls.
That is, we need technical safeguards to outsmart GANs, and we need defensive CAPTCHAs for voice. On the plus side, I've been rambling about the need for AI agents acting on our behalf for a while, as in: If a travel booking site has bots analyzing my behavior to maximize the travel site's margin at my expense then I should also have an agent trying to find the best deal for me, and one I can trust to act exclusively in my interest.
This is going to be an interesting space for a while to come. (Also, cough cough, trustmark for IoT.)
×
Things that caught my attention
Amber Case writes about calm technologies and how Google begins adopting calm technology design principles. A quick easy read with some useful links for deeper dives into digital well-being. It won't surprise you that I am particularly interested in an article linked from there about how to reclaim the relationship with our digital tools but also that I have not yet read it but instead it stuffed into my ludicrous to-read-list.
One more from Samim: "Ad-Blocking is only the very first generation of what could be called "Reality Tunnel Management" (RTM) Applications. It will become a very large industry, propelled by the rise of Augmented Reality and Generative Media." I wasn't familiar with the theory of reality tunnels, a term apparently coined by Timothy Leary. In comms science terms, it seems a close equivalent to fragmented public debate (#fakenews). The framing as Reality Tunnel Management—the filtering, framing and manipulation of content but also of attention and perception—is fascinating.
Douglas Adams's 1990 documentary Hyperland (Hyperland on Youtube) about the web—or more specifically hyperlinks and hypermedia—is fascinating. Thanks to Dan Hon for sharing this.
×
I wish you an excellent weekend.
Yours truly,
Peter
PS. Please feel free to forward this to friends & colleagues, or send them to tinyletter.com/pbihr
×
Who writes here? Peter Bihr explores the impact of emerging technologies — like Internet of Things (IoT) and artificial intelligence. He is the founder of The Waving Cat, a boutique research, strategy & foresight company. He co-founded ThingsCon, a non-profit that fosters the creation of a responsible Internet of Things. In 2018, Peter is a Mozilla Fellow. He tweets at @peterbihr. Interested in working together? Let’s have a chat.
×
This picture via the beautiful Public Domain Review.