Speech Recognition Tech Falls Prey to Secret Messages

By Nathaniel Scharping
Jan 30, 2018 12:37 PMMar 21, 2023 8:28 PM
Man with cellphone
(Credit: Zapp2Photo/Shutterstock)


Sign up for our email newsletter for the latest science news

You hear one thing, but the computer hears another. What’s going on here?

Two researchers from the University of California, Berkeley have exploited the technique computers use to decode human speech to hide messages inside snippets of audio. When translated by a speech recognition program like Mozilla’s DeepSpeech, the computer ends up transcribing the hidden message instead of the sounds we hear.

Do You Hear What I Hear?

The method basically involves hiding a quiet sample of the audio you actually want transcribed within a different portion of audio. The “secret message” registers to humans as nothing more than a bit of background noise, but because of the way computers process audio, they pick up on the hidden audio clearly. In a paper published to the pre-print server the arXiv, the researchers describe how they were able to manipulate DeepSpeech every single time they hid messages inside an audio sample.

It has to do with how machine learning algorithms recognize speech. Considering the full range of possible letter combinations that each audio sample could potentially contain is prohibitively difficult, so algorithms calculate what amounts to an educated guess. An algorithm will map each bit of audio it samples to a probability distribution of possible letters and characters, and pick the most likely. Training the algorithm on many different audio samples is what lets it get good at guessing the correct one.

Computer Vs. Human

The researchers are able to exploit this system of educated guesses by creating audio that tips the computer’s decision in favor of the words they want to be transcribed, instead of the message that it’s hidden inside. And, in a tactic similar to how algorithms are trained, the researchers’ program tries out many different variations of the same audio sample to match their message sonically to what we hear, even if the words are completely different.

The researchers tested their work on 100 snippets of audio from Mozilla’s Common Voice dataset, and they say it worked every time. They were even able to hide text inside audio with no speech, for example, a snippet of classical music. And because DeepSpeech samples audio many times a second, the hidden text can be much longer than what’s actually heard, up to a limit of 50 characters per second of audio.

Hidden audio could be used to sneak messages past human listeners, or to fool computer transcription programs. But it might not necessarily be so easy to hack speech recognition programs. Because they used DeepSpeech, which has its code openly available, the researchers used what’s called a “white box” approach, which means that they knew everything about how the program works. Using a speech recognition program with unknown machinations would make it much harder to hack. In addition, these examples are targeted specifically at DeepSpeech, so a different speech recognition program wouldn’t pick up on the hidden audio.

1 free article left
Want More? Get unlimited access for as low as $1.99/month

Already a subscriber?

Register or Log In

1 free articleSubscribe
Discover Magazine Logo
Want more?

Keep reading for as low as $1.99!


Already a subscriber?

Register or Log In

More From Discover
Recommendations From Our Store
Shop Now
Stay Curious
Our List

Sign up for our weekly science updates.

To The Magazine

Save up to 40% off the cover price when you subscribe to Discover magazine.

Copyright © 2024 Kalmbach Media Co.