Is It a Human or Computer Talking? Google Blurs the Lines

By Nathaniel Scharping
Jan 8, 2018 10:03 PMMar 21, 2023 8:28 PM
Audio sound graph
(Credit: Viktorus/Shutterstock)

Newsletter

Sign up for our email newsletter for the latest science news
 

Siri and Alexa are good, but no one would mistake them for a human being. Google’s newest project, however, could change that.

Called Tacotron 2, the latest attempt to make computers talk like people builds on two of the company’s most recent text-to-speech projects, the original Tacotron and WaveNet.

Repeat After Me

Tacotron 2 pairs the text-mapping abilities of its predecessor with the speaking prowess of WaveNet for an end result that is, frankly, a bit unsettling. It works by taking text, and, based on training from snippets of actual human speech, mapping the syllables and words onto a spectrogram — a visual representation of audio waves. From there, the spectrogram is then turned into actual speech by a vocoder based on WaveNet. Tacotron 2 uses a spectrogram that can handle 80 different speech dimensions, which Google says is enough to recreate not only the accurate pronunciation of words but natural rhythms of human speech as well. The researchers report their work in a paper published to the preprint server arXiv.

Most computer voice programs use a library of syllables and words to construct sentences, something called concatenation synthesis. When humans speak, we vary our pronunciation widely depending on context, and this gives computer-speak its lifeless patina. What Google is attempting to do is get away from the repetition of words and sounds and construct sentences based on not only the words they’re made of, but what they mean as well. The program uses a network of interconnected nodes joined together to identify patterns in speech and ultimately predict what will come next in a sentence, helping to smooth out intonation.

The researchers back up their bluster with a bevy of examples posted online. Where WaveNet sounded accurate but a bit flat, Tacotron 2 sounds fleshed out and impressively varied.

The program can also handle complex, multi-syllabic words with ease, and can be instructed to add stress to words or syllables to alter the interpretation of sentences. This means Tacotron 2 can phrase things as questions and correctly differentiate between homonyms, as well as more subtle things like highlighting the subject of a sentence by adding emphasis to a word.

The final, and most compelling test is a side-by-side comparison of a human and computerized voice. Tacotron 2 scores a 4.53 on a popular test of speech quality, the researchers say, compared to 4.58 for professionally-recorded speech.

Although the program is impressive, it still has a few flaws. It can’t inject any emotion into its speech, and isn’t yet fast enough to produce audio in real time.

0 free articles left
Want More? Get unlimited access for as low as $1.99/month

Already a subscriber?

Register or Log In

0 free articlesSubscribe
Discover Magazine Logo
Want more?

Keep reading for as low as $1.99!

Subscribe

Already a subscriber?

Register or Log In

More From Discover
Recommendations From Our Store
Shop Now
Stay Curious
Join
Our List

Sign up for our weekly science updates.

 
Subscribe
To The Magazine

Save up to 40% off the cover price when you subscribe to Discover magazine.

Copyright © 2024 Kalmbach Media Co.