Register for an account

X

Enter your name and email address below.

Your email address is used to log in and will not be shared or sold. Read our privacy policy.

X

Website access code

Enter your access code into the form field below.

If you are a Zinio, Nook, Kindle, Apple, or Google Play subscriber, you can enter your website access code to gain subscriber access. Your website access code is located in the upper right corner of the Table of Contents page of your digital edition.

Technology

Is It a Human or Computer Talking? Google Blurs the Lines

By Nathaniel ScharpingJanuary 8, 2018 10:03 PM
Audio sound graph
(Credit: Viktorus/Shutterstock)

Newsletter

Sign up for our email newsletter for the latest science news

Siri and Alexa are good, but no one would mistake them for a human being. Google’s newest project, however, could change that.

Called Tacotron 2, the latest attempt to make computers talk like people builds on two of the company’s most recent text-to-speech projects, the original Tacotron and WaveNet.

Repeat After Me

Tacotron 2 pairs the text-mapping abilities of its predecessor with the speaking prowess of WaveNet for an end result that is, frankly, a bit unsettling. It works by taking text, and, based on training from snippets of actual human speech, mapping the syllables and words onto a spectrogram — a visual representation of audio waves. From there, the spectrogram is then turned into actual speech by a vocoder based on WaveNet. Tacotron 2 uses a spectrogram that can handle 80 different speech dimensions, which Google says is enough to recreate not only the accurate pronunciation of words but natural rhythms of human speech as well. The researchers report their work in a paper published to the preprint server arXiv.

Most computer voice programs use a library of syllables and words to construct sentences, something called concatenation synthesis. When humans speak, we vary our pronunciation widely depending on context, and this gives computer-speak its lifeless patina. What Google is attempting to do is get away from the repetition of words and sounds and construct sentences based on not only the words they’re made of, but what they mean as well. The program uses a network of interconnected nodes joined together to identify patterns in speech and ultimately predict what will come next in a sentence, helping to smooth out intonation.

The researchers back up their bluster with a bevy of examples posted online. Where WaveNet sounded accurate but a bit flat, Tacotron 2 sounds fleshed out and impressively varied.

The program can also handle complex, multi-syllabic words with ease, and can be instructed to add stress to words or syllables to alter the interpretation of sentences. This means Tacotron 2 can phrase things as questions and correctly differentiate between homonyms, as well as more subtle things like highlighting the subject of a sentence by adding emphasis to a word.

The final, and most compelling test is a side-by-side comparison of a human and computerized voice. Tacotron 2 scores a 4.53 on a popular test of speech quality, the researchers say, compared to 4.58 for professionally-recorded speech.

Although the program is impressive, it still has a few flaws. It can’t inject any emotion into its speech, and isn’t yet fast enough to produce audio in real time.

3 Free Articles Left

Want it all? Get unlimited access when you subscribe.

Subscribe

Already a subscriber? Register or Log In

Want unlimited access?

Subscribe today and save 70%

Subscribe

Already a subscriber? Register or Log In