Google’s DeepMind brought us artificial intelligence systems that can play Atari classics and the complex game of Go as well as — no, better than — humans.
Now, the artificial intelligence research firm is at it again. This time, its machines are getting really good at sounding like humans.
In a blog post Thursday, DeepMind unveiled WaveNet, an artificial intelligence system that the company says outperforms existing text-to-speech technologies by 50 percent. WaveNet learns from raw audio files and then produces digital sound waves that resemble those produced by the human voice, which is an entirely different approach.
The result is more natural, smoother sounding speech, but that’s not all. Because WaveNet works with raw audio waveforms, it can model any voice, in any language. WaveNet can even model music.
And it did. It’s pretty good at piano. Listen for yourself.
Someday, man and machine will routinely strike up conversations with each other. We’re not there yet, but natural language processing is a scorching hot area of AI research — Amazon, Apple, Google and Microsoft are all in pursuit of savvy digital assistants that can verbally help us interact with our devices.
Right now, computers are pretty good listeners, because deep learning algorithms have taken speech recognition to a new level. But computers still aren’t very good speakers. Most text-to-speech systems are still based on concatenative TTS — basically, cobbling words together from a massive database of sound fragments.
Other systems form a voice electronically, based on rules about how letter combinations are pronounced. Both approaches yield rather robot-y sounding voices. WaveNet is different.
Flexing Those Computing Muscles
WaveNet is an artificial neural network, that, at least on paper, resembles the architecture of the human brain. Data inputs flow through layers of interconnected nodes — the “neurons” — to produce an output. This allows computers to process mountains of data, and recognize patterns that would perhaps take humans a lifetime to uncover.
To model speech, WaveNet was fed real waveforms of English and Mandarin speech. These waveforms are loaded with data points, roughly 16,000 to sample per second, and WaveNet digests them all.
To then generate speech, it assembles an audio wave sample-by-sample, using statistics to predict which sample to use next. It’s like assembling words a millisecond of sound at a time. DeepMind researchers then refine these results by adding linguistic rules and suggestions to the model. Without these rules, WaveNet produces dialogue that sounds like it’s lifted from The Sims video game.
The technique requires a ton of computing power, but the results are pretty good — WaveNet even generates non-speech sounds like breaths and mouth movements. In blind tests, human English and Mandarin speakers said WaveNet sounded more natural than any of Google’s existing text-to-speech programs. However, it still trailed behind actual human speech. The DeepMind team published a paper detailing their results.
Because this technique is so computationally expensive, we probably won’t see this in devices immediately, according to Bloomberg’s Jeremy Kahn.
Still, the future of man-machine conversation sounds pretty good.