The Mathematics of ... Artificial Speech

Have you heard Mike? Could be. Mike is a professional reader, and he's everywhere these days. On MapQuest, the Web-based map service, he'll read aloud whatever directions you ask for. If you like to have AOL or Yahoo! e-mail read aloud to you over the phone, that's Mike's voice you're hearing. Soon Mike may do voice-overs on TV, reading National Weather Service forecasts. But don't expect to see Mike's face on the screen: He's not human. He's a computer voice cobbled together from prerecorded sounds—arguably the most human-sounding one yet.

Introduced in 2001 by AT&T Labs, Mike is fast becoming a star voice of text-to-speech technology, which converts written words into spoken language. He is part of AT&T's large, multilingual, and ever-growing family of so-called Natural Voices. His cohorts include Reiner and Klara (who speak German); Rosa (Spanish); Alain (French); and Audrey and Charles (British English). An American-English speaker named Crystal provided the voice of the spaceship in the recent movie Red Planet. Mike, Crystal, Reiner, Rosa: They're all talk, no bodies.

Synthesized speech is both a triumph of technology and the fruition of a very old dream. The first "acoustic-mechanical speech machine" was introduced in 1791 by the Viennese researcher Wolfgang von Kempelen. The machine simulated the major consonant and vowel sounds with an array of vibrating reeds, like a musical instrument. But not until the advent of electronics did machines truly begin to mimic human voices. In the 1950s, researchers labored to model the acoustics of the human vocal tract and the resonant frequencies, or formants, it generates. This approach eventually led to workable but robotic results—certainly nothing a public-relations person would call customer ready. Stephen Hawking's voice synthesizer is the most famous example. Such a voice might do for explaining the history of the universe, but you wouldn't buy a used car from it. "At some point, it was evident that progress was much too slow," says Juergen Schroeter, the AT&T researcher in charge of the effort that led to Mike. "Our curiosity began moving toward more practical approaches." In the 1970s, researchers at what was then Bell Labs turned to a "concatenative" approach: Instead of trying to generate a human voice from scratch, they would start with an existing voice—several hours' worth of standard English sentences spoken by a clear-voiced person—and design a computer program to splice and re-splice it to say whatever words they wanted said. "Some of my colleagues felt we'd given up the more scientific approach," Schroeter says. In reality, the science had merely switched focus, from acoustical mechanics to combinatorial mathematics.

The computer program first parsed the prerecorded sentences into consonant and vowel sounds, called phonemes—perhaps 50 or 60 in the early iterations. Then the phonemes were reassembled to form new words. The recorded word cat, for instance, could be deconstructed into the phonemes k, ae, and t, which could then be rearranged to form tack. It worked, and it was a definite improvement over robot-speak, but it wasn't Peter Jennings. Fifty-odd phonemes simply couldn't capture the subtle intonations of spoken language. "You can't just take a vowel from this sentence and drop it into this other sentence," says Mark Beutnagel, an AT&T speech researcher.

In the mid-1990s, armed with a new generation of supercomputers, AT&T researchers began amassing a vast digital "voice warehouse" of phonemes. Instead of one t sound for the computer program to choose from, there might be 10,000. "By having so many sounds, it offers a little more spontaneity," says Alistair Conkie, AT&T's speech-synthesis expert. Conkie suggested parsing phonemes into "half-phones" to offer subtler possibilities for recombination. Voice synthesis now entails properly labeling the half-phones—10,000 versions of the "t1" sound, 10,000 versions of the "t2" sound, and so on—then creating a computer algorithm to smoothly string them into words and sentences. "We're playing with half-dominoes," Conkie says. But assembling a simple word like cat from its half-phones—("

k1, k2, a1, a2, t1, t2

")—involves billions of combinatorial decisions and presents a massive computer-processing problem.

Conkie is generally credited with devising a workable solution, now known as unit-selection synthesis. He recalled the old math problem in which a traveling salesman is required to visit all 50 states in a limited time. How to choose the least expensive route while maximizing sales coverage? Conkie's solution was to assign "costs" to the innumerable choices and combinations of half-phones. Charting the "least expensive" path through the chorus of half-phones became simply a math problem for the computer to work out. "We optimized the way in which units are chosen, so it would sound smooth, natural, spontaneous," he says.

For example, most costs crop up where two half-phones meet and attempt to join. The computer can measure the pitch, loudness, and duration (in milliseconds) of each one and compare them. If the total energies of each are vastly different, linking them would produce a disagreeable click or pop, so the link is rated as "expensive," and the computer avoids it. Some linkages are far less likely to occur than others, Conkie realized: In real spoken English, certain "k2" sounds are almost never followed by certain "a1" sounds. Those links could be deemed costly, too, and the computer could avoid them altogether. The word cat could theoretically call upon 10,000 ways of linking the "k2" and "a1" sounds. In practice, though, fewer than 100—a manageable number of choices for the computer to handle—can pass as reasonable facsimiles of human sounds.

There were lots of other niggling problems to deal with, such as how to teach the speaking computer to distinguish between written words like bow (as in "bow and arrow") and bow (as in the bow of a ship), or to recognize that minus signs aren't the same as hyphens. But by 1996, the makings of Mike were in place.

The Natural Voices Web site (www. naturalvoices.att.com), where a visitor can type in a 30-word phrase and hear any of the voices read it back, has since developed something of a cult following. Conkie tells the story of one Web site visitor, a kid who typed in "Please excuse Johnny from school," recorded Crystal's reading of it, then played the track to his principal's office over the phone.

For all the emphasis on their naturalness, Mike and his Natural Voices associates do not yet sound entirely natural. In short phrases ("I'd like to buy a ticket to Stockholm"), they can pass for a human, albeit an officious one. But longer phrases, or anything vaguely poetic or emotive, give rise to weird and warbly enunciations. "Emotion is something we're doing research on," Conkie says. Beutnagel adds, "We're limited by what's in the database, in terms of emotional quality. If we're recording a neutral voice, you can't expect it to sound angry."

Still, AT&T sees a host of applications for the synthetic voices. Software programs like ReadPlease and TextAloud enable the user to have e-mail, documents, or even books read aloud through an MP3 player on a handheld personal organizer. And federal law will soon require government Web sites to be speech-enabled for the visually handicapped. You don't have to be a cynic to imagine the darker uses of this technology as well. How long before Mike and his family start calling you at dinnertime to sell stuff over the phone?

At this point you may be wondering: Who exactly is "Mike"? If he is just the re-scrambled version of an actual human voice, will the real Mike please stand up? No, as it turns out, he will not. The voice talents behind the Natural Voices are contractually prohibited from doing any publicity. "If the voice talent person became known and then got into trouble with the law or something, it would have the potential to tarnish the integrity of the voice itself," says Michael Dickman, a spokesman for AT&T. "We try very hard to keep the voice brand separate from the person." Evidently, that's just fine with the real Mike. "The actor was worried that if it came out who he was, he'd be a pariah in the voice-over industry," Dickman says. "That's a long way from happening."