Just now, Victor Zue's computer sits on his desk at the Massachusetts Institute of Technology Laboratory for Computer Science— but he doesn't expect it to stay there much longer. Computers are already beginning to shrink drastically while they multiply in number. In as little as two years, Zue predicts, they will literallyfall off the desktop. He believes tiny but powerful computers will soon be embedded in the walls of offices and homes, in handheld devices that look like cell phones, and in even the most mundane appliances. The refrigerator of the near future, you may have read, will be able to remind you of what you're low on. What you may not have read is that it will order it for you on the Internet. You can already give a travel destination to a luxury automobile— with the right option— and it will direct you where you want to go, turn by turn. Even the lowly alarm clock will soon develop a computer-assisted attitude: Connected to the Internet, it will be able to check your schedule, cross-reference it with traffic reports, and decide what time to wake you up. Zue says that "even more remarkable than the things we'll be doing with all these computers will be the way we interact with them. We won't be typing on keyboards. Instead, we'll be speaking to them."
And they'll be speaking back. A computer that talks has long been an elusive goal, one that has had less to do with science than with Hollywood, where the prototype was HAL in 2001: A Space Odyssey. But as computers become more commonplace, they remain difficult to communicate with, as those who have struggled with a keyboard or dialed their way into oblivion through a voice-mail tree well knows. Those problems would disappear if computers could be programmed to converse with humans.
"Speech is the simplest and fastest form of human communication there is," says Zue, an associate director of the MIT computer lab. "If we could talk to computers, then virtually anyone could use them, without any training at all."
And our working and personal lives would never be the same.
Not long ago, computers were huge collections of vacuum tubes, wire, resistors, and capacitors. The first general-purpose electronic digital computer, built for the U.S. Army in 1946 to calculate ballistic tables, weighed more than 30 tons and contained more than 17,000 vacuum tubes. Because of their expense and unwieldy size, early computers— which came to be known as mainframes— served many people. Each person connected to the computer by a terminal had to compete for time. The arrival of the personal computer in the late 1970s eventually rearranged the equation to a 1-to-1 ratio of computers to people. And now the equation is changing again, so the ratio will soon be many computers per person.
Small but powerful computers linked to the Internet will soon supersede such personal digital assistants as PalmPilots and cellular phones that link wirelessly to the Web. Mike Greenwood, program director of Planet Blue, IBM's ubiquitous-computing lab, is scrambling to create the software that will enable the new generation of computers to connect with one another. He expects that in 10 to 20 years "more than 1 million businesses and 1 billion people will be connected by 1 trillion handheld and embedded devices."
As the devices shrink, the problem of how to enter data increases. A keyboard, even a wireless one that fits in your pocket, would be so small "you'd have to type on it with toothpicks," says Zue.
There is really no alternative but speech. "There's a whole variety of trends that are making it desirable," says David Nahamoo, a manager of voice technology research at IBM. "A talking computer sounds cute, but this is not a novelty or a gimmick. It's essential."
The woman on the phone from mercury Travel Service seems friendly— if uncommonly patient— as Zue checks the schedule of flights from Boston to San Francisco. "What time do planes leave tomorrow?" he asks, peppering her with questions. "Are there any flights returning to Boston in the afternoon? What are the flight numbers? What time do they arrive?" To each, the smooth voice gives a quick, cheerful response. In two minutes Zue has found out enough to book a flight. Aside from the speediness of the transaction, the surprise is that the Mercury travel agent was not human but a computer Zue himself has programmed to recognize human speech. "Not a bad conversationalist for a computer, don't you think?" he says, hanging up the telephone.
Such fluency didn't come easily for the computer or for Zue himself, who had to struggle to acquire conversational English skills. Born in China, Zue enrolled as a student at the University of Florida in the late 1960s to be near his older sisters, who had moved there. "To be accepted, I wanted to learn to speak like an American— but that was very difficult," he says. Words such as did you, which he could read easily enough in a textbook, suddenly turned into the incomprehensible "didju" when he heard them spoken. Everywhere he turned, he says, he found himself flummoxed by inexplicable rules of pronunciation.
Zue's spark of inspiration came, ironically enough, from Hollywood. In 1968, after making hard-won progress in his English studies, he went to see 2001 and became riveted by HAL, the talking computer. "I saw it and said, 'This is it— this is the future,' " he recalls. "If I could learn all the different rules of pronunciation, then a computer could too." Determined to find a way to do it, he headed for graduate school at MIT. Somehow, he knew, computers could be taught to "hear" what was being spoken but that it would involve more than just wiring up a microphone. "Because of accents and the way words are pronounced, the ear is a very bad decoder of language— both for foreigners and computers," says Zue. "Instead, what I went looking for was a visual representation of speech."
What he ended up with was a spectrogram— an electronic tracing of speech sounds. No one had ever been able to "read" a spectrogram before, but Zue— practicing one hour a day for four years— showed that it could be done. He then theorized that he could teach a computer to take frequency readings from a spoken voice that are similar to a spectrogram, which has turned out to be a reliable way to code speech. "It essentially takes human language and translates it into a language that the computer can understand," Zue says.
At the core of speech recognition lies the phoneme, which is the basic phonetic building block. It's short— often barely 100 milliseconds in all— but that's all the time required to change a b sound to a p, and to change the word bit into pit. To understand speech, a computer translates the spoken word into an electronic representation of these phonemes, then matches them against templates showing real words and clusters of words. "It finds the best possible match between the incoming measurements and the stored measurements for the sound," says Zue. The computer considers what it has "heard," then chooses the most likely meaning— exactly as Zue did when he first learned English. "Basically, I'm treating a machine as a foreign person new to the language," he says. The software programs he wrote, while massive, amount to little more than grammar lessons and instructions on pronunciation: "You teach the computer grammar rules one by one, much the same as a student would learn in kindergarten through high school."
The scope of that challenge becomes clear when taking a look at some of the peculiarities that litter the language— beginning with the homonyms. "We say 'there,' " says Zue. "But do we mean there, they're, or their?" Also, the same letter can be pronounced differently depending on its position in a word. The t in each of the words top, try, city, and button, for instance, sounds radically different, and computers need to be instructed about this. Many times the only way you can understand what someone has said is by remembering what came before. For instance, says Zue, a spoken conversation might contain the line, "How about Japanese?" That could be a reference to currency or language, "until you remember that the discussion is about what kind of restaurant to head to for lunch," he says. "Only in connection with what's spoken before does the sentence make sense."
Sometimes, the sounds of words can be interpreted in wildly different ways— resulting in comic mishandlings of the language, such as when euthanasia, is read youth in Asia or recognize speech comes out wreck a nice beach. Adding to the mayhem, combinations of letters can also sound different depending on where they're found. The words gas shortage, for instance, are pronounced "ga-shortage," says Zue, with the s sound in gas becoming subsumed by the sh in shortage. "But the same rule doesn't apply to the words fish sandwich. You have to pronounce the sh and s distinctly; if you say 'fi-shandwich,' you'll end up sounding like a foreigner."
Most of these problems have been surmounted through grammar instructions, however, and dictation software programs— which have been available for more than a decade— have an error rate of roughly one word per sentence. That might not sound bad, says Zue, "but it would certainly get you fired if you were a typist."
Moreover, to engage in conversation, a computer has to do more than transcribe what's recited to it. It has to provide intelligent responses to questions. "The computer can't think, but it can access information," says Zue. And computers can be linked to the mother of all information repositories— the Internet. For the Mercury Travel Service, Zue's computer translates a spoken question into digital code, then searches the Internet for an up-to-the-minute answer. From there, the process reverses— and the computer speaks the answer.
In addition to Mercury, two other prototype systems developed by Zue's lab at MIT will be linked to the Internet for real-time data searches: Voyager will provide up-to-date information about traffic conditions in Boston, and Jupiter will give details of the weather in 500 cities. For the time being, the systems do not communicate with one another. So a caller who asks Mercury about the weather will be told, "I'm sorry, I do not understand your question." But a question focused on air travel will prompt an instantaneous answer. "We're building systems with very good competence within a narrow domain," says Zue. "The challenge now is to stitch these together— almost like little pieces of a cloth in a quilt— so that one day a person could navigate smoothly from one domain to another."
As this quilt grows, computing as we know it will dramatically change, providing people with instant access to whatever information they want, whenever and wherever they want it. Some people already use cell phones to check e-mail or get instant stock quotes, and within two years speech recognition will begin to eliminate the need to use the tiny button pads on the phone as keyboards. Old-fashioned VCRs will be supplanted in the next few years by "black boxes," which will search TV listings via the Internet and figure out the date, time, and channel of the program to be recorded. When speech recognition is added, tailoring an individual viewing schedule will be as easy as, say, giving a voice command to record all cooking shows or baseball games. Ultimately, speech technology will radically transform people's daily lives by turning computers into eager assistants rather than nemeses. "Speech capability will do to computers what Netscape Navigator did for the Internet," says James Flanagan, director of the Center for Advanced Information Processing at Rutgers University. "It will popularize things that are now too difficult for the average person to use and will reinvent the way we interact with our computers for all time."
Farther down the line, a single small "computing device" will emerge, a speech-controlled device that can be programmed to turn into whatever you want it to be— from cell phone to personal data assistant to digital video camera— just by downloading different software. "No one can be 100 percent sure where we're all headed with speech recognition, but I'll tell you one thing," says Flanagan. "We'll need a mighty big landfill to hold all the electronics equipment it will make obsolete."
Including, many believe, the computer keyboard. "I'm confident it's going to disappear completely in five to 10 years," says Xuedong Huang, general manager of the Speech.Net Group at Microsoft, which has made speech-enabled computing one of its top priorities since 1993. "I'll bet 50 years from now people will look back on us laboriously typing our instructions in on a keyboard and laugh. 'You mean, you had to compose each word?' they'll ask. 'One letter at a time?' They'll think it's very, very funny!"
Perhaps, but others remain more circumspect. "For myself, I can't imagine not using my keyboard," says Gary Herman, director of Hewlett-Packard's Internet and Mobile Systems Laboratory. And he suspects others may feel the same way. "We may have the capability for computer-enabled speech and the vision of what to do with it," Herman says, "but we can't know for certain whether people will actually want to relate to computers like this until we try it."
Fortunately, humans— rather than computers— will have the final say.
Chip Ahoy!
The microchip that forms the heart of the modern computer comes with a surprising limitation— it is hardwired. Therefore, the pathways that electrical signals can follow are limited, and different chips must be designed for different applications. "What you end up with are separate chips for separate uses— whether they're for a PC, a cell phone, or a PalmPilot," says Anant Agarwal, an associate director of the MIT Laboratory for Computer Science. "There's no flexibility at all."That's quite a limitation if computers are expected to become smaller, less obvious, dedicated to a single task, and more pervasive. "You'll end up having to have 100 separate devices for 100 uses," says Agarwal. So he and a team of researchers are developing an alternative chip called Raw that, he says, "exposes the raw hardware to the software system." Instead of being hardwired, the Raw microprocessor will contain a rectangular array of many identical tiles that are configured by the software. When electronic devices are built with Raw chips, "I'll no longer have just a cell phone, or just a Palm Pilot, or just a Walkman," says Agarwal. "Instead, I'll have a generic computing device that can literally turn itself into whatever is needed." A "spit and bailing wire" prototype of that device, which he has dubbed H21, should be up and running later this year. Then, "if I say, 'Hey, turn yourself into a cell phone,' " Agarwal says, "it will be able to locate the appropriate configuration software through the Internet, download it, and configure the wires of the Raw chip inside to give it the characteristics of a cell phone." — C.R.
The Wings of Mercury
Human speech, riddled with tricky phonetics, garbled syntax, and ambiguous phrasings, is far from perfect, a problem that becomes magnified when a computer enters the conversation. To limit the errors a computer could make trying to understand and respond to humans, Victor Zue of MIT believes we'll need different programs for different topics, such as the weather, traffic updates, or travel information. "These separate domains can be stitched together to create the illusion of a vast store of knowledge, in which a computer can appear to move seamlessly from one subject to another," he says. Here's how one such domain— Mercury Travel Service, a research prototype flight-information service developed by Zue and his colleagues— works now.
1 Zue dials up Mercury via telephone (617-258-6040) and asks a question: "When does the next flight leave from Boston for San Francisco?"
2 The computer doesn't actually hear what Zue is saying. Instead, it records his words, translates them into digital code, and slices them into small segments called phonemes, which it analyzes according to their resonant frequencies. These are matched against templates— idealized models of real words— that are written into the software.
3 Using probability statistics, the computer determines the likelihood that a cluster of sounds corresponds to actual words. It then strings together these words, ruling out unlikely combinations. Because it uses probability, Mercury can handle a huge variety of accents and speaking styles yet still capture the essential meaning of a question.
4 Relying on syntax and grammar rules coded into its software, Mercury analyzes the meaning of the question, just as students in a high school English class diagram sentences.
5 Mercury accesses the Internet to search various online databases, just as people do when they type a request into a search engine. At this stage, while still in prototype, Mercury is limited to specific airline Web sites that Zue's staff has selected in advance.
6 Mercury uses a voice synthesizer to convert the sequence of digitized words it finds on the Internet into audible speech. Instead of sounding robotic, the female voice sounds reassuringly human. That's because it relies on a process called concatenation, in which snippets of information, such as the names of airlines, flight numbers, and destinations, are prerecorded by an actual person, stockpiled in a database, and spliced together as needed by the computer. To be practical, this can be done only for domains with narrow subjects such as travel plans. But the result, says Zue, "sounds completely natural."
7 An instant after Zue asks his question, Mercury responds: "The next flight from Boston to San Francisco is United Flight 523, leaving at 3:30 this afternoon. Would that work?" — C.R.
Mercury translates speech (bottom) into an audio wave form (center) and then into a spectrogram (top). Voice-recognition software deciphers the subtle pattern shifts in the spectrogram and uses probability models to identify what words were spoken.
The next time you travel, first call Mercury at 617-258-6040 and "donate your voice to science." The research prototype cannot yet make reservations but will be able to help you establish an itinerary. See the Web site of the Spoken Language Systems group at MIT's Laboratory for Computer Science at www.sls.lcs.mit.edu. For more about the work of Microsoft Research's Speech Technology Group, see research.microsoft.com/stg.