As his friends flocked to social networks like Facebook and MySpace, Alessandro Acquisti, an associate professor of information technology at Carnegie Mellon University, worried about the downside of all this online sharing. “The personal information is not particularly sensitive, but what happens when you combine those pieces together?” he asks. “You can come up with something that is much more sensitive than the individual pieces.”
Acquisti tested his idea in a study, reported earlier this year in Proceedings of the National Academy of Sciences. He took seemingly innocuous pieces of personal data that many people put online (birthplace and date of birth, both frequently posted on social networking sites) and combined them with information from the Death Master File, a public database from the U.S. Social Security Administration. With a little clever analysis, he found he could determine, in as few as 1,000 tries, someone’s Social Security number 8.5 percent of the time. Data thieves could easily do the same thing: They could keep hitting the log-on page of a bank account until they got one right, then go on a spending spree. With an automated program, making thousands of attempts is no trouble at all.
The problem, Acquisti found, is that the way the Death Master File numbers are created is predictable. Typically the first three digits of a Social Security number, the “area number,” are based on the zip code of the person’s birthplace; the next two, the “group number,” are assigned in a predetermined order within a particular area-number group; and the final four, the “serial number,” are assigned consecutively within each group number. When Acquisti plotted the birth information and corresponding Social Security numbers on a graph, he found that the set of possible IDs that could be assigned to a person with a given date and place of birth fell within a restricted range, making it fairly simple to sift through all of the possibilities.
To check the accuracy of his guesses, Acquisti used a list of students who had posted their birth information on a social network and whose Social Security numbers were matched anonymously by the university they attended. His system worked—yet another reason why you should never use your Social Security number as a password for sensitive transactions.
Welcome to the unnerving world of data mining, the fine art (some might say black art) of extracting important or sensitive pieces from the growing cloud of information that surrounds almost all of us. Since data persist essentially forever online—just check out the Internet Archive Wayback Machine, the repository of almost everything that ever appeared on the Internet—some bit of seemingly harmless information that you post today could easily come back to haunt you years from now.
Fortunately, the main practitioners of data mining these days are not criminals. Interpreting the clustering of data is now a big business, a potent force in politics, and a powerful tool of government (although plenty of people may object to finding their data scrutinized by those folks, too). Data-driven targeting of potential voters played an enormous role in the election of Barack Obama; directed marketing has led to record growth for companies like 1-800-Flowers.
These activities are sure to increase as our data clouds expand. In the near future, face-recognition software will scrutinize online photos to identify “anonymous” individuals; software programs will secretly scan e-mail on government networks; implantable medical devices may even transmit your health data directly to your doctor. With all this information floating around, privacy advocates warn, it is inevitable that some of it, somehow, will wind up in the wrong hands—or at the very least in places where you did not intend it to go.
Before the rise of the Internet, we had safety in numbers: the daunting numbers of scattered, hard-to-access databases that contained our sensitive details. It took shoe leather to put all that information together. Governments and companies kept large personal and demographic databases, but there was no way to instantly shuttle the information from one place to another. That changed when scientists began linking computers to each other—primarily with the creation of the Arpanet, forerunner of the Internet—in the 1960s. As a result, information was no longer confined to individual computers. It could be transmitted to any computer connected to the network anywhere.
The migration of information online grabbed the attention of early predictive analysts like Robert Grossman, who is now the director of the Chicago-based National Center for Data Mining. Grossman advises companies that want to use data to target customers better and to improve their profit margins. He and his colleagues have been working for years on statistical analysis methods that chew up complex sets of data and spit out significant patterns that appear in them. Relevant details can be obtained easily from census records, credit report agencies such as Experian and Equifax, and consumer data-mining companies like Phorm. When you have a detailed set of information on a group of people—say, their political views, the kind of homes they live in, and their favorite movie genres—obvious cluster patterns can emerge.
To find these patterns, data miners like Grossman first chart their harvested facts on a scatter plot, an imaginary graph that has as many dimensions as the number of personal characteristics being evaluated, such as age, marital status, gender, and geography. Grossman combines these factors into about 180 segments. A company might then create a dozen different sales offers and target them to specific segments. Some of the targets are straightforward: Newly married women might get ads for furniture. Some are based on more subtle forms of behavior: Single males are more likely to be hit with online ads that move around. And some are just devious. If you have a Gmail account, opening an e-mail will trigger the delivery of ads based not only on your demographics but also on the content of that particular message.
Grossman does not share the identities of the firms he works with, but one company that has profited from this type of data mining is 1-800-Flowers, which has been monitoring the behavior of its customers and sifting the data on buying habits since 2003. (1-800-Flowers, like a number of large retailers, uses the business analysis company SAS.) Instead of reaching out to all customers the same way, as advertisers traditionally do, the company targets specific subgroups. According to Aaron Cano, vice president of enterprise customer knowledge at 1-800-Flowers, there are planners and there are last-minute buyers. Planners receive offers in advance of buying occasions. The last-minute types get occasion-reminder e-mail.
When 1-800-Flowers started its analytics program, third-quarter revenues hit $124.1 million—up 7.5 percent over the same quarter of the previous year, even though the economy was recovering from a recession. The company has also increased customer retention rates by more than 15 percent since the program began. Brooks Brothers and The Limited, which also work with SAS, claim similar successes as a result of their data-mining programs.
No one understands the transformative power of data analysis better than Democratic consultant Ken Strasma, who helped propel Barack Obama into office by devising a mathematical model that predicts the political behavior of nearly every eligible voter. Strasma first randomly selected a pool of about 10,000 voters from his database, which includes demographic information on more than 100 million people. His consulting firm next conducted phone interviews with those 10,000 to learn their views on a wide range of political topics.
Armed with that huge data set, Strasma started looking for clusters. He found some strange things. Gin drinkers tend to be Democrats. Military history buffs are generally conservative on social issues. Got call-waiting? You’re probably a Republican. “We come up with correlations that might not be intuitive at all,” Strasma says. “We really don’t get at the whys of it.” But really, the whys don’t matter; only the correlations do.
To figure out the voting behavior of those not surveyed, Strasma applied what is called the nearest-neighbor algorithm. This technique matches each of the 100 million eligible voters in the United States to one of the people surveyed, according to a range of demographic measures. “The ‘distance’ between voters is not physical distance but rather how similar or dissimilar they are, based on these thousands of indicators,” he says. For instance, two voters with similar retail preferences might tend to vote the same way. Strasma’s nearest-neighbor tactic helped the Obama campaign fine-tune its mailings, advertising, and donation efforts along with its drives to get voters to the polls. Whether Strasma’s efforts proved decisive is an open question, but Obama pulled in $745 million from donors, more than twice what John McCain managed.
Where companies and politicians see opportunity, outspoken privacy advocates like Christopher Soghoian, a doctoral student at Indiana University, see threats to our personal privacy. There is little regulation limiting what data can be taken and mined: The current canonical law, the Federal Trade Commission’s Privacy Act of 1974, specifies that government agencies must show individuals any personal records about them, but it excludes law enforcement from this provision. It also does not restrict the data-collecting efforts of private companies.
As Acquisti demonstrated, even seemingly innocent information that people routinely display about themselves can be mined to expose more sensitive bits. And often people are not aware of how extensive a data trail they are leaving online. For example, that anonymous post you left on the Web site of your local newspaper? Not so private: In 2008 the Alton Telegraph site was served with a grand jury subpoena demanding the full names and addresses of some anonymous commenters who had hinted that they might have information valuable to a murder investigation. “The judge said the law gives anonymity protection to journalists but not non-journalists,” says Seattle Internet-security expert Bennett Haselton. “If you do something online, it’s logged that it was done from your IP address. People should use common sense.”
In the near future, the challenges to personal privacy will move to another level. The Swedish company Polar Rose has developed software that identifies unlabeled individuals in digital photos, such as those posted all over Facebook, using face-recognition algorithms. Once you tag a friend in one photo, the software will automatically identify when that friend appears in other photos. “All kinds of things could happen,” Soghoian says. “What if you were a health insurance company and you could pinpoint all the people who’ve starred in Jackass-type stunt videos online—and drop them?” More plausibly, what if your insurance company sees photos of you drinking and smoking and adjusts your premiums accordingly?
Medical advances may cause our data clouds to envelop us in new and unexpected ways. Harvard synthetic biologist Yaakov Benenson is developing implantable computers capable of detecting chemical changes inside a cell. Eventually, such devices should enable us to monitor our vital statistics, take diagnostic tests, and receive treatment without ever going to the hospital. The results could be beamed wirelessly to health-care providers, raising the specter of eavesdropping. Researchers are experimenting with genetic profiling to fine-tune cancer treatment or to identify patients with an elevated risk of heart attack. Soon doctors might keep your DNA profile on hand to develop personalized treatments for you; if such information got out, your entire genome could be available for public viewing.
In the quest for total knowledge, Soghoian notes, companies and government officials are likely to leave no stone unturned. A planned new version of a system designed to protect the U.S. government from online spies, dubbed Einstein 3, has the capability to read e-mail that travels over government networks. In response, Ari Schwartz, vice president of the Center for Democracy and Technology, has flagged concerns about the government’s ability to balance surveillance with privacy protection. Any information that leaks out can be mined.
Meanwhile, cell phones are accumulating ever-more processing power—Qualcomm’s Snapdragon mobile processor broke the one-gigahertz barrier this year—enabling seamless video watching and recording. “In a not-too-distant future, phones are going to be recording everything we see and hear,” Soghoian warns. That could easily include videos of you going about your business, taken by someone you don’t even know; just look at all the anonymous videos already available on YouTube.
Knowledge may be power, but the runaway growth of our personal data clouds suggests that we may not be happy about where that power ends up. “All of the Facebook interaction, all of the MySpace stuff, all the Expedia travel searches,” Soghoian says, “all those data trails are hanging out forever.”
A GUIDE TO INFORMATION SELF-DEFENSEThere is a flood of personal data on the Web these days, and people often do little to manage their part of the flow. “This has to be addressed at the macro level,” says Marc Rotenberg, executive director of the Electronic Privacy Information Center in Washington, D.C. “We need to protect Internet users through legislation or technical methods, by enforcing fair information practices that give users control over information held by businesses and government agencies, and by limiting the collection of personal information.”
In the meantime, you can take some steps to protect yourself.
Avoid discount cards Using them is letting the store track your every purchase, including drugs—you might as well send them a copy of your health records. And this information could be used in court, so pay cash for anything you want to keep private.
Encrypt your e-mail Would you show your e-mail to everyone sitting around you in the coffee shop? Then don’t send messages on an unsecured Wi-Fi network. Use a free encryption tool, such as Komodo IDE or Thawte, that encodes messages so that they need a decrypting “key” to be read.
Pick passwords carefully Password-guessing programs can perform hundreds of thousands of tries per second, security expert Bruce Schneier says. For maximum security, do not use a dictionary word, and mix numbers and symbols into the body of your password instead of tacking them on at the end. E. S.