[An old post from 2005 I'm fond of]
There was a time not that long ago when sequencing a single gene would be hailed as a scientific milestone. But then came a series of breakthroughs that sped up the process: clever ideas for how to cut up genes and rapidly identify the fragments, the design of robots that could do this work twenty-four hours a day, and powerful computers programmed to make sense of the results. Instead of single genes, entire genomes began to be sequenced. This year marks the tenth anniversary of the publication of the first complete draft of the entire genome of a free-living species (a nasty little microbe called Haemophilus influenzae). Since then, hundreds of genomes have emerged, from flies, mice, humans, and many more, each made up of thousands of genes. More individual genes have been sequenced from the DNA of thousands of other species. In August, an international consortium of databases announced that they now had 100 billion "letters" from the genes of 165,000 different species. But this data glut has created a new problem. Scientists don't know what many of the genes are for. The classic method for figuring out what a gene is for is good old benchwork. Scientists use the gene's code to generate a protein and then figure out what sort of chemical tricks the protein can perform. Perhaps it's good at slicing some other particular protein in half, or sticking two other proteins together. It's not easy to tackle this question with brute force, since a mystery protein may interact with any one of the thousands of other proteins in an organism. One way scientists can narrow down their search is by seeing what happens to organisms if they take out the particular gene. The organisms may suddenly become unable to digest their favorite food or withstand heat, or show some other change that can serve as a clue. Even today, though, these experiments still demand a lot of time, in large part because they're still too complex for robots and computers. Even when it comes to E. coli, a bacterium that thousands of scientists have studied for decades, the functions of a thousand of its genes remain unknown. This dilemma has helped give rise to a new kind of science called bioinformatics. It's an exciting field, despite its woefully dull name. Its mission is to use computers to help make sense of molecular biology--in this case, by traveling through vast oceans of online information in search of clues to how genes work. One of the most reliable ways to find out what a gene is for is to find another gene with a very similar sequence. The human genes for hemoglobin and the chimpanzee genes for hemoglobin are a case in point. Since our ancestors diverged about six million years ago, the genes in each lineage have mutated a little, but not much. The proteins they produce still have a similar structure, which allows them to do the same thing: ferry oxygen through the bloodstream. So if you happen to be trolling through the genome of a gorilla--another close ape relative--and discover a gene that's very similar to chimpanzee and human hemoglobins, you've got good reason to think that you've found a gorilla hemoglobin gene. Scientists sometimes use this same method to match different genes in the same genome. There isn't just one hemoglobin gene in humans but seven. They carry out different slightly functions, some carrying oxygen in the fetus, for example, and others in the adult. This gene family, as it's known, is the result of ancient mistakes. From time to time, the cellular machinery for copying genes accidentally creates a second copy of a gene. Scientists have several lines of evidence for this. Some people carry around extra copies of genes not found in other people. Scientists have also tracked gene duplication in laboratory experiments with bacteria and other organisms. In many cases, these extra genes offer no benefit and disappear over the generations. But in some cases, extra genes appear to provide an evolutionary advantage. They may mutate until they take on new functions, and gradually spread through an entire species. Round after round of gene duplication can turn a single gene into an entire family of genes. Knowing that genes come in families means that if you find a human gene that looks like hemoglobin genes, it's a fair guess that it does much the same thing as they do. This method works pretty well, and bioinformaticists (please! find a better name!) have written a number of programs to search databases for good matches between genes. But these programs tend to pick the low-hanging fruit: they are good at recognizing relatively easy matches and not so good at identifying more distant cousins. Over time, related genes can undergo different mutations rates, which can make it difficult to recognize their relationship simply by eyeballing them side by side. Another hazard is the way a gene can be "borrowed" for a new function. For example, snake venom genes turn out to have evolved from families of genes that carry out very different functions in the heart, liver, and other organs. These sorts of evolutionary events can make it hard for simple gene-matching to yield clues to what a new gene is for. To improve their hunt for the function of new genes, bioinformaticists are building new programs. One of the newest, called SIFTER, was designed by a team of computer scientists and biologists at UC Berkeley. They outline some of their early results in the October issue of PLOS Computational Biology (open access paper here). SIFTER is different than previous programs in that it relies on a detailed understanding of the evolutionary history of a gene. As a result, it offers significantly better results. To demonstrate SIFTER's powers of prediction, the researchers tested it on well-studied families of genes--ones that contained a number of genes for which there was very good experimental evidence for their functions. They used SIFTER to come up with hypotheses about the function of the genes, and then turned to the results of experiments on those genes to see if the hypotheses were right. Here's how a typical trial of SIFTER went. The researchers examined the family of (big breath) Adenosine-5'-Monophosphate/Adenosine Deaminase genes. Scientists have identified 128 genes in this family, in mammals, insects, fungi, protozoans, and bacteria. With careful experiments, scientists have figured out what 33 of these genes do. The genes produce proteins that generally hack off a particular part of various molecules. In some cases, they help produce nitrogen compounds we need for metabolism, while in other cases they help change the information encoded in genes as it is translated into proteins. In still other cases they have acquired an extra segment of DNA that allows them to help stimulate growth. The SIFTER team first reconstructed the evolutionary tree of this gene family, calculating how all 128 genes are related to one other. The shows how an ancestral gene that existed in microbes billions of years ago was passed down to different lineages, duplicating and mutating along the way. The researchers then gave SIFTER the experimental results from just five of the 128 genes in the family. The program used this information to infer how the function of the genes evolved over time. That insight then allowed it to come up with hypotheses about what the other 123 genes in the family do. Aside from the 5 genes whose function the researchers had given SIFTER, there are 28 with good experimental evidence. The scientists compared the real functions of these genes to SIFTER's guesses. It got 27 out of 28 right. SIFTER's 96% accuracy rate is significantly better than other programs that don't take evolution so carefully into consideration. Still, the Berkeley team cautions that they have more work to do. The statistics that the program uses (Bayesian probability) get harder to use as the range of possible functions gets bigger. What's more, the model of evolution that it relies on is fairly simple compared to what biologists now understand about how evolution works. But these aren't insurmountable problems. They're the stuff to expect in SIFTER 2.0 or some other future upgrade. Those who claim to have a legitimate alternative to evolution might want to try to match SIFTER. They could take the basic principles of whatever they advocate and use them to come up with a mathematical method for comparing genes. No stealing any SIFTER code allowed--this has to be original work that doesn't borrow from evolutionary theory. They could then use their method to compare the 128 genes of the Adenosine-5'-Monophosphate/Adenosine Deaminase family. Next, they could take the functions of five of the genes, and use that information to predict how the other 123 genes work. And then they could see how well their predictions were by looking at the other 28 genes for which there's good experimental evidence about their function. All the data to run this test is available for free online, so there's no excuse for these antievolutionists not to take the test. Would they match SIFTER's score of 96%? Would they do better than random? I doubt we'll ever find out. Those who attack evolution these days aren't much for specific predictions of the sort SIFTER makes, despite the mathematical jargon they like to use. Until they can meet the SIFTER challenge, don't expect most scientists to take them very seriously. Identifying the functions of genes is important work. Scientists need to know how genes work to figure out the causes of diseases and figure out how to engineer microbes to produce insulin and other important molecules. The future of medicine and biotech, it appears, lies in life's distant past. Update Monday 10:30 am: John Wilkins says that bioinformatician is the proper term, although no improvement. I then googled both terms and found tens of thousands of hits for both (although bioinformatician has twice as many as bioinformaticist). Is there an authority we can turn to? And can it try to come up with a better name? Gene voyagers? Matrix masters?