A few months ago I reviewed a paper which examined the various complexities of interpreting signals of natural selection from recently developed genomic tests in response to the avalanche of human sequence data. In the paper, Signals of recent positive selection in a worldwide sample of human populations, the authors state:
We find that putatively selected haplotypes tend to be shared among geographically close populations. In principle, this could be due to issues of statistical power: broad geographical groupings share a demographic history and thus have similar power profiles. However, strongly selected loci are expected to show geographical patterns largely independent of demography--depending on the relevant selection pressures, they can be highly geographically restricted despite moderate levels of migration, or spread rapidly throughout a species even in the presence of little migration...Further exploration of the geographic patterns in these data and their implications is warranted, but from the point of view of identifying candidate loci for functional verification, the fact that putatively selected loci often conform to the geographic patterns characteristic of neutral loci is somewhat worrying....
In other words, it could be selection, or it could be demography. Demography should not be destiny. Papers such as this which refer to broad sweeping genomic patterns have be couched in some caveats, naturally the sample space is huge, and so there are many interesting exceptions to this pattern. From the above it seems that genetic variation which deviates from what one would expect based on the population history of a group and its relationship to nearby groups might be a clue as to the evidence of natural selection, and its power in mapping genomes onto the local adaptive topography. Concretely, the presence of lactase persistence in Africa among those groups who have a history of cattle culture is highly suggestive, as is the lack of the same trait among these peoples' neighbors who were traditionally farmers. To obtain further clarity on this issue I thought drilling down on one gene highlighted in a recent paper might be of interest. In A natural history of FUT2 polymorphism in humans the authors slice & dice
FUT2 every which way. Or more specifically, they slice & dice an exon on FUT2. Here is a summary of the functional significance of this gene:
The protein encoded by this gene is a Golgi stack membrane protein that is involved in the creation of a precursor of the H antigen, which is required for the final step in the soluble A and B antigen synthesis pathway. This gene is one of two encoding the galactoside 2-L-fucosyltransferase enzyme. Two transcript variants encoding the same protein have been found for this gene.
You know about the ABO blood groups because you have probably donated blood, or received blood, and who can give or receive blood is critical. Additionally the genetics is so simple that you are often introduced to it in high school biology. So FUT2 has some role to play in the human immune system.
This immediately shifts our probabilities in terms of whether FUT2 might be a target of natural selection, because loci which have immunological implications are often subject to adaptation
. Pathogens are always evolving, and we need to respond. The MHC gene families in particular are so diverse likely due to negative frequency dependent selection, a subcomponent of the Red Queen Hypothesis. First, the abstract, then some specifics on this paper:
Because pathogens are powerful selective agents, host cell surface molecules used by pathogens as identification signals can reveal the signature of selection. Most of them are oligosaccharides, synthesized by glycosyltransferases. One known example is balancing selection shaping ABO evolution as a consequence of both, A and B antigens being recognized as receptors by some pathogens, and anti-A and/or anti-B natural antibodies produced by hosts conferring protection against the numerous infectious agents expressing A and B motifs. These antigens can also be found in tissues other than blood if there is activity of another enzyme, FUT2, a fucosyltransferase responsible for ABO biosynthesis in body fluids. Homozygotes for null variants at this locus present the non-secretor phenotype (se), since they can not express ABO antigens in secretions. Multiple independent mutations have been shown to be responsible for the non-secretor phenotype,which is coexisting with the secretor phenotype in most populations. In this study, we have resequenced the coding region of FUT2 in 732 individuals from 39 worldwide human populations. We report a complex pattern of natural selection acting on the gene. While frequencies of secretor and non-secretor phenotypes are similar in different populations, the point mutations at the base of the phenotypes are different, with some variants showing a long history of balancing selection among Eurasian and African populations, and one recent variant showing a fast spread in East Asia, likely due to positive selection. Thus a convergent phenotype composition has been achieved through different mutations with different evolutionary histories.
The 39 populations were from the Human Genome Diversity Panel. Additionally they used the HapMap populations as references in some of their methods. It is important to note that you have two phenotypes, secretor and non-secretor, but many genetic variants which can result in these phenotypes.Null mutations (some of the ones specified in the paper result in a "stop codon" or are "missense") emerged many times because there are naturally many ways to "break" the functional pathway. This happened with skin color in Eurasia, as Europeans and East Asians seem to have de-pigmented through different loss of function mutations. Though polygenic, most of the variation across populations for this trait is controlled by genes of large effect, but in some cases different genes seem to have lost function between these populations, let alone different mutations on the same gene. The presence of null mutants on FUT2 isn't exceptional, but the very high frequencies of many different alleles is notable. Strongly functionally constrained regions of the genome would purify these alleles, though normally they would be extant at low frequencies and be masked as heterozygotes. But there is also a good adaptive story for why these alleles are at a high frequency. The nulls (as homogyzotes) seem resistant to norovirus infections, which can cause stomach flu. Though an inconvenient illness in the developed world, historically it was probably more problematic. That is, the Malthusian world where lost labor and calories on the margins could result in starvation or malnutrition. Additionally the null non-secretors also seem to develop HIV more slowly. Though these are two specific cases there are almost certainly other disease related implications beyond this. The first table shows minor allele SNP frequencies for the populations clustered by geography, and the FST values, as well as FSC and FCT. Basically these statistics are showing you the variation between and across the groups in question on a given allele. FSC and FCT are constrained to within continents and between continents, respectively. Variants which have low values exhibit little variance which can be attributed to structure between the populations. In contrast, those which have high values suggest that there is a lot of interpopulation variation. Interpopulation variation would naturally be one possible sign of localized adaptation. The percentiles represent the position of the alleles' FST among the 650,000 SNP loci in the HGDP sample in regards to empirical distribution of values. Naturally there is a focus on those at, or above, the 95^th cut-off. I have rotated the table because the font is rather small and I don't have much width. Apologies. The legend is also below.
Frequencies in the header indicate the percentage of null-alleles within each continental region. a Positions accordingly to bibliography (Koda et al. 2001); * SNPs previously described in literature (Koda et al. 2001; Liu et al. 1998; Kelly et al. 1995; Chang et al. 1999; Pang et al. 2000; Yu et al. 1995; Koda et al. 1996; Liu et al. 1999; Peng et al. 1999; Yu et al. 1999). b Phenotype defined as Se indicates functional allele, Pr. Se, presumably functional allele, se non-functional allele and NA not available information. c Interpopulation differentiation statisitc calculated between the 39 opulations, d within continents and e between continental groups. f percentile of the empirical distribution ., With an asterisk, values exceeding the 95th percentile. ns, not significant. SSAFR = Sub-Saharan Africa; MENA = Middle East-North Africa; EUR = Europe; CSASIA = Central-South Asia; EASIA = East Asia; OCE = Oceania; AME = America. The authors note: * A few of the FST values show a pattern where the between population variance is actually between Western + Central + South Eurasia + Sub-Saharan Africa vs. East Asia. This is atypical, and does not comport with descent (where non-Africans are all within the same clade as a subset of Africans). * The highest FST values seem to exhibit non-synonymous variants. That means a change in the base pair which has a functional significance (changes the amino acid encoded by the codon). * Four contiguous SNPs with very high FST range from base pair position 342 to 385. In other words, they are suspiciously close within the sequence of this gene. There may be biophysical reasons for this in terms why this section of the exon is enriched with significant alleles. I've cut off the next table at Tajima's D because it gets the big picture across. And yes, because I don't have as much width as I would wish.
N = number of individuals; S = segregating sites; Hd = haplotype diversity = average number of nucleotide differences per site; = Watterson estimator; * P < 0.05; ** P < 0.01; *** P < 0.001 If the spare definitions above for the parameters which are used to calculate Tajima's D do not suffice, I highly recommend Detecting Natural Selection (Part 7) from RPM. He breaks down parameters and how they come together to generate a statistic than can tell us about deviation from neutral expectations due to natural selection or demography. All that really matters in terms of results is that a positive D means balancing selection and/or decreasing population size, and a negative D indicates an expanding population and/or purifying selection, with the purification being a byproduct of positive selection on specific alleles. The Basque and North Italian populations are show positive and significant values for three out of the four tests. Sorry I didn't show you the other three! The same is true of the Mandenka and Biaka Pygmies. Compared to 132 genes with known Tajima's Ds for these populations the FUT2 alleles fall above the 95^th threshold for positive values for these populations (as well as the Sardinians). Another test which used 250 genes also yielded the East Asian populations falling below the 5^th percentile in the empirical distribution. Remember, negative values may indicate positive selection. Next they looked at the relationships of the various haplotypes in these populations, the patterns of SNPs which are correlated together. If a haplotype can be abstracted as an individual this is simply a family tree that shows how the haplotypes branched off from each other, one novel SNP at a time. There are two major branches. On one side non-functional alleles only found in Western Eurasia and Sub-Saharan Africa in regions where balancing selection is evident. The other family includes both functional and non-functional alleles. I redid figures 1 & 2. I apologize for the small font. If you care enough you can make it out. You get the gist though. A total mess.
Compare to the lactase persistence phylogeny for Eurasia. Granted, only Eurasia and not the world. The diversity and peculiar patterns of variation which don't always map onto genealogy, i.e., East Asians being the outgroup to other Eurasians & Africans, are suggestions of selective forces reshaping the frequencies out of kilter with random walk drift introducing predictable variation across descent groups. Finally they conclude with some haplotype based tests of natural selection. I have a src="http://blogs.discovermagazine.com/gnxp/2009/03/signals_of_recent_positive_sel.php">covered this alphabet soup before. Because of the negative values of Tajima's D in East Asia, that region was of particular interest. If a new allele was being driven to fixation, it would produce a long haplotype as it "outran" (at least transiently) recombination's power to destroy the generated linkage disequilibrium. So the model would be a bout of positive selection dragging a long haplotype to higher frequency, making linkage disequilibrium noticeable across the population in this region of the genome. Hopefully EHH or iHS would twitch at this point. It doesn't look like they came up with much. I'll quote them:
...To detect the signal of positive selection on the FUT2 region, we measured the Extended Haplotype Homozygosity (EHH) versus core haplotype frequency at a fixed length of 0.3cM in both directions from the core haploytype (Sabeti et al. 2002). P-values were significant (<0.05) for 17 core SNPs in four East Asian populations (Yakut, Han, Cambodian and North East China) however, after applying multiple testing correction (Hochberg and Benjamini 1990) none of the cited 17 SNPs remains significant (q-value = 0.20). We also applied the iHS method (Voight et al. 2006), where the integral under the EHH decay plot from any individual SNP is calculated. With this method we 14 detected a peak at ~600Kb from the 5'extreme of FUT2. Several genes are mapped between FUT2 and this position, making unreliable the relationship between this signature and a positive selection event at FUT2.
After all this I think that the likelihood of balancing selection still seems rather high (the above tests wouldn't have picked that up in any case). In their phylogenies they calculated that the most recent common ancestor of all the variants was 2.61-5.27 million years ago, further back in time than the average neutral ancestor in the human genome of 0.7 to 1.2 million years ago. Neutral processes are such that over time all genetic variants should go extinct and be replaced. Unless selection works against this process, and that is what balancing selection does. Though there are many forms of balancing selection one of the most common is frequency dependence so that the fitness of the allele is inversely proportional to its frequency. This is common among immunologically relevant genes because rare variants tend to have defenses which common pathogens can't cope with. Of course once the rare variants increase in frequency...the pathogens develop strategies. This is the evolutionary arms race which balancing selection is often witness to. This is why some of the MHC variants have a time depth beyond the chimp-human speciation event, as some of these alleles on FUT2 may. There is some reference in the discussion to the rather erratic patterns within Africa, from which the authors see infer localized adaptation. Perhaps the signatures of ancient diseases past? The analysis here seems muddy to me, and they admit small sample sizes for some of the African populations which they presume to speak of. But in East Asia there is a clear pattern which diverges from the rest of the world. Tajima's D implied positive selection, though they couldn't verify this via EHH or iHS. The difference between East Asia and the rest of Eurasia does seem to trouble them, and they can't understand why the non-functional alleles of Africa and West Eurasia didn't sweep across to East Asia earlier, where loss of function mutations emerged in situ. I have suggested before that until very recently gene flow between West an East Eurasia was very restricted, so the seal between the regions might have held for FUT2 as well. One final issue I would like to moot is the idea that the non-secretor vs. secretor phenotype exhibits dominance-recessive dynamics, and some problems that might present in terms of persistence of the null alleles. In the somewhat thin adaptive scenarios of disease resistance it was the putatively recessive phenotype with the higher fitness. If there was a negative feedback loop where the oscillation away from the equilibrium swung to a low enough frequency it seems that too many of the null alleles would be masked in heterozygotes when the new pathogen arose. The extremely long period of low frequency of fitness conferring alleles which express only in the homozygote state tends to allow recombination to break up haplotype blocks. Perhaps this is why there wasn't anything picked up in the East Asian samples? There was mention of heterozygote phenotypes which differed from the wild type secretor, but strangely it seemed that the heterozygote had a lower fitness than either homozyogte (underdominances), so seems that this would be an adaptive landscape with an unstable equilibrium. The authors did a good job highlighting the peculiarities in the distribution of FUT2 which indicate possible selective dynamics at work. Additionally time depth of the phylogenies and Tajima's D argue for a deviation from the neutral expectation of the molecular clock. But there was no confirmation of positive selection in East Asians with the haplotype based tests. Genotypic variation and dynamics were relatively well characterized by this group. It seems that the next step might be to tease out the details of the phenotypic variation. Is the binning into two categories too coarse? What other disease related implications might this pathway have? What other biochemical pathways might be perturbed? I strongly suspect that a recessive fitness boosting phenotype can't be the only thing at work. Citation: Anna Ferrer-Admetlla , Martin Sikora , Hafid Laayouni , Anna Esteve , Francis Roubinet , Antoine Blancher , Francesc Calafell , Jaume Bertranpetit , and Ferran Casals, A natural history of FUT2 polymorphism in humans, MBE Advance Access, published on June 1, 2009, DOI 10.1093/molbev/msp108.