A few months ago a friend tipped me off to the fact that David Reich was going to publish a paper about the genetics of Indians which he ascertained was going to model these populations as hybrids between "Europeans and Andaman Islanders." The paper is out, and my friend was roughly right. Reconstructing Indian population history:
India has been underrepresented in genome-wide surveys of human variation. We analyse 25 diverse groups in India to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the 'Ancestral North Indians' (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, whereas the other, the 'Ancestral South Indians' (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71% in most Indian groups, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the indigenous Andaman Islanders are unique in being ASI-related groups without ANI ancestry. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years owing to endogamy. We therefore predict that there will be an excess of recessive diseases in India, which should be possible to screen and map genetically.
The paper itself is relatively tight and concise; a lot of the sausage-making is thrown into the supplementary information. This is freely available online, and in fact I would suggest that the first half of supplement 1 has more meat than the paper itself. As for that, the text is not as bold than the abstract, or the press summations which have appeared in its wake. For example, they say:
We warn that 'models' in population genetics should be treated with caution. Although they provide an important framework for testing historical hypotheses, they are oversimplifications. For example, the true ancestral populations of India were probably not homogeneous as we assume in our model, but instead were probably formed by clusters of related groups that mixed at different times. However, modelling them as homogeneous fits the data and seems to capture meaningful features of history.
I generally agree with the gist of this. The main issue I would also highlight is that these results only clarify and solidify what was likely from previous analyses of worldwide genetic variation. That is, the populations of Northwest India are closer to those of the Middle East & Europe than those of Southeast India are. It was rather awesome that they confirm that the Onge, who are almost extinct, are a relatively unadmixed ancient population. The Onge branch seems to descend from an ancestral population which also gave rise what is termed in the paper "Ancestral South Indian" (ASI). They exhibit no admixture with "Ancestral North Indians" (ANI). This paper confirmed and clarified as well as that the proportion of West Eurasian related lineages increases both as a function of geography and caste. That is, there is a SE-NW and lower-to-upper caste gradient whereby West Eurasian related lineages become more prevalent. This has long been known, but this paper did it with more SNPs across the genome. Here is a table which shows the proportion of ANI is a range of populations:
All you really need to know about the Z-score is that negative scores indicate high levels of admixture. Here is a table which tells you a bit more about the populations above:
The following figure illustrates the general model which looms in the background of this paper:
Note that the Andaman Islanders, the Onge, aren't really the ancestors of Indians on the mainland. Rather, they're a branch of the ancient population which presumably first settled South Asia, and close to the ASI. Who were the ASI? Since they aren't really around, we can only generate conjectures and inferences. In this paper the ANI are actually represented in some ways by Europeans, even though presumably the assumption is that both these are daughter populations of another group. Though not pushed very hard, they do mention proto-Indo-Europeans as the candidate for the ANI. At this point, let's look at the PCA chart (I've reedited and labelled as usual):
This should not surprise, previous work shows that South Asians distribute along an axis away from Europeans. One of the points in the paper is that there is both geographic and caste stratification. I added some labels, but I thought drilling-down was probably useful. I don't know all these groups off the top of my head, and I assume few of readers do either. So I zoomed in:
I think some of the shortcomings with a sample size on the order of the low hundreds is rather clear. They couldn't even use all their samples, or some of the samples were not relevant to the question on hand. The Siddis are an Indian-African mix which emerged during the period of Muslim domination when that group imported black slaves. The Tibeto-Burman groups of Northeast India are interesting, but outliers. The general trends are clear, North Indian groups have more ANI than South Indian groups, and upper caste groups have more ANI than lower caste groups, but that is only with "all things equal." Note that upper caste South Indian groups clearly have more ANI than lower caste South Indians, but they have a lower proportion than some North Indian lower castes, and are in the range of one North Indian tribal group. Some of the outliers are also interesting; the lower caste individual similar to Austro-Asiatic tribals is from a group which resides in a region with many Austro-Asiatic peoples. Clearly there has been identity switching, so you have aberrations such as one North Indian tribal who clusters with Kashmiri Pandit Brahmins! The Austro-Asiatic group is also interesting, because they speak languages related to those of Southeast Asia. Here is a map of the Austro-Asiatic languages:
We know with near 100% certainty that much of Burma & Thailand were dominated by Mon-Khmer languages before the arrival of the Shan, Bamar (Burmans) and Thai peoples (to mention a few). This is matter of historical record, the rise of modern Burma and Thailand was largely a story of the eclipse of Mon and Khmer societies who transmitted to them much of the Indic character which they have (e.g., the northern populations often arrived as Mahayana Buddhists, but the Mon and Khmer Theravada Buddhism was adopted as the dominant religions in the new states). The position of the Munda languages is more confused, as some posit that they arrived from the east, while others argue that the the Austro-Asiatic languages expanded east from India. This is not going to be resolved in this blog post, but let me note that the genetic data above, which show an "eastern" affinity of the Munda, can be combined to with cultural datum such as the arrival of rice farming from the east and historical records which document the migration of populations from Burma, to construct a plausible east-west narrative. In contrast it seems an almost default position by many that the Austro-Asiatics are the most ancient South Asians, marginalized by Dravidians, and later Indo-Europeans. I would not be surprised if it was actually first Dravidians, then Austro-Asiatics and finally Indo-Europeans. Dravidian are found in every corner of the subcontinent (Brahui in Pakistan, a few groups in Bengal, and scattered through the center) while the Austro-Asiatics exhibit a more restricted northeastern range. As I noted above, supplement 1 has a lot of gems. For example, the authors note that previous work which found little regional differentiation in Indian Americans might have been problematic because there is a great deal of intraregional variance which when collapsed loses essential information. This chart shows South Asians + Utah Whites + 85 American Gujaratis in light blue:
Note that about half of Gujaratis form their own unexplained cluster! Throwing them together in one pool would mask this phenomenon. Here's their possible explanation:
Interestingly, one of the GIH subgroups fall outside the main gradient of Indian groups, suggesting that they harbor substantial ancestry that is not a simple mixture of ASI and ANI. A speculative hypothesisis that some Gujarati groups descend from the founders of the "Gurjara Pratihara" empire, which is thought to have been founded by Central Asian invaders in the 7th century A.D. and to have ruled parts of northwest India from the 7-12th centuries. I. Karve noted that endogamous groups with names like "Gurjar" are now distributed throughout the northwest of the subcontinent, and hypothesized that that they likely trace their names to this invading group.
I don't know if this is plausible; perhaps a Gujarati reader would immediately recognize what this cryptic substructure is. Next are two charts which shows Indians, Europeans, and Chinese. In the first the PCA was originally constructed with Europeans & Chinese, and the Indians were projected onto it using the variation found in the first two groups. In the second case, Indians and Chinese were used to construct the PCA, and Europeans projected.
What you see is that Europeans are all equally related to Indians, but Indians exhibit a gradient of relationship to Europeans. That is, there is no European group which in particular resembles Indians via the connection with ANI; the distance between all European groups and ANI seems roughly equal. The Indians vary in their relationship to Europeans because they vary in their proportion of ANI. In the table above there is a reference to the proportion of ANI and ASI in each Indian group. One question you might ask: how do you estimate the proportions of ancestry from groups which you don't have any information about because they no longer exist? Europeans and the Onge can serve as proxies for the ANI and ASI respectively, but how far does this get you? Well, the methods that they used (they have three) which determine ancestral proportions can be used on populations which exist. So here is a figure which shows how their methods compare when you look at a population where we know something concrete about their ancestral populations because those ancestral populations are still extant, African Americans:
I also believe that their calculations are roughly correct because they pass the smell test. It isn't as if this is the first study of the genetics of Indians. Though the assumptions of Structure based analysis are somewhat different, you can discern the same rank orders. Moving back to the nature of population structure within India, as opposed to how Indians relate to non-Indians, one of the results which pops up is that South Asian groups seem to have very high Fst values relative to European ones when compared within regions or between neighbors. Remember that Fst is a rough measure of the genetic variation which occurs between groups. The famous maxim that "85% of variance is within races, and 15% between races," is Fst based. The Fst in that is case 0.15. Corrected for region & caste, they find that South Asian groups seem to have Fst values on the order of 3-4 times higher than equivalent European groups. This isn't too surprising, in History and Geography of Human Genes L. L. Cavalli-Sforza observes that Europeans are particularly homogeneous. Before the spate of 650 K SNP papers it was hard to find good stuff on the phylogeography of European populations because the techniques didn't have the power to differentiate them. On the other hand, anthropologists have long thought that India was riddled with differentiation. After all, there's the caste system. Indians are certainly physically diverse. Additionally, there is a line of thinking that India is the secondary Africa, insofar as most Eurasian and Australasian lineages go back to India. Like Africa, India may hold a great deal of diversity among its many populations because they're old, the oldest in Eurasia and Australia (in concert with endogamy of course). The authors though have another model:
We propose that the high FST among Indian groups could be explained if many groups were founded by a few individuals, followed by limited gene flow. This hypothesis predicts that within groups, pairs of individuals will tend to have substantial stretches of the genome in which they share at least one allele at each SNP. We find signals of excess allele sharing in many groups.
They go on:
Six Indo-European- and Dravidian speaking groups have evidence of founder events dating tomore than 30 generations ago...including the Vysya at more than 100 generations ago...Strong endogamy must have applied since then (average gene flow less than 1 in 30 per generation) to prevent the genetic signatures of founder events from being erased by gene flow. Some historians have argued that ‘caste’ in modern India is an ‘invention’ of colonialism in the sense that it became more rigid under colonial rule. However, our results indicate thatmany current distinctions among groups are ancient and that strong endogamy must have shaped marriage patterns in India for thousands of years
This is one of the places where you get some sense of time scales. In the rest of the paper they avoid this. They note in one of the figures: "Although the model is precise about tree topology and ordering of splits, it provides no information about population size changes or the timings of events." But the numbers above give time scales of foundings on the order of 1,000 years, with perhaps others at 3,000 years. Elsewhere they say:
Two features of the inferred history are of special interest. First, the ANI and CEU form a clade, and further analysis shows that the Adygei, a Caucasian group, are an outgroup. Many Indian and European groups speak Indo-European languages, whereas the Adygei speak a Northwest Caucasian language. It is tempting to assume that the population ancestral to ANI and CEU spoke 'Proto-Indo-European', which has been reconstructed as ancestral to both Sanskrit and European languages, although we cannot be certain without a date for ANI-ASI mixture.
Despite the hedge, the allusion here suggests a date pegged on the order of 4,000 years ago. We don't know much about how the Indo-Aryans arrived in India; the earliest extant records, the Vedas (which were transmitted orally initially), seem to be set in Northwest India. The general suspicion though is that the Indus Valley Civilization was not Indo-Aryan, and there is a Dravidian speaking population to the west of Pakistan, suggesting that that language group was at one point spoken in the region. All in all the outline being faintly sketched out in this paper sounds a lot like what Indians refer to as the Aryan invasion theory, a mass movement of populations out of the Northwest replacing and subjugating the natives. ANI values on the order of 70-80% in the Northwest seems to suggest near total replacement. I'm skeptical. Obviously the Ind-Aryans had to arrive physically, but these sorts of nomadic populations tend to quickly dominate and culturally assimilate sedentarists. In the case of the Hungarians and Turks they even imposed their language upon the natives, with only marginal genetic impact. The paper itself points to the likelihood of a complex history of periodic, and perhaps continuous, gene flow. Two ancient populations mixing is what economists would term a "stylized fact," good enough to get some points across, but not to be confused for reality. What about the idea of foundings and subsequent endogamy explaining the high Fst? 2,500 years ago Herodotus already reported that India was the most populous nation in the world (he did not know of China). It isn't as if the Indo-Aryans arrived in the New World, where the natives died off so that they could enter into a major demographic expansionary phase. That being said, India's population did grow over time as cultures pushed east with better tools (e.g., iron axes), and cut down the local forests. To really test drive this model you need more 132 individuals from 25 populations. You need a lot of data from many individuals on to get a more granular feel for the variation. Population expansions did occur in the east down to the Mughal period as land was reclaimed for agriculture. Much of eastern Bengal was settled relatively recently, within the last 500-1000 years. In some regions we do have a sense of what the demographic history was, so we could be able to predict patterns of Fst if the model of founding + endogamy is operative. Historically this may make sense for some groups, such as Brahmins, who migrated to various regions to provide specialized services and then became indigenized, but it seems unlikely as an explanation for the majority of castes and jatis. Many of the same dynamics at work in India were probably at work in the Middle East. And also in Europe, which went through a population crash and "bounce back" after the fall of the Roman Empire. They should have just struck with a tree without the timing.... John Hawks has a related post. Citation: Reich D, Thangaraj K, Patterson N, Price AL, Singh L. 2009. Reconstructing Indian population history. Nature 461:489-494. doi:10.1038/nature08365