Sons of the conquerors: the story of India?

The past ten years has obviously been very active in the area of human genomics, but in the domain of South Asian genetic relationships in a world wide context it has seen veritable revolutions and counter-revolutions. The final outlines are still to be determined. In the mid-1990s the conventional wisdom was that South Asians were a branch of a broader West Eurasian cluster of peoples, albeit more distant from the core Middle Eastern-North-African-European-Caucasian clade. The older physical anthropological literature would have asserted that South Asians were predominantly Caucasoid, but with a Australoid element admixed in at varying proportions as a function of geography and caste. To put it more concretely, and I think accurately, a large degree of South Asian physical variety can be defined along the spectrum between A. R. Rahman and Nawaz Sharif. The regional and caste truisms are only correlations. Subrahmanyan Chandrasekhar was a Tamil Brahmin, but experienced anti-black racism in the United States. I think that is reasonable in light of his appearance.

This rough & ready mainstream understanding, supporting by classical genetic markers, was overturned in the early years of the 21st century. One line of thought argued that South Asians were much more distinctive from the broader Western Eurasian cluster of peoples. Representative of this body of work is a paper like The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. These researchers tended to start with the female lineages, mtDNA, and then supplement that with Y lineages, the paternal descent. A separate line of evidence, generally drawn from Y chromosomal results, indicated that there were deep connections between the people of India and those of Central Eurasia, in particular via the R1a haplogroup. Additionally, one aspect of the first set of results which was very surprising was that it actually placed South Asians closer to East, not West, Eurasians. But by the end of the aughts the uniparental studies had been supplemented by a range of results produced from SNP-chips, which looked at hundreds of thousands of genetic variants. These studies seemed to support the older view of South Asians being closer to West Eurasians than East Eurasians. Finally last year a paper came out which posited that almost all South Asian populations were actually an ancient stabilized hybrid between two groups, a European-like population, "Ancient North Indians" (ANI), and another group which is no longer present in unadmixed form, "Ancient South Indians" (ASI), of whom the Andaman Islanders are distant relatives. Though there was a slight bias toward ANI as a whole, the fraction of ASI increased as one went southeast, and down the caste ladder. The distinctive "South Asian" ancestral group in other words then may actually be conceived of as a compound of these two elements; an admixture of the native substrate against a European-like genetic background. Strangely it sounds an awful lot like the older idea of a Caucasoid population with Australoid admixture. We know now that the connection between the tribal peoples of India, and the indigenous groups of South and Southeast Asia as a whole, to those of Australia and Melanesia, is tenuous at best. So the term "Australoid" is not really informative, and may even mislead. And in terms of historical linguistics I don't think we've solved the problem by appealing to an "Aryan invasion." The high fraction of ANI among South Indian tribal groups who are isolated from even Dravidian caste groups is a clue to the likelihood that the admixture event is very ancient, and probably precedes the arrival of the Aryans to the Indian subcontinent. But there are more than two actors in this game. In Reconstructing Indian population history the authors acknowledge that their model is stylized, that reality is more complex. Additionally, they perceive in their data that some tribal groups from northeast India have an element which is outside of the purview of a two-way admixture event. They discarded this set from their broader analysis because this seemed to be a restricted phenomenon to these groups. A new paper in Molecular Biology and Evolution re-injects this third element into the picture. Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture:

The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in South and Southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in Southeast Asia with a later dispersal to South Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from South Asia. To test the two alternative models this study combines the analysis of uniparentally inherited markers with 610,000 common SNP loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17-28 KYA) in Southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and “structure-like” analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterised by two ancestral components - one represented in the pattern of Y chromosomal and EDAR results, the other by mtDNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from Southeast Asia, followed by extensive sex-specific admixture with local Indian populations.

Some background is necessary here. South Asia is notoriously linguistically diverse, but, that diversity can be bracketed into several broad families. First, the Indo-European languages are represented by Indo-Aryan and Iranian dialects (and Germanic, if you include English). Second, the Dravidian languages are found across the subcontinent, from Brahui in Pakistan to Malto in Bangladesh. But they're really the dominant languages in the southern cone of South Asia. That being said it seems likely that historically their distribution extended far into the north, with Brahui in western Pakistan being a relic of that period, as well as the fragmented tribal groups in Central India. There is also evidence down to historic periods of a Dravidian-speaking substrate in Maharashtra. And purely from a philological perspective it seems clear that many Indo-Aryan languages evolved within a Dravidian linguistic substrate. Next, in the far north there are languages of Tibetan provenance and affinity. These are explicable in their origins and relationship. But in the northeast third of the Indian subcontinent there are a two groups of Austro-Asiatic languages. The prefix "Austro" is indicative of the symbiotic relationship between historical linguistics and physical anthropology in the early 20th century (most famously illustrated in the transplantation of the social-linguistic term Aryan from a South Asian and Iranian context, to a racialized Northern European term). The map at the top of this post shows the distribution of the Austro-Asiatic languages, as well as their subdivisions. There is clearly an eastern and western wing to the group, but most scholars assume that this is an artifact of the historical eruption of the Burman and Thai peoples out of the southern fringes of the Chinese Empire and into mainland Southeast Asia.

Within India the Austro-Asiatic languages fall into two broad categories: the Munda and the Khasi. The Khasi inhabit the massif which separates Bengal and Assam. Their culture and society is at some variance from the norm in India (they are matrilocal, and animist or Christian). A close relationship to the people to the east is clear in both their language and their physical appearance. The Khasi, and other groups such as the Garo, are of the family of peoples and ethnicities which have arrived from the east and north relatively recently, making the transition from the world of Tibet and Burma to India. This is evident in the face of the Khasi child in the image to the left. Once passing out of their lands of origin these populations have assimilated to different degrees to the Indic domain. The Tripuri people for example retain a Tibeto-Burman language, but are adherents of Vaishnav Hinduism (my own family were once subjects of the Manikya dynasty). The Ahom of Assam were totally assimilated by the Indo-Aryan substrate. Like the Bulgars of Bulgaria their only influence was in the ethnonym that they contributed to their subjects. A quick survey of my own genetics, and those of other South Asians of eastern origin on 23andMe, clearly shows the influence of assimilated Tibeto-Burmans. One Bangladeshi Muslim individual clearly carries an East Asian Y chromosomal haplogroup. The Munda are a somewhat different case. In older historical literature on South Asia there is some consideration that the Munda may be the earliest inhabitants of India; predating the Dravidians. Some readers of South Asian origin also point out that in the early Indo-Aryan language there may be more evidence of Munda, than Dravidian, influence. But the eastern connections of the Munda languages seem clear, albeit less explicable than those of the Khasi or the Tibeto-Burman peoples of the far northeast. If the Munda are the indigenous people then it stands to reason that the Mon-Khmer languages derive from South Asia. On the other hand the vast majority of the Austro-Asiatic languages exist in Southeast Asia, and, the Munda themselves have been hypothesized as being the bearers of rice-culture from the east. This is where genetics comes into play. There has already been evidence of an eastern influence in the genes of the Munda from other researchers, so what this paper does is look at that in detail, instead of discarding it as a minor effect which muddles the broader picture. I've reformatted figure 3 to show how the groups relate to each other. On the left is a PCA. Most of the variance is west-east, ~6%, while some of it is north-south, ~1%. On the right is a bar plot generated from ADMIXTURE. I've edited out many of the populations. Focus on the Austro-Asiatic groups from India.

In the PCA you see the SE-NW axis of ANI-ASI admixture which is the primary aspect of genetic variation within South Asia. Numerically Dravidian and Indo-Aryan groups along this axis are the vast majority of South Asians. But the Munda and other Austro-Asiatic groups are not trivial; there are strong suggestions that the eastern Indo-Aryan groups, Oriya, Bengali, and Assamese, are to some extent shaped by influence from the Austro-Asiatic elements. The closer connection of the Khasi to East Asian populations is clear on the PCA. But the fact that the South Indian samples are further along axis-Y than the Munda are indicative of admixture in the Munda population. Looking at the bar plot that's clear. The dominant dark-green signature of South Indian ancestry is also predominant among the Munda, and found at non-trivial amounts among Iranian, Khasi, and Southeast Asian populations, but the Munda clearly have an eastern component which is not found in South Indians. This is probably the element which perturbs them on the PCA. But this just tells us the relationships in terms of total genome content. It doesn't necessarily tells us the historical sequence of admixture events or the direction of migration. In fact the evidence of Indian ancestry in Southeast Asia could be suggesting migration from South Asia to the Southeast Asia (there is plenty of cultural evidence of transmission, though the presumption is that the demographic movements were marginal). They note in the paper that one phenomenon which could be obscuring and confusing our understanding is that much of gene flow occurs through isolation-by-distance (IBD). Village-to-village dynamics. In contrast to this you have folk wanderings, which result in a "leapfrog" aspect. The Hazara and Uyghur are both cases of leapfrogging, as their genetic makeup can't be explained easily by IBD. So here the connections between the Munda and Southeast Asians, and the broader relationship between Southeast Asians and South Asians, could be IBD, or perhaps reflect deep ancient common ancestry. Perhaps the ASI group spanned the region from the Arabian Sea to the South China sea, and were only later overlain by ANI and East Asian populations. To explore these questions the authors tunneled down to a more fine-grained scale, and looked at uniparental lineages as well as a gene at which recent selection seems to have operated upon East Asians in distinction to other groups, EDAR. Though uniparental lineages are only partially informative in terms of ancestry, they are very amenable to dating because of their haploid inheritance patterns. And the relationships between the branches of the termini can give us historical information. The following figure shows the relationship and distribution of a particular Y chromosomal haplogroup which the Munda carry, and other South Asians tend not to, which connects them to the east:

The haplogroup is O2a (M95). The results from the Y chromosomal data are not clear, though they do seem to reject the model whereby Southeast Asian O2a lineages derive from Indian ones. But it does not seem as if you have a scenario where one founder lineage entered into South Asia from Southeast Asia, there are too many disparate branches of O2a found among Indians. Additionally, the coalescence time (back to last common ancestor) is deeper in Southeast Asia, but still deep in South Asia among the Munda. From this it seems that the origin of Austro-Asiatic languages in South Asia can be rejected, but the details of the emergence of Austro-Asiatic in South Asia can not be clearly perceived as of yet. From what I can gather the authors themselves do not necessarily believe that their results in this domain are robust (insensitive to varying the model's assumptions even marginally). An interesting point though is that the mtDNA, the female lineage, does not seem to diverge from other South Asians much at all. I find it intriguing that this is the same pattern we see along the major NW-SE axis of variation. It seems that mtDNA lineages unite South Asians, while the Y lineages separate them (by caste and region). The generality has many exceptions, but it points to a peculiar sex mediated admixture process from both the northwest and northeast. Men on the move have reshaped the genetics and culture of South Asia, but the mtDNA lineages still point to an ancient Eurasian group with distant but stronger affinities to the east than the west. The mtDNA are likely the purest distillation of ASI. Finally, they look at frequencies of variants of EDAR among the South Asian groups. EDAR is in some ways diagnostic of East Asian ancestry; it seems that a variant which produces thick straight hair emerged relatively recently among East Asians. Here's the result from the HGDP browser:

The G allele exhibits co-dominance, so the GA phenotype has intermediate hair-thickness between AA and GG. Haplotype structure based tests of natural selection have indicated that the derived G allele is recent. The map to the right shows the frequency of the derived G variant by population group. The bubble size is proportional to frequency, while the colors represent language groups. Again the Khasi and Tibeto-Burman groups are as you'd expect, they exhibit a relatively high frequency ofthe derived variant. The Hazara are a group which only came into being within the last 1,000 years through an admixture event. The Tharu seem to have their origins in Nepal's transitional zone, and all the Nepali populations have significant admixture with Tibetan groups even if they themselves are not Tibetan in language and culture. The interesting result are the Munda. The Dravidian groups lack the derived EDAR variant, as do Indo-European groups without a plausible East Asian source of admixture. But within the Munda the derived variant is found in proportions ~5%. This is far lower than the 60% among the Tibeto-Burmans of the northeast, or the 40% among the Khasi, but it is significant. And this result allows the authors to reject the IBD model of connection for Austro-Asiatic groups, because the Munda harbor the variant which other South Asian groups in their environs do not. Gene flow predicated on linguistic affiliation at such a remove seems implausible, so the most parsimonious explanation is that the Munda languages arrived in India from Southeast Asia as part of a leapfrog folk wandering. But why the low frequency of the derived variant? Obviously the Munda have admixed with the local substrate, so dilution would be one explanation. Another could be that when the Munda left East Asia the frequency was lower. Additionally, whatever selective forces were driving the frequency up may have abated in South Asia, and it could be that there was selection against the derived variant! Whatever the truth of it the existence of the derived EDAR variant among the Munda would be like finding the European LCT variant among an East Asian population: clear evidence of long distance gene flow and population movement. So where does this lead us? First, let me observe that some of the authors on this paper are the same ones who argued for a predominantly indigenous origin for South Asians in the early 2000s based on mtDNA variation. In this paper they seem to be leaning against an indigenous origin for the Munda, or at least refuting the conjecture that the Munda are ur-Indians par excellence. I didn't go into the details of the coalescence times because they're rather a mess, but EDAR is probably a "tipping point" in arguing for a relatively recent exogenous origin for the Munda. The strong sex asymmetry in genetic variation is also suggestive, we have plenty of evidence of historical examples of genetic leapfrogs occurring through men-on-the-move. The asymmetry also seems to exist among the Khasi and other Tibeto-Burmans in India's northeast (figure 2 of the paper). The arguments about the history, culture, and genetics of South Asia have traditionally been disputed along the Aryan-Dravidian axis. I'm not interested in rehashing that aspect, but these data point us to another reality: on India's northeast frontier there's another component. As an ethnic Bengali myself I've always been somewhat aware of this. Some of my relatives and family acquaintances look much more like Garos than other South Asians. This component is even more evident on the face of Assamese and Nepali, whose languages are Indo-Aryan and religion is Hinduism, but whose appearance bespeaks a more variegated background. On some level South Asians from these regions are aware of their peculiarity, even if it isn't spoken of much. I have read that in the wake of the victory of Japan over Russia in the early 20th century Bengali intellectuals expressed in public their pride at their Asiatic ancestry. With the rise of China in the 21st century I suspect more South Asians from Nepal, Bengal, and Assam, will rediscover that aspect of their background which links them to the east, and not the west. The genetics is just telling us what we already knew. Citation:

Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik Mägi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, Jüri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems, & Toomas Kivisild (2010). Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture Mol Biol Evol : 10.1093/molbev/msq288

Link acknowledgement: Dienekes Pontikos. Addendum: This is more a speculative comment, so I will tack this on to the body of the main post. Here's my current very tentative model for how South Asians came to be. At some point after the last Ice Age 10,000 years ago the ANI arrived, and hybridized with the ASI, who are descendants of the older original Out of Africa wave to South Asia. After this, but before the Aryans, the Munda arrived from the northeast, and pushed into lands inhabited by ANI-ASI groups. 4,000-3,000 years ago the Indo-Aryans arrive, and impose themselves as an elite on the ANI-ASI hybrid population, before being assimilated biologically and imparting their language to the Indian majority. I don't know where Dravidian came from, but perhaps it was the language of the ANI (its existence in fragments all across the swath of the northern Indian subcontinent is suggestive, as well as possible connections to ancient Elamite, the language of Bronze Age southwest Iran). Eventually the Aryanized ANI-ASI marginalized the Munda in northeast India and drove them to the highlands. Finally, the Tibeto-Burmans arrived in the historical period. Image Credit: Wikimedia Commons

Sons of the conquerors: the story of India?

Newsletter