My friend Zack Ajmal has been running the Harappa Ancestry Project for several years now. This is a non-institutional complement to the genomic research which occurs in the academy. His motivation was in large part to fill in the gaps of population coverage within South Asia which one sees in the academic literature. Much of this is due to politics, as the government of India has traditionally been reluctant to allow sample collection (ergo, the HGDP data uses Pakistanis as their South Asian reference, while the HapMap collected DNA from Indian Americans in Houston). Of course this sort of project is not without its own blind spots. Zack must rely on public data sets to get a better picture of groups like tribal populations and Dalits, because they are so underrepresented in the Diaspora from which he draws many of the project participants.
Once Zack has the genotype one of the primary things he does is add it to his broader data set (which includes many public samples) and analyze it with the Admixture model-based clustering package. What Admixture does is take a specific number of populations (e.g. K = 12) and generate quantity assignments to individuals. So, for example individual A might be assigned 40% population 1 and 60% population 2 for K = 2. Individual B might be 45% population 1 and 55% population 2. These are not necessarily 'real' populations. Rather, the populations and their proportions are there to allow you to discern patterns of relationships across individuals.
Since Zack has put his results online, I thought it would be useful to review what patterns have emerged over the past two years, as his sample sizes for some regions are now moderately significant. Though he has K=16 populations, not all of them will concern us, because South Asians do not tend to exhibit many of the components. I will focus on seven: S Indian, Baloch, Caucasian, NE Euro, SE Asian, Siberian and NE Asian. These are not real populations, but the labels tell you which region these components are modal. So, for example, the "S Indian" component peaks in southern India. The "Baloch" in among the Baloch people of southeastern Iran and southwest Pakistan. The "NE Euro" among the eastern Baltic peoples. The last three are Asian components, running the latitude from south to north to center. They only concern the first population of interest, Bengalis. I will combine these last three together as "Asian."
Below is a table, mostly individuals from Zack's results (though there are some aggregate results from public data sets). Comments below.
A recent paper suggested that there was a single pulse of admixture between South and East Asians in the environs of what is today Bangladesh which occurred ~500 A.D. The traditional accounts for the arrival of Brahmins to Bengal suggests a period around and after 1000 A.D. (Bengal was one of the last redoubts of institutional Buddhism in northern India, so presumably would have less need for the services of Brahmins). The results are easy to align with these two facts. All the Bengali non-Brahmins (Baidya are a non-Brahmin high caste in West Bengal) have substantial East Asian ancestry. The Bengali Brahmins have far less of this. Additionally, their "NE Euro" component is about double that of non-Brahmins. There is still room for the Bengali Brahmins being a synthetic community with some admixture (their East Asian fraction is still notably higher than elsewhere in South Asia), but the outlines of the traditional narrative seem to explain the broad outline of these results.
When you look at South Indians from the four Dravidian states there are four facts which strike me as of note:
- There is a distinct difference between Brahmins and non-Brahmins (most of the non-Brahmins Zack has in the Harappa data set are upper caste, though the public data sets have Dalits and tribal populations)
- There is very little difference between South Indian Brahmins by region and sect (e.g., Iyengar vs. Iyer are Tamil Brahmins divided by theological differences).
- South Indian Brahmins are genetically distinct from North Indian Brahmins. They seem to have about one half the proportion of the "NE Euro" component as North Indian Brahmins (e.g., compare to Bengali Brahmins).
- South Indian non-Brahmin upper castes have very little of the "NE Euro" component, which is found at low, but consistent fractions among non-Brahmins in the Gangetic plain (and at much higher fractions as one moves toward the Punjab)
I do not know about the nature of the origin of the Pancha-Dravida group of Brahmins, but they look to be endogamous, from the same source, and probably had some admixture with the local substrate early on. This would explain their uniformity and lower fraction of "NE Euro" in relation to North Indian Brahmins. The results above also suggest that the Syrian Christians derive from converts from the Nair community, or related communities. This should not be surprising.
Finally let's move to North India, and the zone stretching between Punjab in the Northwest and Bihar in the East. Though in much of this region Brahmins have higher "NE Euro" fractions, this relationship seems to breakdown as you go northwest. The Jatt community in particular seems to have the highest in the subcontinent. There are inchoate theories for the origins of the Jatts in Central Asia. I had dismissed them, but am thinking now they need a second look. The reasoning is simple. The Jatts of the eastern Punjab have a higher fraction of "NE Euro" than populations to their northwest (Pathans, Kalash, etc.), and Brahmin groups (e.g., Pandits) in their area who are theoretically higher in caste status. This violation of these two trends implies something not easily explained by straightforward social and geographic processes. The connection between ancestry and caste status also seems to break down somewhat in the Northwest, as there is a wide variation in ancestral components.
Someone with more knowledge of South Asian ethnography should weigh in. But until then I invite readers of South Asian heritage to submit their results to Zack.