It's been 10 months since Zack Ajmal first contacted me about the possibility of the Harappa Ancestry Project. I was of two minds. On the one hand I did think there was a major problem with undersampling some regions of South Asia. But, it seemed that the 1000 Genomes would fix that soon enough. As it turns out the 1000 Genomes has been a bit slower than I had anticipated (and I assume that the nixing of the Indian samples was a matter of politics not science). So I'm glad Zack started the project when he did. At this point he's hit the zone of diminishing marginal returns when it comes to participants. Looking through his samples he has a little over 100 non-founders of unadmixed South Asian ancestry (I'm not a founder because both my parents are in the database). I decided to prune the individuals down to this selection, and tack on a lot of his reference populations, with a bias toward South Asians, and see what I could find. I used his K = 11 ADMIXTURE run, since this seems maximally informative for South Asians. You can find the file here. One interesting aspect of Zack's project is that he began to collect Y and mtDNA haplogroups at a certain point. Not too surprising there was a preponderance of R1a1a. For many years now this paternal marker has been suggested to have some association with Indo-Iranians, though more recently researchers have suggested that in fact it's a very old haplogroup sharply differentiated between a European branch and a South Asian one. Zack has 56 individuals with Y and mtDNA information in his database. These have to be males. He has 14 individuals with mtDNA information and no Y information. These are probably females (obviously there could be males who are only entering their mtDNA information, but this seems unlikely given that most of the results come from 23andMe). 27 of the males are R1a1a. 29 are not. The mean "Onge" proportion of those with R1a1a is 24%. Without? 24%. The respective values for "South Asian" is 56 and 55 percent respectively. In this likely skewed sample R1a1a doesn't seem to predict the ancestral variation much. How about we look at mtDNA. Haplogroup M is localized to South Asia. Dividing the population into M and not M you get the following values: Not M, South Asian = 55% Not M, Onge = 23% M, South Asian = 56% M, Onge = 23% There doesn't seem to be that much in uniparental markers, which aligns with my intuition. At least to this scale of analysis. So let's look at the autosomal genome. The total genetic variation. If you've been following HAP the following won't be news, for those who haven't, I thought I'd generate some plots.
The two-way admixture aspect of South Asian populations is evident in the HAP data. "Onge" refers to an element affinal to those of Andaman Islanders. "S.Asian" seems to be some sort of compound, but with strong West Eurasian affinities. The axis is NW-SE, upper caste to lower caste, just as you'd expect.
There are two West Eurasian components which aren't collapsed into "S.Asian," "SW.Asian" and "European." The names are rather self-evident. The interesting thing here is that "SW.Asian" tends to be elevated among South Indians, especially non-Brahmin upper castes. In contrast, there is far less "SW.Asian" amongst Northeast Indians, and proportionally more "European." This is more evident when you look at populations in the reference set.
In contrast, Punjabis are where you'd expect geography to predict. That's one reason it was somewhat problematic that the HGDP had only Pakistani groups for South Asians. They're not too representative of South Asians. Differences along the axis of caste become more clear when you correct for region, at least mostly.
Punjab is somewhat atypical here. I am now much more willing to credit migrations within the last 2,000 years accounting for the distinctiveness of groups like Jatts. On a somewhat less exciting note, it looks like a lot of the genome blogging projects are losing steam. I'm pretty busy right now, so I haven't been able to maintain AAP, though we'll have another Merina soon. But I suspect it goes to show just how important collection of new data is to these endeavors. There's only so much juice you can get out of the same data set. Right now we depend on research groups and the 1000 Genomes, as well as enthusiasts. At some point in the near future the genotypes won't be the limiting factor. I think then you'll see a renaissance of amateur ancestral genomics.