Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:
Andhra Pradesh: 2
Caribbean Indian: 2
Uttar Pradesh: 2
Sri Lankan: 2
Iraqi Arab: 2
Egyptian/Iraqi Jew: 1
Again, note the underrepresentation of two of India's most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I've reedited Zack's static images (though go to his website for something more dynamic) with the labels of individuals. I've highlighted myself and my parents with the red pointers.
To the left is a set of plots and tables which I've spliced together from Zack's variousposts. What you need to know is that this at K = 12, and I've used the labels that Zack gave the various putative "ancestral populations" which emerged out of his ADMIXTURE runs. I've also displayed the participants in the Harappa Ancestry Project so far, with their ethnic labels. Finally, smack in the middle you see the Fst values, standardized by the smallest between population difference. So the values in the boxes represent the genetic distances for the inferred ancestral populations in the row and column (I also rounded, since I didn't want to give the impression of excessive precision). This last point is important, these are not between population distance measures across real populations. Rather, they're distance measures across the inferred allele frequencies of populations generated which emerge out of the parameters you constrain ADMIXTURE to, as well as the genetic variation which you throw into the pot for the algorithm in the first place. In the broadest sense the first thing that jumps out at you is the high distance value between "Papuans" and everyone else. This is interesting. In fact, the genetic distance of between Papuans and other ancestral populations is greater than the genetic distance between the putative African populations and other non-Africans, except Papuans. This goes to the point that you need to be very careful in making definitive inferences from these sorts of programs. Interestingly, the population to which the Papuans exhibit the least genetic distance are the "South Asians." What does that mean? I think this has a straightforward explanation. I believe that the South Asian cluster is a hybridized compound, as suggested by Reconstructing Indian History, and that the populations of Oceania represent a relatively "pure" eastern expansion of long resident southern Asian groups which have generally been submerged by admixture with other groups intrusive to the region. This also explains the fact that Cambodians share some of this Papuan component with various South Asian populations. Finally, I wouldn't make too much of this, but in some ADMIXTURE runs which I've done the genuine Papuan population in the HGDP data set breaks into two ancestral components, of which the southern Asian groups from Pakistan to Cambodia share only one. Remember that Oceania was settled initially by Melanesians and Australians ~40-50,000 years ago, and it looks like the people of Melanesia and indigenous Australians date to this initial period. So connections between southern Asians and Papuans are likely very old, and the two groups have been distinctive for a long time. To the South Asian individuals surveyed so far, there's nothing that surprising. The South Asian element tends to increase as one goes south and east. This is what you'd expect. And, the Pakistan/Caucasian component which spans much of western and central Asia is what connects the Iranian samples to the South Asian ones. The Iranians have very little of the South Asian component. This makes sense if the South Asian element is simply an outcome of an admixed population, and one of the ancestral groups from which this component derives, "Ancestral South Indians," were generally not present to the west of Pakistan. The eastern Asian components are enriched among Bengalis, as you'd expect, but they're found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the "eastern" element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance. Most of the other patterns are as you would expect. Finally, I'd like to point out that I suspect that Zack is the first one to post the ancestral fractions of someone from the Nadar caste using SNP-chip markers. Here are all the details about participation.