The figure to the left is a three dimensional representation of principal components 1, 2, and 3, generated from a sample of Gujaratis from Houston, and Chinese from Denver. When these two populations are pooled together the Chinese form a very homogeneous cluster. They don't vary much across the three top explanatory dimensions of genetic variance. In contrast, the Gujaratis do vary. This is not surprising. In the supplements of Reconstructing Indian population history it was notable that the Gujaratis did tend to shake out into two distinct clusters in the PCAs. This is a finding you see over and over when you manipulate the HapMap Gujarati data set. In reality, there aren't two equivalent clusters. Rather, there's one "tight" cluster, which I will label "Gujarati_B" from now on in my data set, and another cluster, "Gujarati_A," which really just consists of all the individuals who are outside of Gujarati_B cluster. Even when compared to other South Asian populations these two distinct categories persist in the HapMap Gujaratis. Zack has already identified a major difference between the two clusters: Gujarat_A has some individuals with much more "West Eurasian" ancestry. To be more formal about this in the future I simply assigned individuals in my merged data set to one of the two Gujarati clusters based on their position in the first two PCs. Yesterday night I ran ADMIXTURE K = 2 to 10, with 75,000 SNPs. I also removed the Native American groups, and added more European and East Asian samples from the HapMap. Below are some populations at K = 4:
Let's drill down to the level of individuals. Here are the Gujarati individuals, along with Sindhis, and my parents (Bengali). I've sorted by the "European" and then "South Asian" components (light blue and green respectively, while purple is modal in Papuans and red in East Asians):
The ADMIXTURE plots are in total alignment with the PCA. In the PCA Gujarati_A exhibit a spectrum of distance from the European cluster, and in the ADMIXTURE you see the same. In contrast, Gujarati_B is relatively uniform. So what's going on? I will be posting something similar over at Sepia Mutiny soon. But my guess is that Gujarati_B are a subset of Patels. In other words, they're a genetically distinct jati. I suspect that Gujarati_A are a more diverse bunch from a number of different jatis. Does this matter? I believe it does. If Gujarati_B are a distinct ethno-social group which is a subset of Gujaratis, then they may not be as good a proxy for South Asian medical genetics as Gujarati_A. More concretely, Gujarati_B may have relatively high frequency rare disease alleles because they're an inbred clan. In contrast, while Gujarati_A may exhibit all the hallmarks of South Asian endogamy, if they're a larger number of different groups, then they'll have all sorts of different rare alleles. The ones they have in common may be more generally South Asian.