The state of China has 1/5 of humanity within its borders, so it's genetic structure is of interest. It is obviously important for medical reasons to clarify issues of population structure so that disease susceptibility among the Han is well characterized, in particular with the heightened medical needs of an aging population in the coming generation. And of course, there are the nationalistic concerns. About 20 years ago L. L. Cavalli-Sforza reported that his South Chinese samples were genetically closer to Southeast Asians than North Chinese in The History and Geography of Human Genes. This result has been somewhat muddled in the past generation with the rise of uniparental markers (NRY and mtDNA passed through the male and female lineages) along with studies which utilize hundreds of thousands of SNPs. One thing that seems to be clear is that genes vary as a function of geography in China (just as they do pretty much everywhere). Two new articles in AJHG shed some more light on this issue, Genomic Dissection of Population Substructure of Han Chinese and Its Implication in Association Studies:
To date, most genome-wide association studies (GWAS) and studies of fine-scale population structure have been conducted primarily on Europeans. Han Chinese, the largest ethnic group in the world, composing 20% of the entire global human population, is largely underrepresented in such studies. A well-recognized challenge is the fact that population structure can cause spurious associations in GWAS. In this study, we examined population substructures in a diverse set of over 1700 Han Chinese samples collected from 26 regions across China, each genotyped at ∼160K single-nucleotide polymorphisms (SNPs). Our results showed that the Han Chinese population is intricately substructured, with the main observed clusters corresponding roughly to northern Han, central Han, and southern Han. However, simulated case-control studies showed that genetic differentiation among these clusters, although very small (FST = 0.0002 ∼0.0009), is sufficient to lead to an inflated rate of false-positive results even when the sample size is moderate. The top two SNPs with the greatest frequency differences between the northern Han and southern Han clusters (FST > 0.06) were found in the FADS2 gene, which associates with the fatty acid composition in phospholipids, and in the HLA complex P5 gene (HCP5), which associates with HIV infection, psoriasis, and psoriatic arthritis. Ingenuity Pathway Analysis (IPA) showed that most differentiated genes among clusters are involved in cardiac arteriopathy (p < 10−101). These signals indicating significant differences among Han Chinese subpopulations should be carefully explained in case they are also detected in association studies, especially when sample sources are diverse.
Population stratification is a potential problem for genome-wide association studies (GWAS), confounding results and causing spurious associations. Hence, understanding how allele frequencies vary across geographic regions or among subpopulations is an important prelude to analyzing GWAS data. Using over 350,000 genome-wide autosomal SNPs in over 6000 Han Chinese samples from ten provinces of China, our study revealed a one-dimensional "north-south" population structure and a close correlation between geography and the genetic structure of the Han Chinese. The north-south population structure is consistent with the historical migration pattern of the Han Chinese population. Metropolitan cities in China were, however, more diffused "outliers," probably because of the impact of modern migration of peoples. At a very local scale within the Guangdong province, we observed evidence of population structure among dialect groups, probably on account of endogamy within these dialects. Via simulation, we show that empirical levels of population structure observed across modern China can cause spurious associations in GWAS if not properly handled. In the Han Chinese, geographic matching is a good proxy for genetic matching, particularly in validation and candidate-gene studies in which population stratification cannot be directly accessed and accounted for because of the lack of genome-wide data, with the exception of the metropolitan cities, where geographical location is no longer a good indicator of ancestral origin. Our findings are important for designing GWAS in the Chinese population, an activity that is expected to intensify greatly in the near future.
Below is a PC chart which shows PC1 on the x-axis and PC2 on the y-axis. In green are South Chinese, and in blue the North Chinese. Japanese are the cluster to the top left, and red represents the HapMap Chinese sample.
And here's a visualization of the ancestries of individuals from particular provinces and dialect groups using Structure (right) and Frappe (left) (the K's represent 2 or 3 putative ancestral populations respectively). It's ordered by ancestry within the classes. The rough geographical correlate is north-south. Note the variance in Singapore; most Singaporean Chinese derive from Fujian (with a large Hakka minority, and some Malay admixture on the part of Baba Chinese), but there were enough disparate migratory events that you don't see a bottleneck and decrease in homogeneity compared to Chinese provinces. On the contrary. A minority of Singaporeans seem to be of North Chinese provenance, a result that would not surprising in Taiwan, where such a migration is historically documented (after the fall of Nationalist China), but is more curious in Singapore which was presumably part of the greater Fujianese Diaspora.
Finally, here are pairwise Fst values. Remember that this captures the proportion of genetic variance between populations. Fst values between continental races is on the order of 0.15. This means 15% of the genetic variation is between races. The values below seem to show a maximum between province/dialect difference in China of about 0.5% of the genetic variation. But despite this small value, note how obvious it is above to differentiate individuals from northern and southern regions of China.
Here are some comparable Fst values from Europe: 0.001 = Bulgaria-Austria 0.002 = Poland-Sweden 0.003 = Northern Italy-Switzerland 0.004 = Spain-Sweden 0.005 = Russia-France I've left out the highest Fst values in Europe, which are between Finns and Southern Italians, on the order of 0.015. But from these data it looks as if Han Chinese are in the same order of magnitude of variance as Europeans in terms of their genetics, but a factor or two lower. But it may be that the coverage of genetic variation is just not as thick in China so that outlier Han populations, the equivalent of Finns (perhaps Sinicized groups in Yunnan?), are out there waiting to push the mean variance higher. It is interesting, though not totally surprising, that different dialect groups in the same region exhibit large genetic differences. Language & genes often correlate because the former circumscribes the limits of marriage networks. The Teochew migrated from Fujian to Guangdong (to my knowledge they are the dominant Chinese group in Thailand), and are nearly as genetically distant from their Cantonese speaking neighbors as they are from North Chinese. Interestingly, the Hakka group who are derived from North Chinese migrants according their history, seem to be closer to "indigenous" South Chinese. Nevertheless, they exhibit less genetic difference from North Chinese than do Cantonese speakers in Guangdong. This is obviously the tip of the iceberg, I suspect that the genetic topography of South China in particular will be surprising because of its geographical fragmentation, the role of powerful clan networks, and the recurrent history of migration from the North China plain by groups who manage to maintain their identities (.e.g, Hakka).* Citation: Jieming Chen, Houfeng Zheng, Jin-Xin Bei, Liangdan Sun, Wei-hua Jia, Tao Li, Furen Zhang, Mark Seielstad, Yi-Xin Zeng, Xuejun Zhang, and Jianjun Liu, Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation, doi:10.1016/j.ajhg.2009.10.016 Citation: Shuhua Xu, Xianyong Yin, Shilin Li, Wenfei Jin, Haiyi Lou, Ling Yang, Xiaohong Gong, Hongyan Wang, Yiping Shen, Xuedong Pan, Yungang He, Yajun Yang, Yi Wang, Wenqing Fu, Yu An, Jiucun Wang, Jingze Tan, Ji Qian, Xiaoli Chen, Xin Zhang, Yangfei Sun, Xuejun Zhang, Bailin Wu, and Li Jin, Genomic Dissection of Population Substructure of Han Chinese and Its Implication in Association Studies, doi:10.1016/j.ajhg.2009.10.015 * It is attested that many groups emigrated from South China to North China, but it seems to me that these groups were simply absorbed. I suspect it has to do with the flat topography of the North China plain which does not allow for easy separation between groups. In South China the Hakka tended to farm the more marginal lands, in particular upland regions.