Looking for relatedness in the HapMap Gujaratis

Gene Expression
By Razib Khan
Mar 18, 2011 8:34 AMNov 19, 2019 11:40 PM

Newsletter

Sign up for our email newsletter for the latest science news
 

Recently I was looking at a 3-D PCA animation which Zack generated from the Harappa Ancestry Project data set. Click the link and come back. Notice the outlier clusters? The Burusho are straightforward, they seem to have low levels of Tibetan admixture. But what about the Gujarati cluster? Again, we see what we've seen before, the fractioning out of the Gujaratis in PCA into two groups, one a tight cluster, and the other relatively widely distributed. This prompted me to look more closely at the HapMap Gujarati sample. Today I was exploring the question with Plink's identity-by-descent feature. First I'll start out with a smaller data set, my family (father, mother, sibling 1, sibling 2, and myself), and an Indian (from Uttar Pradesh) and Pakistani as unrelated individuals. I merged out 23andMe derived genotypes, and with ~900,000 markers calculated pairwise IBD: ./plink --bfile IBDControl --genome Here are the relevant results:

You can infer some things without even knowing what the columns mean. Notice that there are differences between parent-child, sibling-sibling, and unrelated comparisons. The distance measure, DST, is basically exactly the same as the genome-wide comparison in 23andMe. Either the web app is running Plink, or, it's using the exact same algorithm. Z0 = IBD 0, Z1 = IBD 1, and Z2 = IBD 2. Notice that with my siblings I have a fair amount of IBD 2, but far less with my parents. That's because parents give you one copy, but you can share zero, one, or two, of a gene with your siblings. In contrast, with our parents there is hardly any IBD = 0, since they're guaranteed to give you one copy. I assume that the IBD = 2 in that case is population wide fixation of a variant. Notice in the last column that there are different values for unrelated individuals (~2), siblings (~10), and parent-children (~500). I ran a similar test among the Gujaratis. Remember that I've labeled them Gujarat_A and Gujarati_B based on PCA clusters, where the latter form a tight population cluster, and the former do not. Here are the mean pairwise DST values with the groups of pairs: Mean of all: 0.746 Mean of Gujarati_A only: 0.744 Mean of Gujarati_B only: 0.749 Mean of Gujarati_A and Gujarati_B pairs only: 0.745 Gujarati_B are marginally closer to each other than Gujarati_A. I'm not sure these DST values are totally comparable to the ones from the 23andMe files. I'll show you why. I constrained the pairs to those where the RATIO was > 2.5. Here's what I found:

Individual 1Individual 2Z0Z1Z2PI_HATDSTPPCRATIO

IndianFather0.7680.0270.2050.2180.7600.1601.940

IndianMother0.7820.0100.2090.2140.7590.0261.886

IndianRazib0.7670.0320.2020.2180.7590.5002.000

IndianSibling10.7690.0250.2060.2190.7600.1981.949

IndianSibling20.7660.0320.2030.2190.7600.6852.030

IndianPakistani0.7810.0170.2030.2110.7580.5332.005

FatherMother0.7760.0180.2070.2150.7590.2841.965

FatherRazib0.0020.7770.2210.6100.8511.000450.800

FatherSibling10.0010.7850.2140.6060.8501.000898.800

FatherSibling20.0020.7790.2200.6090.8511.000643.143

FatherPakistani0.7780.0190.2030.2130.7580.2011.950

MotherRazib0.0020.7880.2110.6050.8491.000639.429

MotherSibling10.0020.7810.2180.6080.8501.000639.857

MotherSibling20.0020.7820.2160.6070.8501.000447.900

MotherPakistani0.7790.0200.2010.2110.7580.0521.904

RazibSibling10.1830.4080.4090.6130.8661.00011.386

RazibSibling20.1940.4320.3740.5900.8581.00011.491

RazibPakistani0.7810.0160.2030.2110.7580.9332.095

Sibling1Sibling20.2360.4120.3510.5570.8491.0009.413

Sibling1Pakistani0.7770.0240.1990.2110.7580.3271.973

Sibling2Pakistani0.7740.0240.2020.2140.7580.4431.991

Notice that Z2 ~ 0, in contrast to the calculations above. I assume someone reading this knows that there's a simple reason for this, so do tell. The IBD estimates for 23andMe always struck me as too high. In any case, to my surprise the definitely related individuals seem to be in the Gujarati_A cluster! What's going on there? My first thought is that I messed up the data, or, I coded something incorrectly. I assume that this was double-checked before it got into the HapMap data set. Has anyone else seen this weird result? If not, I assume I made an error (that's kind of my working model right now actually).

Individual 1Individual 2Z0Z1Z2PI_HATDSTPPCRATIOPopXPopY

NA20900NA208910.0030.9740.0230.5100.8421.000188.250Gujarati_AGujarati_A

NA20909NA209100.0030.9700.0270.5120.8421.000140.438Gujarati_AGujarati_A

NA20891NA209070.4120.5570.0320.3100.8031.0005.730Gujarati_AGujarati_A

NA20900NA209070.6840.2920.0240.1700.7751.0003.251Gujarati_AGujarati_A

1 free article left
Want More? Get unlimited access for as low as $1.99/month

Already a subscriber?

Register or Log In

1 free articleSubscribe
Discover Magazine Logo
Want more?

Keep reading for as low as $1.99!

Subscribe

Already a subscriber?

Register or Log In

More From Discover
Recommendations From Our Store
Shop Now
Stay Curious
Join
Our List

Sign up for our weekly science updates.

 
Subscribe
To The Magazine

Save up to 40% off the cover price when you subscribe to Discover magazine.

Copyright © 2024 Kalmbach Media Co.