Advertisement

The value of "open genomics"

Explore issues in public genomic data sets, including challenges with duplicate samples and inconsistencies in the Behar et al. data set.

Google NewsGoogle News Preferred Source

Newsletter

Sign up for our email newsletter for the latest science news

Sign Up

Zack Ajmal has been methodically working his way through issues in the public genomic data sets. Often it just involves noting duplicate samples across data sets, which need to be accounted for. But sometimes there seem to be problems within the uploaded data sets, for example relatively close related individuals. Today he highlights an issue which early on was noticeable in the Behar et al. data set:

Advertisement

Behar as in the Behar et al paper/dataset and not the Indian state of Bihar. The Behar dataset contains 4 samples of Paniya, which apparently is a Dravidian language of some Scheduled Tribes in Kerala. I had always been suspicious of those four samples since one of them had admixture proportions similar to other South Indians but the other three were like Southeast Asians. ... Since the Austroasiatic Paniya samples originated from Behar et al, I guess at some point before the Behar data being submitted to the GEO database the Paniyas got mislabeled.

I pulled down the Behar et al. data set too, and the Paniya just look weird enough that I just avoided them. Ideally this sort of stuff should be caught, but errors happen. Best to get as many eyeballs looking over everything.

Stay Curious

JoinOur List

Sign up for our weekly science updates

View our Privacy Policy

SubscribeTo The Magazine

Save up to 40% off the cover price when you subscribe to Discover magazine.

Subscribe
Advertisement

1 Free Article