Register for an account

X

Enter your name and email address below.

Your email address is used to log in and will not be shared or sold. Read our privacy policy.

X

Website access code

Enter your access code into the form field below.

If you are a Zinio, Nook, Kindle, Apple, or Google Play subscriber, you can enter your website access code to gain subscriber access. Your website access code is located in the upper right corner of the Table of Contents page of your digital edition.

Health

Confidence in inference in phylogenetic data sets

Gene ExpressionBy Razib KhanMarch 4, 2013 12:20 AM

Newsletter

Sign up for our email newsletter for the latest science news

A few weeks ago I put up a new data set into my repository. As is my usual practice now the populations can be found in the .fam file. But I've added more into this. I have to rewrite my ADMIXTURE tutorial soon, so I thought I would bring up an important issue when interpreting these data sets using clustering methods: one has to understand that conclusions can not rest on one single result. Rather, one must attempt to ascertain the statistical robustness of the results. If you arrive at an expected result this is obviously not as important a consideration, but if you arrive at a novel and surprising result, then you have to make sure that it isn't simply a fluke. To do this I have been running my PHYLOCORE data set with cross-validation (regular 5-fold). In theory you should be able to see where the value is minimized, and that is your "best" K. But, my personal experience with running ADMIXTURE and STRUCTURE is that the inferred plausibility of a given K derived from the statistic can itself be quite volatile. In other words, it is best to run replicates of a data set when attempt to assess robustness. I'm going to run PHYLOCORE 50 times, but I already have 10 runs. The results are plotted below

cv1.png

It is seems that the best fit to these data is in the 10 to 15 K range. But notice that < 10 K are not very volatile. There are 10 points, but at K = 5 for example they totally overlay. As you go up the number of populations that the algorithm attempts to infer, the more volatile the cross-validation results are.

cv2.png

Zooming in on the plot you notice that not only does K = 13 have the minimum cross-validation error, but seems to exhibit the least volatility. I suspect that this result will hold, but you never know. The point is not to establish hard and fixed rules. It is to be explicit in the guidelines of how to interpret results, which can be quite varied depending upon the input parameters you begin with. Addendum: The seed is random, for those who are curious.

3 Free Articles Left

Want it all? Get unlimited access when you subscribe.

Subscribe

Already a subscriber? Register or Log In

Want unlimited access?

Subscribe today and save 70%

Subscribe

Already a subscriber? Register or Log In