We have completed maintenance on DiscoverMagazine.com and action may be required on your account. Learn More

Confidence in inference in phylogenetic data sets

Gene Expression
By Razib Khan
Mar 4, 2013 12:20 AMNov 20, 2019 3:00 AM


Sign up for our email newsletter for the latest science news

A few weeks ago I put up a new data set into my repository. As is my usual practice now the populations can be found in the .fam file. But I've added more into this. I have to rewrite my ADMIXTURE tutorial soon, so I thought I would bring up an important issue when interpreting these data sets using clustering methods: one has to understand that conclusions can not rest on one single result. Rather, one must attempt to ascertain the statistical robustness of the results. If you arrive at an expected result this is obviously not as important a consideration, but if you arrive at a novel and surprising result, then you have to make sure that it isn't simply a fluke. To do this I have been running my PHYLOCORE data set with cross-validation (regular 5-fold). In theory you should be able to see where the value is minimized, and that is your "best" K. But, my personal experience with running ADMIXTURE and STRUCTURE is that the inferred plausibility of a given K derived from the statistic can itself be quite volatile. In other words, it is best to run replicates of a data set when attempt to assess robustness. I'm going to run PHYLOCORE 50 times, but I already have 10 runs. The results are plotted below

It is seems that the best fit to these data is in the 10 to 15 K range. But notice that < 10 K are not very volatile. There are 10 points, but at K = 5 for example they totally overlay. As you go up the number of populations that the algorithm attempts to infer, the more volatile the cross-validation results are.

Zooming in on the plot you notice that not only does K = 13 have the minimum cross-validation error, but seems to exhibit the least volatility. I suspect that this result will hold, but you never know. The point is not to establish hard and fixed rules. It is to be explicit in the guidelines of how to interpret results, which can be quite varied depending upon the input parameters you begin with. Addendum: The seed is random, for those who are curious.

1 free article left
Want More? Get unlimited access for as low as $1.99/month

Already a subscriber?

Register or Log In

1 free articleSubscribe
Discover Magazine Logo
Want more?

Keep reading for as low as $1.99!


Already a subscriber?

Register or Log In

More From Discover
Recommendations From Our Store
Shop Now
Stay Curious
Our List

Sign up for our weekly science updates.

To The Magazine

Save up to 40% off the cover price when you subscribe to Discover magazine.

Copyright © 2024 Kalmbach Media Co.