Planet Earth

There are more things in prehistory than are dreamt of in our urheimat

Gene ExpressionBy Razib KhanAug 24, 2012 3:59 AM


A new paper in Science claims to have ascertained the locus of origin of the Indo-Europeans, Mapping the Origins and Expansion of the Indo-European Language Family. These are bold claims, and naturally have triggered a firestorm. No surprise, the same happened with these researchers when they published the result in 2003 that Proto-Indo-European flourished ~9,000 years ago, in alignment with an "Anatolian hypothesis," as opposed to a "Steppe/Kurgan hypothesis." The original paper in 2003 utilized phylogenetic methods which are common within biology, and applied them to linguistics. This second paper now incorporates spatial information into their model, to generate an explicit locus of origination, in addition to the dates for the bifurcations of the node. In relation to results I think that the figure to the left is the most important, because it gives us their inferred dates of separation between various Indo-European language families. Observe that Italic and Celtic did not diverge in prehistory, but in history (i.e., the Sumerians and Egyptians were flourishing at the time). Additionally, the diversification pattern is not a simple "rake," there is internal structure. They may date the origin of Indo-European languages to the early Holocene, but the diversification seems to have happened in steps and pulses. Though the authors support the Anatolian hypothesis, they also seem quite comfortable acknowledging that the real story is more complex, though you wouldn't get that from the media. But speaking of complexity, who really knows what's going on in this paper? I have a handle on the general framework, but haven't used all the algorithms. As I indicate below in population genetics a good intuition on the kinks and tendencies of clustering algorithms can be obtained only through usage. And of course few people will read the supplements. For example, in Nick Wades' piece in The New York Times David Anthony, author of the magisterial The Horse, the Wheel, and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World, makes a criticism which is addressed within the paper (in the supplements):

Dr. Anthony, noting that neither he nor Dr. Atkinson is a linguist, said that cognates were only one ingredient for reconstructing language trees, and that grammar and sound changes should also be used. Dr. Atkinson’s reconstruction is “a one-legged stool, so it’s not surprising that the tree it produces contains language groupings that would not survive if you included morphology and sound changes,” Dr. Anthony said. Dr. Atkinson responded that he did indeed run his computer simulation on a grammar-based tree constructed by Don Ringe, an expert on Indo-European at the University of Pennsylvania, but that the resulting origin was, again, Anatolia, not the Pontic steppe.

There's an asymmetry here. The historical linguists have compelling and transparent rationales to make for why the Steppe thesis should be preferred over the Anatolian one. Lay persons can make assessments about historical linguistic models which are based on common sense such as words which span all Indo-European languages, and might give clues to the geographical and temporal point of origin. In response, you have Bayesian phylogenetics. At some point in the future I suspect all of this research will make recourse to Bayesian phylogenetics, but at this stage of the game even most people who use Bayestian phylogenetic packages don't really understand how they work.

I may not grok the methods in detail, but I do appreciate that the authors simulated data to test their methods, and, that their methods worked for cases where we know the answer. For example, the method correctly inferred the geographical origin of the Romance languages, and their time of diversification. But in this situation we know the answer. How about in cases where we don't? I noticed this strange plot in the supplements. I've highlighted Romani, the language of the Roma. The fact that Romani is an outgroup to Indo-Aryan langauges, illustrates some deep problem with their method. Romani did not start diverging from other Indo-Aryan languages 3-3,500 years ago. It started diverging 1-1,500 years one. We know this because that's when the Roma start showing up in the Islamic world and parts of southeast Europe. It may be that it just happens to be that the most diverged Indo-Aryan language also happened to be the one which migrated out of India, but I don't think that's the case. Rather, the non-Indo-Aryan influences on Romani must be impacting its affinity to other Indo-Aryan languages, even if they are core words. With that skepticism entered into the record, I can broadly credit the possibility proposed here in the most general sense. We know from genetic clustering algorithms that Indo-European populations within Europe seem enriched for a "West Asian" element vis-a-vis their non-Indo-European neighbors. I'm talking here mostly about the Basque and Finns, though arguably the Sardinians were Indo-Europeanized only during the Roman era, and they should count as well. But, I'm pretty sure that the Indo-Aryans are the ones who brought the "European" component found in low levels across northwest South Asia to the subcontinent. The Indo-Iranians diverged from the European Indo-Europeans ~4,000 BC, and I'm suspecting this may have happened along the broad trans-Caucasian and Russian fringe. This is where contact was made was Uralic peoples. The authors of the paper themselves point to the viability of the Kurgan hypothesis in this modified form in the text. I don't see why the archaeologist are all worked though (unlike the historical linguists).

