I, and I’m sure other people, have worried about being scooped and beaten to publication due our arXived papers. But really this is silly as we’ve usually given talks, posters, etc on them at big conferences, so the idea that people somehow don’t know about our work before it appears in print is ridiculous. It is far better to get work out, once you consider it worthy of publication, so it can be read and cited by others.
This is in reference to the paper The Geography of Recent Genetic Ancestry across Europe. Go and read the materials and methods. I'm sure that a substantial minority of the readers of this weblog have used every single piece of software listed therein. Phasing and such requires a little bit of computational muscle, but that's not an impossible hurdle. Additionally, many readers with academic affiliations could get their hands on the POPRES data set. But the generation of a paper, from methods to results to discussion, is not simply a robotic sequence of running data through software or algorithms. You need a first-rate statistical geneticist (e.g., theauthors) to actually assemble the pieces together together coherently and with insight even granting the fundamental units of the whole. Then there are sections of the methods with explication such as this:
You can try and cut & paste this, but you'd come off as a fool if you didn't know what you were talking about. The Coop lab has put up a substantial number of their quant bio papers up on arXiv, and I'm skeptical that that's resulted in other groups cheating off them. On the contrary, in an idealized scientific environment the spread of insight will have spillover effects, positive externalities. The scientific community is one where there should be greater returns to scale due to the synergistic power of cross-fertilization. On the other hand there is the flip side of this: the recent rash of data fraud and fudging impacting some of the more 'empirical' sciences. The community of science is based on trust, and sometimes I wonder how it persists. When the juice is in the collection and publication of data, rather than clever or deep analysis of data already commonly circulated, one can see the margin on cheating the system, or hoarding your cache. I don't have any clever solutions for how to prevent cheating in medical or psychological science. But I can hope that in the future genomic data sets will be constantly liberated, so that everyone is working from the same general script. And faking genomic data so that it would pass muster probably isn't worth the time and energy. If you can manage to do this I think there's a much better angle in going to Wall Street and screwing others for profit rather than scientific small-time fame.