An editorial in the Journal of Physiology offers some important notes on statistics.
But even more importantly, it refers to a certain blog in the process:
The Student’s t-test merely quantifies the ‘Lack of support’ for no effect. It is left to the user of the test to decide how convincing this lack might be. A further difficulty is evident in the repeated samples we show in Figure 2: one of those samples was quite improbable because the P-value was 0.03, which suggests a substantial lack of support, but that’s chance for you! A parody of this effect of multiple sampling, taken to extremes, can be found at http://neuroskeptic.blogspot.com/2009/09/fmri-gets-slap-in-face-with-dead-fish.html
This makes it the second academic paper to refer to this blog as far. Although I feel rather bad about this one, since the citation ought to have been to the original dead salmon brain scanning study by Craig Bennett. I just wrote about it.
Actually, though, this editorial was published in five separate journals: The Journal of Physiology, Experimental Physiology, the British Journal of Pharmacology, Advances in Physiology Education, Microcirculation, and Clinical and Experimental Pharmacology and Physiology. Phew.
In fact, you could say that this makes not two but six citations for Neuroskeptic now. Yes. Let's go with that.
Anyway, after discussing the history of the ubiquitous Student's t-test - which was invented in a brewery - it reminds us that the p value you get from such a t-test doesn't tell you how likely it is that your results are "real".
Rather, it tells you how often you'd get the result you did, if there was no effect and it was just random chance. That's a big difference. A p value of 0.01 doesn't mean your results are 99% likely to be real. It means that there's a 1% chance that you'd get them, by chance. But if you did say 100 experiments, or more likely, 100 statistical tests on the same data, then you'd expect to get at least one result with a p value of 0.01 purely by chance.
In that case it would be silly to think that the finding was only 1% likely to be a fluke. Of course it could be true. But we'd have no particular reason to think so until we get some more data.
This is what the dead salmon study was all about. This multiple comparisons issue is very old, but very important. Arguably the biggest problem in science today is that we're doing too many comparisons and only reporting the significant ones.
Drummond GB, & Tom BD (2011). Statistics, probability, significance, likelihood: words mean what we define them to mean. British journal of pharmacology, 164 (6), 1573-6 PMID: 22022804