How to ensure that results in psychology (and other fields) are replicated has become a popular topic of discussion recently. There's no doubt that many results fail to replicate, and also, that people don't even try to replicate findings as much as they should.
Yet psychologist Gregory Francis warns that replication per se is not always a good thing: Publication bias and the failure of replication in experimental psychology
Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong.
Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities. In a misguided effort to demonstrate successful replication of empirical findings and avoid failures to replicate, experimental psychologists sometimes report too many positive results.
Rather than strengthen confidence in an effect, too much successful replication actually indicates publication bias, which invalidates entire sets of experimental findings...
Even populations with strong effects should have some experiments that do not reject the null hypothesis. Such null findings should not be interpreted as failures to replicate, because if the experiments are run properly and reported fully, such nonsignificantfindings are an expected outcome of random sampling... If there are not enough null findings in a set of moderately powered experiments, the experiments were either not run properly or not fully reported. If experiments are not run properly or not reported fully, there is no reason to believe the reported effect is real.Say you took a pack of playing cards and removed half the red cards. Your pack would now be 2/3rds black, so if you took a random sample of cards, say a poker hand of 5 cards, then you'd expect more blacks than reds (a significant 'effect' of color). But you'd still expect some reds, and some random hands would in fact be entirely red, just by chance. If someone claimed to have drawn 10 random hands and they'd all been mainly black, that would be implausible - "too good".
Francis's approach is a bit like Uri Simonsohn's method for detecting fraudulent data - they both work on the principle that "If it's too good to be true, it's probably false" - but they differ in their specifics, and I believe that we should not conflate fraud with publication bias... so let's not get carried away with the parallels.
Earlier this year, Francis wrote a critical letter about a paper published in PNAS purporting to show that wealthier Americans are less ethical. He argued that the paper's results were "unbelievable" - it reported on the results of seven separate experiments, all of which showed a small, but significant, effect in favour of the hypothesis.
Even if rich people really were meaner, Francis said, the chance of 7/7 experiments being positive is very low: just by chance, you'd expect some of them to show no difference (given that the size of the difference in those seven was low, with a lot of overlap between the groups). Francis suggested that the authors may have run more than seven experiments, and only published the positive ones; the authors denied this in their Letter.
Anyway, in the new paper, Francis expands on this approach in much more detail, drawing from this 2007 paper, and suggests a Bayesian approach that might help mitigate the problem.
Francis G (2012). Publication bias and the failure of replication in experimental psychology. Psychonomic Bulletin and Review PMID: 23055145