Recently I've blogged aboutmethodological problems in neuroscience research but just to even things out a bit, here's a paper that highlights a potentially serious issue for psychologists - Treating Stimuli as a Random Factor in Social Psychology: A New and Comprehensive Solution to a Pervasive but Largely Ignored Problem
Suppose you want to find out whether people react differently to stimuli from two different groups. The reactions, stimuli, and groups could be anything: maybe you want to see if people prefer listening to sound clips of cats as opposed to dogs. Or maybe you show people photos of blonde men vs dark-haired men and see whether people judge guys with one colour as less trustworthy.
A lot of psychology studies amount to this.
Going with the blonde vs. dark example, suppose you take 1000 volunteers, show them some pictures of blonde guys and dark guys, and get them to rate them on trustworthiness. You find a significant difference between the two groups of stimuli. You conclude that your volunteers are hair-bigots and submit it as a paper. The reviewers think, 1000 volunteers? That's a big sample size. They publish it.
Now that study I just described might be perfectly valid. But it might be seriously flawed. The problem is that while your sample size may be large in terms of volunteers, it might be very small in another way. Suppose you have just 10 photos per group. Your 'sample size', as regards the sample of stimuli, is only 20. And that sample size is just as important as the other one.
It might be that there's no real hair difference in perceived trustworthiness, but there are individual differences - some men just look dodgy and it's nothing to do with hair - and in your stimuli, you've happened to pick some dodgy looking blonde guys. Or whatever.
Now you can run your statistical analyses taking these possible stimulus variation effects into account. But according to Judd, Westfall and Kenny, authors of this paper, this is rarely done. They show with both real and hypothetical data, that unless you take care of this, you can find "statistically significant" differences from pure random noise. This is not a new argument, but they say it's been ignored for too long.
The worst part is that increasing the number of volunteers actually makes it more likely that you'll fall foul of this, not less. Only increasing the stimulus sample size can prevent it.
The paper goes into lots of detail, and tackles various hot potatoes, including one of Daryl Bem's notorious precognition "retroactive priming" experiments. Bem claimed that college students were able to predict the future - they responded differently to different pictures... before the pictures appeared on the screen. The effect was statistically significant and he published it. But Judd et al say that accounting for stimulus variation removes the effect.
Judd CM, Westfall J, and Kenny DA (2012). Treating Stimuli as a Random Factor in Social Psychology: A New and Comprehensive Solution to a Pervasive but Largely Ignored Problem. Journal of Personality and Social Psychology PMID: 22612667