One week ago, the news broke that the University of Amsterdam is recommending the retraction of a 2012 paper by one of its professors, social psychologist Prof Jens Förster, due to suspected data manipulation. The next day, Förster denied any wrongdoing.
Shortly afterwards, the Retraction Watch blog posted a (leaked?) copy of an internal report that set out the accusations against Förster. The report, titled
Suspicion of scientific misconduct by Dr. Jens Förster,
is anonymous and dated September 2012. Reportedly it came from a statistician(s) at Förster's own university. It relates to three of Förster's papers, including the one that the University says should be retracted, plus two others. A vigorous discussion of the allegations has been taking place in this Retraction Watch comment thread. The identity and motives of the unknown accuser(s) are one main topic of debate; another is whether Förster's inability to produce raw data and records relating the studies is suspicious or not. The actual accusations have been less discussed, and there's a perception that they are based on complex statistics that ordinary psychologists have no hope of understanding. But as far as I can see, they are really very simple - if poorly explained in the report - so here's my attempt to clarify the accusations. First a bit of background. The Experiments In the three papers in question, Forster reported a large number of separate experiments. In each experiment, participants (undergraduate students) were randomly assigned to three groups, and each group was given a different 'intervention'. All participants were then tested on some outcome measure. In each case, Förster's theory predicted that one of the intervention groups would test low on the outcome measure, another would be medium, and another would be high (Low < Med < High). Generally the interventions were various tasks designed to make the participants pay attention to either the 'local' or the 'global' (gestalt) properties of some visual, auditory, smell or taste stimulus. Local and global formed the low and high groups (though not always in that order). The Medium group either got no intervention, or a balanced intervention with neither a local nor global emphasis. The outcome measures were tests of creative thinking, and others. The Accusation The headline accusation is that the results of these experiments were too linear: that the mean outcome scores of the three groups, Low, Medium, and High, tended to be almost evenly spaced. That is to say, the difference between the Low and Medium group means tended to be almost exactly the same as the difference between the Medium and High means. The report includes six montages, each showing graphs of from one batch of the experiments. Here's my meta-montage of all of the graphs:
This montage is the main accusation in a nutshell: those lines just seem too good to be true. The trends are too linear, too 'neat', to be real data. Therefore, they are... well, the report doesn't spell it out, but the accusation is pretty clear: they were made up. The super-linearity is especially stark when you compare Förster's data to the accuser's 'control' sample of 21 recently published, comparable results from the same field of psychology:
It doesn't look good. But is that just a matter of opinion, or can we quantify how 'too good' they are? The Evidence Using a method they call delta-F, the accusers calculated the odds of seeing such linear trends, even assuming that the real psychological effects were perfectly linear. These odds came out as 1 in 179 million, 1 out of 128 million, and 1 out of 2.35 million in each of the three papers individually. Combined across all three papers, the odds were one out of 508 quintillion: 508,000,000,000,000,000,000. (The report, using the long scale, says 508 'trillion' but in modern English 'trillion' refers to a much smaller number.) So the accusers say
Thus, the results reported in the three papers by Dr. Förster deviate strongly from what is to be expected from randomness in actual psychological data.
How so? The Statistics Unless the sample size is huge, a perfectly linear observed result is unlikely, even assuming that the true means of the three groups are linearly spaced. This isbecause there is randomness ('noise') in each observation. This noise is measurable as the variance in the scores within each of the three groups. For a given level of within-group variance, and a given sample size, we can calculate the odds of seeing a given level of linearity in the following way. delta-F is defined as the difference in the sum of squares accounted for by a linear model (linear regression) and a nonlinear model (one-way ANOVA), divided by the mean squared error (within-group variance.) The killer equation from the report:
If this difference is small, it means that a nonlinear model can't fit the data any better than a linear one - which is pretty much the definition of 'linear'. Assuming that the underlying reality is perfectly linear (independent samples from three distributions with evenly spaced means), this delta-F metric should follow what's known as an F distribution. We can work out how likely a given delta-F score is to occur, by chance, given this assumption, i.e. we can convert delta-F scores to p-values. Remember, this is assuming that the underlying psychology is always linear. This is almost certainly implausible, but it's the best possible assumption for Förster. If the reality were nonlinear, the odds of getting low delta-F scores would be even more unlikely. The delta-F metric is not new, but the application of it is (I think). Delta-F is a case of the well-known use of F-tests to compare the fit of two statistical models. People normally use this method to see whether some 'complex' model fits the data significantly better than a 'simple' model (the null hypothesis). In that case, they are looking to see if Delta-F is high enough to be unlikely given the null hypothesis. But here the whole thing is turned on its head. Random noise means that a complex model will sometimes fit the data better than a simple one, even if the simple model describes reality. In a conventional use of F-tests, that would be regarded as a false positive. But in this case it's the absence of those false positives that's unusual. The Questions I'm not a statistician but I think I understand the method (and have bashed together some MATLAB simulations). I find the method convincing. My impression is that delta-F is a valid test of non-linearity and 'super-linearity' in three-group designs. I have been trying to think up a 'benign' scenario that could generate abnormally low delta-F scores in a series of studies. I haven't managed it yet. But there is one thing that troubles me. All of the statistics above operate on the assumption that data are continuously distributed. However, most of the data in Förster's studies were categorical i.e. outcome scores were fixed to be (say) 1 2 3 4 or 5, but never 4.5, or any other number. Now if you simulate categorical data (by rounding all numbers to the nearest integer), the delta-F distribution starts behaving oddly. For example given the null hypothesis, the p-curve should be flat, like it is in the graph on the right. But with rounding, it looks like the graph on the left:
The p-values at the upper end of the range (i.e. at the end of the range corresponding to super-linearity) start to 'clump'. The authors of the accusation note this as well (when I replicated the effect, I knew my simulations were working!). They say that it's irrelevant because the clumping doesn't make the p-values either higher or lower on average. The high and low clumps average out. My simulations also bear this out: rounding to integers doesn't introduce bias. However, a p-value distribution just shouldn't look like that, so it's still a bit worrying. Perhaps, if some additional constraints and assumptions are added to the simulations, delta-F might become not just clumped, but also biased - in which case the accusations would fall apart. Perhaps. Or perhaps the method is never biased. But in my view, if Förster and his defenders want to challenge the statistics of the accusations, this is the only weak spot I can see. Förster's career might depend on finding a set of conditions that skew those curves. UPDATE 8th May 2014: The findings of the Dutch scientific integrity commission, LOWI, on Förster, have been released. English translation here. As was already known, LOWI recommended the retraction of the 2012 paper, on grounds that the consistent linearity was so unlikely to have occured by chance that misconduct seems likely. What's new in the report, however, is the finding that the superlinearity was not present when male and female participants were analysed seperately. This is probably the nail in the coffin for Förster because it shows that there is nothing inherent in the data that creates superlinearity (i.e. it is not a side effect of the categorical data, as I speculated it might be.) Rather, both male and female data show random variation but they always seem to 'cancel out' to produce a linear mean. This is very hard to explain in a benign way.