Back in 2012 I discussed an alarming paper showing very high rates of false positives in single-subject fMRI analyses. Swedish researchers Anders Eklund and colleagues had tested the performance of one popular software tool for the statistical analysis of fMRI data, SPM8. But what about other analysis packages? Now, Eklund et al. are back with a new study, which has not been published yet, but was presented last month at the International Symposium on Biomedical Imaging (ISBI). This time around they compared three popular packages, SPM8, FSL 5.0.7, and AFNI - and they show that all three produce too many false positives. Edit: the conference paper is available here. Broadly speaking, FSL had the highest false positive rate, approaching 100% in some cases. AFNI was slightly better than SPM, but even AFNI gave 10 - 40% false positive rates, depending upon the parameters. The desired rate is 5%. As in the 2012 paper, the problem was most serious for block designs. Here's the results. These graphs show the proportion of single-subject analyses showing at least one significant cluster, at a nominal familywise error corrected (FWE) alpha of 0.05. Higher is worse:

Eklund et al.'s data were resting state fMRI scans from two centers, Cambridge and Beijing. They analyzed these data as if they were part of a task design, in which stimuli were presented at certain times. Since there was in fact no task, and no stimuli, no stimulus-triggered activation 'blobs' should have been seen. The authors remark that

It is clear that all three software packages give unreliable results... It is disappointing that the standard software packages still give unreliable statistical results, more than 20 years after the initial fMRI experiments.

We should note that Eklund et al. only considered single-subject analyses. It's not clear if group level analyses are also affected. Why is the false positive rate so high? The authors say that the problem lies in the assumptions made by each package about the statistical properties of the noise found in fMRI data. Each package has its own problem: SPM has a "too simple noise model" while FSL "underestimates the spatial smoothness", for instance. Eklund et al. conclude that given the problems with parametric statistics approaches to fMRI - as represented by SPM, FSL and AFNI - it may be time for neuroscientists to embrace nonparametric analysis, which makes fewer assumptions. Anders Eklund kindly agreed to answer some of my questions about these results and what they could mean. Here's what he said: Q: Could there be a way to optimize parametric approaches and make them more valid? Or should we all just move to non-parametric methods? A: The SPM group is currently working on an improved noise model for SPM12, it would be interesting to test if it gives lower familywise error rates compared to SPM8. Even if parametric approaches were optimized, it would still be hard to use parametric approaches for multivariate statistical methods, which have more complicated null distributions. A non-parametric approach, like a permutation test, can thereby solve two problems in fMRI. First, to give familywise error rates that are closer to expected values. Second, to enable multivariate statistical methods with complicated null distributions, which may give a higher statistical power. Q: All three packages gave more than 5% false positives, but from it seems that FSL had an even higher error rate than the other two, especially for block designs. Do you think that this would hold for other datasets, or might it be specific to this study? A: Hard to say, we are not sure what the main problem with FSL is, except that the FSL software gives a lower smoothness estimate compared to SPM and AFNI. According to some researchers, resting state data is not optimal for testing false positives (since it has different characteristics compared to task data). An alternative approach could be to analyze task data, using a regressor that is orthogonal to the true paradigm. One could for example analyze task data from the HCP, which were collected with a short TR (0.72 s) and a multiband sequence. According to our 2012 Neuroimage paper, a short TR is very problematic for the SPM software, due to its simple noise model. It would be interesting to see if such a short TR is also problematic for FSL and AFNI. Q: How does this new analysis differ from your Eklund et al. 2012 Neuroimage paper? A: We only looked at 396 instead of 1484 rest datasets. We only considered cluster level inference, and not voxel level inference. In the previous paper we looked at both. We tried two cluster defining thresholds; the threshold that is applied to all voxels to form clusters. We tried p = 0.01 (z-score of 2.3, the default in FSL) and p = 0.001 (z-score of 3.1, the default in SPM). We noticed that the cluster defining threshold has a very large impact on the familywise error rates; a lower threshold (z = 2.3) gives higher familywise error rates compared to a high threshold (z = 3.1). This is consistent with a recent paper (Woo et al. 2014). A possible reason for this is that a lower threshold is more sensitive to the assumption that the spatial smoothness is constant in the brain.