A shocking piece of statistics has been uncovered in a paper published in a respectable psychiatry journal. The offending article, Electrodermal hyporeactivity as a trait marker for suicidal propensity in uni- and bipolar depression, appeared in 2013 in the Journal of Psychiatric Research. It examined whether an 'electrodermal hyporeactivity' test - based on measuring the electrical conductivity of the skin - could predict suicide attempts in depressed people. According to the authors, Lars Thorell and colleagues of Sweden, the test worked well. Their abstract said:
RESULTS: The high sensitivity and raw specificity of electrodermal hyporeactivity for suicide were confirmed... The findings support the hypothesis that electrodermal hyporeactivity is a trait marker for suicidal propensity in depression.
Sensitivity and specificity are two key yardsticks by which any diagnostic or predictive test can be judged. Broadly speaking they refer, respectively, to the test's ability to avoid false negatives, and false positives. A high sensitivity and a high specificity mean that a test is an accurate one. Which is exactly what Thorell et al. found... right? Er... no. They reported sensitivity, but not specificity. Instead they reported something they call 'raw specificity'. What is this? Well... it doesn't exist. Thorell et al. just made it up. The term is unknown in statistics: it does not appear on Google Scholar in any other paper (there are a few 'hits' but upon closer inspection they are all referring to the old-fashioned specificity of some 'raw' variable.)
It turns out that by 'raw specificity', Thorell et al. were referring to the metric known to everyone else in the world as negative predictive value (NPV). NPV is an important metric in its own right, but it's in no way a substitute for specificity. It makes no sense to evaluate a test by looking at sensitivity and NPV. A first-year undergraduate would get a failing grade if they did that in an exam. I'm stunned that Thorell et al passed peer review but as so often, it fell to post-publication peer review to save the day. The Journal of Psychiatry Research has just published two letters (1, 2) from outraged readers, pointing out that 'raw specificity' is a nonsensical concept. One of the letters is by a student who's currently enrolled in an Honors Program and is due to graduate in 2016. I wasn't kidding when I said that this is the kind of error that would shame an undergraduate. So did the test work? Well, the actual specificity (maybe Thorell et al. call this the 'cooked' specificity?) of the electrodermal test was 33% over all patients. The sensitivity was 74%. The sum of sensitivity and specificity was 107%. To put this in context, an entirely random 'test' will get you a sum of sensitivity and specificity equal to 100%, while a perfectly accurate test would get a sum of 200%. So the electrodermal test's true performance is just 7% better than flipping a coin. In a rebuttal letter, Thorell et al. don't dispute any of the facts above, but rather they argue that various special considerations inherent in testing for suicide mean that specificity is a poor metric, and 'raw specificity' is a better one. Their arguments sound vaguely plausible but however you try to rationalize it, the fact is that even a purely random test could have an extremely high sensitivity + 'raw specificity'. I will now proceed to design a suicide prediction technique that outperforms Thorell et al.'s electrodermal test. Watch in amazement! My proposed test is simple: the patient picks a card at random from a standard deck. If it is any card except the Ace of Spades, I declare them a suicide risk. If they pick the Ace of Spades then I say they're not. In other words, I randomly assign a suicide risk to 51/52 or about 98% of people. In Thorell et al. there were 783 patients, of whom 120 turned out to be suicidal, while 663 were not. In this sample, my Ace of Spades test has a sensitivity for detecting suicide of 98%, and a 'raw specificity' of 85%, total 183%! My pack of cards are much better, in other words, than Thorell et al.'s test, which had a sensitivity of 74% and a 'raw specificity' of 88%, totalling a mere 164%. It's clear that there is no substitute for the old-fashioned sensitivity and specificity, which Thorell et al. should have used in the first place. Hat Tip: Bernard Carroll.
Culver, A. (2014). Letter to the Editor: Specificity of electrodermal reactivity testing for suicidal propensity in Thorell et al. Journal of Psychiatric Research DOI: 10.1016/j.jpsychires.2014.03.013
Mushquash, C., Weaver, B., & Mazmanian, D. (2014). Reporting sensitivity and specificity for suicide risk instruments: A comment on Journal of Psychiatric Research DOI: 10.1016/j.jpsychires.2014.03.014