In April 2016, an FDA committee voted not to recommend acceptance of eteplirsen, a drug designed to treat muscular dystrophy. In September, however, the FDA did approve the drug, following a heated internal debate.
This wasn't the end of the story, however. What followed was an unusual scientific controversy that played out in the peer-reviewed literature, discussed in a Retraction Watch post this week. Following the approval of eteplirsen, Ellis Unger and Robert Califf wrote a letter to the journal Annals of Neurology expressing concern over a paper about the drug published in that journal in 2013. This was a remarkable intervention, given that Califf was at the time head of the FDA, while Unger had led the FDA's eteplirsen review team. The paper Unger and Califf criticized was one they were well-acquainted with, because the case for the approval of eteplirsen had largely rested on it. Authored by Mendell et al., the 2013 paper reported on a clinical trial of eteplirsen in 12 children suffering from Duchenne muscular dystrophy (DMD). Mendell et al. reported that eteplirsen was able to increase levels of dystrophin, the protein that is deficient in DMD, as well as producing clinical improvement. In their letter, the two FDA critics focussed on how Mendell et al. measured dystrophin. According to the 2013 paper, muscle biopsy samples were stained for dystrophin and then "evaluated by blinded expert muscle pathologists" (note the plural) to count the percentage of dystrophin-positive muscle cells. Unger and Califf however say that an FDA lab inspection revealed that all of the biopsy stains had been evaluated by a single individual, and they describe this person as a 'technician', a word that implies someone more junior than a 'pathologist'. They quote from an FDA report of the lab visit:
The immunohistochemistry images were only faintly stained, and had been read by a single technician using an older liquid crystal display (LCD) computer monitor in a windowed room where lighting was not controlled. (The technician had to suspend reading around mid-day, when brighter light began to fill the room and reading became impossible.)
Further, the FDA learned of problems with the blinding. The technician rating the images was blinded to treatment group (drug or placebo), but he or she was aware of when each biopsy had been performed. In Mendell et al.'s design, all patients received the drug at the final, 48 week timepoint. Thus, Unger and Califf say, the large increases in dystrophin expression seen at 48 weeks could have arisen "simply by having a lower threshold for calling fibers 'positive' at later time points in the study." Unger and Califf reveal that in the light of the limitations of Mendell et al.'s analysis, the FDA encouraged the researchers to re-analyze the biopsy data, with three independent, fully blinded pathologists as raters. This revealed much lower dystrophin-positive fibers, and no evidence of a treatment effect. This image shows the difference between the old and the new analysis:
In his rebuttal to the Unger and Califf letter, Mendell said that the lower dystrophin levels in the re-analysis were not unexpected, because the FDA told the raters to use more stringent criteria when classifying cells as dystropin-positive:
In the recount, three independent pathologists reported the results using the newly established criteria that excluded any muscle fibers with partial dystrophin staining (borderline positivity) and fibers with membrane staining that touched the borders of the image.
Mendell goes on to say that:
The independent pathologists performing the recount using the more-conservative scoring protocol confirmed the increase in dystrophin-positive fibers in the treated samples, with a mean of 16.27% increase in the number of dystrophin-positive fibers (p ≤ 0.001), and a 15-fold increase between the pretreatment and post-treatment samples... the finding that the treated patients had 16.2% dystrophin-positive fibers confirmed unequivocally that eteplirsen can restore dystrophin to levels that have been associated with milder phenotypes
I have to say that Mendell's response struck me as unconvincing. For one thing, there is really no excuse for relying on a single rater in a study of this kind, especially given how much was at stake: DMD is currently an incurable disease. The efficacy of eteplirsen, or lack of it, is of huge clinical importance. Using multiple independent raters from the start would have increased accuracy and also allowed inter-rater reliability to be assured. (Mendell in fact says that rating was done by "an expert pathologist with the assistance of an experienced staff member" but I don't think this refers to two independent raters.) I'm also not sure what Mendell is talking about when he refers to a "15-fold increase" between the pre- and post-treatment samples in the new analysis. Such an effect would be very impressive, but no group showed such a dramatic increase according to Unger and Califf's graph above. At best, the 30mg/kg group showed about a 2-fold increase. This study was also extremely small, with only four patients receiving placebo. I wonder if this is one of those studies that is so small that it is more likely to mislead than to inform us. I know that it's very expensive to conduct a study like this, and I'm sure the researchers did everything in their power to increase the numbers. But it makes little sense to talk about p-values like p<0.001 when there are only a handful of datapoints.