In a slightly depressing new paper, two researchers describe how they tried to get access to the data behind 111 of the most cited psychology and psychiatry papers published in the past decade. The researchers, Tom E. Hardwicke and John P. A. Ioannidis of Stanford, wanted to place the data into a 'Data Ark' to ensure its continued preservation for science. Unfortunately, in most cases, the data was not made available. The paper is called Populating the Data Ark and it's out now in PLoS ONE. Hardwicke and Ioannidis wrote to the authors of each of the highly-cited articles, explaining the idea behind the Data Ark and requesting the raw data - including the option to give the data to the researchers but with restrictions on who could access it. In about 40% of cases, Hardwicke and Ioannidis received no meaningful response whatsoever. Another 30% of authors declined to share the data in any way. Only 14% of the datasets were made available with no restrictions on who could access them (either made available in Data Ark, or already freely available.)
There were no major differences between psychology and psychiatry papers in this regard, and there was also very little change across time (publication dates on the articles ranged from 2006 up to 2016). This resistance to data sharing is consistent with previous studies from various areas of science. As Hardwicke and Ioannidis put it,
Previous efforts to obtain data directly from authors ‘upon request’ have also encountered low availability rates; for example, data was available for only 7 out of 157 (4.5%) articles published in the BMJ [11], 48 out of 394 (38%) articles published in four American Psychological Association (APA) journals [4], and 38 out of 141 (27%) articles published in four other APA journals [5]. The highest retrieval rate, 17 out of 37 (46%) articles, was observed for a study focused on data from randomized clinical trials (RCTs) published in the BMJ and PLOS Medicine that both mandate data sharing for RCTs [13]
In my view, however, the fact that Hardwicke and Ioannidis targeted the most highly cited articles makes the low rate of data sharing especially galling. The importance of data sharing - to ensure reproducibility and to stimulate further analysis - is especially high when the data in question has already given rise to influential papers. Also, it is sometimes said that researchers should not have to share their data because, since they collected it, they have a right to enjoy the benefits of it (i.e. getting publications out of it.) But these authors had already got very nice publications from their results. The reasons authors gave for not sharing their data were also rather interesting:
As Hardwicke and Ioannidis comment,
Responses appear to suggest that a key barrier to sharing is that data can be outside of an authors’ control, either because the data was generated by other researchers, and sometimes because the data are owned by a commercial entity. This raises important questions about the responsibilities of data stewardship and the ability to verify data that underlies scientific publications.
Indeed, in my experience, one of the barriers to data sharing - even in a purely academic, non-commercial context - is the sense that data 'belongs' to a great many people, all of whom would need to give permission. This means, not just people involved in the data collection, but (as data collectors are usually junior) their supervisors, and anyone involved in obtaining funding for the study. Perhaps clarity is needed on just who has a right to share data.