The ethics of data sharing

I work mostly with clinical trials that collect health data from individuals with obsessive-compulsive disorder and related disorders. What are the ethical aspects as well as arguments for and against sharing that data?

The call for more open data from research studies comes in the aftermath of a replication crisis in psychological research, where findings that were previously viewed as robust fail to replicate in larger, more rigorous, trials. One reason for non-replicability could be that data was not handled properly in the original trials, e.g. flexibility in the pre-processing of data and analytical methods. Funding bodies and journals therefore demand that the raw data and full analysis pipeline be openly shared so that the results and conclusions can be independently verified by other researchers.

Arguments in favour of sharing raw data

The data collection efforts in my field (psychiatry) can be time-consuming for participants, as they have to travel to our clinic repeatedly and fill out several questionnaires at each occasion. How can data sharing help to maximise the usefulness of the data that they provide?

Open data has the potential to improve the quality of research through better error detection or better use of collected data. Openly sharing raw data and analysis scripts from a research project would enable other researchers to verify the results, and improve their understanding of why certain conclusions were drawn. It would also be possible to detect errors in data entry or processing, and perhaps correct the scientific record if those errors would alter the results or conclusions in a meaningful way.

Another potential advantage of sharing raw data from clinical trials is that other researchers will be able to make more accurate summaries of the evidence, for example regarding a certain treatment. New innovations such as individual participant data meta-analyses are more informative compared to meta-analyses based on aggregate data (which is typically shown in published articles), and therefore more likely to result in useful clinical guidelines.

It is also very likely that new methodological and statistical advancements will allow researchers to use the collected data in ways that were not available when the data was collected, and such re-use is likely facilitated if data is shared openly rather than stored in local (sometimes individual) systems. In my experience, research data from clinical trials is often structured in haphazard and idiosyncratic ways; once an individual with intimate knowledge of the collected data (typically the PhD student who was the study coordinator) leaves a research team, the dataset is hard or impossible to find and interpret for re-use. Researchers are incentivised to provide necessary documentation about the data (sometimes called meta-data) if there is an expectation that others will view and use the same data.

Lastly, if the norms of research were to change so that sharing of raw data from clinical trials was expected, this might discourage researchers from producing fabricated or distorted data. Just as the presence of ethical review authorities might prevent the most harmful research studies from being considered, the prospect of demands regarding open sharing of raw data might stop the worst types of fabrication or falsification from happening to begin with. It would also be easier to detect fabricated data if the raw data used to produce figures and results were openly available.

Arguments against sharing raw data

One potential risk is that a research participant is identified through the shared data, which is listed as the main concern for not sharing data among faculty at psychology departments in the United Kingdom.

This is a clear violation of research subjects right to privacy, as sensitive data about their health status could be used to discriminate against them or cause other types of harm if intimate details about their health were known. To avoid this, the data needs to be sufficiently de-identified so that an individual subject cannot be identified (even by the researcher!), for example by removing such information as full date of birth, zip code, and details regarding household structure. Meyer suggests “blurring” certain variables to decrease the possibility of re-identification, for example by providing age ranges instead of a specific age.

A second concern relates to whether the research participants—when they have not given their explicit consent to fully sharing data from the study—would have accepted how their data is used. For example, if data is shared to an open repository, it may be used for purposes that the research participants might not like. Furthermore, if data from publicly funded research projects are frequently used in ways the public does not like, this could potentially harm the public trust in science and willingness to participate in research.

Finally, just as methodological and statistical advancement may enable researchers to re-use data in useful ways in the future, there is a risk that research subjects can be identified from information that we consider anonymous today. According to Hunt,

“Estimating the probability of re-identification is difficult because it, too, is a moving target: As the amount of available data about an individual increases, any one data set about that individual becomes increasingly re-identifiable. More data about most of us is becoming available over time.”

Summary

The arguments in favour of increased sharing of raw data from clinical trials typically focus on the increased value that can be derived from research data. Sharing raw data means that it will be easier to detect mistakes and misconduct, but also that the data can be re-used and lead to more knowledge. However, raw data from clinical trials almost invariably contain sensitive personal information, and researchers have an obligation to respect the research participants’ right to privacy. The risk of re-identifying individual research participants must be considered. Also, it is possible that fully open research data (available to anyone) is used in ways that the research participant does not approve of.

Avatar
Oskar Flygare
PhD Student in psychology