In the current system, where we assume that every researcher is honest, and where raw data are not required to be submitted, the consequence is that fabricated data escapes scrutiny and gets published. The supposition that everyone is honest cannot be valid whilst simultaneously a situation exists in which more than half of the researchers guess that over 25% of all studies are based on non-existing data.
In a paper (cited more than 500 times) that listed recommendations for increasing replicability in psychology , it is noted,
As part of the submission process, journals could require authors to confirm that the raw data are available for inspection (or to stipulate why data are not available). Likewise, co-authors could be asked to confirm that they have seen the raw data and reviewed the submitted version of the paper.
Begley and Ioannidis recommend that institutions should make it a requirement that raw data be made available on request .
These recommendations are also based on the assumption that researchers are honest, at least to the extent that the authors will present raw data upon request. However, I imagine that, upon such a request, some of the authors might say, “Oops, hard disk got broken!” or similar. I do not think it is practical to suppose that every co-author sees and reviews all the raw data in a huge/interdisciplinary paper published in a high impact journal.
I believe that it is now time to design a system, based on such realistic reasoning of the majority of researchers, that not everyone is “honest,” replacing the “trust-me” system that is based on the traditional idealistic assumption that everyone is good.
The idea of open science/open data is needed in such a design and I propose that a custom should be commonly accepted, that sharing raw data publicly is a necessary condition for a study to be considered as scientifically sound, unless the authors have acceptable reasons not to do so (e.g., data contains confidential personal information).
In the past age of print publishing, it was technically impossible to publish all raw data due to the limitation of space. This limitation, however, has been virtually eliminated, thanks to the evolution of data storage devices and the internet.
Indeed, in 2014, the National Institutes of Health mandated researchers to share large-scale human or non-human genomic data, such as large-scale data including genome-wide association studies (GWAS), single nucleotide polymorphisms (SNP) arrays, and genome sequence, transcriptomic, epigenomic, and gene expression data (https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing/). This year, the National Institute of Mental Health (NIMH) issued a data sharing policy, which requires NIMH-funded researchers to deposit all raw and analyzed data (including, but not limited to, clinical, genomic, imaging, and phenotypic data) from experiments involving human subjects into their informatics infrastructure to enable the responsible sharing and use of data collected from and about human subjects by the entire research community (https://grants.nih.gov/grants/guide/notice-files/NOT-MH-19-033.html). In 2018, it is reported that China mandated its researchers to share all scientific data in open national repositories (https://www.editage.com/insights/china-mandates-its-researchers-to-share-all-scientific-data-in-open-national-repositories/1523453996).
I believe that other countries may want to follow such a move. I propose that all journals should, in principle, try their best to have authors and institutions make their raw data open in a public database or on a journal web site upon the publication of the paper, in order to increase the reproducibility of published results and to strengthen public trust in science. Currently, the data sharing policy of Molecular Brain only “encourages” all datasets on which the conclusions of the manuscript rely to be either deposited in publicly available repositories (where available and appropriate) or presented in the main paper or additional supporting files, in machine-readable format (such as spread sheets rather than PDFs) whenever possible. Building on our existing policy, we will require, in principle, deposition of the datasets on which the conclusions of the manuscript rely from 1 March 2020. Such datasets include quantified numerical values used for statistical analyses and graphs, images of tissue staining, and uncropped images of all blot and gel results. The deposition does not have to be completed at the time of manuscript submission but the manuscripts will be accepted on the condition that such data are deposited before its publication. We could allow some exceptions, when the authors cannot make data public due to some ethical or legal reasons (eg. The data consist of confidential personal information, or proprietary data from third party). In such cases, the rational for not doing so should be clearly described in the data availability section of the manuscript and be approved by the handling and chief editors.
There are practical issues that need to be solved to share raw data. It is true that big data, such as various kinds of omics data and footage of animal behaviors, are hard to handle and to be deposited in a public database or repository and could be costly. Different researchers in different institutions may not have equal access to the use of the same level of repositories, or the skills to properly share their data. In addition, the definition of “raw data” could be an issue. For example, in mouse behavior, we are running a database to share “raw data” of mouse behaviors, but the database contains just quantified numerical text data. Ideally all the footage taken for behavior analysis should be shared, and we would like to do so when we obtain sufficient funding and infrastructure to realize such a database. The meaning of “raw data” should be discussed by the experts in each field of science and some consensus should be reached so that they can be shared in a systematic manner whereby re-analysis of the data and data mining can be conducted easily. Storage and sharing of confidential personal information on data derived from human subjects would be another challenge that needs to be overcome.
For these technical issues, institutions, funding agencies, and publishers should cooperate and try to support such a move by establishing data storage infrastructure to enable the securing and sharing of raw data, based on the understanding that “no raw data, no science.”