Data publication consensus and controversies [v1; ref status: approved with reservations 1, http://f1000r.es/3ag]
California Digital Library, University of California Office of the President, Oakland, CA, 94612, USA
JK is supported by a Council on Library and Information Resources/Digital Library Foundation Postdoctoral Fellowship in Data Curation for the Sciences and Social
Sciences funded by the California Digital Library and the Alfred P. Sloan Foundation.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The movement to bring datasets into the scholarly record as first class research products (validated, preserved, cited, and credited) has been inching forward for some time, but now the pace is quickening. As data publication venues proliferate, significant debate continues over formats, processes, and terminology. Here, we present an overview of data publication initiatives underway and the current conversation, highlighting points of consensus and issues still in contention. Data publication implementations differ in a variety of factors, including the kind of documentation, the location of the documentation relative to the data, and how the data is validated. Publishers may present the data as supplemental material to a journal article, with a descriptive “data paper,” or independently. Complicating the situation, different initiatives and communities use the same terms to refer distinct but overlapping concepts. For instance, the term “published” means that the data is publicly available and citable to virtually everyone, but it may or may not imply that the data has been peer-reviewed. In turn, what is meant by data peer review is far from defined; standards and processes encompass the full range employed in reviewing the literature, plus some novel variations. Basic data citation is a point of consensus, but the general agreement on the core elements of a dataset citation frays if the data is dynamic or part of a larger set. Even as data publication is being defined, some are looking past publication to other metaphors, notably “data as software,” for solutions to the more stubborn problems.
How to cite: Kratz J and Strasser C. Data publication consensus and controversies [v1; ref status: approved with reservations 1, http://f1000r.es/3ag] F1000Research 2014, 3:94 (doi: 10.12688/f1000research.3979.1)
© 2014 Kratz J and Strasser C.
This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
No competing interests were disclosed.
First published: 23 Apr 2014, 3:94 (doi: 10.12688/f1000research.3979.1)
Latest published: 16 Oct 2014, 3:94 (doi: 10.12688/f1000research.3979.3)
Introduction: what does data publication mean?
The idea that researchers should share data to advance knowledge and promote the common good is an old one, but in recent years the conversation has shifted from sharing data to “publishing” data1–3. This shift in language stems from the conviction that datasets should join the scholarly record and be afforded the same first class status as traditional research products like journal articles4. While many in the scholarly communication community share this goal, different people and organizations often imply different things by the phrase data publication.
The community largely agrees on two essential properties of a data publication2,4. First, published data is publicly available now and for the indefinite future; access might demand payment of fees or acceptance of a legal agreement, but not the approval of the author. Second, like a book or journal article, a data publication can be formally cited. Open questions flock around a third property: how and to what extent a published dataset must be validated. In an effort to clarify the terminology, Callaghan et al. (2012)4 draw a distinction between data that has been shared, published (lower-case “p”), or Published (upper-case “P”): shared data is available, published data is available and citable, and Published data is available, citable, and validated. In practice, availability is usually satisfied by depositing the dataset in a repository, citability by assigning a persistent identifier (e.g. a Digital Object Identifier, or DOI), and validity by peer review.
Why publish data?
The underlying goals of data publication are to enable research to be reproduced and data to be reused. Hidden primary data exacerbates science’s very public “reproducibility crisis”5–9, most recently illustrated by the collapse of a pair of irreproducible Nature articles describing a simple method to transform somatic cells into pluripotent stem cells10,11. Widespread publication of the data underlying research papers could help expose both honest errors and fraud12. The leaders of the US National Institutes of Health (NIH) recently cited “provid[ing] greater transparency of the data that are the basis of published manuscripts” as one way to improve scientific reproducibility13.
Journals already frequently require authors to supply underlying data on request. In 2011, Alsheikh-Ali et al.14 found that 88% of high-impact journals required a statement regarding the availability of underlying data; half of those made willingness to provide data a condition of publication. However, the authors of 59% of papers examined in the study failed to adhere to the availability instructions. Vines et al. (2014)15 could only obtain underlying data from 101 of 516 papers published from 1991 to 2011. Availability dropped off sharply with time; data could be obtained from only two of the 62 oldest papers. Now, some journals require that underlying data be published simultaneously with the article.
In 2010, a coalition of Ecology and Evolutionary Biology journals began to require that the data underlying articles be archived with a maximum embargo of one year16,17. F1000Research has had a similar policy (without an embargo period) since its inception, and the Public Library of Science (PLOS) journals followed suit earlier this year18. Although there can be no substitute for funding new experiments and data collection, appropriate data reuse lowers costs and accelerates research. Documenting, publishing, and archiving data is time consuming and costly, but usually far less so than repeating the data collection. For example, Open Context published archaeological data from a site in eastern Turkey at the substantial cost of $10,000–15,0000, but this publication expense was minor compared to $800,000 spent to collect the data19. Piwowar (2011) contrasted the impact of $100,000 in National Science Foundation (NSF) grants, which generates an average of three to four papers, with an estimate that the same investment in curating, archiving, and publishing data could contribute to over 1,000 publications20. Furthermore, while some data is merely expensive to recreate, time-dependent or ephemeral data, (e.g. climate records or observations of unique astronomical events) should be published because it can never be recreated for any price21.
Types of data publication
The still-congealing phrase “data publication” covers diverse classes of research objects published via diverse processes. Depending on the speaker, a data publication might be a spreadsheet on a website, a set of images in an institutional archive, a stream of readings from a weather station transmitted over the internet, or a peer-reviewed article describing a dataset. Because disciplines, sub-disciplines, and individual researchers consider different assortments of digital material to be data, it is unlikely that any single structure will suit every discipline and dataset. But, we can hope that a manageable number of designs will fit most data. Five data publication models described by Lawrence et al. (2011) are distinguished “by how the roles involved in publication are distributed between the various actors” (e.g. the author, archive or journal)3. Here, we will more simply group data publications into three categories based on the accompanying documentation; a dataset may supplement a traditional research paper, be the subject of a “data paper”, or be independent of any paper (Figure 1).
Figure 1. To be published, datasets are typically deposited in a repository to make them available and assigned an identifier to make them citable.
Some, but not all, publishers review datasets to validate them.
Data that supplements a paper
The most familiar kind of data publication is a traditional journal article accompanied by underlying data. That data can be hosted by the journal as supplementary material or deposited in a third-party repository. The trend is away from supplemental material because repositories are considered to be better suited to ensure long-term preservation and access to the data. For instance, The Journal of Neuroscience stopped publishing supplemental material in 2010; the announcement promotes disciplinary repositories as “vastly superior to supplemental material as a mechanism for disseminating data”22. Data underlying any peer-reviewed or otherwise “reputable” publication can be deposited in the Dryad repository. Dryad makes data available and citable, but the publisher of the article must manage any assessment of scientific validity. Other third-party repositories include Figshare, Zenodo, institutional repositories (e.g. the Purdue Research Repository), and discipline-specific repositories (e.g. DNA sequences are deposited in GenBank23 and protein structures in the Protein Data Bank24).
Data as the subject of a paper
A data paper describes a dataset with thoroughly detailed rationale and collection methods, but lacks any analysis or conclusions25. Data papers are flourishing as a new article type in journals such as F1000Research, Internet Archaeology, and GigaScience, as well as in dedicated journals like Geoscience Data Journal, Nature Publishing Group’s Scientific Data, and a trio of “metajournals” from Ubiquity Press.
Data paper length and structure varies between journals, but the tendency is toward a short, tightly structured format. All journals require an abstract, collection methods, and a description of the dataset; a few encourage authors to suggest potential uses for the data (e.g. Internet Archaeology, and Open Health Data). Some journals supplement this general framework with field-specific sections. (e.g. Internet Archaeology and the Journal of Open Archaeology Data each include a section for temporal and geographic scope.) Data papers are most sharply defined not by the presence of any particular information, but by the absence of analysis or conclusions. A crisp distinction from other article types is important because many journals do not consider a data paper to be prior publication if the authors seek to publish an analysis of the same dataset (e.g. Nature-titled journals, Science, and others listed by F1000Research).
Data journals generally limit themselves to publishing the description of the dataset; a trusted repository publishes the data itself. For instance, Scientific Data and Geoscience Data Journal each direct authors to a list of approved repositories. As an exception, GigaScience hosts data in an integrated repository named GigaDB. An early implementer of data papers, The International Journal of Robotics Research25 is unusual in that they permit authors to host datasets on their own websites.
Data independent of any paper
To be useful or reproducible, a dataset must be accompanied by descriptive information (i.e. metadata)21, but this need not take the form of a journal article. Instead, some repositories publish rich, structured and/or freeform description together with the data. The distinction between a data repository and a data publisher is often indistinct. Repositories provide access and citability, but the degree of validation varies widely and few are equipped to provide peer review. For instance, to make data publication as easy as possible for authors, Figshare and Zenodo publish datasets from any field with minimal validation.
Fundamentally, to publish is to make public, and to publish data is to make data publicly available. Present availability requires mechanisms for access; future availability also requires preservation (e.g. long-term storage, format migration)21,26,27. As in print publication, published data need not be free or legally unencumbered, and data use agreements constrain many published datasets. If access is limited, it should be contingent on clear and objective criteria; writing a request to the creator for permission should not be part of the process. For example, before granting access to restricted data, The Interuniversity Consortium for Political and Social Research (ICPSR) judges the applicant’s proposed security measures, but not the merit of their research. Datasets from social science or clinical studies that involve human participants are easily the most common source of access restrictions because of the need to protect privacy. In the United States, the Health Insurance Portability and Accountability Act of 1996 (HIPPA) Privacy Rule severely limits the disclosure of medical information28.
As a practical matter, publishing a dataset usually includes depositing it in a trustworthy repository. What constitutes a “trustworthy” repository is somewhat subjective and there are a handful of certification schemes to choose from. In 2007, The Center for Research Libraries (CRL) published the most extensive scheme: the Trusted Repository Audit Checklist (TRAC)29. Many repositories consult TRAC for self-assessment, but only four (listed by the CRL) have completed the lengthy and rigorous process to be officially certified. The process to obtain a Data Seal of Approval (DSA) is considerably more streamlined. The DSA guidelines were also first released, by The Dutch Data Archiving and Networked Services (DANS), in 2007; 24 repositories have been stamped with the DSA since then. Few of the hundreds of repositories in operation (e.g, the 973 now listed Databib or the 609 at re3data.org) have pursued any kind of certification. Given the low adoption of repository certification, a more typical way to decide trustworthiness is to judge by the organization responsible. Repositories run by governments or large universities are likely to be considered trustworthy (although the effects of the 2013 US government shutdown on the PubMed biomedical article database30 might give one pause).
Data citation is the element of publication that has come the farthest toward consensus. This year, a coalition–including Future Of Research Communication and E-Scholarship (FORCE11)31, the Committee on Data for Science and Technology (CODATA)32, and the Digital Curation Centre (DCC)–released a Joint Declaration of Data Citation Principles. The first of the eight principles states, in part, that “[d]ata citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications”. Most of the time, this means that when a published dataset contributes to a paper, it should be cited formally in the reference list.
Data publishers enable formal citation by assigning unique permanent identifiers, most commonly the same ones used for journal articles: Digital Object Identifiers (DOIs). In addition to clarifying exactly what resource is being cited, a DOI can be resolved to locate the referenced dataset. Note, however, that a DOI is neither sufficient nor necessary for citability- if a dataset moves and the DOI is not updated, the citation breaks and, conversely a well-maintained web-address works as well as a DOI.
The present consensus is that a dataset should be cited using, at a minimum, five elements largely familiar from article citations: creator(s), title, year, publisher and identifier. This format agrees with CODATA’s recommendation32 and conveys all the information required to obtain a DataCite DOI33 or be listed in the Thomson-Reuters Data Citation Index. However, this article-derived format fails to address some of the complications unique to datasets, described below.
The first major complication that data citation faces is the need for deep citation. When supporting an assertion in writing, it usually suffices to cite the entirety of a journal article and leave it to the inquisitive reader to find the relevant passage. But, to reproduce an analysis performed on a subset of a larger dataset, the reader needs to know exactly what subset was used (e.g. a limited range of dates, only the adult subjects, wind speed but not direction). Datasets vary so widely in structure that there may not be a good general solution for describing subsets. The most common suggestion is to cite the entire dataset in the reference list and describe the subset in the text of the paper34. In straightforward cases, the Federation of Earth Science Information Partners (ESIP) and the National Snow and Ice Data Center (NSIDC) both recommend including a list of variables or range of dates in the formal citation.
The second major complication arises when datasets change. In the past, the printing process cemented one version of an article as the version of record. Even for traditional scholarly literature, web-based publishing and preprint servers (e.g. arXiv.org) are complicating the situation, but datasets are especially prone to be dynamic. Two kinds of dynamic datasets warrant consideration: growing datasets that add new data while never changing or deleting existing data, and revisable datasets where data may by added, deleted, or changed.
Consider USC00046336, a weather station at the Oakland Museum. Each day, the high temperature, low temperature and amount of precipitation recorded at the Museum35 flow, together with data from more than 20,000 other stations, into the swelling Global Historical Climate Network (GHCN)-Daily36 dataset. Or, consider WormBase37, a genome database used by the Caenorhabditis elegans research community. WormBase encompasses genomic sequences of C. elegans and 20 related species massively annotated with gene structures, protein sequences, expression patterns, and a host of other information from empirical data and computational predictions. Every two months, WormBase responds to new data and better computational models by issuing a revised version with new material added and inaccurate material deleted or corrected.
Additions and updates to published datasets are extremely valuable, but a researcher seeking to reproduce an analysis of a dynamic dataset needs access to a particular version. To enable that access, previous versions must be preserved and citable. Growing datasets can be cited with an access date or a date range in the citation, as recommended by ESIP and NSIDC. Revisable datasets are more difficult; the most common approach is to accumulate revisions and periodically publish a new version with a citable version number. For example, WormBase identifies each release with a citable version number and makes all of the previous versions available.
Controversy persists around the specific issue of identifiers for dynamic datasets. DataCite recommends, but does not insist, that their DOIs refer to immutable objects. NSICD and ESIP instruct researchers to use a single identifier for growing datasets and include the access date in the citation; each major version of a revisable datasets gets a new identifier, but minor versions do not. In contrast, the DCC, Dataverse, and the UK Natural Environment Research Council (NERC) insist that any change to a dataset should trigger a new identifier4,34,38. To handle the difficulties with dynamic data that this policy creates, the DCC recommends periodically issuing growing datasets a new identifier that refers to the “time-slice” of new records and freezing versions of revisable datasets as individually-identified “snapshots”.
The difficulties surrounding deep citation and dynamic data could potentially be solved by turning the identifier-issuing process on its head. Instead of the dataset publisher issuing identifiers for data at the level that researchers seem likely to cite, researchers could issue identifiers for precisely the part of the dataset that they want to cite. The Research Data Alliance (RDA) Data Citation Working Group recently put forth a sophisticated proposal applicable to data in (or convertible to) databases. Identifiers created under this scheme would wrap together identification of a database, a query to return the cited dataset, the version of the database queried for this analysis, and a number of other useful components. Although promising, many technical and policy issues must be resolved before this approach can be widely adopted.
Data validation is the least resolved aspect of data publication, and fundamental questions are still unanswered: What minimum level of quality should a published dataset guarantee? How and by what criteria can datasets be evaluated against that guarantee? Is literature peer review an appropriate model?
Callaghan et al. (2012)4 draw a useful distinction between technical and scientific review. Technical review verifies that a dataset is complete, its description is complete, and that the two match up. Domain expertise is generally not required, and many repositories provide at least some level of technical review. Scientific review evaluates the methods of data collection, the overall plausibility of the data, and the likely reuse value. Scientific review does require domain expertise, making this level of validation more difficult to organize, and few repositories provide it. When data is published with a data paper, review may be split between the repository for technical review and the data journal for scientific review.
Data paper peer review
Peer review guarantees that journal articles entering the scholarly record reach some level of validity (although the aforementioned reproducibility crisis calls into question exactly what that level is). In many fields, peer-reviewed publications enjoy a much higher status than any other literature. Any effort to apply the prestige of “publication” to datasets cascades naturally into an effort to apply the prestige of “peer review”. But as data validation seeks to model itself on literature peer review, literature peer review itself is in flux39–41. Open peer review at F1000Research and post-publication commenting at PubMed Commons are just two of many ongoing web-enabled experiments in article evaluation.
Journal article reviewers traditionally consider whether the methods used are appropriate for the questions asked and the data collected support the conclusions drawn. In the absence of particular questions and conclusions, it is not obvious what peer review of data should certify. A dataset may be suitable for some purposes, but not for others42. In addition, while a reviewer can be expected to read an entire article, they cannot inspect every point in a large dataset. Finally, researchers are already over-whelmed by peer review of articles43 and may find any increased workload unreasonable. Despite all these difficulties, venues for peer-reviewed data papers are opening rapidly.
Data paper journals wrap scientific peer review of the paper and the dataset together into a single process. GigaScience, an exception, assigns technical review of the dataset to a separate data reviewer. The standards that various data journals provide to reviewers are fairly uniform, with the exception that about half of consider novelty or potential impact, while the rest only require that the dataset be scientifically sound. While review standards are similar, processes differ widely.
As an example, compare Biodiversity Journal and Scientific Data. Both journals divide reviewer guidelines into three sections along similar lines, which Biodiversity Journal calls “quality of the data”, “quality of the description”, and “consistency between manuscript and data”. Scientific Data follows a traditional peer-review process: an editor appoints reviewers who are encouraged to remain anonymous. In contrast, review at Biodiversity Journal follows a flexible and open process featuring entirely optional anonymity and multiple types of reviewer. There, an editor appoints two or three “nominated” reviewers who must report back and several “panel” reviewers who read the paper and only comment at their discretion. Additionally, the authors may choose to open the paper to public comment during the review process.
Independent data validation
Data journals all model their data validation more or less faithfully on literature peer review, but independent data validation practices and proposals are considerably more varied. On the conservative end of the spectrum, Lawrence et al. (2011) propose a set of criteria for independent data peer review44. The Planetary Data System (PDS) peer-reviews datasets through the unusual process of holding an in-person meeting with representatives of the repository, the dataset creators, and the reviewers.
Two examples from archaeology, Open Context and the Digital Archaeological Record (tDAR), illustrate the diversity of approaches to data validation. Open Context provides multiple validation processes that incorporate peer review in a way that goes beyond the simple accept/reject binary19. Each Open Context dataset is rated from one to five based not on quality per se, but on the thoroughness of the validation; a one comes with no guarantees, a three has passed a technical review, and a five has passed external peer-review. Whereas Open Context is a boutique publisher, focusing on data presentation and reuse, tDAR is a large repository primarily concerned with with collecting and preserving archaeology data for future use. tDAR is able to operate at scale by performing only technical validation and streamlining data deposition with a minimum of mandatory description. However, tDAR also serves as a platform for high-quality data publication. The repository accommodates contributors who wish to provide more information, and much of the content is deposited by digital curators who can be relied on to supply rich descriptions. Furthermore, two data paper journals, Internet Archaeology and Journal of Open Archaeological Data, recommend tDAR as a repository for their peer-reviewed data. Thus, data validation depends not only on discipline and data type, but on a host of external factors, including the goals of the organizations and researchers involved.
Pre-publication validation can be supplemented or replaced by post-publication feedback from successful or unsuccessful reusers. Parsons et al. (2010) suggest that “data use in its own right provides a form of review”, and go on to point out that the context of reuse demonstrates that the data is not generically “good”, but fit for some particular purpose42. The DANS repository solicits feedback from researchers who use its datasets: users are asked to rate the dataset on a one to five scale in each of six criteria (e.g., data quality, quality of the documentation, structure of the dataset)45,46.
Beyond data publication
In a 2013 paper47, Parsons and Fox argue that thinking about data through the the metaphor of print “publication” is limiting. Diverse kinds of material are regarded as data by one research community or another and, while at least some aspects of publication apply well to at least some kinds of data, other approaches are possible. An alternative metaphor that seems to be gaining traction is “data as software”48. In some cases, it may be better to think of releasing a dataset as one would a piece of software and to regard subsequent changes as analogous to updated versions. The open-source software community has already developed many potentially relevant tools for working collaboratively, managing multiple versions, and tracking attribution. Ram (2013)49 catalogs a multitude of scientific uses for the software version control system Git, including data management. Open Context uses Git and Mantis Bug Tracker to track and correct dataset errors. Furthermore, projects such as IPython Notebook integrate data, processing, and analysis into a single package. However, scientific software struggles for recognition50 just as data does, so using it to alter or affect the academic reward system for data is a tricky prospect.
Ultimately, while “data as software” is promising, data is not software. Nor is it literature. The prestige and familiarity of terms like “publication” and “peer-review” are powerful, but we may have to stretch their definitions if we are determined to apply them to data.