Keywords
Open Data, Publications, National Institute of Health, Bibliometrics
This article is included in the Research on Research, Policy & Culture gateway.
Open Data, Publications, National Institute of Health, Bibliometrics
The new version further specified that the scope of the current article is to analyze BioLINCC datasets, publications that used BioLINCC datasets, and citations received by these publications. The new version provided more details about the datasets themselves, the percentage of datasets used in secondary publications, the datasets that have the highest number of publications, and so on. The discussion of the importance of open data for early career researchers has also been expanded. Table 1 and Figure 1 have been added to reflect these changes.
See the authors' detailed response to the review by Lisa Federer
See the authors' detailed response to the review by Colby Vorland and Andrew Brown
See the authors' detailed response to the review by Christian Ohmann
Recent years have seen an increased call for data sharing in clinical studies, especially for research funded by international and governmental agencies1. The call originally aimed to maximize transparency for clinical trial results1, but the benefits of data sharing extended beyond its original aim. Open access data is frequently cited as a boon for researchers, where researchers can re-analyze already collected data to answer a new research question2,3. To organize and maximize the scientific use of open access data, researchers and funders store their data in open access data repositories4. The Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC), is a National Heart, Lung, and Blood Institute is one such data repository, initiated in 2000 with the aim of sharing data from observational and interventional studies supported by the institute5. The impact of open access data, in terms of number of datasets used from a repository, publications generated from these datasets, and citations received by these publications are still unknown. In this study, we aim to study the BioLINCC datasets, number of publications that used BioLINCC open access data, and the impact of these publications through the citations they received.
There are a total of 205 studies listed on BioLINCC data repository, where four studies have their data stored in other repositories, and seven studies have only specimens available at the BioLINCC institution available upon request, but no datasets associated with them. We only included datasets stored in BioLINCC repository and can be accessed through their portal, which comprises 194 dataset. (Figure 1).
We also contacted BioLINCC support to obtain an up to date list of published articles that used BioLINCC dataset, where we received a list of all publications up to 24th July 2019. Researchers accessing the BioLINCC datasets are requested to disclose any publication resulted from the use of the BioLINCC datasets. The BioLINCC also list published articles that used BioLINCC datasets on their website (https://biolincc.nhlbi.nih.gov/publications/). A manual search of PubMed was also carried out to confirm an updated full list of publications. We used the basic search of PubMed by inputting the title of the dataset in the search field. Any study that reported the use of the searched dataset as part of its results was included in our analysis, where authors either detail the name and acronym of dataset used in the methods section, usually with specific citation to relevant study, or in the acknowledgment section in their articles. The included articles either used data stored in the BioLINCC repository alone, or used these datasets along with other datasets from other repositories. We analyzed the number of studies published using each dataset (supplementary material).
We used Web of Science (WoS) database to analyze the characteristics of included publications. We prepared a list of digital object identifiers (DOIs) for the included articles. We inputted the DOI list into the WoS advanced search field, where only WoS indexed publications from the total included articles were analyzed further. The WoS database has a built-in analysis to provide data regarding the number of publications using the included dataset per year (yearly publications), topic of publication, affiliation of authors, and number of citations received6.
1,086 published articles used data from BioLINCC repository for 79 (40.72%) datasets, where 115 (59.28%) datasets didn’t have any publications associated with it. Dataset for the Atherosclerosis Risk in Communities Study (ARIC) had the highest number of publications associated with it 162 (15%), followed by Framingham Heart Study-Cohort (FHS-Cohort) with 94 (8.7%), and Cardiovascular Health Study (CHS) with 82 (7.6%). 162 (14.9%) of publications used more than one dataset (Table 1). Out of the 1,086 published articles, only 987 (90.88%) articles were WoS indexed. All articles published were English language (see underlying data7). The first publication using BioLINCC open data was from 2002. Since then, the number of publications has steadily increased since 2002, as shown in Figure 2, and peaked in 2018 with a total number of 138 publications.
The 987 open data publications received a total of 34,181 citations from 27,904 published articles up to 1st October 2019. The average citation per item for the publications using BioLINCC data was 34.63. The total number of citations received by publications using BioLINCC data per year has increased from only 2 citations in 2002, to a peak of 2361 citations in 2018 (Figure 3).
A total of 352 (35.66%) of the published articles related to cardiac and cardiovascular systems, 106 (10.74%) articles related to general internal medicine, and 92 (9.32%) related to public and occupational health. Figure 4 shows the 10 most common fields the studied publications using BioLINCC data published in. The American Journal of Cardiology had the highest number of publications using BioLINCC data (60; 6.08%), followed by the International Journal of Cardiology with 47 (4.76%), and American Journal of Medicine 25 (2.53%). Table 2 shows the top 10 journals that publications using BioLINCC data were published in. US authors participated in 842 (85.31%) of the publications using BioLINCC data, followed by Canadian and England authors, with 121 (12.26%), and 81 (8.21%), respectively (Figure 5). The top three affiliations in terms of publications using BioLINCC data were University of Alabama system, University of Alabama at Birmingham, and University of California system as shown in Table 3.
Tremendous effort has been made by BioLINCC in preparing dataset to be used as open data since its establishment, where hundreds of studies have been published using BioLINCC open data6. Despite the finding that majority of datasets didn’t yield further publications from the re-use of the dataset, many of the datasets had high number of publications. The impact of these publications can be measured in terms of citations received, where citations of publications using BioLINCC data have exponentially increased. They received a total of 2361 citations in the year 2018. Cardiology is the main field, with more than third of publications are cardiology related, and the top two journals publishing articles using BioLINCC data are also cardiology journals.
In an analysis done in 2017, Coady and his colleagues analyzed the administrative records of investigator requests for BioLINCC data, they found that 35% of clinical trial data were associated with at least one publication within five years from data public release8. Our findings also showed that majority of datasets deposited in the BioLINCC repository were not associated with secondary publications. In a previous survey conducted on researchers who requested datasets from BioLINCC showed that the majority of researchers requested the data to conduct an independent research project8. Moreover, Ross et al. in their survey also found that majority of requests to the BioLINCC repository were made by early career researchers. Where we previously pointed to the importance of open access data for underfunded and early career researchers2, our results showed that the top three countries using open access data are USA, UK, and Canada. Researchers new to open data might be skeptical about the publishing opportunity of studies performed using open data. In our analysis the top 10 journals publishing open data studies, which also comprised around 27% of the total studied publications, had an impact factor of more than two. Regarding the clinical impact of publications using open data, an example would be the post-hoc analysis of the Digitalis Investigation Group trial using the open data of the original trial9, which showed that digoxin therapy is associated with an increased risk of death from any cause among women, but not men, a finding that the original study failed to find. The digitalis trial is an example of how cardiology researchers are using open data, with efforts of cardiology initiatives encouraging data sharing and use by cardiology researchers10. Clinical trial data sharing in cardiology has also been used to validate the reproducibility of published results11. In our study, we found a higher number of cardiology related publications using open access data compared to other specialties.
Since 2003, the National Institute of Health mandated that data collected by studies receiving more than $500,000 be stored in a publicly available repository, with BioLINCC being the main repository for NIH-NHLB institute funded research12. This might explain the high impact of studies resulting from the BioLINCC stored data. On the other hand, data shared by platforms other than BioLINCC may lack sufficient description about the shared data, which will hamper its use by other researchers13. Moreover, repositories should focus on facilitating access to data and increasing awareness about it, so that more researchers can use the data from these repositories10,11. Our results are based on BioLINCC repository, where data of well-funded research projects undergo extensive processing before being publicly shared, resulting in well-curated, high quality data. Other studies should be done to validate our results, by evaluating data repositories that do not have the pre-sharing processing. Moreover, we didn’t compare citations received by open data publications and primary data publications, which should be carried out in future projects.
Harvard Dataverse: Publications that used Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) datasets. https://doi.org/10.7910/DVN/1TXA3C7
This project contains the following underlying data:
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: data science, data sharing and reuse
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
No
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
No
Competing Interests: Drs. Vorland and Brown have received research funds from the Center for Open Science.
Reviewer Expertise: Meta-research
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: data science, data sharing and reuse
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Ross JS, Ritchie JD, Finn E, Desai NR, et al.: Data sharing through an NIH central database repository: a cross-sectional survey of BioLINCC users.BMJ Open. 2016; 6 (9): e012769 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: clinical research, medical informatics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 4 (revision) 18 Aug 21 |
read | read | ||
Version 3 (revision) 21 Apr 21 |
read | |||
Version 2 (revision) 28 Sep 20 |
read | |||
Version 1 20 Jan 20 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)