Keywords
Open Data, Publications, National Institute of Health, Bibliometrics
This article is included in the Research on Research, Policy & Culture gateway.
Open Data, Publications, National Institute of Health, Bibliometrics
The reviewers, again, performed an in-depth assessment and provided valuable points to be amended and improved, which we followed point by point. We performed editions on figure 1, where we corrected the text within the figure as suggested by reviewers. They also suggested improvements in the dataset. In this regard, we uploaded a separate codebook to detail the dataset details. We hope the manuscript in its current improved version satisfies, to a certain degree, their expectations.
See the authors' detailed response to the review by Lisa Federer
See the authors' detailed response to the review by Colby Vorland and Andrew Brown
See the authors' detailed response to the review by Christian Ohmann
Recent years have seen an increased call for data sharing in clinical studies, especially for research funded by international and governmental agencies1. The call originally aimed to maximize transparency for clinical trial results1, but the benefits of data sharing extended beyond its original aim. Open access data is frequently cited as a boon for researchers, where researchers can re-analyze already collected data to answer a new research question2,3. To organize and maximize the scientific use of open access data, researchers and funders store their data in open access data repositories4. The Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC), is a National Heart, Lung, and Blood Institute is one such data repository, initiated in 2000 with the aim of sharing data from observational and interventional studies supported by the institute5. The impact of open access data, in terms of number of datasets used from a repository, publications generated from these datasets, and citations received by these publications are still unknown. In this study, we aim to study the BioLINCC datasets, number of publications that used BioLINCC open access data, and the citations received by these publications.
There are a total of 205 studies listed on BioLINCC data repository, where four studies have their data stored in other repositories, and seven studies have only specimens available at the BioLINCC institution available upon request, but no datasets associated with them. We only included datasets stored in BioLINCC repository and can be accessed through their portal, which comprises 194 datasets. (Figure 1).
We also contacted BioLINCC support to obtain an up to date list of published articles that used BioLINCC datasets, where we received a list of all publications up to 24th July 2019. This list might not reflect the total publications of 2019, as the whole year was not included. Researchers accessing the BioLINCC datasets are requested to disclose any publication resulted from the use of the BioLINCC datasets. The BioLINCC also list published articles that used BioLINCC datasets on their website (https://biolincc.nhlbi.nih.gov/publications/). A manual search of PubMed was also carried out on 25th of July 2019 to confirm an updated full list of publications, as follows:
We used the basic search of PubMed by inputting the title of the dataset in the search field (e.g., Cooperative Study of Sickle Cell Disease or CSSCD), in order to retrieve results that mention the dataset in the title, abstract, or keywords. It is important to note here that each dataset available on the BioLINCC repository had its own acronym.
The searched articles were manually screened by one of the authors (SAA) to check if the dataset was used in the study to generate results, where authors either detail the name and acronym of dataset used in the methods section, usually with specific citation to relevant study, or in the acknowledgment section in their articles. The included articles either used data stored in the BioLINCC repository alone or used these datasets along with other datasets from other repositories
We added the searched articles to the original dataset provided by the BioLINCC.
We analyzed the number of studies published using each dataset (supplementary material).
We used Web of Science (WoS) database to analyze the characteristics of included publications. We prepared a list of digital object identifiers (DOIs) for the included articles. We inputted the DOI list into the WoS advanced search field, where only WoS indexed publications from the total included articles were analyzed further. The WoS database has a built-in analysis to provide data regarding the number of publications using the included dataset per year (yearly publications), topic of publication, affiliation of authors, and number of citations received6.
1,086 published articles used data from BioLINCC repository for 79 (40.72%) datasets, where 115 (59.28%) datasets did not have any publications associated with it. Dataset for the Atherosclerosis Risk in Communities Study (ARIC) had the highest number of publications associated with it 162 (15%), followed by Framingham Heart Study-Cohort (FHS-Cohort) with 94 (8.7%), and Cardiovascular Health Study (CHS) with 82 (7.6%). 162 (14.9%) of publications used more than one dataset (Table 1). Out of the 1,086 published articles, only 987 (90.88%) articles were WoS indexed. All articles published were English language (see underlying data7). The first publication using BioLINCC open data (i.e., secondary publication) was from 2002. Since then, the number of publications has steadily increased since 2002, as shown in Figure 2, and peaked in 2018 with a total number of 138 publications. For the 99 (9.12%) articles that were not indexed, they were distributed over the years with the majority (i.e. 42 articles) published in 2018.
The 987 open data publications received a total of 34,181 citations from 27,904 published articles up to 1st October 2019. The average citation per item for the publications using BioLINCC data was 34.63. The total number of citations received by publications using BioLINCC data per year has increased from only 2 citations in 2002, to a peak of 4361 citations in 2018 (Figure 3).
A total of 352 (35.66%) of the published articles related to cardiac and cardiovascular systems, 106 (10.74%) articles related to general internal medicine, and 92 (9.32%) related to public and occupational health. Figure 4 shows the 10 most common fields the studied publications using BioLINCC data published in. The American Journal of Cardiology had the highest number of publications using BioLINCC data (60; 6.08%), followed by the International Journal of Cardiology with 47 (4.76%), and American Journal of Medicine 25 (2.53%). Table 2 shows the top 10 journals that publications using BioLINCC data were published in. US authors participated in 842 (85.31%) of the publications using BioLINCC data, followed by Canadian and English authors, with 121 (12.26%), and 81 (8.21%), respectively (Figure 5). The top three affiliations in terms of publications using BioLINCC data were University of Alabama at Birmingham, University of California system, and Harvard University as shown in Table 3.
Tremendous effort has been made by BioLINCC in preparing dataset to be used as open data since its establishment, where hundreds of studies have been published using BioLINCC open data6. Despite the finding that majority of datasets did not yield further publications from the re-use of the dataset, many of the datasets had high number of publications. The citations of publications using BioLINCC data have dramatically increased. They received a total of 2361 citations in the year 2018. Cardiology is the main field, with more than third of publications are cardiology related, which is expected, as the dataset are related to heart, lung, blood institute. The top two journals publishing articles using BioLINCC data are also cardiology journals.
In an analysis done in 2017, Coady and his colleagues analyzed the administrative records of investigator requests for BioLINCC data, they found that 35% of clinical trial data were associated with at least one publication within five years from data public release8. Our findings also showed that majority of datasets deposited in the BioLINCC repository were not associated with secondary publications. In a previous survey conducted on researchers who requested datasets from BioLINCC showed that the majority of researchers requested the data to conduct an independent research project8. Moreover, Ross et al. in their survey also found that majority of requests to the BioLINCC repository were made by early career researchers. Where we previously pointed to the importance of open access data for underfunded and early career researchers2, our results showed that the top users of open access data were from developed countries. This might be related to the fact that the data deposited and made open are from USA. Research studies performed using open access data might have important impact, an example would be the post-hoc analysis of the Digitalis Investigation Group trial using the open data of the original trial9, which showed that digoxin therapy is associated with an increased risk of death from any cause among women, but not men, a finding that the original study failed to find. The digitalis trial is an example of how cardiology researchers are using open data, with efforts of cardiology initiatives encouraging data sharing and use by cardiology researchers10. Clinical trial data sharing in cardiology has also been used to validate the reproducibility of published results11. The high number of citations received publications using the BioLINCC shared datasets might be related to the regulations of National Institute of Health, which mandated that data collected by studies receiving more than $500,000 be stored in a publicly available repository, with BioLINCC being the main repository for The National Institute of Health - The National Heart, Lung, and Blood Institute (NIH-NHLB) institute funded research12. On the other hand, data shared by platforms other than BioLINCC may lack sufficient description about the shared data, which will hamper its use by other researchers13. Upon interpreting the results of the current study, several limitations need to be considered. Our results are based on BioLINCC repository, where data of well-funded research projects undergo extensive processing before being publicly shared, resulting in well-curated, high quality data. Other studies should be done to evaluate data repositories that do not have the pre-sharing processing. Another point here is that we used the WoS database for data extraction and analysis, which might not include several studies done using open access data from the BioLINCC repository. The WoS database usually requires time to index newly accepted articles, which might lead to underestimation in the number WoS indexed articles. Moreover, we did not compare citations received by open data publications and primary data publications, which should be carried out in future projects. One key point that may undermine the idea of ‘impact’ of the open datasets is that the study investigators appear to be included in these counts. For example, the University of Alabama at Birmingham is a key site for some studies (e.g., CARDIA), and thus they would be publishing from their datasets whether they were open in BioLINCC or not, so this need to be considered upon interpreting the results. Finally, using citation as the sole metric for impact is a debatable issue, but it can be better used as a metric for attention.
Harvard Dataverse: Publications that used Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) datasets. https://doi.org/10.7910/DVN/1TXA3C7
This project contains the following underlying data:
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
References
1. Covitz W, Espeland M, Gallagher D, Hellenbrand W, et al.: The heart in sickle cell anemia. The Cooperative Study of Sickle Cell Disease (CSSCD).Chest. 1995; 108 (5): 1214-9 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Meta-research
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Meta-research
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: data science, data sharing and reuse
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
No
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
No
Competing Interests: Drs. Vorland and Brown have received research funds from the Center for Open Science.
Reviewer Expertise: Meta-research
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: data science, data sharing and reuse
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Ross JS, Ritchie JD, Finn E, Desai NR, et al.: Data sharing through an NIH central database repository: a cross-sectional survey of BioLINCC users.BMJ Open. 2016; 6 (9): e012769 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: clinical research, medical informatics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 4 (revision) 18 Aug 21 |
read | read | ||
Version 3 (revision) 21 Apr 21 |
read | |||
Version 2 (revision) 28 Sep 20 |
read | |||
Version 1 20 Jan 20 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)