The National Heart, Lung, and Blood Institute data: analyzing published articles that used BioLINCC open access data

Background: Data sharing is now a mandatory prerequisite for several major funders and journals, where researchers are obligated to deposit the data resulting from their studies in an openly accessible repository. Biomedical open data are now widely available in almost all disciplines, where researchers can freely access and reuse these data in new studies. We aim to study the BioLINCC datasets, number of publications that used BioLINCC open access data, and the citations received by these publications. Methods: As of July 2019, there was a total of 194 datasets stored in BioLINCC repository and accessible through their portal. We requested the full list of publications that used these datasets from BioLINCC, and we also performed a supplementary PubMed search for other publications. We used Web of Science (WoS) to analyze the characteristics of publications and the citations they received, where WoS database index high quality articles. Results: 1,086 published articles used data from BioLINCC repository for 79 (40.72%) datasets, where 115 (59.28%) datasets did not have any publications associated with it. Of the total publications, 987 (90.88%) articles were WoS indexed. The number of publications has steadily increased since 2002 and peaked in 2018 with a total number of 138 publications on that year. The 987 open data publications (i.e., secondary publications) received a total of 34,181 citations up to 1 st October 2019. The average citation per item for the open data publications was 34.63. The total number of citations received by open data publications per year has increased from only 2 citations in 2002, peaking in 2018 with 2361 citations. Conclusion: Majority of BioLINCC datasets were not used in secondary publications. Despite that, the datasets used for secondary publications yielded publications in WoS indexed journals and are receiving an increasing number of citations.


Introduction
Recent years have seen an increased call for data sharing in clinical studies, especially for research funded by international and governmental agencies 1 . The call originally aimed to maximize transparency for clinical trial results 1 , but the benefits of data sharing extended beyond its original aim. Open access data is frequently cited as a boon for researchers, where researchers can re-analyze already collected data to answer a new research question 2,3 . To organize and maximize the scientific use of open access data, researchers and funders store their data in open access data repositories 4 . The Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC), is a National Heart, Lung, and Blood Institute is one such data repository, initiated in 2000 with the aim of sharing data from observational and interventional studies supported by the institute 5 . The impact of open access data, in terms of number of datasets used from a repository, publications generated from these datasets, and citations received by these publications are still unknown. In this study, we aim to study the BioLINCC datasets, number of publications that used BioLINCC open access data, and the citations received by these publications.

Data collection
There are a total of 205 studies listed on BioLINCC data repository, where four studies have their data stored in other repositories, and seven studies have only specimens available at the BioLINCC institution available upon request, but no datasets associated with them. We only included datasets stored in BioLINCC repository and can be accessed through their portal, which comprises 194 datasets. (Figure 1).
We also contacted BioLINCC support to obtain an up to date list of published articles that used BioLINCC datasets, where we received a list of all publications up to 24 th July 2019. This list might not reflect the total publications of 2019, as the whole year was not included. Researchers accessing the BioLINCC datasets are requested to disclose any publication resulted from the use of the BioLINCC datasets. The BioLINCC also list published articles that used BioLINCC datasets on their website (https://biolincc.nhlbi.nih.gov/publications/). A manual search of PubMed was also carried out on 25 th of July 2019 to confirm an updated full list of publications, as follows: • We used the basic search of PubMed by inputting the title of the dataset in the search field (e.g., Cooperative Study of Sickle Cell Disease or CSSCD), in order to retrieve results that mention the dataset in the title, abstract, or keywords. It is important to note here that each dataset available on the BioLINCC repository had its own acronym.
• The searched articles were manually screened by one of the authors (SAA) to check if the dataset was used in the study to generate results, where authors either detail the name and acronym of dataset used in the methods section, usually with specific citation to relevant study, or in the acknowledgment section in their articles. The included articles either used data stored in the BioLINCC repository alone or used these datasets along with other datasets from other repositories • We added the searched articles to the original dataset provided by the BioLINCC.
• We analyzed the number of studies published using each dataset (supplementary material).

Bibliometric analysis
We used Web of Science (WoS) database to analyze the characteristics of included publications. We prepared a list of digital object identifiers (DOIs) for the included articles. We inputted the DOI list into the WoS advanced search field, where only WoS indexed publications from the total included articles were analyzed further. The WoS database has a built-in analysis to provide data regarding the number of publications using the included dataset per year (yearly publications), topic of publication, affiliation of authors, and number of citations received 6 .

Amendments from Version 3
The reviewers, again, performed an in-depth assessment and provided valuable points to be amended and improved, which we followed point by point. We performed editions on Figure 1, where we corrected the text within the figure as suggested by reviewers. They also suggested improvements in the dataset.
In this regard, we uploaded a separate codebook to detail the dataset details. We hope the manuscript in its current improved version satisfies, to a certain degree, their expectations.
Any further responses from the reviewers can be found at the end of the article  The average citation per item for the publications using BioLINCC data was 34.63. The total number of citations received by publications using BioLINCC data per year has increased from only 2 citations in 2002, to a peak of 4361 citations in 2018 ( Figure 3).
A total of 352 (35.66%) of the published articles related to cardiac and cardiovascular systems, 106 (10.74%) articles related to general internal medicine, and 92 (9.32%) related to public and occupational health. Figure 4 shows the 10 most common fields the studied publications using BioLINCC data published in. The American Journal of Cardiology had the highest number of publications using BioLINCC data (60; 6.08%), followed by the International Journal of Cardiology with 47 (4.76%), and American Journal of Medicine 25 (2.53%). Table 2 shows the top 10 journals that publications using BioLINCC data were published in. US authors participated in 842 (85.31%) of the publications using BioLINCC data, followed by Canadian and English authors, with 121 (12.26%), and 81 (8.21%), respectively ( Figure 5). The top    three affiliations in terms of publications using BioLINCC data were University of Alabama at Birmingham, University of California system, and Harvard University as shown in Table 3.

Discussion
Tremendous effort has been made by BioLINCC in preparing dataset to be used as open data since its establishment, where hundreds of studies have been published using BioLINCC  open data 6 . Despite the finding that majority of datasets did not yield further publications from the re-use of the dataset, many of the datasets had high number of publications. The citations of publications using BioLINCC data have dramatically increased. They received a total of 2361 citations in the year 2018. Cardiology is the main field, with more than third of publications are cardiology related, which is expected, as the dataset are related to heart, lung, blood institute. The top two journals publishing articles using BioLINCC data are also cardiology journals.
In an analysis done in 2017, Coady and his colleagues analyzed the administrative records of investigator requests for BioLINCC data, they found that 35% of clinical trial data were associated with at least one publication within five years from data public release 8 . Our findings also showed that majority of datasets deposited in the BioLINCC repository were not associated with secondary publications. In a previous survey conducted on researchers who requested datasets from BioLINCC showed that the majority of researchers requested the data to conduct an independent research project 8 .  9 , which showed that digoxin therapy is associated with an increased risk of death from any cause among women, but not men, a finding that the original study failed to find. The digitalis trial is an example of how cardiology researchers are using open data, with efforts of cardiology initiatives encouraging data sharing and use by cardiology researchers 10 . Clinical trial data sharing in cardiology has also been used to validate the reproducibility of published results 11 . The high number of citations received publications using the BioLINCC shared datasets might be related to the regulations of National Institute of Health, which mandated that data collected by studies receiving more than $500,000 be stored in a publicly available repository, with BioLINCC being the main repository for The National Institute of Health -The National Heart, Lung, and Blood Institute (NIH-NHLB) institute funded research 12 . On the other hand, data shared by platforms other than BioLINCC may lack sufficient description about the shared data, which will hamper its use by other researchers 13 . Upon interpreting the results of the current study, several limitations need to be considered. Our results are based on BioLINCC repository, where data of well-funded research projects undergo extensive processing before being publicly shared, resulting in wellcurated, high quality data. Other studies should be done to evaluate data repositories that do not have the pre-sharing processing. Another point here is that we used the WoS database for data extraction and analysis, which might not include several studies done using open access data from the BioLINCC repository. The WoS database usually requires time to index newly accepted articles, which might lead to underestimation in the number WoS indexed articles. Moreover, we did not compare citations received by open data publications and primary data publications, which should be carried out in future projects. One key point that may undermine the idea of 'impact' of the open datasets is that the study investigators appear to be included in these counts. For example, the University of Alabama at Birmingham is a key site for some studies (e.g., CARDIA), and thus they would be publishing from their datasets whether they were open in BioLINCC or not, so this need to be considered upon interpreting the results. Finally, using citation as the sole metric for impact is a debatable issue, but it can be better used as a metric for attention.

Heyam F. Dalky
College of Nursing, Community and Mental Health Nursing Department, Jordan University of Science and Technology, Irbid, Jordan This is an interesting paper about the impact of data sharing using a non-Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) .
The authors in this manuscript analyzed previous published reports/studies that used the BioLINCC openly accessible datasets. The authors obtained their dataset mostly by asking the BioLINCC support team to provide their up to date data, and they supplemented the provided dataset by a manual search. While the manuscript was not meant to be an exhaustive study to analyze secondary publications to open data and cannot be generalized to all secondary articles published using open data, the study provided a good overview on secondary articles published using one of the highest quality open data repository.
The authors are highly encouraged to apply a manual search conducted by the authors and needs to be further detailed, and preferably through the use of a PRISMA diagram.
The study assessed biomedical research, so the database that more conveniently used for bibliometric analysis might be PubMed. I would advise the authors to consider PubMed database, in addition to Web of Science, in future projects concerning biomedical literature.
As previous reviewers stated, the authors need to make sure to clarify that the study is a descriptive study and they cannot overestimate the impact of its results. The authors should work in the future on a larger project to analyze other openly accessible datasets to compare with the current results.
The authors stated that the "Dataset for the Atherosclerosis Risk in Communities Study (ARIC) had the highest number of publications associated with it 162 (15%), followed by Framingham Heart Study-Cohort (FHS-Cohort) with 94 (8.7%), and Cardiovascular Health Study (CHS) with 82 (7.6%)." -I noticed that some major trials are deposited as multiple fragmented datasets, so it is important to consider clarifying if the authors combined such fragments when they assessed most commonly used datasets.
The authors also stated that, "The first publication using BioLINCC open data (i.e., secondary publication) was from 2002." -it would be better to cite the publication meant by this statement.
From the viewpoint of the reviewer, the manuscript is prepared with full attention to detail. The authors have done great efforts in presenting and comparing the data following a logical and understandable illustration. The figures enclosed make it easier for the reader to track the data and the relevant discussion.
The work reflects highly impressed efforts in compiling data into a constructive way and presenting data in the corresponding tables. The authors complied with reviewers' comments and considered them with attention and caution. The manuscript in its current status is highly recommended for indexing.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes The methods for assessing which studies were classified as "open data" or "secondary publications" versus "primary data publications" are not provided. This distinction seems odd and likely undefinable for cohorts; perhaps secondary analyses of RCTs could be identified, though.
Data formatting and documentation is still incomplete. It is good to see a form of data dictionary, but it includes typographical and other errors (e.g., Medical Subbect Heading), is not in any standard format, and is incomplete. Given this manuscript is on data sharing, the authors should use best practices themselves (see F.A.I.R. practices, for instance).

○
Using cell formatting for additional information is bad practice. When opening the file using the previewer as a '.tab' file, as referenced in the manuscript for instance, bold formatting is stripped, and that information is lost.
○ Non informative missingness throughout: empty cells without justification or explanation.

○
The authors mentioned a couple times that the repository will not allow for editing in their reply. If no versioning is possible, then a new repository should be made with corrected information.

○
There may be other concerns that we did not identify, but these were the most salient in terms of understanding what the authors did. Our general recommendation is to encourage the authors to fully and clearly disclose methods, processes, operationalization of variables, and outcomes to ensure reproducibility and transparency. We thank the authors for responding to our comments. We still have several outstanding concerns.
Specifically, we thank the authors for clarifying about the 9% of articles not indexed in WoS.
Publishing the full results of the distribution of WoS indexing over time (even if like in Figure 2) would aid interpretation. Given that 2018 was the last full year included, it seems to support the point that your analysis is underestimating papers using BioLINCC in recent years because there may be a delay in WoS indexing. This should be listed as a limitation in the discussion section. We re-emphasize that conclusions about the "impact" of BioLINCC data are not appropriate. It is possible that, relatively speaking, there has been no increase in the use of BioLINCC data relative to using other repositories, or compared to authors using their own datasets. Metrics of use may easily just reflect increasing trends of total publications over time. Just because a dataset is included in BioLINCC is not an indication of an impact of BioLINCC. It is reasonable that a central repository would facilitate data sharing and use, but this has not been shown in this descriptive analysis. Without an appropriate comparator group, the results are descriptive, and the interpretation is limited to descriptions, not of impact.
Regarding the data: The dataset still lacks a codebook as far as we could find. What do the authors mean by "open data publications and primary data publications"?
Regarding Alabama: please confirm whether it should be the University of Alabama System or University of Alabama at Birmingham in the figure and text.
We note the authors included a sentence directly from our review: "For example, the University of Alabama at Birmingham is a key site for some studies (e.g., CARDIA), and thus they would be publishing from their datasets whether they were open in BioLINCC or not". While we are glad the authors took our concerns to heart, we are not sure what to think about our sentence being lifted directly.
Number of citations in text (2361) does not match figure for 2018.
Some grammatical concerns: Figure 1: Should say "Four studies' datasets"; in the text should be "comprises 194 datasets" instead of dataset; elsewhere "that used BioLINCC dataset" should be "datasets". "they were distributed over the years with the majority (i.e. 42 articles) were published in 2018" should not have the second "were". "English authors" not "England authors". Acronyms for NIH/NHLBI never established but used in discussion. Formal English avoids contractions (e.g., "didn't"); and so forth.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Meta-research
We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.
Author Response 13 Aug 2021 Saif Aldeen AlRyalat, The University of Jordan, Amman, Jordan The reviewers again performed an in-depth assessment and provided valuable points to be amended and improved, which we followed point by point. They also suggested improvements in the dataset. In this regard, we uploaded a separate codebook to detail the dataset details. We hope the manuscript in its current improved version satisfies, to a certain degree, their expectations.
We thank the authors for responding to our comments. We still have several outstanding concerns.

Specifically, we thank the authors for clarifying about the 9% of articles not indexed in WoS. Publishing the full results of the distribution of WoS indexing over time (even if like in Figure 2) would aid interpretation. Given that 2018 was the last full year included, it seems to support the point that your analysis is underestimating papers using BioLINCC in recent years because there may be a delay in WoS indexing. This should be listed as a limitation in the discussion section.
Reply: We agree with the reviewers that delayed indexing by WoS might lead to underestimation in the number of WoS indexed articles. We further clarified this in the article's limitations.

The description of methods is improved, although still not reproducible. On what day was the search performed? It is not clear from the authors' published dataset what the BioLINCC dataset titles are or how they determined exact search strings. For instance, did all datasets have acronyms like the example used in the authors' reply to our review (Cooperative Study of Sickle
Cell Disease or CSSCD)? How many total results were returned and how many articles were screened manually?
Reply: Thank you for the suggestions that led to this improvement in your previous revision. The PubMed search was carried out in the next day directly (i.e., on 25 th of July 2019), and the search results were saved and screened during subsequent days. All datasets had acronyms as shown in the column (studylist), and as evident on the BioLINCC own website.
In regard to exact numbers, we did not record them at the time of search, so they are not available for reporting. We clarified these points in the methods.
We re-emphasize that conclusions about the "impact" of BioLINCC data are not appropriate. It is possible that, relatively speaking, there has been no increase in the use of BioLINCC data relative to using other repositories, or compared to authors using their own datasets.

Metrics of use may easily just reflect increasing trends of total publications over time. Just because a dataset is included in BioLINCC is not an indication of an impact of BioLINCC. It is reasonable that a central repository would facilitate data sharing and use, but this has not been shown in this descriptive analysis. Without an appropriate comparator group, the results are descriptive, and the interpretation is limited to descriptions, not of impact.
Reply: We agree with the reviewer on the importance of not overestimating the results of our study. In the previous revision, we tried to emphasize on this point in the limitations. Now, we further reviewed the study to explicitly replace words like "impact" by other appropriate words that reflect the descriptive nature of this study, including the word "impact" in the title.

Regarding the data: The dataset still lacks a codebook as far as we could find. A codebook includes descriptions of each variable name in the file so others can interpret what each column is (for example, what is 'recid'…).
○

Reply:
The details about what each column title reflect were already provided in the "Notes" section on dataverse website. However, we agree with the reviewer that a more explicite and detailed codebook still needed, so we uploaded a one that can be downloaded at the dataverse website. It is not clear what is meant by "title of the dataset in the search field". Does this mean the study name? The study acronym? It seems likely that many publications would not use the exact dataset name in the title or abstract and this approach would therefore potentially miss papers. Were any new papers found beyond the BioLINCC list from the PubMed search? How many articles from the BioLINCC list were not confirmed in the PubMed search? This information is missing from the methods.

Results
It is indicated that over 9% of the articles using data from BioLINCC are not WoS indexed. If these articles are not evenly distributed over time, then it will skew the results of the trends. For example -were the papers not indexed more recent papers that WoS has not yet picked up? At minimum, the authors can manually extract the year, journal, and country of publication from these papers to include the in assessments.
The utility of the analyses as currently presented seem questionable. How does the increase in articles published and citations by year compare to trends in overall metrics of these measures? i.e. do these trends outpace or just reflect the growth of scientific publishing overall? The authors may also consider limiting such comparison to the specific fields that use BioLINCC data.
The authors note the values 'peaked in 2018', but that was the most recent year of full data, given their partial year in 2019. Thus, 2019 is likely artificially small by virtue of it being a partial year.
The University of Alabama at Birmingham is part of the University of Alabama System, and thus counting them separately does not seem to make sense.
It is unclear how fields of study were determined. Were these just extracted from WoS (is this "topic of publication" per methods or a separate extraction), or did the authors classify them? Regardless, the finding that cardiology is the top field is not surprising, given that BioLINCC is from the National Heart, Lung, and Blood Institute. This should be made clear.
One key point that may undermine the idea of 'impact' of the open datasets is that the study investigators appear to be included in these counts. For example, the University of Alabama at Birmingham is a key site for some studies (e.g., CARDIA), and thus they would be publishing from their datasets whether they were open in BioLINCC or not. So, what is the incremental contribution to investigators who are not part of the cohort? What difference is it making for how many papers would be published if the data were open or not?

Discussion
In general, the discussion does not seem to flow logically. For example, in one paragraph, the authors discuss the percent of publications after data release, the top countries from which BioLINCC data are used and top journals, and then a single example of clinical impact from using BioLINCC data. The points in the discussion should be separated and connected to the purpose of the study. New results (e.g., impact factor) should not be introduced in the discussion. Further, have there been other studies that have examined these or related questions about BioLINCC or other repositories?
"The impact of these publications can be measured in terms of citations received, where citations of publications using BioLINCC data have exponentially increased" Exponential growth is a specific mathematical term whereas the growth in the figures appears to be roughly linear.

Data
We downloaded and inspected the data: There is no data dictionary to interpret the dataset.
○ 'Recid' starts at 4 and not 1. Some 'Recid's are missing (for example, #5, #7). Were these entries those that were not indexed by WoS? Those DOIs would still be useful to include in the dataset so future researchers can use them.

Were theses and other article types included in all analyses (include this information in the methods)?
○ There are missing data (e.g., funding; MESH terms; article types; study type; one publication was missing 'study list').

○
The authors state that they searched WoS by DOI, and yet DOIs are missing from some entries. How was this accounted for in the analysis? Are the missing DOIs counted as part of 'not indexed in WoS'?

General
The writing is generally clear, but it could benefit from a grammatical edit in some passages.

If applicable, is the statistical analysis and its interpretation appropriate? Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Are the conclusions drawn adequately supported by the results? No
Competing Interests: Drs. Vorland and Brown have received research funds from the Center for Open Science.

Reviewer Expertise: Meta-research
We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Author Response 08 Apr 2021
Saif Aldeen AlRyalat, The University of Jordan, Amman, Jordan I went through the manuscript and amended and responded to all comments. Here are the responses.

Reviewer Colby Vorland and Andrew Brown
It is an honor to receive a feedback from Drs Colby and Brown from Indiana university, we performed almost all the changes suggested, and we hope the current version satisfy the quality required. Here are the detailed responses.

Summary:
The authors ask an interesting question as to what the impact of BioLINCC has been on the use of open data. However, the assessments of impact do not seem to appropriately contextualize the use of BioLINCC datasets as compared to growth of scientific publishing overall. Further, the authors include data in their analyses before the existence of BioLINCC, and the methods used to sample are unclear.
Response: Thank you. The study is mostly descriptive of the studies published using datasets stored at the BioLINCC repository, with a bibliometric analysis of these studies. We believe that such analysis will show the impact of open data and will encourage authors to further share their data publicly. So we agree with the reviewer that the current analysis lacks the comparison with the growth of overall scientific literature, but we will consider such analysis in the near future. The BioLINCC is basically a repository to store and facilitate the share of data collected by National Heart, Lung, and Blood Institute funded studies, where these studies and their data might have been done before the existence of the BioLINCC, but were stored in the BioLINCC repository afterward. Response: Inputting the title and the acronym of the dataset in the PubMed search will retrieve all articles that mentioned the dataset in the title, abstract, keywords. The guidelines for reporting secondary analysis articles require the mention of the dataset used in the title or abstract*. Despite that, we agree with the reviewers that our search might miss few articles that did not mention the dataset there. We tried to limit the words for the methods and results, which is why these details are not provided the full manuscript. We directly added the results searched by the supplementary search directly on the original dataset provided by the BioLINCC, which is provided as supplementary material.

Results
It is indicated that over 9% of the articles using data from BioLINCC are not WoS indexed. If these articles are not evenly distributed over time, then it will skew the results of the trends. For example -were the papers not indexed more recent papers that WoS has not yet picked up? At minimum, the authors can manually extract the year, journal, and country of publication from these papers to include the in assessments.

Response:
We analyzed the non-indexed articles manually to check if they were published in 2019, which if so might reflect a delay in the indexing. We found that they were distributed over the years with the majority were published in the year 2018. We could not perform detailed analysis as they could not be analyzed using the WoS database, so clarified this in the results: "For the 99 (9.12%) articles that were not indexed, they were distributed over the years with the majority (i.e. 42 articles) were published in 2018." The utility of the analyses as currently presented seem questionable. How does the increase in articles published and citations by year compare to trends in overall metrics of these measures? i.e. do these trends outpace or just reflect the growth of scientific publishing overall? The authors may also consider limiting such comparison to the specific fields that use BioLINCC data.
Response: Thank you for the important point. While we did not compare with the overall publishing trend in the field, we tried to show the increase in the number of publication using open access data in each field. The use of specific dataset might not be restricted to the field of the dataset itself, as a dataset that was originally a cardiovascular dataset might be used by researchers from other fields for other ideas. As an example, Radiological images in ACCESS datasets were used several times for Radiology publications.
The authors note the values 'peaked in 2018', but that was the most recent year of full data, given their partial year in 2019. Thus, 2019 is likely artificially small by virtue of it being a partial year. Response: We agree with the reviewer, so we made this point clear in the methods.
The University of Alabama at Birmingham is part of the University of Alabama System, and thus counting them separately does not seem to make sense.

Response:
We corrected according to reviewer suggestion, the WoS database have both as separate affiliation, which led to this confusion.
It is unclear how fields of study were determined. Were these just extracted from WoS (is this "topic of publication" per methods or a separate extraction), or did the authors classify them? Regardless, the finding that cardiology is the top field is not surprising, given that BioLINCC is from the National Heart, Lung, and Blood Institute. This should be made clear. Response: These are WoS based classification, we changed in the text accordingly.
One key point that may undermine the idea of 'impact' of the open datasets is that the study investigators appear to be included in these counts. For example, the University of Alabama at Birmingham is a key site for some studies (e.g., CARDIA), and thus they would be publishing from their datasets whether they were open in BioLINCC or not. So, what is the incremental contribution to investigators who are not part of the cohort? What difference is it making for how many papers would be published if the data were open or not? Response: We thank the reviewers for the important remarks, as it is difficult to performed such discrimination in the current study, we made this point clear in the limitation part, so that readers would consider this point upon interpreting the results.

Discussion
In general, the discussion does not seem to flow logically. For example, in one paragraph, the authors discuss the percent of publications after data release, the top countries from which BioLINCC data are used and top journals, and then a single example of clinical impact from using BioLINCC data. The points in the discussion should be separated and connected to the purpose of the study. New results (e.g., impact factor) should not be introduced in the discussion. Further, have there been other studies that have examined these or related questions about BioLINCC or other repositories? Response: We made several changes on the discussion to improve its flow. We removed some of the unrelated discussion part. We also removed the part related to impact factor.
"The impact of these publications can be measured in terms of citations received, where citations of publications using BioLINCC data have exponentially increased" Exponential growth is a specific mathematical term whereas the growth in the figures appears to be roughly linear.

Data
We downloaded and inspected the data: There is no data dictionary to interpret the dataset. Response: We added a description at the dataset website: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2F1TXA3C&version=DRA 'Recid' starts at 4 and not 1. Some 'Recid's are missing (for example, #5, #7). Were these entries those that were not indexed by WoS? Those DOIs would still be useful to include in the dataset so future researchers can use them.

Response:
The entries did not use the BioLINCC data, so were not included in the dataset.
Were theses and other article types included in all analyses (include this information in the methods)? Response: Thesis were not included, we added this to the methods.
There are missing data (e.g., funding; MESH terms; article types; study type; one publication was missing 'study list'). The authors state that they searched WoS by DOI, and yet DOIs are missing from some entries. How was this accounted for in the analysis? Are the missing DOIs counted as part of 'not indexed in WoS'? Response: Missing doi were added to the manually to the WoS search for data analysis. They were not counted as part of the "not indexed in WoS". After inputting doi to the databse, information about the study will automatically be retrieved from the WoS database, so missing data in the excel sheet won't affect the analyzed data.
List of published articles on the BioLINCC website. 2.
Manual search of Pubmed with the title of the dataset. 3.
hand, the detailed number of datasets included and excluded is of paramount importance, we used the flow chart for detailing its number.

Comment:
The authors state in the "bibliometric analysis" section that "Any study that reported the use of the searched data set as part of its results was included in our analysis". It is not clear, how the datasets were identified in the publication. Was this performed via the registration number of the underlying study in a registry (e.g. NCT-number) or by the title/acronym of the data set from the BioLINCC database? The authors should clarify how this was performed. Response: Any author requested BioLINCC datasets for use in a study should explicitly mention the dataset used in the methods (i.e. the name of the dataset and the acronym if available), in addition to acknowledging the BioLINCC in the acknowledgment section.
Comment: Important to add would be a statistic describing the number of publications per data set (may be also dependent on the year of publication of the data set in BioLINCC). Are there many datasets without any or only very few publications? Is the majority of publications concentrated in a few datasets? This information is important because no requests for data sharing may not justify costs and resources for preparation of data sharing (e.g. de-identification, curation). Response: Thank you for this insight. We analyzed the number of publications associated with each dataset. As the reviewer expected, there are many datasets with no publications associated with them, as well as datasets with high number of publications. We added these results and relevant tables, and we further discussed them in the discussion.

Comment:
One of the factors that is relevant for the number of publications is the year when the data set was published in BioLINCC. A figure correlating the date of publication of the data set with the number of publications could illustrate that. This is similar with the relation between the year of publication and the number of citations. These relationships should be worked out in the paper.

Response:
The publication year of datasets may vary according to the datasets, and may change with time if the study got updated (i.e. more data released with time). So it was difficult to study it, considering the unavailability of specific dates provided in the dataset we received from the BioLINCC.
Comment: Another aspect to be considered could be the role of outliers in the statistics. Are there datasets and/or publications with a very high number of citations (e.g. more than 100). Does the citation pattern mainly concentrate in a few outstanding datasets or is it more evenly distributed? Response: As the author mentioned, we found several "outlier" datasets and we mentioned them in the results. These datasets were associated with higher number of publications compared to other datasets.

Comment:
The authors should include and discuss a cross-sectional web-based survey