ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources

[version 1; peer review: 3 approved]
PUBLISHED 11 Feb 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Research on Research, Policy & Culture gateway.

This article is included in the ELIXIR gateway.

This article is included in the EMBL-EBI collection.

Abstract

Data from open access biomolecular data resources, such as the European Nucleotide Archive and the Protein Data Bank are extensively reused within life science research for comparative studies, method development and to derive new scientific insights. Indicators that estimate the extent and utility of such secondary use of research data need to reflect this complex and highly variable data usage. By linking open access scientific literature, via Europe PubMedCentral, to the metadata in biological data resources we separate data citations associated with a deposition statement from citations that capture the subsequent, long-term, reuse of data in academia and industry.  We extend this analysis to begin to investigate citations of biomolecular resources in patent documents. We find citations in more than 8,000 patents from 2014, demonstrating substantial use and an important role for data resources in defining biological concepts in granted patents to both academic and industrial innovators. Combined together our results indicate that the citation patterns in biomedical literature and patents vary, not only due to citation practice but also according to the data resource cited. The results guard against the use of simple metrics such as citation counts and show that indicators of data use must not only take into account citations within the biomedical literature but also include reuse of data in industry and other parts of society by including patents and other scientific and technical documents such as guidelines, reports and grant applications.

Keywords

Data citations, Data reuse, Data repositories, Data archiving, Open data, Bibliometrics, Patent analysis, Research impact

Introduction

Open sharing of data is a well-established norm in molecular biology and the genomic sciences: protein structure datasets are released to the community after the corresponding articles are published, many genome sequencing projects deposit sequences in public archives as soon as they are acquired. Consequently, the bioinformatics databases holding these data1 form an essential part of molecular biology research. The standardisation, organisation and careful annotation that occurs when experimental data is deposited in openly accessible biomolecular resources such as the European Nucleotide Archive2 or the Protein Data Bank3 enables independent data verification and also support and encourage data reuse by the research community. The deposition of experimental data in structured archives is complemented by a long tradition of manual curation in which protein properties, biological reactions, genetic linkages and other facts from the scientific literature are further catalogued into structured reference collections such as UniProt4, RefSeq5, and OMIM6. Value-adding data resources build on, and further combine, this treasure-trove of open data and provide comprehensive coverage of biology by cataloguing model organisms, protein classes, sequence motifs, biological pathways, reactions, metabolites: to date over 1600 biological databases are reported in the Nucleic Acids Research database catalogue7.

Maintaining and updating an infrastructure to support the active collection, annotation and redistribution of data is costly and only makes sense if there is a research community that actively reuses the data. While the value of opening up data for independent validation is seen as imperative for the scientific debate8, the open datasets from molecular biology research have long been used to stimulate and test additional hypotheses that are independent of the original experiment. The aggregation and inter-linking of published datasets also forms the basis for meta-analysis, modelling or new derivative databases913. Hence, managing these resources in an effective and sustainable manner requires database owners and funders to understand their usage and role in scientific research, as well as their role in generation of downstream societal value, for example by contributing to the definition of intellectual property held in patent documents. Quantitative analysis of data citation in scientific articles currently lacks metrics that parallel traditional scientific article citation indices and journal impact factors. Furthermore publishers of scientific journals rarely annotate database citations leaving organisations such as Europe PubMed Central (EPMC) to provide routine text-mining to find citations of database identifiers in full-text articles14.

Estimating the on-going use of biological data resources by means of their citation patterns in scientific articles captures one aspect of data reuse but is challenging because data citations in the scientific literature are highly variable with few established community norms. For example, Piwowar and colleagues15 tracked the citation practices used by three life science data resources: NCBI’s GEO16, Pangaea17, and Treebase18. They manually curated data citation statements in a corpus of data-citing papers and noted that for datasets from Pangaea the norm was data citation via the reference list while for the other resources a significant proportion of the citations were made by direct mention of the unique data resource identifier in the text narrative. This variable citation practice, and the subsequent problem this poses for estimations of data usage by tracking data citations in the literature is further exemplified by Belter in a study of the data citation practices used by oceanographers19. Despite the fact that the datasets studied had unambiguous terms of use, including recommendations for citation, the citation practices observed were highly variable with most citations occurring as a direct reference in the main text of the journal article. For example, Belter found that the editions of the National Oceanographic Data Center climate data set were cited in no less than 1180 different ways within his curated literature corpus.

Despite these challenges Kafkas et al.14 have shown that text-mining database citation identifiers, i.e. the juxtaposition in the text of a database name and an appropriate accession ID, from an Open Access literature corpus within EPMC doubled the number of data citations compared to the number supplied by publishers. Subsequently they extended their study with an analysis of the supplementary material associated with the same corpus and noted that data citation practices in supplementary data files differed markedly from those observed in the main article20. For instance, supplementary files often contain long lists of database identifiers. The rank of databases when ordered by the frequency of data citations also differed in supplementary data files compared to that observed in the main articles from the same corpus.

Collectively these studies give us a general sense of the scale of data use although the highly diverse citation practices observed cautions against a naive application of data citation as a metric for research impact. Furthermore, the statistics generated by the studies described above do not discriminate between citations arising from the initial deposition and publication of a source article and subsequent secondary citations in the research literature. Nor do these studies describe the flow and indirect use of data through the web of existing bioinformatics data resources. Thus there is a need to further investigate data citations to serve as a background for development of usage metrics, guide the life-cycle management of resources and understand the flow and impact of biological data. We build on and extend earlier studies by demonstrating how primary data citations, arising from the deposition of data and its citation in the source article describing the generation of the data, can be separated from subsequent secondary data citations. In this study we focussed our attention on two of the major biomolecular databases, the European Nucleotide Archive (ENA) and the Protein Data Bank (PDB), where the high-quality curation and well-established links between an open literature resource (EPMC) and data resources allow us to dissect primary from secondary data citations. We have done this by combining accession publication data from the biomolecular resources with the citation data from EPMC in order to provide an insight into dynamics of data citations over time. We further extend our study of data citations by mining a corpus of full-text patent documents (accessed via SureChEMBL21) in order to begin to understand the downstream use of data resources in the definition of biological entities and concepts in a legal/technical commercial environment.

Methods

Sources of full-text and data accession citations used in this study

The full-text research articles used in this study were accessed from EPMC22. The content scope of EPMC covers over 25 million PubMed abstracts and 3.5 million full-text articles (see https://europepmc.org/About), each article is identified by a unique identifier (a “PMID” for abstracts and a “PMCID” for full-text). Data accession references were extracted using EPMC’s text-mining pipeline based on a combination of rule-based knowledge about possible accession number structures and an empirically-determined set of contextual cues14,20,23. The pipeline is integrated into the EPMC infrastructure (http://europepmc.org/) and is used to identify instances of data citation in full-text articles on a daily basis. The data citations are made publically available via EPMC’s APIs. When comparing research articles with patents we focused on 2014 as the most recent year available. However, due to the fact that embargoed articles were still being added at the time of our study, we repeated our analysis using material from 2012 and 2013 to ensure that our comparisons were robust.

The Protein Data Bank (PDB) is the global archive of 3D structures of proteins, nucleic acids and complex assemblies. This large corpus of data (94,117 holdings in 2014) and related citations provide an extensive test set for developing and understanding data citation and access metrics (http://www.wwpdb.org/stats/deposition). We used the European site PDBe as the definitive source of deposition data, i.e. accession identifiers, deposition dates and associated PMID publication details.

The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe's primary resource for nucleotide sequence information. The current size of the ENA is in excess of 2.5 petabytes, with a doubling time of approximately 20 months (see http://www.ebi.ac.uk/ena/about/statistics). We used ENA as an additional definitive source of deposition data, i.e. accession identifiers, deposition dates and associated PMID publication details.

SureChEMBL (https://www.surechembl.org/) is a publicly available, large-scale resource containing chemical annotations found in the full-text, images and attachments of patent documents21. Its data content at 28 October 2015 included more than 14 million chemically annotated full-text patent documents. In addition, it contains 130 million patent abstracts from DOCDB, the European Patent Office master documentation database with worldwide coverage containing bibliographic data, abstracts, and citations (but no full-text or images).

SureChEMBL provides full-text searching of the patent literature using a keyword-based querying functionality, complemented by a chemistry-based query engine. Our queries retrieved full-text patent documents (both applications and granted patents) written in the English language, published in 2014 by the three main patent authorities, namely the European Patent Office (EPO), the US Patent and Trademark Office (USPTO) and the World Intellectual Property Organisation (WIPO). To ensure the relevance of the retrieved patent documents to biological and life sciences, the appropriate international patent classification (IPC http://www.wipo.int/classifications/ipc/en/) codes (predominantly from categories A (human necessities) and C (chemistry), full query: “(ic:(A01 OR A23 OR A24 OR A61 OR A62B OR C05 OR C06 OR C07 OR C08 OR C09 OR C10 OR C11 OR C12 OR C13 OR C14 OR G01N) OR cpc:(A01 OR A23 OR A24 OR A61 OR A62B OR C05 OR C06 OR C07 OR C08 OR C09 OR C10 OR C11 OR C12 OR C13 OR C14 OR G01N) OR ecla_ec:(A01 OR A23 OR A24 OR A61 OR A62B OR C05 OR C06 OR C07 OR C08 OR C09 OR C10 OR C11 OR C12 OR C13 OR C14 OR G01N)) AND desc:the AND pdyear:[2010 TO 2014] AND pnlang:EN AND pnctry:(WO OR EP OR US)” ) were used to filter the results24. No further selection was carried out on the basis of patent kind (an indication of where the patent is in the review process, e.g. application stage, or granted). Patent families were identified using the simple patent family definition provided by the European Patent Office (EPO)25 and a single example selected at random to be sole representative of the group in subsequent analysis. In total, 188,589 documents published in 2014 were retrieved and used as input for the identifier extraction process. The XML content generated by these patent selections was then mined for accession numbers using the EPMC text-mining pipeline.

Text-mining performance characteristics

The performance assessment characteristics of the text-mining pipeline have been previously reported as 97.45% precision/59.6% recall for ENA and 94.63% precision/91.36% recall for PDB accession references when calibrated against an open access full-text corpus from EPMC14. No large-scale validation of the pipeline has been performed on the patent literature. However, manual inspection on a subset of 110 entries indicated that the approximate precision of the system was 99% and recall was 93%. Overall then the accuracy of the system appears to be higher when working with patents. This is possibly due to that fact that most citation-positive patents contain multiple exemplars whereas many research articles only include one. This would reduce the incidence of false negatives.

Metadata acquisition

The EPMC metadata and text-mining results used in this study can be accessed or generated via Europe PMC’s RESTful API which gives access to search tools with citation-count sort order and data citation features. For example, to get all the PDB citations text-mined in the articles published in EPMC in 2014 go to http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=PUB_YEAR:2014 and then for each of those get the accessions identifiers (e.g. for PMID 22517515 the query is http://www.ebi.ac.uk/europepmc/webservices/rest/MED/22517515/textMinedTerms/ACCESSION). The ENA accession data used here was obtained from EMBL release 124 (described in detail here ftp.ebi.ac.uk/pub/databases/ena/sequence/release/doc/relnotes.txt). The data are public and available at: ftp.ebi.ac.uk/pub/databases/embl/release/ or through the ENA Browser and REST API. We used the primary accession identifiers and deposition article PMIDs found in the flat XML files for each entry, and included all ENA data classes with the exception of the WGS (whole genome shotgun) depositions because these are lower level assemblies with sparse or no annotation information and so less likely to be cited in publications.

The PDB data was obtained from the 2 September 2015 release of the PDB. PDB has a weekly release cycle that is loaded and processed by the PDBe team. The PDBe database also contains information extracted from EPMC about additional PMID that reference or mention any given PDB accession identifier. This information is updated once a month. Citation data was extracted from the PDBe database and included information on PMID that mention the PDB identifier code or cite the primary citation describing the given PDB entry. Citation data is available via the PDBe API (See related publication call at http://www.ebi.ac.uk/pdbe/api/doc/) as well as on the individual PDBe entry pages (e.g. http://pdbe.org/3p8c and http://pdbe.org/3p8c/citations).

Data analysis

Each record in a database has a unique accession number, a release or publication date, a series of revision dates, the bibliographic details of the deposition article, subsequent references associated with the generation of the data set and a list of references citing the source. By combining the metadata within the data resource with the citation information from EPMC we could identify the citation linked to the deposition article and hence distinguish between the initial citation event associated with the deposition article or the release of the data to the public, and track the secondary citations of a data entry (or annual cohorts of data entries) over time.

The data sets associated with the generation of Figure 1, Table 1, Table 2, Table 3 and Table 4, and Supplementary material Table 1 and Supplementary material Table 2 are provided (see Data availability). More specifically, data sets containing accession identifiers, deposition_PMID, deposition year, year of first_publication, and publication year of PMID were extracted from the data resources, and corresponding accession identifiers, citation year and number of data citations in that year were extracted from EPMC.

The merged data set contained the variables: [accession_id], [deposition_pmid], [deposition_year], [first_public_year], [pmid_publication_year], [citation_year], [citations]. For records that had a [pmid_publication_year] equal to a [citation_year], we reduced the corresponding [citations] count by one to remove the impact of the deposition citation. We then tabulated total citations for [first_public_year] (or [pmid_publication_year]) against accession/source article [citation_year].

Our data analysis was carried out using the STATA 12 package (http://www.stata.com/products/).

Dataset 1.Raw data for ‘Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources’, Bousfield et al., 2016.
README.txt contains an index to the accompanying datasets: definition of the data fields is given along with a short STATA do routine.

Results

Secondary citation of data from biomolecular resources

To establish a baseline, we used citations of accession identifiers captured by the EPMC text-mining pipeline to provide a comprehensive picture of the annual data citation characteristics for ENA26, UniProt4, PDBe3, OMIM6, RefSNP, RefSeq5, Pfam27, InterPro28, Ensembl29, and ArrayExpress30.

In 2014, the ENA, PDB, and RefSNP accounted for 42.6%, 21.9% and 21.7% respectively of the total text-mined citations (Table 1). These proportions remained approximately constant throughout the sampled periods and hence provide a reference for comparison with the patent corpus below. In the Kafkas et al. study14 the corresponding percentages for a cohort of 486,472 articles published between 1990 and 2012 were 56.5%, 19.9% and 13.8%. We believe that the differences in these percentages can be attributed to the age structures of the two data corpuses, with the Kafkas set providing a more longitudinal view hence favouring well-established repositories such as ENA.

Table 1. Annual total accessions mined in Europe PMC full-text content published between 2012 and 2014, e.g. 7016 articles in 2014 contained 37,767 references to ENA accessions.

Acc/Art is average accession references per article.

Total accessions mined:% TotalArticles:Acc/
Repository20122013201420142014Art
ENA35897331773776742.6%70165.4
PDB21198220471946121.9%59133.3
RefSNP20528216361925221.7%36385.3
UniProt2308276639254.4%8464.6
OMIM1867205128473.2%8193.5
DOI44591415111.7%12151.2
RefSeq1148102814841.7%4513.3
Pfam896106311901.3%4202.8
ArrayExpress5345696120.7%4191.5
Ensembl2042933890.4%1163.4
Interpro1392242690.3%674.0
Total851648576888707100%209204.2

Unsurprisingly, given the breadth of the biomedical literature, data citations of individual biomolecular resources are relatively infrequent in EPMC: for ENA, the proportion of citing papers in 2014 are 7,016/319,815, or 2.2%, and for PDB, 5,913/319,815 or 1.8%. Collectively the investigated databases are referenced in 6.5% of our EPMC sample (the EPMC search: “pub_year:2014 in_epmc:y”, conducted 20 Oct 2015, retrieved 319,815 articles).

Estimates of secondary data citation in the scientific literature, whether measured via citation of an accession identifier in the article text or mentioned in the reference list (e.g. “1fho” or “doi: 0.2210/pdb1fho/pdb”) or via citation of the corresponding deposition article (e.g. “Blomberg et al.31”), need to make a distinction between citations that arise from the original act of data deposition and those that arise from the secondary citation of data. A further distinction, not investigated systematically here, could also be made according to whether the article citations come from one or more of the original author group – as above - or from an independent research group. The former practise would appear to be quite common for ENA depositions (D. Bousfield, unpublished observations). While this distinction seems straightforward in principle, different policies and deposition practices, as well as ambiguity of author names, make it difficult to distinguish these alternatives in a large-scale analysis. We note that the adoption of ORCID within publication workflows will support future disambiguation.

Combining metadata stored in EPMC and the data resources allowed us to build up a picture, based on the summation of individual data elements, of how annual cohorts of accessions and deposition articles are cited over time (see Table 2). For example the PDB accession 2jhr that refers to the crystal structure of myosin-2 motor domain in complex with ADP-metavanadate and pentabromopseudilin was made public on 13 January 2009. The corresponding article for this deposition is PMID:19122661, entitled “The mechanism of pentabromopseudilin inhibition of myosin motor activity”, published later in 200932. At the time of our study, the deposition article had been cited a total of 9 times during the period 2009–2014 (the current list of citing articles can be found in EPMC using the query: cites:19122661_med). None of these papers cite the data accession identifier 2jhr. However, the database record was cited once by its accession identifier in PMID:21841195 (PMCID:3186370), “Structural basis for the allosteric interference of myosin function by reactive thiol region mutations G680A and G680V”. The actual statement from this paper provides a good example of how data citation occurs in narrative text: “This is very unusual, as the meta-vanadate is clearly visible in known wild-type myosin-2 structures that were obtained in the presence of ADP-VO3, like e.g. PDB IDs 2JJ9, 2JHR, and 2XO8”33. The text components recognised by the text-mining pipeline as being an accession citation of 2JHR are highlighted in bold. The text-mining pipeline also found 2JJ9 and 2XO8.

Table 2. PDB accession citations by annual publication cohort.

The rows show the year in which a PDB data entry was first made public. The columns denote the year in which a citation of that data accession was recorded. Thus each row displays the time-series of citations for the cohort of data entries published during a given year. Reasons why there are observations below the diagonal are discussed in the text. Mature cohorts (release years 2005–2011) were cited on average 0.21 times per accession per year.

PDB
RELEASE
YEAR
NEW
ACCESSIONS
PUBLISHED
SUBSEQUENT CITATION OF ACCESSION IN EPMC
2005200620072008200920102011201220132014
20054,165793966751,0931,1891,2431,2731,3621,1991,036
20064,9306934859981,2111,2901,2741,2531,3001,074
20075,113061417681,1811,2921,3091,3021,3371,208
20085,25521141799981,3081,3351,4681,3081,320
20095,47801062061,0241,3551,4081,4021,270
20105,792110282541,1231,4951,4831,432
20115,8540114352841,0551,4201,294
20126,309122014123311,2321,488
20135,798022153063391,145
20142,152000010003176

Note that whereas each data resource by definition contains references to the complete set of deposition articles, EPMC is incomplete in its full-text literature coverage and therefore will contain only a partial set of cited accession identifiers. In addition, the text-mining process will miss some citations (false negatives) and potentially create a small number of false positives (see Methods). These factors need to be kept in mind when interpreting the results shown in Table 2 and Table 3.

Table 3. PDB source article citations by annual publication cohort.

Same format as per Table 3. Notice sustained levels of citation over time. Mature cohorts (publication year 2005–2011) were cited on average 6.73 times per source article per year.

SOURCE
ARTICLE
PUBLICATION
YEAR
NEW
SOURCE
ARTICLES
SUBSEQUENT CITATION OF SOURCE REFERENCE
IN EPMC
2005200620072008200920102011201220132014
20052,2323,03311,27514,38215,91815,89116,54816,06814,83016,29012,019
20062,421103,23313,33016,46417,48917,42217,16315,93917,29312,635
20072,566624,06617,63421,47622,02022,55519,96522,00315,839
20082,596914493,84017,56721,81321,36620,12921,74916,014
20092,645303104,81319,94223,82622,66724,64817,583
20102,787302705,20422,20625,78127,64520,553
20112,7820012045,44123,17230,28323,358
20122,89500000096,95132,52928,939
20132,665600022538,40826,067
201495710000005394,659

Table 2 displays the secondary citation of PDB accession identifiers, subject to above-mentioned caveats, published between 2005 and 2014. In theory, elements below the diagonal should all be zero as the non-zero numbers imply that the accession identifier has been cited before it has been made public. Some of these “below the diagonal” observations may be true false positives created by the text-mining process but also occur when the primary reference article in the database has been updated to a more recent publication. Non-zero “below diagonal” citations can also arise when authors embargo the publication of the data until after the publication of their own additional work citing the data set.

Table 3 shows the corresponding picture for the continuing citation of the PDB deposition articles. Some similar “below diagonal” patterns were found and attributed to use of updated primary reference articles or occasionally genuine misalignments of the underlying archives.

As can be seen from Table 2 and Table 3 the citation of PDB data accession identifiers and PDB deposition articles remain high as the annual cohorts age. The average annual citation-rate for each deposition article in PDB is 6.7 and the annual average number of citations per accession identifier is 0.2. For ENA the corresponding statistics were 2.1 and 0.1 (Supplementary material Table 1 and Supplementary material Table 2 show the corresponding two data sets for ENA). In all four cases these citation rates are stable over time. It is worth noting that most ENA data depositions are not accompanied by a deposition article: 32,188,662 ENA entries in 2014 were not associated with a deposition article as compared to the 26,384,613 entries that were associated with 9,375 source articles. It is also worth noting that the text-mining of accession numbers in EPMC only occurs in the subset of the scientific literature where full-text is available in open access resources, hence these numbers represent a lower bound on direct data citations.

Biological data resources are extensively used within patents

Patents are frequently used as an indicator of broader societal value of research3437. Importantly, it is estimated that only a small fraction of the scientific and technological innovation first reported in patents is subsequently disclosed in scientific literature sources38. During the creation of a patent it is essential to unambiguously identify the components of the invention and to provide extensive reviews of any prior art39. Thus we sought to address the question of how these requirements translate into data citation practices within patents.

Our SureCHEMBL corpus of 188,589 full-text patents contained 7,923 patents with data citations (4.2% of the corpus). Data citations were most common in the description section – which usually constitutes by far the largest section of the document text. The breakdown by patent office shows that the majority of patents with data citations were from the US (see Supplementary material Table 3). The proportion of accessions found for the different repositories (Table 4) differed considerably from that of EPMC articles (see Table 1 for comparison) with RefSeq, ENA, RefSNP and UniProt dominating. The average number of cited accession identifiers per repository and per document (13.9) and the variance of these figures across the resources was also much higher than found for the full-text scientific literature corpus. Since the international patent code (IPC, see Methods) is a hierarchical patent classification system we can use its additional levels to probe the subject matter of accession-positive patents further. Figure 1 shows that patents with references to ENA and UniProt were extensively used to define biological entities in the IPC subclasses A61 (“Preparations for medical, dental or toilet purposes”), C07 (“Organic chemistry”) and C12 (“Microorganisms or enzymes”). The content profiles and scientific topics covered in the two corpuses – open access scientific publications and patent documents - are different and further work is needed to understand how this influences data citation rates.

Table 4. Data citations mined from a 2014 SureCHEMBL patent cohort.

Compare the averages with those in Table 1. Acc/Pat is the average number of accessions per patent per repository.

RepositoryAccessions%TotalPatentsAcc/Pat
RefSeq34,63430%1,00234.6
ENA33,09728%4,0748.1
RefSNP26,20622%32281.4
UniProt14,12712%1,38710.2
PDB3,6123%1,0933.3
Ensembl1,8772%9719.4
OMIM1,7692%2547.0
Pfam1,1581%11510.1
Interpro6011%4613.1
ArrayExpress300%191.6
Total117,1118,40919
9162fb61-4fb4-4bac-9617-7698c6e612df_figure1.gif

Figure 1. The prevalence of the top 5 four character IPC categories for the data set as a whole, those patents containing a data citation, and those patents having a data citation to UniProt or ENA.

Note individual patents can have several IPC annotations – these percentages are based on summing all instances, i.e. “one code, one vote”. For example, 17% of the IPC codes annotating UniProt-positive patents were A61K. Key to coding: A61K preparations for medical, dental, or toilet purposes; C12N micro-organisms or enzymes; A61B diagnosis, surgery, intervention; C07K peptides; G01N investigating or analysing materials by determining their chemical or physical properties; C12P fermentation or enzyme-using processes to synthesise a desired chemical compound; C12Q measuring or testing processes involving enzymes or micro-organisms; C07D heterocyclic compounds. Note absence of A61B from the more biological data sets, compared to the presence of C07K.

Discussion

Citation analysis is a cornerstone of research impact and evaluation and while the use and value of citation of research papers in the scholarly literature as a metric for research is much debated, the citation practices underpinning such analysis are generally unambiguous and well established. With research funders increasingly establishing open data policies, there is a requirement and interest in performance metrics that assess the reuse of open research data - whether to recognise and reward scientists, support the long-term management and sustainability of data archives or to understand the broader societal value derived from these policies. Quantitative analysis of data reuse, let alone estimating the value arising from this reuse, is challenging due to the diversity of data citation practices but also due to the many ways open research data can be used in further studies. As bioinformatics databases increasingly take on the role of dictionaries or “scientific instruments”40 we would expect that most of the use of biomolecular data resources (and consequently data reuse from these resources) is never cited, just as most literature searches, views or downloads from PubMed do not lead to a citation of the PubMed infrastructure.

This study set out to analyse data citation practices with the aim of describing secondary citation of data entries – as one indicator of data reuse - in full-text content available from EPMC (scientific papers) and SureCHEMBL (patents). We focused our efforts on the major biomolecular databases where high-quality curation processes and well-established links between literature and data resources allows us to dissect citations arising from data deposition articles from the secondary citations arising from reuse of this dataset in the scientific literature. Our approach can in principle be applied to all repositories by systematically bringing together metadata from the repository and from EPMC and is in itself a good illustration of the value that open access data and literature resources brings to the scientific community.

The need to separate deposition from reuse in quantitative studies of data citation has been noted previously41 but the complexity and manual analysis required often leads investigators towards aggregate analysis of a total citation rate. For instance, in an analysis of data citation practices across fields using the commercially available Thomson-Reuters Data Citation Index42 the average citation rate for data sets in many of the studied data resources was found to be close to one, suggesting that much of the ‘data citation’ found in this analysis was driven by data deposition publications. Separating out secondary citations by tracking them over time (Table 2 and Table 3) provides one, albeit limited, indicator of the reuse of the data sets in the scientific community. In the case of the two repositories we have analysed in detail, PDB and ENA, it demonstrates long-term reuse of data sets by the community.

Comparing the citation patterns arising from the deposition and reuse from ENA and PDB is instructive, as the mode of usage is very different for the two databases. While ENA is accessed directly by users on a daily basis, the more significant use is as a large reference repository that serves as the archival backend for user-focussed resources such as the genome browsers Ensembl and Ensembl genomes29. Most of the users that access the Ensembl resource on a daily basis are likely to be unaware of the relationship between ENA and Ensembl and hence would not cite the corresponding ENA entry.

The results in this study, taken together with previous work15,19,40,43 guards against reliance on metrics based on familiar approaches developed for the analysis of scientific papers. Such simplified citation metrics do not capture the many different forms of data reuse and heterogeneous and non-standard data citation practice in the biomedical literature. Data citation indexes also need to be developed that acknowledge that different patterns of use give different citation patterns for archival resources (e.g. PDB, ENA, GEO), reference knowledge bases (e.g. UniProt, Reactome, Human Protein Atlas), and secondary value-added resources (e.g. Interpro). Uniform quantitative indicators of data citation are inappropriate as they do not capture the usage patterns of the different resources.

Biomolecular databases also exist within a network of mutual referencing and cross-mappings - just as literature articles build upon previous scholarly work and indicate this through citation there is a complex network of dependencies between bioinformatics databases - most of which is not visible in the primary literature44. Further work is needed to capture this usage pattern for assessments of the data journeys that occur through the extensive reuse and cross referencing of bioinformatics resources – and the corresponding return of investment from this scientific infrastructure.

To date investigations on data reuse have focused on the scientific literature. However, biological data resources are also extensively used by researchers in industry and in the second part of our study we started to address the use of bioinformatics databases in patents as a broader indicator of their industrial and societal value. Patent analyses have been extensively used to understand the industry and societal benefit from publicly funded research37,4547 and full-text patents are available from several patent offices. The practice of large-scale text-mining of molecular entities from patents is well-established in chemistry4851. However, to the best of our knowledge this is the first time that the usage and citation of bioinformatics data resources in the patent literature has been analysed; our beginning foray into this field demonstrates significant use of these resources to define biological concepts and subject matter in patent documents. Although the majority of data citation occurs in patent classes dealing with pharmaceutical and medical inventions (drugs, diagnostics and medical devices) the data also highlights a broad applicability of biomolecular resources in bio-based industries with usage in industrial biotechnology and consumer products, for example the definitions of enzymatic activity in washing powder.

Conclusion

The extensive and quantifiable reuse of data from biomolecular data resources demonstrates the critical role this infrastructure plays in life science research but also highlights the need for robust metrics of data use by the scientific community. Using the cross-referencing between literature repositories such as Europe PMC and the ENA and PDB archives we demonstrate how data citations arising from deposition of data in an archive can be distinguished from the subsequent reuse by the scientific community – an important distinction in research evaluation as the former provides an estimate of adoption of community best practice and/or compliance with open access guidelines, whereas the second is an indicator of the value created by these practices and guidelines. The study also demonstrates that measures based on literature citation may be more or less informative according to mode of use of a repository: large biological archives serve as foundations for other value-added resources. Individual data items from large repositories such as ENA may not be directly cited in the scientific literature but collectively forms important reference collections for e.g. pathogen detection or biodiversity research. Further work is needed to develop methods that classify and account for this mode of use, e.g. by quantifying database cross-linking via literature citation networks and identifier mapping.

By extending the analysis to patent documents we show that the biological data resources provide unambiguous definitions of biological entities for use in official documents such as patents. This shows that life science data resources transcend basic research and form a fundamental component of the digital knowledge management framework needed in a modern society. Hence, assessment of the use and value of scientific data repositories should include data from research articles, patents and perhaps other documents of record such as clinical guidelines, standards, and grant applications. Understanding how to establish robust indicators of data citation in these types of documents in addition to research articles remains an important challenge for further studies. The ecosystem of open literature and data resources can only be sustained if the creation of scientific and societal value can be properly assessed and the scientific and scholarly community needs to make a concerted effort to better cite data. Similar principles can be applied to other resources such as reagents and software. Finally we note that the insights from reviewing data citation patterns could be used to improve article level metrics, this is also an area of further investigation.

Data availability

F1000Research: Dataset 1. Raw data for ‘Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources’, Bousfield et al., 2016. 10.5256/f1000research.7911.d11328152

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 11 Feb 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Bousfield D, McEntyre J, Velankar S et al. Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources [version 1; peer review: 3 approved]. F1000Research 2016, 5(ELIXIR):160 (https://doi.org/10.12688/f1000research.7911.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 11 Feb 2016
Views
25
Cite
Reviewer Report 13 Apr 2016
Ben Johnson, Higher Education Funding Council for England (HEFCE), Avon, UK 
Approved
VIEWS 25
Those involved in thinking about the scientific and societal impact of research will know that the complexities in data sharing, citation and reuse practices often hinder us from developing a quantitative understanding of the value of data. This has implications ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Johnson B. Reviewer Report For: Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources [version 1; peer review: 3 approved]. F1000Research 2016, 5(ELIXIR):160 (https://doi.org/10.5256/f1000research.8516.r12982)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
31
Cite
Reviewer Report 31 Mar 2016
Timothy W. Clark, MassGeneral Institute for Neurodegenerative Disease (MIND), Massachusetts General Hospital, Boston, MA, USA 
Approved
VIEWS 31
This article is an important look at citation patterns and frequencies of some important and representative bioinformatics data resources (ENA, ePDB) in the professional and patent literature, based on textmining for accesssion numbers in the Euro PubMed Central Open Access ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Clark TW. Reviewer Report For: Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources [version 1; peer review: 3 approved]. F1000Research 2016, 5(ELIXIR):160 (https://doi.org/10.5256/f1000research.8516.r12390)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
52
Cite
Reviewer Report 22 Feb 2016
Mark Parsons, Research Data Alliance, Troy, NY, USA 
Approved
VIEWS 52
This paper advances our understanding of how data are used and referenced. It is well-written, well-referenced, and the methods are appropriate. The data are explained and seem to be available and usable. The conclusions are reasoned and sound. The paper ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Parsons M. Reviewer Report For: Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources [version 1; peer review: 3 approved]. F1000Research 2016, 5(ELIXIR):160 (https://doi.org/10.5256/f1000research.8516.r12391)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 11 Feb 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.