Introduction

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.163833.1

Software Tool Article

Articles

LitSieve: An integrated literature search and triage tool for biocuration

[version 1; peer review: 1 approved with reservations]

Jeffryes

Matt

Conceptualization Investigation Methodology Software Visualization Writing – Original Draft Preparation 1 Harrison

Melissa

Conceptualization Supervision Writing – Review & Editing https://orcid.org/0000-0003-3523-4408 a 1 Hermjakob

Henning

Conceptualization Funding Acquisition Supervision Writing – Review & Editing 1 McEntyre

Johanna

Conceptualization Funding Acquisition Supervision Writing – Review & Editing 1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK

a mharrison@ebi.ac.uk

No competing interests were disclosed.

11 7 2025

2025

685

25 6 2025

2025

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Biomedical databases are an important part of the scientific infrastructure for organising and synergising research outputs. Many of these databases abstract content from the rapidly expanding scientific literature. Therefore, database curators require effective literature search methods in order to capture research relevant to their domain.

This article describes LitSieve, a literature search tool with filtering based on text mined annotations, and flexible article organisation features. It allows users to define filters based on biomedical entities like genes, diseases and species to include or exclude particular articles within their results. By combining a search query with a filter, curators are able to identify articles relevant to the database which they are curating. LitSieve uses APIs provided by Europe PMC, from which abstracts, article full text and text mined annotations are drawn.

LitSieve is available at https://www.ebi.ac.uk/europepmc/litsieve/

biocuration information retreieval text mining

European Molecular Biology Laboratory

HORIZON EUROPE Marie Sklodowska-Curie Actions

945405

MJ has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 945405.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction

Biomedical databases have become a critical infrastructure supporting life science research. Biologists and bioinformaticians depend on databases to interpret their results. ^{1,
2} Many important databases depend upon curation of the scientific literature in order to identify and extract relevant information into a structured format. When curating the literature, domain expert biocurators search and sort through scientific articles, and read those that appear relevant to their databases, focussing on the specific facts that they wish to capture. ¹ For example, a reference to a particular pair of proteins interacting, or an association between a gene and a disease.

Biocurators may use biomedical literature databases to identify ‘curatable’ literature. In this work we describe LitSieve, a system building on the Europe PMC database to provide literature filtering and organisational functions designed to assist with biocuration workflows. ³ While there is a variety of literature search and organisation software already available, LitSieve provides a unique ability to filter based on a wide range of text-mined annotations.

Europe PMC

Europe PMC is a comprehensive database of life science literature. It contains abstracts from PubMed and Agricola, full text articles from PubMed Central and content from 35 life science relevant preprint servers including bioRxiv and medRxiv. The database contains a total of over 45 million articles, and the full text of the article is searchable for 10 million of those. Literature in Europe PMC is enriched by over 2 billion text-mined annotations. Annotations are references to biomedical entities or concepts such as gene/protein names and diseases, extracted from the literature using a variety of methods. In total there are 43 different categories of annotation. These entities are normalised to an entry in a database. For example, species names are normalised to the NCBI taxonomy. These annotations are made available via a public REST API. ³

Biocuration tools

A number of tools to assist biocurators have been developed. PubTator permits users to search based on six types of text mined ‘bioentities’–genes, diseases, chemicals, single nucleotide polymorphisms (SNPs), species and cell lines–against the PubMed and PubMed Central databases. ⁴ Users are further able to search based on 12 types of text mined interactions between bio entities; for example drug interactions between two chemicals or causation between a SNP and a disease. It also allows users to gather articles into user defined ‘collections’.

LitSuggest uses machine learning (ML) to suggest similar articles to those selected by the user. ⁵ Articles identified by the trained model can then be marked as relevant or irrelevant to further refine the model. In the context of curation, this permits biocurators to submit a list of articles they have already curated and ideally find further ‘curatable’ articles.

Tools using large language models (LLMs) to search and summarise literature have also emerged, ⁶ however, these tools have yet to be comprehensively assessed in the context of biocuration. Given the impact of LLMs like ChatGPT on the wider technology landscape it seems inevitable that biocurators will use LLM based tools. However, their propensity for factual errors remains an open problem, and presents a challenge to deploying them on biocuration tasks, where statements must be reliably attributed. ⁷

LitSieve development process

Development of LitSieve began with the goal of providing an interface to Europe PMC with improved utility for biocurators. The initial concept being that curators may prefer not to use certain ML-based suggestion or recommendation systems, due to their ‘black box’ nature. ⁸ An internal survey of biocurators was conducted to understand their usage of literature search tools and the types of literature they were interested in. Possible features were discussed, and curators were observed while completing tasks. As development progressed, feedback from biocurators was incorporated into the prerelease versions at each stage.

The LitSieve system is based upon retrieval using a user-specified search query, the results of which are filtered as chosen by the user. This concept prioritises the explainability of the results, since it is clear to users exactly why a particular article has been included or excluded from their search results: Only search results which are retrieved by their boolean search term are included and, of those, only those that match all the filters are displayed in the search results. Therefore, the reason for the inclusion or exclusion of a particular article is always transparent (see Figure 1c for an illustration of a matching search result).

Figure 1. The LitSieve workflow.

A literature search is performed (a), the results optionally filtered (b), and then the literature retrieved (c) which can be read and annotated (d) according to the requirements of the user.

Although ML is used to identify many of the text mined annotations used to filter, this approach reduces the scope of the ‘black box’ area of the retrieval system, it is smaller and more comprehensible. This modular, filter-based concept also enables additional filters to be developed and added, fitting within the same architecture.

System overview

LitSieve is a literature search and organisation tool designed for biocuration. It permits users to perform a standard literature search and then filter it based upon text-mined annotations. The filtering system is very flexible and accommodates a wide variety of use cases. An overview of the process of using LitSieve is shown in Figure 1.

LitSieve builds upon Europe PMC’s public articles and annotations APIs and is implemented using the Vue JavaScript framework. Searches are configured using a form and user selected parameters define which relevant articles are fetched from Europe PMC. These results may then be filtered according to the text-mined annotations found in the articles. Any of the 43 categories of annotation in Europe PMC can be used for filtering. Users can filter to include articles according to the presence or absence of a specific annotation. Three types of filter are available (include, exclude, ignore), listed in Table 1 and illustrated in Figure 2. Annotations are fetched from Europe PMC, and then used to filter the articles client-side. The basic search, filtering and reading functions can be used by anonymous users. Saving lists, highlights and notes requires users to register with either an email address or by using ORCID login.

Table 1. The 3 filter types.

Filter type	Action
include	Only show search results that have an annotation mapped to a specified identifier
exclude	Only show search results that do not have an annotation mapped to a specified identifier
ignore	Only show search results that have an annotation of a specified type, but is not among the specified identifiers

Figure 2. An illustration of the filter types.

Three taxa are specified for filtering at the top. In the left column, 4 documents are shown. In reality, the filter would be applied using the entire document, or a specified section, but in this case a short fragment is used for illustrative purposes. In the fragments, all mentions of a species are underlined, and the specified species are highlighted. In the top row, each filter type is listed. Below the filter types it is indicated whether a filter of the corresponding type, with the three specified taxa, would result in the document being included in the search results or not.

The filter may be restricted to a specific section of the article (for example, finding only articles that have a ‘mouse’ annotation within their Methods section). Lists of identifiers may be saved for convenience, for example, if a curator has a list of diseases of interest that they wish to use as a filter on many searches.

For convenience, several types of annotation can be filtered using an integrated auto-complete interface. Species and other taxonomic ranks can be retrieved from the NCBI taxonomy [1], gene and protein names can be retrieved from UniProt, and terms from the Gene Ontology, Uberon, Experimental Factors Ontology, and Chebi can be retrieved from the Ontology Lookup Service. ^{9,
10}

Articles found using LitSieve can be saved to lists. This accommodates a triage workflow where users can flag literature as either curatable or non-curatable, but users may organise lists as they wish, and no specific workflow is imposed. Articles may be added or removed from lists directly from the search result page, or from the reading view. This permits, for example, a curator to remove an article from their ‘triage’ list after having read it and found it to be non-curateable. The “quick lists” feature allows users to assign an icon and colour to a particular list, which permits easy visual identification of list membership in the search results page. This allows a curator to identify, for example, articles they have already triaged.

In the reader view, users may highlight and add private notes to articles (see Figure 1d). Biocurators may use this to highlight curatable passages from the article or other pertinent details such as cell lines used in experiments.

Users may recall and reorder saved articles from a list management view. A list of all articles to which notes or highlights have been added is also available. Lists may be used to organise or prioritise articles for curation, or to save a group of related articles.

Use cases IntAct

IntAct is a molecular interaction database. ¹¹ It is essentially a graph of interacting molecules, with the vertices being biologically active molecules like proteins, and the edges denoting some kind of interaction between a pair of them. IntAct is manually curated; every interaction has been captured by a biocurator. This is a time intensive process, and given the available resources, prioritisation is necessary because not every possible interaction published can be incorporated into the database. As a strategic goal, IntAct has prioritised adding new molecules to the database (increasing the number of vertices) over adding edges between molecules already in the database, prioritising coverage over increasing the number of evidences for known relationships. Therefore, it is desirable to find literature that discusses protein–protein interactions where at least one of the proteins is not yet listed in IntAct.

LitSieve enables the IntAct biocurators to filter out articles that will not add new molecules to the database. After performing a literature search, an ‘ignore’ filter that lists UniProt identifiers for all proteins already present in IntAct can be applied. This will filter out any article that does NOT mention at least one protein not in the list specified by the user. That is, only articles mentioning proteins new to IntAct will be shown in the result list. While this does not guarantee that the article will discuss a curatable protein interaction, it will filter out articles which certainly do not increase the number of proteins covered by IntAct. In this way, LitSieve enables IntAct curators to perform literature searches constructed using their experience while benefiting from the text-mined annotations in Europe PMC to speed up their triage of the results. A step by step illustration of this workflow is available at 10.5281/zenodo.15682791.

UniProt

UniProt is a data resource for protein sequence and functional information. One component of UniProt is the SwissProt subset of the UniProt knowledgebase (UniProtKB/SwissProt). This is a curated resource summarising experimental and computationally predicted functional information selected and reviewed by an expert biocurator. In order to carry out this work, UniProt biocurators search for, and read, literature related to the proteins which they are tasked with creating and updating records for.

LitSieve has been used to curate proteins related to antimicrobial resistance into UniProt. The ability to filter search results based on species is beneficial during triage to sift out articles not related to the entry being curated. Since a single species may be referred to by multiple different names (for example, mouse, mice, M. musculus, Mus musculus), filtering based on concept rather than exact text matches can save time and effort during the triage process.

Conclusion

LitSieve allows biocurators to combine their literature search expertise with filters based on text-mined annotations. This transparent and reproducible approach to literature discovery allows biocurators and other users to understand why a particular article has or has not been captured by their query. The flexible filter architecture permits use cases that we have not yet anticipated. Based on Europe PMC, LitSieve benefits from daily literature updates and can search across over 31 million abstracts and over 10 million full text articles. Filtering can be performed using 2 billion text-mined annotations in 43 categories. There are a variety of other tools available for biocuration literature search, however, to our knowledge, no others are able to search based on this number of types of annotation.

LitSieve provides an integrated interface for organising and prioritising literature. We anticipate that by integrating biocuration related features into a single application, biocuration workflows can be made more efficient.

Software availability

LitSieve is available publicly at https://www.ebi.ac.uk/europepmc/litsieve/.

Source code is available in two repositories under an MIT licence. The front-end is available at https://gitlab.ebi.ac.uk/mjj/biocuration-toolbox , and the back-end is available at https://gitlab.ebi.ac.uk/mjj/litsieve-backend . An archived copy of these repositories at time of submission has been deposited in Zenodo: https://dx.doi.org/10.5281/zenodo.15480211.

Author contributions

All the authors contributed to conceptualisation and determining the methodology. HH, MH and JM provided supervision. MJ was responsible for software development, and for drafting the original manuscript. All authors contributed to review and editing.

Acknowledgements

We thank Islam Hassan, Mohamed Selim, and Jagadeeswararao Poluru for software engineering and Kalpana Panneerselvam, Paul Denny and other users for testing, and feedback. This work was supported by the European Molecular Biology Laboratory (EMBL).

References 1

International Society for Biocuration: Biocuration: Distilling data into knowledge. PLoS Biol. 2018 Apr 16;16(4):e2002846. 29659566

10.1371/journal.pbio.2002846

PMC5919672

Hirschman

Berardini

Drabkin

: A MOD (ern) perspective on literature curation. Mol. Gen. Genomics. 2010 May;283(5):415–425. 20221640

10.1007/s00438-010-0525-8

PMC2854346

Rosonovski

Levchenko

Bhatnagar

: Europe PMC in 2023. Nucleic Acids Res. 2024 Jan 5;52(D1):D1668–D1676. 37994696

10.1093/nar/gkad1085

PMC10767826

Wei

C-H

Allot

Lai

P-T

: PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical Knowledge. arXiv. 2024 Jan 19;52:W540–W546. 39314498

10.1093/nar/gkae235

Allot

Lee

Chen

: LitSuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res. 2021 Jul 2;49(W1):W352–W358. 33950204

10.1093/nar/gkab326

PMC8262723

Jin

Leaman

: PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine. 2024 Feb 1;100:104988. 38306900

10.1016/j.ebiom.2024.104988

PMC10850402

Wynter

de Wang

Sokolov

: An evaluation on large language model outputs: Discourse and memorization. Nat. Lang. Proc. J. 2023 Sep;4:100024. 10.1016/j.nlp.2023.100024

Holzinger

Langs

Denk

: Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019 Apr 2;9(4):e1312. 32089788

10.1002/widm.1312

PMC7017860

UniProt Consortium: Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D523–D531. 36408920

10.1093/nar/gkac1052

PMC9825514

Côté

Reisinger

Martens

: The Ontology Lookup Service: bigger and better. Nucleic Acids Res. 2010 Jul;38(Web Server issue):W155–W160. 20460452

10.1093/nar/gkq331

PMC2896109

Del Toro

Shrivastava

Ragueneau

: The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res. 2022 Jan 7;50(D1):D648–D653. 34761267

10.1093/nar/gkab1006

https://www.ncbi.nlm.nih.gov/taxonomy

10.5256/f1000research.180243.r410949

Reviewer response for version 1

Rutherford

Kim

1 Referee https://orcid.org/0000-0001-6277-726X 1University of Cambridge, Cambridge, UK

Competing interests: No competing interests were disclosed.

19 9 2025

2025

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

This paper describes "LitSieve", a literature search tool that improves on previous systems by providing integrated access to publication details and text-mined annotations, along with a filtering system allows users to narrow their search to relevant articles.

--------------------------

I appreciate the summary of the filters in Figure 2.

The "exclude" and "include" filter types seem straightforward but I struggle to understand the "ignore" filter type. An example of "ignore" is given in the "Use cases" section but could the function of "ignore" be more precisely discribed earlier? Perhaps in the section that introduces the filters?

--------------------------

"Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?"

It's good to see that the source code is available and has been deposited in Zenodo.

The frontend repository README says "It may be run with or without the backend" but doesn't specify how. I can't see documentation for how to configure the frontend, either in the manuscript or in the repository. Please add this documentation to the repository or manuscript.

--------------------------

Please add a statement about future support, maintenance and software availability. I notice that the repositories linked to from the manuscript have had no code changes for 4 months? Has development and bug fixing stopped? Will there be future support?

--------------------------

"We thank Islam Hassan, Mohamed Selim, and Jagadeeswararao Poluru for software engineering"

If these software engineers made substantial contributions to the software, they should be co-authors. If not, consider a separate explanation for the contribution of each engineer, if there are differences. Thanking someone for "software engineering" in a software publication would be like thanking someone for "lab work" in a experimental publication.

--------------------------

The user-driven approach described here is encouraging:

"An internal survey of biocurators was conducted to understand their usage of literature search tools and the types of literature they were interested in. Possible features were discussed, and curators were observed while completing tasks."

"We thank ... Kalpana Panneerselvam, Paul Denny and other users for testing, and feedback"

"As development progressed, feedback from biocurators was incorporated into the prerelease versions at each stage."

Any users who have made substantial contributions in the form of feedback or ideas should be considered for co-authorship. Especially consider any biocurators who contributed multiple major suggestions that have been incorporated into the system. Are there users who have contributed more ideas or feedback than any of the current co-authors? If there are, they should be on the author list.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Software engineering. Bioinformatics.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.