ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

Organizing knowledge to enable faster data interpretation in COVID-19 research

[version 1; peer review: 2 approved with reservations]
PUBLISHED 30 Jul 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Emerging Diseases and Outbreaks gateway.

This article is included in the ELIXIR gateway.

This article is included in the Coronavirus (COVID-19) collection.

Abstract

Enormous volumes of COVID-19 research data have been published and this continues to increase daily. This creates challenges for researchers to interpret, prioritize and summarize their own findings in the context of published literature, clinical trials, and a multitude of databases. Overcoming the data interpretation bottleneck is vital to help researchers to be more efficient in their quest to identify COVID-19 risk factors, potential treatments, drug side-effects, and much more. As a proof of concept, we have organized and integrated a range of COVID-19 and human biomedical data and literature into a knowledge graph (KG). Here we present the datasets we have integrated so far and the content of the KG which consists of 674,969 biological concepts and over 1.6 million relationships between them. The COVID-19 KG is available via KnetMiner, an interactive online platform for gene discovery and knowledge mining, or via RDF and Neo4j graph formats which can be searched programmatically through SPARQL and Cypher endpoints. KnetMiner is a road mapped ELIXIR UK service. We hope this integrated resource will enable faster data interpretation and discovery of linkages between genes, drugs, diseases and many more types of information relating to COVID-19.

Keywords

Coronavirus, COVID-19, SARS-CoV-2, knowledge graph, knowledge base, target discovery, knowledge mining, bioinformatics

Introduction

The global COVID-19 pandemic, caused by the SARS-CoV-2 virus, has infected millions worldwide, causing many more infections and deaths than the previous severe acute respiratory syndrome (SARS) outbreak that occurred in 2002 and 2003 (WHO, 2020). The race to find plausible treatment strategies and management plans for the coronavirus pandemic, concurrent with investigating the biological underpinnings of how SARS-CoV-2 operates has driven a very rapid rise in the number of SARS-CoV-2-related publication, pre-prints and biological data. The White House, alongside other groups, made a global call to action to find a way to rapidly sift through this mountain of COVID-19 literature. They released the COVID-19 Open Research Dataset (CORD-19), which has over 150,000 COVID-19-related full text articles (Kaggle, 2020). Additionally, the COVID-19 data portal contains over 360,000 publications and many other datasets related to the disease. The rate of growth and sheer volume of data related to the pandemic and the virus has become so large that it has led to something of a new phenomenon for many researchers: ‘too much information’. As such, there lies a challenge to organize the growing mountain of biomedical knowledge in order to overcome the data interpretation bottleneck (Good et al., 2014).

Our goal was to repurpose the KnetMiner software, which we originally developed for crop scientists to help identify the most important genes involved in complex plant traits (Hassani‐Pak et al., 2021), to provide medical researchers with quick and intuitive access to all documented linkages between genes, potential therapeutic compounds, and the virus. KnetMiner contains fast algorithms for scanning millions of relationships from across a range of datasets and literature, scoring all evidence and displaying the links in easy to understand knowledge graphs. KnetMiner for COVID-19 would enable users to search for genes and keywords related to COVID-19 and explore the surrounding connected gene networks and pathways, for example, for negative downstream effects of drugs, or genetic associations with known diseases, or human-pathogen interactions.

KnetMiner requires as input a knowledge graph (KG) which is a flexible semantic data model to represent heterogeneous, interconnected data where graph nodes represent biological concepts and graph edges the relationships between them. Such biological concepts can include, but are not limited to, genes, drugs, diseases, and publications. There have been a few parallel efforts to construct KGs from COVID-19 data (Reese et al., 2020). These mostly differ in data coverage and modelling approaches and would require further work to be made usable by KnetMiner. Hence, we decided to use the toolkit provided by KnetMiner to build a unified knowledge graph from public datasets and the scientific literature.

Here, we describe how we organized the biomedical knowledge in a format that is compatible with KnetMiner and present a use case.

Methods

Data sources

Firstly, we identified the datasets that we considered as important for the first iteration of the COVID-19 KG, based on discussions held with participants of the COVID-19 Biohackathon (5–11 April 2020). We downloaded over 20 datasets including COVID-19 related publications and pre-prints, human genetics studies, virus-host protein interactions, drug-target interactions, pathway information, ontology annotations and mouse knock-out mutant data with links to diseases and phenotypes (see Table 1). The data download and processing steps were automated using custom Python scripts (see Extended data). Various data formats are supported by the down-stream integration pipeline, including PubMed XML, UniProt XML and OWL, and most importantly tabular format, which required the development of an XML-based configuration to map the tabular content to a labelled property graph representation.

Table 1. The key data sources used to construct the COVID-19 knowledge graph, their composite biological concept type, respective semantics, and URL links for the data.

DatasourceData TypesConcept ClassesURLs
Mouse Genome
Informatics data (MGI)
Gene
Disease (OMIM)
Gene → Disease http://www.informatics.jax.org/downloads/reports/MGI_DO.rpt
InterProProtein DomainProtein → Protein Domainhttps://ftp.ebi.ac.uk/pub/databases/interpro/entry.list
OMIMDiseaseDrug→ Diseasehttps://www.omim.org/
NHGRI-EBI GWAS CatalogSNP (GWAS)
Trait
Gene → SNP → Traithttps://www.ebi.ac.uk/gwas/api/search/downloads/studies_alternative
Gene Ontology (GO)Biological ProcessesProtein → BioProchttp://purl.obolibrary.org/obo/go.owl
IntActProteinProtein → Proteinwww.ebi.ac.uk/intact/export?format=mitab_27&query=annot%3A%22dataset%3Acorona
virus%22&negative=false&spoke=false&ontology=false&sort=intact-miscore&asc=false
Ensembl (GRCh38.99)Gene
Protein
Gene → Protein
Protein → Protein
(homology)
https://www.ensembl.org/info/data/ftp/index.html
UniProtProtein
Publication
Gene Ontologies
Gene → Protein →
Publication
Gene → Protein → GO
https://www.uniprot.org/uniprot?include=false&format=xml&compress=yes&force=true
&query=proteome:UP000005640
https://www.uniprot.org/uniprot/
?query=mouse&format=xml&compress=yes&force=true&sort=score&fil=reviewed:yes
DrugBankDrugProtein → Drughttps://www.drugbank.ca/releases/latest
HumanCycPathway
Enzyme Complex
Molecules
Compound
Protein → Enzyme Complex
→ Reaction → Pathway
Pathway → Compound
Enzyme → Compound
http://brg-files.ai.sri.com/public/dist/humancyc.tar.gz
SciBiteGene
Protein (SARS-CoV-2)
Phenotype
Gene → Publication
Protein → Publication
Phenotype → Publication
Drug → Publication
https://github.com/SciBiteLabs/CORD19

KnetMiner

The KnetBuilder (https://github.com/Rothamsted/knetbuilder) software was used to integrate the various data sources into a KG with unified semantics and augment it with additional text mining-based relationships (Hassani-Pak et al., 2010). The data loading and integration steps are specified using an XML-based configuration. KnetBuilder uses an in-memory graph implementation (Köhler et al., 2006) and provides exports to RDF and tooling to convert RDF to Neo4j (Brandizi et al., 2018b).

KnetMiner is an open-source software allowing to index and serve the contents of a knowledge graph in order to accelerate gene discovery. It requires as input a dataset folder with configuration files and a compatible knowledge graph in OXL format. We used the version 4.0 of KnetMiner and deployed it with our COVID-19 KG (see Underlying data). The COVID-19 KG contains all human genes; however, KnetMiner was configured to index a subset of 9,141 COVID-19-related genes, derived from SciBite gene annotations of CORD-19 version 54 (Giles et al., 2020). In the presented use case, KnetMiner was searched with a list of differentially expressed genes (Table 3) and the following keywords:

"Defense Response To Virus" Remdesivir "Killing Of Cells Of Other Organism" "Antimicrobial Humoral Immune Response Mediated By Antimicrobial Peptide" "Response To Virus" "Cellular Response To Lipopolysaccharide" "Positive Regulation Of Transcription By RNA Polymerase II" "White blood cell count" "Blood protein levels"

ConceptID:500458 ConceptID:518058 ConceptID:511770 ConceptID:513484 ConceptID:503458 ConceptID:512650 ConceptID:471625 ConceptID:379514 ConceptID:370288

COVID-19 Knowledge Graph

The combined COVID-19 KG has 674,969 biological concepts and over 1.6 million relationships between them, and is available in OXL, RDF and Neo4j formats. It contains data on 27,599 human genes that are present in the KG as “type gene” concepts that are linked to a range of other concepts through different relation types and evidences. The content of the KG with a breakdown by concept type is shown in Table 2.

Table 2. Content of the COVID-19 KG with a breakdown by concept type.

The most up-to-date statistics can be accessed at: https://knetminer.com/COVID-19/html/release.html

Concept typeNumber of concepts
Biological Process 29,705
Cell Component4,203
Compound1,975
Disease5,535
Drug22,412
EC2,271
Enzyme4,731
Gene47,494
Molecular Function10,278
Pathway296
Phenotype13,037
Protein Domain15,891
Protein Complex220
Protein250,893
Publication174,313
Reaction1,945
SNP87,683
Trait1,953
Transport126

Table 3. Differentially expressed genes from EBI Gene Expression Atlas (https://www.ebi.ac.uk/gxa/experiments/E-GEOD-147507).

# Expression AtlasTranscriptional response of human lung epithelial cells to SARS-CoV-2 infection
# Query: Genes matching: 'default query', specifically up/down differentially expressed in 1/2 selected
comparisons given the adjusted p-value cutoff 0.05 and log2-fold change cutoff 1 in experiment E-GEOD-147507
# Timestamp: Fri, 30-Apr-2021 12:55:29
Gene IDGene Name'Severe acute respiratory syndrome
coronavirus 2 (USA-WA1/2020)' vs 'mock'
in 'NHBE' at '24 hour' .foldChange
'Severe acute respiratory syndrome
coronavirus 2 (USA-WA1/2020)' vs
'mock' in 'NHBE' at '24 hour'.pValue
ENSG00000008517IL321.14.58E-16
ENSG00000023445BIRC31.39.71E-17
ENSG00000089127OAS11.32.97E-14
ENSG00000090339ICAM11.51.75E-27
ENSG00000100985MMP91.67.97E-17
ENSG00000104368PLAT1.23.66E-14
ENSG00000108342CSF31.44.23E-13
ENSG00000111331OAS31.11.32E-15
ENSG00000111335OAS21.13.70E-17
ENSG00000112096SOD21.32.43E-25
ENSG00000113070HBEGF1.26.30E-21
ENSG00000115009CCL202.49.70E-63
ENSG00000118503TNFAIP31.59.51E-45
ENSG00000122641INHBA1.58.55E-30
ENSG00000125730C31.47.66E-33
ENSG00000126709IFI61.31.80E-11
ENSG00000128342LIF1.23.26E-21
ENSG00000134070IRAK21.29.16E-13
ENSG00000134339SAA22.28.78E-63
ENSG00000136244IL61.52.52E-16
ENSG00000136688IL36G2.18.43E-40
ENSG00000140379BCL2A11.33.70E-10
ENSG00000140519RHCG1.21.92E-24
ENSG00000143546S100A81.74.20E-33
ENSG00000145901TNIP11.23.36E-33
ENSG00000157601MX11.75.44E-22
ENSG00000162366PDZK1IP11.41.17E-13
ENSG00000163216SPRR2D27.40E-29
ENSG00000163218PGLYRP41.25.73E-10
ENSG00000163734CXCL31.14.65E-08
ENSG00000163735CXCL51.91.53E-23
ENSG00000163739CXCL11.31.70E-30
ENSG00000163874ZC3H12A1.37.73E-17
ENSG00000165949IFI271.91.73E-30
ENSG00000166920C15orf481.11.09E-19
ENSG00000169429CXCL82.13.90E-89
ENSG00000173432SAA11.91.72E-41
ENSG00000173918C1QTNF11.11.31E-07
ENSG00000185215TNFAIP21.21.64E-17
ENSG00000185479KRT6B1.53.38E-40
ENSG00000188375H3-52.36.63E-35
ENSG00000241794SPRR2A1.35.16E-14
ENSG00000243649CFB1.13.99E-08
ENSG00000268104SLC6A141.11.52E-20
ENSG00000276980AC008760.21.16.67E-09

Programmatic access is available via public SPARQL and Cypher endpoints (Brandizi et al., 2018a) (see Underlying data). We have developed sample Jupyter notebooks to illustrate how the KG can be queried (Brandizi, 2020). This maximizes the adhesion to FAIR data principles, ensuring the data is reusable for a variety of applications. Availability, including pointers to data repositories on the Open Science Framework (OSF), is listed in the README file (see data availability section). Using Neo4j and Cypher, we can query the data and walk the knowledge graph in multiple directions. For example, we can look at the most common pathways related to proteins which are targeted by literature cited drugs using the following Cypher query:

MATCH (bp:BioProc) <- [:participates_in] - (prot:Protein) <- [:has_target] - (drug:Drug) - [:occ_in] -> (pub:Publication)
RETURN bp.prefName, COUNT ( DISTINCT pub ) AS pubNo
ORDER BY pubNo DESC
LIMIT 10

COVID-19 KnetMiner

COVID-19 KnetMiner (https://knetminer.com/COVID-19/) provides gene-centric and interactive access to the integrated knowledge base created by KnetBuilder. KnetMiner can be used to search the KG with keywords, gene lists, genomic regions of interest, and/or any combinations of these user inputs. We have created several example queries for different use cases and search strategies. These are available on the COVID-19 KnetMiner website for easy use.

In June 2020, a gene expression study was published which investigated the transcriptional response of human lung epithelial cells to SARS-CoV-2 infection (Enes & Pir, 2020). The data and the analysis between infected vs mock cells are available at the EBI Gene Expression Atlas (Table 3; https://www.ebi.ac.uk/gxa/experiments/E-GEOD-147507).

We chose this dataset because of its relevance and tractable size to showcase the use of KnetMiner for the interpretation of differentially expressed genes (DEGs). The approach can be equally applied to any data driven analysis and derived gene lists. All DEGs from the above study were used in KnetMiner to investigate linkages to drugs, diseases, biological processes, and the CORD-19 literature. As such the DEG IDs were entered into the Gene List search interface of KnetMiner and analyzed in several iterations (Figure 1): (i) without any keywords to get an unbiased view of the linked and enriched knowledge, (ii) with the keyword SARS-CoV-2 to review recent CORD-19 articles in relation to these genes, and (iii) with specific drugs, diseases and GO terms to visualize and share the organized knowledge with other scientists (Figure 2).

36eaa5ee-ad48-42cd-98ef-8dde76d50409_figure1.gif

Figure 1. COVID-19 KnetMiner web application.

A) Search result for gene list and search terms. B) Gene View: Ranked list of genes and evidence summaries. C) Map View: Interactive genomic map with related GWAS studies and top scoring genes. D) Evidence View: Enriched linked knowledge.

36eaa5ee-ad48-42cd-98ef-8dde76d50409_figure2.gif

Figure 2. Knowledge graph of 20 DEGs.

Information from various sources are visualized in a single view. Users can add or hide information using the legend. In total, the network contains 1216 nodes and 1808 edges, of which 101 and 143 are shown respectively. For example, 2 out of 72 Trait concepts are visible and other linked Traits (GWAS studies) can simply be added by double-clicking the Trait symbol in the legend.

KnetMiner user stories usually follow a common pattern: search, prioritize, explore, and share knowledge. All views shown in Figure 1 can generate knowledge networks for selected genes and/or evidence terms. As an example, an interactive network was generated for the top 20 scoring DEGs and keywords related to white blood cell count (GWAS study), Remdesivir (Drug), Response to Virus (GO).

Summary

KnetMiner provides open-source tools to build, search, visualize and share knowledge graphs, and its species agnostic architecture has provided a cost-efficient toolkit for building the first COVID-19 KnetMiner resource. By organizing COVID-19 related data in one place, integrated through a clear semantic data model, the hope is that this will enable data interpretation using more reproducible and objective approaches and support the international search for useful drugs, stop researchers repeating work done elsewhere, avoid harmful interventions, and ultimately, help pave the way to effective treatments. In an effort to maximize our data usefulness and grow the knowledge graph faster, we plan to offer mappings and conversions between our data and the KGX/Biolink ecosystem (Biolink, 2020). For instance, we are investigating methods to ingest the covid-19-kg produced by the Berkely Lab (Reese et al., 2020) and integrate it into the KnetMiner platform. This is work that will be conducted within the COVID-19 international research team (COV-IRT).

Data availability

Extended data

Documentation and custom scripts are available from https://github.com/Rothamsted/covid19-kg/ (see README)

Archived scripts as at time of publication: https://doi.org/10.5281/zenodo.5094695

License: AGPL-3.0

Software availability

KnetMiner web app: https://knetminer.com/COVID-19

ELIXIR bio.tools: https://bio.tools/covid-19_knetminer

Software available from: https://hub.docker.com/u/knetminer

Source code available from: https://github.com/Rothamsted/knetminer

Archived source code at time of publication (usually Zenodo): https://doi.org/10.5281/zenodo.3891097

License: AGPL-3.0

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 30 Jul 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Hearnshaw J, Brandizi M, Singh A et al. Organizing knowledge to enable faster data interpretation in COVID-19 research [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10(ELIXIR):703 (https://doi.org/10.12688/f1000research.54071.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 30 Jul 2021
Views
13
Cite
Reviewer Report 01 Sep 2021
Keith Hall, Google Research, New York, NY, USA 
Approved with Reservations
VIEWS 13
  • This article presents an automatically constructed and normalized Knowledge Graph of Covid-19 related literature and structured data sources. An existing tool KnetMiner is used to perform the ingestion and normalization of the data.
     
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hall K. Reviewer Report For: Organizing knowledge to enable faster data interpretation in COVID-19 research [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10(ELIXIR):703 (https://doi.org/10.5256/f1000research.57521.r90852)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
10
Cite
Reviewer Report 27 Aug 2021
Justin Reese, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA 
Approved with Reservations
VIEWS 10
The author describe a knowledge graph that integrates public datasets about COVID-19, as well as existing tooling to query and extract knowledge from the KG, i.e. Knetminer.

Transform code is publicly available on GitHub, and source data ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Reese J. Reviewer Report For: Organizing knowledge to enable faster data interpretation in COVID-19 research [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10(ELIXIR):703 (https://doi.org/10.5256/f1000research.57521.r91031)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 30 Jul 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.