Keywords
outbreak, molecular surveillance, peer-to-peer, pathogen
outbreak, molecular surveillance, peer-to-peer, pathogen
Many bioinformatic tasks complement newly sequenced genomes with existing, publicly available ones. For example, when reconstructing a local pathogen outbreak, screening similar genomes can discover related ones from other sampling sites, such as hospitals nearby, and can significantly affect public health responses.1,2 However, no tool exists to monitor newly sequenced genomes and automatically identify those of interest to the user. A long delay until they are publicly available explains why many outbreak studies are retrospective and offer limited practical value to the associated outbreak response. Several components are needed: first, a “publisher” needs to be able to send genomic “messages” to a “consumer” using a simple and secure interface. Second, the genomic “message” may require limited space to avoid upload problems or extensive storage infrastructure. Third, a mechanism is needed to route messages only to interested parties, e.g., consumers that search for genomes of a particular species in a specific geography. Lastly, on receiving a relevant message, download of the associated genome should be possible. Several projects currently develop ways to share genomes effectively (wort, stark). However, to our knowledge, ours is the first end-to-end solution available to users.
DarkQ is implemented using the Nextflow workflow manager to ensure a robust, reproducible, and portable application.3 The user interface of DarkQ is similar to the popular file system service “Dropbox”; the content of a “send” directory is tracked. When a genome is added to it, it is first compressed (“sketched”) using the MinHash algorithm4,5 (sourmash, v3.5). The reduction in file size by orders of magnitude allows for efficient transmission. Together with metadata and inferred taxonomy (using sourmash), the genome sketch constitutes a “message” (Figure 1A). The receiving message queue then uses the Advanced Message Queuing Protocol (AMQP)6 to route messages (implementation: RabbitMQ, v3.8.9) onto queues, i.e. sequential groups of messages. The original genome is uploaded (“pinned”) to a decentralized, peer-to-peer network (IPFS, v0.7).7 Its content-based address is part of the genome message.
A router (circle) distributes the messages to queues via routing keys (annotated arrows). Consumers (C) can use these keys to receive only a subset of messages and then further filter them with target genomes using MinHash sketches. In parallel, the genomes from the publisher are uploaded on a decentralized peer-to-peer (P2P) network. Once messages pass through the consumer’s filters, they are automatically downloaded from the P2P network. This architecture allows the effective distribution of newly sequenced genomes and enables continuous monitoring, e.g, in outbreak scenarios. (B) Use case simulation: a hospital becomes aware of a local outbreak of an XDR Klebsiella pneumoniae (Kp) isolate of subtype (ST, right metadata column) 258 carrying a plasmid-encoded KPC-2 carbapenemase. Using DarkQ, we identified 431 genomes from several countries (leaf colors) from 26 studies (left metadata column) with an average nucleotide identity (ANI) > 99.98% and identical resistance and capsule patterns (not shown). A time-dated phylogeny revealed several non-local isolates, suggesting that the outbreak reached further than previously assumed. An interactive version of the data can be found at microreact.org/project/facEFbDrgwgp9aX97nvpHq. Scale in number of SNVs.
The consumer can subscribe to messages using an arbitrary number of filters, so-called “routing keys”. Each routing key is unique and has five properties: name of sender (e.g. “phiweger”), country code (e.g. “DE”), taxon status (“found” or “mystery”), taxon level (either one of superkingdom, phylum, class, order, family, genus, species, and strain) and taxon name at that level (e.g. “Klebsiella” for genus) – these are adapted from and must conform to the Genome Taxonomy Database (GTDB, release 89).8 For example, “phiweger.DE.*.genus.klebsiella” would select all isolate genomes of the stated genus from Germany sent by the author.
Because we can estimate genome similarity using MinHash sketches,4 the consumer can quickly filter the received genome messages using target genomes, for example those belonging to a local pathogen outbreak or current research project. If this filter is passed, then the genome is automatically downloaded from the peer-to-peer network using its content hash address, which at the same time locates and validates the downloaded file. If multiple users pin the genome, download speed can increase substantially. A downstream workflow can then be connected to refine these genomes’ analysis further, enabling a complete monitoring system.
The software can be run on any UNIX-based operating system. Operation of DarkQ requires less than one Gb RAM and a single core. Details of the workflow can be found in the README file.12
To test DarkQ in a monitoring system, we collected and sent onto DarkQ 9,415 genomes of Klebsiella pneumoniae, a pathogen considered an urgent global threat due to extensive antimicrobial drug resistance (CDC, AR threats report, 2019).9 We simulated a consumer subscribing to all messages from the Klebsiella genus and filtering the received messages using an isolate from a local outbreak at a large tertiary hospital in 2010.10 1,461 messages met both routing key and minimum genome similarity criteria of 0.97 at a k-mer size of 51, typically used to estimate the genomic distance at the strain level.11 After downloading the original genomes from the peer-to-peer network, they were further filtered and refined, resulting in a time-dated phylogeny (Figure 1B). The consumer thus received genomes from a total of 26 studies. Two of these studies contained genomes that belonged to the same outbreak clone the consumer used to filter the genomes. Further work is needed to investigate this relationship more thoroughly; however, an initial assessment was already possible by utilizing the mechanics implemented in DarkQ. All methods used in this use case are available elsewhere.12
DarkQ allows a user to monitor genomic data with a simple user interface, efficient genome compression, filter-based message routing, and fast download of corresponding genomes using a decentralized peer-to-peer network. The proof-of-concept outlined here scales to thousands of genomes and could be particularly valuable in the context of pathogen outbreaks. However, our approach can also be used to disseminate research more broadly.
NCBI BioProject: Context-aware genomic surveillance reveals hidden transmission of a carbapenemase-producing Klebsiella pneumoniae. Accession number PRJNA742413. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA742413.
Source code available from: github.com/phiweger/darkq.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.5503447.13
License: BSD-2 license.
We thank Luiz Irber and C. Titus Brown (University of California, Davis) for insightful discussions of the concepts discussed in this article. An earlier version of this article can be found on bioRxiv (https://doi.org/10.1101/2020.11.12.379560).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics, data structures
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: I am one of the developers of sourmash, which is used within the workflow presented here.
Reviewer Expertise: Bioinformatics, MinHash sketching.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 01 Oct 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)