Keywords
Metagenome, microbiome, antibiotic, resistance, bacterial, genomic signature, MDR, workflow
This article is included in the Pathogens gateway.
This article is included in the Antimicrobial Resistance collection.
This article is included in the Hackathons collection.
Metagenome, microbiome, antibiotic, resistance, bacterial, genomic signature, MDR, workflow
Antimicrobial resistance (AMR) of bacterial pathogens is a growing public health threat around the world. Most concerning are multidrug resistant (MDR) bacteria, which have become more prevalent in recent decades1. Well known examples of these pathogens include methicillin-resistant Staphylococcus aureus (MRSA), vancomycin-resistant S. aureus, extended spectrum beta-lactamase, and vancomycin-resistant Enterococcus. MRSA is prevalent in surgical and maternity hospitals and nursing homes, where it is often associated with hospital-acquired infection with high morbidity and mortality2. The current method of determining if a patient has a MDR infection is based on being able to culture a patient-derived sample in the presence of different antibiotic drugs3. This is a slow process that can take days to weeks, which can put the patient in danger of not receiving the correct antibiotic in time.
There are several known mechanisms by which bacteria acquire AMR. One common mechanism involves acquiring resistance through horizontal genetic transfer (HGT), which can include plasmid-, phage-, transposon-, and integron-mediated resistance4. The second major mechanism involves SNPs in chromosomal genes that can result in a change in antibiotic binding sites5.
High-throughput whole genome sequencing (WGS) of microbiomes is a state-of-the-art method for studying complex microbial communities, such as the human gut. WGS creates large raw data sets, which must be processed quickly and efficiently to guide clinicians for the best and most efficient treatment strategy for the given patient. However, simple, clinically applicable bioinformatics methods that can provide a fast, reusable, reproducible, scalable pipeline to locate AMR genomic signatures in large metagenomics datasets from the NCBI Sequence Read Archive (SRA) and other public datasets are still lacking. Such pipelines could also be used for crowdsourcing of this analysis, such as with undergraduate students. The problem of determining efficient strategies for antibiotic usage is the keystone of the modern antibacterial therapy and prevention6.
In the last few years, a variety of different papers and tools have been developed that exploit AMR detection for both complete genomes and metagenomes. Some of the existing detection methods for AMR genomic signatures include: ResFinder7, PointFinder8, SSTAR9, DeepARG10, ARIBA11 and ResCap12. Another approach is Galaxy-based pipeline Amr++. All detection methods depend on the availability of collections of known AMR genomic signatures. These signatures are then directly searched for, or models are generated for the detection of novel AMR genes/loci. One of the most updated manually curated databases is the Comprehensive Antibiotic Resistance Database (CARD)13; others include ResFinder7, ARG-ANNOT14, and MegaRES15. Although some of these tools provide user-friendly web interfaces and use both FASTA and FASTQ files as input, they do not use the power of command line. Moreover, these solutions are not universal, e.g. ResFinder searches for only HGT-mediated resistance, whereas its successor PointFinder only looks for AMR caused by chromosomal point mutations. Other disadvantages of the existing solutions include an inability to work with big datasets or multiple raw sequence files, slow speed, and poor handling of metagenomic data.
The primary objective of this project is designing a reliable system for rapid diagnostics and prompt treatment of patients with MDR infections. At the heart of the system is a reusable, reproducible, scalable, and interoperable workflow to locate AMR genomic signatures in SRA shotgun sequencing (including metagenomics) datasets. To ease this task we used only RefSeq reference genomes for the bacterial pathogens important for public health, but the pipeline can be scalable to include databases for other microbes, viruses, and fungi. The result, NastyBugs, is a new program that can identify what type of antimicrobial resistance is most likely present in a metagenomic sample, which will allow for both smarter drug selection by clinicians and faster research done in an academic environment. NastyBugs is a framework created during the National Center of Biotechnology Information Hackathon in August 2017.
The detailed workflow to extract antimicrobial resistance gene signatures is described in Figure 1.
Three BLAST databases for downstream analysis were created using the latest versions of: 1) RefSeq human genome assembly GRCh37/UCSC hg19 (RefSeq accession no. GCF_000001405.37); 2) RefSeq bacterial genomes; 3) CARD13. For comparison purposes, CARD was used as two databases: one consisted of genes and another of SNPs.
Further analysis consists of three steps: 1) Host (human) reads removal; 2) Antimicrobial resistance signature identification; 3) Bacterial identification and characterization. Steps 2 and 3 were performed in parallel. Input data is SRA accession numbers (ERR or SRR) of the metagenome of interest. Another option is using FASTQ files from local storage.
Using STAR16 or Magic-BLAST, all reads mapped to human genome GRCh37/hg19 were removed, and unmapped non-human reads (considered as bacterial) were collected using SAMtools for further analysis.
To remove adapters/linkers/barcodes we used FASTX Clipper and Trimmer. The non-human reads are mapped again using Magic-BLAST; however this time they are mapped to bacterial genes/SNPs from the CARD. This allows for the identification of genes and SNPs that can lead to antimicrobial resistance in the bacterial population. Obtained reads were sorted and the sum of read abundance was calculated.
The identification of bacterial species and abundance is carried out in parallel. For that we again used Magic-BLAST and the database of NCBI RefSeq reference bacterial genomes. The resulting list of species was visualized using Krona17.
The output TAB-delimited formatted file contains data in five columns: 1) RefSeq accession number; 2) genus; 3) resistance gene; 4) ARO (Antibiotic Resistance Ontology) accession number; 5) score (number of mapped reads per 1kb). The data can be used for constructing a scatter plot showing relative abundance of antimicrobial resistance and the corresponding bacterial species in metagenomic sample analysed.
The documented workflow contains the script with containerized tools in Docker.
We used the following dependencies: 1) Magic-BLAST v. 1.3, a novel tool allowing mapping of large next-generation RNA or DNA sequencing runs against a whole genome; 2) SAMtools v. 1.3.1, a popular suite of programs for interacting with HTS data; 3) FASTX-Toolkit v. 0.0.13, a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing; 4) STAR v. 2.5.3a, RNA-seq aligner; and 5) Krona v. 2.7, a tool for metagenomic pie chart visualization.
To validate our pipeline we used two human gut metagenomic datasets, SRA acc. no. ERR1600439 and SRR5239736. The SRR5239736 sample was used for comparison of our results with the results obtained by ResCap12. For metagenome sample SRR5239736, 24% of reads were mapped to the gene database of CARD and 1.6% reads were mapped to the SNP database of CARD.
Magic-BLAST, a novel program from the BLAST family, can provide a faster and more accurate way to align reads of interest with reference sequences. A quick comparison of STAR and Magic-BLAST showed at least a 10-fold difference in speed increase with Magic-BLAST for mapping of SRA reads to human genome compared to STAR. For that reason, we chose not to use STAR in the pipeline.
Obtained results showed high efficiency of identification of antibiotic signatures in the studied samples. However, the presented workflow may be improved. Planned improvements will include: 1) optimization of the pipeline; 2) additional large-scale validation for different metagenomic samples; 3) representation of results with more information; 4) adding information about proteins participating in AMR; and 5) prediction of novel resistance genes using Hidden Markov Model.
Moreover, implementation of machine learning analysis may provide additional capabilities.
The pipeline can be used to efficiently identify the presence of antimicrobial resistant genes, which in turn can be used as features for further downstream machine learning analysis. One useful application of machine learning in antimicrobial resistance is the prediction of the appropriate antimicrobial therapy to apply to a critically ill patient. For these patients, the time taken to administer an appropriate antibiotic agent inversely correlates with improved patient outcomes18. Whole genome sequencing of microbial isolates, followed by antimicrobial resistance genes identification and machine learning prediction provides an attractive solution to this problem. A previous application in this regard applied a simple rules-based approach and a logistic regression model19. More sophisticated, non-linear supervised machine learning methods, such as random forests, gradient boosting, and artificial neural networks may play a key role in producing accurate predictions for clinical use. Artificial neural networks, such as convolutional and recurrent neural networks, are particularly interesting as they may be able to extract novel features.
The code for the pipeline is publically available on GitHub: https://github.com/NCBI-Hackathons/MetagenomicAntibioticResistance.
Archived source code as at time of publication: http://doi.org/10.5281/zenodo.102026620
License: MIT
Ben Busby was funded by the Intramural Research Program of the NLM. Hsinyi Tsang was funded by NCI Contract #D14PD00826.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors would like to acknowledge NCBI for hosting the hackathon, Grzegorz Boratyn, Sean Davis and Lisa Federer for technical discussions.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
No
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
No
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 08 Nov 17 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)