NastyBugs: A simple method for extracting antimicrobial resistance information from metagenomes [version 1; peer review: 2 approved with reservations]

Multidrug resistant bacteria are becoming a major threat to global public health. While there are many possible causes for this, there have so far been few adequate solutions to this problem. One of the major causes is a lack of clinical tools for efficient selection of an antibiotic in a reliable way. NastyBugs is a new program that can identify what type of antimicrobial resistance is most likely present in a metagenomic sample, which will allow for both smarter drug selection by clinicians and faster research in an academic environment.


Introduction
Antimicrobial resistance (AMR) of bacterial pathogens is a growing public health threat around the world. Most concerning are multidrug resistant (MDR) bacteria, which have become more prevalent in recent decades 1 . Well known examples of these pathogens include methicillin-resistant Staphylococcus aureus (MRSA), vancomycin-resistant S. aureus, extended spectrum beta-lactamase, and vancomycin-resistant Enterococcus. MRSA is prevalent in surgical and maternity hospitals and nursing homes, where it is often associated with hospital-acquired infection with high morbidity and mortality 2 . The current method of determining if a patient has a MDR infection is based on being able to culture a patient-derived sample in the presence of different antibiotic drugs 3 . This is a slow process that can take days to weeks, which can put the patient in danger of not receiving the correct antibiotic in time.
There are several known mechanisms by which bacteria acquire AMR. One common mechanism involves acquiring resistance through horizontal genetic transfer (HGT), which can include plasmid-, phage-, transposon-, and integron-mediated resistance 4 . The second major mechanism involves SNPs in chromosomal genes that can result in a change in antibiotic binding sites 5 .
High-throughput whole genome sequencing (WGS) of microbiomes is a state-of-the-art method for studying complex microbial communities, such as the human gut. WGS creates large raw data sets, which must be processed quickly and efficiently to guide clinicians for the best and most efficient treatment strategy for the given patient. However, simple, clinically applicable bioinformatics methods that can provide a fast, reusable, reproducible, scalable pipeline to locate AMR genomic signatures in large metagenomics datasets from the NCBI Sequence Read Archive (SRA) and other public datasets are still lacking. Such pipelines could also be used for crowdsourcing of this analysis, such as with undergraduate students. The problem of determining efficient strategies for antibiotic usage is the keystone of the modern antibacterial therapy and prevention 6 .
In the last few years, a variety of different papers and tools have been developed that exploit AMR detection for both complete genomes and metagenomes. Some of the existing detection methods for AMR genomic signatures include: ResFinder 7 , PointFinder 8 , SSTAR 9 , DeepARG 10 , ARIBA 11 and ResCap 12 . Another approach is Galaxy-based pipeline Amr++. All detection methods depend on the availability of collections of known AMR genomic signatures. These signatures are then directly searched for, or models are generated for the detection of novel AMR genes/ loci. One of the most updated manually curated databases is the Comprehensive Antibiotic Resistance Database (CARD) 13 ; others include ResFinder 7 , ARG-ANNOT 14 , and MegaRES 15 . Although some of these tools provide user-friendly web interfaces and use both FASTA and FASTQ files as input, they do not use the power of command line. Moreover, these solutions are not universal, e.g. ResFinder searches for only HGT-mediated resistance, whereas its successor PointFinder only looks for AMR caused by chromosomal point mutations. Other disadvantages of the existing solutions include an inability to work with big datasets or multiple raw sequence files, slow speed, and poor handling of metagenomic data.
The primary objective of this project is designing a reliable system for rapid diagnostics and prompt treatment of patients with MDR infections. At the heart of the system is a reusable, reproducible, scalable, and interoperable workflow to locate AMR genomic signatures in SRA shotgun sequencing (including metagenomics) datasets. To ease this task we used only RefSeq reference genomes for the bacterial pathogens important for public health, but the pipeline can be scalable to include databases for other microbes, viruses, and fungi. The result, NastyBugs, is a new program that can identify what type of antimicrobial resistance is most likely present in a metagenomic sample, which will allow for both smarter drug selection by clinicians and faster research done in an academic environment. NastyBugs is a framework created during the National Center of Biotechnology Information Hackathon in August 2017.

Methods
The detailed workflow to extract antimicrobial resistance gene signatures is described in Figure 1.
Three BLAST databases for downstream analysis were created using the latest versions of: 1) RefSeq human genome assembly GRCh37/UCSC hg19 (RefSeq accession no. GCF_000001405.37); 2) RefSeq bacterial genomes; 3) CARD 13 . For comparison purposes, CARD was used as two databases: one consisted of genes and another of SNPs.
Further analysis consists of three steps: 1) Host (human) reads removal; 2) Antimicrobial resistance signature identification; 3) Bacterial identification and characterization. Steps 2 and 3 were performed in parallel. Input data is SRA accession numbers (ERR or SRR) of the metagenome of interest. Another option is using FASTQ files from local storage.
Host reads removal Using STAR 16 or Magic-BLAST, all reads mapped to human genome GRCh37/hg19 were removed, and unmapped non-human reads (considered as bacterial) were collected using SAMtools for further analysis.
Antimicrobial resistance signature identification To remove adapters/linkers/barcodes we used FASTX Clipper and Trimmer. The non-human reads are mapped again using Magic-BLAST; however this time they are mapped to bacterial genes/ SNPs from the CARD. This allows for the identification of genes and SNPs that can lead to antimicrobial resistance in the bacterial population. Obtained reads were sorted and the sum of read abundance was calculated.

Bacterial identification and characterization
The identification of bacterial species and abundance is carried out in parallel. For that we again used Magic-BLAST and the database of NCBI RefSeq reference bacterial genomes. The resulting list of species was visualized using Krona 17 .

Output format
The output TAB-delimited formatted file contains data in five columns: 1) RefSeq accession number; 2) genus; 3) resistance gene; 4) ARO (Antibiotic Resistance Ontology) accession number; 5) score (number of mapped reads per 1kb). The data can be used for constructing a scatter plot showing relative abundance of antimicrobial resistance and the corresponding bacterial species in metagenomic sample analysed.

Dependencies
The documented workflow contains the script with containerized tools in Docker.
We used the following dependencies: 1) Magic-BLAST v. 1.3, a novel tool allowing mapping of large next-generation RNA or DNA sequencing runs against a whole genome; 2) SAMtools v. 1.3.1, a popular suite of programs for interacting with HTS data; 3) FASTX-Toolkit v. 0.0.13, a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing; 4) STAR v. 2.5.3a, RNA-seq aligner; and 5) Krona v. 2.7, a tool for metagenomic pie chart visualization.

Use case
To validate our pipeline we used two human gut metagenomic datasets, SRA acc. no. ERR1600439 and SRR5239736. The SRR5239736 sample was used for comparison of our results with the results obtained by ResCap 12 . For metagenome sample SRR5239736, 24% of reads were mapped to the gene database of CARD and 1.6% reads were mapped to the SNP database of CARD.
Magic-BLAST, a novel program from the BLAST family, can provide a faster and more accurate way to align reads of interest with reference sequences. A quick comparison of STAR and Magic-BLAST showed at least a 10-fold difference in speed increase with Magic-BLAST for mapping of SRA reads to human genome compared to STAR. For that reason, we chose not to use STAR in the pipeline.

Conclusion and next steps
Obtained results showed high efficiency of identification of antibiotic signatures in the studied samples. However, the presented workflow may be improved. Planned improvements will include: 1) optimization of the pipeline; 2) additional large-scale validation for different metagenomic samples; 3) representation of results with more information; 4) adding information about proteins participating in AMR; and 5) prediction of novel resistance genes using Hidden Markov Model.
Moreover, implementation of machine learning analysis may provide additional capabilities.
The pipeline can be used to efficiently identify the presence of antimicrobial resistant genes, which in turn can be used as features for further downstream machine learning analysis. One useful application of machine learning in antimicrobial resistance is the prediction of the appropriate antimicrobial therapy to apply to a critically ill patient. For these patients, the time taken to administer an appropriate antibiotic agent inversely correlates with improved patient outcomes 18 . Whole genome sequencing of microbial isolates, followed by antimicrobial resistance genes identification and machine learning prediction provides an attractive solution to this problem. A previous application in this regard applied a simple rules-based approach and a logistic regression model 19 . More sophisticated, non-linear supervised machine learning methods, such as random forests, gradient boosting, and artificial neural networks may play a key role in producing accurate predictions for clinical use. Artificial neural networks, such as convolutional and recurrent neural networks, are particularly interesting as they may be able to extract novel features.

Data and software availability
The code for the pipeline is publically available on GitHub: https:// github.com/NCBI-Hackathons/MetagenomicAntibioticResistance.

Competing interests
No competing interests were disclosed.

Grant information
Ben Busby was funded by the Intramural Research Program of the NLM. Hsinyi Tsang was funded by NCI Contract #D14PD00826.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Open Peer Review
Introduction comments: Extended-spectrum beta-lactamase (ESBL) is associated with what bacteria? (e.g., Enterobacteriaceae?) Please define the acronym 'SNP' (Single Nucleotide Polymorphism) the first time it is used within the manuscript. PointFinder is not the successor of ResFinder, they serve different purposes.

Methods comments:
It is not immediately clear when STAR or Magic-Blast will be selected for read mapping. Or is STAR not used in the current pipeline? There seems to be conflicting comments about the use of STAR in the manuscript ("Methods" versus "Use Case").
The pipeline screens out human reads prior to further analysis; however, the authors might also consider removing PhiX reads in order to get cleaner output in your Krona plots.
Bacterial species composition is determined using Bacterial RefSeq and Magic-Blast. How does this compare to tools like Kraken? Did the authors consider using a different tool than Magic-Blast for determining bacterial species composition? Is there a way to link bacterial species to identified AMR genes?
The paper does not mention what cut-offs are being used for reporting AMR genes in terms of coverage and percentage sequence similarity. Is this something that can be adjusted by the user?

Output format comments:
https://doi.org/10.5256/f1000research.13848.r27750 © 2017 Seemann T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Torsten Seemann
University of Melbourne, Melbourne, Vic, Australia "A Software Tool Article should include the rationale for the development of the tool and details of the code used for its construction. The article should provide examples of suitable input data sets and include an example of the output that can be expected from the tool and how this output should be interpreted." The authors present a software tool which claims to detect AMR genes from raw metagenomic FASTQ reads. However, there are several problems with the manuscript and the source code.
The discussion of existing tools, methods and databases is confusing. The term "AMR genomic signature" is not defined. ResCap is mentioned, but that is a wet lab set of probes, not a bioinformatics method per se. CARD does have a command line tool, it is called "RGI". This section would benefit from a table of methods which explain if they are a tool, a database, or both; whether they work with contigs or FASTQ reads; and whether they work on bacterial isolates or metagenomes.
The algorithm is poorly explained. Why was Magic-BLAST chosen as an aligner? Did you turn splicing off (as bacteria don't have introns). It is stated that reads were removed of adaptors etc using Fastx_Clipper but no details of which adaptor files were used. The database used was CARD, but not explaination of how it was split into HGT and SNP/Variant parts. Alignment to Bacterial RefSeq is also done to get a species distribution, but it is unclear why or how, and how it links to the genes found. No examples of output results are provided, nor is it described how you identify the SNPs.
I was unable to install and run the software. The github site https://github.com/NCBI-Hackathons/MetagenomicAntibioticResistance is very confusing also. It seems I have to use Docker to just get MagicBlast, but it doesn't include all the other dependencies. The primary script is called "main.h" but it appears to have lots of hard-coded parameters (eg. -threads 12) and refers to reference database files that do not exist. It also has bash syntax errors like "var= wc -l file" which should probably be "var=`wc -l file`" or "var=$(wc -l file)" -it appears something has lost characters in the file?The docs say "The pipeline use three databases that should be downloaded with the script". But there is no such script. I tried hard to find more info, but I was unable to.
In summary, more detail is needed, and the software needs to be installable and have some basic tests.

Is the rationale for developing the new software tool clearly explained? Partly
Is the description of the software tool technically sound?