Keywords
Antimicrobial resistance, bioinformatics, next-generation sequencing, benchmarking
This article is included in the Cell & Molecular Biology gateway.
This article is included in the Bioinformatics gateway.
This article is included in the Pathogens gateway.
This article is included in the Antimicrobial Resistance collection.
Antimicrobial resistance, bioinformatics, next-generation sequencing, benchmarking
This version contains text additions following the suggestions and comments from the reviewers in their referee reports. These additions include:
- A revised version of sections: General considerations, Section 3, and Conclusions.
- Correction of wrong numbering of sections of the manuscript.
- Update of references and insertion of the suggested ones
                    See the authors' detailed response to the review by Rene Hendriksen
                    See the authors' detailed response to the review by Anna Abramova and Marcus Wenne
                    See the authors' detailed response to the review by Enrico Lavezzo and Emilio Ispano
            
The technological advances in Whole Genome Sequencing (WGS) and the increasing integration of Next Generation Sequencing (NGS) platforms in the arsenal of testing laboratories is having a profound impact on health sciences. Affordable human genome sequencing is bringing about an era of improved diagnostics and personalised healthcare. For microorganisms, the reliable characterisation of their genetic material allows improved insights in their identity and physiology. For example, once sequenced, the genome of a microorganism can also be used to (re-)identify the species and infer important phenotypic properties, such as virulence, resistance to antibiotics, typing and other adaptive traits. In addition, novel strategies for the implementation and analysis of NGS data are being developed and improved, and they can be used, for instance, to reconstruct the timeline and relationships between the cases of an infectious disease outbreak, which is something difficult to achieve with classical microbiological techniques.
An important aspect of the implementation of NGS technologies are considerations related to quality and consistency (see 1), in particular if the result of the method is to be used in a regulatory context (for example, in a monitoring framework) or, more importantly, in a clinical setting linked to decisions on medical treatments2–4, veterinary, agricultural or environmental interventions and food safety5,6 which may be linked under One Health initiatives.
Methods for predicting antimicrobial resistance (AMR) based on genetic determinants from NGS data rely on complex bioinformatics algorithms and procedures to transform the large output produced by the sequencing technologies into relevant information. Traditionally, regulatory implementation of analytical methods focuses on harmonisation of the protocol and the subsequent steps of analysis, i.e. ensuring the implementation of specific methods previously validated according to a set of criteria. For methods with important bioinformatics components, this is often not optimal, due to both the large variability in the developed strategies, variations in the particular computational resources available and the speed at which technologies and analytical approaches evolve. For the prediction of AMR determinants, very different strategies have been proposed, processing the sequencing data either as a set of reads or as pre-processed assemblies7,8, even using neural networks9; sometimes, the system itself is proprietary and operates as a “black box” from the point of view of the user. In such cases like this, it has been proposed to approach the quality assurance challenge through performance-based evaluations, i.e. ensuring that the implemented methods, although different, perform at a similar (acceptable) level in this context10. The same performance-based evaluation can then be applied whenever a component of the pipeline, or its environment, is replaced or updated.
An important component for a performance-based evaluation scheme is the availability of resources (in particular, datasets) that enable these evaluations11–13. In 2017, the Joint Research Centre (JRC) initiated a reflection on the subject by inviting experts in the field of AMR detection with NGS from the four compartments of a “One Health” perspective, i.e. clinics, food, animals and the environment14,15. These discussions led to a compilation of the challenges involved in the development of a benchmark strategy for bioinformatics pipelines, both for NGS-based approaches in general and in this specific field of application16. These challenges were grouped into often overlapping categories, including the nature of the samples in the dataset (e.g. their origin, quality and associated metadata), their composition (e.g. the determinants and species to include), their use (e.g. expected results and performance thresholds) and their sustainability (e.g. their development, release and update).
On the 27th and 28th of May 2019, the JRC held a follow-up meeting, including most of the authors of the original article and additional experts that expressed interest, to discuss and propose solutions to the identified challenges for AMR detection using next generation sequencing. The present article represents a summary of these discussions and the conclusions reached. We propose this document as a baseline for a roadmap and guidelines to harmonise and standardise for the generation of the benchmark resources in the field of AMR.
An important observation that arose from the two-day discussions is that the concept of benchmarking, even when focusing on a single component of the method (i.e. the bioinformatics pipeline), may refer to different activities that can vary in their scope and objectives (see also 17–19). Clarifying these scopes is crucial when proposing recommendations, as these (and the final datasets) will be influenced by the scope of the evaluation.
In the conclusions of Angers et al. article, the use of the benchmark resources was reported as follows: “(1) Ensuring confidence in the implementation of the bioinformatics component of the procedure, a step currently identified as limiting in the field. (2) Allowing evaluation and comparison of new/existing bioinformatics strategies, resources and tools. (3) Contributing to the validation of specific pipelines and the proficiency testing of testing laboratories and (4) “Future-proofing” bioinformatics pipelines to updates and replacement of tools and resources used in their different steps.”14.
These four summarising points made above, in practice, cover two different questions: 1, 3 and 4 (implementation, validation, proficiency testing and future proofing) ask whether the bioinformatics pipeline performs as expected, while 2 (evaluation/comparison) focuses on identifying gold standard pipelines and resources for implementation. The first scope of a benchmark resource would thus address the question: “Which pipeline performs best and at least by the agreed minimum standards?” A second scope addresses the question: “What is the quality of the information produced by the implemented bioinformatics pipeline?”
The latter question requires further refinement, based on the “what” the pipeline is “required” to achieve. Although there may be different contexts to the use of the methods (e.g. guide clinical intervention, contribute data to a monitoring framework, outbreak management, monitor the spread of AMR genes (ARGs) in or between different settings/environments, etc.), for a benchmark resource, these can be split in three main use cases:
to predict the resistance of a pathogen of interest based on the identification of ARGs (either acquired ARGs and/or chromosomal point mutations conferring AMR);
to identify the link of ARGs to a specific taxon in a metagenomic sample (i.e. taxon-binning approaches like described by Sangwan et al.20);
to identify the complete repertoire of the AMR determinants (i.e. the resistome) in a complex bacterial community from which total genomic DNA has been extracted and sequenced (i.e. the metagenome).
Finally, another important scope for a benchmark resource was identified, having, once again, an impact on the decisions regarding the benchmark dataset: “How robust is the bioinformatics pipeline?” Studies addressing this question focus on identifying how the pipelines can tolerate variation in characteristics of the input data, most often related to the quality of the sample or sequencing steps: robustness against contamination or low number/poor quality reads, for example. Robustness, in certain contexts, could also be seen as the effect (or lack of) of swapping a tool (or the reference database) at a certain step in the pipeline for a different one that is functionally equivalent (see, for example,21).
In summary, it is important to be specific about the purpose and scope of the benchmark resource in the decisions taken when generating the datasets. We propose that the scope of a benchmark has three major parts, summarised in Figure 1.

When discussing the different challenges described in 16, rarely can an absolute “best” answer be identified for a given question; recommendations thus need to be made, taking into account the specific purpose of the benchmark resource and the fact that they may evolve with the state-of-the-art in the field.
Still, some general observations and conclusions were proposed regarding difficulties like what type of AMR mechanisms to include, which pathogens to consider, lack of expertise and of harmonisation, rapid evolution of the fields of informatics and bioinformatics in parallel to progress of scientific knowledge on AMR. They are summarised in this section and, by discussing two main use cases for benchmarking resources (single isolates and mixed samples), represent proposals for a way through these and other challenges with common, transparent, performance-based evaluation approaches, to be applied in different fields in a “One Health” perspective.
A quick analysis of the different NGS platforms currently available and in development makes it obvious that the set of reads that they produce have very different characteristics. In addition, each platform has its strengths and weaknesses. Both the error rate (about 0.1% for Illumina (RRID:SCR_010233), 1–10% for newer technologies like from Pacific Biosciences and Oxford Nanopore Technologies (RRID:SCR_003756)) and the types of errors (miscalls, insertions or deletions, or problems of particular motifs such as homopolymer sequences) vary according to the platform used. The average length of the reads can vary from hundreds (Illumina, Ion Torrent) to thousands (PacBio, Nanopore) of base pairs22–24.
Bioinformatics pipelines are thus usually designed to handle the output of a specific platform, often in a certain configuration. Although exceptions exist (e.g. 25,26), in the context of a benchmark resource (and independently of the question asked), we thus believe that different datasets are needed for each of the different NGS platforms, each containing reads that have a profile that matches closely with the normal output of the respective technologies. It is important, in this case, to ensure that the datasets produced for the different platforms are developed so that they do not inadvertently introduce any bias resulting in apparent difference in performance among the different platforms due to composition of the benchmark dataset (e.g. when bioinformatics pipelines analysing the output of different platforms are compared). Although the absence of bias may be hard to demonstrate a posteriori, efforts should be made to ensure that the datasets derive from strategies that are as similar as possible, for example by containing reads generated from the same input samples.
The platforms for which benchmark datasets are produced should be selected based on pragmatic considerations. Ideally, equivalent resources should be available for all technologies; however, it may be necessary to prioritize the most common platforms if resource limitations are an issue. Recent surveys have shown a clear preference for the Illumina platform in this context27,28. The same trend can be observed when counting the number of published articles in a scientific literature database (Figure 2). That being said, sequencing technologies are constantly evolving, and benchmark datasets should be extended to new technologies with improved performance as these are increasingly adopted by testing laboratories.

Source: Scopus, using the search: ALL ("X" AND "antimicrobial resistance").
The so-called “Third Generation Sequencing” technologies that sequence single DNA molecules and, importantly, produce long reads, have been shown to provide substantial benefits in the context of AMR detection. First, many resistance genes are located on plasmids, which are challenging to assemble using short-read sequencing technologies, because the short read lengths do not allow spanning of repetitive regions29. The presence of an AMR determinant on a plasmid is also important for its transfer and eventual spread, and thus their correct assembly using long-read technologies represent a substantial advantage30–34. In addition, the proper and timely treatment of a pathogen infection is critical for successful prevention and control of diseases in clinical settings as well as in the community. In line with this, the Nanopore sequencing technology has shown the promise of providing accurate antibiotic resistance gene identification within six hours of sample acquisition35–37 We thus propose to include DNA Nanopore sequencing as an additional priority platform to develop benchmark resources.
The choice of formats for the different components of the datasets is also important. Each instrument produces a raw data output in a specific format (for example, the Illumina platforms generate raw data files in binary base call (BCL) format, while the Nanopore platforms produce FAST5 (HDF5) files). However, the entry point of most bioinformatics pipelines in this context is the description of the sequence of the produced reads, with an indication of the quality (or confidence) score for each of the base positions. The FASTQ format is a standard format in this context38, which should be used in the benchmark resources; many tools exist to convert the raw data output files into this format in case of different platform outputs (see, for example,39,40) although, it should be noted, different tools may produce different results and this step should be carefully planned.
Other standard formats exist to describe intermediate states of processing, for example for the description of assembled contigs or variant calling41. However, using these formats would make an a priori assumption about the strategy of the bioinformatics pipeline that may not be universal; indeed, not all reported solutions involve assembling reads, or mapping them to reference genomes or databases (see, for example,42,43).
Three main sources of data for creating a benchmark dataset were identified. The first is to simulate the output (reads) in silico using an input sequence of a resistant pathogen and a specialised software. The second is to use the archived output of previously performed experiments that are available in different repositories. The third is to perform NGS experiments on biological samples.
Although the disadvantage of simulating in silico data is obvious (it is not ‘real’), there are some substantial advantages: it is a lot cheaper than performing sequencing runs, a lot faster, and can be applied to any genome previously sequenced. Thus, many more potential scenarios can be tested, for which the ground truth is well-established (i.e., the annotation of the genome reference that is used: different species, different classes of AMR, different localization of AMR), which usually cannot be done by actually sequencing them. Finally, it is also potentially ‘safer’ to do this for pathogenic bacteria for which high biosafety levels would be required to sequence in a laboratory. However, a major drawback is that simulating variation the way nature evolves is very challenging – genetic variation happens in places in the genome where it is hardest to find.
Many methods and programs have been developed to simulate genetic data. Their use in this context is, in itself, an exercise of Open Science and mechanisms should be used to guarantee quality and reproducibility (see 44) In 2013, Peng et al.45 developed the catalogue “Genetic Simulation Resources” (GSR, available at https://popmodels.cancercontrol.cancer.gov/gsr/) to help researchers compare and choose the appropriate simulation tools for their studies. However, after reviewing the software listed in the GSR catalogue, the authors realised that the quality and usefulness of published simulation tools varied greatly due to inaccessible source code, lack of or incomplete documentation, difficulties in installation and execution, lack of support from authors and lack of program maintenance45. For these reasons, a defined checklist of features that may benefit end users was defined46; the “GSR Certification Program” was developed and recently implemented into the GSR in order to assess simulation tools based on these criteria47. Established criteria are grouped to attribute four “certificates” (https://popmodels.cancercontrol.cancer.gov/gsr/certification/):
Accessibility: it ensures that the simulator is openly available to all interested users and is easy to install and use.
Documentation: it ensures that the simulator is well documented so that users can quickly determine if the simulator provides needed features and can learn how to use it.
Application: it ensures that the software simulator is peer-reviewed, is reasonably user-friendly to be useful to peer researchers, and has been used by researchers in the scientific community.
Support: it ensures that the authors of the simulator are actively maintaining the simulator, addressing users’ questions, bug reports and feature requests.
As of December 2019, the GSR catalogue lists 148 simulators and many of them have been assessed for their compliance with the requirements in order to be certified. Obviously, not all of them are for simulation of NGS reads. In 2016 Escalona et al.48 identified and compared 23 computational tools for the simulation of NGS data and established a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.
By browsing the GSR catalogue, 20 out of 23 tools assessed by Escalona et al. (45) have been recorded, including only one with the four “GSR certificates” (Table 1), i.e. the ART tool49. Other tools not assessed by Escalona are also present in the GSR catalogue with certificates, like NEAT50 and VISOR51.
See text for details.
For choice of the simulation methods and programs for NGS reads, the decision tree proposed by Escalona et al. is robust. However, it should be complemented by “certification” steps and, in this respect, we encourage the use of the “certification” criteria established by the GSR Certification Program, to tackle the challenge of following agreed principles for rigorous, reproducible, transparent, and systematic benchmarking of omics tools, in line with those proposed by Mangul et al.13.
Using pre-existing experiments, from private or public repositories, ensures that the components of the dataset are representative of a real-life experiment, including the complete panel of real-life variabilities that are difficult to simulate. The main issues then are: a) there is a need to demonstrate that the experiment met the necessary quality criteria (see section 2.3); b) the “correct” value (i.e. the ‘ground truth’) for the experiment needs to be determined. This can be already described in the metadata associated with the record and/or determined (verified) a posteriori – although this requires strict annotation of the experiment; c) it will not be possible (besides rare exceptions) to build datasets for the different platforms using the same initial samples.
Generating experiments specifically for the sake of a benchmark dataset has almost the same advantages and disadvantages as using pre-existing data. Additional advantages include a better capacity to determine the “ground truth” of each sample by ensuring access to the original pathogen, as well as the possibility to generate datasets for the different platforms while using the same samples, if the same pathogen/purified DNA is processed through the different protocols and instruments. This also allows better control of the quality aspects of the procedure performed, e.g. through the use of accredited laboratories who have therefore demonstrated by audits that they possess the necessary knowhow and expertise to create high-quality data. However, an additional disadvantage is that this process requires a substantial investment of time and resources (although this investment may be deemed worthwhile given the importance of the topic, and could benefit from the involvement of the instrument vendors).
Because each approach has advantages and disadvantages, the choice must be carefully considered, according to the purpose of the dataset, which will be discussed in section 3.
The quality of the original sample and the wet laboratory procedures (e.g. DNA extraction, library preparation and sequencing) have a strong impact on the quality of the reads fed into the bioinformatics pipelines. Contamination, low amounts of reads passing the machine QC, higher error rates than normal, etc. can influence the output of bioinformatics pipelines. Usually, the pipelines are designed to be resilient to some extent to these variations.
Although understanding this resilience is important, we propose, as shown in Figure 1, to separate these considerations from resources meant for quality control and performance evaluation (questions 1, 2a and 2b) for two reasons: first, many of these factors are variable, heterogeneous, technology-specific, and can be implemented at different stages of the bioinformatics pipeline; attempting to incorporate them all in the same resource would be impractical and too costly. Second, pipelines implemented for regulatory or clinical decision-making will be incorporated into a larger quality assurance framework that will ensure the quality of the input until that step2. Although examples exist of end-to-end NGS workflow validation (like in the case of WGS) where bioinformatics is one of the components52, our approach emphasises a process where each step is validated separately (see 53).
It is then crucial to closely follow the proposed quality control schemes, either published or in development, in particular for the upstream steps (DNA isolation, library extraction, sequencing, etc.), for example ISO/TC 34/SC 9/WG 25. From these, both the metrics and the thresholds that can be applied at the level of the reads should be identified (some of which may vary according to the sequencing methodology), such as percent of bases with quality scores over Q30, percent alignment and error rates of the positive control (if present), the number of reads after trimming, quality of the draft assembly (e.g. N50, number of contigs), etc. Tools exist that can provide a panel of quality metrics from FASTQ files, such as FASTQC (RRID:SCR_014583)54. It is important to include the quality metrics as metadata in the dataset samples.
For studies evaluating resilience (question 3), a variety of datasets are needed for the “low quality dimensions” to be tested. For example, datasets incorporating high error rates, contaminating reads, reads with lower Q-scores could be used to assess resilience of pipelines to quality issues that may or may not be detectable with standard QC pipelines. For this reason, the establishment of standard datasets for this type of benchmarking is a complex exercise and answering question 3 of Figure 1 should be attempted on a case-by-case basis, and may be better suited to individual studies. One way to harmonise the approach would be to use the datasets produced for questions 1 and 2 as a starting point, as there are tools that can add some extent of “noise” to existing good quality datasets49.
In the context of challenging/evaluating a bioinformatics pipeline for the detection of AMR genetic determinants, a very pragmatic approach could be the generation of random DNA sequences, to which particular sequences of interest are added (i.e. fragments of AMR genes). However, the genomic background of the bacteria (i.e. the “non-AMR related” sequences) might have a profound influence on the performance of the pipelines. For example, pipelines that include a contig assembly step will be affected by the frequency and level of repetitive sequences in the background genome, as well as its GC content55,56. Some species also have genes that are similar at the sequence level to known AMR determinants that efficient pipelines must be able to distinguish.
In conclusion, the bacterial species included in the benchmark datasets, and the AMR genes they contain, need thus to be carefully selected, with the appropriate justifications. These are specific to the purpose of the dataset (Figure 1) and will be discussed in section 3.1–section 3.3 below.
A pipeline processing sequencing information for AMR can produce two closely linked but conceptually different outputs: a) they can detect the genetic determinants of AMR (genomic endpoint), and in addition b) some can predict the AMR/susceptibility of the bacteria in the original sample (phenotypic endpoint).
In a clinical context, the phenotypic endpoint is the most relevant, as it provides the information that is most useful for the end users. Studies that evaluated AMR genotype to phenotype relationships have indicated that despite generally high correspondence, this can vary greatly between pathogens / case studies, and even for different antimicrobial agents within the same species57,58. There are different reasons for discrepancies between phenotype and genotype, including the extent of the expression of the resistance determinants for the resistance to be conferred, and also relatively complex molecular pathways that can influence the eventual phenotype. In some cases, genes can also confer reduced susceptibility (i.e. increasing the concentration of an antimicrobial necessary for treatment) rather than resistance per se. A genotypic endpoint may also be problematic due to the definition of “antibiotic resistance” in different settings59, which can complicate the interpretation of results.
In practice, however, focusing on a genomic endpoint has many advantages:
The end-point (determinant; gene or SNP) is better defined: presence or absence.
The gene copy number can be calculated, this is important even if obtaining gene copy numbers with short read data remains pretty difficult.
It provides high resolution information that is useful when many genetic determinants confer resistance to the same antimicrobials.
It offers additional information to contribute to an evaluation of the history of the spread of AMR60.
It does not rely on breakpoints such as MICs, which may vary between human and animal bacterial isolates, or may not be available for some animals (or pathogens), or because it may be updated based on phenotypic scientific observations61,62.
Even in the cases of AMR determinants not being expressed (thus not leading to a resistance phenotype), this may be important to characterise/record for epidemiological purposes.
Besides the set of reads themselves, additional information needs to be associated with each sample in the dataset, not for the benchmarking per se but its use for next benchmarking exercises.
Obviously, each sample needs to include a “true” value, i.e. the ‘ground truth’ to be used for comparison when evaluating the performance of the pipeline. For a genotypic endpoint, this would take the form of the name (with a reference to a database) of the AMR determinants present. If real-life samples are used, the phenotypic information should be included, on top of the genotypic endpoint.
Public resources and repositories are available to host both the data and the metadata, and should be used as appropriate for the sake of transparency, obsolescence and traceability of the datasets of the benchmark resource. In practice, this means:
The NGS reads data should be hosted in one of the International Nucleotide Sequence Database Collaboration (INSDC, RRID:SCR_011967) sequence reads archives63, compiling the metadata information as appropriate.
For simulated reads, this information should include the simulation tool used (source, version, parameters).
For simulated reads, the “input” sequence(s) should be a closed genome, and any additional genes, that should be available in INSDC sequence archives64, and the record ID(s) included in the reads metadata information. Optimally, the closed genomes should be linked to a “real” sample in the INSDC BioSample database.
For real experiments, the originally sequenced sample should be present/submitted in the INSDC BioSample database65, with all the appropriate metadata information (including identified resistances and the MIC(s) determined according to the standard culture-based evaluation methods).
Two main use cases for benchmarking resources have been discussed, i.e. single isolates and mixed samples, here summarised in sections 3.1-3.2 and 3.3, respectively.
The scope of this benchmark resource is to address the questions of validation, implementation and quality control over time (i.e. following any change in the pipeline or the environment on which it is executed). The dataset required for this should be compiled based on an agreed “minimum” standard, i.e. thresholds for the acceptance values of certain performance metrics for the bioinformatics pipeline in the context of the detection of AMR determinants, no matter the exact final use of the information produced.
This evaluation of performance should be based on challenging the pipeline with input representing a carefully selected set of resistance determinants and bacterial hosts. These sets of NGS reads should be fully characterised regarding their genetic content and serve as (in silico) reference materials for the validation and quality control of the bioinformatics component of the methods (see, for other host models, 66,67).
To maintain this necessary control on the genetic content of the reads, the dataset could be composed exclusively of simulated experiments. Synthetic artificial reads can be generated on a large scale in a harmonised manner and most importantly allow full control on the content. Alternatively, real sequencing datasets could be used, which would be extremely relevant for cases where the presence/absence of some AMR determinants has been established using consolidated classical molecular-biology-based methods and/or first generation-sequencing (e.g. PCR and/or Sanger sequencing). Both approaches enable the generation of datasets for which the ‘ground truth’ is well-established.
For the choice of resistances and bacterial species to be included, it is proposed to select them based on three sources, based on their current public health relevance and regulatory frameworks:
Table 2 shows the combination of these three lists, in terms of both the bacterial species and the antibiotics mentioned.
a: WHO’s list of antibiotic-resistant "priority pathogens". b: EARS-Net reporting protocol for 2018. c: Commission Implementing Decision 2013/652/EU.
It is important to highlight that the combination of these three lists still leaves important regulatory gaps, and should be complemented by the World Organisation for Animal Health (OIE, RRID:SCR_012759) list of antimicrobial agents of veterinary importance71 or others72. However, the lists do not mention specific species associated to each antibiotic, and these should be selected by the appropriate experts for the context of this benchmark resource. In practice, the generation of reference sequencing datasets should focus on:
1. Pathogens in Table 2 for which high-quality and complete reference genome sequences are available. See, for example, the FDA-ARGOS database11 and the NCBI RefSeq Genomes database (RRID:SCR_003496).
2. Known genetic determinants for the resistance against the antibiotics in Table 2, using available resources7,8,73. If more than one determinant is associated with a resistance phenotype, one possibility is to collect them all; expert knowledge and empirical evidence on the relative contribution of different genes to the phenotypes, from published large-scale studies (e.g. 74) can also be used to objectively reduce the list of determinants to include for a given antibiotic.
3. Combinations of (1) and (2) present in at least one of the chosen lists (see cells in Table 2) (see section 2.3).
The availability of reference datasets for which the ground truth is known, i.e. the composition of AMR determinants within the sample (whether SNPs, genes, or more complex features) is well-established, allows to compare the output of bioinformatics pipelines to evaluate their performance. The endpoint considered for this benchmark is thus genotypic (see section 2.5), i.e. the proper detection and identification of genomic AMR determinants is evaluated. The reference datasets for each pathogen should be carefully chosen and characterised to ensure all present AMR determinants are carefully recorded.
Several classical performance metrics used traditionally for method performance evaluation can then be used to describe performance of the pipeline, such as sensitivity, specificity, accuracy, precision, repeatability, and reproducibility2. Because of the selection of this subset of bacteria/resistances and their immediate clinical and regulatory importance, an important performance metric to be evaluated is accuracy, i.e. the likelihood that results are correct. Definitions for accuracy and other performance metrics should be carefully considered and tailored to the genomics context53 (for example, “reproducibility”, could be evaluated as the result of running the same bioinformatics pipeline, with the same datasets, implemented in different systems).
For all performance metrics, the minimum acceptable values should be subsequently determined once the outputs of real benchmarking exercises considering all the aspects described in this article are available. This will allow to enforce thresholds that pipelines should obtain. Such an approach focusing on performance-based evaluation offers a robust framework for pipeline validation that is more flexible than focusing all efforts on single pipelines or sequencing technologies. Scientists should be able to choose the pipeline best suited to their own needs, whether commercial, open-source, web-based, etc.75, as long as it conforms to minimum agreed upon performance standards, evaluated by analyzing the community AMR benchmark datasets that are created. This also offers flexibility with respect to the currently quickly evolving sequencing technologies for which it is not always clear yet which algorithmic approaches will become community standards, allowing different data analysis methodologies if minimum performance is demonstrated.
When generated, the benchmark should be deployed on a dedicated (and sustainably maintained) platform that includes all the links to the data (see section 2.6) and a description of all the steps/decisions that were taken to generate it. It is also important to implement, from the start, a clear version control system for the benchmark resource, in order to properly document the changes over time, and the exact versions used at the different times that the resource is used. In addition to the unique accession numbers of the individual samples in their respective repositories, the dataset as a whole should have a unique identifier (e.g. a DOI) that changes when any modification is made. The versioning should also allow access to and use of any previous versions of the resource, even after being updated.
This minimal dataset contains, by definition, a limited number of species and may lack pathogens of clinical importance (for example, Mycobacterium tuberculosis, for which WGS-based approaches have shown particular advantages, see 76,77). A full validation exercise for a specific pipeline, applied to a specific context, will need additional samples that complement the resource described in this section with the appropriate species/resistances. These datasets may, for example, be taken from the resources described in the following sections, that focus on evaluating the actual performance of methods in broader contexts by gathering the many datasets necessary to do so.
The scope of this benchmark resource is to identify gold standards for bioinformatics pipelines, in this case linked to the specific use of predicting resistance/susceptibility of a pathogen.
There is a step between identifying the determinants of AMR and predicting resistance, which is not always straightforward as factors such as expression of the AMR gene may affect the prediction57,74. For this reason, and because it is conceptually closer to the information that is acted upon, the endpoint for this benchmark should be phenotypic. In addition, the dataset should be composed of real NGS experiments, since artefacts and variations are more complex in real sequencing reads than in simulated reads, a factor crucial to consider for this scope that focuses on accuracy.
To minimise the need of extensive resources to produce these “real” datasets, we propose to focus on re-using experiments previously performed under standardised conditions. A great source of data are the published ring trials; these have the additional advantage of providing an accurate characterisation of the sequenced samples, since the same original samples are sequenced many times by different laboratories. If needed, the data generated by single-site studies can also be evaluated, although in this case the issue of the correct characterisation of the samples (their “true” resistance patterns) should be addressed. One possibility is to use studies performed in a hospital setting, linked to clinical outcome (for example,78), or where sufficient information is available to evaluate the way the susceptibility testing was performed.
In practice, this would mean:
1. Performing an extensive review of the published literature to identify studies, ring trials, and proficiency testing that meet the criteria (focused on the detection of AMR using NGS, starting from a “real” sample). Table 3 provides a non-exhaustive list of recent references to be used as a starting point, together with reports from national and international reference laboratories dealing with AMR (see for example, the external quality assessment reports from the EU Reference Laboratory – Antimicrobial Resistance).
2. Assessing whether the raw sequencing output for the projects meet the FAIR principles (Findability, Accessibility, Interoperability, and Reusability)79, and are retrievable from publicly available repositories – even if they are access controlled. If not fully open, the corresponding authors should be contacted and asked whether the data could be obtained and deposited in long-term archives (e.g. Zenodo (RRID:SCR_004129), EuDat and/or the European Nucleotide Archive (ENA, RRID:SCR_006515) depending on the deposited data).
| Pathogens | Study | Year | Study type | Ref | 
|---|---|---|---|---|
| Clostridium (Clostridioides) difficile | Berger et al. | 2019 | ring trial | 80 | 
| Neisseria meningitidis | Bogaerts et al. | 2019 | validation study | 53 | 
| Salmonella enterica | Mensah et al. | 2019 | single study | 81 | 
| Enterobacteriales | Ruppé et al. | 2019 | single study | 58 | 
| Escherichia coli | Stubberfield et al. | 2019 | single study | 82 | 
| Staphylococcus aureus | Deplano et al. | 2018 | ring trial | 83 | 
| Brucella melitensis | Johansen et al. | 2018 | ring trial | 84 | 
| Salmonella enterica | Neuert et al. | 2018 | single study | 85 | 
| Salmonella, Campylobacter | Pedersen et al. | 2018 | proficiency testing | 86 | 
| Escherichia coli | Pietsch et al. | 2018 | single study | 87 | 
| Enterococcus faecium, Enterococcus faecalis | Tyson et al. | 2018 | single study | 88 | 
| Actinobacillus pleuropneumoniae | Bossé et al. | 2017 | single study | 89 | 
| Klebsiella pneumoniae | Brhelova et al. | 2017 | single study | 90 | 
| Salmonella enterica | Carroll et al. | 2017 | single study | 91 | 
| Escherichia coli | Day and al. | 2016 | single study | 92 | 
| Salmonella spp., Escherichia coli, Staphylococcus aureus | Hendriksen et al. | 2016 | proficiency testing | 93 | 
| Salmonella | McDermott et al. | 2016 | single study | 94 | 
| Staphylococcus aureus, Enterococcus faecium, Escherichia coli, Pseudomonas aeruginosa | Mellmann et al. | 2016 | single study | 78 | 
| Staphylococcus aureus, Mycobacterium tuberculosis | Bradley et al. | 2015 | single study | 42 | 
| Escherichia coli | Tyson et al. | 2015 | single study | 95 | 
| Mycobacterium tuberculosis | Walker et al. | 2015 | single study | 76 | 
| Campylobacter jejuni, Campylobacter coli | Zhao et al. | 2015 | single study | 96 | 
| Pseudomonas aeruginosa | Koos et al. | 2014 | single study | 97 | 
| Staphylococcus aureus | Gordon et al. | 2013 | single study | 98 | 
| Escherichia coli, Klebsiella pneumoniae | Stoesser et al. | 2013 | single study | 99 | 
| Staphylococcus aureus, Clostridium difficile | Eyre et al. | 2012 | single study | 100 | 
| Salmonella typhimurium, Escherichia coli, Enterococcus faecalis, Enterococcus faecium | Zankari et al. | 2012 | single study | 101 | 
| Salmonella | Cooper et al. | 2020 | single study | 102 | 
These datasets would then be used to test and compare the different bioinformatics pipelines in order to calculate the accuracy of their phenotypic predictions. Although not exhaustive, these datasets should cover the most relevant “real-life” cases, as they warranted their inclusion into a ring trial, with the associated resources committed to produce the data. The final size and composition (species, resistances) of the dataset would depend on what is provided by the available projects; ad hoc ring trials could be organised to cover eventual important gaps in species and/or resistance.
Although the chosen endpoint is mostly phenotypic, the purpose is to evaluate bioinformatics pipelines that process information at the sequence level, so it was agreed that there was little added value of inserting resistant samples (based on a characterised or inferred phenotype) for which the resistance mechanism is still unknown. In any case, it is improbable that these cases would have been included in ring trials projects.
Although the performance metrics described in Section 3.1 apply and are relevant in this case, the main performance metric for this benchmark is the accuracy. Because of the difficulty of predicting the link between the presence of AMR determinants and their impact on the pathogen susceptibility to the antimicrobial agents, the target accuracy is expected to be lower than for a genotypic endpoint. Both false positives and false negatives can be an issue when the information is used for clinical intervention, so a sufficient amount of “borderline” cases should be included, and both sensitivity and specificity evaluated. It is also possible to consider attaching different relative costs for false positives and false negatives when evaluating the accuracy metrics.
Once selected and combined, the data should be separated by NGS platform, and by species and antibiotic. Because this benchmarking aims at evaluating and comparing performance of methods, which are continuously developed and optimised, against a large and constantly expanding dataset, it is crucial to define an environment where the AMR community can establish a continuous benchmarking effort. Within this platform, pipelines would be compared simultaneously based on up-to-date datasets, under the same conditions, and over time. Constantly updating and adding to the reference datasets is important both to keep up with the evolution of the knowledge/reality in the field, and to avoid that pipelines are developed that are optimised to specific datasets only.
One option is OpenEBench (Open ELIXIR Benchmarking and Technical Monitoring platform), which is developed under the ELIXIR-EXCELERATE umbrella in order to provide such a platform18. In this framework, in addition to compiling the data resources to be included (as described above), and whatever the platform chosen, there will be the need for efforts to:
Establish guidelines for input and output formats (and, in the case of the phenotypic endpoint, an agreed ontology for the conclusions).
Encouraging a “FAIR” implementation of the pipelines themselves, to increase the number of pipelines accessible for the benchmarking platform, and for interested end users to retrieve and implement in house.
Provisions should be included to allow the possibility to evaluate, in this context, pipelines that cannot be made “FAIR” based on intellectual property rights, institutional policies or available resources.
A final step will be to communicate these efforts within the scientific community and the potential end users, as well as to demonstrate the added value of this “live” benchmark resource to ensure that future studies (in particular, their pipelines and the datasets they generate) are efficiently integrated in the platform.
Many gaps exist in the scientific understanding of antibiotic resistance development and transmission, making it difficult to properly advise policy makers on how to manage this risk. There is strong evidence that a multitude of resistance genes in the environment have not yet made it into pathogens103,104; understanding the relative importance of different transmission and exposure routes for bacteria is thus crucial59,105–107.
Establishing a baseline for resistance determinants in the environment, and linking this to a surveillance scheme, requires a good understanding of the relative performance of methods that are and have been developed to characterise the resistome in a complex sample. There would be, also for this use case, a great value in the establishment of a community-driven “live” benchmarking using a platform such as OpenEBench, and many of the concepts that were discussed in section 3.2 apply here as well, with the following differences:
As, by definition, the resistome refers to the genetic determinants (and not directly the associated phenotypes)108,109, the endpoint for this benchmark should be genotypic.
Culture-dependent methods established for clinical samples cannot always be readily applied to environmental samples110, so establishing “true” values for real samples, to compare the output of the evaluated pipelines, will be difficult, so the benchmark should be performed, at this stage, with simulated reads.
The resistome is usually derived from a sample containing a complex microbial community (see 111–123 for recent examples). For this reason, the approaches114 and tools115 from the ongoing Critical Assessment of Metagenome Interpretation (CAMI) could be considered when organising the community around this challenge.
In practice, this means an effort to engage and coordinate the community of bioinformatics pipelines designed to predict the resistome of a sample in order to:
1. Design the scope of the challenge, including the relevant metrics for performance evaluation. For this, “accuracy”, the main metrics for the previous two benchmarks, may not be the most appropriate, and the focus should be placed, e.g., on “recall” and “precision”.
2. Describe the microbial communities (i.e. microbial abundance profiles and their typical AMR gene profiles) most relevant for the determination/monitoring of the resistome, in order to generate congruent datasets that accurately represent real-life samples. Of particular interest, for which validation will eventually be a prerequisite, are blood, serum, saliva etc., i.e. the types of samples clinical microbiology laboratories and national reference centres/laboratories typically process.
3. Identify both the microbial genomes and the resistance determinants (as single genetic determinants or plasmids) necessary to generate the profiles identified in (2). As stated in section 3.1, the genomes should be well analysed to ensure no lack of, or an adequate characterisation of, AMR determinants. This is crucial in order to establish a resistome “ground truth” for the generated datasets.
4. Combine these sequences, as appropriate, to generate the benchmark datasets, using appropriate tools (such as CAMISIM, developed as part of the CAMI challenge115).
The community should decide whether (or at what stage) the use of real data can also be considered in the challenge. As for purified bacteria (see Table 3), many studies have been published as potential sources of raw data. These studies can also be used as a source of information to define the relevant profiles (point 2 above). Recent studies include resistome determination in samples from drinking water116,117, wastewater plants118,119, hospital wastewater120,121, human gut111,122, sewage123, to name a few. In the benchmarking platform, the datasets (and the calculated pipeline performances) should be separated by the type of source they originate from or simulate. Another important point for this scope is the detection of minority populations and the correct normalisation of the samples to be analysed124.
The scientific community quickly adopted the new NGS technologies to develop methods that can efficiently detect, identify and characterise genetic determinants of AMR. In parallel with these research uses, NGS technologies can have immediate impacts on how AMR is diagnosed, detected and reported worldwide, complementing etiologic agent diagnosis, clinical decision making, risk assessment and established monitoring frameworks3,125–127.
For this application and in general, there are great challenges in the implementation of NGS-based methods for public decision-making. Capacity building and its cost is of course a factor, but recent surveys show that capacity development is ongoing in many countries28. A greater concern is the interpretation of the produced genomic data into meaningful information that can be acted upon or used for regulatory monitoring, in great part because of the important bioinformatics component of these methods.
The difficulties posed by this reliance on bioinformatics processes are many, and include:
The specific expertise needed for their implementation and maintenance, which is still limited compared to the needs of routine testing environments.
The lack of harmonisation in their design, as the same sequencer output can be processed to produce the same target information by pipelines that either follow the same general strategy, with different tools for the individual steps, or completely different strategies entirely (see 128).
The constant, rapid evolution of the fields of informatics and bioinformatics, which makes uneasy (or even unwise) to “freeze” a harmonised, validated, implemented pipeline with the same components in the same environment over long periods of time.
For AMR, as for other fields, the pipelines (and their performance metrics) are built based on a priori scientific knowledge, in this case the genetics of resistance, which is constantly progressing.
In this document, we propose a way through these difficulties with a transparent, performance-based evaluation approach to assess and demonstrate that pipelines are fit-for-purpose and to ensure quality control. The discussions, initiated in 201716, have involved experts in different fields: human health, animal health, food and environmental monitoring, and general bioinformatics.
The approach is two-fold: first, an agreed-upon, limited dataset to contribute to performance-based control of the pipeline implementation and their integration in quality systems. We propose selection criteria for this common dataset based on bacterial species and resistances relevant to current public health priorities (see section 3.1).
Second, a community-driven effort to establish a “live” benchmarking platform where both the datasets and the bioinformatics workflows are available to the community according to the FAIR principles. After an initial investment of resources to establish the rules and integrate the existing resources, a proper engagement of the community will be needed to ensure that both the datasets and the workflows will constantly be updated, with live monitoring of the resulting comparative performance parameters. For this, two main use cases were identified, each necessitating its own platform: the analysis of isolates (with a focus on the prediction of resistance, see section 3.2), and the analysis of mixed samples (with a focus on the interpretation of the resistome, see section 3.3).
To ensure acceptance of this approach by regulators and policy-makers, the conclusions and the roadmap proposed in this document should be complemented (and, if necessary, revised) with the continuous involvement of all relevant actors in the field, including (but not limited to) the scientific community (see 13 and 114 as excellent examples of defining principles for benchmarking, and a roadmap for software selection, respectively, to guide researchers), the collaborative organisation and platforms active in the field (e.g. the European Committee on Antimicrobial Susceptibility Testing (EUCAST), the Joint Programming Initiative on Antimicrobial Resistance (JPIAMR), the Global Microbial Identifier (GMI), the European Society of Clinical Microbiology and Infectious Diseases and its Study Groups (ESCMID)), regulatory agencies (e.g. the European Food Safety Authority (EFSA, RRID:SCR_000963), the European Centre for Disease Prevention and Control (ECDC)), European Union reference laboratories and their networks (e.g. the EURL AR and the EURLs for the different pathogens) and the existing bioinformatics infrastructures (e.g. the European Bioinformatics Institute (EMBL/EBI), ELIXIR).
Such an approach would be a way to facilitate the integration of NGS-based methods in the field of AMR, and may be a case study on how to approach the overlapping challenges in other potential fields of applications, including some at high level in policy agendas (food fraud, genetically modified organism detection, biothreats monitoring for biodefense purposes, etc.).
The contents of this article are the views of the authors and do not necessarily represent an official position of the European Commission or the U.S. Food and Drug Administration.
No data is associated with this article.
We would like to thank Valentina Rizzi (EFSA) and Alberto Orgiazzi (JRC) for their participation to the workshop discussions. The authors are also grateful to the following colleagues for their comments on the manuscript: Sigrid De Keersmaecker and Nancy Roosens (Sciensano) and Tewodros Debebe (Biomes). We are also grateful to Laura Oliva for her invaluable help during the workshop.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - | 
| PubMed Central Data from PMC are received and updated monthly. | - | - | 
Is the topic of the opinion article discussed accurately in the context of the current literature?
Yes
Are all factual statements correct and adequately supported by citations?
Partly
Are arguments sufficiently supported by evidence from the published literature?
Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?
Yes
References
1. McArthur AG, Waglechner N, Nizam F, Yan A, et al.: The comprehensive antibiotic resistance database.Antimicrob Agents Chemother. 2013; 57 (7): 3348-57 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Microbiology, bioinformatics, genetics
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: AMR, NGS data analysis, bioinformatics
Is the topic of the opinion article discussed accurately in the context of the current literature?
Yes
Are all factual statements correct and adequately supported by citations?
Yes
Are arguments sufficiently supported by evidence from the published literature?
Yes
Are the conclusions drawn balanced and justified on the basis of the presented arguments?
Yes
References
1. Mangul S, Martin L, Hill B, Lam A, et al.: Systematic benchmarking of omics computational tools. Nature Communications. 2019; 10 (1). Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: AMR, NGS data analysis, bioinformatics
Is the topic of the opinion article discussed accurately in the context of the current literature?
Partly
Are all factual statements correct and adequately supported by citations?
Partly
Are arguments sufficiently supported by evidence from the published literature?
No
Are the conclusions drawn balanced and justified on the basis of the presented arguments?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Microbiologist focusing on NGS and EQA/ benchmarking
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |||
|---|---|---|---|
| 1 | 2 | 3 | |
| Version 2 (revision) 16 Mar 22 | read | read | |
| Version 1 08 Feb 21 | read | read | |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)