Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation.

Background: Advancements in DNA sequencing technology have transformed the field of bacterial genomics, allowing for faster and more cost effective chromosome level assemblies compared to a decade ago. However, transforming raw reads into a complete genome model is a significant computational challenge due to the varying quality and quantity of data obtained from different sequencing instruments, as well as intrinsic characteristics of the genome and desired analyses. To address this issue, we have developed a set of container-based pipelines using Nextflow, offering both common workflows for inexperienced users and high levels of customization for experienced ones. Their processing strategies are adaptable based on the sequencing data type, and their modularity enables the incorporation of new components to address the community’s evolving needs. Methods: These pipelines consist of three parts: quality control, de novo genome assembly, and bacterial genome annotation. In particular, the genome annotation pipeline provides a comprehensive overview of the genome, including standard gene prediction and functional inference, as well as predictions relevant to clinical applications such as virulence and resistance gene annotation, secondary metabolite detection, prophage and plasmid prediction, and more. Results: The annotation results are presented in reports, genome browsers, and a web-based application that enables users to explore and interact with the genome annotation results. Conclusions: Overall, our user-friendly pipelines offer a seamless integration of computational tools to facilitate routine bacterial genomics research. The effectiveness of these is illustrated by examining the sequencing data of a clinical sample of Klebsiella pneumoniae.


Introduction
As whole genome sequencing has been established as a routine procedure in research projects worldwide, the computational analysis of sequencing data takes center stage, and it is often the main operational barrier to which biologists stumble (Xuan et al., 2013;Berger and Yu, 2022).Over the years, many open-source software tools were created to tackle different processing steps along intricate computational protocols geared toward different data processing scenarios.Notwithstanding, materializing the data analysis workflow is a non-trivial stage that many biologists face, given challenges ranging from the selection and installation among a vast assortment of computational tools to the logic of enactment of processing steps.Consequently, analyses can be performed in many ways by different groups raising issues of reproducibility and provenance (Grüning et al., 2018;Djaffardjy et al., 2023).Also, individual teams face problems implementing the processing workflow, inventorying the requirements, and optimizing performance and scalability (Wratten et al., 2021;Mölder et al., 2021).
Bacterial whole-genome sequencing has become mainstream in many microbiological settings, such as taxonomy, ecology, and clinical diagnostics.In concert with these applications, several tailored computational workflows, also known as pipelines, were created for bacterial genomics, each with a different underlying design and implementation approach (Petit and Read, 2020).Despite their similarities, each pipeline is unique and may provide different outcomes, considering that they were developed with different components and parameters.Most of the pipelines are designed to work with limited sequencing data types while focusing on specific annotation tasks, such as functional annotation of open reading frames, antibiotic resistance genes, and variant calling using reference genomes (Olawoye et al., 2020;Quijada et al., 2019;Ruiz-Perez et al., 2021;Sserwadda and Mboowa, 2021).Thus, there is still a need for more generic pipelines that give the user an extensive overview of their data while creating visually rich outputs whilst guaranteeing reproducibility.
Here we describe three pipelines built using the workflow composition system Nextflow (Tommaso et al., 2017).These were specifically designed in a modular way to standardize and facilitate bacterial genomic analysis from the standpoint of non-bioinformaticians relieving the issues of installation, configuration, and execution.Altogether, the pipelines are usable in different analytical scenarios, capable of handling data from different sequencing platforms, ranging from small single-genome projects executed on a personal computer to larger multi-genome projects to be executed in cloud computing platforms.We also leverage the use of operating system virtualization, packing the pipelines to use software containers that provide all required supporting programs without the need to install the required operating system and pipeline components.
Together, they offer a seamless exposition of computational tools to provide an easy framework for analyzing and interrogating data in routine bacterial genomics.To illustrate the system's functionality, we provide a full analytical illustration of the processing of a multi-drug-resistant Klebsiella pneumoniae strain.

Implementation
The pipelines have been implemented with Nextflow, a workflow composition and orchestration tool that allows the execution of tasks across multiple heterogeneous computing environments in a portable manner (Tommaso et al., 2017).We have used many concepts discussed and established by the nf-core community and adhered to their development framework (Ewels et al., 2020).
In the pipeline design process, several auxiliary scripts were developed to aid in processing and summarizing intermediate steps and graphical annotation report generation.Several programming languages were used, like Python, R, RMarkdown, and Bash shell scripts.However, the utilization of scripting languages in the pipelines comes with the drawback of increased complexity, as it necessitates the installation of language interpreters and support libraries.To make this as transparent as possible to the end user, all the scripts and core dependencies have been packed into Docker ® container images to ensure a consistent distribution and uniform execution regardless of the endpoint hardware and operating system.
We have adopted a modular development approach, resulting in three independent pipelines specializing in the critical steps of a general bacterial genomics pipeline (Figure 1).The first pipeline focuses on data pre-processing and quality control, the second on genome assembly, and the third on genome annotation.Pipelines 1 and 2 can be used independently for any general-purpose sequencing project, while pipeline 3 is optimized for prokaryotic genomes.The pipelines are autonomous, meaning they can be used separately or in combination with other methods, giving users complete control and flexibility (Figure 1).The pipelines provide a comprehensive end-to-end solution for bacterial genomics analyses when used sequentially.When invoked sequentially, the pipelines provide an end-to-end solution for bacterial genomics analyses.The architecture and implementation of each pipeline are detailed in the following sections.

The preprocessing pipeline
The ngs-preprocess pipeline can perform several quality-control steps required for Next-Generation Sequencing (NGS) data assessment.Short or long sequencing reads can be used as input data, and the subsequent steps are determined automatically by the read type or via user configuration settings.These include contamination checking, quality trimming, adapter removal, demultiplexing, file conversion, and graphical report generation (Figure 2).The pipeline accepts data from local storage or deposited in the public repository Sequence Read Archive (SRA).When a list of SRA IDs is given, raw sequencing data (in fastq format) will be automatically downloaded using the entrez-direct and sra-tools tools.
The following steps in the pipeline will be handled automatically based on the sequencing technology indicated in the configuration file.For sample demultiplexing, Porechop (Wick et al., 2017a) is used for Oxford Nanopore Technologies (ONT) reads and lima for PacBio reads.Long-read qualities and statistics are evaluated, and a quality report is generated using NanoPack (Coster et al., 2018).PycoQC (Leger and Leonardi, 2019) can also be used to perform quality control checks on ONT reads.The program fastp (Chen et al., 2018) is used for preprocessing short sequencing reads, including quality assessment, adapter sequence removal, trimming, and reporting.Tools included in the pipeline are summarized in Table 1.

The assembly pipeline
The MpGAP pipeline for de novo genome assembly has been designed in an organism and platform-independent manner to perform short or long-read only assemblies, as well as hybrid assemblies using a combination of sequencing technologies (Figure 3).Given the user input, the pipeline automatically selects the assembly mode.When using only short-reads, it performs the genome assembly using any of the programs: SPAdes (Bankevich et al., 2012), Unicycler (Wick et al., 2017b), Shovill andMegahit (Li et al., 2015).On the other hand, when only using long-reads, it uses one or more of the following assemblers: Unicycler, Canu (Koren et al., 2017), Flye (Kolmogorov et al., 2019), Raven (Vaser and Šikić, 2021), Shasta (Shafin et al., 2020), and Wtdbg2 (Ruan and Li, 2019).When both short and long-reads are    available, the pipeline is capable of performing two types of hybrid assemblies: (1) A direct hybrid assembly using both short and long-reads sets using HASLR (Haghshenas et al., 2020), SPAdes and Unicycler hybrid assembly modes and (2) A hybrid assembly methodology where long-reads only assembly is produced with one of the assemblers above followed by an error-correction procedure (polishing) using the available short-reads.
MpGAP is capable of polishing long-reads only assemblies using the appropriate tool based on user input: (1) Pilon (Walker et al., 2014) for polishing with short-reads data; (2) Medaka and Nanopolish (Loman et al., 2015) for polishing with nanopore data and (3) GCpp for polishing with PacBio data.Ultimately, assembly statistics are assessed using QUAST (Gurevich et al., 2013) and summarized by MultiQC (Ewels et al., 2016).All the tools that are part of the pipeline are outlined in Table 2.The annotation pipeline The bacannot pipeline is dedicated to annotating prokaryotic genomes.It covers gene prediction, annotation of gene families, mobile genetic elements, and identification of medically relevant features.The pipeline generates dynamic annotation reports and facilitates the navigation of annotated features through a genome browser.
The bacannot workflow is summarized in Figure 4.The pipeline needs a sample sheet file to initiate the process.This file should contain information about the input files, specifying whether they have preprocessed sequencing reads or assembled genomes.If the sample comprises sequencing reads, the pipeline will utilize either Unicycler (Wick et al., 2017b) for short-reads or hybrid assembly or Flye (Kolmogorov et al., 2019) for long-reads assembly.
The main characteristic of this pipeline is that it is built to aid users less familiar with the tools to investigate and interpret its results with graphically rich reports.The pipeline generates HTML reports summarizing all the results using customizable document templates written in RMarkdown (Xie et al., 2020).A genome browser created with JBrowse (Buels et al., 2016) is also available to explore the annotated features.A custom web application has been developed using the R Shiny framework (Chang et al., 2023) to facilitate further results exploration.It offers users additional features such as dynamic annotation filtering, as well as a built-in sequence similarity search (BLAST) functionality with an interface for executing and visualizing the results produced by SequenceServer (Priyam et al., 2019).The software and databases used in the pipeline are listed in Table 3.  (1992), and extracted DNA was quantified with Nanodrop™.DNA sequencing was performed with both long and short-read technologies.Long-read sequencing was performed using an Oxford Nanopore Technologies MinION Mk1b device, using the rapid barcode kit (SQK-RBK-004) in a R9.4.1 SpotON flowcell (FLO-MIN106D).Short-read sequencing was performed by BGI (Shenzhen, China) using paired-end reads with 150 bp on a DNBseq ® platform.

Computational analyses
For the processing of the bacterial genome above, the raw long and short-reads have been analyzed sequentially with the developed pipelines.Quality assessment, filtering, and trimming were performed using the ngs-preprocess pipeline v2.6.In our use case, aside from the default parameters, we set the pipeline to correct the short paired-end reads with fastp and filter long-reads based on quality (≥10) and length (≥750) using the parameters --lreads_min_length and --lreads_min_quality, respectively.
The preprocessed sequencing reads were assembled using the MpGAP pipeline v3.1.4.Genome assembly was performed using the hybrid method where long-reads are first assembled with long-reads assemblers and afterward polished using the short-reads data (--hybrid_strategy 2).The hybrid assembly strategy was executed only with the Flye assembler (Kolmogorov et al., 2019) instead of using all the available options to limit the computational burden.The BUSCO (Simão et al., 2015) completion assessment was performed with the bacteria_odb9 dataset, containing 148 expected bacterial genes.
The final polished genomes were annotated using the bacannot pipeline v3.1.5.The pipeline's required databases, such as VFDB (Liu et al., 2018), were downloaded in May 2022.The parameter --resfinder_species has been set to "Klebsiella".
We used Prokka (Seemann, 2014) for generic annotation, and the tool's database was enhanced with the public NCBI Prokaryotic Genome Annotation Pipeline (PGAP) HMM library (Li et al., 2020) by using the --prokka_use_pgap parameter.

Operation
The pipeline requires POSIX-compliant systems (e.g., Linux or OS X) or Windows with Windows Subsystem for Linux (WSL2).Pre-installation requirements on the executing computer are Nextflow, and a software management tool (Conda) or a container platform, either Docker or Singularity.Each pipeline requires a configuration file describing the samples and corresponding data to be used.These files also set the specific software to be used in pipeline steps with numerous alternatives (e.g., assembler program) and execution parameters for the whole pipeline or specific software.The configuration files are text files (in YAML format) that should be modified by the user.
With all these requirements satisfied, the execution is conducted non-graphically in a terminal.For example, to trigger the sequential execution of the pipelines, the following command lines should be issued sequentially: nextflow run fmalmeida/ngs-preprocess -profile docker -latest -params-file preprocess-params.ymlnextflow run fmalmeida/mpgap -profile docker -latest -params-file assembly-params.ymlnextflow run fmalmeida/bacannot -profile docker -latest -params-file annotation-params.ymlThe exemplified command lines would trigger the execution of the latest version of the pipeline's code using docker as the container engine, with all the required parameters configured in the YAML file.Optionally, it is possible to launch specific versions of the pipeline for reproducibility using the "-r" parameter, e.g., "-r v3.2".
In order for users to replicate the analyses of this paper, the configuration files and command lines executed have been made available in a Zenodo project.The configuration file and the sample sheet used for the annotation pipeline have been provided as Supplementary Material (1 and 2) for quick visualization of the expected format.All pipelines have a complete description of workflows and input files in the online manuals.

Results
This paper showcases the utilization of the developed pipelines to conduct a thorough genome examination of an unpublished bacterial sample sequenced with both short and long-reads.The sample, namely KpBSB53 (see Methods), microbiologically classified as Klebsiella pneumoniae, was isolated from the tracheal aspirate of a 41-year-old man at the University Hospital of Brasilia.The isolate was susceptible to all the antibiotics tested.
The pipeline assembles and annotates genes for prokaryotes and performs additional annotation steps when examining a clinical sample.This feature makes the tool especially useful for investigating antimicrobial resistance.Using a clinical sample exposes additional tasks included in the pipeline that apply to antimicrobial resistance studies.Subsequently, the KpBSB53 sample will be used to demonstrate the operation of the pipelines to analyze a clinically relevant bacterial genome, as well as a description of the generated results and how to interpret them.

Reads preprocessing and quality control
The process begins with the ngs-preprocess pipeline, which carries out quality control and cleaning actions essential for subsequent genome assembly and annotation tasks.Additionally, it offers the user an important assessment of the sequencing outcomes of the samples.
The output of the ngs-preprocess pipeline is a directory containing the preprocessed sequencing reads organized according to the sequencing technology of the input files, as indicated in the user configuration file.The preprocessed reads can have sequencing adapter or low-quality stretches removed (trimming) or be discarded entirely if a minimum number of good-quality bases is not reached.The pipeline also provides plots and reports for quality control (QC) inspection.
In the particular case of the KpBSB53 sample, the short-reads obtained were of outstanding quality, displaying quality values ≥60 before preprocessing (Figure 5A and B).The nanopore data fluctuated around the expected average quality and length, with median values of ≈10 and ≈4 kb, respectively (Figure 5C and D).

Genome assembly
After running ngs-preprocess pipeline, users can evaluate the quality of the input sequencing reads and ensure that only data with enough quality will be used for genome assembly.Once selected, the user manually supplies the preprocessed sequencing reads (such as short-reads, Nanopore, or PacBio) to the MpGAP pipeline using a sample sheet in a text file (see Methods), allowing users to choose from an assortment of assembly programs and strategies to perform the assembly.The pipeline will automatically select the appropriate assembly program based on the sequencing reads provided.
The KpBSB53 sample illustrates the pipeline's versatility and ability to work with both short and long-reads.For this sample, the Flye assembler (Kolmogorov et al., 2019) was chosen for a hybrid assembly approach (Chen et al., 2020), which initially has errors owing to sequencing technology, later corrected by including short-read data (polishing) using Pilon (Walker et al., 2014).
The pipeline generates a sub-directory for each sample containing the assembly results, including the initial and polished assembly files, and creates a report using MultiQC (Ewels et al., 2016) that includes assembly quality metrics from Quast (Gurevich et al., 2013) and BUSCO (Simão et al., 2015) to facilitate comparison between the assemblies.The user can choose different programs to analyze the same data, and report files are available to aid in selecting the best assembly.
After analyzing the KpBSB53 sample with this pipeline, we found that it consists of two circular contigs -one representing a chromosome (5.2Mb) and the other a plasmid (142 kb), essentially a full replicon resolution.The BUSCO metrics results highlight the reference quality level of this genome, with 98.65% ortholog completeness in the bacterial chromosome.

Genome annotation
The assembly reports offer guidelines to assist users in choosing the most suitable assembly alternative for their sample.
After making their selection, users can manually indicate the chosen assembly results to the annotation pipeline through the configuration file.Bacannot provides a structured and consistent output, allowing straightforward summarization and examination of its contents.Though multiple annotation stages produce outputs, the primary focus will be on the pipeline's essential results, namely: 1. Complete genome annotation.
2. A web-based application for results visualization and exploration (Figure 6).
3. Automatic HTML reports for resistance, virulence, mobile elements, and annotations from specialized databases (Figure 7).
4. A genome browser for visualization of annotated features (Figure 8).
After all the specialized programs in bacannot have finished their analysis, the results are consolidated into a single General Feature Format file (GFF) and a GenBank format file (GBK) containing the complete genome annotation.These files can be used in other general investigation programs or submitted to NCBI databases.The annotated features are also saved as nucleotide and protein sequences in FASTA format.Moreover, these files are processed internally and presented in the web application as interactive web pages (Figure 6A).This workbench allows users to filter results by text or sequence using the SequenceServer and BLAST applications.The filtered results can be converted into tables of varying formats (Figure 6B and C).The BLAST function included in this workbench allows users to easily annotate other target sequences even after the pipeline is finished.Users can find intersections of alignment results with the genome annotation (Figure 6E and E) or visualize alignments with SequenceServer (Figure 6F).Moreover, detailed reports are generated for feature-specific analyses, such as antibiotic gene prediction, consolidating the results, and providing cross-references to the source databases used for annotation (Figure 7).Besides these clinicallyrelevant features, the pipeline also has other generic annotation modules that are useful for any bacterial strain, such as prophage and secondary metabolites annotation, KEGG KO annotation, and the possibility of using custom annotation databases or a list of NCBI protein IDs.The KO module, for instance, outputs a text file ready for the KEGG mapper tool to generate pathway figures.
The genome browser allows users to visually explore the annotation results and investigate the genomic context of relevant genes.The browser contains various specialized tracks for the targeted annotations performed by the pipeline, such as for resistance, virulence, prophages, and more.For example, in Figure 8, one can quickly observe a cassette of stress-related resistance genes surrounded by insertion sequences in the plasmid contig.
One of the first results generated by this pipeline is the strain classification based on the alignment-free sequence distance against a database of bacterial genomes (NCBI RefSeq) as calculated by RefSeq Masher.The strain BIDMC 55 (GCF_000692955.1)isolated in the USA was the closest genome to strain KpBSB53 (Mash distance=0.000715202).
Additionally, an in silico multilocus sequence typing (MLST) is executed using the mlst program against the BIGSdb PubMLST database (Jolley and Maiden, 2010), revealing that our strain belongs to the Klebsiella pneumoniae ST 105 group, which has been reported as the driver of a plasmid-mediated outbreak of NDM-1-producing strains in China (Zheng et al., 2016).Using Prokka (Seemann, 2014) for generic annotation, 4937 coding sequences (CDS) have been detected, alongside 25 rRNAs and 87 tRNAs.Another pipeline result is the prediction of plasmid replicons using PlasmidFinder (Carattoli et al., 2014).For the KpBSB53 strain, a single plasmid was predicted and classified as IncFIB(K), a very dynamic replicon that is mostly associated with MDR plasmids in K. pneumoniae (Lahlaoui et al., 2015;Tian et al., 2021) and important for virulence as well (Tian et al., 2021).
The pipeline includes several analyses for annotating antimicrobial resistance genes as default.Strain KpBSB53 has only a few acquired antibiotic-resistance genes, namely blashv-1, fosA, and oqxAB, all considered intrinsic to the species and located in the chromosome (Bernardini et al., 2019;Holt et al., 2015).This agrees with our experimental results, which show that this strain is susceptible to all antibiotic classes tested.
Additionally, a set of stress-related genes (Supplementary Material 3) conferring resistance to copper, silver, and other metalloids have been detected in the plasmid sequence (Figure 7).On the other hand, several virulence genes have been detected in the sample's genome using the Virulence Factor Database (VFDB) (Liu et al., 2018).Despite the classical virulence genes normally found in Klebsiella pneumoniae strains (Paczosa and Mecsas, 2016), strain KpBSB53 encodes three operons for siderophore biosynthesis: enterobactin (entABCDEFS, fesABCDG), salmochelin (iroE) and aerobactin (iutA).Moreover, the type 1 and 3 fimbriae and an ecp (E. coli common pilus) gene have also been detected.These three fimbriae types are directly related to the adhesion to surfaces, interaction with host cells, and biofilm formation (Alcántar-Curiel et al., 2013).Taken together, the presence of several virulence factors suggests that the KpBSB53 sample may be a hypervirulent strain or at least more virulent than the classical K. pneumoniae strains, as discussed by Paczosa and Mecsas (2016).

Resources usage
Table 4 presents the expected computer resources and timings for the execution of all pipelines on a standard Linux laptop.A Linux Ubuntu 22.04 laptop, with 4 CPUs (8 cores) and 18 Gb RAM, was used in this study.The computational requirements of the ngs-preprocess pipeline are low.However, the time it takes to finish is positively related to the amount of data.A single tool called fastp is used for short-read preprocessing, which takes around 7 minutes to complete and requires approximately 3 Gb of RAM.For nanopore reads, porechop is the most resource-intensive module, taking approximately 1 hour to finish and using approximately 5 Gb of RAM.In terms of the MpGAP assembly pipeline, the execution depth highly depends on the amount of sequencing data provided and is the most resource-intensive step in the workflow.In our hybrid assembly use case, it accounts for half of the overall processing time and requires 11 Gb of RAM (Table 4).

Discussion
In the last decade, advances in DNA sequence generation resulted in a steadfast incorporation of genomics into routine microbiological research practice, ranging from clinical to environmental applications (Didelot and Parkhill, 2022).Despite the significant progress made by the establishment of multiple computational protocols to process genomic data, a considerable gap still needs to be bridged for these to be readily integrated into laboratory practices.The primary reason behind the problem is the lack of necessary bioinformatics foundations, including a shortage of skilled personnel or proper infrastructure.
Our belief is that genomics pipelines should possess certain inherent characteristics to promote their widespread use.These attributes include easy installation and execution, modularity and extensibility, and the generation of user-friendly reports that consolidate the results and foster biological interpretation.
Considering these tenets, we have created a comprehensive set of bacterial genomics pipelines (ngs-preprocess, MpGAP, and bacannot).These specialized pipelines should be invoked in sequence, taking raw data from any sequencing platform and converting them into an annotated genome, emphasizing antimicrobial resistance and virulence genes.The first two modules are not restricted to bacterial data, making it possible to analyze data from other organisms.Conversely, the bacannot pipeline is specific to bacteria but can accept sequencing reads as input, allowing users to assemble and annotate their genome with a single command, which is particularly helpful for those less experienced with bioinformatics.
By following a few simple steps (available online), users can customize the execution of these pipelines and obtain complete results in under a day using a commodity computer.These pipelines utilize technology for modular workflow composition, installation, and execution through virtual containers that package all necessary software requirements.This modularity enables incremental updates in response to community requirements which are transparently pushed to the users unless a specific version is requested for reproducibility purposes.et al., 2021).Although they have some common elements and comparable assembly and annotation modules, each software has unique features tailored for specific purposes.Their goals and designs vary, providing users with highquality options for various analytical scenarios.Table 5 outlines the differences in capabilities among the annotation pipelines.We now highlight some of the distinctive features of these pipelines.
Our three pipelines provide a comprehensive bacterial genomics analysis from unprocessed reads to annotation similar to what ASA3P, TORMES, Nullarbor, and Bactopia offer.The analysis provided by MicroPIPE and MicrobeAnnotator is less comprehensive than the other pipelines.MicroPIPE is crafted for genome assembly and covers the process from basecalling to genome polishing.However, our MpGAP genome assembly pipeline is more versatile as it can accommodate data from Illumina, PacBio, and ONT sequencing technologies and has nine assemblers with distinctive assembly strategies.This flexibility enables users to select the most suitable choice for their requirements.MicrobeAnnotator is a pipeline that specializes in functional genome annotation.It offers a KEGG Orthology annotation step through Kofamscan, also available in bacannot.However, MicrobeAnnotator only focuses on the functional annotation of predicted genes using KEGG, UniProt, RefSeq, and Trembl.Unlike other pipelines, it does not include extra modules like virulence and resistance gene annotation.
Analogous to Bactopia, our pipelines are very flexible and customizable.Compared to bacannot, Bactopia has more functionalities but cannot currently produce visual reports for results inspection.Moreover, those with limited bioinformatics experience may find the additional configuration steps more complex.Additionally, Bactopia provides extensions to its core workflow with post-processing tools intended, for example, for pangenome and phylogenetic analyses.Bacannot does not offer these as default, but the standardization of its outputs enables users to adapt them for such tasks.
Bacannot stands out for its added ability to annotate various genomic feature classes as part of its central workflow, including secondary metabolites, prophages, genomic islands, integrative and conjugative elements (ICEs), and DNA methylation, without requiring additional executions.This analytic range permits the annotation of clinically relevant traits and provides valuable attributes for non-clinical samples.As a result, bacannot has been used in various scenarios, including the analysis of clinical samples (de Campos et al., 2021), environmental samples from a lake (Janssen et al., 2021), soil (Belmok et al., 2023), and plant-associated bacteria (Bartoli et al., 2022;Ramírez-Sánchez et al., 2022).
Lastly, bacannot is the only tool that includes a comprehensive set of dynamic reports, presented as a built-in web application, along with a genome browser, offering a unified and interactive platform for interrogation and visualization of the annotation results.
Using the developed pipelines, we analyzed a strain of Klebsiella pneumoniae (KpBSB53) isolated from a patient at the University Hospital of Brasilia.We aimed to characterize its antibiotic resistance profile and virulence at the genome sequence level.We used sequencing data from short and long-read platforms to conduct a comprehensive genomic analysis of the strain on a laptop with 18 Gb of memory in less than a day.Our findings revealed that the strain belongs to the ST 105 group, associated with a neonatal unit outbreak in China (Zheng et al., 2016).The annotation of resistance genes only identified components considered intrinsic to the species (Bernardini et al., 2019;Holt et al., 2015).Additionally, mutations in acrR, ompK36, and ompK37 genes were found, which may play a role in resistance.These findings are consistent with the experimentally observed susceptibility to the tested antibiotics.

Conclusions
This work showcases three bioinformatics pipelines that, together, provide a complete workflow for a thorough analysis of bacterial genomics using next-generation sequencing data.These computational protocols encompass the entire process, from the initial raw reads to the final genome assembly and gene annotation.It is recommended to execute them in succession, but they can also function independently by incorporating external data provided by the user.
The pipelines were designed to simplify the installation process by incorporating the many specialized software tools required for all stages in the form of virtualized containers.This eliminates the complexity of setup tasks, allowing the pipelines to be deployed and executed with a single command line.We also provide thorough documentation to expand the user base and make them accessible to those without bioinformatics expertise.
We not only focused on processing data but also on creating graphical reports and visualizations to improve result interpretation.To achieve this, we developed a web-based tool that allows users to analyze and refine the results using text or sequence annotation, distinguishing bacannot from other pipelines.
The existence of several comparable pipelines indicates that there is no one-size-fits-all approach to genomics.Generating DNA sequence data is becoming more widespread, but there is a significant challenge in analyzing this data.Our set of tools has been developed to address this issue and aid in the study of bacterial genomics.

Ethical considerations
The studies involving human participants were reviewed and approved by the ethical approval received from the Faculdade de Medicina, Universidade de Brasília, Brasília, DF, Brazil (approval no.CEP/FMUnB 1.131.054;CAEE: 44867915.1.0000.558).The patients provided their written informed consent to participate in this study.

Data availability
Sample sequencing data has been made in NCBI, BioProject PRJNA955456, BioSample: SAMN34178607.Additionally, the code required to reproduce the analysis performed in this paper has been made available in a Zenodo repository containing bash scripts with all required configurations.
This project contains the following underlying data: • input/sra_ids.txt(list of SRA ids for the preprocessing pipeline) • preprocess-params.yml(pre-set parameters file for the preprocessing pipeline) • run_preprocess.sh(script to run preprocessing pipeline with pre-set configutations) • assembly-params.yml(pre-set parameters file for the assembly pipeline) • assembly.config(pre-set resources configuration file for the assembly pipeline) • assembly_samplesheet.yml (pre-generated samplesheet for the assembly pipeline) • run_assembly.sh(script to run assembly pipeline with pre-set configuration) • annotation-params.yml(pre-set parameters file for the annotation pipeline) • annotation_samplesheet.yml (pre-generated samplesheet for the annotation pipeline) • run_annotation.sh(script to run annotation pipeline with pre-set configuration) Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability
The pipelines are available as GitHub repositories under the GNU General Public License v3.0.The pipeline and dependencies are easily installed by combining Nextflow and Docker or Singularity, all described in the documentation.The repositories are: • ngs-preprocess -version controlled https://github.com/fmalmeida/ngs-preprocess • ngs-preprocess -archived, Zenodo: https://zenodo.org/record/7831610The authors have developed a suite of open source workflows implemented in Nextflow for microbial WGS preprocessing, de novo assembly, and annotation.The goal of this work is to provide re-usable workflows that can be tailored to the end-user's needs, therefore several interchangeable tools (e.g., de novo assemblers) are wrapped in a common Nextflow framework.Interoperability between bioinformatics tools is a significant challenge therefore these workflows provide a useful resource for tailoring a set of tools to meet one's needs based on sequencing technology.
The workflows are then applied to a real world problem of analyzing genomic sequences from a Klebsiella isolate with recapitulation of antibiotics resistance phenotypes from genomic data as the "ground truth" for validating the workflow.While this is a limited dataset, the purpose of the paper is to not validate all of the bioinformatics tools used in the workflow but to demonstrate how it could be used to generate a useful output from genomic sequencing data.
The paper would benefit from either demonstrating the first two pipelines (excluding annotation) on a non-bacterial dataset as the authors claim that only the annotation pipeline is prokaryotespecific.
I was able to download and start the Nextflow pipelines however I do not have access to sufficient computational resources required to complete the pipelines.Therefore, I am unable to evaluate the outputs of the pipelines at this time.It would be useful for the authors to provide example outputs on Zenodo.

Suggestion (ref documentation website):
As far as the design of the pipeline and description of the design choices goes, it is well explained.The interpretation of the generated results has been neatly described for bacannot in the documentation page (https://bacannot.readthedocs.io/en/latest/outputs/),including the directory structure and the relevant links for the tool-specific reference material.However as the pipelines are modular and the user is expected to run them separately, it is recommended that the documentaiton websites for MpGAP pipeline and ngs-preprocess pipelines should also have an "Output" page to facilitate users on the output structure and refer the correct tools-specific links.We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Is
The paper entitled "Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation", takes advantage of several well-known tools for NGS data processing, assembly, and downstream analyses.The GitHub page seems to provide enough information too.However, the paper still needs to improve in some aspects.
The aim of this paper is not clear.The authors developed a pipeline based on existing tools.However, the aim is not easy to find, and the work lacks novelty.
The other drawback of this paper is the fact that the authors tested the pipeline only using a clinical isolate.How this pipeline handles, for instance, gram-positive bacteria or other bacteria with different genome sizes is not clear.
The figures are not accurate.They do not provide enough/precise information regarding the steps, tasks, or tools.
It is necessary to implement the Bandage tool into the pipeline right after assembly to visualize the assembly.
Is there any specific reason the author chose fastp as a quality check tool?Obviously, fastqc provides much more details compared to fastp.
I understand that different tools make different assembly, however many publications already benchmarked them, for instance, it is well known that SPAdes, Flye, and Unicycler provides a better assembly for short, long, and hybrid assembly, respectively.Please see Wick et al. (2019 1 ) or Khezri et al. (2021 2 ).However, in my opinion in a pipeline, an author should stick to the best overall tool and avoid providing many tools as pipelines are designed for less experienced users.
In Table 4, some of the tasks such as annotation with Prokka as well as Quast and Busco took a tremendous amount of time which is not how it should be.What is the author's comment on this?
Although the authors presented the differences between their pipeline and other published pipelines, the samples in this study should be tested using other pipelines in order to have a clear demonstration of performance.
Porechop is out of date and no longer supported.This is important when it comes to new chemistry and sequencing kits, where, for instance, 24 barcodes in rapid barcoding kits are not recognized by Porechop.Here it would be better to take advantage of ONT guppy software not only for trimming but also for other tasks.
Very little is written regarding BLAST.How can the user benefit from BLAST in this pipeline?Does it come with ore-indexed reference databases?
The Prokka output can be used for many different purposes including core genome and pan genome analyses which are important for people working on epidemiology.It would be nice if the author could extend their pipeline to cover this.Reviewer Expertise: Bioinformatics and bacterial genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Flowchart of optional sequential execution of the developed pipelines for complete bacterial genomics analysis.The available whole genome assembly module in bacannot is considered limited compared to MpGAP, because it only contains two assembler options.Dashed arrows connected to the crossed circles represent the optional flow of data, highlighting that chaining the pipelines is not required.Gray boxes highlight the pipeline's outputs.

Figure 2 .
Figure 2. Flowchart of all steps that can be performed by the different workflows available in the ngspreprocess pipeline.

Figure 3 .
Figure 3. Flowchart of all steps performed by the different workflows available in the MpGAP pipeline.

Figure 5 .
Figure 5. Overview of typical QC outputs generated by the ngs-preprocess pipeline.Figures A and B are generated by the fastp tool and display the base quality of one of the short-read pairs and the summary of reads statistics, respectively.Figures C and D are generated by the NanoPlot tool and display the average read quality per read length and the weighted read length histogram, respectively.

Figure 6 .
Figure 6.Overview of the main features available in the web-based application for results exploration made available as part of the bacannot pipeline.Figure A is the homepage of the application for navigation between available features.Figures B and C display the dynamic text-filtering of annotation results.Figures D and E exemplify the interactive results filtering and investigation based on sequence alignment.Figure F shows the SequenceServer tool included in the application for execution and visualization of BLAST alignments.
Figure 6.Overview of the main features available in the web-based application for results exploration made available as part of the bacannot pipeline.Figure A is the homepage of the application for navigation between available features.Figures B and C display the dynamic text-filtering of annotation results.Figures D and E exemplify the interactive results filtering and investigation based on sequence alignment.Figure F shows the SequenceServer tool included in the application for execution and visualization of BLAST alignments.

Figure 7 .
Figure 7. Overview of the specialized automatic HTML reports generated by the bacannot pipeline.Figures A and B are screenshots of the antimicrobial resistance (AMR) automatic report, highlighting its homepage containing the annotation description and summary along with a bar plot displaying all the features annotated by the AMRFinderPlus tool. Figure C shows a partial screenshot of the KEGG annotation heatmap autogenerated using KOfamscan and KEGGDecoder tools.

Figure 8 .
Figure 8. Overview of the automatically rendered genome browser using JBrowse made available as part of the bacannot outputs.For illustration, the annotation tracks showing predicted antibiotic resistance genes and insertion sequences are activated in a region of the KpBSB53 plasmid.
the rationale for developing the new software tool clearly explained?Yes Is the description of the software tool technically sound?Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?Partly Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.Reviewer Expertise: Mycobacterium tuberculosis, Reproducible methods, Machine learning, Variant calling, Reference Alignment, Resistance profiling and Scientific workflow managers.
sequencing.F1000Res.2019; 8: 2138 PubMed Abstract | Publisher Full Text 2. Khezri A, Avershina E, Ahmad R: Hybrid Assembly Provides Improved Resolution of Plasmids, Antimicrobial Resistance Genes, and Virulence Factors in Escherichia coli and Klebsiella pneumoniae Clinical Isolates.Microorganisms.2021; 9 (12).PubMed Abstract | Publisher Full Text Is the rationale for developing the new software tool clearly explained?Partly Is the description of the software tool technically sound?Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Partly Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?Partly Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?No Competing Interests: No competing interests were disclosed.

Table 2 .
Continued Figure 4. Flowchart of all analytical steps available in the bacannot pipeline.

Table 3 .
Software and databases that have been made part of the bacannot pipeline (v3.2).Biological sample, DNA extraction, susceptibility test, and sequencing The sample KpBSB53 was collected at the University Hospital of Brasilia, Brazil, in April 2016 from the tracheal aspirate of a 41-year-old man.The VITEK 2 system (BioMérieux) was used for microbial identification.Antibiotic susceptibility was tested by the disk diffusion method as described byde Campos et al. (2021).The following antibiotics were tested: Amikacin, Aztreonam, Cefepime, Ceftazidime, Ciprofloxacin, Gentamicin, Imipenem, Levofloxacin, Meropenem, Norfloxacin, Ofloxacin, Piperacillin/tazobactam, Polymyxin, Tobramycin, Ticarcillin/clavulanate (Sensidisc DME -Diagnósticos Microbiológicos Especializados).After 24 h of incubation at 37°C, the sample was classified as susceptible, resistant multiresistant based on what was described by the manufacturer.DNA extraction was performed as described by Ausubel et al.

Table 4 .
Resource usage metrics for the execution of all pipelines using strain KpBSB53 data, as measured automatically by Nextflow.For bacannot, only the most time and memory-consuming tasks are shown.

Is the rationale for developing the new software tool clearly explained? Yes Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 14 November 2023 https://doi.org/10.5256/f1000research.152764.r218196