rANOMALY: AmplicoN wOrkflow for Microbial community AnaLYsis

Bioinformatic tools for marker gene sequencing data analysis are continuously and rapidly evolving, thus integrating most recent techniques and tools is challenging. We present an R package for data analysis of 16S and ITS amplicons based sequencing. This workflow is based on several R functions and performs automatic treatments from fastq sequence files to diversity and differential analysis with statistical validation. The main purpose of this package is to automate bioinformatic analysis, ensure reproducibility between projects, and to be flexible enough to quickly integrate new bioinformatic tools or statistical methods. rANOMALY is an easy to install and customizable R package, that uses amplicon sequence variants (ASV) level for microbial community characterization. It integrates all assets of the latest bioinformatics methods, such as better sequence tracking, decontamination from control samples, use of multiple reference databases for taxonomic annotation, all main ecological analysis for which we propose advanced statistical tests, and a cross-validated differential analysis by four different methods. Our package produces ready to publish figures, and all of its outputs are made to be integrated in Rmarkdown code to produce automated reports.


Introduction
Studies of microbial communities tends to become a daily routine analysis for lots of laboratories and the main method to explore microbial diversity is metabarcoding, which is an amplicon targeted sequencing method (16S for bacteria and ITS for fungi). Metabarcoding generates a large amount of data and a lot of applications already exist for their processing (FROGS 1 , qiime 2 ). Methods and software are continuously evolving and the main challenge for bioinformaticians is to implement the most recent and effective ones in their analysis. Here we present rANOMALY, a scalable and lightweight R package which is able to handle every step of a metabarcoding analysis, from read cleaning, contaminant filtering, taxonomic assignment, to advanced statistical analysis. rANOMALY is fully implemented in R language in which each step correspond to one function, allowing to easy implementation of new features or tools while being easy to use and maintain. The package allows the workflow to be executed on any R environment. rANOMALY only needs a CSV table describing the metadata for each sample, and a folder containing the corresponding fastq files as input. It can produce high quality figures for Rmarkdown reports along with statistical tests ready for publication.
The workflow is illustrated in Figure 1.
Implementation rANOMALY is an R package depending on other CRAN, Bioconductor, and git R packages. It is easy to install via devtools::install_git function.
Raw sequence processing Samples must have been previously demultiplexed into one file per sample with the file name following this syntax: {sampleid}_R [12].fastq. The denoising process is handled using the dada2 R package 3 which produces amplicon sequence variants (ASV) as a taxonomic unit. This improves resolution of the potential presence of microbial organisms by using a prediction model to correct sequencing errors before aggregating similar sequences. rANOMALY handles processing any region of 16S (V1 to V9) and ITS amplicon (ITS1, ITS2) sequences, in which an additional step of cutadapt 4 removes ITS probes left in some short sequences. For 16S amplicon, primers are trimmed based on the primer sequence length. ASV identifiers are sequences translated into MD5 hashes which are unique identifiers based on the DNA sequences offering the possibility to be compared between projects. This step results in an object with representing sequences, a raw ASV counts table and a text file containing statistics from the denoising process.
Taxonomic assignment Taxonomic assignments of ASVs are carried out by IDTAXA (part of the DECIPHER package), an algorithm based on a machine learning method 5 . We implemented a functionality to compute assignment with two reference databases. For example, 16S amplicon sequences can be assigned with an environment specific database (DAIRYdb 6 , HITdb 7 , MIDAS 8 ) and a general database like SILVA 9 or GreenGenes 10 . The assignment with the best confidence and the lowest rank is kept, thus increasing assignment depth and accuracy. As taxonomy can differ between reference databases, we implemented a taxonomy validation step in rANOMALY to unravel taxonomy inconsistency like taxa with multiple ancestors or empty ranks. As supplementary features, rANOMALY functions allow users to create their own IDTAXA formatted specific reference database. The first step consists of filling the taxonomic table empty fields with the last known rank, and checking for taxonomy incongruencies as in the assignment function. A taxid file as used with RDP classifier 11 is constructed and then a last function takes as input, the corrected taxonomy table, the fasta file and the taxid file to generate the IDTAXA formatted database.

Phylogenetic tree
Phylogenetic tree is generated in three steps. First, sequences are aligned with AlignSeqs function from DECI-PHER package 12 by the guided-tree method. Then, distance matrix is calculated with the dist.ml function from the phangorn package. And finally, neighbour joining and pml function computes the likelihood of the phylogenetic tree.
The abundance, metadata, taxonomic table, reference sequences and phylogenetic tree are merged into a phyloseq object 13 .

Decontamination
ASVs have the advantage of enabling the distinction between contaminants and the real community. We have integrated the decontam package 14 into the workflow. Indeed using OTU based clustering methods can agglomerate contaminant sequences with real sample sequences, the whole cluster could hence be considered as contaminant by mistake. Working with ASVs allows the use of R package decontam which will sensitively exclude contaminant ASVs. It integrates two main methods, one based on the prevalence of the contaminant in the control samples, and another one based on the DNA concentration of the samples. Moreover, our decontamination step allows users to apply various filters such as low ASVs frequency, low ASVs prevalence in real samples, and the minimum number of reads per sample.

Graphical plots and statistical analyses
Statistical analyses are key features of rANOMALY workflow. Main descriptive analyses are integrated thanks to phyloseq functionalities. In addition, we automatized graphical representations and advanced statistical tests for alpha, beta diversity and composition plots. Above all, we have included the four most up-to-date differential analyses to assess differentially abundant features.
Community composition plots, alpha and beta diversity analyses rANOMALY allows users to explore microbial community composition with three different types of plots : classical interactive bar plot 15 of raw and relative taxa abundances, rarefaction curves to check sampling effort, and Krona interactive pie charts 16 .
Alpha diversity indices representing the specific richness are calculated (Richness, Simpson, Shannon...) with the vegan R package 17 . We added statistical tests such as multi-factors analysis of variance and pairwise Wilcoxon tests to assess significant differences between tested categories. We included repeated measures ANOVA to handle within-subjects variation. For example, it can be used when there are measures taken on the same individuals at different time points. This step outputs graphical representations and tables with results of statistical tests which are both saved in files and returned as list objects for markdown reports.
Beta diversity analysis allows users to estimate the community differences between two samples, it is also based on the vegan package 17 . rANOMALY can calculate all different distances such as the one based on ASVs abundance (BrayCurtis), rank based Jaccard indices, and phylogenetic distances as UniFrac, weighted Unifrac. Graphical representations PCoA, NMDS and more are available. This analysis can be processed at different taxonomic levels and categorical factors chosen by the user. The additional statistical test of PERMANOVA uses the distance matrix to determine if microbial communities of sample groups are significantly different from each other. We added a pairwise PERMANOVA 18 test to confirm significant differences between specific group of samples.

Differential analysis
Differential analyses are meant to assess potential differentially abundant taxas between tested conditions chosen by users. rANOMALY wraps three methods: • DESeq2 19 uses negative binomial generalized linear model with a variance stabilizing transformation on abundances.
• metagenomeSeq 20 uses a zero-inflated log-normal model with cumulative sum scaling normalization.
• metacoder 21 applies a total sample sum normalisation and uses a non-parametric Wilcoxon Rank Sum test to compare the log ratio of mean proportions.
The use of multiple methods for differential analysis allows the user to investigate which feature can be considered as differentially abundant between conditions. rANOMALY function using DESeq2 and metagenomeSeq outputs tables and plots with significant features. Metacoder outputs heat tree plots allowing users to infer differentially abundant features at each taxonomy rank and their position on the phylogenetic tree. With basic data management, and according to the identification of all significant differentially features (i.e. ASVs, genus, species) from the three methods, we generate a single table to find significant features in one or more methods and ease the interpretation. Additional information are added to the final table, like mean relative abundance for each feature and condition. A column in which condition features are significantly more abundant, features taxonomies and related sequences are added to the aggregated table. To complete differential analysis, we have included the well recognized PLS-DA from the mixOmics package 22 . It is a supervised classification method allowing users to identify features discriminating the sample groups.

Additional features and export
Functions and procedures are available to help the user to generate additional figures or to export the data to third party software. For instance, shared taxa between conditions are useful to explore. We use a function that can generate Venn diagrams, or for more complex visualisation, we can produce files readable by Cytoscape 23 to produce shared taxa networks. Krona diagrams can be displayed to explore sample microbial composition, where samples can be merged by a specified factor. rANOMALY allows users to generate inputs for STAMP 24 , which is a graphical software that provides statistical hypothesis tests and exploratory plots for analysing taxonomic profiles.
Operation rANOMALY requires R 3.6.3 or upper and can be run on any operating system with common specifications (1Go disk space, 4Go RAM, multicore CPU is recommended).

Use case
For this example we are using a dataset from Fretin et al. study 25 in which samples are from four different environments: cow milk, cow cheese (rind and core) and cow teat skin. This dataset and metadata are available on NCBI-SRA website: BioProject accession PRJNA421256. To ease access to this dataset, fastq files along with pre-formatted metadata are available on this repository.

Install
Up to date code is hosted by the INRAE gitlab, users can simply download and install the package in R console with following command lines.

Processing of raw sequences
ASV definition with DADA2. The first step is to define ASVs thanks to the dada2 package. In rANOMALY, only one function is needed to compute all the different steps require from this package. Here sample names are extracted from the file name, thus be sure that files name match samples name in the metadata file.
• rep-seqs.fna fasta file with all representative sequences for each ASV.
• robjects.Rdata with saved dada_res list containing raw ASV table and representative sequences in objects otu. Taxonomic assignment. assign_taxo_fun function uses IDTAXA function from DECIPHER package, and allows to use two different databases. It keeps the best assignment on two criteria, resolution (depth in taxonomy assignment) and confidence (value givenby IDTAXA). The final taxonomy is validated by checking for multiple ancestors (i.e. same species assigned to different genus) and incongruities (i.e. empty fields or incomplete lineage) correction step.
We share the latest databases we use in the IDTAXA format in this link. Users can also generate your own IDTAXA formatted database following those instructions and scripts we provide at this page.
• allDB_tax_table.csv raw assignations from the two databases, mainly for debugging. • the phylogenetic tree (tree).
• metadata from from csv file.
# Phyloseq object generation data = generate_phyloseq_fun(dada_res = dada_res, tax. in overall samples filtering. We have also included an option to filter out ASV based on their taxa names for known laboratory recurrent contaminants.
• Exclu_out.csv list of filtered ASVs for each filtering step.
• Krona plot before and after filtering.
Here we are going to filter out ASVs representing less than 0.1% (freq) of the reads and that are present in more than 4 samples (prev). Moreover, we are excluding "unassigned" taxas for this use case. Our sample dataset do not contain control samples, this step will be skipped.  Plots, diversity and statistic analyses Rarefaction curves. In order to observe the sampling depth of each sample we start by plotting rarefaction curves. Those plots are generated by plotly which makes them interactive. (Figure 2) rareplot = rarefaction(data_filtered, "source_location", 100) Community composition plot. The bars_fun function allows user to generate interactive community composition plot. Figure 3 presents the composition plot with relative abundances for the top 20 genera existing in our samples.
The function allows to plot at different taxonomy rank and to modify the number of taxa to show.
Here two arguments controlling the composition plot aesthetics: • Ord1 option order the sample along the X axis.
• Fact1 option control labels of the X axis. Fact1="sample.id" if user don't want the sample to be renamed.
source_location factor shows very different bacterial community between milk, cheese and cow teats environments.
# Relative abondance prel = bars_fun(data_filtered, "Genus", top = 20, Ord1 = "source_location", Fact1 = "source_ location", relative = TRUE, outfile = "plot_compo_rel.html") Alpha diversity. The alpha_diversity_fun function can computes various alpha diversity indexes. It uses the estimate_richness function from phyloseq (Available measures : Observed, Chao1, ACE, Shannon, Simpson, InvSimpson, Fisher). Here we calculate ASV richness and Shannon index and carry out an analysis of variance on the source_location factor. Sequencing depth is automatically taken into account in this test. A pairwise wilcoxon test is added to ANOVA to define which group might be significantly different from others. Figure 4 shows boxplots of diversity indices, cow teats environment has much more ASV than other environments. Shannon index reveals more differences between cheese rinds and cheese cores. Cheese rinds show a higher Shannon index highlighting a more balanced bacterial community. divAlpha = diversity_alpha_fun(data = data_filtered, output = "./analysis/07_all_res/plot_div_ alpha/", column1 = "source_location", measures = c("Observed", "Shannon")) Results of the analysis of variance and pairwise wilcoxon test on Shannon index:   Beta diversity. The diversity_beta_light function allows to generate specific tests and figures ready to publish in rmarkdown report as in the example below. It is based on the vegan package function vegdist for the distance calculation and phyloseq-extended in addition to ordinate funtion for the ordination plot.
We include statistical tests to ease the interpretation of results. A permutational ANOVA is carried out on matrix distance to compare groups by testing if centroids and dispersion are equivalent for all groups. User have to inform col argument and optionally cov (covariable) to assess PERMANOVA to determine significant differences between groups. A pairwise-PERMANOVA is processed to determine which condition is significantly different from another (based on p-value).
As a return, you will get a list that contains: • An ordination plot ($plot).
• The pairwise permANOVA ($pairwisepermanova) Here we present results of beta diversity analysis on source_location factor, Figure 5 show ordination plot and it confirms the big differences between community of cheese, milk and cow teats. divBeta = diversity_beta_light(psobj = data_filtered, output = "./analysis/07_all_res/plot_div_beta/", col = "source_location", cov = NULL, dist0 = "bray", ord0 = "MDS") The permanova tests on BrayCurtis distance shows significant p-value for the source_location factor. Pairwise permanova test is used to define which level of the factor tested is significantly different from others.

Differential analysis.
We choose three different methods to process differential analysis which is a key step of the workflow. The main advantage of the use of multiple methods is to cross validate deferentially abundant taxa between tested conditions. For this use case, we choose to focus on milk and cow teat environment to compare community at genus level.
Metacoder Metacoder is the most simple differential analysis tool of the three. Counts are normalized by total sum scaling to minimize the sample sequencing depth effect and it uses a Kruskal-Wallis test to determine significant differences between groups. The metacoder_fun function allows the user to choose the taxonomic rank, which factor to the test (column1), and a specific pairwise comparison (comp) to launch the differential analysis.
It produces pretty graphical trees, representing taxas present in both groups and coloring branches depending on which group this taxa is more abundant (Figure 6). Two trees are produced, a raw one, where everything is displayed and a filtered one where only significant features are represented (p-value <= 0.05). • a table with all wilcoxon test results ($table).

Results aggregation step
The aggregate_fun function allows to merge the results from the three differential analyses methods computed previously to obtain one unique table with all informations of significant differentially abundant features. Figure 7 shows the most significant and differential abundant Genera between the two environments.
• sumMethods: sum of methods in which feature is significant.
• DESeqLFC: Log Fold Change value as calculated in DESeq2.
• absDESeqLFC: absolute value of Log Fold Change value as calculated in DESeq2. • MeanRelAbcond1 / MeanRelAbcond2: Means relative abundance in condition 1 and 2.
• Condition: in which the mean feature relative abundance is higher.
• Taxonomy and representative sequence.
Here is an overview of the aggregate_fun table informing in which methods each feature is significant, their DESeq2 LogFoldChange value, taxonomy and representative sequences: as.data.frame(head(apply(resF$

Miscellaneous function
Heatmap. User can generate an interactive heatmap with heatmap_fun function to explore relative abundance of top taxa through samples as showed in Figure 8.

Figure 8. Heatmap of top genuses relative abundance.
Venn diagram of shared taxa. Figure 9 shows Venn diagram comparing the three environment of this study, this method is useful to determine shared taxa between group of samples. This function uses venn::venn function which handles up to 7 groups comparison.
phy2tsv_fun function allows user to generate tabulated format abundance table at different taxonomic rank with following commnands: for (i in c("ASV", "Species", "Genus", "Family")) { phy2tsv_fun(data_prune, output = glue::glue("./analysis/08_tsv_table/{i}_ table/"), rank = i) } export_to_stamp_fun function creates two files that can be imported into the STAMP software. export_to_stamp_fun(data = data, output = "./analysis/stamp/", correc = TRUE) csv2phyloseq_fun function allows user to import data from other bioinformatic pipeline like FROGS, Qiime2. This function needs tabulated ASV, taxonomy, metadata and DNA sequence table as inputs. It can generate phylogenetic tree if missing and output a phyloseq object ready for downstream analyses.

Conclusions
rANOMALY allows users to handle metagenomic data from raw sequences quality control to final differential analysis with ready to publish results in an easy and reproducible manner. Users have access to all sources of the rANOMALY package that can be deployed on any operating system or server allowing them to analyse anything from a few samples to several thousand. This workflow combines all of the latest developments in the field: the use of high resolution amplicon sequence variant, contaminant filtering, double automatic taxonomic assignation, integrated statistical analyses and four differential analyses with cross validation. rANOMALY help users those who don't have big programming skills. And since rANOMALY uses phyloseq objects and standards, original functions from the phyloseq package can still be used to split, filter and select specific samples previously to be visualized and/or tested with rANOMALY functions. For exploratory analysis and interactive experience, a shiny application based on this workflow is under active development and will be shared with the community.

Software availability
Up-to-date source code, and tutorials are available at: https://forgemia.inra.fr/umrf/ranomaly. This manuscript describes an R pipeline (rANOMALY) that connects together in a single simplified R workflow many of the standard tools now routinely used in metabarcoding data analysis. I have no doubt that such an integrated pipeline will be very useful for many researchers, especially ones not experts in bioinformatics. The modular design is also well suited for future changes and evolution of the pipeline.
The manuscript itself is very clear, nicely outlines how the different components of the pipeline are connected, and also clearly distinguish between which functions are provided by this pipeline and which functions are provided by third-party tools or packages. My only major comment is about user input (either user questions or bug reports), for which there is currently no process or web page mentioned in the manuscript. For instance, should users report bugs in the gitlab repo ( https://forgemia.inra.fr/umrf/ranomaly/-/issues)? (Incidentally, I did encounter an unexpected error with DeSeq: Error in `assays<-`(`*tmp*`, value = `*vtmp*`) : please use 'assay(x, withDimnames=FALSE)) <-value' or 'assays(x, withDimnames=FALSE)) <-value' when the dimnames on the supplied assay(s) are not identical to the dimnames on DESeqDataSet object 'x' In addition: Warning message: In DESeqDataSet(se, design = design, ignoreRank) : some variables in design formula are characters, converting to factors). If the authors do intend users to submit gitlab issues, this should be specified in the conclusion and/or the software availability sections. If not, the alternative(s) path to provide feedback and report issues should be clearly described.
Other than that, I only have minor comments on the manuscript which are listed below: Abstract: "uses amplicon sequence variants (ASV) level for". Shouldn't this read "uses amplicon sequence variants (ASV) for"?
In the keywords, I would argue "metagenomics" should be "metabarcoding" ○ p. 3 Introduction: ○ "Studies of microbial communities tends to" should be "Studies of microbial communities "different genus" should be "different genera" "correction step" should be removed, or clarified as e.g. "as part of a correction step" "your own" should be "their own" p. 8 Rarefaction curves: On my version of R, I have to call "rareplot" after the line "rareplot = rarefaction(data_filtered, "source_location", 100)" otherwise no plot appears. If this is the case in most versions (which I believe it is), this line ("rareplot") should be included in the authors' instructions so that a user would not be confused to not see any plot after calling the line "rareplot = rarefaction(data_filtered, "source_location", 100)" (this is also true for the other interactive plots, e.g. community composition, alpha and beta diversity, etc). ○ p. 8 Community composition plot: "source_ location": there seems to be an extra space between "source_" and "location" here ○ p. 10 Beta diversity: "funtion" should be "function" "on matrix distance" should be "on the distance matrix" "A pairwise-PERMANOVA is processed" should be "A pairwise-PERMANOVA is performed" ○ p. 11 Differential analysis: "of the use of multiple" would be more clear as "of using multiple" "cross validate" should be "cross-validate" "which factor to the test" should be "which factor to test" or "which factor to perform the test on" "which group this taxa is more abundant" should be "which group this taxa is more abundant in" ○ p. 12 DESeq2: "and well documented": Could the authors add 1 or 2 references here? ○ p. 15 Venn diagram: Shouldn't "the three environment" be "the four environments". ○ p. 15 Export: "function allows user" should be "function allows users" (also for csv2phyloseq_fun). ○ p. 15 Conclusion: "metagenomic data" should be "metabarcoding data" ○ "help users those" should be "help users" "big programming skills" should be "advanced programming skills" "previously to be visualized and/or tested" should be "prior to visualization/test" or "before being visualized and/or tested" Is the rationale for developing the new software tool clearly explained? Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes