Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses

High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or OTU composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, whether parametric or nonparametric. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests and nonparametric testing using community networks and the ggnetwork package.


Amendments from Version 1
In version 2 of the manuscript: We have updated the procedure for storing the filtered and trimmed files during the call to dada2, this avoids overwriting the files if the workflow is run several times.
We have replaced the msa alignment function with AlignSeqs function from from the DECIPHER 1 package, making the workflow more computationally efficient.
We have expanded the phyloseq section and reduced the number of network plots. We have also provided detailed discussion of our choice not to make the PCoA and PCA plots square.
We have added more detailed instructions in the Github repository as to how one can run only parts of the workflow and how to generate the full paper from scratch using the main.rnw file.

Introduction
The microbiome is formed of the ecological communities of microorganisms that dominate the living world. Bacteria can now be identified through the use of next generation sequencing applied at several levels. Shotgun sequencing of all bacteria in a sample delivers knowledge of all the genes present. Here we will only be interested in the identification and quantification of individual taxa (or species) through a 'fingerprint gene' called 16s rRNA which is present in all bacteria. This gene presents several variable regions which can be used to identify the different taxa.
Previous standard workflows depended on clustering all 16s rRNA sequences (generated by next generation amplicon sequencing) that occur within a 97% radius of similarity and then assigning these to 'OTUs' from reference trees 2,3 . These approaches do not incorporate all the data, in particular sequence quality information and statistical information available on the reads were not incorporated into the assignments.
In contrast, the de novo read counts used here will be constructed through the incorporation of both the quality scores and sequence frequencies in a probabilistic noise model for nucleotide transitions. For more details on the algorithmic implementation of this step see 4.
After filtering the sequences and removing the chimerae, the data are compared to a standard database of bacteria and labeled. In this workflow, we have used the labeled sequences to build a de novo phylogenetic with the phangorn.
The key step in the sequence analysis is the manner in which reads are denoised and assembled into groups we have chosen to call RSVs (Ribosomal Sequence Variants) instead of the traditional OTUs (Operational Taxonomic Units).
This article describes a computational workflow for performing denoising, filtering, data transformations, visualization, supervised learning analyses, community network tests, hierarchical testing and linear models. We provide all the code and give several examples of different types of analyses and use-cases. There are often many different objectives in experiments involving microbiome data and we will only give a flavor for what could be possible once the data has been imported into R.
In addition, the code can be easily adapted to accommodate batch effects, covariates and multiple experimental factors.
The workflow is based on software packages from the open-source Bioconductor project 5 . We provide all steps necessary from the denoising and identification of the reads input as raw sequences in fastq files to the comparative testing and multivariate analyses of the samples and analyses of the abundances according to multiple available covariates.

Methods
Amplicon bioinformatics: from raw reads to tables This section demonstrates the "full stack" of amplicon bioinformatics: construction of the sample-by-sequence feature table from the raw reads, assignment of taxonomy, and creation of a phylogenetic tree relating the sample sequences.
First we load the necessary packages.

Trim and Filter
We begin by filtering out low-quality sequencing reads and trimming the reads to a consistent length. While generally recommended filtering and trimming parameters serve as a starting point, no two datasets are identical and therefore it is always worth inspecting the quality of the data before proceeding.
Here, the forward reads maintain high quality throughout, while the quality of the reverse reads drops significantly at about position 160. Therefore, we choose to truncate the forward reads at position 245, and the reverse reads at position 160. We also choose to trim the first 10 nucleotides of each read based on empirical observations across many Illumina datasets that these base positions are particularly likely to contain pathological errors.
We combine these trimming parameters with standard filtering parameters, the most important being the enforcement of a maximum of 2 expected errors per-read 8 . Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass. ddR <-dada(derepRs[1:40], err=NULL, selfConsist=TRUE) ## Initial error matrix unspecified. Error rates will be initialized to the maximum possible estimate from this data.
In order to verify that the error rates have been reasonably well-estimated, we inspect the fit between the observed error rates (black points) and the fitted error rates (black lines) in Figure 2.

plotErrors(ddF) plotErrors(ddR)
The DADA2 sequence inference method can run in two different modes: Independent inference by sample (pool=FALSE), and inference from the pooled sequencing reads from all samples (pool=TRUE). Independent inference has the advantage that computation time is linear in the number of samples, and memory requirements are flat with the number of samples. This allows scaling out to datasets of almost unlimited size. Pooled inference is more computationally taxing, and can become intractable for datasets of tens of millions of reads. However, pooling improves the detection of rare variants that were seen just once or twice in an individual sample but many times across all samples. As this dataset is not particularly large, we perform pooled inference. As of version 1.2, multithreading can be activated with the arguments multithread = TRUE, which can substantially speed this step. The DADA2 sequence inference step removed (nearly) all substitution and indel errors from the data 4 . We now merge together the inferred forward and reverse sequences, removing paired sequences that do not perfectly overlap as a final control against residual errors.

Construct sequence table and remove chimeras
The DADA2 method produces a sequence table that is a higher-resolution analogue of the common "OTU Although exact numbers vary substantially by experimental condition, it is typical that chimeras comprise a substantial fraction of inferred sequence variants, but only a small fraction of all reads. That is what is observed here: 1503 of 1892 sequence variants were chimeric, but these only represented 10% of all reads.  0   0  10  20  30  40 0  10  20  30  40 0  10  20  30  40 0  10  20  30  40 Consensus quality score Error frequency (log10)

Assign taxonomy
One of the benefits of using well-classified marker loci like the 16S rRNA gene is the ability to taxonomically classify the sequence variants. The dada2 package implements the naive Bayesian classifier method for this purpose 9 . This classifier compares sequence variants to a training set of classified sequences, and here we use the RDP v14 training set 10 .

Construct phylogenetic tree
Phylogenetic relatedness is commonly used to inform downstream analyses, especially the calculation of phylogenyaware distances between microbial communities. The DADA2 sequence inference method is reference-free, so we must construct the phylogenetic tree relating the inferred sequence variants de novo. We begin by performing a multiple-alignment using the DECIPHER R package 11 . The phangorn R package is then used to construct a phylogenetic tree. Here we first construct a neighbor-joining tree, and then fit a GTR+G+I (Generalized time-reversible with Gamma rate variation) maximum likelihood tree using the neighbor-joining tree as a starting point.
ps <-phyloseq(tax_table(taxtab), sample_data(samdf), otu_table(seqtab, taxa_are_rows = FALSE),phy_tree(fitGTR$tree)) phyloseq phyloseq 12 is an R package to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs) or more appropriately denoised, and it is most useful when there is also associated sample data, phylogeny, and/or taxonomic assignment of each taxa. phyloseq leverages and builds upon many of the tools available in R for ecology and phylogenetic analysis (vegan 13 , ade4 14 , ape 15 ), while also using advanced/flexible graphic systems (ggplot2 16 ) to easily produce publication-quality graphics of complex phylogenetic data. The phyloseq package uses a specialized system of S4 data classes to store all related phylogenetic sequencing data as a single, self-consistent, self-describing experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of amplicon count data jointly with important sample covariates.

Further documentation
This tutorial shows a useful example workflow, but many more analyses are available to you in phyloseq, and R in general, than can fit in a single workflow. The phyloseq home page is a good place to begin browsing additional phyloseq documentation, as are the three vignettes included within the package, and linked directly at the phyloseq release page on Bioconductor.

Loading Data
Many use cases result in the need to import and combine different data into a phyloseq class object, this can be done using the import_biom function to read recent QIIME format files, older files can still be imported with import_qiime. More complete details can be found on the phyloseq FAQ page.
In the previous section the results of dada2 sequence processing were organized into a phyloseq object. This object was also saved in R-native serialized RDS format. We will re-load this here for completeness as the initial object p0.
library("phylos12eq") library("gridExtra") ps = readRDS("data/ps.rds") ps ## phyloseq-class experiment-level object ## otu_table() OTU Shiny-phyloseq It can be beneficial to start the data exploration process interactively, this often saves time in detecting outliers and specific features of the data. Shiny-phyloseq 17 is an interactive web application that provides a graphical user interface to the phyloseq package. The object just loaded into the R session in this workflow is suitable for this graphical interaction with Shiny-phyloseq.
Filtering phyloseq provides useful tools for filtering, subsetting, and agglomerating taxa -a task that is often appropriate or even necessary for effective analysis of microbiome count data. In this subsection, we graphically explore the prevalence of taxa in the example dataset, and demonstrate how this can be used as a filtering criteria. One of the reasons to filter in this way is to avoid spending much time analyzing taxa that were seen only rarely among samples. This also turns out to be a useful filter of noise (taxa that are actually just artifacts of the data collection process), a step that should probably be considered essential for datasets constructed via heuristic OTU-clustering methods, which are notoriously prone to generating spurious taxa.

Taxonomic Filtering
In many biological settings, the set of all organisms from all samples are well-represented in the available taxonomic reference database. When (and only when) this is the case, it is reasonable or even advisable to filter taxonomic features for which a high-rank taxonomy could not be assigned. Such ambiguous features in this setting are almost always sequence artifacts that don't exist in nature. It should be obvious that such a filter is not appropriate for samples from poorly characterized or novel specimens, at least until the possibility of taxonomic novelty can be satisfactorily rejected. Phylum is a useful taxonomic rank to consider using for this purpose, but others may work effectively for your data.
To begin, create a table of read counts for each Phylum present in the dataset.
# Show available ranks in the dataset rank_names(ps) ## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" # Create This shows a few phyla for which only one feature was observed. Those may be worth filtering, and we'll check that next. First, notice that in this case, six features were annotated with a Phylum of NA. These features are probably artifacts in a dataset like this, and should be removed.
The following ensures that features with ambiguous phylum annotation are also removed. Note the flexibility in defining strings that should be considered ambiguous annotation.
ps0 <-subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized")) A useful next step is to explore feature prevalence in the dataset, which we will define here as the number of samples in which a taxa appears at least once.
# Compute prevalence of each feature, store as data.frame prevdf = apply(X = otu_table(ps0), MARGIN = ifelse(taxa_are_rows(ps0), yes = 1, no = 2), FUN = function(x){sum(x > 0)}) # Add taxonomy and total read counts to this data.frame prevdf = data.frame(Prevalence = prevdf, TotalAbundance = taxa_sums(ps0), tax_table(ps0)) Are there phyla that are comprised of mostly low-prevalence features? Compute the total and average prevalences of the features in each phylum. Deinococcus-Thermus appeared in just over one percent of samples, and Fusobacteria appeared in just 2 samples total. In some cases it might be worthwhile to explore these two phyla in more detail despite this (though probably not Fusobacteria's two samples). For the purposes of this example, though, they will be filtered from the dataset.

Prevalence Filtering
The previous filtering steps are considered supervised, because they relied on prior information that is external to this experiment (a taxonomic reference database). This next filtering step is completely unsupervised, relying only on the data in this experiment, and a parameter that we will choose after exploring the data. Thus, this filtering step can be applied even in settings where taxonomic annotation is unavailable or unreliable.
First, explore the relationship of prevalence and total read count for each feature. Sometimes this reveals outliers that should probably be removed, and also provides insight into the ranges of either feature that might be useful. This aspect depends quite a lot on the experimental design and goals of the downstream inference, so keep these in mind. It may even be the case that different types of downstream inference require different choices here. There is no reason to expect ahead of time that a single filtering workflow is appropriate for all analysis.
# Subset to the remaining phyla prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum")) ggplot(prevdf1, aes(TotalAbundance, Prevalence / nsamples(ps0),color=Phylum)) + # Include a guess for parameter geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) + geom_point(size = 2, alpha = 0.7) + scale_x_log10() + xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") + facet_wrap(~Phylum) + theme(legend.position="none") Sometimes a natural separation in the dataset reveals itself, or at least, a conservative choice that is in a stable region for which small changes to the choice would have minor or no effect on the biological interpreation (stability). Here no natural separation is immediately evident, but it looks like we might reasonably define a prevalence threshold in a range of zero to 10 percent or so. Take care that this choice does not introduce bias into a downstream analysis of association of differential abundance.
The following uses five percent of all samples as the prevalence threshold.

Agglomerate taxa
When there is known to be a lot of species or sub-species functional redundancy in a microbial community, it might be useful to agglomerate the data features corresponding to closely related taxa. Ideally we would know the functional redundancies perfectly ahead of time, in which case we would agglomerate taxa using those defined relationships and the merge_taxa() function in phyloseq. That kind of exquisite functional data is usually not available, and different pairs of microbes will have different sets of overlapping functions, complicating the matter of defining appropriate grouping criteria.
While not necessarily the most useful or functionally-accurate criteria for grouping microbial features (sometimes far from accurate), taxonomic agglomeration has the advantage of being much easier to define ahead of time. This is because taxonomies are usually defined with a comparatively simple tree-like graph structure that has a fixed number of internal nodes, called "ranks". This structure is simple enough for the phyloseq package to represent taxonomies as table of taxonomy labels. Taxonomic agglomeration groups all the "leaves" in the hierarchy that descend from the user-prescribed agglomerating rank, this is sometimes called 'glomming'.
The following example code shows how one would combine all features that descend from the same genus.
# How many genera would be present after filtering? length(get_taxa_unique(ps2, taxonomic.rank = "Genus")) If taxonomy is not available or not reliable, tree-based agglomeration is a "taxonomy-free" alternative to combine data features corresponding to closely-related taxa. In this case, rather than taxonomic rank, the user specifies a tree height corresponding to the phylogenetic distance between features that should define their grouping. This is very similar to "OTU Clustering", except that in many OTU Clustering algorithms the sequence distance being used does not have the same (or any) evolutionary definition.

Abundance value transformation
It is usually necessary to transform microbiome count data to account for differences in library size, variance, scale, etc. The phyloseq package provides a flexible interface for defining new functions to accomplish these transformations of the abundance values via the transform_sample_counts() function. The first argument to this function is the phyloseq object you want to transform, and the second argument is an R function that defines the transformation. The R function is applied sample-wise, expecting that the first unnamed argument is a vector of taxa counts in the same order as the phyloseq object. Additional arguments are passed on to the function specified in the second argument, providing an explicit means to include pre-computed values, previously defined parameters/thresholds, or any other object that might be appropriate for computing the transformed values of interest. This example begins by defining a custom plot function, plot_abundance(), that uses phyloseq's psmelt() function to define a relative abundance graphic. We will use this to compare differences in scale and distribution of the abundance values in our phyloseq object before and after transformation. plot_abundance = function(physeq,title = "", Facet = "Order", Color = "Phylum"){ # Arbitrary subset, based on Phylum, for plotting p1f = subset_taxa(physeq, Phylum %in% c("Firmicutes")) mphyseq = psmelt(p1f) mphyseq <-subset(mphyseq, Abundance > 0) ggplot(data = mphyseq, mapping = aes_string(x = "sex",y = "Abundance", color = Color, fill = Color)) + geom_violin(fill = NA) + geom_point(size = 1, alpha = 0.3, position = position_jitter(width = 0.3)) + facet_wrap(facets = Facet) + scale_y_log10()+ theme(legend.position="none") } The transformation in this case converts the counts from each sample into their frequencies, often referred to as proportions or relative abundances. This function is so simple that it is easiest to define it within the function call to trans-form_sample_counts(). # Transform to relative abundance. Save as new object. ps3ra = transform_sample_counts(ps3, Now plot the abundance values before and after transformation. plotBefore = plot_abundance(ps3,"") plotAfter = plot_abundance(ps3ra,"") # Combine each plot into one graphic. grid.arrange(nrow = 2, plotBefore, plotAfter) Subset by taxonomy Notice on the previous plot that Lactobacillales appears to be a taxonomic Order with bimodal abundance profile in the data. We can check for a taxonomic explanation of this pattern by plotting just that taxonomic subset of the data. For this, we subset with the subset_taxa() function, and then specify a more precise taxonomic rank to the Facet argument of the plot_abundance function that we defined above. psOrd = subset_taxa(ps3ra, Order == "Lactobacillales") plot_abundance(psOrd, Facet = "Genus", Color = NULL) At this stage in the workflow, after converting raw reads to interpretable species abundances, and after filtering and transforming these abundances to focus attention on scientifically meaningful quantities, we are in a position to consider more careful statistical analysis. R is an ideal environment for performing these analyses, as it has an active community of package developers building simple interfaces to sophisticated techniques. As a variety of methods are available, there is no need to commit to any rigid analysis strategy a priori. Further, the ability to easily call packages without reimplementing methods frees researchers to iterate rapidly through alternative analysis ideas. The advantage of performing this full workflow in R is that this transition from bioinformatics to statistics is effortless.
We back these claims by illustrating several analyses on the mouse data prepared above. We experiment with several flavors of exploratory ordination before shifting to more formal testing and modeling, explaining the settings in which the different points of view are most appropriate. Finally, we provide example analyses of multitable data, using a study in which both metabolomic and microbial abundance measurements were collected on the same samples, to demonstrate that the general workflow presented here can be adapted to the multitable setting.

Preprocessing
Before doing the multivariate projections, we will add a few columns to our sample data, which can then be used to annotate plots. From Figure 7, we see that the ages of the mice come in a couple of groups, and so we make a categorical variable corresponding to young, middle-aged, and old mice. We also record the total number of counts seen in each sample and log-transform the data as an approximate variance stabilizing transformation.
qplot(sample_data(ps)$age, geom = "histogram") + xlab("age") qplot(log10(rowSums(otu_table(ps)))) + xlab("Logged counts-per-sample") For a first pass, we look at principal coordinates analysis (PCoA) with either the Bray-Curtis dissimilarity on the weighted Unifrac distance. We see immediately that there are six outliers. These turn out to be the samples from females 5 and 6 on day 165 and the samples from males 3, 4, 5, and 6 on day 175. We will take them out, since we are mainly interested in the relationships between the non-outlier points.

Aspect ratio of ordination plots
In the ordination plots in Figure 8- Figure 14, you may have noticed as did the reviewers of the first version of the paper, that the maps are not presented as square representations as is often the case in standard PCoA and PCA plots in the literature.
The reason for this is that as we are trying to represent the distances between samples as faithfully as possible; we have to take into account that the second eigenvalue is always smaller than the first, sometimes considerably so, thus we normalize the axis norm ratios to the relevant eigenvalue ratios.

Different ordination projections
As we have seen, an important first step in analyzing microbiome data is to do unsupervised, exploratory analysis. This is simple to do in phyloseq, which provides many distances and ordination methods.
After documenting the outliers, we are going to compute ordinations with these outliers removed and more carefully study the output. We see that there is a fairly substantial age effect that is consistent between all the mice, male and female, and from different litters. We'll first perform a PCoA using Bray-Curtis dissimilarity.   Figure 11. A DPCoA plot incorporates phylogenetic information, but is dominated by the first axis.    Figure 12, this display is harder to interpret. The first plot shows the ordination of the samples, and we see that the second axis corresponds to an age effect, with the samples from the younger and older mice separating fairly well. The first axis correlates fairly well with library size (this is not shown). The first axis explains about twice the variability than the first, this translates into the elongated form of the ordination plot.
out.dpcoa.log <-ordinate(pslog, method = "DPCoA") Finally, we can look at the results of PCoA with weighted Unifrac. As before, we find that the second axis is associated with an age effect, which is fairly similar to DPCoA. This is not surprising, because both are phylogenetic ordination methods taking abundance into account. However, when we compare biplots, we see that the DPCoA gave a much cleaner interpretation of the second axis, compared to weighted Unifrac.

PCA on ranks
Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks. We demonstrate this idea using a rank-transformed version of the data to perform PCA. First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc.
In the case where many bacteria are absent or present at trace amounts, an artificially large difference in rank could occur 21 for minimally abundant taxa. To avoid this, all those microbes with rank below some threshold are set to be tied at 1. The ranks for the other microbes are shifted down, so there is no large gap between ranks. This transformation is illustrated in Figure 15.
abund_ranks <-abund_ranks -329 abund_ranks[abund_ranks < 1] <-1 We can now perform PCA and study the resulting biplot, given in Figure 16. To produce annotation for this figure, we used the following block.
The function allows to create biplots where the positions of samples are determined by similarity in both species signatures and environmental characteristics; in contrast, principal components analysis or correspondence analysis only look at species signatures. More formally, it ensures that the resulting CCpnA directions lie in the span of the environmental variables; thorough treatments are available in 22,23.
Like PCoA and DPCoA, this method can be run using ordinate in phyloseq. In order to use supplemental sample data, it is necessary to provide an extra argument, specifying which of the features to consider -otherwise, phyloseq defaults to using all sample_data measurements when producing the ordination. To access the positions for the biplot, we can use the scores function in the vegan. Further, to facilitate figure annotation, we also join the site scores with the environmental data in the sample_data slot. Of the 23 total taxonomic orders, we only explicitly annotate the four most abundant -this makes the biplot easier to read.

Supervised learning
Here we illustrate some supervised learning methods that can be easily run in R. The caret package wraps many prediction algorithms available in R and performs parameter tuning automatically. Since we saw that microbiome signatures change with age, we'll apply supervised techniques to try to predict age from microbiome composition.
plsClasses <-predict(plsFit, newdata = testing) As another example, we can try out random forests. This is run in exactly the same way as PLS, by switching the method argument from pls to rf. Random forests also perform well at the prediction task on this test set, though there are more old mice misclassified as young.
rfFit <-train(age ~ ., data = training, method = "rf", preProc = "center", proximity = TRUE) rfClasses <-predict(rfFit, newdata = testing) To interpret these PLS and random forest results, it is standard to produce biplots and proximity plots, respectively. The code below extracts coordinates and supplies annotation for points to include on the PLS biplot.

Graph-based visualization and testing
Creating and plotting graphs Phyloseq has functionality for creating graphs based on thresholding a distance matrix, and the resulting networks can be plotting using the ggnetwork. This package overloads the ggplot syntax, so you can use the function ggplot on an igraph object and add geom_edges and geom_nodes geoms to plot the network. To be able to color the nodes or edges a certain way, we need to add these attributes to the igraph object. Below we create a network by thresholding the Jaccard dissimilarity (the default distance for the function make_network) at .35, and then we add an attribute to the vertices indicating which mouse the sample came from and which litter the mouse was in. Then we can plot the network with the coloring by mouse and shape by litter. We see the resulting network in Figure 22, and we can see that there is grouping of the samples by both mouse and litter. setup_example(c("igraph", "phyloseq", "phyloseqGraphTest", "ggnetwork", "intergraph","gridExtra")) net <-make_network(ps, max.dist=0.35) sampledata <-data.frame(sample_data(ps)) V(net)$id <-sampledata[names(V(net)), "host_subject_id"] V(net)$litter <-sampledata[names(V(net)), "family_relationship"] ggplot(net, aes(x = x, y = y, xend = xend, yend = yend), layout = "fruchtermanreingold") + geom_edges(color = "darkgray") + geom_nodes(aes(color = id, shape = litter)) + theme(axis.text = element_blank(), axis.title = element_blank(), legend.key.height = unit(0.5,"line")) + guides(col = guide_legend(override.aes = list(size = .25)))

Graph-based two-sample tests
Graph-based two-sample tests were introduced by Friedman and Rafsky 25 as a generalization of the Wald-Wolfowitz runs test. They proposed the use of a minimum spanning tree (MST) based on the distances between the samples, and then counting the number of edges on the tree that were between samples in different groups. It is not necessary to use a minimum spanning tree (MST), the graph made by linking nearest neighbors 26 or distance thresholding can also be used as the input graph. No matter what graph we build between the samples, we can approximate a null distribution by permuting the labels of the nodes of the graph.

Minimum Spanning Tree (MST)
We first perform a test using an MST with Jaccard dissimilarity. We want to know whether the two litters (family_relationship) come from the same distribution. Since there is a grouping in the data by individual (host_subject_id), we can't simply permute all the labels, we need to maintain this nested structure -this is what the grouping argument does. Here we permute the family_relationship labels but keep the host_subject_id structure intact.
This test has a small p-value, and we reject the null hypothesis that the two samples come from the same distribution. From the plot of the minimum spanning tree in Figure 23, we see by eye that the samples group by litter more than we would expect by chance.

Nearest neighbors
The k-nearest neighbors graph is obtained by putting an edge between two samples whenever one of them is in the set of k-nearest neighbors of the other. We see from Figure 24 that if a pair of samples has an edge between them in the nearest neighbor graph, they are overwhelmingly likely to be in the same litter. We can also compute the analogous test with two-nearest neighbors and the Bray-Curtis dissimilarity. The results are not shown, but the code is given below.

Distance threshold
Another way of making a graph between samples is to threshold the distance matrix, this is called a geometric graph 27 . The testing function lets the user supply an absolute distance threshold; alternatively, it can find a distance threshold such that there are a prespecified number of edges in the graph. Below we use a distance threshold so that there are 720 edges in the graph, or twice as many edges as there are samples. Heuristically, the graph we obtain isn't as good, because there are many singletons. This reduces power, and so if the thresholded graph has this many singletons it is better to either modify the threshold or consider a MST or k-nearest neighbors graph.

Linear modeling
It is often of interest to evaluate the degree to which microbial community diversity reflects characteristics of the environment from which it was sampled. Unlike ordination, the purpose of this analysis is not to develop a representation of many bacteria with respect to sample characteristics; rather, it is to describe how a single measure of overall community structure (In particular, it need not be limited to diversity -defining univariate measures of community stability is also common, for example.) is associated with sample characteristics. This is a somewhat simpler statistical goal, and can be addressed through linear modeling, for which there are a range of approaches in R. As an example, we will used a mixed-effects model to study the relationship between mouse microbial community diversity and the age and litter variables that have been our focus so far. This choice was motivated by the observation that younger mice have noticeably lower Shannon diversities, but that different mice have different baseline diversities. The mixed-effects model is a starting point for formalizing this observation.
We first compute the Shannon diversity associated with each sample and join it with sample annotation.
We apply this method to test the association between microbial abundance and age. This provides a complementary view of the earlier analyses, identifying individual bacteria that are responsible for the differences between young and old mice.
We digress briefly from hierarchical testing to describe an alternative form of count normalization. Rather than working with the logged data as in our earlier analysis, we consider a variance stabilizing transformation introduced by 31 for RNA-seq data and in 32 for 16S rRNA generated count data and available in the DESeq2 package. The two transformations yield similar sets of significant microbes. One difference is that, after accounting for size factors, the histogram of row sums for DESeq is more spread out in the lower values, refer to Figure 27. This is the motivation of using such a transformation, although for high abundance counts, it is equivalent to the log, for lower and mid range abundances it does not crush the data and yields more powerful results. The code below illustrates the mechanics of computing DESeq2's variance stabilizing transformation on a phyloseq object. setup_example(c("phyloseq", "structSSI", "plyr", "dplyr", "reshape2", "ggplot2", "DESeq2")) ps_dds <-phyloseq_to_deseq2(ps, ~ age_binned + family_relationship) varianceStabilizingTransformation(ps_dds, blind = TRUE, fitType = "parametric") We use the structSSI to perform the hierarchical testing 33 . For more convenient printing, we first shorten the names of each microbe.
short_names <-substr(rownames(abund), 1, 5)%>% make.names(unique = TRUE) rownames(abund) <-short_names Unlike standard multiple hypothesis testing, the hierarchical testing procedure needs univariate tests for each higherlevel taxonomic group, not just every bacteria. A helper function, treePValues, is available for this; it expects an edgelist encoding parent-child relationships, with the first row specifying the root node.   The plot opens in a new browser -a static screenshot of a subtree is displayed in Figure 28. Nodes are shaded according to p-values, from blue to orange, representing the strongest to weakest associations. Grey nodes were never tested, to focus power on more promising subtrees. Scanning the full tree, it becomes clear that the association between age group and bacterial abundance is present in only a few isolated taxonomic groups, but that it is quite strong in those groups. To give context to these results, we can retrieve the taxonomic identity of the rejected hypotheses. It seems that the most strongly associated bacteria all belong to family Lachnospiraceae, which is consistent with the random forest results in Section.

Multitable techniques
Many microbiome studies attempt to quantify variation in the microbial, genomic, and metabolic measurements across different experimental conditions. As a result, it is common to perform multiple assays on the same biological samples and ask what features -bacteria, genes, or metabolites, for example -are associated with different sample conditions. There are many ways to approach these questions, which to apply depends on the study's focus.
Here, we will focus on one specific workflow that uses sparse Canonical Correlation Analysis (sparse CCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting variation. We will use an implementation from the PMA 34 .
Since the mouse data used above included only a single table, we use a new data set, collected by the study 35 . There are two tables here, one for bacteria and another with metabolites. 12 samples were obtained, each with measurements at 637 m/z values and 20,609 OTUs; however, about 96% of the entries of the microbial abundance table are exactly zero. The code below retrieves this data.
Our preprocessing mirrors that done for the mouse data. We first filter down to microbes and metabolites of interest, removing those that are zero across many samples. Then, we transform them to weaken the heavy tails.
Our implementation is below. The parameters penaltyx and penaltyz are sparsity penalties. Larger values of penaltyx will result in fewer selected microbes, similarly penaltyz modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation -we generally prefer more sparsity than the default parameters would provide. With these parameters, 5 microbes and 15 metabolites have been selected, based on their ability to explain covariation between tables. Further, these 20 features result in a correlation of 0.974 between the two tables. We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the 20 selected features. Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well -sparse CCA has minimized redundancy across features, but makes no guarantee that these are the "true" features in any sense.
Nonetheless, we can still use these 20 features to compress information from the two tables without much loss. To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA.

Operation
The programs and source for this article can be run using version 3.3 of R and version 3.3 of Bioconductor.

Conclusions
We have shown how a complete workflow in R is now available to denoise, identify and normalize next generation amplicon sequencing reads using probabilistic models with parameters fit using the data at hand.
We have provided a brief overview of all the analyses that become possible once the data has been imported into the R environment. Multivariate projections using the phylogenetic tree as the relevant distance between OTUs/RSVs can be done using weighted unifrac or double principal coordinate analyses using the phyloseq package. Biplots provide the user with an interpretation key. These biplots have been extended to triplots in the case of multidomain data incorporating genetic, metabolic and taxa abundances. We illustrate the use of network based analyses, whether the community graph is provided from other sources or from a taxa co-occurrence computation using a Jaccard distance.
We have briefly covered a small example of using three supervised learning functions (random forests, partial least squares and) to predict a response variable, The main challenges in tackling microbiome data come from the many different levels of heterogeneity both at the input and output levels. These are easily accommodated through R's capacity to combine data into S4 classes. We are able to include layers of information, trees, sample data description matrices, contingency table in the phyloseq data sctructures. The plotting facilities of ggplot2 and ggnetwork allow for the layering of information in the output into plots that combine graphs, multivariate information and maps of the relationships between covariates and taxa abundances. The layering concept allows the user to provide reproducible publication level figures with multiple heterogeneous sources of information. Our main goal in providing these tools has been to enhance the statistical power of the analyses by enabling the user to combine frequencies, quality scores and covariate information into complete and testable projections.

Summary
This illustration of possible workflows for microbiome data combining trees, networks, normalized read counts and sample information showcases the capabilities and reproducibility of an R based system for analysing bacterial communities. We have implemented key components in C wrapped within the Bioconductor package dada2 to enable the different steps to be undertaken on a laptop.
Once the sequences have been filtered and tagged they can be assembled into a phylogenetic tree directly in R using the maximum likelihood tree estimation available in phangorn. The sequences are then assembled into a phyloseq object containing all the sample covariates, the phylogenetic tree and the sample-taxa contingency table.
These data can then be visualized interactively with Shiny-phyloseq, plotted with one line wrappers in phyloseq and filtered or transformed very easily.
The last part of the paper shows more complex analyses that require direct plotting and advanced statistical analyses.
Multivariate ordination methods allow useful lower dimensional projections in the presence of phylogenetic information or multidomain data as shown in an example combining metabolites, OTU abundances, Supervised learning methods provide lists of the most relevant taxa in discriminating between groups.
Bacterial communities can be represented as co-occurrence graphs using network based plotting procedures available in R. We have also provided examples where these graphs can be used to test community structure through non parametric permutation resampling. This provides implementations of the Friedman Rafsky 25 tests for microbiome data which have not been published previously.

Data availability
Intermediary data for the analyses are made available both on GitHub at https://github.com/spholmes/F1000_workflow and at the Stanford digital repository permanent url for this paper: http://purl.stanford.edu/wh250nn9648. All other data have been previously published and the links are included in the paper. The title and abstract appropriately summarize the contents, and the text is fluent to read. The manuscript aims to cover a vast area, and does good job in summarizing the relevant key aspects in a single workflow.
The study design, methods and analysis and their suitability are properly described with appropriate references. The work describes a recommendation for a workflow based on the author's comprehensive experience in this field. It does not provide thorough comparison or benchmarking of the methods but relevant research is cited and key comparisons are already available in the literature. The contribution in this work is to combine the individual elements into a coherent workflow that describes very typical steps and recommended choices in standard taxonomic analysis.

* Major comments:
The main shortcoming is that the workflow is not provided in a readily reproducible format. I cloned the github repository, and could run the analysis rnw files (PartI, PartII, PartIII) individually, as well as the main.rnw. These result in .text file but using pdflatex or latex could not be readily applied to convert these into the final PDF format. The github site does not mention how the rnw files should actually be converted into PDF. This is not evident as there are many ways to do this and the success will depend on the overall setup. The authors could include for instance a simple shell script or README in the github main directory, showing what steps are taken to get from the original R/Rnw files to the final PDF reports. This would greatly increase the utility and reproducibility of this work.

* Minor comment:
-Something is missing from Conclusions. There is paragraph that in its entirety reads as follows: "We have briefly covered a small example of using three supervised learning functions (random forests, partial least squares and) to predict a response variable," -it misses text in parentheses and ends with a comma.
-Could you cite or discuss in more detail based on your experience whether Friedman Rafsky method outperforms alternative or at least closely related methods in the pairwise comparison task ? Not required but would be interesting to know.
-The phyloseq package might serve its purpose better if split in smaller and more compact packages.
The class structure is really useful and valuable, and would deserve its own package. This would better F1000Research -The phyloseq package might serve its purpose better if split in smaller and more compact packages.
The class structure is really useful and valuable, and would deserve its own package. This would better serve the overall microbiome data analytics community which can build on this and expand phyloseq capabilities in separate packages, in the same way as certain microarray data structures became a norm with the RMA and limma packages, with subsequent explosion in analysis methodologies. I am here just repeating my comment from the first review. Not required for this manuscript, however.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
I am currently developing the microbiome R package, which utilizes some of the Competing Interests: functionality in the phyloseq package which has a central role in this review. I have done this development work independently of the authors, following standard open source development model. It is just one of the many packages I am utilizing, and I do not see a clear competing interest but like to mention this. This article is a valuable resource for the metagenomics field. The thorough examples of several statistical analyses of metagenomic data will help both the novice and expert in analyzing their own data. Additionally, this paper sets a standard in the field for documenting analyses.
Both DADA2 and PhyloSeq have much to offer. DADA2 identifies OTUs, which are termed in this paper 'Ribosomal Sequence Variants,' reflecting the extra granularity with which DADA2 is capable of resolving OTUs. The RSVs identified by DADA2 offer the ability to conduct higher resolution analyses on 16S data. PhyloSeq is comprised of numerous capabilities to analyze metagenomic data, making it quite easy for a user to load and analyze their data.
Below I make a few suggestions for clarification purposes. I enjoyed reading this article and have already benefited greatly from using DADA2 and PhyloSeq in my own work.

Minor critiques and suggestions:
A very attractive feature of DADA2 is its ability to resolve RSVs. I wonder if the authors could expand more on the findings they have made with the higher resolution OTUs found by DADA2. This would highlight why DADA2 is such a powerful tool. Page 4 -it could be helpful to illustrate some of the properties of the software with numbers and data. For example, DADA2 has the ability to infer OTUs from pooled or unpooled data. Could the authors illustrate the number of RSVs found in the two scenarios? Page 6 -Is the multiple sequence alignment feature capable of multiple methods? If so, do you advocate for using ClustalW for metageonomic data? Why?
Page 6 --Could the authors define what a GTR+G+I model is?
I wonder if the authors could give some more guidance on how to construct the PhyloSeq object from scratch without relying import functions. For example, I tried making a PhyloSeq object using Metaphlan2 output. Unfortunately I could not figure out how to merge Metaphlan2 biom files for each sample, and so I had to fiddle with Phyloseq for sometime to manually create the OTU, sample, and taxa tables for multiple samples.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed.

Competing Interests:
Author Response 04 Nov 2016 , Stanford University, USA

Benjamin Callahan
Thanks for your comments and suggestions. We made several improvements to the revised manuscript in response: We added an explicit reference to Figure 2 in the text. The error rates being estimated in each plot are indicated in the text just above each plot. A2C (A to C) is shorthand for an A being converted to a C by errors in the amplicon sequencing process.
We changed the multiple-sequence alignment method in the workflow to that implemented by the DECIPHER package, largely because of its improved computational performance.
We added a brief text description of GTR+G+I (Generalized time-reversible with Gamma rate variation).
We did not expand our evaluation of RSVs vs. OTUs or pooled vs. unpooled inference. Performing such evaluations well is a significant undertaking and would take significant space to explain, and our primary purpose here is to demonstrate the many features of an R/Bioconductor amplicon analysis workflow.
For evaluation of DADA2, our manuscript introducing the method examines differences between

F1000Research
For evaluation of DADA2, our manuscript introducing the method examines differences between the output of DADA2 and OTU methods and we are writing another manuscript that looks at performance on datasets with many samples. On the issue of pooled vs. unpooled results, the short answer is that we find both approaches work well. If just counting the number of output OTU sequences, pooled inference generally finds more because of its higher sensitivity to sequences that are found in many samples but are rare in each. Of note, we generally find these pooled-only sequences to be very highly enriched for contaminants (eg. kit contaminants), which are expected to distributed in just this way.
We also did not expand much on the biological findings from this dataset in the initial paper (Stabilization of the murine gut microbiome following weaning, Schloss et al. 2012), as they were quite limited, essentially boiling down to the observation that gut sample early in life differed by more on average than samples from later in life. However, the dataset has been used in a number of studies as an example dataset for testing new methods (as in Kozich et al. 2013) and that is the way in which we are using it here.
No competing interests were disclosed. Competing Interests: 18  There is a growing push in the computational sciences for adopting software practices that promote replicability and provide methodological transparency. In the field of microbiome research these practices should minimize the standard culprits of error-creep such as file proliferation, and incompatible formats; they should provide sound default choices for the core computational steps of sequence clustering and taxonomic assignment; and they should facilitate reproducible statistical analyses of the resulting data. By providing a step-by-step analysis of a microbiome dataset that can be completed entirely from within the R statistical computing environment, this workflow does an admirable job of bringing these best-practices to the world of microbiome science.
The article takes a reader through the steps of processing raw sequence data and loading the data into R. It then demonstrates how to use basic exploratory data analysis to get a sense of the data and finally introduces the use of various statistical packages to search-for and validate patterns. The majority of the article focuses on the application of statistical concepts to microbiome data and this is where scientists would like to be spending their time. However, this allocation of ink-space is only possible because the recent release of the DADA2 package allows the authors (and subsequent users) to condense all the read-processing portion of the tutorial into a few short steps. DADA2 provides a new and arguably superior method for clustering raw amplicon reads and, by processing the reads and assigning taxonomy, it fills in the computational gap required to work completely within R. The benefits of this workflow are fairly self-evident in the amount of space in their workflow devoted to data processing versus exploration, however, there are other benefits as well, of which I will name two. First, by using packages hosted on CRAN or Bioconductor, the authors can leverage the Bioconductor build system and ensure a fully working environment, a non-trivial prerequisite in a field with myriad tools. Second, by providing an integrated set of tools there are few, if any, intermediate files required to analyze a dataset. In addition to reducing the cognitive burden of a newcomer, this generally reduces the footprint for errors.