Meta-Signer: Metagenomic Signature Identifier based on Rank Aggregation of Features

Background The advance of metagenomic studies provides the opportunity to identify microbial taxa that are associated to human diseases. Multiple methods exist for the association analysis. However, the results could be inconsistent, presenting challenges in interpreting the host-microbiome interactions. To address this issue, we introduce Meta-Signer, a novel Metagenomic Signature Identifier tool based on rank aggregation of features identified from multiple machine learning models including Random Forest, Support Vector Machines, LASSO, Multi-Layer Perceptron Neural Networks, and our recently developed Convolutional Neural Network framework (PopPhy-CNN). Meta-Signer generates ranked taxa lists by training individual machine learning models over multiple training partitions and aggregates them into a single ranked list by an optimization procedure to represent the most informative and robust microbial features. Meta-Signer can rank taxa using two input forms of the data: the relative abundances of the original taxa and taxa from the populated taxonomic trees generated from the original taxa. The latter form allows the evaluation of the association of microbial features at different taxonomic levels to the disease, which is attributed to our novel model of PopPhy-CNN. Results We evaluate Mega-Signer on five different human gut-microbiome datasets. We demonstrate that the features derived from Meta-Signer were more informative compared to those obtained from other available feature ranking methods. The highly ranked features are strongly supported by published literature. Conclusion Meta-Signer is capable of deriving a robust set of microbial features at multiple taxonomic levels for the prediction of host phenotype. Meta-Signer is user-friendly and customizable, allowing users to explore their datasets quickly and efficiently.

tree and their quantitative characteristics in metagenomic data. PopPhy-CNN was shown competitive using nine metagenomic datasets of moderate size including the IBD dataset [26].
Despite of the promising results from the numerous ML and DNN approaches, a critical analysis tool, MetaPheno, showed that the predictive performance gains of recent models were small and that more focus should be placed on feature extraction [20]. In addition, to best of our knowledge, there is no reliable method that can systematically generate a robust set of features from a set of individual ML models.
In this work, we introduce a novel tool, Meta-Signer, Metagenomic Signature Identifier based on rank aggregation of informative taxa learned from individual ML models. Meta-Signer uses RF, SVM, LASSO, MLPNN, and PopPhy-CNN models to evaluate importance of each microbial taxa and generates a ranked list of features per model. It aggregates all the ranked lists using a procedure based on the Cross-Entropy method or the Genetic Algorithm [27]. Meta-Signer can extract features from two different ways: the original relative taxa abundance data and the taxa from taxonomic trees constructed from the original taxa. By providing a ranked list of the original taxa and a ranked list of the taxa from the taxonomic trees, the user receives a more robust overview of how the microbial community members are associated with the phenotype. Meta-Signer is user-friendly and easy to run. It provides a readable summary as an HTML output. Meta-Signer is distributed as a Python tool and available at https://github.com/YDaiLab/Meta-Signer.

Implementation
The overview of the workflow is shown in Fig.1 Input Meta-Signer uses relative taxa abundance data and phylogenetic taxonomy information to construct a taxonomic tree. The input to Meta-Signer are (1) a tab separated file of taxa relative abundance Figure 1 Meta-Signer Workflow. Large rounded rectangles represent different modules of the work ow. Red inputs are required and provided by the user. Green inputs are required but included in the package. Yellow inputs are optional and will be generated and saved if not provided values where each row represents an taxon and each column represents a sample, and (2) a line separated list of response values where each row represent the phenotypic response of a sample. The first column in the abundance table should be the taxonomic identification of the taxon. Each level of taxonomy should be represented using the first letter of the taxonomic level (k, p, c, o, f, g, s), followed by two underscore characters, and then the taxonomic identification at that level. Two additional required files are (3) the run configuration file with user specified parameters, and (4) a phylogenetic tree of all microbes in newick format. Templates for both of these files are provided. A final optional file that Meta-Signer will use is (5) model parameters for the neural network architectures in JSON format. If this file is not found, Meta-Signer will tune the parameters and save them for later use.

Tree Construction Module
The taxa relative abundance data are first pre-processed and normalized. To do so, abundance values are converted into relative abundance by dividing each value by the sum of abundance values in its respective sample. Next, low abundant taxa are filtered out based on a proportion threshold specified by the user in the configuration file. A taxon is removed if it is not present at the specified proportion level either overall or within a single class. The remaining abundance values are then log-transformed and normalized based on a user specified method (Z-normalization or min-max normalization).
Next, a taxonomic tree specific to the dataset is generated by pruning a tree template using the remaining taxon across all samples in the dataset. A tree template generated using PhyloT [28] is provided in the package. The nodes of the pruned tree are then populated with abundance values such that an internal node's abundance is the sum of the abundance values of its children. An example of the populated tree is shown in Fig. 2 and the algorithms are shown in Fig. S1 and S2 in Additional File 1.

Figure 2
Generating tree-level taxa. A vector of relative taxa abundance is used to populate a taxonomic tree. This tree can either be represented as a matrix to be used as an input to PopPhy-CNN or be vectorized and used with any other ML method.

Learning Module
Meta-Signer includes three classic ML models (RF, Linear SVM, LASSO), and two DNN models (MLPNN, PopPhy-CNN) in its Learning Module. RF models are decision tree learning models that are trained in an ensemble fashion, taking the average of the ensemble to give a robust decision tree [29].
While growing each tree, a decision is made at each node by selecting the feature from a random subset of features that best splits the data into two subsets based on Gini impurity of each subset. Given a set of items with classes, let be the proportion of samples of class for ∈ {1, … , }. The Gini impurity of the set is calculated as SVMs are supervised machine learning models that try to learn the best hyperplane that separates two classes of data [30]. In case of linear SVMs, from the hyperplane we can obtain a set of weights, , and an intercept, . The class of the sample can then be determined as LASSO regression is a form of least squares regression that uses shrinkage to reduce the total number of model parameters in the final model [31]. This is achieved using L1 regularization in order to penalize the absolute value of the weights, eliminating a portion of the weights to create a sparse model. Given a set of samples = { 1 , … }, where each sample has features and response values = { 1 , … , }, the model minimizes the cost where are the weight parameters which are penalized with the regularization parameter . Neural networks are consisted of multiple layers of nodes that are fully connected with edges constituting weights [32]. The values of a hidden layer are a linear combination of the values from the previous layer which is passed through a non-linear activation function. More explicitly, the values of a hidden layer ℎ is calculated as where ℎ −1 are the values from the previous hidden layer, are the weights connecting ℎ −1 to ℎ , is a bias value, and is a non-linear activation function.
PopPhy-CNN was originally designed by our group [25,26]. PopPhy-CNN explores relationship between taxa by treating a populated taxonomic tree as a type of image. Our model is based on a CNN architecture which utilizes shared kernel weights to learn relationships in the data by constructing multiple feature maps. More explicitly, given an input matrix , a kernel with a set of weights ( ) ∈ , the velocity at position ( , ) of the feature map is defined as: The velocities of a feature map are passed through a non-linear activation and then max pooling is applied to the feature map. More detail about PopPhy-CNN and the other learning models is given in Additional File 1.
Before training, Meta-Signer checks for a file containing network hyper-parameters for MLPNN and CNN models. If it does not find this file, Meta-Signer will use the first partition of the cross-validation to empirically determine the hyper-parameters. This is done using another cross-validation on the training set of the first partition. In addition, using the configuration file, the user can set custom parameters and even disable any of the learning models if they do not wish to incorporate it into their results. In Meta-Signer, RF, SVM, and LASSO are trained using the scikit-learn python package and MLPNN and CNN are trained using Tensorflow.

Feature Ranking Module
For each ML model in each cross-validated partition, Meta-Signer extracts the feature scores and uses the scores to construct a ranked feature list. RF features were scored using a method called mean decrease impurity. For each node, the importance of the feature being split upon is calculated as the decrease in Gini impurity from before and after the split. This value is then weighted by the proportion of total samples that were split upon that node. A feature's importance is then calculated by averaging the weighted importance values of nodes that split using that feature across all trees in the ensemble. Features in LASSO and SVM models were scored based on the magnitude of their weight coefficients in the decision functions.
The extraction of features from DNN models is a challenging task. We use a procedure developed in [33] to evaluate features in MLPNNs. Briefly, the MLPNN features were evaluated by calculating the cumulative weight across all layers by taking the running product of all the weight matrices in the learned networks. The method is quick and the product results in a matrix that has a column for each class and a row for each feature, and the value at a given index is that features cumulative impact for that class. We then consider a feature's importance as the maximum impact across classes to create a single ranked list.
The novel procedure to extract features from CNNs was developed in our previous learning framework PopPhy-CNN [26]. We included the details on the full algorithm of feature importance scoring in Additional File 1 ( Fig. S3 and S4) for convenience. Briefly, the procedure focused on the post analysis of the kernel map activations in the first convolutional layer prior to subsampling. It selects the top 10% of the signals from each kernel map and traces each signal back to the local area from the input from which it was generated. The local area of the input can be interpreted as a group of nodes on the tree. Each input feature's importance is evaluated based on how much it contributed to the signal found in the kernel map.

Feature Aggregation Module
For each partition of the cross-validation, we generate a single ranked list for each of ML model. Once the entirety of the cross-validated training is complete, the entire set of all ranked lists across all models is aggregated into a single top-k ranked list by minimizing the distance between the set of ranked lists and the top-k list, where k is specified by the user in the configuration file. More specifically, given a set of ranked lists { 1 , … , }, the top-k ranked list, ̂, is determined as Here, L is the state space of top-k rankings, is a weight associated with , and ( , ) is the distance between a proposed top-k ranked list, , and . We consider original and tree-level taxa separately, resulting in two separate top-k lists.
The aggregation is performed using the R package RankAggreg [27]. This package uses a cross-entropy based approach with Markov Chain Monte Carlo sampling to find the top-k features that minimize the sum of the distances between each of the input sets and the generated top-k set. The distance used is the Spearman's Correlation. Each input ranked list is weighted in the aggregation by the area under the receiver operating curve (AUC) for binary classification or the generalized Matthews correlation coefficient (MCC) for multiclass classification. When using MCC, negative values are clipped to 0 in order to prevent negative weights.

Output
After the model predictions are evaluated and the features are ranked into a single list, Meta-Signer provides a summary of the results in a portable HTML file. The file contains a description of the run and evaluation metrics for the different models in the form of both a table and boxplots. It also provides the distribution of the feature (taxa) scores for each machine learning model. Lastly, it provides a list of the top-k taxa selected from the original taxa and tree-level taxa, the proportion of individual ranking sets that each taxon was present in the top-k, the rank and p-value under a Wilcoxon Rank-Sum test, and the class in which the taxon was found enriched in. All images are encoded into the file, allowing the HTML file to be moved without considering the location of the images.

Datasets used in evaluation
We used five publicly available datasets to evaluate Meta-Signer: cirrhosis, obesity, type 2 diabetes (T2D), inflammatory bowel disease (IBD), and colorectal cancer (CRC). The datasets were obtained from the MetAML package [17]. The cirrhosis dataset was taken from a study of 114 cirrhosis patients and 118 healthy subjects [34]. The T2D dataset was a combination of two studies [5,35] yielding a total of 223 patients with T2D and 217 healthy subjects. The obesity dataset comes from a study of 292 individuals of which 89 individuals with a BMI lower than 25 kg/m 2 were studied against 164 individuals with a BMI greater than 30 kg/m 2 [36]. The IBD dataset included 110 subjects of which 85 were healthy, 21 had ulcerative colitis, and 4 had Crohn's disease. Since the sample size for sub-jects with Crohn's disease was extremely small, the two disease states were combined into a single class of 25 subjects [37]. The CRC dataset contains 134 subjects where 47 subjects are healthy, 26 subjects had small adenomas, 13 subjects had large adenomas, and 48 subjects had cancer [38]. In previous studies, subjects with small adenomas have been grouped with healthy subjects and subjects with large adenomas have been dropped. However, we wanted to evaluate our method on a multiclass dataset, so we chose to instead combine subjects with small adenomas and large adenomas into a single class of 39 subjects. The first four datasets involve binary classes and the last dataset involves three classes. A summary of the datasets used in our evaluation is shown in Table 1. Taxa  Cirrhosis  114  118  542  T2D  223  217  606  Obesity  164  89  465  IBD  25  85  443  CRC 48, 39 47 507 Table 1 Datasets. Sample sizes and numbers of taxa for datasets used for evaluation. In the case of CRC, the number of samples for each case is separated by a comma.
Each of these datasets was generated using Metagenomic Shotgun (MGS) sequencing. The taxa for each dataset were assigned by MetaPhlAn2, which selects taxa based on the read coverage of cladespecific markers and then estimates their relative abundance [39]. The taxa in each dataset were aggregated at the species level. For any taxon which could not be classified at the species level, we added its abundance to its genus level internal node of the tree. In our experiments, we filtered out any taxon that was not present in at least 10% of a single response class. All abundance values were then log-transformed and normalized to have a mean of 0 and unit variance.

Evaluation of ML models
We first provide an overall evaluation of individual ML models under the same experimental protocol. This can also identify any model that has exceptionally inferior performance. If it indeed occurs then the model can be excluded from the Aggregation Module. For the multiclass CRC dataset, SVM and LASSO are evaluated using a one-verse-all approach. We use the following criteria for our model evaluation: AUC for binary classification (MCC for multiclass classification), precision, recall and F1 score. Here, where TF, FP, TN, and FN are true positives, false positives, true negatives and false negatives, respectively. Models were trained using 10-fold cross-validation. The average and standard deviation of each evaluating criterion from 10 times cross-validations are reported. In each 10-fold cross-validation, the dataset was partitioned into 10 sets stratified by class proportion. We then constructed 10 datasets where a single partition was left out as the test set and the remaining were used for the training set.
For RF models, a maximum of 500 trees was set and all other parameters were left as the default. Feature importance values were first obtained by training a RF using the training set and extracting the feature importance scores from the trained model. We performed 5-fold cross-validation on each training set to determine the optimal number of features to use. We evaluated models using the top 25%, 50%, 75%, and 100% of features and selected the best one. A final model was trained using the determined optimal number of features on the entire training set and evaluated on the test set.
The parameters in Linear SVM and LASSO models were determined using 5-fold cross-validation on each training set. The parameter associated with the error term in SVM was chosen from {1, 10, 100, 1000}, and the regularization parameter in LASSO model ranges from 10 −4 to 10 −0.5 that were space evenly on a log-scale. The best parameter value was determined using 5-fold cross-validation on the training set, and the nal model was trained on the training set and evaluated on the test set.
MLPNN and CNN models were trained for 200 epochs with a dropout rate of 0.5, a learning rate of 0.001, and an L2-regularization penalty value of 0.01. MLPNN models contained 2 layers of 32 nodes. CNN models contained a single convolutional layer with 32 square kernels of size 5 and a single fully connected layer with 32 nodes. Network hyper-parameters were found using 10-fold cross validation on the training set where 70% of the data was used to train the network, 20% of the data was used to determine when to stop early, and the last 10% was used to evaluate the model. A set of hyper-parameters was selected based on the mean performance. We only tune the network using the first cross-validation partition because we assume that the learning complexity of the cross-validation partitions is similar. We used the rectified linear unit (ReLU) activation function for hidden layers and the softmax activation function for the output layer. Each network is trained using a weighted cross-entropy loss function. For consistency in our experiments, we shared the tuned parameters of one dataset, in our case the cirrhosis dataset, between datasets, believing that the complexity of the classification tasks was comparable between datasets.
In our experiments, we observed that RF models performed the best overall compared to the other ML models (Table 2) using the tree taxa as features. The performance using the original taxa had a similar pattern (Table S1 in Additional File 1). The MLPNN and PopPhy-CNN models showed improved performance over SVM and LASSO, but overall they were not superior to RF (Table 2). However, we observed that within a cross-validated evaluation there were partitions in which RF models were outperformed by other models (Fig. 3). Similar comparisons between all models are shown in Fig. S5 in Additional File 1. Therefore, we decided to use features from multiple learning models across the multiple partitions. In addition, we noticed a slight decrease in performance when using the entire set of tree-level features. Despite the reduction in performance, we believe that including the tree taxa can help provide insight to features associated with dysbiosis at multiple levels of the taxonomic tree Taxa extracted from Meta-Signer are more informative than other methods To evaluate Meta-Signer, we benchmark against other three methods: two previously published methods: Biosigner [40], a hierarchical feature engineering (HFE) method [41], and a non-parametric Wilcoxon rank-sum test. Biosigner is a generic ML driven feature selection method for omics data and available in R. It uses trained RF, SVM and Partial Least Squared Discriminant Analysis models to selectively eliminate features, resulting in a single set of remaining features. HFE is implemented in Python specifically designed for metagenomic data and uses a taxonomic tree to construct hierarchical features and extracts nodes based on information gain. HFE was shown to outperform other state-of-the-art methods such as Fizzy [42] and MetaML [17]. We were not able to bench-mark against OMiAT [11], since the method requires the tree branches to contain distance values, which are not available to us.
The non-parametric Wilcoxon rank-sum test was included as a baseline method of feature ranking for comparison. We used the top 20 taxa from each method, except for Biosigner, which only identified less than 20 taxa in every dataset evaluated.    We evaluated each method using 10-fold cross-validation. Each dataset was randomly partitioned into 10 sets, stratified to balance the class proportion. Then each method was applied to 9 of the 10 sets, the training set, to extract features. Then the taxa selected from the ML models are aggregated. The datasets were then filtered to only have the chosen taxa and models were trained on the training set and evaluated on the test set. We used SVMs with Gaussian kernels to evaluate the performance for the selected taxa from all the four methods. In order to perform fair comparison, we report the evaluation without including linear SVM models in taxa aggregation.
We observed that Meta-Signer was robust across all datasets (Table 3). In cirrhosis, T2D, and obesity, it outperformed all other models. Biosigner and HFE performed very poorly for these datasets. Biosigner often returned no taxa (in which case an AUC of 0.5 was assigned to the testing). In the cirrhosis dataset, we observed that using tree-taxa (Meta-Signer (Tree) and HFE) caused the SVM models to become less stable and predict more poorly. This could be in part that many of the tree taxa had parent-child relationships and carried redundant information. These similar taxa may have saturated the top 20 ranked taxa, leaving less room for other informative taxa. In the obesity dataset, both Biosigner and HFE extract taxa that lead to a testing AUC less than 0.5. Surprisingly, the taxa ranked from the Wilcoxon test generated the best performance for IBD. Both Wilcoxon and HFE performed well in CRC. However, although not the best in IBD and CRC, Meta-Signer was still comparable. We observe a similar pattern for performance of Meta-Signer when results of linear SVMs are included in taxa aggregation (Table S7 in Additional File 1).  We then looked at the overlap of original taxa extracted from Meta-Signer to those extracted using a Wilcoxon test and Biosigner. HFE and Meta-Signer using the tree taxa were excluded since they both created their own unique feature space and were not comparable. To compare methods, each method was applied to the entire dataset, generating a single ranked list for each method. We then computed the overlap for the top 20 taxa from Meta-Signer, the Wilcoxon test, and Biosigner. Since Biosigner does not run on multi-class datasets, is not considered in the overlap of the CRC dataset. The results are shown  Finally, we examined at the proportion of how often the extracted taxa appeared in the top 20 taxa in each model independently in Meta-Signer. We observed a large consensus in many taxa, however, a few taxa were missed by some models completely and picked up by the others (Tables S8 and S9 in Additional  File 1). In addition, we observed that PopPhy-CNN was not contributing as much as the other methods to the top 20 selected taxa in each dataset. Upon visual inspection, we observed that the taxa from the CNN models were more often at higher taxonomic levels (class and family levels and super levels). Therefore, we suspected that it captured information at higher aggregated taxa, and as such, less number of taxa are required. To investigate this, we trained SVM models using just the top 10 taxa after aggregating the ranked lists from the CNN models only and compared the performance to Meta-Signer on the tree taxa. Except for the IBD dataset, the taxa extracted just from the PopPhy-CNN models are comparable or even superior to the ensemble aggregated taxa. This makes us believe that PopPhy-CNN is finding useful taxa, however, they are being dampened by the other models in the ensemble aggregation. A table comparing the top 10 taxa of the PopPhy-CNN model and Meta-Signer is shown in Table S10 in Additional File 1.

Taxa extracted from Meta-Signer are biologically relevant -A case study
We focus on the obesity data to examine the relevance of the extracted taxa from Meta-Signer (Fig.5). A number of them have been identified as relevant to obesity from previous studies. We observed an enrichment of Ruminococcus torques and Bacteroides ovatus in obese subjects, both of which have been shown to be positively correlated with metabolic syndrome traits [43,44]. Clostridium leptum was also found to be enriched in obese subjects, which agrees with the findings of a previous study in Danish infants [45], however there have been conflicting observations on the correlation of this species with regards to obesity [46]. Meta-Signer was also able to identify Dorea formicigenerans and Mitsuokella as enriched in obese patients which has been reported in previous studies [44,47] as well as Megasphaera elsdenii which has been identified at a genus level to be associated with obesity [48]. Additionally, we observed that Oscillibacter was enriched in obese subjects, which has been associated to the consumption of high-fat diets [49]. Interestingly, Ruminococcus torques, Bacteroides ovatus, Mitsuokella, and Megasphaera elsdenii are insignificant based on the Wilcoxon rank test. Two species enriched in lean patients were Ruminococcus lactaris, a species found enriched in human patients after being treated for metabolic syndrome using resistant starch type 4 [50], and Ruminococcus avefaciens, a species found to be enriched in non-obese mice [51]. Ruminococcus lactaris is also insignificant from the Wilcoxon rank test.
In addition, we found that microbes from the genera Alistipes as well as Akkermansia muciniphila were enriched in lean patients. Microbes from these genera have been associated to successful weight loss in obese patients after a regiment of a low-calorie diet, exercise, and behavioral therapy [52, 53]. We also identified Ruminococcus bromii, Eubacterium siraeum, and Butyrivibrio crossotus to be enriched in lean subjects, which have all been observed in previous studies [54, 55, 56]. Another taxon that was captured strongly and at multiple levels in the tree was Deinoccocus. This genus has not been strongly implicated with obesity, however, a recent study has shown that some species of this genera create amylosucrase. The study showed that modifying consumed chestnut starch with this enzyme lead to a suppression of insulin signaling and a reduction of fat accumulation [57].
Lastly, we wanted to visualize the landscape identified by Meta-Signer. To do this, we combined the ranked original taxa and ranked tree taxa into a single set and isolated the subtree containing this set of taxa. We annotated the subtree to visualize which branches were enriched in both lean and obese individuals. We connected nodes using a solid line of the same color if Meta-Signer identified nodes at higher taxonomic levels along the branch. Otherwise, we connected nodes using a dashed line of the same color until a branch occurred. The visualized subtree is shown in Fig.6.  A taxonomic tree of the combined ranked tree and original taxa for lean (green) and obese (red) subjects. Solid lines indicate taxonomic levels identified using the tree taxa. Dashed lines are used to extend phenotypic associations through the tree at nodes that were not identified until a branch occurs.

Discussion
We have developed Meta-Signer, a user-friendly tool for the extraction of robust microbial taxa that are predictive to host phenotype at multiple taxonomic levels from multiple machine learning models. Meta-Signer is able to leverage biological knowledge in microbial taxa relative abundance profiles through a taxonomic tree by our novel propagation and matrix-representation procedure. By training different types of machine learning models, Meta-Signer exploits the similarities in the ranked lists of taxa learned by individual machines learning models to create a single aggregated set of informative microbial taxa for host phenotype prediction.
Using five metagenomic datasets, we demonstrated that Meta-Signer not only outperforms state-ofthe-art methods of feature ranking but also is robust. In addition, in our case study on the obesity dataset, we have shown that the many of the taxa extracted from Meta-Signer have been found biologically relevant in previous studies. It should be emphasized that the objective of the current work is not aiming at the establishment of the highest predictive model, but at the identification of the most robust set of taxa that are predictive to the host phenotype. The identified taxa could be used for follow-up validation [17].
There are several limitations to our method. The first is that there is no predetermined way to know how many taxa to extract. We set a default of 20 taxa, but in some cases this may be too many and in others it may not be enough. This is compounded with the fact that the tree-level taxa often contain redundant information since both child and parent nodes can be very similar or even equal in abundance. This may cause both features to be selected, leading to the feature list to become saturated with that branch of the tree, reducing the amount of space for features which can bring in new information. Lastly, deep learning models require a GPU to be time efficient. However, we do allow users to turn which models they want to train o and on in case they do not want to run them.
There are a few directions for further improvements. First, we plan to incorporate ways to reduce the saturation caused when using tree-level taxa. This would allow a more diverse set of features when using the taxonomic tree. Second, we currently drop any features not found when constructing and pruning the tree. This could lead to a loss of information, and these features should still be integrated into the tree in a meaningful way

Conclusion
In conclusion, Meta-Signer is a user-friendly tool to identify a robust set of highly informative microbial taxa that are predictive of human disease status from a metagenomic dataset.

Additional Files
Additional File 1 -Supporting document containing additional information on methodology and results.

Additional file Meta-Signer: Metagenomic Signature Identifier based on Rank Aggregation of Features
Derek Reiman, Ahmed A. Metwally, Jun Sun, and Yang Dai

Tree Construction
In our work, the online tool PhyloT was used to create the phylogenetic tree, and a constant distance of one between nodes in the tree is assumed. This procedure was developed in our previous work PopPhy-CNN. For the sake of completeness, we included here.
This tree is essentially a taxonomic tree. The phylogenetic tree is structured using ancestral nodes from both taxonomic groups and subgroups with no defined distances between nodes. Therefore, we define the distance between any two nodes by the number of nodes between them. The taxonomic tree is used as a template to construct a populated tree for each sample in the dataset. In order to populate a tree, the abundance value of each taxon from a sample is assigned to its respective node in the tree. The tree is then populated such that an abundance value for each internal node is equal to the sum of its children's abundance values. The algorithm is shown in Fig. S1.

Figure S1. Algorithm for tree population
Since Convolutional neural networks (CNNs) are very successful in image processing where inputs are a multi-dimensional matrix, we transform the populated tree into a matrix format that contains meaningful similarity information both vertically through the rows and horizontally through the columns. We began by combining the taxa abundances and the taxonomy tree by assigning the nodes corresponding to the taxa their respective abundances. This is followed by populating the rest of the tree where a parent node's abundance is the sum of its children's abundances. This is preformed from the bottom upwards to the root node, which is populated with the sum of the abundance from all organisms found in the community. The algorithm for tree-matrix representation is outlined in Fig. S2.

Figure S2. Algorithm for tree-matrix representation
These matrices are then used to train CNN models. In order to use these hierarchical features to train other machine learning models, we vectorized the tree nodes. To do so, we build up a vector, beginning with the root of the tree. Then, for each hierarchical layer of the tree, moving from the top to the bottom, the nodes from that layer are sorted by name and then appended to the vector, resulting in a of length |V|, where V is the set of nodes in the tree.

Random Forest
Random forest (RF) models are decision tree learning models that are trained in a supervised manner. The model trains an ensemble of trees and takes the average of the ensemble to give a robust decision tree. While growing each tree, a decision is made at each node by selecting the best feature from a random subset of features that best splits the data into two subsets. In our method, we generate decision rules using the Gini impurity metric. Given a set of items with classes, let be the proportion of samples of class for ∈ {1 … }. The Gini impurity of the set is calculated as Our method implements the RF model using the scikit-learn python library. Once trained, features are then extracted by evaluating the mean decrease impurity. For each node, the importance of the feature being split upon is calculated as the decrease in Gini impurity from before and after the split. This value is then weighted by the proportion of total samples that were split upon that node. A feature's importance is then calculated by averaging the weighted importance values of nodes that split using that feature across all trees in the ensemble.

Support Vector Machines
Support Vector Machines (SVMs) are supervised machine-learning models that try to learn the best hyperplane that separates two classes of data. SVMs, in particularly, linear SVMs, try to find the hyperplane that maximizes the margin, or separation distance, between the two classes. The orientation and position of the hyperplane is driven by a subset of data points called support vectors that lie close to the hyperplane. In case of linear SVM, from the hyperplane we can obtain a set of weights, , and an intercept, . The class of the sample can then be determined as ̂= ( + ) Since the above function can be used to rank the importance of features, we used the linear SVMs in Meta-Signer for feature extraction. To evaluate Meta-Signer and other exist methods, we used the nonlinear SVMs with the radial basis function kernel, ( , ) = −ɣ‖ − ‖ 2 in order to achieve good performance. In our method, we allow a grid search over two kernels using the scikit-learn python library.

LASSO Regression
Least Absolute Shrinkage and Selection Operator (LASSO) regression is a form of least squares regression that uses shrinkage to reduce the total number of model parameters in the final model. This is achieved using L1 regularization in order to penalize the absolute value of the weights, eliminating a portion of the weights to create a sparse model. Given a set of samples = { 1 , 2 , … , } where each sample has features and classes = { 1 , 2 , … , }, the model minimizes the cost where are the weight parameters which are penalized with the regularization parameter . As increases, more of the model parameters are shrunk to zero and the model becomes sparser. In order to extract the features from LASSO models, we consider positive weights to be applied to features significant to the positive class and negative features to be significant to the negative class. We rank features from a LASSO model based on the magnitude of their corresponding value.

Multi-Layer Perceptron Neural Network
Neural networks are consisted of multiple layers of nodes that are fully connected with edges constituting weights. The values of a hidden layer are a linear combination of the values from the previous layer which is passed through a non-linear activation function. More explicitly, the values of a hidden layer ℎ is calculated as where ℎ −1 are the values from the previous hidden layer, are the weights connecting ℎ −1 to ℎ , is a bias value, and Ψ is a non-linear activation function. This non-linear transformation is applied over multiple layers. In our method, we use the Rectified Linear Unit (ReLU) activation function. The ReLU activation function sets all negative values equal to 0 and all positive values are unchanged. The output layer uses the softmax activation function to predict class probabilities.
Once a model is trained, we evaluate the importance of each feature by calculating it's cumulative weight contribution for each class. The larger the value of the cumulative contribution, the larger the influence a change in a feature will have on the prediction. More explicitly, the importance matrix can be calculated as = ∏ =1 which will result in matrix ∈ ℝ x where p is the number of features and c is the number of classes. Features were then ranked by the maximum value across all the classes to give a single ranked list.

Convolutional Neural Network (PopPhy-CNN)
Convolutional neural networks (CNNs) are a deep learning model that consider groups of local features in machinelearning tasks. They have become popular in image processing as well as natural language processing. We have developed a novel CNN framework based on the input to exploit taxonomic information of the taxa by treating a populated taxonomic tree as a type of image. Our framework (PopPhy-CNN) contains one or more convolutional layers, in which a set of kernels are used to capture various signals and generate a set of feature maps. This is usually followed with at least one fully connected hidden layer, followed lastly by the output layer. A generic structure of a CNN is shown in Fig. S3. Figure S3. A generic CNN structure.
Given an input matrix , a kernel with a set of weights ( ) ∈ ℝ , the velocity at position ( , ) is defined as: This velocity value is then passed to a non-linear activation function and pooling is performed over the entire set.
In PopPhy-CNN, we use max pooling and the ReLU activation function for convolutional and fully connected layers. In the output layer, we use a softmax activation in order to predict the class probabilities, assigning the predicted class to the highest probability. We train the entire model using stochastic gradient descent (SGD) with the cost of a sample with class defined as: Where is the total number of samples, is the number of samples with class , is the predicted probability using the softmax activation of class , and is a regularization penalty coefficient for an L2normalization of the weights.

Feature scoring in PopPhy-CNN
To evaluate features from a trained CNN model, we propose here a scoring schedule to extract learned features. We focus on the post analysis of the map activations in the first convolutional layer prior to subsampling. This allows to evaluate which positions in the input contribute most to the highest activations in the learned CNNs. A visualization of the generation for a feature map for the first layer is shown in Fig. S4. Figure S4. Determining the reference window for feature evaluation using a CNN kernel and kernel map.
We first calculated all the kernel maps using the weights from the first convolutional layer for each sample in the test set which was classified correctly. Next, we looked at the feature maps generated by a single kernel, k, across all the samples for a specific class, c. For each of these feature maps, we took the top 10% of the maximum values. For each velocity selected, we traced its location in the feature map ( , ) back to the submatrix of the input M from which it was calculated. We call this matrix R our reference window. More specifically, given a kernel k with weights ( ) with dimensions r × s, Within the reference window, every position ( , ) is equivalent to some node v from the phylogenetic tree with an taxon label, f. We calculate the importance of each feature f given the reference window R for sample s as its proportion of the velocity, Here k is our current kernel with weights ( ) which have been flipped to account for the convolution function; the summation is over all positions in Rs. The absolute value of the weights in the denominator was used in order to handle any case where the contribution of one large positive component and one large negative component can give rise to a velocity that is much smaller than its components. This could lead to sporadic scaling of importance values which would be hard to interpret. By using the absolute value of the weights, we created an upper boundary of 1 and a lower boundary of -1. Within a single reference window, some taxa may have been found important in a small subset of the samples but may not be important considering all of the samples. In order to capture only the taxa which were consistently found important, we calculated the mean importance value of a feature f across all samples in class c given a single reference window R and kernel k.
Since the reference windows of different velocities may overlap, it is possible for a single feature to have multiple importance values using the same kernel. A feature may also be found to be important by multiple kernels. This leads to multiple importance values for a single feature. To handle this problem, we selected the importance of f to be the maximum over all reference windows containing f and over all kernels, k.
Lastly, we assign a score for a feature from the perspective of class c as the difference of the feature importance using all the samples within the class and the feature importance using all the samples not in the class. Given only two classes, the scores will be the same values with opposite sign. Despite that, we designed our method to be able to handle scenarios where there are more than two classes.
From these scores we created a list of feature scores for each class, allowing the analysis of feature importance from the perspective of different classes that can then be ranked. If a taxon was not found in any of the kernels, it is ranked at the bottom the feature list. To create a single ranked list, we then ranked features by the maximum score across classes.

Model Prediction and Feature Aggregation
Models were trained using 10-fold cross-validation and report the average and standard deviation of each evaluating criterion from 10 times cross-validations. In each 10-fold cross-validation, the dataset was partitioned into 10 sets stratified by class proportion. We then constructed 10 datasets where a single partition was left out as the test set and the remaining were used for the training set.
The prediction evaluation for each dataset using the original taxa is shown in Table S1.  Table S1. Evaluation of each model on the five datasets using 10 iterations of 10-fold cross-validation.

RF
In our experiments, we observed that RF models performed the best overall compared to the other models. However, within individual partitions of a 10 times 10-fold cross validation evaluation, we observed that there were partitions in which RF models were outperformed by other models. An example of this for the obesity dataset is shown in Fig. S5. Because of this observation, we decide to use features from multiple learning models across multiple learning models. In this way, we believe that the true signal will be consistent between models and will be ranked higher than any signal from over-fitting when all the rankings are aggregated. Figure S5. AUC scatterplots for pair-wise combinations of learning models on the obesity dataset.

The original and tree features extracted from Meta-Signer
We report the aggregated top 20 original and tree-level taxa for each of the five datasets as well as the proportion of individual ranked lists in which a taxon was found in the top 20 features in Tables S2-S6.

Original Features
Prop In the cirrhosis patients, four species of Veillonella, three species of Streptococcus, and Haemophilus parainfluenzae were found to be discriminative. In healthy patients, Meta-Signer identified Eubacterium hallii, two species of Bacteroides to be enriched in healthy patients. These findings agree with the original study. The tree taxa captured the Vellionellales order at the order level, family level, genus level, and species level. Meta-Signer also identified the tree branch from the class Bacilli to the genus Streptococcus as well as the species underneath. These two genera were identified as being strongly indicative of the disease state for cirrhosis in the original study.

Original Features
Prop. in top-20 In patients with T2D, Meta-Signer identified Roseburia intestinalis and Faecalibacterium prausnitzii enriched in healthy patients while Clostridium bolteae, Clostridium hathewayi, and Eggerthella lenta were enriched in subjects with T2D. Similar observations were reported in previous studies. In addition, we observed an overall increase of the families Clostridaceae and Peptostretpcoccaceae in the tree. A previous study has reported that these two families of microbes are known to increase the blood levels of trimethylamine N-oxide (TMAO) in humans and that higher levels of TMAO are associated with diabetes.

Original Features
Prop Meta-Signer finds microbes from the Lachnospiraceae family as well as Bifidobacterium bifidum and Faecalibacterium prausnitzii to be associated with patients with IBD. These findings agree with a separate analysis of this IBD dataset, however many studies have shown the same identified microbes as protective against IBD using different datasets.

Original Features
Prop In patients with CRC, Meta-Signer found Peptostreptococcus stomatis, Parvimonas micra, Fusobacterium nucleatum, and Streptococcus anginosus to be enriched. All of these microbes have been associated with CRC in previous studies.

Evaluation of the Meta-Signer features against other methods
We evaluated the extracted features from each method using 10-fold cross validation. Each dataset was randomly partitioned into 10 sets, stratified to balance the class proportion. Then each method was applied to 9 of the 10 sets, the training set, to extract features. The datasets were then filtered to only have the chosen features and models were trained on the training set and evaluated on the test. Each training set was used to train an SVM model over both the linear and Gaussian kernel. We observed that Meta-Signer was robust across all datasets. In cirrhosis, T2D, and obesity, it outperformed all other models. Biosigner and HFE were observed to perform very poorly for these datasets. Biosigner often returned no features (in which case an AUC of 0.5 was assigned to the testing). In the obesity dataset, both Biosigner and HFE extract features that lead to a testing AUC less than 0.5. We observed that a Wilcoxon test was the best for IBD and both Wilcoxon and HFE performed will in CRC. However, although not the best in IBD and CRC, Meta-Signer was still comparable. The table of cross-validated results across all learning methods and datasets is shown in Table S7. We then observed the extracted taxa from Meta-Signer and looked at the proportion of how often they appeared in the top 20 features in each model independently. We observed a large consensus in many taxa, however, a few taxa were missed by some models completely and picked up by the others. The top 20 taxa found in obesity using both raw and tree level features are shown in Tables S8 and S9.  17  Table S9. The top 20 tree-level taxa found in the obesity dataset and the proportion of models that contained that feature in the top 20 of their own ranked list.

RF
We observed that the CNN model was not contributing as much to the top 20 selected features in each dataset. Upon visual inspection, we observed that the features from the CNN models were more often higher taxonomic levels (class and family levels and super levels). Therefore, we suspected that it captured information at higher aggregated points, and as such, less features are required. Therefore, we trained SVM models using just the top 10 features after aggregating the ranked lists from the CNN models only and compared the performance to Meta-Signer on the tree features. With the exception of the IBD dataset, the features extracted just from the CNN are comparable or even superior to the ensemble aggregated features. This makes us believe that the CNN is finding useful features, however the features it finds are being dampened by the other models in the ensemble aggregation. A table comparing the top 10 features of the CNN model and Meta-Signer is shown in Table S10.