ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Meta-Signer: Metagenomic Signature Identifier based onrank aggregation of features

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 09 Mar 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Bioinformatics gateway.

Abstract

The advance of metagenomic studies provides the opportunity to identify microbial taxa that are associated with human diseases. Multiple methods exist for the association analysis. However, the results could be inconsistent, presenting challenges in interpreting the host-microbiome interactions. To address this issue, we develop Meta-Signer, a novel Metagenomic Signature Identifier tool based on rank aggregation of features identified from multiple machine learning models including Random Forest, Support Vector Machines, Logistic Regression, and Multi-Layer Perceptron Neural Networks. Meta-Signer generates ranked taxa lists by training individual machine learning models over multiple training partitions and aggregates the ranked lists into a single list by an optimization procedure to represent the most informative and robust microbial features. A User will receive speedy assessment on the predictive performance of each ma-chine learning model using different numbers of the ranked features and determine the final models to be used for evaluation on external datasets. Meta-Signer is user-friendly and customizable, allowing users to explore their datasets quickly and efficiently.

Keywords

Metagenome-wide Association Study, Feature Extraction, Machine Learning, Rank Aggregation

Introduction

Recent metagenomic studies of the gut microbiome have linked dysbiosis to many human diseases13. The identification of microbial taxa associated with human disease has been one of the important efforts in metagenomics data analysis4. Various metagenomic studies use parametric or non-parametric statistical tests to detect differentially abundant individual taxa between disease and control groups59. These types of methods can potentially miss taxa with weak associations but together can present strong statistical association.

In order to capture group association, several methods are proposed by exploring related taxa on a phylogenetic taxonomic tree1014.

These elaborated statistical methods enhanced the detection of the microbial group association. However, they may still fail to detect complex multivariate nonlinear associations.

Alternative approaches of using machine learning (ML) models have been advocated for the prediction of the host phenotype15. This is motivated by the findings that a microbial signature for the host phenotype may be complex, involving simultaneous over- and under-representations of multiple microbial taxa potentially interacting with each other. Classical ML models, such as Random Forest (RF), Logistic Regression and Support Vector Machines (SVMs), and deep neural networks (DNNs) have been applied to host phenotype prediction using microbial abundance features1622.

While the ML approaches demonstrated promising results on host phenotype prediction23,24, it is a challenging task for users to determine what is the best ML model and how many features are needed in order to achieve robust prediction, especially on external validation datasets. In addition, each ML algorithm may generate different feature importance rankings12,22,25, complicating the decision on a consistent and informative signature for the host phenotype of interest.

In this work, we introduce a novel tool, Meta-Signer, a Metagenomic Signature Identifier based on rank aggregation of informative taxa learned from individual ML models. Meta-Signer uses RF, SVM, penalized Logistic Regression, and multiple-layer perceptron neural network (MLPNN) models to evaluate importance of each microbial taxon and generates a ranked list of features (i.e., taxa) per model. It aggregates all the ranked lists using a procedure "RankAggreg" based on the cross-entropy method or the genetic algorithm26. Finally, Meta-Signer reports the top ranking features specified by the user and generates the ML models using these features. Meta-Signer’s workflow is shown in Figure 1.

dcf9aae5-d73e-41b4-942d-73b23a70bfe1_figure1.gif

Figure 1. The Meta-Signer workflow.

Large rounded rectangles represent different modules of the workflow. Microbial abundance is preprocessed and filtered, and then used to train ML models. Features are ranked for each model and an overall aggregated feature ranking is constructed. Meta-Signer generates portable, user-friendly HTML files for visualization as well as ML models trained on a subset of high ranking features. SVM, support Vector Machine; MLPNN, multiple-layer perceptron neural network; ML, machine learning.

Meta-Signer is user-friendly and easy to run. It provides a readable summaries as HTML outputs as well as final trained ML models. Meta-Signer is distributed as Python tool and available at https://github.com/YDaiLab/Meta-Signer.

Methods

Implementation

The inputs to Meta-Signer are (1) a tab separated file of taxa abundance values where each row represents a taxon and each column represents a sample, and (2) a line separated list of response values where each row represents the phenotypic response of a sample. The first column in the abundance table should be the taxonomic identification of the taxon. An additional required file is (3) the run configuration file with user specified parameters. A final optional file is (4) the model parameters for the neural network architectures in JSON format. If this file is not found, Meta-Signer will tune the parameters and save them for later use.

Meta-Signer includes three classic ML models (RF, Linear SVM, Logistic Regression), as well as an MLPNN model. RFs are decision tree learning models trained in an ensemble fashion, taking the average of the ensemble to give a robust decision tree27. While growing each tree, a decision is made at each node by selecting the feature from a random subset of features that best splits the data into two subsets based on the Gini impurity of each subset. Given a set of data points with k classes, let pi be the proportion of samples of class i for i ∈ {1...k}. The Gini impurity of the set is calculated as

IG(p)=1i=1kpi2(1)

Our method implements the RF model using the scikit-learn Python library. Once trained, features are then extracted by evaluating the mean decrease impurity. For each node, the importance of the feature node being split on the decision tree is calculated as the decrease in Gini impurity before and after the split. This value is then weighted by the proportion of total samples that were split upon that node. A feature’s importance is then calculated by averaging the weighted importance values of nodes that split using that feature across all trees in the ensemble.

SVMs are supervised ML models that learn the best hyperplane to separate two classes of data28. In case of linear SVMs, a set of weights (w), and an intercept (b) will be learned. The class of the sample xi can then be determined as

y^=sign(wTxi+b)(2)

Since the weights can be used to rank the importance of features, we used the linear SVMs in Meta-Signer for feature extraction. Once trained, features can be ranked using the absolute value of the learned weight parameters. SVM models are tuned with a grid search using the scikit-learn Python library.

Logistic Regression fits a logistic function to estimate the probability of binary classification; however, it can be extended to multi-class scenarios29. The model predicts the probability of a sample xi being class 1 as

y^=11+eβx+b(3)

where β are the weight parameters which are learned. We apply shrinkage to reduce the total number of model parameters in the final model using L1 regularization in order to penalize the absolute value of the weights, eliminating a portion of the features to create a sparse model. Given a set of samples xi (i = 1,...,n) where each sample has m features and a binary class label yi, the model minimizes the cost

C=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]+λj|βj|(4)

where the weight parameters are penalized with the regularization parameter λ. Logistic regression models are trained using the scikit-learn Python library. Once trained, the β values are used to rank features based on their absolute value. Neural networks are consisted of multiple layers of nodes that are fully connected with edges constituting weights30. The values of a hidden layer are a linear combination of the values from the previous layer which is passed through a non-linear activation function. More explicitly, the values of a hidden layer hl is calculated as

hl=ψ(WlThl1+bl)(5)

where hl−1 are the values from the previous hidden layer, Wl are the weights connecting hl−1 to hl, bl is a bias value, and 𝜓 is a non-linear activation function. Meta-Signer uses the Rectified Linear Unit activation function for hidden layers and the softmax activation function on the output layer. The entire network is trained using a single loss function

C=ln(ac)+λlLWl2(6)

Here ac is the predicted softmax probability of a sample’s true class c. The second term performs L2 regularization on the network weights and is penalized by λ. Networks are trained using early stopping with a validation subset of 10% of the training data.

Before training, Meta-Signer checks for a file containing network hyper-parameters for MLPNN models. If it does not find this file, Meta-Signer will use the first partition of the cross-validation to empirically determine the hyper-parameters. This is done using another cross-validation on the training set of the first partition. In addition, using the configuration file, the user can set custom parameters and even disable any of the learning models if they do not wish to incorporate it into their results. In Meta-Signer, RF, SVM, and Logistic Regression are trained using the scikit-learn python package and MLPNN models are trained using Tensorflow.

For each ML model in each cross-validated partition, Meta-Signer extracts the feature scores and uses the scores to construct a ranked feature list. RF features were scored using a method called mean decrease impurity. For each node, the importance of the feature being split upon is calculated as the decrease in Gini impurity from before and after the split. This value is then weighted by the proportion of total samples that were split upon that node. A feature’s importance is then calculated by averaging the weighted importance values of nodes that split using that feature across all trees in the ensemble. Features in Logistic Regression and SVM models were scored based on the magnitude of their coefficients in the decision functions.

The extraction of features from DNN models is a challenging task in general. We use a procedure developed in 31 to evaluate features in MLPNNs. Briefly, the MLPNN features were evaluated by calculating the cumulative weight across all layers by taking the running product of all the weight matrices in the learned networks. The product results in a matrix that has a column for each class and a row for each feature, and the value at a given index is that feature’s cumulative impact for that class. We then consider a feature’s importance as the maximum impact across classes to create a single ranked list.

For each partition of the cross-validation, we generate a single ranked list for each of ML models. Once the entirety of the cross-validated training is complete, the entire set of all ranked lists across all models is aggregated into a single top-k ranked list by minimizing the distance between the set of ranked lists and the top-k list, where k is specified by the user in the configuration file. More specifically, given a set of ranked lists {1,...,m}, the top-k ranked list, θ^, is determined as,

θ^=argminθLi=1mwid(θ,i)(7)

Here, L is the state space of top-k rankings, wi is a weight associated with i, and d(θ,i) is the distance between a proposed top-k ranked list, θ, and i.

The aggregation is performed using the R package RankAggreg26. This package uses either a genetic algorithm or cross-entropy based approach with Markov Chain Monte Carlo sampling to find the top-k features that minimize the sum of the distances between each of the input sets and the generated top-k set. The distance used is the Spearman’s Correlation. Each input ranked list is weighted in the aggregation by the area under the receiver operating curve (AUC). After the model predictions are evaluated and the features are ranked into a single list, Meta-Signer provides a summary of the results in a portable HTML file. The file contains a description of the run and evaluation metrics for the different models in the form of boxplots. It also provides the distribution of the feature importance scores for each ML model. Lastly, it provides a list of the top-k taxa selected from the original taxa, the proportion of individual ranking sets that each taxon was present in the top-k, the rank and p-value under a PERMANOVA test32, and the class in which the taxon was found to be predictive for. All images are encoded into the file, allowing the HTML file to be moved without considering the location of the images.

Operation

Meta-Signer’s general workflow proceeds as described above and as shown in Figure 1. Meta-Signer was designed for Python 3.7 and requires the following Python packages: numpy, pandas, scipy, scikit-learn, scikit-bio, Tensorflow, matplotlib, seaborn. In addition, a working version of R version 3.0 or higher is required. Code and instructions for configuration and running can be found at https://github.com/YDaiLab/Meta-Signer.

Use cases

In this section we will demonstrate how to run Meta-Signer on a dataset of patients with inflammatory bowel disease (IBD) which is provided in the GitHub repository. This dataset came from the Prospective Registry in IBD Study at MGH (PRISM)33, which enrolled patients with a diagnosis of either Crohn’s disease (CD) or ulcerative colitis (UC). The dataset includes 68 samples with CD, 53 samples with UC, and 34 healthy samples.

In addition, Meta-Signer includes another IBD dataset for external testing. This dataset consists of two independent cohorts from the Netherlands34. The first cohort consists of 22 healthy subjects who participated in the general population study LifeLines-DEEP in the northern Netherlands. The second cohort consists of subjects with IBD from the Department of Gastroenterology and Hepatology, University Medical Center Groningen, Netherlands and includes 20 samples with CD and 23 samples with UC. Together, both the PRISM dataset and the external IBD dataset included 201 microbial features. Datasets were evaluated using all three classes as well as in a binary case by combining CD and UC samples.

To use Meta-Signer and the data provided, the user needs to download Meta-Signer from the GitHub repository using the following command:

git clone https://github.com/YDaiLab/Meta-Signer.git

cd Meta-Signer

The user must make sure that all the Python and R dependencies are met before running Meta-Signer (see Operation). The user can either install these manually, or use the provided Meta-Signer Conda environment file provided using the following command:

conda env create -f meta-signer.yml

source activate meta-signer

Before running Meta-Signer, the user should open the file config.py in the src directory. This file can be used to change the run parameters of Meta-Signer and turn off and on different ML methods. An example is shown in Figure 2.

dcf9aae5-d73e-41b4-942d-73b23a70bfe1_figure2.gif

Figure 2. User specified configuration parameters.

Using these parameters, we will run 10 iterations of 10-fold cross-validation on the data found in the PRISM_3 directory under the data folder. Any taxon not found in at least 10% of samples will be removed as well as any taxon with less than 0.001 mean abundance. We will use the genetic algorithm method for rank aggregation to generate a candidate list with a maximum of 50 features. Descriptions for each parameter can be found on Meta-Signer’s GitHub page as well as in the configuration file. The user can then generate the aggregated ranked list using the following command:

python generate_feature_ranking.py

This will perform the cross-validation evaluation and feature ranking, storing all the results in the results folder under a directory named after the dataset. This directory will contain internal cross-validation results as well as an HTML file that allows the user to see how different ML methods perform on the dataset. Performance of the 10 iterations of 10-fold cross-validation on the PRISM dataset is shown in Table 1. We include an analysis of both binary class (IBD vs Healthy) and tertiary class (IBD vs CD vs UC) scenarios.

Table 1. Mean cross-validated results over the PRISM dataset.

Standard deviation is shown in parentheses. RF, Random Forest; SVM, Support Vector Machine; MLPNN, multiple-layer perceptron neural network; AUC, area under the receiver operating curve; MCC, Matthews correlation coefficient; F1, F1 score.

RF SVM LogisticRegressionMLPNN
PRISM AUC0.91 (0.08)0.81 (0.13)0.82 (0.11)0.87 (0.10)
MCC0.50 (0.28)0.29 (0.32)0.28 (0.27)0.39 (0.30)
Precision0.84 (0.10)0.76 (0.13)0.76 (0.11)0.80 (0.12)
Recall0.85 (0.07)0.81 (0.08)0.78 (0.07)0.82 (0.07)
F1 0.83 (0.09)0.77 (0.10)0.76 (0.08) 0.80 (0.09)
PRISM (3 Class) AUC0.88 (0.06)0.67 (0.09)0.72 (0.10)0.74 (0.10)
MCC0.55 (0.17)0.19 (0.19)0.30 (0.19)0.35 (0.19)
Precision0.72 (0.11)0.47 (0.15)0.56 (0.14)0.60 (0.13)
Recall0.70 (0.11)0.49 (0.11)0.55 (0.12)0.58 (0.12)
F1 0.69 (0.11) 0.46 (0.12) 0.53 (0.12) 0.57 (0.12)

The HTML output displays these results in the form of boxplots. An example for the PRISM dataset is shown in Figure 3. In addition, Meta-Signer will evaluate the different ML methods’ training performance using increasing numbers of ranked features. Since these models will eventually overfit, the saturation of performance can help guide the user on how many features are needed to train the final model on. For the PRISM dataset, we see this saturation around 30 features, as shown in Figure 4

dcf9aae5-d73e-41b4-942d-73b23a70bfe1_figure3.gif

Figure 3. HTML output of machine learning cross-validated evaluation.

Buttons on the top allow users to cycle through different metrics. RF, Random Forest; SVM, Support Vector Machine; MLPNN, multiple-layer perceptron neural network; AUC, area under the receiver operating curve; MCC, Matthews correlation coefficient; F1, F1 score.

dcf9aae5-d73e-41b4-942d-73b23a70bfe1_figure4.gif

Figure 4. HTML output of training performance to help the user select their desired number of ranked features.

RF, Random Forest; SVM, Support Vector Machine; MLPNN, multiple-layer perceptron neural network; AUC, area under the receiver operating curve.

Using the point of saturation on the AUC curve as a guide, the user can then select the number of features and train final models using the following command:

python generate_final_models.py PRISM_3 30

Here PRISM_3 points to the result directory labeled PRISM_3 that Meta-Signer has just generated and 30 represents the number of features the user would like to have in the final models. In addition, the user can provide an external dataset to evaluate,

python generate_final_models.py

PRISM_3 30 -e PRISM_3_external

Note that PRISM_3_external must be a directory in the data directory with it’s own abundance.tsv and labels.txt files. This command will generate a directory that will contain the final trained ML models as well as another HTML file showing the ranked lists for each model as well as the aggregated ranked list as shown in Figure 5.

dcf9aae5-d73e-41b4-942d-73b23a70bfe1_figure5.gif

Figure 5. HTML output of aggregated ranked list for microbes predictive in PRISM dataset.

CD, Crohn’s disease; UC, ulcerative colitis.

In addition, if an external dataset is available for evaluation, the HTML file will display a table of metrics for both evaluation on the training set and evaluation on the external dataset. An example is shown in Figure 6.

dcf9aae5-d73e-41b4-942d-73b23a70bfe1_figure6.gif

Figure 6. HTML output of training and external evaluation for the PRISM dataset.

RF, Random Forest; SVM, Support Vector Machine; MLPNN, multiple-layer perceptron neural network; AUC, area under the receiver operating curve; MCC, ; F1, .

Discussion

We have developed Meta-Signer, a user-friendly tool for the extraction of robust microbial taxa that are predictive to host phenotype from multiple ML models. By training different types of ML models, Meta-Signer exploits the similarities in the ranked lists of taxa learned by individual ML models to create a single aggregated set of informative microbial taxa for host phenotype prediction. We have shown that Meta-Signer can provide more informative features when compared to similar methods in both binary and multi-class scenarios.

To evaluate Meta-Signer, we benchmark against other a previously published methods Biosigner35 and a non-parametric PERMANOVA test32. Biosigner is a generic ML driven feature selection method for omics data and available in R. It uses trained RF, SVM and Partial-Least Squared Discriminant Analysis models to selectively eliminate features, resulting in a single set of remaining features. The non-parametric PERMANOVA test was included as a baseline method of feature ranking for comparison. Results for the benchmarking can be found in the Extended data document on the Meta-Signer GitHub page.

Conclusions

In conclusion, Meta-Signer is a user-friendly tool to identify a robust set of highly informative microbial taxa that are predictive of human disease status from a metagenomic dataset.

Data availability

Source data

Original data and information for both the PRISM and external IBD datasets can be obtained from Dataset 4 of the original study http://doi.org/10.1038/s41564-018-0306-433. The data for the PRISM and external IBD datsets as formatted for use of Meta-Signer can be found at Zenodo: derekreiman/Meta-Signer: Original Release. http://doi.org/10.5281/zenodo.407740336. This project contains microbial abundance and sample class data formatted for Meta-Signer as TSV, TXT and JSON files within the folder ’data’.

  • 1. abundance.tsv - tab separated file for microbial abundance data

  • 2. labels.txt - text file of class labels

  • 3. model_parameters.json - JSON file containing neural network hyper-parameters

Extended data

Zenodo: derekreiman/Meta-Signer: Original Release. http://doi.org/10.5281/zenodo.407740336. This project contains the following extended data:

  • ExtendedData.pdf (Data for the benchmarking of Meta-Signer against Biosigner and PERMANOVA ranked features)

Data are under the GPL-3 license.

Software availability

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 09 Mar 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Reiman D, Metwally A, Sun J and Dai Y. Meta-Signer: Metagenomic Signature Identifier based onrank aggregation of features [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2021, 10:194 (https://doi.org/10.12688/f1000research.27384.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 09 Mar 2021
Views
6
Cite
Reviewer Report 27 Jul 2021
Boyang JI, Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden 
Approved with Reservations
VIEWS 6
In the article “Meta-Signer: Metagenomic Signature Identifier based onrank aggregation of features”, Reiman et.al. developed a pipeline for identifying microbial features linked to human health/disease using multiple machine learning methods with rank aggregation of feature lists. Four different methods for ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
JI B. Reviewer Report For: Meta-Signer: Metagenomic Signature Identifier based onrank aggregation of features [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2021, 10:194 (https://doi.org/10.5256/f1000research.30263.r87185)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
9
Cite
Reviewer Report 29 Jun 2021
Braden T Tierney, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA 
Not Approved
VIEWS 9
The authors present a machine-learning software tool designed for use on compositional, specifically microbiome, data. This is a worthy endeavor -- the microbiome field is in need of well-designed statistical methods. I am specifically enthusiastic about using multiple machine learning ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Tierney BT. Reviewer Report For: Meta-Signer: Metagenomic Signature Identifier based onrank aggregation of features [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2021, 10:194 (https://doi.org/10.5256/f1000research.30263.r86264)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 09 Mar 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.