Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data

Eva Kohnert; Clemens Kreutz

doi:10.12688/f1000research.155230.1

Home Browse Computational Study Protocol: Leveraging Synthetic Data to Validate...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Study Protocol

Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data

[version 1; peer review: 2 approved with reservations]

Eva Kohnert ¹, Clemens Kreutz¹

PUBLISHED 09 Oct 2024

Author details Author details

¹ Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, Germany

Eva Kohnert
Roles: Writing – Original Draft Preparation, Writing – Review & Editing

Clemens Kreutz
Roles: Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

Abstract

Background

The utility of synthetic data in benchmark studies depends on its ability to closely mimic real-world conditions and to reproduce results obtained from experimental data. Here, we evaluate the performance of differential abundance tests for 16S metagenomic data. Building on the benchmark study by Nearing et al. (1), who assessed 14 differential abundance tests using 38 experimental datasets in a case-control design, we validate their findings by generating synthetic datasets that mimics the experimental data. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines and is, to our knowledge, the first of its kind in computational benchmark studies.

Methods

We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring each of the 38 experimental datasets. Equivalence tests will be conducted on 43 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to both synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results.

Conclusions

Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, validate previous findings and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing significantly to transparency, reproducibility, and unbiased research.

Keywords

16S, microbiome, differential abundance, simulation, synthetic data, benchmarking

Corresponding author: Eva Kohnert

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2024 Kohnert E and Kreutz C. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Kohnert E and Kreutz C. Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:1180 (https://doi.org/10.12688/f1000research.155230.1) First published: 09 Oct 2024, 13:1180 (https://doi.org/10.12688/f1000research.155230.1) Latest published: 02 Jan 2025, 13:1180 (https://doi.org/10.12688/f1000research.155230.2)

Protocol

Table of Contents

Introduction

Background and rationale {6a}

Objectives {7}

Trial design {8}

Summary Table

Methods

Study population/participants

Study setting {9}

Eligibility criteria {10}

Who will take informed consent? {26a}

Additional consent provisions for collection and use of participant data and biological specimens {26b}

Interventions

Explanation for the choice of comparators {6b}

Intervention description {11a}

Criteria for discontinuing or modifying allocated interventions {11b}

Strategies to improve adherence to interventions {11c}

Relevant concomitant care permitted or prohibited during the trial {11d}

Provisions for post-trial care {30}

Outcomes {12}

Participant timeline {13}

Sample size {14}

Recruitment {15}

Assignment of interventions: allocation

Sequence generation {16a}

Concealment mechanism {16b}

Implementation {16c}

Assignment of interventions: Blinding

Who will be blinded {17a}

Procedure for unblinding if needed {17b}

Data collection and management

Plans for assessment and collection of outcomes {18a}

Plans to promote participant retention and complete follow-up {18b}

Data management {19}

Confidentiality {27}

Plans for collection, laboratory evaluation and storage of biological specimens for genetic or molecular analysis in this trial/future use {33}

Statistical methods {20}

Data monitoring committee {21a}

Statistical methods for primary and secondary outcomes {20a}

Interim analyses {21b}

Methods for additional analyses (e.g. subgroup analyses) {20b}

Methods in analysis to handle protocol non-adherence and any statistical methods to handle missing data {20c}

Plans to give access to the full protocol, participant level-data and statistical code {31c}

Timeline

Oversight and monitoring

Composition of the coordinating centre and trial steering committee {5d}

Composition of the data monitoring committee, its role and reporting structure {21a}

Adverse event reporting and harms {22}

Ancillary and post-trial care {30}

Frequency and plans for auditing trial conduct {23}

Plans for communicating important protocol amendments to relevant parties (e.g. trial participants, ethical committees) {25}

Dissemination policy {31a}

Discussion

Abbreviations

Declarations

Acknowledgements

Authors’ contributions {31b}

Availability of data and materials {29}

Ethics and consent {24}

Consent for publication {32}

Authors’ information (optional)

Data availability

References

Study status

Note: To achieve a rigorous methodology, this protocol adheres to an established standard and checklist for Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT).² The numbers in curly brackets in this protocol refer to SPIRIT checklist item numbers. The order of the items has been modified to group similar items. Since using a standardized terminology for study designs is essential, we also formulate our protocol using standard terminology, i.e. terms such as study population, comparator, intervention, outcome, modification, inclusion and exclusion.

Table 1. Administrative information summary.

Title {1}	Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.
Trial registration {2a and 2b}.	Currently, there exists no registry tailored specifically to computational benchmark studies. This study does not involve interventions on humans or animals; rather, it exclusively incorporates publicly accessible sequencing data.
Protocol version {3}	August 09, 2024, Version 2
Grant information (Funding {4})	The author(s) declared that no third-party grants were involved in supporting this work.
Author details {5a}	Eva Kohnert: Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany Clemens Kreutz: Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany.
Name and contact information for the trial sponsor {5b}	n/a: There is no sponsor.
Role of sponsor {5c}	n/a: There is no sponsor.

Introduction

Background and rationale {6a}

Differential abundance (DA) analysis of metagenomic microbiome data has emerged as a pivotal tool in understanding the complex dynamics of microbial communities across various environments and host organisms.^3–5 Microbiome studies are crucial for identifying specific microorganisms that differ significantly in abundance between different conditions, such as health and disease states, different environmental conditions, or before and after a treatment. The insights we gain from analyzing the differential abundance of microorganisms are critical to understanding the role that microbial communities play in environmental adaptations, disease development and health of the host.⁶ Refining statistical methods for the identification of changes in microbial abundance is essential for understanding how these communities influence disease progression and other interactions with the host, which then enables new strategies for therapeutic interventions and diagnostic analyses.⁷

The statistical interpretation of microbiome data is notably challenged by its inherent sparsity and compositional nature. Sparsity refers to the disproportionate proportion of zeros in metagenomic sequencing data and requires tailored statistical methods,^8,9 e.g. to account for so-called structural zeros that originate from technical limitations rather than from real absence.¹⁰ Additionally, due to the compositional aspect of microbiome data regulation of highly abundant microbes can lead to biased quantification of low-abundant organisms.¹¹ Such bias might be erroneously interpreted as apparent regulation that is mainly due to the compositional character of the data. Such characteristics of microbiome data have a notable impact on the performance of common statistical approaches for DA analysis, delimits their applicability for microbiome data and poses challenges about the optimal selection of DA tests.

A number of benchmark studies have been conducted to evaluate the performance of DA tests in the analysis of microbiome data.^12–15 However, the results show a very heterogeneous picture and clear guidelines or rules for the appropriate selection of DA tests have yet to be established. In order to assess and contextualize the findings of those studies, additional benchmarking efforts using a rigorous methodology,^16,17 as well as further experimental and synthetic benchmark data sets are essential.

Synthetic data is frequently utilized to evaluate the performance of computational methods because for such simulated data the ‘correct’ or ‘true’ answer is known and can be used to assess whether a specific method can recover this known truth.¹⁶ Moreover, characteristics of the data can be changed to explore the relationship between data characteristics such as effect size, variability or sample size and the performance of the considered methods. Several simulation tools have been introduced for generating synthetic microbiome data.^18–23 They cover a broad range of functionality. For example, MB-GAN²² leverages generative adversarial networks to capture complex patterns and interactions present in the data, while metaSPARSim,¹⁸ sparseDOSSA2¹⁹ or nuMetaSim²⁴ employ different statistical models to generate microbiome data. Introducing a new simulation tool typically involves demonstrating its capacity to replicate key data characteristics. Nonetheless, an ongoing question persists regarding the feasibility of validating findings derived from experimental data when synthetic data, generated to embody the characteristics of the experimental data, is used in its place.

Here we refer to the recent high-impact benchmark study of Nearing et al.¹ in which the performance of a comprehensive set of 14 DA tests applied to 38 experimental 16S microbiome data sets was compared. This 16S microbiome sequencing data is used to study communities in various environments, here from human gut, soil, wastewater, freshwater, plastisphere, marine and built environments. The data sets are presented in a two group design for which DA tools are applied to identify variations in species abundances between the groups.

In this validation study we replicate the primary analysis conducted in the reference study by substituting the actual datasets with corresponding synthetic counterparts. The objective is to explore the validity of the main findings from the reference benchmark study when the analysis workflow is repeated with an independent implementation and with synthetic data, generated to recapitulate the characteristics of the original real data.

Objectives {7}

Aim 1: Synthetic data, simulated based on an experimental template, overall reflect main data characteristics.

Aim 2: Study results from Nearing et al. can be validated using synthetic data, simulated based on corresponding experimental data.

Trial design {8}

Aim 1: Exploratory comparative study

Aim 2: Confirmatory benchmark study

Summary Table

Table 2. Summary of hypotheses and the stasticial analyses used for evaluation.

Research question	Hypothesis	Statistical analyses	Confirmation criteria
Aim 1: Can state of the art simulation tools for metagenomics data realistically generate synthetic data across a broad range of simulation templates?	Main data characteristics calculated from synthetic data are equivalent to experimental templates.	Equivalence tests, i.e. two one-sided one-sample t-tests for each data characteristic as implemented in the TOSTER R-package. PCA of all data characteristics and equivalence test for Euclidean distances in 2D.	We interpret a p-value < 0.05 for rejecting the null hypothesis “non-equivalence” as significant and then conclude that the respective data characteristic is equivalent.
Aim 2: Can conclusions based on performance outcomes (proportion of significant taxa and overlap across DA tests) from 16S microbiome sequencing data be validated with synthetic data, simulated after calibration based on the used experimental data?	Hypotheses 1: 13 extracted outcomes from (1) concerning the overlap of significant features across exp. data sets and DA test can be confirmed based on their corresponding simulations. Hypothesis 2: 14 extracted outcomes from (1) concerning the proportion of significant features identified across multiple DA tools can be confirmed based on their corresponding simulations.	For 23/27 hypotheses: Estimating the proportion P where the hypotheses are fulfilled by counting, and calculation of 95% confidence intervals a) for independent observations based on the SE formula b) for dependent observation using bootstrap. For 2/27 hypotheses: 2-way ANOVA For 1/27 hypothesis: mean of Kolmogorov-Smirnov test statistic For 1/27 hypothesis: visualization by histograms.	For each hypothesis, we specified individual confirmation thresholds. In 18/27 cases, we use a 95% threshold as criterion for the estimated proportion of cases, where the hypothesis is valid. We check these criteria to be fulfilled by considering the 95% CI. For ANOVA and Kolmogorov-Smirnov tests, we also specify individual confirmation criteria.

Methods

Study population/participants

In the context of our benchmark study, the study population is given by the experimental data sets from the reference study.¹

Study setting {9}

Where possible, the study is conducted analogously to the benchmark study conducted by Nearing et al.,¹ e.g. the same data and primary outcomes will be used.

All data sets as provided by Nearing et al.¹ will be included in the study. We employ two published simulation tools, metaSPARSim¹⁸ and sparseDOSSA2,¹⁹ which have been developed for simulating microbial abundance profiles as they are generated by 16s sequencing.

We also apply the same DA tests as in¹ and implementation in the R statistical programming language. In order to provide the most valuable results for the bioinformatics community, the latest versions of these implementations will be used.

Eligibility criteria {10}

Inclusion criteria

We will include the same experimental data sets and DA tests as in Ref. 1.

Exclusion criteria

There are no exclusion criteria for the data sets.

Who will take informed consent? {26a}

n/a: Data is publicly available, there is no need to obtain consent.

Additional consent provisions for collection and use of participant data and biological specimens {26b}

n/a: Data is publicly available, there is no need to obtain consent.

Interventions

Explanation for the choice of comparators {6b}

For aim 1, the comparator are 43 data characteristics calculated from the 38 experimental data sets. These characteristics are chosen such that they provide a comprehensive description of count matrices and enabling unbiased comparison between experimental and synthetic data sets. They cover for example information about the sparsity in a data set, mean-variance trends of features (taxa), or effect sizes between groups of samples. Tables 4 and 5 provide a detailed summary of all data characteristic and how they are calculated.

Table 3. Study status.

Version	Date	Changes made	Reason for changes
1	February 13, 2024	Initial submission as registered report to PLOS Biology and PLOS ONE (not accepted)	Initial version Link to initial version: https://nxc-fredato.imbi.uni-freiburg.de/s/o6TsmZBMdngtamp
2	August 09, 2024	Clarify data used for the hypotheses; some minor text changes (no methodological changes)	Hypotheses align to conclusions in Nearing et al.; make some sentences more precise Link to version with edits: https://nxc-fredato.imbi.uni-freiburg.de/s/jSRNoQxYzk5E6LW
3	August 13, 2024	Change order and naming of sections	Naming of protocoll sections need to align with F1000 requirements Link to version with edits: https://nxc-fredato.imbi.uni-freiburg.de/s/j5ibzbXMwW3Ssj9

Table 4. Calculation of data characteristics in R.

Name of data characteristic	Name in matrix (data.prop) summarizing all data characteristics	Calculation in R	Dimension
dat.cpm Counts per million normalized and log transformed data		edgeR::cpm (dat, log=TRUE, prior.count = 1)	mxn
Feature sparsity	data.prop$P0_feature	apply (dat==0,1,sum)/ncol (dat)	m
Sample sparsity	data.prop$P0_sample	apply (dat==0,2,sum)/nrow (dat)	n
Feature mean abundance	data.prop$mean_log2cpm	apply (dat.cpm, 1,mean,na.rm=T)	m
Feature median abundance	data.prop$median_log2cpm	apply (dat.cpm,1, median,na.rm=T)	m
Feature variance	data.prop$var_log2cpm	apply (dat.cpm, 1, var)	m
Library size	data.prop$lib_size	colSums (dat)	n
Sample means	data.prop$sample_means	apply (dat,2,mean)	n
Sample correlation	data.prop$corr_sample	cor (dat, dat, method="spearman",use="na.or.complete")	nxn
Feature correlation	data.prop$corr_feature	cor(t (dat), t (dat), method="spearman",use="na.or.complete")	mxm

Table 5. Final integer values data characteristic and their calculation in R.

Name of data characteristic	Calculation in R
Number of features	nrow (dat)
Number of samples	ncol (dat)
Sparsity of data set	sum (dat==0)/length (dat)
Median of data set	median (dat,na.rm=TRUE)
95th Quantile	quantile (dat,probs=.95)
99th Quantile	quantile (dat,probs=.99)
Mean library size	mean (colSums (dat),na.rm = T)
Median library size	median (colSums (dat),na.rm = T)
Standard deviation library size	sd (colSums (dat),na.rm = T)
Coefficient of variation of library size	sd (colSums (dat),na.rm = T)/mean (colSums (dat),na.rm = T)*100
Maximum library size	max (colSums (dat),na.rm = T)
Minimum library size	min (colSums (dat),na.rm = T)
Read depth range between samples	diff (range (colSums (dat),na.rm = T))
Mean sample richness	mean (colSums (dat>0), na.rm=T)
Spearman correlation library size with P0*(sample)	cor (data.prop$P0_sample, data.prop$lib_size, method=“spearman”)
Bimodality of feature correlations	bimodalIndex (matrix (data.prop$corr_feature,nrow=1))$BI
Bimodality of sample correlations	bimodalIndex (matrix (data.prop$corr_sample,nrow=1))$BI
Mean of all feature means	mean (data.prop$mean_log2cpm,na.rm=T)
SD of all feature means	sd (data.prop$mean_log2cpm,na.rm=T)
Median of all feature means	median (data.prop$median_log2cpm,na.rm=T)
SD of all feature medians	sd (data.prop$median_log2cpm,na.rm=T)
Mean of all feature variances	mean (data.prop$var_log2cpm,na.rm=T)
SD of all feature variances	sd (data.prop$var_log2cpm,na.rm=T)
Mean of all sample means	mean (data.prop$sample_means,na.rm=T)
SD of all sample means	sd (data.prop$sample_means,na.rm=T)
Mean of sample correlation matrix	mean (data.prop$corr_sample,na.rm=T)
SD of sample correlation matrix	sd (data.prop$corr_sample,na.rm=T)
Mean of feature correlation matrix	mean (data.prop$corr_feature,na.rm=T)
SD of feature correlation matrix	sd (data.prop$corr_feature,na.rm=T)
Mean-Variance relation: Linear component	res <-lm(y~x+I(x²),data=data.frame(y=data.prop$var_log2cpm,x=data.prop$mean_log2cpm)) res$coefficients[2]
Mean-Variance relation: Quadratic component	res=lm(y~x+I(x²),data=data.frame(y=data.prop$var_log2cpm,x=data.prop$mean_log2cpm)) res$coefficients[3]
Slope feature sparsity vs. feature mean	res=lm(y~slope,data=data.frame (slope=data.prop$P0_feature-1,y=data.prop$mean_log2cpm)) res$coefficients[2]
Clustering of features	coef.hclust (hcluster (dat.tmp))
Clustering of samples	coef.hclust (hcluster(t (dat.tmp)))
Sample sparsity	apply (dat==0,2,sum)/nrow (dat)
Library sizes	colSums (dat)
Mean read depths	apply (dat,2,mean)
Feature sparsity	apply (dat==0,1,sum)/ncol (dat)
Feature mean intensity	apply (dat.cpm, 1,mean)
Feature median intensity	apply (dat.cpm,1, median)
Feature variances	apply (dat.cpm, 1, var)
Sample correlations	cor (dat, dat, method="spearman")
Feature correlations	calc_feature_corr(dat)

For aim 2, 14 differential abundance (DA) tests are applied to the experimental data (ALDEx2, ANCOM-II, corncob, DESeq2, edgeR, LEfSe, limma voom (TMM), limma voom (TMMwsp), MaAsLin2, MaSsLin2 (rare), metagenomeSeq, t-test (rare), Wilcoxon test (CLR), Wilcoxon test (rare)), i.e. the outcomes (number of significant features) calculated from the experimental data sets will serve as comparator. As in Ref. 1, we analyzed unfiltered data as well as data filtered with respect to features with a sufficient number of non-zero counts.

Intervention description {11a}

The intervention consists of using synthetic data instead of experimental data.

For each of the 38 experimental data sets, synthetic data will be simulated. For the simulation two simulation tools (metaSPARSim¹⁸ and sparseDOSSA2¹⁹) will be used. Simulation parameters are calibrated using the experimental data, such that the simulated data reflect the experimental data template. Both simulation approaches offer such a calibration functionality. Multiple (N=10) data realizations will be generated for each experimental data template to assess the impact of different realizations of simulation noise and to test for significant differences between interventions and the comparator.

For aim 1, the data characteristics will be computed for each of the synthetic and experimental data sets. For aim 2, 14 DA tests will be applied to the synthetic data generated in aim1.

Criteria for discontinuing or modifying allocated interventions {11b}

For assessing the similarity of the synthetic data templates, we apply equivalence tests based on two one-sided t-tests as implemented in the TOSTER R-package with a 5% significance level. We use the SD of the respective values from all experimental data templates as lower and upper margins. Figure 1 illustrates the equivalence testing procedure for the proportion of zeros in the whole data set as an exemplary data characteristic. For equivalence testing, the combined null hypothesis that the tested values are below the lower margin or above the upper margin has to be rejected to conclude equivalence. This only occurs when the average data characteristic of synthetic data is inside both margins and not too close to those two bounds, i.e. the whole 95% CI interval of the estimated mean has to be between both margins.

Figure 1. Illustration of assessing similarity based on an equivalence test.

The black dots indicate a data characteristic computed for experimental data sets (here the proportion of zeros). Equivalence tests requires an interval that is considered as equivalent given by lower and upper margin bounds (dashed lines). We use the SD over all values from experimental templates to define these margins. The values computed from the synthetic data for a template are considered as equivalent if values below the lower margin and above the upper margin can be rejected according to the prespecified significance level. Depending on the variation of the characteristic for the synthetic data (here indicated by the boxplot), the average characteristic has to be inside a region (brown region) that is smaller than the interval between both margins.

Modification by adjusting the proportions of zeros and effect sizes

If equivalence tests fail, i.e. the synthetic data turns out to be partly unrealistic, we try to reduce the number of failed tests by adjusting two important characteristics, the proportion of zeros in the synthetic data sets, and the effect size, i.e., magnitude of differences between the two groups of samples.

Modifying the proportion of zeros will be performed by the following procedure for all synthetic data sets:

1. If the number of rows or columns of the experimental data template does not coincide, randomly add or delete columns and rows in the template.
2. Count the number of zeros that have to be added (or removed) for a simulated data set to obtain the same number as in the template.
3. If the simulation method does not generate data with matching order of features (i.e. rows), sort all rows of both count matrices according to row means.
4. Copy and replace an appropriate number of zeros (or non-zeros) one-by-one (i.e. with the same row and column indices) from the template to the synthetic data by randomly drawing those positions.
5. Reorder the rows to get the original ordering.
6. Check, whether the total number of failed equivalence test across all data templates is reduced.

Since we calibrate the simulation tools for both groups separately, all simulation parameters controlling the count distribution will be different in both groups. Therefore, we anticipate that the differences between both groups might be overestimated. We therefore try to make the simulation more realistic by modifying the effect size by the following procedure for all synthetic data sets:

1. Estimate the proportion of unregulated features from the results of all DA methods applied to the experimental data templates. This is done by the pi0est function in the qvalue R-package.
2. Calibrate the simulation tool by using all samples from both groups (then there is no difference between both groups of samples) and generate a synthetic data set without considering the assignment of samples to groups.
3. Replace an appropriate number of rows in the original synthetic data set by rows from the group-independent simulation.

In addition to both individual modifications, we also apply both modification. For the following analyses, we then use the synthetic data where most data characteristics are equivalent.

Exclusion criteria

In our study, we use experimental data as templates for generating synthetic data which are then analyzed by DA methods. At both levels, generation of synthetic data and applying DA methods, we define exclusion or modification criteria in order to handle exceeding runtimes, computation errors, and unrealistic data simulation. Figure 2 shows an overview about these exclusion and modification steps.

Figure 2. Overview about the analysis workflow and the exclusion/modification strategy.

These criteria are applied to handle runtime issues, computation errors and unrealistic synthetic data.

Exclusion of simulation for a specific data template based on simulation performance

A simulation tool will be excluded for a specific data template, if calibration of the simulation parameters is not feasible. We define feasibility by the following criteria:

1) Calibration succeeds without error message
2) The runtime of the calibration procedure is below 7 days (168 hours) for one data template
3) The runtime of simulating a synthetic data set is below 1 hour for one synthetic data set

All computations in this study will be performed on a Linux Debian x86_64 compute server with 64 AMD EPYC 7452 32-Core Processor CPUs. Although, we will run parts of the analyses in parallel mode, the specified computation times refer to runtimes on a single core.

Exclusion of simulations for a specific data template based on deviating data properties

For aim 2, we exclude synthetic data sets that are not similar enough to the experimental data sets used as templates. The goal of the following exclusion criterion is to exclude synthetic data sets that are overall strongly dissimilar from the experimental data template, without being too stringent since the simulation tools cannot perfectly resemble all data characteristics and therefore a slight or medium amount of dissimilarity has to be accepted. In general, dissimilarities are exploited to study the impact of those characteristics by investigating the association of such deviations with dissimilarity in outcomes. For assessing similarity, the data characteristics described before and specified in detail in Table 5 are used.

We expect that a few data characteristics are very sensitive in discriminating experimental and synthetic data. To prevent loss of too many data sets, such characteristics (highlighted in gray in Figure 2) are only considered for the investigation of association between mismatch in outcome and mismatch in data characteristics but not for exclusion.

Unrealistic synthetic data will be excluded for the primary analyses of the study using the remaining data characteristics. We define the exclusion criteria due to dissimilarity from its template by one of the following criteria:

1) The equivalence test based on Euclidean distance in the 2-dimensional PCA plot failed to indicate equivalence with the respective data templates. For equivalence testing, we use +/- 1 SD of the Euclidean distance over all exp. data templates as upper and lower margins.
2) We apply equivalence tests for all 43 data characteristics individually. We then only consider data characteristic which are not highly-discriminative. When counting non-equivalence of the remaining characteristics (highlighted in brown color in Figure 2) for a template, the synthetic data of those templates that appear as an outlier will be removed (see example in Figure 2). We use the common outlier definition from boxplots, i.e. all values with distance to the 1^st quartile (Q1) or 3^rd quartile (Q3) larger than 1.5 x the inter-quartile range Q3-Q1 are considered as outlier.

For evaluating the sensitivity with respect to exclusion, we perform an additional, secondary analysis on all synthetic data sets, regardless of similarity to the templates.

Modification of differential abundance (DA) tests

Inflated runtime

Data sets with a large number of features could lead to inflating runtimes for some statistical tests. If the runtime threshold for an individual test is exceeded for a specific data set, we split the data set, apply the test again on the subsets and afterwards merge the results. This split and merge procedure is repeated until the test runtime is below the threshold.

Here, we define the runtime threshold to be max. 1 hour per test. Then, in a worst case scenario, the 14 tests for the 10+1 data sets for each of the 38 template (5832 combinations) would need 244 days on a single core. Since we can conduct the tests on up to 64 cores, such a worst case scenario would still be manageable.

Test failure

If a DA test throws an error we omit the number of significant features and the overlap of significant features and report them as NA (not available) as it would occur in practice.

Strategies to improve adherence to interventions {11c}

n/a

Relevant concomitant care permitted or prohibited during the trial {11d}

n/a

Provisions for post-trial care {30}

n/a

Outcomes {12}

Aim 1: For each data set (experimental template and synthetic data) 43 data characteristics are calculated as described in Table 5. The difference of a data characteristic between a synthetic data and the corresponding data template is calculated as outcomes. For each feature which is closer to a normal distribution on the log-scale according to p-values of the Shapiro-Wilk test, we apply a log2-transformation to the respective characteristics prior to all analyses.

Principal component analysis (PCA) is then performed on the scaled data characteristics and a two-dimensional PCA plot is generated. An additional outcome is the Euclidean distance of a synthetic data set to its template in the first two principal component coordinates. An equivalence test will be conducted on the synthetic data sets for each template to check whether data properties are maintained in synthetic data on a summary level for all data characteristics.

Next, boxplots are generated, visualizing for each data characteristic how it varies between templates, between all simulation realizations, and how templates deviate from the corresponding synthetic data sets. Here, we again perform equivalence tests and also report median distances of a data characteristic between simulated and experimental data.

Aim 2: As primary outcome (aim 2a), for each experimental data template the average proportion of shared significant features across all synthetic data are calculated for each DA tool. For each data template, a barplot is generated as in Nearing et al.¹ to visually summarize how many of the 14 DA tools identified the same feature as significantly changed. Moreover, we try to validate the conclusions from Nearing et al.¹ made on this primary outcome. Overall, we extracted 13 conclusions and formulated the respective hypothesis as shown in Box 1.

Box 1. Hypotheses investigating the overlap of identified features as aim 2a extracted as conclusions from Ref. 1 including the statistical analysis to be applied.

We term the statistical analysis that estimates the proportion P of cases (e.g. the proportion of synthetic data sets) where the hypothesis is fulfilled as “Counting”. Depending on the stringency of the formulation in Ref. 1, we always define a question-specific threshold for confirmation and indicate the respective number of cases n for this evaluation in brackets. The asterisk * indicates that this number of cases might be reduced if exclusion criteria apply.

Hypothesis 1: For unfiltered data, the proportion of features jointly found as significant by limma voom TMM and limma voom TMMwsp but by less than 50% of the other methods, is larger than the overlap with more than 50% of the other methods.

Analysis: Counting (n=380*) with 95% threshold, i.e. the hypothesis is validated if the 95% CI is > 95%.

Hypothesis 2: For unfiltered data, the overlap of features jointly found as significant by limma voom TMM and limma voom TMMwsp with features found by Wilcoxon CLR is larger than the overlap with all other DA methods.