ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data 

[version 1; peer review: 2 approved with reservations]
PUBLISHED 25 Jun 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

Abstract

Background

Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results from experimental data. Building on Nearing et al.‘s study (1), which assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we generated synthetic datasets to verify these findings. We rigorously assessed the similarity between synthetic and experimental data and validated the conclusions on the performance of these tests drawn by Nearing et al. (1). This study adheres to the study protocol: Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data (2).

Methods

We replicated Nearing et al.’s (1) methodology, incorporating simulated data using two distinct tools (metaSPARSim (3) and sparseDOSSA2 (4)), mirroring the 38 experimental datasets. Equivalence tests were conducted on a set of 30 data characteristics (DC) comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests were applied to synthetic datasets, evaluating the consistency of significant feature identification and the proportion of significant features per tool. Correlation analysis, multiple regression and decision trees were used to explore how differences between synthetic and experimental DCs may affect the results.

Conclusions

Adhering to a formal study protocol in computational benchmarking studies is crucial for ensuring transparency and minimizing bias, though it comes with challenges, including significant effort required for planning, execution, and documentation. In this study, metaSPARSim (3) and sparseDOSSA2 (4) successfully generated synthetic data mirroring the experimental templates, validating trends in differential abundance tests. Of 27 hypotheses tested, 6 were fully validated, with similar trends for 37%. While hypothesis testing remains challenging, especially when translating qualitative observations from text into testable hypotheses, synthetic data for validation and benchmarking shows great promise for future research.

Keywords

Benchmarking study, differential abundance, microbiome, omics, preregistered study, sequencing data, statistical analysis

Introduction

Differential abundance (DA) analysis of metagenomic microbiome data has emerged as a pivotal tool in understanding the complex dynamics of microbial communities across various environments and host organisms.57 Microbiome studies are crucial for identifying specific microorganisms that differ significantly in abundance between different conditions, such as health and disease states, different environmental conditions, or before and after a treatment. The insights we gain from analyzing the differential abundance of microorganisms are critical to understanding the role that microbial communities play in environmental adaptations, disease development and health of the host.8 Refining statistical methods for the identification of changes in microbial abundance is essential for understanding how these communities influence disease progression and other interactions with the host, which then enables new strategies for therapeutic interventions and diagnostic analyses.9

The statistical interpretation of microbiome data is notably challenged by its inherent sparsity and compositional nature. Sparsity refers to the disproportionate proportion of zeros in metagenomic sequencing data and requires tailored statistical methods,10,11 e.g. to account for so-called structural zeros that originate from technical limitations rather than from real absence.12 Additionally, due to the compositional aspect of microbiome data, regulation of highly abundant microbes can lead to biased quantification of low-abundant organisms.13 Such bias might be erroneously interpreted as apparent regulation that is mainly due to the compositional character of the data. Such characteristics of microbiome data have a notable impact on the performance of common statistical approaches for DA analysis, delimits their applicability for microbiome data and poses challenges about the optimal selection of DA tests.

A number of benchmark studies have been conducted to evaluate the performance of DA tests in the analysis of microbiome data.1417 However, the results show a very heterogeneous picture and clear guidelines or rules for the appropriate selection of DA tests have yet to be established. In order to assess and contextualize the findings of those studies, additional benchmarking efforts using a rigorous methodology,18,19 as well as further experimental and synthetic benchmark datasets are essential.

Synthetic data is frequently utilized to evaluate the performance of computational methods because for such simulated data the ‘correct’ or ‘true’ answer is known and can be used to assess whether a specific method can recover this known truth.18 Moreover, characteristics of the data can be changed to explore the relationship between DCs such as effect size, variability or sample size and the performance of the considered methods. Several simulation tools have been introduced for generating synthetic microbiome data.3,4,2023 They cover a broad range of functionality and partly allow calibration of simulation parameters based on experimental data templates. For example, MB-GAN22 leverages generative adversarial networks to capture complex patterns and interactions present in the data, while metaSPARSim,3 sparseDOSSA24 or nuMetaSim24 employ different statistical models to generate microbiome data. Introducing a new simulation tool typically involves demonstrating its capacity to replicate key DCs. Nonetheless, an ongoing question persists regarding the feasibility of validating findings derived from experimental data when synthetic data, generated to embody the characteristics of the experimental data, is used in its place.

Here, we refer to the recent high-impact benchmark study of Nearing et al.1 in which the performance of a comprehensive set of 14 DA tests applied to 38 experimental 16S microbiome datasets was compared. This 16S microbiome sequencing data is used to study communities in various environments, here from human gut, soil, wastewater, freshwater, plastisphere, marine and built environments. The datasets are presented in a two group design for which DA tools are applied to identify variations in species abundances between the groups.

Aim of the study

In this validation study we replicated the primary analysis conducted in the reference study by substituting the actual datasets with corresponding synthetic counterparts and using the most recent version of the DA methods. The objective was to explore the validity of the main findings from the reference benchmark study when the analysis workflow is repeated with an independent implementation and with a larger sample size of synthetic data, generated to recapitulate typical characteristics of the original real data.

Aim 1: Synthetic data, simulated based on an experimental template, overall reflect main data characteristics.

Aim 2: Study results from Nearing et al. can be validated using synthetic data, simulated based on corresponding experimental data.

Table 1 provides more detail about the specific hypotheses and statistical analyses used for evaluation.

Table 1. Summary of hypotheses and the statistical analyses used for evaluation.

Research question Hypothesis Statistical analyses Confirmation criteria
Aim 1: Can state of the art simulation tools for 16S rRNA sequencing data realistically generate synthetic data across a broad range of simulation templates?Main data characteristics calculated from synthetic data are equivalent to experimental templates.Equivalence tests, i.e. two one-sided one-sample t-tests for each data characteristic as implemented in the TOSTER R-package. PCA of all data characteristics and equivalence test for Euclidean distances in 2D.We interpreted a p-value < 0.05 for rejecting the null hypothesis “non-equivalence” as significant and then concluded that the respective data characteristic is equivalent.
Aim 2: Can conclusions based on performance outcomes (proportion of significant taxa and overlap across DA tests) from 16S microbiome sequencing data be validated with synthetic data, simulated after calibration based on the used experimental data?Hypotheses 1: 13 extracted outcomes from (1) concerning the overlap of significant features across exp. datasets and DA test can be confirmed based on their corresponding simulations.
Hypothesis 2: 14 extracted outcomes from (1) concerning the proportion of significant features identified across multiple DA tools can be confirmed based on their corresponding simulations.
For 23/27 hypotheses: Estimated the proportion P where the hypotheses were fulfilled by counting, and calculation of 95% confidence intervals a) for independent observations based on the SE formula b) for dependent observation using bootstrap.
For 2/27 hypotheses: 2-way ANOVA
For 1/27 hypothesis: Mean of Kolmogorov-Smirnov test statistic
For 1/27 hypothesis: Visualization by histograms.
For each hypothesis, we specified individual confirmation thresholds. In 18/27 cases, we used a 95% threshold as criterion for the estimated proportion of cases, where the hypothesis was valid. We checked these criteria to be fulfilled by considering the 95% CI.
For ANOVA and Kolmogorov-Smirnov tests, we also specified individual confirmation criteria.

Methods

Study setting

While designing a benchmark study for assessing sensitivities and specificities of DA methods using simulated data, we recognized the need to first assess the feasibility of generating synthetic data that realistically resemble all characteristics of experimental data, to ensure the validity of conclusions drawn from simulated data. This led us to develop the validation study presented here with primary goals were to compare the results based on synthetic data to those from the reference study. This study is going to be followed by a subsequent benchmark study, in which the known truth in the simulated datasets will be used for performance testing, the dependence on characteristics such as effect size, sample size etc. will be systematically evaluated, and all recently published DA methods will be considered.

Where possible, this study was conducted analogously to the benchmark study conducted by Nearing et al.,1 e.g. the same data and primary outcomes were used.

All datasets as provided by Nearing et al.1 were included in the study. A summary for these datasets can be found in Supplemental Text 3. We employed two published simulation tools, metaSPARSim3 and sparseDOSSA2,4 which have been developed for simulating microbial abundance profiles as they are generated by 16s sequencing.

We also applied the same DA tests as in1 and implementation in the R statistical programming language. In order to provide the most valuable results for the bioinformatics community, the latest versions of these implementations were used.

Study design and workflow

The entire workflow of the study is summarized in Figure 1A. This includes data generation of synthetic data and corresponding adaptions or exclusion criteria, characterization of datasets, strategies to compare synthetic and experimental datasets (outcome aim1), strategies to choose the best simulation pipeline to use for aim2, application of differential abundance tests and corresponding exclusion and modification strategies, hypothesis testing to compare results of differential abundance tests from synthetic data with findings from reference study (outcome aim2) and lastly exploratory analyses ‘DCs and discrepancies for DA tests’.

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure1.gif

Figure 1. Overview of the analysis workflow.

A: Overview about the analysis workflow and the exclusion/modification strategy. The depicted strategy was applied to handle runtime issues, computation errors and unrealistic simulated data. B: Illustration of the visual evaluation of the DCs. The left sections of the boxplots show the difference of a data characteristic between templates pairs, serving as a measure of the natural variability. The middle sections show for each template the difference of the DC to the 10 simulated datasets, serving as a measure of precision of data simulation. The right sections display the difference of a DC within the simulated data for one template, serving as a measure for variability due to simulation noise. C: Illustration of assessing DC similarity based on an equivalence test. The black dots indicate a data characteristic computed for experimental datasets (e.g. the proportion of zeros). Equivalence tests requires an interval that is considered as equivalent given by lower and upper margin bounds (dashed lines). We use the SD over all values from experimental templates to define these margins. The values computed from the synthetic data for a template are considered as equivalent if values below the lower margin and above the upper margin can be rejected according to the prespecified significance level. Depending on the variation of the characteristic for the synthetic data (here indicated by the boxplot), the average characteristic has to be inside a region (brown region) that is smaller than the interval between both margins. D: Illustration of the primary analyses conducted for assessing the agreement of the 13 DA tests via overlap profiles. For each count matrix, the overlap within the significant taxa is counted, shown as numbers on the left, as image in the middle and as barplot („overlap profile“) on the right. These overlap profiles can be compared and aggregated at different levels. In the reference study, they were calculated and interpreted at the aggregated level over all taxa of experimental datasets. In practice, however, the agreement obtained for a single experimental count matrix matters. We therefore evaluated how frequently different hypotheses were fulfilled at the unaggregated level.

Step 1: Intervention/Data simulation

According to standardized study design guidelines, we had to define interventions and comparator. In these terms, data generation using simulation tools is the intervention that is compared with experimental data. When defining the interventions, we had to balance the complexity and runtime of the study, with the requirements of conducting the data simulation as realistic as possible and with a sufficient sample size. Due to the complexity of our study, the huge expected computational demands, and the fact that one key aim of this study was also to introduce and emphasize the importance of formulating a study protocol specifically for computational research, we decided to restrict our study to two published simulation approaches for 16S rRNA sequencing data.

For each of the 38 experimental datasets, synthetic data was simulated using metaSPARSim3 version 1.1.2 and sparseDOSSA24 version 0.99.2 as simulation tools. Simulation parameters were calibrated using the experimental data, such that the simulated data reflect the experimental data template. Both simulation approaches offer such a calibration functionality. Multiple (N = 10) data realizations were generated for each experimental data template to assess the impact of different realizations of simulation noise and to test for significant differences between interventions (in our study the simulated data) and the comparator (in our study the experimental data templates). The major calibration and simulation functions were called with the following options:


# metaSPARSim calibration:
params <- metaSPARSim::estimate_parameter_from_data(raw_data=counts, 
norm_data=norm_data, conditions=conditions, intensity_func = “mean”, keep_zeros = TRUE)

# metaSPARSim simulation:
simData <- metaSPARSim (params)
# SparseDOSSA2 calibration:
fit <- SparseDOSSA2::fit_SparseDOSSA2(data=counts, control=list (verbose = TRUE, debug_dir = “./”))
# SparseDOSSA2 simulation:
simResult <- SparseDOSSA2::SparseDOSSA2(template=fit,new_features=F, n_sample=n_sample, n_feature=n_feature)

To account for the two-group design of the data, calibration and simulation were conducted for each group independently. Specifically, the datasets were first split into two groups of samples according to the metadata. Then, calibration and simulation was performed for each group, and finally, the two simulated count matrices were merged.

Next to this default simulation pipeline three adaptions were performed to control two important data properties, namely sparsity and effect size, resulting in 4 simulation pipelines:

Pipeline 1 – Default simulation

This pipeline refers to the above described simulation procedure.

Pipeline 2 – Adaption of sparsity (proportion of zeros)

Modifying the proportion of zeros was performed by the following procedure for all synthetic datasets:

  • 1. If the number of rows or columns of the experimental data template did not coincide, columns and rows were randomly added or deleted in the template.

  • 2. Count the number of zeros that have to be added (or removed) for a simulated dataset to obtain the same number as in the template.

  • 3. If the simulation method does not generate data with matching order of features (i.e. rows), sort all rows of both count matrices according to row means.

Copy and replace an appropriate number of zeros (or non-zeros) one-by-one (i.e. with the same row and column indices) from the template to the synthetic data by randomly drawing those positions.

  • 4. Reorder the rows to get the original ordering.

  • 5. Check, whether the total number of failed equivalence test across all data templates is reduced.

Pipeline 3 – Adaption of effect size

Since we calibrated the simulation tools for both groups separately, all simulation parameters controlling the count distribution were different in both groups. Therefore, we anticipated that the differences between both groups might be overestimated. We therefore tried to make the simulation more realistic by modifying the effect size by the following procedure for all synthetic datasets:

  • 1. Estimated the proportion of unregulated features from the results of all DA methods applied to the experimental data templates. This was done by the pi0est function in the q value R-package.

  • 2. In addition to calibrations within the two groups of samples, the simulation tool was calibrated by using all samples from both groups (then there is no difference between both groups of samples) and a synthetic dataset without considering the assignment of samples to groups was generated. This dataset then had no differences between both groups.

  • 3. Replaced an appropriate number of rows in the original synthetic data with group differences by rows from the group-independent simulation that lack such differences. To ensure that rows with significant regulation were less frequently replaced, we randomly selected the rows to be replaced by taking the FDR into account. Specifically, we used the FDR as a proxy for the probability that a taxon is unregulated, i.e. differentially abundant taxa are drawn via

isDiffAbundant = runif(n=length (pvalues)) > p.adjust(p-values, method=“BH”)
to ensure that taxa with smaller p-values were more likely assigned to be differentially abundant.
  • 4. Checked, whether the total number of failed equivalence test (Step 3 – equivalence test) across all data templates was reduced.

Pipeline 4 – Adaption of sparsity and effect size

Both above described modifications were applied.

Exclusion of simulation for a specific data template based on simulation performance

A simulation tool was excluded for a specific data template, if calibration of the simulation parameters was not feasible. We defined feasibility by the following criteria:

  • 1. Calibration succeeded without error message.

  • 2. The runtime of the calibration procedure was below 7 days (168 hours) for one data template.

  • 3. The runtime of simulating a synthetic dataset was below 1 hour for one synthetic dataset.

All computations in this study were performed on a Linux Debian x86_64 compute server with 64 AMD EPYC 7452 48-Core Processor CPUs. Although parts of the analyses were run in parallel mode, the specified computation times refer to runtimes on a single core.

Step 2: Data characterization

To characterize and compare datasets for aim 1, 40 DCs were calculated for each template and simulation. These characteristics were chosen such that they provide a comprehensive description of count matrices and enabling unbiased comparison between experimental and synthetic datasets. They cover for example information about the count distribution, sparsity in a dataset, mean-variance trends of features (taxa), or effect sizes between groups of samples. Tables 2 and 3 provide a detailed summary of all data characteristic and how they were calculated. To prevent overrepresentation of specific distributional aspects in assessing similarity between experimental and simulated data, we excluded redundant DCs. This was achieved by calculating rank correlations for all DC pairs across all datasets (experimental and simulated) and iteratively eliminated characteristics with a rank correlation ≥0.95.

Table 2. Calculation of DCs in R. DCs that are vectors or matrices.

Name of data characteristicName in matrix (data.prop) summarizing all data characteristicsCalculation in R Dimension
dat.cpm
Counts per million normalized and log transformed data
edgeR::cpm (dat, log=TRUE, prior.count = 1)mxn
Feature sparsitydata.prop$P0_featureapply (dat==0,1,sum)/ncol (dat)m
Sample sparsitydata.prop$P0_sampleapply (dat==0,2,sum)/nrow (dat)n
Feature mean abundancedata.prop$mean_log2cpmapply (dat.cpm, 1,mean,na.rm=T)m
Feature median abundancedata.prop$median_log2cpmapply (dat.cpm,1, median,na.rm=T)m
Feature variancedata.prop$var_log2cpmapply (dat.cpm, 1, var)m
Library sizedata.prop$lib_sizecolSums (dat)n
Sample meansdata.prop$sample_meansapply (dat,2,mean)n
Sample correlationdata.prop$corr_samplecor (dat, dat, method=“spearman”,use=“na.or.complete”)nxn
Feature correlationdata.prop$corr_featurecor(t (dat), t (dat), method=“spearman”,use=“na.or.complete”)mxm

Table 3. Final integer values data characteristic and their calculation in R.

Name of data characteristic Calculation in R
Number of featuresnrow (dat)
Number of samplesncol (dat)
Sparsity of datasetsum (dat==0)/length (dat)
Median of datasetmedian (dat,na.rm=TRUE)
95th Quantilequantile (dat,probs=.95)
99th Quantilequantile (dat,probs=.99)
Mean library sizemean (colSums (dat),na.rm = T)
Median library sizemedian (colSums (dat),na.rm = T)
Standard deviation library sizesd (colSums (dat),na.rm = T)
Coefficient of variation of library sizesd (colSums (dat),na.rm = T)/mean (colSums (dat),na.rm = T)*100
Maximum library sizemax (colSums (dat),na.rm = T)
Minimum library sizemin (colSums (dat),na.rm = T)
Read depth range between samplesdiff (range (colSums (dat),na.rm = T))
Mean sample richnessmean (colSums (dat>0), na.rm=T)
Spearman correlation library size with P0*(sample)cor (data.prop$P0_sample, data.prop$lib_size, method=“spearman”)
Bimodality of feature correlationsbimodalIndex (matrix (data.prop$corr_feature,nrow=1))$BI
Bimodality of sample correlationsbimodalIndex (matrix (data.prop$corr_sample,nrow=1))$BI
Mean of all feature meansmean (data.prop$mean_log2cpm,na.rm=T)
SD of all feature meanssd (data.prop$mean_log2cpm,na.rm=T)
Median of all feature meansmedian (data.prop$median_log2cpm,na.rm=T)
SD of all feature medianssd (data.prop$median_log2cpm,na.rm=T)
Mean of all feature variancesmean (data.prop$var_log2cpm,na.rm=T)
SD of all feature variancessd (data.prop$var_log2cpm,na.rm=T)
Mean of all sample meansmean (data.prop$sample_means,na.rm=T)
SD of all sample meanssd (data.prop$sample_means,na.rm=T)
Mean of sample correlation matrixmean (data.prop$corr_sample,na.rm=T)
SD of sample correlation matrixsd (data.prop$corr_sample,na.rm=T)
Mean of feature correlation matrixmean (data.prop$corr_feature,na.rm=T)
SD of feature correlation matrixsd (data.prop$corr_feature,na.rm=T)
Mean-Variance relation: Linear componentres <-lm(y~x+I(x^2),data=data.frame(y=data.prop$var_log2cpm,x=data.prop$mean_log2cpm))
res$coefficients[2]
Mean-Variance relation: Quadratic componentres=lm(y~x+I(x^2),data=data.frame(y=data.prop$var_log2cpm,x=data.prop$mean_log2cpm))
res$coefficients[3]
Slope feature sparsity vs. feature meanres=lm(y~slope,data=data.frame (slope=data.prop$P0_feature-1,y=data.prop$mean_log2cpm))
res$coefficients[2]
Clustering of featurescoef.hclust (hcluster (dat.tmp))
Clustering of samplescoef.hclust (hcluster(t (dat.tmp)))
Mean inverse Simpson diversitymean (vegan::diversity (dat, index = “invsimpson”),na.rm=T)
Mean Pilou evennessshannon_div <- vegan::diversity (count.data, index = “shannon”)
richness <- apply (count.data, 1, function(x) sum(x > 0,na.rm = T))
pilou <- shannon_div/log (richness)
mean (pilou (dat),na.rm=T)
Mean Bray-Curtis dissimilaritymean (vegan::vegdist (dat, method = “bray”),na.rm=T)
Permanova _R2aitchison <- parallelDist::parallelDist(x=t (pseudocount_and_clr(dat)), method=“euclidean”, threads=30)
 aitchison_formula <- as.formula (aitchison ~ meta.dat[,1])
 permanova <- data.frame (vegan::adonis2(formula = aitchison_formula, permutations=999))
 data.prop[[“Permanova_R2”]] <- as.numeric (permanova[1,“R2”])
Permanoa_P.valuedata.prop[[“Permanova_P-value”]] <- as.numeric (permanova[1,“Pr..F.”])
SD_smallRowMeansq10 <- quantile (data.prop[[“mean_log2cpm”]],probs = 0.1)
 small <- data.prop[[“mean_log2cpm”]] < q10 # 10% features with smalles rowMean
 data.prop[[“SD_smallRowMeans”]] = mean (data.prop[[“var_log2cpm”]][small],na.rm=T)/mean (data.prop[[“var_log2cpm”]][!small],na.rm=T)

Step 3: Assessing similarity of synthetic and experimental data – Outcome aim 1

After calculating the DCs as described in Step 2, the difference of all DCs between all synthetic datasets and the corresponding data templates were calculated as outcomes. As N = 10 simulation realizations were generated, there are 10 values for each data characteristic per experimental data template. For each characteristic that was closer to a normal distribution on the log-scale according to p-values of the Shapiro-Wilk test, we applied a log2-transformation to the respective characteristic prior to all analyses.

Principal component analysis (PCA) was then performed on the scaled DCs and a two-dimensional PCA plot was generated to summarize similarity of experimental and simulated data on the level of all computed, non-redundant DCs. An additional outcome was the Euclidean distance of a synthetic dataset to its template in the first two principal component coordinates.

Additionally, boxplots were generated, visualizing for each DC how it varied between templates, between all simulation realizations, and how templates deviate from the corresponding synthetic datasets. As illustrated in Figure 1B, these boxplots qualitatively show how well a DC was resembled by a specific simulation tool.

Equivalence tests

For assessing the similarity of the synthetic data templates, we applied equivalence tests based on two one-sided t-tests as implemented in the TOSTER R-package with a 5% significance level. The scales of different DCs are inherently incomparable; for instance, the proportion of zeros ranges between 0 and 1, while the number of features varies from 327 to 59,736. Therefore, we used the SD of the respective values from all experimental data templates as lower and upper margins. Figure 1C illustrates the equivalence testing procedure for the proportion of zeros in the whole dataset as an exemplary DC. For equivalence testing, the combined null hypothesis that the tested values are below the lower margin or above the upper margin had to be rejected to conclude equivalence. This only occurs when the average DC of synthetic data is inside both margins and not too close to those two bounds, i.e. the whole 95% CI interval of the estimated mean has to be between both margins.

Step 4: Exclusion of unrealistic simulations

For validating the prespecified hypothesis as aim 2, we excluded synthetic datasets that were not similar enough to the experimental datasets used as templates. For assessing similarity, the DCs described before and specified in detail in Table 3 were used. The goal of the following exclusion criterion was to exclude synthetic datasets that are overall dissimilar from the experimental data template, without being too stringent since the simulation tools cannot perfectly resemble all DCs and therefore a slight or medium amount of dissimilarity had to be accepted. Moreover, these slight or medium dissimilarities were exploited to study the impact of those characteristics on the validation outcomes, i.e. by investigating the association of such deviations with failures in the validation outcomes.

We expected that a few DCs are very sensitive in discriminating experimental and synthetic data, termed as highly-discriminative in the following. To prevent loss of too many datasets, such characteristics that are non-equivalent for the majority (>50%) of templates were only considered for the investigation of association between mismatch in outcome and mismatch in DCs but not for exclusion.

Using the remaining DCs, unrealistic synthetic data was excluded for the validation of prespecified hypotheses in aim 2. We defined the exclusion criteria due to dissimilarity from its template by the following criteria:

For each template the number of non-equivalent DCs across all simulation realisations were counted and the results for all templates summarized in a boxplot. Synthetic data of those templates that appeared as an outlier were removed (see Supplemental Figure 1). We used the common outlier definition from boxplots, i.e. all values with distance to the 1st quartile (Q1) or 3rd quartile (Q3) larger than 1.5 x the inter-quartile range Q3-Q1 were considered as outlier.

For evaluating the sensitivity with respect to exclusion, we performed an additional, secondary analysis on all synthetic datasets, regardless of similarity to the templates. As there were no changes with respect the number of validated hypotheses we did not report this further in the results section.

Step 5: Differential abundance tests and exclusion/modification of DA tests

For evaluating the hypotheses (aim 2), 14 differential abundance (DA) tests were applied to the experimental and synthetic data (ALDEx2, ANCOM-II, corncob, DESeq2, edgeR, LEfSe, limma voom (TMM), limma voom (TMMwsp), MaAsLin2, MaAsLin2 (rare), metagenomeSeq, t-test (rare), Wilcoxon test (CLR), Wilcoxon test (rare)). As outcome the number of significant features for each dataset and DA test were extracted, while significant features were identified using a 0.05 threshold for the multiple testing adjusted p-values (Benjamini-Hochberg).

These analyses were conducted for unfiltered data and for filtered data. As in,1 filtered means that features found in fewer than 10% of samples were removed.

Inflated runtime

Datasets with a large number of features could lead to inflating runtimes for some statistical tests. If the runtime threshold for an individual test was exceeded for a specific dataset, we split the dataset, applied the test again on the subsets and afterwards merged the results. This split and merge procedure was repeated until the test runtime was below the threshold.

Here, we defined the runtime threshold to be max. 1 hour per test. Then, in a worst case scenario, for each simulation tool the 14 tests for the 10+1 datasets for each of the 38 template and taking unfiltered and filtered data into account (11,704 combinations) would need 488 days on a single core. Since we conducted the tests on up to 64 cores, such a worst case scenario would have still be manageable.

Test failure

If a DA test throwed an error we omitted the number of significant features and the overlap of significant features and reported them as NA (not available) as it would occur in practice.

Step 6: Number and overlap of significant features (aim 2)

Primary outcome (2a):

For the primary outcome (2a) the proportion of shared significant taxa between DA tools was investigated. Therefore, it was determined how many tests jointly identified features to be significant for each dataset. A barplot was generated as in Nearing et al.1 to visually summarize how many of the 14 DA tools identified the same taxa as significantly changed. Figure 1D illustrates this visualisation. The 13 hypotheses extracted from1 were investigated using statistical analyses at the level of individual datasets. For most cases, we counted how frequently the hypothesis holds true when counting over all simulated datasets and evaluated the lower bound of the confidence interval (CI.LB) for the estimated proportions. We used exact Clopper-Pearson intervals calculated by the DescTools R-package. The statistical procedure is summarized for each hypothesis in Supplemental Text 1.

Secondary outcome (2b):

For the secondary outcome, the number and proportion of significant taxa for each DA tool was investigated. Therefore, the number and proportion of significant features was extracted for each dataset and test individually and visualized in heatmaps. After visualization, the 14 hypotheses extracted from1 were also investigated using the statistical analyses in Supplemental Text 2.

For the primary and the secondary analysis, CI.LB of the 95% confidence intervals for the estimated proportions of cases were used to evaluate whether the hypothesis was fulfilled. These values are easier to interpret than p-values and the significance of the p-values is strongly determined by the number of cases.

In case we found different results for the simulated data for some hypotheses, we analyzed the association of the mismatch in the outcome with the mismatch of DCs to identify DCs that could be responsible for the disagreement. To ensure independence of the scales, we performed these analyses after rank transformations. We used forward selection with a 5% cut-off criterion for p-values.

Step 7: Exploratory analysis to identify relationships of confirmation and DCs

As exploratory analysis, we identified DCs that were predictive for validating/rejecting each investigated hypothesis. First, we applied a multivariate logistic regression approach to identify the most influential DCs by step-wise forward selection algorithm. This means that starting from the null model validated ~1, the DC that improved AIC most were added, then dropping a DC and adding a further one was repeated until BIC did not improve any more. This procedure identified a minimal set of characteristics that were predictive for validating/rejecting the considered hypothesis. Then, a small decision tree was estimated based on these predictive characteristics. Recursive partitioning trees as indicated by the following pseudo-code were used.

tree_model <- rpart (confirmed ~ predictiveDC1+ predictiveDC2 + … ,
   data = df, method = “class”, control = rpart.control (maxdepth = 3))

Resulting in a hierarchy of conditions “predictiveDC > threshold” indicating the optimal rule for dividing datasets into confirmation vs. rejection.

Results

Aim 1: Synthetic data, simulated based on an experimental template, overall reflect main data characteristics

Data simulation

For each of the 38 experimental templates, 10 synthetic datasets were generated using metaSPARSim3 or sparseDOSSA2.4 Following the data generating mechanism in step 1, metaSPARSim3 was able to generate synthetic data for 37 of the 38 templates, while sparseDOSSA24 was able to generate data for 27 of the 38 templates. MetaSPARSim3 failed to simulate data for ArcticFreshwater because taxa variability was computed only for taxa having more than 20% of values greater than zero. This default value for filtering resulted in samples without counts > 0 leading to simulation failure. For sparseDOSSA2,4 we observed an inflating runtime for larger datasets. For the 11 largest datasets that have 8148 – 59736 features and 75 – 2296 samples, calibration was not feasible within the prespecified runtime limit (7 days for one data template).

Four simulation pipelines were used. In the first pipeline, data was simulated with the default configuration. In the second pipeline, the proportion of zeros was adjusted afterwards. The third pipeline involved adapting the proportion of regulated features to adjust the effect size, while in the fourth pipeline, both the proportion of zeros and the effect size were modified. Figure 2 shows how the proportion of zeros P0 as well as PERMANOVA R2 as a measure for effect size changed by the adaptations in the different pipelines. Adjusting sparsity perfectly resembled the proportion of zeros (upper row) in the simulated datasets when compared with the respective templates (“sim – template”). This step also made PERMANOVA R2 more realistic. Adjusting the effect size in pipeline 4 by introducing taxa without regulation between the two groups of samples, only slightly further reduced bias of PERMANOVA R-squared.

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure2.gif

Figure 2. Impact of different simulation pipelines on precision of data characteristics.

Impact of adjusting sparsity and effect size for the simulated datasets. For both simulation tools, we tried to reduce the number of non-equivalent DCs by introducing additional zeros and/or by introducing taxa without mean difference in both groups of samples. Like in Figure 1B, we show the impact of these modifications as boxplot comparing multiple experimental templates (left sections), comparing simulated with experimental data (middle sections), and comparing simulated datasets for the individual templates (right sections). Data sparsity can be adjusted to the proportion of zeros in the experimental data (upper row), but PERMANOVA effect size still deviates in simulated data compared to the experimental templates after our modifications.

Next, the set of 40 DCs was calculated for each dataset and for comparisons between datasets. The following characteristics were found as redundant (rank correlation > 0.95) and removed: q99, mean_lib_size, median_median_log2cpm, range_lib_size, mean_sample_means, SD_lib_size, mean_var_log2cpm, SD_median_log2cpm, mean_mean_log2cpm. Since the overall median was zero for all datasets and was thus also excluded, this process resulted in a final set of 30 DCs, which were used to characterize each dataset.

Individual data characteristics

Next, we visualized how well individual DCs were resembled in simulated data. Figure 3 shows boxplots of differences of DCs colored by template for metaSPARSim,3 the respective plots for sparseDOSSA24 are available as Supplemental Figure 2. As illustrated in Figure 1B, the magnitude of differences between simulation and template (“simus – template”) had to be compared with variabilities between experimental templates (“templates – template”) and the variabilities of different simulation noise realizations (“simu – simu”). Overall, we did not see any DC where the differences between simulation and templates exceeded experimental variations as well as simulation noise. The DC that appeared to be the most problematic was bimodality of sample correlations (second panel in the upper row). This DC was introduced as a measure for the overall pattern caused by the two-group design. Bimodality is a DC that is sometimes difficult to estimate reliably, e.g. if two modalities (peaks) are hardly visible in the distribution. Then the estimated bimodality index could be in some cases close to zero leading to large variations if all differences are visualized as boxplot. It can be appreciated, that the mismatch of the DCs between simulated data and template was overall smaller than the span of the natural variability between experimental data.

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure3.gif

Figure 3. Assessing the similarity of data characteristics for metaSPARSim-generated data.

Comparison of differences of individual DCs for metaSPARSim. As in Figure 1B and Figure 2, we compared multiple experimental templates (left sections), simulated vs. experimental data (middle sections), and simulated datasets for the same template (right sections). The corresponding plots for sparseDOSSA2 are provided as Supplemental Figure 2.

Results of PCA of the data characteristics

To better assess the overall similarity of the DCs of simulated datasets with their respective templates, the 30 DCs for each dataset were analyzed and visualized by PCA. Figure 4A shows the PCA plot for metaSPARSim3 in the default pipeline 1. The simulation templates were represented as squares, with the 10 corresponding simulated datasets depicted as circles of the same color. It was evident that the synthetic datasets generated by metaSPARSim3 were typically very close to their respective templates indicating that metaSPARSim3 can reproduce DCs at this global summary level. Also, the simulation noise did not introduce significant variation, as indicated by the tight clustering of the 10 simulation realizations. The other simulation pipelines obtained similar outcomes as shown in the Supplemental Figures 3. Overall did data simulated with metaSPARSim3 closely resemble the DCs of the simulation templates.

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure4.gif

Figure 4. PCA on scaled data characteristics assessing the similarity of real and synthetic data.

PCA on scaled data characteristics calculated for experimental and synthetic data. Experimental templates are depicted as squares, while the corresponding 10 synthetic datasets a are depicted as circles.

When using sparseDOSSA24 as a simulation tool, it could be observed that the simulated data was systematically shifted towards the lower-right region of the PCA plot ( Figure 4B). Despite this, the simulation realizations for each template remained clustered together, although the clusters were less tight compared to those generated by metaSPARSim.3 This suggests that sparseDOSSA24 introduces more simulation bias than metaSPARSim.3 A closer examination of the DCs responsible for the systematic shift revealed that sparseDOSSA24 consistently overestimated all DCs related to library size (Supplemental Figure 2). On the other hand, sparseDOSSA24 seemed to better be able to capture sparsity and effect size of a given template. Still, significance measured by PERMANOVA R-squared tended to be overestimated.

Equivalence tests

To quantitatively and more systematically assess the similarity of simulated data and their templates, equivalence tests on the DCs, testing if a given DC for a simulated dataset is within an equivalence region around the DC in the corresponding template, were performed. This resulted in one statistical test for each DC and template. The results were summarized as heatmaps ( Figure 5), displaying if the equivalence was accepted or rejected. Overall, metaSPARSim3 tended to generate data for which most characteristics were equivalent to those of the experimental templates ( Figure 5A). However, one characteristic that is often not well resolved is the sparsity of the dataset (P0). While sparsity remains equivalent to the experimental template in most cases, for 16 out of 37 templates, the sparsity did not match the value in the experimental template. These datasets typically have a low number of features (mostly below 1,000) or a moderate number (1,000-3,000), but a small number of samples (~50). As previously discussed, the bimodality of the samples was systematically overestimated in the simulated data generated by metaSPARSim.3 However, when comparing the discrepancies to the natural range in all templates, as done by the equivalence test, we noticed that only 6 templates show significant non-equivalence ( Figure 5A). This discrepancy did not seem to be linked to any obvious factors, such as sample size or the number of features. Figure 5B shows the result for the equivalence tests for metaSPARSim3 in pipeline 4, in which as well sparsity of the dataset and effect size was adjusted. It can be appreciated that this pipeline resulted in much less non-equivalent DCs and moreover resolved the issue of systematic shifts for some DCs. Figure 5C on the other hand shows the result for sparseDOSSA24 in pipeline 2, in which the sparsity of the data additionally was adjusted. Confirming the results from the PCA analysis, synthetic datasets generated with sparseDOSSA24 had a systematic discrepancy in DCs related to library size and distance in the PCA plot (grey bars). To identify the best data generating pipeline the number of non-equivalent DCs were counted for each pipeline ( Figure 5D). While for metaSPARSim3 pipeline 4, in which sparsity and effect size were adapted, was identified as the best choice, for sparseDOSSA24 pipeline 2, in which only the sparsity was adapted, was selected.

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure5.gif

Figure 5. Equivalence tests assessing similarity of data characteristics of real and synthetic data.

Heatmaps indicate for each data characteristic and each template if equivalence was accepted or not. A.-B. Impact on the results when sparsity and effect size in the simulation process are adapted. C. Results for the best pipeline2 for sparseDOSSA2. D. Summary of the results for all simulation pipelines.

Lastly, templates with an excessively high proportion of non-equivalent DCs were removed from the selected pipelines to perform validation of our hypotheses, i.e. the conclusions in,1 only on those simulated datasets that closely resembled the experimental templates. Supplemental Figure 1 illustrates the results for metaSPARSim3 - pipeline 4 (Supplemental Figure 1A) and sparseDOSSA24 - pipeline 2 (Supplemental Figure 1B). The number of non-equivalent DCs for the simulated datasets was summarized in a boxplot. Templates identified as outliers according to the boxplot criteria were removed. For metaSPARSim,3 this led to the removal of three datasets, whereas all templates were retained for sparseDOSSA2.4

Aim 2: Study results from Nearing et al. can be validated using synthetic data, simulated based on corresponding experimental data

Primary outcome (2a):

The overlaps of significant features (pFDR < 0.05) after calculating the 14 statistical tests were investigated as primary outcome. Figure 6 presents the results for both unfiltered and prevalence-filtered data. For metaSPARSim3 ( Figure 6A), the results were based on 37 datasets, while for sparseDOSSA24 ( Figure 6B), they were based on 27 datasets. Consequently, a full comparison between the two pipelines was only partially feasible. Nonetheless, general trends were observed across both simulation tools that align with findings from the reference study.1

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure6.gif

Figure 6. Consistency profiles for differential abundance tests.

For each differential abundance test it is displayed if as significant identified taxa was also found by other tests. 0 means that no other test found this taxa while 13 means that all other test also identified this taxa to be significantly changed. The color scale shows the number of significant calls. A. Results for metaSPARSim, with results based on unfiltered data in the left panel and results for filtered data in the right panel. B. Similar visualization for sparseDOSSA2.

Overall, both simulation pipelines displayed similar trends regarding the consistency profiles of the statistical tests. Both limma-voom methods and the Wilcoxon test (CLR) identified similar sets of features that were not detected by most other tools. The consistency profiles for both Wilcoxon methods differed substantially, while those for the two MaAsLin2 approaches were very similar. Corncob, DESeq2 and metagenomeSeq exhibited intermediate profiles. ANCOM-II produced an intermediate profile for sparseDOSSA24 data and a more left-shifted profile for metaSPARSim3 data, which contrasts with the reference study,1 where ANCOM-II showed more conservative behaviour similar to ALDEx2.

For prevalence-filtered data, ALDEx2 was the most conservative tool, i.e. it had the least number of significant taxa (light color) and the most significant taxa shared with other DA methods (bars on the right), with this tendency being more pronounced for sparseDOSSA2-generated data. For filtered data, the overlap between most tests was greater compared to the unfiltered data.

As a more systematic comparison with outcomes of the reference study,1 13 statistical hypotheses derived from conclusions in the reference paper were tested next. Some of these hypotheses involved multiple statistical tests, which resulted in a total of 24 possible validations. The results including calculated proportions, lower bounds of the respective confidence intervals (CI.LB) were summarized in Table 4. Successful validations were colored in green, hypotheses that were true at least for the majority of simulated datasets (>50%) were colored in orange and those that were only true for the minority of cases (<50%) in red.

Table 4. Results for hypotheses of aim 2a.

Hypothesis numberResult from reference studyHypothesisSpecification of DA-test or method:metaSPARSimsparseDOSSA2
H1 “Based on the unfiltered data, limma voom methods identified similar sets of significant ASVs that were different from those of most other tools.”For UNFILTERED data the proportion of features jointly found as significant by limma voom TMM and limma voom TMMwsp but by less than 50% of the other methods, is larger than the overlap with more than 50% of the other methods.limma voom (TMM) & limma voom (TMMwsp)CI.LB = 0,9888CI.LB = 0,9864
H4 “[For UNFILTERED data] Both MaAsLin2 approaches had similar consistency profiles.”For UNFILTERED data, MaAsLin2 and MaAsLin2-rare have a more similar profile (larger test statistic D) than a randomly selected pair of methods.MaAsLin2 & MaAsLin2 (rare)CI.LB = 0,6615CI.LB = 0,9645
H7 “[For UNFILTERED data] Corncob, metagenomeSeq, and DESeq2 identified ASVs at more intermediate consistency profiles.”For UNFILTERED data, for corncob, metagenomeSeq, and DESeq2, there are always multiple other methods (i.e. at least 2 out of 10 other DA methods) that have a more extreme consistency profile.corncobCI.LB = 0,8083CI.LB = 0,6212
H7.1 DESeq2CI.LB = 0,8140CI.LB = 0,7472
H7.2 metagenomeSeqCI.LB = 0,4782CI.LB = 0,9191
H8 “The overlap in significant ASVs based on the prevalence-filtered data was similar overall to the unfiltered data results.”The shape of the overlap profiles for all methods except both limma voom approaches is mainly determined by the exp. dataset and the DA method but only little of the fact whether data has been filtered.
Filt vs unfilt (dataset [i], DA-test [j]) compared to … i: 1..380, j: 1..14
… other unfiltered datasetsCI.LB = 0,8299CI.LB = 0,7313
H8.1 … other filtered datasetsCI.LB = 0,8231CI.LB = 0,7195
H8.2 … other DA tests in unfiltered dataCI.LB = 0,7929CI.LB = 0,8079
H8.3 … other dA tests in filtered dataCI.LB = 0,8206CI.LB = 0,7956
H9 “For prevalence-filtered data, the limma voom approaches identified a much higher proportion of ASVs that were also identified by most other tools, compared with the unfiltered data.”For filtered data, for both limma voom approaches the proportion of identified features that are also identified by the majority of other tests is larger than for un-filtered data.limma voom (TMM)CI.LB = 0,6421CI.LB = 0,6376
H9.1 limma voom (TMMwsp)CI.LB = 0,6239CI.LB = 0,6454
H10 “[For FILTERED data] The Wilcoxon (CLR) significant ASVs displayed a bimodal distribution and a strong overlap with limma voom methods.”For FILTERED data, the overlap profile of Wilcoxon CLR is bimodal.Wilcoxon (CLR)CI.LB = 0,7070CI.LB = 0,7707
H5 “[For UNFILTERED data] ALDEx2 and ANCOM-II primarily identified features that were also identified by almost all other methods.”For UNFILTERED data, ALDEx2 and ANCOM-II identify more features that were also identified by all except 3 (i.e. 10 out of 13) other methodsALDEx2CI.LB = 0,5300CI.LB = 0,5998
H5.1 ANCOM-II CI.LB = 0CI.LB = 0,0633
H3 "[For UNFILTERED data] The two Wilcoxon test approaches had highly different consistency profiles."For UNFILTERED data, the Kolmogorov-Smirnov test statistic D when comparing the profile for Wilcoxon CLR and Wilcoxon rare is larger than for other pairs of methodson average.Wilcoxon (CLR) & Wilcoxon (rare)CI.LB = 0,4404CI.LB = 0,7205
H2 “[For UNFILTERED data] Many of the ASVs die ntified by the limma voom methods were also identified as significant based on the Wilcoxon (CLR) approach.”For UNFILTERED data the overlap of features jointly found as significant by limma voom TMM and limma voom TMMwsp with features found by Wilcoxon CLR is larger than the overlap with all other DA methodsBoth limma vooms & Wilcoxon (CLR)CI.LB = 0,1481CI.LB = 0,2149
H6 “[For UNFILTERED data] EdgeR and LEfSe, two tools that often identified the most significant ASVs, output the highest percentage of ASVs that were not identified by any other tool.”For UNFILTERED data, EdgeR and LEfSe identify a larger percentage of features that are not identified by any other tool, compared to the same percentage for all other methods.EdgeRCI.LB = 0CI.LB = 0,1736
H6.1 LEfSeCI.LB = 0CI.LB = 0
H11 “Overall, the proportion of ASVs consistently identified as significant by more than 12 tools was much higher in the filtered data (mean: 38.6%; SD: 15.8%) compared with the unfiltered data (mean: 17.3%; SD: 22.1%).”The proportion of features identified by all except one DA method is larger for prevalence-filtered data.CI.LB = 0,5161CI.LB = 0,5501
H12 “In contrast with the unfiltered results, corncob, metagenomeSeq, and DESeq2 had lower proportions of ASVs at intermediate consistency ranks.”For filtered data, the consistency profiles of corncob, metagenomeSeq, and DESeq2 are more similar to the more extreme methods than for unfiltered data.corncobCI.LB = 0,0276CI.LB = 0,0043
H12.1 DESeq2CI.LB = 0,0066CI.LB = 0,0066
H12.2 metagenomeSeqCI.LB = 0,0245CI.LB = 0,0191
H13 “[For FILTERED data] ALDEx2 and ANCOM-II produce significant ASVs that largely overlapped with most other tools.”For FILTERED data, ALDEx2 and ANCOM-II identify more features that were also identified by all except 3 (i.e. 10 out of 13) other methods.ALDEx2CI.LB = 0,4412CI.LB = 0,4571
H13.1 ANCOM-II CI.LB = 0CI.LB = 0,0578

Validated hypotheses ( Table 4 green):

Out of the 24 hypotheses, only one could be statistically validated for both simulation tools: H1 (CI lower bound (CI.LB) metaSPARSim = 99%, CI.LB sparseDOSSA2 = 99%). This hypothesis refers to the observation that both limma-voom methods tend to identify a similar set of significant features, which differ from those identified by most other tools. Moreover, H4 could be validated for data generated with sparseDOSSA2.4 H4 refers to the similarity of both MaAsLin2 approaches. While this similarity was only partly evident for metaSPARSim-generated data (CI.LB metaSPARSim = 66%), it could be validated for sparseDOSSA2-generated data (CI.LB sparseDOSSA2 = 96%).

Hypotheses that were true for the majority of cases ( Table 4 orange):

For eleven validations belonging to six hypotheses we found that they were true in the majority of simulated datasets. H7 describes the intermediate consistency profile of corncob, DESeq2 and metagenomeSeq, which means that they neither tend to identify taxa that are not identified by almost no other tool and taxa that are identified by almost all other tools. For all three DA methods, we could not strictly validate this hypothesis, but we could confirm this observation for the major proportion of datasets: for Corncob we got CI.LB (metaSPARSim) = 80%, CI.LB (sparseDOSSA2) = 62%, for DESeq2 CI.LB (metaSPARSim) = 81%, CI.LB (sparseDOSSA2) = 74%, and for metageneomeSeq we have CI.LB (metaSPARSim) = 48%, CI.LB (sparseDOSSA2) = 92%.

H8 is based on the observation that in general the shape of the overlap profiles except for limma voom is quite similar for unfiltered and filtered data. Although this hypothesis could not be validated it was true for a high proportion of cases (between 72% and 83%) for all four validations.

The observation that overlap profiles of both limma voom methods shifted towards an intermediate profile for filtered data (H9), was still true for the majority of cases in our study with CI.LB between 62% and 64% of the analyzed simulated datasets.

Also H10, describing the observation that the overlap profile shifts towards a bimodal distribution in filtered data, was true in the majority of cases (CI.LB = 71% for metaSPARSim, 77% for sparseDOSSA2). The references study1 found, that ALDEx2 and ANCOM-II in un-filtered data primarily identified taxa that were also found significant by almost all other methods (H5). For ALDEx2 this could not be strictly validated, however we found the same trend with CI.LB = 53% for metaSPARSim3,4 and CI.LB = 60% for sparseDOSSA2. In contrast, for ANCOM-II we almost never found this behavior (CI.LB = 0% and 6%).

H3 states that the two Wilcoxon test approaches have substantially different consistency profiles. Here, we observed CI.LB = 44% for metaSPARSim3 and 72% for sparseDOSSA2.4 As also visible in Figure 6, this difference is more pronounced for sparseDOSSA2-generated data.

Hypotheses that are only true for the minority of cases ( Table 4 red):

Some of the hypotheses in this category are related to ANCOM-II. We observed in our data a very different behaviour of this test. Although we took the code published in the reference study,1 this might still be related to differences in implementation, since ANCOM-II is not available as a maintained software package. In contrast to the reference study,1 we observed that ANCOM-II shows a left-shifted to intermediate profile and not a strongly right shifted, conservative profile. Other hypotheses are related to LEfSe, which displayed a more similar consistency profile to other methods in our study compared to the reference study.1 We also attribute this to different implementations: as for the other DA methods we used a threshold for p-values and not the score as done in the reference study.1

We found similarity of EdgeR and LEfSe (H2) only in a minority of cases (CI.LB=15% for metaSPARSim and 21% for sparseDOSSA2). H6 and H6.1 refer to the observation that edgeR and Lefse identify a high number of unique features. In our study, we could not confirm this observations. While edgeR showed a left-shifted profile for our data, most identified features were shared with at least one other statistical test. LEfSe showed a more intermediate profile, which was even more pronounced for sparseDOSSA2-generated data.

As hypothesis 11, we intended to validate that the proportion of taxa consistently identified as significant by more than 12 tools was much higher in the filtered data than in unfiltered data. Although our overlap profiles showed this trend when all taxa are aggregated over all projects, at the level of individual datasets we only found proportions slightly above 50% (CI.LB = 52% for metaSPARSim, 55% for sparseDOSSA2).

Lastly, H12 refers to the observation that DESeq2, corncob, and metagenomeSeq exhibit intermediate profiles for unfiltered data, with a shift toward a more conservative profile for filtered data. While all three methods showed relatively intermediate consistency profiles, the impact of prevalence filtering was not as strong as in the reference study. The statistical analysis of the respective hypothesis could not confirm this at the level of individual datasets (all CI.LB of the proportions are below 3%).

Secondary outcome (2b):

For the secondary outcome the number of significant features identified by each tool was investigated. The results are displayed as heatmaps in Figure 7. We chose the same depiction as in the reference study, where colors correspond to the proportion of significant taxa per dataset, scaled over all tests (i.e. for each row) and the number of significant taxa is provided as text. Additionally, 14 hypotheses were formulated that resulted in 19 analyses, as for some hypotheses multiple DA tests had to be considered. Of these 19, we validated 10 hypotheses from the reference study, and for an additional 2 hypotheses, validation was achieved for one of the two simulation tools. For five hypotheses that concerns ANCOM-II, LEfSe, and the limma voom methods, and proportions in filtered compared to unfiltered data we found discrepancies.

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure7.gif

Figure 7. Proportion of significant features for all tests and templates.

Proportion of significant features per dataset shown as a heatmap. As in the reference study, cells are colored based on the standardized (scaled and mean centered) percentage of significant taxa for each dataset. Moreover, we chose the same order of the datasets as in the reference study. The colors correspond to scaled proportions, both color scales were chosen as close as possible as in the reference study. The numbers denote the absolute number of significant features per test summed up over 10 simulated datasets. A. Result for metaSPARSim. B. Results for data generated with sparseDOSSA2.

Validated hypotheses ( Table 5 green):

Table 5. Results for hypotheses of aim 2b.

Hypothesis numberResult from reference studyHypothesisSpecification of DA-test or methodOutcome thresholdSimulation tool: metaSPARSimSimulation tool: sparseDOSSA2
H1 “We found that in both the filtered and unfiltered analyses the percentage of significant ASVs identified by each DA method varied widely across datasets, with means ranging from 3.8–32.5% and 0.8–40.5%, respectively.”For FILTERED and UNFILTERED data, the percentage of significant features identified by each DA method varies widely across datasets.
Dataset: 10 Simulations
Project: 38 templates
DA_method: 14 DA tests
UNFILTERED: % significant ~ dataset * project + DA_methodp-value DA_method < 0.05p-value DA_method = 0p-value DA_method = 0
H1.1 UNFILTERED: % significant ~ dataset * project + DA_method * projectp-vlaue project: DA_method < 0.05p-value project: DA_method= 0p-value project: DA_method= 0
H1.2 FILTERED: % significant ~ dataset * project + DA_methodp-value DA_method < 0.05p-value DA_method= 0p-value DA_method= 0
H1.3 FILTERED% significant ~ dataset * project +DA_method * projectp-vlaue project: DA_method < 0.05p-value project: DA_method= 0p-value project: DA_method= 0
H2 “Many tools behaved differently between datasets. Some tools identified the most features in one dataset while identifying only an intermediate number in other datasets.”For FILTERED and UNFILTERED data, rankings of the DA methods with respect to the proportion of identified features depend on the data template.UNFILTERED: value ~ DA_method * projectp-value DA_method:project < 0.05p-value DA_method:project= 0p-value DA_method:project= 0
H2.1 FILTERED: value ~ DA_method * projectp-value DA_method:project < 0.05p-value DA_method:project= 0p-value DA_method:project= 0
H5 “[For UNFILTERED data] In a few datasets, such as the Human-ASD and Human-OB (2) datasets, edgeR found a higher proportion of significant ASVs than any other tool.”For UNFILTERED data, there are datasets, where edgeR identifies the largest proportion of significant features.EdgeRCI.LB > 0CI.LB=7e-5CI.LB=0,1275
H8 “[For UNFILTERED data] Such extreme findings [as in H7] were also seen in the Wilcoxon (CLR) output, where more than 90% of ASVs were called as significant in eight separate datasets.”For UNFILTERED data, there are datasets, where Wilcoxon CLR identifies more than 90% of features as significant.Wilcoxon (CLR)CI.LB > 0CI.LB=0,0102CI.LB=0,0458
H14 “Over all 38 exp. datasets] edgeR, LEfSe, limma voom, and Wilcoxon tended to output the highest numbers of significant ASVs.”For FILTERED data, there is no method other than EdgeR, LEfSe, limma voom TMMwsp, limma voom TMM, or Wilcoxon CLR that identifies the largest number of significant features in total, i.e. when considering ranks of DA methods over all 38 templates.EdgeR, LEfSe, limma voom (TMM), limma voom (TMMwsp) or Wilcoxon (CLR)Estimated proportion > 0.95Estimated proportion=1Estimated proportion=1
H3 “This [H2] was especially evident in unfiltered data.”Rankings of the DA methods with respect to the proportion of identified features depend stronger on the data template in unfiltered data than in filtered datasets.value ~ DA_method + DA_method:projectComparing interaction Mean-Squares DA_method:project: unfiltered - filtered > 0Interaction MSQ DA_method:project: unfiltered - filtered
12,06
Interaction MSQ DA_method:project: unfiltered - filtered
-13,612
H4 “In the unfiltered datasets, we found that limma voom, Wilcoxon, LEfSe, and edgeR tended to find the largest number of significant ASVs compared with other methods.”In unfiltered data, either limma voom TMMwsp, limma voom TMM, Wilcoxon, or LEfSe identify the largest proportion of significant features.limma voom (TMM) or limma voom (TMMwsp) or Wilcoxon (CLR) or LEfSeCI.LB > 0.95CI.LB=0,6187CI.LB=0,9191
H10 “[For UNFILTERED data] We found that two of the three compositionally aware methods we tested identified fewer significant ASVs than the other tools tested. Specifically, ALDEx2 and ANCOM-II identified the fewest significant ASVs. We found the conservative behavior of these tools to be consistent across all 38 datasets we tested.”In UNFITLERED data, either ALDEx2 or ANCOM-II identify the fewest significant features.ALDEx2 or ANCOM-II CI.LB > 0.95CI.LB=0,7851CI.LB=0,7503
H11 “The results [from H10] based on the filtered tables showed a similarly conservative behavior.”In FILTERED data, ALDEx2 or ANCOM-II do not identify significantly more features than the most conservative tests.ALDEx2 or ANCOM-II CI.LB > 0.95CI.LB=0,7503CI.LB=0,4461
H6 “[For UNFILTERED data] We found that limma voom (TMMwsp) found the majority of ASVs to be significant (73.5%) in the Human-HIV (3) dataset.”For UNFILTERED data, Limma voom TMMwsp identifies the largest proportion of features in the Human-HIV (3) dataset.limma voom (TMMwsp)Estimated proportion > 0.95Estimated proportion=0Not testable
H7 “[For UNFILTERED data] We found that both limma voom methods identified over 99% of ASVs to be significant in several cases such as the Built-Office and Freshwater-Arctic datasets.”For UNFILTERED data, there are datasets, where both limma voom methods identify more than 99% of features as significant.limma voom (TMM) limma voom (TMMwsp)CI.LB > 0CI.LB=0CI.LB=0
H9 “[For UNFILTERED data] We found similar [as in H7 and H8], although not as extreme, trends with LEfSe where in some datasets, such as the Human-T1D (1) dataset, the tool found a much higher percentage of significant hits compared with all other tools.”For UNFILTERED data, there are datasets, where LEfSe identifies more taxa as significant compared with all other tools.LEfSeCI.LB > 0CI.LB=0CI.LB=0
H12 “For filtered data, there was a smaller range in the number of significant features identified by each tool. All tools except for ALDEx2 found a lower number of total significant features when compared with the unfiltered dataset.”No tool (except ALDEx2) identifies a smaller number of features for unfiltered data (than for filtered).ALDEx2CI.LB > 0.95CI.LB=0,0118CI.LB=0,0136
H13 As with the unfiltered data, ANCOM-II was the most stringent method [over all 38 exp. datasets].For FILTERED and UNFILTERED data, ANCOM-II identifies the least significant features in total, i.e. when summing ranks of DA methods over all 38 templates.UNFILTERED: ANCOM-II Estimated proportion > 0.95Esitmated proportion=0Estimated proportion=0
H13.1 FILTERED: ANCOM-II Estimated proportion > 0.95Estimated proportion=0Estimated proportion=0

Out of the 19 tests, 10 were statistically validated for both simulation frameworks. These included hypotheses H1, H2, H5, H8, and H14 with all corresponding tests. Hypothesis H1 and H2 describe the variability in DA results across the different data templates. For H1, it was confirmed that for both filtered and unfiltered data the percentage of significant features identified by each DA method varied widely across datasets. In the linear model with “test” and interaction “test:project” as predictors for the percentage of significant features, both predictors significantly contributed. This indicates that both the dataset and the DA method applied to it highly impacted the results.

Hypothesis H2 states that the percentage of significant features depends more strongly on the data template than on the method itself. Here, it was tested whether the rank of a DA method - based on the number of significant taxa - could be explained solely by the DA method, or whether significant interactions between the DA method and the data template existed. For both filtered and unfiltered data, significant interactions were observed, explaining major proportions of the total variance (p < 1e-16). Subsequent to H2, hypothesis H3 tested, whether this effect was stronger for unfiltered data. The interaction terms from H2 were compared and interestingly, this was validated for metaSPARSim3 but not for sparseDOSSA2.4 However, for both simulation approaches we observed that the mean squares of the interaction test:template was relatively small (MSQunfiltered=64 and MSQfiltered=52 for metaSPARsim, MSQunfiltered=62 and MSQfiltered=76 for sparseDOSSA2) compared to those of the main test effects (MSQunfiltered=3279 and MSQfiltered=3720 for metaSPARsim, MSQunfiltered=2403 and MSQfiltered=1945 for sparseDOSSA2). This resulted in slightly more test:template dependency (MSQunfiltered – MSQfiltered > 0) in unfiltered data for metaSPARsim3 (in agreement with the hypothesis) but less test dependency for sparseDOSSA24 (MSQunfiltered – MSQfiltered < 0, i.e. discrepancy with the hypothesis). So we only found a less pronounced difference of the DA methods across projects and it was not clearly more evident in unfiltered data. This is also illustrated in the heatmaps shown in the Supplemental Figure 4.

In addition to these hypotheses, which describe a general behaviour across DA methods, four hypotheses detailing specific behaviours of DA methods were intended to be validated. Hypotheses H5 and H8 describe the observation that in unfiltered data edgeR and Wilcoxon (CLR) are tests that identify a very high or even the highest number of significant taxa for some datasets. In our simulated data, we also found such datasets. For filtered data edgeR, LEfSe, Wilcoxon (CLR) and both limma methods tended to identify the largest number of significant taxa, a trend validated in H14.

Hypotheses that were true for the majority of cases ( Table 5 orange):

For unfiltered data, the reference study1 found, that the limma methods, Wilcoxon (CLR) or LEfSe were the DA methods that identifed the largest proportion of significant features. While this was true for the majority of our datasets, we could not strictly validate it (CI.LB = 62% and 91%).

Hypothesis H10 and H11 refer to the conservative behaviour of ALDEx2 and ANCOM-II. Specifically, they tested whether these methods do not identify significantly more features than the most conservative test in unfiltered and in filtered data. While this was true for the majority of cases in the metaSPARSim3 pipeline (CI.LBunfiltered=79%, CI.LBfiltered=75%) and for sparseDOSSA24 for unfiltered data (CI.LBunfiltered=75%), the confidence bound was below 50% for sparseDOSSA24 after prevalence-filtering (CI.LBfiltered=44%).

Hypotheses that were true for the minority of cases ( Table 5 red):

In the following, we summarize results that are only in agreement with our hypotheses for the minority of analyzed datasets and therefore do not coincide with conclusion made in the reference study.1 Hypothesis H6 states that limma voom (TMMwsp) identifies the largest proportion of significant taxa in the Human-HIV (hiv_noguerajulian) dataset. This was never true for metaSPARSim3 generated data. However, we observed that for datasets like GWMC_Asia or GWMC_Hot_Cold limma voom (TMMwsp) consistently identified more significant taxa. These two datasets are among the largest, containing three times as many taxa than the hiv_noguerajulian dataset. For sparseDOSSA2,4 this hypothesis could not be validated, as calibration and therefore simulation were not feasible.

Hypothesis H7 also concerns the two limma voom methods. In the reference study,1 these methods were reported to identify 99% of the taxa as significant for two specific datasets. In our translation of this hypothesis we generalized the behaviour, stating that for certain datasets 99% of the taxa are identified as significant. However, the datasets with observed results in the reference study,1 were not included in our simulation pipelines due to calibration errors or unrealistic simulations. In the remaining datasets this hypothesis was not validated.

Another test that found the highest number of significant taxa for some datasets was LEfSe, which is summarized in H9. We could not validate this hypothesis. We suspect this failure is related to the different implementation and usage of LEfSe, which may have contributed to the discrepancy. In the reference study,1 a python implementation has been used and the threshold is applied to the score to assess significance. In contrast, we used the R implementation and used p-values for calculating FDRs and assessing significance, as it is done for all other DA methods.

Another general hypothesis across most DA methods is H12. Here, we intended to validate that all tools, except for ALDEx2, identify a smaller number of significant features in filtered data compared to unfiltered data. However, for our simulated data, we had no cases where all tools (while not considering ALDEx2) indeed identified fewer significant taxa if sparsity filtering was applied. Because filtering out larger p-values would lead to smaller FDRs for small p-values, we checked whether this result is similar for unadjusted p-values. For unadjusted p-values, we found 58 out of 370 datasets where the hypothesis held true, for sparseDOSSA2 we only found 10, all of them belonging to the ob_turnbaugh template. So FDR adjustment only partly explains our surprising observation. The detailed analysis result is provided in Supplemental Table 1.

Finally, hypothesis H13, which concerns the conservative behavior of ANCOM-II in both filtered and unfiltered data, could not be validated. As discussed above, this is again attributed to discrepancies between our results and those of the reference study for ANCOM-II.

Overarching results for both, primary and secondary aims

Overall, the analyses demonstrated that synthetic data, when calibrated appropriately, can closely resemble experimental data in terms of key data characteristics (DCs). The ability of simulation tools to replicate experimental DCs was validated through principal component analysis, equivalence tests, and hypothesis-specific analyses. MetaSPARSim3 generally outperformed sparseDOSSA24 in terms of replicating DCs, although both tools showed consistent trends in their ability to validate findings from the reference study.

For the primary aim (2a), which focused on validating the overlap of significant features across differential abundance (DA) tests, we confirmed few findings from the reference study. Specifically, we validated trends in the consistency profiles of limma voom, MaAsLin2, and Wilcoxon-based approaches. However, certain discrepancies were observed, particularly in the behavior of ANCOM-II and LEfSe, which deviated from expectations based on the reference study.1 These differences may stem from variations in implementation, filtering strategies, or intrinsic differences between simulated and experimental data.

For the secondary aim (2b), which assessed the proportion of significant taxa identified by each DA method, we observed that while overall trends were mostly consistent with the reference study,1 specific hypotheses—particularly those related to the filtering effects on DA methods—showed discrepancies. The most consistent results were found for the variation across experimental data templates, for EdgeR and Wilcoxon (CLR), while methods like ANCOM-II and LEfSe displayed notable inconsistencies.

To check, whether any conclusions concerning the investigated hypotheses depend on these two tests, we repeated the primary and secondary analyses after excluding both DA methods individually and evaluated whether the outcomes changes. For both DA methods, we found that exclusion had no impact on any validation conclusion.

We also compared the agreement of the conclusions from both simulation methods. Figure 8A shows the proportion of datasets where the hypotheses held true. It can be seen, that for both simulation tools, the results highly correlate. Moreover, we also compared these proportions for filtered and un-filtered data for hypotheses that have been claimed for both. Again, we found a strong correlation, i.e. filtering has only a minor impact on those hypotheses ( Figure 8B).

8c1d8b22-25a0-4f07-9b3a-cc11105a8a04_figure8.gif

Figure 8. Association of data characteristics and validation of hypotheses.

A. The proportions of datasets where the hypotheses were validated or rejected, strongly correlated for both simulation tools and (panel B) for both, filtered and un-filtered data. H1xx denote primary hypothesis xx, H2xx the secondary ones. Only hypotheses where counting over simulated datasets is applied are shown. C. Frequencies of the DCs were selected as informative for validating/rejecting hypotheses by the stepwise logistic regression approach. Effect size (PERMANOVA R-squared), the proportion of zeros (P0), and the number samples were selected most frequently. Panels D – F illustrate the decision tree analyses by three interesting examples. The percentage values indicate the proportion of all datasets and always add up to 100% on one level. The numbers indicate the average, i.e. the proportion within a branch where the corresponding hypothesis holds true.

Exploratory analysis

As an exploratory analysis, we attempted to link the outcomes of the investigated hypotheses with data characteristics. Specifically, we aimed to identify DCs that predict whether a hypothesis is validated or not across individual datasets. Identifying such relationships requires a sufficient number of cases where the hypothesis was both confirmed and not confirmed. Additionally, both outcomes should be observed across multiple templates; otherwise, any identified relationships are unlikely to generalize to unseen datasets.

Across all hypotheses and both simulation tools, PERMANOVA R-squared (26 out of 43) and the proportion of zeros (11 out of 43) were the most frequently selected predictors, followed by the number of samples in the count data (nsamples, 10 out of 43). Figure 8C illustrates the frequency with which DCs were selected. Supplemental Table 2 summarizes all predictive DCs, along with the estimated coefficients. Positive coefficients indicate that an increase in the respective DC raises the probability of a hypothesis being validated.

Next, decision trees were generated using the predictive DCs for each hypothesis. To improve interpretability, we applied a rank transformation to the DC values and scaled the ranks to the [0,1] interval to generate more intuitive meaning of the thresholds. Figure 8D presents an example for hypothesis 13.1, which holds true in only 9% of cases for all datasets (denoted as “average = 0.09” in the figure). However, for datasets with the highest 10% of the DC “mean_corr_sample” (right branch), the hypothesis is validated in all cases (“average = 1.00”). In the left branch, the hypothesis was found as false for almost all datasets (average = 0.02, 90% of datasets). A similar pattern was observed for secondary hypothesis 8, where again a decision tree with two branches was found based on nsamples as DC. After applying the denoted threshold, the hypothesis held true for all datasets in the right branch (average = 1.00 for 5% of datasets), while it was false for all datasets in the left branch (average = 0.00 for 95% of datasets).

Panel F presents a more complex decision tree where three DCs contribute. Here, two branches were identified in which hypothesis 13 was true for all datasets (average = 1.00 for 9% and 24% of datasets, respectively). However, in the third branch, the hypothesis was only valid for 17% of datasets (average = 0.17 for 68% of all datasets). Supplemental Figure 5 include decision trees for all analyzed hypotheses.

Overall, we found that certain DCs efficiently separate datasets where hypotheses hold from those where they do not. The effect size, data sparsity, and number of samples were most frequently selected as predictive factors. However, obtaining more reliable results would require a more comprehensive sampling of DCs as well as an out-of-sample validation of the derived decision rules—both would require simulated data with varying DCs which is beyond the scope of this study.

Discussion

In this study, we first demonstrated the feasibility of generating synthetic data that closely resemble experimental DCs. At an individual DC level, most of the simulated data characteristics aligned well with the experimental templates, with only a few showing notable discrepancies. Equivalence tests confirmed that the majority of DCs were successfully reproduced in most cases, reinforcing the validity of our synthetic data generation approach.

Calibrating and then applying two simulation tools revealed that metaSPARSim3 consistently outperformed sparseDOSSA24 when comparing DCs of simulated and experimental data, particularly in capturing DCs at the PCA summary level. While the outcomes of both simulation tools were very similar at the level of individual hypotheses, sparseDOSSA24 struggled with calibration of 11 templates, highlighting limitations in its applicability to large datasets.

The computational requirements of this study due to the complexity and multitude of analyses on several levels, in particular simulation calibration, calculation of DCs, differential abundance tests and the comprehensive computations for hypothesis evaluations posed a major challenge. We used 38 experimental datasets, 2 simulation tools, 14 DA methods, 4 pipelines and used N = 10 simulated datasets. Moreover, the analyses were done on filtered and unfiltered data, and multiple data subsets were analyzed to assess the robustness of the outcomes. Moreover, the split and merge procedure was conducted to reduce the number of test failures and all tests had to be applied to introduce taxa that are not different in the two groups of samples. In addition, code development, checks, conducting the analyses, and subsequent bug fixing required approximately 10 runs of parts of the implemented source code on average. Although we could run the analysis on a powerful compute server with 96 CPUs and we parallelized code execution at multiple levels, code execution in sum still took months. Moreover, the execution of our code regularly stopped unexpectedly mainly due to memory load issues - although our compute server had more than 500 GB RAM. Therefore, we had to implement the analysis pipeline in a way that continuations of the analyses are feasible at all major stages. This, in turn, required storage of intermediate results. Although we tried to keep disc space requirements minimal, the compressed storage of our results requires around 140 GiB.

Importantly, our study confirmed several major findings from the reference benchmark study.1 Specifically, trends in the overlap profiles of significant features and the conservativeness of certain differential abundance methods were largely consistent with previous observations. However, we also identified notable discrepancies, particularly regarding the behavior of methods such as LEfSe and ANCOM-II. These differences may stem from implementation variations, from the fact that we evaluated the hypotheses at the level of individual datasets instead of merging all taxa of all datasets, or from intrinsic differences between simulated and experimental data.

To ensure the highest level of comparability with the reference study, we adhered as closely as possible to its methodology. We reused the published code for calling the DA methods, applied identical configuration parameter settings where reasonable, and maintained consistency of the analysis pipelines. Nevertheless, some disparities arose, particularly for methods not available as R packages, which necessitated the integration of several lines of code in our study as well as in the reference study.

At the hypothesis validation level, we were able to confirm approximately 25% of the reference study’s findings, while an additional 33% showed similar trends, although not meeting the strict validation criteria. However, 42% of the hypotheses could not be confirmed, either due to clearly different outcomes or by overly stringent formulations of the hypotheses and the corresponding statistical analysis in our study protocol.

One key methodological advancement was to conduct our study according to a formalized study protocol, a process that required significantly greater effort in terms of planning, execution, and documentation. The strict adherence to our protocol and the avoidance of analytical shortcuts resulted in additional workload that we estimate to around more than twice compared to a traditional benchmark study. Furthermore, at the time of protocol writing, no established study protocol checklist for computational benchmarking existed. Therefore, we followed the SPIRIT guidelines. More recently, dedicated benchmarking protocol templates have become available, which may help streamline similar efforts in the future.

Despite the additional effort, we remain convinced that conducting benchmark studies with preregistered study protocols is essential for achieving more reliable and unbiased assessments of computational method. This approach not only enhances methodological rigor but also reduces biased interpretations that may arise from preconceived expectations about the results. Consequently, it supports the development of robust guidelines for optimal method selection and contributes to establishing community standards.

To align with this methodological concept, we focused on evaluating the feasibility of using simulated data to validate previously reported results from experimental data. However, we did not yet assess the sensitivity and specificity of DA methods, which is feasible when using simulated data. Additionally, a systematic modification of key data characteristics, such as effect size, sparsity, and sample size, could provide deeper insights, as these were identified as the most influential factors affecting DA method behavior.

A promising next step in research would be to more comprehensively sample the space of data characteristics. Combining this with an evaluation of sensitivity and specificity would allow for the derivation of recommendations and decision rules regarding the optimal selection of a DA method based on dataset-specific characteristics.

Reporting guidelines

The detailed study protocol (2) was developed following the SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) 2013 Checklist. It was published before the study was conducted.

Data are available under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Ethics and consent

Ethical approval and consent were not required.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 25 Jun 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Kohnert E and Kreutz C. Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data  [version 1; peer review: 2 approved with reservations]. F1000Research 2025, 14:621 (https://doi.org/10.12688/f1000research.163152.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 25 Jun 2025
Views
4
Cite
Reviewer Report 23 Aug 2025
Graziano Pesole, Dipartimento Di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari, Bari, Italy 
Giuseppe Defazio, Biosciences, Biotechnology and Environment, Universita degli Studi di Bari Aldo Moro Dipartimento di Bioscienze Biotecnologie e Biofarmaceutica, Bari, Apulia, Italy 
Approved with Reservations
VIEWS 4
General Comment
In the manuscript entitled “Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data”, the authors perform a benchmark study evaluating the similarity between simulated 16S DNA-metabarcoding data and real ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Pesole G and Defazio G. Reviewer Report For: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data  [version 1; peer review: 2 approved with reservations]. F1000Research 2025, 14:621 (https://doi.org/10.5256/f1000research.179453.r397186)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
2
Cite
Reviewer Report 23 Aug 2025
Aaron W Miller, Cleveland Clinic Glickman Urological and Kidney Institute, Cleveland, Ohio, USA 
Approved with Reservations
VIEWS 2
Kohnert and Kreutz seek to determine whether carefully calibrated synthetic data can reproduce the conclusions of the high-profile benchmark by Nearing et al. (2022). They re-created 38 real case–control 16S rRNA datasets with two simulators (metaSPARSim and sparseDOSSA2) and compared ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Miller AW. Reviewer Report For: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data  [version 1; peer review: 2 approved with reservations]. F1000Research 2025, 14:621 (https://doi.org/10.5256/f1000research.179453.r399101)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 25 Jun 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.