Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.73520.3

Method Article

Articles

Using multiple outcomes in intervention studies: improving power while controlling type I errors

[version 3; peer review: 2 approved]

Bishop

Dorothy V. M.

Conceptualization Data Curation Formal Analysis Investigation Methodology Software Visualization Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0002-2448-4033 a 1 1Department of Experimental Psychology, University of Oxford, Oxford, Oxon, OX2 6GG, UK

a dorothy.bishop@psy.ox.ac.uk

No competing interests were disclosed.

6 11 2023

2021

991

30 10 2023

2023

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

The CONSORT guidelines for clinical trials recommend using a single primary outcome, to guard against excess false positive findings when multiple measures are considered. However, statistical power can be increased while controlling the familywise error rate if multiple outcomes are included. The MEff statistic is well-suited to this purpose, but is not well-known outside genetics.

Methods

Data were simulated for an intervention study, with a given sample size (N), effect size (E) and correlation matrix for a suite of outcomes ( R). Using the variance of eigenvalues from the correlation matrix, we compute MEff, the effective number of variables that the alpha level should be divided by to control the familywise error rate. Various scenarios are simulated to consider how MEff is affected by the pattern of pairwise correlations within a set of outcomes. The power of the MEff approach is compared to Bonferroni correction, and a principal component analysis (PCA).

Results

In many situations, power can be increased by inclusion of multiple outcomes. Differences in power between MEff and Bonferroni correction are small if intercorrelations between outcomes are low, but the advantage of MEff is more evident as intercorrelations increase. PCA is superior in cases where the impact on outcomes is fairly uniform, but MEff is applicable when intervention effects are inconsistent across measures.

Conclusions

The optimal method for correcting for multiple testing depends on the underlying data structure, with PCA being superior if outcomes are all indicators of a common underlying factor. Both Bonferroni correction and MEff can be applied post hoc to evaluate published intervention studies, with MEff being superior when outcomes are moderately or highly correlated. A lookup table is provided to give alpha levels for use with Meff for cases where the correlation between outcome measures can be estimated.

intervention methodology statistics correlated outcomes power familywise error rate multiple comparisons

The author(s) declared that no grants were involved in supporting this work.

Revised Amendments from Version 2

This revised version has two substantial changes. a) Figures 2 - 4 have been revised in line with suggestions by reviewer 1, to remove the lines corresponding to effect size of .8. b) Additional real-world examples of studies are provided where the look-up table (Table 2) may be used, but where original raw data was not available. These stimulated further thoughts about the need to consider the nature of the relationship between outcome measures and an intervention, which are picked up in the Discussion. Both for appropriate analysis and for interpretation, it is important to decide whether multiple outcomes are to be regarded as alternate indicators of a common underlying construct, or whether they reflect latent variables that may respond differently to the intervention.

Issues raised by inclusion of multiple outcomes

The CONSORT guidelines for clinical trials ( Moher et al., 2010) are very clear on the importance of having a single primary outcome:

All RCTs assess response variables, or outcomes (end points), for which the groups are compared. Most trials have several outcomes, some of which are of more interest than others. The primary outcome measure is the pre-specified outcome considered to be of greatest importance to relevant stakeholders (such as patients, policy makers, clinicians, funders) and is usually the one used in the sample size calculation. Some trials may have more than one primary outcome. Having several primary outcomes, however, incurs the problems of interpretation associated with multiplicity of analyses and is not recommended.

This advice often creates a dilemma for the researcher: in many situations there are multiple measures that could plausibly be used to index the outcome ( Vickerstaff, Ambler, King, Nazareth, & Omar, 2015). If we have several outcomes and we would be interested in improvement on any measure, then we need to consider the familywise error rate, i.e. the probability of at least one false positive in the whole set of outcomes. For instance, if we want to set the false positive rate, alpha to .05, and we have six independent outcomes, none of which is influenced by the intervention, the probability that none of the tests of outcome effects is significant will be .95^6, which is .735. Thus the probability that at least one outcome is significant, the familywise error rate, is 1-.735, which is .265. In other words, in about one quarter of studies, we would see a false positive when there is no true effect. The larger the number of outcomes, the higher the false positive rate.

A common solution is to apply a Bonferroni correction by dividing the alpha level by the number of outcome measures - in this example .05/6 = .008. This way the familywise error rate is kept at .05. But this is over-conservative if, as is usually the case, the various outcomes are intercorrelated.

Various methods have been developed to address the problem of multiple testing. One approach is to adopt some process of data reduction, such as extracting a principal component from the measures that can be used as the primary outcome. Alternatively, a permutation test can be used to derive exact probability of an observed pattern of results. Neither approach, however, is helpful if the researcher is evaluating a published paper where an appropriate correction has not been made. These could be cases where no correction is made for multiple testing, risking a high rate of false positives, or where Bonferroni correction has been applied despite using correlated outcomes, which will be overconservative in rejecting the null hypothesis. The goal of the current article is to provide some guidance for interpretation of published papers where the raw data are not available for recomputation of statistics.

Vickerstaff et al. (2015) reviewed 209 trials in neurology and psychiatry, and found that 60 reported multiple primary outcomes, of which 45 did not adjust for multiplicity. Those that did adjust mostly used the Bonferroni correction. Thus it would appear that many researchers feel the need to include several outcomes, but this is not always adjusted for appropriately. The goal of the current article is to provide some guidance for interpretation of published papers where the raw data are not available for recomputation of statistics.

In a review of an earlier version of this paper, Sainani (2021) pointed out that the MEff statistic, originally developed in the field of genetics by Cheverud (2001) and Nyholt (2004), provided a simple way of handling this situation. With this method, one computes eigenvalues from the correlation matrix of outcomes, which reflect the degree of intercorrelation between them. The mathematical definition of an eigenvalue can be daunting, but an intuitive sense of how it relates to correlations can be obtained by considering the cases shown in Table 1. This shows how eigenvalues vary with the correlation structure of a matrix, using an example of six outcome measures. The number of eigenvalues, and the sum of the eigenvalues, is identical to the number of measures. Let us start by assuming a matrix in which all off-diagonal values are equal to r. It can be seen that when the correlation is zero, each eigenvalue is equal to one, and the variance of the eigenvalues is zero. When the correlation is one, the first eigenvalue is equal to six, all other eigenvalues are zero, and the variance of the eigenvalues is six. As correlations increase from .2 to .8, the size of the first eigenvalue increases, and that of the other eigenvalues decreases.

Table 1. Eigenvalues, MEff and AlphaMEff with 6 outcome variables.

r	Eigen1	Eigen2	Eigen3	Eigen4	Eigen5	Eigen6	Var	MEff	AlphaMEff
0	1.0	1.0	1.0	1.0	1.0	1.0	0.00	6.0	0.008
0.2	2.0	0.8	0.8	0.8	0.8	0.8	0.24	5.8	0.009
0.4	3.0	0.6	0.6	0.6	0.6	0.6	0.96	5.2	0.010
0.6	4.0	0.4	0.4	0.4	0.4	0.4	2.16	4.2	0.012
0.8	5.0	0.2	0.2	0.2	0.2	0.2	3.84	2.8	0.018
1	6.0	0.0	0.0	0.0	0.0	0.0	6.00	1.0	0.050

In Table 1, r is the intercorrelation between the six outcomes, Eigen1 - Eigen6, are the eigenvalues, and Var is the variance of the six Eigenvalues, which is used to compute MEff (the effective number of comparisons) from the formula: MEff = 1 + ( N - 1 ) * ( 1 - ( Var ( Eigen ) / N )

where N is the number of outcome measures, and Eigen is the set of N eigenvalues.

This value is then used to compute the corrected alpha level, AlphaMEff. Assuming we set alpha to .05, AlphaMEff is .05 divided by MEff. One can see that this value is equivalent to the Bonferroni-corrected alpha (.05/6) when there is no correlation between variables, and equivalent to .05 when all variables are perfectly correlated.

Derringer (2018) provided a useful tutorial on MEff, noting that it is not well-known outside the field of genetics, but is well-suited to the field of psychology. Her preprint includes links to R scripts for computing MEff and illustrates their use in three datasets.

These resources will be sufficient for many readers interested in using MEff, but researchers may find it useful to have a look-up table for the case when they are evaluating existing studies. The goal of this paper is two-fold: A.

To consider how inclusion of multiple outcome measures affects statistical power, relative to the case of a single outcome, when appropriate correction of the familywise error rate is made using MEff. Results from MEff are compared with use of Bonferroni correction and analysis of the first component derived from Principal Components Analysis (PCA).

To provide a look-up table to help evaluate studies with multiple outcome measures, without requiring the reader to perform complex statistical analyses.

These goals are achieved in three sections below: 1.

Power to detect a true effect using MEff is calculated from simulated data for a range of values of sample size (N), effect size (E) and the matrix of intercorrelation between outcomes (R)

A lookup table is provided that gives values of MEff, and associated adjusted alpha-levels for different set sizes of outcome measures, with mean pairwise correlation varying from 0 to 1 in steps of .1.

Use of the lookup table is shown for real-world examples of application of MEff using published articles.

Alternative approach, MinNVar

In the original version of this manuscript ( Bishop, 2021), an alternative approach, MinNVar, was proposed, in which the focus was on the number of outcome variables achieving a conventional .05 level of significance. As noted by reviewers, this has the drawback that it could not reflect continuous change in probability levels, because it was based on integer values (i.e. number of outcomes). This made it overconservative in some cases, where adopting the MinNVar approach gave a familywise error rate well below .05. One reason for proposing MinNVar was to provide a very easy approach to evaluating studies that had multiple outcomes, using a lookup table to check the number of outcomes needed, depending on overall correlation between measures. However, it is equally feasible to provide lookup tables for MEff, which is preferable on other grounds, and so MinNVar is not presented here; interested readers can access the first version of this paper to evaluate that approach.

Use of one-tailed p-values

In the simulations described here, one-tailed tests are used. Two-tailed p-values are far more common in the literature, perhaps because one-tailed tests are often abused by researchers, who may switch from a two-tailed to a one-tailed p-value in order to nudge results into significance.

This is unfortunate because, as argued by Lakens (2016), provided one has a directional hypothesis, a one-tailed test is more efficient than a two-tailed test. It is a reasonable assumption that in intervention research, which is the focus of the current paper, the hypothesis is that an outcome measure will show improvement. Of course, interventions can cause harms, but, unless those are the focus of study, we have a directional prediction for improvement.

Methods

Correlated variables were simulated using the R programming language ( R Core Team, 2020) ( R Project for Statistical Computing, RRID:SCR_001905). The script to generate and analyse simulated data is available on https://osf.io/hsaky/. For each model specified below, 2000 simulations were run. Note that to keep analysis simple, a single value was simulated for each case, rather than attempting to model pre- vs post-intervention change. Data for the two groups were generated by the same process, except that a given effect size was added to scores of the intervention group, I, but not to the control group, C. Scores of the two groups were compared using a one-tailed t-test for each run.

Power was computed for different levels of effect size (E), correlation between outcomes ( R) and sample size per group (N) for the following methods: a)

Bonferroni-corrected data: Proportion of runs where p was less than the Bonferroni-corrected value for at least one outcome.

MEff-corrected data: Proportion of runs where p was less than AlphaMeff value for at least one outcome.

Principal component analysis (PCA): Proportion of runs where p was below .05 when groups I and C were compared on scores on the first principal component of PCA.

Method for simulating outcomes

Simulating multivariate data forces one to consider how to conceptualise the relationship between an intervention and multiple outcomes. Implicit in the choice of method is an underlying causal model that includes mechanisms that lead measures to be correlated.

In the simulation, outcomes were modelled as indicators of one or more underlying latent factors, which mediate the intervention effect. This can be achieved by first simulating a latent factor, with an effect size of either zero, for group C, or E for group I. Observed outcome measures are then simulated as having a specific correlation with the latent variable - i.e. the correlation determines the extent to which the outcomes act as indicators of the latent variable. This can be achieved using the formula: r ∗ L + 1 − r 2 ∗ e

where r is the correlation between latent variable ( L) and each outcome, and L is a vector of random normal deviates that is the same for each outcome variable, while e (error) is a vector of random normal deviates that differs for each outcome variable. Note that when outcome variables are generated this way, the mean intercorrelation between them will be r ². Thus if we want a set of outcome variables with mean intercorrelation of .4, we need to specify r in the formula above as sqrt( r) = .632. Furthermore, the effect size for the simulated variables will be lower than for the latent variable: to achieve an effect size, E, for the outcome variables, it is necessary to specify the effect size for the latent variable, E _l, as E/sqrt( r).

Note that the case where r = 0 is not computable with this method - i.e. it is not possible to have a set of outcomes that are indicators of the same latent factor but which are uncorrelated. The lowest value of r that was included was r = .2.

The initial simulation, designated as Model L1, treated all outcome measures as equivalent. In practice, of course, we will observe different effect sizes for different outcomes, but in Model L1, this is purely down to the play of chance: all outcomes are indicators of the same underlying factor, as shown in the heatmap in Figure 1, Model L1.

Figure 1. Models for data generation.

Heatmap depicts correlations between observed variables V1 to V4 and Latent factors, where colour denotes association. A diagonal line through a latent factor indicates it is not related to intervention.

In two additional models, rather than being indicators of the same uniform latent variable, the outcomes correspond to different latent factors. This would correspond to the kind of study described by Vickerstaff et al. (2021), where an intervention for obesity included outcomes relating to weight and blood glucose levels. Following suggestions by Sainani (2021), a set of simulations was generated to consider relative power of different methods when there are two underlying latent factors that generate the outcomes. In Model L2, there are two independent latent factors, both affected by intervention. In Model L2×, the intervention only influences the first latent factor. The computational approach was the same as for Model L1, but with two latent factors, each used to generate a block of variables. The two latent factors are uncorrelated.

The size of the suite of outcome variables entered into later analysis ranged from 2 to 8. For each suite size, principal components were computed from data from the C and I groups combined, using the base R function prcomp from the stats package ( R Core Team, 2020). Thus, PC2 is a principal component based on the first two outcome measures, PC4 based on the first four outcome measures, and so on.

Results Power calculations

Sample plots comparing power for Bonferroni correction, MEff and PCA are shown for sample size of 50 per group in Figures 2 to 4. Plots for smaller (N = 20) and larger (N = 80) sample sizes are available online ( https://osf.io/k6xyc/), and show the same basic pattern.

Figure 2. Model L1, 50 per group.

Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size. The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3 and .5 respectively.

Figure 3. Model L2: 50 per group.

Figure 4. Model L2x: 50 per group.

Figure 2 shows the simplest situation when there are between 2 and 8 outcome measures, all of which are derived from the same latent variable (Model L1). Different levels of intercorrelation between the outcomes (ranging from .2 to .8 in steps of .2) are shown in columns.

Several points emerge from inspection of this figure; first, when intercorrelation between measures is low to medium (.2 to .6), power increases as the number of outcome measures increases. Furthermore, the power is greater when PCA is used than when MEff or Bonferroni correction is applied. MEff is generally somewhat better-powered than Bonferroni, and Bonferroni has lower power than a single outcome measure when there is a large number of highly intercorrelated outcome measures ( r = .8).

In practice, it may be the case that outcome measures are not all reflective of a common latent factor. Figure 3 shows results from Model L2, where outcome measures form two clusters, each associated with a different latent factor (see Figure 1). Here both latent factors are associated with improved outcomes in the intervention group.

Once again, power increases with number of outcomes when there are low to modest intercorrelations between outcomes. For this method, PCA no longer has such a clear advantage. This makes sense, given that PCA will not derive a single main factor, when the underlying data structure contains two independent factors.

Figure 4 shows equivalent results for Model L2x, where we have a mixture of two types of outcome, one of which is influenced by intervention, and the other is not. This complicates calculation of power for a single variable, since, power will depend on whether we select one of the outcomes that is influenced by intervention or not. The symbols in Figure 4 show average power, assuming we might select either type of outcome with equal frequency. We see that in this situation, MEff is clearly superior to PCA except when we have a large number of outcomes, a small effect size and weak intercorrelation between outcomes.

Deriving a lookup table

Table 2 shows corrected alpha values based on MEff, varying according to the correlation between outcome measures, and the number of outcome measures in the study. In practice, the problem for the researcher is to estimate the intercorrelation between outcome measures if this is not known.

Table 2. AlphaMEff for different correlation values (corr) with 2-12 outcome variables (N2 to N12), based on Model L1.

corr	N2	N3	N4	N5	N6	N7	N8	N9	N10	N11	N12
0.0	0.025	0.017	0.013	0.010	0.008	0.007	0.006	0.006	0.005	0.005	0.004
0.1	0.025	0.017	0.013	0.010	0.008	0.007	0.006	0.006	0.005	0.005	0.004
0.2	0.026	0.017	0.013	0.010	0.009	0.007	0.006	0.006	0.005	0.005	0.004
0.3	0.026	0.018	0.013	0.011	0.009	0.008	0.007	0.006	0.005	0.005	0.005
0.4	0.027	0.019	0.014	0.011	0.010	0.008	0.007	0.006	0.006	0.005	0.005
0.5	0.029	0.020	0.015	0.013	0.011	0.009	0.008	0.007	0.006	0.006	0.005
0.6	0.030	0.022	0.017	0.014	0.012	0.010	0.009	0.008	0.007	0.007	0.006
0.7	0.033	0.025	0.020	0.016	0.014	0.012	0.011	0.010	0.009	0.008	0.008
0.8	0.037	0.029	0.024	0.020	0.018	0.016	0.014	0.013	0.012	0.011	0.010
0.9	0.042	0.036	0.032	0.028	0.026	0.023	0.021	0.020	0.018	0.017	0.016
1.0	0.050	0.050	0.050	0.050	0.050	0.050	0.050	0.050	0.050	0.050	0.050

Model L1, used to generate these data, assumes there will be a uniform intercorrelation between outcome measures in the population. This is likely to be unrealistic. Nevertheless, further simulations showed that values for MEff are reasonably consistent for different correlation matrices that all have the same average off-diagonal correlation. Consider, for instance, the correlations between 4 variables shown in Figure 1 for Model L2. Within the blocks V1-V2 and V3-V4 the intercorrelation is r, but between blocks the intercorrelation is zero. There are six off-diagonal correlations and the mean off-diagonal is (2 * r/6). For instance, if r equals .5, then the mean off-diagonal value is .167. To see how the MEff correction is affected by correlation structure, we can compare MEff for Model L2 with the MEff obtained in Model L1 with the same off-diagonal correlation. This exercise shows that they are similar, as shown in Table 3.

Table 3. AlphaMEff values for Model L2 (odd rows) and Model L1 (even rows), with same mean off diagonal <italic toggle="yes">r</italic>. For Model L2, “Start <italic toggle="yes">r</italic>” is the value for nonzero off-diagonal correlations.

Start r	Model	Mean offdiag r	Alpha.MEff.4	Alpha.MEff.6	Alpha.MEff.8
0.2	L2	0.086	0.013	0.008	0.006
0.2	L1	0.086	0.013	0.008	0.006
0.3	L2	0.129	0.013	0.009	0.006
0.3	L1	0.129	0.013	0.008	0.006
0.4	L2	0.171	0.013	0.009	0.007
0.4	L1	0.171	0.013	0.009	0.006
0.5	L2	0.214	0.013	0.009	0.007
0.5	L1	0.214	0.013	0.009	0.007
0.6	L2	0.257	0.014	0.009	0.007
0.6	L1	0.257	0.013	0.009	0.007
0.7	L2	0.300	0.014	0.010	0.008
0.7	L1	0.300	0.013	0.009	0.007
0.8	L2	0.343	0.015	0.011	0.008
0.8	L1	0.343	0.013	0.009	0.007

In other words, if estimating MEff from existing data, it is reasonable to base the estimate on the average off-diagonal correlation, regardless of whether the pattern of intercorrelations is uniform.

Examples of application to published studies

Use of the lookup Table 2 can be illustrated with data from a study by Burgoyne et al. (2012), which evaluated a reading and language intervention for children with Down syndrome. A large number of assessments was carried out over various time points, but our focus here is on the five outcome measures that had been designated as “primary”, as they were “proximal to the content of the intervention”, i.e., they measured skills and knowledge that had been explicitly taught. The p-values reported by the authors (see Table 4) come from analyses of covariance comparing differences between intervention and control groups after 20 weeks of intervention, controlling for baseline performance, age and gender.

Table 4. P-values from <xref ref-type="bibr" rid="ref3">Burgoyne <italic toggle="yes">et al.</italic> (2012)</xref>.

Bonferroni and MEff alpha for 6 variables with mean correlation of .6.

Measure	Reported p.value	Bonferroni: alpha = .01	MEff: alpha = .014
Letter-Sound knowledge	0.002	*	*
Phoneme blending	0.022
Single word reading	0.002	*	*
Taught expressive Vocabulary	0.011		*
Taught receptive Vocabulary	0.062

Whereas the Bonferroni-corrected alpha can be computed simply from knowledge of the number of outcome measures, the MEff-corrected alpha requires knowledge of the mean correlation between the outcome measures. In this case, this could be computed, ( r = .581), as the data were available in a repository ( Burgoyne et al., 2016). From Table 2, we see that with five outcome measures and r = .6, the adjusted alpha is .014. In this example, three outcomes have p-values below the critical alpha when MEff is used. If the more stringent Bonferroni correction is applied, only two outcomes achieve significance.

In this example the intercorrelation between outcome measures could be computed from deposited raw data; if these are not available, then it may still be possible to obtain plausible estimates of intercorrelation between outcome measures, especially if widely-used instruments are used. An example is provided by two randomized controlled trials of a memory training programme for children, Cogmed. In both studies, the Automated Working Memory Assessment battery ( Alloway, 2007) was used to assess outcome. Chacko et al. (2014) used four subtests, Dot Matrix, Spatial Recall, Digit Recall, and Listening Recall, and applied the Sidak-Bonferroni correction, with effective alpha of .013. The raw data are not available, but the test manual indicates that intercorrelations between these four measures range from .70 to .78. Thus we can use the lookup table (Table 2), which shows that with four variables with intercorrelation of .7, an effective alpha of .02 can be used. In practice this did not affect the interpretation of results, because two of the measures, Dot Matrix and Digit Recall, had associated p-values of < .001 and .005 respectively. The p-values for Spatial Recall and Listening Recall were .048 and .728 respectively, and so would not meet criteria for significance with MEff or Bonferroni methods.

The other study by Roberts et al. (2016) used a different subset of subtests from the same battery: Dot Matrix, Digit Recall, Backward Digit Recall, and Mister X, given at 6 months, 12 months and 24 months post-intervention. According to the test manual, intercorrelations between these subtests range from .65 to .80. These authors did not apply a correction for multiple comparisons. If Bonferroni correction had been used this would have given an alpha level of .004 (.05/12). The test manual indicates that test-retest reliability of the subscales ranges from .84 to .89. Thus overall, we can estimate the off-diagonal correlations for all 12 measures to be around .8, which the lookup table shows as corresponding to an effective alpha of .01. In this study, only the Dot Matrix task effect was significant after correction for multiple comparisons, with p < .001 at both 6 months and 12 months, but p = .14 at 24 months. Backward Digit Recall gave p = .04 at 6 months only, which would be nonsignificant if any correction for multiple comparisons were used. All other comparisons were null. In the next section, the implications of these findings for choosing methods is discussed further.

Discussion

Some interventions are expected to affect a range of related processes. In such cases, the need to specify a single primary outcome tends to create difficulties, because it is often unclear which of a suite of outcomes is likely to show an effect. Note that the MEff approach does not give the researcher free rein to engage in p-hacking: the larger the suite of measures included in the study, the lower the adjusted alpha will be. It does, however, remove the need to pre-specify one measure as the primary outcome, when there is genuine uncertainty about which measure might be most sensitive to intervention.

A second advantage is that in effect, by including multiple outcome measures, one can improve the efficiency of a study, in terms of the trade-off between power and familywise errors. A set of outcome measures may be regarded as imperfect proxy indicators of an underlying latent construct, so we are in effect building in a degree of within-study replication by including more than one outcome measure.

The simulations showed that PCA gives higher power than MEff in the case where all outcomes are indicators of a single underlying factor. PCA, however, needs to be computed from raw data and so is not feasible when re-evaluating published studies, whereas MEff is feasible so long as the average off-diagonal correlation between outcomes can be estimated. PCA is also less powerful when the outcomes tap into heterogeneous constructs and do not load on one major latent factor. Some examples are provided where prior literature gives plausible estimates of intercorrelations between outcome measures. Of course, such estimates are never as accurate as the actual correlations from the reported data, which may vary depending on sample characteristics. Wherever possible, it is preferable to work with original raw data. However, where correlations are available from test manuals, or where previous studies have reported correlations between outcomes, then the researcher can consider how interpretation of results may be affected by assuming a given degree of dependency between outcome measures.

A possible disadvantage of using MEff or Bonferroni correction over PCA is that such approaches are likely to tempt researchers to interpret specific outcomes that fall below the revised alpha threshold as meaningful. They may be, of course, but when we create a suite of outcomes that differ only by chance, it is common for only a subset of them to reach the significance criterion. Any recommendation to use MEff should be accompanied by a warning that if a subset of outcomes shows an effect of intervention, this could be due to chance. It would be necessary to run a replication to have confidence in a particular pattern of results.

In this regard, the example of studies using the Automated Working Memory Assessment to evaluate intervention for children with memory and attentional difficulties ( Chacko et al., 2014; Roberts et al., 2016) are instructive. As reported in the test manual ( Alloway, 2007), intercorrelations between the subtests are high, supporting the idea of a general working memory factor that influences performance on all such measures. On that basis, it might seem preferable to reduce subtest scores to one outcome measure - either by using data reduction such as principal component analysis, or by using the method advocated in the test manual to derive a composite score. We know this is associated with an increase in reliability of measurement and statistical power. However, the results of the two studies sound a note of caution: in both trials there were large improvements in one subtest, Dot Matrix, at least in the short-term, while other measures did not show consistent gains. This kind of result has been much discussed in evaluations of computerised training, where it has been noted that one may see improvements in tasks that resemble the training exercises, ‘near transfer’, without any generalisation to other measures, ‘far transfer’ ( Aksayli, Sala, & Gobet, 2019). The very fact that measures are usually intercorrelated provides the rationale for hoping that training one skill will have an effect that generalises to other skills, and to everyday life. Yet, the verdict on this kind of training is stark: after much early optimism, working memory training leads to improvements on what was trained, but these do not extend to other areas of cognition. This shows us that careful thought needs to be given to the logic of how a set of outcome measures is conceptualised: should we treat them as interchangeable indicators of a single underlying factor, or are there reasons to expect that the intervention will have a selective impact on a subset of measures? Even when variables are intercorrelated in the general population, they may respond differently to intervention.

It is also worth noting that results obtained with the MEff approach will depend on assumptions embodied in the simulation that is used to derive predictions. Outcome measures simulated here are normally distributed, and uniform in their covariance structure. It would be of interest to evaluate MEff in datasets with different variable types, such as those used by Vickerstaff et al. (2021) that included binary as well as continuous data, as well as modeling the impact of missing data.

In sum, a recommendation against using multiple outcomes in intervention studies does not lead to optimal study design. Inclusion of several related outcomes can increase statistical power, without increasing the false positive rate, provided appropriate correction is made for the multiple testing. Compared to most other approaches for correlated outcomes, MEff is relatively simple. It could potentially be used to reevaluate published studies that report multiple outcomes but may not have been analysed optimally, provided we have some information on the average correlation between outcome measures.

Data availability Underlying data

OSF: Revised ‘multiple outcomes’ using MEff, < https://doi.org/10.17605/OSF.IO/6GNB4> ( Bishop, 2022).

This project contains the following underlying data: •

Simulated raw data from 2000 runs for models L1, L2 and L3 (corresponding to L1, L2 and L2x respectively).

Extended data

OSF: Revised ‘multiple outcomes’ using MEff, < https://doi.org/10.17605/OSF.IO/6GNB4> ( Bishop, 2022).

This project contains the scripts to generate and analyse simulated data. Two scripts are included:

Data_simulation_modelL.Rmd, which generates the simulated data under Data, computes power tables and creates plots for Figures 2-4.

Multiple_outcomes_revised.Rmd, which generates the text for the current article.

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

References

Aksayli

Sala

Gobet

: The cognitive and academic benefits of Cogmed: A meta-analysis. Educational Research Review. 2019;27:229–243. 10.1016/j.edurev.2019.04.003

Alloway

: Automated Working Memory Assessment Manual. London: Pearson Assessment;2007. https://www.R-project.org/

Bishop

DVM

: Using multiple outcomes in intervention studies for improved trade-off between power and type I errors: The Adjust NVar approach [version 1; peer review: 2 not approved]. F1000Research. 2021;10:991. 10.12688/f1000research.73520.1

Bishop

DVM

: Revised ‘Multiple Outcomes’ Using MEff. OSF.[Dataset.]2022November 18. 10.17605/OSF.IO/6JF9T

Burgoyne

Duff

Clarke

: Efficacy of a reading and language intervention for children with Down syndrome: A randomized controlled trial. Journal of Child Psychology and Psychiatry, and Allied Disciplines. 2012;53(10):1044–1053. 10.1111/j.1469-7610.2012.02557.x

Burgoyne

Duff

Clarke

: Reading and language intervention for children with Down syndrome. Experimental data [data collection]. 2016. 10.5255/UKDA-SN-852291

Chacko

Bedard

Marks

: A randomized clinical trial of Cogmed Working Memory Training in school-age children with ADHD: A replication in a diverse sample using a control condition. Journal of Child Psychology and Psychiatry. 2014;55(3):247–255. 10.1111/jcpp.12146

Cheverud

: A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001:87(1): Article 1. 10.1046/j.1365-2540.2001.00901.x

Derringer

: A simple correction for non-independent tests. PsyArXiv. 2018. 10.31234/osf.io/f2tyw

Lakens

: The 20% Statistician: One-sided tests: Efficient and Underused. The 20% Statistician. 2016, March 17. http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html

Moher

Hopewell

Schulz

: CONSORT 2010 explanation and elaboration: Updated guidelines for reporting parallel group randomised trials. BMJ (Clinical Research Ed.) 2010;340: c869. 10.1136/bmj.c869

Nyholt

: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. American Journal of Human Genetics. 2004;74(4):765–769.

R Core Team : R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing;2020. https://www.R-project.org/

Roberts

Quach

Spencer-Smith

: Academic outcomes 2 years after working memory training for children with low working memory: A randomized clinical trial. JAMA Pediatrics. 2016;170(5): e154568. 10.1001/jamapediatrics.2015.4568

Sainani

: Peer Review Report For: Using multiple outcomes in intervention studies for improved trade-off between power and type I errors: The Adjust NVar approach [version 1; peer review: 2 not approved]. F1000Research. 2021;10:991. https://doi.org/10.5256/f1000research.77175.r96192

Vickerstaff

Ambler

King

: Are multiple primary outcomes analysed appropriately in randomised controlled trials? A review. Contemporary Clinical Trials. 2015;45(Pt A):8–12. 10.1016/j.cct.2015.07.016

Vickerstaff

Ambler

Omar

: A comparison of methods for analysing multiple outcome measures in randomised controlled trials using a simulation study. Biometrical Journal. Biometrische Zeitschrift. 2021;63(3):599–615. 10.1002/bimj.201900040

10.5256/f1000research.158245.r221079

Reviewer response for version 3

Lakens

Daniel

1 Referee https://orcid.org/0000-0002-0247-239X 1Human-Technology Interaction Group, Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands

Competing interests: No competing interests were disclosed.

14 11 2023

2023

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve

I already approved the report, and the new changes are small improvements that do not change my evaluation.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Applied statistics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.5256/f1000research.141308.r160158

Reviewer response for version 2

Sainani

Kristin

1 Referee https://orcid.org/0000-0003-0614-303X 1Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA

Competing interests: No competing interests were disclosed.

13 3 2023

2023

recommendation

approve

The focus of this paper has shifted from the original version. It now focuses on the Meff approach rather than the original proposed MinNVar approach. The goals of the paper have also shifted: (1) to identify situations in which the use of multiple primary outcomes with appropriate adjustment for multiple comparisons yields higher statistical power than a single primary outcome; and (2) to provide tools for re-evaluating already published papers that used multiple primary outcomes but failed to adjust for multiplicity.

In shifting the focus of the paper, the authors have addressed my original concerns. I like the Meff approach because it is relatively straightforward and intuitive. So, I’m glad that this revised version provides a brief tutorial on Meff for psychologists. Table 1 also provides a nice intuitive illustration of how Meff works. I also appreciate that this new draft explores different possible patterns of correlations that reflect different underlying latent variables. This paints a more realistic picture and brings out some of the tradeoffs of the different approaches (PCA, Bonferroni, Meff).

I think this paper accomplishes its stated goals, and is a useful resource. I have spot-checked a few of the simulations and see similar patterns to what the paper reports. I appreciate that the authors have made their code and data available.

I have just a few minor suggestions:

Figures 2-4. I would recommend removing effect size=0.8. Effect size makes some difference but appears less important than number of outcomes and correlation strength. Furthermore, power is always high with effect size=0.8 and n=50, so it doesn’t add much information to display effect size=0.8. I also had trouble distinguishing the two dashed lines. Altogether, effect size=0.8 just makes the graph harder to read without adding a lot of extra information.

It’s somewhat contradictory to say that the lookup tables are useful when you don’t have access to the underlying data but then to present an illustrative example for which the underlying data were available (Burgyone 2012). Presumably, if we had access to the full data, we could calculate Meff exactly (or account for multiple testing using other approaches). Are there any examples you could present where the data are not available but there is some way to roughly estimate the correlations? E.g., because of summary data in the paper or because the correlations can be roughly estimated from previous work using the same variables? This might be closer to the real-world use case for the lookup tables.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Statistics, Sports Medicine

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Bishop

Dorothy

University of Oxford, UK

Competing interests: No competing interests were disclosed.

28 10 2023

Many thanks for the helpful recommendations.

Figures 2 - 4 have been revised in line with suggestions. The original figures with effect size of .8 are still available on OSF, and a Wiki to that page on OSF has been added to explain the difference.

I have also added two more real-world examples, both dealing with evaluation of a memory intervention, Cogmed, for children. It was possible to use the MEff lookup table because both studies used a published working memory battery, where correlations between the subscales can be found in the test manual. Although in practice this had little impact on the interpretation of these studies, it did show how alpha depends on the correction used: one study had used Bonferroni correction, which was over-stringent, and the other had used no correction for multiple contrasts. I found that this exercise not only provided an illustrative example of use of MEff, but it further emphasised the need to consider the underlying relationships between intervention and outcomes when deciding on an analytic strategy, and so I added a further comment about that in the Discussion.

10.5256/f1000research.141308.r160159

Reviewer response for version 2

Lakens

Daniel

Competing interests: No competing interests were disclosed.

21 2 2023

2023

recommendation

approve

This is an interesting revision, because the author has largely abandoned the original proposal, and switched to a different approach to control error rates when multiple dependent hypotheses are tested. As the goal of the paper is a practical tutorial, I think this switch is a valid and good choice. It also means many of my original comments are no longer relevant.

The idea of creating lookup tables is a nice contribution. The biggest weakness is that correlations 1) are often unknown when data is not shared, and 2) are likely more varied than in the simulations. The authors admit this “In practice, the problem for the researcher is to estimate the intercorrelation between outcome measures if this is not known.”. They then give an applied example where the data was shared in a repository. What is missing is a clear instruction what to do when data is not shared in a repository. I would assume this then means 1) asking for the data, 2) if data is not provided making an informed guess, and 3) be careful in the interpretation, or performing some sensitivity analyses (e.g., X findings are significant, assuming correlations are not lower than Y). I think presenting a plan for when data is not available is a useful addition.

I also still wonder what happens if there is substantial variation in the off-diagonal correlations. Maybe the authors feel this is rare in realistic datasets – then that could be mentioned.

Minor comment

The heading “The case against multiple outcomes” might confuse readers a bit, as you are arguing FOR multiple outcomes. So, maybe replace it by something like ‘Evaluating error rates for multiple outcomes’.

I checked the manuscript for reproducibility, and reproduced the figures and data. I performed the simulations with a larger N, as I thought the patterns in the figures showed some surprising patterns (e.g., not purely decreasing or increasing) but found the same with a larger number of simulations, so that is not the issue.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Applied statistics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Bishop

Dorothy

University of Oxford, UK

Competing interests: None.

24 2 2023

Thanks to the reviewer for the new evaluation.

I’ll defer making modifications to the document until a second review is available, but just note a couple of points.

What is missing is a clear instruction what to do when data is not shared in a repository. I would assume this then means 1) asking for the data, 2) if data is not provided making an informed guess, and 3) be careful in the interpretation, or performing some sensitivity analyses (e.g., X findings are significant, assuming correlations are not lower than Y). I think presenting a plan for when data is not available is a useful addition.

Response: I like this suggestion for a plan of action. A further step between 1 and 2 would be to do a search for other datasets using the same variables, which may give an indication of the range of expected correlation between them – while recognising that observed values may be influenced by factors such as range. So the “informed guess” would be informed by prior literature, if available.

I also still wonder what happens if there is substantial variation in the off-diagonal correlations. Maybe the authors feel this is rare in realistic datasets – then that could be mentioned.

Response: On the contrary, I suspect variation in off-diagonal correlations is more common than uniformity. But in effect the most extreme version of this case is already modelled with model L2. In this models, the off-diagonal values are either zero or a specific value, r.max. The appropriate comparison for these models is a model, L1, where the correlations are uniform, with r.avg equivalent to the average off-diagonal value. Consider the case in the bottom row of table 3. For model L2, we have r values of either 0 or .8. For model L1, we have a corresponding value of r.avg of .343. The alpha values for L1 and L2 are in adjacent rows of table 3, and it is clear they differ only slightly. If you introduced more variability in model L2, with some r-values being intermediate between 0 and r.max, the difference between models L1 and L2 would be smaller.

Bishop

Dorothy

University of Oxford, UK

Competing interests: none

28 10 2023

Thanks for your careful reading of the paper. I particularly appreciate you checking the reproducibility of the simulations; I realise this takes some work but it is good to have the reassurance that the patterns of results are reproducible.

Following suggestions by reviewer 1, I've added two examples of real-world studies where data are not available. In the field of educational/psychological interventions, it can be possible to get estimates of intercorrelations between measures if well-established measures are used, as is the case in this example. I've also added some thoughts on what to do when this is not the case, as you proposed.

Regarding the case of substantial variation in off-diagonal correlations, having played around with various scenarios, I think it's reasonable to regard model L2 as corresponding to that, because it has a mixture on the off-diagonal of variables with correlation of zero, and those with correlation of whatever is the maximum value for correlated measures from one factor. For instance, if you look at the bottom row of Table 3, L2 is the case where the off-diagonal contains a mixture of r values of 0 and .8, and L1 is the case when the off-diagonals are uniform (and equivalent to the mean of L2 values). Yet the MEff varies only slightly. I think that is about the most extreme variation you could get for off-diagonal values.

10.5256/f1000research.77175.r97181

Reviewer response for version 1

Lakens

Daniel

Competing interests: No competing interests were disclosed.

10 11 2021

2021

recommendation

reject

The author discusses more optimal ways to control error rates than the Bonferroni correction when researchers use multiple measures in a study that are positively correlated. In these cases the authors proposed to specify the number of variables that should be significant at the default alpha level (e.g., 0.05) to make sure the overall Type 1 error rate does not exceed 0.05.

The difficulty with correcting for multiple comparisons based on the number of variables that are significant is that you want to prevent a situation where a researcher performs a Stroop effect test, a Simon effect test, and a test for precognition, and conclude the hypothesis is supported because 2 out of 3 tests are significant. The solution the author proposes is to only use this approach if 1) the same latent variable is measured in different measures, and 2) the correlation between these variables can not be zero. I believe this is a valid approach. These assumptions are discussed enough in the article, but on page 11 (of the pdf version), one might want to discuss that *whether* different measures actually are ‘replicants’ could be a matter of debate among peers. One could imagine a situation where researchers in a field who disagree about measures would also disagree about whether different measures are replicants.

The comparison against the Bonferroni correction is one interesting baseline, but there is a large literature on how to correct for multiple comparisons that is also much more efficient than a Bonferroni correction, and which is the more interesting comparison. A strength of the Bonferroni correction is that it makes no assumptions about the variables that are corrected. But if one is willing to make assumptions, most importantly about the correlation between variables, more efficient approaches are available. How does this approach compare against other correction approaches? Although there are many correction approaches (this seems to be a particularly active field in neuroscience, where multiple comparisons when analyzing brain activation are common, and measures are strongly correlated), the following two references provide a starting point.

Fan, J., Han, X., & Gu, W. (2012). Estimating False Discovery Proportion Under Arbitrary Covariance Dependence. Journal of the American Statistical Association, 107(499), 1019–1035 ¹

Yekutieli, D., & Benjamini, Y. (1999). Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference, 82(1), 171–196 ²

There are several things to consider. First, the proposed simulation based approach fixes the alpha level, and selects the number of variables that need to be significant. Numbers of variables are relatively crude, in that we can pick 1, 2, 3, 4, etc, but not 1.23 variables. Alternative approaches in the literature lower the alpha level. A benefit of these approaches is that the alpha level can be set at any value (e.g., 0.0352) to exactly control the Type 1 error rate. The consequences of this are also clear when we look at the figures and compare the principle component approach with the Adjust NVar approach. The familywise error rate for the Adjust NVar approach is often well below 0.05, while it is controlled at 0.05 in the principle component approach (and it is controlled at 0.05 in other approaches in the literature that lower the alpha level). The author discusses this (e.g., page 7 of the pdf version) in some detail, but the author seems slightly biased towards their own approach, stating that “the tradeoff between power and familywise error (expressed as a ratio) is higher for Adjust NVar.” It is not clear this ratio is a fair evaluation of the methods, and the information is difficult to distill from the figures (a Table with Type 1 error rates and Type 2 error rates would be more useful for this). This ratio is not so easy to summarize in a single sentence, I feel. If a design has 99.9% power, lowering the alpha from 0.05 to 0.02 has little effect on power, but if power is 0.8, lowering the alpha has a greater effect. Typically, the evaluation is done on the required sample size – which type of correction would require the smallest sample size? And then it becomes important to take a long more modern corrections for correlated variables as well.

I could imagine other modern corrections are a bit more efficient, but the author makes a good point that the simplicity of the methods allows one to evaluate published studies. This might be a useful application. However, then I would like to urge the author to add an example. It would be good to see how this approach should be applied in a real use case (e.g., read a justification for why the measures are measuring the same latent construct, and how to make an assumption about the correlation) and how to interpret the results, depending on how many tests are significant.

The current simulation is somewhat limited. The idea that all correlations are identical will not be true in practice, and this will complicate the application of the proposed technique. What is the recommendation in practice? Should authors choose the largest correlation or the smallest correlation? Which approach is more or less conservative? What is the effect of differences in standard deviations, and does it matter if measures with low correlations have larger standard deviations? I am not sure all these factors will have a large influence but users will most likely need to know what to do.

It was very nice to see the paper was written in Rmarkdown. I was able to reproduce the results computationally (I did not repeat the simulations). As a minor comment, skipsim was no longer on line 207 but on 222. The files would be clearer if all simulation code was a separate R script that is run to generate the simulation data. The datafiles generated can then be read in at the top of the Rmd file. The ‘skipsim’ workaround does not improve clarity, and I had a hard time figuring out where the code reads in the data. In row 561, for example, the toybit file is written, but if I read it in, this is not needed. Separating the creation of data, and reading in data, makes this cleaner. I needed to create an ‘Images” folder to run the plot code on line 679 – this folder could be uploaded to the github repo perhaps?

To conclude, I believe this current version of the manuscript needs some additional work, which could include a more extensive discussion of other corrections in the literature, an exploration of additional simulations, and a practical example of how to use this approach when analyzing published studies.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Applied statistics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

References 1

: Estimating False Discovery Proportion Under Arbitrary Covariance Dependence. J Am Stat Assoc .2012;107(499) : 10.1080/01621459.2012.720478 1019-1035

24729644

10.1080/01621459.2012.720478

: Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference .1999;82(1-2) : 10.1016/S0378-3758(99)00041-5 171-196

10.1016/S0378-3758(99)00041-5

Bishop

Dorothy

University of Oxford, UK

Competing interests: No competing interests were disclosed.

18 11 2022

I thank the reviewer for such a comprehensive and constructive review. The two reviews complemented one another very nicely, and helped me see how to restructure and revise the paper to make it clearer.

As background, I should explain that this paper grew out of an attempt to write a primer on how to analyse/evaluate published intervention studies for students/professionals in allied health professions/education. My concern was that Bonferroni correction is often too conservative, but explanations of other methods are typically highly complex, so I was looking for a simpler rule of thumb that could be useful in interpreting published studies. I have emphasised this rationale more in the revised paper.

Response to specific points.

1. The difficulty with correcting for multiple comparisons based on the number of variables that aresignificant is that you want to prevent a situation where a researcher performs a Stroop effect test, a Simon effect test, and a test for precognition, and conclude the hypothesis is supported because 2 out of 3 tests are significant. The solution the author proposes is to only use this approach if 1) the same latent variable is measured in different measures, and 2) the correlation between these variables can not be zero. I believe this is a valid approach. These assumptions are discussed enough in the article, but on page 11 (of the pdf version), one might want to discuss that *whether* different measures actually are ‘replicants’ could be a matter of debate among peers. One could imagine a situation where researchers in a field who disagree about measures would also disagree about whether different measures are replicants

Response: This is an important point, which is similar to issues raised by reviewer 1, so I have now extended the simulations to take into account situations where outcomes may not all be indicators of one factor.

2. Alternative methods for correction for multiple comparisons.

Response: The reviewer recommends two references with alternative approaches to multiple comparison correction. As I noted above, one goal was to keep things simple: so I am reluctant to do a full comparison of all methods, especially since it is clear that I need to consider different underlying data structures. I like the relative simplicity of the MEff approach recommended by reviewer 1, and so I am now presenting a comparison of Bonferroni, PCA and MEff.

3. I could imagine other modern corrections are a bit more efficient, but the author makes a good point that the simplicity of the methods allows one to evaluate published studies. This might be a useful application. However, then I would like to urge the author to add an example. It would be good to see how this approach should be applied in a real use case (e.g., read a justification for why the measures are measuring the same latent construct, and how to make an assumption about the correlation) and how to interpret the results, depending on how many tests are significant.

Response: Again, this converges nicely with the recommendations of reviewer 1, and I have now restructured the paper to focus more on this aspect and to give a real-life example.

4. The current simulation is somewhat limited. The idea that all correlations are identical will not be true in practice, and this will complicate the application of the proposed technique. What is the recommendation in practice? Should authors choose the largest correlation or the smallest correlation? Which approach is more or less conservative? What is the effect of differences in standard deviations, and does it matter if measures with low correlations have larger standard deviations? I am not sure all these factors will have a large influence but users will most likely need to know what to do.

Response: I now contrast three models with different patterns of intercorrelation between intervention and outcome. Potentially, there is no end to the options to consider, but I found it reassuring that with the MEff approach, the adjusted alpha was similar for different correlated variables, provided the mean off-diagonal correlation was constant.

5. It was very nice to see the paper was written in Rmarkdown. I was able to reproduce the results computationally (I did not repeat the simulations). As a minor comment, skipsim was no longer on line 207 but on 222. The files would be clearer if all simulation code was a separate R script that is run to generate the simulation data. The datafiles generated can then be read in at the top of the Rmd file. The ‘skipsim’ workaround does not improve clarity, and I had a hard time figuring out where the code reads in the data. In row 561, for example, the toybit file is written, but if I read it in, this is not needed. Separating the creation of data, and reading in data, makes this cleaner. I needed to create an ‘Images” folder to run the plot code on line 679 – this folder could be uploaded to the github repo perhaps?

Response: I have now separated the simulation and generation of corresponding figures from the code to write the paper.

10.5256/f1000research.77175.r96192

Reviewer response for version 1

Sainani

Kristin

1 Referee https://orcid.org/0000-0003-0614-303X 1Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA

Competing interests: No competing interests were disclosed.

12 10 2021

2021

recommendation

reject

The paper presents a method for controlling the familywise error rate when testing multiple outcomes: Adjust NVar. Unlike multiple testing adjustments that lower the p-value threshold, the idea behind Adjust NVar is to require a minimum number of p-values (MinNSig) to meet a nominal p<.05 threshold.

When outcomes are independent, MinNSig is defined as follows, where M is the number of outcomes:

X~binomial(M, 0.05)

MinNSig is the minimum x for which P(X>=x)<.05 or, equivalently, P(X < x)>.95.

For example, if M=6, then:

P(X=0)=.95**6=0.735

P(X<=1)=P(X<2)=0.735+6*.95**5*.05=.232+.735=.967

Thus, MinNSig=2.

When outcomes are correlated, there is no simple formula for obtaining MinNSig, so the author has used a simulation approach to account for varying correlation structures.

The idea is novel and has several merits. In particular, the approach closely matches what many researchers already do in the published literature. Many published studies apply a p-value cutoff of 0.05 to multiple outcomes without correcting for multiple testing (or designating a primary outcome). I can envision the AdjustNVar approach being a useful heuristic for re-evaluating published studies that used multiple outcomes but failed to account for multiple testing in any way. For example, if an RCT reported 10 outcomes with only 1 significant result (p<.05), readers can easily recognize that this is compatible with a chance finding. But what if the trial found 2 significant results or 3? And what if the outcomes are moderately correlated instead of independent? I can envision researchers using a table such as Table 2 of the AdjustNVar paper to make a quick assessment based on the number of outcomes reported and a rough guess at the correlation structure.

However, I am less convinced of the value of AdjustNVar as a formal tool for controlling the familywise error rate in a planned study. At a minimum, further development and a broader set of simulations would be required to support such a recommendation. The current manuscript describes three alternatives to specifying a single primary outcome in an RCT: (1) Bonferroni adjustment, (2) permutation tests, and (3) use of PCA to derive a single composite outcome. But this ignores existing p-value adjustment methods that are less conservative than Bonferroni. For example, the “M-effective” (Meff) approach accomplishes many of the same goals as AdjustNVar (see: Cheverud (2001) ¹, Nyholt (2004) ², and Derringer (2018) ³). In the Meff approach, one adjusts the p-value threshold by dividing by the effective number of outcomes (Meff) rather than the actual number of outcomes (M). Meff is based on the eigenvalues of the correlation matrix of the outcomes. Where Eigen is the observed vector of eigenvalues from the correlation matrix of the outcomes, Meff is calculated as:

Meff = 1 + (M-1)*(1-(Var(Eigen)/M))

Bonferroni threshold = alpha/M

Meff threshold = alpha/Meff

Like AdjustNVar, Meff is simple and accounts for correlated outcomes. But I believe it has several advantages over AdjustNVar: (1) Meff precisely controls the Type I error rate, whereas AdjustNVar has varying Type I error rates that cannot be precisely controlled by the investigator; (2) Meff accounts for the correlation structure observed in the data, whereas AdjustNVar requires the investigator to guess at the correlation structure; if this guess is far off (which could easily be the case), this would lead to poor Type I error control.

This paper has numerous strengths, including the novelty of the idea; the potential use as a heuristic for re-interpreting flawed published papers; the concision of the writing; and the availability of all code and data. The major limitations of the paper are: (1) it presents an overly narrow set of simulations that do not capture most realistic situations, but then makes overly broad claims based on these simulations. (2) it does not compare AdjustNVar to existing approaches that are less conservative than Bonferroni, such as Meff and (3) it does not address the different reasons why researchers may be including multiple outcomes, but these different reasons lead to markedly different correlation structures.

Specific comments:

The paper would benefit from further consideration of the reasons why researchers may include multiple outcomes. The paper focuses on the case where an intervention is “expected to affect a range of related processes.” The simulations make assumptions that match this case, assuming equal correlations across outcomes and equal true effects for each outcome. But researchers may include multiple outcomes for many other reasons, such as: (a) they aren’t sure which process the intervention will affect, (b) they believe the intervention may affect two different processes but they measure each process with several different measurements to “hedge their bets”, or (c) they include a “soft” endpoint in addition to a “hard” endpoint because the “hard” endpoint may occur too rarely. Each of these cases corresponds to different assumptions for the simulations. For example, (b) would be expected to have two clusters of highly correlated variables that are only weakly correlated with each, which will affect MinNSig.

The paper suggests that AdjustNVar could be used in study planning—researchers would guess at the correlation structure and set a MinNSig ahead of time. But if they guess the correlation structure incorrectly, such as underestimating the true correlation, then they may choose a MinNSig that does not adequately control Type I error.

The “quantum nature” of AdjustNVar is not a desirable characteristic. The researcher is unable to precisely control the Type I error rate. In planning a study in which the correlation is expected to be 0.4, for example, Table 2 would suggest that the researcher should then always choose 9 outcomes over 5-8 outcomes, since 9 maximizes the chances of getting at least 3 p-values <.05. This is one reason I prefer Meff, which precisely controls the Type I error rate.

This description is misleading: “Should we dismiss the trial as showing no benefit? We can use the binomial theorem to check the probability of obtaining this result if the null hypothesis is true and the measures are independent: it is 0.033, clearly below the 5% alpha level.” The description gives the misleading impression that one would be justified in re-evaluating a paper that used a Bonferroni correction by instead applying the criterion of at least two p-values <.05. But doing so would inflate the Type I error rate. Results would have been declared significant if EITHER at least one p-value met the Bonferroni threshold OR at least two p-values were <.05 — leading to an effective Type I error rate of 7% (assuming independent outcomes). Note that this example reappears in the discussion and also mistakenly implies that had three p-values been <.05, we would have been able to reject the null hypothesis. But this is not the case because the results were already subjected to Bonferroni, and additionally subjecting them to AdjustNVar makes the effective Type I error rate higher than 5%. AdjustNVar should be applied only to re-evaluate studies that failed to incorporate any adjustments for multiple testing originally.

I found Table 1 confusing on first read, and I would recommend simplifying it by focusing on a single number of outcomes rather than both 2 outcomes and 4 outcomes and by removing discussions of ranking p-values. (The p-value ranking isn’t important — this is just part of the mechanics of how the algorithm is calculating MinNSig, so I don’t think it’s needed.) For example, you could just focus on calculating MinNSig for 6 outcomes. Show 6 columns of p-values for the 6 outcomes, and then show a single final column that tabulates the number of p-values <.05 for each simulated trial. Then show a frequency table of how many simulations out of 1000 resulted in 0 p-values <.05, 1 p-value <.05, 2 p-values <.05, etc. Then indicate that MinNSig occurs at one number above when the cumulative frequency crosses 95%.

I’m unclear as to why the paper focuses on one-tailed tests, which are less common in the literature. I think it would be more useful to present two-tailed tests in Table 2 or to present two tables — one for one-tailed tests and one for two-tailed tests. This makes a difference in a few MinNSig values.

Figures 1-3: These figures compare AdjustNVar to a single study outcome. I think the logic behind this comparison is flawed, however. It is comparing apples to oranges. The simulation assumes that, when applying AdjustNVar, ALL variables studied have a true effect. This, in effect, stacks the deck on statistical power for *any* method that considers multiple outcomes rather than a single outcome. For example, I ran a simulation comparing Bonferroni with 6 outcomes compared to a single outcome with n=50 per group. When the correlation is <0.8, Bonferroni also has more power than the single outcome. And, when I compared the Meff strategy to a single outcome in a variety of scenarios, Meff was always more powerful than a single outcome. I think a more useful comparison would be to directly compare different methods that handle multiple outcomes (e.g., PC to AdjustNVar to Meff).

Related to comment (7), I don’t think the paper is justified in making this broad claim: “The Adjust NVar approach can achieve a more efficient trade-off between power and type I error rate than use of a single outcome when there are three or more moderately intercorrelated outcome variables.” This conclusion is true only when the intervention truly affects ALL outcomes, which is a narrow and arguably unrealistic case. A more realistic scenario is where the intervention works only on a subset of outcomes. In this case, the single variable strategy will be more statistically powerful than the multiple-variable strategies if you choose the right variable.

Figures 4-6. Same issue as for Figures 1-3: comparing the principal components composite variable strategy (PC) to a single outcome is flawed because the simulation “stacks the deck” for PC by assuming that all outcomes have a true effect. I believe that Figures 1-6 should focus instead on comparisons of different methods for handling multiple outcomes.

The article claims that power is only “slightly lower” for AdjustNVar compared with the PC strategy. However, when I run simulations comparing PC to AdjustNVar, I get consistently higher statistical power for PC and I would characterize the difference as more than just “slight”. For example, with N=50, global corr=0.6, global ES=0.3, and 6 outcomes (two-tailed test), I get power of 35% for AdjustNVar versus 37% for Meff versus 46% for PC.

PC is always more powerful than AdjustNVar and Meff when we assume that all outcomes have a true effect. However, PC is not more powerful when we assume that only a subset of outcomes have true effects. For example, if I tweak the above simulation so that only three outcomes out of six have true effects of 0.3, this changes the power to 15% for PC and AdjustNVar, and 28% for Meff. This illustrates why one narrow simulation is insufficient for making general conclusions about the tradeoffs in performance between the different methods.

In my simulations, Meff consistently has higher statistical power than AdjustNVar, which is a function of the fact that Meff always has a 5% Type I error rate whereas that of AdjustNVar is variable and sometimes lower than 5%. I don’t view it as a strength that AdjustNVar results in arbitrarily lower Type I error rates. It is better for the investigator to be able to precisely control the tradeoff between Type I and Type II error. Meff allows this whereas AdjustNVar does not.

Figures 1-6: I found these graphs hard to read as they don’t have a clear take-home point. I would suggest removing the single outcome comparisons altogether and then reformatting so that the graphs directly compare AdjustNVar to Meff and PC (three lines). In one graph, hold effect size, sample size, and N outcomes constant, and then show the power of the three methods as a function of increasing correlation. In another graph, hold correlation, sample size, and effect size constant, and show the power of the three methods as a function of N outcomes. And so on. All methods aim to control the familywise error rate at 5%, so this should not be a variable. The fact that AdjustNVar sometimes results in a lower Type I error rate is incidental — it arises as a quirk of the method not as the intent of the researcher.

Where AdjustNVar may have an advantage over a method like Meff is for re-evaluating flawed published studies where researchers used multiple outcomes but did not account in any way for multiple testing. An AdjustNVar table such as Table 2 would give readers a quick way to reevaluate such studies without the need for any calculations or access to raw data. I would focus the paper more on this application.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Statistics, Sports Medicine

References 1

: A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb) .2001;87(Pt 1) : 10.1046/j.1365-2540.2001.00901.x 52-8

11678987

10.1046/j.1365-2540.2001.00901.x

: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet .2004;74(4) : 10.1086/383251 765-9

14997420

10.1086/383251

: A simple correction for non-independent tests.2018; 10.31234/osf.io/f2tyw

10.31234/osf.io/f2tyw

Bishop

Dorothy

University of Oxford, UK

Competing interests: No competing interests were disclosed.

18 11 2022

Thanks for the very useful evaluation of this paper, which has prompted further thoughts and a revised manuscript. There was some convergence of views of the two reviewers, although the specific recommendations varied. I thank the reviewer for engaging so thoroughly with the manuscript and for helping improve it.

General point A: AdjustNVar as a heuristic.

Reviewer 1 writes: The idea is novel and has several merits. In particular, the approach closely matches what many researchers already do in the published literature. Many published studies apply a p-value cutoff of 0.05 to multiple outcomes without correcting for multiple testing (or designating a primary outcome). I can envision the AdjustNVar approach being a useful heuristic for re-evaluating published studies that used multiple outcomes but failed to account for multiple testing in any way. For example, if an RCT reported 10 outcomes with only 1 significant result (p<.05), readers can easily recognize that this is compatible with a chance finding. But what if the trial found 2 significant results or 3? And what if the outcomes are moderately correlated instead of independent? I can envision researchers using a table such as Table 2 of the AdjustNVar paper to make a quick assessment based on the number of outcomes reported and a rough guess at the correlation structure.

Response: After exploring Meff, I decided not to proceed with the AdjustNVar approach, as I think a lookup table derived from MEff would achieve the same effect but without the problems arising from the quantal nature of N variables.

General Point B: Need to consider MEff

Reviewer 1 notes that the MEff approach accomplishes many of the same goals as AdjustNVar, and is preferable in many respects.

Response: I had been unaware of MEff, and having consulted the references, I agree it is an elegant solution to the problem I was trying to tackle that avoids some of the limitations of AdjustNvar, particularly the need to assume a given correlation structure. I initially thought this approach had only been used in genetics and not been applied in psychology, but the final reference pointed to the work of Derringer, who has done a great job in a preprint that provides a tutorial in its use. I don’t think it is much used in intervention research, which is the context I am particularly interested in, and so I think it is worthwhile updating the article so that it serves as a basic introduction to MEff, with discussion of factors affecting power.

Accordingly, I have changed the focus to compare different methods with a focus on MEff.

Specific comments:

1. The paper would benefit from further consideration of the reasons why researchers may include multiple outcomes. The paper focuses on the case where an intervention is

“expected to affect a range of related processes.” The simulations make assumptions that

match this case, assuming equal correlations across outcomes and equal true effects for

each outcome. But researchers may include multiple outcomes for many other reasons,

such as: (a) they aren’t sure which process the intervention will affect, (b) they believe the

intervention may affect two different processes but they measure each process with several

different measurements to “hedge their bets”, or (c) they include a “soft” endpoint in

addition to a “hard” endpoint because the “hard” endpoint may occur too rarely. Each of

these cases corresponds to different assumptions for the simulations. For example, (b)

would be expected to have two clusters of highly correlated variables that are only weakly

correlated with each, which will affect MinNSig.

Response: This really got me thinking and I have now incorporated some further simulations where correlations are not uniform, as also added more discussion of this issue

2. The paper suggests that AdjustNVar could be used in study planning—researchers would

guess at the correlation structure and set a MinNSig ahead of time. But if they guess the

correlation structure incorrectly, such as underestimating the true correlation, then they

may choose a MinNSig that does not adequately control Type I error.

Response: Agreed. This is now dropped.

3. The “quantum nature” of AdjustNVar is not a desirable characteristic. The researcher is

unable to precisely control the Type I error rate. In planning a study in which the correlation

is expected to be 0.4, for example, Table 2 would suggest that the researcher should then

always choose 9 outcomes over 5-8 outcomes, since 9 maximizes the chances of getting at

least 3 p-values <.05. This is one reason I prefer Meff, which precisely controls the Type I

error rate.

Response: Agreed. AdjustNVar now dropped

4. This description is misleading: “Should we dismiss the trial as showing no benefit? We can

use the binomial theorem to check the probability of obtaining this result if the null

hypothesis is true and the measures are independent: it is 0.033, clearly below the 5% alpha

level.” The description gives the misleading impression that one would be justified in reevaluating a paper that used a Bonferroni correction by instead applying the criterion of at

least two p-values <.05. But doing so would inflate the Type I error rate. Results would have

been declared significant if EITHER at least one p-value met the Bonferroni threshold OR at

least two p-values were <.05 — leading to an effective Type I error rate of 7% (assuming

independent outcomes). Note that this example reappears in the discussion and also

mistakenly implies that had three p-values been <.05, we would have been able to reject the

null hypothesis. But this is not the case because the results were already subjected to

Bonferroni, and additionally subjecting them to AdjustNVar makes the effective Type I error

rate higher than 5%. AdjustNVar should be applied only to re-evaluate studies that failed to

incorporate any adjustments for multiple testing originally.

Response: thanks for this clarification. As AdjustNVar is now omitted, this no longer applies.

5. I found Table 1 confusing on first read, and I would recommend simplifying it by focusing

on a single number of outcomes rather than both 2 outcomes and 4 outcomes and by

removing discussions of ranking p-values. (The p-value ranking isn’t important — this is just

part of the mechanics of how the algorithm is calculating MinNSig, so I don’t think it’s

needed.) For example, you could just focus on calculating MinNSig for 6 outcomes. Show 6

columns of p-values for the 6 outcomes, and then show a single final column that tabulates

the number of p-values <.05 for each simulated trial. Then show a frequency table of how

many simulations out of 1000 resulted in 0 p-values <.05, 1 p-value <.05, 2 p-values <.05, etc.

Then indicate that MinNSig occurs at one number above when the cumulative frequency

crosses 95%.

Response: this no longer applies as tables are redone

6. I’m unclear as to why the paper focuses on one-tailed tests, which are less common in the

literature. I think it would be more useful to present two-tailed tests in Table 2 or to present

two tables — one for one-tailed tests and one for two-tailed tests. This makes a difference in

a few MinNSig values.

Response: One-tailed tests have a bad reputation because so often they are misused to just require a lower level of significance when there are no directional predictions, but there are contexts in which they are entirely appropriate, and that includes the kinds of intervention study that is the focus of attention here. You can reasonably predict that an intervention will improve performance rather than worsen it. Reviewer 2 wrote a blogpost about this which I find convincing: http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html. I have now explained this further.

7. Figures 1-3: These figures compare AdjustNVar to a single study outcome. I think the logic

behind this comparison is flawed, however. It is comparing apples to oranges. The

simulation assumes that, when applying AdjustNVar, ALL variables studied have a true

effect. This, in effect, stacks the deck on statistical power for *any* method that considers

multiple outcomes rather than a single outcome. For example, I ran a simulation comparing

Bonferroni with 6 outcomes compared to a single outcome with n=50 per group. When the

correlation is <0.8, Bonferroni also has more power than the single outcome. And, when I

compared the Meff strategy to a single outcome in a variety of scenarios, Meff was always

more powerful than a single outcome. I think a more useful comparison would be to

directly compare different methods that handle multiple outcomes (e.g., PC to AdjustNVar

to Meff).

Response: Thanks again for helping clarify what is being simulated here. I hope this is clearer in the current article. The figures have been redrawn and I hope are now clearer. Perhaps one takeaway point is that, in being concerned to control type I error, the CONSORT recommendations appear to ignore the gains in power that can be achieved with multiple outcomes.

8. Related to comment (7), I don’t think the paper is justified in making this broad claim: “The

Adjust NVar approach can achieve a more efficient trade-off between power and type I error

rate than use of a single outcome when there are three or more moderately intercorrelated

outcome variables.” This conclusion is true only when the intervention truly affects ALL

outcomes, which is a narrow and arguably unrealistic case. A more realistic scenario is

where the intervention works only on a subset of outcomes. In this case, the single variable

strategy will be more statistically powerful than the multiple-variable strategies if you

choose the right variable.

Figures 4-6. Same issue as for Figures 1-3: comparing the principal components composite

variable strategy (PC) to a single outcome is flawed because the simulation “stacks the deck”

for PC by assuming that all outcomes have a true effect. I believe that Figures 1-6 should

focus instead on comparisons of different methods for handling multiple outcomes.

Response: same as for point 6; this is clearly a key issue, and I think it is clearer now that the alternative models (L2 and L2x) are included.

9. The article claims that power is only “slightly lower” for AdjustNVar compared with the PC

strategy. However, when I run simulations comparing PC to AdjustNVar, I get consistently

higher statistical power for PC and I would characterize the difference as more than just

“slight”. For example, with N=50, global corr=0.6, global ES=0.3, and 6 outcomes (two-tailed

test), I get power of 35% for AdjustNVar versus 37% for Meff versus 46% for PC.

Response: thanks for pushing back on this – this is fair comment.

10.PC is always more powerful than AdjustNVar and Meff when we assume that all outcomes

have a true effect. However, PC is not more powerful when we assume that only a subset of

outcomes have true effects. For example, if I tweak the above simulation so that only three

outcomes out of six have true effects of 0.3, this changes the power to 15% for PC and

AdjustNVar, and 28% for Meff. This illustrates why one narrow simulation is insufficient for

making general conclusions about the tradeoffs in performance between the different

methods.

Response: agreed.

11. In my simulations, Meff consistently has higher statistical power than AdjustNVar, which is a function of the fact that Meff always has a 5% Type I error rate whereas that of AdjustNVar

is variable and sometimes lower than 5%. I don’t view it as a strength that AdjustNVar

results in arbitrarily lower Type I error rates. It is better for the investigator to be able to

precisely control the tradeoff between Type I and Type II error. Meff allows this whereas

AdjustNVar does not.

Response: fair comment

12. Figures 1-6: I found these graphs hard to read as they don’t have a clear take-home point. I would suggest removing the single outcome comparisons altogether and then reformatting

so that the graphs directly compare AdjustNVar to Meff and PC (three lines). In one graph,

hold effect size, sample size, and N outcomes constant, and then show the power of the

three methods as a function of increasing correlation. In another graph, hold correlation,

sample size, and effect size constant, and show the power of the three methods as a

function of N outcomes. And so on. All methods aim to control the familywise error rate at

5%, so this should not be a variable. The fact that AdjustNVar sometimes results in a lower

Type I error rate is incidental — it arises as a quirk of the method not as the intent of the

researcher.

Response: Agreed. Graphs reformatted as suggested and I hope are now much easier to interpret.

13. Where AdjustNVar may have an advantage over a method like Meff is for re-evaluating

flawed published studies where researchers used multiple outcomes but did not account in

any way for multiple testing. An AdjustNVar table such as Table 2 would give readers a quick

way to reevaluate such studies without the need for any calculations or access to raw data. I

would focus the paper more on this application.

Response: Agreed. The paper has been revised to make this point.