Using multiple outcomes in intervention studies: improving power while controlling type I errors

Background The CONSORT guidelines for clinical trials recommend use of a single primary outcome, to guard against the raised risk of false positive findings when multiple measures are considered. It is, however, possible to include a suite of multiple outcomes in an intervention study, while controlling the familywise error rate, if the criterion for rejecting the null hypothesis specifies that N or more of the outcomes reach an agreed level of statistical significance, where N depends on the total number of outcome measures included in the study, and the correlation between them. Methods Simulations were run, using a conventional null-hypothesis significance testing approach with alpha set at .05, to explore the case when between 2 and 12 outcome measures are included to compare two groups, with average correlation between measures ranging from zero to .8, and true effect size ranging from 0 to .7. In step 1, a table is created giving the minimum N significant outcomes (MinNSig) that is required for a given set of outcome measures to control the familywise error rate at 5%. In step 2, data are simulated using MinNSig values for each set of correlated outcomes and the resulting proportion of significant results is computed for different sample sizes,correlations, and effect sizes. Results The Adjust NVar approach can achieve a more efficient trade-off between power and type I error rate than use of a single outcome when there are three or more moderately intercorrelated outcome variables. Conclusions Where it is feasible to have a suite of moderately correlated outcome measures, then this might be a more efficient approach than reliance on a single primary outcome measure in an intervention study. In effect, it builds in an internal replication to the study. This approach can also be used to evaluate published intervention studies.


Issues raised by inclusion of multiple outcomes
The CONSORT guidelines for clinical trials (Moher et al., 2010) are very clear on the importance of having a single primary outcome: All RCTs assess response variables, or outcomes (end points), for which the groups are compared.Most trials have several outcomes, some of which are of more interest than others.The primary outcome measure is the pre-specified outcome considered to be of greatest importance to relevant stakeholders (such as patients, policy makers, clinicians, funders) and is usually the one used in the sample size calculation.Some trials may have more than one primary outcome.Having several primary outcomes, however, incurs the problems of interpretation associated with multiplicity of analyses and is not recommended.
This advice often creates a dilemma for the researcher: in many situations there are multiple measures that could plausibly be used to index the outcome (Vickerstaff, Ambler, King, Nazareth, & Omar, 2015).If we have several outcomes and we would be interested in improvement on any measure, then we need to consider the familywise error rate, i.e. the probability of at least one false positive in the whole set of outcomes.For instance, if we want to set the false positive rate, alpha to .05, and we have six independent outcomes, none of which is influenced by the intervention, the probability that none of the tests of outcome effects is significant will be .95^6,which is .735.Thus the probability that at least one outcome is significant, the familywise error rate, is 1-.735, which is .265.In other words, in about one quarter of studies, we would see a false positive when there is no true effect.The larger the number of outcomes, the higher the false positive rate.
A common solution is to apply a Bonferroni correction by dividing the alpha level by the number of outcome measuresin this example .05/6= .008.This way the familywise error rate is kept at .05.But this is over-conservative if, as is usually the case, the various outcomes are intercorrelated.
Various methods have been developed to address the problem of multiple testing.One approach is to adopt some process of data reduction, such as extracting a principal component from the measures that can be used as the primary outcome.Alternatively, a permutation test can be used to derive exact probability of an observed pattern of results.Neither approach, however, is helpful if the researcher is evaluating a published paper where an appropriate correction has not been made.These could be cases where no correction is made for multiple testing, risking a high rate of false positives, or where Bonferroni correction has been applied despite using correlated outcomes, which will be overconservative in rejecting the null hypothesis.The goal of the current article is to provide some guidance for interpretation of published papers where the raw data are not available for recomputation of statistics.Vickerstaff et al. (2015) reviewed 209 trials in neurology and psychiatry, and found that 60 reported multiple primary outcomes, of which 45 did not adjust for multiplicity.Those that did adjust mostly used the Bonferroni correction.Thus it would appear that many researchers feel the need to include several outcomes, but this is not always adjusted for appropriately.The goal of the current article is to provide some guidance for interpretation of published papers where the raw data are not available for recomputation of statistics.
In a review of an earlier version of this paper, Sainani (2021) pointed out that the MEff statistic, originally developed in the field of genetics by Cheverud (2001) and Nyholt (2004), provided a simple way of handling this situation.With this method, one computes eigenvalues from the correlation matrix of outcomes, which reflect the degree of intercorrelation between them.The mathematical definition of an eigenvalue can be daunting, but an intuitive sense of how it relates to correlations can be obtained by considering the cases shown in Table 1.This shows how eigenvalues vary with the correlation structure of a matrix, using an example of six outcome measures.The number of eigenvalues, and the sum of the eigenvalues, is identical to the number of measures.Let us start by assuming a matrix in which all off-diagonal values REVISED Amendments from Version 2   This revised version has two substantial changes.a) Figures 2-4 have been revised in line with suggestions by reviewer 1, to remove the lines corresponding to effect size of .8. b) Additional real-world examples of studies are provided where the look-up table (Table 2) may be used, but where original raw data was not available.These stimulated further thoughts about the need to consider the nature of the relationship between outcome measures and an intervention, which are picked up in the Discussion.Both for appropriate analysis and for interpretation, it is important to decide whether multiple outcomes are to be regarded as alternate indicators of a common underlying construct, or whether they reflect latent variables that may respond differently to the intervention.
Any further responses from the reviewers can be found at the end of the article are equal to r.It can be seen that when the correlation is zero, each eigenvalue is equal to one, and the variance of the eigenvalues is zero.When the correlation is one, the first eigenvalue is equal to six, all other eigenvalues are zero, and the variance of the eigenvalues is six.As correlations increase from .2 to .8, the size of the first eigenvalue increases, and that of the other eigenvalues decreases.
In Table 1, r is the intercorrelation between the six outcomes, Eigen1 -Eigen6, are the eigenvalues, and Var is the variance of the six Eigenvalues, which is used to compute MEff (the effective number of comparisons) from the formula: where N is the number of outcome measures, and Eigen is the set of N eigenvalues.
This value is then used to compute the corrected alpha level, AlphaMEff.Assuming we set alpha to .05,AlphaMEff is .05divided by MEff.One can see that this value is equivalent to the Bonferroni-corrected alpha (.05/6) when there is no correlation between variables, and equivalent to .05 when all variables are perfectly correlated.
Derringer (2018) provided a useful tutorial on MEff, noting that it is not well-known outside the field of genetics, but is well-suited to the field of psychology.Her preprint includes links to R scripts for computing MEff and illustrates their use in three datasets.
These resources will be sufficient for many readers interested in using MEff, but researchers may find it useful to have a look-up table for the case when they are evaluating existing studies.The goal of this paper is two-fold: A. To consider how inclusion of multiple outcome measures affects statistical power, relative to the case of a single outcome, when appropriate correction of the familywise error rate is made using MEff.Results from MEff are compared with use of Bonferroni correction and analysis of the first component derived from Principal Components Analysis (PCA).
B. To provide a look-up table to help evaluate studies with multiple outcome measures, without requiring the reader to perform complex statistical analyses.
These goals are achieved in three sections below: 1. Power to detect a true effect using MEff is calculated from simulated data for a range of values of sample size (N), effect size (E) and the matrix of intercorrelation between outcomes (R)

Alternative approach, MinNVar
In the original version of this manuscript (Bishop, 2021), an alternative approach, MinNVar, was proposed, in which the focus was on the number of outcome variables achieving a conventional .05level of significance.As noted by reviewers, this has the drawback that it could not reflect continuous change in probability levels, because it was based on integer values (i.e.number of outcomes).This made it overconservative in some cases, where adopting the MinNVar approach gave a familywise error rate well below .05.One reason for proposing MinNVar was to provide a very easy approach to evaluating studies that had multiple outcomes, using a lookup table to check the number of outcomes needed, depending on overall correlation between measures.However, it is equally feasible to provide lookup tables for MEff, which is preferable on other grounds, and so MinNVar is not presented here; interested readers can access the first version of this paper to evaluate that approach.

Use of one-tailed p-values
In the simulations described here, one-tailed tests are used.Two-tailed p-values are far more common in the literature, perhaps because one-tailed tests are often abused by researchers, who may switch from a two-tailed to a one-tailed p-value in order to nudge results into significance.
This is unfortunate because, as argued by Lakens (2016), provided one has a directional hypothesis, a one-tailed test is more efficient than a two-tailed test.It is a reasonable assumption that in intervention research, which is the focus of the current paper, the hypothesis is that an outcome measure will show improvement.Of course, interventions can cause harms, but, unless those are the focus of study, we have a directional prediction for improvement.

Methods
Correlated variables were simulated using the R programming language (R Core Team, 2020) (R Project for Statistical Computing, RRID:SCR_001905).The script to generate and analyse simulated data is available on https://osf.io/hsaky/.
For each model specified below, 2000 simulations were run.Note that to keep analysis simple, a single value was simulated for each case, rather than attempting to model pre-vs post-intervention change.Data for the two groups were generated by the same process, except that a given effect size was added to scores of the intervention group, I, but not to the control group, C. Scores of the two groups were compared using a one-tailed t-test for each run.
Power was computed for different levels of effect size (E), correlation between outcomes (R) and sample size per group (N) for the following methods: a) Bonferroni-corrected data: Proportion of runs where p was less than the Bonferroni-corrected value for at least one outcome.
b) MEff-corrected data: Proportion of runs where p was less than AlphaMeff value for at least one outcome.
c) Principal component analysis (PCA): Proportion of runs where p was below .05when groups I and C were compared on scores on the first principal component of PCA.

Method for simulating outcomes
Simulating multivariate data forces one to consider how to conceptualise the relationship between an intervention and multiple outcomes.Implicit in the choice of method is an underlying causal model that includes mechanisms that lead measures to be correlated.
In the simulation, outcomes were modelled as indicators of one or more underlying latent factors, which mediate the intervention effect.This can be achieved by first simulating a latent factor, with an effect size of either zero, for group C, or E for group I. Observed outcome measures are then simulated as having a specific correlation with the latent variablei.e. the correlation determines the extent to which the outcomes act as indicators of the latent variable.This can be achieved using the formula: where r is the correlation between latent variable (L) and each outcome, and L is a vector of random normal deviates that is the same for each outcome variable, while e (error) is a vector of random normal deviates that differs for each outcome variable.Note that when outcome variables are generated this way, the mean intercorrelation between them will be r 2 .
Thus if we want a set of outcome variables with mean intercorrelation of .4,we need to specify r in the formula above as sqrt(r) = .632.Furthermore, the effect size for the simulated variables will be lower than for the latent variable: to achieve an effect size, E, for the outcome variables, it is necessary to specify the effect size for the latent variable, E l , as E/sqrt(r).
Note that the case where r = 0 is not computable with this method -i.e. it is not possible to have a set of outcomes that are indicators of the same latent factor but which are uncorrelated.The lowest value of r that was included was r = .2.
The initial simulation, designated as Model L1, treated all outcome measures as equivalent.In practice, of course, we will observe different effect sizes for different outcomes, but in Model L1, this is purely down to the play of chance: all outcomes are indicators of the same underlying factor, as shown in the heatmap in Figure 1, Model L1.
In two additional models, rather than being indicators of the same uniform latent variable, the outcomes correspond to different latent factors.This would correspond to the kind of study described by Vickerstaff et al. (2021), where an intervention for obesity included outcomes relating to weight and blood glucose levels.Following suggestions by Sainani (2021), a set of simulations was generated to consider relative power of different methods when there are two underlying latent factors that generate the outcomes.In Model L2, there are two independent latent factors, both affected by intervention.In Model L2Â, the intervention only influences the first latent factor.The computational approach was the same as for Model L1, but with two latent factors, each used to generate a block of variables.The two latent factors are uncorrelated.
The size of the suite of outcome variables entered into later analysis ranged from 2 to 8.For each suite size, principal components were computed from data from the C and I groups combined, using the base R function prcomp from the stats package ( R Core Team, 2020).Thus, PC2 is a principal component based on the first two outcome measures, PC4 based on the first four outcome measures, and so on.

Power calculations
Sample plots comparing power for Bonferroni correction, MEff and PCA are shown for sample size of 50 per group in Figures 2 to 4. Plots for smaller (N = 20) and larger (N = 80) sample sizes are available online (https://osf.io/k6xyc/),and show the same basic pattern.
Figure 2 shows the simplest situation when there are between 2 and 8 outcome measures, all of which are derived from the same latent variable (Model L1).Different levels of intercorrelation between the outcomes (ranging from .2 to .8 in steps of .2) are shown in columns.
Several points emerge from inspection of this figure; first, when intercorrelation between measures is low to medium (.2 to .6), power increases as the number of outcome measures increases.Furthermore, the power is greater when PCA is used than when MEff or Bonferroni correction is applied.MEff is generally somewhat better-powered than Bonferroni, and Bonferroni has lower power than a single outcome measure when there is a large number of highly intercorrelated outcome measures (r = .8).
In practice, it may be the case that outcome measures are not all reflective of a common latent factor.Figure 3 shows results from Model L2, where outcome measures form two clusters, each associated with a different latent factor (see Figure 1).Here both latent factors are associated with improved outcomes in the intervention group.
Once again, power increases with number of outcomes when there are low to modest intercorrelations between outcomes.For this method, PCA no longer has such a clear advantage.This makes sense, given that PCA will not derive a single main factor, when the underlying data structure contains two independent factors.Figure 4 shows equivalent results for Model L2x, where we have a mixture of two types of outcome, one of which is influenced by intervention, and the other is not.This complicates calculation of power for a single variable, since, power will depend on whether we select one of the outcomes that is influenced by intervention or not.The symbols in Figure 4 show average power, assuming we might select either type of outcome with equal frequency.We see that in this situation, MEff is clearly superior to PCA except when we have a large number of outcomes, a small effect size and weak intercorrelation between outcomes.

Deriving a lookup table
Table 2 shows corrected alpha values based on MEff, varying according to the correlation between outcome measures, and the number of outcome measures in the study.In practice, the problem for the researcher is to estimate the intercorrelation between outcome measures if this is not known.Model L1, used to generate these data, assumes there will be a uniform intercorrelation between outcome measures in the population.This is likely to be unrealistic.Nevertheless, further simulations showed that values for MEff are reasonably consistent for different correlation matrices that all have the same average off-diagonal correlation.Consider, for instance, the correlations between 4 variables shown in Figure 1 for Model L2.Within the blocks V1-V2 and V3-V4 the intercorrelation is r, but between blocks the intercorrelation is zero.There are six off-diagonal correlations and the mean off-diagonal is (2 * r/6).For instance, if r equals .5, then the mean off-diagonal value is .167.To see how the MEff correction is affected by correlation structure, we can compare MEff for Model L2 with the MEff obtained in Model L1 with the same off-diagonal correlation.This exercise shows that they are similar, as shown in Table 3.
In other words, if estimating MEff from existing data, it is reasonable to base the estimate on the average off-diagonal correlation, regardless of whether the pattern of intercorrelations is uniform.

Examples of application to published studies
Use of the lookup Table 2 can be illustrated with data from a study by Burgoyne et al. (2012), which evaluated a reading and language intervention for children with Down syndrome.A large number of assessments was carried out over various time points, but our focus here is on the five outcome measures that had been designated as "primary", as they were  "proximal to the content of the intervention", i.e., they measured skills and knowledge that had been explicitly taught.The p-values reported by the authors (see Table 4) come from analyses of covariance comparing differences between intervention and control groups after 20 weeks of intervention, controlling for baseline performance, age and gender.
Whereas the Bonferroni-corrected alpha can be computed simply from knowledge of the number of outcome measures, the MEff-corrected alpha requires knowledge of the mean correlation between the outcome measures.In this case, this could be computed, (r = .581),as the data were available in a repository (Burgoyne et al., 2016).From Table 2, we see that with five outcome measures and r = .6,the adjusted alpha is .014.In this example, three outcomes have p-values below the critical alpha when MEff is used.If the more stringent Bonferroni correction is applied, only two outcomes achieve significance.
In this example the intercorrelation between outcome measures could be computed from deposited raw data; if these are not available, then it may still be possible to obtain plausible estimates of intercorrelation between outcome measures, especially if widely-used instruments are used.An example is provided by two randomized controlled trials of a memory training programme for children, Cogmed.In both studies, the Automated Working Memory Assessment battery (Alloway, 2007) was used to assess outcome.Chacko et al. (2014) used four subtests, Dot Matrix, Spatial Recall, Digit Recall, and Listening Recall, and applied the Sidak-Bonferroni correction, with effective alpha of .013.The raw data are not available, but the test manual indicates that intercorrelations between these four measures range from .70 to .78.Thus we can use the lookup table (Table 2), which shows that with four variables with intercorrelation of .7,an effective alpha of .02can be used.In practice this did not affect the interpretation of results, because two of the measures, Dot Matrix and Digit Recall, had associated p-values of < .001and .005respectively.The p-values for Spatial Recall and Listening Recall were .048and .728respectively, and so would not meet criteria for significance with MEff or Bonferroni methods.The other study by Roberts et al. ( 2016) used a different subset of subtests from the same battery: Dot Matrix, Digit Recall, Backward Digit Recall, and Mister X, given at 6 months, 12 months and 24 months post-intervention.According to the test manual, intercorrelations between these subtests range from .65 to .80.These authors did not apply a correction for multiple comparisons.If Bonferroni correction had been used this would have given an alpha level of .004(.05/12).The test manual indicates that test-retest reliability of the subscales ranges from .84 to .89.Thus overall, we can estimate the off-diagonal correlations for all 12 measures to be around .8, which the lookup table shows as corresponding to an effective alpha of .01.In this study, only the Dot Matrix task effect was significant after correction for multiple comparisons, with p < .001at both 6 months and 12 months, but p = .14at 24 months.Backward Digit Recall gave p = .04at 6 months only, which would be nonsignificant if any correction for multiple comparisons were used.All other comparisons were null.In the next section, the implications of these findings for choosing methods is discussed further.

Discussion
Some interventions are expected to affect a range of related processes.In such cases, the need to specify a single primary outcome tends to create difficulties, because it is often unclear which of a suite of outcomes is likely to show an effect.Note that the MEff approach does not give the researcher free rein to engage in p-hacking: the larger the suite of measures included in the study, the lower the adjusted alpha will be.It does, however, remove the need to pre-specify one measure as the primary outcome, when there is genuine uncertainty about which measure might be most sensitive to intervention.
A second advantage is that in effect, by including multiple outcome measures, one can improve the efficiency of a study, in terms of the trade-off between power and familywise errors.A set of outcome measures may be regarded as imperfect proxy indicators of an underlying latent construct, so we are in effect building in a degree of within-study replication by including more than one outcome measure.
The simulations showed that PCA gives higher power than MEff in the case where all outcomes are indicators of a single underlying factor.PCA, however, needs to be computed from raw data and so is not feasible when re-evaluating published studies, whereas MEff is feasible so long as the average off-diagonal correlation between outcomes can be estimated.PCA is also less powerful when the outcomes tap into heterogeneous constructs and do not load on one major latent factor.Some examples are provided where prior literature gives plausible estimates of intercorrelations between outcome measures.Of course, such estimates are never as accurate as the actual correlations from the reported data, which may vary depending on sample characteristics.Wherever possible, it is preferable to work with original raw data.However, where correlations are available from test manuals, or where previous studies have reported correlations between outcomes, then the researcher can consider how interpretation of results may be affected by assuming a given degree of dependency between outcome measures.
A possible disadvantage of using MEff or Bonferroni correction over PCA is that such approaches are likely to tempt researchers to interpret specific outcomes that fall below the revised alpha threshold as meaningful.They may be, of course, but when we create a suite of outcomes that differ only by chance, it is common for only a subset of them to reach the significance criterion.Any recommendation to use MEff should be accompanied by a warning that if a subset of outcomes shows an effect of intervention, this could be due to chance.It would be necessary to run a replication to have confidence in a particular pattern of results.
In this regard, the example of studies using the Automated Working Memory Assessment to evaluate intervention for children with memory and attentional difficulties (Chacko et al., 2014;Roberts et al., 2016) are instructive.As reported in the test manual (Alloway, 2007), intercorrelations between the subtests are high, supporting the idea of a general working memory factor that influences performance on all such measures.On that basis, it might seem preferable to reduce subtest scores to one outcome measure -either by using data reduction such as principal component analysis, or by using the method advocated in the test manual to derive a composite score.We know this is associated with an increase in reliability of measurement and statistical power.However, the results of the two studies sound a note of caution: in both trials there were large improvements in one subtest, Dot Matrix, at least in the short-term, while other measures did not show consistent gains.This kind of result has been much discussed in evaluations of computerised training, where it has been noted that one may see improvements in tasks that resemble the training exercises, 'near transfer', without any generalisation to other measures, 'far transfer' (Aksayli, Sala, & Gobet, 2019).The very fact that measures are usually intercorrelated provides the rationale for hoping that training one skill will have an effect that generalises to other skills, and to everyday life.Yet, the verdict on this kind of training is stark: after much early optimism, working memory training leads to improvements on what was trained, but these do not extend to other areas of cognition.This shows us that careful thought needs to be given to the logic of how a set of outcome measures is conceptualised: should we treat them as interchangeable indicators of a single underlying factor, or are there reasons to expect that the intervention will have a selective impact on a subset of measures?Even when variables are intercorrelated in the general population, they may respond differently to intervention.
It is also worth noting that results obtained with the MEff approach will depend on assumptions embodied in the simulation that is used to derive predictions.Outcome measures simulated here are normally distributed, and uniform in their covariance structure.It would be of interest to evaluate MEff in datasets with different variable types, such as those used by Vickerstaff et al. ( 2021) that included binary as well as continuous data, as well as modeling the impact of missing data.
In sum, a recommendation against using multiple outcomes in intervention studies does not lead to optimal study design.
Inclusion of several related outcomes can increase statistical power, without increasing the false positive rate, provided appropriate correction is made for the multiple testing.Compared to most other approaches for correlated outcomes, MEff is relatively simple.It could potentially be used to reevaluate published studies that report multiple outcomes but may not have been analysed optimally, provided we have some information on the average correlation between outcome measures.
This project contains the scripts to generate and analyse simulated data.Two scripts are included: Data_simulation_modelL.Rmd, which generates the simulated data under Data, computes power tables and creates plots for Figures 2-4 to identify situations in which the use of multiple primary outcomes with appropriate adjustment for multiple comparisons yields higher statistical power than a single primary outcome; and (2) to provide tools for re-evaluating already published papers that used multiple primary outcomes but failed to adjust for multiplicity.
In shifting the focus of the paper, the authors have addressed my original concerns.I like the Meff approach because it is relatively straightforward and intuitive.So, I'm glad that this revised version provides a brief tutorial on Meff for psychologists.Table 1 also provides a nice intuitive illustration of how Meff works.I also appreciate that this new draft explores different possible patterns of correlations that reflect different underlying latent variables.This paints a more realistic picture and brings out some of the tradeoffs of the different approaches (PCA, Bonferroni, Meff).
I think this paper accomplishes its stated goals, and is a useful resource.I have spot-checked a few of the simulations and see similar patterns to what the paper reports.I appreciate that the authors have made their code and data available.
I have just a few minor suggestions: Figures 2-4.I would recommend removing effect size=0.8.Effect size makes some difference but appears less important than number of outcomes and correlation strength.Furthermore, power is always high with effect size=0.8 and n=50, so it doesn't add much information to display effect size=0.8.I also had trouble distinguishing the two dashed lines.Altogether, effect size=0.8 just makes the graph harder to read without adding a lot of extra information. 1.
It's somewhat contradictory to say that the lookup tables are useful when you don't have access to the underlying data but then to present an illustrative example for which the underlying data were available (Burgyone 2012).Presumably, if we had access to the full data, we could calculate Meff exactly (or account for multiple testing using other approaches).Are there any examples you could present where the data are not available but there is some way to roughly estimate the correlations?E.g., because of summary data in the paper or because the correlations can be roughly estimated from previous work using the same variables?This might be closer to the real-world use case for the lookup tables.

2.
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Statistics, Sports Medicine
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
of .8 are still available on OSF, and a Wiki to that page on OSF has been added to explain the difference.
I have also added two more real-world examples, both dealing with evaluation of a memory intervention, Cogmed, for children.It was possible to use the MEff lookup table because both studies used a published working memory battery, where correlations between the subscales can be found in the test manual.Although in practice this had little impact on the interpretation of these studies, it did show how alpha depends on the correction used: one study had used Bonferroni correction, which was over-stringent, and the other had used no correction for multiple contrasts.I found that this exercise not only provided an illustrative example of use of MEff, but it further emphasised the need to consider the underlying relationships between intervention and outcomes when deciding on an analytic strategy, and so I added a further comment about that in the Discussion.
The idea of creating lookup tables is a nice contribution.The biggest weakness is that correlations 1) are often unknown when data is not shared, and 2) are likely more varied than in the simulations.The authors admit this "In practice, the problem for the researcher is to estimate the intercorrelation between outcome measures if this is not known.".They then give an applied example where the data was shared in a repository.What is missing is a clear instruction what to do when data is not shared in a repository.I would assume this then means 1) asking for the data, 2) if data is not provided making an informed guess, and 3) be careful in the interpretation, or performing some sensitivity analyses (e.g., X findings are significant, assuming correlations are not lower than Y).I think presenting a plan for when data is not available is a useful addition.
I also still wonder what happens if there is substantial variation in the off-diagonal correlations.
Maybe the authors feel this is rare in realistic datasets -then that could be mentioned.Regarding the case of substantial variation in off-diagonal correlations, having played around with various scenarios, I think it's reasonable to regard model L2 as corresponding to that, because it has a mixture on the off-diagonal of variables with correlation of zero, and those with correlation of whatever is the maximum value for correlated measures from one factor.For instance, if you look at the bottom row of Table 3, L2 is the case where the off-diagonal contains a mixture of r values of 0 and .8,and L1 is the case when the offdiagonals are uniform (and equivalent to the mean of L2 values).Yet the MEff varies only slightly.I think that is about the most extreme variation you could get for off-diagonal values.

Daniel Lakens
Human-Technology Interaction Group, Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands The author discusses more optimal ways to control error rates than the Bonferroni correction when researchers use multiple measures in a study that are positively correlated.In these cases the authors proposed to specify the number of variables that should be significant at the default alpha level (e.g., 0.05) to make sure the overall Type 1 error rate does not exceed 0.05.
The difficulty with correcting for multiple comparisons based on the number of variables that are significant is that you want to prevent a situation where a researcher performs a Stroop effect test, a Simon effect test, and a test for precognition, and conclude the hypothesis is supported because 2 out of 3 tests are significant.The solution the author proposes is to only use this approach if 1) the same latent variable is measured in different measures, and 2) the correlation between these variables can not be zero.I believe this is a valid approach.These assumptions are discussed enough in the article, but on page 11 (of the pdf version), one might want to discuss that *whether* different measures actually are 'replicants' could be a matter of debate among peers.One could imagine a situation where researchers in a field who disagree about measures would also disagree about whether different measures are replicants.
The comparison against the Bonferroni correction is one interesting baseline, but there is a large literature on how to correct for multiple comparisons that is also much more efficient than a Bonferroni correction, and which is the more interesting comparison.A strength of the Bonferroni correction is that it makes no assumptions about the variables that are corrected.But if one is willing to make assumptions, most importantly about the correlation between variables, more efficient approaches are available.How does this approach compare against other correction approaches?Although there are many correction approaches (this seems to be a particularly active field in neuroscience, where multiple comparisons when analyzing brain activation are common, and measures are strongly correlated), the following two references provide a starting point.There are several things to consider.First, the proposed simulation based approach fixes the alpha level, and selects the number of variables that need to be significant.Numbers of variables are relatively crude, in that we can pick 1, 2, 3, 4, etc, but not 1.23 variables.Alternative approaches in the literature lower the alpha level.A benefit of these approaches is that the alpha level can be set at any value (e.g., 0.0352) to exactly control the Type 1 error rate.The consequences of this are also clear when we look at the figures and compare the principle component approach with the Adjust NVar approach.The familywise error rate for the Adjust NVar approach is often well below 0.05, while it is controlled at 0.05 in the principle component approach (and it is controlled at 0.05 in other approaches in the literature that lower the alpha level).The author discusses this (e.g., page 7 of the pdf version) in some detail, but the author seems slightly biased towards their own approach, stating that "the tradeoff between power and familywise error (expressed as a ratio) is higher for Adjust NVar."It is not clear this ratio is a fair evaluation of the methods, and the information is difficult to distill from the figures (a Table with Type 1 error rates and Type 2 error rates would be more useful for this).This ratio is not so easy to summarize in a single sentence, I feel.If a design has 99.9% power, lowering the alpha from 0.05 to 0.02 has little effect on power, but if power is 0.8, lowering the alpha has a greater effect.Typically, the evaluation is done on the required sample size -which type of correction would require the smallest sample size?And then it becomes important to take a long more modern corrections for correlated variables as well.
I could imagine other modern corrections are a bit more efficient, but the author makes a good point that the simplicity of the methods allows one to evaluate published studies.This might be a useful However, then I would like to urge the author to add an example.It would be good to see how this approach should be applied in a real use case (e.g., read a justification for why the measures are measuring the same latent construct, and how to make an assumption about the correlation) and how to interpret the results, depending on how many tests are significant.
The current simulation is somewhat limited.The idea that all correlations are identical will not be true in practice, and this will complicate the application of the proposed technique.What is the recommendation in practice?Should authors choose the largest correlation or the smallest correlation?Which approach is more or less conservative?What is the effect of differences in standard deviations, and does it matter if measures with low correlations have larger standard deviations?I am not sure all these factors will have a large influence but users will most likely need to know what to do.
It was very nice to see the paper was written in Rmarkdown.I was able to reproduce the results computationally (I did not repeat the simulations).As a minor comment, skipsim was no longer on line 207 but on 222.The files would be clearer if all simulation code was a separate R script that is run to generate the simulation data.The datafiles generated can then be read in at the top of the Rmd file.The 'skipsim' workaround does not improve clarity, and I had a hard time figuring out where the code reads in the data.In row 561, for example, the toybit file is written, but if I read it in, this is not needed.Separating the creation of data, and reading in data, makes this cleaner.I needed to create an 'Images" folder to run the plot code on line 679 -this folder could be uploaded to the github repo perhaps?
To conclude, I believe this current version of the manuscript needs some additional work, which could include a more extensive discussion of other corrections in the literature, an exploration of additional simulations, and a practical example of how to use this approach when analyzing published studies.

Is the description of the method technically sound? Yes
Are sufficient details provided to allow replication of the method development and its use by others?Yes If any results are presented, are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article?No Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Applied statistics I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
Author Response 18 Nov 2022

Dorothy Bishop
I thank the reviewer for such a comprehensive and constructive review.The two reviews complemented one another very nicely, and helped me see how to restructure and revise the paper to make it clearer.
As background, I should explain that this paper grew out of an attempt to write a primer on how to analyse/evaluate published intervention studies for students/professionals in allied health professions/education.My concern was that Bonferroni correction is often too conservative, but explanations of other methods are typically highly complex, so I was looking for a simpler rule of thumb that could be useful in interpreting published studies.I have emphasised this rationale more in the revised paper.
Response to specific points.
1.The difficulty with correcting for multiple comparisons based on the number of variables that aresignificant is that you want to prevent a situation where a researcher performs a Stroop effect test, a Simon effect test, and a test for precognition, and conclude the hypothesis is supported because 2 out of 3 tests are significant.The solution the author proposes is to only use this approach if 1) the same latent variable is measured in different measures, and 2) the correlation between these variables can not be zero.I believe this is a valid approach.These assumptions are discussed enough in the article, but on page 11 (of the pdf version), one might want to discuss that *whether* different measures actually are 'replicants' could be a matter of debate among peers.One could imagine a situation where researchers in a field who disagree about measures would also disagree about whether different measures are replicants Response: This is an important point, which is similar to issues raised by reviewer 1, so I have now extended the simulations to take into account situations where outcomes may not all be indicators of one factor.

Alternative methods for correction for multiple comparisons.
Response: The reviewer recommends two references with alternative approaches to multiple comparison correction.As I noted above, one goal was to keep things simple: so I am reluctant to do a full comparison of all methods, especially since it is clear that I need to consider different underlying data structures.I like the relative simplicity of the MEff approach recommended by reviewer 1, and so I am now presenting a comparison of Bonferroni, PCA and MEff.
3. I could imagine other modern corrections are a bit more efficient, but the author makes a good point that the simplicity of the methods allows one to evaluate published studies.This might be a useful application.However, then I would like to urge the author to add an example.It would be good to see how this approach should be applied in a real use case (e.g., read a justification for why the measures are measuring the same latent construct, and how to make an assumption about the correlation) and how to interpret the results, depending on how many tests are significant.
Response: Again, this converges nicely with the recommendations of reviewer 1, and I have now restructured the paper to focus more on this aspect and to give a real-life example.
4. The current simulation is somewhat limited.The idea that all correlations are identical will not be true in practice, and this will complicate the application of the proposed technique.What is the recommendation in practice?Should authors choose the largest correlation or the smallest correlation?Which approach is more or less conservative?What is the effect of differences in standard deviations, and does it matter if measures with low correlations have larger standard deviations?I am not sure all these factors will have a large influence but users will most likely need to know what to do.
Response: I now contrast three models with different patterns of intercorrelation between intervention and outcome.Potentially, there is no end to the options to consider, but I found it reassuring that with the MEff approach, the adjusted alpha was similar for different correlated variables, provided the mean off-diagonal correlation was constant.
5. It was very nice to see the paper was written in Rmarkdown.I was able to reproduce the results computationally (I did not repeat the simulations).As a minor comment, skipsim was no longer on line 207 but on 222.The files would be clearer if all simulation code was a separate R script that is run to generate the simulation data.The datafiles generated can then be read in at the top of the Rmd file.The 'skipsim' workaround does not improve clarity, and I had a hard time figuring out where the code reads in the data.In row 561, for example, the toybit file is written, but if I read it in, this is not needed.Separating the creation of data, and reading in data, makes this cleaner.I needed to create an 'Images" folder to run the plot code on line 679 -this folder could be uploaded to the github repo perhaps?
Response: I have now separated the simulation and generation of corresponding figures from the code to write the paper.
When outcomes are correlated, there is no simple formula for obtaining MinNSig, so the author has used a simulation approach to account for varying correlation structures.
The idea is novel and has several merits.In particular, the approach closely matches what many researchers already do in the published literature.Many published studies apply a p-value cutoff of 0.05 to multiple outcomes without correcting for multiple testing (or designating a primary outcome).I can envision the AdjustNVar approach being a useful heuristic for re-evaluating published studies that used multiple outcomes but failed to account for multiple testing in any way.For example, if an RCT reported 10 outcomes with only 1 significant result (p<.05), readers can easily recognize that this is compatible with a chance finding.But what if the trial found 2 significant results or 3? And what if the outcomes are moderately correlated instead of independent?I can envision researchers using a table such as Table 2 of the AdjustNVar paper to make a quick assessment based on the number of outcomes reported and a rough guess at the correlation structure.
However, I am less convinced of the value of AdjustNVar as a formal tool for controlling the familywise error rate in a planned study.At a minimum, further development and a broader set of simulations would be required to support such a recommendation.The current manuscript describes three alternatives to specifying a single primary outcome in an RCT: (1) Bonferroni adjustment, (2) permutation tests, and (3) use of PCA to derive a single composite outcome.But this ignores existing p-value adjustment methods that are less conservative than Bonferroni.For example, the "M-effective" (Meff) approach accomplishes many of the same goals as AdjustNVar (see: Cheverud (2001) 1 , Nyholt (2004) 2 , and Derringer (2018) 3 ).In the Meff approach, one adjusts the p-value threshold by dividing by the effective number of outcomes (Meff) rather than the actual number of outcomes (M).Meff is based on the eigenvalues of the correlation matrix of the outcomes.Where Eigen is the observed vector of eigenvalues from the correlation matrix of the outcomes, Meff is calculated as: Like AdjustNVar, Meff is simple and accounts for correlated outcomes.But I believe it has several advantages over AdjustNVar: (1) Meff precisely controls the Type I error rate, whereas AdjustNVar has varying Type I error rates that cannot be precisely controlled by the investigator; (2) Meff accounts for the correlation structure observed in the data, whereas AdjustNVar requires the investigator to guess at the correlation structure; if this guess is far off (which could easily be the case), this would lead to poor Type I error control.
This paper has numerous strengths, including the novelty of the idea; the potential use as a heuristic for re-interpreting flawed published papers; the concision of the writing; and the availability of all code and data.The major limitations of the paper are: (1) it presents an overly narrow set of simulations that do not capture most realistic situations, but then makes overly broad claims based on these simulations.( 2) it does not compare AdjustNVar to existing approaches that are less conservative than Bonferroni, such as Meff and (3) it does not address the different reasons why researchers may be including multiple outcomes, but these different reasons lead to markedly different correlation structures.

Specific comments:
The paper would benefit from further consideration of the reasons why researchers may include multiple outcomes.The paper focuses on the case where an intervention is "expected to affect a range of related processes."The simulations make assumptions that match this case, assuming equal correlations across outcomes and equal true effects for each outcome.But researchers may include multiple outcomes for many other reasons, such as: (a) they aren't sure which process the intervention will affect, (b) they believe the 1.
intervention may affect two different processes but they measure each process with several different measurements to "hedge their bets", or (c) they include a "soft" endpoint in addition to a "hard" endpoint because the "hard" endpoint may occur too rarely.Each of these cases corresponds to different assumptions for the simulations.For example, (b) would be expected to have two clusters of highly correlated variables that are only weakly correlated with each, which will affect MinNSig.
The paper suggests that AdjustNVar could be used in study planning-researchers would guess at the correlation structure and set a MinNSig ahead of time.But if they guess the correlation structure incorrectly, such as underestimating the true correlation, then they may choose a MinNSig that does not adequately control Type I error.

2.
The "quantum nature" of AdjustNVar is not a desirable characteristic.The researcher is unable to precisely control the Type I error rate.In planning a study in which the correlation is expected to be 0.4, for example, Table 2 would suggest that the researcher should then always choose 9 outcomes over 5-8 outcomes, since 9 maximizes the chances of getting at least 3 p-values <.05.This is one reason I prefer Meff, which precisely controls the Type I error rate.

3.
This description is misleading: "Should we dismiss the trial as showing no benefit?We can use the binomial theorem to check the probability of obtaining this result if the null hypothesis is true and the measures are independent: it is 0.033, clearly below the 5% alpha level."The description gives the misleading impression that one would be justified in reevaluating a paper that used a Bonferroni correction by instead applying the criterion of at least two p-values <.05.But doing so would inflate the Type I error rate.Results would have been declared significant if EITHER at least one p-value met the Bonferroni threshold OR at least two p-values were <.05 -leading to an effective Type I error rate of 7% (assuming independent outcomes).Note that this example reappears in the discussion and also mistakenly implies that had three p-values been <.05, we would have been able to reject the null hypothesis.But this is not the case because the results were already subjected to Bonferroni, and additionally subjecting them to AdjustNVar makes the effective Type I error rate higher than 5%.AdjustNVar should be applied only to re-evaluate studies that failed to incorporate any adjustments for multiple testing originally.
4. Then indicate that MinNSig occurs at one number above when the cumulative frequency crosses 95%.

5.
I'm unclear as to why the paper focuses on one-tailed tests, which are less common in the literature.I think it would be more useful to present two-tailed tests in Table 2 or to present two tables -one for one-tailed tests and one for two-tailed tests.This makes a difference in 6.
a few MinNSig values.
Figures 1-3: These figures compare AdjustNVar to a single study outcome.I think the logic behind this comparison is flawed, however.It is comparing apples to oranges.The simulation assumes that, when applying AdjustNVar, ALL variables studied have a true effect.This, in effect, stacks the deck on statistical power for *any* method that considers multiple outcomes rather than a single outcome.For example, I ran a simulation comparing Bonferroni with 6 outcomes compared to a single outcome with n=50 per group.When the correlation is <0.8,Bonferroni also has more power than the single outcome.And, when I compared the Meff strategy to a single outcome in a variety of scenarios, Meff was always more powerful than a single outcome.I think a more useful comparison would be to directly compare different methods that handle multiple outcomes (e.g., PC to AdjustNVar to Meff). 7.
Related to comment (7), I don't think the paper is justified in making this broad claim: "The Adjust NVar approach can achieve a more efficient trade-off between power and type I error rate than use of a single outcome when there are three or more moderately intercorrelated outcome variables."This conclusion is true only when the intervention truly affects ALL outcomes, which is a narrow and arguably unrealistic case.A more realistic scenario is where the intervention works only on a subset of outcomes.In this case, the single variable strategy will be more statistically powerful than the multiple-variable strategies if you choose the right variable.

8.
Figures 4-6.Same issue as for Figures 1-3: comparing the principal components composite variable strategy (PC) to a single outcome is flawed because the simulation "stacks the deck" for PC by assuming that all outcomes have a true effect.I believe that Figures 1-6 should focus instead on comparisons of different methods for handling multiple outcomes.

9.
The article claims that power is only "slightly lower" for AdjustNVar compared with the PC strategy.However, when I run simulations comparing PC to AdjustNVar, I get consistently higher statistical power for PC and I would characterize the difference as more than just "slight".For example, with N=50, global corr=0.6,global ES=0.3, and 6 outcomes (two-tailed test), I get power of 35% for AdjustNVar versus 37% for Meff versus 46% for PC.

10.
PC is always more powerful than AdjustNVar and Meff when we assume that all outcomes have a true effect.However, PC is not more powerful when we assume that only a subset of outcomes have true effects.For example, if I tweak the above simulation so that only three outcomes out of six have true effects of 0.3, this changes the power to 15% for PC and AdjustNVar, and 28% for Meff.This illustrates why one narrow simulation is insufficient for making general conclusions about the tradeoffs in performance between the different methods. 11.
In my simulations, Meff consistently has higher statistical power than AdjustNVar, which is a function of the fact that Meff always has a 5% Type I error rate whereas that of AdjustNVar is variable and sometimes lower than 5%.I don't view it as a strength that AdjustNVar results in arbitrarily lower Type I error rates.It is better for the investigator to be able to precisely control the tradeoff between Type I and Type II error.Meff allows this whereas AdjustNVar does not. 12.
Figures 1-6: I found these graphs hard to read as they don't have a clear take-home point.I would suggest removing the single outcome comparisons altogether and then reformatting so that the graphs directly compare AdjustNVar to Meff and PC (three lines).In one graph, hold effect size, sample size, and N outcomes constant, and then show the power of the three methods as a function of increasing correlation.In another graph, hold correlation, sample size, and effect size constant, and show the power of the three methods as a function of N outcomes.And so on.All methods aim to control the familywise error rate at 5%, so this should not be a variable.The fact that AdjustNVar sometimes results in a lower Type I error rate is incidental -it arises as a quirk of the method not as the intent of the researcher. 13.
Where AdjustNVar may have an advantage over a method like Meff is for re-evaluating flawed published studies where researchers used multiple outcomes but did not account in any way for multiple testing.An AdjustNVar table such as

Dorothy Bishop
Thanks for the very useful evaluation of this paper, which has prompted further thoughts and a revised manuscript.There was some convergence of views of the two reviewers, although the specific recommendations varied.I thank the reviewer for engaging so thoroughly with the manuscript and for helping improve it.
General point A: AdjustNVar as a heuristic.
Reviewer 1 writes: The idea is novel and has several merits.In particular, the approach closely matches what many researchers already do in the published literature.Many published studies apply a p-value cutoff of 0.05 to multiple outcomes without correcting for multiple testing (or designating a primary outcome).I can envision the AdjustNVar approach being a useful heuristic for re-evaluating published studies that used multiple outcomes but failed to account for multiple testing in any way.For example, if an RCT reported 10 outcomes with only 1 significant result (p<.05), readers can easily recognize that this is compatible with a chance finding.But what if the trial found 2 significant results or 3? And what if the outcomes are moderately correlated instead of independent?I can envision researchers using a table such as Table 2 of the AdjustNVar paper to make a quick assessment based on the number of outcomes reported and a rough guess at the correlation structure.
Response: After exploring Meff, I decided not to proceed with the AdjustNVar approach, as I think a lookup table derived from MEff would achieve the same effect but without the problems arising from the quantal nature of N variables.
General Point B: Need to consider MEff Reviewer 1 notes that the MEff approach accomplishes many of the same goals as AdjustNVar, and is preferable in many respects.
Response: I had been unaware of MEff, and having consulted the references, I agree it is an elegant solution to the problem I was trying to tackle that avoids some of the limitations of AdjustNvar, particularly the need to assume a given correlation structure.I initially thought this approach had only been used in genetics and not been applied in psychology, but the final reference pointed to the work of Derringer, who has done a great job in a preprint that provides a tutorial in its use.I don't think it is much used in intervention research, which is the context I am particularly interested in, and so I think it is worthwhile updating the article so that it serves as a basic introduction to MEff, with discussion of factors affecting power.
null hypothesis.But this is not the case because the results were already subjected to Bonferroni, and additionally subjecting them to AdjustNVar makes the effective Type I error rate higher than 5%.AdjustNVar should be applied only to re-evaluate studies that failed to incorporate any adjustments for multiple testing originally.
Response: thanks for this clarification.As AdjustNVar is now omitted, this no longer applies.Response: this no longer applies as tables are redone 6.I'm unclear as to why the paper focuses on one-tailed tests, which are less common in the literature.I think it would be more useful to present two-tailed tests in Table 2 or to present two tables -one for one-tailed tests and one for two-tailed tests.This makes a difference in a few MinNSig values.

I found
Response: One-tailed tests have a bad reputation because so often they are misused to just require a lower level of significance when there are no directional predictions, but there are contexts in which they are entirely appropriate, and that includes the kinds of intervention study that is the focus of attention here.You can reasonably predict that an intervention will improve performance rather than worsen it.Reviewer 2 wrote a blogpost about this which I find convincing: http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-andunderused.html.I have now explained this further.
7. Figures 1-3: These figures compare AdjustNVar to a single study outcome.I think the logic behind this comparison is flawed, however.It is comparing apples to oranges.The simulation assumes that, when applying AdjustNVar, ALL variables studied have a true effect.This, in effect, stacks the deck on statistical power for *any* method that considers multiple outcomes rather than a single outcome.For example, I ran a simulation comparing Bonferroni with 6 outcomes compared to a single outcome with n=50 per group.When the correlation is <0.8,Bonferroni also has more power than the single outcome.And, when I compared the Meff strategy to a single outcome in a variety of scenarios, Meff was always more powerful than a single outcome.I think a more useful comparison would be to directly compare different methods that handle multiple outcomes (e.g., PC to AdjustNVar to Meff).
Response: Thanks again for helping clarify what is being simulated here.I hope this is clearer in the current article.The figures have been redrawn and I hope are now clearer.Perhaps one takeaway point is that, in being concerned to control type I error, the CONSORT recommendations appear to ignore the gains in power that can be achieved with multiple outcomes.
8. Related to comment (7), I don't think the paper is justified in making this broad claim: "The Adjust NVar approach can achieve a more efficient trade-off between power and type I error rate than use of a single outcome when there are three or more moderately intercorrelated outcome variables."This conclusion is true only when the intervention truly affects ALL outcomes, which is a narrow and arguably unrealistic case.A more realistic scenario is where the intervention works only on a subset of outcomes.In this case, the single variable strategy will be more statistically powerful than the multiple-variable strategies if you choose the right variable.
Figures 4-6.Same issue as for Figures 1-3: comparing the principal components composite variable strategy (PC) to a single outcome is flawed because the simulation "stacks the deck" for PC by assuming that all outcomes have a true effect.I believe that Figures 1-6 should focus instead on comparisons of different methods for handling multiple outcomes.
Response: same as for point 6; this is clearly a key issue, and I think it is clearer now that the alternative models (L2 and L2x) are included.
9. The article claims that power is only "slightly lower" for AdjustNVar compared with the PC strategy.However, when I run simulations comparing PC to AdjustNVar, I get consistently higher statistical power for PC and I would characterize the difference as more than just "slight".For example, with N=50, global corr=0.6,global ES=0.3, and 6 outcomes (two-tailed test), I get power of 35% for AdjustNVar versus 37% for Meff versus 46% for PC.
Response: thanks for pushing back on this -this is fair comment.
10.PC is always more powerful than AdjustNVar and Meff when we assume that all outcomes have a true effect.However, PC is not more powerful when we assume that only a subset of outcomes have true effects.For example, if I tweak the above simulation so that only three outcomes out of six have true effects of 0.3, this changes the power to 15% for PC and AdjustNVar, and 28% for Meff.This illustrates why one narrow simulation is insufficient for making general conclusions about the tradeoffs in performance between the different methods.
11.In my simulations, Meff consistently has higher statistical power than AdjustNVar, which is a function of the fact that Meff always has a 5% Type I error rate whereas that of AdjustNVar is variable and sometimes lower than 5%.I don't view it as a strength that AdjustNVar results in arbitrarily lower Type I error rates.It is better for the investigator to be able to The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Models for data generation.Heatmap depicts correlations between observed variables V1 to V4 and Latent factors, where colour denotes association.A diagonal line through a latent factor indicates it is not related to intervention.

Figure 2 .
Figure 2. Model L1, 50 per group.Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size.The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3and .5 respectively.

Figure 3 .
Figure 3. Model L2: 50 per group.Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size.The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3and .5 respectively.

Figure 4 .
Figure 4. Model L2x: 50 per group.Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size.The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3and .5 respectively.
doi.org/10.5256/f1000research.77175.r97181© 2021 Lakens D. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
2. A lookup table is provided that gives values of MEff, and associated adjusted alpha-levels for different set sizes of outcome measures, with mean pairwise correlation varying from 0 to 1 in steps of .1.
3. Use of the lookup table is shown for real-world examples of application of MEff using published articles.

Table 3 .
AlphaMEff values for Model L2 (odd rows) and Model L1 (even rows), with same mean off diagonal r.For Model L2, "Start r" is the value for nonzero off-diagonal correlations.
. Multiple_outcomes_revised.Rmd, which generates the text for the current article.Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
bottom row of table 3.For model L2, we have r values of either 0 or .8.For model L1, we have a corresponding value of r.avg of .343.The alpha values for L1 and L2 are in adjacent rows of table 3, and it is clear they differ only slightly.If you introduced more variability in model L2, with some r-values being intermediate between 0 and r.max, the difference between models L1 and L2 would be smaller.
Following suggestions by reviewer 1, I've added two examples of real-world studies where data are not available.In the field of educational/psychological interventions, it can be possible to get estimates of intercorrelations between measures if well-established measures are used, as is the case in this example.I've also added some thoughts on what to do when this is not the case, as you proposed.
Table1confusing on first read, and I would recommend simplifying it by focusing on a single number of outcomes rather than both 2 outcomes and 4 outcomes and by removing discussions of ranking p-values.(The p-value ranking isn't important -this is just part of the mechanics of how the algorithm is calculating MinNSig, so I don't think it's needed.)For example, you could just focus on calculating MinNSig for 6 outcomes.

the rationale for developing the new method (or application) clearly explained? Partly Is the description of the method technically sound? Partly Are sufficient details provided to allow replication of the method development and its use by others? Yes If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly
Table 2 would give readers a quick way to reevaluate such studies without the need for any calculations or access to raw data.I would focus the paper more on this application.
References 1. Cheverud JM: A simple correction for multiple comparisons in interval mapping genome scans.Heredity (Edinb).2001; 87 (Pt 1): 52-8 PubMed Abstract | Publisher Full Text 2. Nyholt DR: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other.Am J Hum Genet.2004; 74 (4): 765-9 PubMed Abstract | Publisher Full Text 3. Derringer J: A simple correction for non-independent tests.2018.Publisher Full Text Is Competing Interests: No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
Table1confusing on first read, and I would recommend simplifying it by focusing on a single number of outcomes rather than both 2 outcomes and 4 outcomes and by removing discussions of ranking p-values.(The p-value ranking isn't important -this is just part of the mechanics of how the algorithm is calculating MinNSig, so I don't think it's needed.)For example, you could just focus on calculating MinNSig for 6 outcomes.Show 6 columns of p-values for the 6 outcomes, and then show a single final column that tabulates the number of p-values <.05 for each simulated trial.Then show a frequency table of how many simulations out of 1000 resulted in 0 p-values <.05, 1 p-value <.05, 2 p-values <.05, etc.Then indicate that MinNSig occurs at one number above when the cumulative frequency crosses 95%.