<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.73520.3</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Method Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Using multiple outcomes in intervention studies: improving power while controlling type I errors</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 3; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Bishop</surname>
                        <given-names>Dorothy V. M.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-2448-4033</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Department of Experimental Psychology, University of Oxford, Oxford, Oxon, OX2 6GG, UK</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:dorothy.bishop@psy.ox.ac.uk">dorothy.bishop@psy.ox.ac.uk</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>6</day>
                <month>11</month>
                <year>2023</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2021</year>
            </pub-date>
            <volume>10</volume>
            <elocation-id>991</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>30</day>
                    <month>10</month>
                    <year>2023</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Bishop DVM</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/10-991/pdf"/>
            <abstract>
                <sec>
                    <title>Background </title>
                    <p> The CONSORT guidelines for clinical trials recommend using a single primary outcome, to guard against excess false positive findings when multiple measures are considered. However, statistical power can be increased while controlling the familywise error rate if multiple outcomes are included. The MEff statistic is well-suited to this purpose, but is not well-known outside genetics. </p>
                </sec>
                <sec>
                    <title>Methods </title>
                    <p> Data were simulated for an intervention study, with a given sample size (N), effect size (E) and correlation matrix for a suite of outcomes ( R). Using the variance of eigenvalues from the correlation matrix, we compute MEff, the effective number of variables that the alpha level should be divided by to control the familywise error rate. Various scenarios are simulated to consider how MEff is affected by the pattern of pairwise correlations within a set of outcomes. The power of the MEff approach is compared to Bonferroni correction, and a principal component analysis (PCA).</p>
                </sec>
                <sec>
                    <title>Results </title>
                    <p> In many situations, power can be increased by inclusion of multiple outcomes. Differences in power between MEff and Bonferroni correction are small if intercorrelations between outcomes are low, but the advantage of MEff is more evident as intercorrelations increase. PCA is superior in cases where the impact on outcomes is fairly uniform, but MEff is applicable when intervention effects are inconsistent across measures.</p>
                </sec>
                <sec>
                    <title>Conclusions </title>
                    <p> The optimal method for correcting for multiple testing depends on the underlying data structure, with PCA being superior if outcomes are all indicators of a common underlying factor. Both Bonferroni correction and MEff can be applied post hoc to evaluate published intervention studies, with MEff being superior when outcomes are moderately or highly correlated. A lookup table is provided to give alpha levels for use with Meff for cases where the correlation between outcome measures can be estimated.</p>
                </sec>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>intervention</kwd>
                <kwd>methodology</kwd>
                <kwd>statistics</kwd>
                <kwd>correlated outcomes</kwd>
                <kwd>power</kwd>
                <kwd>familywise error rate</kwd>
                <kwd>multiple comparisons</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 2</title>
                <p>This revised version has two substantial changes. a) Figures 2 - 4 have been revised in line with suggestions by reviewer 1, to remove the lines corresponding to effect size of .8. b) Additional real-world examples of studies are provided where the look-up table (Table 2) may be used, but where original raw data was not available.&#x00a0; These stimulated further thoughts about the need to consider the nature of the relationship between outcome measures and an intervention, which are picked up in the Discussion. Both for appropriate analysis and for interpretation, it is important to decide whether multiple outcomes are to be regarded as alternate indicators of a common underlying construct, or whether they reflect latent variables that may respond differently to the intervention.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec id="sec1">
            <title>Issues raised by inclusion of multiple outcomes</title>
            <p>The CONSORT guidelines for clinical trials (
                <xref ref-type="bibr" rid="ref8">Moher et al., 2010</xref>) are very clear on the importance of having a single primary outcome:</p>
            <p>
				
                <italic toggle="yes">All RCTs assess response variables, or outcomes (end points), for which the groups are compared. Most trials have several outcomes, some of which are of more interest than others. The primary outcome measure is the pre-specified outcome considered to be of greatest importance to relevant stakeholders (such as patients, policy makers, clinicians, funders) and is usually the one used in the sample size calculation. Some trials may have more than one primary outcome. Having several primary outcomes, however, incurs the problems of interpretation associated with multiplicity of analyses and is not recommended.</italic>
			</p>
            <p>This advice often creates a dilemma for the researcher: in many situations there are multiple measures that could plausibly be used to index the outcome (
                <xref ref-type="bibr" rid="ref12">Vickerstaff, Ambler, King, Nazareth, &amp; Omar, 2015</xref>). If we have several outcomes and we would be interested in improvement on any measure, then we need to consider the familywise error rate, i.e. the probability of at least one false positive in the whole set of outcomes. For instance, if we want to set the false positive rate, alpha to .05, and we have six independent outcomes, none of which is influenced by the intervention, the probability that none of the tests of outcome effects is significant will be .95^6, which is .735. Thus the probability that at least one outcome is significant, the familywise error rate, is 1-.735, which is .265. In other words, in about one quarter of studies, we would see a false positive when there is no true effect. The larger the number of outcomes, the higher the false positive rate.</p>
            <p>A common solution is to apply a Bonferroni correction by dividing the alpha level by the number of outcome measures - in this example .05/6 = .008. This way the familywise error rate is kept at .05. But this is over-conservative if, as is usually the case, the various outcomes are intercorrelated.</p>
            <p>Various methods have been developed to address the problem of multiple testing. One approach is to adopt some process of data reduction, such as extracting a principal component from the measures that can be used as the primary outcome. Alternatively, a permutation test can be used to derive exact probability of an observed pattern of results. Neither approach, however, is helpful if the researcher is evaluating a published paper where an appropriate correction has not been made. These could be cases where no correction is made for multiple testing, risking a high rate of false positives, or where Bonferroni correction has been applied despite using correlated outcomes, which will be overconservative in rejecting the null hypothesis. The goal of the current article is to provide some guidance for interpretation of published papers where the raw data are not available for recomputation of statistics.</p>
            <p>
				
                <xref ref-type="bibr" rid="ref12">Vickerstaff et al. (2015)</xref> reviewed 209 trials in neurology and psychiatry, and found that 60 reported multiple primary outcomes, of which 45 did not adjust for multiplicity. Those that did adjust mostly used the Bonferroni correction. Thus it would appear that many researchers feel the need to include several outcomes, but this is not always adjusted for appropriately. The goal of the current article is to provide some guidance for interpretation of published papers where the raw data are not available for recomputation of statistics.</p>
            <p>In a review of an earlier version of this paper, 
                <xref ref-type="bibr" rid="ref11">Sainani (2021)</xref> pointed out that the MEff statistic, originally developed in the field of genetics by 
                <xref ref-type="bibr" rid="ref5">Cheverud (2001)</xref> and 
                <xref ref-type="bibr" rid="ref9">Nyholt (2004)</xref>, provided a simple way of handling this situation. With this method, one computes eigenvalues from the correlation matrix of outcomes, which reflect the degree of intercorrelation between them. The mathematical definition of an eigenvalue can be daunting, but an intuitive sense of how it relates to correlations can be obtained by considering the cases shown in 
                <xref ref-type="table" rid="T1">Table 1</xref>. This shows how eigenvalues vary with the correlation structure of a matrix, using an example of six outcome measures. The number of eigenvalues, and the sum of the eigenvalues, is identical to the number of measures. Let us start by assuming a matrix in which all off-diagonal values are equal to 
                <italic toggle="yes">r.</italic> It can be seen that when the correlation is zero, each eigenvalue is equal to one, and the variance of the eigenvalues is zero. When the correlation is one, the first eigenvalue is equal to six, all other eigenvalues are zero, and the variance of the eigenvalues is six. As correlations increase from .2 to .8, the size of the first eigenvalue increases, and that of the other eigenvalues decreases.</p>
            <table-wrap id="T1" orientation="portrait" position="float">
                <label>Table 1. </label>
                <caption>
                    <title>Eigenvalues, MEff and AlphaMEff with 6 outcome variables.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">r</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Eigen1</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Eigen2</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Eigen3</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Eigen4</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Eigen5</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Eigen6</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Var</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">MEff</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">AlphaMEff</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.00</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">6.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">2.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.24</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">5.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">3.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.96</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">5.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">4.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">2.16</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">4.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.012</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">5.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">3.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">2.8</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.018</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">6.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">6.00</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                            <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <p>In 
                <xref ref-type="table" rid="T1">Table 1</xref>, 
                <italic toggle="yes">r</italic> is the intercorrelation between the six outcomes, Eigen1 - Eigen6, are the eigenvalues, and Var is the variance of the six Eigenvalues, which is used to compute MEff (the effective number of comparisons) from the formula:
                <disp-formula id="e2">
					
                    <mml:math display="block">
                        <mml:mtext>MEff</mml:mtext>
                        <mml:mo>=</mml:mo>
                        <mml:mn>1</mml:mn>
                        <mml:mo>+</mml:mo>
                        <mml:mo>(</mml:mo>
                        <mml:mi mathvariant="normal">N</mml:mi>
                        <mml:mo>-</mml:mo>
                        <mml:mn>1</mml:mn>
                        <mml:mo>)</mml:mo>
                        <mml:mo>*</mml:mo>
                        <mml:mo>(</mml:mo>
                        <mml:mn>1</mml:mn>
                        <mml:mo>-</mml:mo>
                        <mml:mo>(</mml:mo>
                        <mml:mtext>Var</mml:mtext>
                        <mml:mo>(</mml:mo>
                        <mml:mtext>Eigen</mml:mtext>
                        <mml:mo>)</mml:mo>
                        <mml:mo>/</mml:mo>
                        <mml:mi mathvariant="normal">N</mml:mi>
                        <mml:mo>)</mml:mo>
                    </mml:math>
				</disp-formula>
			</p>
            <p>where N is the number of outcome measures, and Eigen is the set of N eigenvalues.</p>
            <p>This value is then used to compute the corrected alpha level, AlphaMEff. Assuming we set alpha to .05, AlphaMEff is .05 divided by MEff. One can see that this value is equivalent to the Bonferroni-corrected alpha (.05/6) when there is no correlation between variables, and equivalent to .05 when all variables are perfectly correlated.</p>
            <p>
				
                <xref ref-type="bibr" rid="ref6">Derringer (2018)</xref> provided a useful tutorial on MEff, noting that it is not well-known outside the field of genetics, but is well-suited to the field of psychology. Her preprint includes links to R scripts for computing MEff and illustrates their use in three datasets.</p>
            <p>These resources will be sufficient for many readers interested in using MEff, but researchers may find it useful to have a look-up table for the case when they are evaluating existing studies. The goal of this paper is two-fold:
                <list list-type="alpha-upper">
                    <list-item>
                        <label>A.</label>
                        <p>To consider how inclusion of multiple outcome measures affects statistical power, relative to the case of a single outcome, when appropriate correction of the familywise error rate is made using MEff. Results from MEff are compared with use of Bonferroni correction and analysis of the first component derived from Principal Components Analysis (PCA).</p>
                    </list-item>
                    <list-item>
                        <label>B.</label>
                        <p>To provide a look-up table to help evaluate studies with multiple outcome measures, without requiring the reader to perform complex statistical analyses.</p>
                    </list-item>
                </list>
			</p>
            <p>These goals are achieved in three sections below:
                <list list-type="order">
                    <list-item>
                        <label>1.</label>
                        <p>Power to detect a true effect using MEff is calculated from simulated data for a range of values of sample size (N), effect size (E) and the matrix of intercorrelation between outcomes (R)</p>
                    </list-item>
                    <list-item>
                        <label>2.</label>
                        <p>A lookup table is provided that gives values of MEff, and associated adjusted alpha-levels for different set sizes of outcome measures, with mean pairwise correlation varying from 0 to 1 in steps of .1.</p>
                    </list-item>
                    <list-item>
                        <label>3.</label>
                        <p>Use of the lookup table is shown for real-world examples of application of MEff using published articles.</p>
                    </list-item>
                </list>
			</p>
            <sec id="sec2">
                <title>Alternative approach, MinNVar</title>
                <p>In the original version of this manuscript (
                    <xref ref-type="bibr" rid="ref1">Bishop, 2021</xref>), an alternative approach, MinNVar, was proposed, in which the focus was on the 
                    <italic toggle="yes">number</italic> of outcome variables achieving a conventional .05 level of significance. As noted by reviewers, this has the drawback that it could not reflect continuous change in probability levels, because it was based on integer values (i.e. number of outcomes). This made it overconservative in some cases, where adopting the MinNVar approach gave a familywise error rate well below .05. One reason for proposing MinNVar was to provide a very easy approach to evaluating studies that had multiple outcomes, using a lookup table to check the number of outcomes needed, depending on overall correlation between measures. However, it is equally feasible to provide lookup tables for MEff, which is preferable on other grounds, and so MinNVar is not presented here; interested readers can access the first version of this paper to evaluate that approach.</p>
            </sec>
            <sec id="sec3">
                <title>Use of one-tailed p-values</title>
                <p>In the simulations described here, one-tailed tests are used. Two-tailed p-values are far more common in the literature, perhaps because one-tailed tests are often abused by researchers, who may switch from a two-tailed to a one-tailed p-value in order to nudge results into significance.</p>
                <p>This is unfortunate because, as argued by 
                    <xref ref-type="bibr" rid="ref7">Lakens (2016)</xref>, provided one has a directional hypothesis, a one-tailed test is more efficient than a two-tailed test. It is a reasonable assumption that in intervention research, which is the focus of the current paper, the hypothesis is that an outcome measure will show improvement. Of course, interventions can cause harms, but, unless those are the focus of study, we have a directional prediction for improvement.</p>
            </sec>
        </sec>
        <sec id="sec4" sec-type="methods">
            <title>Methods</title>
            <p>Correlated variables were simulated using the R programming language (
                <xref ref-type="bibr" rid="ref10">R Core Team, 2020</xref>) (
                <ext-link ext-link-type="uri" xlink:href="http://www.r-project.org/">R Project for Statistical Computing</ext-link>, RRID:SCR_001905). The script to generate and analyse simulated data is available on 
                <ext-link ext-link-type="uri" xlink:href="https://osf.io/hsaky/">https://osf.io/hsaky/</ext-link>. For each model specified below, 2000 simulations were run. Note that to keep analysis simple, a single value was simulated for each case, rather than attempting to model pre- vs post-intervention change. Data for the two groups were generated by the same process, except that a given effect size was added to scores of the intervention group, I, but not to the control group, C. Scores of the two groups were compared using a one-tailed t-test for each run.</p>
            <p>Power was computed for different levels of effect size (E), correlation between outcomes (
                <bold>R</bold>) and sample size per group (N) for the following methods:
                <list list-type="alpha-lower">
                    <list-item>
                        <label>a)</label>
                        <p>Bonferroni-corrected data: Proportion of runs where p was less than the Bonferroni-corrected value for at least one outcome.</p>
                    </list-item>
                    <list-item>
                        <label>b)</label>
                        <p>MEff-corrected data: Proportion of runs where p was less than AlphaMeff value for at least one outcome.</p>
                    </list-item>
                    <list-item>
                        <label>c)</label>
                        <p>Principal component analysis (PCA): Proportion of runs where p was below .05 when groups I and C were compared on scores on the first principal component of PCA.</p>
                    </list-item>
                </list>
			</p>
        </sec>
        <sec id="sec5">
            <title>Method for simulating outcomes</title>
            <p>Simulating multivariate data forces one to consider how to conceptualise the relationship between an intervention and multiple outcomes. Implicit in the choice of method is an underlying causal model that includes mechanisms that lead measures to be correlated.</p>
            <p>In the simulation, outcomes were modelled as indicators of one or more underlying latent factors, which mediate the intervention effect. This can be achieved by first simulating a latent factor, with an effect size of either zero, for group C, or E for group I. Observed outcome measures are then simulated as having a specific correlation with the latent variable - i.e. the correlation determines the extent to which the outcomes act as indicators of the latent variable. This can be achieved using the formula:
                <disp-formula id="e1">
					
                    <mml:math display="block">
                        <mml:mi>r</mml:mi>
                        <mml:mo>&#x2217;</mml:mo>
                        <mml:mi>L</mml:mi>
                        <mml:mo>+</mml:mo>
                        <mml:msqrt>
                            <mml:mrow>
                                <mml:mn>1</mml:mn>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:msup>
                                    <mml:mi>r</mml:mi>
                                    <mml:mn>2</mml:mn>
                                </mml:msup>
                            </mml:mrow>
                        </mml:msqrt>
                        <mml:mo>&#x2217;</mml:mo>
                        <mml:mi>e</mml:mi>
                    </mml:math>
				</disp-formula>
			</p>
            <p>where 
                <italic toggle="yes">r</italic> is the correlation between latent variable (
                <italic toggle="yes">L</italic>) and each outcome, and L is a vector of random normal deviates that is the same for each outcome variable, while 
                <italic toggle="yes">e</italic> (error) is a vector of random normal deviates that differs for each outcome variable. Note that when outcome variables are generated this way, the mean intercorrelation between them will be 
                <italic toggle="yes">r</italic>
				
                <sup>2</sup>. Thus if we want a set of outcome variables with mean intercorrelation of .4, we need to specify r in the formula above as sqrt(
                <italic toggle="yes">r</italic>) = .632. Furthermore, the effect size for the simulated variables will be lower than for the latent variable: to achieve an effect size, E, for the outcome variables, it is necessary to specify the effect size for the latent variable, E
                <sub>l</sub>, as E/sqrt(
                <italic toggle="yes">r</italic>).</p>
            <p>Note that the case where 
                <italic toggle="yes">r</italic> = 0 is not computable with this method - i.e. it is not possible to have a set of outcomes that are indicators of the same latent factor but which are uncorrelated. The lowest value of 
                <italic toggle="yes">r</italic> that was included was 
                <italic toggle="yes">r</italic> = .2.</p>
            <p>The initial simulation, designated as Model L1, treated all outcome measures as equivalent. In practice, of course, we will observe different effect sizes for different outcomes, but in Model L1, this is purely down to the play of chance: all outcomes are indicators of the same underlying factor, as shown in the heatmap in 
                <xref ref-type="fig" rid="f1">Figure 1</xref>, Model L1.</p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>Figure 1. </label>
                <caption>
                    <title>Models for data generation.</title>
                    <p>Heatmap depicts correlations between observed variables V1 to V4 and Latent factors, where colour denotes association. A diagonal line through a latent factor indicates it is not related to intervention.</p>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/158245/f62b5644-b6c3-4f8c-8372-915d0b11e469_figure1.gif"/>
            </fig>
            <p>In two additional models, rather than being indicators of the same uniform latent variable, the outcomes correspond to different latent factors. This would correspond to the kind of study described by 
                <xref ref-type="bibr" rid="ref13">Vickerstaff et al. (2021)</xref>, where an intervention for obesity included outcomes relating to weight and blood glucose levels. Following suggestions by 
                <xref ref-type="bibr" rid="ref11">Sainani (2021)</xref>, a set of simulations was generated to consider relative power of different methods when there are two underlying latent factors that generate the outcomes. In Model L2, there are two independent latent factors, both affected by intervention. In Model L2&#x00d7;, the intervention only influences the first latent factor. The computational approach was the same as for Model L1, but with two latent factors, each used to generate a block of variables. The two latent factors are uncorrelated.</p>
            <p>The size of the suite of outcome variables entered into later analysis ranged from 2 to 8. For each suite size, principal components were computed from data from the C and I groups combined, using the base R function 
                <italic toggle="yes">prcomp</italic> from the 
                <italic toggle="yes">stats</italic> package ( 
                <xref ref-type="bibr" rid="ref10">R Core Team, 2020</xref>). Thus, PC2 is a principal component based on the first two outcome measures, PC4 based on the first four outcome measures, and so on.</p>
        </sec>
        <sec id="sec6" sec-type="results">
            <title>Results</title>
            <sec id="sec6.1">
                <title>Power calculations</title>
                <p>Sample plots comparing power for Bonferroni correction, MEff and PCA are shown for sample size of 50 per group in 
                    <xref ref-type="fig" rid="f2">Figures 2</xref> to 
                    <xref ref-type="fig" rid="f4">4</xref>. Plots for smaller (N = 20) and larger (N = 80) sample sizes are available online (
                    <ext-link ext-link-type="uri" xlink:href="https://osf.io/k6xyc/">https://osf.io/k6xyc/</ext-link>), and show the same basic pattern.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Model L1, 50 per group.</title>
                        <p>Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size. The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3 and .5 respectively.</p>
                    </caption>
                    <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/158245/f62b5644-b6c3-4f8c-8372-915d0b11e469_figure2.gif"/>
                </fig>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>Figure 3. </label>
                    <caption>
                        <title>Model L2: 50 per group.</title>
                        <p>Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size. The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3 and .5 respectively.</p>
                    </caption>
                    <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/158245/f62b5644-b6c3-4f8c-8372-915d0b11e469_figure3.gif"/>
                </fig>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>Figure 4. </label>
                    <caption>
                        <title>Model L2x: 50 per group.</title>
                        <p>Power in relation to number of outcome measures (N outcomes), intercorrelation between outcomes (column headers), type of Correction, and Effect size. The square, circle and triangle symbols represent the power for a single outcome measure with effect size .3 and .5 respectively.</p>
                    </caption>
                    <graphic id="gr4" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/158245/f62b5644-b6c3-4f8c-8372-915d0b11e469_figure4.gif"/>
                </fig>
                <p>
					
                    <xref ref-type="fig" rid="f2">Figure 2</xref> shows the simplest situation when there are between 2 and 8 outcome measures, all of which are derived from the same latent variable (Model L1). Different levels of intercorrelation between the outcomes (ranging from .2 to .8 in steps of .2) are shown in columns.</p>
                <p>Several points emerge from inspection of this figure; first, when intercorrelation between measures is low to medium (.2 to .6), power increases as the number of outcome measures increases. Furthermore, the power is greater when PCA is used than when MEff or Bonferroni correction is applied. MEff is generally somewhat better-powered than Bonferroni, and Bonferroni has lower power than a single outcome measure when there is a large number of highly intercorrelated outcome measures (
                    <italic toggle="yes">r</italic> = .8).</p>
                <p>In practice, it may be the case that outcome measures are not all reflective of a common latent factor. 
                    <xref ref-type="fig" rid="f3">Figure 3</xref> shows results from Model L2, where outcome measures form two clusters, each associated with a different latent factor (see 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>). Here both latent factors are associated with improved outcomes in the intervention group.</p>
                <p>Once again, power increases with number of outcomes when there are low to modest intercorrelations between outcomes. For this method, PCA no longer has such a clear advantage. This makes sense, given that PCA will not derive a single main factor, when the underlying data structure contains two independent factors.</p>
                <p>
					
                    <xref ref-type="fig" rid="f4">Figure 4</xref> shows equivalent results for Model L2x, where we have a mixture of two types of outcome, one of which is influenced by intervention, and the other is not. This complicates calculation of power for a single variable, since, power will depend on whether we select one of the outcomes that is influenced by intervention or not. The symbols in 
                    <xref ref-type="fig" rid="f4">Figure 4</xref> show average power, assuming we might select either type of outcome with equal frequency. We see that in this situation, MEff is clearly superior to PCA except when we have a large number of outcomes, a small effect size and weak intercorrelation between outcomes.</p>
            </sec>
            <sec id="sec6.2">
                <title>Deriving a lookup table</title>
                <p>
					
                    <xref ref-type="table" rid="T2">Table 2</xref> shows corrected alpha values based on MEff, varying according to the correlation between outcome measures, and the number of outcome measures in the study. In practice, the problem for the researcher is to estimate the intercorrelation between outcome measures if this is not known.</p>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>Table 2. </label>
                    <caption>
                        <title>AlphaMEff for different correlation values (corr) with 2-12 outcome variables (N2 to N12), based on Model L1.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">corr</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N2</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N3</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N4</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N5</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N6</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N7</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N8</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N9</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N10</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N11</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N12</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.0</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.025</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.017</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.004</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.025</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.017</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.004</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.026</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.017</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.004</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.026</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.018</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.027</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.019</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.014</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.5</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.029</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.020</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.015</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.005</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.6</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.030</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.022</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.017</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.014</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.012</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.7</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.033</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.025</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.020</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.016</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.014</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.012</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.8</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.037</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.029</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.024</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.020</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.018</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.016</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.014</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.012</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.9</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.042</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.036</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.032</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.028</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.026</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.023</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.021</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.020</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.018</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.017</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.016</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">1.0</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.050</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>Model L1, used to generate these data, assumes there will be a uniform intercorrelation between outcome measures in the population. This is likely to be unrealistic. Nevertheless, further simulations showed that values for MEff are reasonably consistent for different correlation matrices that all have the same average off-diagonal correlation. Consider, for instance, the correlations between 4 variables shown in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref> for Model L2. Within the blocks V1-V2 and V3-V4 the intercorrelation is 
                    <italic toggle="yes">r,</italic> but between blocks the intercorrelation is zero. There are six off-diagonal correlations and the mean off-diagonal is (2 * 
                    <italic toggle="yes">r</italic>/6). For instance, if 
                    <italic toggle="yes">r</italic> equals .5, then the mean off-diagonal value is .167. To see how the MEff correction is affected by correlation structure, we can compare MEff for Model L2 with the MEff obtained in Model L1 with the same off-diagonal correlation. This exercise shows that they are similar, as shown in 
                    <xref ref-type="table" rid="T3">Table 3</xref>.</p>
                <table-wrap id="T3" orientation="portrait" position="float">
                    <label>Table 3. </label>
                    <caption>
                        <title>AlphaMEff values for Model L2 (odd rows) and Model L1 (even rows), with same mean off diagonal 
                            <italic toggle="yes">r</italic>. For Model L2, &#x201c;Start 
                            <italic toggle="yes">r</italic>&#x201d; is the value for nonzero off-diagonal correlations.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Start 
                                    <italic toggle="yes">r</italic>
								</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Mean offdiag 
                                    <italic toggle="yes">r</italic>
								</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Alpha.MEff.4</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Alpha.MEff.6</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Alpha.MEff.8</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.086</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.086</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.129</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.129</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.171</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.171</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.006</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.5</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.214</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.214</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.6</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.257</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.014</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.257</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.7</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.300</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.014</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.010</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.300</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="2" valign="middle">0.8</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.343</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.015</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.008</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">L1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.343</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.013</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.009</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.007</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>In other words, if estimating MEff from existing data, it is reasonable to base the estimate on the average off-diagonal correlation, regardless of whether the pattern of intercorrelations is uniform.</p>
            </sec>
            <sec id="sec6.3">
                <title>Examples of application to published studies</title>
                <p>Use of the lookup 
                    <xref ref-type="table" rid="T2">Table 2</xref> can be illustrated with data from a study by 
                    <xref ref-type="bibr" rid="ref3">Burgoyne et al. (2012)</xref>, which evaluated a reading and language intervention for children with Down syndrome. A large number of assessments was carried out over various time points, but our focus here is on the five outcome measures that had been designated as &#x201c;primary&#x201d;, as they were &#x201c;proximal to the content of the intervention&#x201d;, i.e., they measured skills and knowledge that had been explicitly taught. The p-values reported by the authors (see 
                    <xref ref-type="table" rid="T4">Table 4</xref>) come from analyses of covariance comparing differences between intervention and control groups after 20 weeks of intervention, controlling for baseline performance, age and gender.</p>
                <table-wrap id="T4" orientation="portrait" position="float">
                    <label>Table 4. </label>
                    <caption>
                        <title>P-values from 
                            <xref ref-type="bibr" rid="ref3">Burgoyne 
                                <italic toggle="yes">et al.</italic> (2012)</xref>.</title>
                        <p>Bonferroni and MEff alpha for 6 variables with mean correlation of .6.</p>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Measure</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Reported p.value</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Bonferroni: alpha = .01</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">MEff: alpha = .014</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Letter-Sound knowledge</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.002</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">*</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">*</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Phoneme blending</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.022</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Single word reading</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.002</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">*</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">*</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Taught expressive Vocabulary</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.011</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="middle">*</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Taught receptive Vocabulary</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">0.062</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>Whereas the Bonferroni-corrected alpha can be computed simply from knowledge of the number of outcome measures, the MEff-corrected alpha requires knowledge of the mean correlation between the outcome measures. In this case, this could be computed, (
                    <italic toggle="yes">r</italic> = .581), as the data were available in a repository (
                    <xref ref-type="bibr" rid="ref4">Burgoyne et al., 2016</xref>). From 
                    <xref ref-type="table" rid="T2">Table 2</xref>, we see that with five outcome measures and 
                    <italic toggle="yes">r</italic> = .6, the adjusted alpha is .014. In this example, three outcomes have p-values below the critical alpha when MEff is used. If the more stringent Bonferroni correction is applied, only two outcomes achieve significance.</p>
                <p>In this example the intercorrelation between outcome measures could be computed from deposited raw data; if these are not available, then it may still be possible to obtain plausible estimates of intercorrelation between outcome measures, especially if widely-used instruments are used. An example is provided by two randomized controlled trials of a memory training programme for children, Cogmed. In both studies, the Automated Working Memory Assessment battery (
                    <xref ref-type="bibr" rid="ref15">Alloway, 2007</xref>) was used to assess outcome. 
                    <xref ref-type="bibr" rid="ref16">Chacko et al. (2014)</xref> used four subtests, Dot Matrix, Spatial Recall, Digit Recall, and Listening Recall, and applied the Sidak-Bonferroni correction, with effective alpha of .013. The raw data are not available, but the test manual indicates that intercorrelations between these four measures range from .70 to .78. Thus we can use the lookup table (Table 2), which shows that with four variables with intercorrelation of .7, an effective alpha of .02 can be used. In practice this did not affect the interpretation of results, because two of the measures, Dot Matrix and Digit Recall, had associated p-values of &lt; .001 and .005 respectively. The p-values for Spatial Recall and Listening Recall were .048 and .728 respectively, and so would not meet criteria for significance with MEff or Bonferroni methods.</p>
                <p>The other study by 
                    <xref ref-type="bibr" rid="ref17">Roberts et al. (2016)</xref> used a different subset of subtests from the same battery: Dot Matrix, Digit Recall, Backward Digit Recall, and Mister X, given at 6 months, 12 months and 24 months post-intervention. According to the test manual, intercorrelations between these subtests range from .65 to .80. These authors did not apply a correction for multiple comparisons. If Bonferroni correction had been used this would have given an alpha level of .004 (.05/12). The test manual indicates that test-retest reliability of the subscales ranges from .84 to .89. Thus overall, we can estimate the off-diagonal correlations for all 12 measures to be around .8, which the lookup table shows as corresponding to an effective alpha of .01. In this study, only the Dot Matrix task effect was significant after correction for multiple comparisons, with p &lt; .001 at both 6 months and 12 months, but p = .14 at 24 months. Backward Digit Recall gave p = .04 at 6 months only, which would be nonsignificant if any correction for multiple comparisons were used. All other comparisons were null. In the next section, the implications of these findings for choosing methods is discussed further.</p>
            </sec>
        </sec>
        <sec id="sec7">
            <title>Discussion</title>
            <p>Some interventions are expected to affect a range of related processes. In such cases, the need to specify a single primary outcome tends to create difficulties, because it is often unclear which of a suite of outcomes is likely to show an effect. Note that the MEff approach does not give the researcher free rein to engage in p-hacking: the larger the suite of measures included in the study, the lower the adjusted alpha will be. It does, however, remove the need to pre-specify one measure as the primary outcome, when there is genuine uncertainty about which measure might be most sensitive to intervention.</p>
            <p>A second advantage is that in effect, by including multiple outcome measures, one can improve the efficiency of a study, in terms of the trade-off between power and familywise errors. A set of outcome measures may be regarded as imperfect proxy indicators of an underlying latent construct, so we are in effect building in a degree of within-study replication by including more than one outcome measure.</p>
            <p>The simulations showed that PCA gives higher power than MEff in the case where all outcomes are indicators of a single underlying factor. PCA, however, needs to be computed from raw data and so is not feasible when re-evaluating published studies, whereas MEff is feasible so long as the average off-diagonal correlation between outcomes can be estimated. PCA is also less powerful when the outcomes tap into heterogeneous constructs and do not load on one major latent factor. Some examples are provided where prior literature gives plausible estimates of intercorrelations between outcome measures. Of course, such estimates are never as accurate as the actual correlations from the reported data, which may vary depending on sample characteristics. Wherever possible, it is preferable to work with original raw data. However, where correlations are available from test manuals, or where previous studies have reported correlations between outcomes, then the researcher can consider how interpretation of results may be affected by assuming a given degree of dependency between outcome measures.</p>
            <p>A possible disadvantage of using MEff or Bonferroni correction over PCA is that such approaches are likely to tempt researchers to interpret specific outcomes that fall below the revised alpha threshold as meaningful. They may be, of course, but when we create a suite of outcomes that differ only by chance, it is common for only a subset of them to reach the significance criterion. Any recommendation to use MEff should be accompanied by a warning that if a subset of outcomes shows an effect of intervention, this could be due to chance. It would be necessary to run a replication to have confidence in a particular pattern of results.</p>
            <p>In this regard, the example of studies using the Automated Working Memory Assessment to evaluate intervention for children with memory and attentional difficulties (
                <xref ref-type="bibr" rid="ref16">Chacko et al., 2014</xref>; 
                <xref ref-type="bibr" rid="ref17">Roberts et al., 2016</xref>) are instructive. As reported in the test manual (
                <xref ref-type="bibr" rid="ref15">Alloway, 2007</xref>), intercorrelations between the subtests are high, supporting the idea of a general working memory factor that influences performance on all such measures. On that basis, it might seem preferable to reduce subtest scores to one outcome measure - either by using data reduction such as principal component analysis, or by using the method advocated in the test manual to derive a composite score. We know this is associated with an increase in reliability of measurement and statistical power. However, the results of the two studies sound a note of caution: in both trials there were large improvements in one subtest, Dot Matrix, at least in the short-term, while other measures did not show consistent gains. This kind of result has been much discussed in evaluations of computerised training, where it has been noted that one may see improvements in tasks that resemble the training exercises, &#x2018;near transfer&#x2019;, without any generalisation to other measures, &#x2018;far transfer&#x2019; (
                <xref ref-type="bibr" rid="ref14">Aksayli, Sala, &amp; Gobet, 2019</xref>). The very fact that measures are usually intercorrelated provides the rationale for hoping that training one skill will have an effect that generalises to other skills, and to everyday life. Yet, the verdict on this kind of training is stark: after much early optimism, working memory training leads to improvements on what was trained, but these do not extend to other areas of cognition. This shows us that careful thought needs to be given to the logic of how a set of outcome measures is conceptualised: should we treat them as interchangeable indicators of a single underlying factor, or are there reasons to expect that the intervention will have a selective impact on a subset of measures? Even when variables are intercorrelated in the general population, they may respond differently to intervention.</p>
            <p>It is also worth noting that results obtained with the MEff approach will depend on assumptions embodied in the simulation that is used to derive predictions. Outcome measures simulated here are normally distributed, and uniform in their covariance structure. It would be of interest to evaluate MEff in datasets with different variable types, such as those used by 
                <xref ref-type="bibr" rid="ref13">Vickerstaff et al. (2021)</xref> that included binary as well as continuous data, as well as modeling the impact of missing data.</p>
            <p>In sum, a recommendation against using multiple outcomes in intervention studies does not lead to optimal study design. Inclusion of several related outcomes can increase statistical power, without increasing the false positive rate, provided appropriate correction is made for the multiple testing. Compared to most other approaches for correlated outcomes, MEff is relatively simple. It could potentially be used to reevaluate published studies that report multiple outcomes but may not have been analysed optimally, provided we have some information on the average correlation between outcome measures.</p>
        </sec>
        <sec id="sec18">
            <title>Data availability</title>
            <sec id="sec19">
                <title>Underlying data</title>
                <p>OSF: Revised &#x2018;multiple outcomes&#x2019; using MEff, &lt;
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.17605/OSF.IO/6GNB4">https://doi.org/10.17605/OSF.IO/6GNB4</ext-link>&gt; (
                    <xref ref-type="bibr" rid="ref2">Bishop, 2022</xref>).</p>
                <p>This project contains the following underlying data:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Simulated raw data from 2000 runs for models L1, L2 and L3 (corresponding to L1, L2 and L2x respectively).</p>
                        </list-item>
                    </list>
				</p>
            </sec>
            <sec id="sec20">
                <title>Extended data</title>
                <p>OSF: Revised &#x2018;multiple outcomes&#x2019; using MEff, &lt;
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.17605/OSF.IO/6GNB4">https://doi.org/10.17605/OSF.IO/6GNB4</ext-link>&gt; (
                    <xref ref-type="bibr" rid="ref2">Bishop, 2022</xref>).</p>
                <p>This project contains the scripts to generate and analyse simulated data. Two scripts are included:
                    <list list-type="order">
                        <list-item>
                            <p>Data_simulation_modelL.Rmd, which generates the simulated data under Data, computes power tables and creates plots for Figures 2-4.</p>
                        </list-item>
                        <list-item>
                            <p>Multiple_outcomes_revised.Rmd, which generates the text for the current article.</p>
                        </list-item>
                    </list>
				</p>
                <p>Data are available under the terms of the 
                    <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Zero &#x201c;No rights reserved&#x201d; data waiver</ext-link> (CC0 1.0 Public domain dedication).</p>
            </sec>
        </sec>
    </body>
    <back>
        <ref-list>
            <title>References</title>
            <ref id="ref14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Aksayli</surname>
                            <given-names>ND</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Sala</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Gobet</surname>
                            <given-names>F</given-names>
                        </name>
					</person-group>:
                    <article-title>The cognitive and academic benefits of Cogmed: A meta-analysis.</article-title>
                    <source>
						
                        <italic toggle="yes">
Educational Research Review.
</italic>
					</source>
                    <year>2019</year>;<volume>27</volume>:<fpage>229</fpage>&#x2013;<lpage>243</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.edurev.2019.04.003</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Alloway</surname>
                            <given-names>TP</given-names>
                        </name>
					</person-group>:
                    <source>
						
                        <italic toggle="yes">Automated Working Memory Assessment Manual.</italic>
					</source>
                    <publisher-loc>London</publisher-loc>:
                    <publisher-name>Pearson Assessment</publisher-name>;<year>2007</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.R-project.org/">https://www.R-project.org/</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref1">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bishop</surname>
                            <given-names>DVM</given-names>
                        </name>
					</person-group>:
                    <article-title>Using multiple outcomes in intervention studies for improved trade-off between power and type I errors: The Adjust NVar approach [version 1; peer review: 2 not approved].</article-title>
                    <source>
						
                        <italic toggle="yes">F1000Research.
</italic>
					</source>
                    <year>2021</year>;<volume>10</volume>:<fpage>991</fpage>.
                    <pub-id pub-id-type="doi">10.12688/f1000research.73520.1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bishop</surname>
                            <given-names>DVM</given-names>
                        </name>
					</person-group>:
                    <data-title>Revised &#x2018;Multiple Outcomes&#x2019; Using MEff. OSF.</data-title>[Dataset.]<year>2022</year>November 18.
                    <pub-id pub-id-type="doi">10.17605/OSF.IO/6JF9T</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Burgoyne</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Duff</surname>
                            <given-names>FJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Clarke</surname>
                            <given-names>PJ</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Efficacy of a reading and language intervention for children with Down syndrome: A randomized controlled trial.</article-title>
                    <source>
						
                        <italic toggle="yes">
Journal of Child Psychology and Psychiatry, and Allied Disciplines.
</italic>
					</source>
                    <year>2012</year>;<volume>53</volume>(<issue>10</issue>):<fpage>1044</fpage>&#x2013;<lpage>1053</lpage>.
                    <pub-id pub-id-type="doi">10.1111/j.1469-7610.2012.02557.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Burgoyne</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Duff</surname>
                            <given-names>FJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Clarke</surname>
                            <given-names>PJ</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Reading and language intervention for children with Down syndrome.</article-title>
                    <source>
						
                        <italic toggle="yes">
Experimental data [data collection].
</italic>
					</source>
                    <year>2016</year>.
                    <pub-id pub-id-type="doi">10.5255/UKDA-SN-852291</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Chacko</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bedard</surname>
                            <given-names>AC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Marks</surname>
                            <given-names>DJ</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>A randomized clinical trial of Cogmed Working Memory Training in school-age children with ADHD: A replication in a diverse sample using a control condition.</article-title>
                    <source>
						
                        <italic toggle="yes">
Journal of Child Psychology and Psychiatry.
</italic>
					</source>
                    <year>2014</year>;<volume>55</volume>(<issue>3</issue>):<fpage>247</fpage>&#x2013;<lpage>255</lpage>.
                    <pub-id pub-id-type="doi">10.1111/jcpp.12146</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Cheverud</surname>
                            <given-names>JM</given-names>
                        </name>
					</person-group>:
                    <article-title>A simple correction for multiple comparisons in interval mapping genome scans.</article-title>
                    <source>
						
                        <italic toggle="yes">
Heredity.
</italic>
					</source>
                    <year>2001</year>:<volume>87</volume>(<issue>1</issue>): Article 1.
                    <pub-id pub-id-type="doi">10.1046/j.1365-2540.2001.00901.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Derringer</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>A simple correction for non-independent tests.</article-title>
                    <source>
						
                        <italic toggle="yes">
PsyArXiv.
</italic>
					</source>
                    <year>2018</year>.
                    <pub-id pub-id-type="doi">10.31234/osf.io/f2tyw</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Lakens</surname>
                            <given-names>D</given-names>
                        </name>
					</person-group>:
                    <article-title>The 20% Statistician: One-sided tests: Efficient and Underused.</article-title>
                    <source>
						
                        <italic toggle="yes">The 20% Statistician.</italic>
					</source>
                    <year>2016, March 17</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html">http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Moher</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hopewell</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Schulz</surname>
                            <given-names>KF</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>CONSORT 2010 explanation and elaboration: Updated guidelines for reporting parallel group randomised trials.</article-title>
                    <source>
						
                        <italic toggle="yes">
BMJ (Clinical Research Ed.)
</italic>
					</source>
                    <year>2010</year>;<volume>340</volume>: c869.
                    <pub-id pub-id-type="doi">10.1136/bmj.c869</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Nyholt</surname>
                            <given-names>DR</given-names>
                        </name>
					</person-group>:
                    <article-title>A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other.</article-title>
                    <source>
						
                        <italic toggle="yes">American Journal of Human Genetics.</italic>
					</source>
                    <year>2004</year>;<volume>74</volume>(<issue>4</issue>):<fpage>765</fpage>&#x2013;<lpage>769</lpage>.</mixed-citation>
            </ref>
            <ref id="ref10">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">
						
                        <collab>R Core Team</collab>
					</person-group>:
                    <source>
						
                        <italic toggle="yes">R: A language and environment for statistical computing.</italic>
					</source>
                    <publisher-loc>Vienna, Austria</publisher-loc>:
                    <publisher-name>R Foundation for Statistical Computing</publisher-name>;<year>2020</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.R-project.org/">https://www.R-project.org/</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Roberts</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Quach</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Spencer-Smith</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Academic outcomes 2 years after working memory training for children with low working memory: A randomized clinical trial.</article-title>
                    <source>
						
                        <italic toggle="yes">
JAMA Pediatrics.
</italic>
					</source>
                    <year>2016</year>;<volume>170</volume>(<issue>5</issue>): e154568.
                    <pub-id pub-id-type="doi">10.1001/jamapediatrics.2015.4568</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Sainani</surname>
                            <given-names>K</given-names>
                        </name>
					</person-group>:
                    <article-title>Peer Review Report For: Using multiple outcomes in intervention studies for improved trade-off between power and type I errors: The Adjust NVar approach [version 1; peer review: 2 not approved].</article-title>
                    <source>
						
                        <italic toggle="yes">
F1000Research.
</italic>
					</source>
                    <year>2021</year>;<volume>10</volume>:<fpage>991</fpage>.
                    <pub-id pub-id-type="doi">https://doi.org/10.5256/f1000research.77175.r96192</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Vickerstaff</surname>
                            <given-names>V</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ambler</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>King</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Are multiple primary outcomes analysed appropriately in randomised controlled trials?</article-title>
                    <source>
						
                        <italic toggle="yes">
A review. Contemporary Clinical Trials.
</italic>
					</source>
                    <year>2015</year>;<volume>45</volume>(<issue>Pt A</issue>):<fpage>8</fpage>&#x2013;<lpage>12</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.cct.2015.07.016</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Vickerstaff</surname>
                            <given-names>V</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ambler</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Omar</surname>
                            <given-names>RZ</given-names>
                        </name>
					</person-group>:
                    <article-title>A comparison of methods for analysing multiple outcome measures in randomised controlled trials using a simulation study.</article-title>
                    <source>
						
                        <italic toggle="yes">
Biometrical Journal. Biometrische Zeitschrift.
</italic>
					</source>
                    <year>2021</year>;<volume>63</volume>(<issue>3</issue>):<fpage>599</fpage>&#x2013;<lpage>615</lpage>.
                    <pub-id pub-id-type="doi">10.1002/bimj.201900040</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report221079">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.158245.r221079</article-id>
            <title-group>
                <article-title>Reviewer response for version 3</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Lakens</surname>
                        <given-names>Daniel</given-names>
                    </name>
                    <xref ref-type="aff" rid="r221079a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0247-239X</uri>
                </contrib>
                <aff id="r221079a1">
                    <label>1</label>Human-Technology Interaction Group, Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>14</day>
                <month>11</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Lakens D</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport221079" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.73520.3"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>I already approved the report, and the new changes are small improvements that do not change my evaluation.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Applied statistics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report160158">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.141308.r160158</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Sainani</surname>
                        <given-names>Kristin</given-names>
                    </name>
                    <xref ref-type="aff" rid="r160158a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-0614-303X</uri>
                </contrib>
                <aff id="r160158a1">
                    <label>1</label>Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>13</day>
                <month>3</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Sainani K</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport160158" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.73520.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The focus of this paper has shifted from the original version. It now focuses on the Meff approach rather than the original proposed MinNVar approach. The goals of the paper have also shifted: (1) to identify situations in which the use of multiple primary outcomes with appropriate adjustment for multiple comparisons yields higher statistical power than a single primary outcome; and (2) to provide tools for re-evaluating already published papers that used multiple primary outcomes but failed to adjust for multiplicity.</p>
            <p> </p>
            <p> In shifting the focus of the paper, the authors have addressed my original concerns. I like the Meff approach because it is relatively straightforward and intuitive. So, I&#x2019;m glad that this revised version provides a brief tutorial on Meff for psychologists. Table 1 also provides a nice intuitive illustration of how Meff works. I also appreciate that this new draft explores different possible patterns of correlations that reflect different underlying latent variables. This paints a more realistic picture and brings out some of the tradeoffs of the different approaches (PCA, Bonferroni, Meff).</p>
            <p> </p>
            <p> I think this paper accomplishes its stated goals, and is a useful resource. I have spot-checked a few of the simulations and see similar patterns to what the paper reports. I appreciate that the authors have made their code and data available.</p>
            <p> </p>
            <p> I have just a few minor suggestions: 
                <list list-type="order">
                    <list-item>
                        <p>Figures 2-4. I would recommend removing effect size=0.8. Effect size makes some difference but appears less important than number of outcomes and correlation strength. Furthermore, power is always high with effect size=0.8 and n=50, so it doesn&#x2019;t add much information to display effect size=0.8. I also had trouble distinguishing the two dashed lines. Altogether, effect size=0.8 just makes the graph harder to read without adding a lot of extra information. &#x00a0;&#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>It&#x2019;s somewhat contradictory to say that the lookup tables are useful when you don&#x2019;t have access to the underlying data but then to present an illustrative example for which the underlying data were available (Burgyone 2012). Presumably, if we had access to the full data, we could calculate Meff exactly (or account for multiple testing using other approaches). Are there any examples you could present where the data are not available but there is some way to roughly estimate the correlations? E.g., because of summary data in the paper or because the correlations can be roughly estimated from previous work using the same variables? This might be closer to the real-world use case for the lookup tables.</p>
                    </list-item>
                </list>
            </p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Statistics, Sports Medicine</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment10469-160158">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bishop</surname>
                            <given-names>Dorothy</given-names>
                        </name>
                        <aff>University of Oxford, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>28</day>
                    <month>10</month>
                    <year>2023</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Many thanks for the helpful recommendations.&#x00a0;</p>
                <p> </p>
                <p> Figures 2 - 4 have been revised in line with suggestions.&#x00a0; The original figures with effect size of .8 are still available on OSF, and a Wiki to that page on OSF has been added to explain the difference.&#x00a0;</p>
                <p> </p>
                <p> I have also added two more real-world examples, both dealing with evaluation of a memory intervention, Cogmed, for children.&#x00a0; It was possible to use the MEff lookup table because both studies used a published working memory battery, where correlations between the subscales can be found in the test manual. Although in practice this had little impact on the interpretation of these studies, it did show how alpha depends on the correction used:&#x00a0; one study had used Bonferroni correction, which was over-stringent, and the other had used no correction for multiple contrasts.&#x00a0; I found that this exercise not only provided an illustrative example of use of MEff, but it further emphasised the need to consider the underlying relationships between intervention and outcomes when deciding on an analytic strategy, and so I added a further comment about that in the Discussion.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report160159">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.141308.r160159</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Lakens</surname>
                        <given-names>Daniel</given-names>
                    </name>
                    <xref ref-type="aff" rid="r160159a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0247-239X</uri>
                </contrib>
                <aff id="r160159a1">
                    <label>1</label>Human-Technology Interaction Group, Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>21</day>
                <month>2</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Lakens D</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport160159" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.73520.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This is an interesting revision, because the author has largely abandoned the original proposal, and switched to a different approach to control error rates when multiple dependent hypotheses are tested. As the goal of the paper is a practical tutorial, I think this switch is a valid and good choice. It also means many of my original comments are no longer relevant.</p>
            <p> </p>
            <p> The idea of creating lookup tables is a nice contribution. The biggest weakness is that correlations 1) are often unknown when data is not shared, and 2) are likely more varied than in the simulations. The authors admit this &#x201c;In practice, the problem for the researcher is to estimate the intercorrelation between outcome measures if this is not known.&#x201d;. They then give an applied example where the data was shared in a repository. What is missing is a clear instruction what to do when data is not shared in a repository. I would assume this then means 1) asking for the data, 2) if data is not provided making an informed guess, and 3) be careful in the interpretation, or performing some sensitivity analyses (e.g., X findings are significant, assuming correlations are not lower than Y). I think presenting a plan for when data is not available is a useful addition.</p>
            <p> </p>
            <p> I also still wonder what happens if there is substantial variation in the off-diagonal correlations. Maybe the authors feel this is rare in realistic datasets &#x2013; then that could be mentioned.</p>
            <p> </p>
            <p> Minor comment</p>
            <p> </p>
            <p> The heading &#x201c;The case against multiple outcomes&#x201d; might confuse readers a bit, as you are arguing FOR multiple outcomes. So, maybe replace it by something like &#x2018;Evaluating error rates for multiple outcomes&#x2019;.</p>
            <p> </p>
            <p> I checked the manuscript for reproducibility, and reproduced the figures and data. I performed the simulations with a larger N, as I thought the patterns in the figures showed some surprising patterns (e.g., not purely decreasing or increasing) but found the same with a larger number of simulations, so that is not the issue.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Applied statistics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment9388-160159">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bishop</surname>
                            <given-names>Dorothy</given-names>
                        </name>
                        <aff>University of Oxford, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>None.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>24</day>
                    <month>2</month>
                    <year>2023</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks to the reviewer for the new evaluation.</p>
                <p> </p>
                <p> I&#x2019;ll defer making modifications to the document until a second review is available, but just note a couple of points.</p>
                <p> </p>
                <p> 
                    <italic>What is missing is a clear instruction what to do when data is not shared in a repository. I would assume this then means 1) asking for the data, 2) if data is not provided making an informed guess, and 3) be careful in the interpretation, or performing some sensitivity analyses (e.g., X findings are significant, assuming correlations are not lower than Y). I think presenting a plan for when data is not available is a useful addition.</italic>
                </p>
                <p> </p>
                <p> Response: I like this suggestion for a plan of action. A further step between 1 and 2 would be to do a search for other datasets using the same variables, which may give an indication of the range of expected correlation between them &#x2013; while recognising that observed values may be influenced by factors such as range. So the &#x201c;informed guess&#x201d; would be informed by prior literature, if available.</p>
                <p> </p>
                <p> 
                    <italic>I also still wonder what happens if there is substantial variation in the off-diagonal correlations. Maybe the authors feel this is rare in realistic datasets &#x2013; then that could be mentioned.</italic>
                </p>
                <p> </p>
                <p> Response: On the contrary, I suspect variation in off-diagonal correlations is more common than uniformity. But in effect the most extreme version of this case is already modelled with model L2. In this models, the off-diagonal values are either zero or a specific value, r.max.&#x00a0; The appropriate comparison for these models is a model, L1, where the correlations are uniform, with r.avg equivalent to the average off-diagonal value. Consider the case in the bottom row of table 3. For model L2, we have r values of either 0 or .8.&#x00a0; For model L1, we have a corresponding value of r.avg of .343. The alpha values for L1 and L2 are in adjacent rows of table 3, and it is clear they differ only slightly. If you introduced more variability in model L2, with some r-values being intermediate between 0 and r.max, the difference between models L1 and L2 would be smaller.</p>
            </body>
        </sub-article>
        <sub-article article-type="response" id="comment10470-160159">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bishop</surname>
                            <given-names>Dorothy</given-names>
                        </name>
                        <aff>University of Oxford, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>none</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>28</day>
                    <month>10</month>
                    <year>2023</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for your careful reading of the paper. I particularly appreciate you checking the reproducibility of the simulations; I realise this takes some work but it is good to have the reassurance that the patterns of results are reproducible.&#x00a0;&#x00a0;&#x00a0;</p>
                <p> </p>
                <p> Following suggestions by reviewer 1, I've added two examples of real-world studies where data are not available.&#x00a0; In the field of educational/psychological interventions, it can be possible to get estimates of intercorrelations between measures if well-established measures are used, as is the case in this example.&#x00a0; I've also added some thoughts on what to do when this is not the case, as you proposed.&#x00a0;</p>
                <p> </p>
                <p> Regarding the case of substantial variation in off-diagonal correlations, having played around with various scenarios, I think it's reasonable to regard model L2 as corresponding to that, because it has a mixture on the off-diagonal of variables with correlation of zero, and those with correlation of whatever is the maximum value for correlated measures from one factor.&#x00a0; For instance, if you look at the bottom row of Table 3, L2 is the case where the off-diagonal contains a mixture of r values of 0 and .8, and L1 is the case when the off-diagonals are uniform (and equivalent to the mean of L2 values). Yet the MEff varies only slightly.&#x00a0; I think that is about the most extreme variation you could get for off-diagonal values.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report97181">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.77175.r97181</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Lakens</surname>
                        <given-names>Daniel</given-names>
                    </name>
                    <xref ref-type="aff" rid="r97181a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0247-239X</uri>
                </contrib>
                <aff id="r97181a1">
                    <label>1</label>Human-Technology Interaction Group, Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>10</day>
                <month>11</month>
                <year>2021</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2021 Lakens D</copyright-statement>
                <copyright-year>2021</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport97181" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.73520.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The author discusses more optimal ways to control error rates than the Bonferroni correction when researchers use multiple measures in a study that are positively correlated. In these cases the authors proposed to specify the number of variables that should be significant at the default alpha level (e.g., 0.05) to make sure the overall Type 1 error rate does not exceed 0.05.</p>
            <p> </p>
            <p> The difficulty with correcting for multiple comparisons based on the number of variables that are significant is that you want to prevent a situation where a researcher performs a Stroop effect test, a Simon effect test, and a test for precognition, and conclude the hypothesis is supported because 2 out of 3 tests are significant. The solution the author proposes is to only use this approach if 1) the same latent variable is measured in different measures, and 2) the correlation between these variables can not be zero. I believe this is a valid approach. These assumptions are discussed enough in the article, but on page 11 (of the pdf version), one might want to discuss that *whether* different measures actually are &#x2018;replicants&#x2019; could be a matter of debate among peers. One could imagine a situation where researchers in a field who disagree about measures would also disagree about whether different measures are replicants.</p>
            <p> </p>
            <p> The comparison against the Bonferroni correction is one interesting baseline, but there is a large literature on how to correct for multiple comparisons that is also much more efficient than a Bonferroni correction, and which is the more interesting comparison. A strength of the Bonferroni correction is that it makes no assumptions about the variables that are corrected. But if one is willing to make assumptions, most importantly about the correlation between variables, more efficient approaches are available. How does this approach compare against other correction approaches? Although there are many correction approaches (this seems to be a particularly active field in neuroscience, where multiple comparisons when analyzing brain activation are common, and measures are strongly correlated), the following two references provide a starting point.</p>
            <p> </p>
            <p> Fan, J., Han, X., &amp; Gu, W. (2012). Estimating False Discovery Proportion Under Arbitrary Covariance Dependence. 
                <italic>Journal of the American Statistical Association</italic>, 
                <italic>107</italic>(499), 1019&#x2013;1035
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-97181-1">1</xref>
                </sup>
            </p>
            <p> </p>
            <p> Yekutieli, D., &amp; Benjamini, Y. (1999). Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. 
                <italic>Journal of Statistical Planning and Inference</italic>, 
                <italic>82</italic>(1), 171&#x2013;196
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-97181-2">2</xref>
                </sup>
            </p>
            <p> </p>
            <p> There are several things to consider. First, the proposed simulation based approach fixes the alpha level, and selects the number of variables that need to be significant. Numbers of variables are relatively crude, in that we can pick 1, 2, 3, 4, etc, but not 1.23 variables. Alternative approaches in the literature lower the alpha level. A benefit of these approaches is that the alpha level can be set at any value (e.g., 0.0352) to exactly control the Type 1 error rate. The consequences of this are also clear when we look at the figures and compare the principle component approach with the Adjust NVar approach. The familywise error rate for the Adjust NVar approach is often well below 0.05, while it is controlled at 0.05 in the principle component approach (and it is controlled at 0.05 in other approaches in the literature that lower the alpha level). The author discusses this (e.g., page 7 of the pdf version) in some detail, but the author seems slightly biased towards their own approach, stating that &#x201c;the tradeoff between power and familywise error (expressed as a ratio) is higher for Adjust NVar.&#x201d; It is not clear this ratio is a fair evaluation of the methods, and the information is difficult to distill from the figures (a Table with Type 1 error rates and Type 2 error rates would be more useful for this). This ratio is not so easy to summarize in a single sentence, I feel. If a design has 99.9% power, lowering the alpha from 0.05 to 0.02 has little effect on power, but if power is 0.8, lowering the alpha has a greater effect. Typically, the evaluation is done on the required sample size &#x2013; which type of correction would require the smallest sample size? And then it becomes important to take a long more modern corrections for correlated variables as well.</p>
            <p> </p>
            <p> I could imagine other modern corrections are a bit more efficient, but the author makes a good point that the simplicity of the methods allows one to evaluate published studies. This might be a useful application. However, then I would like to urge the author to add an example. It would be good to see how this approach should be applied in a real use case (e.g., read a justification for why the measures are measuring the same latent construct, and how to make an assumption about the correlation) and how to interpret the results, depending on how many tests are significant.</p>
            <p> </p>
            <p> The current simulation is somewhat limited. The idea that all correlations are identical will not be true in practice, and this will complicate the application of the proposed technique. What is the recommendation in practice? Should authors choose the largest correlation or the smallest correlation? Which approach is more or less conservative? What is the effect of differences in standard deviations, and does it matter if measures with low correlations have larger standard deviations? I am not sure all these factors will have a large influence but users will most likely need to know what to do.</p>
            <p> </p>
            <p> It was very nice to see the paper was written in Rmarkdown. I was able to reproduce the results computationally (I did not repeat the simulations). As a minor comment, skipsim was no longer on line 207 but on 222. The files would be clearer if all simulation code was a separate R script that is run to generate the simulation data. The datafiles generated can then be read in at the top of the Rmd file. The &#x2018;skipsim&#x2019; workaround does not improve clarity, and I had a hard time figuring out where the code reads in the data. In row 561, for example, the toybit file is written, but if I read it in, this is not needed. Separating the creation of data, and reading in data, makes this cleaner. I needed to create an &#x2018;Images&#x201d; folder to run the plot code on line 679 &#x2013; this folder could be uploaded to the github repo perhaps?</p>
            <p> </p>
            <p> To conclude, I believe this current version of the manuscript needs some additional work, which could include a more extensive discussion of other corrections in the literature, an exploration of additional simulations, and a practical example of how to use this approach when analyzing published studies.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Applied statistics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-97181-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Estimating False Discovery Proportion Under Arbitrary Covariance Dependence.</article-title>
                        <source>
                            <italic>J Am Stat Assoc</italic>
                        </source>.<year>2012</year>;<volume>107</volume>(<issue>499</issue>) :
                        <elocation-id>10.1080/01621459.2012.720478</elocation-id>
                        <fpage>1019</fpage>-<lpage>1035</lpage>
                        <pub-id pub-id-type="pmid">24729644</pub-id>
                        <pub-id pub-id-type="doi">10.1080/01621459.2012.720478</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-97181-2">
                    <label>2</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics</article-title>.
                        <source>
                            <italic>Journal of Statistical Planning and Inference</italic>
                        </source>.<year>1999</year>;<volume>82</volume>(<issue>1-2</issue>) :
                        <elocation-id>10.1016/S0378-3758(99)00041-5</elocation-id>
                        <fpage>171</fpage>-<lpage>196</lpage>
                        <pub-id pub-id-type="doi">10.1016/S0378-3758(99)00041-5</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment9033-97181">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bishop</surname>
                            <given-names>Dorothy</given-names>
                        </name>
                        <aff>University of Oxford, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>11</month>
                    <year>2022</year>
                </pub-date>
            </front-stub>
            <body>
                <p>I thank the reviewer for such a comprehensive and constructive review. The two reviews complemented one another very nicely, and helped me see how to restructure and revise the paper to make it clearer.&#x00a0;&#x00a0;</p>
                <p> </p>
                <p> As background, I should explain that this paper grew out of an attempt to write a primer on how to analyse/evaluate published intervention studies for students/professionals in allied health professions/education. My concern was that Bonferroni correction is often too conservative, but explanations of other methods are typically highly complex, so I was looking for a simpler rule of thumb that could be useful in interpreting published studies. I have emphasised this rationale more in the revised paper.&#x00a0;</p>
                <p> </p>
                <p> 
                    <bold>Response to specific points.</bold>
                </p>
                <p> </p>
                <p> 1. The difficulty with correcting for multiple comparisons based on the number of variables that aresignificant is that you want to prevent a situation where a researcher performs a Stroop effect test, a Simon effect test, and a test for precognition, and conclude the hypothesis is supported because 2 out of 3 tests are significant. The solution the author proposes is to only use this approach if 1) the same latent variable is measured in different measures, and 2) the correlation between these variables can not be zero. I believe this is a valid approach. These assumptions are discussed enough in the article, but on page 11 (of the pdf version), one might want to discuss that *whether* different measures actually are &#x2018;replicants&#x2019; could be a matter of debate among peers. One could imagine a situation where researchers in a field who disagree about measures would also disagree about whether different measures are replicants</p>
                <p> </p>
                <p> 
                    <italic>Response: This is an important point, which is similar to issues raised by reviewer 1, so I have now extended the simulations to take into account situations where outcomes may not all be indicators of one factor.</italic>
                </p>
                <p> </p>
                <p> 2. Alternative methods for correction for multiple comparisons.</p>
                <p> </p>
                <p> 
                    <italic>Response: The reviewer recommends two references with alternative approaches to multiple comparison correction. As I noted above, one goal was to keep things simple: so I am reluctant to do a full comparison of all methods, especially since it is clear that I need to consider different underlying data structures. &#x00a0;I like the relative simplicity of the MEff approach recommended by reviewer 1, and so I am now presenting a comparison of Bonferroni, PCA and MEff.</italic>
                </p>
                <p> </p>
                <p> 3. I could imagine other modern corrections are a bit more efficient, but the author makes a good point that the simplicity of the methods allows one to evaluate published studies. This might be a useful application. However, then I would like to urge the author to add an example. It would be good to see how this approach should be applied in a real use case (e.g., read a justification for why the measures are measuring the same latent construct, and how to make an assumption about the correlation) and how to interpret the results, depending on how many tests are significant.</p>
                <p> </p>
                <p> 
                    <italic>Response: Again, this converges nicely with the recommendations of reviewer 1, and I&#x00a0; have now restructured the paper to focus more on this aspect and to give a real-life example.</italic>
                </p>
                <p> </p>
                <p> 4. The current simulation is somewhat limited. The idea that all correlations are identical will not be true in practice, and this will complicate the application of the proposed technique. What is the recommendation in practice? Should authors choose the largest correlation or the smallest correlation? Which approach is more or less conservative? What is the effect of differences in standard deviations, and does it matter if measures with low correlations have larger standard deviations? I am not sure all these factors will have a large influence but users will most likely need to know what to do.</p>
                <p> </p>
                <p> 
                    <italic>Response: I now contrast three models with different patterns of intercorrelation between intervention and outcome. Potentially, there is no end to the options to consider, but I found it reassuring that with the MEff approach, the adjusted alpha was similar for different correlated variables, provided the mean off-diagonal correlation was constant.</italic>
                </p>
                <p> </p>
                <p> 5. It was very nice to see the paper was written in Rmarkdown. I was able to reproduce the results computationally (I did not repeat the simulations). As a minor comment, skipsim was no longer on line 207 but on 222. The files would be clearer if all simulation code was a separate R script that is run to generate the simulation data. The datafiles generated can then be read in at the top of the Rmd file. The &#x2018;skipsim&#x2019; workaround does not improve clarity, and I had a hard time figuring out where the code reads in the data. In row 561, for example, the toybit file is written, but if I read it in, this is not needed. Separating the creation of data, and reading in data, makes this cleaner. I needed to create an &#x2018;Images&#x201d; folder to run the plot code on line 679 &#x2013; this folder could be uploaded to the github repo perhaps?</p>
                <p> </p>
                <p> 
                    <italic>Response: I have now separated the simulation and generation of corresponding figures from the code to write the paper.</italic>
                </p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report96192">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.77175.r96192</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Sainani</surname>
                        <given-names>Kristin</given-names>
                    </name>
                    <xref ref-type="aff" rid="r96192a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-0614-303X</uri>
                </contrib>
                <aff id="r96192a1">
                    <label>1</label>Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>12</day>
                <month>10</month>
                <year>2021</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2021 Sainani K</copyright-statement>
                <copyright-year>2021</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport96192" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.73520.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The paper presents a method for controlling the familywise error rate when testing multiple outcomes: Adjust NVar. Unlike multiple testing adjustments that lower the p-value threshold, the idea behind Adjust NVar is to require a minimum number of p-values (MinNSig) to meet a nominal p&lt;.05 threshold.</p>
            <p> </p>
            <p> When outcomes are independent, MinNSig is defined as follows, where M is the number of outcomes:</p>
            <p> X~binomial(M, 0.05)</p>
            <p> MinNSig is the minimum x for which P(X&gt;=x)&lt;.05 or, equivalently, P(X &lt; x)&gt;.95.</p>
            <p> For example, if M=6, then:</p>
            <p> P(X=0)=.95**6=0.735</p>
            <p> P(X&lt;=1)=P(X&lt;2)=0.735+6*.95**5*.05=.232+.735=.967</p>
            <p> Thus, MinNSig=2.</p>
            <p> </p>
            <p> When outcomes are correlated, there is no simple formula for obtaining MinNSig, so the author has used a simulation approach to account for varying correlation structures.</p>
            <p> </p>
            <p> The idea is novel and has several merits. In particular, the approach closely matches what many researchers already do in the published literature. Many published studies apply a p-value cutoff of 0.05 to multiple outcomes without correcting for multiple testing (or designating a primary outcome). I can envision the AdjustNVar approach being a useful heuristic for re-evaluating published studies that used multiple outcomes but failed to account for multiple testing in any way. For example, if an RCT reported 10 outcomes with only 1 significant result (p&lt;.05), readers can easily recognize that this is compatible with a chance finding. But what if the trial found 2 significant results or 3? And what if the outcomes are moderately correlated instead of independent? I can envision researchers using a table such as Table 2 of the AdjustNVar paper to make a quick assessment based on the number of outcomes reported and a rough guess at the correlation structure.</p>
            <p> </p>
            <p> However, I am less convinced of the value of AdjustNVar as a formal tool for controlling the familywise error rate in a planned study. At a minimum, further development and a broader set of simulations would be required to support such a recommendation. The current manuscript describes three alternatives to specifying a single primary outcome in an RCT: (1) Bonferroni adjustment, (2) permutation tests, and (3) use of PCA to derive a single composite outcome. But this ignores existing p-value adjustment methods that are less conservative than Bonferroni. For example, the &#x201c;M-effective&#x201d; (Meff) approach accomplishes many of the same goals as AdjustNVar (see: Cheverud (2001)
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-96192-1">1</xref>
                </sup>, Nyholt (2004)
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-96192-2">2</xref>
                </sup>, and Derringer (2018)
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-96192-3">3</xref>
                </sup>). In the Meff approach, one adjusts the p-value threshold by dividing by the effective number of outcomes (Meff) rather than the actual number of outcomes (M). Meff is based on the eigenvalues of the correlation matrix of the outcomes. Where Eigen is the observed vector of eigenvalues from the correlation matrix of the outcomes, Meff is calculated as:</p>
            <p> </p>
            <p> Meff = 1 + (M-1)*(1-(Var(Eigen)/M))</p>
            <p> </p>
            <p> Bonferroni threshold = alpha/M</p>
            <p> Meff threshold = alpha/Meff</p>
            <p> </p>
            <p> Like AdjustNVar, Meff is simple and accounts for correlated outcomes. But I believe it has several advantages over AdjustNVar: (1) Meff precisely controls the Type I error rate, whereas AdjustNVar has varying Type I error rates that cannot be precisely controlled by the investigator; (2) Meff accounts for the correlation structure observed in the data, whereas AdjustNVar requires the investigator to guess at the correlation structure; if this guess is far off (which could easily be the case), this would lead to poor Type I error control.</p>
            <p> </p>
            <p> This paper has numerous strengths, including the novelty of the idea; the potential use as a heuristic for re-interpreting flawed published papers; the concision of the writing; and the availability of all code and data. The major limitations of the paper are: (1) it presents an overly narrow set of simulations that do not capture most realistic situations, but then makes overly broad claims based on these simulations. (2) it does not compare AdjustNVar to existing approaches that are less conservative than Bonferroni, such as Meff and (3) it does not address the different reasons why researchers may be including multiple outcomes, but these different reasons lead to markedly different correlation structures.</p>
            <p> </p>
            <p> Specific comments: 
                <list list-type="order">
                    <list-item>
                        <p>The paper would benefit from further consideration of the reasons why researchers may include multiple outcomes. The paper focuses on the case where an intervention is &#x201c;expected to affect a range of related processes.&#x201d; The simulations make assumptions that match this case, assuming equal correlations across outcomes and equal true effects for each outcome. But researchers may include multiple outcomes for many other reasons, such as: (a) they aren&#x2019;t sure which process the intervention will affect, (b) they believe the intervention may affect two different processes but they measure each process with several different measurements to &#x201c;hedge their bets&#x201d;, or (c) they include a &#x201c;soft&#x201d; endpoint in addition to a &#x201c;hard&#x201d; endpoint because the &#x201c;hard&#x201d; endpoint may occur too rarely. Each of these cases corresponds to different assumptions for the simulations. For example, (b) would be expected to have two clusters of highly correlated variables that are only weakly correlated with each, which will affect MinNSig.</p>
                    </list-item>
                    <list-item>
                        <p>The paper suggests that AdjustNVar could be used in study planning&#x2014;researchers would guess at the correlation structure and set a MinNSig ahead of time. But if they guess the correlation structure incorrectly, such as underestimating the true correlation, then they may choose a MinNSig that does not adequately control Type I error.</p>
                    </list-item>
                    <list-item>
                        <p>The &#x201c;quantum nature&#x201d; of AdjustNVar is not a desirable characteristic. The researcher is unable to precisely control the Type I error rate. In planning a study in which the correlation is expected to be 0.4, for example, Table 2 would suggest that the researcher should then always choose 9 outcomes over 5-8 outcomes, since 9 maximizes the chances of getting at least 3 p-values &lt;.05. This is one reason I prefer Meff, which precisely controls the Type I error rate.</p>
                    </list-item>
                    <list-item>
                        <p>This description is misleading: &#x201c;Should we dismiss the trial as showing no benefit? We can use the binomial theorem to check the probability of obtaining this result if the null hypothesis is true and the measures are independent: it is 0.033, clearly below the 5% alpha level.&#x201d; The description gives the misleading impression that one would be justified in re-evaluating a paper that used a Bonferroni correction by instead applying the criterion of at least two p-values &lt;.05. But doing so would inflate the Type I error rate. Results would have been declared significant if EITHER at least one p-value met the Bonferroni threshold OR at least two p-values were &lt;.05 &#x2014; leading to an effective Type I error rate of 7% (assuming independent outcomes). Note that this example reappears in the discussion and also mistakenly implies that had three p-values been &lt;.05, we would have been able to reject the null hypothesis. But this is not the case because the results were already subjected to Bonferroni, and additionally subjecting them to AdjustNVar makes the effective Type I error rate higher than 5%. AdjustNVar should be applied only to re-evaluate studies that failed to incorporate any adjustments for multiple testing originally. &#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>I found Table 1 confusing on first read, and I would recommend simplifying it by focusing on a single number of outcomes rather than both 2 outcomes and 4 outcomes and by removing discussions of ranking p-values. (The p-value ranking isn&#x2019;t important &#x2014; this is just part of the mechanics of how the algorithm is calculating MinNSig, so I don&#x2019;t think it&#x2019;s needed.) For example, you could just focus on calculating MinNSig for 6 outcomes. Show 6 columns of p-values for the 6 outcomes, and then show a single final column that tabulates the number of p-values &lt;.05 for each simulated trial. Then show a frequency table of how many simulations out of 1000 resulted in 0 p-values &lt;.05, 1 p-value &lt;.05, 2 p-values &lt;.05, etc. Then indicate that MinNSig occurs at one number above when the cumulative frequency crosses 95%.</p>
                    </list-item>
                    <list-item>
                        <p>I&#x2019;m unclear as to why the paper focuses on one-tailed tests, which are less common in the literature. &#x00a0;I think it would be more useful to present two-tailed tests in Table 2 or to present two tables &#x2014; one for one-tailed tests and one for two-tailed tests. This makes a difference in a few MinNSig values.</p>
                    </list-item>
                    <list-item>
                        <p>Figures 1-3: These figures compare AdjustNVar to a single study outcome. I think the logic behind this comparison is flawed, however. It is comparing apples to oranges. The simulation assumes that, when applying AdjustNVar, ALL variables studied have a true effect. This, in effect, stacks the deck on statistical power for *any* method that considers multiple outcomes rather than a single outcome. For example, I ran a simulation comparing Bonferroni with 6 outcomes compared to a single outcome with n=50 per group. When the correlation is &lt;0.8, Bonferroni also has more power than the single outcome. And, when I compared the Meff strategy to a single outcome in a variety of scenarios, Meff was always more powerful than a single outcome. I think a more useful comparison would&#x00a0;be to directly compare different methods that handle multiple outcomes (e.g., PC to AdjustNVar to Meff).</p>
                    </list-item>
                    <list-item>
                        <p>Related to comment (7), I don&#x2019;t think the paper is justified in making this broad claim: &#x201c;The Adjust NVar approach can achieve a more efficient trade-off between power and type I error rate than use of a single outcome when there are three or more moderately intercorrelated outcome variables.&#x201d; This conclusion is true only when the intervention truly affects ALL outcomes, which is a narrow and arguably unrealistic case. A more realistic scenario is where the intervention works only on a subset of outcomes. In this case, the single variable strategy will be more statistically powerful than the multiple-variable strategies if you choose the right variable.</p>
                    </list-item>
                    <list-item>
                        <p>Figures 4-6. Same issue as for Figures 1-3: comparing the principal components composite variable strategy (PC) to a single outcome is flawed because the simulation &#x201c;stacks the deck&#x201d; for PC by assuming that all outcomes have a true effect. I believe that Figures 1-6 should focus instead on comparisons of different methods for handling multiple outcomes.</p>
                    </list-item>
                    <list-item>
                        <p>The article claims that power is only &#x201c;slightly lower&#x201d; for AdjustNVar compared with the PC strategy. However, when I run simulations comparing PC to AdjustNVar, I get consistently higher statistical power for PC and I would characterize the difference as more than just &#x201c;slight&#x201d;. For example, with N=50, global corr=0.6, global ES=0.3, and 6 outcomes (two-tailed test), I get power of 35% for AdjustNVar versus 37% for Meff versus 46% for PC.</p>
                    </list-item>
                    <list-item>
                        <p>PC is always more powerful than AdjustNVar and Meff when we assume that all outcomes have a true effect. However, PC is not more powerful when we assume that only a subset of outcomes have true effects. For example, if I tweak the above simulation so that only three outcomes out of six have true effects of 0.3, this changes the power to 15% for PC and AdjustNVar, and 28% for Meff. This illustrates why one narrow simulation is insufficient for making general conclusions about the tradeoffs in performance between the different methods.</p>
                    </list-item>
                    <list-item>
                        <p>In my simulations, Meff consistently has higher statistical power than AdjustNVar, which is a function of the fact that Meff always has a 5% Type I error rate whereas that of AdjustNVar is variable and sometimes lower than 5%. I don&#x2019;t view it as a strength that AdjustNVar results in arbitrarily lower Type I error rates. It is better for the investigator to be able to precisely control the tradeoff between Type I and Type II error. Meff allows this whereas AdjustNVar does not.</p>
                    </list-item>
                    <list-item>
                        <p>Figures 1-6: I found these graphs hard to read as they don&#x2019;t have a clear take-home point. I would suggest removing the single outcome comparisons altogether and then reformatting so that the graphs directly compare AdjustNVar to Meff and PC (three lines). In one graph, hold effect size, sample size, and N outcomes constant, and then show the power of the three methods as a function of increasing correlation. In another graph, hold correlation, sample size, and effect size constant, and show the power of the three methods as a function of N outcomes. And so on. All methods aim to control the familywise error rate at 5%, so this should not be a variable. The fact that AdjustNVar sometimes results in a lower Type I error rate is incidental &#x2014; it arises as a quirk of the method not as the intent of the researcher.</p>
                    </list-item>
                    <list-item>
                        <p>Where AdjustNVar may have an advantage over a method like Meff is for re-evaluating flawed published studies where researchers used multiple outcomes but did not account in any way for multiple testing. An AdjustNVar table such as Table 2 would give readers a quick way to reevaluate such studies without the need for any calculations or access to raw data. I would focus the paper more on this application.&#x00a0;</p>
                    </list-item>
                </list>
            </p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Statistics, Sports Medicine</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-96192-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>A simple correction for multiple comparisons in interval mapping genome scans.</article-title>
                        <source>
                            <italic>Heredity (Edinb)</italic>
                        </source>.<year>2001</year>;<volume>87</volume>(<issue>Pt 1</issue>) :
                        <elocation-id>10.1046/j.1365-2540.2001.00901.x</elocation-id>
                        <fpage>52</fpage>-<lpage>8</lpage>
                        <pub-id pub-id-type="pmid">11678987</pub-id>
                        <pub-id pub-id-type="doi">10.1046/j.1365-2540.2001.00901.x</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-96192-2">
                    <label>2</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other.</article-title>
                        <source>
                            <italic>Am J Hum Genet</italic>
                        </source>.<year>2004</year>;<volume>74</volume>(<issue>4</issue>) :
                        <elocation-id>10.1086/383251</elocation-id>
                        <fpage>765</fpage>-<lpage>9</lpage>
                        <pub-id pub-id-type="pmid">14997420</pub-id>
                        <pub-id pub-id-type="doi">10.1086/383251</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-96192-3">
                    <label>3</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>A simple correction for non-independent tests</article-title>.<year>2018</year>;
                        <elocation-id>10.31234/osf.io/f2tyw</elocation-id>
                        <pub-id pub-id-type="doi">10.31234/osf.io/f2tyw</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment9034-96192">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bishop</surname>
                            <given-names>Dorothy</given-names>
                        </name>
                        <aff>University of Oxford, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>11</month>
                    <year>2022</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for the very useful evaluation of this paper, which has prompted further thoughts and a revised manuscript. There was some convergence of views of the two reviewers, although the specific recommendations varied. I thank the reviewer for engaging so thoroughly with the manuscript and for helping improve it.&#x00a0;</p>
                <p> </p>
                <p> General point A: AdjustNVar as a heuristic.&#x00a0;</p>
                <p> </p>
                <p> Reviewer 1 writes: The idea is novel and has several merits. In particular, the approach closely matches what many researchers already do in the published literature. Many published studies apply a p-value cutoff of 0.05 to multiple outcomes without correcting for multiple testing (or designating a primary outcome). I can envision the AdjustNVar approach being a useful heuristic for re-evaluating published studies that used multiple outcomes but failed to account for multiple testing in any way. For example, if an RCT reported 10 outcomes with only 1 significant result (p&lt;.05), readers can easily recognize that this is compatible with a chance finding. But what if the trial found 2 significant results or 3? And what if the outcomes are moderately correlated instead of independent? I can envision researchers using a table such as Table 2 of the AdjustNVar paper to make a quick assessment based on the number of outcomes reported and a rough guess at the correlation structure.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: After exploring Meff, I decided not to proceed with the AdjustNVar approach, as I think a lookup table derived from MEff would achieve the same effect but without the problems arising from the quantal nature of N variables.&#x00a0; </italic>
                </p>
                <p> </p>
                <p> General Point B: Need to consider MEff</p>
                <p> </p>
                <p> Reviewer 1 notes that the MEff approach accomplishes many of the same goals as AdjustNVar, and is preferable in many respects.&#x00a0;&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: I had been unaware of MEff, and having consulted the references, I agree it is an elegant solution to the problem I was trying to tackle that avoids some of the limitations of AdjustNvar, particularly the need to assume a given correlation structure. I initially thought this approach had only been used in genetics and not been applied in psychology, but the final reference pointed to the work of Derringer, who has done a great job in a preprint that provides a tutorial in its use. I don&#x2019;t think it is much used in intervention research, which is the context I am particularly interested in, and so I think it is worthwhile updating the article so that it serves as a basic introduction to MEff, with discussion of factors affecting power. &#x00a0;</italic>
                </p>
                <p>
                    <italic> </italic>
                </p>
                <p>
                    <italic> Accordingly, I have changed the focus to compare different methods with a focus on MEff.</italic>
                </p>
                <p> </p>
                <p> Specific comments:</p>
                <p> </p>
                <p> 1. The paper would benefit from further consideration of the reasons why researchers may include multiple outcomes. The paper focuses on the case where an intervention is</p>
                <p> &#x201c;expected to affect a range of related processes.&#x201d; The simulations make assumptions that</p>
                <p> match this case, assuming equal correlations across outcomes and equal true effects for</p>
                <p> each outcome. But researchers may include multiple outcomes for many other reasons,</p>
                <p> such as: (a) they aren&#x2019;t sure which process the intervention will affect, (b) they believe the</p>
                <p> intervention may affect two different processes but they measure each process with several</p>
                <p> different measurements to &#x201c;hedge their bets&#x201d;, or (c) they include a &#x201c;soft&#x201d; endpoint in</p>
                <p> addition to a &#x201c;hard&#x201d; endpoint because the &#x201c;hard&#x201d; endpoint may occur too rarely. Each of</p>
                <p> these cases corresponds to different assumptions for the simulations. For example, (b)</p>
                <p> would be expected to have two clusters of highly correlated variables that are only weakly</p>
                <p> correlated with each, which will affect MinNSig.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: This really got me thinking and I have now incorporated some further simulations&#x00a0;where correlations are not uniform, as also added more discussion of this issue</italic>
                </p>
                <p> </p>
                <p> 2. The paper suggests that AdjustNVar could be used in study planning&#x2014;researchers would</p>
                <p> guess at the correlation structure and set a MinNSig ahead of time. But if they guess the</p>
                <p> correlation structure incorrectly, such as underestimating the true correlation, then they</p>
                <p> may choose a MinNSig that does not adequately control Type I error.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: Agreed. This is now dropped.</italic>
                </p>
                <p> </p>
                <p> 3. The &#x201c;quantum nature&#x201d; of AdjustNVar is not a desirable characteristic. The researcher is</p>
                <p> unable to precisely control the Type I error rate. In planning a study in which the correlation</p>
                <p> is expected to be 0.4, for example, Table 2 would suggest that the researcher should then</p>
                <p> always choose 9 outcomes over 5-8 outcomes, since 9 maximizes the chances of getting at</p>
                <p> least 3 p-values &lt;.05. This is one reason I prefer Meff, which precisely controls the Type I</p>
                <p> error rate.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: Agreed. AdjustNVar now dropped</italic>
                </p>
                <p> </p>
                <p> 4. This description is misleading: &#x201c;Should we dismiss the trial as showing no benefit? We can</p>
                <p> use the binomial theorem to check the probability of obtaining this result if the null</p>
                <p> hypothesis is true and the measures are independent: it is 0.033, clearly below the 5% alpha</p>
                <p> level.&#x201d; The description gives the misleading impression that one would be justified in reevaluating a paper that used a Bonferroni correction by instead applying the criterion of at</p>
                <p> least two p-values &lt;.05. But doing so would inflate the Type I error rate. Results would have</p>
                <p> been declared significant if EITHER at least one p-value met the Bonferroni threshold OR at</p>
                <p> least two p-values were &lt;.05 &#x2014; leading to an effective Type I error rate of 7% (assuming</p>
                <p> independent outcomes). Note that this example reappears in the discussion and also</p>
                <p> mistakenly implies that had three p-values been &lt;.05, we would have been able to reject the</p>
                <p> null hypothesis. But this is not the case because the results were already subjected to</p>
                <p> Bonferroni, and additionally subjecting them to AdjustNVar makes the effective Type I error</p>
                <p> rate higher than 5%. AdjustNVar should be applied only to re-evaluate studies that failed to</p>
                <p> incorporate any adjustments for multiple testing originally.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: thanks for this clarification. As AdjustNVar is now omitted, this no longer applies.</italic>
                </p>
                <p> </p>
                <p> 5. I found Table 1 confusing on first read, and I would recommend simplifying it by focusing</p>
                <p> on a single number of outcomes rather than both 2 outcomes and 4 outcomes and by</p>
                <p> removing discussions of ranking p-values. (The p-value ranking isn&#x2019;t important &#x2014; this is just</p>
                <p> part of the mechanics of how the algorithm is calculating MinNSig, so I don&#x2019;t think it&#x2019;s</p>
                <p> needed.) For example, you could just focus on calculating MinNSig for 6 outcomes. Show 6</p>
                <p> columns of p-values for the 6 outcomes, and then show a single final column that tabulates</p>
                <p> the number of p-values &lt;.05 for each simulated trial. Then show a frequency table of how</p>
                <p> many simulations out of 1000 resulted in 0 p-values &lt;.05, 1 p-value &lt;.05, 2 p-values &lt;.05, etc.</p>
                <p> Then indicate that MinNSig occurs at one number above when the cumulative frequency</p>
                <p> crosses 95%.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: this no longer applies as tables are redone&#x00a0; </italic>
                </p>
                <p> </p>
                <p> 6. I&#x2019;m unclear as to why the paper focuses on one-tailed tests, which are less common in the</p>
                <p> literature. I think it would be more useful to present two-tailed tests in Table 2 or to present</p>
                <p> two tables &#x2014; one for one-tailed tests and one for two-tailed tests. This makes a difference in</p>
                <p> a few MinNSig values.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: One-tailed tests have a bad reputation because so often they are misused to just require a lower level of significance when there are no directional predictions, but there are contexts in which they are entirely appropriate, and that includes the kinds of intervention study that is the focus of attention here. You can reasonably predict that an intervention will improve performance rather than worsen it. Reviewer 2 wrote a blogpost about this which I find convincing: 
                        <ext-link ext-link-type="uri" xlink:href="http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html">http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html</ext-link>. I have now explained this further.</italic>
                </p>
                <p> </p>
                <p> 7. Figures 1-3: These figures compare AdjustNVar to a single study outcome. I think the logic</p>
                <p> behind this comparison is flawed, however. It is comparing apples to oranges. The</p>
                <p> simulation assumes that, when applying AdjustNVar, ALL variables studied have a true</p>
                <p> effect. This, in effect, stacks the deck on statistical power for *any* method that considers</p>
                <p> multiple outcomes rather than a single outcome. For example, I ran a simulation comparing</p>
                <p> Bonferroni with 6 outcomes compared to a single outcome with n=50 per group. When the</p>
                <p> correlation is &lt;0.8, Bonferroni also has more power than the single outcome. And, when I</p>
                <p> compared the Meff strategy to a single outcome in a variety of scenarios, Meff was always</p>
                <p> more powerful than a single outcome. I think a more useful comparison would be to</p>
                <p> directly compare different methods that handle multiple outcomes (e.g., PC to AdjustNVar</p>
                <p> to Meff).&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: Thanks again for helping clarify what is being simulated here. I hope this is clearer in the current article. The figures have been redrawn and I hope are now clearer. Perhaps one takeaway point is that, in being concerned to control type I error, the CONSORT recommendations appear to ignore the gains in power that can be achieved with multiple outcomes. &#x00a0;</italic>
                </p>
                <p> </p>
                <p> 8. Related to comment (7), I don&#x2019;t think the paper is justified in making this broad claim: &#x201c;The</p>
                <p> Adjust NVar approach can achieve a more efficient trade-off between power and type I error</p>
                <p> rate than use of a single outcome when there are three or more moderately intercorrelated</p>
                <p> outcome variables.&#x201d; This conclusion is true only when the intervention truly affects ALL</p>
                <p> outcomes, which is a narrow and arguably unrealistic case. A more realistic scenario is</p>
                <p> where the intervention works only on a subset of outcomes. In this case, the single variable</p>
                <p> strategy will be more statistically powerful than the multiple-variable strategies if you</p>
                <p> choose the right variable.</p>
                <p> </p>
                <p> Figures 4-6. Same issue as for Figures 1-3: comparing the principal components composite</p>
                <p> variable strategy (PC) to a single outcome is flawed because the simulation &#x201c;stacks the deck&#x201d;</p>
                <p> for PC by assuming that all outcomes have a true effect. I believe that Figures 1-6 should</p>
                <p> focus instead on comparisons of different methods for handling multiple outcomes.</p>
                <p> </p>
                <p> 
                    <italic>Response: same as for point 6; this is clearly a key issue, and I think it is clearer now that the alternative models (L2 and L2x) are included.</italic>
                </p>
                <p> </p>
                <p> 9. The article claims that power is only &#x201c;slightly lower&#x201d; for AdjustNVar compared with the PC</p>
                <p> strategy. However, when I run simulations comparing PC to AdjustNVar, I get consistently</p>
                <p> higher statistical power for PC and I would characterize the difference as more than just</p>
                <p> &#x201c;slight&#x201d;. For example, with N=50, global corr=0.6, global ES=0.3, and 6 outcomes (two-tailed</p>
                <p> test), I get power of 35% for AdjustNVar versus 37% for Meff versus 46% for PC.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: thanks for pushing back on this &#x2013; this is fair comment.</italic>
                </p>
                <p> </p>
                <p> 10.PC is always more powerful than AdjustNVar and Meff when we assume that all outcomes</p>
                <p> have a true effect. However, PC is not more powerful when we assume that only a subset of</p>
                <p> outcomes have true effects. For example, if I tweak the above simulation so that only three</p>
                <p> outcomes out of six have true effects of 0.3, this changes the power to 15% for PC and</p>
                <p> AdjustNVar, and 28% for Meff. This illustrates why one narrow simulation is insufficient for</p>
                <p> making general conclusions about the tradeoffs in performance between the different</p>
                <p> methods.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: agreed.</italic>
                </p>
                <p> </p>
                <p> 11. In my simulations, Meff consistently has higher statistical power than AdjustNVar, which is a function of the fact that Meff always has a 5% Type I error rate whereas that of AdjustNVar</p>
                <p> is variable and sometimes lower than 5%. I don&#x2019;t view it as a strength that AdjustNVar</p>
                <p> results in arbitrarily lower Type I error rates. It is better for the investigator to be able to</p>
                <p> precisely control the tradeoff between Type I and Type II error. Meff allows this whereas</p>
                <p> AdjustNVar does not.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: fair comment</italic>
                </p>
                <p> </p>
                <p> 12. Figures 1-6: I found these graphs hard to read as they don&#x2019;t have a clear take-home point. I would suggest removing the single outcome comparisons altogether and then reformatting</p>
                <p> so that the graphs directly compare AdjustNVar to Meff and PC (three lines). In one graph,</p>
                <p> hold effect size, sample size, and N outcomes constant, and then show the power of the</p>
                <p> three methods as a function of increasing correlation. In another graph, hold correlation,</p>
                <p> sample size, and effect size constant, and show the power of the three methods as a</p>
                <p> function of N outcomes. And so on. All methods aim to control the familywise error rate at</p>
                <p> 5%, so this should not be a variable. The fact that AdjustNVar sometimes results in a lower</p>
                <p> Type I error rate is incidental &#x2014; it arises as a quirk of the method not as the intent of the</p>
                <p> researcher.</p>
                <p> </p>
                <p> 
                    <italic>Response: Agreed. Graphs reformatted as suggested and I hope are now much easier to interpret.</italic>
                </p>
                <p> </p>
                <p> 13. Where AdjustNVar may have an advantage over a method like Meff is for re-evaluating</p>
                <p> flawed published studies where researchers used multiple outcomes but did not account in</p>
                <p> any way for multiple testing. An AdjustNVar table such as Table 2 would give readers a quick</p>
                <p> way to reevaluate such studies without the need for any calculations or access to raw data. I</p>
                <p> would focus the paper more on this application.&#x00a0;</p>
                <p> </p>
                <p> 
                    <italic>Response: Agreed. The paper has been revised to make this point.</italic>
                </p>
            </body>
        </sub-article>
    </sub-article>
</article>
