Accumulation Bias in meta-analysis: the need to consider time in error control

Studies accumulate over time and meta-analyses are mainly retrospective. These two characteristics introduce dependencies between the analysis time, at which a series of studies is up for meta-analysis, and results within the series. Dependencies introduce bias — Accumulation Bias — and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependencies are also inevitable, since for science to accumulate efficiently, new research needs to be informed by past results. Here, we investigate various ways in which time influences error control in meta-analysis testing. We introduce an Accumulation Bias Framework that allows us to model a wide variety of practically occurring dependencies including study series accumulation, meta-analysis timing, and approaches to multiple testing in living systematic reviews. The strength of this framework is that it shows how all dependencies affect p-value-based tests in a similar manner. This leads to two main conclusions. First, Accumulation Bias is inevitable, and even if it can be approximated and accounted for, no valid p-value tests can be constructed. Second, tests based on likelihood ratios withstand Accumulation Bias: they provide bounds on error probabilities that remain valid despite the bias. We leave the reader with a choice between two proposals to consider time in error control: either treat individual (primary) studies and meta-analyses as two separate worlds — each with their own timing — or integrate individual studies in the meta-analysis world. Taking up likelihood ratios in either approach allows for valid tests that relate well to the accumulating nature of scientific knowledge. Likelihood ratios can be interpreted as betting profits, earned in previous studies and invested in new ones, while the meta-analyst is allowed to cash out at any time and advice against future studies.


Introduction
Meta-analysis refers to the statistical synthesis of results from a series of studies. [...] the synthesis will be meaningful only if the studies have been collected systematically. [...] The formulas used in meta-analysis are extensions of formulas used in primary studies, and are used to address similar kinds of questions to those addressed in primary studies. -Borenstein, Hedges, Higgins & Rothstein (2009, pp. xxi-xxiii) To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. -Fisher (1938, p. 18) These two quotes conflict. Most meta-analyses are retrospective and consider the number of studies availableafter the literature has been searched systematically -as a given for the statistical analysis. P-value based statistical tests, however, are intended to be prospective and require the sample size -or the stopping rule that produces the sample -to be set specifically for the planned statistical analysis. The second quote, by the p-value's popularizer Ronald Fisher, is about primary studies. But this prospective rationale influences meta-analysis as well because it also involves the size of the study series: p-value tests assume that the number of studies -so the timing of the meta-analysis -is predetermined or at least unrelated to the study results. So by using p-value methods, conventional meta-analysis implicitly assumes that promising initial results are just as likely to develop into (large) series of studies as their disappointing counterparts. Conclusive studies should just as likely trigger metaanalyses as inconclusive ones. And so the use of p-value tests suggests that results of earlier studies should be unknown when planning new studies as well as when planning meta-analyses. Such assumptions are unrealistic and actively argued against by the Evidence-Based Research Network (Lund et al., 2016) part of the movement to reduce research waste (Chalmers and Glasziou, 2009;Chalmers et al., 2014). But ignoring these assumptions invalidates conventional p-value tests and inflates type-I errors. P-values are based on tail areas of a test statistic's sampling distribution under the null hypothesis, and thus require this distribution to be fully specified. In this paper we show that the standard normal Z-distribution generally assumed (e.g. Borenstein et al. (2009)) is not an appropriate sampling distribution. Moreover, we believe that no sampling distribution can be specified that fully represents the variety of processes in accumulating scientific knowledge and all decision made along the way. We need a more flexible approach to testing that controls errors regardless of the process that spurs the meta-analysis.
When dependencies arise between study series size or meta-analysis timing and results within the series, bias is introduced in the estimates. This bias is inherent to accumulating data, which is why we gave it the name Accumulation Bias. Various forms of Accumulation Bias have been characterized before, in very general terms as "bias introduced by the order in which studies are conducted" (Whitehead, 2002, p. 197) and more specifically, such as bias caused by the dependence of follow-up studies on previous studies' significance and the dependence of metaanalysis timing on previous study results (Ellis and Stewart, 2009). Also, more elaborate relations were studied between the existence of follow-up studies, study design and meta-analysis estimates (Kulinskaya et al., 2016). Yet no approach to confront these biases has been proposed. In this paper we define Accumulation Bias to encompass processes that not only affect parameter estimates but also the shape of the sampling distribution, which is why only approximation and correction for bias does not achieve valid p-value tests. We illustrate this by an example in Section 3, right after we give a general introduction to Accumulation Bias in Section 2 with its relation to publication bias (Section 2.1) and an informal characterization of the direction of the bias (Section 2.2). By presenting its diversity, we argue throughout the paper that any efficient scientific process will introduce some form of Accumulation Bias and that the exact process can never be fully known. We collect the various forms of Accumulation Bias into one framework (Section 4) and show that all are related to the time aspect in meta-analysis. The framework incorporates dependencies mentioned by Whitehead (2002), Ellis and Stewart (2009) and Kulinskaya et al. (2016) as well the effect of multiple testing over time in living systematic reviews (Simmonds et al., 2017). We conclude that some version of these biases will also be introduced by Evidence-Based Research. Our framework specifies analysis time probabilities -with behavior familiar from survival analysis -and distinguishes two approaches to error control: conditional on time (Section 5.1) and surviving over time (Section 5.2). We show that general meta-analyses take the former approach, while existing methods for living systematic reviews take the latter. However, neither of the two is able to analyze study series affected by partially unknown processes of Accumulation Bias (Section 5.3). After an intermezzo on evidence that indeed such processes are already at play in Section 6, we introduce a general form of a test statistic that is able to withstand any Accumulation Bias process: the likelihood ratio. We specify bounds on error probabilities that are valid despite the existing bias, for error control conditional on time (Section 7.1) as well as surviving over time (Section 7.2). The reader is left to choose between the two; the consequences of either preference are specified in Section 8. We try to give intuition on why both are still possible in their respective sections 7.1 and 7.2, but also give some extra intuition on the magic of likelihood ratios in Section 9: Likelihood ratios have an interpretation as betting profit that can be reinvested in future studies. At the same time, the meta-analyst is allowed to cash out at any time and advise against future studies. Hence, the likelihood ratio relates the statistics of Accumulation Bias to the accumulating nature of scientific knowledge, which is critical in reducing research waste.

Accumulation Bias
Any meta-analyst carries out a meta-analysis under the assumption that synthesizing previous studies will add to what is already known from existing studies. So metaanalyses are mainly performed on series of studies of meaningful series size. What is considered meaningful varies considerably: 16 and 15 studies per meta-analysis were reported to be the median numbers in Medline metaanalyses from 2004and 2014(Moher et al., 2007aPage et al., 2016), while 3 studies per meta-analysis were reported in Cochrane meta-analyses from 2008 (Cochrane Database of Systematic Reviews (Davey et al., 2011)). Since meta-analyses are performed on research hypotheses that have spurred a certain study series size, they always report estimates that are conditioned on the availability of such a series. The crucial point is that not all pilot studies or small study series will reach a meaningful size, and that doing so might depend on results in the series. Apart from the dependent size of the study series, the exact timing of a meta-analysis can also depend on the available results. The completion of a highly powered or otherwise conclusive study, for example, might be considered to finalize the series and trigger a meta-analysis. So meta-analysis also report estimates conditioned on the consideration that a systematic synthesis will be informative. Both dependencies -series size and meta-analysis timing -introduce bias: Accumulation Bias.

Accumulation Bias vs. publication bias
Publication bias refers to the practice that studies with nonsignificant, or more general, unsatisfactory results have smaller probability to be published than studies with significant, satisfactory results. So unsatisfactory studies are performed, but do not reach the meta-analyst because they are stashed away in a file drawer (Rosenthal, 1979). Accumulation Bias, on the other hand, refers to some studies or meta-analyses not being performed at all, as a result of previous findings in a series of studies. In a file drawerfree world, Accumulation Bias would still exist. But Accumulation Bias is a manageable problem because it does not operate at the individual study level. Conditional on the fact that a second study is performed, the second study is an unbiased sample. Conditional on the fact that a third study is performed, for whatever reason, the third study is an unbiased sample. So bias is introduced at the level of the series, not at the study level. This is different for publication bias, where, conditional on being published, the studies available are not an unbiased sample. We exploit the difference in this paper by considering time in error control.
Of course, Accumulation Bias and publication bias are not alone in their effects on meta-analysis reporting. All sorts of significance chasing biases -selective-outcome bias, selective analysis reporting bias and fabrication biasmight be present in the study series up for meta-analysis, and can lead to "wrong and misleading answers" (Ioannidis, 2010, p. 169). But for a world in which these biases are overcome, we also need tests that reflect how scientific knowledge accumulates.

Accumulation Bias' direction
Accumulation Bias in estimates is mainly bias in the satisfactory direction, which means that the effect under study is overestimated. This is the case for bias caused by size of the studies series when (overly) optimistic initial estimates (either in individual studies or in intermediate meta-analyses) give rise to more studies, while disappointing results terminate a series of studies. This is also the case when the timing of the meta-analysis is based on an (overly) optimistic last study estimate or an (overly) optimistic meta-analysis synthesis is considered the final one. We focus on this satisfactory direction of Accumulation Bias and will only briefly discuss other possibilities in Section 5.3 and 6.1. We introduce the wide variety of possible dependencies in an Accumulation Bias Framework in Section 4, which has a generality that also includes Accumulation Bias without a clear direction. But we first present Accumulation Bias' effects on error control by an example.

A Gold Rush example: new studies after finding significant results
We study the effect of Accumulation Bias by a simple example. Its simplicity allows us to calculate the exact amount of bias in the test statistic and investigate the additional effect on the sampling distribution. The example given in this section is an extension of the toy example introduced by Ellis and Stewart (2009). We denote this example by Gold Rush because it describes how new studies go looking for more results after finding initial statistical significance. In the current culture of scientific practice, statistical significance can be seen as the currency of scientific success. After all, significant results achieve the future possibility to pay off in publications, grants and tenure positions. When a gold rush for statistical significance presents itself in a series of studies, dependencies arise between the size of the series and the results within: Accumulation Bias. We specify this mechanism in detail in Section 3.2 and 3.3, after we simplified our meta-analysis setting to common/fixedeffects meta-analysis in Section 3.1. We present the resulting bias in the test estimates in Section 3.4 and its additional effects on the sampling distribution and testing in Section 3.5 and 3.6. In Section 3.7 we conclude by pointing out the very mild condition needed for some form of Gold Rush Accumulation Bias to occur

Common/fixed-effect meta-analysis
This paper discusses meta-analysis in its simplest form, which is common-effect meta-analysis, also known as fixed-effect meta-analysis.
This restriction does not mean that more complex forms of meta-analysis, such as random-effects meta-analysis and meta-regression, do not suffer from the problems mentioned in this paper. The reason for simplification is to reduce the complexity in quantifying the problem, part of showing that quantification is not enough. In a future paper we will study the effects of heterogeneity on testing in more detail. For an example of Accumulation Bias in random-effects estimates we refer to Kulinskaya et al. (2016). Common-effect meta-analysis derives a combined Z-score from the summary statistics of the available studies. This combined Z-score is used as a test statistic in two-sided meta-analysis testing by comparing it to the tails of a standard normal distribution. This is equivalent to assessing whether its absolute value is more than z α 2 standard deviations away from zero (larger than 1.960 for α = 0.05). We simplify the setting by assuming studies with equal standard deviations to obtain an easy to handle expression for the combined Z-score of t available studies. We denote this meta-analysis Z-score by Z (t) and derive it as the weighted average over the study Z-scores Z 1 , . . . , Z t , shown in its general form in Eq. (3.1a) and in Eq. (3.1b) under the assumption of equal study sizes:

Gold Rush new study probabilities
In our Gold Rush example, we assume the following dependency within a series of studies: each study in a series has a larger probability to be replicated -and therefore expanding the series of studies -if the study shows a significant positive effect. So the existence of a new study is dependent on the significance and sign of the results of its predecessor.
T is the random variable that denotes the maximum size of a study series -the time at which the search stops. We enumerate time by the order of appearance in a study series, with t = 1 for the pilot study, t = 2 for the second study (so now we have a two-study series) etc. So we use t to denote the number of studies available for metaanalysis at any time point: our notion of time is not related to actual dates at which studies are performed. The maximum time T is usually unknown since more studies might be performed in the future. T ≥ 2 means that the series has not halted after the first initial study, but that it is unknown how many replications will eventually be performed. In our extended Gold Rush example, we present the Accumulation Bias process by the probability that the maximum size is at least one study larger than the current size (T ≥ t + 1), and do so using six parameters. We denote these parameters by the new study probabilities, since they indicate the probability that a follow-up study is performed when the result of the current study is available: We distinguish between the influence of the first (pilot) study (ω X and ω (1) NS ) and the others (ω S , ω X and ω NS ) since pilot studies are carried out with future studies in mind, and therefore replications have higher probability after the first than after other studies in the series, also in case the pilot study is not significant. We assume that no new study is performed when a significant negative result is obtained (ω (1) X = ω X = 0) and new studies are always performed after positive significant findings, the satisfactory result (ω (1) S = ω S = 1). Nonsignificant results have a small, but not negligible probability to spur new studies (ω (1)

Gold Rush new study probabilities' independence from data-generating hypothesis
In the following we use P 1 to express probabilities under the alternative hypothesis and P 0 to express probabilities under the null hypothesis. Our new study probabilities in Eq. (3.2) were given without reference to any of these hypotheses, to make explicit that they depend solely on the data (or summary statistic Z t ) and not on the hypothesis that generated the data. So P in these definitions can be read as P 1 as well as P 0 . In the next sections we focus on Gold Rush Accumulation Bias under the null hypothesis and its effect on type-I error control. The values in rightmost column of Eq. (3.2) are introduced to obtain estimates for the Accumulation Bias in the test estimates. These values are not supposed to be realistic, but are chosen to demonstrate the effect of Accumulation Bias as clearly as possible. The extreme values 1 for ω (1) S and ω S given in Eq. (3.2) support the simulation of large study series under the null hypothesis. The small values for ω (1) NS and ω NS are chosen such that the effect of significant findings on the sampling distribution is clearly visible (see Section 3.5 and Figure 1). For α = 0.05, ω (1) S = 1 implies that, in expectation under the null distribution, all of the 2.5% ( α 2 ) positively significant pilot studies under the null hypothesis become a two-study series, while ω (1) NS = 0.1 indicates that, since an expected 95% (1 − α) of pilot studies is not significant under the null hypothesis, 9.5% (0.1 · 95%) become a two-study series. For study series beyond the pilot study and its replication, this setup entails that in all studies, except for the last and the first, the fraction of significant findings is more than half, since ω S = 0.02 implies that only 0.02 · 95% = 1.9% nonsignificant studies grow into a larger study series: the expected fraction of significant studies in growing series under the null hypothesis converges to 2.5/(2.5 + 1.9) = 0.6.

Gold Rush Accumulation Bias' estimates under the null hypothesis
The new study probability parameters in Eq. (3.2) are much larger when results are positively significant than when they are not. As a result, study series that contain more significant studies have larger probabilities to come into existence than those that contain less. While the expectation of a Z-score is 0 under the null hypothesis for each individual study (for all t: E 0 [Z t ] = 0), the expectation of a study that is part of a series of studies is larger. This shift in expectation introduces the Accumulation Bias in the estimates. The main ingredient of the bias in the meta-analysis Z (t)score is the bias in the individual study Z t -scores, conditional on being part of a series. This is already apparent for the pilot study, which we use as an example by expressing its expected value under the null hypothesis, given that it has a successor study: E 0 [Z 1 | T ≥ 2]. This conditional expectation is a weighted average of two other expectations that are conditioned further based on the events that lead to a new study according to Eq. (3.2): Z 1 from the right tail of the null distribution, and the nonsignificant results with expectation E 0 Z 1 |Z 1 | < z α 2 . We discard negative significant results, since those were given 0 probability to produce replication studies in Eq. (3.2). The positive significant and nonsignificant results are weighted by the new study probabilities in Eq. (3.2) and the probabilities under the null distribution of sampling from either the tail (α) or the middle part (1 − α) of the standard normal distribution. A more detailed specification of these components can be found in Appendix A.2. If we assume a significance threshold of 5% we obtain: For α = 0.05 : Here we use the fact that, for α = 0.05, is the expectation of a symmetrically truncated standard normal distribution, which is 0. The value 0.487 is obtained by using the parameter values given in Eq. (3.2). For studies in the series later than the pilot study, the expression follows analogously by taking for all t ≥ 2 : ω To determine the effect on the meta-analysis Z (t) -score, we define the expectation under the null hypothesis E 0 Z (t) T ≥ t , conditioned on the availability of a series of size t. To specify this expectation, we use that the last study is always unbiased since we do not know whether it will spur more studies. As shown in more detail in Appendix A.3, the expression follows from Eq. (3.1a) by separately treating the unbiased expectation of 0 and the pilot study. If we assume a significance threshold of 5%, we obtain the general expression in Eq. (3.4a) and the expression in Eq. (3.4b) under the assumption of equal study sizes (n 1 = n 2 = · · · = n t = n): For α = 0.05, for all t ≥ 2 : (3.4b) Table 1 shows the Accumulation Bias in the estimates of E 0 Z (t) T ≥ t as studies accumulate under the Gold Rush scenario, with equal study sizes and values for the new study probabilities given by Eq. (3.2).   Figure 1 shows simulated Gold Rush sampling distributions for study series of size two and three in comparison to an individual study Z-distribution. Because the new study probabilities in Eq. (3.2) give Z t−1 -values below −z α 2 zero probability to warrant a successor study, values for the z (t) -statistic below −z α 2 will be scarce and the larger t is the larger this scarcity will be since only the last study is able to provide such small Z-score estimates. The opposite is the case for values above z α 2 , which have probability 1 to warrant a new study. As a result, the distribution of the meta-analysis Z-score has negative skew (more mass on the right, more tail to the left). See the comparison to the normal distribution also plotted in Figure 1 for a three-study series. Skewness is not the only characteristic that distinguishes the resulting distribution from a standard normal. The variance also deviates since the metaanalysis distribution is a mixture distribution.

Gold Rush Accumulation Bias' sampling distribution under the null hypothesis
For a two-study meta-analysis Z (2) we obtain a mixture of two conditional distributions, one conditioned on the first study being a significant -sampled from the right tail of the distribution (with probability α 2 · ω (1) S ) -and one with the first study nonsignificant -sampled from the symmetrically truncated normal distribution (with probability (1 − α) · ω (1) NS ). Because the combined distribution on Z (2) is a mixture of the two scenarios, its variance is larger than the variance of either of the two components of the mixture, as we show in Appendix A.4. In Figure 1 we see that, with the parameter values from Eq. (3.2) the variance of Z (2) and Z (3) are even larger than that of Z 1 , even though both Var Z (2) Z 1 < z α 2 and Var Z (2) |Z 1 | ≥ z α 2 are smaller. Hence the sampling distribution under the null hypothesis of a meta-analysis Z-score deviates from a standard normal under Accumulation Bias due to a non-zero location (the bias), skewness and inflated variance. All three inflate the probability of a type-I error in a standard normal test, as we will study in the next section.

Gold Rush Accumulation Bias' influence on p-value tests
Let us now establish the effect of our Gold Rush Accumulation Bias on meta-analysis testing when using common/fixed-effects Z-tests. Let T ≥ t denote the expected rate of type-I errors in a two-sided common/fixedeffect Z-test for studies i up to t conditional on the fact that at least t studies were performed. We obtain the type-I error rate for this test by simulating the Gold Rush scenario, for which the results are shown in the right hand column of Table 2, assuming α = 0.05. If only bias would be at play, the sampling distribution under the null hypothesis would be a shifted normal distribution. Eq. (3.5) expresses the expected type-I error rate for this bias only scenario, with Φ() the cumulative normal distribution. The inflation actual inflation in the type-I error rate is larger than shown by this scenario, as illustrated the Table 2. The difference between these two type-I error rates for a series of three studies is depicted in Figure 1 by the area under the red histogram for Z (3) and the red φ(z | E 0 (3) ) curve below −z α 2 and above z α 2 . We conclude that the effect of Accumulation Bias on testing cannot be corrected by only an approximation of the bias.
(3.5) (1) 7 for the code that produces the simulation and this figure.

Gold Rush Accumulation Bias: When does it occur?
We indicated in Section 3.3 that we chose extreme values for parameters ω NS , ω S , ω X and ω NS such that Figure 1 would clearly show the bias and distributional change that occurs. However, for any combination of values for which there is a t where ω Accumulation Bias occurs for series larger than size t and p-value tests that assume a standard normal distribution are invalid.

The Accumulation Bias Framework
In general, Accumulation Bias in meta-analysis makes the sampling distribution of the meta-analysis Z-score difficult to characterize due to the data dependent size and timing of a study series up for meta-analysis. In this section, we specify both processes in a framework of analysis time probabilities. We use the term analysis time because time in meta-analysis is partly based on a survival time. A survival time indicates that a subject lives longer than time t (and might still become much older), just as an analysis time indicates that a series up for meta-analysis has at least size t (but might still grow much larger). As such, analysis time probabilities, just as the probabilities in a survival function, do not add up to 1.
Our Accumulation Bias Framework uses the following notation for its three key components: S(t − 1), (t) and A(t). Firstly, S(t − 1) can be understood as the survival function in the variable time t that indicates the size of the expanding study series. S(t − 1) denotes the probability that the available number of studies is at least t (P[T ≥ t]), so the study series has survived past the previous study at t − 1. Secondly, (t) indicates the event that a meta-analysis is performed on a study series of size exactly t. Lastly, A(t) combines the probability that a study series of certain size is available (S(t − 1)) with the decision (t) to perform the analysis on exactly t studies. So the analysis time prob-ability A(t) represents the general probability that a metaanalysis of size t -so at time t -is performed and is the key to describing the influence of various forms of Accumulation Bias on testing.

Analysis time probabilities
Let P (t) T ≥ t, z 1 , . . . , z t denote the probability that a meta-analysis is performed on the first t studies. Just as the Gold Rush' new study probabilities from Eq. (3.2), this probability can depend on the results in the study series z 1 , . . . , z t . The event (t) only occurs if a series of size t is available, so we need to condition on the survival past t − 1, which can also depend on previous results. When combined, we obtain the following definition 1 of analysis time probabilities A(t): (4.1) Eq. (4.1) formalizes the idea of analysis time probabilities "depending on previous results" in terms of the individual study Z-scores z 1 , . . . , z t . This is compatible with the Ztest approach in meta-analysis and the dependencies and the Gold Rush' new study probabilities that are explicitly expressed in terms of Z-scores. More generally however, in Section 4.3 and 4.4 we extend the definition and allow analysis time probabilities to also depend on the data in the original scale and external parameters.

Analysis time probabilities' independence from the data-generating hypothesis
Just as for the Gold Rush' new study probabilities discussed in Section 3.2 and 3.3, the analysis time probabilities A(t) only depend on the data, and are independent from the hypothesis that generated the data. So again, P in these definitions can be read as P 1 as well as P 0 . Our definition of A(t) relates to the definition of a Stopping Rule by Berger and Berry (1988, pp. 33-34), where they use x (m) to denote a vector of m observations: 1 Note that A(t | z 1 , . . . , z t ) is defined as a product of two (conditional) probabilities. Calling this product itself a "probability", as we do, can be justified as follows: we currently think of the decision whether to continue studies at time t, i.e. whether T ≥ t, to be made before the t-th study is performed. But we may also think of the t-study result z t as being generated irrespective of whether T ≥ t, but remaining unobserved for ever if T < t. If the decision whether T ≥ t is made independently of the value z t , i.e. we add the constraint P [T ≥ t | z 1 , . . . , z t−1 ] = P [T ≥ t | z 1 , . . . , z t ], then the resulting model is mathematically equivalent to ours (in the sense that we obtain exactly the same expressions for S(t), A(t | z 1 , . . . , z t ), all error probabilities etc.), but it does allow us to write, by Eq. (4.1), that τ 0 is the probability of stopping the experiment with no observations (e.g., if it is determined that the experiment is too expensive); τ 1 (x (1) ) is the probability of stopping after observing the datum x (1) = x 1 , conditional on having taken the first observation; τ 2 (x (2) ) is the probability of stopping after observing x (2) = (x 1 , x 2 ), conditional on having taken the first and second observations; etc.
To take the analogy with survival analysis further, we consider the sequence τ defined above by Berger and Berry (1988) to be a sequence of hazards. Instead of using their notation τ we denote the Stopping Rule by λ = (λ(0), λ(1), . . . ) to emphasize its behavior as a sequence of hazard functions and to distinguish time t from the probability λ(t) of stopping at that time given that you were able to reach it. The hazard of stopping at time t can depend on previous results and is defined as follows: In this paper we are only interested in cases in which a first study is available, so λ(0) = 0 (also stated as P[T ≥ 1] = 1 in Appendix A.2). The survival S(t − 1), the probability of obtaining a series of size at least t (so larger than t − 1), follows from the hazards by considering that surviving past time t − 1 means that the series has not stopped at studies i up to and including t − 1. So for t ≥ 1: In many examples, the hazard of stopping at time t, λ(t), will depend on the result z t just obtained. In that case 3) above. But in general λ(t) might also depend on some synthesis of all z i so far. We show some of the variety of forms that λ(t), S(t) and A(t) can take in our Accumulation Bias Framework in the following sections.

Accumulation Bias caused by dependent study series size
Our Gold Rush example describes an instance of Accumulation Bias that is caused by how the study series size comes about. This is expressed by the S(t) component of the analysis times probability A(t). We represent our Gold Rush scenario in terms of our Accumulation Bias framework in next section, followed by variations from the literature that we were able to express in a similar manner.

Gold Rush: dependence on significant study results
The Gold Rush scenario operates in a general meta-analysis setting and assumes that there is a single random or prespecified time t at which a study series is up for metaanalysis. This is the approach taken by meta-analyses not explicitly part of a living systematic review. In the Gold Rush example the dependency arises in the study series because a t-study series has a larger probability to come into existence when individual study results are significant, and you need a t-study series to perform a t-study metaanalysis. This dependency was characterized by the new study probabilities ω (1) NS , ω S and ω NS from Eq. (3.2). The value of S(t), and therefore A(t), can be expressed in terms of these new study probabilities by considering whether z 1 , . . . , z t−1 are larger than z α 2 (which is 1.960 for α = 0.05). Since a meta-analysis is performed only once at a randomly chosen time t, we have P[ (t) ] = 1 for that time t and P[ (t) ] = 0 otherwise. So for the one metaanalysis we obtain: For t such that P[ (t) ] = 1 : with λ (0) = 0 and for all i ≥ 1, λ(i) is defined as follows: (4.5) Therefore, (leaving out the λ(0) and summing from i = 1 to t − 1), we obtain the following expressions for the Gold Rush analysis time probabilities and its expectations under the null distribution: (4.6)

Kulinskaya et al. (2016): dependence on metaanalysis estimates
Kulinskaya et al. (2016) report biases that result from dependencies between a current meta-analysis estimate and the decision to perform a new study. Since their focus is on bias, they do not discuss issues of multiple testing over time, which would arise if their cumulative meta-analyses estimates were tested. In this section we assume that the timing of the meta-analysis test is independent from the estimates that determined the size of the series, as if a test were done by a second unknowing meta-analyst. This scenario is hinted at by Kulinskaya et al. (2016, p. 296) in the statement "When a practitioner or a meta-analyst finds several trials in the literature, a particular decision-making scenario may have already taken place." We postpone the discussion of multiple testing to Section 4.3.4. In this estimation setting, the decision to perform new studies is determined not by the meta-analysis Z-scores Z (t−1) , but by the meta-analysis estimates on the original scale M (t−1) (notation adopted from Borenstein et al. (2009), see Appendix A.1), in relation to a minimally clinically relevant effect ∆ H1 . A minimally clinically relevant effect is the effect that should be used to power a trial (in the alternative distribution H1), and therefore, the effect that the researchers of the study do not want to miss. Kulinskaya et al. (2016) consider three models for the study series accumulation process: the power-law model and the extremevalue model and the probit model. The models relate the probability of a new study to the cumulative meta-analysis estimate of the study series so far and are inspired by models for publication bias. Although all three models can be recast in our framework, we demonstrate this only for the power law model that uses one extra parameter τ to relate the previous meta-analysis estimate M (t−1) to S(t).
Just as in the Gold Rush scenario, we must assume that a meta-analysis test is performed only once at a randomly chosen time t. So only at that time t P[ (t) ] = 1 and P[ (t) ] = 0 otherwise. We obtain the following expression for the Kulinskaya et al. (2016) power-law model: with λ(0) = λ(1) = 0, and for all i ≥ 2, λ(i) is defined as follows: for 0 < M (i−1) < ∆ H1 and 1 (so 1 − λ = 0) otherwise. According to this model, no further studies are performed as soon as an estimate as large as ∆ H1 is found. For estimates smaller than ∆ H1 , the closer the estimate is to ∆ H1 , the larger the probability of a subsequent study. Just as in the Gold Rush example, this model will introduce bias as well as skew the sampling distribution of the data under the null hypothesis since initial studies with large estimates have larger probability to end up in study series of considerable size than small initial estimates do. When the initial study gives a large overestimation of the effect, this overestimation stays present in the subsequent metaanalysis estimates and keeps influencing the probability of subsequent studies. Therefore, this model shows the effect of early studies in the series even more clearly than the Gold Rush example. However, the accumulation bias does have a cap, since estimates larger than ∆ H1 do not introduce new replication studies.

Whitehead (2002): dependence on early study results
Bias may also be introduced by the order in which studies are conducted. For example, large-scale clinical trials for a new treatment are often undertaken following promising results from small trials. [...] given that a meta-analysis is being undertaken, larger estimates of treatment difference are more likely from the small early studies than from the later larger studies.
- Whitehead (2002, p. 197) Whitehead (2002) (2002) does not give sufficient details to specify this dependency explicitly, but we are confident that it will fit in our Accumulation Bias framework. Two ways to approach this Accumulation Bias are given in Whitehead (2002). The first is to exclude early studies from the meta-analyses, either in the main analysis or in a sensitivity analysis. The second way is to ignore the problem, since the small studies will have little effect on the overall estimate. In Section 7 we show that any small initial study dependency that can be expressed in terms of A(t) can be dealt with by tests using likelihood ratios.

Living Systematic Reviews: dependence on significant meta-analyses + multiple testing
A living systematic review (LSR) should keep the review current as new research evidence emerges. Any meta-analyses included in the review will also need updating as new material is identified. If the aim of the review is solely to present the best current evidence standard metaanalysis may be sufficient, provided reviewers are aware that results may change at later updates. If the review is used in a decision-making context, more caution may be needed. When using standard meta-analysis methods, the chance of incorrectly concluding that any updated metaanalysis is statistically significant when there is no effect (the type I error) increases rapidly as more updates are performed. -Simmonds, Salanti, McKenzie & Elliott (2017, p. 39) In living systematic reviews, the aim is to have a metaanalysis available to present the current evidence, thus synthesizing the t studies available at a certain time. The current meta-analysis estimate might be used to decide whether further studies should be performed. In that case S(t −1), the probability that a study series of size t is available -so that a study series has expanded beyond series size t − 1 -depends on the meta-analysis estimate Z (t−1) at the previous study's meta-analysis. Because the review is continuously updated, P[ ] is always 1, and living systematic reviews can be described by the following analysis time probability A(t): (4.9) The quote above warns against decisions based on the continuously updated meta-analysis using a fixed threshold z α 2 . Living systematic reviews experience multiple testing problems of a kind that are familiar from statistical monitoring of individual clinical trials (Proschan et al., 2006). If the study series is stopped as soon as a significance threshold is reached, and the obtained meta-analysis is considered the final one, then this final meta-analysis test has an increased chance of a type-I error. So the warning is not to use the following simple stopping rule: (4.10) Various corrections to significance thresholds are proposed that relate intermediate looks to a maximum sample size or information size. These corrected thresholds depend on α and the fraction of sample size or information size available at time t. Examples of such methods are Trial sequential analysis (Brok et al., 2008;Thorlund et al., 2008;Wetterslev et al., 2008) and Sequential meta-analysis (Whitehead, 2002, Ch. 12) (Whitehead, 1997;Higgins et al., 2011). For an overview see Simmonds et al. (2017). In general, Eq. (4.9) and (4.10) show that any dependency between "the best current evidence" and the accumulation of future studies is part of our Accumulation Bias Framework. We discuss the approach to error control taken by the corrected thresholds in Section 5.2.

Accumulation Bias caused by dependent meta-analysis timing
We described various forms of Accumulation Bias that are caused by how the study series size comes about, but dependencies are also introduced by how the meta-analysis itself arises. This is expressed by the P (t) component of the analysis times probabilities A(t). We only found one such process mentioned in the literature and will discuss it in the next section. Though exact parametric modeling indeed stays problematical, we can assume that a positive finding is a study estimate larger than the minimally clinically relevant effect ∆ H1 , define the right amount of positive findings to be in the region [a, b], and show that this fits in our Accumulation Bias Framework by expressing a possible model for For t such that S(t − 1) =1 : (4.11)

Accumulation Bias caused by Evidence-Based Research
New research should not be done unless, at the time it is initiated, the questions it proposes to address cannot be answered satisfactorily with existing evidence. -Chalmers & Glasziou (2009) In 2009, the term Research Waste was coined and this key recommendation was made. The recommendation further specifies that existing evidence should be obtained by a systematic review and summarized with a meta-analysis. But how exactly to answer the question whether new research is necessary or wasteful remained unclear. Nevertheless, the recommendation was important enough to be repeated, as was first done in an entire series on Research Waste with a specific recommendation on setting research priorities (Chalmers et al., 2014) and later in a paper that gave the recommendation its official name: Evidence-Based Research (Lund et al., 2016). Support for these recommendations was provided by various retrospective cumulative meta-analyses that show how many studies were still performed while satisfactory evidence was already available. These cumulative meta-analysis judge "satisfactory evidence" based on a significance threshold, usually uncorrected for multiple testing (e.g. Fergusson et al. (2005)), which reminds us of the Accumulation Bias that occurs in living systematic reviews (Section 4.3.4). The larger consequence, however, is that Accumulation Bias is caused by any dependencies between results and series size and meta-analysis timing, and that Evidence-Based Research introduces such dependencies. Inspecting previous results to decide whether new research is necessary or wasteful therefore always introduces Accumulation Bias, whether it based on uncorrected or corrected thresholds. Also more subtle decision methods -implicit rather than based on thresholds -introduce Accumulation Bias, as was shown by Kulinskaya et al. (2016). In fact, they describe the rationale behind their models -among which the power-law model (Section 4.3.2) -as an example of bias introduced by guidelines to decide on "the usefulness of a new study" "with direct reference to existing metaanalysis." (Kulinskaya et al., 2016, p. 297  Over time new study series are initiated, studies are added to existing study series and more meta-analyses are performed. To visualize how this process relates to error control, we need to start with a specific state of this expanding system. In 2001 an estimated minimum of 10 000 medical topics were covered in over half a million studies, thus requiring 10 000 meta-analyses if all were synthesized in a database such as the Cochrane Database of Systematic Reviews (Mallett and Clarke, 2003). The number of studies in a series varied between 2 and 136, which we can use to describe the 2001 state of a possible database, that to be complete, also includes many unreplicated pilot studies. We could visualize this database in a table, with studies in the rows, topics in the columns and many missing entries. A sketch is shown in Table 3. The conventional approach to error control, which we used to show the influence of Gold Rush Accumulation Bias in meta-analysis testing in Section 3.6, is a conditional approach. Since conventional meta-analysis does not raise any multiple testing issues, there is a hidden assumption that the timing of a meta-analysis (t) is independent from the data and each study series experiences only one meta-analysis. In Section 4.3.1 we took the t at which the sole meta-analysis is conducted to be either random or prespecified. This is shown in Table 3 by the black box enclosing the available studies on Topic 1. Other possible study series up for meta-analysis are shown by the boxes enclosing studies on Topic 5 and 8. Note that by assuming only one meta-analysis, a study series might continue growing but not be fully analyzed, as shown for Topic 5. In the conditional approach to error control, a three-study series (Z 1 , Z 2 , Z 3 ) produces a possible draw from the Z (3) sampling distribution. If we test our draw, the type-I error rate is defined as the fraction of t-study series that is considered significant if all t-study series were to be sampled from the null distribution. The question is: What study series are taken into account to specify this fraction? This is visualized in Table 3 by the dark blue and grey shading for t = 2 and the dark blue and lighter blue shading for t = 3. The unshaded topics and change of color between t = 2 and t = 3 show the flaw of this approach: some series might not survive up until a specific time t, as for instance shown by the grey studies that are part of t = 2 but not part of the error control for t = 3. We also do not want every series to survive up until any arbitrary time t to avoid research waste (Chalmers and Glasziou, 2009). The crucial point is that the series that do survive are no random sample from all possible t-study series. This is another illustration of Accumulation Bias such as the Toy Story scenario. The series deviates even more from the assumption of a random t-study draw if the meta-analysis time t is not random or prespecified, but dependent on the results, as expressed in Section 4.4. We discuss the conventional conditional approach to meta-analysis error control in more detail in Section 5.1. The other possible approach to error control is surviving over analysis times, which means that it should be valid for any upcoming analysis time t within a series. So the probability that a type-I error -ever -occurs in the accumulating series is controlled, whether the series reaches a large size or not. This is visualized in Table 3 by the orange shading, and has a long run error rate that runs over series of any size, including the one-study series. This approach to error control is taken by methods for living systematic reviews such as Trial sequential analysis and Sequential meta-analysis. We discuss this approach of error control surviving over time in more detail in Section 5.2.

Error control conditioned on time
The null distributions of the common/fixed meta-analysis Z-statistic shown in Figure 1 are conditioned on the size of the series, which is the time: T ≥ t. We can use our Accumulation Bias framework to give this distribution a general description, where we use f 0 (z (t) ) to denote the assumed standard normal null distribution for the meta-analysis Zscore and obtain a conditional density using Bayes' rule: where we define: with under the equal study size assumption in (Eq. (3.1b)) (extension to the general cases with unequal sample sizes is straightforward). For the Gold Rush example, A 0 (t) was given by Eq. (4.6) and can be calculated if ωs are known. A 0 (t) denotes the general probability of arriving at T ≥ t under the null hypothesis, and so does A 0 t z (t) , but with the restriction that we only take samples into account that result in meta-analysis score z (t) . The type-I error rates for the Gold Rush example shown in Table 2 are based on a randomly chosen or prespecified t for which P[ (t) ] = 1, and represent the following (with f 0 as above in Eq. (5.1)): (5.2)

Error control surviving over time
In living systematic reviews, a meta-analysis is performed after each new study (P (t) = 1 for all t). The properties on error control obtained by for example Trial Sequential Analysis are therefore surviving over analysis times t and depend on the joint distribution on the data and the maximum study series size T . For P (t) always 1, A(t) = S(t −1) and this joint distribution can be presented as follows: where we define with under the equal study size assumption in (Eq. (3.1b)), and with f 0 (z (0) ) = 1 and P 0 T ≥ 1 z (0) , z (1) = 1.
The result P[T = t] = S(t − 1) − S(t) is known from survival analysis and made explicit in the Appendix A.5. When S(t) is known for all t, it is possible to obtain error control that survives over analysis times T = t with thresholds z If we assume a one-sided test, the approach to error control taken by these methods can be expressed as follows: with f 0 as above (5.3) and T = t only in the case λ(t) = 1 Z (t) ≥z (t) α 2 = 1. (5.4) The change in notation from T ≥ t to T = t already hints at the limitations of this approach: the series size needs to be completely determined by the thresholds specified in the hazard function and nothing else. We discuss this limitation in more detail in the next section.

Unknown and unreliable analysis time probabilities
To obtain thresholds to test z (t) under Accumulation Bias, we need to know the probability A(t) (or only S(t)) for meta-analysis time t. However, any of the scenarios described in Sections 4.3 and 4.4 can be involved, and some can be influencing z (t) simultaneously. Also, ethical imperatives might balance the bias, as illustrated by the following quote: A negative result will dampen enthusiasm and turn the attention of investigators to other possible protocols. A positive result will excite interest but may provide an ethical veto on further randomization. -Armitage (1984) as cited by Ellis and Stewart (2009) We do not believe that the corrected thresholds z (t) α 2 from sequential methods like Trial Sequential Analysis can account for all Accumulation Bias, since they require very strict conformation to the stopping rule based on synthesized studies z (t) and some have already argued that meta-analysts do not have such control over new studies (Chalmers and Lau, 1993). Sequential meta-analysis was proposed for prospective meta-analyses (Whitehead, 1997;Higgins et al., 2011) and never intended for settings with retrospective dependencies. Stopping rules based solely on meta-analysis ignore dependencies that might already have arisen at the individual study level (such as in the Gold Rush example) and that meta-analyses might in practice not be performed continuously (so P[ (t) ] = 1 for some t). When meta-analyses are not performed continuously, as discussed in Section 4.4, the specification of which series are included in the long run error control is missing (imagine for example that some of the columns 1, 2, 3 and 5 of meta-analyses in Table 3 be excluded in the long run error control because the individual study results were such that nobody will ever bother to perform a meta-analysis). It might be very inefficient to try to avoid Accumulation Bias. As stated in the introduction, avoiding it would mean that results from earlier studies should be unknown when planning new studies as well as when planning metaanalyses (that is, the decision to do a meta-analysis after t studies should not depend on the outcome of these studies). Achieving this might be impossible, since research is very often somehow inspired by other findings. Also, such approach cannot be reconciled with the Evidence-Based Research initiative to reduce waste (Lund et al., 2016;Chalmers and Glasziou, 2009;Chalmers et al., 2014). We conclude that the Accumulation Bias process specifying A(t) can never be fully known and that avoiding an Accumulation Bias process will introduce more research waste. So we need a testing method that is valid regardless of the exact Accumulation Bias process. We will introduce such a method in Section 7, but first exhibit some evidence that, even though the recommendations from Evidence-Based Research still need renewed attention, Accumulation bias might already be at play.
6 Intermezzo: evidence for the existence of Accumulation Bias

Agreement with empirical findings
Accumulation Bias arises due to dependencies in how a study series comes about (Section 4.3), and in the timing of the meta-analysis (Section 4.4). We first discuss some indications of the former and then illustrate how these can be reinforced by some approaches to the latter. If citations of previous results are a real indication of why a replication study is performed, than many such dependencies have been demonstrated in the literature on reference/citation bias (Gøtzsche, 1987;Egger and Smith, 1998). Citation or reference bias indicates that initial satisfactory results are more often cited than unsatisfactory results, thus some sort of Gold Rush occurs. Studies into citations indicate that early small trials are much more often cited than later large trials (e.g. Fergusson et al. (2005); Robinson and Goodman (2011)), which might limit the Gold Rush to the early studies in a series, such as indicated by Whitehead (2002) (2011)), which is also an indication of early study Accumulation Bias.
Other empirical findings suggest that Accumulation Bias might occur throughout a series, but to a lesser extent in later studies. Gehr et al. (2006), for example, report effect sizes that decrease over time, but in which study size did not play a significant role. What has been recognized as regression to the truth in heart failure studies, might also be characterized as Accumulation Bias (Krum and Tonkin, 2003). But this effects will be difficult to limit to only a few early studies, so excluding a certain number from meta-analysis, as proposed in Whitehead (2002, p. 197) (Section 4.3.3), might therefore be a too crude measure. The Proteus effect (Pfeiffer et al., 2011;Ioannidis and Trikalinos, 2005;Ioannidis, 2005a) describes how early replications can be biased against initial findings. If early contradicting findings spur a large series of studies into a phenomenon, it introduces a more complex pattern of Accumulation Bias that does not have a straightforward dominating direction. The same holds for the Value of Information approach, to decide on replication studies (Claxton and Sculpher, 2006;Claxton et al., 2002). There is quite some literature with suggestions on when a meta-analysis should be updated. One general recommendation is to do so when studies can be added that will have a large effect on the meta-analysis (Moher and Tsertsvadze, 2006;Moher et al., 2007bMoher et al., , 2008. If such recommendations reflect an overall tendency in timing of meta-analysis, Accumulation Bias might be re-enforced by the timing of the meta-analysis: initial misleading studies might have spurred a study series, and might also indirectly encourage a meta-analysis after later studies report deviating results.

Agreement with intuitions about priors
The famous paper "Why Most Published Research Findings are False" (Ioannidis, 2005b) introduced the concept of field specific prior odds to a large audience. The prior odds were presented as the "Ratio of True to Not-True Relationships (R)", which has the same meaning as the fraction of pilot studies from the null and alternative distribution (π/(1 − π)) in the terminology of this paper. Ioannidis (2005b) combines this ratio with the average power and type-I error of tests in a research field to obtain a fieldspecific estimate of the Positive Predictive Value (PPV) of a significant result. This is the expected rate or target rate of true to false rejections, and the same as γ · π/(1 − π) in Section 7.1 of this paper. Ioannidis (2005b) provides prior odds of various research fields and publication types for which two are of interest to Accumulation Bias: "Adequately powered RCT with little bias" and "Confirmatory meta-analysis of good-quality RCTs". For the first of these an R of 1:1 is provided and for the second an R of 2:1. So a distinction is made between topics worthy of only one individual study and those that evoke a series of studies eligible for meta-analysis. How would the researchers involved in replicating RCTs know that their topic is worthy of a series of studies in comparison to just one? The difference between prior odds of the two indicates that this is no random decision. The only available source of information would be previous study results, hence introducing dependence between study series size and study results: Accumulation Bias. So the prior odds R specified by Ioannidis (2005b) is actually (1−π)·A 0 (t) , with A 1 (1) = 1 and A 0 (1) = 1 for primary studies.

Likelihood ratios' independence from meta-analysis time
In Section 5.3 we argued that any approach to model the analysis time probabilities A(t) is unreliable: in realistic and practically relevant scenarios, the ingredients required to calculate A(t) will be unknown. Therefore, we need to define test statistics that are independent from how a series size or meta-analysis comes about. A possible form of such a test statistic is the likelihood ratio, which we discuss from the two approaches to error control: in the next section 7.1 from the perspective of error control conditioned on time, and in Section 7.2 from the perspective of error control surviving over time.
Our proposed use of the likelihood ratio is based on the following extraordinary property 2 , already recognized by Berger and Berry (1988) and shown in Eq. (7.1): The likelihood ratio is a test statistic that depends on the specifica-tion of some alternative distribution f 1 . Any data sampled from an alternative distribution will have the same analysis time probabilities as data sampled from the null distribution, since analysis time probabilities are independent from the data-generating hypothesis (Section 4.2). When a likelihood ratio statistic is obtained for known data, the analysis time probability is a constant factor that is the same in the numerator and denominator of the likelhood ratio and therefore drops out of the equation: (7.1) Here we used the standard definition of likelihood ratio for the case that the likelihood jointly involves continuousvalued data and discrete events, and we critically used the fact that the probability of (t) , T ≥ t does not depend on whether the null or the alternative distribution generated the data. In the following two sections we discuss two means of using likelihood-ratio based tests that yield results that are valid irrespective of accumulation bias. 3

Likelihood ratio's error control conditioned on time
A large study series has an extremely low probability of occurring under the null hypothesis in the Gold Rush scenario, and under any other similar Accumulation Bias setting. The probability of reaching a certain study series size t is much larger under any alternative hypothesis when the power of the test for that alternative hypothesis (1 − β) is larger than the type-I error α. Due to this fact, it is possible to control an error rate if we assume that a certain fraction of pilot studies (or topics, see Table 3) π are sampled from the alternative distribution and a proportion (1 − π) of pilot studies from the null. This way, we are able to control the fraction of true rejections 3 To avoid any confusion, let us highlight that our likelihood-ratio based tests are never equivalent to p-value based tests. While some pvalue based tests (such as the Neyman-Pearson most powerful test) can be written as likelihood ratio tests, these are invariably of the form 'reject at significance level α if LR10(z 1 , . . . , z t ) ≥ γ where γ is chosen such that P 0 ( f 1 (z 1 , . . . , z t )/ f 0 (z 1 , . . . , z t ) ≥ γ) = α. In contrast, we choose γ in a way that does not depend on knowledge of the tail area under P 0 (e.g. in Section 7.2 we take γ = 1/α, and there the equality above is a (strict) inequality).
We can achieve such error control conditioned on timee.g. error control taking into account only t-study metaanalyses -if we define thresholds based on the Bayes posterior odds, which, by Bayes' theorem, are given by O post (z 1 , . . . , z t ) = LR10 (z 1 , . . . , z t )· π 1−π . Remarkably, these are not affected by the mechanism underlying the decisions to continue studies or perform meta-analyses: We can set a threshold γ based on the rate of true to false rejections, so γ = 16 would mean that we try to achieve 16 times as many true rejections than false rejections γ = 1−β α , which is the the usual goal of a primary analysis with intended power 1 − β = 0.8 and type-I error rate α = 0.05. To obtain error control, we need to specify the preexperimental rejection odds (Bayarri et al., 2016) γ · π 1−π and use these to threshold the posterior odds (Eq. (7.2)). We define R to be the region of the sample space and the event for which O post (z 1 , . . . , z t ) ≥ γ· π 1−π , i.e. the event that we reject, and obtain the following: where the inequality follows since if O post z 1 , . . . , z t (t) , T ≥ t ≥ γ · π 1−π : 1 − π then f 1 (z 1 , . . . , z t ) f 0 (z 1 , . . . , z t ) ≥ γ and (7.4) So by specifying π 1−π and an intended rate of true to false rejections γ, we can calculate the posterior odds based on the likelihood ratio, compare it to the threshold based on γ and control fraction γ of type-I errors under the null hypothesis. Note that any (t) is allowed, also multiple testing in a series or selection for the most promising metaanalysis timing. Setting a threshold to the Bayes posterior odds as described above, achieves conditional error control under any form of Accumulation Bias.

Likelihood ratio's error control surviving over time
A likelihood ratio itself can be used as a test statistic to obtain a procedure that controls P 0 [ TYPE-I ] surviving over analysis times t, as in Section 5.2. Suppose we simply reject if the likelihood ratio in favor of the alternative is larger than 1/α, ignoring any knowledge we might have about the accumulation bias process and the prior odds. We then find: TYPE-I and (t) = P 0 ∃t ≤ T : The final inequality is a classic result, proofs of which can be found in, for example, Robbins (1970); Shafer et al. (2011) and (with substantial explanation) Hendriksen et al. (2018); see also Royall (2000). Thus, the type-I error control survives over time in the sense that the P 0 -probability that we ever reject at a metaanalysis time is bounded by α. To further illustrate and interpret error control surviving over time, we define TYPE-I means 'no type-I error at time t ). As we show in Appendix A.6, the previous inequality implies that (7.6) The change in notation from TYPE-I is necessary since we want a general result for all forms of Accumulation Bias and do not want to assume that the series stops growing after the threshold is crossed (as is assumed in living systematic reviews, see Section 4.3.4). But since it is not possible to control the amount of errors if multiple errors are made in the same series, we count only the first error in Eq. (7.6). As such, we are able to control the number of topics for which an error ever occurs in the series by comparing the likelihood ratio to the threshold 1 α . It may seem surprising that it is possible to obtain error control in the sense of Eq. (7.6) for Accumulation Bias scenarios like Gold Rush example. After all, in this example large study series have only a large probability to occur if they contain many extreme (significant) results. So it seems that we would inevitably hit a type-I error once we perform a meta-analysis. But note that in this example, the expectation of A(t | Z 1 , . . . , Z t ) (A 0 (t)) is much larger for small t -due to the S(t) component -so that most metaanalyses will be of small study series, or even one-study series, with small type-I error rates. In terms of Table 3, controlling error this way is possible because error control runs over all topics, regardless of the realized series size. Thus, such error control is only meaningful if the series for each topic are continuously monitored -including those consisting of only pilot studies.

The choice between error control conditioned and surviving over time
Many meta-analysts seem reluctant to apply living systematic review techniques to all meta-analyses. We believe that this reluctance can be defended based on the assumed approach to error control surviving over time.
Surviving over time means that all possible analysis times are weighted and that -in the long run --a large proportion of meta-analyses will be one-, two-and threestudy meta-analyses and never expand. To the occasional meta-analyst, not involved in continuously updating metaanalyses, two-or three-study meta-analyses might never occur. Also, it requires a stretch of mind to imagine onestudy meta-analyses part of the long run properties of your specific 15-study meta-analysis. But it has been argued that "primary research is increasingly viewed as part of a wider sequential process" (Higgins et al., 2011, p. 918), or at least, that it should be (Lund et al., 2016). Whether this approach to error control is acceptable might also be very field specific. Among medical meta-analyses in the Cochrane Database of Systematic Reviews, two-and threestudy meta-analyses are common (Davey et al., 2011), but in other fields meta-analyses might only be performed if many more studies are available. If, on the other hand, we want to stick to the conventional conditional approach to meta-analysis, we need additional assumptions on the fraction π of true alternative hypotheses among pilot studies to threshold the posterior odds. Assuming a base rate π means that we are essentially Bayesian about the null and alternative hypothesis 4 , but there is no need to be strictly Bayesian: in practice, we might play around, and try best case and worst case π, to see how it affects our posterior odds. The important thing for us to note within the context of this paper is that, when concentrating on posterior odds, we can ignore all details of the Accumulation Bias process and still obtain meaningful results, in the form of error control that balances type-I and type-II errors. Summarizing: If we prefer conditional error control, we can obtain meaningful error control despite Accumulation Bias if we use tests based on likelihood ratios, but using prior odds for the base rates (and being partially Bayesian) is then unavoidable. If we prefer not to rely on any prior odds, we can still obtain meaningful error control despite Accumulation Bias if we use tests based on likelihood ratios, but then we have to resort to error control surviving over time instead of conditional error control. The former, conditional approach balances type-I and type-II errors and thus takes power into account. The importance of taking power (the complement of a the type-II error rate) into account has been argued before by many.
In the general approach to error control in individual studies, the expected type-I error rate is fixed by the significance level α, and the type-II error rate minimized by the experimental design and sample size. In retrospective meta-analysis, however, sample size (or study series size t) is not under the control of the meta-analyst. Also, the study series size t is only a snapshot of a possibly growing series (T ≥ t), since more studies might be performed in the future. Therefore also estimations of meta-analysis power are snapshots at a specific meta-analysis time. Nevertheless, it is often argued that many meta-analyses are underpowered (Turner et al., 2013;Davey et al., 2011) and that this should be taken into account in evaluating significance in meta-analyses. In Trial Sequential Analysis (Wetterslev et al., 2008) for example, an alternative hypothesis is formulated to judge the fraction of a required sample size available at t studies. A later review on trial sequential analysis noted: statistical confidence intervals and significance tests, relating exclusively to the null hypothesis, ignore the necessity of a sufficiently large number of observations to assess realistic or minimally important intervention effects. -Wetterslev, Jakobsen & Gluud (2017, p. 12) Testing procedures based on likelihood ratios are very well suited to take an alternative distribution with minimally important intervention effect into account. Especially when balancing type-I error and power by thresholding posterior odds. Specifying power in tests without fixed sample sizes is studied extensively in Grünwald et al.
we do need to be partially Bayesian, in the sense that we need to specify a base rate for the null (Grünwald et al., 2019) (2019) and will be the focus of future research into likelihood ratios for meta-analysis.

Why likelihood ratios work: dependencies as strategy
We calculate p-values to judge the extremeness of our results under the null hypothesis, and to control type-I errors. But the p-value method is a fairly complicated approach to that goal when it comes to meta-analysis: To obtain a valid p-value for a series of studies, the sampling distribution under the null hypothesis needs to specify exactly how the series and the meta-analysis timing came about. Only for a completely and accurately specified process can the extremeness of the data be judged and compared to a threshold based on the tail area of the sampling distribution.
Fortunately, much simpler approaches to the same goal can be found. One intuitive way is to consider a series of bets s(Z 1 ), s(Z 2 ), . . . , s(Z t ) against the null hypothesis that make a profit when observed study results are extreme. The more extreme the results, the larger the profit. The bet needs to be designed in such a way that, under the null hypothesis, no profit is to be expected. Each null result might costs $1 to play the bet, but in expectation also makes a $1 profit: Suppose that you start by investing $1 in the first bet. After each study, you either decide to do a new study, and reinvest all profit obtained so far, or to stop and cash out. If you cash out after, for example, three studies, your profit is s(Z 1 ) · s(Z 2 ) · s(Z 3 ).
As long as Eq. (9.1) holds for each bet, you cannot expect to profit under the null hypothesis; no matter what the process is for deciding, based on past data, to continue to new studies or to stop. This can be mathematically proven using martingale theory, but intuitively the reason is clear: The situation is entirely analogous to that in a casino where you cannot expect to make a salary out of playing -no matter how sophisticated the strategy you use on the order of the games or when you want to play or want to go home. Thus, irrespective of the rules used for continuation and stopping, making a large profit casts doubt on the null hypothesis even without knowledge of the entire sampling distribution. This idea of testing by betting is described in great detail by Shafer andVovk (2019), andShafer et al. (2011) show that a likelihood ratio is a beautiful way to specify such bets. Briefly, if we set s(Z t ) = f 1 (Z t )/ f 0 (Z t ), then Eq. (9.1) obviously holds: Under this definition, s(z 1 ) · . . . · s(z t ) has two interpretations: First, it is the joint likelihood ratio for the first t studies. Second, it is the amount of profit made by sequentially reinvesting in a bet that is not expected to make a profit under the null hypothesis. So we can think of the meta-analyst acting at time t as earning the profit specified by the likelihood ratio of the data until the t-th study, and using that information to advise on reinvestment in future studies. This procedure will not lead to bankruptcy if the null hypothesis is true, and will therefore allow you to keep reinvesting. If the null hypothesis is not true, the better the focus of the bets -determined by how close the alternative distribution in the likelihood ratio is to the data-generating distribution -the larger the expected profit. The crucial point is that every strategy is allowed, so also the ineffective ones that produce research waste: also not taking into account earlier studies is a strategy. This interpretation -likelihood ratios as betting strategies -explains how dependencies in the series relate to the test statistic. Any Accumulation Bias process can be considered a strategy to reinvest profit made so far, by deciding on new studies (S(t)), or cashing out the current profit (equivalent to performing a meta-analysis at time t and advising against further studies: (t) , T = t). This is the intuition behind the proof of results like Eq. (7.5) and (7.6) -bounds on type-I error probability in metaanalysis --that can be derived without knowledge of the Accumulation Bias process. These bounds simply express that under the null, a large profit is unlikely under the null no matter what the Accumulation Bias is.
it is always legitimate to continue betting, and this makes each individual study a more informative element of a research program or a metaanalysis -Shafer (2019, p. 2) In contrast to an all-or-nothing test for one study, inspecting the betting profit of a study is a way to test the data without loosing the ability to build on it in future studies. The likelihood ratio has the ability to maximize the rate of growth among all studies in a series, instead of the power of a single p-value test on a prespecified series size or stopping rule (Shafer, 2019). It allows for promising but inconclusive initial studies and small study series to be revisited in the light of new studies, but also to keep track of the combined evidence at any time. In this sense, the use of likelihood ratios in meta-analysis is a statistical implementation of the goals of the Evidence Based Research Network (Lund et al., 2016). Choosing your bets wisely, by informing new studies by previous results is just another betting strategy. You optimize what studies to perform, and how to design and analyze them. Implementing this rationale in the statistics allows to maximize the efficiency of future research and reduce research waste (Chalmers and Glasziou, 2009).

Expanding likelihood ratios to Safe Tests
When the null hypothesis is simple, it can be shown that either using bets that satisfy Eq. (9.1) under the null or using likelihood ratios or using Bayes factors is equivalent, and the gambling approach can be viewed as a form of Bayesian inference. But for composite null (as in the ttest scenario, with unknown variance σ 2 ), the situation is trickier: bets that satisfy Eq. (9.1) under all distributions in the null hypotheses can still be constructed, but their relation to likelihood ratios is more complicated. The paper Safe Testing (Grünwald et al., 2019) investigates this setting in great detail and shows that 'error control surviving over time' (Section 7.2) can still be obtained for general composite null.

Discussion
We need to consider time -study chronology and analysis timing -in meta-analysis. We need it because estimates are biased by Accumulation Bias when they assume that a t-study series is a random sample from all possible t-study series, while in fact dependencies arise in accumulating science. We also need time because sampling distributions are greatly affected by it, and the (p-value) tail area approach to testing is very sensitive to the shape of the sampling distribution. And we need to consider time because it allows for new approaches to error control that recognize the accumulating nature of scientific studies. Doing so also illustrates that available meta-analysis methodsgeneral meta-analysis and methods for living systematic reviews -target two very different approaches to type-I error control. We believe that the exact scientific process that determines meta-analysis time can never be fully known, and that approaches to error control need to be trustworthy regardless of it. A likelihood ratio approach to testing solves this problem and has even more appealing properties that we will study in a forthcoming paper. Firstly, it agrees with a form of the stopping rule principle (Berger and Berry, 1988). Secondly, it agrees with the Prequential principle (Dawid, 1984). Thirdly, it allows for a betting interpretation (Shafer and Vovk, 2019; Shafer, 2019): reinvesting profits from one study into the next and cashing out at any time. But this approach still leaves us with a choice: either assume a prior probability π and separate meta-analysis of various sizes from each other and individual studies, or control the type-I error rate over all analysis times t and include individual studies in the meta-analysis world. The first approach is more of a reflection of the current reality in meta-analysis, while the second can be aligned with the goals from the Evidence-Based Research Network (Lund et al., 2016) and living systematic reviews (Simmonds et al., 2017). Accumulation Bias itself might not need to be corrected at all, which is why we want to close this paper with the following quote: the intuitive notion that bias is something bad which must be corrected for, does not even fit well within the frequentist framework. [...] one could not state "use estimate X for a fixed sample size experiment, but use X − c(X ) (correcting for bias) for a sequential experiment," and retain frequentist admissibility in the "real" situation where one encounters a variety of both types of problems. The requirement of unbiasedness simply seems to have no justification. -Berger & Berry (1988, p. 67)

Underlying data
All data underlying the results are available as part of the article and no additional source data are required

Grant information
This work is part of the NWO TOP-I research programme Safe Bayesian Inference [617.001.651], which is financed by the Netherlands Organisation for Scientific Research (NWO).
σ D be the Cohen's d of the treatment score in study i (Borenstein et al., 2009, p. 26) -so standardized with regard to the estimated population standard deviation -and let n i denote the sample size in the treatment and placebo arm of study i (under the assumption that all studies have equal size study arms). Since SE 2  = n i denote the weights for d i . Based on these weights, M (t) and SE M (t) can be expressed as follows, using the fact that D i = d i σ D , SE 2 D i = σ 2 D n i , and thus W i = w i 1 σ 2 D (see also Borenstein et al. (2009, p. 82)): With N (t) = t i=1 n i and d i = Z i n i , the common/fixed-effect Z-score Z (t) of studies i up to and including t can be derived as an average weighted by the square root of the individual study sample sizes: Z i for n 1 = n 2 = . . . = n t = n (A.4)

A.2 Expectation Gold Rush conditional pilot Z-score
Here, and in the following, we assume that there is always a first study (P [T ≥ 1] = 1).
This expression only considers significant positive and nonsignificant results in the pilot study, since we defined in Eq.
(3.2) that significant negative results have 0 probability to produce replication studies. We can replace P 0 by P in the middle term of the fractions in the first two rows because new study probabilities are independent from the data generating distribution, as discussed in Section 3.3.

A.3 Expectation Gold Rush conditional meta-analysis Z-score
For all t ≥ 2 : Page 23 of 31 For all t ≥ 2 : Here we use that the last study in a series under the Gold Rush example is unbiased and has expectation 0 under the null hypothesis. We also use that the expansion of the series beyond the next study does not influence a study's expectation in our Gold Rush example: for t ≥ 2 E 0 [Z 1 | T ≥ t] is the same as E 0 [Z 1 | T ≥ 2], and for any i and t ≥ i, E 0 [Z i | T ≥ t] is the same as E 0 [Z i | i + 1]). (A.7b) Because squaring is a convex function, we know from Jensen's Inequality that the average squared mean (A.7a) is larger than the square of the average mean (A.7b). So the variance of the mixture is larger than the mixture of the variances.

A.5 Maximum time probability
The survival function S(t−1) represents the probability P[T ≥ t]. The survival function is the complement of a cumulative distribution function on maximum time or stopping times T, known in survival analysis as the lifetime distribution function F (t − 1): TYPE-I be the even that both (t) and T ≥ t holds. Using in the first equality below that the events (1) TYPE-I , . . . are all mutually exclusive (so that the union bound becomes an equality), we get: Page 24 of 31

A.6 Error control surviving over time in terms of a sum
Let (t) TYPE-I be the even that both (t) and T ≥ t holds. Using in the first equality below that the events (1) TYPE-I , . . . are all mutually exclusive (so that the union bound becomes an equality), we get: where the final inequality is just the final inequality of Eq. (7.5) again. Eq. (7.6) follows.

A.7 Code availability
Table 1, Figure 1 and Table 2 were calculated, simulated and created by R code available in the EASY-DANS repository: https://doi.org/10.17026/dans-x56-qfme (see Extended data(Schure, 2019)) Details on the OS and version at which it were run can be found below: • Platform: x86 64-redhat-linux-gnu • Arch: x86 64 • OS: linux-gnu • System: x86 64, linux-gnu • R version: 3.5.3 (2019-03-11) Great Truth • svn rev: 76217 The following packages were used: • ggplot2 version 3.0.0 • graphics version 3.5.3 • grDevices version 3.5.3 • methods version 3.5.3 • stats version 3.5.3 • utils version 3.5.3 Page 25 of 31 series will reach a meaningful size and that might depend on results in the series. doing so One but last sentence: So meta-analysis also report…. (should be meta-analys s) e Section 3.6 Typo: The inflation actual inflation in the type-I error… Figure 1: in black and white print the colours in the legend seem to differ from the colours in the graph. Table 2: If possible, it might be handy to add the P̃ and P to the title (after bias only and after bias as well as impaired sampling distribution).
Section 4.5 Typo: These cumulative meta-analysis judge…. : should be meta-analys s e

Section 5
Here you refer to "another illustration"… as the Scenario. However, where do you discuss this Toy Story scenario? Section 6.1 Typo: "But effects…". Also somewhat unclear sentence. I suppose that you mean that is difficult to this define a cut-off for the number of early studies to be excluded from meta-analysis. Section 6.2 Here you compare the prior odds of Ioannidis with the fraction of pilot studies from the null and alternative distribution π / (1-π). However, you did not define π before, and if I understand it correctly, π is the fraction of studies from the alternative distribution, although this text (first line) suggests the other way around.
I noticed that you don't use the word test, only "error control". It is not fully clear to me: if we use the threshold based on the Bayes posterior odds, does that also result in a p-value, or is it just a yes/no answer? Or can we use a distribution? (you elaborate on this only in Section 9).
And how do we specify R= π / (1-π)? Should this be influenced by the study results seen thus far? As you state in section 6.2 on π / (1-π): "the fraction of pilot studies from the null and alternative distribution. … The only available source of information would be previous study results… ". However, this would mean that -indeed -depending on the timing of the meta-analysis, we would define a different R. Or should we use the same -more general -R as Ioannidis, ie 1:1, or 2:1?
Interesting is also that the threshold, that is based on the pre-experimental rejection odds, becomes more stringent if we believe the ratio of true to false rejections to be higher, e.g. if R =2 and γ = 16, the threshold becomes 32, but if R=1, the threshold is 16. Could you elaborate on that? You do elaborate a bit in Section 8, but for me it is still not very clear.

Section 9
Edit, 2 paragraph, … a series of bets s(Z ), … against the null hypothesis that make a profit….
I suggest to (re)move "against the null hypothesis" to facilitate easier reading.