What is reproducibility ?

The debate on reproducibility in biomedicine will gain precision only if we agree what reproducibility means. Importantly, reproducibility should be distinguished from validity (“truth”). We propose the application of an equivalence trials framework to clarify the concept of reproducibility by changing the (narrow) equivalence zone around a zero difference by a zone of reproducibility around (a) previous finding(s).


Introduction
Reproducibility is said to be a core principle of scientific progress.Nevertheless, poor reproducibility has recently been shown to haunt preclinical research 1,2 , translational research 3 , medicine 4 and psychology 5 .False-positive initial results due to random chance or incorrect study design were among the reasons implicated, as well as data-dredging, publication bias and misconduct.Others called irreproducible results 'biased' 1 and 'unreliable' 5 .
Coming from a background of meta-analysis with its countless examples of unexplained heterogeneity and an ingrained appreciation of sampling variability, we were surprised that these outcries cited above were not accompanied by a formal definition of the concept of reproducibility.Goodman et al. did define three types of reproducibility (methods, results, and inferences) and stated that confusion arises when, inadvertently, people use reproducibility as a synonym for "truth" 6 .We read their paper as being about truth although its title suggests otherwise.Our paper is about reproducibility sensu stricto and we revisit some basic definitions of reproducibility, notice that these definitions are problematic, and argue that the concept of equivalence in randomized trials may be fruitfully applied to sharpen our understanding of what we mean by reproducibility.We propose that investigators aiming to reproduce others' findings should pay more attention to predefining a margin of (unacceptable) discordance with existing findings.

Discussion
Box 1 shows two formal definitions of the concept of reproducibility.

Definition 1:
"The value below which the absolute difference between two single test [or study, our addition] results may be expected to lie with a probability of 95%, when the results are obtained by the same method and equipment from identical test material in the same setting by the same operator within short intervals of time.A test or measurement [or study, our addition] is reproducible if the results are identical or closely similar each time it is conducted (Synonym, repeatability)" 7 Definition 2: "The degree of agreement among a set of observations […] after all known sources of error are accounted for (Synonym, precision)" 8 Note the following differences between definitions 1 and 2: (i) In definition 1, reproducibility is taken to be a binary concept: a result is either reproduced or not.Definition 2, takes reproducibility to be a continuous concept, like a degree of concordance.
(ii) Related to (i), definition 1 implies the subjective choice of a difference, δ, whose value will depend on the measurement problem at hand.Definition 2 avoids a choice of δ.
(iii) Definition 1 chooses the value '95' for the confidence interval to be used.Definition 2 avoids subjective choices of a particular confidence level, such as 95, 90, 68 etc.
(iv) Only definition 2 emphasizes measurement that is free of bias.
Reproducibility studies may be seen as a type of equivalence trials (see Figure 1).Briefly, in classic superiority trials, we pose a statistical null hypothesis of no difference, which we then seek to reject to conclude that a difference exists.In equivalence trials, we define a (narrow) zone around a zero difference (between, say, our new drug and an existing one) and we establish equivalence if the entire confidence interval for the reproducibility study lies inside that zone.In this article, we propose to replace the difference of zero by the (pooled) value of (the) previous study or studies (vertical line in Figure 1).The width of the grey equivalence zone or "zone of reproducibility" is crucial and it seems sensible to define it pragmatically for each research situation separately.Without concrete ideas about the maximal width of this zone, judgments of when a result counts as a reproducibility can be quite subjective.For example, Begley and Ellis considered positive results as not reproduced if the replicate findings were not sufficiently robust to drive a drugdevelopment program.Ioannidis considered the results of a therapeutic intervention as reproduced if the researcher's final interpretation of the data in both studies was that the intervention was effective (or ineffective).Figure 1, however, shows that even in situations in which one has strictly defined the width of the zone and a suitable type of confidence interval, undecided outcomes may still occur (situations 5-7, Figure 1).
Reproducibility studies imply healthy scepticism: "Can we reproduce this finding?"In contrast with the comment cited above, which states that irreproducible results are biased, we emphasize that (ir)reproducibility of results says nothing about the validity of the previous nor of the current findings.For that, we need (validity) judgments about rigor of study design and execution.Meta-analyses of many small, but concordant, studies that were subsequently negated by the result of a single megatrial (believed by many to represent the truth) illustrate this situation 9 .
In conclusion, the concept of reproducibility (repeatability, precision) should be distinguished from validity ("truth").Furthermore, an equivalence trials framework can be fruitfully used to clarify the concept of reproducibility if we change the (narrow) equivalence zone around a zero difference by a zone of reproducibility around (a) previous finding(s).Care should be exercised when selecting sensible margins (delta) to decide on reproducibility of results 10 .Note, that two components are subjective: (1) the choice of δ, although preferably it should be chosen with a thorough understanding of theory or application of the research problem, and (2) the type of confidence interval since other choices than a 95%CI may be possible and defensible.Note also that, even after delta and the type of confidence limit have been chosen, uncertainty may persist if confidence limits overlap the boundaries of delta.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Analogy between equivalence trials framework and reproducibility (concordance): 9 examples.Numbers in brackets refer to the 9 scenarios; horizontal lines are xx% confidence intervals (CI), where xx=95, 90, or 68 etc; short vertical lines depict point estimates;the grey area signifies the zone of reproducibility; delta (δ) refers to the maximal absolute value below which reproducibility (concordance with (an) existing finding(s)) is deemed present.Scenarios 1-4: reproducibility is present since the new point estimate and its entire 95%CI interval lie within the grey zone; scenarios 5-6: presence of reproducibility is uncertain since the point estimate lies inside the grey zone, but the xx%CI does not; scenario 7: presence of reproducibility is uncertain since the point estimate lies outside the grey zone, but part of its xx%CI lies inside; scenario 8-9: absence of reproducibility since point estimate and corresponding xx%CIs are outside the grey zone.Note, that two components are subjective: (1) the choice of δ, although preferably it should be chosen with a thorough understanding of theory or application of the research problem, and (2) the type of confidence interval since other choices than a 95%CI may be possible and defensible.Note also that, even after delta and the type of confidence limit have been chosen, uncertainty may persist if confidence limits overlap the boundaries of delta.