Keywords
null hypothesis significance testing, tutorial, p-value, reporting
null hypothesis significance testing, tutorial, p-value, reporting
Null hypothesis significance testing (NHST) is a method of statistical inference by which an observation is tested against a hypothesis of no effect or no relationship. The method as practiced nowadays is a combination of the concepts of critical rejection regions developed by Neyman & Pearson (1933) and the p-value developed by Fisher (1959).
The method developed by Fisher (1959) allows computation of the probability of observing a result at least as extreme as a test statistic (e.g. t value), assuming the null hypothesis is true. This p-value thus reflects the conditional probability of achieving the observed outcome or larger, p(Obs|H0), and is equal to the area under the null probability distribution curve, for example [-∞ -t] and [+t +∞] for a two-tailed t-test (Turkheimer et al., 2004). Following Fisher, the smaller the p-value, the greater the likelihood that the null hypothesis is false. This was however only intended to be used as an indication that there is something in the data that deserves further investigation. The reason for this is that only H0 is tested whilst the effect under study is not itself being investigated.
The p-value is not the probability of the null hypothesis of being true, p(H0) (Krzywinski & Altman, 2013). This common misconception arises from confusion between the probability of an observation given the null p(Obs|H0) and the probability of the null given an observation p(H0|Obs) (see Nickerson (2000) for a detailed demonstration). The p-value is not an indication of the strength or magnitude of an effect. Any interpretation of the p-value in relation to the effect under study (strength, reliability, probability) is indeed wrong, since the p-value is conditioned on H0. Similarly, 1-p is not the probability to replicate an effect. Often, a small value of p is considered to mean a strong likelihood of getting the same results on another try, but again this cannot be ascertained from the p-value because it is not informative on the effect itself (Miller, 2009). If there is no effect, we should replicate the absence of effect with a probability equal to 1-p. The total probability of false positives can also be obtained by aggregating results (Ioannidis, 2005). If there is an effect however, the probability to replicate is a function of the (unknown) population effect size with no good way to know this from a single experiment (Killeen, 2005). Finally, a (small) p-value is not an indication favouring a hypothesis. A low p-value indicates a misfit of the null hypothesis to the data and cannot be taken as evidence in favour of a specific alternative hypothesis more than any other possible alternatives such as measurement error and selection bias (Gelman, 2013). The more (a priori) implausible the alternative hypothesis, the greater the chance that a finding is a false alarm (Krzywinski & Altman, 2013; Nuzzo, 2014). As Nickerson (2000) puts it ‘theory corroboration requires the testing of multiple predictions because the chance of getting statistically significant results for the wrong reasons in any given case is high’.
Neyman & Pearson (1933) introduced the notion of critical intervals over which the probability of observing a test statistic is less than a stipulated significance level, α. If the statistic value falls within those intervals, it is deemed significantly different from that expected under the null hypothesis. For instance, we can estimate that the probability of a given F value to be in the critical interval [+2 +∞] is less than 5%. Because the space of results is dichotomized, we can distinguish correct results (rejecting H0 when there is an effect and not rejecting H0 when there is no effect) from errors (rejecting H0 when there is no effect and not rejecting H0 when there is an effect). When there is no effect (H0 is true), the erroneous rejection of H0 is known as type I error and is equal to the p-value.
The significance level α is defined to be the maximum probability that a test statistic falls into the rejection region when the null hypothesis is true (Johnson, 2013). When the test statistics falls outside the critical region(s), all we can say is that no significant effect is observed, but one cannot conclude that the null hypothesis is true, i.e. we cannot accept H0. There is a profound difference between accepting the null hypothesis and simply failing to reject it (Killeen, 2005). By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot, from a non-significant result, argue against a theory. We cannot accept the null hypothesis, because all we have done is not disprove it. To accept or reject equally the null hypothesis, Bayesian approaches (Dienes, 2014; Kruschke, 2011) or confidence intervals must be used.
Confidence intervals (CI) have been advocated as alternatives to p-values because (i) they allow judging the statistical significance and (ii) they provide estimates of effect size. CI fail to cover the true value at a rate of alpha, the type I error rate (Morey & Rouder, 2011) and therefore indicate if values can be rejected by a two tailed test with a given alpha. CI also indicates the precision of the estimate of effect size, but unless using a percentile bootstrap approach, they require assumptions about distributions which can lead to serious biases in particular regarding the symmetry and width of the intervals (Wilcox, 2012). Assuming the CI (a)symmetry and width are correct, this gives some indication about the likelihood that a similar value can be observed in future studies, with 95% CI giving about 83% chance of replication success (Lakens & Evers, 2014). Finally, contrary to p-values, CI can be used to accept H0. Typically, if a CI includes 0, we cannot reject H0. If a critical null region is specified rather than a single point estimate, for instance [-2 +2] and the CI is included within the critical null region, then H0 can be accepted. Importantly, the critical region must be specified a priori and cannot be determined from the data themselves.
Although CI provide more information, they are not less subject to interpretation errors (see Savalei & Dunn, 2015 for a review). People often interpret X% CI as the probability that a parameter (e.g. the mean) will fall into that interval X% of the time. The (posterior) probability of an effect can however not be obtained using a frequentist framework. The CI represents the bounds for which one has X% confidence. The correct interpretation is that, for repeated measurements with the same sample sizes, taken from the same population, X% of times the CI obtained will contain the same parameter value, e.g. X% of the times the CI contains the same mean (Tan & Tan, 2010). The alpha value has the same interpretation as when using H0, i.e. we accept that 1-alpha CI are wrong in alpha percent of the times. This implies that CI do not allow to make strong statements about the parameter of interest (e.g. the mean difference) or about H1 (Hoekstra et al., 2014). To make a statement about the probability of a parameter of interest, likelihood intervals (maximum likelihood) and credibility intervals (Bayes) are better suited.
NHST has always been criticized, and yet is still used every day in scientific reports (Nickerson, 2000). Many of the disagreements are not on the method itself but on its use. The question one should ask is what is the goal of a scientific experiment at hand? If the goal is to establish the likelihood of an effect and/or establish a pattern of order, because both requires ruling out equivalence, then NHST is a good tool (Frick, 1996). If the goal is to establish some quantitative values, then NHST is not the method of choice. Because results are conditioned on H0, NHST cannot be used to establish beliefs. To estimate the probability of a hypothesis, a Bayesian analysis is a better alternative. To estimate parameters (point estimates and variances), alternative approaches are also better suited. Note however that even when a specific quantitative prediction from a hypothesis is shown to be true (typically testing H1 using Bayes), it does not prove the hypothesis itself, it only adds to its plausibility.
Considering that quantitative reports will always have more information content than binary (significant or not) reports, we can always argue that effect size, power, etc. must be reported. Reporting everything can however hinder the communication of the main result(s), and we should aim at giving only the information needed, at least in the core of a manuscript. Here I propose to adopt minimal reporting in the result section to keep the message clear, but have detailed supplementary material. When the hypothesis is about the presence/absence or order of an effect, it is sufficient to report in the text the actual p-value since it conveys the information needed to rule out equivalence. When the hypothesis and/or the discussion involve some quantitative value, and because p-values do not inform on the effect, it is essential to report on effect sizes (Lakens, 2013), preferably accompanied with confidence, likelihood or credible intervals depending on the question at hand. The reasoning is simply that one cannot predict and/or discuss quantities without accounting for variability. For the reader to understand and fully appreciate the results, nothing else is needed.
Because science progress is obtained by cumulating evidence (Rosenthal, 1991), scientists should also consider the secondary use of the data. With today’s electronic articles, there are no reasons for not including all of derived data: mean, standard deviations, effect size, CI, Bayes factor should always be included as supplementary tables (or even better also share raw data). It is also essential to report the context in which tests were performed – that is to report all of the tests performed (all t, F, and p values) because of the increase in type I error rate due to selective reporting (multiple comparisons problem - Ioannidis, 2005). Providing all of this information allows (i) other researchers to directly and effectively compare their results in quantitative terms (replication of effects beyond significance, Open Science Collaboration, 2015), (ii) to compute power to future studies (Lakens & Evers, 2014), and (iii) to aggregate results for meta-analyses.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 5 (revision) 12 Oct 17 |
||||
Version 4 (revision) 26 Sep 17 |
read | |||
Version 3 (revision) 10 Oct 16 |
read | read | ||
Version 2 (revision) 13 Jul 16 |
read | |||
Version 1 25 Aug 15 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)