Keywords
COVID-19, school closure, Japan, causal inference, reanalysis
This article is included in the Japan Institutional Gateway gateway.
In a paper recently published in Nature Medicine, Fukumoto et al. tried to assess the government-led school closure policy during the early phase of the COVID-19 pandemic in Japan. They compared the reported incidence rates between municipalities that had and had not implemented school closure in selected periods from March–May 2020, where they matched for various potential confounders, and claimed that there was no causal effect on the incidence rates of COVID-19. However, the effective sample size (ESS) of their dataset had been substantially reduced in the process of matching due to imbalanced covariates between the treatment (i.e. with closure) and control (without closure) municipalities, which led to the wide uncertainty in the estimates. Despite the study title starting with “No causal effect of school closures”, their results are insufficient to exclude the possibility of a strong mitigating effect of school closure on incidence of COVID-19. In this replication/reanalysis study, we showed that the confidence intervals of the effect estimates from Fukumoto et al. included a 100% relative reduction in COVID-19 incidence. Simulations of a hypothetical 50% or 80% mitigating effect hardly yielded statistical significance with the same study design and sample size. We also showed that matching of variables that had large influence on propensity scores (e.g. prefecture dummy variables) may have been incomplete.
COVID-19, school closure, Japan, causal inference, reanalysis
I have included more discussions in response to the peer review comments: notably the two additional issues in Fukumoto et al. raised by the reviewers; the choice of estimand for the main analysis (ATC instead of ATT) and potential residual confounding. The overall conclusion of the article remains the same.
See the author's detailed response to the review by Takehiko I. Hayashi
See the author's detailed response to the review by Koichiro Shiba
A paper recently published in Nature Medicine, Fukumoto et al., tried to assess the government-led school closure policy during the early phase of the COVID-19 pandemic in Japan. They compared the reported incidence rates between municipalities that had and had not implemented school closure in selected periods from March–May 2020, where they matched for various potential confounders, and claimed that they found no causal effect on the incidence rates of COVID-19. School closure as a means to control outbreaks has been studied mostly for influenza prior to the emergence of COVID-19, which generally suggested low-to-moderate effects, but the evidence on other respiratory infections including coronavirus diseases has been limited (Viner et al., 2020). Sometimes decisions need to be made in the lack of sufficient evidence in the earliest phase of the pandemic; nonetheless, such decisions should undergo retrospective policy assessment to provide insights and refinement for future pandemic responses.
One of the challenges in this type of analysis of the early COVID-19 epidemic in Japan is the limited statistical power due to low case counts. During the first wave of the epidemic from February to June 2020 that overlapped with the study period of Fukumoto et al., Japan never observed more than 1,000 COVID-19 cases per day. As a result, out of the total 79,989 municipality-level daily counts from the 847 municipalities included, 99.9% were less than 10 cases per day (Figure S2 of the original study). Moreover, the matching technique used to minimise confounding has a known side effect of limiting statistical power, especially when there is little overlap in the covariates between arms (King et al., 2017).
Unfortunately, the analysis in Fukumoto et al. appears to suffer from these issues. The study title says “No causal effect”, which is a rather strong statement given the substantial uncertainty in their estimates. As the saying goes, “absence of evidence is not evidence of absence”—when the uncertainty range covers practically meaningful values, it should not be prematurely concluded that there is “no effect” just because the effect estimates are statistically insignificant. Here I highlight limitations of the analysis and discuss possible factors that may have rendered the study underpowered.
The original study measures the effect of school closures as the absolute difference in incidence rates between the treatment and control municipalities. However, the theoretical ground is unclear for assuming a fixed additive effect of school closures on the incidence rate per capita. Infectious disease risks are inherently dynamic; more current infections in a population would result in a greater risk of infection among susceptible individuals through increased encounters with infectious others. This means that the effect of school closures, which intended to reduce contacts at schools, should also depend on the baseline incidence in the population because the risk of infection averted would be the reduction in contacts multiplied by the probability that the contacts were otherwise with infectious individuals. The effect estimates relative to the baseline incidence would therefore be a more relevant and interpretable measure for assessment of its practical use. It should also be noted that since incidence rates can only take non-negative values, the absolute mitigating effect of school closure can only be as high as the average incidence rate in the control group.
I rescaled the reported average treatment effects (average treatment effect on the control: ATC; and average treatment effect on the treatment: ATT) and their confidence intervals relative to the average outcome (incidence rate per capita) in the control group (Figure 1). The confidence intervals of the relative ATC and ATT cover most of the regions from 100% reduction to 100% elevation, suggesting the underpowered nature of the original study. An effect of 50% reduction (i.e. -50% relative effect), which most experts would agree is of practical significance, or even complete reduction (i.e. -100%) was within the confidence intervals over the substantial part of the period of interest. The effective sample size (ESS; a proxy measure for the amount of information contained in weighted samples (Shook-Sa and Hudgens)) of the matched arms of around 40–50 (Figure 1d) was likely insufficient to find a statistical significance because incidence of infectious diseases typically exhibits higher dispersion than independent- and identically-distributed settings due to its self-exciting nature (i.e. an increase in cases induces a further increase via transmission).
The turquoise vertical lines represent the date of treatment (school closure). The black lines and shaded areas represent the mean effect and 95% confidence intervals, respectively. (a) Relative ATC for the closure as of April 6, 2020. (b) Relative ATC for the closure as of April 10, 2020. (d) Relative ATT for the closure as of April 6, 2020. (d) Comparison of sample sizes. The number of all samples included for matching, the number of unique samples matched to at least one other sample and the effective sample size (ESS) of the matched samples are shown.
To further examine the statistical power of the study, I artificially modified the dataset such that school closure has a 50% or 80% mitigating effect on the incidence rate per capita. On the treatment reference date (April 6) and onward, the expected incidence rate of each municipality in the treatment group was assumed to be 50%/20% that of the matched control municipality plus Poisson noise (see Extended data: Supplementary document for details). The results suggested that, even with as much as 50%/80% mitigating effect, the approach in the original study might not have reached statistical significance (Figure 2). The absolute ATT for the 50% mitigating effect (Figure 2b) appears similar to what were referred to as “no effect” in the original study. ATT for the 80% mitigating effect was also statistically insignificant (Figure 2c and 2d), suggesting that the study was underpowered to find even moderate to high mitigating effects, if any. ATC estimates also yielded similarly insignificant/barely significant patterns (Figure 3).
(a) The average outcome (incidence per capita) of the matched treatment (black) and control (red) groups for closure as of April 6, 2020. (b) Absolute ATT estimates (black line) and 95% confidence intervals (shaded area) for closure as of April 6. (c) Relative ATT estimates and 95% confidence intervals for closure as of April 6. (d)–(f) Those for closure as of April 10.
(a) The average outcome (incidence per capita) of the unmatched treatment (dashed), matched treatment (black) and control (red) groups for closure as of April 6, 2020. (b) Absolute ATC estimates (black line) and 95% confidence intervals (shaded area) for closure as of April 6. (c) Relative ATC estimates and 95% confidence intervals for closure as of April 6. (d)–(f) Those for closure as of April 10.
I also noticed that propensity scores computed for one of the subanalyses included, inverse-probability weighting, exhibited substantial/complete “separation” (Heinze & Schemper, 2002) and most samples were essentially lost due to the substantial imbalance in the assigned weights (Figure 4). Although separation of propensity scores can arise from overfitting, in this case it remained (while slightly ameliorated) even after addressing overfitting by Lasso regularisation (Figure 5). This indicates that the treatment assignments may have been nearly deterministic in the dataset, which can compromise the performance of quasi-experimental causal inference via “positivity violation” (Petersen et al., 2020).
(a) Balance of propensity scores before and after matching for school closure as of April 6, 2021. (b) Balance of propensity scores before and after matching for school closure as of April 10, 2021. (c) All and effective sample sizes and the maximum weight among the samples. The effective sample size of NaN indicates that the all samples received zero weights.
(a) The average outcome (incidence per capita) of the unmatched treatment (dashed), matched treatment (black) and control (red) groups for closure as of April 6, 2020. (b) Absolute ATC estimates (black line) and 95% confidence intervals (shaded area) for closure as of April 6. (c) Result of 10-fold cross validation. The x-axis represents the logarithm of the regularisation coefficient λ for each model; the number of included variables is also displayed above the panel. The left dotted vertical line denotes the selected model with the best cross validation performance and the right dotted line the most parsimonious within the 1 standard error range of the performance from the best model (for reference purpose). (d) Balance of propensity scores before and after matching. (e)–(h) Those for closure as of April 10. (i) All and effective sample sizes and the maximum weight among the samples.
The authors did not use propensity scores in the Mahalanobis distance-based genetic matching for the main analysis as opposed to the general recommendation (Diamond & Sekhon, 2013) (the authors cite King & Nielsen, 2019 as a reason not to use propensity scores, the authors of which however clarifies that their criticism does not apply to genetic matching). This means that the covariates that strongly determined the treatment assignment may not have received large weights (and therefore were not prioritised) in the matching process, which could leave unadjusted bias arising from these potential confounders. For example, many regression coefficients for prefecture dummy variables had large values (~5 or larger) in the Lasso-regularised model, whereas 236 out of 483 matched pairs of municipalities in the original analysis for April 6 were from different prefectures. The robustness to the above concerns could be assessed by computing ESS from another genetic matching including propensity scores and a calliper (to ensure the matched pairs have sufficiently similar features), which I report in the next section.
I reanalysed the original dataset with the genetic matching algorithm incorporating propensity scores and a calliper and estimated ATCs for school closures as of Aril 6 and 10, 2020. Propensity scores were estimated by a Lasso-regularised linear regression model and included in genetic matching with a calliper of 0.25 (Rosenbaum & Rubin, 1985). The results remained statistically insignificant and the confidence intervals for the relative effects covered most region from -100% to 100%, although the direction of the weak trend reversed for closure as of April 6 from the original study (Figure 6). ESS of the matched treatment group was only 7 and 3.8 for April 6 and 10, respectively, indicating that the results relied on only a small set of samples that were repeatedly used in matching. Genetic matching is a generalisation of propensity score and Mahalanobis distance matching that searches for optimal covariate balance and thus should achieve no worse balance than matching using only Mahalanobis distance (Diamond & Sekhon, 2013). The substantial loss of ESS in the updated genetic matching with propensity scores suggests that improved matching required more samples to be discarded and that both the original and current results are likely unreliable.
(a) The average outcome (incidence per capita) of the unmatched treatment (dashed black), matched treatment (solid black) and control (red) groups for closure as of April 6, 2020. (b) Absolute ATC estimates (black line) and 95% confidence intervals (shaded area) for closure as of April 6. (c) Relative ATC estimates and 95% confidence intervals for closure as of April 6. (d)–(f) Those for closure as of April 10.
The reanalysis of Fukumoto et al. suggested that the study was inherently underpowered to identify the presence of causal effects of school closure on COVID-19. While I recognise the importance of their attempt to assess the school closure policy given its collateral effect imposed onto students and their families, I argue that their conclusion of “no causal effect” was not well supported by data due to the limited statistical power. Finding no mitigating effect itself would not be surprising as children were not the centre of the outbreak especially in the earliest phase (Davies et al. 2020); nonetheless, evidence claiming “no effect” would need to show that effects were at least below the level of practical significance.
In addition to this issue of insufficient statistical power, which I demonstrated in the present reanalysis, two additional issues have been raised during the peer review process of this article. For one: the authors’ choice of ATC as the main estimand may have been suboptimal as Shiba has pointed out in his comment (Shiba, 2022). The control group in the original study may have consisted of municipalities that did not need school closures because of low incidence. ATC in this context would represent the effect in settings where the policy was not needed, which is of limited political implication. To counterargue against school closures as a control policy, the authors should have aimed to robustly show insufficient effect of such a policy even in municipalities in which school closures had been a selectable option (possibly because of higher incidence rate, where an effective policy could be more impactful). For the other: residual confounding may have remained among the matched samples. Both (Shiba, 2022) and (Hayashi, 2022) expressed concern on the immediate positive effect on incidence rate (e.g. increased incidence) immediately after the implementation of school closures in the treated group, which Fukumoto et al. left unexplained. Unless a plausible causal mechanism in which school closures could increase COVID-19 incidence is provided, this gap between the treated and control group may indicate residual bias, which is unsurprising given my reanalysis results suggesting matching failure. Hayashi additionally suggested that the trend in incidence (e.g. increasing/decreasing) may be one of the potential confounding variables that had not been adjusted for in the original study (Hayashi, 2022).
Altogether, these limitations represent difficulties in post-hoc causal analysis of mass interventions implemented without a built-in evaluation design such as randomisation. The fact that even the reasonably designed approach of Fukumoto et al. suffers insufficient power emphasises the importance of the “evidence-generating” philosophy in policy planning as has been promoted for medicine (Embi & Payne, 2013).
This study did not generate original data. The underlying dataset is available from the repository associated with the original study:
Harvard Dataverse. Replication Data for: No causal effect of school closures in Japan on the spread of COVID-19 in spring 2020. DOI: https://doi.org/10.7910/DVN/N803UQ (Fukumoto et al. 2021a).
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver.
Replication code along with the full analysis report (Extended data: Supplementary document) is available from a GitHub repository: https://github.com/akira-endo/reanalysis_Fukumoto2021.
Archived version of the above repository at time of publication is available from: Zenodo. akira-endo/reanalysis_Fukumoto2021: ‘Not finding causal effect’ is not ‘finding no causal effect’ of school closure on COVID-19. DOI: https://doi.org/10.5281/zenodo.6457916 (Endo, 2022).
This project contains the following data:
- main.html/main.ipynb (Extended data: Supplementary document).
- replication codes and data from the original study (Fukumoto et al. 2021a) which are partially modified and reused.
- replication codes for the analysis conducted in this study.
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for commenting on the previous publication clearly described?
Yes
Are any opinions stated well-argued, clear and cogent?
Yes
Are arguments sufficiently supported by evidence from the published literature or by new data and results?
Yes
Is the conclusion balanced and justified on the basis of the presented arguments?
Yes
References
1. Austin PC: Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies.Pharm Stat. 2011; 10 (2): 150-61 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Behavioral Economics, Applied Economics
Is the rationale for commenting on the previous publication clearly described?
Yes
Are any opinions stated well-argued, clear and cogent?
Yes
Are arguments sufficiently supported by evidence from the published literature or by new data and results?
Yes
Is the conclusion balanced and justified on the basis of the presented arguments?
Yes
References
1. Nguyen TL, Collins GS, Spence J, Daurès JP, et al.: Double-adjustment in propensity score matching analysis: choosing a threshold for considering residual imbalance.BMC Med Res Methodol. 2017; 17 (1): 78 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Risk analysis, statistical causal inference, and environmental data analyis
Is the rationale for commenting on the previous publication clearly described?
Yes
Are any opinions stated well-argued, clear and cogent?
Partly
Are arguments sufficiently supported by evidence from the published literature or by new data and results?
Yes
Is the conclusion balanced and justified on the basis of the presented arguments?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Public health, epidemiology, causal inference
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 16 Apr 24 |
read | ||
Version 1 25 Apr 22 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)