Keywords
COVID-19, Epidemic outbreak, Probability forecast, Public health, Mathematical biology
This article is included in the Emerging Diseases and Outbreaks gateway.
This article is included in the Bioinformatics gateway.
COVID-19, Epidemic outbreak, Probability forecast, Public health, Mathematical biology
Statistical aspects of coronavirus disease 2019 (COVID-19) and other similar recent epidemics were receiving much attention in the modern research community.1 Generally, it is quite challenging to calculate realistic biological system reliability factors and outbreak probabilities under actual epidemic conditions by using conventional theoretical statistical methods.1–14 The latter is usually due to many degrees of system freedom and random variables governing dynamic biological systems, spread over extensive terrain. In principle, the reliability of a complex biological system may be accurately estimated straightforwardly by having enough measurements or by direct Monte Carlo simulations.10 For COVID-19, however, the only available observation numbers are limited as the observations are only available from the beginning of the year 2020 up to now. Motivated by the latter argument, the authors have introduced a novel reliability method for biological and health systems to predict and manage epidemic outbreaks more accurately, this study was focused on COVID-19 epidemics in northern China, with focus on cross-correlations between different provinces within same climatic zone. For other studies related to statistical variations per country see,15 where spatial lags were addressed. China was chosen to test the methodology proposed in this study, because of its COVID-19 origin and extensive health observations and related research available online.15–29
Statistical modelling of lifetime data or extreme value theory (EVT) is widespread in medicine and engineering. For example, Gumbel used EVT to estimate the demographic of various populations in.30 Similarly, in Ref. 31, the author builds on this EVT theory to determine the fitness effect using a Beta-Burr distribution. While in Ref. 32, the author discusses using a bivariate logistic regression model, which was then used to access multiple sclerosis patients with walking disabilities and in a cognitive experiment for visual recognition. Finally, Ref. 33 is a paper of relevance, which used EVT to estimate the probability of an influenza outbreak in China. The author demonstrated a forecasting prediction potential amid the epidemic in this paper. While in Ref. 34 similarly used EVT to predict and detect anomalies of influenza epidemics.
In this paper epidemic outbreak is viewed as unexpected incident that may occur at any province of a given country at any time, therefore spatial spread is accounted for. Moreover, a specific non-dimensional factor is introduced to predict the latter epidemic risk at any time and any place.
Biological systems are subjected to ergodic environmental influences. The other alternative is to view the process as being dependent on specific environmental parameters whose variation in time may be modelled as an ergodic process on its own. The incidence data of COVID-19 in all provinces of the People's Republic of China (PRC) from February 2020 until today were retrieved from the official public PRC health website, for simplicity only northern provinces were selected for this study. As this dataset is organized by province (more than 30 provinces in China), the biological system under consideration can be regarded as a multi-degree of freedom (MDOF) dynamic system with highly inter-correlated regional components/dimensions. Some recent studies have already used statistical tools to predict COVID-19 development, for linear log model see Ref. 9, these studies however did not address fully dynamic space-time dynamic bio-system as this study does.
Note that while this study aims at reducing risk of future epidemic outbreaks by predicting them, it is solely focused on daily registered patient numbers and not on symptoms themselves. For long-lasting COVID-19 symptoms, the so-called “long COVID”, and its risk factors and whether it is possible to predict a protracted course early in the disease, see Ref. 35, for mortality research see Ref. 36.
This section presents novel bio-system reliability method, presently invented by authors and previously not existing. Advocated methodology used generalized extreme value (GEV) theory as a basis for its buildup. For numerical part authors used commercial software MATLAB, (Mathworks, V 8.6), namely its optimization routines, otherwise authors have used extrapolation code available from ACER. Only the code available from ACER was used to complete all sections of the methods – both numerical part as well as final extrapolation.
The novel method introduces MDOF (multi-degree of freedom) health response vector process that was measured over a sufficiently long period of time. Unidimensional global maxima over the entire time span denoted as , , .
Let be time local maxima of the process consequent in time, recorded at discrete time instants that are monotonously increasing in. A similar definition follows1 for other MDOF response components with and so on. For simplicity, all components, and therefore its maxima are assumed to be non-negative.
The target is to estimate system failure probability, in other words the probability of exceedance
More specifically, the moment when either exceeds , or exceeds , or exceeds , and so on, the system is regarded as immediately failed. Fixed failure levels , , ,…are, of course, individual for each unidimensional response component of . , , , and so on.
Now, the local maxima time instants are sorted in monotonously non-decreasing order into one single merged time vector . Note that , . In this case represents local maxima of one of MDOF structural response components either or , or and so on. That means that having time record, one just needs continuously and simultaneously screen for unidimensional response component local maxima and record its exceedance of MDOF limit vector in any of its components Local unidimensional response component maxima are merged into one temporal non-decreasing vector following the merged time vector , see Figure 1. That is to say, each local maxima is, in fact, actual encountered local maxima corresponding to either or , or and so on. Finally, the unified limit vector is introduced with each component is either , or and so on, depending on which of or , or etc., corresponds to the current local maxima with the running index.
Now, the scaling parameter is introduced to, artificially, simultaneously decrease limit values for all response components, namely the new MDOF limit vector with , , , … is introduced, see Ref. 37. The unified limit vector is introduced with each component is either , or and so on. The latter automatically defines probability as a function of , note that from Equation (1). Non-exceedance probability can be introduced as follows
Next, cascade of approximations which is based on conditioning is briefly outlined. In practice,11–14,38 the dependence between the neighboring is not negligible; thus, the following one-step (will be called conditioning level ) memory approximation is introduced
where (will be called conditioning level ), and so on. The idea is to monitor each independent failure that happened locally first in time, thus avoiding cascading local inter-correlated exceedances.
Equation (4) exhibits subsequent refinements with respect to the statistical independence assumption. These approximations increasingly accurately approximate statistical dependence between neighboring maxima. Since the original MDOF process has been assumed ergodic and thus stationary, the probability for is independent of , however dependent on the conditioning level . Thus, non-exceedance probability may be approximated as in the average conditional exceedance rate method, see Refs. 30, 37 for more details on exceedance probability
Note that Equation (5) follows from Equation (1) if neglecting, as design failure probability must be epsilon order o(1), with . Equation (5) is analogous to the well-known mean up-crossing rate equation for the stochastic process probability of exceedance.10,37 There is convergence with respect to , called here conditioning level
Note that Equation (5) for is equivalent to a well-known non-exceedance probability relationship with the mean up-crossing rate function
where denotes the mean up-crossing rate of the response level for the above assembled non-dimensional vector assembled from scaled MDOF system response . The mean up-crossing rate is given by the Rice's formula given in Equation (7) with being joint probability density for with being time derivative. Equation (7) relies on the Poisson assumption that is up-crossing events of high levels (in this paper, it is ) can be assumed to be independent. The latter may not be the case for narrowband responses and higher-level dynamical systems that exhibit cascading failures in different dimensions, subsequent in time, caused by intrinsic inter-dependency between extreme events, manifesting itself in the appearance of highly correlated local maxima clusters within the assembled vector .
In the above, the stationarity assumption has been used. However, the proposed methodology can also treat the nonstationary case. For nonstationary case, the scattered diagram of seasonal epidemic conditions, each short-term seasonal state has the probability, so that. Next, let one introduce the long-term equation
with being the same function as in Equation (6) but corresponding to a specific short-term seasonal epidemic state with the number .
Note that the accuracy of the suggested approach for a large variety of one-dimensional dynamic systems was successfully verified by authors in previous years.10,37
Introduced by Equation (5) functions are regular in the tail, specifically for values of approaching and exceeding. More precisely, for , the distribution tail behaves similar to with being suitably fitted constants for suitable tail cut-on value. Therefore, one can write
Next, by plotting versus , often nearly perfectly linear tail behaviour is observed. Optimal values of the parameters may also be determined applying sequential quadratic programming (SQP) methods, incorporated in NAG Numerical Library.39 Methods described above have been applied as described in methods section. Authors used MATLAB (Mathworks, V 8.6) (RRID:SCR_001622) commercial tool as a basis for their numerical purposes. For more specific author developed code routines, related to the extrapolation method by Equation (9), see ACER. Note that ACER is a repository, containing not only the code, but user manual, examples and references. In this study only extrapolation part of ACER was used. In other words current study presents novel theoretical methodology, but using ACER software previously developed by some of the authors.
Methods described in this paper are novel and state of art. Prediction of influenza type epidemics has long been the focus of attention in mathematical biology and epidemiology. It is known that public health dynamics is a seasonally and spatially varying dynamic system that is always challenging to analyse. This section illustrates the efficiency of the above-described methodology using the new method applied to the real-life COVID-19 data sets, presented as a new daily recorded infected patient time series, spread over different regions.
COVID-19 and influenza are contagious diseases with high transmissibility and ability to spread. Seasonal influenza epidemics caused by influenza A and B viruses typically occur annually during winter, presenting a burden on worldwide public health, resulting in around 3–5 million cases of severe illness and 250,000–500,000 deaths worldwide each year, according to the World Health Organization (WHO).3
This section presents a real-life application of the above-described method. The statistical data in the present section are taken from the official website of the National Health Commission of the people's Republic of China. The website provides the number of newly diagnosed cases every day in 13 administrative regions in northern China from 22 January 2020 to 6 April 2022. Patient numbers from thirteen different Chinese administrative regions were chosen as components thus constituting an example of a thirteen dimensional (13D) dynamic biological system.
In order to unify all three measured time series , the following scaling was performed as follows
Next, all local maxima from three measured time series were merged into one single time series by keeping them in time non-decreasing order: with the whole vector being sorted according to non-decreasing times of occurrence of these local maxima.
Figure 2 presents new daily recorded patients number plotted as a time-space 2D surface using MATLAB. Figure 3 presents the number of new daily recorded patients as a 13D vector , consisting of assembled regional new daily patient numbers. Note that vector does not have physical meaning on its own, as it is assembled of different regional components with different epidemic backgrounds. Index is just a running index of local maxima encountered in a non-decreasing time sequence.
Left: as it is, Right: scaled by Equation (10).
Figure 4 presents 100 years return level extrapolation according to Equation (9) towards epidemic outbreak with 100 year return period, indicated by the horizontal dotted line, and somewhat beyond, cut-on value was used. Dotted lines indicate extrapolated 95% confidence interval according to Equation (10). According to Equation (5) is directly related to the target failure probability from Equation (1). Therefore, in agreement with Equation (5), system failure probability can be estimated. Note that in Equation (5), corresponds to the total number of local maxima in the unified response vector.Conditioning parameter was found to be sufficient due to occurrence of convergence with respect to , see Equation (6). Figure 4 exhibits reasonably narrow 95% CI. The latter is an advantage of the proposed method.
Extrapolated 95% CI indicated by dotted lines, using https://github.com/cran/acer.
Note that while being novel, the above-described methodology has a clear advantage of utilizing available measured data set quite efficiently due to its ability to treat health system multi-dimensionality and perform accurate extrapolation based on quite limited data set. Note that, predicted non-dimensional level, indicated by star in Figure 4, represents probability of epidemic outbreak at any northern province in China in the years to come.
Traditional health bio-systems reliability methods dealing with observed/measured time series do not have the ability to efficiently deal with high dimensionality and cross-correlation between different system responses. The main advantage of the introduced methodology is its ability to study high dimensional non-linear dynamic systems reliability.
Despite the simplicity, the present study successfully offers a novel multidimensional modelling strategy and a methodological avenue to implement the forecasting of an epidemic during its course, if it is assumed to be stationary in time. Proper setting of epidemiological alarm limits (failure limits) per province has been discussed, see Section Use case.
This paper studied recorded COVID-19 patient numbers from thirteen different Chinese northern provinces, constituting an example of a thirteen dimensional (13D) and ten-dimensional (10D) dynamic biological system respectively observed in 2020-2022. The novel reliability method was applied to new daily patient numbers as a multidimensional system in real-time.30
The main conclusion is that if the public health system under local environmental and epidemiologic conditions in northern China is well managed. Predicted 100 year return period risk level of epidemic outbreak is reasonably low. However, there is an ultra-low risk of a future epidemic outbreak in both countries, at least in 100 years horizon.
This study outlines a general-purpose, robust and straightforward multidimensional bio-system reliability method. The method introduced in this study has been previously successfully validated by application to a wide range of engineering models,30,40–49 but only for one and two-dimensional system responses and, in general, very accurate predictions were obtained. Both measured and simulated time series responses can be analysed using the proposed method. It is shown that the method produced an acceptable 95% confidence interval, see Figure 4. Thus, the suggested methodology may be used as a tool in various non-linear dynamic biological systems reliability studies. The presented COVID-19 example does not limit potential areas of new method applicability by any means.
The datasets analyzed during the current study are publicly available from the daily recorded patients dataset in China during 2020-2022 years, are available at http://www.nhc.gov.cn/. Daily recorded patients data was structured per province and per calendar day, namely it was straightforward to extract and systematize joint statistical distribution as a function of both space and time.
Source code along with demo and user manual and examples used for extrapolation available from: ACER this third-party software is under license GPL-3.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
No
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Mathematical Biology
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Mathematical and statistical modeling of infectious disease
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 3 (revision) 21 Nov 23 |
read | read | ||
Version 2 (revision) 22 May 23 |
read | |||
Version 1 10 Nov 22 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)