ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

A novel statistical method for long-term coronavirus modelling

[version 1; peer review: 1 approved, 1 not approved]
PUBLISHED 10 Nov 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Emerging Diseases and Outbreaks gateway.

This article is included in the Bioinformatics gateway.

Abstract

Background: Novel coronavirus disease has been recently a concern for worldwide public health. To determine epidemic rate probability at any time in any region of interest, one needs efficient bio-system reliability approach, particularly suitable for multi-regional environmental and health systems, observed over a sufficient period of time, resulting in a reliable long-term forecast of novel coronavirus infection rate. Traditional statistical methods dealing with temporal observations of multi-regional processes do not have the multi-dimensionality advantage, that suggested methodology offers, namely dealing efficiently with multiple regions at the same time and accounting for cross-correlations between different regional observations.
Methods: Modern multi-dimensional novel statistical method was directly applied to raw clinical data, able to deal with territorial mapping. Novel reliability method based on statistical extreme value theory has been suggested to deal with challenging epidemic forecast. Authors used MATLAB optimization software.
Results: This paper described a novel bio-system reliability approach, particularly suitable for multi-country environmental and health systems, observed over a sufficient period of time, resulting in a reliable long-term forecast of extreme novel coronavirus death rate probability.  Namely, accurate maximum recorded patient numbers are predicted for the years to come for the analyzed provinces.
Conclusions:  The suggested method performed well by supplying not only an estimate but 95% confidence interval as well.  Note that suggested methodology is not limited to any specific epidemics or any specific terrain, namely its truly general. The only assumption and limitation is bio-system stationarity, alternatively trend analysis should be performed first.  The suggested methodology can be used in various public health applications, based on their clinical survey data.

Keywords

COVID-19, Epidemic outbreak, Probability forecast, Public health, Mathematical biology

Introduction

Statistical aspects of coronavirus disease 2019 (COVID-19) and other similar recent epidemics were receiving much attention in the modern research community.1 Generally, it is quite challenging to calculate realistic biological system reliability factors and outbreak probabilities under actual epidemic conditions by using conventional theoretical statistical methods.114 The latter is usually due to many degrees of system freedom and random variables governing dynamic biological systems, spread over extensive terrain. In principle, the reliability of a complex biological system may be accurately estimated straightforwardly by having enough measurements or by direct Monte Carlo simulations.10 For COVID-19, however, the only available observation numbers are limited as the observations are only available from the beginning of the year 2020 up to now. Motivated by the latter argument, the authors have introduced a novel reliability method for biological and health systems to predict and manage epidemic outbreaks more accurately, this study was focused on COVID-19 epidemics in northern China, with focus on cross-correlations between different provinces within same climatic zone. For other studies related to statistical variations per country see,15 where spatial lags were addressed. China was chosen to test the methodology proposed in this study, because of its COVID-19 origin and extensive health observations and related research available online.1529

Statistical modelling of lifetime data or extreme value theory (EVT) is widespread in medicine and engineering. For example, Gumbel used EVT to estimate the demographic of various populations in.30 Similarly, in Ref. 31, the author builds on this EVT theory to determine the fitness effect using a Beta-Burr distribution. While in Ref. 32, the author discusses using a bivariate logistic regression model, which was then used to access multiple sclerosis patients with walking disabilities and in a cognitive experiment for visual recognition. Finally, Ref. 33 is a paper of relevance, which used EVT to estimate the probability of an influenza outbreak in China. The author demonstrated a forecasting prediction potential amid the epidemic in this paper. While in Ref. 34 similarly used EVT to predict and detect anomalies of influenza epidemics.

In this paper epidemic outbreak is viewed as unexpected incident that may occur at any province of a given country at any time, therefore spatial spread is accounted for. Moreover, a specific non-dimensional factor λ is introduced to predict the latter epidemic risk at any time and any place.

Biological systems are subjected to ergodic environmental influences. The other alternative is to view the process as being dependent on specific environmental parameters whose variation in time may be modelled as an ergodic process on its own. The incidence data of COVID-19 in all provinces of the People's Republic of China (PRC) from February 2020 until today were retrieved from the official public PRC health website, for simplicity only northern provinces were selected for this study. As this dataset is organized by province (more than 30 provinces in China), the biological system under consideration can be regarded as a multi-degree of freedom (MDOF) dynamic system with highly inter-correlated regional components/dimensions. Some recent studies have already used statistical tools to predict COVID-19 development, for linear log model see Ref. 9, these studies however did not address fully dynamic space-time dynamic bio-system as this study does.

Note that while this study aims at reducing risk of future epidemic outbreaks by predicting them, it is solely focused on daily registered patient numbers and not on symptoms themselves. For long-lasting COVID-19 symptoms, the so-called “long COVID”, and its risk factors and whether it is possible to predict a protracted course early in the disease, see Ref. 35, for mortality research see Ref. 36.

Methods

This section presents novel bio-system reliability method, presently invented by authors and previously not existing. Advocated methodology used generalized extreme value (GEV) theory as a basis for its buildup. For numerical part authors used commercial software MATLAB, (Mathworks, V 8.6), namely its optimization routines, otherwise authors have used extrapolation code available from ACER. Only the code available from ACER was used to complete all sections of the methods – both numerical part as well as final extrapolation.

The novel method introduces MDOF (multi-degree of freedom) health response vector process Rt=XtYtZt that was measured over a sufficiently long period of time0T. Unidimensional global maxima over the entire time span 0T denoted as XTmax=max0tTXt, YTmax=max0tTYt, ZTmax=max0tTZt,.

Let X1,,XNX be time local maxima of the process Xt consequent in time, recorded at discrete time instants t1X<<tNXX that are monotonously increasing in0T. A similar definition follows1 for other MDOF response components Yt,Zt, with Y1,,YNY; Z1,,ZNZ and so on. For simplicity, all Rt components, and therefore its maxima are assumed to be non-negative.

The target is to estimate system failure probability, in other words the probability of exceedance

(1)
1P=ProbXTmax>ηXYTmax>ηYZTmax>ηZ
where P=000ηXηYηZpXTmax,YTmax,ZTmax,XTmaxYTmaxZTmaxdXTmaxdYNYmaxdZNzmax being the probability of non-exceedance for critical values of response componentsηX, ηY, ηZ,…; denotes logical unity operation «or»; and pXTmax,YTmax,ZTmax, being joint probability density of the global maxima over the entire period 0T. However, it is not feasible to estimate the latter joint probability distribution directly due to its high dimensionality and available dataset limitations.

More specifically, the moment when either Xt exceeds ηX, or Yt exceeds ηY, or Zt exceeds ηZ, and so on, the system is regarded as immediately failed. Fixed failure levels ηX, ηY, ηZ,…are, of course, individual for each unidimensional response component of Rt. XNXmax=maxXjj=1NX=XTmax, YNYmax=maxYjj=1NY=YTmax, ZNzmax=maxZjj=1NZ=ZTmax, and so on.

Now, the local maxima time instants t1X<<tNXXt1Y<<tNYYt1Z<<tNZZ are sorted in monotonously non-decreasing order into one single merged time vector t1tN. Note that tN=maxtNXXtNYYtNZZ, N=NX+NY+NZ+. In this case tj represents local maxima of one of MDOF structural response components either Xt or Yt, or Zt and so on. That means that having Rt time record, one just needs continuously and simultaneously screen for unidimensional response component local maxima and record its exceedance of MDOF limit vector ηXηYηZ in any of its components X,Y,Z, Local unidimensional response component maxima are merged into one temporal non-decreasing vector R=R1R2RN following the merged time vector t1tN, see Figure 1. That is to say, each local maxima Rj is, in fact, actual encountered local maxima corresponding to either Xt or Yt, or Zt and so on. Finally, the unified limit vector η1ηN is introduced with each component ηj is either ηX, ηY or ηZ and so on, depending on which of Xt or Yt, or Zt etc., corresponds to the current local maxima with the running indexj.

cb33d9b2-b5a9-4fb3-9597-3c62f95aed42_figure1.gif

Figure 1. Illustration on how two exemplary processes X and Y are combined into new synthetic vector R.

Now, the scaling parameter0<λ1 is introduced to, artificially, simultaneously decrease limit values for all response components, namely the new MDOF limit vector ηXληYληzλ with ηXλληX, ληY, ηzλληZ, … is introduced, see Ref. 37. The unified limit vector η1ληNλ is introduced with each component ηjλ is either ηXλ, ηYλor ηzλ and so on. The latter automatically defines probability Pλ as a function of λ, note that PP1 from Equation (1). Non-exceedance probability Pλ can be introduced as follows

(2)
Pλ=ProbRNηNλR1η1λ=ProbRNηNλRN1ηN1λ,,R1η1λ}·ProbRN1ηN1λR1η1λ=j=2NProbRjηjλRj1η1jλR1η1λ·ProbR1η1λ

Next, cascade of approximations which is based on conditioning is briefly outlined. In practice,1114,38 the dependence between the neighboring Rj is not negligible; thus, the following one-step (will be called conditioning level k=1) memory approximation is introduced

(3)
ProbRjηjλRj1ηj1λR1η1λProbRjηjλRj1ηj1λ
for 2jN (conditioning level k=2). The approximation introduced by Equation (3) can be further expressed as
(4)
ProbRjηjλRj1ηj1λR1η1λProbRjηjλRj1ηj1λRj2ηj2λ

where 3jN (will be called conditioning level k=3), and so on. The idea is to monitor each independent failure that happened locally first in time, thus avoiding cascading local inter-correlated exceedances.

Equation (4) exhibits subsequent refinements with respect to the statistical independence assumption. These approximations increasingly accurately approximate statistical dependence between neighboring maxima. Since the original MDOF process Rt has been assumed ergodic and thus stationary, the probability pkλProbRj>ηjλRj1ηj1λ,Rjk+1ηjk+1λ} for jk is independent of j, however dependent on the conditioning level k. Thus, non-exceedance probability may be approximated as in the average conditional exceedance rate method, see Refs. 30, 37 for more details on exceedance probability

(5)
PkλexpNpkλ,k1.

Note that Equation (5) follows from Equation (1) if neglectingProbR1η1λ1, as design failure probability must be epsilon order o(1), with Nk. Equation (5) is analogous to the well-known mean up-crossing rate equation for the stochastic process probability of exceedance.10,37 There is convergence with respect to k, called here conditioning level

(6)
P=limkPk1;pλ=limkpkλ

Note that Equation (5) for k=1 is equivalent to a well-known non-exceedance probability relationship with the mean up-crossing rate function

(7)
Pλexpν+λT;ν+λ=0ζpRṘλζ

where ν+λ denotes the mean up-crossing rate of the response level λ for the above assembled non-dimensional vector Rt assembled from scaled MDOF system response XηXYηYZηZ. The mean up-crossing rate is given by the Rice's formula given in Equation (7) with pRṘ being joint probability density for RṘ with Ṙ being time derivativeRt. Equation (7) relies on the Poisson assumption that is up-crossing events of high λ levels (in this paper, it is λ1) can be assumed to be independent. The latter may not be the case for narrowband responses and higher-level dynamical systems that exhibit cascading failures in different dimensions, subsequent in time, caused by intrinsic inter-dependency between extreme events, manifesting itself in the appearance of highly correlated local maxima clusters within the assembled vector R=R1R2RN.

In the above, the stationarity assumption has been used. However, the proposed methodology can also treat the nonstationary case. For nonstationary case, the scattered diagram of m=1,..,M seasonal epidemic conditions, each short-term seasonal state has the probabilityqm, so thatm=1Mqm=1. Next, let one introduce the long-term equation

(8)
pkλm=1Mpkλmqm

with pkλm being the same function as in Equation (6) but corresponding to a specific short-term seasonal epidemic state with the number m.

Note that the accuracy of the suggested approach for a large variety of one-dimensional dynamic systems was successfully verified by authors in previous years.10,37

Implementing extrapolation method

Introduced by Equation (5) functions pkλ are regular in the tail, specifically for values ofλ approaching and exceeding1. More precisely, for λλ0, the distribution tail behaves similar to exp+bc+d with a,b,c,d being suitably fitted constants for suitable tail cut-on λ0value. Therefore, one can write

(9)
pkλexpakλ+bkck+dk,λλ0

Next, by plotting lnlnpkλdk versus lnakλ+bk, often nearly perfectly linear tail behaviour is observed. Optimal values of the parameters ak,bk,ck,pk,qk may also be determined applying sequential quadratic programming (SQP) methods, incorporated in NAG Numerical Library.39 Methods described above have been applied as described in methods section. Authors used MATLAB (Mathworks, V 8.6) (RRID:SCR_001622) commercial tool as a basis for their numerical purposes. For more specific author developed code routines, related to the extrapolation method by Equation (9), see ACER. Note that ACER is a repository, containing not only the code, but user manual, examples and references. In this study only extrapolation part of ACER was used. In other words current study presents novel theoretical methodology, but using ACER software previously developed by some of the authors.

Ethical consideration

Authors confirm that all methods were performed in accordance with the relevant guidelines and regulations according to the Declarations of Helsinki.

Use case

Methods described in this paper are novel and state of art. Prediction of influenza type epidemics has long been the focus of attention in mathematical biology and epidemiology. It is known that public health dynamics is a seasonally and spatially varying dynamic system that is always challenging to analyse. This section illustrates the efficiency of the above-described methodology using the new method applied to the real-life COVID-19 data sets, presented as a new daily recorded infected patient time series, spread over different regions.

COVID-19 and influenza are contagious diseases with high transmissibility and ability to spread. Seasonal influenza epidemics caused by influenza A and B viruses typically occur annually during winter, presenting a burden on worldwide public health, resulting in around 3–5 million cases of severe illness and 250,000–500,000 deaths worldwide each year, according to the World Health Organization (WHO).3

This section presents a real-life application of the above-described method. The statistical data in the present section are taken from the official website of the National Health Commission of the people's Republic of China. The website provides the number of newly diagnosed cases every day in 13 administrative regions in northern China from 22 January 2020 to 6 April 2022. Patient numbers from thirteen different Chinese administrative regions were chosen as components X,Y,Z, thus constituting an example of a thirteen dimensional (13D) dynamic biological system.

In order to unify all three measured time series X,Y,Z, the following scaling was performed as follows

(10)
XXηX,YYηY,ZZηZ,
making all three responses non-dimensional and having the same failure limit equal to 1. Failure limits ηX,ηY,ηZ,, or in other words, epidemic thresholds, are not set values and must be decided. The simplest choice would be for different countries to set failure limits equal to the percentage per corresponding country population, making X,Y,Z, equal to percentage of daily infected per country. In this study, however, twice maxima of daily infected per country have been chosen as failure limits. Note that failure limits may be chosen differently for different dynamic bio-systems. Although the latter choice obviously introduces bias (accumulation point) at λ=0.5 if the number of countrys is large, in this study the number of regions is not that large (below 20 national regions, see Ref. 30 for similar USA study) and proper extrapolation technique may easily circumvent the above mentioned accumulation point bias.

Next, all local maxima from three measured time series were merged into one single time series by keeping them in time non-decreasing order: R=maxX1Y1Z1maxXNYNZN with the whole Rvector being sorted according to non-decreasing times of occurrence of these local maxima.

Figure 2 presents new daily recorded patients number plotted as a time-space 2D surface using MATLAB. Figure 3 presents the number of new daily recorded patients as a 13D vector R, consisting of assembled regional new daily patient numbers. Note that vector R does not have physical meaning on its own, as it is assembled of different regional components with different epidemic backgrounds. Index j is just a running index of local maxima encountered in a non-decreasing time sequence.

cb33d9b2-b5a9-4fb3-9597-3c62f95aed42_figure2.gif

Figure 2. New daily recorded patients number plotted as a 2D time-space surface, data was according to http://www.nhc.gov.cn.

cb33d9b2-b5a9-4fb3-9597-3c62f95aed42_figure3.gif

Figure 3. Number of new daily recorded patients as 13D vector R.

Left: as it is, Right: scaled by Equation (10).

Figure 4 presents 100 years return level extrapolation according to Equation (9) towards epidemic outbreak with 100 year return period, indicated by the horizontal dotted line, and somewhat beyond, λ=0.1 cut-on value was used. Dotted lines indicate extrapolated 95% confidence interval according to Equation (10). According to Equation (5) pλ is directly related to the target failure probability 1P from Equation (1). Therefore, in agreement with Equation (5), system failure probability 1P1Pk1 can be estimated. Note that in Equation (5), N corresponds to the total number of local maxima in the unified response vectorR.Conditioning parameterk=3 was found to be sufficient due to occurrence of convergence with respect to k, see Equation (6). Figure 4 exhibits reasonably narrow 95% CI. The latter is an advantage of the proposed method.

cb33d9b2-b5a9-4fb3-9597-3c62f95aed42_figure4.gif

Figure 4. 100 years return level (horizontal dotted line) extrapolation of pkλ towards critical (system failure) level (indicated by star) and beyond.

Extrapolated 95% CI indicated by dotted lines, using https://github.com/cran/acer.

Note that while being novel, the above-described methodology has a clear advantage of utilizing available measured data set quite efficiently due to its ability to treat health system multi-dimensionality and perform accurate extrapolation based on quite limited data set. Note that, predicted non-dimensional λ level, indicated by star in Figure 4, represents probability of epidemic outbreak at any northern province in China in the years to come.

Discussion

Traditional health bio-systems reliability methods dealing with observed/measured time series do not have the ability to efficiently deal with high dimensionality and cross-correlation between different system responses. The main advantage of the introduced methodology is its ability to study high dimensional non-linear dynamic systems reliability.

Despite the simplicity, the present study successfully offers a novel multidimensional modelling strategy and a methodological avenue to implement the forecasting of an epidemic during its course, if it is assumed to be stationary in time. Proper setting of epidemiological alarm limits (failure limits) per province has been discussed, see Section Use case.

This paper studied recorded COVID-19 patient numbers from thirteen different Chinese northern provinces, constituting an example of a thirteen dimensional (13D) and ten-dimensional (10D) dynamic biological system respectively observed in 2020-2022. The novel reliability method was applied to new daily patient numbers as a multidimensional system in real-time.30

The main conclusion is that if the public health system under local environmental and epidemiologic conditions in northern China is well managed. Predicted 100 year return period risk level λ of epidemic outbreak is reasonably low. However, there is an ultra-low risk of a future epidemic outbreak in both countries, at least in 100 years horizon.

This study outlines a general-purpose, robust and straightforward multidimensional bio-system reliability method. The method introduced in this study has been previously successfully validated by application to a wide range of engineering models,30,4049 but only for one and two-dimensional system responses and, in general, very accurate predictions were obtained. Both measured and simulated time series responses can be analysed using the proposed method. It is shown that the method produced an acceptable 95% confidence interval, see Figure 4. Thus, the suggested methodology may be used as a tool in various non-linear dynamic biological systems reliability studies. The presented COVID-19 example does not limit potential areas of new method applicability by any means.

Data availability

The datasets analyzed during the current study are publicly available from the daily recorded patients dataset in China during 2020-2022 years, are available at http://www.nhc.gov.cn/. Daily recorded patients data was structured per province and per calendar day, namely it was straightforward to extract and systematize joint statistical distribution as a function of both space and time.

Software availability

Source code along with demo and user manual and examples used for extrapolation available from: ACER this third-party software is under license GPL-3.

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 10 Nov 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gaidai O, Yan P, Xing Y et al. A novel statistical method for long-term coronavirus modelling [version 1; peer review: 1 approved, 1 not approved]. F1000Research 2022, 11:1282 (https://doi.org/10.12688/f1000research.125924.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 10 Nov 2022
Views
35
Cite
Reviewer Report 21 Feb 2023
Adnan Khan, Department of Mathematics, Lahore University of Management Sciences, Lahore, Pakistan 
Not Approved
VIEWS 35
The authors want to propose a new methodology to make reliable long term forecast of COVID-19 infection rates, in particular they want to consider multidimensional data that may have cross correlations.  

Comments:

The ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Khan A. Reviewer Report For: A novel statistical method for long-term coronavirus modelling [version 1; peer review: 1 approved, 1 not approved]. F1000Research 2022, 11:1282 (https://doi.org/10.5256/f1000research.138279.r157163)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 22 May 2023
    Oleg Gaidai, Engineering Research Center of Marine Renewable Energy, Shanghai Ocean University, Shanghai, China
    22 May 2023
    Author Response
    Response to reviewer comments

    Note:  since publication time of this methodology in 2022, a number of other applications of this method has been published, to mention few:

    1) ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 22 May 2023
    Oleg Gaidai, Engineering Research Center of Marine Renewable Energy, Shanghai Ocean University, Shanghai, China
    22 May 2023
    Author Response
    Response to reviewer comments

    Note:  since publication time of this methodology in 2022, a number of other applications of this method has been published, to mention few:

    1) ... Continue reading
Views
29
Cite
Reviewer Report 29 Nov 2022
Oluwakemi E Abiodun, Department of Physical Sciences, Landmark University, Omu-Aran, Nigeria 
Approved
VIEWS 29
This study came up with a new way to look at the dependability of biosystems. This method was tested for a long enough time to make a credible long-term prediction of the probability of the extreme new coronavirus mortality rate. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Abiodun OE. Reviewer Report For: A novel statistical method for long-term coronavirus modelling [version 1; peer review: 1 approved, 1 not approved]. F1000Research 2022, 11:1282 (https://doi.org/10.5256/f1000research.138279.r155545)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 10 Nov 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.