Keywords
Linear regression model, Ordinary Least Square estimator, Ridge regression, K-L estimator, High Heating values, Proximate analysis.
Linear regression model, Ordinary Least Square estimator, Ridge regression, K-L estimator, High Heating values, Proximate analysis.
Considering the general linear regression model
such that εi is normally distributed with mean 0 and variance σ2I where I is the identity matrix. y is an n × 1 vector of dependent variable, X is an n × p matrix of the independent variables, β is a p × 1 vector of unknown regression parameters of interest. The method of ordinary least square (OLS) is well known and generally accepted for estimating the parameters (β’s) in the linear regression model. The model is defined as:
Where H = and is normally distributed that is ~ N(β, σ2H–1). However, when the OLS estimator is applied to a model where there is correlation between the independent variables, then the variance of the regression estimates becomes inflated1,2. This relationship between the independent variables is referred to as multicollinearity3,4.
In addressing the problem of multicollinearity, various biased estimators with mean square error smaller than the OLS have been developed by different authors2–15. The limitations of these estimators is that they are biased , however the unbiased versions of some of them have been developed. The advantage of these estimators is that they produced estimates that were similar to the OLS estimator with better mean squared error. Crouse et al.16,17 developed the unbiased ridge and the Liu estimators. Wu18 developed the unbiased version of the two-parameter estimator by Ozkale and Kaciranlar9. Lukman et al.19 developed the unbiased modified ridge-type estimator. Recently, the K-L estimator was proposed to circumvent the problem of multicollinearity in the linear regression model13. The K-L estimator is classified as a biased estimator with a single biasing parameter13.
In this study, a new unbiased technique is developed based on the K-L estimator and its properties are derived. We compared the unbiased K-L estimator with some existing techniques using the mean square error (MSE) criterion.
Hoerl and Kennard5 developed the ridge estimator to mitigate multicollinearity in the linear regression model. The ridge estimator of β with the biasing parameter k is:
The modified ridge technique was proposed with the addition of the prior information6. This is expressed as follows:
According to 16, the unbiased ridge estimator with the introduction of prior information J is given as
where J and are uncorrelated and J ̴ N(β, D) such that and Ip is p × p identity matrix. J is estimated by .
Modified ridge-type method proposed by 3 is given as follows:
where Akd = [S+k(1+d)]–1S.
The unbiased modified ridge type estimator19 was developed and defined as follows:
where Akd = [H+k(1+d)]–1 H such that . Consequently, for
k ˃ 0, 0<d<1.
Recently13, proposed the K-L estimator and found that this estimator generally outperform the ridge regression estimator. The K-L estimator of β is defined as:
where Ak = (H + kI)-1 (H – kI)
This research proposes an unbiased K-L estimator following the convex method. The convex method is defined as:
where G is a p×p matrix and I is an identity matrix of p×p dimensions. Thus, the MSE of (G, J) is
Such that,
The value of G from (11) is G = D(σ2H–1 + D)–1. Accordingly, D = σ2(I – G)-1GH-1). We observed that the convex estimator β(G, J) is an unbiased estimator of β and possesses minimum MSE for optimal value of G. Consequently, the new unbiased estimator is defined as
where Ak = (H + kI)–1(H – kI) and the value of . Therefore, for k ˃ 0.
It can be expressed conveniently that is unbiased for β The new estimator has properties defined as follows:
It follows from Equation (13) that the proposed estimator is unbiased. This classified the new estimator into the same class with the OLS estimator. The biasedness of the estimator is also zero. This is proved as follows:
Given that there exists an orthogonal matrix Q, such that = Ε = diag (e1,e2,...,ep) where ei is the ith eigenvalue of , E and Q are the matrices of eigenvalues and eigenvectors of respectively. Equation (1) can be expressed canonically as:
where Z = XQ, and = Ε. For Equation (15), we get the following representations:
Lemma 1.1 Let N be an n×n positive definite matrix and α be some vector, then if and only if 20.
Lemma 1.2 Let = Ciy i = 1, 2 be two linear estimators of α. Suppose that D = Cov() – Cov() > 0, where Cov(), i = 1, 2 denotes the covariance matrix of and bias() = b = (CiX – I)α, i = 1, 2. Consequently,
if and only if where 21.
and
Theorem 1.1. is preferred to by using the matrix mean square error as criteria for k > 0.
Proof
Recall that,
The difference between (23) and (24) is as follows:
Simplifying (25) further, we observed that E–1 – (E + k)–1 (E – k)Λ–1 will be positive definite since 2k > 0 for k > 0.
and
Theorem 3.2. is preferred to by using the matrix mean square error as criteria for k > 0.
Proof
where Bk = (E + kI)–1
The difference of Equation (24) and (26) is as follows:
Simplifying (27) further, we observed that E(E + k)–2 – (E + k)–1 (E – k)E–1 will be positive definite since k2 > 0.
and
Theorem 3.3. is preferred to by using the matrix mean square error as criteria for k > 0.
Proof
We observed that σ2(E + k)–1 – σ2(E + k)–1(E – k)E–1 will be positive definite since k > 0.
and
Theorem 3.4. is preferred to by using the matrix mean square error as criteria for k > 0.
Proof
where Ek = (E + k)–1. Consequently,
We observed that σ2(E – k)2E–1(E + k)–2 – σ2(E – k)E–1(E + k)–1 will be positive definite if k > 2Λ for k > 0.
R Studio was used for both the simulation and real-life analysis. The independent variables were generated by following the study of McDonald and Galarneau22 given as:
where Zij are independent standard normal pseudorandom numbers, r2 is the relationship between any two independent variables and p is the number of independent variables taken as three and seven in this study. The values of r2 varies between 0.8, 0.9, 0.99 and 0.999 respectively. For p=3, the response variable is defined as:
where ei is normally distributed with mean 0 and variance σ2. β is chosen such that = 123. Samples of size 30, 50, and 100 were used. Values of σ are 1 and 5. The mean square error is calculated as:
where is the estimate of the ith parameter in jth replication and βi are the true parameter values. The MSE results are presented in Table 1 and Table 2. We observed the following:
1. All other alternative techniques studied in this work outperforms the OLS estimator at all the levels of multicollinearity.
2. The ridge estimator outperforms its unbiased version when the MSE is used as a criterion.
3. The proposed unbiased estimator (UKL) in this study outperform its K-L estimator counterpart.
4. There was a general better performance of the proposed estimator over all the estimators considered in this work though its performance is a function of the choice of biasing parameters.
In this study, the following trends about the mean square error and the factors in the simulation were observed:
1. There is a decrease in the MSE when there is an increase in the sample size at a particular level of multicollinearity.
2. Increase in the value of σ leads to a corresponding increase in the mean square errors of each of the estimators when other variables are kept constant.
3. An increase in the number of explanatory variables leads to a corresponding increase in the MSE of all estimators at varying level of multicollinearity and σ.
The poultry waste data adopted in this study was found and analyzed in Qian et al.24,25 and was also recently employed by Lukman et al.19. The study was aimed at modelling the high heating values of proximate-based model. The response variable is High Heating Values (HHV), while the independent variables are Fixed Carbon (FC), Volatile Matter (VM), and Ash (A). The linear regression model is:
where ε is the normally distributed random error term. In this study, the Jarque-Bera (JB) test was employed to know the distribution of the residual. The test statistic and its p-value are 0.6409 and 0.7258, respectively. The result shows that the residual in the model is normally distributed. We diagnosed if the model has the problem of multicollinearity. According to Lukman et al.14, the model suffered from the problem of multicollinearity because the variance inflation factors (VIFFC =997.819, VIFVM =2163.504, VIFASH =1533.782) are greater than ten (10). Also, there is evidence of multicollinearity with the use of the condition number (CN).
Following Lukman et al.3,4, moderate level of multicollinearity is observed if the CN ranges are between 100 and 1000 but severe multicollinearity is encountered if CN is greater than 1000. For effective modelling, we considered some alternative estimators to the ordinary least squared estimator in this study. These include the ridge estimator, unbiased ridge estimator, K-L estimator and the unbiased K-L estimator. The estimators’ performance was examined using the mean square error. We also adopt the leave-one-out cross-validation to validate how well the estimators perform14. The performance of the estimator is assessed through the mean squared prediction error (MSPE). The estimator with the least MSE and MSPE is considered as the best. The result is available in Table 3.
From Table 3, the regression estimates of the following methods are the same: URR, UKL, and OLS as expected. They also possess a smaller mean squared error when compared with the OLSE. The estimators all exhibit the same regression coefficient signs. The proposed estimator UKL demonstrated the best performance in terms of the MSE and the MSPE. Although its performance is a function of the biasing parameter k.
There is high inconsistency in the performance of the OLS estimator for parameter estimation in the linear regression model with multicollinearity problem. The estimator is unbiased but no longer has minimum variance. Due to this setback, in this study, the unbiased K-L estimator was developed and the properties of this new estimator was derived and established. It was found that the estimator is in the class of the unbiased estimator. An added advantage of this estimator over the OLS estimator is that it possesses minimum variance when multicollinearity is present. The superiority of the proposed estimator over the existing methods was theoretically established. The estimator is preferred to other estimators considered in this study.
Furthermore, the simulation and real-life results strengthened the findings of the theoretical comparison in terms of the mean squared error and the mean square prediction error. We recommend this new estimator for parameter estimation in a linear regression model with and without multicollinearity. In further studies, we will extend the new unbiased estimator to other generalized linear models such as the logistic regression model, Beta regression model, Gamma regression model etc.
Zenodo: Regression Model to Predict the Higher Heating Value of Poultry Waste from Proximate Analysis. http://doi.org/10.5281/zenodo.507897725.
This project contains the following underlying data:
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Arashi M, Asar Y, Yüzbaşı B: SLASSO: a scaled LASSO for multicollinear situations. Journal of Statistical Computation and Simulation. 2021. 1-14 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Shrinkage estimation; High-dimensional analysis; Penalized regression
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Linear and generalized linear models, Biased Estimation Methods, Outlier Analysis
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Statistical Modelling, Econometrics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 19 Aug 21 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)