Unbiased K-L estimator for the linear regression model

Background: In the linear regression model, the ordinary least square (OLS) estimator performance drops when multicollinearity is present. According to the Gauss-Markov theorem, the estimator remains unbiased when there is multicollinearity, but the variance of its regression estimates become inflated. Estimators such as the ridge regression estimator and the K-L estimators were adopted as substitutes to the OLS estimator to overcome the problem of multicollinearity in the linear regression model. However, the estimators are biased, though they possess a smaller mean squared error when compared to the OLS estimator. Methods: In this study, we developed a new unbiased estimator using the K-L estimator and compared its performance with some existing estimators theoretically, simulation wise and by adopting real-life data. Results: Theoretically, the estimator even though unbiased also possesses a minimum variance when compared with other estimators. Results from simulation and real-life study showed that the new estimator produced smaller mean square error (MSE) and had the smallest mean square prediction error (MSPE). This further strengthened the findings of the theoretical comparison using both the MSE and the MSPE as criterion. Conclusions: By simulation and using a real-life application that focuses on modelling, the high heating values of proximate analysis was conducted to support the theoretical findings. This new method of estimation is recommended for parameter estimation with and without multicollinearity in a linear regression model.


Introduction
Considering the general linear regression model i y Xβ ε = + (1) such that ε i is normally distributed with mean 0 and variance σ 2 I where I is the identity matrix. y is an n × 1 vector of dependent variable, X is an n × p matrix of the independent variables, β is a p × 1 vector of unknown regression parameters of interest. The method of ordinary least square (OLS) is well known and generally accepted for estimating the parameters (β's) in the linear regression model. The model is defined as: Where H = X'X and ˆO LS β is normally distributed that is ˆO LS β ~ N(β, σ 2 H -1 ). However, when the OLS estimator is applied to a model where there is correlation between the independent variables, then the variance of the regression estimates becomes inflated 1,2 . This relationship between the independent variables is referred to as multicollinearity 3,4 .
In addressing the problem of multicollinearity, various biased estimators with mean square error smaller than the OLS have been developed by different authors [2][3][4][5][6][7][8][9][10][11][12][13][14][15] . The limitations of these estimators is that they are biased , however the unbiased versions of some of them have been developed. The advantage of these estimators is that they produced estimates that were similar to the OLS estimator with better mean squared error. Crouse et al. 16,17 developed the unbiased ridge and the Liu estimators. Wu 18 developed the unbiased version of the two-parameter estimator by Ozkale and Kaciranlar 9 . Lukman et al. 19 developed the unbiased modified ridge-type estimator. Recently, the K-L estimator was proposed to circumvent the problem of multicollinearity in the linear regression model 13 . The K-L estimator is classified as a biased estimator with a single biasing parameter 13 .
In this study, a new unbiased technique is developed based on the K-L estimator and its properties are derived. We compared the unbiased K-L estimator with some existing techniques using the mean square error (MSE) criterion.

Methods
Unbiased K-L estimator with prior information Hoerl and Kennard 5 developed the ridge estimator to mitigate multicollinearity in the linear regression model. The ridge estimator of β with the biasing parameter k is: The modified ridge technique was proposed with the addition of the prior information 6 . This is expressed as follows: According to 16, the unbiased ridge estimator with the introduction of prior information J is given as where J and ˆO Recently 13 , proposed the K-L estimator and found that this estimator generally outperform the ridge regression estimator. The K-L estimator of β is defined as: This research proposes an unbiased K-L estimator following the convex method. The convex method is defined as: where G is a p×p matrix and I is an identity matrix of p×p dimensions. Thus, the MSE of β (G, J) is Such that, The value of G from (11) is G = D(σ 2 H -1 + D) -1 . Accordingly, D = σ 2 (I -G) -1 GH -1 ). We observed that the convex estimator β(G, J) is an unbiased estimator of β and possesses minimum MSE for optimal value of G. Consequently, the new unbiased estimator is defined aŝˆ( where A k = (H + kI) -1 (H -kI) and the value of It can be expressed conveniently that ˆ( , ) UKL k β J is unbiased for β The new estimator has properties defined as follows:( It follows from Equation (13) that the proposed estimator is unbiased. This classified the new estimator into the same class with the OLS estimator. The biasedness of the estimator is also zero. This is proved as follows: Given that there exists an orthogonal matrix Q, such that Q X XQ ′ ′ = Ε = diag (e 1 ,e 2 ,...,e p ) where e i is the i th eigenvalue of X X ′ , E and Q are the matrices of eigenvalues and eigenvectors of X X ′ respectively. Equation (1) can be expressed canonically as: and Z Z E ′ = . For Equation (15), we get the following representations: Lemma 1.1 Let N be an n×n positive definite matrix and α be some vector, then if and only if Theoretical comparison J is preferred to ˆO LS α by using the matrix mean square error as criteria for k > 0.
by using the matrix mean square error as criteria for k > 0.
The difference of Equation (24) and (26) is as follows: Simplifying (27) further, we observed that by using the matrix mean square error as criteria for k > 0.
We observed that by using the matrix mean square error as criteria for k > 0.

Selection of the biasing parameters
In this study, we adopt the following biasing parameter for the ridge and the unbiased ridge estimators: For the K-L estimator, we adopted: For the proposed estimator, the following biasing parameters were examined:

Results
R Studio was used for both the simulation and real-life analysis. The independent variables were generated by following the study of McDonald and Galarneau 22 given as: where Z ij are independent standard normal pseudorandom numbers, r 2 is the relationship between any two independent variables and p is the number of independent variables taken as three and seven in this study. The values of r 2 varies between 0.8, 0.9, 0.99 and 0.999 respectively. For p=3, the response variable is defined as: where e i is normally distributed with mean 0 and variance σ 2 . β is chosen such that β β ′ = 1 23 . Samples of size 30, 50, and 100 were used. Values of σ are 1 and 5. The mean square error is calculated as: where ˆi j β is the estimate of the i th parameter in j th replication and β i are the true parameter values. The MSE results are presented in Table 1 and Table 2. We observed the following: 1. All other alternative techniques studied in this work outperforms the OLS estimator at all the levels of multicollinearity. 2. The ridge estimator outperforms its unbiased version when the MSE is used as a criterion.
3. The proposed unbiased estimator (UKL) in this study outperform its K-L estimator counterpart.
4. There was a general better performance of the proposed estimator over all the estimators considered in this work though its performance is a function of the choice of biasing parameters.
In this study, the following trends about the mean square error and the factors in the simulation were observed: 1. There is a decrease in the MSE when there is an increase in the sample size at a particular level of multicollinearity.
2. Increase in the value of σ leads to a corresponding increase in the mean square errors of each of the estimators when other variables are kept constant.
3. An increase in the number of explanatory variables leads to a corresponding increase in the MSE of all estimators at varying level of multicollinearity and σ.

Application: poultry waste data
The poultry waste data adopted in this study was found and analyzed in Qian et al. 24,25 and was also recently employed by Lukman et al. 19 . The study was aimed at modelling the high heating values of proximate-based model. The response variable is High Heating Values (HHV), while the independent variables are Fixed Carbon (FC), Volatile Matter (VM), and Ash (A). The linear regression model is: where ε is the normally distributed random error term. In this study, the Jarque-Bera (JB) test was employed to know the distribution of the residual. The test statistic and its p-value are 0.6409 and 0.7258, respectively. The result shows that the residual in the model is normally distributed. We diagnosed if the model has the problem of multicollinearity. According to Lukman et al. 14 , the model suffered from the problem of multicollinearity because the variance inflation factors (VIF FC =997.819, VIF VM =2163.504, VIF ASH =1533.782) are greater than ten (10). Also, there is evidence of multicollinearity with the use of the condition number (CN).
Following Lukman et al. 3,4 , moderate level of multicollinearity is observed if the CN ranges are between 100 and 1000 but severe multicollinearity is encountered if CN is greater than 1000. For effective modelling, we considered some alternative estimators to the ordinary least squared estimator in this study. These include the ridge estimator, unbiased ridge estimator, K-L estimator and the unbiased K-L estimator. The estimators' performance was examined using the mean square error. We also adopt the leave-one-out cross-validation to validate how well the estimators perform 14 . The performance of the estimator is assessed through the mean squared prediction error (MSPE). The estimator with the least MSE and MSPE is considered as the best. The result is available in Table 3.
From Table 3, the regression estimates of the following methods are the same: URR, UKL, and OLS as expected. They also possess a smaller mean squared error when compared with the OLSE. The estimators all exhibit the same regression coefficient signs. The proposed estimator UKL demonstrated the best performance in terms of the MSE and the MSPE. Although its performance is a function of the biasing parameter k.

Conclusion
There is high inconsistency in the performance of the OLS estimator for parameter estimation in the linear regression model with multicollinearity problem. The estimator is unbiased but no longer has minimum variance. Due to this setback, in this study, the unbiased K-L estimator was developed and the properties of this new estimator was derived and established. It was found that the estimator is in the class of the unbiased estimator. An added advantage of this estimator over the OLS estimator is that it possesses minimum variance when multicollinearity is present. The superiority of the proposed estimator over the existing methods was theoretically established. The estimator is preferred to other estimators considered in this study.
Furthermore, the simulation and real-life results strengthened the findings of the theoretical comparison in terms of the mean squared error and the mean square prediction error. We recommend this new estimator for parameter estimation in a linear regression model with and without multicollinearity. In further studies, we will extend the new unbiased estimator to other generalized linear models such as the logistic regression model, Beta regression model, Gamma regression model etc. For selecting ridge parameter in Eqs (32)-(37), provide a reference. 8.
In Eq. (38), the pseudo normal variables are denoted by small ; however, capital has defined the line after.

10.
It is helpful to provide a table for the real data used in the Application, although a data availability statement is added.

11.
Yes © 2021 Oyeronke Alaba O. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Oluwayemisi Oyeronke Alaba
Department of Statistics, University of Ibadan, Ibadan, Nigeria In the abstract, inefficiency is better used than performance drops when multicollinearity is present. The new unbiased estimator should be clearly stated and the reason why there is a need to modify it, which was clearly stated in the body of the work.
Why did you choose the biasing parameter -did you check the unbiasedness of KL estimator, what were the limitations in previous studies before the need to modify it? You only stated that you classified it as a biased estimator with a single biasing parameter. The results of the simulation did not display well, but it will be better if few figures could be picked to discuss the results

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes