Performance of LASSO and Elastic net estimators in Misspecified Linear Regression Model

: Ridge Estimator (RE) has been used as an alternative estimator for Ordinary Least Squared Estimator (OLSE) to handle multicollinearity problem in the linear regression model. However, it introduces heavy bias when the number of predictors is high, and it may shrink irrelevant regression coefficients, but they are still in the model. Least Absolute Shrinkage and Selection Operator (LASSO) and Elastic net (Enet) estimator have been used to make the variable selection and shrinking the regression coefficients simultaneously. Further, the model misspecification due to excluding relevant explanatory variable in the linear regression model is considered as a severe problem in statistical research, and it will lead to bias and inconsistent parameter estimation. The performance of RE, LASSO and Enet estimators under the correctly specified regression model was well studied in the literature. This study intends to compare the performance of RE, LASSO and Enet estimators in the misspecified regression model using Root Mean Square Error (RMSE) criterion. A Monte-Carlo simulation study was used to study the performance of the estimators. In addition to that, a real-world example was employed to support the results. The analysis revealed that Enet outperformed RE and LASSO in both correctly specified model and misspecified regression model.


INTRODUCTION
Consider the linear regression model where y is the vector of observations on the predictor variable, X is the matrix of observations on non stochastic regressor variables, is a vectors of unknown parameters, ε is the normally distributed vector of disturbances, such that and Usually, by minimising the Error Sum of Squares the Ordinary Least Squared Estimator (OLSE), which is the Best Linear Unbiased Estimator (BLUE) for , is obtained as (3) It is well-known that OLSE is unstable and produces estimates having high variance when multicollinearity exists among explanatory variables, i.e., the columns of X are highly correlated. To handle this problem, Hoerl and Kennard (1970) proposed Ridge Estimator (RE) by minimising ESS subject to the constraint , where is a turning parameter. Further, it can be defined as (4) where is the regularization parameter, and is known as the L2 norm. Note that for any there exist an equal to . This is the dual form of the optimization problem. RE helps to obtain the estimates with smaller variance by shrinking the regression coefficients towards zero. However, it has two significant issues in a high dimensional linear model. First issue is that it introduces heavy bias when the number of predictors is high, and secondly, it may shrink irrelevant regression coefficients, but they are still in the model. To overcome this problem, Tibshirani (1996) proposed the Least Absolute Shrinkage and Selection Operator (LASSO) by minimising ESS subject to the constraint for variable selection and shrinking the regression coefficients. Further, it is defined as (5) where is the regularization parameter, and is known as the L1 norm. Similarly, for any there exist an equal to Since is not a differentiable function, there is no analytical solution to estimate LASSO. Therefore, the numerical methods have been used by researchers to find solutions to this problem. Tibshirani (1996) used the standard quadratic programming technique, Fu (1998) proposed the shooting algorithm, and Efron et al. (2004) proposed the Least Angle Regression (LARS) algorithm. The LASSO estimation method handles both the multicollinearity problem and best feature selection simultaneously in the high dimension linear regression model. However, according to Zou and Hastie (2005) LASSO estimation procedure is unstable when the number of predictors is higher than the number of observations . Further, the prediction performance of RE dominates LASSO if there exist high multicollinearity among predictors. To handle this problem, Zouand Hastie (2005) proposed Elastic net (Enet) estimator by combining RE and LASSO, and it is defined as (6) The LARS-EN algorithm, which is a modified version of the LARS algorithm, has been used to estimate the Enet.
The misspecification of the linear model is unavoidable in the practical situation when fitting regression models. It may occur due to including irrelevant explanatory variables or excluding relevant explanatory variables, or measurement errors in variables.
Note that model (1) can be written as (7) where, X 1 and X 2 are the and matrices , and are the and vectors . Let us say that the researcher misspecified the regression model (7) by omitting X 2 , then the model (7) becomes (8) where .

Note that
, and the omitted variables may be correlated with the variables in the model if X is multicollinear. Therefore, one or more assumptions of the linear regression model will be violated when the model is misspecified, and hence the estimators become biased and inconsistent.
FU (1998), Zou and Hastie (2005), and Oyeyemi et al. (2015) examined the performance of RE, LASSO and Enet estimators in the correctly specified linear regression model. In this research, we examined the performance of LASSO and Enet estimator in comparison to RE using Root Mean Square Error (RMSE) criterion when the regression model is misspecified due to the exclusion of some important variables. Rest of the article contains the following contents: estimators under the misspecified model, a common form to represent the RE, LASSO and Enet estimators, a Monte Carlo simulation study and a numerical example to discuss the performance of the estimators, and finally some concluding remarks.

MATERIALS AND METHODS
Now we write RE, LASSO and Enet estimator for the misspecified linear regression model (8) as follows (9) (10) (11) Now, let then . Zou and Hastie (2005) have shown that solving in equation (11) is equivalent to minimising ESS subject to the constraint , where and is referred to as the mixing percentage of regularization parameters. Further, it can be defined as (12) Note that implies , and then Enet estimator in equation (12) is equivalent to RE. Similarly, implies , and then Enet estimator in equation (12) is equivalent to LASSO. Hence, we can write the RE, LASSO and Enet estimator in a common form in the misspecified regression model as below: where The RMSE, which is the expected prediction error of the estimators, is given by (14) where includes new observations that are not used to obtain the coefficient estimates .
The "glmnet" package available in R programming language was used to estimate RE, LASSO and Enet solutions. A Monte Carlo simulation study was conducted to examine the performance of the RE, LASSO and Enet estimator. Further, a numerical example is employed to supports the results. For the simulation study, of the Enet estimator was selected as . The minimum RMSE and the suitable value of regularization parameters for the particular problem were calculated using K-fold cross-validation method as suggested by Tibshirani (1996) andFU (1998).

Simulation study
According to McDonald and Galarneau (1975), now we generate the regressor variables as follows: (15) where is an independent standard normal pseudo random number, and is specified so that the theoretical correlation between any two explanatory variables is given by . We used the linear regression model of 100 observations and 20 regressors, and the dependent variable is generated by using the following equation (16) Where is a normal pseudo random variable with mean zero and variance one. Also, we choose as the normalized eigenvector corresponding to the largest eigenvalue of XX' for which .
To investigate the effects of different degrees of multicollinearity on the estimators, we choose which represent low, moderate and high multicollinearity, respectively. To study the effect of misspecification, we choose X 1 and X 2 , where X 2 was assumed as the regressor matrix related to omitted variables in the misspecified regression model. Since the execution time for the simulation of the algorithm is too long, the simulation was repeated 50 times according to Tibshirani (1996) andFU (1998).
The estimated RMSE values of the Ridge, LASSO and Enet estimators versus regularization parameter when , and are displayed in Figures  1-3, respectively. The average minimum simulated RMSE values of the estimators and the optimal value of regularization parameters are summarized in Table 1.
From Figure 1-3, we can observe that Enet outperforms RE and LASSO in both correctly specified and misspecified regression model.    According to Table 1, we can observe that Enet estimator produces minimum RMSE in all cases considered in this study. Further, we observe different performances of RE, LASSO and Enet estimators when the model is correctly specified and misspecified.

Numerical example
The US crime dataset was considered to analyse the performance of Ridge, LASSO and Enet estimators. This data was used by Venables and Ripley (1999) to examine the effect of punishment regimes on crime rates, and it contains 16 variable with 47 observation. The data set is attached to the MASS library in R package, and it includes the following variables: (percentage of males aged 14--24), (indicator variable for a Southern state), (mean years of schooling), (police expenditure in 1960), (police expenditure in 1959), (labor force participation rate), (number of males per 1000 females), (state population), (number of non-whites per 1000 people), (unemployment rate of urban males 14--24), (unemployment rate of urban males 35--39), (gross domestic product per head), (income inequality), (probability of imprisonment), (average time served in state prisons), (rate of crimes in a particular category per head of population).
For the model fitting, is considered as a dependent variable, the variable is ignored because it is categorical. According to ANOVA of UScrime dataset, the p-values for the predictor variables and are 0.26, 0.00, 0.00, 0.09, 0.47, 0.12, 0.33, 0.12, 0.89, 0.01, 0.35, 0.00, 0.024 and 0.62, respectively. Based on the ANOVA, we assume that the variables are missing and therefore the regression model is misspecified by omitting these variables. The Variance Inflation Factor (VIF) values of the regressor variables of the dataset are 2. 86, 5.05, 104.58, 113.028, 2.88, 3.69, 2.53, 3.84, 5.19, 4.83, 9.97, 7.43, 2.75, and 2.66, which shows a moderated multicollinearity on the dataset.
The estimated RMSE values of the RE, LASSO and Enet estimators versus regularization parameter are displayed in Figure 4. The minimum RMSE values of the estimators and the optimal value of regularization parameter are summarized in Table 2.
From Figure 4, we can observe that Enet estimator outperforms other two estimators in both correctly specified and misspecified regression model. It is evident that the RMSE of RE, LASSO and Enet estimators show a significant difference performance when the model is correctly specified and misspecified.
According to Table 2, we can observe the superiority of the Enet estimator in both correctly specified and misspecified regression model. Further, we can note that LASSO and Enet select different number of variables.