Abstract
This paper considers the variable selection problem in linear regression analysis when multicollinearity exists between explanatory variables. To solve this problem, we focus on the fact that the covariance matrix of estimated regression coefficients comprises two parts: the residual sum of squares of a response variable for the given explanatory variables, and the design matrix of the explanatory variables. Then, we propose three different variable selection criteria based on the simple idea of selecting a subset of explanatory variables using the covariance matrix of estimated regression coefficients. We refer to the explanatory variables selected by the proposed variable selection criteria as “predictive principal variables”(PPV)and to the statistical variable selection method using these criteria as the “PPV method”. Given the proposed variable selection criteria, the PPV method incorporates stopping rules without employing statistical hypothesis testing or subjective judgments. Comparing the PPV method with existing variable selection criteria through simulation experiments and applications involving two case studies, we show that the PPV method ⅰ enables selection of a subset of explanatory variables that take both prediction accuracy and degree of multicollinearity into account, andⅱ is useful for high dimensional data analysis according to Pareto’s principle under the assumption that the sample size is smaller than the number of the whole explanatory variables but not than the number of important explanatory variables. Finally, we discuss extension of the PPV method to nonlinear regression analysis.