Variable Selection for Ultra High Dimensional Data

Date

2014-05-29

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Variable selection plays an important role for the high dimensional data analysis. In this work, we first propose a Bayesian variable selection approach for ultra-high dimensional linear regression based on the strategy of split-and-merge. The proposed approach consists of two stages: (i) split the ultra-high dimensional data set into a number of lower dimensional subsets and select relevant variables from each of the subsets, and (ii) aggregate the variables selected from each subset and then select relevant variables from the aggregated data set. Since the proposed approach has an embarrassingly parallel structure, it can be easily implemented in a parallel architecture and applied to big data problems with millions or more of explanatory variables. Under mild conditions, we show that the proposed approach is consistent. That is, asymptotically, the true explanatory variables will be correctly identified by the proposed approach as the sample size becomes large. Extensive comparisons of the proposed approach have been made with the penalized likelihood approaches, such as Lasso, elastic net, SIS and ISIS. The numerical results show that the proposed approach generally outperforms the penalized likelihood approaches. The models selected by the proposed approach tend to be more sparse and closer to the true model.

In the frequentist realm, penalized likelihood methods have been widely used in variable selection problems, where the penalty functions are typically symmetric about 0, continuous and nondecreasing in (0,?). The second contribution of this work is that, we propose a new penalized likelihood method, reciprocal Lasso (or in short, rLasso), based on a new class of penalty functions which are decreasing in (0,?), discontinuous at 0, and converge to infinity when the coefficients approach zero. The new penalty functions give nearly zero coefficients infinity penalties; in contrast, the conventional penalty functions give nearly zero coefficients nearly zero penalties (e.g., Lasso and SCAD) or constant penalties (e.g., L0 penalty). This distinguishing feature makes rLasso very attractive for variable selection: It can effectively avoid selecting overly dense models. We establish the consistency of the rLasso for variable selection and coefficient estimation under both the low and high dimensional settings. Since the rLasso penalty functions induce an objective function with multiple local minima, we also propose an efficient Monte Carlo optimization algorithm to solve the minimization problem. Our simulation results show that the rLasso outperforms other popular penalized likelihood methods, such as Lasso, SCAD, MCP, SIS, ISIS and EBIC. It can produce sparser and more accurate coefficient estimates, and have a higher probability to catch true models.

Description

Citation