Thresholding Multivariate Regression and Generalized Principal Components



Journal Title

Journal ISSN

Volume Title



As high-dimensional data arises from various fields in science and technology, traditional multivariate methods need to be updated. Principal component analysis and reduced rank regression are two of the most important multivariate statistical techniques that have seen major changes in recent years. To improving the statistical performance and achieve fast computational efficiency, recent approaches aim at regularizing both the row and column factors of the low-rank matrix approximation by adopting the Lasso-type penalties. Thresholding is another powerful technique for regularizing the row and column factors without solving an optimization problem. This dissertation research covers two novel applications of the idea of thresholding: the thresholding reduced rank multivariate regression and the generalized principal component analysis/singular value decomposition (SVD). The following two para- graphs give brief introductions to each of the two topics, respectively.

Uncovering a meaningful relationship between the responses and the predictors is a fundamental goal in multivariate regression problems, which can be very challenging when data are high-dimensional. Dimension reduction and regularization techniques are applied extensively to alleviate the curse of dimensionality. It is desirable to estimate the regression coefficient matrix by low-rank matrices constructed from its SVD. We reduce such regression problems to sparse SVD problems for cor- related data matrices and generalize the fast iterative thresholding for sparse SVDs algorithm to this situation. This generalization inherits the computational and statistical advantages of the original algorithm including its sparse initialization, novel ways of estimating the thresholding levels and the thresholded subspace iterations. It guarantees the orthogonality of the singular vectors and computes them simultaneously and not sequentially as in the existing methods. We also place this algorithm in an optimization framework by introducing a specific bi-convex objective function. An iterative algorithm that minimizes the objective function, via closed form iterates, is proposed and its convergence is established. This enables us to study the large sample properties of the solution of the multivariate regression problem and establishes consistency of the estimators as the sample size tends to infinity. The methodology and the potential adverse impact of dependence on the earlier algorithms are illustrated using simulation and real data.

The second part of this dissertation considers transposable data matrices where both their rows and columns are correlated. Such datasets are routinely encountered in fields such as econometrics, bio-informatics, chemometrics, network data and so on. While methods to approximate the high-dimensional data matrices have been extensively researched for uncorrelated and independent situations, they are much less so for the transposable data matrices. A generalization of principal component analysis and the related weighted least squares matrix decomposition with respect to a transposable quadratic norm for such data matrices along with their regularized counterparts have been proposed recently. We replace this optimization framework by thresholding the factors in the decompositions and propose a fast iterative thresholding for sparse generalized matrix decomposition algorithm to find sparse factors of the data matrix and account for the two-way dependencies simultaneously. We show that our algorithm is suitable for the reduced rank regression and canonical correlation analysis for two-way dependent data, which is done by connecting them with the generalized matrix decomposition. These connections enable us to improve predictive accuracy in regression and to facilitate interpretation of our proposed algorithm. The effectiveness of the method is tested and illustrated through simulation and real data examples.