Learning with high-dimensional noisy data



Journal Title

Journal ISSN

Volume Title



Learning an unknown parameter from data is a problem of fundamental importance across many fields of engineering and science. Rapid development in information technology allows a large amount of data to be collected. The data is often highly non-uniform and noisy, sometimes subject to gross errors and even direct manipulations. Data explosion also highlights the importance of the so-called high-dimensional regime, where the number of variables might exceed the number of samples. Extracting useful information from the data requires high-dimensional learning algorithms that are robust to noise. However, standard algorithms for the high-dimensional regime are often brittle to noise, and the suite of techniques developed in Robust Statistics are often inapplicable to large and high-dimensional data. In this thesis, we study the problem of robust statistical learning in high-dimensions from noisy data. Our goal is to better understand the behaviors and effect of noise in high-dimensional problems, and to develop algorithms that are statistically efficient, computationally tractable, and robust to various types of noise. We forge into this territory by considering three important sub-problems. We first look at the problem of recovering a sparse vector from a few linear measurements, where both the response vector and the covariate matrix are subject to noise. Both stochastic and arbitrary noise are considered. We show that standard approaches are inadequate in these settings. We then develop robust efficient algorithms that provably recover the support and values of the sparse vector under different noise models and require minimum knowledge of the nature of the noise. Next, we study the problem of recovering a low-rank matrix from partially observed entries, with some of the observations arbitrarily corrupted. We consider the entry-wise corruption setting where no row or column has too many entries corrupted, and provide performance guarantees for a natural convex relaxation approach. Our unified guarantees cover both randomly and deterministically located corruptions, and improve upon existing results. We then turn to the column-wise corruption case where all observations from some columns are arbitrarily contaminated. We propose a new convex optimization approach and show that it simultaneously identify the corrupted columns and recover unobserved entries in the uncorrupted columns. Lastly, we consider the graph clustering problem, i.e., arranging the nodes of a graph into clusters such that there are relatively dense connections inside the clusters and sparse connections across different clusters. We propose a semi-random Generalized Stochastic Blockmodel for clustered graphs and develop a new algorithm based on convexified maximum likelihood estimators. We provide theoretical performance guarantees which recover, and sometimes improve on, all exiting results for the classical stochastic blockmodel, the planted k-clique model and the planted coloring models. We extend our algorithm to the case where the clusters are allowed to overlap with each other, and provide theoretical characterization of the performance of the algorithm. A further extension is studied when the graph may change over time. We develop new approaches to incorporate the time dynamics and show that it can identify stable overlapping communities in real-world time-evolving graphs.