Algorithms for Robust Data Analysis
Abstract
Data analysis plays an important role in making decisions, making predictions, and helping business operate. Unfortunately, in many situations the data is not reliable and robust analysis is needed to obtain stable results. This may be challenging when the data is high dimensional. Transforming high-dimensional data into low-dimensional data is an important prior step in applications such as managing the data, performing efficient learning, retrieving information, and other data analytics tasks. Feature selection and feature extraction are two classical techniques to achieve dimensionality reduction. Feature selection removes irrelevant and redundant features and keeps only the most important ones. Unlike feature selection, feature extraction generates the features as arbitrary functions of the data. They are typically more accurate than those obtained by feature selection. But the extracted features are harder to interpret than the selected features. Another disadvantage is that feature extraction is less robust than feature selection. In this dissertation we describe algorithms and methods to improve the robustness of feature selection and feature extraction. We propose computing a hybrid low rank representation (HLR) of selected features and extracted features. The robustness of the HLR model comes from the selected features, and its accuracy comes from the extracted features. We develop an algorithm to solve this hybrid problem optimally by combinatorial search. We propose optimal, sub-optimal, and greedy variants of the algorithm to solve this hybrid problem. The sub-optimal and the greedy variants come with the exact bounds on the representation accuracy. Principal Component Analysis (PCA) is a widely used feature extraction algorithm. It is known to be sensitive to outliers that reduce its robustness. Robust Principal Component Analysis (RPCA), also known as Robust Subspace Recovery (RSR), is a classical approach to improve the robustness of the PCA by identifying and removing outliers. We develop a novel RPCA algorithm, which converts this problem into graph search. We show how to solve the graph search problem optimally by applying heuristic search techniques from AI. The results obtained by our algorithm are optimal in terms of accuracy. We also describe a sub-optimal variant that runs much faster than the optimal variant and produces a solution that is guaranteed to be near the optimal. Outlier based RPCA removes outliers from the data and computes the principal components of the remaining data. The centered variant requires the center of non-outliers, which is unknown until after the outliers are determined. Not using an accurate center may lead to the detection of wrong outliers. We propose a method that can be used to improve the robustness of many currently known RPCA algorithms. Our method implicitly centers the non-outliers; it is implemented by appending a bias value to each data element. It can be used with “black box” RPCA algorithms since only their input needs to be augmented.