Large-scale non-linear prediction with applications

Si, Si, Ph.D.

Large-scale non-linear prediction with applications

Date

2016-08

Authors

Si, Si, Ph.D.

Abstract

With an immense growth in data, there is a great need for training and testing machine learning models on very large data sets. Several standard non-linear algorithms based on either kernels (e.g., kernel support vector machines and kernel ridge regression) or decision trees (e.g., gradient boosted decision trees) often yield superior predictive performance on various machine learning tasks compared to linear methods; however, they suffer from severe computation and memory challenges when scaling to millions of data instances. To overcome these challenges, we develop a family of scalable kernel-approximation-based and decision-tree-based algorithms to reduce the computational cost of non linear methods in terms of training time, prediction time and memory usage. We further show their superior performance on a wide range of machine learning tasks including large-scale classification, regression, and extreme multi-label learning.

In particular, we make the following contributions: (1) We develop a family of memory efficient kernel approximation algorithms by exploiting the structure of kernel matrices. The proposed kernel approximation scheme can significantly speed up the training phase of kernel machines; (2) We make the connection between forming a kernel approximation and predicting new instances using kernel machines, and propose a series of improvements over the classical \Nystrom kernel approximation method. We show that these improvements result in an order of magnitude speed-up in prediction time on large-scale classification and regression tasks with millions of training instances; (3) We overcome the challenges of applying decision trees to the extreme multi-label classification problem, which can have more than 100,000 different labels, and develop the first Gradient Boosting Decision Tree (GBDT) algorithm for extreme multi-label learning. We show that the modified GBDT algorithm achieves substantial reductions in prediction time and model size.