Simultaneous partitioning and modeling : a framework for learning from complex data
MetadataShow full item record
While a single learned model is adequate for simple prediction problems, it may not be sufficient to represent heterogeneous populations that difficult classification or regression problems often involve. In such scenarios, practitioners often adopt a "divide and conquer" strategy that segments the data into relatively homogeneous groups and then builds a model for each group. This two-step procedure usually results in simpler, more interpretable and actionable models without any loss in accuracy. We consider prediction problems on bi-modal or dyadic data with covariates, e.g., predicting customer behavior across products, where the independent variables can be naturally partitioned along the modes. A pivoting operation can now result in the target variable showing up as entries in a "customer by product" data matrix. We present a model-based co-clustering framework that interleaves partitioning (clustering) along each mode and construction of prediction models to iteratively improve both cluster assignment and fit of the models. This Simultaneous CO-clustering And Learning (SCOAL) framework generalizes co-clustering and collaborative filtering to model-based co-clustering, and is shown to be better than independently clustering the data first and then building models. Our framework applies to a wide range of bi-modal and multi-modal data, and can be easily specialized to address classification and regression problems in domains like recommender systems, fraud detection and marketing. Further, we note that in several datasets not all the data is useful for the learning problem and ignoring outliers and non-informative values may lead to better models. We explore extensions of SCOAL to automatically identify and discard irrelevant data points and features while modeling, in order to improve prediction accuracy. Next, we leverage the multiple models provided by the SCOAL technique to address two prediction problems on dyadic data, (i) ranking predictions based on their reliability, and (ii) active learning. We also extend SCOAL to predictive modeling of multi-modal data, where one of the modes is implicitly ordered, e.g., time series data. Finally, we illustrate our implementation of a parallel version of SCOAL based on the Google Map-Reduce framework and developed on the open source Hadoop platform. We demonstrate the effectiveness of specific instances of the SCOAL framework on prediction problems through experimentation on real and synthetic data.