Unsupervised learning methods: An efficient clustering framework with integrated model selection

Corona, Enrique

Unsupervised learning methods: An efficient clustering framework with integrated model selection

Date

2012-08

Authors

Corona, Enrique

Abstract

Classification is one of the most important practices in data analysis. In the context of machine learning, this practice can be viewed as the problem of identifying representative data patterns in such a manner that coherent groups are formed. If the data structure is readily available (e.g. supervised learning), it is usually used to establish classification rules for discrimination. However, when the data is unlabeled, its underlying structure must be unveiled first. Consequently, unsupervised classification poses more challenges. Among them, the fundamental question of an appropriate number of groups or clusters in the data must be addressed. In this context, the "jump" method, an efficient but limited linear approach that finds plausible answers to the number of clusters in a dataset, is improved via the optimization of an appropriate objective function that quantifies the quality of particular cluster configurations. Recent developments showing interesting associations between spectral clustering (SC) and kernel principal component analysis (KPCA) are used to extend the improved method to the non-linear domain. This is achieved by mapping the input data to a new space where the original clusters appear as linear structures. The characteristics of this mapping depend to a large extent on the parameters of the kernel function selected. By projecting these linear structures to the unit sphere, the proposed method is able to measure the quality of the resulting cluster configurations. These quality scores aid in the simultaneous decision of the kernel parameters (i.e. model selection) and the number of clusters present in the dataset. Results of the enhanced jump method are compared to other relative validation criteria such as minimum description length (MDL), Akaike's information criterion (AIC) and consistent Akaike's information criterion (CAIC). The extension of the method is tested with other cluster validity indices, in similar settings, such as the adjusted Rand index (ARI) and the balanced line fit (BLF). Finally, image segmentation examples are shown as a real world application of the technique.