Probabilistic model-based clustering of complex data

Date

2003

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In many emerging data mining applications, one needs to cluster complex data such as very high-dimensional sparse text documents and continuous or dis- crete time sequences. Probabilistic model-based clustering techniques have shown promising results in many such applications. For real-valued low-dimensional vec- tor data, Gaussian models have been frequently used. For very high-dimensional vector and non-vector data, model-based clustering is a natural choice when it is difficult to extract good features or identify an appropriate measure of similarity between pairs of data objects. This dissertation presents a unified framework for model-based clustering based on a bipartite graph view of data and models. The framework includes an information-theoretic analysis of model-based partitional clustering from a deter- ministic annealing point of view and a view of model-based hierarchical clustering that leads to several useful extensions. The framework is used to develop two new variations of model-based clustering—a balanced model-based partitional cluster- ing algorithm that produces clusters of comparable sizes and a hybrid model-based clustering approach that combines the advantages of partitional and hierarchical model-based algorithms. I apply the framework and new clustering algorithms to cluster several dis- tinct types of complex data, ranging from arbitrary-shaped 2-D synthetic data to high dimensional documents, EEG time series, and gene expression time se- quences. The empirical results demonstrate the usefulness of the scalable, bal- anced model-based clustering algorithms, as well as the benefits of the hybrid model-based clustering approach. They also showcase the generality of the pro- posed clustering framework.

Description

text

Keywords

Citation