Probabilistic model-based clustering of complex data
Abstract
In many emerging data mining applications, one needs to cluster complex data such as very high-dimensional sparse text documents and continuous or dis- crete time sequences. Probabilistic model-based clustering techniques have shown promising results in many such applications. For real-valued low-dimensional vec- tor data, Gaussian models have been frequently used. For very high-dimensional vector and non-vector data, model-based clustering is a natural choice when it is difficult to extract good features or identify an appropriate measure of similarity between pairs of data objects. This dissertation presents a unified framework for model-based clustering based on a bipartite graph view of data and models. The framework includes an information-theoretic analysis of model-based partitional clustering from a deter- ministic annealing point of view and a view of model-based hierarchical clustering that leads to several useful extensions. The framework is used to develop two new variations of model-based clustering—a balanced model-based partitional cluster- ing algorithm that produces clusters of comparable sizes and a hybrid model-based clustering approach that combines the advantages of partitional and hierarchical model-based algorithms. I apply the framework and new clustering algorithms to cluster several dis- tinct types of complex data, ranging from arbitrary-shaped 2-D synthetic data to high dimensional documents, EEG time series, and gene expression time se- quences. The empirical results demonstrate the usefulness of the scalable, bal- anced model-based clustering algorithms, as well as the benefits of the hybrid model-based clustering approach. They also showcase the generality of the pro- posed clustering framework.