Statistical clustering of data

Zhang, Lihao

Statistical clustering of data

Date

2015-05

Authors

Zhang, Lihao

Abstract

Cluster analysis aims at segmenting objects into groups with similar members and, therefore helps to discover distribution of properties and correlations in large datasets. Data clustering has been widely studied as it arises in many domains in marketing, engineering, and social sciences. Especially, the occurrence of transactional and experimental datasets in large scale in recent years significantly increased the necessity of clustering techniques to reduce the size of the existing objects, to achieve a better knowledge of the data. This report introduced fundamental concepts related to cluster analysis, addressed the similarity and dissimilarity measurements for cluster definition, and clarified three major clustering algorithms-hierarchical clustering, K-means clustering and Gaussian mixture model fitted by Expectation-Maximization (EM) algorithm-theoretically and experimentally to illustrate the process of clustering. Finally, methods of determining the number of clusters and validating the clustering were presented as for clustering evaluation.