Nonparametric and Robust Methods for Community Detection in Complex Networks
Abstract
Community detection is a key for understanding latent mechanisms behind formation and dynamics of network data. However, the presence of outliers (vertices with low degree and small and weak communities) tends to aggravate the community discovery, and result in fluctuating outcomes and lack of robustness. Moreover, most of the currently available methods for community discovery within a spectral clustering framework are based on the Euclidean distance as a measure of "cohesion" or "closeness" among vertices, and thus do not explicitly account for the underlying probabilistic geometry of the graph. In this dissertation, we propose three new network segmentation approaches that aim to bridge the analysis of complex networks with modern methods of robust and nonparametric statistics. The dissertation explores utility of two different but tightly interrelated statistical methods, that is, data depth and impartial trimming, for community detection in unsupervised and supervised settings.
The first key idea is the notion of a robust and data-driven measure of distance that still remains new and unexplored in network sciences, data depth. We propose a new nonparametric supervised algorithm for detecting multiple communities in complex networks using the Depth vs. Depth (DD(G)) classifier. The proposed new DD(G)-method is inherently geometric and allows to simultaneously account for network communities and outliers. We illustrate the utility of the new approach using the benchmark political blogs data, "dark" terrorist networks, and analysis of bill cosponsorship in the Italian Parliament. Moreover, we propose a new data-driven unsupervised algorithm, K-depths, for community detection using the L_1-depth. We evaluate finite sample properties of the K-depths method using synthetic networks and illustrate its performance for tracking communities in online social media platform, Flickr. The new method significantly outperforms the classical K-means and yields comparable results to the regularized K-means. Being robust to low-degree vertices, the new K-depths method is computationally efficient, requiring up to 400 times less CPU time than the currently adopted regularization procedures based on optimizing the Davis-Kahan bound.
The second key idea presented in this dissertation is the approach of impartial trimming. This idea mainly comes from the fact that the performance of many clustering algorithms for community recovery in large complex networks is known to be adversely impacted by the presence of outliers. In Chapter 4, we propose a new unsupervised robustified clustering method, trimmed K-means, for community detection in sparse networks, based on the concept of impartial trimming, that is, outliers are self-determined by the joint structure of the entire multivariate sample. We show that the new robustified algorithm for network clustering offers a resistant alternative to conventional K-means in sparse multi-community networks. We evaluate finite sample properties of the new trimmed K-means using synthetic networks and illustrate its performance for tracking communities in a friendship network of the UK university faculty members.
This dissertation aims to open up new horizons for more robust, data-driven and systematic analysis of complex random networks. We hope that findings of this dissertation will inspire and boost the search of more links between network sciences and modern robust and nonparametric statistics.