Browsing by Subject "Clustering"
Now showing 1 - 16 of 16
Results Per Page
Sort Options
Item A Comparison of Clustering Methods for Developing Models of User Interest(2012-07-16) Ganta, PrasanthFor open-ended information tasks, users must sift through many potentially relevant documents assessing and prioritizing them based on relevance to current information need, a practice we refer to as document triage. Users often perform triage through their interaction with multiple applications, and to efficiently support them in this process an extensible multi-application architecture Interest Profile Manager(IPM) was developed in the prior research at Texas A & M University. IPM infers user interests from their interactions with documents, especially the interests expressed by the user through an interpretive action like assigning a visual characteristic color, coupled with the document?s content characteristics. IPM equates each specific color and application as an interest class and the main challenge for the user is to consistently maintain interest class-color scheme across applications forever which is not practical. This thesis presents a system that can help reduce potential problems caused by these inconsistencies, by indicating when such inconsistencies have occurred in the past or are happening in the user?s current triage activity. It includes (1)a clustering algorithm to group similar triage interest instances by choosing the factors that could define the similarity of interest instances, and (2)an approach to identify sequences of user actions that provide strong evidence of user?s intent which can be used as constraints during clustering. Constrained and unconstrained versions of three Agglomerative Hierarchical Clustering algorithms: (1)Single-Link, (2)Complete-Link, (3) UPGMA(Unweighted Pair Group Method with Arithmetic Mean) have been studied. The contribution of each of the three factors: (1)Content Similarity, (2)Temporal Similarity, and (3)Visual Similarity to the overall similarity between interest instances has also been examined. Our results indicate that the Single-Link algorithm performs better than the other two clustering algorithms while the combination of all three similarity factors defines the similarity between two instances better than considering any single factor. The use of constraints as strong evidence about user?s intent improved the clustering efficiency of algorithms.Item Automatic semiconductor wafer map defect signature detection using a neural network classifier(2010-12) Radhamohan, Ranjan Subbaraya; Ghosh, Joydeep; El-Hamdi, MohamedThe application of popular image processing and classification algorithms, including agglomerative clustering and neural networks, is explored for the purpose of grouping semiconductor wafer defect map patterns. Challenges such as overlapping pattern separation, wafer rotation, and false data removal are examined and solutions proposed. After grouping, wafer processing history is used to automatically determine the most likely source of the issue. Results are provided that indicate these methods hold promise for wafer analysis applications.Item Bayesian variable selection in clustering via dirichlet process mixture models(Texas A&M University, 2007-09-17) Kim, SinaeThe increased collection of high-dimensional data in various fields has raised a strong interest in clustering algorithms and variable selection procedures. In this disserta- tion, I propose a model-based method that addresses the two problems simultane- ously. I use Dirichlet process mixture models to define the cluster structure and to introduce in the model a latent binary vector to identify discriminating variables. I update the variable selection index using a Metropolis algorithm and obtain inference on the cluster structure via a split-merge Markov chain Monte Carlo technique. I evaluate the method on simulated data and illustrate an application with a DNA microarray study. I also show that the methodology can be adapted to the problem of clustering functional high-dimensional data. There I employ wavelet thresholding methods in order to reduce the dimension of the data and to remove noise from the observed curves. I then apply variable selection and sample clustering methods in the wavelet domain. Thus my methodology is wavelet-based and aims at clustering the curves while identifying wavelet coefficients describing discriminating local features. I exemplify the method on high-dimensional and high-frequency tidal volume traces measured under an induced panic attack model in normal humans.Item Bootstrapping in a high dimensional but very low sample size problem(Texas A&M University, 2006-08-16) Song, JuheeHigh Dimension, Low Sample Size (HDLSS) problems have received much attention recently in many areas of science. Analysis of microarray experiments is one such area. Numerous studies are on-going to investigate the behavior of genes by measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression. HDLSS data investigated in this dissertation consist of a large number of data sets each of which has only a few observations. We assume a statistical model in which measurements from the same subject have the same expected value and variance. All subjects have the same distribution up to location and scale. Information from all subjects is shared in estimating this common distribution. Our interest is in testing the hypothesis that the mean of measurements from a given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and traditional bootstrapping, do not necessarily provide reliable results since there are only a few observations for each data set. We motivate a mixture model having C clusters and 3C parameters to overcome the small sample size problem. Standardized data are pooled after assigning each data set to one of the mixture components. To get reasonable initial parameter estimates when density estimation methods are applied, we apply clustering methods including agglomerative and K-means. Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean of within Cluster Variance estimates), are used to choose an optimal number of clusters. Density estimation methods including a maximum likelihood unimodal density estimator and kernel density estimation are used to estimate the unknown density. Once the density is estimated, a bootstrapping algorithm that selects samples from the estimated density is used to approximate the distribution of test statistics. The t-statistic and an empirical likelihood ratio statistic are used, since their distributions are completely determined by the distribution common to all subject. A method to control the false discovery rate is used to perform simultaneous tests on all small data sets. Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid) microarray experiment data are analyzed by the proposed methods.Item Combining classifier and cluster ensembles for semi-supervised and transfer learning(2012-05) Acharya, Ayan; Ghosh, Joydeep; Mooney, Raymond J.Unsupervised models can provide supplementary soft constraints to help classify new, "target" data since similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place, as in transfer learning settings. This contribution describes two general frameworks that take as input class membership estimates from existing classifiers learnt on previously encountered "source" data, as well as a set of cluster labels from a cluster ensemble operating solely on the target data to be classified, and yield a consensus labeling of the target data. One of the proposed frameworks admits a wide range of loss functions and classification/clustering methods and exploits properties of Bregman divergences in conjunction with Legendre duality to yield a principled and scalable approach. The other approach is built on probabilistic mixture models and provides additional flexibility of distributed computation that is useful when the target data cannot be gathered in a single place for privacy or security concerns. A variety of experiments show that the proposed frameworks can yield results substantially superior to those provided by popular transductive learning techniques or by naively applying classifiers learnt on the original task to the target data.Item Community detection in network analysis: a survey(2016-05) Zhang, Lingjia; Lin, Lizhen, Ph.D.; Keitt, TimothyThe existence of community structures in networks is not unusual, including in the domains of sociology, biology, and business, etc. The characteristic of the community structure is that nodes of the same community are highly similar while on the contrary, nodes across communities present low similarity. In academia, there is a surge in research efforts on community detection in network analysis, especially in developing statistically sound methodologies for exploring, modeling, and interpreting these kind of structures and relationships. This survey paper aims to provide a brief review of current applicable statistical methodologies and approaches in a comparative manner along with metrics for evaluating graph clustering results and application using R. At the end, we provide promising future research directions.Item CUDIA : a probabilistic cross-level imputation framework using individual auxiliary information(2011-12) Park, Yubin; Ghosh, Joydeep; Markey, Mia K.In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones such as Hospital Referral Regions (HRR) or Hospital Service Areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this thesis, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel "cross-level" imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.Item Document clustering with nonparametric hierarchical topic modeling(2015-05) Schaefer, Kayla Hope; Williamson, Sinead; Zhou, MingyuanSince its introduction, topic modeling has been a fundamental tool in analyzing corpus structures. While the Relational Topic Model provides a way to link, and subsequently cluster, documents together as an extension of the original Latent Dirichlet Allocation (LDA) model, this paper seeks to form a document clustering model for the nonparametric alternative to LDA, the Dirichlet Process. As the structure of Shakespeare's tragedies is the focus of this work, we specifically cluster documents while modeling the text using a Hierarchical Dirichlet Process (HDP), which allows for a mixture model with shared mixture components, in order to capture the natural topic clustering within a play. Using collapsed Gibbs sampling, the effectiveness of the clustered HDP is compared against that of LDA and an HDP without document clustering. This is done using both log perplexity and a qualitative assessment of the returned topics. Furthermore, clustering is performed and analyzed individually on speeches from each of ten tragedies, as well as with a combined corpus of acts.Item Fitting planar points with two lines(2013-08) Jiang, Junhai; Martin, Clyde F.; Mansouri, HosseinThere are various researches on linear regression and it has become a systematic and thus an important subject of statistics. However, almost all of the researches focused on regression with only one line, and only a few of the researchers are concerning about fitting the scatter points with more than one line, for example, two lines. The key point of this problem is that we have to find a method to generate two lines that at least locally minimize the sum of squared distances from the points to the nearest line. The method is given by fitting regression lines for two updating sets of points.Item Hierarchical Signature Clustering for General Time-Series Data(2010-05) Koenig, Lars; Youn, Eunseog; Rushton, J. Nelson; Cooke, Daniel E.Clustering data is an integral part of data processing and pattern extraction. Most existing clustering techniques provide clusters from time series data, but several issues arise when applied to such datasets. Unless specifically tailored, the distance metrics used lack interpretability for these data: distances have no units, or the units are hard to comprehend. Many methods treat time series as n-dimensional points, which gives a distance that can be understood abstractly, but it lacks the ability to discern patterns in the data by comparing distances to one another. While some previous methods are concerned with matching levels, also of interest are series that change levels in a similar pattern but with different measured levels. These are not clustered together using a Euclidean metric and are indiscernible using a correlation metric, so we propose a more appropriate metric that clusters data with similar behaviors and allows for comparison of distances between clusters directly. This dissertation describes a new method that extends on prior methods of trajectory clustering allowing their application to longer time series and on short time series with poor trajectory distributions.Item Nonparametric Bayesian analysis of some clustering problems(Texas A&M University, 2006-10-30) Ray, ShubhankarNonparametric Bayesian models have been researched extensively in the past 10 years following the work of Escobar and West (1995) on sampling schemes for Dirichlet processes. The infinite mixture representation of the Dirichlet process makes it useful for clustering problems where the number of clusters is unknown. We develop nonparametric Bayesian models for two different clustering problems, namely functional and graphical clustering. We propose a nonparametric Bayes wavelet model for clustering of functional or longitudinal data. The wavelet modelling is aimed at the resolution of global and local features during clustering. The model also allows the elicitation of prior belief about the regularity of the functions and has the ability to adapt to a wide range of functional regularity. Posterior inference is carried out by Gibbs sampling with conjugate priors for fast computation. We use simulated as well as real datasets to illustrate the suitability of the approach over other alternatives. The functional clustering model is extended to analyze splice microarray data. New microarray technologies probe consecutive segments along genes to observe alternative splicing (AS) mechanisms that produce multiple proteins from a single gene. Clues regarding the number of splice forms can be obtained by clustering the functional expression profiles from different tissues. The analysis was carried out on the Rosetta dataset (Johnson et al., 2003) to obtain a splice variant by tissue distribution for all the 10,000 genes. We were able to identify a number of splice forms that appear to be unique to cancer. We propose a Bayesian model for partitioning graphs depicting dependencies in a collection of objects. After suitable transformations and modelling techniques, the problem of graph cutting can be approached by nonparametric Bayes clustering. We draw motivation from a recent work (Dhillon, 2001) showing the equivalence of kernel k-means clustering and certain graph cutting algorithms. It is shown that loss functions similar to the kernel k-means naturally arise in this model, and the minimization of associated posterior risk comprises an effective graph cutting strategy. We present here results from the analysis of two microarray datasets, namely the melanoma dataset (Bittner et al., 2000) and the sarcoma dataset (Nykter et al., 2006).Item On the effect of INQUERY term-weighting scheme on query-sensitive similarity measures(Texas A&M University, 2006-04-12) Kini, Ananth UllalCluster-based information retrieval systems often use a similarity measure to compute the association among text documents. In this thesis, we focus on a class of similarity measures named Query-Sensitive Similarity (QSS) measures. Recent studies have shown QSS measures to positively influence the outcome of a clustering procedure. These studies have used QSS measures in conjunction with the ltc term-weighting scheme. Several term-weighting schemes have superseded the ltc term-weighing scheme and demonstrated better retrieval performance relative to the latter. We test whether introducing one of these schemes, INQUERY, will offer any benefit over the ltc scheme when used in the context of QSS measures. The testing procedure uses the Nearest Neighbor (NN) test to quantify the clustering effectiveness of QSS measures and the corresponding term-weighting scheme. The NN tests are applied on certain standard test document collections and the results are tested for statistical significance. On analyzing results of the NN test relative to those obtained for the ltc scheme, we find several instances where the INQUERY scheme improves the clustering effectiveness of QSS measures. To be able to apply the NN test, we designed a software test framework, Ferret, by complementing the features provided by dtSearch, a search engine. The test framework automates the generation of NN coefficients by processing standard test document collection data. We provide an insight into the construction and working of the Ferret test framework.Item Overlapping community detection in massive social networks(2015-12) Whang, Joyce Jiyoung; Dhillon, Inderjit S.; Grauman, Kristen; Mooney, Raymond J; Pingali, Keshav; Gleich, David FMassive social networks have become increasingly popular in recent years. Community detection is one of the most important techniques for the analysis of such complex networks. A community is a set of cohesive vertices that has more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this thesis, we propose scalable overlapping community detection algorithms that effectively identify high quality overlapping communities in various real-world networks. We first develop an efficient overlapping community detection algorithm using a seed set expansion approach. The key idea of this algorithm is to find good seeds and then greedily expand these seeds using a personalized PageRank clustering scheme. Experimental results show that our algorithm significantly outperforms other state-of-the-art overlapping community detection methods in terms of run time, cohesiveness of communities, and ground-truth accuracy. To develop more principled methods, we formulate the overlapping community detection problem as a non-exhaustive, overlapping graph clustering problem where clusters are allowed to overlap with each other, and some nodes are allowed to be outside of any cluster. To tackle this non-exhaustive, overlapping clustering problem, we propose a simple and intuitive objective function that captures the issues of overlap and non-exhaustiveness in a unified manner. To optimize the objective, we develop not only fast iterative algorithms but also more sophisticated algorithms using a low-rank semidefinite programming technique. Our experimental results show that the new objective and the algorithms are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness. We extend our non-exhaustive, overlapping clustering techniques to co-clustering where the goal is to simultaneously identify a clustering of the rows as well as the columns of a data matrix. As an example application, consider recommender systems where users have ratings on items. This can be represented by a bipartite graph where users and items are denoted by two different types of nodes, and the ratings are denoted by weighted edges between the users and the items. In this case, co-clustering would be a simultaneous clustering of users and items. We propose a new co-clustering objective function and an efficient co-clustering algorithm that is able to identify overlapping clusters as well as outliers on both types of the nodes in the bipartite graph. We show that our co-clustering algorithm is able to effectively capture the underlying co-clustering structure of the data, which results in boosting the performance of a standard one-dimensional clustering. Finally, we study the design of parallel data-driven algorithms, which enables us to further increase the scalability of our overlapping community detection algorithms. Using PageRank as a model problem, we look at three algorithm design axes: work activation, data access pattern, and scheduling. We investigate the impact of different algorithm design choices. Using these design axes, we design and test a variety of PageRank implementations finding that data-driven, push-based algorithms are able to achieve a significantly superior scalability than standard PageRank implementations. The design choices affect both single-threaded performance as well as parallel scalability. The lessons learned from this study not only guide efficient implementations of many graph mining algorithms but also provide a framework for designing new scalable algorithms, especially for large-scale community detection.Item Statistical clustering of data(2015-05) Zhang, Lihao; Sager, Thomas W.; Hersh, MatthewCluster analysis aims at segmenting objects into groups with similar members and, therefore helps to discover distribution of properties and correlations in large datasets. Data clustering has been widely studied as it arises in many domains in marketing, engineering, and social sciences. Especially, the occurrence of transactional and experimental datasets in large scale in recent years significantly increased the necessity of clustering techniques to reduce the size of the existing objects, to achieve a better knowledge of the data. This report introduced fundamental concepts related to cluster analysis, addressed the similarity and dissimilarity measurements for cluster definition, and clarified three major clustering algorithms-hierarchical clustering, K-means clustering and Gaussian mixture model fitted by Expectation-Maximization (EM) algorithm-theoretically and experimentally to illustrate the process of clustering. Finally, methods of determining the number of clusters and validating the clustering were presented as for clustering evaluation.Item Topic modeling via scatter/gather clustering(2015-05) Tyler, Marcus Mitchell; Ghosh, Joydeep; Bovik, AlanLatent variable models such as Latent Dirichlet Allocation provide rich tools for analyzing large document corpora. They can uncover a wide range of hidden information such as topics in text, communities in social networks, and patterns in images. Scatter/Gather is a clustering technique that allows users to interactively combine and split groups. When joined with latent variable models, Scatter/Gather organizes topics into themes, enables topic browsing, and improves processing time for large numbers of topics.Item Unsupervised learning methods: An efficient clustering framework with integrated model selection(2012-08) Corona, Enrique; Nutter, Brian; Mitra, Sunanda; Pal, Ranadip; López-Benitez, NoéClassification is one of the most important practices in data analysis. In the context of machine learning, this practice can be viewed as the problem of identifying representative data patterns in such a manner that coherent groups are formed. If the data structure is readily available (e.g. supervised learning), it is usually used to establish classification rules for discrimination. However, when the data is unlabeled, its underlying structure must be unveiled first. Consequently, unsupervised classification poses more challenges. Among them, the fundamental question of an appropriate number of groups or clusters in the data must be addressed. In this context, the "jump" method, an efficient but limited linear approach that finds plausible answers to the number of clusters in a dataset, is improved via the optimization of an appropriate objective function that quantifies the quality of particular cluster configurations. Recent developments showing interesting associations between spectral clustering (SC) and kernel principal component analysis (KPCA) are used to extend the improved method to the non-linear domain. This is achieved by mapping the input data to a new space where the original clusters appear as linear structures. The characteristics of this mapping depend to a large extent on the parameters of the kernel function selected. By projecting these linear structures to the unit sphere, the proposed method is able to measure the quality of the resulting cluster configurations. These quality scores aid in the simultaneous decision of the kernel parameters (i.e. model selection) and the number of clusters present in the dataset. Results of the enhanced jump method are compared to other relative validation criteria such as minimum description length (MDL), Akaike's information criterion (AIC) and consistent Akaike's information criterion (CAIC). The extension of the method is tested with other cluster validity indices, in similar settings, such as the adjusted Rand index (ARI) and the balanced line fit (BLF). Finally, image segmentation examples are shown as a real world application of the technique.