A Comparison of Clustering Methods for Developing Models of User Interest



Journal Title

Journal ISSN

Volume Title



For open-ended information tasks, users must sift through many potentially relevant documents assessing and prioritizing them based on relevance to current information need, a practice we refer to as document triage. Users often perform triage through their interaction with multiple applications, and to efficiently support them in this process an extensible multi-application architecture Interest Profile Manager(IPM) was developed in the prior research at Texas A & M University. IPM infers user interests from their interactions with documents, especially the interests expressed by the user through an interpretive action like assigning a visual characteristic color, coupled with the document?s content characteristics. IPM equates each specific color and application as an interest class and the main challenge for the user is to consistently maintain interest class-color scheme across applications forever which is not practical. This thesis presents a system that can help reduce potential problems caused by these inconsistencies, by indicating when such inconsistencies have occurred in the past or are happening in the user?s current triage activity. It includes (1)a clustering algorithm to group similar triage interest instances by choosing the factors that could define the similarity of interest instances, and (2)an approach to identify sequences of user actions that provide strong evidence of user?s intent which can be used as constraints during clustering. Constrained and unconstrained versions of three Agglomerative Hierarchical Clustering algorithms: (1)Single-Link, (2)Complete-Link, (3) UPGMA(Unweighted Pair Group Method with Arithmetic Mean) have been studied. The contribution of each of the three factors: (1)Content Similarity, (2)Temporal Similarity, and (3)Visual Similarity to the overall similarity between interest instances has also been examined. Our results indicate that the Single-Link algorithm performs better than the other two clustering algorithms while the combination of all three similarity factors defines the similarity between two instances better than considering any single factor. The use of constraints as strong evidence about user?s intent improved the clustering efficiency of algorithms.