Document clustering with nonparametric hierarchical topic modeling

Schaefer, Kayla Hope

Document clustering with nonparametric hierarchical topic modeling

dc.contributor.advisor	Williamson, Sinead	en
dc.contributor.committeeMember	Zhou, Mingyuan	en
dc.creator	Schaefer, Kayla Hope	en
dc.date.accessioned	2015-11-16T18:47:17Z	en
dc.date.accessioned	2018-01-22T22:29:09Z
dc.date.available	2015-11-16T18:47:17Z	en
dc.date.available	2018-01-22T22:29:09Z
dc.date.issued	2015-05	en
dc.date.submitted	May 2015	en
dc.date.updated	2015-11-16T18:47:17Z	en
dc.description	text	en
dc.description.abstract	Since its introduction, topic modeling has been a fundamental tool in analyzing corpus structures. While the Relational Topic Model provides a way to link, and subsequently cluster, documents together as an extension of the original Latent Dirichlet Allocation (LDA) model, this paper seeks to form a document clustering model for the nonparametric alternative to LDA, the Dirichlet Process. As the structure of Shakespeare's tragedies is the focus of this work, we specifically cluster documents while modeling the text using a Hierarchical Dirichlet Process (HDP), which allows for a mixture model with shared mixture components, in order to capture the natural topic clustering within a play. Using collapsed Gibbs sampling, the effectiveness of the clustered HDP is compared against that of LDA and an HDP without document clustering. This is done using both log perplexity and a qualitative assessment of the returned topics. Furthermore, clustering is performed and analyzed individually on speeches from each of ten tragedies, as well as with a combined corpus of acts.	en
dc.description.department	Statistics	en
dc.format.mimetype	application/pdf	en
dc.identifier	doi:10.15781/T2N334	en
dc.identifier.uri	http://hdl.handle.net/2152/32498	en
dc.subject	Clustering	en
dc.subject	Nonparametric Bayesian statistics	en
dc.subject	Hierarchical models	en
dc.subject	Gibbs sampling	en
dc.subject	Shakespeare	en
dc.title	Document clustering with nonparametric hierarchical topic modeling	en
dc.type	Thesis	en

Collections

University of Texas at Austin

Document clustering with nonparametric hierarchical topic modeling

Files

Collections