Document clustering with nonparametric hierarchical topic modeling

dc.contributor.advisorWilliamson, Sineaden
dc.contributor.committeeMemberZhou, Mingyuanen
dc.creatorSchaefer, Kayla Hopeen
dc.date.accessioned2015-11-16T18:47:17Zen
dc.date.accessioned2018-01-22T22:29:09Z
dc.date.available2015-11-16T18:47:17Zen
dc.date.available2018-01-22T22:29:09Z
dc.date.issued2015-05en
dc.date.submittedMay 2015en
dc.date.updated2015-11-16T18:47:17Zen
dc.descriptiontexten
dc.description.abstractSince its introduction, topic modeling has been a fundamental tool in analyzing corpus structures. While the Relational Topic Model provides a way to link, and subsequently cluster, documents together as an extension of the original Latent Dirichlet Allocation (LDA) model, this paper seeks to form a document clustering model for the nonparametric alternative to LDA, the Dirichlet Process. As the structure of Shakespeare's tragedies is the focus of this work, we specifically cluster documents while modeling the text using a Hierarchical Dirichlet Process (HDP), which allows for a mixture model with shared mixture components, in order to capture the natural topic clustering within a play. Using collapsed Gibbs sampling, the effectiveness of the clustered HDP is compared against that of LDA and an HDP without document clustering. This is done using both log perplexity and a qualitative assessment of the returned topics. Furthermore, clustering is performed and analyzed individually on speeches from each of ten tragedies, as well as with a combined corpus of acts.en
dc.description.departmentStatisticsen
dc.format.mimetypeapplication/pdfen
dc.identifierdoi:10.15781/T2N334en
dc.identifier.urihttp://hdl.handle.net/2152/32498en
dc.subjectClusteringen
dc.subjectNonparametric Bayesian statisticsen
dc.subjectHierarchical modelsen
dc.subjectGibbs samplingen
dc.subjectShakespeareen
dc.titleDocument clustering with nonparametric hierarchical topic modelingen
dc.typeThesisen

Files