Browsing by Subject "Bayesian nonparametrics"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Distributed inference in Bayesian nonparametric models using partially collapsed MCMC(2016-05) Zhang, Michael Minyi; Williamson, Sinead; Lin, LizhenBayesian nonparametric based models are an elegant way for discovering underlying latent features within a data set, but inference in such models can be slow. Inferring latent components using Markov chain Monte Carlo either relies on an uncollapsed representation, which leads to poor mixing, or on a collapsed representation, which is usually slow. We take advantage of the fact that the latent components are conditionally independent under the given stochastic process (we apply our technique to the Dirichlet process and the Indian buffet process). Because of this conditional independence, we can partition the latent components into two parts: one part containing only the finitely many instantiated components and the other part containing the infinite tail of uninstantiated components. For the finite partition, parallel inference is simple given the instantiation of components. But for the infinite tail, performing uncollapsed MCMC leads to poor mixing and thus we collapse out the components. The resulting hybrid sampler, while being parallel, produces samples asymptotically from the true posterior.Item Infinite-word topic models for digital media(2014-05) Waters, Austin Severn; Miikkulainen, RistoDigital media collections hold an unprecedented source of knowledge and data about the world. Yet, even at current scales, the data exceeds by many orders of magnitude the amount a single user could browse through in an entire lifetime. Making use of such data requires computational tools that can index, search over, and organize media documents in ways that are meaningful to human users, based on the meaning of their content. This dissertation develops an automated approach to analyzing digital media content based on topic models. Its primary contribution, the Infinite-Word Topic Model (IWTM), helps extend topic modeling to digital media domains by removing model assumptions that do not make sense for them -- in particular, the assumption that documents are composed of discrete, mutually-exclusive words from a fixed-size vocabulary. While conventional topic models like Latent Dirichlet Allocation (LDA) require that media documents be converted into bags of words, IWTM incorporates clustering into its probabilistic model and treats the vocabulary size as a random quantity to be inferred based on the data. Among its other benefits, IWTM achieves better performance than LDA while automating the selection of the vocabulary size. This dissertation contributes fast, scalable variational inference methods for IWTM that allow the model to be applied to large datasets. Furthermore, it introduces a new method, Incremental Variational Inference (IVI), for training IWTM and other Bayesian non-parametric models efficiently on growing datasets. IVI allows such models to grow in complexity as the dataset grows, as their priors state that they should. Finally, building on IVI, an active learning method for topic models is developed that intelligently samples new data, resulting in models that train faster, achieve higher performance, and use smaller amounts of labeled data.Item Optimization models and methods under nonstationary uncertainty(2010-08) Belyi, Dmitriy; Popova, Elmira; Morton, David P.; Damien, Paul; Djurdjanovic, Dragan; Hasenbein, John J.This research focuses on finding the optimal maintenance policy for an item with varying failure behavior. We analyze several types of item failure rates and develop methods to solve for optimal maintenance schedules. We also illustrate nonparametric modeling techniques for failure rates, and utilize these models in the optimization methods. The general problem falls under the umbrella of stochastic optimization under uncertainty.Item Statistical Models for Next Generation Sequencing Data(2013-04-01) Wang, YiyiThree statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene?s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.