Statistical Models for Next Generation Sequencing Data

dc.contributorDahl, David B.
dc.contributorLiang, Faming
dc.creatorWang, Yiyi
dc.date.accessioned2015-05-01T05:57:09Z
dc.date.accessioned2017-04-07T20:04:21Z
dc.date.available2015-05-01T05:57:09Z
dc.date.available2017-04-07T20:04:21Z
dc.date.created2013-05
dc.date.issued2013-04-01
dc.description.abstractThree statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene?s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.
dc.identifier.urihttp://hdl.handle.net/1969.1/149412
dc.language.isoen
dc.subjectnext generation sequencing
dc.subjectBayesian nonparametrics
dc.subjectGene Ontology
dc.subjectMCMC
dc.titleStatistical Models for Next Generation Sequencing Data
dc.typeThesis

Files