Bayesian inference methods for next generation DNA sequencing
Abstract
Recently developed next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thus enabling routine sequencing tasks and taking us one step closer to personalized medicine. To provide a blueprint of a target genome, next-generation sequencing systems typically employ the so called shotgun sequencing strategy and oversample the genome with a library of relatively short overlapping reads. The order of nucleotides in the short reads is determined by processing acquired noisy signals generated by the sequencing platforms, and the overlaps between the reads are exploited to assemble the target long genome. Next-generation sequencing utilizes massively parallel array-based technology to speed up the sequencing and reduce the cost. However, accuracy and lengths of the short reads are yet to surpass those provided by the conventional slower and costlier Sanger sequencing method. In this thesis, we first focus on Illumina's sequencing-by-synthesis platform which relies on reversible terminator chemistry and describe the acquired signal by a Hidden Markov Model. Relying on this model and sequential Monte Carlo methods, we develop a parameter estimation and base calling scheme called ParticleCall. ParticleCall is tested on an experimental data set obtained by sequencing phiX174 bacteriophage using Illumina's Genome Analyzer II. The results show that ParticleCall scheme is significantly more computationally efficient than the best performing unsupervised base calling method currently available, while achieving the same accuracy. Having addressed the problem of base calling of short reads, we turn our attention to genome assembly. Assembly of a genome from acquired short reads is a computationally daunting task even in the scenario where a reference genome exists. Errors and gaps in the reference, and perfect repeat regions in the target, further render the assembly challenging and cause inaccuracies. We formulate reference-guided assembly as the inference problem on a bipartite graph and solve it using a message-passing algorithm. The proposed algorithm can be interpreted as the classical belief propagation scheme under a certain prior. Unlike existing state-of-the-art methods, the proposed algorithm combines the information provided by the reads without needing to know reliability of the short reads (so-called quality scores). Relation of the message-passing algorithm to a provably convergent power iteration scheme is discussed. Results on both simulated and experimental data demonstrate that the proposed message-passing algorithm outperforms commonly used state-of-the-art tools, and it nearly achieves the performance of a genie-aided maximum a posteriori (MAP) scheme. We then consider the reference-free genome assembly problem, i.e., the de novo assembly. Various methods for de novo assembly have been proposed in literature, all of whom are very sensitive to errors in short reads. We develop a novel error-correction method that enables performance improvements of de novo assembly. The new method relies on a suffix array structure built on the short reads data. It incorporates a hypothesis testing procedure utilizing the sum of quality information as the test statistic to improve the accuracy of overlap detection. Finally, we consider an inference problem in gene regulatory networks. Gene regulatory networks are highly complex dynamical systems comprising biomolecular components which interact with each other and through those interactions determine gene expression levels, i.e., determine the rate of gene transcription. In this thesis, a particle filter with Markov Chain Monte Carlo move step is employed for the estimation of reaction rate constants in gene regulatory networks modeled by chemical Langevin equations. Simulation studies demonstrate that the proposed technique outperforms previously considered methods while being computationally more efficient. Dynamic behavior of gene regulatory networks averaged over a large number of cells can be modeled by ordinary differential equations. For this scenario, we compute an approximation to the Cramer-Rao lower bound on the mean-square error of estimating reaction rates and demonstrate that, when the number of unknown parameters is small, the proposed particle filter can be nearly optimal. In summary, this thesis presents a set of Bayesian inference methods for base-calling and sequence assembly in next-generation DNA sequencing. Experimental studies shows the advantage of proposed algorithms over traditional methods.