Algorithms and analysis for next generation biosensing and sequencing systems

Shamaiah, Manohar

Algorithms and analysis for next generation biosensing and sequencing systems

Date

2012-08

Authors

Shamaiah, Manohar

Abstract

Recent advancements in massively parallel biosensing and sequencing technologies have revolutionized the field of molecular biology and paved the way to novel and exciting innovations in medicine, biology, and environmental monitoring. Among them, biosensor arrays (e.g., DNA and protein microarrays) have gained a lot of attention. DNA microarrays are parallel affinity biosensors that can detect the presence and quantify the amounts of nucleic acid molecules of interest. They rely on chemical attraction between target nucleic acid sequences and their Watson-Crick complements that serve as probes and capture the targets. The molecular binding between the probes and targets is a stochastic process and hence the number of captured targets at any time is a random variable. Detection in conventional DNA microarrays is based on a single measurement taken in the steady state of the binding process. Recently developed real-time DNA microarrays, on the other hand, acquire multiple temporal measurements which allow more precise characterization of the reaction and enable faster detection based on the early dynamics of the binding process. In this thesis, I study target estimation and limits of performance of real time affinity biosensors. Target estimation is mapped to the problem of estimating parameters of discretely observed nonlinear diffusion processes. Performance of the estimators is characterized analytically via Cramer-Rao lower bound on the mean-square error. The proposed algorithms are verified on both simulated and experimental data, demonstrating significant gains over state-of-the-art techniques.

In addition to biosensor arrays, in this thesis I present studies of the signal processing aspects of next-generation sequencing systems. Novel sequencing technologies will provide significant improvements in many aspects of human condition, ultimately leading towards the understanding, diagnosis, treatment and prevention of diseases. Reliable decision-making in such downstream applications is predicated upon accurate base-calling, i.e., identification of the order of nucleotides from noisy sequencing data. Base-calling error rates are nonuniform and typically deteriorate with the length of the reads. I have studied performance limits of base-calling, characterizing it by means of an upper bound on the error rates. Moreover, in the context of shotgun sequencing, I analyzed how accuracy of an assembled sequence depends on coverage, i.e., on the average number of times each base in a target sequence is represented in different reads. These analytical results are verified using experimental data.

Among many downstream applications of high-throughput biosensing and sequencing technologies, reconstruction of gene regulatory networks is of particular importance. In this thesis, I consider the gene network inference problem and propose a probabilistic graphical approach for solving it. Specifically, I develop graphical models and design message passing algorithms which are then verified using experimental data provided by the Dialogue for Reverse Engineering Assessment and Methods (DREAM) initiative.