Algorithms for next generation sequencing data analysis
Abstract
The field of genomics has witnessed tremendous achievements in the past two decades. The advances in sequencing technology have enabled acquisition of massive amounts of data that reveals information about individual genetic blueprint and is revolutionizing the field of molecular biology. Interpretation of such data requires solving mathematical (statistical and computational) problems rendered difficult by the complex interacting processes that are characteristic of biological systems; the data is high dimensional, typically noisy and often incomplete. Algorithm design in these settings requires deep understanding of the underlying biological principles, good mathematical abstractions permitting tractable inference and fast, scalable and accurate solutions using ideas from diverse fields such as optimization, probability, statistics and algorithms. This dissertation deals with two such problems occurring in the field of bioinformatics/computational biology. First, for the problem of basecalling for sequencing-by-synthesis (Illumina) platforms, I describe novel computationally tractable statistical models and signal processing schemes that are fast and have lower error rates than existing state-of-the-art basecallers. Extensions to a soft information exchange setup to do joint basecalling and SNP calling are also explored. Next, I describe two novel single individual haplotyping inference schemes using an (optimal) branch and bound framework and (scalable) low rank semidefinite programming ideas for diploid and polyploid species. In addition to improving the quality of basecalling, SNP calling, genotyping and haplotyping, I also developed user-friendly software that can be used by the biological research community for various purposes including cancer genomics and metagenomics studies.