Model-based Biomarker Detection and Systematic Analysis in Translational Science
Abstract
This dissertation is concerned with the application of mathematical modeling and statistical signal processing into the rapidly expanding fields of proteomics and genomics. The research is guided by a translational goal which drives the problem formalization and experimental design, and leads to optimization, prediction and control of the underlying system. The dissertation is comprised of three interconnected subjects.
In the first part of the dissertation, two Bayesian peptide detection algorithms are proposed to optimize the feature extraction step, which is the most fundamental step in mass spectrometry-based proteomics. The algorithms are designed to tackle data processing challenges that are not satisfactorily addressed by existing methods. In contrast to most existing methods, the proposed algorithms perform deisotoping and deconvolution of mass spectra simultaneously, which enables better identification of weak peptide signals. Unlike greedy template-matching algorithms, the proposed methods have the capability to handle complex spectra where features overlap. The proposed methods achieve better sensitivity and accuracy compared to many popular software packages such as msInspect.
In the second part of the dissertation, we consider modeling and assessing the entire mass spectrometry-based proteomic data analysis pipeline. Different modules are identified and analyzed, resulting in a framework that captures key factors in system performance. The effects of various model parameters on protein identification rates and quantification errors, differential expression results, and classification performance are examined. The proposed pipeline model can be used to aid experimental design, pinpoint critical bottlenecks, optimize the work flow, and predict biomarker discovery results.
Finally, the same system methodology is extended to analyze the work flow in DNA microarray experiments. A model-based approach is developed to explore the relationship among microarray data properties, missing value imputation, and sample classification in a complicated data analysis pipeline. The situations when it is suitable to apply missing value imputation are identified and recommendations regarding imputation are provided. In addition, a missing value rate-related peaking phenomenon is uncovered.