Statistical Methods for High Dimensional Biomedical Data

Date

2013-03-27

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This dissertation consists of four different topics in the areas of proteomics, genomics, and cardiology. First, a data-based method was developed to assign the subcellular localization of proteins. We applied the method to data on the bacteria Rhodobacter sphaeroides 2.4.1 and compared the results to PSORTb v.3.0. We found that the method compares well to PSORTb and a simulation study revealed that the method is sound and produces accurate results. Next, we investigated genomic features involved in the lethality of the knockout mouse using the random forest technique. We achieved an accuracy rate of 0.725 and found that among other features, the evolutionary age of the gene was a good predictor of lethality. Third, we analyzed DNA breakpoints across eight different cancer types to determine if common hotspots or cancer-type specific hotspots can be well-predicted by various genomic features and investigated which of the genomic features best predict the number of breakpoints. Using the random forest technique, we found that cancer- type specific hotspots are poorly predicted by genomic features but common hotspots can be predicted using the relevant genomic features. Additionally, we found that among the genomic features analyzed, indel rate and substitution rate were consistently chosen as the top predictors of breakpoint frequency. Lastly, we developed a method to predict the hypothetical heart age of a subject based on the subject?s electrocardiogram (ECG). The heart age predictions are consistent with current ECG science and knowledge of cardiac health.

Description

Citation