Statistical Methods for the Analysis of Mass Spectrometry-based Proteomics Data

Wang, Xuan

Statistical Methods for the Analysis of Mass Spectrometry-based Proteomics Data

Date

2012-07-16

Authors

Wang, Xuan

Abstract

Proteomics serves an important role at the systems-level in understanding of biological functioning. Mass spectrometry proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. In the most widely used bottom-up approach to MS-based high-throughput quantitative proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and then analyzed using a mass spectrometer. The three fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (i) Identifying the proteins that are present in a sample, (ii) Aligning different samples on elution (retention) time, mass, peak area (intensity) and etc, (iii) Quantifying the abundance levels of the identified proteins after alignment. Each of these challenges requires knowledge of the biological and technological context that give rise to the observed data, as well as the application of sound statistical principles for estimation and inference. In this dissertation, we present a set of statistical methods in bottom-up proteomics towards protein identification, alignment and quantification.

We describe a fully Bayesian hierarchical modeling approach to peptide and protein identification on the basis of MS/MS fragmentation patterns in a unified framework. Our major contribution is to allow for dependence among the list of top candidate PSMs, which we accomplish with a Bayesian multiple component mixture model incorporating decoy search results and joint estimation of the accuracy of a list of peptide identifications for each MS/MS fragmentation spectrum. We also propose an objective criteria for the evaluation of the False Discovery Rate (FDR) associated with a list of identifications at both peptide level, which results in more accurate FDR estimates than existing methods like PeptideProphet.

Several alignment algorithms have been developed using different warping functions. However, all the existing alignment approaches suffer from a useful metric for scoring an alignment between two data sets and hence lack a quantitative score for how good an alignment is. Our alignment approach uses "Anchor points" found to align all the individual scan in the target sample and provides a framework to quantify the alignment, that is, assigning a p-value to a set of aligned LC-MS runs to assess the correctness of alignment. After alignment using our algorithm, the p-values from Wilcoxon signed-rank test on elution (retention) time, M/Z, peak area successfully turn into non-significant values.

Quantitative mass spectrometry-based proteomics involves statistical inference on protein abundance, based on the intensities of each protein's associated spectral peaks. However, typical mass spectrometry-based proteomics data sets have substantial proportions of missing observations, due at least in part to censoring of low intensities. This complicates intensity-based differential expression analysis. We outline a statistical method for protein differential expression, based on a simple Binomial likelihood. By modeling peak intensities as binary, in terms of "presence / absence", we enable the selection of proteins not typically amendable to quantitative analysis; e.g., "one-state" proteins that are present in one condition but absent in another. In addition, we present an analysis protocol that combines quantitative and presence / absence analysis of a given data set in a principled way, resulting in a single list of selected proteins with a single associated FDR.