Integration and validation of mass spectrometry proteomics data sets

Prince, John Theodore, 1976-

Integration and validation of mass spectrometry proteomics data sets

Date

2008-05

Authors

Prince, John Theodore, 1976-

Abstract

Mass spectrometry (MS) has been a key player in biological investigation for some time and is the instrument of choice for high throughput proteomics. However, the generation of large, inherently rich, proteomics data sets has far outpaced our ability to utilize them to produce biological knowledge. The ultimate utility of MS proteomics is closely tied to our ability to interpret, integrate and validate this voluminous data. By way of introduction, I discuss the creation of the Open Proteomics Database, which aims to increase publicly available data and to encourage broader contribution from the statistical and bioinformatic communities. Next, I detail research efforts in the integration of mass spectrometry data sets to increase the number of quantifiable peptides. Comparing peptide quantities between experiments (or subsequent chromatographic fractions) in large numbers requires the chromatographic alignment of MS signals, a challenging problem. We use Dynamic Time Warping (DTW) and a bijective (one-to-one) interpolant to create a smooth warp function amenable to multiple alignment. We test a wide variety of alignment scenarios coupled with high confidence, overlapping peptide identifications to optimize and compare alignment parameters. We determine an optimal spectral similarity function, show the importance of penalizing gaps in the alignment path, and demonstrate the utility of our algorithm for multiple alignments. Then, we introduce a method to independently validate large scale proteomics data sets. We use known biases in sample constitution including amino acid content, transmembrane sequence content, and protein abundance to estimate peptide false identification rates (FIRs) in what we term sample bias validation (SBV). We use SBV to compare the false identification rate accuracy (FIRA) and recall capabilities of widely used techniques for error estimation in MS based proteomics. Finally, we describe the open source package mspire (mass spectrometry proteomics in Ruby). Mspire offers unified interfaces for working with a variety of file formats across the analytical pipeline, much needed converters between key formats, and tools for FIR determination. The package eases the burden of working with MS proteomics data, reducing the barrier of entry to developers and offering useful tools to analysts of MS proteomics data.