Browsing by Subject "Proteomics"

Now showing 1 - 13 of 13

Advancement of photodissociation and electron-based tandem mass spectrometry methods for proteome analysis
(2011-08) Madsen, James Andrew; Brodbelt, Jennifer S.; Dalby, Kevin N.; Marcotte, Edward M.; Webb, Lauren J.; Willets, Katherine A.
The number and types of diagnostic ions obtained by infrared multiphoton dissociation (IRMPD) and collision induced dissociation (CID) were evaluated for supercharged peptide ions created by electrospray ionization of solutions spiked with mnitrobenzyl alcohol. IRMPD of supercharged peptide ions increased the sequence coverage compared to that obtained by CID for all charge states investigated. Multiply charged, N-terminally derivatized peptides were subjected to electron transfer reactions to produce singly charged, radical species. Upon subsequent “soft” CID, highly abundant z-type ions were formed nearly exclusively, which yielded simplified fragmentation patterns amenable to de novo sequencing methods. Furthermore, the simplified series of z ions were shown to retain labile phosphoric acid moieties. Infrared multiphoton dissociation (IRMPD) was implemented in a novel dual pressure linear ion trap for rapid “top-down” proteomics. Due to secondary dissociation, IRMPD yielded product ions in significantly lower charge states as compared to CID, thus facilitating more accurate mass identification and streamlining product ion assignment. This outcome was especially useful for database searching of larger proteins (~29 kDa) as IRMPD substantially improved protein identification and scoring confidence. Also, IRMPD showed an increased selectivity towards backbone cleavages N-terminal to proline and C-terminal to acidic residues (especially for the lowest precursor charge states). Ultraviolet photodissociation (UVPD) at 193 nm was implemented on a linear ion trap mass spectrometer for high-throughput proteomic workflows. Upon irradiation by a single 5 ns laser pulse, efficient photodissociation of tryptic peptides was achieved with production of a, b, c, x, y, and z sequence ions, in addition to immonium ions and v and w side-chain loss ions. The factors that influence the UVPD mass spectra and subsequent in silico database searching via SEQUEST were evaluated. 193 nm ultraviolet photodissociation (UVPD) was employed to sequence singly and multiply charged peptide anions. Upon dissociation by this method, a-/x-type, followed by d and w side-chain loss ions, were the most prolific and abundant sequence ions, often yielding 100% sequence coverage. LC-MS/UVPD analysis using high pH mobile phases yielded efficient characterization of acidic peptides from mitogen-activated protein kinases.
Development and Application of Proteomic Technologies for the Analysis of Post-Translational Modifications
(2007-08-08) Sprung, Robert William, Jr; Zhao, Yingming
Post-translational modifications represent a rapid and dynamic means for diversifying the chemistry of the ~20 ribosomally coded amino acids. As such, they provide an ideal mechanism for promoting cellular adaptability by facilitating the tuning of protein interactions and functions in response to changing environmental conditions. Despite their fundamental importance in regulating cellular functions and their wide implications in physiology, efficient means for the detection, enrichment and identification of proteins bearing specific modifications are lacking for most modifications. The availability of such methods would constitute invaluable tools supporting efforts to better understand the essential regulatory roles of modifications and the means by which aberrant modifications result in the onset and progression of disease. Towards this end, my dissertation describes the development and application of novel methods for the proteomic analysis of proteins bearing known modifications, including O-GlcNAc, lysine acetylation and methyl esterification. The identification of known targets of the modifications support some of the current ideas regarding their potential impact and serve as a means of validating the methods. More importantly, the identification of novel targets for the modifications challenges some currently held concepts, in particular regarding the relatively limited regulatory roles associated with lysine acetylation. In addition, the unparalleled power of proteomics as a screening strategy is demonstrated through compelling evidence of the existence of novel lysine acylations in vivo with respect to propionylation and butyrylation. Together, the methods described in this dissertation and the datasets generated embody powerful platforms and rich resources for the ongoing exploration of the fundamental contributions of post-translational modifications to the regulation of biological processes.
Development of matrix assisted laser desorption ionization-ion mobility-orthogonal time-of-flight mass spectrometry as a tool for proteomics
(Texas A&M University, 2005-08-29) Ruotolo, Brandon Thomas
Separations coupled to mass spectrometry (MS) are widely used for large-scale protein identification in order to reduce the adverse effects of analyte ion suppression, increase the dynamic range, and as a deconvolution technique for complex datasets typical of cellular protein complements. In this work, matrix assisted laser desorption-ionization is coupled with ion mobility (IM) separation for the analysis of biological molecules. The utility of liquid-phase separations coupled to MS lies in the orthogonality of the two separation dimensions for all analytes. The data presented in this work illustrates that IM-MS relies on the correlation between separation dimensions for different classes (either structural or chemical) of analyte ions to obtain a useful separation. For example, for a series of peptide ions of increasing mass-to-charge (m/z) a plot drift time in the IM drift cell vs. m/z increases in a near-linear fashion, but DNA or lipids having similar m/z values will have very different IM drift time-m/z relationships, thus drift time vs. m/z can be used as a qualitative tool for compound class identification. In addition, IM-MS is applied to the analysis of large peptide datasets in order to determine the peak capacity of the method for bottom-up experiments in proteomics, and it is found that IM separation increases the peak capacity of an MS-only experiment by a factor of 5-10. The population density of the appearance area for peptides is further characterized in terms of the gas-phase structural propensities for tryptic peptide ions. It is found that a small percentage (~3%) of peptide sequences form extended (i.e., helical or β-sheet type) structures in the gas-phase, thus influencing the overall appearance area for peptide ions. Furthermore, the ability of IM-MS to screen for the presence of phosphopeptides is characterized, and it is found that post translationally modified peptides populate the bottom one-half to one-third of the total appearance area for peptide ions. In general, the data presented in this work indicates that IM-MS offers dynamic range and deconvolution capabilities comparable to liquid-phase separation techniques coupled to MS on a time scale (ms) that is fully compatible to current MS, including TOF-MS, technology.
Enhanced protein characterization through selective derivatization and electrospray ionization tandem mass spectrometry
(2011-08) Vasicek, Lisa Anne; Brodbelt, Jennifer S.; Holcombe, James A.; Willets, Katherine A.; Anslyn, Eric V.; Liu, Hung-wen
There continue to be great strides in the field of proteomics but as samples become more complex, the ability to increase sequence coverage and confidence in the identification becomes more important. Several methods of derivatization have been developed that can be used in combination with tandem mass spectrometry to identify and characterize proteins. Three types of activation, including infrared multiphoton dissociation, ultraviolet photodissociation, and electron transfer dissociation, are enhanced in this dissertation and compared to the conventional method of collisional induced dissociation (CID) to demonstrate the improved characterization of proteins. A free amine reactive phosphate group was synthesized and used to modify the N-terminus of digested peptides. This phosphate group absorbs at the IR wavelength of 10.6 µm as well as the Vacuum-ultraviolet (VUV) due to an aromatic group allowing modified peptides to be dissociated by infrared multi-photon dissociation (IRMPD) or ultraviolet photodissociation (UVPD) whereas peptides without this chromophore are less responsive to IR or UV irradiation. The PD spectra for these modified peptides yield simplified MS/MS spectra due to the neutralization of all N-terminal product ions from the incorporation the negatively charged phosphate moiety. This is especially advantageous for UVPD due to the great number of product ions produced due to the higher energy deposition of the UV photons. The MS/MS spectra also produce higher sequence coverage in comparison to CID of the modified or unmodified peptides due to more informative fragmentation pathways generated upon PD from secondary dissociation and an increased ion trapping mass range. IRMPD is also implemented for the first time on an orbitrap mass spectrometer to achieve high resolution analysis of IR chromophore-derivatized samples as well as top-down analysis of unmodified proteins. High resolution/high mass accuracy analysis is extremely beneficial for characterization of complex samples due to the likelihood of false positives at lower resolutions/accuracies. For electron transfer dissociation, precursor ions in higher charge states undergo more exothermic electron transfer and thus minimize non-dissociative charge reduction. In this dissertation, cysteine side chains are alkylated with a fixed charge to deliberately increase the charge states of peptides and improve electron transfer dissociation. ETD can also be used to study protein structure by derivatizing the intact structure with a hydrazone reagent. A hydrazone bond will be preferentially cleaved during ETD facilitating the recognition of any modified residues through a distinguishing ETD fragmentation spectrum.
Hyphenating Ion Mobility With Mass Spectrometry to Increase the Information Content of Top-Down Analyses
(2014-04-25) Zinnel, Nathanael
Mass spectrometry (MS) has been established as important analytical tool in the characterization of an array of analyte classes, including biological samples. However, without hyphenation with other techniques, the approach has limitations to the information that can be elucidated and the samples that can be analyzed. In an attempt to overcome these limitations, separation is performed prior to MS analysis to aid in alleviating sample complexity while dissociation is incorporated to increase the information content. Here, we employ ion mobility (IM), a gas-phase separation technique, to disperse product ions resulting from collision-induced dissociation (CID), denoted as MS-CID-IM-MS, for top-down analysis for a variety of applications, specifically, primary structure elucidation, disulfide bond identification, secondary structure characterization, and polymer characterization. First, the fundamental attributes of this approach and the resulting information elucidated are investigated. Using this approach CID product ions are dispersed in two-dimensions, specifically size-to-charge (IM) and mass-to-charge (MS), and the resulting 2-D data display greatly facilitates the top-down information contents; (i) charge state specific trand lines, (ii) increased dynamic range, (iii) separation of overlapping ion signals. The increase in peak capacity allows for detection of low abundant fragment ions providing an increase in the primary sequence coverage and the confidence of ion assignments as demonstrated by melittin and ubiquitin. Second, this general approach is applied to the top-down analysis for a variety of applications. MS-CID-IM-MS is used for the structural characterization of disulfide linked protein ions by monitoring the ATD of the ion pre- and post-collisional activation. Similarly, this approach can also be used to distinguish product ion type as well as, in some cases, specific secondary structural elements, viz. extended coils or helices providing rapid identification of the onset and termination of extended coil structure in peptides as demonstrated by insulin B-chain. Detect of low abundant ion signals associated with cross-ring cleavages allows this approach to be extended to determine regiochemistry of glucose derived polymers. As demonstrated, the MS-CID-IM-MS approach is highly versatile owing to the information content gained upon dispersion of ions in two-dimensions, providing an effective increase in experimental dynamic range as well as providing conformational information.
Investigation of the proteomic interaction profile of uncoupling protein 3 and its effect on epigenetics
(2014-08) Yan, Xiwei; Mills, Edward Michael
Uncoupling proteins (UCPs) are localized on the inner mitochondrial membrane (IMM) and “uncouple” the electrochemical proton gradient formed by the electron transport chain (ETC) from ATP production. Though the prototypical uncoupling protein 1 (UCP1) is known to mediate the cold-induced thermogenesis in rodents and human neonates, the physiological and biochemical functions of the homologs UCP2-5 are still under debate. Our research focuses on UCP3, the homolog prevalently expressed in skeletal muscle (SKM), the most important metabolic organs. UCP3 has long been speculated to have a pivotal role in maintaining the mitochondrial metabolism. Several biochemical roles have been attributed to UCP3, including the regulation of fatty-acid transport and oxidation, reactive oxygen species (ROS) scavenging and calcium uptake. And several proteins have been identified to directly bind with UCP3 and facilitate its function. But to further understand how UCP3 relates to different aspects of mitochondrial functions, a more comprehensive profile of the UCP3 interaction partners is needed. We performed a mass spectrometry-based experiment and successfully identified a list of over 170 potential proteins that may directly or indirectly interact with UCP3, and several novel functions of UCP3 are implied by these protein-protein interactions. Additionally, researches have shown that the metabolic defects are important contributing factors to the epigenetic changes. Considering the roles of UCP3 in sustaining the normal mitochondrial metabolism, we hypothesized that UCP3 has a novel function in regulating the genomic DNA methylation processes. The data we obtained from the pilot study confirms that loss of UCP3 will lead to aberrant DNA methylation changes. But further experiment is still needed to investigate the regulatory pathway between UCP3 and DNA methylation. The physiological role of UCP3 in defending against cancer, diabetes and obesity has been investigated, but the mechanisms how UCP3 protect the organism from these diseases have not been elucidated. Our research sheds light on the understanding of UCP3 functions and may be of significant therapeutic benefit in the prevention and treatment of these diseases.
Miniaturized antenna and transponder based wireless sensors for internet of things in healthcare
(2014-12) Huang, Haiyu; Akinwande, Deji; Gharpurey, Ranjit; Neikirk, Dean; Hu, Ye; Lu, Nanshu
Future medical and healthcare systems will be largely improved by the wide-spreading of internet of things (IoTs). One of the crucial challenges of IoTs for healthcare is at the wireless sensors. Miniaturization of sensor node profile, minimizing power consumption as well as lowering down design/production cost of antenna, RF circuits and sensor modules have become the key issues for realizing more exciting applications in medical and healthcare fields that never seemed to be possible before. In this dissertation work, we first focus on electrically small antenna (ESA) design and fabrication for medical telemetry. A comprehensive analysis of the radiation properties of a novel electrically small folded ellipsoidal ESA is presented, showing its ability to self-resonate and impedance match without external components. It will benefit various size-restricted applications especially with wireless medical implants. The second focus is on healthcare sensors using ESA as the sensing agent, which saves the power and cost by eliminating the need of extra sensing modules. Specifically, miniaturized helix ESAs are integrated with drug reservoirs to function as wireless transponder sensors for real-time drug dosage monitoring. We also introduce a system level innovation of a passive wireless harmonic transponder/harmonic sniffer/frequency hopped interrogator based sensing system. The μL- liquid level resolution and absolute-accuracy passive sensing is demonstrated in the presence of strong direct coupling, background scatters, distance variance as well as near-filed human body movement interference. Furthermore, we investigate how modern ubiquitous wireless sensor networks could take advantage of sensitive nanostructure materials for enhanced performance. Here we propose a new paradigm of chemically-gated mixed modulation on a single homogeneous graphene device in which the chemical exposure directly modulates an electrical carrier signal. To make the device ubiquitously reusable, a method of precisely tuning the charge neutrality point (Vcnp) is introduced by electrochemical calibration with gate voltage pulse sequence. Such chemically gated graphene modulator can be potentially used in a harmonic transponder as a passive ubiquitous sensor node for chemical and bio sensing applications. Overall the research work presented in the dissertation will help enable cost and power-efficient wireless sensor networks in future healthcare IoTs.
Profiling cerebrospinal fluid proteins in Alzheimer's disease
(2011-08) Krishnamachari, Sesha; Tripathy, Jatindra Nath; San Francisco, Susan; Grammas, Paula
Alzheimer’s disease (AD) is a progressive, irreversible, neurodegenerative disease that affects more than 5 million people in USA alone and is expected to increase to 7 million by 2020. Currently, there are no drugs available that can halt or prevent the progression of this disease. This may be due to the complex nature of this disease with a number of proteins, mediators and factors involved or that there are some low abundance proteins not yet identified that may play a major role in the pathogenesis of AD. In this regard, with rapidly improving proteomics technologies it is possible to identify less abundant but potentially important functional molecules that may play an important role in the pathogenesis of AD. The objective of this study was to identify novel proteins expressed in the cerebrospinal fluid (CSF) of AD compared to age matched CSF samples from non-neurodegenerative cohorts. Using highly sensitive mass spectrometry techniques (MALDI-TOF/TOF), 260 protein spots from 18 AD and 14 control CSF samples were analyzed. Of the 141 proteins identified nine proteins were solely found in AD CSF but absent in the control CSF. Similarly eleven proteins identified in control CSF were not identified in AD CSF. Proteins identified include apolipoprotein E, cystatin C, orosomucoid, prostaglandin D2, enolase, and transthyretin. The results suggest that with proteomic profiling technology it will be possible to identify unique, low abundance proteins which may provide new insights into the pathogenesis of AD.
Proteomic analysis of mycobacteria and mammalian cells
(2006-05) Wang, Rong, 1974-; Marcotte, Edward M.
Tuberculosis is a serious threat that claims 2 million lives annually. Mycobacterium tuberculosis is the causative agent of tuberculosis. The fast-growing bacteria Mycobacterium smegmatis is a model mycobacterial system, a non-pathogenic soil bacterium that nonetheless shares many features with the pathogenic M. tuberculosis. Multidimensional chromatography coupled with shotgun style tandem mass spectrometry was used to detect and identify 2,550 distinct proteins from M. smegmatis with an estimated 5% false positive identification rate, many predicted genes were annotated using experimental results and protein expression levels were estimated from the shotgun proteomic data. First, in 25 exponential and stationary phase experiments, we observed numerous proteins involved in energy production, protein translation, and lipid biosynthesis. Protein expression levels were estimated from the number of observations of each protein, allowing measurement of differential expression of complete operons, and the comparison of the stationary and exponential phase proteomes. Expression levels are correlated with proteins' codon biases and mRNA expression levels. Secondly, we measured changes in the proteome of Mycobacterium smegmatis in response to three anti-tuberculosis drugs isoniazid (INH), ethambutol (EMB) and 5- chloro-pyrazinamide (5-Cl-PZA). Protein expression levels were calculated from the number of identified peptides for each protein. Translation, energy production, and protein export are all down-regulated in the three drug treatments. By contrast, systems related to drugs’ targets, including lipid, amino acid, nucleotide metabolism and transport, show specific protein expression changes associated with each drug treatment. We use these changes to infer likely targets for PZA. Thirdly, computational methods were used to predict protein-protein interactions and protein functions in a metabolic pathway in M. tuberculosis. Protein functional links were built and specific functions were characterized for the pathway and its parallel pathways in M. tuberculosis and other organisms. Finally, multidimensional chromatography coupled with shotgun style tandem mass spectrometry has been applied in the analysis of nuclear proteins from mammalian cells. Nuclear proteins were identified from mouse T lymphoma cells. Nuclear matrixassociated proteins were identified from human preliminary T cells during the transition from quiescent state to proliferating state. These proteins are involved in the function of DNA replication, RNA transcription, splicing, etc.
Proteomic analysis of the pre-mRNA splicing machinery utilizing chromosomal locus epitope tagging in metazoans
(2007) Chen, Yen-I Grace, 1977-; Stevens, Scott W.
Epitope tagging in metazoans is an important tool for biochemical analyses and is generally accomplished by using trans-genes with in-frame epitope tags. However, protein levels from trans-genes are rarely representative of native levels. To overcome the shortcomings using trans-genes, epitope tags were introduced by homologous recombination technology, termed CLEP tagging (Chromosomal Locus EPitope tagging), immediately upstream of the stop codon of targeted genes in chicken B cell line DT40 and mouse embryonic stem (ES) cells. I first demonstrated the feasibility and promise of this technique in DT40 cells by purifying low abundance polypeptides and factors loosely associated with the SmD3 protein, a core protein participating in pre-mRNA splicing and mRNA turnover, with a TAP (tandem affinity purification) tag. Glycerol gradient separation was performed to further characterize the SmD3-associated protein complexes from the 200S fractions, corresponding to the supraspliceosomes. The purification included all five spliceosomal snRNAs. Most known snRNP-associated proteins, 5' end binding factors, 3' end processing factors, mRNA export factors, hnRNPs, and other RNA binding proteins were identified from the protein components. Intriguingly, the purified supraspliceosomes also contained a number of structural proteins, nucleoporins, chromatin remodeling factors, and several novel proteins that were absent from splicing complexes assembled in vitro. I showed that the in vivo analyses provide a more comprehensive list of polypeptides associated with pre-mRNA splicing apparatus as well as those that coupled transcription to the pre-mRNA processing steps. With similar techniques, the TAP tag was inserted into the chromosomal locus of a pre-mRNA splicing factor component, mSART-1 in live mice. Surprisingly, a profound autoimmune response was induced in homozygous-modified mice, due likely to an inappropriate stimulation of the immune system. I believe these mice will serve as a valuable tool for the studies of mammalian autoimmune diseases, especially those resulting from the generation of autoantibodies against RNP components.
Statistical Methods for High Dimensional Biomedical Data
(2013-03-27) Ball, Robyn Lynn
This dissertation consists of four different topics in the areas of proteomics, genomics, and cardiology. First, a data-based method was developed to assign the subcellular localization of proteins. We applied the method to data on the bacteria Rhodobacter sphaeroides 2.4.1 and compared the results to PSORTb v.3.0. We found that the method compares well to PSORTb and a simulation study revealed that the method is sound and produces accurate results. Next, we investigated genomic features involved in the lethality of the knockout mouse using the random forest technique. We achieved an accuracy rate of 0.725 and found that among other features, the evolutionary age of the gene was a good predictor of lethality. Third, we analyzed DNA breakpoints across eight different cancer types to determine if common hotspots or cancer-type specific hotspots can be well-predicted by various genomic features and investigated which of the genomic features best predict the number of breakpoints. Using the random forest technique, we found that cancer- type specific hotspots are poorly predicted by genomic features but common hotspots can be predicted using the relevant genomic features. Additionally, we found that among the genomic features analyzed, indel rate and substitution rate were consistently chosen as the top predictors of breakpoint frequency. Lastly, we developed a method to predict the hypothetical heart age of a subject based on the subject?s electrocardiogram (ECG). The heart age predictions are consistent with current ECG science and knowledge of cardiac health.
Statistical Methods for the Analysis of Mass Spectrometry-based Proteomics Data
(2012-07-16) Wang, Xuan
Proteomics serves an important role at the systems-level in understanding of biological functioning. Mass spectrometry proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. In the most widely used bottom-up approach to MS-based high-throughput quantitative proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and then analyzed using a mass spectrometer. The three fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (i) Identifying the proteins that are present in a sample, (ii) Aligning different samples on elution (retention) time, mass, peak area (intensity) and etc, (iii) Quantifying the abundance levels of the identified proteins after alignment. Each of these challenges requires knowledge of the biological and technological context that give rise to the observed data, as well as the application of sound statistical principles for estimation and inference. In this dissertation, we present a set of statistical methods in bottom-up proteomics towards protein identification, alignment and quantification. We describe a fully Bayesian hierarchical modeling approach to peptide and protein identification on the basis of MS/MS fragmentation patterns in a unified framework. Our major contribution is to allow for dependence among the list of top candidate PSMs, which we accomplish with a Bayesian multiple component mixture model incorporating decoy search results and joint estimation of the accuracy of a list of peptide identifications for each MS/MS fragmentation spectrum. We also propose an objective criteria for the evaluation of the False Discovery Rate (FDR) associated with a list of identifications at both peptide level, which results in more accurate FDR estimates than existing methods like PeptideProphet. Several alignment algorithms have been developed using different warping functions. However, all the existing alignment approaches suffer from a useful metric for scoring an alignment between two data sets and hence lack a quantitative score for how good an alignment is. Our alignment approach uses "Anchor points" found to align all the individual scan in the target sample and provides a framework to quantify the alignment, that is, assigning a p-value to a set of aligned LC-MS runs to assess the correctness of alignment. After alignment using our algorithm, the p-values from Wilcoxon signed-rank test on elution (retention) time, M/Z, peak area successfully turn into non-significant values. Quantitative mass spectrometry-based proteomics involves statistical inference on protein abundance, based on the intensities of each protein's associated spectral peaks. However, typical mass spectrometry-based proteomics data sets have substantial proportions of missing observations, due at least in part to censoring of low intensities. This complicates intensity-based differential expression analysis. We outline a statistical method for protein differential expression, based on a simple Binomial likelihood. By modeling peak intensities as binary, in terms of "presence / absence", we enable the selection of proteins not typically amendable to quantitative analysis; e.g., "one-state" proteins that are present in one condition but absent in another. In addition, we present an analysis protocol that combines quantitative and presence / absence analysis of a given data set in a principled way, resulting in a single list of selected proteins with a single associated FDR.
The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics
(2012-07-16) Vu, Thang
The small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications.

Browsing by Subject "Proteomics"

Results Per Page

Sort Options