Browsing by Subject "Bootstrap"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Bootstrapping in a high dimensional but very low sample size problem(Texas A&M University, 2006-08-16) Song, JuheeHigh Dimension, Low Sample Size (HDLSS) problems have received much attention recently in many areas of science. Analysis of microarray experiments is one such area. Numerous studies are on-going to investigate the behavior of genes by measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression. HDLSS data investigated in this dissertation consist of a large number of data sets each of which has only a few observations. We assume a statistical model in which measurements from the same subject have the same expected value and variance. All subjects have the same distribution up to location and scale. Information from all subjects is shared in estimating this common distribution. Our interest is in testing the hypothesis that the mean of measurements from a given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and traditional bootstrapping, do not necessarily provide reliable results since there are only a few observations for each data set. We motivate a mixture model having C clusters and 3C parameters to overcome the small sample size problem. Standardized data are pooled after assigning each data set to one of the mixture components. To get reasonable initial parameter estimates when density estimation methods are applied, we apply clustering methods including agglomerative and K-means. Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean of within Cluster Variance estimates), are used to choose an optimal number of clusters. Density estimation methods including a maximum likelihood unimodal density estimator and kernel density estimation are used to estimate the unknown density. Once the density is estimated, a bootstrapping algorithm that selects samples from the estimated density is used to approximate the distribution of test statistics. The t-statistic and an empirical likelihood ratio statistic are used, since their distributions are completely determined by the distribution common to all subject. A method to control the false discovery rate is used to perform simultaneous tests on all small data sets. Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid) microarray experiment data are analyzed by the proposed methods.Item Estimation of multiple mediator model(2013-05) Wen, Sibei; Beretvas, Susan NatashaModels for mediation are widely used in psychology, behavior science and education because they help researchers understand how a causal effect happens through one or several mediating variables. And more complex mediation models that incorporate multiple mediators are increasingly being assessed. This report uses a generated dataset to provide an overview of the assessment of direct effects and indirect effects in multiple mediator models. Use of a multiple comparison-based procedure for testing a set of hypotheses simultaneously while controlling the experiment-wise type I error rate is used to calculate a confidence interval for each pairwise contrast of mediated effects. Three approaches will be used to test hypotheses concerning the contrast between pairs of mediator effects. These approaches include 1) an assumption of zero covariance between parameters from different models, 2) assumption of a non-zero covariance between parameters from different models and 3) use of bootstrapping. Results are provided and discussed.Item On Parametric and Nonparametric Methods for Dependent Data(2011-10-21) Bandyopadhyay, SoutirIn recent years, there has been a surge of research interest in the analysis of time series and spatial data. While on one hand more and more sophisticated models are being developed, on the other hand the resulting theory and estimation process has become more and more involved. This dissertation addresses the development of statistical inference procedures for data exhibiting dependencies of varied form and structure. In the first work, we consider estimation of the mean squared prediction error (MSPE) of the best linear predictor of (possibly) nonlinear functions of finitely many future observations in a stationary time series. We develop a resampling methodology for estimating the MSPE when the unknown parameters in the best linear predictor are estimated. Further, we propose a bias corrected MSPE estimator based on the bootstrap and establish its second order accuracy. Finite sample properties of the method are investigated through a simulation study. The next work considers nonparametric inference on spatial data. In this work the asymptotic distribution of the Discrete Fourier Transformation (DFT) of spatial data under pure and mixed increasing domain spatial asymptotic structures are studied under both deterministic and stochastic spatial sampling designs. The deterministic design is specified by a scaled version of the integer lattice in IRd while the data-sites under the stochastic spatial design are generated by a sequence of independent random vectors, with a possibly nonuniform density. A detailed account of the asymptotic joint distribution of the DFTs of the spatial data is given which, among other things, highlights the effects of the geometry of the sampling region and the spatial sampling density on the limit distribution. Further, it is shown that in both deterministic and stochastic design cases, for "asymptotically distant" frequencies, the DFTs are asymptotically independent, but this property may be destroyed if the frequencies are "asymptotically close". Some important implications of the main results are also given.Item Robust Clock Synchronization Methods for Wireless Sensor Networks(2011-10-21) Lee, Jae HanWireless sensor networks (WSNs) have received huge attention during the recent years due to their applications in a large number of areas such as environmental monitoring, health and traffic monitoring, surveillance and tracking, and monitoring and control of factories and home appliances. Also, the rapid developments in the micro electro-mechanical systems (MEMS) technology and circuit design lead to a faster spread and adoption of WSNs. Wireless sensor networks consist of a number of nodes featured in general with energy-limited sensors capable of collecting, processing and transmitting information across short distances. Clock synchronization plays an important role in designing, implementing, and operating wireless sensor networks, and it is essential in ensuring a meaningful information processing order for the data collected by the nodes. Because the timing message exchanges between different nodes are affected by unknown possibly time-varying network delay distributions, the estimation of clock offset parameters represents a challenge. This dissertation presents several robust estimation approaches of the clock offset parameters necessary for time synchronization of WSNs via the two-way message exchange mechanism. In this dissertation the main emphasis will be put on building clock phase offset estimators robust with respect to the unknown network delay distributions. Under the assumption that the delay characteristics of the uplink and the downlink are asymmetric, the clock offset estimation method using the bootstrap bias correction approach is derived. Also, the clock offset estimator using the robust Mestimation technique is presented assuming that one underlying delay distribution is mixed with another delay distribution. Next, although computationally complex, several novel, efficient, and robust estimators of clock offset based on the particle filtering technique are proposed to cope with the Gaussian or non-Gaussian delay characteristics of the underlying networks. One is the Gaussian mixture Kalman particle filter (GMKPF) method. Another is the composite particle filter (CPF) approach viewed as a composition between the Gaussian sum particle filter and the KF. Additionally, the CPF using bootstrap sampling is also presented. Finally, the iterative Gaussian mixture Kalman particle filter (IGMKPF) scheme, combining the GMKPF with a procedure for noise density estimation via an iterative mechanism, is proposed.Item The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics(2012-07-16) Vu, ThangThe small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications.Item Using the bootstrap to analyze variable stars data(Texas A&M University, 2005-02-17) Dunlap, Mickey PaulOften in statistics it is of interest to investigate whether or not a trend is significant. Methods for testing such a trend depend on the assumptions of the error terms such as whether the distribution is known and also if the error terms are independent. Likelihood ratio tests may be used if the distribution is known but in some instances one may not want to make such assumptions. In a time series, these errors will not always be independent. In this case, the error terms are often modelled by an autoregressive or moving average process. There are resampling techniques for testing the hypothesis of interest when the error terms are dependent, such as, modelbased bootstrapping and the wild bootstrap, but the error terms need to be whitened. In this dissertation, a bootstrap procedure is used to test the hypothesis of no trend for variable stars when the error structure assumes a particular form. In some cases, the bootstrap to be implemented is preferred over large sample tests in terms of the level of the test. The bootstrap procedure is able to correctly identify the underlying distribution which may not be χ2.