Browsing by Author "Huang, Jianhua"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item A Likelihood Based Framework for Data Integration with Application to eQTL Mapping(2014-06-24) Feng, ShuoWe develop a new way of thinking about and integrating gene expression data (continuous) and genomic information data (binary) by jointly compressing the two data sets and embedding their signals in low dimensional feature spaces with an information sharing mechanism, which connects the continuous data to the binary data, under the penalized log-likelihood framework. In particular, the continuous data are modeled by a Gaussian likelihood and the binary data are modeled by a Bernoulli likelihood which is formed by transforming the feature space of the genomic information with a logit link. The smoothly clipped absolute deviation (SCAD) penalty, is added on the basis vectors of the low dimensional feature spaces for both data sets, which is based on the assumption that only a small set of genetic variants are associated with a small fraction of gene expression and the fact that those basis vectors can be interpreted as weights assigned on the genetic variants and gene expression similar to the way the loading vectors of principal component analysis (PCA) or canonical correlation analysis (CCA) are interpreted. Algorithmically, a Majorization-Minimization (MM) algorithm with local linear approximation (LLA) to SCAD penalty is developed to effectively and efficiently solve the optimization problem involved, which produces closed-form updating rules. The effectiveness of our method is demonstrated by simulations in various setups with comparisons to some popular competing methods and an application to eQTL mapping with real data.Item Bivariate B-splines and its Applications in Spatial Data Analysis(2011-08-09) Pan, Huijun 1987-In the field of spatial statistics, it is often desirable to generate a smooth surface for a region over which only noisy observations of the surface are available at some locations, or even across time. Kriging and kernel estimations are two of the most popular methods. However, these two methods become problematic when the domain is not regular, such as when it is rectangular or convex. Bivariate B-splines developed by mathematicians provide a useful nonparametric tool in bivariate surface modeling. They inherit several appealing properties of univariate B-splines and are applicable in various modeling problems. More importantly, bivariate B-splines have advantages over kriging and kernel estimation when dealing with complicated domains. The purpose of this dissertation is to develop a nonparametric surface fitting method by using bivariate B-splines that can handle complex spatial domains. The dissertation consists of four parts. The first part of this dissertation explains the challenges of smoothing over complicated domains and reviews existing methods. The second part introduces bivariate B-splines and explains its properties and implementation techniques. The third and fourth parts discuss application of the bivariate B-splines in two nonparametric spatial surface fitting problems. In particular, the third part develops a penalized B-splines method to reconstruct a smooth surface from noisy observations. A numerical algorithm is derived, implemented, and applied to simulated and real data. The fourth part develops a reduced rank mixed-effects model for functional principal components analysis of sparsely observed spatial data. A numerical algorithm is used to implement the method and tested on simulated and real data.Item Dimension Reduction and Covariance Structure for Multivariate Data, Beyond Gaussian Assumption(2012-10-19) Maadooliat, MehdiStorage and analysis of high-dimensional datasets are always challenging. Dimension reduction techniques are commonly used to reduce the complexity of the data and obtain the informative aspects of datasets. Principal Component Analysis (PCA) is one of the commonly used dimension reduction techniques. However, PCA does not work well when there are outliers or the data distribution is skewed. Gene expression index estimation is an important problem in bioinformatics. Some of the popular methods in this area are based on the PCA, and thus may not work well when there is non-Gaussian structure in the data. To address this issue, a likelihood based data transformation method with a computationally efficient algorithm is developed. Also, a new multivariate expression index is studied and the performance of the multivariate expression index is compared with the commonly used univariate expression index. As an extension of the gene expression index estimation problem, a general procedure that integrates data transformation with the PCA is developed. In particular, this general method can handle missing data and data with functional structure. It is well-known that the PCA can be obtained by the eigen decomposition of the sample covariance matrix. Another focus of this dissertation is to study the covariance (or correlation) structure under the non-Gaussian assumption. An important issue in modeling the covariance matrix is the positive definiteness constraint. The modified Cholesky decomposition of the inverse covariance matrix has been considered to address this issue in the literature. An alternative Cholesky decomposition of the covariance matrix is considered and used to construct an estimator of the covariance matrix under multivariate-t assumption. The advantage of this alternative Cholesky decomposition is the decoupling of the correlation and the variances.Item Estimation of Large Spectral Function and Its Application(2014-07-28) Qu, YuanIt is of fundamental interest to routinely monitor waves and currents in the nearshore seas both scientifically and to the general public, because they play an important role in coastline erosion and they have a significant effect in the nearshore recreational activities. In this work, we show the way to estimate both wave height and wave direction with the data observed from a bottom-mounted, upward-looking Acoustic Doppler Current Profiler. One of the most challenging works is to estimate the wave-number spectra using all gathered observations of receiving antennas. The frame of observed data is 100-dimensional time series with T = 2399. Due to the fact that there is only one realization of this multivariate time series, the conventional methods are either applicable for univariate time series or appropriate in low dimensional setting. In this work, we propose a new regularization estimator for wave-number spectral density with three merits: positive definite, smoothness and sparsity. This method can also be used to regularize any complex/real tensor in order to gain a resulting estimator with the above three merits. We describe and prove the convergence of our proposed algorithm, and compare our proposed estimator with the sample wave-number spectra and the other two regularization estimators: banding and extended tapering. The numerical results show that the estimation performance of our proposed approach is overwhelming better than other estimators. The proposed estimator and the extended tapering estimator are comparable in smoothness and positive definiteness. Unlike other estimators, our approach can produce a sparse estimator which would massively reduce the computation complexity for further study.Item Model-based Pre-processing in Protein Mass Spectrometry(2011-02-22) Wagaman, John C.The discovery of proteomic information through the use of mass spectrometry (MS) has been an active area of research in the diagnosis and prognosis of many types of cancer. This process involves feature selection through peak detection but is often complicated by many forms of non-biologicalbias. The need to extract biologically relevant peak information from MS data has resulted in the development of statistical techniques to aid in spectra pre-processing. Baseline estimation and normalization are important pre-processing steps because the subsequent quantification of peak heights depends on this baseline estimate. This dissertation introduces a mixture model to estimate the baseline and peak heights simultaneously through the expectation-maximization (EM) algorithm and a penalized likelihood approach. Our model-based pre-processing performs well in the presence of raw, unnormalized data, with few subjective inputs. We also propose a model-based normalization solution for use in subsequent classification procedures, where misclassification results compare favorably with existing methods of normalization. The performance of our pre-processing method is evaluated using popular matrix-assisted laser desorption and ionization (MALDI) and surface-enhanced laser desorption and ionization (SELDI) datasets as well as through simulation.Item Modeling covariance structure in unbalanced longitudinal data(2009-05-15) Chen, MinModeling covariance structure is important for efficient estimation in longitudinal data models. Modified Cholesky decomposition (Pourahmadi, 1999) is used as an unconstrained reparameterization of the covariance matrix. The resulting new parameters have transparent statistical interpretations and are easily modeled using covariates. However, this approach is not directly applicable when the longitudinal data are unbalanced, because a Cholesky factorization for observed data that is coherent across all subjects usually does not exist. We overcome this difficulty by treating the problem as a missing data problem and employing a generalized EM algorithm to compute the ML estimators. We study the covariance matrices in both fixed-effects models and mixed-effects models for unbalanced longitudinal data. We illustrate our method by reanalyzing Kenwards (1987) cattle data and conducting simulation studies.Item Three Essays on Semiparametric Econometrics: Theory and Application(2014-04-25) Li, HongjunThis dissertation aims at investigating the theory and application of semiparametric econometrics. I first inspect the selection of optimal bandwidth using the cross-validation method for the kernel estimation of cumulative distribution/survivor functions. Then, I analyze the determination of the number of factors with the methods of principal component and information criteria. I also show the application of semiparametric methods to "purchasing power parity" puzzle. Firstly, I propose a data-driven least squares cross-validation method to optimally select smoothing parameters for the nonparametric estimation of cumulative distribution/ survivor functions. The general multivariate covariates can be continuous, discrete/ordered categorical or a mix of either. I establish the asymptotic optimality of least squares cross-validation method. Also, I show that the estimators of cumulative distribution/survivor functions using the smoothing parameters selected by the proposed method is asymptotically normally distributed. Monte Carlo simulation verifies the finite-sample properties of the least squares cross-validation method. Secondly, I provide some discussions on the econometric theory for factor models of large dimensions where the number of factors (r) is allowed to increase as the two dimensions, cross-sections (N) and time dimensions (T) increase. I mainly focus on the determination of the number of factors. I extend the existing panel criteria to high dimension case where r may be increasing with N or T. I show that the number of factors can be consistently estimated using the criteria. Also, Monte-Carlo simulation demonstrates the nite sample properties of the proposed estimating method. Lastly, I consider an empirical application of semiparametric econometrics to the problem of purchasing power parity (hereafter PPP) hypothesis test. Traditional linear cointegration tests of PPP hypothesis often lead to rejection of the PPP hypothesis. More recent studies allowing for some sort of nonlinearity in econometric modelings suggest mixed results and leave this problem as an unresolved issue. Therefore, I analyze PPP hypothesis within a semiparametric framework using the varying coe cient model with integrated variables, which can capture the nonlinearity of the economic structures. Applying the semiparametric functional cointegration test method, I conduct the cointegration test of PPP hypothesis between U.S. and Canada, U.S. and Japan, and U.S. and U.K., respectively to test the PPP hypothesis. In contrast to the usual ndings based on linear model PPP hypothesis testing, the semiparametric model based tests provide supporting evidence of the PPP hypothesis.Item Variable Selection and Function Estimation Using Penalized Methods(2012-02-14) Xu, GanggangPenalized methods are becoming more and more popular in statistical research. This dissertation research covers two major aspects of applications of penalized methods: variable selection and nonparametric function estimation. The following two paragraphs give brief introductions to each of the two topics. Infinite variance autoregressive models are important for modeling heavy-tailed time series. We use a penalty method to conduct model selection for autoregressive models with innovations in the domain of attraction of a stable law indexed by alpha is an element of (0, 2). We show that by combining the least absolute deviation loss function and the adaptive lasso penalty, we can consistently identify the true model. At the same time, the resulting coefficient estimator converges at a rate of n^(?1/alpha) . The proposed approach gives a unified variable selection procedure for both the finite and infinite variance autoregressive models. While automatic smoothing parameter selection for nonparametric function estimation has been extensively researched for independent data, it is much less so for clustered and longitudinal data. Although leave-subject-out cross-validation (CV) has been widely used, its theoretical property is unknown and its minimization is computationally expensive, especially when there are multiple smoothing parameters. By focusing on penalized modeling methods, we show that leave-subject-out CV is optimal in that its minimization is asymptotically equivalent to the minimization of the true loss function. We develop an efficient Newton-type algorithm to compute the smoothing parameters that minimize the CV criterion. Furthermore, we derive one simplification of the leave-subject-out CV, which leads to a more efficient algorithm for selecting the smoothing parameters. We show that the simplified version of CV criteria is asymptotically equivalent to the unsimplified one and thus enjoys the same optimality property. This CV criterion also provides a completely data driven approach to select working covariance structure using generalized estimating equations in longitudinal data analysis. Our results are applicable to additive, linear varying-coefficient, nonlinear models with data from exponential families.