High Dimension, Low Sample Size (HDLSS) problems have received much attention
recently in many areas of science. Analysis of microarray experiments is one
such area. Numerous studies are on-going to investigate the behavior of genes by
measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression.
HDLSS data investigated in this dissertation consist of a large number of data sets
each of which has only a few observations.
We assume a statistical model in which measurements from the same subject
have the same expected value and variance. All subjects have the same distribution
up to location and scale. Information from all subjects is shared in estimating this
common distribution.
Our interest is in testing the hypothesis that the mean of measurements from a
given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and
traditional bootstrapping, do not necessarily provide reliable results since there are
only a few observations for each data set.
We motivate a mixture model having C clusters and 3C parameters to overcome
the small sample size problem. Standardized data are pooled after assigning each
data set to one of the mixture components. To get reasonable initial parameter estimates
when density estimation methods are applied, we apply clustering methods
including agglomerative and K-means.
Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean
of within Cluster Variance estimates), are used to choose an optimal number of clusters.
Density estimation methods including a maximum likelihood unimodal density
estimator and kernel density estimation are used to estimate the unknown density.
Once the density is estimated, a bootstrapping algorithm that selects samples from
the estimated density is used to approximate the distribution of test statistics. The
t-statistic and an empirical likelihood ratio statistic are used, since their distributions
are completely determined by the distribution common to all subject. A method to
control the false discovery rate is used to perform simultaneous tests on all small data
sets.
Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid)
microarray experiment data are analyzed by the proposed methods.