Functional Data Analysis For Environmental And Biomedical Problems
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Vast amounts of data are being generated due to the development of sensing technology. Among those, one of the common types of data usually found in various discipline is the functional data. Because the functional data are generally collected in a wide area of interest over a relatively long period, such analyses should take into account both temporal and spatial characteristics. Furthermore, combinations of observations from multiple locations, each with a large number of serially correlated values, lead to a situation that poses a great challenge to analytical and computational capabilities. In contrast, data obtained from medical and biomedical researches usually collected from a very small number of testing subjects. Since all medical data collection procedures require direct interaction with testing subjects, these procedures need to be carried with high attention and caution to ensure that there is no side effects or consequences from the experiments. Generally, these experiments required approvals from the review board in order to proceed. Therefore, over a long period of time, only a very small set of data can be obtained from a medical study.To efficiently extract implicit patterns from these datasets, data mining methods are beneficial tools for analyzing such large and complicated as well as small and scarce data. Despite the great potential of applying data mining methods to such complicated data, the appropriate methods remain premature and insufficient. The major aim of this dissertation is to present some data mining methods, along with the real data, as a tool for analyzing the complex behavior of functional data. In the first part, this dissertation presents a data mining application to: (1) Identify an efficient way to characterize the spatial variations of PM2.5 concentrations based solely upon their temporal patterns, and (2) Analyze the temporal and seasonal patterns of PM2.5 concentrations in spatially homogenous regions. This study used 24-hour average PM2.5 concentrations measured every third day during the period between 2001 and 2005 at 522 monitoring sites in the continental United States. A k-means clustering algorithm using the correlation distance was employed to investigate the similarity in patterns between temporal profiles observed at the monitoring sites. A k-means clustering analysis produced six clusters of sites with distinct temporal patterns which were able to identify and characterize spatially homogeneous regions of the United States. The study also presents a rotated principal component analysis (RPCA) that has been used for characterizing spatial patterns of air pollution and discusses the difference between the clustering algorithm and RPCA.Data mining application for investigating the behavior of ozone concentration will be presented in the followed chapter. Ozone has been known to be associated with human health. Ozone data are generally collected over a long period of time from interested locations. However, constructing ozone monitoring sites may not possible or cost effective due to some limitations such as hazardous environment or inaccessible area. The objective of this present study is: (1) To interpolate ozone concentrations as a functional response at an unsampled location, and (2) To reduce model complexity by constructing a data compression and reduction model which achieve the highest accuracy as much as possible. This study used daily maximum 8-hour ozone concentrations between 2003 and 2006 at 14 monitoring sites in Dallas-Fort Worth area. Wavelet decomposition broke down the data into multiscale data analysis. Regression Analysis was used as a data compression method. Kriging was applied as a spatial interpolation. In addition, model refining step helped tune the ozone concentration with different variability. This study reveals that our model can achieve up to 6.99 ppb in mean absolute error (MAE) and 9.76 ppb in mean absolute error for high ozone day (MAE75).Finally, an efficient strategy for classification of prostate cancer in near infrared spectra is illustrated. Prostate cancer is the most common male cancer and the second leading cause of cancer death in the United States. The main purpose of this study is to develop an efficient tool that classifies the near infrared (NIR) spectroscopic data taken from ex vivo human prostate glands as normal or cancer. Our proposed procedure consists of several steps. First, to ensure the comparability between spectra, normalization was done by dividing each spectral point by the area of the total intensity of the spectrum. Second, clustering analysis was performed with these normalized spectra to separate the spectra that represent the normal pattern from a mixed group that contains both normal and tumor spectra. Third, we conducted two-stage classification, the first being an effort to construct a classification model with the labels obtained from the preceding clustering analysis and the second being a classification to focus on the mixed group classified from the first classification model. To increase the accuracy, the second classification model was constructed based on the selected features that capture important characteristics of the spectral data. Our proposed procedure was evaluated by its classification ability in testing samples using a leave-one-out cross validation technique, yielding an accuracy of90%.