Browsing by Subject "Data mining"

Now showing 1 - 20 of 35

A simple transformantion [sic] model of HTML into ROF [sic] for the semantic web
(Texas Tech University, 2004-08) Baquero Oleas, Jorge A
Several aspects about data mining and ways to implement it into the World Wide Web recently have been studied by the research and developer communities. The main problem that these communities have faced is that data on the Web is primarily unstructured, which makes the application of data mining techniques to the Web a difficult task. However, new semantic structures, which make the application of data mining techniques easier, are currently being developed as an extension of the Web. The first revolutionary technique for content and structural modeling was extensible Markup Language (XML). More recently, even more effective techniques, such as. Resource Definition Framework (RDF) and Ontologies have incorporated semantic and structural features that can be exploited to help users find relevant information on the next generation of the Web. The aim of introducing semantics and structures to the Web is to enhance the precision of search engines and to enable the use of logical reasoning in Web documents to answer users' queries.
Adaptive Evolutionary Monte Carlo for Heuristic Optimization: With Applications to Sensor Placement Problems
(2010-01-14) Ren, Yuan
This dissertation presents an algorithm to solve optimization problems with "black-box" objective functions, i.e., functions that can only be evaluated by running a computer program. Such optimization problems often arise in engineering applications, for example, the design of sensor placement. Due to the complexity in engineering systems, the objective functions usually have multiple local optima and depend on a huge number of decision variables. These difficulties make many existing methods less effective. The proposed algorithm is called adaptive evolutionary Monte Carlo (AEMC), and it combines sampling-based and metamodel-based search methods. AEMC incorporates strengths from both methods and compensates limitations of each individual method. Specifically, the AEMC algorithm combines a tree-based predictive model with an evolutionary Monte Carlo sampling procedure for the purpose of heuristic optimization. AEMC is able to escape local optima due to the random sampling component, and it improves the quality of solutions quickly by using information learned from the tree-based model. AEMC is also an adaptive Markov chain Monte Carlo (MCMC) algorithm, and is in fact the rst adaptive MCMC algorithm that simulates multiple Markov chains in parallel. The ergodicity property of the AEMC algorithm is studied. It is proven that the distribution of samples obtained by AEMC converges asymptotically to the "target" distribution determined by the objective function. This means that AEMC has a larger probability of collecting samples from regions containing the global optimum than from other regions, which implies that AEMC will reach the global optimum given enough run time. The AEMC algorithm falls into the category of heuristic optimization algorithms, and is applicable to the problems that can be solved by other heuristic methods, such as genetic algorithm. Advantages of AEMC are demonstrated by applying it to a sensor placement problem in a manufacturing process, as well as to a suite of standard test functions. It is shown that AEMC is able to enhance optimization effectiveness and efficiency as compared to a few alternative strategies, including genetic algorithm, Markov chain Monte Carlo algorithms, and meta-model based methods. The effectiveness of AEMC for sampling purposes is also shown by applying it to a mixture Gaussian distribution.
Classification of encrypted cloud computing service traffic using data mining techniques
(2011-12) Qian, Cheng; Ghosh, Joydeep
In addition to the wireless network providers’ need for traffic classification, the need is more and more common in the Cloud Computing environment. A data center hosting Cloud Computing services needs to apply priority policies and Service Level Agreement (SLA) rules at the edge of its network. Overwhelming requirements about user privacy protection and the trend of IPv6 adoption will contribute to the significant growth of encrypted Cloud Computing traffic. This report presents experiments focusing on application of data mining based Internet traffic classification methods to classify encrypted Cloud Computing service traffic. By combining TCP session level attributes, client and host connection patterns and Cloud Computing service Message Exchange Patterns (MEP), the best method identified in this report yields 89% overall accuracy.
Clinically interpretable models for healthcare data
(2015-12) Ho, Joyce Carmen; Ghosh, Joydeep; Vishwanath, Sriram; Vikalo, Haris; Sanghavi, Sujay; Sun, Jimeng
The increasing availability of electronic health records (EHRs) has spurred the adoption of data-driven approaches to provide additional insights for diagnoses, prognoses, and cost-effective patient treatment and management. The records are composed of a diverse array of data that includes both structured information (e.g., diagnoses, medications, and lab results) and unstructured clinical narratives notes (e.g., physician's observations, progress notes, etc). Thus, EHRs are a rich source of patient information. However, there are several formidable challenges with using EHRs that have limited their utility for clinical research so far. Problems include data quality; high-dimensional heterogenous information from various sources; privacy; and interoperability across institutions. Further hampering the acceptance of data-driven models is the lack of interpretability of their results. Physicians are accustomed to reasoning based on concise clinical concepts (or phenotypes) rather than directly on high-dimensional EHR data. Unfortunately, these records do not readily map to simple phenotypes, let alone more sophisticated and multifaceted ones. This dissertation investigates the development of clinically interpretable models for EHR data using dimensionality reduction techniques. We posit that clinical concepts are representations in lower dimensional latent spaces. Yet, standard dimensionality reduction techniques alone are insufficient to derive concise and relevant medical concepts from EHR data. We explore two approaches: (1) state space models to dynamically track a patient's cardiac arrest risk, and (2) non--negative matrix and tensor factorization models to generate concise and clinically relevant phenotypes. Our approaches yield clinically interpretable models with minimal human intervention and provides a powerful, and data-driven framework for transforming high-dimensional EHR data into medical concepts.
Community Detection in Social Networks
(2016-12)
Computational discovery of genetic targets and interactions : applications to lung cancer
(2016-05) Young, Jonathan Hubert; Marcotte, Edward M.; Gonzalez, Oscar; Dhillon, Inderjit; Elber, Ron; Wilke, Claus
We present new modes of computational drug discovery in each of the three key themes of target identification, mechanism, and therapy regimen design. In identifying candidate targets for therapeutic intervention, we develop novel applications of unsupervised clustering of whole genome RNAi screening in prioritizing biological systems whose inhibition differentially sensitizes diseased cells apart from a normal population. When applied to lung cancer, our approach identified protein complexes for which various tumor subtypes are especially dependent. Consequently, each complex represents a candidate drug target specifically intended for a particular patient sub-population. The cellular functions impacted by the protein complexes include splicing, translation, and protein folding. We obtained experimental validation for the predicted sensitivity of a lung adenocarcinoma cell line to Wnt inhibition. For our second theme, we focus on genetic interactions as a mechanism underlying sensitivity to target inhibition. Experimental characterization of such interactions has relied on brute-force assessment of gene pairs. To alleviate the experimental burden, our hypothesis is that functionally related genes tend to share common genetic interaction partners. We thereby examine a method that recognizes functional network clusters to generate high-confidence predictions of different types of genetic interactions across yeast, fly and human. Our predictions are leave-one-out cross-validated on known interactions. Moreover, by using yeast as a model, we simulatr the degree to which further human genetic interactions need to be screened in order to understand their distribution in biological systems. Finally, we employ yeast as a model organism to assess the feasibility of designing synergistic or antagonistic drug pairs based on genetic interactions between their targets. The hypothesis is that if the target genes of one chemical compound are close to those of a second compound in a genetic interaction network, then synergistic or antagonistic growth effects will occur. Proximity between sets in a gene network are quantified through graph metrics, and predictions of synergy and antagonism are validated by literature-curated gold standards. Ultimately, integrating knowledge of druggable targets, how gene perturbations interact with the genetic background of an individual, and design of personalized therapeutic regimens will provide a general framework to treat a variety of diseases.
Data driven analysis of fast oxide ion diffusion in solid oxide fuel cell cathodes
(2015-08) Miller, Alexander Scot; Benedek, Nicole; Yu, Guihua
The goal of this study was to determine whether atomic-scale features (related to composition and crystal structure) of perovskite and perovskite-related materials could be used to predict fast oxide ion diffusion for Solid Oxide Fuel Cell (SOFC) applications; materials that can be used as SOFC cathodes were a particular focus. One hundred and twenty six pairs of diffusion (D*) and surface exchange (k*) coefficients for a variety of materials were collected from literature sources published between 1991 and 2015. A website was created with these data for public viewing. Statistical tests revealed that diffusion measurements have significant differences at 400K, 700K, and 1000K when grouped according to material family and sample type. Models predicting diffusion rates were created from atomic-scale features at several temperatures between 400K and 1000K. Perovskite and double-perovskite models explained >85% of the variance in ln(D*k*) at 800K-1000K, meaning these models successfully predicted ln(D*k*) more than 85% of the time. These models explained 55%-75% of the variance at lower temperatures (400K-700K). Materials whose B-site cations had the highest electron affinities showed the fastest diffusion at all temperatures. Thus, these models suggest using B-site cations with high electron affinities (i.e. atoms that are easily reduced) may increase fuel cell performance, even at low and intermediate temperatures.
Data mining techniques for classifying RNA folding structures
(2016-08) Kim, Vince; Garg, Vijay K. (Vijay Kumar), 1963-; Gutell, Robin R
RNA is a crucial biological molecule that is critical for protein synthesis. Significant research has been done on folding algorithms for RNA, in particular the 16S rRNA of bacteria and archaea. Rather than modifying current works on these folding algorithms, this report ventures into the pioneering works for data mining the same 16S rRNA. Initial works were based on a single complex helix across seven organisms. However, classification analysis proved to be inaccurate due to severe multicollinearity in the data set. A secondary data mining analysis was done on the entire RNA sequence of the same seven organisms, and was successfully used to sequentially categorically predict the characteristic of a given nucleotide in the RNA sequence.
Data-mining the Ubuntu Linux Distribution for bug analysis and resolution
(2012-08) Arges, Christopher John; Stewart, Kate; Ghosh, Joydeep
The Ubuntu Linux Distribution represents a massive investment of time and human effort to produce a reliable computing experience for users. To accomplish these goals, software bugs must be tracked and fixed. However, as the number of users increase and bug reports grow advanced tools such as data mining must be used to increase the effectiveness of all contributors to the project. Thus, this report involved collecting a large amount of bug reports into a database and calculating relevant statistics. Because of the diversity and quantity of bug reports, contributors must find which bugs are most relevant and important to work on. One study in this report created an automatic way to determine who is best fit to solve a particular bug by using classification techniques. In addition, this report explores how to initially classify if a bug report will be eventually marked invalid or not.
Dataflow parallelism for large scale data mining
(2010-08) Daruru, Srivatsava; Ghosh, Joydeep; Marin, Nena
The unprecedented and exponential growth of data along with the advent of multi-core processors has triggered a massive paradigm shift from traditional single threaded programming to parallel programming. A number of parallel programming paradigms have thus been proposed and have become pervasive and inseparable from any large production environment. Also with the massive amounts of data available and with the ever increasing business need to process and analyze this data quickly at the minimum cost, there is much more demand for implementing fast data mining algorithms on cheap hardware. This thesis explores a parallel programming model called dataflow, the essence of which is computation organized by the flow of data through a graph of operators. This paradigm exhibits pipeline, horizontal and vertical parallelism and requires only the data of the active operators in memory at any given time allowing it to scale easily to very large datasets. The thesis describes the dataflow implementation of two data mining applications on huge datasets. We first develop an efficient dataflow implementation of a Collaborative Filtering (CF) algorithm based on weighted co-clustering and test its effectiveness on a large and sparse Netflix data. This implementation of the recommender system was able to rapidly train and predict over 100 million ratings within 17 minutes on a commodity multi-core machine. We then describe a dataflow implementation of a non-parametric density based clustering algorithm called Auto-HDS to automatically detect small and dense clusters on a massive astronomy dataset. This implementation was able to discover dense clusters at varying density thresholds and generate a compact cluster hierarchy on 100k points in less than 1.3 hours. We also show its ability to scale to millions of points as we increase the number of available resources. Our experimental results illustrate the ability of this model to “scale” well to massive datasets and its ability to rapidly discover useful patterns in two different applications.
Delivery data mining as e-services in the world wide web
(Texas Tech University, 2005-05) Chen, Nianen; Hernandez, Hector J.; Mengel, Susan A.
The field of knowledge discovery and data mining emerged in the recent past as a result of the dramatic evolution of the technology for information storage, access, and analysis. Distributed data mining (DDM) is a result of further evolution of the data mining technology. DDM embraces the growing trend of merging computation with communication. At the same time, the Internet and the World Wide Web (WWW) have provided a platform where organizations conduct commercial transactions. This was the transformation that led to the onset of electronic commerce (e-commerce). It was the new medium that brought with it the opportunity and ability for doing businesses in a global marketplace. Recently, the rapid strides in research and development in e-commerce are triggering the emergence of the next evolutionary phase, namely, e-service. This model envisages the Internet and WWW evolving from a global arena for selling goods to a virtual marketplace of services, particularly where businesses and organizations conduct their transactions via the Internet. Today, the goods that are able to be transacted in an e-services modeled (B2B) virtual marketplace are not restricted in real entities such as electronics, furniture, or tickets. They can also be resources such as software, computation abilities, or useful datasets. These resources are potentially able to be sold or rented to clients as e-services. Data mining in conjunction with other business intelligence applications is emerging as intuitively suitable for being delivered as such an e-service, mainly because of the fact that several small to medium range businesses are constrained by the high cost of setting up and maintaining infrastructure of support technologies and software required for business intelligence. To efficiently and effectively deliver data mining as an e-service in the World Wide Web, Web service technologies are introduced to provide a layer of abstraction above existing software systems. Unlike existing distributed computing systems, Web services are adapted to the Web. The default network protocol is HTTP. With Web services, the communication protocol among e-service agents and service providers is already there, World Wide Web. Web services work at a level of abstraction that is capable of bridging any operating system, hardware platform, or programming language, just as the Web is. Data mining e-service model requires interactions between clients and e-service agents, as well as between e-service agents and e-service providers. These interactions need to be implemented in a reliable, stable, and scalable way. Java 2 Enterprise Edition (J2EE) provides APIs and design patterns to design sound architectures and result in quality web-based applications, which makes the Internet-based data mining e-service feasible. This thesis investigates the delivery of Internet based data mining e-services and proposes an architecture supported by Web services and J2EE technologies to address the specific infrastructure requirements imposed on data mining systems by the e-services domain.
Delivery of data mining as e-services in the World Wide Web
(2005-05) Chen, Nianen; Hernandez, Hector J.; Mengel, Susan A.
The field of knowledge discovery and data mining emerged in the recent past as a result of the dramatic evolution of the technology for information storage, access, and analysis. Distributed data mining (DDM) is a result of further evolution of the data mining technology. DDM embraces the growing trend of merging computation with communication. At the same time, the Internet and the World Wide Web (WWW) have provided a platform where organizations conduct commercial transactions. This was the transformation that led to the onset of electronic commerce (e-commerce). It was the new medium that brought with it the opportunity and ability for doing businesses in a global marketplace. Recently, the rapid strides in research and development in e-commerce are triggering the emergence of the next evolutionary phase, namely, e-service. This model envisages the Internet and WWW evolving from a global arena for selling goods to a virtual marketplace of services, particularly where businesses and organizations conduct their transactions via the Internet. Today, the goods that are able to be transacted in an e-services modeled (B2B) virtual marketplace are not restricted in real entities such as electronics, furniture, or tickets. They can also be resources such as software, computation abilities, or useful datasets. These resources are potentially able to be sold or rented to clients as e-services. Data mining in conjunction with other business intelligence applications is emerging as intuitively suitable for being delivered as such an e-service, mainly because of the fact that several small to medium range businesses are constrained by the high cost of setting up and maintaining infrastructure of support technologies and software required for business intelligence. To efficiently and effectively deliver data mining as an e-service in the World Wide Web, Web service technologies are introduced to provide a layer of abstraction above existing software systems. Unlike existing distributed computing systems, Web services are adapted to the Web. The default network protocol is HTTP. With Web services, the communication protocol among e-service agents and service providers is already there, World Wide Web. Web services work at a level of abstraction that is capable of bridging any operating system, hardware platform, or programming language, just as the Web is. Data mining e-service model requires interactions between clients and e-service agents, as well as between e-service agents and e-service providers. These interactions need to be implemented in a reliable, stable, and scalable way. Java 2 Enterprise Edition (J2EE) provides APIs and design patterns to design sound architectures and result in quality web-based applications, which makes the Internet-based data mining e-service feasible. This thesis investigates the delivery of Internet based data mining e-services and proposes an architecture supported by Web services and J2EE technologies to address the specific infrastructure requirements imposed on data mining systems by the e-services domain.
Distributed learning using generative models
(2006) Merugu, Srujana; Ghosh, Joydeep
Essays on inflation forecast based rules, robust policies and sovereign debt
(2004) Rodriguez, Arnulfo; Kendrick, David A.
The success of inflation reduction in industrial countries along with the adoption of inflation targeting regimes by many central banks has prompted considerable interest in “feedback rules” for inflation targeting. Over the past few years, much research has been devoted to assessing the performance of these rules. The first essay provides an optimal policy responsiveness of inflation forecast based (IFB) rules to inflation and/or output shocks in order to lead inflation and output state variables back to their equilibrium value. The rendered system of equations is a bilinear one that becomes the restriction for the quadratic criterion function used in control theory problems. There has been a recent interest in the use of robust control techniques for economic policies. Analyzing a control variable response as the degree of robustness rises is important in advancing our understanding of the application of robust control methods to economic models. The second and third essays provide an analytical framework derived from one-state one-control robust control problem in order to understand the relationship between the control variable and unstructured model uncertainty. Seeking a robust policy rule for a variety of different structural macroeconomic models is an important exercise to determine if an IFB rule would be adequate to meet some performance criteria in the face of model uncertainty. Robust performance is all about finding a rule that makes it possible to have some similar, if not equal, performance across different models. However, before a rule becomes robust performance-wise, it must be robust stability-wise. The fourth essay provides a way of searching for IFB rules that are robust to two different structural macroeconomic models. First, we find the IFB rules that accomplish robust stability by using the root-locus method. Second, we find a subset of such rules that accomplish similar performances across both models – i.e. robust rules. Finally, the fifth essay uses and compares some data mining techniques to understand and predict sovereign debt rescheduling. Receiver Operating Characteristic Curves are used to measure the discrimination power of models. Moreover, the issue of interpretability of models for sovereign debt rescheduling is addressed.
Eukaryotic transcriptional regulation : from data mining to transcriptional profiling
(2008-12) Morgan, Xochitl Chamorro; Iyer, Vishwanath R.
Survival of cells and organisms requires that each of thousands of genes is expressed at the correct time in development, in the correct tissue, and under the correct conditions. Transcription is the primary point of gene regulation. Genes are activated and repressed by transcription factors, which are proteins that become active through signaling, bind, sometimes cooperatively, to regulatory regions of DNA, and interact with other proteins such as chromatin remodelers. Yeast has nearly six thousand genes, several hundred of which are transcription factors; transcription factors comprise around 2000 of the 22,000 genes in the human genome. When and how these transcription factors are activated, as well as which subsets of genes they regulate, is a current, active area of research essential to understanding the transcriptional regulatory programs of organisms. We approached this problem in two divergent ways: first, an in silico study of human transcription factor combinations, and second, an experimental study of the transcriptional response of yeast mutants deficient in DNA repair. First, in order to better understand the combinatorial nature of transcription factor binding, we developed a data mining approach to assess whether transcription factors whose binding motifs were frequently proximal in the human genome were more likely to interact. We found many instances in the literature in which over-represented transcription factor pairs co-regulated the same gene, so we used co-citation to assess the utility of this method on a larger scale. We determined that over-represented pairs were more likely to be co-cited than would be expected by chance. Because proper repair of DNA is an essential and highly-conserved process in all eukaryotes, we next used cDNA microarrays to measure differentially expressed genes in eighteen yeast deletion strains with sensitivity to the DNA cross-linking agent methyl methane sulfonate (MMS); many of these mutants were transcription factors or DNA-binding proteins. Combining this data with tools such as chromatin immunoprecipitation, gene ontology analysis, expression profile similarity, and motif analysis allowed us to propose a model for the roles of Iki3 and of YML081W, a poorly-characterized gene, in DNA repair.
An exploratory study of teacher retention using data mining
(2014-05) Krause, Gladys Helena; Marshall, Jill Ann; Carmona Domínguez, Guadalupe de la Paz
The object of this investigation is to report a study of mathematics teacher retention in the Texas Education System by generating a model that allows the identification of crucial factors that are associated with teacher retention in their profession. This study answers the research question: given a new mathematics teacher with little or no service in the Texas Education System, how long might one expect her to remain in the system? The basic categories, used in this study to describe teacher retention are: long term (10 and more years of service), medium term (5 to 9 years of service), and short term (1 to 4 years of service). The research question is addressed by generating a model through data mining techniques and using teacher data and variables from the Texas Public Education Information Management System (PEIMS) that allows a descriptive identification of those factors that are crucial in teacher retention. Research on mathematics teacher turnover in Texas has not yet focused on teacher characteristics. The literature review presented in this investigation shows that teacher characteristics are important in studying factors that may influence teachers' decisions to stay or to leave the system. This study presents the field of education, and the state of Texas, with an opportunity to isolate those crucial factors that keep mathematics teachers from leaving the teaching profession, which has the potential to inform policy makers and other educators when making decisions that could have an impact on teacher retention. Also, the methodology applied, data mining, allows this study to take full advantage of a collection of valuable resources provided by the Texas Education Agency (TEA) through the Public Education Information Management System (PEIMS), which has not yet been used to study the phenomenon of teacher retention.
The fluviageny, a method for analyzing temporal river fragmentation using phylogenetics
(2015-05) Gordon, Andrew Lloyd; Howison, James; Arctur, David K
Phylogenetic trees have historically been used to determine evolutionary relatedness between organisms. In the past few decades, as we've developed increasingly powerful computational algorithms and toolsets for performing analyses using phylogenetic methods, the use of these trees has expanded into other areas, including biodiversity informatics and geoinformatics. This report proposes using phylogenetic methods to create "fluviagenies" - trees that represent the effects of river fragmentation over time caused by damming. Faculty at the Center for Research in Water Resources at the University of Texas worked to develop tools and documentation for automating the creation of river segment codes (a.k.a., "fluvcodes") based on spatiotemporal data. Python was used to generate fluviageny trees from lists of these codes. The resulting trees can be exported into the appropriate data format for use with various phylogenetics programs. The Fishes of Texas Database (fshesoftexas.org), a comprehensive geospatial database of Texas fish occurrences aggregated and normalized from 42 museum collections around the world, was employed to create an example of how this tool might be used to analyze and hypothesize changes in fish populations as a consequence of river fragmentation. Additionally, this paper serves to theorize and analyze past and future potential uses for phylogenetic trees in various other fields of informatics.
Genetic algorithms with functional mutation and mating operators in time series data mining
(Texas Tech University, 2004-08) Huang, Jianyong
Recently, genetic algorithms (GAs) and artificial neural networks (ANNs) have been widely used in time series data mining (TSDM). Both GAs and ANNs are inspired from natural processes. A GA can be used to find optimized parameters for a given model, while an ANN has the ability to approximate unknown functions to any degree of desired accuracy without knowing the model. There are some limitations of using GAs or ANNs individually in TSDM. For example, ANNs generally use backpropagation learning algorithms, which are based on the deepest descent algorithm. Therefore, a solution from the .A.NN usually is a local optimized solution. The purpose of this thesis work is to develop innovative algorithms which can overcome the limitation of using GAs or ANNs solely in TSDM. The first part of this research involves designing a new genetic algorithm (called mGA), which can analyze not only polynomial but also non-polynomial time series. The mGA automatically searches a polynomial function with minimal degree for a non-polynomial time series. The rest of this research focuses on developing a neural network based genetic algorithm (called nGANN). The nGANN represents a chromosome as a neural network and uses genetic operators to select a global solution for a lime series. The nGANN introduces a new mating scheme (called NN _ mate), which uses a backpropagation learning network to produce offsprings. Therefore, NN mate can mate two parents with different models. The solution found by the nGANN has two attractive features: a network with small number of hidden neurons and a small mean squared error. From the solution network, h is possible to discover some relationships among different variables. Three different types of lime series data are used to evaluate the performance of the above algorithms, the two algorithms work well for one-variable polynomial and one-variable non-polynomial time series data. For two or more variables, the above algorithms do not produce very good results. In the last part of this thesis, future work is discussed.
Integrating top-down and bottom-up approaches in inductive logic programming: applications in natural language processing and relational data mining
(2003) Tang, Lap Poon Rupert; Mooney, Raymond J. (Raymond Joseph)
Matrix nearness problems in data mining
(2007) Sra, Suvrit, 1976-; Dhillon, Inderjit S.
This thesis addresses some fundamental problems in data mining and machine learning that may be cast as matrix nearness problems. Some exam- ples of well-known nearness problems are: low-rank approximations, sparse approximations, clustering, co-clustering, kernel learning, and independent components analysis. In this thesis we study two types of matrix nearness problems. In the first type, we compute a low-rank matrix approximation to a given input matrix, thereby representing it more efficiently and hopefully discovering the latent structure within the input data. In the second kind of nearness problem we seek to either learn a parameterized model of/from the input data, or the data represents noisy measurements of some underly- ing objects and we wish to recover the original measurements. Both types of problems can be naturally approached by computing an output model/matrix that is “near” the input. The specific nearness problems that we study in this thesis include: i) nonnegative matrix approximation (NNMA), ii) incremental low-rank matrix approximations, iii) general low-rank matrix approximations via convex op- timization, iv) learning a parametric mixture model for data, specifically for directional data, and v) metric nearness. NNMA is a recent powerful matrix decomposition technique that ap- proximates a nonnegative input matrix by a low-rank approximation composed of nonnegative factors. It has found wide applicability across a broad spec- trum fields, ranging from problems in text analysis, image processing, and gene microarray analysis, to music transcription. We develop several new gen- eralizations to the NNMA problem and derive efficient iterative algorithms for computing the associated approximation. Furthermore, we also provide efficient software which implements many of the derived algorithms. With growing input matrix sizes, sometimes low-rank approximation techniques themselves can become computationally expensive. For such situa- tions, and to aid model selection (the rank of the approximation), we develop incremental versions of low-rank matrix approximations, where the approxi- mation is obtained one rank at a time. There are several applications of such a scheme, for example, topic discovery from a collection of documents. We also develop methods based on large-scale convex optimization for computing low-rank approximations to the input data. Our approach can deal with large scale data, while permitting incorporation of constraints more general than nonnegativity if desired. Our approach has some beneficial byproducts—it yields new methods for solving the nonnegative least squares problem, as well as l1-norm regression. The next nearness problem that we look at is that of learning a para- metric probabilistic mixture model for the data. Here one estimates a param- eter matrix given the input data, where the estimation process is implicitly regularized to avoid over-fitting. In particular we solve the parameter estimation problem for two fundamental high-dimensional directional distributions, namely the von Mises-Fisher and Watson distributions. Parameter estimation for these distributions is highly non-trivial and we present efficient methods for it. The final nearness problem that we study is a more typical matrix nearness problem, which is called metric nearness. The goal here is to find a distance matrix (i.e., a matrix whose entries satisfy the triangle inequality) that is “nearest” to an input matrix of dissimilarity values. For most of the algorithms that we develop in this thesis, we also pro- vide software that implements them. This software will be of use to both researchers and practitioners who want to experiment with our algorithms.

Browsing by Subject "Data mining"

Results Per Page

Sort Options