Browsing by Subject "Classification"

Now showing 1 - 20 of 25

Break down the walls : how the “folder effect” influences the transfer of learning
(2011-05) He, Jingjie; Svinicki, Marilla D., 1946-; Markman, Arthur
Categorizing knowledge into different disciplines and units may block knowledge within separate “folders”, which could limit its later retrieval and transfer to new contexts. To test this hypothesis, two experiments had been conducted. In one experiment, participants memorized a list of words with or without cuing which category these words belonged to. One week later, they were asked to recall all the positive adjectives, which required them to retrieve words that came from different categories. In the other experiment, participants read exactly the same story but embedded in two different subject domains or no context. A survey report was presented to test whether people from different contexts would have different transfer effect. The current study replicated previous results that successful transfer was hard to observe in the laboratory settings without explicit prompts. The memory test and transfer task in this study were too difficult and resulted into to the poor performance of the participants. The initial hypothesis had been neither supported nor rejected. To test the hypothesis, future studies could reduce the time interval between study and test, and modified the transfer task to lower the difficulty of the experiment.
Characterization of aggregate shape properties using a computer automated system
(Texas A&M University, 2005-02-17) Al Rousan, Taleb Mustafa
Shape, texture, and angularity are among the properties of aggregates that have a significant effect on the performance of hot-mix asphalt, hydraulic cement concrete, and unbound base and subbase layers. Consequently, there is a need to develop methods that can quantify aggregate shape properties rapidly and accurately. In this study, an improved version of the Aggregate Imaging System (AIMS) was developed to measure the shape characteristics of both fine and coarse aggregates. Improvements were made in the design of the hardware and software components of AIMS to enhance its operational characteristics, reduce human errors, and enhance the automation of test procedure. AIMS was compared against other test methods that have been used for measuring aggregate shape characteristics. The comparison was conducted based on statistical analysis of the accuracy, repeatability, reproducibility, cost, and operational characteristics (e.g. ease of use and interpretation of the results) of these tests. Aggregates that represent a wide range of geographic locations, rock type, and shape characteristics were used in this evaluation. The comparative analysis among the different test methods was conducted using the Analytical Hierarchy Process (AHP). AHP is a process of developing a numerical score to rank test methods based on how each method meets certain criteria of desirable characteristics. The outcomes of the AHP analysis clearly demonstrated the advantages of AIMS over other test methods as a unified system for measuring the shape characteristics of both fine and coarse aggregates. A new aggregate classification methodology based on the distribution of their shape characteristics was developed in this study. This methodology offers several advantages over current methods used in practice. It is based on the distribution of shape characteristics rather than average indices of these characteristics. The coarse aggregate form is determined based on three-dimensional analysis of particles. The fundamental gradient and wavelet methods are used to quantify angularity and surface texture, respectively. The classification methodology can be used for the development of aggregate shape specifications.
Classification of internet memes
(2015-12) Kolawole, Olamide Temitayo; Barber, Suzanne; Grauman, Kristen
This paper explores a system that could be used to classify internet memes by certain characteristics. The anatomy of these viral images are explored to find the best indicators to classify an internet meme. Although more than one indicator was found, the paper focuses on the using image data to perform the classification. Further research is done to determine which type of feature descriptor would be used based on past successes of other projects. A dataset is a scraped from a popular repository of memes on the internet and their features extracted. Features are passed into a SVM classifier to derive a unique listing of potential labels that an image could have. Although training times were very reasonable as the number of classes increased, result accuracy degrade with increase in number of classes trained on the same model.
Classification rule induction with an ant colony optimization algorithm
(Texas Tech University, 2004-08) Xie, Xuepeng
Ant colony optimization is a meta-heuristic approach inspired from the behavior of natural ants. It has been applied to solve a variety of combinatorial optimization problems because of its advantages with cooperation and adaptation. Applied to classification rule induction, an ant colony optimization system may be able to perform a flexible, robust search for a set of high-quality classification rules. In this thesis, a new ant colony optimization system called Ant-Rule is proposed to learn a set of unordered classification rules from a training data set. Ant-Rule implements three different heuristic functions and two different fitness functions. The roles played by the heuristic function and the fitness function in rule induction with Ant- Rule are investigated. Experiments show that applying the Laplace estimate error function for both the heuristic function and the fitness function produces the best predictive accuracy for most of the data sets studied in this thesis. The performance of Ant-Rule is also compared to Ant-Miner, the first ant colony optimization algorithm for classification rule induction, and CN2, a well-known rule induction algorithm. Results show that Ant-Rule achieves the same or better performance in classification rule induction than both CN2 and Ant-Miner in the data sets tested in this thesis, which provides evidence that ant colony optimization is a viable approach to the classification rule induction problem.
Classifying learning management platforms by examining features and educational affordances
(2011-08) Sung, Woon Hee; Liu, Min, Ed. D.; Veletsianos, George
Learning management systems(LMSs) have become one of the most common computer systems adopted at universities, colleges and distance learning organizations. In order to identify different features and accordance of each LMS, LMSs’ features were compared by using four different categories; communication tools, productivity and student involvement tools, course delivery tools, and administration tools. Based upon the comparison of the different features affecting different usage patterns, this paper proposes a classification of seven selected LMSs; ANGEL, Blackboard, Moodle, Sakai, WebCT, Ning and Elgg. These seven LMSs are classified into three groups according to systems’ pedagogical adaptability and technological usability. The classification seeks to understand the possibilities and limitations of what these classified groups of LMSs can accomplish and is used to suggest a suitable usage in order to support teaching and learning. The proposed classification implies the need of future exploratory case study analyzing teaching and learning practices according to the classification.
Combining classifier and cluster ensembles for semi-supervised and transfer learning
(2012-05) Acharya, Ayan; Ghosh, Joydeep; Mooney, Raymond J.
Unsupervised models can provide supplementary soft constraints to help classify new, "target" data since similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place, as in transfer learning settings. This contribution describes two general frameworks that take as input class membership estimates from existing classifiers learnt on previously encountered "source" data, as well as a set of cluster labels from a cluster ensemble operating solely on the target data to be classified, and yield a consensus labeling of the target data. One of the proposed frameworks admits a wide range of loss functions and classification/clustering methods and exploits properties of Bregman divergences in conjunction with Legendre duality to yield a principled and scalable approach. The other approach is built on probabilistic mixture models and provides additional flexibility of distributed computation that is useful when the target data cannot be gathered in a single place for privacy or security concerns. A variety of experiments show that the proposed frameworks can yield results substantially superior to those provided by popular transductive learning techniques or by naively applying classifiers learnt on the original task to the target data.
Criteria Combinations in the Personality Disorders: Challenges Associated with a Polythetic Diagnostic System
(2011-08-08) Cooper, Luke D.
Converging research on the diagnostic criteria for personality disorders (PDs) reveals that most criteria have different psychometric properties. This finding is inconsistent with the DSM-IV-TR PD diagnostic system, which weights each criterion equally. The purpose of the current study was to examine the potential effects of using equal weights for differentially-functioning criteria. Using data from over 2,100 outpatients, response patterns to the diagnostic criteria for nine PDs were analyzed and scored within an item response theory (IRT) framework. Results indicated that combinations that included the same number of endorsed criteria (the same "raw score") yielded differing estimates of PD traits, depending on which criteria were met. Moreover, trait estimates from subthreshold criteria combinations often overlapped with diagnostic combinations (i.e., at threshold or higher), indicating that there were subthreshold combinations of criteria that indicated as much or more PD traits than some combinations at the diagnostic threshold. These results suggest that counting the number of criteria an individual meets provides only a coarse estimation of their PD trait level. Suggestions for the improved measurement of polythetically-defined mental disorders are discussed.
Exploring Sequence-Structure-Function Relationships in Proteins Using Classification Schemes
(2005-12-19) Cheek, Sara Anne; Grishin, Nick V.
With the rapid growth in the number of available protein sequences and structures, the necessity of interpreting this data in comprehensive and meaningful ways becomes increasingly apparent. Identifying and categorizing the functional, structural, and evolutionary relationships between proteins is a key step in understanding protein evolution. Protein classification is a useful means of organizing biological data for the purpose of exploring these sequence-structure-function relationships in proteins. In this work, two-tier classification schemes are constructed for the organization of large protein classes. One level of this hierarchy reflects structural similarity ("fold groups"), while the second level indicates an evolutionary relationship between members ("families"). Kinases are a ubiquitous group of enzymes that participate in a variety of cellular pathways. Despite that all kinases catalyze similar phosphoryl transfer reactions, they display remarkable diversity in structural fold and substrate specificity. All available kinase sequences and structures have been classified into fold groups and families. This classification presents the first comprehensive structural annotation of a large functional class of proteins. The question of how different structural folds accomplish the same fundamental elements of the kinase reaction is investigated. Disulfide-rich domains are small protein domains whose global folds are stabilized predominantly by disulfide bonds. In order to understand the structural and functional diversity among available disulfide-rich proteins, a comprehensive classification of these domains has been performed. The resulting fold groups and families describe more distant structural and evolutionary relationships than previously acknowledged among disulfide-rich domains. Variations in disulfide bonding patterns of these domains are also evaluated. Several existing classification databases have been developed for the purpose of cataloguing all available protein structures. Because such databases are often manually curated, recently solved structures are not included and useful information regarding their relatedness to other proteins is not immediately available. To address this limitation, an algorithm has been developed to make classification assignments with evolutionary relevance for domains in newly solved structures, with the objective of reliably reproducing assignments to an existing classification scheme in an automatic manner.
A field guide to observable phenomena : a tool for aesthetic practice
(2004) Bash, Katherine E.; Taylor, Chris, 1965-
A Field Guide to Observable Phenomena is the result of various observational investigations of natural phenomena that I have performed during the last two years. In this guide, I write a commentary on the role of naming and active perception, suggest tools for observation, and give examples of named and identified phenomena. Structurally, I am putting forward a mode of classification that can hold both current and future findings. The guide is considered to be an open work as it also lacks a formal conclusion. Operationally, it puts forward questions and rather than answering them directly, it relies upon the reader to participate actively with the text in order that the answers be revealed.
Investigation for Genetic Determinants of Flexion Contractures and Contracted Foal Syndrome in Neonatal Thoroughbred Foals
(2014-09-04) Caldwell, Jana Denise
Musculoskeletal disorders are one of the leading causes of morality in neonatal Thoroughbred foals. Contracted Foal Syndrome (CFS) has accounted for up to 48% of such disorders in foals submitted for necropsy according to the Kentucky Livestock Diagnostic Center and is reportedly a concern to clinicians and breeders. CFS is primarily characterized by limb contractures and other malformations of the appendicular and/or axial skeleton. Foals are often euthanized in severe cases and successful rehabilitation in moderate cases does not entirely negate secondary complications. Because of the economic implications associated with treatment costs, owners may opt to euthanize foals even though they potentially could have led productive lives. A familial predisposition was observed in some cases. In addition, veterinarians reported increased incidence of contracted foals in one particular sire line. This, coupled with model genetic disorders in other species, prompted us to conduct the first molecular genetics study on congenital flexion and CFS. The inconsistent nature of clinical documentation and variable phenotypes pose a challenge to researchers investigating such complex conditions. We therefore conducted a detailed analysis of the phenotypes and used the data to propose a preliminary classification system that could be used by clinicians and researchers. The implementation of such a classification system will reduce ambiguity of clinical documentation and provide the basis for future study designs. Our hypothesis states, that in some cases, flexion contractures and CFS are major gene disorders with the likelihood of genetic heterogeneity. Our first approach was to sequence the candidate gene, tropomyosin beta 2. This gene encodes a component of the skeletal muscle contractile apparatus and has been implicated in congenital distal limb contractures in humans. Next, new utilized the newly available Equine SNP50 Beadchip for a case/control population based genome-wide association mapping approach followed by a family validation study and family based genome-wide association study. These approaches resulted in the identification of associations between various subtypes of contracted foals and at least 3 disease susceptibility loci. In summary, this study provides insight into the genetics underlying flexion contractures and CFS in the neonatal foal and has proved the first evidence for a genetic cause. Furthermore, it provides a solid foundation for future research targeting candidate genes for resequencing.
Large-scale non-linear prediction with applications
(2016-08) Si, Si, Ph.D.; Dhillon, Inderjit S.; Grauman, Kristen; Keerthi, Selvaraj Sathiya; Mooney, Raymond
With an immense growth in data, there is a great need for training and testing machine learning models on very large data sets. Several standard non-linear algorithms based on either kernels (e.g., kernel support vector machines and kernel ridge regression) or decision trees (e.g., gradient boosted decision trees) often yield superior predictive performance on various machine learning tasks compared to linear methods; however, they suffer from severe computation and memory challenges when scaling to millions of data instances. To overcome these challenges, we develop a family of scalable kernel-approximation-based and decision-tree-based algorithms to reduce the computational cost of non linear methods in terms of training time, prediction time and memory usage. We further show their superior performance on a wide range of machine learning tasks including large-scale classification, regression, and extreme multi-label learning. In particular, we make the following contributions: (1) We develop a family of memory efficient kernel approximation algorithms by exploiting the structure of kernel matrices. The proposed kernel approximation scheme can significantly speed up the training phase of kernel machines; (2) We make the connection between forming a kernel approximation and predicting new instances using kernel machines, and propose a series of improvements over the classical \Nystrom kernel approximation method. We show that these improvements result in an order of magnitude speed-up in prediction time on large-scale classification and regression tasks with millions of training instances; (3) We overcome the challenges of applying decision trees to the extreme multi-label classification problem, which can have more than 100,000 different labels, and develop the first Gradient Boosting Decision Tree (GBDT) algorithm for extreme multi-label learning. We show that the modified GBDT algorithm achieves substantial reductions in prediction time and model size.
Lithologic heterogeneity of the Eagle Ford Formation, South Texas
(2014-05) Ergene, Suzan Muge; Milliken, K. L.
Grain assemblages in organic-rich mudrocks of the Eagle Ford Formation of South Texas are assessed to determine the relative contributions of intra- and extrabasinal sediment sources, with the ultimate goal of producing data of relevance to prediction of diagenetic pathways. Integrated light microscopy, BSE imaging, and X-ray mapping reveal a mixed grain assemblage of calcareous allochems, biosiliceous grains (radiolaria), quartz, feldspar, lithics, and clay minerals. Dominant fossils are pelagic and benthic foraminifers and thin-walled and prismatic mollusks; echinoderms, calcispheres, and oysters are present. Early-formed authigenic minerals, including calcite, kaolinite, dolomite, albite, pyrite, quartz, and Ca-phosphate, some reworked, add to the overall lithologic heterogeneity. Point counting of images produced using energy-dispersive X-ray mapping in the SEM provides observations at a scale appropriate to classifying the mudrocks based on the composition of the grain assemblage, although grains and other crystals of clay-size cannot be fully characterized even with the SEM. Each sample is plotted on a triangle, whose vertices correspond to terrigenous and volcanic grains (extrabasinal components), calcareous allochems, and biosiliceous grains. As a subequal mix of grains of intrabasinal and extrabasinal origins the detrital grain assemblage of the Eagle Ford, presents a formidable challenge to the task of lithologic classification of this unit, as neither conventional limestone nor sandstone classifications can be readily applied. The abundant marine skeletal debris in the Eagle Ford is accompanied by abundant calcite cementation and the dissolution and replacement of biosiliceous debris is accompanied by authigenic quartz, suggesting that mudrock grain classification has potential for yielding diagenetic predictions.
New media communication in education
(2012-06) Livingston, Kat; Bichard, Shannon; Baake, Ken; Stoker, Kevin
Research and teaching are the crossroads at which higher education exists. Great scholar-researchers in the field understand that new media communication in education is a very fluid area of study, rich with opportunities to glean context and insight in every interaction. This project evaluates the learning processes and experiences that took place in my pursuit of a Master of Science in the interdisciplinary studies of new media communication in education. The research included in this portfolio is a reflection of my growth and development as a professional scholar. The content provides an assessment of the academic work I completed, and a means for self-examination and exploration. The papers within this portfolio draw attention to research and literature related to different elements within the realm of Mass Communications, Educational Instructional Technology, Technical Communication and Rhetoric, and Educational Psychology. The content, research, and subject matter seek to explore various concepts and challenges within these four areas of study. Additionally, this research provides a bridge of understanding in regards to the role of new media communication in education, and analyzes the relationship and connectedness of new media and instructional learning. In the study and exploration of these areas of interest, I was able to gain great focus on a research agenda that concentrates on generating research pertaining to the psychological effects of new media on teachers and students, and how these areas work together to better pedagogy and instruction in education. In analyzing the various issues surrounding Mass Communications, Educational Instructional Technology, Technical Communication and Rhetoric, and Educational Psychology, I was able to develop a greater understanding of the world and a foundation upon which my interest in higher education is built.
Nonparametric Inference for High Dimensional Data
(2013-04-23) Mukhopadhyay, Subhadeep
Learning from data, especially ?Big Data?, is becoming increasingly popular under names such as Data Mining, Data Science, Machine Learning, Statistical Learning and High Dimensional Data Analysis. In this dissertation we propose a new related field, which we call ?United Nonparametric Data Science? - applied statistics with ?just in time? theory. It integrates the practice of traditional and novel statistical methods for nonparametric exploratory data modeling, and it is applicable to teaching introductory statistics courses that are closer to modern frontiers of scientific research. Our framework includes small data analysis (combining traditional and modern nonparametric statistical inference), big and high dimensional data analysis (by statistical modeling methods that extend our unified framework for small data analysis). The first part of the dissertation (Chapters 2 and 3) has been oriented by the goal of developing a new theoretical foundation to unify many cultures of statistical science and statistical learning methods using mid-distribution function, custom made orthonormal score function, comparison density, copula density, LP moments and comoments. It is also examined how this elegant theory yields solution to many important applied problems. In the second part (Chapter 4) we extend the traditional empirical likelihood (EL), a versatile tool for nonparametric inference, in the high dimensional context. We introduce a modified version of the EL method that is computationally simpler and applicable to a large class of ?large p small n? problems, allowing p to grow faster than n. This is an important step in generalizing the EL in high dimensions beyond the p ? n threshold where the standard EL and its existing variants fail. We also present detailed theoretical study of the proposed method.
The perceptibility of duration in the phonetics and phonology of contrastive consonant length
(2012-05) Hansen, Benjamin Bozzell; Myers, Scott P.; Crowhurst, Megan; King, Robert; Lindblom, Björn; Sussman, Harvey
This dissertation investigates the hypothesis that the more vowel-like a consonant is, the more difficult it is for listeners to classify it as geminate or singleton. A perceptual account of this observation holds that more vowel-like consonants lack clear markers to signal the beginning and ending of the consonant, so listeners don’t perceive the precise duration and consequently the phonological contrast may be neutralized in some languages. Three experiments were performed to address these questions using data from Persian speakers. In Experiment I, four speakers produced singleton and geminate tokens of the voiced oral consonants [d,z,n,l,j] and the glottals [h] and glottal stop at three speaking rates. It was found that Persian speakers do distinguish geminate durations from singleton durations for all manners even at very fast speaking rates, and vowels preceding geminates are slightly longer than those preceding singletons. Speaking rate had more of an effect on geminates than on singletons for all segments studied: the durations of the geminates decreased more in fast speech than the durations of the singletons did. In Experiment II, listeners heard manipulated continua of consonants ranging from singletons to geminates. Subjects’ identification curves were modeled using the cumulative Gaussian model. The modeled standard deviation was interpreted as the breadth of the perceptual threshold, and a broader threshold understood to indicate a less distinct perceptual boundary between the two categories. Obstruents [d,z] had smaller breadth values than the sonorants [n,l,j], and the glottals had the largest breadth values of all. This indicates that while sonorants were more difficult for listeners to categorize than obstruents, the glottals were the most difficult to categorize of the segments tested. Experiment III tested whether the modification of a specific parameter, the formant transition duration, would affect the perceptibility of the geminate/singleton contrast. A single token containing the glide [j] was manipulated to produce three different continua, each having a distinctly different manipulated transition: short, normal or long. It was found that the longer the transition was, the broader the perceptual threshold, thus making the consonant harder to categorize.
Phylogeography of the cottonmouth, Agkistrodon piscivorus, using AFLP and venom protein profiles.
(2011-05-25T14:32:04Z) Strickland, Jason L.; Strickland, Jason; Ammerman, Loren K.; Ross, Linda; Maxwell, Terry C.; Parkinson, Christopher L.; Angelo State University. Department of Biology.
The objective of this study was to examine population structure in cottonmouths (Agkistrodon piscivorus) using Amplified Fragment Length Polymorphism (AFLP) and compare genetic and venom protein profiles in Texas. AFLP profiles using 622 fragments were generated for 105 individuals to understand the level of variation within Agkistrodon. In Texas, there was a significant lack of gene flow detected and support for the isolation of Concho Valley individuals. Cottonmouths showed the greatest genetic variation when compared to other Agkistrodon species but there was not complete support for two species of cottonmouths as currently proposed. RP-HPLC was used to examine venom protein profiles in 86 Texas cottonmouths. Relative peak heights were analyzed using PCA and the MANOVA demonstrated separation of populations based on profiles (p<0.001). Genetic and venom variation did not follow the same pattern indicating that there may be other selection pressures acting on the venom proteins.
Predicting success of bank telemarketing with classification trees and logistic regression
(2016-05) Yang, Chuanfeng; Zhou, Mingyuan (Assistant professor); Gawande, Kishore
Success of bank marketing campaign is predicted with customer features, campaign information and economic attributes. To predict whether or not clients will subscribe long-term deposit, logistic regression is applied with backward variable selection and principal components analysis. Random forests and stochastic gradient boosting, as members of classification trees, are also built as comparisons. Based on visualization and quantitative predictive performance, gradient boosting (AUC = 0.791) is slightly better than the other two models. Variable importance from 3 models remains consistent for most variables. Social and economic attributes, such as euribor3m, are among top important variables.
Secondary Analysis of Case-Control Studies in Genomic Contexts
(2011-10-21) Wei, Jiawei
This dissertation consists of five independent projects. In each project, a novel statistical method was developed to address a practical problem encountered in genomic contexts. For example, we considered testing for constant nonparametric effects in a general semiparametric regression model in genetic epidemiology; analyzed the relationship between covariates in the secondary analysis of case-control data; performed model selection in joint modeling of paired functional data; and assessed the prediction ability of genes in gene expression data generated by the CodeLink System from GE. In the first project in Chapter II we considered the problem of testing for constant nonparametric effects in a general semiparametric regression model when there is the potential for interaction between the parametrically and nonparametrically modeled variables. We derived a generalized likelihood ratio test for this hypothesis, showed how to implement it, and gave evidence that it can improve statistical power when compared to standard partially linear models. The second project in Chapter III addressed the issue of score testing for the independence of X and Y in the second analysis of case-control data. The semiparametric efficient approaches can be used to construct semiparametric score tests, but they suffer from a lack of robustness to the assumed model for Y given X. We showed how to adjust the semiparametric score test to make its level/Type I error correct even if the assumed model for Y given X is incorrect, and thus the test is robust. The third project in Chapter IV took up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model. We showed how to estimate the regression parameters in a rare disease case even if the assumed model for Y given X is incorrect, and thus the estimates are model-robust. In the fourth project in Chapter V we developed novel AIC and BIC-type methods for estimating the smoothing parameters in a joint model of paired, hierarchical sparse functional data, and showed in our numerical work that they are many times faster than 10-fold crossvalidation while at the same time giving results that are remarkably close to the crossvalidated estimates. In the fifth project in Chapter VI we introduced a practical permutation test that uses cross-validated genetic predictors to determine if the list of genes in question has ?good? prediction ability. It avoids overfitting by using cross-validation to derive the genetic predictor and determines if the count of genes that give ?good? prediction could have been obtained by chance. This test was then used to explore gene expression of colonic tissue and exfoliated colonocytes in the fecal stream to discover similarities between the two.
Sentiment-based Classification of Tweeters and University Programs
(2014-08-21) Huang, Bolun
The rapidly growing World WideWeb (WWW) is no longer a passive information provider. Nowadays, Internet users themselves have become contributors to the WWW. A lot of user generated data, along with non-user-generated data, make our world an informative, however, perhaps over-informed society. The increasing amount of unorganized, disordered, unstructured, or even randomly generated data drove the momentum of big data analysis, aiming to discover and learn the hidden patterns behind the data. In this thesis, in particular, we look at two problems of mining knowledge from data. In the first project, we are trying to classify "democrats" and "republicans" in Twitter. We first propose a sentiment-based classification model to classify "democrats" and "republicans", with the aim to address the problem that conventional quantitative features, such as tweet count, follower-to-following ratio, election tweet count, cannot reflect the opinion alignment of tweeters. Therefore we utilize sentiment scores over multiple topics as our feature vector in the classification model. We innovatively proposed an automatic topic selection model to learn those distinguishing topics, making the sentiment feature selection domain independent. However, the sentiment-based classification model is not doing much better than non-sentiment model. Given the fact that sentiment-based classification model is not doing well enough, we propose using social relationship graph information to adjust our sentiment vectors. The graph-adjusted sentiment model achieves an accuracy higher than 80 percent in classification. What's more, we deploy a completely graph-based model, Belief Propagation (BP) model on the social graph, which achieves a prediction accuracy higher than 85 percent. We conclude that the effect of social relationship graph is more important than sentiment of tweets for classifying users into "democrats" and "republicans". In the second project, we propose an alternative and new way to rank graduate schools using algorithms, instead of using qualitative surveys as U.S. News does. Based on the assumption that "schools tend to hire PhD graduates from better or peer schools" to become their faculty members, we propose deploying link-based ranking algorithms on the "hiring graph" among universities. We refine PageRank (PR) algorithm and Hyperlink-induced Topic Search (HITS) Algorithm by taking the edge weight into consideration, as our own way to rank graduate programs. In order to validate our approach, we collect two separate data sets to construct the "hiring graph", faculty data in top 50 Computer Science (CS) programs and faculty data in top 50 Mechanical Engineering (ME) programs across the United States. By comparing our new rankings with U.S. News ranking, we discover that some programs are either under-ranked or over-ranked by U.S. News. We also conduct extensive data analysis on our data, revealing a lot of interesting patterns and cases behind the U.S. News ranking. Finally, we conduct sensitivity analysis on each proposed algorithms to see how sensitive they are in response to changes in the "hiring graph".
Simultaneous partitioning and modeling : a framework for learning from complex data
(2010-05) Deodhar, Meghana; Ghosh, Joydeep; John, Lizy; Chase, Craig; Dhillon, Inderjit; Saar-Tsechansky, Maytal
While a single learned model is adequate for simple prediction problems, it may not be sufficient to represent heterogeneous populations that difficult classification or regression problems often involve. In such scenarios, practitioners often adopt a "divide and conquer" strategy that segments the data into relatively homogeneous groups and then builds a model for each group. This two-step procedure usually results in simpler, more interpretable and actionable models without any loss in accuracy. We consider prediction problems on bi-modal or dyadic data with covariates, e.g., predicting customer behavior across products, where the independent variables can be naturally partitioned along the modes. A pivoting operation can now result in the target variable showing up as entries in a "customer by product" data matrix. We present a model-based co-clustering framework that interleaves partitioning (clustering) along each mode and construction of prediction models to iteratively improve both cluster assignment and fit of the models. This Simultaneous CO-clustering And Learning (SCOAL) framework generalizes co-clustering and collaborative filtering to model-based co-clustering, and is shown to be better than independently clustering the data first and then building models. Our framework applies to a wide range of bi-modal and multi-modal data, and can be easily specialized to address classification and regression problems in domains like recommender systems, fraud detection and marketing. Further, we note that in several datasets not all the data is useful for the learning problem and ignoring outliers and non-informative values may lead to better models. We explore extensions of SCOAL to automatically identify and discard irrelevant data points and features while modeling, in order to improve prediction accuracy. Next, we leverage the multiple models provided by the SCOAL technique to address two prediction problems on dyadic data, (i) ranking predictions based on their reliability, and (ii) active learning. We also extend SCOAL to predictive modeling of multi-modal data, where one of the modes is implicitly ordered, e.g., time series data. Finally, we illustrate our implementation of a parallel version of SCOAL based on the Google Map-Reduce framework and developed on the open source Hadoop platform. We demonstrate the effectiveness of specific instances of the SCOAL framework on prediction problems through experimentation on real and synthetic data.

Browsing by Subject "Classification"

Results Per Page

Sort Options