A machine learning approach to automate classification of literature in a SAM research database

Date

2004-08

Journal Title

Journal ISSN

Volume Title

Publisher

Texas Tech University

Abstract

In the mid-eighties, researchers at the University of Miami confronted their problem of information overload while investigating information on worker performance. They required literature sources from various fields, such as engineering, business, and psychology, to name a few. To cope with their information overload, they devised a research methodology to partition information resources into category matrices in order to find pattems, frends, or voids. The approach was termed State-of-the-Art Matrix or SAM Analysis. SAM Analysis is a manual process, thus restricting the amount of information for conveying category decisions. During the first phase of the manual process, researchers construct models or categories that best describe the research area. In the next phase, articles from the information sources are read and assigned to the pre-defined categories based on the judgment of assessors. The manual approach presents major challenges to researchers who must deal with identifying and utilizing the information hidden in a large corpus of information. The approach is only practical for a small number of articles and categorization relies on the subjective judgment of assessors. A more scalable and flexible approach, therefore, is needed for categorizing information, such as by using machine leaming and data mining techniques to automate categorization of articles in large volumes of data. In this research, automation is approached through the use of a machine leaming technique known as a Leaming Classifier Systems (LCS). The LCS performs the data mining task of categorizing articles using the SAM approach by utilizing training and testing datasets extracted from SAM EndNote bibliographic databases related to a specific area of research. In order to evaluate the ability of the LCS to predict category membership, accuracy-based metrics borrowed from the field of medicine are applied. The metrics include sensitivity, specificity, positive predictive value, and negative predictive value. After training, the evaluation results indicate that the predictive ability of the LCS system is greater than 90%. The results are obtained during the second trial of a five trial experiment.

Description

Citation