Investigations Into Using Machine Learning Models to Automate the Sorting of Digitized Texas State Publications

Date

2023-05-16

Authors

Rikka, Praneeth

Journal Title

Journal ISSN

Volume Title

Publisher

Texas Digital Library

Abstract

Over the past ten years the UNT Libraries has been digitizing Texas State Publications it receives from the Texas State Library and Archives Commission as part of the Texas State Depository program. During this time, over 19,000 items have been digitized and made available in The Portal to Texas History’s Texas State Publications Collection (https://texashistory.unt.edu/explore/collections/TXPUB/). Each year, batches of publications are sent to a digitization vendor, digitized, and sent back to UNT where each publication is sorted so that similar items are grouped together to assist in metadata creation. This sorting usually happens with sets of over 1,000 publications at a time. The manual sorting process is time consuming and requires expert knowledge of the subject matter. Recent advances in machine learning offer an automated approach to this manual sorting of documents. This poster presents a research project to build and test a classification model to assist librarians in the sorting of digitized Texas State Publications into groups. It discusses the labeled dataset that was created to test different machine learning approaches and presents the findings of text-based and image-based classification models. We hope that this poster encourages others in two specific ways, first to build datasets that highlight specific problems in the library and archives space that can be worked on by students interested in real world problems, and second, to think about processes that exist in their institution that might benefit from judicious use of machine learning to complement human decisions in making resources available for users.

Description

Citation