Increasing Access to Content-Rich Publications from Web Archives with Machine Learning Models




Caragea, Cornelia
Fox, Nathan
Patel, Krutarth
Phillips, Mark

Journal Title

Journal ISSN

Volume Title


Texas Digital Library


The University of North Texas (UNT) Libraries, in partnership with the University of Illinois at Chicago, were awarded a National Leadership Grant (IMLS:LG-71-17-0202-17) from the Institute of Museum and Library Services (IMLS) to research the efficacy of using machine-learning models to identify and extract content-rich publications (publications considered to be “within scope” for a given collection or repository) located in web archives. This research project seeks to combine machine learning and traditional qualitative research methods in order to improve the ability for the team to identify documents and publications from web archives that align with existing collections held by cultural heritage organizations. Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. It is posed that the use of qualitative methods will promote the identification and formulation of features that can be used to leverage machine learning to pull content-rich publications from web archives. The study exists in two phases. The first phase of the research, being performed by researchers at the University of North Texas, focuses on interviewing subject matter experts having experience in collecting publications from the web. Efforts have been made to collect a representative sample of collection types that align with the three use cases in this study. These use cases include: populating an institutional repository from a university domain crawl (, extracting state publications from a domain crawl, and identifying technical reports from a large web archive of a federal agency ( The interviews and subsequent analysis are aimed at identifying potential features that can be used in to inform the machine learning algorithms being developed and refined in phase two of the study. Interviews have been conducted with librarians and archivists to better understand how they approach collecting publications from the web and to determine what kind of workflows and features aid these individuals in identifying documents of interest for the collections that they are building. Interviews were subsequently transcribed and analyzed using qualitative analysis software (NVivo 12). Recommendations for features to be incorporated into the machine learning models were then made to the research partners to carry out in later stages of the study. The hope is that these features, when integrated with machine learning models, can be used to identify content-rich publications from the massive amount of material available in web archives that can then, in turn, be used to aid libraries and archives in their collection efforts. It is also hoped that these methods will inform future research in the pursuit of breaking down barriers to the access and utilization of the wealth of resources available through web archives. This poster will present the research design of the project and the workflow for the qualitative data collection, transcription, and analysis. A discussion of findings from the analysis of the interviews will also be included as well as examples of feature suggestions identified for further testing in machine learning models.


Presented by the University of Illinois at Chicago, University of North Texas, and Kansas State University, Poster Minute Madness, at TCDL 2019.