Batch Importing into DSpace with the SAFCreator

Date

2016-05-24

Authors

Creel, James

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

A commonly difficult use case for any digital repository is the ingest of large batches of items. Batches can come from all sorts of campus and community stakeholders with varying types and quantities of content, differing ways of representing metadata, and unique needs for access control and licensing. The heterogeneous nature of batches presents a fundamental challenge to automating the importation workflow and has lead to ad hoc and brittle solutions.

The DSpace institutional repository software enjoys wide adoption in academia and industry, and is a flagship service of the Texas Digital Library to its member institutions. DSpace offers a simple but powerful batch import format called SAF (Simple Archive Format) that allows for metadata assignment, licensing, organization of files into bundles, and authorization management. SAF is simpler than other programmatic means of importation into DSpace such as the METS SIP (Submission Information Package) used by SWORD or HTTP POST requests to the REST API. However, generating SAF batches usually still requires external software, programming work, or a combination of both.

There have been some efforts to provide generalized tools for processing metadata and content into SAF (notably Peter Dietz’s SAFBuilder https://github.com/DSpace-Labs/SAFBuilder), but when batches have special requirements regarding licensing and permissions, it has usually entailed custom code to do the processing. In addition, the spreadsheets often used to encode metadata are prone to errors such as invalid field labels and incorrect or missing filenames. It greatly accelerates a batch loading workflow to get validation of the input prior generating the archive and attempting to import it into DSpace.

A new tool designated SAFCreator aims to provide enough flexibility to eliminate programming requirements for a wide variety of batch loads, and has been used by librarians at Texas A&M to ingest content into several collections this past year. The tool is packaged as a lightweight desktop java application. A list of important features includes: Input of metadata and file references as CSV spreadsheets; support for any number of schema.element.qualifier labels; support for multiple values in a field; wildcards to select all the files in a directory; customizable item licenses; customizable read access policies on items; modular verifiers for batches. The code is open source at https://github.com/jcreel/SAFCreator and under current development. I welcome and encourage pull requests for new features and verifiers. In this workshop, I will demonstrate the tool and provide instruction on DSpace batch imports with SAF.

Description

Workshop presentation slides for the 2016 Texas Conference on Digital Libraries (TCDL).

Citation