Toward Automatic Metadata Assignment in the Texas A&M University Digital Repository
We are researching and developing tools to automatically assign high-quality metadata to documents in the Texas A&M University Digital Repository, a DSpace repository. A reliable automation of the metadata assignment process confers several advantages. First, the submission workflow may be streamlined by providing the user with pre-filled input fields. Second, existing collections may have their items automatically augmented with new metadata fields, usually a lengthy and painstaking process for human catalogers. Finally, the repository may be systematically reviewed for metadata errors.
Metadata in a DSpace repository exist at several levels and in various forms. Some forms are suitable for automatic assignment whereas others remain beyond the scope of modern artificial intelligence and natural language processing. Though we typically think of metadata in a DSpace repository as being applied at the item level, metadata are also applied at the bistream, bundle, collection, and community levels. Among these levels, we are currently interested in the bistream and item levels.
At the bistream level, the metadata are restricted to name (always a filename), type (typically a MIME type) and description. The type is currently determined on the basis of filename extensions. We propose to employ the UNIX file command to achieve less error-prone type assignment. The description field, consisting of unrestricted text, poses greater challenges and will be the focus of future research.
At the item level, metadata are restricted only by schemas registered in the repository, typically including Dublin Core (DC). Among the unqualified DC fields, some are highly amenable to automatic assignment while others pose formidable difficulties in light of the high degree of accuracy required.
At present, we find it appropriate to tailor metadata assignment tools on a collection-by-collection basis. This is so because the necessary metadata fields vary between collections; whereas geographic metadata may be relevant to a collection of maps, committee member metadata may be relevant to a collection of theses. Additionally, by focusing on a particular collection the software may assume particular structural consistencies between documents. For example, our experimental software can accurately assign title, author, committee member, degree level, abstract, and subject area metadata to TAMU ETDs by virtue of the fact that the locations of these strings are virtually guaranteed within these documents. Furthermore, stylistic and subject-matter consistencies between documents in a collection can facilitate statistical-linguistically informed inferences about those documents.
Experimentally, we are employing named entity recognition software to identify references to people, places, and things within documents by using statistical linguistic clues. These identifications may be used to provide subject metadata, geographic metadata, and other types of metadata fields depending on the collection. In the future, we plan to integrate named entity recognition with a formal knowledge base of entities, events, and relations, which can dynamically grow with the repository. This knowledge-based integration is the first step in a long journey toward automatic assignment of complex fields like descriptions and summaries of documents.