Home
    • Login
    View Item 
    •   TDL DSpace Home
    • Texas Conference on Digital Libraries Proceedings
    • 2009 Texas Conference on Digital Libraries
    • View Item
    •   TDL DSpace Home
    • Texas Conference on Digital Libraries Proceedings
    • 2009 Texas Conference on Digital Libraries
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Toward Automatic Metadata Assignment in the Texas A&M University Digital Repository

    Thumbnail
    View/Open
    TCDL2009_Creel_Toward_Automatic.pdf (660.6Kb)
    TCDL2009_Creel_Toward_Automatic.ppt (729.5Kb)
    Date
    2009-05-27
    Author
    Creel, James
    Maslov, Alexey
    Mikeal, Adam
    Phillips, Scott
    Metadata
    Show full item record
    Abstract
    We are researching and developing tools to automatically assign high-quality metadata to documents in the Texas A&M University Digital Repository, a DSpace repository. A reliable automation of the metadata assignment process confers several advantages. First, the submission workflow may be streamlined by providing the user with pre-filled input fields. Second, existing collections may have their items automatically augmented with new metadata fields, usually a lengthy and painstaking process for human catalogers. Finally, the repository may be systematically reviewed for metadata errors. Metadata in a DSpace repository exist at several levels and in various forms. Some forms are suitable for automatic assignment whereas others remain beyond the scope of modern artificial intelligence and natural language processing. Though we typically think of metadata in a DSpace repository as being applied at the item level, metadata are also applied at the bistream, bundle, collection, and community levels. Among these levels, we are currently interested in the bistream and item levels. At the bistream level, the metadata are restricted to name (always a filename), type (typically a MIME type) and description. The type is currently determined on the basis of filename extensions. We propose to employ the UNIX file command to achieve less error-prone type assignment. The description field, consisting of unrestricted text, poses greater challenges and will be the focus of future research. At the item level, metadata are restricted only by schemas registered in the repository, typically including Dublin Core (DC). Among the unqualified DC fields, some are highly amenable to automatic assignment while others pose formidable difficulties in light of the high degree of accuracy required. At present, we find it appropriate to tailor metadata assignment tools on a collection-by-collection basis. This is so because the necessary metadata fields vary between collections; whereas geographic metadata may be relevant to a collection of maps, committee member metadata may be relevant to a collection of theses. Additionally, by focusing on a particular collection the software may assume particular structural consistencies between documents. For example, our experimental software can accurately assign title, author, committee member, degree level, abstract, and subject area metadata to TAMU ETDs by virtue of the fact that the locations of these strings are virtually guaranteed within these documents. Furthermore, stylistic and subject-matter consistencies between documents in a collection can facilitate statistical-linguistically informed inferences about those documents. Experimentally, we are employing named entity recognition software to identify references to people, places, and things within documents by using statistical linguistic clues. These identifications may be used to provide subject metadata, geographic metadata, and other types of metadata fields depending on the collection. In the future, we plan to integrate named entity recognition with a formal knowledge base of entities, events, and relations, which can dynamically grow with the repository. This knowledge-based integration is the first step in a long journey toward automatic assignment of complex fields like descriptions and summaries of documents.
    URI
    http://hdl.handle.net/123456789/67063
    Collections
    • 2009 Texas Conference on Digital Libraries

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by @mire NV
     

     

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by @mire NV