Home
    • Login
    View Item 
    •   TDL DSpace Home
    • Texas Conference on Digital Libraries Proceedings
    • 2015 Texas Conference of Digital Libraries
    • View Item
    •   TDL DSpace Home
    • Texas Conference on Digital Libraries Proceedings
    • 2015 Texas Conference of Digital Libraries
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Beyond the Early Modern OCR Project

    Thumbnail
    View/Open
    TCDL15-Beyond-eMOP.pdf (6.891Mb)
    Date
    2015-04-27
    Author
    Christy, Matthew
    Grumbach, Elizabeth
    Mandell, Laura
    Metadata
    Show full item record
    Abstract
    The Early Modern OCR Project (eMOP) is a Mellon Foundation grant funded project, nearing completion at the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University. eMOP’s goal is to improve optical character recognition (OCR) output for early modern printed English-language texts by utilizing and creating open-source tools and workflows. In addition to establishing an impressive OCR workflow infrastructure, eMOP has produced several open-source post-processing tools to evaluate and improve the text output of Google’s Tesseract OCR engine. Work on eMOP is nearing completion this summer, and the team is now looking beyond eMOP towards sharing its accrued knowledge and tools. As a Mellon Foundation grant funded project, eMOP is tasked with sharing the results of its work whenever possible. This is in line with the IDHMC’s stated goals of aiding Humanities scholars with conducting digital research and/or creating digital outcomes of their research. As such, we are pursuing a variety of methods to disseminate the various products of our work. We are creating open-source code repositories for all software created by, and for, eMOP. We are creating an open-source repository of all eMOP typeface training created for the Tesseract OCR engine. We are creating a publicly available database of early modern printers, publishers and booksellers based on the imprint metadata of the entire Eighteenth-Century Collection Online (ECCO) and Early English Books Online (EEBO) proprietary collections. We are making the recently released Phase I hand-transcriptions of EEBO by the Text Creation Partnership (TCP), available for full-text searching via the Advanced Research Consortium’s (ARC’s) 18thConnect website. We are making the first-ever-produced OCR transcriptions of the entire EEBO catalog available via 18thConnect’s online crowd-sourced transcript correction tool, TypeWright. TypeWright will provide free access to the EEBO transcriptions, and a text or XML version of that corrected transcription for anyone who corrects an entire document. In addition, the eMOP team is committed to continuously improving the accuracy and robustness of our workflow. We are currently in discussion with, or actively engaged in, partnerships with teams at Notre Dame, Penn State, and the University of Texas to apply eMOP’s workflow to different collections. These partnerships will provide us with the ability to improve eMOP by: Adding more OCR engines to our workflow in addition to Tesseract, currently being used; Expanding our collected dictionaries beyond the current early modern English used with eMOP; Expanding our database of google-3grams beyond the early modern period to aid in post-processing OCR correction of documents outside of the early modern period; Expanding our printers & publishers database to include data from outside of the ECCO and EEBO collections. We are proud of the work we have done with eMOP and are eager to continue to find ways to build upon what we have accomplished. We feel that much of our work would be of interest to libraries and librarians. We look forward to sharing the outcomes of eMOP and our vision for future work with the participants at TCDL this April.
    URI
    http://hdl.handle.net/2249.1/68405
    Collections
    • 2015 Texas Conference of Digital Libraries

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by @mire NV
     

     

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by @mire NV