Browsing by Subject "open source"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item All Aboard: Bringing the Community Forward to Fedora 6.0(Texas Digital Library, 2021-05-24) Wilcox, David; Griffith, ArranItem Flowcharting a Course Through Open-Source Waters, an eMOP guide to OCR(2014-03-14) Christy, Matthew; Texas A&M UniversityThe Early Modern OCR Project (eMOP), an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, intends to use font and book history techniques to train modern Optical Character Recognition (OCR) engines. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research (Mandell, 2013). Now in year two, eMOP is turning towards one of their main goals: to produce a workflow, published in Taverna, for use by individuals and institutions with similar projects. Matthew Christy and Liz Grumbach, eMOP Co-Project Managers for Year Two, will present a series of interconnected workflows that represent the work being done by eMOP and give an idea of how eMOP work will benefit the library, and larger academic, communities. Our presentation will include flowcharts covering: Wrangling the eMOP data and metadata. Our data set consists of the 45 million pages that make up the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) commercial database, as well as over 46,000 had transcribed texts from the Text Creation Project (TCP). We have created our own DB and query/download tools to manage and access that data. The eMOP Font History database being created. This DB is based on parsing the natural-language imprint lines of every document in EEBO. Training Tesseract. We have developed our own tools and methods to optimize training of Google’s open source OCR engine Tesseract for work on pre-modern printed texts. The eMOP controller. The controller is a software process that controls work from OCR’ing to scoring of results The eMOP post-processing process. This process will score OCR results per page, and then decide which of two post-processes to route the page through. Pages that score well will be routed for further correction. Pages that score badly will be routed to a triage system which will determine what is causing the page to fail OCR’ing and tag them for appropriate pre-processing to rectify problems and later re-OCR’ing. The eMOP post-processing scoring method. The process for training eMOP’s triage system’s machine learning applications. We will conclude with information where to find out more information about eMOP, as well as our open source code and workflows.Item Managing Assets as Linked Data with Fedora 4(2016-05-24) Woods, Andrew; DuraSpaceFedora is a flexible, extensible, open source repository platform for managing, preserving, and providing access to digital content. Fedora is used in a wide variety of institutions including libraries, museums, archives, and government organizations. Fedora 4 introduces native linked data capabilities and a modular architecture based on well-documented APIs and ease of integration with existing applications. Both new and existing Fedora users will be interested in learning about and experiencing Fedora 4 features and functionality first-hand. Attendees will be given pre-configured virtual machines that include Fedora 4 bundled with the Solr search application and a triplestore that they can install on their laptops and continue using after the workshop. These virtual machines will be used to participate in hands-on exercises that will give attendees a chance to experience Fedora 4 by following step-by-step instructions. Participants will learn how to create and manage content in Fedora 4 in accordance with linked data best practices, and how to search and run SPARQL queries against content in Fedora using the included Solr index and triplestore. This workshop is intended to be an introduction to Fedora 4 - no prior experience with the platform is required. Repository managers and librarians will get the most out of this workshop, though developers new to Fedora would likely also be interested. Attendees can expect to come away with a working understanding of Fedora's main features and benefits, and a clear path for adopting Fedora as a new repository platform or migrating from a previous version of Fedora.Item Session 1A | Introduction to Fedora 6.0(Texas Digital Library, 2021-05-24) Wilcox, DavidFedora 6.0 is quickly approaching a production release. This workshop will provide an overview of the software and basic concepts, examples of deployments, and an overview and demonstration of the core features, with a particular focus on new features in version 6.0. We will also discuss the product roadmap and ways to get involved with the Fedora community. This is a technical workshop pitched at an introductory level, so no prior Fedora experience is required. Attendees who wish to participate in the optional hands-on sections will need to access an online sandbox via a URL that will be provided ahead of the workshop.Item Texas Digital Library Collaboration: Pooling Resources to Avoid Drowning(Texas Digital Library, 2017-10-23) Mumma, CourtneyPresented at Digital Library Federation Forum, October 2017: TDL has tried to mitigate some of the problems caused by excess technological needs and diminishing resources.Item Transforming Access to Texts with 18thConnect and TypeWright(2014-03-14) Grumbach, Elizabeth; Texas A&M University18thConnect is a digital aggregator and virtual research environment (VRE) for eighteenth-century researchers. As part of a larger community of VRE’s, all organized under the Advanced Research Consortium (ARC) and based on the NINES (Networked Infrastructure for Nineteenth-Century Electronic Scholarship) model for peer review and scholarship, 18thConnect has to tackle issues relevant to its period-specific research community. As a result, the TypeWright application was built for the 18thConnect platform in order to provide an easily-accessible, crowd-sourced correction tool for eighteenth-century texts. The TypeWright tool was designed to solve issues with Optical Character Recognition (OCR) for early printed texts, specifically those in Gale/Cengage Learning’s Eighteenth-Century Collections Online (ECCO) subscription database, to provide accurate text for full-text searching, data mining, and the creation of digital scholarly editions. Because these texts were photographed, microfilmed, and then digitized over a period of 40 years, their quality negatively impacts OCR text output. In addition, early printing conventions, especially early typefaces and paper quality, cause OCR engines to mis-recognize the word images on a page. To foster the sustainability and use of these texts in scholarship, TypeWright was created to enable users to correct, by hand, save, and share their editing with the 18thConnect community. For this poster presentation, I intend to focus on illuminating the following three aspects of the TypeWright tool: 1. Correcting a text in TypeWright, or, briefly explaining the accessible user interface. When a user accesses the 18thConnect site, they can search for “TypeWright-enabled” texts, right now consisting of the 183,000 documents contained in ECCO. Once a user has selected a text, they are ported into the editing interface, which displays snippets of the page image for transcription in the text editing box below. The text editing box already contains the text generated by a previous OCR process, so that the user can either edit the text, or confirm the current text is correct. 2. Liberating a text in TypeWright, or, how users can request full text and XML for a document after completing correction; After a user, or a group of users working collaboratively, have completed correcting a document, their work is reviewed by TypeWright administrators. If the work passes the evaluation process, then the user(s) are able to receive the corrected plain text or XML/TEI-encoded files. If the work fails evaluation (which is rare) users are instructed to look for common “correction” mistakes, and fix them. 3. Using a text after TypeWright correction, or, the benefit of crowdsourcing correction for the academic community. Once a user has received their corrected text files, 18thConnect administrators advise users to use this data in their digital project, then submit that digital project for peer review to 18thConnect. In addition, the corrected text, per our agreements with Gale/Cengage Learning, return to that database to improve the searchability of this proprietary product, which constitutes an important resource for the eighteenth-century scholarly community.Item Using Islandora for Open-Source Powered Digital Collections(2015-04-27) Keswick, Tommy; The Cherry Hill CompanyIslandora brings together the Fedora Repository Project, the Drupal content management system, and the Apache Solr search platform to enable librarians and other content managers to easily ingest and create collections with all types of digital assets. This presentation will demonstrate the features of Islandora that make it a compelling choice for building online digital collections. We will also highlight the potential for customizations through the open source architecture, including using Drupal for the administrative interface and Solr for search and indexing.