Typesafe NLP pipelines on Spark

dc.contributor.advisorBaldridge, Jason
dc.contributor.advisorErk, Katrin
dc.creatorHafner, Simonen
dc.date.accessioned2015-02-24T16:51:06Zen
dc.date.accessioned2018-01-22T22:27:30Z
dc.date.available2018-01-22T22:27:30Z
dc.date.issued2014-12en
dc.date.submittedDecember 2014en
dc.date.updated2015-02-24T16:51:08Zen
dc.descriptiontexten
dc.description.abstractNatural language pipelines consist of various natural language algorithms that use the annotations of a previous algorithm to compute more annotations. These algorithms tend to be expensive in terms of computational power. Therefore it is advantageous to parallelize them in order to reduce the time necessary to analyze a large document collection. The goal of this project was to develop a new framework to encapsulate algorithms such that they may be used as part of a pipeline without any additional work. The framework consists of a custom-built data structure called Slab which implements type safety and functional transparency to integrate itself into the Scala programming language. Because of this integration, it is possible to use Spark, a MapReduce framework, to parallelize the pipeline on a cluster. To assess the performance of the new framework, a pipeline based on the OpenNLP library was created. An existing pipeline implemented in UIMA, an industry standard for natural language pipeline frameworks, served as a baseline in terms of performance. The pipeline created from the new framework processed the corpus in about half the time.en
dc.description.departmentLinguisticsen
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttp://hdl.handle.net/2152/28654en
dc.language.isoenen
dc.subjectNatural language processingen
dc.subjectNLPen
dc.subjectPipelinesen
dc.subjectSparken
dc.subjectSlaben
dc.titleTypesafe NLP pipelines on Sparken
dc.typeThesisen

Files