Typesafe NLP pipelines on Spark

Hafner, Simon

Typesafe NLP pipelines on Spark

dc.contributor.advisor	Baldridge, Jason
dc.contributor.advisor	Erk, Katrin
dc.creator	Hafner, Simon	en
dc.date.accessioned	2015-02-24T16:51:06Z	en
dc.date.accessioned	2018-01-22T22:27:30Z
dc.date.available	2018-01-22T22:27:30Z
dc.date.issued	2014-12	en
dc.date.submitted	December 2014	en
dc.date.updated	2015-02-24T16:51:08Z	en
dc.description	text	en
dc.description.abstract	Natural language pipelines consist of various natural language algorithms that use the annotations of a previous algorithm to compute more annotations. These algorithms tend to be expensive in terms of computational power. Therefore it is advantageous to parallelize them in order to reduce the time necessary to analyze a large document collection. The goal of this project was to develop a new framework to encapsulate algorithms such that they may be used as part of a pipeline without any additional work. The framework consists of a custom-built data structure called Slab which implements type safety and functional transparency to integrate itself into the Scala programming language. Because of this integration, it is possible to use Spark, a MapReduce framework, to parallelize the pipeline on a cluster. To assess the performance of the new framework, a pipeline based on the OpenNLP library was created. An existing pipeline implemented in UIMA, an industry standard for natural language pipeline frameworks, served as a baseline in terms of performance. The pipeline created from the new framework processed the corpus in about half the time.	en
dc.description.department	Linguistics	en
dc.format.mimetype	application/pdf	en
dc.identifier.uri	http://hdl.handle.net/2152/28654	en
dc.language.iso	en	en
dc.subject	Natural language processing	en
dc.subject	NLP	en
dc.subject	Pipelines	en
dc.subject	Spark	en
dc.subject	Slab	en
dc.title	Typesafe NLP pipelines on Spark	en
dc.type	Thesis	en

Collections

University of Texas at Austin

Typesafe NLP pipelines on Spark

Files

Collections