Browsing by Subject "Parsing"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Inducing grammars from linguistic universals and realistic amounts of supervision(2015-05) Garrette, Daniel Hunter; Baldridge, Jason; Mooney, Raymond J. (Raymond Joseph); Ravikumar, Pradeep; Scott, James G; Smith, Noah AThe best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains. This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it. This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation. Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.Item Supervision for syntactic parsing of low-resource languages(2016-05) Mielens, Jason David; Baldridge, Jason; Erk, Katrin; Mooney, Ray; Dyer, Chris; Beavers, JohnDeveloping tools for doing computational linguistics work in low-resource scenarios often requires creating resources from scratch, especially when considering highly specialized domains or languages with few existing tools or research. Due to practical considerations in project costs and sizes, the resources created in these circumstances are often different from large-scale resources in both quantity and quality, and working with these resources poses a distinctly different set of challenges than working with larger, more established resources. There are different approaches to handling these challenges, including many variations aimed at reducing or eliminating the annotations needed to train models for various tasks. This work considers the task of low-resource syntactic parsing, and looks at the relative benefits of different methods of supervision. I will argue here that the benefits of doing some amount of supervision almost always outweigh the costs associated with doing that annotation; unsupervised or minimally supervised methods are often surpassed with surprisingly small amounts of supervision. This work is primarily concerned with identifying and classifying sources of supervision that are both useful and practical in low-resource scenarios, along with analyzing the performance of systems that make use of these different supervision sources and the behaviors of the minimally trained annotators that provide them. Additionally, I demonstrate several cases where linguistic theory and computational performance are directly connected. Maintaining a focus on the linguistic side of computational linguistics can provide many benefits, especially when working with languages where the correct analysis for various phenomena may still be very much unsettled.Item Unknown word sequences in HPSG(2014-05) Mielens, Jason David; Baldridge, JasonThis work consists of an investigation into the properties of unknown words in HPSG, and in particular into the phenomenon of multi-word unknown expressions consisting of multiple unknown words in a sequence. The work presented consists first of a study determining the relative frequency of multi-word unknown expressions, and then a survey of the efficacy of a variety of techniques for handling these expressions. The techniques presented consist of modified versions of techniques from the existing unknown-word prediction literature as well as novel techniques, and they are evaluated with a specific concern for how they fare in the context of sentences with many unknown words and long unknown sequences.Item Unsupervised partial parsing(2011-08) Ponvert, Elias Franchot; Baldridge, Jason; Bannard, Colin; Beaver, David I.; Erk, Katrin E.; Mooney, Raymond J.The subject matter of this thesis is the problem of learning to discover grammatical structure from raw text alone, without access to explicit instruction or annotation -- in particular, by a computer or computational process -- in other words, unsupervised parser induction, or simply, unsupervised parsing. This work presents a method for raw text unsupervised parsing that is simple, but nevertheless achieves state-of-the-art results on treebank-based direct evaluation. The approach to unsupervised parsing presented in this dissertation adopts a different way to constrain learned models than has been deployed in previous work. Specifically, I focus on a sub-task of full unsupervised partial parsing called unsupervised partial parsing. In essence, the strategy is to learn to segment a string of tokens into a set of non-overlapping constituents or chunks which may be one or more tokens in length. This strategy has a number of advantages: it is fast and scalable, based on well-understood and extensible natural language processing techniques, and it produces predictions about human language structure which are useful for human language technologies. The models developed for unsupervised partial parsing recover base noun phrases and local constituent structure with high accuracy compared to strong baselines. Finally, these models may be applied in a cascaded fashion for the prediction of full constituent trees: first segmenting a string of tokens into local phrases, then re-segmenting to predict higher-level constituent structure. This simple strategy leads to an unsupervised parsing model which produces state-of-the-art results for constituent parsing of English, German and Chinese. This thesis presents, evaluates and explores these models and strategies.