Browsing by Subject "Computational linguistics"

Now showing 1 - 12 of 12

Data-rich document geotagging using geodesic grids
(2011-05) Wing, Benjamin Patai; Baldridge, Jason; Erk, Katrin
This thesis investigates automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.
Dependency based CCG derivation and application
(2010-12) Brewster, Joshua Blake; Baldridge, Jason; Erk, Katrin
This paper presents and evaluates an algorithm to translate a dependency treebank into a Combinatory Categorial Grammar (CCG) lexicon. The dependency relations between a head and a child in a dependency tree are exploited to determine how CCG categories should be derived by making a functional distinction between adjunct and argument relations. Derivations for an English (CoNLL08 shared task treebank) and for an Italian (Turin University Treebank) dependency treebank are performed, each requiring a number of preprocessing steps. In order to determine the adequacy of the lexicons, dubbed DepEngCCG and DepItCCG, they are compared via two methods to preexisting CCG lexicons derived from similar or equivalent sources (CCGbank and TutCCG). First, a number of metrics are used to compare the state of the lexicon, including category complexity and category growth. Second, to measures the potential applicability of the lexicons in NLP tasks, the derived English CCG lexicon and CCGbank are compared in a sentiment analysis task. While the numeric measurements show promising results for the quality of the lexicons, the sentiment analysis task fails to generate a usable comparison.
The dynamics of collocation: a corpus-based study of the phraseology and pragmatics of the introductory-it construction
(2005) Mak, King Tong; Blyth, Carl S. (Carl Stewart), 1958-
Through corpus linguistics, words are found to co-occur in regular patterns, and such collocational behavior turns out to be an essential aspect of meaning itself. While corpus linguists have made much headway in describing language use in terms of collocations and their associated lexical sequences – known as phraseology, their corpusdriven approach to grammar has not adequately demonstrated the subtlety and valency involved in the mutual interaction of lexical choices. Such a descriptive model of Pattern Grammar (Hunston and Francis, 1999) is also found to lack firm theoretical underpinnings. The structure under analysis in this study was the introductory-it construction: “it + (modal) + link verb + (modifier) + lexical slot + complement.” By focusing on this construction that allows immense lexicalization possibilities, the present study aimed to illustrate the complex dynamics of collocation by showing how the way a different element is lexicalized affects not only the lexical choice of other elements, but also the semantic meaning and pragmatic function of the construction as a whole. It also aimed to explore how corpus findings can draw on insights from one of the offshoots of generative theory – Construction Grammar. The corpus used in this study was the British National Corpus. More than 160 searches were conducted to identify how the categories of link verbs, modals, modifiers and complements co-pattern with one another in formulating the different phraseologies of the introductory-it construction. It is thus argued, along the lines of Construction Grammar, that the various phraseologies could be considered different constructions in their own right, each with a specific pragmatic function. It is also suggested that a constructional-schematic view of introductory-it provides a coherent account for the variability in fixity and idiomaticity of its phraseologies, where constructions that show similarities in their specifications of form or meaning to the prototypes are embraced in a family resemblance relationship. Pedagogically, it is argued that typical collocations and phraseologies play a significant part not only in building up fluency, but also in empowering learners so that they have the pragmatic competence to linguistically comport themselves in ways felicitous for their illocutionary goals.
Inducing grammars from linguistic universals and realistic amounts of supervision
(2015-05) Garrette, Daniel Hunter; Baldridge, Jason; Mooney, Raymond J. (Raymond Joseph); Ravikumar, Pradeep; Scott, James G; Smith, Noah A
The best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains. This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it. This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation. Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.
Methods and applications of text-driven toponym resolution with indirect supervision
(2013-08) Speriosu, Michael Adrian; Baldridge, Jason
This thesis addresses the problem of toponym resolution. Given an ambiguous placename like Springfield in some natural language context, the task is to automatically predict the location on the earth's surface the author is referring to. Many previous efforts use hand-built heuristics to attempt to solve this problem, looking for specific words in close proximity such as Springfield, Illinois, and disambiguating any remaining toponyms to possible locations close to those already resolved. Such approaches require the data to take a fairly specific form in order to perform well, thus they often have low coverage. Some have applied machine learning to this task in an attempt to build more general resolvers, but acquiring large amounts of high quality hand-labeled training material is difficult. I discuss these and other approaches found in previous work before presenting several new toponym resolvers that rely neither on hand-labeled training material prepared explicitly for this task nor on particular co-occurrences of toponyms in close proximity in the data to be disambiguated. Some of the resolvers I develop reflect the intuition of many heuristic resolvers that toponyms nearby in text tend to (but do not always) refer to locations nearby on Earth, but do not require toponyms to occur in direct sequence with one another. I also introduce several resolvers that use the predictions of a document geolocation system (i.e. one that predicts a location for a piece of text of arbitrary length) to inform toponym disambiguation. Another resolver takes into account these document-level location predictions, knowledge of different administrative levels (country, state, city, etc.), and predictions from a logistic regression classifier trained on automatically extracted training instances from Wikipedia in a probabilistic way. It takes advantage of all content words in each toponym's context (both local window and whole document) rather than only toponyms. One resolver I build that extracts training material for a machine learned classifier from Wikipedia, taking advantage of link structure and geographic coordinates on articles, resolves 83% of toponyms in a previously introduced corpus of news articles correctly, beating the strong but simplistic population baseline. I introduce a corpus of Civil War related writings not previously used for this task on which the population baseline does poorly; combining a Wikipedia informed resolver with an algorithm that seeks to minimize the geographic scope of all predicted locations in a document achieves 86% blind test set accuracy on this dataset. After providing these high performing resolvers, I form the groundwork for more flexible and complex approaches by transforming the problem of toponym resolution into the traveling purchaser problem, modeling the probability of a location given its toponym's textual context and the geographic distribution of all locations mentioned in a document as two components of an objective function to be minimized. As one solution to this incarnation of the traveling purchaser problem, I simulate properties of ants traveling the globe and disambiguating toponyms. The ants' preferences for various kinds of behavior evolves over time, revealing underlying patterns in the corpora that other disambiguation methods do not account for. I also introduce several automated visualizations of texts that have had their toponyms resolved. Given a resolved corpus, these visualizations summarize the areas of the globe mentioned and allow the user to refer back to specific passages in the text that mention a location of interest. One visualization presented automatically generates a dynamic tour of the corpus, showing changes in the area referred to by the text as it progresses. Such visualizations are an example of a practical application of work in toponym resolution, and could be used by scholars interested in the geographic connections in any collection of text on both broad and fine-grained levels.
Semi-automated annotation and active learning for language documentation
(2009-12) Palmer, Alexis Mary; Baldridge, Jason; Erk, Katrin; England, Nora; Mooney, Raymond; Woodbury, Anthony
By the end of this century, half of the approximately 6000 extant languages will cease to be transmitted from one generation to the next. The field of language documentation seeks to make a record of endangered languages before they reach the point of extinction, while they are still in use. The work of documenting and describing a language is difficult and extremely time-consuming, and resources are extremely limited. Developing efficient methods for making lasting records of languages may increase the amount of documentation achieved within budget restrictions. This thesis approaches the problem from the perspective of computational linguistics, asking whether and how automated language processing can reduce human annotation effort when very little labeled data is available for model training. The task addressed is morpheme labeling for the Mayan language Uspanteko, and we test the effectiveness of two complementary types of machine support: (a) learner-guided selection of examples for annotation (active learning); and (b) annotator access to the predictions of the learned model (semi-automated annotation). Active learning (AL) has been shown to increase efficacy of annotation effort for many different tasks. Most of the reported results, however, are from studies which simulate annotation, often assuming a single, infallible oracle. In our studies, crucially, annotation is not simulated but rather performed by human annotators. We measure and record the time spent on each annotation, which in turn allows us to evaluate the effectiveness of machine support in terms of actual annotation effort. We report three main findings with respect to active learning. First, in order for efficiency gains reported from active learning to be meaningful for realistic annotation scenarios, the type of cost measurement used to gauge those gains must faithfully reflect the actual annotation cost. Second, the relative effectiveness of different selection strategies in AL seems to depend in part on the characteristics of the annotator, so it is important to model the individual oracle or annotator when choosing a selection strategy. And third, the cost of labeling a given instance from a sample is not a static value but rather depends on the context in which it is labeled. We report two main findings with respect to semi-automated annotation. First, machine label suggestions have the potential to increase annotator efficacy, but the degree of their impact varies by annotator, with annotator expertise a likely contributing factor. At the same time, we find that implementation and interface must be handled very carefully if we are to accurately measure gains from semi-automated annotation. Together these findings suggest that simulated annotation studies fail to model crucial human factors inherent to applying machine learning strategies in real annotation settings.
Speech data compression
(Texas Tech University, 1996-08) Ho, Chien-Te
The analysis-by-synthesis method is the most useful application for the parametric representation. The necessary components for the model are derived from signal analysis procedures while the output speech waveform is obtained from the synthetic procedure. This method, such as the Codebook Exited Coder (CELP) [1], is first implemented in the time domain. The basic approach is to model the correlation among the speech samples by using a linear time-varying filter. An excitation model can then be obtained by removing the correlation. Since the filter will not ignore the noise, the parametric representation does have problems with the noisy speech data. An alternative procedure is to implement the technique in the frequency domain. This leads to a flexible method for lower bit rate procedure transmission. Furthermore, it provides a suitable way to model the filter in a noisy environment. Methods such as the harmonic vocoder and Multiband Excitation Coder (MBE) [4] are all frequency domain techniques. Since the speech data is recovered from the parametric model, the output depends on the model parameters, which may greatly effect the quality of the speech. The objective of this thesis is to develop efficient algorithms for implementing the harmonic vocoder in the frequency domain. A reUable method is developed to realize the analysis procedure and to achieve the correct fundamental elements of speech signal. An efficient method is proposed to synthesize output speech signal and to improve speech quality. Also, the techniques of model refinement and enhancement will be described in this thesis. In practice, the analogue speech signal is sampled at 8000Hz and this rate is used throughout this research. The research is concentrated on the method for speech data compression and speech quality improvement rather than coding schemes.
Speech recognition system
(Texas Tech University, 1996-08) Mehta, Milan G.
Automatic Speech Recognition (ARS) has progressed considerably over the past several decades, but still has not achieved the potential imagined at its very beginning. Almost all of the existing applications of ASR systems are PC based. This thesis is an attempt to develop a speech recognition system that is independent of any PC support and is small enough in size to be used in a daily use consumer appliance. This system would recognize isolated utterances from a limited vocabulary, provide speaker independence, require less memory and be cost-efficient compared to present ASR systems. In this system, speech recognition is performed with the help of algorithms such as Vector Quantization and Zero Crossing. Several features of a Digital Signal Processor (DSP) have been utilized to generate and execute the algorithms for recognition. The final system has been implemented on Texas Instmments TMS320C30 DSP. The system, when implemented using the Vector Quantizer approach, achieved an accuracy of 94% for a vocabulary of 6 words and a recognition time of 6 seconds. The zero crossing approach resulted in an accuracy of 89% for the same vocabulary while the recognition time was 0.8 seconds.
Supervision for syntactic parsing of low-resource languages
(2016-05) Mielens, Jason David; Baldridge, Jason; Erk, Katrin; Mooney, Ray; Dyer, Chris; Beavers, John
Developing tools for doing computational linguistics work in low-resource scenarios often requires creating resources from scratch, especially when considering highly specialized domains or languages with few existing tools or research. Due to practical considerations in project costs and sizes, the resources created in these circumstances are often different from large-scale resources in both quantity and quality, and working with these resources poses a distinctly different set of challenges than working with larger, more established resources. There are different approaches to handling these challenges, including many variations aimed at reducing or eliminating the annotations needed to train models for various tasks. This work considers the task of low-resource syntactic parsing, and looks at the relative benefits of different methods of supervision. I will argue here that the benefits of doing some amount of supervision almost always outweigh the costs associated with doing that annotation; unsupervised or minimally supervised methods are often surpassed with surprisingly small amounts of supervision. This work is primarily concerned with identifying and classifying sources of supervision that are both useful and practical in low-resource scenarios, along with analyzing the performance of systems that make use of these different supervision sources and the behaviors of the minimally trained annotators that provide them. Additionally, I demonstrate several cases where linguistic theory and computational performance are directly connected. Maintaining a focus on the linguistic side of computational linguistics can provide many benefits, especially when working with languages where the correct analysis for various phenomena may still be very much unsettled.
Text-based document geolocation and its application to the digital humanities
(2015-12) Wing, Benjamin Patai; Baldridge, Jason; Erk, Katrin; Beaver, David; Mooney, Ray; Lease, Matt
This dissertation investigates automatic geolocation of documents (i.e. identification of their location, expressed as latitude/longitude coordinates), based on the text of those documents rather than metadata. I assert that such geolocation can be performed using text alone, at a sufficient accuracy for use in real-world applications. Although in some corpora metadata is found in abundance (e.g. home location, time zone, friends, followers, etc. in Twitter), it is lacking in others, such as many corpora of primary-source documents in the digital humanities, an area to which document geolocation has hardly been applied. To this end, I first develop methods for accurate text-based geolocation and then apply them to newly-annotated corpora in the digital humanities. The geolocation methods I develop use both uniform and adaptive (k-d tree) grids over the Earth’s surface, culminating in a hierarchical logistic-regression-based technique that achieves state of the art results on well-known corpora (Twitter user feeds, Wikipedia articles and Flickr image tags). In the second part of the dissertation I develop a new NLP task, text-based geolocation of historical corpora. Because there are no existing corpora to test on, I create and annotate two new corpora of significantly different natures (a 19th-century travel log and a large set of Civil War archives). I show how my methods produce good geolocation accuracy even given the relatively small amount of annotated data available, which can be further improved using domain adaptation. I then use the predictions on the much larger unannotated portion of the Civil War archives to generate and analyze geographic topic models, showing how they can be mined to produce interesting revelations concerning various Civil War-related subjects. Finally, I develop a new geolocation technique for text-only corpora involving co-training between document-geolocation and toponym- resolution models, using a gazetteer to inject additional information into the training process. To evaluate this technique I develop a new metric, the closest toponym error distance, on which I show improvements compared with a baseline geolocator.
Unsupervised partial parsing
(2011-08) Ponvert, Elias Franchot; Baldridge, Jason; Bannard, Colin; Beaver, David I.; Erk, Katrin E.; Mooney, Raymond J.
The subject matter of this thesis is the problem of learning to discover grammatical structure from raw text alone, without access to explicit instruction or annotation -- in particular, by a computer or computational process -- in other words, unsupervised parser induction, or simply, unsupervised parsing. This work presents a method for raw text unsupervised parsing that is simple, but nevertheless achieves state-of-the-art results on treebank-based direct evaluation. The approach to unsupervised parsing presented in this dissertation adopts a different way to constrain learned models than has been deployed in previous work. Specifically, I focus on a sub-task of full unsupervised partial parsing called unsupervised partial parsing. In essence, the strategy is to learn to segment a string of tokens into a set of non-overlapping constituents or chunks which may be one or more tokens in length. This strategy has a number of advantages: it is fast and scalable, based on well-understood and extensible natural language processing techniques, and it produces predictions about human language structure which are useful for human language technologies. The models developed for unsupervised partial parsing recover base noun phrases and local constituent structure with high accuracy compared to strong baselines. Finally, these models may be applied in a cascaded fashion for the prediction of full constituent trees: first segmenting a string of tokens into local phrases, then re-segmenting to predict higher-level constituent structure. This simple strategy leads to an unsupervised parsing model which produces state-of-the-art results for constituent parsing of English, German and Chinese. This thesis presents, evaluates and explores these models and strategies.
Word meaning in context as a paraphrase distribution : evidence, learning, and inference
(2011-08) Moon, Taesun, Ph. D.; Erk, Katrin; Baldridge, Jason; Bannard, Colin; Dhillon, Inderjit; Mooney, Raymond
In this dissertation, we introduce a graph-based model of instance-based, usage meaning that is cast as a problem of probabilistic inference. The main aim of this model is to provide a flexible platform that can be used to explore multiple hypotheses about usage meaning computation. Our model takes up and extends the proposals of Erk and Pado [2007] and McCarthy and Navigli [2009] by representing usage meaning as a probability distribution over potential paraphrases. We use undirected graphical models to infer this probability distribution for every content word in a given sentence. Graphical models represent complex probability distributions through a graph. In the graph, nodes stand for random variables, and edges stand for direct probabilistic interactions between them. The lack of edges between any two variables reflect independence assumptions. In our model, we represent each content word of the sentence through two adjacent nodes: the observed node represents the surface form of the word itself, and the hidden node represents its usage meaning. The distribution over values that we infer for the hidden node is a paraphrase distribution for the observed word. To encode the fact that lexical semantic information is exchanged between syntactic neighbors, the graph contains edges that mirror the dependency graph for the sentence. Further knowledge sources that influence the hidden nodes are represented through additional edges that, for example, connect to document topic. The integration of adjacent knowledge sources is accomplished in a standard way by multiplying factors and marginalizing over variables. Evaluating on a paraphrasing task, we find that our model outperforms the current state-of-the-art usage vector model [Thater et al., 2010] on all parts of speech except verbs, where the previous model wins by a small margin. But our main focus is not on the numbers but on the fact that our model is flexible enough to encode different hypotheses about usage meaning computation. In particular, we concentrate on five questions (with minor variants): - Nonlocal syntactic context: Existing usage vector models only use a word's direct syntactic neighbors for disambiguation or inferring some other meaning representation. Would it help to have contextual information instead "flow" along the entire dependency graph, each word's inferred meaning relying on the paraphrase distribution of its neighbors? - Influence of collocational information: In some cases, it is intuitively plausible to use the selectional preference of a neighboring word towards the target to determine its meaning in context. How does modeling selectional preferences into the model affect performance? - Non-syntactic bag-of-words context: To what extent can non-syntactic information in the form of bag-of-words context help in inferring meaning? - Effects of parametrization: We experiment with two transformations of MLE. One interpolates various MLEs and another transforms it by exponentiating pointwise mutual information. Which performs better? - Type of hidden nodes: Our model posits a tier of hidden nodes immediately adjacent the surface tier of observed words to capture dynamic usage meaning. We examine the model based on by varying the hidden nodes such that in one the nodes have actual words as values and in the other the nodes have nameless indexes as values. The former has the benefit of interpretability while the latter allows more standard parameter estimation. Portions of this dissertation are derived from joint work between the author and Katrin Erk [submitted].

Browsing by Subject "Computational linguistics"

Results Per Page

Sort Options