Supervised language models for temporal resolution of text in absence of explicit temporal cues

Date

2013-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis explores the temporal analysis of text using the implicit temporal cues present in document. We consider the case when all explicit temporal expressions such as specific dates or years are removed from the text and a bag of words based approach is used for timestamp prediction for the text. A set of gold standard text documents with times- tamps are used as the training set. We also predict time spans for Wikipedia biographies based on their text. We have training texts from 3800 BC to present day. We partition this timeline into equal sized chronons and build a probability histogram for a test document over this chronon sequence. The document is assigned to the chronon with the highest probability.

We use 2 approaches: 1) a generative language model with Bayesian priors, and 2) a KL divergence based model. To counter the sparsity in the documents and chronons we use 3 different smoothing techniques across models. We use 3 diverse datasets to test our mod- els: 1) Wikipedia Biographies, 2) Guttenberg Short Stories, and 3) Wikipedia Years dataset.

Our models are trained on a subset of Wikipedia biographies. We concentrate on two prediction tasks: 1) time-stamp prediction for a generic text or mid-span prediction for a Wikipedia biography , and 2) life-span prediction for a Wikipedia biography. We achieve an f-score of 81.1% for life-span prediction task and a mean error of around 36 years for mid-span prediction for biographies from present day to 3800 BC. The best model gives a mean error of 18 years for publication date prediction for short stories that are uniformly distributed in the range 1700 AD to 2010 AD. Our models exploit the temporal distribu- tion of text for associating time. Our error analysis reveals interesting properties about the models and datasets used.

We try to combine explicit temporal cues extracted from the document with its implicit cues and obtain combined prediction model. We show that a combination of the date-based predictions and language model divergence predictions is highly effective for this task: our best model obtains an f-score of 81.1% and the median error between actual and predicted life span midpoints is 6 years. This would be one of the emphasis for our future work.

The above analyses demonstrates that there are strong temporal cues within texts that can be exploited statistically for temporal predictions. We also create good benchmark datasets along the way for the research community to further explore this problem.

Description

text

Citation