Search Text Based on Locations

Zhang, Weiwei

Search Text Based on Locations

Date

2014-11-21

Authors

Zhang, Weiwei

Abstract

To satisfy the current need for finding queried information quickly, search engines, data mining systems, and many other applications have been in development in recent years. Some of those applications look for documents containing phrases of a particular topic, such as historical events from a certain time period. Among these applications, queries based on geographical data are receiving significant attention from the research community and industry. Therefore, this thesis studies text search based locations, which contributes to the Geographical Information Retrieval (GIR) systems.

In addition to the traditional applications of GIR systems, which are used for finding locations in documents, GIR can be applied to other fields as well. Firstly, it can retrieve location information in text and search for answers to questions of a spatial nature (such as \Where is College Station?"). Location information can improve presentation of the search results, for example, by presenting the search results on a map. GIR also adds to the field of spatial diversity search, which allows users to express preferences and constrain the search results to a particular geographical region. In addition, it finds related document based on location information from different sources of information and then represents the similarities graphically. In this way, the readers can visually see the data, helping them understand the document correlations in an intuitive way.

However, most of the previous research involves keyword searches in spatial databases instead of raw (unlabeled) text. Although there is some work on raw text processing, that work uses matching techniques, and limits the geographical range to small geographical regions such as a single country. Therefore, this thesis adopts a new clustering method, which utilizes a geographical dictionary to locate any place by its coordinates. This method reduces ambiguity and improves the accuracy over the previous research. This study also implements a new word-clustering method to detect a combination of topics in raw text. This method is more accurate than the latent Dirichlet allocation, a state of the art method based on a probabilistic model. In addition, a novel graphic illustration is utilized to visually represent the relevance ranking between documents.