Incremental Retrieval And Ranking Of Complex Patterns From Text Repositiories

Date

2007-09-17T17:07:33Z

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Computer Science & Engineering

Abstract

As the volume of information accessible via the electronic medium (such as the Internet) is staggeringly large and growing rapidly, users have to sift through vast reservoirs of information to retrieve relevant data of their choice. It has been estimated that, the Internet consists of 2.5 billion unique, publicly accessible web-pages and this figure is growing at an alarming rate of 7.3 million pages per day. Currently, the only way to wade through such colossal information is by using search engines (such as Google, Live, Yahoo, etc.). Although the popularity of search engines has increased manifold due to -- simplicity of usage, speed of retrieval and amount of results generated, their ability to intelligently retrieve relevant information is significantly hampered due to over-reliance on Boolean operators for data retrieval. Consider searching for complex patterns involving pattern frequency, e.g., (at least 5 occurrences of the phrase research experiences''), proximity e.g., (metal'' near traders'', in any order, within 10 words of each other), sequence of sub-patterns e.g., (soya'' followed by plantings'', within 5 words of each other), all occurrences of the word (contract'' and its synonyms) or structural patterns like (France'' within the occurrence of sunflower plantings'' and ``harvest'') etc.. Such queries are not supported by currently available search engines. Researchers looking for information in domains such as security, biology, legal research etc. need focused, objective and precise semantics to specify such patterns. In order to deal with such complex requirements of specific domain users, the design of a document retrieval system that consists of -- a pattern specification language and an efficient detection engine -- that allows specification of such expressive patterns is needed. To address these issues, we have designed a framework (made up of two interdependent yet distinct systems -- InfoFilter and InfoSearch) based on an expressive pattern specification language and a set of novel detection algorithms that handles streaming as well as static data. InfoFilter handles pattern detection for streaming data (news feeds, IP packets, etc.) where freshness of search results is paramount. However, in the case of data that resides in the form of large yet static repositories, and when the freshness of data is not critical, the InfoSearch system handles pattern detection using a pre-computed index. The initial design of InfoSearch, for complex pattern detection, focused on fetching all matching occurrences of the pattern in the data repositories. It further processed all the tuples of the operands that constituted the pattern. However, this approach proves to be inefficient in terms of processing time, memory utilization and number of computations. Moreover, the results are generated in the order in which they are detected, thus ignoring the relevancy of results with respect to user preferences. In order to address these problems, an incremental approach for complex pattern detection is needed. Moreover, in order to generate results based on the relevancy rather than the order of detection, ranking mechanisms for appropriately filtering the results also needs to be addressed. In this thesis, we investigate several approaches for incremental detection, retrieval and ranking of results based on user specified criteria. We also investigate the need for novel data structures as well as the types of structural meta-data to be associated with the data stored in the index for ranking the fetched results. We propose a novel ranking algorithm that utilizes the structural boundaries of the data to rank results based on the location and occurrence of a complex pattern in a document. We also present algorithms for each operator encountered in a pattern, that are based on ensuring optimal utilization of computational and memory resources in the least possible time.

Description

Keywords

Citation