|dc.description.abstract||The colossal amount of information available online has resulted in overloading users who need to navigate this information for their routine requirements. Although search engines have been effective in reducing this information overload, they support only queries involving keywords and Boolean operators. There are certain application domains where more expressive ways of searching are necessary. Consider searching a full-text patent database for documents containing more than n occurrences of a pattern, or for documents that have a particular pattern followed by another pattern within a specified interval. Such complex patterns involving pattern frequency and sequence, as well as patterns involving proximity, structural boundaries and synonyms are not supported by current search engines. Users searching documents based on focused, precise semantics may need to specify such patterns. This calls for a Pattern Specification Language (PSL) as well as an efficient detection engine that allows such expressive patterns.
Presently, we have an expressive PSL and detection algorithms that work on stream data. That is, data coming as a stream of text is fed into the Pattern Detection Graph (PDG) using a dataflow approach. This is suitable and required for searching patterns over frequently updated data sources, such as news feeds in order to ensure freshness of the search results. However, this approach is inefficient and impractical for searching patterns over data sources that are extremely large or relatively static, or where freshness is not critical. One alternative is to use a pre-computed index over the stored documents to efficiently detect expressive queries.
This thesis investigates the type of information needed as part of the index to detect patterns that include frequency, proximity and sequence of patterns, phrases, synonyms etc. We present algorithms for each operator that uses the index and detects the pattern. Grouping of common sub expressions and their detection is done in an efficient manner. We describe InfoSearch, a system for retrieving the documents that contain the specified patterns, by using an inverted index over the document collection. For searching patterns and retrieving the matching documents, InfoSearch uses PDGs to filter the results returned by the index.
Interestingly, very little additional information is needed in the index to detect all the patterns that were detected by the streaming approach. Furthermore, the complexity of the algorithms is not very different from the earlier approach.||en_US