Searching And Ranking Xml Data In A Distributed Environment

Searching And Ranking Xml Data In A Distributed Environment

Date

2009-09-16T18:20:01Z

Publisher

Computer Science & Engineering

Abstract

Due to the increasing number of independent data providers on the web, there is a growing number of web applications that require searching and querying data sources distributed at different locations over the internet. Since XML is rapidlygaining in popularity as a universal data format for data exchange and integration, locating and ranking distributed XML data on the web are gaining importance in the database community. Most of existing XML indexing techniques combine structure indexes and inverted lists extracted from XML documents to fully evaluate a full-text query against these indexes and return the actual XML fragments of the query answer. In general, these approaches are well-suited for a centralized date repository since they perform costly containment joins over long inverted lists in order to evaluate full-text XML queries, which does not scale very well to large distributed systems.In this thesis work, we present a novel framework for indexing, locating and ranking schema-less XML documents based on concise summaries of their structural and textual content. Instead of indexing each single element or term in a document, we extract a structural summary and a small number of data synopses from the document, which are indexed in a way suitable for query evaluation. The search query language used in our framework is XPath extended with full-text search. We introduce a novel data synopsis structure to summarize the textual content of an XML document that correlates textual with positional information in a way that improves query precision. In addition, we present a two-phase containment filtering algorithm basedon these synopses that speeds up the searching process. To return a ranked list of answers, we integrate an effective aggregated document ranking scheme into the query evaluation, inspired by TF*IDF ranking and term proximity, to score documents and return a ranked list of document locations to the client. Finally, we extend our framework to apply to structured peer-to-peer systems, routing a full-text XML query from peer to peer, collecting relevant documents along the way, and returning list ofdocument locations to the user. We conduct many experiments over XML benchmark data to demonstrate the advantages of our indexing scheme, the query precision improvement of our data synopses, the efficiency of the optimization algorithm, the effectiveness of our ranking scheme and the scalability of our framework.We expect that the framework developed in this thesis will serve as an infrastructure for collaborative work environments within public web communities that share data and resources. The best candidates to benefit from our framework are collaborative applications that host on-line repositories of data and operate on a very largescale. Furthermore, good candidates are those applications that seek high system and data availability and scalability to the network growth. Finally, our framework can also benefit to those applications that require complex/hierarchical data, such as scientific data, schema flexibility, and complex querying capabilities, including full-text search and approximate matching.