Active learning and compilation of higher order schema integration queries



Journal Title

Journal ISSN

Volume Title



After nearly 30 years, database integration remains the province of engineers and application developers. In an informal proof, Krishnamurthy, Litwin and Kent [KLK91] demonstrated that only higher order relational languages such as SchemaSQL and SchemaLog [LSS96] are general enough to concisely describe the merging of data from multiple heterogeneous sources. However, those languages are incrementally harder to program than SQL. A modern solution is to provide a GUI language with the same capabilities. However, a Query-by-Example (QBE) [Zloof77] type interface does not allow the unambiguous specification of higher order data integrating queries. We propose an architecture comprising three layers. In the middle layer, the user expresses the desired federated view through a QBE inspired user interface. Further learning proceeds via a sample selection method asking the user to validate examples of federated records. This interaction ends when the system is satisfied it has converged to the exact view definition the user intends. The bottom layer provides the execution mechanism for higher order data manipulations by compiling higher order relational definitions into first-order SQL programs. The top layer component of the architecture assists the learning algorithm by collecting meta-data and catalog statistics. Our primary contributions comprise a taxonomy examining trade-offs between complexity and completeness and identifying various classes of higher order relational data manipulations. The architecture delimits three separate challenges, which must be overcome in order to propose a solution. Our compilation for SchemaSQL proposes novel theoretical complexity guarantees. Type-based vertical partitioning of the meta- data ensures that the result can be properly optimized by existing SQL engines. Sample selection constraints specific to databases require the introduction of a third kind of instance label in the training set of our learner. We derive a new algorithm by modifying Mitchell’s version spaces [Mitchell82] in order to handle this new kind of label. We prove that the modified algorithm preserves the original properties of version spaces and avoids the possibility of deadlock. We introduce a sample selection heuristic that converts catalog statistics into a classic inductive bias. Finally, we develop the Sphinx prototype, carry out experiments and demonstrate the system on an application.