Extracting structured data from web pages with maximum entropy segmental Markov models



Journal Title

Journal ISSN

Volume Title


Texas Tech University


The conventional ways for retrieving information from web pages are time-consuming. A possible solution is to integrate useful data over the whole Internet with uniform schemes so that people can easily access and query the data with the relational database techniques. Many approaches are proposed to solve this problem. Based on the degree of users' involvement, these approaches can be classified into three categories: manual, semi-automatic, and automatic.

This dissertation proposes a novel semi-automatic approach based on the maximum entropy segmental Markov model to extract structured data from web pages. The main purpose of this approach is to overcome the shortcomings existing in current semi-automatic approaches: many training web pages and too general or specific learned models (or templates). This approach decreases the number of training web pages by modeling the sequences embedding structured data instead of their context. In addition, the sequences embedding structured data are modeled with segmental Markov models, each of whose states corresponds to a subsequence embedding one data item. Finally, the maximum entropy principle is applied to learn the transition distributions to prevent generating too general or specific models from training data. This approach, therefore, can reduce the users' labor of preparing training data while remaining a good performance. The experimental results on thirty web sites show this approach has better performance than Stalker, a known good performance semi-automatic approach, when only one training web page is provided.