Using partial string matching for coding of hierarchical grammars

Date

2002-12

Journal Title

Journal ISSN

Volume Title

Publisher

Texas Tech University

Abstract

Text data can be compressed by extracting repeated strings in the input. Therefore, the input can be expressed using a context-free grammar, where each repeated string is denoted by a rule of the grammar. Sequitur [3] is one such method that derives a context-free grammar, which generates the given input precisely as its language. Sequitur is a two-pass method; in the first pass, it determines a symbolic representation of the grammar, and in the second pass, it encodes this representation using Arithmetic coder to realize a very efficient compressed representation of the input.

This thesis presents a complete and robust implementation of Sequitur. First, it presents the design and then gives a detailed description of the algorithm. Next, it presents a context based coding of the grammar; the purpose of this coding method is to code the next symbol in the context of the input character, which is the last character generated by the preceding rule. This coding method is known as partial matching method, and hence this implementation of sequitur is termed here as sequitur-PM.

The thesis presents a number of compression results; the first set of results is given to show that our implementation is able to attain as good compression ratio as presented in [1] for all benchmark test input. The next set of results is given to show further compression obtained by using PM method. These results where on par with the published results [3].

Description

Citation