Int. J. Electron. Commun. (AEÜ) 65 (2011) 239–243
Contents lists available at ScienceDirect
International Journal of Electronics and Communications (AEÜ) journal homepage: www.elsevier.de/aeue
LETTER
Rule reduction—A method to improve the compression performance of grammatical compression algorithms Martin Bokler, Eric Hildebrandt ∗ T-Systems International GmbH, Deutsche-Telekom-Allee 7, D-64295 Darmstadt, Germany
a r t i c l e
i n f o
Article history: Received 13 December 2005 Accepted 18 February 2010 Keywords: Compression Grammatical compression Sequential Grammar Coding Huffman code
a b s t r a c t Many compression algorithms use Huffman codes or arithmetic codes for coding the output of a preceding redundancy reduction step. The technique presented here connects these usually separated steps in order to improve the overall compression performance. Estimating the code word length for each output symbol of the redundancy reduction algorithm, it is checked whether using a specific structure found in the first step yields an overall improvement in compression or not. Hence some structures found in the first step are discarded in the final representation of the data. This improves the compression performance since the final length of the code words is taken into account. Empirical test results showing the beneficial aspects of rule reduction are presented for a grammatical compression algorithm based on the so-called Sequential algorithm. © 2010 Elsevier GmbH. All rights reserved.
1. Introduction Widely used compression programs like gzip attain their good overall compression performance by using a two-stage compression process. In the first stage a pattern matching algorithm detects redundant structures in the input data and represents them using references to their first occurrence to produce a shorter representation of the data. This representation is then efficiently coded in a following step by taking advantage of their probability distribution(s). Grammatical compression algorithms – in a similar vein – employ a grammar deduction algorithm in the first stage that outputs a context free grammar and use a statistical coder to further reduce the space requirements of the grammar representation. Pattern matching and coding are usually separated from each other and no feedback from the second stage to the first stage is used. This is a bit awkward, as it is the overall compression that one seeks to optimize. Namely, structures found in the first stage, which are optimal considering the detection of redundant structures, may nevertheless cause a non-optimal overall compression by unfavorably changing the probability distribution(s) used in the coding stage. The proposed method of rule reduction is a feedback method to attain an improvement of the overall compression performance by estimating the effects that local changes to the output distribution(s) of the first stage have.
∗ Corresponding author. Tel.: +49 6151 937 4270. E-mail address:
[email protected] (E. Hildebrandt). 1434-8411/$ – see front matter © 2010 Elsevier GmbH. All rights reserved. doi:10.1016/j.aeue.2010.04.003
We will present rule reduction for a grammatical compression algorithm based on Sequential[1] combined with Huffman coding as second stage. Nevertheless the basic principles of rule reduction can easily be adapted to other compression algorithms, which generate a grammar, and also to compression schemes that use different (e.g. arithmetic) codes, if a sufficiently good estimate for the code word length is available. In particular it should be possible to apply this procedure to compression algorithms of the Lempel–Ziv-family, not only [2] but also1 [3]. In this contribution we will mainly focus on efficiency improvements for Sequential used in combination with Huffman coding. For our application we need a fast decompression and random access to the compressed text, therefore we use a Huffman code instead of an arithmetic coding for the improved Sequential[1] algorithm, accepting a slightly bigger overall coding size. A similar method of adding a reduction step to the Sequitur[4] algorithm for the compression of dna-sequences was independently developed in [5]. In our paper we focus on the compression of a standard text corpus, present different ways of applying the reduction step and show performance measures concerning compression ratio and execution time.
1 NB: LZ78 produces a rather trivial grammar and LZ77 does not produce a grammar, but uses rules that are statistically coded and therefore might also benefit from rule reduction.
240
M. Bokler, E. Hildebrandt / Int. J. Electron. Commun. (AEÜ) 65 (2011) 239–243
1.1. Notational conventions
2.2. The basic reduction step
Let G be a context free grammar. Let VT be the alphabet of terminal symbols and let VN be the set of non-terminal symbols. We call the alphabet of all symbols := VN ∪ VT . Let ∗ denote an arbitrary number of elements of . To each non-terminal symbol A ∈ VN there exists a grammatical rule A → X with X ∈ ∗ . In the following we will often denote the production X just as the ‘right-hand side’ of the rule A.
After a context free grammar has been deduced from the input data, it is checked for every grammar rule A → X, A ∈ VN , X ∈ ∗ , whether the use of this rule actually improves the compression performance, i.e. whether the code for a grammar G with this rule is shorter than the code for a grammar G , which equals G, except that every occurrence of A in G is replaced by X and the rule for A is therefore obsolete. Unfortunately the complete codes for both grammars G and G would have to be considered for an exact determination which of the two grammars would lead to a more compact coding. Doing this for every single rule of G would be computationally inefficient, hence we use an estimate. When we examine a single grammar rule A at a time, we consider only the sum of code word lengths for all occurrences of A plus the lengths of the symbols for X that are used in the definition of A as an estimate for the overall code length contribution of A in G. The analog code length contribution in G is just the lengths of the symbols in X times the number of occurrences that the Rule A would have had. Let lG be a function serving as an estimate for the length of a code word, i.e. as a mapping for each symbol occurring in the grammar G onto the length of the corresponding Huffman code symbol. Let #A be the number of occurrences of the symbol A in G. The corresponding rule for A, say A → X where X stands for an arbitrary string over the alphabet of all symbols, will only remain in the grammar G, when the following easily checked inequality is fulfilled:
2. Reduction algorithm Before we proceed to a formal consideration of the reduction algorithm, we present the following simple but illustrative example, which motivates the need for grammar reduction. 2.1. Example Let the input data be the string: ABABAB Algorithms like Sequitur [4] or Sequential [1], that deduce context free grammars from arbitrary strings, will generate for this string a grammar like the following: S0 → CCC C → AB Here S0 is the so-called start rule and C is a grammatical rule to be substituted into S0 . For an efficient representation of the grammar saving the right-hand sides of the rules only is sufficient, provided there is an escape symbol to separate the rules. As the data consists of six symbols and the right-hand sides contain five symbols we could expect to obtain a compression by using this grammar. The output of the first step of compression could therefore be CCC|AB where | is the above mentioned escape symbol. Now in the second step this sequence has to be coded. Possible Huffman code words are: C→1 | → 01 A → 000 B → 001 Hence the compressed text would be 11101000001, i.e. 11 bits. On the other hand, if we would just use the trivial grammar: S0 → ABABAB this grammar could be represented by ABABAB. Possible code words are A → 1B → 0 Therefore the “compressed” text would now be represented by the binary string 101010, with only 6 bits. This shows that using the structure AB for compression does not improve the overall compression performance. In this case the detection of a redundant structure in the input data does not lead to an overall better compression. Admittedly this example is somewhat artificial, as the input sequence is very short and contains only two different symbols; nevertheless the general problem, how to optimize a grammar with regard to a minimal code size, is also relevant for large input data streams with many different symbols.
#A · lG (A) + lG (X) + offset ≤ #A · lG (X) where offset are the extra bits needed to code the rule for A, e.g. to mark the end of the rule. The length of the offset depends on the way the grammar is coded and on properties of the rule for A (length of X,...).2 Note that this estimate does not take into account the effects on other code words for the other symbols. The code for G contains one word (for the symbol A) less than the code for G. Generally code words are shorter, if the number of different code words is smaller.
2.3. Estimation of code lengths For using the inequality above, we certainly need an explicit expression for the estimation function lG which maps symbols of to the lengths of their corresponding code words. Unfortunately for Huffman codes the length of a code word depends on the frequencies of all code words. It is again inefficient to consider the frequencies of all code words in every step, therefore an estimate, that can be easily calculated from the frequency of the code word alone, is called for. We suggest using a Shannon-Fano-code (cf. [6]) as such an estimate for the code word length l. Shannon-Fano-codes use only the probability of a symbol alone to determine the corresponding code word. Admittedly the code lengths of the Shannon-Fano-code are only in special cases equal to the code lengths of the Huffman code. For a symbol A with probability of occurrence p and an output coding alphabet using r code symbols3 the length of the correspondent
2 In our implementation we mark right-hand sides which contain more than two symbols with an escape symbol before and behind this rule. Hence if the right-handside of a rule contains two symbols, then the offset equals 0 otherwise it equals the length of the code word for the escape symbol multiplied by 2. 3 For the usual binary alphabet, logr will be log2 , i. e. the logarithm to base two.
M. Bokler, E. Hildebrandt / Int. J. Electron. Commun. (AEÜ) 65 (2011) 239–243
241
Table 1 Compression performance for the files of the Canterbury Corpus in bits per character. bzip2
gzip
gzip -9
grzip -0
grzip -1
grzip -2
grzip*
grzip**
grzip***
alice29.txt asyoulik.txt cp.html fields.c grammar.lsp kennedy.xls lcet10.txt plrabn12.txt ptt5 sum xargs.1 Cant.Corpus
2.27 2.52 2.48 2.18 2.76 1.01 2.02 2.42 0.77 2.70 3.33 1.54
2.86 3.13 2.60 2.26 2.67 1.61 2.72 3.24 0.88 2.70 3.32 2.08
2.85 3.12 2.60 2.25 2.68 1.63 2.71 3.23 0.82 2.67 3.32 2.08
3.00 3.26 3.14 2.88 3.56 1.83 2.62 3.04 0.96 3.35 4.18 2.17
2.86 3.08 2.95 2.76 3.37 1.82 2.50 2.89 0.90 3.17 3.96 2.09
2.86 3.08 2.95 2.76 3.37 1.82 2.50 2.89 0.90 3.17 3.94 2.09
2.86 3.08 2.97 2.77 3.40 1.82 2.51 2.89 0.90 3.18 3.97 2.09
2.86 3.08 2.96 2.76 3.38 1.82 2.50 2.90 0.90 3.17 3.97 2.09
2.87 3.09 2.97 2.77 3.41 1.83 2.51 2.90 0.91 3.18 4.00 2.13
E. coli bible.txt world192.txt large
2.16 1.67 1.58 1.85
2.31 2.35 2.34 2.33
2.24 2.33 2.33 2.29
2.22 1.95 1.92 2.05
2.05 1.86 1.84 1.94
2.04 1.86 1.84 1.93
2.05 1.87 1.85 1.94
2.05 1.87 1.84 1.94
2.11 1.87 1.85 1.97
Shannon-Fano-code word is an integer l such that 1 1 logr ( ) ≤ l ≤ logr ( ) + 1. p p
Table 2 The values serve as a measure for the compression ratio in bits per character for the files of the Canterbury Corpus when arithmetic coding is used. grzip -0
grzip -1
grzip -2
We define lG (A) := l. An alternative estimate for code lengths is to compute the exact code for the grammar in the beginning and then to use a list of the minimal frequency for each code length in the following way: If the frequency n of a symbol is known, then this list is checked for the smallest frequency f larger than n. Let lf be the code length of the symbol with frequency f. Then the estimate for the code length ln of the symbol with frequency n is set to lf − 1.
alice29.txt asyoulik.txt cp.html fields.c grammar.lsp kennedy.xls lcet10.txt plrabn12.txt ptt5 sum xargs.1
2.99 3.24 3.11 2.81 3.37 1.82 2.61 3.04 0.96 3.32 4.00
2.86 3.08 2.92 2.71 3.19 1.82 2.50 2.89 0.90 3.14 3.79
2.86 3.08 2.92 2.71 3.19 1.82 2.50 2.89 0.90 3.14 3.79
2.4. The algorithm and its implementation variants
E. coli bible.txt world192.txt
2.21 1.95 1.92
2.03 1.86 1.84
2.03 1.86 1.84
The basic rule reduction procedure proceeds in two logical steps: (1) “After” the grammar deduction algorithm has generated a context free grammar the frequency distribution for the symbols used in the representation of the grammar is calculated. Most grammar deduction algorithms, especially Sequitur and Sequential, will allow to update the frequency distribution efficiently in parallel to the grammar deduction step, therefore an additional scan through the representation of the grammar is often not necessary. (2) Then for every rule of the grammar the basic reduction step is executed. As the frequencies of the symbols change with every deleted rule, the order in which the rules are checked and the number of checks have an impact on the reduced grammar. There are different ways of implementing the basic reduction step, that differ in following regards: (a) Order of execution When the grammar is deduced, a natural order of the rules arises. When applying the basic reduction step, there are different ways to do this with regard to this order: (i) starting with the rules generated last, (ii) in order of the original generation of the rules, or (iii) in some other chosen order. (b) Adaption of code lengths whilst reduction If a symbol on the right-hand side of a rule examined in the basic reduction step is marked for reduction this can be taken into account during the computation of the code lengths, i.e. by replacing this symbol with it’s right-hand side. Note that a recursive algorithm
is needed for this, as the right-hand side inserted may contain symbols also marked for reduction. (c) Number of executions The basic step may be executed multiple times for the remaining rules of the grammar. 3. Test results 3.1. The effects of rule reduction on the compression performance In the empirical evaluation of different variants of the rule reduction we use the files of the Canterbury Corpus [7] as a test sample. As grammatical compression algorithms generally perform better when the files are large and have a sufficient high redundancy, we also include additional large files4 in our test sample, that are commonly used together with the Canterbury Corpus. Besides our grammatical compression algorithm, called grzip, we also quote results for the well known compression programs bzip2 and gzip as an orientation for the overall performance of our grammatical compression scheme. The following algorithms are therefore used in the test suite: • bzip2, [9] a compression algorithm using the Burrows–Wheeler transform, c.f. [10] with it’s default compression settings.
4 The large corpus consists of the files: E.coliComplete genome of the E. coli bacterium, bible.txtThe King James version of the bible, world192.txtThe CIA world fact book, cf [8].
242
M. Bokler, E. Hildebrandt / Int. J. Electron. Commun. (AEÜ) 65 (2011) 239–243
Table 3 Run-time requirements in seconds.
Compress Canterbury Decompress Canterbury Compress large Decompress large
bzip2
gzip
gzip -9
g rzip -0
g rzip -1
g rzip -2
1.68 0.56 10.85 3.51
0.84 0.09 6.77 0.24
5.99 0.08 24.14 0.30
1.48 0.18 8.01 0.61
1.54 0.16 8.61 0.51
1.69 0.15 8.53 0.50
• gzip, the commonly used implementation [11] of the LZ77 Algorithm [3] with it’s default compression settings. • gzip − 9, is the gzip algorithm with a more thorough search for long pattern matches. This (often) leads to a better compression compared to gzip, but increases the run time. • grzip − 0 is the Sequential algorithm [1] with canonical Huffman coding [12,13] of the symbols. The grammar is converted to its canonical form before being coded. • grzip − 1 is the algorithm of grzip − 0, except that after generation of the grammar for every rule the basic reduction step is applied once, starting with the rules generated last, i.e. backwards. • grzip − 2 is the same algorithm of grzip − 1 but for every rule the basic reduction step is applied twice, starting with the rules generated last. • grzip∗ is the same algorithm as grzip − 1 except that the basic reduction step is applied in the order that the rules where generated, i.e. the rules generated first are checked first, c.f. Section 2.4, rule reduction variant 2a. • grzip∗∗ is the same algorithm as grzip − 1 except that the changes of the frequencies through the deleted rules are considered whilst execution of the basic step, c.f. Section 2.4, rule reduction variant 2b. • grzip∗∗∗ is the same algorithm as grzip − 1 except that we use the lengths of a Huffman code computed before reduction as an estimation for the code lengths c.f. 2.3. As a quantitative measure for the compression performance we use the bits per character, i.e. the ratio of the bits of the compressed file and the number of characters in the uncompressed file. The results are shown in Table 1. Let X be the uncompressed text which is represented by the grammar G. Let A be the alphabet used to express G and for s ∈ A let #s be the number of occurences of s in the representation of the grammar. Let |X| be the length of the uncompressed text. Set sum := # and compute s∈A s H(G) =
s∈A
log2
sum #s
.
The value H(G)/|X| might serve as a measure for the compression rate of the grzip algorithms, when arithmetic coding is used.5 Here any additional overhead needed to store the compressed data in a file is neglected. In our implementation this overhead is quite small, around 80 Byte. The values H(G)/|X| for the files of the Canterbury Corpus [7] are presented in Table 2. 3.2. Run time requirements As the compression performance of grzip∗ , grzip∗∗ and grzip∗∗∗ are generally worse than the values attained with grzip − 1 and the run times are similar to the one of grzip − 1, they are omitted in Table 3. The tests were carried out using a standard PC equipped with a Pentium III processor running at a clock speed
5 Compared to the normalized grammar entropy defined in [14] our definition slightly differs to allow a direct comparison to the values showed in Table 1.
of 800 MHz, with 512 KByte cache and 256 MByte RAM. As there are also considerable time variations between different runs of the same algorithm, the values presented here serve only as a rough comparative measure for the run time efficiency. For every algorithm we measured the time it takes to compress/decompress all files of the Canterbury and the large corpus. All values in Table 3 denote the execution time in seconds.
3.3. Discussion of the results As expected the results for the compression performance vary for the different types of data in the Canterbury Corpus. Nevertheless for all files our rule reduction algorithm improves the compression performance. For all files of the Canterbury corpus taken together the compressed files are 3.6% and for the large corpus even 5.9% smaller. Savings up to 8% were found. The reduction algorithm works best, if the basic step is applied backwards (c.f. 2(a)ii) and if during execution the changes to the frequencies are ignored (c.f. 2b). The best compromise between execution time and compression performance seems to apply the basic reduction step once for every rule, as a second reduction only yields a minor improvement. If the Huffman code lengths for the grammar computed before reduction are used as an estimation for the code lengths (c.f. 2.3) this leads to a slightly worse compression. Hence in grzip Shannon-Fano-codes are used to estimate code word lengths. Compared to gzip the grzip algorithms with rule reduction offer a similar good compression performance at small file sizes and better compression at large file sizes. grzip requires approximately twice the time for compression/decompression than gzip. The main disadvantage of grzip compared to gzip is the large amount of memory required by grzip during compression. In comparison to bzip2 the compression performance of grzip is somewhat worse, but there are types of files for which the compression performance of grzip is better. The main advantage of grzip compared to bzip2 is the faster compression and especially decompression. Our implementation of the rule reduction algorithm is not optimized, but as the algorithm is quite fast compared to the deduction of the grammar and the statistical coding, we did not consider a further optimization worthwhile.
4. Conclusion It has been found, that the rule reduction procedure described in this paper consistently improves the compression performance of grammatical compression algorithms up to eight percent. The procedure is quite fast and does not require a substantial memory overhead; therefore it can easily be placed as an additional optimization step between an arbitrary grammatical deduction algorithm and a following (canonical) Huffman-coding scheme. While presented here for grammatical compression only, a similar processing should also improve other two-stage compression schemes, that use Huffman codes for coding the output symbols of a redundancy reduction stage.
M. Bokler, E. Hildebrandt / Int. J. Electron. Commun. (AEÜ) 65 (2011) 239–243
Acknowledgement We acknowledge valuable comments from Klaus Huber concerning the estimation of code lengths. References [1] Yang E, Kieffer JC. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform—Part One: Without context models. IEEE Trans Inf Theory 2000;46(3):755–77. [2] Ziv J, Lempel A. Compression of individual sequences via variable length coding. IEEE Trans Inf Theory 1978;24:530–6. [3] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Trans Inf Theory 1977;23:337–43. [4] Manning CG, Witten IH. Identifying hierarchical structure in sequences: a linear-time algorithm. J Artif Intell Res 1998;7:67–82. [5] Cherniarsky N, Ladner R. Grammar-based compression of dna sequences. In: DIMACS working group on the Burrows–Wheeler transform: ten years later. 2004. [6] Hamming RW. Coding and information theory. 2nd ed Prentice Hall; 1986. [7] http://corpus.canterbury.ac.nz/. [8] http://corpus.canterbury.ac.nz/descriptions/. [9] http://sources.redhat.com/bzip2/.
243
[10] Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. DEC SRC Res Rep 1994;124. [11] http://www.gzip.org/zlib/. [12] Witten IH, Moffat A, Bell TC. Managing gigabytes: compressing and indexing documents and images. 2nd ed San Diego: Morgan Kaufmann; 1999. [13] Hirschberg DS, Lelewer DA. Efficient decoding of prefix codes. Commun ACM 1990;33(4):449–59. [14] Yang E, Kieffer JC. Grammar based codes: a new class of universal lossless source codes. IEEE Trans Inf Theory 2000;46(3):737–54. Martin Bokler studied mathematics and informatics at Justus-Liebig-University, Giessen. He did his Ph.D. work in the area of projective geometries over finite fields. In 2002 he joined the Technologiezentrum of Deutsche Telekom AG working in the departement for security research and development. He has left the Deutsche Telekom AG in 2008 and currently works as a teacher at the Heinrich-EmanuelMerck-Schule, Darmstadt. Eric Hildebrandt studied physics and mathematics at Goethe University of Frankfurt am Main. From 1993 to 1998 he was with the Institute of Physics at Goethe University, where he did his Ph.D. work in the area of experimental quantum optics. In 1998 he joined the Technologiezentrum of Deutsche Telekom AG working in the departement for security research and development. He is currently working as a security consultant at T-Systems International GmbH. Since 2007 he lectures on information theory, data compression and machine learning at Goethe University.