Information Processing Letters 70 (1999) 223–228
Adding state merging to the DMC data compression algorithm Matthew Young-Lai 1 Computer Science Department, University of Waterloo, Waterloo, Ontario Canada, N2L 3G1 Received 26 March 1998; received in revised form 28 April 1999 Communicated by D. Gries
Abstract Dynamic Markov Compression (DMC) generates Markov models for data compression by starting with an initial model then expanding it one state at a time using a cloning operation. Typically, expansion continues until memory is full then the model is discarded and restarted. We present an alternative method that retracts the model based on selective merging of similar states. This requires extra time, but allows DMC to attain its best compression using much less memory. 1999 Elsevier Science B.V. All rights reserved. Keywords: Data structures; Dynamic Markov Compression
1. Introduction 1.1. Data compression Two broad categories of lossless data compression methods are dictionary and statistical coding. Dictionary methods use heuristics to construct dictionaries of common substrings in the data, then achieve compression by replacing these substrings by pointers into the dictionary. An example of a dictionary compression algorithm is LZ77 which is used in the widely available utility gzip. Statistical methods construct predictive models for the data. At any point in the input, a model is based on the data seen so far and attempts to predict the next character. A prediction takes the form of a vector of probabilities, one for each character in the input alphabet. The probabilities must sum to 1.0 and the higher the probability predicted for a 1 Email:
[email protected].
character that actually occurs, the less space is needed to encode the character according to the relationship size = log2 (1/p), where size is the number of bits required and p is the predicted probability. This theoretical minimum size can approximated arbitrarily closely using arithmetic coding [5]. Therefore, a better modeling technique directly translates into a better compressor. 1.2. Dynamic Markov Compression Dynamic Markov Compression (DMC) is a modeling technique that achieves state-of-the-art compression performance by constructing a Markov model for the data [4]. Informally, a Markov model is a finite automaton that associates a probability with every transition. DMC is most efficient and conceptually simple when used to model the data as a stream of bits rather than characters. This way, every state in the model has only two outgoing transitions: 0 and 1. We start with
0020-0190/99/$ – see front matter 1999 Elsevier Science B.V. All rights reserved. PII: S 0 0 2 0 - 0 1 9 0 ( 9 9 ) 0 0 0 6 6 - 6
224
M. Young-Lai / Information Processing Letters 70 (1999) 223–228
an initial model where any state can be reached from any other. Fig. 1 shows two examples. Every transition has an associated counter. Each bit of input results in a transition from the current state to a new state, and an increment of the associated counter. The counters are used to generate predictions: the probability of a 0 (or 1) transition is the value of the 0 counter (or 1 counter), divided by the sum of the two counters at the current state. A cloning operation is used to change the model structure by replicating nodes as demonstrated in Fig. 2. The counters for C 0 are set so that its predictions are initially the same as for C (the hope, of course, is that they will eventually diverge and thus model the data more accurately). The counters for both C and C 0 are scaled so that their total outgoing frequencies remain equal to their incoming frequencies (a condition analogous to Kirchoff’s Law). The decision to clone is made with a simple heuristic: if the count for a transition exceeds a chosen threshold, clone the destination state. The transition count is also required to exceed the total incoming count of the destination node by another threshold since, for example, it makes no sense to clone a node with only one incoming transition.
Fig. 1. A 1-state and a 4-state initial model.
1.3. Overview of the proposed method The main problem with DMC is that the model continues to grow without bound with the following consequences: (1) Separate components of the model can come to represent the same sequences in the data. These occupy unnecessary space and record less accurate frequency information since they are visited less often separately than they would be together. (2) Memory limitations eventually require that the model be frozen or the model be discarded and restarted. Techniques exist for reducing the impact of the second problem: a history buffer can be maintained to partially rebuild the model before proceeding after a flush, or a second model can be built starting when the first is half full, and switched to when the first is discarded. Neither of these techniques address the first problem, and both reduce the maximum size of the model assuming available memory is constant (although it is possible for the tradeoff to give better overall performance). The method introduced here does not throw out the model when a memory limit is reached. Rather, it recovers nodes by merging those that appear similar. This addresses both of the problems above. The decision to merge is done on the basis of a heuristic just as simple as that used for cloning. Even so, the set of pairs of nodes that we compare for merging is quite restricted, both for efficiency reasons (we don’t want to take quadratic time comparing every pair), and for semantic reasons (the nature of the model produced by DMC is such that arbitrary merges will usually result in non-determinism). The following section gives a detailed description of the method.
2. A state merging method 2.1. Testing equivalence
Fig. 2. An example of the cloning operation. The new node C 0 is cloned from C.
Define Nx to be the frequency of the symbol x recorded at node N . Then, consider two nodes N , M and their frequencies N0 , N1 , M0 , M1 . The absolute difference in the probability predictions at the two nodes is:
M. Young-Lai / Information Processing Letters 70 (1999) 223–228
N0 M0 probDiff = abs − N0 + N1 M0 + M1 M1 N1 − . = abs N0 + N1 M0 + M1 We define two nodes as equivalent if probDiff is less than some threshold. 2 Note that a node is always equivalent to itself. Let δ(N, x) denote the state that the automaton ends in to when it starts in state N and inputs a string x. Two nodes N , M are defined to be mergeable if they are equivalent, and for any string x ∈ {0, 1}∗, δ(N, x) is equivalent to δ(M, x). Thus an algorithm for testing whether two nodes N , M are mergeable is: (1) if N = M return true (2) if NOT equivalent(N, M) return false (3) if NOT mergeable(δ(N, 0), δ(M, 0)) return false (4) if NOT mergeable(δ(N, 1), δ(M, 1)) return false (5) return true. There is no guarantee that the recursion will terminate for all choices of N and M. Therefore only a subset of all possible pairs of nodes should be tested. 2.2. Which nodes to test? We show that termination of the recursion is guaranteed when a node is compared to the node from which it was originally cloned. Lemma 1. If node N is cloned from node M by the DMC cloning operation, 3 then the following assertion holds regardless of how many subsequent clonings take place: ∃n such that ∀x ∈ {0, 1}∗ , |x| > n ⇒ δ(N, x) = δ(M, x). Proof. By induction on m, the number of subsequent cloning operations. • Base case (m = 0): Since N was just cloned from M, δ(N, 0) = δ(M, 0) and δ(N, 1) = δ(M, 1) so n = 1 satisfies the assertion. 2 We experimented with both simpler and more sophisticated equivalence criteria. Simpler ones gave worse performance. More sophisticated ones gave equivalent compression but were slower. 3 Actually, a small semantic change must be made to the cloning operation if the initial model contains self-referential states.
225
• Induction hypothesis: The assertion holds for all m < k. • Induction step: There are two cases, depending on what effect the kth cloning operation has on δ: – Case 1: ∀x, |x| > n, there is no effect on δ(N, x) or δ(M, x). Obviously, the assertion continues to hold with the same value of n. – Case 2: ∃x, |x| > n, for which δ(N, x) 6= δ(M, x). Then the state δ(N, x) must just have been cloned from δ(M, x) or vice versa. This means that δ(N, x.0) = δ(M, x.0) and δ(N, x.1) = δ(M, x.1) (where x.a denotes concatenation of a to x). So the assertion must hold for n = |x| + 1. 2 Thus, parsing a bit string starting from two different nodes, one of which was previously cloned from the other, is guaranteed to eventually reach the same state. Since a state is equivalent to itself, the recursion in the test for whether two nodes are mergeable is guaranteed to terminate if one node was cloned from the other. 2.3. Implementation details An extra pointer is needed in each node to locate the node from which it was originally cloned. This increases the total number of pointers from 2 to 3. Two of the pointers in a node are also used when the node is unallocated. One is used to point to the next node in the linked free list (i.e., memory management). The other is used to resolve dead links after a merge phase (i.e., to store the new destination for any link that points to the deleted node). Initially, the model is allowed to grow normally. When the number of nodes reaches the limit of available memory, every node is tested for merging with the node from which it was originally cloned. There is no clear semantically correct choice of testing order so we just use the most efficient one: testing them in the order they are stored. The worst case time complexity of testing every node in this way is O(n2 ), where n is the number of nodes. This is because every test can potentially re-
226
M. Young-Lai / Information Processing Letters 70 (1999) 223–228
Fig. 3. Model growth and contraction during compression.
curse through O(n) other nodes. On the other hand, if no nodes test equivalent then no recursion takes place and the time is O(n). In practice, the time requirement falls closer to the latter: the recursion usually terminates quickly when two nodes are found not to be equivalent; and, nodes are merged immediately upon determining they are mergeable thus leaving fewer nodes to be compared later. 2.4. Dynamic thresholds Without merging, the average frequency of a node in the model always remains roughly constant: every bit of input increases the total frequency by 1; the number of nodes increases linearly with the total frequency; and, when the maximum model size is reached, frequencies are discarded along with the model structure. With merging, the average frequency per node continues to increase: when maximum model size is reached, part of the structure is discarded by merging, but frequencies continue to accumulate. Fig. 3 is an example of the type of model size oscillation that results during compression. Two effects are apparent as frequencies accumulate:
(1) model size increases more rapidly, and (2) the number of nodes recovered by each merging phase decreases. These result in the model filling and having to be merged more and more often as compression proceeds. The majority of the time is therefore spent on the final part of the input. To avoid this, we can gradually change the cloning and equivalence thresholds. There are many ways to do this and the strategy that proves the best may be dependent on the data. For this reason, we only mention simple strategies. We recalculate the thresholds at the beginning of every merge phase. This is done in terms of the average node frequency avgFreq =
totalAccumulatedFrequency . numNodes
Clone thresholds are calculated with the simplest possible linear relationship: α × avgFreq, where α is a scaling parameter. The equivalence threshold is √ β/ avgFreq, where β is another scaling parameter. Note that this was chosen over the simpler linear relationship β/avgFreq since the width of a confidence intervals between two binomial proportions decreases quadratically with frequency.
M. Young-Lai / Information Processing Letters 70 (1999) 223–228
227
Fig. 4.
Fig. 5.
3. Results Here we compare performance results of two implementations: DMC-1 The reference implementation of DMC from http://plg.uwaterloo .ca/~ftp/dmc.
DMC-2 The above program augmented with state merging and dynamic thresholds as described in the preceding sections. Runs were conducted using a DEC Alpha system with 64 bit pointers and 32 bit floats. This means the extra pointer in every node reduces the number of nodes that fit in a given chunk of memory by 33%, as opposed to
228
M. Young-Lai / Information Processing Letters 70 (1999) 223–228
25% for a machine where pointers and floats are the same size. Fig. 4 shows compression ratios for book1 from the Calgary corpus [2,3] using various memory sizes. The maximum model size tested was 32 Megabytes—the point at which neither program ever needed to flush or merge the model. Both programs level off at about the same compression ratios, but DMC-2 reaches that point using much less memory. The dynamic merging parameters for DMC-2 were α = 1, β = 0.5. Fig. 5 shows compression times. DMC-2 is relatively slower the more cramped it gets in memory since the model fills and has to be merged more frequently. With 1Mb (163,308 nodes), for example, it executes the merging routine 144 times, each time comparing between 100,000 and 200,000 nodes. Indications are that both time and compression performance can be improved by tuning the dynamic threshold parameters to the data (or the shape of the functions used to calculate the parameters). This was not done extensively for these results: both α and β were initially set to 1.0. Then β was cut in half to 0.5 when it was observed that far too few nodes were being merged.
4. Conclusion By taking more time, the method attains the best compression of unmodified DMC using less memory.
State merging techniques have the potential to infer models that use the full power of the Markov model representation [6], which DMC does not [1]. This shortcoming is not improved here because of the necessarily restricted nature of the merging method.
Acknowledgement Financial assistance from the Natural Sciences and Engineering Research Council of Canada, the Institute for Computer Research, and the University of Waterloo is gratefully acknowledged. Also, thanks to Gordon Cormack for reading and commenting on a draft of this paper. References [1] T.C. Bell, A.M. Moffat, A note on the DMC data compression scheme, Computer J. 32 (1) (1989) 16–20. [2] T.C. Bell, I.H. Witten, J.G. Cleary, Modeling for text compression, ACM Computing Surv. 21 (4) (1989) 557–591. [3] T.C. Bell, I.H. Witten, J.G. Cleary, Text Compression, Prentice Hall, Englewood Cliffs, NJ, 1990. [4] G.V. Cormack, R.N. Horspool, Data compression using dynamic Markov modelling, Computer J. 30 (6) (1987) 541–550. [5] G.G. Langdon, An introduction to arithmetic coding, IBM J. Res. Dev. 28 (2) (1984) 135–149. [6] E. Vidal, Grammatical inference: An introductory survey, in: R. Carrasco, J. Oncina (Eds.), Proc. Second Annual Colloquium on Grammatical Inference and Applications (ICGI), Lecture Notes Comput. Sci., Vol. 862, Springer, Berlin, 1994, pp. 1–4.