Information Processing Letters 97 (2006) 133–137 www.elsevier.com/locate/ipl
Compressing probability distributions Travis Gagie Department of Computer Science, University of Toronto, Ontario M5S 3G4, Canada Received 16 May 2004; received in revised form 19 October 2005; accepted 25 October 2005 Available online 17 November 2005 Communicated by A. Moffat
Keywords: Data structures; Data compression
1. Introduction Manipulating probability distributions is central to data compression, so it is natural to ask how well we can compress probability distributions themselves. For example, this is useful for probabilistic reasoning [2] and query optimization [7]. Our interest in it stems from designing single-round asymmetric communication protocols [1,5]. Suppose a server with high bandwidth wants to help a client with low bandwidth send it a message; the server knows the distribution from which the message is drawn but the client does not. If the distribution compresses well, then the server can just send that; conversely, if the server can help the client in just one round of communication and without sending too many bits, then the distribution compresses well—we can view the server’s transmission as encoding it. Compressing probability distributions must be lossy, in general, and it is not always obvious how to measure fidelity. In this paper we measure fidelity using relative entropy because, in the asymmetric communication example above, the relative entropy is roughly how many more bits we expect the client to send with E-mail address:
[email protected] (T. Gagie). 0020-0190/$ – see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ipl.2005.10.006
the server’s help than if it knew the distribution itself. Let P = p1 , . . . , pn and Q = q1 , . . . , qn be probability distributions over the same set. Then the relative entropy [10] of P with respect to Q is defined as D(P Q) =
n i=1
pi log
pi . qi
By log we mean log2 . Despite sometimes being called Kullback–Leibler distance, relative entropy is not a true distance metric: it is not symmetric and does not satisfy the triangle inequality. However, it is widely used in mathematics, physics and computer science as a measure of how well Q approximates P [4]. We consider probability distributions simply as sequences of non-negative numbers that sum to 1; that is, we do not consider how to store the sample space. In Section 2 we show how, given a probability distribution P = p1 , . . . , pn , we can construct a probability distribution Q = q1 , . . . , qn with D(P Q) < 2 and store Q exactly in 2n − 2 bits of space. Constructing, storing and recovering Q each take O(n) time. We also show how to trade compression for fidelity and vice versa. Finally, in Section 3, we show how to store a compressed probability distribution and query individual probabilities without decompressing it.
134
T. Gagie / Information Processing Letters 97 (2006) 133–137
2. An algorithm for compressing probability distributions The simplest way to compress a probability distribution P is to construct and store a Huffman tree [6] for it. This lets us recover a probability distribution Q with D(P Q) < 1 [11,15] but takes both (n log n) time and (n log n) bits of space. In this section, we show how to use the following theorem, due to Mehlhorn [12], to compress P by representing it as a strict ordered binary tree. A strict ordered binary tree is one in which each node is either a leaf or has both a left child and a right child. We show how to trade compression for fidelity, by applying this result repeatedly, or trade fidelity for compression, using another approach. Theorem 1 [12]. Given a probability distribution P = p1 , . . . , pn , we can construct a strict ordered binary tree on n leaves that, from left to right, have depths less than log(1/p1 ) + 2, . . . , log(1/pn ) + 2. This takes O(n) time. Proof (Sketch). For 1 i n, let pi + pj . 2
For 1 i n, let 2−di qi = −d . j j2 Since T is strict, by the Kraft Inequality [9], n
2−dj = 1;
j =1
thus, qi = 2−di > pi /4.
2
Using Theorem 2 as a starting point, we can improve fidelity at the cost of using more space. One approach is given below; we leave as future work finding better tradeoffs. Theorem 3. Given a probability distribution P = p1 , . . . , pn and an integer k 2, we can construct a probability distribution Q = q1 , . . . , qn with max1in {pi / qi } < 2 + 1/2k−3 , so D(P Q) < log(2 + 1/2k−3 ), and store Q exactly in kn − 2 bits of space. Constructing, storing and recovering Q each take O(kn) time.
i−1
Si =
j =1
Consider the code in which the ith codeword is the first log(2/pi ) bits of the binary expansion of Si ; these bits suffice to distinguish Si , so the code is prefix-free. Notice the ith leaf of the corresponding code-tree has depth less than log(1/pi ) + 2. 2 Once we have used Theorem 1 to get a strict ordered binary tree T , we store T . It is important that T be ordered; otherwise, it would only store information about the multiset {pi : 1 i n}, rather than the sequence P , so we would also need a permutation on n elements, which takes (n log n) bits.
Proof. By induction on k. By Theorem 2, the claim is true for k = 2. Let k 3 and assume the claim is true for k − 1. Let Q = q1 , . . . , qn be the probability distribution we construct when given P and k − 1. Let B = b1 · · · bn be the binary string with bi = 1 if pi /qi 1 + 1/2k−3 and b = 0 otherwise. For 1 i n, let 2q 2q +i q if bi = 1, and bi =1 bi =0 i i qi = qi if bi = 0. 2q + q bi =1
Notice we can store Q exactly in kn − 2 bits of space, using (k − 1)n − 2 bits for Q and n bits for B. Also,
Theorem 2. Given a probability distribution P = p1 , . . . , pn , we can construct a probability distribution Q = q1 , . . . , qn with max1in {pi /qi } < 4, so D(P Q) < 2, and store Q exactly in 2n − 2 bits of space. Constructing, storing and recovering Q each take O(n) time. Proof. We apply Theorem 1 to P to get a strict ordered binary tree T on n leaves that, from left to right, have depths d1 , . . . , dn with di < log(1/pi ) + 2. We store T in 2n − 2 bits of space, represented as a sequence of balanced parentheses.
bi =0 i
i
2qi +
bi =1
qi =
bi =0
n i=1
1+
qi + bi =1
qi
bi =1
pi 1 + 1/2k−3
2k−2 + 1 . 2k−3 + 1
If bi = 1, then qi 2qi · (2k−3 + 1)/(2k−2 + 1). Since, by assumption, qi > pi /(2 + 1/2k−4 ), we have qi >
2pi 2k−3 + 1 · 2 + 1/2k−4 2k−2 + 1
T. Gagie / Information Processing Letters 97 (2006) 133–137
and so pi /qi < 2 + 1/2k−3 . If bi = 0, then qi > pi /(1 + 1/2k−3 ). Since qi qi · (2k−3 + 1)/(2k−2 + 1), we have qi >
pi 2k−3 + 1 · k−2 k−3 1 + 1/2 2 +1
and so pi /qi < 2 + 1/2k−3 . By assumption, constructing Q takes O((k − 1)n) time and constructing B and Q from Q takes O(n) time. Thus, constructing, storing and recovering Q each take O(kn) time. 2 It may be possible to strengthen Theorem 2 using results about alphabetic Huffman codes (e.g., [14]). We base it on Theorem 1 for two reasons: Mehlhorn’s construction takes O(n) time, whereas known algorithms for constructing alphabetic Huffman codes take (n log n) time [8], and the guarantee that max1in {pi /qi } < 4 makes the proof of Theorem 3 cleaner. Using a different approach, related to two-layer coding (see, e.g., [3]), we can also reduce the space used at the cost of reducing fidelity. Theorem 4. Given a probability distribution P = p1 , . . . , pn and c 1, we can construct a probability distribution Q = q1 , . . . , qn with D(P Q) c · H (P ) + log(π 2 /3) and store Q exactly in at most n1/(c+1) (log n+1) bits of space. Constructing, storing and recovering Q each take O(n) time. Proof. Let t n1/(c+1) be the number of probabilities in P that are at least 1/n1/(c+1) . Let r1 , . . . , rt be such that prj is the j th largest probability in P , and let R = {r1 , . . . , rt }. Thus, pr 1 · · · p r t
∞ j =1
π2
1 , = 6 j2
we have t j =1
Storing Q as the binary representations of r1 , . . . , rt takes at most n1/(c+1) (log n + 1) bits of space and O(n) time. For 1 j t, since prj is the j th largest probability in P , we have prj 1/j . Therefore, H (P ) =
n
1 qr j < . 2
For i ∈ / R, let 1 − tj =1 3/(πj )2 1 qi = > . n−t 2n
pi log
i=1
t
1 pi
prj log j +
=
t
pi log n1/(c+1)
i ∈R /
j =1
i ∈R / pi
prj log j +
j =1
log n . c+1
Compare this with D(P Q) =
n
pi log
i=1
=
t
pi qi
prj log
t
prj log
j =1
=2
1 1 + pi log − H (P ) qr j qi i ∈R /
j =1
t
(πj )2 3
prj log j +
j =1
+
pi log(2n) − H (P )
i ∈R /
pi log n
i ∈R /
t π2 + log pr j + pi − H (P ) 3 i ∈R /
j =1
2
t
prj log j +
j =1
1 > max{pi }. i ∈R / n1/(c+1)
Computing the set R takes O(n) time and sorting it takes O(n1/(c+1) log n) ⊂ O(n) time. For 1 j t, let qrj = 3/(πj )2 ; since
135
+ log
pi log n
i ∈R /
π2 − H (P ). 3
Since c 1, D(P Q) − log(π 2 /3) +1 H (P ) 2 tj =1 prj log j + i ∈R / pi log n t i ∈R / pi log n j =1 prj log j + c+1 c + 1; that is, D(P Q) c · H (P ) + log(π 2 /3).
2
As an aside, we note that the number t of probabilities at least 1/n1/(c+1) is less than (c + 1)H (P )n1/(c+1) / log n + 1. To see why, consider that H (P ) is minimized,
136
T. Gagie / Information Processing Letters 97 (2006) 133–137
relative to t, when t − 1 probabilities are 1/n1/(c+1) , one is 1 − (t − 1)/n1/(c+1) and the rest are 0. Thus, H (P ) >
t −1 log n1/(c+1) n1/(c+1)
so t<
(c + 1)H (P )n1/(c+1) + 1. log n
Therefore, an alternative bound is that we store Q exactly in at most
(c + 1)H (P )n1/(c+1) + 1 log n + 1 log n bits; presumably (c + 1)H (P ) is significantly less than log n, otherwise we would just approximate P by the uniform distribution. If space is at a premium, we may need to work with Q without decompressing it. Notice we can do this by storing {(r1 , 1), . . . , (rt , t)} in order by first component, which takes at most
1/(c+1) log n +2 n log n + c+1 bits of space. Given i between 1 and n, we can compute 3/(πj )2 if i = rj , qi = 1−t 3/(πj )2 j =1 if i ∈ / {r1 , . . . , rt } n−t in O(log t) ⊆ O(log(n)/c) time. 3. A data structure for compressed probability distributions In this section, we show how to work with a probability distribution compressed with Theorem 2 without decompressing it, using a succinct data structure due to Munro and Raman [13]. This data structure stores a strict ordered binary tree on n leaves in 2n + o(n) bits of space and supports queries that, given a node, return its parent, left child, right child and number of descendants. Each of these queries takes O(1) time. Notice that, given i between 1 and n, we can find the depth d of the ith leaf in O(d) time. Theorem 5. Given a probability distribution P = p1 , . . . , pn , we can construct a data structure that uses 2n + o(n) bits of space and supports a query that, given i between 1 and n, returns qi in O(log(1/qi )) time. Here, Q = q1 , . . . , qn is a probability distribution with max1in {pi /qi } < 4, so D(P Q) < 2 and O(log(1/qi )) ⊆ O(log(1/pi )).
Proof. As for Theorem 2, but with the sequence of balanced parentheses replaced by an instance of Munro and Raman’s data structure. 2 A drawback to Theorem 5 is that querying a very small probability might take (n) time. We can fix this by smoothing the given probability distribution slightly. Theorem 6. Given a probability distribution P = p1 , . . . , pn and ε > 0, we can construct a data structure that uses 2n + o(n) bits of space and supports a query that, given i between 1 and n, returns qi in O(log(1/qi )) time. Here, Q = q1 , . . . , qn is a probability distribution with D(P Q) < 2 + ε and log(1/qi ) ∈ O(log min(1/pi , n/ε)). Proof. Let P = p1 , . . . , pn , where pi =
ε/4 pi + . 1 + ε/4 (1 + ε/4)n
We apply Theorem 5 to P ; let Q be the stored distribution. Notice pi pi pi max · max max 1in qi 1in pi 1in qi ε 1+ · 4 = 4 + ε. 4 Since log is convex, it follows that D(P Q) < 2 + ε. Since pi ε , qi > max , 4 + ε 4n we have log(1/qi ) ∈ O(log min(1/pi , n/ε)).
2
References [1] M. Adler, B.M. Maggs, Protocols for asymmetric communication channels, J. Comput. System Sci. 63 (2001) 573–596. [2] D. Bellot, P. Bessière, Approximate discrete probability distribution representation using a multi-resolution binary tree, in: Proc. 15th Internat. Conf. on Tools with Artificial Intelligence, 2003, pp. 498–503. [3] D. Chen, Y.-J. Chiang, N. Memon, X. Wu, Optimal alphabet partitioning for semi-adaptive coding of sources of unknown sparse distributions, in: Proc. of the Data Compression Conf., 2003, pp. 372–381. [4] T. Cover, J. Thomas, Elements of Information Theory, Wiley, New York, 1991. [5] T. Gagie, Dynamic asymmetric communication, submitted for publication. [6] D.A. Huffman, A method for the construction of minimum redundancy codes, Proc. IERE 40 (9) (1952) 1098–1101. [7] H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, T. Suel, Optimal histograms with quality guarantees, in: Proc. 24th Internat. Conf. on Very Large Databases, 1998, pp. 275–286.
T. Gagie / Information Processing Letters 97 (2006) 133–137
[8] M.M. Klawe, B. Mumey, Upper and lower bounds on constructing alphabetic binary trees, SIAM J. Discrete Math. 8 (1995) 638–651. [9] L.G. Kraft, A device for quantizing, grouping, and coding amplitude-modulated pulses, Master’s thesis, Massachusetts Institute of Technology, 1949. [10] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Statistics 22 (1951) 79–86. [11] G. Longo, G. Galasso, An application of informational divergence to Huffman codes, IEEE Trans. Inform. Theory 28 (1) (1982) 36–43.
137
[12] K. Mehlhorn, A best possible bound for the weighted path length of binary search trees, SIAM J. Comput. 6 (1977) 235–239. [13] J.I. Munro, V. Raman, Succinct representation of balanced parentheses and static trees, SIAM J. Comput. 31 (2001) 762–776. [14] D. Sheinwald, On binary alphabetical codes, in: Proc. of the IEEE Data Compression Conf., 1992, pp. 112–121. [15] C. Ye, R.W. Yeung, A simple upper bound on the redundancy of Huffman codes, IEEE Trans. Inform. Theory 48 (7) (2002) 2132–2138.