APPLIED AND COMPUTATIONAL HARMONIC ANALYSIS ARTICLE NO.
3, 388–392 (1996)
0032
LETTER TO THE EDITOR Unconditional Bases and Bit-Level Compression
2. ASYMPTOTICS OF BIT-LEVEL COMPRESSION
David L. Donoho1 Communicated by Charles K. Chui on June 14, 1996
1. INTRODUCTION
A previous article [2] gave results showing that an orthogonal basis which is an unconditional basis for a functional class F furnishes an optimal representation of elements of F for certain de-noising and compression tasks. Since publication of that article, the author has received several queries which pointed out that the definition of compression in that article was based on counting the number of significant transform domain coefficients which must be retained to get acceptable reconstruction error in transform coding. These queries asked whether results could instead be formulated in terms of the number of bits stored. The purpose of this note is to point out that results analogous to [2] hold under a model which measures bits encoded. There are two key results: • The sparsity of the coefficients in an unconditional basis determines the rough asymptotics in for the number of bits which must be stored to reconstruct any member of F to within -accuracy. • A simple transform coding scheme based on uniform quantization and run-length encoding of the coefficients in the unconditional basis can achieve near-optimal asymptotics for the number of bits needed to represent any member of F to accuracy . In short, when an unconditional basis for a class F exists, transform coding in that basis offers near-optimal representation of elements of F. Settings where these results apply include: • L 2 Sobolev classes on the circle, Fourier basis; • L p Sobolev classes on the interval, Nice Wavelets basis; • bounded variation classes on the interval, Haar basis. In all these cases, simple transform coding in the indicated basis gives bitlengths which achieve the same rough asymptotic behavior as the -entropy of the corresponding functional classes. Our discussion in this note will follow closely the notation and vocabulary of [2]. 1
Department of Statistics, Stanford University, and University of Cal-
ifornia, Berkeley. 1063-5203/96 $18.00 c 1996 by Academic Press, Inc. Copyright All rights of reproduction in any form reserved.
Let F be a compact set of functions in L 2 (X), where, depending on the case at hand, the functions might have domain X = [0, 1], X = [0, 1]2 , etc. We are interested in approximately representing elements f ∈ F by encoding short bit strings and later decoding such strings to approximately reconstruct F. We are interested in knowing the length of such strings which is required in order to reconstruct elements with accuracy . Let ` be a fixed counting number and let E` → {0, 1}` be a functional which assigns a bit string of length ` to each f ∈ F. Let D` : {0, 1}` → L 2 (X) be a mapping which assigns to each bit string of length ` a function. The coder– decoder pair (E`, D` ) will be said to achieve distortion ≤ over F if sup kD` (E` (f)) − fkL2 (X) ≤ .
(1)
f∈F
We define the minimax code length as L(, F) = min{` : ∃(E` , D` ) achieving distortion ≤ over F}.
(2)
This measures precisely the number of bits it is necessary to retain in order to be sure that the reconstruction of any f ∈ F will be accurate to within . There is a metric characterization of L(, F) relating to ideas of Kolmogorov and Tikhomirov. Let B (f0 ) = {g : kg − f0 kL2 (X) ≤ } denote the ball of radius about f0 . An 2 -net for F is a finite collection of points (fi )N i=1 in L (X) such that F⊂
N [
B (fi ).
i=1
Let N(, F) denote the minimum possible cardinality of any such -net; N(, F) < ∞ by compactness of F. Then L(, F) = dlog2 N(, F)e. The (Kolmogorov–Tikhomirov) -entropy of F is (by definition) H (F) = log2 N(, F) and so L(, F) is (to within one bit) the -entropy of F. When F is not a finite set, then H (F) → ∞ as → 0, and the rate of this growth becomes of interest. In many interesting cases, H (F) −1/α or H (F) −1/α log(1/)β for some α, β > 0. A crude measure of growth—insensitive
388
389
LETTER TO THE EDITOR
to the difference between −1/α and −1/α log(1/)β —is the optimal exponent α∗ (F) = sup{α : H (F) = O(−1/α ), → 0}.
(3)
We will interpret this below as saying that the entropy ∗ grows roughly as −1/α even though there might be an extra ∗ factor which makes the actual behavior O(−1/α log(1/)β ) or something similar. Our main result in this note will be to show that one can calculate α∗ (F) from sparsity properties of expansions in an unconditional basis. Using this, one can give the following examples on the unit interval, X = [0, 1]. • Bounded variation space. Let TV(C) = {f : R1 R1 ∗ 0 |f(x)|dx + 0 |df(x)| ≤ C}. Then α (TV(C)) = 1. • Sobolev space. Let 1 ≤ p < ∞ and Wpm (C) = {f : R1 R 1 (m) p (x)|p ≤ Cp }. Then α∗ (Wpm (C)) = m. 0 |f(x)| dx + 0 |f • Besov/Triebel space. Let σ > 0 and 1 ≤ p, q ≤ ∞. Let σ σ (C) (resp. Fp,q (C)) denote balls of functions with Besov Bp,q (resp. Triebel) norm ≤ C. These are functions that are in some sense σ-times differentiable (with the proviso that σ σ σ (C)) = σ, α∗ (Fp,q (C)) = σ. may be fractional). Then α∗ (Bp,q 2 In dimension 2, let X = [0, 1] , and fix 1 ≤ p1 , p2 ≤ σ ,σ ∞, 1 ≥ σ1 , σ2 > 0. Let Np11 ,p22 (C1 , C2 ) denote the collection of f(x, y) with 0 ≤ x, y ≤ 1 and controlled differences k∆1h fkLp1 (Q1h ) ≤ C1 hσ1 ,
0 < h < 1,
Thus, |θ|(1) is the size of the largest coefficient, |θ|(2) the size of the second largest, and so on. We define the weak `p “norm” by |θ|w`p = sup i1/p |θ|(i) . i≥1
It was shown in [2] that this quasi-norm controls various measures of asymptotic sparsity (e.g., Lemma 1 in that paper). For example, the number of coefficients exceeding δ in amplitude obeys p
#{i : |θi | > δ} ≤ |θ|w`p δ−p ,
δ > 0,
(4)
which can be used to infer that a sequence with small w`p norm is asymptotically sparse. Also, if we define m = 1/p− 1/2 then !1/2
∞ X
|θ|2(i)
≤ cp · N−m · |θ|w`p ,
N ≥ 1,
(5)
N+1
which can be used to show that the energy of a sequence with small w`p norm is concentrated in its few biggest coefficients. Finally, Lemma 1 in [2] also gives the inequality, with r = 1 − p/2, ∞ X
p
min(θ2i , 2 ) ≤ cp0 · |θ|w`p · (2 )r .
(6)
i=1
where (∆1h f)(x, y) = f(x + h, y) − f(x, y) and Q1h = [0, 1 − h] × [0, 1] and k∆2h fkLp2 (Q2h ) ≤ C2 hσ2 ,
0 < h < 1,
where (∆2h f)(x, y) = f(x, y + h) − f(x, y) and Q2h = [0, 1] × [0, 1 − h]. Here we require pi > 0, Ci > 0, 0 < σi ≤ 1. Such functions have possibly different smoothness in the two directions. Then σ1 σ2 σ ,σ α∗ (Np11 ,p22 (C1 , C2 )) = . σ1 + σ2 3. SPARSITY IN AN ORTHOGONAL EXPANSION 2 Let (φi )∞ i=1 be an orthogonal basis of L (X) and let θi (f) = hf, φi i denote the ith coefficient of the expansion
f∼
∞ X
θi φi .
i=1
As in [2] we are going to measure the sparsity of the transform coefficients based on the weak-`p “norm” for 0 < p < 2. For (θi ) the coefficients of f, we let |θ|(i) , i = 1, 2, . . . denote the decreasing rearrangement of the absolute values.
Properly interpreted, these inequalities can be used to bound the performance of various compression and de-noising schemes. For example, we could determine a bound on how many expansion coefficients N must be stored in order to reconstruct with accuracy . The answer: N() determined by = cp N−m |θ|w`p suffices, i.e., d(cp |θ|w`p /)1/m e. In general, the weak `p norms with small p (p close to zero) are “asymptotically more powerful” than those with large p (p close to 2), in the sense of providing stronger inequalities in an asymptotic sense. Relations (4)–(6) all obey this rule of thumb. Suppose that p < p0 and we have θ, θ0 0 with θ ∈ w`p and θ0 ∈ wlp but θ0 ∈ / w`p ; then asymptotically, the rearrangement |θ|(i) is decaying faster than |θ0 |(i) , hence θ is more sparse than θ0 . In view of this, let Θ be a collection of vectors θ and define the critical index by p∗ (Θ) = inf{p : Θ ⊂ w`p }. This is a measure of the common sparsity of members of Θ; if p* is very small, then sequences in Θ are, in an asymptotic sense, quite sparse. We may use this remark to compare orthogonal expansions. Let (φi ) and (ξi ) be two orthogonal bases for L 2 (X), and let θi = hf, φi i and ωi = hf, ξi i be the corresponding expansion coefficients. Let F be a specific class of functions
390
LETTER TO THE EDITOR
and let ΘF = {θ(f) : f ∈ F} and ΩF = {ω(f) : f ∈ F} be sets of coefficients in two different bases. Then if p∗ (ΘF ) < p∗ (ΩF ), we may conclude that the expansion in terms of (φi ) yields asymptotically sparser coefficient sequences than the one in terms of (ξi ). 4. SPARSITY IN AN UNCONDITIONAL BASIS
A key fact about certain orthogonal bases is that they serve as unconditional bases for certain functional spaces. On the circle X = [0, 2π), the standard Fourier basis is an m unconditional basis of the L 2 -Sobolev space W2,per [0, 2π). On the interval X = [0, 1], wavelets make an unconditional basis for all the standard L 2 Sobolev spaces, as well as L p Sobolev spaces, and in fact all the Besov and Triebel spaces. The property of unconditionality can be given an appealing geometric interpretation. When an orthogonal basis is an unconditional basis for a function space F, it means that there is an equivalent norm for the space, call it kfkF , such that the ball F(C) = {f : kfkF ≤ C} corresponds to a set of coefficient sequences Θ(C) = {θ(f) : f ∈ F(C)} which is solid and orthosymmetric. Informally, these two properties mean that if θ ∈ Θ and θ0 is produced by coordinatewise shrinkage of θ, then θ0 ∈ Θ as well. Formally, if θ ∈ Θ and |θ0i | ≤ |θi |, ∀i,
then θ0 ∈ Θ.
(7)
Solid orthosymmetric bodies are highly symmetric about the coordinate axes, since whenever θ ∈ Θ, then θ0 = (±i θi ) ∈ Θ, where (±i ) is any sequence of signs. The simplest example is on the circle X = [0, 2π) with m F = W2,per . Then geometrically F is an ellipsoid whose major axes are the sinusoids, and Θ is the rotation of this ellipsoid into a standard form in which the major axes are the standard coordinates. Hence the transformation of a functional class F into a coordinate system where the coordinate body is symmetric has, as a special case, the classical problem of rotating an ellipsoid into standard form, i.e., diagonalizing a quadratic form. And so an unconditional basis in some sense “diagonalizes” a functional class. It should be evident that an unconditional basis, when it exists, is very nice. In [2] it was shown that if Θ is the set of coefficients of functions F in an unconditional basis and Ω is the set of coefficients in some other basis, then p∗ (Θ) ≤ p∗ (Ω). So expansions in orthogonal unconditional bases have a kind of optimal sparsity. 5. MAIN RESULT
Suppose now that we have a class F arising as a ball
of a functional space F which has an unconditional basis. Consider representing elements of F by transform coding in the basis (φi )i , that is to say, by encoding the coefficient sequence θ(f) as a bit strings of length ` and later decoding an approximate coefficient θ˜ from this bit string. Because of the Porthogonality of the basis, if we approximate f by ˜ L2 (X) = f˜ = i θ˜i φi , the error of approximation is kf − fk ˜ 2 kθ − θk` . This isometry means that H (F, k·kL2 (X) ) = H (Θ, k·k`2 ) and that α∗ (F) = α∗ (Θ). We now show that the orthosymmetry of Θ allows for a computation of α∗ (Θ) in terms of p∗ (Θ). For this, we need one extra condition [2]. Definition 1. We say that a set Θ ⊂ `2 is minimally tail compact if for some β1 , β2 > 0 X θ2i ≤ β1 N−β2 , N = 1, 2, . . . . (8) i>N
Theorem 2. Let Θ be a bounded subset of `2 which is solid, orthosymmetric, and minimally tail compact. Then α∗ = 1/p∗ − 1/2. Moreover, coder–decoder pairs achieving the optimal exponent of code length can be derived from simple uniform quantization of the coefficient (θi )i , followed by simple runlength coding. In words, the asymptotic rate of growth of the minimax code length as → 0 is determined by the degree of sparsity of the coefficients in an unconditional basis, and there is a natural way to get a reasonably good code by using the coefficients in the unconditional basis. 6. UPPER BOUND
We first prove that α∗ ≥ 1/p∗ − 1/2. We do this by showing that for each α with α > 1/p∗ − 1/2 we can construct a coder–decoder pair achieving distortion ≤ and coding length `() ≤ Const · −1/α log(−1 ) as → 0. Fix such an α, and define p = p(α) by α = 1/p − 1/2. Then p∗ < p < 2. As p > p∗ , Θ ⊂ w`p and so for some Cp > 0, sup{kθkw`p : θ ∈ Θ} ≤ Cp . The family of coders/decoders will be indexed by a parameter q > 0 and will be constructed in cascade form: Eq = E1q ◦ E0q and Dq = Dq0 ◦ Dq1 . Here (E0q , Dq0 ) is a lossy quantization and (E1q , Dq1 ) is a lossless run-length coding scheme. The construction of (E0q , Dq0 ) relies on the assumption of minimal tail compactness (8), which furnishes an N = N(q) such that ∞ X N+1
θ2i ≤ q2
and
N(q) ≤ γ1 q−γ2 ,
∀q < 1,
(9)
391
LETTER TO THE EDITOR
with γi = γi (β1 , β2 ). The encoding E0q (θ) is then the sequence k =
N(q) (ki )i=1
of N(q) integers
ki = sgn(θi )b|θi |/qc,
1 ≤ i ≤ N(q),
produced by uniform quantization with quantum q, and the ˜ where decoder Dq0 (k) = θ, ( θ˜i =
`(q) ≤ M · (n + m + 1) ≤ (|θ|w`p /q)p · (log2 N(q) + log2 (kθk2 /q) + 3). By (9) we have log2 (N(q)) ≤ γ2 log(q−1 ) + log(γ1 ), and as Θ is bounded we have kθk2 ≤ C; so
i > N(q)
The reconstruction error of this scheme obeys ∞ X
N(q) X
i=1
i=1
≤
Combining the above estimates we see that encoding the vector k by run-length encoding gives a representation b using `(q) bits, where
ki q, 1 ≤ i ≤ N(q) 0,
(θi − θ˜i )2 ≤
code of n bits to represent hj and because −2m < kij < 2m a fixed length code with m + 1 bits will represent kij .
∞ X
min(θ2i , q2 ) +
∞ X
`(q) ≤ q−p · (A log(q−1 ) + B) θ2i
i=N(q)+1
min(θ2i , q2 ) + q2 .
i=1
By the properties of the weak `p norm (6) we have, with r = (1 − p/2), ∞ X p (θi − θ˜i )2 ≤ cp · |θ|w`p (q2 )r + q2
say,
`() ≤ −1/α · (A0 log(−1 ) + B0 ),
< 0 ,
where A0 = A0 (p, Θ), B0 = B0 (p, Θ). This completes the upper bound.
i=1
= Cq2r + q2 ,
for constants A = A(p, Θ), B = B(p, Θ). We now assemble the above √ inequalities. Setting now q = q() = (/ 2C)1/r , we have that Cq2r + q2 ≤ 2 as soon as ≤ 0 = 0 (p, Θ). Hence from (10), the cascaded coder (Eq() , Dq() ) achieves a distortion ≤ for every θ ∈ Θ, as soon as ≤ 0 . As q−p −1/α , the bit length of the encoding b obeys
(10)
where C = C(p, Θ). The lossless coder–decoder pair (E1q , Dq1 ) is designed to N(q)
represent the vector k = (ki )i=1 by a bitstring b = E1q (k) with relatively few bits. Key fact: because k is the uniform quantization of a vector in weak `p , it is reasonably sparse—most entries are zero. Indeed, given θ ∈ w`p , we have by (4) at most M ≤ (kθkw`p /q)p nonzero entries. To exploit this sparsity we can develop a coding scheme that records just the positions of the few nonzero entries and the values of the ki in just those positions. Lemma 3. If there are at most M nonzero entries in a list (ki ) of at most 2n integers, and if each entry obeys |ki | < 2m , then the whole list can be represented losslessly by run-length coding using no more than ` = M · (n + m + 1) bits. Proof. Let 1 ≤ i1 < i2 < · · · < iM ≤ 2n denote the positions of nonzero entries in the sequence. The list of the (ij , kij ) pairs allows us to reconstruct losslessly. Encode the list by recording pairs (hj , kij ) with short bit strings, where hj = ij −ij−1 −1 is the run length of zeros occurring between adjacent nonzero elements (take i0 = 0, say). Because 0 ≤ hj < 2n , each hj can be represented using a fixed-length
7. LOWER BOUND
We now show that α∗ ≤ 1/p∗ − 1/2. To do this we fix α so that α > 1/p∗ − 1/2 and we exhibit a sequence n → 0 as n → ∞ for which 1/α
n Hn (Θ) → ∞.
(11)
It follows from this that H (Θ) ≠ O(−1/α ) for any α obeying α > 1/p∗ − 1/2. The argument proceeds by constructing a sequence of finite-dimensional hypercubes Θn . The constructed hypercubes obey Θn ⊂ Θ
(12)
n Hn (Θn ) → ∞.
(13)
and 1/α
Since Θn ⊂ Θ implies H (Θn ) ≤ H (Θ), (12) and (13) yield (11). The hypercube construction needs two technical facts which we state without proof. Lemma 4. Let Θ = {θ : |θi | = δ, 1 ≤ i ≤ n, and θi = 0, i > n} be a standard n-dimensional hypercube of side δ. Then H∆/2 (Θ) ≥ A · n,
n = 1, 2, 3, . . . ,
with ∆2 = δ2 n and A > 0 a universal constant.
392
LETTER TO THE EDITOR
The lemma says in effect that it requires at least A · n bits to describe all the vertices of an n-cube within 50% accuracy per coordinate. This “intuitively plausible” fact can be derived from rate-distortion theory for Bernoulli channels under so-called single-letter difference-distortion measures; see [1]. Lemma 5. Suppose Θ ⊂ / w`p . Then there is a sequence (n) mn → ∞ and θ ∈ Θ with −1/p
|θ(n) |(mn ) ≥ mn
.
(14)
To use these two facts, fix α > 1/p∗ − 1/2. Let p = p(α) be chosen so that both p < p∗ and α > 1/p − 1/2. It follows that Θ ⊂ / w`p . Lemma 5 equips us with a sequence (mn )n≥1 and an associated sequence (θ(n) )n≥1 . −1/p and let in,1 , . . . , in,mn denote the coordiLet δn = mn (n) (n) nates of θ with |θin,j | ≥ δn . Let Θn denote the hypercube formed with vectors θ having |θin,j | = δn, 1 ≤ j ≤ mn and / {in,1 , . . . in,mn }. Note that if θ ∈ Θn then θi = 0, i ∈ (n)
|θi | ≤ |θi |,
∀i.
Hence, by the solid orthosymmetry of Θ (7), we have the inclusion (12). Setting 2n = mn δn2 /4 and applying Lemma 4, we conclude that Hn (θn ) ≥ A · mn . Hence, 1/α
1/α
n Hn (θn ) ≥ n
· A · mn 1+(1/2−1/p)/α
= A00 mn
,
with A00 = A00 (p). Now p has been chosen so that 1+(1/2− 1/p)/α > 0, and so (13) follows. 8. DISCUSSION
8.1. Examples Let X = [0, 1], and let (φi )i be a nice wavelets basis which is an unconditional basis for the Sobolev space Wpm . Under an equivalent norm for Wpm , the coefficients θ of objects with Wpm norm ≤ C make up a Triebel body m (C) which is solid and orthosymmetric. From, e.g., calfp,2 m (C)) = 1/(m + 1/2). Hence culations in [2] we have p∗ (fp,2 ∗ m m α (Wp (C)) = m : H (Wp (C)) diverges with → 0 roughly as −1/m , and run length coding of uniformly quantized coefficients achieves roughly this bitlength for a given distortion . Let TV(C) be as in section 2. Let θ(f) denote the Haar transform coefficients of f and let ΘTV(C) be the collection of Haar coefficients of functions in TV(C). The class of functions of bounded variation does not have an unconditional basis, but as in [2] we can bracket the coefficient body between two Besov balls b11,1 (1/4) ⊂ ΘTV(C) ⊂ b11,∞ (1); hence α∗ (b11,1 (1/4)) ≤ α∗ (ΘTV(C) ) ≤ α∗ (b11,∞ (1)). As in [2], p∗ (b11,1 (1/4)) = p∗ (b11,∞ (1/4)) = 2/3 so we
have α∗ (b11,1 (1/4)) = α∗ (b11,∞ (1/4)) = 1. It follows that α∗ (TV(C)) = 1 : H (TV(C)) diverges with → 0 roughly as −1 , and run length coding of uniformly quantized coefficients achieves roughly this bitlength for a given distortion . 8.2. Comparison Simple transform coding in other transform domains will not generally achieve bitlength comparable to what can be gotten by simple quantization and run-length coding in an unconditional basis. Consider the class F of functions of bounded variation on the circle [0, 2π). We know that for this class, the Haar transform (adapted to [0, 2π)) works well. Consider instead the Fourier basis, and let ω be the sequence of Fourier coefficients. Examples like the function f(x) = 1[a,b] show that in general, the Fourier coefficients decay slowly and obey estimates no better than X |ωi |2 > C0 N−1 . #{i : |ωi | > q} ≥ Cq−1 ; i>N
It follows that for a given distortion of size ≈ one must pick a quantum q ≈ 2 , and so one needs to keep N(q) = O(−2 ) coefficients. Moreover, the vector of coefficients is dense—most entries of the quantized sequence exceeding q in amplitude—and oscillatory. Keeping even just one bit per coefficient—e.g., the sign bit—will require a coding length of order −2 bits. Hence crude scalar quantization will work poorly in the Fourier domain, compared to working in a “correct” transform domain for this class—the Haar domain—which we have shown will require a coding length only of order O(−1 log(−1 )) bits. 8.3. Limitations The optimal exponent is defined in such a way that H (F) −1/α and H (F) −1/α log(1/)β both have the same optimal exponent α. Hence the simple coder/decoder combination developed here might be suboptimal in concrete cases by as much as a “log factor.” Refinements to uniform quantization and run-length coding can be developed for certain concretely specified transforms; compare Shapiro’s EZW tree-based coding of the wavelet transform. Such schemes might conceivably improve on the bit rates here by as much as “log factors.” ACKNOWLEDGMENTS This research was partially supported by NSF DMS-92-09130. The author thanks Emmanuel Cand`es, Ingrid Daubechies, Iain Johnstone, and Bob Gray for helpful discussion.
REFERENCES 1. T. Berger, “Rate Distortion Theory,” Prentice Hall, 1970. 2. D. L. Donoho, Unconditional bases are optimal bases for data compression and for statistical estimation, Appl. Comput. Harmonic Anal. 1 (1993), 100–115.