ARTICLE IN PRESS
Real-Time Imaging 10 (2004) 285–295 www.elsevier.com/locate/rti
An embedded wavelet image coder with parallel encoding and sequential decoding of bit-planes Yufei Yuan, Mrinal K. Mandal Department of Electrical and Computer Engineering, Multimedia Computing and Communications Laboratory, University of Alberta, Edmonton, Alta., Canada T6G 2V4
Abstract Wavelet-based coders are widely used in image compression. Many popular embedded wavelet coders are based on a data structure known as zerotree. However, there exists another category of embedded wavelet coders that are fast and efficient without employing zerotree. These coders are based on three key concepts: (1) wavelet coefficient reordering, (2) bit-plane partitioning, and (3) encoding of bit-planes with certain efficient variants of run-length coding. In this paper, we propose a novel method to construct a bit-plane encoder that can be used in this category of non-zerotree coders. Instead of encoding the bit-planes progressively, the bitplane encoding process can be finished in one pass when multiple bit-plane encoders are activated concurrently. With this proposed method, traditional partitioned-block based parallel processing strategy is enhanced with another dimension (depth of bit-planes) of processing flexibility. This bit-plane encoder inherently targets parallel processing architecture. The final output bitstream can be compatible with that of the original sequential coder if compatibility is preferred over speed and memory efficiency. r 2004 Published by Elsevier Ltd.
1. Introduction Wavelet transform-based image coding schemes have already found their wide usage in the scientific and industrial community. In many application scenarios such as image database indexing and retrieval, parallel processing capability is desirable because of the huge volume of data and real-time processing demand. Although parallel wavelet image compression is an interesting research topic, so far a significant amount of work has only been done on parallel wavelet transform algorithms, which function as the first stage of a complete image coder. When it comes to parallel processing of the quantization and entropy coding stages, less work has been reported in the literature, which in the authors’ opinion results from the difficulty in parallelizing the zerotree-based coding methods since Corresponding author. Tel.: +1 780 492 0152; +1 780 492 1811. E-mail addresses:
[email protected] (Y. Yuan),
[email protected] (M.K. Mandal).
1077-2014/$ - see front matter r 2004 Published by Elsevier Ltd. doi:10.1016/j.rti.2004.08.006
fax:
many wavelet image coding methods proposed up-todate are based on the zerotree data structure. The first zerotree-based embedded coding method proposed was EZW [1], in which clusters of insignificant (small magnitude) coefficients are encoded through single symbols. These clusters of coefficients are called zerotrees because of the quadtree-like arrangement of coefficients which exploits the residual cross-scale statistical dependency in a wavelet pyramid. Later, an enhanced zerotree-based technique was proposed as SPIHT [2], which uses a different pattern of parent– children relationship and achieves good compression performance even without an explicit entropy encoding stage. Parallelization of the EZW algorithm was first reported in [3], where two approaches were proposed: (1) a straightforward parallelization which ensures that each processing element (PE) contains only entire zerotrees of wavelet coefficients, and (2) one PE is reserved for the collection of the symbols that have to be encoded. This PE reorders the symbols and encodes them. This approach results in higher complexity. In
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
286
fact, the first approach can be generalized as local processing principle under which an image (a decomposed wavelet pyramid) can be partitioned into pixel (coefficient) blocks and each block can be compressed locally by a PE. This one-block-per-PE approach is also suggested in [4] for consideration of parallel execution in the JPEG2000 image coding system. The parallelization of the SPIHT algorithm is more difficult because of the list structures it uses. The lists in the SPIHT cause variable and data-dependent memory requirement. The task of memory management on adding, dropping, and removing list nodes is complicated and undesirable in memory-constraint environments. To make it worse, managing a distributed list across multiple PEs is even more difficult. Some techniques have recently been proposed to remove the list structures from zerotree-based coders to make them more suitable for hardware implementation and parallel processing [5–7]. Because of the popularity of zerotree-based techniques, virtually no effort has been dedicated to parallelization of other equally efficient techniques, which may potentially be parallelized more easily and efficiently. One category of such algorithms is based on bit-plane coding [8] and run-length coding of binary sequence [9–11,13]. The common features of these methods are: (1) the wavelet pyramid is mapped into a one-dimensional array and thereafter the coefficients within it are linearly indexed, (2) the wavelet pyramid is treated as multiple bit-planes and bit-planes are encoded following the order of decreasing significance, and (3) twodimensional bit-planes are converted to one-dimensional binary sequences by linear reordering before the binary sequences are encoded by efficient run-length coders. This paper demonstrates that the category of embedded coding methods mentioned above can easily
and efficiently be mapped onto a multiple instruction multiple data (MIMD) architecture in the form of parallel bit-plane encoding. In other words, in addition to being segmented into blocks, wavelet pyramids can be sliced in another dimension (depth of bit-planes) into bit-planes as shown in Fig. 1. Furthermore, a bit-plane can be encoded by a PE without any communication with other PEs as proposed in this paper. In the next section, we will describe our proposed parallel encoding and sequential decoding (PESD) method of parallelization. We use a modified wavelet difference reduction (MWDR) algorithm [12], which does not employ arithmetic coding, as our example although the method we propose can readily be applied to other data-independent schemes such as progressive wavelet coding (PWC) [11]. The method proposed in Section 2 is extended to cover bitstream compatibility in Section 3. The decoding procedure is discussed in Section 4. Experimental results are presented in Section 5. The paper ends with concluding remarks in Section 6.
2. Parallelization through data-independent bit-plane encoding The success of wavelets in the area of still image compression has been attributed to the utilization of certain types of data dependency. For example, zerotrees in the EZW and the SPIHT, and other context models in [4,10] all utilize the residual data dependency among wavelet coefficients. One disadvantage of exploiting data dependency is that tracking the intermediate state information of the coder becomes more difficult than it does in the MWDR and the PWC. In the following sections, we will review the common features of [10–13], and discuss how to design a data-independent parallel bit-plane encoder based on the MWDR.
Pyramid Width Pyramid Height
Bit-plane Depth
Block Partitioning
Wavelet Pyramid Joint Block & Bit-plane Partitioning
Fig. 1. Comparison between conventional parallelization strategy and the new strategy established in this paper. The new strategy at the bottom provides another dimension of processing flexibility for SIMD architecture.
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
2.1. Coefficient reordering by linear indexing Linear indexing denotes the process of mapping a two-dimensional pyramid to a one-dimensional array. For a pyramid of size W H, the length of the array will be WH. Each coefficient will be assigned a unique linear index in the range of [1,WH]. Because we scan the coefficients according to their indices and want to encode the coefficients with larger magnitude earlier, ideally we hope to: (1) assign smaller indices to larger (in terms of magnitude) coefficients, and (2) cluster significant coefficients together to increase the ‘‘skewness’’ of the significant symbol distribution, thereby improving the efficiency of run-length encoding. Since wavelets are known to have good energy compaction ability, it is reasonable to assign indices starting from coarsest scale to finest scale. Hence, in all the techniques [10–12], the coefficients are scanned following that order. The major difference in indexing lies in the scan order within each scale and/or each subband. The MWDR takes a simple strategy in the design of the linear indexing. Within the same scale, the scanning priority of subbands is LH-HL-HH. Here, LH (HL) denotes low-pass (high-pass) filtering by row/width and high-pass (low-pass) filtering by column/height. Within an HL subband, coefficients are raster-scanned columnby-column, while coefficients in the other three types of subbands are raster-scanned row-by-row. Instead of employing raster-scanning inside each subband, the linear indexing scheme in [10] divides subbands into macro-blocks and zig-zag scans the macro-blocks. Within each macro-block, the coordinates of the ith coefficient in the scan are obtained by deinterleaving the odd and even bits of i. The concept of macro-block is also used in the PWC [11]. However, the PWC alternatively scans the LH and HL coefficients within the same scale assuming that a particular feature in the original image tends to generate salient coefficients in both the HL and LH that correspond to that specific spatial location. The sole objective of the linear indexing is to increase the probability of long runs of zeros in the bit-planes. Our exemplar algorithm is based on the MWDR, so the WDR’s indexing scheme is implemented in this paper. 2.2. Bit-plane partitioning and run-length encoding As shown in Fig. 1, a wavelet pyramid can be seen as a ‘‘bit cube’’ composed of W H (D+1) bits, where D is the bit-plane depth determined by the coefficient with the largest magnitude, i.e., D ¼ blog2 jcmax jc þ 1: In addition to the D bits for magnitude of each coefficient, one more bit is needed to record the sign information for each coefficient. The principle of the embedded coding is
287
to encode this bit cube progressively according to the effect of each bit on the rate-distortion quality of the final decoded image. Bit-plane p, Bp, is composed of the pth bits of all coefficients {ci}, iA[1,y,WH], where bit-plane 0 denotes the least significant bit in each coefficient. Bp is further partitioned into two or more sets (fractional bit-planes). In the WDR and the PWC, two sets are obtained for Bp : Bp;s ¼ fbp;i jbq;i ¼ 0; 8poqpDg and Bp,r=Bp Bp,s. The set Bp,s corresponds to coefficients that are insignificant above bit-plane p. Bp,r corresponds to refinement bits for coefficients whose most significant bit have already been encoded. Bp,s is then encoded by an efficient run-length encoder. Reduced index coding is employed as the run-length encoder in the WDR while run-length/Rice encoder is employed in the PWC. The sign bit of the stop symbol (significant coefficient) is interleaved uncoded into the output following the encoding of each stop symbol. The sequence of bits in Bp,r is appended behind the run-length encoding output. Bit-planes are encoded from the most significant towards the least significant until the targeted bit-rate is met. Instead of the two sets used in the WDR and the PWC, four sets are resulted from bit-plane partitioning in [10]. Three of the four sets except the refinement bit sequence are encoded by an elementary Golomb encoder. The refinement sequence is appended to the output uncoded. 2.3. Parallel bit-plane encoding The feasibility of encoding bit-planes in parallel originates from the asymmetric information availability at the encoder and the decoder. At the sender side, information about the whole wavelet pyramid is available to the encoder. At the receiver side, however, only those already decoded information is available to the decoder. To keep the synchronization between the encoding and the decoding process at any intermediate point, the sender side pretends that not all information are known and encodes bit-planes in the progressive order. To make it clearer, let us examine how a bit-plane of the pyramid is encoded in the WDR. First, a sorting pass is applied to insignificant coefficients remaining in the insignificant coefficient set (ICS) after encoding of the immediate higher bit-plane. Run length between two neighboring significant coefficients are then encoded. The interesting point is, an ICS is in fact not necessary in the algorithm because the scan order is fixed and we know exactly which coefficients are excluded from current sorting pass based on their magnitudes. If the magnitude of a coefficient is greater than 2Tp, where Tp is the threshold for current bit-plane p, it is excluded from the sorting pass because we know it is already labeled significant in a certain higher bit-plane. Hence,
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
288
The output bitstream from each bit-plane encoder consists of output of the sorting pass followed by output of the refining pass. The output from each bit-plane encoder is buffered before they are ultimately assembled by a bitstream assembler and truncated to fit in the rate budget, as shown in Fig. 2. The pseudo-code for the bitplane encoder is shown in Listing 1. The run length bits and the refining bits are not interleaved in the final generated bitstream. These two different types of data are buffered separately before the refining bits are finally appended to the run length bits. Note that Listing 1 only describes the algorithm for a bit-plane encoder. To encode the entire wavelet pyramid, this procedure should be executed multiple times (concurrently or sequentially) within a main encoder. The decomposed wavelet pyramid P and the bit-plane number b are passed to the procedure as the only parameters. A nice feature of this algorithm is that no communication among bit-plane encoders is needed, resulting in a very simple parallel architecture since each PE can process individual bit-plane(s) independently
we can follow the exact scan order within sorting pass without using the ICS. The major problem for bit-plane encoding is the refinement pass. Coefficients that need refining are stored in the significant coefficients set (SCS) in the WDR. Within the SCS, the coefficients are ordered according to their most significant bits (MSB) and their indices in the scan order. If we want to follow the same order as in the SCS in the bit-plane encoding, we need to reorder those already significant coefficients and construct an SCS. Here, we argue that each output bit within a bit-plane contributes on the same order to the reduction of distortion. Therefore, the identical output order of refining bits is indeed not necessary. This coding strategy causes bitstream incompatibility between the proposed algorithm and the MWDR. However, we need to emphasize that the bitstream compatibility is not difficult to achieve if we would like to sacrifice buffer size and algorithmic complexity. We will consider the bitstream compatibility in the next section.
Bit-Plane Encoder
BN
N Bit-Plane Encoder
BN−1
Bitstream Assembler & Truncator
BNBN−1
N−1 Input Pyramid
B0 Bit-Plane Encoder 0
Fig. 2. Processing architecture of the proposed bit-plane encoder, where Bn denotes the generated bitstream for the nth bit-plane.
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
without resorting to information from other PEs when this algorithm is implemented with parallel processing hardware.
3. Bitstream compatibility in parallel bit-plane encoding In the last section, we have already explained that an intermediate list like the ICS is not necessary because along the path of the fixed scan order, we know exactly which coefficients should be skipped solely by testing their magnitudes. Note that this magnitude test is exerted on all coefficients while only coefficients in the ICS undergo such test in the MWDR. However, the extra computational cost of the magnitude test is just one logic instruction, which is justified compared to the memory requirement and computational load imposed by management of dynamic list structures. We have mentioned before that the proposed algorithm is designed for parallel encoding, and the algorithm described in Section 2.3 generates a bitstream that is incompatible with the bitstream output by the original sequential encoder. However, we now present a bitstream compatible algorithm with which the original sequential decoder works just fine. Before we present the listing of pseudo-code, one example is illustrated in Fig. 3 to show how bitstream compatibility increases the encoder complexity. The encoding of bit-planes B3 and B4 is shown in Fig. 3. Those bits falling into the two different passes, i.e., the sorting pass and the refining pass, are labeled in different shades. From the figure, we can see that the bits in the sorting pass are scanned in the same order. However, the bits in the refinement pass are scanned differently compared to the MWDR. When we encode
289
B4, the refining bits are scanned as {163, 0 34, 149}. However, the refining bits are scanned as {163, 0 34, 1 31, 023, 049, 1 25} when B3 is encoded. This shows that the position of the refining bits in output bitstream is solely determined by the scan order, which can be expressed as Iðcðx; yÞÞp;s ¼ f ðSÞ;
jcðx; yÞjo2T p ;
Iðcðx; yÞÞp;r ¼ f ðSÞ;
jcðx; yÞjX2T p ;
ð1Þ
where I(c(x,y))p,s denotes the index of c(x,y) in the sorting pass of Bp if the magnitude of c(x,y) is less than two times of the threshold Tp for Bp. Similarly, I(c(x,y))p,r denotes the index of c(x,y) in the refinement pass of Bp. f( ) denotes the index mapping function while S represents the scan order. When we use the MWDR to encode the pyramid in Fig. 3, the scan order for the sorting pass still remains the same. However, the order of the refining bits is changed because the significant coefficients in B4 are stored in the SCS and they are refined first when we output refining bits for B3. Now the refining bits are scanned as {163, 0 34, 049, 1 31, 023, 1 25}. It is clear that the position of the refining bits is determined not only by the fixed scan order but also by the refining bits ordering of higher bit-planes. Iðcðx; yÞÞp;s ¼ f ðSÞ;
jcðx; yÞjo2T p ;
Iðcðx; yÞÞp;r ¼ f ðS; Bfqjq4pg Þ;
jcðx; yÞjX2T p :
ð2Þ
To make the bitstream compatible with that of the MWDR, we need to output the refining bits according to the order set by the SCS. Since no list is available in our proposed algorithm, we need to parse all the coefficients to collect information that is provided by
Coefficient Scanning Order
Memory 63
+
-34 -31 23 49 -25 14 -7 8 -12 3 14 -13 10 -14 -9
+ + + + + + + -
1 1 1 0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 1 1 0 1 1 0 1 1 1 1 1
1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 0
1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 0
1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1
0
63
-34
49
14
-31
23
-25
-7
count(P)
-12 -13
10
count(P−1)
8 31
4
-14
Memory
1
count(P)−1
-9
0 Two-level wavelet pyramid of size 4x4
count(p+1)
count(P−1) −1 Bits for refining pass Bits for sorting pass
Bitplane Scanning Order
Fig. 3. One example to illustrate the fixed scan order and bit-plane encoding.
P: the highest bit−plane p: current bit−plane count(x): number of coefficients identified as significant in bit−plane x
0
Fig. 4. The data structure that stores the order of refining bits for the encoding of Bp.
ARTICLE IN PRESS 290
Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
the SCS in the MWDR. This parsing stage now makes our algorithm a two-pass algorithm. In the first pass, we collect information on how to order the refining bits. Then in the second pass, we output refining bits according to the order set after the first pass. Fig. 4 illustrates the data structure that handles the output order of the refining bits. In the first pass, we need to determine the priority of the refining bits, i.e., those coefficients which are significant in the highest bit-plane BP will be refined first. This is followed by the coefficients significant in BP 1. Significant coefficients from bit-plane Bp+1 will be refined last. When the first pass is completed, the total number of refining bits for Bp and the numbers of significant coefficients for each higher bit-plane are known. Hence, we can allocate memory for all refining bits of Bp, and
the offset for each bit-plane can be calculated based on the significant coefficients for each bit-plane. When we output a refining bit bp,i, its position in the bitstream is jointly determined by the highest bit-plane pmaxðci Þ reached by the corresponding coefficient ci, the offset for refining bits in pmaxðci Þ in the buffer, and the index i in the fixed scan order.The pseudo-code for the bitstream compatible bit-plane encoder is shown in Listing 2.
4. Sequential bit-plane decoding Contrary to the encoding process that can be truly parallelized, the decoding process is inherently sequential because the information is incrementally available to the receiver following the progressive decoding of
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
bit-planes. However, by proper algorithm analysis, we can still potentially optimize the execution of the decoding process.The pseudo-code for bit-plane decoder is shown in Listing 3. It is assumed that all bit-planes higher than b have already been decoded error-free. When we decode each bit-plane, the bitstream generated by the sorting pass is first decoded, and the decoded run length and sign information are stored into a run length list and a sign list. The wavelet pyramid is then scanned and the refining bits are read in to refine the coefficients at the correct positions. When the incremented run length equals one value in the run length list, the coefficient in current position becomes significant, and its value is initialized by current threshold and the corresponding sign. Note that the decoder is designed to be multiple bit-plane decoders, which can be executed on a parallel system. However, the communication among the PEs are now inevitable as dependency existed among the bitplane decoding process. One possible parallelization strategy is to label the size and offset of the sub-stream for each bit-plane within the final bitstream when we encode an image and segment them later to feed them into multiple PEs. The section of the codes in a bit-plane decoder does not depend on the result from higher bit-planes can be executed in parallel. In the remaining code section when we decode Bp, we need to know the number of coefficients that fall into the sorting pass and the refining pass, which depends on the correct decoding of bit-planes higher than Bp. Another parallelization strategy can be pipelining. In Listing 3 it can be seen that
291
the source code of a bit-plane decoder consists of two major sections. The first section is the generation of an intermediate list which contains the decoded run-length and sign information, which does not depend on the result from higher bit-planes. The second section is the reconstruction process that does depend on the decoding of higher bit-planes. These two sections of code can be efficiently pipelined. For decoding the bitstream generated by the bitstream compatible bit-plane encoder, the decoder of the MWDR can be directly employed without any modification. Alternatively, a two-pass bit-plane decoder similar to that described in Listing 3 can be designed. In the first pass of such a decoder, a list like the SCS is constructed before any refining bit is read in.
5. Experimental results To verify the correctness of the proposed algorithm, two versions of the algorithm have been implemented. The first, PESD-A, is a non-threaded version. The second, PESD-B, is a threaded version which uses the Pthread library for Linux operating system and is able to take advantage of the symmetric multi-processor (SMP) kernel of Linux. To demonstrate the simplicity of the proposed algorithm, the run time of non-threaded PESD-A is compared with that of the QccPack [14] implementation of the SPIHT without arithmetic coding (AC), which is well-known for its fastness and highPSNR performance. This implementation [14] of the SPIHT without AC is chosen because there is no AC
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295 Lena
1300
PESD-A SPIHT
1200 1100 Execution time (ms)
stage in the PESD and our source code is also based on the QccPack library. This arrangement can ensure a fair comparison between the PESD and the SPIHT. We have encoded two gray-scale images, Lena and Barbara, at different bit-rates and have used the profiling tool Gprof [15] to record the execution time. The size of both images is 512 512 at 8 bpp. Each image is consecutively encoded 30 times at each bit-rate. The program is run on a Pentium III 800 MHz workstation with 256 MB PC-100 SDRAM running Linux uni-processor (UP) kernel 2.4.20-8 and GCC 3.3.2-1. Fig. 5 illustrates the mean value and the range of samples at each bitrate for the two algorithms. Fig. 6 compares the mean execution time of the two methods at each bitrate on same scale. The mean execution time are compared in Fig. 7 on the coding of Barbara. The reconstructed images processed by PESD-A at very low bitrate can be visually compared with those processed by the SPIHT in Fig. 8. For a closer look at the images, we blow up parts of the reconstructed Barbara in Fig. 9. It is clearly observed that the PESD-A-coded image does offer more details around the knee. To demonstrate the potential of parallel bit-plane encoding, PESD-B implements the bit-plane encoders as multiple threads. We ran PESD-B on a Pentium III 700 MHz dual-processor server with 256 MB PC-100 SDRAM running Linux SMP kernel 2.2.24-7.0.3 and GCC 2.96. Since PESD-B is a threaded implementation, we expected that bit-plane coding can be accelerated as threads can be scheduled to run on both processors at the same time. To demonstrate how threading improves timing performance, we ran PESD-A, PESD-B, and PESD-B with bitstream compatibility (PESD-B/BC) on the server. It is worth noting that we have not tried to optimize any of the three implementations. All bitplanes are encoded in PESD-B and PESD-B/BC before the bitstreams are assembled and truncated to fit in rate budget. However, since PESD-A is non-threaded, the
1000 900 800 700 600 500 400 300 200 0
0.5
1
2 3 Bitrate (bpp)
4
5
Fig. 6. Comparison of execution time of SPIHT (without arithmetic coding) and PESD-A on Lena. Barbara 1300 PESD-A SPIHT
1200 1100 Execution time (ms)
292
1000 900 800 700 600 500 400 300 200 0
0.5
1
2 3 Bitrate (bpp)
4
5
Fig. 7. Comparison of execution time of SPIHT (without arithmetic coding) and PESD-A on Barbara.
PESD-A on Lena
SPIHT on Lena
700 650
1400 Execution time (ms)
Execution time (ms)
600 550 500 450 400 350 300 250
1200 1000 800 600 400
200 0 (a)
0.5 1
1.5 2 2.5 3 3.5 4 Bitrate (bpp)
4.5 5
0 (b)
0.5 1
1.5 2
2.5 3
3.5 4
4.5 5
Bitrate (bpp)
Fig. 5. The mean and the range of samples of the execution time of SPIHT (without arithmetic coding) and PESD-A on Lena. (a) PESD-A and (b) SPIHT.
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
(a)
(b)
(c)
(d)
293
Fig. 8. Comparison on visual quality of reconstructed Lena and Barbara when they are encoded by PESD-A and SPIHT at 0.15 bpp. (a) PESD-A, PSNR=31.12 dB, (b) SPIHT, PSNR=31.33 dB, (c) PESD-A, PSNR=25.67 dB and (d) SPIHT, PSNR=25.49 dB.
(a)
(b)
Fig. 9. Visual comparison of the reconstructed Barbara at 0.15 bpp. It is clear that PESD-A suffers less aliasing effect around the knee. (a) PESD-A and (b) SPIHT.
bit-plane encoding in PESD-A is actually sequential and bit-plane encoding can be stopped once rate budget is reached. To make result more revealing, we encoded all bit-planes using PESD-A to compare timing performance. Since threaded program profiling is not supported by Gprof, we have used an alternative method to collect the timing information. In PESD-A, a timer is reset just before the start of bit-plane encoding and stopped when all bit-planes are encoded. In PESD-B, a timer is reset just before the main thread creates all children threads and stopped once all children threads are joined. The result of running them 30 times is shown
in Fig. 10, where PESD-B almost doubles the speed on bit-plane encoding because of the presence of dualprocessor. With respect to the PSNR quality of the proposed algorithm, it is observed in Table 1 that the proposed PESD algorithm provides similar performance with the MWDR algorithm. The PESD also provides quite similar performance to the SPIHT, especially at very low bit-rates. Note that the PSNR performance of PESD-B/BC is identical to that of the MWDR. As shown in Fig. 10, to keep bit-stream compatible with the MWDR, an additional 30–40% percent of processing
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295
294
1000
900
950
850
900
800
Execution time (ms)
Execution time (ms)
Barbara
Lena
950
750 700 650
PESD-B PESD-B (bitstream compatible) PESD-A
600
850 800 750 700 650
550
600
500
550
450
5
10
(a)
15
20
25
500
30
(b)
Execution instance
PESD-B PESD-B (bitstream compatible) PESD-A
5
10
15
20
25
30
Execution instance
Fig. 10. Comparison of execution time on a dual-processor server using PESD-A, PESD-B, and PESD-B with bitstream compatibility using Lena and Barbara. (a) Lena and (b) Barbara.
Table 1 Comparison of SPIHT, SPIHT-AC [14], MWDR [12], and the proposed PESD technique Image
Output bit-rate (bpp) 0.15
0.25
0.5
1.0
Lena
SPIHT SPIHT-AC MWDR PESD
31.33 31.71 31.1257 31.1248
33.62 34.00 33.2419 33.2419
36.74 37.11 36.3603 36.3603
39.90 40.29 39.5416 39.5416
Goldhill
SPIHT SPIHT-AC MWDR PESD
28.59 28.89 28.6421 28.6419
30.16 30.44 30.2451 30.2451
32.57 32.96 32.6099 32.6094
35.86 36.38 35.8604 35.8604
Boat
SPIHT SPIHT-AC MWDR PESD
28.37 28.63 28.3197 28.3198
30.34 30.74 30.3450 30.3440
33.74 34.20 33.5517 33.5517
38.34 38.91 38.2289 38.2285
SPIHT SPIHT-AC MWDR PESD
25.49 26.08 25.6714 25.6714
27.60 28.20 27.2310 27.2307
31.60 32.20 31.2009 31.2009
36.79 37.42 36.0531 36.0530
Barbara
time is required on encoding all the bit-planes. This additional processing time obviously does not bring back any PSNR performance improvement as shown in Table 1. This proves our hypothesis that the order of the refinement bits is not a factor that can significantly influence the final reconstruction quality.
6. Conclusions In this paper, we proposed a wavelet image coding algorithm that is based on parallel execution of multiple bit-plane encoding subroutines. In addition to the
conventional parallelization strategy of segmenting the wavelet pyramid into multiple code-blocks, we show that using our proposed algorithm, each code-block can be further segmented into multiple bit-planes that can be encoded simultaneously. The output bitstreams of the bit-plane encoders are buffered, assembled, and finally truncated to meet the rate budget. The proposed algorithm has been implemented as threaded Linux application program and successfully tested on a dualprocessor workstation. We have also demonstrated by experiments that in embedded wavelet image coding the ordering of refinement bits does not significantly influence the PSNR quality of the reconstructed image.
References [1] Shapiro JM. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions on Acoustics, Speech and Signal Processing 1993;41(12):3445–62. [2] Said A, Pearlman WA. A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Transactions on Circuits and Systems Video Technology 1996;6(3):243–50. [3] Creusere CD. Image coding using parallel implementations of the embedded zerotree wavelet algorithm. Proceedings of the SPIE digital video compression: algorithms and technologies 1996;2668:82–92. [4] Taubman D, Ordentlich E, Weinberger M, Seroussi G. Embedded block coding in JPEG2000. HPL-2001-35 (External), HP, February 2001. [5] Lin W-K, Burgess N. Listless zerotree coding for color images. Proceedings of the 32nd asilomar conference on signals, systems and computers, vol. 1, 1998. p. 231–5. [6] Wheeler FW, Pearlman WA. SPIHT image compression without lists. Proceedings of the IEEE ICASSP, 4, 2000. p. 2047–50. [7] Kutil R. Approaches to zerotree image and video coding on MIMD architectures. Parallel Computing 2002;28(7–8):1095–109. [8] Shwartz JW, Baker RC. Bit-plane encoding: a technique for source encoding. IEEE Transactions on Aerospace and Electronic Systems 1966;2:385–92.
ARTICLE IN PRESS Y. Yuan, M.K. Mandal / Real-Time Imaging 10 (2004) 285–295 [9] Tian J, Wells Jr RO. A lossy image codec based on index coding. Proceedings of the IEEE DCC, 1996. p. 456. [10] Ordentlich E, Weinberger MJ, Seroussi G. A lowcomplexity modeling approach for embedded coding of wavelet coefficients. Proceedings of the IEEE DCC, 1998. p. 408–17. [11] Malvar HS. Fast progressive wavelet coding. Proceedings of the IEEE DCC, 1999. p. 336–43. [12] Yuan Y, Mandal MK. Fast embedded image coding technique using wavelet difference reduction. Proceedings of the SPIE
295
electronic imaging and multimedia technology III 2002;4925: 200–7. [13] Yuan Y, Mandal MK. Context-modeled wavelet difference reduction coding based on fractional bit-plane partitioning. Proceedings of the IEEE ICIP, vol. 2, 2003. p. 251–4. [14] Fowler JE. QccPack—quantization, compression, and coding library. http://qccpack.sourceforge.net/. [15] Graham S, Kessler P, McKusick M. Gprof: a call graph execution profiler. Proceedings of the SIGPLAN 82 symposium on compiler construction, vol. 17, 1982. p. 120–6.