ARTICLE IN PRESS
INTEGRATION, the VLSI journal 39 (2005) 1–11 www.elsevier.com/locate/vlsi
A high-throughput, memory efficient architecture for computing the tile-based 2D discrete wavelet transform for the JPEG2000 G. Dimitroulakos, M.D. Galanis, A. Milidonis, C.E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, 26500 Patras, Greece Received 26 November 2003; received in revised form 1 November 2004; accepted 30 November 2004
Abstract In this paper, the design and implementation of an optimized hardware architecture in terms of speed and memory requirements for computing the tile-based 2D forward discrete wavelet transform for the JPEG2000 image compression standard, are described. The proposed architecture is based on a well-known architecture template for calculating the 2D forward discrete wavelet transform. This architecture is derived by replacing the filtering units by our previously published throughput-optimized ones and by developing a scheduling algorithm suited to the special features of our filtering units. The architecture exhibits highperformance characteristics due to the throughput-optimized filters. Also, the extra clock cycles required due to the tile-based version of the discrete wavelet transform are partially compensated by the proper scheduling of the filters. The developed scheduling algorithm results in reduced memory requirements compared with existing architectures. r 2005 Elsevier B.V. All rights reserved. Keywords: Discrete wavelet transform; JPEG2000 standard; Tile-based 2D wavelet transform; Scheduling
Corresponding author. Tel.: +30 2610 997324; fax: +30 2610 994798.
E-mail address:
[email protected] (G. Dimitroulakos). 0167-9260/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2004.11.002
ARTICLE IN PRESS 2
G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
1. Introduction The discrete wavelet transform (DWT) has been introduced as a highly efficient and flexible method for subband decomposition of signals [1]. In digital signal processing (DSP), good algorithmic performance especially in the fields of image and video compression has been demonstrated. The inclusion of the DWT in contemporary multimedia compression standards, such as the JPEG2000 [2,3] and MPEG-4 [4], has led to intensive research efforts for improving the implementation aspects of the transform. Many architectures that implement the 2D separable forward (2D-DWT) and inverse DWT (2D-IDWT) in order to be applied on 2D signals have been presented in the past [5–8]. These architectures consist of filters for performing the one-dimensional (1D) DWT and memory units for storing the results of the transformation. The need for optimizing the design of the filters in terms of speed is imposed by the fact that multimedia applications—in which the DWT is present—are characterized by high-throughput requirements. The minimization of the storage size can be done by setting a proper sequence of the computations—this setting is called time scheduling—taking into account the algorithmic dependencies. Based on a given architecture template and on the algorithm for computing the DWT, to achieve improved performance and memory requirement characteristics, an intelligent time scheduling is required. The goal of scheduling is to maximize the utilization of the filters along with minimization of the memory buffering between the successive stages of computation. In this paper, an optimized architecture in terms of speed and memory requirements for computing the tile-based (TB) 2D-DWT tailored to a JPEG2000 encoder, is presented. The proposed architecture is based on a well-established architecture template presented in [9] where the four conventional filtering units [6,8] were replaced by the four throughput optimized ones presented in [10]. Also adapted to the filters architecture, an intelligent scheduling algorithm based on the line-based algorithm for computing the 2D-DWT is also proposed. The proposed scheduling algorithm minimizes 2D-DWT computation memory requirements between levels of decomposition and results in improved throughput characteristics derived by the use of our filters in the encoder architecture. The rest of the paper is organized as follows: In Section 2, the related work in developing hardware architectures for the 2D-DWT is presented. In Section 3, a brief description of the special characteristics of the 1D-DWT filters employed is given. Sections 4.1 and 4.2 illustrate the 2D-DWT encoder architecture and present the scheduling algorithms. The memory requirements and performance of the proposed 2D-DWT encoder architecture and their comparison with existing encoders are illustrated in Section 4.3. Finally, Section 5 concludes the paper and presents on-going and future work.
2. Related work Numerous architectures have been proposed for computing the 1D-DWT [5–7,11,12]. The 1DDWT architectures can, in principle, be extended to architectures for computing the separable 2D-DWT. This is because the separable 2D-DWT can be computed by 1D-DWT filtering on rows followed by 1D-DWT filtering on columns. In contrast, the non-separable 2D-DWT processes 2D
ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
3
data directly. Due to the fact that the minimizations of latency and of storage requirements are important goals for most streaming multimedia applications, mapping 1D-DWT architectures to 2D-DWT architectures is not a trivial issue. Lewis and Knowles [13] were the first to propose an architecture for the 2D-DWT. Their architecture was tuned to the Daubechies four-tap filters, so it suffered from scalability since it is strongly dependent on the limited properties of the filters used. Chakrabarti and Vishwanath [5] have proposed a scalable architecture for the encoder based on the non-separable 2D-DWT. Their architecture consists of two parallel computation units of size K 2 and a storage unit of size NK; where N N is the size of the input image and K is the number of filter taps for each high-pass and low-pass filter. A parallel computation unit of size M consists of M multipliers and a tree of adders to add the M products. Vishwanath et al. [7] have proposed an architecture for separable 2D-DWT, which consists of two systolic arrays of size K, two parallel computational units of size K, and a storage unit of size Nð2K þ JÞ; where J are the DWT decomposition levels. A drawback of this architecture is that two rows of inputs are fed into the two systolic arrays every two cycles and as a result, an additional data converter is required to convert the raster scan input (one per cycle) into two per two cycles output. Chakrabarti and Mumford [9] introduced an architecture for the analysis (synthesis) filters based on the 2D-DWT, together with two scheduling algorithms for computing the forward (inverse) 2D-DWT. The goal was to minimize the storage requirements and keep the data-flow regular. Denk and Parhi [14] have proposed an encoder architecture based on the 2D-DWT, which uses lapped block processing techniques to exploit the overlap in the input data blocks. The architecture uses a minimum number of registers and the forward–backward allocation scheme to allocate the variables. However, this comes at the expense of scalability and complex routing. The three main architectures for computing the 2D-DWT—level by level, line-based [15] and blockbased [16]—are thoroughly studied and compared in terms of memory requirements, throughput and energy dissipation by Zervas et al. [8].
3. 1D-DWT filters characteristics As mentioned in Section 2, the implementation of the 2D-DWT is based on the application of two 1D-DWT on the rows and columns of the input image. The 1D-DWT is a two-channel subband decomposition (Fig. 1a) of an input signal x[n] that produces two subband coefficients
Fig. 1. (a) 1D-DWT filter bank, (b) 2D-DWT filter bank.
ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
4
y0 ½n and y1 ½n for one-stage of decomposition [1] according to the following equations: X x½kh½2n k; y1 ½n ¼
(1)
k
y0 ½n ¼
X
x½kw½2n k:
(2)
k
The h[n] and w[n] are high-pass filters and low-pass filters, respectively, while the y1 ½n and the y0 ½n are the high-pass and the low-pass subband coefficients (denoted hereafter H and L), respectively. After the filtering operations, the produced sequences are downsampled by a factor of 2. The four-channel subband decomposition of an input image is derived from the separable 2D-DWT (Fig. 1b) by the successive applications of two, two-channel decomposition of 1D-DWT in the rows and columns, producing the (LL), (LH), (HL) and (HH) coefficients. The coefficients produced by the 2D-DWT after a stages of decomposition are denoted as ðLLÞa 1 LL; ðLLÞa 1 LH; ðLLÞa 1 HL and ðLLÞa 1 HH: Throughout this paper, we assume that: (a) The image size is N N; (b) the number of stages of decomposition is J; (c) maximum number of filter taps among the high-pass and low-pass filters is K. The 1D-DWT filters architecture employed in the 2D-DWT computation is based on the equations for the 1D-DWT as given in [2] and their detailed architecture is described in [10].There are basically two types of parallel convolutional filters for computing the 1D-DWT which are shown in Fig. 1. The first was presented in our previous work [10] and hereafter it will be called as throughput optimized (TO) and the second one, called conventional, is presented in [6,8]. The conventional filter architecture (Fig. 2a) consists of one delay line and one data-path for low-pass and high-pass filtering which works in an interleaved manner. For this reason, the conventional architecture produces one sample per clock cycle (low-pass or high-pass). On the other hand, the TO filters (Fig. 2b) consists of a modified delay line that receives two input samples per clock cycle and a separate data-path for high-pass and low-pass filtering in order to produce two subband coefficients per clock cycle. Also as described in [10], the TO filters can work with higher clock frequencies than the corresponding conventional ones. Hence, the throughput of the TO filters is at least twice as much as that of the conventional filters.
Fig. 2. (a) Conventional filter architecture, (b) TO architecture.
ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
5
As described in [10], there is a different design regarding the number of taps (even, odd) and the number of samples to be processed (even, odd). Hence, for the four types of TO analysis filters called 5/3, 9/7, 2/10 and 10/18 filter banks, there are eight different designs. In the filter notation x/y, x corresponds to the number of taps of the low-pass filter while y refers to number of taps of the high-pass filter.
4. DWT encoder architecture In this section, the 2D-DWT encoder architecture for computing the tiled 2D-DWT along with a proper scheduling algorithm is described. Prior to the application of the 2D-DWT, the image is optionally decomposed into a number of non-overlapping rectangular blocks called tiles. From hereafter, to express the degree by which an image of size N N is decomposed into tiles, we use the tile factor l, which means that the image is separated into 22l tiles of size N=2l N=2l : The typical values used for the tile factor and the ones that we will use in our experiments are 1, 2, 3. After decomposing the image into tiles, the separable 2D-DWT is applied on each tile independently by performing two successive 1D-DWTs along the rows and the columns of the specific tile. The decomposition of the image into tiles and the application of the 2D-DWT on each tile independently, imposes additional computation cycles compared to the non-tiled case. The cycle overhead comes from the symmetric extension which takes place at the image boundaries and increases as the number of tiles (in which the image is decomposed) increases. The hardware architecture that will be used to perform the 2D-DWT in each tile is based on a well-known architecture template presented in [9] and is shown in Fig. 3. The proposed architecture is derived from this architecture by replacing the filtering units with the proper filters
Fig. 3. Block diagram of the proposed architecture.
ARTICLE IN PRESS 6
G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
presented in [10] and by estimating the proper size for the memory units. As shown in Fig. 3, the encoder architecture consists of: (a) two parallel computational units—Filter 1 (F1) and Filter 2 (F2)—for computing the L, H subband coefficients and (LL)aL, (LL)aH coefficients, respectively, along the rows, (b) two parallel computational units—Filter 3 (F3) and Filter 4 (F4)—for computing the (LL)aLL, (LL)aLH and (LL)aHL, (LL)aHH coefficients, respectively, along the columns, and (c) two storage units (SUs) for storing the intermediate results among stages of computation. The scheduling algorithm is separated into two parts. The first part refers to the scheduling of the filtering units in order to perform the 2D-DWT in the tile and is based on the line-based approach [15], where the input image is scanned first in a row-by-row and then in a column-bycolumn manner. The objective of this algorithm is to minimize the memory requirements and maximize throughput. The second part refers to the scheduling of the filtering units among the tiles which is done in a way to reduce as much as possible, the cycle overhead, imposed by the tilebased 2D-DWT. Both the architecture and the scheduling algorithm comply with the following concept: proceed to the next layer filtering as soon as possible (ASAP), without interleaving filtering along a row. 4.1. Proposed scheduling algorithm description inside the tile The proposed scheduling algorithm is based on an algorithm analogous to the 1D recursive pyramid algorithm (RPA) [7]. Both in column and row-wise filtering, the produced DWT coefficients are produced in a row-major order. The filters’ operations are scheduled as follows. Filter 1 scans the N rows of the input image and produces N/2 high-frequency transform coefficients and N/2 low-frequency coefficients noted as H and L, respectively. The coefficients are produced in a row-major order, one L and one H coefficient per clock cycle and stored into SU 1. This filtering unit is always responsible for the calculations of the L and H subband. Filter 3 reads the low-pass coefficients (LL)aL from SU 1 and computes the (LL)aLL and (LL)aLH coefficients—by applying vertical filtering along the columns—in a row-major order where 0paoJ: In order to accomplish the row-major order production of the (LL)aLL coefficients, the vertically positioned filter mask slides each time, one position to the right on the image at the current stage of decomposition. In this way, the consumption of the coefficients resembles the way they are produced. Due to the fact that Filter 3 is used for all (LL)aLL coefficients, where 0paoJ; a higher priority is given to the lower stages calculations. In this way, the row-wise production of the coefficients of a stage cannot be interrupted until the row is finished. Filter 4 works in a parallel manner with Filter 3 but reads the high-pass coefficients (LL)aH from SU 1 and computes the (LL)aHL and (LL)aHH coefficients along the columns in the same way as Filter 3. Both filters load their coefficients for each filtering in parallel. Hence one (LL)aLL, (LL)aLH, (LL)aHL and (LL)aHH coefficient is produced per clock cycle. Filter 2 computes the (LL)aL and (LL)aH coefficients along the rows as shown in Fig. 3, where 1papJ: The outputs are stored into SU 1. The filtering is not interleaved along a row. Moreover, the low-pass coefficients (LL)aLL are consumed by Filter 2 in the same way that they are produced by Filter 3. Thus Filter 2 and Filter 3 operate in a lock-step manner. The (LL)aLL coefficients are fed directly to the delay line of Filter 2 as soon they are produced. Therefore, there
ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
7
is no need for buffering between Filter 2 and Filter 3. So, SU 2 is not needed because of the way the operation of these two filters is scheduled. SU1 loads the (LL)aL coefficients for applying the filtering to subsequent stages. For each decomposition level, the horizontal sliding of the vertically positioned filter mask of Filters 3 and 4 imposes the need for storing K ðN=2i 1 Þ; 1pipa coefficients in order to fulfill the requirements for the operation of the aforementioned filters. Hence, the total memory size (in words) needed for these calculations corresponding to J stages is equal to: J X N N 1 1 ¼ 2KN 1 J : (3) KN þ K þ þ K J 1 ¼ KN a 1 2 2 2 a¼1 2 In Fig. 4, an example of the proposed architecture is presented for the case of an input image of size 16 16 pixels for J ¼ 3 levels of decomposition and using the 5/3 filter bank. The proposed scheduling algorithm consists of the following steps: (i) the Li,j and Hi,j coefficients at time instance: N tL;H ði; jÞ ¼ i þ j: 2
(4)
(ii) Schedule the (LL)aLLi,j, (LL)aHLi,j, (LL)aLHi,j, (LL)aHHi,j coefficients at time instance: tðLLÞa ði; jÞ ¼
2a 1 N 1 ðK 1Þ þ tð1 a 1 tÞN þ 2a 1 Ni þ j þ 1: 2 2 2
Fig. 4. Example of the proposed architecture.
(5)
ARTICLE IN PRESS 8
G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
(iii) Schedule the (LL)aL, (LL)aH coefficients at time instance: 2a 1 N 1 K þ3 a ðK 1Þ þ 1 a 1 N þ 2a 1 Ni þ 2j þ ; tðLLÞ L ði; jÞ ¼ 2 2 2 2
(6)
where i ¼ f0; 1; 2; . . . ; N=2a 1g; j ¼ f0; 1; 2; . . . ; N=2a 1g; 1papJ and i, j stands for the ith row and the jth column, respectively.
4.2. Proposed scheduling algorithm description among tiles By taking into account the scheduling Eqs. (4)–(6) it can be concluded that the 2D-DWT computation on a tile can be separated into three phases (Fig. 5a). In each stage, each filtering unit may be idle; meaning that either waits for the appropriate number of input samples to start or that the computation for this filtering unit has ended for the specific tile. In the first stage, only F1 operates, while F2, F3 and F4 are in idle state waiting for the appropriate samples. This stage endures until the first sample of the (LL) level is produced. Hence, according to Eq. (4), this stage holds for c1 ¼ NðK 1Þ=4 þ 1 cycles. After this time instantly F2, F3 and F4 begin to operate until the end of the filtering of the tile. However, F1 doesn’t work all the time. It goes to idle state after L and H subband has been produced and this takes place at time instant c2 ¼ N 2 =2 1: According to Eq. (5) the number of cycles needed to perform the filtering in one tile is NðK 1Þ 1 N N a a 1 (7) þ N 1 a 1 þ 2 N a 1 þ a : c3 ¼ ð2 1Þ 4 2 2 2 So, according to Fig. 4a it can be deduced that the time where F1 remains idle is c4 ¼ c3 c2 : Instead of letting F1 to go to idle state, it starts processing the next tile without having conflict with the previous one. This can be assured by assuming that the number of decomposition levels is greater than 1. Then, always c4 4c1 is satisfied and interleaving is possible. Therefore, after F2, F3 and F4 have finished the filtering with the previous tile, they start processing the already produced subband samples (L) of the next tile without passing through the idle state (Fig. 5b). Therefore, the number of cycles to perform the tile-based 2D-DWT with interleaving computations among tiles is C t int ðN; aÞ ¼ c3 ðN=2l ; aÞ þ ð22l 1Þc2 ðN=2l ; aÞ;
Fig. 5. (a) Phases in the tile computation, (b) interleaved and non-interleaved scheduling.
(8)
ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
9
where l is the tile factor. Additionally, the number of cycles without interleaving is C t ðN; aÞ ¼ 22l c3 ðN=2l ; aÞ: By subtracting Eq. (8) from Eq. (9) we take 1 N 2l a 2 C t ðN; aÞ C t int ðN; aÞ ¼ ð2 1Þ N 2 ðK 3Þ þ ð5 KÞ þ a 1 þ 1 : 4 2
(9)
(10)
Based on the prerequisite a41; the second member of Eq. (10) is always positive irrespective of K; i.e. C t ðN; aÞ4C t int ðN; aÞ: Thus, by interleaving the filtering among tiles, a reduction in the additional cycles imposed by the tile-based 2D-DWT can be achieved. 4.3. Comparisons with existing architectures In this sub-section, the proposed and existing 2D-DWT encoder architectures are compared in terms of memory size and performance. The results for the total cycles needed to perform the 2DDWT on a 128 128 image are shown in Fig. 6. They are based on 12 different scenarios regarding the tile factor l (1, 2, 3) and the number of decomposition stages J (2, 3, 4). Each different graph corresponds to a different selection of the
Fig. 6. Performance of the proposed and previously published 2D-DWT architectures.
ARTICLE IN PRESS 10
G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
filter bank (5/3, 9/7, 2/10, 10/18) implemented on the proposed architecture. For each scenario, the total cycles needed to perform the 2D-DWT with the proposed architecture—in three different ways for implementing the 2D-DWT (tile with interleaving, tile without interleaving and nontile)—are compared with the cycles needed by the architectures of [5,8]. As it can be concluded, the performance of the proposed architecture overcomes that of the existing architectures for all scenarios when applying the 2D-DWT without tiling. The computation of the 2D-DWT without tiling in the proposed architecture is faster than the computation in the existing architectures approximately by a factor of two. Even when tiling is used, the performance of the proposed architecture is still better than that of the existing architectures in most cases. The results for image sizes ranging from 64 64 to 256 256 are similar. In Fig. 7, the comparison of the memory requirements of the proposed 2D-DWT encoder and the existing encoding architectures [5,6,8] is illustrated for the case of an input image with size 128 128 without the use of tiling. The results are based on 12 different scenarios regarding the levels of decomposition J (2, 3, 4) and the filter length (5, 9, 10, 18). As it is deduced, the memory requirements of the existing architectures overcome those of the proposed in all cases for the same size of input image. Specifically, the memory requirements of the existing encoder architectures are 2–9.3% higher than those of the proposed architecture. This result is based on the comparison of the proposed architecture with the most efficient one [8] in terms of memory requirements. Concluding, the proposed architecture is characterized by low-memory requirements and higher performance compared to previously published 2D-DWT encoder architectures. It partially overcomes the performance bottleneck imposed by the tile-based 2D-DWT by interleaving the computations among tiles and still benefits from the low-memory requirements of this type of wavelet transform.
Fig. 7. Memory requirements of the proposed and previously published 2D-DWT architectures.
ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11
11
5. Conclusions—future work The proposed JPEG2000 compliant 2D-DWT architecture that uses as data-path the filters that were previously described in [10] and performs the computations using the introduced scheduling algorithm, achieves higher performance and consumes less-memory area than other existing architectures. Ongoing work focuses on the development of a 2D-DWT decoder for the JPEG2000 standard that embeds the proposed TO synthesis filters. We are also conducting research on the design and implementation of a complete JPEG2000 encoder–decoder (codec). Acknowledgements We thank the European Social Fund (ESF), Operational Program for Educational and Vocational Training II (EPEAEK II), and particularly the Program HERAKLEITOS, for funding the above work. References [1] S. Mallat, A Wavelet Tour of Signal Processing, second ed., Academic Press, New York, 1999. [2] JPEG 2000 Image Coding System, ISO/IEC FCD15444-1, 2000. [3] A. Skodras, C. Christopoulos, T. Ebrahimi, The JPEG 2000 still image compression standard, IEEE Signal Process. Mag. 18 (5) (2001) 36–58. [4] MPEG-4, ISO/IEC JTC1/SC29/WG11, FCD 14496, Coding of Moving Pictures and Audio, May 1998. [5] C. Chakrabarti, M. Vishwanath, Efficient realizations of the discrete and continuous wavelet transforms: from single chip implementations to SIMD parallel computers, IEEE Trans. Signal Process. 43 (3) (1995) 759–771. [6] C. Chakrabarti, M. Vishwanath, R.M. Owens, Architectures for wavelet transforms: a survey, J. VLSI Signal Process. 4 (2) (1996) 171–192. [7] M. Vishwanath, R.M. Owens, M.J. Irwin, VLSI architectures for the discrete wavelet transform, IEEE Trans. Circuits Syst. II 42 (5) (1995). [8] N.D. Zervas, G.P. Anagnostopoulos, V. Spiliotopoulos, Y. Andreopoulos, C.E. Goutis, Evaluation of design alternatives for the 2-D-discrete wavelet transform, IEEE Trans. Circuits Syst. Video Technol. 11 (2) (2001) 1246–1262. [9] C. Chakrabarti, C. Mumford, Efficient realizations of analysis and synthesis filters based on the 2-D discrete wavelet transform, in: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1996, pp. 3256–3259. [10] G. Dimitroulakos, N.D. Zervas, N. Sklavos, C.E. Goutis, An efficient VLSI implementation for forward and inverse wavelet transform for JPEG2000, in: Proceedings of 2002 14th International Conference on Digital Signal Processing (DSP2002), Santorini, Greece, 2002, pp. 233–236. [11] F. Fridman, E. S. Manolakos, Distributed memory and control VLSI architectures for the 1-D discrete wavelet transform, in: IEEE VLSI Signal Processing VII, 1994, pp. 388–397. [12] Grzeszczak, et al., VLSI implementation of discrete wavelet transform, IEEE Trans. VLSI Syst. 4 (1996) 421–433. [13] A.S. Lewis, G. Knowles, VLSI architecture for 2-D Daubechies wavelet transform without multipliers, Electron. Lett. 27 (5) (1991) 171–173. [14] T. Denk, K. Parhi, Calculation of minimum number of registers in 2-D discrete wavelet transforms using lapped block processing, in: Proceedings of the International Symposium on Circuits and Systems, 1994, pp. 77–81. [15] C. Chrysafis, A. Ortega, Line-based, reduced memory, wavelet image compression, IEEE Trans. Image Process. 9 (3) (2000) 378–389. [16] G. Lafruit, et al., Optimal memory organizations for scalable texture codecs in MPEG-4, IEEE Trans. Circuits Syst. Video Technol. 9 (2) (1999) 218–242.