A high-throughput, memory efficient architecture for computing the tile-based 2D discrete wavelet transform for the JPEG2000

ARTICLE IN PRESS INTEGRATION, the VLSI journal 39 (2005) 1–11 www.elsevier.com/locate/vlsi A high-throughput, memory efﬁcient architecture for compu...

Download PDF

560KB Sizes 0 Downloads 35 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 39 (2005) 1–11 www.elsevier.com/locate/vlsi

A high-throughput, memory efﬁcient architecture for computing the tile-based 2D discrete wavelet transform for the JPEG2000 G. Dimitroulakos, M.D. Galanis, A. Milidonis, C.E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, 26500 Patras, Greece Received 26 November 2003; received in revised form 1 November 2004; accepted 30 November 2004

Abstract In this paper, the design and implementation of an optimized hardware architecture in terms of speed and memory requirements for computing the tile-based 2D forward discrete wavelet transform for the JPEG2000 image compression standard, are described. The proposed architecture is based on a well-known architecture template for calculating the 2D forward discrete wavelet transform. This architecture is derived by replacing the ﬁltering units by our previously published throughput-optimized ones and by developing a scheduling algorithm suited to the special features of our ﬁltering units. The architecture exhibits highperformance characteristics due to the throughput-optimized ﬁlters. Also, the extra clock cycles required due to the tile-based version of the discrete wavelet transform are partially compensated by the proper scheduling of the ﬁlters. The developed scheduling algorithm results in reduced memory requirements compared with existing architectures. r 2005 Elsevier B.V. All rights reserved. Keywords: Discrete wavelet transform; JPEG2000 standard; Tile-based 2D wavelet transform; Scheduling

Corresponding author. Tel.: +30 2610 997324; fax: +30 2610 994798.

E-mail address: [email protected] (G. Dimitroulakos). 0167-9260/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2004.11.002

ARTICLE IN PRESS 2

G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

1. Introduction The discrete wavelet transform (DWT) has been introduced as a highly efﬁcient and ﬂexible method for subband decomposition of signals [1]. In digital signal processing (DSP), good algorithmic performance especially in the ﬁelds of image and video compression has been demonstrated. The inclusion of the DWT in contemporary multimedia compression standards, such as the JPEG2000 [2,3] and MPEG-4 [4], has led to intensive research efforts for improving the implementation aspects of the transform. Many architectures that implement the 2D separable forward (2D-DWT) and inverse DWT (2D-IDWT) in order to be applied on 2D signals have been presented in the past [5–8]. These architectures consist of ﬁlters for performing the one-dimensional (1D) DWT and memory units for storing the results of the transformation. The need for optimizing the design of the ﬁlters in terms of speed is imposed by the fact that multimedia applications—in which the DWT is present—are characterized by high-throughput requirements. The minimization of the storage size can be done by setting a proper sequence of the computations—this setting is called time scheduling—taking into account the algorithmic dependencies. Based on a given architecture template and on the algorithm for computing the DWT, to achieve improved performance and memory requirement characteristics, an intelligent time scheduling is required. The goal of scheduling is to maximize the utilization of the ﬁlters along with minimization of the memory buffering between the successive stages of computation. In this paper, an optimized architecture in terms of speed and memory requirements for computing the tile-based (TB) 2D-DWT tailored to a JPEG2000 encoder, is presented. The proposed architecture is based on a well-established architecture template presented in [9] where the four conventional ﬁltering units [6,8] were replaced by the four throughput optimized ones presented in [10]. Also adapted to the ﬁlters architecture, an intelligent scheduling algorithm based on the line-based algorithm for computing the 2D-DWT is also proposed. The proposed scheduling algorithm minimizes 2D-DWT computation memory requirements between levels of decomposition and results in improved throughput characteristics derived by the use of our ﬁlters in the encoder architecture. The rest of the paper is organized as follows: In Section 2, the related work in developing hardware architectures for the 2D-DWT is presented. In Section 3, a brief description of the special characteristics of the 1D-DWT ﬁlters employed is given. Sections 4.1 and 4.2 illustrate the 2D-DWT encoder architecture and present the scheduling algorithms. The memory requirements and performance of the proposed 2D-DWT encoder architecture and their comparison with existing encoders are illustrated in Section 4.3. Finally, Section 5 concludes the paper and presents on-going and future work.

2. Related work Numerous architectures have been proposed for computing the 1D-DWT [5–7,11,12]. The 1DDWT architectures can, in principle, be extended to architectures for computing the separable 2D-DWT. This is because the separable 2D-DWT can be computed by 1D-DWT ﬁltering on rows followed by 1D-DWT ﬁltering on columns. In contrast, the non-separable 2D-DWT processes 2D

ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

3

data directly. Due to the fact that the minimizations of latency and of storage requirements are important goals for most streaming multimedia applications, mapping 1D-DWT architectures to 2D-DWT architectures is not a trivial issue. Lewis and Knowles [13] were the ﬁrst to propose an architecture for the 2D-DWT. Their architecture was tuned to the Daubechies four-tap ﬁlters, so it suffered from scalability since it is strongly dependent on the limited properties of the ﬁlters used. Chakrabarti and Vishwanath [5] have proposed a scalable architecture for the encoder based on the non-separable 2D-DWT. Their architecture consists of two parallel computation units of size K 2 and a storage unit of size NK; where N N is the size of the input image and K is the number of ﬁlter taps for each high-pass and low-pass ﬁlter. A parallel computation unit of size M consists of M multipliers and a tree of adders to add the M products. Vishwanath et al. [7] have proposed an architecture for separable 2D-DWT, which consists of two systolic arrays of size K, two parallel computational units of size K, and a storage unit of size Nð2K þ JÞ; where J are the DWT decomposition levels. A drawback of this architecture is that two rows of inputs are fed into the two systolic arrays every two cycles and as a result, an additional data converter is required to convert the raster scan input (one per cycle) into two per two cycles output. Chakrabarti and Mumford [9] introduced an architecture for the analysis (synthesis) ﬁlters based on the 2D-DWT, together with two scheduling algorithms for computing the forward (inverse) 2D-DWT. The goal was to minimize the storage requirements and keep the data-ﬂow regular. Denk and Parhi [14] have proposed an encoder architecture based on the 2D-DWT, which uses lapped block processing techniques to exploit the overlap in the input data blocks. The architecture uses a minimum number of registers and the forward–backward allocation scheme to allocate the variables. However, this comes at the expense of scalability and complex routing. The three main architectures for computing the 2D-DWT—level by level, line-based [15] and blockbased [16]—are thoroughly studied and compared in terms of memory requirements, throughput and energy dissipation by Zervas et al. [8].

3. 1D-DWT ﬁlters characteristics As mentioned in Section 2, the implementation of the 2D-DWT is based on the application of two 1D-DWT on the rows and columns of the input image. The 1D-DWT is a two-channel subband decomposition (Fig. 1a) of an input signal x[n] that produces two subband coefﬁcients

Fig. 1. (a) 1D-DWT ﬁlter bank, (b) 2D-DWT ﬁlter bank.

ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

4

y0 ½n and y1 ½n for one-stage of decomposition [1] according to the following equations: X x½kh½2n k; y1 ½n ¼

(1)

k

y0 ½n ¼

X

x½kw½2n k:

(2)

k

The h[n] and w[n] are high-pass ﬁlters and low-pass ﬁlters, respectively, while the y1 ½n and the y0 ½n are the high-pass and the low-pass subband coefﬁcients (denoted hereafter H and L), respectively. After the ﬁltering operations, the produced sequences are downsampled by a factor of 2. The four-channel subband decomposition of an input image is derived from the separable 2D-DWT (Fig. 1b) by the successive applications of two, two-channel decomposition of 1D-DWT in the rows and columns, producing the (LL), (LH), (HL) and (HH) coefﬁcients. The coefﬁcients produced by the 2D-DWT after a stages of decomposition are denoted as ðLLÞa 1 LL; ðLLÞa 1 LH; ðLLÞa 1 HL and ðLLÞa 1 HH: Throughout this paper, we assume that: (a) The image size is N N; (b) the number of stages of decomposition is J; (c) maximum number of ﬁlter taps among the high-pass and low-pass ﬁlters is K. The 1D-DWT ﬁlters architecture employed in the 2D-DWT computation is based on the equations for the 1D-DWT as given in [2] and their detailed architecture is described in [10].There are basically two types of parallel convolutional ﬁlters for computing the 1D-DWT which are shown in Fig. 1. The ﬁrst was presented in our previous work [10] and hereafter it will be called as throughput optimized (TO) and the second one, called conventional, is presented in [6,8]. The conventional ﬁlter architecture (Fig. 2a) consists of one delay line and one data-path for low-pass and high-pass ﬁltering which works in an interleaved manner. For this reason, the conventional architecture produces one sample per clock cycle (low-pass or high-pass). On the other hand, the TO ﬁlters (Fig. 2b) consists of a modiﬁed delay line that receives two input samples per clock cycle and a separate data-path for high-pass and low-pass ﬁltering in order to produce two subband coefﬁcients per clock cycle. Also as described in [10], the TO ﬁlters can work with higher clock frequencies than the corresponding conventional ones. Hence, the throughput of the TO ﬁlters is at least twice as much as that of the conventional ﬁlters.

Fig. 2. (a) Conventional ﬁlter architecture, (b) TO architecture.

ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

5

As described in [10], there is a different design regarding the number of taps (even, odd) and the number of samples to be processed (even, odd). Hence, for the four types of TO analysis ﬁlters called 5/3, 9/7, 2/10 and 10/18 ﬁlter banks, there are eight different designs. In the ﬁlter notation x/y, x corresponds to the number of taps of the low-pass ﬁlter while y refers to number of taps of the high-pass ﬁlter.

4. DWT encoder architecture In this section, the 2D-DWT encoder architecture for computing the tiled 2D-DWT along with a proper scheduling algorithm is described. Prior to the application of the 2D-DWT, the image is optionally decomposed into a number of non-overlapping rectangular blocks called tiles. From hereafter, to express the degree by which an image of size N N is decomposed into tiles, we use the tile factor l, which means that the image is separated into 22l tiles of size N=2l N=2l : The typical values used for the tile factor and the ones that we will use in our experiments are 1, 2, 3. After decomposing the image into tiles, the separable 2D-DWT is applied on each tile independently by performing two successive 1D-DWTs along the rows and the columns of the speciﬁc tile. The decomposition of the image into tiles and the application of the 2D-DWT on each tile independently, imposes additional computation cycles compared to the non-tiled case. The cycle overhead comes from the symmetric extension which takes place at the image boundaries and increases as the number of tiles (in which the image is decomposed) increases. The hardware architecture that will be used to perform the 2D-DWT in each tile is based on a well-known architecture template presented in [9] and is shown in Fig. 3. The proposed architecture is derived from this architecture by replacing the ﬁltering units with the proper ﬁlters

Fig. 3. Block diagram of the proposed architecture.

ARTICLE IN PRESS 6

G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

presented in [10] and by estimating the proper size for the memory units. As shown in Fig. 3, the encoder architecture consists of: (a) two parallel computational units—Filter 1 (F1) and Filter 2 (F2)—for computing the L, H subband coefﬁcients and (LL)aL, (LL)aH coefﬁcients, respectively, along the rows, (b) two parallel computational units—Filter 3 (F3) and Filter 4 (F4)—for computing the (LL)aLL, (LL)aLH and (LL)aHL, (LL)aHH coefﬁcients, respectively, along the columns, and (c) two storage units (SUs) for storing the intermediate results among stages of computation. The scheduling algorithm is separated into two parts. The ﬁrst part refers to the scheduling of the ﬁltering units in order to perform the 2D-DWT in the tile and is based on the line-based approach [15], where the input image is scanned ﬁrst in a row-by-row and then in a column-bycolumn manner. The objective of this algorithm is to minimize the memory requirements and maximize throughput. The second part refers to the scheduling of the ﬁltering units among the tiles which is done in a way to reduce as much as possible, the cycle overhead, imposed by the tilebased 2D-DWT. Both the architecture and the scheduling algorithm comply with the following concept: proceed to the next layer ﬁltering as soon as possible (ASAP), without interleaving ﬁltering along a row. 4.1. Proposed scheduling algorithm description inside the tile The proposed scheduling algorithm is based on an algorithm analogous to the 1D recursive pyramid algorithm (RPA) [7]. Both in column and row-wise ﬁltering, the produced DWT coefﬁcients are produced in a row-major order. The ﬁlters’ operations are scheduled as follows. Filter 1 scans the N rows of the input image and produces N/2 high-frequency transform coefﬁcients and N/2 low-frequency coefﬁcients noted as H and L, respectively. The coefﬁcients are produced in a row-major order, one L and one H coefﬁcient per clock cycle and stored into SU 1. This ﬁltering unit is always responsible for the calculations of the L and H subband. Filter 3 reads the low-pass coefﬁcients (LL)aL from SU 1 and computes the (LL)aLL and (LL)aLH coefﬁcients—by applying vertical ﬁltering along the columns—in a row-major order where 0paoJ: In order to accomplish the row-major order production of the (LL)aLL coefﬁcients, the vertically positioned ﬁlter mask slides each time, one position to the right on the image at the current stage of decomposition. In this way, the consumption of the coefﬁcients resembles the way they are produced. Due to the fact that Filter 3 is used for all (LL)aLL coefﬁcients, where 0paoJ; a higher priority is given to the lower stages calculations. In this way, the row-wise production of the coefﬁcients of a stage cannot be interrupted until the row is ﬁnished. Filter 4 works in a parallel manner with Filter 3 but reads the high-pass coefﬁcients (LL)aH from SU 1 and computes the (LL)aHL and (LL)aHH coefﬁcients along the columns in the same way as Filter 3. Both ﬁlters load their coefﬁcients for each ﬁltering in parallel. Hence one (LL)aLL, (LL)aLH, (LL)aHL and (LL)aHH coefﬁcient is produced per clock cycle. Filter 2 computes the (LL)aL and (LL)aH coefﬁcients along the rows as shown in Fig. 3, where 1papJ: The outputs are stored into SU 1. The ﬁltering is not interleaved along a row. Moreover, the low-pass coefﬁcients (LL)aLL are consumed by Filter 2 in the same way that they are produced by Filter 3. Thus Filter 2 and Filter 3 operate in a lock-step manner. The (LL)aLL coefﬁcients are fed directly to the delay line of Filter 2 as soon they are produced. Therefore, there

ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

7

is no need for buffering between Filter 2 and Filter 3. So, SU 2 is not needed because of the way the operation of these two ﬁlters is scheduled. SU1 loads the (LL)aL coefﬁcients for applying the ﬁltering to subsequent stages. For each decomposition level, the horizontal sliding of the vertically positioned ﬁlter mask of Filters 3 and 4 imposes the need for storing K ðN=2i 1 Þ; 1pipa coefﬁcients in order to fulﬁll the requirements for the operation of the aforementioned ﬁlters. Hence, the total memory size (in words) needed for these calculations corresponding to J stages is equal to: J X N N 1 1 ¼ 2KN 1 J : (3) KN þ K þ þ K J 1 ¼ KN a 1 2 2 2 a¼1 2 In Fig. 4, an example of the proposed architecture is presented for the case of an input image of size 16 16 pixels for J ¼ 3 levels of decomposition and using the 5/3 ﬁlter bank. The proposed scheduling algorithm consists of the following steps: (i) the Li,j and Hi,j coefﬁcients at time instance: N tL;H ði; jÞ ¼ i þ j: 2

(4)

(ii) Schedule the (LL)aLLi,j, (LL)aHLi,j, (LL)aLHi,j, (LL)aHHi,j coefﬁcients at time instance: tðLLÞa ði; jÞ ¼

2a 1 N 1 ðK 1Þ þ tð1 a 1 tÞN þ 2a 1 Ni þ j þ 1: 2 2 2

Fig. 4. Example of the proposed architecture.

(5)

ARTICLE IN PRESS 8

G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

(iii) Schedule the (LL)aL, (LL)aH coefﬁcients at time instance: 2a 1 N 1 K þ3 a ðK 1Þ þ 1 a 1 N þ 2a 1 Ni þ 2j þ ; tðLLÞ L ði; jÞ ¼ 2 2 2 2

(6)

where i ¼ f0; 1; 2; . . . ; N=2a 1g; j ¼ f0; 1; 2; . . . ; N=2a 1g; 1papJ and i, j stands for the ith row and the jth column, respectively.

4.2. Proposed scheduling algorithm description among tiles By taking into account the scheduling Eqs. (4)–(6) it can be concluded that the 2D-DWT computation on a tile can be separated into three phases (Fig. 5a). In each stage, each ﬁltering unit may be idle; meaning that either waits for the appropriate number of input samples to start or that the computation for this ﬁltering unit has ended for the speciﬁc tile. In the ﬁrst stage, only F1 operates, while F2, F3 and F4 are in idle state waiting for the appropriate samples. This stage endures until the ﬁrst sample of the (LL) level is produced. Hence, according to Eq. (4), this stage holds for c1 ¼ NðK 1Þ=4 þ 1 cycles. After this time instantly F2, F3 and F4 begin to operate until the end of the ﬁltering of the tile. However, F1 doesn’t work all the time. It goes to idle state after L and H subband has been produced and this takes place at time instant c2 ¼ N 2 =2 1: According to Eq. (5) the number of cycles needed to perform the ﬁltering in one tile is NðK 1Þ 1 N N a a 1 (7) þ N 1 a 1 þ 2 N a 1 þ a : c3 ¼ ð2 1Þ 4 2 2 2 So, according to Fig. 4a it can be deduced that the time where F1 remains idle is c4 ¼ c3 c2 : Instead of letting F1 to go to idle state, it starts processing the next tile without having conﬂict with the previous one. This can be assured by assuming that the number of decomposition levels is greater than 1. Then, always c4 4c1 is satisﬁed and interleaving is possible. Therefore, after F2, F3 and F4 have ﬁnished the ﬁltering with the previous tile, they start processing the already produced subband samples (L) of the next tile without passing through the idle state (Fig. 5b). Therefore, the number of cycles to perform the tile-based 2D-DWT with interleaving computations among tiles is C t int ðN; aÞ ¼ c3 ðN=2l ; aÞ þ ð22l 1Þc2 ðN=2l ; aÞ;

Fig. 5. (a) Phases in the tile computation, (b) interleaved and non-interleaved scheduling.

(8)

ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

9

where l is the tile factor. Additionally, the number of cycles without interleaving is C t ðN; aÞ ¼ 22l c3 ðN=2l ; aÞ: By subtracting Eq. (8) from Eq. (9) we take 1 N 2l a 2 C t ðN; aÞ C t int ðN; aÞ ¼ ð2 1Þ N 2 ðK 3Þ þ ð5 KÞ þ a 1 þ 1 : 4 2

(9)

(10)

Based on the prerequisite a41; the second member of Eq. (10) is always positive irrespective of K; i.e. C t ðN; aÞ4C t int ðN; aÞ: Thus, by interleaving the ﬁltering among tiles, a reduction in the additional cycles imposed by the tile-based 2D-DWT can be achieved. 4.3. Comparisons with existing architectures In this sub-section, the proposed and existing 2D-DWT encoder architectures are compared in terms of memory size and performance. The results for the total cycles needed to perform the 2DDWT on a 128 128 image are shown in Fig. 6. They are based on 12 different scenarios regarding the tile factor l (1, 2, 3) and the number of decomposition stages J (2, 3, 4). Each different graph corresponds to a different selection of the

Fig. 6. Performance of the proposed and previously published 2D-DWT architectures.

ARTICLE IN PRESS 10

G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

ﬁlter bank (5/3, 9/7, 2/10, 10/18) implemented on the proposed architecture. For each scenario, the total cycles needed to perform the 2D-DWT with the proposed architecture—in three different ways for implementing the 2D-DWT (tile with interleaving, tile without interleaving and nontile)—are compared with the cycles needed by the architectures of [5,8]. As it can be concluded, the performance of the proposed architecture overcomes that of the existing architectures for all scenarios when applying the 2D-DWT without tiling. The computation of the 2D-DWT without tiling in the proposed architecture is faster than the computation in the existing architectures approximately by a factor of two. Even when tiling is used, the performance of the proposed architecture is still better than that of the existing architectures in most cases. The results for image sizes ranging from 64 64 to 256 256 are similar. In Fig. 7, the comparison of the memory requirements of the proposed 2D-DWT encoder and the existing encoding architectures [5,6,8] is illustrated for the case of an input image with size 128 128 without the use of tiling. The results are based on 12 different scenarios regarding the levels of decomposition J (2, 3, 4) and the ﬁlter length (5, 9, 10, 18). As it is deduced, the memory requirements of the existing architectures overcome those of the proposed in all cases for the same size of input image. Speciﬁcally, the memory requirements of the existing encoder architectures are 2–9.3% higher than those of the proposed architecture. This result is based on the comparison of the proposed architecture with the most efﬁcient one [8] in terms of memory requirements. Concluding, the proposed architecture is characterized by low-memory requirements and higher performance compared to previously published 2D-DWT encoder architectures. It partially overcomes the performance bottleneck imposed by the tile-based 2D-DWT by interleaving the computations among tiles and still beneﬁts from the low-memory requirements of this type of wavelet transform.

Fig. 7. Memory requirements of the proposed and previously published 2D-DWT architectures.

ARTICLE IN PRESS G. Dimitroulakos et al. / INTEGRATION, the VLSI journal 39 (2005) 1–11

11

5. Conclusions—future work The proposed JPEG2000 compliant 2D-DWT architecture that uses as data-path the ﬁlters that were previously described in [10] and performs the computations using the introduced scheduling algorithm, achieves higher performance and consumes less-memory area than other existing architectures. Ongoing work focuses on the development of a 2D-DWT decoder for the JPEG2000 standard that embeds the proposed TO synthesis ﬁlters. We are also conducting research on the design and implementation of a complete JPEG2000 encoder–decoder (codec). Acknowledgements We thank the European Social Fund (ESF), Operational Program for Educational and Vocational Training II (EPEAEK II), and particularly the Program HERAKLEITOS, for funding the above work. References [1] S. Mallat, A Wavelet Tour of Signal Processing, second ed., Academic Press, New York, 1999. [2] JPEG 2000 Image Coding System, ISO/IEC FCD15444-1, 2000. [3] A. Skodras, C. Christopoulos, T. Ebrahimi, The JPEG 2000 still image compression standard, IEEE Signal Process. Mag. 18 (5) (2001) 36–58. [4] MPEG-4, ISO/IEC JTC1/SC29/WG11, FCD 14496, Coding of Moving Pictures and Audio, May 1998. [5] C. Chakrabarti, M. Vishwanath, Efﬁcient realizations of the discrete and continuous wavelet transforms: from single chip implementations to SIMD parallel computers, IEEE Trans. Signal Process. 43 (3) (1995) 759–771. [6] C. Chakrabarti, M. Vishwanath, R.M. Owens, Architectures for wavelet transforms: a survey, J. VLSI Signal Process. 4 (2) (1996) 171–192. [7] M. Vishwanath, R.M. Owens, M.J. Irwin, VLSI architectures for the discrete wavelet transform, IEEE Trans. Circuits Syst. II 42 (5) (1995). [8] N.D. Zervas, G.P. Anagnostopoulos, V. Spiliotopoulos, Y. Andreopoulos, C.E. Goutis, Evaluation of design alternatives for the 2-D-discrete wavelet transform, IEEE Trans. Circuits Syst. Video Technol. 11 (2) (2001) 1246–1262. [9] C. Chakrabarti, C. Mumford, Efﬁcient realizations of analysis and synthesis ﬁlters based on the 2-D discrete wavelet transform, in: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1996, pp. 3256–3259. [10] G. Dimitroulakos, N.D. Zervas, N. Sklavos, C.E. Goutis, An efﬁcient VLSI implementation for forward and inverse wavelet transform for JPEG2000, in: Proceedings of 2002 14th International Conference on Digital Signal Processing (DSP2002), Santorini, Greece, 2002, pp. 233–236. [11] F. Fridman, E. S. Manolakos, Distributed memory and control VLSI architectures for the 1-D discrete wavelet transform, in: IEEE VLSI Signal Processing VII, 1994, pp. 388–397. [12] Grzeszczak, et al., VLSI implementation of discrete wavelet transform, IEEE Trans. VLSI Syst. 4 (1996) 421–433. [13] A.S. Lewis, G. Knowles, VLSI architecture for 2-D Daubechies wavelet transform without multipliers, Electron. Lett. 27 (5) (1991) 171–173. [14] T. Denk, K. Parhi, Calculation of minimum number of registers in 2-D discrete wavelet transforms using lapped block processing, in: Proceedings of the International Symposium on Circuits and Systems, 1994, pp. 77–81. [15] C. Chrysaﬁs, A. Ortega, Line-based, reduced memory, wavelet image compression, IEEE Trans. Image Process. 9 (3) (2000) 378–389. [16] G. Lafruit, et al., Optimal memory organizations for scalable texture codecs in MPEG-4, IEEE Trans. Circuits Syst. Video Technol. 9 (2) (1999) 218–242.

A high-throughput, memory efficient architecture for computing the tile-based 2D discrete wavelet transform for the JPEG2000

A high-throughput, memory efficient architecture for computing the tile-based 2D discrete wavelet transform for the JPEG2000

Recommend Documents