2D discrete wavelet transform

2D discrete wavelet transform

ARTICLE IN PRESS JID: MICPRO [m5G;September 6, 2016;19:50] Microprocessors and Microsystems 0 0 0 (2016) 1–15 Contents lists available at ScienceD...

3MB Sizes 0 Downloads 29 Views

ARTICLE IN PRESS

JID: MICPRO

[m5G;September 6, 2016;19:50]

Microprocessors and Microsystems 0 0 0 (2016) 1–15

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform Mohamed Asan Basiri M., Noor Mahammad Sk∗ Indian Institute of Information Technology Design and Manufacturing Kancheepuram, Chennai, Tamil Nadu, India

a r t i c l e

i n f o

Article history: Received 31 March 2016 Revised 25 August 2016 Accepted 30 August 2016 Available online xxx Keywords: Convolution DSP processor Wavelet transform

a b s t r a c t In this paper, high performance VLSI architectures for lifting based 1D and 2D-Discrete wavelet transforms (DWTs) are proposed. The proposed logic used for area efficient lifting based DWT is to perform the whole operation with one processing element. Similarly, the proposed logic used for delay efficient lifting based DWT is to perform the whole operation with multiple processing elements in parallel. In both the cases, the processing element consists of one floating point adder and one proposed fused multiply add design. The proposed and existing lifting based 1D and 2D lifting based DWTs are implemented with 45 nm technology. The results show that the proposed designs achieve significant improvement compared with existing architectures. For example, 9-point 2-parallel proposed (9, 7) single level 1D-DWT achieves 33.5% of reduction in total cycle delay compared with direct form. Similarly, 9-point single PE proposed (9, 7) single level 1D-DWT achieves 59.8% and 75.5% of reduction in total area and net power over direct form respectively. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Wavelet is used to represent the signal properties in time and frequency [1] efficiently. DWT decomposes the input signal sample values into a set of low and high pass outputs. The low pass outputs are sent as input for next level of operation recursively. In multimedia applications, DWT is used to reduce the redundant information and hence the storage requirement of image pixels will be reduced [2]. DWT is used in image processing for the analysis of multi resolution [3]. The image reconstruction [4], image fusion [5], and image coding [6] can be done easily with DWT. The general architectures of DWT are (1) convolution based (2) lifting based. In this paper, we considered only the lifting based DWT architectures. The lifting based DWT [7] has an advantage over convolution based DWT. Here, the low pass and high pass filters are replaced by upper and lower triangular matrices and hence computation complexity has been reduced as compared to convolution based DWT. In N-point 1D or N × N-point 2D DWT, the maximum number of possible levels is log − 2N. The papers [8] and [9] proved that the DWT with decomposition levels 2 or 3, the image quality will be good and the degradation is reasonable. Beyond the decomposition levels of 2 or 3, the image quality will be poor. Therefore, number of decomposition levels in the implementation of our pro∗

Corresponding author. Fax: +91 4427476301. E-mail addresses: [email protected] (M. Asan Basiri M.), [email protected] (N. Mahammad Sk).

posed 1D and 2D-DWT designs is 3 and 2 respectively. Fig. 1 shows lifting based 3 levels DWT. Here, P and U represent predict and update modules. The input sample values are decomposed into odd and even sample values using split unit. The output from update unit is sent to the next stage recursively. The low pass and high pass analysis filters are taken as h (z ) and g (z ) respectively. The corresponding synthesis filters are h(z) and g(z) respectively. The corresponding polyphase matrices are defined as,





 h (z ) h o (z ) P (z ) = e  g ( z ) g e o (z )



and P (z ) =



he ( z ) ho ( z ) ge ( z ) go ( z )

If ( h,  g) is a complementary filter pair, then P (z ) can be factored into the following lifting steps,









m K 0  1 S i (z ) P (z ) = 1 0 K 0 1 i=1

 P (z ) =







1 0 t ( z ) 1 i



m −1 ) 0  1 −S i (z 0 K 0 1 i=1 1 K

(1)



1

0

−1 ) 1 −t i (z

(2)

The most efficient factorization of the polyphase matrix for (5, 3) and (9, 7) filters [24] are shown in Eqs. (3) and (7) respectively. Here, a, b, c, d, e, f, and K are constants.

P ( z ) ( 5,3 ) =





1 e (1 + z ) 0 1



1 0 − f (1 + z−1 ) 1

(3)

http://dx.doi.org/10.1016/j.micpro.2016.08.007 0141-9331/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO 2

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

Fig. 1. Lifting based DWT with 3 levels.

 P1 (z ) =



1 a(1 + z−1 ) 0 1





1 c (1 + z−1 ) P2 (z ) = 0 1

 P3 (z ) =

[m5G;September 6, 2016;19:50]

K 0 0 K1



1 0 b( 1 + z ) 1

(4)



1 0 d (1 + z ) 1

(5)



P (z )(9,7) = P1 (z )P2 (z )P3 (z )

(6) (7)

The 2D-lifting based discrete wavelet transform outputs can be produced in 2 steps, namely (1) row process and (2) column process. During the row process, each row of N × N-input signal matrix is 1D transformed and the results are stored in N × N2 buffer. After completing all the N rows of N × N-input signal matrix in row process, transpose matrix of the N × N2 -buffer is taken for column process. In column process, each row of transposed buffer matrix is 1D transformed and results are the final 2D-DWT values ( N2 × N2 ). Fig. 2 shows the example for lifting based 2D discrete wavelet transform with 2 levels of decomposition. 1.1. Related works Most of the 2D-DWT lifting based architectures use any one of the following 1D-DWT architectures (1) direct form [11]; (2) recursive form [12]; and (3) flipping form [19]. In all the above mentioned architectures, the set of processing elements (PEs) are used. Each PE consists of two adders and one multiplier. The (9, 7) direct form [11] and recursive form [12] architectures are the traditional designs for (9, 7) DWT, where, 4 and 2 PEs are used respectively. In the folded recursive [12] lifting based DWT, the half of the direct form (9, 7) is used. So, the whole operation using [12] takes more cycles than direct form (9, 7) DWT [11] to complete the operation. In (5, 3) direct form [10] lifting based DWT, two PEs are connected in series. In the flipping based DWT [19], the scaling constants (multipliers) are inversed to reduce the critical path delay of the lifting structure. In all these lifting based DWTs, the drawback is the worst path delay that includes two adders and one multiplier (refer Fig. 4(a)). In the proposed designs, one floating point adder is combined with floating point multiplier and it is renamed as fused multiply add design (refer Fig. 4(b)). Therefore, the worst path delay of proposed techniques is less than the existing designs. Another drawback with the above mentioned 1D-DWTs is the requirement of more cycles. In this paper, 2-parallel DWT architectures are proposed, where two PEs are connected in parallel instead of series in the direct form [11] of (5, 3) DWT and recursive form [12] of (9, 7) DWT. Therefore, the number of cycles required for the proposed 2-parallel DWT architectures is much less than the existing designs. The SCLA (spatial combinative lifting algorithm) based DWT is proposed in [18]. Here, the line buffers are added with four PEs of direct form, where the line buffers act as transpose buffer to perform column process. The four PEs are used for row and column processes by hardware reuse strategy. The similar reuse strategy based 2D-DWT is proposed in [21], where the direct form (9, 7)

DWT with four PEs is used for both row and column processes. In the flippling based 2D-DWT [22], two flipping based 1D-DWTs (row and column processors) are used. Each processor contains four PEs and each with two adders and one multiplier. The similar strategy is proposed in [14], where two DWTs with four PEs are used dedicatedly to row and column processes. In all the above mentioned 2D-DWTs, the drawback is the requirement of more cycles and this has been resolved in our proposed N-parallel architectures, where N 1D-DWTs are connected in parallel. Also, the area efficient 2D-DWTs are proposed using single PE instead of two and four PEs in each row and column processes of (5, 3) and (9, 7) 1D-DWTs respectively. In the recent trends, multiple input and multiple output (parallel) lifting based architectures are proposed, where the transpose buffer is not used. High speed and fast 2D-DWT architectures are shown in [16], where two and one 1D-DWTs are used in column process respectively. In both the cases, the row process contains two 1D-DWTs. Therefore, the area for fast architectures is less than high speed design and the number of cycles for fast architectures is greater than high speed design. The multiply accumulate circuit (MAC) based DWT is shown in [17]. Here, the drawback is the critical path that contains two add-shift based multipliers and four adders. Therefore, the critical path of [17] is much greater than proposed designs. The P-block parallel 2D-DWT is proposed in [15], where 2P PEs are used in both row and column processes and each with four PEs. Therefore, Basant et al. [15] require 4P multipliers and 8P adders in parallel to complete the DWT operation. In [20], M recursive architecture [12] based PEs are used in row process and M direct form [11] PEs are used in column process. Therefore, Tian et al. [20] require 6M multipliers and 12M adders in parallel. In all the aforementioned transpose buffer free 2D-DWT parallel designs, the main drawback is the critical path delay that includes two adders and one multipliers, which has been reduced in the proposed designs. 1.2. Contribution of this paper The major objective of this work is to optimize the delay, area and power requirement of lifting based 1D/2D-DWT architecture. In general, any one of these parameters can be optimized at the cost of others. The proposed logic used for area efficient lifting based DWT is to perform the whole operation with one processing element. Similarly, the proposed logic used for delay efficient lifting based DWT is to perform the whole operation with multiple processing elements in parallel. In both the cases, the processing element consists of one floating point adder and one proposed fused multiply add design. The worst path delay for proposed techniques is less than other existing techniques because the existing techniques require two adders and one multiplier. In the proposed designs, one floating point adder is combined with floating point multiplier and it is renamed as fused multiply add design. In this paper, IEEE-754 based floating point single precision (32 bits) is considered because high dynamic range (HDR) image processing applications require IEEE-754 single/half precision arithmetic [25–27]. These proposed DWTs can be used in JPEG standards for an efficient image compression. The rest of the paper is organized as follows, Section 2 states the proposed lifting based 1D and 2D-DWT proposed architectures. Section 3 discusses the design modeling, implementation, and results. The Section 4 states the conclusion. 2. The proposed lifting based discrete wavelet transform architecture Fig. 3 shows the proposed lifting based single PE (area efficient) N × N-point 2D-DWT architecture. In this proposed lifting

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

JID: MICPRO

ARTICLE IN PRESS

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

3

Fig. 2. 2D-lifting based discrete wavelet transform with decomposition of two levels.

Fig. 3. Proposed lifting based single PE N × N-point 2D-DWT architecture.

Fig. 5. Storage buffer architecture (Example – 8 × 4) used in the proposed 2D-DWT as shown in Fig. 3.

Fig. 4. Processing element (PE) used in lifting based DWT (a) conventional (b) proposed.

based area efficient DWT, only one processing element (PE) is used to find all the values in row process. In existing systems, more number of processing elements are used. So, the area of the proposed lifting based single PE DWT is less than existing systems with trade off in number of cycles. Fig. 4(a) shows the structure of conventional processing element. The processing element consists of two floating point adders and one floating point multiplier. Here, the inputs are x, y, Bi , and Ci . The difference between the floating point multiply accumulation (MAC) and fused multiply add (FMA) is clearly shown in the Eqs. (8) and (9) respectively. Here, Ai and Bi are the present inputs to be multiplied, Fi is the present MAC/FMA result, Fi−1 is the previous MAC result, and Ci is the third input for FMA. The proposed processing element shown in Fig. 4(b) that contains one floating point adder and one proposed FMA, which is explained in [13]. A new number is added with present multiplication in FMA, whereas in the MAC, the previous MAC result is added with present multiplication.

Fi = (Ai × Bi ) + Fi−1

(8)

Fi = (Ai × Bi ) + Ci

(9)

Fig. 6. The signal flow graph for lifting based 9-point (5, 3) 1D-DWT with 3-levels.

process has to be stored in the first row of transpose buffer, then en0 = 1 and en1 = ..en7 = 0. This storage buffer is made up of 2-to-1 multiplexers and registers. The shaded boxes represent the hardware registers. This storage (transpose or temporal) buffer architecture is used in the proposed single PE (5, 3) and (9, 7) 2D-DWT. Temporal buffer is used to store the output from column process of the previous level, that is used as input to the next level. 2.1. Lifting based (5, 3) DWT

Fig. 5 shows the 8 × 4 storage buffer architecture, where the control line eni = 1 to store the new data and = 0 for storing the previous data. For example, if the data from 1D-DWT of row

The signal flow graph [24] for lifting based (5, 3) DWT with 3-levels is shown in Fig. 6. Here, HP and LP are high pass and low pass outputs respectively. The initial set of inputs are x0 , x1 , . . . x8 . The high pass outputs of first level are y1 , y3 , y5, and y7 . The low

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO 4

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15 Table 1 Operation of 9-point 3-levels lifting based (5, 3) 1D-DWT using proposed single PE recursive architecture. Cycle

xi1

xi2

xi3

xi4

xx1

xx2

xx3

xx4

Sel

Sel1

Sel2

ss3

ss2

ss1

S4

S3

S2

S1

S0

S(-1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

X X X X x0 X X X x8 X X X X X X X X X

x0 x2 x4 x6 X X X X X X X X X X X X X X

x2 x4 x6 x8 X X X X X X X X X X X X X X

x1 x3 x5 x7 X x2 x4 x6 X X X X X X X X X X

X X X X x0 X X X x8 X X y0 X y8 yy0 X yy4 X

x0 x2 x4 x6 X y1 y3 y5 X y0 y4 X yy1 X X yy0 X X

x2 x4 x6 x8 X y3 y5 y7 X y4 y8 X yy3 X X yy4 X X

x1 x3 x5 x7 X x2 x4 x4 X y2 y6 X y4 X X yy2 X X

0 0 0 0 X 1 1 1 X 2 3 X 4 X X 3 X X

X X X X 0 X X X 0 X X 1 X 2 3 X 3 X

0 0 0 0 X 0 0 0 X 1 2 X 3 X X 2 X X

0 0 0 0 X 1 1 1 X 0 0 X 1 X X 0 X X

0 0 0 0 X 1 1 1 X 0 0 X 1 X X 0 X X

1 1 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 0

X y1 y3 y5 y7 y0 y2 y4 y6 y8 yy1 yy3 yy0 yy2 yy4 yyy0 yyy1 yyy2

X X y1 y3 y5 y7 y0 y2 y4 y6 y8 yy1 yy3 yy0 yy2 yy4 yyy0 yyy1

X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 yy1 yy3 yy0 yy2 yy4 yyy0

X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 yy1 yy3 yy0 yy2 yy4

X X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 yy1 yy3 yy0 yy2

X X X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 yy1 yy3 yy0

output. Here, the main thing is to keep the appropriate select line for the multiplexers. According to the area efficient proposed technique, only one processing element will be used. In (5, 3) DWT, each level contains two stages (high and low pass). The number of cycles required to produce high pass outputs (N(HP ) in proposed N-point single PE 5,3 ) (5, 3) 1D-DWT is shown in Eq. (10),

N(HP 5,3 ) =

N 2

(10)

The number of cycles required to produce low pass outputs (N(LP5,3 ) ) in proposed single PE N-point (5, 3) 1D-DWT is shown in Eq. (11),

N(LP5,3) = Fig. 7. Proposed single PE recursive architecture for 3-levels lifting based (5, 3) 1DDWT.

pass outputs of first level are y0 , y2 , y4 , y6 , and y8 . The high pass outputs of second level are yy1 and yy3 . The corresponding low pass outputs are yy0 , yy2 and yy4 . Similarly, the high pass output of third level is yyy1 and its corresponding low pass values are yyy0 and yyy2 . The multiplication constants e and f are used to find the high pass and low pass results respectively. Here, both the constants are considered as floating point values. The low pass outputs are sent to the input of next level recursively. The low-high pass outputs for each levels can be found by processing element shown in Fig. 4. Throughout this document, X symbol is used to represent do not care condition in the tables. The recursive based lifting architecture for DWT is explained in [24]. Here, the computation of low pass and high pass values of the next immediate stage will be started if the required inputs are available from the previous stage low pass output. The proposed single PE recursive architecture for 9-point 3-levels lifting based (5, 3) 1D-DWT is shown in the Fig. 7 and its operation is explained in Table 1, which includes only one processing element with 18 cycles. According to Table 1, the first level outputs (y0 to y8 ) are obtained in 10 cycles and the second level outputs (yy0 to yy4 ) are obtained from the cycles 11 to 15. Similarly, the third level outputs (yyy0 to yyy2 ) are obtained from the cycles 16 to 18. In all the proposed systems, the hardware registers are used beyond the processing element to store the intermediate results that are sent back to the processing element to find the required low/high pass

N 2

(11)

Therefore, the number of cycles required to complete the single pro level (N(5,3 ),single ) using proposed single PE N-point (5, 3) 1D-DWT pro

is shown in Eq. (12). In other words, N(5,3 ),single represents the

total number of terms produced in a single level N-point (5, 3) 1D-DWT.

N(pro =N 5,3 ),single

(12)

In case of direct form of N-point (5, 3) 1D-DWT [24], two PEs will be used. During first clock cycle, first PE alone is busy and from second cycle onwards, both the PEs will be busy. Therefore, from second cycle onwards, two terms will be produced during each cycle of single level N-point (5, 3) 1D-DWT. So, the maximum number of cycles required to complete the single level N-point (5, direct 3) 1D-DWT using direct form (N − (5, 3 ), single ) is shown in Eq. (13)

N(direct 5,3 ),single = 1 +

N−1 N+1 = 2 2

(13)

The maximum number of cycles required to complete the L-levels of N-point (5, 3) 1D-DWT using direct (N(direct ) and proposed 5,3 ),L pro single PE (N(5,3 ),L ) forms are shown in Eqs. (14) and (15) respectively.

N(direct 5,3 ),L =

N(pro = 5,3 ),L

L−1 

N 2i

i=0

+1 2

L−1  N 2i

(14)

(15)

i=0

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

5

Table 2 Proposed 2-parallel architecture for 9-point 3-levels lifting based (5, 3) 1D-DWT. Cycle

Se1

Se2

Se3

Se4

s1

s2

s3

s4

d1

d2

d3

d4

S4

S5

R3

R4

0 1 2 3 4 5 6 7 8

0 0 0 1 X X X X X

0 0 2 X 1 2 2 3 X

0 0 2 X 1 2 3 1 X

0 0 0 0 0 1 2 3 X

0 1 1 X 1 1 1 1 X

0 0 0 X 0 0 0 0 X

0 0 1 1 1 0 1 0 X

1 1 1 0 1 1 1 1 X

x0 x2 y3 X y3 y5 yy1 yy0 X

x2 x4 y1 X y5 yy0 yy2 yy4 X

x1 x3 x2 X x4 y2 y4 yy2 X

x0 X X y0 X X X X X

X y1 y3 y2 yy0 y4 yy1 yy2 yyy1

X y0 y1 y3 y2 yy0 y4 yy1 yy2

X X y0 y1 y3 y2 yy0 y4 yy1

X X X y0 y1 y3 y2 yy0 y4

Cycle 0 1 2 3 4 5 6 7

SSe1 0 0 0 1 3 X 2 X

SSe2 0 0 1 X 1 2 X X

SSe3 0 0 1 X 1 X X X

SSe4 0 0 0 X 0 X X X

ss1 0 1 1 X 1 1 1 X

ss2 0 0 0 X 0 0 0 X

ss3 0 0 1 1 1 0 1 X

ss4 1 1 1 0 0 1 1 X

dd1 x4 x6 y7 X X y4 X X

dd2 x6 x8 y5 X X y8 X X

dd3 x5 x7 x6 X X y6 X X

dd4 x8 X X y8 yy0 X yy4 X

SS4 X y5 y7 y6 yy4 yyy0 yy3 yyy2

SR5 X y8 y5 y7 y6 yy4 yyy0 yy3

RR3 X X y8 y5 y7 y6 yy4 yyy0

RR4 X X X y8 y5 y7 y6 yy4

Fig. 9. Proposed single PE 13-point 2-levels lifting based (5, 3) 1D-DWT architecture for proposed single PE 13 × 13-point 2D-DWT with 2-levels.

Fig. 8. Proposed 2-parallel architecture for 9-point three levels lifting based (5, 3) 1D-DWT.

Fig. 8 shows the proposed 2-parallel 3-levels 9-point (5, 3) 1D-DWT architecture that contains 2 PEs but the operation is different from direct form. In direct form (5, 3) DWT, both the PEs will work on serially but in proposed 2-parallel (5, 3) 1D-DWT, both the PEs will work on parallel. So, the number of cycles required to complete the operation using proposed 2-parallel (5, 3) 1D-DWT, will be less than direct form and half of the proposed single PE (5, 3) 1D-DWT. The Eq. (16) shows the number of cycles 2 pro (N(5,3 ),L ) required to complete L-levels of N-point (5, 3) 1D-DWT using proposed 2-parallel architecture. The operation of proposed 2-parallel 3-levels 9-point (5, 3) 1D-DWT architecture is explained in Table 2.

N(25pro = ,3 ),L

L−1 1N 2 2i

(16)

i=0

In proposed 2D-DWT implementation, 13 × 13-point input image matrix is considered. Therefore, 13 × 6 and 13 × 7-point

transpose buffers will be used for high and low pass outputs during row process respectively. In column process, 6 × 6 and 7 × 7-point temporal buffers will be used for high and low pass outputs respectively. In the direct implementation of 13-point single level lifting based (5, 3) 1D-DWT, 2 processing elements are used. The proposed architecture for two levels lifting based single PE (5, 3) 1D-DWT is shown in Fig. 9, Here, only one processing element is used. So, the area of proposed system is less than direct implementation. The architecture shown in Fig. 9 is used for the row process of proposed single PE 13 × 13-point 2D-DWT with 2-levels (shown in Fig. 3). The operation of 2-levels lifting based (5, 3) 1D-DWT using proposed single PE architecture is shown in Table 3, which takes 13 cycles for level-1 row/column process. Here, the level-2 operations can be done with fa0 , fa1 , ... fa6 , which are sent as feed back from low pass filter of column process (HH1), which is shown in Fig. 3. Here, fa0 , fa1 , ... , fa6 are the outputs of 7th column of HH1 7 × 7-temporal buffer. Fig. 10 shows the proposed 9-parallel architecture for 9 × 9-point lifting based (5, 3) 2D-DWT. Here, row process contains 9 singe PE proposed DWTs that is shown in Fig. 9 and column process contains seven proposed FMA based PEs. This architecture is designed for 2-level 2D-DWT. Here, first level will be started if s = 0, otherwise second level 2D-DWT will be carried. Therefore, 5 × 5-temporal buffer is used to send the low pass outputs (HH1) of first level into second level. In general, the number terms generated in column process of proposed N-parallel N × N-point

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO 6

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15 Table 3 Operation of lifting based 13-point (5, 3) 1D-DWT using proposed single PE architecture for proposed single PE 13 × 13-point 2D-DWT with 2-levels. Level

Cycle

xi1

xi2

xi3

xx1

xx2

xi4

Sel

ss3

ss2

ss4

ss1

R1

R2

R3

1

1 2 3 4 5 6 7 8 9 10 11 12 13

x0 X X X X X X X X X X x12 X

x0 x2 x4 x6 x8 x10 X X X X X X X

x2 x4 x6 x8 x10 x12 X X X X X X X

x0 x2 x4 x6 x8 x10 y3 y5 y7 y9 y11 X X

x2 x4 x6 x8 x10 x12 y1 y3 y5 y7 y9 X X

x1 x3 x5 x7 x9 x11 x2 x4 x6 x8 x10 X X

0 0 0 0 0 0 1 1 1 1 1 X X

0 0 0 0 0 0 1 1 1 1 1 X X

0 0 0 0 0 0 1 1 1 1 1 X X

1 1 1 1 1 1 1 1 1 1 1 0 X

0 1 1 1 1 1 1 1 1 1 1 1 X

X y1 y3 y5 y7 y9 y11 y2 y4 y6 y8 y10 y12

X y0 y1 y3 y5 y7 y9 y11 y2 y4 y6 y8 y10

X X y0 y1 y3 y5 y7 y9 y11 y2 y4 y6 y8

2

1 2 3 4 5 6 7

fa0 X X X X fa6 X

fa0 fa2 fa4 X X X X

fa2 fa4 fa6 X X X X

fa0 fa2 fa4 y3 y5 X X

fa2 fa4 fa6 y1 y3 X X

x1 fa3 fa5 x2 x4 X X

0 0 0 1 1 X X

0 0 0 1 1 X X

0 0 0 1 1 X X

1 1 1 1 1 0 X

0 1 1 1 1 1 X

X y1 y3 y5 y2 y4 y6

X y0 y1 y3 y5 y2 y4

X X y0 y1 y3 y5 y2

Fig. 10. Proposed 9-parallel architecture for 9 × 9-point lifting based (5, 3) 2DDWT.

2D-DWT is shown in Eq. (12). Therefore, the number of adders N−par (5,3 ) N−par (5,3 ) (NADD ) and FMAs (NF MA ) is calculated based on N single level proposed single PE (5, 3) 1D-DWTs, which is shown in equation (17). Similarly, the number of cycles required to finish the proposed N-parallel N × N-point (5,3) 2D-DWT depends on one single level proposed single PE (5, 3) 1D-DWT and column process 1D-DWT. Here, column process requires 2 cycles to produce all the N−par (5,3 ) outputs. The shows the number of cycles (Ncycle ) required to complete the operation of proposed N-parallel 1-level N × N-point (5,3) 2D-DWT. N−par (5,3 ) ( 5,3 ) NADD = NFN−par = 2N MA

(17)

N−par (5,3 ) Ncycle = (N + 2 )

(18)

2.2. Lifting based (9, 7) DWT The signal flow graph for lifting based (9, 7) DWT [24] with 3-levels is shown in Fig. 11. The main difference between the (9,

Fig. 11. The signal flow graph for lifting based 9-point (9, 7) 1D-DWT with 3-levels.

7) and (5, 3) DWT is the number of multiplication constants and scaling factors. Here, the number of multiplication constants is 4. They are a, b, c, and d. The scaling factors 1k and k are used to obtain the high pass and low pass outputs respectively. Due to an additional multiplication constants, the height of the signal flow graph for (9, 7) DWT is greater than (5, 3) DWT. Hence four processing elements are required to get one low/high pass output for a particular level. In the proposed area efficient system, only one processing element is used to get all the outputs. So, the area of the proposed single PE system is less than existing (9, 7) DWTs. The proposed single PE recursive 3-levels lifting based (9, 7) 1D-DWT is shown in Fig. 12. Here, only one processing element is used. The operation for the proposed single PE recursive 3-levels (9, 7) 9-point 1D-DWT requires more cycles than existing recursive systems [11,12] to complete the operation.

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

7

Fig. 12. Proposed single PE recursive architecture of 3-levels lifting based 9-point (9, 7) 1D-DWT.

In (9, 7) DWT, each level contains four stages. Here, third and fourth stages are high and low pass outputs respectively. The number of cycles required to produce first stage (N(19,7 ) ) and

second stage outputs (N(29,7 ) ) in proposed single PE N-point (9, 7) 1D-DWT is shown in Eq. (19),

N N(19,7) = N(29,7) = 2

=

N(LP9,7)

N = 2

(20)

Therefore, the maximum number of cycles required to complete pro the 1-level (N(9,7 ),single ) of proposed single PE N-point (9, 7) 1Dpro

DWT is shown in Eq. (21). In other words, N(9,7 ),single represents the total number of terms produced in a 1-level N-point (9, 7) 1D-DWT.

N(pro = 2N 9,7 ),single

(21)

In case of direct form [11] of N-point (9, 7) 1D-DWT, four PEs will be used. During first clock cycle, first PE alone is busy. During second and third cycles, 2 and 3 PEs are busy respectively. So, six terms can be produced up to third cycle. From fourth cycle onwards, all the four PEs will be busy. Therefore, from fourth cycle onwards, four terms will be produced during each cycle of 1-level N-point (9, 7) 1D-DWT. The maximum number of cycles required to complete the single level N-point (9, 7) 1D-DWT using direct direct form (N − (9, 7 ), single ) is shown in Eq. (22)

N(direct 9,7 ),single

2N − 6 3+N =3+ = 4 2

(22)

In case of recursive form of N-point (9, 7) 1D-DWT [12], two PEs will be used. During first clock cycle, first PE alone is busy and from second cycle onwards, both the PEs will be busy. Therefore, from second cycle onwards, two terms will be produced during each cycle of single level N-point (9, 7) recursive 1D-DWT. The maximum number of cycles required to complete the 1-level Nrecurse point (9, 7) 1D-DWT using recursive form (N − (9, 7 ), single ) is shown in Eq. (23)

N(recurse 9,7 ),single = 1 +

2N − 1 1 + 2N = 2 2

N(direct 9,7 ),L =

(19)

The number of cycles required to produce third stage (high pass) outputs (N(HP ) and fourth stage (low pass) outputs (N(LP9,7 ) ) in 9,7 ) proposed single PE N-point (9, 7) 1D-DWT is shown in Eq. (20),

N(HP 9,7 )

pro

(N(recurse ), and proposed single PE (N(9,7 ),L ) forms are shown in 9,7 ),L Eqs. (24), (25), and (26) respectively.

(23)

The maximum number of cycles required to complete the L-levels of N-point (9, 7) 1D-DWT using direct (N(direct ), recursive 9,7 ),L

N(recurse 9,7 ),L =

N(pro = 9,7 ),L

L−1  3+

N 2i

(24)

2

i=0

L−1  1 + 2 2Ni

(25)

2

i=0 L−1  N 2 i 2

(26)

i=0

Fig. 14 shows the proposed 2-parallel 1-level 9-point (9, 7) 1DDWT architecture that contains 2 PEs but the operation is different from recursive form [12]. In recursive form (9, 7) DWT, both the PEs will work on serially but in case of proposed 2-parallel (9, 7) DWT, both the PEs will work on parallel. So, the number of cycles required to complete the operation using proposed 2-parallel (9, 7) DWT will be less than recursive form. The number of cycles required for proposed 2-parallel (9, 7) DWT is half of the proposed single PE 1-level (9, 7) DWT. The operation of proposed 2-parallel 1-level 9-point (9, 7) 1D-DWT architecture is explained in Table 4. 2 pro The Eq. (27) shows the number of cycles (N(9,7 ),single ) required to complete 1-level of N-point (9, 7) 1D-DWT using proposed 2-parallel architecture.

N(29pro =N ,7 ),single

(27) 2 pro

The Eq. (28) shows the number of cycles (N(9,7 ),L ) required to complete L-levels of N point (9, 7) 1D-DWT using proposed 2-parallel architecture.



N(29pro ,7 ),L

L−1 1  N = 2 i 2 2



(28)

i=0

In the recursive (folded) architecture of lifting based (9, 7) wavelet [12], two processing elements are involved instead of four elements of direct implementation. So, the area for [12] is less than direct form (9, 7) wavelet. The proposed architecture of 9-point 1-level lifting based (9, 7) 1D-DWT is shown in Fig. 13. Here, only one processing element is used to find all the low pass/high pass values. So, the area for the proposed system is less than direct form and folded form of single level (9, 7) 1D-DWT and its operation is explained in Table 5, which requires 19

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO 8

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15 Table 4 Operation of proposed 2-parallel 9-point 1-level lifting based (9, 7) 1D-DWT. Cycle

s1

s2

s3

d1

d2

d3

s4

sc

K1

ss1

ss2

ss3

dd1

dd2

dd3

ss4

ssc

KK1

0 1 2 3 4 5 6 7 8 9

0 0 1 3 3 3 3 2 3 X

0 0 2 2 3 2 2 2 1 X

0 0 0 0 0 1 1 2 2 X

x0 x2 X y1 y3 y0 y2 y11 y9 X

x2 x4 y1 y3 y5 y2 y4 y9 0 X

x1 x3 x0 x2 x4 y1 y3 y2 y0 X

0 0 1 1 1 2 2 3 3 X

2 2 2 2 2 0 0 1 1 X

X y1 y3 y0 y2 y4 y9 y11 y12 y10

0 0 2 3 2 4 2 4 1 X

0 0 1 2 2 2 2 2 3 X

0 0 0 0 2 1 2 3 4 X

x4 x6 y7 y5 y6 y4 y13 y11 0 X

x6 x8 X y7 y8 y6 y15 y13 y15 X

x5 x7 x8 x6 y7 y5 y6 y4 y8 X

0 0 1 1 2 2 3 3 3 X

2 2 2 2 0 0 1 1 1 X

X y5 y7 y8 y6 y15 y13 y16 y14 y18

Table 5 Operation of 9-point 1-level lifting based (9, 7) 1D-DWT using proposed single PE architecture. Cycle

xi1

xi2

xi3

xx1

xx2

xx3

S1

S2

S3

S4

sc

K1

K2

K3

R1

R2

K4

K5

K6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

x0 x2 x4 x6 X X X X X X X X X X X X X X X

x2 x4 x6 x8 X X X X X X X X X X X X X X X

x1 x3 x5 x7 x0 x2 x4 x6 x8 X X X X X X X X X X

x0 x2 x4 x6 y1 y3 y5 y7 0 y2 y4 y6 y8 y9 y11 y13 y15 0 X

x2 x4 x6 x8 0 y1 y3 y5 y7 y0 y2 y4 y6 0 y9 y11 y13 y15 X

x1 x3 x5 x7 x0 x2 x4 x6 x8 y1 y3 y5 y7 y0 y2 y4 y6 y8 X

0 0 0 0 2 2 2 2 1 2 2 2 2 2 2 2 2 1 X

0 0 0 0 1 2 2 2 2 2 2 2 2 1 2 2 2 2 X

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 X

0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 3 3 X

2 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 1 1 X

X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13 y15 y10 y12 y14 y16 y18

X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13 y15 y10 y12 y14 y16

X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13 y15 y10 y12 y14

X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13 y15 y10 y12

X X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13 y15 y10

X X X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13 y15

X X X X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11 y13

X X X X X X X X y1 y3 y5 y7 y0 y2 y4 y6 y8 y9 y11

Fig. 13. Proposed architecture of 9-point single PE single level lifting based (9, 7) 1D-DWT.

cycles. Table 5 shows the operation of row process of 1-level of proposed single PE based 9 × 9-point 2D-DWT (from cycles 0 to 9). Similarly, level 2 row process can be done with fa0 , fa1 , … fa4 , which are sent as feed back from low pass filter of column process (HH1). Here, fa0 , fa1 , … , fa4 are the outputs of 5th column of HH1 5 × 5-temporal buffer. Fig. 15 shows the proposed 9-parallel architecture for 9 × 9-point lifting based (9, 7) 2D-DWT. Here, row process contains nine proposed singe PE DWTs that is shown in Fig. 13 and column process contains 18 proposed FMA based PEs. This architecture is designed for 2-level 2D-DWT. Here, first level will be done if s = 0, otherwise second level 2D-DWT will be carried. Therefore, 5 × 5temporal buffer is used to send the low pass outputs (HH1) of first level into second level. In general, the number terms generated in column process of proposed N-parallel N × N-point 2D-DWT is

Fig. 14. Proposed 2-parallel architecture for 9-point 1-level lifting based (9, 7) 1DDWT.

N−par (9,7 )

shown in Eq. (21). Therefore, the number of adders (NADD ) is calculated based on N numbers of 1-level proposed single PE (9, 7) 1D-DWTs, which is shown in equation (30). The number of multipliers depends on the number of high and low pass terms generated column process. Here, these terms should be multiplied by scaling factors (k and 1/k). The number of high pass and low pass

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

9

Fig. 15. Proposed 9-parallel architecture for 9 × 9-point lifting based (9, 7) 2D-DWT.

Fig. 17. Overall architecture of proposed 1D lifting based DWT.

first column of buffer, then en0 = en1 = ..en4 = 1. This buffer is made up of 2-to-1 multiplexers and registers. The shaded boxes represent the hardware registers. Fig. 16. Storage (temporal buffer) buffer architecture (Example - 5 × 5) used in the proposed parallel (9, 7) 2D-DWT as shown in Figs. 10 and 15.

terms generated in N-point single level (9, 7) 1D-DWT is shown in N−par (9,7 ) equation (20). The Eq. (29) shows the number of FMAs (NF MA ) required for the proposed N-parallel N × N-point (9,7) 2D-DWT. ( 9,7 ) NFN−par = 2N + N + MA

N N + = 4N 2 2

(29)

Similarly, the number of cycles required to finish the operation of proposed N-parallel N × N-point (9, 7) 2D-DWT depends on one 1-level proposed single PE (9, 7) 1D-DWT and column process of proposed N-parallel N × N-point (9, 7) 2D-DWT. The column process of proposed N-parallel N × N-point (9, 7) 2D-DWT requires 4 cycles to produce all the outputs. The Eq. (31) shows the N−par (9,7 ) number of cycles (Ncycle ) required to complete the operation of proposed N-parallel 1-level N × N-point (9,7) 2D-DWT. N−par (9,7 ) NADD = 2N + N = 3N

(30)

N−par (9,7 ) Ncycle = 2N + 4

(31)

Fig. 16 shows the storage (temporal) buffer architecture used in the proposed parallel 2D-DWTs, where the control line eni = 1 to store the new data and = 0 for storing the previous data. For example, if the data from column process has to be stored in

3. Design modeling, implementation, and results The proposed and existing 1D/2D lifting based DWTs are modeled in Verilog HDL. These Verilog HDL models are simulated and synthesized using Cadence 6.1 ASIC design tool with 45 nm technology that is used to find the worst path delay, area, power, and EOP details with operating voltage 0.88 v. The energy per operation (EOP) [30] or power delay product (PDP) stands for the average energy consumed per switching event, which can be found by multiplying worst path delay with sum of dynamic (switching) and leakage (static) power. The proper select lines of multiplexers are used to perform the particular 1D/2D-lifting based DWT using the proposed architecture. Fig. 17 shows the overall proposed architecture of lifting based 1D-DWT. Here, the select lines of multiplexers are stored in look up table with corresponding address line. The Address will be incremented by one during each clock cycle (initially it is 0). The proper select lines Sel[Address] are obtained during each clock cycle to perform the required 1D lifting based DWT. Tables 6 and 7 show the theoretical comparison of lifting based N-point 1D-DWT and N × N-point 2D-DWT architectures. Tables 8, 9, and 10 show the performance analysis of various 1D/2D-DWT architectures. The 13-point 1-level proposed single PE (5, 3) 1D-DWT requires 13 cycles, which is shown in Table 3. Therefore, 13 × 13-point proposed single PE (5, 3) 2-levels 2D-DWT requires 13 × 13 = 169 cycles in row process of first level and 13 × 13 2 = 85 cycles in column process of first level. In the second level, row and column

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO 10

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15 Table 6 Comparison of lifting based N-point 1D-DWT architectures with L-stages. 1D-DWT

# multipliers

# adders

# cycles

L−1

(5, 3) direct [24]

2

4

(5, 3) single PE proposed

1

1

(5, 3) 2-parallel PEs proposed

2

2

(9, 7) direct [11]

6

8

L−1

(9, 7) recurse [12]

3

4

L−1

(9, 7) flipping based [19]

4

8

(9, 7) single PE proposed

2

1

(9, 7) 2-parallel PEs proposed

4

2

i=0

N 2i

Critical path depth

+1

TMUL + 2TADD

2

L−1

N i =0 2 i 1 L−1 N i=0 2i 2 N 2i

3+

i=0

TF MA + TADD TF MA + TADD TMUL + 2TADD

2 1+2

i=0

N 2i

TMUL + 2TADD

2



N L−1 3+ 2i 2 i=0

L−1 N i=0 2 2i

1 2

L−1 i=0

TMUL + TADD TF MA + TADD

2 2Ni

TF MA + TADD

TMUL , TADD , and TFMA , are critical path depth of floating point multiplier, adder, and proposed fused multiply add respectively. Table 7 Comparison of lifting based N × N-point 2D-DWT architectures with single level. 2D-DWT

# multipliers

# adders

# cycles

Critical path depth N 2N+1 . 2 2

3N (N +1 ) 4

Transpose buffer size

TMUL + 2TADD

2 (N ×

TF MA + TADD

2(N × N2 )

6

12

N. N+1 + 2

3

3

N .N +

(5, 3) P-block parallel [15]

2P

4P

N2 P

TMUL + 2TADD

4N

(5, 3) High speed architecture [16]

8

16

N2 4

TMUL + 2TADD

Not required

(5, 3) Fast architecture [16]

4

8

N2 2

(5, 3) adder based MAC [17]

0

8

3 N 2

(5, 3) SCLA based [18]

2

4

(5, 3) proposed N-parallel

2N

(9, 7) direct [11]

18

=

2 N .N = 3N2 2

N 2

)

(5, 3) direct [24] (5, 3) single PE proposed

TMUL + 2TADD

Not required

4TADD

Not required

= N (N + 1 ) 2N. N+1 2

TMUL + 2TADD

2 (N ×

2N

N+2

TF MA + TADD

Not required

24

N. 3+2N +

TMUL + 2TADD

2 (N ×

N 2

)

TMUL + 2TADD

2 (N ×

N 2

)

TF MA + TADD

2(N × N2 ) 4N

+3

N 3+N . 2 2

=

3N (N +3 ) 4 3N (2N +1 ) 4

)

(9, 7) recurse [12]

9

12

N N. 1+2 2

(9, 7) single PE proposed

6

3

N .2N +

(9, 7) 2-parallel [14]

10

16

N2 2

TMUL + 2TADD

(9, 7) P-block parallel [15]

4P

8P

N2 P

TMUL + 2TADD

Not required

(9, 7) High speed architecture [16]

18

32

N2 4

TMUL + 2TADD

Not required

(9, 7) Fast architecture [16]

10

16

N2 2

(9, 7) adder based MAC [17]

0

16

3 N 2

(9, 7) SCLA based [18]

6

8

+

N 1+2N . 2 2

N 2

=

N .2N =3N 2 2

TMUL + 2TADD

Not required

2TMUL + 4TADD

Not required

2N. 3+2N = N (3 + N )

TMUL + 2TADD

2 (N ×

TMUL + 2TADD

Not required 4N

+3

N 2

)

(9, 7) existing M-parallel [20]

6M

12M

N2 M

(9, 7) [21]

4

8

3 N4

TMUL + 2TADD TMUL

4N

TF MA + TADD

Not required

2

(9, 7) flipping based [22]

10

16

N2 2

(9, 7) proposed N-parallel

4N

3N

2N + 4

TMUL , TADD , and TFMA , are critical path depth of floating point multiplier, adder, and proposed fused multiply add respectively. Size of the block is P. Number of cycles is the sum of cycles required for row and column processes. In all these cases, the temporal buffer size is equal to N2 × N2 that is to store the output from column process of first level to start second level of DWT operation. Here, the transpose buffer is used to store the output from row process to start column process. Table 8 Performance analysis of lifting based 9-point single level 1D-DWT architectures. Worst path delay (ps)

Frequ -ency (MHz)

Total area (μ m2 )

(5, 3) direct [24] (5, 3) single PE proposed

4597.8 2883.8

217.5 346.7

9782.3 7313

(9, 7) direct [11] (9, 7) recurse [12] (9, 7) flipping based [19] (9, 7) single PE proposed (9, 7) 2-parallel PEs proposed

4561.7 4792.8 4335.6 2807.6

219.2 208.6 230.6 356.1

2731.9

366.1

Net power (nw)

Switching power (nw)

Leakage power (nw)

EOP (f J)

Total number of cycles

Total cycle delay (ps)

79890 46986.8

221072 155206.8

433129 339626.1

3007.8 1426.9

5 9

436791 25954.2

19202 10921.4 18887.3 7701.4

180350 93948.6 134635.8 44167.2

491729 269910.7 371179.5 146815.6

863615 488251.9 849559.5 347266.1

6182.6 3633.7 5292.6 1387.1

9 12 7 19

41055.3 57513.6 30349.2 53344.4

13950.6

86600.1

268654.1

634440.8

2467.1

10

27319.0

Total cycle delay = Total number of cycles × Worst path delay. All these implementations have been done with IEEE-754 single precision floating point number (32-bits precision).

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

11

Table 9 Performance analysis of lifting based 9-point 3-levels 1D-DWT architectures. Worst path delay (ps)

Frequ -ency (MHz)

Total area (μ m2 )

(5, 3) direct [24] (5, 3) single PE proposed (5, 3) 2-parallel PEs proposed

4681.2 2658.0

213.6 376.2

10654 8376.8

2760.4

362.3

(9, 7) direct [11] (9, 7) recurse [12] (9, 7) flipping based [19] (9, 7) single PE proposed (9, 7) 2-parallel PEs proposed

4728.9 4712.1 4410.7 2631.3 2730

Net power (nw)

Switching power (nw)

Leakage power (nw)

EOP (f J)

Total number of cycles

Total cycle delay (ps)

92061.9 10219.2

257491.6 50141.8

476278.1 385358.2

3434.9 1157.5

12 19

13342.9

140998.1

410319.2

648185.4

2921.8

9

24840

211.5 212.2 226.7 380.1

20076.7 11123 19629.97 8194.6

165716.9 98988 150630.8 19463.8

460522.6 271312.1 414951.4 81228.2

907683.1 490362.4 893551.0 390534.8

6470.1 3589.1 5771.4 1241.3

24 29 13 39

113493.6 136648.0 57339.1 102620.7

366.3

14167.6

87896.2

279754

654347.1

2550.1

17

46410

56174.4 50502

Total cycle delay = Total number of cycles × Worst path delay. All these implementations have been done with IEEE-754 single precision floating point number (32-bits precision). Table 10 Comparison of lifting based N × N-point 2D-DWT architectures with 2-levels.

(5, 3) direct [24] (5, 3) single PE proposed (5, 3) P-block parallel [15] (5, 3) High speed architecture [16] (5, 3) Fast architecture [16] (5, 3) adder based MAC [17] (5, 3) SCLA based [18] (5, 3) proposed N-parallel (9, 7) direct [11] (9, 7) recurse [12] (9, 7) single PE proposed (9, 7) 2-parallel [14] (9, 7) P-block parallel [15] (9, 7) High speed architecture [16] (9, 7) Fast architecture [16] (9, 7) adder based MAC [17] (9, 7) SCLA based [18] (9, 7) existing M-parallel [20] (9, 7) [21] (9, 7) flipping based [22] (9, 7) proposed N-parallel

Worst path delay (ps)

Freq -uency (MHz)

Total area (μ m2 )

4508.3 2676.2

221.8 346.5

51254.1 32303.6

4792.6

208.7

4822.2

Net power (nw)

Switching power (nw)

Leakage power (nw)

EOP (f J)

Total number of cycles

Total cycle delay (ps)

262620.7 228670.6

1022748.6 1053732.8

2206115.5 1965754.8

14556.6 8080.7

1032400.7 877793.6

290 0 0.3

176+53=229 254+74= 328 28+10=38

100884.5

521745.8

1528159.8

4522913.4

207.4

53740.6

286333.4

926427.1

2363365.6

15864.0

44+15=59

284509.8

4824.0

207.3

34547

207670.4

699029.9

1506288.5

10638.4

87+26=113

545112.0

4891.4

204.4

33488.1

203332.6

702018.5

1479489

10670.6

25+15=40

195656.0

4446.6 2824.7

224.8 354.0

48015.1 179271.4

288640.5 1160429.8

1357014.5 3468783.0

1980735.4 8236974.9

14841.6 33065.2

182+56=238 15+9=24

1058290.8 67792.8

4633.6 4667.7 2801.0

215.8 214.2 357.0

54808.5 37836.7 31795.7

327960.1 254023.5 160323.1

1060725.1 888948.2 645726.3

2422821.2 1599349.6 1379464.9

161413.6 11614.6 5672.5

810880.0 896198.4 929932.0

4777.2 4615.3

209.3 216.7

38537.1 88442.1

265122.5 272592.7

895213.3 783395.4

1697369.3 4018359.7

12385.2 22161.5

122+53=175 162+30=192 257+75= 332 42+27=69 22+7=29

4637.4

215.6

83965.3

501684.2

1458554.4

3736259.1

24090.4

21+7=28

129847.2

4679.8

213.7

38613.4

299117.5

907245.8

1718699.4

12288.8

42+14=56

262068.8

4909.9

203.7

34882.9

233462.8

785764.2

1368409.1

10576.7

18+12=30

147297.0

4521.9 4715.6

221.2 212.1

38326.6 182170.9

179162.1 1196879.2

637703.9 3353252.0

1654255.8 8240150.6

10364.1 54669.8

108+40=148 13+6=19

669241.2 89596.4

4500.1 4660.2 2776.2

225.4 214.6 360.2

53454.3 38543.2 175711.7

271621.5 309137.5 969626.8

1123241.1 898235.3 2886869.1

2110011.2 1807519.1 8322798.3

14549.9 12609.35 31120.2

162+70=232 42+14=56 23+14=37

1044023.2 260971.2 58300.2

182118.8

329281.8 133843.7

N = 13 for (5, 3) 2D-DWT, N = 9 for (9, 7) 2D-DWT, P = 6 for (5, 3) 2D-DWT, P = 4 for (9, 7) 2D-DWT and M = 7 (M-inputs can be processed in parallel) for (9, 7) 2D-DWT. Total number of cycles = Number of cycles required for first level 2D DWT + Number of cycles required for second level 2D DWT. Total cycle delay = Total number of cycles × Worst path delay. All these implementations have been done with IEEE-754 single precision floating point number (32-bits precision).

processes of proposed single PE (5, 3) 2D-DWT requires 7 × 7 = 49 and 7 × 72 = 25 cycles respectively, because 7-point proposed single PE (5, 3) 1D-DWT requires 7 cycles, which is shown in Table 3. Therefore, proposed single PE (5, 3) 2-levels 13 × 13-point 2D-DWT requires (169 + 85 ) + (49 + 25 ) = 328 cycles. Here, 7 × 7-point low pass output will be produced after first level, which is sent as the input for second level operation.

Similarly, the 9-point 1-level single PE proposed (9, 7) 1D-DWT requires 19 cycles that is shown in Table 5. Therefore, 9 × 9-point proposed single PE (9, 7) 2-levels 2D-DWT requires 19 × 9 = 171 cycles in row process of first level and 19 × 92 = 86 cycles in column processes for first level. In the second level, row and column processes of proposed single PE (9, 7) 2D-DWT requires 10 × 5 = 50 and 10 × 52 = 25 cycles respectively, because 5-point

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

JID: MICPRO 12

ARTICLE IN PRESS

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

Fig. 18. Layout chip diagram for proposed 13-parallel 13 × 13-point (5, 3) lifting based 2-levels 2D-DWT with core area as 215, 125.2 μm2 , die space around core as 30 μ m and total chip area as 274, 381.2 μm2 using 45-nm technology.

Fig. 19. Layout chip diagram for proposed 9-parallel 9 × 9-point (9, 7) lifting based 2-levels 2D-DWT with core area as 210, 853.2 μm2 , die space around core as 30 μ m and total chip area as 269, 557.2 μm2 using 45-nm technology.

proposed single PE (9, 7) 1D-DWT requires 10 cycles in practice. Therefore, proposed single PE (9, 7) 2-levels 9 × 9-point 2D-DWT requires (171 + 86 ) + (50 + 25 ) = 332 cycles. Here, 5 × 5-point low pass output will be produced after first level that is sent as the input for second level operation. The 13 × 13-point proposed N-parallel (5, 3) 2-levels 2D-DWT requires 13 + 2 = 15 cycles in first level. In the second level, it requires 7 + 2 = 9 cycles, because 7-point proposed single PE (5, 3) 1D-DWT requires 7 cycles, which is shown in Table 3. Here, extra 2 cycles in each level is added according to Eq. (18). So, the total

number of cycles required for N-parallel (5, 3) 2-levels 2D-DWT is 15 + 9 = 24, Here, N = 13. The 9 × 9-point proposed N-parallel (9, 7) 2-levels 2D-DWT requires 19 + 4 = 23 cycles in first level. In the second level, it requires 10 + 4 = 14 cycles, because 5-point proposed single PE (9, 7) 1D-DWT requires 10 cycles. Here, extra 4 cycles in each level is added according to Eq. (31). So, the total number of cycles required for N-parallel (9, 7) 2-levels 2D-DWT is 23 + 14 = 37, Here, N = 9. The total cycles required for existing/proposed 1D/2D-DWTs can be calculated from Tables 6 and 7. In Table 7, number of cycles

Fig. 20. Comparison of total cycle delay (Tdelay ), area, and energy per operation (EOP) between existing and proposed designs of (a) 9-point (5, 3) lifting based 1D-DWT with 3 levels, (b) 9-point (9, 7) lifting based 1D-DWT with 3 levels, and (c) 13 × 13-point (5, 3) lifting based 2D-DWT with 2 levels, and (d) 9 × 9-point (9, 7) lifting based 2D-DWT with 2 levels.

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

ARTICLE IN PRESS

JID: MICPRO

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

is represented for single level 2D-DWT. Therefore, N = 13 for (5, 3), N = 9 for (9, 7) 2D-DWTs during first level and N = 7 for (5, 3), N = 5 for (9, 7) 2D-DWTs during second level. There is a small tolerance between the theoretical number of cycles (obtained from Tables 6 and 7) and the actual number of cycles (obtained from the Tables 8, 9, and 10). Fig. 18 and 19 show the layout chip diagram for proposed 13-parallel 13 × 13-point (5, 3) and 9-parallel 9 × 9-point (9, 7) lifting based 2D-DWTs using 45 nm technology respectively. In general, the proposed 2-parallel (5, 3) and (9, 7) 1D-DWTs are designed to optimize the total cycle delay. Here, the trade-off is total area and power dissipation (Example - Super computer). The proposed single PE (5, 3) and (9, 7) 1D-DWTs are designed to optimize the total area and power dissipation. Here, total cycle delay is the trade-off (Example - Hand held devices). Similarly, the proposed N-parallel (5, 3) and (9, 7) 2D-DWTs are designed to optimize the total cycle delay (high throughput). Here, the trade-off is total area and power dissipation (Example – Super computer). The proposed single PE (5, 3) and (9, 7) 2D-DWTs are designed to optimize the total area and power dissipation. Here, total cycle delay is the trade-off (Example – Hand held devices). For example, 9-point 2-parallel proposed (9, 7) single level 1D-DWT achieves 33.5% of reduction in total cycle delay compared with direct form. Similarly, 9-point proposed single PE (9, 7) single level 1D-DWT achieves 59.8% and 75.5% of reduction in total area and net power over direct form respectively. According to [23], the mean square error (MSE) is defined as,

MSE = σ 2 =

W 1 ((xn − yn )2 ) N

(32)

n=1

where xn is the input sequence, yn is the output sequence, and W is the length of data sequence. Peak-Signal-To-Noise ratio (PSNR) measures the size of error relative to peak value xpeak (for 8 bit pixel x2peak equals 255) of the signal and it is given by:

P SNR = 10log10

x2peak

σ2

(33)

All the existing and proposed designs are implemented using IEEE-754 single precision (32-bits) format with the standard round off noise reduction procedure [13]. In floating point arithmetic, the following three bits are used beyond the least significant bit of the mantissa for rounding. They are (1) Guard bit (G); (2) Round bit (R); and (3) sticky bit (S). These are used to perform round up, down, and even. During the rounding process, unit in the last place (ulp) plays major role [28,29]. The maximum of weight of the bits beyond the mantissa (unit in the last place) during IEEE-754 based rounding is ul p = 2−1 .G + 2−2 .R + 2−3 .S. Therefore, the value of ulp < 1 (it is always true). Therefore, MSE < 1 for floating point rounding. Therefore, the PSNR value is always high and nearly equal to the peak signal value. In other words, the output image using IEEE-754 based rounding technique will give very less rounding noise ( < 1). Fig. 20 (a), (b), and (c) show the comparison of total cycle delay (Tdelay ), area, and energy per operation (EOP) between existing and proposed designs of 9-point (5, 3) lifting based 1D-DWT with 3 levels, 9-point (9, 7) lifting based 1D-DWT with 3 levels, 13 × 13-point (5, 3) lifting based 2D-DWT with 2 levels, and 9 × 9-point (9, 7) lifting based 2D-DWT with 2 levels respectively. 4. Conclusion In this paper, high performance VLSI architectures for lifting based 1D and 2D-DWTs are proposed. The proposed logic used for area efficient lifting based DWT is to perform the whole operation with one processing element. Similarly, the proposed logic used

[m5G;September 6, 2016;19:50] 13

for delay efficient lifting based DWT is to perform the whole operation with multiple processing elements in parallel. In both the cases, the processing element consists of one floating point adder and one proposed fused multiply add design. The proposed and existing 1D/2D lifting based DWTs are implemented with 45 nm technology. The results show that the proposed designs achieve significant improvement compared with existing architectures. For example, 9-point 2-parallel proposed (9, 7) single level 1D-DWT achieves 33.5% of reduction in total cycle delay compared with direct form. Similarly, 9-point single PE proposed (9, 7) single level 1D-DWT achieves 59.8% and 75.5% of reduction in total area and net power over direct form respectively. References [1] I. Daubechies, The wavelet transform, time-frequency localization and signal analysis, IEEE Trans. Inf. Theory 36 (5) (1990) 961–1005. [2] R.A. DeVore, B. Jawerth, B.J. Lucier, Image compression through wavelet transform coding, IEEE Trans. Inf. Theory 38 (2) (1989) 719–746. [3] S.G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. Mach. Intell. 11 (7) (1989) 674–693. [4] Y. Choi, J.-Y. Koo, N.-Y. Lee, Image reconstruction using the wavelet transform for positron emission tomography, IEEE Trans. Med. Imaging 20 (11) (2001) 1188–1193. [5] J. Nunez, X. Otazu, O. Fors, A. Prades, V. Pala, R. Arbiol, Multiresolution-based image fusion with additive wavelet decomposition, IEEE Trans. Geosci. Remote Sens. 37 (3) (1994) 1204–1211. [6] M. Antonini, M. Barlaud, P. Mathieu, I. Daubechies, Image coding using wavelet transform, IEEE Trans. Image Process. 1 (2) (1992) 205–220. [7] K. Andra, Chaitali chakrabarti and tinku acharya, a VLSI architecture for lifting-based forward and inverse wavelet transform, IEEE Trans. Signal Process. 50 (4) (2002) 966–977. [8] P.S. Pradhan, R.L. King, S. Member, N.H. Younan, D.W. Holcomb, Estimation of the number of decomposition levels for a wavelet-based multiresolution multisensor image fusion, IEEE Trans. Geosci. Remote Sens. 44 (12) (2006) 3674–3686. [9] B.U. Toreyn, O. Yilmaz, Y.M. Mert, F. Turk, Lossless hyperspectral image compression using wavelet transform based spectral decorrelation, in: IEEE International Conference on Recent Advances in Space Technologies (RAST), 2015, pp. 251–254. [10] B.K. Mohanty, P.K. Meher, Vlsi, in: IEEE International Asia Pacific Conference on Circuits and Systems, 2006, pp. 458–461. [11] L. Hong-jin, S. Yang, H. Xing, Z. Tie-jun, W. Dong-hui, H. Chao-huan, A novel VLSI architecture for 2-d discrete wavelet transform, in: IEEE ASIC, 2007, pp. 40–43. [12] G. Shi, W. Liu, L. Zhang, F. Lii, An efficient folded architecture for lifting-based discrete wavelet transform, IEEE Trans. Circuits Syst.-II 56 (4) (2009) 290–294. [13] Mohamed Asan Basiri M, N.M. Sk, An efficient hardware based higher radix floating point MAC design, ACM Trans. Des. Autom. Electr. Syst. (TODAES) 20 (1) (2014) 15:1–15:25. [14] W. Zhang, Z. Jiang, Z. Gao, Y. Liu, An efficient VLSI architecture for lifting-based discrete wavelet transform, IEEE Trans. Circuits Syst.-II 59 (3) (2012) 158–162. [15] B.K. Mohanty, A. Mahajan, P.K. Meher, Area- and power-efficient architecture for high-throughput implementation of lifting 2-d DWT, IEEE Trans. Circuits Syst.-II 59 (7) (2012) 434–438. [16] C. Xiong, J. Tian, J. Liu, Efficient architectures for two-dimensional discrete wavelet transform using lifting scheme, IEEE Trans. Image Process. 16 (3) (2007) 607–614. [17] C.-H. Hsia, J.-S. Chiang Member, J.-M. Guo, Memory-efficient hardware architecture of 2-d dual-mode lifting-based discrete wavelet transform, IEEE Trans. Circuits Syst. Video Technol. 23 (4) (2013) 671–683. [18] L. Liu, N. Chen, H. Meng, L. Zhang, Z. Wang, H. Chen, A VLSI architecture of JPEG20 0 0 encoder, IEEE J. Solid State Circuits 39 (11) (2004) 2032–2040. [19] C.-T. Huang, P.-C. Tseng, L.-G. Chen, Flipping structure: an efficient VLSI architecture for lifting-based discrete wavelet transform, IEEE Trans. Signal Process. 52 (4) (2004) 1080–1089. [20] X. Tian, L. Wu, Y.-H. Tan, J.-W. Tian, Efficient multi-input/multi-output VLSI architecture for two-dimensional lifting-based discrete wavelet transform, IEEE Trans. Comput. 60 (8) (2011) 1207–1211. [21] B.H. Ang, U.U. Sheikh, M.N. Marsono, 2-d DWT system architecture for image compression, J. Signal Process. Syst., Springer (78) (2015) 131–137. [22] A. Darji Member, S. Agrawal, A. Oza, V. Sinha, A. Verma, S.N. Merchant, A.N. Chandorkar, Dual-scan parallel flipping architecture for a lifting-based 2-d discrete wavelet transform, IEEE Trans. Circuits Syst.-II 61 (6) (2014) 433–437. [23] H. ZainEldin, M.A. Elhosseini, H.A. Ali, Image compression algorithms in wireless multimedia sensor networks: a survey, Ain Shams Eng. J., Elsevier (6) (2015) 481–490. [24] T. Acharya, C. Chakrabarti, A survey on lifting-based discrete wavelet transform architectures, J. VLSI Sig. Process. 42 (2006) 321–339. [25] T. Murofushi, T. Dobashi, M. Iwahashi, H. Kiya, An integer tone mapping operation for HDR images in openexr with denormalized numbers, in: IEEE International Conference on Image Processing (ICIP), 2014, pp. 4497–4501.

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

JID: MICPRO 14

ARTICLE IN PRESS

[m5G;September 6, 2016;19:50]

M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

[26] W. Liu, L. Chen, C. Wang, M.O. Neill, F. Lombardi, Inexact floating point adder for dynamic image processing, in: IEEE International Conference on Nanotechnology, 2014, pp. 239–243. [27] M.e.L. Pendu, C. Guillemot, D. Thoreau, Adaptive re-quantzation for high dynamic range video compression, in: IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014, pp. 7367–7371.

[28] C. Lee, Multistep gradual rounding, IEEE Trans. Comput. 38 (4) (1989) 595–600. [29] D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Comput. Surv. 23 (1) (1991) 5–48. [30] R. Gonzalez, B.M. Gordon, M.A. Horowitz, Supply and threshold voltage scaling for low power CMOS, IEEE J. Solid State Circuits 32 (8) (1997) 1210–1216.

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007

JID: MICPRO

ARTICLE IN PRESS M. Asan Basiri M., N. Mahammad Sk / Microprocessors and Microsystems 000 (2016) 1–15

[m5G;September 6, 2016;19:50] 15

Mohamed Asan Basiri M obtained Ph.D. degree from Indian Institute of Information Technology Design and Manufacturing (IIITDM) Kancheepuram, India in July 2016. Currently, he is working as Post Doctoral Fellow in the department of Computer Science and Engineering, Indian Institute of Technology (IIT) Kanpur, India. His research interest includes VLSI for signal processing/cryptography and reconfigurable systems.

Noor Mahammad Sk obtained Ph.D. from Indian Institute of Technology (IIT) Madras, and he is currently working as Assistant professor in the department of Computer Science and Engineering, Indian Institute of Information Technology Design and Manufacturing (IIITDM) Kancheepuram, Chennai, India. His research interest includes software for VLSI, network on chip, open flow networking, software defined radio, and evolvable hardware.

Please cite this article as: M. Asan Basiri M., N. Mahammad Sk, An efficient VLSI architecture for lifting based 1D/2D discrete wavelet transform, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.08.007