Microprocessors and Microsystems 71 (2019) 102881
Contents lists available at ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
An efficient non-separable architecture for Haar wavelet transform with lifting structure Serwan Ali Bamerni∗, Ahmed Kh. Al-Sulaifanie College of Engineering, University of Duhok, Iraq
a r t i c l e
i n f o
Article history: Received 4 November 2017 Revised 5 August 2019 Accepted 28 August 2019 Available online 29 August 2019 Keywords: Haar wavelet Lifting scheme 2-D wavelet transform FPGA
a b s t r a c t In this paper, a memory efficient, fully integer to integer with parallel architecture for 2-D Haar wavelet transform with lifting scheme has been proposed. The main problem in most 2-D wavelet architecture is the intermediate or internal memory (on-chip), which is mostly proportional to the data size, the increase of the internal memory lead to increases of the die area and control complexity. The proposed algorithm is a non-separable Haar wavelet architecture which is derived by rearranging and combining the lifting steps which are carried in both vertical and horizontal directions and performing it in a simple and single step. In addition to the elimination of internal memory, the proposed algorithm is outperforming the existing architecture in term of hardware utilization, latency, number of arithmetic operation, power consumption, and used area. The proposed algorithm, is fully parallel and that make it suitable to be implanted using FPGA devices. Finally, An FPGA Development board with Zynq series “XC7Z020-1CLG484” has been used as the design platform, and all the results and tables are estimated by the Vivado® Design Suite. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Over the past few decades, a considerable number of studies have been conducted on two dimensional (2-D) discrete wavelet transforms (DWT) for image and video signals. Unfortunately, not all these proposed algorithms are suitable to be implemented for real time processing due to their complexity or processing time. Ever since Daubechies and Sweldens [1], introduce the lifting scheme to implement DWT, there has been a renewal of interest in employing this technique in hardware and software implementation in many applications, especially in attaining high throughput and low latency processing for high resolution image and video signals, as for example in [2–6]. Although lifting scheme has sped up the wavelet transform by reducing the number and complexity of the arithmetic operation, it still suffering from the high memory storage resources, where these memory issues dominate the hardware cost and complexity of the architectures for 2-D DWT. The storage resources include transposition memory, temporal memory and frame memory. Transposition memory is used in the 2-D DWT to transpose the intermediate results produced by the DWT in the row direction, for the input to the subsequent DWT
∗
Corresponding author. E-mail addresses:
[email protected] [email protected] (A.Kh. Al-Sulaifanie). https://doi.org/10.1016/j.micpro.2019.102881 0141-9331/© 2019 Elsevier B.V. All rights reserved.
(S.A.
Bamerni),
in the column direction. Temporal memory is required for storing the partial results produced by applying the DWT in each direction, these two types together are commonly called internal memory. Frame memory is needed in multi-level DWT, that store the intermediate data produced at each level for the succeeding level, which it is usually advised to be the external memory. Different algorithms have been proposed to decrease the memories hired in the lifting architecture, ranging from 2N to N2 for internal memory (where N is the width and height of the N × N image) with line based scan [7–9]. Recently, Iwahashi et al. [10], presented a non-separable lifting scheme, by reassembling the update and predicted processing unit in row and column direction for both 3/5 and 7/9 lifting scheme, so all the band (LL, LH, HL and HH) generated at the same time. The advantage of this algorithm is the elimination of the transposition memory and decreasing the latency 25%, even though the number of arithmetic operations increased proportionally to the square of filter sizes. In this paper, we proposed a most efficient algorithm for implementing Haar wavelet transform with lifting structure. Haar wavelet has found many applications in both signal and image processing [11–13]. The proposed algorithm has superiority over the other algorithm in term of memory usage, latency, number of arithmetic operation, power consumption, and used area. The rest of the paper is organized as follows. Section 2 introduces the underlying concepts of the Haar wavelet transform
2
S.A. Bamerni and A.Kh. Al-Sulaifanie / Microprocessors and Microsystems 71 (2019) 102881
Fig. 1. Block diagram of Haar wavelet lifting scheme.
Fig. 2. Block diagram of reverse Haar wavelet lifting scheme.
and the derivation of the proposed algorithm equations. Section 3 compares the proposed architecture with other related studies. Section 4 presents the FPGA implementation results of different parameters. Finally, a brief conclusion is given in Section 5. 2. Lifting Haar wavelet transform
To achieve the idea underlying the lifting scheme, we first have to decompose the analysis and synthesis filters into their polyphase components. the unnormalized analysis and synthesis Haar filters banks have the following coefficients [1]:
1 1 + z−1 analysis LPF, 2 2
(1)
Ga (z ) = −1 + z−1 analysis HPF,
(2)
Hs (z ) = 1 + z−1 synthesis LPF,
(3)
Gs ( z ) = −
1 1 + z−1 synthesis HPF 2 2
(4)
Using the Euclidean algorithm, its polyphase matrix can be written as:
P (z ) = 1 1
−1/2 = 1 1 1/2
0 1
1 0
−1/2 1
(5)
Pˆ(1/z ) = 1 0
1/ 2 1
1 −1
0 1
(6)
0
(7)
0
(8)
a ( k ) = xe ( k )
(10)
Step 1: Splitting: The input signal x(n) at sampling rate fs is split into even xe (k ) = x(2k ) and odd xo (k ) = x(2k − 1 ) sequences with half sampling rate fs /2. Step 2: Prediction: The odd sample is predicted based on the even sample through a predicting operator P (in which it is 1 in Haar transform), and the prediction detail signal is d1 (k ) = −xe (k ) + xo (k ) . Step 3: Updating: The detail signal d1 (k) is updated through an updating operator U, to get the approximation signal as a1 (k ) = xe (k ) + 0.5d1 (k ). These steps can be implemented as shown in the block diagram of Fig. 1. While the reconstruction process can easily obtain by reversing the direction of the data flow and operators in the original formula then applying the merging process instead of splitting as shown in Fig. 2. The even sample can be computed as
xe ( k ) = a ( k ) −
1 d (k ) 2
xo (k ) = K1 d (k ) + xe (k )
According to the procedure discussed in Section 2.2, the forward transform of Haar wavelet can be implemented as:
d ( k ) = xo ( k )
0
a(k ) = a(k ) + 1/2 d (k )
(9)
While the odd samples are
And on the analysis size is
0
Thus, the computation of the output coefficients can be summarized in the following steps:
2.1. 1-D lifting scheme of Haar wavelet transform
Ha (z ) =
0
d (k ) = d (k ) − a (k )
2.2. Non-separable 2-D Haar wavelet lifting scheme Generally, 2-D wavelet transform implemented by applying the 1-D transform in the row direction and then in column direction or vice versa, as shown in Fig. 3. The drawback of this method is the need of a high transposition memory which reaches at least to 2 N for a matrix of N × N dimension, or modify the structure of the 1-D transform to reduce
S.A. Bamerni and A.Kh. Al-Sulaifanie / Microprocessors and Microsystems 71 (2019) 102881
3
Fig. 3. Conventional 2-D wavelet transform.
from applying the 1-D Haar wavelet transform is obtained as follow
1 X11 − X21 H11 = [X11 − X21 ] × √ = √ 2 2
L11 = X21 +
(13)
√ 1 X11 + X21 H11 × 2 = √ 2 2
(14)
In the same way
1 X21 − X22 H12 = [X21 − X22 ] × √ = √ 2 2
Fig. 4. Segment of a 2-D data.
L12 = X21 +
(15)
√ 1 X21 + X22 H12 × 2 = √ 2 2
(16)
the transposition memory but that will be at the expense of the increase of frame memory. In this paper, we propose a non-separable 2-D Haar wavelet lifting scheme which discerning from other architecture of no need for neither transposition nor frame memory. Before proceeding to the derivation of non-separable 2-D Haar wavelet lifting scheme, it is worthy to review the formulas that represent the 2-D bior (5,3) lifting scheme as an example. So, if we consider the structure shown in Fig. 1, as part of a 2-D data of size N × M; to obtain the HH subband, for example, the following equation should be carried out. First, the data should be scanned in row direction (column direction) to obtain the high band (H) as in the equation below:
Now, to obtained the subbands LL, LH, HL and HH, the 1-D transform in row direction is applied on both the L and H respectively.
H (m, n ) = x(m, n − 1 ) + α [x(m, n ) + x(m, n − 2 )]
H L11 = H12 +
(11)
where α in bior(5, 3) lifting structure is – ½. Then the HH subband is obtained by scanning all the resulted H components in column direction (row direction)
H H (m, n ) = H (m − 1, n ) + α [H (m, n ) + H (m − 2, n )]
(12)
So, according to Fig. 4, for the purpose of obtaining the HH subband, we need to employ nine data samples; in other words, it is required to scan a block of 3 × 3 size. For the non-separable algorithms stated in [14,15], which concern only bior(5,3) and bior(9,7), they depend on scanning 2 × 2 sample block and drive a new predicted and updated units with different weights which were optimized to obtain the result as similar as that of separable one [14]. The equations of the proposed method can be derived as follows, with N × N image matrix with the data reading direction as mentioned in Fig. 5, the low pass and high pass subband resulted
1 X11 − X21 + X12 − X22 LH11 = [L11 − L12 ] × √ = 2 2
LL11 = L12 +
√ 1 X11 + X21 + X12 + X22 LH11 × 2 = 2 2
1 X11 − X21 − X12 + X22 H H11 = [H11 − H12 ] × √ = 2 2
(17)
(18)
(19)
√ 1 X11 + X21 − X12 − X22 H H11 × 2 = 2 2
(20)
The implementation of these equations is shown in Fig. 6. These equations have a very important property like 1. All subbands need the same sample components at the same time, as in this case (X11 , X21 , X12 and X22 ), which represents a block of 2 × 2 samples and moving line by line to the end of the input matrix. 2. The equations are fully parallel, that make use the parallelism advantage of FPGA when implementing the architecture. 3. No need for any transposition memory, since all the computation is done directly on the spatial sample of the input data, led to saving area for implementation and time lost in communicating with internal or external memory. Also, no temporal memory or delay register is required which present in all lifting based and convolution based structures.
4
S.A. Bamerni and A.Kh. Al-Sulaifanie / Microprocessors and Microsystems 71 (2019) 102881
Fig. 5. the data flow direction.
Fig. 6. the forward Haar wavelet lifting scheme. Fig. 7. the forward Haar wavelet lifting scheme.
4. Since all equations consist addition and subtraction only, this guarantees the integer to integer transform which is an important property needed in many applications, such as lossless image compression [16].
Since all subband’s coefficients are divided by 2, this division process can be omitted in all application without any effect on the results. The same equation can be used to implement the reconstruction process with only reversing the direction of the data flow as shown in Fig. 7.
3. Architecture implementation and comparison The proposed architectures were implemented on FPGA using Xilinx System Generator (XSG) and an FPGA Development board with Zynq series “XC7Z020-1CLG484” has been used as the design platform, and all the tables shown below are estimated by the Vivado® Design Suite. First, comparison is made between the separable and non-separable Haar wavelet transform in term of hardware resources, power consumption, and system speed. The implementation of the non-separable 2-D Haar and separable 2-D Haar are shown in Figs. 9 and 10 respectively. Then a comparison is made
S.A. Bamerni and A.Kh. Al-Sulaifanie / Microprocessors and Microsystems 71 (2019) 102881
5
Fig. 8. The power consumption diagram for both separable and non-separable Haar lifting scheme architecture. Table 1 Hardware resources utilization used for each architecture. Method
Separable 2-D lifting scheme Proposed non-separable 2-D lifting scheme
Slice LUTs Available (53,200)
Slice Registers Available (106,400)
DSPs Available (220)
Block RAM Tile Available (140)
Used
Utilize (%)
Used
Utilize (%)
Used
Utilize (%)
Used
Utilize (%)
725 160
1.36 0.3
262 0
0.25 0
0 0
0 0
1 0
0.71 0
Table 2 The maximum speed and the worst negative slack for each architecture.
Separable 2-D lifting scheme Proposed non-separable 2-D lifting scheme
with other lifting schemes in term of computation time, internal memory, latency and the number of adders and multipliers. 3.1. Hardware resources utilization comparison Table 1 summarizes the performance of the above method compared with the separable one, in terms of hardware resources, with the 2-D testing input data assumed to be 8 × 8 in dimension, which is almost the smallest input size that could be used for comparison. 3.2. System speed comparison Another important factor that should be evaluated in our comparison is the system speed or the maximum speed that the system could work on. The system speed is determined by the delayed path (critical path or longer time path) in the circuit, where the data in the shorter path arrive earlier to the node and have to wait for the data in the longer path to propagate forward and that limit the speed performance of the system. Table 2 summarizes the results of this comparison in term of the Worst Negative Slack (WNS) or worst path that has the maximum slack and the maximum working frequency. 3.3. Power consumption comparison The power consumption depends on a number of factors such as clock frequency, resources density, interconnect structure, and power supply voltage levels. As in other embedded systems, the consumed power is composed of two components the static power and dynamic power. The static power is virtually independent of the implemented architecture and mainly resulted from the transistor leakage current when the device is powered and not con-
WSN
Max. frequency
−0.131 ns No negative slack
97 MHz Up to max. system clock
Table 3 The consumption power for each architecture.
Separable 2-D lifting scheme Proposed non-separable 2-D lifting scheme
Dynamic power
Static power
0.027 W 0.001 W
0.118 W 0.118 W
figured. While the dynamic power is completely depending on the implemented design and cumulate from the logic gates power, signal power and clock power. Fig. 8. and Table 3 summarize the power consumption comparison for each case of the implemented architectures. 3.4. Comparison with other architecture Since, to our knowledge, there is no previous architecture for 2-D non-separable Haar wavelet transform, a comparison is made with the other 2-D architecture for Haar, bior (5, 3) and bior (9,7) in terms of the computation time, internal memory, latency and the number of adders and multipliers. Table 4 summarizes the results for this comparison. While in Table 5 the comparison is made with the non-separable architecture for both bior (5, 3) and bior (9, 7) proposed by Iwahashi et al. [14]. 4. Discussion the results According to the resulted presented in the previous section, the following notes can be summarized: 1. The results show that our proposed method outperforms the other architecture in most of the criteria. 2. The proposed algorithm is surpassing even the 1-D lifting based Haar and other 1-D wavelet filters in most of the
6
S.A. Bamerni and A.Kh. Al-Sulaifanie / Microprocessors and Microsystems 71 (2019) 102881
Fig. 9. XSG system implementation of 2-D non-separable Haar wavelet transform.
Fig. 10. XSG system implementation of 2-D separable Haar wavelet transform.
Table 4 Comparisons of various 2-D DWT architectures with the proposed one. [17] Haar
[18] (5/3)
[9] (5/3)
[8] (9/7)
[19] (9/7)
Our Haar
3/ N 2 + 3/ N + 7 4 2 2N 3/ N + 3 2 8 0
22 + N 2 + 3N 5.5N
N2 /4 5.5N
N2 /4 0
– 8 6
N 32 18
1 12 4
Computing time Internal memory
2N2 2N2
N 2 /2 + N + 5 3.5N
Latency Adders Multipliers
N×N 4
2N + 5 8 4
S.A. Bamerni and A.Kh. Al-Sulaifanie / Microprocessors and Microsystems 71 (2019) 102881 Table 5 Comparisons of the non-separable architecture proposed in [20] with the proposed one.
Computing time Transposition memory Line buffers Latency Adders Multipliers
3.
4.
5.
6.
7.
[14] (5/3)
[14] (9/7)
Our Haar
N2 /4 + 11 0 8 5 10 10
N2 /4 + 11 0 28 11 24 24
N2 /4 0 0 1 12 4
criteria, as a reason of combining the lifting operations in a single one parallel summation of the input samples. The advantage of the elimination of the internal memory is not affecting only on the hardware utilization, but the most important factor is its effect on the system speed, since there is no need to communicate with any external memory or use any internal memory to store the intermediate result which is the most important factor that affects the system speed and the consumed dynamic power. All the listed architecture utilized in comparisons even the non-separable one, need some stage of pipelines to improve the speed of algorithm unlike the proposed one which doesn’t need any pipelining stage since all processes take place in a single clock cycle. There is memory reduction about to 2N compared with bior (5,3) and to 5.5 N compared with bior (9,7) to 2N2 for Haar wavelet, in internal memory between the proposed nonseparable architecture and the separable one of these architectures. The latency of the proposed architecture, which is 1 clock cycle is outperforming all the previous architectures listed in Tables 4 and 5. The proposed architecture using less number of multipliers and addition compared to the listed architectures in Tables 4 and 5. And hence outperform these architectures in term of maximum operating speed, as shown in Table 2. Which stated that the proposed architecture could work up to system speed or maximum speed that FPGA can work.
5. Conclusion An efficient architecture for implanting Haar wavelet transform is proposed in the paper. This algorithm has money advantages, the main one is the abandon of both the transportation and frame memory, also the full integer to integer transform due to the absence of multiplication processor in addition to the low hardware utilization, low power consumption and the fully parallel process. Finally, the algorithm has successfully been verified using FPGA Development board with Zynq series “XC7Z020-1CLG484”. Declaration of Competing Interest None. References [1] I. Daubechies, W. Sweldens, Factoring wavelet transforms into lifting steps, J. Fourier Anal. Appl. 4 (3) (1998) 247–269. [2] J. Azar, C. Habib, R. Darazi, A. Makhoul, J. Demerjian, Using adaptive sampling and DWT lifting scheme for efficient data reduction in wireless body sensor networks, in: 2018 14th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), 2018, pp. 1–8.
7
[3] O. Prakash, C.M. Park, A. Khare, M. Jeon, J. Gwak, Multiscale fusion of multimodal medical images using lifting scheme based biorthogonal wavelet transform, Optik (Stuttg) 182 (2019) 995–1014. [4] P. Sendamarai, M.N.G. Prasad, S.M. Prabhu, K.N. Nagesh, A DCF based lifting scheme for the satellite image compression, in: 2016 3rd MEC International Conference on Big Data and Smart City (ICBDSC), 2016, pp. 1–4. [5] R. Mehta, N. Rajpal, V.P. Vishwakarma, Robust image watermarking scheme in lifting wavelet domain using GA-LSVR hybridization, Int. J. Mach. Learn. Cybern. 9 (1) (2018) 145–161. [6] Y. Ma, T. Li, Y. Ma, K. Zhan, Novel real-time FPGA-based R-wave detection using lifting wavelet, Circuits, Syst. Signal Process. 35 (1) (2016) 281–299. [7] P.-.Y. Chen, VLSI implementation for one-dimensional multilevel lifting-based wavelet transform, IEEE Trans. Comput. 53 (4) (2004) 386–398. [8] B.-.F. Wu, C.-.F. Lin, A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG20 0 0 codec, IEEE Trans. circuits Syst. video Technol. 15 (12) (2005) 1615–1628. [9] C.-.H. Hsia, J.-.S. Chiang, J.-.M. Guo, Memory-efficient hardware architecture of 2-D dual-mode lifting-based discrete wavelet transform, IEEE Trans. Circuits Syst. Video Technol. 23 (4) (2013) 671–683. [10] M. Iwahashi, H. Kiya, A new lifting structure of non separable 2D DWT with compatibility to JPEG 20 0 0, in: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 1306–1309. [11] D.L. Fugal, Conceptual Wavelets in Digital Signal Processing, 174, Sp. Signals Tech. Publ, San Diego, CA, 2009. [12] S. Tedmori, N. Al-Najdawi, Image cryptographic algorithm based on the Haar wavelet transform, Inf. Sci. (Ny). 269 (2014) 21–34. [13] C.-.C. Tu, C.-.F. Juang, Recurrent type-2 fuzzy neural network using Haar wavelet energy and entropy features for speech detection in noisy environments, Expert Syst. Appl. 39 (3) (2012) 2479–2488. [14] M. Iwahashi, H. Kiya, Non separable 2D factorization of separable 2D DWT for lossless image coding, in: Image Processing (ICIP), 2009 16th IEEE International Conference on, 2009, pp. 17–20. [15] M. Kaaniche, J.-.C. Pesquet, A. Benazza-Benyahia, B. Pesquet-Popescu, Two-dimensional non separable adaptive lifting scheme for still and stereo image coding, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, 2010, pp. 1298–1301. [16] A.R. Calderbank, I. Daubechies, W. Sweldens, B.-.L. Yeo, Wavelet transforms that map integers to integers, Appl. Comput. Harmon. Anal. 5 (3) (1998) 332–369. [17] A. Ahmad, A. Amira, P. Nicholl, B. Krill, Dynamic partial reconfiguration of 2-D Haar wavelet transform (HWT) for face recognition systems, in: 2011 IEEE 15th International Symposium on Consumer Electronics (ISCE), 2011, pp. 9–13. [18] K. Andra, C. Chakrabarti, T. Acharya, A VLSI architecture for lifting-based forward and inverse wavelet transform, IEEE Trans. Signal Process. 50 (4) (2002) 966–977. [19] C. Xiong, J. Tian, J. Liu, Efficient architectures for two-dimensional discrete wavelet transform using lifting scheme, IEEE Trans. image Process. 16 (3) (2007) 607–614. Serwan Bamerni: received B.Sc. degree in Electrical Engineering from University of Baghdad and M.Sc. degree in solid state electronics from University of Technology, Baghdad, Iraq, in 1997 and 2002 respectively. now he is a Ph.D. student at University of Duhok in Electrical and Computer Engineering. his research field include VLSI architectures and algorithms for signal processing, and reconfigurable computing for multimedia systems.
Ahmed K. Al-Sulaifanie: received both his B.S. and M.Sc. degree in Electrical Engineering-Electronics and Communications College of Engineering at University of Mosul in 1977 and 1980 respectively and Ph.D. in Electrical Engineering-Computer Architectures at University of Mosul in 2003. His-research interests include VLSI architectures and algorithms for signal processing data scheduling optimization, especially on multimedia applications.