Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 54 (2015) 605 – 611
Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)
An Efficient 256-Tap Parallel FIR Digital Filter Implementation Using Distributed Arithmetic Architecture Amita Nandala,∗ , T. Vigneswarnb , Ashwani K. Ranaa and Arvind Dhakac a Department of ECE, NIT Hamirpur b Department of ECE, VIT University, Chennai c Deparment of CSE, NIT Hamripur
Abstract This paper discusses FPGA implementation of Finite Impulse Response (FIR) filters using Distributed Arithmetic (DA) which substitute multiply and accumulate operations with a series of Look-Up-Table (LUT) accesses. Parallel FIR digital filter can be used either for high speed or low-power applications. The distributed arithmetic provides a multiplication-free method for calculating inner products of fixed-point data, based on table lookups of pre calculated partial products. The implementation results are provided to demonstrate a high-speed and low power proposed architecture. The proposed filter is implemented in very high speed integrated circuit hardware description language (VHDL) and verified via simulation. The proposed method offers average reductions of 60% in the number of LUT, 40% reduction in occupied slices and 50% reduction in the number gates for parallel FIR filter implementation. © 2015 2015The TheAuthors. Authors. Published Elsevier © Published by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015). Processing-2015 (IMCIP-2015)
Keywords:
Distributed arithmetic; DSP; Finite impulse response; LUT; MAC and parallel filters.
1. Introduction Due to the intensive use of Finite Impulse Response (FIR) filters in video and communication systems, high performance in speed, area and power consumption is demanded. Basically, digital filters are used to modify the characteristic of signals in time and frequency domain and have been recognized as primary digital signal processing element. In DSP, the design methods were mainly focused in multiplier-based architectures to implement the Multiply-And-Accumulate (MAC) blocks that constitute the central piece in FIR filters and several functions. Fast parallel filter structures have been discussed in detail in1–9 . Finite Impulse Response (FIR) filters are important building blocks for various Digital Signal Processing (DSP) applications. Recently, because of the increasing demand for video-signal processing and transmission, high-speed and high-order programmable FIR filters have frequently been used to perform adaptive pulse shaping and signal equalization on the received data in real time, such as ghost cancellation10, 11 and channel equalization12. Hence, an efficient VLSI architecture for a high-speed programmable FIR filter is needed. ∗ Corresponding author. Tel.: 09736494921.
E-mail address: amita−
[email protected]
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015) doi:10.1016/j.procs.2015.06.070
606
Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611
Fig. 1.
Block diagram of conventional parallel FIR filtering.
In this work we are addressing optimizations in the multiplier block of FIR filters by changes in architectural level improvements using distributed arithmetic concept of digital filtering which improves device utilization. Distributed Arithmetic (DA) is a high speed multiplication technique used for implementation of digital filters and signal up conversions21, 22. The DA is bit serial word parallel approach where throughput rate does not depend on filter length or data size. This paper is organized as follows. Section 2 briefly describes background study. We present a distributed arithmetic based filtering scheme in section 3. In section 4 we present our proposed FIR filter design flow. Section 5 shows the synthesis results obtained by XILINX. Section 6 summarizes the conclusions and presents proposals for future work. 2. Background Study 2.1 Parallel FIR filters A FIR filter can be mathematically expressed by the equation (1)13, 26. y[n] =
N−1
h[i ]x[n − i ]
(1)
i=0
where X represents the input signal, H the filter coefficients, Y the output signal, Y [n] is the current output sample, and N is the number of taps of the filter. In the sequential implementation a set of Multiply-And-Accumulate (MAC) operations is performed for each sample of the input data signal, multiplying the N delayed input samples by coefficients and summing up the results together to generate the output signal. In parallel implementations, we can have two main architectures. The first one consists of unrolling of MAC loop where we have several delayed versions of the input signal entering in a fully parallel multiplier block, followed by a summation block. The other one consists of a multiplier block, which takes the same input signal and delivers each output to an input of a delayed summation block. Fig. 1 shows the basic block diagram of conventional parallel FIR filtering. Several techniques for optimizing the multiplier block of parallel FIR filters were proposed in the literature. Most of them consider the use the fixed-point representation and transposed form implementation, because it is easier to obtain common sub expressions to be shared along two or more multipliers in this form14, 15. Many consider the use of some kind of Signed Digit (SD) representation, mainly the Canonical Signed Digit (CSD) representation16, 17, which results in fewer non-zero digits in each coefficient, usually resulting in a smaller multiplier block. Previous research has been shown reductions of more than 50%16 in the number of adders by using these techniques. The great advantage of these techniques is that the optimized filter has the same behavior of the original non-optimized one (i.e. same impulse response or transfer function). Another approach consists of representing each coefficient as a sum of power-of-two terms and limiting the number of power-of-two terms in each coefficient17–20 . An FIR digital filter of degree N is described by the impulse response, h n , n = 0, 1, 2, . . . , N and its transfer function is represented as follows: N H (z) = h n z −n (2) n=0
607
Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611
Fig. 2. Concept of existing distributed arithmetic based filtering.
the traditional L-parallel FIR filter can be derived using polyphase decomposition3 as, L−1
Y p (z L )z − p =
p=0
L−1
X q (z L )z −q
q=0
L−1
Hr (z L )z −r
(3)
r=0
A two-parallel FIR filter can be expressed as Y0 + z −1 H1 = (H0 + z −1 H1 )(X 0 + z −1 X 1 ) = H0 X 0 + z −1 (H0 X 1 + H1 X 0 ) + z −2 H1 X 1
(4) (5)
Equation 4 and equation 5 implies that, Y0 = H0 X 0 + z −2 H1 X 1
(6)
Y1 = H0 X 1 + H1 X 0
(7)
2.2 Distributed arithmetic based filtering scheme Distributed Arithmetic was first brought up by Croisier23 , and was extended to cover the signed data system. Then it was introduced into FPGA design to save MAC blocks with the development of FPGA technology. High performance FIR filter based on DA using LMS architecture is implemented in24, 25. If h[n] is the filter coefficient and x[n] is the input sequence to be processed, the N-length FIR filter can be described as final form of distributed arithmetic as, y = −2 B x B [n]h[n] +
B−1 b=0
2b
N−1
h[n]x b [n]
(8)
N=0
Figure 2 shows the existing DA based filtering scheme27. 3. Proposed Work The basic LUT-DA scheme on an FPGA would consist of three main components: the input registers, the 4-input LUT unit and the shifter/accumulator unit. Additionally, it would require a control unit to manipulate the filter operation, and an adder tree unit to perform addition on partial filter results. Applying this approach the 4-input LUT unit will not be directly accessed instead 2-input LUT is used based on multiplexer select. The concept of multiplexer based DA filtering scheme is shown in Fig. 4. The proposed DA based filter architecture is shown in Fig. 4 for 256-tap parallel FIR filter. This architecture uses the concept of multiplexer based DA filtering scheme shown in Fig. 3. The particular 2-input LUT is selected which represent all the possible sum combinations of filter coefficients. It implies about 50% reduction in the number of LUT used with increased speed. There are two main aspects to be considered when designing a parallel filter, namely the number of bits required for the signal and the required transfer function of the filter. The former one determines the word length of the entire data path. The later one is determined by two parameters, namely the number of taps, and the number of bits in each coefficient. The multipliers are the most expensive blocks in terms of area, delay,
608
Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611
Fig. 3. Multiplexer based DA filtering scheme.
Fig. 4.
Proposed parallel digital FIR filter architecture.
and power in a FIR filter when considering custom implementation. In fact, even for a dedicated implementation of constant-coefficient multipliers, the amount of hardware needed is very high as we have several multipliers in the entire filter. To evaluate the performance of the proposed scheme, 4-tap, 8-tap, 16-tap, 32-tap, 64-tap, 128-tap and 256-tap parallel FIR filters are implemented using VHDL and synthesis is carried out in XILINX-ISE8.1i. The results are compared with the conventional parallel FIR filtering scheme and existing DA based implementation. 4. Results and Discussion The simulation has been done using MODEL SIM 6.4 and XILINX Integrated Software Environment (ISE) is used for performing synthesis and implementation of designs using ‘Spartan-3’ device. The power analysis has been
Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611
Fig. 5. Comparison of number of LUTs used.
Fig. 7.
Comparison of number of gates used.
Fig. 6.
Comparison of number of occupied slices.
Fig. 8.
Comparison of delay.
done using XILINX XPOWER tool. The evaluation of device utilization using proposed DA architecture can be comprehended easily with the help of the results in graphs shown below. Figure 5 reports the comparison of number of LUTs used among the various filter architectures designed using existing and proposed method. It is shown that proposed multiplexer based DA filter comprehends the existing DA based parallel FIR filter. The number of LUTs are reduced by 66% using proposed method as compared to DA based method for parallel FIR filter implementation. This reduction is 19% when compared with conventional filter implementation. Figure 6 plots the comparison of number of occupied slices among various parallel FIR filter architectures. It is shown that the proposed multiplexer based DA filter has 42% reduced number of slices as compared to existing DA based filter architecture. This reduction is 20% when compared to conventional architecture. Figure 7 reports the comparison of number of gates used among various filter architectures designed. It is shown that the proposed multiplexer based DA filter has 55% reduced number of slices as compared to existing DA based filter architecture. This reduction is 14% when compared to conventional architecture. Figure 8 reports the delay comparison among various filter architectures designed. It is shown that the proposed multiplexer based DA filter is 10% faster as compared to existing DA based filter architecture. This improvement is 5% when compared to conventional architecture.
609
610
Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611
Fig. 9. Comparison of power consumption.
Fig. 10.
Comparison of Power Delay Product (PDP).
Figure 9 reports the power consumption comparison among various filter architectures designed. It is shown that the proposed multiplexer based DA filter has 55% reduced power consumption as compared to existing DA based filter architecture. This reduction is 27% when compared to conventional architecture. The overall performance metric (i.e. power, delay product) for proposed design is also improved as shown in Fig. 10. 5. Conclusion We have presented an efficient multiplexer based DA scheme which is used to implement parallel FIR filters. The device utilization of the proposed architecture is relatively less since it uses split LUT technique. Proposed method is implemented for 4-tap to 256-tap parallel FIR filters and can be even extended for more taps. A high speed and less area implementation is achieved. The test results indicate that the designed filter using proposed distributed arithmetic can work stable with high speed and can save almost 50 percent hardware resources. For various digital signal processing applications the various parameters and order of filter can be changed accordingly. References [1] D. A. Parker and K. K. Parhi, Low-Area/Power Parallel FIR Digital Filter Implementations, Journal of VLSI Signal Process. Syst., vol. 17(1), pp. 75–92, (1997). [2] J. G. Chung and K. K. Parhi, Frequency-Spectrum-Based Low-Area Low-Power Parallel FIR Filter Design, EURASIP J. Appl. Signal Process., vol. 9, pp. 444–453, (2002). [3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, (New York, Wiley), (1999). [4] Z. J. Mou and P. Duhamel, Short-Length FIR Filters and their use in Fast Nonrecursive Filtering, IEEE Trans. Signal Process., vol. 39(6), pp. 1322–1332, (1991). [5] C. Cheng and K. K. Parhi, Hardware Efficient Fast Parallel FIR Filter Structures based on Iterated Short Convolution, IEEE Trans. Circuits Syst., 51(8), pp. 1492–1500, (2004). [6] J. I. Acha, Computational Structures for Fast Implementation of L-Path and L-Block Digital Filters, IEEE Trans. Circuits Syst., 36(6), pp. 805–812, (1989). [7] I. S. Lin and S. K. Mitra, Overlapped Block Digital Filtering, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process., vol. 43, pp. 586–596, (1996). [8] C. Cheng and K. K. Parhi, Further Complexity Reduction of Parallel FIR Filters, Proc. IEEE Int. Symp. Circuits Syst., (Kobe, Japan), pp. 1835–1838, (2005). [9] R. C. Agarwal and J. W. Cooley, New Algorithms for Digital Convolution, IEEE Trans. Acoust. Speech, Signal Process., vol. 25(5), pp. 392–410, (1977). [10] B. Edwards, A. Corry, N. Weste and C. Greenberg, A Single Chip Ghost Canceller, IEEE 1992 Custom Integrated Circuits Conference, pp. 26.5.1–26.5.14, (1992).
Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611 [11] J. R. Choi, L. H. Jang, S. W. Jung and J. H. Choi, Structured Design of a 288-tap FIR Filter by Optimized Partial Product Tree Compression, IEEE J. Solid-State Circuits, 32(3), pp. 468–476, (1997). [12] D. J. Pearson, S. K. Reynolds, A. C. Megdanis, S. Gowda, K. R.Wrenner, M. Immediato, R. L. Galbraith and H. J. Shin, Digital FIR Filters for High Speed PRML Disk Read Channels, IEEE J. Solid-State Circuits, vol. 30(12), pp. 1517–1523, (1995). [13] R. W. Hamming, Digital Filters (Prentice Hall, 3rd edition, 1989). [14] M. Potkonjak, M. B. Srivastava and A. Chandrakasan, Efficient Substitution of Multiple Constant Multiplication by Shifts and Addition using Iterative Pairwise Matching. Proc. 31st ACM/IEEE Design Antomation Conf., pp. 189–194, (1994). [15] M. Mehendale, S. D. Sherlekar and G. Venkatesh, Synthesis of Multiplier-less FIR Filters with Minimum Number of Additions, Proc. IEEE/ACM Int. Conf. Computer-Aided Design, pp. 668-671, (1995). [16] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde and D. Iuraekova, A New Algorithm for Elimination of Common Subexpressions, IEEE Trans. Computer-Aided Design, vol. 18, pp. 58–68 (1999). [17] H. Samueli, An Improved Search Algorithm for the Design of Multiplier-Less FIR Filters with Powers-of-Two Coefficients, IEE Trans. Circuits Syst., vol. 36, pp. 1044–1047, (1989). [18] K. H. Chen and T. D. Chiueh, Design and Implementation of a Reconfigurable FIR Filter, Proc of 2003 Int. Symp. Circuits Systems, pp. 25–28, (2003). [19] C. Lim, J. B. Evans and B. Liu, Decomposition of Binary Integers into Signed Power-of-Two Terms, IEEE Trans. Circuits Syst., vol. 38, pp. 667–672 (1991). [20] J. Portela, E. Costa and J. Monteiro, Optimal Combination of Number of Taps and Coefficient Bit-Width for Low Power FIR Filter Realization, IEEE European Conference on Circuit Theory and Design, pp. 145–148, (2003). [21] B. New, A Distributed Arithmetic Approach to Designing Scalable DSP Chips, EDN. (1995). [22] W. P. Burleson and L. L. Scharf, A VLSI Design Method for Distributed Arithmetic, VLSI Sig. Proc, vol. 2, (1991). [23] Croisier, D. J. Esteban, M. E. Levilion and V. Rizo, Digital Filter for PCM Encoded Signals, U.S. Patent, no. 3,777,130, (1973). [24] B. K. Mohanty and P. K. Meher, A High-Performance Energy-Efficient Architecture for FIR Adaptive Filter based on New Distributed Arithmetic Formulation of Block LMS Algorithm, IEEE Transactions on Signal Processing, vol. 61(4), pp. 921–930, (2013). [25] E. Trakultritrung and E. Thanangchusin, Distributed Arithmetic LMS Adaptive Filter Implementation without Look-up Table, 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 1–4, (2012). [26] Yu-Chi Tsao and Ken Choi, Area-Efficient Parallel FIR Digital Filter Structures for Symmetric Convolutions based on Fast FIR Algorithm, IEEE Transactions on VLSI Systems, vol. 20(2), pp. 366–371, (2012). [27] T. Vigneswaran and P. Subbaramani Reddy, Design of Digital FIR Filter based on Dynamic Distributed Arithmetic Algorithm, Indian Journal of Applied Sciences, vol. 7(9), pp. 2908–2910, (2007).
611