An Efficient 256-Tap Parallel FIR Digital Filter Implementation Using Distributed Arithmetic Architecture

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 54 (2015) 605 – 611 Eleventh International Multi-Conference on Inf...

Download PDF

1MB Sizes 7 Downloads 92 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 54 (2015) 605 – 611

Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)

An Efﬁcient 256-Tap Parallel FIR Digital Filter Implementation Using Distributed Arithmetic Architecture Amita Nandala,∗ , T. Vigneswarnb , Ashwani K. Ranaa and Arvind Dhakac a Department of ECE, NIT Hamirpur b Department of ECE, VIT University, Chennai c Deparment of CSE, NIT Hamripur

Abstract This paper discusses FPGA implementation of Finite Impulse Response (FIR) ﬁlters using Distributed Arithmetic (DA) which substitute multiply and accumulate operations with a series of Look-Up-Table (LUT) accesses. Parallel FIR digital ﬁlter can be used either for high speed or low-power applications. The distributed arithmetic provides a multiplication-free method for calculating inner products of ﬁxed-point data, based on table lookups of pre calculated partial products. The implementation results are provided to demonstrate a high-speed and low power proposed architecture. The proposed ﬁlter is implemented in very high speed integrated circuit hardware description language (VHDL) and veriﬁed via simulation. The proposed method offers average reductions of 60% in the number of LUT, 40% reduction in occupied slices and 50% reduction in the number gates for parallel FIR ﬁlter implementation. © 2015 2015The TheAuthors. Authors. Published Elsevier © Published by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015). Processing-2015 (IMCIP-2015)

Keywords:

Distributed arithmetic; DSP; Finite impulse response; LUT; MAC and parallel ﬁlters.

1. Introduction Due to the intensive use of Finite Impulse Response (FIR) ﬁlters in video and communication systems, high performance in speed, area and power consumption is demanded. Basically, digital ﬁlters are used to modify the characteristic of signals in time and frequency domain and have been recognized as primary digital signal processing element. In DSP, the design methods were mainly focused in multiplier-based architectures to implement the Multiply-And-Accumulate (MAC) blocks that constitute the central piece in FIR ﬁlters and several functions. Fast parallel ﬁlter structures have been discussed in detail in1–9 . Finite Impulse Response (FIR) ﬁlters are important building blocks for various Digital Signal Processing (DSP) applications. Recently, because of the increasing demand for video-signal processing and transmission, high-speed and high-order programmable FIR ﬁlters have frequently been used to perform adaptive pulse shaping and signal equalization on the received data in real time, such as ghost cancellation10, 11 and channel equalization12. Hence, an efﬁcient VLSI architecture for a high-speed programmable FIR ﬁlter is needed. ∗ Corresponding author. Tel.: 09736494921.

E-mail address: amita− [email protected]

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015) doi:10.1016/j.procs.2015.06.070

606

Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611

Fig. 1.

Block diagram of conventional parallel FIR ﬁltering.

In this work we are addressing optimizations in the multiplier block of FIR ﬁlters by changes in architectural level improvements using distributed arithmetic concept of digital ﬁltering which improves device utilization. Distributed Arithmetic (DA) is a high speed multiplication technique used for implementation of digital ﬁlters and signal up conversions21, 22. The DA is bit serial word parallel approach where throughput rate does not depend on ﬁlter length or data size. This paper is organized as follows. Section 2 brieﬂy describes background study. We present a distributed arithmetic based ﬁltering scheme in section 3. In section 4 we present our proposed FIR ﬁlter design ﬂow. Section 5 shows the synthesis results obtained by XILINX. Section 6 summarizes the conclusions and presents proposals for future work. 2. Background Study 2.1 Parallel FIR ﬁlters A FIR ﬁlter can be mathematically expressed by the equation (1)13, 26. y[n] =

N−1

h[i ]x[n − i ]

(1)

i=0

where X represents the input signal, H the ﬁlter coefﬁcients, Y the output signal, Y [n] is the current output sample, and N is the number of taps of the ﬁlter. In the sequential implementation a set of Multiply-And-Accumulate (MAC) operations is performed for each sample of the input data signal, multiplying the N delayed input samples by coefﬁcients and summing up the results together to generate the output signal. In parallel implementations, we can have two main architectures. The ﬁrst one consists of unrolling of MAC loop where we have several delayed versions of the input signal entering in a fully parallel multiplier block, followed by a summation block. The other one consists of a multiplier block, which takes the same input signal and delivers each output to an input of a delayed summation block. Fig. 1 shows the basic block diagram of conventional parallel FIR ﬁltering. Several techniques for optimizing the multiplier block of parallel FIR ﬁlters were proposed in the literature. Most of them consider the use the ﬁxed-point representation and transposed form implementation, because it is easier to obtain common sub expressions to be shared along two or more multipliers in this form14, 15. Many consider the use of some kind of Signed Digit (SD) representation, mainly the Canonical Signed Digit (CSD) representation16, 17, which results in fewer non-zero digits in each coefﬁcient, usually resulting in a smaller multiplier block. Previous research has been shown reductions of more than 50%16 in the number of adders by using these techniques. The great advantage of these techniques is that the optimized ﬁlter has the same behavior of the original non-optimized one (i.e. same impulse response or transfer function). Another approach consists of representing each coefﬁcient as a sum of power-of-two terms and limiting the number of power-of-two terms in each coefﬁcient17–20 . An FIR digital ﬁlter of degree N is described by the impulse response, h n , n = 0, 1, 2, . . . , N and its transfer function is represented as follows: N H (z) = h n z −n (2) n=0

607

Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611

Fig. 2. Concept of existing distributed arithmetic based ﬁltering.

the traditional L-parallel FIR ﬁlter can be derived using polyphase decomposition3 as, L−1

Y p (z L )z − p =

p=0

L−1

X q (z L )z −q

q=0

L−1

Hr (z L )z −r

(3)

r=0

A two-parallel FIR ﬁlter can be expressed as Y0 + z −1 H1 = (H0 + z −1 H1 )(X 0 + z −1 X 1 ) = H0 X 0 + z −1 (H0 X 1 + H1 X 0 ) + z −2 H1 X 1

(4) (5)

Equation 4 and equation 5 implies that, Y0 = H0 X 0 + z −2 H1 X 1

(6)

Y1 = H0 X 1 + H1 X 0

(7)

2.2 Distributed arithmetic based ﬁltering scheme Distributed Arithmetic was ﬁrst brought up by Croisier23 , and was extended to cover the signed data system. Then it was introduced into FPGA design to save MAC blocks with the development of FPGA technology. High performance FIR ﬁlter based on DA using LMS architecture is implemented in24, 25. If h[n] is the ﬁlter coefﬁcient and x[n] is the input sequence to be processed, the N-length FIR ﬁlter can be described as ﬁnal form of distributed arithmetic as, y = −2 B x B [n]h[n] +

B−1 b=0

2b

N−1

h[n]x b [n]

(8)

N=0

Figure 2 shows the existing DA based ﬁltering scheme27. 3. Proposed Work The basic LUT-DA scheme on an FPGA would consist of three main components: the input registers, the 4-input LUT unit and the shifter/accumulator unit. Additionally, it would require a control unit to manipulate the ﬁlter operation, and an adder tree unit to perform addition on partial ﬁlter results. Applying this approach the 4-input LUT unit will not be directly accessed instead 2-input LUT is used based on multiplexer select. The concept of multiplexer based DA ﬁltering scheme is shown in Fig. 4. The proposed DA based ﬁlter architecture is shown in Fig. 4 for 256-tap parallel FIR ﬁlter. This architecture uses the concept of multiplexer based DA ﬁltering scheme shown in Fig. 3. The particular 2-input LUT is selected which represent all the possible sum combinations of ﬁlter coefﬁcients. It implies about 50% reduction in the number of LUT used with increased speed. There are two main aspects to be considered when designing a parallel ﬁlter, namely the number of bits required for the signal and the required transfer function of the ﬁlter. The former one determines the word length of the entire data path. The later one is determined by two parameters, namely the number of taps, and the number of bits in each coefﬁcient. The multipliers are the most expensive blocks in terms of area, delay,

608

Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611

Fig. 3. Multiplexer based DA ﬁltering scheme.

Fig. 4.

Proposed parallel digital FIR ﬁlter architecture.

and power in a FIR ﬁlter when considering custom implementation. In fact, even for a dedicated implementation of constant-coefﬁcient multipliers, the amount of hardware needed is very high as we have several multipliers in the entire ﬁlter. To evaluate the performance of the proposed scheme, 4-tap, 8-tap, 16-tap, 32-tap, 64-tap, 128-tap and 256-tap parallel FIR ﬁlters are implemented using VHDL and synthesis is carried out in XILINX-ISE8.1i. The results are compared with the conventional parallel FIR ﬁltering scheme and existing DA based implementation. 4. Results and Discussion The simulation has been done using MODEL SIM 6.4 and XILINX Integrated Software Environment (ISE) is used for performing synthesis and implementation of designs using ‘Spartan-3’ device. The power analysis has been

Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611

Fig. 5. Comparison of number of LUTs used.

Fig. 7.

Comparison of number of gates used.

Fig. 6.

Comparison of number of occupied slices.

Fig. 8.

Comparison of delay.

done using XILINX XPOWER tool. The evaluation of device utilization using proposed DA architecture can be comprehended easily with the help of the results in graphs shown below. Figure 5 reports the comparison of number of LUTs used among the various ﬁlter architectures designed using existing and proposed method. It is shown that proposed multiplexer based DA ﬁlter comprehends the existing DA based parallel FIR ﬁlter. The number of LUTs are reduced by 66% using proposed method as compared to DA based method for parallel FIR ﬁlter implementation. This reduction is 19% when compared with conventional ﬁlter implementation. Figure 6 plots the comparison of number of occupied slices among various parallel FIR ﬁlter architectures. It is shown that the proposed multiplexer based DA ﬁlter has 42% reduced number of slices as compared to existing DA based ﬁlter architecture. This reduction is 20% when compared to conventional architecture. Figure 7 reports the comparison of number of gates used among various ﬁlter architectures designed. It is shown that the proposed multiplexer based DA ﬁlter has 55% reduced number of slices as compared to existing DA based ﬁlter architecture. This reduction is 14% when compared to conventional architecture. Figure 8 reports the delay comparison among various ﬁlter architectures designed. It is shown that the proposed multiplexer based DA ﬁlter is 10% faster as compared to existing DA based ﬁlter architecture. This improvement is 5% when compared to conventional architecture.

609

610

Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611

Fig. 9. Comparison of power consumption.

Fig. 10.

Comparison of Power Delay Product (PDP).

Figure 9 reports the power consumption comparison among various ﬁlter architectures designed. It is shown that the proposed multiplexer based DA ﬁlter has 55% reduced power consumption as compared to existing DA based ﬁlter architecture. This reduction is 27% when compared to conventional architecture. The overall performance metric (i.e. power, delay product) for proposed design is also improved as shown in Fig. 10. 5. Conclusion We have presented an efﬁcient multiplexer based DA scheme which is used to implement parallel FIR ﬁlters. The device utilization of the proposed architecture is relatively less since it uses split LUT technique. Proposed method is implemented for 4-tap to 256-tap parallel FIR ﬁlters and can be even extended for more taps. A high speed and less area implementation is achieved. The test results indicate that the designed ﬁlter using proposed distributed arithmetic can work stable with high speed and can save almost 50 percent hardware resources. For various digital signal processing applications the various parameters and order of ﬁlter can be changed accordingly. References [1] D. A. Parker and K. K. Parhi, Low-Area/Power Parallel FIR Digital Filter Implementations, Journal of VLSI Signal Process. Syst., vol. 17(1), pp. 75–92, (1997). [2] J. G. Chung and K. K. Parhi, Frequency-Spectrum-Based Low-Area Low-Power Parallel FIR Filter Design, EURASIP J. Appl. Signal Process., vol. 9, pp. 444–453, (2002). [3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, (New York, Wiley), (1999). [4] Z. J. Mou and P. Duhamel, Short-Length FIR Filters and their use in Fast Nonrecursive Filtering, IEEE Trans. Signal Process., vol. 39(6), pp. 1322–1332, (1991). [5] C. Cheng and K. K. Parhi, Hardware Efﬁcient Fast Parallel FIR Filter Structures based on Iterated Short Convolution, IEEE Trans. Circuits Syst., 51(8), pp. 1492–1500, (2004). [6] J. I. Acha, Computational Structures for Fast Implementation of L-Path and L-Block Digital Filters, IEEE Trans. Circuits Syst., 36(6), pp. 805–812, (1989). [7] I. S. Lin and S. K. Mitra, Overlapped Block Digital Filtering, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process., vol. 43, pp. 586–596, (1996). [8] C. Cheng and K. K. Parhi, Further Complexity Reduction of Parallel FIR Filters, Proc. IEEE Int. Symp. Circuits Syst., (Kobe, Japan), pp. 1835–1838, (2005). [9] R. C. Agarwal and J. W. Cooley, New Algorithms for Digital Convolution, IEEE Trans. Acoust. Speech, Signal Process., vol. 25(5), pp. 392–410, (1977). [10] B. Edwards, A. Corry, N. Weste and C. Greenberg, A Single Chip Ghost Canceller, IEEE 1992 Custom Integrated Circuits Conference, pp. 26.5.1–26.5.14, (1992).

Amita Nandal et al. / Procedia Computer Science 54 (2015) 605 – 611 [11] J. R. Choi, L. H. Jang, S. W. Jung and J. H. Choi, Structured Design of a 288-tap FIR Filter by Optimized Partial Product Tree Compression, IEEE J. Solid-State Circuits, 32(3), pp. 468–476, (1997). [12] D. J. Pearson, S. K. Reynolds, A. C. Megdanis, S. Gowda, K. R.Wrenner, M. Immediato, R. L. Galbraith and H. J. Shin, Digital FIR Filters for High Speed PRML Disk Read Channels, IEEE J. Solid-State Circuits, vol. 30(12), pp. 1517–1523, (1995). [13] R. W. Hamming, Digital Filters (Prentice Hall, 3rd edition, 1989). [14] M. Potkonjak, M. B. Srivastava and A. Chandrakasan, Efﬁcient Substitution of Multiple Constant Multiplication by Shifts and Addition using Iterative Pairwise Matching. Proc. 31st ACM/IEEE Design Antomation Conf., pp. 189–194, (1994). [15] M. Mehendale, S. D. Sherlekar and G. Venkatesh, Synthesis of Multiplier-less FIR Filters with Minimum Number of Additions, Proc. IEEE/ACM Int. Conf. Computer-Aided Design, pp. 668-671, (1995). [16] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde and D. Iuraekova, A New Algorithm for Elimination of Common Subexpressions, IEEE Trans. Computer-Aided Design, vol. 18, pp. 58–68 (1999). [17] H. Samueli, An Improved Search Algorithm for the Design of Multiplier-Less FIR Filters with Powers-of-Two Coefﬁcients, IEE Trans. Circuits Syst., vol. 36, pp. 1044–1047, (1989). [18] K. H. Chen and T. D. Chiueh, Design and Implementation of a Reconﬁgurable FIR Filter, Proc of 2003 Int. Symp. Circuits Systems, pp. 25–28, (2003). [19] C. Lim, J. B. Evans and B. Liu, Decomposition of Binary Integers into Signed Power-of-Two Terms, IEEE Trans. Circuits Syst., vol. 38, pp. 667–672 (1991). [20] J. Portela, E. Costa and J. Monteiro, Optimal Combination of Number of Taps and Coefﬁcient Bit-Width for Low Power FIR Filter Realization, IEEE European Conference on Circuit Theory and Design, pp. 145–148, (2003). [21] B. New, A Distributed Arithmetic Approach to Designing Scalable DSP Chips, EDN. (1995). [22] W. P. Burleson and L. L. Scharf, A VLSI Design Method for Distributed Arithmetic, VLSI Sig. Proc, vol. 2, (1991). [23] Croisier, D. J. Esteban, M. E. Levilion and V. Rizo, Digital Filter for PCM Encoded Signals, U.S. Patent, no. 3,777,130, (1973). [24] B. K. Mohanty and P. K. Meher, A High-Performance Energy-Efﬁcient Architecture for FIR Adaptive Filter based on New Distributed Arithmetic Formulation of Block LMS Algorithm, IEEE Transactions on Signal Processing, vol. 61(4), pp. 921–930, (2013). [25] E. Trakultritrung and E. Thanangchusin, Distributed Arithmetic LMS Adaptive Filter Implementation without Look-up Table, 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 1–4, (2012). [26] Yu-Chi Tsao and Ken Choi, Area-Efﬁcient Parallel FIR Digital Filter Structures for Symmetric Convolutions based on Fast FIR Algorithm, IEEE Transactions on VLSI Systems, vol. 20(2), pp. 366–371, (2012). [27] T. Vigneswaran and P. Subbaramani Reddy, Design of Digital FIR Filter based on Dynamic Distributed Arithmetic Algorithm, Indian Journal of Applied Sciences, vol. 7(9), pp. 2908–2910, (2007).

611

An Efficient 256-Tap Parallel FIR Digital Filter Implementation Using Distributed Arithmetic Architecture

An Efficient 256-Tap Parallel FIR Digital Filter Implementation Using Distributed Arithmetic Architecture

Recommend Documents