Novel digital filter implementations using hybrid RNS-binary arithmetic

Novel digital filter implementations using hybrid RNS-binary arithmetic

k. , - SIGNAL ,- ~ z :-~.. ' v)~/ll PROCESSING ELSEVIER Signal Processing 40 (19941 287 294 Novel digital filter implementations using hybrid ...

515KB Sizes 0 Downloads 48 Views

k.

,

-

SIGNAL

,-

~ z :-~.. ' v)~/ll

PROCESSING ELSEVIER

Signal Processing 40 (19941 287 294

Novel digital filter implementations using hybrid RNS-binary arithmetic M.K.

Ibrahim

Department o[' Electrical and Electronic Engineering, University o! Nottingham, University Park, Nottingham NG7 2RD, UK Received 22 July 1993; revised 20 December 1993 and 18 March 1994

Abstract The implementation of a FIR filter using a new hybrid RNS-binary arithmetic is presented for the first time. In the new arithmetic, the data samples are represented using RNS, and hence the carry free advantage of RNS computations is retained. However, the computation performed for each modulo is implemented using conventional binary arithmetic elements which overcome the drawback of ROM-based RNS arithmetic elements that become inefficient for large moduli. The conventional binary arithmetic elements are also faster and require less area than existing memoryless RNS arithmetic elements. It is shown that the filter structures based on the new arithmetic have better performance than those based on either the conventional binary or conventional RNS arithmetic for large moduli.

Zusammenfassung Die Implementierung eines FIR Filters wird erstmals vorgestellt, das eine neue hybride. RNS-bin~ire Arithmetik verwendet. In der neuen Arithmetik werden die Abtastwerte durch RNS dargestellt, und dadurch bleibt der Vorteil iibertragsfreier Berechnungen erhalten. Die Berechnung ffir jeden Modulwert wird jedoch durch konventionelle bin/~re arithmetische Einheiten ausgeffihrt, wodurch der Nachteil yon ROM-gesttitzten RNS arithmetischen Einheiten iiberwunden wird, bei groBen Modulwerten ineffizient zu werden. Die gew6hnlichen bin/iren arithmetischen Einheiten sind auch schneller und erfordern weniger Oberfl~iche als bestehende ged/ichtnislose RNS arithmetische Einheiten. Es wird gezeigt, dab die Filterstrukturen mit der neuen Arithmetik eine gr6Bere Leistungsf~ihigkeit haben als die, die entweder gew6hnliche bin~ire Arithmetik verwenden oder gew6hnliche RNS Arithmetik mit grol3en Modulwerten.

R~sum~ L'implantation d'un filtre FIR utilisant une nouvelle arithm+tique hybride RNS-binaire est pr~sentee pour la premi6re lois. Dans la nouvelle arithm6tique, les 6chantillons sont repr6sent6s en utilisant RNS, d'ofi sont retenus les avantages des calculs sans retenus. N6anmoins, le calcul, r6alis6 pour chaque modulo, est implant+ en utilisant les 616ments conventionels de l'arithm6tique binaire ce qui permet de contourner les d6savantages des 616ments arithm6tique RNS bas6s sur les ROM qui deviennent innefficaces pour de grand modulo. Les 616ments conventionnels de l'arithmetique binaire sont +galement plus rapide et n6cessitent moins de superficie que les 616ments arithm6tiques RNS sans m+moire. I1 est montr6 que les structures de filtres bas~es sur la nouvelle arithm~tique ont de meilleures performances que celles bas6es sur les arithm6tiques conventionnelles binaires ou RNS pour les grand modulo. Keywords: Digital filter implementation; VLSI architectures; RNS

0165-1684/94/$7.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0 1 6 5 - 1 6 8 4 ( 9 4 ) 0 0 0 8 2 - B

288

M.K. Ibrahim / Signal Processing 40 (1994) 28~294

1. Introduction The implementation of digital filters and transform processors using Residue Number System (RNS) has received significant research interest especially in high-speed implementation of many signal processing applications [4]. Filter designs based on RNS have also been reported where they clearly show the advantages over structures based on the conventional binary arithmetic [6]. The filter designed based on RNS arithmetic used about 150 000 gates compared to more than two million gates using conventional binary arithmetic [6]. The main merit of RNS arithmetic is that the complete wordlength computation is divided into computations on independent moduli which are represented with smaller number of bits. Since the computation on each modulo is independent of each other, RNS arithmetic is free from the carry propagation across a large wordlength which is the case in conventional arithmetic. Another advantage of RNS is that it is suitable for the design of fault tolerant signal processors due to its inherent fault detection capability [5]. In applications where high accuracy is needed, or in operations that require a large wordlength, large moduli are required. In the implementation of RNS arithmetic for large moduli, modulo multipliers are the major bottleneck in the implementation of RNS-based digital signal processors in terms of cost and/or speed [3]. Although many multipliers that are based on ROM [9] have been proposed, the major drawback of using ROM is that it becomes inefficient for large moduli [3, 4]. Another common approach has been to design modulo specific RNS multipliers. However, the major drawback of this approach for VLSI implementation is that the resulting structure is not modular since the path of each modulo cannot be realised using a single hardware unit. The only universal modulo multiplier that exists in the literature for large moduli which is not based on ROM is the one proposed in [3]. This structure requires two n-bit multipliers, one 2n-bit multiplier and two (n + 1) bit adders, where n is the number of bits required to represent a modulo. In this paper, a new architecture is proposed which will allow the exploitation of both the carry

free advantage of RNS as well as the use of efficient binary arithmetic elements.

2. Vector inner product implementation using hybrid RNS-binary arithmetic In the new arithmetic the numbers are coded using RNS. However, the processing of each modulo is realised using the conventional binary arithmetic elements. To illustrate the concept of the new arithmetic, we will consider the case of the vector inner product. Many digital signal processing operations including filtering and transforms can be formulated as a vector inner product operation. The vector inner product is given by Y = X o A o + X I A I + ... + X x A r ,

(1)

where AR and Xk are the kth elements of the two vectors to be multiplied. Using modulo notation and from the property of modulo arithmetic, it is known that (c.d),. = ((c)m. ,. = <,.m + ... + <mm>m • It is worth noting at this stage that in the conventional RNS arithmetic, the vector inner product is calculated recursively according to [10] m = <m + <mm>m,

(2)

where m = m. Now, in order to see how the new arithmetic is used to calculate the inner product, from the definition of residue [10] we denote Ym,k h t o be given as h

h

Yz.k = Y,,,,k--1 -- Ckm + m(Ak>m.

(3)

Eq. (3) can also be used to calculate (I1>,, as follows. Let Y* be given by Ym. h r, viz. Y* = .,m + (XI>,.,.
M.K. lbrahim / Signal Processing 40 (1994) 287 294

where C = Co + .." + CK. Since ( C m ) , , = 0, it can be easily shown that (Yh)m = (Y)m. The new arithmetic proposed here is based on Eq. (3). The fundamental difference between the new arithmetic and RNS is that in the new arithmetic, the multiplication and accumulation of ( X k ) m ( A k ) m is performed using conventional arithmetic elements, while in the case of RNS arithmetic this is performed using modulo arithmetic elements. Furthermore, the possible wordlength growth in the new arithmetic is avoided by using the correcting terms Ckm at each stage of the process as will be illustrated for the case of systolic digital filters in the next section.

3. Architecture of systolic FIR filter using hybrid RNS-binary arithmetic The implementation of systolic FIR filter using RNS number representation is shown in Fig. 1. It consists of a binary-to-RNS converter, a filter structure for each modulo, and an RNS-to-binary converter. The architecture used to implement the systolic FIR filter for each modulo based on the hybrid arithmetic is shown in Fig. 2. The basic cell which is illustrated in Fig. 3 consists of an n-bit multiplier, a (2n + 1)-bit adder, a 2n-bit adder, and a 2n-bit 2:1 multiplexer, where m is less than or equal to 2". Each cell basically performs the operations given in Eq. (3). The functionality of the multiplexer in the kth cell is simply to choose the correcting term, ckm, in Eq. (3) to ensure that there is no wordlength growth. This will be proved as follows. We will show that if the input of the kth cell, which is the output of the (k 1)th cell (i.e. Ym, h k- 1), is 2n bit, the output of the kth cell, Y,,.k h will also be 2n bit. We then generalise the result by induction. Let z represent the output of the first adder which is the result of adding Ym.k-~ h to the output of the multiplier. The value of z can be one of two cases: Case 1. If the output carry of the first adder is zero, z is less than 2 2" and as a result no correction term is needed. In this case the multiplexer output will be zero, and the output of the second adder and hence the cell is 2n-bit wide, -

-

m(i)

~

289



i I

Y,~,I~

SystoliC FIR filter for Modulo 1

J

Binary

I RNS

to

]to

x ~ RNS

~Y

'~].

SystoiicFIRlilterforModuloM :

1

.Y>,,~

j

'i

' I

Fig. 1. Filter implementation using RNS number representation.

Case 2. If the output carry of the first adder is one, z is greater than (2 2" - 1). Since the maximum value of the residue digit with modulo m is (m - 1), the maximum value of the output of the multiplier is (m 2 2m + 1). Now since Ym.k h is assumed to be 2n-bit wide, the maximum value it can have is (2 2 n 1). Therefore, the output of the first adder, z, in this case will be in the range [2 2", 2 2n -]- m 2 - - 2m]. In this case, the multiplexer output is selected to be - m 2 (i.e. Ck = m) and hence the output of the second adder, Ym.k, h which is also the output of the cell, will be in the range [-2 2n - - m 2, 2 2n - - 2m]. Since m is less than or equal to 2 n, Y,,.R h can be represented by 2n bits. Since the input to the first cell is zero and hence can be represented in 2n-bit, the above two cases can be generalised for all k by induction. Finally, since the output of the modulo FIR filter must be n-bits wide, the output of the final cell of the proposed structure which is 2n-bit is fed to a binary-to-residue converter to obtain the final result, i.e. ( y)mIt is worth pointing out that if the correction terms are not used, the wordlength of accumulating partial sum will grow and the final size will depend on the filter order as well as number of bits needed to represent the output of the multiplier. This implies that adders with larger wordlengths are needed especially for high-order filters. In the next section, a new cell is presented where it is shown -

-

M.K. Ibrahim / Signal Processing 40 (1994) 287-294

290

m

--1

F----

I"

PE

PE

m

i

Fig. 2. Systolic FIR filter implementation using the hybrid RNS-binary arithmetic.



m

m

m i

I

m ¢

r

"

I

X Y' in

,

2n+l , ,

Carry Nt

,

f

"

s

'L

Yl

'

2n+1,--~

211 y ~ -

2n+1'

-m

Fig. 3. The hybrid RNS-binary basic cell architecture. Fig. 4. Modified hybrid RNS-binary basic cell architecture.

that the time overhead of introducing the correction terms will be that of a single carry save adder.

4. Basic cell with improved performance A more efficient implementation of the basic cell is shown in Fig. 4. In this cell, the second adder is replaced with a carry save adder which is faster and requires less area. Also, rather than adding the correction term as a second stage in the same cell, in the new cell it is performed within the first stage of the next cell. This does not change the functionality of the basic structure. In the new architecture, the carry save adder and the adder within the dashed box are equivalent to adding three numbers, namely the output of the multiplier, the correction term, and the accumulating partial sum from the previous cell. With an argument similar to that in Section 3, we can show that in the modified cell architecture, the output will always be (2n + D-bit wide.

We will show that if the input of the kth cell passed from the previous cell, which we will denote Y~,,k 1, is (2n + 1) bit, the output of the kth cell will also be (2n + 1) bit. We then generalise the result by induction. The value of Y~n,k-1 c a n be one of two cases depending on the most significant bit (m.s.b.) : C a s e 1. If the m.s.b, is 0, Ym,k-1 is effectively a 2n-bit number. In this case the correction term is set to zero. Since the output of the multiplier will also be a 2n-bit number, the output of the cell will be equal to adding two 2n-bit numbers, and hence it can be represented by a (2n + 1) bits. C a s e 2. If the m.s.b, of Y ' , k - 1 is 1, the value of Y~,,k 1 is in the range (22",22"+a - 1). In this case the correction term is set to - m 2. Since the maximum value of the residue digit with modulo m is (m - 1), the maximum value of the output of the multiplier is (m 2 - 2 m + 1). Therefore the output of the cell will be in the range (22" - m 2, 2 2 n + l -- 2m).

M.K. lhrahim / Signal Processing 40 (1994) 287-294

291

Since m is less than or equal to 2", the output of the cell can be represented by (2n + 1)-bit. Since the input to the first cell is zero and hence can be represented in 2n-bit, the above two cases can be generalised for all k by induction.

adders can be used whose addition time, which is of the order O(logn) E11], is much less than that of the multiplier for large n (i.e. for large moduli). Hence, from Eq. (4), the cycle time of the new structure will be dominated by that of the multiplier speed, and will be approximately given by

5. Performance evaluation

Tn, c = nTFA(n) •

Let Am(n ) =- area of an n-bit multiplier, A,(n) = area of an n-bit adder, Acsa(n) = area of an n-bit carry save adder, A 1 -- area of a 1-bit latch, Amux(n) = area of an n-bit 2:1 multiplexer, Tin(n) = time of an n-bit multiplier, T a ( n ) = t i m e of an n-bit adder, T¢s, = time of a carry save adder, T~ = delay of one latch, and Tmux = delay of a 2:1 multiplexer.

5.1. New structure From Fig. 4, the time, Tn.c of the multiply add cell used in the new structure for modulo m is

Tn,¢ = Tm(n)+ Tcsa + T,(2n)+ T~.

(4)

Now, in order to complete the computation, the output of the new structure which is 2n-bit wide must be converted to its modulo representation as shown in Fig. 2. In this paper, the pipelined universal converter reported in I-7] is adopted. This converter has a cycle time of Ta(n), and hence it is less than the propagation delay of the cell. Therefore, the throughput of the structure is given by the throughput of the inner product cell. Clearly, the delay of a single latch and the carry save adder is negligible compared to the delay of the multiplier and the adder. Furthermore, the n-bit multiplier in Fig. 4 can be implemented using carry save arithmetic [8]. One of the most efficient implementations of multipliers using carry save arithmetic is the one based on the array multiplier. An n-bit array multiplier consist of n 2 full adders, and when used as a carry save multiplier, it has a multiplication time of (nTvA), where TVA is the propagation delay of a full adder [1]. Although the final adder in Fig. 4 must be implemented using fast adders since the carry-out bit is needed for the specification of the subsequent correction term, this is not a major drawback since fast carry look ahead

(5)

The cost of the new multiply-add cell, An. c, is given by An.~ = Am(n) + Aa(2n) + Ac,~(2n) + A .... (2n)

+ 4nAl + (n + 1)A~.

(6)

Four n-bit latches and one (n + 1)-bit latch are needed to hold the value of (A)m, Y'~.k and - m z respectively. Note that the last two values are (2n + 1)-bit and 2n-bit wide, respectively. As indicated earlier, the universal converter proposed in [7] is used to convert the 2n-bit output of the new structure to its modulo representation. The cost of this converter is given by

2nAa(n) + (4n 2 - n)Al.

(7)

Now, it is known that the cost of an 2n-bit multiplexer is twice the cost of an n-bit multiplexer, i.e. Amux(2n) = 2Amux(n). Also, in this paper a 2n-bit adder is assumed to be implemented using a cascade of two n-bit adders, i.e. Aa(2n) = 2Aa(n). Also, for large n, Aa(n + 1) is approximately the same as A~(n), and (n + 1) A~ is approximately the same as nA~. Therefore, from Eqs. (6) and (7), the total cost per one modulo, An.m, is An,m = K[Am(n) + 2Aa(n) + Ac~,(2n) + 2Am.x(n) + 5nAil + 2nA~(n) + (4n 2 -- n)Ab

(8)

where K is the filter order.

5.2. Comparison with the structure based on the multiplier in [3] The systolic FIR filter based on the conventional RNS arithmetic is shown in Fig. 5. The corresponding RNS multiply-add cell used to implement Eq. (2) is illustrated in Fig. 6 and consists of an n-bit

M.K. lbrahim / Signal Processing 40 (1994) 287-294

292 m //

N

PE

PE

PE

//

m //

n

I,

n

Fig. 5. Systolic FIR filter implementation using conventional RNS arithmetic.

approximated as m

m

/

TRNS = 4nTFA •

(9)

n

From Eqs. (5) and (9), it is clear that the speed of the new structure is four times faster than that based on the multiplier in [3]. Considering the comparison of the area, the total cost per one modulo of the RNS structure, AaNS,m, is

f

/ !

I

m n

.

/ '(

\

- I - / 1/ ,~

m

ARNS, m = K[2Am(n) +

/

Am(2n) + 4Aa(n)

-k 2Amux(n) + 5nAil.

(10)

From Eqs. (8) and (10), the new structure is clearly better when Fig. 6. The RNS basic cell architecture.

modulo multiplier, two n-bit registers and an n-bit modulo adder. The n-bit modulo multiplier proposed in [3] for large moduli consists of two n-bit multipliers, one 2n-bit multiplier, two (n + 1)bit adders, one 2n-bit register, one (n + 1)-bit register, and a 2:1 n-bit multiplexer. A conventional n-bit modulo adder consists of two n-bit adders and a 2 : 1 n-bit multiplexer [4]. The register needed to store the value of the modulo in the adder is not needed, because this value is already stored in the modulo multiplier used. Therefore, the time of the RNS multiply-add cell, TRNS, is given by TRNS= 2Tin(n) + Tm(2n) + 3Ta(n) + Tmux+ TI, where 3Ta(n) is required, because the two adders used in modulo addition operate in parallel, i.e. the time of modulo addition is one Ta(n) . Since the multiplier delay will be the dominating factor, following similar arguments as for the new structure, the time of the RNS multiply add can be

K[Am(2n) + Am(n)'] > 2nAn(n) + (4n 2 - n)Ai. (11)

This condition also implies that the new structure becomes even more efficient as the filter order increases.

5.3. Comparison with structures based on conventional binary arithmetic

The cell architecture based on the conventional arithmetic which is used in implementing a systolic FIR filter is shown in Fig. 7. In here, N represents the data wordlength and 2N + log K is the wordlength of the final output. The cell computation time of this structure is Ts~N = TIn(N) + Ta(2N + logK) + Tl. In this case, both multiplication and addition are assumed to be implemented using carry save arithmetic [8]. Since the throughput of the cell will be dominated by the multiplier speed, and using a carry save array multiplier [1], the delay time of

293

M.K. Ibrahim / Signal Processing 40 (1994) 287 294

X

i!

X

;

N

/

iI Yin

z ~' -~

2N+IogK

2N+IogK

/

Yout

respectively. In what follows we will assume that fast adders with an area of the order O(nlog n) will be used for the addition of two n-bit numbers [11], and that an n-bit multiplication is implemented using carry save arithmetic with an area of Am(r/) = r/2AFA [ 8 ] , where AFA is the area of a full adder. It can be shown that when AB RNs(N, L) and ARNS-B(N + logK, L) are implemented using the universal converter in [-7], asymptotically they are of the order O ( L N n log n). Finally, it can be shown that the second term in Eq. (13) is asymptotically of the order O(Ln 2 log n). Therefore, the area of the new structure is approximately given bv AH = LKn2AvA + O(Ln 2 log n) + O ( L N n log n).

Fig. 7. The basic cell architecture based on conventional binary arithmetic.

Considering high filter orders, and keeping higher order terms only, the total area of the new structure can be approximated as

the cell can be approximately given by AH = LKnZAvA • T B I N ~--- N T F A .

(14)

(12)

From Eqs. (5) and (12), it is clear that the improvement in speed of the new structure over the conventional approach is of the order O(N/n). The speed improvement increases as the ratio N/n increases. It is worth noting that the process of converting from binary to RNS and vice versa shown in Fig. 1 for the new structure has a throughput of an n-bit adder when using the universal converter reported in [7], and hence the conversion process does not effect the cycle time of the new structure since it will be much less than that of the multiplier used in the basic cell. As with regard to the area comparison, since we are concerned more with large moduli, the multiplier area will be the dominating factor in the total area calculation of the basic cell. Therefore, using Eq. (8), the total area of the structure based on the new hybrid arithmetic including the cost of conversion from and to binary can be simplified to the following: A H = L K A m ( n ) + L(2nAa(n ) + (4n 2 - n)A])

+ AB-RNs(N, L) + ARNS-B(N + log K, L), (13) where L is the number of moduli, and AB-RNS(N, L) and ARNS B(N + log K, L) are the area required for converting from binary to RNS and vice versa,

In the case of the conventional binary implementation, the area of the multiplier will be the dominating factor when using carry save arithmetic. Hence, the area of the structure based on the conventional binary arithmetic, AB~N,can be approximated by the area of the multipliers, viz. AI3IN = K N Z A v A •

(15)

From Eqs. (14) and (15), we have ABIN AH

N2 -- L n 2 •

Therefore, the new architecture can result in less area, but the amount of reduction depends on the choice of n and L.

6. Conclusion

In this paper, the implementation of a digital FIR filter using a new hybrid RNS-binary arithmetic is introduced. It is based on representing numbers with the RNS numbering system and using conventional binary arithmetic elements in performing the arithmetic on each modulo. It is shown that the new architecture results in a better performance with respect to cost and time than those

294

M.K. Ibrahim / Signal Processing 40 (1994) 287-294

based on the conventional RNS arithmetic for large moduli. It is also shown that the throughput of the new structures is superior to that based on the conventional binary arithmetic as expected.

[6]

[7]

References [1] A. Aggoun, A. Ashur and M.K. lbrahim, "Bit-level pipelined digit-serial multiplier", Internat. J. Electronics, Vol. 75, No. 6, 1993, pp. 1209-1219. [2] G. Alia and E. Martinelli, "VLSI binary-residue converters for pipelined processing", The Comput. J., Vol. 33, No. 5, 1990, pp. 473474. [3] G. Alia and E. Martinelli, "A VLSI modulo m multiplier", IEEE Trans. Comput., Vol. 40, No. 7, July 1991, pp. 873-878. [4] K. Elleithy and M. Bayoumi, "A O(1) algorithm for modulo addition", IEEE Trans. Circuits and Systems, Vol. 37, No. 5, May 1990, pp. 628-631. [5] W.K. Jenkins and E.J. Altman, "Self checking properties of residue number error checkers based on mixed radix con-

[8]

[9]

[10]

[l 1]

version", IEEE Trans. Circuits and Systems, Vol. 35, No. 2, February 1988, pp. 159-167. M. Kameyama et al., "Highly parallel residue arithmetic chip based on multiple-valued current mode logic", IEEE J. Solid-State Circuits, Vol. 24, No. 5, October 1989, pp. 1404 1411. S.J. Meehan et al., "'An universal input and output RNS converter", IEEE Trans. Circuits and Systems, Vol. 37, No. 6, June 1990, pp. 799 803. T. Noll, "Carry-save arithmetic for high speed digital signal processing", Proc. IEEE Internat. Syrup. on Circuits and Systems, ISCAS 1990, pp. 982 986. D. Radhakrishnan and Y. Yuan, "Novel approaches to the design of VLSI RNS multipliers", IEEE Trans. Circuits and Systems-ll: Analog and Digital Signal Processing, Vol. 39, No. 1, January 1992, pp. 52 57. N. Szabo and R. Tanaka, Residue Arithmetic and Its Application to Computer Technology, McGraw-Hill, New York, 1957. C.L. Wey and T.Y. Chang, "Design and Analysis of VLSIbased parallel multipliers", IEE Proc., Pt. E, Vol. 137, No. 4, July 1990, pp. 328-336.