Fast fourier transform processors using Gaussian residue arithmetic

Fast fourier transform processors using Gaussian residue arithmetic

JOURNAL OF PARALLEL AND DISTRJBUTED COMPUTING 2, 219-237 (1985) Fast Fourier Transform Processors Using Gaussian Residue .Arithmetic* ALVIN M. DE...

1012KB Sizes 0 Downloads 128 Views

JOURNAL

OF PARALLEL

AND DISTRJBUTED

COMPUTING

2, 219-237 (1985)

Fast Fourier Transform Processors Using Gaussian Residue .Arithmetic* ALVIN M. DESPAIN Computer Sciences Division, University of California,

Berkeley, California 94720

ALLEN M. PETERSON Department of Electrical Engineering,

Stanford Vniversi@, Stanford, California

94305

OSCAR S. ROTHAUS Department of Mathematics, Cornell University, Ithaca, New York 14850

ERLING H. WOLD Computer Sciences Division, University of California,

Berkeley, California 94720

Residue arithmetic using moduli which are Gaussianprimes is suggestedas a method for computing the fast Fourier transform(FFT). For complexoperations,a substantialsavingsin hardware and time over residuearithmeticusing real moduli can be obtainedusing complexmoduli. Gaussianresiduearithmeticis discussedand methods for conversion into and out of the complex residue representationare developed. This representationlends itself to table-driven computation, allowing very low latencydesignsto be.developed.A 64-point Cooley-Tukeystyleprocessor and a 60-point Rader-Winogradstyle processorusing this techniqueare described and compared.Hardware-savingsare realizedby approximatingthe rotationsnecessaryto perform the FFT by small complexintegersand by scalingthe resultsof the computationat anintermediatepoint. It is shownthat further hardwarereductionscan be ma& by developingcustomintegratedcircuitsat the expenseof latency. 0 1985 Ac.sdcmic Ress. Inc.

1.

INTRODUCTION

The fast Fourier transform (FIT) is widely employed in many signal processing tasks. As a result, there have been many approaches to designing special-purpose, high-performance FFT processors. The most critical design *This work was supportedin part by MITRE Corporation and the DefenseAdvancedResearchProjectsAgency under GrantsNOOO39-82-C-0235 and NOOO39-K-025 1. 219 0743-7315185$3.00 copyright 0 1985 by Academic Press, Inc. All rights of repmduction in any form mmwd.

220

DESPAIN ET AL.

problem of such processors is typically the attainment of high processing bandwidth. The propagation of carry signals in binary adders works against the speed at which data can be processed. The purpose of this paper is to examine a new approach to the design of FFT processors which addresses the carry propagation problem. In addition, the techniques examined here are interesting from the viewpoint of VLSI implementation. This is due to the “table-driven” design approach and the simple partitioning that can be employed. It has long been known that residue arithmetic can speed up arithmetic computations by avoiding long carry propagation times [ 11. For this reason, it has often been suggested for general-purpose computers, although the difficulty of doing common operations such as test for sign and divide renders it unattractive in some settings. However, none of the FFI algorithms explicitly require these operations, and are thus possible targets for the use of residues. The algorithms which are especially interesting are the Cooley-Tukey algorithm [2], the Good prime factor algorithm [3], the Rader algorithm for prime-length transforms [4], and the Winograd technique of centralizing the multiplications of the prime factor algorithm [5, 61.’ In the first section of this paper, we quickly review the basic concepts of real residue arithmetic and introduce complex residue arithmetic. We will then show its application in a few processors whose structures reflect the limitations of this style of computation. Since we have been especially interested in the construction of high-performance pipeline FFT processors for radar and image processing applications where the data are applied to the processor in a serial fashion at rates up to 200 Mhz, the processors which we have chosen as examples are of this form. However, the usefulness of complex residue arithmetic is not limited to this style of processor. 2. 2.1

RESIDUE ARITHMETIC

Real Integer Mod&i

Given an integer u and r relatively prime moduli, ml, . . . , m,, we can represent u by (ui, . . . , u,) where uj = (u), ((*I,,, is the residue mod m). The Chinese Remainder Theorem guarantees that, if u is always in the range ‘Number theoretic transforms (NT%) [7], have some similarities to residue arithmetic versions of the F’Ff, since both work in a finite field, but are actually distinct. First, N’lTs do not directly compute spectra, only convolutions. To compute a spectrum with an NTT, you would need to convert the Fourier transform to convolutions (by the Rader-Winograd techniques) and use the NTT to compute these convolutions. Second, NT%, given exact input data, can compute exact convolutions. However, since input data are never exact, the answers are actually only approximate. Therefore, nothing is lost by approximating the traditional Fourier transform coefficients appropriately and using these as integers in a finite field computation of the FFT as is done here. Of course, the F’FT has the convolution property as do the NTTs.

FFT USING

GAUSSIAN

RESIDUE

ARITHMETIC

221

c 5 u I (c - 1) + l’lmj, where c is a fixed constant, then the set of residues mod mj uniquely definks u and is uniquely defined by U. The convenience of the representation comes from the fact that

Thus, multiplication or addition of large integers can be performed by several small adders and multipliers working independently and in parallel without the need for carries to be propagated between them. This provides a natural “bit-slice” architectural approach. ROMs are often suggested for use in performing operations in residue arithmetic since the number of bits needed to represent the inputs and outputs in all operations is fixed. For real operations, r ROMs are required to perform a multiply or add operation. For complex operations, it requires 2r ROMs for a complex add and 6r ROMs for a complex multiply using real moduli. The disadvantages of the residue representation come from the difficulty of doing division, which requires communication between the various residues [8]. In fact, it is usually simplest to convert out of the representation to do the division, which can be expensive. Even though the FIT algorithms do not explicitly require division, the use of residue arithmetic requires that all the calculations be exact integer calculations. Since the number of bits needed to represent the results of such calculations increases very quickly, one would like to be able to scale intermediate results. Scaling requires division, except that one usually can choose scaling constants which do not require a generalpurpose divider to implement. We will make use of this in our example processors. 2.2

Gaussian Integer Moduli

For operations on complex numbers, it is possible to save both computation time and hardware through the use of complex moduli. This has been previously suggested for use in digital filter computations by Leung [9] and by Soderstrand and Poe [lo]. The properties of the set of complex numbers where the real and imaginary parts are integers were first studied by Gauss [ 111. As with the familiar Archimedian integers, every Gaussian integer can be uniquely factored into Gaussian primes up to the units + 1, +j. The Gaussian primes can be described as follows: 1. All integer primes of the form p = 4k + 3 are among the Gaussian primes. 2. Since all integer primes of the formp = 4k + 1 can be expressed as a sum of two squares, they may be factored into two distinct Gaussian primes as given byp = a2 + b2 = (a + bj)(a - bj), where j = m. Note that (a + bj) and (-b + aj) are not considered distinct primes, since they only

DBSPAINET AL.

222

TABLE I GAUSSIAN

PRIMES LESS THAN 27

Archimedeanprimes

CorrespondingGaussianprimes

5 13 17 29 37 41 53 61 73 89 97 101 109 113

2rtj 3 + 2j 4+j 5 + 2j 6&j 5 t 4j 7+2j 6 + 5j 8 + 3j 8 + 5j 9 f 4j 10kj 10 + 3j 7 + 8j

differ by a factorof j, which is a unit elementof the Gaussianintegers.(This is analogousto the treatmentof 1 with regardto the Archimedeanprimes.) 3. Theprimenumber2canbefactoredas(l + j)(l - j),butonlyoneof thesefactorsis distinct. Table I showsthe Gaussianprimes < 128which are derivedfrom integer primesoftheformp=4k+ 1. The conversioninto residuerepresentationis performedin the following manner.If, given a complex integerinput sample(m + nj), we define X+ =

(m

+

(2)

d)(a+bj)

then X+(Q

-

bi)

=

((m

+

M)b

-

@))(0+bj)(0--bj)

x+a - x+bj = ((am + bn) + (an - bm)j),,

(3) (4)

wherep = u2 + b2. Sincepis prime, thesetof residuesmodp formsa field, andthus we can find a multiplicative inverseof a, which we denoteby a-‘, where(a~-‘)~ = 1. Multiplying both sidesof E!q.(4) by u-l givesus X+

- x+bu-‘j

= ((m + bu-‘n)

+ (n - bu-‘m)j),,

(5)

which is satisfiedby X

+ = (m + dbn),,.

(6)

FFT USING

GAUSSIAN

RESIDUE

ARITHMETIC

223

Thus, x+ is a real integer.For a particularchoiceof p, (a-‘b), is unique,so we designateit as Kp. For the other Gaussianprime factor of p, we can similarly showthat X-

= (m - a-‘bn),

= (m - K,n),.

(7)

The two real integersx+ andx- uniquelydefinethecomplexnumberm + nj aslongasc,,, I m < c, +p - 1 andc, I n < c,, +p - l,wherec,and cnarefixed constants.Typically we takecm= c,, = -(p - 1)/2, sothatwe canrepresenta fixed squareof complexintegerscenteredabouttheorigin. As with the real case,the useof severalGaussianprimesas moduli will extend therangeof representable Gaussianintegers.Also, the conversioncanagain beaccomplishedby ROMs which takem andn asinputsandbring outx+ and x- asresults.Alternatively, the constantKp can be precomputedandactual multipliers andadderscan beused.As anexample,the residue(1 + 2j)(s+zj) iscomputedasfollows. Settinga = 3, b = 2, andp = 13,we find u-l = 9, since (3 X 9)13= (27)13= 1, which gives us K,, = (9 X 2),3 = 5. Applying (6) and (7) results in x+ = (1 + 5 X 2)13= 11 and x- = (1 - 5 X 2),3 = 4. Complex addition and multiplication mod Gaussianprimes are simple operations.Let r denotethe number of Archimedeanprimes used in the residuerepresentation.Complex additionrequiresthe useof 2r ROMs with complexor real moduli, but complex multiplication requires6r ROMs with real moduli and only 2r ROMs with complex moduli. As an example,we choosemoduli (2 + j), (2 - j), (3 + 2j), and(3 - 2j), which arederived from the Archimedeanprimes 5 and 13. To add the two complex integers (1 + 2j) and( - 2 + j) , we first convertthemto theresiduerepresentation by looking them up in Table II. The first two residuesareaddedmodulo5, and the secondtwo are addedmodulo 13. The final answeris again found by looking up in Table II. (1 + 2j) + (-2 + j) -+ ((2,0,11,4) + (1,0,3,6)), = (3,0,1,10) = (-1 + 3j) Note that, in the additionof the two residuenumbers,(11 + 3)19= 1. Multiplication is handledsimilarly: (1 + j)(l

+ j) -+ ((4,3,6,9) = (0 + 2j).

x (4,3,6,9)),

= (1,4,10,3)

224

DESPAIN ET AL. TABLE II EMPLE

OF RESIDUE

REPRESENTATION

OF GAUSSIAN

INTEGERS

Gaussian residue representation Gaussian integers

p=5

(2 +j) m + nj

p = 13

(2 -A

(3 + 3)

(3 - 3)

(m + 5nh

(m + 8nh

(m + 3nh

(m + 2nh

0 + Oj 0 + lj 0 + 2j

0 3 1

0 2 4

0 5 10

0 8 3

0 - 2j 0 - lj

4 2

1 3

3 8

10 5

1 + Oj 1 + lj 1 + 2j

1 4 2

1 6 11

1 9 4

-2 + Oj -2 + lj -2 + 2j

11 3 8

11 6 1

-1 +Oj -1 + lj -1 + 2j -1+3j

12 4 7 1

12 1 2 10

Assume we wish to have the magnitude squared of a Gaussian integer. (1 + 2j)(l

- 2j) + ((2,0,11,4)

x (0,2,4,11)),

= (0,0,5,5)

= (5 + Oj). Since we know a priori that the imaginary part is zero, we need do only one of the modular multiplications for each modulus, as the other will always yield the same result. To convert from the complex residue format to complex binary, we first separate the real and imaginary parts. Then, residue to binary conversion as explained in [l] is employed. Given the 2r “digit” residue representation

(x:,x;,x:,x;, . . . ,x:,x;, . . . ,x:,x;), we form y; = ix:

+ xi >p,.

FFT USING GAUSSIAN RESIDUE ARITHMETIC

225

From (6) above,this is (for a given kth “digit”) Y; = ((m + a-‘bn),,

+ (m - a-‘bn),,),,

= h),,

Now, let

Yk= (W’)p,(2m)pk)p, = (dpk9 which is the residueof the real part mod pk. Similarly, 2; = (xl - qp, = ((m + u-‘bn),,

- (m - u-lbn)p,)p,

= (2a-‘bn),,.

Now, let zk = ((2-‘ab-‘),,(2u-‘bn),,),, =
which is the residueof the imaginarypart mod pk. The next stepis to convertfrom theresiduerepresentation to a mixed radix representationas reportedin Knuth [ 11. We first must find “reciprocal” constantscii suchthat hjmi>,j

=

15i
1,

03)

wheremi is the ith real modulus. That is, q is the multiplicative inverseof t?Zimod mj. Then, we perf0Ml the following:

u3 +-

((643

u, +-

((*

-

dc13

* * ((4

-

-

u2k23)m3

uh

-

1)h.r

-

- - - -

Ur-,)C(r-&l,.

The desiredresult is now given by u = u,m,-l . . . m2ml + . - - + u3m2ml + u2ml + ul.

(10)

226

DESPAIN ET AL.

It shouldbe notedthat, if oneis usingthe residuerepresentation to handle complexnumberswhich arecontainedin a fixed squarecenteredaroundthe origin, oneshouldwork with residueswhich arecenteredaroundzero.In this case,themixedradix conversionalgorithmabovewill againresultin complex integerswhich arecenteredaroundthe origin. 3.

RESIDUE PROCESSOR EXAMPLES

3.1. Cooley-Tukey Radix-8 Processor 3.1.1.

ROM Based Version

Our first exampleis a Cooley-TukeystyleFFI processorof transformsize N = 64 and an input precisionof 8 bits which usesa radix-8 cascadeorganization.We have constructedthe circuit out of shift registersfor data rearrangement and64K-bit ROMs with internalpipelineregisters.The shift registerscould be constructedfrom RAMS with controlprovidedby ROMs programmedas counters. In Fig. 1, the processoris seento consist of the necessaryconversion circuits separatedby eight pipelines which perform the FFT calculations modulo the Gaussianintegerwith which they arelabeled.The intermediate resultsarescaledto reducethe dynamicrangerequiredto representthe result of the computation.Figure 2 showsthe binary to residueconverterin more detail. It requiressix 64K X 1 ROMs to deriveeachresiduefrom the 16bits of input. The residueto binary converteris displayedin Fig. 3, wherethecg and mj are the sameas in (9). Note that the residuereductionshave been distributedover the calculationsin (9) to allow the useof ROMs. The base8 pipelines,shownin Fig. 4, areimplementedinternallyasa base 4 and a base2 DFT (discreteFouriertransform)with a complexmultiplier (rotator)in between.The multiplies can be consideredto be rotationsin the complexplanesincethetwiddle factorsareof the form e-(*MIN)‘.This representationmakesit easyto factor the twiddle factor rotationsinto rotations which are simpler to implement in hardware.The base4 DFI requiresa

FIG. 1. Sixty-four-point

Gaussian residue Cooley-Tukey

m

processor.

FJFT USING GAUSSIAN RESIDUE ARITHMETIC

RESIDUE MOD @+%I (6-53 (7+2j) O-23

REAL

(5+4j) b-4j)

IMAG

@+j) (6-j) 48 64kxl FIG.

30 4Kx6

PROMS

PROMS required

2. Binary to residue conversion

and

6 adders

required

jvq-mj= +X() m cij

mj

FIG. 3. Residue to binary conversion.

227

228

DESPAIN ET AL STAGE

0

STAGE

1

STAGE

2

STAGE

3

STAGE

4

STAGE

5

FIG. 4. FFT pipeline.

rotation by 7r/2, which is a trivial operation. The twiddle factor multiply between the two base 8 pipelines applies the 64th roots of unity as required by the Cooley-Tukey algorithm. The blocks marked “BF’ are expanded in Fig. 5, and are capable of performing a 2-point DFT on the data stream. Note that the shift register for stage m is of length N2-“‘-I - 1. This allows the 0th stage to perform DFTs on the data elements which are separated by N/2, the 1st stage on elements separated by N/4, and so on. It should also be pointed out that the use of a table-driven implementation makes the control of the processor very simple. The control prom shown in Fig. 4 is just a counter, and is only required once in the processor. The ADD/PASS signals can all be derived from the counter output, since they all have periods which are powers of two times the clock period. The ROMs which apply the twiddle factors have two inputs, the data stream and the count from the control ROM. The actual twiddle factors applied at each count are encoded in the multiplier ROMs themselves. Since all of the computations in the residue representation must be done as integer operations, we must carry enough bits of precision so that we can represent our answers exactly. Note that this may require many more bits than are necessary to represent the one-bit increase in real precision due to the additions at each stage. At the output, only 8 + log&4 = 14 bits would be meaningful. Since the twiddle factors would require infinite precision to represent exactly, they must be rounded off to whatever precision is required at the stage at which they are applied. For example, the first rotator comes after the first add in the FFT and thus the coefficients must be correct to an accuracy of 9 bits. A straightforward implementation would require 9 extra bits of dynamic range at this point to handle the product. The total extra bits

/ ADD/PASS (1 bit)

CONTROL

3 BKx6 PROMs and FIG.

ADD/PASS

0

1 sh,ft

ADDRESS register

5. Butterfly for stage m.

requmd

FFT USING

GAUSSIAN

RESIDUE

229

ARITHMETIC

for rotations is 9 + 11 + 12 = 32, bringing the total to 14 + 32 = 46 bits. This is inefficient and costly. To reduce the number of bits required, we make use of the methods in [ 121 to find small integer approximations to the rotations. By biasing all of the rotations required to compute the base 8 DFI by - 7r/8, we can reduce the hardware for the rotation between the base 4 and base 2 DFT to a r, 7r/2, and +7r/8 rotator [ 131. Since the 7r and r/2 rotations do not cause any increase in the dynamic range requirements, we will examine only the + 7r/S rotation. To minimize the number of bits which need to be carried through the computation, we need to find complex integers of the form x + yj whose arguments approximate the angles we wish to rotate by to the accuracy desired, and for which x and y are small. Multiplication by an x + yj will perform the desired rotation and will introduce a known, correctable gain. To rotate by the same angle in the opposite direction, we multiply by x - yj. We will use the same accuracy criterion as that used by Despain in [ 121. Note that the gains introduced by these rotations can be corrected by scaling the twiddle factors which are applied between the two base 8 modules if desired. A computer program was written to search for these integers and produced the results shown in Table III. From the table, we choose 5 + 2j for the first *r/8 rotator and 29 + 12j for the second. These correspond to gains of 5.385 and 31.39, respectively. For the central rotation between the base 8 sections, we must find small integer approximations to the rotations n7~/32, n = 0, 1, . . . , 8, with the added constraint that the gain introduced by all of these rotations must be a constant to within 1 l-bit accuracy. Again, a computer program was written to search for integers with these characteristics. The results are shown in Table IV. The product of the gains for the various rotations is equal to 6.81 X 104, which corresponds to a 16-bit gain. Thus, a dynamic range of 8 -t- log264 + 16 = 30 bits would be necessary to implement this FIT in complex residue arithmetic. We need the primes to be less than 26 due to the ROM size we have chosen. From Table I, this would require use of the primes 61, 53, 41, 37, 29, and 17, allowing a dynamic range of 3 1.2 bits. A processor of this design TABLE

III

SMALL COMPLEX INTEGERAPPROXIMATIONSTO r/8

ROTATION

Angle

Accuracy (bits)

748

x

Y

2 5 12 29 70

1 2 5 12 29

6.4 9.0 11.5 14.0 16.6

230

DESPAIN ET AL. TABLE IV ELEVEN-BIT APPROXIMATIONS TO m/32 n

WITH GAIN OF 403

x

Y

403 401 395 386 372 356 335 312 285

0 40 79 116 155 184 224 255 285

would requireapproximately500chips(wherewe arecountingshift registers as a ROM and a RAM). However, one can reducethe number of primes to four by scaling the intermediateresultsafter the centralmultiply is performedusing the circuit shownin Fig. 6. This circuit first separatesout the real andimaginaryparts of the residuesand convertstheseto a mixed radix representationby the techniqueof the last section.The boxesmarkedc,, mj arethe sameas those in theresidueto binarycircuit. The realpartof themixed radixrepresentation is given by u = v1 + v2ml + v3mlm2 + v4mlm2m3,

wherewe havedenotedthe ith digit of the mixed radix representation by vi.

34

4Kx6

PROMS

required

FlG. 6. Scaling circuit.

m

USING GAUSSIAN RESIDUE ARITHMETIC

231

If we scaleu by mlm2 and truncate,the result is u, = u3 + u4m3.

Thus, we only needto useu3,u4,andcorrespondingmixed radix digits of the imaginarypart to get back to the residuerepresentation.Note that we have factoredout somecommonsubexpressions to saveROMs in thecircuit which convertsthe mixed radix representationback to the residuerepresentation. Also, we havedistributedthe calculationinvolving c34andm4 over the first partof mixed radix to residueconversionsothatwe caneliminatetwo ROMs. By including the scalingcircuit at this point, we now needto haveonly 22 bits of dynamicrangeavailable.This canbeachievedby usingtheprimes61, 53, 41, and 37. The final designrequiresapproximately350chipsto realizea 64-pointFFI processor.The throughputof this designwill bedeterminedby themaximum clock rate of the PROM chips. The latency(time from first input until first output)is 85 clocks. It shouldbe notedthat the previousversionwithout the intermediatescalingunit has a latency of 80 clocks. The small difference betweenthe two versionsreflectsthefact thatmostof thelatencyis in thedata rearrangementnecessaryto computethe transformin a pipelinefashion. A softwaresimulatorof the hardwarewaswritten to experimentallydetermine the natureof the errorsdueto the variousapproximationsabove.Table V lists the averageandworst-casemagnitudeerrorsof the simulatoroutput comparedto an exact FFI’ for various input signals. The output of the simulator was correctedin phaseand magnitudeas discussedearlier. For comparison,Table VI hasthe outputmagnitudeerrorsfor simulatorsof two more standardfixed point WI processorsaveragedover a set of uniform [- 127, + 1271white inputs. Both of theseprocessorsareradix-8andlocate the rotatorsat the sameplace in the computationas the residueprocessor above.The fixed precisionsimulatorkeepsthe numberof bits for the representationof intermediateresultsandmultiplier coefficientsconstantthroughout thecomputationandusesthe standardone-halfbit per stagescaling.This scalingprogramis knownto avoidoverflowfor a Gaussianwhite input signal TABLE V OUTPUT

MAGNITIJDE

ERROR OF EXAMPLE

Input Delta function, amplitude Constant functioo = 127 Rectangular wave from Uniform white noise from Uniform white noise from

= 127 127 to 127,

duty

-63 to 63 - 127 to 127

cycle 0.1

PROCESSOR

Average

worst case

0.080 0.047 0.677 0.669 1.918

0.080 3.005 2.074 2.768 10.653

232

DESPAIN ET AL. TABLE VI OUTPUT

MAGNITUDE

ERROR OF FWED POINT

Fixed precision

Output precision (bits)

Average

worst

16 15 14 13 12 11 10 9 8 7 6

0.095 0.184 0.350 0.628 1.411 2.810 5.494 12.102 27.195 59.250

0.251 0.485 356 1.715 3.744 8.118 13.499 27.156 66.771 123.214

FWICFSSORS

Varying precision Average

Worst

0.705 1.380 2.987 7.840 20.076 50.462 92.148 179.954 312.952

2.222 3.972 9.012 21.845 47.077 121.931 228.314 430.523 687.638

[ 141.The varying precisionsimulatorincreasesthenumberof bits by onebit per stage,endingwith the precisiongiven in the first column. Varying the precisionin this way avoids overflow for all possibleinput signals. Both simulatorsuse normalroundingto reducethe numberof bits whereneeded aftermultiplies andscalingoperations.By comparingthe last entry in Table V with Table VI, it is seenthat the residueprocessorhasan errorsimilar to a 10.5bit fixed precisionprocessoror a varyingprecisionprocessorwith an outputprecisionof 14.5 bits. 3.1.2. CustomVLSI Version To usefewerchips, onecould build partswhich werecapableof replacing severalROMs in the abovedesign.A 64K-bit ROM can only perform one Bbit complexresiduemultiplication or add,whereaswe canput several6-bit nonresidueaddersandmultipliers on onegatearrayor MOS chip. We know that the residuebutterfly andcoefficientmultiply can be performedasin the ROM basedversionabove,

wherea residuereductionis doneaftereveryoperation,andis combinedwith the operationinside the ROM. Note that the A’s, B’s, and W’s areactually the residuesof thosequantitiesmod mi. This is equivalentto Ao = (@o + &)Whi A, = (PO - &)Kd,,

FFT USING GAUSSIAN RESIDUE ARITHMETIC from

RG.

Y

SR

233

to SR 4

7. Custom version of residue butterfly.

which allows the use of nonresidue adders and multipliers and a separate residue reduction step. Note that all the arithmetic must be exact integral arithmetic. Thus, if the number of bits of precision required for each residue is given by M, the number of bits at the output of the adder is M + 1, and the number of bits at the output of the multiplier is 2M + 1. The residue reduction step is implemented by dividing the number to be reduced by mi and keeping only the remainder. A simple algorithm for this is to first subtract m.2”+‘, choose between the original number and the output of the subtractor bised on whether or not overflow occurred, then subtract mi2”, and so on, until M + 1 subtractions and tests have been performed. Since the reduction operation is separate from the addition and multiply, the same hardware can be used with any modulus less than 2”. A block diagram of a circuit which performs one stage of a radix-2 FFI’ is shown in Fig. 7. Special-purpose chips could also be developed for the conversion into and out of the residue representation and for scaling. Such an approach would use silicon area much more efficiently and would require fewer packages (although with more pins) than the approach taken in the last section. However, the disadvantage of this technique is the requirement that a number of distinct chips be designed. Also, a table-driven approach allows arbitrarily complex 2-input, l-output functions to be computed in one clock cycle, resulting in very low latency designs. As an example, the table-driven butterlly implementation has a latency of 2 plus the shift register length, whereas the version above has a latency of 13 plus the shift register length. 3.2. PROM Based Rader-Winograd Processor A DFT whose length is a prime p can be expressed in terms of a circular correlation of size p - 1 using a technique developed by Rader in [4]. For primes where p - 1 is highly composite, the (p - I)-point correlation can be computed efficiently by the use of any of the FFI algorithms. This reduces the computation of the p-point DFT to two (p - l)-point FITS, two additions, and p - 1 central multiplies. This form is developed for DFTs of lengths 3, 5,7, 13, and 17 in [ 131. It is especially useful to put prime-length DFTs into this form when one is combining them to build up larger DFTs through the use of the prime factor algorithm. Winograd [5,6] showed how

234

DESPAIN ET AL.

all of the centralmultiplies for all of the factorsof a large DFT could be combinedinto onesetof centralmultiplies. The algorithmwhich resultscan beusedto build pipelineFFT processorswhich requireonly onefull complex multiplier in the pipeline, insteadof a multiplier or rotatorafter everystage of a Goodor Cooley-Tukeypipeline. In thederivationof the Cooley-Tukeystylepipelineof the last section,it can be seenthat the numberof bits requiredto representthe intermediate resultsgoesup very quickly with the numberof multipliers in the pipeline. Therefore,theWinogradcentralizedmultiplier form would seemto benatural for this type of implementation.For N = 60, we canusea Rader-Winograd style pipeline for a factor of 3 5 = 15, and follow that with a base4 module. Since 15 and 4 are relatively prime, no multiplies are required betweenthe two sectionsof the pipe. Also, the sectionsof the base3 and5 DFTs which surroundthe centralmultiplies do not requiremultiplies themselves. An analysisof the dynamic rangerequiredfor this calculationshowsthat we can usethe samefour moduli we usefor the Cooley-Tukeypipelineas long as we again include a scaling circuit. The processoris of the same generalform as before,exceptthat a permutationcircuit is requiredat the beginning.This permutationcircuit would typically be madeup of a RAM which storesthe serialdataastheyarriveandanaddressROM which is used to readthedataout in thecorrectsequence.Thepermutationwhich is required internalto thebase5 moduleasdevelopedin [ 131canbeincludedin theinitial permutation.The actual circuitry in the pipeline is shown in Fig. 8. The blocks marked “BF” are the sameas thosein the Cooley-Tukey pipeline X

!

half

BASE

3

half .----

BASE 5 .---..---- -_- .,----..

! c 1A j L II iNi IGI i

COMPLEX MULT

half

BASE 5

half

BASE 3

j

iSi

BASE

I 1 p i i

4

(to multipliers and shift registers)

FIG. 8. Sixty-point Gaussian residue Rader-Winograd

m

processor.

235 SHIFT REGISTER 1 I COUNT ADD/PASS FIG.

9. Half butterfly

circuit,

81,2.

exceptfor the shift registerlengthswhich are listed on the blocks. These lengthsarenot difficult to implement if the shift registersare madefrom a ROM anda RAM. SincetheADD/PASS signalperiodsare-lotpowersof two times the clock period as they were before,we needto havesomecontrol PROMSwhich decodethe binary count to producethe control signals.The blocks marked“B,,; areexpandedin Fig. 9. In this example,48 chips arerequiredto implementone of the pipelines, ignoringthe scalingandresidue/binaryconversioncircuits.This comparesto 29 chipsrequiredfor onepipeline of the Cooley-Tukeyprocessor.The total numberof chips requiredfor the 60-pointprocessoris approximately500. The reasonthat this designtakesmorechipsthanthe lastis thatthereduction in multiplies affordedby theWinogradtechniqueis offsetby the increasein the total numberof operationsrequired.Note that multipliers are no more expensivethanaddersin this implementation.Also, theslightly higherirregularity of the Raderalgorithmoverthe single-radixCooley-Tukeyalgorithm requiresthe inclusion of a few extraprocessingelements. The throughputof this processorwould be identical to that of the Cooley-Tukeystyleprocessorof thelast sectiondueto theirpipelinednature. However,the latencyof this processoris 209 clocks, most of which is due to the extra datarearrangementrequiredby the Raderalgorithm.

4. LARGER TRANSFORMS For larger transform sizes, Rader-Winogradstyle processorsrequire a smallerdynamicrangethanthecorrespondingCooley-Tukeystyleprocessor. For example,an 8-bit 256~pointradix-16 processoroptimizedby the techniquesof this paperrequiresabout29 bits of dynamicrange.A 24Opoint Rader-Winogradprocessorcanbe built by addinga base4 moduleto thefront of the 60-point processorof the last section. The multiplies which would normally be requiredbetweenthe two base4 circuitscan be includedin the centralmultiplier. For this processor,thedynamicrangeis approximately26 bits. Sincethe Rader-Winogradpipelinerequiressomeextrahardware,these two designsare about the same size. For transformsabovethis size, the Rader-Winogradform would be expectedto be slightly better.

236

DESPAINETAL. 5.

CONCLUSIONS

A description of residue arithmetic using Gaussian primes as moduli has been discussed. Gaussian residue arithmetic requires less hardware and exhibits lower latency for complex operations than real residue arithmetic. The advantages of residue representation for arithmetic are: 1. Multiplication, addition, and subtraction can be performed by a number of small, independent processors in parallel without carry propagation. 2. By using small residues, a table-driven implementation is made feasible, which results in low latency computation. The disadvantages of the residue representation are: 1. Division requires communication between all of the residues. 2. There is a dynamic range limitation due to the integer representation of all quantities. This problem is made worse by the difficulty of scaling operations which are similar to division. 3. Overhead is incurred due to the need for conversion into and out of the residue representation. The fast Fourier transform is a prime candidate for the residue arithmetic technique since it requires only additions and multiplications. Several residue processor designs have been developed to allow a comparison between different FIT algorithms and between this technique and more traditional computational methods. An 8-bit 60-point Cooley-Tukey style processor using Gaussian residue arithmetic requires approximately 350 chips, and consists primarily of 64K-bit ROMs and a few RAMS. Hardware savings were realized by approximating the complex rotations in the FFI by small complex integers and by scaling the results at an intermediate point. This processor exhibits a latency of 85 clock cycles, primarily due to the necessary data rearrangement. A conventional FFT processor built with the techniques of Despain and Wold [ 131 would require fewer than 20 chips with more pins but similar silicon area. Such a processor would have a latency of approximately 105 clock cycles since the rotations required by the FFI would have to be computed as complex multiplies or CORDIC rotations [ 151. For larger word sizes, the residue processor’s latency would not increase significantly, whereas the conventional processor would have a greater increase in latency due to the rotations. Also, a conventional processor will have a decreasing throughput due to the need for longer carry propagation. An 8-bit 60-point Rader-Winograd style processor has also been developed, and consists of approximately 500 chips. The increase in component count over the Cooley-Tukey style processor is due mainly to the increase in the total number of computational elements. For larger transform sizes, however, it is expected that the Rader-Winograd style might be less expensive since it requires less dynamic range. However, the extra data rearrangements in pipeline versions of the Rader algorithm cause the above processor to have a latency of 209 clock cycles.

FFI USING GAUSSIAN RESIDUE ARITHMETIC

231

At this point, costeffectivenesswould still seemto favor moretraditional approaches to FFT computation.However,a table-drivenimplementationof the FFT using Gaussianresiduearithmetic can achievesignificantlylower latency. REFERENCES 1. Knuth, D. E. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms. Addison-Wesley, Reading, Mass., 1%9, p. 248. 2. Cooley, J. W., and Tukey, J. W. An algorithm for the machine calculation of complex Fourier series, Math Camp. 19 (Apr. 1%5), 297-301. 3. Good, I. J. The interaction algorithm and practical Fourier analysis. J. Royal Statist. Sot. Ser. B u) (1958), 361-372. 4. Rader, C. M. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56, 6 (June l%S), 1107-l 108. 5. Kolba, D. P., and Parks, T. W. A prime factor FFT algorithm using high-speed convolution. IEEE Trans. Acoust. Speech. Signal Process MSP-25 (Aug. 1977), 281-294. 6. Winograd, S. On computing the discrete Fourier transfrom. Math Camp. 32, 141 (Jan. 1978), 175-199. 7. McClellan, J. H., and Rader, C. M. Number Theory in Digital Signal Processing. PrenticuHall, Englewood Cliffs, N.J., 1979. 8. Szabo, N. S., and Tanaka, R. I. Residue Arithmetic and Its Applications to Computer Technology. McGraw-Hill, New York, 1967. 9. Leung, S. H. Application of residue number system to complex digital filters. Proc. ISth Asilomar Conf., 1981, pp. 70-73. 10. Soderstrand, M. A., and Poe, G. D. Applications of quadratic-like complex residue number system arithmetic to ultrasonics. Proc. ICASSP, 1984. 11. Gauss, C. F. Theoria residuorum biquadraticorum. In Carl Friedrich Gauss Werke. Georg Ohns Verlag, New York, 1973, Vol. 2, pp. 65-178. 12. Despain, A. M. Very fast Fourier transform algorithms for hardware implementation. IEEE Trans. Comput. C-28 (May 1979), 333-341. 13. Wold, E. H., and Despain, A. M. Pipeline and parallel-pipeline FFI processors for VLSI implementations. IEEE Trans. Comput. (May 1983), 414-426. 14. Oppenheim, A. V. and Schafer, R. W. Digital Signal Processing. PrenticeHall, Englewood Cliis, N.J., 1975. 15. Despaia, A. M. Fourier transform computers using CORDIC iterations. IEEE Trans. comput. c-23 (Oct. 1974), 993-1001.