Fast Fourier Transformation Based on Number Theoretic Transforms

Fast Fourier Transformation Based on Number Theoretic Transforms

Fast Fourier Transformation Based on Number Theoretic Transforms by REZAADHAMI Electrical Huntsville, and ROBERTJ.POLGE and Computer Huntsville, ...

643KB Sizes 0 Downloads 65 Views

Fast Fourier Transformation Based on Number Theoretic Transforms by REZAADHAMI Electrical

Huntsville,

and ROBERTJ.POLGE

and Computer

Huntsville,

Engineering

Department,

University

of Alabama in

AL 35899, U.S.A.

Many fast algorithms have been proposed for computing the discrete Fourier transformation. Most of them are based on factorization with the goal of reducing the number of multiplications. They usejoating point arithmetic to avoid repetitious scaling and a sizeable wordlength to minimize quantization errors. This paper shows that the DFT of a sequence with prime length, P, can be computed eficiently, for selected P’s, using number theoretic transforms. The proposed technique, denoted as FFT/NTT, is illustrated for p = 2M + 1. Advantages include availability offast algorithms for a set ofprime lengths, residue arithmetic with bene$t in speed and hardware costs, parallel implementation with short wordlengths through the use of the Chinese remainder theorem, and exact computation except for scaling and round offfor the input array and the trigonometric sequences. ABSTRACT:

I. Introduction In recent years, numerous techniques have been reported for computing the discrete Fourier transform (DFT). However, there is a need for faster algorithms that use the computer hardware more efficiently and can be executed in real time (e.g. in the field of image processing, radar processing,, . . etc.) Historically, the most important event in fast algorithm development was the fast Fourier transform (FFT), introduced by Cooley and Tukey in 1965 (1). This algorithm reduces the number of multiplications required for computing the DFT to N log, N for radix-2 and even more for radix-4. In addition, the same computational cells can be used through the entire algorithm. This is an important factor both in software and hardware implementation. More recent algorithms (210) combine algorithms for short-length DFT’s into algorithms for computing the DFT’s of greater lengths. These algorithms require fewer multiplications but more additions. Most algorithms minimize the number of multiplications and storage. This reduction translates directly into more complex control logic for hardware implementations, and generally increases algorithm complexity for software implementations. Advances in VLSI technology are removing the hardware cost and execution time constraints previously associated with multiplication operations. Single components are now available (11) that carry out multiplications and additions with equal speed. Clearly, the traditional-approach for designing new algorithms is no longer valid for systems employing these new components.

C The FranklinInsl~tutcO01~~32,'8X $3.00+0.00

547

.

Reza Adhami and Robert J. Polge

The above referred FFT techniques require floating point arithmetic to avoid repetitious scaling, and a sizeable wordlength to minimize quantization errors. These limitations can be removed using residue arithmetic. This paper shows that the DFT of a sequence with a prime length, P, can be computed efficiently, for selected P’s, using number theoretic transforms (NTT’s). In the proposed technique, which will be denoted as FFT/NTT, the dc component is computed as usual. The remainder of the DFT is expressed as a cyclic convolution of two arrays of Size P- 1. One is the permuted version of the last P- 1 elements of the input sequence, {x(n)}, and the other is the permuted version of the sequence of sines and cosines. The sequence of sines and cosines are the powers of w’s in the DFT. These two arrays, {a’(n)} and {b’(n)}, are obtained by the following two equations : i.e. a’(n) = x((gp’“+“),), b’(n) = W
2.

root of the prime number

IV=cos($)-jsin(F),

and

j’=

(2) P, (.),

indicates

-1.

the

(3)

In general, (a’(n)} is a real array and {b’(n)} is a complex array of size P- 1. Later, we show that because of symmetry, the computation of the transform of {b’(n)} is nearly the same as for a real array. These two sequences are normalized and quantized to yield the integer arrays {a(n)> and {b(n)). Once the cyclic convolution of these two arrays is computed, the remaining (P- 1) samples of the DFT can be obtained through denormalization, addition of x(O), and permutation. The cyclic convolution of integer sequences can be computed efficiently using NTT’s (12-20). Since all operations are executed in a ring of integers module q, the computation is exact, except that results are congruent to the exact values. Equality is guaranteed only if the upper bound of the convolution terms is less than q, otherwise the ambiguity due to congruence can be removed using the Chinese remainder theorem. That is, first, the cyclic convolution is computed with respect to relatively prime moduli q,, q2,. . . , q,. Second, using the Chinese remainder theorem, one finds the results to be congruent to the product of the moduli which is larger than the estimated upper bound. Section II presents the process involved in expressing the DFT with prime length as a cyclic convolution. In section III, the FFT/NTT algorithm is derived. Section IV discusses selection of parameters. Section V shows an efficient computer implementation of the FFT/NTT for P = 2”+ 1. An example is provided in section VI.

II. Expressing the DFT as a Cyclic Convolution The

DFT

of a sequence

x(n)

with

a prime

length

P can

be represented

as Journal

548

of the Franklin Pergamon

Institute Press plc

X(k) =

r x(O)+

Fast Fourier Transformation n=P-

1

nzo

x(n),

k

=

(4)

0.

n=PpI x(n) W”sk, k = 1,2,. . . , P1 fl= 1

1.

(5)

Since P is a prime number, there exists some integer g not necessarily unique, such that there is one-to-one mapping of the integers IZ = 0, 1, . . . , P- 2 to the integersm = 1,2 ,..., P-1,givenby

m=

p.

(6)

To simplify the notation, we use m = g”% where modulo P is understood. To illustrate this mapping, we let P = 17. Since 17 is a prime number, there exists an integer g = 3 such that it generates all the elements of the ring { 1, 2,. . . ,16} through the sequence 3”%, n = 0, 1, . . . , 15. The integer 3 is called a primitive root of 17. A table of primitive roots can be found in (21). Table I shows the mapping ofn=1,2 ,..., 16ontom=O,l,..., 15. TABLEI no12345678 m

Knowing

1

3

9

10

the mapping

13

5

15

11

16

9

10

11

12

13

14

1.5

14

8

7

4

12

2

6

process, we make the following

substitution

into (4) : i.e.

il + g-Y! k+$%. It follows that n-P-2 qgk+'%)

=

x(O)+

c

x(

g-(n+l)%)Wg~“+‘%,

n=O k=O,l,...,

(7)

P-2.

Let (c’(k)} represent the summation on the right hand side of (6). It can be shown that (c’(k)} is equivalent to the cyclic convolution of two sequences given in (1) and (2). That is c’(n) = a’(n) * b’(n),

n = 0, 1, . . . , P- 2.

(8)

where * denotes cyclic convolution. It is important to note that a different permutation on the input sequence and the corresponding trigonometric sequences may result in a different operation. For instance, Rader (23) has shown that through a permutation, the DFT can be viewed as a circular correlation. Vol. 325, No. 5, pp. 547-557, Printed in Great Britain

1988

549

Reza Adhami and Robert J. Polge III. Cyclic Convolution

Using NTT’s

In general, the sequences given in (7) are real. To compute their cyclic convolution using NTT’s, one must scale and round off {a’(n)} and {b’(n)} to obtain the integer sequences (a(n)} and {b(n)). That is - NORM],

a(n) = NINT[a’(n) and b(n) = NINT[b’(n)

n = 0, 1, . . , P- 2.

- NORM],

(9)

In (8), NORM is the scaling factor and NINT takes the nearest product [ -1. The direct NTT of {a(n)} and {b(n)} are given by A(k) s

c

a(n)

integer

(10)

* Yk%%

n=O n=N-

B(k)=

of the

I

c

b(n)*Srk%%,

n=O,l,...,

(11)

N-l,

?I=0

where = indicate the congruence relationship ; N= P-l; S is the transform factor and it is a root of unity of order N module q ; %% denotes the residue with respect to q and (11) represents two NTT’s since {b(n)} is complex. The product of (A(k)} and {B(k)}, (C(k)}, is the and (b(n)} in the transform domain. To compute {C(k)} must be calculated. The inverse exists and it is of integers modulo q satisfies the following conditions

cyclic convolution of {a(n)> {c(n)}, the inverse NTT of unique if and only if the ring ;

1. the inverse of the transform factor, s- ‘, exists and it is unique ; 2. the inverse of N exists and it is unique. Under

these conditions,

the inverse NTT of C(k) = A(k) - B(k) is n=N-

c(n) L N-’

c

I

C(k)S-“‘k%%,

n = 0, I,. . . , N-

1,

(12)

n=O

where {c(n)} is the cyclic convolution of {a(n)} and {b(n), N- ’ is the inverse of N and S ’ is the inverse of the transform factor, in the ring of integers module q. To compute the cyclic convolution of the original sequences, {a’(n)} and {b’(n)}, one must divide {c(n)) by the NORM. That is, c’(n) = Ko+ti Now using (6), we find the last P-

c(n),

n=O,l,...,

I components

P-2.

of the DFT. That is

(13)

Fast Fourier Transformation

m ‘+I%) As k varies from 0 to P-2,

k = O,l,. . .,P-2.

= x(O)+c’(k), $+I%

generates

integers

(14)

from 1 to P- 1.

IV. Selection of Parameters The first step in the computer programming of the proposed method ing the DFT, is the determination of the following parameters : 1. a primitive root, g, of the transform 2. the modulus q.

for comput-

length P;

The primitive root, g, is an integer of the set {2,3,. . . , P- 2) that generates a cyclic group containing all non-zero integers smaller than P, through the sequence f% for-n = 0, l,... , P - 2. This integer g, which is read from a table of primitive roots or determined through computer search, is used in the permutation of the last P- 1 elements of the input sequence and the sequence of sines and cosines. The modulus, q, plays an important role in NTT’s and it must satisfy the following conditions : (a) existence of, S, a root of unity of order N = P- 1 ; (b) existence of a unique inverse of S; (c) existence of a unique inverse of N. If q is chosen such that S is 2 or a power of 2, and moreover q itself is a power of 2 plus or minus 1, NTT’s can be executed using only bitshifts and additions. The ambiguity due to congruence could be removed by selecting a value of q larger than the estimated maximum output. For this purpose, the wordlength should be very large (typically larger than 128), which means the design of special hardware. Thus, it is more practical to select a set of smaller values of q, relatively prime, and to use the Chinese remainder theorem. To optimize the implementation, we determined through computer search the root of unity of various orders smaller than 17 for prime numbers less than 60,000 (22). The restriction to prime numbers is not necessary but it guarantees conditions (b) and (c).

V. Implementation The NTT can be computed fast and in place using algorithms similar to those are in the FFT. In our example N = P- 1 = 2M and radix-2 type algorithms applicable. The direct NTT subroutine is based on frequency decimation, while the inverse NTT is based on time decimation. In this manner, unscrambling can be avoided entirely. The program also takes advantage of the symmetry of the trigonometric array. A block diagram for FFT/NTT algorithm is given in Fig. 1. Each block is named “BK” followed by its corresponding number, with the top block being number 1. (BKl) Initialization : transform length = P, N = P- 1, primitive root of P = g, scaling factor = NORM, i = (1, I), transform factor (i) = S,, modulus (i) = qi, inverse of transform factor (i) = S- ‘, inverse of N(i) = N17‘. Store x(n) in an array Vol. 325, No. 5, pp. 547-557, Printed in Great Britain

1988

551

Reza Adhami and Robert J. Polge BK2 BKI Initialize store x(n) compute a(n)+ b(n)

Y+

BK4

Compute cyclic convolution using NTT with respect to moduli q,

Remainder

Denormalize add x (0)

-

permute

-

c(i,n)

DFT -

FIG. 1. A block diagram for a FFT/NTT

algorithm.

of size P. Compute and store W, n = 1, 2,. . . , P- 1, where W is given in (3). of DFT. Calculate {a(n)} and {b(n)) using (8). Compute the zero component Compute the first step of the direct NTT of {b(n)} given in (16) and (17). (BK2) Compute the cyclic convolution of a(n) and b(n) using NTT’s with respect to qi and store the results in ci(n), an array of size N. A flow diagram of this block is given in Fig. 2. To compute the cyclic convolution using NTT’s, one can use the subroutines DNTT, direct NTT and INTT, inverse NTT, given in the Appendix. To avoid unscrambling, the direct NTT is based on radix-2 frequency decimation and the inverse NTT is based on radix-2 time decimation (22). The NTT of {a(n)} is straight forward since {a(n)} is a one-dimensional integer array. However, b(n) = b,(n) + jbi(n) where b,(n) = NINT (cos (0) - NORM) for n = 0, 1, . . . , N-

hi(n) = NINT (sin(v) -NORM)

1 and v = (-2”/P)

(g”%)

The subscripts r and i denote the real and imaginary parts, respectively. It appears that two transformations are needed for computing B(k), one for (b,(n)} and the other for {hi(n)}. However, it is shown (22) that {b(n)} is symmetrical in the following sense

a(n)

A(k) -

DNTT

Multiply

-C(k)=

c A(k).

B(k)

Last (M-l) steps of DNTT

b(n)

First -

step

of

-

DNTT

B(k) C(k) First

(M-l)

steps of INTT

I

FIG. 2. A block diagram

for computing

cyclic convolutions

using NTT’s. Journal

552

of the Franklin Pergamon

Institute Press plc

Fast Fourier Tramformation

forr=O,l,

. . ..(N/2-1)

(15)

where * denotes the complex conjugate. Using the above property of {b(n)) and the fact that SNj2 = - I%%, it can be shown that the first step for computing (1 I), the direct NTT of {b(n)) is 2

~,vwl

=

C

b,(n)LP%%,

k = even

(16)

tZ=O

k = odd

0, N/2-

I

2 C

L[Wk)l =

b,(n)S”‘“%%,

k = odd (17)

n=O

k = even

0,

where R, and I, denote the real and imaginary parts of the array B,(k), and only the first half of {b,(n)} and {b,(n)} need to be stored. Since N = 2”, one can transform (16) and (17) simultaneously using a radix-2 type frequency decimation algorithm, except for the first step where the storage is halved by taking advantage of the zeros. The remainder of the transformation is performed using the subroutine DNTT which computes the last (M- 1) steps in parallel for both {b,(n)} and {hi(n)}. Cyclic convolution corresponds to a product in the transform domain. Therefore, the next step is to multiply {A(k)} and {B(k)} to obtain (C(k)). Note that C(k) is in a compressed form. Let c,(n) = b,.(n) * a(n) c2(n) = bi(rz) *a(n) and let (C,(k)} and {C2(k)} be the direct NTT of {c,(n)} binary scrambled order. Then C,(k) C,(k)

= 0,

= C(k),

C,(k+N/2)

C,(k+N/2)

= C(k+N/2),

and {cl(n)}

stored

in

= 0 k = 1,2,, . . , N/2,

and C(k) = C,(k)+&(k),

k=

I,2 ,...,

N.

To obtain {c(n)} from {C(k)}, we carry the first (M- 1) steps of the inverse NTT using the subroutine INTT listed in the Appendix. It computes the first (M- 1) steps of the inverse transformation of (C,(k)) in the first half of the transformed array, obtaining cc,“‘- ‘)(n). Simultaneously, the first (M- 1) steps of the inverse transformation of {C,(k)) are computed and the results, cl”- ‘j(n), are stored in the second half of the transformed array. Thus, we can use a compressed representation through the (M- 1)th step because half of C(,“- ‘j(k) and Cc,“- ‘j(k) are zeros. However, the compressed representation cannot be used for Vol. 325, No. 5, pp. 547-557. Printed in Great Britain

1988

553

Reza Adhami and Robert J. Polge the last step which would merge these two sequences. Instead, c’;“- ‘j(n) with zeros and finish the transformation to obtain

one can expand

c,(n) = c’;“P”(n) c,(n+N/2)

= c,(n),

n = 1,2 ,...,

N/2.

Similarly, cZ(n) = c”‘--“(n+N/2).tnp’, c,(n+N/2)

= -c,(n),

n = 1,2,.

, N/2.

where t = Sp’. (BK3) Apply the Chinese remainder theorem when more than one modulus is used. (BK4) Denormalize {c,(n)} and {c~(Iz)} to obtain {c’,(n)} and {c;(n)}, add x(0) to {c’,(n)}, and permute {c’,(n)} and {c;(n)} according to c\(g”) + c;(n) c;(g”)+c;(n),

it=

1,2 ,...,

N.

Now {c’,(a)} and {c’&)} contain the real and imaginary parts of {X(k)}, the DFT of the input array x(n), for k = 1, 2,. . . , P- 1. The zero component is computed by simply adding all the elements of x(n) together. VI. Example To illustrate the FFT/NTT technique for computing the DFT, we chose randomly generated sequences of length P = 5, 17 and 257. For P = 257, a primitive root, g = 3, is selected from a primitive root table to determine two sequences {a’(H)) and {b’(n)) g’iven in (1) and (2), respectively. For the purpose of scaling these two sequences, we chose different values for the NORM. These values are in the range of 100Ck20,000. The results are compared to a direct method for computing the DFT using real arithmetic. A graph representing the error with respect to different values of NORM is given in Fig. 3. It is shown that when NORM is

Norm

FIG. 3. Error due to round-off

554

x E3

when Norm varied from 1000 to 20,000.

Journalofthe Franklin

lnst~tute Pergamon Press

plc

Fast Fourier

Transformation

greater ihan 4000, the error is negligible and is less than 0.05%. We use NORM = 2”. The modulus q = 12,289 is found through a computer search. This modulus is chosen because it has a root of unity of order 256, S = 9. Since q is a prime number, the inverses of S = 9 and N = 256 exist and they are found from the following two congruence relationships : 9S-’ 256N-’

E 1 module

12,289 + S- ’ = 2731,

= 1 modulo

12,289 + N-’

= 12,241.

VZZ.Conclusions The FFT/NTT technique for computing the DFT has several advantages. First, it provides another set of transform length (prime numbers) for which a fast algorithm can be developed. Second, the complex arithmetic is replaced by integer arithmetic with benefits in speed and hardware costs. Third, the Chinese remainder theorem allows a parallel implementation which should be useful in many applications including hybrid optical computing. Fourth, the computation is exact except for the scaling and rounding off the input array and of the trigonometric sequence. The FFT/NTT technique has been implemented only for the sequences with prime length of the form 2”+ 1. However, fast algorithms can be developed for arbitrary prime numbers provided that P- 1 can be factorized. We are in the process of developing a multiple radix FFT/NTT for any prime number P of the form P = 2k’ - 3k2 - qk3 * gk4 * 7k5+ 1 where ki are arbitrary integers.

References (1) J. W. Cooley

(2) (3) (4) (5) (6) (7)

(8) (9)

and J. W. Tukey, “An algorithm for machine calculation of complex Fourier series”, Math. Comput., Vol. 19, pp. 197-301, 1965. G. D. Bergland, “A fast Fourier transform algorithm using base 8 iterations”, Math. Comput., Vol. 22, pp. 275-279, 1968. R. C. Singleton, “An algorithm for computing the mixed radix fast Fourier transform”, IEEE Trans. AU-17, pp. 93-103, 1969. C. M. Rader and N. M. Brenner, “A new principle for fast Fourier transformation”, IEEE Trans. ASSP-24, pp. 264265, 1976. S. Winograd, “On computing the discrete Fourier transform”, Math. Comput., Vol. 32, pp. 175-199, Jan. 1978. E. Dubois and A. N. Venetsanopoulos, “A new algorithm for the radix-3 FFT”, IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-26, pp. 2222225, June 1978. R. J. Polge and B. K. Bahagavan, “Efficient fast Fourier transform program for arbitrary factors with one step loop unscrambling”, IEEE Trans. on Comput., C-27, pp. 534-539, May 1976. S. Prakash and V. V. Rao, “A new radix-6 FFT algorithm”, IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-29, pp. 939-941, Aug. 198 1. C. S. Burrus and P. W. Eschenbacher, “An in-place, in-order prime factor FFT algorithm”, IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-29, pp. 806816, Aug. 1981.

Vol. 325, No. 5, pp. 547-557, Printed in Great Britain

1988

555

Reza Adhami

and Robert

J. Polge

(10) J. B. Martens, “Recursive cyclotomic factorization-A new algorithm for calculating the discrete Fourier transform”, IEEE Trans. Acoust., Speech, Signal, Signal Processing, Vol. ASSP-32, pp. 750-761, Aug. 1984. (11) B. A. Bowen, “VLSI System Design for Digital Signal Processing”, Prentice-Hall, Englewood Cliffs, NJ, 1982. (12) P. J. Nicholson, “Algebraic theory of finite Fourier transform”, J. Comput. Syst. Sci., Vol. 5, pp. 524547, 1971. (13) C. M. Rader, “Discrete convolution via Mersenne transforms”, IEEE Trans. Comput., Vol. C-21, pp. 126991273, Dec. 1972. (14) R. C. Agarwal and C. S. Burrus, “Fast convolution using Fermat number transforms with application to digital filtering”, Proc. IEEE, Vol. 63, pp. 550-560, 1975. (15) 1. S. Reed and T. K. Truong, “The use of finite field to compute convolution”, IEEE Trans. Inf. Theory, Vol. IT-21, pp. 208-213, Mar. 1975. (16) J. H. McClellan, “Hardware realization of a Fermat number transform”, IEEE Trans. ASSP-24, pp. 216225, 1976. (17) H. J. Nussbaumer, ‘Complex convolutions via Fermat number transforms”, IBM J. Res. Dec., Vol. 20, pp. 282-344, 1976. (18) W.-C. Siu and Constantinides, “On the computation of discrete Fourier transform using Fermat number transforms”, ZEE Pro.-F, Communication, Radar, and Signal Processing, Vol. 131, no. 1, pp. 7-14, Feb. 1984. (19) H. J. Nussbaumer, “Relative evaluation of various number theoretic transforms for digital filtering applications”, IEEE Trans. ASSP-26, pp. 8893, 1978. (20) J. M. Pollard, “Implementation of number theoretic transforms”, Electron Lett., Vol. 11, pp. 294295. 1976. (21) Hua Loo Keng, “Introduction to Number Theory”, Springer, New York, NY, 1968. (22) R. Adhami, “Spectral analysis using number theoretic transforms with fast Fourier transform implementation”, Ph.D. dissertation. The University of Alabama in Huntsville, 1985. (23) C. M. Rader, “Discrete Fourier transforms when the number of data samples is prime”, Proc. IEEE, Vol. 56, no. 6, pp. 1107-8, June 1968.

Appendix

C

SUBROUTINE DNTT(lX,M,IG,IQ,ISTRUT) IX = ARRAY, M = POWER OF 2, IG = TRANSFER MODULUS DIMENSION 1X(257) N=2*M NX=N IF(ISTRUT.EQ.l) NX=N+N LG=N LS=N/2 INCW = IG DO 3 J=l,M KSWG= 1 DO4KS=l,LS DO 5 K = KS,NX,LG KP=K+LS IT=MOD(IX(K)-IX(KP),IQ)

FACTOR,

Journal

556

IQ =

of the

Franklin Pergamon

lnst~tute Press plc

Fast Fourier Transformation

5 4

3

4

6

’ IX(K) = IX(K) + IX(KP) IX(KP) = MOD(KSWG*INCW,IQ) KSWG = MOD(KSWG*INCW,IQ) LG=LS LS = LGj2 INCW = MOD(INCW*INCW,IQ) RETURN END SUBROUTINE INTT(IX,M,INVS,IQ,INVN,INVFIN) INVS = INVERSE OF G, INVN= INVERSE OF N INCW(M) = INVS DO2J=M,2,-1 JJ=J-1 INCW(JJ) = MOD(INCW(J)*INCW(J),IQ) N=2*M LG=l MX=M IF(INVFIN.EQ. 1) MX = M - 1 D03J=I,MX LS=LG LG=LS+LS KSWG= 1 DO4KS=l,LS DO 5 K = KS,N,LS KP=K+LS IT = MOD(IX(KP)*KSWG,IQ) IX(K) = MOD(IX(K) +IT,IQ) IX(KP) = MOD(IX(K) - IT,IQ) KSWG = MOD(KSWG*INCW(J),IQ) CONTINUE DO6K=l,N IX(K) = MOD(IX(K),IQ) IX(K) = MOD(IX(K)*INVN,IQ) RETURN END

Vol. 325, No. 5, pp. 547-557, Printed in Great Britain

1988

557