journal of parallel and distributed computing 54, 4858 (1998) article no. PC981472
A Novel Approach to Fast Discrete Fourier Transform J. G. Liu Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science H Technology, State Education Commission Laboratory for Image Processing H Intelligent Control, People's Republic of China
H. F. Li Department of Computer Science, University of Concordia, Montreal, Quebec, Canada
and F. H. Y. Chan and F. K. Lam Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong E-mail: fhychanhkueee.hku.hk Received July 1, 1996; accepted June 7, 1998
Discrete Fourier transform (DFT) is an important tool in digital signal processing. In the present paper, we propose a novel approach to performing DFT. We transform DFT into a form expressed in discrete moments via a modular mapping and truncating Taylor series expansion. From this, we extend the use of our systolic array for fast computation of moments without any multiplications to one that computes DFT with only a few multiplications and without any evaluations of exponential functions. The multiplication number used in our method is O(N log 2 N log 2 log 2 N) superior to O(N log 2 N) in FFT. The execution time of the systolic array is only O(N log 2 N log 2 log 2 N ) for 1-D DFT and O(N k ) for k-D DFT (k2). The systolic implementation is a demonstration of the locality of dataflow in the algorithms and hence it implies an easy and potential hardwareVLSI realization. The approach is also applicable to DFT inverses. 1998 Academic Press
1. INTRODUCTION It is well known that Fourier transforms have played an important role in science and engineering and DFT (discrete Fourier transform) is very important in digital signal processing. A variety of algorithms for computing DFT have been proposed [57, 14]. 0743-731598 25.00 Copyright 1998 by Academic Press All rights of reproduction in any form reserved.
48
File: DISTL1 147201 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 3850 Signs: 1995 . Length: 58 pic 2 pts, 245 mm
49
A NOVEL APPROACH TO FAST DFT
On the other hand, computer vision and image analysis have propelled the advancement of fast computation of discrete moments (DM) [3, 8, 9, 11, 15]. These two research threads have been developed independently. In this paper, we first construct the bridge between DFT and discrete moments (DM) by modular mapping and making use of Taylor expansions and hence we can transform DFT into computation involving moments. We also discuss the problems of convergence and errors. Based on our approach to the fast calculation of moments [3], new systolic arrays to perform 1-D and 2-D DFT are presented, followed by a complexity analysis.
2. USING MOMENTS TO COMPUTE DFT The DFT of a sampled sequence x(0), x(1), ..., x(N&1) is given by N&1
X(k)= : x(r) e &j 2?rkN
k=0, 1, 2, ..., N&1.
(2.1)
r=0
The above equation bears some resemblance to the 1-D moments defined by N&1
m(k)= : x(r) r k
k=0, 1, 2, ..., N&1
(2.2)
r=0
Recently we developed a fast method to compute Eq. (2.2) that involves addition only and demonstrated its realizability in the form of a systolic array [2]. In this paper we attempt to construct a bridge between the two equations above so that the advantage of the fast method of computing moments can also be obtained in computing DFT. The first step partitions the set [0, 1, 2, ..., N&1] into N disjoint subsets, depending on k and N. Specifically, S k i =[r | kr#i (mod N ), r # [0, 1, 2, ..., N&1]]
i, k=0, 1, 2, ..., N&1. (2.3)
For example, for N=4, k=0, 1, 2, 3, S 0 0 =[0, 1, 2, 3], S 0 1 =S 0 2 =S 0 3 =8, S 1 0 =[0], S 1 1 =[1], S 1 2 =[2], S 1 3 =[3], S 2 0 =[0, 2], S 2 1 =8, S 2 2 =[1, 3], S 2 3 =8, S 3 0 =[0], S 3 1 =[3], S 3 2 =[2], S 3 3 =[1].
File: DISTL1 147202 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 2544 Signs: 1575 . Length: 52 pic 10 pts, 222 mm
50
LIU ET AL.
The sum of each of these subsets of sampled values is denoted by
y k (i)=
{
: x(r)
if
S ki {, k, i=0, 1, 2, ..., N&1
r # Ski
0
(2.4)
otherwise.
So for N=4, y 0 (0)=x(0)+x(1)+x(2)+x(3), y 0 (1)=0, y 0 (2)=0, y 0 (3)=0; y 1 (0)=x(0), y 1 (1)=x(1), y 1 (2)=x(2), y 1 (3)=x(3); y 2 (0)=x(0)+x(2), y 2 (1)=0, y 2 (2)=x(1)+x(3), y 2 (3)=0; y 3 (0)=x(0), y 3 (1)=x(3), y 3 (2), y 3 (3)=x(1). In other words, using Eqs. (2.3) and (2.4), the sampled sequence x(r) is mapped into a new sequence y k (i) for each value of k by summing the terms of the sampled sequence that have the same multiplier of complex exponential function. We are then ready to rewrite Eq. (2.1) in terms of y k (i) instead of x(r) as follows: N&1
X(k)= : x(r) e &j 2?rkN r=0
k=0, 1, 2, ..., N&1.
N&1
= : y k (i ) e
(2.5)
&j 2?iN
i=0
In deriving (2. 5), we have used the property that e &j 2?rkN =e &j 2?(rk mod N )N =e &j 2?iN. The second step involves the application of series expansion to Eq. (2.5), yielding N&1
X(k)= y k (0)+ : y k (i) i=1
_
p
&
: (&j 2?iN ) r r! +R p r=0
p
N&1
= y k (0)+ : [&j 2?i )$N r r!] : y k (i) i r +R p r=0
k=0, 1, ..., N&1
(2.6)
i=1
p
= y k (0)+ : a r m k (r)+R p , r=0
where a r =(&j 2?) rN r r!
(2.7)
N&1
m k (r)= : y k (i) i r i=1
File: DISTL1 147203 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 2660 Signs: 1060 . Length: 52 pic 10 pts, 222 mm
(2.8)
A NOVEL APPROACH TO FAST DFT
51
and N&1
: y k (i)(cos(! i 1 +( p+1) ?2)(2?iN ) p+1( p+1)! i=1
& j sin(! i 2 + p?2)(2?iN ) pp!) 0
Rp=
if p is even.
(2.9)
N&1
: y k (i)(cos(! i 1 +p?2)(2?iN ) pp! i=1
& j sin(! i 2 +(p+1) ?2)(2?iN ) p+1(p+1)!) 0
if p is odd.
(from the theorem of the extended law of the mean[4, 10, 12]). Suppose p>2? and p is even, thus the error introduced by ignoring R p is bounded by
}
N&1
|R p | = : y k (i )(cos(! i 1 +( p+1) ?2)(2?iN ) p+1( p+1)! i=1
& j sin(! i 2 + p?2)(2?iN ) pp!)
}
N&1
: | y k (i )| |cos(! i 1 +(p+1) ?2)(2?iN ) p+1(p+1)! i=1
& j sin(! i 2 + p?2)(2?iN ) pp! | N&1
: | y k (i )| (cos 2(! i 1 +(p+1) ?2)(2?iN ) 2( p+1)((p+1)!) 2 i=1
+sin 2(! i 2 + p?2)(2?iN ) 2p( p!) 2 ) 12 max |x(r)| 2 12(N&1)(2?) pp!.
(2.10)
r
When p>2? and p is odd, the same result can be proved. For example, when max |x(r)| 256 (0rN&1), N2K, p=34, |R p | 3.45_10 &6. Increasing p to 35, we get |R p | 6.20_10 &7. Thus the error converges to zero very rapidly and uniformly and the approximation can satisfy the accuracy requirement of most applications without computing too many terms. It is obvious that p is a slow-growing function of N. Suppose max |x(r)| 256 (0rN&1); some cases of N, p and R p are listed in Table 1.
File: DISTL1 147204 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 2654 Signs: 1147 . Length: 52 pic 10 pts, 222 mm
52
LIU ET AL.
TABLE 1 Some Cases of N, p, R p 2 10
N: 2 log 2 N p Rp
2 15
2 20
235
2 30
2 35
20 30 40 50 60 70 35 37 39 41 43 45 3.1_10 &7 2.9_10 &7 2.5_10 &7 1.9_10 &7 1.3_10 &7 8.6_10 &8
2 40
2 50
80 47 5_10 &8
100 50 1.1_10 &7
Furthermore, we can prove that the least upper bound of p is not more than O(log 2 N log 2 log 2 N ) as N tends to infinity. Let f (N, p)=max |x(r)| 2 12(N&1)(2?) pp!=A(N&1)(2?) pp!
(2.11)
r
Substituting [2 log 2 N log 2 log 2 N] for p in Eq. (2.11) ([x] denotes an integer closest to x), we get f (N, p)=F(N )=A(N&1)(2?) [2 log2 N log2 log2 N ][2 log 2 N log 2 log 2 N]! Let 2 log 2 Nlog 2 log 2 N=t, then log 2 N=t(log 2 log 2 N )2, and F(N)AN(2?) t+1[t]!. So N=2 t(log2 log2 N )2 =(log 2 N) t2 and lim
N
t=
lim (log 2 N ) 12t= lim (log 2 N) 12 log 2 log 2 N2 log 2 N
N
N
= lim (log 2 log 2 N )2(log 2 N ) 12 N
= lim (log 2 M )2M 12 =0, M
then lim F(N ) lim A(2?) t+1 (log 2 N ) t+1 (log 2 N ) t2[t]!
N
t
= lim A(2?) t+1 (log 2 N ) t2t te &t - 2?t t
= lim A(2?)(2?e(log 2 N ) 12t ) t- 2?t=0. t
The final result is obtained by using the famous Stirling formula [1]. In Table 1, p can be expressed approximately in the form of [2 log 2 N log 2 log 2 N ]+32.
File: DISTL1 147205 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 2728 Signs: 1005 . Length: 52 pic 10 pts, 222 mm
A NOVEL APPROACH TO FAST DFT
53
The computation of X(k) using the approximation of Eq. (2.6) involves the generation of the r th order moments of the transformed data sequence y k (i ) and then performing a dot product of these moments with a constant vector (a r ) and an addition with y k (0). Thus if the moments can be computed very quickly and efficiently, so can the above procedure.
3. IMPLEMENTING DFT FROM 1-D MOMENTS The general network for implementing DFT is shown in Fig. 1. It consists of the moment generator [3], the processing arrays, and a dot product array. The moment generator formed of N&2 sections of Pascal triangles with a row of adderlatch could be used to generate the 1-D moments. It receives y k (i ) (i=0, 1, 2, ..., N&1) that the processing arrays produce. The dot product array performs the dot product and produces X(k) at the only output. The numbers in square brackets that are above the horizontal line denote the amount of time delay to keep synchronous pace in Fig. 1. The scheduling of the dataflow is such that m k (0) V a 0 moves from left to right, and accumulates with m k (1) V a 1 to produce the partial sum and this repeats until X(k) emerges at the far right of the output. As m k (i ) is produced one cycle earlier than m k (i+1)(i=0, 1, ..., p), as shown in Fig. 1, so the right most term m k ( p) takes two cycles to merge into X(k). This is just the time cost for the dot product. It remains to consider the generation of y k (i ). One interesting property of the mapping of x(r) to y k (i ) is noteworthy: If (k, N )=1, (i.e., k and N are coprime) then [ki | i=0, 1, 2, ..., N&1] is a complete system of residues modulo N [13] and |S ki | =1. This means y k (i ) is a permutation of x(r). Particularly, if N is prime then S00 =<, 1, ..., N&1i and S 0 j =< for j{0, but |S ki | =1 for all k{0. In other words, y k (i ) is a permutation of x(r) whenever k{0. In the remaining case, y 0 (0)= N&1 r=0 x(r), and y 0 ( j)=0 for j{0. So in the case where the sample size is a prime number, a preprocessor can be used to produce the y k (i ) via permutations of the sampled sequence x(r), after producing y 0 = N&1 r=0 x(r) in the first pass. In the more general case involving arbitrary values of N, a permutation network is inadequate to produce y k (i ). For example, when N=4, y 2 (0)=x(0)+x(2) and y 2 (2)=x(1)+x(3). In the general case, we propose using a special linear array to implement y k (i ), k=0, 1, ..., N&1. Each element x(r) is tagged with a rank=kr(mod N ) before it is sent into the array. For example, for k=2 and N=4, the samples become (x(0), 0), (x(1), 2), (x(2), 0), (x(3), 2). These are sent into the linear array one element at a time. An element is percolated from left to right until it reaches a cell in the array whose position is identical to its rank. When this happens, the value is accumulated in that cell in case it receives multiple values (i.e., |S ki | >1). Figure 2 illustrates an example. A cell in a linear array contains an adder only if the corresponding partition S ki has a size greater than 1. In 8 clocks, y 2 (0) will emerge at the right-hand side, as shown in Fig. 2. In general, it takes 2N clocks for the vector to appear on the right. It can be proved that the number of additions required in
File: DISTL1 147206 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 3732 Signs: 3189 . Length: 52 pic 10 pts, 222 mm
54
LIU ET AL.
FIG. 1.
The systolic array for 1-D DFT.
the preprocessing array for each k is equal to N&Ngcd(k,N). So the total number of additions required in the preprocessing array is equal to N&1 k=0 (N&Ngcd(k, N )= 2 N&1 N & k=0 Ngcd(k, N). Two-dimensional discrete Fourier transform is given by the equation below: N&1 N&1
X(k, m)= : i=0 N&1
= : i=0
: x(i, j ) e &j 2?(ki+mt)N t=0 N&1
\:
t=0
(k=0, 1, 2, ..., N&1; m=0, 1, 2, ..., N&1).
+
x(i, t ) e &j 2?mtN e &j 2?kiN
A NOVEL APPROACH TO FAST DFT
FIG. 2.
55
The linear array to implement y 2 (i ).
Let N&1
y(i, m)= : x(i, t) e &j 2?mtN, t=0
then N&1
X(k, m)= : y(i, m) e &j 2?kiN. i=0
So a two-dimensional discrete Fourier transform can be transformed into 2 onedimensional discrete Fourier transforms. The approach above is easily extended to computing 2-D DFT by taking 1-D DFT row-wise and column-wise. The systolic arrays shown in Fig. 1 are simplified in the center of Fig. 3 By this simplification
FIG. 3.
The simplified systolic array for 2-D DFT.
56
LIU ET AL.
the systolic array used for directly computing two-dimensional discrete Fourier transform via data feedback is shown in Fig. 3. The total execution time of the computing procedure is N+( p+1)(N&2)+2+(N 2 &1)+N 2 =2N 2 +( p+2) N&(2p+1) network clock ticks (every tick is one addition or multiplication). The method can be extended to multidimensional discrete Fourier transform (higher than two dimensions). The higher the dimension is, the more obvious the advantages of the method become. It is proved by mathematical induction that the execution time of the systolic array is O(N k ) for k-D DFT (k2).
4. COMPUTATION COMPLEXITY ANALYSIS According to the algorithm above, the fast discrete Fourier transform includes three steps: preprocessing, computation of moments, and dot product. We analyze the computation complexity of each step in the following. Preprocessing involves three equations, Eqs. (2.3), (2.4), and (2.7). The parameters a r and the index set S ki can be computed in advance, so these computations are not part of the real-time system. It is noteworthy that a r is a real number or a pure imaginary one and can be obtained through 2p+1 real multiplications. The total number of additions required in Eq. (2.4) is equal to N 2 & N&1 k=0 N gcd(k, N ) and N parallel linear arrays can produce all the y k (i ) in 2N clock cycles. There are all together ( p+1)( p+2)(N&2)2 additions in the network for computing the moment, and the execution time of the systolic array is ( p+1)(N&2) [3]. The final dot product and summing y k (0) need ( p+1)N additions and pN multiplications (a 0 =1 means no multiplication) to obtain all X(k) (k=0, 1, ..., N&1). However, because of the overlap in timing, this leads to an increase of only 3 cycles in the systolic computation. So there are all together N 2 & N&1 k=0 Ngcd(k, N)+( p+1)(p+2)(N&2) N2+ (p+1) N additions and pN multiplications in the algorithm. If N is a prime number, then there are ( p+1)( p+2)(N&2) N2+(p+1) N additions and pN multiplications in the algorithm. The total execution time of the systolic arrays is N+( p+1)(N&2)+2+N&1=(p+3) N&2p&1 In the above the first term N is for preprocessing (the preprocessing array can provide data processed after N clock ticks) and the last term N&1 is for collecting all X(k) (k=1, ..., N&1). All the p's mentioned above can be expressed in the form of [2 log 2 N log 2 log 2 N ]+32 in the case of N<2 100 and max |x(r)| 256, (0rN&1). In any cases, p=O(log 2 N log 2 log 2 N ). Compared with the direct method having a computational complexity O(N 2 ) and the conventional FFT having O(N log 2 N ), our method is superior to the direct
File: DISTL1 147209 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 3303 Signs: 2591 . Length: 52 pic 10 pts, 222 mm
57
A NOVEL APPROACH TO FAST DFT
method and seems to be inferior to FFT if the calculations are executed in a sequential machine (for the example given in Section 2, N2k, log 2 N11, p=34 or 35). However, if the sampled data are real numbers, since a r (r=0, 1, 2, .... p) is either a real number or a pure imaginary one, then additions and multiplications involved in our method are all real operations. Even though sampled data are complex numbers, the multiplication involved in our method is equivalent to two real multiplications and one real addition for the same reason. Since one complex addition is equivalent to two real additions and one complex multiplication is equivalent to four real multiplications and two real additions or to three real multiplications and five real additions [2], our method has an advantage over a FFT which involves complex operations. The traditional FFT requires O(N log 2 N ) complex multiplications and additions, for example, radix-2 FFT require N log 2 N 2 multiplications that are equivalent to 2N log 2 N real multiplications and N log 2 N real additions. When N is large enough (N2 20 ), p can get a value much lower than 2 log 2 N and still satisfy the accuracy requirement as demonstrated in Table 1 and p=O(log 2 N log 2 log 2 N ). In other words, in the case of N large enough, the number of multiplications in our method is lower than that in FFT although the number of additions in our method is still higher than that in FFT.
5. CONCLUSIONS We have proposed new algorithms and systolic arrays for 1-D and 2-D discrete Fourier transforms (DFT) and their inverses and have analyzed their computational complexity. We have displayed the relationship between the discrete Fourier transform and discrete moments and transformed the discrete Fourier transform into a discrete moment transform. Our method has the following advantages: 1. Exponential functions have been replaced by simple polynomial functions and many multiplications have been changed into additions. This decreases the computational cost and memory requirement. 2. The systolic arrays consist of only latches, adder-latches and a few multipliers. This results in an easy hardware implementation and is very suited to a realtime processing. 3. It can accommodate data samples of arbitrary length shorter than a given constant which depends directly on the number of multipliers used and produces nice accuracy and convergence property; 4. If the sampled data are real numbers, then the additions and multiplications involved in our method are all real operations while in other methods they are complex operations. 5. The multiplication number for 1-D DFT in O(N log 2 N log 2 log 2 N) superior to the O(N log 2 N ) in FFT.
our
method
is
6. The method is very easily extendible to multidimensional fast discrete Fourier transform and DFT inverses and the time complexity is O(N k ) for k-D DFT (k2).
File: DISTL1 147210 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 3477 Signs: 2868 . Length: 52 pic 10 pts, 222 mm
58
LIU ET AL.
ACKNOWLEDGMENT This report project was supported in part by University of Hong Kong research grants. The revision work was also supported in part by NSFC 69775011.
REFERENCES 1. E. F. Beckenbach and B. Bellman, ``Inequalities,'' Springer-Verlag, New YorkBerlin, 1961. 2. R. E. Blahut, ``Fast Algorithms for Digital Signal Processing,'' AddisonWesley, Reading, MA, 1984. 3. F. H. Y. Chan, F. K. Lam, H. F. Li, and J. G. Liu, An all adder systolic structure for fast computation of moments, J. VLSI Signal Process. 12 (1996), 159175. 4. J. B. Conway, ``Functions of One Complex Variable,'' Springer-Verlag, New YorkBerlin, 1986. 5. J. W. Cooley and J. W. Tukey, An algorithm for machine computation of complex Fourier series, Math. Comput. 19 (1965), 297301. 6. P. Duhamel, ``Paper on the Fast Fourier Transform,'' IEEE Press, New York, 1995. 7. P. Duhamel and M. Vetterli, Fast Fourier transforms: A tutorial review and a state of the art, Signal Processing 19 (1990), 259299. 8. M. Hatamian, A real-time two-dimensional moment generating algorithm and its single chip implementation, IEEE Trans. Acoust. Speech Signal Process. 34 (1986), 546553. 9. M. K. Hu, Visual pattern recognition by moment invariants, IRE Trans. Inform. Theory 8 (1962), 179187. 10. S. Lang, ``Complex Analysis,'' Springer-Verlag, New York, 1993. 11. B.-C. Li and J. Shen, Pascal triangle transform approach to the calculation of 3-D moments, Comput. Vision, Graphics, and Image Process.: Graph. Models and Image Process. 54 (1992), 301307. 12. J. M. H., ``Appleton-Century-Crofts,'' 1956. 13. H. E. Rose, ``A Course in Number Theory,'' Oxford Univ. Press, Oxford, 1994. 14. W. W. Smith and J. M. Smith, ``Handbook of Real-Time Fast Fourier Transform, IEEE Press, New York, 1995. 15. M. F. Zakaria, L. J. Vroomen, P. J. A. Zsombar-Murray, and J. M. H. M. Van Kessel, Fast algorithm for computation of moment invariants, Pattern Reconignition 20 (1987), 634643.
File: DISTL1 147211 . By:GC . Date:29:09:98 . Time:11:48 LOP8M. V8.B. Page 01:01 Codes: 6044 Signs: 1931 . Length: 52 pic 10 pts, 222 mm