Signal Processing 14 (1988) 237-241 North-Holland
237
ON DISCRETE FOURIER T R A N S F O R M S (NUMBER-THEORETIC T R A N S F O R M S ) W I T H M I N I M U M N U M B E R OF ROTATIONS ( S H I F T S ) P. DUHAMEL and H. H O L L M A N N * CNET/ PAB/ RPE, 38-40, Avenue de la R~publique, 92131 lssy-les-Moulineaux, France Received 30 June 1986 Revised 4 March 1987 and 2 November 1987
Abstract. First we present some short-length DFTs with minimum number of rotations in a given class of algorithms. In the case of prime length ( N = p) we give a systematic algorithm which is optimum for N = 3 and N = 5. We then discuss how to combine short-length DFTs when N - p" and N = P~P2through some examples.
Zusammenfassung. Zun~ichst werden einige Algorithmen bestimmter Form fiir DFTs kurzer LSnge mit der kleinstm6glichen Anzahl yon Rotationen angegeben, lm Falle von Primzahl-L~ingen ( N = p) wird ein systematischer Algorithmus angegeben, der f/ir N = 3 und N - 5 optimal ist. Danach wird diskutiert anhand einiger Beispiele, wie DFTs kurzer L~inge zu kombinieren sind, wenn N = pn oder N = PiP2 (mit Primzahlen P1, Pc) vorliegt. R6sum6. Tout d'abord, nous pr6sentons des algorithmes de calcul de Transform6es de Fourier Discr~tes (TFD) de petites longueurs n6cessitant le nombre minimum de rotations (pour une classe d'algorithmes). Puis, nous donnons un algorithme syst6matique qui est optimal pour N - 3 et N = 5. Enfin, nous discutons, ~ l'aide d'exemples, de la manibre de combiner ces algorithmes courts de TFD quand N = p" et N = P~ P2, P~ et '°2 6rant premiers entre eux. Keywords. fast Fourier transform, number-theoretic transform, CORDIC, minimum-rotation algorithm.
I. Introduction Short-length discrete Fourier transforms (DFTs) are the basic building blocks used by nearly every fast DFT algorithm. Nevertheless, the usual shortlength algorithms are generally designed for a minimum number of multiplications and, as a consequence, are not well suited in a number of circumstances. A first case, which in fact originated this work, is the computation of number-theoretic transforms (NTTs) that have been chosen in such a way that the Nth root of unity is simple (2, as an example)
[4]. A second one is the implementation of fast DFT algorithms when the elementary operator is the rotation (e.g., CORDIC processors) [1,7].
In both cases, if a is the corresponding root of unity, an expression of the type (aS+ak)x, although being a single multiplication (complex in the case of DFTs) is only attainable with two rotations (shifts for NTTs with 2 as a root of unity) and an addition. This is due to the fact that, mathematically, the set of constants has not the field structure. The problem of designing minimum rotation (shifts) algorithms, is then quite different from the usual one (minimum multiplication algorithm) since the underlying mathematical structure has completely changed. Therefore, it is not possible to use the fundamental results of Winograd anymore, and we could not derive any analytical result, due to the weakness of the structure of this problem, stated in terms of mathematics.
* H. Hollmann is now with Philips, Eindhoven, The Netherlands.
0165-1684/88/$3.50 O 1988, Elsevier Science Publishers B.V. (North-Holland)
P. Duhamel, H. Hollmann / D F T with minimum number of rotations
238
Nevertheless, we were able to derive by combinatorial means some (new) results providing both new algorithms and proof of optimality (if the number of rotations is considered) for length-P NTTs (DFTs), for P = 3, P = 5, and P = 7. More precisely, we first present some shortlength DFTs with minimum number of rotations in a given class of algorithms. For several prime values of the length N, we find the minimum number of rotations R to be N=3,
R=l,
N=5,
R =6,
N=7,
R=13.
In the case of N prime, we also derived from symmetry considerations on the transform matrix a systematic algorithm, giving an upper bound for the number of rotations. This upper bound meets the minimum number for N = 3, N = 5. We then give some additional results for which we could not obtain any proof of optimality. By working on some examples of composite length N = p", and N = N n N 2 , N 1 and N2 being coprime, we show that different algorithms for DFT computations give the same number of rotations, which led to some conjectures on the 'best' algorithms to combine short-length DFTs. Let us point out that, since this work was originated by the study of NTTs, some of the results given here should be adapted when working on DFTs. In fact, in the case of NTTs, a multiplication by j - - ~ is not a simpler shift than the other ones (in many cases, it is not even a shift, when the order of 2 cannot be divided by 4), while obviously, in the case of complex numbers, a multiplication by j is much simpler than performing a general multiplication by e i°. In the remainder of this paper we shall try to point out when such differences will occur. Anyway, our main intent here is to state (and partly solve) a mathematical problem which is exactly the opposite of the work that was done by Winograd on DFTs [8, 9]: We want to use only rotations, while Winograd avoids them completely. Signal Processing
(Winograd's DFTs make systematic use of multiplications by (e j° + e-J°), which is either purely real or purely imaginary.) Nevertheless, it should be clear that, for people wanting to implement FFTs using C O R D I C processors, the real problem is somewhere in-between [10].
2. Optimality--The class of algorithms considered In this paper, we shall only consider algorithms in the following class: • The only allowed operations are addition and rotation, • each rotation is applied to a sum of previously computed terms. Under those conditions, the algorithms that will be given for N = 3, 5, 7 were proven to be optimum. Due to the mathematical structure of this problem, the proofs of optimality were obtained by combinatorial techniques based mainly on the structure of the DFT matrix (symmetry, appearance of every possible power of a, the N t h root of unity in each row and column). These proofs are very tedious and will not be given here.
3. 'Optimum' algorithm 3.1. D F T length 3 X o = xo+ Xl + X2, X 1 = X o + o~x I -q- 13t2X2 ~
(1)
X 2 = Xo-~ o/2Xl q- or,x2 '
with a 3 = 1 (a is the 3rd root of unity). Since a 3 - 1 = (a
-
1)(or2+ ot + 1) = O,
and ( a - l ) has an inverse both in the case of complex numbers (DFTs) and NTTs (when the transform exists), we get a2+a+l
=0.
(2)
239
P. Duhamel, H. Hollmann / DFT with minimum number of rotations
Inserting (2) into (1), we easily get the following algorithm: R o t a t i o n step: R 1 = a ( X 1 - - X 2 ) , X o : X o + X 1 +X2,
3.3. D F T length 7
If a is now the 7th root of unity, one of the minimal multiplication algorithms is given as follows: i = 1 , 2 , 3 , 4 , 5.
yi=xi--X6,
Xl = Xo-X2+ Rl,
R o t a t i o n steps:
X 2 -~ X 0 -- X 1 -- R~.
R 1= a4yl, R2 = a2y2, R3 = a 6y2,
3.2. D F T length 5
R 4 = a 6Y3, R 5 = a2y4,
X o = Xo + XI + X2-1- X3 + X 4 , X 1 = X 0 -~- a x 1 -[- a 2X 2 At- ol 3X 3 -[- a 4 x 4 , X 2 = Xo~- a 2 x ! -[- a 4 x 2 At- a x 3 -~- a 3 2 4 ,
(3)
R 6 = a 6y4, R7 = a 3Ys, R8 = a 5Ys, R9 = aa(y4 + RI +
X 3 ~-- X o - 1- a 3 x 1 .-[- a x 2 - 1 - of 4 x 3 --t- a 2 x 4 ,
R4),
Rio = a2(yl + R2 + R6),
X 4 ~-- x 0 At- 19/4x 1 ~- a 3 x 2 + a 2 x 3 --b a x 4 ,
with a s = 1 ( a is the 5th root of unity). For the same reason as for N = 3, we get
Rll = a3(yl + g 4 + R5 + R8), R12 = a6(y5 + g 2 + R4), Ri3 = a ( y 3 + RL + R2 + R7); 6
a4+ a3+a2+a
+ 1=0
(4)
Yo=2
xi,
i=1
and, with that expression, we can delete by changes of variables the last row and column, to get the following algorithm: Yi = xi - x4,
i = 1, 2, 3.
Y~ = - x 6 + R2+ R8 + R9, Y2 = - - X 6 + R4+ RT+ Rio, Y3 = --X6+ R3+ Rll, Y4 = - x 6 + RI + R5 + R12,
y~ = --X6+ R6+ R13 ; i = 0 , 1, 2, 3,4, 5,
Xi=xo+Yi,
5
R o t a t i o n steps:
i=0
R 1 = a3yl, R2 = a 2y2,
3.4. A s y s t e m a t i c algorithm f o r N p r i m e
R 3 = cry3, R4 -----a3(Rl +Y3), R~-- a2(R2+ y 0 , R6 = a4( R~ + Y3), Yo = Xl + X2 + X3 + X4,
YI =
-x4-{-
X~ = X o - ~ Y,.
R4+ R2,
As an example, let us consider the N---5 algorithm: Since a4 + a3 + a2 + a + 1 = 0, we can replace each coefficient along the main diagonal in the 'working part' of equations (3) (i.e., the part that involves multiplications by powers of a): - - X l -- a 2 ( X l
- - X 2 ) -- a 3 ( X l
- - X 3 ) -- 0 ' 4 ( 2 1 -- a 4 ) '
}I2 = --X4 "~ R5 + R3,
Y3 = - x 4 + R6+ R1, i = 0 , 1, 2,3, X4 = Xo- Yo- Y l - Y2- g3.
Xi=xo+Yi,
a 2 ( X l -- X2) -- X 2 -- 0£ ( X 2 -- X3) -- a 3 ( x 2
-- x 4 ) ,
Ot'3(Xl -- X3) -~- a ( X 2 -- X3) -- X 3 -- a 2 ( x 3
-- x 4 ) ,
(5) a4(xl
-- x 4 ) -~ a 3 ( x 2
-- x 4 ) -~- a 2 ( x 3
-- x 4 ) -- x 4 , Vol. 14, No 3, April 1988
240
P. Duhamel,
H.
Hollmann
/ DFT
Since the resulting matrix is antisymmetric, the DFT of length 5 can be obtained with six rotations. P l o c i = 0 , for oc This method is general, since Y.~=o the pth root of unity. Then,
OC X i . . . . . . . . . . . . . . . . . . . . I I I I I
OC Xj I I I I I I I I
I I I' J ji OC X i . . . . . . . . . . . . . . . . . . . .
1
' "" OlYJ~,j
with
minimum
number
of rotations
sidered since it requires more rotations than the PFA). The first one (WFTA) is not useful here, since it is based on a property of the usual short-length DFT algorithms (multiplications are 'nested' at the center of the algorithm, which corresponds to the following factorization of the transform matrix: Tm = Uioc~V~, where the multiplications are met only in the diagonal matrix oct), which does not hold in minimum rotation algorithms. Let us now consider PFA: if R~ and R2 are the numbers of rotations for DFTs of length p~ and p2 respectively, the total number of operations is
can be replaced by RpFA ......................
-x,
=
PI R2 + P2R1
and, for N = 6 for example, this gives
oc 'j (XJ - x,)
I I I I I I I I I I I I
R6= 2.1 +0.3 = 2.
ocJ' ( x~ - x j ) . . . . . . . . . . . . . . . . .
Another algorithm for N = 6 can be obtained by making use of the properties of the 6th root of unity:
x
oc3 = - - 1 , This is the base of an algorithm computing a length-p D F T with a number of rotations equal to the number of elements in the lower triangular part of the 'working part' of the D F T matrix: R =½(p- 1)(p-2),
(6)
which gives p=3,
R = 1 (optimum" 1),
p=5,
R = 6 (optimum:6),
p=7,
R = 15 (optimum: 13).
a 2 -a+l
(7) =0.
After substitution of (7) in the length-6 DFT, it becomes quite obvious that the D F T can be obtained with two rotations (the same number as PFA). This was the same situation as for N = 12: decomposed as 4 x 3 , we obtained R12=7 with PFA, which could not be improved by any of the algorithms we tried. This is a general observation, which led us to the following conjecture.
Conjecture. The minimum number of rotations in 4. Algorithms for composite lengths Case 1: N = P~P2. In this case, if PI and P2 are mutually prime, two kinds of algorithms are well known in the usual case: the Winograd algorithm [8] (WFTA) and the Prime Factor algorithm (PFA) [5] (the mixed-radix algorithm has not been conSignal Processing
computing a D F T of length N = P1P2, with P~ and P2 different primes, is given by the PEA combining two minimum-rotation algorithms of length P~ and P2.
Case 2: N = P " . As an example, let us take N = 3 2. At first algorithm is obtained as follows.
P. Duhamel, H. Hollmann / DFT with minimum number qf rotations
Considering that O~6"~ O~3"Jt- 1 = 0, and Ot'3 is the 3rd root o f unity, it is easy to see that Xo, X3, and X6 are outputs of a length-3 DFT, and are then c o m p u t e d with one rotation. It is also easy to see that columns 3 and 6 o f the transform are obtained with one rotation: Og3(X3- X6). The 'working part' o f the D F T remains and, after a permutation, is given as a convolution o f length 6. This convolution can be c o m p u t e d with a length-6 D F T (the 6th root o f unity is then - a 3 ) , and the whole procedure gives a total o f ten rotations. A radix-3 algorithm would require the same n u m b e r o f rotations: six rotations inside the 'butterflies' (one rotation each), and four twiddle factors. It is our experience that, for any example we considered, we could never obtain fewer rotations than with a radix-P algorithm, provided that a multiplication by every p o w e r o f c~ (except a N/2 = - 1 ) is considered to have the same complexity. This corresponds to one o f the cases considered here since, in the case o f NTTs with 2 as a root of unity, there is no such simplification as a N/4 = ±j. (NTTs are not usually defined on extension rings.) Nevertheless, when one is interested in implementing DFTs with C O R D I C processors, it is legitimate to consider that a multiplication by j_a ~ is m u c h simpler than a general rotation. In that case, when multiplications by j are not c o u n t e d as rotations, the 'split-radix' algorithm ( S R F F T ) [3] seems to be the optimum. In fact, we have shown [2] that the S R F F T meets the lower b o u n d on the n u m b e r o f nontrivial complex multiplications needed to c o m p u t e a length-2 n D F T up to and including N = 64. Hence, since minimizing the n u m b e r o f rotations is a special case o f minimizing the n u m b e r of complex multiplications, the S R F F T is also o p t i m u m with the criterion considered in this paper at least up to length 64.
241
5. Conclusion We have considered in this paper a s o m e w h a t new problem that arises, for example, in NTTs with 2 as the root of unity, or in DFTs implemented by C O R D I C processors. In those cases, we give o p t i m u m short-length D F T algorithms for N = 3, 5, 7, and a n u m b e r o f conjectures on the best way to combine them to obtain long transform. (Note that this problem is not very well understood in the usual case, either.) Further results would need a lot o f research, due to the weakness of the underlying mathematical structure. References [ 1] A. Despain, "'Very fast Fourier transform algorithms hardware for implementation", IEEE Trans. Comput., Vol. C-28, No. 5, May 1979, pp. 333-341. [2] P. Duhamel, "'Algorithms meeting the lower bounds on the multiplicative complexity of length-2" DFT's and their connection with practical algorithms", IEEE Trans. Acousr, Speech, Signal Process., submitted for publication. [3] P. Duhamel and H. Hollmann, "Split-radix FFT algorithm", Electronics Letr, Vol. 20, No. 1, January 1984, pp. 14-16. [4] H. Hollmann and P. Duhamel, "Longer NTT's with 2 as a root of unity", Proc. ICASSP 83, 1983, pp. 159-162. [5] D.P. Kolba and T.W. Parks, "A prime-factor FFT algorithm using high-speed convolution", IEEE Trans. Acousr, Speech, Signal Process., Vol. ASSP-25, 1977, pp. 90-103. [6] H.J. Nussbaumer, Fast Fourier TransJorm and Convolution Algorithms, Springer, Berlin, 1981.
[7] J.E. Voider, "The CORDIC trigonometric computing technique", IRE Trans. Elec. Compur, Vol. EC-8, No. 3, 1959, pp. 330-334. [8] S. Winograd, "On computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. U.S.A., Vol. 73, 1976, pp. 1005-1006. [9] S. Winograd, "'On the multiplicative complexity of the Discrete Fourier Transform", Adv. in Math., Vol. 32, 1979, pp. 83-117. [10] E.H. Wold and A.M. Despain, "'Pipeline and parallelpipeline FFT processors for VLSI implementations", IEEE Trans. Cornpur, Vol. C-33, No. 5, May 1984, pp. 414-426.
Vol 14, No. 3, April 1988