A parallel 1-D FFT algorithm for the Hitachi SR8000

A parallel 1-D FFT algorithm for the Hitachi SR8000

Parallel Computing 29 (2003) 679–690 www.elsevier.com/locate/parco A parallel 1-D FFT algorithm for the Hitachi SR8000 Daisuke Takahashi Institute of...

123KB Sizes 1 Downloads 114 Views

Parallel Computing 29 (2003) 679–690 www.elsevier.com/locate/parco

A parallel 1-D FFT algorithm for the Hitachi SR8000 Daisuke Takahashi Institute of Information Sciences and Electronics, University of Tsukuba, 1-1-1 Tennodai, Tsukuba-shi, Ibaraki 305-8573, Japan Received 17 September 2001; received in revised form 20 December 2002; accepted 24 January 2003

Abstract In this paper, we propose a high-performance parallel one-dimensional fast Fourier transform (FFT) algorithm on clusters of vector symmetric multiprocessor (SMP) nodes. The fourstep FFT algorithm can be altered into a five-step FFT algorithm to expand the innermost loop length. We use the five-step algorithm to implement the parallel one-dimensional FFT algorithm. In our proposed parallel FFT algorithm, since we use cyclic distribution, all-toall communication takes place only once. Moreover, the input data and output data are both in natural order. Performance results of one-dimensional power-of-two FFTs on clusters of pseudo-vector SMP nodes, Hitachi SR8000, are reported. We succeeded in obtaining performance of over 61 GFLOPS on a 16-node Hitachi SR8000/MPP.  2003 Elsevier Science B.V. All rights reserved. Keywords: Fast Fourier transform; Distributed-memory parallel computer; Vector symmetric multiprocessor; Cyclic distribution; All-to-all communication

1. Introduction The fast Fourier transform (FFT) [1] is an algorithm widely used today in science and engineering. Parallel FFT algorithms on distributed-memory parallel computers have been well studied [2–5]. Many vendors support parallel FFTs on clusters of scalar symmetric multiprocessors (SMPs) [6], but few vendors support parallel FFTs on clusters of vector SMP nodes.

E-mail address: [email protected] (D. Takahashi). 0167-8191/03/$ - see front matter  2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0167-8191(03)00039-5

680

D. Takahashi / Parallel Computing 29 (2003) 679–690

In this paper, we propose a high-performance parallel one-dimensional FFT algorithm on clusters of vector SMP nodes. Our proposed parallel one-dimensional FFT algorithm is based on the four-step FFT algorithm [7,8]. On a vector SMP node, an innermost loop length is particularly important to get high performance. The four-step FFT algorithm can be altered into a five-step FFT algorithm to expand the innermost loop length. We use the five-step algorithm to implement the parallel one-dimensional FFT algorithm. We have implemented parallel one-dimensional FFT algorithm on clusters of pseudo-vector SMP nodes, the Hitachi SR8000. In this sense, the present work takes a complementary role to FFTW [5], which is intended mainly for RISC processors. Although the FFTW works well for cache-based processors, is does not work well for vector processors [9]. Section 2 describes the Hitachi SR8000 architecture and its unique features. Section 3 describes the four-step FFT algorithm. Section 4 describes a five-step FFT algorithm. In Section 5, we propose a parallel one-dimensional FFT algorithm. Section 6 describes a power-of-two FFT algorithm on a single vector SMP node. Section 7 gives performance results. In Section 8, we provide some concluding remarks.

2. Hitachi SR8000 architecture The Hitachi SR8000 system [10] is a distributed-memory parallel computer with pseudo-vector SMP nodes. An SR8000 system consists of from 4 to 512 nodes. The node processor of the SR8000/MPP is a 2.2 ns PowerPC node with major enhancements made by Hitachi [11]. Each node contains eight RISC microprocessors which have the ‘‘pseudo-vector processing’’ (PVP) feature [12]. This allows data to be fetched from main memory, in a pipelined manner, without stalling the succeeding instructions. The result is that data is fed from memory into the arithmetic units as effectively as in a vector supercomputer. This feature was already available on the CP-PACS [12], and experiments have shown that this idea works well for vector processing. The peak performance per processor, or IP, can be attained with two simultaneous multiply/add instructions, resulting in a speed of 1.8 GFLOPS on the SR8000/MPP. The processor has a first-level write-through 128 KB four-way set associative on-chip data cache with 128-byte cache lines. The instruction cache is twoway set associative 64 KB on-chip cache. Eight processors are coupled to form one processing node all addressing a common part of the memory. Hitachi refers to this node configuration as COMPAS, cooperative microprocessors in single address space. Peak performance of a node of the SR8000/MPP is 14.4 GFLOPS and maximum memory capacity is 16 GB. The nodes on the SR8000 are interconnected through a multidimensional crossbar network. The communication bandwidth available at a node of the SR8000/ MPP is 1.6 GB/s (single direction)  2. The automatic parallelizing compiler is provided to perform parallelization for the shared-memory multiprocessors within each node of the SR8000.

D. Takahashi / Parallel Computing 29 (2003) 679–690

681

3. The four-step FFT algorithm The discrete Fourier transform (DFT) is given by yk ¼

n1 X

xj xjk n ;

0 6 k 6 n  1;

ð1Þ

j¼0

pffiffiffiffiffiffiffi where xn ¼ e2pi=n and i ¼ 1. If n has factors n1 and n2 (n ¼ n1  n2 ), then the indices j and k can be expressed as: j ¼ j1 þ j 2 n 1 ;

k ¼ k 2 þ k 1 n2 :

ð2Þ

We can define x and y as two-dimensional arrays (in Fortran notation): xj ¼ xðj1 ; j2 Þ;

0 6 j1 6 n1  1; 0 6 j2 6 n2  1;

ð3Þ

yk ¼ yðk2 ; k1 Þ;

0 6 k1 6 n1  1; 0 6 k2 6 n2  1:

ð4Þ

Substituting the indices j and k in Eq. (1) with those in Eq. (2), and using the relation of n ¼ n1  n2 , we can derive the following equation: yðk2 ; k1 Þ ¼

nX 1 1 n 2 1 X

xðj1 ; j2 Þxjn22k2 xjn11kn22 xjn11k1 :

ð5Þ

j1 ¼0 j2 ¼0

This derivation leads to the following four-step FFT algorithm [7,8]: Step 1: n1 simultaneous n2 -point multirow FFTs x1 ðj1 ; k2 Þ ¼

nX 2 1

xðj1 ; j2 Þxjn22k2 :

j2 ¼0

Step 2: Twiddle factor multiplication x2 ðj1 ; k2 Þ ¼ x1 ðj1 ; k2 Þxjn11kn22 : Step 3: Transpose x3 ðk2 ; j1 Þ ¼ x2 ðj1 ; k2 Þ: Step 4: n2 simultaneous n1 -point multirow FFTs yðk2 ; k1 Þ ¼

nX 1 1

x3 ðk2 ; j1 Þxjn11k1 :

j1 ¼0

The distinctive features of the four-step FFT algorithm can be summarized as: pffiffiffi pffiffiffi • If n1 is equal to n2 (n1 ¼ n2  n), the innermost loop length can be fixed to n. This feature makes the algorithm suitable for vector processors. • A matrix transposition takes place just once (Step 3).

682

D. Takahashi / Parallel Computing 29 (2003) 679–690

4. A five-step FFT algorithm We can extend the four-step FFT algorithm in another way into a three-dimensional formulation. If n has factors n1 , n2 and n3 (n ¼ n1 n2 n3 ), then the indices j and k can be expressed as: j ¼ j1 þ j 2 n 1 þ j 3 n 1 n 2 ;

ð6Þ

k ¼ k3 þ k2 n3 þ k1 n2 n3 :

We can define x and y as three-dimensional arrays (in Fortran notation), e.g., xj ¼ xðj1 ; j2 ; j3 Þ;

0 6 j1 6 n1  1; 0 6 j2 6 n2  1; 0 6 j3 6 n3  1;

ð7Þ

yk ¼ yðk3 ; k2 ; k1 Þ;

0 6 k1 6 n1  1; 0 6 k2 6 n2  1; 0 6 k3 6 n3  1:

ð8Þ

Substituting the indices j and k in Eq. (1) by those in Eq. (6) and using the relation of n ¼ n1 n2 n3 , we can derive the following equation: yðk3 ; k2 ; k1 Þ ¼

nX 3 1 1 1 n 2 1 n X X

xðj1 ; j2 ; j3 Þxjn33k3 xjn22kn33 xjn22k2 xjn1 k3 xjn11kn22 xjn11k1 :

ð9Þ

j1 ¼0 j2 ¼0 j3 ¼0

This derivation leads to the following five-step FFT: Step 1: n1 n2 simultaneous n3 -point multirow FFTs nX 3 1

x1 ðj1 ; j2 ; k3 Þ ¼

xðj1 ; j2 ; j3 Þxjn33k3 :

j3 ¼0

Step 2: Twiddle factor multiplication and transpose x2 ðk3 ; j1 ; j2 Þ ¼ x1 ðj1 ; j2 ; k3 Þxjn22kn33 : Step 3: n3 n1 simultaneous n2 -point multirow FFTs x3 ðk3 ; j1 ; k2 Þ ¼

nX 2 1

x2 ðk3 ; j1 ; j2 Þxjn22k2 :

j2 ¼0

Step 4: Twiddle factor multiplication and rearrangement x4 ðk3 ; k2 ; j1 Þ ¼ x3 ðk3 ; j1 ; k2 Þxjn1 k3 xjn11kn22 : Step 5: n3 n2 simultaneous n1 -point multirow FFTs yðk3 ; k2 ; k1 Þ ¼

nX 1 1

x4 ðk3 ; k2 ; j1 Þxjn11k1 :

j1 ¼0

We note that we combined some of the operations with data movements as in Steps 2 and 4 to gain efficiency in utilizing the memory bandwidth.

D. Takahashi / Parallel Computing 29 (2003) 679–690

683

The distinctive features of the five-step FFT can be summarized as: • If n1 , n2 and n3 are equal (n1 ¼ n2 ¼ n3  n1=3 ), the innermost loop length can be fixed to n2=3 . Thus the five-step FFT algorithm is more suitable for vector processors than the four-step FFT algorithm. • A matrix transposition with twiddle factor multiplication takes place twice (Steps 2 and 4). The five-step FFT algorithm is based on representing the one-dimensional array xðnÞ as a three-dimensional array xðn1 ; n2 ; n3 Þ. This idea can also be found in HeglandÕs approach [3]. One difference between the five-step FFT algorithm and HeglandÕs approach is in the choice of the size of the ni ð1 6 i 6 3Þ. Both HeglandÕs approach and the five-step FFT algorithm choose n1 ¼ n3 . Hegland chooses n2 to be of minimal size [3]. Taking the opposite approach, we suggest that all the ni ð1 6 i 6 3Þ should be of equal size. This is because that the minimum innermost loop length (i.e. minðn1 n2 ; n2 n3 ; n3 n1 Þ) can be maximized. The minimum loop length to obtain peak performance is of the order of 1000 on the Hitachi SR8000 [13]. In this respect the innermost loop length is particularly important to get high performance on the SR8000. In view of the innermost loop length, the five-step FFT algorithm is more advantageous than HeglandÕs approach. While on the other hand HeglandÕs approach requires one transpose step, the fivestep FFT algorithm requires two transpose steps.

5. Parallel FFT algorithm based on the five-step FFT Let us consider how to perform long-vector FFTs on clusters of vector SMP nodes. Let N have factors N1 , N2 and N3 (N ¼ N1  N2  N3 ). The original one-dimensional array xðN Þ can be defined as a three-dimensional array xðN1 ; N2 ; N3 Þ (in Fortran notation). On a distributed-memory parallel computer which has P nodes, the array xðN1 ; N2 ; N3 Þ is distributed along the first dimension N1 . If N1 is divisible by P , each node has distributed data of size N =P . We introduce the notation N^r  Nr =P and we denote the corresponding index as J^r which indicates that the data along Jr are distributed across all P nodes. Here, we use the subscript r to indicate that this index belongs to dimension r. The distributed array is represented as x^ðN^1 ; N2 ; N3 Þ. At node m, the local index J^r ðmÞ corresponds to the global index as the cyclic distribution: Jr ¼ J^r ðmÞ  P þ m;

0 6 m 6 P  1;

1 6 r 6 3:

ð10Þ

To illustrate the all-to-all communication it is convenient to decompose Ni into two dimensions N~i and Pi , where N~i  Ni =Pi . Although Pi is the same as P , we are using the subscript i to indicate that this index belongs to dimension i.

684

D. Takahashi / Parallel Computing 29 (2003) 679–690

Starting with the initial data x^ðN^1 ; N2 ; N3 Þ, the five-step FFT-based parallel FFT can be performed according to the following steps: Step 1: ðN1 =P Þ N2 simultaneous N3 -point multirow FFTs x^1 ðJ^1 ; J2 ; K3 Þ ¼

N 3 1 X

x^ðJ^1 ; J2 ; J3 ÞxJN33K3 :

J3 ¼0

Step 2: Twiddle factor multiplication and transpose x^2 ðK3 ; J^1 ; J2 Þ ¼ x^1 ðJ^1 ; J2 ; K3 ÞxJN22KN33 : Step 3: N3 ðN1 =P Þ simultaneous N2 -point multirow FFTs x^3 ðK3 ; J^1 ; K2 Þ ¼

N 2 1 X

x^2 ðK3 ; J^1 ; J2 ÞxJN22K2 :

J2 ¼0

Step 4: Twiddle factor multiplication and transpose J^ ðK3 þK2 N3 Þ

x^4 ðK~3 ; J^1 ; K2 ; P3 Þ ¼ x^3 ðP3 ; K~3 ; J^1 ; K2 ÞxN1

J^ ðK3 þK2 N3 Þ

 x^3 ðK3 ; J^1 ; K2 ÞxN1

:

Step 5: All-to-all communication x^5 ðK^3 ; J~1 ; K2 ; P1 Þ ¼ x^4 ðK~3 ; J^1 ; K2 ; P3 Þ: Step 6: Rearrangement x^6 ðK^3 ; K2 ; J1 Þ  x^6 ðK^3 ; K2 ; P1 ; J~1 Þ ¼ x^5 ðK^3 ; J~1 ; K2 ; P1 Þ: Step 7: ðN3 =P Þ N2 simultaneous N1 -point multirow FFTs y^ðK^3 ; K2 ; K1 Þ ¼

N 1 1 X

x^6 ðK^3 ; K2 ; J1 ÞxJN11K1 :

J1 ¼0

The distinctive features of the five-step FFT-based parallel FFT algorithm can be summarized as: • N 2=3 =P simultaneous N 1=3 -point multirow FFTs are performed in Steps 1, 3 and 8 for the case of N1 ¼ N2 ¼ N3 ¼ N 1=3 . Therefore the innermost loop length can be fixed to N 2=3 =P . • Only one all-to-all communication is required. Moreover, the input data x and the output data y are both in natural order. • These ideas can be adopted not only for the radix-2, 4 and 8 parallel FFT but also for the radix-3, 5 and 16 parallel FFT. Similar algorithms could also be used for higher-dimensional FFT algorithms [14]. If both of N1 and N3 are divisible by P , the workload on each node is also uniform.

D. Takahashi / Parallel Computing 29 (2003) 679–690

685

5.1. Innermost loop length considerations We consider an innermost loop length of four-step pffiffiffiffiand five-step FFT-based parallel FFT algorithms. The innermost loop length is N =P in the four-step FFT-based parallel FFT algorithm [4], and N 2=3 =P in the five-step FFT-based parallel FFT algorithm. For example, given N ¼ 224 and Pp¼ffiffiffiffiffiffi 128, the innermost loop length of the four-step FFT-based parallel FFT is 32 ð¼ 224 =128Þ and that of the five-step FFTbased parallel FFT is 512 ð¼ ð224 Þ2=3 =128Þ. We can conclude that the five-step FFT-based parallel FFT algorithm is more advantageous than the four-step FFT-based parallel FFT algorithm on clusters of vector SMP nodes. On the other hand, we can combine the Pease algorithm [15] p and the six-step FFT ffiffiffiffiffiffiffiffiffi algorithm [7,8]. Then, the innermost loop length can be fixed to N =2. However, the Pease algorithm needs digit-reversal operations and needs to access the trigonometric table in the innermost loop as well as the data.

6. Power-of-two FFT algorithm on a single vector SMP node Parallel FFT algorithms on vector SMPs have been proposed [2,7]. On a single vector SMP node we use FFT algorithms of the Stockham algorithm [16] for radix-2, 4 and 8. The SR8000Õs microprocessors have a three-operand multiply-add instruction, which computes a ¼ a bc, where a, b and c are floating-point registers. Goedecker [17] reduced the number of instructions necessary for radix-2, 3, 4 and 5 FFT kernels by maximizing the use of fused multiply-add instructions. Although GoedeckerÕs technique works well for a four-operand multiply-add instruction which computes d ¼ a bc, where a, b, c and d are floating-point registers, it does not work well for the three-operand multiply-add instruction. Therefore we use the conventional decimation-in-frequency (DIF) version of the Stockham algorithm. Table 1 shows the number of operations required for radix-2, 4 and 8 FFT kernels on a single node of the Hitachi SR8000. The higher radices are more efficient in terms of both memory and floating-point operations. A high ratio of floating-point instructions to memory operations is particularly important on a vector SMP node. In view of a ratio of floating-point instructions to memory operations, the radix-8 FFT is more advantageous than the radix-4 FFT. The power-of-two FFT (except for 2point FFT) can be performed by a combination of radix-8 and radix-4 steps containing at most two radix-4 steps. That is, the power-of-two FFTs can be performed as a length n ¼ 2p ¼ 4q 8r ðp P 2; 0 6 q 6 2; r P 0Þ. 6.1. The multirow FFT Let n ¼ 2p and xq ¼ e2pi=q . An ns simultaneous n-point radix-2 multirow FFT based on the DIF Stockham FFT is as follows:

686

D. Takahashi / Parallel Computing 29 (2003) 679–690

Table 1 Real inner-loop operations for radix-2, 4 and 8 FFT kernels based on the DIF Stockham FFT on a single node of the Hitachi SR8000 Loads and stores Multiplications Additions Total floating-point operations ðn log2 nÞ Floating-point instructions Floating-point/memory ratio

Radix-2

Radix-4

Radix-8

8 4 6 5.000 8 1.000

16 12 22 4.250 28 1.750

32 32 66 4.083 84 2.625

do t ¼ 1; p l ¼ 2pt ; m ¼ 2t1 complex*16 Xt1 ðns; m; 2 lÞ, Xt ðns; 2 m; lÞ do j ¼ 1; l do k ¼ 1; m do row ¼ 1; ns c0 ¼ Xt1 ðrow; k; jÞ c1 ¼ Xt1 ðrow; k; j þ lÞ Xt ðrow; k; jÞ ¼ c0 þ c1 Xt ðrow; k þ m; jÞ ¼ xj1 2l ðc0  c1 Þ end do end do end do end do Here the variables c0 and c1 are temporary variables. 6.2. Parallelism on a single vector SMP node On a single node of the Hitachi SR8000, the outer loop of each FFT kernel is distributed across the nodeÕs eight microprocessors. In the multirow FFT kernel, this corresponds to the do j loop. The difficulty with distributing the outer loop is that its range ends at one. When the range of j is less than eight, it is necessary to interchange the loop indices of j and k. This permits the compiler to distribute the do k loop in a way that keeps all eight microprocessors active.

7. Performance results To evaluate the proposed parallel one-dimensional FFT, we compared its performance against that of the Hitachi MATRIX/MPP library parallel one-dimensional FFT routine HZFT4MHP. We averaged the elapsed times obtained from 10 executions of complex forward FFTs. The parallel FFTs were performed on

Table 2 Performance of five-step FFT-based parallel one-dimensional FFTs on the Hitachi SR8000/MPP n 20

P ¼2

P ¼4

P ¼8

P ¼ 16

GFLOPS

Time

GFLOPS

Time

GFLOPS

Time

GFLOPS

Time

GFLOPS

0.02171 0.04815 0.09666 0.19607 0.39716 0.79802 1.60397 3.22622 * * * *

4.830 4.573 4.773 4.920 5.069 5.256 5.439 5.616 * * * *

0.01394 0.03009 0.05953 0.12003 0.24246 0.48168 0.96709 1.94065 4.23113 * * *

7.522 7.318 7.750 8.037 8.304 8.708 9.021 9.337 8.882 * * *

0.00801 0.01725 0.03392 0.06884 0.13928 0.28455 0.55766 1.11703 2.25992 4.84238 * *

13.090 12.765 13.602 14.013 14.455 14.740 15.644 16.221 16.629 16.076 * *

0.00526 0.01062 0.02013 0.03995 0.08044 0.16030 0.31841 0.63136 1.26482 2.68237 5.32969 *

19.952 20.736 22.918 24.149 25.027 26.165 27.399 28.699 29.712 29.021 30.220 *

0.00378 0.00620 0.01118 0.02144 0.04233 0.08140 0.16283 0.32672 0.64760 1.35994 2.71541 5.44977

27.731 35.516 41.263 45.004 47.565 51.527 53.577 55.459 58.031 57.243 59.314 61.078

Table 3 Performance of the Hitachi MATRIX/MPP library parallel one-dimensional FFT routine HZFT4MHP on the Hitachi SR8000/MPP n

P ¼2

P ¼4

P ¼8

P ¼ 16

Time

GFLOPS

Time

GFLOPS

Time

GFLOPS

Time

GFLOPS

Time

GFLOPS

0.01690 0.04444 0.09572 0.18921 0.48305 0.96154 1.99744 * * * *

6.203 4.955 4.820 5.098 4.168 4.362 4.368 * * * *

0.02118 0.04689 0.08629 0.19131 0.37048 0.81736 1.52944 3.35317 6.29681 * *

4.950 4.696 5.347 5.043 5.434 5.132 5.704 5.404 5.968 * *

0.01071 0.02148 0.04301 0.08729 0.16830 0.33366 0.67305 1.31926 2.69698 * *

9.786 10.251 10.727 11.051 11.962 12.571 12.962 13.735 13.934 * *

0.00861 0.01678 0.03565 0.06813 0.14927 0.28679 0.56104 1.21680 1.80079 3.82921 7.59953

12.174 13.120 12.943 14.160 13.487 14.625 15.550 14.891 20.869 20.330 21.194

0.00404 0.00943 0.01671 0.02984 0.23032 0.29945 0.41822 2.18530 3.06919 4.70021 20.37182

25.986 23.353 27.604 32.331 8.741 14.007 20.860 8.291 12.245 16.562 7.906

687

220 221 222 223 224 225 226 227 228 229 230

P ¼1

D. Takahashi / Parallel Computing 29 (2003) 679–690

2 221 222 223 224 225 226 227 228 229 230 231

P ¼1 Time

688

D. Takahashi / Parallel Computing 29 (2003) 679–690

double-precision complex data, and the table for twiddle factors was prepared in advance. The Hitachi SR8000/MPP (144 nodes, 2304 GB total main memory size, 2073.6 GFLOPS peak performance) was used as a distributed-memory parallel computer. In the experiment, we used 1–16 nodes on the SR8000/MPP. MPI was used as a communication library on the SR8000. All routines were written in Fortran. The compiler used was optimizing Fortran 77 V01-03 of Hitachi Ltd. The optimization option, ‘‘-Oss -rdma’’, was specified. Tables 2 and 3 compare the five-step FFT-based parallel FFT algorithm and the Hitachi MATRIX/MPP library parallel one-dimensional FFT routine HZFT4MHP in terms of their run times and GFLOPS. The column headed by n shows the number of points of FFTs. The next ten columns contain the average elapsed time in seconds and the average execution performance in GFLOPS. The GFLOPS values are each based on 5n log2 n for a transform of size n ¼ 2m . For n 6 223 , P ¼ 1 the five-step FFT-based parallel FFT algorithm is slower than the library parallel FFT routine HZFT4MHP, whereas for n P 224 , P ¼ 1 and n P 220 , P P 2 the five-step FFT-based parallel FFT algorithm is faster than HZFT4MHP. This is because the innermost loop length of the five-step FFT-based parallel FFT algorithm is longer than that of HZFT4MHP. These results clearly indicate that the five-step FFT-based parallel FFT algorithm is superior to the Hitachi MATRIX/MPP library parallel one-dimensional FFT routine HZFT4MHP. We note that on the SR8000/MPP with 16 nodes, over 61 GFLOPS was realized with size n ¼ 231 in the five-step FFT-based parallel one-dimensional FFT algorithm as in Table 2. Table 4 shows the results of the all-to-all communication timings on the SR8000. The column headed by P shows the number of nodes. The next eight columns contain the average elapsed time in seconds and the average bandwidth in MB/s.

Table 4 All-to-all communication performance on the Hitachi SR8000/MPP n 220 221 222 223 224 225 226 227 228 229 230 231

P ¼2

P ¼4

P ¼8

P ¼ 16

Time

MB/s

Time

MB/s

Time

MB/s

Time

MB/s

0.00310 0.00605 0.01196 0.02377 0.04743 0.09470 0.18923 0.37921 1.08000 * * *

1354.01 1386.75 1403.11 1411.50 1414.79 1417.27 1418.55 1415.77 994.21 * * *

0.00236 0.00443 0.00854 0.01673 0.03314 0.06593 0.13993 0.37178 0.86058 1.31953 * *

1330.40 1419.11 1474.22 1503.98 1518.94 1526.73 1438.74 1083.03 935.77 1220.60 * *

0.00215 0.00398 0.00765 0.01492 0.02949 0.05895 0.11664 0.23283 0.46519 0.92991 1.85933 *

853.45 921.84 960.01 984.03 995.68 996.07 1006.83 1008.81 1009.83 1010.33 1010.61 *

0.00762 0.00287 0.00436 0.00802 0.01555 0.03030 0.05999 0.11820 0.23380 0.46736 0.93166 1.87053

128.93 686.07 902.00 980.85 1011.49 1038.36 1048.69 1064.50 1076.39 1076.94 1080.47 1076.31

D. Takahashi / Parallel Computing 29 (2003) 679–690

689

8. Conclusion In this paper, we proposed a parallel one-dimensional FFT algorithm on clusters of vector SMP nodes. The five-step FFT-based parallel one-dimensional FFT algorithm has resulted in high-performance one-dimensional parallel FFT transforms suitable for clusters of vector SMP nodes. In our proposed parallel FFT algorithm, since we use cyclic distribution, all-to-all communication takes place only once. Moreover, the input data and output data are both in natural order. We succeeded in obtaining performance of over 61 GFLOPS on a 16-node Hitachi SR8000/MPP. The performance results demonstrate that the proposed algorithm has low communication cost and long vector length. Implementation of the extended split-radix FFT algorithm [18] which has a lower operation count than radix-8 FFT on clusters of vector SMP nodes is one of the important problems for the future.

Acknowledgement This work was supported by the Grant-in-Aid for Encouragement of Young Scientists (No. 12780190) of Japan Society for the Promotion of Science.

References [1] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput. 19 (1965) 297–301. [2] P.N. Swarztrauber, Multiprocessor FFTs, Parallel Comput. 5 (1987) 197–210. [3] M. Hegland, A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing, Numerische Mathematik 68 (1994) 507–547. [4] M. Hegland, Real and complex fast Fourier transforms on the Fujitsu VPP 500, Parallel Comput. 22 (1996) 539–553. [5] M. Frigo, S.G. Johnson, The fastest Fourier transform in the west, Technical Report MIT-LCS-TR728, MIT Laboratory for Computer Science, 1997. [6] IBM Corporation, Parallel Engineering and Scientific Subroutine Library for AIX Version 2 Release 3 Guide and Reference (SA22-7273-04), fifth ed., 2001. [7] D.H. Bailey, FFTs in external or hierarchical memory, J. Supercomput. 4 (1990) 23–35. [8] C. Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM Press, Philadelphia, PA, 1992. [9] S. Yamamoto, Effective implementations of multi-dimensional radix-2 FFT, Comput. Phys. Commun. 125 (2000) 1–7. [10] Y. Tamaki, N. Sukegawa, M. Ito, Y. Tanaka, M. Fukagawa, T. Sumimoto, N. Ioki, Node architecture and performance evaluation of the Hitachi Super Technical Server SR8000, in: Proceedings of the 12th International Conference on Parallel and Distributed Computing Systems, 1999, pp. 487–493. [11] K. Shimada, T. Kawashimo, M. Hanawa, R. Yamagata, E. Kamada, A superscalar RISC processor with 160 FPRs for large scale scientific processing, in: Proceedings of the International Conference on Computer Design (ICCDÕ99), 1999, pp. 279–280.

690

D. Takahashi / Parallel Computing 29 (2003) 679–690

[12] K. Nakazawa, H. Nakamura, T. Boku, I. Nakata, Y. Yamashita, CP-PACS: a massively parallel processor at the University of Tsukuba, Parallel Comput. 25 (1999) 1635–1661. [13] M. Brehm, R. Bader, H. Heller, R. Ebner, Pseudovectorization, SMP, and message passing on the Hitachi SR8000-F1, in: Proceedings of the 6th International Euro-Par Conference (Euro-Par 2000), Vol. 1900 of Lecture Notes in Computer Science, Springer-Verlag, 2000, pp. 1351–1361. [14] M. Hegland, An implementation of multiple and multi-variate Fourier transforms on vector processors, SIAM J. Sci. Comput. 16 (1995) 271–288. [15] M.C. Pease, An adaptation of the fast Fourier transform for parallel processing, J. ACM 15 (1968) 252–264. [16] P.N. Swarztrauber, FFT algorithms for vector computers, Parallel Comput. 1 (1984) 45–63. [17] S. Goedecker, Fast radix 2, 3, 4, and 5 kernefor for fast Fourier transformations on computers with overlapping multiply-add instructions, SIAM J. Sci. Comput. 18 (1997) 1605–1611. [18] D. Takahashi, An extended split-radix FFT algorithm, IEEE Signal Process. Lett. 8 (2001) 145–147.