Computer Physics Communications 26 (1982) 331—334 North-Holland Publishing Company
331
FAST METHODS ON PARALLEL AND VECTOR MACHINES Clive TEMPERTON Meteorological Office, Bracknell, Berkshire, UK
The implementation of fast numerical methods on parallel and vector computers is illustrated by describing the development of fast Fourier transform routines for the vector-processing Cray-I and Cyber 205 machines. Various vectorization methods are presented for FFT’s on the Cray-l. By performing a number of transforms in parallel, “super-vector” performance can be achieved. By modifying the algorithms slightly, multiple transforms can be implemented faster on the Cyber 205 (using 64-bit arithmetic on the 2-pipe model) than on the Cray-l, provided that enough transforms (of order 100) can be performed in parallel.
I. Introduction The recent advent of vector and parallel cornputers has made possible a very significant increase in the scope of practicable scientific computation. To take an example from numerical weather prediction (NWP), a few years ago computing resources restricted operational forecast models to coarse resolution grids covering limited areas of the globe, predicting developments for two to three days into the future. Now higher resolution global models can be run for 10 days ahead, on a daily operational basis. However, such improvements in the scale of scientific computing have required considerable effort in order to exploit the new hardware effectively. Many computational procedures and numerical algorithms have had to be reconsidered or redesigned in the transition from scalar to vector or parallel machines. The Fast Fourier Transform (FFT) is a numerical algorithm which plays an important role in many areas of computational physics, including numerical weather prediction models. Some examples of its use in both grid point and spectral NWP models were detailed in ref. [101. A global 10-day forecast using current operational models may require millions of FFT’s, and it is clearly important that they be performed efficiently. In this paper we consider the implementation of the FFT algorithm on the vector computers Cray- 1
and Cyber 205. Much of the experience will be relevant to other numerical algorithms, and to parallel as well as vector machines. The FFT, originally publicized by Cooley and Tukey [3], is a fast way of evaluating summations of the form N—I
x1~
Ck
exp(2yklT/N),
0’~j~N—1,
k0
where the Ck’S are given complex Fourier coefficients. In many cases we have the symmetry condition CNk = cZ (* denoting complex conjugate), which implies that the x, ‘s are real. Before proceeding, it is worthwhile disposing of two myths about the FFT. Firstly, N does not have to be a power of 2, and a factorization of the form N 2~°3”5’ can be implemented with little loss of efficiency. Secondly, provided that a work space of length N is available (rarely a problem on modern computers), both the input and output data can be serially ordered. The original Cooley— Tukey formulation required a permutation of the data either before or after the transform. The “self-sorting” variants [8] have in fact been known for a long time (Cochran et al. [1] attribute the idea to Stockham), and are more suitable for vector machines since the unscrambling operation is essentially unvectorizable.
00l0-4655/82/0000—0000/$02.75 © 1982 North-Holland
332
C. Temperton
/ Fast
methods on parallel and vector machines
2. The Cray-i and Cyber 205
3. Complex transforms on the Cray-i
In this section we catalogue some of the important differences between the two machines considered. The unique feature of the Cray-i is the provision of eight vector registers (each holding 64 words) which act as input and output to all arithmetic operations. Transferring data between main memory and the vector registers requires additional operations, but temporary results can be held in the registers. For the CAL (Cray Assembly Language) programmer there is considerable scope for organizing the computation so as to minimize the number of transfers to and from memory, thus achieving “super-vector” computation rates [5] which are limited only by the number of func‘ional units available. On the Cyber 205 all vector oerations are directly memory-to-memory. A vector in Cray-i memory can have its elements spaced at any constant increment; the definition is more restrictive on the Cyber 205, where a vector must have its elements stored contiguously. The optimum vector lengths on Cray-i are multiples of 64, the maximum length of a single vector operation. On the Cyber 205 the optimum vector length is 64 K. Perhaps a more important aspect is revealed by Hockney’s parameter fl 1/2 [4], the vector length at which half maximum speed is reached. This value is roughly 10 on the Cray-i compared with 100 on the Cyber 205, implying that the Cray-I is much better at handling short vectors. On the other hand the Cyber 205 has the higher maximum possible speed, 200 megaflops (millions of floating-point operations per second; here we assume 64-bit arithmetic on the 2-pipe version of the machine) compared with 160 megaflops on the Cray-I. Finally, additions and multiplications on the Cray-I can be performed simultaneously (in parallel), or in “chained” mode for triadic operations of the form a b*(c + d). The Cyber 205 has a “link” facility analogous to chaining, but independent additions and multiplications cannot be performed in parallel.
The implementation of complex FFT’s on the Cray-I was considered in detail in ref. [9]; here only a summary will be given. There is one pass through the data for each factor of N, and each pass has a nested loop structure which can be vectorized in various ways. The self-sorting variant of the algorithm leads to a fairly natural vectorization (scheme A), in which the vector elements are consecutive complex numbers. The trip count for the inner loop (i.e., the vector length) increases in successive passes; for example N = 256 44 gives vector lengths of 1, 4, 16 and 64 in the four stages of the transform. Alternatively the nested loop structure can be turned “inside out” (scheme B) to give decreasing vector lengths, e.g. 64, 16, 4, 1 in the above example. The vector elements are no longer consecutive, and additional vector loads and stores are required; also memory bank conflicts can occur. Since schemes A and B give the same answers, an obvious choice is to start with scheme B and to switch to scheme A later. Using the example N = 44 again, two passes with scheme B followed by two passes with scheme A would give vector lengths of 64, 16, 16, 64. An alternative version of the FFT algorithm was given by Pease [7], with a natural vector length of N/(current factor); however, the results emerge in scrambled order, and to compensate for this a scalar sorting step is required.
Table I Complex transform of length N = 96 on Cray-I Vectorization scheme
Time per transform
Speed (Mflops)
(jss) scalar B B+A Pease (scrambled)
540 150 124
4.5 16 20
Pease (sorted) multiple (Fortran) * multiple (CAL) *
167 52 28
15 47 87
*
64 transforms in parallel.
C. Temperton
/ Fast methods on parallel and vector machines
For modest values of N, none of the above schemes really harnesses the full power of the machine. Table I shows the times and speeds achieved for a complex transform of length N = 96, typical of the size used in NWP models. In order to achieve better performance, we recognize that in fact we normally perform many transforms of the same length simultaneously, and the data can easily be organized so that a vector contains one element from each transform. The vector length is then just the number of transforms being performed in parallel (scheme M, the multiple transform). As demonstrated in table 1, this is much more efficient. Finally, reprogramming the multiple transform in CAL gives a further improvement by almost a factor of two, at which point “super-vector” performance has been achieved, It is easily shown that the number of additions required in the FFT is considerably greater than the number of multiplications. Apart from the radix-2 case, the number of additions is also greater than the number of memory references [9]. Hence the time required depends essentially on the number of additions, Recently, Winograd [11] has proposed an ingenious variant of the FFT algorithm requiring many fewer multiplications, at the cost of a slightly increased number of additions. Unfortunately, the above argument demonstrates that Winograd’s algorithm would actually run more slowly on the Cray-I than the conventional FFT. (In fact the time could well be dominated by the number of memory references, which would make matters even worse.)
4. Real transforms on the Cray-i Suppose now we want to perform a transform N1 Ck
exp(2ykir/N),
0~J~N l,
k=O
where CNk = c~and the x1’s are consequently real. The conventional approach [2] is to form an artificial complex series of length N/2 to which a complex transform of length N/2 is applied; the real and imaginary parts of the output then yield
333
Table 2 Real periodic transforms on Cray-I (64 in parallel; CAL)
N
Conventional (ps)
Special real FFT
(,ss)
(Mflops)
__________________________________________________
120 ~ 240 256 320 360
23 3~ 51 52 67 79
16 25 36 38 53 62
100 100 104 95 92 110
the even and odd values of x1. Alternatively a complex transform of length N can be used to carry out two simultaneous real transforms of length N. A more efficient approach [10] is to take the full complex transform of length N and to eliminate redundant operations. This leads to a slightly more involved algorithm, but gives savings of typically 30% in the operation count compared with the conventional approach. A vectorization of the real transform algorithm analogous to scheme A for the complex case is possible, but it is the simple multiple scheme M which can most readily be adapted. Table 2 shows times and megaflop rates for real periodic transforms on the Cray- 1, using the multiple approach programmed in CAL, and comparing the conventional procedure with the special real FFT. The latter is significantly faster, and again “super-vector” speeds are attained.
5. Real transforms on the Cyber 205 In transferring the FFT algorithm from the Cray-i to the Cyber 205, several significant diiferences between the two machines have to be taken into account. To take the simplest first, multiplications on the 205 can only be overlapped with additions in “linked” mode, whereas on the Cray- 1 they can simply be performed in parallel. However, with a little ingenuity the computation can be reorganized so that each multiplication appears linked with an addition. .
C. Temperton
334
/
Fast methods on parallel and vector machines
More seriously, in NWP applications the number of transforms which we wish to perform in parallel is rather typically of order 64, so that the simple multiple vectorization scheme gives vector lengths on the Cray-I which are close to optimum. On the Cyber 205 with its much larger value of ~ /2’ the corresponding vector lengths are distinctly too short for optimum performance. At the same time, the most natural layout of the data (all the elements of the first transform, then all the • elements of the second transform,...) is the wrong way round for the Cyber’s definition of a vector, which must have contiguously stored elements. We must therefore transpose the data layout, so that we have all the first points of each transform, then all the second points of each transform,... It turns out that this transposition of the data simultaneously cures the vector length problem, for we can then implement the “direct product” of vectorization scheme A (see section 3) with the multiple scheme M. Suppose for example we perform 100 simultaneous transforms of length N 256 = 44, At the first stage we have a vector length of 100. At the second stage the vector length becomes 400 (in fact this stage is exactly equivalent to the first stage of doing 400 transforms each of length 64), at the third stage 1600 and at the fourth stage 6400 (equivalent to doing 6400 transforms of length 4). An analogous result was presented by Korn and Lambiotte [6]. This trick works for the special real transform as well as for the purely complex transform; in fact in the special case, real and imaginary numbers are interlaced in such a way that they can sometimes be treated together, giving a further
Table 3 0 Real periodic transforms on Cray-i and Cyber 205 time in ~ss (speed in Mflops) N
120 125 192 240 256
Cray-I 64 in II
16 (100) 19(111) 25 (100) 36 (104) 38 (95)
Cyber 205 64 in 1
256 in II
21(75) 27(78) 33 (76) 50 (76) 44 (79)
13 (121) 17(128) 21(122) 31(123) 28 (123)
doubling of the vector length. Table 3 shows the results of applying this approach to real transforms on the Cyber 205, cornpared with the simple multiple approach in CAL on the Cray- 1. We see that transforms can be implemented faster on the Cyber 205, provided that enough transforms (of order 100) can be taken in parallel.
6. Concluding remarks The transition from scalar to vector or parallel computing requires a reconsideration of many numerical algorithms. It has been the author’s slightly rueful experience that the transition from one vector computer to another requires yet another reconsideration. Significantly perhaps, the procedure developed for efficient implementation of multiple FFT’s on the Cyber 205 could be translated back to the Cray-l to provide a more efficient way of performing a small number of simultaneous FFT’s on that machine, much as the multiple approach developed for the Cray- 1 turned out to be a very efficient procedure on a scalar machine [10].
References [1] W.T. Cochran et al., IEEE Trans. Audio and Electroacoustics AU-15 (1967) 45. [2] J.W. Cooley, P.A.W. Lewis and P.D. Welch, J. Sound Vib. 12 (1970) 315. [31J.W. Cooley and J.W. Tukey, Math. Comput. 19 (1965) 297. [4] R.W. Hockney and C.R. Jesshope, Parallel computers: architecture, programming and algorithms (Adam Huger, Bristol, 1981). [5] T.L. Jordan and K. Fong, in: High speed computer and algorithm organization, eds. D.J. Kuck, D.H. Lawrie and A.H. Sameh (Academic Press, New York, London, 1977). [6] DO. Korn and J.J. Lambiotte, Math. Comput. 33 (1979) 977. [7] M.C. Pease, JACM 15 (1968) 252. [81 C. Temperton, Technical Report No. 3, European Centre for Medium Range Weather Forecasts (1977). [9] C. Temperton, in: Supercomputers, Infotech State of the Art Report, Infotech International Ltd., Maidenhead, UK (1979). [10] C. Temperton, in: Special topics of applied mathematics, eds. J. Frehse, D. Pallaschke and U. Trottenberg (NorthHolland, Amsterdam, 1980). [11] S. Winograd, Math. Comput. 32 (1978) 175.