Computers them. Engng, Vol. 13, No. 9, pp. 1065-1073. 1989 Printed in Great Britain. All rights reserved
ON
009% 1354/89 $3.00 + 0.00 Copyright 0 1989 PergamonPressplc
COUPLED TRANSPORT/CHEMISTRY THE MASSIVELY PARALLEL
CALCULATIONS PROCESSOR COMPUTER
G. R. CARMICHAEL,’ D. M. COHEN,~ S.-Y. CHOW and M. H. OGUZTLJZUN* ‘Chemical & Materials Engineering Department and ZComputer Science Department,
University of Iowa, Iowa City, IA 52242, U.S.A.
(Received
2 February 1988; final revision received 28 December received for publication 16 February 1989)
1988;
Abstract--Coupled transport/chemistry problems are performed on the Massively Parallel Processor (MPP) computer. The MPP is a single-instruction multiple data (SIMD) machine with 16,384, individual processors arranged in a 2-D array with nearest neighbor connections between processors. The suitability of this architecture for coupled/transport chemistry applications is investigated. Substantial speedups can be achieved on the MPP when concurrency in the calculation can be exploited. For example, speedup factors on the MPP of 450 and 10 relative to VAX 1 l/780 and Cray-2, respectively, are achieved for 3-D chemistry calculation. Various aspects of parallel computing on the MPP are discussed.
I. INTRODUCITON
The solution of large numbers of coupled partial differential equations (PDEs) is necessary to analyze coupled transport/chemistry (CTC) problems associated with many physical systems (e.g. analysis of the acid rain problem, catalysis and reactor design, combustion and many other chemical and biological problems). As with other computationally intensive problems, the solution of realistic examples is often infeasible, given the limitation of current computer technology. Reducing the size of the problem, the number of dimensions, etc. has been necessary in order to solve instances of such a probiem. For example, we have developed and are applying a CTC model for the study of acid rain (Carmichael et al., 1986). This model tests 50 chemical species involved ’ - 100 chemical reactions in each of four phases ze. g as, cloud water, rain and snow). In total, 120 coupled 3-D time dependent, nonlinear, stiff PDEs (and additional algebraic equations) must be integrated. This model is currently being run on Cyber 205 and IBM 3090/300 and has run on Cray-2 and Cray XMP. Typical applications utilizing 3000 grid points run in l/3 and l/2 of real time (i.e. to simulate a 1 day event requires 10 CPU h on the Cyber 205), with 90% of the CPU time spent performing the stiff chemical calculations. Ideally one would like to use the models to provide seasonal and even yearly averages of acid deposition so that policy issues can be examined (requiring simulations of 365 days of
tPresent address: Department of Mechanical Engineering, Princeton University, Princeton, N.J., U.S.A.
real time). However, such long-term simulations are not feasible with present computers. New and more powerful parallel computers are becoming available and hold promise for carrying out detailed calculations (Gabriel, 1986). The newly-created parallel computers differ substantially in many dimensions: number of processors; speed and word length of processors; organization, amount and accessiblity of storage to each processor; number of instruction streams; number and length of pipelines, etc. It is currently not known which architecture is best suited for CTC applications or how to exploit architectural parallelism in the numerical calculations. The new parallel computers may be well-suited for performing CTC calculations. The numerical solution of CTC problems is difficult because the equations are nonlinear, highly coupled and extremely stiff. The problem involves both calculation of the transport processes (in which both convection and diffusion can be important) and of the chemical reaction processes which are stiff, coupled and nonlinear. Typically, the solution of these problems makes use of some splitting technique which separates the transport and the chemistry calculations (Mitchell, 1969). This may be an advantage on parallel computers because the splitting process decouples the spatial nature of the calculation, i.e. within each transport time step, the chemistry at each node is uncoupled from that of the other nodes. We have begun to evaluate various parallel computers for performing CTC calculations. In this paper we will discuss our results on NASA Goddard’s MPP (Massively Parallel Processor), a 16,384 processor machine.
1065
1066
G.
R.
CARMICHAEL
et
al.
+
I
Program
Array unit (ARU)
and
OVERVIEW
-“a
128 BIT out put I ntertoce
.I
data
Fig.
2.
r-
OF THE MASSIVELY PROCESSOR
1. MPP
system block diagram.
PARALLEL
The MPP was built by Goodyear Aerospace for the NASA Goddard Space Flight Center. It operates as a peripheral processor to a host machine (DEC VAX 1 l/780). The four major components of the MPP are shown in Fig. 1. The array unit (ARU) consists of a 128 x 128 array of processing elements (PEs), either open at the edges or connected to form a horizontal cylinder, a vertical cylinder, or a torus. Each PE contains a full adder, a variable length shift register (2-30 bits), logic circuitry, 1024 bits of random access memory and various registers. Instructions for the ARU are issued for the array control unit (ACU). The ACU has three distinct parts, any number of which may execute simultaneously: the processing element control unit (PECU), the I/O control unit (IOCU), and the main control unit (MCU). The PECU and the IOCU send instructions to the ARU, while the MCU serves as a fast serial processor which invokes the PECU and IOCU functions as subroutines. All array manipulations are performed by means of the PECU; all scalar arithmetic is programmed in the MCU. The IOCU manages the flow of data into the ARU. The program and data management unit (PDMU) of the MPP consists primarily of a DEC PDP-1 l/34 minicomputer. It manages data flow into the ACU, performs diagnostic tests of the hardware, and buffers output to secondary storage devices, terminals and printers. The staging memory unit (SMU) consisting of 32 Mbyte of storage lies in the data path between the host computer and the ARU-PDMU components of the MPP. Its function is to buffer and/or reformat arrays of data.
The ARU has an instruction cycle time of 100 ns. Peak performance of the MPP with respect to manipulation of 128 x 128 arrays has been reported in the literature (Hwang and Briggs, 1984). Using 12-bit integers, array addition can be performed at 4428 mops (millions of operations s-l), element-byclement multiplication of two arrays (Hadamard product) at 910 mops, and multiplication of an array by a scalar at 1260 mops. Using 32-bit floating point numbers, the peak performance rates are 430, 216 and 373 mops, respectively. The MPP belongs to the class of SIMD (single instruction stream, multiple data stream) computers. Unless disabled for a particular machine cycle, each PE performs the same operation as all of the other PEs. Addition and multiplication of arbitrary length integers and floating point numbers is accomplished through successive operations on the individual bits of the operands. A bitplane is an assignment of bit values to each PE. Each bitplane represents one bit of each of 16,384 numbers. A 32-bit integer, for example, would be represented as 32 bits in the local store of a single PE. A collection of 32 bitplanes would characterize a “parallel array” of 32-bit integers. The assembly language for the PECU is called PEARL (PE Array Language). All operations on bitplanes or on parallel arrays are accomplished by the use of PEARL subroutines. Microprogram conflcts are detected by the PEARL assembler. The programming language for the IOCU and the MCU is MCL (Main Control Language). Macroprocessing is available, as well as the ability to define parallel array variables corresponding to collections of bitplanes in the ARU. Parallel array variables may be of several types, including (arbitrary length) integer, 32-bit floating point and bitplane. The MPP is designed to run in tandem with the
Coupled
transport/chemistry
VAX-I l/780. The parallel Pascal compiler accepts a language which is an extension to Pascal (allowing for parallel array variables, among other things) and produces code for both the MPP and the VAX (see Fig. 2). A symbolic debugger is available to help the user correct errors in MPP programs. CAD (control and debug) is an interactive tool, allowing for the display and modification of values in the MPP registers and for the alteration of the control flow of the program. VAX-resident code is assembled and linked with host libraries, user-supplied subroutines and the MCU symbol table. MPP-resident code is assembled and linked with MCU libraries and routines and PECU routines (“primitives”). The VMS load module, PECU load module and MCU load module are the outcome of the compilation and link steps.
calculations
1067
With regard to program organization, one may choose to let the Pascal main program reside in either the VAX or the MPP. Alternatively, the MPP-resident code could be called as subroutines from VAXresident programs written in other languages (e.g. in FORTRAN). The performance of the MPP may be compared to that of two other SIMD computers (Gabriel, 1986). The Connection Machine (Thinking Machines, Inc.) has 65,536 l-bit processors connected in a Boolean n-cube. Each processor has 4096.bits of local memory and 16 registers. The Non Von computer (Columbia University) has 64 processors (to be increased to 8000 in the near future) which may operate in either an SIMD mode or an MIMD (multiple instruction stream, multiple data stream) mode. There are two categories of processors, small PEs (g-bits) and large
para11.t
pCl*Cal
ComPll*r
I P-Cd.
I Code
g*n*rator
I vMS
macro
MCL
assembler 1
assembler
1
Pn mitlve Library
VAX
runtlme
VMS
Cinkor
Fig. 2. MPP
Pascal
program
development
steps.
G.
1068
PEs (32-bits). The small PEs are complete binary tree, with the leaf to each other in a 2-D grid. Subsets in unison as SIMD machine, under large PE. The MPP, the Connection Non Von have maximum speeds 16 bips (billions of instructions s-l), 3. ‘TEST
R.
CARMICHAEL
connected as a nodes connected of small PEs act the control of a Machine and of 6.5, 10 and respectively.
PROBIXMS
have begun work the We investigating NASA/Goddard MPP machine as a means of increasing the speed of performing CTC calculations. We have performed several test applications using subsets of our CTC model. The results of these tests are discussed below. Test Problem
I--chemistr_v
k,
C,.
c,+cz. 2c,
LCC4, 2c,.
c, -2
One way of numerically solving complex chemistry network problems is to split the mass balance equations into transport and parts. The chemistry calculations using this requires solving the set of equations: dC ---! = R,(C,)
i, j = 1,.
dt
. number
transport governing chemistry technique
of species,
(1)
at each grid point in the discretized space, where R,(C,) is the chemical reaction rate for species i. The use of the semi-implicit Euler method (Preussner and Brand, 1981) to solve equation (1) results in the following ordinary differential equations: dC, dl=
r
-C,Cd_:
,
n
k=l
C,+Cp;nC,,,i=l
i
Now consider the case in which we have 32 x 32 x 16 grid points in a discretized 3-D spatial domain. On a serial computer the following FORTRAN loop would be performed to solve for the chemical compositions within the transport time step: DO252=
,?I
1.32
DO25J=
1.32
DO25K-
Description. Over 90% of the CPU time in our CTC acid rain model is spent performing the chemistry calculations. For example, a 3000 grid point calculation for gas phase only (no rain) conditions involving 50 chemical species involved in some 100 chemical reactions requires 10 CPU h 24-h-l simu1 l/780 of which 9 CPU h is spent lated on a VAX on the chemistry calculation and I CPU h on the 3-D transport. (A full simulation with rain would take -40 CPU h 24-h-l simulated on the VAX). In order to see if the MPP can help eliminate the bottleneck associated with the chemistry portion of the model, we evaluate the MPPs ability to solve sets of stiff ordinary differential equations arising from the chemistry calculations. Consider a single four-species test problem where species C, , C,, C,, C, are involved in the following chemical reactions:
c,L
where the djs are the rate constants associated with destruction reaction j of species i, p( are the rate constants of reactions which generate species i, and the C,s are the chemical concentrations. This set of initial value problems is solved within each transport time step, i.e. for:
onfJ
C, + C, -
et ni.
I...,
4,
(2)
CALL 25
1, 16 CHEMRXN
(I) (S, to. tr, C(I,
J, K, S))
CONTINUE
In the above DO loop, S is the number of chemical species, C(l, J, K. S) is the concentration of species S at grid point (I, J, K) and CHEMRXN is the subroutine which solves the S-coupled ODE-IVPs represcntcd in equation (2). Therefore each chemical calculation within each transport step requires the solution of 16,384 sets of equation (2). To implement the solution of these equations on the MPP requires first the choice of how to map the equations to the architecture. One procedure is to simply view each processor as a grid point in the discrctized space. and to have each processor solve its own sets of equation (2). In this mapping each processor integrates the set of chemical equations describing the concentrations at that grid point concurrently with the other processors [i.e. all 16,384 sets of equation (2) are solved at the same time]. This mapping is shown in Fig. 3. The algorithm for solution of equation (2) is written in Parallel Pascal and resides on the VAX (the host computer). The initial conditions and constants are distributed to each processor; then the algorithm is executed on the MPP and the output is sent back to the VAX. The CPU time required for execution of 100 time steps on the MPP of this four-species mechanism at 16,384 grid points is 0.293 CPU-s. The same problem was executed on the VAX-l l/780 and required 13X CPU-s. Thus for this chemical network problem, the MPP executed a factor of 470 times faster than VAX 1 li780. The same problem was also executed ont he Cray-2. A fully vectorized version of the code required 2.7 CPU-s. The above test calculation indicates that the MPP and similar array processors may be well-suited for chemical network problems where each node can hold the entire mechanism. Current memory restrictions on the MPP limit the size of the chemical mechanism that can be solved in this fashion. At present. each processor can hold 32 32-bit variables.
Coupled transport/chemistry calculations
(It is planned to increase the storage in the near future.) However, it is possible to handle larger chemical mechanisms. One way is to group processors together. For example, if 128 32-bit words are required at each node then four processors can work together. This in turn would reduce the maximum number of grid points possible by a factor of four. Another way is to make better use of the staging memory module of the MPP. Analysis. We can extropolate our timing measurements on 100 time steps of reaction chemistry calculations to estimate the CPU requirements of a full simulation, including both the chemistry and the transport processes, where the transport calculations are carried out on the VAX and the chemistry calculations are conducted on the MPP. This extropolation can be carried out with various assumptions. The most common assumption is to assume that the size of the problem remains fixed. The expected speedup of the MPP/VAX combination compared to the VAX alone can then be calculated by means of Amdahl’s law (Gustafson, 1988). The speedup for running the entire simulation under this assumption (both chemistry and transport processes) will be:
1069
3-D
GRID
SYSTEM
(3) where Thlppand TV,, equal the computation time on the MPP and VAX computers, respectively. The symbol o represents the fraction of all calculations in the simulation which are performed on the MPP. Substituting the values obtained from Test Problem 1 and using the fact that 90% of the calculations are for transport processes and hence performed on the MPP, we obtain a value for the speedup 138/7.178. Using Amdahl’s law to calculate the speedup of the MPP/VAX combination compared to the Cray-2, we obtain a speedup of: T CRAY-2
a x r,,,
+ (1 - U)T”AX’
(4)
MPP
Fig. 3. Mapping used for chemistry calculations on the MPP. Each grid point (I, J, K) in the 3-D domain is mapped to a single processor on the MPP. The chemistry calculations are then done concurrently on the MPP.
processor run in serial. Then the total CPU time can be expressed as:
TotaLp,, = TTrann-CPU + TChem-CPU = (to + x
where Tc,,,, equals the computation time on the Cray-2. The numerical value of the speedup is 2.7/7.178 or 0.38. We estimate, therefore, that the entire simulation would execute approx. 7.178/2.7 or 2.7 times faster on the Cray-2 compared to the MPP/VAX combination. Gustafson (1988) points out that, in practice, the size of the problem which is to be solved increases with the number of processors available to the programmer. This assumption leads to a different analysis. Consider the general problem of combined transport/chemistry calculations where the transport portion is calculated on the VAX and the chemistry portion is performed on a multiprocessor system. Furthermore, assume that the VAX and the multi-
ARRAY
t,gw x
hx
+
(co + Cl m N
(5)
ti-.a,,,
where (to + r,gS) = number of statements needed to do S time steps of transport points and
calculations
with g grid
r, = overhead for transport calculations, t, = VAX statements for transport calcu-
lation per grid point per time step, g = number of grid points, S = number of time steps, (c,, + c,gS) = number of VAX statements needed to do S time steps of chemistry calculations, using g grid points, co = overhead of chemistry calculations, c, = VAX statements for chemistry calculations per grid point per time step, N = the number of processors,
1070
G.
R.
CARMXHAEL
rl
t VAX -- the average time (s) per VAX h4PP =
statement, the average time (s) for a processor of the MPP to execute the equivalent of 1 VAX instruction.
The CPU requirements for the VAX/MPP system can be estimated using the data from test calculation 1 and the VAX timing profile for a full CTC simulation (i.e. a 24-h simulation of 3000 grid points require lOCPU-h; 1 CPU-h for the transport and 9 CPU-h for the chemistry). Assuming that t, and c, are negligible then: 3600 CPU-s
for transport calculated 3000 grid points
on VAX
3 (6)
al.
104 fjA
103
*
.
I?
f
for chemistry calculated 3000 grid points
on VAX
(7)
*VAX
CPU-s on MPP x 16,384 138 CPU-s on VAX = 34.8 =
processor
C, x 16,384 x 100 x tMpp C, x 16,384 x 100 x tvAx’
(8)
12 2
-’
E 6-
0.0
0.2
0.4
0.6
0.8
.
II
.
q
z
q
II
q
.
100 i
Ia .
,,-22
102
101
.
104
103
105
Fig. 5. Dependency of the execution time on the number of processors for different problems sizes (i.e. the number of grid points used in the calculation).
Substituting these values into equation glecting c0 and t, yields:
1.0
Fig. 4. Execution time of the coupled transport/chemistry calculation vs the fraction of the time spent on the parallel processor for 100 and 16,384 processor arrays.
(5) and ne-
z 1.2 g + 376 g/N.
(9a)
This equation holds for conditions where 90% of the code is done on the MPP and a 24-h simulation. It can be written more generally by introduction parameter p, the fraction of the CPU time spent on the chemistry calculation: Tota&,
10
g=20000
. .
Tota&, 0.293
0 II
IO'
i.e. lMPP
g-100 g-1000 g=5000
1
where s, equals the number of time steps of the 24-h simulations on the VAX. For calculations of chemical reactions only (no transport) for 100 time steps of a four-species mechanism (mentioned earlier) the MPP took 0.293 s and the VAX took 138 s. This information can be used to estimate the ratio of t,,, to tVAX,
$
D .
&
100
(9 x 3600) CPU-s
.
q
102.
. q
q
.
q
= 12(1 --p)g
+418
xp
x g/N.
(9b)
These expressions show the dependency of the CPU time with respect to the size of the problem (i.e. the number of grid points), the number of processors (N) and the fraction of the calculation done in parallel. Further insights into the relationship between the parameters of the problem are shown in Fig. 4 where the CPU-time as a function of parallelizable fraction of the code is shown for a fixed problem size of 3000 grid points. Using a 16,384 processor machine, a IO-fotd decrease in CPU time could only be achieved if over 90% of the code could be done in parallel. A 100-processor machine could only achieve a speedup of about a factor of three for this problem. The dependency of CPU time on the number of processors is shown in Fig. 5. Recall that for a VAX 1 l/780 uniprocessor, 3000 grid points required 10 CPU-h, and that the CPU load increases proportionally with the number of grid points. The dependency of the CPU-time with the number of processors is shown as a function of problem size when 90% of the code can be done in parallel in Fig. 6. This figure can be used to estimate the size of the problem
which
can
be calculated
given
a maxi-
Coupled transport/chemistry calculations
1071
ple, the solution to Burger’s an explicit finite difference sented schematically as: [i -
transport expression
equation using can be repre-
1.f
where each bracket term represents a processor and the arrows represent the transfer of information between processors. Thus, to calculate the new con[i,j], the concentrations at the centration at processor four surrounding points must be transferred to processor [i,j] and then the new concentration is calculated according to:
T-1 0 CPU-hrs
“4
IO'
102
103
Number of processen
104
105
(N)
+ G.i+ I
Fig. 6. The relationship between the size of the problem and the number of processors for a coupled transport/chemistry calculation with 90% of the code done in parallel and a fixed CPU allotment of 10 CPU-h per simulation.
-
c,.,-I),
(10)
where At and A, represent the time steps and spatial grid spacing, respectively. On the MPP the transfer of data from the neighbors to [i,j] is accomplished by a number of shift/rotate operations. The solution of Burger’s
CPU-time and the number of processors. For example, if we were to fix the CPU time at 10 h we would find that for N = 100 we could solve the problem with 7300 grid points, and with N = 16,384 we could solve a problem with N 29,500 grid points.
mum
Test Problem lations
Z--combined
transport /chemistry calcu -
To evaluate the use of the MPP to solve combined transport chemistry problems we solved the combined transport/chemistry problem for a four-species system using 16,000 grid points on the VAX/MPP. The chemistry calculation was done on the MPP and the transport calculations were done on the VAX. The relative execution profiles for MPP/VAX computer is shown compared to the VAX-only calculation in Fig. 7. The chemistry calculations no longer constitute the computation bottleneck; but now the transport calculation on the VAX represents the bottleneck. As indicated in equation (9), when using the MPP/VAX combination, the transport portion of the calculation scales linearly with the size of the problem (number of grid points) and is not influenced by the number of processors. This suggests that further speedups in the calculation may be achieved by and the chemistry on performing both the transport the MPP. Transport calculations on the MPP are discussed in the following subsection. Test Problem
3-transport
2-D transport
(a)
Horizon ta 1
Chemistry SOlVW 3.9%
(b)
calculations on the MPP
The MPP architecture with its 2-D array of processors with nearest-neighbor connections is well-suited for performing
Chemistry SOlVW 95.0%
calculations.
For exam-
Fig. 7. Timing profile of a CTC model for acid rain on a CRAY-X/MP (a), and for the MPP/VAX combination where the chemistry portion of the calculation is done on the MPP (b).
G. R.
1072
CARMICHAEL
I
: j I I
I
96 ’ I
?3
127 c__-______~_-1
13 I 1 L_-_-_-__*
14
I 1
-----
Fig. 8. A direct mapping
of 3-D CTC
I 1
I I I 2
16 ----___
model grid system to
the MPP 2-D processor array. For this example .x, y and z directions have 32, 32 and 16 grid points, respectively. The x-;v slabs are mapped to 32 x 32 square arrays. The z-level of each such unit is indicated by the numbers in the squares. The numbers along the top and side reference the 127 x 127 array of processors. The
0
0 : 3 4 124
<1,32.
1)
127
(1.32.13>
(2, (2, . . .
1. 1,
f2.32,
0 1.1. i) 1. f. 5) 1. 1, 9) i. 1.13)
<
2.
(32.
i.
f
t
3 4 7
II
. ii;;;;;;; . . .
(1, (1.
1,2) 1.6)
96
63
.
127
1*2)
(32. (32.
(32. 1. 4) (32Zf# 81 (32. t, 12) (32. 1. 16) 132.2. 4)
1.6)
zk;
5%
(1.32.2) G)
.
.
(32,328
2)
(1.32.
4)
(1.32#
.
16)
.
1,
i>
31
l,P,I)
. . .
t
1,32, 1) 1,32,5)
(
32
. _
(32,2.1)
. . .
( t
c
2.23.1)
..I
124
(32,
127
(32.23.1)
1.1)
(. t
t. 1.2) i.1,61
127
., .
t (
1.32~2) 1.32.6)
. ( 1.32,4> . ( 1.32.8) . ( 1,32. 12) C 1.32. 16) . f 2, 32. 41
. . ._
tGGi,
C 2,32.
I (32.lr2)
.,
.
t32.32,2)
(32.
1, 4)
16)
(32.32,
4)
(32.32.
16)
z-map
1.2) 9.2)
(32.1.2) ._.
4)
<32.32.16)
_.:*“*‘-
1 i. 1. . . .
(32*32.
L3
.... ...
(32.2.13)
( 4 . . . . . .
.
y-map
...
1. i#i) 1. 9,1> 1.17.1) 1#23.1> 2, 1.1)
t
(32.
... ... .
0
c
.
13)
The
:
1, L) 1,s)
1)
432.i.l3>
0
(32,
I
< 2.1.131
127
11 5)
.
. . .
(2.32#
4 t I
32
31
The
c
serial mode. The implementation of 3-D transport problems on the MPP is not as straightforward. The basic problem is how to map the 3-D solution domain onto the 2-D processor domain. For maximum computational speed, the mapping should minimize the amount of processors-to-memory data movement. For example, consider a transport problem carried out on a spatial domain of x = 32,.~ = 32 and z = 16. One direct mapping is to map each 32 x 32 array onto a 32 x 32 portion of the MPP array. Each of the submatrices corresponds to an index k( 1, . . . , 16) of the z direction. This mapping is shown in Fig. 8. The number inside the squares show the corresponding index values. In this mapping, neighbors in the x- and
x-map
1
1. 11 1. 3) 1. 91 l* 13) 2.. 1)
Cl. (1. (1. (1, (1.
al.
equation on a 128 x 128 grid requires 0.93 ms per time-integration step on the MPP. The same CdkXlation takes 17 ms on the IBM 3090 operating in
187
95 96 63 64 31 32 0 r-___-_-_-T_-______--~--------T--------Y I I 01 I I 3 1 1 2 I 4 I I 3’1 ~______---_t_--______t-________~________~ 32 I I I 6 6 7 5 I I I 63 ’ ~________-_+___--__--.+ ----__-I I 64 [--------I I I II 10 12 9 I I I I 95 ~___-----_+___ $ ----___-f________j _____
et
t3 . . . . . .
(l.l,lC~ tl,o*i6)
16 (1. 2.1) tL.10.1)
ai . . .
...
(1. (1.
2.16) LO. 16)
. . .
..
........ . . . ii&; i; ii, ........
127
111 4I.E),
L)
t1.16,1)
. . .. .
t 1. 8.16) c 1,16*~6) t l.24.161 C 1.32, 16) c 2, 8.16)
. . . C 2, 32, 16) ...
Fig. 9. An alternative set of maps for 3-D CTC model grid system to the MPP’s
(32. (32.
2-D array of processors. Different maps are used for calculating the transport in the X, y and I directions. Each map enables the necessary nearest neighbors in the transport algorithm to be obtained by a single column shift.
8. i6) 32.
16)
Coupled transport/chemistry calculations
(JJ-) directions can be reached by a single column (row) shift. However, neighbors in the z-direction can be reached by a sequence of 32 rotations and 32 shifts. This is demonstrated below in the simplified excerpts from the Fortran program: DO 1 DO 1 DO 1 SUMW= 1 continue
J= J= K=
W(Z,J,K)+
1,32 1,32 1,16
(III)
W(Z,J,K+l)
and its Parallel Pascal equivalent wtemp: = rotate ( W, 0,32); where (colindex
= 96) DO.
wtemp: = shift (wtemp,
(IV)
32,0);
sumw: = w + wtemp An alternative mapping for the z-direction calculation is shown in Fig. 9. In this configuration the immediate neighbor’s data are obtained by a single column shift. However, three different mappings are necessary for calculation in the x, y and z directions. Further work is needed to implement 3-D transport algorithms efficiently on the MPP. 4. SUMMARY
Transport/chemistry calculations have been performed on the NASA/Goddard MPP computer. The MPP is a SIMD machine with 16,384 individual processors arranged in a 2-D array with nearest neighbor connections between processors. This architecture is well-suited for performing chemical calculations arising from 3-D transport/chemistry problems. When operator splitting techniques are used, then each grid point in a 3-D domain can be mapped to
1073
a processor, and the chemistry at each grid point calculated in parallel. Speedup factors of over 450 and 10 were achieved on the MPP relative to the VAX 1l/780 and Cray-2, respectively for the chemistry calculations. The MPP is also well-suited for 2-D transport calculations. However, due to the 2-D physical arrangement of the processor, 3-D transport problems are more difficult on the MPP. These test calculations suggest that the MPP and similar machines are well-suited for coupled transport/chemistry applications and that more robust interconnect topologies (e.g. hypercube) are needed. However, further work is necessary to gain the knowledge and experience necessary to fully exploit these machines and to evaluate their utility. thank authors wish to NASA/Goddard for making the MPP commuter available and their assistance in using;t, and the GradLate College at the University of Iowa for providing funds to carry out this study. We thank the reviewers for their helpful suggestions. Acknowledgements-The
REFERENCES
Carmichael G. R., Peters L. K. and Kitada T.. A second generation model for regional-scale transport;chemistry/ deposition. Atmospheric Enoiron. 20, 173-188 (1986). Cray Research, The Gray-2 Series of Computer Systems. Gray Research, Inc. (1987). Gabrikl R. P., Massively parallel computers: the connection machine and non-Von. Science 231, 975-978 (1986). Gustafson J. L., Reevaluating Amdahl’s Law. dACh? 31, 532-533 (1988). Hwang K. ‘and Briggs F. A., Computer Architecture and Parallel Processing. McGraw-Hill, New York (1984). Mitchell A. R., Computation Methods in Partial differential Eouations. Wilev. New York (1969).
Pre&ner P. R. a&l Brand K., ‘Application of a semi-implicit Euler method to mass action kinetics. Chem. Engng
Sci. 10, 1633-1641 (1981). Thinking Machines Corp., Introduction to Data Level Parallelism, Technical Report 86.14, Cambridge (1986).