Computer Physics Communications
Computer Physics Communications 73 (1992) 72—92 North-Holland
Introduction to programming on distributed memory multiprocessors Johnny Petersen The IBM Bergen Environmental Sciences & Solutions Centre, Bergen, Norway
Often computations occunng in the physical sciences are very compute intensive and thus time consuming. Many of these problems are too big for today’s supercomputers. To solve some of these problems one needs a large number of supercomputers in order to both fit the problems and find a solution in a reasonable time. Parallel distributed memory multiprocessing will be the way to cope with many ofthese large computing problems. This paper will touch on some of the most common tools of an established parallel message passing environment, Trollius. Several examples of solving PDE common to the physical sciences will be discussed. Finally, the paper will introduce an efficient method for doing parallel volume rendering.
1. Introduction There is always a need for faster computing! It is the same as it is with research budgets (or private budgets). If the budget increases, we will always think of ways to spend it and at least 50% more. In scientific computing there will always be problems that will need large/fast computers, such as in QCD (quantum chromodynamics), quantum chemistry, fluid dynamics, global climate modeling etc. Computers running at teraflops with terawords of memory will not be too big for some of these problems. Present day serial (single processor) computers will be hard pressed to achieve such speeds. The smallest clock cycle on a commercial supercomputer today is 2.9 ns (the SX-3 computer from NEC). This is remarkable considering that light in vacuum only travels 87 cm in that time (the distance is less than half in some glass materials). With todays technology it is unlikely that this time will be greatly reduced. In order to get a teraflop out of a 2.9 ns computer 2900 floating —
______
Correspondence to: J. Petersen, The IBM Bergen Environmental Sciences & Solutions Centre, Thormohlensgt. 55, N-5008 Bergen, Norway. 0110-4655/92/s 05.00
point operations will be needed per cycle. The NEC SX-3 uses vector hardware and pipelining to get a “maximum” (unreachable) computational speed of 5.5 gigaflops (16 floating point operations per cycle). The most likely way to get a teraflop computer is to move to multiprocessor (parallel) computation. In all fairness, it is incorrect to call todays supercomputers “single processor” or serial. Computers with vector hardware can easily be defined as parallel, however in this paper “parallel computers” is defined as computers with several central processor units that can be programmed individually, often referred to as MIMD (multipie instruction stream multiple data stream) parallel computers. This definition will also leave out some of the massively parallel computers often referred to as SIMD (single instruction stream multiple data stream), such as the Connection Machine, MasPar and DAP. The parallel computers (MIMD) we are left with will be divided into two categories, the shared memory and the distributed memory computers; see fig. 1. We will leave out the “hybrid” category. The shared memory computers are computers connected via a fast bus or network to a large
© 1992 Elsevier Science Publishers B.V. All rights reserved —
J. Petersen / Programming on distributed memory multiprocessors
Computer Computer Computer Compute Compute
_______
_________
~ompute~_____ ______ ________
pilers for such machines making writing parallel programs “fairly” simple. Compilers and preprocessors are available which search out do-loops and subroutines that can be run in parallel. Thus a large number of “dusty deck” programs can Memory make use of shared~memoryparallelism. However, programs written with parallelism in mind will gain the most from these architectures. The focus of this paper will be programming on distributed memory computers, or Message Passing computers. Each of these computers is a full computer with memory, processor, and in some cases a hard disk. The most important features are (a) one or ~norecommunication links to other processors, and (b) each processor has a unique name (usually a number). The speed Computed of the link is usually much lower than the bus speed of the shared memory computers. With this shortcoming the programming strategy will ~omputei~ be different from th4 of computers with shared memory. If the processor can do a lot of computation in the time it takes to send a message, it _______
________
Computet~_____
0
.~
.:
_____Computed
Computeu~____ ~
.~
Pcomp~
.
0 .
.~ ..~..: ..~.
.
.: .
_____Compute .:*..ji~.H::.~~::: .:~:f ..
mputer_____ .~
E
0
p Computer_____ .
.j
.,.
.~7
.:,:
73
.
_____Compute
.:M~~y .:
_____Computer ..
~
__________
.
..
::.~*m~Y .
Fig. 1. Comparison of shared (top) and distributed (bottom) memory computers.
common memory. Often each of the processors will have a private cache memory as for example in IBM-3090 and Crays multiprocessor cornputers. The shared memory computers seldom have more than 20 processors (Alliant FX/2800 has a maximum of 14, IBM-3090 has a maximum of 6 and Cray Y-MP has a maximum of 8). Communication with the memory is fast as long as a small number of processors are using the bus. Much work has been done on parallelcorn-
is important to do as~much work as possible between communication. It will then be important from the outset to divide the problem domains that can be efficiently solved in parallel. “Dusty decks”, or old serial programs, will not run efficiently on a distributed memory computer. 1.1. Parallel computlng vocabulary As in most speciaUzed topics there are a few expressions belonging to parallel computing. We have already mentioned some of them, MIMD and SIMD fOr example. .
1.1.1. Communication time The communicatic~ntime is the time it takes one processor to communicate a message to another processor. A rqugh model for the communication time to a nearest neighbor is Tc
=
Tsetup + B
X
Naytes.
(1)
On some of the earlier machines the ~ was usually a little less th~n1 ms, and B around 0.002 to 0.01 ms/byte, see ~unigan [1]. The additional cost in making several hops before reachingthe
74
J. Petersen / Programming on distributed memory multiprocessors
destination has gone from being linear with the number of hops to almost nothing. The “wormhole” hardware[2] has greatly shortened the communication time. The computational time has also decreased with the new faster processors, so the ratio of communication time to cornputational time has probably not decreased. 1.1.2. Parallel efficiency Parallel efficiency (PE) is defined simply as the ratio of the time it takes to run a program on one processor to the time it takes to run it on a parallel processor multiplied by the number of processors. PE
— —
T~erjai
N x Tpar~iei~
(2)
We notice that the parallel efficiency can not be greater than 1, and that a parallel efficiency of 0.75 on 128 processors will give a speedup of 96. The time it takes for a program to run on N parallelprocessors is a function of the communication time between processors, TC; how many operations that must be done in serial; the load balance (the program can not be faster than the slowest part); the synchronization; and ofcourse the speed of the processors. One important parameter coupled with the parallel efficiency is the ratio of the communication rate to the computation rate, Mcomm/Mcomp. The Mcomp is the rate at which each processor can do a multiplication or addition of real nurnbers (often double precision). On todays RISC processors it will be highly dependent on how the data are laid out in memory. If each parallel processor is capable of a communication rate of 10 Mbytes/s and a computational rate of 20 Mflops/s (double precision) then the transfer of each double precision number is “costing” 16 floating point operations. 1.1.3. Amdahl’s law Amdahl’s law[3] is probably the main reason for the choice of “low” number of processors, N, among todays parallel supercomputer vendors; Cray with maximum 8, IBM with maximum 6, etc. Gene Amdahl argued that all pro-
grams had a fraction ofnecessary serial computation, s, with the parallelfraction, p being (1 s). The speedup, Sp, is thus limited by us: —
s
— —
s +p s + p/N
— —
1 s + p/NO
3
With a serial fraction of “only” 10% Amdahl argued that with an infinite number of processors all you can achieve is a speedup of 10. With 6 processors you would get a speedup of4 (12 processors would only achieve a speedup of 5.7). The problem with this reasoning is that it assumes that p is independent of N, or, in other words, that the problem size is held fixed. For many scientific applications how you address a problem will be a function of the computational resources available. With many more computers, you may wish to increase the computational grid, add more variables, do more exact (timeconsuming) computations per grid point, etc. Gustafson [4] argues that the parallelpart ofthe program scales with the problem size, i.e. the Serial components such as times for program loading, serial bottlenecks, and I/O do not scalewith problem size. Gustafson thus suggested an alternative to Amdahl’s law. His question is “how long will a given parallel program take to run on a serial processor?” Here s’ and p’ are the serial and parallel relative times spent on the parallel systern, s’ + p’ = 1. A uniprocessor requires 5’ + N p’ to perform the same task. Thus the scaled speedup becomes: ~‘ + Np’ Scaled Sp = / / S + P = N (1 s’) + s’. (4) —
This equation yields a more realistic performance prediction than Amdahl’s law. 1.2. Future in parallel computing Coarsly grained parallel machines, such as the Intel Hypercube, NCUBE, the Transputer based machines, are getting more and more sophisticated processors. Recently, RISC proces-
J. Petersen / Programming on distributed memory multiprocessors
sors such as the processor ofthe IBM RISC System/6000 computers can boast of achievable computational speeds of more than 50 Mflops. The memory chips are getting more compact (IBM is producing 4 MBit chips commercially) and there is a lot of research going on in high speed communication ( gigabit/s Using these present day parameters I believe that each processor ofa (near) future massively parallel computer (> 500 processors) will be capable of running at 100 Mflops and having upwards of 32 Mbytes ofmemory. We will then be well on the way towards a teraflop computer. ).
2. Introduction to Trollius parallel computing tools The parallel tools are needed to set up the systern around doing parallel computing. The message passing, the distribution of the programs, debugging, etc. are necessary ingredients of serious parallel software. Trollius [5] was initially made for parallel computers based on the Inmos Transputers but work is underway to port it to Unix workstations on a Local Area Network (LAN). The Inmos Transputer chips with 4 communication channels are built primarily for parallel processing. The T800-20 (20 MHz) Transputer has floating point circuits, 4 Kbytes “cache” memory and is capable of communicating at 40 Mbits/s both ways, i.e. 20 Mbits/s in one direction [6]. Inmos claims to sustain 1.5 Mflops on a single T800-20 (the reported Linpack benchmark is 0.37 Mflops using the 3L Fortran compiler [7]). Bergen Scientific Centre IBM has a Transputer based parallel computer, Victor 16 (built at IBM’s T.J. Watson research center), with 19 Transputersand two 300 Mbytes “parallel” hard disks, see fig. 2. Trollius is similar to other parallel computing tools such as ParaSoft’s Express, Intel 1PSC parallel computing tools, Meikos CSTools, and Helios.
75
2.1. System processes Under Trollius the system daemons residing on all the active computers (nodes) in the parallel system are[8]: bootdboots directly connected boards, bufferdcreates, kills, sweeps, shows buffer states, castd sends and receives messages from many nodes, dl.inomcreates datalink process pairs on Transputer links, echodechos test messages to test node and links, filesys accesses files from remote nodes, flatdprovides symbolic access to memory, kenyadcreates, dooms and shows process state, kernel coordinates message passing, loaddloads executable files, mallocdallocates and frees memory on nodes, memdreads and writes memory on nodes, routermaintains routing tables. A programmer may add his/her own processes or change the ones above. The most common one to change is the router. When doing changes in the connections in a Victor type Transputer system, the router needs to be updated with the “map” of the new connections. In Trollius the message router can be programmed using the NaIL (Node and Interconnect Language) [5] , a grammar which defines the router. NaIL has 6 types of functions (or “topology tuples”): TNODEspecifies node identifiers and types, TLINKlists physicai links between nodes, TWEIGHTspecifies default weights used by map when generating optimal paths, TROUTE specifies an entry in one node’s routing table, TORDERprioritizes the generation of an optimal path between the specified nodes, TSOLDERassociates TNODE node identifiers with actual nodes, reconfigures links. A router must contain TNODE and TLINK tuples. This is the minimum of data needed to generate a topology. The followinglisting is an example ofthe NaIL file for fig. 3.
76
J. Petersen / Programming on distributed memory multiprocessors
1 1
1
1
1
1
1
1
____
1
1
1
__
1
1
1
__
U Token Ring
~
2
1-2OMHzllOOwl4Mbyte.
2-20MHzT800-scsI
300MB
Fig. 2. Victor 16 at Bergen Scientific Centre IBM.
~1oo1~~BOO8~
Lii!~
~
~rj~
Fig. 3. Example of/* using NaIL.Connection ~TOJ~ /* Node Numbering,and Typeioool definition*/ B008 to Node 0*/ TNODE=(l000,0); /* HQst machine, Only TLINK=(100l, 2,,0,0,,)(0,0,,lOOl,2,,); node with type 0 */ TNODE=(l001,l); /* Inmos B008 board */ /* Node Connections */ TLINK=(0,3,,l,0,,)(l,0,,0,3,,) ; /* 0- 1 TNODE=(0,l); /* Lower left */ TLINK=(0,2,,2,l,,)(2,1,,0,2,,) ; /* 0-2 TNODE=(l,l); /~Top left */ TLINK=(l,2,,3,l,,)(3,1,,l,2,,) ; /* 1 -3 TNODE=(2,1); /* Lower right ~/ TLINK=(2,3,,3,0,,)(3,0,,2,3,,) ; /* 2-3 TNODE=(3,l); /~Top right */ TLINK=(0,u,,3,2,,)(3,2,,0,l,,) ; /* 0-3 ~2Hi2~
*/ */ */ */ */
/~TLINK Parameters: */
1* (Me,Outlink,,Node linked to, Node’s link,,)
2.2. System functions
(Same but Switched)~/ /* Host Connection to B008*/ TLINK= (1000, 0,”/dev/parcomp”, 1001,0,,) l00l,0,,l000,0,”/dev/parcomp”,);
The most used Trollius system functions
are [5] spread, loadgo, state and tkill. Some often used functions are tping, spawn, fish, sweep, bfstate, mstate and bfstate.
J. Petersen / Programming on
sprea~d; puts the operating system on the nodes. loadgonodes program; loads the parallel program, program, on the nodes defined by nodes. nodes can be either the host, h, or one or several of the nodes, n. For example, loadgo nO-3,5,79 myprog; loads myprog on nodes 0,1,2,3,5,7,8 and 9. The same user may load another program on the same nodes while the first one is running. state nodes; lists user processes on the nodes. state -g nodes will list both the user processes and the system processes. tkill; releases the nodes and destroys the system on the nodes: ,, tping nodes; will do a Unix type ping on the specified nodes. spawnsource destination; copies a processon one node (source) to another node (destination), fish; resets the parallel processors. sweep nodes; “sweeps” the buffer on the given nodes. bfstate nodes; is a very useful debugging tool. It causes the buffer states to be displayed. mstate nodes processes; reports the status of the memory allocation of the specified processes. .
.
2.3. Message passing primitives All interactions between processes running under Trollius[9] are done by means of message passing. A message is like a letter, it contains an address identifying a destination node and a tuple ( event,type). The rest of the message is a block of data. Messages are routed to the destination node by a router. The router acts like a postmaster. There is one router on each node. When a router receives a message it passes it to the next node in the path towards the destination node. At the destination node the message is delivered to a process waiting for that message or it is put in a buffer until a process asks for it. Trollius provides several types of primitives with different functionality. Increased functionality means more overhead. There are two levels of message passing in Trollius. The kernel level, or local level, and the
distributed memory multiprocessors
77
network level. The kernel level allows communication within a node and the network level between nodes as wellas within the node. Both network and kernel level offers blocking and nonblocking functions. The message passing primitives are send, trysend, recv and tryrecv. sendsends a message in a blocking way, it waits until the message is delivered. recv receives a message in a blocking way, it waits until the message is received. trysend sends a message in a non-blocking way. If a process is waiting for the message, it is sent. tryrecv receives a massage in a non-blocking way. If there is a message it is received, if not the rocess continues The parameters in a send subroutine are usually destination, event and/or type specification, number of bytes in the array to be sent, and a pointer to the starting position of the array. The recv subroutine sometimes need not know which node sent it, this information could be hidden in the event value. On the kernel level (between processes on one physical processor) the calls have a “k” prefix, k.send, krecv, etc.. The network level (i.e. between separate processors) has four layers, the “physical” layer, the “datalink” layer, the “network” layer and the “transport” layer. The “physical” layer is the most basic communication layer. It uses the “p” prefix, psend, precv, and it is the least “user friendly” but the fa~test.It is almost never used directly in Fortran programs. The “datalink” layer is the next layer up. It only communicates with the nearest neighbors (specified by the processors com~nunicationlink identifier). It has “d” prefix ai~dcomes with dtrysend and dtryrecv as well as dsend and drecv. The most used layer is the ne*work layer, with “n” prefix.
This layer allows “Mrect” communication with any processor (including the host). The “transport” layer communication calls have “t” prefix. tsend will block until the receiving process is ready. The transport layer functions are typically 4sed in debugging (naturally they have high overhead).
78
J. Petersen / Programming on distributed memory multiprocessors
I
0
I
I
1000
2000
Bytes Fig. 4. Comparison of performance of NSEND and DSEND.
Figure 4 shows performance in message passing between two neighboring processors using nsend-nrecv and dsend-drecv. It is clear that when one does nearest neighbor communication, as in the wave equation discussed below, dsend-drecv will give substantially better performance. In Trollius it is possible to send messages from one node to a group of nodes using the multicast function. When using multireeling, messages will come from a predetermined number of nodes and combine in some way in one node. The cornbination possibilities include adding all corresponding integers, taking the maximum, taking the minimum, and applying bitwise functions (AND, OR and XOR). 3. Parallel computing There are more than one way to produce an efficient parallel program. In this discussion we
will cover some examples using divide and conquer methods. Divide and conquer is a programming philosophy where the problem is divided into smaller sections which are solved separately and then put together in order to solve the whole problem. (Often the division is much easier than the conquering!) Domain decomposition is a set of divide and conquer methods, see ref. [10], which has had a new awakening due to parallel computing. We will discuss the 2-D acoustic wave equation solved using explicit forward timestepping. This is a fairly simple program to parallelize using “domain decomposition”. Two iterative solvers, SOR and conjugate gradient, follow next using communications similar to the wave equation example. The last example, “one-way dissection”, is a Gauss elimination method which has proved to be efficient, but much more difficult than the other examples.
1. Petersen / Programming on distributed memory multiprocessors
~
__________
clear which points at the old time levels the new
__________
_________
point depends on. This is helpful not only for comparing formulae but also in the design ofthe
_________
parallel algorithm.
~
_________
79
______
.4
~+ I
“ ‘ Fig. 5. Computational stencil for 2nd order wave equation.
3.1. Simple example, acoustic wave equation
We notice that we only need two P (x, y) arrays since the (t + öt) timestep can overwrite the (t öt) tirnestep. Since a value at one timestep only depends on the neighbors it is possible to do a simple split of the domain as in fig. 6. There is a duplica—
tion of data on the edges of the domains. The Consider the linear acoustic wave equation in 2 dimensions [11]: V2 ~‘
=
~1 ~r’ ~92~
(5)
edge data ofthe domain on one processor are not computed on that processor but on his neighbor. They need to be communicated after each time iteration. The parallel program on each subdomain will have the following parts:
with P
— —
P (x y) and 0
=
Initialize: Create the parameters for this domain, Velocity array and Boundary Conditions.
ç at ~ = 0
‘
‘
where P(x,y, t) is the pressure amplitude ofthe wave and v (x, y) its velocity, dependent on the density and bulk modulus of the medium. Using 2nd order finite differencing and eq. (5), the updated P becomes: P(x,y, t + =
ot)
2P(x,y,t)
—
P(x,y,t
—
cit)
2
Initialize this domains Pressure array
at t = 0 and t = —ot. Communicate Boundaries. Solve problem: 1. Do One Timestep on “inside” domain. 2. If Boundary at Global Boundary, then 3. Do Boundary Computation. 4. Communicate Boundaries. 5. If necessary: Output partial results. 6. Go to 1 until the end of computation. Output results.
[P(x+öx,y,t) +P(x
—
+P(x,y
ôx,y, t) + P(x,y + —
oy, t)
—
.
oy, t)
4P(x,y, t)].
.
Each point needs 9 floating point operations (6)
Here ox, Oy are the grid sizes in the x and y directions and ot is the time increment. In our casewe use a square grid so that Ox = oy = h. This formula is suitable for all points not lying on the spatial boundary. Reference [11] covers what needs to be done on the boundary, see also ref. [12] for a further discussion on boundary conditions. In fig. 5 the difference equation, eq. (6), is represented by a stencil. From this stencil it is
(flops) per timestep. If the total domain is M~ by M~ points, 9M~M~ flops are needed per timestep. Since no extra floating point work is done in doing this in parallel, each processor of N processors does 9Af~M,,/Nflops. The speedup when doing this on processors is then:
s
— —
T~(9M~M~) T~(9M~M~/r4) + T~omm
(7)
The T~(k) is the computer processing time for k flops. This used to be a simple function, but as new improved RISC processors are getting developed this isn’t easy anymore. For example,
80
J. Petersen / Programming on distributed memory multiprocessors
‘_~jH._
I1II~~IIj~ ___
~
~
~
:j:i Fig. 6. Message passing on finite difference domain for the wave equation.
on some calculations (provided the data is positioned right for certain computations) the IBM RS/6000 processor can do 2 double precision floating point operations per clock cycle. Reference [13] covers speedups using special floating point hardware in parallel. If we communicate on all 4 sides, we will do 4 send’s ( and 4 receive’s) per processor. Thus TComm ~ 4T 0 + B x (2 x 8 x M~ f~/W+ 2 x 8 x M~/v’N)for double precision computation. As an example, let us look at a processor capable of 0.5 Mflops, with T0 = 500~usand B = 1~us/byte.Let us compute how big M (= M~= M~)or equivalently M/ ../R, the subdomain size, must be in order to achieve a speedup of 0.9N (i.e. 90% efficiency): 0.9N =
9M2x2
N
2 > 2~ 32M (8) 9M ous-~-4x500~us+~~us .
Assuming perfect synchronization of processors we will get 0.9N speedup with a subdomain of 41 x 41 no matter if N is 16 or 256. By doing 4th order accuracy in space (by using 9 point stencil), the minimum computation per time step per point is 12 floating point operations. Since 4th order computations has a larger stencil anotherlayer ofedge values must be cornmunicated. After some simple computations we find that in order to achieve 0.9N speedup the subdomain arrays must be 42 x 42 or above. Vertical or horizontal strips can be used instead ofsquare subdomains. Strips are more efficient for small domains when T0 is “large” since a strip subdomain only communicates with two neighbors as opposed to four. On many problems the strips have advantages with vector type hardware. us determine how large the domain must Let be when the square subdornains are equal in communication overhead to strips for
J. Petersen / Programming on distributed memory multiprocessors
81
Recipes [15] describes an automatic method,
N processors:
Chebyshev acceleration, to obtain an optimal co. (squares)
4T0 + 4B8M/V’~=
(strips)
On a rectangular grid the above equation be-
comes:
2T0 + 2B8M; x~,1= x,,3 + (W/Cj,j) x [b~
valid when N> 4 or M=
8B
T0
—
16B/V’W
—
when N> 4.
Using the above parameters, M = 125 double precision values on 16 processors, and M = 83 on 64 processors. The square subdomains in this case will be about 31 and 10, respectively, For larger square domains, the square subdomains will be most efficient. 3.2. Red—black SOR One standard method for solving sparse elliptic partial differential equations is the simultaneous over-relaxation (SOR) [14]. A short cookbookstyle description is found in Numerical Recipes [15]. We will not go into the theory behind the method, just a brief description of the algorithm. Consider an elliptic equation, Lu = p’ where L is the elliptic operator, u the variable and p the source term. Relaxation methods relaxes an mitial chosen u ofthe diffusion equation, au/at = Lu p, to an equilibrium solution as t oc. As an example, let us choose L to be the Lapla2. in 2 dimensions. On a finite difference cian, grid, V the Laplacian operator is replaced with a second order differential matrix operator, A, reducing the elliptic equation to amatrix equation: Ax = b. b represents the p on the grid and x, the grid variable, replaces u. Solving via SOR becomes performing the iter—
ations: 1(b Ax), (9) wD where D is the diagonal part of A and w is the overrelaxation parameter. The convergence is highly dependent upon w. Numerical —
= ~ old +
~
~
x x,~+ ~ ~~
x
+ li,j x
+re,~X x•+~,~)],
(10)
where b~is the element of b corresponding to
the (i, J )th position on the grid. Similarly c,,~ are the elements along the diagonal of A, u,~ (up), d,~(down), 4,~(left) and r~,3(right) are the off-diagonal elei~ientsof A operating on the up, down, left and right neighbors of ~ respectively. Notice that the new x,3 depends on the new x_1,3 and x,~_imaking this algorithm impossible for vectoriz~tionand unsuitable for parallelization. If we Use red—black ordering, i.e. painting half of the x’s red and half black in a checkerboard patterji, and first do all of the red (and then all the black) this data dependency problem goes away. Since the operations for each iteration is “local”, i.e. the change in one x (i, j) only dependents on the nearest neighbors, it is easy to divide the domain and do each section in parallel, see fig. 7. Once the red x’s are computed, the edge values neec~to be communicated to the neighbor processors, The points are then ready for computatiGn. So black for each full iteration, red and black, two cpmmunications are needed. The amount of work done by each processor scales as 0(M2) an4I the communication scales as 0(M) when we are dealing with an M by M array. Since the communication time, T~ = T 0 + ~ is usually strongly dominated by the startup time, T0, it is wise to reduce the number cornmui~icationsnecessary increase theofnumber ofbytes sent. The ratioand To/Be and the ratio of compute time to communication time are the deciding factors. Thus by sending over a “compute ahead” modified black with the red it is possible to do only one communication
82
J. Petersen / Programming on distributed memory multiprocessors
~
lu;.
U Fig. 7. Red—black SOR checkerboards.
with each full iteration. The modified black x’s
are computed with the red x’s within the senders domain. Extra red information is needed in the receivers domain in order to make a “true” black x. The creation of the modified black can cause extra computation (still only of 0(M)). Let us use fig. 7 as an example. Let us assume that the whole domain is M~by M~,(16 by 16 in our case). The subdomains are (m~+ 1) by (m~+ 1) such that m~+ m~= M~in our case. If we let Xm~jbe a red x then Xm~,j~1is black. If these two x’s belong to the upper left square and if we number the arrays by columns and rows, then they are situated on the right edge of the square subdomain. Just outside the edge is Xmx + 1 j which is computed on the domain to the right where it is called X2,j. If the red Xmx,j has been just updated then he is ready to be sent to
the right to become x1,3 on the top right subdomain. Instead of sending xm~,j+ after the next step, we can send a modified ~ 1: ~
= xm~,j+1+
(w/cmx,j÷i)X
—(cm~,j+lX Xmx,j+1 +
+dmx,j+i
X Xmx,j
[bmx,j+i
Umx,j+1 X xm~,j+2
+ ‘~nx,j+1x xm~_i,j+i)].
(11) When x’ (mi, I + 1) arrives on the right side, it becomes x’ (1,1 + 1) and will become a true black x by Xi~+1 = x
rl,j+1 x X2,j+1,
(12)
J. Petersen / Programming on distributed
thus saving us from another communication, The Chebyshev acceleration (in order to find optimal w) mentioned above will need additional communication as often as the w is updated. The updating of w needs global communication. The new co is a function of data spread across all the processors. In the above example it is possible to add the necessary data to the one communication needed per iteration, i.e. no extra communication calls. With up to 16 processors connected in a torus (4 links per processor) this can be done, but if more processors are added additional communication will be needed. A parallel computer program should contain the following parts:
memory multiprocessors
83
the diagonal. This is smart since it will save us from one multiplication in the “matrix vector product” part. If D is the matrix only containing the diagonal values of A, we can compute the “normalized” matrix and vectors: (V”2AD”2)V112x ~
Ax
= b.
=
(13)
The standard conjugate gradient algorithm goes as follows: Initialize: x 0
=
r0
=
initial guess b Ax0 Po = r~’r0 Po = Do k Until Converged (i.e. until Pk
Initialize: Create the part of the matrix belonging to this domain. Create the r.h.s. belonging to this domain. Solve problem: 1. Do Red Area 2. Do Boundary: 3. Compute ahead Blacks using (11) 4. If necessary: Prepare update of co co information) 5. Communicate Boundary (and 6. If necessary: Update cv 7. Do Black Area 8. Update Black Boundary using (12) 9. Go to 1 until Converged As opposed to the previous section, more than Output results. one communication is needed during each iteration, i.e. for the matrix vector multiplication and for each dot product. In the matrix vector 3.3. Polynomial preconditioned conjugate multiplication one only needs to exchange inforgradient mation with the nearest neighbor. The dot product needs “global” communication and can thus Problems ending in an linear equation, Ax = be more costly on machines with a large numb, where A is a symmetric, positive definite mabar of processors. In the above algorithm neartrix can be solved with the efficient conjugate est neighbor commuilication is needed in step 1, gradient method [16]. We will not discuss the and global communi~ationis needed in steps 2 theory behind the method only its use. Considerand 5. ing the types of problems discussed in the previPolynomial preconditioning is one preconous section, A will be sparse and associated with ditioning method suggested by many authors, a computational stencil as discussed in the wave among them Dubois et al. [17], Johnson et al. equation section. [18], and Saad [19]. Since the preconditioning Let us first “normalize” the problem, i.e. transis done by sparse matrix vector products, it is form it such that the matrix will have “1” along very efficient on vector hardware. -
J. Petersen / Programming on distributed memory multiprocessors
84
The idea of preconditioning is to find a ma-
S~
:
trix,M,whichissomeapproximateinverseofA and is easy (and cheap) to compute. One such approximate inverse is found by using the Neuman series~ 1 +
=
—
~
+
—
~3 +
We let A = 1 + B, keeping in mind that we are assuming A is such that B 1. This gives 11<
—
2B3 + +B~ 1B+B 1 B(1 —B(1 —B(...))). (15) “~
= =
S3
::~ :::. :
Li
;
S4
S3
57
—
:
:
:
:H::
I I
Fig. 8. Dissected computational grid with seven separators.
(14)
A’ ~M
S2
Since B is the same as A without the diagonal, no extra matrix is needed in storage. Using this preconditioner the above algorithm will only be changed in 2 places, when r 0 is computed (r0 = M ( b Ax0)); and in statement number 1 (tic = MApk). With polynomial preconditioning we must do I + 1 ( I is the order of the polynomial) nearest neighbor communications in statement 1 instead of just one. It is possible to communicate a wider boundary instead, thus doing only one nearest neighbor communication. Since the setup time for communication, T0, is usually large compared to the transfer time, this will pay. Doing this, however, will cause overlapping computations on the boundary which is a relatively small job compared to the whole domain for large domains. The 1 is usually pretty small, seldom greater than 3, even on problems very suitable for this method such as problems with highly variable coefficients. The idea is to reduce the total number of iterations such that it pays to do extra nearest neighbor communications in order to do less global communications. -
3.4. One-way dissection One-way dissection[20,21] is in essence a labeling strategy for sparse linear systems. The
method used to solve the problem (finite difference or finite element) and the discretized differential operators used will determine the relationships between the gridpoints. Once this has been established the differential equation can be transformedintoamatrixform,Ax = f,where A takes the place of the differential operator, x isthesolutionvector,andfisthesourcevector. Consider solving a partial differential equation on a regular M~x M~grid, with M~ M~ (fig. 8 illustrates this with M~ = 31 and M~ = 6). The “natural” way of labeling grid.
.
.
points is consecutively row by row (or column by column). Doing “natural” ordering produces a large (M~x M~by M~x M~)sparse matrix with a semi-bandwidth of about M~(depending on the computational method used). The semibandwidth is the primary factor in the time it takes to solve the problem using Gauss elimination. One-way dissection reduces the “effective” bandwidth by letting a verticalgrid lines (which will be referred to as separators) dissectthe grid into a + 1 “independent” areas, e.g. see fig. 8 where a = 7. The gridpoints within each independent area are numbered first, consecutively row by row (i.e. first the gridpoints within A,, then A2, A3, A8). After all the gridpoints in all the areas have been numbered, the separator gridpoints are numbered in a nested dissection manner (see ref. [21]) (i.e. first the gridpoints within S,, then S3, S2, S5, S7, S6 and finally S4). Figure 9 shows the resulting A matrix using this numbering. This numbering of the separators is optimal for parallel computing. When using oneway dissection on a single processor, however, natural numbering of the separators is optimal. The numbered diagonal matrices, or “area” matrices, describe the dependences between the ...,
J. Petersen / Programming on distributed memory multiprocessors
L
~1i!I1~ ___
85
L~12’
__
~
H
Fig. 9. The matrix~resulting from a one-way dissection of the grid in the previous figure.
gridpoints within each area A, through A8, respectively. Likewise, the lower right diagonal matrices, or “separator matrices”, describe the dependences between the gridpoints within the separators S, through S7 of fig 8 respectively The off-diagonal matrices describe the dependences of the area matrices on the separator matrices. All the (sub) matrices will be banded. The area matrices relate to a grid of M~by m~gridpoints, where m~is an integer approximately equal to (M~ a )/ (a + 1) (in general, m~will not be the same for all subdivisions). These matrices are thus m~x M~by m~x M~large with semibandwidth of m~.The separator matrices (M~ by M~)are usually tridiagonal (semi-bandwidth of 1) and the rectangular off-diagonal matrices will be sparse. In many cases, the total matrix will be sparse and diagonally dominant and will thus not require any pivoting. The forward elimination is done first on the area matrices. This creates fill-in near the separator matrices, see fig. 10. Forward elimination is then done on the separator matrices whichare now dense due to fill-in. The last operation of Gauss eliminiation is back substitution (first on separator and then on area matrices), The parallelism of this method is obvious, see refs. [22—24].We let each node work on “it’s own” area matrix and the separators on both
S~S~
Fig. 10. Lower ~ right part (separator ~ part) S66 of the previous figure after the initial forward elimination.
~
.
a)
.~
a ~
.~ .~
~
.
—
Left Separator
SL
Right Separator
[II
SR
. Fig. 11. The problem as seen by one processor before forward elimination ofarea matrix.
sides. This will cause a small redundancy in storage but not in compuling. The situation at each node is pictured in figs. 11 and 12. In fig. lithe node does forward elimination on its area matrix and updates the separator matrices. When this is done the rest ofthe work, the elimination of the separators, is dOne in a tree structure way; fig. 13. When doing this elimination we will let the separator number determine which processor
J. Petersen / Programming on distributed memory multiprocessors
86
~Jj-1
_____
trices.matrices After theofelimination of the left side four the lower nght corner will the be
:~:.::.~:
I
~Jj+i _______
.:
.
...
a) .
I Si-i ~
I
L~_____ .~
.g~
sj+1,j~•i
fl~:T~?~T: ~ _____
‘~
HJ~ ~
.:.
I
J+14+1
_______
Fig. 12. Elimination of the separator matrices on one processor.
4
2
1
6
3
I
I
5
I
7
Fig. 13. Tree structure corresponding to the solving of fig. 10.
works on it, i.e. node 3 will be responsible for the forward elimination of separator 3. Because of the elimination of the area matrices, each separator, S~,becomes coupled to S,_, and S~1. The odd numbered nodes will thus receive matrices “related” and coupled to the separator it is working on. For example, node 3 will receive matrices from node 4, but not from 2 since the matrix coupling separator 3 to separator 2 was already on node 3 since node 3 contains both separator 2 and 3. Those nodes doing forward elimination on separators (all but node 8) have a 3 by 3 block matrix elimination problem shown in fig. 12. The two matrices below the top left one will be eliminated. This creates fill-in of the other ma-
sent to the appropriate even nodes (the second row of matrices will be sent to (i 1) and the bottomrowto(i+ 1)). The even nodes will receive the matrices and combine them and/or put them in the appropriate separator matrix arrays Due to the elimination of the odd separators, the coupling of the even will be between S~and 5i-2, and S~ and 5,nodes + 2 The matrix elimination picture is still fig 12 with j- 1 refemng to i — 2 and ,j + 1 referring to i + 2 When the forward elimination is done, this node will communicate with the i — 2 and i + 2 nodes. At the top of the tree only one matrix will remain (S 4 in our case). When forward elimination is finished, back substitution is done at the top of the tree first (S4) and then down along the tree structure until in the end all the area matrices are done. See ref.[23] for a more detailed description and timings on test problems. One drawback of this method is the loss of parallel efficiency because of the idle processors. Doing the separators in a tree structure causes the majority of the processors to be idle during this part of the computation. If M~is much greater than the number ofprocessors the majority of the work lies in the first part, i.e. elimination ofthe area matrices and the long rectangular matrices on the bottom. In this case the idle time is small. If this is not the case and when doing largeproblems it will be profitable to use a different algorithm for this part so that all of the processors could help in doing the separators. For example by using a dense matrix method similar to that of Chamberlain [25] Using such a method 2 processors could be used on each separator on the bottom ofthe tree, 4 on each of the next level of the tree, then 8 on the next level up, and so on. In order to share the workload more evenly among processors in the first part of the calculation the Areas on both ends should be larger than the ones in the middle, i.e. A, and A8 are greater than the other A’s, see a discussion on this in ref. [23]. —
.
J. Petersen / Programming on distributed memory multiprocessors
4. Graphics tools “One picture is worth a thousand words” says an old proverb. One picture is also easier to understand than a list of one thousand numbers. Thus graphics should become an integral part of computation. It eases debugging, it eases understanding of results, and it saves time. Since parallel computers are dealing with many processors and often larger data sets, more things can go wrong, and graphics is of even greater need. The difficulties with parallel graphics are the same as with parallel computing. There has been some activity in creating graphics packages as a part of parallel tools, e.g. Plotix [26], a part of Express, and Bergen Scientific Centre’s PGraph [27]. PGraph was initially developed to work under Trollius, but it is soon available for a system of workstations using other parallel tools. PGraph works with X-windows on the host machine. It contains a library of subroutines for Fortran and C such as setting colors, line drawing, polygon fill, dot drawing, opening and closing windows, etc. Several nodes can also produce graphics on the same window, Another effort at Bergen Scientific Centre, IBM, is rendering of data in 3 spatial dimensions. The rest of the chapter will be discussing this problem. 4.1. Motivation Computations on large 3-dimensional data sets are both compute intensive and consumers of large memory. Due to the size of large 3-D problems ( usually stretching over many timesteps) there is a problem in monitoring the solution process. Often just a few global parameters are being monitored due to the difficulty ofdisplaying all of the data, the cost involved in transferring the data to a graphics workstation or the space limitation of the graphics workstations. A graphical display of the whole domain is often desirable since it gives the scientist an overall view of the solution thus helping in the understanding of the results. During the debugging
87
and testing phase this allows early detection of errors. These errors include errors in the program, errors in the numerical method and errors in the model or the model parameters. It is our opinion that graphical output should become as common as writing numbers to a file. In order to achieve this graphics calls must be simple to use and the graphics must be quick. There is no doubt that if fast 3-D graphics was easily available it would be one of the most important debugging tools for the scientist. Let us illustrate the problem with a “back of the envelope” computation. Assume we have a fairly large 3-D problem, a 512 by 512 by 512 computing domain. We have several variables on each point ofthe domain. Forjust one double precision (8 bytes) variable, we need 1 Gbyteof memory. Let us assume that we need to transfer all the data to a graphics workstation in order to view the data. Transferringone (scalar) variable at 10 Mbytes/s will take around 100 seconds. Assuming we have 2 variables we wish to view, the transfer time to a graphics workstation will be 200 seconds. This will be true whether we do this on a single processor or a multiprocessor machine. We are proposing doing raytrace imaging locally on each processor. This should result in (a) reducing the amount of data to transfer out of the parallel computer and (b) a great speedup of the actual imaging computation since it is done in parallel. Let us return to the example above. Assume the problem was distributed on 512 large message passing processors using domain decomposition (.8 by 8 by 8 processors). Each processor will do a domain of 64 by 64 by 64 thus needing 2 Mbytes per double precision scalar variable. Assuming we want an image of 1024 by 1024 pixels each processor will be responsible for 128 by 128 pixels. Using the method discussed below approximately 9 floating point operations will be needed per point, i.e. 9 Mflops for the whole array on one processor. With processors Such as the one on IBM’s RS/6000 we can assume a computational speed of around 30 Mflops,, thus doing the raytracing in ~ seconds. When the raytracing is done we need to communicate. in a tree structure and do
88
J. Petersen / Programming on distributed memory multiprocessors
about 3 times 128 x 128 operations. This is small compared to the ~ seconds. Transfer of the final image (1—2 Mbytes) to the screen is about ~ seconds. In this hypothetical experiment less than 1 second is thus needed to display the image by doing parallel graphics as opposed to the 200 seconds of just high speed data transfer. 4.2. Volume rendering Simple raytracing is based on making each pixel in the window of the graphics screen the end point of a ray. The value of the pixel is directly related to the value of the ray at the endpoint. Often the ray is defined by a red, green and blue value as opposed to a pixel number which is usually a one or two byte number in a colortable. (Pixels can also be defined by sets of 3 RGB numbers.) The simple raytrace (back to front) is based on the reduction of the rays magnitude as it passes through regions of absorption and the increase in magnitude as it passes through regions oflight sources. From basic optics [28] we learn that the absorption is an exponential function of the opacity. The source term isjust additive. The ray at position x is a function of the ray at x — Ox: I~(x)= 1~(x Ox) e_0c~ox + S~(x), —
C E {r, g, b},
(16)
where 1(x) is the light intensity at position x, 0(x) ~5 the opacity at x, Ox is the distance between sample points in the light ray direction and S(x) is the source at x. The c subscript refers to the particular color: red, green or blue. I~ (0) is the background value. Although this sort of volume rendering has been extensively used by medical physicists (in X-ray imaging) the method has great potential for imaging of other types of data. In order to make the image useful to the scientist, he/she must create source and opacity functions which are related to the interesting parameters in the model. Some scalar variables can be represented by color while others by opacity. Porter [29] warns that if color and opacity vary
Fig. 14. Equipotential surfaces of octopole field.
together the resulting image will have little contrast. For example, in a fluid dynamics calculation Porter [29] used an opacity function related to the square magnitude of the vorticity, and a source function dependent on the helicity. Many parameters can be displayed since color can be used to distinguish different opacities and sources. The opacities can act as filters for particular colors. Figure 14 shows potential surfaces of 4 positive and 4 negative charged particles, all equally charged in magnitude. The charges are positioned at the corners of a cube such that each edge has one positive and one negative charge. Since only a small number of colors were available from the hardware, the opacity was set high at a particular potential surface and the source was a function of the gradient vector of this surface. When passing rays through a 3-D finite difference type mesh one must use numerical volume integration. Several integrating methods are available. See Mitchell and Dickof [301 for a comparison of a few different line integral algorithms.
J. Petersen / Programming on
—Proc.3—- —~Proc.2-——Proc.1 T ~~“1
/2’ //
distributed memory multiprocessors
89
containing the far back plane will also have a “global” background value ‘bc’ When M processors are involved, the global endpoint of the ray will thus have the value:
,/ “
~M+L~fx(I~M_l+L~f_l
1/’
//
Back
x(I~2 + L~2 x (...L~x Ibc)...)))(l9)
/
Front
Fig. 15. Path of ray through three processors.
4.3. Parallel volume rendering There is a problem in parallelizing eq. (16) since the present value ofI~depends on the previous. We will use the ideas of “alpha channels” [31] Figure 15 shows the difficulty with passing a ray through a volume situated on several processors (3 in this case). The front computers must wait until all the computers behind are done. Figure 16 shows a bird-eye view of 9 processor parallel raytracing, illustrating the rays passing through several processors. If we add another “screen”, the L screen (sirnilar to the alpha channel), this problem is solved: m~ Ox + S~’(x), jm(x) = Icm(x — Ox) e°c 1~’(0)= 0 V c, (17)
L~(x)
=
L~’(0)
=
L~’(x Ox) e_0r~ Ox, —
1 V c,
(18)
where the m superscript refers to the processor. The Im~screenholds all the images within processor m and assumes a “black” background. The Lm~screenstarts out with a “white” background, and the “white” rays get absorbed as they propagate through the “volume”. The Lm~screenthus only holds information about the absorptions in the m volume, When the ray has passed through to the other end of its volume the processor is left with two endpoints, 17’ (Xf) and L~’(xf). The processor
This saysthat thefinal endpoint is the I~added to the Lc multiplied by a background unknown at the start. Thus for any two processors where the first is situated in front of the next, they can combine their final I and L values: jf,fleW
f new
L~’
+
=
f =
(L~x It),
(20)
b
L~ x L~.
( 1
After the processors have done their raytracing in parallel (without message passing) the final image can be combined in a treestructure type communication which is 0 (log 2 N) where N is the number of processors. If for example we have 8 processors in depth, 1 and 2, 3 and 4, 5 and 6,them and 7in and 8 will combine images The and deposit the ~ownumber processor. next communication will then be done between 1 and 3, and 5 and 7. The finished image will be completed by cothmunication between 1 and 5. This work is usu~llynegligible compared to computingthe rays. E~igures17 through 19 shows each of the “sub-im~ges”of 3 processors. When combined they become fig. 14. In parallel raytrading the storage needed for the rays is double tiliat of single processor raytracing, and the exfra computational work per step is one extra mt4tiplication. This extra storage is usually small, however, compared to the storage of the volum~arrays used in the computation. The extra multiplication is negligible if the full exponential I~unctionis calculated, but if the exponential fun~tionis approximated with (1 o~(x )Ox) then~the extra work can become measurable (3 multiplications as opposed to 2 multiplications per iteration). —
90
J. Petersen / Programming on distributed memory multiprocessors
Back / /
Back
~0 ~/// ///
Back
7 ~/ 2~77 7
// /
/~,/
~
//
/
/
//1 //
/
77
/ ~
___vLi~12’//A7 ~
~ I
/ / /~7 //
~/ ~//
~
7
7~7~/
/
L//~~
/
F/
7~77 / /
K2~~ Front
Fig. 16. Paths of ray through nine processors at an arbitrary angle.
Fig. 17. Equipotential surfaces of the “back” processor.
Fig. 18. Equipotential surfaces of the “middle” processor.
J.
Petersen / Programming on
distributed memory multiprocessors
91
future machines. The ideas behind distributed memory computing Still works for shared memory machines. For example optimal use of such machines is often dQne by domain decomposition such that the decomposed part fits in one processors cache. The shared memory is then used as a communication link. The change to massively parallel computers will probably not be quick since the large group ofscientific users are comfortable with the speed (and low cost) oftodays powerful workstations and with the ease ofprogramming single processor machines. They have more software packages and easier tools.(graphics, debugging, etc.). Since there are no st~nd.ardsfor parallel computing, only those that need the extrapower ofthese machines and enjo~ithe challenge will venture into this “brave new, world”. Fig. 19. Equipotential surfaces of the “front” processor.
Acknowledgements 5. Conclusions This paper was meant to be a brief introduction to parallel computing on message passing machines. First by discussing some of the features of fairly standard parallel tools, Trollius. The other parallelmessage passing tools will contam similar features. The second major topic was numerical algorithms. Four standard algorithms often used in scientific computing were presented. These particular algorithms were presented in order to give an insight to the way one can approach a problem. There are many different ways to solve a problem with parallel computers, however. The reader interested in this should look through the literature, I would recommend a book by Fox et al. [32], proceedings from the many conferences on parallel computing, and many scientific penodicals on computation. The graphics section was included since I felt that graphics is such an important part oftodays scientific computing. The last word on parallel computing will not be said for a longtime. I believe that it is the only way to do large physics type codes in the future, but I am not sure of the architecture of these
The author wishçs to thank Finn Lundbo, Brit Gunn Ersland, John Grieg for their parallel help, Alan Tuchman and Børre Johnsen for LAT~Xhelp,and Jan Eileng for PostScript help. I~Mand RISC System/6000 are trademarks of IBM Corporatiob. Express and Plotix are trademarks of Paras~ftCorporation. Helios is a trademark of Perih~lionSoftware Ltd. Trollius is a trademark ofTh~Ohio State University and Cornell Theory Ceniter. iPSC is a trademark of Intel Corporation. 1~ransputeris a trademark of Inmos. NCUBE is aitrademark of NCUBE Corporation. Connection Machine is a trademark of Thinking MachiÜes Corporation. MasPar is a trademark of MasPar Computer Corporation. References [1] T.H. Dunigan, Hyp~rcubeperformance, in: Hypercube Multiprocessors 1987, ed. M.T. Heath (SIAM, Philadelphia, 1987). [2] Conference on Hyp~rcubeConcurrent Computers and Applications, 1988,I Vol. 1, ed. Geoffrey Fox (ACM, Baltimore, 1988).
92
.1. Petersen / Programming on distributed memory multiprocessors
[3] G. Amdahl, Validity of the single-processor approach to achieving large-scale computer capabilities, in: AFIPS Conf. Proc., Vol. 30 (1967). [4] J.L. Gustafson et al, Development of parallel methods for a 1024-processor hypercube, SIAM J. Sci. Stat. Comput. 9 (4) (July 1988). [5] Trollius User’s Reference, Research Computing The Ohio State University and Advanced Computing Research Institute Cornell Theory Center, Document Series 2/1 (22 January 1990). [6] Product overview, IMS T800 transputer, Inmos, 72TRN 117 01 (November 1989). [7] J.J. Dongarra, Performance of Various Computers Using Standard Linear Equation Software’, CS-8985, Computer Science Dept. Univ. of Tennessee (31 January 1991). [8] Trollius Man Pages, Research Computing The Ohio State University and Advanced Computing Research Institute Cornell Theory Center (31 October 1989). [9] Trollius Reference Manual for Fortran Programmers, Research Computing The Ohio State University and Advanced Computing Research Institute Cornell Theory Center, Document Series 2/4 (7 February 1990). [10] See articles in First International Symposium on Domain Decomposition Methods for Partial Differential Equations, eds. R. Glowinsky, G.H. Golub, G.A. Meurant and J. Périaux (SIAM, Philadelphia, 1988), and proceedings from subsequent Symposia. [11] J. Petersen and R. Renaut, Synthetic 2-D seismic wave propagation using a hypercube parallel computer, Geophys. Trans. 34 (1988) 309. [12] R. Renaut and’ J. Petersen, Stability of wide angle absorbing boundary conditions for the wave equation, Geophysics 54 (1989) 1153. [13] R. Renaut and J. Petersen, Evaluation of a vector hypercube for seismic modelling, in: Proc. 3rd Conference on Hypercube Concurrent Computers and Applications, 1988, Vol. 2, ed. G. Fox (ACM, Baltimore, 1988). [14] L.A. Hageman and D.M. Young, Applied Iterative Methods (Academic Press, New York, 1981). [15] W.H. Press et al., Numerical Recipes: The Art of Scientific Computing (Cambridge Univ. Press, New York, 1986). [16] GH. Golub and C.F. Van Loan, Matrix Computations (The Johns Hopkins Univ. Press, Baltimore, 1983).
[17] P.F. Dubois et a!., Approximating the inverse of a matrix for use on iterative algorithms on vectors processors, Computing 22 (1977) 257. [18] O.G. Johnson, C.A. Micchelli and G. Paul, Polynomial Preconditions for conjugate gradient calculations, SIAM J. Numer. Anal. 20 (1983) 362. [19] Y. Sand, Practical use of polynomial preconditionings for the conjugate gradient method, SIAM J. Sci. Stat. Comput. 6 (1985) 865. [20]A. George and J. Liu, Computer Solution of Large Sparse Positive Definite Systems (Prentice-Hall, Englewood Cliffs, NJ, 1981). [2l]S. Pissanetsky, Sparse Matrix Technology (Academic Press, London, 1984). [22] PH. Worley and R. Schreiber, Nested dissection on a mesh-connected prossesor array, in: New Computing Environments: Parallel, Vector and Systolic, ed. A. Wouk (SIAM, Philadelphia, 1986). [23] T.-H. Olesen, and J. Petersen, Vectorized dissection on the hypercube, Proc. 3rd Conference on Hypercube Concurrent Computers and Applications, 1988, Vol. 2, ed. G. Fox (ACM, Baltimore, 1988). [24] R. Wait, Partitioning and perconditioning of finite element matrices on the DAP, Parallel Comput. 8 (1988) 275. [25] R.M. Chamberlain, An alternate view of LU factorization with partial pivoting on a hypercube multiprocessor, in: Hypercube Multiprocessors 1987, ed. M.T. Heath (SIAM, Philadelphia, 1987). [26] Express Fortran User’s Guide Version 3.0, Parasoft Corporation, Pasadena, CA (1988, 1989, 1990). [27] F.C. Lundbo, PGraph Reference Manual for FORTRAN Programmers, IBM Bergen Scientific Centre (December 1990), unpublished. [28] See any basic optics text. For example: M. Born and E. Wolf, Principles ofOptics, 6th ed. (Pergamon, Oxford, 1980). [29] D.H. Porter, Perspective volume rendering, MSI report 91 / 149 Minnesota Supercomputer Inst. Preprint Series, Univ. of Minnesota, Minneapolis, MN. [30] J.R. Mitchell and R. Dickof, A comparison of line integral algorithms, Comput. Phys. 4 (2) (March/April 1990). [31] T. Porter and T. Dufl Compositing digital images, Comput. Graph. 18 (1984) 253. [32] G. Fox et a!., Solving Problems on Concurrent Processors, Vol. 1 (Prentice-Hall, Englewood Cliffs, NJ, 1988).