711
Parallel p r o c e s s i n g for P H A S T : a t h r e e - d i m e n s i o n a l r e a c t i v e - t r a n s p o r t s i m u l a t o r David L. Parkhurst and Kenneth L. Kipp U.S. Geological Survey, MS-413, P.O.Box 25046, Denver Federal Center, Denver, Colorado, U.S.A. 80225
A parallel algorithm for the reactive-transport simulator PHAST was developed for a Beowulf cluster of Linux PCs. The Local Area Multicomputer implementation of the Message Passing Interface standard was used for communication among processors. PHAST simulates reactive transport by operator splitting the calculation into a flow and transport step and a chemical reaction step. A load-balancing algorithm was developed using random mapping of cells to the processors for the reaction task and rebalancing after each time step. Testing with a demonstration field-scale example showed that the parallel algorithm was scalable over a range of processors and problem sizes. Load balancing is an important element of the parallel algorithm. Super-linear speedup indicated that the sequential program is not optimal in cache usage. Relative speedups on 16.8 effective processors ranged from 10 to 24 which can bring field-scale reactive-transport simulations into a reasonable timeframe. 1. INTRODUCTION The numerical simulation of ground-water flow and solute transport with geochemical reactions has become more widely used over the past decade. Applications have been made to environmental and industrial contamination problems as well as to investigation of natural hydro-geochemical processes. However, this type of simulation is very computationally intensive and often has required the use of large vector or massively parallel computers. Three-dimensional simulation of dozens of components and chemical reactions can easily exceed the capacity of desktop workstations or personal computers (PCs). To speed up multicomponent reactive-transport simulation in a PC environment would make such simulations accessible to a much wider group of ground-water modelers. One class of parallel computer is a cluster of workstations or PCs on a private, high-speed, local area network (LAN), running a free-software operating system like Linux. This class is known as a Beowulf cluster. An early prototype study on parallelizing a ground-water flow simulator on this class of computer was done a decade ago [1]. However, little has been published on parallelization of ground-water reactive-transport simulators on Beowulf clusters. In this paper, we present the results of our initial efforts at designing and implementing a parallel version of PHAST, a three-dimensional, multi-component, reactive-transport simulator. Our primary objective was to parallelize PHAST such that field-scale simulations (tens of thousands of nodes and dozens of reactions for hundreds of time steps) can be done in a reasonable time (hours) with a modest cost in computer hardware (PCs). The basis of our parallel design, the results of testing and evaluation of scalability and load balancing, and conclusions are described.
712 2. PARALLELIZATION OF PHAST The computer program PHAST ( PHREEQC and HST3D) simulates multi-component, reactive solute transport in three-dimensional saturated ground-water flow systems. It was created by linking PHREEQC [21, a computer program for geochemical calculations, with HST3D[3], [41 a simulator of variable-density ground-water flow, heat, and solute transport. The heat-transport and variable-density portions of HST3D have been removed or disabled. The equations that are solved numerically are (1) the saturated ground-water flow equation; (2) a set of solute-transport equations for each solute component of a chemical reaction system; and (3) a set of chemical reaction equations comprising equilibrium and kinetically controlled reactions. The solute transport equations and the chemical reaction equations are coupled through the chemical concentration terms. Numerical solutions are obtained for potentiometric head and concentrations of solute components and species using a sequential approach. Operator splitting [51 is used to separate the solute transport calculations from the chemical reaction calculations. Finite-difference techniques are used for spatial and temporal discretization of the equations. Simulation of a time step is performed by a sequential solution approach without iteration (SNIA[S]). A demonstration problem run on a single processor required 98% of the simulation time for the reaction-chemistry calculations. Only 2% of the time was used for flow and transport calculations. Because the overwhelming majority of time was spent on the reaction calculations, we focused on parallelization of the reaction calculations. 2.1 Hardware and software environment
The parallel machine model [6] is a number of yon Neumann computers (nodes), coupled by an interconnection network. Each computer executes its own program on its own local data and communicates with other computers by sending messages over the network. The memory is distributed among the processors. This configuration forms a distributed-memory multipleinstruction, multiple-data (MIMD) computer. Our Beowulf cluster comprises 9 dual-processor nodes with Pentium III processors running at 700-800 MHz and 1 dual-processor node with an AMD processor running at 1200 MHz, all using the Linux operating system. Local memory amounts ranged from 0.75 to 1 GB. Network connections are by dual 100Mbit Ethernet with an effective data transfer rate of about 150 Mbit/s. The PHAST code was compiled by the Lahey/Fujitsu Fortran 90 compiler and the GNU C compiler. The parallel programming model [6] consists of tasks connected by message queues called channels. One or more tasks can be mapped to a processor node. We used the Message Passing Interface (MPI) [7] variant of the task/channel model as implemented by the Local Area Multicomputer (LAM) [8] project. Message passing programs create a fixed number of tasks at program startup, each using local data. Each task executes the same program on its own data following a single program multiple data (SPMD) model. Named tasks interact with messages by calling MPI library routines. We used only 6 of the 129 functions in the MPI standard. 2.2 Parallel algorithm for PHAST In designing our parallel algorithm, we followed the four steps of Foster[6]: partitioning, communication, agglomeration, and mapping. Partitioning was done by functional decomposition. The flow and transport tasks are logically separate in constant-density flow. The operator splitting algorithm in PHAST separates functions into a transport task and a chemical reaction task. The data requirements
713 for the reaction task are independent for each mesh cell. Data communication is required between the flow, transport, and reaction steps. Figure 1 shows the flow control graph [9] for our partitioning for N cells (tasks) for a single time step. Additional time steps are indicated by the feedback loop. The nodes of the graph represent tasks and the directed edges represent execution dependencies. Data dependencies do not exist for a task with no incoming edges. The parallel structure of the reaction tasks is apparent. Although the transport of each chemical component could be done in parallel, we did not partition this task. The cell data for each reaction task are disjoint, thus the partition is complete. Data communication patterns can be characterized as local/global, structured/unstructured, static/dynamic, and synchronous/asynchronous [6]. Our parallel algorithm for PHAST has a communication pattern that is local with each reaction task communicating with only the transport task; structured with a partitioning that may have dynamic mapping; static with reaction and transport tasks as fixed communication partners; and synchronous with communication after the sequential tasks of transport and reaction. Parallel communication includes scattering from the transport task to the reaction tasks and gathering from the reaction tasks to the summary calculation task which includes the workload rebalancing task. We agglomerate the flow and transport tasks. We also agglomerate sets of cells into a few reaction tasks. Flexibility is maintained by allowing the size of each set and population to vary dynamically for each time step. A fixed number of agglomerated tasks are defined at initialization. Later versions of MPI do allow for a dynamic number of processes. In mapping the reaction tasks to the processors, there is full concurrency (task independence) and locality (data independence), since the reaction computation is independent for each cell. Each agglomerated task is mapped to a different processor. Agglomeration and mapping are handled by a load-balancing algorithm, comprising both a static and a dynamic component. For our demonstration example, the reaction chemistry gives a larger computational workload for cells within the contamination plume than for cells outside it. As the plume moves through the ground-water system, the group of cells with
II,
E - Completion Task F - Flow Task I
-
Initialization and Load Balancing Task
N - Next Time Step Task R i- Reaction Task i
+
S - Summary Calculation and Load Balancing Task T - Transport Task
Figure 1. Flow control graph for PHAST showing nodes (circles) and edges (arrows).
714
higher computational workload changes. The static component of the algorithm consists of randomly selecting the cells that form the agglomerated tasks during the initialization task so that no processor has only cells with little reaction chemistry. To adjust for the heterogeneity in processor capability, a small calculation is run on each processor to measure its relative speed. The results are used to establish the initial number of cells in each agglomerated task. The dynamic component of the algorithm consists of rebalancing at the end of each time step. Mappings are revised by moving cells to or from processors to equalize the reaction task computation time for each processor. We determine the np such that: P
npT"p =
T" for p = 1, " ' , P ;
subject the constraint that ]~
np = N"
(1)
p=l
where np is the number of cells on processor p; 1"p is the time per cell for the reaction task for processor p; T is the time for all processors to complete the reaction task; P is the number of processors; and N is the total number of cells. An optimal load balance will reduce the total idle time for all of the reaction tasks to zero. A damping factor is used to eliminate flutter in rebalancing. Only half of the number of cells to be remapped are actually transferred at any rebalancing step. This algorithm is very inexpensive, typically consuming about 0.03% of the total computation time. The four steps described above produced a parallel algorithm for PHAST on the Beowulf cluster. This is a coarse-grain parallelization in that the ratio of computation time to communication time is large. 3. TESTING AND EVALUATION To assess the performance of this algorithm, we used a single demonstration example. Both strong and weak testing were done for scaleup evaluation [1~ Strong testing involves increasing the number of processors while keeping the problem size fixed, while weak testing involves increasing the number of processors while increasing the problem size such that the workload per processor remains either constant or increases. Speedup and efficiency are two common metrics defined as [6] tl
Sp = ~
tp
and
Ep =
S_e-
p
,
(2)
respectively, where t l is the execution time with one processor; Sp is the relative speedup with p processors; tp is the execution time with p processors; Ep is the relative efficiency; and p is the number of processors. Relative speedup measures the reduction factor for execution time with increasing number of processors, and is limited by the non-parallel segments of the program and by communication load. Relative efficiency measures the fraction of time that the parallel program is doing useful work. Our performance evaluation is focused on execution time and parallel scalability. Total execution time can be split into computation time, communication time, idle time, and other time. The computation time can be further split into transport (and flow) task and reaction task times. Load balancing and other overhead tasks are placed into the other time category. Experimental timing with five replicates showed timing variations less than 1%. 3.1 Description of the demonstration example The demonstration example problem was adapted from a simulation study done on the sewage plume at the U.S. Geological Survey Toxic Substances Hydrology Research Site [11]
715 The transport of phosphorus was simulated by a reactive-transport model having 17 aqueous components, 12 surface complexation species, 50 aqueous complexation species, 2 mineral precipitates, and about 66 reactions. The set of chemical reactions included phosphorus sorption, dissolution and precipitation of iron and manganese oxides, two kinetic organic carbon decompositions, and cation sorption. The two organic decomposition reactions were kinetically controlled while all other reactions were equilibrium controlled. The model domain was 1600 m long, ranged from 1000 m wide at the sewage-treatment plant to 1600 m wide at the downgradient boundary, and was 45 m thick. Ground-water flow was generally from north to south with some flow into Ashumet Pond. We simulated 20 years of sewageplume movement, with a time step of 0.5 yr, covering two distinct infiltration rates from the sewage-disposal beds, which cause two sets of reaction-chemistry conditions.
3.2 Scalability Strong scalability was tested using five fixed problem sizes consisting of about 1700, 4200, 8400, 16000, and 31000 cells for the discretization, and denoted by N1 through N5 respectively. The number of processors was increased by powers of two up to 16, then using the maximum number of 20 processors for problems N3,N 4, and N s. The numbers of processors were converted to an effective processor count by adjusting for the different clock speeds. This adjustment was important only for the 16- and 20-processor cases giving effective numbers slightly greater than actual processor counts. A 700-MHz processor was used for all the single-processor runs. Plots of relative speedup and efficiency, including the line of perfect speedup, are shown in Figure 2. Observe that super-speedup (values greater than the increase in the number of processors) was obtained when using 2 to 8 processors on problem N4 and when using 2 to 22.3 processors on problem N s. The corresponding efficiency values were greater than 1. For problem N 3, a slight super-speedup was found with 2 and 4 processors. The two smallest problems (N1 and N2) did not exhibit super-speedup. These results indicate poor cache usage by suboptimal sequential code. As the number of processors increases, the number of cache misses are reduced due to smaller amounts of data on each processor. Smaller amounts 100
2.0
N~
Q_
>'15 C (1)
qP (b
._o
O_ bO
N,
Ld 1.0
10
>
| >
C)
9
qp
0.5
Od
,
,
J
I
,
,Ltl
,
,
0.0
J
I@ No. of Processors
100
,
~
I
,
,
,
,
i
i
~
,
,
i
,
,
t
t
i
,
5 10 15 2O No. of Processors
Figure 2. Relative speedup and relative efficiency for demonstration problems of size N 1 through Ns. Straight line shows perfect speedup.
716 of data can also remove the need for paging of virtual memory. However, no paging was detected during our testing. The relative speedup and efficiency curves for problems N1 and N 2 (Fig. 2) appear normal, indicating that these problems are sufficiently optimal in cache utilization for the sequential program. Note also that we have not reached a point of severely declining speedup for this example with the range of processors available. The total execution time is split into flow plus transport and reaction task times, communication time, idle time, and other time for problem size N 4 as shown in Figure 3. The execution time for 20 time steps (10 yr) and 16000 cells ranged from 12000 s to 900 s (200 minutes to 15 minutes). The flow plus transport time is essentially constant, as expected, since they are non-parallelized tasks. With 22.3 effective processors, the time spent on the flow plus transport task is about equal to the time spent on the reaction tasks. The idle time decreases and the communication time increases slightly with increasing number of processors. These times form an increasing fraction of the total time but are still relatively minor overhead costs. Weak scalability was tested by abstracting data from the relative efficiency curves in Figure 2 for problem sizes N1 through N 4. The 90% isoefficiency function (Fig. 4) gives insight as to the scalability of the algorithm to larger problem sizes. With limited data, we see that the number of processors has to increase slightly faster than the problem size in order to maintain the same efficiency (workload per processor). Thus, our algorithm is almost perfectly scalable for problem size, over the range tested, because the function N(P) IE=.9 = p 1.1. However, this result could change if the super-speedup were eliminated. 3,3 Load balancing The efficiency of load balancing is defined as the ratio of the minimum reaction task time as predicted by the load-balancing algorithm to the actual reaction task time. The performance of the load-balancing algorithm is illustrated by the efficiency over simulation time shown in Figure 5 for a simulation of example N4 using 16 processors. Four cases are shown comprising the various combinations of initial randomization of the cells and dynamic adjustments to the cell mapping. Randomization is more effective than dynamic adjustments to the cell mapping when each are done exclusively. Using both algorithm components, we 105 _10
4
~-- 105 !
Flow plus Tronsport
_
E 2 ~_ 10 _-
Idle omm u n i c G t i o - - n ~
10
1
)
100
i
1
L
,
i
i
i
I
10 No. of Processors
.
.
.
.
.
100
Figure 3. Total execution time and its components for demonstration problem size N 4.
717
2.0x104
1 5 x l 04
|
-0 0 C
9|O0 ~ 1 . 0 x l 04
E (1)
5 . 0 x 1 0.5
o 0_
5
10 No. of P r o c e s s o r s
15
20
Figure 4. Ninety-percent isoefficiency function for the parallel PHAST algorithm. achieve better than 95% load-balancing efficiency over most of the time steps. From the test results presented, it appears that we have a very scalable parallel algorithm for PHAST, both for number of processors and for problem size. We have not reached diminishing effectiveness for the number of processors, and the demonstration example is highly scalable up to at least 31000 cells. We see that the communication and idle times are reasonably small and that the reaction task time can be reduced to take about the same time as the flow plus transport tasks. 4. CONCLUSIONS The development and testing of a parallel algorithm for the PHAST simulator on a Beowulf cluster yielded the following observations and conclusions: 100
'
'
'
r----'n
.
.
.
.
.
.
.
.
i
9
9
_'
omizaa}!on and balancing
g',2
>"
80
"0 E
60
9 C @
i
,
_-
111 _
(D
c9
40
C? m
-~ O
No balancing
20
O _J
0
_ I
~
~
~
I
5
~
~
t
~
I
~
10 Time Step
~
~
~
I
~
L
t
15
Figure 5. Efficiency of load balancing during a simulation using 16 processors.
20
718 9 Our objective of reducing the total computation time for a field-scale example of reactivetransport in ground-water flow to a feasible timeframe has been achieved. 9 Our parallel algorithm is highly scalable over the range of number of processors tested, and almost perfectly scalable over the range of problem sizes tested. 9 Load balancing is very important for overall efficiency of the parallel algorithm with initial random mapping providing the most effective part of the balancing. 9 The sequential program needs further optimization for more efficient usage of memory cache. 9 Future directions for development will include parallelization of the transport task. Disclaimer: The use of firm, trade, and brand names in this article is for identification purposes only and does not constitute endorsement by the U.S. Geological Survey. REFERENCES
1. M.J. Eppstein, J.F. Guamaccia, and D.E. Dougherty. Parallel Groundwater Computations Using PVM. In T.F. Russell, R.E. Ewing,, C.A. Brebbia, W.G. Gray, and G.F. Pinder, eds., Computational Methods in Water Resources IX, v. 1, Numerical Methods in Water Resources. Computational Mechanics Publications, London, and Elsevier Applied Science, Boston, p.713-720. 1992. 2. D.L. Parkhurst. User's guide to PHREEQC(Version 2) )em A computer program for speciation, batch-reaction, one-dimensional transport, and inverse geochemical calculations, U.S. Geological Survey Water-Resources Investigations Report, 99-4259, 1999. 3. K.L. Kipp,Jr.. HST3D: A computer code for simulation of heat and solute transport in three-dimensional ground-water flow systems, U.S. Geological Survey Water-Resources Investigations Report 86-4095, 1987. 4. K.L. Kipp,Jr.. Guide to the revised heat and solute transport simulator:HST3D--Version 2, U.S. Geological Survey Water-Resources Investigations Report 97-4157, 1997. 5. C.I. Steefel and K.T.B. MacQuarrie. Approaches to modeling of reactive transport in porous media. In P.C. Lichtner, C.I. Steefel, and E.H. Oelkers, eds., Reactive Transport in Porous Media, Reviews in Mineralogy, 34. Mineralogical Society of America, Washington, D.C., p.83-130. 1996. 6. I.T. Foster. Designing and Building Parallel Programs. Addison-Wesley, Reading, MA, 1995. 7. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, Cambridge, MA, 1995. 8. Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In J.W. Ross, ed., Proceedings of Supercomputing Symposium '94, University of Toronto, p.379-386, 1994. 9. J.J. Dongarra, I.S. Duff, D.C. Sorensen, and H.A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991. 10. Stefan Goedecker and Aldofy Hoisie. Performance Optimization of Numerically Intensive Codes. SIAM, Philadelphia, PA, 2001. 11. D.A. Walter, B.A. Rea, K.G. Stollenwerk, and Jennifer Savoie. Geochemical and hydrologic controls on phosphorus transport in a sewage-contaminated sand and gravel aquifer near Ashumet Pond, Cape Cod Massachusetts. U.S. Geological Survey Water Supply Paper 2463, 1996.