Parallel Computing 24 Ž1998. 181–203
Practical aspects
Implementation and performance evaluation of a parallel ocean model Manu Konchady
a,1
, Arun Sood
b,)
, Paul S. Schopf
c,2
a Mitre Corporation, McLean, VA 22102, USA Department of Computer Science, George Mason UniÕersity, Fairfax, VA 22030, USA c Dept. of Computational Science, George Mason UniÕersity, Fairfax, VA 22030, USA
b
Received 18 June 1996; revised 21 May 1997; accepted 13 July 1997
Abstract Ocean models have been used to study global phenomena such as the El Nino Southern Oscillation. The effects of El Nino are felt worldwide and are of importance to all segments of society. High resolution ocean models are being developed to run more sophisticated simulations. The demand for computing power is expected to increase drastically with the use of high resolution models. Running extended simulations with high resolution models on current vector supercomputers will require computing power which is not available or prohibitively expensive. We have developed a parallel implementation suitable for running high resolution simulations. We studied performance issues such as the communication overhead and cache hit rate. Our experiments with grid partitioning and inter-processor communication techniques enabled us to obtain an efficient parallel implementation. Three grid partitioning techniques are evaluated. Two schemes to gather and scatter global grid data among processors are compared. A cache model to analyze performance is described. q 1998 Published by Elsevier Science B.V. All rights reserved. Keywords: Ocean model; High resolution simulation; Communication overhead; Cache misses; Grid partitioning
1. Introduction The use of high resolution models for studying ocean circulation was proposed by Semnter and Chervin in 1988 w17x. The prohibitive cost of running high resolution ocean )
Corresponding author. E-mail:
[email protected] E-mail:
[email protected] 2 E-mail:
[email protected] 1
0167-8191r98r$19.00 q 1998 Published by Elsevier Science B.V. All rights reserved. PII S 0 1 6 7 - 8 1 9 1 Ž 9 7 . 0 0 0 9 2 - 6
182
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
models on vector supercomputers has limited the use of such models. To lower the cost of simulation, several parallel ocean models have been successfully implemented on multi-processor distributed memory machines w2,11,13x. The use of parallel distributed memory computers for climate modelling has gained popularity because of the development of high performance processors and increased inter-processor communication bandwidth. In this research, we have used a Cray T3D as our experimental platform, and have focussed on developing techniques that will increase performance. In particular, we have explored grid partitioning approaches, techniques to collect and distribute data and limitations of the cache on a Cray T3D processor. We show that the proper choices lead to enhanced performance. Regular 3D grids have been traditionally used in ocean models, and thus the implementation of an ocean model on a parallel distributed memory machine requires the partitioning of a grid with a periodic exchange of shared data. We have analyzed the performance of three grid partitioning techniques for regular grids. We describe a parallel implementation and evaluate the performance of Poseidon w15x, a coupled ocean model on the Cray T3D. Our implementation is similar to prior implementations of parallel ocean models w2,11x with respect to grid partitioning. The distinguishing feature of our model is the coupling interface. We have developed a mechanism for coupling the ocean model with land, ice, lake and atmospheric models. The ocean model we used does not include the treatment of ice and was coupled to a simplified atmospheric model. A coupled system consisting of multiple models including ocean, ice, atmospheric and land models can be used to address some of the complex climate simulation problems. The exchange of forcing data between models is a significant overhead since data must be distributed and collected for all processors. In our implementation, we have studied two techniques to distribute and collect data efficiently. The communication overhead problem also occurs during IrO for restart and history files. We have not used any parallel IrO facilities which can reduce some of the communication overhead. Our results show that global data can be distributed and collected with less overhead using a recursive scheme. A new algorithm to find an optimized processor grid for a given topography using an irregular blocking technique has been developed. This algorithm is used to obtain a high resolution processor grid for a fixed number of physical processors. Our irregular blocking scheme is similar to the one developed for the MICOM ocean model w2x. We have evaluated performance using optimized processor grids for several configurations. One of the limitations of running on the Cray T3D is the size of the cache on a single processor Ž8K.. While the 64 Mbyte per processor memory limit is sufficient to run most applications, the small cache can degrade performance. For a simple application, it is possible to rearrange data to obtain a high cache hit rate. It is more complex for a large application to determine the optimum data layout. Cache performance tools w4x can be used to determine bottlenecks. We have developed a technique to evaluate cache performance for multiple processor configurations and cache sizes without the use of a large multi-processor simulator. It can be extended to handle different cache models. We have used the technique to study the performance of a primary direct mapped cache.
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
183
Table 1 Parallel ocean models Model
Platform
Grid partitioning
Coupled
SWEM w1x MICOM w2x POP w19x POCM w18x
SGI Onyx Cray T3D, SGI Onyx CM-5 and others Cray YMP
regular 1D irregular 2D irregular 2D irregular 2D
no no yes no
1.1. Parallel ocean models Several organizations have developed parallel versions of ocean models. Table 1 shows a list of some parallel ocean models. The platforms, grid partitioning techniques and coupling mode is described for each model in Table 1. All the models in Table 1 obtain parallelism by running the same code on multiple processors for separate regions of a grid. The partitioning techniques differ among models and we have evaluated three techniques for the Poseidon ocean model w15x. Coupled parallel ocean models have the added complexity of exchanging forcing data with other climate models. We have described a scheme which can be used to reduce the communication overhead of exchanging forcing data in Section 3. The Poseidon ocean model was developed to study seasonal to decadal variability of the global ocean. Coupled with a land–atmosphere model, the ocean model has been used to study the El Nino phenomenon. Poseidon is a three dimensional general ocean circulation model ŽFig. 1. extended from a two layer upper ocean model w14x with the following key features: generalized isopycnic vertical coordinates w2x, reduced gravity approximation w21x, a staggered grid w11x and the provision for a turbulent surface mixed layer. The Poseidon Ocean Model User’s Guide Version 4 w15x includes a detail description of the model. 1.2. Coupled model The ocean model includes provisions for coupling with other climate models. Current model runs can be coupled with a simplified or a complete atmospheric general
Fig. 1. Ocean model layers.
184
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
circulation model ŽAGCM.. The coupling interface between the ocean and atmospheric models can be extended to other models such as ice, lake, or land models. Obviously, an ocean model run coupled with a complete AGCM takes much longer than a run with a simplified AGCM. The decision to run with a simplified or complete AGCM is made based on the experiment. In our experiments to evaluate performance, we have used the ocean model coupled with a simplified AGCM. The provision for this type of coupling is important to test theories regarding El Nino which is based on a complex interaction between the ocean and atmosphere. The coupling process consists of the exchange of forcing data between the two models. The atmospheric model provides wind stresses and other forcing data and the ocean model provides the sea surface temperature. The coupling interface between any two models is defined by the forcing data exchanged. Since each model requires forcing data to begin a simulation, there are two ways to initiate a coupled model run. The first technique is to run the two models alternately on a single machine. The forcing data for one of the models is read from a restart file and it generates forcing data for the other model. The second technique is to run both models in parallel ŽFig. 2.. The forcing data for both models is read from restart files and both models run concurrently on separate machines or a single machine. A physical coupling exists between the ocean and atmospheric models, i.e. the forcing data generated by the ocean model affects the results of the atmospheric model and vice versa. A distributed coupled model has been implemented using a Cray T3D, YMP and C90 w9x. The ocean model ran on the T3D while the atmospheric model ran on the C90 and a driver program ran on the YMP. The driver program handled communication between the models, synchronization and performed interpolation of forcing data. One of the limitations of such a system is the bandwidth between machines. For a coupled model, high volume intermittent exchange of data is required. A low bandwidth connection will lower performance, since both models are idle till forcing data has been received. Running on
Fig. 2. Coupled ocean–atmosphere model.
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
185
a distributed system introduces a load balancing problem. The atmospheric and ocean models do not run for the same amount of time between coupling intervals. The model completing a simulation sooner must wait till the other model is ready for coupling. The use of other climate models such as an ice or land model together with the ocean and atmospheric models increases the complexity of running the coupled models on a distributed system. This is due to the differences between the times to run each model on a particular machine. Models taking less time for computation will be idle until forcing data is received from a more computation intensive model. Parallelizing and distributing multiple coupled climate models for a complete global climate simulation is a complex and challenging task. We have successfully executed simulations with the atmospheric and ocean models on a network of heterogenous machines. The results from the distributed simulation were similar to the results from a simulation on a single machine. We did not study issues related to running distributed computations such as load balancing and a rigorous verification of results. 1.3. Serial implementation The fundamental purpose of Poseidon is to advance the state of the ocean. It can be called repeatedly and executes functions based on alarms. A global timekeeper maintains simulation time and alarms. Alarms are set for individual functions to run at specific time intervals during the simulation. The overall program flow is shown in Fig. 3. The initial ocean state is read from a
Fig. 3. Ocean model algorithm.
186
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
restart file. The clock and alarms are set for the current run. External forcing data is exchanged with the atmospheric model. The coupling interface is built into the main driver routine. Coupling can be performed with ice, land, lake and atmospheric models. The ocean state is advanced by calling the hydrodynamics and other routines. The vertical diffusion, mixed layer and filtering w16x routines are executed less frequently than the hydrodynamics. A new ocean state is determined at the end of each hydrodynamics computation. Each of the hydrodynamic routines contribute to a partial derivative which is then used by the update routine to compute a new state. The ocean state variables are layer thickness, temperature, salinity, zonal and meridional advection. Filtering, vertical diffusion and the mixed layer computation consume less than 13 of the total computation time. The hydrodynamics computations use about 32 of the total computation time. A large number of grid points implies loops with a higher iteration count and a proportionally longer computation time. An explicit finite difference scheme is used by the update routine to compute a new state. A Shapiro filtering scheme w16x has been used to maintain numerical stability near the poles. Since the Arctic ocean was not included in the simulation, a stripe of land points is assumed near the North pole. Stripes of land points are used near the South pole. In our parallel implementation we have not used a compressed grid due to the problems of handling inter-processor communication. 1.4. Parallel implementation About 15,000 lines of Fortran code have been ported to the Cray T3D from a Cray vector supercomputer. Our implementation includes the capability to use three communication protocols: PVM, MPI and shmem w6,8,12x. We have successfully run the model using each of the three protocols on the Cray T3D. The task of parallelizing the code involved three tasks: Ž1. developing code for inter-processor and global communication, Ž2. modifying the number of iterations executed for a do loop based on the number of processors used and Ž3. optimizing code to run efficiently with a small cache. We accomplished all three tasks while still maintaining the capability to run the sequential and distributed versions from a single source. We created a preprocessor file with a number of options for the number of processors, the type of grid partitioning, the communication protocol ŽPVM, MPI, or shmem., shadow border width and the tracing of communication calls. The C preprocessor was used to generate code from the source based on the options defined in the preprocessor file. The first task of developing code for inter-processor communication and global communication involved implementing a grid partitioning technique and efficiently performing global computations. We analyze three grid partitioning techniques, in Section 2. On the Cray T3D, the number of processors allocated for a job is always a power of two w3x. We have coded global computations assuming a complete binary tree. When running on a machine such as the Intel Paragon, additional code must be included to handle processor configurations which are not powers of 2. The second task is a tedious job of changing the iteration counts of every do loop. The number of iterations is
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
187
based on the options in the preprocessor file. The start and end iterations are computed depending on the number of processors, shadow width and grid partitioning technique. The third task involves analysis of performance for a direct mapped cache. We have discussed cache performance in Section 4. A number of the performance recommendations for a T3D processor may not be appropriate for a vector processor. Therefore, optimizing code to run on multiple machines involves iterative changes to verify that performance on any one of the machines has not degraded significantly. The performance of the parallel algorithm depends on the communication overhead. Minimizing communication overhead while performing the required computation is a challenging task. A single program multiple data ŽSPMD. coding model is chosen. The choice is made for easier maintenance and simplicity. The SPMD model is easier to develop from uni-processor code. The implementation uses a single program on each processor working on a section of the 3D grid with message passing to share common grid data. Verification of results from the sequential and parallel algorithms must take into account the differences between processors and computations such as global sums. Since filtering is used in both algorithms, values will not exceed bounds. We have verified the results using the sequential and parallel algorithms to validate the parallel implementation. Successful extended runs were executed to generate a higher confidence in the use of a parallel implementation.
2. Grid partitioning techniques In this section, we have briefly reviewed three grid partitioning techniques for the ocean model. Each grid partitioning technique has pros and cons in terms of the volume of data communicated and number of neighbor processors. The performance for each technique is assessed assuming computation for a grid point requires data from four Žnorth, south, east, and west. neighbors. The striping and blocking grid partitioning techniques have been compared in earlier research w5,10x. 2.1. Striping Striping can be performed along any one of the three axes. A stripe is assigned to a single processor and consists of a sub-grid. The size of the sub-grid is determined by the number of stripes ŽFig. 4.. The advantages of striping are simpler coding and the number of neighbor processors is always 2. The number of bytes exchanged is equal to the size of the dimension of the grid Ža shadow row.. The process of dividing the grid is simple since the entire grid can be viewed as a one dimensional array which is divided into sections. During an exchange, the shadow row is not copied into a temporary message buffer. The shadow data can be exchanged by simply providing an address and length. A one dimensional distribution of the grid makes the task of gatherrscatter of global grid data simple Žsee Section 3.. For striping, the maximum number of processors that can be used is limited to the size of the y dimension of the grid.
188
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
Fig. 4. Striping and blocking.
2.2. Regular blocking Regular blocking is a two dimensional partitioning technique which reduces the volume of inter-processor data communication compared to striping ŽFigs. 4 and 5.. The grid is divided across the latitude and longitude dimensions into blocks of equal areas Žor grid points.. Each processor must exchange data with 4 processors. The implementation is more complex compared to striping. The grid cannot be viewed as a 1-dimensional array. Additional code must be written to handle the ‘gather and scatter’ of shadow column data which is non-contiguous. This additional code causes a degradation in performance compared to striping for large grids on small processor configurations. The number of processors that can be used in regular blocking exceeds that of striping. 2.3. Irregular blocking For a global ocean model grid, approximately 35% of the grid points are land points and the remainder are water points ŽFig. 6.. No useful computation is performed on land points. A lower execution time can be obtained through the use of an irregular blocking technique w2x. We will show how an optimal processor grid can be obtained for a given data grid and processor configuration. For a topography with over 50% land points, the benefits of irregular blocking are significant. Using the algorithm below for a given
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
189
Fig. 5. Regular blocking for oceans.
processor configuration and topography, we can calculate an optimized processor grid. The optimized processor grid would have the maximum number of processors that can be used while still including all water points. The best solution results in a small sub-grid for a processor. This results in fewer iterations per do loop and a corresponding lower execution time. With a lower communication overhead and computation demand, irregular blocking will give better performance results than regular blocking. The algorithm can be used for any given topography. It should be noted that by altering topography such as adding extra columns or rows of land along the edges of the grid, optimal solutions can be obtained which are not possible for the original topography. An altered topography can have more factors for processor grid configurations. For example in Fig. 7, we have used a 12 = 13 processor grid for a 288 = 260 data grid. The dimensions of each sub-grid are 24 = 20 or 480 grid points per processor. For regular blocking, a processor grid of 8 = 16 is used for a 288 = 256 data grid. The dimensions of each sub-grid are 36 = 16 or 576 grid points per processor. The size of the sub-grid for irregular blocking is smaller by 96 points despite the use of a larger data grid. Algorithm for optimized processor grid – Read topography mask – Set procnum to maximum # of processors – Do while procnum<# of real processors
190
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
Fig. 6. Irregular blocking for oceans.
– Find all factors for procnum and data grid, i.e. compute numbers x and y such that x*y = procnum and the dimensions of the data grid are divisible by x and y – Do for each factor – Compute number of land procs – If (# of land procs+# of real procs=procnum) – then save factors and exit from do loops – Endif – Enddo – Decrement procnum – Enddo The purpose of the above algorithm is to obtain two numbers x and y such that x P y is equal to the number of physical processors and all water points are allocated to at least one processor. The algorithm begins by reading a binary topography mask. Procnum is assigned a large number based on the number of physical processors. For example, if 64 physical processors are used, then procnum can be set to 2000. The number 2000 is calculated based on estimated maximum values for x and y. In each iteration of the loop, factors x and y are computed for a procnum value. For each set of factors, the allocation of water points to at least one processor is verified. This is accomplished by comparing the sum of the number of processors with only land points
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
191
Fig. 7. Optimal processor grid for 128 processors.
and the number of physical processors with procnum. If procnum is less than the sum, optimized factors have been obtained. Otherwise, procnum is decremented and iterations continue till either optimized factors are obtained or procnum is equal to the number of physical processors. The differences between the two blocking schemes are in inter-processor communication and collectionrdistribution of global data. Shadow data is not exchanged with a neighbour land processor, since it does not exist. Processor 0 does not transmit or receive global data from a land processor. Table 2 below shows the improvement of optimal solutions for larger processor grids for a 288 = 260 data grid. With larger processor configurations, the number of land processors increases and a high resolution processor grid can be used. The % increase represents the percentage of processors in excess of the real number of processors. For the 512 processor configuration, we use a grid 36 = 20 or 720 processors. The number of processors in excess of 512 processors is 208. The % increase will reach a stable value with more processors and should be approximately the same as the percentage of land for the data grid. The
Table 2 Optimized processor grids for ocean grid No. of processors
Processor grid dimensions
No. of land processors
% increase in processors
64 128 256 512 1024
8=9 12=13 32=10 36=20 72=20
8 33 67 220 466
12.5 21.9 25.0 40.6 40.6
192
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
irregular decomposition technique yields significant benefits when high resolution processor and data grids are used. 2.4. Execution timing analysis We have analyzed the time for execution Žexcluding IrO. of the model. The time for IrO is approximately 30% of the total time to complete a one month simulation. IrO time has not been modelled here and the focus is on execution time alone. We ran experiments with the ocean model coupled to a simplified AGCM. The processing time without overlapping computation and communication time, t p for a timestep is t p s t comm q t comp where t comm and t comp are the communication and computation times per timestep, respectively. A perfectly balanced load is assumed for calculating t p . This is possible when an equal number of grid points can be assigned to every processor. Due to small differences in the computation or communication times for processors, the processing time t p will not be identical for every processor. The computation time is an approximate linear function of the number of grid points, n g . This was observed by measuring the computation time for several grid sizes on a single processor of the Cray T3D. The computation time by component is described in Section 1.3. n g is based on the number of processors used and the dimensions of the data grid. For a balanced load, n g must be the same in all processors. This is true when the same computations are performed for every grid point and the values at grid points do not affect the type of computation performed. The communication time, t comm is based on inter-processor bandwidth and latency time. The latency time for the Cray T3D is approximately 2.5 m s w12x. The time to exchange data is a function of the bandwidth, latency and the number of bytes transmitted and received. Data can be exchanged using PVM w8x send and receiÕe calls. On the Cray T3D, shared memory calls provide a higher bandwidth than PVM send and receiÕe calls. We have used shared memory calls in the model and have included code for communication using PVM Žversion 3.3.7. or MPI Žversion 1.0. w6x. In our experiments, we found the Cray shmem protocol to be the fastest followed by MPI and PVM. Table 3 lists the execution times for a 10 day run using 32 and 64 processors for a 288 = 256 = 8 grid. 2.4.1. Striping Õs. regular blocking The three components of communication time are latency time, transmission time and gatherrscatter time. The latency time cannot be changed by a program, however, the
Table 3 Execution time Žs. using PVM, MPI, and shmem Number of processors
32 64
Communication protocol Cray shmem
MPI
PVM
235 121
280 161
477 552
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
193
transmission time can be reduced by sending data packets which are near the peak of the bandwidth curve. For example, in the Cray T3D, the latency time is in m s, while the transmission time is in tens of m s for packets of 40 Kbytes or less and in hundreds of m s for packets up to 100 Kbytes. We note that in w12x, Numrich has used a similar model to assess computation cost. In the striping grid partitioning technique, data can be sent by providing an address and the number of bytes to be sent since shadow data are stored in contiguous locations. In the regular blocking grid partitioning technique, pre-processing must be performed prior to sending data and after receiving data Žthe gatherrscatter operation. w7x. This additional overhead for exchanging data must be added to the latency time and transmission time. The communication time, tcomm to exchange shadow data for striping and regular blocking techniques can be defined as t comm s t lat q ttran q tgs where t lat , t tran and tgs are the latency time, the transmission time and the gatherrscatter time, respectively. In striping, t lat s 2 t l , t tran s 2 t x , tgs s 0 t l is the latency time to transfer data from the application to a system message buffer and t x is the time to transmit a row of x bytes. In regular blocking, t lat s 4 t l , t tran s 2 Ž t x p q t y q . , tgs / 0 t x p and t y q are the times to transmit xp and yq bytes, respectively. If p and q are the dimensions of the processor configuration in the x and y axes, then for a grid with dimensions x = y, xp s xrp and yq s yrq. tgs , t x p , and t y q vary depending on the processor configuration and grid dimensions. In general, striping is better than regular blocking, when the following is true, 2 t x - 2 Ž t l q t x p q t y q . q tgs 2.4.2. Regular blocking Õs. irregular blocking Irregular blocking is similar to regular blocking and includes preprocessing to determine land and water processors ŽFigs. 5 and 6.. Processor 0 reads the topography mask and builds a table consisting of real processor numbers, virtual processor numbers and a flag indicating a water or land processor. This table is broadcast to all processors. A virtual processor number is based on the location of a processor on the processor grid. A real processor number is the T3D PE number associated with the virtual processor number. Shadow data is exchanged with four neighbor processors to the north, south, east and west. An exchange is not performed when the neighbor processor is a land processor. The benefits of irregular blocking are smaller sub-grid sizes for a fixed number of processors. With a smaller sub-grid, t comp and t comm will both be smaller giving better performance. The experimental results below demonstrate the improvement in performance that can be expected with irregular blocking for the global topography.
194
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
3. Distribution and collection of data For coupled model runs, there is a frequent exchange of forcing data between the ocean and the atmospheric models. In a sequential implementation, all forcing data is available on a single processor. In a parallel implementation, forcing data must be collected and distributed at each coupling interval. An all-to-one and one-to-all approach is used during coupling. This approach is used since our interpolation algorithm requires both complete ocean and atmospheric grids. A parallel interpolation scheme will eliminate the need to distribute and collect forcing data at a single node. The distribution and collection of data also occurs during IrO. When a run is initiated, data from a restart file is distributed to all processors. At the end of a run, data is collected for a restart file. During the run, data is collected for history files. For small processor configurations, the overhead is not excessive. With a large processor configuration Ž) 64 processors., the overhead for collecting and distributing data increases rapidly. In our 256 processor implementation, approximately 10% of the execution time was spent in distributing and collecting data. We have used two approaches for data distribution and collection, a hub scheme and a recursive scheme. These two schemes have been used on other platforms such as the Intel iPSCr860 and Touchstone Delta w20x. In the hub scheme, a single processor collects and distributes global data. This scheme is sequential since only one sub-grid can be sent or received from other processors. An alternative scheme is the recursive scheme. The idea is to recursively halve the size of the sub-grid and transmit data in parallel when global data is distributed ŽFig. 8.. A global grid can be built in logŽ n. steps where n is the number of processors instead of Ž n y 1. steps for the hub scheme. The recursive scheme can be
Fig. 8. Gatherrscatter of global data for 1D decomposition.
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
195
Fig. 9. Gatherrscatter of global data for 2D decomposition.
used for 1D and 2D grid partitionings ŽFig. 9.. A grid is recursively divided along the x and y dimensions for a 2D grid partitioning. 4. A cache performance model A parallel implementation of the ocean model on the Cray T3D with 256 processors achieved a throughput equivalent to a gigaflop or about 3.8 Mflops per processor before optimization. The peak performance of the alpha processor of the Cray T3D is 150 Mflops. The reason for the lower megaflop rate of the ocean model was assumed to be the small cache size. In order to determine the effect of cache size on performance, a cache model was built to calculate the hit percentage for a subroutine. The primary cache consists of 256 lines w12x. Each line is allocated a 32 byte segment or 4 words of the cache. The cache is direct mapped. The cache address of any element is the address modulo 1024. An element is assumed to be a 8 byte field. The goal was to maximize the hit percentage rate. The example below shows the cache hit percentage for array elements. For every cache miss, a cache line is loaded with 4 words from main memory. This operation can take 21 or 30 clock periods depending on whether there is a page hit or miss in main memory. Example: real a(37), b(37), c(37) integer i, j
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
196
do j=1, 5 do i=1, 37 c(i)=a(i)*b(i) enddo enddo Cache table Name a b
Hits 148 148
Misses 37 37
In the first iteration of the outer loop in the example above, all references to elements of a and b result in cache misses. For all other iterations of the outer loop, the references to elements of a and b are in cache. Four write buffers, each one cache line long are used for memory writes. The cache hit rate cannot be estimated for a complex program by observation alone. Several cache performance tools have been developed w4x. Some of the tools have a graphical user interface to examine cache performance and profile the cache hit percentage by variable. The performance tool we developed was used to simulate cache performance for multiple processor configurations and various cache sizes. Instead of using a multiprocessor simulator, we replaced the subroutine to be profiled with
Fig. 10. Execution time for 744 timesteps.
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
197
generated code. A function call statement was made to a cache simulator with an address and a string as arguments. The cache simulator used the arguments to update a cache table. The generated code was included with the rest of the model code intact. A single timestep of the simulation was executed to generate the cache table. Input Subroutine™ Code Generator™ Output Profile Code Output Profile Code ™ Cache Table Using the above technique, a single subroutine of a complex application can be studied in isolation without profiling the entire application. A multiprocessor simulator is not required since generated code is used with the application to build the cache table. The cache table is an array of 1024 unsigned long integers. Each element of the cache table contains a memory address. The cache table is maintained by updating table entries as needed based on memory references generated. A cache line is the unit of replacement for the table. Replacements are made starting from a cache line boundary. After running a Cray performance tool to analyze the performance of the ocean model, one of the subroutines, update, was found to be the most time consuming subroutine of the ocean model. It consumed approximately 15% of the total execution time excluding IrO. All the remaining subroutines consumed less time. The cache performance tool was used to analyze the update subroutine of the model alone. The function of the update subroutine is to calculate a new state of the ocean after the
Fig. 11. Interprocessor communication time for 744 timesteps.
198
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
computation of partial derivatives from hydrodynamic calculations. It consists of a single outer loop over each layer of the ocean. The grid point values for layer thickness, horizontal velocities and tracers are revised. Two inner loop divisions are performed. Since update uses the most CPU time in our analysis, we decided to focus on this subroutine. Similar approaches can be used to analyze other program components.
5. Experimental results Our results from experiments with grid partitioning techniques, distribution and collection of data and a cache performance model are described in this section. In all experiments, we have used the ocean model coupled to a simplified AGCM. The first set of results are timings for various combinations of processor configurations and grid partitioning techniques. We have plotted results for three parameters execution time excluding IrO, communication time and speed up. The second set of results compare the hub and recursive schemes to distribute and collect data for 1D and 2D partitioning. The final results are from the cache performance model for multiple cache sizes and processor configurations.
Fig. 12. Speed up for 744 timesteps using a 288=256=8 grid.
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
199
5.1. Grid partitioning The results for grid partitioning are based on a 1 month simulation Ž744 timesteps.. The number of processors used ranges from 64 to 512. A 288 = 256 = 8 grid was used for all experiments. No results are plotted for striping with 512 processors since the maximum number of processors that can be used with striping for the 288 = 256 grid is 256. Irregular blocking gives the best results and striping gives the worst results of the three grid partitioning techniques ŽFigs. 10–12.. The reason for the poor performance of striping is the high communication overhead for inter-processor communication. For example, striping uses 121 seconds and regular blocking uses 62 seconds for interprocessor communication when a 64 processor configuration is used. Clearly, the volume of communication affects performance as demonstrated by the difference in timing results. The benefits of communicating with two neighbours is not sufficient to overcome the lower communication overhead associated with regular blocking. This is especially true for large grids. The superior performance of irregular blocking can be explained by the smaller communication overhead and sub-grid size. With higher processor configurations, the differences between sub-grid sizes for regular blocking and irregular blocking decreases and therefore, the benefits of irregular blocking are correspondingly lower. The highest execution time difference between the two techniques is for the 64 processor configuration. A larger problem size will give better results for irregular blocking.
Fig. 13. Execution time for 744 timesteps using 256 processors.
200
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
Fig. 14. Recursive versus hub scheme for gatherrscatter in 1D.
In our next experiment ŽFig. 13., we evaluated processor grid dimensions for a particular processor configuration. A processor grid dimension must be selected for a fixed number of processors. We ran seven tests with a 256 processor configuration. The processor grid dimensions ranged from 128 = 2 to 2 = 128. The 128 = 2 and 8 = 32 configurations took 185 and 113 seconds, respectively, for a 1 month simulation. There is a difference of 72 seconds between the two configurations. The poor performance of the 128 = 2 configuration can be explained by the long columns for sub-grids. The time to gatherrscatter data from long columns during inter-processor communication is responsible for the high execution time. 5.2. RecursiÕe Õs. hub In Figs. 14 and 15, the results from the recursive and hub schemes are compared for a 288 = 256 = 8 grid. The recursive scheme scales well and gives better performance than the hub scheme at higher processor configurations. The disadvantage of the recursive scheme is the added communication of data through multiple processors. If a grid of dimension x = y is to be distributed over a processor grid p = q, then the time to distribute the grid using the hub scheme is
Ž t l q ttran Ž xyrpq . q tg . pq
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
201
Fig. 15. Recursive versus hub scheme for gatherrscatter in 2D.
where t tran is the transmission time for xyrpq bytes and tg is the time to build a block of size xyrpq. The time to distribute the grid using the recursive scheme is dim
Ý Ž ttran Ž xyr2 i . q t l q tg i . is1
where dim is the dimension of the processor grid and tg i is the time to build a block of size Ž xyr2 i . bytes. The results for both grids are better when the recursive scheme is used. The rate of increase of communication time is higher with the hub scheme than the recursive scheme for regular blocking. 5.3. Cache performance model In this experiment, the cache size was varied from 8K to 1 Meg and the number of processors was varied from 64 to 256. In Fig. 16, the hit rate versus cache size is plotted for three processor configurations for a subroutine Ž update . of the model. The cache size was varied by changing the size of the cache table in the code generator. The highest cache hit percentage of 61.6% was obtained for the 256 processor configuration with a 1 Mbyte cache. The lowest cache hit percentage of 8.22% was obtained for the 128 processor configuration with an 8 Kbyte cache. From the plot in Fig. 16, we can conclude that for a Cray T3D processor, the cache hit percentage for the update subroutine is a function of grid size, processor configuration and cache size.
202
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
Fig. 16. Cache hit rate percentage for a 288=256=3 grid.
Using the information provided by the cache performance model, the data layout was altered to group memory references and shorter loops were used. With these changes, the performance for the update routine improved to about 8.2 Mflops per processor from the earlier performance of 5.4 Mflops per processor for a 64 processor run. Some dependencies were eliminated and an inner loop division operation was pre-computed for an iteration of the loop. These changes were made for the parallel version of the code alone. By using pre-processor directives, a single source was maintained with different sections of code for the parallel and vector versions. With these modifications, the execution time for the update subroutine fell from 15.7 seconds to 10.5 seconds, i.e. an improvement of approximately 33%. There is an obvious benefit to having a higher cache hit rate. However, a higher cache hit rate does not always result in lower execution time. There are other factors mentioned earlier such as dependencies and loop optimization which also affect performance.
6. Conclusions The performance of a parallel coupled ocean model on the Cray T3D is based on the grid partitioning technique, the global data collectionrdistribution algorithm and the optimization of memory access. The irregular blocking technique is the most effective grid partitioning technique for the ocean grid with lower execution and communication
M. Konchady et al.r Parallel Computing 24 (1998) 181–203
203
times than regular blocking and striping. An efficient algorithm to distribute and collect the global data lowers the communication overhead. Optimizing memory accesses by restructuring code to obtain a higher cache hit rate improves computation performance. Future developments include a parallel ocean and atmosphere model with parallel communication between the models.
References w1x J.L. Bickham, Parallel ocean modelling using Glenda, Master’s thesis, University of Southern Mississippi, Hattiesburg, MS, 1995. w2x R. Bleck et al., A comparison of data-parallel and message passing versions of the Miami Isopycnic Coordinate Ocean Model ŽMICOM., Parallel Comput. 21 Ž1995. 1695–1720. w3x Cray Research, Cray MPP Fortran Reference Manual, Mendota Heights, MN, 1993. w4x J. Dongarra, O. Brewer, J.A. Kohl, S. Fineberg, A tool to aid in the design, implementation and understanding of matrix algorithms for parallel processors, J. Parallel Distrib. Comput. 9 Ž1990. 185–202. w5x G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, D. Walker, Solving Problems on Concurrent Processors, vol. 1, Prentice-Hall, Englewood Cliffs, NJ, 1988. w6x W. Gropp, E. Lusk, A. Skjellum, Using MPI, Portable Parallel Programming with the Message Passing Interface, MIT Press, Cambridge, MA, 1994. w7x J.L. Gustafson, G.R. Montry, R.E. Benner, Development of parallel methods for a 1024 processor hypercube, SIAM, J. Sci. Stat. Comput. 9 Ž1988. 609–638. w8x D. Morton, K. Wang, O.D. Ogbe, Lessons learned in porting FortranrPVM code to the Cray T3D, IEEE Parallel Distrib. Technol. Ž1995. 4–11. w9x M. Konchady, Implementation of a parallel coupled climate model, Proceedings of the Second International Conference on High Performance Computing, New Delhi, India, 1995. w10x M. Konchady, A. Sood, Analysis of grid partitioning techniques for an ocean model, Proceedings of the Seventh High Performance Computing Symposium, Montreal, Canada, 1995, pp. 407–419. w11x C.R. Mechoso et al., Parallelization and distribution of a coupled atmosphere–ocean general climate model, Mon. Weal Rev. 121 Ž1991. 2062–2076. w12x R. Numrich, P.L. Springer, J.C. Peterson, Measurement of the communication rates on the Cray T3D interprocessor network, Proceedings of the High Performance Computing Networks Symposium, Munich, Germany, 1994. w13x L. DeRose, K. Gallivan, E. Gallopoulos, A. Navara, Parallel ocean circulation modelling on Cedar, in: Proceedings of the 5th SIAM Conference on Parallel Processing for Scientific Computing, 1992, pp. 401–405. w14x P. Schopf, M. Cane, On equatorial dynamics, mixed layer physics and SST, J. Phys. Oceanogr. Ž1983. 917–935. w15x P. Schopf, Poseidon Ocean Model User’s Guide, Version 4, Goddard Space Flight Center, Greenbelt, MD, 1994. w16x R. Shapiro, Smoothing, filtering and boundary effects, Rev. Geophys. Space Phys. 8 Ž1970. 359–387. w17x J.A. Semnter, R.M. Chervin, A simulation of the global ocean circulation with resolved eddies, J. Geophys. Res. 93 Žc12. Ž1988. 15502–15522. w18x J.A. Semnter, R.M. Chervin, Ocean general circulation from a global eddy resolving model, J. Geophys. Res. 97 Ž1992. 5493–5550. w19x R.D. Smith, J.K. Dukowicz, New numerical methods for ocean modelling on parallel computers, Los Alamos Science, Los Alamos, NM, 1993. w20x S. Takkella, S. Seidel, Broadcast and complete exchange algorithms for mesh topologies, Technical Report CS-TR93-04, Michigan Technical University, Houghton, MI, 1996. w21x K.E. Trenberth, Climate System Modelling, Cambridge University Press, Cambridge, UK, 1992.