Parallel Computing 29 (2003) 1–20 www.elsevier.com/locate/parco
Practical aspects
Large scale computing at Rijkswaterstaat E.A.H. Vollebregt b
a,b,*
, M.R.T. Roest
a,b
, J.W.M. Lander
c,1
a VORtech Computing, P.O. Box 260, 2600 AG Delft, The Netherlands Large scale modeling group, Delft University of Technology, Delft, The Netherlands c Rijkswaterstaat/RIKZ: P.O. Box 20907, 2500 EX The Hague, The Netherlands
Received 3 February 2001; received in revised form 19 February 2002; accepted 10 August 2002
Abstract The Dutch Rijkswaterstaat uses simulation models extensively in carrying out its various tasks, among which are the protection of the country from flooding and the management of shipping routes and ports. Different applications of the models lead to large scale computations. Furthermore the continuing increase in level of detail of the simulations demands more and more computing power. In the past few years, Rijkswaterstaat/RIKZ, Delft University of Technology and VORtech Computing have developed and implemented techniques to make these large scale simulations possible. First the 3D shallow water model TRIWAQ has been parallelized. Then this parallel version has been extended to allow for different forms of domain decomposition. Finally also various on-line couplings with different models have been established. In this paper we give an overview of these developments and of our approach towards parallel computing that enabled us to carry out all of these developments in a single conceptual framework. 2002 Elsevier Science B.V. All rights reserved. Keywords: Large scale computing; Shallow water models; Parallel computing strategy; Model-coupling; Communication library
*
Corresponding author. Address: VORtech Computing, P.O. Box 260, 2600 AG Delft, The Netherlands. Tel.: +31-15-2850125; fax: +31-15-2850126. E-mail address:
[email protected] (E.A.H. Vollebregt). 1 Author suddenly died on April 10, 2001, at the age of 46. 0167-8191/02/$ - see front matter 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 2 ) 0 0 2 1 7 - X
2
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
1. Introduction In this paper we describe the various activities concerning large scale computing and simulation of water systems at the Dutch Rijkswaterstaat, the directorategeneral of Public Works and Water Management. Important tasks of Rijkswaterstaat are: • the protection of the country from flooding; • the management of shipping routes and ports; • water management in terms of quantity and quality. Furthermore Rijkswaterstaat is responsible for the construction, management and maintenance of public works, and for research and regulations related to the public works. Rijkswaterstaat uses simulation models extensively in carrying out its tasks. The models are used for operational purposes, for supporting policy making, and for research, to gain better understanding of the physical processes involved. The computational demands of the simulations vary from a few minutes on a simple PC to several days on todayÕs most powerful supercomputers. The development of simulation models (programs) for use within Rijkswaterstaat is concentrated within the Dutch National Institute for Coastal and Marine Management (Rijkswaterstaat/RIKZ). A general model infrastructure ‘‘SIMONA’’ has been built, and a number of general simulation models have been realized within this framework. The motivation for the framework, besides re-use of the generic infrastructure, is to enhance the exchange of information between different models. The most important applications are WAQUA and TRIWAQ, which implement 2D and 3D simulation models for shallow waters: rivers, estuaries and coastal areas. There is a continuing need to resolve more and more details of the processes that are simulated. This leads to the use of finer grids and smaller time steps in numerical simulations, and to the incorporation of more aspects of the physical processes in the models, such as density variations and turbulence. These demands continually increase the computational complexity of the simulations to be performed. In the past few years, Rijkswaterstaat/RIKZ, Delft University of Technology and VORtech Computing have developed and implemented techniques to make these large scale simulations possible. First of all, the 3D shallow water model TRIWAQ has been parallelized and tested on various supercomputers. A somewhat uncommon viewpoint has been taken for the parallel implementation, in which parallelization consists of coupling of different instances of the sequential program rather than breaking up the calculation. This strategy opened the way to a large number of other developments. For instance, the parallel version of TRIWAQ has been extended to allow for using different vertical resolutions in the subdomains that are computed by the different instances of the program, and is being further extended towards horizontal grid refinement. Furthermore, on-line interactions have been established between the hydrodynamic model TRIWAQ and a particle simulation model SIMPAR, and
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
3
between WAQUA and WLjDelft HydraulicsÕ sedimentation model MORSYS [1]. Finally, a parallel version of the Kalman filter for WAQUA/TRIWAQ has been realized by first decomposing the monolithic WAQUA/TRIWAQ+Kalman program into separate computing processes, and then parallelizing each of these in their own most appropriate way. In this paper we give an overview of these developments. First Section 2 presents the framework SIMONA and different applications of the simulation models, along with their relevance and an indication of the amount of computations involved. Then Section 3 introduces the approach that is used for parallelization of WAQUA and TRIWAQ. Section 4 describes the extension of this approach to domain decomposition with vertical and horizontal grid refinement. In Section 5 we show techniques and ideas for the extension towards coupling of different models. Section 6 illustrates the viability of the approach, by presenting some recent performance results. Finally Section 7 presents conclusions and further developments.
2. Simulation models and applications 2.1. Introduction of the SIMONA framework In the last decade the demand for computing details of complex phenomena increased tremendously. The required accuracy of computations has become much larger, and therefore the need has grown for combining different phenomena into one computation. This is all made possible by the availability of powerful computers. The demand for integrated simulation systems inspired the start of the development of a generic framework for (mathematical) simulation models called SIMONA in the early 1990s. SIMONA is an information system for simulation of physical processes in water on behalf of the task of the Dutch Rijkswaterstaat. Water levels, water velocities, wave heights and dispersion of dissolved substances in rivers, lakes, estuaries and seas are computed by means of numerical simulation models. The most important simulation models are: • WAQUA––two-dimensional computation of water movement and transport of dissolved substances; • TRIWAQ––three-dimensional computation of water movement and transport of dissolved substances; • SIMPAR––particle computation for computing the dispersion of dissolved substances with sharp gradients; • KALMAN/WAQAD––adjusting the computational results of WAQUA and TRIWAQ according to measured values. These simulation models are called complex in the sense that they have high resolution in two or three spatial directions. These simulation models form the numerical part of SIMONA. Some applications of WAQUA, TRIWAQ and KALMAN are
4
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
described later in this paragraph and are referred to in numerical experiments later in this report. The numerical part of SIMONA is built upon the information technical part, the SIMONA tool-box. This tool-box is designed as a general layer on which the specific simulation models are built. The tool-box provides the following functionality: • memory management and I/O access to the SIMONA data storage file, specifically designed for handling/transport of huge amounts of data, • a generic pre-processing mechanism for input data, providing a uniform flexible input format to all different simulation models, • interpolation of function values defined on different spatial grids in various combinations (e.g., from spherical coordinates to a curvi-linear grid). The structure of SIMONA formed an important motivation for the parallelization strategy that was adopted, and which is the main subject of this paper. 2.2. Applications of the simulation model WAQUA The most important of RijkswaterstaatÕs simulation models/programs are WAQUA and TRIWAQ, which are used for simulation of water movement and transport processes in shallow waters, where the waves to be modeled are much longer than the depth of the water. WAQUA is based on a vertically averaged (2D) approach of the flow field. Its development started with the work of Leendertse in 1967 [7], but most of the methods that are currently used were developed by Stelling in 1983 [12]. The 2D approach of WAQUA is sufficiently accurate for many applications where water levels are of primary concern. For instance schematizations of the rivers Meuse, Rhine and IJssel were developed by Rijkswaterstaat/RIZA (Dutch National Institute for Inland Water Management and Waste Water Treatment) for computing the water levels in exceptional circumstances, e.g., occurring with probability 1/250 per year. These high water levels are used to decide on the required height of dikes, to reduce the risk of flooding to an acceptable level. The model of the Meuse is now also used by the ‘‘Maaswerken’’, a collaboration between different governmental institutes and various (hydraulic) engineering firms that is developing measures that enlarge the river bed and in this way improve safety, improve ecology and improve shipping routes via the Meuse. With the model, the effects of different alternatives can be compared, and the requested safety level can be verified. The model of the Meuse consists of about 210.000 active grid points. A typical simulation period is 7–14 days, i.e., a period of heavy rainfall plus the time needed for discharge of the water to the sea. The time step used in the simulation is 15 s (which results in 40–80.000 steps for the entire period). The required simulation time on 16 processors of an SGI Origin2000 supercomputer is about half an hour per day to be simulated. On a single-processor modern workstation this would amount to 2–5 days computing time for a single simulation.
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
5
2.3. Application of Kalman filtering WAQUA is also used in the operational storm surge forecasting system. Here new forecasts for the Dutch Continental Shelf model CSM-16 are produced every 6 h at the Dutch Royal Meteorological Institute KNMI, using the latest available weather predictions. The storm surge forecasting system is used a.o. for deciding whether or not to close the storm surge barrier in the New Waterway, which cuts off the entrance to the port of Rotterdam. Because of the huge economic interests that are at stake, only small errors in the water level predictions are accepted. Therefore WAQUA has been extended with a steady-state Kalman filter which allows for real-time assimilation of observed water levels at various tide gauge locations on the Continental Shelf. The Kalman filter uses the latest available observations from eight measurement stations along the English and Dutch coasts. It is further configured with expected levels of uncertainties in the measurements and the magnitudes and correlation length scales of errors in the boundary and wind forcings of the model. This information is used to compute the theoretically best adjustments (‘‘gain matrices’’) of the model state and forcings to deviations between model and observations in the measurement locations. Then at every time step in the start-up phase of the simulation, the predictions and observations are compared and the model state is adjusted accordingly. The application of the Kalman filter significantly improves the CSM-16 model forecasts along the Dutch coast during the first 12 h of the forecast window [5]. The Kalman filter described above is not very compute intensive; i.e., it requires at most a few minutes of computing time on a modern PC for simulation of several days of real time. But further improvement of the predictions do make this a large scale problem. First of all, the CSM-16 model should be replaced by the four times larger DCSM model of the Continental Shelf area with a grid size of about 8 km. Secondly the steady-state Kalman filter should be replaced by the more flexible RRSQRT Kalman filter [13]. The RRSQRT Kalman filter allows for incorporating measurements with irregular locations in space and time, such as satellite altimeter data [8] and HF radar data. Further it allows for attuning to the actual weather situation, i.e., to expect larger errors in the predictions in storm situations than in calm weather. However, the RRSQRT Kalman filter represents far more computing work than the steady-state filter; it works by keeping track of the largest ‘‘error modes’’ and updating these through the real model dynamics. The amount of computations is about linear in the number of modes used, which is of the order of 100. This means that a simulation with WAQUA with the RRSQRT Kalman filter takes about 100 times as much work as a standard WAQUA run for the same model. 2.4. Introduction of the simulation model TRIWAQ The 3D model TRIWAQ is applicable for accurate simulation of transport processes, where the vertical velocity profile needs to be resolved in some detail.
6
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
Dissolved substances that are concentrated primarily near the water surface are sometimes transported much faster than would be predicted on the basis of the vertically averaged flow velocity, because flow velocities are generally larger at the surface than near the bottom. The simulation model TRIWAQ has been realized through the extension of WAQUA. A first version of TRIWAQ with strictly horizontal layer-interfaces was built in 1989, a second version was realized in 1993 with a much more flexible vertical grid schematization [17]. In this version the water column in each grid point is subdivided in a constant number of layers, with the thicknesses of some layers prescribed in absolute sense (meters) and thicknesses of other layers given as a percentage of the remaining water depth. One particularly important transport process concerns salinity. In coastal areas the interaction of fresh and salt water provides an important mechanism for local flow phenomena (density currents). For proper representation of this interaction, the vertical mixing of water must be modeled too. Therefore TRIWAQ has been extended with different variants of the k– turbulence model for turbulent kinetic energy k and dissipation rate , see e.g., [16]. One application example of TRIWAQ concerns a research into the effects of partially opening of the Haringvliet sluices in the Netherlands. These sluices were built
Fig. 1. Detail of the RYMAMO grid schematization for the harbor of Rotterdam and the Haringvliet estuary, in ‘‘real’’ curvilinear coordinates.
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
7
in 1970 as part of the Delta works that defend the Netherlands against flooding, and are currently used only for letting fresh water out into the sea. This way of operating the sluices has several disadvantages for the ecosystem in the estuary Haringvliet/ Hollands Diep behind the sluices: contaminated silt is piling up in the estuary, the dynamics of ebb and flood has decreased, and the transition region between fresh and salt water has become much smaller. The opening of the sluices is investigated in order to compensate these effects and to restore the areaÕs important natural values. But the opening also has consequences for the safety of the land around the estuary, and for agriculture, intake of drinking water, fishing and recreation. Therefore the intrusion of salt water into the estuary needs to be predicted. Because of the density variations due to salinity, this requires application of a 3D model with at least 8–10 layers in the transition zone. Furthermore these simulations depend on the extensions of WAQUA/ TRIWAQ for barrier constructions. For instance in TRIWAQ vertically moving barriers are modeled through (partially) closing of layers near the bottom or water
Fig. 2. Manual partitioning of the RYMAMO grid in logical coordinates into 32 subdomains, with indication of region covered by Fig. 1.
8
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
surface. These researches are typically done with the RYMAMO grid schematization shown in Figs. 1 and 2. Simulations with this model on a modern workstation may require about 2 h computing time per hour that is simulated. In Section 6 we show that this can be reduced to less than 3 min per hour by application of parallel computing.
3. Parallelization strategy The amount of computations involved in typical relevant simulations with WAQUA and TRIWAQ have inspired researches into the vectorization [6] and parallelization [10,15] of the simulation models and computer codes involved. Initially the parallelization research focused on the development of numerical simulation techniques that were suited for parallelization and on the development of techniques for partitioning of finite difference meshes with optimal load balance and communication. However, gradually the goals of the research shifted. Ultimately it appeared that the development of parallel simulation software for suitable numerical techniques was much more a problem than the development of these techniques themselves. Further the distribution of all data structures of the simulation software according to a partitioning of the mesh appeared more complicated than the partitioning itself. Therefore the primary results of the research turn out to be a coherent strategy for the parallelization of existing simulation software. In this section we present an overview of this strategy. The following sections present the extension of the same strategy towards the coupling of separate models for different domains or for different physical processes. Different strategies have been considered for the parallelization, such as using automatic parallelizing compilers, restructuring of the program, use of data parallel programming languages, and manual parallelization using shared memory mechanisms or using message passing. Ultimately our analysis of the algorithms and data structures in the original software revealed that no good speedup could be expected from (semi-)automatic parallelization tools at that time, or in the future [15]. The problems for this are, among others: The original WAQUA/TRIWAQ software used various computational kernels that are inherently sequential. For instance tridiagonal systems of equations must be solved for rows or columns of the grid due to the ADI-nature of the time integration scheme, and for this the inherently sequential double sweep method is used (ThomasÕ algorithm, see e.g., [11]). For other systems of equations different variants of the Gauss–Seidel iteration method are used. The numerical efficiency of these solvers depend strongly on the detailed ordering of the computations. Therefore replacement of these solution techniques requires insight in the problems being solved. Efficient parallelization requires that all data items related to a single grid point are rapidly available to the process that carries out the corresponding computations. But different types of data are stored in different data structures, and alignment of these data requires insight in the relations between the data structures, i.e., insight in the (indirect) addressing schemes that are used.
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
9
The same problems exist if a shared data paradigm [4,15] is used for the parallelization, where the data structures of the original program are largely left untouched and made available via a kind of (distributed) shared memory mechanism to the different computing processes that perform the calculations of the original program. Therefore we have adopted a process/channel paradigm [4,15], in which the computing processes that are introduced work on private data (structures) only and (conceptually) communicate with each other via ‘‘channels’’ (e.g., message passing). The approach that is taken for the parallelization can now be described as coupling different instances of the original program. The coupling requires some extensions of the program regarding the cooperation and communication between the programs. Furthermore it requires distribution of the initial problem data, in such a way that the indirect addressing schemes are obeyed. For the communication a set of abstract subroutines has been devised (CouPLib) that hide the details of the interactions from the calling program. For the data distribution a generic automatic partitioning program has been devised. The resulting structure of the parallel software is illustrated in Fig. 3. It is implemented in such a way that the external behavior
Fig. 3. Overview of the parallelization approach for WAQUA/TRIWAQ. A single data input file is partitioned automatically in separate––complete––data input files per subdomain, WAQUA/TRIWAQ processes per subdomain are started by the master process CouPEx and communicate with each other via the communications library CouPLib, and the output data per subdomain are joined together by the collector program.
10
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
of the parallel version is the same as for the original software. Particularly all standard pre- and post-processing of the SIMONA framework can still be used, and the usage of a parallel computer is nearly invisible for end users, except for the shorter execution time. One special feature of the communications library CouPLib is that the interface is defined at a high level of abstraction; it is formulated in terms of entities that are meaningful to application programmers such as grid functions and stencils. This implies that users do not have to worry about the exact details of which grid points are close to subdomain boundaries or which data must be sent to other processors. Also the precise details of message passing (process idÕs, message tags) are hidden. Therefore CouPLib provides a strict separation between the calculations and communications in the code. The implementation provides support for arbitrarily shaped subgrids such that any partitioning of the grid can be accommodated. An important advantage of the approach is that only a single version of the program needs to be maintained, which can be used on all kinds of computers. For standard computations this code can be used on a single PC or workstation (by itself, i.e., without partitioner, master and collector), for larger computations on a cluster of workstations or a small scale (SMP) parallel computer, and for large scale simulations on parallel supercomputers at a super-computing center.
4. Extension towards domain decomposition 4.1. Domain decomposition with vertical grid refinement In the area of shallow water simulation, the dominating physical processes are largely different for horizontal versus vertical coordinate directions. As described in Section 2.4, the detailed resolution of vertical effects, and consequently the use of computational grids with large numbers of layers in the vertical direction, is needed primarily in the transition region from fresh to salt water. Therefore a reduction of computing times and increase in modeling flexibility is possible by allowing variations in the number of layers in different regions of the area under consideration. This has been realized by extending the approach for parallelization of TRIWAQ towards domain decomposition with vertical refinement [9]. In the realization of this form of domain decomposition the fact is exploited that the vertical direction is of minor importance in TRIWAQ in terms of description in the input file. Nearly all features in the user input refer to the horizontal grid only (tide openings, barrier constructions, etc.), almost none refer the number of layers used. Therefore it is possible to use the following approach: the user provides an input file for the entire domain to be modeled (using a single horizontal grid schematization), a decomposition of the horizontal grid into different regions, and the characteristics of the different regions with respect to the vertical direction. The software then provides an output file for the entire domain with the highest vertical resolution, only computed in less time than otherwise possible. This greatly simplifies the
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
11
usage of the domain decomposition version for end-users. For instance it allows for direct usage of all existing post-processing programs. The realization of this form of domain decomposition required extension of the communications library CouPLib and the partitioner and collector programs of parallel TRIWAQ. Here the true nature of CouPLib can be seen: it provides a high level facility for the exchange of information between different computing processes. One of the aspects of this exchange is now the conversion of the data from the representation for one process into the representation that is used by another process. In the current situation this conversion consists of vertical interpolation. The communications library provides different interpolation mechanisms that can be configured by the calling program. After the initial configuration, no more details about the communication can be seen in the computing routines of TRIWAQ than that a variable (e.g., concentration of salt) with a certain interpretation (cell-averaged quantity) and data structure (full matrix storage scheme) is updated for a certain guard band region (stencil operation): call cocupdðsalin; ’full kmax’; ’lay avg’; ’stc2’Þ The partitioner program has been extended with a more user-friendly format for specification of the desired decomposition of the domain. Note that in this case the decomposition is largely determined by physical motivations, instead of by motivations from computational efficiency. The collector is extended with interpolation of the output data of all domains to the finest vertical resolution used. This largely characterizes the programming strategy that is employed for domain decomposition. With respect to the numerical issues involved it is remarked that quite simple interpolation methods suffice in our case (constant, linear), but that complex methods can be accommodated in CouPLib too. More details of the numerical methods are presented in [9]. 4.2. Domain decomposition with horizontal grid refinement A further reduction of computing times and increase of modeling flexibility is achieved by implementation of domain decomposition with horizontal grid refinement for WAQUA and TRIWAQ. Contrary to the situation for vertical refinement, it is in this case not practical to work with a single horizontal grid schematization. Therefore the user must now provide separate data input files for the different regions to be modeled by separate grids. The different regions must be non-overlapping and must coincide with each other at their mutual interfaces. However, input files may be constructed using the partitioner program, such that also parts of existing (‘‘stand alone’’, overlapping) data input files may be combined in a single simulation. The partitioner is also used for the extension of the non-overlapping domains with a guard band of additional grid cells and grid points for the storage of interpolated function values. Different interpolation mechanisms are provided for by CouPLib; up to now these mechanisms are restricted to the case of (variable) integer refinement factors, where the grid points of a coarser domain coincide with grid points of the neighboring finer domain.
12
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
Fig. 4. Illustration of restrictions that are imposed by the necessity of a global logical grid. Left: actual geometry and desirable discretization, available through domain decomposition with horizontal refinement. Right: possible solution that can be fitted onto a single logical grid, using a bend in one channel and (dashed) additional grid lines.
The extension to domain decomposition with horizontal refinement improves both the user-friendliness and the accuracy of WAQUA and TRIWAQ. The userfriendliness is improved because it is made easier to operate detailed models. Previously, users had to perform a large number of steps for simulation of a detailed model: running a global model, extracting boundary conditions, and inserting these into the detailed model, now the user must provide for suitably matching grids only. The user-friendliness is improved further via increased modeling flexibility: some restrictions are circumvented that arise when a region as a whole must be matched onto a single structured (curvi-linear) grid, see Fig. 4. The accuracy of simulations is improved because local features of the flow that are resolved in the detailed model can be incorporated in a global model, through the two-way coupling that has been established. Further the domain decomposition allows for using fine resolution only in those areas where it is really needed, and thus avoids certain restrictions on the accuracy that are imposed by required computing times. The presentation above mainly aims to show that the complex numerical technique of horizontal grid refinement can be implemented in a large simulation program as a conceptually small extension to our strategy of parallel computing. Numerical aspects such as which interpolation methods to use (bi-linear, constant, etc.) are of less importance here; a comparison of different techniques is being carried out and will be presented in a future publication.
5. Extension to coupling of different models/processes 5.1. Coupling of different models The strategy for parallelization of TRIWAQ is equally well applicable to the coupling of different simulation programs. This has been demonstrated first by realizing an on-line coupling between TRIWAQ and the particle simulation model SIMPAR [2]. SIMPAR allows for simulation of transport processes through tracing of the stochastic motion of particles. This is particularly convenient for simulation problems where sharp variations in the concentration of dissolved substances occur, which fre-
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
13
quently cause difficulties for finite difference discretization methods. SIMPAR uses the flow velocity fields that are computed by TRIWAQ at all time steps. The on-line coupling therefore allows for avoiding the huge storage requirements of an off-line coupling. Its implementation appeared to be quite easy since it involved replacing file I/O (read) by communication operations (receive) only. An on-line coupling has also been realized between WAQUA and WLjDelft HydraulicsÕ sedimentation module MORSYS [1]. MORSYS simulates the transport, deposition and re-suspension of silt in rivers, estuaries and coastal areas and the consequent changes of the bottom topography. For this it uses the flow pattern that is computed by a flow simulation program, which determines the transport as well as the intensity of deposition and re-suspension. But this flow pattern depends on the bottom topography, so the processes are mutually dependent. Therefore the integrated simulation is required for obtaining the most realistic results. 5.2. Generic concepts for coupling of different models/processes One of the differences between coupling of different domains and coupling of different processes lies in the coordination of the overall solution procedure. In parallel WAQUA/TRIWAQ and in the domain decomposition software all computing processes perform the same sequence of steps, and a single process multiple data (SPMD) approach can be used. Therefore little attention has to be paid to the coordination of the processes, besides that convergence criteria in iterative solution procedures must be implemented in such a manner that all processes perform the same number of iterations. In the coupling of different models however all processes involved must agree upon the information that is exchanged (physical interpretation, units, time, accuracy, etc.), and on the order in which this information is communicated. Therefore the coordination of the processes has become an important issue. Our approach towards this coordination is that for each computing process a socalled coupling algorithm must be given, which specifies the entire sequence of interactions with other processes, i.e., which information can be made available and needs to be obtained from other processes in which order. Furthermore these algorithms for different processes must be matched with each other in a coupling configuration file for the specification of a coupled run. This can be explained by saying that processes ‘‘know in advance’’ which information must be provided to other processes. Processes should not, however, depend explicitly on the presence of other processes in the system, but should be designed as independent modules as much as possible. Therefore this knowledge exists only implicitly in the coupling algorithm that is used. The basic assumption of our approach is that the interactions between the processes that need to carry out a joint simulation can be specified beforehand. Consequently there is no need for a dynamic mechanism with which processes can request information from other processes at unpredictable times. The motivation for this assumption is that you cannot analyze the mathematical properties of a combined simulation method (e.g., with respect to accuracy or numerical stability) if you do not know the status of the information that is used (such as the bottom topography at
14
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
an unknown time instance). Therefore we expect that the use of coupling algorithms is beneficial for a wide range of integrated modeling applications. 5.3. Coupling of different computing processes for enabling parallelization These concepts for the coupling of different processes have been put to the test in a research carried out by Delft University of Technology for Rijkswaterstaat/RIKZ, in which the parallelization of the RRSQRT Kalman filter for WAQUA/TRIWAQ is pursued. A large complication in this parallelization is that the WAQUA/TRIWAQ computations have entirely different needs for the parallelization than the Kalman filter computations. Furthermore the interfacing between the different data structures of the separate modules in the original software is implemented in such a way that an SPMD approach a la parallel TRIWAQ is practically not possible. Therefore the parallelization is carried out in two steps, by first decomposing the entire program into four cooperating modules, and then parallelizing each of the modules in their own most appropriate way. The four modules that have been created are illustrated in Fig. 5. Currently only the two most compute-intensive components are parallelized. The linear algebraic computations for the RRSQRT Kalman filter in COVMAT are parallelized by a row wise decomposition of their central data structure, the so-called ‘‘L-matrix’’. The propagation steps for the error modes in PROPWAQ using the WAQUA model
Fig. 5. Overview of the structure of the parallel version of the WAQUA/TRIWAQ+Kalman software: four distinct modules each parallelized in their own most appropriate way.
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
15
Fig. 6. Excerpt from the coupling algorithm used in the parallel version of the WAQUA/TRIWAQ+ Kalman software.
16
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
dynamics are parallelized by performing the computations for different error modes concurrently. This can be seen as a column wise decomposition of the L-matrix. The CouPLE communication infrastructure takes care of the exchange of information between these computing processes. Here the re-structuring of data is done automatically using abstract index sets, see, [14,15]. Our experience with the componentization is that it gives much insight in the interactions between different parts of the algorithms used, because the information that is exchanged is made explicit in the coupling algorithm that is used, see Fig. 6. Further, the components appear to be much smaller and easier to understand than the original compound program. Finally, it has become easier to experiment with alternative versions of the Kalman filter, such as ensemble Kalman filters [3]. In that case the complete infrastructure and the modules WAQUA, KALMAN and PROPWAQ can be entirely maintained and only COVMAT needs to be replaced.
6. Performance results on an SGI Origin2000 supercomputer In the last few years parallel WAQUA/TRIWAQ has been used on a wide variety of parallel computing platforms: homogeneous and inhomogeneous networks of workstations (HP, SUN, Linux) and Windows NT PCÕs, different SMP-type workstation clusters, and various parallel supercomputer platforms including IBM SP2, Cray T3E and SGI Origin2000. In this section we present some new performance results for the parallel version on the latter platform. These experiments were carried out early in 1999 to evaluate the applicability of the at that time new SGI Origin2000 supercomputer for large scale computations with TRIWAQ. These results show that our strategy indeed allows for tackling the various large scale computing problems at Rijkswaterstaat. The experiments use the RYMAMO model (see Figs. 1 and 2), which covers the mouths of the rivers Rhine and Meuse in the North sea and which is used among others for studying the complicated flow patterns at the entrance to the harbor of Rotterdam. The version of the model that was used consists of 320 592 grid points in the horizontal direction of which about 61.000 are actually used in the simulation (32%). The vertical direction is subdivided into 10 layers. The computations include transport of salinity and the computation of turbulent energy and dissipation according to the k– turbulence model. The simulation concerns a period of January 17, 1998, when the Haringvliet sluices were partially opened. The simulation period was restricted to 2 h in order to reduce the computing time of the experiments. A time step of 30 s is used. Suitable partitionings of the model grid are constructed manually in an iterative fashion. Starting from an initial partitioning a short simulation is carried out. Then, the computing times per subdomain are used to locate bottle-necks and determine which subdomains must be enlarged or made smaller. A graphical tool VISIPART is used for making these changes. Then another short simulation is carried out. This process is repeated until no further progress is made or until the differences in the workload per subdomain are sufficiently small. The main problem for automatic par-
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
17
titioning methods is that the computing time per subdomain cannot be well estimated in advance. This is largely due to effects of cache memory, which cause a strong dependence on the precise geometry of a subdomain, such as the fill ratio, the number of boundary points and the number of computational rows and columns of a subdomain. Another difficulty of the Origin2000 system is that computing times may vary up to 20% even between identical simulations on a fully loaded system. This is caused by the fact that users do not have control over the placement of processes on processors, and that processors are not reserved exclusively for a single application. This means that the WAQUA/TRIWAQ processes may have to share a processor with a process from another user. But when one WAQUA/TRIWAQ process is delayed, then ultimately all other WAQUA/TRIWAQ processes will have to wait longer for the required boundary information. This synchronization overhead quickly increases with the number of processors used in the WAQUA/TRIWAQ run. Not only the overhead caused by a single ‘‘delay event’’, but also the number of delay events is proportional to the number of processors used. Therefore this places a restriction on the maximum number of processors that can be effectively used. After the optimization stage simulations have been carried out for the full simulation period of 240 time steps on 1 through 32 processors. Besides standard runs in the presence of processes from other users also a few additional runs could be carried out in a short period where the system was reserved exclusively for our experiments. The computing times and speedup factors for these runs are shown in Table 1. These results show a super-linear speedup of up to a factor 41 on 32 processors on a dedicated system, which is obtained through a reduction of computation time by cache effects that is larger than the communication overhead. The efficiency figures for the non-dedicated runs indicate that the maximum number of processors is effectively reached. On the other hand the efficiency of the dedicated runs shows that the parallel software itself is scalable to larger numbers of processors. For a more detailed analysis of the performance of parallel TRIWAQ the reader is referred to [15]. Table 1 Computing times and speedup factors for simulations with the RYMAMO model with 10 layers on an Origin2000 system Number of processors
1
4
8
16
32
Total computing time (s) Computing time per hour simulation Speedup w.r.t. one processor Efficiency
14309 1:59:14 1.0 1.0
3470 28:55 4.1 1.03
1533 12:47 9.3 1.17
795.9 6:38 18.0 1.12
434.2 3:37 33.0 1.03
Number of processors
16
32
Total computing time (s) Computing time per hour simulation Speedup w.r.t. one processor Efficiency
676.4 5:38 21.2 1.32
348.5 2:54 41.1 1.28
1–32 processors: standard situation with processes from other users, 16/32 processors: results for runs on a dedicated system.
18
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
Note that the total computing time is reduced to less than 3 min per hour that is simulated. This is about 30 times faster than on a single processor of a Cray C90 vector-computer, and about 150 times faster than on a single processor of an HP K460 workstation. This tremendous speedup opens the way to new applications of the model, since computations now proceed significantly faster than reality. More experiments were carried out with parallel WAQUA and with TRIWAQ with vertical grid refinement, with largely similar findings. For parallel WAQUA, the speedup saturates at somewhat smaller numbers of processors than for parallel TRIWAQ because of the computational methods used (higher ratio of communication to calculation). An additional result for the domain decomposition version is that the number of grid points could be halved without significant implications for the accuracy of the simulation results, by using 1, 4 and 10 layers in different parts of the computational domain instead of 10 layers everywhere. This also reduces the required cpu and wall-clock times by a factor 2. Finally our initial experiments with the parallel Kalman software indicate an acceptable (communication) overhead of less than 5% due to the splitting of the program into separate components, and good speedup values for the parts that have been parallelized (efficiency > 95%).
7. Conclusions In this paper we have given an overview of the various activities concerning large scale computing and simulation at the Dutch Rijkswaterstaat. Rijkswaterstaat is the Dutch governmental institute responsible for management of the Dutch water systems. Among its tasks are the protection of the country against flooding and the management of shipping routes and ports. Simulation models for a.o. water movement, transport processes and sedimentation have become indispensable tools for carrying out these tasks. The continuing increase in level of detail of the simulations demands more and more computing power. We have shown various applications (Meuse, Haringvliet) where simulation of a single scenario requires days of computing time on a single modern workstation. In many cases there is a large number of scenarios to be computed, for instance for calibration of unknown parameters in the model to get the best fit with available measurements. Also the required level of detail leads to the incorporation of more and more aspects of the physical processes in the models, such as density variations and turbulence. This again increases the computational complexity of the simulations to be performed. Therefore there is a strong need for high performance computing. In the past few years, Rijkswaterstaat/RIKZ, Delft University of Technology and VORtech Computing have developed and implemented techniques for making these large scale simulations possible. First of all the 3D shallow water model TRIWAQ has been parallelized and tested on various supercomputers. A somewhat uncommon viewpoint has been taken for the parallel implementation, in which parallelization consists of coupling of different instances of the sequential program rather than breaking up the calculation. This strategy opened the way for a large number of other developments.
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
19
We have shown the extension of the strategy towards different forms of domain decomposition, by coupling of instances of the program that are employing grids with different vertical and horizontal resolutions. Furthermore the approach has been used for on-line coupling between models for different physical processes (flow and transport, flow and sedimentation). Finally the approach has been used for the separation of different submodels in the WAQUA/TRIWAQ program with Kalman filtering into different computing processes. This opened the way for parallelization of each submodel in its own most appropriate way. Central elements in our strategy for coupling of different models are a generic program for the automatic distribution of a single global model input file into separate complete input files for the different subdomains, and a high level communications library for the exchange of information between different computing processes. Through abstract concepts such as ‘‘index sets’’ (grids) and ‘‘stencils’’ the latter allows for a convenient description of the required communications. In the domain decomposition versions this has been extended with various mechanisms for data conversion (interpolation). In the parallel Kalman filter software the calls of the communication routines are called coupling points, and the entire sequence of coupling points per program is specified externally in a parameterized form. This allows for off-line consistency checking (to avoid dead lock), and for the replacement of one module by another that conforms to the same coupling algorithm. Performance experiments on an SGI Origin2000 supercomputer employing 32 processors illustrate the viability of the approach. Good parallel speedups have been achieved even on larger numbers of processors, especially for simulations with TRIWAQ. The computing time for simulation of one important model has been reduced to about 3 min per hour to be simulated, 30 times faster than on a Cray C90 vectormachine. A further reduction by a factor 2 has been achieved using domain decomposition with vertical refinement. Similar improvements in terms of computing time and in modeling flexibility are expected from domain decomposition with horizontal refinement. These developments together open the way to a new range of applications of 2D and 3D shallow water models.
References [1] H.H. ten Cate, S. Hummel, M.R.T. Roest, An open model system for 2D/3D hydrodynamic simulations, in: Proceedings of the 4th International Hydroinformatics Conference, 2000 (cd-rom). [2] H.H. ten Cate, R. Salden, H.X. Lin, A.E. Mynett, A case study on integrating software packages, in: V. Babovic, L.C. Larsen (Eds.), Proceedings of Third International Conference on Hydroinformatics, A.A. Balkema, 1998. [3] G. Evensen, P.J. van Leeuwen, Advanced data assimilation based on ensemble statistics, in: WMO (Ed.), Second International Symposium on Assimilation of Observations in Meteorology and Oceanography, WMO, 1995, pp. 153–158. [4] I. Foster, Designing and Building Parallel Programs, Addison Wesley, Reading, MA, 1995. [5] H. Gerritsen, J.W. de Vries, M.E. Philippart, The Dutch Continental shelf model, in: D. Lynch, A. Davies (Eds.), Quantitative Skill Assessment for Coastal Ocean Models, in: Coastal and Estuarine Studies, vol. 47, American Geophysical Union, 1995.
20
E.A.H. Vollebregt et al. / Parallel Computing 29 (2003) 1–20
[6] E. de Goede, Numerical methods for the three-dimensional shallow water equations on supercomputers, Ph.D. thesis, University of Amsterdam, The Netherlands, 1992. [7] J.J. Leendertse, Aspects of a computational model for long period water wave propagation, Technical Report RM-5294-PR, Rand Corporation, Santa Monica, CA, 1967. [8] M.E. Philippart et al., DATUM2-data assimilation with altimetry techniques used in a tidal model, 2nd program, Technical Report NRSP-2 98-19, BCRS, P.O. Box 5023, 2600 GA Delft, October 1998. [9] L.M. Riemens, H.H. ten Cate, B. van Õt Hof, M.R.T. Roest, Domain decomposition with vertical refinement in TRIWAQ, in: Proceedings of the 4th International Hydroinformatics Conference, 2000 (cd-rom). [10] M.R.T. Roest, Partitioning for parallel finite difference computations in coastal water simulation, Ph.D. thesis, Delft University of Technology, 1997. [11] Z.W. Song, Parallelization of hydrodynamic models for distributed memory computers, Ph.D. thesis, K.U. Leuven, 1995. [12] G.S. Stelling, On the construction of computational methods for shallow water flow problems, Ph.D. thesis, Delft University of Technology, 1983. [13] M. Verlaan, Efficient Kalman filtering algorithms for hydrodynamic models, Ph.D. thesis, Delft University of Technology, April 1998. [14] E.A.H. Vollebregt, Abstract level parallelization of finite difference methods, Scientific Programming 6 (1997) 331–344. [15] E.A.H. Vollebregt, Parallel software development techniques for shallow water models, Ph.D. thesis, Delft University of Technology, 1997. [16] V. Yakhot, S.A. Orszag, S. Thangam, T.B. Gatski, C.G. Speziale, Development of turbulence models for shear flows by a double expansion technique, Physics of Fluids A 4 (1992) 1510–1520. [17] M. Zijlema, Technical documentation TRIWAQ, Technical Report SIMONA 99-01, National Institute for Coastal and Marine Management, The Hague, The Netherlands, 1999.