PARALLEL COMPUTING Parallel Computing
ELSEWIER
23 ( 1997) 2 135-2 142
Preface
Regional weather modeling on parallel computers C. Baillie a,‘, J. Michalakes ’ NOAA Forecast Systems Laboratory, h Argonne ’ Norwegian
National
Meteorological
R/E/
Received
FS, 325 Broadway,
Laboratory,
Institute,
b**, R. Sk%lin ’ Argonne,
Boulder,
CO 80303.
USA
IL, USA
P.O. Box 43. Blindern,
0313 Oslo. Noma?
16 June 1997; revised 16 July 1997
Abstract This special issue on ‘regional weather models’ complements the October 1995 special issue on ‘climate and weather modeling’, which focused on global models. In this introduction we review the similarities and differences between regional and global atmospheric models. Next, the structure of regional models is described and we consider how the basic algorithms applied m these models influence the parallelization strategy. Finally, we give a brief overview of the eight articles in this issue and discuss some remaining challenges in the area of adapting regional weather models to parallel computers. 0 1997 Elsevier Science B.V. Kewwrd.~:
Regional
weather modeling;
Parallel computing
1. Introduction
Numerical weather prediction, the application of numerical algorithms to predict the future state of the atmosphere from an approximation of initial state, revolutionized the practice of weather forecasting. Beginning in the 196Os, vector and then vector shared-memory computers (parallel vector processors, or PVP machines) offered performance first in the tens, then hundreds, then thousands of megaflops (millions of floating point operations per second). Ever-increasing computational power has allowed continuing advances toward finer scales, longer simulations and increasing sophistication in the modeling of atmospheric processes. Today, scalable parallel computers offer the poten-
’ Corresponding author. ’ Jointly affiliated with the Cooperative University,
Ft. Collins, Colorado,
Institute for Research
in the Atmosphere
(CIRA), Colorado
USA.
0167~8191/97/$17.00 0 1997 Published PI1 SO167-8191(97)00104-X
by Elsevier Science B.V. All rights reserved.
State
2136
C. Baillie et al./Parallel
Computing 23 (1997) 2135-2142
tial of performance in the hundreds of gigaflops and memory capacity to allow problem sizes of previously intractable dimensions. The door is now open for real-time forecasts of individual features of a severe storm, photochemical generation of pollutants during an urban afternoon, or for multicentury simulations of climate. The 1995 special issue of Parallel Computing on climate and weather modeling, edited by Drake and Foster [3], explored the issues of adapting global models of the earth’s atmosphere and oceans to parallel computers. This issue focuses on the special characteristics, problems and implementation strategies for a closely related branch of geophysical applications: regional scale models. Of the eight papers presented, six describe efforts to parallelize regional weather models; the last two describe related regional-scale codes, an ocean model and a metropolitan air quality model. Each paper describes the capabilities and uses of the particular model, the underlying numerics and algorithmics, the motivation, approaches and problems for parallelization, the computer platforms targeted and the results of those efforts. In terms of their impact and relevance to human concerns and activities, regional scale models have much in common with, and are in fact operationally dependent upon, their global counterparts. The differences are more than simply a matter of scale, however, and warrant separate treatment of regional models. Indeed, regional models represent a distinct class of applications with unique requirements and challenges for parallel computing.
2. Regional and global models Regional, or ‘limited-area’, models are closely related historically, methodologically and operationally to global models. As members of the broad class of scientific computing applications, they are more alike than different, they solve the same basic equations that describe atmospheric motion, though the formulations may differ, and they use similar modules to simulate atmospheric processes such as clouds and radiation. Regional and global models are also used for essentially the same purposes: weather forecasting, climate prediction, observational data assimilation and analysis, generation of dynamic meteorological fields for input to sub-gridscale models (atmospheric chemistry, for example) and basic atmospheric research. One of the principal reasons for using a regional model instead of a global model is cost. The computational cost of an atmospheric model is a function of the number of cells in the domain and of the time step. For a given domain size, the cost of explicit three-dimensional hydrodynamics codes is afunction of n4, where n is a grid dimension. The fourth-order term reflects the additional refinement of the time dimension (smaller time-step) necessary to maintain numeric stability as the spatial dimensions of each cell are decreased. Because the atmosphere is shallow, and because the number of vertical layers is independent from and more or less constant with respect to the horizontal resolution, the cost behaves more like a function of n3. Even so, a doubling of resolution (a halving of the width of a cell) results in an eight-fold increase in computational cost. Therefore, resolution is a precious commodity, and there is a strong incentive not to waste it. Regional models permit computational effort to be concentrated over a limited area for the purpose of computing higher-resolution simulations than would be possible
C. Baillie
et al. / Parallel
Computing
23 (I 997) 2135-2142
2137
globally. This capability is useful, for example, for simulating the effects of complex terrain on an evolving weather system. Nesting, the ability to refine the mesh over subdomains within a regional simulation, extends the ability of some regional models to efficiently concentrate resolution. Regional and global models differ in a several other important respects. Whereas global models must evolve the model solution on a sphere, regional models employ regularly spaced Cartesian grids requiring only minor adjustments to account for spherical distortion in the arc covered by the domain. Unlike global models, which are periodic in the east-west dimension and treat cross-polar flows as special cases, regional models employ fixed or forced lateral boundaries that may also include some form of relaxation. Often, the data used in the lateral boundary conditions are taken from a global model. In terms of numerical methods, regional models can be simpler than global models, employing finite-difference methods without special treatment required around a pole. The use of finite-difference methods also has favorable implications for parallelization, since such methods are nearest-neighbor and therefore more straightforward to implement efficiently than the spectral method employed in some global models.
3. Structure
and parallelization
Like global models, regional atmospheric models are composed of dynamics, which integrate the differential equations of fluid motion, and physics, which calculate the forcing terms in those equations based on model state. Algorithms are chosen and implemented with concern for computational cost, stability and accuracy. These choices, in turn, have implications for parallelization issues: data decomposition, communication and load balance. Dynamics, which comprises advection and gravity pressure balance, are based on the system of nonlinear time-dependent partial differential equations known as the primitive equations. The system of equations regulates the transport of the computed fields, such as wind, temperature, pressure and humidity. Dynamics may employ a hydrostatic approximation, which assumes that the force of gravity is balanced by the vertical component of the pressure gradient, or a nonhydrostatic formulation. There may also be a high order diffusion operator, to maintain stability of the model. The system of equations is traditionally solved by an Eulerian time integration scheme. These schemes are referred to as explicit if the computation of a new time step is based on values from previous time steps only. Otherwise, the schemes are known as implicit. For an explicit scheme, the time step is limited by the fastest moving waves, which in a hydrostatic model, are gravity waves. To increase the time step, the terms responsible for the fastest moving waves may be treated by an implicit scheme. Such schemes are referred to as semi-implicit. Alternatively, the terms responsible for the fastest moving waves may be integrated with a smaller time step than the remaining terms. Schemes applying this technique are referred to as split-explicit. At meso- and microscale resolutions below 10 km, when the resolvable horizontal and vertical scales are of the same order of magnitude, the hydrostatic approximation
2138
C. Baillie et al. /Parallel
Computing 23 (1997) 2135-2142
begins to break down. Typically, small scale models integrate either the anelastic or fully compressible Euler/Navier-Stokes governing equations. The acoustic waves admitted by compressible models impose severe time-step restrictions due to the numerical stability constraints of explicit time integration schemes. The anelastic approximation filters sound waves directly in the governing equations by neglecting the pressure time derivative in the mass continuity equation. The anelastic equations can then be effectively integrated using an explicit scheme, where the pressure is obtained from an elliptic problem derived from the mass continuity equation. Specialized time integration schemes are required in compressible atmospheric models to overcome the stability constraints. Even when the limitation imposed by the fast-moving waves is removed, the time step will be determined by stability rather than accuracy [15]. Hence, valuable computational resources are spent for operations that do not improve the forecast. To overcome this problem, semi-Lagrangian time integration schemes have become popular in atmospheric models. (In other disciplines, these schemes are referred to as EulerianLagrangian methods or the modified method of characteristics.) Semi-Lagrangian schemes are unconditionally stable for advection problems, although the presence of forcing terms may introduce stability restrictions [l]. Furthermore, semi-Lagrangian schemes require a factor of five to ten times more floating-point operations per grid-point per time step than the leap-frog scheme [Il. Nevertheless, semi-Lagrangian schemes are efficient from a computational point of view if the time step can be substantially increased and/or the physics and the treatment of the fast moving waves account for a sufficiently large part of the total number of operations. The physics component of a regional model includes physical processes such as longand short-wave radiation, condensation and precipitation and surface processes. The physical processes occur mainly on a subgrid scale and they are modeled by parameterization rather than differential equations. As with any application, implementing an atmospheric model on a parallel computer involves partitioning the work among processors so that they can work together to perform the computation faster than if the work was performed sequentially. Atmospheric models perform essentially the same set of computations in each vertical column of their three-dimensional, regular and almost universally Cartesian domains (the exception is icosahedral meshes that are regular but non-Cartesian; these are used infrequently and only in global models). Therefore, an SPMD (single program, multiple data stream) data-domain decomposition approach to parallelism is favored. With this method, one or more dimensions of the domain are divided into subdomains, often rectangular tiles and allocated to separate processors. On a distributed-memory machine, this implies that the state and intermediate data representing the subdomain are transferred to the local memory of a processor, where the data are readily accessible by that processor but accessible to other processors only through intermemory transfer: explicit message passing, shared-memory ‘gets’ and ‘puts,’ or implicit data movement through cache-coherency hardware and software. The cost of data movement is a key efficiency concern because, unmitigated, it is pure overhead. Partly for this reason, model physics dictate the processor decomposition in atmospheric models, because these routines contain strong data dependencies in the vertical
C. Baillieerul./Parallel Computing23 (1997) 2135-2142
2139
direction, but none in the horizontal. Hence, by decomposing only over the two horizontal dimensions, latitude and longitude, communication is avoided within physics, making this part of the code almost perfectly parallel. AS Foster points out, it is also desirable from a software perspective to use existing sequential physics packages without modification, since these are the parts most liable to change over the life of the model code [4]. Dynamics, however, does entail significant horizontal data dependencies. For explicit Eulerian finite-difference schemes, communication involves exchanging data between edges of neighboring subdomains. The width of the edge exchange is fixed, a function of the order of the differencing operation, one cell for second order schemes, two for fourth order and so on. Semi-Lagrangian schemes also communicate data between neighboring subdomains, but the situation is complicated by the fact that the edge-exchange width is a dynamic function of the mode1 state. A fixed overlap region can still be used, but it will be larger than for Eulerian schemes. Most of the models discussed in this special issue also employ implicit methods, in which the value computed at a point depends not only on its neighbors but on the values of all the other points in the domain. The amount and pattern of the required global communication depend on the elliptic solver employed. Direct methods employ parallel data transposition; parallel iterative solvers use global summation (and other communication) along with preconditioners. The preconditioners may also entail communication, or they may be strictly local algorithms. In addition to communication cost, load imbalance may be a source of inefficiency in a parallel model. If some processors have more work than others (or if some processors are more powerful than others), the faster processors will reach a common synchronization point (communication or I/O) and then idle until the slower processors catch up. Sources of load imbalance include uneven distribution of points to processors in a dimension, different amounts of work at domain boundaries, different amounts of work having to do with the localized state of the atmosphere in part of the domain (physics imbalance), unequal processor speeds and unequal loading of processors from other jobs if the model does not have exclusive access to the nodes. Imbalances can be addressed by reallocation of work to processors, either statically at the beginning of a run or dynamically over the course of a run; frequently, however, such imbalances are simply ignored. The last option is not unreasonable if the cost or complexity of addressing an imbalance outweighs or marginalizes the benefit; nevertheless, several of the papers in this issue quantify and attempt to remedy load imbalance, with some success. Many models in use today provide some form of mesh refinement for concentrating resolution over a particular feature of interest in a simulation. One-way nesting, in which lateral boundary forcing of the nest is provided by the parent but without feedback, is computationally the simplest mesh refinement strategy, because forcing is infrequent (on the order hours between each new set of lateral boundary conditions) and the parent and nest can run asynchronously in batch mode. Two-way nesting entails feedback and has the advantage of keeping the parent and nest solutions from diverging into separate trajectories. Two-way nesting is more computationally demanding, however, in that the forcing and feedback occur at every time step and the parent and nest must run synchronously. Parallelization of such schemes is also more complex, since it requires a
2140
C. Bailiie et al. / Purullel Computing 23 (I 9971 2135-2142
choice between costly communication between parent and nest running in parallel or suffering load imbalance on the parent to avoid having to communicate forcing and feedback data between processors.
4. Overview
of articles
The articles discuss various algorithmic, computational and software issues involved in parallel implementation of regional models. All of the models are at least partly based on explicit Eulerian solvers using finite differencing and the technique employed to parallelize these modules is also a common thread: processors exchange messages to update halo regions surrounding the local subdomains. The exchanges may be implemented using direct calls to a message-passing library such as MPI [5] or using higher-level stencil communication libraries developed for the purpose. Two papers, by alternatives to Bjorge and Skilin and by Thomas et al., explore semi-Lagrangian Eulerian methods. Most of the papers also discuss implicit methods for dealing with fast-moving waves, employing a number of approaches for generating solutions on parallel machines. Direct methods with parallel matrix transposes are employed by Bjorge and Sk%lin and by Schktler and Krenzien. Parallel versions of iterative solvers with preconditioners are used by Ashworth et al., by Bj0rge and SkHlin and by Thomas et al. Load balancing is a concern for a number of authors. SchBttler and Krenzien quantify the load imbalance in model physics in the Deutschland Modell. Michalakes presents the design and implementation of dynamic load balancing in the parallel version of MM5. The papers by Sathye et al. and Michalakes discuss nesting and the problems associated with designing efficient interdomain communication when nested domains are computed in parallel. Sathye et al., Michalakes, and Wallcraft and Moore discuss the impact parallelization has on model software, in particular the ability to maintain a single model to run on diverse computing platforms. These papers discuss various software techniques aimed at enhancing portability, especially to parallel architectures, including the use of macro preprocessors and source translators prior to compilation. The last two papers in this issue do not describe regional weather models but do deal with important related parallelization efforts in regional geophysical modeling. Walcraft and Moore discuss parallelization of the Navy Layered Ocean Model, a regional ocean model with many algorithmic and computational similarities to regional atmospheric codes. In this paper, the authors also address issues involved with writing and maintaining a single model source code on different high-performance computer architectures. The last paper, by Dabdub and Manohar, describes parallelization of an air quality model, the California Institute of Technology model, which includes transport and interaction of numerous chemical species in air over the Los Angeles metropolitan area.
5. Challenges The work presented here represents a sample of the accomplishments in adapting regional models to parallel computers. Efforts continue by these groups and many others
C. Baillie et ul./PurallelComputing 23 (1997) 2135-2132
2141
software worldwide. Significant challenges remain, including processor performance, issues, I/O, data assimilation and adaptive mesh refinement. Obtaining good single-node performance on parallel computers using RISC processors with hierarchical memories has been difficult. Typical achieved performance is only 5 to 15% of theoretical peak, compared with efficiencies close to 50% on vector processors. Restructuring model codes to better utilize cache and thereby attain performance has important implications for software maintenance and portability. Operational weather models are large, expensive and nontrivial to maintain even for one computer architecture and still more so for many different ones. Tools and techniques that facilitate a single-source approach to multiple platforms must be developed. The cost of input and output to a parallel weather model is a serious impediment to performance and, especially, scalability. Scalable I/O will become an increasing concern. Ad hoc efforts to address this problem in the context of parallelization libraries [ 141 and standards-based approaches [ 10,12,13] are being watched with interest. We also refer to a recent special issue of Parallel Computing [ 171 for further information on parallel I/O. Load balancing is important to efficient use of parallel machines, but must be lightweight enough to deal effectively with imbalances that arise quickly as the result of model physics, for example, from the onset of unstable convection. Lightweight and automatic remapping mechanisms will also be of use in exploiting distributed computing resources. Data assimilation provides a method for improving the quality of the forecast by including the influence of observations in a running simulation. Assimilation schemes include Newtonian nudging [ 161, optimal interpolation (01) and four-dimensional variational assimilation. Four-dimensional data assimilation can, however, add significantly to the cost of a simulation, moreover, it requires efficient parallel input and may introduce additional load imbalance [8]. Nesting schemes in parallel models provide an efficient way of concentrating resolution over subregions in a forecast. However, in most models today (excepting Sathye et al.) the user must specify the location and duration of nests prior to the run using a priori knowledge of where the resolution is likely to be needed. This falls short of true adaptive mesh refinement, in which gradient steepness, error bounds, or other convergence criteria are assessed at run time and used to direct the automatic generation of finer meshes within the simulation. Interest in truly adaptive mesh refinement in weather modeling is growing and research to address the numerical, computational and parallelization issues is under way [2,11]. This issue and other sources [3,6,7,9] show that scalable distributed memory computers have been a prime target for weather and climate modeling groups since the lure of commodity-based cost-performance and unlimited scalability first began drawing attention away from the large vector mainframes. (Notwithstanding, the surge in enthusiasm for MPP has been tempered by at least a half-decade’s’ experience and by the continued vitality of traditional vector supercomputing.) Distributed-memory clusters of symmetric multiprocessor nodes are coming to the fore. Finally, networked clusters of PCs are beginning to assume the distributed memory message passing mantle, as explicit
2142
C. Baillie et al. /Parallel
Computing 23 (I 997) 2135-2142
two-sided message passing gives ground to very low-latency one-sided messaging and more automatic schemes for maintaining coherency between processor memories on the large parallel machines. Which of these architectures will dominate the future of atmospheric modeling in general and regional models in particular, is unclear. In the meantime, the experiences and lessons learned in the development of parallel weather models will be crucial to satisfying current and future operational and research computing requirements of the atmospheric modeling scientific community.
Acknowledgements This work was supported by the Mathematical, Information and Computational Sciences Division subprogram of the Office of Computational and Technology Research, U.S. Department of Energy under Contract W-31-109-Eng-38.
References [I] P. Bartello, S.J. Thomas, The cost-effectiveness [2]
[3] [4] [5] [6] [7] [8] [9] [lo] [I I]
[I21 1131 [I41 [IS] 1161 [I71
of semi-Lagrangian advection, Mon. Weather Rev. 124 (1996) 2883-2897. N. Chrisochoides, K. Droegemeier, Cl. Fox, K. Mills, M. Xue, A methodology for developing high performance computing models: Storm-scale weather prediction, in: High Performance Computing, 1993: Grand Challenges in Computer Simulation, The Society for Computer Simulation, 1993. J.B. Drake, I. Foster (Eds.), Parallel Computing, vol. 21, Special issues on applications: Climate and weather modeling, North-Holland, 1995. I. Foster, Designing and Building Parallel Programs, Addison-Wesley, 1995. W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, 1995. G.-R. Hoffman, T. Kauranne (Eds.), Parallel Supercomputing in Atmospheric Science, World Scientific, River Edge, New Jersey, 1993. G.-R. Hoffman, N. Krettz (Eds.), Coming of Age, World Scientific, River Edge, New Jersey, 1995. L. Isaksen, S.R.M. Barros, IFS 4D-variational analysis: Overview and parallel strategy, in: Coming of Age, World Scientific, River Edge, New Jersey, 1995, pp. 337-351. F.X. LeDimet (Ed), High Performance Completing in the Geosciences, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1995. Message-Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, July 1997, to be released. J. Michalakes, RSL: A parallel runtime system library for regional atmospheric models with nesting, in: Proceedings of the IMA Workshop ‘Structured Adaptive Mesh Refinement Grid Methods,’ March 12-13, Minneapolis, 1997, to appear. Also preprint ANL/MCS-P663-0597. P. Corbett et al., MPI-IO: A parallel file I/O interface for MPI Version 0.3, Tech. Rep. NAS-95-002, NASA Ames Research Center, 1995. J. Pool, Scalable I/O initiative, World-Wide Web page at http: //www. cacr. caltech. edu/SIO, Sept. 1995. B. Rodriguez, L. Hart, T. Henderson, A library for the portable parallelization of operational weather forecast models, in: Coming of Age, World Scientific, River Edge, New Jersey, 1995, pp. 148-161. A. Staniforth, J. C&e, Semi-Lagrangian integration schemes for atmospheric models: A review, Mon. Weather Rev. 119 (1991) 2206-2223. D.R. Stauffer, N.L. Seaman, Use of four-dimensional data assimilation in a limited-area mesoscale model. Part I: Experiments with synoptic-scale data, Mon. Weather Rev. 118 (1990) 1250-1277. D.E. Womble, D.S. Greenberg (Eds.), Parallel Computing, vol. 23, Special double issue: Parallel I/O, North-Holland, 1997.