Environmental Modelling & Software 25 (2010) 398–411
Contents lists available at ScienceDirect
Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft
A comparison of three parallelisation methods for 2D flood inundation models Jeffrey C. Neal a, *, Timothy J. Fewtrell b, Paul D. Bates a, Nigel G. Wright c a
School of Geographical Sciences, University Road, University of Bristol, Bristol BS8 1SS, UK Willis Research Network, School of Geographical Sciences, University Road, University of Bristol, Bristol BS8 1SS, UK c UNESCO-IHE Institute for Water Education/TU Delft, Delft, The Netherlands b
a r t i c l e i n f o
a b s t r a c t
Article history: Received 29 July 2009 Received in revised form 10 November 2009 Accepted 14 November 2009 Available online 1 December 2009
For many applications two-dimensional hydraulic models are time intensive to run due to their computational requirements, which can adversely affect the progress of both research and industry modelling projects. Computational time can be reduced by running a model in parallel over multiple cores. However, there are many parallelisation methods and these differ in terms of difficulty of implementation, suitability for particular codes and parallel efficiency. This study compares three parallelisation methods based on OpenMP, message passing and specialised accelerator cards. The parallel implementations of the codes were required to produce near identical results to a serial version for two urban inundation test cases. OpenMP was considered the easiest method to develop and produced similar speedups (of w3.9) to the message passing code on up to four cores for a fully wet domain. The message passing code was more efficient than OpenMP, and remained over 90% efficient on up to 50 cores for a completely wet domain. All parallel codes were less efficient for a partially wet domain test case. The accelerator card code was faster and more power efficient than the standard code on a single core for a fully wet domain, but was subject to longer development time (2 months compared to <2 week for the other methods). Ó 2009 Elsevier Ltd. All rights reserved.
Keywords: Hydraulic model LISFLOOD-FP Urban flooding OpenMP MPI SIMD ClearSpeed
1. Introduction Two-dimensional flood inundation models have become an integral part of flood risk assessment in both urban and rural areas, providing a means of converting catchment discharge into inundation extent, depth and in some cases flow velocity. These models vary in complexity from solutions of the two-dimensional shallow water equations (e.g. Villanueva and Wright, 2006a) to storage cell models based on Manning’s equation (e.g. Bates and De Roo, 2000). Common to all these models, however, is the computational expense (or burden) for many real world applications. Recently, developments in computer hardware have moved away from increasing clock speed towards providing many computational cores that operate in parallel. For example, desktop computers are now routinely sold with dual-core 86 general purpose central processing units (CPU’s) and this trend towards larger multi-core machines is expected to continue in the future. Developers of flood inundation models have begun to acknowledge this change by producing versions of their codes that
* Corresponding author. Tel.: þ44 (0)117 92 88290; fax: þ44 (0)117 928 7878. E-mail address:
[email protected] (J.C. Neal). 1364-8152/$ – see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.envsoft.2009.11.007
are capable of running in parallel. The methods chosen to parallelise a code are usually based on either message passing techniques that use cores with distributed memory (MPIF, 2008), shared memory multiprocessor approaches such as OpenMP (Chapman et al., 2008) or take advantage of the massively parallel nature of graphics processor units (GPUs). These methods have been proven to facilitate simulation speedup over single core implementations of various flood inundation model codes (Rao, 2005; Villanueva and Wright, 2006b; Neal et al., 2009; Lamb et al., 2009). To date, however, no study has undertaken a controlled comparison of these methods to analyse their computational performance and determine (at least qualitatively) the ease with which they can be implemented for a given hydraulic model. This is, therefore, the fundamental task undertaken in this paper. Section 2 reviews past efforts to implement parallel computing in hydraulics and outline the research design to be followed here. Section 3 outlines the implementation of a variety of parallel computing methods for a cut down version of the LISFLOOD-FP hydraulic model (Bates and De Roo, 2000), whilst Section 4 presents the test case data used to benchmark model performance and verify the parallel codes. The results of these tests are described in Section 5 and discussed in Section 6. Conclusions are drawn in Section 7.
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
2. Review of past studies Hydraulic models, including TRENT (Villanueva and Wright, 2006a,b), CalTWiMS (Pau and Sanders, 2006), TELEMAC-2D (Hervouet, 2000) and RMA (Rao, 2005) have used message passing interface (MPI) and domain decomposition to run simulations in parallel. In these cases, the domain decomposition approach breaks the two-dimensional model domain into a number of contiguous sub-domains, with each sub-domain run on a different processing core. Pau and Sanders (2006) achieved a 1.5–2 speedup on four cores for an implicit, total variation diminishing finite volume scheme. They demonstrated the value of non-blocking message passing on machines with relatively low interconnect bandwidth (100 Mb/s) compared to currently available hardware, such as the InfiniBand interconnect (850 Mb/s) that will be used in this study. Hervouet (2000) achieved a 3 and 5 speedups on four and eight cores respectively for the TELEMAC-2D implicit finite element model when run on a CRAY T3E vector machine, whilst Rao (2005) reported almost linear speedups on four cores for the RMA2 finite element model. Villanueva and Wright (2006b) obtained speedups of up to 6, 12 and 30 on 6, 18 and 32 processors respectively for a coupled 1D-2D full shallow water model, where the 1D model was run in series on a master processor and the 2D model distributed amongst slave processors. The LISFLOOD-FP explicit storage cell model (Bates and De Roo, 2000) has been parallelised using OpenMP. Neal et al. (2009) tested the OpenMP version of the LISFLOOD-FP code for a range of test cases with domain sizes that varied from 3 k to 3 M cells, where some of the models included a one-dimensional channel model run in series. Here the computation was distributed between processor cores using data decomposition, such that, the data to be processed was distributed between cores by breaking up the loop that ran through rows of floodplain cells in the two-dimensional domain. The study showed that explicit storage cells codes are straightforward to parallelise with OpenMP, that the test case used could have a significant effect on the speedup achieved and that larger model domains run on fewer cores tended to parallelise most efficiently. The range of speedups reported above reflects some of the considerable number of influences on the speedup of a particular code, including differences between model types, specific test cases and the hardware used. This makes it difficult to compare the parallelisation methods that have been applied to hydraulic models in the literature, despite the often detailed descriptions of the model development and implementation. The techniques above are multiprocessor parallelisation methods that use multiple general purpose CPU cores to solve equations concurrently and thus reduce model computation time. An alternative method is to use a coprocessor in conjunction with a general purpose CPU to handle some of the computationally intensive arithmetic. A coprocessor will have limited functionality in comparison to the general purpose CPU, enabling its design to be optimised for specific tasks, often floating point arithmetic. There are many forms of coprocessor that integrate with modern computers, the most common being graphics processing units (GPUs), which are dedicated to performing the computation associated with graphics processing. Recently, the use of this hardware as general purpose graphics processing units (GPGPUs) has allowed these coprocessors to conduct floating point arithmetic not associated with graphics, including finite element applications (Go¨ddeke et al., 2007a). The main attraction of these coprocessors is the very high peak single precision floating point performance. For example an NVIDIA C1060 Tesla (NVIDIA, 2009) has a peak performance of 933 billion floating point operations per second (GFLOPS) compared with around 71.5 GFLOPS for a double precision Intel E5462 quad-core CPU (SPEC.org, 2009).
399
The JFLOW two-dimensional diffusive wave model (Bradbrook et al., 2004) has recently been rewritten for GPGPUs to yield substantially faster computation on single accelerator processors with the limitations of single precision and complete recoding to a graphics oriented language. Lamb et al. (2009) reported a speedup of 112 (simulation time 9.5 min) for the GPU code run on a NVIDIA GeForce 8800GTX over the serial JFLOW code which took 18 h, for a test case in Glasgow, UK. Although the numerical schemes employed are almost identical apart from subtle differences in the time step calculation. They are entirely separate codes with different structures, to the extent that not all speedup was hardware related. ClearSpeedÔ accelerator cards are another example of a coprocessor, although these are specifically designed for computationally intensive science and engineering applications. This means the processors primarily perform double precision arithmetic, with an emphasis on maximising the number of floating point operations per second (FLOPS) per watt of power consumed (ClearSpeed, 2008a). The CSX600 processor, of which there are two on each accelerator card, is quoted as being capable of 33 GFLOPS of sustained double precision matrix multiplication for just 10 W of power. This is achieved by having 96 processing elements with relatively slow clock speeds of 210 MHz on a processor chip (ClearSpeed, 2007). Both the GPGPUs used by Lamb et al. (2009) and ClearSpeed accelerator cards take advantage of a parallelisation method known as single instruction multiple data (SIMD) where many processing elements (PEs) perform the same task on different data. This allows more of the chip space to be dedicated to arithmetic operations. However, it can be difficult to adapt codes or numerical schemes designed for use on CPU’s to a SIMD environment. Other examples of coprocessors include IBM Cell and field-programmable gate arrays (FPGA’s). Go¨ddeke et al. (2007b) undertook at GPU and FPGA comparison for FEM solvers, but these will not be covered in this paper. There is relatively little information regarding the speedups achieved on benchmark test cases for flood inundation models through parallelisation as opposed to code and compiler optimisation. Even less common are direct comparisons between different parallelisation methods for the same model code. Thus, the novel task undertaken here is a comparison of three parallelisation techniques representing the main families of method described above for a single flood inundation model. A simplified serial version of the LISFLOOD-FP hydraulic model called LISMIN is used as a benchmark for three parallel versions that solve the same problem. This yields four models to be compared: 1. A serial version of the model (LISMIN) 2. A version of the LISMIN model that uses the OpenMP application programming interface (API) and shared memory (LISMINOMP) 3. A version that uses message passing interface (MPI) and distributed memory (LISMIN-MPI) 4. A version of the code where the computationally expensive floodplain components of the model are run on a ClearSpeed accelerator card/coprocessor (LISMIN-CS) An additional aim of the paper is to maintain all the functionality of the serial version of the model LISMIN such that direct comparisons can be made between codes and to qualitatively document the time taken, technical difficulty and code structural changes required to implement each parallel method. The paper will emphasise the practicality of coding these three methods for the given application and the speedups achieved. It does not necessarily present the optimal use of the method, which may have
400
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
been achieved with more development time, thus results should be viewed in this context. 3. Parallel code development LISFLOOD-FP, first published by Bates and De Roo (2000), has since been developed substantially. The current Cþþ version is written as a modular code that is capable of running in parallel using the OpenMP API (Neal et al., 2009). The code includes functionality for: Two-dimensional floodplain flows based on a raster storage cell model (Bates and De Roo, 2000), One-dimensional kinematic (Bates and De Roo, 2000) and diffusive channel models (Trigg et al., 2009), Adaptive time stepping (Hunter et al., 2005), Time varying, fixed and free water level or flux boundary conditions, Point sources, Options to export model states at snapshot times or time series Options to calculate maximum levels and inundation times. In this paper, only the floodplain component of the model with zero-flux boundaries and no dry checking function (Bates and De Roo, 2000) will be considered, although ease of integration with the channel model and other components is discussed in Section 5. Removing all functionality apart from the two-dimensional floodplain model with adaptive time stepping leaves the LISMIN code used in this study, which is described in the next section. This is followed by a description of the three parallel implementations of the LISMIN model in Sections 3.2–3.4. 3.1. Serial version (LISMIN) The hydraulic model LISMIN is a two-dimensional model for floodplain flows based on the LISFLOOD-FP model (Bates and De Roo, 2000; Hunter et al., 2005). The floodplain is treated as a Cartesian grid of storage cells with flows Q in the x and y directions between cells calculated using Manning’s equation. The model is described in Fig. 1, where h is the water free surface height [m], Dx and Dy are the cell dimensions [m], n is the effective Manning’s friction coefficient [s/m1/3], and Q describes the volumetric flow rates between floodplain cells [m3 s1]. The flow depth, hflow [m], represents the depth through which water can flow between two cells. This approach assumes that the flow is a function of the free surface height difference between adjacent cells and differs from a 2D diffusive wave only in the sense that the x- and ycomponents of the flow are decoupled. An explicit, first-order finite difference time marching scheme with a five point spatial stencil is used where the water heights are only a function of water heights calculated at the previous time step. Stability is maintained by adapting the model time step using a diffusive analogy to the classical CFL condition for advective flows described in Hunter et al. (2005). This leads to the following equation for the maximum stable time step Dt between two cells:
Dx2 @ 2n vh 1=2 A 4
3.2. OpenMP (LISMIN-OMP) The OpenMP API is implemented through compiler directives that instruct segments of code to be run as parallel threads. It supports both function and data decomposition approaches to parallelisation, where function decomposition involves running different operations on each processor core, e.g. one core could run the model described here whilst another calculates carbon fluxes over the domain. Here only data decomposition is considered. Although calculations are done on multiple processor cores, all threads must share a common memory space, which places a restriction on the number of cores available to OpenMP codes. For example, in this study, up to eight shared memory cores will be used, although in the future larger multi-core machines are likely to become available. OpenMP was the least labour-intensive to implement of the three methods discussed here and can be applied incrementally to the code (e.g. one section of the code can be parallelised whilst the others remain as serial components). More information on OpenMP can be found at http://openmp.org. Data decomposition was used to split the loops called in calcQx, calcQy and UpdateH by the LISMIN code (Fig. 1) into threads to run on different processor cores as illustrated by Fig. 2. By doing this rows of the model grid, each containing many cells, are now separate threads. The final adjustment to the LISMIN code was to update the maximum stable time step in the calcQx, calcQy functions within an OpenMP critical section. The OpenMP critical section ensures that only one thread can execute the code within the defined section at any particular time. Thus, it can be used to prevent multiple threads attempting to update the time step simultaneously. The implementation requires each thread to calculate its own private maximum stable time step. Once all the cells allocated to a thread have been processed, the critical section ensures that each thread updates a shared maximum stable time step with its private maximum stable time step in series, resulting in the same time step as calculated by LISMIN. Choosing to run a section of code in series may appear to be a limitation. However, the critical section contains a single if statement and a line of code to update the shared maximum stable time step with the thread maximum stable time step if it is shorter. As this is called only once per thread the serial computation associated with the critical statement is small relative to that of the parallelised code sections. The time update, point source and file I/O were not parallelised. As OpenMP uses directives to the compiler to parallelise a code these can be ignored during compilation by not including the required compiler flags for OpenMP. This means that the LISMIN-OMP code can be compiled to form either a parallel or serial model or compiled as a serial model by a compiler that does not support OpenMP.
1
0
Dt ¼
is in principle easier to make alterations to the code to run in parallel (Neal et al., 2009). Furthermore, as an explicit code, it is conceptually quite simple to parallelise through data decomposition of the spatial domain, which is the approach adopted by all three parallel methods implemented here. Water can enter the domain as a point source (time varying) or as an initial depth at the beginning of the simulation.
j j 5=3 hflow vx
(1)
where vh is the difference in water free surface height between two cells. For very low free surface gradients a linear scheme is used to prevent the time step tending to zero (Hunter et al., 2005). The approach requires fewer state variables and instructions per time step than a full two-dimensional shallow water model, meaning it
3.3. MPI (LISMIN-MPI) Message passing is a means of parallelising code to run on multiple processors with distributed memory space. As each processor core holds variables in its own local memory space, messages must be passed between processors if the code on one processor core requires some or all of the result of a process run on another processor core. In most cases (including here) the whole
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
401
A
Fig. 1. Schematic of the LISMIN model function calls.
code must be parallelised at once to avoid passing excessive amounts of data between processor cores, whilst the code must also be compiled for the interconnect on the machine it is to be run on. This means that, unlike the OpenMP version, the effect of parallelisation on the operation of point sources, time keeping and file I/O components of the LISMIN code must also be considered. Like OpenMP, the code can be decomposed using function or data decomposition. Here a special case of data decomposition called domain decomposition, where the domain is split into contiguous sub-domains, is used. LISMIN-MPI adopts a single program multiple data (SPMD) approach where identical code runs on each core but with different data. The 2D Cartesian model domain is decomposed into sub-domains by rows, with different sub-domains processed by each core (see Fig. 3B). Message passing between cores occurs at four points in the code as illustrated by the schematic in Fig. 3. First, the MPI_Isend and MPI_Irecv functions are used to pass Qy from a core of rank i to i þ 1. To hide the time required for message passing due to latency and bandwidth limitations these calls are
non-blocking such that the message passing is hidden behind the computation of Qx, which doesn’t need to be passed between cores. Once Qx and Qy have been calculated, the maximum stable time step on all the cores must be found. To achieve this, the MPI_Allreduce command was used to find the shortest maximum stable time step calculated and broadcast this value to all cores before the function UpdateH is called. As all processors must synchronise before UpdateH can be called, this is a blocking command, which needs to be undertaken to ensure an identical time step to LISMIN is used. When the UpdateH function is called, the first row of data is updated, after which MPI_Irecv and MPI_Isend commands are used to pass this value from the core with rank i to i-1. These nonblocking calls will hide the message passing provided the time taken to update the remaining water levels is longer than the time needed to message pass. At the end of the simulation an MPI_Gather command is used to gather the simulated water levels from each processor on the processor with rank 0, from which the file output is executed. This function could also be used to take snapshots of water levels in the domain during the simulation,
402
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
A
Fig. 2. Schematic of the LISMIN-OMP model parallelisation with compiler directives.
although this is not implemented here such that the functionality of the code matches that of LISMIN. By using the MPI functions in this way the results of a simulation are near identical to those of LISMIN and LISMIN-OMP. 3.4. Accelerator cards (LISMIN-CS) The ClearSpeed accelerator cards used here comprise of two CSX600 processors, where each processor has a single instruction multiple data (SIMD) linear array of 96 processing elements and access to 1 GB of DDR2 SDRAM memory known as mono memory. The use of the SIMD array means that all processing elements execute the same instructions on different data. Each processing element has its own physically distinct 6 kbyte memory space known as poly memory. Data can be copied to a processing element’s poly memory from the mono memory space or from the poly memory of a neighbouring processing element in the linear array. The processing elements have a separate thread for copying data between mono and poly memory spaces allowing the compute
thread to continue running. Therefore, data input/output (I/O) operations can be done asynchronously with computation on the processing element. All processing elements have 32 or 64-bit capabilities. Unlike OpenMP and MPI, function decomposition is not possible on the processor due to the use of a single instruction stream, although separate functions could run on the host and accelerator card (e.g. a one-dimensional channel on the host and two-dimensional floodplain model on the card). LISMIN-CS comprises two executables that run on the host and accelerator card, respectively. The host code is written in C and was compiled using the Intel C compiler with flags to dynamically link to various ClearSpeed libraries necessary to access the accelerator card (ClearSpeed, 2008a). This executable was responsible for all file I/O, loading the accelerator executable onto the card, moving data between card and host and calling external functions on the card (as illustrated at the top of Fig. 4). To maintain consistency, the file I/O section of the code is identical to that used by LISMIN. In this application, data was initially loaded from the host machine into the accelerator card mono memory using the CSPX
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
A
403
B
Fig. 3. Schematic of the LISMIN-MPI model parallelisation with MPI functions and illustration of domain decomposition method (B).
application programming interface (ClearSpeed, 2008b). This interface also handles the initialisation and function calls on the accelerator card (Fig. 4). The mono memory space on the card was easily large enough to store the test cases used here, meaning that communication of data between the host machine and card was limited to loading the model data onto the card and retrieving simulation results. To begin a model simulation, the host machine calls a function on the card and passes the parameters of the data that have been loaded onto the card previously to the function. The host then waits for that function to finish before retrieving the simulation results from the card. The accelerator code, written in the Cn language, was compiled using the ClearSpeed Cn compiler. This executable runs a simulation on the accelerator card using data received from the host. The on-card functions used by LISMIN-CS are illustrated by sections A and B of Fig. 4. When a call to run the model is issued by the host
machine, the on-card function begins to run the numerical scheme set out in LISMIN for the prescribed simulation time. To achieve this on many processing elements, the data associated with a subset of cells was moved from mono to poly memory using the function async_memcpym2p. Since each processing element has only 6 kbytes of memory, the whole domain could not be stored in poly memory. In order to fit with the memory constraints the code divides the data necessary to calculate flow Q and water depth h on the first row of the domain equally between the poly memories on each processing element, at the beginning of each time step (see Fig. 4A). Then, whilst computing these values, the data for the next row of cells is loaded into poly memory, such that the next row of flows and depths can then be calculated. At the boundary between the domains on each processing element calculations of flow were passed between neighbouring poly memories of the linear array of processing
404
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
A
A
B
B
B
Fig. 4. Schematic of the LISMIN-CS model parallelisation showing transfer of data from the host machine to the accelerator card (top) processing of the data (A) and calculation of flows and levels at each time step on a processing element (B).
elements, eliminating the need to duplicate computation on processing elements (see Fig. 4B). Updated water levels are written back to mono memory asynchronously using the function async_memcpyp2m. To further reduce computational time, flows are calculated in groups of four such that the vectorised versions of the expensive power and square root functions available in the ClearSpeed vector math library could be utilised (ClearSpeed, 2008c). The process of loading and computing rows of data continues until all rows have been computed. Once all the rows have been processed for a given time step the reduction function cs_reduce_min is used to find the maximum stable time step calculated by each processing element. The inflows from point sources are calculated in mono memory by a single mono-processing element, although this could be done in parallel if many point sources are needed. The only difference between the implementation of the model in LISMIN-CS and LISMIN was in the treatment of the time step calculation. LISMIN computes flows and maximum stable time step for every cell in the domain. It then finds the minimum of
these time steps and updates water levels accordingly. However, LISMIN-CS computes the maximum stable time step, flows and updated water levels for sections of the domain, with the smallest of the maximum stable time steps only known after all sections of the domain have had their water levels updated. Therefore, the time step used was based on a linear extrapolation of the previous two calculated maximum stable time steps or the previous maximum stable time step, using whichever was smaller. This allows the time step to both increase and decrease over a simulation. For the test cases used here this ensured a stable solution and reduced the amount of data movement between poly and mono memory. However, it was slightly less efficient in terms of the number of time steps required for a given simulation time (see Section 4.1). The suitability of the time step can be easily checked at the end of each time step where the time step used and maximum stable time step are both known. All the memory copy, vector maths and reduction functions used here are described in the Cn standard reference manual (ClearSpeed, 2008c).
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
405
4. Study site and test cases
4.1. Model verification
The study site is an urban area of Glasgow, UK previously used by Hunter et al. (2008) to benchmark six hydraulic model codes and Fewtrell et al. (2008) to test the effect of grid resolution on simulations by the LISFLOOD-FP code. The test case was also used by Lamb et al. (2009) to benchmark a GPU version of the JFLOW code and by Schubert et al. (2008) to study the effects of building representation. The test case comprises a 2 m resolution digital terrain model (DTM) with buildings re-inserted from ordnance survey master map data and a point source hydrograph from a culvert overflow in the north-east corner. A uniform roughness parameter of 0.035 was used based on Fewtrell et al. (2008). This is on On average, this is greater than the spatially varying roughness used by Hunter et al. (2008) and Lamb et al. (2009), which is likely to allow for longer stable time steps here given Eq. (1). The DEM used here is slightly smaller than that used by Hunter et al. such that the domain has 200 by 384 cells. This was done so that the DTM columns were divisible by the 96 processing elements of the CSX600 processor and the number of rows was divisible by the number of processor cores used to test the OpenMP and MPI implementations. This ensured that all methods were able to equally balance the number of cells on a core or processing element, allowing a fairer and more controlled test of each method, although it is not a requirement for any of the codes. Fig. 5 shows the DTM and result of the point source hydrograph test case using the LISMIN code. This test case begins as a dry domain that inundates over a period of 7200 s, thus, providing an example of a partially wet domain. A second test case, with no point source, where the whole domain was initialised with a water depth of 0.10 m was also created. Fig. 6 shows the resulting water levels from this simulation after 10 s. The intention of the fully wet domain test case is to illustrate the effect of load balancing problems caused by having a partially wet domain in the point source test case. The 10 s simulation time was adequate to compare the different codes and kept computation time short to avoid wasting processor time, however real applications are likely to have much longer simulation times. Simulated depth from the point source test case after 7200 s using LISMIN differed from the LISFLOOD-FP results at some locations by 0.001 m. These differences were not unexpected due to the complexity of the codes but prompt the following brief section on model verification.
This section presents the results of the partially wet test case for the three parallel codes, LISMIN and serial LISFLOOD-FP. Water depths were sampled at the four point locations marked in Fig. 5 at 10 s intervals for the duration of the simulation. These point locations were also used by Hunter et al. (2008), Fewtrell et al. (2008) and Lamb et al. (2009). The top row of Fig. 7 plots the water depths from LISMIN and the three parallelised LISMIN codes. Below these is a matrix of plots showing the difference between simulations from each of the codes, including LISFLOOD-FP, at the four point locations over time. The simulated levels from all codes differ, with LISMIN-OMP and LISMIN-CS the least alike, with water level differences up to 0.02 m. at location 1. At the end of the simulation, the greatest difference in water level for any code was 4 104 m, whilst the majority of points differed by less than 1 104 m. LISMIN-OMP and LISMIN-MPI were the most alike with a maximum difference of 1 104 m at any sample point throughout the simulation, whilst LISMIN-CS simulated levels that were most like those from LISFLOOD-FP. The greatest differences between all codes occurred during the initial wetting phase, whilst the time step was decreasing at its fastest rate. Fig. 8 plots the LISFLOOD-FP time step (a) and the difference between the time step used by LISFLOOD-FP and the other codes (b). As expected, LISMIN-CS tends to have the shortest time step due to its time step extrapolation routine. LISMIN, LISMIN-OMP and LISMIN-MPI have similar time steps except at 500 s. As the precision of water level sample times are determined by the time step, the sampling time step differences between codes will account for some of the water level differences in Fig. 7 and time step differences in Fig. 8. Although simulation results are not identical, it is assumed here that the codes are functionally equivalent for practical purposes. Furthermore, the differences between the two serial codes suggested that the parallelisation techniques are not the principal cause of the differences in results. 5. Results All test cases were run on the University of Bristol high performance computer BlueCrystal. MPI and OpenMP jobs were run on quad-core IntelÒ Harpertown E5462 processors with clock speeds of 2.8 GHz and 12 MB cache. Message passing between cores was handled using an 850 Mb/s InfiniBand interconnect. The OpenMP
Fig. 5. Point source test case at end of simulation and DEM.
406
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
Fig. 6. Fully wet domain test case at end of simulation.
jobs were also run on dual-core AMD Opteron 2218 processors with clock speeds of 2.6 GHz and cache memory of 1024 KB. The ClearSpeed accelerator cards were connected to PCI-X slots on the AMD nodes. The BlueCrystal system runs the ClusterVisionOS Linuxbased cluster operating system. The next section presents profiling results for the serial code followed by the results of the parallel implementations. 5.1. Code profiling The serial implementations of the code were compiled and profiled for the two test cases using the IntelÒ Cþþ compiler
version 10.1.015. The wet domain test case took 80.8 s to run on a single IntelÒ E5462 core, whilst the point source test case took 2849.4 s. An AMD Opteron 2218 core took 127.8 and 4738.8 s for the fully wet and point source test cases, respectively. A summary of the profiler results (Table 1) indicated that both test cases spent over 99% of their execution time in sections of the code that can be run in parallel. In the wet test case, the percentage serial time predominantly represents the time required for file I/O, looping through the time steps and setup costs incurred during the model start-up. In the point source case the serial section of the code also includes the time spent calculating the point source inflow after each time step (2.7 s or <0.01% of the runtime). Since the same
Fig. 7. (Top row) Plots of simulated water depths over time at the four control points for LISMIN and the three parallelised models. (top row). Other plots (rows 2–5) show simulated water depth from the plots on the top row minus simulated water levels from other codes, including LISFLOOD-FP.
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
407
b
a
Fig. 8. Plots of LISFLOOD-FP time step over time (a) and LISFLOOD-FP time step minus time steps from the other codes over time (b).
functions are parallelised by the MPI and accelerator card versions of the code the problem is well suited to parallelisation because the majority of the computationally expensive code segments can be run in parallel. Table 1 also presents profiles for the OpenMP code when executed on four and eight cores. The intention is not to present a detailed profile of the model code on differing numbers of cores. Instead the table indicates the load balancing problem caused by a mixture of wet and dry cells in the domain. When simulating the fully wet test case each core has an equal amount of computation to do during each time step. For example, on four cores the total computation time lost (sum over all cores) to load imbalances (1.8 s) and waiting in the OMP critical section (1.6 s) was 3.4 s over the simulation, which if equally divided over the cores is 3.8% of the parallel computation time. This compares to 1642.5 s for the point source test case, which is 31.9% of the parallel computation time on four cores. The increase in time lost to imbalances on a partially wet domain occurs because flow between wet cells requires more computation than the no flow conditions between dry cells. Thus, when the number of wet cells in each thread differs so will the computational resources required to complete that thread. Incidentally, the serial code spent 0.6 and 129.5 s on the section of code in the critical section for the wet and point source test case, respectively. Table 2 shows the percentage computational time (excluding file I/O) spent by the serial code in each of the function calls described in Fig. 1. The code spent over 96% of the computation time
calculating flows and the model time step (CalcQy, CalcQx) in the fully wet test case. This was expected because these functions calculate two square roots and a power function for each wet cell in the domain. The point source test case spends less time in these functions relative to UpdateH (13.78%) due to the presence of dry cells in the domain where fluxes need not be calculated. 5.2. Parallel run-times This section presents the parallel speedups achieved by the three methods. The HPC machine was not dedicated exclusively to running these test cases, although a queuing system did ensure cores were allocated to a single user at a time. To mitigate for anomalous results caused by using a busy machine, the timing of each model run started after and finished before the file I/O, thus ignoring hard disk access. Furthermore, each test case was run 3 times with the shortest time retained in these results because the occasional run was slowed down by background processes. The LISMIN-MPI code was run on 1, 2, 4, 8, 10, 20, 40, 50 and 100 Intel E5462 cores, using an InfiniBand interconnect for message passing. These numbers were chosen such that the 200 rows of the model domain could always be equally split between the cores as described in Section 3 to ensure a more controlled test. The maximum number of shared memory cores available on the AMD and Intel processors was eight, therefore the LISMIN-OMP code was run on 1, 2, 4 and 8 cores, again always ensuring an equal number of rows on each core. The LISMIN-CS code was run on one CSX600
Table 1 Results of code profiling for serial job and comparison with OpenMP profile on IntelÒ E5462 processors. Total time is the length of the simulation (in seconds) which is divided into either sequential or parallel time. The imbalances row shows the sum the time each core spent waiting for other cores, not the proportion of total time spent in this section. The ‘‘Critical section’’ row also shows the sum of time spent in this section of code by each core, which may be due to the computation required by the section or a core waiting for another to exit the critical section. Wet domain test case Serial Time Total time Sequential Parallel Imbalances Critical section
80.8 80.8 0 0 0
Point source test case 4 Cores
% 100 0
Time 22.2 0.2 22.0 1.8 1.6
8 Cores % 0.7 99.2
Time 14.5 0.1 14.4 2.6 3.4
Serial % 0.7 99.2
Time 2849.4 2849.4 0 0 0
4 Core % 100 0
Time 1293.6 6.94 1286.6 1079.6 562.9
8 Core % 0.5 99.4
Time 1089.9 6.6 1083.3 2283.0 2168.7
% 0.6 99.4
408
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
Table 2 Breakdown of function call execution time (excluding file I/O).
CalcQy CalcQx UpdateH Point source
Serial wet
Serial point
47.38% 49.17% 3.44% N/A
43.85% 42.36% 13.78% <0.01%
processor with 96 processing elements, attached to an AMD node via a PCI-X interconnect. The timing of these runs includes the data transfer to and from the card, although this was small (<0.1 s, or <1%) relative to the simulation times. Computational times for the fully wet (top) and point source (bottom) test cases are plotted in Fig. 9a and d. The computational time plots show a decrease in runtime as the number of cores increase, except for the 8 core AMD node simulations which were as slow as or slower than the 4 core implementations on the same processor type. In terms of single core run-times the Intel E5462 processors were 1.6 times faster than the AMD 2218 processors. This was not unexpected given that Intel E5462 processors have
a
d
a higher clock speed and were released a year later in quarter 4 of 2007 as opposed to quarter 3 of 2006. The speedup over the serial runtime (Eq.(2)) and parallel efficiency (Eq.(3)) are also plotted in Fig. 9b and e and Fig. 9c and f, respectively. LISMIN-MPI was more efficient that LISMIN-OMP on an equivalent number of cores. Although, for the fully wet test case the differences in parallel speedups between LISMIN-MPI and -OMP were less than 10% of the computation time for runs on the Intel processors. The differences were greater for the point source test case, where the efficiency of the AMD cores was also greater than the Intel chips, which was not the case for the wet domain. This may have occurred because the memory transfer speed to the processors becomes more of an issue for dry cells where little computation is required. However, to avoid a lengthy discussion on processor architecture this paper will simply note that parallel efficiency may differ between processor models in a complex manner.
Speedup ¼
Serial run time Parallel run time
(2)
b
c
e
f
Fig. 9. Parallel speedups.
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
Efficiency ¼
Speedup Number of cores
(3)
Parallel efficiency tended to decrease as the number of cores increased, with the drop off in efficiency more pronounced for the point source test case. The point source case efficiency may be limited by the transfer of data between memory and the processor for the dry regions where computation is negligible. However, the profiler results in Table 1 indicate that the most significant factor affecting parallel efficiency for the point source test case was load imbalances, caused by cores having different numbers of wet cells. This is the main reason why the LISMIN-MPI parallel efficiency falls below 50% on more than 50 cores. For the wet test case, where load imbalances were minimal, LISMIN-MPI was still over 90% and 70% efficient on 50 and 100 cores, respectively. All the parallelisation approaches used here decompose the model domain such that an equal number of cells are processed by each core. The results of this study suggest a domain decomposition that dynamically balances the computational load on each core by considering the number of wet cells in a domain would improve parallel efficiency when dealing with partially wet domains. The computation times for the LISMIN-CS and LISMIN codes are shown in Table 3. The LISMIN-CS code was 1.3 times faster than LISMIN on a single core of an Intel E5462 processor for the fully wet domain, but 5.3 times slower for the point source domain. The reason for the slow performance of the point source test case is that the single instruction stream used by the CSX600 processor means that all processing elements must implement the same numerical scheme. In other words if one processing element has a wet cell, all processing elements will go at the speed of the wet processing element, despite the dry cell requiring only negligible computational resource. Table 3 also details the power consumption of the processors used. The CSX600 processor is designed to be power efficient and this was indeed the case here for the wet test case given manufacture guidelines. LISMIN-CS required 570 J of energy to complete the computation, compared with 1520 and 6033 J for LISMIN on the Intel and AMD processors, respectively. The calculated energy use considers only the energy used by the processors. Other components within a computer node and any cooling requirements will substantially increase the actually energy usage compared with the values shown here. These other considerations are likely to increase the power efficiency of the accelerator cards relative to the CPUs as the entire card with 2 CSX600 processors uses only 25 W. Of course, this assumes that the host CPU node is doing some other useful work during the computation or that many accelerator cards are connected to a single dedicated host. This comparison of LISMIN-CS and LISMIN is most relevant to Monte Carlo type applications of the code where many model simulations can be run by distributing them amongst the available cores, rather than running individual jobs faster through parallelisation. For the wet domain, the LISMIN-CS model was both more power efficient and faster than the LISMIN model running on newer CPU hardware. However, the combination of the single instruction stream and the data decomposition approach used by LISMIN-CS was very inefficient compared to LISMIN for the partially wet domain test case. Serial inundation model codes like LISMIN tend to scale with the number of wet cells (Sanders, 2008), as do LISMIN-
409
OMP and -MPI when the processor loads can be balanced (e.g. if there are the same number of wet cell in each thread/sub-domain). LISMIN-CS scales with the number of cells if any processing element encounters a wet cell, which is a problem in the point source test case because there are many more dry cells than wet cells. Codes that decompose only the wet cells of the model domain may reduce this problem, but if not addressed it will lead to an inefficient use of SIMD type resources as demonstrated here. Nevertheless, the fact that the SIMD type coprocessor appears to deliver benefits for fully wet domains means they are well suited to simulating surface water flooding based on rainfall inputs, where the whole domain is typically inundated. This has been seen in practice by the application of JFLOW-GPU to produce ‘‘initial maps of areas susceptible to surface water flooding’’ based on rainfall inputs (Environment Agency, 2009; Lamb et al., 2009). Although this functionality was not included in LISMIN the structure of all three parallel LISMIN codes is such that it would be possible to add volume due to rainfall at the beginning of each time step with relative ease. 6. Discussion The analysis of these parallelisation options would not be complete without a discussion of the time required to code the models, the ease with which additional functionality can be added and any specialist software and hardware requirements. Table 4 attempts to summarise the time taken for each parallelisation method to be implemented. As these estimates are clearly unique to the person coding these models and the LISMIN model code, thus, the following discussion is derived from this experience and no other. In this case, the code was developed by someone from a geo-sciences rather than computer science background. The table shows that none of the methods were particularly difficult to implement relative to coding a model such as LISMIN in the first place. Development time was shortest for LISMIN-OMP and longest for LISMIN-CS simply because the latter required more changes to the code. OpenMP is currently used by the standard version of LISFLOOD-FP (Neal et al., 2009) because it was possible to parallelise the computationally expensive function calls in the code without having to alter the majority of the functions. The use of LISMIN in this study was necessary such that the code could be ported to the accelerator card, this will also have significantly reduced the development time required for LISMIN-MPI. LISFLOOD-FP would be more difficult to parallelise with MPI than LISMIN because the one-dimensional channel model, which is likely to span multiple sub-domains, would also have to be considered. Thus in our experience, OpenMP is an easy to implement parallelisation method that is likely to be worth trying ahead of the other potentially more efficient and scalable methods presented here, particularly when parallelising an existing code. The introduction to this paper presented results from Lamb et al. (2009) using JFLOW-GPU for a slightly different setup of the point source test case used here. Their paper focuses on the performance and practicalities of a GPU based parallel implementation of JFLOW within an industrial setting and reports a simulation time of 9.5 min for the 2 h hydrograph. A direct comparison with
Table 3 Processor power efficiency. Intel Harpertown
Wet Point
AMD Opteron
Clearpeed CSX600
Seconds
Watts
Joules
Seconds
Watts
Energy
Seconds
Watts
Joules
76 3094
20 20
1520 61 k
127 4738
47.5 47.5
6033 225 k
57 16 543
10 10
570 165 k
Source AMD (2009) and ClearSpeed (2007).
410
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411
Table 4 Summary of implementation issues for each parallelisation method. Development timea
Specialised hardware
Specialised software
Difficulty
LISMIN-OMP LISMIN-MPI
1–2 Days 1–2 Week
None Fast interconnect
Easy Easy–Moderate
LISMIN-CS
1–2 Months
Accelerator cards
OpenMP capable compiler Must be compiled for interconnect hardware and requires specialised MPI library files ClearSpeed SDK
Moderate
a
These times are approximate and based on no prior knowledge of the parallelisation techniques used, although the MPI version benefited from being the last to be developed and a training course.
computation time and water depths reported here should be avoided due to the differences between the test case setups, especial as the higher roughness used here will allow for longer and fewer model time steps. Nevertheless on a qualitative level, the GPU based flood inundation modelling was quick and the results here support the conclusion of Lamb et al. that commodity graphics hardware can lead to significant performance gains.
F20511/1 from the EPSRC and the DEFRA/EA joint Research Program on Flood and Coastal Defence. Timothy Fewtrell was funded by Willis Re as part of the Willis Research Network (http:// www.willisresearchnetwork.com). This work was carried out using the computational facilities of the Advanced Computing Research Centre, Bristol University. Finally, the authors would like to thank the reviewers for their constructive comments.
7. Conclusions
References
This paper has demonstrated the application of three parallelisation methods to a cut-down version of the LISFLOOD-FP storage cell hydraulic model. A key constraint on the model development was that each code had to give effectively the same answer for the two test cases as the serial version. Although many factors influence the performance of a parallel code the following conclusions can be drawn from the results obtained in this paper:
AMD, 2009. AMD OpteronÔ processor solutions. Available online at. http:// products.amd.com/en-us/OpteronCPUDetail.aspx?id¼317. Bates, P.D., De Roo, A.P.J., 2000. A simple raster-based model for flood inundation simulation. Journal of Hydrology 236 (1–2), 54–77. Bradbrook, K., Lane, S.N., Waller, S., Bates, P.D., 2004. Two dimensional diffusion wave modelling of flood inundation using a simplified channel representation. International Journal of River Basin Management 2 (3), 1–13. Chapman, B., Jost, G., Pas, R., 2008. Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, Cambridge, Massachusetts. ISBN-13: 978-0-26253302-7. ClearSpeed, 2007. CSX Processor Architecture. ClearSpeed Technology, Bristol, UK. Available online at. http://developer.clearspeed.com/resources/library/. ClearSpeed, 2008a. Software Development Kit: Introductory Programming Manual. ClearSpeed Technology, Bristol, UK. Available online at. http://developer. clearspeed.com/resources/library/. ClearSpeed, 2008b. User Guide: The CSPX Accelerator Interface Library. ClearSpeed Technology, Bristol, UK. Available online at. http://developer.clearspeed.com/ resources/library/. ClearSpeed, 2008c. User Guide: Cn Standard Reference Library. ClearSpeed Technology, Bristol, UK. Available online at. http://developer.clearspeed.com/ resources/library/. Environment Agency, 2009. Flooding in England: a National Assessment of Flood Risk. Report GEHO0609BQDS-E-P. Environment Agency, Bristol. Fewtrell, T.J., Bates, P.D., Horritt, M.S., Hunter, M.N., 2008. Evaluating the effect of scale in flood inundation modelling in urban environments. Hydrological Processes 22, 5107–5118. Go¨ddeke, D., Strzodka, R., Mohd-Yusof, J., McCormick, P., Buijssen, S.H.M., Grajewski, M., Turek, S., 2007a. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33, 685–699. Go¨ddeke, D., Strzodka, R., Turek, S., 2007b. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems 22 (4), 221–256. Hervouet, J., 2000. A high resolution 2-D dam-break model using parallelization. Hydrological Processes 14 (13), 2211–2230. ´ eelz, S., Pender, G., Villanueva, I., Wright, N.G., Liang, D., Hunter, N.M., Bates, P.D., N Falconer, R.A., Lin, B., Waller, S., Crossley, A.J., Mason, D.C., 2008. Benchmarking 2D hydraulic models for urban flooding. Proceedings of the Institution of Civil Engineers – Water Management 161 (1), 13–30. Hunter, N.M., Horritt, M.S., Bates, P.D., Wilson, M.D., Werner, M.G.F., 2005. An adaptive time step solution for raster-based storage cell modelling of floodplain inundation. Advances in Water Resources 28 (9), 975–991. Lamb, R., Crossley, A., Waller, S., 2009. A fast 2D floodplain inundation model. Water Management (Proceedings of the Institution of Civil Engineers) 162. Message Passing Interface Forum, 2008. MPI: a message-passing interface standard, version 2.1. Available online at: www.mpi-forum.org Neal, J.C., Fewtrell, T.J., Trigg, M.A., 2009. Parallelisation of storage cell flood models using OpenMP. Environmental Modelling & Software 24, 872–877. http://dx.doi. org/10.1016/j.envsoft.2008.12.004. NVIDIA, 2009. Board Specification: Tesla C1060 Computing Processor Board. NVIDIA Corporation, Santa Clara, USA. Available online at. http://www.nvidia.com/docs/ IO/43395/BD-04111-001_v05.pdf Pau, J.C., Sanders, B.F., 2006. Performance of parallel implementations of an explicit finite-volume shallow-water model. Journal of Computing in Civil Engineering 20 (2), 99–110. Rao, P., 2005. A parallel RMA2 model for simulating large-scale free surface flows. Environmental Modelling & Software 20 (1), 47–53.
Two-dimensional explicit storage cell flood inundation models can be parallelised with the three methods presented here with relative ease. Parallelisation through message passing was slightly more efficient than OpenMP on the processors and interconnects used. OpenMP was by far the easiest method to implement and provided a similar level of speedup to the message passing code on up to four cores. The message passing code allowed a greater number of cores to be utilised due to a limit on the number of shared memory cores currently available. In fact, the parallelisation with message passing was still over 70% efficient on 100 cores for a fully wet domain with just 200 rows (e.g. two rows per core). OpenMP is used by the LISFLOOD-FP model because computationally expensive sections of the code can be parallelised without considering the rest of the code. The accelerator card code was faster and more power efficient than the standard code on a modern 86 core given a test case that could be distributed evenly over the processing elements. SIMD type coprocessors deliver the efficiency gains for fully wet domains, meaning they are well suited to surface water flooding problems based on rainfall inputs. However, for partially wet domains load balancing is a critical factor for consideration and can lead to a very inefficient use of the hardware. The accelerator code took the longest to develop. However, the extensions to the C language required to code the cards were not conceptually difficult and should not present a problem to an experienced modeller. Acknowledgements Jeffrey Neal was funded by the Flood Risk Management Research Consortium, which is supported by grant number EP/
J.C. Neal et al. / Environmental Modelling & Software 25 (2010) 398–411 Sanders, B.F., 2008. Integration of a shallow water model with a local time step. Journal of Hydraulic Research 46 (4), 466–475. Schubert, J.E., Sanders, B.F., Smith, M.J., Wright, N.G., 2008. Unstructured mesh generation and landcover-based resistance for hydrodynamic modeling of urban flooding. Advances in Water Resources 31 (12), 1603–1621. SPEC.org, 2009. SPEC’s benchmarks and published results. Available online at: http://www.spec.org/benchmarks.html.
411
Trigg, M.A., Wilson, M.D., Bates, P.D., Horritt, M.S., Alsdorf, D.E., Forsberg, B.R., Vega, M.C., 2009. Amazon flood wave hydraulics. Journal of Hydrology. doi:10.1016/j.jhydrol.2009.06.004. Villanueva, I., Wright, N.G., 2006a. Linking Riemann and storage cell models for flood Prediction. ICE Journal of Water Management 159, 27–33. Villanueva, I., Wright, N.G., 2006b. An Efficient Multi-processor Solver for the 2D Shallow Water Equations. Hydroinformatics 2006, Nice, France.