Parallelization of a hydrological model using the message passing interface

Parallelization of a hydrological model using the message passing interface

Environmental Modelling & Software 43 (2013) 124e132 Contents lists available at SciVerse ScienceDirect Environmental Modelling & Software journal h...

774KB Sizes 0 Downloads 130 Views

Environmental Modelling & Software 43 (2013) 124e132

Contents lists available at SciVerse ScienceDirect

Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft

Parallelization of a hydrological model using the message passing interface Yiping Wu a, b, *, Tiejian Li c, **, Liqun Sun b, Ji Chen b a

ASRC Research and Technology Solutions, U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center, Sioux Falls, SD 57198, USA Department of Civil Engineering, The University of Hong Kong, Pokfulam, Hong Kong, China c State Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing 100084, China b

a r t i c l e i n f o

a b s t r a c t

Article history: Received 29 October 2012 Received in revised form 29 January 2013 Accepted 10 February 2013 Available online 15 March 2013

With the increasing knowledge about the natural processes, hydrological models such as the Soil and Water Assessment Tool (SWAT) are becoming larger and more complex with increasing computation time. Additionally, other procedures such as model calibration, which may require thousands of model iterations, can increase running time and thus further reduce rapid modeling and analysis. Using the widely-applied SWAT as an example, this study demonstrates how to parallelize a serial hydrological model in a WindowsÒ environment using a parallel programing technologydMessage Passing Interface (MPI). With a case study, we derived the optimal values for the two parameters (the number of processes and the corresponding percentage of work to be distributed to the master process) of the parallel SWAT (P-SWAT) on an ordinary personal computer and a work station. Our study indicates that model execution time can be reduced by 42%e70% (or a speedup of 1.74e3.36) using multiple processes (two to five) with a proper task-distribution scheme (between the master and slave processes). Although the computation time cost becomes lower with an increasing number of processes (from two to five), this enhancement becomes less due to the accompanied increase in demand for message passing procedures between the master and all slave processes. Our case study demonstrates that the P-SWAT with a fiveprocess run may reach the maximum speedup, and the performance can be quite stable (fairly independent of a project size). Overall, the P-SWAT can help reduce the computation time substantially for an individual model run, manual and automatic calibration procedures, and optimization of best management practices. In particular, the parallelization method we used and the scheme for deriving the optimal parameters in this study can be valuable and easily applied to other hydrological or environmental models. Published by Elsevier Ltd.

Keywords: Hydrological model Message passing Parallelization SWAT

Software availability

1. Introduction

Name of software: P-SWAT Description: The watershed model SWAT is parallelized using a parallel programing technology (MPI) to enhance the execution efficiency on the Microsoft Windows platform. Developers: Y. Wu and T. Li Source language: Fortran Software availability: Contact the developers

Hydrological and environmental models are useful tools for mimicking real world problems of investigating, planning, designing, and managing natural and anthropogenic systems (Borah and Bera, 2002; Howarth et al., 1996; Liu et al., 2008; Wang et al., 2007; Wu and Xu, 2006). These numerical models tend to be complex, requiring many input data and parameters to represent the real situations (Sharma et al., 2006). Moreover, with the deepening understanding of the causal mechanism to be simulated and increasing precision requests on spatiotemporal resolutions, such models can become more sophisticated and complex and thus require more computing resources (Beck, 1999; Brun et al., 2001). The rapid development of the watershed hydrological/water quality model, Soil and Water Assessment Tool (SWAT) (Arnold et al., 1998; Neitsch et al., 2005), is such an example (Douglas-Mankin

* Corresponding author. ASRC Research and Technology Solutions, U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center, Sioux Falls, SD 57198, USA. ** Corresponding author. E-mail addresses: [email protected] (Y. Wu), [email protected] (T. Li). 1364-8152/$ e see front matter Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.envsoft.2013.02.002

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

et al., 2010; Gassman et al., 2007). First, the further expansion of the model may continue as we have seen due to its attribute of open source and wide application over the globe (Chen and Wu, 2012; Li et al., 2010; Panagopoulos et al., 2011, 2012; Srinivasan et al., 2010; Tuppad et al., 2010; Wu and Liu, 2012; Wu et al., 2012; Zhang et al., 2009, 2010). Second, in terms of the model application, the more detailed delineation of a watershed (spatial resolution) and the finer time step (daily or hourly) request can substantially raise the computation time even for the same study area. Although the speed and capacity of computers have increased multi-fold in the past several decades, the time consumed for hydrological models running (especially those complex, physically based, and distributed models) is still a concern for hydrologic practitioners (Zhang et al., 2009). On the other hand, these mathematical models, even those that are physically-based, often contain parameters that cannot be measured directly because of measurement limitations and scaling issues (Beven, 2001). Thus, model inversion is required to derive the optimal parameters to obtain an agreement between observations and simulations (Ng et al., 2010; Tolson and Shoemaker, 2007; Zhang et al., 2010). Although a variety of global optimization algorithms can serve this purpose for numerical models, these algorithms usually need a large number of iterations of the model itself before deriving the optimal solutions (Gupta et al., 1998). Further, the popularity of physically-based models can lead to substantial increases in the complexity of the model calibration and uncertainty analysis (Gupta et al., 1998). In addition, decision making for optimal or sustainable river basin management can be enhanced by optimizing the selection and placement of best management practices (BMP) across the landscape, and SWAT has been selected as the modeling tool for deriving the optimal BMP (Maringanti et al., 2009; Panagopoulos et al., 2011, 2012). However, this searching procedure also requires thousands of model iterations and may be time-costly for large hydrologic systems. Consequently, the sophistication of a model itself, the calibration algorithm, and the optimization of BMP to support decision making may require substantial computing resources and running time. In numerical simulation, parallel processing or parallel computing generally refers to the simultaneous execution of multiple operations or tasks with a number of central processing unit (CPU) cores in order to make a program run faster (Li et al., 2011; Rouholahnejad et al., 2012). To parallelize an existing hydrological model with optimization procedure, parallelizing either the model structure or its optimization algorithm can help reduce the execution time. The former is often difficult as it requires much knowledge of/familiarity with the model itself to modify it in such a way that different portions of the model can be executed on separate CPU cores with dynamic inter-communication (Li et al., 2011; Rouholahnejad et al., 2012). In terms of the widely applied SWAT model, a few studies have focused on the latter to reduce the model calibration time (Denisa et al., 2012; Gorgan et al., 2012; Rouholahnejad et al., 2012; Yalew et al., 2013). Although Yalew et al. (2013) accomplished distributed computation of SWAT via grid computing (split the model input files, submit the sub-models to the GRID, conduct distributed computation using computers connected, and merge output from sub-models), their product requires GRID infrastructure and the Internet for file transfer between GRID and a local computer, which may also cost substantial time for large projects. Parallelizing the structure of a model by modifying its source code can be a universal way to help reduce computation time for a model run, automatic calibration, and manual calibration, which is still a commonly used approach (Gupta et al.,1999) for incorporating the expert knowledge for a specific area, and thus cannot be totally replaced by the automatic technique. Parallelization of the model

125

itself would be of importance in a labor intensive manual trial and error procedure. Additionally, a parallelized SWAT can also help accelerate the optimization of selection and placement of BMP for supporting decision making (as described previously) (Maringanti et al., 2009; Panagopoulos et al., 2011, 2012), which cannot benefit from parallelizing the calibration algorithm only. With the development of computer technology, especially the parallel computing techniques, computation capacity has been rapidly advanced (Jordi and Wang, 2012; Li et al., 2011; Wang and Shen, 2012; Zhao et al., 2013). Some parallel algorithms, such as the Message Passing Interface (MPI) (MPI Forum, 2008) and Open Multi-Processing (OpenMP) application program interface (Chapman et al., 2007), have been proposed and applied widely to support the parallelization of numerical models. These programing standards make it possible to parallelize an existing program by adding functions, compiler directives, and inter-process communications to the original serial program code (Li et al., 2011). In contrast to the grid computing, which utilizes a group of loosely coupled computers (nodes) to execute computing jobs (Chen et al., 2009; Schwiegelshohna et al., 2010), MPI uses inter-process communication and thus can help develop a tightly bundled model. Because almost all of the current personal computers are installed with a multi-core (generally 2 to 20 logical cores) CPU, the execution of a serial model tends to waste some computing resource on a single computer. Therefore, both the availability of parallel programing tools and the popularity of a multi-core CPU encourage the parallelization of existing numerical models. The objective of our study is to address the high computation time of hydrological models through parallelization of their existing model structures using the MPI parallel programing library. With the representative hydrological model SWAT used as an example, we demonstrated how to parallelize a hydrological model to make it run on a Windows platform with multiple CPU cores. A parallelized hydrological model cannot only enhance the computing efficiency for an individual model run and manual/automatic calibrations, but also support the acceleration of an environmental wrapper using the model as a sub-tool such as the optimization framework of SWAT-BMP (Maringanti et al., 2009; Panagopoulos et al., 2011, 2012). 2. Method 2.1. SWAT structure The SWAT model (Arnold et al., 1998; Neitsch et al., 2005) was developed by the U.S. Department of Agriculture (USDA) Agricultural Research Service (ARS) for exploring the effects of climate and land management practices on water, sediment, and agricultural chemical yields (Douglas-Mankin et al., 2010; Gassman et al., 2007). Hydrological Response Unit (HRU) is the basic simulation unit and is defined as a lumped land area composed of a unique land cover, soil type, and slope (Neitsch et al., 2005) and the ArcSWAT interface (Winchell et al., 2009). Although SWAT is computationally efficient (Arnold et al., 1998), it may cost seconds to hours depending on a project size (number of HRUs). To parallelize a numerical model, the first step is to identify the parts (modules) in the model that can be executed independently and will not be influenced by other parts. For hydrological models, land phase hydrological cycles are usually simulated for each subbasin before routing water along channels. Therefore, these parts can be divided into sub-parts, and each of these sub-parts can be executed on different CPU cores simultaneously by means of a parallel programing technology. Through scrutinizing the SWAT code, all the hydrological processes (land phase and routing phase) are simulated at daily time step, which is under the annual loop as shown in the left panel of Fig. 1. Within the daily loop, the subbasin (or HRU) level hydrological (i.e., the land phase cycle) simulation is executed followed by simulation of channel and pond/reservoir routing processes (i.e., water or routing phase cycle) (see Fig. 1). The general sequence of the land phase hydrological cycle on the subbasin/HRU level is shown in the right panel of Fig. 1. The computation of the land phase cycle is the core part of SWAT and usually costs more computing resource than the routing cycle. From the model’s source code, the land phase hydrological cycle simulation is executed serially (i.e., finish one subbasin before doing the next) until all subbasins are finished before starting the routing cycle. However, there is no interaction between any

126

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

Subbasin Loop

Annual Loop

do subbasin = 1, number of subbasins

do year = begin year, end year

Daily Loop do day = 1, 365

Subbasin Command Route Command

Read or generate Precipitation, Temperature, Solar Radiation, Wind Speed, Relative Humidity

Compute Soil Temperature

Rainfall + Snowmelt >0?

Yes

No

No

Reservoir Command

Compute Surface Runoff and Infiltration Surface Runoff >0? Yes

End Daily Loop end do

Compute Soil Water, ET, Crop Growth, Pond/Pothole/Wetland, Groundwater Flow

Compute Peak Rate, Transmission Loss, Sediment, Nutrient and Pesticides Yield

End Annual Loop end do

End Subbasin Loop end do

Fig. 1. The basic structure of SWAT. The left panel illustrates how the model handles the time frame, and the right panel demonstrates the landscape hydrological simulation for each subbasin. two individual subbasins because they are spatially discrete, although the coding is serial (from the first subbasin to the last). Then, the routing cycle simulation follows the upstream-to-downstream relationship described by the configuration file to route water and solutes stored at the hydrological nodes (one node for one subbasin). That is to say, the routing cycle is executed serially following the sequence that interprets the actual river network, and this sequence cannot be reversed. From Fig. 1 and the above description, only when all land phase (subbasin) processes are finished, can the serial routing (channel) processes be executed. Therefore, we proposed to parallelize the land phase (subbasin/HRU level) hydrological simulation only and keep the serial routing part unchanged. Thus, we needed to modify the SWAT code to start subbasin computing processes on different CPU cores and then converge them to the master process eventually before the channel routing procedure begins. 2.2. MPI parallel library The Message Passing Interface (MPI) is a specification for the user interface to message-passing libraries for parallel computers (Muttil et al., 2007). MPI can be used to write programs for efficient execution on a wide variety of parallel machines, including massively parallel supercomputers, shared memory multi-processors and networks of workstations. MPI allows the coordination of a program running as multiple processes in a distributed memory environment, yet it is flexible enough to be used in a shared memory system. MPI programs always work with processes, although they are commonly viewed as processors. When one tries to obtain maximum performance, one process per CPU core is selected as part of the mapping activity; this mapping activity occurs at the runtime using the agent that starts the MPI program, normally called ‘mpiexec’. The standardization of the MPI library is one of its most powerful features because a parallel programmer can write code containing MPI subroutine and function calls that will work on “any” machine on which the MPI library is installed without having to make changes in his/her code (Muttil et al., 2007). A complete detail of the MPI is provided in Gropp et al. (1994) and Pacheco (1997). The Message Passing Interface (MPI) is now commonly used for writing portable message passing programs. MPICH (Balaji et al., 2011) is a public-domain, portable implementation of the MPI standard that was originally developed for distributed memory applications used in parallel computing and was released by Argonne National Laboratory (ANL) and Mississippi State University (MSU) in parallel with the standardization effort (Luecke et al., 2003; Muttil et al., 2007). According to Muttil et al. (2007), MPICH is freely available for Microsoft Windows and for most operating systems of UNIX (including Linux and Mac OS X). Moreover, MPICH is a developed program library. Du (2008) has introduced detailed design and application of the MPICH under Microsoft Windows environment on the Fortran and Cþþ platform. More information including tutorials can be found on the MPICH website at: http://www.mcs.anl.gov/research/projects/mpich2/. In this study, the MPICH2 (version 1.4.1) and Intel Fortran Compiler (version 11.1) with Microsoft Visual Studio (version 2010) were used to develop the parallel SWAT (P-SWAT) under the Microsoft Windows environment. For Windows users,

especially those who use Microsoft Windows High Performance Computing (HPC) Server 2008, MS-MPI can be used for the P-SWAT with only minor changes on compiler configuration and the executing environment.

2.3. SWAT parallelization Because the land phase hydrological simulations can be allocated to more than one program process on CPU based on subbasins, as described previously (Section 2.1), we planned to use the MPICH to parallelize SWAT, which can distribute computing loads and converge results through message transfer and coordination between the master and slave processes. We used the Microsoft Windows installed on a personal computer (PC) as the developing platform to parallelize the SWAT model, and the major steps are presented below (See Section A of the Appendix for the major source codes). 1) Load MPI libraries into the model’s main program Details about how to use MPICH for a customized program can be found at its official website (Balaji et al., 2011). 2) Break the subbasin loop (see Fig. 1) to divide subbasins into sub-groups All subbasins can be divided into multiple groups depending on the number of processes (p) that will be launched. Generally, each process takes up one CPU core. Thus, each CPU core is only responsible for a single group of subbasins during the model execution. 3) Converge results from all the processes Slave processes need to send the results once their work is finished, and the master process need to wait and receive them from all slave processes. This procedure can be accomplished through message transfer functions (i.e., send and receive functions of the MPI). The operational flowchart of the parallelized SWAT is shown in Fig. 2, which clearly illustrates how the model works with and without auto-calibration. The left panel of this figure shows the model procedure executed on the master process, while the right panel is for a slave process with the green hidden layers representing multiple slave processes. For the sake of description herein, we take the single slave process as an example. The total number of subbasins delineated in SWAT setup for a basin is denoted as n, the computation for the first m subbasins (from 0 to m) is allocated to the master process, while the left n  m subbasins (from m þ 1 to n) simulations are allocated to the slave process. Thus, two groups of different tasks have been assigned to the two processes, respectively. Once the computations of subbasin simulation on both processes have been finished, the slave process needs to send the simulation results for subbasins from mþ1 to n (to be used for water and solute

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

127

START

Read input

Read input

Parameter initialization

Parameter initialization Yes

No Calculate subbasins (1 ~ m)

Change parameters

Send new parameters

Receive new parameters

Change parameters

MT1

Receive subbasins (m+1 ~ n)

Yes

Auto calibration?

No

Send subbasins (m +1 ~ n)

Calculate subbasins (m+1 ~ n)

Channel routing Results output No

Last day of simulation? Yes

No

Auto calibration?

No

Multiple slave processes

MT2 Auto calibration?

Yes No

Meet criteria?

No

Yes Message END

Legend:

Master process

MPI

Slave process Transfer

Fig. 2. Operational flowchart of the parallelized SWAT with and without auto-calibration. routing along channels) to the master process, which is set to receive the results. This procedure, denoted as MT1 in Fig. 2, is accomplished with the message transfer functions of the MPI. After the master process receives the simulation results sent by the slave one, the routing processes are executed on the master process only until the end. This is the basic operation of the parallelized SWAT for a normal model run. If the auto-calibration procedure is activated, then the objective function for each model run is calculated and followed by searching a new set of parameters with the optimization algorithm. Because the new set of parameters is derived on the master process only, these values need to be passed to the slave process before a new round of subbasin loop begins (see Figs. 1 and 2); otherwise, the slave process can only have the initial parameter values for each model run. This procedure of parameter transfer from the master process to the slave one is denoted as MT2 (as shown in Fig. 2), which also relies on the message transfer functions of the MPI. 2.4. Parameters of the P-SWAT The above description of the operation of the P-SWAT with and without autocalibration is based on two processes. However, the developed P-SWAT, as shown in Fig. 2, can use multiple slave processes. Thus, the above message transfer procedures (MT1 and MT2) are executed between the master process and all slave ones. Users can decide how many processes can (are allowed to) take part in the model execution by setting the parameter (p) with the following command: mpiexec -n p P-SWAT.exe When the P-SWAT model runs on a cluster of computers in a local network, a configure file indicating the names of computers and the number of processes on each computer should be given for MPICH. When the computer cluster runs Microsoft Windows HPC Server 2008, the Job Manager should be used. From Fig. 2, the master process includes both subbasin and routing calculations, but all slave processes include the subbasin calculation only. Obviously, the master process should take less subbasin computing load than the slave one. Then, what is the optimal task-distribution scheme? We set a rule to implement the task distribution between the master and slave process: the first m subbasins is proposed to be conducted by the master process; whereas the other n  m subbasins are assigned to

each slave process as equally as possible. Thus, users can decide how many tasks (i.e., subbasins (m)) can be taken by the master process. To facilitate easy understanding and usage, we used the percentage of the number of subbasins (f) to designate the tasks (subbasins) to be taken by the master process: m ¼ n  f. In brief, users need to set values for two parameters (p and f) of the P-SWAT, and the recommended values from our case study are given in Section 4.

3. Case study 3.1. Study area and model setup To examine our developed P-SWAT, we used a previously wellestablished project for the East River Basin in South China (Chen and Wu, 2012; Wu and Chen, 2012a,b, 2013). Detailed description of the study area and model input data were given by Chen and Wu (2012) where the basin was delineated into 39 subbasins. In the current study, however, we used two more detailed subbasindelineation schemes to generate more subbasins: one relatively smaller project with 231 subbasins (denoted as Project 1) and one relatively larger project with 1090 subbasins (denoted as Project 2). The option of one single HRU per subbasin was selected to facilitate the illustration of the task distribution between the master and slave processes when using the P-SWAT (see Section 2.4). Other model input data such as weather, Digital Elevation Model (DEM), land use, and soil attributes remain unchanged. 3.2. Computer configuration Model running speed also depends on the computer configuration. In this study, we used two different computers, one ordinary

128

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

4. Results and discussion

Table 1 Computer settings for running the parallel SWAT. Item

CPU Physical cores Logical cores RAM System type Windows edition

Description Personal computer (PC)

Work station (WS)

One Intel (R) Core 2 duo E6600 @ 2.4 GHz 2 2 3.25 GB 32-bit operating system Windows XP professional

Intel(R) Xeon(R) QUAD X5560 @ 2.8 GHz 4 8 4.0 GB 64-bit operating system Windows 7 enterprise

personal computer (PC) and one work station (WS), to evaluate the running time of the P-SWAT. These two categories of computers are common in the market, and their configuration details are described in Table 1. From this table, there are two and eight logical cores available for the PC and WS, respectively. The logical cores, in comparison to physical ones, are brought by the Intel HyperThreading (HT) technology. To avoid confusion with program thread, which is the smallest unit in executing a program process, the term logical core is used here instead of CPU thread. Though the core is logical, and the performance may be influenced when the pair of logical cores (on a same physical one) is simultaneously used, the HT feature is generally enabled for computers, and a user does not need to distinguish whether the cores are logical or physical in levels upper than the operating system.

To demonstrate and evaluate the execution of the P-SWAT, we employed the two East River Basin projects (see Section 3.1) and two different computers (see Section 3.2). As described previously (Section 2.4), there are two parameters, the number of processes (p) and the percentage of work for the master process (f), to be set when using the P-SWAT. A different number of cores involved and different task distribution can result in a different execution time. The larger the f, the more work will be taken by the master process. Take the two-core PC as an example (p ¼ 2), as shown in Fig. 3a, we used a series of f (from 5% to 45%) to assess the effect of this parameter on the model execution time. Fig. 3a clearly indicates that 25% (i.e., 231  25% z 57 subbasins for the project we used) can be the optimal value for Project 1 because it can ensure a balance between the master and slave processes. Conversely, either larger or lower values can lead to more computing time on the master process or the slave ones. From Fig. 3a, the P-SWAT with p ¼ 2 and f ¼ 25% used a much lower time (56.9 seconds (s)) than the original serial SWAT (99.1 s), a notable reduction of 42.6% in the model running time or a speedup of 1.74 (see Table 2). For Project 2, the running time can save up to 45.1% (see Fig. 4a) with a speedup of 1.82 (Table 2) when f ¼ 20%, the performance is quite close to that for Project 1. Therefore, we can say the P-SWAT running on PC would not be dependent on a project size. Using the WS with a larger computation capacity than the PC, we also implemented a group of model runs with a series of f (from

(a)

(b)

Fig. 3. Model execution time with a different number of processes and different task-distribution schemes between the master and slave processes for Project 1 with 231 subbasins using a personal computer (a) and a work station (b), respectively.

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132 Table 2 Recommended parameter setting and the expected performance of the parallel SWAT. Computer

pa

Project 1c

Project 2c

fb (%)

Speedup

fb (%)

Speedup

Personal computer Work station

2 2 3 4 5

25 35 15 4 1

1.74 1.80 2.49 3.09 3.36

20 30 5 0 0

1.82 1.90 2.62 3.06 3.09

a

p is number of processes used. f is percentage of work load for the master process. c Project 1 represents a relatively smaller project with 231 subbasins and Project 2 represents a relatively larger project with 1090 subbasins. b

0% to 50%) and different processes (from one to five). As shown in Fig. 3b for Project 1, the optimal value of f is 35% for a two-process execution with a running time of 39.5 s, a reduction of 44.4% or a speedup of 1.80 (Table 2) when compared to the serial SWAT (71 s) using the same WS (also see Fig. 5). Compared to the PC, the WS shows a higher running speed because the execution time is reduced on a whole (no matter a single-process or a two-process run), but the degrees on which the reduction of executing time because of the single-core to two-core transition are quite close (i.e., 42.6% and 44.4% on PC and WS, respectively). Similarly, we found the optimal value of f would be 15%, 4%, and 1% for the model run using three, four, and five processes, respectively (see Fig. 3b), and the model execution time can accordingly be reduced by 59.9%, 67.6%, and 70.3% (see Fig. 5). For running Project 2 with the WS, we

129

found the optimal value of f can be 30%, 5%, 1%, and 0% for a two-, three-, four-, and five-process run (Fig. 4b), respectively, with similar time reduction percentages. From Table 2, the speedup for running Project 1 varied from 1.80 to 3.36, close to the range of 1.90e3.09 for Project 2. Therefore, the project size would not significantly affect the performance of the P-SWAT. In addition, using the result of Project 1 as an example, Figs. 3b and 5 illustrate that the more CPU cores that are involved, the less time needed for a model run. However, the degree of speedup becomes less for every increased core. For example, a three-process model run can use 29% less time than a two-process run, and a fourprocess run can use 19% less time than a three-process run; whereas, a five-process run can reduce time by 8% only compared to a four-process run. As shown in Fig. 5a, there is a notable decreasing power relationship between normalized model execution time (running time per subbasin) and the number of processes. Although the size of Project 1 is quite different from that of Project 2, Fig. 5a and b demonstrates a similar power relationship (the coefficients of the equations are close) for these two projects. This finding also indicates that the P-SWAT can perform stably, fairly independent of a project size. Our tests using more processes indicate that a six-process run cannot reduce the execution time further and a seven-process run would lead to a little longer execution time than using a fiveprocess run. When more processes result in less work load for each process, then more message transfer and coordination procedures between the master and all slave ones are required, which also consumes computing resources. Therefore, our study can justify that a five-process run may reach the speedup potential of

(a)

(b)

Fig. 4. Model execution time with a different number of processes and different task-distribution schemes between the master and slave processes for Project 2 with 1090 subbasins using a personal computer (a) and a work station (b), respectively.

130

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

(a)

(b)

Fig. 5. Relationship of normalized model execution time and the number of processes used with the Work Station for Project 1 with 231 subbasins (a) and Project 2 with 1090 subbasins (b). Normalized model execution time is to divide the model execution time by the number of subbasins for each project.

the P-SWATda further increase in the number of cores would not result in a notable reduction in the execution time. Although the speedup reported above is for an individual model run, it is still applicable for sensitivity analysis, calibration like using SWAT-CUP (Abbaspour, 2011) or R-SWAT-FME (Wu and Liu, 2012), and BMP optimization procedures (Maringanti et al., 2009; Panagopoulos et al., 2011, 2012) because they just require the iteration of the model run. Therefore, the P-SWAT can have nearly the same speedup performance in those procedures, which usually cost substantial time with the original serial SWAT. The developed P-SWAT using MPI cannot only be implemented on a single computer with more than one CPU core but also on a group of computers within a local network. In theory, the more computers used, the less time will be needed for a model run. But, similar to executing P-SWAT in a single computer, the speedup will become less significant with more computers being involved due to the increased message transfer procedures. Further, in terms of the speedup with the same number of cores (e.g., five cores), a single computer (with five cores) would be better than a group of computers because of the slower communication between individual computers. Therefore, we would recommend a WS with multiple CPU cores instead of a group of connected individual computers, although the latter can be an alternative to users. 5. Conclusions In this study, we modified the widely-applied hydrological model, SWAT, to a parallel version on Microsoft Windows using a parallel programing technology (MPI) with the aim of reducing the model execution time. Using a case study of a relatively small project with 231 subbasins, we derived that the optimal

percentages (f) of work to be distributed to the master process are 35%, 15%, 4%, and 1% for a two-, three-, four-, and five-process model execution on a WS, respectively. The corresponding model execution time can be reduced by 44.4%, 59.9%, 67.6%, and 70.3% (or a speedup of 1.80e3.36), respectively, compared to the model running time with the original serial SWAT. For a relatively larger project with 1090 subbasins, the recommended percentages of work for the master process are 30%, 5%, 1%, and 0% for a two-, three-, four-, and five-process model execution, respectively, and the speedup can range from 1.90 to 3.09. The comparison of the two projects with different sizes indicates that the P-SWAT can have a relatively stable performance in enhancing the model efficiency (independent of a project size). Although the speedup becomes higher with more cores being involved, the degree of enhancement becomes less, following a notable decreasing power relationship between the normalized model execution time and the number of cores used. Although the P-SWAT can also be used with a group of computers within a local network, using a multi-core (CPU) computer (e.g., with five cores and 1% of work load for the master process) is recommended because the former would not be faster than the latter considering the communication cost between individual computers. Because the P-SWAT can reduce the execution time of an individual model run, it can help shorten the time required for both manual and automatic calibrations. The time reduction reported herein (for an individual model run and for our case study) is also applicable for the auto-calibration procedures and for other cases, although the degree of reduction may slightly vary due to computer/network configurations. In addition, the parallelization method we presented and the deriving of the optimal parameters in this study can be useful and easily applied to other hydrological models. Acknowledgments This study was funded by Hong Kong RGC GRF Projects HKU711008E, HKU710910E, and the Natural Science Foundation of China Project 51109114. Part of this work was performed under the USGS contract G08PC91508. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. We thank Devendra Dahal (Stinger Ghaffarian Technologies, a contractor to USGS EROS) for his comments on the early draft and thank Sandra Cooper (USGS) for further reviews to improve the paper quality. We also thank the editor and the three anonymous reviewers for their constructive comments and suggestions. Appendix This appendix is a supplement of Section 2.3 SWAT parallelization. The following shows the major source codes for the P-SWAT development and the steps for using the P-SWAT for a customized project. A. Major source codes of the P-SWAT development (1) Load the MPI library in the statement of the main program and subroutines where MPI functions are used. include “mpif.h” (2) Launch the MPI and get the parameters (e.g., numprocs ¼ p) at the beginning of the main program. call MPI_INIT(&argc, &argv) call MPI_COMM_RANK( MPI_COMM_WORLD, &myid) call MPI_COMM_SIZE( MPI_COMM_WORLD, &numprocs)

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

(3) Distribute subbasin simulations to different processes depending on the number of processes involved (numprocs ¼ p), percentage of work for the master process (f), the current process number (myid) before calling the subbasin modeling work. This technical procedure has been described in Section 2.3 SWAT parallelization and is implemented in the source file: command.f. (4) Send or Receive message by the slave and master processes, respectively, once the subbasin modeling work ends. call MPI_SEND(void*buffer, count, datatype, destination, tag, comm) call MPI_RECV(void*buffer, count, datatype, source, tag, comm, status) (5) End the MPI at the end of the main program. MPI_Finalize(void) B. Steps for using the P-SWAT for a customized project (1) Users need to make sure their PC can run a parallel program. The MPICH2 installation package and example programs can be accessed at http://www.mpich.org. (2) Create a file (Log.in) containing a single fraction (0.00e0.50) indicating the percentage of work for the master process (f). (3) Make several copies of a specific SWAT project (i.e., the TxtInOut folder) on the Drive D and rename them as “simulation0”, “simulation1”, “simulation2”, ., etc. The number of copies must be equal to or larger than the number processes (p) they plan to use. (4) Use the following command to execute the P-SWAT with setting the number of processes (p) to be used. mpiexec -n p P-SWAT.exe It is noted that users need to define the two parameters: p as stated in step (4) and f as described in Step (2). References Abbaspour, K.C., 2011. SWAT Calibration and Uncertainty Programs 4.3.2. Swiss Federal Institute of Aquatic Science and Technology, Eawag, Duebendorf, Switzerland, 103 pp. Arnold, J.G., Srinivasan, R., Muttiah, R.S., Williams, J.R., 1998. Large area hydrologic modeling and assessment e part 1: model development. Journal of the American Water Resources Association 34, 73e89. Balaji, P., Buntinas, D., Bultler, R., Chan, A., Goodell, D., Gropp, W., et al., 2011. MPICH2 User’s Guide (Version 1.4.1). http://www.mcs.anl.gov/research/ projects/mpich2/documentation/files/mpich2-1.4.1-userguide.pdf. Beck, M.B., 1999. Coping with ever larger problems, models, and data bases. Water Science and Technology 39, 1e11. Beven, K.J., 2001. Rainfall-runoff Modelling. John Wiley & Sons, Chichester, 360 pp. Borah, D.K., Bera, M., 2002. Watershed-scale Hydrologic and Nonpoint-source Pollution Models: Review of Mathematical Bases, Chicago, Illinois, pp. 1553e 1566. Brun, R., Reichert, P., Kunsch, H.R., 2001. Practical identifiability analysis of large environmental simulation models. Water Resources Research 37, 1015e1030. Chapman, B., Jost, G., van der pas, R., 2007. Using OpenMP, Portable Shared Memory Parallel Programming. The MIT Press, Massachusettts, USA, 353 pp. Chen, A., Di, L., Wei, Y., Bai, Y., Liu, Y., 2009. Use of grid computing for modeling virtual geospatial products. International Journal of Geographical Information Science 23. Chen, J., Wu, Y., 2012. Advancing representation of hydrologic processes in the Soil and Water Assessment Tool (SWAT) through integration of the TOPographic MODEL (TOPMODEL) features. Journal of Hydrology 420e421, 319e328. Denisa, R., Victor, B., Dorian, G., 2012. Comparative parallel execution of SWAT hydrological model on multicore and grid architectures. International Journal of Web and Grid Services 8, 304e320. Douglas-Mankin, K.R., Srinivasan, R., Arnold, J.G., 2010. Soil and water assessment tool (SWAT) model: current developments and applications. Transactions of the ASABE 53, 1423e1431. Du, Z., 2008. High Performance Computation Parallel Programming Technology e MPI Parallel Programming Design (in Chinese). Gassman, P.W., Reyes, M.R., Green, C.H., Arnold, J.G., 2007. The soil and water assessment tool: historical development, applications, and future research directions. Transactions of the ASABE 50, 1211e1250.

131

Gorgan, D., Bacu, V., Mihon, D., Rodila, D., Abbaspour, K., Rouholahnejad, E., 2012. Grid based calibration of SWAT hydrological models. Natural Hazards and Earth System Sciences 12, 2411e2423. Gropp, W., Lusk, E., Skjellum, A., 1994. Using MPI: Potable Parallel Programming with the Message-passing Interface. MIT Press, Cambridge, Mass. Gupta, H.V., Sorooshian, S., Yapo, P.O., 1998. Toward improved calibration of hydrologic models: multiple and noncommensurable measures of information. Water Resources Research 34, 751e763. Gupta, H.V., Sorooshian, S., Yapo, P.O., 1999. Status of automatic calibration for hydrological models: comparison with multilevel expert calibration. Journal of Hydrological Engineering 4, 135e143. Howarth, R.W., Billen, G., Swaney, D., Townsend, A., Jaworski, N., Lajtha, K., et al., 1996. Regional nitrogen budgets and riverine N&P fluxes for the drainages to the North Atlantic Ocean: natural and human influences. Biogeochemistry 35, 75e139. Jordi, A., Wang, D.-P., 2012. sbPOM: a parallel implementation of Princenton Ocean Model. Environmental Modelling & Software 38, 59e61. Li, T.J., Wang, G.Q., Chen, J., Wang, H., 2011. Dynamic parallelization of hydrological model simulations. Environmental Modelling & Software 26, 1736e1746. Li, Z.L., Shao, Q.X., Xu, Z.X., Cai, X.T., 2010. Analysis of parameter uncertainty in semi-distributed hydrological models using bootstrap method: a case study of SWAT model applied to Yingluoxia watershed in northwest China. Journal of Hydrology 385, 76e83. Liu, H.C., Zhang, L.P., Zhang, Y.Z., Hong, H.S., Deng, H.B., 2008. Validation of an agricultural non-point source (AGNPS) pollution model for a catchment in the Jiulong River watershed, China. Journal of Environmental Sciences-China 20, 599e606. Luecke, G.R., Kraeva, M., Ju, L.L., 2003. Comparing the performance of MPICH with Cray’s MPI and with SGI’s MPI. Concurrency and Computation-Practice & Experience 15, 779e802. Maringanti, C., Chaubey, I., Popp, J., 2009. Development of a multiobjective optimization tool for the selection and placement of best management practices for nonpoint source pollution control. Water Resources Research 45. MPI Forum, 2008. MPI: A Message-passing Interface Standard. Version 2.1. http:// www.mpi-forum.org/docs/mpi21-report.pdf. Muttil, N., Liong, S.Y., Nesterov, O., 2007. In http://eprints.vu.edu.au/767/1/ AParallelShuffled_s48_Muttil_.pdf, pp. 1940e1946. Neitsch, S.L., Arnold, J.G., Kiniry, J.R., Williams, J.R., King, K.W., 2005. Soil and Water Assessment Tool Theoretical Documentation. Grassland, Soil and Research Service, Temple, TX. Ng, T.L., Eheart, J.W., Cai, X.M., 2010. Comparative calibration of a complex hydrologic model by stochastic methods GLUE and PEST. Transactions of the ASABE 53, 1773e1786. Pacheco, P.S., 1997. Parallel Programming with MPI. Morgan Kaufmann Publisher, San Francisco, CA. Panagopoulos, Y., Makropoulos, C., Mimikou, M., 2011. Reducing surface water pollution through the assessment of the cost-effectiveness of BMPs at different spatial scales. Journal of Environmental Management 92, 2823e2835. Panagopoulos, Y., Makropoulos, C., Mimikou, M., 2012. Decision support for diffuse pollution management. Environmental Modelling & Software 30, 57e70. Rouholahnejad, E., Abbaspour, K.C., Vejdani, M., Srinivasan, R., Schulin, R., Lehmann, A., 2012. A parallelization framework for calibration of hydrological models. Environmental Modelling & Software 31, 28e36. Schwiegelshohna, U., Badia, R.M., Bubak, M., Danelutto, M., Dustdar, S., Gagliardi, F., 2010. Perspectives on grid computing. Future Generation Computer Systems 26. Sharma, V., Swayne, D.A., Lam, D., Schertzer, W., 2006. In 20th International Symposium on High-performance Computing in an Advanced Collaborative Environment. HPCS. Srinivasan, R., Zhang, X., Arnold, J., 2010. SWAT ungauged: hydrological budget and crop yield predictions in the Upper Mississippi River Basin. Transactions of the ASABE 53, 1533e1546. Tolson, B.A., Shoemaker, C.A., 2007. Cannonsville reservoir watershed SWAT2000 model development, calibration and validation. Journal of Hydrology 337, 68e86. Tuppad, P., Kannan, N., Srinivasan, R., Rossi, C.G., Arnold, J.G., 2010. Simulation of agricultural management alternatives for watershed protection. Water Resources Management 24, 3115e3144. Wang, G., Wu, B., Li, T., 2007. Digital Yellow River model. Journal of HydroEnvironment Research 1, 1e11. Wang, J., Shen, Y., 2012. On the development and verification of a parametric parallel unstructured-grid finite-volume wind wave model for coupling with ocean circulation models. Environmental Modelling & Software 37, 179e192. Winchell, M., Srinivasan, R., Di Luzio, M., Arnold, J.G., 2009. ArcSWAT 2.3.4 Interface For SWAT2005. Grassland, Soil and Research Service, Temple, TX. Wu, K., Xu, Y.J., 2006. Evaluation of the applicability of the SWAT model for coastal watersheds in southeastern Louisiana. Journal of the American Water Resources Association 42, 1247e1260. Wu, Y., Chen, J., 2012a. Modeling of soil erosion and sediment transport in the East River Basin in southern China. Science of the Total Environment 441, 159e168. Wu, Y., Chen, J., 2012b. An operation-based scheme for a multiyear and multipurpose reservoir to enhance macro-scale hydrologic models. Journal of Hydrometeorology 12, 1e14. Wu, Y., Chen, J., 2013. Estimating irrigation water demand using an improved method and optimizing reservoir operation for water supply and hydropower generation: a case study of the Xinfengjiang reservoir in southern China. Agricultural Water Management 116, 110e121.

132

Y. Wu et al. / Environmental Modelling & Software 43 (2013) 124e132

Wu, Y., Liu, S., 2012. Automating calibration, sensitivity and uncertainty analysis of complex models using the R package Flexible Modeling Environment (FME): SWAT as an example. Environmental Modelling & Software 31, 99e109. Wu, Y., Liu, S., Abdul-Aziz, O.I., 2012. Hydrological effects of the increased CO2 and climate change in the Upper Mississippi River Basin using a modified SWAT. Climatic Change 110, 977e1003. Yalew, S., van Griensven, A., Ray, N., Kokoszkiewicz, L., Betrie, G.D., 2013. Distributed computation of large scale SWAT models on the Grid. Environmental Modelling & Software 41, 223e230.

Zhang, X.S., Srinivasan, R., Bosch, D., 2009. Calibration and uncertainty analysis of the SWAT model using genetic algorithms and Bayesian model averaging. Journal of Hydrology 374, 307e317. Zhang, X.S., Srinivasan, R., Van Liew, M., 2010. On the use of multi-algorithm, genetically adaptive multi-objective method for multi-site calibration of the SWAT model. Hydrological Processes 24, 955e969. Zhao, G., Bryan, B.A., King, D., Luo, Z., Wang, E., Bende-Michlc, U., et al., 2013. Largescale, high-resolution agricultural systems modeling using a hybrid approach combining grid computing and parallel processing. Environmental Modelling & Software 41, 231e238.