Parallelisation of the Lagrangian atmospheric dispersion model NAME

Parallelisation of the Lagrangian atmospheric dispersion model NAME

Computer Physics Communications 184 (2013) 2734–2745 Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www...

1MB Sizes 0 Downloads 60 Views

Computer Physics Communications 184 (2013) 2734–2745

Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Parallelisation of the Lagrangian atmospheric dispersion model NAME Eike H. Müller a,∗ , Rupert Ford b,1 , Matthew C. Hort a , Lois Huggett a , Graham Riley b , David J. Thomson a a

Met Office, Fitzroy Road, Exeter EX1 3PB, United Kingdom

b

School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, United Kingdom

article

info

Article history: Received 27 November 2012 Received in revised form 15 June 2013 Accepted 24 June 2013 Available online 29 July 2013 Keywords: Atmospheric modelling Lagrangian dispersion model Parallel computing OpenMP

abstract The NAME Atmospheric Dispersion Model is a Lagrangian particle model used by the Met Office to predict the propagation and spread of pollutants in the atmosphere. The model is routinely used in emergency response applications, where it is important to obtain results as quickly as possible. This requirement for a short runtime and the increase in core number of commonly available CPUs, such as the Intel Xeon series, has motivated the parallelisation of NAME in the OpenMP shared memory framework. In this work we describe the implementation of this parallelisation strategy in NAME and discuss the performance of the model for different setups. Due to the independence of the model particles, the parallelisation of the main compute intensive loops is relatively straightforward. The random number generator for modelling subgrid scale turbulent motion needs to be adapted to ensure that different particles use independent sets of random numbers. We find that on Intel Xeon X5680 CPUs the model shows very good strong scaling up to 12 cores in a realistic emergency response application for predicting the dispersion of volcanic ash in the North Atlantic airspace. We implemented a mechanism for asynchronous reading of meteorological data from disk and demonstrate how this can reduce the runtime if disk access plays a significant role in a model run. To explore the performance on different chip architectures we also ported the part of the code which is used for calculating the gamma dose from a cloud of radioactive particles to a graphics processing unit (GPU) using CUDA-C. We were able to demonstrate a significant speedup of around one order of magnitude relative to the serial CPU version. Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved.

1. Introduction 1.1. The NAME model The UK Met Office’s Numerical Atmospheric-dispersion Modelling Environment (NAME) is a Lagrangian particle model [1] which is used to predict the transport and dispersion of pollutants in the atmosphere [2,3]. It was originally designed to provide advice on the spread of airborne radioactive material released in nuclear accidents, such as the Chernobyl disaster [4,5]. Since then the model has undergone extensive development and NAME is now used in a wide range of applications. These include research into air quality [6,7], support to regulatory bodies in emission monitoring [8], emergency response applications [9], and the spread of vectors for bluetongue and foot and mouth disease [10–12]. As part

∗ Correspondence to: Department of Mathematical Sciences, University of Bath, Claverton Down, BA2 7AY, United Kingdom. Tel.: +44 7957463462. E-mail addresses: [email protected], [email protected] (E.H. Müller). 1 Present address: Science and Technology Facilities Council, Daresbury Labora-

of the emergency response work NAME is used to provide forecasts for the London volcanic ash advisory centre (VAAC). Recently the model provided predictions for the spread of volcanic ash in the European and North Atlantic airspace following the eruption of the Icelandic volcano Eyjafjallajökull [13–16] and the May 2011 eruption of Grimsvotn. NAME is available under license and requests for obtaining the source code should be directed to the Met Office by emailing [email protected]. Lagrangian particle models track individual model particles, which each represent a certain amount of pollutant, through the atmosphere. The mean velocity of the particle is given by the flow field U evaluated at the particle position, which can for example be output from a numerical weather prediction project such as the Met Office Unified Model (UM) [17]. Turbulence below the grid scale is modelled by adding a random increment du(turb) to the particle velocity vector at each timestep [18,19]; particle position and velocity are updated according to the Langevin equation [20], which for homogeneous turbulence is given by (turb)

dxi = (Ui + ui (turb)

dui

)dt 

(turb)

=−

tory, Warrington, Cheshire, WA4 4AD, United Kingdom. 0010-4655/$ – see front matter Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cpc.2013.06.022

ui

τi

dt +

2σi2

τi

dξi (t ).

(1)

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

The dξi are Gaussian random variables with mean zero and variance dt which are independent from time step to time step and from component to component (so that the dξi can be regarded as increments of independent Wiener processes). For inhomogeneous turbulence where τi and σi2 are time dependent functions of position and depend on the meteorological conditions [21], this equation has to be modified as described in [1,19]. The modifications usually assume the fixed point velocity distributions are Gaussian, but non-Gaussian options are also implemented in NAME [22]. Instead of updating both velocity and position of the particle using (1), a simplified integration scheme can be used in which only the position is updated according to a Wiener process without velocity memory as described by dxi = Ui dt +



2 Keff dξi (t ).

(2)

This assumes that the parameters do not depend on position, or vary slowly in the horizontal direction (but different values are used in the atmospheric boundary layer and the free troposphere). Compared to Eulerian models, which solve the diffusion equation on a fixed three dimensional grid, Lagrangian models allow for the resolution of steep gradients near the source and, as they do not contain an internal grid scale, can easily be used on a wide range of spatial scales. Also, when used with velocity memory as in (1), they allow more sophisticated turbulence parameterisations than are possible in Eulerian models. The independence of particles in most applications allows an obvious approach to parallelisation, which, in contrast to Eulerian models, does not require any interprocessor communication during the propagation step. In addition to the core routine which simulates the propagation of particles, NAME models other physical processes such as for example wet- and dry-deposition, deep convective mixing, gravitational settling and buoyant plume rise [22]. Although the particles do not interact as they propagate through the atmosphere, the model contains a chemistry scheme, which can be used to describe chemical reactions between the various species carried by the model particles and/or held as background fields on an Eulerian grid. For a plume of radioactive particles decay chains and half life decays can also be calculated. Usually the model output is a set of concentration fields, which can represent instantaneous snapshots and temporal and spatial averages or integrals that are written to disk. Due to the stochastic nature of the model, the relative error in these fields decreases proportional to the inverse square root of the particle number. Flow information (i.e. meteorological data, in the following abbreviated as ‘‘met data’’) is read from disk at regular time intervals, and multilinearly interpolated in time and space for each Lagrangian particle and model time step. NAME is a mature and complex model and consists of around 110,000 lines of Fortran 90 source code; it should be kept in mind that the main focus in the original design of the code was modularity and maintainability. 1.2. Parallelisation For quick decision making, emergency response applications require NAME to provide reliable results as quickly as possible. This need for a reduction of the model runtime (or the suppression of statistical noise by an increase in the number of particles at a constant runtime) motivated the parallelisation of the model which is described in this article. Research applications, which typically run for days to months and/or require 100 s of separate simulations, will also benefit from this effort, although in this case it is sometimes more effective to simply run different jobs simultaneously on separate CPUs. As the particles do not interact while they are propagated according to Eq. (1), parallelisation of certain parts of the model is straightforward. Each compute unit propagates a set of independent model particles. Synchronisation is only necessary at times

2735

when met data is read from disk or output fields are written. As discussed in Section 2, several subtleties have to be taken into account to ensure that met data is read from disk efficiently and to guarantee that the correctness of the parallel code can be verified. Chemistry calculations are often carried out at larger timesteps than the intrinsic integration timestep, however, they require particles to ‘communicate’ in order to compute the chemical reactions. The information carried by the particles is converted to an Eulerian grid, the chemistry calculations are carried out on this grid and then the results are put back on the particles. The Eulerian grid calculation can be parallelised by dividing the grid between different compute units. Before particles can be propagated through the atmosphere they have to wait for the flow information for the next met data time to be read from disk. This can be avoided by reading the met data asynchronously, i.e. by assigning one compute unit to reading flow information for met data time t + 2 while the other compute units propagate particles through the flow field which is interpolated in the time interval [t , t + 1]. We have implemented all three strategies within NAME, and also parallelised some less important, but easily parallelisable, loops in the output calculation. Although in principle this general strategy could be implemented in other schemes, such as MPI [23], we decided to use the shared memory model OpenMP [24] for the following reasons:

• Commonly available CPUs, such as the Intel Xeon series, contain more than one core so that existing hardware can be used (at the Met Office Intel Xeon processors with four to twelve physical cores per machine are used to run NAME both for operational emergency response and research applications). • OpenMP is easily implemented by adding compiler directives to the existing source code. • As OpenMP directives are ignored in serial builds, the parallel and serial source code are identical, which improves maintainability and simplifies verification of results. This is particularly important as NAME is used for operational forecasts at the Met Office for which reliability and correctness of the results has to be guaranteed. On commonly available CPUs, such as the Intel Xeon series, the number of cores is in the order of tens, and this limits the achievable speedup if only OpenMP is used for parallelisation. In contrast to a distributed memory approach, communication is not explicit but is realised through shared memory access. This makes memory and cache access of an individual core less transparent and can limit the scalability. In this article we demonstrate that this relatively straightforward approach leads to a significant speedup and scales well up to around 10 cores for our typical applications. We have found that some loops scale better than others and that scaling is dependent on the hardware that is used. Previous work on parallelising Lagrangian atmospheric dispersion models has been reported in [25] where the shared and distributed memory parallelisation of the Lagrangian Operational Dispersion Integrator (LODI) [26] is described. In the distributed memory version of the model each processor has a complete copy of the meteorological data and output fields are calculated separately by each processing unit. Parallel communication is required to calculate global output fields. The MPI implementation shows a nearly linear speedup as long as the problem size is large enough so that the overhead from communication is negligible; for 50,000 model particles strong scaling is good on up to 50–100 processors and improves for larger particle numbers. In addition a shared memory approach was used to parallelise the main time step loop with OpenMP. As in the NAME model this loop contains the particle advection and physical processes such as gravitational settling and deposition processes. In the LODI model this loop typically amounts to ∼85% of the total runtime. Using only the shared

2736

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

memory parallelisation the authors find a speedup of 1.44 for a benchmark run with fifty thousand particles on one four core CPU of the ASCI Blue machine at LLNL. The distributed and shared memory parallelisation can be used combined by using OpenMP on each node and MPI communication between the nodes. On 5 nodes (20 cores in total) they report a speedup of 7.3 for this hybrid OpenMP/MPI approach. Some earlier work is described in [27] where a Lagrangian model for the one dimensional dispersion in the boundary layer is parallelised on a Distributed Array of Processors (DAP) machine. The DAP consists of 4096 single-bit processing elements which receive a common instruction stream from a central Master Control Unit and execute code in a SIMD fashion. Relative to the sequential code on an IBM 3084 mainframe computer a 14-fold speedup could be achieved on a DAP 610 machine and the authors estimate that this could be increased by another factor of five on a DAP 610C which has more powerful compute elements that can process instructions on the byte level. 1.3. Overview This article is organised as follows. In Section 2 we describe the structure and the parallelisation strategy of the NAME model and show how it can be implemented in OpenMP both for the computationally most expensive loops and for the asynchronous reading of met data. We then investigate performance improvements and in particular the scalability of the model in Section 3 where we focus on results from a realistic emergency response application for the spread of volcanic ash over the North Atlantic. The parallelisation of the cloud gamma calculation on a graphics processing unit (GPU) is described in Section 4. Our conclusions are presented in Section 5 where we also outline ideas for further performance improvements and parallelisation. Some more technical aspects and further results can be found in the appendices: the parallel random number generator used in NAME is described in Appendix A. The correctness of results is verified in Appendix B and further strong and weak scaling tests are reported in Appendix C. Whereas at the Met Office the NAME model is mainly run on Intel Xeon chips, we also show results for the scaling of the code on other systems, most importantly on a 32 core Power 6 node of the Met Office IBM supercomputer, in Appendix D. Thread binding and hyperthreading on Intel processors is discussed in Appendix E. 2. Model structure and parallelisation strategy The NAME model contains two main timesteps. The intrinsic timestep dt is used to integrate the Langevin Eq. (1). At regular sync time intervals (typically 5 min, 10 min or 15 min) the particles are synchronised to write output to disk, perform chemistry calculations and release new particles. If the ‘‘cheap’’ integration scheme in (2) is used, only one internal time step is performed in each sync time interval. (Often the full scheme is only used near the source, i.e. for the first 30 min of travel time of each particle.) Met data is usually only available at even larger time intervals, for runs with global UM data, this can be every three or six hours.2 Flow information is only available at the model grid points (these are tens of kilometres apart in the global UM, whereas the resolution is 1.5 km for the UKV model over the UK, which is a limited area configuration of the UM). The position of Lagrangian model particles is not restricted to grid points and therefore flow information

2 Output from higher-resolution limited area versions of the UM can be more frequent and is typically available every hour.

Fig. 1. Simplified NAME model structure. Met data is typically generated by the Met Office Unified Model, but other sources can be used as well.

has to be multilinearly interpolated in space. Interpolation in time is also necessary as the NAME timestep is smaller than the time interval of the met data from the UM. The main loop in the model consists of the following steps (Fig. 1): 1. Check if flow information is available to propagate the particles over the next synctime step, read met data from disk if necessary and update flow fields. 2. Release new particles from the source, if necessary. 3. Loop over particles: Propagate particles to next synctime. 4. Perform chemistry calculation for chemically active species (loop over grid boxes). 5. Loop over particles: Calculate contribution of particles to output at synctime. 6. Process output and write to disk. 7. Continue with step 1., or exit main loop at end of run. Note that in runs with chemically inert particles the chemistry calculation is disabled. Also, additional physical processes such as decay chains of radioactive particles can be simulated in the model. The cloud gamma calculation discussed in Section 4 is part of the output loop. The parallelisation of the different loops (highlighted in bold above) is described in the following sections. The model contains an additional outer loop over different ‘‘cases’’ (not shown in Fig. 1), which could be runs based on an ensemble of met data. Results from these cases can be combined statistically at the end of the run to obtain information on the derived probability densities, e.g. to calculate percentiles and exceedance probabilities. In the following we will not consider this loop, as most operational runs are based on one case only. 2.1. Particle loops Typically most of the computational work is done in the following two loops over model particles: 2.1.1. Main particle loop The main loop for propagating model particles through the atmosphere is contained in one subroutine which integrates the particle velocity and position according to (1) or (2) and is often the computationally most expensive part of a model run, as demonstrated for an operational emergency response run in Section 3.1. To parallelise this loop, calls to the subroutine are surrounded by an OpenMP parallel region and the loop is parallelised by adding OpenMP directives. NAME can be set up so that met data is only read from disk ‘‘on demand’’, i.e. only if the corresponding flow information is needed

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

for particle propagation. Often this reduces the amount of data that is read from disk, especially if the plume is very localised. The most important example of this is the use of global met data, which can be split up into fourteen smaller regions each covering a small part of the globe. Data for one of these regions is only loaded if it contains model particles. This implies that data can also be loaded from inside the particle loop, as soon as a particle enters one of these regions. If the model is run in parallel it is necessary to protect any met updates inside the particle loop to prevent two particles which are processed by different threads from loading the same segment of the globe when they enter it. This was implemented with OpenMP locks (one for each region of the globe) which can be set and unset through API calls, and ensures that only one thread can access the critical subroutine at a given time. This approach can also be used if nested met data is used. If data is available at different resolutions, NAME will only use the most accurate flow field and again it has to be ensured that this flow information is only read from disk once.

2737

removed in a separate loop within an OpenMP single region so that the list of particle indices available for reuse has a reproducible order. Otherwise, they are removed in the particle output loop inside an OpenMP critical region. To obtain the timing results presented in this paper we will always assume that the reproducibility flag has not been set; further results on verification can be found in Appendix B. Using our implementation instead of the Fortran intrinsic random number generator is important for another reason: it was found that, when the code was compiled with the IBM xlf compiler and the intrinsic random number generator was used, only one thread at a time can pick a new random number. While this avoids the problem of overlapping sequences of random numbers, it effectively serialises this part of the code and leads to a huge increase in runtime. In fact we found that it causes the code to antiscale. On AMD Opteron processors a twofold increase in the runtime was observed if the internal random number generator is used. 2.2. Chemistry loop

2.1.2. Particle output loop Output is calculated at ‘sync’ times in a separate loop over particles, which in the following is referred to as the ‘‘particle output loop’’. In this loop concentrations are calculated from the particles and in addition individual particle trajectories can also be outputted. Again, parallelisation of this loop in OpenMP is straightforward. Internally particles are stored in an array structure. In the body of this loop, particles, which are no longer needed in the simulation (for example, because they have left the domain of interest or their mass has been depleted by deposition to the ground or radioactive decay), are marked as inactive and removed from the array of model particles and their slot becomes available for newly released particles. Turbulence is modelled by a random process and we will see below that this has implications for the implementation of the random number generator and the reproducibility of results. 2.1.3. Parallel random number generator The turbulent component of the velocity is calculated using (pseudo-) random numbers which can be initialised with a fixed seed. Using the Fortran intrinsic random number generator is potentially dangerous because the sequence of random numbers used by different particles might overlap, or, depending on the implementation of the random number generator, several threads might update the seed simultaneously. The period of the intrinsic generator might also be too short. All this could mean that the statistics of the final results are compromised. To overcome this problem, we implemented a portable random number generator following [28–30] with a very long period (>1018 ) which generates a separate set of random numbers for each particle index and which uses separate subsequences of this long sequence for each particle index. The seeds of the generator at each position in the particle array are spaced so that the random sequences for different particle indices do not overlap; details of the algorithm can be found in Appendix A. Keeping in mind that particles are removed from the run in the parallelised particle output loop, this implies a difficulty in obtaining reproducible results and hence in verifying the results of a parallel run by comparing it to an equivalent serial run. If the model is run in parallel, the order in which particles are marked as inactive and in which new particles are assigned to the free positions in the particle array depends on the order in which the different threads process their particles: if a new particle is created with a different index it will use a different sequence of random numbers. Consequently the difference between a serial and a parallel run (and potentially between two parallel runs) can grow, after much particle recycling, to be of the order of the statistical noise. To still allow comparison of parallel and serial runs for testing, a reproducibility flag was introduced. If this flag is set, the particles are

The model particles in a NAME run can carry a number of different chemical species. This is the case in air quality runs, which model the concentrations of different chemicals affecting human health, such as ozone, nitric oxides, sulphates, organic compounds and other chemicals [6,7,31]. Over the length of a run, these species react and transform into each other. For chemistry calculations the species carried by the particles are first converted to concentrations on a three dimensional Eulerian grid. In each gridbox the concentrations at the next timestep are then calculated according to the chemical reaction rates determined by these concentrations. The resulting concentrations are mapped back to the particles by assigning a relative proportion to each particle. The calculations in each gridbox are independent and we parallelised the outermost loop over the three dimensional grid. An example of a NAME run dominated by the chemistry loop is discussed in Section 3.2. 2.3. Asynchronous reading of met data In some cases reading NWP met data from disk can make up a significant proportion of the runtime. This process has been parallelised by creating a separate IO thread which prefetches met data for the next met data time while the worker threads propagate the particles using already available flow information. The implementation of this strategy requires nested OpenMP parallelisation. The IO thread is created in an outer OpenMP parallel region and loops over the different met modules3 and updates the flow information by reading new data from disk if one of the modules has been marked for prefetching by the worker threads. Three buffers are used for storing met data: data for the previous and next timestep is stored in the old and new buffers (at the beginning of the run these buffers are read from file by the worker threads). Both buffers are needed for the interpolation of flow fields between these two times. Data for the following timestep is loaded into a prefetch buffer by the IO thread. Once this prefetched data is available and the worker thread is ready to propagate the particles over the next time interval, the buffer indices are swapped around according to prefetch → new, new → old, old → prefetch and a new prefetch request is launched (note that it is not necessary to copy data between the buffers as they are only relabelled). The process is illustrated in Fig. 2. Updating the flow fields consists of two steps: met data is read from disk and subsequently processed in a separate subroutine. To

3 NAME can use different types of met data in a single run, for example nested global, European and UK wide data.

2738

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745 Table 1 Timing results for an emergency response run for predicting the spread of volcanic ash over the North Atlantic. All times are measured in minutes; the code was run on an Intel Xeon X5680 processor with 2 × 6 cores. The results for the 1 thread run are linearly extrapolated from those of a run with a smaller number of particles.

Fig. 2. Asynchronous reading of met data. The worker threads propagate particles over the time interval [t , t + 1] while the IO thread reads met data for t + 2 from disk. Before the next iteration the met data buffer indices are swapped according to prefetch → new, new → old, old → prefetch. The worker threads then propagate particles in the time interval [t + 1, t + 2] while the IO thread reads met data for t + 3 and so on.

allow more fine tuned load balancing between the IO and worker threads, both steps or only the first can be executed by the IO thread. Synchronisation between the IO and worker threads is realised with OpenMP locks. In Section 3.3 we demonstrate how the asynchronous reading of met data can reduce the total runtime by around 30%. It should also be noted that currently this approach cannot be combined with loading of met data ‘‘on demand’’ as discussed in Section 2.1.1 as this would require knowledge of the particle positions at prefetch time. 3. Results We verified that the changes resulting from parallelisation of the code have no impact on the final output. As discussed in detail in Appendix B, results typically differ from those in a serial run in the least significant digit and are much smaller than statistical fluctuations if the reproducibility flag is set. Comparing a serial run to a parallel run with one thread shows that the overhead from parallelisation is around 5% (see Appendix C for details). In the following we present results for the absolute runtime and scaling of the parallelised NAME code for a set of realistic applications, in particular for an operational emergency response run for the forecast of volcanic ash over the North Atlantic. We also consider a run dominated by the chemistry calculation and a different short range emergency response run which requires the reading of large data volumes from hard disk. These runs demonstrate the speedup gained by parallelising different parts of the code, as described in the previous section. Results for an idealised test case are reported in Appendix C where both weak and strong scaling are studied in more detail. All results presented in this section are obtained with the Intel Linux Fortran compiler and on Intel Xeon CPUs. The performance and scalability of the code on other hardware, in particular on a node of the Met Office IBM supercomputer is shown in Appendix D. 3.1. Scaling of an operational emergency response run In the following we consider a realistic emergency response run for a volcanic eruption. Runs in this configuration, but with a reduced particle release rate, were used to predict the spread of the ash cloud during the 2010 eruption of the Eyjafjallajökull volcano in Iceland. The runs were carried out on a Dell server with two Intel Xeon X5680 six-core processors with a 3.33 GHz clock rate, 49.4 GB of

Number of threads

1

Particle loop Particle output loop Output MetRead Total

853.98 220.08 148.38 – 8.72 8.72 – 15.68 15.67 – 5.75 5.62 – 248.06 176.36

4

6

8

11

16

114.76 89.45 86.23 8.72 8.75 9.37 15.9 15.81 14.97 5.62 5.69 6.85 142.98 117.63 114.38

24 68.46 9.6 14.92 5.28 96.41

Table 2 Timing results for an emergency response run for predicting the spread of volcanic ash over the North Atlantic with reduced number of particles (half as many particles as in Table 1). All times are measured in minutes; the code was run on an Intel Xeon X5680 processor with 2 × 6 cores. The results for the 1 thread run are linearly extrapolated from those of a run with a smaller number of particles. Number of threads

1

4

6

8

11

Particle loop Particle output loop Output MetRead Total

426.99 – – – –

113.87 4.36 15.79 5.59 137.47

75.73 4.36 15.77 4.82 99.1

58.52 4.36 15.86 4.64 81.96

44.83 4.37 15.8 5.28 68.46

main memory and an L2 (L3) cache of 256 kB (12 MB). To investigate the dependence on the number of particles we carried out runs with two different particle release rates, as described below. Table 1 gives a detailed breakdown of the time spent in different sections of the code for the first choice of release rate. The Intel Xeon X5680 processor supports hyperthreading (see also Appendix E) and we include results for runs with 16 and 24 threads in the results. In some preliminary runs (not shown here) with this version of the model we found that the particle output loop antiscales. This loop only accounts for a minute fraction of the runtime and because of this antiscaling we decided to assign only one thread for processing the particle output loop. This turned out to be a technical problem which was later fixed. We find that the main particle loop (see Section 2.1.1) indeed accounts for most of the runtime, the overhead from reading met data from disk and outputting NAME data is small. A fit of the total run time in Table 1 to Amdahl’s law



T ( n; T 0 , r ) = T 0 1 − r +

r

(3)

n

gives: T0 = 865 min,

r = 95%,

(1 − r )T0 = 41 min.

(4)

Here n denotes the number of cores, T0 is the total runtime of a 1thread run and r the parallel fraction of the code. (1 − r )T0 is the sequential part of the runtime. We repeated the same analysis with a release rate that was reduced by a factor of two. The results are shown in Table 2 and plotted in Fig. 3 (filled circles) together with the data from Table 1 (empty circles). Here we find the following fit results: T0 = 463 min,

r = 94%,

(1 − r )T0 = 28 min.

(5)

From this we conclude that, as expected, the runtime is approximately proportional to the number of particles. The original operational runs were carried out in 2010 with an older version of the model which did not include some serial improvements which are contained in the version used to obtain the results above. During the 2010 eruption of Eyjafjallajökull the serial version of the model was run on an Intel Xeon X5560 processor with a clock rate of 2.80 GHz and the release rate was a factor 10 smaller than in Table 2, these runs typically took 1 h and 30 min to complete.

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

2739

Table 3 Time spent in different sections of the code for an air quality benchmark; results for the coarse grid run are shown at the top and for the fine grid run at the bottom. The code was run on an Intel Xeon X5680 processor. All times are measured in minutes. Number of threads

1

2

4

6

8

12

Chemistry loop Particle loop Particle output loop Total

209.0 73.0 194.3 491.5

106.1 37.1 98.0 255.4

73.1 21.7 50.7 160.4

59.5 14.9 34.0 123.7

63.1 14.3 28.4 121.1

56.7 12.5 21.7 107.4

Chemistry loop Particle loop Particle output loop Total

607.6 178.7 86.3 970.0

341.7 99.3 49.3 589.9

233.3 53.1 28.4 413.3

206.0 37.8 20.9 363.5

195.8 32.2 17.5 344.0

212.5 26.9 14.8 354.0

Fig. 3. Total runtime and time spent in the particle loop for two different release rates in an emergency response run for predicting the spread of volcanic ash over the North Atlantic. The results were obtained on an Intel Xeon X5680 processor with a clock rate of 3.33 GHz. Only the results for 4, 6, 8 and 11 threads were used for the linear fit. The total runtime for a run with the larger release rate but on an Intel Xeon X7542 processor with a clock rate of 2.67 GHz are also shown (open squares). We also show the runtime of the original run (using an older serial version of the model) during the 2010 eruption of Eyjafjallajökull with a reduced release rate. The serial run was carried out on an Intel Xeon X5560 processor with a clock rate of 2.80 GHz.

Taking all these improvements into account, we conclude that we are now able to repeat these runs with a tenfold increase in particle number (and the resulting reduction in statistical noise) in the same length of time as before, if the model is run on 8 cores of an Intel Xeon X5680 processor. We repeated the above runs on a system with four Intel Xeon X7542 six core processors (2.67 GHz, 64 GB main memory, 256 kB/18 MB L2/L3 cache) in the hope of demonstrating scaling beyond 12 threads. As can be seen from Fig. 3, the performance of NAME on this processor is not as good as on the X5680 system. While part of the difference can be explained by the ratio of clock speeds (3.33 GHz versus 2.67 GHz), the X7542 does not show the same scaling as the X5680 even up to 12 threads. 3.2. Air quality runs NAME can be used operationally to provide air quality forecasts. Although these runs typically use a large number of particles, a significant amount of time is spent in the parallelised chemistry routine. In the following we show results for an air quality run on an Intel Xeon X5680 processor. In the operational setup this run consists of two parts. A coarse run on a grid that covers central Europe is used to initialise the boundary conditions of a fine grid run with a higher spatial resolution which covers the United Kingdom. When run on one thread, 42% of the time of the coarse grid run is spent in the chemistry loop, followed by the particle output loop (39%) and the particle loop (14%). The runtime of the fine grid run is dominated by the chemistry loop (62%), the particle loop and particle output loop only take 18% and 8% of the total runtime. The total runtime and the time spent in different sections of the code is shown for both runs in Table 3 and the self-relative speedup is shown in Fig. 4. For the particle loop the scaling is good up to 6 threads in both cases and then deteriorates; the scaling is not as good as for the operational volcano run in Section 3.1. The scaling of the chemistry loop is much poorer and this has a big impact on the total runtime, in particular for the fine grid run, which is dominated by this part of the code. Initially it was suspected that the poor scaling of the chemistry loop can be traced back to cache and memory access problems

Fig. 4. Self-relative speedup of different sections of the air quality run (see Section 3.2), both for the coarse (top) and fine (bottom) grid. Both runs were carried out on an Intel Xeon X5680 processor. The percentages given are the time fractions for the 1 thread run.

due to inefficient array layout. Reordering of the relevant array indices gave a speedup of around 10% and slightly improved the scaling of the chemistry loop, but does not fully explain the poor performance, see Fig. 5. 3.3. Asynchronous reading of met data The last example we consider is an operational emergency response run in which 200,000 particles of a radioactive species (caesium-137) were released into the atmosphere. The run uses global met data. The runtime is dominated by the particle loop and reading and processing of met data. In Fig. 6 we show an activity chart for this run, both without (top) and with (bottom) parallel

2740

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745 Table 4 Total runtime for a cloud gamma calculation described in Section 4 both for the serial Fortran code on an Intel Xeon E5504 processor with 2.0 GHz clock speed and for the CUDA-C code on an nVidia Tesla GT200 C1060 GPU. All times are given in seconds. The last column shows the speedup achieved by porting the code to the GPU. Number of particles

Fortran (CPU)

CUDA-C (GPU)

Speedup

5,000 10,000 20,000 50,000

147 293 586 1500

19 32 59 146

7.7 9.2 9.9 10.3

Fig. 5. Scaling of the chemistry loop before and after reordering of the data structures. The results were obtained on a machine with two Intel Xeon X5470 quad core CPUs with a clock speed of 3.33 GHz.

Fig. 7. Speedup achieved by porting the cloud gamma calculation to a GPU for different numbers of particles. An nVidia Tesla GT200 C1060 card with 240 processor cores and 4 GB memory was used.

Fig. 6. Activity charts for asynchronous reading of met data (see Section 3.3), with wall clock time shown on the horizontal axis. Note that the line marked ‘‘workerthread’’ refers to the activity of all worker threads, and not only a single thread. All met data is read and processed by the worker threads in the top example; on the bottom the corresponding chart is shown for the case where an additional IO thread is used. LPPAT denotes the time spent in the main particle loop. The number of worker threads is two in both cases. Note that met data for the first time step is always read by the worker threads.

reading of met data. Two worker threads were used in both cases, i.e. the latter run uses three threads in total; using parallel reading of met data reduces the total runtime by around 30%. 4. Cloud gamma calculation on GPUs The results reported so far have been obtained by OpenMP parallelisation on a multi-core CPU. To explore potential performance benefits on novel chip architectures we also ported parts of the code to a General Purpose Graphics Processing Unit.

For a plume of radioactive particles, the cloud gamma dose on a fixed receptor grid can be calculated inside the particle loop. For each particle this requires looping over all grid boxes, calculating the distance between the particle and the receptor (which might require the use of spherical coordinates) and deriving the cloud gamma dose. We identified a benchmark with a receptor grid of size 101 × 51 × 1 for which the runtime is almost completely dominated by the dosage calculation. As the body of the loop over the receptor grid boxes is very simple it provides an ideal candidate for a kernel that can be run on a graphics processing unit (GPU). This required rewriting this kernel in C, linking the resulting object file to the rest of the Fortran code, and then adding CUDA-C calls for launching this kernel on the device. It is important to minimise memory transfers between host and device. To achieve this, the parameters needed for the cloud gamma calculation (such as physical parameters and the grid layout) are copied to the device outside the particle loop and output fields are copied back to the host after all particles have been processed. Runs with an nVidia Tesla GT200 C1060 card with 240 processor cores and 4 GB memory gave a speedup of around a factor of 10 relative to a one core CPU run of the original Fortran code; Table 4 compares the runtime both of the serial Fortran code and the device code for different numbers of particles. The speedup for different numbers of model particles is plotted in Fig. 7. Note that there is potential for further optimisation in the GPU kernel by exploiting the device specific memory hierarchy (all data is copied to global device memory in our implementation). As in the CPU implementation, single precision arithmetic was used on the GPU. The output fields of both runs were compared and found to differ in the last decimal places, which is the expected size of differences for floating point arithmetic (see also Appendix B).

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

2741

5. Conclusion In this article we described the parallelisation of the NAME atmospheric dispersion model. The most compute intensive loops have been parallelised with OpenMP and a mechanism for asynchronously reading meteorological data from disk has been implemented. The random number generator for the simulation of sub-gridscale turbulence was adapted to ensure that the results for different particles are statistically independent and the results of parallel runs can be verified by comparison to a serial run, if necessary. We find that the scaling of the main particle loop is very good on Intel Xeon processors. The chemistry loop, which is relevant for air quality type runs, scales less well. The good scaling of the code was demonstrated for a real-world emergency-response model configuration which is currently used routinely for volcanic ash forecasting at the UK Met Office. We also investigated the performance on other computer architectures such as IBM Power 6 processors; details can be found in Appendix D. For this specific processor we found that scaling of the particle loop is good if not more than eight physical cores are used. The scalability of individual loops can also depend on the model configuration, for example we observed that the particle output loop scales much better for the coarse air quality type run described in Section 3.2 than for the run described in Appendix C. We have explored the use of General Purpose Graphics Processing Units (GPGPUs) and successfully ported the calculation of the cloud gamma dose from a plume of radioactive particles with CUDA-C. A speedup (relative to a serial CPU run) of around ten was achieved for the test case we investigated when the code was run on an nVidia Tesla GT200 C1060 GPU. This speedup has to be compared to the one obtained through OpenMP parallelisation. Of course, ideally the entire particle loop should be ported to the GPU, as has been done for a simple Lagrangian dispersion model in [32,33]. While it is worth investigating this approach, it should be noted that NAME is a well established model consisting of about 110,000 lines of source code and complex data structures, which make memory transfers between host and device very challenging; this would require rewriting large parts of the code. To extend the scalability to a larger number of processes, parallelisation of the particle and chemistry loop with MPI should be explored and there is also potential for further parallelisation of the model. The outer loop over different ‘‘cases’’, which can be based on different members of an ensemble of meteorological data (see Section 2), is an obvious candidate for parallelisation with MPI or even simple task farming, if the results are combined outside the NAME model itself.

Fig. A.8. Sequence of random numbers. Each particle index uses a subsequence of length N from the sequence, N has to be chosen greater than the number of timesteps to ensure that the sequences do not overlap.

Appendix A. Parallel random number generator This section describes details on the implementation of the parallel random number generator which is used for modelling turbulent motion in NAME. As already discussed in the main text, it is important that each particle uses a separate sequence of random numbers. The particles are stored in a static array structure, which might contain more slots than the number of particles currently used in the simulation. Conceptually we consider a single random sequence which is then partitioned into separate subsequences for each particle index as shown in Fig. A.8. The random number generator is based on two linear congruential generators as described in [28] which are combined to obtain a single generator with a longer period [29]. For each position i in the particle array a pair of random number generators is used to gen(1,2) erate two sequences ri,j of uniformly distributed integer random numbers via the recursion relations (1)



(1)



mod m(1) ,

(2)



(2)



mod m(2) .

ri,j+1 = a(1) ri,j ri,j+1 = a(2) ri,j

(A.1)

The multipliers a(1,2) are suitably chosen integers and the moduli m(1,2) are prime numbers. The index j loops over the model timesteps. To ensure that the sequences do not overlap for different (1,2) = ri(,11,2) are defined recursively as follows: particles, the seeds si (1)



(2)



si+1 = si+1 =

a(1)

a(2)

N N

(1)



mod m(1) ,

(2)



mod m(2) ,

si si

(A.2)

with (1)

s1 = m(1) − 1,

(2)

s1 = m(2) − 1.

(A.3)

Here N is a number that is greater than the number of time steps multiplied by the number of random numbers required at each time step, but smaller than the period of the random number generator divided by the maximal number of particles. If N is a power of two, N = 2p , calculation of



a(1,2)

N  s

mod m

(A.4)

Acknowledgements

can be carried out in p = log2 N steps as follows: Define a0 = a(1,2) and calculate

We thank all external users who tested the parallel implementation for feedback, in particular Pete Bedwell (Health Protection Agency) and Glenn Carver (University of Cambridge) for useful discussions and for porting the NAME code to the SUN architecture. Paul Selwood and Matthew Glover from the Met Office HPC optimisation team and Martyn Foster (IBM) provided invaluable help for improving the scalability of the code on a node of the Met Office IBM supercomputer. We are particularly grateful to Matthew Glover for pointing out problems with the Fortran intrinsic random number generator. The work reported in Section 4 was carried out during a collaboration with the Edinburgh Parallel Computing Centre (EPCC).

ak = a2k−1





mod m

for k = 0, . . . , p.

(A.5)

Then



N 





a(1,2)

a(1,2)

N

s

mod m = ap , mod m = ap s mod m.





(A.6)

In other words to calculate the (i + N )th number from the ith number in the sequence it is not necessary to calculate all N intermediate numbers, but only log2 N steps of increasing size are required. Also, the two ap values (corresponding to a(1) and a(2) ) only need to be calculated once. Once the ap are known the cost is

2742

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

the same as generating a new random number with the underlying serial generator. (1) (2) Finally, ri,j and ri,j are combined to obtain a uniformly distributed random number in the unit interval, (1)

ri,j =

(2)

ri,j − ri,j + α(m(1) − 1)

(A.7)

m(1)

where the integer α is chosen such that 0 < ri,j < 1. This sequence is shown in Fig. A.8. In the current implementation the two multipliers a(1) = 40 014, a(2) = 40 692 and moduli m(1) = 2 147 483 563, m(2) = 2 147 483 399 are used [29,30]. With these values the pseudorandom sequence that is obtained by subtracting the two numbers as in (A.7) has a period of ≈ 2.3 · 1018 ≈ 261 . The value of N is set to N = 232 ≈ 4 · 109 , which restricts the number of particles to around 5 · 108 . An alternative strategy is to set the seeds for the different particle indices to be successive random numbers in the underlying serial generator and, for each index i, to move down the sequence in jumps of size N (where N is larger than the number of particles). This is more similar to what occurs with a serial generator in serial runs and would be no more expensive than the approach that is implemented in the model. This alternative approach follows that used with a single linear congruential generator in the simulations described in [34] which were conducted on a Cyber 205 vector machine. To generate Gaussian random numbers, two independent normally distributed random numbers n1 , n2 can be obtained from a pair of independent uniformly distributed random numbers r1 , r2 ∈ (−1, 1) using a Box–Muller transformation [35,36]

 n1 = r 1

−2

log s s

 ,

n2 = r 2

−2

log s s

,

Fig. B.9. Concentration fields for a simple particle run (see Appendix B) with a total release of 100,000 particles. The model was run both in serial and with four threads and the output of the serial run is shown in grey. The relative differences between the output of the serial and parallel, four thread run, are highlighted in colour, with grid boxes where the results agree exactly shown as uncoloured. NAME uses single precision (32-bit) floating point arithmetic with a machine epsilon of around 10−7 . The reproducibility flag (see Section 2.1.2) was set for the parallel run. Table C.5 Time spent in different sections of the code for a simple ‘‘particle’’ type run described in Appendix C with 5 million particles. We show results both for a serial run and for a parallel run with different numbers of threads. All times are given in minutes; the code was run on a 4 core Intel Xeon E5520 processor. Number of threads

Serial

1

2

3

4

Particle loop Particle output loop Total

– – 66.9

69.5 0.16 70.03

35.66 0.1 36.13

24.34 0.08 24.79

18.64 0.06 19.07

(A.8)

where s = r12 + r22 and all r1 , r2 which result in s ≥ 1 are rejected. This uses the idea that transforming a uniform distribution on a unit disk to a two dimensional normal distribution is easier than the corresponding transformation in one dimension. Appendix B. Verification To verify that parallelisation of the code does not change the results we ran the model both in serial mode and with four threads. Fig. B.9 demonstrates that the relative error in the concentration field never exceeds one in a million and is much smaller than statistical fluctuations in the concentrations due to the modelling of turbulence by random trajectories if the reproducibility flag discussed in Section 2.1.3 is set. Achieving bit reproducibility is hard, because the output fields are obtained by summing over the contribution of several particles and floating point addition is not associative. Note also that for a run without the chemistry scheme the individual particle trajectories agree between a serial and the equivalent parallel run if the reproducibility flag is set, which provides another useful test of the parallel code. Appendix C. Further scaling tests In this section we present more results on a simple particle type run with a large number of model particles and study both strong and weak scaling. The chemistry scheme is not used in these runs. In an idealised test case we use met data derived from observations at a single location. The meteorology is assumed to be horizontally homogeneous, to change in the vertical following idealised semi-empirical formulae, and to change in time. This simple setup is used for short range applications that only require a limited prediction range (at the order of a few kilometres).

Fig. C.10. Strong scaling. Self-relative speedup of different sections of the code for a simple ‘‘particle’’ type run described in Appendix C with a total number of 5 million particles. The code was run on a 4 core Intel Xeon E5520 processor. Note that the particle output loop only accounts for a negligible amount of the runtime.

The following results were obtained on a Intel Xeon E5520 processor with 4 physical cores (2.27 GHz, 5.9 GB main memory, 256 kB/8 MB L2/L3 cache). Five million particles are released over the course of the run, which is a significant number of model particles (at the same order of magnitude as in the emergency response run studied in Section 3.1), but is designed to result in the run time being dominated by the particle loop. This is confirmed by Table C.5 which shows the time spent in different sections of the code. In this instance the particle output loop only makes up a very small fraction of the runtime (less than 1% in a one core run). The total time spent on a serial run is 67 min; the overhead from parallelisation is about 5% for a one thread run. Strong scaling. In Fig. C.10 we plot the self-relative speedup T (1, npart )/T (n, npart ) for the total runtime and the time spent in the two particle loops. Here npart denotes the number of model

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745 Table C.6 Time spent in different sections of the code for a simple ‘‘particle’’ type run described in Appendix C, the total particle number per thread is kept constant at 1.25 million particles. The model was run on a 4 core Intel Xeon E5520 processor. Number of threads

1

2

3

4

Particle loop Particle output loop Total

17.37 0.04 17.54

17.83 0.05 18.08

18.24 0.06 18.59

18.62 0.06 19.05

2743

in the final results decreases proportional to the inverse square root of the particle number, going from 1 to n threads reduces this √ error by 1/ n. Appendix D. Porting to other hardware All results in the main text (Section 3) were obtained on Intel Xeon CPUs. The code has also been successfully ported to other architectures and in the following we report results on IBM Power 6 and AMD Opteron chips. D.1. IBM Power 6 The model was ported to a node of the IBM supercomputer at the Met Office. The system consists of IBM Power 6 nodes with 32 cores each. The cores on the chip are organised into four multi-core modules (MCMs), with 8 cores each, which share memory. For each core access to memory local to its MCM is fast, and remote (offMCM) memory access is slower. Different strategies for distributing the threads between the cores on a node were used:

• Each thread has exclusive use of one core. • Two threads share one physical core.

Fig. C.11. Weak scaling. Parallel efficiency of different sections of the code for a simple ‘‘particle’’ type run described in Appendix C, the total particle number per thread is kept constant at 1.25 million particles. The code was run on a 4 core Intel Xeon E5520 processor. Note that the particle output loop only accounts for a negligible amount of the runtime.

particles and n the number of threads. The plot demonstrates the very good scaling of the code up to 4 cores. To quantify this further, we fit the total runtime to Amdahl’s law (3) where T0 and r are fit parameters. We observe that the data is described very well by this functional relationship and find for the total runtime T0 of a one thread run, the parallel fraction r and the serial proportion of the runtime (1 − r )T0 : T0 = 70 min,

r = 97%,

(1 − r )T0 = 2 min.

(C.1)

Intel Xeon processors support hyperthreading and the assignment of threads to physical cores can be controlled by setting the KMP_AFFINITY environment variable. If this is set to compact, multiple threads are assigned to one physical core, whereas scatter will attempt to distribute the threads between different physical cores as far as possible. We carried out runs with both values of KMP_AFFINITY, but found that the best results are obtained if this variable is not set, i.e. the assignment is done automatically. Further details can be found in Appendix E. Weak scaling. Alternatively the computational load (i.e. the number of particles) per thread can be kept constant and one can investigate how the runtime changes as the number of threads is increased. This is known as weak scaling and is usually easier to achieve. To quantify this we define the parallel efficiency as T (1, npart )/T (n, n × npart ), which should be as close to unity as possible. We have repeated the above runs with a total particle release of 1.25 million model particles per thread. Results for the runtime are shown in Table C.6 and the parallel efficiency is plotted in Fig. C.11. The results demonstrate the very good weak scaling of the particle loop and the total run time, which increases by less than 10% when going from one to four threads. Investigating parallel efficiency is appropriate if the extra threads are not used to reduce the solution time but to increase the number of model particles. The total runtime and the time spent in the particle loop stay constant as the number of particles increases. As the (relative) error

The second option is known as symmetric multithreading (SMT), which is the equivalent of hyperthreading on Intel Xeon processors. For a fixed number of threads we expect a tradeoff between two factors. Without SMT, each thread has exclusive use of the compute resources of one physical core, but (in the case of more than four threads) the amount of remote memory access is increased as the threads are spread between a larger number of cores. On the other hand, with SMT threads have to share physical cores, but there is less remote memory access as the threads are spread between a smaller number of MCMs. Fig. D.12 shows the speedup of different parts of the code both with and without SMT. For the particle loop, which dominates the total runtime, scaling is better up to 8 threads if each of them has exclusive use of one core. Beyond 8 threads the scaling breaks down. On the other hand, if threads share a core, scaling is slightly worse initially but extends beyond 8 threads with the curve only flattening out at beyond 16 threads. This implies that for this loop off-MCM memory access is the limiting factor: scaling breaks down as soon as the threads are spread over more than one MCM. A similar effect can be observed for the particle output loop, although here the explanation is not obvious and further investigations are required to understand the factors limiting the scalability. In contrast to the particle loop, this loop contains frequent writes to shared data structures. As for the results on Intel Xeon chips, presented in Appendix C, the particle output loop does not scale as well as the particle loop. D.2. AMD Opteron The model was also successfully ported to a SUN X4200 server with two AMD 275 dual core processors by Glenn Carver at the Chemistry Department of the University of Cambridge. The timing results for different thread numbers are listed in Table D.7 and Fig. D.13 shows the self-relative speedup on this machine for a simple particle run with 5 million particles as described in Appendix C. Note that an older version of the model was used, which explains the poor scaling of the particle output loop (as already discussed in the main text this problem has now been solved). The overall scaling is not quite as good as on Intel Xeon CPUs but still reasonable up to four threads. It is interesting to note that when the Fortran intrinsic random number generator was used instead of the NAME specific parallel implementation, the runtime increased by a factor of two.

2744

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745

Fig. E.14. Self-relative scaling of a simple particle run on an Intel Xeon X5520 quad core processor with hyperthreading enabled. Fig. D.12. Scaling of a simple particle run on a Power 6 node of the Met Office IBM supercomputer. Results are shown both for runs where each thread had exclusive use of a core (empty symbols) and for runs with symmetric multithreading (SMT) where two threads share a single core (filled circles).

Table D.7 Time spent in different sections of the code for a simple ‘‘particle’’ type run described in Appendix C. All times are given in minutes; the code was run on two AMD Opteron 275 dual core processors. Number of threads

1

2

3

4

Particle loop Particle output loop Total

92.96 0.59 94.06

50.98 0.69 52.19

35.25 1.41 37.16

27.43 2.03 29.96

reported for the UM (Matthew Glover and Paul Selwood, private communications). For NAME we explored three different values:

• none (default): Threads are automatically distributed between the cores by the operating system.

• scatter: Threads are spaced out as widely as possible, i.e. the first n threads are assigned to the n physical cores in the order 1.1 → 2.1 → 3.1 → · · · → n.1 and subsequent threads are assigned to 1.2 → 2.2 → 3.2 and so on. • compact: Threads are kept as close together as possible, i.e. the physical cores are filled in the order 1.1 → 1.2 → 2.1 → 2.2 and so on. We compared the performance of these three cases on an Intel Xeon X5520 quad core processor for a simple run similar to the one studied in Appendix C with a total release rate of 2 million particles. When run on one core, more than 95% percent of the runtime was spent in the particle loop. The results in Fig. E.14 show that an additional performance gain of 20% to 30% can be achieved by using hyperthreading and running 8 threads on the four physical cores of the CPU. For less than four threads the runtime increases if more than one thread is bound to each physical core, as now the threads have to share resources. Not binding threads to cores at all appear to be the best strategy. These results have to be compared to those obtained on a node of the IBM supercomputer reported in Appendix D.1. In the latter case it can be beneficial to force threads to share a physical core beyond a certain thread number as this improves memory throughput. References

Fig. D.13. Self-relative speedup on an SUN X4200 server with two AMD 275 dual core processors. A simple particle test case with 5 million particles as described in Appendix C was used. Note that an older version of the model was used in which the scaling of the particle output loop is poor. However, this part of the code only amounts to a very small fraction of the runtime and has a negligible effect on the scalability of the overall runtime.

Appendix E. Thread binding on Intel processors Many Intel Xeon processors support hyperthreading, which allows running more than one thread on a physical core. On a processor with n physical cores, each is split into two logical cores, which we label by 1.1, 1.2, 2.1, . . . , n.1, n.2. As Table 1 (and Fig. 3) demonstrate, this can be used to reduce the model runtime further for emergency response runs on Intel Xeon X5680 processors. On Intel systems binding to threads is controlled at runtime by the environment variable KMP_AFFINITY. Thread binding can reduce the amount of data movement between cores, as

[1] H.C. Rodean, Stochastic Lagrangian Models of Turbulent Diffusion, in: Meteorological Monographs, vol. 48, American Meteorological Society, Boston, Mass, 1996. [2] A. Jones, Weather 59 (2004) 311–316. [3] A. Jones, D. Thomson, M. Hort, B. Devenish, Air Pollution Modeling and Its Application XVII, 2007, pp. 580–589. [4] F.B. Smith, M.J. Clark, Meteorological Office Scientific Paper No. 42, 1989. [5] W. Klug, G. Graziani, G. Grippa, D. Pierce, C. Tassone, Evaluation of Long Range Atmospheric Transport Models Using Environmental Radioactivity Data from the Chernobyl Accident: The ATMES Report, Elsevier Applied Science, London, New York, 1992. [6] A. Redington, R. Derwent, Atmospheric Environment 36 (2002) 4425–4439. [7] A.L. Redington, R.G. Derwent, C.S. Witham, A.J. Manning, Atmospheric Environment 43 (2009) 3227–3234. [8] A.J. Manning, S. O’Doherty, A.R. Jones, P.G. Simmonds, R.G. Derwent, Journal of Geophysical Research—Atmospheres 116 (2011) D02305. [9] H.N. Webster, E.B. Carroll, A.R. Jones, A.J. Manning, D.J. Thomson, Weather 62 (2007) 325–330. [10] J. Gloster, H.J. Champion, D.B. Ryall, A.R. Brown, Weather 59 (2004) 8–11. [11] J. Gloster, H.J. Champion, D.B. Ryall, A.R. Brown, Weather 59 (2004) 43–45. [12] J. Gloster, L. Burgin, C. Witham, M. Athanassiadou, P.S. Mellor, Veterinary Record 162 (2008) 298–302. [13] C.S. Witham, M.C. Hort, R. Potts, R. Servranckx, P. Husson, F. Bonnardot, Meteorological Applications 14 (2007) 27–38. [14] S.J. Leadbetter, M.C. Hort, Journal of Volcanology and Geothermal Research 199 (2011) 230–241.

E.H. Müller et al. / Computer Physics Communications 184 (2013) 2734–2745 [15] H.F. Dacre, A.L.M. Grant, R.J. Hogan, S.E. Belcher, D.J. Thomson, B.J. Devenish, F. Marenco, M.C. Hort, J.M. Haywood, A. Ansmann, I. Mattis, L. Clarisse, Journal of Geophysical Research—Atmospheres 116 (2011) D00U03. [16] H.N. Webster, D.J. Thomson, B.T. Johnson, I.P.C. Heard, K. Turnbull, F. Marenco, N.I. Kristiansen, J. Dorsey, A. Minikin, B. Weinzierl, U. Schumann, R.S.J. Sparks, S.C. Loughlin, M.C. Hort, S.J. Leadbetter, B.J. Devenish, A.J. Manning, C.S. Witham, J.M. Haywood, B.W. Golding, Journal of Geophysical Research 117 (2012) D00U08. [17] M.J.P. Cullen, Meteorological Magazine 122 (1993) 81–94. [18] G.I. Taylor, Proceedings of the London Mathematical Society s2-20 (1922) 196–212. [19] D. Thomson, Journal of Fluid Mechanics 180 (1987) 529–556. [20] P. Langevin, Comptes Rendus de l’Académie des Sciences Paris 146 (1908) 530–533. [21] H. Webster, D. Thomson, N. Morrison, Met Office Turbulence and Diffusion Note No. 288, 2003. [22] R.H. Maryon, D.B. Ryall, A.L. Malcolm, Met Office Turbulence and Diffusion Note No. 262, 1999. [23] Message Passing Interface Forum, 2009. URL: http://www.mpi-forum.org/. [24] OpenMP Architecture Review Board, 2008. URL http://openmp.org/.

2745

[25] D.J. Larson, J.S. Nasstrom, Atmospheric Environment 36 (2002) 1559–1564. [26] J. Nasstrom, G. Sugiyama, J. Leone Jr., D. Ermak, Eleventh Joint Conference on the Applications of Air Pollution Meteorology, Long Beach, CA, 2000, pp. 84–89, Preprint. [27] A.K. Luhar, J.J. Modi, Atmospheric Environment. Part A. General Topics 26 (1992) 3055–3059. [28] S.K. Park, K.W. Miller, Communications of the ACM 31 (1988) 1192–1201. [29] P. L’Ecuyer, Communications of the ACM 31 (1988) 742–751. [30] W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Numerical Recipes in Fortran 77: The Art of Scientific Computing, Cambridge University Press, 1992. [31] I.P.C. Heard, A.J. Manning, J.M. Haywood, C. Witham, A. Redington, A. Jones, L. Clarisse, A. Bourassa, Journal of Geophysical Research 117 (2012) D00U22. [32] E. Pardyjak, B. Singh, A. Norgren, P. Willemsen, 7th Symp. on the Urban Environment (Paper 14.2), American Meteorological Society, San Diego, 2007. [33] F. Molnar Jr., T. Szakaly, R. Meszaros, I. Lagzi, Computer Physics Communications 181 (2010) 105–112. [34] D.J. Thomson, Quarterly Journal of the Royal Meteorological Society 112 (1986) 511–530. [35] G.E.P. Box, M.E. Muller, The Annals of Mathematical Statistics 29 (1958) 610–611. [36] G. Marsaglia, T.A. Bray, SIAM Review 6 (1964) 260–264.