GPU and Multi-core based Reaction Ensemble Monte Carlo method for non-ideal thermodynamic systems

GPU and Multi-core based Reaction Ensemble Monte Carlo method for non-ideal thermodynamic systems

Computer Physics Communications ( ) – Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www.elsevier.com...

625KB Sizes 7 Downloads 66 Views

Computer Physics Communications (

)



Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

GPU and Multi-core based Reaction Ensemble Monte Carlo method for non-ideal thermodynamic systems M. Tuttafesta a,∗ , A. D’Angola b , A. Laricchiuta c , P. Minelli c , M. Capitelli a , G. Colonna c a

Università di Bari, via Orabona, 4 - 70126 Bari, Italy

b

Scuola di Ingegneria SI, Università della Basilicata, via dell’Ateneo Lucano, 10 - 85100 Potenza, Italy

c

CNR-IMIP Bari, via Amendola 122/D - 70126 Bari, Italy

article

info

Article history: Received 28 January 2013 Received in revised form 6 October 2013 Accepted 16 October 2013 Available online xxxx Keywords: REMC Thermodynamics of plasmas Monte Carlo Real gases CUDA OpenMP

abstract A Graphics Processing Unit (GPU)-CUDA C and (Multi-core)-OpenMP versions of the Reaction Ensemble Monte Carlo method (REMC) are presented. The REMC algorithm is a powerful tool to investigate the equilibrium behavior of chemically reacting systems in highly non-ideal conditions. Both the GPU and the Multi-core versions of the code are particularly efficient when the total potential energy of the system must be calculated, as in the constant-pressure systems. Results, obtained in the case of Helium plasma at high pressure, show differences between real and ideal cases. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Reaction Ensemble Monte Carlo (REMC) [1,2] is a molecular-level computer simulation technique for predicting chemical equilibrium and thermodynamic properties under non-ideal conditions. Non-ideal plasmas are currently of great experimental and theoretical interest for astrophysics and for applications in power engineering such as catalyst development, nanoporous material manufacturing, supercritical fluid separation, propulsion and combustion science, and novel energy storage [3–7]. Accounting for intermolecular forces between reacting components in non-ideal systems is critical for optimizing processes and applications with chemical reactions. REMC is a computational tool for the exact calculation of macroscopic plasma properties considering the interaction between particles and simultaneous multiple reactions. The required information for molecular-level computer simulation techniques are interaction potentials and internal thermodynamic properties of the reacting components. REMC is based on the random sampling (Metropolis algorithm) of the grand-canonical ensemble for a multicomponent reacting plasma and consists in a combination of



Corresponding author. Tel.: +39 0802373626; fax: +39 0805929501. E-mail address: [email protected] (M. Tuttafesta).

0010-4655/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cpc.2013.10.017

three types of state transitions: particle displacements; reaction movements and volume changes. REMC approach, accounting for the interaction potential among particles, is particularly suitable at high pressure, where non-ideal effects become important. This method has been successfully applied to Helium [8], Argon and air plasmas [9,2] respectively consisting of 2, 7 and 26 ionization reactions. The interactions between charged particles in these calculations were described by Deutsch potentials while both the neutral–neutral particle and the neutral–ion particle interactions were approximated by Exp-6 potentials [10]. In this work a GPU-CUDA C version of REMC is presented and applied to high pressure helium plasma. A considerable speedup with respect to CPU calculation is obtained by using the GPU code and in particular to calculate the total interaction energy. To assess the efficiency of the GPU calculation we have also compared computational times with an OpenMP Multi-core calculation and with a multi-GPU computer [11]. 2. The REMC method Reaction equilibrium in plasmas at specified temperature and pressure is attained when the Gibbs free energy of the system is minimized subject to mass conservation and charge neutrality. The REMC method involves molecular and atomic production and deletion, as well as changes of particle identities during the simulation. Adding to the total partition function the internal

2

M. Tuttafesta et al. / Computer Physics Communications (

contribution Qiint for isolated molecule i (which includes rotational, vibrational, electronic and nuclear contributions), the grand canonical partition function for a mixture of N reacting components is [12]

= Q

∞ 

···

N1 =0

∞  

 ···

dxN1 dωN1 · · · dxNs dωNs

Ns =0

   × exp Υ − β Φ V , xNi

(1)

where

Υ =

s  

β Ni µi − log (Ni !) + Ni log

i =1



V Qiint

λ3th

 (2)

and β = 1/kT . Deviation from non-ideality in plasmas arises from the long-range Coulombic interaction between charged particles, from the lowering of the ionization potential which depends on the composition and from the short-ranged neutral–neutral and neutral-charge interactions. Thermodynamic and transport properties of plasma used in fluid dynamic codes [13–20] are calculated in the framework of Debye–Hückel theory, which takes into account the first two sources of non-ideality. In a plasma, the Coulombic interactions lower the ionization potential of the ions and thermochemical quantities are composition dependent resulting in an inherently nonlinear process. This effect can be accounted for into the REMC method. The REMC method generates a Markov chain to simulate the properties of the system governed by Eq. (1); the chain consists of a combination of three main types of Monte Carlo steps: (1) particle displacements, D; (2) reaction moves, R; (3) volume changes at fixed pressure, V . Particles are initially placed in a simulation box and the total energy of the system is evaluated considering that only pairwise interactions are relevant. Initial configuration may correspond to a regular crystalline lattice or to a random configuration but with no hard-core overlaps [21,22]. A particle is randomly selected and a random displacement is given by rl = rk + ∆ (ς − 0.5)

(3)

where k and l are respectively the initial and the final state of the system, ∆ are small displacements properly selected and ς are random numbers uniformly distributed between 0 and 1. The reverse trial move is equally probable. The particle displacement is accepted with a transition probability k → l given by WklD = min [1, exp (−β 1Φkl )] .

(4)

When a particle moves crossing the boundary of the box, the problem of surface effects can be overcome by implementing periodic boundary conditions. The cubic box is replicated throughout space to form an infinite lattice. There are no walls at the boundary of the central box, and no surface molecules. This box simply forms a convenient axis system for measuring the coordinates of the N molecules. When the periodic boundary condition is adopted an infinite number of terms is requested in order to evaluate the total energy. For a short-range potential energy function, the summation to evaluate total energy can be restricted considering the particle at the center of a region which has the same size and shape as the basic simulation box, so that it interacts with all the molecules whose centers lie within this region. In REMC simulations at pressure and temperature fixed, the volume is simply treated as an additional coordinate, and trial moves in the volume must satisfy the same rules as trial moves in positions. A volume trial move consists of an attempted change of the volume from Vk to Vl [21,22] Vl = Vk + 1V (ς − 0.5)

(5)

)



and such a random, volume-changing move will be accepted with the probability





WklV = min 1, exp −β 1Φkl − β (Vl − Vk ) P + N log

Vl



Vk

. (6)

When a volume change is accepted the size of the box and the relative position of the particles change. In order to calculate PklV the new configuration energy must be recalculated considering the new distances between particles. To simulate a chemically reacting system at specified temperature and pressure rather than at constant temperature and volume, a trial volume change as in Eq. (6) must be considered. Chemical reactions are considered as follows. In the case of multiple reactions system [23], for any linearly independent set of nR chemical reactions given by sj 

νij Xi = 0 j = 1, 2, . . . , nR

(7)

i=1

and considering that

Ni = Ni 0 +

nR 

νij ξj i = 1, . . . , s; j = 1, . . . , nR

(8)

j =1

the transition probability for a step ξj is [12] ξj

Wkl

   0 sj  0 ξj ν¯j ξj  Ni !  0  Kp,j = min 1, β p V Ni + ξj νij ! i=1  

× exp (−β 1Φkl )

(9)

sj

where ν¯ j = i=1 νij is the net change in the total number of molecules for reaction j. Finally, the procedure for a reaction move is as follows: (a) a reaction is randomly selected; (b) the reaction direction, forward or reverse, is randomly selected; (c) a set of reactants and products according to the stoichiometry of the reaction considered is randomly selected; (d) the reaction move is accepted by evaluating the probability associated with performing changes of particle identities, together with particle insertions and deletions, if the total number of molecules changes during the selected reaction. 3. Case study: Helium plasma In the Helium plasma (4 species: He, He+ , He++ , e− ) we consider for the He–He+ interaction, three different potentials: Exp-6, an average and a statistical selection of attractive and repulsive Hulburt–Hirschfelder potential [24]. The following formula has been used: 1. Exp-6 [8]: u(r ) =



∞, A exp(−Br ) − C6 r −6 ,

r < rc r ≥ rc

where A = 6.161 · 10−17 J, B = 5.2802 · 1010 m−1 , C6 = 0.642 · 10−79 J m6 , rc = 0.6008 · 10−10 m. 2. Hulburt–Hirschfelder average (HH-modRep-mix) [24]: umix (r ) = (uHH (r ) + umodRep (r ))/2 where



uHH (r ) = u¯ HH (¯r (r ))

 = φ0 exp [−2αHH r¯ ] − 2 exp [−αHH r¯ ] + βHH r¯ 3 × [1 + γHH r¯ ] exp [−2αHH r¯ ]}   r r¯ = −1 , φ0 = 2.4730 eV, re = 1.081 Å, re

αHH = 2.23,

βHH = 0.2205,

γHH = 4.389.

M. Tuttafesta et al. / Computer Physics Communications (

)



3

Fig. 1. Exp-6, uHH , umodRep and umix potentials as a function of He–He+ interparticle distance. Table 1 General algorithm. Input: pressure, temperature, initial particles distribution, potential types Assign: initial particle positions, pair-potential pointers to function Start cycles loop for equilibration Start moves loop inside a cycle Move-function: randomly generated movement (volume, particle, reaction) with fixed probabilities End moves loop inside a cycle Output: variables as a function of cycles End cycles loop for equilibration Cycles loop for averages: same scheme as equilibration Output: variables averaged at a given temperature and pressure

• umodRep (r ) = φ1 exp(−ar − br 2 − cr 3 )

φ1 = 359 eV,

a = 4.184 Å −2

−1

, −3

b = −0.649 Å , c = 0.08528 Å . 3. Hulburt–Hirschfelder statistical selection (HH-modRep-stat) [24]: ustat (r , ρ) =

uHH (r ) umodRep (r )



0 < ρ ≤ 0.5 0.5 < ρ ≤ 1

where ρ is a random uniformly distributed number in [0, 1] selected at every numerical calculation of ustat . These potentials have been reported and compared in Fig. 1. 4. Numerical methods and performance profiles The REMC base algorithm is schematized in Table 1 for a given temperature and pressure. The algorithm implementation for a single CPU has been compared with three high performance versions: Single-GPU CUDA, Multi-core OpenMP, Multi-GPU CUDA. The details of such comparisons are explained below.

movement. In order to evaluate the speed-up = (CPU execution time)/(GPU execution time) of our code we perform a test calculation considering only 20 cycles for equilibration and 80 cycles for averages but varying the number of particles N for the Helium plasma. In order to perform a speed-up test, every cycle consists exactly in 1 volume-move, N¯ position-move and N¯ reaction-move, where N¯ is the half-sum of the maximum and the minimum number of particles during a simulation run. Temperature and pressure are fixed to 20 000 K and 400 MPa respectively. The GPU module is executed using the hardware and the CUDA Toolkit described in Table 3, which has been freely provided by Exxact Corporation [11]. The CPU module, separately executed with respect to the GPU one, was run on the hardware described in Table 4. Fig. 2 shows the single-CPU (Table 4) total execution time for each movement and for the whole module as a function of the number of starting particles (100% He). The volume change move employs a computational total time lower than the position and reaction ones because position, volume and reaction movements are in the ratio N¯ : 1 : N¯ . In order to compare the execution time of different code versions we adopt the following definition: [25] Execution time of code-1

. (10) Execution time of code-2 In Fig. 3 speeds-up (single-CPU/single-GPU) are shown for each movement and for the whole module as a function of the number of starting particles (100% He). Move-volume kernel speed-up is greater than Move-position and Move-reaction ones because the former involves ∝ N 2 threads while the latter involve ∝ N threads. Due to the move ratio Move-position:Movevolume:Move-reaction = N¯ : 1 : N¯ , lower speeds-up dominate the whole module one. The execution time of a code belonging to both the CPU or GPU version is measured by the standard C clock() function and by two variables t_start and t_total according to the following scheme: Speed-up (code-1)/(code-2) =

4.1. Single-CPU and single-GPU Listing 1: Timing the CPU or GPU code In order to compare single-CPU and single-GPU execution of the algorithm we build a computer program which can run optionally either the GPU or the CPU module. It must be pointed out that both CPU and GPU modules implement the same algorithm schematized in Table 1. The basic steps of CPU and GPU programs (Move-functions) are reported in Table 2 where functions marked by a box contain a large loop, that are parallelized in GPU kernels. GPUdeltaU is the parallel version of CPUdeltaU. Both functions calculate the change in configurational energy for a given movement. B and T are the number of blocks per grid and threads per block respectively in kernels configuration. Typically we adopt T = 128 for all kernels, B = 32 for particle or reaction movements and 64 for volume

t _ s t a r t =clock ( ) ; / / . . . CPU or GPU code t o be timed cudaDeviceSynchronize ( ) ; / / only f o r GPU code t _ t o t a l += ( ( double ) ( c l o c k () − t _ s t a r t ) / CLOCKS_PER_SEC ) ;

Both the

CPUdeltaUmove

function and the corresponding

GPUdeltaUmove kernel, pointed out in Table 2, calculate the change in configurational energy 1Φ (Eq. (6) for volume, Eq. (4) for position and Eq. (9) for reaction movements). If N is the number of particles, the calculation consists of N independent threads, which are proportional to N in the case of position and reaction movements and to N 2 for volume movements. The CPUdeltaUmove

4

M. Tuttafesta et al. / Computer Physics Communications (

)



Table 2 CPU and GPU subroutines in a Move-function. Operations

CPU main subroutine

GPU main subroutine

Random Move-configuration Move-configuration probability Evaluate 1Umove

CPU random function CPU Pmove function

CPU random function GPU Pmove function ≪ Bmov e Tmove ≫ (. . . )

COPY 1Umove from HOST to DEVICE Accept/Reject the Move-configuration Update configuration

CPU random function

cudaMemcpy CPU random function cudaMemcpy

Table 3 Single GPU basic hardware and CUDA Toolkit employed (thanks to Exxact Corporation [11]). GPU

GeForce GTX 680

CUDA capability Base clock Boost clock Global memory Memory speed MP × cores/MP = cores Constant memory Shared memory per MP Registers per MP Threads in warp Max threads per block Max sizes of a block Max sizes of a grid

3.0 1006 MHz 1058 MHz 4 GB 6.0 Gbps 8 × 192 = 1536 65 536 bytes 49 152 bytes 65 536 32 1024 1024 × 1024 × 64 2 147 483 647 × 65 535 × 65 535

CPU

Intel(R) Xeon(R) CPU E5-2660

Clock rates Cache size

2.20 GHz 20 480 kB

CUDA Toolkit

Release 5.0

Fig. 2. Single CPU (Table 4) total execution time for each movement and for the complete module as a function of the number of initial particles (100% He).

Table 4 Single CPU employed. CPU

Intel(R) Xeon(R) X5690

Clock rates CPU family Model Stepping CPU MHz Cache size Siblings Bogomips Clflush size Cache_alignment Address sizes

3.47 GHz 6 44 2 1600.000 12 288 kB 6 6933.02 64 64 40 bits physical, 48 bits virtual

function performs a sequential loop on the N threads and a contextual summation of N scalar terms. The GPUdeltaUmove kernel parallelizes the above loop using 32 or 64 blocks per grid and 128 threads per block and uses the shared memory to perform a final summation–reduction of the N terms. Inside a single thread the pair-potential energy of a couple of particles is calculated by means of a multi-potential function, whose implementation is described in Appendix A.1. In order to obtain good characteristics of generality, flexibility and reusability of the code, C language structs and pointers to functions have been widely used. So, for instance, the information about particles is stored inside an array of struct, as described in Appendix A.2. We did not use the curand CUDA library, so the implementation of the multipotential function calling, described in Appendix A.3, uses our handcoded pseudo-random number generator, which has been suitably tested (see Appendix B). 4.2. Single-CPU and Multi-core As in the previous section, temperature and pressure are fixed to 20 000 K and 400 MPa respectively.

Fig. 3. Detailed speed-up single-CPU (Table 4)/single-GPU (Table 3) for each movement and for the complete module as a function of the number of initial particles (100% He).

The Multi-core module has been implemented using the OpenMP API (www.openmp.org). So hereafter we could refer it as OMP module. It was built from the Single-CPU module following the same parallelization scheme of the GPU one. The basic steps of CPU and OMP Move-functions are reported in the Table 5 where, as in the GPU case, functions marked by a box contain a large loop, that are parallelized using suitable OpenMP directives (see Appendix D for details). The standard C clock() function cannot be used to measure execution times in a Multi-core system, so the code belonging to both the CPU or the OMP version is timed through the omp_get_wtime function, according to the following scheme: Listing 2: Timing the CPU or OMP code t _ s t a r t =omp_get_wtime ( ) ; / / . . . s e r i a l −CPU or p a r a l l e l −OpenMP code t o be timed t _ t o t a l += ( omp_get_wtime() − t _ s t a r t ) ;

Another important change that we had to make in the OMP module with respect to the CPU one covers the use of the function rand. To avoid problems of race, typical of multithreading parallelization systems, such as OpenMP, we replace the rand()

M. Tuttafesta et al. / Computer Physics Communications (

)



5

Table 5 CPU and OMP subroutines in a Move-function. Operations

CPU main subroutine

OMP main subroutine

Random Move-configuration Move-configuration probability Evaluate 1Umove

CPU random function CPU Pmove function

CPU random function OMP Pmove function

Accept/Reject the Move-configuration Update configuration

CPU random function

CPU random function

Table 6 Multi-core employed: 6 + 6 = 12 cores, of the type described in Table 4. 6 cores Intel(R) Xeon(R) X5690

+

6 cores Intel(R) Xeon(R) X5690

Fig. 5. Execution time of whole single modules CPU, GPU, OMP as a function of the number of initial particles (100% He).

Fig. 4. Detailed speed-up Single-CPU (Table 4)/Multi-core (Table 6) for each movement and for the complete module as a function of the number of initial particles (100% He).

function in parallelized loops with a hand-coded (pseudo) random number generator, which has the same implementation of the device function d_randREMC defined in the Appendix A.1 List. 15. The hardware employed is schematized in Table 6 consisting of two chips each of which integrates 6 CPUs of the type described in Table 4. The detailed speeds-up, reported in Fig. 4 as a function of the number of particles, are mostly lower than those provided by the Single-GPU implementation, see Fig. 3, but weakly dependent from the number of particles. In our simulations, OpenMP code uses 12 cores reaching the optimal performance already at 500 particles, while the 1536 core-GPU device reaches the full occupancy for higher number of particles [26]. It should be noted that the speed-up for the move volume step of the OpenMP/CPU reaches a maximum of 6, with an efficiency (speed-up/number of cores) of 50%, while GPU/CPU a maximum speed-up higher than 100, with an efficiency lower than 33%. This large speed-up shows that the GPU code is particularly suitable for move-volume, making the computational time of this step comparable with the other movements. 4.3. Single-GPU and Multi-core Referring to the tests described in the previous two sections, it is useful to compare execution times of the complete single modules: CPU, GPU, OMP. Then we obtain the profiles shown in Fig. 5 from which it is clear how, in our implementation, the OMP approach is more convenient for a low number of particles while the GPU paradigm is definitely more powerful for a large number of particles. 4.4. Multi-GPU and Multi-core for a multi-temperature analysis The use of Multi-GPU is really effective when the communication between different GPUs is limited and the memory used by

each GPU is not too large. Therefore, having n devices we can run n instances of the single-GPU module. Each instance, through the cudaSetDevice function, allows each device to perform a simulation over a group of temperatures. So the entire simulation may decrease by a factor n its total execution time. Similarly, in an n-core system, we can achieve an n-times reduction of the whole execution time, simply by running n instances of the single-CPU module, using the same strategy as for the MultiGPU calculation. 5. Numerical results GPU-CUDA C and (Multi-core)-OpenMP versions of the REMC are presented in order to investigate the equilibrium behavior of chemically reacting systems in highly non-ideal condition. Results, obtained in the case of Helium plasma at high pressure, show that both the GPU and the OpenMP versions of the code are efficient especially for volume movements, where the total energy of the system (proportional to N 2 ) must be calculated. We performed REMC simulations for Helium plasma at a fixed pressure of p = 400 MPa and for temperatures ranging from 20 000 to 100 000 K. For a given temperature we followed these steps: 1. Set the initial number of particles: [He, He+ , He++ , e− ] = [500, 0, 0, 0]; 2. REMC simulation in the Ideal Gas (IG) condition obtaining the corresponding particle distribution; 3. REMC simulation starting from IG distribution and considering Exp-6 or ustat interaction potential. In all simulations we generated 300 cycles for equilibration and 700 cycles to accumulate averages of the desired quantities. Each cycle consists of a fixed number of total moves n = nD + nV + nR . Particle moves nD , volume moves nV and reaction moves nR are selected randomly with the probability ratio N¯ : 1 : N¯ , where N¯ is the half-sum of the maximum and the minimum number of particles during a simulation run. Figs. 6 and 7 show molar fractions, molar volume, molar enthalpy and excess molar energy obtained at 400 MPa by using Exp-6 potential. In particular, excess molar energy represents the molar potential energy [8]. Results have been also obtained in the

6

M. Tuttafesta et al. / Computer Physics Communications (

)



Fig. 6. Molar fractions and molar volume obtained at 400 MPa by using Exp-6 potential. Dashed lines refer to the IG case.

Fig. 7. Molar enthalpy and excess molar energy obtained at 400 MPa by using Exp-6 potential. Dashed lines refer to the IG case.

Fig. 8. Mean particle spacing obtained at 400 MPa by using Exp-6 potential. Dashed lines refer to the IG case.

ideal gas (IG) case, where no interactions are considered. Some differences can be observed especially for the molar volume. The mean particle distance observed in Fig. 8 is higher than 5 Å and differences cannot be observed between multi-potential curves as shown in Fig. 1. At higher pressures, where the mean interparticle distances become lower, multi-potential curve effects can be appreciated. Results obtained with the CPU, GPU and OMP codes are in excellent agreement. As an example in Fig. 9 excess molar energies obtained with the different codes are reported. Statical errors, which depend on the number of particles and on the number of cycles for averages, are also shown. In the simulations the initial number of particles (100% He) is 200 and 500 respectively for CPU and GPU/OMP cases. 6. Conclusions In this paper we have presented different parallel versions of the REMC algorithm to calculate thermodynamic properties of non-ideal plasmas and gases. The method is exact in the sense

that it is equivalent to the complete virial expansion series, starting from the interaction potentials. REMC is suitable for GPU calculation, especially for constant-pressure simulations where the total potential energy of the system must be recalculated, significantly increasing the execution speed of a single GPU calculation with respect to single CPU calculation. GPU is also more efficient than Multi-core calculation when the number of particles is larger than 1000. The maximum speed-up obtained by our OpenMP implementation on a 12-core hardware is about 5. A strategy for the use of multi-GPU is also presented, considering independent calculation on each GPU. Similar approach can also be used on Multicore calculation, further improving the speed-up. The speed-up obtained with the GPU-calculation is not negligible, making the execution time on GPU devices comparable with that of Multi-core CPU’s. Acknowledgments The authors would like to thank Dr. Mike Sucharew and Dr. Andrew Nelson of the ‘‘Exxact Corporation’’ [11] (Corporate Headquarters: 45445 Warm Springs Blvd. Fremont, CA 94539, USA) for their generous and comprehensive support in remote using of the GPUs GeForce GTX 680. Appendix A. CPU and GPU implementation details A.1. Potential energy calculation The implementation of the multi-potential function, that calculates the pair-potential energy of a couple of particles, is for CPU Listing 3: mpot (CPU) double mpot ( double r , DATAMPOT ∗dmp, double rnd ) { for ( i n t i =0; i <(dmp−>npot)−1 && rnd >(dmp−>w[ i ] ) ; i + + ) ;

M. Tuttafesta et al. / Computer Physics Communications (

)



7

Fig. 9. Excess molar energy obtained with CPU, GPU and OMP codes. Standard deviations of the Monte Carlo method are also reported. In the simulations the initial number of particles (100% He) are 200 and 500 respectively for CPU and GPU/OMP cases.

return dmp−>pot [ i ] ( r , & (dmp−>dp [ i ] ) ) ; }

and for GPU

where Npmax is the maximum number of particles in a simulation run and PARTICLE is a typedef of the struct Listing 7: PARTICLE

Listing 4: mpot (GPU) __device__ double d_mpot ( double r , DATAMPOT ∗dmp, double rnd ) { / / . . . same as mpot }

where r is the interparticle distance, rnd is a random number uniformly distributed in [0, 1] and DATAMPOT is a typedef of the following struct: Listing 5: DATAMPOT typedef s t r u c t { i n t npot ; double w[NPOT_MAX ] ; POT pot [NPOT_MAX ] ; DATAPOT dp [NPOT_MAX ] ; } DATAMPOT;

where npot is the number of single-potentials belonging to a given DATAMPOT instance and whose maximum value is NPOT_MAX. The array w contains the statistical limits in the interval [0, 1], that is 0 < w[0] < w[1] · · · w[npot − 2] < 1, w[-1]=0, so w[i] − w[i − 1] is the statistical weight of the i-th single-potential. The single-potential calculation is obtained by using an array of pointers to function, pot, defined as type POT

typedefdouble(∗POT)(doubler, DATAPOT ∗ dp). (A.1) The parameter r in (A.1) is, as above, the interparticle distance and DATAPOT is a typedef of the following struct: Listing 6: DATAPOT typedef s t r u c t { double pr [PR_MAX ] ; } DATAPOT;

where the array pr contains scalar parameters associated with a single-potential function. A.2. Input and assignment Initial particle species and distribution are read from file at the input stage of the general algorithm, see Table 1, together with potential types for every different species-pair. The information about particles is stored in the array of struct allocated as follows:

part = (PARTICLE∗)malloc(Npmax ∗ sizeof(PARTICLE)); for CPU

cudaMalloc((void ∗ ∗)&dev_part, Npmax ∗ sizeof (PARTICLE)); for GPU

typedef s t r u c t { / / Coordinates double x ; double y ; double z ; int s ; / / Spe cies index double m; / / Mass i n t c ; / / I o n i c charge ( i n elementary u n i t ) } PARTICLE ;

The information about species-pair potential is stored in the array of DATAMPOT struct allocated as follows: Listing 8: datampot array allocation / / For CPU datampot =(DATAMPOT∗ ) malloc ( NSPES∗NSPES∗ s i z e o f (DATAMPOT ) ) ; ... / / For GPU cudaMalloc ( ( void ∗∗)&dev_datampot , NSPES∗NSPES∗ s i z e o f (DATAMPOT ) ) ;

where NSPES is the number of species in a simulation run. Specific functions for each species-pair potential are defined according to the prototype in (A.1). They are defined in double versions, e.g.: Listing 9: Species-pair potentials / / For CPU double fpotname1 ( . . . ) ; double fpotname2 ( . . . ) ; ... / / For GPU __device__ double d_fpotname1 ( . . . ) ; __device__ double d_fpotname2 ( . . . ) ; ...

The pot member of datampot instance is assigned to the above potential functions in the Assign stage of the general algorithm, see Table 1, according to the following scheme: Listing 10: pot member assignment for ( i =0; i
8

M. Tuttafesta et al. / Computer Physics Communications (

)



} e l s e i f ( ! strcmp ( potname , " fpotname2 " ) ) { / / A s s i g n s i n g l e −p o t e n t i a l f u n c t i o n datampot [ NSPES∗ j + i ] . pot [ k ]= fpotname2 ; ... } ...

distributed number in [0, 1], is here obtained by the function rand, belonging to the stdlib C library. In the GPU module, the implementation scheme of the

} / / Symmetric copy datampot [ NSPES∗ i + j ]= datampot [ NSPES∗ j + i ] ;

#include " l oc k . h" ... __global__ void GPUdeltaU_vol ( double ∗ de lta_ ept , DATAMPOT ∗dmp, double Lold , double Lnew , Lock lock , PARTICLE ∗ part , i n t Np , i n t NSPES , i n t ∗MPr) { __shared__ double cache [ threadsPerBlock ] ; i n t cacheIndex = threadIdx . x ; ... double delta_ept_temp = 0 . 0 ; k = threadIdx . x+ blockIdx . x∗ blockDim . x ; while ( ki ){ ... delta_ept_temp += d_mpot ( r_new , dmp+NSPES∗ p a r t [ i ] . s + p a r t [ j ] . s , d_randREMC (MPr+k)) − d_mpot ( r_old , dmp+NSPES∗ p a r t [ i ] . s + p a r t [ j ] . s , d_randREMC (MPr+k ) ) ; } k += blockDim . x∗ gridDim . x ; }

} }

The dev_datampot instance is assigned by first copying the whole datampot array Listing 11: dev_datampot assignment, stage 1/2 cudaMemcpy( dev_datampot , datampot , NSPES∗NSPES∗ s i z e o f (DATAMPOT) , cudaMemcpyHostToDevice ) ;

After that copy, the pot member of dev_datampot is not assigned correctly, so we complete the assignment as follows: Listing 12: dev_datampot assignment, stage 2/2 / / f o r e v e r y i , j , k i n d e x e s and ev ery fpotname f u n c t i o n ... cudaMemcpyFromSymbol(& pot_pointer , d_fpotname1_pointer , s i z e o f ( POT ) ) ; cudaMemcpy(&( dev_datampot [ NSPES∗ j + i ] . pot [ k ] ) , & pot_pointer , s i z e o f ( POT ) , cudaMemcpyHostToDevice ) ;

GPUdeltaUvolume kernel, named GPUdeltaU_vol, is as follows: Listing 14: GPUdeltaU_vol

/ / s e t t h e cache v a l u e s cache [ cacheIndex ] = delta_ept_temp ;

where pot_pointer is a local POT variable and d_fpotname1_ pointer is an auxiliary pointer that is associated with the d_fpotname1 function and globally assigned as

/ / synchronize threads in t h i s block __syncthreads ( ) ;

__device__ POT d_fpotname1_pointer=d_fpotname1;

/ / f o r r e d u c t i o n s , t h r e a d s P e r B l o c k must be a power o f 2 / / because o f t h e f o l l o w i n g code i n t i r = blockDim . x / 2 ; while ( i r != 0 ) { i f ( cacheIndex < i r ) cache [ cacheIndex ]+= cache [ cacheIndex+ i r ] ; __syncthreads ( ) ; i r / = 2; } i f ( cacheIndex == 0 ) { / / wait u n t i l we g e t t h e l o c k l o ck . l o ck ( ) ; / / we have t h e l o c k a t t h i s point , update and r e l e a s e ∗ d e l t a _ e p t += cache [ 0 ] ; l o ck . unlock ( ) ; }

A.3. Multi-potential function calling In the CPU module, the implementation scheme of the CPUdeltaUvolume function, named CPUdeltaU_vol, is as follows: Listing 13: CPUdeltaU_vol void CPUdeltaU_vol ( double ∗ delta_ept , DATAMPOT ∗dmp, double Lold , double Lnew) { ... for ( k =0; ki ){ ... ∗ d e l t a _ e p t += mpot ( r_new , dmp+NSPES∗ p a r t [ i ] . s + p a r t [ j ] . s , ( double ) rand ( ) /RAND_MAX)− mpot ( r_old , dmp+NSPES∗ p a r t [ i ] . s + p a r t [ j ] . s , ( double ) rand ( ) /RAND_MAX) ; } } }

where delta_ept is the returning change in volume configuration energy (the 1Φ in (6)), dmp is the current CPU datampot array allocated in List. (8), Lold(Lnew) is the cube root of the current(changed) volume, Np is the instantaneous number of particles, r_old(r_new) is the distance between particles i and j in the current(changed) volume. It is important to note that the third argument of the function mpot, defined in List. 3, which has to be a random uniformly

}

where the first four arguments are the same of the CPUdeltaUvolume function. The variable lock is local, simply declared without initialization before the kernel calling, whose type Lock is defined in the header lock.h, listed in the Appendix C (for details see [27]). Such a variable together with the shared array cache allows us to perform a well known [27] parallel algorithm of summation–reduction. The relevant aspect here is that the random argument of the device function d_mpot is provided by the hand-coded (pseudo) random number generator d_randREMC, according to the scheme xi+1 = (a · xi + c ) mod m, implemented as follows: Listing 15: d_randREMC #define M_RND 2147483647 / / {2}^{3}1 − 1 a prime number #define A_RND 16807 #define C_RND 0

M. Tuttafesta et al. / Computer Physics Communications (

)



9

that the observed chi-square will exceed its value by chance even for a correct model is greater than 0.05, then one can consider the observed distribution compatible with the expected one. So we used the routine gammq from the Numerical Recipes, [29, Chapter 6], to calculate the probability (B.2) for every chisquare calculation, that is as function of cycles. The results, are reported in Fig. B.10, from which one can deduce an high quality of the multi-potential statistical selection for both CPU and GPU models, seeing that the Q values are always greater than 5%, frequently greater than 40% and a sometime greater than 90%. In List. 16 one can Appendix C. Atomic locks Fig. B.10. Percentage probability (B.2), for the same CPU and GPU test case simulation, that the chi-square (B.1) will exceed the value calculated from the simulation start to the end of a given cycle as a function of cycles.

__device__ double d_randREMC ( i n t ∗x ) { ( ∗ x ) = (A_RND∗ ( ∗ x )+C_RND)%M_RND; return ( ( double ) ( ∗ x ) ) / RAND_MAX; }

The array MPr, last argument of the kernel GPUdeltaU_vol, contains the seeds used by the function d_randREMC and is allocated with a dimension Npmax*Npmax to ensure a unique random sequence for each thread. Every single element of MPr is initialized by a rand call and subsequently is passed by reference in the d_randREMC function which in turn update its value at every call. Appendix B. Validation of the multi-potential statistical selection The extent to which the multi-potential selections, performed by both mpot (List. 3) and d_mpot (List. 4) functions, fit the assumed distribution, given by the w array in List. 5, is evaluated by the reduced Chi-Square, χ˜ 2 (see [28, p. 261 and Appendix D]), defined as

χ˜ = 2

s t r u c t Lock { i n t ∗mutex ; Lock ( void ) { cudaMalloc ( ( void ∗∗)&mutex , s i z e o f ( i n t ) ) ; cudaMemset ( mutex , 0 , s i z e o f ( i n t ) ) ; } ~Lock ( void ) { cudaFree ( mutex ) ; } __device__ void lock ( void ) { while ( atomicCAS ( mutex , 0 , 1 ) != 0 ) ; } __device__ void unlock ( void ) { atomicExch ( mutex , 0 ) ; } }; #endif

Appendix D. OpenMP parallelization scheme The following code is the OpenMP version of the kernel in List. 14. Listing 17: OMPdeltaU_vol

n −1 1  (Ok − Ek )2

d k=0

Listing 16: lock.h #ifn def __LOCK_H__ #define __LOCK_H__

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

(B.1)

Ek

void OMPdeltaU_vol (DATAMPOT ∗dmp, REAL Lold , REAL Lnew) { int k ;

which in general refers to a series of measurements grouped into bins k = 0, . . . , n − 1, where Ok is the number of measurements observed in the bin k, Ek is the number of the expected ones, on the basis of some distribution, and d is the number of degrees of freedom (d = n − c, where c is the number of constraints, that is the number of parameters that had to be calculated from the data to compute the expected numbers Ek ). As a test case we considered the simulation described in Section 5 for only T = 50 000 K and with an He–He+ interaction that is always Exp-6 but is made artificially as multi-potential by setting npot=8 and

#pragma omp p a r a l l e l for reduction ( + : d e l t a E ) for ( k =0; ki ){ REAL dx= f a b s ( p art [ i ] . x−par t [ j ] . x ) ; REAL dy= f a b s ( p art [ i ] . y−par t [ j ] . y ) ; REAL dz= f a b s ( p art [ i ] . z−par t [ j ] . z ) ; REAL dx2 =( dx <0.5? dx∗dx:(1 − dx)∗(1 − dx ) ) ; REAL dy2 =( dy <0.5? dy∗dy:(1 − dy)∗(1 − dy ) ) ; REAL dz2 =( dz <0.5? dz ∗ dz:(1 − dz)∗(1 − dz ) ) ; REAL r _ o l d = Lold ∗ s q r t ( dx2+dy2+dz2 ) ; REAL r_new=Lnew∗ r _ o l d / Lold ;

w=[0.125, 0.225, 0.375, 0.5, 0.625, 0.825, 0.875, 1] The chi-square in (B.1) is calculated accumulating data from the simulation start to the end of every cycle by considering: n = npot (in the struct described in List. 5), M = total number of multi-potential function (mpot for CPU or d_mpot for GPU) calls for the He–He+ interaction, Ok = fraction of M corresponding to the number of the k-th single-potential function (pot element of the struct described in List. 5) calls, Ek = M ·(w[k]-w[k-1]) and then d = n − 1. As it is known [28], if the probability Q (χ |d) = 2

2



2d/2 Γ (d/2) χ

DATAMPOT ∗ p t r = dmp+NSPES∗ par t [ i ] . s + par t [ j ] . s ; i n t t i d =omp_get_thread_num ( ) ; d e l t a E += ( ptr −>npot >1? mpot ( r_new , ptr , randREMC ( MPrand+ t i d )) − mpot ( r_old , ptr , randREMC ( MPrand+ t i d ) ) : ptr −>pot [ 0 ] ( r_new , ptr −>dp )− ptr −>pot [ 0 ] ( r_old , ptr −>dp ) ) ; } }



x

d−1

exp(−x /2) dx 2

(B.2)

}

10

M. Tuttafesta et al. / Computer Physics Communications (

References [1] J.H. Johnson, A.Z. Panagiotopoulos, K.E. Gubbins, Molecular Physics 81 (1994) 717. [2] W.R. Smith, B. Triska, The Journal of Chemical Physics 100 (1994) 3019. [3] V. Bezkrovniy, M. Schlanges, D. Kremp, W. Kraeft, Physical Review E 69 (2004) 061204. [4] V. Bezkrovniy, et al., Physical Review E 70 (2004) 057401. [5] M. Lisal, J.K. Brennan, W.R. Smith, Journal of Chemical Physics 124 (2006) 064712. [6] C.H. Turner, K.E. Gubbins, Journal of Chemical Physics 119 (2003) 021105. [7] E. Bourasseau, V. Dubois, N. Desbiens, J.B. Maillet, Journal of Chemical Physics 127 (2007) 084513. [8] M. Lisal, W. Smith, I. Nezbeda, Journal of Chemical Physics 113 (2000) 4885. [9] M. Lisal, W. Smith, M. Bures, V. Vacek, J. Navratil, Molecular Physics 100 (2002) 2487. [10] F.H. Ree, The Journal of Physical Chemistry 87 (1983) 2846. [11] http://www.exxactcorp.com. [12] M. Capitelli, G. Colonna, A. D’Angola, Fundamental Aspects of Plasma Chemical Physics: Thermodynamics, first ed., in: Atomic, Optical, and Plasma Physics, vol. 66, Springer, New York, 2011. [13] D. Bernardi, V. Colombo, G. Coppa, A. D’Angola, European Physical Journal D 14 (2001) 337. [14] M. Capitelli, G. Colonna, C. Gorse, A. D’Angola, European Physical Journal D 11 (2000) 279.

)



[15] G. Colonna, A. D’Angola, M. Capitelli, Physics of Plasmas 19 (2012). [16] G. Colonna, A. D’Angola, A. Laricchiuta, D. Bruno, M. Capitelli, Plasma Chemistry and Plasma Processing 33 (1) (2013) 401. [17] G. Colonna, A. D’Angola, Computer Physics Communications 163 (2004) 177. [18] A. D’Angola, et al., European Physical Journal D 66 (2012). [19] A. D’Angola, G. Colonna, C. Gorse, M. Capitelli, European Physical Journal D 46 (2008) 129. [20] A. D’Angola, G. Colonna, C. Gorse, M. Capitelli, European Physical Journal D 65 (2011) 453. [21] M.P. Allen, D.J. Tildesley, Computer Simulation of Liquids, second ed., Oxford University Press, USA, 1989. [22] D. Frenkel, B. Smit, Understanding Molecular Simulation: From Algorithms to Applications, second ed., Academic Press, 2002. [23] H. Turner, et al., Molecular Simulation 34 (2008) 119. [24] D. Bruno, et al., Physics of Plasmas 17 (2010) 112315. [25] A.H. Karp, H.P. Flatt, Communication of the ACM 33 (5) (1990) 539. [26] M. Tuttafesta, G. Colonna, G. Pascazio, Computer Physics Communications 184 (2013) 1497. [27] J. Sanders, E. Kandrot, CUDA by Example, Addison-Wesley, 2011. [28] J.R. Taylor, Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements, 1997. [29] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C (second ed.): The Art of Scientific Computing, Cambridge University Press, New York, NY, USA, 1992.