Computer Physics Communications 182 (2011) 989–993
Contents lists available at ScienceDirect
Computer Physics Communications www.elsevier.com/locate/cpc
Simulation of stochastic processes using graphics hardware Alison Barros a , Euler de Vilhena Garcia b , Rafael Morgado Silva a,b,∗ a b
Instituto de Física, Universidade de Brasília, 70919-970 Brasília-DF, Brazil Faculdade Gama, Universidade de Brasília, 72405-610 Gama-DF, Brazil
a r t i c l e
i n f o
a b s t r a c t
Article history: Received 17 July 2009 Received in revised form 8 December 2010 Accepted 28 December 2010 Available online 30 December 2010 Keywords: Generalized Langevin equation Stochastic simulation High performance computing Graphics processing unit acceleration
Graphics Processing Units (GPUs) were originally designed to manipulate images, but due to their intrinsic parallel nature, they turned into a powerful tool for scientific applications. In this article, we evaluated GPU performance in an implementation of a traditional stochastic simulation – the correlated Brownian motion. This movement can be described by the Generalized Langevin Equation (GLE), which is a stochastic integro-differential equation, with applications in many areas like anomalous diffusion, transport in porous media, noise analysis, quantum dynamics, among many others. Our results show the power inherent in GPU programming when compared to traditional CPUs (Intel): we observed acceleration values up to sixty times by using a NVIDIA GPU in place of a single-core Intel CPU. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Correlated Brownian motion is customarily described by the Generalized Langevin Equation (GLE)
m
dv dt
t = −m
Γ t − t v t dt + ζ (t ),
(1)
0
where m, v and Γ (t ) stand respectively for particle mass, particle velocity and particle memory. The last one is related to the stochastic noise ζ (t ) by the fluctuation–dissipation theorem
ζ (t )ζ t = mK B T Γ t − t ,
(2)
in which · · · defines ensemble average, K B means Boltzmann’s constant and T represents the system temperature. This is a pivotal equation in statistical mechanics: it is equivalent to the Heisenberg equation for the motion of a quantum operator [1] and also to the classical Hamiltonian dynamics [2]. It is very useful in modeling relevant phenomena in many distinct areas of knowledge, like engineering [3], physics [4], chemistry [5], economy [6], among others. To obtain analytical solutions from stochastic integro-differential equations like the GLE continues to be a great challenge from the theoretical point of view. This calls the development of numerical methods to correctly integrate it, but this cannot be done in general because of the demanding step of correlated noise generation and the call for great amounts of computational time (or
*
Corresponding author at: Instituto de Física, Universidade de Brasília, 70919-970 Brasília-DF, Brazil. Tel.: +55 61 3384 4023. E-mail address: rafael@fis.unb.br (R. Morgado Silva). 0010-4655/$ – see front matter doi:10.1016/j.cpc.2010.12.043
©
2011 Elsevier B.V. All rights reserved.
power). Even simple algorithms show an ever-growing time at each iteration, making GLE simulation prohibitive for large times. However, large times are exactly what most applications aim, because of the decaying transient time and the thermodynamic limit (infinite time). This naturally leads to the requirement of very powerful computational systems, making this problem a good probe test for new architectures and languages. The graphics processing unit (GPU) has evolved from a fixed instruction processor to a generally programmable hardware [7,8], paving the way to the development of applications that could unravel the power of this multi core architecture – several orders of magnitude higher than the currently available central processing units (CPUs). The simulation of the GLE is a trivially parallelized problem, since its overall aim is to generate as many independent stochastic trajectories as possible and then use this set to estimate quantities like averages, correlations, probability distribution functions, and others. We believe that the performance gain in using the GPU as a mathematical co-processor can be considerable in problems of such a kind. In this work we compare similar GLE simulations running in three environments: single core Intel CPU, cluster of quad-core Intel CPU and two models of GPU. The article is organized as follows: in Section 2 we describe the methodology, introducing the numerical method and algorithm, as well as including the correlated random number generation; in Section 3 we show the benchmarks and in Section 4 we conclude our work. 2. Methodology This work was developed using the NVIDIA GPUs G92 (GeForce 9800GTX video board) and GT200b (GeForge GTX280 and GeForce GTX295 video boards). Simulation times obtained from three dif-
990
A. Barros et al. / Computer Physics Communications 182 (2011) 989–993
ferent approaches were compared: a serial code running in a single core CPU (a Intel Q6660 CPU core); a MPI (Message Passing Interface) parallel code running in a cluster with three nodes interconnected by gigabit network and one Intel Q6660 CPU per node; and the parallel code running the main mathematical routines into the GPU. GPU code was compiled using NVCC – the Nvidia compiler present into the Cuda platform [8], but the codes in both serial and MPI versions were compiled by a Intel C Compiler version 10.1, as we observed that it has better optimization routines than the GNU gcc compiler. Applications were executed ten times for each version and the global processing time, i.e. the time to execute all the code, was measured using a six precision digit function. 2.1. Numerical algorithm The development of a numerical algorithm to simulate GLE problems poses two main issues. The subdivision of the integral term in the GLE in homogeneous time steps leads to the need to recalculate the integral at each iteration and a steady, proportional growth in the number of subdivisions of the integral with time. This can make the simulation of large times prohibitive, a fact we want to avoid. In addition, we need to generate correlated random numbers because the stochastic noise in the GLE must satisfy the fluctuation–dissipation theorem, which states that the noise correlation must be proportional to the memory function. In this paper, we propose a method that solves both problems described above. Firstly, without loss of generality, we rewrite the noise and the memory functions as Fourier cosine series, since we have discrete time and frequencies. So, the noise expression becomes
ζ (t ) =
g (ωi ) cos
ωi t + φ(ωi ) ω,
(3)
i
where g (ω) is the noise density of states and φ(ωi ) are deltacorrelated random phases.
δ(ω) φ(ω)φ(0) = .
(4)
2
By the fluctuation–dissipation theorem, the correlation of φ , and after some trigonometric algebra [4], we found that the memory function is depicted by
Γ (t ) =
1 2K B T
g (ωi )2 cos(ωi t )ω2 .
(5)
i
Inserting Eqs. (3) and (5) into the GLE, Eq. (1) we get
m
dv dt
=−
K (ωi , t ) + g (ωi ) cos
ωi t + φ(ωi ) ω,
(6)
i
where K (t ) is given by
K (ωi , t ) =
g (ωi )2 ω 2K B T
cos(ωi t ) I 1 (t ) + sin(ωi t ) I 2 (t ) ,
(7)
with
t I 1 (t ) =
cos
ωi t v k t dt ,
sin
ωi t v k t dt .
All developed simulation code follow the algorithm described in the previous section. They are singled out according to the parallelization strategy (none, MPI or GPU) implemented. In the parallel code, we used one thread for each particle instead of a loop. We developed the GPU code using CUDA language [8], in which code is split in kernels executed into the GPU. The created kernels were as follows: non-correlated random number generation kernel (using the Mersene–Twister algorithm [9]), density of states array filling kernel, GLE solution kernel and data processing kernel (Fig. 2). The random number generation kernel generates a pseudo randomized number between −π and π using the Mersene–Twister algorithm [9] adapted to GPU. Since each particle has its own set of random numbers, made out of distinct seeds, we decided to generate in one call all necessary numbers for all particles. The amount of video memory available would, then, impose a limitation on the quantity of particles and the maximum time for the simulation, but this limitation can be lifted if the kernels are called inside a loop, in a way that each iteration calculates a block of trajectories. After the random number generation, we fill in the density of g (ω )2
states vectors g (ωi ) and 2K i T . These are small vectors, since a B large amount of frequencies is not needed for good precision in the oscillatory integrals. Due to their limited size, they can be stored into the constant memory of the GPU, which has lower latency and faster access times when compared to the global memory. These vectors and the random numbers are now used to solve the GLE in the next kernel. Since each particle has an independent trajectory, one thread was assigned to each particle. As it is running the GLE time domain solution, this kernel concomitantly stores the resultant positions and velocities in an array of results. This array of results is finally fed into the data processing kernel, which processes averages, correlations and probability distribution functions, and save them in vectors that will be further copied to the main memory and stored into the hard disk drive. 3. Results
and
I 2 (t ) =
2.2. Simulation structure
(8)
0
t
be accumulated at each iteration and the algorithm takes a fixed amount of time at each iteration. Also, Eqs. (3) and (5) can be used to generate the desired correlated noise from the uncorrelated random set φ(ωi ), as long as the noise correlation can be expanded in a Fourier series, which covers most applications of interest. The complete algorithm is depicted by the fluxogram in Fig. 1. The full algorithm starts by the calculation of the density of states for the noise term and the random number matrix. After that, it goes into the main loop – the particle loop. Inside it, another two loops are started: the time loop and the frequency loop. The frequency loop encompasses the solution of the integrals given in Eqs. (8) and (9) as well as K (ω, t ) in Eq. (7). As the frequency loop runs, the sum of the calculated K (ω, t ) is accumulated. When it finishes, the GLE is integrated using Eq. (6) and the algorithm proceeds to the next time step, calling the frequency loop again. As the time loop finishes, it goes on to the next particle, and runs until the total number of particles is attained (Fig. 1).
(9)
0
The derived equations correctly addresses the issues aforementioned in this subsection. The new time integrals I 1 and I 2 can
All three code versions were executed with variable number of particles. Measured global execution times are shown in Table 1. From this table, we note that the performance gain in using GPU as a mathematical co-processor is negligible for a small number of particles, but it increases as the number of particles grows. To better illustrate the comparison among versions and number of particles, we define Speed as
Speed =
Number of particles Total simulation time
.
(10)
A. Barros et al. / Computer Physics Communications 182 (2011) 989–993
991
Fig. 1. Fluxogram of the developed algorithm. WMAX is the number of subdivisions to calculate the oscillatory integrals, N_PART is the total number of particles, totaltime is the final simulation time, and dt represents the time step used in the integration of the GLE.
Fig. 2. Stream diagram of the developed code, stressing data flux and that sequential calling of the kernels.
992
A. Barros et al. / Computer Physics Communications 182 (2011) 989–993
Table 1 Measured total simulation times in seconds. Enclosed times are related to number of particles in parentheses. Number of particles
Serial
MPI
9800GTX
GTX280
GTX295
128 (120) 256 (240) 512 (480) 1024 (960) 2048 (1920) 4096 (3840) 8192 (7680) 16384 (15360)
86.69 172.93 345.91 694.04 1396.43 2765.95 5629.84 10990.24
30.08 58.27 113.71 225.44 449.33 899.82 1793.83 3580.51
30.64 29.50 30.30 31.98 42.67 101.91 210.18 –
(41.04) (41.01) (41.09) (45.97) (48.36) (57.58) (87.76) (164.05)
(45.26) (45.29) (45.64) (51.67) (54.04) (66.60) (105.39) (214.42)
Table 2 Calculated Speed. Enclosed speeds are related to number of particles in parentheses. Number of particles
Serial
MPI
9800GTX
GTX280
GTX295
128 (120) 256 (240) 512 (480) 1024 (960) 2048 (1920) 4096 (3840) 8192 (7680) 16384 (15360)
1.477 1.479 1.479 1.475 1.445 1.481 1.456 1.49
4.255 4.386 4.505 4.545 4.566 4.545 4.566 4.566
4.178 8.677 16.897 32.021 47.991 40.194 38.977 –
(2.924) (5.853) (11.682) (20.883) (39.703) (66.694) (87.509) (93.633)
(2.651) (5.299) (10.517) (18.578) (35.531) (57.658) (72.875) (71.637)
to be twelve times faster (theoretically) when related to the Serial code. We postulate that the explanation for the low parallel efficiency of the MPI implementation can be found in the fact that the MPI code uses one thread for each particle, so data must be copied from one machine to another, and a new process initiated. Since the MPI thread finalizes execution in a relatively small time, this causes the extra overhead and consequently low parallel performance. One could do a better job with MPI by simulating blocks of particles in each thread, but this wold turn the comparison with the GPU programming model a very complex task. This also poses questions about the scalability of clusters and future multi core architectures: the overhead to copy data and start new MPI processes may be (or become) a great bottleneck that shift down the overall performance with an increasing number of cores (nodes) [10]. 4. Conclusion
Fig. 3. Number of particles per unit of time as a function of the number of particles.
Table 2 and Fig. 3 show the Speed defined above as a function of the number of particles. In this figure, we note that Speed is approximately constant in the Serial and MPI cases. However, in the GPU, it shows an initial increase and then stabilizes as the number of particles grows. The flattened curves showed in the MPI and Serial cases are expected. Since the occupation rate for the CPU does not change appreciably with the number of particles, the average time spent per particle is constant. On the opposite, the GPU suffers a great impact from kernelcalled latencies and occupancy rate, as can be observed in Table 2. Execution time in the GPU is constant for a small number of particles. A small number of particles also implies that there are not sufficient threads to keep all GPU processors occupied, being the origin of the low occupancy rate. With a bigger number of particles the occupancy rate of GPU processors grows until saturation, and the kernel-called latency is diluted in the total execution time, so the defined Speed in the GPUs tends to stabilize in a plateau. Fig. 3 also shows that if the number of particles is big enough, the GPU code running in the GTX280 graphics board is about 60 times faster than the Serial code and nearly 20 times faster than the MPI code. Comparison between MPI and Serial code versions shows a time performance gain of nearly three times. Interestingly, the MPI version runs in 12 processors and, thus, can be expected
In this article, we have shown a stochastic simulation using graphics hardware. Specifically, we used the simulation of the Generalized Langevin Equation as a tool to benchmark the GPU as a mathematical co-processor for scientific applications. To carry on this kind of Brownian motion simulation, we have developed an algorithm that can further extended to general use, as it generates correlated random numbers in a easy and robust way. In relation to the physical results obtained with the GPU code, we do not observed significant differences caused by single precision arithmetics. By using this code, we were able to reproduce the results of Refs. [4] and [11], which were produced using double precision arithmetics. Finally, we found that the GPU code was about 60 and 20 times faster when compared against Serial and MPI versions, respectively. Our results also indicate that the bigger bottleneck in the MPI case is associated with data coping and launching new processes, which brings on questions about the viability, performance and scalability of clusters and future multi core architectures [10] when compared to the GPU architecture and programming model. References [1] M.H. Lee, J. Math. Phys. 24 (1983) 2512. [2] R. Kubo, Rep. Prog. Phys. 29 (1966) 255. [3] L. Weiss, W. Mathis, IEEE International Symposium on Circuits and Systems, Hong Kong, June 9–12, 1997.
A. Barros et al. / Computer Physics Communications 182 (2011) 989–993
[4] R. Morgado, F.A. Oliveira, G.G. Batrouni, A. Hansen, Phys. Rev. Lett. 89 (2002) 100601. [5] S. Roy, I. Mitra, R. Llinas, Phys. Rev. E 78 (2008) 041920. [6] J.-P. Bouchaud, R. Cont, Eur. Phys. J. B 6 (1998) 543–550. [7] I. Buck, Brook Language Specification 0.2, in: http://merrimac.stanford.edu/ brook, 2003. [8] NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Pro-
993
gramming Guide, Version 2.0, 2008. [9] M. Matsumoto, T. Nishimura, ACM Trans. on Modeling and Computer Simulation 8 (1) (January 1998) 3–30. [10] S.K. Moore, Spectrum IEEE 45 (11) (November 2008) 15. [11] L.C. Lapas, R. Morgado, M.H. Vainstein, J.M. Rubi, F.A. Oliveira, Phys. Rev. Lett. 101 (2008) 230602.