Sampling Compton Scattering by Kahn's method on GPU

Sampling Compton Scattering by Kahn's method on GPU

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Procedia Engineering Procedia Engineering 00 (2011) 000–000 Proc...

363KB Sizes 1 Downloads 47 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Procedia Engineering

Procedia Engineering 00 (2011) 000–000 Procedia Engineering 15 (2011) 3495 – 3499 www.elsevier.com/locate/procedia

Advanced in Control Engineering and Information Science

Sampling Compton Scattering by Kahn’s method on GPU Jing Xiea,b,a* a b

School of Information, XI’AN University of Finance and Economics, Xi’an, 710100, China School of Computer, National University of Defense Technology, Changsha 410073, China

Abstract The Monte Carlo simulation of particle transport does not solve an explicit equation, but rather obtains answers by sampling individual particles and recording some aspects of their average behavior. For the incident energy of particles below 1.5 MeV, the Compton scattering process is sampled exactly by Kahn's method. This paper focuses on accelerating sampling the Compton scattering of Kahn's method on GPU. The sampling procedures are distributed to the massive thread level parallelism of GPU architecture with CUDA programming model. Experimental results show that the speedup of the GPU implementation on one NVIDIA M2050 chip ranges from 3.10 to 8.51 compared with one Intel Xeon X5670 chip to 8.51 compared with one Intel Xeon X5355 chip with full double precision floating point operations.

© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011] Keywords: Compton scattering; Klein-Nishina formula; Kahn’s method; CUDA; GPU

1. Introduction Monte Carlo (MC) method is widely used to simulate the particle transport, including the transport of neutrons, electrons and protons [1]. The photoelectric effect is treated as pure absorption by implicit capture with a corresponding reduction in the photon weight. The electron positron pair created as a result of pair production is treated for further transport and the photon disappears [2]. The physical processes treated are photoelectric effect, pair production, and Compton scattering on free electrons. The

* Corresponding author. Tel.: +86 150 2924 7182. E-mail address: .

1877-7058 © 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2011.08.654

3496 2

JingetXie ProcediaEngineering Engineering00 15(2011) (2011)000–000 3495 – 3499 Jing Xie al/ /Procedia

photoelectric effect is regarded as an absorption (without fluorescence), scattering (Compton) is regarded to be on free electrons (without use of form factors), and the highly forward coherent Thomson scattering is ignored. Thus the total cross section σ t is regarded as the sum of three components [1]: σ t = σ pe + σ pp + σ cs (1) where σ pe , σ pp and σ cs stand for the cross section of photoelectric effect, pair production and Compton scattering. For the incident energy of particles below 1.5 MeV, the Compton scattering process is sampled exactly by Kahn's method [3] which will be studied in this paper. Today's GPUs greatly outpace CPUs in arithmetic throughput and memory bandwidth, making them the ideal processor to accelerate a variety of computation intensive applications. The NVIDIA Compute Unified Device Architecture (CUDA) [4] programming model becomes more mature than before and simplifies the development of non-graphics applications. At present, GPU has been successfully applied to many computation intensive domains, such as random number generation [5,6], undetermined (Monte Carlo) particle transport [7,8,9] and determined particle transport and [10]. 2. Background and related works

σ t /(σ t − σ pe ) .

In the event of such a

collision, the objective is to determine the energy E’ of the scattered photon, and

μ = cos θ for the angle

The Compton scattering on a free electron has probability

θ

of deflection from the line of flight. The energy deposited at the point of collision can then be used to

make a Compton recoil electron for further transport [1]. The differential cross section for the process is given by the Klein-Nishina formula [11,12,13]:

⎤ α ' 2 ⎡α ' α K (α , μ )dμ = πr ( ) ⎢ + ' + μ 2 − 1⎥ dμ α ⎣α α ⎦ 2 0

2.817938×10−13 , α and α ' are the incident and final photon 2 energies in units of 0.511 MeV. α = E /( mc ) where m is the mass of the electron and c is the speed ' of light, and α = α /[1 + α (1 − μ )] . The Compton scattering process is sampled exactly by Kahn’s where r0 is the classical electron radius

method [3] below 1.5 MeV and by Koblinger’s method [14] above 1.5 MeV as analyzed by Blomquist and Gelbard [15]. Compton scattering is a basic sampling process in the MC simulation of particle transport. Thomas et al. [5] compared different PRNG algorithms on four different types of platform. Ticker [8] implemented photon transport on GPU. Heimlich et al. [7] presented a neutron transport simulation by MC method on GPU. Gong et al. accelerated the PRNG for MCNP on GPU [6], and studied a parallel Monte Carlo benchmark on both GPU [16] and heterogeneous CPU/GPU platform [17]. The sampling process of the angular distribution of MCNP based neutron transport was accelerated on GPU [18]. 3. Implementation details The PRNG is the basic of Monte Carlo simulation. All the random number uniformly distributes on the interval (0, 1). The original serial implementation on CPU and the parallel implementation on GPU can refer to paper [6], which gives the details of the PRNG for MCNP on GPU.

Jing Engineering 15 (2011) 3495 000–000 – 3499 Jing Xie Xie /etProcedia al / Procedia Engineering 00 (2011)

3.1. Implement Kahn’s method on GPU The implementation of Kahn’s method on GPU is shown in Algorithm 1. The device qualifier declares a function that is executed on the device and callable from the device only. The input parameter Ein stands for the incident energy. Ein ranges from 0 to 3 for the energy in units of 0.511 MeV. In the real application, it needs to call the device function many times. When it comes to the parallel implementation, we should ensure not reuse the random numbers.

There are three branches in this device function. Individual threads composing a SIMD warp start together at the same program address but are otherwise free to branch and execute independently [4]. If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken; disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths [4]. So these branches will affect the performance of GPU implementation. The Compton scattering sampling kernel with CUDA programming model is shown in Algorithm 2. In the real simulation of photon or electron transport, the use of the outputs of the device function Kahn(), such as Eout and cs, cannot be omitted.

4. Results The experimental platforms consist of three platforms NVIDIA M2050 GPU (CUDA 3.2), Intel X5670 (6 cores, Ifort 11.1, MPICH2 1.3.2) and X5355 (4 cores, Ifort 10.1, MPICH2 1.07).

3497 3

3498 4

JingetXie ProcediaEngineering Engineering00 15(2011) (2011)000–000 3495 – 3499 Jing Xie al/ /Procedia

The impact of the size of CUDA thread block is shown in Fig. 1. The Base is 11000320. An optimizing option with CUDA programming model is the thread block size, which means how many GPU threads running in a GPU thread block. The execution with 128 and 192 threads per thread block outperforms that with 64 and 256 threads per thread block. For random number size 128, the runtime of 128, 256 threads per thread block are 3.06 and 3.34 seconds. 4

Runtime (second)

3.5 3 2.5

64

2 1.5

128

1

192

0.5

256

0 32

64

128

Random number (*Base)

Fig. 1. Impact of the CUDA thread block size.

The impact of the size of CUDA thread block is shown in Fig. 1. The Base is 11000320, which is the product of 4297, 10 and 256. An optimizing option with CUDA programming model is the thread block size, which means how many GPU threads running in a GPU thread block. The execution with 128 and 192 threads per thread block outperforms that with 64 and 256 threads per thread block. For random number size 128, the runtime of 128, 256 threads per thread block are 3.06 and 3.34 seconds.

Speedup

X5670/X5355

M2050/X5670

M2050/X5355

9 8 7 6 5 4 3 2 1 0 Random number (*Base)

Fig. 2. Performance comparisons among M2050, X5670 and X5355.

The performance comparisons among M2050, X5670 and X5355 are shown in Fig. 2. The CPU implementations run with all cores of the CPU chips. The performance speedup between M2050 and X5670 is up to 3.10. M2050 is 8.51 times faster than X5355. X5670 is 2.74 times faster than X5355. 5. Conclusions In this paper, sampling Compton scattering by Kahn's method is accelerated on GPU. It has been implemented with the CUDA programming model. The advantage of Kahn's method on GPU has

Jing Engineering 15 (2011) 3495 000–000 – 3499 Jing Xie Xie /etProcedia al / Procedia Engineering 00 (2011)

demonstrated on NVIDIA M2050 GPU compared with two Intel CPUs. Future work includes implementing other crucial kernels of the MC simulations. Acknowledgements This research work is supported by the National Natural Science Foundation of China under grant No.60970033, also by 973 Program of China under grant No.61312701001. References [1] Briesmeister J, et al.. MCNP-A general Monte Carlo N-particle transport code Version 4C. Tech. Rep. LA-13709-M, Los Alamos National Laboratory; 2000. [2] Mitra M, Sarkar P. Monte Carlo simulations to estimate the background spectrum in a shielded NaI(Tl) [gamma]spectrometric system. Applied Radiation and Isotopes 2005; 63:415–422. [3] Kahn H. Applications of Monte Carlo, AEC-3259. The Rand Corporation 1956. [4] NVIDIA Corporation. CUDA Programming Guide Version 3.2. 2010. [5] Thomas DB, Howes L, Luk W. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. FPGA ’09. New York: ACM; 2009, p. 63–72. [6] Gong C, Liu J, Chi L, Hu Q, Deng L, Gong Z. Accelerating Pseudo-Random Number Generator for MCNP on GPU, in: AIP Conference Proceedings ICNAAM 2010; 2010, 1281, p. 1335. [7] Heimlich A, Mol A, Pereira C. GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation. Progress in Nuclear Energy 2011; 53:229 – 239. [8] Tickner J. Monte Carlo simulation of X-ray and gamma-ray photon transport on a graphics-processing unit. Computer Physics Communications 2010; 181:1821 – 1832. [9] Gong C, Liu J, Yang B, Deng L, Li G, Li X, Hu Q, Gong Z. Accelerating MCNP-based Monte Carlo Simulations for Neutron Transport on GPU. International Journal of Radiation Oncology * Biology * Physics, Proceeding of ASTROS 2011 Annual Meeting; accepted. [10] Gong C, Liu J, Chi L, Huang H, Fang F, Gong Z. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. Journal of Computational Physics 2011; 230: 6010 – 6022. [11] Carter L, Cashwell E. Particle-transport simulation with the Monte Carlo method. Tech. rep., Los Alamos Scientific Lab., N. Mex.(USA), ERDA Critical Review Series, TID-26607; 1975. [12] Singhal R, Burns A. Verification of Compton collision and Klein–Nishina formulasan undergraduate laboratory experiment. Am. J. Phys 1978; 46:646–649. [13] Moderski R, Sikora M, Coppi P, Aharonian F. Klein–Nishina effects in the spectra of non-thermal sources immersed in external radiation fields. Monthly Notices of the Royal Astronomical Society 2005; 363:954–966. [14] Koblinger L. Direct sampling from the Klein–Nishina distribution for photon energies above 1.4 MeV. Nucl. Sci. Eng. 1975; 56:218–219. [15] Blomquist R, Gelbard E. An assessment of existing Klein-Nishina Monte Carlo sampling methods. Nucl. Sci. Eng. 1983; 83:380–384. [16] Gong C, Liu J, Qin J, Hu Q, Gong Z. Effcient Embarrassingly Parallel on Graphics Processor Unit. in: Education Technology and Computer (ICETC), 2nd International Conference on. IEEE 2010; 4:400–404. [17] Gong C, Liu J, Qin J, Hu Q, Gong Z. Hybrid Embarrassingly Parallel on heterogeneous platform. in: Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on. IEEE; 2010, 3:95–99. [18] Gong C, Liu J, Yang B, Hu Q, Gong Z. Sampling Angular Distribution of MCNP based Neutron Transport on GPU. in: Computer Science and Information Technology (ICCSIT), 2011 4th IEEE International Conference on. IEEE, 2011, accepted.

3499 5