Advances in Engineering Software xxx (2012) xxx–xxx
Contents lists available at SciVerse ScienceDirect
Advances in Engineering Software journal homepage: www.elsevier.com/locate/advengsoft
A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform Lin Zhang a,⇑, Steven F. Quigley a, Andrew H.C. Chan b a b
School of Electronic, Electrical & Computer Engineering, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK School of Civil Engineering, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
a r t i c l e
i n f o
Article history: Received 9 October 2012 Accepted 19 October 2012 Available online xxxx Keywords: Discrete Element Method Parallel computing Domain decomposition General purpose graphic processing unit FDEM CUDA
a b s t r a c t Real-time solution of the Discrete Element Method is a computational challenge that is hardly achievable on standard PCs, especially when a large number of triangular shaped particles are involved. This paper presents a scalable architecture, including a domain decomposition technique, of a GPU accelerator for the two-dimensional Discrete Element Method for triangular shaped particles. This approach achieved a speed up of about 140 times as a single core and about 80 after domain decomposition on a consumer level GPU compared to a similar algorithm run on a fast desktop PC. Ó 2012 Elsevier Ltd. All rights reserved.
1. Introduction The Discrete Element Method (DEM) for simulation of granular media is widely used both by academics and industrialists, especially in the civil and mechanical engineering fields [1,2]. The most serious barrier to the practical usage of the DEM is its high computational cost, which means that this promising approach normally requires a long run time even on expensive workstations and is unable to provide real-time results. Previous attempts at parallelisation of the DEM have mostly focused on the exploitation of coarse grained parallelism on distributed memory parallel computers [3–5]. The hardware required to achieve high levels of speed up using this approach is relatively expensive. An alternative approach is to exploit massive fine grained parallelism using a low cost special purpose board plugged into a single PC. Several approaches have been tried on Field Programming Gate Arrays (FPGAs) [6] and GPUs [7–10]. The earliest application of GPUs to the DEM [7] applied a simple approach using uniform grid and texture lookups to demonstrate the feasibility of the GPU approach. Application of thread per particle and thread per cell algorithms [8] provided superior exploitation of the GPUs parallel capacity. GPUs were been determined to be a powerful accelerator for granular flows simulation involving several millions of particles [9] and were extended to a GPU based multi-scale HPC system for multi-scale discrete simulation [10]. ⇑ Corresponding author. Tel.: +44 7949299976. E-mail address:
[email protected] (L. Zhang).
However, all of the above approaches used circular shaped particles for DEM simulation and did not consider problems involving divided domains. Previous experiments done by researchers [11], have shown that circular particles may not represent real granular material well. One approach to generalising the shape of particles is to use triangular elements (that may be bonded together to form more complex shapes). However, when triangular shaped particles are used, solution of the DEM becomes much more challenging compared to formulations that use circular particles. This paper presents a design for a two dimensional Triangle DEM solver using the NVIDIA CUDA technique which is implemented on NVIDIA GeForce 8600M GT and GeForce GTS 250 GPUs. A dynamic domain decomposition method has also been included. 2. Heterogeneous multicore system and general purpose graphic processing unit Heterogeneous multicore computer systems have been around for many years, but historically have been high cost systems that are found in research institutes or specialist facilities. The last decade has seen a proliferation of low cost heterogeneous multicore solutions that are targeted at the mass consumer market. One of the most promising architectures, from the point of view of massively parallel scientific computation, is the general purpose Graphic Processing Unit (GPU). Nowadays, the GPU is playing an increasingly important role in the general purpose computation field in addition to its original function, the acceleration of
0965-9978/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
2
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
computer graphics. It is an accessible, extensible and easy programming platform and has great promise for extensive use as a general purpose computation unit compared to other novel High Performance Computing (HPC) architectures in the market. A GPU can readily be found on most modern computers, even on consumer level home PCs and portable laptops. However, even low cost mainstream GPUs in the mass market contain many highfrequency floating point cores and hundreds of megabytes of memory [12,13], which make supercomputing more affordable and more accessible than ever before. The combination of built-in support for floating point and parallel processing means that GPU computing is a highly efficient solution especially well-suited to problems that can be expressed as data-parallel computations with high arithmetic requirements [14]. Meanwhile, there is also a large number of high-end, high cost GPU devices, such as the NVIDIA Tesla [15] and the AMD FireStream series [16], which are specifically aimed at numerically intensive computing. Recently, NVidia and ARM have joined forces for ‘‘Project Denver’’ aimed at the high performance computation market, to challenge the industry leader, Intel, who still occupies a 77.4% position in the top500 super computing system list [17]. In 2010 AMD started to deploy their new Accelerated Processing Unit (APU) supporting its Fusion computational architecture. This architecture integrates the CPU and GPU into one chip, thus removing the time delay latency between host and processors and also the separate memory addressing schemes, which are normally considered as the fundamental bottle necks for the server-client based GPU computation system. Another benefit of its single chip design is a reduction in the system power consumption. A further advantage of using the GPU as a general purpose computation co-processor is that it can provide impressive performance with a relatively fast learning curve for programmers to meet different level of requirements, no matter whether it is used for theoretical research, industrial practice or just small designs for play. GPU programming languages use very similar coding syntax to those widely used on standard CPUs, such as C, C++ and Java. Existing programs written for standard CPUs can be run directly on GPUs by making relatively modest modifications, although to fully benefit from the GPU parallel architecture the codes need to be re-structured for deep optimisation. 2.1. CUDA architecture The Compute Unified Device Architecture (CUDA) is used on NVIDIA’s computational GPUs. A typical member of the family is shown in Fig. 1. The computational units are grouped into a number of Stream Multiprocessors (SMs). Each SM contains eight scalar processors (cores) which contain an integer and a single precision floating point ALU. Each SM also contains one double precision floating point unit, two special function units, a shared memory and several registers. Table 1 shows the common execution model of CUDA programs. The smallest unit of an executable program running on a GPU is called a thread. Threads are executed by scalar processors. A number of threads are grouped into one block. These threads are divided into one or more warps, and each warp contains no more than 32 threads. All threads in the same warp will follow exactly the same instruction sequence when executed. Blocks do not migrate and are executed on multiprocessors. A computational kernel is launched as a grid of blocks. Only one kernel can execute on a device at one time. 3. The Discrete Element Method The Discrete Element Method has been used by academia and industry for more than thirty years and was originally used for ana-
lysing the mechanical behaviour of discontinuous media, such as rocks [18,19]. The main idea underlying the DEM is to divide these discontinuous objects into an aggregate of rigid elements; the equations of motion are then used to calculate their subsequent movement. The most important benefit of the DEM is that in most approaches relative motion of particles is permitted and there is no requirement for continuity of displacement and compatible deformation conditions. Despite originating from traditional discontinuous medium problems, the application area of the DEM has been extended to continuous objects and mechanical transformation problems between continuous and discontinuous media. One typical example is the damage and destruction of brittle material, such as concrete, under dynamic loading conditions such as shock and penetration. This kind of topic is normally hard to solve using algorithms based exclusively on continuum mechanics, such as the finite element method [20,21]. A simple flow chart of the DEM is shown in Fig. 2. The simulation starts from some initial condition, and then proceeds to Contact Detection. By assuming that particles only exert forces on one another when they are in contact, the resultant force on each particle can then be calculated. Then the increment of velocity can be obtained using Newton’s second law. Each particle’s behaviour is calculated individually within a single time step, and the results of the method will be correct only under the assumption that no particle can travel beyond the immediate neighbours within one time step. Thus, the time step needs to be very small to meet this requirement. 3.1. Contact detection The easiest way to determine the position relationship between two oblique triangles is the so called direct area method. The space vector of any triangle MABC can be expressed as
! 1 ! ! S MABC ¼ AB AC 2 1 ¼ ðxB xA ÞðyC yB Þ ðxC xB ÞðyB yA Þ 2
ð1Þ
This equation can be also used to determine the position relation! ship between the vector AB and the point C as shown in the Fig. 3. For the point C1, which is to the right of vector ! ! AB ; S > 0; while for the point C2, which lies exactly on vector ! !MABC 1 AB ; S MABC 2 ¼ 0; while for the point C3, which is to the left of vector ! ! AB ; S MABC 3 < 0. The factor of 12 will not affect the result and can be omitted. Thus, the computational cost of Eq. (1) includes Addition 5 and Multiplication 2. Fig. 4 shows some example relative placements of MABC and point P.
! ! ! ðaÞ S MABP > 0; S MBCP > 0; S MCAP > 0: ! ! ! ðbÞ S MABP > 0; S MBCP > 0; S MCAP < 0: ! ! ! ðcÞ S MABP > 0; S MBCP < 0; S MCAP < 0: To determine whether or not two triangles, say MABC and MA0 B0 C0 , contact each other, it is necessary to determine the relationship between MABC and each of the points A0 , B0 and C0 , respectively. This will entail nine copies of the space area calculation, which is Addition 45 and Multiplication 18. It will also involve some control operations. Compared to circular-shaped particles [22], triangular-shaped elements give greater flexibility and accuracy of simulation for the DEM, but also involve a much higher computational requirement. One possible way to reduce this high computational cost is to add a pre-filter before contact detection (using the Direct Area
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
3
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
Fig. 1. CUDA architecture.
Table 1 CUDA execution model.
Fig. 3. Position relationship between vector and point.
on most of hardware platforms, this procedure will cost no more than Addition 16. This block is used to reduce the processing time for the whole contact detection stage because it is much cheaper to compute than the full contact check. This technique is especially useful when the particles are widely dispersed over the problem domain, and can greatly reduce the computation time. 3.2. Contact interaction The contact interaction step calculates the force generated between every pair of triangles that are in collision. In the combined finite-discrete element method, individual discrete element are discretized into finite elements, and each discrete element can be represented as the union of its finite elements [23]. The contact force is obtained as shown in Eq. (2).
fc ¼
n X m Z X
C
i¼1 j¼1
Fig. 2. Simple flow chart of the DEM.
Method) takes place. The filter is fairly simple as shown in Fig. 5. It is a bounding box around the triangle. If the bounding boxes of two triangles do not overlap, then those two triangles within the boxes cannot be in contact with one another. This procedure needs to determine the smallest and largest x and y values of the vertex coordinates for both triangles, and then carry out four comparisons. As comparison has a similar cost to Addition or Subtraction
bc
i
T
nC T ðuci utj ÞdC bc
bt
i
bt j
ð2Þ
j
where bc and bt are the contactor and target discrete elements. uc and ut are their potentials respectively, and can be produced by a sum of potential associated with individual finite elements. The contact force between overlapping discrete elements is then calculated by summation over the edges of the corresponding finite elements that overlap. The implementation details for discretized contact force in 2D can be explained by Figs. 6 and 7. The total contact force exerted by the target triangle on the edge AB is given by the area the of potential over the edge, as shown in Fig. 7, i.e.
fc;AB ¼
1 u u2
Z
L
puðv Þdv
ð3Þ
0
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
4
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
(a)
(b)
(c)
Fig. 4. Direct area method.
Fig. 5. Pre-filter.
Fig. 6. Contact of contactor and target triangles and contact of an edge of a contactor triangle with a target triangle [23].
where p is the penalty term, and the term u2 arises because vectors u and v are not unit vectors. The same process is repeated for all the remaining edges of the contactor triangle. The overall force on one particular triangle is the vector sum of the forces obtained from each collision with its neighbours. 3.3. Position Update During the Position Update procedure, each particle is treated as a point mass located by its centroid. The total force on a certain particle Pi is the sum of all the action forces generated between other contacted particles with this one. Newton’s equations are integrated according to the equation n X F Pki ¼ f ðP i ; Pj Þ ¼ mi ai i–j¼1
ð4Þ
Fig. 7. Distribution of contact force between the target triangle and an edge of contactor triangle [23].
where a is the acceleration of the particle. Assuming that all particles have uniform density, the velocity and the angular velocity are
v kþ1 ¼ v ki þ ai Dt i
ð5Þ
Mi h_ ikþ1 ¼ h_ ki þ Dt Ii
ð6Þ
Thus, the new coordinates for the particles after one time step will be
Pikþ1 ¼ Pki þ f F Pki
ð7Þ
4. Two dimensional DEM implementation on GPU The 2D DEM approach is based on Professor Antonio Munjiza’s Combined Finite-Discrete Element Method [24]. Some source code
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
5
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx Table 2 Specification for GeForce 8600M GT and GeForce GTS 250. Device model
GeForce 8600M GT
GeForce GTS 250
CUDA capability major revision number CUDA capability minor revision number Total amount of global memory Number of multiprocessors Number of cores Total amount of constant memory Total amount of shared memory per block Total number of registers available per block Warp size Maximum number of threads per block Maximum sizes of each dimension of a block Maximum sizes of each dimension of a grid Maximum memory pitch Clock rate
1 1 25,119,948 bytes 4 32 65,536 bytes 16,384 bytes 8192 32 512 512 512 64 65,535 65,535 1 2,147,483,647 bytes 0.95 GHz
1 1 1,073,414,144 bytes 16 128 65,536 bytes 16,384 bytes 8192 32 512 512 512 64 65,535 65,535 1 2,147,483,647 bytes 1.78 GHz
can be found in his Virtual Experimentation Labs online [25] which was written in C++ for standard CPU implementation. The main improvement presented in this paper is to modify this application to be suitable for GPU hardware. Specifically, this design is optimized for the balance between high parallelism and efficient use of memory bandwidth and granularity. The execution resources provided by the GPU’s stream multiprocessors (SMs), including registers, thread block slots and thread slots, can be dynamically partitioned to support the execution of programmes with different parameter combinations. The efficiency of a design is greatly affected by these groups of variables. There are a few technical tricks for finding out the best values for the parameters, especially the block size [26]. Early DEM research was carried out using single precision floating point arithmetic. Nowadays, most DEM computations are performed employing double precision when high accuracy is need. Early GPUs could only handle single precision floating point numbers. More modern GPUs, with compute capability 1.3 (late GT200 architecture) or higher, can support double precision computation but with a factor of eight speed penalty. GPUs with
compute capability 2.0 (introduced in the Fermi architecture) have much improved double precision support and a speed penalty of only 2 compared to single precision. Our approach was originally designed for the NVIDIA GeForce 8600M GT. The detailed specification of this GPU is listed in Table 2. With a CUDA Capability of 1.1, this GPU can only deals efficiently with single precision floating point numbers. However, our approach generalises easily to double precision if a different GPU is used. The architecture of the DEM solver core on a GPU is shown in Fig. 8. For each element, the data structure is shown in Fig. 9, where the variable en is used to indicate whether or not a particular memory slot contains a valid particle. The float2 is a built-in data type supported by CUDA for two dimensional items. Using this data structure, the particles are more manageable on both the CPU and GPU side, and can easily achieve coalesced data accesses between global memory and shared memory on the GPU, which is normally considered to be a requirement for good performance because it gives more efficient utilisation of the global memory bandwidth. The above structure was adjusted for
Shared Memory
Preprocess volume, mass
Contact Detection
Shared Memory
Host Global Memory
Pre-detection Fliter
Direction Area Method
Shared Memory
Interaction Force
Shared Memory
Position Update
Fig. 8. DEM solver core.
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
6
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
Fig. 9. Data structure for a particle.
execution on the GPU to meet the requirement that the data structure must be aligned to 16-word boundaries in order to ensure coalesced data transfer thus achieving maximum speed. The parameters, such as block size and grid size, are chosen to achieve zero-overhead scheduling to keep the execution units fully utilised by hiding the global memory latency when accessing the data located on the GPU global memory. Normally, this latency can be tolerated by introducing large number of threads while keeping the stream processors busy for a time long enough to cover the global memory latency. Limited by the number of registers contained in each block, the number of threads cannot be infinitely large. If the domain for the DEM problem was treated as a whole, it was found that each stream processor dealing with 256 threads contained in four blocks can make full use of the registers and shared memory capacity to reduce the total number of data transfers with global memory whilst ensuring that all blocks work at full load. In this algorithm, each particle in the simulation is handled by a single GPU thread. Fig. 10 shows the time line of this implementation as it executes on the GPU. In this figure, the upper row shows computation within the GPU’s cores; the middle row shows data transfers within the GPU; the bottom row shows the data transfers and computation that involve the host CPU. CUDA programs use the Single-Instruction Multiple-Thread/ Data (SIMT or SIMD) mode. This works well when all 32 threads in the same warp follow the same control flow path. By contrast,
when the threads meet different conditions, and need to pass through divergent computation paths, the threads actually process the different paths in a sequential manner, thus requiring more execution time. Fig. 11 [27] shows a example control flow with different execution path. On the left of the figure is shown an instruction sequence. On the right is shown the number of threads within a warp that are active for each instruction. It can be seen that the warp needs to execute both the contents of the ‘‘if’’ clause and the ‘‘then’’ clause sequentially. The Pre-filter module can provide significant acceleration when the computation runs on the Single-Instruction Single-Data (SISD) systems, such as CPUs, as well as on Multiple-Instruction MultipleData (MIMD) systems, such as FPGAs, especially for medium to sparse particle distributions. However, because of the SIMT mode used by the GPU, this pre-filter module can only increase the computation time, unless all particles processed within one warp do not contact each other at all. Otherwise, if even a single pair of particles encounters a collision, all other threads within the same warp will have to wait until the calculations of process the collision are finished. Thus the hardware resource and time used for the prefilter are wasted. For problems with medium to dense element density, it is reasonable to assume that there is no benefit in the inclusion of pre-filter module when executing on the GPU. However, it is conceivable that there may be some benefit for problems with a sparse element distribution.
Contact Detection Data Transfer Preprocess
Data Transfer form Host to GPU
Initial Phase Contact Detection Data Transfer
Interaction Force Data Transfer
t Position Update
Data Transfer
Data Transfer from GPU to Host
Main Loop Phase
t
Position Update Data Transfer Data Transfer from GPU to Host
Final Phase
t
Fig. 10. Time line of the DEM implementation on GPU.
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
7
Fig. 11. Control flow with divergent execution path.
Table 3 Time cost for 1 time step for 256 particles. Processing time on GPU Transfer time between host and GPU Processing time on CPU
0.173 ms 3.702 ms 4.173 ms
If the GPU is fully loaded, containing four blocks with 64 particles each, the time spent for a single time step is shown in Table 3. All the data shown was averaged across 10 runs. The transfer time between host and GPU only occurs once (one read from host and one write back to host) no matter how many time steps the simulation uses. After all the data is fully read into the GPU global memory, all the calculations can be based on that copy of the data. The processed data is written back to host memory only if those data need to be recorded; otherwise the results can be visualised directly using the GPU’s graphic capability. If all the data of every time step needs to be recorded, the data transfer time will be hidden by the data processing except for the very last time step. The processing time on the CPU indicates a similar algorithm optimized for the CPU and sequentially ran on a single core on the host side, which is powered by an IntelÓCore™ 2 Duo CPU (
[email protected] GHz) with 4 GB RAM using the Windows 7 operating system. The ‘‘particle per thread’’ algorithm can guarantee that there is no interaction between different threads. If the program is run for more time steps, the time cost will grow linearly. Fig. 12 shows the speed up for different numbers of time steps. The overall acceleration for this simple DEM core gives a speed up around 140 times compared to the similar algorithm running on the CPU.
Owing to the architecture of a CUDA enabled GPU, if the threads in different blocks do not need to synchronise with each other, the CUDA runtime system can execute the blocks in any order. This scalability means that one design of an algorithm can work on different hardware platforms with the same micro features. As shown in Table 2, the GPU GeForce GTS 250 has an identical hardware specification to the GeForce 8600M GT except for its global memory size, core number and clock rate. These three factors have only a minor effect on the architecture of the program design compared to other parameters. Fig. 13 shows the time cost for the same code run on these two GPUs. According to Fig. 13b, the run time difference between these two GPUs is a factor of 5 ± 0.1, although the theoretical value should be (1.78 16)/(0.95 4) = 7.49. This is caused by the fact that the number of blocks is too small compared to the number of stream processors contained in the GeForce GTS 250. This means that the cores were not fully loaded all the time. When a larger quantity of elements is involved, more blocks of threads will run on these cores, so that the theoretical value of the time cost difference will be approached. These results can still support the scalability principle that when the GPUs only differ in the amount of resource, but with the same hardware architecture, a program can run on these GPUs and provide near-linear speed boost.
5. Domain decomposition In a practical problem, the DEM method usually needs to process in excess of 100,000 particles, which is far beyond the capacity of a single domain described in Section 4. Moreover, when the total number of particle increases, the computational density of the
Fig. 12. Speed up for different numbers of time steps.
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
8
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
Fig. 13. Time cost and speed up on GeForce 8600M GT and GTS 250.
Contact Detection stage will increase as O(N2). Handling a DEM problem with a simple single domain will soon become a ‘‘mission impossible’’ as the particle number increases. From the nature of the DEM algorithm, it is clear that particles will only be affected by their neighbouring particles. This provides a possibility to split the whole domain into several sub-domains. There are two different approaches to decomposing the problem domain. One is based on the physical position of the particles involved; the other is based on the number of particles that can be contained within one sub-domain. Both methods have their pros and cons. The domain decomposition method based on physical position is very straight forward. First, the whole problem domain is divided into several sub-domains, most likely all with the same size. Then particles will be located depending on their centroids’ coordinates. This method is easy to apply, in that does not need a complex pre-processing procedure before starting the DEM calculation itself. However it also has a disadvantage, that it is highly possible that the number of particles in a particular sub-domain may be reasonably small. For an implementation with a fixed number of threads for each block, the hardware resources used for empty threads will obviously be wasted. Thus, the computational efficiency will be decreased. This method is only suited to problems with a limited domain size and a relatively high density of particles or to those problems that have evenly distributed particles with tiny variation. The alternative method uses quasi-fixed particle numbers within every sub-domain, and has more comprehensive flexibility for different problem types compare to the fixed-size approach. The physical size of each sub-domain will be automatically adjusted as the calculation is performed, and the stream processors will therefore be well utilised. Systems using this method normally achieve a better load balance on each processor and can generate higher computational efficiency [28,29] compared to the former method. The drawback for this method is that the pre-processing procedure to group particles into suitable sub-domains is complicated and the control unit used for modifying the border of the sub-domains has to be invoked between each time step that involves a transition of a particle from one sub-domain to another. Fig. 14 shows a design for domain decomposition that uses a given particle count within each sub-domain; Fig. 16 shows a single domain from this problem. Each sub-domain (including its outer overlap area) can contain no more than 256 particles to meet the capability of the DEM core described in Section 4 according to the device resource specification. At the edge of each domain, there is an overlap area. The width of these areas is L, the largest side length of any of the particles. This can be extracted by the pre-processing unit, or can be given when the initial data is read in. For any particular particle, the identity of the domain in which it is contained can be determined from its centroid.
Fig. 14. Dynamic domain decomposition.
Assume that the boundary for the problem domain is
x 2 ð0; xmax Þ y 2 ð0; ymax Þ
ð8Þ
and the length of the sub domain on axis y is DLy which is a fixed value when assigned. All the particles involved in the problem can be placed into a pre-grouped grid with cell size L L. Thus, for a particle with centroid o(xo, yo), it belongs to a cell with the index information CellInf[x][y] where
x ¼ bxo =Lc y ¼ byo =Lc
ð9Þ
The read-in data was organised as a linked list shown in Fig. 15 while the number of particles within the area S, GroupCount[group.x, group.y/DLy], shown in Fig. 16 can be easily obtained. The image address pointers are used for the later flush particle process (shown in Fig. 18). while the number of particles within the area S, GroupCount[group.x,group.y/DLy], shown in Fig. 16 can be easily obtained. If one denotes the number of particles contained in the sub-domain DxD ;yD by Par_real[xD, yD], and the right most border of this sub-domain is DB[xD, yD], then the domain decomposition information can be calculated by the pseudo code in Fig. 17. This initial grouping process needs to be done only once before reading the data of all the particles into the GPU global memory, and is more suitable to be carried out by the CPU side. All these
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
9
Fig. 15. Read-in cell structure.
Fig. 16. Single sub-domain for dynamic domain decomposition.
reference values will be saved in the constant memory on the GPU side for further use. If the number of particles in the overlap area is Par_overlap, the DEM computation within one sub-domain will involve Par_real (Par_real + Par_overlap) instances of contact detection. Only the information for particles within the sub-domain will be updated. The information for the particles in the overlapping area will be updated when the calculation is carried out in its own sub-domain. Thus, there will be no data conflicts across sub-domains. The sequence for handling the data within different sub-domains will not affect the overall result and they can be processed simultaneously. After all calculations of one time step are completed, it may be that some particles have travelled between sub-domains. To ensure that these particles are registered in the correct sub-domain, their centroid should be compared with the sub-domain borders and their group information changed if needed. This is followed by sub-domain boundary modification. After changes of particle groups, it is possible that the numbers of particles contained in some of the sub-domains exceed Par_max. The borders of these sub-domains should be extended by a length of L at each time. This is the so called Dynamic Domain Decomposition concept.
Fig. 17. The pseudo code for sub-domain parameters.
Fig. 18. Overall data flow of the dynamic domain decomposition model.
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
10
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
Fig. 19. Time cost for different particle number with 100 time steps.
Fig. 20. Time cost of 2D domain decomposition on GeForce 8600M GT of different sub-domain dimensions.
The overall data flow of one Dynamic Domain Decomposition Model for Discrete Element Method is shown in Fig. 18. When the problem size is small, the sub-domain number on the y-axis can be set to 1 that the decomposed domain is retrograded to one-dimensional. This simplified design was implemented on the GeForce 8600M GT, GTS 250 and a standard Intel CPU for 100 time steps. The results obtained are shown in Fig. 19. From this Figure, it can be seen that the time cost on the CPU side is almost linear whilst a threshold related to the number of particles exists on the GPU side. When the particle number is below this threshold, the program performed in an inefficient manner; when the particle number exceeds the threshold the time cost on the GPU side becomes linear in the number of particles. Thus, the CPU has higher efficiency dealing with a small amount of particles when the domain decomposition method is introduced
whereas GPUs have advantages on computation for a large number of elements. For different GPUs with similar architecture, the larger the computation resource they contain, the larger this threshold should be. The usage of the simple 1D domain decomposition model is limited by the device resource capacity. When the height of the computational domain involved exceeds the cap on the maximum particle count of each sub-domain, 256 in this design, it is highly possible that even the pre-domain mapping procedure could fail especially for high density. Moreover, when the domain is so thin that the length of the sub-domain on the x-axis direction is very short, a large amount of overlap area would be included. The total reduction in computation efficiency due to these doubly computed areas would be too great to ignore. In these circumstances, a 2D domain decomposition should be preferred.
Fig. 21. 1D and 2D domain decomposition on CPU and GPU.
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006
L. Zhang et al. / Advances in Engineering Software xxx (2012) xxx–xxx
An implementation for a system initialised with a high initial particle density was tested for 100 time steps. Each test was repeated 10 times for the same parameter combination and the average time taken. This 2D domain decomposition system was run on the NVIDIA GeForce 8600M GT card. As shown in Fig. 20, the legend on the right indicates the column and row combinations of the initialised sub-domain size. The time spent by this system climbs almost linearly with the increase of the number of particles. The result that achieved the highest efficiency was generated by the 14 14 decomposition, which involved the least overlap area among all the tested scenarios. As shown in Fig. 21, the 2D dynamic Domain Decomposition are slightly slower than the 1D version, whether executed on the CPU or the GPU side. This is due to the fact that the 2D Domain Decomposition method occupies more overlap area and invokes more complicated control logic during the computation process. The speed-ups achieved by both methods are around a factor of 80. 6. Conclusion The DEM method is a promising approach widely used in many different fields with growing importance. However, the usefulness of the DEM is limited by its high computational demands and it is hard to achieve real-time solution. GPUs are best suited for dataparallel computations with a high ratio of arithmetic operations to memory operations. This paper has presented an implementation for the core of a DEM method for triangles on a consumer-level GPU based on Antonio Munjiza’s Combined Finite-Discrete Element Method, which achieved a speed up of about 140 times compared to a similar algorithm running on a mainstream PC. This paper also presented a design for Dynamic Domain Decomposition using this DEM core for solving DEM problems with large number of particles, which achieved a speed up of about 80. 7. Future work As described in Section 5, using the Domain Decomposition method no data conflicts occur across sub-domains. The sequence for handling the data within different sub-domains will not affect the result and can be processed simultaneously. This provides a possibility to split the whole problem across multiple GPUs for higher computational capacity. Moreover, approaches such as ‘‘thread per cell’’ and ‘‘thread per particle’’ can also be used within each discrete sub-domain. Further work may include multi-GPU designs for a higher level of acceleration for real-time Discrete Element Method solution. With the new features introduced by the Fermi architecture with CUDA compute capability of 2.0 (GPU Direct, Peer to Peer communication and Unified Virtual Addressing), a high acceleration result can be expected. Other complicating factors for the DEM, such as nonlinear elastic forces and long range forces such as gravitation or electrostatic forces, may also be added into the design to generate realistic simulation results. A real-time visualisation output interface is also under way. Implementation of 3D DEM problems on a GPU platform can use a domain decomposition method similar to that described in this paper. This will involve a more complicated particle contact detection algorithm for 3D tetrahedral element. Also, handling the overlap between sub-domains becomes more complicated in 3D space. References [1] Williams JR, Mustoe GGW. Modal methods for the analysis of discrete systems. Comput Geotech 1987;4(1):1–19.
11
[2] Zhu H, Zhou Z, Yang R, Yu A. Discrete particle simulation of particulate systems: a review of major applications and findings. Chem Eng Sci 2008;63(23):5728–70. http://dx.doi.org/10.1016/j.ces.2008.08.006. [3] Maknickas A, Kacˇeniauskas A, Kacˇianauskas R, Balevicˇius R, Dzˇiugys A. Parallel dem software for simulation of granular media. Informatica 2006;17:207–24. [4] Kacˇianauskas R, Maknickas A, Kacˇeniauskas A, Markauskas D, Balevicˇius R. Parallel discrete element simulation of poly-dispersed granular material. Advan Eng Softw 2010;41(1):52–63. http://dx.doi.org/10.1016/j.advengsoft. 2008.12.004. civil-Comp Special Issue.. [5] Weatherley D, Boros V, Hancock W, Abe S. Scaling benchmark of esys-particle for elastic wave propagation simulations. In: IEEE sixth international conference on e-Science (e-Science), 2010. p. 277–83, doi:http://dx.doi.org/ 10.1109/eScience.2010.40. [6] Carrión Schäfer B, Quigley SF, Chan AHC. Acceleration of the discrete element method (dem) on a reconfigurable co-processor. Comput Struct 2004;82(20– 21):1707–18. [7] Green S. Particle simulation using cuda. Tech. rep., NVIDIA (May 2007)
. [8] Fritz KP, Herrero-Lopez S. Particle simulation using discrete element methods with cuda. Massachusetts Institute of Technology; 2010. [9] Radeke CA, Glasser BJ, Khinast JG. Large-scale powder mixer simulations using massively parallel gpuarchitectures. Chem Eng Sci 2010;65(24):6435–42. http://dx.doi.org/10.1016/j.ces.2010.09.035. . [10] Chen F, Ge W, Guo L, He X, Li B, Li J, et al. Multi-scale hpc system for multiscale discrete simulation development and application of a supercomputer with 1 petaflops peak performance in single precision. Particuology 2009;7(4):332–5. from micro-scale to technical dimension - Challenges in the simulation of dense gas-particle flows, Selected papers from the SinoGerman Workshop. doi:http://dx.doi.org/10.1016/j.partic.2009.06.002 . [11] Cleary PW, Sawley ML. Dem modelling of industrial granular flows: 3d case studies and the effect of particle shape on hopper discharge. Appl Math Modell 2002;26(2):89–111. http://dx.doi.org/10.1016/S0307-904X(01)00050-6. [12] Nvida, Geforce family; 2010 . [13] AMD, Ati stream¢ technology - consumer; 2010 . [14] Nvida, Nvidia cuda c programming guide, NVIDIA Corporation; 2010. [15] Nvida. The world’s first mass market parallel processors; 2010 . [16] AMD. Ati stream¢ technology – commercial; 2010 . [17] Top500.org_Supercomputer_Sites, Top500 list june 2011 (June 2011) . [18] Cundall P. A computer model for simulating progressive, large scale movements in blocky rock systems. ISRM; 1971. [19] Cundall PA, Strack ODL. A discrete numerical model for granular assemblies. Geotechnique 1979;29(1):47–65. [20] Liu K, Gao L. The application of discrete element method in solving threedimensional impact dynamics problems. Acta Mechanica Solida Sinica 2003;16(3):256–61. http://dx.doi.org/10.1007/s10338-003-0134-8. [21] Sawamoto Y, Tsubota H, Kasai Y, Koshika N, Morikawa H. Analytical studies on local damage to reinforced concrete structures under impact loading by discrete element method. Nucl Eng Des 1998;179(2):157–77. [22] Carrión Schäfer B, Quigley SF, Chan AHC. Analysis and implementation of the discrete element method using a dedicated highly parallel architecture in reconfigurable computing. In: 10th Annual IEEE symposium on fieldprogrammable custom computing machines, 2002. Proceedings; 2002. p. 173–81. [23] Munjiza A. Processing of contact interaction in the combined finite discrete element method. In: The combined finite-discrete element method. John Wiley & Sons, Ltd; 2004. p. 35–72. doi:http://dx.doi.org/10.1002/0470020180.ch9. [24] Munjiza A. The combined finite-discrete element method. Wiley; 2004. [25] Munjiza A. Virtual experimentation labs; 2004 . [26] Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu W-mW. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: Proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’08. New York, NY, USA: ACM; 2008. p. 73–82. doi:http://dx.doi.org/10.1145/ 1345206.1345220. [27] Fung W, Sham I, Yuan G, Aamodt T. Dynamic warp formation and scheduling for efficient gpu control flow. In: 40th Annual IEEE/ACM international symposium on microarchitecture, 2007. MICRO 2007; 2007. p. 407–20, doi:http://dx.doi.org/10.1109/MICRO.2007.30. [28] Fleissner F, Eberhard P. Parallel load-balanced simulation for short-range interaction particle methods with hierarchical particle grouping based on orthogonal recursive bisection. Int J Numer Meth Eng 2008;74(4):531–53. http://dx.doi.org/10.1002/nme.2184. [29] Zhang D, Jiang C, Li S. A fast adaptive load balancing method for parallel particlebased simulations. Simul Modell Practice Theory 2009;17(6):1032–42. http:// dx.doi.org/10.1016/j.simpat.2009.03.003.
Please cite this article in press as: Zhang L et al. A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform. Adv Eng Softw (2012), http://dx.doi.org/10.1016/j.advengsoft.2012.10.006