Computers and Structures 112–113 (2012) 193–204
Contents lists available at SciVerse ScienceDirect
Computers and Structures journal homepage: www.elsevier.com/locate/compstruc
GPU-based parallel algorithm for particle contact detection and its application in self-compacting concrete flow simulations Jingwei Zheng, Xuehui An ⇑, Miansong Huang State Key Laboratory of Hydroscience and Engineering, Tsinghua University, 100084 Beijing, China
a r t i c l e
i n f o
Article history: Received 8 September 2011 Accepted 19 August 2012 Available online 29 September 2012 Keywords: GPU Contact detection DEM SCC flow simulations GPU–DEM platform
a b s t r a c t In the present paper, we propose a novel parallel graphic processing unit (GPU)-friendly algorithm for contact detection. We develop a GPU-discrete element method (GPU–DEM) platform for self-compacting concrete (SCC) flow simulations. The algorithm takes full advantage of the parallel architecture of the GPU and can rapidly perform contact detection than that using CPU algorithm. Its scalability and robustness make it suitable for particle contact detection for different particle size distributions. The proposed GPU–DEM platform implements all the steps of DEM on GPU, which prevents the transfer of data between GPU and CPU. The proposed methodology is expected to be relevant in computational mechanics with applications to SCC flow simulations, where the number of particles ranges from millions to billions. Ó 2012 Elsevier Ltd. All rights reserved.
1. Introduction The self-compacting concrete (SCC) first developed in Japan [1] has been widely used because of its outstanding advantages in terms of lower labor cost, compact structures, and efficient construction. SCC is one of the most promising potential technologies in civil engineering constructions. However, as reinforcements become much denser and formworks become more complex, the risk of improper form filling has become one of the main obstacles of SCC applications [2]. Nowadays, such problems can only be predicted by experimental simulations that are time-consuming and expensive. Therefore, developing effective numerical tools to predict the flow and fill behaviors of SCC before actual constructions is imperative. The discrete element method (DEM) was first proposed by Cundall and Strack [3] for simulating rock and granular flow, and analyzing the micromechanical behavior of granular media. Since then, DEM has developed rapidly into an effective tool to simulate the dynamics in many industrial and scientific applications, such as in the process of packing and mixing. Recently, DEM has been adopted in the area of concrete simulation, especially in analyzing the flow behaviors of SCC, where SCC is assumed to be represented by a collection of discrete elements [4–7]. However, DEM simulations require a great amount of elements to reproduce accurately the behavior of SCC, leading to lower computational efficiency. Contact detection is the most time-consuming aspect of DEM calculation, and its maximum proportion can reach up to 80% approximately [8]. ⇑ Corresponding author. E-mail addresses:
[email protected] (J. Zheng), tsinghua.edu.cn (X. An),
[email protected] (M. Huang). 0045-7949/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compstruc.2012.08.003
anxue@
With recent developments in computer technology, the focus in processor architecture has shifted from increasing clock rate to increasing parallelism. The most successful instances of architecture are the graphic processing unit (GPU), along with the continuous renewal of the Compute Unified Device Architecture (CUDA) programming model presented by NVIDA Corporation in 2007, which have become general processors with strong focus on achieving performance through high parallelism [9]. Initially, the architecture of GPU was designed as parallel processors to specialize in processing graphics task. Nowadays, GPU is programmable, so we can write shader programs to control it. GPU accelerates graphics applications can also be employed in other areas, i.e., computational physics, computational biology, and mathematical finance. The maximum speed-up ratio can be up to several hundred compared with the respective serial CPU implementations [10]. As far as we know, the application of GPU to accelerate SCC flow simulations has not yet been utilized. However, contact detection algorithms, such as the bounding volume hierarchy (BVH) method, the k-d tree method, and the uniform grid method, have been widely implemented in the application of smoothed particle hydrodynamics (SPH) and ray tracing. Takahiro et al. proposed the following requirements for an efficient data structure of contact detection: low construction cost, easy access data from the memory, and small size of memory cost [11]. Based on these requirements, all three methods mentioned above have their advantages and disadvantages. Moreover, in varying degrees, these methods cannot well apply the GPU technology to SCC flow simulations. The BVH method has been widely used for accelerating contact detection, including discrete and continuous contact detection and self-collisions. Lauterbach et al. proposed a GPU-based linear
194
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
BVH approach for both discrete and continuous contact detections. They also introduced a parallel front-based traversal method for contact detection that greatly reduced the bottlenecks in collision traversal [9]. Pabst et al. employed a highly parallel and scalable hybrid GPU collision detection approach for the detection of discrete and continuous contacts of rigid and deformable objects. Their approach utilizes an on off-the-shelf hardware on single GPU or multi-GPU systems and could handle complex scenarios involving a large amount of self contacts in a few milliseconds [12]. However, all the BVH approaches share one limitation with other similar spatial subdivision parallel techniques, i.e., the grid size needs to be large enough so the largest primitive (in their studies, the primitive is a triangle) could be contained in a single cell. Thus, the size of the largest primitive should be smaller than that of the cell. This factor seriously affects performance in some cases as it leads to relatively large grid cells, and consequently, a high number of collision candidates per cell. The kd-tree is a well-known space-partitioning data structure for organizing points in k-dimensional space. As an acceleration structure, the kd-tree has been used in a variety of graphics applications, such as nearest neighbor search in point cloud modeling and particle-based fluid simulations. Zhang et al. employed a method to efficiently construct a real-time dynamic kd-tree on the GPU to accelerate the query for closest particles [13]. Zhou et al. developed a kd-tree algorithm capable of achieving real-time performance on the GPU. This algorithm builds kd-trees in the BFS order to exploit the large-scale parallelism of modern GPUs. They also demonstrated the potential of their kd-tree algorithm in various applications, including GPU ray tracing, GPU photon mapping, and point cloud modeling [14]. Kd-tree structures are not the best models for DEM-like computations because the structure needs to be rebuilt at every time step. Moreover, in each time step, the k-d structure is different, which lowers the computation efficiency and wastes the memory [14–16]. The uniform grid method is a regular spatial subdivision method, which simplifies the calculation of square through neighbor search. The axis-aligned bounding box of the entire square is subdivided into equally sized cells along each of the three main axes X, Y, and Z. A typical grid representation stores a list of balls contained by each cell. Therefore, contact detection can be implemented in one or more cells instead of the whole calculation of square to raise the efficiency. Harada et al. used a fixed uniform grid to search for neighboring particles and store particle indices on the grid to simulate the rigid body. However, the maximum number of particles stored in a cell is finite (four particles in their implementation), which is the biggest constraint of their method [17]. However, the drawback of grid structures is that too much memory as buckets are allocated with predefined capacity. In a typical simulation, many of these buckets are not used. Hence, the occupied GPU memory cannot be used for any other purpose. To save the memory, two main methods are implemented: adopting a sliced data structure using a dynamic grid and using sort technology in advance. Harada et al. also constructed a sliced data structure to simulate the fluid [11], but this new structure uses up as much memory as the uniform grid method itself at worst; it might also make direct mapping of SPH onto CUDA more complicated. Both aspects limit the usage of this new structure in fluid simulations [18]. As an alternative method for grid construction, Purcell et al. used a type of sorting, wherein particle indices are first sorted by the index of voxel, and then the value in a grid is determined using a binary search [19]. Some studies utilize this method to save memory. The simulation of simple particle interactions in the CUDA SDK sorting is used for the construction of a uniform grid structure with an upper bound of eight neighboring particles [20]. However, the SDK method could not handle particles with
different radii. To deal with this, Ivson et al. developed a more general data-parallel uniform grid construction algorithm to handle geometric primitives overlapping any number of cells to accelerate structures for ray tracing geometric surfaces [21]. Similarly, Kalojanov et al. presented a scalable parallel algorithm for constructing grids over a scene of geometric primitives that can also handle geometric primitives overlapping any number of cells. Through sorting of primitive-cell pairs, the performance does not depend on the triangle distribution in the scene. However, these two methods cannot deal with cases with non-uniform triangle distributions [22]. Therefore, the sort method is efficient from the viewpoint of memory usage because there are no empty buckets. However, the sort method is more expensive in identifying the memory address of the cell to which a point belongs than a uniform grid in which the address of the data storing the cell values can be calculated in a few arithmetic calculations [11]. Moreover, the calculation of sort uses up the most time in contact detection and is inefficient from the point of excessive proportion of total time. In conclusion, a method that has strong parallel ability and saves memory is required in SCC flow simulations. In the current paper, we propose a new contact detection algorithm based on the uniform grid method. In this new parallel algorithm, we exploit the computational power of the GPU, access particle data in the memory easily, and save memory through the choice of the optimal cell size. To address these issues, we first present a novel GPU-based parallel algorithm to speed up the calculation of contact detection. To achieve high efficiency and simulate the motion of particles using the DEM method, the uniform grid method is applied to divide the space. In contact detection, both broad phase and narrow phase are executed in a parallel way to determine the particles in contact with each other. Then, the calculation of contact force in the program is incorporated to be a full GPU–DEM program. Therefore, all the computations including contact detection, calculation of contact force, and trajectory update are conducted on the GPU, thereby avoiding any CPU–GPU transfer overhead in the SCC flow simulations. This approach can produce more efficient results for DEM simulations. This paper has three main contributions. First, it introduces a novel, memory-optimized DEM implementation approach using uniform grid method; second, it employs a universal GPU–DEM platform for particle-based simulations; and third, it utilizes the GPU-based application to SCC flow simulations. 2. The new GPU-based parallel algorithm for particle contact detection In the following section, we first describe the data structure of the parallel algorithm, after which we introduce the implementation in detail using the novel uniform grid method in CUDA. The algorithm is divided into two parts, broad-phase contact detection and narrow-phase contact detection. In the broad phase, particle parallel is adopted to determine which balls are in each independent cell. This step also aims to determine which balls each cell contains using the uniform grid method. This allows a ball to overlap an arbitrary number of cells, which is the significant difference compared with other algorithms and methods, such as the CUDA SDK [20]. In the narrow phase, cell parallel is executed for each ball, in which the balls in contact with it are determined and prepared for the calculation of contact force. 2.1. Data structure In the past, the data structure is stored into two parts, such as the SDK, etc. [20,22]. The first part of the data structure contains the ball references. The second part contains the beginning of the
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
interval for the current cell and the end of the interval for the previous cell. The second part uses a sort and hash table to determine the corresponding relations between the ball indices and the corresponding cell index. As mentioned previously, the algorithm and the data structure in the current paper is divided into two parts. In the broad phase, we store the data structure in two arrays. The first array is the number of balls each cell contains, with a magnitude equal to the total number of cells divided by the uniform grid in the three dimensions. The second array contains the ball indices for each cell, with a magnitude equal to the total number of cells multiplied by a threshold value, which considers the maximum number of balls belonging to a cell. In the narrow phase, we store the data structure in two arrays similar to the broad phase. The first array is the number of balls each ball comes into contact with, and the magnitude of the array is equal to the total number of balls. The second array contains the ball indices for each ball contact, and the magnitude of the array is equal to the total number of balls multiplied by a threshold value, which considers the maximum number of times a ball comes in contact to another. Although there are some empty voxels that do not store balls, thus wasting memories, the memory size of the GPU is not the key problem in the development of GPU (e.g., a GTX580 with a memory size of 1538 MB). On the other hand, coalesced access can be satisfied using the uniform grid method if we store the data in column major order and set the threshold value to multiples of 64 in Fermi structure. Moreover, we can study the effect of cell size to determine the optimal cell size, which can also save the memory and avoid waste. 2.2. Steps of the algorithm The proposed approach introduces an efficient uniform grid method for contact detection. A typical grid representation stores a list of balls contained by each cell. Usually, consecutive lists are arranged contiguously in the memory. However, in practice, the grid structure must support random access during the SCC flow simulations. Thus, it is necessary to determine all balls that overlap a given cell only using its ID. When using the GPU, the main challenge is the task of developing an algorithm capable of writing them into several balls lists that can be easily paralleled. The atomic operation is typically used to avoid concurrent writing for the same memory position in order to solve this data-parallel problem. Although this is relatively time-consuming compared with other operations, we only use it once. In the Fermi structure of the GPU, the atomic operation speed increases by as much as 20–30 times in contrast with the previous structure. In previous studies, the cell size is required to be bigger than the radii of the balls [20,23]. In the present paper, the value of the cell size and radius can be arbitrary, which means that a ball can overlap several cells. In addition, if a ball overlaps a cell, that ball belongs to the cell. Different cell sizes have different speed-up
195
ratios, and in the current work, we consider the size effect of the grid in the next section. An example of contact detection implementing the new algorithm is shown in Fig. 1, which shows five balls and four cells. The steps of the algorithm are summarized as follows. First, the work for GPU parallel calculation is prepared, such as partition of the grid and allocation of the GPU memory. Second, particle parallel is implemented to determine which balls are in each independent cell in the broad phase. Third, cell parallel is executed to determine which ball is exactly in contact with others in the narrow phase. The following subsections present additional details of each step during the process of grid building. 2.2.1. Initialization In the first step, a fixed grid is first prepared in the computational space. The size of the cells should be at an optimal value to realize the maximum speed-up ratio compared with the CPU calculation. Then, we allocate the GPU memory for the arrays of the data structure, the coordinates and the radii of the balls, as well as the information of the divided cells, such as the cell size and the number of the cells in three dimensions. Finally, we transfer the CPU data to the GPU correspondingly. In addition, the attributes of the material in the SCC flow simulations are transferred to the GPU usually stored in the constant memory if the data are not time dependent. 2.2.2. Broad phase In the second step, the broad phase of contact detection is executed to determine which balls are in each independent cell. This phase is divided into two stages. A GPU thread is assigned to each active ball in these two stages to perform particle-parallel collision detection. This method is effective because the balls are all operated at the same time instead of being operated serially. The pseudo codes of the broad phase based on the uniform grid method can be written as algorithm 1. Algorithm 1. Broad phase of contact detection For each ball do Stage 1: Calculate the bounds of balls Stage 2: Insert the balls into corresponding cells For upper bound to lower bound For front bound to back bound For left bound to right bound atomicAdd operation in the array d_lenCells for each cell respectively record the indices of balls in each cell into the array d_Cells End End End End
1 0 Cell 0
Cell 1
where d_lenCells represents the number of balls each cell contains and d_Cells records the indices of balls in each cell. The two stages of broad phase are presented in detail below.
3
2
Stage 1: Calculate the bounds of balls
Cell 3
Cell 2 4
Fig. 1. Example of contact detection.
As the balls can overlap multi-cells and have different radii, it is necessary to determine the minimum and maximum boundaries of the ball. Based on spatial subdivision, we first calculate the bounding box of each ball using particle parallel by determining six face boundaries in three dimensions as follows: left, right, front, back,
196
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
number each ball overlaps. At the same time, the number of balls each cell overlaps is saved into an array d_lenCells. The ball indices are then recorded in the array d_Cells, both of which are initialized at the start of each time step. Once the loop is over, the insertion of the ball is finished, and the two arrays are produced for the narrow phase of contact detection. The date structure of array d_lenCells and array d_Cells is shown as Fig. 3, where ‘‘numParticlePerCell’’ is the threshold value of the balls in each cell. The balls distribute imbalance and different radii, which means that the loop for each ball is unequal. Thus, we must wait for the last ball to be inserted. The cell size is a major factor, which immensely affects the efficiency of the insertion of the balls into each cell. If the cell size is much bigger than the radius of the ball, more balls are counted in each cell, and the frequency of the loop is decreased; in turn, this increases the frequency of the narrow phase of contact detection is increased. Conversely, if the cell size is too small compared with the radius of the ball, fewer balls appear in each cell; however, the number of balls stored in each cell may not decrease quickly because a ball can overlap several cells in this proposed algorithm. Therefore, the frequency of the loop may be increased, and the frequency of the narrow phase of contact detection may be decreased. An optimum cell size is considered in determining whether or not the efficiency of computation must be raised. In this step, we also store the array d_Cells in a column major order to ensure coalesced access for writing. Alternately, the allocated pattern of the array d_Cells can use the ‘‘cudaMallocPitch()’’ instead of ‘‘cudaMalloc()’’ to access program sacrificing memory cost much more efficiently.
up Cell 0
Cell 1
3
right
left
Cell 3
Cell 2 down
Fig. 2. Boundaries of the ball in two dimensions.
up, and down. As shown in Fig. 2, the boundaries of ball no.3 are divided into two dimensions. The four bounds defining the cells upon which the ball overlaps are also presented. Although the placement of the boundaries of the ball may not be accurate, the bounding box method is the most practical way to define the boundaries of the ball. Stage 2: Insert the balls into corresponding cells After the boundaries of the balls are determined in stage 1, we determine the balls in each cell independently using a loop parallelly, as each ball overlaps unequal number of cells to implement a particle parallel detection in stage 2. In the loop, the cell index about to be inserted is first calculated as formula (1):
cell number ¼ i cellGrid:x cellGrrid:y þ j cellGrid:x þ k;
2.2.3. Narrow phase In the third step, the narrow phase of contact detection is executed to determine which balls are in exact contact with each other. This phase is divided into four stages. Cell parallel is implemented in these four stages to perform a narrow phase of collision detection. Each cell respects a thread block executed to parallel calculation. This procedure is effective because all the cells are operated at the same time, instead of serially. Moreover, the shared memory in this step is used instead of abundance writing and reading operation in the global memory, because the computation speed in shared memory is far greater than that in the global memory.
ð1Þ
where cellGrid.x, cellGrid.y, and cellGrid.z are the number of the divided cells in the x, y and z dimensions, respectively. In addition, i, j, and k are the indices of the cell of the ball overlapped in the z, y and x dimensions, respectively. The entire ball must fit the boundaries; thus, the number of cells that the ball intersects can be determined quickly by counting the number of cells between the boundaries in each axis and multiplying them. In the next step, each ball uses the atomicAdd function to atomically increment the cell counter corresponding to the cell. The count of the last number for each ball is equal to the cell
Array 1
d_lenCells
this array records the number of balls each cell contains
2
Array 2
d_Cells
3
3
numParticlePerCell
…… 0
The Balls in Cell 0
1
this array records the indices of balls in each cell
numParticlePerCell
1
4
1
2
3
……
The Balls in Cell 1
numParticlePerCell
numParticlePerCell
3
…… 2
The Balls in Cell 2
Fig. 3. Data structure of the broad phase.
3
4
The Balls in Cell 3
……
197
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
Using the uniform grid method, the contact detection can only be found in the same cell once a ball overlaps several cells. All the overlapped cells record the information of the ball. This means that balls that come in contact with other balls are limited to contact the balls with indices stored in the same cell of the uniform grid. Therefore, the computational square is reduced and the shared memory for acceleration becomes possible. In each cell, the calculation of contact detection between the balls in the same cell adopts the brute calculation, which considers a loop serial calculation. For example, in the pseudo codes, stages 2, 3, and 4 are all in the loop. The maximum number of the loop is the threshold value of the balls in each cell. Therefore, the amount of the calculation is very small if we control the cell size. The pseudo codes of the narrow phase based on the uniform grid method can be written as algorithm 2. Algorithm 2: Narrow phase of contact detection For each cell do Stage 1: initialize the share memory global memory -> share memory Stage 2: Choose the judge cell cell parallel and loop in the cell if (blockIdx.x)==checkbox do stage 3 Stage 3: Distance detection if d12<(r_ball1+r_ball2) do stage 4 Stage 4: Write the array d_lenContactBall and array d_ContactBall atomic operation for count the number into the array d_lenContactBall record the ball into the array d_ContactBall End
Stage 2: Choose the judge cell Before comparing the distance of two balls with the sums of the radii, a judge cell must be chosen first. In some cases, two balls may appear in two or more cells. For example, in Fig. 4, balls 1 and 3 are both in cells 0 and 1. Thus, choosing a judge cell to avoid the repetitive judgment in advance before the narrow contact detection is implemented is necessary. The principle is simple. The bigger value of the down boundary between the two potential balls can only be contained within one cell. For example, in Fig. 4, balls 2 and 3 are both in cells 1 and 3. As the value of the down boundary of ball 2 is bigger than that of ball 3, we only calculate the collision in cell 3 and ignore the calculation in cell 1. Thus, the procedure is completed without repetitive judgment. Therefore, before the distance detection, we must determine first whether the cell for the two potential balls is the corrected cell, in which distance detection must be conducted. Thus, narrow contact detection is only undertaken in the judge cell, which guarantees that two balls that came in contact are only detected once, thus decreasing the amount of detection work that must be done. Stage 3: Distance detection After determining the judge cell, distance detection is implemented. The function of the algorithm used to check collisions between the two spheres is to calculate the distance between both objects because all objects are spheres and contact can only occur when the distance between the center of the two objects is less than or equal to the sum of their radii. Therefore, when the distance between two balls is smaller than their radii, collision occurs. Conversely, they do not come in contact with each other. Stage 4: Write d_ContactBall
where d_lenContactBall records the number of balls each ball comes in contact with, and d_ContactBall records the indices of the balls each ball comes in contact with. The four stages of narrow phase are presented in detail below. Stage 1: Initialize the share memory We first initialize two arrays in the shared memory in the narrow phase corresponding to the array d_lenCells and array d_Cells in the last step. Then, we copy the data of array d_lenCells and array d_Cells from the global memory into the shared memory s_sizes, which stores the number of balls in each cell, and into s_Cells that, in turn, stores the ball indices of each cell. Thus, we conduct narrow contact detection in the shared memory to speed it up.
d_lenContactBall
bigger value of the down boundary
0 Cell 1
3 2 Cell 3
Cell 2
array
and
array
The final stage of the narrow phase is writing the contacted information of the balls from the shared memory into the global memory. Two other atomicAdd operations are adopted for the count of the balls that come in contact with others, after which the ball index is stored in the cell to which the ball belongs. At the same time, the number of balls each ball comes in contact with is saved into an array d_lenContactBall. The ball index is recorded in the array d_ContactBall, both of which are initialized at the start of each time step. After the loop in each cell is over, the insertion of the ball is completed, and two arrays are produced for the force calculation of contact detection. The date structure of array d_lenContactBall and array d_ContactBall is shown as Fig. 5. The maximum number of particles stored in a cell is infinite, but it does
1 Cell 0
the
4
Fig. 4. Method for determining the judge cell.
judge cell
198
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
Array 3 d_lenContactBall
this array records the number of balls each ball contacts
0
Array 4
d_ContactBall
numParticlePerAimBall
1
2
0
this array records the indices of balls each ball contacts
numParticlePerAimBall numParticlePerAimBall numParticlePerAimBall
…… 3 the Balls contact with Ball 0
1
…… 3 the Balls contact with Ball 1
…… 1 the Balls contact with Ball 2
2
numParticlePerAimBall
……
the Balls contact with Ball 3
…… the Balls contact with Ball 4
Fig. 5. Data structure of the narrow phase.
not exceed the threshold value we set. In Fig. 5, ‘‘numParticlePerAimBall’’ is the threshold value for a ball that comes in contact with other balls. In stage 4, the atomic operation is a non-critical part of our implementation for convenience because it is operated in shared memory and does not hurt performance. 2.3. Optimization and analysis Optimizations can be summarized as two aspects with the aim of achieving uniform grid method for acceleration. One aspect controls the parameters of the grid and boundary condition, and the other utilizes the superiority of the GPU. Both aspects are analyzed to realize the maximum speed-up ratio, which can be performed in several stages of the parallel algorithm to increase speed and reduce memory usage. 2.3.1. Determination of the optimal cell size An important part of the algorithm is the choice of the grid resolution. The optimal cell size is further discussed in this section as it relates to the efficient use of the GPU. This is the only external factor we can vary to influence the efficiency. Mono-sized and poly-dispersive materials are considered hereafter. In the SCC flow simulations, the mortar is considered a mono-sized ball, whereas the aggregate is assumed to be a polydispersed material, which presents the composition of particles of different sizes. Usually, the radius of the mortar is 0.25 cm, and that of the aggregate in the SCC ranges from 0.25 to 1.5 cm. Therefore, the factor of particle distribution is the maximum radius/the maximum radius. We use several cube cases as examples to verify the efficiency of the presented algorithm. To generate the required particle composition, packing is utilized to realize different numbers and different grading distributions of particles. Particles are allowed to drop into a container by gravity. The undersurface size of the container is 20 cm 20 cm. Different particle size and cell size cases for contact detection are executed using the new algorithm and the CPU algorithm. The elapsed time is also measured during each process. In a cube in the current investigation, mono-sized compositions denoted by mono and poly-dispersed compositions are used in the simulations. Six sets of different cell factors having identical number of particles for each composition are generated for model purposes. The minimum radius of the particle remains at 0.25 cm,
whereas the maximum radius changes from 0.25 to 1.5 cm by 0.25 cm, i.e., the factor of particle distribution, rmax/rmin, varies from 1.0 to 6.0. The cell size also changes in each condition to determine the fastest condition of cell size. The contact detection time for each condition is converted into the case of 50,000 particles to avoid the effect of number of particles on the time. These types of distributions cover the most important cases to check the efficiency of the presented algorithm. The parameters of the different cell factors used are listed in Table 1, which shows an example of the detailed simulation conditions of the number of particles for the case to study the effect of cell size and particle size. Here, rmin represents the minimum value of the radii, rmax represents the maximum value of the radii, ravg represents the average value of the radii, and the cell factor is defined as formula (2):
cell factor ¼ r max =r min ;
ð2Þ
We implement our algorithm later as a part of the SCC flow simulations in the application in the CUDA programming language. All tests are performed on three different platforms as follows: 1. 64-bit Window server 2008, 2.93 GHz Intel Core i7 CPU, NVIDIA GeForce GTX 280 with 1024 MB VRAM, CUDA3.0. 2. 64-bit Windows 7, 2.80 GHz Intel Core i7 CPU, NVIDIA GeForce GTX 580 with 1536 MB VRAM, CUDA3.0. 3. 64-bit Window server 2008, 2.93 GHz Intel Core i7 CPU, Intel Fortran Compiler v10.1.013. Figs. 6 and 7 shows the influence of the different cell sizes on the elapsed time on the GPU and CPU. The elapsed time on the GPU is only several milliseconds, which is less than 6 ms. However, in the CPU, the elapsed time is several hundred milliseconds. Thus the rates of optimization of the cell size in the GPU and CPU are
Table 1 Parameters of the different cell factors. Cell factor
rmin (cm)
rmax (cm)
ravg (cm)
1 2 3 4 5 6
0.25 0.25 0.25 0.25 0.25 0.25
0.25 0.50 0.75 1.00 1.25 1.50
0.25 0.25 0.26 0.27 0.28 0.29
199
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
GPU C=2
5
700
GPU C=3
4
GPU C=4 3
GPU C=5
2
GPU C=6
1 0
800
GPU C=1
0
0.5
1
1.5
2
2.5
cell size (cm)
time for contact detection (ms)
time for contact detetion on GPU (ms)
6
600 500 400 300 CPU 200
GPU-GTX580 GPU-GTX285
100
Fig. 6. Influence of the different cell sizes on the elapsed time on GPU.
0 0
time for contact detetion on CPU (ms)
CPU C=1 1000 900 800 700 600 500 400 300 200 100 0
25000
50000
75000
100000
125000
150000
number of the particles
CPU C=2
CPU C=3
Fig. 8. Cases for the different number of particles compared on different GPUs and CPU.
CPU C=4 CPU C=5 CPU C=6
0
0.5
1 1.5 cell size (cm)
2
2.5
Fig. 7. Influence of the different cell sizes on the elapsed time on CPU.
different. Moreover, the optimization of cell size is related to the average radii of particles, rather than to the maximum and minimum radii. On the GPU, the optimization cell size ranges from 1 to 1.5 cm, which is 4 times compared with the average radius of the particles. However, this value ranges from 0.5 to 1 cm in the CPU, which is 2 times compared with the average radius of the particles. Therefore the optimization cell size DoptCPU and DoptGPU can be written as follows:
DoptCPU ¼ 2ravg ¼
N 2X ri ; N i¼1
ð3Þ
DoptGPU ¼ 4r avg ¼
N 4X ri : N i¼1
ð4Þ
To evaluate the efficiency of the new parallel algorithm, 7 sets of cases with different number particles are tested using the GPU algorithm, where the numbers of particles are 2969, 6476, 20653, 48928, 73760, 100196, and 140063. Moreover, the factor of particle distribution is 4 and the cell size is 1 cm, both of which do not change in these cases.
Table 2 shows the influence of the different numbers of particles on the elapsed time between the GPU and CPU. The unit of time in Table 2 is milliseconds. The speed-up ratio on GTX580 is over 65, and the max speed-up ratio is over 73. Thus, implementing the contact detection on GPU is very efficient. The comparison on the different GPUs is also carried out in this paper. Fig. 8 shows the influence of the different number of particles to the time consumption compared with different GPUs and CPU. The speed on GTX285 is also efficient, and the speed-up ratio can reach 35, about 1/2 compared with GTX580. Therefore, using the GPU to accelerate the speed of contact detection is feasible based on these results. 2.3.2. Optimization of GPU The optimization of GPU has the following two aspects: controlling the thread number per block and efficient memory usage. All threads in a grid execute the same kernel functions. However, these threads are organized into a two-level hierarchy using block indices and thread indices in CUDA. Thus, properly distributing the two-level parallel resources is a critical problem that must be solved. In the GPU structure, the thread number should be a multiple of 64 and must not exceed 512. The bigger the thread number, the smaller the block number. In the new algorithm, we first implement a particle parallel operation in the broad phase and then a cell parallel operation in the narrow phase. In the broad phase, the number of the threads is recorded as thread number 1, whereas in the narrow phase, the number of the threads is recorded as thread number 2. Ortogonality measure is carried in the case of the cube when the cell factor is 4 and thread numbers 1 and 2 are separately equal to 64, 128, 256, and 512. The block number is automatically determined by the number of the particles. If the number of the particles is not the multiple of the thread number, the block number will be the divided result plus 1.
Table 2 Cases for the different number of particles compared on the GPUs and CPU. Number of particles
CPU time
GTX285 time
GTX580 time
Speed-up ratio on GTX285
Speed-up ratio on GTX580
2969 6476 20,653 48,928 73,760 1,00,196 1,40,063
15.63 31.25 93.75 218.75 359.38 468.75 671.88
0.50 1.02 2.92 7.02 10.69 14.87 21.34
0.22 0.43 1.39 3.36 5.10 6.90 9.79
31.58 30.58 32.08 31.15 33.62 31.53 31.49
70.41 73.01 67.35 65.10 70.51 67.95 68.65
200
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
time for contact detection (ms)
4.5 4 3.5 3 2.5
thread number 2=64
2
thread number 2=128
1.5
thread number 2=256
1
thread number 2=512
0.5 0 0
100
200
300
400
500
600
thread number 1
(a) Determination of thread number 1
time for contact detection (ms)
4.5
of GPU, which represents the comprehensive evaluation of the number of registers used, the amount of shared memory, and the number of warps in a block. The foregoing is a composite effect of the use ratio of the efficiency. The occupancy considers the use ratio of the GPU, rather than the efficiency of the calculation. Changing the amount of threads makes it possible to increase the occupancy for a given configuration, but the efficiency may not change. The CUDA occupancy calculator uses this information to automatically determine the maximum occupancy that the GPU can achieve. Occupancy at 1 means the stream multiprocessors in the GPU is fully utilized. Table 3 shows the occupancy and the influencing factors in the present paper. The occupancy of the two phases both exceed 0.5 which means the use ratio of the GPU is good when the proposed algorithm is used. 3. The application of the algorithm to SCC flow simulation
4 3.5 3 2.5 thread number 1=64
2
thread number 1=128
1.5
thread number 1=256
1
thread number 1=512
0.5 0
0
100
200 300 400 thread number 2
500
600
(b) Determination of thread number 2 Fig. 9. Determination of the thread number in the broad phase and narrow phase.
Fig. 9(a) and (b) show the optimal choice of thread numbers 1 and 2, respectively. When thread number 1 is set as 256 and thread number 2 is set as 128, we obtain the optimal result. The other technique considered in improving performance is the efficient use and reuse of the memory available on the GPU. Three areas are observed. The first area is the use of the GPU constant memory, rather than the storage of the variables in global memory or the shared memory in a kernel function, for example, the information of the division and the parameter of the SCC. This is done because storing the variables in the constant memory of the GPU is more efficient. Unlike the global memory, constant memory on the GPU is cached, and unlike the shared memory, the constant memory on the GPU does not need to be copied for each kernel call and copied back to global memory. The constant memory is copied once to the GPU and then read by any kernel with very low latency. Second, the repeated writing and reading of the global memory is reduced. As mentioned previously, the global memory is not cached, so repeatedly writing and reading the global memory is very inefficient. An alternative is to use the register to avoid the repeated writing and reading. However, the capacity of register in the CUDA has a restriction: if the overhead uses the register, the performance is drastically affected, thus reducing the occupancy of the GPU. The last area is the occupancy
In this section, we first introduce the mechanical model we used in contact force calculation in DEM for SCC flow simulations, and then build a GPU–DEM platform to combine the contact detection and contact force calculation. Finally, the simulations of the slump flow, V-funnel, and L shape-Balls box (L-B box) experiments are implemented in the proposed GPU–DEM platform, which validates the efficiency of the SCC flow simulations on GPU. 3.1. Mechanical model using DEM for SCC The main calculation of DEM consists of the following three steps: (1) initiating contact detection; (2) calculating the forces; and (3) updating the trajectories. These processes are looped until the predesigned time steps are finished. Of the three, the contact detection process has the heaviest calculation load in DEM [8]. In the current paper, we accelerate the simulations through the GPU described above. Meanwhile, the second step is the process of calculating the contact force. In the current paper, we utilize the most common linear model, called linear-spring-dashpot which was proposed by Cundall and Strack [3]. In this model, the spring is used for the elastic deformation, whereas the dashpot accounts for the viscous dissipation, as shown in Fig. 10. The normal contact force model adopts the theory of Hertz to describe the elastic contact between two spheres in the normal direction. He posited that the relationship between the normal force and normal displacement is nonlinear [26]. The tangential contact force model adopts the Mindlin and Deresiewicz model, which demonstrates that the force–displacement relationship depends on the whole loading history and the instantaneous rate of change of the normal and tangential force or displacement [27]. To simulate the adhesive and cohesive behavior, a set of joints for both the normal and shear direction are added as failure criteria (Fig. 11). In the fresh stage of SCC, it behaves as the Bingham model with two rheological constants, namely, the yield value and plastic
fn
kn
ks cs
Table 3 Occupancy of GPU. Phase
Threads
Shared memory
Register
Occupancy
Broad phase Narrow phase
256 128
0 528
17 28
1 0.667
cn
Slider
fs
Fig. 10. Cundall–Strack Model phase.
201
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
σ
3.2. GPU–DEM platform
δn fn
Separate
Contact Under tension
Contact Under compression
(a) Normal direction
τ fs
|δs | (b) Shear direction Fig. 11. Failure criteria of the mortar model [28].
viscosity [4–7]. In the present paper, combined modeling is adopted as shown in Fig. 12 [25]. If the deformation is bigger than the thickness of mortar, deformations of the mortar and aggregate are considered, amounting to a series between the mortar and aggregate. Based on Obara and Huang [24,25], two rheological parameters of SCC elements, shear strength fs and damping coefficient h, are inputted to simulate the flow behaviors of SCC. The two parameters model the two rheological constants of SCC, namely, yield stress and plastic viscosity. The different values of fs and h lead to different flow behaviors of SCC. The constitutive law implemented by Obara and Huang [24,25], which uses rheological parameters in DEM simulations, has also been adopted in the present research. The modeling of SCC is not the key point of the paper, so it is not described in detail. Through the basic work of the models and failure criteria, the contact forces of the particles are finished. The third step is to update the trajectories of the particles using Newton’s second law. Using the three steps above, a time step of a DEM simulation is conducted, after which the output of the last step is similar to the input of the next step, with which the next time step is begun. When all the steps are finished, the DEM simulation is over.
To avoid time-consuming CPU–GPU transfers, we execute the mechanical contact force calculation and update the displacement on GPU. Thus, the GPU–DEM platform is established, upon which all steps of the DEM simulation are fully executed on the GPU. After the contact detection on the GPU, four kernels are carried out. The first kernel is executed to calculate the force between the balls in contact based on the results of the contact detection using parallel particles. The second kernel is carried out to calculate the force between the ball and the plane of the container. The planes are few, so parallel brute force method, which considers each ball to calculate the contact force with each plane, is used. The third kernel is used to add the ball–ball force and ball-plane force together to create a resultant force, which will be the input in the next kernel to calculate the updated displacement. Newton’s second law is applied in the fourth kernel to update the displacement of the particles, which is to be the initial value in the next time step. Through the contact detection and these four kernels, we build a GPU–DEM system to implement the particle-based simulations, such as the SCC flow simulations in the current paper. Fig. 13 shows the flow chart of the GPU–DEM platform. Given that the rendering is not employed on GPU in the present research, we display the process of the simulation after the calculation in GPU. Therefore, we need to output the data of some steps from the GPU to CPU in the simulations. This process does not affect the performance as shown below. The time step in the DEM is usually very small. In our simulation, we set time step at 105 s/step, which means that if we simulate 5 s in real life, the total steps are 5,00,000 in the SCC flow simulation. Therefore, for every 2000 steps, we transfer the coordinates of the particles from GPU to CPU. Although this process is time consuming, it only transfers 250 times if the total step is 5,00,000. Therefore, compared with the 5,00,000 steps simulation time, the transferred time can be ignored. 3.3. Performance There are plenty of methods for testing the workability properties of SCC, but only three typical tests are simulated in the current paper to validate the accuracy and efficiency of the algorithm and GPU–DEM platform. These three typical tests are slump test, Vfunnel test, and the L-B box test.
SCC Element
Self –Compacting Concrete
Coarse Aggregate Element simplify the mortar
simplify the coarse aggregate
Mortar Element Coarse Aggregate Mortar
Coarse Aggregate
if deformation< thickness of mortar
Mortar
if deformation>thickness of mortar
Fig. 12. DEM modeling of concrete [25].
202
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
Begin
Transfer CPUdata to GPU
Input temporal difference method
These steps are all done on GPU N
Contact Detection
The most time consuming
Contact Force
Calculation of Ball-Ball Force and Ball-Plane Force
Update the Positions
Newton's second law
Circle Completes Y Output
Transfer GPUdata to CPU
Finish Fig. 13. Flow chart of the GPU–DEM platform.
Fig. 14. SCC experiments on the GPU–DEM platform.
203
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
Fig. 15. Comparison between the experiments and the simulations.
Table 4 Results of SCC cases simulated on different platforms. Cases
Number of particles
Total steps
Print steps
CPU time
GTX285 time
GTX580 time
Speed-up radio on GTX285
Speed-up radio on GTX580
L-box Slump-1 Slump-2 V-funnel Slump-3
29,580 14,647 8360 4723 2680
50,000 50,000 50,000 50000 50,000
2000 2000 2000 2000 2000
19302.8 8457.7 4402.7 1599.1 1339.4
1425.48 769.16 557.80 379.55 350.67
663.86 369.36 293.31 202.20 224.40
13.54 11.00 7.89 4.21 3.82
29.08 22.90 15.01 7.91 5.97
time for SCC simulation (s)
25000 20000 15000 CPU
10000
GTX285 GTX580
5000 0
0
5000
10000
15000 20000 25000 number of particles
30000
35000
Fig. 16. Elapsed time comparison on GPUs and CPU.
The slump test is the simplest and most commonly used in research and practical projects to determine the workability of selfcompacting concrete. In this test, the SCC is placed (by a prescribed procedure) into a conical formwork, which is then suddenly released. Then, concrete begins to spread under the gravity loading (Fig. 14(a)). In its standard form, only the final value of the slump is measured to evaluate the flowing behavior of the SCC.
The V-funnel test is a simple method of evaluating the ability of SCC to pass through spaces. In its standard form, the time needed for the container of SCC to flow through the bottom opening is measured to estimate both passing ability and deformation ability (Fig. 14(b)). In the GPU–DEM simulations, both the slump and the V-funnel models are simulated as one quarter of the real apparatus to reduce the enormous number of particles and increase the efficiency. The L-B box, which stands for the flowability of the SCC, consists of two containers. One container holds the concrete to be poured in the second container. The second container has two parts: the vertical column and the horizontal section. The SCC is dumped into the second container, which is joined at the end to the top of the vertical column of the L-B box (Fig. 14(c)). All the three experiments are simulated on the GPU–DEM platform to validate the exactness of the proposed method by comparing them with the experiment results. The SCC in the experiments are similar and the values of shear strength fs and damping coefficient h are equal to 7.5 and 1.7 in the simulations, respectively. In Fig. 15, the proposed GPU–DEM platform can simulate the slump and L-B box experiments well. For the V-funnel experiment, the experiment time is 14.7 s, whereas the simulated time is 13.1 s.
204
J. Zheng et al. / Computers and Structures 112–113 (2012) 193–204
In addition, the results of the simulations remain relatively unchanged throughout the multi tests. Good stability and accuracy are found in the SCC flow simulations. Based on correctness, five cases of tests are simulated to validate the efficiency of the algorithm of contact detection and the system of the GPU–DEM platform. Table 4 shows the compared results of five cases in SCC simulated on the GPU–DEM platform and CPU. The unit of time in Table 4 is seconds. The results show that the GPU–DEM platform proposed in the present paper can simulate the SCC accurately and improve the efficiency of calculation. The maximum speed-up ratio can reach 35, with nearly 30,000 particles. With the increase of the number of particles, the speed-up ratio is also increased, which shows that the GPU–DEM platform may increase the scale of particles and make possible the simulation on the construction site. The comparison on different GPUs is also carried out. Fig. 16 shows the time costs on the different GPUs and CPUs. The speed on GTX285 is efficient, and the speed-up ratio can reach 15, which is about 1/2 compared with GTX580. GTX285 and GTX580 are superior to CPU. These results verify that the GPU–DEM platform proposed in the current research is outstanding on particle-based simulations. The implementation in the SCC flow simulations shows that this GPU parallel algorithm is tremendously faster than previous CPU implementations claiming the same particle number in particlebased simulations. This parallel algorithm is also stable because it appears to linearly increase as the magnitude of particles increases. 4. Conclusion and future work The current paper presents a robust and scalable parallel algorithm on GPU for contact detection based on the uniform grid method. First, the particle parallel in the broad phase of contact detection is implemented to determine which balls each cell contains. Then, cell parallel in the narrow phase of contact detection is carried out to determine each ball that comes in contact with others. The proposed algorithm shows a tremendous increase in efficiency of about 73 speed-up ratio compared with the CPU algorithm. We also establish a GPU–DEM platform to execute three different SCC flow simulations on the GPU completely, which prevents data transfer between the GPU and CPU in the DEM simulations. The results verify that the GPU–DEM platform we proposed in the current paper shows outstanding results after completing the particle-based simulations. There are many avenues for future work. An extended application of the present research would be to investigate the use of parallel algorithm for different primitive types, for example, the triangle applied on ray tracing. Other variations of contact detection similar to this algorithm can also be widely used and similarly implemented in the future. We aim to further study acceleration structures for multi-level grid method when the size of the biggest particle is more than 10 times bigger than the smallest particles.
For particle-based simulations, the proposed GPU–DEM platform is mostly orthogonal. Thus, it would be interesting to investigate the implementation of other methods in our framework. References [1] Okamura H, Ouchi M. Self-compacting concrete. J Adv Concrete Technol 2003;1:5–15. [2] Roussel N, Geiker MR, Dufour F, Thrane LN, Szabo P. Computational modeling of concrete flow: general overview. Cement Concrete Res 2007;37:1298–307. [3] Cundall PA, Strack ODL. A discrete numerical model for granular assemblies. Geotechnique 1979;29:47–65. [4] Chu H, Machida A, Suzuki N. Experimental investigation and DEM simulation of filling capacity of fresh concrete. Trans JCI 1997;18:9–14. [5] Puri U. Numerical simulation of shotcrete by distinct element method. Ph.D. thesis. Tokyo, University of Tokyo; 1999. [6] Noor MA, Uomoto T. Three-dimensional discrete element simulation of rheology tests of self-compacting concrete. RILEM Symp on SCC 1999;51(4):25–8. [7] Mechtcherine V, Shyshko S. Virtual concrete laboratory – continuous numerical modelling of concrete from fresh to the hardened state. Adv Constr Mater 2007:479–88. [8] Mio H, Shimosaka A, Shirakawa Y, Hidaka J. Optimum cell size for contact detection in the algorithm of the discrete element method. J Chem Eng Jpn 2005;38:969–75. [9] Lauterbach C, Mo Q, Manocha D. Hierarchical GPU based operations for collision and distance queries. Eurograph Assoc 2010:419–28. [10] Luebke D. The democratization of parallel computing. In: Astro GPU Conference; 2007. [11] Harada T, Koshizuka S, Kawaguchi Y. Sliced data structure for particle-based simulations on gpus. Proc of Graphite 2007:55–62. [12] Pabst S, Koch A, Straber W. Fast and scalable CPU/GPU collision detection for rigid and deformable surfaces. Eurograph Symp on Geom Process 2010;29(5):1605–12. [13] Zhang Y, Solenthaler B, Pajarola R. Adaptive sampling and rendering of fluids on the gpu. In: IEEE Symposium on PBG; 2008. p. 1–10. [14] Zhou K, Hou Q, Wang R, Guo B. Real-time kd-tree construction on graphics hardware. ACM Trans Graph 2008;29(5):126–36. [15] Hunt W, Mark WR, Stoll G. Fast kd-tree construction with an adaptive errorbounded heuristic. In: IEEE Symposium Interactive Ray Tracing; 2006. p. 81–8. [16] Shevtsov M, Soupikov A, Kapustin A. Highly parallel fast kd-tree construction for interactive ray tracing of dynamic scenes. Eurograph Assoc 2007:395–404. [17] Harada T. Real-time rigid body simulation on GPUs. GPU Gems3 2007;3:611–32. [18] Goswami P, Schlegel P, Solenthaler B, Pajarola R. Interactive SPH simulation and rendering on the GPU. Eurograph Assoc 2010:55–64. [19] Purcell TJ, Donner C, Cammarano M, Jensen HW, Hanrahan P. Photon mapping on programmable graphics hardware. Eurograph Assoc 2003:41–50. [20] Green S. Cuda particles. NVIDIA Whitepaper; 2007. [21] Ivson P, Duarte L, Celes W. GPU-accelerated uniform grid construction for ray tracing dynamic scenes. Master thesis. Rio de Janeiro, Pontificia Universidade Catolica; 2009. [22] Kalojanov J, Slusallek P. A parallel algorithm for construction of uniform grids. In: Proceedings of HPG; 2009. p. 23–8. [23] Grand S. Broad-phase collision detection with CUDA. GPU Gems3 2007;3:697–721. [24] Obara T. Development and application of Discrete element simulation system for mixing materials. Ph.D. thesis. Beijing: Tsinghua University; 2007. [25] Huang MS. The application of discrete element method on filling performance simulation of self-compacting concrete in RFC. Ph.D. thesis. Beijing: Tsinghua University; 2010. [26] Hertz H. Ueber die berührung fester elastischer körper. J für die reine und angewandte Mathematik 1882;92:156–71. [27] Mindlin R, Deresiewicz H. Elastic spheres in contact under varying oblique forces. J Appl Mech 1953;20:327–44. [28] Huang MS, An XH, Obara T, Ouchi M. Numerical modeling of self-compacting mortar flow using discrete element method. JCI Annu Conv 2009:1477–82.