J. Parallel Distrib. Comput. 97 (2016) 24–34
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Hardware-accelerated generation of 3D diffusion-limited aggregation structures Attila Michael Zsaki Department of Building, Civil and Environmental Engineering, Concordia University, Montreal, Quebec, Canada
highlights • Implementation of diffusion limited aggregation (DLA) on parallel hardware. • Use of OpenCL running on multi-core CPU, GPU and FPGA. • Performance evaluation of the accelerated DLA algorithm.
article
info
Article history: Received 10 August 2015 Received in revised form 31 March 2016 Accepted 20 June 2016 Available online 28 June 2016 Keywords: Diffusion-limited aggregation Simulation Hardware acceleration OpenCL Multi-core CPU GPU FPGA
abstract The diffusion and aggregation of particles in a medium can result in complex geometric forms with an artistic interpretation, yet these aggregates can represent many natural processes as well. Although the method is quite simple, it takes many particles to form an aggregation. If the process is simulated using a computer, it directly translates into lengthy computation times. In this paper, the acceleration of the diffusion-limited aggregation was investigated. The algorithm of aggregation was implemented on a serial single-core CPU, and that served as the base-case. With the aim of reducing run times, the algorithm was implemented on three accelerator architectures using OpenCL as the connecting software framework. Performance testing of the OpenCL implementation was done on a multi-core CPU, a GPU and an FPGA. Metrics such as run time, relative speedup and speedup-per-watt were used to compare the hardwareaccelerated implementations. Even though using a GPU is not the most economical alternative energywise, its performance resulted in the highest speedup, while an FPGA or a multi-core CPU offered other viable options in accelerating the creation of diffusion-limited aggregation structures. © 2016 Elsevier Inc. All rights reserved.
1. Introduction Diffusion-limited aggregation (DLA) structures are generated by the accumulation of particles diffusing through a medium. These structures possess a fractal-like appearance with wispy tendril-like arms, such as the one shown in Fig. 1. Even though they have captured the imagination of computer graphics artist [16], they commonly represent physical processes such as formation of river networks [12], dielectric breakdown [12], electro-deposition [10] or in fluid dynamics, the Hele-Shaw flow [8]. In a granular geological medium, such as soils and rocks, DLA structures can be found forming during injection of air or water into the fractured medium for natural gas or oil extraction [8]. The process of aggregation is characterized by the diffusion of a relatively small number of particles (ideally a single particle, but at most a few dozen) at a time through a
E-mail address:
[email protected]. http://dx.doi.org/10.1016/j.jpdc.2016.06.009 0743-7315/© 2016 Elsevier Inc. All rights reserved.
medium [10]. When a moving particle comes into contact with an already deposited particle, it attaches itself to the structure, enlarging it. If the aggregation process is allowed to occur over an extended period of time, particularly in 3D, formations with millions of particles can be generated. However, the creation of such structures occurs at a cost. Particles diffuse in the medium obeying a Brownian-like random motion, thus for a particle to find the aggregated structure can take a large number of steps. This translates into potentially lengthy computation times considering millions of particles to be aggregated. There have been a number of algorithmic acceleration methods developed, which will be reviewed in the next section. Yet, this paper implements, tests and evaluates the performance of a hardware-accelerated DLA algorithm in 3D on a single CPU core, on multiple CPU cores, on a GPU and on an FPGA using OpenCL as the language of implementation across the platforms. The performance of each implementation will be assessed using metrics such as time-tocompletion, relative speedup and speedup-per-Watt to evaluate which hardware platform provides an ideal environment for accelerated DLA simulations.
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
25
Fig. 2. A 3D DLA structure bound by an outer spherical shell. Fig. 1. A 3D DLA structure, grown while immersed in a velocity field.
2. Diffusion-limited aggregation simulation There are two prevalent approaches for generating DLA structures; on-lattice and off-lattice. The on-lattice variant allows particles to diffuse on a 3D lattice only, thus the domain is discretized into cells, or voxels. The off-lattice variant relaxes this condition, permitting particles to move freely in space. Thus, the on-lattice version can be thought of as an integer-only variant, while the off-lattice requires floating point computations to determine particle locations and collisions with the DLA structure. This paper considers an on-lattice algorithm for the ease of implementation across hardware platforms. Although, the onlattice method can appear restrictive since particles are placed at lattice centers only, if the domain is discretized into a reasonable number of cells (5123 or more), the underlying lattice structure becomes much less evident, as seen in Fig. 1. In order for the aggregation process to start, there has to be some seed particles placed in a domain. Most simulations place a single seed at the center of a lattice, however interesting structures can be grown with multiple dispersed seeds. The implementation used in this paper allows an arbitrary number of seed particles to be placed prior to simulation. With the seed particles in place, a new non-stationary particle is introduced into the system. Its starting point is selected to be on a sphere bounding the current aggregation structure instead randomly selecting an unoccupied location in the lattice. The use of such sphere is one of the traditional acceleration methods, since no matter how a particle diffuses in the lattice; eventually it has to cross this bounding sphere. Since the particle obeys a random Brownian motion, selecting a random point on the sphere as a starting point is thus equivalent to all the diffusion (movement) prior to crossing the sphere from the outside. From then onward, the process of aggregation is quite simple; a particle diffuses, or moves, by stepping into one of the 26 adjacent lattice cells at random, representing a Brownian motion. Ultimately, a diffusing particle reaches one of two conditions; it either wanders outside the lattice or it stumbles upon a cell already occupied by a particle attached to the aggregation structure. In the first case the particle can be destroyed, and a new one generated on the bounding sphere, or if periodicity of the lattice is to be enforced, the particle is reintroduced at the adjacent face of the
bounding box. To reduce the amount of diffusion in the unoccupied space of the lattice, another sphere can be used at some distance farther away from the aggregation structure. This time the particle is re-started if it wanders outside this farther sphere, which is another common algorithmic acceleration technique. If the second condition is met, e.g. the particle comes into contact with the existing aggregation structure, it is decided if the particle becomes attached to the structure. Parameters, such as ‘stickiness’ or the number of adjacent occupied cells, can decide if a particle is to be attached. All these parameters or rules of attachment can be deterministic or can be assigned a probability, either for the whole structure or affecting a local region only, depending on the physics being modeled. Although generally the diffusion or movement of a particle is simulated via a random Brownian motion, it is possible to ‘steer’ this motion by defining a velocity vector field within a lattice. The vector field has to be computable at each lattice point. There are a few methods dealing with how the effect of a velocity field can be incorporated into the simulation. The algorithm used in this paper assigns a probability to each step; either one of the 26 adjacent cells will be chosen at random or an adjacent cell is selected, to which the local velocity field points to. The use of velocity fields enables the generation of complex shapes. In conjunction with velocity fields, geometric restraints can be incorporated into the model as well. These restraints can take the form of surfaces, which limit the growth of the aggregation, as shown in Fig. 2, where a sphere acts as an outer bounding surface. 3. Hardware acceleration of diffusion-limited aggregation simulation Historically, acceleration of algorithms relied on either specialpurpose application-specific integrated circuits (ASICs) or the use of multiple computing units and the paradigm of parallel processing. With the introduction of programmable graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) into the area of general purpose computing, many of the algorithms can now be implemented to run on almost any accelerator hardware. To exploit the computing power of massively-parallel FPGAs or GPUs, a set of software toolkits was developed such as NVIDIA’s CUDA [17], promoting more the finegrained parallelism of GPUs, or OpenCL [13], which is capable of exploiting both fine-grained and coarse-grained parallelism.
26
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
Up until recently, CUDA and OpenCL were only available for GPUs and/or multi-core CPUs (OpenCL), but recently Altera has introduced FPGAs that can be programmed using their OpenCL SDK [1] and there are already some algorithms benefiting from this development [6,7]. Thus, the hardware-acceleration of algorithms can benefit from multi-core CPUs, GPUs and FPGAs. OpenCL was chosen for the acceleration of DLA because it is available on all platforms (CPUs, GPUs and FPGAs) and its use requires minimum modification to the algorithm implementation to be able to run on CPUs, GPUs and FPGAs. The development environment was Microsoft Visual Studio Ultimate 2012, the programs were developed in C/C++. The computer was equipped with 32 GB RAM, running Microsoft Windows 7 Professional. For the FPGA, the OpenCL kernels were compiled using Altera’s Quartus II 12.0 Suite [1]. In reference to [21,3,5], the DLA simulation can be described as follows. On a 3D computation lattice of dimension x by y by z, with seed particles in place, a randomly placed particle diffuses until (a) it either finds the existing aggregation and attaches itself to it or (b) it wanders outside the computation lattice. In case (a), a probability of attachment can be modeled using another toss of a dice; either the particle attaches or continues to diffuse. The probability of attachment can be a random number or it can be dependent on the number of existing attached particles in the neighborhood of the particle under consideration. For case (b), as illustrated earlier, the particle is either re-introduced at the adjacent face of the lattice (periodicity) or re-placed at a random location within the lattice. Traditional acceleration structures, such as bounding spheres for particle birth and death were implemented as well. The hardware acceleration of a DLA simulation can be envisioned using two approaches; either by partitioning the computational domain based on the available number of accelerators (or cores or computing units) or concurrently diffusing a number of particles. The first approach was dismissed, since for a single or a very small number of particles (diffusing particle numbers equal to the number of accelerators and/or number of concurrent threads), a number of domains/accelerators would be idle if no particle diffuses within them. Thus the second approach was adopted in the implementation. However, the idea of even a small number of concurrent particles (equal to the number of accelerators and/or number of concurrent threads), as opposed to a single particle, is in disagreement with the traditional DLA simulation, where a single particle diffuses in a domain [12,10]. Thus for these particles, theoretically, it is possible that two or more particles collide outside the existing DLA structure or that they will be in a position where the order of attachment (the local shape of existing aggregation structure) would be dependent on the strict order of processing of each particle. Since most hardware accelerators derive their performance from independently running their concurrent processes, a deterministic order of processing of particles cannot be guaranteed. The probability of multiple, concurrently diffusing particles trying to attach to the aggregation at the same cell location is a function of the number of concurrently diffusing particles and the number of available (empty) and occupied cells. Thus the probability changes as the simulation progresses. Even though the possibility is remote, multiple particles trying to attach at the same location can occur. The algorithm, for the sake of speed, allows multiple particles to attach themselves to the aggregation structure in any given kernel invocation. To justify this, a small experiment was conducted, in which the occurrence of such case was monitored. Out of 50 simulation runs on a 7683 grid with 256 concurrently diffusing particles, the probability of two particles attempting to attach to the same location was 8.12 ∗ 10−6 percent, which amounts to an acceptably low probability. No case was observed in which more than two particles tried to attach at the same location.
The algorithm, as implemented in OpenCL, was developed using a three-phased approach. The first phase set up and initialized the OpenCL environment, defined and allocated the buffers for data transfer, compiled the kernels and finally, transferred all the data needed to the accelerator. The second phase executed the kernel on the accelerators while the third phase wrote the data back to the host. The pseudo code, based on the actual C code, is as follows; first the host side, then the OpenCL kernel is given describing the program flow.
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
27
(b) disable the OS’ watchdog timer responsible for Timeout Detection and Recovery (c) reduce the OpenCL (and equally valid for CUDA) kernel run times. Option (a) was appealing if the system is equipped with a redundant GPU, however most computer systems are equipped with a single GPU, thus this option was ruled out because it greatly reduces the usability of the proposed method. Option (b) was not considered wise, since it can interfere with the normal operation of the computer system and can lead to crashes. Thus, option (c) was adopted resulting in the subdivision of the problem into smaller blocks, but at the cost of increased overhead associated with the data transfer. The implications of the repeated invocation of a kernel along with the transfer of grid structure from the host to the accelerator can be quantified as follows: The OpenCL environment was set up by querying the availability of platforms and resources and selecting the proper accelerator using CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_ACCELERATOR (for FPGAs), as appropriate. The buffers, including the lattice (or grid), were set up next. The maximum size of the lattice was dictated by the lowest maximum available memory among the accelerators, which will be summarized, along with other key simulation parameters, in the next section. To maximize the lattice size, a boolean data type was adopted to denote empty/occupied lattice elements. Other input data buffers included an array of floats for initializing a random number generator running on the accelerator and a short buffer for the control parameters of the simulation, such as lattice dimensions, boundary condition treatment, stickiness/rules of attachment, radii of spherical emitter and destroyer, and, if specified, the definition of a vector field steering particle movements. These few key simulation parameters were stored in a shared memory on the accelerator, since they are often used by the threads during the simulation. The only output data buffer was needed to store the 3D locations of newly aggregated particles. Next, the buffers were ‘enqueued’ and kernels were compiled using the global and local workgroup sizes, as specified. Ideally, the whole simulation should be run on the accelerator without communicating back to the host until all particles are aggregated. However, the continuous use of the GPU interferes with its primary purpose of updating the screen, thus it was decided that the simulation will be comprised of repeated invocation of the computing kernel, thus allowing the system to be responsive during a simulation. This decision necessitated the need for an output buffer, the update of the lattice on the host and the re-upload of the updated lattice structure to the accelerator after placing n particles, as dictated by the global workgroup size. Note that only the GPU requires multiple invocations of the kernel, both the CPU and FPGA can run the simulation without communication with the host in OpenCL. The decision to split up the work into blocks was based on the options resulting from the operating system’s (Microsoft Windows 7) reaction to operations that execute for a long time. Any operation, including GPU-related ones, which runs beyond what the OS deems ‘too long’, triggers a Timeout Detection and Recovery response from the OS. While running a simulation, if the timeout response is invoked by the OS, the simulation is effectively canceled. Since the DLA simulation can take minutes of continuous run to complete, putting it well beyond the OS’ timeout triggers limit. On the tested GPU platform, which will be summarized in Section 3.2, this timeout limit was approximately 2.8 s. Any kernel that wanted to run longer was interrupted and canceled. Therefore the literature proposes three ways to deal with the time limit [17,13]: (a) run the simulation on a GPU that is not participating in displaying graphics
(a) prior to kernel invocation, the grid structure is transferred to the accelerator (CPU, GPU or FPGA) (b) kernel is run (c) accelerator returns an array containing the coordinates of the newly placed particles (d) host updates the grid structure. Thus the major overhead occurs in step (a) since the whole grid needs to be uploaded. Step (c), the communication from the accelerator back to the host, is minimal compared to step (a) since only the newly placed particles (up to the number of the threads used) are returned. The update of the grid structure on the host is relatively fast as well, for the same reason as step (c), since only a small number of operations (relative to the grid size) are required for the newly placed particles. To estimate the effect of this communication overhead, a set of experiments were devised in which a grid structure was uploaded to the accelerator, but the kernel did no work. The running time associated with this operation was measured. The numbers revealed that, on average, the data transfer comprises about 12% of the simulation run time. 3.1. Model and simulation parameters The lattice for the DLA model was comprised of a 768 by 768 by 768 grid (about 432 MB storage) with a boolean data structure per cell (booleans on most platforms are represented by bytes). The dimensions were selected such that the lattice, represented as a one-dimensional array, was within the limits posed by the most restrictive memory limit available for a maximum single memory allocation among the hardware platforms. Although memory compression schemes could have been used to represent each cell’s occupancy as a true one-bit boolean, for the sake of implementation clarity, this was not done. A single seed particle was inserted at the center of the lattice prior to start of the simulation. A generating sphere was specified to be 25 lattice points away from the maximum radius of the current DLA structure. Similarly, a destroying sphere was defined at 50 lattice points outward from the generating sphere. Both spheres had their centers fixed at the center of the aggregation structure, while their radii were updated during the simulation as new particles were added to the aggregation. The outer, destroying sphere’s radius was subsequently limited to half the width of the lattice dimensions when the sphere’s expansion reached the lattice boundaries. As a consequence, towards the end of simulation, the bounding spheres resulted in a formation of an outer ‘shell’ of particles, as seen in Fig. 2. The particles were attached to the aggregation using a stickiness probability of one in three. The simulation was run until at least 1.7 million particles were attached to the aggregation structure. All simulation parameters were summarized in Table 1.
28
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
Table 1 DLA simulation parameters. Parameter
Value
Grid width Grid height Grid depth Probability of attachment upon contact Generator type Generator start radius Generator radius increment (from DLA structure) Destroyer type Emitter radius (beyond current emitter radius) Seed particle Number of particles to place Number of particle steps before giving up
768 points 768 points 768 points 1/3 Spherical 120 points 25 points Spherical 50 points At 384, 384, 384 1,700,000 300,000
3.2. Accelerator hardware The study considered a range of accelerators; from a single core of a CPU, a multi-core CPU, a GPU and an FPGA. At the time of writing, these accelerators have represented a reasonable crosssection of available hardware. The features of each accelerator were summarized in Table 2. Where possible, a reference and a link to the manufacturer’s documentation were given. Since there is a communication between a host and an accelerator during the simulation to update the aggregation structure and the lattice, it was anticipated that the high memory bandwidth available to the CPU will positively affect (e.g. reduce) the simulation time as compared to both the GPU and FPGA, since latter ones were limited by the PCI-E bus’ relatively limited bandwidth. 3.3. Single-core CPU—base case The serial implementation of the DLA algorithm, where a single particle was allowed to diffuse at a time, was run on a single core of a CPU. Fig. 3 summarizes the runtimes of the DLA algorithm. The time to attach the requested number of particles to the aggregation structure was recorded. Since ideally each run should result in different DLA structure due to the randomness of the process, the runtimes shown in Fig. 3 represent an average of 10 runs. This non-OpenCL implementation of the DLA algorithm will serve as a benchmark, to which the accelerated implementations of the algorithm will be compared. As evident from Fig. 3, the number of particles versus run time plot is comprised of an initial curved portion and a linear portion for higher particle numbers. This shape is attributable to the aggregation process. At the beginning, there are relatively few cell faces where a new particle can attach itself. Thus it takes more time for a particle to diffuse before it attaches itself. Subsequently, even though the generating sphere expands as the structure grows, the number of available cell faces increases at a greater rate. Thus once an established DLA structure is reached, it seems a balance is formed between the rate of increasing volume
within the generating sphere and the time it takes for a particle to attach itself. This is represented by the almost linear portion of the curve for larger number of particles. 3.4. Multi-core CPU accelerated DLA The multi-core implementation of the DLA algorithm using OpenCL was executed using all four cores of a CPU. The OpenCL specification allows both a global (total number of concurrent threads) and a local workgroup size to be used. It was anticipated that these two settings could influence the run time of the program. Thus different global and local workgroup sizes were tried. For the global workgroup size a range from 8 to 1024 was used (as 2i , i = 3–10), while the local workgroup sizes ranged from 1 to 256 (as 2i , i = 0–8). These sizes were selected based on the literature [15,19] and the requirements that a relatively small number of particles (equal to the global workgroup size or the number of concurrent threads) should diffuse at any given time. Also, from the performance results obtained it was revealed that the global workgroup size of 8 resulted in run times noticeably slower (for large number of particles to be generated) than the serial implementation. Also, beyond a global workgroup size of 1024 there was only marginal improvement in the reduction of run times, as seen in Fig. 4. Thus, for each combination, 10 runs were performed and the average times were recorded. For each global workgroup size, the local workgroup size that resulted in the lowest run time was kept. Although the plots of variation of runtimes within each global–local workgroup size were not included in the manuscript, it was observed that smaller (1 or 2) local workgroups resulted in the shortest run times. Fig. 4 summarizes these best-performing results. As evident from Fig. 4, the size of the global workgroup significantly influences the run time of an OpenCL program. It was observed that runs with larger global workgroup sizes take less time to execute. From the collected data it was calculated that there is an almost 13-fold decrease in computation time if a global workgroup size of 1024 was used versus an 8. The curves exhibit a behavior similar to the serial implementation; a larger curvature at lower particle numbers and a relatively linear increase at higher particle numbers. 3.5. GPU accelerated DLA The GPU implementation of the DLA algorithm using OpenCL was executed on the platform listed in Table 2. Similar to the multicore CPU case, a combination of global and local workgroup sizes was considered. For the global workgroup size a range from 16 to 4096 was used (as 2i , i = 4–12), while the local workgroup sizes ranged from 1 to 256 (as 2i , i = 0–8), where a certain global–local workgroup size combination was possible. These sizes were selected considering that GPUs prefer a larger global workgroup size, while bearing in mind the constraint of the problem, set forth in Section 3 about the number of concurrently diffusing
Table 2 Accelerator specifications. CPU
GPU
FPGA
GTX 760 [9] 1152 2 GB
Terasic DE5-Net using Altera Stratix V [20] Configurable, see text 8 GB
Max. single allocatable memory Frequency
Intel i7-4770K [11] 4 (8 threads) 8 MB L3 cache (32 GB system) 512 MB 3.5 GHz
Max. memory bandwidth
25.6 GB/s
Thermal Design Power (TDP)
84 W
512 MB 1072 MHz 192.2 GB/s 15.75 GB/s (PCI-E 3.0) 170 W
4096 MB 71.46 MHz [2] 17 GB/s 15.75 GB/s (PCI-E 3.0) 21.5 W [2]
Cores Memory
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
29
Fig. 3. Serial DLA algorithm runtimes.
Fig. 4. DLA algorithm runtimes on the multi-core CPU.
particles. Similar to the multi-core case, beyond 4096 for the global workgroup size there is marginal improvement, and below 16 the run times are inferior to the serial base case. For each combination 10 runs were performed and the average times were recorded. For each global workgroup size, the local workgroup size that resulted in the lowest run time was kept. Although the plots of variation of runtimes within each global–local workgroup size were not included in this manuscript, it was observed that a single local workgroup was the most efficient for the range of global workgroup sizes up to 64. While medium-sized (16 or 32) local workgroups resulted in the shortest run times for global workgroup sizes beyond 64. The exception was for the 4096 global workgroup size, for which the local workgroup size of 128 was marginally the best. Fig. 5 summarizes these best-performing results. As apparent from Fig. 5, the size of the global workgroup considerably influences run times of the program. In accordance with the observation from multi-core CPUs, runs with larger global workgroup sizes take less time to run. Similar to the multi-core CPUs, it was calculated that there is an almost 85-fold decrease in computation time if a global workgroup size of 4096 was used
versus if a 16 was used. The general shape of the curves follows the same trend as the ones obtained from the serial algorithm; an initial sharper curve followed by a more linear portion. 3.6. FPGA accelerated DLA The third accelerator platform, an FPGA, is somewhat different than the previous two. In this case, the hardware synthesis of the DLA algorithm can be influenced by such decisions as how many core one intends to use, which is limited by the number of logic elements and the size of the algorithm to be implemented in an OpenCL kernel. Other factors, such as the available memory can influence the design. The DLA algorithm was synthesized using Altera’s Quartus II suite. Due to the size of the DLA algorithm, at most two cores could be implemented using the FPGA logic. It was anticipated that the performance would increase as more cores are synthesized. However, this was not the case. The single kernel core version outperformed the dual-core version for all test runs. Similar to the multi-core CPU and GPU test cases, a range of global and local work sizes was considered in the measurements. Global
30
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
Fig. 5. DLA algorithm runtimes on the GPU platform.
Fig. 6. DLA algorithm runtimes on the FPGA platform.
work sizes tried were in the range of 32 to 4096 (as 2i , i = 5–12), while the local workgroup sizes ranged from 1 to 256 (as 2i , i = 0–8). The lower and upper limits of these ranges were determined by trial-and error. The lower value of 32 was selected because this resulted in a performance that was lower than the serial version, as shown in Fig. 6. While the upper value of 4096 was chosen since any value higher results in marginal performance improvements, while it violates the condition on the number of simultaneous particles diffusing. Interestingly, the local workgroup size did not affect the run times. The experiments were repeated 10 times for each combination and the average times were recorded. For example, Fig. 7 shows the run times for a global workgroup size of 128. There is no discernable effect on the run times as the local workgroup size is varied. Similar situation was observed for other global workgroup sizes as well. Judging from Fig. 6, the general shape of the curves was found to be similar that to the serial implementation; a curved portion for low particle numbers and a linear one for higher number of particles. The ratio between the slowest and faster runtime curves
was about 4, considerably less than for either the multi-core CPU or the GPU implementations. 3.7. Performance comparisons The individual performance of each implementation running on a particular accelerator was presented in the preceding three sections. Although in Figs. 4 through 6 the serial implementation was plotted as well, its use was limited to justifying the lower limits of workgroup sizes. Having ran the DLA algorithm’s implementation on all three hardware platforms, now a direct comparison can be drawn between them. Traditionally, the run times can serve as a means of comparing the effectiveness of an algorithm, its implementation and the benefits offered by a particular hardware platform. In addition, the relative speedup, as defined by [18], can be used to compare various implementations. More recently, another metric, the speedup-per-watt gained more attention [4]. It reflects the energy consumption of a hardware platform in performing a task. In this paper all three of these metrics will be considered.
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
31
Fig. 7. Effect of local workgroup size on FPGA runtimes.
Fig. 8. Side-by-side performance of platforms.
By plotting the single-core CPU base-case along with the best-performing multi-core CPU, GPU and FPGA cases, as shown in Fig. 8, the following observations can be made. All three hardware platforms are more than capable in accelerating the DLA algorithm. At the largest particle numbers (equal to or greater that 1.7 million), the multi-core CPU reduced the run time by a factor of about 8, while the GPU achieved an almost 55-fold reduction. Similarly, the FPGA reduced the run time by a factor of 4. Thus, the GPU has a clear advantage over the other two accelerator platforms. At lower particle numbers, the advantage of a GPU versus the multi-core CPU diminishes, as seen from Fig. 8. Both the GPU and FPGA reach an almost steady state as the particle numbers increase, while the multicore CPUs run time appears to increase, albeit linearly. By considering the relative speedup, a further comparison can be drawn, as shown in Fig. 9. The accelerators’ (such as a GPU and an FPGA) performance was compared to a single-core CPU and to a multi-core CPU’s performance. Traditionally, all accelerated
parallel implementations were compared to a serial or singlecore CPU bases cases. More recently, however, literature reports comparisons to multi-core CPUs as well [14]. The multi-core CPU, as compared to the base case of a single core, achieves its highest speedup of 13 at a relatively low particle number. Subsequently, the speedup erodes, and levels off to about 8 for large particle numbers. The speedup of 13 is a superlinear speedup. It is attributable to cache-locality and the cache-effect for smaller aggregation structures. Since a relatively large number of particles (1024 for the best-performing multi-core CPU case) are accessing a relatively small volume of the grid, many particles (threads) find the required data in either the L1 (32 kB per core), L2 cache (256 kB per core) or the L3 cache (8 MB per CPU, shared between the cores). As the aggregation structure grows, the diffusing particles occupy an increasing volume in space. This necessitates a larger amount of memory to hold the grid that is accessed by the currently diffusing particles. Thus beyond a certain point the advantage of data locality in the cache diminishes
32
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
Fig. 9. Speedup achieved in accelerating the DLA algorithm.
Fig. 10. Speedup-per-watt achieved in accelerating the DLA algorithm.
because data stored in the caches cannot be re-used by multiple particles. The GPU, as compared to the single core of a CPU, starts off at a speedup of about 22, first sharply increases then seems to level off to a speedup of about 55 for high particle counts. The FPGA hardware accelerator starts off at about 3 then slowly but steadily increases to a speedup of 4 for large particle numbers. In this comparison the GPU has a clear advantage over the other accelerators. When compared to the multi-core CPU, the GPU starts of with a speedup of about 3, which slowly increases to about 8 for large particle numbers. In these experiments, the FPGA, unfortunately, offers no speedup greater than unity, if compared to the multi-core CPU. The performance of an accelerator can be measured not only in its ability to reduce run times, but also in perhaps consuming less energy while performing a task. Using the manufacturer’s data provided in Table 2, the speedup-per-watt was computed for each accelerator and the results were plotted in Fig. 10. Even though the GPU consumes 170 W, which is more than twice of the CPU’s 84 W, it still provides the most economical performance in comparison with a single-core CPU. Since the FPGA consumes only 21.5 W to
run, its speedup-per-watt performance beats the multi-core CPU for particle numbers greater than 500,000 by an increasing margin in a comparison with a single-core of a CPU. Similarly, if compared to the multi-core CPU’s performance, the GPU is a better performer than the FPGA, even though the FPGA consumes much less power. The basis of the previous comparison was using the manufacturer-supplied data for power consumption. However, some components can only run as part of a greater system; neither the GPU nor an FPGA can run without a CPU, motherboard and RAM being online as well. To reflect this compound power consumption, a set of experiments was performed with the goal of testing the power consumption for a number of scenarios, as summarized in Table 3. For each scenario, the model was run for 10 simulations. Within each run, the power was sampled at one-second intervals using an energy usage monitor, to which the computer system was connected. The power consumption numbers in Table 3 are an average within each run and also averaged over the 10 runs. The actual system-level power consumption, as seen in Table 3, is somewhat different than the component-level predictions. The single-core CPU scenario on average consumed only 52 W to run
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
33
Table 3 Scenarios for power consumption testing. Scenario
Active components
Average power consumption (W)
Single-core CPU Multi-core CPU GPU FPGA
CPU, Motherboard, RAMa CPU, Motherboard, RAMa CPU, Motherboard, RAM plus GPU CPU, Motherboard, RAM plus FPGAa
52 68 128 75
a
Note: the system was run headless, accessed via VNC over a network.
Fig. 11. Speedup-per-watt achieved in accelerating the DLA algorithm, using real system-level power consumption.
the simulation, while the multi-core CPU needed 68 W. When a GPU was added to the system and the simulation was executed on it, the average power requirements rose to 128 W. If an FPGA was added to the base system, the average total power consumption was 75 W. For the case of an FPGA, the additional real power consumption was near to what the component level suggested; about 23 W. However, for both the CPU and GPU, the real power consumption was considerably less than what the sum of component-level data suggested (84 W for the CPU and 170 W for the GPU alone). Thus, with the real power consumption measured at the system level, the speedup-per-watt performance curves were updated, as shown in Fig. 11. Even though the real power consumption at the system-level is different than at the component-level, still the most speedupper-watt can be obtained if a GPU is used in accelerating a DLA simulation. At the component-level, the FPGA presents the secondbest alternative for a large number of particles. However, once the power consumption of the needed base system components is factored in, the FPGA no longer presents this advantage, it is the multi-core CPU that performs best after the GPU when compared to a single-core CPU. In comparisons with a multi-core CPU as the base case, the GPU still has the advantage over the FPGA. 4. Conclusion Many natural phenomena, representing complex systems, can be modeled using a computer simulation. The geometric structures characterizing diffusion-limited aggregation can represent both an artistic expression and a natural process such as formation of river networks. Although the process of DLA can be quite simple, it requires considerable time to generate large structures. This paper presented a brief introduction of the aggregation
method and some literature-reported techniques to make it more efficient. Thus starting with an already sound algorithm, an OpenCL-based implementation was developed and tested on accelerator platforms such as a multi-core CPU, a GPU and an FPGA. Each hardware platform presented unique challenges because some were designed for parallel execution of a large number of concurrent threads (GPU) while others were designed to run a relatively small set of concurrent threads (the number of cores on a multi-core CPU). While the FPGAs are relatively new among the OpenCL-based accelerators, their low power consumption represents an energy-conscious alternative. Given the constraints posed by requiring only a small set of simultaneously diffusing particles, the maximum workgroup size was limited. Nevertheless, the GPU emerged to be the platform achieving not only the largest speedup but also being the most economical top performer as measured by speedup-per-watt. The multi-core CPU was the second-best performer if modest-sized DLA structures are sought after, while the FPGA approached the multi-cored CPUs performance for large DLA structures. If the speedup-per-watt is considered, the FPGA presents an alternative to GPUs and multicored CPUs in accelerating the creation of DLA structures. References [1] Altera Inc., 2014. http://www.altera.com/products/software/opencl/openclindex.html. [2] Altera Inc., Altera Quartus II – PowerPlay Power Analyzer Tool, 2015. [3] R. Ball, M. Nauenberg, T.A. Witten Jr., Diffusion-controlled aggregation in the continuum approximation, Phys. Rev. A 29 (4) (1984) 2017–2020. [4] C. Bischof, Parallel Computing: Architectures, Algorithms, and Applications, IOS Press, Amsterdam, The Netherlands, 2008. [5] P.D. Bourke, Constrained diffusion limited aggregation in 3 dimensions, Comput. Graph. 30 (4) (2006) 646–649.
34
A.M. Zsaki / J. Parallel Distrib. Comput. 97 (2016) 24–34
[6] D. Chen, D. Singh, Using OpenCL to evaluate the efficiency of CPUs, GPUs and FPGAs for information filtering, in: 22nd International Conference on Field Programmable Logic and Applications, FPL, Aug. 29–31 2012, Oslo, Norway, 2012. [7] D. Chen, D. Singh, Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms, in: 18th Asia and South Pacific Design Automation Conference, ASP-DAC, January 22–25, 2013, Yokohama, Japan, 2013. [8] J. Duran, Sands, Powders, and Grains, an Introduction to the Physics of Granular Materials, Springer, 2000. [9] EVGA, GTX760, 2015. http://www.evga.com/Products/Specs/GPU.aspx?pn= D34D9B88-00D7-4F24-A92D-76ECD7BB6346. [10] T.C. Halsey, Diffusion-limited aggregation: A model for pattern formation, Phys. Today 53 (11) (2000) 36–41. R [11] Intel, Intel⃝ CoreTM i7-4770K Processor, 2015. http://ark.intel.com/products/ 75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz. [12] B. Jamtveit, P. Meakin, Growth, Dissolution and Pattern Formation in Geosystems, Springer, 1999. [13] Khronos Group, OpenCL, 2014. https://www.khronos.org/opencl/. [14] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen, N. Satis, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, in: Proceedings of the 37th Annual International Symposium on Computer Architecture, ACM, New York, NY, USA, 2010, pp. 451–460. [15] J.H. Lee, N. Nigania, H. Kim, OpenCL performance evaluation on modern multicore CPUs, Sci. Program. 2015 (2015) Article ID 859491. [16] A. Lomas, Aggregation: Complexity out of simplicity, in: SIGGRAPH 2005, Los Angeles CA, USA, August 2, 2005, 2005. [17] NVIDIA, CUDA Toolkit Documentation, 2014. http://docs.nvidia.com/cuda.
[18] P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, San Francisco, USA, 1997. [19] S. Seo, J. Lee, G. Jo, J. Lee, Automatic OpenCL work-group size selection for multicore CPUs, in: PACT’13 Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, UK, Sept. 7–11, 2013, pp. 387–398. [20] Terasic Inc., DE5-Net User Manual, 2015. http://www.terasic.com.tw/ attachment/archive/526/DE5NET_OpenCL.pdf. [21] T.A. Witten Jr., L.M. Sander, Diffusion-limited aggregation, a kinetic critical phenomenon, Phys. Rev. Lett. 47 (19) (1981) 1400–1403. Attila Michael Zsaki is an Associate Professor in the Department of Building, Civil and Environmental Engineering. He obtained his B.Eng. degree from Ryerson University in 1996 and his M.A.Sc. and Ph.D. degrees in civil engineering from the University of Toronto in 1999 and 2003, respectively. Dr. Zsaki’s research is focused on modeling and computational aspects of geosciences (rock & soil mechanics) with particular interest in multiphysics modeling of continuum and discontinuum. His other areas of interest are scientific computing, parallel computing, computer graphics and mesh generation. He joined the faculty of Concordia in June 2005. In addition to academia, Dr. Zsaki has worked in the industry as software developer and consultant on various projects ranging from 3D digital content creation, geomechanics analysis software, and lately on high-performance scientific computing applications for modeling continuum behavior. His interests in scientific computing are performance optimization and parallel computing on scalable, shared-memory multiprocessor systems, graphics processing units (GPU) and FPGAs.