Journal of Computational and Applied Mathematics 270 (2014) 496–505
Contents lists available at ScienceDirect
Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
A fast and interactive heat conduction simulator on GPUs Zhangping Wei a,∗ , Byunghyun Jang b , Yafei Jia a a
National Center for Computational Hydroscience & Engineering, The University of Mississippi, University, MS 38677, USA
b
Department of Computer & Information Science, The University of Mississippi, University, MS 38677, USA
article
info
Article history: Received 5 October 2013 Keywords: Heat conduction ADI GPGPU CUDA OpenGL
abstract GPU offers a number of unique benefits to scientific simulation and visualization. Its superior computing capability and interoperability with graphics library are two of those that make GPU the platform of choice. In this paper, we present a fast and interactive heat conduction simulator on GPUs using CUDA and OpenGL. The numerical solution of a two-dimensional heat conduction equation is decomposed into two directions to solve tridiagonal linear systems. To achieve fast simulation, a widely used implicit solver, alternating direction implicit (ADI) is accelerated on GPUs using GPU-based parallel tridiagonal solvers. We investigate the performance bottleneck of the solver and optimize it with several methods. In addition, we conduct thorough evaluations of the GPU-based ADI solver performance with three different tridiagonal solvers. Furthermore, our design takes advantage of efficient CUDA–OpenGL interoperability to make the simulation interactive in real-time. The proposed interactive visualization simulator can be served as a building block for numerous advanced emergency management systems in engineering practices. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Traditionally, numerical simulation and visualization is a two-step process. First, computation is carried out on either CPU or GPU and output data is written into system memory or a file, and then a third-party software (e.g., Matlab) is used to visualize the data on screen. This process typically involves multiple data copies between CPU and GPU making realtime interactive visualization difficult. With general-purpose computing and graphics interoperability support on GPUs, however, it becomes possible to do both simulation and visualization in real-time without any CPU intervention. This GPUbased scientific computation paradigm, as shown in Fig. 1, has been widely adopted in many science and engineering fields (e.g., [1–4]). In this research, we focus on the numerical simulation of heat conduction for two reasons. The first reason concerns its physical property. Heat conduction involves two important processes: unsteadiness and diffusion [5]. From the numerical solution point of view, this equation can be treated as a basic building block of more complicated equations used by other scientific computation (e.g., Navier–Stokes Equations in computational fluid dynamics [6]). Given the same numerical discretization method, the discretized heat conduction equation has similar algebraic equations with other more advanced equations, various algebraic equation solvers (e.g., ADI solver) can be easily tested by the heat conduction simulation before they are applied to other complex systems modeling. The other reason is related to numerical modeling in engineering practices. To take the flooding simulation as an example, as we all know, unexpected levee breaching may happen during the flooding, and usually the breaching location and time are also unknown, so it is very important for the model user to interact with the model to handle emergency. As an interactive heat conduction simulator can dynamically take the
∗
Corresponding author. Tel.: +1 662 915 7788; fax: +1 662 915 7796. E-mail addresses:
[email protected] (Z. Wei),
[email protected] (B. Jang),
[email protected] (Y. Jia).
0377-0427/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cam.2013.11.030
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
497
Fig. 1. GPU-based scientific simulation and visualization paradigm.
user’s input during the simulation, this is a good start point to build more advanced interactive systems for engineering applications. When a two-dimensional heat conduction equation is discretized implicitly by a standard finite difference method [7], the final algebraic equation becomes a pentadiagonal equation system, which can be efficiently solved by the ADI method [8]. The basic idea behind this method is to decompose the numerical step into two sub-steps with two tridiagonal linear algebraic equation systems and solve them implicitly in alternate order. In terms of the solution of tridiagonal equation system, the Tri-Diagonal Matrix Algorithm (TDMA) (e.g., [6]) has been widely used in practice (e.g., [9,10]). In addition, several parallel alternatives of TDMA, e.g., cyclic reduction (CR) [11] and parallel cyclic reduction (PCR) [12] have also been developed. With the advent of the GPU, scientific computing has been significantly accelerated by its data-driven parallel execution model, and the aforementioned parallel tridiagonal solvers and other hybrid methods have also been implemented on GPUs, see e.g., [13–15], among others. OpenGL is a cross-platform API for 2D and 3D computer graphics rendering [16,17]. This library provides a collection of APIs that allow the programmer to interact with the hardware to achieve hardware-accelerated rendering. Although CUDA and OpenGL were developed for different purposes, they can share a number of hardware resources as both operate in the same graphics context. For instance, OpenGL may be mapped to the memory object generated by CUDA and vice versa [16,18–20]. In [15], based on the solution of a heat conduction equation, we presented a GPU-based ADI solver by addressing two important issues. The first one is related to the solver scalability, we proposed two mapping methods to eliminate the hardware resource constraint, i.e., the limited amount of shared memory. The other is to solve the uncoalesced memory access due to alternating calculations by leveraging hardware texture memory and matrix transpose techniques. However, in terms of the inefficient memory access in the vertical sweep, the influence and the contribution of different numerical parameters (e.g., domain size and thread block size) are still not very clear at that stage. What is more, we only implemented the PCR under the framework of the ADI method, the performance of other tridiagonal solvers has not been fully investigated either. In this paper, we develop an interactive heat conduction simulator that harnesses modern GPUs on the basis of our previous work. First of all, we introduce a source term in the previously used heat conduction equation to achieve an interactive simulation. Second, we conduct a detailed study on how numerical parameters contribute the inefficient memory access in the vertical sweep. Next, efforts are made to evaluate the GPU-based ADI solver performance by testing three tridiagonal solvers. Finally, CUDA and OpenGL interoperability technique is implemented to visualize the simulation on-the-fly and run the simulation interactively. Our overall design aims at keeping both simulation and visualization on GPUs as much as possible to minimize the overhead associated with conventional CPU–GPU interactions. 2. Numerical discretization and solution of heat conduction equation 2.1. Heat conduction equation discretization A two-dimensional unsteady heat conduction equation [5] is given by
ρc
2 ∂ T ∂ 2T ∂T =k + + S ∂t ∂ x2 ∂ y2
(1)
where T is the temperature, ρ is the density of the conduction media, c is the specific heat of the conduction media, and k is a diffusion coefficient. It should be pointed out that different from our previous work [15], a heat source term S is introduced to develop an interactive simulator to be presented in Section 4. In this study, we also apply the assumption with ρkc = 1, as a result, the above equation is rewritten as
∂T ∂ 2T ∂ 2T = 2 + 2 + S. ∂t ∂x ∂y
(2)
To solve Eq. (2) numerically, we consider a rectangular computation domain as discretized in Fig. 2, where i and j are node indices in horizontal and vertical directions, respectively, and M and N are the total number of nodes in horizontal and vertical directions, respectively.
498
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
Fig. 2. Discretized two-dimensional computation domain.
Next, we implicitly discretize Eq. (2) using the finite difference method and, at a typical node (i, j), an algebraic equation is formed as AP Tin,j+1 + AE Tin++11,j + AW Tin−+11,j + AN Tin,j++11 + AS Tin,j+−11 = Tin,j + 1tSin,j
(3)
with coefficients at internal nodes defined as
1t 1t + 2 , 1x2 1y2 1t 1t = AE = − 2 , AN = AS = − 1x 1y2
AP = AW
1+2
where n is time level, 1t is time step, 1x and 1y are mesh sizes in horizontal and vertical directions, respectively. At the domain boundary, a Dirichlet boundary condition is applied, and then coefficients are obtained as AP = 1,
AW = AE = AN = AS = 0.
2.2. Heat conduction equation solution Following Wei et al. [15], we solve Eq. (3) by the ADI method [8] in two sub-steps. At the first sub-step, Eq. (3) is solved in i direction: n+1/2
n+1/2
ai Ti−1,j + bi Ti,j
n+1/2
+ ci Ti+1,j = di
(4)
where ai = AW , bi = AP , ci = AE and di = Tin,j + 1tSin,j − AN Tin,j+1 − AS Tin,j−1 . For the next sub-step, Eq. (3) is solved in j direction: aj Tin,j+−11 + bj Tin,j+1 + cj Tin,j++11 = dj
(5) n+1/2
n+1/2
n+1/2
n+1/2
where aj = AS , bj = AP , cj = AN and dj = Ti,j + 1tSi,j − AE Ti+1,j − AW Ti−1,j . It is noted that Eqs. (4) and (5) are linear equation systems of the form [A]{T } = {d}, where A is a tridiagonal matrix b0 a1
A=
c0 b1 a2
0
c1 b2
..
.
0 c2
..
..
. .
.. ..
. .
aM − 1
cM −2 b M −1
.
Traditionally, Eqs. (4) and (5) are solved by the TDMA algorithm as presented by Zhang et al. [15]. However, because of data dependency between successive iterations, TDMA is not suitable for GPU implementation.
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
(a) One thread block.
499
(b) k thread blocks.
Fig. 3. Two thread mapping techniques. A whole system with M equations is mapped to (a) one thread block, (b) k thread blocks.
3. Solution of heat conduction equation on GPUs This section discusses the solution of a two-dimensional heat conduction equation by implementing the ADI solver on GPUs. We begin with our previous GPU implementation and provide thorough investigations on the memory access mode in the vertical sweep, and then we present the data flow and memory access patterns of three existing parallel tridiagonal solvers, and further use them to evaluate the GPU-based ADI solver performance. 3.1. GPU-based ADI solver implementation In the GPU-based ADI solver of Wei et al. [15], we first addressed the solver scalability by providing two mapping methods to make the solver suitable for large-scale simulation, in this study, the more scalable one is implemented. The other topic we studied is the uncoalesced memory access. We pointed out the issue and optimized the solver with two methods but did not analyze different contributions of numerical parameters, and therefore, we will devote efforts to understand this process in a more clear way. 3.1.1. Solver scalability When it comes to the parallel computation on GPUs, one of the most important things is thread mapping. A well designed thread mapping may achieve several orders of magnitude speedup over a poorly designed one by fully taking advantage of the hardware resources. In terms of the thread mapping of the ADI method, it is highly dependent on that of its underlying tridiagonal solver. Zhang et al. [13,14] presented a series of tridiagonal solvers on GPUs, as their intention was likely to develop efficient solvers, and application of those solvers for large systems was not clearly addressed, a thread mapping as shown in Fig. 3(a) was implemented by them. This thread mapping can only take into account a small system size up to 1024, since all coefficients are stored in shared memory which is a small amount of hardware resource in the current generation of GPUs. As we aim to apply the ADI solver to large-scale simulations involving millions of nodes, a more scalable mapping method should be taken to fully make use of hardware resources on GPUs. Fig. 3(b) shows a mapping method from Wei et al. [15] which maps the whole system to multiple thread blocks, as a result, the thread block size and the amount of shared memory used are not restricted by the system size anymore. In this study, this mapping method is still implemented. 3.1.2. Memory access efficiency In the current generation of GPUs, compared with ALU operations, memory copy is still a relatively costly process. For a two-dimensional heat conduction simulation, the coefficient variables are stored in the global memory space linearly and ordered horizontally. When it comes to alternating direction calculations, the vertical sweep accesses two vertically consecutive data which are separated by the stride with horizontal domain size in the global memory space. To understand how different parameters influence the solver performance, we run the baseline implementation C 2 in [15] with multiple domain sizes and thread block configurations, and present the result in Fig. 4. We have two important observations, the first one is that the speedup deceases as the domain sizes increase, and the other is that different thread block sizes also present different speedups, and thread block size 128 gives better results across different domain sizes. In Fig. 5(a), we show the memory address of all threads inside one thread block when the vertical sweep is under considered. Before each computation, the data is loaded from the global memory to the shared memory, if the domain size M becomes larger, it would take more time to load data before computing. Similarly, once the computation is finished, it also takes more time to write the data from the shared memory to the global memory if M is larger. In terms of the influence of thread block configuration, we compare the normalized execution time of the vertical sweep over the horizontal sweep with two thread block sizes 64 and 256, as shown in Fig. 5(b). It is seem that the vertical sweep of thread block size 256 is relatively slower than the one with thread block size 64. This can also be properly explained using Fig. 5(a), as the thread block size S becomes larger, it also takes more time to load data before computing, the same observation is applicable to data writing after the computation. In [15], we proposed an optimization technique which accesses the data through a special hardware block on GPUs called texture unit, and compared its performance with an existing method using matrix transpose, which eliminates the
500
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
Fig. 4. Parallel ADI solver speedup over serial ADI solver with four domain sizes and three thread block configurations.
Fig. 5. Analysis of the unbalanced memory access pattern. (a) Memory access in the vertical sweep with horizontal domain size = M and number of threads per block = S. (b) Normalized execution time of the vertical sweep over the horizontal sweep with number of threads per block equal to 64 and 256, respectively.
uncoalesced memory access in the vertical sweep by transposing matrix coefficients to be accessed consecutively, and these two methods are still taken into account when we present the solver performance in this study. 3.2. Tridiagonal solvers In [15], we only implemented a tridiagonal solver PCR under the GPU-based ADI solver. As presented by Zhang et al. [13,14], there are several other choices in addition to PCR, and all of them exhibit different memory access patterns and require different amounts of hardware resources. In this study, we further investigate the ADI solver performance using three tridiagonal solvers (i.e., CR, PCR, and hybrid CR–PCR) and two optimization techniques (i.e., texture memory and matrix transpose). 3.2.1. Cyclic reduction (CR) CR [11] is a numerical method for solving large linear systems with two phases, forward reduction and backward substitution. The forward reduction phase successively reduces a system to a smaller system with half of the unknowns, until a system of 2 unknowns is reached. The backward substitution phase successively determines the other half of the unknowns using the previously solved values [13]. Fig. 6 shows the data flow of CR with an 8-equation system. In steps 1 and 2, it reduces the 8-equation system into a 2-equation system, and solves them in parallel at step 3. From step 4 to step 5, it substitutes the already calculated values backward to determine the rest of the unknowns. It seems that CR can perform up to n/2 parallel operations and takes 2 log2 n steps to finish the computation on a parallel computer with n processors [14].
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
501
Fig. 6. Data flow of CR with an 8-equation system.
Fig. 7. Data flow of PCR with an 8-equation system. (Arrows in step 2 are omitted for simplicity.)
3.2.2. Parallel cyclic reduction (PCR) PCR [12] is another powerful parallel alternative of the TDMA algorithm for a tridiagonal linear equation system. Different from CR, PCR only has forward reduction calculation and, in each step, it reduces the current system into two systems with half size, and finally it solves M (domain size) one-equation systems simultaneously. Its data flow process is explained in Fig. 7 with an 8-equation system. In the first reduction step, an 8-equation system is reduced into two 4-equation systems; and then the two 4-equation systems are reduced into four 2-equation systems, in the final step, the four systems are solved in step 3. PCR takes 12n log2 n operations and log2 n steps to finish [13]. 3.2.3. Hybrid CR–PCR solver Zhang et al. [13] observed that the complexity of CR is O(n), which is smaller than those of other solvers. But its parallelism levels are low at the end of forward reduction and at the beginning of backward substitution, while PCR is able to maintain more parallelism through all steps. To combine the merits of different solvers, Zhang et al. [13] proposed several hybrid solvers, the hybrid CR–PCR solver is implemented in this study. This solver first reduces the system to a certain size using the forward reduction phase of CR, then solves the reduced (intermediate) system with PCR. Finally, it substitutes the solved unknowns back into the original systems using the backward substitution phase of CR. In terms of the performance of the aforementioned three solvers, Zhang et al. [13] found that CR is less efficient than PCR and hybrid CR–PCR, which is mainly due to bank conflicts when it accesses the shared memory. And the hybrid CR–PCR solver performs best when the thread block size 512 is considered because it has fewer steps and also the intermediate system is solved by PCR without any bank conflicts. 3.3. Performance analysis In this part, we implement three tridiagonal solvers in the GPU-based ADI solver using the scalable mapping method as shown in Fig. 3(b) and two optimization techniques (i.e., texture memory and matrix transpose). Our test machine is configured with an Intel Core i7-3770K CPU running at 3.5 GHz, 16 GB system memory, a NVIDIA GTX 680 GPU with 48 kB shared memory (Kepler architecture), CUDA 5.5, and Ubuntu 12.04 operation system. Simulation parameters are configured as follows. Mesh sizes are uniform in both directions with 1x = 1y = 0.01 m, the numerical time step is 1t = 0.01 s, and the total simulation time is 1.0 s, resulting in the total numerical time step equal to
502
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
50
Speedup
40 30 20 10 PCR 0
64
CR
CR–PCR 256
128 Block Sizes
70 60 Speedup
50 40 30 20 10 0
PCR 64
CR 128 Block Sizes
CR–PCR 256
Fig. 8. Performance of the GPU-based ADI solver over a serial solver with three tridiagonal algorithms and three thread block configurations (domain sizes: 4096 × 4096). Texture memory (top panel); matrix transpose (bottom panel).
100. Compared with four domain sizes tested in [15], the domain size 4096 × 4096 is considered, since it is able to represent large-scale simulations encountered in engineering practices. We also experiment with three thread block configurations (i.e., 64, 128, and 256). In all implementations, the single-precision floating point data type is tested. Fig. 8 presents the GPU-based ADI solver performance with two optimization techniques and three tridiagonal solvers. In general, the matrix transpose performs better than the texture memory, since it eliminates the uncoalesced memory access and the overhead due to matrix transpose is relatively small for this specific case [15]. With respect to the performance of three solvers under the framework of the texture memory optimization technique, it is very interesting to find that PCR performs better than the hybrid CR–PCR does. As mentioned before, the hybrid CR–PCR is the most efficient one among three when a tridiagonal equation system is considered [13], which does not require alternate calculations as we have. As a result, the performance degradation observed in this work should be attributed to the dynamic large-scale system involving complex source term and data transformation. Specifically, although hybrid CR–PCR has fewer steps, it still suffers from bank conflicts when CR performs, under the uncoalesced memory access pattern as shown in Fig. 5(a), the bank conflict may become worse. On the other hand, PCR is free of the bank conflict, which means it is less restricted than CR and hybrid CR–PCR. Once the thread block size equals 256, the advantage of PCR (i.e., free of the bank conflict) is almost compromised by the uncoalesced memory access. As a result, three tridiagonal solvers perform similarly. When it comes to the matrix transpose, the uncoalesced memory access penalty is no longer present, and now the major issue is the bank conflict. Since the forward reduction of CR suffers from more serious bank conflicts as it goes deeper, and there is no bank conflict in PCR, it is reasonable to see that PCR outperforms the other two solvers, and the hybrid CR–PCR is better than CR, as shown in the bottom panel of Fig. 8. 4. Interactive heat conduction simulation using OpenGL In this section, we develop a visualization system to run the heat conduction simulation interactively using CUDA and OpenGL interoperability. Since this technique has been widely used in practice, step-by-step implementations can be found in References (e.g., [18–20]). Instead of repeating the detailed steps, we first briefly talk about the principle of CUDA and OpenGL interoperability, and then we present our implementation procedure and demonstrate the interactive simulation. 4.1. CUDA and OpenGL interoperability CUDA was developed by NVIDIA for data parallel computing, while OpenGL was designed for rendering 2D and 3D computer graphics, as both of them interact with GPUs, it is possible to run the simulation and display the results on-thefly. To achieve interoperability, there are several steps: (1) OpenGL resource (e.g., a texture or renderbuffer object) should
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
503
Fig. 9. Flowchart of an interactive simulation using CUDA and OpenGL.
be registered with CUDA, and then a device pointer associated with the registered resource can be manipulated by CUDA; (2) the registered resource can be mapped and unmapped as many times as necessary by CUDA, when it is mapped, the CUDA kernel can read and write data into the pointer; (3) when it is unmapped, OpenGL is able to process the image data. 4.2. Interactive heat conduction simulation In our implementation, we choose pixel buffer object (PBO) as the resource shared by CUDA and OpenGL. The OpenGL Utility Toolkit (GLUT) [21], a window independent toolkit for writing OpenGL programs, is used to integrate CUDA and OpenGL programs using C language. Since GLUT has its own event loop, a computation procedure, as shown in Fig. 9, is designed to handle the interaction between CUDA and OpenGL. In general, the procedure can be further divided into several steps. 4.2.1. Initialization Before the main simulation loop, all involved processes should be initialized as follows. (1) CUDA initialization. To achieve interoperability with OpenGL, the CUDA device should be specified by cudaGLSetGLDevice(), which tells CUDA and OpenGL to communicate through the designated GPU. (2) OpenGL initialization. Initialization of OpenGL includes creation of display windows, defining GLUT events, and registration of callback functions. OpenGL extension should also be initialized. (3) Heat conduction simulation setup. To run the simulation correctly, the boundary condition is first specified, and internal domain values are initialized. Those coefficient variables as presented in Section 2 are calculated and allocated in both host and device memory spaces. 4.2.2. GLUT events Once CUDA, GLUT, and heat conduction simulation parameters are initialized properly, the GLUT event loop is triggered by calling glutMainLoop(). Among numerous callback functions provided by GLUT, three of them are used in this development. glutMouseFunc (). Although glutMouseFunc () has several callback events, we are only interested in the coordinate locations of clicked points. With this, heat sources are dynamically added to these points with certain rules. glutKeyBoardFunc (). In the GLUT keyboard event, Esc key callback is set up to end the simulation and free up the resource. glutDisplayFunc (). This display callback function involves both CUDA computation and OpenGL rendering. And it processes the following steps. (1) as presented in Section 3, the heat conduction equation is solved by the ADI solver on GPUs, although this part is complex with several kernels, it only appears as one function call in the display callback function as indicated in Fig. 9, and heat sources may be added into the computation if necessary; (2) once CUDA finishes the computation, the heat field value (i.e., the temperature) is converted to the image data RGBA. Next, OpenGL takes over the shared PBO, and it creates the texture based on the PBO and displays the image on screen; (3) as the simulation runs continuously, the latest
504
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
Fig. 10. Snapshot of an interactive heat conduction simulation.
calculated temperature is updated by calling glutSwapBuffers(), which performs a flush of the OpenGL pipeline and does a buffer swap. Fig. 10 shows a snapshot of an interactive two-dimensional heat conduction simulation with heat sources added arbitrarily in the computation domain. 5. Conclusions This paper has developed a fast and interactive heat conduction simulator that harnesses modern GPUs. We solved a twodimensional heat conduction equation by the ADI method using GPU-based tridiagonal solvers on the basis of our previous work. We conducted a detailed study to understand the influence of numerical parameters on the inefficient memory access in the vertical sweep. Furthermore, we explored the GPU-based ADI solver performance by experimenting among three tridiagonal solvers. Additionally, real-time interactive visualization of simulation has been developed with the CUDA and OpenGL interoperability using a pixel buffer object. Although our GPU-based ADI solver and the interactive simulation technique presented in this research are based on the numerical solution of a two-dimensional heat conduction equation on GPUs, they can be easily extended and adopted by other researches. Future work includes extension of the fast and interactive heat conduction simulator to be an interactive flood modeling system for engineering practice, and further optimization of the parallel ADI solver. Acknowledgments This work is supported by the USDA Agriculture Research Service under the Specific Research Agreement No. 58-64081-609 monitored by the USDA-ARS National Sedimentation Laboratory and The University of Mississippi. The first author would like to thank the Graduate School and Engineering Dean’s Office of The University of Mississippi for providing travel grants for him to present this work at the FEMTEC2013 Conference. References [1] T.D. Zuk, S. Carpendale, Interactive simulation and visualization using the GPU, in: Proc. SPIE, 2005, pp. 262–273. [2] T. Umenhoffer, L. Szirmay-Kalos, Interactive Distributed Fluid Simulation on the GPU, Technical Report, Budapest, University of Technology and Economics, Hungary, 2008. [3] P. Goswami, P. Schlegel, B. Solenthaler, R. Pajarola, Interactive SPH simulation and rendering on the GPU, in: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Eurographics Association, 2010, pp. 55–64. [4] C. Huang, H. Sun, S. Liu, P. Li, Interactive soft-fabrics watering simulation on GPU, Comput. Anim. Virtual Worlds 22 (2011) 99–106. [5] S. Patankar, Numerical Heat Transfer and Fluid Flow, Taylor & Francis, 1980. [6] H.K. Versteeg, W. Malalasekera, An Introduction to Computational Fluid Dynamics: The Finite Volume Method, Prentice Hall, 2007. [7] A.R. Mitchell, D.F. Griffiths, The Finite Difference Method in Partial Differential Equations, Wiley, Chichester, 1980. [8] D.W. Peaceman, H.H. Rachford Jr., The numerical solution of parabolic and elliptic differential equations, J. Soc. Ind. Appl. Math. 3 (1955) 28–41. [9] M. Kass, A. Lefohn, J. Owens, Interactive Depth of Field Using Simulated Diffusion on a GPU, Pixar Animation Studios Technical Report, 2006. [10] S. Sengupta, M. Harris, Y. Zhang, J.D. Owens, Scan primitives for GPU computing, in: SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, 2007, pp. 97–106. [11] R.W. Hockney, A fast direct solution of Poisson’s equation using Fourier analysis, J. ACM (JACM) 12 (1965) 95–113. [12] R.W. Hockney, C.R. Jesshope, Parallel Computers, Adam Hilger, 1981.
Z. Wei et al. / Journal of Computational and Applied Mathematics 270 (2014) 496–505
505
[13] Y. Zhang, J. Cohen, J.D. Owens, Fast tridiagonal solvers on the GPU, ACM SIGPLAN Not. 45 (2010) 127–136. [14] Y. Zhang, J. Cohen, A.A. Davidson, J.D. Owens, A hybrid method for solving tridiagonal systems on the GPU, in: GPU Computing Gems Jade Edition, 2011, pp. 117–132. [15] Z. Wei, B. Jang, Y. Zhang, Y. Jia, Parallelizing alternating direction implicit solver on GPUs, Procedia Comput. Sci. 18 (2013) 389–398. [16] D. Shreiner, OpenGL Programming Guide: The Official Guide to Learning OpenGL, Versions 3.0 and 3.1, Volume 1, Addison-Wesley Professional, 2010. [17] The Khronos Group, 2013. http://www.opengl.org/ (accessed: 2.01.2013). [18] J. Stam, What every CUDA programmer should know about OpenGL, in: GPU Technology Conference, San Jose, CA, 2009. [19] V. Demir, A.Z. Elsherbeni, CUDA-OpenGL interoperability to visualize electromagnetic fields calculated by FDTD, 2012, pp. 206–214. [20] NVIDIA, NVIDIA CUDA programming guide 4.2, 2012. [21] M.J. Kilgard, The OpenGL utility toolkit (GLUT) programming interface API, version 3, 1996.