Optimized sparse Cholesky factorization on hybrid multicore architectures

Optimized sparse Cholesky factorization on hybrid multicore architectures

Accepted Manuscript Title: Optimized Sparse Cholesky Factorization on Hybrid Multicore Architectures Author: Meng Tang Mohamed Gadou Steven Rennich Ti...

NAN Sizes 0 Downloads 94 Views

Accepted Manuscript Title: Optimized Sparse Cholesky Factorization on Hybrid Multicore Architectures Author: Meng Tang Mohamed Gadou Steven Rennich Timothy A. Davis Sanjay Ranka PII: DOI: Reference:

S1877-7503(17)31216-4 https://doi.org/doi:10.1016/j.jocs.2018.04.008 JOCS 857

To appear in: Received date: Revised date: Accepted date:

1-11-2017 6-4-2018 7-4-2018

Please cite this article as: Meng Tang, Mohamed Gadou, Steven Rennich, Timothy A. Davis, Sanjay Ranka, Optimized Sparse Cholesky Factorization on Hybrid Multicore Architectures, (2018), https://doi.org/10.1016/j.jocs.2018.04.008 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ip t

Optimized Sparse Cholesky Factorization on Hybrid Multicore Architectures

cr

Meng Tanga , Mohamed Gadoua , Steven Rennichc , Timothy A. Davisb , Sanjay Rankaa a

University of Florida, Gainesville, Florida, USA Texas A&M University, College Station, Texas, USA c NVIDIA, Santa Clara, California, USA

an

us

b

Abstract

Ac ce p

te

d

M

We present techniques for supernodal sparse Cholesky factorization on a hybrid multicore platform consisting of a multicore CPU and GPU. The techniques are the subtree algorithm, pipelining and multithreading. The subtree algorithm [15] minimizes PCIe transmissions by storing an entire branch of the elimination tree in the GPU memory (the elimination tree is a tree data structure describing the workflow of the factorization), and also reduces the total kernel launch time by launching BLAS kernels in batches. The pipelining technique overlaps the execution of GPU kernels and PCIe data transfers. The multithreading technique [17] creates multiple threads for both the CPU and the GPU, to utilize concurrency of the elimination tree. Our experimental results on a platform consisting of an Intel multicore processor along with an Nvidia GPU indicate a significant improvement in performance and energy over CHOLMOD (SuiteSparse 4.5.3), a sparse algorithm, after these techniques are applied. Keywords: sparse matrices, sparse direct methods, Cholesky factorization, GPU, CUDA

This is an extended version of our conference paper that was invited to the ICCS special issue (https://doi.org/10.1016/j.procs.2017.05.260)

Submitted to Journal of Computational Science

April 6, 2018

Page 1 of 25

1. Introduction

Ac ce p

te

d

M

an

us

cr

ip t

Sparse direct methods are used to solve linear algebraic problems [4] including systems of linear equations, sparse linear squares and eigenvalue problems, and form the backbone of many applications in computational science [21]. The computational workflow of sparse Cholesky factorization [5] is structured as a tree where each node represents the factorization of a dense sub-matrix, which we refer to as a “supernode”, and each edge represents an assembly operation of portions of a factorized descendant supernode into one of its ancestors. A supernode must have all its descendants assembled before being factorized itself, and all sub-matrices with no children or with all their descendants already assembled may be factorized in parallel. Hybrid multicore processors (HMPs) are poised to dominate the landscape of the next generation of computing on the desktop as well as on exascale systems. HMPs consist of general purpose CPU and GPU cores along with specialized cores and are expected to provide benefits to a wide spectrum of applications at significantly lower energy requirements per floating-point operation. Linear algebrea problems are examples of applications that can exploit GPU accelerated systems. Many of these problems are based on matrices operations that can benefit from GPUs. Ltaief et al. [12] provided an efficient implementation of dense Cholesky factorization on HMPs systems with multiple CPU cores and multiple GPUs. Tomov et al. [18] provided a set of techniques for efficient dense linear algebra algorithms on hybrid systems with GPUs and they applied their technique in implementation of hybrid LU factorization. Tomov et al. [19] provided a programming model for dense matrices factorization techniques such as QR, LU, and Cholesky on a sytesm with GPUs. They provided as well a set of optimizations relevant to this type of systems. Volkov and Demmel [20] presented results for dense matrices factorization using QR, LU, and Cholesky algorithms. They were able to reach up to 80-90% of GEMM rate and 300 Gflops on two GPUs. Ayguade et al [2] provided an extension of the Star Superscalar programming model that targets the parallelization of applications on hybrid platforms with GPUs and they tested their model on dense matrices factorization applications. In this paper, we describe efficient multithreaded and pipelined algorithms for sparse Cholesky factorization on HMPs consisting of CPUs and GPUs. This algorithm is shown to provide significant improvement over the baseline supernodal algorithm, CHOLMOD , in both its CPU [3] and GPU [15] 2

Page 2 of 25

Ac ce p

te

d

M

an

us

cr

ip t

versions. CHOLMOD is a SuiteSparse module provided by Chen, Davis, Hager and Rajamanickam, capable of performing sparse Cholesky factorization on one GPU. CHOLMOD has been under continuous development for quite some time. Version 4.0.0 [14] was the first to benefit from GPU acceleration. This was substantially improved starting in version 4.3.0 and continuing to version 4.5.5 using what we will refer to as the “root” algorithm on the GPU. Later CHOLMOD was extended with the “subtree” algorithm, first made available in v 4.6.0-beta (however, portions of 4.6.0-beta still use the root algorithm). In this work we investigate extending the root algorithm using multiple CPU threads and, separately, further improving the subtree algorithm by applying pipelining. Both the root algorithm and the subtree algorithm are left-looking supernodal. The baseline root algorithm uses a single CPU thread to factorize a sparse matrix on the entire GPU. When the matrix size is large, this method results in good performance. For small matrices, the performance can be improved by setting up multiple threads, and letting each thread use a portion of the GPU. The multithreading technique leverages the structure of the matrix to factorize in parallel as many sub-matrices as possible. Using the streaming feature of the GPU, the factorization of different sub-matrices can be done concurrently resulting in higher performance on the GPU. By storing a subtree instead of a single supernode in the GPU memory, the subtree algorithm avoids copying factorization results back and forth between the GPU memory and the main memory, reducing the cost of PCIe transfers. By using pipelines on top of the subtree algorithm, the overlapping of factorizations and data transfers can reduce the overall running time. Our experimental results on sparse matrices derived from the SuiteSparse Matrix Collection (formerly called the UF Sparse Matrix Collection [7]) show that CHOLDMOD v4.5.3 with multithreading shows speedups of 3.5x vs v4.5.3 as published. The results also show that pipelines can add a 10% to 25% improvements in performance for the subtree algorithm. We also measured energy requirements for our implementation and show that similar gains in the energy requirements can also be achieved. This represents significant improvement over implementations that are already finely tuned for the NVIDIA GPU. In Section 2, we provide background information for Cholesky factorization. We also present previous research related to sparse direct methods and sparse Cholesky factorization. Section 3 describes the workflow for 3

Page 3 of 25

cr

ip t

CHOLMOD. In Section 4, we describe the subtree algorithm. In Section 5, we describe how pipelining is implemented in our algorithm, and how the pipelines add concurrency to the workflow. In Section 6, we describe how the scheduler handles the factorization workflow, and how multithreading is implemented on a single GPU. Experimental results are presented in Section 7. Finally, we conclude in Section 8.

us

2. Background

Ac ce p

te

d

M

an

The Cholesky factorization is a decomposition of a real symmetric positivedefinite matrix into the product of a lower triangular matrix and its transpose, i.e. given a real symmetric positive-definitematrix  A,  find a lower  9 6 3 3 2 T triangular matrix L such that LL = A. Example: = . 6 5 2 1 1 There are three major techniques for sparse Cholesky factorization, rightlooking, up-looking, and left-looking. Rose et al. in [16] describe the major differences in the three techniques. In our algorithm, we use a left-looking method for both dense and sparse Cholesky factorization. The left-looking method computes L one column at a time. It can be derived from the following expression (we’re using uppercase letters for matrices and lowercase letters for column vectors and scalars) [8].      T A11 a12 AT31 L11 L11 l12 LT31 T  T  l12  l22 l32 l22 =  aT12 a22 aT32  A31 a32 A33 LT33 L31 l32 L33

where the first column entries on L, namely, L11 , l12 , L31 are already computed, and the next column elements, l22 , l32 are to be computed. p T 2 T Note that l22 is a scalar. We have l12 l12 + l22 = a22 , l22 = a22 −  l12l12 , c then a32 = L31 l12 + l32 l22 , l32 = (a32 − L31 l12 )/l22 . If we let c = 1 = c2   T  √ a1 l22 l12 , where c1 is a scalar, then l22 = c1 , l32 = c2 /l22 . In prac− l12 a2 tice, we could use a supernodal approach [11] to compute multiple columns at a time. In this case, l22 becomes a dense submatrix rather than a scalar.  T l12 In each iteration, we first compute l , followed by c1 . We then coml22 12 pute l22 with a dense Cholesky factorization, followed by l32 with a triangular solve. The supernodal method exploits the fact that sparse matrices often 4

Page 4 of 25

Ac ce p

te

d

M

an

us

cr

ip t

contain rows and columns with identical or similar non-zero pattern [13]. As a result, l32 becomes a submatrix whose columns all have identical pattern (or nearly so), allowing it to be represented as a single dense submatrix holding just the nonzero entries. By clustering these rows or columns together, thus leveraging the CPU’s and GPU’s ability to perform “dense math”, the supernodal method could reduce memory consumption and gain performance. CHOLMOD is an implementation of the left-looking supernodal method, which factorizes one or multiple columns at a time. Descendant supernodes that the current supernode depends on are assembled into the current supernode immediately before factorizing the current suernode. It is composed of three steps: 1) fill-reducing permutation, 2) symbolic analysis, and 3) numeric factorization. These steps are described in detail below. A fill-reducing permutation is performed to minimize the number of nonzero entries, so that less computation is required to factorize the matrix. The fillminimizing problem is NP-hard, but efficient heuristic algorithms exist [1] [5] [6]. In CHOLMOD we are using METIS [10] for the permutation, which provides an ordering with low fill-in that also exposes good parallelism in the tree. After the permutation, a symbolic analysis is performed to determine the workflow of the sparse matrix factorization. The workflow is represented by the elimination tree. For example, in our algorithm the matrix boneS10 (Table 1) has an elimination tree with 53030 nodes. Fig. 1 shows the structure of the top four levels of boneS10’s elimination tree. The nodes of the elimination tree each represents the factorization of 21 a supernode. A sparse Cholesky factorization algorithm needs to factor20 13 5 ize all supernodes in the elimination tree, and data dependency needs to 1 4 19 16 12 8 be met so that no parent node is processed before any of its children. In 0 2 3 6 7 9 10 11 14 15 17 18 Fig. 1, there exist nodes that have no dependency between each other, Figure 1: Top four levels of boneS10’s elimie.g. Node 0, 2, 12, 16, 18. This is nation tree a common case in sparse matrices, and indicates possible parallelism in factorization algorithms. This “tree parallelism” is solely dependent on the structure of the elimination tree, and could be utilized by using an elimination-tree-based scheduler. The scheduler 5

Page 5 of 25

M

an

us

cr

ip t

is expected to find supernodes that are the next to be factorized, and to call functions for factorization. The factorization of the entire sparse matrix is done by repeatedly factorizing supernodes from leaf to root. A supernode is a m×k sub-matrix of A (m ≥ k). It is composed of a square sub-matrix on the top and a rectangular sub-matrix in the bottom. First the contribution blocks from its descendants are computed during apply operation and then assembled into the rectangular sub-matrix.   Lc1 Let the parent front be Ap and the factorized child front be Lc = Lc2 where Lc1is square while L is rectangular, then the contribution block c2  Lc1 T is Cp,c = L . The assembly operation is done by subtracting conLc2 c1 tribution blocks from Ap . After applying and assembly, a dense Cholesky factorization is performed on the square sub-matrix. Then the rectangular sub-matrix is processed with a triangular solve. 3. Supernodal sparse Cholesky factorization

Ac ce p

te

d

The workflow of a sparse Cholesky factorization is structured as an elimination tree, where each tree node stands for a dense Cholesky factorization, and each edge stands for the parent’s data dependency on the children. Contribution blocks from descendants are required in each supernode’s assembly phase. Factorizations with no data dependency may be processed independently and in parallel. A single-threaded algorithm such as CHOLMOD [3] root algorithm (v4.5.3) orders the nodes in ascending order, with regard to the column index, and factorizes them also in ascending order. Since no lower-indexed columns would depend on higher-indexed columns, this workflow ensures that no dependency is violated. The factorization workflow is straightforward, and no scheduler is needed in single-thread case. The update, assembly, and triangular solve workflow is integrated in the factorization workflow. The left-looking Cholesky requires that all descendants be assembled before a supernode is factorized. Each node keeps a list of descendant nodes that are needed in its assembly phase. The assembly list of each node is initialized to a null list. After choosing a supernode for factorization, the thread creates an empty contribution block for the node. The algorithm then goes through the assembly list, updating the contribution block with each list member, removing it from the current list, and putting it in the assembly list of current node’s parent. After this 6

Page 6 of 25

Ac ce p

te

d

M

an

us

cr

ip t

step, the contribution block is subtracted from the current supernode, followed by a dense Cholesky factorization on the leading triangular matrix. Finally the output of the dense Cholesky is used for a triangular solve on the lower rectangular matrix. After the triangular solve, the current node is added to the assembly list of its parent. The root algorithm uses both the GPU and the CPU for the factorization and other computations. Because data transfers between main memory and the GPU memory is expensive, the GPU is used only for sufficiently large computations. Thus, the processing of each node is composed of four steps: update, assembly, factorize and triangular solve. The list of descendants to assemble is first sorted by their sizes, in descending order. It is assumed that the GPU will process the larger ones, while the CPU will process the smaller ones. Whenever the GPU is not busy, a descendant is picked from the head of the list, and copied to the GPU memory. The is essentially a matrix multiplication in the form of  update   operation  C1 L1 T C= = L , where C1 = L1 LT1 is computed with cuBLAS DSYRK C2 L2 1 routine, and where C2 = L2 LT1 is computed with cuBLAS DGEMM routine. A zero contribution block is initiated in the GPU memory before all update operations are performed corresponding to the current node. Each time a C for a descendant is computed, it is added to the contribution block. A zero contribution block is initiated also in the main memory before all updates. If the GPU is not available, or the descendant matrix is too small, then a descendant is picked from the tail of the array. The update operation is completed with BLAS DSYRK and DGEMM routines, and the result is added to the contribution block. The assembly operation is a matrix subtraction A0 = A − C where A is the supernode to be factorized, and C is the contribution block from its descendants. Since the factorization operation would be done by the GPU only for large supernodes, the choice of where the assembly takes place (either in the GPU memory or in the main memory) also depends on the size of the matrix. For supernodes above the user provided threshold, the contribution block in the main memory is copied into the GPU memory and summed with the in-GPU-memory contribution block. This sum is then subtracted from the supernode. For supernodes below the threshold, the contribution block in the GPU memory is copied into the main memory, and the assembly operation takes place in the main memory instead. 7

Page 7 of 25

an

us

cr

ip t

After the assembly, a dense Cholesky factorization is performed on the top triangular part of the supernode. The triangular matrix is divided into column panels. Each panel consists of a number (user defined, currently 384 maximum) of columns and is factorized using a left-looking approach. Before a column panel (m×k) is factorized, it is first updated with previous columns using cuBLAS DSYRK and DGEMM routines. The k × k leading matrix is then factorized with LAPACK potrf routine. The remaining (m − k) × k rectangular matrix is triangular-solved using cuBLAS trsm routine using the output of the k × k factorization as the divisor. The objective of the triangular solve is to find a matrix R0 such that R = R0 LT , where R is the rectangular part of the supernode, and L is the output of the dense Cholesky factorization. The triangular solve is completed using the cuBLAS trsm routine. Finally, the entire supernode is copied back to the main memory.

M

4. Subtree algorithm

Ac ce p

te

d

The subtree algorithm was developed by Rennich, Stosic and Davis [15]. It is a deriviation from the root algorithm. The root algorithm requires each supernode to be copied from the main memory into the GPU memory every time it is used to update an ancestor. The subtree algorithm tries to avoid that by repeatedly copying an entire branch of the elimination tree into the GPU memory, factorizing its supernodes in batches, and copying the results back. Compared with the root algorithm, the subtree algorithm avoids copying supernodes back and forth between the GPU memory and the main memory, and therefore reduces the time consumption of PCIe communication. In each GPU function call such as DSYRK and DGEMM, there is a launch latency before the actual computations begin, which reduces the overall factorization efficiency. By launching DGEMMs in batches, the subtree algorithm greatly recudes the total number of launch latencies incurred, and also increases the concurrency of DGEMM kernels. The subtree algorithm groups supernodes in an array of “levels”. A level is composed of supernodes that are independent from each other, and supernodes in a level only depend on supernodes in previous levels. If supernodes are factorized one level at a time, from lower levels to higher levels, no dependencies will be violated. Before doing the actual factorization, the subtrees need to be constructed. The maximum size of a subtree is defined as the capacity of the GPU mem8

Page 8 of 25

Ac ce p

te

d

M

an

us

cr

ip t

ory. Despite being called a “subtree” algorithm, our algorithm also accepts “subforests” as valid subtrees. The subtrees are constructed such that each subtree is a subforest of the elimination tree, and its supersets that are also sub-forests of the elimination tree all exceed the maximum allowed size. Fig. 2 shows an example of finding a subtree in the elimination tree. The subtree in this example has two levels.

Figure 2: Determining subtrees in the elimination tree

After determining the subtrees, the factorization of each subtree runs as follows. In the first step all the supernodes in the subtree are copied into the GPU memory. In the second step the supernodes are factorized level by level. The factorization of each level consists of calls to DSYRK, DGEMM, DPOTRF and DTRSM that can be processed in batches. Finally, after the factorization of the last level, the results are copied back to the main memory. Algorithm 1 illustrates the implementation of the subtree factorization. For large matrices it is highly possible that the size of the entire matrix cannot fit in the GPU memory. After processing the subtrees, there will be higher-level supernodes left unfactorized. While these supernodes might be small, they are likely to have large number of descendants that need to be assembled before factorizing them, making it impossible to use the subtree 9

Page 9 of 25

Ac ce p

te

d

M

an

us

cr

ip t

Algorithm 1 Subtree algorithm 1: procedure factorize subtree 2: for all supernodes in subtree do 3: copy supernode data to GPU memory 4: end for 5: device synchronization 6: for all levels in subtree do 7: for all supernodes in level do 8: compute contribution blocks upper panels (DSYRK batched) 9: compute contribution blocks lower panels (DGEMM batched) 10: end for 11: device synchronization 12: for all supernodes in level do 13: assemble contribution blocks (batched) 14: end for 15: device synchronization 16: for all supernodes in level do 17: factorize supernode (DPOTRF batched) 18: end for 19: device synchronization 20: for all supernodes in level do 21: compute supernode lower panel (DTRSM batched) 22: end for 23: device synchronization 24: end for 25: for all supernodes in subtree do 26: copy supernode data back to main memory 27: end for 28: end procedure technique. In this case, the algorithm will fall back to the original supernode algorithm, which is named the “root algorithm”. 5. Pipelining technique The subtree algorithm is already several times superior in performance to the root algorithm, and still more techniques can be applied to enhance 10

Page 10 of 25

d

M

an

us

cr

ip t

the performance even further. In our algorithm, we use pipelines to overlap factorizations and data transfers. Specifically, we are overlapping the copyback of one level and the factorization of the next level. Fig. 3 shows the timing of the factorization of a subtree, without and with pipelines respectively. The idea is that since the PCIe bus can work in parallel with GPU cores, there is no need to wait for the completion of PCIe transmissions before factorizing the next level. Our code uses separate CUDA streams for copying back factorization results, so that PCIe transmissions could be mostly hidden.

te

Figure 3: Using pipelines in the subtree algorithm (“Fx” for factorizing level x and “CBx” for copying back level x)

Ac ce p

The pseudocode for the implementation of pipelining technique is presented in Algorithm 2. Data copies between main memory and GPU memory, and cuBLAS kernels all run asynchronously, therefore a synchronization barrier needs to be put before the second for loop, to ensure that all the data are already present in the GPU memory before starting cuBLAS kernels. After factorizing each level, there is also a synchronization barrier to ensure that the factorization is finished before copy back is started. Note that the results are not copied from the GPU memory directly to the destination. To get the best performance, and to support asynchronous copies, pinned host memory is used for buffers. The data need to be first transferred to the pinned host memory buffer, then copied to the destination. A CUDA event is used to meet the dependencies between the two stages of the copy back process. The event is recorded after the data are transferred to the pinned host memory buffer, and synchronized before the data are copied 11

Page 11 of 25

Ac ce p

te

d

M

an

us

cr

ip t

Algorithm 2 Pipelining in subtree algorithm 1: procedure factorize subtree pipelined 2: for all supernodes in subtree do 3: copy supernode data to GPU memory 4: end for 5: device synchronization 6: for all levels in subtree do 7: for all supernodes in level do 8: compute contribution blocks upper panels (DSYRK batched) 9: compute contribution blocks lower panels (DGEMM batched) 10: end for 11: event synchronization 12: copy the previous level from buffer to destination 13: device synchronization 14: for all supernodes in level do 15: assemble contribution blocks (batched) 16: end for 17: device synchronization 18: for all supernodes in level do 19: factorize supernode (DPOTRF batched) 20: end for 21: device synchronization 22: for all supernodes in level do 23: compute supernode lower panel (DTRSM batched) 24: end for 25: device synchronization 26: for all supernodes in level do 27: copy supernode data back to buffer (cudaMemcpyAsync) 28: end for 29: record event 30: end for 31: copy the last level from buffer to destination 32: end procedure to the destination. It essentially serves as a synchronization barrier between the two stages. A number of moderately-sized pinned host buffers are used. As there is 12

Page 12 of 25

an

us

cr

ip t

sufficient host memory bandwidth, this approach permits substantial concurrency (copying into [pinned] host buffers, coping from host buffers to device, copying from device to host buffers and copying out of host buffers simultaneously) while benefiting from maximum PCIe transfer rates and avoiding the cost and system issues associated with pinning large portions of the system memory. We expect the copy back of each level’s data to run in parallel with the factorization of the next level, therefore, there is no device synchronization after the last for loop, and the statements that correspond to the transfer from the pinned host memory buffer to the destination are postponed until the next batch of DSYRKs and DGEMMs kernels are launched. In this way, the GPU can compute the contribution blocks, while at the same time data are either transfered through the PCIe bus, or flushed out of the pinned host memory buffer.

M

6. Multithreading technique for GPU

Ac ce p

te

d

The first parallel supernodal algorithm for Cholesky factorization was developed by Ng and Peyton [13]. It leverages the nature of sparse matrices to factorize the supernodes in parallel. The structure of the elimination tree holds information about data dependencies between supernodes. The underlying tree structure [9] can help us determine which supernodes may be factorized concurrently. In our algorithm, we use multithreading (using OpenMP on the CPU) to leverage the task parallelism in the underlying tree structure. To keep all threads busy, as long as enough supernodes are available, they are run asynchronously and independently. Upon finishing factorizing a supernode, each thread finds its next target in the elimination tree. We use a tree-based scheduling to maximize the number of supernodes being factorized in parallel. The supernodes that are ready to be factorized either are represented by leaf nodes, or have all their descendants assembled into them. At any given time, all “ready” supernodes are independent from each other. These supernodes may be factorized in parallel, but not necessarily synchronously. In the beginning of the execution, the set of “ready” supernodes is exactly the set of supernodes corresponding to the elimination tree’s leaf nodes. So the scheduler starts with a complete list of the elimination tree’s leaf nodes. A number of threads is initiated, each thread processes a leaf node. Upon 13

Page 13 of 25

Ac ce p

te

d

M

an

us

cr

ip t

successful factorization of a supernode, the thread checks the status of the parent of its current node. If the parent is ready, then the thread proceeds to the parent, assembles all descendants, and factorizes it. If not, it implies that a child of the parent has not finished factorizing yet, so the thread will search the next leaf in the leaf list. Then the end of the leaf list has been reached, the thread terminates. Our algorithm ensures that 1) All supernodes represented by leaf nodes are factorized, and 2) Among a node’s children, only the child that finishes factorizing last will proceed to its parent. Therefore each supernode is factorized exactly once. Our algorithm is essentially a multithreaded extension of CHOLMOD. Each thread is expected to find own its workspace, run a scheduler, and independently do its computations, except in critical sections. The critical section is limited to updating the assembly list. This could be due to the fact that there could be data conflict when two nodes with a common ancestor are being put in the ancestor assembly list. Since CHOLMOD uses both the CPU and the GPU in factorization, each thread should have access to both the CPU the the GPU. The multithreading of the CPU portion of the code is relatively straightforward as OpenMP handles it naturally. In order to use multiple threads on a single GPU, we divide the GPU memory so that each thread only uses a portion of the GPU. To run k threads on the GPU, we split the memory of the GPU into k regions with the same size. Each thread is assigned to a region, and maintains its own set of data structures, buffers and workspaces. Each thread also creates its own set of CUDA streams. These k regions essentially provide to its thread an interface identical to using the entire GPU. Fig. 4 shows how multithreading is implemented on a GPU. The main program launches CUDA kernels, and copy data to and from the GPU memory. After splitting the GPU memory into k regions, and creating the streams, memory copies and computational kernels in different threads are independently initiated using different CUDA streams. Though the real hardware may limit Figure 4: Using multiple threads on a single the threads’ average bandwidth and GPU. The upper figure for single thread. The computing power, from software’s lower figure for multiple threads. 14

Page 14 of 25

Table 1: Test matrices used for comparing (Single GPU method) and MAGMA

cr

ip t

dimension nonzeros 923,136 40,373,538 638,802 27,245,944 1,564,794 114,165,372 1,437,960 60,236,322 1,498,023 59,374,451 1,391,349 64,131,971 1,465,137 21,005,389 943,695 77,651,847 986,703 47,851,783 72,000 28,715,634

us

problem type structural structural structural structural structural structural computational fluid dynamics structural model reduction 2D/3D problem

an

matrix Emilia 923 Fault 639 Flan 1565 Geo 1438 Hook 1498 Serena StockF 1465 audikw 1 bone010 nd24k

Ac ce p

te

d

M

point of view, different threads act as if they were executing on different portions of the GPUs. Our algorithm creates k threads when there are k split regions on the GPU. k is limited by the total number of available CPU cores. Since the GPU significantly outperforms the CPU when processing large matrices, creating more threads than k would result in using CPU cores for factorization generally deteriorating the overall performance. A counter is used to indicate the number of available CPU threads. The counter is initialized to k 0 (the maximum number of allowed CPU threads). When a thread finds its assigned GPU memory region in use, it checks the counter. If the counter has positive value, it is decremented, and the thread will continue using the CPU in the computation. Otherwise the thread waits until the assigned GPU memory region is free again. When finishing using the CPU, the thread will increment the counter. Our algorithm can be easily extended to utilize multiple GPUs. Experimental results presented in this paper are however limited to using only a single GPU. 7. Experimental Results We have conducted extensive experiments to evaluate performance, power, and energy requirements of our algorithm. Given space limitations, we briefly present the representative results. The experiments were carried out on a 15

Page 15 of 25

Ac ce p

te

d

M

an

us

cr

ip t

platform consisting of a dual socket Intel(R) Xeon(R) CPU E5-2695 v2 at 2.4 GHz, and a GPU NVIDIA Tesla P100 with 3840 CUDA cores and 16 GB of physical memory. Table 1 provides a summary information about the matrices used in our performance measurements. These matrices were chosen from the SuiteSparse Matrix Collection [7]. As indicated in our algorithm description, we have the ability to vary the number of threads. Although the performance should generally increase as we increase the number of threads, there is a threshold after which the overhead of a new stream outweighs the benefits (both in terms of time and power requirements). In our experiments, we measured performance, power and energy for our test matrices using different algorithms and configurations. We compare these objectives using CHOLMOD (v4.5.3) on multiple CPUs [3], Rennich et al [15] CHOLMOD (v4.5.3) on a single GPU, and our technique using one, two, and four threads. Figures 5, 6, and 7 show respectively the Gflops, power and energy by each of these described techniques to factorize our testing matrices. Figure 5 shows that our proposed techniques are effective for most of the tested matrices. Clearly, the amount of improvement is dependent on the matrix. In our experiments, the multithreading technique, based on the root algorithm (Suitesparse 4.5.3), improves the performance by up to 3.5 times. Though there are exceptions, the performance generally increases when more threads are used. The subtree technique (v4.6.0-beta), based on the root algorithm (v4.5.3), improves the performance by up to 4 times. The pipelining technique, based on the subtree algorithm (v4.6.0-beta), improves the performance by 10% to 25%. We may expect to gain the best performance when all three techniques are combined, but in reality the multithreading technique is not as effective with the subtree technique present. The number of threads quickly hits the threashold when the subtree algorithm is used. The experiments show the best performance when we use the pipelined subtree algorithm, and the number of GPU threads is 1 or 2. The power results in Fig. 6 show that the power increases as the performance of the algorithm improves. It is expected, because higher performance usually indicates more intense computations on GPU. Despite of higher power consumption, the total energy consumed actually decreases when the performance is better (Fig. 7). This is because the decrease in factorization time offsets the increase in power. From the power results Fig. 6 and the energy results Fig. 7 we see that the use of the pipelining technique causes the GPU to use more power, but 16

Page 16 of 25

ip t cr us an

Figure 5: Cholesky factorization performance for sparse matrices

M

since the factorization time is reduced, the total energy consumption roughly stays the same.

d

8. Conclusions

Ac ce p

te

In this paper we present techniques to improve the performance of sparse Cholesky factorization. Our multithread algorithm works by using separate streams on the GPU for multiple supernodes. It uses the workflow derived from the elimination-tree to schedule the tasks corresponding to factorization of sub-matrices. Experimental results clearly demonstrate the performance and energy improvements over existing implementations. Using the root algorithm in v4.5.3 as a reference, the multithreaded algorithm presented here improves performance by up to 3.5x and demonstrates the benefits of using streams to extract more concurrency from both the CPU and the GPU. The subtree algorithm in the existing 4.6.0-beta version shows a performance of 4x vs. v4.5.3 and demonstrates the benefit of minimizing the bottleneck of host↔device communication. However, the pipelined version presented here shows how this host↔device communication bottleneck can be further reduced resulting in an additional 25% performance improvement, or up to 5x improvement as compared to the reference performance of v4.5.3. The efficiency of data copy from host to device memory and vice versa is a key factor in the algorithm’s performance. The bandwidth of buses are 17

Page 17 of 25

ip t cr us an M

Ac ce p

te

d

Figure 6: Power Consumption for Cholesky Factorization

Figure 7: Energy Consumption for Cholesky Factorization

18

Page 18 of 25

cr

ip t

bounded by hardware limits, but work can be done to increase the bandwidth utilization. The subtree technique and the pipelining technique both aim to reduce the time consumption of the data transfers on the PCIe bus. It might still be possible to further improve the performance by refining the scheduling policy and better overlap the computations and data transfers, and this is part of our current investigation. Additionally, we are developing extensions to our work to support efficient parallelization over multiple GPUs.

us

Acknowledgments

M

an

This material is based upon work supported by the National Science Foundation under Grant No. under CNS 1514116. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

d

[1] Amestoy, P. R., Davis, T. A., and Duff, I. S. (2004). Algorithm 837: Amd, an approximate minimum degree ordering algorithm. ACM Transactions on Mathematical Software (TOMS), 30(3):381–388.

Ac ce p

te

[2] Ayguad´e, E., Badia, R. M., Igual, F. D., Labarta, J., Mayo, R., and Quintana-Ort´ı, E. S. (2009). An extension of the starss programming model for platforms with multiple gpus. In European Conference on Parallel Processing, pages 851–862. Springer. [3] Chen, Y., Davis, T. A., Hager, W. W., and Rajamanickam, S. (2008). Algorithm 887: Cholmod, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software (TOMS), 35(3):22. [4] Collins, G. et al. (1990). Fundamental numerical methods and data analysis. Fundamental Numerical Methods and Data Analysis, by George Collins, II. [5] Davis, T. A. (2006). Direct methods for sparse linear systems. SIAM. [6] Davis, T. A., Gilbert, J. R., Larimore, S. I., and Ng, E. G. (2004). Algorithm 836: Colamd, a column approximate minimum degree ordering algorithm. ACM Transactions on Mathematical Software (TOMS), 30(3):377–380. 19

Page 19 of 25

ip t

[7] Davis, T. A. and Hu, Y. (2011). The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1.

cr

[8] Davis, T. A., Rajamanickam, S., and Sid-Lakhdar, W. M. (2016). A survey of direct methods for sparse linear systems. Acta Numerica, 25:383– 566.

us

[9] Duff, I. S. (1986). Parallel implementation of multifrontal schemes. Parallel Computing, 3(3):193–204. [10] Karypis, G., Schloegel, K., and Kumar, V. (2014). Metis–serial graph partitioning and fill-reducing matrix ordering, version 5.

M

an

[11] Liu, J. W., Ng, E. G., and Peyton, B. W. (1993). On finding supernodes for sparse matrix computations. SIAM Journal on Matrix Analysis and Applications, 14(1):242–252.

d

[12] Ltaief, H., Tomov, S., Nath, R., Du, P., and Dongarra, J. (2010). A scalable high performant cholesky factorization for multicore with gpu accelerators. In International Conference on High Performance Computing for Computational Science, pages 93–101. Springer.

Ac ce p

te

[13] Ng, E. and Peyton, B. W. (1993). A supernodal Cholesky factorization algorithm for shared-memory multiprocessors. SIAM Journal on Scientific Computing, 14(4):761–769. [14] Rennich, S. C., Stosic, D., and Davis, T. A. (2014). Accelerating sparse Cholesky factorization on GPUs. In Proceedings of the Fourth Workshop on Irregular Applications: Architectures and Algorithms, pages 9–16. IEEE Press. [15] Rennich, S. C., Stosic, D., and Davis, T. A. (2016). Accelerating sparse Cholesky factorization on GPUs. Parallel Computing, 59:140–150.

[16] Rose, D., Whitten, G., Sherman, A., and Tarjan, R. (1980). Algorithms and software for in-core factorization of sparse symmetric positive definite matrices. Computers & Structures, 11(6):597–608.

[17] Tang, M., Gadou, M., and Ranka, S. (2017). A multithreaded algorithm for sparse Cholesky factorization on hybrid multicore architectures. Procedia Computer Science, 108:616–625. 20

Page 20 of 25

ip t

[18] Tomov, S., Dongarra, J., and Baboulin, M. (2010a). Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing, 36(5-6):232–240.

cr

[19] Tomov, S., Nath, R., Ltaief, H., and Dongarra, J. (2010b). Dense linear algebra solvers for multicore with gpu accelerators. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1–8. IEEE.

an

us

[20] Volkov, V. and Demmel, J. (2008). LU, QR and Cholesky factorizations using vector capabilities of GPUs. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-49, 49.

Ac ce p

te

d

M

[21] Yeralan, S. N., Davis, T., and Ranka, S. (2013). Sparse QR factorization on GPU architectures. University of Florida, Tech. Rep.

21

Page 21 of 25

Highlights A multithreading technique is proposed to utilize the tree parallelism of Cholesky factorization



The subtree technique increases the concurrency of the Cholesky factorization and reduces the total kernel launch latency



The pipelining technique overlaps floating point computations and data transfers to reduce total factorization time

Ac ce p

te

d

M

an

us

cr

ip t



Page 22 of 25

M

an

us

cr

ip t

Meng Tang is a PhD student of the Department of Computer & Information Science & Engineering, University of Florida. He received his master’s degree from University of Chinese Academy of Sciences, and bachelor’s degree from Shanghai Jiaotong University. His research focus is high performance computing.

Ac ce p

te

d

Mohamed Gadou is a PhD student in the Department of Computer Information Science and Engineering at University of Florida. He got his MSc in Engineering Mathematics and Physics and BSc in Computer Systems and Engineering, both from Alexandria University, Egypt. His research interests are on High Performance Computing and Scientific Computing.

Steven Rennich is a Senior HPC Developer Technology engineer at NVIDIA whose primary interest is in the development of massively parallel algorithms. Current activities involve supporting the deployment of the latest GPUaccelerated supercomputers and all of developing, demonstrating, and supporting high performance sparse linear algebra methods on GPUs. Prior to joining NVIDIA, he spent time at MSC Software and at MIT’s Lincoln Laboratory. He received his BS from UIUC and his MS and PhD (all in

Page 23 of 25

ip t

AeroAstro) from Stanford where he performed computational studies of vortex instabilities.

Ac ce p

te

d

M

an

us

cr

Tim Davis is a Professor in the Computer Science and Engineering Department at Texas A&M University. His primary scholarly contribution is the creation of widely-used sparse matrix algorithms and software (SuiteSparse). His software is relied upon by a vast array of commercial, government lab, and open source applications, including MATLAB (x=A\b when A is sparse), Apple, Mathematica, Google (Street View, Photo Tours, and 3D Earth), Octave, Cadence, MSC NASTRAN, Mentor Graphics, and many more. As an NVIDIA Academic Partner, he is creating a new suite of highly-parallel sparse direct methods that can exploit the high computational throughput of recent GPUs. NVIDIA has designated his work Texas A&M as a CUDA Research Center. He was elected in 2013 as a SIAM Fellow, in 2014 as an ACM Fellow, and in 2016 as an IEEE Fellow. He serves as an associate editor for ACM Transactions on Mathematical Software, the Journal of Parallel and Distributed Computing, and the SIAM Journal on Scientific Computing. Tim is a Master Consultant to The MathWorks.

Sanjay Ranka is a Professor in the Department of Computer Information Science and Engineering at University of Florida. His current research interests are high performance and parallel computing with a focus on energy efficiency; and big data science with a focus on data mining/machine learning algorithms for spatiotemporal applications. His work is driven by applications in CFD, remote sensing, health care and transportation. He teaches courses on data science (three course curriculum), data mining and parallel computing. From 1999–2002, he was the Chief Technology Officer at Paramark (Sunnyvale, CA). At Paramark, he developed a real-time optimization service called PILOT for marketing campaigns. PILOT served more than 10 million optimized decisions a day in 2002 with a 99.99% uptime. Paramark was recognized by VentureWire/Technologic Partners as a top 100 Internet technology company in 2001 and 2002 and was acquired in 2002. He has also

Page 24 of 25

held positions as a tenured faculty positions at Syracuse University and as a researcher/visitor at IBM T.J. Watson Research Labs and Hitachi America Limited.

us

cr

ip t

Sanjay earned his Ph.D. (Computer Science) from the University of Minnesota and a B. Tech. in Computer Science from IIT, Kanpur, India. He has coauthored four books, 230+ journal and refereed conference articles. His recent co-authored work has received a best student paper runner up award at IGARSS 2015, best paper award at BICOB 2014, best student paper award at ACM-BCB 2010, best paper runner up award at KDD-2009, a nomination for the Robbins Prize for the best paper in journal of Physics in Medicine and Biology for 2008, and a best paper award at ICN 2007.

Ac ce p

te

d

M

an

He is a fellow of the IEEE and AAAS, and a past member of IFIP Committee on System Modeling and Optimization. He is an associate Editor-in-Chief of the Journal of Parallel and Distributed Computing and an associate editor for ACM Computing Surveys, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Sustainable Computing: Systems and Informatics, Knowledge and Information Systems, and International Journal of Computing. Additionally, he is a book series editor for CRC Press for Bigdata. In the past, he has been an associate editor for IEEE Transactions on Parallel and Distributed Systems and IEEE Transactions on Computers.

Page 25 of 25