A Multithreaded Algorithm for Sparse Cholesky Factorization on Hybrid Multicore Architectures

A Multithreaded Algorithm for Sparse Cholesky Factorization on Hybrid Multicore Architectures

Available online at www.sciencedirect.com ScienceDirect This This This This space is Computer reservedScience for the Procedia header, Procedia 108C...

745KB Sizes 0 Downloads 71 Views

Available online at www.sciencedirect.com

ScienceDirect This This This This

space is Computer reservedScience for the Procedia header, Procedia 108C (2017) 616–625 space is reserved for the Procedia header, space is reserved for the Procedia header, space is reserved for the Procedia header,

do do do do

not not not not

use use use use

it it it it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland

A Multithreaded Algorithm for Sparse Cholesky A Algorithm for Cholesky A Multithreaded Multithreaded Algorithm for Sparse Sparse Cholesky Factorization on Hybrid Multicore Architectures A Multithreaded Algorithm for Sparse Cholesky Factorization on Hybrid Multicore Architectures Factorization on Hybrid Multicore Architectures Meng Tang,on Mohamed and Sanjay Ranka Factorization HybridGadou, Multicore Architectures Meng Tang, Mohamed Gadou, and Sanjay Ranka Meng University Tang, Mohamed Gadou, and Sanjay Ranka of Florida, Gainesville, Florida, USA Meng University Tang, Mohamed and Sanjay [email protected], Florida, Gadou, Gainesville, Florida, USARanka [email protected], [email protected] University of Florida, Gainesville, Florida, USA [email protected], [email protected], [email protected] [email protected], [email protected] University [email protected], Florida, Gainesville, Florida, USA [email protected], [email protected], [email protected]

Abstract Abstract We present a multithreaded method for supernodal sparse Cholesky factorization on a hyAbstract We present platform a multithreaded method supernodal Cholesky onutilize a hybrid multicore consisting of a for multicore CPUsparse and GPU. Our factorization algorithm can We present a multithreaded method for supernodal sparse Cholesky factorization on a hyAbstract brid multicore platform consisting of aelimination multicore tree CPUbyand GPU. Our algorithm can utilize concurrency at different levels of the using multiple threads in both the bridWe multicore platform consisting of a for multicore CPUsparse and GPU. Our algorithm can present adifferent multithreaded method supernodal Cholesky onutilize a hyconcurrency at levels of thetree elimination tree structure by using multiplefactorization threads in both the CPU and the GPU. The elimination is a tree data describing the workflow of the concurrency at platform different consisting levels of the tree byand using multiple threads incan both the brid multicore oftree aelimination multicore CPUstructure GPU. Our algorithm utilize CPU and the GPU. The elimination isa aplatform tree data describing the workflow of the factorization. Our experiments results on consisting of an Intel multicore processor CPU and theat GPU. The elimination is a tree data structure describing the workflow of the concurrency different levels of thetree elimination tree by usingof multiple in processor both factorization. Our experiments results on a platform consisting an Intelthreads multicore along withthe an GPU. Nvidia GPU indicate atree significant improvement inof performance and energy over factorization. Our experiments results onisa aplatform consisting an Intel multicore processor CPU and The elimination tree data structure describing the workflow the along with an Nvidia GPU algorithm. indicate a significant improvement in performance and energyofover single-threaded supernodal along with an Our Nvidia GPU indicate a significant improvement inofperformance and energy over factorization. experiments results on a platform consisting an Intel multicore processor single-threaded supernodal algorithm. single-threaded supernodal algorithm. along an Nvidia GPUsparse indicate aB.V. significant improvement in performance and energy over Keywords: sparse matrices, direct methods, Cholesky factorization, GPU, CUDA © 2017with The Authors. Published by Elsevier Keywords: sparse matrices, sparse direct methods, Cholesky factorization, GPU,onCUDA single-threaded supernodal algorithm. Peer-review under responsibility of the scientific committee of the International Conference Computational Science Keywords: sparse matrices, sparse direct methods, Cholesky factorization, GPU, CUDA Keywords: sparse matrices, sparse direct methods, Cholesky factorization, GPU, CUDA

1 Introduction 1 Introduction 1 Introduction Sparse direct methods are used to solve linear algebraic problems [3] including systems of linear 1 Introduction Sparse direct methods aresquares used to and solveeigenvalue linear algebraic problems including systems of of linear equations, sparse linear problems, and [3] form the backbone many

Sparse direct methods are used to solve linear algebraic problems [3] including systems of linear equations, sparse linear squares and eigenvalue problems, and workflow form the of backbone of many applications in computational [15]. computational sparse Cholesky equations, sparse linear problems, and [3] form the backbone of many Sparse direct methods aresquares used science to and solveeigenvalue linearThe algebraic problems including systems of linear applications in computational science [15]. The computational workflow of sparse Cholesky factorization [5] computational is linear structured asscience aand tree eigenvalue where eachproblems, node represents the factorization a dense applicationssparse in [15]. The computational workflow of sparse ofCholesky equations, squares and form the backbone of many factorization [5] and is structured a tree where each node represents the of a dense frontal matrix, each edgeas represents anThe assembly operationworkflow of afactorization factorized descendant factorization [5] computational is structured asscience a tree where each node represents the factorization ofCholesky a dense applications in [15]. computational of sparse frontal matrix, into and its each edge represents anmatrix assembly operation of a factorized descendant frontal ancestors. A frontal is a represents dense submatrix of the entire frontal matrix matrix, and each edgeas represents an each assembly operation of afactorization factorized descendant factorization [5]into is structured a tree where node the of a sparse dense frontal matrix its ancestors. A frontal matrix is a dense submatrix of the entire sparse matrix. A sub-matrix must have all its descendants assembled before being factorized itself, matrix into its ancestors. A frontalanmatrix is a operation dense submatrix of the entire sparse frontal matrix, and each edge represents assembly of a factorized descendant matrix. A sub-matrix must have all its descendants assembled before being factorizedmay itself, and all sub-matrices with no children or with all their descendants already assembled be matrix. matrix A sub-matrix must have all its descendants before being factorized itself, frontal into its ancestors. A frontal matrix isassembled adescendants dense submatrix ofassembled the entiremay sparse and all sub-matrices with no children or with all their already be factorized in parallel. and all sub-matrices with no children or with all their descendants already assembled may be matrix. Ainsub-matrix must have all its descendants assembled before being factorized itself, factorized parallel. Hybrid multicore processors (HMPs) are poised to dominate the landscape of the next factorized in parallel. andHybrid all sub-matrices with no children or with theirtodescendants already assembled may be multicore processors (HMPs) are all poised dominate the landscape of ofthe next generation of computing on the desktop as well as on exascale systems. consist general Hybridin multicore processors (HMPs) are poised to dominate theHMPs landscape of the next factorized parallel. generation of computing onwith the desktop as well asand on exascale systems. HMPsbenefits consist of general purpose GPU cores along corespoised are tothe provide to a wide generation of computing on the specialized desktop as well as on exascale systems. HMPs consist general Hybrid multicore processors (HMPs) are to expected dominate landscape of of the next purpose GPU cores along with specialized cores and are expected to provide benefits to a wide spectrum of applications at significantly lower energy requirements per FLOP (FLoating-point purpose GPU cores along with specialized cores and are expected to provide benefits to a wide generation of computing on the desktop as well as on exascale systems. HMPs consist of general spectrum of Per applications atInsignificantly energy requirements per FLOP (FLoating-point Operations Second). this paper, lower we describe anexpected efficient to multithreaded algorithm for spectrum of applications significantly lower energy requirements per FLOPbenefits (FLoating-point purpose GPU cores alongat with specialized cores and are provide to a wide Operations Per Second). In this paper, we describe an efficient multithreaded algorithm for sparse Cholesky factorization on HMPs consisting of CPUs and GPUs. This algorithm is shown Operations Per Second). In this paper, we describe an efficient multithreaded algorithm for spectrum of applications at significantly lower energy requirements perThis FLOP (FLoating-point sparse Cholesky factorization on HMPs consisting of CPUs and GPUs. algorithm is shown to provide significant improvement over the supernodal algorithm, CHOLMOD [4], in both its sparse Cholesky factorization on HMPs consisting of CPUs and GPUs. This algorithm is shown Operations Per Second). In this paper, we supernodal describe analgorithm, efficient multithreaded algorithm for to provide significant improvement over the CHOLMOD [4], in both its CPU [2] and GPU [13] versions. CHOLMOD is a SuiteSparse module provided by Chen, Davis, to provide significant improvement over the supernodal algorithm, CHOLMOD [4], in both its sparse Cholesky factorization on CHOLMOD HMPs consisting of CPUs andmodule GPUs.provided This algorithm is Davis, shown CPU [2] and GPU [13] versions. is a SuiteSparse by Chen, CPU [2] and GPU [13] versions. CHOLMOD is a SuiteSparse module provided by Chen, Davis, to provide significant improvement over the supernodal algorithm, CHOLMOD [4], in both its 1 CPU [2] and GPU [13] versions. CHOLMOD is a SuiteSparse module provided by Chen, Davis, 1 1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1 Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 1 10.1016/j.procs.2017.05.260



A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625

Hager and Rajamanickam, capable of performing sparse Cholesky factorization on one GPU. Their algorithm uses a single thread for factorizing a matrix on the entire GPU. When the matrix size is large, this results in good performance. For small matrices, the performance can be improved by using a portion of the GPU. Our algorithm leverages the structure of the matrix to factorize in parallel as many sub-matrices as possible. Using streaming feature of the GPU, the factorization of submatrices and other tasks can be done concurrently resulting in higher performance on the GPU. Each dense sub-matrix is factorized with cuBLAS on GPUs or OpenBLAS on CPUs. Our experimental results on sparse matrices derived from SuiteSparse (previously UF Sparse Matrix Collection) show that speedups of up to 3.5 times can be achieved by our algorithm. We also measured energy requirements for our implementation and show that similar gains in the energy requirements can also be achieved. This represent significant improvement over implementations that are already finely tuned for the NVIDIA GPU. In Section 2, we provide background information for Cholesky factorization. We also present previous research related to sparse direct methods and sparse Cholesky factorization. Section 3 describes the workflow for CHOLMOD. In Section 4, we describe how the scheduler handles the factorization workflow, and multithreading is implemented on a single GPU. Experimental results are presented in Section 5. Finally, we conclude in Section 6.

2

Background

The Cholesky factorization is a decomposition of a real symmetric positive-definite matrix into the product of a lower triangular matrix and its transpose, i.e. given a real symmetric T positive-definite    matrix  A, find a lower triangular matrix L such that LL = A. Example: 9 6 3 3 2 = . There are three major techniques for sparse Cholesky factorization, 6 5 2 1 1 right-looking, up-looking, and left-looking. Rose et al. in [14] describe the major differences in the three techniques. In our algorithm, we use a left-looking method for both dense and sparse Cholesky factorization. The left-looking method computes L one column at a time. It can be derived from the following expression (we’re using uppercase letters for matrices and lowercase letters for column vectors and scalars) [8].   T    L11 L11 l12 LT31 A11 a12 AT31 T  T  l12  l22 l22 l32 =  aT12 a22 aT32  L31 l32 L33 A31 a32 A33 LT33 where the first column entries on L, namely, L11 , l12 , L31 are already computed, and the next column elements, l22 , l32 are to be computed.  T 2 T l12 + l22 = a22 , l = Note that l22 is a scalar. We have l12 22    a22 − l12 l12 , then a32 =   a c T L22 l12 , where c1 is L31 l12 + l32 l22 , l32 = (a32 − L31 l12 )/l22 . If we let c = 1 = 1 − l12 c2 a2 √ a scalar, then l22 = c1 , l32 = c2 /l22 . In practice, we could use a supernodal  T approach [11] to l compute multiple columns at a time. In each iteration, we first compute 12 l12 , followed by L22 c1 . We then compute l22 with a dense Cholesky factorization, followed by l32 with a triangular solve. The supernodal method exploits the fact that sparse matrices often contain rows and columns with identical or similar non-zero pattern [12]. By clustering these rows or columns together, the supernodal method could reduce memory consumption and gain performance. CHOLMOD is an implementation of the left-looking supernodal method, which factorizes one 2

617

618

A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625

or multiple columns at a time. Descendant frontal matrices that the current frontal matrix depends on are assembled into the current frontal matrix immediately before factorizing the current frontal matrix. It composes of three steps: 1) fill-reducing permutation, 2) symbolic analysis, and 3) numeric factorization. These steps are described in detail below. A fill-reducing permutation is performed to minimize the number of nonzero entries, so that less computation is required to factorize the matrix. The fill-minimizing problem is NP-hard, but efficient heuristic algorithms exist [1] [5] [6]. After the permutation, a symbolic analysis is performed to determine the workflow of the sparse matrix factorization. The workflow is represented by the elimination tree. For example, in our algorithm the matrix boneS10 (Table 1) has an elimination tree with 53030 nodes. Fig. 2 shows the structure of the top four levels of boneS10’s elimination tree. 21 The nodes of the elimination tree each represents the factorization of a dense frontal matrix. A sparse Cholesky factorization algorithm needs 13 20 5 to factorize all frontal matrices in the elimination tree, and data dependency needs to be met so that 16 12 8 19 4 no parent node is processed before any of its chil- 1 dren. In Fig. 2, there exist nodes that have no dependency between each other, e.g. Node 0, 2, 12, 0 2 3 6 7 9 10 11 14 15 17 18 16, 18. This is a common case in sparse matrices, and indicates possible parallelism in factorization algorithms. This ”tree parallelism” is solely deFigure 1: Top four levels of boneS10’s elimpendent on the structure of the elimination tree, ination tree and could be utilized by using an elimination-treebased scheduler. The scheduler is expected to find frontal matrices that are the next to be factorized, and to call functions for factorization. The factorization of the entire sparse matrix is done by repeatedly factorizing frontal matrices from leaf to root. A frontal matrix is a m×k sub-matrix of A (m ≥ k). It is composed of a square sub-matrix on the top and a rectangular sub-matrix in the bottom. First the contribution blocks from its descendants are computed during apply operation and then assembled into the rectangular sub-matrix. Let the parent front be Ap and the factorized  Lc1 child front be Lc = where Lc1 is square while Lc2 is rectangular, then the contribution L   c2 L block is Cp,c = c1 LTc1 . The assembly operation is done by subtracting contribution blocks Lc2 from Ap . After applying and assembly, a dense Cholesky factorization is performed on the square sub-matrix. Then the rectangular sub-matrix is processed with a triangular solve.

3

Single-thread sparse Cholesky factorization

The workflow of a sparse Cholesky factorization is structured as an elimination tree, where each tree node stands for a dense Cholesky factorization, and each edge stands for the parent’s data dependency on the children. Contribution blocks from descendants are required in each frontal matrix’s assembly phase. Factorizations with no data dependency may be processed independently. A single-threaded algorithm such as CHOLMOD [2] orders the nodes in ascending order, with regard to the column index, and factorizes them also in ascending order. Since no higher-indexed columns would depend on lower-indexed columns, this workflow ensures that no dependency is violated. The factorization workflow is straightforward, and no scheduler is 3



A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625

needed in single-thread case. The update, assembly, and triangular solve workflow is integrated in the factorization workflow. The left-looking Cholesky requires that all descendants be assembled before a frontal matrix is factorized. Each node keeps a list of descendant nodes that are needed in its assembly phase. The assembly list of each node is initialized to a null list. After choosing a frontal matrix for factorization, the thread creates an empty contribution block for the node. The algorithm then goes through the assembly list, updating the contribution block with each list member, removing them from the current list, and putting them in the assembly list of current node’s parent. After this step, the contribution block is subtracted from the current frontal matrix, followed by a dense Cholesky factorization on the leading triangular matrix. Finally the output of the dense Cholesky is used for a triangular solve on the lower rectangular matrix. After the triangular solve, the current node is added to the assembly list of its parent. CHOLMOD uses both the GPU and the CPU for the factorization and other computations. Because data transfers between main memory and the GPU memory is expensive, the GPU is used only for sufficiently large computations. Thus, the processing of each node is composed of four steps: update, assembly, factorize and triangular solve. The list of descendants to assemble is first sorted by their sizes, in descending order. It is assumed that the GPU will process the larger ones, while the CPU will process the smaller ones. whenever the GPU is not busy, a descendant is picked from the head of the list, and copied to the GPU memory.   C1 = The update operation is essentially a matrix multiplication in the form of C = C 2   L1 T L . C1 = L1 LT1 is computed with cuBLAS syrk routine. C2 = L2 LT1 is computed L2 1 with cuBLAS gemm routine. A zero contribution block is initiated in the GPU memory before all update operations are performed corresponding to the current node. Each time a C for a descendant is computed, it is added to the contribution block. A zero contribution block is initiated also in the main memory before all updates. If the GPU is not available, or the descendant matrix is too small, then a descendant is picked from the tail of the array. The update operation is completed with BLAS syrk and gemm routines, and the result is added to the contribution block. The assembly operation is a matrix subtraction A = A − C where A is the frontal matrix to be factorized, and C is the contribution block from its descendants. Since the factorization operation would be done by the GPU only for large frontal matrices, the choice of where the assembly takes place (either in the GPU memory or in the main memory) also depends on the size of the frontal matrix. For frontal matrices above the user provided threshold, the contribution block in the main memory is copied into the GPU memory and summed with the in-GPU-memory contribution block. This sum is then subtracted from the frontal matrix. For frontal matrices below the threshold, the contribution block in the GPU memory is copied into the main memory, and the assembly operation takes place in the main memory instead. After the assembly, a dense Cholesky factorization is performed on the top triangular part of the frontal matrix. The triangular matrix is divided into column panels. Each panel consists of at most 384 columns and is factorized using a left-looking approach. Before a column panel (m × k) is factorized, it is first updated with previous columns using cuBLAS syrk and gemm routines. The k × k leading matrix is then factorized with LAPACK potrf routine. The remaining (m − k) × k rectangular matrix is triangular-solved using cuBLAS trsm routine using the output of the k × k factorization as the divisor. 4

619

620

A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625

The objective of the triangular solve is to find a matrix R such that R = R LT , where R is the rectangular part of the frontal matrix, and L is the output of the dense Cholesky factorization. The triangular solve is completed using the cuBLAS trsm routine. Then the entire frontal matrix is copied back to the main memory.

4

Multi-threaded algorithm

The multifrontal method was developed by Duff and Reid [10]. It leverages the nature of sparse matrices to factorize the frontal matrices in parallel. The structure of the elimination tree holds information about data dependencies between frontal matrices. The underlying tree structure [9] can help us determine which frontal matrices may be factorized concurrently. In our algorithm, we use multithreading (using OpenMP) to leverage the task parallelism in the underlying tree structure. To keep all threads busy, as long as enough frontal matrices are available, they are run asynchronously and independently. Upon finishing factorizing a frontal matrix, each thread finds its next target in the elimination tree. We use a tree-based scheduling to maximize the number of frontal matrices being factorized in parallel. The frontal matrices that are ready to be factorized either are represented by leaf nodes, or have all their descendants assembled into them. At any given time, all ”ready” frontal matrices are independent from each other. These frontal matrices may be factorized in parallel, but not necessarily synchronously. In the beginning of the execution, the set of ”ready” frontal matrices is exactly the set of frontal matrices corresponding to the elimination tree’s leaf nodes. So the scheduler starts with a complete list of the elimination tree’s leaf nodes. A number of threads is initiated, each thread processes a leaf node. Upon successful factorization of a frontal matrix, the thread checks the status of the parent of its current node. If the parent is ready, then the thread proceeds to the parent, assembles all descendants, and factorizes it. If not, it implies that a child of the parent has not finished factorizing yet, so the thread will search the next leaf in the leaf list. Then the end of the leaf list has been reached, the thread terminates. Our algorithm ensures that 1) All frontal matrices represented by leaf nodes are factorized, and 2) Among a node’s children, only the child that finishes factorizing last will proceed to its parent. Therefore each frontal matrix is factorized exactly once. Our algorithm is essentially a multi-threaded extension of CHOLMOD. Each thread is expected to own its workspace, run a scheduler, and independently do its computations, except in critical sections. The critical section is limited to updating the assembly list. This could be due to the fact that there could be data conflict when two nodes with a common ancestor are being put in the ancestor assembly list Since CHOLMOD uses both the CPU and the GPU in factorization, each thread should have access to both the CPU the the GPU. The multithreading of the CPU portion of the code is relatively straightforward as OpenMP handles it naturally. In order to use multiple threads on a single GPU, we need to virtualize it so that each thread only uses a portion of the GPU. To run k threads on the GPU, we split the memory of the GPU into k regions with the same size. Each thread is assigned to a region, and maintains its own set of data structures, buffers and workspaces. Each thread also creates its own set of CUDA streams. These k regions essentially provide to its thread an interface identical to using the entire GPU. Fig. 2 shows how multi-streaming is implemented on a GPU. The main program launches CUDA kernels, and copy data to and from the GPU memory. After splitting the GPU memory into k regions, and creating the streams, memory copies and computational kernels in different threads are independently initiated using different CUDA streams. Though the real hardware 5



A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625

Table 1: Test matrices used for comparing (Single GPU method) and MAGMA problem

type

Emilia 923 Fault 639 Flan 1565 Geo 1438 Hook 1498 StockF 1465 audikw 1 bone010 boneS10

structural structural structural structural structural computational fluid dynamics structural model reduction model reduction

rows ×103 923136 638802 1564794 1437960 1498023 1465137 943695 986703 914898

cols ×103 923136 638802 1564794 1437960 1498023 1465137 943695 986703 914898

nz ×103 40373538 27245944 114165372 60236322 59374451 21005389 77651847 47851783 40878708

may limit the threads’ average bandwidth and computing power, from software’s point of view, different threads act as if they were executing on different portions of the GPUs. Our algorithm creates k threads when there are k split regions on the GPU. k is limited by the total number of available CPU cores. Since the GPU significantly outperforms the CPU when processing large matrices, creating more threads than k would result in using CPU cores for factorization generally deteriorating the overall performance. A counter is used to indicate the number of available CPU threads. The counter is initialized to k  . When a thread finds its virtual GPU busy, it checks the counter. If the counter has positive value, it is decremented, and the thread will continue using the CPU in the computation. Otherwise the thread waits Figure 2: Using multiple threads and streams on until the virtual GPU is free again. When a single GPU. The upper figure shows when the finishing using the CPU, the thread will in- GPU is used as a singlestream The lower figure crement the counter. shows GPU interface with multiple streams. Our algorithm can be easily extended to utilize multiple GPUs. This can be done by adding virtual GPUs to the additional GPUs. Experimental results presented in this paper are however limited to using only a single GPU.

5

Experimental Results

We have conducted extensive experiments to evaluate performance, power, and energy requirements of our algorithm. Given space limitations, we briefly present the representative results. The experiments were carried out on a platform consists of a dual socket Intel(R) Xeon(R) CPU E5-2695 v2 at 2.4 GHz an GPU NVIDIA Tesla K40m with 2880 CUDA cores and 12 GB of physical memory. Table 1 provides a summary information about all the matrices used in our performance measurements. These matrices were chosen from SuiteSparse Matrix Collection 6

621

A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625

(formerly called the UF Matrix Collection [7]). CHOLMOD CPU CHOLMOD GPU GPU multiple streams GPU multiple streams GPU multiple streams GPU multiple streams

90

80

Factorization time

70

-

2 streams 4 streams 8 streams 12 streams

60

50

40

30

20

10

0

audikw_1

bone010

boneS10

Emilia_923 Fault_639 Flan_1565

Geo-1438 Hook_1498 StocF-1465

Testing Matrices

Figure 3: Factorization time for different Sparse Cholesky Factorization techniques

300

Average Power Consumption

622

250

CHOLMOD CPU CHOLMOD GPU (1 stream) GPU multiple streams - 2 streams GPU multiple streams - 4 streams GPU multiple streams - 8 streams GPU multiple streams - 12 streams

200

150

100

50

0

Figure 4: Average power consumption for different Sparse Cholesky Factorization techniques As indicated in our algorithm description, we have the ability to vary number of streams. Although, the performance should generally increase as we increase the number of streams, there is a threshold after which the overhead of a new stream outweighs the benefits (both in terms of time and power requirements). In our experiments, we measured performance, power and energy for our test matrices using different algorithms and configurations. We compare these objectives using CHOLMOD on multiple CPUs version [2], Rennich et al [13] CHOLMOD version on GPU and our technique using 2, 4, 8, and 12 streams. Figures 3, 4, and 5 show 7

Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625 A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka 16000

CHOLMOD CPU CHOLMOD GPU GPU multiple streams GPU multiple streams GPU multiple streams GPU multiple streams

14000

12000

Energy Consumption

10000

-

2 streams 4 streams 8 streams 12 streams

8000

6000

4000

2000

0

Testing Matrices

Figure 5: Energy consumption for different Sparse Cholesky Factorization techniques 4

Speed up (8 streams over single stream CHOLMOD)



3.5

3

2.5

2

1.5

1

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Operational Intensity (Flop count/Byte)

Figure 6: Relationship between speedup achieved by using eight streams over a single stream versus operation intensity

respectively the performance, power and energy consumed by each of these described techniques to factorize our testing matrices. Figure 3 shows that the performance of our proposed technique is the best for all the tested matrices. Clearly, the amount of improvement is dependent on the matrix. Figure 6 shows the relation between operational intensity of each matrix to performance improvement achieved by factorizing this matrix using 8 GPU stream as compared to a single GPU stream. 8

623

624

Meng Tang et al. / Procedia Computer Science 108C (2017) 616–625 A Multithreaded Algorithm for Sparse Cholesky Factorization Tang, Gadou and Ranka

Figure 7: Cholesky factorization time for sparse matrices versus the size of the matrix. These results show that as operational intensity of a matrix decreases, the speed up achieved by multithreaded algorithm increases. When operational intensity is low, the computation required for factorization of dense matrices (especially ones at lower levels) is small compared to data transferred and factorizing multiple such matrices should increase GPU utilization. For high operational intensity, the GPU is generally well utilized using a single stream and performing multiple factorizations simultaneously using multiple streams is not as useful. The power results in Figure 4 show that the power is nearly a constant albeit it increases slightly as the number of streams is increased. Since energy is a product of time and power, this should result in a local minima in terms of the total number of streams used. Our results show that using only four or eight streams instead of twelve has lower energy consumption in all tested matrices (see Figure 5) . In our experiments, we also find that the total factorization time has a linear relationship with the size of the matrix (see Figure 7). This shows that the total time requirements are proportional to the amount of data that is transferred between the main memory and the GPU memory and further optimizations on using the GPU more efficiently should only have a limited impact, unless the communication bandwidth is significantly improved.

6

Conclusions

In this paper we present an improved algorithm for sparse Cholesky factorization. Our multithread algorithm works by using multiple streams on the GPU. It uses the workflow derived from the elimination-tree to schedule the tasks corresponding to factorization of submatrices. Experimental results clearly demonstrate the performance and energy improvements over existing implementations. Our results show that up to a factor of 3.5 reduction in time requirements can be achieved for some matrices. The efficiency of data copy from host to device memory and vice versa is a key factor in the algorithm’s performance. The bandwidth of buses are bounded by hardware limits, but work can be done to increase the bandwidth utilization. It is also possible to further improve the performance by refining the scheduling policy and applying pipelining schemes and this is part of our current investigation. Additionally, we are developing extensions to our work to support efficient parallelization over multiple GPUs. 9



Meng Tang et al. / Factorization Procedia Computer Science 108C (2017) 616–625 A Multithreaded Algorithm for Sparse Cholesky Tang, Gadou and Ranka

Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. under CNS 1514116. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors will like to thank Tim Davis and Steven Rennich for several insightful discussions on parallelization of sparse matrix factorization and providing us access to CHLOMOD code; and Tim Davis for providing us access to Backslash computing platform at Texas A&M University.

References [1] Patrick R Amestoy, Timothy A Davis, and Iain S Duff. Algorithm 837: Amd, an approximate minimum degree ordering algorithm. ACM Transactions on Mathematical Software (TOMS), 30(3):381–388, 2004. [2] Yanqiang Chen, Timothy A. Davis, William W. Hager, and Sivasankaran Rajamanickam. Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software (TOMS), 35(3), 2008. [3] George Collins et al. Fundamental numerical methods and data analysis. Fundamental Numerical Methods and Data Analysis, by George Collins, II., 1990. [4] Tim Davis, WW Hager, and IS Duff. Suitesparse. htt p://faculty. cse. tamu. edu/davis/suitesparse. html, 2014. [5] Timothy A. Davis. Direct Methods for Sparse Linear Systems, volume 2 of Fundamentals of Algorithms. Society for Industrial and Applied Mathematics, 2006. [6] Timothy A Davis, John R Gilbert, Stefan I Larimore, and Esmond G Ng. Algorithm 836: Colamd, a column approximate minimum degree ordering algorithm. ACM Transactions on Mathematical Software (TOMS), 30(3):377–380, 2004. [7] Timothy A Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011. [8] Timothy A. Davis, Sivasankaran Rajamanickam, and Wissam M. Sid-Lakhdar. A survey of direct methods for sparse linear systems. Acta Numerica, 25:383–566, 2016. [9] Iain S Duff. Parallel implementation of multifrontal schemes. Parallel computing, 3(3):193–204, 1986. [10] Iain S Duff and John K Reid. The multifrontal solution of indefinite sparse symmetric linear systems. ACM Transactions on Mathematical Software (TOMS), 9(3):302–325, 1983. [11] Joseph WH Liu, Esmond G Ng, and Barry W Peyton. On finding supernodes for sparse matrix computations. SIAM Journal on Matrix Analysis and Applications, 14(1):242–252, 1993. [12] Esmond Ng and Barry W Peyton. A supernodal cholesky factorization algorithm for sharedmemory multiprocessors. SIAM Journal on scientific computing, 14(4):761–769, 1993. [13] Steven C Rennich, Darko Stosic, and Timothy A Davis. Accelerating sparse cholesky factorization on gpus. Parallel Computing, 59:140–150, 2016. [14] DJ Rose, GG Whitten, AH Sherman, and RE Tarjan. Algorithms and software for in-core factorization of sparse symmetric positive definite matrices. Computers & Structures, 11(6):597–608, 1980. [15] Sencer Nuri Yeralan, Timothy A Davis, Sanjay Ranka, and Wissam M. Sid-Lakhdar. Sparse QR factorization on GPU architectures. submitted for publication ACM Transactions on Mathematical Software (TOMS), 2015.

10

625