Parallel implementation of an efficient preconditioned linear solver for grid-based applications in chemical physics. III: Improved parallel scalability for sparse matrix–vector products

Parallel implementation of an efficient preconditioned linear solver for grid-based applications in chemical physics. III: Improved parallel scalability for sparse matrix–vector products

J. Parallel Distrib. Comput. 70 (2010) 779–782 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier...

389KB Sizes 3 Downloads 64 Views

J. Parallel Distrib. Comput. 70 (2010) 779–782

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

Research note

Parallel implementation of an efficient preconditioned linear solver for grid-based applications in chemical physics. III: Improved parallel scalability for sparse matrix–vector products Wenwu Chen, Bill Poirier ∗ Department of Chemistry and Biochemistry, Texas Tech University, Box 41061, Lubbock, TX 79409-1061, USA Department of Physics, Texas Tech University, Box 41061, Lubbock, TX 79409-1061, USA

article

info

Article history: Received 19 February 2007 Received in revised form 5 March 2010 Accepted 15 March 2010 Available online 1 April 2010 Keywords: Sparse matrix Preconditioning Block Jacobi Matrix–vector product Chemical physics Parallel computing Eigensolver Linear solver

abstract The linear solve problems arising in chemical physics and many other fields involve large sparse matrices with a certain block structure, for which special block Jacobi preconditioners are found to be very efficient. In two previous papers [W. Chen, B. Poirier, Parallel implementation of efficient preconditioned linear solver for grid-based applications in chemical physics. I. Block Jacobi diagonalization, J. Comput. Phys. 219 (1) (2006) 185–197; W. Chen, B. Poirier, Parallel implementation of efficient preconditioned linear solver for grid-based applications in chemical physics. II. QMR linear solver, J. Comput. Phys. 219 (1) (2006) 198–209], a parallel implementation was presented. Excellent parallel scalability was observed for preconditioner construction, but not for the matrix–vector product itself. In this paper, we introduce a new algorithm with (1) greatly improved parallel scalability and (2) generalization for arbitrary number of nodes and data sizes. © 2010 Elsevier Inc. All rights reserved.

1. Introduction This paper explores the parallel scalability of sparse iterative linear algebra calculations for block-structured matrices, such as arise in chemical and molecular physics, nuclear reaction physics, and other scientific and engineering fields. The authors’ specific focus is the quantum dynamics of molecular systems, which requires the solution of the non-Coulombic Schrödinger partial differential equation (PDE)—one of the most challenging computational problems of current interest, motivating the development of scalable algorithms for massively parallel computers. The methods developed here may be applied to a very broad range of applications, including any multidimensional partial differential equation (PDE) simulation on a structured grid. In all such cases, the matrices are characterized by off-diagonal blocks that are diagonal, and diagonal blocks that are either dense, or themselves (recursively) blockstructured as described above. We refer to such matrices as ‘‘A matrices’’. Before proceeding, we discuss how the discretized Schrödinger and related PDEs, giving rise to the generic A matrices described

∗ Corresponding author at: Department of Physics, Texas Tech University, Box 41061, Lubbock, TX 79409-1061, USA. E-mail address: [email protected] (B. Poirier). 0743-7315/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2010.03.008

above, differ from the most typical elliptic PDE applications, e.g. discretized Poisson/Helmholtz equations. Both involve the Laplacian,

E2 = ∇



∂ ∂ x1

2

 + ··· +

∂ ∂ xd

2

,

(1)

where xk is the kth dimension in a physical configuration space of d total dimensions. However, the Schrödinger PDE has a ‘‘potential’’ contribution, V (x1 , . . . , xd ), which renders it more difficult than the Poisson/Helmholtz PDEs—though for their part, the latter typically involve highly non-trivial boundary conditions, unlike the Schrödinger PDE. As a consequence, discretizations of the Schrödinger PDE generally employ structured grids, rather than unstructured grids, leading to the aforementioned block structure. The most important difference, however, is simply this: in almost all engineering applications, the configuration space encompasses no more than d = 3 dimensions, whereas in quantum scattering applications, d = 3 × the number of particles. The chief difficulty of the latter stems from the fact that the size N of the matrices scales exponentially with d, i.e., N = n1 × n2 ×· · ·× nd , where nk is the number of grid points for the kth dimension. To minimize the impact of exponential scaling, standard discretization schemes such as nearest-neighbor finite difference (FD) are almost never

780

W. Chen, B. Poirier / J. Parallel Distrib. Comput. 70 (2010) 779–782

used in the Schrödinger context. Instead, more general and efficient pseudospectral schemes are used [2,5], for which each nk is typically reduced by an order of magnitude. For large d, the corresponding reduction in N is astronomical, but at the expense of less matrix sparsity than for FD. Specifically, for fixed values of all xj6=k , the kth term in the right-hand side of Eq. (1) couples together all grid points along xk , rather than just nearest neighbors, thus leading to the A matrix form. Many of the numerical algorithms and toolkits developed in the applied linear algebra field – though in principle applicable to general problems – have in practice been largely driven by d ≤ 3 Poisson/Helmholtz/fluid dynamics applications, and are not so effective for Schrödinger and other A matrices. The same is true of parallel linear algebra developments, which is unfortunate, given the great demand for massive parallelization which Schrödinger PDE applications present. This state of affairs has motivated the present work. The linear solve problem of interest is w = (H − λI)−1 v, where H and I are N × N matrices, x, w, and v are vectors, and λ is a scalar. For A matrix applications, H is dense for d = 1, but becomes increasingly sparse with increasing d. We thus use iterative sparse methods, operating via a sequence of matrix–vector products. Extreme ill-conditioning (introduced by the potential) requires that preconditioning be used, in order to keep the total number of iterations, M, reasonably small. Standard preconditioners are found lacking in the A-matrix context, prompting the development of the optimal separable basis (OSB) approach [17,12,13], based on block Jacobi diagonalization, which has proven very effective [14–16]. The technical challenges for massive parallelization are profound, primarily due to remote coupling across the entire space (in a given dimension). The two fundamental operations are (1) OSB preconditioner construction using block Jacobi diagonalization and (2) sparse matrix–vector products. These were considered in two previous articles, respectively referred to as Paper I [3] and Paper II [4]. (1) was found to parallelize well, with an isoefficiency study suggesting effective application across tens of thousands of nodes. (2) did not exhibit good parallel scalability in the most challenging (i.e., most sparse) cases considered. In this paper, the causes of the poor parallel scaling of (2) are identified and repaired, by introducing direct non-blocking pointto-point communication between appropriate node pairs, rather than the higher-level message passing interface (MPICH) ‘‘gather’’ and ‘‘scatter’’ commands. The new scheme requires substantial additional computational overhead, vis-à-vis keeping explicit track of which vector portions get mapped between which node pairs. However, the (memory bandwidth limited) communication time is dramatically reduced, so that (2) now scales as effectively as (1). A second key issue addressed in this paper is load balancing. In Paper I and Paper II, the number of nodes, p, was restricted to be divisible by the number of blocks below a certain recursive level, k ≤ s. In addition, there were severe limitations on individual nk values, thus leading to perfect load balancing by design. Such restrictions on p and nk are artificial and unreasonable when imposed on real applications, however, for which these values ought to be allowed to be completely arbitrary. 2. Method overview The quasi-minimum residual (QMR) linear solve algorithm [4,6] requires that two different matrix–vector products be performed, i.e., (H − λI)x and P−1 x, with Hi1 j1 ...id jd = h1i1 j1 δi2 j2 · · · δid jd + · · · + δi1 j1 · · · δid−1 jd−1 hdid jd

+ V (x1i1 , . . . , xdid )δi1 j1 · · · δid jd ,

(2)

Fig. 1. Schematic of data distribution as employed by the parallel implementation at the highest four dimensional levels, k = d, k = d − 1, k = d − 2 and k = d − 3, with nk = 2 and s = d − 2. Diagonal lines indicate non-zero matrix elements to be stored. The k = d and k = d − 1 cases (left side) use ‘‘cyclical’’ storage, whereas the others (right side) use ‘‘sequential’’ storage. The three different diagonal line types (dotted, solid and dashed) indicate how data at each level would be distributed across p = 3 nodes.

and P the preconditioner. The OSB P−1 matrix is constructed as follows. For the simplest, d = 2 case, one applies an orthogonal transformation, V,

(H0 − λI) = VT (H − λI)V = (HD − λI) + HO

(3)

such that (a) the A matrix form of Eq. (2) is preserved and (b) the off-block-diagonal matrix elements are minimized. In Eq. (3) above, HD and HO comprise the (transformed) diagonal and offdiagonal blocks of H0 , respectively. The resultant block Jacobi preconditioner, P0 = (HD − λI), is thus an excellent approximation to (H0 − λI). If lexicographical ordering of the grid points is used so that the grid point index i1 for dimension x1 varies more quickly than i2 , then in order to satisfy (a), V must adopt the ‘‘diagonalblock’’ form, Vi1 j1 i2 j2 = δi1 j1 vi2 j2 , where v is itself an orthogonal matrix on the (one-dimensional) x2 grid. To satisfy (b), the matrix v is constructed via block Jacobi diagonalization so as to minimize kHO k [17,12–14]. For d > 2, a recursive implementation may be applied, which minimizes the kHO k contribution at one block level, k, at a time, via a succession of transformations, i ,...,ik+1

Vki1 j1 ...id jd = δi1 j1 · · · δik−1 jk−1 vk idk jk

δid jd · · · δik+1 jk+1 ,

(4)

from the outermost, k = d, down to k = 1. These satisfy the recursion relation, VTk−1 HDk Vk−1 = HDk−1 + HOk−1 .

(5)

In the transformed representation, the OSB P0 = (HD1 − λI), which corresponds to the following product of sparse matrices in the original grid representation: P−1 = Vd Vd−1 · · · V2 V1 (HD1 − λI)−1 VT1 VT2 · · · VTd−1 VTd .

(6)

3. Parallel implementation In Paper I and Paper II of this series, parallel implementations were developed for (1) and (2) of Section 1. The d system dimensions give rise to a d-level hierarchical block structure, such that the requisite matrices from each k level are block diagonal, with block size Nk = n1 × n2 × · · · × nk ; moreover, each of the blocks has a ‘‘diagonal-block’’ structure [Fig. 1]. This is very advantageous vis-à-vis minimizing parallel communication. If the total number of nodes is p = (N /Ns ) = ns+1 ns+2 · · · nd , for some integer s with 0 < s < d, then s represents the k-level value below which all processing occurs on a single node without communication. However, Eq. (6) also involves higher-level (k > s) matrix– vector products for which communication amongst groups of

W. Chen, B. Poirier / J. Parallel Distrib. Comput. 70 (2010) 779–782

Nk /Ns nodes is required to effect vector reordering. At the highest level (k = d), essentially all nodes belong to the same group, leading to the communication bottleneck—i.e., an MPICH gather call from all nodes to a single node, followed by a scatter to all appropriate nodes at the next level (see Paper II, Fig. 3). One contribution of the present paper is to replace these MPICH routines with lower-level calls that effectively provide direct ‘‘many-to-many’’ access amongst nodes, i.e., by employing multiple communication streams in parallel (direct non-blocking point-to-point communication between appropriate node pairs). Other authors have considered similar strategies in the context of dense [11,8,9] and unstructured mesh [1] matrices, as well as tensor-product matrices, somewhat similar to the A matrices considered here [10]. The resultant communication is asynchronous and two-sided, but occurs primarily in parallel. A second key contribution is a generalization of the previous parallel algorithm, to accommodate arbitrary numbers of nodes, p, and data sizes, nk , for which load balancing issues are now a critical consideration. The data distribution scheme used here is presented schematically in Fig. 1, for the case nk = 2, p = 3, s = d − 2. For k ≤ s (right side of figure), the distribution is sequential (block), meaning that the vector data is divided into exactly p contiguous partitions, each stored on a different node. The sizes of these sequential partitions is determined by the level k block size, Nk , as the corresponding matrix data is distributed by block (i.e., each matrix block is stored on one node). The number of blocks is greater than or equal to p, but not generally divisible by p, with blocks distributed as evenly as possible across node partitions. Since the number of blocks N /Nk increases with decreasing k, the worst case with respect to load balancing is k = s, whereas the best case is k = 1. For the higher levels, k > s, a cyclical storage scheme is used, meaning that vector data is divided into more than p partitions, with some or all nodes storing data from multiple non-contiguous partitions (e.g. the left side of Fig. 1). Our use of the term ‘‘cyclical’’ is somewhat related to the standard usage, but broader than the conventional meaning of ‘‘cyclic’’ or even ‘‘blockcyclic’’ distribution in the context of target processor element ownership [10]. The average partition size is on the order of Ns−1 — as opposed to Ns for the sequential scheme, i.e., each node now stores around nk partitions. More precisely, as there are now more nodes than blocks, each block is assigned an integral number of nodes (instead of the other way around), in as evenly divided a manner as possible. Matrix data for a given block is stored across the corresponding node group by row, as follows. The uppermost Nk−1 rows of the block are divided as evenly as possible amongst all nodes of the group in turn (thus defining the partition sizes), and this pattern is then repeated cyclically for the remaining nk − 1 sets of Nk−1 rows in the block. Load balancing is therefore least effective for k = s + 1, and most effective for k = d, for which all nodes belong to a single group. There are three fundamentally distinct types of vector reordering communication that must be performed: (1) cyclical–cyclical; (2) cyclical–sequential (between levels s and s + 1); and (3) sequential–sequential. Each node stores the local portion of the matrix and vector needed to perform the local matrix–vector product operation at a given level, k. Once this operation is complete, the updated local vector must be appropriately subdivided, and the pieces distributed among the nodes that require them for the next-level matrix–vector product multiply. Although complicated, the primary advantage of the above implementation is that many pairs of processors can transfer data simultaneously, in whatever precise manner is desired. Since the total vector is of length N, the amount of data transferred by any given node is O (N /p), whereas in the old implementation it was O (N ), owing to the use of a single master node required to perform gather and scatter operations.

781

a

b

Fig. 2. Speedup (a) and parallel efficiency (b) of two parallel QMR linear solver iterations, as a function of number of nodes, p, for various data sizes N = 8d , ranging from d = 4 to d = 8, using the new algorithm.

4. Numerical testing and results In this section, the parallel scalability of the new recursive OSBpreconditioned QMR linear solver codes is investigated. The benchmark application is the d-dimensional Schrödinger PDE model of Paper I and Paper II (nk = n = 8 grid points per dimension, N = nd ), defined as follows: V (x1i1 , . . . , xdid ) =

d X

[(1/2)ik − 9/4]2

k=1

+ 0.1

d−1 X

[(1/2)ik − 9/4] [(1/2)id − 9/4]

(7)

k=1

[hk ]ik ,jk = (−1)

i k −j k

 2

π 2 /3, 2/(ik − jk )2 ,

if ik = jk ; otherwise.

(8)

The code was written in FORTRAN 90 and MPICH, and all numerical tests performed on the Argonne National Labs Jazz cluster [3,4] (2.4 GHz Xeons with 2 GB RAM per node and Myrinet 2000 connectivity). Parallel scalability studies were performed by varying both the system size, d, and the number of processors, p, simultaneously. All system dimensionalities from d = 4 to d = 8 were considered, corresponding to a wide range of N values from N = 4096 to N ≈ 16 million. The number of nodes, p, varied from 1 to 128— though unlike the previous work, it was not restricted to powerof-two values. By comparing power-of-two to non-power-of-two p values, we can assess the impact of load balancing disparities. Parallel scalability testing was performed on a single pair of QMR iterations. Plots of the parallel speedup and efficiency are presented in Fig. 2(a) and (b), respectively, for all system sizes over the range of p values considered. In comparison with the previous

782

W. Chen, B. Poirier / J. Parallel Distrib. Comput. 70 (2010) 779–782

algorithm, the parallel performance improvement is phenomenal. In particular, parallel speedups in the previous study were never greater than around 5×, even when the greatest number of processors, p = 128, and the largest system size, d = 8, were employed—behavior that has been termed ‘‘unacceptable’’ [7]. In contrast, the new algorithm results in speedups of 75× to 125× at p = 128, for the largest three data sizes considered (d = 6, 7, 8). Reasonable efficiencies are obtained for these three data sets throughout the range of p values considered, although the variability between power-of-two and non-power-of-two values is substantial. The effect of uneven load balance is more pronounced for the larger data sizes, probably due to communication, rather than computation. From a practical perspective, however, load imbalances per se do not appear to be so profound as to lead to untenable efficiencies or parallel scalability. The worst efficiencies observed for d = 6, 7, 8 are around 0.2, and in these cases, they can be easily improved merely by switching to a nearby p value. Feasible extrapolation to p values in the tens of thousands is therefore anticipated. Acknowledgments This work was supported by the Office of Advanced Scientific Computing Research, Mathematical, Information, and Computational Sciences Division of the US Department of Energy under contract DE-FG03-02ER25534, and by the Welch Foundation (D-1523). References [1] F. Bertrand, R. Bramley, D.E. Bernholdt, J.A. Kohl, S. Sussman, J.W. Larson, K.B. Damevski, Data redistribution and remote method invocation for coupled components, J. Parallel Distrib. Comput. 66 (2006) 931–946. [2] J.P. Boyd, Chebyshev and Fourier Spectral Methods, in: Lecture Notes in Engineering, Springer-Verlag, Berlin, 1989. [3] W. Chen, B. Poirier, Parallel implementation of efficient preconditioned linear solver for grid-based applications in chemical physics. I. Block Jacobi diagonalization, J. Comput. Phys. 219 (1) (2006) 185–197. [4] W. Chen, B. Poirier, Parallel implementation of efficient preconditioned linear solver for grid-based applications in chemical physics. II. QMR linear solver, J. Comput. Phys. 219 (1) (2006) 198–209. [5] D.T. Colbert, W.H. Miller, A novel discrete variable representation for quantum mechanical reactive scattering via the s-matrix Kohn method, J. Chem. Phys. 96 (3) (1992) 1982–1990. [6] R.W. Freund, N.M. Nachtigal, QMR: a quasi-minimal residual method for nonHermitian linear systems, Numer. Math. 60 (1) (1991) 315. [7] R. Geus, S. Röllin, Towards a fast parallel sparse symmetric matrix–vector multiplication, Parallel Comput. 27 (7) (2001) 883–896. [8] M. Guo, I. Nakata, Y. Yamashita, Contention-free communication scheduling for array redistribution, Parallel Comput. 26 (2000) 1325–1343. [9] M. Guo, Y. Pan, Improving communication scheduling for array redistribution, J. Parallel Distrib. Comput. 65 (2005) 553–563. [10] S.K.S. Gupta, C.-H. Huang, P. Sadayappan, R.W. Johnson, A framework for generating distributed-memory parallel programs for block recursive algorithms, J. Parallel Distrib. Comput. 34 (1996) 137–153.

[11] S.K.S. Gupta, S.D. Kaushik, C.-H. Huang, P. Sadayappan, Compiling array expressions for efficient execution on distributed memory machines, J. Parallel Distrib. Comput. 32 (1996) 152–172. [12] B. Poirier, Optimal separable bases and series expansions, Phys. Rev. A 56 (1) (1997) 120–130. [13] B. Poirier, Efficient preconditioning scheme for block partitioned matrices with structured sparsity, Numer. Linear Algebra Appl. 7 (2000) 715–726. [14] B. Poirier, Quantum reactive scattering for three-body systems via optimized preconditioning, as applied to the O + HCL reaction, J. Chem. Phys. 108 (13) (1998) 5216–5224. [15] B. Poirier, T. Carrington Jr., Accelerating the calculation of energy levels and wave functions using an efficient preconditioner with the inexact spectral transform method, J. Chem. Phys. 114 (21) (2001) 9254–9264. [16] B. Poirier, T. Carrington Jr., A preconditioned inexact spectral transform method for calculating resonance energies and widths, as applied to HCO, J. Chem. Phys. 116 (4) (2002) 1215–1227. [17] B. Poirier, W.H. Miller, Optimized preconditioners for Green function evaluation in quantum reactive scattering calculations, Chem. Phys. Lett. 265 (1–2) (1997) 77–83.

Wenwu Chen received his B.S. in Chemistry from the University of Science and Technology of China in 1991, and his Ph.D. in Chemical Physics from the Institute of Chemistry, Chinese Academy of Sciences in 1998. Since receiving his Ph.D. degree, Dr. Chen has worked in the Pacific Northwest National Laboratory (PNNL), the Advanced Light Source at the Ernest Orlando Lawrence Berkeley National Laboratory (LBNL), the University of Nevada, Texas Tech University and the National Institute of Standards and Technology (NIST) Center for Neutron Research. He is an expert in molecular dynamics, quantum chemistry and computer science, developing numerical high performance, massively parallel versions of scientific software for molecular reactions, quantum dynamics, and neutron scattering. He has published more than 30 peer-reviewed papers in distinguished journals, including Science, the Journal of Chemical Physics and the Journal of Physical Chemistry. Email: [email protected].

Bill Poirier received his Sc.B. in Physics from Brown University in May 1988, and his Ph.D. in Theoretical Chemical Physics from the University of California, Berkeley in 1997; from 1997 to 2000 he was at the University of Chicago, and from 2000 to 2001 he was at the University of Montreal, studying Theoretical Chemical Physics. He has had the following academic appointments, all in Texas Tech University. 2001–2006: Department of Chemistry, Assistant Professor; 2002–present: Department of Physics, Joint Professor; 2006–2009: Department of Chemistry, Associate Professor; 2009–present: Department of Chemistry, Professor. He is an expert in accurate time-independent quantum dynamics methods and applications, achieving major reductions in CPU effort for real molecular systems on serial computers. Massively parallel versions of these quantum dynamical algorithms have now been incorporated into a suite of codes called ScalIT, which is also extremely efficient for real molecular applications. He has published over 50 peerreviewed papers in distinguished journals in Chemistry, Physics, and Computer Science, and presented over 50 invited talks. He has served as journal article referee nearly 50 times, and as research proposal reviewer for all of the major funding agencies in his field (NSF, DoE, ACS, Research Corporation, Israeli Science Foundation). He has also organized seven scientific meetings. He is the 2008 recipient of the Texas Tech System Chancellor’s Council Distinguished Research Award.