Cannon-type triangular matrix multiplication for the reduction of generalized HPD eigenproblems to standard form

Cannon-type triangular matrix multiplication for the reduction of generalized HPD eigenproblems to standard form

Parallel Computing 91 (2020) 102597 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Ca...

1MB Sizes 0 Downloads 25 Views

Parallel Computing 91 (2020) 102597

Contents lists available at ScienceDirect

Parallel Computing journal homepage: www.elsevier.com/locate/parco

Cannon-type triangular matrix multiplication for the reduction of generalized HPD eigenproblems to standard form Valeriy Manin∗, Bruno Lang University of Wuppertal, Mathematics and Natural Sciences, Wuppertal 42097, Germany

a r t i c l e

i n f o

Article history: Received 12 September 2018 Revised 15 May 2019 Accepted 14 November 2019 Available online 22 November 2019 2010 MSC: 65F30 65F15 65Y05

a b s t r a c t We first develop a new variant of Cannon’s algorithm for parallel matrix multiplication on rectangular process grids. Then we tailor it to selected situations where at least one triangular matrix is involved, namely “upper triangle of (full × upper triangular),” “lower triangle of (lower triangular × upper triangular),” and “all of (upper triangular × rectangular).” These operations arise in the transformation of X = X , and making generalized hermitian positive definite eigenproblems AX = BX  to standard form A use of the triangular structure enables savings in arithmetic operations and communication. Numerical results show that the new implementations outperform previously available routines, and they are particularly effective if a whole sequence of generalized eigenproblems with the same matrix B must be solved, but they can also be competitive for the solution of a single generalized eigenproblem. © 2019 Elsevier B.V. All rights reserved.

Keywords: Cannon’s multiplication Triangular matrices Reduction of generalized eigenproblems Parallel

1. Introduction

X = X  for X  and 4. Solve the standard hermitian eigenproblem A

In ab initio electronic structure computations [1–3] often a significant portion of the overall time is spent in so-called self consistent field (SCF) cycles, which involve solving a sequence of generalized hermitian positive definite (HPD) eigenproblems A(i ) X (i ) = BX (i ) (i ) . The n × n matrices A(i) and B are hermitian (the real symmetric case is similar), and in addition B is positive definite. (i) is a k × k diagonal matrix with the k ≤ n eigenvalues of interest on the diagonal, and X(i) is an n × k matrix of associated B-orthogonal eigenvectors. While the “overlap matrix” B remains unchanged during an SCF cycle, the “Hamiltonian” A(i) depends on the solution (X (i−1 ) , (i−1 ) ) of the previous problem. A standard approach for solving a generalized HPD eigenproblem first reduces it to an equivalent standard eigenproblem, AX = X = X , and proceeds as follows: BX  ⇒ A

 (back-transformation of the eigenvectors) 5. X = U −1 X

1. B⇒UH U (Cholesky decomposition of B; U is upper triangular, and UH denotes the conjugate transpose of U) 2. Optional: U ⇒ U −1 (triangular matrix inversion)  = U −H AU −1 3. A ⇒ A



Corresponding author. E-mail addresses: [email protected] (V. Manin), [email protected] (B. Lang). https://doi.org/10.1016/j.parco.2019.102597 0167-8191/© 2019 Elsevier B.V. All rights reserved.



Step 3 can be implemented without explicitly building the upper triangular matrix U −1 (see [4] for an overview of possible realizations), and this is done in the function TwoSidedTrsm in the ELEMENTAL library [5] and PDSYNGST in ScaLAPACK [6]. Then the back-transformation (step 5) amounts to a triangular solve. This approach is particularly attractive if only a single eigenproblem with matrices A and B must be solved. For a sequence of eigenproblems with the same matrix B, however, steps 1 and 2 must be performed only once. Then, explicitly building U −1 and using triangular multiplications in steps 3 and 5 may be superior. This approach is followed in the ELPA library [7] and in the present paper. Here we concentrate on scalable routines to realize steps 3 and 5 based on Cannon’s algorithm. We optimize Cannon’s idea for triangular matrices and extend Cannon’s algorithm to the case of rectangular process grids. We show that our approach may even be beneficial for solving a single eigenproblem because the gain obtained from the optimized steps 3 and 5 may outweigh the time spent for the explicit computation of U −1 . Note that the matrix U −1 in steps 3 and 5 is again upper triangular. To simplify the notation we therefore assume that U has been

2

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

overwritten with U −1 and simply consider multiplication with U, i.e.,  = U H AU and the reduction and back-transformation take the form A , respectively. X = UX While the back-transformation in step 5 is a straight-forward multiplication with a triangular matrix, it is well known (cf., e.g., [8]) that considerable savings are possible by looking more closely at the two-sided triangular matrix multiplication in step 3. Since A  = U H AU, and therefore it is sufficient to calis hermitian, so is A . Step 3 may be implemented as a seculate only one triangle of A quence of four substeps as follows.

Algorithm 1: Cannon’s algorithm for a square pc × pc process grid. 1

2

3 4 5 6

(i) Compute the upper triangle Mu of M = AU (“multiplication 1” in following); this takes approximately 23 n3 arithmetic operations. (ii) Transpose Mu to obtain the lower triangle Ml of MH = U H AH = U H A.  = M U = U H AU (“multiplica(iii) Compute the lower triangle of A l tion 2” in following; ≈ 13 n3 operations).  is needed for further computations (iv) If the whole matrix A then reflect its lower triangle across the diagonal. Transposition [9] is a cheap operation in comparison to the multiplications, and therefore this paper focuses on efficient parallel routines for the triangular matrix multiplications. Multiplication 1 stands for (a routine for) computing the upper triangular part of a product “hermitian full times upper triangular”, whereas multiplication 2 yields the lower triangle of a product “lower triangular times upper triangular.” Also, a function for the product “upper triangular times rectangular full” has been developed and can be used for an efficient back-transformation to restore eigenvectors of the initial problem (step 5). The remainder of this work is organized as follows. In Section 2 we first review the classical algorithm by Cannon for multiplying two full matrices that are block-distributed (or 2D block cyclically distributed) over a square process grid. Then we propose an extension to rectangular grids, where the number of process columns is a multiple of the number of rows. In Section 3 we adapt this algorithm to the situation of the above multiplication 1, “upper triangle of full × triangular,” such that communication volume and arithmetic complexity are reduced, and in Section 4 we consider multiplication 2, “lower triangle of lower triangular × upper triangular.” Combining the two multiplications in one routine allows for further optimizations, which are discussed in Section 5. In Section 6 we evaluate our routines in the overall context of a generalized eigenproblem or a sequence of these. Section 7 concludes the paper.

7

8

Circular shift for matrix A (by i positions to the left in process row i); Circular shift for matrix B (by j positions upward in process column j); Cloc := 0; for k = 0 to pc − 1 do Local multiplication: Cloc := Cloc + Aloc · Bloc ; Circular shift for A: Send current Aloc to left neighbor and receive new Aloc from right neighbor; Circular shift for B: Send current Bloc to upper neighbor and receive new Bloc from lower neighbor; end

The first two lines of the algorithm effect an initial “skewing” of the matrices, i.e., each process Pi,j in the ith process row sends its Aloc to the process i positions to its left, Pi, j−i , and receives a new Aloc from the process i positions to its right, Pi, j+i . Similarly, the jth block column of B is shifted by j positions upward in the process grid, such that Pi,j now holds Ai, j+i and B j+i, j ; see Fig. 1. Later on, in pass k of the for loop, Pi,j will hold Ai, j+i+k and B j+i+k, j , and therefore at the end of the algorithm this process will have computed the (i, j) block of the product A · B, pc −1

Cloc =

 k=0

pc −1

Ai, j+i+k B j+i+k, j =



Ai,m Bm, j = (A · B )i, j .

Note that in the above description all indices must be taken “modulo pc .” That is, for pc = 6, in the initial skewing of A, process P4,2 sends to P4,2−4 ≡ P4,4 and receives from P4,2+4 ≡ P4,0 , corresponding to wrap-around connections in the process grid. Similarly, for the indices of the block received by this process we have A4,2+4 ≡ A4,0 ; cf. Fig. 1. We will use this convention throughout the article to keep the notation concise. The moduli will be indicated in each case. In fact Algorithm 1 is applicable in a more general setting. Let A and B be partitioned into size-(nb × nb ) blocks with arbitrary block size nb ≥ 1 and distributed over a pc × pc process grid in a 2D block cyclic way, i.e., Pi,j holds exactly those blocks A,m and B,m such that  ≡ i mod pc and m ≡ j mod pc . (“2D block cyclic” is the standard distribution for matrices in ScaLAPACK and ELPA.) Then, for the example shown in Fig. 2, process P2,4 obtains P2,4+2 ≡ P2,0 ’s part of A and P2+4,4 ≡ P0,4 ’s part of B during the skewing, that is, afterward P2,4 will hold Aloc =

2. Cannon’s algorithm for full matrices In this section we briefly discuss Cannon’s algorithm for nontriangular matrices. First we review the method for square matrices on square process grids. Then we extend it to the case of rectangular process grids and compare our approach to other parallel matrix multiplication schemes.

(1)

m=0



A2,0 A2,6 A8,0 A8,6



and Bloc =



B0,4 B0,10 B6,4 B6,10



.

Note that we do not shift the local block rows/columns individually – as might be assumed from Fig. 2, – but Aloc and Bloc are shifted as a whole. This keeps the local blocks sorted by increasing row/column number. It is not hard to verify that during the course of Algorithm 1, P2,4 computes “its” portion Cloc =



C 2,4 C 2,10 C 8,4 C 8,10



of the

product, and similarly for the other processes. All process indices are modulo 6 = pc .

2.1. Cannon’s algorithm for square process grids 2.2. Cannon’s algorithm for rectangular process grids Cannon’s method [10] for a parallel matrix–matrix product A · B on a square grid of processes is summarized in Algorithm 1. As in most textbook descriptions (e.g., [11]) we first assume that, for a pc × pc process grid, the matrices are partitioned into size-(nb × nb ) blocks, where nb = n/pc . Then each process Pi,j , where 0 ≤ i, j < pc , initially holds exactly one block of the matrices, namely Aloc ≡ Ai, j ≡ A(inb + 1 : (i + 1 )nb , jnb + 1 : ( j + 1 )nb ) and Bloc ≡ Bi, j ≡ B(inb + 1 : (i + 1 )nb , jnb + 1 : ( j + 1 )nb ).

In [12], Cannon’s algorithm has been generalized to rectangular pr × pc process grids. Essentially, the process grid is emulating a virtual p × p grid, where p is the least common multiple of pr and pc . Thus, p steps are needed in the algorithm, each involving two shifts. As a consequence, process grids such as “5 × 7” cannot be a good choice for Cannon’s algorithm. However, usually there is no

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

3 4 5

5

4

3

2

1

2

3

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1 1,1 1,2 1,3 1,4 1,5 1,0

0,0 1,1 2,2 3,3 4,4 5,5

2,2 2,3 2,4 2,5 2,0 2,1

2,0 3,1 4,2 5,3 0,4 1,5

3,3 3,4 3,5 3,0 3,1 3,2

3,0 4,1 5,2 0,3 1,4 2,5

4,4 4,5 4,0 4,1 4,2 4,3

4,0 5,1 0,2 1,3 2,4 3,5

5,5 5,0 5,1 5,2 5,3 5,4

5,0 0,1 1,2 2,3 3,4 4,5

A

B

1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 5,0 5,1 5,2 5,3 5,4 5,5

0,0 0,1 0,2 0,3 0,4 0,5

1,0 2,1 3,2 4,3 5,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 5,0 5,1 5,2 5,3 5,4 5,5

Fig. 1. Distribution of the blocks of the matrices A and B after the initial skewing for the case that every process holds exactly one block of each matrix. For each block, the block number is shown with upright font, followed by the process coordinates in slanted font. The numbers next to the arrows indicate by how many positions the blocks in the respective block row (block column) have been shifted to the left (upward).

1

2

2,0 3,1 4,2 5,3 6,4 7,5 2,6 3,7 4,8 5,9 6,a 7,b

3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,a 3,b 3,0 3,1 3,2

3,0 4,1 5,2 6,3 7,4 8,5 3,6 4,7 5,8 6,9 7,a 8,b

4,4 4,5 4,6 4,7 4,8 4,9 4,a 4,b 4,0 4,1 4,2 4,3

4,0 5,1 6,2 7,3 8,4 9,5 4,6 5,7 6,8 7,9 8,a 9,b

5,5 5,6 5,7 5,8 5,9 5,a 5,b 5,0 5,1 5,2 5,3 5,4

5,0 6,1 7,2 8,3 9,4 a,5 5,6 6,7 7,8 8,9 9,a a,b

6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8 6,9 6,a 6,b

6,0 7,1 8,2 9,3 a,4 b,5 6,6 7,7 8,8 9,9 a,a b,b

1 7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8 7,9 7,a 7,b 7,0

7,0 8,1 9,2 a,3 b,4 0,5 7,6 8,7 9,8 a,9 b,a 0,b

8,2 8,3 8,4 8,5 8,6 8,7 8,8 8,9 8,a 8,b 8,0 8,1

8,0 9,1 a,2 b,3 0,4 1,5 8,6 9,7 a,8 b,9 0,a 1,b

9,3 9,4 9,5 9,6 9,7 9,8 9,9 9,a 9,b 9,0 9,1 9,2

9,0 a,1 b,2 0,3 1,4 2,5 9,6 a,7 b,8 0,9 1,a 2,b

a,4 a,5 a,6 a,7 a,8 a,9 a,a a,b a,0 a,1 a,2 a,3

a,0 b,1 0,2 1,3 2,4 3,5 a,6 b,7 0,8 1,9 2,a 3,b

b,5 b,6 b,7 b,8 b,9 b,a b,b b,0 b,1 b,2 b,3 b,4

b,0 0,1 1,2 2,3 3,4 4,5 b,6 0,7 1,8 2,9 3,a 4,b

A

B

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

5,0 5,1 5,2 5,3 5,4 5,5 5,0 5,1 5,2 5,3 5,4 5,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

2 3 4 5

3

5

2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,a 2,b 2,0 2,1

4,0 4,1 4,2 4,3 4,4 4,5 4,0 4,1 4,2 4,3 4,4 4,5

5

1

2

4

0,0 1,1 2,2 3,3 4,4 5,5 0,6 1,7 2,8 3,9 4,a 5,b

3,0 3,1 3,2 3,3 3,4 3,5 3,0 3,1 3,2 3,3 3,4 3,5

4

3

5

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b 1,0 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

3

2

4

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 4,0 4,1 4,2 4,3 4,4 4,5 5,0 5,1 5,2 5,3 5,4 5,5 5,0 5,1 5,2 5,3 5,4 5,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1,0 2,1 3,2 4,3 5,4 6,5 1,6 2,7 3,8 4,9 5,a 6,b 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 4,0 4,1 4,2 4,3 4,4 4,5 5,0 5,1 5,2 5,3 5,4 5,5 5,0 5,1 5,2 5,3 5,4 5,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 4,0 4,1 4,2 4,3 4,4 4,5 5,0 5,1 5,2 5,3 5,4 5,5 5,0 5,1 5,2 5,3 5,4 5,5

Fig. 2. Initial skewing of matrices A and B for a 2D block cyclic distribution. The matrices A and B have 12 block rows and block columns, the process grid is of size 6-by-6 (shown by thick black lines). For each block, the block number is shown with upright font, followed by the process coordinates in slanted font. The block indices are hexadecimal, i.e., a ≡ 10 and b ≡ 11. The numbers next to the arrows indicate by how many positions the blocks in the respective block row (block column) have been shifted to the left (upward). The blocks ending up in P2,4 are shaded.

need to use such grids in practice. In the following we focus on rectangular grids where the number of process columns is a multiple of the number of rows, with an aspect ratio r = pc /pr ∈ N. In this situation the algorithm from Lee et al. [12] would take pc steps. By contrast, the method described below requires only pr steps and therefore reduces the number of communication operations by a factor of r (with the same overall communication volume), at the price of increased memory usage for keeping r copies of A. More precisely, the Aloc of r processes will have to be combined during the skewing as follows. Again we start from a 2D block cyclic distribution of the matrices. Since a block column of Bloc contains r times more blocks than a block row of Aloc (cf. Fig. 3 for pr = 3, pc = 6, i.e., r = 2, and matrices with 12 block columns and rows), the local multiplications in Algorithm 1 could not be carried out if each process Pi,j in the ith row obtained just the Aloc from the process i positions to its right, Pi, j+i , as in the skewing on a square grid. In

the rectangular case Pi,j will in addition get the A blocks of Pi, j+i+ pr , Pi, j+i+2 pr , ..., Pi, j+i+(r−1 ) pr , where all column indices must be taken modulo pc . This in turn implies that the r processes Pi,j , Pi, j+ pr , ..., Pi, j+(r−1 ) pr (with “stride” pr ) will hold the same blocks of A, and these correspond to the locally available block columns of B. Since the Aloc and Bloc are exchanged as a whole, the ascending order of the block rows in each Bloc is preserved, and therefore the block columns of the combined Aloc must also be in ascending order. This is done by interleaving the block columns of the Aloc from the processes Pi,i+ j , Pi,i+ j+ pr , ..., Pi,i+ j+(r−1 ) pr , starting with the block column that has the lowest global number. For the 3 × 6 grid shown in Fig. 3, the skewing and combination will lead to process P2,4 holding Aloc = ⎡ ⎤ ⎡ ⎤ A2,0 A2,3 A2,6 A2,9 B0,4 B0,10 A5,3 A5,6 A5,9 ⎥ B ⎢ A5,0 ⎢B ⎥ ⎣A ⎦ and Bloc = ⎣B3,4 B3,10 ⎦. P2,1 will A8,3 A8,6 A8,9 8,0 6,4 6,10 A11,0 A11,3 A11,6 A11,9 B9,4 B9,10 hold the same Aloc , but a different Bloc .

4

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

blocks to exchange and combine

1

2

1

2

1,0 2,1 3,2 1,3 2,4 3,5 1,6 2,7 3,8 1,9 2,a 3,b

2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,a 2,b 2,0 2,1

2,0 3,1 4,2 2,3 3,4 4,5 2,6 3,7 4,8 2,9 3,a 4,b

3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,a 3,b

3,0 4,1 5,2 3,3 4,4 5,5 3,6 4,7 5,8 3,9 4,a 5,b

1 4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 4,9 4,a 4,b 4,0

4,0 5,1 6,2 4,3 5,4 6,5 4,6 5,7 6,8 4,9 5,a 6,b

5,2 5,3 5,4 5,5 5,6 5,7 5,8 5,9 5,a 5,b 5,0 5,1

5,0 6,1 7,2 5,3 6,4 7,5 5,6 6,7 7,8 5,9 6,a 7,b

6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8 6,9 6,a 6,b

6,0 7,1 8,2 6,3 7,4 8,5 6,6 7,7 8,8 6,9 7,a 8,b

1 7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8 7,9 7,a 7,b 7,0

7,0 8,1 9,2 7,3 8,4 9,5 7,6 8,7 9,8 7,9 8,a 9,b

8,2 8,3 8,4 8,5 8,6 8,7 8,8 8,9 8,a 8,b 8,0 8,1

8,0 9,1 a,2 8,3 9,4 a,5 8,6 9,7 a,8 8,9 9,a a,b

9,0 9,1 9,2 9,3 9,4 9,5 9,6 9,7 9,8 9,9 9,a 9,b

9,0 a,1 b,2 9,3 a,4 b,5 9,6 a,7 b,8 9,9 a,a b,b

1 a,1 a,2 a,3 a,4 a,5 a,6 a,7 a,8 a,9 a,a a,b a,0

a,0 b,1 0,2 a,3 b,4 0,5 a,6 b,7 0,8 a,9 b,a 0,b

b,2 b,3 b,4 b,5 b,6 b,7 b,8 b,9 b,a b,b b,0 b,1

b,0 0,1 1,2 b,3 0,4 1,5 b,6 0,7 1,8 b,9 0,a 1,b

A

B

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

2

1

1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b 1,0

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

2

2

0,0 1,1 2,2 0,3 1,4 2,5 0,6 1,7 2,8 0,9 1,a 2,b

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2

1

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

2

2

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

Fig. 3. 2D block cyclic distribution of the matrices A and B on a rectangular process grid after the skewing. A and B have 12 block rows and block columns each, and the process grid is of size 3 × 6 (shown by thick black lines). Block and process numbers, as well as communication distance (next to the arrows), are denoted as in Fig. 2. The blocks ending up in P2,4 are shaded.

The resulting overall procedure is given in Algorithm 2. After skewing and combining A as just described (lines 1 and 2) and a “standard” skewing of B (line 3) the computations can be done as in Algorithm 1, in pr steps. The algorithm shows in particular the use of non-blocking send operations for the shifts in order to overlap communication and computation. This requires two buffers for each local matrix: one is used for computation and sending, and in the meantime the other buffer can take up the incoming data for the next step. Note that the shifts in the last step can be dropped if we need not restore the distribution after the initial skewing; then the whole last iteration i = pr − 1 is replaced with out Cloc := Cloc + Aout loc · Bloc . In our approach each process receives every pr th block column of A, and therefore the amount of memory to store the local part of the matrix A depends only on the number of process rows. Thus, r = pc /pr should not be too large if local memory size is an issue. A similar scheme can be derived for process grids where the number of rows is a multiple of the number of columns, pr /pc ∈ N. Then B is replicated instead of A, and the Bloc are combined (and interleaved) among all “stride-pc ” processes within a column of the grid. In Section 3.4 we will comment on the respective advantages of both variants. Parallel matrix multiplication has been considered for decades, and several approaches have been proposed for 2D block cyclically distributed full matrices. They mainly differ in the way they bring together the blocks needed in (1) for the computation of a C block and in the order in which the products in (1) are computed in each process. We briefly point to some approaches that were related to the development of ScaLAPACK. PUMMA (Parallel Universal Matrix Multiplication Algorithm) [13] extends an earlier algorithm [14] to this distribution. In each step, it shifts the matrix A along rows of the process grid, and in each column of the grid one process broadcasts its local portion of B, and in our situation (pr × pc grid, pc = r · pr ) the algorithm would take pc steps. It would also be possible to shift B and broadcast A, leading to pr steps. SUMMA (Scalable Universal Matrix Multiplication Algorithm) [15] uses an outer product approach,

broadcasting block columns of A and block rows of B (along rows and columns, resp., of the process grid). These 2 × n/nb

broadcasts can be implemented in a pipelined fashion such that they do not lead to a factor log p in the overall communication time. DIMMA (Distribution-Independent Matrix Multiplication Algorithm) [16] essentially rearranges these broadcasts and aggregates the computation to optimize the local GEMM performance and the overlapping of computation with communication. Holding several copies of the matrices A and B may reduce the communication costs, as done in the so-called 3D and 2.5D algorithms [17–19]. The former arrange the p processes in a threedimensional p1/3 × p1/3 × p1/3 grid, generate p1/3 copies of A and B by broadcasting along two axes from two-dimensional “slices,” and then reduce the C blocks along the third dimension. The latter allow to adjust the memory requirements to c ∈ {1, . . . , p1/3 } copies of A and B; they essentially run c “truncated” instances of Cannon’s algorithm (not the full number of steps) on these copies and reduce the partial results at the end. Our approach does not involve collective communication and allows for overlapping of communication and computation in a natural way. 3. A Cannon-type algorithm for multiplication 1 This routine performs step (i) from Section 1, i.e., it computes the upper triangular part Mu of M := AU, where U is the inverse of the Cholesky factor of the matrix B from the initial generalized eigenproblem AX = BX . Here A is a full matrix, U is upper triangular, and only the upper triangular part of their product is to be computed. Essentially, we use Algorithm 2, with Bloc replaced by a buffer Ubuf (see discussion below) and suitable modifications to lines 3 and 10. These are explained in the following. 3.1. Initial skewing for multiplication 1 Since A is full, the initial skewing and combination for this matrix is done exactly as described in Section 2.2 and Algorithm 2.

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

Algorithm 2: Multiplication “full × full” on a pr × pc process grid with non-blocking communication; (myRow, myCol ) are the coordinates of the current process.

1 2

3 4 5

6 7

8

9

10 11 12 13 14

Algorithm 3: Initial skewing of U for multiplication 1; (myRow, myCol ) are the coordinates of the current process. This algorithm replaces line 3 in Algorithm 2.

/* Initial skewing; the communication for the shift of the Aloc and their combination (lines 1 and 2) can be done together */ Shift Aloc to the left by myRow positions Combine my Aloc with those of PmyRow, myCol+ pr mod pc , …, PmyRow, myCol+(r−1 ) pr mod pc and arrange the block columns by increasing global number; this gives Aout /* see main text loc ; for more details */ Shift Bloc up by myCol mod pr positions; this gives Bout loc Cloc := 0 for i = 0 to pr − 1 do /* Initiate shift for A */ MPI_Isend(Aout loc ) to my left neighbor, PmyRow, myCol−1 mod pc ; MPI_Irecv(Ain loc ) from my right neighbor, PmyRow, myCol+1 mod pc ; /* Initiate shift for B */ MPI_Isend(Bout loc ) to my upper neighbor, PmyRow−1 mod pr , myCol ; MPI_Irecv(Bin loc ) from my lower neighbor, PmyRow+1 mod pr , myCol ; out Cloc := Cloc + Aout loc · Bloc ; MPI_Wait() for A shift; MPI_Wait() for B shift; out in out Swap pointers Ain loc ↔ Aloc and Bloc ↔ Bloc ; end

For U, we make use of the triangular structure to reduce the communication volume. More precisely, we pack the nonzero blocks from each Uloc contiguously into a buffer Ubuf , which is then used for the skewing and the ensuing shifts. We note that, for any process Prow,col in an arbitrary pr × pc grid (not necessarily pc a multiple of pr ), the number of nonzero blocks in column jloc of Prow,col ’s local portion of the upper triangular matrices U and Mu is given by:

numBlocks =

β ( jloc , row, col ) := u

col + jloc · pc − row + 1

pr

5

, (2)

where · denotes rounding upward to the nearest integer and col + jloc · pc =: jglob is the number of that column in the overall matrix. For the situation before the skewing the reader may consider the matrix Mu in Fig. 4, as Mu is also upper triangular and non-skewed. The packing is summarized in Algorithm 3. Note that processes in the “lower part” of the grid (i.e., myRow > myCol) effectively skip the pass jloc = 0 of the loop because their first column in Uloc is empty (β u (0, ·, · ) = 0). The skewing also bears potential for asynchronous communication. One might just initiate the shift for Aloc and then overlap the actual communication with packing Uloc into Ubuf . We have not done this in our implementation because we expect only minor benefits from it. Communication time is a significant issue only for large process grids (cf. also Fig. 6), and then the initial skewing is almost negligible compared to the pr shifts between the updates.

/* Loop to form Ubuf block column by block column */ 1

2 3 4 5 6

7 8 9

myBlockColsInU =



n/nb − myCol pc



;

/* number of block

columns in my Uloc */ Ubuf = ∅; for jloc = 0 to myBlockColsInU − 1 do numBlocks = β u ( jloc , myRow, myCol ) according to (2); if (numBlocks > 0) then append block column j of Uloc to the buffer Ubuf ; /* numBlocks · n2b elements */ end end out ; Shift Ubuf up by (myCol mod pr ) positions; this gives Ubuf

ceed by block columns, and we will consider the matrices from Fig. 4 and two processes, P0,1 and P2,0 , to explain the four possible situations. Since P0,1 belongs to the “upper part” of the grid (myRow ≤ myCol), according to (2) the first column in Mloc is nonempty in this process; see Fig. 5.a). After the skewing, that is, in iteration i = 0 of Algorithm 2, P0,1 holds the Aout loc (after combination; cf. Section 2.2) and Ubuf that originally came from P0,1 (itself) and P1,1 , respectively, and since P1,1 also is in the upper part of the grid, Ubuf contains a block U1,1 from the first block column in P1,1 ’s Uloc ; see Fig. 5.a1). Then the local update is performed in two steps,

M0,1 = M0,1 + A0,1 · U1,1 for the first block column in Mloc , and



M0,7 M3,7 M6,7





=

M0,7 M3,7 M6,7





A0,1 + A3,1 A6,1

A0,4 A 3,4 A 6,4

A0,7 A3,7 A6,7

 

U1,7 · U4,7 U7,7



for the second block column. In the final iteration, i = 2, the situation is similar; see Fig. 5.a3), whereas the Ubuf for i = 1 comes from P2,1 , which is in the lower part of the grid, and thus Ubuf does not contain a block from P2,1 ’s first block column; see Fig. 5.a2). Therefore the first block column of Mloc is not updated in this case. For P2,0 the situation is slightly different. Since P2,0 is in the lower part of the grid, the first block column in its Mloc is empty and therefore is never updated; see Fig. 5.b). For i = 0 and i = 2 this matches the current Ubuf s, which also come from processes in the lower part of the grid and therefore do not contain blocks from the first block column of Uloc ; see Fig. 5.b1) and b3). By contrast, the Ubuf in iteration i = 1 comes from a process in the upper part, and therefore it contains a block U0,0 from block column 0; see Fig. 5.b2). This block is ignored in the local update, but it cannot be deleted from Ubuf because it might be needed in later local updates in other processes. Overall, there are four cases, depending on whether the current process and the origin of Ubuf are in the upper or lower part of the grid, thus determining whether the first block column in Mloc is updated and whether all block columns in Ubuf are used. Algorithm 4 summarizes the local update. 3.3. Results for multiplication 1

3.2. Local update for multiplication 1 If we want to use the triangular structure of U and Mu to save arithmetic operations then the local updates cannot be done in a single matrix product as in line 10 of Algorithm 2. Instead we pro-

In this section we present results obtained on the COBRA system at the Max Planck Computing and Data Facility (MPCDF) in Garching. Each COBRA node contains two 20-core 2.4GHz Intel Skylake processors running a SUSE Linux Enterprise Server 12 SP3

6

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

blocks to exchange and combine

1 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b 1,0 2

2

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

9,0 9,1 9,2 9,3 9,4 9,5 9,6 9,7 9,8 9,9 9,a 9,b 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1 a,1 a,2 a,3 a,4 a,5 a,6 a,7 a,8 a,9 a,a a,b a,0 2





− 1,3 2,4 3,5 1,6 2,7 3,8 1,9 2,a 3,b

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b





− 2,3 3,4 4,5 2,6 3,7 4,8 2,9 3,a 4,b





− 3,3 4,4 5,5 3,6 4,7 5,8 3,9 4,a 5,b











− 4,6 5,7 6,8 4,9 5,a 6,b











− 5,6 6,7 7,8 5,9 6,a 7,b











− 6,6 7,7 8,8 6,9 7,a 8,b

















− 7,9 8,a 9,b

















− 8,9 9,a a,b

















− 9,9 a,a b,b



− 0,2 −

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

− 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5



*

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

− 0,5 −

− 0,8 −

− 0,b

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

b,2 b,3 b,4 b,5 b,6 b,7 b,8 b,9 b,a b,b b,0 b,1

− 0,1 1,2 − 0,4 1,5 − 0,7 1,8 − 0,a 1,b

− 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,a 2,b

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5





− 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,a 3,b







− 4,4 4,5 4,6 4,7 4,8 4,9 4,a 4,b









− 5,5 5,6 5,7 5,8 5,9 5,a 5,b











− 6,6 6,7 6,8 6,9 6,a 6,b













− 7,7 7,8 7,9 7,a 7,b















− 8,8 8,9 8,a 8,b

















− 9,9 9,a 9,b



















− a,a a,b





















0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

5,2 5,3 5,4 5,5 5,6 5,7 5,8 5,9 5,a 5,b 5,0 5,1

8,2 8,3 8,4 8,5 8,6 8,7 8,8 8,9 8,a 8,b 8,0 8,1

2

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1 4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 4,9 4,a 4,b 4,0

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

1 7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8 7,9 7,a 7,b 7,0

2

1

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,a 2,b 2,0 2,1

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8 6,9 6,a 6,b

1

0,0 1,1 2,2 0,3 1,4 2,5 0,6 1,7 2,8 0,9 1,a 2,b

3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,a 3,b 2

2

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

=

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

− b,b

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

A

U

Mu

Fig. 4. Initial skewing of the matrices A and U and distribution of the resulting matrix Mu (upper triangle of M) on a 3 × 6 process grid (shown by thick black lines). Block and process numbers, as well as communication distance (next to the arrows), are denoted as in Fig. 2. The blocks ending up in P2,4 are shaded. “ - ” marks a zero block, which is not touched.

Fig. 5. Local matrix data during Algorithm 2 after the initial skewing for two processes, P0,1 in the “upper part” of the grid (i.e., myRow ≤ myCol), and P2,0 in the “lower part” (myRow > myCol). Pictures a) and b) show the local parts of the matrix Mu that are held and updated by these processes, and pictures a1)–a3) and b1)–b3) indicate the out current contents of the buffers Aloc and Uloc (more precisely, the Aout loc and Ubuf used for the update) available in the three iterations i = 0 to 2. Below each of these buffers we indicate the process from which it came originally, and thick horizontal lines are used when no block from a block column of U had been packed into a Ubuf . Matrix sizes and process grid are as in Fig. 4. The block indices are hexadecimal, i.e., a ≡ 10 and b ≡ 11.

operating system. We used the MPI from the Intel Parallel Studio 2018.4, as well as the MKL from the same release for local BLAS and ScaLAPACK functionality. We compare our algorithm with two other functions providing similar functionality,



PDTRMM from ScaLAPACK and



elpa_mult_at_b_real_double from the ELPA package, calculating the lower triangular part of an “upper triangular transposed × full” matrix product (with implicit transpositions).

All matrices were double precision real of size n = 30,0 0 0. The routines used only MPI parallelization. Note that the optimum grid shape, pr × pc , and block size, nb , for a given overall number of

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597 Table 1 Timings (in seconds) for rowwise and columnwise ordering of the processes in multiplication 1 (matrix size n = 30, 0 0 0, block size nb = 64).

Algorithm 4: One local update during multiplication 1; this algorithm replaces line 10 in Algorithm 2. (myRow, myCol ) are the coordinates of the current process, and i is the loop counter from Algorithm 2. /* Loop over the block columns of Mloc for the local update */ 1

2 3

4

5

6 7

8 9

myBlockColsInM =



n/nb − myCol pc



;

/* number of block

columns in my Mloc */ for jloc = 0 to myBlockColsInM − 1 do blo cksToUp date = β u ( jloc , myRow, myCol ); /* blocks in block column jloc of Mloc */ origRowU = (myRow + myCol + i ) mod pr ; /* row number out */ of origin process of Ubuf blocksInUBuf = β u ( jloc , origRowU, myCol ); /* blocks in out for this block column */ Ubuf if ((blo cksToUp date > 0 ) and (blocksInUBuf > 0 )) then Mact = Mact + Aact · Uact /* The ‘‘active’’ submatrices are Mact ≡ Mloc (0 : blo cksToUp date − 1, jloc ), Aact ≡ Aout loc (0 : blo cksToUp date − 1, 0 : blocksInUBuf − 1 ), out , Uact ≡ next blocksInUBuf blocks in Ubuf and indices refer to blocks, not to individual entries */ end end

7

Grid size

rowwise

columnwise

8 × 8 16 × 16 32 × 32 64 × 64 128 × 128

8.68 2.56 0.69 0.35 0.17

9.11 2.48 0.92 0.63 0.42

This is not important for a general matrix–matrix product. However, during “multiplication 1” blocks of a full matrix A are transferred along process rows, whereas only an upper triangular matrix U is moved within the columns. Thus, the communication volume along grid rows is significantly larger than along grid columns, and therefore it is reasonable to make best use of the faster direction by setting up the grid with a rowwise ordering. Table 1 shows that this choice can have a significant impact on the performance, in particular for larger grids. If we start with a square pr × pr grid and double the number of processes then the volume of communication per process can be halved along one dimension. If we set up a 2pr × pr grid, this affects the shifts of Aloc along rows, whereas for a pr × 2pr grid, the shifts of Ubuf along columns become faster. Thus we prefer “flat” grids with pr ≤ pc over “tall” ones because the benefits from saving on column communications are larger. If communication within grid columns is faster then tall grids may be superior. 4. A Cannon-type algorithm for multiplication 2

processes, p, and a matrix size, n, may depend on many parameters, e.g., the performance of the node-local BLAS operations with respect to the shape and size of the matrices involved in them, the speed of collective and point-to-point communication operations, etc. A systematic investigation of this topic is out of the scope of this paper. As a practical approach, we chose the process grid “as square as possible,” i.e., pr × pr if the process numbers was a perfect square, and twice as wide as high, pr × 2pr , for the remaining p (all p were powers of 2); cf. Section 3.4 why we prefer “flat” grids, pr ≤ pc . For each p we tried three block sizes, nb = 32, 64, 128, and report the fastest of nine runs (three runs with each block size). These settings are also assumed in the following unless explicitly stated otherwise. Fig. 6 reveals that our Cannon-based new routine is significantly faster and exhibits very good speedup up to 2k processes even for matrices of moderate size. 3.4. Setting up the process grid The time for data transfers along columns and rows of the process grid typically depends on the process ordering used for the grid setup. The latter can be influenced in two ways. First, on a machine with p∗ multi-core nodes one would often run pnode ≥ 1 processes per node. Then typically one can specify at job start-up whether the distribution is, e.g., by contiguous chunks, mapping Pk·pnode , . . . , P(k+1 )·pnode −1 to the same node k, or in a round-robin fashion, mapping every p∗ th process Pk , Pp∗ +k , . . . , P( pnode −1 )·p∗ +k to node k. Throughout the paper, we assume a distribution by chunks. In a second step, e.g., using the routine BLACS_GRIDINIT, the p = p∗ · pnode processes P0 , . . . Pp−1 are configured as a pr × pc grid. With our assumption, a columnwise grid configuration, Pi, j ≡ Pi+ j·pr , leads to faster communication along columns of the grid, since the processes in a column are mapped to physically closer nodes. Similarly, a rowwise ordering of the processes, Pi, j ≡ Pi·pc + j , makes data transfers along rows to be faster.

This function performs step (iii) from Section 1, i.e., it calculates  = M U. Again, U is the inverse of B’s the lower triangular part of A l Cholesky factor, and therefore upper triangular, and Ml is lower triangular. For simplicity we will write L instead of Ml in the following. 4.1. Initial skewing for multiplication 2 Since the second factor U in multiplication 2 is the same as in multiplication 1 (step (i) in Section 1), its skewing is done with Algorithm 3, first packing the local blocks tightly into a buffer Ubuf and then shifting within the process column, yielding the initial out for the local updates. (In Section 5 we will come back to U Ubuf being the same in the two multiplications.) Now the other factor is also (lower) triangular, and therefore each local portion Lloc is also packed into a buffer Lbuf before the skewing along process rows; see lines 1 through 8 in Algorithm 5. Similarly to (2), for a grid with integer aspect ratio r = pc /pr the number of blocks in the jloc th column of Prow,col ’s Lloc is given by:

  n

β ( jloc , row, col ) = l

nb

− row

pr





 col − row  pr

− jloc · r,

(3)

where the first term is the maximum length of an Lloc column held by any process of the current grid row, the second term takes a value from {0, . . . , r − 1} and accounts for missing blocks due to the lower triangular structure, and the third term reflects the fact that each local block column is r blocks shorter than the previous one. As described in Section 2.2, every process has to combine blocks of L from r processes, and therefore also sends its Lbuf to r processes. This is done in lines 9 through 28 of Algorithm 5, together with the interleaving of the received block columns out in lines 23 into a buffer Lout buf . For the computation of pos

8

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

Algorithm 5: Initial skewing of L for multiplication 2; (myRow, myCol ) are the coordinates of the current process, and r = pc /pr is the grid’s aspect ratio. /* Pack my own Lloc into the buffer Lbuf blockcolumn by block column */ 1 2 3 4 5 6 7 8

9 10 11

12 13 14 15 16

17 18

19 20 21 22 23 24 25 26 27 28

myBlockColsInL =



n/nb − myCol



pc

;

/* number of block columns in my Lloc */

Lbuf = ∅; for jloc = 0 to myBlockColsInL − 1 do numBlocks = β l ( jloc , myRow, myCol ) according to (3); if (numBlocks > 0) then append block column jloc of Lloc to the buffer Lbuf ; /* numBlocks · n2b elements */ end end /* Shift the Lbuf s by myRow positions to the left and combine this skewing with interleaving data from r processes with ‘‘stride’’ pr ; collect all received data in a buffer Lout buf */ Lout buf = ∅;

/* will hold

β l (0, myRow, (myCol + myRow ) mod

maxBlocks·(maxBlocks+1 ) blocks, 2

each containing n2b elements */

maxBlocks = pr ))according to(3); /* longest column in Lout buf */ for i = 0 to r − 1 do /* Communication partners are at a distance myRow (accounting for the shift) + i · pr (for combination) */ whereToSend = (myCol − myRow − i · pr + pc ) mod pc ; fromWhereToReceive = (myCol + myRow + i · pr ) mod pc ; if (whereToSend = myCol) then Send Lbuf to PmyRow,whereToSend andreceive Lin buf from PmyRow,fromWhereToReceive ; else /* Then also fromWhereToReceive = myCol, i.e.,I use my own data */ Lin /* Lbuf contains my portion of L before shift;copy only pointer */ buf = Lbuf ; end out in points to the next ( j th) block column toextract from /* Append the contents of received Lin loc buf to Lbuf ;pos in in / out Lbuf ;pos refer tonb × nb blocks, not individual elements */ jloc = 0 ; posin = 0; while (posin has not reached the end of Lin buf ) do numBlocks = β l ( jloc , myRow, fromWhereToReceive mod pc ) according to (3); if (numBlocks > 0) then +1 ) +1 ) posout = maxBlocks·(maxBlocks − numBlocks·(numBlocks ; /* see main text for details */ 2 2 in out ); Copy numBlocks blocks from Lbuf (starting at blockposition posin ) to Lout buf (starting atpos jloc = jloc + 1 ; posin = posin + numBlocks end end end

and 10 note that in the buffer Lout buf , each column of L has one block less than the preceding one, and the longest among these columns is the first one ( jloc = 0) received from that “skewing partner” PmyRow,myCol+myRow , . . . , PmyRow,myCol+myRow+(r−1 ) pr with the lowest column number; cf. also Figs. 7 and 8. Thus the first term in line 23 is the overall length of the buffer (in blocks), and the second term gives the offset of the current block column from the end. 4.2. Local update for multiplication 2 loc The local update is similar to Algorithm 4 and works on A block column by block column, with one important difference. act = A act + Lact · Uact Since the left factor Lact in each update A comes from a packed buffer (with different leading dimension for each block column in Lact ), it typically cannot be done in a single matrix multiplication, as in line 7 of Algorithm 4. To give an example, consider the updates in iteration i = 1 in process P1,1 ; see also Figs. 7 and 8. While the first block column can be updated in a single multiplication,

⎡ ⎤

⎡ ⎤ ⎡ ⎤ A 1,1 A 1,1 L1,0 ⎢ A4,1 ⎥ ⎢ A4,1 ⎥ ⎢ L4,0 ⎥   ⎣ A ⎦ = ⎣ A ⎦ + ⎣ L ⎦ · U0,1 , 7,0 7,1 7,1 10,1 10,1 L10,0 A A

the update of the second block column,



7,7 A  A10,7



 =







7,7 A L7,0 +  L10,0 A10,7

L7,3 L10,3





 =







Lact

 U0,7 L7,6 · U3,7 L10,6  U6,7 





7,7     A L7,0 L7,3 10,7 + L10,0 · U0,7 + L10,3 · U3,7 A





  L + 7,6 · U6,7 L10,6

requires three multiplications because even if all three block columns of Lact comprise the same number of blocks (two), their distance in Lout buf is different: three blocks from L7,0 to L7,3 , and two blocks from L7,3 to L7,6 . In general, blocksInUBuf multiplications are needed, where blocksInUBuf is the number of blocks in the jloc th column of out ; cf. line 5 of Algorithm 4. Therefore, line 7 of that algorithm Ubuf must be replaced with a straight-forward loop taking blocksInUBuf

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

Timings for multiplication 1

4

Speedup over the best p=32 run

8 Time [seconds]

Speedups for multiplication 1 512

ScaLAPACK PDTRMM nb=128 ELPA UL nb= 32 ELPA UL nb= 64 ELPA UL nb=128 New nb= 32 New nb= 64

16

2 1 1/2 1/4 1/8 32

64

128

4x8

8x16 8x8

256

9

512 1024 2048 4096 8192 16384

ideal ScaLAPACK PDTRMM ELPA UL New

256 128 64 32 16 8 4 2 1

16x32 32x64 64x128 16x16 32x32 64x64 128x128

32

64

128 256 512 1024 2048 4096 8192 16384

Number of processes and grid size

Number of processes

Fig. 6. Strong scaling timings and speedups for multiplication 1 with double precision real data, matrix size n = 30,0 0 0, 32 processes per node, and rowwise grid setup √ (cf. Section 3.4). For p ∈ {64, 256, 1024, 4096, 16, 384}, a square process grid was used, pc = pr = p, for p ∈ {32, 128, 512, 2048, 8192} the grid was twice as wide as high,



pc = 2 pr = 2 p. (Left) For each p, the fastest of the runs with block sizes nb ∈ {32, 64, 128} is shown, with the marker specifying the best nb . (Right) Corresponding speedups over the best p = 32 run.





















0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b





















1,0 1,1 −



















− 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b

1,0 1,1 −



















2,0 2,1 2,2 −



















− 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,a 2,b

2,0 2,1 2,2 −

















3,0 3,1 3,2 3,3 −



















− 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,a 3,b

3,0 3,1 3,2 3,3 −















4,0 4,1 4,2 4,3 4,4 −



















− 4,4 4,5 4,6 4,7 4,8 4,9 4,a 4,b

4,0 4,1 4,2 4,3 4,4 −













5,0 5,1 5,2 5,3 5,4 5,5 −



















− 5,5 5,6 5,7 5,8 5,9 5,a 5,b

5,0 5,1 5,2 5,3 5,4 5,5 −











6,0 6,1 6,2 6,3 6,4 6,5 6,6 −



















− 6,6 6,7 6,8 6,9 6,a 6,b

6,0 6,1 6,2 6,3 6,4 6,5 6,6 −









7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 −



















− 7,7 7,8 7,9 7,a 7,b

7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 −







8,0 8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8 −



















− 8,8 8,9 8,a 8,b

8,0 8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8 −





9,0 9,1 9,2 9,3 9,4 9,5 9,6 9,7 9,8 9,9 −



















− 9,9 9,a 9,b

9,0 9,1 9,2 9,3 9,4 9,5 9,6 9,7 9,8 9,9 −





















− a,a a,b

a,0 a,1 a,2 a,3 a,4 a,5 a,6 a,7 a,8 a,9 a,a −





















0,0 −

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

a,0 a,1 a,2 a,3 a,4 a,5 a,6 a,7 a,8 a,9 a,a −

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

b,0 b,1 b,2 b,3 b,4 b,5 b,6 b,7 b,8 b,9 b,a b,b

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 −

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

*

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

− b,b

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

L

U

=

2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5 0,0 0,1 0,2 0,3 0,4 0,5 0,0 0,1 0,2 0,3 0,4 0,5 1,0 1,1 1,2 1,3 1,4 1,5 1,0 1,1 1,2 1,3 1,4 1,5

b,0 b,1 b,2 b,3 b,4 b,5 b,6 b,7 b,8 b,9 b,a b,b 2,0 2,1 2,2 2,3 2,4 2,5 2,0 2,1 2,2 2,3 2,4 2,5

~

A

 before the skewing for multiplication 2 on a 3 × 6 process grid (shown by thick black lines). Block and Fig. 7. Distribution of the matrices L, U, and the lower triangle of A process numbers are denoted as in Fig. 2. “ − ” marks a zero block, which is not touched.

 that is held and updated by this Fig. 8. Local matrix data during Algorithm 2 after the initial skewing for process, P1,1 . The left picture shows the local part of the matrix A out process, and the remaining pictures indicate the current contents of the buffers Lbuf and Ubuf (more precisely, the Lout buf and Ubuf used for the update) available in the three iterations i = 0 to 2. Below each of these buffers we indicate the process from which it came originally, and thick horizontal lines are used when no block from a block column of L or U had been packed into the buffer. The block indices are hexadecimal, i.e., a ≡ 10 and b ≡ 11.

passes, with possibly blocksInUBuf = 0. (The first block column in Mloc is not updated during iteration i = 0 since there is no matchout .) We skip the explicit pseudocode and just state ing block in Ubuf that the block columns of Lact are readily accessed once the quantity maxBlocks, the length of the first (and longest) block column in the current Lout buf , is known. This quantity must be determined w.r.t. the process that built this buffer before the skewing, that is,

maxBlocks = β l (0, myRow, (myCol + myRow + i ) mod pr ) in iteration i. , all Since we are updating only blocks in the lower triangle of A currently available U blocks will be used in the update. By contrast, in most cases not all blocks of Lout buf are used; again, these cannot be removed from the buffer because they may be needed in later updates in other processes.

10

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

Timings for multiplication 2 ScaLAPACK PDTRMM nb=128 ELPA UU nb= 64 ELPA UU nb=128 New nb= 32 New nb= 64 New nb=128

16

Time [seconds]

8 4 2 1 1/2 1/4 1/8 32

64

128

256 512 1024 2048 4096 8192 16384 Number of processes

Fig. 9. Strong scaling timings for multiplication 2 with double precision real data (matrix and grid setup as in Fig. 6).

Timings for transformation (double real)

32

Time [seconds]

16 8 4

32 16 8 4

2

2

1

1

1/2

1/2 64

128

256

ScaLAPACK PZHENGST nb= 32 ScaLAPACK PZHENGST nb= 64 ScaLAPACK PZHENGST nb=128 Inv+new (one fct, (B)) nb= 64 Inv+new (one fct, (B)) nb=128 ELPA nb= 32 ELPA nb= 64 ELPA nb=128 New (one fct, (B)) nb= 32 New (one fct, (B)) nb= 64 Inversion nb= 64 Inversion nb=128

64

Time [seconds]

ScaLAPACK PDSYNGST nb= 64 ScaLAPACK PDSYNGST nb=128 Inv+new (one fct, (B)) nb= 64 Inv+new (one fct, (B)) nb=128 ELPA nb= 32 ELPA nb= 64 ELPA nb=128 New (one fct, (B)) nb= 32 New (one fct, (B)) nb= 64 New (one fct, (B)) nb=128 Inversion nb=128

64

Timings for transformation (double complex)

512 1024 2048 4096 8192 16384 Number of processes

64

128

256

512 1024 2048 4096 8192 16384 Number of processes

 using a precomputed U −1 (current ELPA implementation, curve “ELPA”, and new Cannon-based implementation Fig. 10. Timings for the complete transformation A → A with type-(B) buffering, curve “New”) and without precomputed U −1 (one-step ScaLAPACK routines P{DSY,ZHE}NGST, using U −1 only implicitly, and “invert and multiply” approach with triangular inversion from ELPA and Cannon-based multiplications, curve “Inv+new”), and for the triangular inversion from ELPA (curve “Inv”). Matrix and grid setup are as in Fig. 6. Left (right): double precision real (complex) data.

Timings for back-transformation (33%)

Time [seconds]

8 4 2 1

8 4 2 1

1/2

1/2

1/4

1/4

1/8

ScaLAPACK PDTRTRS nb= 64 ScaLAPACK PDTRTRS nb=128 ELPA nb= 32 ELPA nb= 64 ELPA nb=128 New nb= 32 New nb=128

16

Time [seconds]

ScaLAPACK PDTRTRS nb= 32 ScaLAPACK PDTRTRS nb= 64 ScaLAPACK PDTRTRS nb=128 ELPA nb= 32 ELPA nb= 64 ELPA nb=128 New nb= 32 New nb= 64 New nb=128

16

Timings for back-transformation (100%)

1/8 64

128

256

512

1024 2048 4096 8192 16384

Number of processes

64

128

256

512

1024 2048 4096 8192 16384

Number of processes

Fig. 11. Timings for the back-transformation of one third of the eigenvectors (k = 9,900, left) and of all eigenvectors (right). Matrix and grid setup are as in Fig. 6.

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

Timings for multiplication 1 (nb=64)

128 64 Time [seconds]

32 16 8 4 2 1

Timings for multiplication 2 (nb=64) ScaLAPACK PDTRMM n=15k ScaLAPACK PDTRMM n=30k ScaLAPACK PDTRMM n=60k ELPA UU n=15k ELPA UU n=30k ELPA UU n=60k New n=15k New n=30k New n=60k

256 128 64 32 Time [seconds]

ScaLAPACK PDTRMM n=60k ScaLAPACK PDTRMM n=30k ScaLAPACK PDTRMM n=15k ELPA UL n=60k ELPA UL n=30k ELPA UL n=15k New n=60k New n=30k New n=15k

256

11

16 8 4 2 1

1/2

1/2

1/4

1/4

1/8

1/8

1/16

1/16

1/32

1/32 32

64

128 256 512 1024 2048 4096 8192 16384

32

64

128 256 512 1024 2048 4096 8192 16384

Number of processes

Number of processes

Fig. 12. Timings of multiplication 1 (left) and multiplication 2 (right) for different matrix sizes, n, with fixed block size nb = 64. The grid setup was as in Fig. 4.

Table 2 Size of the required buffer for full buffering of U according to (4) in MiB per process (1 MiB = 10242 bytes) for different numbers of process columns, pc , and matrix sizes, n, with double precision real data, i.e., 8 bytes per element, and block size nb = 64. .

4.3. Results for multiplication 2 In Fig. 9 we compare the runtimes of our new implementation and of two other functions, • PDTRMM from ScaLAPACK and • elpa_mult_at_b_real_double from the ELPA package, calculating the upper triangular part of an “upper triangular transposed × upper triangular” matrix product (with implicit transpositions), on the COBRA system described in Section 3.3. In the computedominated regime, the new multiplication 2 takes about half of the time of multiplication 1, in close correspondence to the arithmetic complexity ( 31 n3 vs. 23 n3 operations), whereas for high numbers of processes communication dominates and the two routines need almost the same time. 5. Combining both multiplications in one function In this section we point out potential benefits of implementing the two-sided triangular matrix multiplication A → UH AU in one function. (Again, U is the inverse of B’s Cholesky factor.) Recall the steps (i)–(iv) from Section 1. Since the multiplications 1 and 2 in steps (i) and (iii) have the same right operand U, in both of them exactly the same communications for U are carried out in order to implement initial skewing and circular shifts. To avoid these duplicated data exchanges, it is possible to buffer some of the blocks of U that are received in multiplication 1 in local storage and use them in multiplication 2. The available memory puts an upper limit on how much data we can buffer. We have tested three versions of the function: (A) No buffering at all. Even in this case a small speedup may be possible through an optimized memory allocation for buffers (only once for both multiplications) and thus perhaps more efficient cache usage. (B) Buffering only of the skewed U. Here we avoid the initial skewing of U for multiplication 2. (C) Full buffering. Here we save all U blocks received during multiplication 1 and thus completely avoid communication for U during the second multiplication. Of course intermediate levels of buffering are also possible. In our implementation it can be adjusted by a parameter specifying

pc

n = 30,0 0 0

n = 60,0 0 0

n = 10 0,0 0 0

8 16 32 64 128

437 222 115 61.3 34.6

1733 874 444 230 123

4796 2410 1217 621 323

how many of the circular shifts to store. We also recall the discussion on process ordering in Section 3.4 and that we organize the process grid in a way making the data transfers along rows faster. Thus, the buffering idea helps us to avoid slow communications along columns, and only fast data transfers are used for the second multiplication. 5.1. Additional memory requirements for buffering To obtain an upper bound for the additional memory required for full buffering, we first determine the maximum number of blocks that must be held in one process. If a process holds blocks from a certain block column in his Uloc then with the vertical shifts in multiplication 1 it will eventually receive all blocks of that block column, and therefore it must buffer the whole block column. Therefore the largest buffer must be allocated in the processes holding the longest block columns of the upper triangular matrix U, that is, the processes holding the last (n∗ = n/nb )th block column, which contains n∗ blocks. Due to the block cyclic distribution over the pr × pc grid, these processes also must buffer every pc th block column to the left, containing n∗ − pc , n∗ − 2 pc ,..., blocks. In total, they hold k∗ = n∗ /pc nonempty block columns with a total of ∗ k −1

k=0

( n∗ − k · pc ) = k∗ n∗ −

( k∗ − 1 )k∗ 2

· pc

(4)

blocks, each containing n2b elements. This is just slightly more than the n(n + 1 )/(2 pc ) elements that a perfectly balanced distribution of U’s entries would take. Table 2 presents the memory requirements in MiB for the full buffering case. The benefits from full buffering depend on the

12

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597 Table 3 Running times (in seconds) for the two-sided multiplication with double precision real data. Matrix and grid setup are as in Fig. 6, with fixed block size nb = 64. All timings include multiplications 1 and 2, as well as the two transpositions of Ml . and A Number of processes

Separate functions

(A) No buffering

(B) Buffering of initial skewing

(C) Full buffering

256 1024 4096 16,384

4.07 1.16 0.65

4.08 1.16 0.63

4.03 1.14 0.63 0.43

4.06 1.15 0.50 0.27

number of communication steps saved. In fact, buffering pays only on relatively large process grids, and in this case there should be enough memory to run the algorithm. To take an example, for matrices of size n = 30,0 0 0 full buffering is faster for 4, 096 processes or more (cf. Table 3 below), corresponding to a 64 × 64 grid. Then at most 61.3 MiB (1 MiB = 10242 bytes) per process are required to completely avoid communications along columns in the second multiplication, and thus with 32 processes per node less than 2 out of the 96 GiB (1 GiB = 10243 bytes) of a COBRA node’s memory are consumed. However, as we have already mentioned, it is also possible to use intermediate levels of buffering, storing only some of the shifts of U. As a rough estimate, buffering s shifts requires s times the memory for the local data of the matrix U, ranging from one to pr “additional copies” of U for configurations (B) and (C), respectively. Sharper bounds similar to (4) may be used to adjust the buffering level to the amount of available memory. 5.2. Results of experiments In Table 3 we compare the above configurations (A)–(C) of the single-function approach with separate calls of the “multiplication 1” and “multiplication 2” routines. It can be seen that buffering is beneficial for large process grids and improves the scaling. For low process numbers it may be slower because for small grids the calculation is dominating in comparison to communication. Then the optimized data transfer scheme does not play a significant role, whereas copying data to buffers takes more time since the buffers are larger for lower number of processes. 6. Further considerations In this section we consider additional aspects, such as using our routines in the context of a single eigenproblem instead of a whole sequence of such, as well as the back-transformation of the eigenvectors, and we present timing results for these. 6.1. Computation for a single eigenproblem In the previous sections we have assumed that the inverse of B’s Cholesky factor is already available explicitly (this was our matrix U). As pointed out before, in an important application area of the ELPA library, electronic structure computations, typically a sequence of generalized eigenproblems has to be solved with different A and the same B, and then the Cholesky decomposition and the inversion have to be done only once and thus the cost for explicit inversion is amortized over the sequence. If only a single eigenproblem AX = BX  must be solved then  are attractive, other approaches for the transformation A → A which combine both multiplications and work directly with the Cholesky factor, using its inverse only implicitly. The ScaLAPACK function PDSYNGST proceeds this way. In the left picture of Fig. 10 we compare the latter routine to our approach (curve “Inv+new”: time for the inversion with

elpa_invert_trm plus the time for the transformation A → A with two multiplications in one routine, version (B), i.e., buffering of the skewed U) on the COBRA system. We see that explicit inversion, combined with our Cannon-based matrix multiplications, can be highly competitive even for solving a single generalized eigenproblem. The actual transformation (curve “New”) is faster than its current ELPA counterpart (curve “ELPA”) and significantly faster than PDSYNGST (if the inverse can be reused from an earlier computation), and for large process numbers it even takes less time than the inversion (curve “Inv”). With full buffering, it would be another 36% faster for 16,386 processes. The right picture indicates that for complex data the behavior is similar, but the transformation takes roughly four times longer than in the real case for small numbers of processes, p, and two times longer for large p, and the complex routines scale to higher p. This is to be expected because the number of arithmetic operations increases by a factor of four – one complex addition (multiplication) requires two (six) real operations, – whereas the communication volume is only doubled, and the number of communication steps remains unchanged. 6.2. Back-transformation with a Cannon-type algorithm A Cannon-type algorithm can be also applied in step 5 from  of the eigenvectors. Section 1, the back-transformation X = U −1 X Here, the columns of the matrix X contain the desired eigenvectors  are the computed of the generalized eigenproblem AX = BX , X X = X , and U is the eigenvectors of the standard eigenproblem A Cholesky factor of B. With our approach for the transformation, U −1 is already available explicitly, and thus the back-transformation can be done with a matrix product. The algorithm is very similar to the one described in Section 3, but here the triangular matrix is on the left side, and the second factor is not necessarily square. ScaLAPACK does not build U −1 , and therefore it uses a triangular solve with the matrix U (routine PDTRTRS) for the backtransformation. Fig. 11 compares the two approaches for the backtransformation of one third of the eigenvectors (left) and of all eigenvectors (right). The former situation is common in SCF computations, where typically a significant percentage, but not all, of the eigenvectors are needed. Note that our back-transformation has been optimized for this case. If all eigenvectors are needed then the second matrix in the  contains more data than the first one, and therefore product U −1 X most communication is along process columns. Then a columnwise grid setup would be superior. Alternatively, if the remaining computations in the simulation make a rowwise setup mandatory, one might implement an optimized routine for right-multiplying with a triangular matrix (similarly to our multiplication 1) and use that H U −H )H . routine to compute (X 6.3. Quality of the computed eigensystems To assess the quality of the computed eigensystems, generalized eigenproblems AX = BX  with different size and condition

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

13

Table 4 Maximum residuals and deviation from B-orthonormality for varying dimensions n and condition numbers cond(B) ≈ λmax (A, B). n

1,000

30,000

λmax (A, B) 5.01E+02 5.01E+05 5.01E+08 1.50E+04 1.50E+07 1.50E+10

Residual max j ||Ax j − Bx j λ j ||2

B-orthonormality maxi, j |xTi Bx j − δi, j |

ScaLAPACK

ELPA

new

ScaLAPACK

ELPA

new

2.19E−12 6.44E−08 2.00E−03 8.15E−10 1.25E−05 3.57E−01

3.28E−12 7.05E−08 3.60E−03 3.55E−10 2.52E−05 6.21E−01

2.60E−12 6.21E−08 3.95E−03 4.07E−10 2.17E−05 1.35E+00

1.35E−13 3.56E−12 3.39E−09 3.83E−11 3.97E−11 7.38E−09

1.02E−14 4.06E−12 4.16E−09 2.78E−13 7.39E−12 7.54E−09

1.24E−14 3.87E−12 3.27E−09 2.88E−13 7.54E−12 8.01E−09

number have been solved. The entries of the test matrices were chosen as follows: ai, j = cos i cos j + sin j sin i, and B = B0 + σ I, where b0i, j = sin j sin i and σ > 0. Thus, cond(B) can be controlled by varying σ . Table 4 reports some of the results indicating that the two methods with explicit inversion (ELPA, new) perform comparably to ScaLAPACK (using implicit inversion). Note that, for these matrices, the maximum computed eigenvalue λmax (A, B) is almost identical to cond(B).

We expect the new algorithms to be included in future releases of the ELPA library. Since their restriction to pr × pc process grids, where pc is a (small) multiple of pr , does not apply to other ELPA routines, we propose a hybrid overall strategy, using the new (faster) algorithms if possible and fall back to the standard ones otherwise.

7. Concluding remarks

This work was supported by the Federal Ministry of Education and Research within the project “ELPA-AEO, Eigenvalue soLvers for Petaflop Applications – Algorithmic Extensions and Optimizations” under Grant 01IH15001. The authors want to thank the Max Planck Computing and Data Facility (MPCDF), Garching, and the Riken Advanced Institute for Computational Science, Kobe, for providing access to the COBRA, HYDRA and K computer systems, respectively, and to Martin Galgon and Andreas Marek for many fruitful discussions. Data from HYDRA and K featured in an earlier version of this paper. The authors are also grateful to the unknown referees, whose valuable comments helped to improve the quality of the presentation.

We have presented new Cannon-type algorithms for the triangular matrix multiplications that arise in the reduction of generalized HPD eigenproblems to standard form, if the inverse of the Cholesky factor is built explicitly. Numerical experiments demonstrate that the new implementations can outperform their counterparts from the current ELPA library and the ScaLAPACK triangular multiplication P_TRMM with a pure MPI parallelization. This also holds for matrix sizes different from the n = 30, 0 0 0 considered so far; cf. Fig. 12, and it can be explained roughly as follows. For simplicity, we consider a square process grid, p = p2r , and ignore the effects of non-even distribution of the matrices to the processes, i.e., we assume that each process holds n2 /p entries of full matrices and n2 /2p entries of triangular matrices. We also neglect the initial skewing and combination, which is dominated by the ensuing pr shifts if pr is not too small. Then, in our Cannonbased multiplication 1, each process passes a total of pr · n2 /p elements (1/pr of A) to its left neighbor and pr · n2 /2p elements (1/pr of B) to its upper neighbor, and in multiplication 2 the volume of communication along process rows is halved because the first matrix is triangular as well. ELPA uses the same amount of communication, but with collective communications, which in particular increase the level of synchronization. This also holds for multiplication 1 with P_TRMM, and in addition multiplication 2 does make use of both matrices being triangular when using that routine, such that then the communication volume is higher. In contrast to P_TRMM, our Cannon-based multiplications, as well as their ELPA counterparts, make full use of the facts that only one triangle must be computed in both multiplications and that both matrices are triangular in multiplication 2. Thus they have comparable arithmetic complexity, which is lower than P_TRMM’s. Note that a square grid leads to only one copy of the matrices being stored. Therefore the speedup is not due to keeping redundant data, as in the 3D or 2.5 matrix multiplication schemes. The limit of scalability may be pushed slightly further by using a mixed “MPI & multithreading” parallelization, which is currently under investigation. For simplicity the algorithms presented in this paper assume that the matrix size n is a multiple of the block size nb . They do, however, not require the number of blocks, n/nb , to be a multiple of the grid dimensions pc and pr . Our implementation also does not need the former assumption, i.e., it can handle arbitrary matrix sizes without artificially filling the last block column/row.

Acknowledgments

References [1] V. Fock, Näherungsmethode zur Lösung des quantenmechanischen Mehrkörperproblems, Z. Phys. 1930 (1–2) (1930) 126–148, doi:10.1007/BF01340294. [2] C.C.J. Roothaan, Modern developments in molecular orbital theory, Rev. Mod. Phys. 23 (2) (1951) 69–89, doi:10.1103/RevModPhys.23.69. [3] W. Kohn, L.J. Sham, Self-consistent equations including exchange and correlation effects, Phys. Rev. 140 (4A) (1965) A1133–A1138, doi:10.1103/PhysRev.140. A1133. [4] J. Poulson, R.A. van de Geijn, J. Bennighof, Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem, Technical Report TR-11-05, The University of Texas at Austin, Department of Computer Science, 2011. FLAME Working Note #56. [5] J. Poulson, B. Marker, R.A. van de Geijn, J.R. Hammond, N.A. Romero, Elemental: a new framework for distributed memory dense matrix computations, ACM Trans. Math. Softw. 39 (2) (2013) 13:1–13:24, doi:10.1145/2427023.2427030. [6] L.S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, ScaLAPACK Users’ Guide, SIAM, Philadelphia, PA, 1997, doi:10.1137/1.9780898719642. [7] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H.-J. Bungartz, H. Lederer, The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science, J. Phys. 26 (21) (2014) 213201, doi:10.1088/0953-8984/26/21/213201. [8] J.J. Dongarra, L. Kaufman, S. Hammarling, Squeezing the most out of eigenvalue solvers on high-performance computers, Linear Algebra Appl. 77 (1986) 113– 136, doi:10.1016/0024-3795(86)90164-3. [9] J. Choi, J.J. Dongarra, D.W. Walker, Parallel matrix transpose algorithms on distributed memory concurrent computers, Parallel Comput. 21 (9) (1995) 1387– 1405, doi:10.1016/0167-8191(95)0 0 016-H. [10] L.E. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm, Montana State University, Bozeman, MT, 1969 Ph.D. Thesis. [11] G.H. Golub, C.F. Van Loan, Matrix Computations, forth ed., The Johns Hopkins University Press, Baltimore, MD, 2013. [12] H.-J. Lee, J.P. Robertson, J.A.B. Fortes, Generalized Cannon’s algorithm for parallel matrix multiplication, in: Proc. ICS ’97, Intl. Conf. Supercomputing, July 7– 11, 1997, Vienna, Austria, ACM Press, 1997, pp. 44–51, doi:10.1145/263580. 263591. [13] J. Choi, D.W. Walker, J.J. Dongarra, Pumma: parallel universal matrix multiplication algorithms on distributed memory concurrent computers, Concurrency 6 (7) (1994) 543–570, doi:10.10 02/cpe.4330 060702.

14

V. Manin and B. Lang / Parallel Computing 91 (2020) 102597

[14] G.C. Fox, S.W. Otto, A.J.G. Hey, Matrix algorithms on a hypercube I: matrix multiplication, Parallel Comput. 4 (1987) 17–31, doi:10.1016/0167-8191(87) 90060-3. [15] R.A. van de Geijn, J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency 9 (4) (1997) 255–274, doi:10.1002/(SICI) 1096- 9128(199704)9:4<255::AID- CPE250>3.0.CO;2- 2. [16] J. Choi, A new parallel matrix multiplication algorithm on distributed-memory concurrent computers, Concurr. Comput. 10 (8) (1998) 655–670, doi:10.1002/ (SICI)1096-9128(199807)10:8<655::AID-CPE369>3.0.CO;2-O. [17] E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Comput. 10 (4) (1981) 657–675, doi:10.1137/0210049.

[18] S.L. Johnsson, Minimizing the communication time for matrix multiplication on multiprocessors, Parallel Comput. 19 (11) (1993) 1235–1257, doi:10.1016/ 0167- 8191(93)90029- K. [19] E. Solomonik, J. Demmel, Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms, in: E. Jeannot, R. Namyst, J. Roman (Eds.), Euro-Par 2011 Parallel Processing. 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29–September 2, 2011, Proceedings, Part II, 6853, Springer, Berlin, Heidelberg, 2011, pp. 90–109, doi:10.1007/ 978- 3- 642- 23397- 5_10.