Parallel Computing 85 (2019) 131–140
Contents lists available at ScienceDirect
Parallel Computing journal homepage: www.elsevier.com/locate/parco
Scalable eigenvector computation for the non-symmetric eigenvalue problemR Angelika Schwarz∗, Lars Karlsson Department of Computing Science, Umeå University, Sweden
a r t i c l e
i n f o
Article history: Received 2 October 2018 Revised 26 February 2019 Accepted 10 April 2019 Available online 11 April 2019
a b s t r a c t We present two task-centric algorithms for computing selected eigenvectors of a non-symmetric matrix reduced to real Schur form. Our approach eliminates the sequential phases present in the current LAPACK/ScaLAPACK implementation. We demonstrate the scalability of our implementation on multicore, manycore and distributed memory systems. © 2019 Elsevier B.V. All rights reserved.
Keywords: Eigenvectors Real Schur form Tiled algorithms MPI + OpenMP parallel programming
1. Introduction Eigenvalue problems are ubiquitous. They arise, for example, in image processing, mechanics, structural engineering or stability analysis. The standard eigenvalue problem aims at computing an eigenvalue λ and a corresponding eigenvector y = 0 such that Ay = λy. If distinct eigenvalues and eigenvectors are sought, Ayi = λi yi , i = 1, . . . , , can be written equivalently as AY = Y , where = diag(λ1 , . . . , λ ) is the diagonal matrix of eigenvalues and Y = [y1 . . . y ] is the eigenvector matrix. Here we focus on the case where A ∈ Rn×n is non-symmetric and dense. The eigenvalues and eigenvectors are, as a consequence, potentially complex. This paper contributes two task-parallel algorithms for efficiently computing eigenvectors using real arithmetic. The eigenvalue problem can be solved through three phases, see for example [1]. First, A is reduced to upper Hessenberg form H = Q0T AQ0 . The similarity transform with the orthogonal matrix Q0 ensures that the eigenvalues are preserved. Second, the Hessenberg matrix is reduced to real Schur form by means of the QR algorithm. This iterative procedure employs orthogonal matrices whose accumulation eventually yields the real Schur decomposition A = Q T Q T . The eigenvalues of A are given on the diagonal of
R The authors wish to confirm that there are no known conflicts of interest associated with this publication. ∗ Corresponding author. E-mail addresses:
[email protected] (A. Schwarz),
[email protected] (L. Karlsson).
https://doi.org/10.1016/j.parco.2019.04.001 0167-8191/© 2019 Elsevier B.V. All rights reserved.
⎡ ⎢ ⎣
Q T AQ = T = ⎢
T1,1
T1,2 T2,2
... ... .. .
⎤
T1,m T2,m ⎥ .. ⎥ ⎦, . Tm,m
(1)
where the diagonal blocks Tk,k , k = 1, . . . , m, are 1-by-1 or 2-by2. A 1-by-1 block corresponds to a real eigenvalue; a 2-by-2 block corresponds to a pair of complex conjugate eigenvalues. Third, an eigenvector yk corresponding to λk can be computed with a twostep procedure. Finding a non-trivial solution to (T − λk I )xk = 0 yields an eigenvector of T. Due to 2-by-2 blocks on the diagonal of T, this step requires a modified backward substitution. The backtransform yk = Qxk gives an eigenvector that satisfies Ayk = λk yk . The libraries LAPACK [2] and ScaLAPACK [3] attempt to provide the scientific community with reliable and efficient implementations of widely used linear algebra algorithms. LAPACK expresses an algorithm in terms of BLAS operations; ScaLAPACK extends LAPACK to distributed memory and replaces BLAS with parallel BLAS (PBLAS). Their performance relies on an efficient implementation of BLAS and PBLAS routines, respectively. The resultant programming model is commonly referred to as the fork-join model, which reflects alternating sequential and parallel BLAS operations. The transition from sequential to parallel corresponds to a fork, whereas the reverse transition yields a join. The sequential phases limit scalability. In order to make more efficient use of the compute resources, algorithms have to be redesigned. This paper addresses the eigenvector computation from the real Schur form for the non-symmetric eigenvalue problem. The corresponding LAPACK (version 3.7.0) routine is xTREVC3, which computes
132
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
eigenvectors in both complex and real arithmetic. ScaLAPACK (version 2.0.2) supports only complex arithmetic. The corresponding routines are PCTREVC and PZTREVC. Our parallel algorithms, in some sense, amend ScaLAPACK with the missing implementation in real arithmetic. We present two task-centric algorithms. The input to the algorithms consists of a real Schur decomposition and a set of userselected blocks Tk,k on the diagonal of T. The output is a complex matrix whose columns contain the computed eigenvectors, represented in the original basis, corresponding to the user-selected blocks. The first task-parallel algorithm targets shared-memory systems. The second algorithm targets distributed memory systems with shared-memory nodes and utilizes the first algorithm as a fundamental building block. The scalability in terms of memory footprint and time-to-solution are analyzed on multicore, manycore and distributed memory systems. Our two task-centric implementations solve a large class of problems. For now, LAPACK is able to handle even more problems, namely those where floating-point overflows occur. Consider λk ≈ λ , k < . The matrix T − λ I is tiny at position (k, k). Solving for the kth entry of the eigenvector x yields growth. This situation occurs repeatedly if eigenvalues are clustered, eventually exceeding the overflow threshhold (see [4–6]). LAPACK uses a specialized solver xLATRS, which downscales eigenvector columns to avoid overflow. xLATRS corresponds to a level-2 BLAS operation. An alternative, which is based on level-3 BLAS, is described in [7–9] for standard backward substitution. This overflow protection scheme has to be extended to the modified backward substitution faced here. This is ongoing work made difficult by the need to rigorously validate all possible cases. Our implementation, however, is prepared for overflow protection. We emphasize that our implementation is designed in such a way that adding robustness will not hamper scalability. The rest of this paper is structured as follows. Section 2 shows how a single eigenvector can be computed from the real Schur form. It describes how the backward substitution is modified to execute all computations in real arithmetic. The algorithm is extended to multiple eigenvectors in Section 3. The first task-based algorithm for shared-memory systems is introduced in Section 4, followed by the second algorithm for distributed memory in Section 5. Section 6 compares these two algorithms qualitatively to existing library implementations. The experiment setup is described in Sections 7, 8 presents performance results for both algorithms.
Algorithm 1 Computation of single eigenvectors. Require: Real Schur decomposition A = Q T Q T , position s of a selected block Ts,s . Ensure: An eigenvector y corresponding to λ such that Ay = λy. function ComputeEigenvector(Q, T , s) x ← SingleBacksolve(T , s ) y ← SingleBacktransform(Q, x ) return y function SingleBacksolve(T , s) if Ts,s is a 1-by-1 block then // Construct real eigenvector λ ← Ts,s b ← −T1: s−1,s
// Solve T1:s−1,1:s−1 − λI x1 = b x1 ← BacksolveReal(T , b, λ, s )
x1 x← 1 0 if Ts,s is a 2-by-2 block then // Construct complex eigenvector
α β ← Ts,s γ α λ ← α + i |βγ | if |β| ≥ |γ | then x2 ← 1√ βγ x3 ← i β
else
√
−
x2 ← x3 ← i
|βγ | γ
b ← −T1:s−1,s
x2 x3
// Solve T1:s−1,1:s−1 − λI x1 = b x1 ←⎡BacksolveComplex (T , b, λ, s ) ⎤ x1 ⎢x ⎥ x ← ⎣ 2⎦ x3 0 return x function SingleBacktransform(Q, x) x1 ←x 0
Q1 Q2 ← Q y ← Q1 x1 return y
2. Computation of single eigenvectors A common approach (see, e.g., [5]) to compute an eigenvector y = 0 such that Ay = λy given the Schur decomposition A = Q T Q T is to first find a non-trivial solution to (T − λI )x = 0 and then backtransform x to the original basis by y = Qx. This section elaborates on the first step. We describe how the initial right-hand sides are formed for a real and a complex eigenvalue. Further, we discuss what modifications of the backward substitution algorithm are required to solve the resulting system in real arithmetic.
a modified backward substitution, see routine BacksolveReal in Algorithm 2. Note that the computation of an eigenvector only involves the leading submatrix of T up to and including the selected eigenvalue.
1-by-1 blocks. Let λ ∈ R be a selected 1-by-1 block. A corresponding eigenvector x is any nontrivial solution to (T − λI )x = 0. In
T11 0 0
t12
λ 0
T13 T t23 T33
− λI
x1 x2 x3
= 0,
(2)
setting x3 = 0 trivially satisfies (T33 − λI )x3 = 0. With this choice the first equation becomes (T11 − λI )x1 + t12 x2 = 0. This system is underdetermined. We follow the choice in LAPACK and set x2 = 1, see Algorithm 1. Then we solve (T11 − λI )x = −t12 through
2-by-2 blocks. Following LAPACK’s notion of the Schur canonical form, the 2-by-2 blocks are assumed to be in the normalized form1
β ∈ R2×2 , α
α γ
(3)
where γ and β are non-zero with oppositesigns, i.e., γ β < 0. ¯ =α− The eigenvalues of such a block are λ = α + i |βγ | and λ
i |βγ |. As the corresponding eigenvectors also occur in complex 1
See
DLANV2 in LAPACK version 3.7.0.
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
Algorithm 2 Backsolve in real arithmetic. Require: Upper quasi-triangular matrix T , right-hand side b, eigenvalue λ, position s of the selected block Ts,s = λ. Ensure: Solution to (T − λI )x = b. function BacksolveReal(T , b, λ, s) for j ← s − 1, s − 2, . . ., 1 do if T j, j is a 1-by-1 block then bj ←
bj t j j −λ
b1: j−1 ← b1: j−1 − T1: j−1, j b j if T j, j is a 2-by-2 block then Solve the 2 × 2 real system (T j, j − λI2 )x = b j bj ← x b1: j−1 ← b1: j−1 − T1: j−1, j b j return b function BacksolveComplex(T , b, λ, s) Let λ = α + iβ x ← ( b ), y ← ( b ) for j ← s − 1, s − 2, . . ., 1 do if T j, j is a 1-by-1 block then Execute complex division (x j + iy j )/(t j j − (α + iβ )) in real arithmetic (t −α )x −β y
e ← (tj j −α )2j +β 2 j jj
133
which is subtly different from a standard backward substitution. Since T is upper quasi-triangular, four cases arise. 1) A real eigenvalue encounters a 1-by-1 block. This case can be solved with standard backward substitution. 2) A real eigenvalue encounters a 2-by-2 block. The real-valued 2-by-2 system can be solved with Gaussian elimination. 3) A complex eigenvalue encounters a 1-by1 block. The complex division is executed in real arithmetic. 4) A complex eigenvalue encounters a 2-by-2 block. The complex 2by-2 system has two complex unknowns and can be solved with Gaussian elimination. The backtransform of the result of the modified backward substitution is implemented with SingleBacktransform (see Algorithm 1), a matrix-vector multiplication. A single real eigenvector of length w requires w divisions, w subtractions, (w2 − w )/2 additions and (w2 − w )/2 multiplications for the backsolve if T is upper triangular. Additional flops are required for solving 2-by-2 systems. The backtransform can be executed through a matrix-vector multiplication with n(2w − 1 ) flops. If a complex eigenvector is computed in real arithmetic, the flop count approximately doubles, accounting for the computation of the real and the imaginary part. A slight increase in flops arises from the complex division in real arithmetic and the solution of complex 2-by-2 systems. Since only one of the complex conjugate eigenvectors is computed per selected 2-by-2 block, the flop count per eigenvector – real or complex – is roughly the same.
(t −α )y +x β
f ← (tj j −α )2j +βj2 jj xj ← e yj ← f x1: j−1 ← x1: j−1 − T1: j−1, j x j y1: j−1 ← y1: j−1 − T1: j−1, j y j
3. Towards compute-bound eigenvector computation The computation of a single eigenvector as in Section 2 is memory-bound. Both the memory requirement and the number of arithmetic operations is O(n2 ), which corresponds to an arithmetic intensity of O(1). The arithmetic intensity can be increased if several eigenvectors are computed simultaneously. For this purpose, let S ⊆ {T1,1 , . . ., Tm,m } be a set of selected blocks whose corresponding eigenvectors are sought. Recall that Tk,k denotes a 1-by-1 or 2-by-2 block on the diagonal of T ∈ Rn×n , see (1). We assume that all blocks Tk,k are distinct. Algorithm 3
if T j, j is a 2-by-2 block then Solve the 2 × 2 complex system (T j, j − (α + iβ )I2 )(e + i f ) = x j + iy j xj ← e yj ← f x1: j−1 ← x1: j−1 − T1: j−1, j x j y1: j−1 ← y1: j−1 − T1: j−1, j y j return x + iy
Algorithm 3 Multi-backsolve.
conjugate pairs, only one eigenvector has to be computed. The second one can be obtained by complex conjugation. The system
⎛⎡
T11 ⎜⎢ 0 ⎝⎣ 0 0
t12
α γ 0
⎤
t13
⎞⎡ ⎤
T14 x1 T t24 ⎥ ⎟⎢x2 ⎥ T ⎦ − λI ⎠⎣x ⎦ = 0 t34 3 x4 T44
β α 0
(4)
can be solved similarly to the real case. Setting x4 = 0 satisfies (T44 − λI )x4 = 0. With x4 = 0, the second and third equations read
β x2 − λI = 0, α x3
α γ
(5)
for which a non-trivial solution x2 , x3 is sought. Substituting in the eigenvalue λ = α + i |βγ | reveals the dependency between x2 and x3 as
x2 =
i
|βγ | x3 γ
or
x3 =
Require: Real Schur form T = Q T AQ, set S of selected blocks. Ensure: n-by-|S| complex eigenvector matrix X S such that T X S = X S . function Backsolve(T , S) s ← |S | for k ← m, m − 1, . . ., 1 do if Tk,k ∈ S then xs ← SingleBacksolve(T , k ) s ← s − 1 X S ← x 1 , . . . , x |S | return X S
i
|βγ | x2 . β
(6)
Thus, a non-trivial solution is obtained by choosing a non-zero value for either x2 or x3 and computing the value of the other. We again adopt the choice made in LAPACK and set either x2 = 1 or x3 = i, see Algorithm 1. Having set x2 , x3 and x4 , the first equation can be solved with BacksolveComplex in Algorithm 2. Since the goal is to compute eigenvectors entirely in real arithmetic, Algorithm 2 shows the steps for the backward substitution,
computes an eigenvector for each selected block. The eigenvector columns are stored compactly in an n-by-|S | complex matrix X S . The backtransform can then be computed by a sparsity-exploiting matrix-matrix multiplication. Throughout the paper, the term block denotes a 1-by-1 or 2-by2 block Tk,k on the diagonal of T as defined in (1). General contiguous submatrices are called tiles. In the following, we present a tiled algorithm. For this purpose, the upper quasi-triangular matrix T is partitioned into n2b tiles such that 2-by-2 blocks on the diagonal are not split and diagonal tiles are square, see Fig. 1 (left). We partition the eigenvector matrix X ∈ Cn×n accordingly, see Fig. 1 (center). The shape of T carries over to the shape of the eigenvector matrix X. Analogously to Algorithm 3, the eigenvectors should be stored compactly if less than m blocks
134
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
Fig. 1. Partitioning of X carried over from T and the compact storage as X S . The top rectangle marks the position of selected blocks.
are selected. The resulting matrix X S attains a generalized quasitriangular shape, see Fig. 1 (right). A tiled multi-backsolve is given in Algorithm 4. Besides BackAlgorithm 4 Tiled multi-backsolve. Require: Real Schur form T = Q T AQ, set S of selected blocks. Ensure: n-by-|S | complex eigenvector matrix X S such that T X S = X S . function TiledBacksolve(T , S) XS ← 0 for k ← nb , nb − 1, . . . , 1 do for j ← k, k − 1, . . . , 1 do if k = j then S ← Backsolve (T , S ) Xkk kk else S ← Solve (X S , T , blocks (T ) ∩ S ) X jk kk jk j j for h ← 1, . . ., j − 1 do S ← Update (X S , T , X S ) Xhk hk h j jk return XS solve, two more functions are needed. First, Solve is a multishift backsolve T j j X jk = X jk kk , similar to Backsolve. The shifts, the eigenvalues kk of the selected blocks in the tile Tkk , are an input argument. Second, Update is a matrix-matrix multiplication of the form C ← C − AB. With the partitioning introduced in the tiled backsolve, the backtransform becomes a structure-exploiting tiled matrix-matrix multiplication. While for readability some equations are written assuming complex numbers, all operations are executed in real arithmetic. Next we outline the storage of the compact complex eigenvector matrix X S illustrated in Fig. 1 (right). We denote the memory representation of X S with Xˆ S . The multiplication of a real-valued matrix Q and a complexvalued matrix X S operates component-wise
QX S = Q[x1 , . . . x|S | ]
= [Q (x1 ) + iQ (x1 ), . . . , Q (x|S | ) + iQ (x|S | )] = [ ( y 1 ) + i ( y 1 ) , . . . , ( y | S | ) + i ( y | S | )] = [ y 1 , . . . y |S | ] = YS.
This motivates representing complex vectors in X S as real and imaginary parts in adjacent columns in Xˆ S . Consequently, Xˆ S interleaves real eigenvectors and real and imaginary parts of complex eigenvectors. A complex eigenvector xi has the memory rep-
resentation [xˆi1 , xˆi2 ] in Xˆ S , where xˆi1 corresponds to the real part and xˆi2 to the imaginary part of xi . The result of the matrix-matrix multiplication [Q xˆi1 , Q xˆi2 ] can be interpreted as real and imaginary part of yi based on the column index. As a consequence, the matrix-matrix multiplications arising in the backtransform and Update tasks can operate on the memory representation of tiles of X S unaware of what type of eigenvector is stored. If only a few blocks are selected, the eigenvector tiles may be tall and skinny as illustrated in Fig. 1 (right). In the extreme case, the eigenvector tiles degrade into vectors. Due to limited cache reuse, matrix-matrix multiplications with thin matrices perform poorly. Since the widths of the eigenvector tiles are derived from the tile size in T and the number of selected blocks in a tile column, a larger tile size is a means to mitigate thin eigenvector tiles. A larger tile size, however, lowers the total number of tiles and reduces parallelism. A compromise between efficient matrix-matrix multiplications and exposure of parallelism can be found by tuning. 4. Shared-memory systems In the following we present and justify the taskification of the tiled algorithms for the backsolve and the backtransform. Our implementation utilizes OpenMP as the shared memory API. Backsolve. The sequential tiled algorithm in Algorithm 4 naturally defines tasks – each function call corresponds to a task. Next we specify data dependences between tasks. The starting point is Backsolve with solely outgoing dependences. Backsolve and Solve tasks are a prerequisite to Update tasks for above-lying tiles. That S , the tiles X S , h = 1, . . . , j − 1 can be upis, having computed X jk hk S can be executed, all Update tasks on dated. Before a Solve on X jk S must have been completed. Fig. 2 illustrates the dependences X jk
for one tile column of X S . Tile columns in the eigenvector matrix X S do not exhibit any dependences. In other words, there is one task graph for each tile column. Backtransform. The backtransform implements Y S ← QX S as a structure-exploiting tiled matrix-matrix multiplication. We consider two options to define tasks. First, the computation of a tile YhSj constitutes a task, see Algorithm 5 (left). Second, a fine-grained
approach defines single updates YhSj ← YhSj + Qhk XkSj as tasks, see Algorithm 5 (right). In order to keep the critical region as short as possible, the expensive matrix-matrix multiplication W ← Qhk XkSj
writes to a local buffer W; only the cheap accumulation YhSj ←
YhSj + W is protected by a lock associated with YhSj .
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
135
Algorithm 5 Task-based backtransform. Require: Orthogonal matrix Q partitioned analogously to T in Fig. 1 (left), block upper triangular matrix X S partitioned as in Fig. 1 (right). Ensure: Y S = QX S . function TiledBacktransform(Q, X S ) Partition Y S analogously to X S YS ← 0 for j ← 1, ..., nb do for j ← 1, ..., nb do for h ← 1, ..., nb do for h ← 1, ..., nb do task { for k ← 1, ..., j do S YhSj ← Qh,1: j X1: task { j, j W ← Qhk XkSj } lock(YhSj )
return Y S
YhSj ← YhSj + W
}
unlock(YhSj )
return Y S
Backsolve Solve Update
Fig. 2. Task graph constructed for the tiled backsolve for k = 4 in Algorithm 4. Tasks are laid out according to what tile Thj is required.
Fusion. In a task-based implementation, the successive execution of backsolve and backtransform yields a synchronization point. If the backtransform exhibits fine-grained parallelism as in Algorithm 5 (right), backtransform tasks can become ready before the backsolve is completed. Fusion of the two separate task graphs removes the synchronization point and improves the performance for problems that are small relative to the number of cores available. 5. Distributed memory systems This section extends the shared-memory implementation to distributed memory. We describe the initial data distribution of the matrices and depict how data is replicated to allow usage of the task-parallel algorithm from Section 4. We assume T and Q to be 2D tile cyclically distributed as in ScaLAPACK2 X S and Y S are distributed tile column-wise. This choice is legitimate for X S because X S is an internal variable. Y S , by contrast, is an output. A column distribution facilitates the column-wise validation of the eigenvectors. Furthermore, there is no communication during the backtransform. A 2D tile cyclic distribution of Y S is equally feasible and incurs only a marginal performance penalty, which we discuss after the introduction of the algorithm. The distributed eigenvector computation is listed in Algorithm 6. We replicate T and Q through asynchronous broadcasts. This allows us to reuse the task-parallel algorithm from 2 We refrain from terming this a ‘2D block cyclic distribution’ to avoid ambiguity with a ‘block’ denoting 1-by-1/2-by-2 blocks on the diagonal of T.
Algorithm 6 Distributed memory + OpenMP eigenvector computation. Require: 2D block cyclically distributed upper quasi-triangular matrix T , 2D block cyclically distributed orthogonal matrix Q, set S of selected blocks, where T and Q constitute the real Schur decomposition A = Q T Q T . Ensure: Eigenvector matrix Y S such that AY S = Y S . function DistributedEigenvectors(T , Q, S) Compute partitioning of X S , Y S into nr tile columns and distribute the tile columns among the r processes Replicate T through asynchronous all-to-all broadcast Wait for broadcast of T to complete Replicate Q through asynchronous all-to-all broadcast for k ← 1, . . ., nr do if my rank owns X:Sk then Allocate X:Sk , Y:Sk Thread-parallel X:Sk ← TiledBacksolve(T , blocks(T:k ) ∩ S) Wait for broadcast of Q to complete for k ← 1, . . ., nr do if my rank owns X:Sk then Thread-parallel Y:Sk ← TiledBacktransform(Q, X:Sk ) return Y S
Section 4 as a basic building block. As soon as a process holds a full copy of T, it computes its share of the eigenvectors starting with the backsolve. The process proceeds with the backtransform as soon as a copy of Q is available. Since there is no synchronization between processes beyond what is implied by the replications, load balancing happens over the successive execution of backsolve and backtransform. We use flops as metric. The ideal workload is computed by dividing the total amount of flops by the process count. 1-by-1 and 2-by-2 blocks are successively assigned to a process until the ideal workload is reached, see Algorithm 7. To enable task parallelism, each process partitions its share of eigenvectors further into tiles. The tile size is chosen as in the shared memory implementation, i.e., the tile heights conform with the partitioning of T and the tile widths are derived. We aim for one rank per node using solely task parallelism within a node. Our implementation uses the master thread to handle all communication. While the backsolve is computed, the master thread handles the broadcast of Q and then joins the team of threads to process the tasks. Fig. 3 summarizes the assignments of the threads in an ideal case, where the replication of Q can be
136
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
Time
Algorithm 7 Load balancing. Require: Upper quasi-triangular T as in (1), process count r, set S of selected blocks. Ensure: Assignment of blocks to processes. function DistributeLoad(T , r, S) for k ← 1, 2, . . . , m do wk ← 0 if Tk,k ∈ S ∧ Tk,k is a 1-by-1 block then wk ← k2 + 2m ( k − 1 ) if Tk,k ∈ S ∧ Tk,k is a 2-by-2 block then wk ← 2k2 + 4m ( k − 1 ) m wtotal ← wk
tcomm Q = tbacksolve ba o cks
k=1
lve
wideal ← wtotal /r k←1 for process j ← 1, 2, . . . , min{r, m} do wj ← 0 while w j ≤ wideal ∧ k ≤ m do Assign Tk,k to process j w j ← w j + wk k←k+1
backtransform
worker thread
m
comm Q
r
cores each, summing to rq cores in total. Since our problem is rich in level-3 BLAS operations, we assume that the computation is compute-bound and flops are a reasonable metric to model the compute time. Assuming perfect load balancing and computation at the peak rate Rpeak , the compute time of the backsolve and the backtransform can be modeled as
tbacksolve (r, q, n ) = tbacktransform (r, q, n ) =
worker thread 2
.. .
sfor
Fig. 4. Execution time model of DistributedEigenvectors.
worker thread 1
.. .
t total
tran
1 2
master thread
node i
back
comm T
Communication T Communication Q Computation backsolve
add communication cost in multi-node environment
.. .
Fig. 3. Program execution of DistributedEigenvectors.
nearly fully overlapped with the backsolve. Since only the master thread issues MPI calls, the MPI implementation is only required to have a low level of thread support, which avoids costly synchronization to guarantee thread safety within MPI. In the algorithm, Y S is distributed column-wise. Previously we claimed that a 2D tile cyclic distribution is equally reasonable. The process in charge of computing the tile YhSj has all data readily available because Q was broadcast and X:Sj was computed on
this process. The output tile YhSj possibly has to be sent to its final location. Given the ratio of computation and communication, the transmission can be fully overlapped with the computation of other tiles. A small performance penalty arises from transmitting the very last tile when all computations have been completed. Next we consider the memory requirement. The upper quasitriangular matrix T is replicated across r processes amounting to (n2 r) bytes. Due to the tiled memory layout, tiles on the diagonal of T exhibit some explicitly stored zero entries. Q is replicated, too, and allocates (n2 r) bytes. X S and Y S are split across processes, allocating an equal amount of (n|S | ) bytes. Due to the replication of T and Q the set of problems solvable through the algorithm is limited by the amount of memory available on one node. On the machine available to us (see Section 7.1), this set covers matrix dimensions n ≤ 10 0, 0 0 0 using double-precision arithmetic. Next, we develop an analytic model to analyze the strong scaling behavior of Algorithm 6. This allows us to determine the point up to which it is beneficial to add execution units for a fixed problem size n. The execution units comprise r processes with q
Fbacksolve (n ) rqRpeak
and
Fbacktransform (n ) , rqRpeak
where Fbacksolve and Fbacktransform denote the flops of the backsolve and the backtransform, respectively. Initially, the matrices T and Q are distributed over r processes such that each node holds roughly 1/rth of the (n2 + n )/2 and n2 non-zero entries of T and Q, respectively. We model the two all-gather collective communications to be composed of a startup time and a transmission time. The startup time is a small constant depending on the processor count r. The data is transmitted at the peak data transmission rate Rcomm . If the collective communication is implemented using pipelining on a ring topology, the communication can be modeled as
tcomm T (r, n ) = O (r ) + tcomm Q (r, n ) = O (r ) +
( r − 1 )n2
and
2rRcomm
( r − 1 )n2 rRcomm
.
Coupling these components, the model of the total execution time is
ttotal (r, q, n ) = tcomm T (r, n ) + max{tcomm Q (r, n ), tbacksolve (r, q, n )} + tbacktransform (r, q, n ). Fig. 4 illustrates the model for realistic values of Rcomm , Rpeak and q when computing 1/3 of the eigenvectors for n = 40, 0 0 0. It is an upper bound for the achievable speedup. As long as the computation is executed on one node, there is no communication cost. When the computation is executed on more than one node, the cost of replicating T is added to the total execution time. When the replication of Q cannot be overlapped with the backsolve, the replication cost of Q contributes fully and the speedup slows down. 6. Qualitative comparison with existing implementations Next, we relate our two algorithms to two library implementations. First, we compare with the multi-threaded implementation
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
137
scaling factors have to be consolidated into a global scaling factor in a postprocessing step. 7. Experiment setup In the following, we describe the hardware and software used for the experiments and how measurements were conducted. Fig. 5. Logical partitioning in Elemental. The dark region is a single update task; the backsolve/solve task is colorized lightly.
that is part of MAGMA [5]. Second, we contrast our algorithms with the distributed memory implementation in Elemental [10]. MAGMA. Eigenvectors are computed by the successive execution of the backsolve and, optionally, the backtransform. The backsolve relies on the triangular matrix T − λk I, which has a different diagonal for each eigenvector. Parallelism is enabled by casting the backsolve as a quasi-triangular backward substitution such that the shift λk I is applied without modifying T. As a consequence, multiple eigenvectors can be processed in parallel. Since the backsolve is parallelized over the eigenvector columns, each backsolve corresponds to a level-2 BLAS operation. The backtransform of all eigenvectors is parallelized as in Algorithm 5 (left) with a static tile size. Elemental. We compare with SafeMultiShiftTrsm, extrapolating how a routine tailored for eigenvectors could look like. Hence, we have a rectangular matrix of right-hand sides rather than eigenvectors. All matrices are distributed over the processes in an element-wise fashion. An iteration proceeds as follows. The matrix T is logically partitioned into a square tile, used for backsolve or solve tasks, and the remaining tile column, which is used in linear updates, see Fig. 5 (left). The logical partitioning of the right-hand sides into tile rows is conforming, see Fig. 5 (right). Backsolve and solve tasks operate on a column level as our distributed memory algorithm does on a tile column level: The relevant submatrix of T is replicated. Parallelism is introduced through the independent computation of the right-hand side columns. Prior to an update, the rows over the relevant submatrix of T and the right-hand sides are replicated across a subset of ranks such that one rank holds all data necessary to execute the update as a matrix-vector multiplication. The communication pattern corresponds to a row-all-gather and a column-all-gather, respectively. For a reasonable ratio of ranks and right-hand sides, a rank is in charge of many right-hand sides. Consequently, an update corresponds to a level-3 BLAS operation. The output matrix is again distributed element-wise. This completes the current iteration and the algorithm proceeds with the next tile column in T. Since the communication is realized with synchronous MPI communication, backsolve/solve and trailing matrix updates are executed alternatingly. Our algorithm partitions the matrices into regular tiles, recall Fig. 1. The execution order is up to the OpenMP runtime system. It is typically the case that independent solve and update tasks of the same eigenvector tile column are executed in parallel. Elemental implements robustness with global scaling factors, one scaling factor for each eigenvector. The eigenvector is scaled in its entirety whenever a backsolve, solve or update task triggers scaling, which may result in a cascade of scalings being applied to parts of the eigenvector without intermediate calculations. Our algorithms are designed with robustness in mind. We aim for local scaling factors, one scaling factor for each eigenvector segment per tile, compare with [9]. As a consequence, the tiled structure and the asynchronous execution model can be preserved while the overhead from scaling is minimized. Every scaling of a tile is immediately followed with some computation on that tile. The local
7.1. Test environment Our experiments were run on Kebnekaise, a supercomputer situated at High Performance Computing Center North (HPC2N) in Umeå, Sweden. We employed Intel Xeon E5-2690 v4 (“Broadwell”) nodes and Intel Xeon Phi 7250 (“KNL”) nodes. Kebnekaise, by default, had dynamic frequency scaling enabled. Each Broadwell node has 128 GB of memory and contains 28 cores organized in two NUMA islands with 14 cores each. In double-precision arithmetic, the theoretical peak performance is 41.6 GFLOPS/s per core and 1164.8 GFLOPS/s per node at the base frequency of 2.6 GHz. The real peak performance deviates due to dynamic frequency scaling and hinges on the usage of AVX instructions and the core count. The communication between the Broadwell nodes relies on a Mellanox Technologies MT27600 Infiniband interconnect with an effective bandwidth of 60 0 0 MB/s. The KNL features 68 cores. Pairs of cores are grouped into tiles that share an L2 cache. A total of 34 tiles are arranged in a 2D mesh and operated in quadrant mode. In this mode, the chip is divided into four quadrants. Memory requests are handled by the tag directory and the memory channel located on the same quadrant as the core that generates the request. The theoretical peak performance in double-precision arithmetic is 44.8 GFLOPS/s per core and 3046.4 GFLOPS/s per full node. The 16GB MCDRAM is configured as L3 cache. One node reaches 330GB/s in the STREAM Triad benchmark. Measurements on Broadwell used the Intel compiler 17.0.4 linked against single-threaded MKL 2017.3.196. Measurements on KNL used Intel compiler 18.01 and MKL 2018.1.163. The optimization flags were set to -Ofast -unroll-aggressive -xHost -qopt-prefetch -malign-double. Moreover, interprocedural optimizations were activated with -ipo. We used OpenMP 4.5 as shared-memory API. Threads were bound to cores by setting KMP_AFFINITY to compact. Thereby threads are placed as close as possible, such that a NUMA island is filled completely before another NUMA island is used. 7.2. Selection of eigenvectors When the Schur decomposition is computed, eigenvalues can occur in any order. It is therefore reasonable to randomly select 1by-1 or 2-by-2 blocks on the diagonal, independent of their value and their position. We follow the selection procedure in [11] and select blocks randomly with some probability p. Joining the sets of probabilities used in [11] and [4], we consider p ∈ {0.05, 0.15, 0.35, 0.5, 1}. 7.3. Choice of the tile size The eigenvector computation spends the majority of the time in matrix-matrix multiplications, which require a certain size in order to be efficient. Based on experience, we propose as tile size
161
b=
p1/3 192 p1/3
if p ∈ {0.05, 0.15} if p ∈ {0.35, 0.5, 1}.
(7)
We validated the values of b experimentally. We searched for an optimal tile size through a blind sweep. We emphasize that the
138
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140 Table 1 Comparison of the generic tile size b specified in (7) against optimal values obtained from tuning. The parentheses give the performance degradation in %. The thread counts are set to 68 for KNL and 28 for Broadwell. KNL
Broadwell
p
b
n = 10, 0 0 0
n = 10, 0 0 0
n = 40, 0 0 0
0.05 0.15 0.35 0.5 1
437 303 272 241 192
396 284 136 356 356
{376, 408} (9) {296, 360} (7) 352 (7) {264, 276, 288} (3) 192 (0)
936 (19) 360 (13) 176 (4) 328 (11) 328 (7)
(2) (2) (14) (9) (2)
tile size considered optimal can be biased due to a single evaluation with a random selection of the eigenvalues. Table 1 compares b with the best values obtained through blind sweeps. There are typically several local minima, such that there may be more than one optimal tile size. The performance with b chosen as (7) was less than 20% worse than the optimal tile size found by the sweep. 8. Performance results This section presents strong scaling results relative to the capabilities of the machine. Strong scaling experiments fix the problem size and then increase the number of execution units. Ideally the efficiency is constant. 8.1. Shared memory In the following, performance results of the shared-memory implementation are presented, first for Broadwell and then for KNL. All results report the median performance of three runs. All measurements were executed with exclusive node access. Broadwell. Our implementation employed tile layout and implemented the backtransform as in Algorithm 5 (left). The test matrices exhibited 50% complex eigenvalues, i.e., the diagonal of T has n/2 1-by-1 and n/4 2-by-2 blocks. Fig. 6 shows the results. The multi-threaded MKL implementation of DGEMM serves as reference. The work of the matrix-matrix multiplication C ← AB with ˆ , nˆ , kˆ ) := (n, n, 2/3n ) approximately matches matrix dimensions (m the FLOPS in the case p = 1. In the case of n = 20, 0 0 0, all tile sizes were obtained through tuning. For the remaining matrix sizes, the tile sizes were only tuned for 28 cores and were otherwise chosen according to (7). The performance generally improved as the work increases. The nodes had dynamic frequency scaling enabled, which explains why runs can exceed the theoretical peak performance. It also explains the wedge in the performance degradation. The frequency boost reduces to the minimum when 8 cores of the first NUMA island are used. With proper tuning, we expect the performance to follow the shape of the DGEMM curve more closely. The speedup obtained with 28 cores ranges between 8 and 16 for n = 20, 0 0 0. The speedup is measured relative to a sequential run of our algorithm executed without OpenMP tasks, i.e., the execution order follows the loop structure of Algorithms 4 and 5 (left). Due to reduced frequency boosts, the best achievable speedup is 23.2. Under the assumption that the sequential run is close to the best possible sequential run, 28 cores achieve a speedup in the range of 11 and 16 for n = 10, 0 0 0 and 15 and 18 for n = 40, 0 0 0. KNL. Fig. 7 shows the strong scaling results on KNL with 100% complex eigenvalues using tile layout. Fig. 7 (center) fused backsolve and backtransform and implemented the backtransform as in Algorithm 5 (right). The other experiments were set up analogously to those on Broadwell.
Fig. 6. Strong scaling for various select ratios on shared memory (Broadwell). DGEMM performance included for reference.
For reference, the average performance of the micperf MKL
DGEMM benchmark [12] is included. The dimensions of the matrixmatrix multiplications are identical to the DGEMM reference on Broadwell. In addition, the matrix dimensions were rounded to multiples of 8 to benefit from alignment. For n = 50 0 0, the performance was measured for all cores. Successive core counts yielded the same runtime, which explains the wild oscillations for small core counts in the performance plot. For n = 40, 0 0 0, the performance was only measured using 68 cores. The performance improves with increasing load only as long as the matrices fit into the MCDRAM. For n = 40, 0 0 0, the matrices require more than 16GB of
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
139
Fig. 8. Strong scalability of the DistributedEigenvectors (Broadwell).
Fig. 7 (top, center) evaluates fusion for n = 50 0 0. The unfused scheme outperforms the fused scheme for small core counts, most notably the sequential run. For larger worker counts the fused scheme is slightly faster. For larger problems the additional parallelism from fine-grained tasks could not compensate for the performance advantage of coarse-grained DGEMM tasks, see Algorithm 5. For n = 40, 0 0 0 fusion degraded the performance. Since the matrices do not fit into the MCDRAM, the out-of-order task execution increases the cache misses. The speedup ranges between 7 and 23 (n = 50 0 0) and 30 and 48 (n = 40, 0 0 0) over a sequential run of our algorithm executed without OpenMP. Fig. 7. Strong scaling for various select ratios on shared memory (KNL). formance included for reference.
DGEMM per-
storage, which explains why the performance of the DGEMM benchmark is about 10% better for n = 50 0 0 than for n = 40, 0 0 0. Since hyperthreading did not give any observable performance improvement, tuning was restricted to 68 cores. The tile sizes were obtained through a coarse sweep over a wide range of values and a fine sweep for regions of interest. The performance curve exhibited several local minima. More fine-grained tuning can probably improve the performance even for 68 cores. The tile sizes for all other core counts are chosen according to Section 7.3. We expect proper tuning to remove performance wiggles and to restore that larger select ratios yield better performance.
8.2. Distributed memory with OpenMP This section summarizes the results for the distributed memory algorithm on Broadwell. The reference pure shared memory run, too, was executed with MPI functionality. Fig. 8 shows the strong scaling results for up to four nodes. With one MPI process per node and 28 OpenMP threads per MPI process up to 112 cores are used in total. Expanding to more than one node slows down the computation for small select ratios. The shared memory strong scaling results on Broadwell (Fig. 6) showed that the performance for p = 0.05 is poor and cutting the work degrades the efficiency further. For p = 0.05, four nodes are 2.1 times (n = 10, 0 0 0) slower than the shared-memory implementation. In the case of n = 40, 0 0 0, the runtime stays approximately constant with an increasing number of nodes. Speedup can be observed for larger
140
A. Schwarz and L. Karlsson / Parallel Computing 85 (2019) 131–140
Backsolve
runtime [seconds]
23.1
24.5
Backtransform 24.9
25.2
16
15.8
8.9
9.4
20 18.3
17.6
10
4.8 0
6.9
node 1 node 2 node 3 node 4
Fig. 9. Runtime decomposition n = 40, 0 0 0, p = 1, four separately timed nodes.
loads. For four nodes and p = 1, the speedup ranges between 1.4 (n = 10, 0 0 0) and 2.8 (n = 40, 0 0 0). Fig. 9 analyzes the load balance for an example with n = 40, 0 0 0 and all blocks selected ( p = 1) on four full nodes. To only analyze the load balance without possible performance penalties from communication, all communication is completed before the computation is started. More precisely, T and Q are replicated, all nodes are synchronized through a barrier and then the computation is started and timed. Since there is no synchronization between backsolve and backtransform, the compute times of these two algorithmic steps sum up. Striving for perfect load balance, the added (rather than separately timed) execution time of backsolve and backtransform should be approximately equal. Next we analyze to what degree computation and communication can be overlapped. The replication of T and the replication of Q were exclusively timed: the processes are synchronized with a barrier, submit the MPI messages necessary for the asynchronous replication, wait for the arrival of all tiles and are synchronized again. The communication costs are 2.7 s for T and 5.1 s for Q. The replication of the dense matrix Q takes approximately double the time of replicating the upper quasi-triangular matrix T. The timings indicate that the non-overlapped transmission of T is tolerable compared to the compute times. The replication of Q can mostly be overlapped with the computation of the backsolve. Only node 1 has a small window of idle time. Note that these numbers may change slightly when the communication and the computation are actually overlapped rather than being separately timed. 9. Conclusion and further work This paper explored the scalability of eigenvector computation from the real Schur form. We demonstrated that task parallelism, possibly extended to distributed memory, has the potential to solve the eigenvector problem efficiently. Our task-parallel implementa-
tion removes the sequential phases in the LAPACK and ScaLAPACK fork-join implementations. Our next step is to add robustness, i.e., protection against floating-point overflows. If eigenvalues are clustered and one of the corresponding eigenvectors is sought, the matrix T − λI can be ill-conditioned. In order to avoid overflow during the backsolve, eigenvector columns have to be downscaled. We plan to augment our backsolve routine with robustness as advocated in [7,8] and compare against the current state-of-the-art routine DTREVC3 in LAPACK. The approach described in both papers must be extended to also cover (a) complex divisions executed in real arithmetic and (b) the solution of small linear systems arising from 2-by-2 blocks on the diagonal of T. We highlight that our distributed memory + OpenMP scheme is particularly well suited for robustness. Since the eigenvectors are column-distributed, scaling factors propagate only within nodes and thus do not require any inter-node communication. Acknowledgments The authors highly appreciate the quick and competent help of staff at High Performance Computing Center North (HPC2N), Umeå. This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 671633. Support has also been received by eSSENCE, a collaborative e-Science programme funded by the Swedish Government via the Swedish Research Council (VR) grant no. UFV 2010/149. References [1] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., John Hopkins University Press, 1996. [2] E. Anderson, Z. Bai, C. Bischof, L.S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, et al., LAPACK Users’ Guide, SIAM, 1999. [3] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, ScaLAPACK: A portable linear algebra library for distributed memory computers–Design issues and performance, Comput. Phys. Commun. 97 (1–2) (1996) 1–15. [4] B. Adlerborn, C.C. Kjelgaard Mikkelsen, L. Karlsson, B. Kågström, Towards highly parallel and compute-bound computation of eigenvectors of matrices in Schur Form, Technical Report NLAFET Working Note 10, 2017. [5] M. Gates, A. Haidar, J. Dongarra, Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem, in: International Conference on High Performance Computing for Computational Science, Springer, 2014, pp. 182–191. [6] M.R. Fahey, New Complex Parallel Eigenvalue and Eigenvector Routines, Technical Report, Computational Migration Group, Computer Science Corporation, 2001. [7] C.C. Kjelgaard Mikkelsen, L. Karlsson, Blocked algorithms for robust solution of triangular linear systems, in: International Conference on Parallel Processing and Applied Mathematics, Springer, 2017, pp. 68–78. [8] C.C. Kjelgaard Mikkelsen, L. Karlsson, Robust Solution of Triangular Linear Systems, Technical Report NLAFET Working Note 9, 2017. [9] C.C. Kjelgaard Mikkelsen, A.B. Schwarz, L. Karlsson, Parallel robust solution of triangular linear systems, Concurr. Comput. (2018) e5064. [10] T. Moon, J. Poulson, Accelerating eigenvector and pseudospectra computation using blocked multi-shift triangular solves, arXiv:1607.01477 (2016). [11] M. Myllykoski, C.C. Kjelgaard Mikkelsen, L. Karlsson, B. Kågström, Task-Based Parallel Algorithms for Eigenvalue Reordering of Matrices in Real Schur Forms, Technical Report NLAFET Working Note 11, 2017. [12] Micperf User Guide, A Unified Interface for Benchmark Tools on the Intel ©Xeon Phi Processor X200 Product Family, 2016.