Journal of Computational and Applied Mathematics 256 (2014) 230–241
Contents lists available at ScienceDirect
Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
A generalized Block FSAI preconditioner for nonsymmetric linear systems Massimiliano Ferronato ∗ , Carlo Janna, Giorgio Pini Department of ICEA, University of Padova, via Trieste 63, 35121 Padova, Italy
article
info
Article history: Received 1 August 2012 Keywords: Parallel preconditioning Factorized approximate inverse Unsymmetric matrices
abstract The efficient solution to nonsymmetric linear systems is still an open issue, especially on parallel computers. In this paper we generalize to the unsymmetric case the Block Factorized Sparse Approximate Inverse (Block FSAI) preconditioner which has already proved very effective on symmetric positive definite (SPD) problems. Block FSAI is a hybrid approach combining an ‘‘inner’’ preconditioner, with the aim of transforming the system matrix structure to block diagonal, with an ‘‘outer’’ one, a block diagonal incomplete or exact factorization intended to improve the conditioning of each block. The proposed algorithm is experimented with in a number of large size matrices showing both a good robustness and scalability. © 2013 Elsevier B.V. All rights reserved.
1. Introduction The solution to: Ax = b n×n
(1) n
where A ∈ R is large and sparse, x and b ∈ R , is a central task in scientific computing. The continuous improvement of parallel computers can afford the opportunity to implement numerical models with up to some million unknowns, provided that an efficient solver is available. Iterative methods based on Krylov subspaces can be easily adapted to parallel computations, including the most recent hardware technology such as GPU accelerators and hybrid CPU–GPU architectures, e.g. [1–4]. However, an efficient parallel implementation of most traditional preconditioners, such as incomplete LU (ILU) factorizations, is still a challenge, so that a great interest in the research and development of novel parallel preconditioners can be found in the recent literature, e.g. [5–10] just to cite a few. A novel preconditioner for the parallel solution of symmetric positive definite (SPD) problems is the Block Factorized Sparse Approximate Inverse (Block FSAI) algorithm [11,12]. Block FSAI is a generalization of the FSAI preconditioner developed by Kolotilina and Yeremin [13] with the aim of minimizing the distance between the preconditioned matrix and an arbitrary block diagonal matrix in the sense of the minimal Frobenius norm. Such a minimization can be performed also by using a dynamically generated non-zero pattern, giving rise to the Adaptive Block FSAI (ABF). Then, another block diagonal preconditioner can be applied, an ‘‘outer’’ preconditioner so to say, e.g. a block diagonal Incomplete Cholesky (IC) decomposition or even a direct solver for each block. In combination with IC, Block FSAI has proved to be quite a robust and efficient tool for the parallel solution of both linear systems [11,12] and eigenproblems [14] in different engineering applications, ranging from computational fluid dynamics, to contact problems, to structural mechanics.
∗
Corresponding author. Tel.: +39 049 8271332; fax: +39 049 8271333. E-mail address:
[email protected] (M. Ferronato). URL: http://www.dmsa.unipd.it/∼ferronat (M. Ferronato).
0377-0427/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cam.2013.07.049
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
231
Fig. 1. Schematic representation of the non-zero patterns SBD , SBL and SBU .
Preconditioning unsymmetric indefinite matrices is generally much more difficult than SPD matrices. The reason is twofold. First, clustering the eigenvalues of the preconditioned matrix away from 0 does not theoretically guarantee that the Krylov subspace method is actually accelerated, as the convergence of nonsymmetric solvers such as GMRES depends also on the conditioning of the matrix of the eigenvectors [15]. Second, preconditioners for nonsymmetric matrices are typically much less robust than those for SPD ones, with unstable numerical behaviors and quasi-singularities as frequent and undesired occurrences. Currently available nonsymmetric preconditioners include incomplete LU decompositions (e.g. ILUT, [16,17]), factorized and non-factorized approximate inverses (e.g. AINV, [18], and SPAI, [19], respectively), and algebraic multigrid (e.g. AMG with smoothed aggregation, [20,21]). Among the above mentioned algorithms, SPAI and AMG can be efficiently implemented on parallel computers, while ILUT and AINV require sophisticated graph partitioning techniques to gain some parallelism, e.g. [22]. The aim of the present work is to extend the Block FSAI approach developed in [11] to unsymmetric indefinite matrices. The Unsymmetric Block FSAI (UBF) formulation is derived by modifying properly the function measuring the distance of the preconditioned matrix from an arbitrary block diagonal matrix. In contrast with the SPD case, existence and uniqueness of UBF can become an issue deserving an appropriate discussion. Then, UBF is coupled to an ILU decomposition as the ‘‘outer’’ preconditioner. The numerical performance of the proposed algorithm is compared to that of two competing alternatives in a number of test cases arising from different applications. A discussion of the results along with some hints on the possible future developments close the paper. 2. The Unsymmetric Block FSAI (UBF) preconditioner The following notation is used throughout the paper:
• • • • •
[A]ij denotes the coefficient in row i and column j of matrix A; v[I] denotes the subvector of v made of the components with index belonging to the set I; A[I, J ] denotes the submatrix of A made of the coefficients [A]ij such that i ∈ I and j ∈ J ; A[I, :] denotes the submatrix of A made of the rows with index belonging to the set I; A[:, J ] denotes the submatrix of A made of the columns with index belonging to the set J .
The basic idea of Block FSAI for SPD matrices [11] is to generalize the native FSAI algorithm [13]. Let SBD and SBL be n × n block diagonal and lower block diagonal non-zero patterns, respectively, with nb diagonal blocks of variable size (Fig. 1), and WBD and WBL the set of matrices with pattern SBD and SBL , respectively. The Block FSAI preconditioner of an SPD matrix A reads: −1 MBF = FTF
(2)
where F ∈ WBL is computed such that:
∥D − FL∥F → min
(3)
with D ∈ WBD an arbitrary full-rank block diagonal matrix and L the lower Cholesky factor of A. Setting:
Φsym (F ) = Tr (D − FL) (D − FL)T
(4)
condition (3) is equivalent to:
∂ Φsym = 0 ∀ (i, j) ∈ SBL ∂[F ]ij
(5)
i.e., finding the point where the scalar function Φsym (F ) is stationary. UBF can be defined as a generalization of (2) for an unsymmetric matrix A: −1 MUBF = FU FL
(6)
232
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
Fig. 2. Block structure of FL and FU .
where FL ∈ WBL , FU ∈ WBU , and WBU is the set of matrices with pattern SBU (Fig. 1). Extending condition (5), the factors FL and FU are computed as the point where Φuns (FL , FU ) is stationary:
∂ Φuns = 0 ∀ (i, j) ∈ SBL ∂[FL ]ij
(7)
∂ Φuns = 0 ∀ (i, j) ∈ SBU ∂[FU ]ij
(8)
with Φuns (FL , FU ) generalizing Φsym (F ) in Eq. (4) to the unsymmetric case:
Φuns (FL , FU ) = Tr [(DL − FL L) (DU − UFU )] .
(9)
A similar procedure has been already followed in [23] for generalizing FSAI to unsymmetric matrices. In Eq. (9) DL , DU ∈ WBD are arbitrary full-rank block diagonal matrices, and L and U are the lower and upper triangular factors of A, respectively. As DL and DU are arbitrary, a number of coefficients in both FL and FU can also be set arbitrarily. The most effective choice, as already shown in [11], relies on setting the diagonal blocks of both FL and FU equal to the identity, so that FL and FU are actually a unit lower and upper triangular matrix, respectively, with the following block structures (Fig. 2):
I FL 21 FL = .. .
···
0 I
..
FnLb 1
FnLb 2
.
···
0
.. . .. .
I
U F12
0 FU = .
I
I
..
..
0
···
···
. ···
U F1n b
U F2n b
.. . .
(10)
I
Before proceeding, let us introduce the following simple result. Lemma 2.1. Let A, B ∈ Rn×n ; then
∂ ∂ Tr (AB) = Tr (BA) = [B]ji = [BT ]ij . ∂[A]ij ∂[A]ij
(11)
Proof. The proof follows trivially by recalling that Tr(AB) = Tr(BA) and writing by components the trace and the matrixby-matrix product AB. With reference to Fig. 3, we are now able to state the following. Theorem 2.1. Let A ∈ Rn×n , DL , DU ∈ WBD , FL ∈ WBL and FU ∈ WBU , with FL and FU having the block structure (10). Then the point where Φuns (FL , FU ) is stationary is such that: (ib )
Aib fUi = −ci
(jb )
ATjb fLj = −rj
∀i = m1 + 1, . . . , n
(12)
∀j = m1 + 1, . . . , n
(13)
where mk is the size of the k-th diagonal block, ib and jb are the indices of the diagonal blocks containing the i-th column and the kb −1 U L j-th row of A, respectively, Akb is the square submatrix built from [A]11 to [A]mm with m = k=1 mk , fj and fi are the non-zero (jb )
part of the j-th row of FL and the i-th column of FU , respectively, and rj i-th column of A, respectively.
(ib )
and ci
are the first m components of the j-th row and
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
(ib )
Fig. 3. Submatrices Aib and Ajb , and vectors fUi , fLj , ci
(jb )
, and rj
233
of Eqs. (12) and (13).
Proof. Write explicitly Eqs. (7) and (8):
−
∂ ∂ Tr (FL LDU ) + Tr (FL AFU ) = 0 ∀ (i, j) ∈ SBL ∂[FL ]ij ∂[FL ]ij
(14)
−
∂ ∂ Tr (DL UFU ) + Tr (FL AFU ) = 0 ∀ (i, j) ∈ SBU ∂[FU ]ij ∂[FU ]ij
(15)
and use Lemma 2.1 on (14) and (15):
−[LDU ]ji + [AFU ]ji = 0 ∀ (i, j) ∈ SBL
(16)
−[U T DTL ]ij + [AT FLT ]ij = 0 ∀ (i, j) ∈ SBU .
(17)
Recalling (10), the i-th column of FU and the j-th row of FL can be written as (Fig. 3): (ib )
FU [:, {i}] = fUi ei
(jb )
FL [{j}, :] = fLj ej (kb )
where ek
T
0
(18)
0
(19)
is the k-th row or column of the identity matrix with size mkb . Recalling that m = (k )
kb −1 k=1
mk , let us define the
sets Mkb and Mkbb as:
Mkb = {l : 1 ≤ l ≤ m} (kb )
(20)
Mkb = l : m + 1 ≤ l ≤ m + mkb .
(21)
Then, introducing Eq. (18) into (16) gives rise to the set of linear systems:
(i )
(ib )
A[Mib , Mib ]fUi + A[Mib , Mibb ]ei (ib )
A[Mib ,
Mib fUi
]
(ib )
(ib )
+ A[Mib , Mib
=0 ]ei = v (ib )
∀i = m1 + 1, . . . , n
(22)
m
where v ∈ R ib is arbitrary because of the arbitrariness of DU . Hence, only the upper equation in (22) has to be considered (i ) (i ) (i ) as a condition for fUi . Observing that (Fig. 3) A[Mib , Mib ] = Aib and A[Mib , Mibb ]ei b = ci b , the upper equation in (22) reads: (ib )
Aib fUi = −ci
∀i = m1 + 1, . . . , n
(23)
234
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
T Fig. 4. Submatrices of A, rows and columns of FL and FU with sparse patterns SBL ̸= SBU .
i.e., Eq. (12). Similarly, introducing Eq. (19) into (17) provides: (j )
(jb )
AT [Mjb , Mjb ]fLj + AT [Mjb , Mjbb ]ej (jb )
A [Mjb , T
with w ∈ R
mj
b
=0 (j ) ] + A [Mjb , Mjb ]ej b = w
Mjb fLj
(jb )
T
(jb )
∀j = m1 + 1, . . . , n (j )
(24) (jb )
an arbitrary vector. As AT [Mjb , Mjb ] = ATjb and AT [Mjb , Mjbb ]ej (jb )
ATjb fLj = −rj
completing the proof.
∀j = m1 + 1, . . . , n
(j )
= rj b , the upper equation in (24) becomes: (25)
According to the thesis of Theorem 2.1, the UBF factors FL and FU can be computed by solving the 2(n − m1 ) independent linear systems (12) and (13), that can be easily scheduled to different processors in a parallel computing environment. The use of dense patterns, however, would make the computation of FL and FU unfeasible. Let us assume SBU and SBL to be sparse, and denote as JlU and JlL the set of indices of the non-zero components of fUl and fLl , respectively. Recalling Eq. (16), the i-th column of FU depends on the set JiL of non-zero entries in the i-th row of FL . In the same way, according to Eq. (17), the j-th row of FL depends on the set JjU of non-zero entries in the j-th column of FU . Hence, systems (12) and (13) become (Fig. 4): (ib )
Aib [JiL , JiU ] fUi [JiU ] = −ci
(jb )
ATjb [JjU , JjL ] fLj [JjL ] = −rj
[JiL ] i = m1 + 1, . . . , n
(26)
[JjU ] j = m1 + 1, . . . , n.
(27)
The solution to Eqs. (26) and (27) can raise some issues according to the number of elements belonging to the sets JlL and JlU , l = m1 + 1, . . . , n. Denoting as |J | the cardinality of the set J : (1) if |JlL | = |JlU |, then both matrices in Eqs. (26) and (27) are square, with fLl and fUl uniquely defined provided that Alb [JlL , JlU ] and ATlb [JlU , JlL ] are non singular;
(2) if |JlL | ̸= |JlU |, then both matrices in Eqs. (26) and (27) are rectangular. If |JlL | > |JlU | there are generally no solutions to (26) and infinite solutions to (27). For instance, a least square approximation for fUl can be computed and at least |JlL | − |JlU | components of fLl can be set arbitrarily. If |JlL | < |JlU | the contrary occurs; (3) if JlL = ∅ and JlU ̸= ∅, then fLl = 0 and fUl cannot be computed; (4) if JlL ̸= ∅ and JlU = ∅, then fLl cannot be computed and fUl = 0; (5) if JlL = JlU = ∅, then fLl = fUl = 0. As a consequence, existence and uniqueness of both FL and FU cannot be guaranteed. A possible practical remedy to the points 2 through 4 above can be enlarging either JlL or JlU such that the number of elements of both sets equates max(|JlL |, |JlU |). Anyway, in case either Alb [JlL , JlU ] or ATlb [JlU , JlL ] or both are singular, the solution to systems (26) and (27) is not unique.
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
235
The factors FL and FU are computed with the aim of building a preconditioned matrix FL AFU that resembles as much as possible DL DU , i.e., a block diagonal matrix whatever DL and DU . In other words, FL AFU has the block structure: B1 R21
R12 B2
FL AFU = ..
.
Rnb 1
Rnb 2
··· ··· .. .
R1nb R2nb
···
Bnb
.. .
(28)
where: Rij → 0 ∀i, j = 1, . . . , nb and i ̸= j
(29)
as the sparsity of FL and FU decreases, and Bkb , kb = 1, . . . , nb , tends to the kb -th diagonal block of DL DU . However, as DL and DU are arbitrary, it is not ensured that FL AFU is better than A in a Krylov subspace method, so it is necessary to precondition FL AFU again. Recalling (29), it appears reasonable assuming: B1 0
FL AFU ≃ ..
0 B2
.
0
0
··· ··· .. . ···
0 0
.. .
(30)
Bnb
that can be efficiently preconditioned in parallel by a block diagonal incomplete, or whenever feasible even exact, factorization, where each diagonal block Bkb is managed by a single processor. Hence, the most effective choice is setting the number of blocks nb equal to the number of available processors np . Using an ILU factorization of each block Bkb , e.g., as implemented in [16,17], the ‘‘outer’’ preconditioner can be factorized as JU−1 JL−1 where: L˜ 1 0
0 L˜ 2
0
0
JL = . .
.
··· ..
. ···
0 0
.. .
L˜ nb
U˜ 1 0
0 U˜ 2
0
0
JU = . .
.
··· ..
. ···
0 0
.. .
(31)
U˜ nb
and L˜ kb and U˜ kb are the lower and upper incomplete factors of Bkb , respectively. The resulting preconditioned matrix is: JL−1 FL AFU JU−1 = WL AWU
(32)
where: WL = JL−1 FL
(33)
WU = FU JU .
(34)
−1
Therefore, the final UBF-ILU preconditioner can be written in the factorized form: M −1 = WU WL .
(35)
The structure (35) is well suited to the implementation of a split preconditioning strategy. 2.1. Implementation details The algorithm for the UBF-ILU computation is summarized in Table 1. The main body is represented by a loop over the blocks which can be easily set up in parallel. The solution to the dense systems (26) and (27), where the additional constraint |JlL | = |JlU | is added, can be efficiently carried out by using LAPACK subroutines, while the Bkb computation along with its ILU factorization are performed locally by each single processor. A possible breakdown of the algorithm can occur only if either Akb [JlL , JlU ] or ATkb [JlU , JlL ] is singular. In this case, the solution to the corresponding system is skipped setting either fUl or fLl to the null vector. The other options to be specified by the user are the following:
(1) the number of processors np , which basically depends on the available machine and the size n of A; (2) the size m1 , m2 , . . . , mnb of the blocks B1 , B2 , . . . , Bnb . The simplest choice is to uniformly subdivide A among the processors, thus setting m2 = · · · = mnb = ⌊n/nb ⌋ and m1 = n−(nb −1)⌊n/nb ⌋. A preliminary reordering by the Reverse Cuthill–McKee (RCM) algorithm [24] can help reduce communications among the processors. Further improvements can be obtained by using a multilevel graph partitioning technique, such as those implemented in the Metis [25] or Scotch [26] libraries, and setting correspondingly the block sizes [27];
236
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241 Table 1 Algorithm for the UBF-ILU computation. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Choose the number of processors np Set nb = np nb Select the block size m1 , m2 , . . . , mnb such that n = k=1 mk U L Select the sets of non-zero positions Jl and Jl for l = m1 + 1, . . . , n m=0 For kb = 1, . . . , nb m ← m + mkb For k = 1, . . . , mkb l=m+k (kb )
Solve Akb [JlL , JlU ] fUl [JlU ] = −cl
10. 11. 12.
Solve End For
13. 14. 15.
ATkb
[
JlU
,
JlL
fLl
JlL
(kb )
] [ ] = −rl
(k )
[JlL ] [JlU ]
(k )
Bkb = FL [Mkbb , :]AFU [:, Mkbb ]
Compute L˜ kb U˜ kb ≃ Bkb End For
(3) the sets JlU and JlL , l = m1 + 1, . . . , n, containing the positions of the non-zero entries of the l-th column and row of FU and FL , respectively. The definition of these sets is not trivial, and the optimal a priori selection of the sparsity pattern for a factorized approximate inverse is still an open issue. In the SPD case an efficient adaptive pattern search for Block FSAI has been defined by Janna and Ferronato [12], however the approach they follow is based on the Kaporin conditioning number minimization and has no theoretical foundation for an unsymmetric indefinite matrix. Other adaptive searches for factorized and non factorized approximate inverses can be found in the works by Huckle [19,28,29]. A generally accepted choice is using the upper and lower nonzero patterns of a power of A, or of a sparsified A [28,30], for FU and FL , respectively. In order to ensure that the constraint |JlL | = |JlU | is satisfied for all l = m1 + 1, . . . , n, the sparsity pattern of a power of the symmetric part of A, i.e., (A + AT )κ , can be selected. In practical problems, however, the power κ can be hardly larger than 2 or 3 because the number of non-zeros per row can easily become unacceptably high. Larger κ values are feasible computing the power of the symmetric part of a sparsified A, i.e., of A˜ obtained by dropping each entry [A]ij such that:
[A]ij ≤ δ ∥A[{i}, :]∥ ∥A[:, {j}]∥
(36)
where δ is a user-specified tolerance and the symbol ∥ · ∥ denotes a proper vector norm, e.g., the 1-, 2-, or ∞-norm. Hence, the cost for allowing for κ values larger than 3 is that of tuning the additional user-specified parameter δ ; (4) the fill-in of the ILU factorization L˜ kb U˜ kb of Bkb , kb = 1, . . . , nb . Several different ILU implementations have been proposed in the past with the aim of controlling either the number of non-zeros added per row or their magnitude or both. The selection of the retained entries can be carried in different ways, e.g., based on the level-of-fill [31,32] or a dual threshold strategy [16,33,17]. The selected implementation is based on a uniform partition of A among the processors after a preliminary RCM reordering. The nonzero pattern of FU and FL are set to the upper and lower off-block diagonal pattern of (A + AT )κ , respectively, with no sparsification. The constraint |JlL | = |JlU | is therefore automatically enforced for all l = m1 + 1, . . . , n. Finally, the dual threshold strategy in [16] is used for the ILU factorization of each block Bkb . Therefore, the user-specified parameters required for the preconditioner setup are:
• κ : integer value denoting the power of (A + AT ) used to define the sparsity pattern of FU and FL ; • ρ : integer value denoting the number of terms retained per row in L˜ kb and U˜ kb ; • τ : real value denoting the relative magnitude with respect to the 2-norm of a row or column below which an entry of L˜ kb or U˜ kb is dropped. 3. Numerical results The UBF-ILU computational performance and parallel efficiency is investigated in seven large sized test matrices arising from different engineering applications. All matrices are currently, or will be soon, available at the University of Florida Sparse Matrix Collection [34]. In all test cases the Bi-Conjugate Gradient Stabilized (Bi-CGSTAB, [35]) solver with split preconditioning is used, with the right-hand side corresponding to a unitary solution and the convergence achieved when the relative residual is smaller than 10−10 . The UBF-ILU performance is compared to the unsymmetric version of FSAI [23] as implemented in [36], and AMG as implemented in the Hypre software library [37]. The computational performance is evaluated in terms of the number of iterations nit , the wall clock time in seconds Tp and Ts for the preconditioner computation and the Bi-CGSTAB to converge, respectively, with the total time Tt = Tp + Ts , and the speed-up value Sp defined as the ratio between Tt elapsed with 1 and np processors, i.e., the closer Sp to np the more scalable the algorithm.
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
237
Table 2 Test matrices.
ThermoMech-dK ML_Laplace CoupCons3D AtmosModd Tmt_Unsymm ML_Geer Transport
Size
# non-zeros
Problem description
204,316 377,002 416,800 1,270,432 917,825 1,504,002 1,602,111
2,846,228 27,689,972 22,322,336 8,814,880 4,584,801 110,879,972 23,500,731
Finite Element thermomechanics Meshless Poisson equation 3D Finite Element coupled consolidation Computational Fluid Dynamics Finite Difference electromagnetics Meshless poroelasticity 3D Finite Element flow and transport
Table 3 UBF-ILU performance with the test matrices of Table 2. For each case the tern (κ, ρ, τ ) of user-specified parameters yielding the best outcome is also provided. np
nit
Tp
Ts
Tt
Sp
nit
ThermoMech-dK(3, 20, 10 ) 1 2 4 8 16 32
632 704 856 930 1076 1340
8.2 4.6 2.6 1.9 1.2 0.8
117.0 68.6 40.7 27.9 17.0 11.1
59 71 70 71 73 78
13.0 7.2 4.5 2.3 1.3 0.8
13.0 9.2 6.0 2.7 1.3 0.6
Ts
Tt
Sp
156.2 89.4 51.8 34.1 20.2 11.2
– 1.8 3.0 4.6 7.8 14.0
37.4 21.8 11.1 6.0 3.3 2.1
– 1.7 3.4 6.2 11.0 17.6
−4
125.2 73.2 43.3 29.8 18.2 11.9
–
224 249 259 270 305 324
1.7 2.9 4.2 6.9 10.5
CoupCons3D(1, 10, 10−4 ) 1 2 4 8 16 32
Tp
ML_Laplace(1, 30, 10 )
−4
19.6 8.8 4.9 2.6 1.4 0.7
136.6 80.6 46.9 31.5 18.8 10.5
AtmosModd(1, 10, 10−2 ) 26.0 16.4 10.5 5.0 2.6 1.4
–
88 94 100 102 106 119
1.6 2.5 5.2 10.2 18.2
4.6 2.0 1.1 0.6 0.4 0.4
32.8 19.8 10.0 5.4 2.9 1.7
Table 4 The same as Table 3. np
1 2 4 8 16 32
Tmt_Unsymm(2, 10, 10−4 )
ML_Geer(1, 60, 10−4 )
Transport(1, 15, 10−2 )
nit
Tp
Ts
Tt
Sp
nit
Tp
Ts
Tt
Sp
nit
Tp
Ts
Tt
Sp
337 411 392 421 368 354
4.6 2.3 1.2 0.6 0.4 0.4
89.7 63.1 32.7 16.6 7.6 3.9
94.3 65.4 33.9 17.2 8.0 4.2
–
205 212 219 221 290 300
204.9 107.8 41.1 27.0 10.6 7.3
779.5 424.3 224.2 105.9 73.1 39.8
984.4 532.1 265.3 132.9 83.7 47.1
–
168 180 186 201 195 209
7.0 4.0 2.0 1.4 0.9 0.8
84.0 53.6 28.2 16.0 8.2 4.6
91.0 57.6 30.2 17.4 9.1 5.4
–
1.4 2.8 5.5 11.8 22.3
1.9 3.7 7.4 11.8 20.9
1.6 3.0 5.2 10.0 16.7
The test-matrices used for the comparison are the following:
• ThermoMech-dK: Finite Element simulation of temperature and deformation of a steel cylinder; • ML_Laplace: Poisson equation of elliptic type discretized with the Meshless Local Petrov–Galerkin method, see [38]; • CoupCons3D: 3D linear Finite Element coupled consolidation simulation using a nonsymmetric matrix structure, see [39];
• • • •
AtmosModd: computational fluid dynamics for atmospheric turbulent flow problems; Tmt_Unsymm: Finite Difference electromagnetic simulation based on the Helmholtz equation; ML_Geer: poroelastic problem discretized with the Meshless Local Petrov–Galerkin method, see [40]; Transport: 3D Finite Element density driven coupled flow and transport, see [41].
The size and the number of non-zero terms for each matrix is provided in Table 2. All tests are performed on the IBM SP6/5376 cluster at the CINECA Centre for High Performance Computing, equipped with IBM Power6 processors at 4.7 GHz with 168 nodes, 5376 computing cores, and 21 TB of internal network RAM. Tables 3 and 4 provide the best UBF-ILU performance in the different tests with 1–32 processors. Finding the optimal user-specified parameters does not prove overly difficult, as acceptable κ values are smaller than, or equal to, 3, and setting the pair (ρ, τ ) is quite similar to a standard ILU factorization with dual threshold. Fig. 5 compares the overall wall clock time Tt elapsed using the unsymmetric FSAI, AMG as implemented in the Hypre package, and UBF-ILU, all with the user-specified parameters providing the best performance. For the sake of comparison, the ideal profile obtained scaling the best outcome of one processor with np is also reported. Although UBF-ILU convergence tends to deteriorate with increasing the number of processors, as is theoretically expected [11], scalability appears to be still pretty good up to 32 processors. Using a larger number of processors does not appear to be justified in view of the size of the test matrices.
238
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
Fig. 5. Total wall clock time vs. number of processors for the test matrices of Table 2.
Fig. 6 shows the ratio between the total wall clock time elapsed with the unsymmetric FSAI (left) and AMG (right), and UBF-ILU. UBF-ILU appears to be a robust and efficient preconditioner ensuring convergence for all the different problems. In all test matrices it outperforms the unsymmetric version of FSAI, with the exception of AtmosModd with np = 16 and 32.
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
239
Fig. 6. Ratio between the total wall clock time elapsed with the unsymmetric FSAI (left) and Hypre AMG (right), and with UBF-ILU vs. the number of processors. Table 5 (2 )
Relative efficiency index ηnp for the test matrices of Table 2.
ThermoMech-dK ML_Laplace CoupCons3D AtmosModd Tmt_Unsymm ML_Geer Transport
np = 4
np = 8
np = 16
np = 32
85% 86% 78% 98% 97% 99% 95%
61% 66% 82% 91% 95% 99% 83%
50% 56% 80% 81% 99% 79% 79%
38% 50% 72% 64% 97% 71% 66%
Hypre AMG can scale very well with the number of processors, as in the Tmt_Unsymm test case, and can be sometimes more efficient, as with Tmt_Unsymm and Transport. However, it proves less robust, failing to converge in a coupled problem such as CoupCons3D and performing quite unsatisfactorily whenever the problem discretization is highly irregular. (i)
Finally, the parallel performance of UBF-ILU is investigated considering the relative efficiency index ηnp in the different (i)
test cases. This index is defined as the ratio between the actual speed-up Snp , obtained with np processors relative to i processors, and the ideal speed-up:
ηn(ip) =
(i)
Snp
np /i
=
iTt (i) np Tt (np )
.
(37)
In this case we have selected i = 2 because with one processor UBF-ILU is actually coincident with a standard ILUT factoriza(2) tion of the overall matrix [11]. The closer ηnp is to 1, the more scalable the parallel code with respect to two processors. It is expected that the efficiency of any parallel code for a fixed problem decreases as the number of processors increases, mainly because of the growing computational burden required by the interprocessor communications. The relative efficiency index ηn(2p) for the test matrices of Table 2 is provided in Table 5. Whereas ηn(2p) decays rapidly for the smallest problems, it is still pretty good with 32 processors in the largest ones, achieving values up to 97%. 4. Conclusions Block FSAI is a novel and efficient parallel preconditioner developed for SPD problems. In the present work it has been extended to unsymmetric and indefinite matrices giving rise to the UBF-ILU preconditioner. UBF-ILU has been obtained by generalizing the Block FSAI formulation along the lines of the unsymmetric version of FSAI and coupling the resulting block approximate inverse with an ILU factorization based on a dual threshold strategy. The new preconditioner has been implemented in a parallel Bi-CGSTAB method and tested in seven large size matrices arising from different applications. The results appear to be promising, showing both a good robustness and scalability also in ill-conditioned problems. A numerical comparison with the outcome obtained by the unsymmetric version of FSAI and AMG as implemented in the Hypre software package provide evidence that UBF-ILU is always superior to FSAI and five times out of seven to AMG. However, AMG appears to be less robust as it is not able to ensure convergence in one problem.
240
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
The most significant improvement with respect to FSAI and AMG is found using a relatively small number of processors, so that UBF-ILU can be particularly attractive for the modern multi-core processor technology. In contrast with the native Block FSAI for SPD problems, UBF-ILU existence and uniqueness are not theoretically guaranteed for any choice of the sparsity pattern SBL and SBU of the factors FL and FU . Moreover, an adaptive procedure for generating optimally SBL and SBU cannot be straightforwardly derived from the SPD case. More research is currently ongoing in order to address the issues above. Acknowledgment This work has been supported by the ISCRA project ‘‘SCALPREC: Scalable Preconditioners for Engineering Applications’’ at the CINECA Centre for High Performance Computing, Italy. References [1] J.M. Elble, N.V. Sahinidis, P. Vouzis, GPU computing with Kaczmarz’s and other iterative algorithms for linear systems, Parallel Computing 36 (2010) 215–231. [2] M. Papadrakakis, G. Stavroulakis, A. Karatarakis, A new era in scientific computing: domain decomposition methods in hybrid CPU–GPU architecture, Computer Methods in Applied Mechanics and Engineering 200 (2011) 1490–1508. [3] H. Knibbe, C.W. Oosterlee, C. Vuik, GPU implementation of a Helmoltz Krylov solver preconditioned by a shifted Laplace multigrid method, Journal of Computational and Applied Mathematics 236 (2011) 281–293. [4] R. Helfenstein, J. Koko, Parallel preconditioned conjugate gradient algorithm on GPU, Journal of Computational and Applied Mathematics 236 (2012) 3584–3590. [5] S.A. Funken, E.P. Stephan, Fast solvers with block-diagonal preconditioners for linear FEM-BEM coupling, Numerical Linear Algebra with Applications 16 (2009) 365–395. [6] L. Bergamaschi, A. Martinez, FSAI-based parallel mixed constraint preconditioners for saddle point problems arising in geomechanics, Journal of Computational and Applied Mathematics 236 (2011) 308–318. [7] M.F. Wheeler, T. Wildey, I. Yotov, A multiscale preconditioner for stochastic mortar mixed finite elements, Computer Methods in Applied Mechanics and Engineering 200 (2011) 1251–1262. [8] M. Ferronato, C. Janna, G. Pini, Parallel solution to ill-conditioned FE geomechanical problems, International Journal for Numerical and Analytical Methods in Geomechanics 36 (2012) 422–437. [9] M. Ferronato, C. Janna, G. Pini, Shifted FSAI preconditioners for the efficient parallel solution of non-linear groundwater flow models, International Journal for Numerical Methods in Engineering 89 (2012) 1707–1719. [10] C. Janna, M. Ferronato, G. Gambolati, Parallel inexact constraint preconditioning for ill-conditioned consolidation problems, Computational Geosciences 16 (2012) 661–675. [11] C. Janna, M. Ferronato, G. Gambolati, A block FSAI-ILU parallel preconditioner for symmetric positive definite linear systems, SIAM Journal on Scientific Computing 32 (2010) 2468–2484. [12] C. Janna, M. Ferronato, Adaptive pattern research for block FSAI preconditioning, SIAM Journal on Scientific Computing 33 (2011) 3357–3380. [13] L.Y. Kolotilina, A.Y. Yeremin, Factorized sparse approximate inverse preconditionings I. Theory, SIAM Journal on Matrix Analysis and Applications 14 (1993) 45–58. [14] M. Ferronato, C. Janna, G. Pini, Efficient parallel solution to large size sparse eigenproblems with block FSAI preconditioning, Numerical Linear Algebra with Applications 19 (2012) 797–815. [15] A. Greenbaum, V. Pták, Z. Strakoš, Any nonincreasing convergence curve is possible for GMRES, SIAM Journal on Matrix Analysis and Applications 17 (1996) 465–469. [16] Y. Saad, ILUT: a dual threshold incomplete ILU factorization, Numerical Linear Algebra with Applications 1 (1994) 387–402. [17] N. Li, Y. Saad, E. Chow, Crout version of ILU for general sparse matrices, SIAM Journal on Scientific Computing 25 (2003) 716–728. [18] M. Benzi, M. Tůma, A sparse approximate inverse presconditioner for nonsymmetric linear systems, SIAM Journal on Scientific Computing 19 (1998) 968–994. [19] M. Grote, T. Huckle, Parallel preconditioning with sparse approximate inverses, SIAM Journal on Scientific Computing 18 (1997) 838–853. [20] M. Sala, R.S. Tuminaro, A new Petrov–Galerkin smoothed aggregation preconditioner for nonsymmetric linear systems, SIAM Journal on Scientific Computing 31 (2008) 143–166. [21] M. Brezina, T. Manteuffel, S. McCormick, J. Ruge, G. Sanders, Towards adaptive smoothed aggregation (α SA) for nonsymmetric problems, SIAM Journal on Scientific Computing 32 (2010) 40–61. [22] M. Benzi, M. Tůma, A parallel solver for large-scale Markov chains, Applied Numerical Mathematics 41 (2002) 135–153. [23] A.Y. Yeremin, A.A. Nikishin, Factorized-sparse-approximate-inverse preconditionings of linear systems with unsymmetric matrices, Journal of Mathematical Sciences 121 (2004) 2448–2457. [24] E. Cuthill, J. McKee, Reducing the bandwidth of sparse symmetric matrices, in: Proceedings of the 1969 24th National Conference, 1969, pp. 157–172. [25] G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for irregular graphs, Journal of Parallel and Distributed Computing 48 (1998) 96–129. [26] F. Pellegrini, A parallelisable multi-level banded diffusion scheme for computing balanced partitions with smooth boundaries, in: Proceedings of EuroPar 2007, in: Lecture Notes in Computer Science, vol. 4641, 2007, pp. 191–200. [27] C. Janna, N. Castelletto, M. Ferronato, The effect of graph partitioning techniques on parallel Block FSAI preconditioning, Numerical Algorithms (submitted for publication). [28] T. Huckle, Approximate sparsity patterns for the inverse of a matrix and preconditioning, Applied Numerical Mathematics 30 (1999) 291–303. [29] T. Huckle, Factorized sparse approximate inverses for preconditioning, The Journal of Supercomputing 25 (2003) 109–117. [30] E. Chow, A priori sparsity patterns for parallel sparse approximate inverse preconditioners, SIAM Journal on Scientific Computing 21 (2000) 1804–1822. [31] I. Gustaffson, A class of first order factorization methods, BIT 18 (1978) 142–156. [32] J.W. Watts III, A conjugate gradient-truncated direct method for the iterative solution of the reservoir simulation pressure equation, SPE Journal 21 (1981) 345–353. [33] C. Lin, J.J. Moré, Incomplete Cholesky factorizations with limited memory, SIAM Journal on Scientific Computing 21 (1999) 24–45. [34] T.A. Davis, Y. Hu, The University of Florida sparse matrix collection, ACM Transactions on Mathematical Software 38 (2011) 1–25. [35] H.A. van der Vorst, Bi-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems, SIAM Journal on Scientific and Statistical Computing 13 (1992) 631–644. [36] L. Bergamaschi, A. Martinez, Parallel acceleration of Krylov solvers by factorized approximate inverse preconditioners, in: Proceedings of VecPar 2004: High Performance Computing for Computational Science, in: Lecture Notes in Computer Science, vol. 3402, 2005, pp. 623–636. [37] Lawrence Livermore National Laboratory. HYPRE user’s manual—software version 1.6.0. Center for Applied Scientific Computing (CASC), University of California, 2001.
M. Ferronato et al. / Journal of Computational and Applied Mathematics 256 (2014) 230–241
241
[38] G. Pini, A. Mazzia, F. Sartoretto, Accurate MLPG solution of 3D potential problems, CMES-Computer Modeling in Engineering & Sciences 36 (2008) 43–64. [39] M. Ferronato, G. Pini, G. Gambolati, The role of preconditioning in the solution to FE coupled consolidation equations by Krylov subspace methods, International Journal for Numerical and Analytical Methods in Geomechanics 33 (2009) 405–423. [40] M. Ferronato, A. Mazzia, G. Pini, G. Gambolati, A meshless method for axi-symmetric poroelastic simulations: numerical study, International Journal for Numerical Methods in Engineering 70 (2007) 1346–1365. [41] A. Mazzia, M. Putti, High order Godunov mixed methods on tetrahedral meshes for density driven flow simulations in porous media, Journal of Computational Physics 208 (2005) 154–174.