Parallel iterative solution for the Helmholtz equation with exact non-reflecting boundary conditions

Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741 www.elsevier.com/locate/cma Parallel iterative solution for the Helmholtz equation with exact...

Download PDF

723KB Sizes 0 Downloads 31 Views

Report

PDF Reader
Full Text

Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741 www.elsevier.com/locate/cma

Parallel iterative solution for the Helmholtz equation with exact non-reﬂecting boundary conditions q Cristian Ianculescu, Lonny L. Thompson

*

Department of Mechanical Engineering, Clemson University, Innovation Building, 102 Fluor Daniel Engineering, Clemson, SC 29634-0921, United States Received 11 November 2004; received in revised form 24 January 2005; accepted 13 February 2005

Abstract Eﬃcient and scalable parallel solution methods are presented for the Helmholtz equation with global non-reﬂecting DtN boundary conditions. The symmetric outer-product structure of the DtN operator is exploited to signiﬁcantly reduce inter-processor communication required by the non-locality of the DtN to one collective communication per iteration with a vector size equal to the number of harmonics included in the DtN series expansion, independent of the grain size. Numerical studies show that in the context of iterative equation solvers, and for the same accuracy, the global DtN applied to a tightly ﬁtting spheroidal boundary and implemented as a low-rank update with the multiplicative split is more cost-eﬀective (both in wall-clock times and memory) compared to local approximate boundary conditions. 2005 Elsevier B.V. All rights reserved. Keywords: Helmholtz equation; DtN; Finite element; Parallel methods

1. Introduction Numerical solution of acoustic radiation and scattering problems on unbounded domains requires a large number of degrees-of-freedom (dof) to resolve waves at high driving frequencies (wavenumbers), especially in three dimensions. Domain based methods such as the ﬁnite element method (FEM) are able to solve problems in non-homogeneous media and allow for a natural coupling with complex, including q *

Portions of this manuscript originally presented in [1,2]. Corresponding author. Tel.: +1 864 656 5631; fax: +1 864 656 4435. E-mail address: [email protected] (L.L. Thompson). URL: http://www.ces.clemson.edu/~lonny/ (L.L. Thompson).

0045-7825/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2005.02.030

3710

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

elastic, structures. The numerical advantage of the FEM is they lead to sparse matrices, which by avoiding calculations on zeroes, signiﬁcantly speedup computations and reduce memory requirements. Complexity estimates [3] and numerical evidence [4] have shown that the FEM is an eﬀective alternative to boundary integral methods (BEM) for exterior acoustics problems. With the recent developments in fast boundary integral methods [5–7], the method with the best eﬃciency is less clear, yet the FEM retains advantages of robustness and natural integration with other discrete models in coupled problems. Domain based methods such as the ﬁnite element method (FEM) usually introduce an artiﬁcial boundary C, which divides the original unbounded domain into two regions: a ﬁnite computational domain X suitable for discretization and an inﬁnite residual region D ¼ R n X, see Fig. 1. The exterior acoustics problem in unbounded domains presents a special challenge for ﬁnite element methods. Reducing the size of the bounded domain reduces the computation cost but must be balanced by the ability of a computationally eﬃcient and geometrically ﬂexible truncation boundary treatment to minimize any spurious wave reﬂection to a level below that of the discretization error. Methods for modelling the exterior complement D ¼ R n X, i.e., the inﬁnite region exterior to the artiﬁcial boundary C, can be divided into three main categories: inﬁnite elements [4,8–14], absorbing PML layers [15–18], and absorbing (non-reﬂecting) boundary conditions. Inﬁnite element methods represent the exterior complement by assuming a radial approximation with suitable outgoing wave behavior. Matched absorbing layers attempt to rapidly decay outgoing waves in a relatively thin layer exterior to C. For the non-reﬂecting (absorbing) boundary conditions, the outgoing wave solution in D is represented by a relation of the unknown solution and its normal derivative on the artiﬁcial truncation boundary C. Options include matching exact analytical series solutions as used in the global Dirichlet-to-Neumann (DtN) map [19–23] and various local approximations [24–27]. Numerical and implementation aspects of these alternative boundary treatments are reviewed in [28]. The computational merits of inﬁnite elements, absorbing layers, and local absorbing boundary conditions, are that they incorporate approximating basis functions with compact support, thus preserving the sparsity of the discretization for the bounded domain. In the case of the local absorbing boundary conditions the partitioning of the computational domain for parallel solution on distributed memory computers is straightforward. In the case of inﬁnite elements, and absorbing layers, a partitioning with variable computational weights may be required. The ﬁrst- and second-order local absorbing conditions of [24] on a truncation boundary of spherical shape have been widely used because of there simple implementation [9,29–31]. High-order terms in the sequence of local boundary conditions are diﬃcult to implement due to the presence of high-order derivatives. To obtain a tight ﬁt around elongated objects, the second-order local conditions of [24] have been generalized to spheroidal [22,23] and arbitrary convex surfaces [26]. A tight ﬁt reduces the size of the computational domain X, thus reducing the overall cost of solution. Local approximate absorbing boundary

Fig. 1. Artiﬁcial boundary C deﬁning ﬁnite computational domain X for the exterior problem.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3711

conditions preserve the sparsity of the matrix equations within the bounded domains, but must be used with caution, as errors produced by spurious reﬂections are generally unknown. This sometimes requires multiple solutions and remeshing, where the truncation boundary must be systematically moved further away from the scatterer to provide solutions with reﬂection error below the discretization error. Complexity estimates show that it is usually more eﬃcient to use high-order accurate absorbing conditions which enable smaller computational domains. Direct ﬁnite element implementation of high-order local operators are problematic in conventional ﬁnite element methods since regularity in angular derivatives higher than standard C0(C) are required [32]. The development of high-order local boundary conditions for which the order can be easily increased to a desired level are usually based on using auxiliary variables to eliminate higher-order derivatives [27,33–37]. While generally derived for the time-dependent case,pﬃﬃﬃﬃﬃﬃ time-harmonic counterparts are readily implemented with time derivatives replaced by ix, where ﬃ i ¼ 1 and x is the driving frequency. An alternative to high-order local absorbing conditions are global DtN non-reﬂecting boundary conditions deﬁned on a separable truncation boundary [20,21]. Uniqueness is ensured by modifying the DtN map to incorporate the ﬁrst- and second-order local absorbing conditions in combination with the series representation [22,38]. The extension of the modiﬁed DtN map in spheroidal coordinates suitable for ﬁnite element implementation is given in [23,39]. The number of terms taken in the series representation of the DtN boundary condition determines the accuracy of the procedure. By coupling all the nodes on the truncation boundary, the DtN map is non-local giving rise to a full submatrix. As a result, the DtN was considered computationally prohibitive. However, researchers [19,40,41] recognized the symmetric outer-product structure of the DtN map could be exploited to avoid storage or direct computation of a full dense matrix. Exploiting this feature, eﬃcient matrix–vector updates were implemented for Krylov subspace iterative solvers (e.g. QMR, BICG-STAB), rendering the DtN condition competitive with local absorbing conditions on sequential computer systems. For large three-dimensional problems at high-wavenumbers k, resolution requirements require a large number of elements leading to large system matrices. In this case, iterative solvers are preferred over direct factorization methods due to the lower memory requirements and parallel computing performance. To accommodate a growing demand for solving large scale exterior acoustic scattering problems, especially in three dimensions and for higher frequencies (wavenumbers), research eﬀorts have been directed to designing parallel algorithms for the Helmholtz equation on unbounded domains. Domain decomposition methods provide an eﬀective means of problem subdivision for parallel processing. Classical Schur complement based domain-decomposition methods have diﬃculties when applied to the Helmholtz equation since the inversion of the system matrix deﬁned on each interior subdomain will be singular when the wavenumber corresponds to a resonance frequency (eigenvalue) of the subdomain Dirichlet problem. The ﬁrst resonance will occur at a resolution of less than two subdomains per wavelength [42]. Schwarz-type domain decomposition methods have been used with Robin-type transmission interface conditions [43–45]. However, a diﬃculty is that the iteration count increases with many subdomains. An overlapping Schwarz method is demonstrated in [46] in combination with GMRES acceleration and coarse grid corrections to improve converge rates. Another additive Schwarz domain decomposition method with the Robin-type subdomain interface transmission conditions has been proposed in [47] where the non-local DtN non-reﬂecting boundary condition is computed with an iterative lag to maintain sparsity of the subdomain matrices. In [48] a preconditioned restarted GMRES iterative method is used for the solving the Helmholtz equation, including non-local non-reﬂecting boundary conditions. Domain decomposition is used with a Schur complement algorithm and fast preconditioners for the subdomains to accelerate convergence. In [49,50], a non-overlapping domain decomposition method based on two-level Lagrange multipliers and the alternating Robin-type transmission conditions at subdomain interfaces given in [43,44], is used to solve the Helmholtz equation with second-order local absorbing boundary conditions. The matrices

3712

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

restricted to each subdomain include regularization matrices associated with the simple impedance operator deﬁned by integration over subdomain interfaces. The use of regularization matrices for the Helmholtz operator provides for a unique solution on each subdomain. Inclusion of the alternating regularization matrix on the interface boundaries cancel upon global assembly, thus reverting the original problem, and lead to non-singular and invertible subdomain system matrices. The transmission conditions can be interpreted as a simple local preconditioner of the linear system condensed on the interface. Improved transmission conditions with tangential derivatives have been derived in [51]. Optimized coeﬃcients have been chosen to minimize the convergence rate of the Jacobi algorithm in the closely related additive Schwarz method with no overlap [52]. In [53] it is shown that the optimal augmented interface operator is the Schur complement of the outer domain. Approximations of this Schur complement with sparse approximate inverse methods and incomplete factorization are investigated. An important aspect for eﬃcient parallel algorithms, especially for distributed memory systems, is minimizing the amount of data interchanged (communicated) between subdomains. Maximum decoupling between subdomains is pursued when designing parallel algorithms, especially for massively parallel architectures. In this framework, local boundary conditions are natural for parallel eﬃcient algorithms due to their locality. However, local ﬁrst- and second-order operators such as those used in [30,54] require extra computational domain to produce accuracy comparable to DtN (as demonstrated in [55] and the numerical studies presented in this paper). For a limited number of subdomains, it is reasonable to believe an optimized implementation of the DtN operator would be more computationally eﬀective than local conditions. As the number of subdomains increases, the parallel performance degrades faster by applying DtN versus local conditions, as communication costs will eventually overwhelm computation costs, due to a higher coupling between subdomains. However, for large to medium grain size (hundreds of processors) the DtN boundary condition may prove more cost-eﬀective. In this work, an eﬃcient parallel algorithm is presented for computing DtN matrix–vector products in Krylov-based iterative methods, which takes advantage of the low-rank matrix update resulting from the exact non-reﬂecting DtN map into a parallel iterative process. To minimize the computational domain around elongated structures, a spheroidal DtN non-reﬂecting boundary is used. We demonstrate using linear tetrahedral elements that for the same accuracy, the modiﬁed DtN non-reﬂecting boundary condition applied to a spheroidal boundary and implemented as a low-rank update is more cost-eﬀective than using the local approximate boundary conditions in the context of iterative equation solvers. The low-rank matrix implementation of the discrete DtN map should be even more eﬃcient for dispersion reducing high-order interpolation where high-accuracies are needed. Parallel scale-up studies show that the eﬀect on the overall communication of our parallel DtN matrix-by-vector product algorithm is roughly that of a relatively small dot-product inter-processor communication. Numerical results for a three-dimensional benchmark problem show that speedup and parallel eﬃciency have not been signiﬁcantly impacted by DtN versus a purely approximate local boundary condition.

2. The exterior problem in unbounded domains Consider time-harmonic scattering and radiation of waves in an inﬁnite three-dimensional region D R3 , truncated by an artiﬁcial boundary C thus enclosing a ﬁnite computational domain X. The Helmholtz equation is satisﬁed within X, ðr2 þ k 2 Þu ¼ 0;

x 2 X;

ð1Þ

where u is the scalar ﬁeld, k 2 C is the wave number, Im k P 0. The solution of the Helmholtz equation exterior to the truncating surface satisﬁes the Sommerfeld radiation condition at inﬁnity, ensuring outgoing waves in the far-ﬁeld.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3713

On the artiﬁcial boundary, a non-reﬂecting boundary condition is deﬁned by the Dirichlet-to-Neumann (DtN) map, ou ¼ MðuÞ; on

x 2 C;

ð2Þ

where M: H1/2(C) ! H1/2(C) is an integral operator relating Dirichlet data to the outward normal derivative of the solution to the Helmholtz problem exterior to the truncating surface, restricted on C. Using the DtN map (2), the originally unbounded exterior problem is replaced by an equivalent reduced problem deﬁned on the bounded domain X: Given g 2 C; Find uðxÞ 2 C, such that (Fig. 2), ðr2 þ k 2 Þu ¼ 0; in X; ou ¼ g; on S; on ou ¼ Mu; on C: on

ð3Þ ð4Þ ð5Þ

The corresponding weak form is (W): Find u 2 H1(X), such that, for all w 2 H1(X): Bðw; uÞ ðw; MuÞC ¼ ðw; gÞS ; where

Z

uÞ dx; ðr w ru k 2 w Z Mu dC; ðw; MuÞC :¼ w Z C g ds: ðw; gÞS :¼ w Bðw; uÞ :¼

ð6Þ

ð7aÞ

X

ð7bÞ ð7cÞ

S

Formally, the L2(C) inner-product of the trace w 2 H1/2(C) and the DtN map Mu 2 H1/2(C) is deﬁned by the dual pairing, ðw; MuÞC ¼ ðw; MuÞH 1=2 H 1=2 . For a general DtN map, the solution is unique if the innerproduct Im(u, Mu)C > 0 (or <0), for all u 2 H1/2(C), u 5 0, [22,56]. Complexity estimates show that it is usually more eﬃcient to use high-order accurate conditions which enable small computational domains. To this end, the artiﬁcial boundary C is often deﬁned in separable coordinates such as a sphere or spheroid,

Fig. 2. Exterior scattering problems. Exterior domain D is reduced to computational domain X enclosed by the scatterer S and the truncation boundary C.

3714

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

or a rectangular shape in Cartesian coordinates. The use of spheroidal or rectangular coordinates allows the artiﬁcial boundary to obtain a tight ﬁt around elongated objects. A history on the origins of the DtN ﬁnite element method for acoustics and other wave problems in exterior domains is given in [38].

3. The DtN non-reﬂecting boundary condition in prolate spheroidal coordinates Consider the DtN map deﬁned in prolate spheroidal coordinates. The angular coordinate is deﬁned as t = (g, u) where g = cos h, and u is the polar coordinate for spheroidal systems, see Fig. 3. A prolate spheroid is deﬁned by a constant radial value l, with an ellipse revolving around the major z-axis. Alternatively, the spheroid pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃmay be parameterized by x = x(n, t), where n = cosh l, and g = cos h, so that a = fn, and b ¼ f n2 1 are the semi-major and semi-minor axis of the ellipse. In prolate spheroidal coordinates the Helmholtz equation may be written as o2 o f ðn2 1Þ 2 þ 2f n þ DC þ k 2 hn hg hu u ¼ 0; ð8Þ on on where o hn hu ou hn hg o2 u þ DC u :¼ og hg og hu ou2 is the surface Laplacian, and f(n2 1) = hghu/hn. In the above, qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hn ¼ f ðn2 g2 Þ=ðn2 1Þ; qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hg ¼ f ðn2 g2 Þ=ð1 g2 Þ; qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hu ¼ f ð1 g2 Þðn2 1Þ

ð9Þ

ð10aÞ ð10bÞ ð10cÞ

are the metrics for the prolate coordinate system.

Fig. 3. Prolate spheroidal coordinate system deﬁned by the parameterization x ¼ xðl; h; uÞ ¼ ½b sin h cos u; b sin h sin u; a cos h, 0 6 hp<ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ p, and ﬃ 0 6 u < 2p, where a and b are the semi-major and semi-minor axis of the ellipse, deﬁned by (a, b) = f(cosh l, sinh l), and f ¼ a2 b2 is the semi-interfocal distance.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3715

The artiﬁcial boundary C corresponds to a constant radial coordinate n0 = cosh l0. The outgoing solution to the Helmholtz equation (8) in the unbounded region n P n0 = cosh l0 can be expressed the expansion: uðn; tÞ ¼

n 1 X X 0 Rmn ðc; nÞ amn wcmn ðc; tÞ þ bmn wsmn ðc; tÞ ; R ðc; n Þ mn 0 n¼0 m¼0

ð11Þ

where c = kf is a normalized wave number, and Rmn(c, n) are the radial prolate spheroidal wave functions of the third kind [57,58]. In (11), the prime after the sum indicates that terms with m = 0 are multiplied by 1/2. The even and odd orthogonal eigenfunctions are deﬁned by 1 wcmn ðc; tÞ :¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ S mn ðc; gÞ cos mu; pN mn 1 wsmn ðc; tÞ :¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ S mn ðc; gÞ sin mu; pN mn where Smn(c, n) are the angular prolate spheroidal wave functions of the ﬁrst kind, and Z p Z 1 amn ðcÞ ¼ uðn0 ; tÞwcmn ðc; tÞ dt; p 1 Z p Z 1 uðn0 ; tÞwsmn ðc; tÞ dt: bmn ðcÞ ¼ p

ð12aÞ ð12bÞ

ð13aÞ ð13bÞ

1

In the above, dt = dg du, and Nmn are the standard normalization coeﬃcients used by Flammer [58]: N mn ¼ 2

1 X 0 l¼0;1

ðl þ 2mÞ! 2 ðd mn Þ ; ð2l þ 2m þ 1Þl! l

ð14Þ

where d mn l ðcÞ are the coeﬃcients in the expansion for the angular functions Smn(c, g) in terms of associated Legendre functions. In (14), the prime on the summation sign indicates that the summation is carried out for even l when (n m) is even and for odd l when (n m) is odd. Evaluating the normal derivative of the eigenfunction expansion (11) on the boundary at n = n0 = cosh l0 deﬁnes the DtN map for a the prolate spheroidal truncation boundary [22]. The DtN operator deﬁnes an exact non-reﬂecting condition on the artiﬁcial boundary; i.e., there are no spurious reﬂections introduced at C. The operator is non-local since integration is required over the whole surface to compute the coeﬃcients amn and bmn. In practice, the sum over n in the expansion is truncated at a ﬁnite value Nm and is implemented naturally in the variational equation as a symmetric form. The DtN map exactly represents all harmonics in the solution up to the number of terms included in the truncated series expansion as measured by Nm. For ﬁxed Nm, the harmonics n > Nm are evaluated with a homogeneous Neumann boundary condition on C. If an insuﬃcient number of harmonics is included in the series, non-unique solutions may result when k2 matches one of an inﬁnite number of interior resonances (real eigenvalues) associated with the Laplacian operator. As a result, the accuracy of the harmonics n > Nm are poorly represented and a minimum value is required to ensure uniqueness. To ensure uniqueness more terms in the DtN map than may be necessary to achieve a desired accuracy, leading to a potential for excessive computation. To eliminate the bound Nmin for uniqueness, a modiﬁed DtN may be formulated by generalizing the normal derivative applied to the harmonic expansion for outgoing waves (11), with a local diﬀerential operator representing an approximate radiation boundary condition [22]. The resulting modiﬁed DtN condition is unique for any choice of Nm, and approximates the harmonics n > N with greater accuracy than the original DtN condition.

3716

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Local boundary conditions are easily constructed by extending the procedures employed in [24] for a circle or sphere, where radial terms in a multipole expansion for outgoing waves are absorbed. The generalization to spheroidal coordinates is given by the asymptotic expansion given by [4]: 1 gj ðh; u; cÞ expðicnÞ X ; ð15Þ u j cn ðcnÞ j¼0 where c = kf, is the normalized wave number and n is the radial coordinate in spheroidal coordinates. The sequence of local operators which annihilate radial terms in the expansion (15) is constructed as a product of normalized radial derivatives. The ﬁrst two boundary conditions are: 1 o B1 u ¼ þ a1 u ¼ 0; on C; ð16Þ f on 1 o2 o þ a3 u ¼ 0; on C; B2 u ¼ 2 þ a2 ð17Þ f on2 o; n where a1 = (1 icn0)/n0, a2 = (4 2icn0)/n0, and a3 ¼ ð2 4icn0 ðcn0 Þ2 Þ=n20 . Both conditions B1 and B2 satisfy the uniqueness condition, Im(u, Bu)C 5 0. The local condition B2 is preferred since it is substantially more accurate compared to B1. Local approximate conditions require careful placement when computing the response over a range of frequencies; the size of the computational domain and the mesh density must be carefully selected to achieve a prescribed accuracy. Applying the B1 operator to the eigenfunction expansion (11), evaluated at n0, gives [23], n Nm X X 0 ð1Þ ou 1 ou 1 1 ¼ ¼ z1 u þ Z mn ðcÞ Dmn ðc; tÞ; on hn on Js Js n¼0 m¼0

ð18Þ

z1 ¼ f ðn20 1Þa1 ;

ð19Þ

2 Z ð0Þ mn ðcÞ ¼ f ðn0 1Þ

R0mn ðc; n0 Þ ; Rmn ðc; n0 Þ

ð0Þ Z ð1Þ mn ðcÞ ¼ Z mn ðcÞ þ z1 ; Z 1 0 0 0 Dmn ðc; tÞ :¼ 0 uðn0 ; t Þwmn ðtjt Þ dC ; 0 J C s

wmn ðtjt0 Þ :¼

1 S mn ðc; gÞS mn ðc; g0 Þ cos mðu u0 Þ: pN mn

ð20Þ ð21Þ ð22Þ ð23Þ

In the above, dC = Js dg du, where Js = hghu is the surface Jacobian evaluated at n0 = cosh l0. When the condition (18) is truncated at the ﬁnite value Nm, it is exact for harmonics n 6 Nm, and approximates the harmonics n > Nm with the local approximate condition B1u = 0. The normal derivative form derived in (18) is convenient for ﬁnite element implementation as a natural boundary condition in a Galerkin variational equation. The modiﬁed condition given in [22] is left in terms of a derivative with respect to l, and is suitable for ﬁnite diﬀerence implementation. Applying the local B2 operator to the eigenfunction expansion (11) with second-order radial derivatives replaced in favor of tangential derivatives via the Helmholtz equation in spheroidal coordinates (8) leads to the modiﬁed DtN [23]: n Nm X X

0 ð2Þ ou 1 1 ¼ DC fc2 g2 fz2 u þ Z mn ðcÞ Dmn ðc; g; uÞ; on mJ s Js n¼0 m¼0

ð24Þ

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3717

where 2n0 z2 ¼ ðn20 1Þa3 c2 n20 ; 1 " # f m2 ð2Þ ð0Þ kmn þ 2 Z mn ðcÞ ¼ Z mn ðcÞ þ þ z2 : m n0 1

m ¼ a2

n20

ð25Þ ð26Þ

In the above, kmn(c) are the characteristic values of the prolate spheroidal wave functions. The form of the modiﬁed DtN condition (24), after integrating-by-parts the second-order tangential derivatives appearing in the surface Laplacian DC, leads to a symmetric Galerkin variational form with standard ﬁnite element spaces in H1(C). In [22], the B2 operator is left with second-order derivatives in the radial coordinate l, suitable for ﬁnite diﬀerence approximation. For the modiﬁed DtN boundary condition, the term (7b) is composed of both a local and global DtN part: Z ðjÞ ðjÞ Mu dC ¼ BC ðw; uÞ þ Z C ðw; uÞ; j ¼ 1; 2: ð27Þ w |ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ} |ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ} C DtN

local

The local part of the modiﬁed DtN operators using B1 and B2, respectively, are Z 1 ð1Þ u dC; BC ðw; uÞ :¼ z1 w J ZC s Z f 1 1 ð2Þ 2 2 rs u dC; ðz2 þ c g Þ wu dC hn r s w BC ðw; uÞ :¼ m C Js m C

ð28Þ ð29Þ

where rs u :¼

1 ou 1 ou þ : hg og hu ou

ð30Þ

The non-local part is ðjÞ

Z C ðw; uÞ :¼

n 1 X X 0 n¼0

Z ðjÞ w; wcmn ÞC ðu; wcmn ÞC þ ð w; wsmn ÞC ðu; wsmn ÞC mn ð

ð31Þ

m¼0

for j = 1, 2 corresponding to B1- or B2-modiﬁed DtN, respectively. The inner-products over the boundary C are deﬁned by Z 1 ðu; wcmn ÞC :¼ uwcmn ðc; g; uÞ dC; ð32aÞ C Js Z 1 uwsmn ðc; g; uÞ dC: ð32bÞ ðu; wsmn ÞC :¼ J s C The local boundary operator in (27) provides an approximation to the spectral properties of the exact DtN map, and thus can be used to construct an eﬃcient preconditioner for an iterative solution with the complete DtN operator [23,41].

4. Finite element discretization SN e h The domain X is discretized into a ﬁnite number of elements, X Xh ¼ e¼1 Xe . The total number of nodes for the ﬁnite element mesh is N. Using a standard Galerkin approximation uh(x) = NT(x)d, where

3718

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

N 2 RN is a column vector of standard C0 basis functions, and d 2 CN is a column vector containing the N unknowns nodal values the following indeﬁnite complex-symmetric matrix system results: ðK S þ K DtN Þd ¼ f ;

ð33Þ

where K S ¼ ðS k 2 MÞ þ K C is a sparse matrix, decomposed into a part associated with the discretization of the Helmholtz equation in X, deﬁned by the real symmetric arrays S and M deﬁned in (7a) and a complex part KC associated with the local boundary operator BC (28) and (29). Additional wavenumber dependent matrices may also be present when using stabilized ﬁnite element methods designed for improved accuracy of the discrete wavenumberfrequency relation, e.g., the Galerkin Least-Squares (GLS) and related residual based stabilized methods. The DtN contribution couples all the nodes on the boundary C, and if assembled, is given by the matrix K DtN ¼ PC Z C PTC ;

ð34Þ

N C N C

where Z 2 C is a full submatrix of size equal to the number of nodes on the boundary NC, and PC is a permutation matrix used to represent pointer arrays which extract truncation boundary unknowns from the total number of unknowns. Due to the special structure of the DtN map deﬁned on a separable boundary, the complex-symmetric (non-Hermitian) block matrix KDtN is deﬁned by a summation of vector outer products (rank-1 vector updates), ZC ¼

n Nm X X 0 n¼0

T T Z ðjÞ mn cmn cmn þ smn smn ;

ð35Þ

m¼0

with even and odd vectors of size NC, equal to the number of unknowns on the truncation boundary C, Z 1 T c cmn ¼ ðN T ; wcmn ÞC ¼ N wmn ðc; tÞ dC; ð36aÞ C Js Z 1 T s N wmn ðc; tÞ dC: ð36bÞ smn ¼ ðN T ; wsmn ÞC ¼ C Js For a direct solve of (33), the storage of the full dense matrix KDtN, associated with the non-local DtN operator, requires OðN 2C Þ complex numbers. The storage for the sparse matrix KS is O(N), so that the total storage required for (KS + KDtN) is OðN þ N 2C Þ. For large models, the fully populated submatrix KDtN becomes prohibitively expensive to store and factorize. 4.1. Iterative solution methods for the modiﬁed DtN The situation is signiﬁcantly diﬀerent when iterative solution methods are used. Iterative solution methods are generally the algorithm of choice for solving systems with large numbers of unknowns due to signiﬁcantly reduced storage requirements. Suitable iterative solvers for the indeﬁnite complex symmetric matrix systems arising from discretization of the Helmholtz equation include BiCG-Stab [59], and QMR [60] methods. To accelerate convergence of the iterative process, eﬃcient implementations of algebraic and other preconditioners can be constructed based on the sparse matrix KS. The local operator KS provides a good approximation to the spectral properties of the complete system matrix, KS + KDtN [23,41]. The computationally intensive kernel in Krylov subspace iterative solvers is the repeated operation of matrix-by-vector products of the form vj = ZCuj, at each iteration j. The special structure of the DtN matrix ZC, as a summation of rank-1 vector updates can be exploited to signiﬁcantly reduce storage and cost. Here, the vectors cmn and smn can be organized in order and stored as columns in the matrix C, of dimension

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3719

(NC · NT), where NT = Nm(Nm + 1) denote the total number of harmonics included in the truncated DtN expansion. The DtN matrix contribution can then be expressed in the form of a generalized rank-NT update [39], Z C ¼ C D C T;

ð37Þ

C ¼ ½c00 ; . . . ; cN m N m ; s00 ; . . . ; sN m N m ; 1 D ¼ diagðZ 00 ; . . . ; Z N m N m ; Z 00 ; . . . ; Z N m N m Þ; p

ð38Þ ð39Þ

where D is a NT · NT diagonal matrix of impedance coeﬃcients, and C contains the vector harmonics. Using this construction, a matrix–vector product with the DtN matrix v = ZCu, can be computed eﬃciently in block form using the following algorithm: 1. Set w = CTu, 2. Set w = Dw, 3. Set v = Cw. The dense matrix ZC is never assembled; only C and the diagonal coeﬃcients in D need be stored. This implementation avoids the direct evaluation of the full submatrix associated with the unknowns on C producing signiﬁcantly reduced storage costs. The storage requirements and number of operations are reduced to O(N + NT NC). Since NT NC, the storage and cost is considerably lower than a straightforward matrix–vector product. An eﬀective preconditioner to accelerate convergence of the iterative solution is the SSOR algebraic decomposition together with Eisenstats trick [61]. Here, the sparse matrix KS, which is stored and split into, KS = L + D + LT, where D :¼ diag{KS} is a diagonal matrix and L is a strictly lower triangular matrix. The SSOR preconditioner is deﬁned by P ¼ LD1 LT , where L ¼ ðD þ xLÞ. Here 0 < x < 2, is a relaxation parameter. With x ! 0, the SSOR preconditioner reduces to the diagonal (Jacobi) preconditioner, i.e., P = D. Using this sparse SSOR preconditioner leads to the modiﬁed system, Ax ¼ b; 1=2

A ¼ D1=2 L1 ðK S þ K DtN ÞLT D1=2 ; T

1=2

ð40Þ

1

where x ¼ D L d, b ¼ D L f . Iterative solution of the preconditioned system (40) requires a matrix– vector product v = Au, where A is the complex-symmetric preconditioned matrix and u is an iteration vector. Here the matrix–vector product v = Au, is computed eﬃciently using Eisenstats trick [61], together with the special structure of KDtN written in the matrix outer-product form (37), giving rise to: 1 A ¼ D1=2 LT þ L1 ðI þ ½ðx 2ÞD xCDC T LT Þ D1=2 : ð41Þ x

5. Parallel iterative solution method Parallel iterative methods provide a means for dividing the problem into subsystems which when solved in parallel, provide compute time speedup, and for distributed-memory computer systems, the ability to scale-up to very large systems. A computationally intensive kernel in Krylov subspace iterative solvers is the repeated operation of matrix-by-vector products. The goal of an eﬃcient algebraic parallel solution is to compute the matrix-by-vector products concurrently and to minimize communication. Consider domain Xh divided into p non-overlapping subdomains (substructures) Xhi such Sp h the discretized h that i¼1 Xi ¼ X . The unknowns within each subdomain are further partitioned into interior and interface nodes, see Fig. 4. The following nomenclature will be used throughout this section: • • • •

N—total number of unknowns for the global problem, Ns—total number of interior unknowns, Nv—total number of unknowns on the interface, Ni—number of unknowns in the interior of subdomain i,

3720

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

interior nodes – numbered first interface nodes – numbered last

Fig. 4. Illustration of a subdomain Xi partitioned into interior and interface nodes. The number unknowns in the interior of subdomain i is denoted Ni. The number of unknowns on the interface of subdomain i is denoted N vi .

• • • • • •

N vi —number of unknowns on the interface of subdomain i, NC—total number of DtN unknowns on the boundary C, N Cs —total number of interior DtN unknowns, N Cv —total number of DtN unknowns on the interface, N Ci —number of DtN unknowns in the interior of subdomain i, N Cvi —number of DtN unknowns on the interface of subdomain i.

5.1. Sparse matrix–vector product Partitioning the computational domain into p non-overlapping subdomains and numbering the nodes interior to the subdomains ﬁrst, and nodes on the interfaces numbered last, then given the global vector iterate d 2 CN , a sparse system matrix-by-vector iterate f = KSd, takes the form of a block arrowhead matrix structure: 8 9 2 38 9 A1 B1 > f1> d1 > > > > > > > > > > 6 > 7 >f > > > > > > A B d > > > > 7 6 2 2 2 2 > > > = < = 6 7< > .. 7 .. .. 6 .. ¼6 ð42Þ 7 . ; . . . > > > 7> 6 > > > > > > > > 7 6 > 4 > > Ap B p 5> > > > > >fp > > dp > > > ; ; : > : > T T T B1 B 2 Bp Av dv fv where each di, i = 1, . . . , p, represents the subvector of unknowns that are interior to subdomain Xi, and dv represents the vector of all interface unknowns. In the above, Ai = Si k2Mi + KCi are sparse matrices of size Ni · Ni associated with internal unknowns for each subdomain, and Av is a square matrix of size Nv · Nv, associated with unknowns on the interfaces. The global interface matrix Av is deﬁned as the assembly of contributions of local interface submatrices Avi , but is never explicitly assembled for the iterative solution process. The Bi are sparse matrices of size Ni · Nv representing the subdomain to interface coupling. In general, the number of interface unknowns Nv will be considerably less than the number of internal unknowns, Nv Ns.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3721

Local interior to local interface coupling matrices B ivi of size N vi N vi are formally deﬁned by the operation B i ¼ B ivi Pi where Pi is a permutation matrix of size (N vi N v ), is a global-to-local interface mapping for subdomain i. For iterative solutions on a parallel computer, the sparse matrix–vector products for each local subdomain can be computed in parallel for each processor i, # " Ai Bivi fi di ¼ ; ð43Þ B Tivi Avi f vi d vi where Avi and d vi ¼ Pi d v are local interface subarrays and B ivi are local coupling matrices assembled from elements on each subdomain and stored on each distributed processor system. Both the vectors fi and f vi can be computed concurrently on each processor since the local arrays Ai , Bivi , d i , d vi are localized on each computing node i. The global interface vector fv is updated by assembly of each subvector f vi using fv ¼

p X

PTi f vi :

ð44Þ

i¼1

Inter-processor communication from the above operation is minimized by taking advantage of the fact that nodes on interfaces are shared by only a few local subdomains so that communication for vector updates need only occur between adjacent subdomains sharing common interface data. Therefore, the assembly operation in (44) does not require global communication between all computing nodes, but rather local data updates between adjacent subdomains sharing interface data. Another optimization may be applied by using non-blocking communication subroutines, allowing for data communication and computation to overlap. In this way, communication costs may be hidden to a certain extent, depending on the hardware support for such operations. In summary, using the mapping between interface and interior nodes, the sparse matrix–vector product f = KSd is computed from 9 8 A 1 d 1 þ E 1 d v1 > 8 9 > > > > > f 1> > > > > > > > .. > > > > > > > > = < .. = < . . : ð45Þ ¼ A d þE d p p p vp > > > > > > > > f > > > > p > > > > P > ; p > : > > > PTi f vi > fv ; : i¼1

5.2. DtN matrix–vector product Given a vector iterative d 2 CN , the matrix–vector contribution g = KDtNd from the DtN matrix is given by " # ds M s Z s M Ts M s Z sv M Tv g¼ ; ð46Þ T T dv M v Z vs M s M v Z v M v where

2

Z 11 6 . Zs ¼ 6 4 .. Z p1

.. .

3 Z 1p .. 7 7 . 5; Z pp

2

3 Z 1v 6 . 7 . 7 Z sv ¼ 6 4 . 5; Z pv

2

3 d1 6 . 7 . 7 ds ¼ 6 4 . 5 dp

3722

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

and Ms = diag[M1, . . . , Mp]. The blocks Mi, of size (N i N Ci ) are the permutation matrices mapping the local interior DtN unknowns to the local interior, for subdomain i. Similarly, Mv, of size (N v N Cv ) maps the interface DtN unknowns to total interface; Zij, Ziv, represent the discretized DtN operators coupling subdomain i with subdomain j and the total interface v, respectively; Zvv represents the DtN operator along the global interface v. Fig. 5 illustrates the coupling of DtN boundary nodes for a domain partitioned into p = 4 subdomains. Denote d Cv ¼ M Tv d v as the vector of unknowns corresponding only to the DtN part of the total interface. For eﬃcient parallel computation, the interface unknowns for the local interface vj are split between adjacent subdomains sharing the same interface node. Given the vector of total DtN interface unknowns d Cv , the averaged local interface vectors are formed by, for j = 1, . . . , p 8 0 if node corresponding to k > > > < n o not on the interface vj C ð47Þ d vj ¼ C > k d v k =s if node corresponding to k > > : on the interface vj ; where s is the number of subdomains sharing the node corresponding to position k in the vector d Cv . As a result of this decomposition, the vector of total DtN interface unknowns is deﬁned by the summation over all subdomains, d Cv ¼

p X

C d vj :

ð48Þ

j¼1

Formally, this operation can be expressed as d Cv ¼

p X

Rj d Cvj ;

ð49Þ

j¼1

where Rj, of size (N Cv N Cvj ), maps the local DtN interface to the global DtN interface. Using this construction, the interior to interface matrix vector products may be expressed as Z iv d Cv ¼

p X

Z ivj d Cvj ;

ð50Þ

j¼1

where Z ivj ¼ Z iv Rj . Deﬁning d Cj ¼ M Tj d j to be the interior DtN part of subdomain j, the DtN matrix–vector product in (46) can be expressed as Z 22 Z 33

Z 11

2 1

3 4 Z 44

Fig. 5. Illustration of coupling between nodes on the DtN boundary for a domain partitioned into four subdomains. DtN matrix partitions Zii associated with interior nodes on the boundary are highlighted.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

8 9 p P C C > > > > M Z d þ Z d 1 1j 1v > > j j v j > 8 9 > > > j¼1 > > > > g 1 > > > > > > > > . > > > > > > > > . = < .. = < . . p : ¼ P > > Mp Z pj d Cj þ Z pvj d Cvj > > > > > gp > > > > > > > > > j¼1 > : ; > > > > gv > > > p > P > >M C C > > Z vj d j þ Z vvj d vj > ; : v

3723

ð51Þ

j¼1

The interior matrix–vector products are thus g i ¼ M i f Ci ; i ¼ 1; 2; . . . ; p, where p X Z ij d Cj þ Z ivj d Cvj ; f Ci ¼

ð52Þ

j¼1

while the interface matrix–vector product gv ¼ M v f Cv , where p X Z vj d Cj þ Z vvj d Cvj : f Cv ¼

ð53Þ

j¼1

The vectors d Cvs ¼ Z vj d Cj and d Cvv ¼ Z vvj d Cvj can be written as sums over interface DtN vectors for each subdomain i, and applying a similar decomposition as in (47) and (49), the total interface matrix–vector product can be expressed as the sum, gv ¼

p X

M vi f Cvi ;

ð54Þ

i¼1

where f Cvi ¼

p X Z vi j d Cj þ Z vi vj d Cvj ;

ð55Þ

j¼1

where M vi of size (N v N Cvi ) maps the local DtN interface for processor i to the global interface. Combining (52) and (55) in matrix form, each processor i must perform the following DtN operations: ( ) " #( C ) p X dj Z ij Z ivj f Ci ¼ : ð56Þ C Z vi j Z vi vj f vi d Cvj j¼1 The non-local property of the DtN map manifests itself in the summation over each processor j. On each processor i, contributions from all processors j have to be gathered. For the corresponding sparse-matrix contribution in (43), no gather operation is required because of the local property of the arrays. A naive implementation of the DtN matrix–vector operation for each processor i would be to compute and store full submatrices Zij, Z ivj , Z vi vj , and perform a collective global communication (all-to-all type communication) with vectors of relatively large size (N Ci þ N Cvi ). Since the gather operation is grain-size dependent, this implementation would have a potentially high negative impact on the scalability and parallel eﬃciency of the computation. The grain size is the number of boundary nodes per domain. To address this communication problem, we never explicitly form the full DtN matrices Zij, Ziv, Z vi ¼ Z Tvi , and Zvv. Instead, the symmetric outer-product form (37) is exploited to reduce communication costs in a parallel implementation on distributed memory computer systems. Here the multiplicative split (37) is deﬁned on each subdomain, Z ij ¼ C i D C Tj ;

Z iv ¼ C i D C tv ;

Z vv ¼ C v D C Tv ;

ð57Þ

3724

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

where C i ¼ ½ci00 ; . . . ; ciN m N m ; si00 ; . . . ; siN m N m ; cimn

i ¼ 1; . . . ; p

simn

and are the angular function vectors deﬁned in (36) restricted to subdomain i, which are comand puted and stored on each processor concurrently. The matrix Z vi vj ¼ C vi DC Tvj are the discretized DtN operators coupling the local interface vi and vj, and C vi are the restrictions of the matrix angular functions to the boundary interfaces. Using the decompositions (57), the collective communication of the DtN matrix–vector product (56) can be formed as ( )

X p Ci f Ci ¼ uj ; D ð58Þ f Cvi C vi j¼1 where uj ¼ C Tj d Cj þ C Tvj d Cvj

ð59Þ

is a vector of size equal to NT = Nm(Nm + 1), the total number of harmonics, and computed concurrently on each processor. Prior to interface vector updates, the vectors on the left-side of the summation in (58) are added to the sparse matrix–vector product (43) concurrently on each processor. The total of the sparse and DtN contributions are, bi ¼ f i þ M i f Ci , and bvi ¼ f vi þ M vi f Cvi , i = 1, . . . , p. The summation in (58) is carried out by summing each uj on each processor j and storing the result back onto each of them (requires all-to-all communication). As a result, collective communication of the non-local DtN has been reduced to a relatively small vector of size N T N Ci , independent of the grain size (number of boundary nodes per domain). Using this procedure, the DtN matrix–vector product can be computed with one collective communication per iteration with a vector size equal to the number of harmonics included in DtN series expansion. The eﬀect on the overall communication is roughly that of an extra dot-product communication (The basic BiCG-STAB iteration method for sparse matrices requires two matrix–vector products and six dotproducts per iteration). The algorithm can be summarized in the following. 5.3. Parallel matrix–vector product algorithm For each subdomain i: (1) Extract the DtN unknowns on the boundary from the local vector ( ) ( ) d Ci M Ti d i ¼ : d Cvi M Tvi d vi

ð60Þ

(2) Compute the vectors of size NT = Nm(Nm + 1) ui ¼ C Ti d Ci þ C Tvi d Cvi :

ð61Þ

(3) Sum the vectors uis from all processors and store the result back onto each of them (requires all-to-all type communication) ui

p X

uj :

j¼1

Dui (4) Compute ui (5) Add the sparse and DtN matrix–vector products

ð62Þ

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

bi b vi

"

¼

Ai BTivi

Bivi Avi

#

di d vi

M iC i þ ui : M vi C vi

3725

ð63Þ

(6) Update interface bv ¼

p X

PTi bvi :

ð64Þ

i¼1

6. Numerical studies Prior to performing parallel scale-up studies, we ﬁrst demonstrate that for the same accuracy, the modiﬁed DtN(B1) non-reﬂecting boundary condition applied to a spheroidal boundary deﬁned in (18) and implemented as a low-rank update with the multiplicative split given in (37) is more cost-eﬀective than using the local B1 operator in the context of iterative equation solvers. While the results reported here are for the modiﬁed DtN(B1) condition, similar conclusions are reported in [39,55] for the modiﬁed DtN(B2) condition deﬁned in (24). 6.1. Validation and comparison of local and global DtN conditions Consider a prolate spheroidal scatterer with focal distance fs = 10q and elliptic radial coordinate ls = 0.1, ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ

such that ns ¼ cosh ls 1:005, L = 2a = 2fsns 20.1, D ¼ 2b ¼ 2f s n2s 1 2:0, L/D 10. The incident ﬁeld is a plane wave ui(x) = eikxÆm propagating along the direction m =(cos a, sin a cos b, sin a sin b) with incident angles a and b. The scatterer is assumed to be sound-hard, i.e. a homogeneous Neumann condition is imposed for the total ﬁeld along the scatterer boundary S, or for the scattered ﬁeld ouðn; g; /Þ oui ðn; g; /Þ ¼ on on

n ¼ ns ; 1 6 g 6 1; 0 6 / 6 2p:

ð65Þ

For prolate spheroidal objects, an inﬁnite series solution can be obtained for the homogeneous Helmholtz equation on unbounded domains, via plane wave expansion on spheroidal coordinates [62]. Matching normal derivatives of incident and scattered ﬁelds along the scatterer boundary, the analytical solution is /ðn; g; /Þ ¼

n 1 X X 0 n¼0

ð3Þ im Amn Bmn Rmn ðn; cÞS mn ðg; cÞ cosðmð/ bÞÞ ;

ð66Þ

m¼0

where Amn ¼ 4

R0mn ðns ; cÞ ; N mn R0ð3Þ mn ðns ; cÞ

Bmn ¼ S mn ðg0 ; cÞ

and g0 = cos a. The scatterer described above is enclosed by a DtN truncation boundary, also in the shape of a prolate spheroid, placed at a distance dx = sxk and dy = dz = syk, where k = 2p/k and k are the wave length and wave number of incident ﬁeld (see Fig. 6). Two diﬀerent wave numbers are considered corresponding to D/k = (0.2, 0.4). To ensure accurate mesh resolution, k/h = 25 tri-linear tetrahedral elements per wave length are used. Comparative studies are performed for local B1 and non-local modiﬁed DtN(B1) operator deﬁned in (18). Figs. 7 and 8 show the accuracy of the numerical solution for D/k = 0.2 and D/k = 0.4, respectively.

3726

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Fig. 6. Computational domain enclosed by a prolate spheroidal DtN truncation boundary placed at dx = sxk and dy = syk from the embedded spheroidal scatterer.

Relative errors measured in the L2 norm on the scatterer boundary S are reported as a function of circumferential harmonics Nm included in the modiﬁed DtN map eigenfunction expansion. By truncating the inﬁnite series at Nm, the modiﬁed DtN condition deﬁned in (18) becomes exact for circumferential modes n 6 Nm in the exact solution, and it reduces to an approximate B1 condition for higher modes. If Nm = 0 then DtN reduces to B1 for all the harmonics in the exact solution. To quantify the accuracy with boundary position, the truncation boundary is placed successively farther away from the scatterer. Three diﬀerent methods are considered for increasing the distance of the DtN truncation boundary from the scatterer: (a) equal oﬀset parameters sx = sy = s, (b) constant horizontal distance sx = 0.25, with increasing sy, and (c) maintaining a confocal (f = constant) spheroidal boundary with increasing radial coordinate n. In case (a), the DtN spheroidal boundary is increased with equal aspect ratio. In case (b), the DtN boundary approaches a sphere as sy is increased. In case (c) the DtN spheroidal boundary is increased in proportion to the radial coordinate with varying aspect ratio. The DtN boundary approaches a sphere as n is increased. Using a suﬃcient number of harmonics Nm in the DtN condition to obtain an eﬀectively exact non-reﬂecting boundary condition reduces the error to the discretization level of the ﬁnite element approximation. Examining Figs. 7 and 8, we ﬁnd the relative error of the ﬁnite element discretization with mesh resolution kh ¼ 25, corresponding to D/k = 0.2, is approximately 8%, while the error for D/k = 0.4 is approximately 4.5%. As the wavelength is decreased from D/k = 0.2 to D/k = 0.4, the number of harmonics needed to reach the discretization level increases from Nm = 6 to Nm = 8. The accuracy of the local boundary conditions generally deteriorate as the wavelength is decreased. The relative errors for the approximate local B1 operator corresponding to Nm = 0 and with equal oﬀsets sx = sy = s are also shown in Figs. 7a and 8b and are summarized in Table 1. The values corresponding to s = 1/4, 1/2 indicate that boundary reﬂection error is substantially higher than the discretization error levels when the truncation boundary is positioned near the scatterer. Even after the truncation boundary is moved to s = 1, far away from the scatterer, the discretization levels still have not been reached. These results indicate that moving the truncation boundary with equal distance is not a cost eﬀective strategy when using the local B1 operator. A similar behavior is seen in Figs. 7c and 8c, for the case with keeping the spheroid confocal (f = constant) with increasing radial coordinate n. In this case, the error produced by the B1 condition is reduced but still does not reach the level of the discretization error.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3727

45 s=0.25 (N=6,759) s=0.50 (N=15,829) s=1.00 (N=42,758)

40 35

η(%)

30 25 20 15 10 5 0

a

5

Nm

10

45

15

s y=0.25 (N=6,759) sy=0.50 (N=13,788) s y=1.00 (N=29,576)

40 35

η(%)

30 25 20 15 10 5 0

5

b 40

Nm

10

15

μ=0.4 (sx=0.08, sy=0.31, N=7,051) μ=0.5 (sx=0.12, sy=0.42, N=10,156) μ=0.6 (sx=0.18, sy=0.53, N=14,144)

35

30

η(%)

25

20

15

10

c

5 0

5

Nm

10

15

Fig. 7. Accuracy of the modiﬁed DtN(B1) operator for Dk ¼ 0:2; incident angles a ¼ p4, b ¼ p4; (a) equal distance sx = sy = s; (b) ﬁxed sx = 0.25 with increasing sy; (c) confocal, f = constant, with increasing radial coordinate n.

3728

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741 90 s=0.25 (N=16,931) s=0.50 (N=32,349) s=1.00 (N=69,505)

80 70

η(%)

60 50 40 30 20 10 0 0

a

5

Nm

90

10

sy=0.25 sy=0.50 sy=1.00 sy=2.00

80 70

15

(N=16,931) (N=29,591) (N=56,102) (N=117,856)

η(%)

60 50 40 30 20 10 0 0

b

5

40

Nm

10

15

μ=0.4 (sx=0.15, sy =0.62, N=33,600) μ=0.5 (sx=0.25, sy =0.84, N=47,855) μ=0.6 (sx=0.36, sy =1.07, N=62,318)

35 30

η(%)

25 20 15 10 5 0 0

c

5

Nm

10

15

Fig. 8. Accuracy of the modiﬁed DtN(B1) operator for Dk ¼ 0:4; incident angles a ¼ p4, b ¼ p4; (a) equal distance sx = sy = s; (b) ﬁxed sx = 0.25 with increasing sy; (c) confocal, f = constant, with increasing radial coordinate n.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3729

Table 1 Relative errors for local B1 operator (Nm = 0), with increasing equal oﬀsets sx = sy = s D k

ns

0.2 0.4

1 4

1 2

1

45 83

24 44

12 25

Figs. 7b and 8b show results when the spheroidal DtN boundary is placed very close to the poles of the scatterer with a ﬁxed sx = 0.25, and increasing sy to approach a sphere, as shown in Fig. 9. The results for the B1 operator are summarized in Table 2. The results demonstrate that, when using the local B1 operator, truncation boundaries with improved aspect ratios, yield less spurious reﬂection error than farther, but more stretched truncation boundaries. For example, the discretization level of 8% error is reached with a truncation boundary placed at sx = 0.25, sy = 1.0, which yields a lower problem size—N = 29,576 degrees-of-freedom (DOF), compared to the case when the truncation boundary placed with equal oﬀset parameters sx = sy = 1.0, yielding a much large number N = 42,758 DOF. 6.2. Cost comparison between local B1 and DtN operators Timings are reported using the bi-conjugate gradient stabilized BICG-STAB iterative method [59] in conjunction with a SSOR preconditioner using Eisenstats algorithm [61]. Sequential timings results are obtained using optimized Sun Fortran90 compilers on a Sun Enterprise Ultra450 server, with 300 MHz UltraSPARC-II cpu. The sparse matrix KS is stored in compressed sparse row (CSR) format [63]. The DtN is computed using the multiplicative split (37) combined with the eﬃcient matrix–vector product described in Section 4.1. In Fig. 10 the performance of the B1 operator is analyzed for various positions and geometries of the truncation boundary. Curve (a) represents cases where the truncation boundary is successively placed farther away from the scatterer with equal oﬀset parameters sx = sy = s = (0.3, 0.4, 0.5, 0.9, 1.6); Curve (b) corresponds to cases with a constant horizontal oﬀset parameter sx = 0.25 and increasing vertical parameter sy = (1.0, 1.35, 1.5, 1.6, 1.8, 2.0); Curve (c) represents cases with a confocal truncation boundary f0 = fs = 10, and l = (0.4, 0.5, 0.6, 0.8, 0.9), corresponding to pairs (sx, sy) = {(0.15, 0.62), (0.25, 0.84), (0.36, 1.07), (0.66, 1.58), (0.96, 1.85)}. Consistent with our earlier ﬁndings, Fig. 10 indicates that Case (b) is the most eﬀective, yielding the lowest cost to obtain a given accuracy. Fig. 11 gives solution times versus relative error comparing the approximate B1 operator, the DtN(B1) operator with the minimum number of harmonics Nm0 = 8 needed to reach the discretization error, and the DtN(B1) operator with Nm0 1 harmonics; for the case of constant horizontal oﬀset parameter sx = 0.25 and increasing vertical parameter sy = (1.0, 1.35, 1.5, 1.6, 1.8, 2.0). By applying the DtN(B1) operator with Nm0 = 8, the relative error was reduced to the discretization level 4.5%, for all geometries of the truncation boundary. For the DtN(B1) the optimal geometry which yields the discretization error with the smallest solution times (approximately 88 s) is to place the truncation boundary position near the scatterer at sx = sy = 0.25. To reach the 4.5% discretization level with the B1 operator requires a much larger computational domain with the truncation boundary placed at the far-ﬁeld position sx = 0.25, and sy = 2.0, and a wall-clock time of over 300 s. Table 3 summarizes these timing results. For the DtN(B1) condition these timing results are broken down into a percentage required to compute the sparse (interior) and DtN (boundary) parts of the iterative solution. Here, Nit is the number of iterations, Tit—time/iteration, Twc—the wall-clock time, and T Xwc , T Cwc are percentages of the wall-clock time associated with interior and boundary parts, respectively. A signiﬁcant reduction in wall-clock solving time is observed when applying DtN(B1), by a factor of approximately 8 times versus the case with local B1 operator. As expected, for the DtN(B1) with a small interior domain X and tight ﬁt of the truncation boundary C positioned at

3730

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Fig. 9. Left: Contours of real-part of the scattered ﬁeld in the plane z = 0 with ﬁxed sx = 0.25, and increasing sy; Right: Real and imaginary parts of the scattered ﬁeld along the circumferential direction h on the truncation boundary C, at the extremes u = 0 and p4, respectively; (a) sy = 0.25; (b) sy = 0.5; (c) sy = 1.5.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3731

Table 2 Relative errors for B1 operator (Nm = 0), with ﬁxed sx ¼ 14, and increasing sy D k

n sy

0.2 0.4

1 4

1 2

1

2

45 83

13 41

8 21

8 4.5

1600

sx=sy =s sx=const.

1400

f=const.

wall–clock time (sec)

1200 1000

(b)

(c)

(a)

800 600 400 200 0 0

10

20

30

η (%)

Fig. 10. Timings versus error for B1 operator;

40 D k

50

¼ 0:4, kh ¼ 25.

1600

B1 (Nm =0) Nm =Nm 0 –1 Nm =Nm 0

1400

wall–clock time (sec)

1200 1000 800 600 400 200 0 3

4

5

6

η (%)

7

8

9

10

Fig. 11. Timings versus error comparing B1 and DtN operator with minimum number of included harmonics Nm0 to reach discretization error, and DtN operator with Nm0 1 harmonics; Dk ¼ 0:4, kh ¼ 25.

3732

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Table 3 Comparison of wall-clock timings for DtN and B1 operators: DtN B1

sy = 0.25 sy = 2.0

D k

¼ 0:4, kh ¼ 25, sx = 0.25, relative error g = 4.5%

Nit (s)

Tit (s)

Twc (s)

TX wc (s)

TX wc (%)

T Cwc (s)

T Cwc (%)

74 316

1.2 2.2

88 684

35 684

(40) (100)

53 –

(60)

sx = sy = 0.25, the solving time allocated to process the DtN contribution becomes dominant (60%), however the signiﬁcant aggregate reduction in time compared to the B1 case, is due to the reduction of interior computational domain. In Table 4 storage costs are compared. The quantities N Xw , N Cw represent the number of double precision values required to store arrays associated with the sparse interior and DtN boundary, respectively, while Nw is the aggregate value. In the case of the DtN boundary, the non-local contributions require twice the storage of the sparse interior arrays, due to a large fraction of total nodes lying on the truncation boundary and the number of harmonics NT = 72 (Nm = 8) stored in the array C. However, aggregate storage costs for the DtN case with a closely ﬁt truncation boundary, remain lower (approximately half) of the storage requirements for the local B1 operator employed on a truncation boundary required to achieve the same accuracy (sx = 0.25 and sy = 2). In Table 5 additional timings to set up the linear system are compared. Eﬃcient algorithms have been coded to assemble global matrices into compact sparse storage formats (e.g. compressed sparse row—CSR format [63]), directly from element connectivity arrays, resulting in low timings for the interior assembly part. For the DtN operator, computational costs to form the matrix of angular and radial functions C, D in (38), (39) are higher than the interior assembly costs, due to the time required to evaluate the spheroidal function routines in [64]. Here, Ta is the total time to setup the global linear system, T Xa is the time associated with assembling the interior (sparse) part of the global system, and T Ca the time associated with forming the arrays of radial and angular spheroidal functions. Similar conclusions can be drawn: For the DtN operator, the boundary contribution becomes dominant as the truncation boundary is placed closer to the scatterer (69%), however the overall timings remain lower than those required for the local B1 condition applied on a far-ﬁeld truncation boundary (sx = 0.25 and sy = 2) positioned such that the discretization error level of 4.5% is met.

Table 4 Storage comparison:

DtN B1

D k

sy = 0.25 sy = 2.0

¼ 0:4, kh ¼ 25, sx = 0.25, relative error g = 4.5% N

NC

Nm

3 NX w (·10 )

N Cw (·103)

Nw (·103)

16,931 85,199

5586 –

8 –

222 1161

402 –

624 1161

Table 5 Comparison of assembly timings for DtN and B1 operators: DtN B1

sy = 0.25 sy = 2.0

D k

¼ 0:4, kh ¼ 25, sx = 0.25, relative error g = 4.5%

Assembly (s)

Interior (s)

Interior (%)

Boundary (s)

Boundary (%)

39 51

12 51

(31) (100)

27

(69)

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3733

7. Benchmark submarine example Consider a submarine shaped three-dimensional scatterer subject to a plane-wave incident ﬁeld. The submarine model is shown in (Fig. 12) with overall length dimension L = 20, and diameter D = 2. The conning tower has a height h = 1 and an elliptic cross-section with semi-minor and semi-major axes b = 1 and l = 2, respectively. The incident ﬁeld is a plane wave ui(x) = eikxÆm propagating along the direction m = (cos a, sin a cos b, sin a sin b) with incident angles a and b. The scatterer is assumed to be sound-hard, yielding prescribed Neumann data for the scattered ﬁeld along the scatterer surface ouðxÞ oui ðxÞ ¼ ; on on

x 2 S:

ð67Þ

The computational domain is enclosed by a prolate spheroidal truncation boundary, placed at horizontal distance dx = sxk from the tips of the submarine, and at dz = szk from the top of the conning tower (see Fig. 13). The ﬁnite domain is then discretized via unstructured grids of tri-linear tetrahedral elements. In order to benchmark numerical results, a reference solution was obtained by positioning the DtN truncation boundary at a distance given by sx = 0.25, sz = 1.8, using k/h = 35 elements/wave length, and Nm = 10 circumferential harmonics (NT = 10 Æ 11 = 110 total modes). The number of harmonics and the mesh resolution have been intentionally overestimated from the previous validation studies of the spheroidal scatterer with known analytical solution given in Section 6.1, to ensure a reference solution with minimal boundary error. 7.1. Cost comparison between local B1 and DtN operators Numerical studies in this section are considered for the case Dk ¼ 0:4 with a mesh resolution kh ¼ 25 elements/wave length. In Fig. 14 a direct performance comparison is depicted, between local B1 and modiﬁed DtN(B1) conditions on truncation boundaries placed with oﬀset parameters sx = 0.25, sy = (0.25,0.75,1.25,1.8). By increasing the dimension sy the shape moves from spheroidal to near spherical. One concludes an eﬀective strategy to ﬁnd an optimal position of the truncation boundary, is to employ the modiﬁed DtN condition on boundaries tightly following the scatterer, which are then gradually moved

Fig. 12. Plane-wave impinging upon a submarine-shaped scatterer.

3734

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Fig. 13. Computational domain X enclosed by a spheroidal truncation boundary C placed at dx = sxk and dy = syk from the submarine-shaped scatterer, where k = 2p/k is the acoustic wavelength.

1100 1000

B1 (Nm =0) DtN(B1) (Nm =Nm 0 )

sx=0.25; sz =1.8

wall–clock time (sec)

900 800 sx=0.25; sz=1.25

700 600 500

sx=0.25; sz=0.75

400 300

sx=sz=0.25

200 100

5

10

15

20

25

30 η (%)

35

40

45

50

Fig. 14. Comparison study between B1 and DtN operator for submarine problem; D/k = 0.4, k/h = 25.

farther away (by keeping a constant sx and increasing sy) until the required tolerance is reached. As the modiﬁed DtN performance curve is almost vertical (Fig. 14), it is likely that a satisfactory tolerance is obtained with any reasonably close truncation boundary, if an appropriate number of harmonics is employed in the DtN map. This number can be obtained via a few trial-and-error runs with increasing number of modes, without changing the mesh. In contrast, for the case of local conditions, several: re-mesh, solve, cycles are necessary in order to ﬁnd the position of the truncation boundary that yields the desired tolerance; a disadvantage of using low-order local boundary operators.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3735

7.2. Parallel performance of the DtN map In the context of iterative methods, the matrix–vector products, dot-products and vector updates are the most computationally intensive operations. For distributed memory systems, it is important to distribute the load evenly across computing nodes, in order to achieve optimal wall-clock timings and reduce storage costs per node. Another important aspect is minimizing the length of the interface between subdomains, to reduce communication costs. Sophisticated graph partitioning algorithms have been developed to accommodate these tasks. In the present work an implementation [65] of the partitioning algorithm in [66] is employed, allowing for fast partitions of the adjacent graph (element partitions). To obtain a scalable algorithm, the computational load has to be balanced across computing nodes. This may be achieved via graph weighting, i.e. degrees of freedom throughout the mesh are weighed diﬀerently depending on the computational work they incur. The algorithm used in METIS uses a multi-criteria objective function, which minimizes the interface length in conjunction with balancing the computational load, based on multiple weights for each graph unit. In the present work, the interest is focused on placing the truncation boundary as close as possible to the scatterer, to demonstrate the true eﬀectiveness of the DtN boundary condition. For such cases, and using unstructured free meshes with uniform average element size, partitions with minimal interface will be radial slices which include regions from the scatterer surface S as well as the truncation boundary C. Thus, for close truncation boundaries with nearly uniform average element sizes, uniform weights for all elements in the mesh will yield balanced graphs, as each partition will have equally sized subregions of interior and truncation boundary. A sample three-dimensional partitioned grid with scattered ﬁeld solution on the surface of the submarine is depicted in (Fig. 15). The computational

Fig. 15. Sample grid partition for a three-dimensional submarine scatterer; N 350,000 nodes; 1 million tri-linear tetrahedran elements, D/k = 0.4; k/h = 50; p = 10 subdomains.

3736

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

domain is discretized with unstructured meshes of tri-linear tetrahedral elements, and element grid partitions are obtained using the METIS partitioning package [65]. Several alternate parallel methods for preconditioning the linear system are available including algebraic SSOR and ILU [63,67,68] and others, e.g. [69]. In [55], a hybrid parallel implementation with a block Jacobi component associated with interface nodes and SSOR type blocks associated with interior nodes are implemented. The interface nodes are conditioned with diagonal scaling to minimize inter-processor communication costs, while the SSOR preconditioning is used together with a parallel implementation of Eisenstats

32

Speed up

Nm =15 N m =0

16

8 4 2 2

4

8

16 No. procs

a

32

1.1 Nm =15 Nm =0

1.08 1.06

Parallel efficiency

1.04 1.02 1 0.98 0.96 0.94 0.92 0.9 2

b

4

8

16 No. procs

32

Fig. 16. (a) Speedup and (b) parallel eﬃciency for case N 350,000 dofs (1 million elements) on the NCSA PIII-1GHz IA-32 Linux cluster.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3737

algorithm for the relatively large number of interior nodes. In the following, the parallel performance of the DtN matrix–vector product in relation to the sparse matrix–vector product is studied. The parallel performance of the hybrid Jacobi-SSOR preconditioner in terms of the grain size and acceleration of the required number of iterations for convergence is reported in [55]. To quantify the impact of the DtN term on the algorithm, speedup and parallel performance studies are performed. Speedup Sn when using n processors is deﬁned as t2 Sn ¼ 2 ; ð68Þ tn where t2, tn are solving timings for 2 and n processors, respectively. Parallel eﬃciency is deﬁned as Pn ¼

Sn : n

ð69Þ

The 2-processor case was chosen as a reference point for measurements, in order to quantify the parallelism of the code, and observe the impact of DtN upon it, rather than comparing with a very optimized serial code. The performance of the DtN algorithm is studied up to 32 processors (Fig. 16), for 200 BICG-STAB iterations on the IA-32 Linux Cluster at NCSA. This cluster is based on IBM eServer x330 thin servers, each with two 1 GHz Intel Pentium III processors, running Red Hat Linux, and Myrinet II cluster interconnect network. The results show a linear speedup and near optimal parallel eﬃciency, as the amount of interior computational work for each processor signiﬁcantly exceeds the communication load. Any negative eﬀects of using Nm = 15 non-local DtN modes compared to the purely local operator corresponding to

1500 Matrix–vector compute Interface update DtN part compute DtN communication Dot–products (computation) Dot–products (communication)

Time (sec)

1000

500

0 2

4

8 No. procs

16

32

Fig. 17. Timings for 200 iterations for mesh with N 350,000 dofs (approximately 1 million elements) on the on the NCSA PIII-1GHz IA-32 Linux cluster, using Nm = 15 modes in the DtN term; 2 matrix–vector products and 6 dot-products per BICG-STAB iteration.

3738

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Nm = 0 is insigniﬁcant showing that the eﬃcient parallel implementation of the DtN matrix–vector product algorithm is eﬀective in achieving the near optimal speedup of the local sparse matrix iterative solution. In Fig. 17 the timings are depicted for each computation and communication segment within 200 BICGSTAB iterations. By design, computation segments are signiﬁcantly larger than communication ones, which still remain barely noticeable up to 32 processors. The amount of collective communication required by the non-local DtN map in step (62) is about the same as a single dot-product due to the small ﬁxed vector size equal to the number of DtN harmonics. The DtN communication cost is independent of the number of boundary unknowns (grain size), and remains lower than the 6 required dot-product communications of the BICG-STAB iteration, even though the small amount of data communicated within the DtN operation is larger than the amount of data communicated in the 6 dot-products contained in one BICG-STAB iteration. This is possible since the latency time of calling a global communication routine 6 times each iteration produces the major contribution to the total communication compared to the single global communication for the DtN contribution. Further studies of two-dimensional Helmholtz problems with elliptic DtN boundary conditions [22,23], have been carried out in [1,55] using both the NCSA IA-32 Linux cluster with Intel PIII-1GHz processors and Myrinet II interconnects, a distributed memory Linux cluster using Intel PIII-900 MHz processors with Myrinet I interconnects, and a 14 processor Sun UltraSparc shared memory system implemented as a distributed memory system through the use of Message Passing Interface (MPI) calls. The parallel DtN algorithm with the multiplicative split was shown to scale with minimal extra communication costs compared to the local B1 and B2 operators deﬁned on elliptic truncation boundaries for problems partitioned with varying grain size.

8. Conclusions The low-rank matrix representation of the discrete DtN map is exploited to construct a storage- and timewise eﬃcient implementation for distributed memory parallel computers. In particular, the symmetric outerproduct structure of the DtN matrix KDtN is exploited to compute matrix-by-vector products in parallel with one collective communication per iteration with a vector size equal to the number of harmonics included in the DtN series expansion; the eﬀect on the overall communication is roughly that of a relatively small dotproduct inter-processor communication. The speedup and parallel eﬃciency have not been signiﬁcantly impacted by DtN versus a purely approximate local boundary condition (Nm = 0). Numerical results show that the collective communication within the DtN rank-update is not the most communication intensive operation in the overall BICG-STAB iterative solution process. The amount of extra memory for the DtN is small, consisting only of storage of the angular modes distributed on each processor. Numerical studies show that in the context of iterative equation solvers, and for the same accuracy, the modiﬁed DtN non-reﬂecting boundary condition applied to a tightly ﬁtting spheroidal boundary and implemented as a low-rank update with the multiplicative split is more cost-eﬀective (both in wall-clock times and memory) compared to local approximate boundary conditions. While linear tetrahedral meshes have been used in our studies, the low-rank matrix implementation of the discrete DtN map should be even more eﬃcient for dispersion reducing high-order interpolation, where high-accuracies are needed. Optimal scale-up is expected to hundreds of processors by exploiting the outer-product structure of the DtN operator, however, the DtN implementation is expected to be sub-optimal for massively parallel systems using thousands of processors, since the speedup and parallel eﬃciency may drop as the truncation boundary is placed closer to the scatterer. However improved timings are obtained with the DtN moving the truncation boundary closer to the scatterer, thus eliminating signiﬁcant interior domain. This reduction in cost may oﬀset the need to utilize thousands of processors (hundreds of processors may be suﬃcient and costeﬀective).

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3739

Acknowledgements Support for this work was provided in part by the National Science Foundation under Grant CMS9702082 in conjunction with a Presidential Early Career Award for Scientists and Engineers (PECASE), and the National Computational Science Alliance under Grant MSS020008N which provided access to the IA-32 Linux Cluster at NCSA.

References [1] C. Ianculescu, L. Thompson, Parallel iterative methods for the Helmholtz equation with exact non-reﬂecting boundaries, in: Proceedings of the 2002 ASME International Mechanical Engineering Congress and Exposition, Paper IMECE2002/NCA-32744, 2002. [2] C. Ianculescu, L. Thompson, Parallel iterative ﬁnite element solution methods for three-dimensional acoustic scattering, in: Proceedings of the 2003 ASME International Mechanical Engineering Congress & Exposition, Washington, DC, November 16–21, 2003, Paper IMECE2003-44266. [3] I. Harari, T. Hughes, A cost comparison of boundary element and ﬁnite element methods for problems of time-harmonic acoustics, Comput. Methods Appl. Mech. Engrg. 97 (1992) 77–102. [4] D. Burnett, A 3-D acoustic inﬁnite element based on a generalized multipole expansion, J. Acoust. Soc. Am. 96 (1994) 2798–2816. [5] W. Chew, J. Song, T. Cui, S. Velarnparambil, M. Hastriter, B. Hu, Review of large scale computing in electromagnetics with fast integral equation solvers, CMES—Comput. Model. Engrg. Sci. 5 (4) (2004) 361–372. [6] E. Bleszynski, M. Bleszynski, T. Jaroszewicz, Development of new algorithms for high frequency electromagnetic scattering, CMES—Comput. Model. Engrg. Sci. 5 (2004) 295–317. [7] O. Bruno, New high-order integral methods in computational electromagnetism, CMES—Comput. Model. Engrg. Sci. 5 (2004) 319–330. [8] R. Astley, G. Macaulay, J.-P. Coyette, L. Cremers, Three-dimensional wave-envelope elements of variable order for acoustic radiation and scattering. Part I. Formulation in the frequency domain, J. Acoust. Soc. Am. 103 (1998) 49–63. [9] J. Shirron, I. Babuska, A comparison of approximate boundary conditions and inﬁnite element methods for exterior Helmholtz problems, Comput. Methods Appl. Mech. Engrg. 164 (1–2) (1998) 121–139. [10] K. Gerdes, Conjugated versus the unconjugated inﬁnite element method for the Helmholtz equation in exterior domains, Comput. Methods Appl. Mech. Engrg. 152 (1998) 125–145. [11] F. Ihlenburg, On fundamental aspects of exterior approximations with inﬁnite elements, J. Comput. Acoust. 8 (1) (2000) 63–80. [12] R. Astley, Inﬁnite elements for wave problems: a review of current formulations and an assessment of accuracy, Int. J. Numer. Methods Engrg. 49 (2000) 951–976. [13] R. Astley, J. Hamilton, Numerical studies of conjugated inﬁnite elements for acoustical radiation, J. Comput. Acoust. 8 (1) (2000) 1–24. [14] D. Dreyer, O. von Estorﬀ, Improving conditioning of inﬁnite elements for exterior acoustics, Int. J. Numer. Methods Engrg. 58 (6) (2003) 933–953. [15] J.-P. Berenger, A perfectly matched layer for the absorption of electromagnetic waves, J. Comput. Phys. 114 (2) (1994) 195–200. [16] J.-Y. Wu, D. Kingsland, J.-F. Lee, R. Lee, A comparison of anisotropic PML to Berengers PML and its application to the ﬁniteelement method for em scattering, IEEE Trans. Antennas Propagat. 45 (1) (1997) 40–50. [17] E. Turkel, A. Yefet, Absorbing PML boundary layers for wave-like equations, Appl. Numer. Math. 27 (4) (1998) 533–557. [18] I. Harari, M. Slavutin, E. Turkel, Analytical and numerical studies of a ﬁnite element PML for the Helmholtz equation, J. Comput. Acoust. 8 (1) (2000) 121–137. [19] A. Bayliss, C. Goldstein, E. Turkel, The numerical solution of the Helmholtz equation for wave propagation problems in underwater acoustics, Comput. Math. Appl. 11 (7–8) (1985) 655–665. [20] J. Keller, D. Givoli, Exact non-reﬂecting boundary conditions, J. Comput. Phys. 81 (1989) 172–192. [21] L.W. Pearson, R.A. Whitaker, L.J. Bahrmasel, An exact radiation boundary condition for the ﬁnite-element solution of electromagnetic scattering on an open domain, IEEE Trans. Magn. 25 (1989) 3046–3048. [22] M. Grote, J. Keller, On non-reﬂecting boundary conditions, J. Comput. Phys. 122 (1995) 231–243. [23] L. Thompson, R. Huan, C. Ianculescu, Finite element formulation of exact Dirichlet-to-Neumann radiation conditions on elliptic and spheroidal boundaries, in: 1999 ASME International Mechanical Engineering Congress & Exposition, Nashville, TN, November 14–19, 1999, ASME Noise Control and Acoustics Division—1999, NCA-vol. 26, pp. 497–510. [24] A. Bayliss, M. Gunzberger, E. Turkel, Boundary conditions for the numerical solution of elliptic equations in exterior domains, SIAM J. Appl. Math. 42 (2) (1982) 430–451.

3740

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

[25] B. Engquist, A. Majda, Absorbing boundary conditions for the numerical simulation of waves, Math. Comput. 31 (1977) 629–651. [26] X. Antoine, H. Barucq, A. Bendali, Bayliss–Turkel-like radiation conditions on surfaces of arbitrary shape, J. Math. Anal. Appl. 229 (1) (1999) 184–211. [27] D. Givoli, High-order local non-reﬂecting boundary conditions: a review, Wave Motion 39 (2004) 319–326. [28] L. Thompson, P. Pinsky, Acoustics, in: E. Stein, R.D. Borst, T.J. Hughes (Eds.), Encyclopedia of Computational Mechanics, vol. 2, Wiley InterScience, 2004 (Chapter 22). [29] R. Bossut, J.-N. Decarpigny, Finite element modeling of radiating structures using dipolar damping elements, J. Acoust. Soc. Am. 86 (4) (1989) 1234–1244. [30] R. Tezaur, A. Macedo, C. Farhat, R. Djellouli, Three-dimensional ﬁnite element calculations in acoustic scattering using arbitrarily shaped convex artiﬁcial boundaries, Int. J. Numer. Methods Engrg. 53 (6) (2002) 1461–1476. [31] R. Kechroud, A. Soulaimani, Y. Saad, S. Gowda, Preconditioning techniques for the solution of the Helmholtz equation by the ﬁnite element method, Math. Comput. Simul. 65 (2004) 303–321. [32] D. Givoli, I. Patlashenko, J. Keller, High order boundary conditions and ﬁnite elements for inﬁnite domains, Comput. Methods Appl. Mech. Engrg. 143 (1997) 13–39. [33] T. Hagstrom, S. Hariharan, A formulation of asymptotic and exact boundary conditions using local operators, Appl. Numer. Math. 27 (1998) 403–416. [34] T. Hagstrom, Radiation Boundary Conditions for the Numerical Simulation of Waves, vol. 8, Cambridge University Press, 1999. [35] R. Huan, L. Thompson, Accurate radiation boundary conditions for the time-dependent wave equation on unbounded domains, Int. J. Numer. Methods Engrg. 47 (2000) 1569–1603. [36] L. Thompson, R. Huan, D. He, Accurate radiation boundary conditions for the two-dimensional wave equation on unbounded domains, Comput. Methods Appl. Mech. Engrg. 191 (2001) 311–351. [37] V. van Joolen, D. Givoli, B. Neta, High-order non-reﬂecting boundary conditions for dispersive waves in Cartesian, cylindrical and spherical coordinate systems, Int. J. Comput. Fluid Dynam. 17 (4) (2003) 263–274. [38] D. Givoli, Recent advances in the DtN FE method, Arch. Comput. Methods Engrg. 6 (2) (1999) 71–116. [39] L. Thompson, C. Ianculescu, R. Huan, Exact radiation conditions on spheroidal boundaries with sparse iterative methods for eﬃcient computation of exterior acoustics, in: Proceedings of the 7th International Congress on Sound and Vibration, 2000, pp. 2101–2108. [40] M. Malhotra, P. Pinsky, A matrix-free interpretation of the non-local Dirichlet-to-Neumann radiation boundary condition, Int. J. Numer. Methods Engrg. 39 (1996) 3705–3713. [41] A. Oberai, M. Malhotra, P. Pinsky, On the implementation of the Dirichlet-to-Neumann radiation condition for iterative solution of the Helmholtz equation, Appl. Numer. Math. 27 (1998) 443–464. [42] L. Thompson, L. Zhang, R. Ingel, Domain decomposition methods with frequency band interpolation for computational acoustics, in: ASME International Mechanical Engineering Congress and Exposition, November 11–16, 2001, New York, NY, The American Society of Mechanical Engineers, 2001, IMECE2001, Paper NCA-23532. [43] B. Despres, Domain decomposition method and the Helmholtz problem (Part II), in: 2nd International Conference on Mathematical and Numerical Aspects of Wave Propagation, SIAM, Newark, DE, 1993. [44] J. Benamou, B. Despres, A domain decomposition method for the Helmholtz equation and related optimal control problems, J. Comput. Phys. 136 (1997) 68–82. [45] S. Kim, Domain decomposition iterative procedures for solving scalar waves in the frequency domain, Numer. Math. 79 (1998) 231–259. [46] X. Cai, M. Cararin, F. Elliott, O. Widlund, Overlapping Schwarz algorithms for solving Helmholtzs equation, in: M. et al. (Eds.), Domain Decomposition Methods 10, Contemp. Math., vol. 218, AMS, Providence, RI, 1998, pp. 391–399. [47] R. Susan-Resiga, H. Atassi, A domain decomposition method for the exterior Helmholtz problem, J. Comput. Phys. 147 (1998) 388–401. [48] E. Larsson, A domain decomposition method for the Helmholtz equation in a multilayer domain, SIAM J. Sci. Comput. 20 (5) (1999) 1713–1731. [49] C. Farhat, A. Macedo, M. Lesoinne, F. Roux, F. Magoules, A. La Bourdonnaie, Two-level domain decomposition methods with lagrange multipliers for the fast iterative solution of acoustic scattering problems, Comput. Methods Appl. Mech. Engrg. 184 (2000) 213–239. [50] R. Tezaur, A. Macedo, C. Farhat, Iterative solution of large-scale acoustic scattering problems with multiple right hand-sides by a domain decomposition method with lagrange multipliers, Int. J. Numer. Methods Engrg. 51 (10) (2001) 1175–1193. [51] F. Collino, S. Ghanemi, P. Joly, Domain decomposition method for harmonic wave propagation: a general presentation, Comput. Methods Appl. Mech. Engrg. 184 (2000) 171–211. [52] M. Gander, F. Magoules, F. Nataf, Optimized Schwarz methods without overlap for the Helmholtz equation, SIAM J. Sci. Comput. 24 (2002) 38–60. [53] F. Magoules, F.-X. Roux, S. Salmon, Optimal discrete transmission conditions for a nonoverlapping domain decomposition method for the Helmholtz equation, SIAM J. Sci. Comput. 25 (5) (2004) 1497–1515.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3741

[54] R. Djellouli, C. Farhat, A. Macedo, R. Tezaur, Finite element solution of two-dimensional acoustic scattering problems using arbitrarily shaped convex artiﬁcial problems, J. Comput. Acoust. 8 (2000) 81–99. [55] C. Ianculescu, Iterative solution of exterior Helmholtz equation with exact non-reﬂecting boundary conditions on distributed memory systems, Ph.D. thesis, Clemson University, Dept. of Mechanical Engineering, August 2003. [56] I. Harari, T. Hughes, Analysis of continuous formulations underlying the computation of time-harmonic acoustics in exterior domains, Comput. Methods Appl. Mech. Engrg. 97 (1992) 103–124. [57] M. Abramowitz, I. Stegun, Handbook of Mathematical Functions, 1968. [58] C. Flammer, Spheroidal Wave Functions, Stanford University Press, Stanford, CA, 1957. [59] H. Van der Vorst, Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems, SIAM J. Sci. Stat. Comput. 13 (2) (1992) 631–644. [60] R. Freund, A transpose-free quasi-minimal residual algorithm for non-Hermitian linear systems, SIAM J. Sci. Comput. 14 (2) (1993) 470–482. [61] S. Eisenstat, Eﬃcient implementation of a class of preconditioned Conjugate Gradient methods, SIAM J. Sci. Stat. Comput. 2 (1984) 130–138. [62] P.M. Morse, H. Feshbach, Methods of Theoretical Physics, McGraw-Hill, 1953. [63] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company, 1996. [64] S. Zhang, J. Jin, Computation of Special Functions, John Wiley & Sons, Inc, 1996. [65] METIS: Unstructured graph partitioning and sparse matrix ordering system. Available from: . [66] G. Karypis, V. Kumar, A coarse-grain parallel formulation of multilevel k-way graph partitioning algorithm, in: 8th SIAM Conference on Parallel Processing for Scientiﬁc Computing, 1997. [67] A. Gupta, G. Karypis, V. Kumar, Highly scalable parallel algorithms for sparse matrix factorization, IEEE Trans. Parallel Distrib. Syst. 8 (5) (1997) 502–520. [68] J.R. Gilbert, R. Schreiber, Highly parallel sparse Cholesky factorization, SIAM J. Scient. Statist. Comput. 13 (5) (1992) 1151– 1172. [69] M. Malhotra, P. Pinsky, Parallel preconditioning based on h-hierarchical ﬁnite elements with application to acoustics, Comput. Methods Appl. Mech. Engrg. 155 (1998) 97–117.

Parallel iterative solution for the Helmholtz equation with exact non-reflecting boundary conditions

Parallel iterative solution for the Helmholtz equation with exact non-reflecting boundary conditions

Recommend Documents