Parallel iterative solution for the Helmholtz equation with exact non-reflecting boundary conditions

Parallel iterative solution for the Helmholtz equation with exact non-reflecting boundary conditions

Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741 www.elsevier.com/locate/cma Parallel iterative solution for the Helmholtz equation with exact...

723KB Sizes 0 Downloads 31 Views

Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741 www.elsevier.com/locate/cma

Parallel iterative solution for the Helmholtz equation with exact non-reflecting boundary conditions q Cristian Ianculescu, Lonny L. Thompson

*

Department of Mechanical Engineering, Clemson University, Innovation Building, 102 Fluor Daniel Engineering, Clemson, SC 29634-0921, United States Received 11 November 2004; received in revised form 24 January 2005; accepted 13 February 2005

Abstract Efficient and scalable parallel solution methods are presented for the Helmholtz equation with global non-reflecting DtN boundary conditions. The symmetric outer-product structure of the DtN operator is exploited to significantly reduce inter-processor communication required by the non-locality of the DtN to one collective communication per iteration with a vector size equal to the number of harmonics included in the DtN series expansion, independent of the grain size. Numerical studies show that in the context of iterative equation solvers, and for the same accuracy, the global DtN applied to a tightly fitting spheroidal boundary and implemented as a low-rank update with the multiplicative split is more cost-effective (both in wall-clock times and memory) compared to local approximate boundary conditions.  2005 Elsevier B.V. All rights reserved. Keywords: Helmholtz equation; DtN; Finite element; Parallel methods

1. Introduction Numerical solution of acoustic radiation and scattering problems on unbounded domains requires a large number of degrees-of-freedom (dof) to resolve waves at high driving frequencies (wavenumbers), especially in three dimensions. Domain based methods such as the finite element method (FEM) are able to solve problems in non-homogeneous media and allow for a natural coupling with complex, including q *

Portions of this manuscript originally presented in [1,2]. Corresponding author. Tel.: +1 864 656 5631; fax: +1 864 656 4435. E-mail address: [email protected] (L.L. Thompson). URL: http://www.ces.clemson.edu/~lonny/ (L.L. Thompson).

0045-7825/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2005.02.030

3710

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

elastic, structures. The numerical advantage of the FEM is they lead to sparse matrices, which by avoiding calculations on zeroes, significantly speedup computations and reduce memory requirements. Complexity estimates [3] and numerical evidence [4] have shown that the FEM is an effective alternative to boundary integral methods (BEM) for exterior acoustics problems. With the recent developments in fast boundary integral methods [5–7], the method with the best efficiency is less clear, yet the FEM retains advantages of robustness and natural integration with other discrete models in coupled problems. Domain based methods such as the finite element method (FEM) usually introduce an artificial boundary C, which divides the original unbounded domain into two regions: a finite computational domain X suitable for discretization and an infinite residual region D ¼ R n X, see Fig. 1. The exterior acoustics problem in unbounded domains presents a special challenge for finite element methods. Reducing the size of the bounded domain reduces the computation cost but must be balanced by the ability of a computationally efficient and geometrically flexible truncation boundary treatment to minimize any spurious wave reflection to a level below that of the discretization error. Methods for modelling the exterior complement D ¼ R n X, i.e., the infinite region exterior to the artificial boundary C, can be divided into three main categories: infinite elements [4,8–14], absorbing PML layers [15–18], and absorbing (non-reflecting) boundary conditions. Infinite element methods represent the exterior complement by assuming a radial approximation with suitable outgoing wave behavior. Matched absorbing layers attempt to rapidly decay outgoing waves in a relatively thin layer exterior to C. For the non-reflecting (absorbing) boundary conditions, the outgoing wave solution in D is represented by a relation of the unknown solution and its normal derivative on the artificial truncation boundary C. Options include matching exact analytical series solutions as used in the global Dirichlet-to-Neumann (DtN) map [19–23] and various local approximations [24–27]. Numerical and implementation aspects of these alternative boundary treatments are reviewed in [28]. The computational merits of infinite elements, absorbing layers, and local absorbing boundary conditions, are that they incorporate approximating basis functions with compact support, thus preserving the sparsity of the discretization for the bounded domain. In the case of the local absorbing boundary conditions the partitioning of the computational domain for parallel solution on distributed memory computers is straightforward. In the case of infinite elements, and absorbing layers, a partitioning with variable computational weights may be required. The first- and second-order local absorbing conditions of [24] on a truncation boundary of spherical shape have been widely used because of there simple implementation [9,29–31]. High-order terms in the sequence of local boundary conditions are difficult to implement due to the presence of high-order derivatives. To obtain a tight fit around elongated objects, the second-order local conditions of [24] have been generalized to spheroidal [22,23] and arbitrary convex surfaces [26]. A tight fit reduces the size of the computational domain X, thus reducing the overall cost of solution. Local approximate absorbing boundary

Fig. 1. Artificial boundary C defining finite computational domain X for the exterior problem.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3711

conditions preserve the sparsity of the matrix equations within the bounded domains, but must be used with caution, as errors produced by spurious reflections are generally unknown. This sometimes requires multiple solutions and remeshing, where the truncation boundary must be systematically moved further away from the scatterer to provide solutions with reflection error below the discretization error. Complexity estimates show that it is usually more efficient to use high-order accurate absorbing conditions which enable smaller computational domains. Direct finite element implementation of high-order local operators are problematic in conventional finite element methods since regularity in angular derivatives higher than standard C0(C) are required [32]. The development of high-order local boundary conditions for which the order can be easily increased to a desired level are usually based on using auxiliary variables to eliminate higher-order derivatives [27,33–37]. While generally derived for the time-dependent case,pffiffiffiffiffiffi time-harmonic counterparts are readily implemented with time derivatives replaced by ix, where ffi i ¼ 1 and x is the driving frequency. An alternative to high-order local absorbing conditions are global DtN non-reflecting boundary conditions defined on a separable truncation boundary [20,21]. Uniqueness is ensured by modifying the DtN map to incorporate the first- and second-order local absorbing conditions in combination with the series representation [22,38]. The extension of the modified DtN map in spheroidal coordinates suitable for finite element implementation is given in [23,39]. The number of terms taken in the series representation of the DtN boundary condition determines the accuracy of the procedure. By coupling all the nodes on the truncation boundary, the DtN map is non-local giving rise to a full submatrix. As a result, the DtN was considered computationally prohibitive. However, researchers [19,40,41] recognized the symmetric outer-product structure of the DtN map could be exploited to avoid storage or direct computation of a full dense matrix. Exploiting this feature, efficient matrix–vector updates were implemented for Krylov subspace iterative solvers (e.g. QMR, BICG-STAB), rendering the DtN condition competitive with local absorbing conditions on sequential computer systems. For large three-dimensional problems at high-wavenumbers k, resolution requirements require a large number of elements leading to large system matrices. In this case, iterative solvers are preferred over direct factorization methods due to the lower memory requirements and parallel computing performance. To accommodate a growing demand for solving large scale exterior acoustic scattering problems, especially in three dimensions and for higher frequencies (wavenumbers), research efforts have been directed to designing parallel algorithms for the Helmholtz equation on unbounded domains. Domain decomposition methods provide an effective means of problem subdivision for parallel processing. Classical Schur complement based domain-decomposition methods have difficulties when applied to the Helmholtz equation since the inversion of the system matrix defined on each interior subdomain will be singular when the wavenumber corresponds to a resonance frequency (eigenvalue) of the subdomain Dirichlet problem. The first resonance will occur at a resolution of less than two subdomains per wavelength [42]. Schwarz-type domain decomposition methods have been used with Robin-type transmission interface conditions [43–45]. However, a difficulty is that the iteration count increases with many subdomains. An overlapping Schwarz method is demonstrated in [46] in combination with GMRES acceleration and coarse grid corrections to improve converge rates. Another additive Schwarz domain decomposition method with the Robin-type subdomain interface transmission conditions has been proposed in [47] where the non-local DtN non-reflecting boundary condition is computed with an iterative lag to maintain sparsity of the subdomain matrices. In [48] a preconditioned restarted GMRES iterative method is used for the solving the Helmholtz equation, including non-local non-reflecting boundary conditions. Domain decomposition is used with a Schur complement algorithm and fast preconditioners for the subdomains to accelerate convergence. In [49,50], a non-overlapping domain decomposition method based on two-level Lagrange multipliers and the alternating Robin-type transmission conditions at subdomain interfaces given in [43,44], is used to solve the Helmholtz equation with second-order local absorbing boundary conditions. The matrices

3712

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

restricted to each subdomain include regularization matrices associated with the simple impedance operator defined by integration over subdomain interfaces. The use of regularization matrices for the Helmholtz operator provides for a unique solution on each subdomain. Inclusion of the alternating regularization matrix on the interface boundaries cancel upon global assembly, thus reverting the original problem, and lead to non-singular and invertible subdomain system matrices. The transmission conditions can be interpreted as a simple local preconditioner of the linear system condensed on the interface. Improved transmission conditions with tangential derivatives have been derived in [51]. Optimized coefficients have been chosen to minimize the convergence rate of the Jacobi algorithm in the closely related additive Schwarz method with no overlap [52]. In [53] it is shown that the optimal augmented interface operator is the Schur complement of the outer domain. Approximations of this Schur complement with sparse approximate inverse methods and incomplete factorization are investigated. An important aspect for efficient parallel algorithms, especially for distributed memory systems, is minimizing the amount of data interchanged (communicated) between subdomains. Maximum decoupling between subdomains is pursued when designing parallel algorithms, especially for massively parallel architectures. In this framework, local boundary conditions are natural for parallel efficient algorithms due to their locality. However, local first- and second-order operators such as those used in [30,54] require extra computational domain to produce accuracy comparable to DtN (as demonstrated in [55] and the numerical studies presented in this paper). For a limited number of subdomains, it is reasonable to believe an optimized implementation of the DtN operator would be more computationally effective than local conditions. As the number of subdomains increases, the parallel performance degrades faster by applying DtN versus local conditions, as communication costs will eventually overwhelm computation costs, due to a higher coupling between subdomains. However, for large to medium grain size (hundreds of processors) the DtN boundary condition may prove more cost-effective. In this work, an efficient parallel algorithm is presented for computing DtN matrix–vector products in Krylov-based iterative methods, which takes advantage of the low-rank matrix update resulting from the exact non-reflecting DtN map into a parallel iterative process. To minimize the computational domain around elongated structures, a spheroidal DtN non-reflecting boundary is used. We demonstrate using linear tetrahedral elements that for the same accuracy, the modified DtN non-reflecting boundary condition applied to a spheroidal boundary and implemented as a low-rank update is more cost-effective than using the local approximate boundary conditions in the context of iterative equation solvers. The low-rank matrix implementation of the discrete DtN map should be even more efficient for dispersion reducing high-order interpolation where high-accuracies are needed. Parallel scale-up studies show that the effect on the overall communication of our parallel DtN matrix-by-vector product algorithm is roughly that of a relatively small dot-product inter-processor communication. Numerical results for a three-dimensional benchmark problem show that speedup and parallel efficiency have not been significantly impacted by DtN versus a purely approximate local boundary condition.

2. The exterior problem in unbounded domains Consider time-harmonic scattering and radiation of waves in an infinite three-dimensional region D  R3 , truncated by an artificial boundary C thus enclosing a finite computational domain X. The Helmholtz equation is satisfied within X, ðr2 þ k 2 Þu ¼ 0;

x 2 X;

ð1Þ

where u is the scalar field, k 2 C is the wave number, Im k P 0. The solution of the Helmholtz equation exterior to the truncating surface satisfies the Sommerfeld radiation condition at infinity, ensuring outgoing waves in the far-field.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3713

On the artificial boundary, a non-reflecting boundary condition is defined by the Dirichlet-to-Neumann (DtN) map, ou ¼ MðuÞ; on

x 2 C;

ð2Þ

where M: H1/2(C) ! H1/2(C) is an integral operator relating Dirichlet data to the outward normal derivative of the solution to the Helmholtz problem exterior to the truncating surface, restricted on C. Using the DtN map (2), the originally unbounded exterior problem is replaced by an equivalent reduced problem defined on the bounded domain X: Given g 2 C; Find uðxÞ 2 C, such that (Fig. 2), ðr2 þ k 2 Þu ¼ 0; in X; ou ¼ g; on S; on ou ¼ Mu; on C: on

ð3Þ ð4Þ ð5Þ

The corresponding weak form is (W): Find u 2 H1(X), such that, for all w 2 H1(X): Bðw; uÞ  ðw; MuÞC ¼ ðw; gÞS ; where

Z

 uÞ dx; ðr w  ru  k 2 w Z  Mu dC; ðw; MuÞC :¼ w Z C  g ds: ðw; gÞS :¼ w Bðw; uÞ :¼

ð6Þ

ð7aÞ

X

ð7bÞ ð7cÞ

S

Formally, the L2(C) inner-product of the trace w 2 H1/2(C) and the DtN map Mu 2 H1/2(C) is defined by the dual pairing, ðw; MuÞC ¼ ðw; MuÞH 1=2 H 1=2 . For a general DtN map, the solution is unique if the innerproduct Im(u, Mu)C > 0 (or <0), for all u 2 H1/2(C), u 5 0, [22,56]. Complexity estimates show that it is usually more efficient to use high-order accurate conditions which enable small computational domains. To this end, the artificial boundary C is often defined in separable coordinates such as a sphere or spheroid,

Fig. 2. Exterior scattering problems. Exterior domain D is reduced to computational domain X enclosed by the scatterer S and the truncation boundary C.

3714

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

or a rectangular shape in Cartesian coordinates. The use of spheroidal or rectangular coordinates allows the artificial boundary to obtain a tight fit around elongated objects. A history on the origins of the DtN finite element method for acoustics and other wave problems in exterior domains is given in [38].

3. The DtN non-reflecting boundary condition in prolate spheroidal coordinates Consider the DtN map defined in prolate spheroidal coordinates. The angular coordinate is defined as t = (g, u) where g = cos h, and u is the polar coordinate for spheroidal systems, see Fig. 3. A prolate spheroid is defined by a constant radial value l, with an ellipse revolving around the major z-axis. Alternatively, the spheroid pffiffiffiffiffiffiffiffiffiffiffiffiffimay be parameterized by x = x(n, t), where n = cosh l, and g = cos h, so that a = fn, and b ¼ f n2  1 are the semi-major and semi-minor axis of the ellipse. In prolate spheroidal coordinates the Helmholtz equation may be written as   o2 o f ðn2  1Þ 2 þ 2f n þ DC þ k 2 hn hg hu u ¼ 0; ð8Þ on on where   o hn hu ou hn hg o2 u þ DC u :¼ og hg og hu ou2 is the surface Laplacian, and f(n2  1) = hghu/hn. In the above, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hn ¼ f ðn2  g2 Þ=ðn2  1Þ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hg ¼ f ðn2  g2 Þ=ð1  g2 Þ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hu ¼ f ð1  g2 Þðn2  1Þ

ð9Þ

ð10aÞ ð10bÞ ð10cÞ

are the metrics for the prolate coordinate system.

Fig. 3. Prolate spheroidal coordinate system defined by the parameterization x ¼ xðl; h; uÞ ¼ ½b sin h cos u; b sin h sin u; a cos h, 0 6 hp<ffiffiffiffiffiffiffiffiffiffiffiffiffiffi p, and ffi 0 6 u < 2p, where a and b are the semi-major and semi-minor axis of the ellipse, defined by (a, b) = f(cosh l, sinh l), and f ¼ a2  b2 is the semi-interfocal distance.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3715

The artificial boundary C corresponds to a constant radial coordinate n0 = cosh l0. The outgoing solution to the Helmholtz equation (8) in the unbounded region n P n0 = cosh l0 can be expressed the expansion: uðn; tÞ ¼

n 1 X X  0 Rmn ðc; nÞ  amn wcmn ðc; tÞ þ bmn wsmn ðc; tÞ ; R ðc; n Þ mn 0 n¼0 m¼0

ð11Þ

where c = kf is a normalized wave number, and Rmn(c, n) are the radial prolate spheroidal wave functions of the third kind [57,58]. In (11), the prime after the sum indicates that terms with m = 0 are multiplied by 1/2. The even and odd orthogonal eigenfunctions are defined by 1 wcmn ðc; tÞ :¼ pffiffiffiffiffiffiffiffiffiffiffi S mn ðc; gÞ cos mu; pN mn 1 wsmn ðc; tÞ :¼ pffiffiffiffiffiffiffiffiffiffiffi S mn ðc; gÞ sin mu; pN mn where Smn(c, n) are the angular prolate spheroidal wave functions of the first kind, and Z p Z 1 amn ðcÞ ¼ uðn0 ; tÞwcmn ðc; tÞ dt; p 1 Z p Z 1 uðn0 ; tÞwsmn ðc; tÞ dt: bmn ðcÞ ¼ p

ð12aÞ ð12bÞ

ð13aÞ ð13bÞ

1

In the above, dt = dg du, and Nmn are the standard normalization coefficients used by Flammer [58]: N mn ¼ 2

1 X 0 l¼0;1

ðl þ 2mÞ! 2 ðd mn Þ ; ð2l þ 2m þ 1Þl! l

ð14Þ

where d mn l ðcÞ are the coefficients in the expansion for the angular functions Smn(c, g) in terms of associated Legendre functions. In (14), the prime on the summation sign indicates that the summation is carried out for even l when (n  m) is even and for odd l when (n  m) is odd. Evaluating the normal derivative of the eigenfunction expansion (11) on the boundary at n = n0 = cosh l0 defines the DtN map for a the prolate spheroidal truncation boundary [22]. The DtN operator defines an exact non-reflecting condition on the artificial boundary; i.e., there are no spurious reflections introduced at C. The operator is non-local since integration is required over the whole surface to compute the coefficients amn and bmn. In practice, the sum over n in the expansion is truncated at a finite value Nm and is implemented naturally in the variational equation as a symmetric form. The DtN map exactly represents all harmonics in the solution up to the number of terms included in the truncated series expansion as measured by Nm. For fixed Nm, the harmonics n > Nm are evaluated with a homogeneous Neumann boundary condition on C. If an insufficient number of harmonics is included in the series, non-unique solutions may result when k2 matches one of an infinite number of interior resonances (real eigenvalues) associated with the Laplacian operator. As a result, the accuracy of the harmonics n > Nm are poorly represented and a minimum value is required to ensure uniqueness. To ensure uniqueness more terms in the DtN map than may be necessary to achieve a desired accuracy, leading to a potential for excessive computation. To eliminate the bound Nmin for uniqueness, a modified DtN may be formulated by generalizing the normal derivative applied to the harmonic expansion for outgoing waves (11), with a local differential operator representing an approximate radiation boundary condition [22]. The resulting modified DtN condition is unique for any choice of Nm, and approximates the harmonics n > N with greater accuracy than the original DtN condition.

3716

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Local boundary conditions are easily constructed by extending the procedures employed in [24] for a circle or sphere, where radial terms in a multipole expansion for outgoing waves are absorbed. The generalization to spheroidal coordinates is given by the asymptotic expansion given by [4]: 1 gj ðh; u; cÞ expðicnÞ X ; ð15Þ u j cn ðcnÞ j¼0 where c = kf, is the normalized wave number and n is the radial coordinate in spheroidal coordinates. The sequence of local operators which annihilate radial terms in the expansion (15) is constructed as a product of normalized radial derivatives. The first two boundary conditions are:   1 o B1 u ¼ þ a1 u ¼ 0; on C; ð16Þ f on   1 o2 o þ a3 u ¼ 0; on C; B2 u ¼ 2 þ a2 ð17Þ f on2 o; n where a1 = (1  icn0)/n0, a2 = (4  2icn0)/n0, and a3 ¼ ð2  4icn0  ðcn0 Þ2 Þ=n20 . Both conditions B1 and B2 satisfy the uniqueness condition, Im(u, Bu)C 5 0. The local condition B2 is preferred since it is substantially more accurate compared to B1. Local approximate conditions require careful placement when computing the response over a range of frequencies; the size of the computational domain and the mesh density must be carefully selected to achieve a prescribed accuracy. Applying the B1 operator to the eigenfunction expansion (11), evaluated at n0, gives [23], n Nm X X 0 ð1Þ ou 1 ou 1 1 ¼ ¼  z1 u þ Z mn ðcÞ Dmn ðc; tÞ; on hn on Js Js n¼0 m¼0

ð18Þ

z1 ¼ f ðn20  1Þa1 ;

ð19Þ

2 Z ð0Þ mn ðcÞ ¼ f ðn0  1Þ

R0mn ðc; n0 Þ ; Rmn ðc; n0 Þ

ð0Þ Z ð1Þ mn ðcÞ ¼ Z mn ðcÞ þ z1 ; Z 1 0 0 0 Dmn ðc; tÞ :¼ 0 uðn0 ; t Þwmn ðtjt Þ dC ; 0 J C s

wmn ðtjt0 Þ :¼

1 S mn ðc; gÞS mn ðc; g0 Þ cos mðu  u0 Þ: pN mn

ð20Þ ð21Þ ð22Þ ð23Þ

In the above, dC = Js dg du, where Js = hghu is the surface Jacobian evaluated at n0 = cosh l0. When the condition (18) is truncated at the finite value Nm, it is exact for harmonics n 6 Nm, and approximates the harmonics n > Nm with the local approximate condition B1u = 0. The normal derivative form derived in (18) is convenient for finite element implementation as a natural boundary condition in a Galerkin variational equation. The modified condition given in [22] is left in terms of a derivative with respect to l, and is suitable for finite difference implementation. Applying the local B2 operator to the eigenfunction expansion (11) with second-order radial derivatives replaced in favor of tangential derivatives via the Helmholtz equation in spheroidal coordinates (8) leads to the modified DtN [23]: n Nm X X

0 ð2Þ ou 1 1 ¼ DC  fc2 g2  fz2 u þ Z mn ðcÞ Dmn ðc; g; uÞ; on mJ s Js n¼0 m¼0

ð24Þ

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3717

where 2n0 z2 ¼ ðn20  1Þa3  c2 n20 ; 1 " # f m2 ð2Þ ð0Þ kmn þ 2 Z mn ðcÞ ¼ Z mn ðcÞ þ þ z2 : m n0  1

m ¼ a2 

n20

ð25Þ ð26Þ

In the above, kmn(c) are the characteristic values of the prolate spheroidal wave functions. The form of the modified DtN condition (24), after integrating-by-parts the second-order tangential derivatives appearing in the surface Laplacian DC, leads to a symmetric Galerkin variational form with standard finite element spaces in H1(C). In [22], the B2 operator is left with second-order derivatives in the radial coordinate l, suitable for finite difference approximation. For the modified DtN boundary condition, the term (7b) is composed of both a local and global DtN part: Z ðjÞ ðjÞ  Mu dC ¼ BC ðw; uÞ þ Z C ðw; uÞ; j ¼ 1; 2: ð27Þ w |fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflffl{zfflfflfflfflffl} C DtN

local

The local part of the modified DtN operators using B1 and B2, respectively, are Z 1 ð1Þ  u dC; BC ðw; uÞ :¼ z1 w J ZC s Z f 1 1 ð2Þ 2 2   rs u dC; ðz2 þ c g Þ wu dC  hn r s w BC ðw; uÞ :¼  m C Js m C

ð28Þ ð29Þ

where rs u :¼

1 ou 1 ou þ : hg og hu ou

ð30Þ

The non-local part is ðjÞ

Z C ðw; uÞ :¼ 

n 1 X X 0 n¼0

  Z ðjÞ w; wcmn ÞC  ðu; wcmn ÞC þ ð w; wsmn ÞC  ðu; wsmn ÞC mn ð

ð31Þ

m¼0

for j = 1, 2 corresponding to B1- or B2-modified DtN, respectively. The inner-products over the boundary C are defined by Z 1 ðu; wcmn ÞC :¼ uwcmn ðc; g; uÞ dC; ð32aÞ C Js Z 1 uwsmn ðc; g; uÞ dC: ð32bÞ ðu; wsmn ÞC :¼ J s C The local boundary operator in (27) provides an approximation to the spectral properties of the exact DtN map, and thus can be used to construct an efficient preconditioner for an iterative solution with the complete DtN operator [23,41].

4. Finite element discretization SN e h The domain X is discretized into a finite number of elements, X  Xh ¼ e¼1 Xe . The total number of nodes for the finite element mesh is N. Using a standard Galerkin approximation uh(x) = NT(x)d, where

3718

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

N 2 RN is a column vector of standard C0 basis functions, and d 2 CN is a column vector containing the N unknowns nodal values the following indefinite complex-symmetric matrix system results: ðK S þ K DtN Þd ¼ f ;

ð33Þ

where K S ¼ ðS  k 2 MÞ þ K C is a sparse matrix, decomposed into a part associated with the discretization of the Helmholtz equation in X, defined by the real symmetric arrays S and M defined in (7a) and a complex part KC associated with the local boundary operator BC (28) and (29). Additional wavenumber dependent matrices may also be present when using stabilized finite element methods designed for improved accuracy of the discrete wavenumberfrequency relation, e.g., the Galerkin Least-Squares (GLS) and related residual based stabilized methods. The DtN contribution couples all the nodes on the boundary C, and if assembled, is given by the matrix K DtN ¼ PC  Z C  PTC ;

ð34Þ

N C N C

where Z 2 C is a full submatrix of size equal to the number of nodes on the boundary NC, and PC is a permutation matrix used to represent pointer arrays which extract truncation boundary unknowns from the total number of unknowns. Due to the special structure of the DtN map defined on a separable boundary, the complex-symmetric (non-Hermitian) block matrix KDtN is defined by a summation of vector outer products (rank-1 vector updates), ZC ¼

n Nm X X 0 n¼0

  T T Z ðjÞ mn cmn cmn þ smn smn ;

ð35Þ

m¼0

with even and odd vectors of size NC, equal to the number of unknowns on the truncation boundary C, Z 1 T c cmn ¼ ðN T ; wcmn ÞC ¼ N wmn ðc; tÞ dC; ð36aÞ C Js Z 1 T s N wmn ðc; tÞ dC: ð36bÞ smn ¼ ðN T ; wsmn ÞC ¼ C Js For a direct solve of (33), the storage of the full dense matrix KDtN, associated with the non-local DtN operator, requires OðN 2C Þ complex numbers. The storage for the sparse matrix KS is O(N), so that the total storage required for (KS + KDtN) is OðN þ N 2C Þ. For large models, the fully populated submatrix KDtN becomes prohibitively expensive to store and factorize. 4.1. Iterative solution methods for the modified DtN The situation is significantly different when iterative solution methods are used. Iterative solution methods are generally the algorithm of choice for solving systems with large numbers of unknowns due to significantly reduced storage requirements. Suitable iterative solvers for the indefinite complex symmetric matrix systems arising from discretization of the Helmholtz equation include BiCG-Stab [59], and QMR [60] methods. To accelerate convergence of the iterative process, efficient implementations of algebraic and other preconditioners can be constructed based on the sparse matrix KS. The local operator KS provides a good approximation to the spectral properties of the complete system matrix, KS + KDtN [23,41]. The computationally intensive kernel in Krylov subspace iterative solvers is the repeated operation of matrix-by-vector products of the form vj = ZCuj, at each iteration j. The special structure of the DtN matrix ZC, as a summation of rank-1 vector updates can be exploited to significantly reduce storage and cost. Here, the vectors cmn and smn can be organized in order and stored as columns in the matrix C, of dimension

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3719

(NC · NT), where NT = Nm(Nm + 1) denote the total number of harmonics included in the truncated DtN expansion. The DtN matrix contribution can then be expressed in the form of a generalized rank-NT update [39], Z C ¼ C  D  C T;

ð37Þ

C ¼ ½c00 ; . . . ; cN m N m ; s00 ; . . . ; sN m N m ; 1 D ¼  diagðZ 00 ; . . . ; Z N m N m ; Z 00 ; . . . ; Z N m N m Þ; p

ð38Þ ð39Þ

where D is a NT · NT diagonal matrix of impedance coefficients, and C contains the vector harmonics. Using this construction, a matrix–vector product with the DtN matrix v = ZCu, can be computed efficiently in block form using the following algorithm: 1. Set w = CTu, 2. Set w = Dw, 3. Set v = Cw. The dense matrix ZC is never assembled; only C and the diagonal coefficients in D need be stored. This implementation avoids the direct evaluation of the full submatrix associated with the unknowns on C producing significantly reduced storage costs. The storage requirements and number of operations are reduced to O(N + NT NC). Since NT NC, the storage and cost is considerably lower than a straightforward matrix–vector product. An effective preconditioner to accelerate convergence of the iterative solution is the SSOR algebraic decomposition together with Eisenstats trick [61]. Here, the sparse matrix KS, which is stored and split into, KS = L + D + LT, where D :¼ diag{KS} is a diagonal matrix and L is a strictly lower triangular matrix. The SSOR preconditioner is defined by P ¼ LD1 LT , where L ¼ ðD þ xLÞ. Here 0 < x < 2, is a relaxation parameter. With x ! 0, the SSOR preconditioner reduces to the diagonal (Jacobi) preconditioner, i.e., P = D. Using this sparse SSOR preconditioner leads to the modified system, Ax ¼ b; 1=2

A ¼ D1=2 L1 ðK S þ K DtN ÞLT D1=2 ; T

1=2

ð40Þ

1

where x ¼ D L d, b ¼ D L f . Iterative solution of the preconditioned system (40) requires a matrix– vector product v = Au, where A is the complex-symmetric preconditioned matrix and u is an iteration vector. Here the matrix–vector product v = Au, is computed efficiently using Eisenstats trick [61], together with the special structure of KDtN written in the matrix outer-product form (37), giving rise to:   1 A ¼ D1=2 LT þ L1 ðI þ ½ðx  2ÞD  xCDC T LT Þ D1=2 : ð41Þ x

5. Parallel iterative solution method Parallel iterative methods provide a means for dividing the problem into subsystems which when solved in parallel, provide compute time speedup, and for distributed-memory computer systems, the ability to scale-up to very large systems. A computationally intensive kernel in Krylov subspace iterative solvers is the repeated operation of matrix-by-vector products. The goal of an efficient algebraic parallel solution is to compute the matrix-by-vector products concurrently and to minimize communication. Consider domain Xh divided into p non-overlapping subdomains (substructures) Xhi such Sp h the discretized h that i¼1 Xi ¼ X . The unknowns within each subdomain are further partitioned into interior and interface nodes, see Fig. 4. The following nomenclature will be used throughout this section: • • • •

N—total number of unknowns for the global problem, Ns—total number of interior unknowns, Nv—total number of unknowns on the interface, Ni—number of unknowns in the interior of subdomain i,

3720

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

interior nodes – numbered first interface nodes – numbered last

Fig. 4. Illustration of a subdomain Xi partitioned into interior and interface nodes. The number unknowns in the interior of subdomain i is denoted Ni. The number of unknowns on the interface of subdomain i is denoted N vi .

• • • • • •

N vi —number of unknowns on the interface of subdomain i, NC—total number of DtN unknowns on the boundary C, N Cs —total number of interior DtN unknowns, N Cv —total number of DtN unknowns on the interface, N Ci —number of DtN unknowns in the interior of subdomain i, N Cvi —number of DtN unknowns on the interface of subdomain i.

5.1. Sparse matrix–vector product Partitioning the computational domain into p non-overlapping subdomains and numbering the nodes interior to the subdomains first, and nodes on the interfaces numbered last, then given the global vector iterate d 2 CN , a sparse system matrix-by-vector iterate f = KSd, takes the form of a block arrowhead matrix structure: 8 9 2 38 9 A1 B1 > f1> d1 > > > > > > > > > > 6 > 7 >f > > > > > > A B d > > > > 7 6 2 2 2 2 > > > = < = 6 7< > .. 7 .. .. 6 .. ¼6 ð42Þ 7 . ; . . . > > > 7> 6 > > > > > > > > 7 6 > 4 > > Ap B p 5> > > > > >fp > > dp > > > ; ; : > : > T T T B1 B 2    Bp Av dv fv where each di, i = 1, . . . , p, represents the subvector of unknowns that are interior to subdomain Xi, and dv represents the vector of all interface unknowns. In the above, Ai = Si  k2Mi + KCi are sparse matrices of size Ni · Ni associated with internal unknowns for each subdomain, and Av is a square matrix of size Nv · Nv, associated with unknowns on the interfaces. The global interface matrix Av is defined as the assembly of contributions of local interface submatrices Avi , but is never explicitly assembled for the iterative solution process. The Bi are sparse matrices of size Ni · Nv representing the subdomain to interface coupling. In general, the number of interface unknowns Nv will be considerably less than the number of internal unknowns, Nv Ns.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3721

Local interior to local interface coupling matrices B ivi of size N vi  N vi are formally defined by the operation B i ¼ B ivi Pi where Pi is a permutation matrix of size (N vi  N v ), is a global-to-local interface mapping for subdomain i. For iterative solutions on a parallel computer, the sparse matrix–vector products for each local subdomain can be computed in parallel for each processor i, # " Ai Bivi fi di ¼ ; ð43Þ B Tivi Avi f vi d vi where Avi and d vi ¼ Pi d v are local interface subarrays and B ivi are local coupling matrices assembled from elements on each subdomain and stored on each distributed processor system. Both the vectors fi and f vi can be computed concurrently on each processor since the local arrays Ai , Bivi , d i , d vi are localized on each computing node i. The global interface vector fv is updated by assembly of each subvector f vi using fv ¼

p X

PTi f vi :

ð44Þ

i¼1

Inter-processor communication from the above operation is minimized by taking advantage of the fact that nodes on interfaces are shared by only a few local subdomains so that communication for vector updates need only occur between adjacent subdomains sharing common interface data. Therefore, the assembly operation in (44) does not require global communication between all computing nodes, but rather local data updates between adjacent subdomains sharing interface data. Another optimization may be applied by using non-blocking communication subroutines, allowing for data communication and computation to overlap. In this way, communication costs may be hidden to a certain extent, depending on the hardware support for such operations. In summary, using the mapping between interface and interior nodes, the sparse matrix–vector product f = KSd is computed from 9 8 A 1 d 1 þ E 1 d v1 > 8 9 > > > > > f 1> > > > > > > > .. > > > > > > > > = < .. = < . . : ð45Þ ¼ A d þE d p p p vp > > > > > > > > f > > > > p > > > > P > ; p > : > > > PTi f vi > fv ; : i¼1

5.2. DtN matrix–vector product Given a vector iterative d 2 CN , the matrix–vector contribution g = KDtNd from the DtN matrix is given by " #  ds M s Z s M Ts M s Z sv M Tv g¼ ; ð46Þ T T dv M v Z vs M s M v Z v M v where

2

Z 11 6 . Zs ¼ 6 4 .. Z p1

 .. . 

3 Z 1p .. 7 7 . 5; Z pp

2

3 Z 1v 6 . 7 . 7 Z sv ¼ 6 4 . 5; Z pv

2

3 d1 6 . 7 . 7 ds ¼ 6 4 . 5 dp

3722

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

and Ms = diag[M1, . . . , Mp]. The blocks Mi, of size (N i  N Ci ) are the permutation matrices mapping the local interior DtN unknowns to the local interior, for subdomain i. Similarly, Mv, of size (N v  N Cv ) maps the interface DtN unknowns to total interface; Zij, Ziv, represent the discretized DtN operators coupling subdomain i with subdomain j and the total interface v, respectively; Zvv represents the DtN operator along the global interface v. Fig. 5 illustrates the coupling of DtN boundary nodes for a domain partitioned into p = 4 subdomains. Denote d Cv ¼ M Tv d v as the vector of unknowns corresponding only to the DtN part of the total interface. For efficient parallel computation, the interface unknowns for the local interface vj are split between adjacent subdomains sharing the same interface node. Given the vector of total DtN interface unknowns d Cv , the averaged local interface vectors are formed by, for j = 1, . . . , p 8 0 if node corresponding to k > > > < n o not on the interface vj C ð47Þ d vj ¼  C  > k d v k =s if node corresponding to k > > : on the interface vj ; where s is the number of subdomains sharing the node corresponding to position k in the vector d Cv . As a result of this decomposition, the vector of total DtN interface unknowns is defined by the summation over all subdomains, d Cv ¼

p X

C d vj :

ð48Þ

j¼1

Formally, this operation can be expressed as d Cv ¼

p X

Rj d Cvj ;

ð49Þ

j¼1

where Rj, of size (N Cv  N Cvj ), maps the local DtN interface to the global DtN interface. Using this construction, the interior to interface matrix vector products may be expressed as Z iv d Cv ¼

p X

Z ivj d Cvj ;

ð50Þ

j¼1

where Z ivj ¼ Z iv Rj . Defining d Cj ¼ M Tj d j to be the interior DtN part of subdomain j, the DtN matrix–vector product in (46) can be expressed as Z 22 Z 33

Z 11

2 1

3 4 Z 44

Fig. 5. Illustration of coupling between nodes on the DtN boundary for a domain partitioned into four subdomains. DtN matrix partitions Zii associated with interior nodes on the boundary are highlighted.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

8 9 p  P C C > > > > M Z d þ Z d 1 1j 1v > > j j v j > 8 9 > > > j¼1 > > > > g 1 > > > > > > > > . > > > > > > > > . = < .. = < . .   p : ¼ P > > Mp Z pj d Cj þ Z pvj d Cvj > > > > > gp > > > > > > > > > j¼1 > : ; > > > > gv > > > p  > P > >M C C > > Z vj d j þ Z vvj d vj > ; : v

3723

ð51Þ

j¼1

The interior matrix–vector products are thus g i ¼ M i f Ci ; i ¼ 1; 2; . . . ; p, where p   X Z ij d Cj þ Z ivj d Cvj ; f Ci ¼

ð52Þ

j¼1

while the interface matrix–vector product gv ¼ M v f Cv , where p   X Z vj d Cj þ Z vvj d Cvj : f Cv ¼

ð53Þ

j¼1

The vectors d Cvs ¼ Z vj d Cj and d Cvv ¼ Z vvj d Cvj can be written as sums over interface DtN vectors for each subdomain i, and applying a similar decomposition as in (47) and (49), the total interface matrix–vector product can be expressed as the sum, gv ¼

p X

M vi f Cvi ;

ð54Þ

i¼1

where f Cvi ¼

p   X Z vi j d Cj þ Z vi vj d Cvj ;

ð55Þ

j¼1

where M vi of size (N v  N Cvi ) maps the local DtN interface for processor i to the global interface. Combining (52) and (55) in matrix form, each processor i must perform the following DtN operations: ( ) " #( C ) p X dj Z ij Z ivj f Ci ¼ : ð56Þ C Z vi j Z vi vj f vi d Cvj j¼1 The non-local property of the DtN map manifests itself in the summation over each processor j. On each processor i, contributions from all processors j have to be gathered. For the corresponding sparse-matrix contribution in (43), no gather operation is required because of the local property of the arrays. A naive implementation of the DtN matrix–vector operation for each processor i would be to compute and store full submatrices Zij, Z ivj , Z vi vj , and perform a collective global communication (all-to-all type communication) with vectors of relatively large size (N Ci þ N Cvi ). Since the gather operation is grain-size dependent, this implementation would have a potentially high negative impact on the scalability and parallel efficiency of the computation. The grain size is the number of boundary nodes per domain. To address this communication problem, we never explicitly form the full DtN matrices Zij, Ziv, Z vi ¼ Z Tvi , and Zvv. Instead, the symmetric outer-product form (37) is exploited to reduce communication costs in a parallel implementation on distributed memory computer systems. Here the multiplicative split (37) is defined on each subdomain, Z ij ¼ C i  D  C Tj ;

Z iv ¼ C i  D  C tv ;

Z vv ¼ C v  D  C Tv ;

ð57Þ

3724

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

where C i ¼ ½ci00 ; . . . ; ciN m N m ; si00 ; . . . ; siN m N m ; cimn

i ¼ 1; . . . ; p

simn

and are the angular function vectors defined in (36) restricted to subdomain i, which are comand puted and stored on each processor concurrently. The matrix Z vi vj ¼ C vi DC Tvj are the discretized DtN operators coupling the local interface vi and vj, and C vi are the restrictions of the matrix angular functions to the boundary interfaces. Using the decompositions (57), the collective communication of the DtN matrix–vector product (56) can be formed as ( )

X p Ci f Ci ¼ uj ; D ð58Þ f Cvi C vi j¼1 where uj ¼ C Tj d Cj þ C Tvj d Cvj

ð59Þ

is a vector of size equal to NT = Nm(Nm + 1), the total number of harmonics, and computed concurrently on each processor. Prior to interface vector updates, the vectors on the left-side of the summation in (58) are added to the sparse matrix–vector product (43) concurrently on each processor. The total of the sparse and DtN contributions are, bi ¼ f i þ M i f Ci , and bvi ¼ f vi þ M vi f Cvi , i = 1, . . . , p. The summation in (58) is carried out by summing each uj on each processor j and storing the result back onto each of them (requires all-to-all communication). As a result, collective communication of the non-local DtN has been reduced to a relatively small vector of size N T N Ci , independent of the grain size (number of boundary nodes per domain). Using this procedure, the DtN matrix–vector product can be computed with one collective communication per iteration with a vector size equal to the number of harmonics included in DtN series expansion. The effect on the overall communication is roughly that of an extra dot-product communication (The basic BiCG-STAB iteration method for sparse matrices requires two matrix–vector products and six dotproducts per iteration). The algorithm can be summarized in the following. 5.3. Parallel matrix–vector product algorithm For each subdomain i: (1) Extract the DtN unknowns on the boundary from the local vector ( ) ( ) d Ci M Ti d i ¼ : d Cvi M Tvi d vi

ð60Þ

(2) Compute the vectors of size NT = Nm(Nm + 1) ui ¼ C Ti d Ci þ C Tvi d Cvi :

ð61Þ

(3) Sum the vectors uis from all processors and store the result back onto each of them (requires all-to-all type communication) ui

p X

uj :

j¼1

Dui (4) Compute ui (5) Add the sparse and DtN matrix–vector products

ð62Þ

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741



bi b vi

"

¼

Ai BTivi

Bivi Avi

#

di d vi

 M iC i þ ui : M vi C vi

3725



ð63Þ

(6) Update interface bv ¼

p X

PTi bvi :

ð64Þ

i¼1

6. Numerical studies Prior to performing parallel scale-up studies, we first demonstrate that for the same accuracy, the modified DtN(B1) non-reflecting boundary condition applied to a spheroidal boundary defined in (18) and implemented as a low-rank update with the multiplicative split given in (37) is more cost-effective than using the local B1 operator in the context of iterative equation solvers. While the results reported here are for the modified DtN(B1) condition, similar conclusions are reported in [39,55] for the modified DtN(B2) condition defined in (24). 6.1. Validation and comparison of local and global DtN conditions Consider a prolate spheroidal scatterer with focal distance fs = 10q and elliptic radial coordinate ls = 0.1, ffiffiffiffiffiffiffiffiffiffiffiffi ffi

such that ns ¼ cosh ls  1:005, L = 2a = 2fsns  20.1, D ¼ 2b ¼ 2f s n2s  1  2:0, L/D  10. The incident field is a plane wave ui(x) = eikxÆm propagating along the direction m =(cos a, sin a cos b, sin a sin b) with incident angles a and b. The scatterer is assumed to be sound-hard, i.e. a homogeneous Neumann condition is imposed for the total field along the scatterer boundary S, or for the scattered field ouðn; g; /Þ oui ðn; g; /Þ ¼ on on

n ¼ ns ; 1 6 g 6 1; 0 6 / 6 2p:

ð65Þ

For prolate spheroidal objects, an infinite series solution can be obtained for the homogeneous Helmholtz equation on unbounded domains, via plane wave expansion on spheroidal coordinates [62]. Matching normal derivatives of incident and scattered fields along the scatterer boundary, the analytical solution is /ðn; g; /Þ ¼

n 1 X X 0 n¼0

  ð3Þ im Amn Bmn Rmn ðn; cÞS mn ðg; cÞ cosðmð/  bÞÞ ;

ð66Þ

m¼0

where Amn ¼ 4

R0mn ðns ; cÞ ; N mn R0ð3Þ mn ðns ; cÞ

Bmn ¼ S mn ðg0 ; cÞ

and g0 = cos a. The scatterer described above is enclosed by a DtN truncation boundary, also in the shape of a prolate spheroid, placed at a distance dx = sxk and dy = dz = syk, where k = 2p/k and k are the wave length and wave number of incident field (see Fig. 6). Two different wave numbers are considered corresponding to D/k = (0.2, 0.4). To ensure accurate mesh resolution, k/h = 25 tri-linear tetrahedral elements per wave length are used. Comparative studies are performed for local B1 and non-local modified DtN(B1) operator defined in (18). Figs. 7 and 8 show the accuracy of the numerical solution for D/k = 0.2 and D/k = 0.4, respectively.

3726

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Fig. 6. Computational domain enclosed by a prolate spheroidal DtN truncation boundary placed at dx = sxk and dy = syk from the embedded spheroidal scatterer.

Relative errors measured in the L2 norm on the scatterer boundary S are reported as a function of circumferential harmonics Nm included in the modified DtN map eigenfunction expansion. By truncating the infinite series at Nm, the modified DtN condition defined in (18) becomes exact for circumferential modes n 6 Nm in the exact solution, and it reduces to an approximate B1 condition for higher modes. If Nm = 0 then DtN reduces to B1 for all the harmonics in the exact solution. To quantify the accuracy with boundary position, the truncation boundary is placed successively farther away from the scatterer. Three different methods are considered for increasing the distance of the DtN truncation boundary from the scatterer: (a) equal offset parameters sx = sy = s, (b) constant horizontal distance sx = 0.25, with increasing sy, and (c) maintaining a confocal (f = constant) spheroidal boundary with increasing radial coordinate n. In case (a), the DtN spheroidal boundary is increased with equal aspect ratio. In case (b), the DtN boundary approaches a sphere as sy is increased. In case (c) the DtN spheroidal boundary is increased in proportion to the radial coordinate with varying aspect ratio. The DtN boundary approaches a sphere as n is increased. Using a sufficient number of harmonics Nm in the DtN condition to obtain an effectively exact non-reflecting boundary condition reduces the error to the discretization level of the finite element approximation. Examining Figs. 7 and 8, we find the relative error of the finite element discretization with mesh resolution kh ¼ 25, corresponding to D/k = 0.2, is approximately 8%, while the error for D/k = 0.4 is approximately 4.5%. As the wavelength is decreased from D/k = 0.2 to D/k = 0.4, the number of harmonics needed to reach the discretization level increases from Nm = 6 to Nm = 8. The accuracy of the local boundary conditions generally deteriorate as the wavelength is decreased. The relative errors for the approximate local B1 operator corresponding to Nm = 0 and with equal offsets sx = sy = s are also shown in Figs. 7a and 8b and are summarized in Table 1. The values corresponding to s = 1/4, 1/2 indicate that boundary reflection error is substantially higher than the discretization error levels when the truncation boundary is positioned near the scatterer. Even after the truncation boundary is moved to s = 1, far away from the scatterer, the discretization levels still have not been reached. These results indicate that moving the truncation boundary with equal distance is not a cost effective strategy when using the local B1 operator. A similar behavior is seen in Figs. 7c and 8c, for the case with keeping the spheroid confocal (f = constant) with increasing radial coordinate n. In this case, the error produced by the B1 condition is reduced but still does not reach the level of the discretization error.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3727

45 s=0.25 (N=6,759) s=0.50 (N=15,829) s=1.00 (N=42,758)

40 35

η(%)

30 25 20 15 10 5 0

a

5

Nm

10

45

15

s y=0.25 (N=6,759) sy=0.50 (N=13,788) s y=1.00 (N=29,576)

40 35

η(%)

30 25 20 15 10 5 0

5

b 40

Nm

10

15

μ=0.4 (sx=0.08, sy=0.31, N=7,051) μ=0.5 (sx=0.12, sy=0.42, N=10,156) μ=0.6 (sx=0.18, sy=0.53, N=14,144)

35

30

η(%)

25

20

15

10

c

5 0

5

Nm

10

15

Fig. 7. Accuracy of the modified DtN(B1) operator for Dk ¼ 0:2; incident angles a ¼ p4, b ¼ p4; (a) equal distance sx = sy = s; (b) fixed sx = 0.25 with increasing sy; (c) confocal, f = constant, with increasing radial coordinate n.

3728

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741 90 s=0.25 (N=16,931) s=0.50 (N=32,349) s=1.00 (N=69,505)

80 70

η(%)

60 50 40 30 20 10 0 0

a

5

Nm

90

10

sy=0.25 sy=0.50 sy=1.00 sy=2.00

80 70

15

(N=16,931) (N=29,591) (N=56,102) (N=117,856)

η(%)

60 50 40 30 20 10 0 0

b

5

40

Nm

10

15

μ=0.4 (sx=0.15, sy =0.62, N=33,600) μ=0.5 (sx=0.25, sy =0.84, N=47,855) μ=0.6 (sx=0.36, sy =1.07, N=62,318)

35 30

η(%)

25 20 15 10 5 0 0

c

5

Nm

10

15

Fig. 8. Accuracy of the modified DtN(B1) operator for Dk ¼ 0:4; incident angles a ¼ p4, b ¼ p4; (a) equal distance sx = sy = s; (b) fixed sx = 0.25 with increasing sy; (c) confocal, f = constant, with increasing radial coordinate n.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3729

Table 1 Relative errors for local B1 operator (Nm = 0), with increasing equal offsets sx = sy = s D k

ns

0.2 0.4

1 4

1 2

1

45 83

24 44

12 25

Figs. 7b and 8b show results when the spheroidal DtN boundary is placed very close to the poles of the scatterer with a fixed sx = 0.25, and increasing sy to approach a sphere, as shown in Fig. 9. The results for the B1 operator are summarized in Table 2. The results demonstrate that, when using the local B1 operator, truncation boundaries with improved aspect ratios, yield less spurious reflection error than farther, but more stretched truncation boundaries. For example, the discretization level of 8% error is reached with a truncation boundary placed at sx = 0.25, sy = 1.0, which yields a lower problem size—N = 29,576 degrees-of-freedom (DOF), compared to the case when the truncation boundary placed with equal offset parameters sx = sy = 1.0, yielding a much large number N = 42,758 DOF. 6.2. Cost comparison between local B1 and DtN operators Timings are reported using the bi-conjugate gradient stabilized BICG-STAB iterative method [59] in conjunction with a SSOR preconditioner using Eisenstats algorithm [61]. Sequential timings results are obtained using optimized Sun Fortran90 compilers on a Sun Enterprise Ultra450 server, with 300 MHz UltraSPARC-II cpu. The sparse matrix KS is stored in compressed sparse row (CSR) format [63]. The DtN is computed using the multiplicative split (37) combined with the efficient matrix–vector product described in Section 4.1. In Fig. 10 the performance of the B1 operator is analyzed for various positions and geometries of the truncation boundary. Curve (a) represents cases where the truncation boundary is successively placed farther away from the scatterer with equal offset parameters sx = sy = s = (0.3, 0.4, 0.5, 0.9, 1.6); Curve (b) corresponds to cases with a constant horizontal offset parameter sx = 0.25 and increasing vertical parameter sy = (1.0, 1.35, 1.5, 1.6, 1.8, 2.0); Curve (c) represents cases with a confocal truncation boundary f0 = fs = 10, and l = (0.4, 0.5, 0.6, 0.8, 0.9), corresponding to pairs (sx, sy) = {(0.15, 0.62), (0.25, 0.84), (0.36, 1.07), (0.66, 1.58), (0.96, 1.85)}. Consistent with our earlier findings, Fig. 10 indicates that Case (b) is the most effective, yielding the lowest cost to obtain a given accuracy. Fig. 11 gives solution times versus relative error comparing the approximate B1 operator, the DtN(B1) operator with the minimum number of harmonics Nm0 = 8 needed to reach the discretization error, and the DtN(B1) operator with Nm0  1 harmonics; for the case of constant horizontal offset parameter sx = 0.25 and increasing vertical parameter sy = (1.0, 1.35, 1.5, 1.6, 1.8, 2.0). By applying the DtN(B1) operator with Nm0 = 8, the relative error was reduced to the discretization level 4.5%, for all geometries of the truncation boundary. For the DtN(B1) the optimal geometry which yields the discretization error with the smallest solution times (approximately 88 s) is to place the truncation boundary position near the scatterer at sx = sy = 0.25. To reach the 4.5% discretization level with the B1 operator requires a much larger computational domain with the truncation boundary placed at the far-field position sx = 0.25, and sy = 2.0, and a wall-clock time of over 300 s. Table 3 summarizes these timing results. For the DtN(B1) condition these timing results are broken down into a percentage required to compute the sparse (interior) and DtN (boundary) parts of the iterative solution. Here, Nit is the number of iterations, Tit—time/iteration, Twc—the wall-clock time, and T Xwc , T Cwc are percentages of the wall-clock time associated with interior and boundary parts, respectively. A significant reduction in wall-clock solving time is observed when applying DtN(B1), by a factor of approximately 8 times versus the case with local B1 operator. As expected, for the DtN(B1) with a small interior domain X and tight fit of the truncation boundary C positioned at

3730

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Fig. 9. Left: Contours of real-part of the scattered field in the plane z = 0 with fixed sx = 0.25, and increasing sy; Right: Real and imaginary parts of the scattered field along the circumferential direction h on the truncation boundary C, at the extremes u = 0 and p4, respectively; (a) sy = 0.25; (b) sy = 0.5; (c) sy = 1.5.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3731

Table 2 Relative errors for B1 operator (Nm = 0), with fixed sx ¼ 14, and increasing sy D k

n sy

0.2 0.4

1 4

1 2

1

2

45 83

13 41

8 21

8 4.5

1600

sx=sy =s sx=const.

1400

f=const.

wall–clock time (sec)

1200 1000

(b)

(c)

(a)

800 600 400 200 0 0

10

20

30

η (%)

Fig. 10. Timings versus error for B1 operator;

40 D k

50

¼ 0:4, kh ¼ 25.

1600

B1 (Nm =0) Nm =Nm 0 –1 Nm =Nm 0

1400

wall–clock time (sec)

1200 1000 800 600 400 200 0 3

4

5

6

η (%)

7

8

9

10

Fig. 11. Timings versus error comparing B1 and DtN operator with minimum number of included harmonics Nm0 to reach discretization error, and DtN operator with Nm0  1 harmonics; Dk ¼ 0:4, kh ¼ 25.

3732

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Table 3 Comparison of wall-clock timings for DtN and B1 operators: DtN B1

sy = 0.25 sy = 2.0

D k

¼ 0:4, kh ¼ 25, sx = 0.25, relative error g = 4.5%

Nit (s)

Tit (s)

Twc (s)

TX wc (s)

TX wc (%)

T Cwc (s)

T Cwc (%)

74 316

1.2 2.2

88 684

35 684

(40) (100)

53 –

(60)

sx = sy = 0.25, the solving time allocated to process the DtN contribution becomes dominant (60%), however the significant aggregate reduction in time compared to the B1 case, is due to the reduction of interior computational domain. In Table 4 storage costs are compared. The quantities N Xw , N Cw represent the number of double precision values required to store arrays associated with the sparse interior and DtN boundary, respectively, while Nw is the aggregate value. In the case of the DtN boundary, the non-local contributions require twice the storage of the sparse interior arrays, due to a large fraction of total nodes lying on the truncation boundary and the number of harmonics NT = 72 (Nm = 8) stored in the array C. However, aggregate storage costs for the DtN case with a closely fit truncation boundary, remain lower (approximately half) of the storage requirements for the local B1 operator employed on a truncation boundary required to achieve the same accuracy (sx = 0.25 and sy = 2). In Table 5 additional timings to set up the linear system are compared. Efficient algorithms have been coded to assemble global matrices into compact sparse storage formats (e.g. compressed sparse row—CSR format [63]), directly from element connectivity arrays, resulting in low timings for the interior assembly part. For the DtN operator, computational costs to form the matrix of angular and radial functions C, D in (38), (39) are higher than the interior assembly costs, due to the time required to evaluate the spheroidal function routines in [64]. Here, Ta is the total time to setup the global linear system, T Xa is the time associated with assembling the interior (sparse) part of the global system, and T Ca the time associated with forming the arrays of radial and angular spheroidal functions. Similar conclusions can be drawn: For the DtN operator, the boundary contribution becomes dominant as the truncation boundary is placed closer to the scatterer (69%), however the overall timings remain lower than those required for the local B1 condition applied on a far-field truncation boundary (sx = 0.25 and sy = 2) positioned such that the discretization error level of 4.5% is met.

Table 4 Storage comparison:

DtN B1

D k

sy = 0.25 sy = 2.0

¼ 0:4, kh ¼ 25, sx = 0.25, relative error g = 4.5% N

NC

Nm

3 NX w (·10 )

N Cw (·103)

Nw (·103)

16,931 85,199

5586 –

8 –

222 1161

402 –

624 1161

Table 5 Comparison of assembly timings for DtN and B1 operators: DtN B1

sy = 0.25 sy = 2.0

D k

¼ 0:4, kh ¼ 25, sx = 0.25, relative error g = 4.5%

Assembly (s)

Interior (s)

Interior (%)

Boundary (s)

Boundary (%)

39 51

12 51

(31) (100)

27

(69)

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3733

7. Benchmark submarine example Consider a submarine shaped three-dimensional scatterer subject to a plane-wave incident field. The submarine model is shown in (Fig. 12) with overall length dimension L = 20, and diameter D = 2. The conning tower has a height h = 1 and an elliptic cross-section with semi-minor and semi-major axes b = 1 and l = 2, respectively. The incident field is a plane wave ui(x) = eikxÆm propagating along the direction m = (cos a, sin a cos b, sin a sin b) with incident angles a and b. The scatterer is assumed to be sound-hard, yielding prescribed Neumann data for the scattered field along the scatterer surface ouðxÞ oui ðxÞ ¼ ; on on

x 2 S:

ð67Þ

The computational domain is enclosed by a prolate spheroidal truncation boundary, placed at horizontal distance dx = sxk from the tips of the submarine, and at dz = szk from the top of the conning tower (see Fig. 13). The finite domain is then discretized via unstructured grids of tri-linear tetrahedral elements. In order to benchmark numerical results, a reference solution was obtained by positioning the DtN truncation boundary at a distance given by sx = 0.25, sz = 1.8, using k/h = 35 elements/wave length, and Nm = 10 circumferential harmonics (NT = 10 Æ 11 = 110 total modes). The number of harmonics and the mesh resolution have been intentionally overestimated from the previous validation studies of the spheroidal scatterer with known analytical solution given in Section 6.1, to ensure a reference solution with minimal boundary error. 7.1. Cost comparison between local B1 and DtN operators Numerical studies in this section are considered for the case Dk ¼ 0:4 with a mesh resolution kh ¼ 25 elements/wave length. In Fig. 14 a direct performance comparison is depicted, between local B1 and modified DtN(B1) conditions on truncation boundaries placed with offset parameters sx = 0.25, sy = (0.25,0.75,1.25,1.8). By increasing the dimension sy the shape moves from spheroidal to near spherical. One concludes an effective strategy to find an optimal position of the truncation boundary, is to employ the modified DtN condition on boundaries tightly following the scatterer, which are then gradually moved

Fig. 12. Plane-wave impinging upon a submarine-shaped scatterer.

3734

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Fig. 13. Computational domain X enclosed by a spheroidal truncation boundary C placed at dx = sxk and dy = syk from the submarine-shaped scatterer, where k = 2p/k is the acoustic wavelength.

1100 1000

B1 (Nm =0) DtN(B1) (Nm =Nm 0 )

sx=0.25; sz =1.8

wall–clock time (sec)

900 800 sx=0.25; sz=1.25

700 600 500

sx=0.25; sz=0.75

400 300

sx=sz=0.25

200 100

5

10

15

20

25

30 η (%)

35

40

45

50

Fig. 14. Comparison study between B1 and DtN operator for submarine problem; D/k = 0.4, k/h = 25.

farther away (by keeping a constant sx and increasing sy) until the required tolerance is reached. As the modified DtN performance curve is almost vertical (Fig. 14), it is likely that a satisfactory tolerance is obtained with any reasonably close truncation boundary, if an appropriate number of harmonics is employed in the DtN map. This number can be obtained via a few trial-and-error runs with increasing number of modes, without changing the mesh. In contrast, for the case of local conditions, several: re-mesh, solve, cycles are necessary in order to find the position of the truncation boundary that yields the desired tolerance; a disadvantage of using low-order local boundary operators.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3735

7.2. Parallel performance of the DtN map In the context of iterative methods, the matrix–vector products, dot-products and vector updates are the most computationally intensive operations. For distributed memory systems, it is important to distribute the load evenly across computing nodes, in order to achieve optimal wall-clock timings and reduce storage costs per node. Another important aspect is minimizing the length of the interface between subdomains, to reduce communication costs. Sophisticated graph partitioning algorithms have been developed to accommodate these tasks. In the present work an implementation [65] of the partitioning algorithm in [66] is employed, allowing for fast partitions of the adjacent graph (element partitions). To obtain a scalable algorithm, the computational load has to be balanced across computing nodes. This may be achieved via graph weighting, i.e. degrees of freedom throughout the mesh are weighed differently depending on the computational work they incur. The algorithm used in METIS uses a multi-criteria objective function, which minimizes the interface length in conjunction with balancing the computational load, based on multiple weights for each graph unit. In the present work, the interest is focused on placing the truncation boundary as close as possible to the scatterer, to demonstrate the true effectiveness of the DtN boundary condition. For such cases, and using unstructured free meshes with uniform average element size, partitions with minimal interface will be radial slices which include regions from the scatterer surface S as well as the truncation boundary C. Thus, for close truncation boundaries with nearly uniform average element sizes, uniform weights for all elements in the mesh will yield balanced graphs, as each partition will have equally sized subregions of interior and truncation boundary. A sample three-dimensional partitioned grid with scattered field solution on the surface of the submarine is depicted in (Fig. 15). The computational

Fig. 15. Sample grid partition for a three-dimensional submarine scatterer; N  350,000 nodes; 1 million tri-linear tetrahedran elements, D/k = 0.4; k/h = 50; p = 10 subdomains.

3736

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

domain is discretized with unstructured meshes of tri-linear tetrahedral elements, and element grid partitions are obtained using the METIS partitioning package [65]. Several alternate parallel methods for preconditioning the linear system are available including algebraic SSOR and ILU [63,67,68] and others, e.g. [69]. In [55], a hybrid parallel implementation with a block Jacobi component associated with interface nodes and SSOR type blocks associated with interior nodes are implemented. The interface nodes are conditioned with diagonal scaling to minimize inter-processor communication costs, while the SSOR preconditioning is used together with a parallel implementation of Eisenstats

32

Speed up

Nm =15 N m =0

16

8 4 2 2

4

8

16 No. procs

a

32

1.1 Nm =15 Nm =0

1.08 1.06

Parallel efficiency

1.04 1.02 1 0.98 0.96 0.94 0.92 0.9 2

b

4

8

16 No. procs

32

Fig. 16. (a) Speedup and (b) parallel efficiency for case N  350,000 dofs (1 million elements) on the NCSA PIII-1GHz IA-32 Linux cluster.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3737

algorithm for the relatively large number of interior nodes. In the following, the parallel performance of the DtN matrix–vector product in relation to the sparse matrix–vector product is studied. The parallel performance of the hybrid Jacobi-SSOR preconditioner in terms of the grain size and acceleration of the required number of iterations for convergence is reported in [55]. To quantify the impact of the DtN term on the algorithm, speedup and parallel performance studies are performed. Speedup Sn when using n processors is defined as t2 Sn ¼ 2 ; ð68Þ tn where t2, tn are solving timings for 2 and n processors, respectively. Parallel efficiency is defined as Pn ¼

Sn : n

ð69Þ

The 2-processor case was chosen as a reference point for measurements, in order to quantify the parallelism of the code, and observe the impact of DtN upon it, rather than comparing with a very optimized serial code. The performance of the DtN algorithm is studied up to 32 processors (Fig. 16), for 200 BICG-STAB iterations on the IA-32 Linux Cluster at NCSA. This cluster is based on IBM eServer x330 thin servers, each with two 1 GHz Intel Pentium III processors, running Red Hat Linux, and Myrinet II cluster interconnect network. The results show a linear speedup and near optimal parallel efficiency, as the amount of interior computational work for each processor significantly exceeds the communication load. Any negative effects of using Nm = 15 non-local DtN modes compared to the purely local operator corresponding to

1500 Matrix–vector compute Interface update DtN part compute DtN communication Dot–products (computation) Dot–products (communication)

Time (sec)

1000

500

0 2

4

8 No. procs

16

32

Fig. 17. Timings for 200 iterations for mesh with N  350,000 dofs (approximately 1 million elements) on the on the NCSA PIII-1GHz IA-32 Linux cluster, using Nm = 15 modes in the DtN term; 2 matrix–vector products and 6 dot-products per BICG-STAB iteration.

3738

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

Nm = 0 is insignificant showing that the efficient parallel implementation of the DtN matrix–vector product algorithm is effective in achieving the near optimal speedup of the local sparse matrix iterative solution. In Fig. 17 the timings are depicted for each computation and communication segment within 200 BICGSTAB iterations. By design, computation segments are significantly larger than communication ones, which still remain barely noticeable up to 32 processors. The amount of collective communication required by the non-local DtN map in step (62) is about the same as a single dot-product due to the small fixed vector size equal to the number of DtN harmonics. The DtN communication cost is independent of the number of boundary unknowns (grain size), and remains lower than the 6 required dot-product communications of the BICG-STAB iteration, even though the small amount of data communicated within the DtN operation is larger than the amount of data communicated in the 6 dot-products contained in one BICG-STAB iteration. This is possible since the latency time of calling a global communication routine 6 times each iteration produces the major contribution to the total communication compared to the single global communication for the DtN contribution. Further studies of two-dimensional Helmholtz problems with elliptic DtN boundary conditions [22,23], have been carried out in [1,55] using both the NCSA IA-32 Linux cluster with Intel PIII-1GHz processors and Myrinet II interconnects, a distributed memory Linux cluster using Intel PIII-900 MHz processors with Myrinet I interconnects, and a 14 processor Sun UltraSparc shared memory system implemented as a distributed memory system through the use of Message Passing Interface (MPI) calls. The parallel DtN algorithm with the multiplicative split was shown to scale with minimal extra communication costs compared to the local B1 and B2 operators defined on elliptic truncation boundaries for problems partitioned with varying grain size.

8. Conclusions The low-rank matrix representation of the discrete DtN map is exploited to construct a storage- and timewise efficient implementation for distributed memory parallel computers. In particular, the symmetric outerproduct structure of the DtN matrix KDtN is exploited to compute matrix-by-vector products in parallel with one collective communication per iteration with a vector size equal to the number of harmonics included in the DtN series expansion; the effect on the overall communication is roughly that of a relatively small dotproduct inter-processor communication. The speedup and parallel efficiency have not been significantly impacted by DtN versus a purely approximate local boundary condition (Nm = 0). Numerical results show that the collective communication within the DtN rank-update is not the most communication intensive operation in the overall BICG-STAB iterative solution process. The amount of extra memory for the DtN is small, consisting only of storage of the angular modes distributed on each processor. Numerical studies show that in the context of iterative equation solvers, and for the same accuracy, the modified DtN non-reflecting boundary condition applied to a tightly fitting spheroidal boundary and implemented as a low-rank update with the multiplicative split is more cost-effective (both in wall-clock times and memory) compared to local approximate boundary conditions. While linear tetrahedral meshes have been used in our studies, the low-rank matrix implementation of the discrete DtN map should be even more efficient for dispersion reducing high-order interpolation, where high-accuracies are needed. Optimal scale-up is expected to hundreds of processors by exploiting the outer-product structure of the DtN operator, however, the DtN implementation is expected to be sub-optimal for massively parallel systems using thousands of processors, since the speedup and parallel efficiency may drop as the truncation boundary is placed closer to the scatterer. However improved timings are obtained with the DtN moving the truncation boundary closer to the scatterer, thus eliminating significant interior domain. This reduction in cost may offset the need to utilize thousands of processors (hundreds of processors may be sufficient and costeffective).

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3739

Acknowledgements Support for this work was provided in part by the National Science Foundation under Grant CMS9702082 in conjunction with a Presidential Early Career Award for Scientists and Engineers (PECASE), and the National Computational Science Alliance under Grant MSS020008N which provided access to the IA-32 Linux Cluster at NCSA.

References [1] C. Ianculescu, L. Thompson, Parallel iterative methods for the Helmholtz equation with exact non-reflecting boundaries, in: Proceedings of the 2002 ASME International Mechanical Engineering Congress and Exposition, Paper IMECE2002/NCA-32744, 2002. [2] C. Ianculescu, L. Thompson, Parallel iterative finite element solution methods for three-dimensional acoustic scattering, in: Proceedings of the 2003 ASME International Mechanical Engineering Congress & Exposition, Washington, DC, November 16–21, 2003, Paper IMECE2003-44266. [3] I. Harari, T. Hughes, A cost comparison of boundary element and finite element methods for problems of time-harmonic acoustics, Comput. Methods Appl. Mech. Engrg. 97 (1992) 77–102. [4] D. Burnett, A 3-D acoustic infinite element based on a generalized multipole expansion, J. Acoust. Soc. Am. 96 (1994) 2798–2816. [5] W. Chew, J. Song, T. Cui, S. Velarnparambil, M. Hastriter, B. Hu, Review of large scale computing in electromagnetics with fast integral equation solvers, CMES—Comput. Model. Engrg. Sci. 5 (4) (2004) 361–372. [6] E. Bleszynski, M. Bleszynski, T. Jaroszewicz, Development of new algorithms for high frequency electromagnetic scattering, CMES—Comput. Model. Engrg. Sci. 5 (2004) 295–317. [7] O. Bruno, New high-order integral methods in computational electromagnetism, CMES—Comput. Model. Engrg. Sci. 5 (2004) 319–330. [8] R. Astley, G. Macaulay, J.-P. Coyette, L. Cremers, Three-dimensional wave-envelope elements of variable order for acoustic radiation and scattering. Part I. Formulation in the frequency domain, J. Acoust. Soc. Am. 103 (1998) 49–63. [9] J. Shirron, I. Babuska, A comparison of approximate boundary conditions and infinite element methods for exterior Helmholtz problems, Comput. Methods Appl. Mech. Engrg. 164 (1–2) (1998) 121–139. [10] K. Gerdes, Conjugated versus the unconjugated infinite element method for the Helmholtz equation in exterior domains, Comput. Methods Appl. Mech. Engrg. 152 (1998) 125–145. [11] F. Ihlenburg, On fundamental aspects of exterior approximations with infinite elements, J. Comput. Acoust. 8 (1) (2000) 63–80. [12] R. Astley, Infinite elements for wave problems: a review of current formulations and an assessment of accuracy, Int. J. Numer. Methods Engrg. 49 (2000) 951–976. [13] R. Astley, J. Hamilton, Numerical studies of conjugated infinite elements for acoustical radiation, J. Comput. Acoust. 8 (1) (2000) 1–24. [14] D. Dreyer, O. von Estorff, Improving conditioning of infinite elements for exterior acoustics, Int. J. Numer. Methods Engrg. 58 (6) (2003) 933–953. [15] J.-P. Berenger, A perfectly matched layer for the absorption of electromagnetic waves, J. Comput. Phys. 114 (2) (1994) 195–200. [16] J.-Y. Wu, D. Kingsland, J.-F. Lee, R. Lee, A comparison of anisotropic PML to Berengers PML and its application to the finiteelement method for em scattering, IEEE Trans. Antennas Propagat. 45 (1) (1997) 40–50. [17] E. Turkel, A. Yefet, Absorbing PML boundary layers for wave-like equations, Appl. Numer. Math. 27 (4) (1998) 533–557. [18] I. Harari, M. Slavutin, E. Turkel, Analytical and numerical studies of a finite element PML for the Helmholtz equation, J. Comput. Acoust. 8 (1) (2000) 121–137. [19] A. Bayliss, C. Goldstein, E. Turkel, The numerical solution of the Helmholtz equation for wave propagation problems in underwater acoustics, Comput. Math. Appl. 11 (7–8) (1985) 655–665. [20] J. Keller, D. Givoli, Exact non-reflecting boundary conditions, J. Comput. Phys. 81 (1989) 172–192. [21] L.W. Pearson, R.A. Whitaker, L.J. Bahrmasel, An exact radiation boundary condition for the finite-element solution of electromagnetic scattering on an open domain, IEEE Trans. Magn. 25 (1989) 3046–3048. [22] M. Grote, J. Keller, On non-reflecting boundary conditions, J. Comput. Phys. 122 (1995) 231–243. [23] L. Thompson, R. Huan, C. Ianculescu, Finite element formulation of exact Dirichlet-to-Neumann radiation conditions on elliptic and spheroidal boundaries, in: 1999 ASME International Mechanical Engineering Congress & Exposition, Nashville, TN, November 14–19, 1999, ASME Noise Control and Acoustics Division—1999, NCA-vol. 26, pp. 497–510. [24] A. Bayliss, M. Gunzberger, E. Turkel, Boundary conditions for the numerical solution of elliptic equations in exterior domains, SIAM J. Appl. Math. 42 (2) (1982) 430–451.

3740

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

[25] B. Engquist, A. Majda, Absorbing boundary conditions for the numerical simulation of waves, Math. Comput. 31 (1977) 629–651. [26] X. Antoine, H. Barucq, A. Bendali, Bayliss–Turkel-like radiation conditions on surfaces of arbitrary shape, J. Math. Anal. Appl. 229 (1) (1999) 184–211. [27] D. Givoli, High-order local non-reflecting boundary conditions: a review, Wave Motion 39 (2004) 319–326. [28] L. Thompson, P. Pinsky, Acoustics, in: E. Stein, R.D. Borst, T.J. Hughes (Eds.), Encyclopedia of Computational Mechanics, vol. 2, Wiley InterScience, 2004 (Chapter 22). [29] R. Bossut, J.-N. Decarpigny, Finite element modeling of radiating structures using dipolar damping elements, J. Acoust. Soc. Am. 86 (4) (1989) 1234–1244. [30] R. Tezaur, A. Macedo, C. Farhat, R. Djellouli, Three-dimensional finite element calculations in acoustic scattering using arbitrarily shaped convex artificial boundaries, Int. J. Numer. Methods Engrg. 53 (6) (2002) 1461–1476. [31] R. Kechroud, A. Soulaimani, Y. Saad, S. Gowda, Preconditioning techniques for the solution of the Helmholtz equation by the finite element method, Math. Comput. Simul. 65 (2004) 303–321. [32] D. Givoli, I. Patlashenko, J. Keller, High order boundary conditions and finite elements for infinite domains, Comput. Methods Appl. Mech. Engrg. 143 (1997) 13–39. [33] T. Hagstrom, S. Hariharan, A formulation of asymptotic and exact boundary conditions using local operators, Appl. Numer. Math. 27 (1998) 403–416. [34] T. Hagstrom, Radiation Boundary Conditions for the Numerical Simulation of Waves, vol. 8, Cambridge University Press, 1999. [35] R. Huan, L. Thompson, Accurate radiation boundary conditions for the time-dependent wave equation on unbounded domains, Int. J. Numer. Methods Engrg. 47 (2000) 1569–1603. [36] L. Thompson, R. Huan, D. He, Accurate radiation boundary conditions for the two-dimensional wave equation on unbounded domains, Comput. Methods Appl. Mech. Engrg. 191 (2001) 311–351. [37] V. van Joolen, D. Givoli, B. Neta, High-order non-reflecting boundary conditions for dispersive waves in Cartesian, cylindrical and spherical coordinate systems, Int. J. Comput. Fluid Dynam. 17 (4) (2003) 263–274. [38] D. Givoli, Recent advances in the DtN FE method, Arch. Comput. Methods Engrg. 6 (2) (1999) 71–116. [39] L. Thompson, C. Ianculescu, R. Huan, Exact radiation conditions on spheroidal boundaries with sparse iterative methods for efficient computation of exterior acoustics, in: Proceedings of the 7th International Congress on Sound and Vibration, 2000, pp. 2101–2108. [40] M. Malhotra, P. Pinsky, A matrix-free interpretation of the non-local Dirichlet-to-Neumann radiation boundary condition, Int. J. Numer. Methods Engrg. 39 (1996) 3705–3713. [41] A. Oberai, M. Malhotra, P. Pinsky, On the implementation of the Dirichlet-to-Neumann radiation condition for iterative solution of the Helmholtz equation, Appl. Numer. Math. 27 (1998) 443–464. [42] L. Thompson, L. Zhang, R. Ingel, Domain decomposition methods with frequency band interpolation for computational acoustics, in: ASME International Mechanical Engineering Congress and Exposition, November 11–16, 2001, New York, NY, The American Society of Mechanical Engineers, 2001, IMECE2001, Paper NCA-23532. [43] B. Despres, Domain decomposition method and the Helmholtz problem (Part II), in: 2nd International Conference on Mathematical and Numerical Aspects of Wave Propagation, SIAM, Newark, DE, 1993. [44] J. Benamou, B. Despres, A domain decomposition method for the Helmholtz equation and related optimal control problems, J. Comput. Phys. 136 (1997) 68–82. [45] S. Kim, Domain decomposition iterative procedures for solving scalar waves in the frequency domain, Numer. Math. 79 (1998) 231–259. [46] X. Cai, M. Cararin, F. Elliott, O. Widlund, Overlapping Schwarz algorithms for solving Helmholtzs equation, in: M. et al. (Eds.), Domain Decomposition Methods 10, Contemp. Math., vol. 218, AMS, Providence, RI, 1998, pp. 391–399. [47] R. Susan-Resiga, H. Atassi, A domain decomposition method for the exterior Helmholtz problem, J. Comput. Phys. 147 (1998) 388–401. [48] E. Larsson, A domain decomposition method for the Helmholtz equation in a multilayer domain, SIAM J. Sci. Comput. 20 (5) (1999) 1713–1731. [49] C. Farhat, A. Macedo, M. Lesoinne, F. Roux, F. Magoules, A. La Bourdonnaie, Two-level domain decomposition methods with lagrange multipliers for the fast iterative solution of acoustic scattering problems, Comput. Methods Appl. Mech. Engrg. 184 (2000) 213–239. [50] R. Tezaur, A. Macedo, C. Farhat, Iterative solution of large-scale acoustic scattering problems with multiple right hand-sides by a domain decomposition method with lagrange multipliers, Int. J. Numer. Methods Engrg. 51 (10) (2001) 1175–1193. [51] F. Collino, S. Ghanemi, P. Joly, Domain decomposition method for harmonic wave propagation: a general presentation, Comput. Methods Appl. Mech. Engrg. 184 (2000) 171–211. [52] M. Gander, F. Magoules, F. Nataf, Optimized Schwarz methods without overlap for the Helmholtz equation, SIAM J. Sci. Comput. 24 (2002) 38–60. [53] F. Magoules, F.-X. Roux, S. Salmon, Optimal discrete transmission conditions for a nonoverlapping domain decomposition method for the Helmholtz equation, SIAM J. Sci. Comput. 25 (5) (2004) 1497–1515.

C. Ianculescu, L.L. Thompson / Comput. Methods Appl. Mech. Engrg. 195 (2006) 3709–3741

3741

[54] R. Djellouli, C. Farhat, A. Macedo, R. Tezaur, Finite element solution of two-dimensional acoustic scattering problems using arbitrarily shaped convex artificial problems, J. Comput. Acoust. 8 (2000) 81–99. [55] C. Ianculescu, Iterative solution of exterior Helmholtz equation with exact non-reflecting boundary conditions on distributed memory systems, Ph.D. thesis, Clemson University, Dept. of Mechanical Engineering, August 2003. [56] I. Harari, T. Hughes, Analysis of continuous formulations underlying the computation of time-harmonic acoustics in exterior domains, Comput. Methods Appl. Mech. Engrg. 97 (1992) 103–124. [57] M. Abramowitz, I. Stegun, Handbook of Mathematical Functions, 1968. [58] C. Flammer, Spheroidal Wave Functions, Stanford University Press, Stanford, CA, 1957. [59] H. Van der Vorst, Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems, SIAM J. Sci. Stat. Comput. 13 (2) (1992) 631–644. [60] R. Freund, A transpose-free quasi-minimal residual algorithm for non-Hermitian linear systems, SIAM J. Sci. Comput. 14 (2) (1993) 470–482. [61] S. Eisenstat, Efficient implementation of a class of preconditioned Conjugate Gradient methods, SIAM J. Sci. Stat. Comput. 2 (1984) 130–138. [62] P.M. Morse, H. Feshbach, Methods of Theoretical Physics, McGraw-Hill, 1953. [63] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company, 1996. [64] S. Zhang, J. Jin, Computation of Special Functions, John Wiley & Sons, Inc, 1996. [65] METIS: Unstructured graph partitioning and sparse matrix ordering system. Available from: . [66] G. Karypis, V. Kumar, A coarse-grain parallel formulation of multilevel k-way graph partitioning algorithm, in: 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997. [67] A. Gupta, G. Karypis, V. Kumar, Highly scalable parallel algorithms for sparse matrix factorization, IEEE Trans. Parallel Distrib. Syst. 8 (5) (1997) 502–520. [68] J.R. Gilbert, R. Schreiber, Highly parallel sparse Cholesky factorization, SIAM J. Scient. Statist. Comput. 13 (5) (1992) 1151– 1172. [69] M. Malhotra, P. Pinsky, Parallel preconditioning based on h-hierarchical finite elements with application to acoustics, Comput. Methods Appl. Mech. Engrg. 155 (1998) 97–117.