> Pergamon
Compurers & Srrucrures Vol. 55. No. 3, PII. 471483, 1995 El&r Science Ltd Pnnted in Great Britain 0045s7949/95 $9.50 + 0.00
. ,
A HIERARCHICAL
TWO-LEVEL Y. Pressburgerfz
THibbitt, §Department
MULTIGRID
SOLVER
and R. Peruccbio$
Karlsson & Sorensen, Inc., 1080 Main St, Pawtucket, RI 02860, U.S.A. of Mechanical Engineering, College of Engineering and Applied Science,
University of Rochester, Rochester, NY 14627, U.S.A. (Received
18 December
1993)
this paper we present a fast, inherently parallel, two-level multigrid (TLMG) algorithm for solving quadratic finite element models containing tetrahedral meshes. The basic idea is to integrate a TLMG algorithm with hierarchical octree structure based on Recursive Spatial Decomposition (RSD). The new hierarchical TLMG (HTLMG) algorithm combines direct substructuring and static condensation with a new algorithm for performing hierarchically the SOR iteration. The hierarchical SOR (HSOR) iteration is designed to operate independently on each subdomain in the octree structure. The time and space complexities of the HSOR and the HTLMG iterations are derived for a regular domain. Two test problems are used to compare the computational cost of the HTLMG algorithm with that of a direct solver. Abstract-In
1. INTRODUCTION
the exact solution for quadratic p-hierarchical finite element (FE) models, while offering a substantial saving of CPU time compared with direct solution via Gauss elimination [l-3]. In this context, parallelism is defined as follows. Let A be an algorithm which operates on an element in the space S and produces results in the space R. In other words, A a function that maps elements from the domain S to elements in the range R.C The algorithm A is said to be inherently parallel if
In a typical application of the FEM, we approximate the solution II of a boundary value problem over a domain S by discretizing S into a mesh M consisting of a finite number of elements, each defined in terms of a set of nodal points. Within each element, II is replaced by ii, defined by interpolating nodal values iii with a set of shape functions N,. The FEM solution of three-dimensional continuum problems is computationally intensive. The accuracy of the solution depends to a great extent on the level of discretization used to approximate the original domain. A finer mesh, in general, produces better results than a coarse one, but at the expense of higher computational costs. Clearly, in a standard sequential machine, time and space complexity will limit severely the size of the model. The problem of increasing storage requirements can be solved by adding more memory, the price of which has been decreasing for the past few years. However, dealing with increasing computational requirements is more challenging and can be done in two ways: by using machines of higher computational speeds or by adopting parallel algorithms that allow execution of the problem in a multiprocessing environment. This paper addresses the problem of introducing high-level of parallelism into a two-level multigrid (TLMG) procedure. It has been demonstrated that TLMG algorithms provide a good approximation of
A (s) = u A (s,),
,
where s, are elements
of a disjoint S= v s,
$ Formerly at the University of Rochester, Department of Mechanical Engineering. r For example the SOR iteration operates on a vector P‘ defined in the space R” (the space of real vectors of size n), and the result. ii’+ ’ is defined in the same space R”. 471
s, s, E S,
(I)
set such that (2)
and A(q) is strictly a function of s, only. The TLMG algorithm developed in the course of this work combines successive over relaxation (SOR) iterations over a fine model (quadratic model) and direct Gaussian elimination over the coarse model (linear model). The time complexity of the TLMG iteration is strongly dominated by the internal SOR iteration applied to smooth the solution of the quadratic FE model. From the SOR equation [see eqn (7) below] it can be observed that before computing the ith component of the solution vector at the s + I iteration, ii+‘, all the components S:+‘(j=O,l,2 ,..., i - 1) need to be available. Thus, there is a precedence requirement among the algebraic operations performed by the SOR iteration. This implies that the traditional implementation of the SOR iteration does not satisfy eqn (I) and therefore cannot be directly implemented on a concurrent machine.
Y. Pressburger and R. Perucchiu
472
The problem of introducing parallelism into the SOR iteration was addressed by Missirlis [4]. Later this algorithm was improved by Robert and Trystram [5] and again by Eisenstat [6]. These algorithms divide the SOR iteration into non-interfering tasks by identifying a precedence among the different operations performed by the iterative procedure This type of solution preserves the convergence property of the sequential SOR method, since the basic iteration remains unchanged. There is however, the problem of large overheads due to the synchronization and communication requirements between processors. We develop a parallel version of the TLMG algorithm by exploiting the hierarchy of substructures contained in a recursively decomposed geometry in order to organize the SOR iteration into non-interfcring tasks. The tetrahedral meshes used by the hierarchical TLMG algorithm are based on an octree decomposition of the solid. We first review our TLMG algorithm in Section 2. In Section 3. we present our new hierarchical twolevel multigrid algorithm. In Section 4. the time and space complexities are determined for a regular mesh embedded in a unit cube. Results show that for sufficiently large problems the new algorithm requires less algebraic operations per iteration than the standard TLMG and SOR algorithms. In Section 5 two tests problems are used to determine the efficiency of the HTLMG algorithm as compared with a direct solver.
simple projection and interpolation operators between the coarse and the fine models. Such operators can be derived relatively easily when the nested meshes are regularly structured. However, in general. FE meshes representing complex geometries are not regularly structured and the derivation of the projection and interpolation operators becomes problematic. In this paper, we present a variation to the original multigrid method that does not require regularly structured meshes, To develop a special multigrid algorithm for the present work we exploit the notion that a sequence of nested FE models can be embedded on the same mesh by using p-hierarchical shape functions. In contrast with the conventional multigrid approach described above, progressive model refinement is now obtained by adding higher order shape functions to existing elements. The mesh geometry and topology remain unchanged. The particular algorithm described below consists of two models onlyPlinear and quadratic. Hence. the name two-level multigrid (TLMG) [I, 31. Assume that there arc n, degrees of freedom in the linear model. and that there are p = 12,+ 11~degrees of freedom in the quadratic model. For the linear model we have KJ, = f,.
(3)
whcrc K, is a 11,x 11,matrix and ii, and f, arc )I, x I vectors. For the quadratic model we have K, ii, = f,, .
2. A TLMG
ALGORITHM
In the context of finite element analysis, the multigrid method denotes an iterative procedure for solving systems of equations obtained from a set of geometrically nested finite element meshes. The objective of the multigrid method is to accelerate the convergence of a “regular” (or one-level) iterative procedure (such as SOR, Gauss Seidcl. etc.) applied to the finest mesh by revisiting the coarser models. As discussed by Brandt [7] the main idea is to exploit the fact that the coarse and the fine models approximate the same differential problem. Hence, the coarser model can bc used as an approximation to the fine model. It can be shown that one-level iterative methods arc very effective in reducing the highly oscillatory parts of the residual vector but less so in reducing the low frequency parts of the residuals [7]. The multigrid algorithm solves the problem of slow convergence rate due to the existence of low frequency components of the residuals by interpolating the residuals from a fine model to a coarser model. and solving the residual equation on the coarse model either iteratively or exactly [S]. It is important to realize that a key factor in the design of a multigrid algorithm is the existence of
waherc K, is a p x p matrix and ii, and f, arc p x I vectors. Because of the hierarchical nature of the elements eqn (4) can be partitioned as
(5) where K,, = K, and f, = f, Notice that. because of the particular structure of the matrix K,,. the optimal bandwidth is larger than the size of the matrix I?,, [i.e. B(K,) $11,. where B(K,) is the bandwidth of Kq]. Thus. p-hierarchic FE models require larger storage than regular FE model\ of the same order [the amount of storage is of O(NB)]. Furthermore. solving hierarchic FE models is computationally more expensive [the computational cost of Gaussian climination is of O()VB’)]. This justifies the effort to develop fast iterative solution procedures specifically designed for p-hierarchical models. The first step of the TL.MG algorithm rcquircs the solution of the linear model. This is obtained by Gaussian elimination. Next. the SOR method (or any other one-level iteration method) is used to solve iteratively for the quadratic model. The solution to the linear model ii is extrapolated to the quadratic one to scrvc as the initial guess ii: of the iterative
473
Hierarchical two-level multigrid solver procedure. Sincep-hierarchical extrapolation is simply
elements
are used, the
and the solution
to the quadratic
model is updated
iii = ti:, + eq.
where I, is a n, x n, identity matrix and O,,, is an nq x n, zero matrix. Following the conventional multigrid terminology, this step is referred to as the iteration sweep or smoothing iteration. The SOR iteration is given by (see for example in Ref. [9]):
as
(13)
The process of solving the residual equation on the linear model and updating the quadratic model solution is called the coarse-grid correction. The combination of the coarse-grid correction and v smoothing iterations represents one two-level cycle. These cycles are repeated until convergence of iii is obtained. 3. HIERARCHICAL TLMG ALGORITHM
where &;T ’ is an element of vector i”g+’ [ii;+’ denotes the vector ii, at the (s + 1) SOR iteration, and iii = ii”,], f, is an element of vector f4, Ki, is an element of matrix K,, and w is the relaxation factor. After v iteration sweeps the convergence rate of the SOR iteration decreases. At this point, the iteration is stopped and the residual vector rqis computed from rq =f q -K 9i”.q
Subtracting eqn (4) from residual equation,
eqn
(8), we obtain
the
K,e,=r,,
Our present objective is to modify each of the steps in a TLMG iteration such that it can be integrated with a hierarchy of structures contained in RSDbased meshes. The modified algorithm, called hierarchical two-level multigrid (HTLMG), includes a hierarchical version of the SOR (HSOR) iteration, specifically designed for parallel implementation. HTLMG consists of the following essential steps: directly solve the linear FEM model via hierarchical substructuring [ 10, 1 I], iterate on the quadratic model following the HSOR algorithm, compute the residual vector, interpolate and project vectors between the linear and the quadratic models, and solve directly the residual equation in the linear model via hierarchical substructuring. 3.1. Solution of the linear FE model
where eq is the error vector (es = ii, - ii;). To improve the convergence rate of the solution procedure the residual vector is projected back to the linear model using the relation rl =
(10)
[II O,.,lrq 1
where r, denotes the residual model. Now the equation
K,e,= rl,
vector
in the linear
(11)
which is an approximation to the residual eqn (9), is solved for e,. This equation can be solved by either a direct or an iterative solver. Notice, however, that since the matrix K, is already reduced to an upper triangular matrix (because Gaussian elimination was performed on the linear model), the exact solution to eqn (11) is reduced essentially to a backward substitution step, approximately as expensive as one SOR iteration. Thus, it is convenient to solve eqn (11) exactly. Furthermore, numerical tests have shown that the convergence rates of the global solution improves when eqn (I I) is solved exactly. Next, the error vector e, is extrapolated back to the quadratic model using the relation
The first step of the TLMG algorithm determines the exact solution of the linear FE model using direct equation elimination. HTLMG solves the linear model using the hierarchical substructuring solver introduced by Kela and Perucchio [IO]. The hierarchical substructuring solver is fully integrated with the RSD data structure and can be implemented on a concurrent machine as discussed by Saxena and Perucchio [I I]. 3.2. Hierarchical SOR (HSOR)
iteration
The integration of the SOR iteration into the hierarchical tree structure follows an approach similar to the hierarchical substructuring scheme. A multi-level hierarchy of substructures is obtained by recursively assembling groups of substructures. A two-dimensional example is shown in Fig. I, where (a) is the complete RSD-based mesh, (b) and (c) are typical leaf level substructures, (d) and (e) are typical intermediate level substructures, and (f) is the root level substructure. Each substructure is characterized by a set of internal nodes and a set of nodes on the boundary. A SOR iteration, based on this hierarchy of substructures is called a hierarchical SOR (HSOR) iteration. The HSOR iteration consists of four basic steps; (a) assemble the substructures stiffness matrix and load vector; (b) iterate on the internal degrees of freedom associated with the lowest level (i.e. leaf
Y. Pressburger and R. Perucchio
474
LEGEND IN nodes : Boundary nodes :
n l
(a)
Level 0 (root):
Level 1: A> Level 2:
Level
3:
(b)
(c)
(4
Cd)
Fig. 1. A RSD-based mesh (a) and typical substructures at each tree level (b-f).
level) substructures; (c) traverse the tree from the leaves to the root included and iterate on the internal degrees of freedom associated with each substructure; (d) iterate on the boundary degrees of freedom associated with the root substructure. Note that, step (a) is performed only once before the first HSOR iteration while steps (b)-(d) are repeated at every HSOR iteration. 3.2.1. Step (a)--st[ffness matrix assembly. The lowest level nodes of the octree considered consist of cells, which are mapped into tetrahedral elements. The stiffness matrices of elements associated with a particular substructure are assembled into the matrix K,. Similarly the load vectors are assembled into the vector f,. In the case considered, K, and f, are associated with a quadratic FE model generated using p-hierarchical shape functions. Let b and i denote, respectively the nodes on the boundary and in the interior of the substructure. The equilibrium equations for the lowest level substructures are K,ii, = f,.
(14)
The matrices in eqn (14) are partitioned into submatrices associated with the internal nodes and the boundary nodes as follows:
(15)
16) and 6, =
u, _ i-1uh
t
Since the FE model consists of hierarchical pelements, the submatrices in eqns (l5)-( 17) are further partitioned following the linear/quadratic classification. For example, the matrix x,,, is partitioned as
(18) where the superscripts quadratic, respectively.
I and q refer to linear
and
Hierarchical two-level multigrid solver Note that for any substructure, the stiffness matrices &, Rib, and the load vector $ are fully assembled since the internal degrees of freedom are shared only by elements contained within the substructure. Based on the last observation, the assembly process works as follows: starting from the leaf nodes and proceeding to the root of the octree, the stiffness matrices R,, and the load vectors T,, of the offspring of the same cell are assembled and then partitioned as in eqns (15) and (16). Since the internal degrees of freedom are not condensed out, eqn (14) is not satisfied for the non-leaves substructures and must be replaced by
475
formed on the boundary degrees of freedom. As the HSOR iteration proceeds, p, is assembled into the global vector P, in the usual way. 3.2.3. Step (c)-intermediate level iteration. Starting from one level above the leaf level, the iteration continues by traversing the tree to the root. In each level, the following three operations are performed on each substructure. (I) First a generalized load vector i, associated with the internal degrees of freedom is computed as
ii = T,+ P, K, ii, = f, + pr
where p, represents partial residuals produced by SOR iterations on lower tree level substructures (an exact definition of pr and its use is given below). Once all the substructures at a given level are assembled, procedure begins to operate on the tree level immediately above. This process continues until the root node of the tree is reached. 3.2.2. Step (b)--leaf level iteration. After the stiffness matrices and the load vectors of all the substructures have been assembled, the HSOR iteration begins to operate on all the lowest level substructures (i.e. leaf level substructures). Consider a typical lowest level substructure. First, a SOR iteration is applied to compute the new solution vector at the internal degrees of freedom of the substructure. Following the notation defined above for the assembly process, this iteration is given as
n = I,2 )....)
n,,
(22)
(19)
(20)
where n, and n,, are the number of internal and boundary degrees of freedom, fi’+ ’ is the solution vector at the (s + I) HSOR iteration and, [ft],,,, refers to a particular component of the matrix R. Notice that all the matrices in eqn (20) are fully assembled, which indicates that this is a complete and a valid SOR iteration. Next, we define a partial residual vector pr, equal in dimension to the number of boundary degrees of freedom in the substructure
where P,, is a subset of the global vector P,. Note that, at this stage, the components of P, associated with the internal degrees of freedom are fully assembled. (2) Next a complete SOR iteration is performed for all the internal degrees of freedom using
[iii]:’ ’ = [iii]: +
& II
n=1,2
nn
([ii],- ‘i’
,....,
n,.
(23)
The only difference between eqn (20) and eqn (23) is that the vector i, in eqn (23) replaces the vector 7, in eqn (20). The vector ?, includes the contribution of all degrees of freedom internal to current substructure, including those which are not represented in the current stiffness matrices because they are already accounted for as internal degrees of freedom in lower level substructures. (3) When the iteration over the internal degrees of freedom is completed the partial residual vector P, is updated following the process described above [eqn
(2])1. 3.2.4. Step (d)-root level iteration. The HSOR iteration is completed by iterating on the boundary degrees of freedom associated with the root substructure. Note that the internal degrees of freedom of the root substructure have already been taken into account in Step (c) above. This step starts with the derivation of a modified load vector associated with the boundary degrees of freedom as
i, = T,+ P,, p,= -R;ii;+‘.
[Rii],[tii]L+ ’
m=l
(24)
(21)
The vector pr represents the contribution of the internal degrees of freedom to a SOR iteration per-
where P,, is a sub-vector of the vector P, containing the elements of the vectors P, associated with degrees of freedom on the boundaries of the substructure.
Y. Pressburger and R. Perucchio
476 This is followed by an SOR iteration degrees of freedom using
n=l,2
on the boundary
. . . . . . 12b.
(25)
At this point all the degrees of freedom have been updated following the SOR equation and a single HSOR iteration is completed. The next HSOR iteration follows the same path, beginning at the leaf level and terminating at the root level. 3.2.5. Concergence parallelism. The HSOR iteration converges for the same reasons the regular SOR iteration does.? In general however, after n HSOR iterations the solution vector may differ from the one obtained by n regular SOR iterations. This is because the solution vector obtained by the SOR equation depends on the order in which its component are updated (i.e. it depends on the node numbering). Essentially, the HSOR iteration changes the order in which these components are updated by iterating first on the internal nodes of each substructure. Thus, both iterations converge under the same conditions but may give different results after a fixed number of iterations. The HSOR iteration satisfies the definition of parallelism given in Section I. We define the set of elements on which the HSOR iteration operates on to be the collection of all substructures at a particular level. From the description of the HSOR iteration it is clear that the HSOR iteration can be applied to each element in this set independently to the operations performed on other elements in the set. Since the set of substructures are disjoint, eqns (I) and (2) are satisfied and the HSOR is inherently parallel. The only restriction is that before operating at a particular level, all the substructures of a lower level have to be processed. 3.3. Interpolation
and projecti,w
operators
The complete HTLMG algorithm contains two steps which require the projection of vectors from the linear model to the quadratic model and one step which requires the interpolation of vectors from the quadratic model to the linear one. Due to the hierarchical nature of the shape functions used in the FE model, these steps are performed in constant time.
t In Ref. [9] p. 77 it is proved that if the coefficient matrix (i.e KY) is positive definite, and if 0 < (I) < 2, then the SOR iteration converges.
3.4. Residual rector computation After v SOR iterations on the quadratic model are completed, the residual vector is computed, eqn (8). Note that the residual vector is interpolated to the linear model immediately after being computed on the quadratic model. Because of the special form of the interpolation operation, eqn (10). only the first n, components of the residual vectors (where n, is the number of degrees of freedom in the linear model) need to be computed. The computation of the residual vector rq in the HTLMG involves traversing the octree and visiting each non-empty RSD cell. A partial residual vector pr is associated with each cell, and is equal to
p, =
Pr’ 11 PQI
(26)
where
p,, = -K,,u; - ⅈ,
(27)
and pTh is either
(28) for non-root
substructures,
p,, = -q&i;
or
- R:,u;
(29)
for the root substructure. As before. i and b denote internal and boundary, respectively. Next, the vectors pr are assembled into a global vector P, and, the residuals are computed as rq = f, + P,.
(30)
As with the case of the HSOR iteration, the algorithm for computing the residual vector satisfies the definition of parallelism. We define the set of substructures (at all levels) to be the domain on which this algorithm operates. The result of the algorithm is the set of p, vectors. There is no requirement on the order on which substructures at any level are visited since the vector li remains unchanged during this step. Thus the substructures can be operated upon in parallel. 3.5. Solution of’ the residual equation After the residuals are computed and interpolated to the linear model, the linear representation of the residual equation is solved for the error vector e,, eqn (11). Recall that the linear model is solved directly and, therefore the stiffness matrix K, is already reduced into upper triangular form. Thus, the direct solution of the residual equation requires a backward substitution for the boundary degrees of freedom associated with the root substructure followed by recovery of the internal degrees of freedom. As
Hierarchical two-level multigrid solver indicated in Ref. [I I] recovery freedom can be parallelized.
of internal
degrees
of
4. TIME AND SPACE COMPLEXITY ANALYSIS
In this section we compare the computational cost (i.e. CPU time and storage) of the HTLMG algorithm with that of the TLMG algorithm. We assume that both methods converge after the same number of iterations. Therefore, it is enough to compare the costs of a single iteration of each method. The following comparison reveals that for uniform RSD meshes constructed on cube, HTLMG and HSOR are asymptotically superior to TLMG and SOR, respectively. This finding assumes a sequential machine and does not reveal the full advantage that the HTLMG and HSOR iterations will provide on a parallel machine. Clearly, the computational cost of the HTLMG is strongly dominated by the internal HSOR iteration. For each HTLMG cycle the cost is approximately equal to v times the cost of a single HSOR iteration, where v is the number of HSOR iterations in each HTLMG cycle. The same consideration applies to TLMG with respect to SOR iterations. In addition, the storage requirements for both HSOR and HTLMG are identical and are dominated by the storage for the stiffness matrices associated with the quadratic model of each substructure. Thus, it is sufficient to study the time and space complexities of the HSOR iteration and compare them with those of the standard SOR iteration. The time and space complexities-i.e. the computational cost T and the storage requirement S-associated with the HSOR are derived below for a regular hexahedron, a unit cube, which has been subdivided to level m of the octree decompositiont. In this case, the mesh is a regular n x n x n grid, with 5ni tetrahedra elements and (n + 1)3 vertices (nodes for the linear model) located at the grid points, where n = 2”‘. Comparing only the order of complexity and ignoring the multiplicative constants in the expressions for T and S may lead to wrong conclusions simply because very large values of n are not feasible for practical application. Here, in order to present a more reliable comparison, we give the explicit expressions for T and S associated with HSOR and SOR iterations. The substructures at level i of the hierarchical structure are formed by assembling eight substructures of level i - I. Note that in this case i = 0 is the initial mesh stored in the octree at the resolution level, while i = n? is the final substructure stored at the root-node level. Lets, denote the number of substructure at level i, and y,, b, and p, be the number of interior nodes, boundary nodes, and the total number of nodes, respectively, in each substructure at level t Recall that level 0 is the cube itself, level to the first eight octants and so on.
I corresponds
477
i(p, = q, + b,). It can be shown that for the quadratic model we have [3] s, =
23””
1)
q,= 12.22’-
18.2’+
7
bi= 24.22’+ 2 ~,=36.2~‘-
18.2’+9.
(31)
Assuming only one degree of freedom per node, the computational cost in each HSOR iteration is associated with the following three operations: (I) iteration on the internal degrees of freedom at each substructure; (2) computation of the partial residual vector p,; and (3) iteration on the boundary degrees of freedom associated with the root substructure. The total number of operations in these three steps is [3]: !I,
THsoR=,CIS,.q,.(2.b,+q,)+b:.
(32)
Substituting s,, q,, and b, into eqn (32) and summing over i results in the computational cost for the quadratic model
GSOR= 2016n4 - 1960n’-
784&+108n
-7.
(33)
The space complexity S (i.e. the number of floating point numbers stored) is derived by considering that any substructure requires storing three matricesK,,, K,,, and K,,. The total storage required to store these matrices is [3]:
+q,. b, + f b, (b, + I)].
(34)
Substituting s,, q,, and b, into eqn (34) results in the following storage requirement for the quadratic model
%OR = 1296ni-yn3-504n’+57,i
-;.
(35)
To complete this study, we must compare the above results with the computational cost and the storage requirement associated with a single SOR iteration applied to a fully assembled system of equations in which the nodes are numbered to minimize the bandwidth. The time and space complexity for a single SOR iteration is [3]: TmR=N.(2B--
I)-B.(B-
1)
Sso,=B+V-B+l)+~+?-I), L
(36)
Y. Pressburger
418
and R. Perucchio
LEGEND .
Stondord
SOR - Ouodratic
= HierarchicaL SOR - OucdrotLc ________________________________________-----.---
I
-loo Fig. 2. Number
where N and B is question, quadratic
of operations
is the total number the semi-bandwidth. N and the optimal model are [3]: N,=7n’+
I
‘l.nvr~‘l
10’
lo2
10’
I’1’i-‘1’8L” 10’ lo5 No. of nodes
for the SOR and HSOR iterations
of degrees of freedom For the problem in B associated with the
12n’+6n
’ I8
10’
’
10’
Fig. 3. Storage
10n’+9n
requirements
ml
’
10’
lo9
for the unit cube, quadratic
Substituting
7’kjOR=13n6+ 132n5+255n4+242n’+84n’-21rz 255 Sj”R = ‘2” n6+66n’+-n4+Z-n’
+
+7.
model
eqn (37) into eqn (36) gives -29
249
2
I +48n’-:n
Bq=n’+
I
-
14.
(38)
(37)
for the SOR and HSOR
iterations
for the unit cube. quadratic
model
Hierarchical
two-level
The number of operations and the storage requirement (in kByte, assuming double precision) for the SOR and HSOR iteration for a quadratic hierarchical FE model are presented in Figs 2 and 3. Each data point represents a particular level in the octree starting from level 0 at the left up to level 9 at the right. Notice that the HSOR becomes more efficient than the SOR method for a moderate size model (n > 2000 or level 3 in the octree), and even for smaller models the HSOR and SOR iterations are comparable. 5. EXAMPLES
In this section we present to evaluate the convergence ithm.
two test problems used of the HTLMG algor-
5.1. Test problem 1 The first test case is a short cantilever beam subject to a uniform traction applied at the free end and with the built-in end fully clamped (Fig. 4). For the problem under consideration we select P = 1600, E = 3 x IO’, v = 0.3. I = 8, h = 4. Figure 5 shows a sequence of adaptively refined meshes. This includes the initial mesh (Fig. 5a) and
multigrid
solver
479
five adaptively refined meshes. Two different meshes are shown for the most refined meshes. In Fig. 5e a very non-uniform mesh is given, while a more uniform mesh is shown in Fig. 5f. The strain energies for the linear and quadratic models are shown in Fig. 6. In Fig. 6 the strain energies are presented for three different solutions associated with each FE mesh, namely: direct (exact) solutions for both linear and quadratic models based on hierarchical substructuring, and approximate solution for the quadratic model based on the HTLMG iterative procedure. The figure clearly shows that, for sufficiently fine models, the iterative procedure gives a very good approximation to the exact FE solution. The relatively poor approximation for coarse models is due to the stopping condition for the iteration. As described in Refs [I, 3 1, we adopted a stopping criterion for the iteration that is based on the notion that the error in the iterative procedure should be of the same order as the error in the finite element approximation. The basic idea is that as the FE solution produces a better approximation to the physical problem (due to mesh refinement), also the iterative procedure gives a better approximation to the FE solution. The results shown in Fig. 6 indicate that the stopping criterion operates exactly as designed.
Y
ON FACE
X=0.0
q
ON FACE Gy=
$
A h = 4.0
1
1=8.0
1
/d-4.0
\ \
t
N
t
\ ’ \ \
P
t
Fig. 4. Short cantilever
beam under bending.
X=&O
- 100.0
c
1
Y. Pressburger
No. of DOF:
and
R.
Perucchio
No. of DOF: Linear
Linear Model
Model
(e)
@I
No. of DOF: Linear Model No. of DOF: Quadratic Model
13168
/ (f)
Cc)
No. of DOF: Linear Model No. of DOF: Quadratic Model
1 189 11128
No. of DOF: Linear Model No. of DOF: Quadratic Model
j
522
1 3096
Fig. 5. Meshes for the beam problem: (a) initial mesh; (b) first refinement: (c) second refinement; (d) third refinement; (e) fourth refinement-local refinement; and (f) fourth refinement--global refinement.
Finally,
we
define
using the HTLMG substructuring as
the
T,=+,
T, obtained by in place of hierarchical
function of the number of degrees of freedom in the quadratic FE model. Notice that, for very non-uniform mesh (Fig. 5e) the speed-up is smaller than for more uniform mesh (Fig. Sf). This is in agreement with the experimental results reported in Refs [I. 31. which indicated that the efficiency of the iterative procedure decreases for very non-uniform meshes.
speed-up
algorithm
T (39) 1 HTLMG
where THs and THTLMGare the CPU time in seconds required to solve the models using direct solvers based on hierarchical substructuring and HTLMG algorithm, respectively. In Fig. 7 T, is plotted as a
I
*
3
2 RnoLtJsls
5.2. Test problem 2 The second test problem is a thick plate with a cylindrical hole under uniform uniaxial tension. As
5
I
2
no.
Fig. 6. Strain energies for the beam problem: (a) using the models described models described in Fig. S(ad and f).
4
3 RnoCysis
no.
in Fig. 5(a e): (b) using the
5
Hierarchical
two-level
multigrid
solver
481
error indicator
1
1
No. Fig. 7. Speed-ups
of the HTLMG
algorithm
1000
DOF
of
compared problem.
with hierarchical
substructuring
for the beam
Y ON
4
FACE
Z=O
-
f ON FACE lJx=O
X=0
h
a -4 .
Fig. 8. Thick plate with a circular
\. W
r
ON FACE u, =o
hole under tension. Due to symmetry is modeled.
shown in Fig. 8. only one-quarter of the problem is modeled due to symmetry. The dimensions are: a = 3.5, w = 8.0, d = 4.0 and /z = 16.0. The material properties are E = 3 x IO’, and v = 0.3. The appropriate boundary conditions are also shown in this figure. A sequence of adaptively refined meshes is shown in Fig. 9. i In Fig. Il. analysis I is the initial mesh. analysis first refinement, and so on.
2 is the
Y=O
only one eighth of the problem
In Fig. IO the efficiency of the HTLMG algorithm is compared with that of hierarchical substructuring. The speed-up r,, defined in eqn (39). is plotted as a function of the number of degrees of freedom in the FE models. As for the beam problem, large speed-ups are registered. in particular for the finest model where r, z 28. Also, as shown in Fig. I It, for finer models the iterative procedure gives an excellent approximation to the exact solution. Thus, as the model is refined, the HTLMG
Y. Pressburger and R. Perucchio
482
No. of DOF: Linear Model No. of DOF: Quadratic Model
1 25 1117
NC,. of DOF: Linear Model pie. of DOF: Quadratic Model
b)
9. Finite
element
) 46 ( 232
No. of DOF: Linesr Model NO. of DOF: Quadratic Model
provides
(d) third
both higher efficiency and better
6. DISCUSSION In this paper we introduce a hierarchical version of the TLMG algorithm previously described in Refs [l-3]. The two basic operations in each HTLMG cycle are the internal smoothing iterations using the HSOR scheme and the solution of the residual equation via hierarchical substructuring. Both operations are fully integrated with the octree data structure used for meshing. We define the sub-domains on which these operations are applied to be the hierarchy
;;y
, 100
No.
1 402 12360
meshes for the plate problem: (a) initial mesh; (b) first refinement; (c) second refinement;
algorithm accuracy.
) 669
(4
No. of DOF: Linear Model NO. of DOF: Quadratic Model Fig.
112.3
of
,
DOF
Fig. 10. Speed-ups of the HTLMG algorithm compared with hierarchical substructuring for the plate problem.
(
refinement.
of substructures embedded at the various levels of the tree and we show that HTLMG satisfies the requirement for being inherently parallel in the sense that operations on substructures at the same tree-level can be performed concurrently. There is, however, an ordering requirement between levels such that before operating on a particular level in the tree, the operations on lower levels need to be completed. The HTLMG algorithm is particularly efficient for solving high order models built with hierarchical shape functions. The stiffness matrix for high order hierarchical models is conveniently built by adding high order terms to the matrix of the linear model. To avoid nodal renumbering (and thus to keep the linear matrix unchanged) higher order terms are inserted as additional rows and columns to the original linear matrix. Unfortunately, this results in very large bandwidth for higher order models. For example the bandwidth of a quadratic model is always greater than the number of degrees of freedom in the linear model. Thus, for the same number of degrees of freedom the hierarchical shape function approach requires much more computational time and storage than the standard non-hierarchical formulation [12]. In contrast to standard SOR, HSOR does not require the assembly of the global stiffness matrix and, therefore, its time and space complexities are not dominated by the bandwidth. By replacing SOR with
Hierarchical
two-level
multigrid
Linear
model
solver
483
LEGEND
Ouodrotic model (exact) ______________________~______~___~~~~~~ Ouodrotic model . . . . ..... (approximate) . .
,
I
I
2
3
4
AnoLysis
no.
Fig. 1I. Strain energies for the plate problem. HSOR in the TLMG algorithm we retain the advantages of the hierarchical approach, such as simple projection interpolation operators, without paying the penalty of a reduced efficiency due to the increase in the bandwidth. The advantage of the HSOR over SOR for quadratic models embedded on a unit cube is clearly shown in Figs 2 and 3. Two tests problems are used to compare the HTLMG algorithm with a direct solver based on hierarchical substructuring. It can be shown that for a regular cubic grid, hierarchical substructuring is exactly equivalent to the nested dissection [13, 141 ordering of mesh nodes, which has been found to produce the lowest time and space complexity that can be obtained by a direct solver on a regular grid [ 1.51. Because of the regularity of the meshes in the two test cases, we conjecture that hierarchical substructuring produces near-optimum time complexities for these two problems. Thus, the speed-ups reported in Figs 7 and 10 should provide a good estimation of the speed-ups that will be measured with larger and more complex geometrical models. In the course of the present work, machine and compiler limitations have prevented us from applying our approach to larger models. Results show, however, that the efficiency of the iterative procedure increases with the size of the model and that speed-ups of over 30 can be achieved for models with approximately 3000 degrees of freedom. REFERENCES I. Y. Pressburger, R. Perucchio and D. A. Field. A two-level multigrid algorithm for solving 3-D quadratic finite element models. Proc. 1991 ASME Inrernational Compulers in Engineering Conference. Santa Clara, CA, Vol. II, pp. 61-70 (1991).
An h-p-multigrid 2. D. A. Field and Y. Pressburger, method for finite element analysis. Inr. J. Numer. Merh. Engng 36, 8933908 (1993). A self-adaptive FE procedure based on 3. Y. Pressburger, recursive spatial decomposition and multigrid analysis, Ph. D. Thesis, Department of Mechanical Engineering, University of Rochester, NY (1991). 4. N. M. Missirlis, Scheduling parallel iterative methods on multiprocessor systems. Parallel Comput. 5,295-302 (1987). 5. Y. Robert and D. Trystram, Comments on scheduling parallel iterative methods on multiprocessor systems. Paralkl Comput. 7, 2533255 (1988). 6. S. C. Eisenstat, Comments on scheduling parallel iterative methods on multiprocessor systems II. Parallel Comput. 11, 241-244 (1989). 7. A. Brandt, Multi-level adaptive solutions to boundaryvalue problems. Math. Comput. 31, 333-390 (1977). 8. W. L. Briggs, A Multigrid Tutorial. Society for Industrial and Applied Mathematics, Philadelphia, PA (1987). 9. R. S. Varga, Matrix Iterative Analysis. Prentice-Hall Inc., Englewood Cliffs, NJ (1962). Hierarchical incremental 10. A. Kela and R. Perucchio. analysis through solid modeling. Proc. ASME In{. Computers in Engineering Coqf, San Fransisco, CA, Vol. III, pp. 611-621 (1988). II. M. Saxena and R. Perucchio, Parallel FEM algorithms based on recursive spatial decomposition--II. Automatic analysis via hierarchical substructuring. Comput. Strurt. 47, 143-154 (1993). R. B. Morris. Y. Tsuji and P. Carnevali, Adaptive solution strategy for solving large systems of p-type finite element equations. IBM Research Report, RJ7216 (67983) (1989). A. George, Nested dissection of a regular finite element mesh. SIAM J. Numer. Anul. 10, 345-363 (1973). 14. 0. Axelsson and V. A. Barker. Finite Elenrenr Solution qf Boundary Value Problems: Theory and Computation. Academic Press. London (1984). M. S. Martin and D. J. Rose, 15. A. J. Hoffman, Complexity bounds for regular finite difference and finite element grids. SIAM J. Numer. Anrrl. 10, 364-369 (1973).