Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667 www.elsevier.com/locate/cma
Parallel methods for optimality criteria-based topology optimization Kumar Vemaganti *, W. Eric Lawrence CAE Research Laboratory, Department of Mechanical, Industrial and Nuclear Engineering, University of Cincinnati, P.O. Box 210072, Cincinnati, OH 45221-0072, United States Received 11 February 2004; received in revised form 21 July 2004; accepted 6 August 2004
Abstract Topology optimization problems require the repeated solution of finite element problems that are often extremely illconditioned due to highly heterogeneous material distributions. This makes the use of iterative linear solvers inefficient unless appropriate preconditioning is used. Even then, the solution time for topology optimization problems is typically very high. These problems are addressed by considering the use of non-overlapping domain decomposition-based parallel methods for the solution of topology optimization problems. The parallel algorithms presented here are based on the solid isotropic material with penalization (SIMP) formulation of the topology optimization problem and use the optimality criteria method for iterative optimization. We consider three parallel linear solvers to solve the equilibrium problem at each step of the iterative optimization procedure. These include two preconditioned conjugate gradient (PCG) methods: one using a diagonal preconditioner and one using an incomplete LU factorization preconditioner with a drop tolerance. A third substructuring solver that employs a hybrid of direct and iterative (PCG) techniques is also studied. This solver is found to be the most effective of the three solvers studied, both in terms of parallel efficiency and in terms of its ability to mitigate the effects of ill-conditioning. In addition to examining parallel linear solvers, we consider the parallelization of the iterative optimality criteria method. To tackle checkerboarding and mesh dependence, we propose a multi-pass filtering technique that limits the number of ‘‘ghost’’ elements that need to be exchanged across interprocessor boundaries. 2004 Elsevier B.V. All rights reserved. Keywords: Topology optimization; Parallel computing; Preconditioners
*
Corresponding author. Tel.: +1 512 556 2738; fax: +1 512 556 3390. E-mail address:
[email protected] (K. Vemaganti).
0045-7825/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2004.08.008
3638
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
1. Introduction Topology optimization, also known as layout optimization, is a relatively new and growing subfield of structural optimization. Over the last decade, it has seen widespread academic use and is now being applied to real-life industrial problems [1], especially in the automotive industry. In topology optimization, the goal is to find the material distribution that minimizes some objective function subject to user-specified constraints on the amount of material available. The shape and the connectivity of the domain are both design variables; the introduction of new boundaries is permitted (and expected). This is in contrast to shape optimization, where portions of the structural boundary are described as parametrized curves or surfaces and the goal is to find the boundary representation that minimizes the objective function for a fixed topology. Thus, topology optimization features a larger space of feasible solutions and can play an important role in the conceptual design phase of the engineering life cycle. Engineers can use topology optimization to create mechanical components that are stronger, lighter and more cost effective. Most previous work on topology optimization for continuum structures has been devoted to methods for solving the optimization problem itself. This includes developing new formulations, studying existence and uniqueness issues, dealing with numerical instabilities, etc. In the last decade or so, however, significant progress has been made in understanding and mitigating these problems and it continues to this day. This work has opened doors for study of other aspects of topology optimization that have thus far received far less scrutiny. One such class of under-examined problems deals with the computational aspects of the topology optimization problem and is the focus of this work. Topology optimization problems are generally discretized using a conforming finite element mesh and the material distribution problem becomes one of finding the relative density of each finite element. An element with zero relative density represents a void and an element with a relative density of 1 represents a solid element. The goal is to find a distribution of relative densities that extremizes an objective function, subject to constraints on the amount of material available. The exact number of design variables per finite element depends on the specific formulation at hand. For example, in two spatial dimensions, the solid isotropic material with penalization (SIMP) approach uses one variable per element [2,3], whereas the homogenization approach [4–6] requires three variables per element. In any case, the number of design variables is large and the problem is usually solved using iterative optimization techniques that involve repeated finite element solves. The overall computational effort involved is thus considerable even for coarse finite element discretizations. This observation serves as the motivation for our development and study of parallel methods for the solution of large-scale topology optimization problems. In this work, we develop algorithms for the parallel solution of the topology optimization problem. A particular formulation of the topology optimization problem called the SIMP approach is parallelized in two spatial dimensions using domain decomposition techniques. To solve this form of the optimization problem an optimality criteria (OC) method is employed with an iterative heuristic scheme for updating the design variables. At each OC iteration an elasticity equilibrium problem must be solved. For this purpose, we consider three parallel solvers: a diagonally preconditioned conjugate gradient solver, an incomplete LU (ILU) preconditioned conjugate gradient solver and a hybrid substructuring solver that uses static condensation followed by the preconditioned conjugate gradient method. The first of these three is similar to the solver presented by Borrvall and Petersson [7], where a parallel methodology is presented for the topology optimization of three-dimensional structures based on a one-dimensional domain decomposition. In our work, we use Hilbert space filling curves (SFC) to effect the decomposition of the domain. We study the parallel efficiency of the aforementioned solvers as well as their efficacy in mitigating the poor conditioning of the systems that result from topology optimization. A filtering technique used to ensure mesh-independent solutions of the optimization problem is modified and parallelized as well. We also address the question of how accurately the state variables (i.e., the nodal displacements) need to be solved
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3639
in order to obtain reasonable values for the design variables (i.e., those variables that are adjusted as the structure is optimized). The rest of this paper is organized as follows. In Section 2, we define the problem and present background information. Section 3 discusses the sequential techniques typically used for solving the topology optimization problem: solvers for the linear system Ku = f including the preconditioned conjugate gradient method, and the optimality criteria iterative method for the solution of the discrete optimization problem. The parallelization of the various aspects of the solution process are described in Section 4. Section 5 documents some of the hardware and software resources that were utilized to accomplish the research. Section 6 presents the results from several numerical experiments that study the behavior of the modified filtering scheme, the efficiency of the various solvers, and the effects of solving the topology optimization problem with various PCG tolerances.
2. Preliminaries 2.1. Problem formulation Consider a body occupying a reference domain X, as shown in Fig. 1. This reference domain is often chosen to be the simplest domain possible that allows for the application of the desired boundary conditions and loading. It consists of prescribed points of material and no material as well as design points where the existence of material is not determined until the topology optimization problem is solved. The body is subject to body forces b in the reference domain X and surface traction forces t on the boundary C. The reference domain boundary contains Cd where displacements are prescribed and Ct where traction forces are applied. Note that Cd [ Ct ¼ oX and Cd \ Ct = ;. The space of admissible displacements is given by V ¼ fvjv 2 H 1 ðXÞ;
v ¼ 0 on Cd g:
ð1Þ
The internal virtual work done by the body at equilibrium in moving through an arbitrary admissible virtual displacement v 2 V is given by the bilinear form:
Fig. 1. Topology optimization design domain.
3640
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Bðu; vÞ ¼
Z ðvÞEðuÞ dx;
ð2Þ
X
where u is the equilibrium displacement field, (Æ) is the strain operator: T
ðvÞ ¼ 12ðrv þ ðrvÞ Þ;
v2V
and E is the elasticity tensor. The external virtual work is represented as Z Z FðvÞ ¼ b v dx þ t v ds: X
ð3Þ
ð4Þ
Ct
The displacement field u is then the solution to the weak form of the equilibrium equations, Bðu; vÞ ¼ FðvÞ;
8v 2 V:
ð5Þ
For the sake of convenience, we will assume that the goal is to maximize the stiffness of the body, or, equivalently, to minimize the compliance. This assumption has no direct bearing on the primary focus of this article: parallel solution of the topology optimization problem. Let the compliance be defined as C ¼ FðuÞ:
ð6Þ
The standard topology optimization consists of finding a material distribution E 2 Eadm that minimizes the compliance, subject to a constraint on the amount of material available. Here, Eadm is the space of admissible elasticity tensors over which the infimum of C is sought. The topology optimization problem can thus be stated as 9 Fðu; EÞ; inf > > u2V ;E2Eadm > = ð7Þ subject to Bðu; vÞ ¼ FðvÞ; 8v 2 V > > > ; þ Volume constraints: Problem (7) needs further transformation and manipulation in order to ensure existence of solutions. Broadly, there exist two categories of such methods: relaxation methods that enlarge the design set, and restriction methods that reduce the design set. Homogenization [4–6] and artificial material methods fall under the first category, while perimeter control [8,9] and slope constraint [10] methods belong to the latter. For a detailed discussion of existence and uniqueness issues, we refer the reader to [11]. Here, without loss of generality, we will use the so-called Solid Isotropic Material with Penalization (SIMP) approach [2,3], a variant of the artificial material approach, to parametrize the set of admissible elasticity tensors Eadm: Eadm ¼ fEjE ¼ qp E0 ; p P 1g;
ð8Þ
where q 2 H ¼ fqjq 2 L1 ðXÞ; 0 6 q 6 1 a:e: in Xg 0
ð9Þ
and E is a known positive definite elasticity tensor with the usual symmetries. The function q is termed the relative density function. Regions in X with q = 1 represent material regions, whereas regions with q = 0 represent voids. Regions with intermediate values of q represent non-physical ‘‘gray’’ areas; such values of q are penalized through the exponent p. As a result of this parametrization of Eadm, the bilinear form B now depends on q, which will be emphasized by replacing B with Bq . The SIMP approach is neither a relaxation approach nor a restriction approach. Therefore, to ensure existence, we use a filtering technique that involves the modification of design sensitivities using a heuristic algorithm. This is discussed in more detail in Section 2.4.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
The topology optimization problem can now be stated as follows: 9 inf Fðu; qÞ; > > u2V ;q2H > = subject to Bq ðu; vÞ ¼ FðvÞ; R q dx 6 V ; X
3641
ð10Þ
8v 2 V > > > ;
where V is a user-defined volume limit on the structure. 2.2. Discretization of the topology optimization problem In practice, (10) is discretized using the finite element method. The reference domain X is partitioned into N non-overlapping finite elements and the density function is approximated using a piece-wise constant function qh = {qi}, 1 6 i 6 N, whereas the displacement field u is approximated using piece-wise polynomials. We will use the symbol u to denote the finite element displacement degrees of freedom as well since the meaning is usually clear from the context. The design variables then become the element densities qi, which, because they are constant per element, can be moved outside the integral and treated as simple scalings of the element stiffness matrix. The discretized topology optimization problem with corresponding discretized forms of the equilibrium equation and mean compliance becomes 9 PN min C ¼ uT Ku ¼ i¼1 qi uTi ki ui > > fqi g > > > > > > = PN ; ð11Þ subject to Ku ¼ i¼1 qi ki ui ¼ f; > > PN > > > > i¼1 qi vi 6 V ; > > ; 0 < qmin 6 qi 6 1; i ¼ 1; . . . ; N ; where K is the global stiffness matrix, f is the global force vector, ki is the element stiffness matrix, ui is the set of element degrees of freedom, and vi is the element area (or volume in three-dimensions). Note that setting qmin to a positive value prevents the stiffness matrix from becoming singular. 2.3. The optimally criteria (OC) method The discrete topology optimization problem (11) is characterized by a large number of design variables, N in this case. It is therefore common to use iterative optimization techniques to solve this problem, e.g., the method of moving asymptotes [12], optimality criteria (OC) method, to name two. Here we choose the latter. At each iteration of the OC method, the design variables are updated using a heuristic scheme. We now briefly review the major aspects of this method. Details may be found in [5]. The Lagrangian for the optimization problem is defined as ! N N N X X X T Lðqi Þ ¼ |ffl{zffl} u Ku þK qi vi V ki2 ðqmin qi Þ þ ki3 ðqi 1Þ; ð12Þ þ k1 ðKu fÞ þ ¼C
i¼1
i¼1
i¼1
where K, k1, k2 and k3 are Lagrange multipliers for the various constraints. The optimality condition is given by oL ¼ 0; oqi
i ¼ 1; 2; . . . ; N :
ð13Þ
3642
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Differentiating (12) with respect to qi and manipulating the terms, one can write the optimality condition as def
Bi ¼
oC oqi
Kvi
¼ 1:
The compliance sensitivity term
ð14Þ oC oqi
is evaluated as
oC ¼ pðqi Þp1 uTi ki ui : oqi Based on these expressions, the design variables are updated as follows 8 maxðqmin ; qi mÞ > > > > > if qi Bgi 6 maxðqmin ; qi mÞ; > > > < qi Bgi qnew ¼ i > if maxðqmin ; qi mÞ < qi Bgi < minð1; qi þ mÞ; > > > > > minð1; qi þ mÞ > > : if maxð1; qi þ mÞ 6 qi Bgi ;
ð15Þ
ð16Þ
where m is called the move limit and represents the maximum allowable change in the relative density qi in a single OC iteration. Also, g is a numerical damping coefficient and is usually taken to be 1/2. The Lagrange multiplier for the volume constraint, K, is determined at each OC iteration using a bisectioning algorithm. See the paper by Sigmund [13] for a description of the OC iterative scheme, including details of its implementation. 2.4. Checkerboarding, mesh dependence and filtering Checkerboarding, the appearance of regions of alternating solid and void elements is a common problem in topology optimization caused by non-convergence of the finite element solution to the exact solution of (10). A sample finite element solution with checkerboarding is shown in Fig. 2. Mesh dependence is another common problem in topology optimization, wherein the solution to the topology optimization changes qualitatively as the mesh is refined. Sigmund and Petersson [11] enumerate various approaches to mitigate these problems, including the use of smoothing techniques, higher order finite elements, superelements, and filtering. While purely heuristic, the filtering technique has become popular due to its simplicity and ease of implementation (in sequential codes). Applying a filter involves modification of the design sensitivities used in each iteration of the optimality criteria update (16):
Fig. 2. Converged topology optimization problem with checkerboard patterns.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
c X oC 1 b j qj oC ; H ¼ P j2Mi b oqi qi oqj Hj
3643
ð17Þ
j2Mi
b j is a convolution operator defined as where H b j ¼ rmin distði; jÞ; H
Mi ¼ fj : 1 6 j 6 N ;
distði; jÞ 6 rmin g
ð18Þ
and dist(i, j) is the distance between the centers of elements i and j. The convolution operator is zero outside the filter area and decays linearly with the distance from element i. Therefore, the closest neighbors of element i have the most influence on its sensitivity. The sensitivities calculated in (17) are used in the OC update (16) instead of the original sensitivities obtained from (15). In this form, the filtering approach is good for eliminating both checkerboarding and mesh-dependency problems if the filter size is sufficiently large.
3. Numerical solution of topology optimization problems The numerical solution of the discrete topology optimization problem, as presented in the previous section, is computationally very demanding. Each OC iteration is comprised of the solution of a large system of linear equations (the finite element problem) and a smaller optimization problem (the update of the design variable qi on each element in the domain X). This procedure is summarized in Fig. 3. Before each OC update, the finite element problem must be solved so that the displacements can be used as input into the update. For problems of any practical size, the time to update the design variables is much smaller than the time to do the finite element analysis (that is, to solve the equilibrium equation (5)). Therefore, any investigation into the efficiency of solving the topology optimization problem should focus primarily on: • The time it takes to solve the finite element problem at each optimization iteration. • The number of optimization iterations needed to converge to a reasonable solution (so as to minimize the total number of finite element solves that need to occur). In this section, we focus on the solution of the finite element equations which yield the large sparse symmetric positive definite (SPD) linear system Ku ¼ f;
ð19Þ
where K is the global stiffness matrix of the structure. It is the size of K and the form it takes when solving topology optimization problems that drive decisions about the most appropriate techniques for solving (19). 3.1. Solution methods Methods for the solution of systems of linear equations can be classified into two general groups: direct methods and iterative methods. Direct methods are usually variants of Gaussian elimination, where the matrix is factorized into a triangular form and then back-substitution is performed. In general, direct methods are not well-suited for large, sparse systems such as (19) because of the large fill-in that often occurs when factorizing the matrix. ‘‘Fill-in’’ refers to non-zero slots in the factorized matrix that are zero in the original system matrix. This is a significant problem because large fill-in requires much more memory to store the factorized matrix. Fill-in can be reduced, but not eliminated, if intelligent strategies are used to reorder the
3644
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Fig. 3. Topology optimization algorithm.
unknowns. Another problem with direct methods in large systems is that they cannot be parallelized efficiently due to the large amount of communication that needs to occur between processors. Iterative solvers work by repeatedly improving an approximate solution until it is accurate enough. These methods are only exposed to the stiffness matrix K via a matrix-vector multiply of the form y = Kx. Iterative solvers in general require much less storage space to solve sparse systems than direct methods because there is no analogy to fill-in. Furthermore, iterative solvers are much easier to parallelize and require less interprocess communication because only the matrix-vector multiply operations need to be parallelized. An important area of concern when working with iterative solvers is that they are often slow to converge (or fail to converge at all) for poorly conditioned systems. This is of particular interest for the solution of the topology optimization problem since the condition of the global stiffness matrix generally deteriorates as the optimization problem begins to converge. This is because of the highly heterogeneous material distribution that results when all element density values are nearly 0 or 1. It is important to note, however, that much of this ill-conditioning can be mitigated by effective preconditioning of the stiffness matrix.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3645
3.2. The preconditioned conjugate gradient method The conjugate gradient (CG) method is the most prominent method for solving large, sparse, SPD systems and is the primary solution method chosen in this research. The fundamental premise of this method is that the quadratic functional f ðxÞ ¼ 12uT Ku f T u
ð20Þ
is minimized by the solution of Ku = f if K is SPD. The method works by generating: • successive approximations ui to the solution u, • residuals ri corresponding to these approximations, • and search directions pi used in updating the approximations and residuals. In every iteration of the method, two inner products are performed in order to compute update scalars: a is used to update the iterate ui and is chosen to minimize a weighted norm of the residual, and b is used to update the search direction pi and is chosen so that pi is orthogonal to pi1. Convergence is achieved when the relative L2 norm of the residual vector is sufficiently small. Details can be found in many books including [14,15]. The serial version of the preconditioned conjugate gradient algorithm is listed below: Algorithm 1 (Serial Preconditioned Conjugate Gradient Method) i=0 r0 = f Ku0 repeat Solve Mzi = ri i=i+1 if i = 1 p1 = z 0 else rT zi1 bi ¼ i1 rTi2 zi2 pi = zi1 + bipi1 end rT zi1 ai ¼ i1 pTi Kpi ui = ui1 + aipi ri = ri1aiKpi until convergence u = ui As with all iterative solution methods, the speed of convergence of the conjugate gradient method is highly dependent on the condition number of the global stiffness matrix. Qualitatively, the condition number of a matrix represents how sensitive the solution is to small changes in the input data. Quantitatively, the condition number for an SPD matrix can be expressed as jðKÞ ¼
kðKÞmax P 1; kðKÞmin
where k(K)max and k(K)min are the maximum and minimum eigenvalues of the matrix, respectively.
ð21Þ
3646
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
If the condition number of a matrix is high, it is desirable to take steps to transform it to one that is better conditioned. This is done by applying a preconditioner M to the system so that M1 Ku ¼ M1 b:
ð22Þ
An effective preconditioner must be spectrally close to K such that M1K ffi I where I is the identity matrix. Often, in practice, the choice of an efficient preconditioner is heavily dependent on the formulation of the problem at hand. Also, a major factor in choosing a preconditioner is the amount of effort it takes to formulate the preconditioner. The usefulness of preconditioning drops when the effort to calculate the preconditioner approaches the amount of effort it would take to converge using the original system with no preconditioner. This factor is of particular importance in a parallel environment where calculating a global preconditioner can become very expensive. The are many formulations of preconditioners. Two types are implemented for the research here: diagonal and incomplete LU (ILU) factorization. 3.2.1. Diagonal preconditioning The diagonal preconditioner is by far the simplest to implement and least costly to calculate. It is given by ai;j if i ¼ j; mi;j ¼ ð23Þ 0 otherwise: Because this is a diagonal matrix the storage required is only equal to the number of degrees of freedom. 3.2.2. ILU preconditioning In the ILU preconditioner, the preconditioner matrix M is never actually assembled [14]. Instead, the equation Mzi = ri which is solved in the conjugate gradient up-date actually becomes Kzi = ri where K is the original global stiffness matrix. This equation, however, is only solved ‘‘partially’’ because incomplete factorization techniques are used to solve it. Incomplete factorization means that, as the matrix is being factorized, fill-in is either eliminated completely by dropping all entries in the factorized matrix that were zero in the original matrix K, or reduced using a drop tolerance approach. As with any factorization technique, this can potentially require much interprocessor communication in a parallel environment. It turns out that the diagonal preconditioner is one of only a few where it is practical to calculate and store its inverse directly before the first CG iteration. For the ILU preconditioner, however, the equation Mzi = ri, is solved at each CG iteration without calculating M1 directly. Numerical experiments involving these two preconditioners are presented in Section 6. Remark 3.1. The condition number of the global stiffness matrix tends to degrade in topology optimization problems as the optimization problem begins to converge to the optimal solution because of the development of interfaces between areas of material and no material. Generally, optimal designs resulting in longer material/non-material interfaces (i.e., more porous materials) result in poorer conditioning than those with shorter boundaries. Thus, the use of the perimeter control approach [8,9] might, in addition to ensuring existence of a solution to the topology optimization problem, improve the convergence properties of the PCG solver. The filtering technique introduced earlier has a similar beneficial effect on the conditioning of the stiffness matrix since it results in ‘‘grey’’ areas along the material/non-material boundaries. Measuring these effects is, however, difficult because the topology that results from the modified approach (whether using perimeter control or filtering) is most often qualitatively different from the unmodified case.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3647
3.3. Hybrid substructuring solver The substructuring or condensation solver is the other type of solver studied here. However, this is only done in the context of parallel solves and is explained in detail in the next section. The basic idea here is to use a mixture of direct and iterative solvers to accomplish the solution of the entire system. Direct methods are used to ‘‘condense’’ the system on each subdomain owned by a processor into an interface problem. The interface problem is then solved in parallel using a PCG solver. The use of direct methods on parts of the domain collapse the problem into a much smaller, better-conditioned system that can be easily handled by the PCG solver. 3.3.1. Accuracy of FE solutions One of the questions explored in this work is how accurately the state variables (i.e., the displacements) need to be solved in order to calculate reasonable values for the design variables qi. If it is found that the PCG algorithm can be solved with loose tolerances and without significantly affecting the accuracy of the design variable update procedure, then the time it takes to solve the finite element problem at each optimization iteration can be significantly reduced. Essential to this idea is understanding what acceptable values for the design variables mean in the context of topology optimization problems. Topology optimization is often used as a preprocessing phase to obtain the general material distribution of a structure while providing an estimate for the boundaries. The resulting topology is then used as the initial guess for shape optimization, which can more easily control the details of the boundaries with a moderate number of design variables (such as spline control points). The density data is transferred by visual interpretation by a designer or automated image processing or smoothing techniques. Using either process, there will be no significant changes in the resulting input to the shape optimization problem if there are only slight variations in the density values along the boundary elements. Therefore, the use of looser tolerances for the iterative solution of the finite element problem is a viable option if it only results in such small variations in the element densities and not in a qualitatively different structure. Referring to (14), there are essentially two quantities that contribute to the update of the design variable oC qi at each element: the sensitivity oq and the Lagrange multiplier K of the volume constraint. It is therefore i these two quantities, both of which are functions of the displacements calculated from the finite element problem, that must be considered in order to understand the error in the density values that result from the error in the displacements. Section 6.2 discusses numerical experiments run to show the effect of using different PCG tolerance values on the Lagrange multiplier, the sensitivity, and the overall qualitative structure.
4. Parallel solution of the topology optimization problem This section details parallel algorithms designed to solve topology optimization problems on distributed memory computers. Because the repeated solution of the finite element problem accounts for most of the computational effort in solving the topology optimization problem, it is the most important part of the solution algorithm to parallelize. Thus, the chief focus here will be on the parallelization of the preconditioned conjugate gradient (PCG) method which, as described earlier, is the primary iterative solver in this work. The parallelization of the optimization problem itself (the design variable update and filtering scheme) is really just a consequence of the domain decomposition that is performed to solve the FE problem and does not account for any measurable speedup of the solution. It is, however, a necessary step and is discussed at the end of this section.
3648
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
4.1. Parallel finite element analysis: domain decomposition For the parallel solution of elliptic partial differential equations such as (5), domain decomposition techniques are quite common [16]. The basic idea of such techniques is to decompose the reference domain X into subdomains and let each processor be responsible for the computations and storage of data on its subdomain. The domain is decomposed such that the data is distributed more or less evenly over the participating processors and such that most of the necessary computations can be handled in parallel. This is to minimize the high latency costs associated with parallel communication. There are various forms of domain decomposition techniques that are used for partitioning finite element meshes. They include simple geometric partitioners, graph partitioners, octrees, and space filling curves [17]. The latter is what is used here. Space filling curves (SFCs) map an n-dimensional domain into a one-dimensional domain by constructing a curve that passes through the centroid of each element. There are various approaches to doing this, but the basic rules are that each element may be visited only once and that elements visited in succession must be neighboring elements [18]. The algorithm used for generating the space filling curves for this research uses Hilbert curves [19,20]. A sample Hilbert SFC is shown for three different meshes in Fig. 4. After applying the SFC algorithm, the global mesh is broken up into P subdomains where P is the number of participating processors. This is done by segmenting the space filling curve into P parts; the elements that belong to a given curve segment become the elements that make up a subdomain. Note that this leads to a non-overlapping decomposition of the domain. Once this domain decomposition is applied, the discretized equilibrium problem Eq. (19) can be rewritten in its decomposed form as ! ! p p X X Kj u ¼ fj ; ð24Þ j¼1
j¼1
where Kj is the subdomain contribution to the global stiffness matrix and fj is the subdomain contribution to the global load vector. 4.2. Parallel PCG In Section 3, the serial preconditioned conjugate gradient algorithm was presented. This algorithm is now modified after applying the domain decomposition technique described in the previous section so that it works in parallel. Following [21], the method employed takes the approach of storing the subdomain stiffness matrices only on the individual processors, but accumulating subdomain vectors into global ones
Fig. 4. Hilbert space filling curves for meshes of various refinements.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3649
before any vector operations are done. This method is based on the assumption that global vector operations are less expensive than local operations followed by a gather at each CG iteration. In this way, scaled vector operations and dot products are carried out globally while matrix-vector products are carried out locally. The algorithm is as follows: Algorithm 2 (Parallel Preconditioned Conjugate Gradient Algorithm) for j = 1,. . .,P Assemble processor stiffness matrix Ki and right-hand side fi ! p P Accumulate global right-hand side f ¼ fj j¼1
end for i=0 r0 = f Ku0 repeat Solve Mzi = ri i=i+1 if i = 1 p1 = z 0 else rT zk1 bi ¼ i1 rTi2 zi2 pi = zi1 + bipi1 end ! p P Kj pj;i di ¼ Kpi ¼ j¼1
rT zi1 ai ¼ i1T pi di ui = ui1 + aipi ri = ri1 aidi until convergence u = ui In this algorithm, the two summations over all processors represent the points in the algorithm where interprocess communication must occur. Notice that only one of these is actually inside the iteration loop. What is not shown is the interprocess communication that may need to happen in the step where the preconditioner is applied (i.e., Mzi = ri). This is because the amount of communication is different for each preconditioner. The diagonal preconditioner needs no interprocess communication at this point because the entire diagonal matrix is stored on each processor. The ILU preconditioner requires much communication because this requires the global stiffness matrix to be factored in parallel. For this purpose, we use the builtin functions in the software package SPOOLES; this package is described in Section 5. 4.3. Substructuring approach One technique that is common for solving systems that employ domain decomposition is the substructuring or condensation approach. Consider a mesh in which the degrees of freedom are classified into two categories: interface and bubble (or internal). Interface degrees of freedom (DOFs) are those on a subdomain boundary and are shared with at least one other processor. Bubble DOFs are those that are not shared
3650
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Fig. 5. Interface and bubble DOFs.
with any other processors (see Fig. 5). Eq. (25) shows how the equilibrium equation can be represented with this classification. The subscript I represents interface DOFs and the subscript B represents bubble DOFs. :
ð25Þ
The substructuring approach involves transforming the subdomains system into a problem on the interface such that all the bubble DOFs are eliminated. This interface problem is then solved usually employing some iterative technique such as the PCG. Once this problem has been solved, the internal DOFs are recovered. A key factor in this approach is that the interface and bubble DOFs are solved in two distinct steps. With some simple matrix algebra, we can write (25) in its condensed form as ;
ð26Þ
where e II ¼ KII KIB K1 KBI ; K BB ef I ¼ f I KIB K1 f B ; BB
ð27Þ
ef B ¼ f B KBI uI : Thus, e II uI ¼ ef I ; K
ð28Þ
becomes the interface problem, and KBB uB ¼ ef B ;
ð29Þ
becomes the problem that must be solved to recover the bubble DOFs. Note that the reduction to the interface problem and the recovery of the bubble DOFs are computed independently on each subdomain with no communication between processors. Only the solution of the interface problem requires such communication. Note that in the term K1 BB that appears in (27) is never explicitly computed. Instead, we factorize KBB and 1 compute the terms K1 BB f B and KBB KBI as backward solves with one or more right hand sides. These operations are performed using the sparse linear algebra package SPOOLES (see Section 5).
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3651
4.3.1. Parallel design variable update As mentioned previously, the effort required to update the design variables at each optimization iteration is small compared to the corresponding finite element solve. However, this must still be parallelized because of the domain decomposition performed. The parallelization of the unfiltered problem is actually quite simple because the only global parameter that is needed in the heuristic update scheme (16) is the Lagrange multiplier K used to calculate Bi in (14). All that is needed to calculate K in the bisectioning algorithm employed is the area (volume in three-dimensions) of each processor. Therefore, the only communication needed between processors is the summation of the scalar areas from each processor, V ¼
p X
ð30Þ
qi vi :
i¼1
4.4. Parallel filtering Parallelizing the optimization update becomes more complex when filtering is applied. Recall from (17) that the filtering procedure works by taking a weighted average of the design sensitivities over the element itself and its neighbors. Assume for the moment that the elements chosen are uniform in size (i.e., a regular mesh) and that the filter size chosen is only large enough to include the target elements 8 immediate neighbors (see Fig. 6). If the element resides on a subdomain boundary it will need the latest sensitivities and densities from neighboring elements that are on another processors subdomain. However, the space filling curve technique used here results in a non-overlapping domain decomposition. Therefore, if the filter is applied, a secondary step must be performed after the SFC technique is used to break up the reference domain into subdomains. For each subdomain, all the elements on all surrounding subdomains that are its immediate neighbors are identified. These surrounding elements are then added to the subdomains table of elements as ‘‘ghost’’ elements. These ghost elements do not participate in the solution of the equilibrium equations, they only serve as placeholders to share the sensitivity and density information across processors
Fig. 6. Neighbor elements for filtering.
3652
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
when the filter is applied. This is demonstrated in Fig. 6. The elements of subdomain 1 which are on the border with subdomain 2 (shaded in gray) are added to subdomain 2s element table (and vice versa). Once the subdomains have been defined with the appropriate ghost elements, the filter equation (17) itself must be parallelized. The following procedure is followed on each processor (i.e., subdomain) to accomplish this: (1) Determine the list of elements that make up the subdomain border not including the ghost elements. These are the elements whose information must be shared with neighboring processes. (2) Loop through the ghost elements and generate a list of all the neighboring processors. (3) Send all the sensitivity and density information of each border element (determined in Step 1) to each neighboring processor (determined in Step 2). (4) Receive the latest sensitivity and density values of the ghost elements from the neighboring processors. (5) Apply Eq. (17) to all elements owned by the processor, but not the ghost elements.
4.5. Parallel mesh-independence A filter size that is only large enough to include the elements immediate neighbors as described in the previous section has shown to be sufficient for eliminating checkerboarding. However, a filter of this size is not in general sufficient to mitigate mesh-dependency. To do this, Sigmund [22] suggests increasing the filter size to include a much larger set of elements. This approach, however, can become problematic in a parallel environment because of the amount of ghost element information that must be shared and the fact that the preprocessor must know a priori the size of the filter so it knows how many ghost elements are needed. To get around this, we use a modified approach where the filter size is left unchanged but is applied multiple times on a single OC iteration. That is, Steps 3, 4, and 5 from the steps above are applied repeatedly until mesh dependence is achieved. Using this approach did eliminate mesh-dependence for the cases studied. The computational costs of this approach are discussed in Section 6. Remark 4.1. Though the discussion in this section is mostly focused on parallelizing optimality criteriabased topology optimization, extensions to other formulations are possible. For instance, if a direct method that treats both displacements and densities as unknowns is used to solve (11) (e.g., [23,24]), then the techniques proposed here may still be used to handle each Newton step with appropriate modifications to the linear solvers discussed earlier. The precise nature of the modifications will depend on the characteristics—symmetry, definiteness, etc.—of the resulting tangent stiffness matrix.
5. Software and hardware support In this section, we highlight the computing resources utilized in our research and describe how they relate to a parallel computing environment. 5.1. Software 5.1.1. PIFED and Top-Opt code The primary software library used in this research is called parallel infrastructure for finite element discretizations (PIFED) [25]. PIFED is an object-oriented finite element software library written in C++ and is primarily two-dimensional in its current form. It is designed to run on distributed memory architecture computers and uses the MPI standard for the message passing demanded by such an architecture. The
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3653
Fig. 7. Topology optimization application architecture.
library essentially provides a set of FEA tools that can be linked into a customized application. It provides a set of distributed datastructures and several linear solvers which work with these datastructures. Some of these data structure formats are implemented specifically for use in PIFED but it also has access to datastructures of third-party software packages such as SPOOLES (described later in this section). Additionally, PIFED has a stand-alone preprocessor application that is responsible for partitioning the mesh using the space filling curve domain decomposition techniques described in Section 4. Finally, PIFED provides a framework by which customized applications can easily plug into its functionality (see Fig. 7). The customized application used on top of the PIFED library for this work is called Top-Opt. Top-Opt contains all the knowledge for the optimality criteria update and filtering schemes. In fact, PIFED has no direct knowledge of topology optimization at all. Because of this, Top-Opt also makes direct use of the MPI standard for the interprocess message passing that is needed for the OC update and filtering. 5.1.2. MPICH MPICH is a freely available, portable implementation of the MPI standard [26]. It is the MPI implementation used by PIFED and Top-Opt. The version of MPICH used in this work is based on version 1.1 of the MPI standard. While there are several implementations of the MPI standard in existence, MPICH attempts to distinguish itself through its design goals of portability and high performance. MPICH has exhibited positive results in benchmark performance testing against vendor implementations of MPI and other message-passing systems [26]. This combination of portability and performance makes it an ideal implementation for our research activity. 5.1.3. SPOOLES SPOOLES (SParse Object Oriented Linear Equations Solver) is a library for solving sparse real and complex linear systems of equations [27]. The SPOOLES library is written in the C language and heavily uses object-oriented design concepts. SPOOLES contains a wide variety of datastructures and solvers. The SPOOLES functionality used in this work include: • Classes for working with both sparse and dense matrices. These classes provide not only containers for the matrix data, but also a set of tools for matrix manipulation (multiplication, addition, partitioning, etc.).
3654
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
• Parallel direct solver. Used here with a drop tolerance to provide the ILU preconditioner for the PCG solver. The SPOOLES direct solver abstracts most of the complex techniques that are needed to reduce fill-in during the LU factorization stage by employing matrix ordering techniques like multiple minimum degree, nested dissection and multi-section. It also abstracts the interprocess message-passing that accompanies such techniques when applied on a partitioned matrix.
5.2. Hardware The bulk of the data used in this research was generated on a Beowulf cluster. Beowulf systems are computer clusters that exploit mass-market manufacturing of consumer-grade electronic components [28]. These clusters are made of (sometimes many) PCs, usually cheap hard disks, and low-cost DIMMs (dual inline memory modules) for main memory. The Beowulf cluster used for the research presented in this paper consists of 32 dual processor nodes running the Linux operating system. The processors on each node are 1.80 GHz Intel Xeon chips with 512 kilobyte L2 caches. Each node has 1 gigabyte of random access memory (RAM).
6. Results This section presents several numerical experiments that are carried out in support of the discussion earlier on parallel topology optimization. First, in Section 6.1, we study the effectiveness of the modified filtering scheme in eliminating mesh-dependent solutions. Next, in Section 6.2, we examine the effect of loosening the CG tolerance in the finite element solution on the resulting topology. Section 6.3 focuses on the ill-conditioning of the global system of equations, a common problem in topology optimization. Finally, in Section 6.4, we compare the various forms of CG solvers used in this research, by considering the parallel efficiency of the solvers, the filtering scheme and the design update procedure. 6.1. Experiment 1: Results with modified mesh-independency filter To study the effectivity of the modified parallel mesh-independency filter discussed in Section 4, we now solve a simple topology optimization problem for various mesh refinements with various numbers of filter passes. The domain and boundary conditions for the problem chosen are shown in Fig. 8. The reference
Fig. 8. Reference domain for topology optimization problem.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3655
domain is a unit square and the left edge of the domain is fixed. A load is applied on the middle few elements of the right edge of the domain. The parameters to the topology optimization problem follow Sigmunds example [13] where the penalty parameter p from (11) is set to 3, the damping coefficient g from Eq. (16) is set to 0.5, and the desired volume V* is set to 0.5. Regular, 4-noded quadrilateral elements are used to discretize the reference domain. The problem is solved for mesh refinements of 30 · 30 elements (961 dofs), 60 · 60 elements (3721 dofs), and 120 · 120 elements (14641 dofs). For each refinement, the problem is solved with no filter, then with a single filter pass, and then the number of filter passes is incremented until the resultant topology matches that of the singlepass filter, 30 · 30 element solution. Figs. 9–11 show the results of these solves. For the 60 · 60 mesh, 6 filter passes are needed to get a topology that qualitatively matches that of the single-pass filter solution of the 30 · 30 mesh. For the 120 · 120 mesh, 18 passes are needed to achieve this. Figs. 9–11 also show the maximum change in relative density p over the entire domain from the previous OC iteration. This is the criterion used to determine when the topology optimization problem has
Fig. 9. 30 · 30 elements. Converged topology (left) and maximum change in density at each OC iteration (right) for different number of filter passes. (a) No filter and (b) 1 filter pass.
3656
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Fig. 10. 60 · 60 elements. Converged topology (left) and maximum change in density at each OC iteration (right) for different number of filter passes. (a) No filter, (b) 1 filter pass and (c) 6 filter passes.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3657
Fig. 11. 120 · 120 elements. Converged topology (left) and maximum change in density at each OC iteration (right) for different number of filter passes. (a) No filter, (b) 1 filter pass and (c) 18 filter passes.
3658
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Table 1 OC iterations for various meshes Mesh size
Filter passes
OC iterations
30 · 30
0 1
22 27
60 · 60
0 1 3 6
25 2102 37 31
0 1 6 12 18
38 1909 158 39 36
120 · 120
converged. It is notable that the convergence characteristics of the various refinements vary drastically when they do not converge to the same topology. However, when they do converge to the same topology (that is, when the filter is applied enough times to eliminate mesh-dependency) then the convergence characteristics are very similar. Table 1 shows this as well. When the number of filter passes is too small to prevent mesh-dependency, then the number of OC iterations to convergence can be quite high, but when the filter is applied an adequate number of times, all three cases converge in approximately the same number of OC iterations. 6.2. Accuracy of FE solutions In this section, the relationship between the accuracy of the solution to the finite element problem computed using a diagonal PCG solver and the resulting topology is explored. These numerical experiments are in support of the discussion in Section 3.3.1. The domain chosen to investigate this is identical to that in the previous section. Also, the same topology optimization parameters (penalty p, damping g, and volume fraction) are used. The optimization problem is solved using a 60 · 60 mesh of bilinear elements for both the unfiltered and filtered cases. For both cases, the problem is solved with the CG tolerance set to various values ranging from 1.0 · 108 to 0.1. Tables 2–5 show the results of these runs. The first column of each table is the PCG tolerance and the second column is the number of OC iterations. The third and fourth columns show the L2 and L1 norms of the difference in element design sensitivities between the solution at this PCG tolerance and the solution at the most accurate PCG tolerance (1 · 108). The fifth column is the difference between the Lagrange Table 2 First OC iteration of unfiltered 60 · 60 element topology optimization problem for various CG tolerance PCG tolerance
OC iterations
oC kD oq k2
oC kD oq k1
DK
kDqk2
kDqk1
1.0e08 1.0e07 1.0e06 1.0e05 1.0e04 1.0e03 1.0e02 1.0e01
1 1 1 1 1 1 1 1
– 4.01e08 3.23e07 2.72e06 3.18e05 3.22e04 5.55e03 8.55e02
– 3.00e09 3.10e08 1.75e07 2.21e06 3.09e05 3.27e04 9.34e03
– 2.00e11 l.00e11 1.47e09 1.63e08 2.97e07 8.28e06 3.91e06
– 2.95e06 2.43e05 2.19e04 2.73e03 2.50e02 3.70e01 4.56e+00
– 7.90e08 7.46e07 8.26e06 1.03e04 7.09e04 1.36e02 1.00e01
i
i
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3659
Table 3 First OC iteration unfiltered 60 · 60 element topology optimization problem for various CG tolerances PCG tolerance
OC iterations
oC kD oq k2
oC kD oq k1
DK
kDqk2
kDqk1
1.0e08 1.0e07 1.0e06 1.0e05 1.0e04 1.0e03 1.0e02 1.0e01
25 25 25 25 25 25 43 43
– 6.09e09 1.20e07 2.62e06 4.97e05 3.52e04 8.60e03 6.49e02
– 6.10e10 1.01e08 2.34e07 6.58e06 4.64e05 1.65e03 4.77e03
– 3.00e13 1.01e10 3.24e09 2.93e07 3.79e06 2.78e05 7.47e05
– 1.70e08 2.55e08 1.78e06 5.83e06 9.44e05 7.93e+00 4.32e+01
– 6.00e09 9.00e09 6.28e07 2.06e06 3.34e05 9.90e01 9.90e01
i
i
Table 4 First OC iteration of filtered 60 · 60 element topology optimization problem for various CG tolerances PCG tolerance
OC iterations
oC kD oq k2
oC kD oq k1
DK
kDqk2
kDqk1
1.0e08 1.0e07 1.0e06 1.0e05 1.0e04 1.0e03 1.0e02 1.0e01
1 1 1 1 1 1 1 1
– 3.21e08 2.36e07 2.15e06 2.55e05 2.79e04 4.97e03 7.97e02
– 1.20e09 1.17e08 1.28e07 1.28e06 1.43e05 2.01e04 5.27e03
– 1.00e11 1.00e11 1.36e09 1.59e08 3.10e07 7.68e06 1.65e06
– 2.45e06 1.89e05 1.65e04 2.16e03 2.14e02 3.30e01 4.25e+00
– 5.80e08 4.33e07 4.34e06 5.23e05 5.29e04 9.16e03 9.53e02
i
i
Table 5 First OC iteration of filtered 60 · 60 element topology optimization problem for various CG tolerances PCG tolerance
OC iterations
oC kD oq k2
oC kD oq k1
DK
kDqk2
kDqk1
1.0e08 1.0e07 1.0e06 1.0e05 1.0e04 1.0e03 1.0e02 1.0e01
31 31 31 31 31 33 372 –
– 3.28e09 1.33e07 2.42e06 4.17e05 2.65e04 9.76e04 –
– 2.70e10 5.23e09 9.68e08 2.66e06 1.12e05 1.16e04 –
– 1.00e12 4.40e11 6.09e09 1.67e07 1.12e08 7.17e07 –
– 1.02e06 2.34e05 9.83e04 1.01e02 1.04e01 9.32e01 –
– 5.30e08 1.58e06 3.55e05 4.29e04 3.82e03 3.77e02 –
i
i
multiplier of the volume constraint at this PCG tolerance and the most accurate solution. The sixth and seventh columns are the L2 and L1 norms of difference in the element relative densities between this PCG tolerance and the most accurate solution. All of these numbers help form a quantitative picture of how the resulting topology changes as the PCG tolerance is loosened. Also note that for each solution, a table is presented for the first OC iteration as well as for the converged OC iteration so that the cumulative effect of the error can be seen. Referring to (14), there are essentially two quantities that contribute to the update of the design variable oC qi at each element: the sensitivity oq and the Lagrange multiplier K of the volume constraint. Therefore, for i each of these solution cases, the resulting sensitivity, Lagrange multiplier, and density values are compared against the baseline (very accurate) solution of 1.0 · 108. These results show that, for the unfiltered case, as the CG tolerance is loosened, certain elements are prone to flip from black to white. That is, some elements change from being completely solid to no material at all and vice versa. However, the overall qualitative
3660
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Fig. 12. Solve times for the 60 · 60 element problem: (a) unfiltered case, and (b) filtered case.
topology does not change drastically (at least up to the loosest tolerance attempted: 0.1). For the filtered case, it is also seen that there is very little change in the topology as solutions are solved less accurately. In fact, there are actually no cases when elements undergo the extreme switch from black to white or vice versa. Instead, it is observed that the topology optimization converges at a much slower rate with very loose tolerances until, at some point (a tolerance of 0.1 in these cases), the optimization problems no longer converges at all. This is represented by the dashes in the last row of Table 5. Fig. 12 shows the resulting solve time as a function of the PCG tolerance for the unfiltered and filtered cases. For each case, the solve times for the first OC iteration and the final (converged) OC iteration are plotted. As expected it is found that as the tolerance is loosened (increased) the time to solve the FE problem decreases significantly. Also, the figures show that the time it takes to solve the FE problem at the first OC iteration is always significantly more than the time to solve the FE problem at the last OC iteration. This can be attributed to the fact that solutions from the FE solution in the previous OC iteration are used as starting values for the FE solution in the following OC iteration. Otherwise, the FE solution at the last OC iteration would generally take more time than the solution at the first OC iteration because the condition of the system is much poorer at the converged state. Finally, it is found that the difference in FE solve times between the first and last OC iteration becomes significantly greater as the PCG tolerance is loosened. This indicates that the effect of using previous solutions as a starting point for the following FE solve has a greater impact as the PCG tolerance is loosened. 6.3. Ill-conditioning in topology optimization Three parallel iterative solvers are used in this research to solve the finite element problem at each topology optimization iteration. They are: a conjugate gradient solver with diagonal preconditioning, a conjugate gradient solver with ILU preconditioning, and a hybrid substructuring solver which uses direct methods on interior degrees of freedom and then solves the interface problem using a diagonally preconditioned conjugate gradient solver. A primary concern in evaluating these solvers is their ability to handle the poor conditioning of the stiffness matrix caused by the heterogeneity of the black and white topology optimization solution. This ill-conditioning can easily be shown by solving a simple topology optimization problem using a conjugate
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3661
Fig. 13. Number of CG iterations at each OC iteration for a system with no preconditioning.
gradient method with no preconditioning. This is done on the now familiar square domain with mesh refinement of 30 · 30 elements. To obtain an unencumbered picture of how the CG algorithm converges, the starting guess for the displacements is always taken as zero instead of using the solution from the previous optimization iteration. In the initial OC iteration, when the available material is distributed evenly across the reference domain, the number of CG iterations to convergence is a manageable 214 iterations. However, as the optimization problem begins to converge, the number of CG iterations climbs quickly to well over 35,000 (see Fig. 13). Of course, in theory, the number of CG iterations cannot exceed the number of degrees of freedom. But this is only true if all the computations are performed using exact arithmetic, which is not the case here.
Fig. 14. Number of CG iterations at each OC iteration for a diagonally preconditioned system.
3662
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Now the same problem is solved using a diagonal preconditioner with the conjugate gradient solver (Fig. 14). The CG solve at the first OC iteration needs 205 iterations to converge and then the number of CG iterations actually goes down for subsequent OC iterations. This shows that it is imperative to use some form of preconditioning if one wishes to use iterative solution schemes for topology optimization problems. The solutions for some larger and more complex problems are now presented in order to compare the efficiency of the three solvers in their ability to mitigate the ill-conditioning of the stiffness matrix. Complexity is added to the problem in the following ways: First, the filter is not applied. This is because the unfiltered topology is generally more porous with more occurrences of adjacent black and white elements. Thus the resulting stiffness matrices are more poorly conditioned than a filtered solution of the same problem. Second, the aspect ratio of the reference domain is changed from 1:1 to 2:1 and the applied forces are moved to the bottom of the right edge so that the problem is no longer symmetric (Fig. 15). The problem is solved with all three solvers for mesh refinements of 120 · 60 elements, 240 · 180 elements, and 480 · 240 elements. Note that the ILU PCG solver uses a drop tolerance approach. The drop tolerance is chosen ad hoc to be 1.0 · 103. Table 6 shows the results from these solutions. For each solver, the number of CG iterations to convergence is shown along with the time (in seconds) spent inside the solver. The last three columns represent the solve performed at the first OC iteration, the solve performed at the last OC iteration, and the solve that required the most CG iterations of all the OC iterations. The results show an overall poor performance for all problems sizes of the diagonal PCG solver as compared with the other two solvers. The poor performance is attributed to the poor parallelization character-
Fig. 15. Reference domain for topology optimization in Section 6.3.
Table 6 Solver comparison: number of CG iterations and times in seconds for the first and last OC iterations for various solvers Mesh
Solver
CG iterations/time at first OC iteration
CG iterations/time at last OC iteration
CG iterations/time max
120 · 60
Diagonal PCG ILU PCG Substructuring
923 (21.1) 16 (3.45) 91 (4.68)
2093 (47.2) 19 (3.66) 88 (4.63)
2093 (47.2) 25 (4.55) 91 (4.68)
240 · 120
Diagonal PCG ILU PCG Substructuring
1829 (182) 33 (26.1) 128 (38.1)
3590 (356) 43 (32.0) 125 (38.0)
3596 (356) 54 (38.8) 128 (38.1)
480 · 240
Diagonal PCG ILU PCG Substructuring
3635 (1550) – 181 (320)
7205 (2945) – 178 (318)
7283 (2986) – 181 (320)
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3663
istics of the PCG solver in combination with the insufficient conditioning provided by the diagonal preconditioner. The ILU PCG solver converges somewhat more quickly than the substructuring solver for the 120 · 60 and 240 · 120 cases, but fails to solve the problem for the 480 · 240 case because of an attempted division by a zero pivot. This is not unusual with the ILU approach, where the incomplete factorization may break down even if the full factorization is guaranteed to exist. 6.4. Parallel efficiency 6.4.1. Parallel efficiency of solvers Turning our attention to the parallel efficiency of the linear solvers, it can quickly be shown that the diagonal and ILU PCG solvers are quite inefficient. The topology optimization problem described in the previous section is discretized using a 120 · 60 mesh and is solved with various numbers of processors ranging from 2 through 32. Only the first OC iteration is solved since the parallel efficiency of the solver is independent of the condition of the global stiffness matrix. Table 7 demonstrates that parallelizing the diagonal and ILU PCG solvers actually increases the overall solve time instead of decreasing it. The poor parallel performance of these solvers remains when attempting to solve larger problems as well. Similar tests are run on the substructuring solver using 120 · 60, 240 · 120 and 480 · 240 size meshes. Again, the problem is solved for only the first OC iteration. Table 8 shows that for each mesh increasing the number of processors decreases the overall solve time until 32 processors are reached when the solve time begins to increase. For the smaller mesh sizes this increase at 32 processors is rather large and can be attributed to the fact that the processors are starved for data. That is, the computational effort on each processor is light compared to the interprocess communication that must occur at each CG iteration. We also see from Fig. 16 that as the problem size is increased, the computational effort required on each processor increases and the data starvation becomes much less noticeable. There is an additional factor that contributes to the substructuring solver becoming less efficient as the number of processors is increased. There are essentially two computation-intensive pieces to the substructuring solver: Table 7 Parallel performance of diagonal and ILU PCG solvers Number of processors
Solve time (s)
2 4 8 16 32
Diagonal PCG
ILU PCG
21.1 39.1 70.3 95.8 122
3.38 5.57 8.17 16.1 24.1
Table 8 Parallel performance of the substructuring solver Number of processors
2 4 8 16 32
Solve time (s) 120 · 60 mesh
240 · 120 mesh
480 · 240 mesh
4.47 1.86 1.77 1.38 4.48
36.0 13.5 10.6 6.21 9.25
320 120 83.1 34.3 37.0
3664
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
Fig. 16. Time to solve topology optimization problem using the substructuring solver for various meshes: (a) 120 · 60 mesh, (b) 240 · 120 mesh, and (c) 480 · 120 mesh.
(1) The reduction (condensation) of the bubble degrees of freedom into an interface problem using direct methods. (2) The solution of the resulting interface problem. For a given discretization, the addition of more processors decreases the size of the bubble reduction on each processor which makes the first step in the substructuring solver faster. However, it also increases the size of the interface problem which is the parallelized piece of the solver. This increased interface problem size is shown in degrees of freedom in Table 9 for the 480 · 240 element mesh. Also shown here is the Table 9 Substructuring solver performance for the 480 · 240 mesh Number of processors
Interface DOFs
CG iterations
2 4 8 16 32
962 1442 2882 4314 7178
181 213 303 352 452
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3665
Table 10 Parallel performance of the filtering step for the 480 · 240 mesh: the 1-pass filter is for checkerboard control, whereas the 18-pass filter is for mesh independency Number of processors
1 2 4 8 16 32
1-Pass filter
18-Pass filter
Filter time (s)
% Overall time
Filter time (s)
% Overall time
9.13 8.79 3.01 1.25 0.930 0.454
13 3.2 2.4 1.5 2.6 1.2
164.8 177.8 52.9 24.3 16.0 9.04
73 36 31 23 31 20
increase in the number of CG iterations necessary to solve the interface problem as the number of processors is increased. At some point, the increased size of the parallel interface problem overshadows the effect of decreasing the bubble reduction. 6.4.2. Parallel efficiency of the filter Here, the parallel efficiency of the filtering algorithm is measured. The topology optimization problem on the 480 · 240 mesh is solved with both a singlepass, checkerboard filter and a multi-pass, mesh-independency filter (Table 10). The multi-pass filter is, of course, identical to the single pass filter except that it applied multiple times. It is, however, justified to measure the performance of both because of the significant amount of synchronization that must happen across the processors during multiple filter passes. The third and fifth columns of the table show the percentage of the total OC iteration time (including the time to perform finite element solve) spent in applying the filter. 6.4.3. Parallel efficiency of design variable update For completeness, the parallel design variable update should be mentioned here. The only parallel part of this algorithm is the summation of the subdomain areas which is needed to compute the Lagrange multiplier (which is a global parameter). This, however, only involves the exchange of a single double precision value at each iteration of the bisectioning algorithm. This has been found to be computationally insignificant and this of no further interest. 7. Conclusions Despite the fact that topology optimization problems are computationally intensive, very little has been published in the literature on the development and use of parallel computing methods to solve them. This paper is aimed at developing an understanding of the factors affecting the parallel solution of topology optimization problems. All aspects of the problem including the finite element solution, the design variable update and the filtering mechanism that eliminates checkerboarding and mesh-dependent solutions are parallelized. Our numerical experiments show that the multi-pass filter approach is able to eliminate the meshdependence of topology optimization solutions in a manner similar to a single-pass filter with a larger filter size. It can do this while significantly limiting the amount of ghost element information that must be shared across processors in a parallel computing environment. However, the large number of filter passes that may be necessary to accomplish this can carry a significant performance cost and, thus, may need careful consideration. These experiments also demonstrate good parallel efficiency of this filter approach. Results presented here also show that in some cases the tolerance of the PCG solver used to solve the finite element problem at each OC iteration can be significantly loosened without qualitatively affecting
3666
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
the resulting topology leading to a considerable reduction in the solution time of the FE problem. While loosening the PCG tolerance is a convenient way to reduce the overall solution time, the analyst must realize that the error in the state variables which result from solving the finite element problem loosely accumulates through the OC iterations. Three different types of preconditioned conjugate gradient solvers are used to solve topology optimization problems in this research. The diagonal PCG solver is found to do a fair job at preconditioning the global stiffness matrix but only for small problems. The ILU PCG solver, on the other hand, mitigates this ill-conditioning much better; the increases in the number of PCG iterations from first to last OC iteration is quite small. Both the diagonal and ILU PCG solvers exhibit very poor parallel efficiencies. The substructuring solver is found to mitigate ill-conditioning in a manner comparable to the ILU PCG solver while also being much more parallel efficient than either PCG solver.
References [1] H. Thomas, M. Zhou, U. Schramm, Issues of commercial optimization software development, Struct. Multidiscip. Optim. 23 (2002) 97–110. [2] M.P. Bendsøe, Optimal shape design as a material distribution parameter problem, Struct. Optim. 1 (1989) 193–202. [3] M. Zhou, G.I.N. Rozvany, The COC algorithm. Part II: topological, geometry and generalized shape optimization, Comput. Methods Appl. Mech. Engrg. 89 (1991) 197–224. [4] M.P. Bendsøe, N. Kikuchi, Generating optimal topologies in structural design using a homogenization method, Comput. Methods Appl. Mech. Engrg. 71 (2) (1988) 197–224. [5] M. Bendsoe, Optimization of Structural Topology, Shape, and Material, Springer, New York, Berlin, Heidelberg, 1995. [6] B. Hassani, E. Hinton, A review of homogenization and topology optimization. I. Homogenization theory for media with periodic structure. II. Analytical and numerical solution of homogenization equations. III. Topology optimization using optimality critreria, Comput. Struct. 69 (1998) 707–717, 719–738, 739–756. [7] T. Borrvall, J. Petersson, Large-scale topology optimization in 3d using parallel computing, Comput. Methods Appl. Mech. Engrg. 190 (2001) 6201–6229. [8] R.B. Haber, M.P. Bendsøe, C.S. Jog, Perimeter constrained topology optimization of continuum structures, in: IUTAM Symposium on Optimization of Mechanical Systems, Stuttgart, 1995, Kluwer Academic Publishers, Dordrecht, 1996, pp. 113–120. [9] R. Haber, C. Jog, M. Bendsøe, A new approach to variable-topology shape design using a constraint on perimeter, Struct. Optim. 11 (1996) 1–12. [10] J. Petersson, O. Sigmund, Slope constrained topology optimization, Int. J. Numer. Methods Engrg. 41 (1998) 1417–1434. [11] O. Sigmund, J. Petersson, Numerical instabilities in topology optimization: a survey on procedures dealing with checkerboards, mesh-dependencies and local mimina, Struct. Optim. 16 (1998) 68–75. [12] K. Svanberg, The method of moving asymptotes—a new method for structural optimization, Int. J. Numer. Methods Engrg. 24 (1987) 359–373. [13] O. Sigmund, A 99 line topology optimization code written in MATLAB, Struct. Multidiscip. Optim. 21 (2001) 120–127. [14] G.H. Golub, C.F. Loan, Matrix Computations, third ed., The Johns Hopkins University Press, Baltimore, Maryland, 1996. [15] R. Barrett, M. Berry, T.F. Chan, et al., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, PA, 1994. [16] P. Le Tallec, Domain Decomposition Methods in Computational Mechanics, North-Holland, 1994. [17] A. Laszloffy, AFEAPI: Adaptive Finite Elements Application Programmers Interface, Masters of Science, State University of New York at Buffalo, August 1999. [18] H. Sagan, Space-Filling Curves, Springer-Verlag, New York, 1994. [19] A.K. Patra, J.T. Oden, Problem decomposition strategies for adaptive hp finite element methods, Comput. Syst. Engrg. 6 (2) (1995) 97–109. [20] H.C. Edwards, A Parallel Infrastructure for Scalable Adaptive Finite Element Methods and its Application to Least Squares Cinfinity collocation, Ph.D. thesis, The University of Texas at Austin, 1997. [21] G. Carey, E. Barragy, R. Mclay, M. Sharma, Element-by-element vector and parallel computations, Commun. Appl. Numer. Methods 4 (1988) 299–307. [22] O. Sigmund, On the design of compliant mechanisms using topology optimization, Mach. Struct. Mach. 25 (1997) 495–526. [23] A. Ben-Tal, M. Kocˇvara, A. Nemirovski, J. Zowe, Free material design via semidefinite programming: the multiload case with contact conditions, SIAM Rev. 42 (4) (2000) 695–715.
K. Vemaganti, W.E. Lawrence / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3637–3667
3667
[24] B. Maar, V. Schulz, Interior point multigrid methods for topology optimization, Struct. Multidiscip. Optim. 19 (2000) 214–224. [25] K. Vemaganti, PIFED: Parallel Infrastructure for Finite Element Discretizations, World Wide Web, http://www.min. uc.edu/~kumar/PIFED/, April 28, 2003. [26] W. Gropp, E. Lusk, N. Doss, A. Skjellum, A high-performance, portable implementation of the MPI message passing interface standard, Parallel Comput. 22 (6) (1996) 789–828. [27] C. Ashcraft, D. Pierce, D.K. Wah, J. Wu, The reference manual for SPOOLES release 2.2: an object-oriented software library for solving sparse linear systems of equations, Boeing Shared Services Group, available online at www.netlib.org/linalg/ spooles/spooles.2.2.html, 1999. [28] T. Sterling, Beowulf Cluster Computing with Linux, MIT Press, Cambridge, Mass, 2003.