Localized tensor-based solvers for interactive finite element applications using C++ and Java

Localized tensor-based solvers for interactive finite element applications using C++ and Java

Computers and Structures 81 (2003) 423–437 www.elsevier.com/locate/compstruc Localized tensor-based solvers for interactive finite element application...

215KB Sizes 0 Downloads 17 Views

Computers and Structures 81 (2003) 423–437 www.elsevier.com/locate/compstruc

Localized tensor-based solvers for interactive finite element applications using C++ and Java G.R. Miller

*,1,

P. Arduino, J. Jang, C. Choi

Department of Civil and Environmental Engineering, University of Washington, Box 352700, Seattle, WA 98195-2700, USA Received 18 June 2002; accepted 21 November 2002

Abstract This paper presents techniques for solving systems of equations arising in finite element applications using a localized, tensor-based approach. The approach is localized in that a major part of the solution responsibility is delegated to vector degree-of-freedom objects, rather than residing solely in a global solver working on a monolithic data structure. The approach is tensor-based in that the fundamental quantities used for computation are considered to be second-order tensors. The localized data structure strategy provides the benefits of an efficient sparse and symmetric storage scheme without requiring complex implementation code. The tensor-based aspect of the approach can bring substantial performance benefits by increasing the granularity at which solution algorithms deal with their data. Java and C++ implementations of interactive finite element programs are used to demonstrate performance that is competitive with other available solvers, especially in the case of problems for which interactive analysis is feasible on commonly available hardware. Ó 2003 Elsevier Science Ltd. All rights reserved. Keywords: Finite elements; Object-oriented programming; Interactive modeling; Numerical analysis; Visualization

1. Introduction The general idea of building finite element programs with explicit ties to the underlying concepts and abstractions has been around for many years, and various object-oriented finite element programs have been and are being developed based on this model [1–10]. There are many advantages to building complex programs around high-level abstractions, but there is also a general perception that one cannot carry such abstractions

*

Corresponding author. Tel.: +1-206-543-0350; fax: +1-206543-1543. E-mail addresses: [email protected] (G.R. Miller), [email protected] (P. Arduino), [email protected] (J. Jang), [email protected] (C. Choi). 1 J. Ray Bowen Professor of Civil and Environmental Engineering.

too deeply into numerical computations without paying a significant performance penalty. In this paper we present an approach that not only allows, but actually exploits the use of high-level abstraction taken all the way down to the numerical solver code to obtain performance on par with traditional solvers. Among other things, our results show that with this approach, even Java-based implementations can be made competitive with other contemporary Fortran and C++ solvers. Although Java is not typically considered as a suitable language for numerical analysis, we include it here because there are many situations where using Java would be desirable given the relative ease with which one can write useful cross-platform code with nontrivial user interfaces. Numerically solving the systems of equations associated with finite element analysis is typically undertaken from the perspective of application-independent linear algebra, i.e., performing computations involving matrices populated with scalars without reference to the

0045-7949/03/$ - see front matter Ó 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0045-7949(03)00014-2

424

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

meaning or context of the quantities involved. The approach presented in this paper is based on a perspective that maintains closer ties to the physical context from which the finite element model is derived. In particular, we exploit the fact that for finite element applications involving structural mechanics, stiffnesses are tensor quantities and unknowns are vector quantities, and finite element models themselves can be viewed as a system of interconnected degrees of freedom. From this perspective, implementing a solution process involves ‘‘training’’ degrees of freedom to work together to carry out calculations associated with a particular algorithm, with the calculations themselves expressed in terms of fundamental, coordinate-free tensor operations. These ideas have been presented in different threads for different purposes in [1,4]: the point of the present paper is to pull these threads together in the context of equation solving and to characterize performance. The broader context of this work is the development of highly interactive analysis environments that support live modeling. We define live modeling as a type of graphical interface in which one can add/modify loads, boundary conditions, imposed displacements, members, and member properties directly via controls or mouse gestures and see the results of these actions immediately in an animated, quasi-real-time fashion. In effect, the system being modeled appears ‘‘live’’ in the sense that it responds immediately and realistically to input from its environment. The performance demands necessary for achieving live modeling, with multiple solution and image generations per second, are clearly high, but the kind of high performance computing required is not necessarily that of handling extremely large problems on specialized hardware at batch mode speeds. Rather, it is necessary to handle a large number of small to medium sized problems in each second, keeping the displayed state of the system in synch with essentially unpredictable user actions, and to do so on typical desktop hardware. For this reason, our results and discussions here will focus on relatively small to medium problem sizes, going up to about 40,000 DOFs (this is beyond the current range of interactive analysis, but is on the horizon). This having been said, there is no reason that strategies and algorithms appropriate for larger problems cannot be implemented in the framework of a tensor-based, DOF-localized approach: for example, we ourselves have implemented a multi-grid solver in Java that we have used successfully on problems with hundreds of thousands of degrees of freedom [11]. However, this lies outside the scope of this paper: here our focus will be on direct solvers and modest sized problems. Our presentation begins with a brief illustration of how computing stiffnesses in terms of coordinate-free tensor quantities can be done. We then consider tensorbased computation, including C++ and Java performance results in the fundamental context of matrix–matrix

multiplication. Next, we describe the implementation of degree-of-freedom oriented equation solution, with a particular emphasis on Gaussian decomposition. Finally, we provide results from reference implementations of tensor-based direct finite element solvers, and compare performance to other systems in various contexts.

2. Tensor-based stiffnesses In an earlier paper, coordinate-free, tensor-based element formulations were presented for general parametric solid mechanics elements under general conditions [12]. Here we present a more brief derivation for the stiffness tensors in the case of linear kinematics and linear isotropic material behavior. We also provide tensor-based expressions for the stiffnesses characterizing a basic linear beam element. These two examples illustrate the nature of coordinate-free formulations, and also provide useful expressions for implementation. 2.1. Parametric elements We begin with the assumption that there is some parameterized description of location, i.e., we can locate a point P via a relation of the form P ¼ /ðn1 ; n2 ; . . .Þ

ð1Þ

in which / is a mapping function expressed in terms of the parametric coordinates, na . This induces a basis at each point P of the form: ga ¼

o/ ona

ð2Þ

The dual basis associated with this basis is defined such that g a  g b ¼ dba

ð3Þ

The displacement gradient can be expressed in terms of parametric derivatives of the displacement field, uðP Þ and the dual basis as ru ¼

ou  ga ona

ð4Þ

For the case of isoparametric elements, we have both P ¼ N q ðn1 ; n2 ; . . .ÞPq

ð5Þ

and u ¼ N q ðn1 ; n2 ; . . .Þuq

ð6Þ

in which the N q are a suitable set of shape functions and Pq and uq are nodal locations and displacements, respectively. This enables us to express the induced basis as

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

ga ¼

oN q Pq ona

ð7Þ

Similarly, the displacement gradient becomes

ξ

oN uq  g a ¼ uq  Bq ona

φ (ξ , ξ , ξ 1

2

3

)

1

P3 1

ð8Þ

in which

P2 ξ

2

P1

1

q

Bq ¼

P4

3

q

ru ¼

425

oN a g ona

ξ

1

Fig. 1. A simple constant-strain tetrahedron element.

We can determine the element stiffness by considering the element contribution to the strain energy variation for an arbitrary displacement as follows: Z dU ¼ r : du dV ð9Þ ZV ¼ ½lðru þ ruT Þ þ k trðruÞ : du dV ð10Þ ZV ¼ ½lðuj  Bj þ Bj  uj Þ V

¼

Z

þ kðuj  Bj Þ1 : ðduj  Bj Þ dV

ð11Þ

fl½ðBi  uj ÞðBj  dui Þ þ ðuj  dui ÞðBj  Bi Þ

ð12Þ

V

þ kðuj  Bj Þðdui  Bi Þg dV Z ¼ dui  fl½Bj  Bi þ ðB i  B j Þ1

ð13Þ

V

þ kBj  B i g dV  uj ¼ dui  Kij  uj

ð14Þ ð15Þ

in which we have assumed isotropic material with LameÕs constants l and k. The element stiffness tensors, Kij are thus given by: Z Kij ¼ fl½Bj  Bi þ ðBi  Bj Þ1 þ kBj  Bi g dV ð16Þ V

For cases in which the natural dimension of the element does not match the dimension of the embedding space (e.g., line and plane elements in 3-D) the identity tensor in the above expression should be calculated as 1 ¼ g a  g a . It can be seen from Eq. (16) that when i ¼ j, the tensors are symmetric, whereas the off-diagonal stiffness tensors are related via Kji ¼ ðKij ÞT . As a simple example, consider the case of a constant strain tetrahedron (see Fig. 1). The parametric map in this case can be expressed as P ¼ /ðn1 ; n2 ; n3 Þ 1

2

3

¼ P1 þ n ðP2 P1 Þ þ n ðP3 P1 Þ þ n ðP4 P1 Þ

o/ ¼ ðP3 P1 Þ on2

ð19Þ

g3 ¼

o/ ¼ ðP4 P1 Þ on3

ð20Þ

The dual basis can be calculated as follows: ðg 2 g 3 Þ ; ðg 2 g3 Þ  g 1 ðg 1 g 2 Þ g3 ¼ ðg 1 g2 Þ  g 3 g1 ¼

g2 ¼

ðg 3 g1 Þ ; ðg 3 g 1 Þ  g 2 ð21Þ

Using an isoparametric approach, the displacements are expressed as uðP Þ ¼ u1 þ n1 ðu2 u1 Þ þ n2 ðu3 u1 Þ þ n3 ðu4 u1 Þ ð22Þ from which we can identify the following shape function expressions: N 1 ¼ 1 n1 n2 n3

ð23Þ

N 2 ¼ n1

ð24Þ

N 3 ¼ n2

ð25Þ

N 4 ¼ n3

ð26Þ

The gradient vectors can now be determined as B1 ¼ g1 g 2 g 3

ð27Þ

B2 ¼ g1

ð28Þ

B3 ¼ g2

ð29Þ

B4 ¼ g3

ð30Þ

ð17Þ

The corresponding induced basis vectors are o/ g 1 ¼ 1 ¼ ðP2 P1 Þ on

g2 ¼

The stiffness tensors can then be written explicitly as ð18Þ

K11 ¼ V ½ðl þ kÞðB1  B1 Þ þ lðB1  B 1 Þ1

ð31Þ

426

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

θj

θi

uj s Mj ui

Fi

Fj Pj n

Pi

p t

Mi Fig. 2. Linear beam element configuration.

K22 ¼ V ½ðl þ kÞðB2  B 2 Þ þ lðB2  B2 Þ1

ð32Þ

K33 ¼ V ½ðl þ kÞðB3  B 3 Þ þ lðB3  B3 Þ1

ð33Þ

44

4

4

4

4

4EI JG þ ðn  nÞ L L

ð43Þ

^ mh ¼ 2EI JG ðn  nÞ K L L

ð44Þ

Kmh ¼

K ¼ V ½ðl þ kÞðB  B Þ þ lðB  B Þ1

ð34Þ

K12 ¼ V ½lðB2  B1 Þ þ kðB1  B2 Þ þ lðB1  B2 Þ1

ð35Þ

K13 ¼ V ½lðB3  B1 Þ þ kðB1  B3 Þ þ lðB1  B3 Þ1

ð36Þ

K14 ¼ V ½lðB4  B1 Þ þ kðB1  B4 Þ þ lðB1  B4 Þ1

ð37Þ

^I ¼ Iss s  s þ Itt t  t

K23 ¼ V ½lðB3  B2 Þ þ kðB2  B3 Þ þ lðB2  B3 Þ1

ð38Þ

K24 ¼ V ½lðB4  B2 Þ þ kðB2  B4 Þ þ lðB2  B4 Þ1

ð39Þ

K34 ¼ V ½lðB4  B3 Þ þ kðB2  B4 Þ þ lðB3  B4 Þ1

ð40Þ

The relation between displacement and load quantities can be written as 9 2 38 9 8 ui > Fi > Kfu Kf h Kfu Kf h > > > > > > < < = = 6 Kmu Kmh Kmu Kmh 7 6 7 hi ¼ Mi 4 Kfu Kf h 5 Kfu Kf h > > uj > > Fj > > > > : : ; ; hj Mj Kmu Kmh Kmu Kmh

in which V ¼ ðg 1 g 2 Þ  g 3 . The primary thing to note about these stiffness tensors is that they are defined and can be computed without reference to an explicit underlying global coordinate system. By providing a set of computational classes that encapsulate vector and tensor operations, the expressions given above can be computed efficiently using the same level of abstraction as used in the formulation.

in which E is the elastic modulus, G is the shear modulus, J is the torsional stiffness, L is the length of the element, A is the cross-sectional area, I is the moment of inertia tensor, and ^I ¼ n I n. 2 In terms of principal directions (s; t) and principal values (Iss ; Itt ), we have ð45Þ

ð46Þ As in the case of the previous example, these stiffness expressions can be computed very efficiently using essentially direct translations of expressions (41)–(44), and there is no explicit reference to any arbitrary global coordinate system (the memberÕs own n; s; t reference frame is not arbitrary, but rather is an inherent property of the member).

2.2. Beam elements Space does not allow providing full derivational details here, but for the simple linear beam element shown in Fig. 2 the stiffness can be characterized in terms of the following four tensors: Kfu ¼

12E^I AE ðn  nÞ þ L3 L

3. Tensor-based computations In this section we consider the advantages associated with building computations around tensor quantities rather than scalars. In effect, computing with tensors

ð41Þ 2

Kmu ¼

KTf h

6Eðn ^IÞ ¼ L2

ð42Þ

The cross-product of a vector and second-order tensor is defined using the basic results a ðb  cÞ ¼ ða bÞ  c and ðb  cÞ a ¼ b  ðc aÞ:

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

provides a built-in framework for loop unrolling and small-scale data blocking. It also puts one in a good position in regards to potentially piggy-backing on the technology gains focused on 3-D graphics, as tensor operations and granularities match those of common 3-D transformation quite closely (albeit not exactly due to the common use of 4-D data and transformations in 3-D graphics applications). Fundamental tensor/vector computations could be implemented using the computing infrastructure specialized for geometric transformations, thereby gaining increased performance on par with associated graphics sub-system improvements. It is difficult to predict when the typical engineer will have large-scale parallel computing available on his or her desktop, but small-scale parallelism is already the norm, and this trend is likely to continue (3-D gaming providing a much bigger market for hardware manufacturers than engineers). Implementing basic tensor and vector types in languages such as Java or C++ is straightforward. One essentially encapsulates an internal representation (typically some kind of xyz system), and then provides an operational interface supporting the typical algebra required for manipulation, such as addition, scalar multiplication, inner products, and so on. The key is to keep virtually all explicit references to coordinates encapsulated within the vector and tensor classes, so that all outside users of these classes work in terms of coordinate-free expressions. In the case of element formulations of the kind presented in the previous section, the coordinate-free formulation then translates directly (and efficiently 3) into code. When working with aggregate data types, there are always issues associated with argument passing and copying of intermediate results when operations are cascaded together (see [13] for a detailed discussion (and alternative solution) of these issues in the case of C++). Matrix libraries, for example, often require the passing in of a data structure to hold the result of an operation in addition to the operands themselves. The implementation used for generating the results in this paper employs two strategies to deal with these issues. First, each vector and tensor class includes static (class) variables that hold circular lists or rings of the corresponding types to hold and pass back results. This avoids the calls to constructors and destructors that would otherwise be necessary, and in the case of C++ preserves the syntactic convenience of being able to write code of the form C ¼ A þ B. Caution is necessary to ensure that the size

3 Typical matrix expressions such as those used to denote mapping of stiffnesses from a local to a global reference frame are usually computationally expensive if implemented at the same level of abstraction as the formulation expression.

427

of the ring be sufficiently large to handle the most deeply nested calls to the functions in questions, and so this is not necessarily a suitable approach for a general purpose library, but especially in the case of Java, performance improves significantly with this strategy. The second strategy for handling issues of intermediate values and copying is both simpler and less general. For particular operations that are called frequently and that involve several sub-operations, it is generally worthwhile simply wrapping up the operation in a single function. For example, in the case of multiplication of matrices containing tensors, the principal operation is to compute in place the following combined sum and product of tensors: S ¼ S þ A  B. It makes sense in this case to create a function that encapsulates this entire operation, and to optimize it internally to the degree possible. Again, this is not necessarily a suitable approach for a standalone library, but it works well for internally integrated classes, where there is typically need for very few such ‘‘super functions’’ for most applications, and these functions are called in very localized places. In [14–16], Rucki showed how using tensors in both C++ and Fortran 90 generally improved performance of basic levels 1 and 2 basic linear algebra subroutine operations on various platforms and using various compilers. Fig. 3 shows more recent results comparing the use of tensors versus scalars for dense matrix multiplication on a 1 GHz Pentium III using both Java and C++. The code was virtually identical for each case, and used a standard implementation strategy. For example, the snippet below shows the core C++ code used to generate the scalar and tensor C++ results in Fig. 3: for(int j ¼ 0; j < N; j þ þ) { for(int k ¼ 0; k < N; k þ þ) { Bk[k] ¼ B[k][j]; } for(int i ¼ 0; i < N; i þ þ) { #ifdef SCALAR_FIELD // – use scalars Scalar* Ai ¼ A[i]; Scalar s ¼ 0; #elif TENSOR_FIELD // – use tensors GTensor* Ai ¼ A[i]; GTensor s; #endif for(int k ¼ 0; k < N; k þ þ) { #ifdef SCALAR_FIELD s ¼ s þ Ai½k  Bk½k ; #else DotPlusEq(Ai[k],Bk[k],s); #endif } C[i][j] ¼ s; } }

428

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437 600

Java scalarbased Java tensorbased C++ scalarbased C++ tensorbased

500

MFLOPS

400

300

200

100

0

0

200

400

600

800

1000

1200

Matrix dimension(N) Fig. 3. Java and C++ N N dense matrix multiplication performance for tensor and scalar data types.

The results in Fig. 3 show clearly how the use of tensors can have a significant impact on performance, in this case allowing tensor-based Java code to out-perform scalar-based C++ code. It is important to note that speed-ups are not always this dramatic, nor do horse race examples of this kind necessarily predict performance in more complex contexts, but one can see the potential benefits of tensor-based computations.

4. DOF-based computation Given an infrastructure for computation that includes vectors and tensors as well as scalars, it is not difficult to build computational classes to support vector degrees of freedom. Following the approach described in [1,4], a vector degree of freedom can be encapsulated in a DOF class, which at the simplest level has the following attributes and capabilities: a generalized displacement; a symmetric stiffness tensor; a set of Interactions; and a member function for inter-connection with the DOF objects. The Interaction class provides a means to explicitly link one DOF to another, and consists of a pointer to a DOF identifying whom the interaction is

with, and a tensor containing the interaction itself. The system stiffness thus can be stored in a distributed fashion, with each DOF being responsible for storing and managing its own set of equation coefficients. Sparse matrix techniques typically rely on treating the system of equations as a graph, and the DOF-based approach makes this explicit. In effect, what one might think of as an indexed matrix entry of Kij is stored in DOFi Õs interaction set in an Interaction with a pointer to DOFj and a tensor value equal to Kij (see Fig. 4). Assembly consists of elements making calls of the form dof i:Connectðdof j; Kij Þ. This data structure is well suited to storing and managing sparse systems of equations, albeit at the expense of carrying additional pointers for each entry. By working with tensors rather than scalars, though, this additional storage overhead is relatively minor, requiring a single pointer for every nine scalars in the case of 3D analysis. This overhead is reduced further in the case of block solvers, with the 4 4 tensor block solver requiring a single pointer for each 144 scalars. As will be seen in the performance results section, this amortization of the indexing cost over a set of scalar degrees of freedom also pays dividends during run time.

Interactions DOFi

… DOFp Kip

DOFj

Kij

DOFq Kiq …

Fig. 4. DOF-based storage of stiffness tensors.

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

There are various possibilities for the data structure used for the interaction storage within the DOFÕs. Two implementation approaches were used for generating the results in this paper: a simple singly linked list, and an array-based scheme. In the array-based scheme, the implementation separates the DOF pointers and the stiffness values into distinct arrays, but the overall logic presumes linkages matching those in Fig. 4. In both cases symmetry is exploited by storing only quantities above the diagonal. The diagonal stiffness terms themselves are symmetric tensors, and these are stored with each DOF separately from the general interactions. To facilitate symmetric storage, DOFs are also given ID numbers matching their ordering, and the DOF implementation keeps its interaction lists sorted by ID. Both the linked list and array-based schemes are set up to do numerical and symbolic factorization at the same time, and so fill-in is computed on the fly. This has benefits for smaller problems, in particular, albeit with associated penalties for larger problems. 4.1. Factorization In the DOF-based approach, factorization is based on the ability of individual DOF objects to factorize themselves. GaussElimDOF is a subclass of DOF that provides a member function named EliminateSelf, and factorization is accomplished by having each GaussElimDOF in the system perform this function. The function itself does each of the following: (1) Computes the inverse of the DOFÕs own diagonal stiffness tensor. (2) Iterates through its interaction list updating the remote diagonal of all its connected DOFs, i.e., for each interaction j in DOFi Õs interaction list, do the following: Kij ! K 1 ii  Kij

ð47Þ

Kjj ! Kjj KTij  Kii  Kij

ð48Þ

(3) Iterates through its interaction list a second time doing factorization and fill-in of remote interactions, i.e., for each interaction j in DOFi Õs interaction list, do the following updates for interaction p in DOFj Õs interaction list: Kjp ! Kjp KTij  Kip

ð49Þ

This is essentially just standard Gaussian elimination with the caveat that the quantities are tensors rather than scalars. The indices referred to in the above expressions are not actually used explicitly in the iterations indicated––rather, the looping is list based, and so iter-

429

ation only occurs over nonzero tensor entries (zero tensor entries are not stored). Fill-in can be handled relatively efficiently for the list-based implementation since list insertion is a fast operation. The array-based implementation does more work to handle fill-in, but ends up providing other benefits that generally outweigh the cost. In the event that one has a fairly dense banded structure for a given system of equations, an approach of this kind would not be expected to compete well against a more traditional banded solver were the approach implemented at the level of scalar degrees of freedom. However, when implemented at the level of tensors (or tensor blocks as described below), the implementation ends up performing quite well even in cases for which sparse data structures would be expected to lag. This appears to be due to the fact that the tensor operations can be coded and encapsulated in an efficient manner, and one gains a modicum of loop unrolling and data localization as an inherent part of using the tensor data structure. These are well-known benefits associated with blocking data [17]: the point here is that the blocking can be considered ‘‘natural’’ to the degree the tensors in question are the natural objects of computation. Although the major part of the factorization occurs within the DOF objects, there is still a need for an external Solver class to manage the DOFs and to drive the solution process. This includes forward–backwards substitution as well as factorization, and in the case of Gaussian elimination, we have developed a GaussElimSolver class to handle these operations, again in collaboration with the GaussElimDOF objects. 4.2. DOF blocks Given the benefits derived from the inherent smallscale blocking of data and procedures that come with using tensor-based data, it is likely these benefits can be extended further by explicitly blocking the data further. This indeed has been our experience as will be illustrated in the results presented later. It is straightforward to write code for relatively small blocks without using iteration in the critical inner loop operations, and without requiring particularly fancy mechanisms for identifying appropriate blocks. With the basic unit of blocking being a vector DOF, one can use relatively small, hand coded blocks and end up with scalar blocks sized well for common CPU caches. The trick is to form the blocks in a manner that fits well with the overall analysis framework. Our approach has been to define an additional set of classes to support blocking while maintaining consistent programming interfaces to the modeling objects (i.e., elements, nodes, and so on). The additional classes come in pairs: a BlockDOF class which extends the DOF class; and a DOFBlock which

430

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

extends the GaussElimDOF. The BlockDOF is essentially a DOF that knows how to exist within a DOFBlock, and a DOFBlock is essentially a block of DOFs that knows how to carry out block-based factorization. BlockDOFs appear to client elements, nodes, and so on just like normal DOFs, and therefore allow such operations as assembly and results extraction to be handled using the same code as for standard DOFs. These operations are actually carried out by passing information to and from the associated DOFBlock. The DOFBlocks appear to the GaussElimSolver as simply a GaussElimDOF, which can carry out factorization and related operations in the usual fashion. Again, these operations are actually carried out differently internally, using two sets of displacement and right-hand side quantities, and using BlockInteractions rather than simple Interactions. BlockInteractions are similar to simple Interactions, but instead of a single tensor interaction slot, there are four corresponding to the 2 2 nature of the interaction between DOFBlocks. These block solver classes turned out to enhance performance significantly, and so the process was repeated once more with the creation of 4 4 block solver classes. A BlockDOF4 class was used to fill the same role as the BlockDOF class described above, a DOFBlock4 class was developed to serve the function of the DOFBlock class, and so on. Results are presented below for this solver, as well. In principle, this blocking could be expanded further either hierarchically or by using larger blocks. To be effective though, the blocks need to be relatively densely filled and fit reasonably well with a CPUs high speed memory caches. As the block size grows, the likelihood of this diminishes, and in fact the results to be presented in the next section show diminishing returns in moving from 2 2 blocks to 4 4 blocks for the problems considered. Nevertheless, larger blocks could be effective in some circumstances, and the approach used here could be extended to work in these cases.

5. Performance results In this section we present performance results for Java and C++ implementations of solvers based on the DOF-based, tensor approach in the context of basic finite element problems, with comparisons to various other C++ and Fortran implementations. Given the pace at which hardware and software technology continues to change, any such performance testing can provide only a fleeting snapshot of relative performance trends. Thus, the point here is not to provide the final word concerning the superiority of any particular implementation, but rather to illustrate the relative effectiveness of the DOF-based, tensor approach in a general sense. The results show how the approach presented

here works well both for cases in which the system ordering leads to a highly banded system of equations, and for cases in which the ordering leads to a sparse set of equations. The present results were generated using two basic problem configurations: one two-dimensional and one three-dimensional. The 2-D configuration is a simple square mesh of rectangular elements, while the 3-D configuration uses brick elements to form the generalization of rectangular meshes in 3-D. In both cases, the number of degrees of freedom was controlled by simply increasing the size of the corresponding meshes, and base constraints were used as the support conditions. In the 2-D case the mesh was increased uniformly in both directions, while in the 3-D case two expansion variations were used: (1) the mesh size was increased in all three directions simultaneously; and (2) the mesh size was held fixed in two dimensions and then increased in the third. For an N N 2-D mesh, the number of scalar degrees of freedom can be calculated as 2NðN þ 1Þ, while for the 3-D mesh the number of degrees of freedom is given by 3N ðM þ 1Þ2 , in which M is the cross-section mesh size, and N is the mesh length in the third dimension. In the 2-D case and the uniformly expanded 3-D case, the natural problem bandwidth increases as the problem size increases, while in the single dimension 3-D expansion, the bandwidth remains relatively constant. The performance tests presented here were run on an 1 GHz Pentium III CPU with 512 MB of RAM. We have run portions of these tests on a variety of hardware, including both older/slower and newer/faster Pentium CPUÕs and PowerPC G3 and G4 CPUÕs. In the case of the Java tests, the results presented here were run using SunÕs JDK 1.3.1 HotSpot Client, although again a number of other runtime environments have been tested, as well. We have also used a variety of compilers (primarily Metrowerks CodeWarrior and Microsoft Visual C++) on various platforms (Windows 98, 2000, and XP; Mac OS 9 and X) with the results in virtually all cases telling a consistent story. The comparison code for these tests is OpenSEES [18,19], an open source finite element analysis program providing access to a number of solvers. OpenSEES uses an object-oriented, C++ core combined with support for external solvers written in different languages. It uses a Tcl-based [20] command language to define models and configure analyses. The particular solvers available within the OpenSEES framework used for the comparisons here include the following: BandSPD: A Fortran-based solver built around LAPACK [21] routines, optimized for use with systems of equations with a relatively constant bandwidth. ProfileSPD: A C++ solver optimized for use with banded systems with varying bandwidth.

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437 2

2

10

10

Java Tensor Based Solver(LL) Java 2 X 2 Block Solver(LL) Java 4 X 4 Block Solver(LL)

Solution Time(seconds)

Solution Time(seconds)

OpenSees BandSPD OpenSEES ProfileSPD OpenSEES SparseSPD

1

10

0

10

1

10

3

10

4

10

1

10

0

10

10

5

10

1 3

10

Number of DOFs 2

5

10

2

10 C++ Tensor Based Solver(LL) C++ Tensor Based Solver(Array) C++ 2 X 2 Block Solver(LL) C++ 2 X 2 Block Solver(Array) C++ 4 X 4 Block Solver(LL)

OpenSees ProfileSPD Java 4 X 4 Block Solver(LL) C++ 4 X 4 Block Solver(LL)

Solution Time(seconds)

Solution Time(seconds)

4

10

Number of DOFs

10

1

10

0

10

10

431

1 3

10

4

10

5

10

Number of DOFs

1

10

0

10

10

1 3

10

4

10

5

10

Number of DOFs

Fig. 5. Performance results for square 2-D mesh presented in Table 1.

SuperLU: A widely available solver optimized for use with sparse systems of equations [22]. OpenSEES uses a C-based implementation of the solver. SparseSPD: A generalized C-based sparse solver for symmetric, positive-definite systems. The source code for all these solvers is available as part of the OpenSEES package, and the latter two can

also be obtained independently from the referenced web pages. Comparison performance results were also generated using UMFPack [23], a high performance solver written in Fortran that uses a number of strategies to obtain high performance for general large-scale systems. It was difficult to get consistent solutions for the problems we considered, so we do not present timing results, but for those cases where consistent results could be obtained, the performance was in a similar range to the

432

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

other OpenSEES solvers. It should be noted that OpenSEES is still in an active development phase, and so once again we emphasize that the results presented here are intended to be indicative rather than definitive. Identical models were set up and run using OpenSEES and the solvers outlined above, and also using the Java and C++ implementations of the DOF and DOFBlock tensor-based solvers described earlier. For the most part, the sizes of the problems considered are such that the set-up time is relatively insignficant compared to the decomposition time, but the run-times reported here include the full stiffness set-up, factorization, and solution cycle (parsing of model descriptions, etc., is not included in the timings). In all cases multiple runs were used to ensure consistent timing, which was especially important for the Java tests for which relatively wide variations are common due to JavaÕs automatic memory management. The meshes were set up in such a way that the default orderings led to systems of equations that were naturally banded, with an essentially constant bandwidth. Running the meshes through a Reverse Cuthill-McGhee [24] re-ordering provides little benefit in these cases, so the solvers were run using the default ordering. For the BandSPD and ProfileSPD solvers, this provides an essentially ideal match between the solver and the corresponding structure of the equations. In the case of SuperLU, SparseSPD, and the tensor-based solvers presented here, there is no inherent reliance on a banded equation structure, and one could expect efficient operation also in the case of sparse orderings, such as one would get from a minimum degree ordering algorithm [25]. Such alternative orderings were tested, with some improvement in performance for larger bandwidth sizes.

For those cases in which blocks were used, the blocking was set up in a very simple manner: the default ordering was re-grouped directly into 2Õs or 4Õs, as appropriate. Thus, the resulting blocks were of the form f1; 2g; f3; 4g; f5; 6g; . . . for the 2-DOF blocks, and f1; 2; 3; 4g; f5; 6; 7; 8g; f9; 10; 11; 12g; . . . for the 4-DOF blocks. In the case of 2-DOF blocks in particular, this simple approach is likely to provide reasonably dense blocks since the use of linked vector degrees of freedom fits closely the underlying problem characteristics. Given the breadth of results to be presented, it is worth taking a general overview before moving to detailed descriptions. Fig. 5 and Table 1 present timing results for the two-dimensional mesh for varying problem sizes. The four sub-plots in Fig. 5 depict the range of performance for the OpenSEES solvers, the Java solvers, and the C++ solvers, followed by a comparison of the best performers from each group. Fig. 6 and Table 2 summarize similar results for the 3-D case with the uniformly expanded mesh, while Fig. 7 and Table 3 contain results for the 3-D case with mesh expansion in one direction only. Fig. 8 shows the variation in behavior for this latter class of problems with increasing base dimensions, which translates directly into increasing equation bandwidth. SuperLU was not included in the 3-D results because it exceeded the machineÕs memory capacity relatively early on due to its not using symmetric storage. The various implementations of the approach presented in this paper are identified in the tables and plots as follows: Java tensor-based and C++ tensor-based refer to Java and C++ implementations of straight tensor-based solvers with no additional blocking; Java 2=4 2=4 block and C++ 2=4 2=4 block correspond similarly to 2=4 2=4 block solvers; the parenthetical notation (LL) and (Array) refers to linked

Table 1 Performance results for a square 2-D mesh Solver

Problem size (number of scalar equations) 5202

13,122

20,402

29,282

39,762

BandSPD ProfileSPD SparseSPD SuperLU

0.49 0.34 0.48 0.90

2.03 1.39 1.91 4.06

4.35 2.85 3.85 9.40

7.95 5.30 7.05 19.2

13.7 9.05 11.8 49.4

Java tensor-based (LL) Java 2 2 block (LL) Java 4 4 block (LL)

0.93 0.60 0.71

3.86 3.02 2.58

8.66 6.99 5.29

18.3 13.5 10.6

41.7 25.3 17.2

C++ C++ C++ C++ C++

0.29 0.26 0.19 0.18 0.20

1.41 1.14 0.87 0.78 0.90

4.74 2.79 1.95 1.69 1.86

15.9 6.27 3.96 3.51 3.71

40.1 15.2 8.05 7.33 7.09

tensor-based (LL) tensor-based (Array) 2 2 block (LL) 2 2 block (Array) 4 4 block (LL)

Solution times in seconds. Bandwidth varies with increasing problem size.

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437 3

3

10

10

Java Tensor Based Solver(LL) Java 2 X 2 Block Solver(LL) Java 4 X 4 Block Solver(LL)

Solution Time(seconds)

Solution Time(seconds)

OpenSees BandSPD OpenSEES ProfileSPD OpenSEES SparseSPD

2

10

1

10

0

10

2

10

1

10

0

3

10

4

10

10

5

10

3

10

Number of DOFs 3

5

10

3

10 C++ Tensor Based Solver(LL) C++ Tensor Based Solver(Array) C++ 2 X 2 Block Solver(LL) C++ 2 X 2 Block Solver(Array) C++ 4 X 4 Block Solver(LL)

OpenSees BandSPD Java 4 X 4 Block Solver(LL) C++ 4 X 4 Block Solver(LL)

Solution Time(seconds)

Solution Time(seconds)

4

10

Number of DOFs

10

2

10

1

10

0

10

433

2

10

1

10

0

3

10

4

10

5

10

Number of DOFs

10

3

10

4

10

5

10

Number of DOFs

Fig. 6. Performance results for brick 3-D mesh presented in Table 2.

list and array-based implementations of the DOF interaction lists, respectively. Turning now to the results themselves, we note first that overall the Java and C++ block solvers performed well in general, with the Java 4 4 block solver even being among the fastest in some cases. The efficacy of basing computations of this nature on blocks is well documented, and the tensor data structures used here

apparently provides a useful way of getting good blocking behavior. For Java in particular, the packaging of the fundamental inner loop computations in a form requiring no sub-block indexing or looping works well, especially in the case of the 3-D problems. As can be seen by comparing the Java and C++ results for matching solvers in Tables 1–3, the performance differential between the two languages is much less in the 3-D

434

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

Table 2 Performance results for 3-D mesh Solver

Problem Size (number of scalar equations) 5184

10,125

14,739

20,577

24,000

BandSPD ProfileSPD SparseSPD

5.34 6.72 7.28

19.8 29.2 29.9

50.4 66.3 66.9

91.3 138.9 139.7

142.5 196.5 198.7

Java rensor-based (LL) Java 2 2 block (LL) Java 4 4 block (LL)

9.16 4.45 3.04

49.1 22.7 14.6

120.3 55.6 33.6

267.2 122.8 72.9

385.5 177.7 105.9

C++ C++ C++ C++ C++

7.16 4.87 2.92 2.64 2.24

39.5 29.8 17.4 15.5 11.3

102.8 72.3 42.0 38.3 27.0

230.2 160.2 96.8 85.0 60.0

330.0 235.4 136.5 120.9 85.4

tensor-based (LL) tensor-based (Array) 2 2 block (LL) 2 2 block (Array) 4 4 block (LL)

Solution times in seconds. The same number of elements are used in the x, y, and z directions so that bandwidth varies with increasing problem size.

case. In the 2-D case, there is less size difference between scalars and vectors, and so in general the effect of using tensors alone is not as marked. For C++, there generally was little performance difference overall between the 2 2 and 4 4 block cases, whereas for Java the further increase in block size continued to be more noticeably beneficial. An exception to this can be seen in the uniform expansion 3-D results in Table 2, where the C++ 4 4 block solver performs significantly better than the 2 2 block solver for the larger problem sizes. As a point of reference for all these results, the C++ 2 2 block array solver achieved a maximum rate for the 2-D problems of about 49% of the theoretical floating-point rate of the processor. As mentioned in the introduction, the tensor-based solvers have been developed with a particular interest in live modeling contexts, and for the smaller problem sizes, the data in the tables show that they perform well in these cases. Data for even smaller problems are not included in the tables, but in the case of problems with just a few thousand degrees of freedom, the localized, tensor-based approach presented here exhibits particularly good performance. OpenSEES is based on a strategy of relying on external solvers to service the C++ modeling code, and the overhead associated with this design becomes more apparent for smaller problems in which set-up costs become more significant relative to solution costs. The integrated approach used here does not carry this same overhead. While all the various implementations of the localized, tensor-based approach performed well for smaller problems, in many cases the linked-list implementations did not scale particularly well. There are various explanations for this, one of the most significant being the relative overhead of using linked-lists versus arrays [26,27]. Comparing the linked list and array-based ten-

sor solvers in Table 1, this difference can be seen quite clearly. The effect is exacerbated for larger problem sizes by an ultimately inefficient dynamic memory allocation scheme associated with using the default implementation of new for allocating the list data. 4 For the 2 2 block solvers in the same table, however, the difference is much less pronounced, since the increased data granularity reduces the amount of container management work that must be done. Similar behavior can be seen in Tables 2 and 3, as well. Another important factor in this is the fact that the linked-list solvers are designed to work best with sparse data, and the overhead associated with relatively expensive indexing becomes problematic in the case of large, dense bands. Fig. 9 shows examples of improved performance in the case of using minimum degree ordering for both plain tensor and 2 2 block linked-list solvers. The focus here is on relative performance, so the plots have been normalized with respect to the largest problem size. It should also be noted that the ordering time itself is not included in these results, primarily because we used a simple ordering implementation that would not be representative of typical ordering speeds. The point of including the results in Fig. 9 is to show simply that the schemes presented here can exploit the advantages of sparse ordering when these advantages are present. It should be noted, however, that sparse ordering does not always improve performance. With the crucial role played by high speed

4

A pool-based allocation scheme would make more sense in this case, and we recently have been able to obtain linked-list performance on par with the array-based performance via this mechanism. The key is to provide each DOF with its own pool allocator.

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437 2

2

10

10

Java Tensor Based Solver(LL) Java 2 X 2 Block Solver(LL) Java 4 X 4 Block Solver(LL)

Solution Time(seconds)

Solution Time(seconds)

OpenSees BandSPD OpenSEES ProfileSPD OpenSEES SparseSPD

1

10

0

10

1

10

0

3

10

4

10

10

5

10

3

10

Number of DOFs 2

5

10

2

10

OpenSees BandSPD Java 4 X 4 Block Solver(LL) C++ 4 X 4 Block Solver(LL)

Solution Time(seconds)

C++ Tensor Based Solver(LL) C++ Tensor Based Solver(Array) C++ 2 X 2 Block Solver(LL) C++ 2 X 2 Block Solver(Array) C++ 4 X 4 Block Solver(LL)

Solution Time(seconds)

4

10

Number of DOFs

10

1

10

0

10

435

1

10

0

3

10

4

10

5

10

Number of DOFs

10

3

10

4

10

5

10

Number of DOFs

Fig. 7. Performance results for brick 3-D mesh presented in Table 3.

caches in current architectures, the reduction in data locality and density associated with increased sparsity can easily make it likely that overall performance will drop, even though substantially less computational work is being done. We observed this in our own testing in which we kept track of the number of actual computations during factorization for cases with and without minimum degree ordering, and often saw a

substantial lag (or complete disappearance) of the cross-over point for performance relative to the crossover point for number of computations. It was necessary to use different mesh aspect ratios for the regular and block cases to obtain the kind of results shown in Fig. 9. Thus, Fig. 9 shows best-case scenarios for minimum degree ordering, but this is not necessarily representative across the board.

436

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437

Table 3 Performance results for 3-D mesh Solver

Problem Size (number of scalar equations) 5082

10,890

20,328

30,129

40,293

BandSPD ProfileSPD SparseSPD

4.32 4.91 5.46

9.91 11.5 12.9

19.2 22.4 24.6

29.1 34.2 37.4

39.8 46.7 50.0

Java tensor-based (LL) Java 2 2 block (LL) Java 4 4 block (LL)

5.69 3.01 2.32

13.9 7.74 5.81

27.1 15.1 10.6

40.9 23.0 17.1

54.9 27.6 20.0

C++ C++ C++ C++ C++

4.45 2.99 1.83 1.78 1.66

10.8 7.09 4.39 4.26 3.91

20.9 13.7 8.60 8.21 7.68

31.8 20.6 13.2 12.3 11.7

43.4 27.8 18.1 16.6 16.1

tensor-based (LL) tensor-based (Array) 2 2 block (LL) 2 2 block (Array) 4 4 block (LL)

Solution times in seconds. The number of elements in the x and y direction are fixed, and the number of elements in z direction varies so that a fixed bandwidth of 363 is used.

memory usage was similar across the board is consistent with what one would expect given the use of the natural ordering and conceptually similar data storage schemes. The use of object-oriented techniques is often associated with unacceptable performance trade-offs, especially at the level of numerical computation. The outcome of the numerical tests presented here, however, shows that such performance trade-offs can be circumvented with suitable formulation and implementation strategies.

100

Solution time in seconds

OpenSees BandSPD OpenSEES ProfileSPD C++ Tensor Based Solver(Array) C++ 2 X 2 Block Solver(Array) C++ 4 X 4 Block Solver(LL)

10

6. Conclusions

1 243

270

300

330

363

396

432

468

507

Various bandwidth with the same number of DOFs(10,500)

Fig. 8. Relationship between performance level and various bandwidths. All the tested cases have the same number of DOFs.

Finally, memory use is another important issue for solvers, since it can dramatically influence performance in cases where problem size causes spill-over into virtual memory. It should be noted that standard Java implementations have a default heap size of only 64 MB, so in order to run the larger problem sizes presented here, it was necessary to override this default. Memory use was monitored for the tests presented above, and in general the various symmetric schemes were comparable, with the tensor-based approaches typically requiring a little less memory than the OpenSEES solvers. The fact that

Tensors and vectors form the natural basis for much of the theoretical foundation of structural mechanics. This paper has shown that tensor and vector formulations also can be used productively and consistently in the context of numerical implementations. The benefits of this include those usually associated with object-oriented approaches: simplified coding due the support for higher-level abstractions; more direct relationships between conceptual and implementation models; inherent modularity and extensibility; and so on. These benefits are particularly useful in the context of building systems with rich, interactive interfaces. However, the additional benefit demonstrated in this paper is that one can also achieve competitive numerical performance. In many respects, this good performance can be obtained because of, rather than in spite of, the object-oriented approach. We would note in closing that the general possibilities and potential for tensor-based computation extend beyond the basic implementations of direct solver presented here. In most cases, the strategies and techniques available for general analysis scalar-based systems of linear or nonlinear equations can be adapted to accommodate tensor and vector data types.

G.R. Miller et al. / Computers and Structures 81 (2003) 423–437 Tensor Based Solver(LL)

437

2 X 2 Block Solver(LL)

1

1 w/o MDO w/ MDO

Normalized solution time

Normalized Solution Time

w/o MDO w/ MDO

0

1 Normalized number of DOFs

0

1 Normalized Number of DOFs

Fig. 9. Performance increase with minimum degree ordering.

References [1] Miller GR. A LISP-based object-oriented approach to structural analysis. Eng Comput 1988;4:197–203. [2] Baugh Jr JW, Rehak DR. Object-oriented design of finite element programs. In: Nelson JK, editor. Computer utilization in structural engineering. New York: American Society of Civil Engineers; 1989. p. 91–100. [3] Forde BW, Foschi RO, Stiemer SF. Object-oriented finite element analysis. Comput Struct 1990;34:355–74. [4] Miller GR. An object-oriented approach to structural analysis and design. Comput Struct 1991;40:75–82. [5] Zimmermann T, Dubois-Pelerin Y, Bomme P. Objectoriented finite element programming: I––governing principles. Comput Meth Appl Mech Eng 1992;98:291–303. [6] Lucas D, Dressler B, Aubry D. Object-oriented finite element programming using the Ada language. In: Hirsch, et al., editors. Numerical methods in engineering Õ92. New York: Elsevier Science Publisher; 1992. p. 591–8. [7] Ohtsubo H, Kawamura Y, Kubota A. Development of the object-oriented finite element system––MODIFY. Eng Comput 1993;9:187–97. [8] Archer GC, Fenves G, Thewalt C. A new object-oriented finite element analysis program architecture. Comput Struct 1999;70(1):63–75. [9] Mackie RI. An object-oriented approach to calculation control in finite element programs. Comput Struct 2000; 77(5):461–74. [10] Yu LC, Kumar AV. An object-oriented modular framework for implementing the finite element method. Comput Struct 2001;79(9):919–28. [11] Jang J. An assessment of iterative solution techniques for interactive finite element analysis. MS thesis. Department of Civil Engineering, University of Washington; 2001. [12] Miller GR. Coordinate-free isoparametric elements. Comput Struct 1993;49(6):1027–35. [13] Stroustrup B. The C++ programming language. 3rd ed. New York: Addison-Wesley; 2000. [14] Rucki MD. An algorithmic framework for flexible finite element modeling. PhD dissertation. Department of Civil Engineering, University of Washington; 1997.

[15] Rucki MD, Miller GR. An algorithmic framework for flexible finite element-based structural modeling. Comput Meth Appl Mech Eng 1996;36(3-4):363–84. [16] Rucki MD, Miller GR. An adaptable finite element modeling kernel. Comput Struct 1998;69(3):399–409. [17] Demmel JW. Applied numerical linear algebra. SIAM; 1997. [18] An open system for earthquake engineering simulation. Pacific Earthquake Engineering Research Center: Available from: http://opensees.berkeley.edu. 2000. [19] McKenna F. Object oriented finite element programming: framework for analysis, algorithms, and parallel computing. PhD dissertation. Berkeley, CA: Department of Civil Engineering, University of California at Berkeley; 1997. [20] TCL Developer Site. Available from: http://www.tcl.tk. 2002. [21] Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, et al. LAPACK usersÕ guide. 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1999. [22] Demmel JW, Gilbert JR, Li XS. SuperLU usersÕ guide. Technical report UCB//CSD-97-944. UC Berkeley: Computer Science Division; 1997. [23] Timothy A, Davis TA, Duff IS. A combined unifrontal/ multifrontal method for unsymmetric sparse matrices. Technical report TR-97-016. Computer and Information Science and Engineering Department, University of Florida; 1997. [24] Meurant G. Computer solution of large linear systems. Studies in mathematics and its applications. North-Holland; 1999. p. 28. [25] George A, Liu JWH. The evolution of the minimumdegree ordering algorithm. SIAM Rev 1989;31(1):1–19. [26] Bulka D, Mayhew D. Efficient C++: performance programming techniques. New York: Addison-Wesley; 2000. [27] Black JR, Martel CU, Qi H. Graph and hashing algorithms for modern architectures: design and performance. In: Proceedings of the 2nd Workshop on Algorithm Engineering. WAE 98, Max-Planck Institute f€ ur Informatik; 1998. In TR MPI-I-98-1-019:37-48.