Compurers & Smrfures
PII: s0045-7949(%)00343
Vol. 63. No. 3, pp. 413-428. 1991 0 1997 Elwier Science Ltd Printed in Great Britain. All rights rewwd 004s7949197 $17.00 + 0.00
NONLINEAR FINITE ELEMENT ANALYSIS OF BEAMS AND ARCHES USING PARALLEL PROCESSORS L. N. B. Gummadi and A. N. Palazotto Department of Aeronautics and Astronautics, Air Force Institute of Technology, WPAFB, OH 45433, U.S.A. (Received 17 January 1996)
Abstract-A nonlinear finite element formulation for beams and arches is presented for implementation on a multiple instruction multiple data (MIMD) parallel machine. A twelve degree of freedom arch element is developed using Green strains and 2nd Piola Kirchhoff stresses. Concepts like loop splitting and expression splitting are used to improve the load balancing capability among the processors. Using this developed element, beams and arches that are undergoing large displacements and large rotations are analyzed, and efficiency of the program is compared with a sequential implementation. 0 1997 Elsevier Science Ltd. All rights reserved
INTRODUCTION
Ever since the serial computers based on the architecture by John Von Neumann [l] was developed, their speed has increased steadily to match the needs of ever growing engineering applications in science and engineering. However during the 197Os, it became evident that the maximum possible speed limits of computers were approached due to the fundamental physical limitations imposed by the speed of light. A viable alternative to improve the speed of a computer in solving an engineering problem is in the use of a group of computers. This led to the development of the concept of parallel processing. In a serial computer, one processor is used for all the co-mputation involved in solving a problem. In a parallel computer, many processors are available for simultaneous processing and the effectiveness becomes possible through the sharing of the computational effort among themselves thereby improving the speed of the computer. The concept of parallel processing has made tremendous impact on many areas of computer application such as biosphere modeling, pollution modeling, semi conductor material modeling, ocean modeling, computer tomography, analysis of protein structures, vehicle design and dynamics etc [2]. A parallel computer is useless unless an efficient parallel algorithm is developed. An algorithm that is developed for a serial computer can be highly inefficient in a parallel computing environment as that algorithm will make use of only one processor at a time keeping the other available processors idle. Development of a parallel algorithm for a given problem is dependent on the control mechanism of the processors (SIMD, MIMD etc.), memory sharing
mechanism of the processors (shared memory or distributed memory) and the arrangement of the processors (hypercube, mesh, ring, BBN butterfly etc.) [3]. This dependency on the architecture poses problems of portability with the parallel computers. A program written for one parallel computer may require extensive modification for a different parallel program unlike a serial computer (as a matter of fact, all serial computers fall into one sub category of parallel computers with a single instruction single data (SISD) control mechanism incorporating a single memory unit architecture). The concept of parallel processing has been successfully used in the past for various structural. analysis applications such as static analysis, dynamic analysis and optimization. The review of progress in these applications can be found in Refs [4-6]. In a typical linear static finite element analysis, the most time consuming operation is the solution of linear simultaneous equations. Hence, the majority of the efforts in the parallelization of a linear static analysis were focused on developing and implementing various equation solver algorithms on different parallel machines of different architecture. Some such efforts include linear static and transient finite element analysis on BBN butterfly parallel processor by Henno et al. [7] using parallel Gaussian elimination algorithm, LDL decomposition algorithm for linear and nonlinear static and dynamic analysis on local memory MIMD machines by Farhat et al. [8,9], and column oriented Cholesky algorithm by Goehlich et al. [lo] on shared memory MIMD machines etc. Some other techniques that are implemented for the parallel computational environment include global local analysis by Sun and
413
414
L. N. B.
Gummadi and A. N. Palazotto
Mao [ 111, substructuring formulation by Padovan et al. [12] and Farhat et al. [13]. In the case of nonlinear finite element analysis, the relative expensiveness of each operation involved depends on the type of nonlinearity, and the solution method chosen. For example, a geometrically nonlinear problem can be solved using a total Lagrangian method or an updated Lagrangian method. A total Lagrangian method requires more displacement calculations, complicated strain whereas an updated Lagrangian formulation requires addition of stresses and displacements in each update. However, both the formulations are iterative. In an iterative algorithm, the global tangential stiffness matrices need to be updated and solved. Unlike linear analysis, formulation and assembly of the stiffness matrices can be computationally expensive. For an efficient parallel nonlinear algorithm, attention must also be focused on the stiffness matrix generation. Vanluchene et a[. [14] used an updated Lagrangian technique to formulate a nonlinear finite element program. In their analysis, explicit multiplication formulae were used for stiffness and stress calculations. In a typical iteration, the solution to equations involved around 20% of the total time while assembly of the elemental stiffnesses involved 40% of the total iteration time, and the remaining time was spent on miscellaneous operations like updating the displacements, renumbering the elements etc. Reference [ 141 established parallel equation solvers as well as assembly techniques. They implemented this new program on a CYBER 20.5, a vector processed machine, which has a shared memory system. They did not parallelize the assembly process. Farhat [ IS] considered localized nonlinearities. In this reference, all the elements are divided into subdomains using sub structuring concepts. Each of the subdomains consisted of a linear and a nonlinear stiffness matrix which were later assembled. They also used updated Lagrangian method. These techniques are valid for local memory MIMD parallel systems. Chien and Sun [16] distributed the generation of stiffness matrices (for truss elements) to various processors in a shared memory MIMD system and assembled them concurrently. Local locks are set up in the shared memory pool to provide synchronization of the assembly process. They also used an updated Lagrangian method. In the present analysis, local memory MIMD parallel system with a mesh architecture is chosen as the computational environment. Geometrically nonlinear beam and arch problems are solved using a parallel processing techniques for very large displacements and large rotations. The total Lagrangian approach is used for the nonlinear analysis, and the concepts of parallelization are introduced at the formulation of the elemental stiffness level itself. Instead of focusing on the efficient equation solver, efficiency of operations is achieved in the formulation of the stiffness matrices. Two sample problems are
solved using the developed approach, and total efficiency achieved is discussed. In the next section, formulation of the stiffness matrix is described followed by the implementation aspects on a parallel machine.
THEORY
The displacements due to pure bending for a flat beam are shown in Fig. 1. In this figure, the co-ordinates y - z correspond to global co-ordinates with the total Lagrangian system. h is the thickness of the beam and I/I is the bending rotation angle. u2 and u3 are the displacements in y and z co-ordinates, respectively. The undeformed and deformed configurations for a section of the midplane and the normal perpendicular to the midplane are shown in Fig. I(b). koj corresponds to the normal to the midplane before deformation, and k’oy’ denote the normal to the midplane after deformation. jq is a part of the midplane before deformation, and j’q’ is the same part of the midplane after deformation. When these two configurations are superimposed one on top of the other making o and o’ to coincide, deformed and undeformed configurations for a part of the beam are shown in Fig. l(b). Kinematics that can be used to describe the deformed configuration can be written using Fig. 1 in which a curved arch is introduced into the discussion as
U2= u 1 - f + z sin($) ( >
#3
=
w
+
z(cos(~)
-
1)
(2)
where v and w are the displacements of point o on the midplane, and R is the radius of the arch. Throughout the analysis, small strain assumption is used so that the constitutive relationship between Lagrangian stresses and Green strains is assumed to be constant even at large displacements and rotations. These kinematics do not satisfy the condition of zero shear on the top and bottom surfaces of the beam. To satisfy the shear distribution across the thickness, higher-order coefficients of the z co-ordinates are included in the kinematics, following a similar procedure presented by Palazotto and Dennis [ 171, and Heyliger and Reddy [ 181. The total kinematic equations can be written as
uz=u
1 -f (
+zsin($) >
+ z2 sin(t)‘*’ + z3k(w,y + sin(+))
(3)
Analysis of beams and arches
415
Deformed configuration
apron
‘. (a)
Various co-ordinate
: : : ‘. i..” :i..‘*’
Deformed configuration
systems
2
z =
-h/2 or 1;= 1
(b)
Deformed kinematics
jv____ r z sin Jr
p9 j
t-4
V
x
A 7-
0
-9
z(l-cost)
Pj’ w
0’ x’
p
j"
---
JI
.,
p”
Y
PI'
0
Z
(cl
Fig. 1. Geometry and kinematics description.
‘43 =
1Y +
z(cos(l)) - I).
(4)
Here k = -(4/3/C’) and ,)’indicates differentiation with respect to y. It is nossible to establish relations consistent with a compbsite beam like structures. Thus, the simplified one dimensional constitutive relationship between the stresses and strains for the kth lamina can be expressed as (5)
a:, = s 44623
(6)
where QZ2= QZ2cos4 0 + 2Qt2 sin2 0 cos’ @+ Q,, sin’ B (7)
(8) Here Q&S represent the ply stiffness expressed in material coordinates, and 0 is the angle between the material coordinates and the geometric coordinates of individual lamina, the ai are second PiolaKirchhoff ply stresses, and t’he Eij are the Green strains. Using the explicit expressions for strain displacement relations given in Palazotto and Dennis [7],
L. N. B. Gummadi and A. N. Palazotto
416 the in-plane co-ordinates
strain as
cz2 can be written
in curvilinear
X22(3) = CW + C3VfV + 2c’v.J - 1 + cos(lfQ)) + c’(-
7 Lz2= t:z + c Z%z,,)
(9)
p=l
1 + cos(*)y
- 2&J,, C,?cos($) - 2c3j.J
- 1 + cos($))
x cos($) + qVV cos($)2 where t& is the mid plane strain and ~~~~~are functions independent of z and are given by
+ k(w,,., + $.V cos($)) + kv,, (W,,V+ $,,! cos($))
$2 = o.5c2vz + v1.)+ 0.5v;,. - cw - cv,VM’ + o.5czwz + cvw.C + 0.5w:,
(10)
- ckw(w,,
+ $., cos($))
- 2c4v sin($)
+ 2.5c%$,, sin($) + c’w,,$,, sin($) KZZ(I) = -(c’w)
- c+I,,w + c3w2 + c*vw,V+ cw: + c3 sin($)*
- c( - 1 + co@)) - CV,) (- 1 + cos($)) - l.5cz$,, sin($)* + O.Sc$$ sin($)’ + c’w(-
1 + cos($))
+ tiJ cos($) + c’kv(w,, + sin($)) + ckw,, (w,, + sin($))
+ u,,$,, cos(ll/) - cw$,?. cos($) + c(C + 22) sin($) + cw,! sin($) - cv$,, sin($) - w,&,, sin($) K22(2, =
-
(11)
-
1 .5C4V2
+
2c3v,I-w - 2c’vw,,
C2V,y
-
- c’( - 1 + cos($))
- c%,,( - 1 + cos($))
+ 0.5$.:
+
II/,?,
sin($))
- 1 + cos($))
+ tiIJJ sin($))
cos($)(~3
COW)
+ $,yF sin($)) (13)
Kw) = Ck(W,,,. + $., COS($))
- 2c2w$,, cos($)
+ cku,,.(w,,,. + $,T WI(/)) - 2c2kw(w,,? + $,v cm($))
1 + cos(~))cos(~) cos(~)z
+ c3v sin($)
cos($)
x (II/%cos($) + 0.5$.,
+ 0.5c2( - 1 + cos($))2 + qkJ cos($)
- cll/.y(-
c’w(bG
- 0.54
+ 2c3w( - 1 + cos(i+k))
+ cu.,.+,,. cos($)
+ 0.5c~,,~(C cos($) + ti,, sin($)) -
1.52$,
cos($) + $., sin($)) 2
- c4+ 4,.
+ 2c*w,) sin(+)
1 + cos($))(w,,. + $4 cos($N COSW)
x (w,?~+ $,y cosWN - c‘?,,
- 0.5c2vll/,, sin($) - 1.5cw,&,, sin($)
sin(4+)
+ c’$,~ sin($)’ - 0.875c*1j~~sin($)’
$ 0.5~’ sin($)’ - c$,, sin($)’
+ c’kv(w,?. + sin($))
+ 0.51& sin($)2
+ 2c%w,,.(w,,- + sin($)) + c*k sin($)(w,J + 0.5u,,.($f, cos(JI) + II/.TY sin($)) - O.~CW(V?,cos(+) + II/,,, sin($))
+ sin($))
- cW,?. sin($)(w.,. + sin($))
(12)
- ~‘v.~($?~COW) + $.yVsin($))
Analysis of beams and arches - c’( - 1 + cos($))
417
where
+ c$,, cos($)
.& = w,~+ sin($)
(19)
~2~~~) = 3k(w,. + sin(ti)).
(20)
Shear strain can also be written as x (ICI?ccs($) + ti.!? sin($)) ~23= (1 + 3z2k)& + 0.125($$:‘,.cos($) + $.jr sin($))’
The internal strain energy for the arch structure can be written as
Kzz(~) = - 2C2kZ~J(MJ,?J + $., cos(ll/)) - 2c%( - 1 + cos(I(I))
U = fb ssI h
x (W&Y + $,J co@))
&&
dz dy
+ 2CW.J cos($) + fb
x (w,,,. + IL,,cos($)) + 0.25c3$fr sin($)’
x (w,~+ sin($)) - 1.5~~k$,~sin($) x (w.? + sin($)) + O.Ww,,,. + $.Vcos($))
=
0.5k2(w.F!
+
ti.y
sin(ti)(w.,
dz dy.
(22)
Here I is the arc length, b is the width and h is the thickness of the beam. Using a similar procedure developed in Palazotto and Dennis [17], the strain components can be divided into linear and nonlinear parts as
x (C cos($) + II/.,, sin(+)) + 0.25c($? cos($) + $,VJsin(lC/))’
&:, SII h
- 2c%v(w,, + sin($)) + 2c3k sin($)
~22e.j
(21)
(14)
(15)
& = L;q + fq’tioq
(23)
~zz(i)= L,Tq+ fq’Hiq
(24)
623 =
(25)
cq
+
fq’H,q.
COS($))~
+
c3k$.y
+
+
0.5c2k2(w,F + sin($))’ + ck(w,,)
Here, L, are column arrays for linear parts; H, are the symmetric matrices for the nonlinear parts of the normal strain; L,Tand H, are the linear and nonlinear parts for the shear strain; while q is the displacement gradient vector which is given by
sin($))
+ $,, cos(+))(ti2,r cos($) + $,J, sin($))
qr= (0, u,,, w, w,r, w.,, , $3 ti.?, *.t,).
(26)
(16) Kzzm = ck’(w,,,. + $.r COS(~(/))~ + c3k2(w,,. + sin(l(/))2. (17)
In the above expressions, c = l/R. Scale factors h, = h3 = 1 and hl = I - (z/R) are used to obtain the above equations. The only other non zero strain is through the thickness shear strain e2,. In deriving this strain, only the linear components of Green strain are considered. This assumption is consistent since only linear shear strain components are considered in reducing the kinematics. The linear Green shear strain can be obtained using the strain displacement relations in Ref. [ 171 as
A detailed explanation regarding the form of L and H can be found in Ref. [ 171. It also should be pointed out here that the displacement gradient used in this formulation has $,YVas an extra term when compared with the displacement gradient used in the Ref. [17] formulation, and this term is due to the presence of the $.IY term in the strain expressions. If the variation of the total potential is taken, the equilibrium equations can be obtained. The variation of total potential can be taken with respect to the degrees of freedom q as
6II=b
~2~228(~22)
ssI
dz
dy
II
BqTp dy.
Q$$~.&2.1) dz dy -
+b ssI 1
sI
(27)
418
L. N. B. Gummadi and A. N. Palazotto
In terms of L, and H,, the variation of strains tZ2and cz3can be written as
R, = ii C,+j(LiqTHj + qTLjHi + HjqLT) 0 II + c%Lsq’Hs
+ qTL,Hs + H,qLT)
(34)
8~22= 2 L:Sq + SqTHiq + r?LTq+ qTGH,q (28) ,=O 6~2, = Lfsq + GqTH,q + c5Lfq + qTGHsq. (29) + f(q’HjqHi + qTHiqHi))
In Palazotto and Dennis [17], due to a simplification in the kinematic relations, all the elements in Li and H, are constants (independent of q) and hence the last two terms in the right-hand side of the above equations are identically equal to zero. In the case of the present formulation, all the elements of Li are constants and some of the elements of Hi are functions of $. That is, while the variatons of L, are equal to zero, variations of H, are not equal to zero. Now the variation of strains can be rearranged as
+ @(HsqqTH,
+ H,qqTHs
+ f(q’HsqHr + qTHsqHs))
m,= b
(C321jr~22 +
(35)
dz dy.
&&h,)
Here, the elastic terms, C,, CS i = 0, 1,2, . . . ) 14 can be defined as
(36)
when
[C,, CI, cz, . . I Cd
6622 =
i
i=o
6~23 =
Lfhq + Gq=H,q + hqTficq
L,TGq+ Gq=H,q + Gqrfisq.
(30)
=
b
(31) cs = b
022L
1, i-‘,
. . . ,
i”1 di (37)
& [l + 3ki]dc.
(38)
When Hi terms are not constants, Thus the equilibrium b
[
[ 0:2e2
( c'
L’aq + GqTHiq + GqT&qj dz dy
equations can be written as
‘. ^ R+F+F+&
1 5 qdy=
I
Pdy.
(39)
The analysis obtained so far is element independent. In the next section, a 12 degree of freedom beam element is developed based on the current formulation. +
Sq'&q)dz
= SqT
tiqTfi dy
dy -
fs+%+%+N
2
3
’1 I q-p
FINITE ELEMENT
dy.
(32)
Here # is the load vector and R, R, and I$ are matrices corresponding to the stiffness. Of these, R is constant, RI is a linear function and NZ is a quadratic function of the displacement vector q and can be expressed as
FORMULATION
The beam finite element used for the current formulation is shown in Fig: 2. The element has two nodes with degrees of freedom (DOF) u, t)+, w, w,,, $, and $,, at each node. The values of the DOF q(q) at any point can be expressed in natural coordinates in terms of the element’s DOF using the shape functions
= Td it = Ii i k Ci+j(LiL: + L,LT) 0 0 + fCS(LsL: + LsLT) (33)
(40)
Analysis of beams and arches
419
2a
t = * ..._. ... ...... h I
=
-1
... .
f
.. . . .. . .
Fig. 2. Description of element. where HI,
(HII).,
0
0
0
0
0
0
H,z
Wd.,
0
0
0
0
0
0
0
0
HI,
(Hi,),,
(HII).,,
0
0
0
0
0
H12
W12l.q
W12)m
0
0
0
0
0
0
0
F=
Using the Hermitian given as
0 0
0 0
0 0
H2i
Wn).ri
0
0
0
H22
W22).1
0
0
0
0
0
H2,
(ff211,w
W21h,n
0
0
H22
(H22L
WzzL,
0
0
0
0
0
0
0
0
shape functions
0 0
H which are
HI, (HII)., WI).,, Hu W12),4 Wdm, 0 0 0 0 0 0 0 0 0 0 0 0 H21
Vf21
Hz2
0322
(41)
12 x 8
where K=
D’izD dr]
(46)
D%,D
dv
(47)
D’fi2D dq
(48)
D%3D dq
(49)
sI
HII= a(2- 31 + q’) H12= ; (1 - q - q2 + II’) (43)
N, = sI
HI, = a(2 + 3~ - n’) H:2 = ; (-
N2 = 1 - q + q2 + 7’)
(44) N, =
and a is the half the length of the element, and standard finite element procedure, the equilibrium equations can be written as
K+$+N,
1
d-F=0
(45)
sI F=
r DTPdq.
Ji
(50)
Here D is the product of inverse Jacobian matrix f and transformation matrix T.The inverse Jacobian
L. N. B. Gummadi and A. N. Palazotto
420
matrix r is used to transform the natural coordinate system to global coordinate system and can be written as -I
1000
000
Ol/aO 0010 l-=
0
0
0 0 000
0
0 0
0
0
0
l/a
0
0
0
0
0
0
0
0
1/a*
0
0
0
0
l/a
0
0000
010
0000
0
0000
000
0 l/a’
(51) These nonlinear equilibrium equations at each load or displacement level can be solved using the modified Newton-Raphson incremental methods such as displacement control technique [ 171or modified Riks technique [191. If the equilibrium equations (eqn (45)) at d = d, _ , can be written as f(dn_ ,) = 0, equilibrium at d, = d, _, + Ad, _, can be arrived at by expanding f(d. _, + Ad,_ I) in a truncated Taylor series for a small value of Ad. _, as f(d.-,+A4-,)af(d”-,)+~Ad,-,=O.
the relative time required for each task. A typical finite element algorithm consist of tasks such as reading the necessary input data, formation of elemental matrices and vectors such as mass, stiffness, tangent stiffness, force vector etc., assembling the matrices and vectors to obtain the global matrices and vectors, application of boundary conditions, solution of global algebraic equations, and post processing operations to compute strains and stresses. In the case of nonlinear finite elements, the elemental stiffness matrices and global matrices are nonlinear in nature. In order to solve these matrices, the modified Runge-Kutta method or modified Riks method can be used. A typical flow chart for this problem can be seen in Fig. 3. Parallelization is possible only when the individual tasks are independent and there is no data dependency among the tasks. That is, if the outcome of one task is not depend on the outcome of another task, then these two tasks can be performed in parallel. For example, in a finite element program, generation of elemental stiffness matrices and assembly of these matrices are two independent tasks, but the elemental matrices cannot be assembled before the generation of these matrices. Thus, there
I
(52)
K
I
Alternatively $Adn-,
~_
NI
= -f(d._,)
af/dd can also be written as [K + N, + N2 + aN,j&ij. Thus, the incremental equations to be solved can be written as
4
N2 I N3 I N4
=-
I
Gauss integration +
Dqdrj
(54)
sI
/
and dn=d._,+Ad._,.
(55)
It should be observed here that the tangential stiffness matrix (left-hand side of the eqn (54)) has four components K, NI , Nz, Nd = aN,/ad and the stiffness matrix has four components K, N,, N2, N,. IMPLEMENTATION
ON A PARALLEL
No
PROCESSOR
In order to implement a program in a parallel processing environment, first and foremost is the determination of tasks that can be parallelized and
Next increment Fig. 3. Flow chart for a finite element program sequentially.
Analysis of beams and arches
is a data dependency between these two tasks. However, generation of the stiffness matrix of one element is independent of the generation of the stiffness matrix of another element. That is, generation of the individual element matrices can be done in parallel. Concepts of substructuring are effectively used to this end [ 131.In this technique, the structure is subdivided a priori into many parts, substructures. The stiffness matrix of each of the substructures is separately determined by an individual processor. In this paper, parallelization is introduced in the development of an element stiffness matrix itself. The focus of this paper is merely on the parallelization of the elemental stiffness matrix itself. That is, further parallelization is possible when this method is used in conjunction with the well known substructuring concepts. Another task where the parallelization is possible is in the solution of global equations. In this paper, effects of parallelizing this task are not discussed. By using an efficient parallel equation solver, further reduction in time can be possible. Most of the nonlinear parallel finite element techniques use an updated Lagrangian formulation. The total Lagrangian formulation requires more strain displacement calculations when compared with updated Lagrangian formulation. For a single iteration, relative time taken by each of the tasks is shown for a flat bea.m and a deep arch in Table 1 using the total Lagrangian approach. These percentage time calculations are arrived at by solving a beam and an arch problem, each of which is modeled using elements of equal lengths. It can be observed that formation of the element stiffness matrix use nearly 90% of the total CPU time involved in a single iteration. This percentage contribution is attained by averaging ten problems with a different number of elements. Hence, it is logical to focus the attention on the generation of the elemental stiffness matrix for efficient parallel program. Concepts of loop splitting, and expression splitting are used to parallelize the element stiffness matrix generation. Loop splitting
As previously indicated, loop splitting is the process where by the work is divided among various available processors. Thus, an ideal feature for parallelization on a finite element program is the generation of the elemental stiffness matrices as they
Table 1. CPU time percentages of various tasks for beams and arches Task Element stiffness Assembly Boundary Post processing Eqn solver Convergence
Beam (%)
Arch (%)
94.71 0.05 0.24 4.735 0.24 0.025
94.88 0.04 0.16 4.74 0.16 0.016
421
are independent to each other. In a sequential computer, this generation is done in a loop, one after the other. In a parallel computer, all the elemental matrices can be generated simultaneously using various processors. In terms of Pseudo programming language, the generation of the stiffness matrix using a sequential computer can be written as DO 10 I = 1,NELEM CALL ELSTIF CALL ASSEMBLY 10 CONTINUE Loop splitting is an efficient way to distribute the work done in a loop among various processors. Using the loop splitting concept, the program can be parallelized as follows DO 10 I = 1 + ID,NELEM,NPROC CALL ELSTIF CALL ASSEMBLY 10 CONTINUE SEND GSTIF Here, ID is used to denote the processor number and NPROC is the number of processors used. The description of the concept of loop splitting can be explained using the following example. Suppose there are 10 elements and 2 processors are used to compute the global stiffness matrix. In a sequential computer, total operations involved include the generation of the 10 element stiffness matrices and the assembly of the 10 stiffness matrices. Ifs is used to denote the time for a element stiffness generation and a for assembly, total time used in a sequential computer can be written as Tses= lO(s + a). In the case of a parallel computer, processor 1 will determine the elemental stiffness for the elements 1, 3, 5, 7 and 9 and assemble the stiffness matrix of these elements. The processor 2 will be used to determine the stiffness matrix for the elements 2, 4, 6, 8 and 10 and their assembly. By this way, the work has been distributed between the available processors resulting in the reduction of the effort of each of the processor. Total time for each processor so far becomes equal to 5(s + a). The next operation in this program is the transfer of the global stiffness matrix generated by the processor 2 to the processor 1. Then the processor will further assemble the stiffness matrix with the global stiffness matrix developed by itself. Thus the combined global stiffness matrix can be achieved by using the parallel processor. If t is the time for the transmission of the data between the processors and r is the time taken by the processor to combine the global stiffness matrices developed by itself and the processor 2, then the total time elapsed in developing the global stiffness matrix using the parallel processor with 2 processors can be written as TPil,= S(s + a) + t + r. If T,,, is less than Tq, then speed up can be achieved. The concept of the loop splitting can be generalized
422
L. N. B. Gummadi and A. N. Palazotto
for any number of elements. However, this concept has limitations in certain circumstances. For example, there are 10 elements and 11 processors for the generation of the stiffness matrix. Then, only the first 10 processors will be operational in determining the elemental stiffness leaving the 11th processor idle. This drawback can be reduced by further parallelization of the process. For this, the concept of expression splitting can be used. Expression splitting
The nonlinear elemental stiffness matrix consists of four matrices, K, N,, Nz, N, and Nq. Here, the matrix K is linear while the matrices N, to Nq are nonlinear. All these matrices are independent of each other. That is, formation of matrix K does not depend on the formation of the other matrices. This is an ideal feature for parallelization. The elemental stiffness matrix will be developed by more than one processor. This concept is called “expression splitting”, an operation in which the large expressions for the stiffness matrix are split into five smaller independent expressions by which different processors can compute simultaneously. The Pseudo code describing the expression splitting in combination with loop splitting can be written as follows. DO 10 I = 1 + ID/N,NELEM,NPROC/N CALL ELSTIF CALL ASSEMBLY 10 CONTINUE SEND GSTIF Here N is used to denote the number of expression
splits in the stiffness matrix. By this process, N number of elements are working to determine the stiffness matrix of a single element. This can be explained by using the following example. Suppose there are 10 elements in a structure and 20 processors to compute the global stiffness matrix. The stiffness matrix is divided into four submatrices and four processors will be used to determine the stiffness for each element. Processors 14 will be used to determine and assemble the stiffness matrices of elements 1 and 6. Processors 5-8 are responsible for the elements 2 and 7, processors 9-12 are responsible for the elements 3 and 8, processors 13-16 are used to determine the stiffness of the elements 4 and 9 and processors 17-20 are used to for determining the stiffness of the elements 5 and 10. In this way, work has been distributed among 20 processors. By this method, any number of processors can be used and all the processors will be sharing the work load among themselves. A flow chart for this finite element procedure can be shown in Fig. 4. It should be pointed that this loop splitting is only one kind of division of work among various processors. Many different variations can be made to this concept. For example, processors 1, 2 and 3 will be assigned to elements 1 and 2 (instead of 1 and 4), a second set of processors (4, 5 and 6) will be assigned to elements 3 and 4 and a third set of processors will be assigned to determine the stiffness matrices of the elements 5 and 6. This kind of loop splitting is also known as block splitting where a priori, the programmer assigns the element to the processors. Self scheduling is another concept by which work load
Fig. 4. Flow chart for a parallel finite element program.
Analysis of beams and arches
can be distributed a.mong processors in such a way that work will be d:lstributed evenly.
RESULTS AND
DISCUSSIONS
Intel’s Paragon parallel computer is used for the program implementation. It is a distributed memory multiple data stream and multiple instruction stream (MIMD) computer. Two sample examples are considered. In the first example, a cantilever beam with a tip concentrated load is considered. In the second example, a simply-supported arch with a crown load is considered. Element formulation differs for both the beam and the arch in terms of the curvature term in the arch. Strain displacement relations for the beam can be obtained by using l/R = 0 in equations (lOHl7). That is, the number of operations in the element stiffness generation in an arch problem are more when compared with the beam problem. In general, a sequential algorithm is usually evaluated in terms of its execution time which can be expressed as a function of the size of the problem, but a parallel algorithm is a combination of the algorithm and the architecture of the machine on which it is implemented. Two significant metrices that can be used to measure the performance of parallel systems are speed up and efficiency. Some other metrices that can also be used include run time and cost. Serial run time is the time elapsed between the beginning and the end of the execution on a sequential computer. The parallel run time is the time that elapses between the parallel computation start and the end of the parallel computation. Speed up is a measure that captures the relative benefit of solving a problem in parallel. It can be defined as the ratio of run time of the best sequential algorithm on a single processor and the run time on a parallel computer. In a very ideal situation (a situation in which there is no time elapsed during the communication between the processors), the speed up is equal to the number of processors (dotted line in Figs 7 and 11). Theoretically, speed up can never exceed the number of processors as part of the time required by the processors is spent in communication. Efficiency is a measure of the fraction of the time for which the processor is usefully aemployed. Efficiency is defined as the ratio of speed up to the number of processors used. In an ideal system, the efficiency is equal to 100% but in practice, the efficiency is between zero and 100%. Cost of solving a problem on a parallel system can be defined as the product of the parallel run time and the number of processors. Cost is a reflection of the time that each processor spends solving the problem, Cost can also be expressed as the ratio of the run up time for the sequential computer and the efficiency. As there is dependency between the four metrices described, all the following discussions consider speed up and efficiency as the
423
basic performance measuring metrices for a parallel system. The geometry and the material properties used for the beam problem are shown in Fig. 5. The beam is divided into 40 equal elements and the results are compared with the analytical results available in Ref. [20]. Displacement as a function of load is shown in Fig. 6 and a close agreement can be seen with the analytical results. Speed up for the 20 iterations in a single increment step are shown as a function of no of processors in Fig. 7. Here N = 40 correspond to the computational effort when 40 elements are used. Both the expression splitting and loop splitting concepts are implemented. It can be observed that speed up increased with the increase of the number of processors. In an ideal program, this curve should be a straight line, shown dotted in Fig 7. The reason for the less than ideal speed up is attributed to various overheads such as interprocessor communication. Efficiency is plotted as a function of number of processors in Fig. 8. When only one processor is used, the efficiency is 100%. If there is no CPU time expended for the data transmission between the processors, this efficiency will remain to be 100%. It can be observed that the efficiency is reducing with the increase in the number of processors. Speed up and the efficiency of a parallel system are also a function of the size of the problem. This was observed by increasing the size of the problem. The size of the problem is increased by increasing the number of elements. The speed up and the efficiency are measured for 80. 120,200 and 400 elements. It can be observed that as the problem size increased (as the number of elements are increased), the speed up and the efficiency increased. When the number of elements are 40, the maximum speed up achieved is
-P
b=lm 10m
t =O.lm E = 1.2kN/mm2 V= 0
Fig. 5. Geometry description of a beam with concentrated tip load.
L. N. B. Gummadi and A. N. Palazotto
424 10000
I
I
1
I
I
I
1 I I I
9000 -
8000-
7000-
600022
---
pO0 r&
Exact (Bishhopp cr ok) Preseot
4000-
3000-
2000-
1000 -
06 0
C-
_---
I 2
1
I 4
I 3
I 5
I 7
I 6
8
Vertical tip deflection (m) Fig. 6. Load displacement characteristics of the arch.
I
I
I
1
I
8 I / / / / /
Ideal speed up
,
/
,
/
/
/
/
/
/
/
/
/
/
/
/
,
/
/
/
/
N=4Ot
15 N=2M 10
0
0
5
10
15
20
25
No of processors
Fig. 7. Speed up for the beam problem.
30
35
Analysis of beams and arches
425
80
60 6 5 ‘5
50
5 u.l 40
0
II
5
10
I5
20
25
30
35
40
No of processors Fig. 8. Efficiency for the beam problem.
approximately equal to 5 and this maximum speed up is achieved using 10 processors. Increasing the processors did not increase the speed up. In the case of 400 elements, the maximum speed up is approximately equal to 15 when the number of processors are equal to 40. Also, a similar trend was observed in the case of efficiency (Fig. 8). Efficiency
of the system increased with the increase in the number of elements. When the number of elements is 40, the efficiency of the system is as low as 8% when 40 processors are used. At the same time, it can be
observed that for the case when the number of elements equals 400, the efficiency is equal to 40% when the number of processors are equal to 40. Similar trend of results were also observed for the arch problem. The geometry and the material properties used for the problem are shown in Fig. 9. Load displacement results are compared with the available analytical solution [Ref. 211 in Fig. 10. 40 elements of equal length with symmetry are used to analyze the structure. Speed up and the efficiency for
E = IE 7 psi r= lOOin a = 106.3 deg Y= 0 t=lin b=lin 8=40in c = 80 in
Fig. 9. Geometry description of an arch with crown load. CAS 63/3A
426
L. N. B. Gummadi and A. N. Palazotto I
1400
1
I
I _.
1200 -
-
---
0
5
I
_.
. .
.
.
Exact (Huddleston) Present
10
15
20
25
1 30
Vertical tip deflection (in) Fig. 10. Load displacement characteristics of the arch.
/
/
/
/
/
/
,
/
/
No of processors Fig. 11. Speedup for the arch problem.
I 35
40
Analysis of beams and arches
427
60 oh z *a 50 E ‘i; 40-
“0
5
IO
15
20
25
30
35
40
No of processors Fig. 12. Efficiency for the arch problem.
the problem are plotted in Figs 11 and 12. Dependence of the speed up on the size of the problem can also be observed in this case. A speed up of approximately 18 is obtained when the number of elements are increased to 400. All the results so far generated do not use parallel simultaneous equation solver. This algorithm can be implemented in conjunction with the proposed parallelization concepts and further speed up and efficiency can be achieved.
2. A. K. Noor and R. E. F&on, Impact of new computing on finite element computations. Impact of New Computing Systems on Computational Mechanics, ASME. United Engineering Center, NY (1983).
3. V. Kumar, A. Grama, A. Gupta and G. Karpys, Introduction to Parallel Computing: Design and Analysis of Algorithms. The Benjamin/Cummings Publishing
Company, Redwood City, CA (1994). 4. H. Adeli and P. Vishnubotla, Parallel processing. Microcomput. Civil Engng. 2, 257-269 (1987). 5. H. Adeh, M. P. Kamat, G. Kulkarni and R. D.
Vanluchene, High performance computing in structural mechanics and engineering. J. Aerospace Engng 6, 249266 (1993). 6. M. W. Fahmy and A. H. Namini, A survey of parallel
CONCLUSIONS
nonlinear dynamic dynamic analysis methodologies.
for bending type problems is presented for large displacements and large rotations. The concepts of expression splitting and loop splitting are introduced in the element stiffness matrix generation level for parallelization. A significant speed up and efficiency are achieved by these techniques. This process can also be used in conjunction with the already existing parallel processing techniques. A nonlinear
arch finite element
development
Comput. Struct. 53, 103331043 (1994). 7. A. Henno, S. Moore, F. O’Neil and E. Tenenbaum, Finite element analysis on the BBN-butterfly. Comput. Struct. 27, 13-21 (1987). 8. C. Farhat, E. Wilson and G. Powell, Solution of finite
element systems on concurrent processing computers. Engng Comput. 2, 157-165 (1987). 9. C. Farhat and E. Wilson, A parallel active column equation solver. Comput. Struct. 28, 289-304 (1988). 10. D. Goehlich, L. Komzsik and R. E. F&on, Application
of a parallel equation solver to static FEM problems. authors acknowledge the financial support received fro:m AFOSR under grant no. F33601-94 CJ026 and Dr Arnold Meyer of WL/FI/Vehicle Subsystems branch, WPAFB, for this work. The authors also would like to acknowledge Dr Alex Yui of ASC/MSRC for the PARAGON system support, Acknowledgements--The
REFERENCES 1.
S. Brawer, Introduction to Parallel Programming. Academic Press, San Diego, CA (1989).
Comput.
Struct. 31, 121-129 (1989).
11. C. Sun and K. Mao, A global-local finite element method suitable for parallel computations. Comput. Struct. 29, 309-315 (1988). 12. J. Padovan and A.
Kwang, Hierarchially parallelized constrained nonlinear equation solvers with automated substructuring. Comput. Struct. 41, 7-33
(1991). 13. C. Farhat,
domain (1988).
A simple and efficient automatic FEM decomposer. Comput. Struct. 29, 579-602
428
L. N. B. Gummadi and A. N. Palazotto
14. R. D. Vanluchene, R. H. Lee and V. J. Meyers, Large scale finite element analysis on a vector-processor. Cornnut. Struct. 24. 625-635 (1986). 15. C. Farhat, Computational strategies for finite element simulations on super computers with 4 to 65,536 processors. Struct. Cong., AXE, 177-186 (1989). 16. L. S. Chien and C. T. Sun, Parallel processing techniques for finite element analysis of nonlinear large truss structures. Comput. Strucf. 31, 1023-1029 (1989). 17. A. N. Palazotto and S. T. Dennis, Nonlinear Analysis of Shell Structures. American Institute of Aeronautics and Astronautics, Washington, D.C. (1992).
18. P. R. Heyliger and J. N. Reddy, A higher-order beam finite element for bending and vibration problems. J. Sound V&r. 126, 309-326 (1988). 19. C. T. Tsai and A. N. Palazotto, Nonlinear and multiple snapping responses of cylindrical panels comparing displacement control and Riks method. Comput. Struct. 41, 605-610 (1991).
20. K. E. Bishhopp and D. C. Drucker, Large deflection of cantilever beams. Quart. Appl. Math. 3, 272-275 (1945).
21. J. V. Huddleston, Finite deflections and snap through of high circular arches. J. Appl. Mech. 31, 763-769 (1968).