The waveform relaxation method in the concurrent dynamic process simulation

The waveform relaxation method in the concurrent dynamic process simulation

Computers them. Engng, Vol. II, No. 7, pp. 683-704, Printed in Great Britain. All rights reserved CCI98-1354/93 1993 $6.00 + 0.00 Copyright0 1993...

2MB Sizes 0 Downloads 78 Views

Computers them. Engng, Vol. II, No. 7, pp. 683-704,

Printed in Great Britain. All rights reserved

CCI98-1354/93

1993

$6.00 + 0.00

Copyright0 1993 PergamonPressLtd

THE WAVEFORM RELAXATION METHOD IN THE CONCURRENT DYNAMIC PROCESS SIMULATION A. R.

SECCHI,’

‘Chemical Engineering 21041, zEngenharia Quimica-Sala

(Received 31 October

M. MORARI’? and E. C.

BISCAIA JR*

California Institute of Technology, Pasadena, CA 91125, U.S.A.

G115, COPPE-Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, 2 1945, Brazil

1991; final revBion received 6 .TuIy 1992; received for publication

27 August 1992)

Abstract-We investigate the concurrent solution of low-index differential-algebraic equations (DAEs) by the waveform relaxation (WR) method, an iterative method for system integration. The WR method (or Picard-Lindeliif iteration) is an operator-splitting approach to the solution of a system of DAEs partitioned into several lower-order systems (subsystems). Such a method, when efficiently implemented, results in algorithms with a highly paralleliible concurrent fraction and low sequential overhead, making them especially suitable for coarse- and medium-grain MIMD distributed-memory machines. Since problems represented by DAEs cannot be %X&XI”, our performance measure is reduction in solution time for fixed problems, and the Amdahl’s-law bottleneck therefore concerns us. We describe our new simulation code DAWRS (Differential-Algebraic-Waveform Relaxation Solver), written in C, to solve DAEs on parallel machines using the WR methods. We discuss the system partitioning, and describe new techniques to improve both the local and global convergence of such methods. We demonstrate the achievable concurrent performance when solving DAEs for a class of dynamic simulation applications of chemical engineering.

1. INTRODUCIYON

Large-scale dynamic process simulation is inherently time consuming on conventional sequential computers. Supercomputers have been the traditional alternative, but their cost-performance ratio is often high, with no mention to the fact that their electronic components are already very close to their physical limitation. An alternative to supercomputers is concurrent computation on multiprocessor computers, or multicomputers. The concurrent solution of DAEs strives to make possible the dynamic simulation of industrial problems that have up to now not been simulated because of their very-large size, where by very-large size, or large-scale systems, we mean problems with several thousands of integration states. There are two basic concurrent simulation paradigms for solving DAEs, “direct methods” exploit the parallelism across the existing sequential algorithms for example, in the Jacobian formation and Gaussian elimination. Such methods maintain all numerical characteristics of the original algorithms, that is, global timesteps, formulation of overall system of nonlinear algebraic equations at each chosen timestep, and iterative solution of the resulting nonlinear system via Newton-iteration and Gaussian elimination of the sequence of linear systems

tTo whom all correspondence should be addressed. 683

generated thereby. These methods fall into the loose category of “efficient sequential algorithms”, and their performance depends on how well the parallelism was exploited [see Skjellum (1990) for more detail]. However, the direct methods can become inefficient for large systems where the different states are changing at very different rates. This is because direct techniques force every equation in the system of DAEs to be discretized at the same timepoint, and this discretization must be fine enough so that the fastest changing unknown variable in the system is accurately represented. Second, “dynamic iterative methods”, or waveform relaxation methods, exploit the parallelism across the system, iterating independent solutions of different parts of the overall system. In other words, the WR methods are operator-splitting techniques which solve DAEs iteratively. These methods are able to pick different discretization points, or timesteps, for each subsystem in the partitioned system of DAEs. Thus, each one could use the largest timestep that would accurately reflect the behavior of its associated unknown variables (multirate integration), and the efficiency of the simulation would be greatly improved. When efficiently implemented, the dynamic iterative methods have high concurrent fraction and low sequential overhead, and, consequently, fall into the loose category of “efficient concurrent algorithms”. The performance for WR methods

684

A. R. SECCHIet al.

depends, mainly, on their convergence and scheduling characteristics. The idea behind the development of the WR method originated in the late 19th century by Picard and improved by Lindelef. But it only became popular in the late 1970s from the study of the work of Newton (1979) who formulated the timing simulation algorithm in the form of a relaxation technique for solving the nonlinear algebraic equations associated with the discretized circuit differential equations. Afterwards, researchers in circuit simulation referred to such a method as waveform relaxation method (Lelarasmee et al., 1982). The first attempt to take advantage of the throughput of a multiprocessor computer by the WR method was made by Mattisson (1983), and it has proven successful for the concurrent simulation of large-scale VLSI circuits (Lelarasmee ef al., 1982; White, 1985, White and Vincentelli, 1985; Vandewalle and Roose, 1987; Smart, 1988; Saleh et al., 1987; Mattisson, 1986; Peterson, 1989). An initial implementation effort to use the WR method in the concurrent dynamic simulation in chemical engineering was made by Skjellum and co-authors (Skjellum et al., 1988). In this work the authors sketched a simplified binary distillation column in a template formulation to use the WR method analog of concurrent circuit simulators, generally device-oriented. Here, we investigate the concurrent solution of DAEs by a dynamic iterative method implemented in our DAWRS package, an easy-to-use and application-independent C-based concurrent DAE solver package. To demonstrate the achievable concurrent performance of DAWRS on multicomputers such as Symult ~2010 and Intel iPSC860 (Gamma), we utilize distillation column networks, examples of chemical process flowsheets. Such problems are modeled by large-scale, sparse and nonsymmetric systems of DAEs of index? one or two (depending on modeling sophistication), with a natural imbalance in their residuals. These models represent mass, energy and momentum balances, as well as the thermodynamic vapor-liquid equilibrium. Their subsystems often demonstrate unequal activity (latency), which can be well exploited by the WR methods. Also, some equations may involve iterative calculations (e.g. bubble-point calculations), which give more advantages to concurrent simulation when compared to the sequential case. t1nde.xis the minimum number of times that all or part of

(1) must be differentiated with respect to t in order to determine j, as a continuous function of y and t (Brenan et a/., 1989), although there are different definitions of index (Chung and Westerberg, 1991). The above definition is also called d@renrial in&x.

We discuss the system partitioning in which the DAEs to be solved must be partitioned into several lower-order subsystems, but each subsystem has to hold the correct unknown variables to maintain the problem consistency (Lelarasmee et al., 1982) (consistent assignment) and to ensure a sufficiently fast convergence (Peterson, 1989) (optimal partitioning). In the assignment process, each unknown variable is associated with an equation of the system of DAEs in which it is involved. However, no two variables can be assigned to the same equation, and the state vector of the decomposed system has to be a valid state vector of the overall system. In the partitioning process, the strongly coupled equations are included in the same subsystems so that the resulting subsystems are effectively more loosely coupled. We present new techniques to improve both the local and global convergence of the WR methods. By local convergence, we mean the timepoint convergence of the integrator within each subsystem; and by we mean the convergence global convergence, between the waveforms of the current and previous iteration. We show significant gains in performance as a result of our new approaches to manipulation of the waveforms. For example, during the dynamic simulation, DAWRS retains useful information from past waveforms which makes the convergence faster than conventional approaches. Also, a more rigorous convergence criterion shows better performance of the algorithm for our examples. To describe the algorithms we begin with an overview of the strategies to solve DAEs, including the predictor and corrector formulas, solution of the nonlinear system, stepsize and order selection and local error control, which constitute the integration layer of the WR algorithms. 2. INTEGRATOR

OVERVIEW

As the WR method exploits the parallelism across the system, each resultant subsystem may have its own integrator according to its necessity (explicit, semi-implicit or implicit methods) and characteristics (stiffness, sparsity, etc.). The interaction between subsystems is made by means of waveforms, ordered sets of timepoints, but one does not need to know how the other subsystem generated its waveform. Here, we consider only the case where all subsystems have the same integrator. The general case is the subject of future research. We have chosen the code DASSL of Petzold (1983) to carry out the subsystem integration, because of its known efficiency and robustness characteristics. DASSL is the most widely used FORTRAN-based code for solving index zero and one systems of DAEs,

Concurrent solution of DAEs with

a f&d

leading

coefficient,

variable

stepsize

and order, backward-differentiation-formula (BDF) implicit integration scheme. The l&d leading co&icient is one way of extending the fixed stepsize BDF methods to variable stepsizes, which is an essential property to exploit the latency of the subsystems during the transient analysis. There are three main approaches to extending fixed stepsize multistep methods to variable stepsize: fixed coefficient, variable coefficient and fixed leading coefficient. The flxed coefficient methods can be implemented very efficiently for smooth problems, but suffer from inefficiency, or possible instability, for problems which require frequent stepsize adjustments. The variable coefficient methods are the most stable implementation, but have the disadvantage that they tend to require more evaluations of the iteration matrix in intervals when the stepsize is changing, and hence are usually considered to be less efficient than a fixed implementation for most problems coefficient (Brenan et al., 1989). The fixed leading coefficient (Jackson and Davis, 1980) is a compromise between these two approaches, offering somewhat less stability, along with usually fewer iteration matrix evaluations, than the variable coefficient formulation. DASSL is designed for solving initial value problems of the implicit form: F(f, Y, 3, u) = 0, Y(G) =Ycl,

3(to) = 909

(1)

where F: R x RN x RN x Rr-rRN is a nonlinear function, Y(t) E RN is the vector of unknown variables, jr(t) E RN is the vector of the time derivatives of the unknown variables and u(t) E R’ is the input vector. Systems of ordinary differential equations (ODES) are included in this formulation as a special case of DAEs. DASSL approximates the derivatives 9 using a kth order BDF method, where k is chosen based on the behavior of the previous solution. Thus the resultant system of nonlinear algebraic equations is: ~(f”+I,Y.+,,~Y”+l+BI~n+,)=O, where

(2)

685

has to be solved for each timestep. In the corrector equation (2), all variables are evaluated at time f, + I, a is a constant which changes whenever the stepsize h“+, or order k changes, and /3 is a vector which remains constant while we are solving (2). The predicted value y$‘i, and its time derivative 3i”i, are obtained by polynomial extrapolation based on previous timepoints. The corrector equation, for a fixed timestep, is solved using a modified Newton-Raphson iteration, given by: y(“+ ‘) = yCm)- cJ_‘F(t,

Y(@. ay”) + /I, u),

m =0,1,2,.

(3)

..

where J is the iteration matrix:

J=E+aCF ay

ai,’

and c is a scalar constant chosen to speed up the rate of convergence of the iteration when J is the iteration matrix from some previous timepoint. For a more detailed formulation see Brenan et al. (1989). The stepsize and order selection strategies in DASSL tend to favor sequences of constant stepsize and order. There is an initial phase where the order and stepsize are increased at every step, and after that the order is lowered or raised depending on whether the estimates for the leading error term in the remainder of the Taylor series expansion, Ilhk-‘ytk-‘)II, Ilhkycy”)(l,Ilhk+‘yg+‘)II and Ilhk+‘yCkf2)ll form an increasing or decreasing sequence, respectively. The new stepsize is chosen so that the error at the new order k, estimated as if the last k + 1 steps were taken at the new stepsize, satisfies the condition: MIIYn+L

-YFi,ll

G 1.0,

(5)

where

M=m=bk+ltn

+ l),Iak+l(n

+ 1) + a, - aO(n + 1)1],

ak+l(n

+I)=

t ‘zi _k , n+L

I

a”(n + 1) = - i ajtn + l), j-1 a=

B=)’ a,= hn+1=

--,

a, hn+1

!%, -aY$‘i,, k 1 -]Q>

tn+l - f”,

which bounds the error estimate taking into account the interpolation error and the local truncation error. To control the local error the current stepsize is rejected whenever the condition (5) is not satisfied. Note that condition (5) requires that the estimated integration error be less than the requested integration error tolerances, as reflected in the weighted

A. Ii.

686 norm

of yi”i, IlY.+*

the

predictor-corrector

SIXXHI et al.

difference

11,where 11 y 11is given by:

(6) where N is the system dimension, and wt is the weight vector reflecting the relative and absolute error tolerances. 3. DYNAMIC

XTERATWE

METHODS

The basic idea behind the dynamic iterative methods is analogous to the standard Jacobi and Gauss-Seidel iterations used to solve linear systems of equations. The iteration process starts with an initial guess of the solution of the original system and is carried out repeatedly until satisfactory convergence is achieved. However, waveform relaxation operates on groups of function approximations (waveforms) rather than on groups of real values. During each iteration, each decomposed subsystem is solved for its assigned unknown variables in the given time interval by using the approximate waveforms of its neighbor subsystems (“neighbors” in connective sense). For iterations like Gauss-Seidel, the waveform solution obtained by solving one decomposed subsystem is immediately used to update its approximate waveform used by other subsystems. For Jacobi iterations, all approximate waveforms are updated at the beginning of the next iteration. Let us formulate the decoupled system for waveform relaxation. Without loss of generality, we can rewrite (1) as follows:

~p(t,Yp&.ib,dp,u)=O

Yp(~o)=Ypo,jb(~o)=)i,,9

is the vector of waveforms. Naturally, z,(7) is the discretized representation of a waveform, where t is taken to be the mesh points generated inside the subinterval [t,,, tb] by the integrator, and for any particular timepoint which is not a mesh point the waveform can provide it by interpolation. Note that the elements of z(z), generally, have different numbers of timepoints (different wnueleng&), according to the subsystem activities. Finally, we define neighbor waveforms by the following ordered sets of timepoints: a71

=

{4(2)/t

i=l,2

E 7 t7f}.

,...,

(10)

p,

and

V(~)=(V~,VZ,..-,V~)~,

(11)

as the vector of neighbor waveforms, which is held fixed during the integration process for a specific iteration, and for the jth iteration the decoupling vectors in (10) are set according to the iteration scheme utilized, for example, for i = 1,2, . . . , p: dj=(y{-I,...,

y;:;,y<;:

)...,

Jq-‘, . . . ,+{rt,j{;;

y;-I’,

,...)

j{-I),

(12)

. . ,jr;-‘),

(13)

for Jacobi iterations, and dj-

Cy’I,...,yj-,,

I-1 ...,y;-'. yi+,,

3’1,. . . ,j{_,,jj;f,.

for Gauss-Seidel iterations. Also here, any particular timepoint which is not a mesh point can be provided by the neighbor waveform interpolation. Thus, the basic structure of a WR algorithm can be abstractly described for a given subinterval as shown in Fig. 1. 7s = Pq’ $+,I

(7) where, for i = 1,2, . . _,p, yi E R”i is the subvector of the unknown variables assigned to the ith partitioned F,* R x R=’ x R”’ x R2N-2”’ x R’+R”’ subsystem, and di is the def&pling vector which contains all the unknown components of y which are not in yi and all time derivative components of 3 not in pi _ Now, if we consider the decoupling vector of each subsystem in (7) as an input vector, we can solve (1) for a given time interval by solving iteratively p independent subsystems. In the following, we represent waveforms by the ordered set of timepoints: 21(z) = (n(t)lt

f7

= It,, r,l=7,

i=l,2

,...,

= [to,

$I>,

(8)

p,

where [to, +] is the interval of interest, and Z(T)=(21rZ2,...,Zp)=,

(9)

set j = 0 and ~“(r-~) = ve (an initial

guess)

do for

i = 1,2,...

solve

,p

Fi(T*, Zj, ii+, Vf,U) = 0 { Yi(%)

= Yi(tg),

set j = j + 1 and &(T,)

I

until

?$(tq)

= $i(tq)

= v$$&T~)

convergence

Fig. I. Basic waveform relaxation algorithm for a given subinterval r4 = [tq, tq+ ,I.

Concurrent solution of DAEs

Convergence is achieved when 11 zi - zj- ’ 11 m is sui%iently small To illustrate how the waveform iteration proceeds until convergence, consider the following simple example of ordinary differential equations (Skjellum, 1990): a(t)

-y(t)

= 0,

{ 3(t) +x(t)

= 0,

ithm shown in Fig.

column their corresponding numerical results are plotted. The exact solution, represented by the finite series, is well approximated after a few Jacobi iterations, about seven iterations for this example. In the next sections we describe the basic algorithm in more detail, including our strategies to guess an

in the interval 7 = [0, T], and initial condition given for x(0) = 0 [i(O) = l] and y(O) = 1 [p(O) = 01. Now, if we assign x to the first equation and y to the second equation, we can solve (14) in r by using the algor-

jr =

-x

Picard-Lindelof initial

,

1, where p = 2. Simple initial

guesses for x,,(l) and j+,(t), t E r, are x,(t) =x(O) and yO(t) =y(O). The resulting waveforms at different iterations of the basic algorithm are shown in Fig. 2. The left column of Fig. 2 shows the results of analytical integration of each iterate, and in the right

(14)

e.g.: k = y , x(0)

687

= 0

y(0) = 1

t E co.Tl

iterations

(Jacobi)

guess: x0(0 = 0 ye(t) = 1

first iteration: x1 = yo + x&J = t 4’1 = -JL3 + y*(t) = 1

:te

1

2

3

2

3

second iteration: 4 = y, --+ xz(t) = t 92 = -x1 -+ y&t) = 1 - 5

:m

third iteration: 2. = yz -9 x3(t) = t - 5 9, = -x2 +

y,(t) = 1 - 5

e

1

fourth iteration: rt4 = y3 + x&)=t-5 y4=

-x,

--+ y&)=

1-$+$

fifth iteration: - XI - y4 --* xl(t) = t - 5

9s = -x4

+

y&)

= 1 -5

. .

. + 5 + 5

=th iteration: xJt)

=

2 <-1Yt”” ,d

y-0)

= $

(2j+l)! *

.

. . I

= sin(t) = cos(t)

Fig. 2. Waveforms at different Jacobi iterations of the basic algorithm. The left column shows the analytical integration, and the right column shows their corresponding numerical results.

A. R. SJXCXIet al.

688

initial neighbor waveform v, and to measure difference between waveforms.

4. ALGORITHM

the

DESCRIPTIDN

The WR algorithm consists of three parts: the system partition phase, the integration phase, and the relaxation phase. In the system partition phase, DAWRS executes six multioption well-defined steps (assignment, grouping, ordering, placement, process generation and neighborhood), shown in Fig. 3, including a totally or partially user-defined partitioning. Also, DAWRS allows arbitrary sets of equations to be addressed (named) by implicit mapping, whereas

y1

Y2

YJ

..-

Y.-l

Y.

in Mattisson (1986) and Skjellum et al. (1988) the approach is device- or template-oriented. That is, DAWRS addresses the variables and equations locally (in each cluster or block) as independent systems and, globally as defined by the user, allowing complete interaction between the algorithm and the user. Figure 4 illustrates how DAWRS maps the local indices to the global indices (those given by the user). The vectors col and row of each subsystem contain the global indices of the variables and equations, respectively. Thus, given a system F(t, Y, Y, u) = 0 with global indices { 1,2, . . . , N), this system is partitioned into p subsystems Fi(t,r,3, u) = 0 with local indices (1,2, . . , , n,}, where COZ,maps y, 3 into Y, I’ and rowi maps Fi into F. The subsystems

b I

X

b) Grouping

a) Assignment

c) Ordering

d) Placement

e) Process Generation

f_lNeighborhood

Fig. 3. The six partitioning steps.

Concurrent solution of DAEs y1

:=

YmMl)

. fl

:=

F,,,(l)

y2

:=

ycol,(z)

‘: f2

:=

6wv,C2>

Ycol~(nl)

: f”I

:=

~loWlh>

:=

cowv,(l)

Yn1:= user-defined global vet tor Y,, - -.Yt.l and

global functions F,, . .J=N

689

y1 := Y=&(l) YZ := Y&(2) ytn := Y&(m>

.

fl

Fl&y,~.dl,u)

= 0

. f2 := FlDw*(2> : *- fna := F,.xv,(ti

. . . YI

:=

Yc.z+(l>

. fi

:=

K.,wp
y2

:=

Yco4¶C2)

. f2 . :

:=

%wpf_2~

Fig. 4. Variable-equation addressing. The subsystems F,, Fz, , F, are treated as independent systems during each waveform iteration. The interactions are made by the decoupling vectors (i, ,4,. . . , dP. The Ui =,...., ,col,= Ui =,...., .rowi={1,2,.. ., N) and col,r\colj = row,nrow, = H, indexsetscolandrowsatisfy Vi #j.

F,,F2, _ . _, Fp are treated as independent systems of DAEs during each WR iteration. The interactions are made by the decoupling vectors d,, d2, . . . , dP, which are viewed as input vectors by the subsystems. With this local addressing formulation we can use any software available to integrate each subsystem of DAEs with no additional effort. In the waveform relaxation literature, the authors usually refer to the partitioning as the way they group the equations in different blocks. The assignment process is either considered as another phase or is not considered at all. When a system of DAEs is solved as a single system, the variables do not need to be assigned to a specific equation. However, when a system of DAEs is divided in several subsystems, it is very important to know which set of variables will he solved by each subsystem. We consider the assignment process as the first stage in the system partitioning, followed by the other five stages described in Section 4.1. The partition phase plus all other computations prior to the simulation (data file readings, parameter settings, spawning of the processes among the processors, database downloadings,...) constitute the setup phase of the algorithm. Figure 5 shows how the DAWRS algorithm was designed. The statistics phase shown in Fig. 5 does all timing and counting analyses for the simulation. 4.1. The partition phase The partition phase is the most crucial phase to the success of the WR method. It affects the consistency, convergence and efficiency characteristics of the WR algorithm. Here we describe the principal topics that tNote that the unknown variables are composed of state and nonstate variables.

involve each one of the six step executed for the complete partition of the system of DAEs. Figure 3 illustrates the function of each step. 4.1.1. Assignment process. To explain the assignment process, let us consider the following example of DAEs: 3,+v*--

=o, Yl -

Y2

=

v,(O)=O,

(15)

0.

(16)

This system has a consistent assignment if y, and y2 are assigned to equations (15) and (16), respectively, otherwise we could have a inconsistent assignment and convergence to an incorrect solution using the WR method (Lelarasmee et al., 1982). An assignment process is said to be consistent with a given dynamic system if any choice of the state vector of its associated decomposed system is also a valid choice of the state vector of the given systemt (Lelarasmee et al., 1982). Also, a system of DAEs can have several consistent assignments giving different convergence characteristics to the partitioned system. Thus, a combinatorial optimization may reduce the number of candidates to an optimal assignment (Lelarasmee, 1982; Lawler, 1976; Martello and Toth, 1987). The combinatorial optimization associated with the system interactions for high DAEs-index system is complex, but is straightforward for DAEs-index zero systems, namely, ODES. Together with this technique we make use of the information in the iteration matrix to choose one among the remaining candidates, resulting in a two-step optimization. In order to find a consistent assignment for the optimization problem, we must be able to identify directly from the system of DAEs a set of variables that can form a state vector of the system. However,

A. R. SECCHIet al.

DAVVRS Differential-Algebraic - Waveform Relaxation Solver

/

---. . .

DAWRS

Fig. 5. Global structure of the DAWRS algorithm. since we are dealing with an implicit system of DAEs, it is impractical to determine explicitly the states of the system. Therefore, the state variables are identified symbolically by means of a dependency structure of the system equations which is given in the form of a d@erentiaf-ulgebraic dependency matrix,defined by: D, = 0

1 D, = 2 D, = 3 D,,=

if if if if

the the the the

ith ith ith ith

equation does not equation involves equation involves equation involves

involve yi or gI yj but not jj 3, but not yj both yj and jjj

and the module of the d~erential-algebraic matrix of the system (1) by:

for some given a.

iteration

With the definitions above, and introducing binary variables xii which take the value 1 if and only if variable i is assigned to equation i, we can formulate our problem into an integer linear program, more specifically into a weighted matchingproblem (Lawler, 1976; Martello and Toth, 1987): N

N

max d = C C w,,xij, t-Ii-1

(17)

subject to ,i,x,j=l,

i=l,...,

N,

,i, XI/ = 1, i=l,...,

N,

x,,=O

or 1,

i,j=l,...,

N,

Concurrent solution of DAEs where wij = min(2, Dij). Naturally, the above formulation can lead to multiple optimal solutions. Therefore, another optimization step has to be carried out in order to select only one optimal solution. Let us define a matching as the set X = {(i, j)/x,. = 1, i, j E (1, __ _ , N}} with no two variables (equations) assigned to the same equation (variable). Then, r~ is the weight of the matching X. A matching that leaves no variables unassigned, denoted by Xc, is said to be complete, that is, 1Xc I= N, where 1.1denotes the cardinality of a set. In other words, a complete matching is a matching that satisfies all the constraints of the weighted matching problem. If a complete matching has the maximum weight Q* then it is said to be a maximum complete weighted matching, denoted by X*. If the above optimization leads to multiple optimal solutions, then the sum of the associated weights: max k (&

(18)

Dij,

is maximized in order to choose one optimal solution among the set of maximum complete weighted matchings, Xi = X*(a,*). Equation (19) is used if the optimization given by equation (18) still leaves more than one possible maximum weighted complete matching XL. A well known algorithm to solve this problem is the Hungarian method, a primal-dual algorithm (Lawler, 1976; Martello and Toth, 1987). The time complexity of this method is O(N’) for dense matrices, compared with O(N’) for the primal algorithms derived from the simplex method. In Lelarasmee (1982), the author gives the following result which states that the state vector of a given system of equations is associated with a maximum weighted complete matching of its assignment problem. Lemma 4.1. Given a system of DAEs in N unknown variables, let X* be a maximum weighted complete matching of its associated assignment problem. Then the set of variables {yj/(i, j)~ X* and wij = 2) is a set of state variables of the system and o * - N is the number of states where c* is the weight of x*. For example,

the system of DAEs:

itl+a,jZ+aZy,+u,=O, 31+ a& i,

I +

ad2

+ a4y2 + a5y3 + u2= 0, +

a7y3

+

u3 =

(20)

0,

wherea,#O,i=l,..., 7, and a, # a3. has four complete weighted matchings, X; = ((1, l), (2,2), (3,3)},

691

xs = {(1,2), (2, I), (3,3)}, xs = ((1, 11, C&3), (3,2)} and X: = { (1,2), (2, 3), (3, I)}, whose respective weights are Q] = 5, u2 =5, a,=4 and u,=3. The weight of the maximum weighted complete matching, Xy or X:, is o* = 5, indicating that the number of state variables is 5 - 3 = 2. Also, for this example, the second optimization steps selects the matching X* = Xi as the optimal solution for the assignment problem once Z,, Exi D, = 7 and ZciJ,Exj D, = 5. Thus, if we can %nd a maximum weighted complete matching for the assignment problem associated with the matrices D and J* and define the state variables according to Lemma 4.1, then we have found a consistent assignment for the given system. 4.1.2. Grouping. After the system assignment, we can group tightly coupled unknown variables, ideally yielding more loosely coupled subsystems. The main reason for the grouping step is to get a higher convergence rate for the overall system, but in addition we reduce the influence of the neighbor interpolation error in the residual evaluation. As the major issue for any iterative method is convergence, to enhance convergence we have to attend more carefully to those partition-phase steps that have supposedly the larger effect. In other words, we want to reduce the contraction constant y of the contractive map /zj - z*ll m < y’llz* - .z*I( m to the solution .z* in the waveform space. It is clear that the convergence rate depends on how much the unknown variables influence each other and, when we put strongly coupled unknown variables together we are reducing the contraction constant. However, it is not simple to evaluate the contraction constant for a given problem. Also, for a large system it is very expensive to estimate y for all grouping possibilities. Several approaches have been applied to group coupled unknown variables together (White and Vincentelli, 1985; Peterson, 1989). Here we describe a technique based on graph theory and information from the entries of the module of the iteration matrix J*. Obviously, the maximum number of subsystems is the number of unknown variables, that is, when no grouping is necessary or requested (or block size = 1). In the literature it is also called pointwise partitioning. As we increase the size of the subsystems to group tightly coupled unknown variables, we are also increasing, in a nonlinear fashion, the work that has to be done for solving those systems. Additionally, it becomes more difficult to load balance. Thus, large subsystems are only justified if the improvement in the convergence rate compensates for the increase in the execution time due to extra work and load imbalancing, resulting in a faster global simulation.

692

A. R. SECCH~ et al.

We can express our problem to find the strongly coupled unknown variables as a graph partitioning problem which yields a simple algorithm with low time complexity. Thus, in order to describe the grouping algorithm, let us define a graph G( V, A), is a set of vertices and where F={l,...,N} A = {(u. u)/u, 0 E V, tl # 0, O.., # 0} is a set of edges. The index e, in D represents the index equation that the uth variable was assigned to, that is, the matrix (D,“,) is a row permutation of the matrix (D,,). Each vertex v E Y corresponds to an assigned variable @)-equation (e,) pair. The edges (u, U) E A correspond to the off-diagonal nonzero elements of the matrix (D,,). If the edges in A are ordered pairs (u, v) of vertices, then the graph G is a directed graph, or a digraph, where u is the tail and v is the head of the edge. The edge (u, v) is said to Ieave vertex u and enter vertex v. The restriction u #v in A means that the graph G has no self-loops, that is, edges of the form (u, U) are not allowed. To describe the technique based on the graph theory, a few more definitions have to be introduced. A path p : uz v in G is a sequence of vertices and edges leading from u to v. A tree T is a digraph having one vertex which is the head of no edges, called the root, and such that all vertices, except the root, are the head of exactly one edge. An edge (u, v) of T is denoted by u + v, where u is the parent of v and v is a child of u, and a path from u to v in T is denoted by II--* *v, where u is an ancestor of v and v is a descendant of U. If u is a vertex in a tree T, T, is a subtree of T having as vertices all the descendants of u in T. A subgraph Gi = (Vi, Ai) s G consists of a subset of vertices Viz V and all edges (u, v) such that A, = {(u, v) E A/u, v E Vi}. A tree T is said to be a spanning tree of a digraph G if T is a subgraph of G and T contains all vertices of G. A subgraph Gi of a digraph G is said to be strongly connected if there is a path from any of its vertices to any other, that is, for each pair of vertices u, v in G, there exists paths pl: u g v and p2: v & u. If Gi is a strongly connected subgraph of G and it cannot be enlarged to another strongly connected subgraph by adding extra vertices and associated edges, then G, is said to be a strong component of G, that is, Gi is not a subgraph of any other strongly connected subgraph of G. Clearly, each vertex can belong to only one strong component, which may consist of a single vertex, so the strong compontents define a partition of a given digraph. A graph search has to be carried out to find all strong components of G. There are many ways of searching a graph, depending upon the way in which the edges to search are selected (Tarjan, 1972; Duff and Reid, 1978). The search that selects an edge to

traverse and always chooses an edge emanating from the vertex most recently reached which still has unexplored edges, is called a depth-first search. A depth-first search is very easy to program either iteratively or recursively. We utilize the Tarjan’s depth-first search algorithm (Tarjan, 1972; Duff and Reid, 1978; Duff, 1978), which has a time complexity of O(N) + G(lA I). To briefly describe Tarjan’s algorithm, consider what happens when a depth-first search is performed on a digraph G. For example, see the graph represented in Fig. 6b associated with the matrix D in Fig. 6a. The set of edges which lead to a new vertex when traversed during the search form a tree, represented by the solid arrows in Fig. 6c. The other edges fall into three classes. Some are edges running from ancestors to descendants in the tree [edge (1,4) in Fig. SC]. These edges may be ignored, because they do not affect the strongly connected components of G. Some edges, called fronds, run from descendants to ancestors in the tree. Other edges run from one subtree to another in the tree; these edges are called cross-links, see Fig. 6c. When the depth-first search algorithm is applied to a digraph G until all the edges are explored, a set of trees FC which contain all the

vertices of G, and sets of fronds and cross-links are created. FG is also called a spanning forest of G. A digraph consisting of a spanning forest and sets of fronds and cross-links is called a jungle. Figure 6 illustrates this process applied to a given problem after a consistent assignment process, where the resulting strong components, or blocks Bi, are shown in Fig. 6d. At this time we have done only a symbolic grouping, with no use of information on how strong the connections are. However, this process should be enough if the size of the blocks, or strong components, created do not overload the costs previously discussed. But, if it is not satisfied, we have to break big blocks into small ones in order to achieve the tradeoff between faster convergence and higher overheads. A big block has to be broken in such a way that a large increase in the contraction constant is avoided. It can be viewed as eliminating some small off-diagonal elements of the module of the iteration matrix associated with that block. An inexpensive but crude heuristic to break a big block into small blocks, with the sizes bounded by some specification, is to estimate the contraction constants between the elements of the block (Peterson, 1989; White and Vincentelli, 1985). If these estimates are high (above some specified value), then the elements are tied together, and the procedure is repeated until the size of the new block reaches its

Concurrent solution of DAEs

693

1

'dlld12 0 dH 0 0 0 dza dB 0 0 0 0 0 dS3 dw 0 0 000~a,000 0 0 dS, 0 dS5 a, 00000~00 0 d,dO dm 0 0 0 0 0 0 ,dsl0

0 0 0 dar d,, 0 0

0

7

3

d,, 0 d8,da

(a) dependency matrix after a row permutation.

5 (b) digraph G associated to the matrix D.

6,. BS

1

s&2 1

a

7

3

5 (c) Jungle generated by depth-first search, formed by sets of trees (q), and mnds and Qoss-wks (- ->). The edge (1,4) may be ignored.

B, 4

(d) strong connected components of the dlgraph G.

Fig. 6. A depth-first search applied to a given problem represented by its dependency matrix D. bound, or there are no more elements to add. An alternative approach is to apply the depth first search algorithm recursively in each subgraph Gi associated with a big block &. First, the edge of the subgraph Gi representing the smallest off-diagonal element of the corresponding block matrix .ijI’ (and 0,) is removed. Then, the depth-first search algorithm is applied to Gi as if it were a new problem. This procedure is repeated until all block sizes satisfy the specified bound. Naturally, the grouping strategies above are performed only once, at the beginning of the simulation. In principle, different groupings may be suitable along the simulation. However, the gain in the rate of convergence by a dynamic grouping strategy usually does not compensate for its overhead (Peterson, 1989; Peterson and Mattisson, 1991). 4.1.3. Or&ring. Another way to improve the convergence rate is to exploit the directionality of information flow within the system by means of iterations like Gauss-Seidel. These methods require some ordering of the system [e.g. n-coloring (Jordan, 1986; 198811 to get reasonable contraction Ortega, upper

CACE1717-E

constants. Of course, if the Jacobi iteration scheme is used, then no ordering is necessary. The n-coloring approach can be viewed as an intermediate scheme between Jacobi and Gauss-Seidel iterations (n = 1 =z-Jacobi iteration, n =p a Gauss-Seidel iteration). A natural way to mix the Jacobi iteration with the Gauss-Seidel iteration is when several blocks are placed in each processor. Inside of each processor a local Gauss-Seidel iteration can take place, once the execution is sequential, while among the processor we can have a global Jacobi iteration. Thus, only a local ordering is necessary. Peterson and Mattisson (1989) present an n-colored Gauss-Seidel algorithm for the global iteration in the WR method for circuit simulation. It is based on the same principle as the red-black coloring algorithm commonly used for partial differential equations (Ortega, 1988). The problem of coloring the subsystems is exactly the same as coloring a graph, with the key idea of obtaining as few colors as possible while still assuring that no neighbor subsystems have the same color. The ordering is then

A. R. SECCMet al.

694

obtained by solving subsystems with color 1 before color 2 and so on. The main difficulty with this approach is to find the starting point among the subsystems to give an adequate color. The algorithm proposed by the authors starts at the circuit nodes connected to the input voltage source with the fastest time-constant. An alternative to this approach is to exploit the directionality of the information flow given by the block structures, as explained below. It is easy to see that there must be at least one strong component, obtained by using the depth-first search algorithm, such that there is no path from any of its vertex to any vertex of another strong component. Letting this strong component be called B,, the remaining strong components B2, B3, . . . , BP may be chosen similarly so that there is no path from any vertex of one strong component to a vertex of a strong component later in the sequence (Duff and Reid, 1978). Thus, the resulting structure of this ordering strategy can be associated with a block lower trianguhr matrix, with blocks corresponding to the strong components. This procedure is illustrated in Fig. 3c, where the blocks preserve their old number to clarify the ordering process. If no color is given for that structure, a pure Gauss-Seidel iteration is the best scheme to exploit the flow, but is not well suitable for parallel machines. Therefore, to take advantage of the parallel machines, the system has to be ordered in order to obtain as many subsystems as possible that can be solved simultaneously. In other words, the coloring process should give as few colors as possible for the system. If we disregard weak-neighbor connections in the block structure, then we allow more subsystems to be solved simultaneously. Therefore, let us step down such a block structure, giving appropriate colors to the blocks by disregarding those neighbor connections for which: llJ$ II < 6,

(21)

where 6 is given a threshold and Jt is the block submatrix of J*, with i representing all row indices of the matrix Bi and j representing all column indices of the matrix Bi. Then, we obtain another n-colored system to be solved by iterations like Gauss-Seidel. 4.1.4. Placement. Each subsystem of DAEs commonly has different residual evaluation time and/or different stiffness in the integration process. Thus we need to place the subsystems optimally among the processors in order to obtain a good dynamic load balancing (placement step in DAWRS). Also we have to estimate the communication cost to exchange waveforms as another parameter to proceed with the placement step. The evaluation time for residuals is easily computed, but both the stiffness and the

communication cost are related to the wavelength and may vary during the simulation. We use only static placement schemes to avoid the overhead associated with a change in the database, but we are aware of the possibility of using dynamic placement schemes if the communication costs allow it. The investigation into dynamic placement is given in Peterson and Mattisson (1991). The authors compare static placement to the ideal case, that is, when the overhead due to the parallelization is not taken into account (time spent packing database and waveforms, or time spent waiting for waveforms). In their experiments the increase of the elapsed time compared to the time for the ideal case was always less than 50%. Moreover, in most cases it was substantially closer to the ideal. Also, this kind of comparison does not take into account the replacement overheads (resizing stacks, exchange database, etc.). It only indicates how far the static placement is from the perfect one. However, it is reasonable to say that in general a change in the placement of the subsystems during the simulation is too time consuming on a multicomputer. Dynamic load balancing may be valuable in problems that need to be simulated repeatedly, as is the case for optimization problems (Olde and Mattisson, 1991). In this case it is possible to use information from earlier runs to improve the load balancing in the following runs. There are many possibilities to distribute statically the subsystems among the processors: randomly; by uniformly distributing the number of equations per processor; by the order that the subsystem are solved; by uniformly distributing the cost of the residual evaluations (of course, this should take into account the iteration matrix computation and the number of Newton iterations to solve the subsystem); by uniformly distributing the communications costs (a reasonable estimate for these costs is the number of neighbors of each subsystem); by uniformly distributing the stiffness estimate of the subsystems. Also, it is possible to make combinations of some of the options above. However, as most of the information used to make a static placement changes along the simulation (problem dependent), it is not possible to tell which static placement scheme is the best. 4.15. Process generation. At this stage we have a disconnected partitioned system of DAEs (see Fig. 3d); in the process generation step, we actually create the processes with all clustered subsystems to

695

Concurrent solution of DABS be loaded into the available multicomputer nodes. In this step the subsystems shall receive the particular information about how they have to be solved. For example, the maximum BDF order they are allowed to use or the integration method; how they have to calculate their iteration matrices (numerically or analytically, dense or sparse); which norm they have to use; which database will be available, etc. Figure 3e illustrates how these processes are created. Each process constitutes a list of a small system of DAEs together with all necessary information to be loaded in its correspondent processor. 4.1.6. Neighborhood. As yet, we have not established the communication interconnections, which is to be done in the neighborhood step. This last step of the partitioning phase finds all interconnected subsystems either by means of user information or by perturbation over the unknown variables, and establishes the pattern of waveform communications during the dynamic simulation. It is clear that the differential-algebraic dependency matrix D is a symbolic representation of the perturbation over the unknown variables, and it can provide all the necessary information about the interconnections. Figure 3f graphically represents the results of this step for the given example, where the arrows represent the interconnections. As mentioned earlier, all six steps of the partitioning phase can be totally or partially defined by the user, because all schemes are made statically, at the beginning of the simulation. Therefore, some information could be missing during the partitioning. For example, if some model equation has to be replaced by a new one in the middle of the simulation, and this new equation includes other neighbor unknown variables than the neighbor unknown variables of the previous model equation, then a partitioning with no user information would lead to unexpected results. 4.2. Integration

phase

It is of central importance for the WR method that the integrator works effectively for stiff systems of DAEs. Also, in order to reduce the costs of communication, it is important that the integrator solves a subsystem of DAEs with as few timepoints as possible. The DAWRS’ integration phase was modeled on the DASSL package (briefly described in Section 2). First, if not supplied, the initial derivatives are obtained as in DASSL by a small implicit Euler step. This can be interpreted as WR iterations with a one-timepoint wavelength. Next, with a consistent initial condition, DAWRS performs the integration of each subsystem within the specified time interval. In addition to the prediction-correction formulas, DAWRS has a neighbor interpolation formula to

provide the values and derivatives of the neighbor waveforms for each timestep making them continuous input functions. To avoid convergence problems, the neighbor interpolation formula should be at least a CL interpolant, that is, the solution and its first time derivative should be continuous over the range of integration

(Berzins,

4.3. Relaxation

1986; Watts,

1989).

phase

In the relaxation phase DAWRS iterates the waveforms until convergence. The way this is done depends on the selected iterative methods (Jacobi, Gauss-Seidel, asynchronous iterations, etc.). An efficient scheduler process (Fig. 5) with a low sequential fraction, controls the overall system convergence. Also, as successfully implemented in the CONCISE simulator (Mattisson, 1986), a concurrent circuit simulator program designed to dynamically simulate VLSI/ULSI circuits, we incorporate a dynamic waveform splitting strategy where we form subintervals of waveforms (windows). This strategy helps us to avoid redundancies in the timepoint calculation, to address nodal memory limitations, to handle input discontinuities and, most centrally, to mitigate slow-convergence problems. That is, after a certain number of iterations, the current window is split into two new windows, so that the flrst part will converge in a few more iterations and the second part will be less expensive than the original window. In Mattisson (1986) the windowing may force a last shorter timestep in the window since integration steps are not allowed to pass a window frame except when a new window is started. Therefore, the windowing may slow down the integration algorithm for the next window. We avoid this problem by choosing the timestep to the next window not to be based on the last timestep of the current window, but on the previous timestep history. The splitting strategy is one way to enhance the waveform convergence; in the next section we discuss several techniques to improve the convergence of WR methods. 5. CONVERGENCE CRITERIA

AND

TECHNIQUES

The key principle behind the convergence of iterative methods is the concept of contraction of a map (Ortega and Rheinboldt, 1970). If we define a map W: z w z such that F+‘(z(r)) is the solution of the decomposed system (7) in T, then the WR iteration can be rewritten as a fixed-point algorithm: ~‘(7)

=

w(2j-l(7)),

j

=

1,2,.

..,

(22)

with z*(t) = W(z*(r)) the solution of the given system (1) in z. The condition (5) controls the local error to generate an approximate sequence of W. We still

A. R.

696

have

to

check

if this

sequence

is a

SECCHI

convergent

sequence in the waveform space. In order to prove that the WR algorithm is convergent (is a contraction map), the following nonuniform norm was used in Lelarasmee ef al. (1982):

-

Y’-‘(t)11

i=l,2

,...,

p,

(24)

for the current jth WR iteration, and norm given (6). Whenever the condition (24) is satisfied for timepoints in the current time interval we check the convergence of the whole waveform by following condition: ~llz~-z~-~ll,QCz


i=1,2,...,p,

by all for the

(25)

where llz’ - zj-‘ll

m = mfrx 11 y’(t)

al.

(Edwards, 1965), p is not bounded by [0, 1). In other words, values of p > 1 do not imply a divergent bound of p* = min sequence. An experimental (p, 0.99) appears to be very efficient. Associated with the splitting strategy we can derive the waveform sequence either by use of a relaxation parameter o E (0,2), yielding: zj=(l

(23)

where b > 0. Although the convergence in this norm guarantees that the WR method is robust, it does not insure computational efficiency. In fact, the computations performed on the last part of the time interval are essentially wasted. Thus, to have a better computational behavior, the WR algorithm should converge uniformly, that is, the error should be reduced over the entire time-interval at each iteration (White and Vincentelli, 1985). In order to check if the sequence (22) is a convergent sequence, the following criteria have to be satisfied in DAWRS. First, we verify the timepoint convergence by the condition: IIY+-y{-‘Il<~l,

er

- Yj- i(t)11,

6, and lz scale the weight vector, wt, to the allowed waveform tolerence, and p is an estimate of the convergence rate given by (Edwards, 1965; Shampine, 1980):

(26) Since waveforms from different iterations may have different timepoints, either y{(t) or Y{-‘(t) in condition (24) has to be interpolated at the chosen time t. Because we can easily obtain the polynomial coefficients to interpolate y<(t) from the current BDF data, t is chosen to be the nearest low timepoint of the previous waveform to the current timepoint. Also we note that because the %niform” global convergence criterion (25) is tighter than the usual condition (by using the nonuniform norm) to prove that the operator W is a contractive map in the Banach space

-eJ)z’-‘+&I+-‘),

(27)

where o changes as the simulation proceeds, or introducing a variable local error criterion replacing the condition (5) by: MIlY.+ 1 -YiOi,

II G 53

(28)

where c E [1, co) goes along with llzj - zj- ’ IIm from a prespecified value to one. While the relaxation parameter attempts to give more stability to the WR iteration when w c 1 and to accelerate the convergence when o > 1, the variable local error criterion prevents the integrator from using shorter timesteps. Higher local error criteria allow larger integration timesteps. The basic WR algorithm (Fig. 1), converges when all waveforms zi converge. However, in practice, a more rigorous global convergence criterion gives a better performance to the algorithm. For example, Fig. 7 shows a situation, using the Jacobi iteration, where at the second iteration both subsystems (zi and z2) reached convergence, but at that point their neighbor waveforms (u, and u,, respectively) had not converged yet. In the next iteration, subsystem 1, using an updated neighbor waveform, did not satisfy conditions (24) and (25), and one more iteration was needed. Therefore, we have to adopt a more stringent criterion for the waveform convergence. We consider that a waveform zi has only converged when it satisfies conditions (24) and (25)

WR

l- subsystem

iteration

1

l-subsystem

F

16t?

F”

F F

F F"

F F

2nd

T

T”

T

T”

rd

T

F

F

T

4th

T’

T

T’

T

5th

1

v2

Oth

3

2

T'

T'

Fig. 7. A typical WR iteration sequence. F: not converged, T: converged. The basic algorithm ends at the 0 sequence. and the modified algorithm ends at the l sequence.

Concurrent solution of DAEs

697

(31)

c1 = 3 x c2 = lo-‘. Now, based on these results, we introduce the modification in the basic WR algorithm divided into four cases. These modifications were made in a sequential cumulative order, i.e. Case II starts from Case I and adds its modification, and so on. First, in Case II, DAWRS follows the neighbor integration history holding the head of the neighbor waveform from the previous window, see Fig. 8, to use higher-order polynomials to interpolate the neighbor waveform at the beginning of the current window. As we can see in Fig. 9, this modification reduces the number of timepoints by about 33% for Example 1 and by 16% for the other two. Second, instead of discarding the timepoints beyond the splitting point, we incorporate a mergtig strategy (see Fig. 8). When the first part of a split window converges DAWRS merges it into the second part, generating in this manner a more reliable initial guess to the next window (second part of a split window). Because a window is only split after a reasonable number of iterations, the second part of a split window already has a good approximation of the waveform for that time interval. If splitting occurs because the user-specified number of allowed timepoints is exceeded, then obviously the timepoints beyond the splitting point have to be discarded. The merging strategy is represented by Case III in Fig. 9, where a 10% improvement is observed in Example 2. Because no splitting was necessary in Examples 1 and 3, there is no change from Case II to

Case I in Fig. 9 shows the integration results over the time interval 7 = [0, l] using the basic WR algorithm. The number of windows was fixed to 10 (printing interval = O.l), the relative and absolute tolerances were set to lo-’ and zero, respectively, and

Case III. The third modification, represented by Case IV, is the inclusion of a prediction horizon for the neighbor waveforms. When not available, DAWRS predicts the initial guess of the neighbor waveforms, v,,(r), by polynomial extrapolation. This prediction is carried out while the estimates for the leading error

and its neighbor these conditions.

waveform vi has already satisfied This strategy requires a few more

iterations in some windows, but the algorithm will be more robust, and generally faster than the original one. Another successful modification in the basic WR algorithm, summarized in Fig. 8, is that DAWRS retains useful information of past waveforms, which makes the convergence faster than conventional approaches. To explain this change let us introduce three simple and representative examples: (1) linear system (condition

number z 100);

3,+0.1y,-y*=o, jz + IOOY, + Yr = 0, {

Yl(0) =y2(0)

(2) linear time-varying

= 1,

system:

J$ + y1 + K cos(t)X

= 0

j2 + y2 + K sin(t)X = 0, Y,(O)

1 where

(29)

=Y2@)

=

(30)

1,

K = 999 and X = y, cos(t) + y2 sin(t);

(3) nonlinear system: j1 + 1002y, - 1oooy; = 0, 32-y1+(1 {

+y21y2=0.

Y,(O)

-

=Y2@)

=

1.

WAVEFORM _ _ - _ . discrete function (set of time points) variable order Lagrangean interpolation prediction horizon minimum and maximum length integration history n n n

bead of previous waveform

waveform

Fig. 8. Waveform definition and characteristics.

A. R. Sa~cur er al.

698

example

hal -

Rodows

WR terat&m

rlmcpom I 18338 18338 1068s 10685

2708 2050 2050 954 954

720 630

LMCWR&gdtlUU

630

~:~~~_

207 207

N:predktionbnrlxon

89935 21171 12201 II749 12102

9666 1739 1144 1155 1213

4965 771 329 338 341

159777 140422 127702 31019 30349

14525 13618 10861 3385 3075

6677 6652

42724 22511 24905 17540 24394

85084 44718 49568 34780 46506

9574 4894 5145 3426 5108

IO IO 10 10 IO

I55 137 157 94 94

12251 9186 9166 5353 5353

24379

19 10 IO IO 10

408 115 98 93 95

4488 1 10582 6123 5850 6059

I II III IV V

60 53 49 24 22

1252 1103 908 441 419

80126 70490 64105 15584 15290

VI

46

1126 519 48s 461 921

I

II III IV V VI

VII VIII IX X -

VII VIII IX X -

z:, 26 31

I II III IV V

IO IO IO IO 10

I40 116 116 53 53

6860 se43 5043 1748 1748

13519 11243 11243 3089 3089

2294 1753 1753 692 692

VI VII VIII IX X -

IO IO IO 10 IO

IO6 71 68 47 54

12097 3159 2399 1600 1956

23664 5675 4374 2773 3504

3371 999 888 668 758

5606 1427 1348 3269

vII:llxedorda3 vm:axdwder4 M:eedorderS

X:mulnmmordrr I-V use lveraga BDF’ order.

VI-XllBl8tkcOIlpUd toIv(theyuse

auueemditiau).

1824 2110 984 2048

NlUllbWOfWiIl&W8 w8snxc!dtot4&

624 489 489 57 57 1637

thewtw*-.

Esamplc 2 and case VI hthrex8mpklurt Tilegbbdtimciatud fardldImhtLms&

K&11, =d thc ==ur=y

hgivcn by:

96 39 74 178

ltd.Tohmms - 1x10’ Ah. T.knw=O Waveloa Td. = 1x10= I

Fig. 9. DAWRS convergence characteristics. Three simple exampies showing the modifications in the basic WR algorithm.

terms in the integrator form a decreasing sequence (see Section 2), and from the point that does not match this condition to the end of the window the initial guess is kept constant. Evidently, this strategy substantially improves the results; about 70% for the first example and a factor of four for the other two. Finally, DAWRS has an optional insertion of timepoints into those neighbor waveforms that have less timepoints than necessary for the interpolation order. This insertion is done by polynomial interpolation using their own BDF data, that is, before sharing the waveforms. This last modification is optional because the algebraic equations in the DAEs are not automatically satisfied at interpolated points, so that it could compromise the WR convergence. Case V in Fig. 9 shows the results of this modification. Because the number of timepoints per window for Examples 1 and 3 was enough to the required polynomial interpolation order,

tin DAWRS the definition of neighbor waveform, (lo), does not include the derivatives, except for the 6rat and last timepoints. The derivatives are obtained deriving the interpolation polynomials. $The maximum BDF order was fixed to 5 in all simulations.

no change is observed from Case IV to Case V. Moreover, only a slight improvement occurs for Example 2. In general, all those modifications in the basic WR algorithm contributed to improve the efficiency of the WR method, given special attention for the prediction horizon strategy. Next, we will discuss how DAWRS chooses the order of the polynomial interpolator for the neighbor waveforms. As mentioned in the previous section, since the waveforms have timesteps independent of each other, DAWRS has a neighbor interpolation formula to provide the values and derivatives of the neighbor waveforms to the integration steps.7 For each timestep the integrator selects the best BDF order to obtain a timepoint. Thus if each timepoint in the neighbor waveforms were tagged with its order, the interpolation polynomial could use the optimal order. However, to reduce the communication overhead among processors, the length of the messages could be as short as possible. Therefore, DAWRS sends the waveforms to their neighbors together with only their BDF average order for the time interval of the current window. In Fig. 9, for Cases VI-X, we compare this approachs with fixed-order and maximum window order interpolation polynomials. The

Concurrent maximum

window

699

solution of DAEs

order scheme uses the maximum

BDF order for the time interval of the current window instead of the average order. Case IV in Fig. 9 represents the BDF average order scheme, Cases VI-IX show the results for fixed orders 2-5, respectively, and Case X represents the maximum window order scheme. Although Example 3 shows a better performance when the order is 6xed to 5, a performance degradation will likely occur in less active regions. Also, when the linear coefficients in Example 1, (0.1, - 1, 100, l), are replaced by (0.01, - 100, 0, lOO), respectively, a tied order of 3 seems to be better than a fixed order of 5. Thus, to monitor the activity changes and problem dependencies, a variable interpolation order given by the BDF average order seems to be a good choice.

6. APPLICATION

PROBLEMS

To present more realistic results using our computational package, we consider here two examples of distillation column networks. Both networks are arranged in a tree-sequence, separating eight alcohols: methanol, ethanol, propan-l-01, propan-2-01, butan-l -01, 2-methyl propan- l-01, butan-2-01 and 2methyl propan-2-01. The basic assumptions that are used in the distillation model are the following: . Negligible vapor holdup on trays. . The pressure profile in the column is constant over time. l Liquid and vapor phases are assumed to be in thermodynamic equilibrium at all times. l Constant liquid holdup and molar flowrates. l No energy balance.

Nodes Fig. 10. Three-distillation-column network. The real relative speedup &J(P) and efficiency q(P) are represented by the solid lines, and their lower and upper estimates are represented by the dashed lines.

A. R. Seccm et aI.

700

The thermodynamic model uses Wilson’s correlations for the liquid activity, and the Antoine equation for vapor pressure. The vapor-liquid equilibrium is obtained by iterative calculations. The state variables are the liquid compositions. Their dynamic behavior is determined from the mass balance for each component. The first network (Fig. 10) has three distillation columns with 64 trays each, totaling 1536 unknown variables, and the second network (Fig. 11) has seven distillation columns with 143 trays each, totaling 8008 unknown variables. Each tray is initialized to a nonsteady condition, and the system is relaxed to the steady-state governed by a single feed stream to the first column in both sequences. 7. EXPERIMENTAL

RESULTS

According to the distillation model utilized, each tray has eight tightly coupled equations. The best grouping for this formulation was found to

be one tray per subsystem (or block). Subsystems bigger than that are disadvantageous hecause of expensive calculation, and subsystems smaller than that are disadvantageous by excessive WR iteration (around 7-10 against 4-5) and massive waveform exchange. Another partitioning step that deserves attention for this kind of problem is the placement. Because we assume that for each distillation column different holdups for the reboiler and condenser, we can distinguish four types of “trays” with significantly different activities (reboiler, condenser, feed tray and other trays). Therefore, we can expect a load balancing problem. Experimentally, the maximum/minimum node CPU-time ratio was found to he between 1.5 and 3 and when the workload is even in terms of number of equations per node. DAWRS automatically reduced this ratio to 1.0-l .7 by evening the workload for the residual and Jacobian evaluations, and by also taking into account the size of each subsystem (see Figs 12e and f).

Fig. 11. Seven-distiIlation-columnnetwork. The dashed lines represent the lower and upper estimates for

the relative speedup Sp(P) and efficiency q(P).

701

Concurrent solution of DAEs

(b)

(d) 1.7 1.6

W

(9

Fig. 12. Load balancing analysis for the distillation column networks.

Naturally, the ratio equal to 3 for the threecolumns example when running on 192 processors (Fig 1%~)cannot be decreased, because in this cast we only have one block per processor. We also show in Fig. 12 the total waiting time fir waveform and the total waiting time per node, that is, the total time that each node spent waiting for the neighbor waveforms

(Figs 12a and b) and the total time that each node spent waiting for new tasks (Figs 12c and d). The increase shown in graphics 12c and 12d is due to the increase of work left to the scheduler process as the number of nodes increases (see Fig. 5). This work can be reduced by applying combine operations among the processors.

A. Ii. Sacom et al.

702 3 columas

7 colunms

200

200

160

160

a

120

E cn

80 40 0

40 0

0

0 3

$

No&s

s

z

__*

WP=~Sp&dup

ImwerEstspcx%lup

---

LowaEsLspcahlp

-

calallatcdspccdup

--

PrirSQaxtup

--

Fairspccdup

---

IdcalSpoeciup

-__

Jd-JSpcsduP

m-w

uppa Est.

---

a

NC&S

spccdup

0.4 0.2 0

s

3

---

-I -

---

Fig. 13. DAWRS

LnwerEnFlfficimcy

FJ

p

NOdCS

NCdCS ---

tXculatedEtTii

-

PdrEflicicacy

---

LowaEaEffiiieacy -

FairEfficiicy IdulEft%mcy

I

IddEfficiwlcy algorithm compared to “the best” sequential algorithm for the two distillation column network examples.

Figures 10 and 11 show the obtained specdup on a 192-node Symult ~2010 mu1ticomputer.f The tolerances indicated in Fig. 9 were used in these simulations. The solid lines in the Speedup and Efficiency

?A single Symult 92010 node, composed of a Motorola 68020/68882s (25 MHz) chip, has similar performance to a Sun 3/60. The same qualitative results were obtained on a 64-node Intel iPSC860 multicomputer, but 4.5-5 times faster.

graphics in Fig. 10 are the relative speedup and efficiency, respectively, given by:

WP)

rtw

=

T(1) T(P)’

=p’

SP(P)

(32)

where T(P) = Node(P) + Host(P), Host(P) and Node(P) are the host time and the maximum node CPU-time, respectively, and P is the number of

Concurrent

processors.

Due to memory

limitation,

solution of DAEs

the relative

speedup and efficiency could not be calculated for the seven-distillation-column network. Therefore, an estimate for this speedup has to be defined. Assuming that the contributions of the host and scheduler processes? to the total CPU time are negligible when running on only one node (no sequential fraction assumption), then: T(1) x No&(l).

x 5 Node,(P). in I

(34)

Therefore, an upper estimate for the relative speedup and efficiency can be written as follows:

SpU(P) =

CPU(P)

- Host(P) Node(P)

_ CPU(P)

- Host(P)

- HP)

T(P)

,

where $(P) represents the parallelization overheads (time spent packing, unpacking, sending and receiving messages and waveforms, and some other overhead of the algorithm implementation). Note that we may also consider: CPU(P) T(P)

Q*(P)



as an estimate for the relative speedup that will lie between the lower and upper estimates. Actually, this equation can replace equation (35) as an upper estimate for the relative speedup. The dashed lines in the speedup and efficiency graphics (Figs 10 and 1l), represent the lower and upper estimates for the relative speedup and efficiency. tThe host time includes the scheduler time.

T = s T(P)

tl*(P)=p’



sP*(P)

where T, is the CPU-time in one node for “the best” sequential algorithm. The results show that we can obtain fair speedups over 50 if more processors were available. Also, above 3 processors the parallel algorithm starts to be faster than “the best” sequential algorithm. 8. CONCLUSION



Note CPU(P) = Host(P) + EI’p 1 Node@). where that we put only Node(P) in the denominator instead of T(P), so that this is an estimate assuming no sequential fraction and no communication. The inequality S’“(P) > &J(P) is easily proved considering that CPU(P) 2 CPU( 1). We can also define a lower estimate for the relative speedup and efficiency as follows:

S+(P)

Finally, Fig. 13 compares the above results with “the best” sequential algorithm, which here is considered to be DASSL associated with an efficient sparse solver [Sparse 1.3 solver package (Kundert and Vincentelli, 1988). Thefair speedup and efficiency are determined by:

(33)

Now, assuming we can equally distribute the time spent on one node, No&(l), among the P processors (no communication assumption), then we can write: T(1) = Node(l)

703

AND

FIJTURE

WORK

We have showed significant gains in performance as a result of our new approaches to manipulation of the waveforms. In future work we intend to provide new partitioning strategies, such as new ordering schemes for iterations like Gauss-Seidel, placement schemes based on communication costs, and a more detailed grouping analysis, in order to achieve higher convergence rates. Experimental results with DAWRS, applied to distillation column networks, already show that the WR method is a strong candidate as a concurrent flowsheeting simulation methodology. We expect to apply the technique to many other interesting applications in chemical engineering, and to make time comparisons with powerful sequential algorithms and other concurrent methods (e.g. direct methods). REFERENCES Berzins M., A C’ interpolant for codes based on backward differentiation formulae. J. Appl. Numer. Math. 2, 109-l 18 (1986). Brenan K. E., S. L. Campbell and L. R. Petzold, Numerical of Inittal- VaIue Problem in D@eren Solution tiaf-Algebraic Equations. North-Holland, New York (1989). Chung Y. and A. W. Westerberg, Solving stiff DAE systems as ‘near’ index nroblems. Technical reoort. Enaineerinn Design Research Center and Dept of &em: En&ieering, Carnegie Mellon University, Pittsburgh (1991). Duff I. S.. Algorithm 529: Permutation to block triangular form [Fl]. ACM Trans. Math. Soft. 4, 189-192 (1978). Duff I. S. and J. K. Reid, An implementation of Tarjan’s algorithm for the block triangularization of a matrix. ACM Trans. Math. Soft. 4, 137-147 (1978). Edwards R. E., Functional Analysts: Theory and Applicutions. Holt. Rinehart & Winston. New York (1965). Jackson K. R. and R. S. Davis, An alternative implementation of variable step-size multistep formulation for stiff ODES. ACM Trans. Math. Soft. 6, 295-318 (1980).

704

A. R. SEC:m

Jordan H., Structuring parallel algorithms in a MIMD shared memory environment. Parallel Cornput. 3,93-l 10 (1986). Kundert K. S. and A. L. S. Vincentelli, A sparse linear equation solver-user’s guide. Technical report, Dept of EE and CS, Univers. of California Berkeley, CA (1988). Lawler E. L., Combinatorial Optimization: Networks and Mutrotds. Hoit, Rinehart & Winston, New York (1976). Lelarasmee E., The waveform relaxation method for time domain analysis of large scale integrated circuits. Ph.D. Thesis, University of California Berkeley, CA (1982). Lelarasmee E., A. E. Ruehli and A. L. S. Vincentelli, The waveform relaxation method for time-domain analysis of large scale integrated circuits. IEEE Trans. ON CAD Integrated Circuit Syst. CAD-l, 131-145 (1982). Martello S and P. Toth, Linear assignment problems. Surveys in Combinatorial Optimization. fS. Martello. G. Laporte, M. Minoux and C: Ribeiro, l%&), pp. 259-282. North-Holland. New York (19871. Mattisson S., M&-VLSI circuit sim’ulation formulation for concurrent execution. Technical Report Display the 5996:DF:83, Computer Science Dept, Caltech, Pasadena (1983). Mattisson S., CONCISE: A concurrent circuit simulation program. Ph.D. Thesis, Tillampad Elektronik, Dept of Applied Electronics, Lund, Sweden (1986). Newton A. R., Techniques for the simulation of large-scale integrated circuits. IEEE Trans. Circuits Syst. CAS-26, 741-749 (1979). Olde B and S. Mattisson, Waveform relaxation-based circuit optimization. In Proc. 7th Conf on the Numerical AnoIysis of Semiconductor Devices and Zntegrated Systems, Boulder, CO (1991). Ortega J. l& Instruction to Parallel and Vector Solution of Linear Systems. Plenum Press, New York (1988). Ortega J. M. and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables. Academic Press. San Diego (1970). Peterson L., A study of convergence-enhancina techniaues for concurrent waveform relixation. Tech&al Report, Tillampad Elektronik, Dept of Aeulied Electronics Lund. -_ Sweden (1989). Peterson L. and S. Mattisson, Circuit and ordering of the waveform relaxation used in circuit simulation. In Proc. IEEE Int. Symp. on Circuit and Systems (1989).

ef al.

Peterson L. and S. Mattisson, Practical and ideal parallelism for circuit simulation using waveform relaxation. In Proc. 7th Conf. on the Numerical Analysis of Semiconductor Devices and Integrated Systems, pp. 67-68, Boulder, CO (1991). Petxold L. R., DASSL: differential algebraic system solver. Technical Report Category D2A2, Sandia National Laboratories, Livermore. CA-(1983). Saleh R. A. D. Webber, E. Kia and A. S. Vinoentelli, Parallel waveform-Newton algorithms for circuit simulation. In Proc. IEEE ZSCAS’87, pp. 660-663 (1987). Shampine L. F., Implementation of implicit formulas for the solution of ODEs. SIAM J. Sci. Stat. Comput. 1,103-l 18 (1980). Skjellum A., Concurrent dynamic simulation: multicomputer algorithms research applied to ordinary differentialalgebraic process system in chemical engineering. Ph.D. Thesis, California Institute of Technology, Pasadena, CA (1990). Skjelhnn A., M. Morari and S. Mattisson, Waveform relaxation for concurrent dynamic simulation of distillation columns. In Proc. The Third Conf. on Hvnercube Concurrent Computers and Applications, vol. 2, pp. 1061-1071, Pasadena (1988). Smart D. W., Parallel processing techniques for the simulation of MOS VLSI circuits using waveform relaxation. Ph.D. Thesis, University of Illinois at Urbana-Champaign, Urbana, IL (1988). Tarjan R., Depth-first search and linear graph algorithms. SIAM J. Comput. 1, 146-160 (1972). Vandewalle S. and D. Roose, The parallel waveform relaxation multigrid method. In Proc. The Third SIAM Conj on Parallel Process for Sci. Comput, pp. 152-156 (1987). Watts H. A., A smooth output interpolation process for BDF codes. J. Comput. Appl. Math. 397-419 (1989). White J. and A. L. S. Vincentelli, Partitioning algorithms and parallel implementations of waveform relaxation algorithms for circuit simulation. In Proc. IEEE ZSCAS’85, pp. 221-224 (1985). White J. K., Multirate integration properties of waveform relaxation with application to circuit simulation and parallel computation. Ph.D. Thesis, University of California Berkeley, CA (1985).