Parallel implementation of sparse LU decomposition for chemical engineering applications

Parallel implementation of sparse LU decomposition for chemical engineering applications

Computers them. Engng, Vol. 13, No. 8, pp. Printed in Great Britain. All rights 899-914, 1989 Copyright 0 reserved 0098-I 354189 53.00 + 0.00...

2MB Sizes 3 Downloads 98 Views

Computers them. Engng, Vol. 13, No. 8, pp. Printed

in Great

Britain.

All

rights

899-914,

1989

Copyright 0

reserved

0098-I 354189 53.00 + 0.00 Pergamon Macmillan plc

1989 Maxwell

PARALLEL IMPLEMENTATION OF SPARSE DECOMPOSITION FOR CHEMICAL ENGINEERING APPLICATIONS

LU

A. B. COON and M. A. STADTHERR~ Department

of Chemical

Engineering,

(Receiveci I4 June 1988; final revision

University of Illinois, IL 61801, U.S.A.

received

25 October

1209 West California

1988; received for publicarion

Street, Urbana,

15 November

1988)

Abstract-Parallel, shared-memory computer architectures can provide enormous computational resources for numerical simulations in many chemical engineering applications. The efficient solution of large, sparse sets of linear equations is often a critical requirement in these applications; thus the parallel implementation of a sparse linear solver is crucial: We examine methods for the parallel LU decomposition of sparse matrices arising from 3-D grid graphs (typically associated with finite difference discretizations in three dimensions). Parallel performance of the decomposition is significantly enhanced by the use of a new diagonal variant of nested dissection, in conjunction with a parallel dense solver at certain stages. The overall parallel efficiency exceeds that of the parallel dense solver. We also consider the extension of such methods to the less structured graphs that arise from sparse linear systemsin equation-basedprocess flowsheeting. We use performance results from the 3-D grid graphs to make inferences regarding generalizations of these methods of Aowsheeting applications.

(PDEs) which describe the phenomena. (Large, sparse linear systems also occur in the solution of these PDEs using finite element methods.) The role of the linear equation solver in equation-based (EB)

INTRODUCTION

in digital computing have presented a number of opportunities to chemical engineers in several areas. In particular, the areas of steady state process simulation and design (flowsheeting) and optimization are likely to benefit from the use of parallel computing because of the potential for much greater computational rates. These increases in computational speed are attractive, not only because they will make possible the use of more realistic unit models involving complex phenomena and the tractability of larger simulations, but also because they will result in the increased productivity of the design engineer (Stadtherr and Vegeais, 1985). Although many parallel computers have extremely fast clock cycles, their ability to achieve high computational rates is also a result of the extensive use of various forms of parallelism within each of those computers. The vector processing and multiprocessing architectures of today’s supercomputers are both examples of paraHelism in state-of-the-art machines. The potential for high computational rates on parallel computers is obviously related to the extent to which its parallel architecture can be exploited and a program that executes on one of these machines should take advantage of the architecture of that machine. The accurate and efficient solution of large sets of sparse linear equations is a critical requirement in simulation of many phenomena that are of interest to chemical engineers. These large sparse matrices arise in the linearization of the finite difference equations that approximate the partial differential equations Recent

developments

tAuthor to whom all correspondence should be

process flowsheeting is no less crucial. The solution of large sparse sets of linear equations can require up to 90% of the total solution time in the EB flowsheeting

approach (Hilton, 1981). Thus, the use of parallelism in the linear solution routine of a process flowsheeting package or in the discretized solution of PDEs is necessary to achieve satisfactory performance of these programs on a parallel architecture machine. In this paper, we consider the parallel solution of linear systems arising from 3-D finite difference grids (with a seven-point difference operator), with the intention of generalizing successful solution strategies so that they may be applied to EB flowsheeting. We use the incidence graph of the sparse matrix to model properties of a nested dissection ordering with respect to parallel LU decomposition. The sufficiently regular structure of graphs of the 3-D grids allows us to treat this case in more detail than we are able to for the EB flowsheeting case, in which the graph of the symmetric part of the sparse matrix has a much less regular structure. Nonetheless, we are still able to use the same tools (i.e. graph theoretical descriptions) for developing generalizations of the parallel strategies that have been applied to 3-D grid problems and other sparse linear systems.

BACKGROUND In the following theoretical notation

address&.

899

section, we present some graph and its relationship with LU

900

A. B. CWN and M. A. STAUTIIERR

decomposition. The notation is similar to that used in George and Liu (198 1) and Peters (1984). We then discuss (using this notation) the basic ideas involved in generating a parallel pivoting strategy from the nested dissection method. Finally, we briefly review some prior applications of nested dissection to 2-D discretizations. Graph notation

and LU

decomposition

A graph G = (X, E) is an ordered pair of disjoint sets, where X is a nonempty set of vertices (or nodes) and E is a set of edges, with each edge being an unordered pair of distinct vertices. (In this case, the graph is said to be undirected; in the cast that the edges are ordered pairs of distinct vertices, the graph is directed.) The number of vertices in X, say N, is called the order of G. Two vertices arc adjacent if there is an edge that joins them, that is if (u, v) E E for U, v E X. In the case, u and 2:are said to be incident to such an edge. A graph H = (Y, 3’) is a s&graph of G = (X, E) if Y c X and F c E. An ordered graph G = (X, E, a) is an ordered triple, where X is a vertex set, E is an edge set and r is an ordering, which is a bijectivemapa: (1,2,...,N)+X. WithanNxN structurally symmetric matrix A (i.e. the occurrence matrix of A is symmetric), one may associate an undirected ordered graph G = (X, E, a). For each i = 1, . , N, the ith row and variable are associated with a unique element of X, and we may denote by xi that x E X for which x = a(i). (One should note that because we associate both a row and a variable with each vertex. two different orderings of X will correspond to two different symmetric permutations of A.) The edge set of G is then E(G) = ((xi, .x/) [xi. x, E X and u,, # 0, i #_I}. Hence. the vertices correspond to the locations of diagonal

A X X X X Fig.

X ii

I

X

1, An ordered graph and the corresponding matrix.

c~ccurrencc

elements of A and the edges correspond to the locations of nonzero off-diagonal elements of A. In Fig. 1, the numbered circles represent the vertices and their ordering and the lines connecting those circles represent the edges. A path from tl to 1’ (for U, t’ E X) is a sequence of distinct vertices (U = _u,, xL. , xk = D) such that (_Y,~_, , x,) E E(G) for i = 1, _ , k. This path is said to have Iength k. A graph is connected if, for any pair of distinct vertices .U and ~1, there is a path from u to V. A disconnected graph is a union of connected subgraphs called connected components. If Y c X, and F = {(WI, z)/ MI,_-~Y}KlE,thenH=(Y,F)isthe subgraph of G induced by Y. We will denote the subgraph of G induced by Yas G[Y]. Given a connected graph G = (X. E). a set S c X is called a separator if G[X\S] (where X\,S is the set of elements of X that arc not in S) is not connected. A separator S is a minimal separator if no proper subset of S is also a separator. For a graph G = (X, E, a), a path (x,,, x I....,_ I-~) is an rr-path (Peters, 1984) if ?--‘(x,)
+ I,

_. .,N,

where the elements Q,, for k + 1 < i,j < N are overwritten at the kth iteration of the outer loop. Consequently, we can associate each pivot (or, equivalently, each diagonal coefficient) of the matrix A with a vertex of its graph. Since the ordering CLof the graph G = (X, E. x) will indicate the pivoting sequence, we will refer to the pivots as elements of (I, 2,. . . , N}. To proce.r.< the kth pivot, one determines the values of ulkuA-;Q,, for all k + 1 C i, j ,( N, and updates the appropriate a,, as indicated above. For i,jE{l. 2,. , N), i _i denotes that the results from the processing of pivot i are needed to process pivotj. We will denote the transitive closure of -+ by .+(thatis,i.-*,jifi-rjorifi.-+k.-+j,forsomek). Given a matrix A, partitioned into a 2 x 2 block matrix with submatrices A,,, A,,, A?, and A,,, if A,, is nonsingular then AZ2 - A,, A ,-\‘A,, is the Schur complement of A ,, in A, Parallel LU

decomposition

Nested dissection for from an n x n regular

and nested dissection

sparse linear systems arising grid was first analyzed by

Sparse LU decompositionfor chemicalengineeringapplications George (1973). This method provides an ordering for the grid which reduces the number of arithmetic operations necessary for the factorization of the associated matrix. Since the incidence graph of a matrix arising from finite difference approximations on n x n regular grid blocks (with a block-centered discretization) is itself an n x n regular grid, the dissection strategy is applied to a graph with a highly regular structure. The basic strategy is to find a separator in the grid graph such that the resulting disconnected components are subgrids of approximately the same order. The key observation is that the elimination of a variable corresponding to a vertex in one of the two components creates fill-in only in locations corresponding to edges with one vertex in the component and the other vertex in the separator set. This means that no fill-in will occur in the blocks of the matrix corresponding to the edges with one vertex in one component and the other vertex in another component. In this section, we present (in graph theoretical terms) the ideas behind the use of nested dissection to determine a pivot sequence for a parallel LU decomposition. The results given here are not limited to matrices whose associated graphs are grid graphs or even planar graphs, rather they are applicable to any structurally symmetric matrix for which the previous stated assumptions about pivoting hold. The following lemma and its corollary are from Peters (1984) and although they were developed in the context of the LU decomposition of symmetric matrices, they hold for structurally symmetric matrices as well. (This can be seen by extending the proof given in Peters (1984) to consider the updating of nonzero coefficients in the pivot rows as well as the pivot columns.): Lemma I (Peters 19841-i -S J’ if and only if there is an rt-path from xi to xI and i -zj. Corollary (Peters, 1984)--One can process i and j concurrently if and only if neither i .dj nor j .+ i. We present a second lemma which gives a property that can be used to implement parallel LU decomposition for matrices whose graphs are ordered with a nested dissection strategy: Lemma 2--Given a connected G = (X, E, u) with a separator S and connected components C, = G[X,] and C, = G[X,] of G[X\S] (where X = S U X, U X2): If max{a-‘(x)1x E (X, U X2)) < a-‘(s) for all s ES. Then neither i --s j nor j --, i for all xi E X, and X,E

X,.

Furthermore, if S is minimal, the converse holds as well. Proof-All paths from any x,EX, to any X.E X, contain some s E S and since a -I(s) > min(m, n), there is no rt-path between x,,, and x, . Hence, neither m-+nnorn+mforanyx,,,~X,,x,EX,.Theonly other possibility for i .-tj or j-+ i is that there exists some s E S such that m + a -I(s) + n, or

901

n-a-‘(s)-+m, for some x,,,cX,. x,EX~. But with nosESsuchthatm <~-‘(~) min{i*, j*} and since neither J‘*+i*) i* .+j* nor j* -+ i*, a-‘(s) > max{i*,j*). Finally, in the theorem below we provide a relationship between the ordering at each step of the nested dissection method and a parallel LU decomposition strategy: Theorem-Given a connected G = (X, E, a) with a separator S and connected components C, = G[X,] and C, = G[X,] of G[X\S] (with X = S U X, U X,): If max(a-‘(x)(x E(X, U X2)} -c LY-‘(s) for all s E s. Then one can process any pivot i concurrently with any pivot j for all x, E X, and xJE X,. Proof-By Lemma 2, neither i .+ j nor j --+ i, and using the Corollary from above, we obtain the result. The above result can be visualized by considering the structure of a matrix whose graph G = (X, E, a) has an ordering Q which maps n, } into X, , {nl + 1, . . . , n, + n2} into X, and I::,;+1 .., N) into S, where S is a separator, X = S U X, ‘6 X2, and X, and X, have order n, and n2, respectively. This ordering satisfies the condition that each vertex in the separator be mapped into an integer that is greater than integers into which the vertices of the connected components are mapped. Figure 2 shows the structure of such a matrix and its LU decomposition. The connected components C, and C2 correspond to the submatrices A, 1 and Arz . The submatrices A,, and AS, are represented by a subset of E whose elements have both a vertex from X, and from a vertex from S incident to each of them. Similarly, the subset of E whose elements have both a vertex from X, and a vertex from S incident to them represents the submatrices AZ3 and A,,. The subgraph induced by the separator G [S] corresponds to A,,. Since there are no vertices in X, which are adjacent to a vertex in X,, or vice versa (otherwise G[X\S] would be connected), the coefficients of the submatrices A,, and A,, are all zero. Such an ordering of rows and columns in the matrix results in a structure which one can readily factor in block form. By the theorem given above, we know that processing of pivots in A,, can proceed independently of pivoting in Az2, as we indicate in Fig. 2. This ordering corresponds to the first level of nested dissection. The method proceeds recursively by applying the constraint on the ordering to each connected component from the previous level, pro-

902

A. B.

COON

and hf. A.

STADTHERR

rently. After the diagonal blocks have been factored and the borders have been updated, the block factorization of the matrix A requires that the Schur complement of the block diagonal submatrix in A be factored. But this Schur complement will also have a BoBDF structure with Zk -’ diagonal blocks, as blocks correshown in Fig. 3. These 2 k ’ diagonal spond to vertices from subsets (in fact, the separators) of the connected components from the k - 1 leve1. Since the ordering of the separators of the connected components from all previous levels (in particular, the k - 2 level) satisfies the condition given in the theorem, these diagonal blocks can also be factored independently (since the pivots correspond to subsets of the connected components from the k - 1 level). Thus, at subsequent stages of the overall block factorization, concurrent factorization of the diagonal blocks of Schur complements can occur as well.

Fig. 2. The structure of the graph and the matrix level of the ordering.

after one

vided that component has a separator. (We will refer to a subgraph as a connected component from level k if it is a connected component of the subgraph induced by the set of original vertices less the separators of connected components from the first k - 1 levels, where the original graph itself is the connected component from the zeroeth level. Hence, if a connected component from the previous level has a separator, and one removes the vertices of the separator from the vertex set of that connected component, the subgraph of the connected component from the previous level induced by the modified vertex set comprises the connected components of the current level.) At each level, the ordering of the vertices in a separator of a connected component from the previous level satisfies the condition of the theorem. That is, the vertices of the separator of a connected component are mapped into integers that are greater than the integers associated with the other vertices of that connected component. Incomplete nested dissection (George er al., 1978) is a modification of this scheme in which the recursive process is stopped before the level at which there are no remaining connected components with separators. If one then requires that vertices of all separators be mapped into integers that are greater than those associated with connected components from the last level, the structure of the matrix with such an ordering will be bordered block diagonal form (BoBDF). After k levels of this ordering, if each connected component at every previous level has a separator, then the ordering yields a BoBDF matrix with 2’ blocks along the diagonal (see Fig. 3). These blocks can be factored independently, and hence, concur-

In nested dissection of‘ 2-D grids (George, 1973), the set of vertices on a mesh line whose removal disconnects the grid into two subgrids of approximately the same order is chosen as the first separator. From each of the resulting connected components (i.e. the two subgrids from the previous level), vertices on a mesh line perpendicular to the first level separator mesh line are chosen as the separator for that component. Since there may be several mesh lines in a component which are perpendicular to the mesh line of the previous level’s separator, the mesh line chosen should be the one which will result in the most nearly equal order of the resulting connected com-

m

Non-zero block structure in A after fist two levels (k=2) of ordering Non-zero block structure in the Schur complement of the block diagonal submatrix of A

Fig. 3. BoBDF of both the matrix and the Schur complement (matrix coefficients corresponding to nonzero entries in the Schur complement are nonzero in the original matrix and hence are indicated in both).

Sparse LU decomposition for chemical engineering applications

ponents. Thus, the first stage (which comprises two levels) of the method consists of splitting the grid into four subgrids (of approximately equal order) by the choice of three separators: the second and third of these separators are on the same mesh line (chosen at the second level), and this mesh line is perpendicular to that of the first separator (chosen at the first level). The procedure is applied recursively to the subgrids that remain at each stage, until the connected components that remain cannot be dissected. The vertices of these nondissectable components are then numbered first, with the vertices of the separators numbered next, in the reverse order of which their separators were found. Hence, all of the connected components from all levels are ordered such that they satisfy the ordering condition discussed previously. (We later use a slight modification of this numbering that still satisfies the ordering criterion.) The nested dissection ordering of a 7 x 7 grid graph (from a five-point difference operator) is shown in Fig. 4. The arithmetic complexity for the factorization of a symmetric positive-definite matrix associated with an n x n grid graph with an ordering given by nested dissection is 829n 3/84 + O(n2 log, n) (George and Liu, 1981). (As is customary in matrix computations, the complexities or operation counts given here represent the number of multiplicative operations only, since the number of additive operations in matrix computations is roughly equal to the number of multi-. plicative operations.) For nonsymmetric systems which do not require partial pivoting, the arithmetic complexity is twice that of the symmetric case (Golub and Van Loan, 1983). With this in mind the operation counts considered in the following will be assumed to be for the symmetric case unless otherwise stated.

903

Rose and Whitten (1976) report a complexity of 251n3/48 for a diagonal nested dissection scheme originally proposed by Birkhoff and George (1973). In this diagonal method, the separators are sets of vertices lying along lines diagonal to the mesh lines of the grid. If the grid is thought of in the context of a Cartesian plane, and the mesh lines are described by lines of constant x and lines of constant y, the diagonal separators lie along lines of constant x + y and lines of constant x - y. Calahan (1979) has investigated the performance of a sparse elimination code on the Cray-1 using grids with this diagonal dissection ordering. He reports that the factorization of the blocks corresponding to the nondissectable components can proceed at rates of over 90 MFLOPS for the proper choice of the matrix multiply kernel. The factorization of the dense matrix corresponding to the separators can be executed at rates of over 100 MFLOPS. George et al. (1978) have considered the implementation of incomplete nested dissection on vector computers. In incomplete nested dissection, the dissection process is stopped a given number of levels short of the level which results in the nondissectable components. Their study indicates that execution time for the factorization on vector computers is minimized by stopping the dissection process one to two levels short of completion. Dongarra and Johnsson (1987) have presented an algorithm for the solution of dense banded systems on parallel processors which is equivalent to incomplete nested dissection. Peters (1985) and Duff (1986) have both suggested the natural extension of nested dissection on a n x n grid to parallel processing. If one considers the factorization of two submatrices whose corresponding vertex sets are such that a path from a vertex in one set to a vertex in the other must contain a vertex in a separator set, then the factorization of those two submatrices can proceed simultaneously, as we discussed in the previous section. In the next section, the generalization of nested dissection to 3-D grids and a hybrid of dissection methods for use on 3-D grids will be examined in the context of parallel processing. PARALLEL IMPLEMENTATION

FOR 2-D GRIDS

Rose and Whitten (1976) reported that the elimination complexity for an n x n x n grid graph using the regular nested dissection ordering extended to three dimensions is 3.86n6 multiplications for a sequential implementation. A parallel implementation of 3-D nested dissection will be examined here. The reduction in the operation count that the diagonal nested dissection scheme offers over the regular nested dissection in two dimensions (the former requires roughly half the number required by the latter) provides the motivation for considering possible generalizations of this diagonal scheme to three dimensions. The proposed generalization can then be

904

A. B. Coon and M. A.

For

cagin at center of

cube.

c = 0

Fig. 5. 3-D dissection by diagonal planes

compared with the regular 3-D nested dissection in the context of parallel processing. One possible generalization of the diagonal scheme would be to consider the vertices which lie along diagonal planes as the candidates for separator sets. If the grid is thought of in the context of a Cartesian coordinate system again, these would be defined by constant x &y + z. Such a dissection is depicted in Fig. 5. This suggested scheme will be rejected for two reasons: it results in connected components that are subgrids whose orders are not approximately equal, and the structure of the subgrids is sufficiently different from the original grid so as to inhibit a straightforward recursive application of the original procedure. Another approach is to use the diagonal dissection in one plane and one-way nested dissection in the third direction. As in regular 3-D nested dissection,

STADTHERR

the separators from each stage comprise vertices from perpendicular mesh planes. This results in connected components which are wedge shaped and diamond shaped subgrids (Fig. 6). The subgrids of the wedges and diamonds are again wedge shaped and diamond shaped, so the recursive procedure is possible. This is essentially a hybrid of two 2-D methods in different directions. Hence, each stage of this method consists of three levels, one in which one-way nested dissection is applied in one direction, and the other two in which diagonal dissection is applied in the remaining two perpendicular directions. The parallel complexity of this scheme, along with regular 3-D nested dissection, will be considered for the case of an n x n x m grid, with the n x n planes dissected in a diagonal manner, and the one-way nested dissection executed along the direction of the side of length m. Since some of the operations will be performed simultaneously, the terms parallel complexity and parallel operations will refer to the number of multiplicative steps (each of a duration equal to that of one multiplication) that must be taken sequentially, with possibly more than one multiplication occurring during a given step. To distinguish between these two nested dissection methods, the hybrid method will be referred to as the diagonal variant of 3-D nested dissection, or simply DV, and regular 3-D nested dissection will be denoted by ND. Since the analysis here is primarily concerned with large granularity tasks, the number of processors considered will be restricted to eight or less. The multiprocessor on which these methods are implemented is also assumed to have a shared global memory. (There are several commercially used fIId_' chines which have eight or less processors and a shared global memory including the Alliant FX/8, Denelcor HEP. the Cray X-MP, the Cray Y-MP and the Cray-2.) It is with the popularity of this type of architecture in mind that such restrictions will be adopted in this analysis. For both of the nested dissection methods (ND and DV), the connected

First Stage Dissection of Entire Grid

Dis-tionofwcdge-shapedsubgrid

Fig. 6. Diagonal variant of 3-D nested dissection.

Sparse L.C/ decomposition for chemical engineering applications

components of the first stage are eight subgrids of approximately equal size. The factorization of the submatrix associated with each of these subgrids can be executed simultaneously. Motivated by this observation, we will use a slight modification of the nested dissection numbering. In this modification, the vertices from the separators of the first three levels are still numbered last. However, the nested dissection numbering is applied to the vertices within each connected component from the third level only with respect to the vertices of that component. and the integers associated with those vertices will be consecutive. Hence, the resulting matrix will have BoBDF with eight diagonal blocks, and the vertices within the component associated with a diagonal block will have the original nested dissection numbering. The bordered block-diagonal structure of the matrix for both ND and DV with this modified numbering strategy is shown in Fig. 7. Note that fill-in will be limited to the areas of the borders that have been shaded. These areas correspond to vertices in the separators from the three levels of the first stage. Within the nonzero blocks, the corresponding vertices are completely ordered with ND or DV. The sequential complexities for the factorizations of the diagonal blocks corresponding the first stage subgrids for both ND and DV have the form: 8 = c,n6+

czn’m + c3n4mz+

c,n3m3.

(1)

The complexity coefficients c1 through c, are constants that depend on the particular method (either ND or DV, since the subgrids themselves will be ordered in the stages following the first) and the planes from which the separators are chosen (see the Appendix). After the appropriate diagonal blocks have been factored and the blocks in the borders have been modified, one may proceed with the factorization of the Schur complement of the block diagonal submatrix, which is also BoBDF. The four small (dense) diagonal blocks of this Schur complement correspond to the separators from the third level of the dissection ordering and their factorizations can proceed concurrently. In fact, any set of blocks that correspond to distinct separators that are coplanar will be the diagonal blocks of a Schur complement, and can be factored simultaneously. The block factorization of the matrix requires the factorization of three nested Schur complements. Hence, referring to Fig. 7, the factorization of the 9th through the 12th diagonal blocks (which correspond to the four coplanar separators from the third level) can proceed simultaneously and the appropriate blocks in the remaining border modified. The 13th and 14th blocks (corresponding to the two planar separators from the second level) can then be factored concurrently and the remaining borders modified. Finally, the last diagonal block (corresponding to the single separator from the first level) is factored after all the others. The operations needed in computing

Locafions

OP Nested

Schur

Complements

Shown

in

Heavy

Outtine

Fig. 7. Bordered block-diagonal structure of the dissection methods.

the Schur complement of a particular block-diagonal submatrix (including modifications of the blocks in the border) are counted as part of the cost of factoring that block-diagonal submatrix. what is, equation (1) includes the arithmetic cost of all the modifications associated with a block-diagonal submatrix factorization.] The delay caused by memory access conflicts that result from the execution of these modification operations during concurrent factorizations is assumed (as in Peters, 1984; Jess and Kees, 1982) to be negligible. The factorization of the blocks corresponding to separators that are not coplanar will not be executed simultaneously (since we assume that these are separators of connected components from different levels). For the present, we assume that the factorization of a single diagonal block is done sequentially, and hence this factorization is assigned to only one processor. Thus, we have a lower bound on the complexity for the factorization of the Schur complement of the block-diagonal submatrix (with respect to the entire matrix), regardless of how many processors are used. This bound is the sum of the operation counts for the (sequential) factorizations of three submatrices: the largest diagonal blocks from each of the three nested Schur complements. These three (dense) submatrices correspond to the largest separators within each of the three perpendicular planes. The expression for the complexity of the factorization of each of the diagonal blocks of the nested Schur complements has the same form as equation (1). The expression for the overall factorization is then the sum of four expressions of this form. These four expressions are the complexities of the factorizations for the blocks associated with: (a) a connected component from the first stage (a diagonal block of the block-diagonal submatrix); and (b) a

A. B. Coon

906

and M. A. STADTHERR

separator of a connected component from each of the first three levels (i.e. diagonal blocks from each of the three nested Schur complements). The complexity of each block factorization is multiplied by an appropriate factor to account for the number of such blocks with respect to the number of processors being used. This can be expressed conveniently by: 0 =p”C?i

(2)

where

iiT=

for a given orientation of the separators within the stages, and so the complexity of the factorization of the Schur complement of the block-diagonal submatrix is the same for both methods. This means that for a given separator orientation, the last three rows of C from equation (2)] will be the same, implying that any differences in f&(i,j, k) and S,,(i,j, k) [for (i,j, k) = (n, n, m). or (n, m, n), or (m, n, n)] are attributable to differences in the performance of the two methods on the diagonal blocks (each of which is assumed to be executed on a singie processor).

RESlJLTS

[n”, n5m, n4m2, n3m3],

C = 4 x 4 matrix equation Cl)],

of

complexity

coefficients

[see

rki

and Lk _i denote the least integer greater ( than or equal to k and the greatest integer less than or equal to k, respectively.) The coefficients of the matrix C will depend on the method used and the order of the separators. In the following notation, we will denote (for either DV or ND) a mesh plane that is perpendicular to the direction of the side of length m (i.e. parallel to the n x n mesh plane) as an m-plane. Mesh planes perpendicular to m-planes will be identified as n-planes. The notation we will adopt to distinguish between the various separator orientations for DV an ND is as follows: e(n, n, m)-the

complexity when the vertices in an m-plane are chosen as the separators in the first level of each stage (i.e. numbered after the other separators at that stage); O(m, n, n)-the complexity for the case in which the vertices in an m-plane are chosen for separators in the third level of each stage (numbered before the other separators at that stage); @(n,m,nFthe complexity when the vertices in an m-plane are chosen as the elements of separators in the second level of each stage (numbered before the separators at the previous level and after the separators in the subsequent level of that stage).

The implementations of ND using these particular orientations of separators will be designated by ND,,, 1 NQt, >ND,,, and their parallel complexities will be f?,,(n, n, m), 8,(n, m, n) and 8,,(m, n,n), respectively. Likewise, for the three different implementations of DV we will write DV,,,, DV,, and 8,,(n, n, m) and Qdv(m, n, n), DV,,, 1with complexities respectively. (The order of the subscripts indicates the ordering of the separators within a stage.) The separators at the first stage of both ND and DV have approximately the same number of vertices

Results ,for 3-D

AND

DISCUSSION

grids

The performance of DV relative to ND [measured as a ratio of the complexities S,,(i,j, k)/B,(i,j, k) so that DV is faster when this ratio is less than one] for various numbers of processors is shown in Figs S-10. The DV method appears to be superior in any case for which the ratio of m to n is larger than about one-half. For large numbers of processors, the performance of both methods becomes limited by the factorization of the Schur complement of the blockdiagonal submatrix. Since the time to factor these blocks increases as the cube of the number of vertices in each separator, one wishes to avoid using large separators. The mesh plane from which the separator for the third level is chosen has its vertices distributed almost equally among four separators. Thus, for the case in which the connected components from the third level are associated with submatrices that can be factored very quickly (with respect to the factorization of the matrices associated with first stage separators), one would assume that separators for the third level should be chosen from the plane with the largest number of vertices. Its vertices will then be split into four separators of roughly equal size. The plane with the next largest number of vertices should have the separators for the second level chosen from it, and its vertices distributed among two roughly equal sized separators. This leaves the vertices of the separator

1.1 5 z

1.10

; c5 ac L

1 .oo

z e’ 6 -4

1.05

0.95 0.90 0.85 0.80 0.75

l/B

l/4

l/2

1

2

4

8

m/n Fig. 8. Parallel performance of DV,, P processors.

relative to ND,,

for

907

Sparse LU decompositionfor chemicalengineeringapplications

1.5

0.61 l/8

I l/4

I t/2

I

I

I

I

1

2

4

e

-

n

l/S

I

I

l/4

i/2

for

for the first level to be taken from the plane with the smallest number of vertices. For any stage, the vertices of a mesh plane will be distributed among the maximum number of separators if they are chosen for the third level separators of that stage. Likewise, the ptane with the next largest number of vertices should have the separators for the second level chosen from it. The vertices of the separators for the first level of that stage should be taken from the plane with the smallest number of vertices. Figures 11 and 12 show the performance (on eight processors) of the different implementations of ND and DV relative to the implementations of ND and DV in which the separators at the third level are selected from the mesh plane with the second largest (instead of the largest) number of vertices. (For the first stage and any ratio of m to n other than 1, the m-plane, which has nZ vertices, is either the smallest or the largest plane, the n-planes having rrm vertices. So, ND,,, and DV,,, are always the methods where the separators for the third level are chosen from the second largest mesh plane.) The ratio of O(i,j, k) to e(n, m, n) is used as the measure of relative performance. As these two figures indicate, ND,_ and are always slower than one of the other DV,,, methods. For ND, the separators should indeed be chosen such that vertices from the largest mesh plane are split among the 2k- ’ separators at the last level of each stage (where k is the overall level). That is, 1 .3 r

I

I

2

4

I e

m/n

m/n

Fig. 9. Parallelperformanceof DV,. relativeto ND,,, P processors.

1

Fig. 11. Parallel performance of DV,,,,,,, and DV,, to DV_ for eight processors.

NDmmis better when m is greater than n, and ND,,,_

is better otherwise. This is because the relative number of vertices in each plane remains the same at all stages. The better of DV,, and DV,,,, however, is not solely determined by whether n is greater than m. At subsequent stages of DV, the relative numbers of vertices in each plane are different from those in the first stage (see Appendix). In fact, the ratio of vertices in an n-plane to vertices in an m-plane for the subgrids of DV is m/2n. So, while the first stage of may assign vertices from the largest mesh DV,,, plane to the last level separator for m > n, the subsequent stages of DV,,, do not do so unless m > 2n. for the case of eight processors, however, the parallel complexity is dominated by the factorization of the Schur complement associated with the first stage separators, and this effect is not pronounced. The remarks above indicate that for exceedingly small or large values of m/n, a better strategy than either of these two methods would be to take all separators in the first stage (or first few stages) from mesh planes that are parallel to the plane with the smallest number of vertices; this amounts to using one-way dissection for al1 of these initial levels. The vertices in the largest mesh plane will then be distributed among seven or more separators initially. The speed-up of an algorithm on P processors, S(P), is the ratio of the execution time of the algorithm on one processor to its execution time on P processors. The speed-ups for various values of P 4.5

I

relative

-

?

2 d.

-2.

P-8

2 0.9 > a? 0.8

-

n-? . _.

l/8

I l/4

I l/2

, 1

2

P-l 4

I a

l/8

m/n Fig. IO. Parallel performance of DV,,

P processors.

relativeto ND,,

for

l/4

l/2

1

Fig. 12. Parallelperformanceof ND,,, to ND.,,

2

4

m/n

and ND,,,, relative

for eight processors.

A. B. 4.2 4.0 3.6 3.6 3.4 3.2 3.0 2.6 2.6 2.4 2.2 2.0 1.6 1.6 14

2 I

P

x

U-J

COON

and M. A.

STADTHERR

6.0 ,5.5 5.0

-

9

4.5

:, g

4.0 ~ 3.5

P-6

P-4

b,

Lo” 3.0

118

l/4

l/2

1

2

4

6

Fig.

13. Speed-up

of DV,,,

for

The above assumes the the time the parallel dense Factorization

7.v

P

Fig. 15. Speed-up of DV,,_

processors

needed routine

to initialize on any of

6

for

P

processors

eight or fewer processors is small compared to the total factorization time. The parallel efficiency, ep(P), of an algorithm is the ratio of the speed-up to the number of processors. S’(P)/P. The speed-ups of algoN D,,,,,, and DV,,,,, using a dense factorization rithm with a parallel efficiency of 75% for various values of P are shown in Figs I7 and 18. The use of such a parallel dense factorization routine within DV and ND dramatically improves the speed-up attainable for the overall factorization. Note that parallel efficiency for the overall factorization is now greater than that of the dense factor-ization algorithm. The results discussed above indicate that significant speed-ups for these sparse factorization methods can be realized if some attempt is made to exploit the available parallelism within the tasks of liarge granularity. The incorporation of an efficient parallel dense factorization routine into these methods rcsuits in a parallel factorization algorithm with attractive potential speed-ups. The implementation of partial pivoting, which was not considered above, is often a necessity in the solution of matrices which were not known to be positive-definite or diagonally dominant. However, the numerical stability of the factorization can usually be maintained with local pivoting among variables within a single diagonal block. Such a local pivoting scheme could quite readily be incorporated into the methods considered here (or any parallel method with a sufficiently large task granularity). Should a local pivoting scheme still not prove

-

2.5

P-2

-

2.0 1

2

4

I 8

14. Speed-up of ND,,,,, for P processors.

I

1.5 1/ 8

I 7/4

1 l/2

1

P-2 I 2

I 4

m/n

m/n

Fig.

4

m/n

are shown in Figs 13-16. The relatively small speedups realized by these methods is, in a large measure, attributable to our previous restrictions on the processor task assignments. If, however, this method is used in conjunction with a parallel dense matrix method (i.e. we remove the restriction that single diagonal blocks are factored sequentially on a single processor), processors that would otherwise be idle during the factorization of the diagonal blocks of a Schur complement can be used for these factorizations. For example, with eight processors, four processors are idle while the four diagonal blocks associated with the separators from the third level of the ordering are factored concurrently; six are idle while the next set of diagonal blocks (associated with second level separators) are processed simultaneously; there arc seven idle processors during the last factorization. Assuming the availability of a parallel dense factorization routine with a speed-up of S,(P), the complexity of these methods can still be expressed with equation (2) by using:

4.6 4.4

2

I

m/n

Fig.

16. Speed-up of ND,,,

for P processors.

I a

Sparse LU decomposition for chemical engineering applications

P-2

21

l/8

I l/4

I l/2

1

I 2

I 4

m/n

J 8

using a paralleldense routine with L,,= 0.75.

Fig. 17. Speed-up of DV,,

sufficient for a particular matrix, the pivots in question could be moved to the diagonal block corresponding to the first (or previous) level separator. Applicufion

to equation -based process flowsheeting

We now consider the extension of these strategies to the parallel implementation of a linear solver for equation-based process flowsheeting. The general structure of EB flowsheeting matrices is nearly blockbanded, with some nonzero blocks occurring outside of the block bands (Vegeais and Stadtherr, 1989a; Vegeais, 1987); feedforward and recycle streams are typically represented by these off-band blocks. Although the corresponding occurrence matrices are rarely symmetric, we can still apply the appropriate results we have from the undirected graph model to the occurrence matrix of the symmetric part (i.e. (A + A ‘)/2). The orderings that we can generate for the occurrence matrix of the symmetric part can then be used for the original matrix (for, clearly, if two pivots can be processed concurrently in A + AT, they can be processed concurrently in A as well). The regularity in the structure of the grid graph allows us to identify good separators (i.e. relatively small separators whose removal from the vertex set induces connected components of approximately equal order) a priori. Matrices with graphs of a more general structure will require a more complicated algorithm for the selection of good separators, and the performance of such an algorithm is critical to the

P-4

,:

l/8

114

l/2

1

2

4

e

m/n using a parallel dense routine Fig. 18. Speed-up of ND,, with cP= 0.75.

909

success of this ordering strategy for general sparse matrices. The generalization of nested dissection to matrices whose corresponding graphs are planar was discussed by Lipton et al. (1979). Algorithms for finding separators in planar graphs and graphs of bounded genus have been examined by Lipton and Tarjan (1979) and Gilbert et al. (1984), respectively. These algorithms, however, can only guarantee an upper bound of 2N/3 on the order of the resulting components, where N is the order of the graph (or subgraph) to which the algorithm is applied. Since the upper bound is this large, these algorithms cannot guarantee that the connected components generated at a particular level will be nearly the same order. (In fact, for small separators, this is equivalent to an upper bound of two on the ratio of those orders.) If the ratio of the orders of connected components differs substantially from unity at each level, the computational load will not be uniformly distributed among the processors and the resulting parallel performance will be poor. A heuristic algorithm for finding small separators which yield connected components of approximately equal size was presented by George and Liu (1981). Like the previously mentioned algorithms, the connected components generated at each level by this algorithm are not guaranteed to be of equal or near equal order. Since we cannot make sufficiently general statements about the precise structure of the graph of EB flowsheeting matrices, we are not able to determine CI priori the order or location of the separator sets that will be selected by a particular algorithm. Hence, we cannot characterize the parallel performance of the solution of EB flowsheeting matrices using this nested dissection strategy, as we could for 3-D grid problems. Moreover, we will not know beforehand how many levels of the ordering can be applied before a connected component without a separator is generated. We can, however, still make important inferences about the parallel performance of this strategy applied to EB flowsheeting matrices based on our observations of its parallel performance on grid problems. In particular, we can predict that the implementation of a good parallel dense solver will have a significant effect on the overall performance of this strategy. Furthermore, from the effect of the orientation of the separators for various values of m/n we can infer that, not only will it be important to keep the size of the separators small, but also that the availability of a small separator may be affected by the choice of a separator at a previous level. Vegeais (1987) and Vegeais and Stadtherr (1989b) have considered the parallel solution of EB flowsheeting matrices via a strategy which exploits the inherent bordered block-diagonal structure of these matrices. In this arrangement of the equations and variables, each diagonal block is associated with a process unit in the flowsheet and the variables that correspond to that diagonal block are the variables that are strictly internal to that unit (Westerberg and

910

A. B.

COON

and hA. A. STADTHERR

1978). The variables corresponding to the Berna, borders are those appearing in the equations which couple the process units (e.g. stream connection equations, design equations). (In the case that a process unit has no internal variables, there will be no diagonal block associated with it, and any variables that are associated with the unit appear in the border.) The parallel solution of this matrix can then proceed by assigning a diagonal block to each processor, factoring these blocks concurrently, updating the borders concurrently then computing the Schur complement of the block diagonal submatrix and factoring it (at which point a parallel dense factorization might possibly be used). One need not necessarily assume diagonal dominance for this approach since, in the factorization of the diagonal blocks, row interchanges within blocks may be made to maintain numerical stability. (If, at some point in the factorization of a diagonal block, no viable pivot candidates remain in the block, the remaining rows and columns can be moved to the borders). However, the number of variables internal to a process unit may vary from unit to unit. and this strategy could yield diagonal blocks with significantly different sizes, resulting in a poor computational load distribution among processors. This strategy can also yield large borders, and the Schur complement of the blockdiagonal submatrix will not necessarily have a BoBDF that can be factored in parallel. Leiseron and Lewis (1987) have described two heuristic algorithms for finding separators in which an adjustabte parameter is used to control the disparity in the orders of the components induced by the removal of the separator. Thus, the possibility of a highly unbalanced processor load distribution can be avoided by the proper selection of this parameter. For a graph G = (X, E) of order N, and a given partitioning of X into two disjoint subsets X, and X, of the orders [ N/2 1 and L N/2 J , respectively, we define the bisecrion B c E to be the set of edges with one vertex in X, and the other vertex X,. In these stratcgics, an algorithm that finds a bisection of the graph is executed first, then heuristic selection rules are used to cull a separator set from the vertices incident to the bisection. Since every path from a vertex in X, to a vertex in XZ contains an edge from 3, the set of all vertices incident to the edges of B is itself a separator (provided. of course, that both X, and X, contain some vertices that are not incident to B). It is from this incident vertex set that the separator is selected. This particular set is obviously not a minimal separator, since its intersection with either X, or X2 is also a separator. Furthermore, even the smaller of these two possible intersections is not necessarily a minimal separator and heuristics presented by Leiserson and Lewis often produce better separators. Graph bisection algorithms have several applications, most notably VLSI design, and several algorithms for finding bisections have been proposed

(e.g. Bui et al., 1984; Goldberg and Gardner, 1984; Fiduccia and Mattheyscs. 1982; Kernighan and Lin, 1970). However, the problem of finding the partition of the vertex set which gives the minimum bisection is NP-complete and consequently these algorithms attempt to select small bisections using various heuristics. If a sufficiently small bisection can be found, then one will be able to select a separator from its incident vertices with the two qualities that one looks for in a separator, namely small size and components (induced by its removal) of approximately equal order. In general, however, there is no guarantee that the best separator for a graph can be found from the set of vertices incident to the best (smallest) bisection. Furthermore, there is no guarantee that these bisection algorithms will find the best bisection. We suggest here another possible use of graph bisection algorithms for parallel solution of EB flowsheeting matrices. This application of bisection algorithms is motivated by an algorithm for the parallel solution of narrow banded linear systems presented by Dongarra and Sameh (1984). In this algorithm, the narrow banded system is partitioned into blocktridiagonal form. For P processors and a system of order N with semibandwidth 6, the diagonal blocks of this partition will have order N;P. (Here we are assuming that N is a multiple of P; otherwise the first P - 1 diagonal blocks will have order I_N/P 1 .) Each of the off-diagonal blocks is also of order N/P and contains all zeros except for either an upper or lower triangular matrix of order h (Fig. 19a). The diagonal blocks can be factored concurrently. with each factorization assigned to a different processor. Left multiplication of the system by the inverse of the block diagonal part creates fill-in only above the lower triangular submatrix or below the upper triangular submatrix within each off-diagonal block (Fig. 19b). (In the actual implementation. one does not compute the inverse of the block diagonal part and premultiply with it, rather one computes the fill-in columns by forward elimination and back substitution using the factorization of the diagonal block and appropriate off-diagonal block column as the right-hand-side.) Again, these computations can be done concurrently by assigning one block row to each processor. At this point, an independent system of order 26 (P - I ) remains, as depicted by the solid outline in Fig. 19b. This reduced system comprises the b equations above and the b equations below each block row. It can be solved independently and the rest of the overall solution obtained by substitution (which can be executed concurrently) or the overall system can be repartitioned into block-triangular form with diagonal blocks of order 2N/P. In this latter case (in which we assume P = 2’ for some integer k) the factorization, premultiplication and repartitioning are applied recursively until a reduced system of order 26 remains, at which point this reduced system is solved and rest of the overall solution is determined

Sparse LU decomposition for chemical engineering appbcations

(a)

911

(6)

Partitioning *or P = 8

Fig. 19. Block-tridiagonal solution for narrow banded system. by concurrent substitution. One should note that for this recursive procedure, the diagonal blocks of the repartitioned matrix are not dense and can be factored as indicated in Fig. 19. We now consider the use of the bisection algorithms to find a more general block-tridiagonal ordering to which we can apply the above strategy. Thus, the above strategy will not be limited to narrow banded systems. Given a bisection B of a graph and the vertex sets X, and X, into which the vertices are partitioned, a block-tridiagonal form with two diagonal blocks will result if we order the vertices as follows: (a) all the vertices in X, are mapped into integers less than those associated with the vertices in X,; and (b) the vertices incident to B in X, are mapped into integers greater than those integers into which the rest of X, is mapped, while the vertices incident to B in X, are mapped into integers less than those integers associated with the rest of X,. (The latter condition is equivalent to requiring that vertices incident to B be numbered consecutively.) We will let 6, denote the number of vertices in X, that are incident to B and b, denote the number of such vertices in X, In the resulting matrix, the upper right off-diagonal block has zero entries everwhere except in the b, x b2 subblock closest to the diagonal. The lower left off-diagonal block has zero entries everywhere except in its b2 x b, subblock which is closest to the diagonal. If this system is left multiplied by the inverse of the block-diagonal portion of the matrix fill-in will occur only above the b, x b, subblock and below the b, x 6, subblock, analogous to the narrow banded case. A reduced system of b, + b2 independent equations results. We consider this to constitute the first level of the ordering.

This ordering can be applied recursively to the subgraphs induced by the vertex sets of the partition in a manner similar to the recursive application described previously for the nested dissection ordering. One should note, however, that the algorithm which finds the bisection must be modified slightly. This modification should prevent the algorithm from choosing a bisection with any incident vertices that are also incident to a bisection found at a previous level (since those vertices have already been made to satisfy the ordering criteria at that level). The recursive application of the ordering stops after the kth level, where the number of processors is 2’. The resulting matrix will be block-tridiagonal with 2k diagonal blocks, each of order N/2& (for simplification, we assume N to be a multiple of P). Figure 20 shows an example for two levels of this ordering. The size of the nonzero subblocks in the off-diagonal blocks will be determined by the number of vertices incident to the corresponding bisection. If the matrix is premultiplied by the inverse of the diagonal blocks (which can be factored concurrently) the fill-in is limited to the nonzero columns of the off-diagonal blocks. The resulting matrix contains a reduced system of independent equations whose order is equal to the number of vertices which are incident to all of the bisections found in the ordering stage. As in the narrow-banded case, the reduced system can be solved and the entire solution then can be found via substitution, or the system can be repartitioned to a block-tridiagonal form with 2k- ’ diagonal blocks (Fig. 20b). The factorization of the diagonal blocks, the premultiplication by the inverse (both of which can be executed concurrently) and the repartitioning can all be applied recursively until the remaining system has only two diagonal blocks. (Diagonal blocks of each repartitioned matrix have a sparse structure similar to that shown in Fig. 19b and

A. B. CWN

912

and M. A. STADTHERR

(4

t

Fig.

20. Proposed

block-tridiagonal

can be factored as indicated there.) At this point, the order of the reduced system of independent equations will be equal to the number of vertices incident to the bisection found at the first level. This reduced system can be solved and the rest of the solution is computed by substitution. One should note that the order of the reduced system is determined by the number of vertices incident to the lirst level bisection. In the former case (that the reduced system is solved without recursive repartitioning), the order is determined by the number of vertices incident to all bisections. In the narrow banded case, the order of the reduced system was determined by the bandwidth. The number of vertices incident to a bisection can be thought of as a local bandwidth which is always less than or equal to the overall bandwidth (e.g. Fig. 20a). One might consider reordering the E3 flowsheeting matrix with a bandwidth reduction algorithm and applying the narrow banded strategy to the result. But the preceding remarks indicate that the reduced system for this bandwidth reduction strategy will be larger than that for the proposed approach, except when the banded system is dense (in which case the reduced systems are of equal order). Investigations of bandwidth and profile reduction algorithms for EB flowsheeting matrices (Vegeais, 1987; Vegeais and Stadtherr, 1989b) indicate that the resulting matrices have fairly large bandwidths with sparse bands. hence, one expects the proposed approach to be superior to the bandwidth reduction approach for EB flowsheeting matrices.

CONClAJSlONS

The parallel complexity of DV is less than that of ND for ratios of m to n much greater than about one-half (Figs 8 and 10). However, as the number of processors increases, the difference in the complexities of the two methods decreases, since the solution time becomes limited by the time to factor the Schur complement of the block-diagonal submatrix (which

solution

for general

sparse

system.

is the same for either method). The greatest difference in the complexities of the two methods is observed at very large and very small values of rnjjz, but at extreme values of m/n we would invoke an alternate choice of separators at the initial levels that should provide better performance than either DV or ND. For the large task granularity considered here, the USC of a parallel dense factorization scheme during the factorization of diagonal blocks of the Schur complements significantly improves the parallel efficiency of the algorithms. In particular. the use of a parallel dense solver with tip = 0.75 can result in overall efficiencies of 0.80-0.88 for eight parallel processors (Figs I7 and 18). The use of heuristic algorithms for finding separators in graphs of general sparse matrices was originally considered in the context of reducing the sequential complexity of the LU decomposition (Lipton et al., 1979) and has been investigated more recently as a means of producing a parallel LC’ decomposition ordering (Leiserson and Lewis, 1987). This latter approach, in which the relative size of the connected components can be bounded, might be useful in generating a parallel solution ordering for EB flowsheeting matrices which does not result in a severe processor load imbalance. Some previously suggested parallel solution strategies for EB flowsheeting matrices that exploit the natural BoBDF of the matrix are susceptible to processor load rmbalantes. We propose the use of a block-tridiagonal parallel scheme for EB flowsheeting matrices. This block-tridiagonal scheme is a generalization of the narrow banded solver presented in Dongarra and Sameh (1984) and its effectiveness will depend on performance of the heuristic graph bisection algorithm that is used in generating the block-tridiagonal ordering. The effectiveness of these heuristic methods in the parallel solution of EB flowsheeting matrices is a topic of the research we are currently conducting. Acknouclled~emenr-This work was supported in part by the National Science Foundation under grant DMC-8520663.

Sparse LU decomposition for chemical engineering applications REFERENCES Birkhoff G. and J. A. George, Elimination by nested dissection. Complexity ofSequential and Numerical Algorithms. Academic Press, New York (1973). Bui T., S. Chaudheri, T. Leighton and M. Sipser, Graph bisection algorithms with good average case behavior. Pror. 25th AnnI IEEE Symp. Foundations Comput. Sci., Singer Island, Florida (1984). Calahan D. A., Vectorized sparse elimination. Proc. Scientific Comput. Inform. Exchange Mfg, Livermore, Calif. (1979). Dongarra J. J. and L. Johnsson, Solving banded systems on a parallel processor, Parallel Comput. 5, 219-246 (1987). Dongarra J. J. and A. H. Sameh, On some parallel banded system solvers. Parallel Comput. 1, 223 ~235 (1984). Duff I. S., Parallel implementation of multifrontal schemes. Parallel

Compuf.

3, 193-204

(1986).

Fiduccia C. M. and R. M. Mattheyses, A linear-time heuristic for improving network partitions. Design Automation Co&, Las Vegas, Nevada (1982). George J. A., Nested dissection of a regular finite element mesh. SIAM f. Numer. Analysis 10, 345-363 (1973). George A. and J. W. Liu, Computer Solution of Large Sparse Positive Definitive Systems. Prentice-Hall, Englewood Cliffs, New Jersey (1981). George A., W. G. Poole Jr and R. G. Voigt, Analysis of dissection algorithms for vector computers. Compel. Math.

Applic.

4, 287-304

(1978).

Gilbert J. R., J. P. Hutchinson and R. E. Tarjan, A separator theorem for graphs of bounded genus. J. Algorithm. 5, 391407 (1984). Goldberg M. K. and R. Gardner, On the minimal cut problem. Progress in Graph Theory (J. A. Bondy and U. S. R. Murty, Eds). Academic Press, Orlando (1984). Golub G. H. and C. F. Van Loan, Matrix Computations. Johns Hopkins University Press, Baltimore (1983). Hilton C. M., Numerical studies in equation-based chemical process flowsheeting. Ph.D. Thesis, University of Illinois, Urbana (1981). Jess J. A. G. and H. G. M. Kees, A data structure for parallel L/U decomposition. IEEE Tram Compuf. C-31, 231-239

(1982).

Kernighan B. W. and S. Lin, An efficient heuristic procedure for partitioning graphs. Bell System Tech. J. 49, 291-307 (1970).

Vegeais J. A. and M. A. Stadtherr, Sparse matrix strategies for equation-based chemical process flowsheeting on advanced computer architectures-II. Parallel processing. Comput. them. Engng (1989b). In press. Westerberg A. W. and T. J. Bema, Decomposition of very large-scale Newton-Raphson based flowsheeting problems. Comput. them. Engng 2, 61.-63 (1978).

APPENDIX Development

21, 21-24

of Complexity

Expressions

The development of these complexity equations is analogous to the development of the 2-D complexity expressions given by George and Liu (198 1). An expression can be developed for each type of subgrid (i.e. each different subgrid shape) that occurs in the dissection. For the case of DV, the dimensions and dissection of an n, m-wedge and an n, m-diamond are defined in Fig. Al. Both the n, m-wedge and the n, m-diamond can be bordered on the top or bottom or both by a separator. Let br,.,,(n,m) and t?,,(n, m) represent the complexity for the factorization of a matrix whose grid graph is an n, m-wedge bordered on the top or bottom and an n, m-wedge bordered on both the top and bottom, respectively. Ba,(n, m) and O,,(n, m) are defined similarly for the n, m-diamond. Since the a, m-wedge is composed of four n/2, m/2-wedges and two n f2, m/Zdiamonds, Q,.,,(n, m) can now be computed as: &,(n, m) = e&/2,

m/2) + 2e,,(n/2,

m/2)

+ e,, Cm/2, m /2) + 2ew, (n /2, m i2) + S,,(n,

m),

C-41)

in which &,,(n, m) is the cost of factoring the Schur complement that is associated with the separators between the six subgrids of the n, m-wedge. The m-plane of the n, m-wedge has approximately n2 vertices, while the two n-planes of the n, m-wedge have approximately nm /2 vertices each. In the n, m-diamond, the m-plane has approximately 2n’ vertices and each n-plane has about nm vertices. Using Theorem 2.1.2 of George and Liu (1981), one can write:

Leiserson C. E. and J. G. Lewis, Orderings for parallel sparse symmetric factorization. Third SIAM Conf. ParaBel Processing Sci. Comput., Los Angeles (1987). Lipton R. J., D. J. Rose and R. E. Tarjan, Generalized nested dissection, SIAM J. Numer. Analvsis 16, 346358 (1979). Lipton R. J. and R. E. Tarjan, A separator theorem for planar aravhs. SIAM J. Aa&. Math. 36, 177-189 (1979). Peters Fr J., Parallel pivoting algorithms for sparse symmetric matrices. Parallel Compur. 1, 99-l 10 (1984). Peters F. J., Parallelism and sparse linear equations. In Sparsity & its Applications. Cambridge University Press (1985). Rose D. J. and G. F. Whitten, A recursive analysis of nested dissection strategies. Sparse Matrix Computations (J. R. Bunch and D. J. Rose, Eds). Academic Press, New York (1976). Stadtherr M. A. and J. A. Vegeais, Advantages of supercomputers for engineering applications. Chem. Engng Prog.

913

s,,=

2 i j? W+3) ,=,

II

S-N,

J ,

n, m-wedge

n, m-diamond

(1985).

Vegeais J. A., Vector and parallel strategies for sparse matrix problems in equation-based chemical process flowsheeting. Ph.D. Thesis, University of Illinois, Urbana (1987). Vegeais J. A. and M. A. Stadtherr, Sparse matrix strategies for equation-based chemical process flowsheeting on advanced computer architectures-f. Vector processing. Comput. them. Engng (1989u). In press.

Fig. Al,

Relative dimensions of diagonal variant subgrids.

A.

914

B. COON and

where

M.

A.

STADTHERR

Thus, Nh, = b,nm

for

all p subgrid

f b,,n 2 + O(n)

‘4

and

where: N,,, = a,,nm i- b2)n2 + O(n ).

642)

The outer sum is from 1 to IV,, over the number of separators between the subgrids. The coefficients a,,, a>,, b,, for each subgrid and separator and b,, will be different ordering. Definrng coeffictents y,-y4 by the following: (A3)

shapes,

one writes:

(&H+ra 1=

(I:

648)

A = diagonal matrix with elements 1, = 26- K’/(26 d D = matrix with elements d, as defined above for i #j d,, = 0 for all i I. = matrix with elements ye defined as above, n’r= (nb, II %71,n %I 2. n ‘rn ’ ).

o’T= (cr,.

, 2

(AZ)

w, (A3)

2 (b!,--a:i), ,-- /

;,=; equation

The coefficients of three sides by other with C,, and C ,,‘,

T (b?,b,, - +z,h ,.- I

can now

be written

(A4)

as:

I=

(-45)

I

A general expression for the complexity associated with each different subgrid shape in terms of the complexities associated with other subgrid shapes can be written: 2+0(n/2,

eP(n, m) =

+

m/2)

+‘z’ ,=

5 jfn,(n7-‘)(m’-‘)

,= /

I

d,,B,(n/2,

1

,

m/2)

(‘46)

where JJ,,], , yp4 are the coefficients defined in equation (A5) for the pth subgrid shape, dPz is the number of n/2, m/2-(ith subgrid)s into which the pth subgrid is disof the number of times the sected, and K, is the log, n/2, m/2-(pth subgrid) occurs in the dissection of the n, m-pth subgrid). By applying Theorem 1 from Rose and Whitten (1976) and ignoring lower order terms, equation (A6) can be rewritten as: ~~(n,m)=[2h-X~ij(2h~Xp

I .4857 = 5.7502 hl 6.05 36 1.X382

-I)]

0.0000

0.0742 0.078 1 0.1250 o.oooo

0.092 1 0.0313 0.2500 0.0000

0.03 I3 0.0000 0.0000 0.1667

0.0770 0.0495 0.0000 0.1667

0.1078 0.078 I 0.2500 0.0000

0.0457 0.0313 0.1250 0.0000

0.0058 0.0000 0.0208 0.0000

0.0333 = 0.0495 J>,i,,,,, 0.0000 0 1667

0.0725 0.07x1 0.2500 0.0000

0.0523 0.0313 0.1250 0.0000

0.0117 0.0000 0.0208 0.0000

0.0929 1.5481 6.1309 7.3527

0.1 I49 C,,,,,,“,? = 0.0000 0.3958 <).I667

0.0958 0.1250 0.0000 0.0000

0.0242 0.0313 0.0000 0.0000

0.0015 0.0026 0.0000 0 0000

0.2483 1.8865 4.3760 3.3000

0.05 I6 C&,,>M*= 0.0000 0.3958 0.1667

0.0684 0.1250 0.0000 0.0000

0.0295 0.0313 0.0000 0.0000

O.OO-iY 0.0026 0.0000 0.0000

0.7470 = 3.3486 WI 4.6379 2.1321

N Dmnn

@,, =

2 yD,-(n’-’ )(rni--‘)l. 15,

0.0232 0.0000 0.0000

(A7) _I

0.0127 C;,,,

= ;:g;;

C rldnmri =

DVttm

- = I),,

(A9)

0.0898 0.0313 0.2500 0.0000

0.3714 = 2.9257 h3 6.9000 4.9267

D V,,,,,r

+

0.0287 = 0.0495 C,,dnn,. 0.0208 0.0000

N D,,,,”

fl

the

0.0946 0.0781 0.1250 0.0000

2.003 1 B = 5.8956 *I 4.7488 0.8250

g

for

fI,, and U,, (the subblock bordered on subblocks in ND) are given below along

DV,,,,,

5 -yk(n’-*)(m*-‘).

SW, =

ND,,,,“, a

and

.o,r.

If (I - 2~ 6 AD) is nonsingular. then complexities subgrids are given by the following expression: d=(~-2-hAD)--‘,irn’=C,,,n’.

yi =

- 1).

c

0. I bhJ