__ __ ii3
&
PARALLEL COMPUTING Parallel Computing
EWEVIER
22
(I 996) I373- 1397
Mapping affine loop nests Michkle Dion, Yves Robert Laborcrroire
LIP.
(IRA
CNRS
1398,
Received
Ecole
Normule
Sup&ewe
de Lyon.
* LYON Cedex 07, France
69364
14 November 1994; revised 21 May 1996
Abstract This
paper
deals
with
the problem
of aligning
data
and
computations
when
mapping
affine
Memory Parallel Computers (DMPCS). We formulate the problem by introducing a new graph, the access graph, to model affine communications (with rectangular access matrices) more adequately than with the previously introduced tool, the communication graph. We show that maximizing the number of local communications in the access graph is an NP-complete problem in the strong sense and we present several heuristics based upon the access graph for mapping affine loop nests onto DMPCs. loop nests
Keywords:
graphs;
onto
Distributed
Affine Heuristics
mapping
problem;
Distributed
memory
multiprocessors;
Loop
nests;
Communication
1. Introduction
This paper deals with the problem of mapping affine loop nests onto Distributed Memory Parallel Computers (DMPCS). Because the communication is very expensive in DMPCs, how to distribute data arrays and computations to processors is a key factor to performance. The computations described in this paper are general non-perfect loop nests (or multiple loop nests) with uniform or affine dependences. Mapping such loop nests onto virtual processors with the aim of “minimizing” in some sense the communication volume or number is known as the alignment problem, which has drawn a lot of attention [18,&l ,6,_5,23,4].
* Corresponding author. Email:
[email protected].
for Research Community.
0167-S 191/96/$15.00 PI/
SO 167-8
Supported by the Project C3PRS of the French Council
CNRS, and by the ESPRIT Basic Research
Action 6632 “NANA2”
0 1996 Elsevier Science B.V. All rights reserved
19 I (96)00049-X
of the European
Economic
1374
M. Dion, Y. Robert/Purullel
Computing 22 (1996) 1373-1397
In previous papers ([6,5] for uniform loop nests, [l] and [23]), the communication behavior of a given loop nest is described by its communication graph. But it turns out that some communications (some edges of the communication graph) can be zeroed out or, at least, made local, only for some values of the dimension m of the target virtual architecture. In other words, the communication graph should depend upon m to cope with non-square access matrices that frequently arise within non-perfect loop nests. Therefore we introduce a new graph, rhe access graph, to model communications more adequately. We apply graph-theoretic heuristics on the access graph in order to zero out as many nonlocal communications as possible. We introduce the ufzne mapping problem which consists in maximizing the number of local communications in the access graph. We show that this problem is NP-complete in the strong sense (which could be expected from several similar NF’-completeness results, see Section 4.3). We present heuristics based on the access graph for mapping affine loop nests onto DMPCs: the aim is to maximize the number of local communications in the graph. For technical reasons, we delay the precise definition of the access graph and the formal setting of our new results up to Section 4. Rather, we start with a motivating example which we use throughout the text to help the reader follow more easily. The paper is organized as follows: in Section 2, we introduce our motivating example and informally explain our new approach. Section 3 is devoted to a survey of related work in the literature. We state and prove our results in Section 4. Finally, some concluding remarks are stated in Section 5. For the sake of convenience, background material on mathematical tools are collected in the Appendix. 2. Motivating example Consider the following non-perfect affine loop nest, where a is a 2D-array, and b and c are 3D-arrays: Example 1. for i = 0 to N do for j = 0 to M do {Statement S,) b(i+j-2, 1, 3-j)=g,(a(j1, i+6), di1, j+3, i-j)> for k = 0 to N + M {Statement SJ a( j - k + 1, i + k - 2) = g,(b(i + 2, j + 5, k - 3)) (Statement S,} c(i+j+k, j- 1, i-k)=g,(a(i-2k,2i-j-k+2)) endfor endfor endfor Here, g,, g, and g, are arbitrary functions. The above loop nest is an c@ne loop nest because all array references are affine functions of the loop indices. Consider for instance the reference to array b in the left-hand side of statement S,: we write b(i+j-2,
1,3-j)=b
( F, (j) +c,)l
M. Dion, Y. Robert/
Parallel
Computing 22 (1996) 1373-1397
1315
where
is a 3 x 2 access matrix, and c, = is a 2D UCC~SSvector.
We rewrite the loop nest to name access matrices and vectors as
follows: for i = 0 to N do for j = 0 to M do {Statement S,} b(F, . (i, j>’ + c,) = g,(a(F, . (i, j>’ + c,),dF,. (4 j)’ + ~3)) for k= 0 to N +M (Statement S,) a(F,. (i, j, k)’ + c4) = g2(b(F5 . (i, j, k)’ + ~5)) {Statement SJ c(F, . (i, j, k)’ + cs) = g,(a(F, . (i, j, k)’ + ~7)) endfor endfor endfor where and
F,=
(;
_yi
and
c2=
c3=
and
(-;),
c4=
Mapping the loop nest onto a m-dimensional virtual processor space consists in determining allocation matrices for each statement (to determine where each statement instance is computed) and for each array (to determine which processor each array element is allocated to). We define afine aElocutionfunctions as follows: l for each statement S of depth d: allot,(Z) = A4, I + cts, where MS is an m X d matrix and (Ye an m-vector. Here I is the iteration vector: I = (i, j)’ for statement S, whose depth is d = 2 while I = (i, j, k)’ for statements S, and S, whose depth is d=3.
1376
M.Dion,
Y. Robert/Parallel
Computing 22 (1996) 1373-1397
for each array x of dimension q,: allot.(Z) = A4,Z + (Y,~,where M,Tis an m x q, matrix and (Y, an m-vector. By this mapping, statement instance S(Z) is assigned to execute at virtual processor M, I + (Ys and array element x(Z) is stored in the local memory of processor M, I + cx*. In [23], such allocation functions are introduced as statement alignment and array alignment respectively. The choice of the dimension m of the target processor space is dictated by various considerations. Ideally, a perfect loop nest of depth d can be rewritten as a sequential loop (representing time-steps) surrounding d - 1 parallel loops to be executed concurrently on the virtual processors (in the case of perfect loop nests with uniform dependences, this is nothing else as Lamport’s hyperplane method [ 161). In such a case we would choose m = d - 1 to squeeze all the potential parallelism out of the nest and to achieve linear execution time. However, for general affine loop nests, multidimensional scheduling may be necessary and we would select m smaller than d - 1, where d is the maximal depth of a statement. Furthermore, the choice of m can be driven by other parts of the program that are being parallelized, or by hardware resource limitations. Anyway, it is clear that the choice of m should be an input parameter of the mapping process, as it depends upon external constraints, and not the output result of an alignment heuristic as in [l]: we come back to this point below. In our example, which values of m are likely to be of interest? Before enumerating alternatives, note that for a statement S of depth d, MS is a m X d matrix of rank G min(m, d). Similarly, for an array variable x of dimension q,, M, is a m X q, matrix of rank G min(m, q.J. (1) Mapping the loop nest onto a single processor (m = 0) is excluded! (2) Mapping the loop nest onto a lD-processor space is one possibility. Selecting m = 1, we would impose the 6 allocation matrices (3 statements and 3 array variables) to be of rank 1 so as to actually distribute data and computations onto the whole linear processor space. Note that this is a somewhat arbitrary restriction as a null allocation matrix might prove useful for some array or computation. However, our goal is to achieve the finest granularity of parallelism. (3) Mapping the loop nest onto a 2D-processor grid is another possibility. Again, we would impose then all 6 allocation matrices to be of rank 2. Otherwise, some virtual processors would not get any element of one of the arrays or not get any computation instance of one of the statements. (4) Mapping the loop nest onto a 3D-processor grid is still possible. However, statement S, is of depth 2, and array a is of dimension 2. Hence matrices M,, and M, can be of rank at most 2, and it is not possible to distribute either the computations S,((i, j>‘) or the data elements u(i, j> over the whole virtual processor space. Therefore in such a case we would not deal with the communications induced by these items: we would rather try to align statements S, and S, together with arrays b and c, as they represent the core of the computations and data elements to be distributed. In statement S,, the value a( j - 1, i + 6) has to be read in the memory of processor l
alloc,(( j-
1, i+6)‘)=M,(
j-
1, i+6)‘+cu,=M,(F,(i,
j)t+c2)+cxo
M. Dim. Y. Rohrrr/Purallel
Compurin~ 22 (1996) 1373-1397
1371
and sent to the processor in charge of executing the computation S,((i, j>‘), namely This results in a communication of length processor alloc,l((i, j>‘) = M,l(i, j>’ + as. S U,s, equal to the distance between a(j - 1, i + 6) and S,(i, j>. We have S OS) = alloc,I(( i, j)‘)
- alloc.((
j - 1, i + 6)‘),
S o.s, =M,,(i,j)‘+(YS,-(MY(F2(i,j)‘+c2)+UU), S 05 = (M,,
- M,F,)(
i, j)’ - Mac2 + cxs, - ct,.
In the expression of Sa,s,, we have a nonfocal term (MS, - M, F,)(i, j>’ which depends upon the iteration vector (i, jY, and a local term cxs, - M,c, - a,. Note that the local term, called displacement in [l], is further divided into a neighbor ferm Mac2 and a constant term ‘_ys,- (Y, in [23]. Clearly, the main factor affecting performance is the nonlocal term, as it corresponds to an irregular pattern of communications, whose size can grow over the whole processor space. On the other hand, the local term represents regular fixed-size communications that can be more easily blocked or overlapped with other computations. Therefore, zeroing out the nonlocal term is the main goal of the mapping optimization process, as recognized by [l] and [23]. Here, we see that zeroing out the nonlocal term amounts to choose allocation matrices M,, and M, so that the equation M, - M, F2 = 0 is satisfied. A communication-local mapping is a mapping where all nonlocal terms have been zeroed out. Consider now the value b(i + j - 2, 1, 3 - j> which has to be written after the the computation S,((i, j)‘). This results in a communication: S S,.b =alloc,((i+j-2, S s,.h=Mh(Fi(i, S s, .h = (M,F,
1,3-j)‘)-alloc,,((i, j)‘+c,)
-M,,)(i,
+a,j)‘+M,c,
j)‘),
(Ms(iY j)‘+os,)l +a,--a,,.
Again, to zero out the nonlocal term, we have to fulfill equation M,F, - MS, = 0. Following Darte and Robert [6,5] or Shang and Shu 2231, we would define the communication graph as a directed graph G = (V, E) where (1) each vertex u E V represents an array variable or a statement, (2) if statement S writes variable X, then there is a directed edge from S to X; if statement S reads variable x, then there is a directed edge from x to S. We would obtain the graph of Fig. 1. In the framework of perfect uniform loop nests [6,5], all statements have the same depth d and all access matrices are equal to Id,, the identity matrix of dimension d. Therefore, we get a communication-local mapping by choosing the same allocation matrix for each statement and for each variable. This is done in [6,5]: Darte and Robert set M = MS = M,, for each statement S and array x, and they try to zero out as many local terms as possible. They show that detecting cycles of null weight in the communication graph, considered as non-oriented, plays a key role in this optimization problem, which is shown to be NP-complete. In fact the orientation of the arcs in the communication graph was actually introduced to differentiate read and write accesses to variables.
1378
M. Dion, Y. Robert/Parallel Computing22 (1996) 1373-1397
Fig. 1. The communication
graph with read-write
edges.
In the general case of affine loop nests, the situation is more complicated. As outlined above, the primary goal is to zero out as many nonlocal terms as possible, possibly all so as to get a communication-local mapping. But consider the equation of a nonlocal term linking statement S of depth d and array x of dimension q,: the equation is MS - M, F = 0, where M, is a m X d matrix, and M,y an m X q, matrix. The access matrix F is of dimension q,y X d. As already said, if we target a m-dimensional processor space, we assume that m Q d and m < q,, and we impose that matrices M, and M, are of full rank m. So far, we have made no hypothesis on F. The simplest case is clearly when F is square and non-singular. In such a case, d = q+; if M, of rank m is given, then M, F is of rank m and we can let M, = M,F without violating the constraint that M, is of rank m. Conversely, if M, of rank m is given, then M,F-’ is of rank m, and we can let M, = M,F-‘. However for general affine loop nests, it might well be the case that F is not square but rectangular. Let us discuss the two cases where q, < d and qx > d: q, < d (in such a case, F is flat). Assume that F is of full rank q+. Then, given M, of rank m, we can easily prove * that M,vF is of rank m, and we can safely let M, = M, F. However, given M,, finding M, of rank m such that MS - M, F = 0 is not always possible. We know that F admits a pseudo-inverse (or right-inverse) 3 F- ’ of size d x q, and of rank q,. such that FF- ’ = Id. Hence, if there exists M, such that M, = M, F, then M,F-’ = M, FF- ’ = M,. Unfortunately, M, = M,F- ’ is not always a solution of the equation M, = M,F: we have the compatibility condition 4 M, = M,F-‘F. Furthermore, M,F-’ can be of arbitrary rank less than m. Consider the following example with m = 1, qr = 2, d = 3, M, = (1 1 l), and
’ see Lemma A. I. 2 see the Appendix for background 3 see Lemma A.3.
on pseudo-inverses.
M. Dion, Y. Rohert/Porullel
Computing
22 (1996) 1373-1397
I379
F is of rank 2, F-L
0 1 .
:, 3
i -1
-1 I
We have M,F-’ = 0, while we were expecting a rank-l matrix. To summarize, given M, of rank m, it is always possible to determine (M, = M., F) while the converse is not true.
M, of rank m
d < q, (in such a case, F is narrow>. Assume that F is of full rank d. Then, F has a pseudo-inverse (or left inverse) F- ’ of size q5 X d, of rank d, and such that F- ‘F = Id. The situation is exactly the converse of the previous one: given M, of rank m, M, = M,F- ’ is a rank-m solution ’ of the equation M, - M,rF = 0; however, given M,, of rank m, M,F can be of arbitrary rank less than m. Take a simple example as above: let m = 1, q,, = 3, d = 2, M, = (1 1 l), and 1 0
F= i -1
0 1. -1 1
Then M,yF = 0, and we cannot let MS = M,yF. To summarize, given M, of rank m, it is always possible to determine CM., = M,F- ’ > while the converse is not true.
M, of rank m
summarize the discussion by informally introducing the m-dimensional acceu G = (V, E, m) (see Section 4.1 for a more formal definition): m is the dimension of the target virtual architecture. Vertices are the same as in the communication graph: each vertex u E V represents an array variable or a statement. (3) Edges are a subset of the edges of the communication graph, and they are oriented. Consider a loop nest where an array variable x of dimension q,, is accessed (read or written, there is no difference) in a statement S of depth d, through an access matrix F of rank min( q,, d) greater than the dimension m of the target architecture: then if q, < d we have an edge from x to S, to indicate that given M,v of rank m it is always possible to find M, of rank m such that the communication is made local; and if d Q q, we have an edge from S to x, to indicate that given MS of rank m it is always possible to find M,, of rank m such that the communication is made local. In the special case where q,, = d, we do not draw two opposite edges (because there is a single potential communication) but rather we draw a single edge with two arrows so as to indicate that both orientations are possible. The access graph for our target example is represented in Fig. 2. There are several points worth pointing out: l The number of edges in the access graph depends upon the dimension m of the target architecture. The set of edges in the (m + l&dimensional access graph
We graph (1) (2)
4see Lemma A.3.
1380
M. Dion, Y. Robert/
Parallel Computing 22 (1996) 1373-1397
G = (V, E, m + 1) is a subset of the set of edges in the m-dimensional access graph G = (V, E, m). In our example, the access graph is the same for m = 1 and
m = 2. But, for m = 3, it would consist only of the two “double” edges S, c) b and S, c, c. Define fhe potential of a vertex in the communication graph as its depth (for a statement) or as its dimension (for an array). For all edges in the access graph, the tail has a potential at most equal to that of the head. In particular, cycles can only occur when all edges in the cycle correspond to equipotential vertices, i.e. when all access matrices in the cycles are non-singular square matrices. Consider a simple path in the communication graph going from vertex u, to vertex u2 with u, # u2 (the path is not a cycle). Then given any allocation matrix M,, of rank m for vertex ZJ,,the existence of the path ensures that it is always possible to make local all communications occuring between the vertices in the path. In our example, given M, of rank 2 in G = (V, E, 21, and following the path a + S, + b -P S,, we are ensured to be able to compute MS,, Mb and M, , all of rank 2, so that the communications corresponding to access matrices F2 (reading a in S,), F, (writing b in S,> and F, (reading b in S,) are made local: we successively let IV,, =M,F,, Mb = M,(F;’ and MS, = M,F,. Starting from a, another path to follow from a to S, is to use the direct edge a + S, (access matrix F4 for writing a in S,). Is it possible to make local this communication in addition to the three above communications? Using the edge a -B S, we get the equation MS, = M, F4 while we had MS, = MbFs = M,,F;‘F, = M, F2 F; ‘F,. We derive the condition M, F2 F; ‘F, = M, F4. This condition is satisfied for all matrices M,, of rank m iff F2 F; ‘F, = F4. In our example, let Fpath= F2 F; IF,. We compute
Fig. 2. The access graph (m = 1 or m = 2).
M. Dim,
Y. Robert/Purullel
Computing 22 (1996) 1373-1397
1381
Note however that Fpath
-1 -
F4
=
’
0
0
0
0)
is a rank-l matrix. Hence if m = 2, then 2 X 2 matrix M, of rank 2 is nonsingular, and the condition M, F2 F; ‘F, = M, F4 can never be satisfied. But if m = 1, then the condition can indeed be satisfied by some 1 X 2 matrices M, of rank 1 (take for instance M, = (0 1)). In fact, this analysis can be extended in the general case: each time there are two disjoint paths p, and p2 both going from a vertex u, to a vertex v2 in the access graph, we can make all communications on both paths local provided that the equality F,, = F,,, holds (where F, denotes the product of the access matrices along the edges of path p). (See Section 4.1 for more details.) The access graph illustrates one of our basic assumption: it is not very practical to aim at communication-local mappings and to compute the largest dimension m of the target architecture that makes it feasible. In the example, if m = 2, we already know that we cannot make the above four communications local. If m = 1, we follow the paths a + S, + c + S, and a + S, to get the equation M, F2 F; ’ F6 = M,F,. But
is of rank 2, and the condition can never be satisfied. Hence the largest possible dimension for a communication-local mapping is m = 0, not a very parallel architecture! We insist on the fact that the communication graph depends upon the dimension m of the target architecture. Given m, not all the communications are taken into account in the access graph G = (v, E, m). The edges in G represent only the communications with access matrices of full rank, greater than m. So, we do not try to make all the communications local but only the “most important” ones. Also, we said that the value of m should be an input parameter of the mapping process, rather than being determined during this process. Of course, we can try several values of m as input, and select the best output based both upon the number of nonlocal communications (which decreases as m decreases) and upon all external constraints.
3. Review of current work The data alignment problem has motivated a vast amount of research. A brief review of some related work is presented here. The review contains only short abstracts of the presented papers and is by no means intended to be comprehensive. Li and Chen. In [18] the authors formulate the problem of index domain alignment as finding suitable alignment functions that embed the index domains of the arrays into a
1382
M. Dion. Y. Robert/Parullel
Computing 22 (1996) 1373-1397
common index domain. The paper contains an NP-completeness result: minimizing the cost of data movement is shown to be an NP-complete problem. Besides, a greedy heuristic algorithm is given at the end of the paper. Feautrier. In [s], Feautrier proposes a greedy algorithm analogous to Gaussian elimination to determine a placement function. Data and computations are aligned in such a way that the Owner computes rule is respected. The main idea is to zero out edges corresponding to the most important communication volume. An heuristic is given to estimate the communication volume associated to an edge. Anderson and Lam. In [l], the authors propose an algorithm and heuristics that determine alignment for both data and computations (extension of the owner compures rule). The algorithm is based on the mathematical model of decompositions as affine functions and is structured into three components: partition, orientation and displacement. The only parallelism exploited is ford1 parallelism or doacross parallelism using tiling. Darte and Robert. In [6,5], the authors introduce a communication graph that contains all the information to align data and computations. They formulate ways to reduce the amount of communications (communication rank minimization, broadcasting, message vectorization, . . . ). But, the main result is a NP-completeness result. Darte and Robert restrict themselves to a simple case - perfect loop nest in which all access functions are translations - and they show that, even in this case, the alignment problem is NP-complete. They give several heuristics. Shang and Shu. In [23], the data alignment problem for single loop nests is addressed. The paper was motivated by [6]. Algorithms are classified to uniform communications algorithms where communication patterns are regular and affine communication algorithms where communication patterns are affine functions of loop index variables. Necessary and sufficient conditions that a nontrivial mapping without nonlocal communications exists are presented. Chen and Sheu. In [4,3], the authors analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without inter-block communication. The arrays can be partitioned under the communication-free criteria with non-duplicate or duplicate area. Finally an heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load-balancing is proposed. Knobe, Lukas and Steele. In [ 131, the authors discuss techniques for automatic layouts of arrays in a compiler targeted to SIMD architectures. The approach to data locality is to consider each occurrences of a datum as a separately allocated object and to mark preferences among certain occurrences to indicate that they should be allocated together. This approach is extended in [20] to MIMD systems. In [19], Lukas shows that same
M. Dion, Y. Robert/Parallel
Computing 22 (1996) 1373-1397
1383
data optimization alignment techniques can be used in both distributed and shared memory systems. For shared memory systems, when alignment preferences can be satisfied, synchronization requirements are eliminated. Sinharoy and Szymanski. In [24], a compile-time selection of data alignment that minimizes communication cost is discussed. In contrast with other approaches which consider only the volume of communication, the communication cost is measured in terms of the distance traveled by the messages. The problem is shown to be NP-hard. Two algorithms for exact minimum solutions are discussed, a polynomial-time algorithm for finding an approximate solution is also described.
In [22], the authors discuss data alignment in a linear algebraic framework. Aligned data can be viewed as forming an hyperplane in the iteration space. This allows the quantification of data alignment and the determination of the existence of transformations to reduce non-local access. O’Boyle and Hedayat.
Huang and Sadayappan. In [12], the authors consider the issue of communication-free hyperplane partitioning. By modeling the iteration and data spaces and the relation that maps one to another, necessary and sufficient conditions for the feasibility of communication-free partitioning along hyperplanes are characterized. Hovland and Ni. In [ 11I, a formal technique using augmented data acces descriptors (ADADS) to determine data distribution among the processors in distributed-memory machines is presented. The problem of data alignment is viewed as an extension of data dependence analysis. An algorithm for generating automatically Fortran D alignment and distribution statements is presented.
4. Mapping affme loop nests 4.1. Statement of the problem The problem is formulated in terms of graphs. Accesses to arrays and statement instances are represented by an oriented access graph defined as follows: Definition
2 (access graph). The czcceS.rgraph is a directed graph G = (V, E, m) such
that: l
m is the dimension of the target virtual architecture,
G has u vertices u,,..., u,, one per array and one per statement, G has eedges e ,,..., e,, where e depends on m. There is an edge between an array and a statement when the array is read or written in the statement and when the access matrix is of rank greater than m. There are two kinds of edges. Some edges are one-direction oriented edges represented by a simple arrow, the others can be considered as oriented in both directions and they are represented by double arrows, see Fig. 2. A weight matrix is associated to each edge. l
l
1384
M. Dion, Y. Robert/Parallel
Computing 22 (1996) 1373-1397
Orientation of the edges. Assume that in a statement S, there is an access to an array x. M, and M, represent the allocation matrices of S and x. F represents the access matrix of x in S. M, is a matrix of size m X d, M,v a matrix of size m X q, and F a matrix of size q, X d, where m d min(q,V, d). We assume that F is a full rank matrix of rank greater than m. There are several cases depending on (q,, d): If q+ < d (F is flat): F is right invertible, i.e. there exists a matrix F-’ such that FF- ’ = Id (see Section 2 and the Appendix). There is an oriented edge from x to S, the weight of the edge is F. We have:
If q, > d (F is narrow): F is left invertible, there exists a matrix F- ’ such that F-IF = Id. There is an oriented edge from S to x, the weight of the edge is F-‘. We have: F-1
X-S
If q, = d (F is square): there is an edge between X and S which can be oriented in the two directions. The edge is represented by a double arrow and the weight can be either F or F- ’ depending on the direction along which the edge is considered. We have:
Example. The oriented access graph associated to the motivating example is given in Fig. 2. Afine mapping problem. Given the access graph G = (V, E, m), the mapping problem can now be stated as follows: l choose for each node an allocation matrix l define for each edge u = (x, y) E E the communication matrix S,, by 6, = MY- M,C,) where C, is the weight of the edge. The goal is to maximize the number of local communications, so to maximize the number of null communication matrices. Remark. In the access graph, not all the communications are taken into account. The edges represent only the communications with access matrices of full rank, greater than m. So, we do not try to make all the communications local but only the “most important” ones. It is in general impossible to zero out all 6,,. For example, when there are several oriented paths in the graph going from the same source to the same destination, it is not always possible to make all the communications local along the paths. But, a sufficient
M. Dim,
Y. Roberr/Purdlel
Computing 22 (1996) 1373-1397
I385
condition on the access matrices that guarantees this possibility can be found. In the same way, when the graph contains an oriented cycle, the possibility to zero out all the communication matrices on the cycle depends upon the access matrices of the cycle.
Multiple paths. If we consider from S, to S, in the graph:
the access graph of Fig. 2, there are several ways to go
a
So, if we want to zero out all the communication MS3 = M,F, =M,,F,-‘F,
and
matrices,
we must have
MS, = MhFs = M,,F;‘F,,
SO
M,,(F,‘F,-F;‘F,)=O. But F;‘F,-F,‘F,=
(::
Y
:)
is not null so the previous condition is not satisfied for all matrices M,,. In the particular case M,, is a square matrix, it is not possible to satisfy the condition with a matrix of full rank. More generally, if we note F,,, the product of the weight matrices along the first path and F,,2 the product along the second path, M the source vertex for the two edge, we have: l if F,,, - Fpz = 0, we can choose any matrix M and make all the communications local on the two paths, l if F,,, - F,,? is of full rank, it is not possible to zero out all the communication matrices, l if F,,, - Fp, is of deficient rank, according to the size of M it can or cannot be possible to find a matrix M verifying M(F,, - FYI) = 0. When it is impossible to find M satisfying M(F,, - F,,;,,) = 0, we can, for example, make local all the communications in the first path and all the communications in the second path except the last one.
Cycles. The problem with cycles is similar to the problem with multiple now the following example:
paths. Consider
1386
M. Dion, Y. Robert/
Parallel Computing 22 (1996) 1373-1397
Fig. 3. The oriented access graph.
Example 3.
for i = 0 to N for j = 0 to M for k = 0 to M (Statement s,)
a(~, . (i, j, k)’ + cl) = gJb(F;! . (i, j, k)’ + ~2)~ d(F, .
6, j, kY + c,>)
(Statement S,) b(F,. (i, j, k)’ + cd) = g,(C(F, . (iv .i k)‘) -!-C,) (Statement S,} c(F, 9(i, j, k)’ + c6) = g3(dF7. (i, .L k)’ + CT)) (Statement S,) d(Fs . (i, j, k)’ + cs> = g4(4F, . (iv A k)’ + C9)) endfor endfor The loop nest is of depth 3 and a, b, c, d are 3Darrays. We want to computations and the arrays on a 2D-processor space (m = 2). We assume access matrices (Fi I i = 1,. . . , 9) are of full rank, i.e. non singular. The access represented in Fig. 3. The access graph contains two elementary cycles. Assume that we want to all the communication matrices in the first cycle, we must have: MS, = M,F, = Ms4F;‘F,
map the that the graph is zero out
= MdF8F;‘F,,
MS, = MsIF;‘F,F,-‘F,.
If F; ’ F8F; l F, = I,, it is possible to arbitrarily choose a first allocation matrix and to zero out all the communication matrices on the cycle. In the same way, if we want to zero out all the communication matrices in the second cycle, we must have: MS, = M,,F;‘F,F;‘F,F,-‘F,.
M. Dim,
For example,
Y. Rohert/Pordlel
Computing 22 (1996) 1373-1397
1387
let us take:
So, we have
If we choose
all the communication matrices can be zeroed out. In the case of a cycle, the results are the same as in the case of two multiple paths. Instead of considering the matrix F,,, - FPz, we consider now the matrix Fcycle- I where Fcycle is the product of the weight matrices along the oriented cycle and I is the identity matrix. If M is one allocation matrix of the cycle, we have: and all the communications can be l if Cycle - I = 0, M can be chosen arbitrarily made local, l if Fcycle - I is of full rank, it is not possible to zero out all the communications matrices, according to the size of M, it can or not be possible l is Cycle - I is rank deficient, to zero out the communication matrices. When it is impossible to have M(F,,,,, - I) = 0, we can zero out all the communication matrices in the cycle from M on, except the last one. 4.2. A simple heuristic
We see in the previous section that when there are oriented cycles or multiple paths in the graph, it may be impossible to make all the communications local. So, we will now try to find a subgraph G’ of G that contains neither cycles nor multiple paths. A possible solution is to choose for G’ a branching of G. Before studying the mapping problem, we first need some definitions and results borrowed from graph theory. For more details, see [7]. Definition 4 (branching and arborescence). An arborescence is defined as a tree in which no two arcs are directed to the same vertex. A branching is defined as a forest in which each tree is an arborescence.
1388
M. Dion, Y. Robert/Parallel
Computing 22 (1996) 1373-1397
Fig. 4. A maximum
branching.
5 (maximum branching). Associate an integer weight a(x, y) to each arc ( x, y>. The weight 6 of a branching is defined as the sum of the weights of the arcs in the branching. A maximum branching of graph G is any branching of graph G with the largest possible weight.
Definition
In [7], Edmond’s maximum branching algorithm, which constructs a maximum branching for any graph G, is given. Lemma 6. If G is a branching, communication matrices are null.
one can choose the allocution
matrices
so that all
Proof. In case of a branching, each vertex has at most one incoming edge. Thus, it is
always possible to choose an allocation matrix that zeroes out the communication matrix. For example, construct a topological sort for each tree of G and choose M, = L,q, for the roots. Then, choose the remaining allocation matrix node after node in order to satisfy M, - M,C, = 0. 0 Lemma 7. Zf we associate a weight 1 to each edge in the graph G, the number of communications (edges in the access graph) that can be made local is at least equal to the weight of a maximum branching of G. Proof. The proof is immediate with the previous lemma.
0
The maximum branching algorithm given in [7] gives a simple way to obtain a solution. Fig. 4 represents a maximum branching for the graph of Fig. 2. Definition
8 (weight of path). The weight of a path k is dejned
’ The integer weight defmed here for the branching introduced for the edges of the access graph.
us W, = l-lUE ,C,,.
is not related to the previous
weight (access matrix)
M. Dim, Y. Roberr/Parullel
Compuring 22 (1996) 1373-1397
I389
Simple heuristic. With the previous lemmas and definitions, we see that a simple heuristic for the affine mapping problem consists in, for a communication graph G(V, E, m): 1. construct a maximum branching G’ = (V’, E’, m> of G with Edmond’s algorithm, 2. for each edge in E\E’, try to add the edge to G’. If the addition of the new edge creates a cycle of weight the identity matrix or a new path with same source and destination vertices and same weight as an already existing path, the edge can be added in E’. At this step, all the communications represented by edges in G’ can be made local, 3. consider the multiple paths and the cycles with F,>, - F,,? or Fcycle - I of deficient rank and try to find allocation matrices that allow to zero out even these communications. 4.3. NP-completeness
of the aftine mapping problem
We define the decision problem associated to the afine mapping problem as follows: given the access graph G = (V, E, m) of an affine loop nest, and a positive integer K, is there a set of rank-m allocation matrices M, (for each array x of dimension q,, > m> and M, (for each statement S of depth d > m> such that at least K communications (edges with full rank access matrices in G) are made local? It is not surprising that the afine mapping problem is NP-complete (see the proof below). Various instances of the mapping problem have been shown to be NP-complete. Li and Chen showed that the problem of determining an optimal static alignment between the dimension of distinct arrays is NP-complete [ 171. Anderson and Lam showed that the dynamic data layout problem is NP-hard in the presence of control flow between loop nests [l]. Mace discussed three different formulations of the dynamic data layout problem for interleaved memory machines [21]. Kremer proved that the inter-phase data layout problem is NP-complete [ 151. Gilbert and Schreiber proved that aligning temporaries in array expressions with common sub-expressions is NP-complete [lo]. Bouchitte et al. [2] proved that evaluating HPF style expressions with communication/computation overlap is NP-complete. Finally, Darte and Robert [6] proved that a very simple instance of the mapping problem, namely aligning arrays and computations when mapping perfect uniform loop nests onto distributed-memory computers, is NP-complete. Below we show that the affine mapping problem (as defined above via the access graph) is NP-complete in the strong sense: Theorem
9. The ajjine mapping problem
is NP-complete
(in the strong sense).
Proof. The affine mapping problem is of course in NP: given a solution, one can count in polynomial time the number of local communications. The subset product problem (SPP) is the following NP-complete (in the strong sense) problem [9, p. 2241: given a finite set A = {u,, a,, . . . , u,,}, a nonnegative integer size s(a) 2 2 for each a E A, and a nonnegative integer B, is there a subset A’ CA such that the product of the sizes of the elements in A’ is exactly B? We polynomially transform the SPP problem into the following particular instance of the affine mapping problem:
M. Dim, Y. Robert/Parullel
1390
Computing 22 (1996) 1373-1397
Loop nest.
for i= . ..to...do for j= . ..to...do Statement S,: b,(F, *(i, j)‘) = b,(G, . (i, j)‘) + b,(H, . (i, j)‘) Statement S,: b&F2 . (i, jY> = b,(G, * (i, $0 + b,(H, . (i, j)‘) ...
Statement S,,_ ,: b&F,,_, . (i, jY> = b,_ ,(G,_, . (i, j>‘) + b,,- ,(H,,- I . (4 j)‘) Statement S,: b,(F, . (i, jN = b,(G, * (i, j)‘) + b,(H,. (iv j)‘) endfor endfor Each bi is a 2D-array. We let for 1
pkB n
the identity matrix of order 2, for 1
and l
1.
( 01 0
1’
Access graph. We let m = 2. The access graph G = (V, E, 2) is represented in Fig. 5. There are 3n “double” edges in the graph: all edges are “double” edges because all
access matrices are square non singular 2 X 2 matrices. Furthermore, all allocation matrices (for each array and each statement) are square and nonsingular 2 X 2 matrices too. instance. Given the access graph, we let K = 2n, i.e. we ask whether we can make at least 2n communications local.
Problem
The construction of G is clearly a polynomial function of the inputs of the SPP instance. There are 3n edges in the access graph G. Each statement Si is connected to three edges in the graph for 1 < i < n: l there are two edges between vertices Si and bi, corresponding to access matrices Gi and Hi respectively. Let e,, and eHi denote these two edges.
Fig. 5. The access graph corresponding
to the transformation
of the Subset Product Problem.
M. Dion, Y. Robert/Parallel
Computing 22 (1996) 1373-1397
1391
there is one edge between vertices Si and bi+ ,modn, corresponding to access matrix Fi.Let eF denote this edge. it would We first point out that eF, and eo, cannot be made local simultaneously: imply M, = Gi M, = H;M,,,hence (Gi - H,)M,,= 0. But M,, is invertible, and the condition is equivalent to Gj - Hi = 0.But l
Gi-
Hi = 4%)
where s( ai) 2 2, therefore a contradiction. As a consequence, at most 2n edges can be made local. And to make exactly 2n edges local, then for each statement Si we must choose exactly one edge in (e,,, eH,} and make it local together with edge {e,]. Let A' be the subset of those elements ai E A such that edge (e,,} has been made local. We have the following conditions: l For 1 < i G n, M, = G,M, if ai E A' and M, = H,M, otherwise. We write
Mbr, where
= Gi if ai E A' and I
Fori
MI,,, Mb,=F;'MS,, MS?= G
0H2
MS n=
M,,,, .... -
M,,, = Fn-'Ms",
hence the condition
Again,
Mb, is invertible,
hence the equivalent
condition
Given the values of Fj and
0 G
H ['
1 < i G n,
we derive that
The condition is exactly l7 a E ,,s(a,.) = B, thereby the reduction of the SPP instance to an instance of the affine mapping problem which can be solved if and only if the original SPP instance can be solved. 0
1392
M. Dim, Y. Robert/Pnrullel
Computing 22 (1996) 1373-1397
4.4. A refined heuristic Consider the following elementary case:
To zero out the two communications matrices, we must have: M, = M,F,,
M,=M,F,.
With the simple heuristic, only one communication matrix is zeroed out; the branching eliminates one of the two edges. However, under some conditions on matrices F, and Fb, it is sometimes possible to satisfy both equalities. We know that (see [14] and Section 6): MO = M,,F,,F,-’ MbFbFo-’ = MbFb
=s
M,F, = MbF6.
So, Fb FJ ’ F, = Fb (or conversely F, F; ’ Fb = F,) is a sufficient condition to be able to zero out the two communication matrices. We choose Mb (or M,) and deduce M, = MbFbF;’ (Or Mb = kf,F,F;‘). Remark that F, is a matrix of size qn X d, Fb a matrix of size qb X d and so, Fb F; ’ is a matrix of size qb X q,, F,F;’ a matrix of size q, X qb. Therefore, to take into account the possibility of making the two communications local, we can add virtual edges to the access graph defined in Section 4.1. The orientation and the weight of the virtual edge depends on the relative values of q, and qb.
If q, < qb (F, F; ’ is flat): if F, F; ’ and M, are full rank matrices, then the matrix M, F, F; ’ will also be of full rank (see Lemma A.l). We add in the access graph a virtual edge from vertex a to vertex b with weight F,,F; ‘. Note that the fact that both F, and Fb are of full rank does not imply that F, F; ’ is of full rank too: this is a condition on F, F; ’ for the virtual edge to be inserted. 0 If qb < q, (Fb F; ’ is flat): in the same way, if Fb F; ’ and Mb are full rank matrices, then the matrix Mb FbFaw’ will also be of full rank. We add in the access graph a virtual edge from vertex a to vertex b with weight Fb FJ ’ 0 If q, = qb (F, F; ’ and Fb F; ’ are square): if F,F; ’ and Fb F; ’ are of full rank, we insert a double oriented edge in the access graph between a and b with weight F,F;’ or FbFo-’ according to the direction in which the edge is considered. Note that the two matrix products F,F;’ and FbF; ’ are of the same rank (F, F; ’ = F,F;(F,F;)-’ and F,F;’ = F,F$F,FJ)so rank (F,F;‘) = rank(F,Fi) = rank((F,F~)‘) = rank(F,Fd) = rank(F,F;‘)). See Fig. 6. The virtual edges are represented in dashed lines. We see that the sufficient condition F, F; ’ Fb = F, is equivalent on the new access graph with virtual edges to the condition of Section 4.1 on multiple paths. If the graph of 0
M. Dion, Y. Robert/Parullel
I393
Computing 22 (1996) 1373-1397
Fig. 6. Virtual edge, case y. < q,,.
Fig. 6 represents two multiple paths of same weight (virtual edge included) then it will be possible to make the two communications local. This idea can be generalized to obtain a more refined heuristic than the simple heuristic given previously.
Refined heuristic. 1. Consider all pairs (x, y) of vertices without incoming edges, consider all paths going from x and y and joining in the same vertex z. Call t, the weight of the path from x to z and F, the weight of the path from y to Z. a if q,, < q, and if F, FL’ is of full rank, add a virtual edge from x to y with weight F, F; ’ l if qy < q,, and if F, F,;’ is of full rank, add a virtual edge from y to x with weight F, Fly ’ l if q, = qyr according to the rank of F,F.;’ and F., F;’ insert a double arrow between x and y with the adequate weight. 2. Affect a weight 1 to the real edges of the access graph, a weight 0 to the virtual edges and, as in the simple heuristic construct a maximum branching. The weight 0 is affected to the virtual edges because they do not represent real communications. The virtual edges must be cut off in first during the construction of the branching because, when a virtual edge is cut off, it does not mean that a communication will not be made local. 3. As in the simple heuristic, add the edges corresponding to cycle of identity weight or to multiple paths of same weight. We give an example to illustrate
the refined heuristic.
Example 10. for i = 0 to N do for j = 0 to N do for k = 0 to N do (Statement S,) e(F, .(i, j, k)‘+~,)=g~(d(F~~(i~ {Statement s,} a(F, . (i, j, k)’ + c,> = gz(b(Fd. (i, j, k)’ + c,>, d(F, . (i, j, k)’ + ~6)) endfor endfor endfor
j, k)‘+cz)) k
_A k)’
+ cd),
c(Fs
M. Dion, Y. Robert/Parallel
1394
Computing 22 (1996) 1373-1397
The arrays a, b, c, d, e are 3-dimensional arrays and the access matrices are:
The paths going from vertices of degree 0 and going to the same destination are: F.7 l a+& and b2S,. Fx and F4 are of size (2 X 3) and
The two matrices are of full rank. There is a double dashed arrow between a and b. l
a 2 SzF%‘d and S, FL’d. F; ’ is of size (3 X 4), Fj F6- ’ is of size (2 X 4). F,F,-‘(F;‘)-’
=
(i
Y
i)
is a matrix of full rank. There is an dashed arrow from a to S,. l
b f: SzF%‘d and S,Fz’d. F; ’ is of size (3 X 41, F4F; ’ is of size (2 x 4). F,F,-‘(F;‘)-’
=(t
:,
i)
is a matrix of full rank. There is an dashed arrow from b to S,. The access graph with the virtual edges is given in Fig. 7.
Fig. 7. Example of graph with virtual edges.
M. Dim
Y. Robert/Parallel
Computing 22 (1996) 1373-1397
1395
When the virtual edges have been added in the graph, the simple heuristic defined in Section 4.2 can be used as previously. To find the maximum branching, we affect a weight 0 to the dashed edges and a weight 1 to the others. In this example, the three pairs of multiple paths are of same weight. ( F3 F; ’ F4 = F, and F4F3-‘F3 = F4, F,F,-‘F,.j’-‘F2-’ = F,F6,‘, F4F6-‘F2-im’F2-’ = F4F6-‘. If we extract a maximum branching from the access graph and the edges corresponding to multiple paths of same weight, we obtain all the edges of the graph. All communications can be made local. With the simple heuristic, we would have made only four communications local.
5. Conclusion We have dealt with the alignment problem when mapping affine loop nests onto distributed memory parallel computers. The main technical contributions of our paper are the following: 0 The introduction of the access graph to cope with affine communications involving rectangular access matrices. The access graph is the key to specifying the affine mapping problem when the dimension of the target virtual architecture is an input parameter of the optimization process. l The proof that the affine alignment problem is NP-complete in the strong sense. l The introduction of several heuristics based upon the structure of the access graph (including properties of cycles and multiple paths). In our heuristics, alignment functions are kept fixed during the entire execution of the program. Further work could be oriented towards dynamic allocation functions, provided that the potential improvement with such allocations could pay off the cost of data and computation redistribution.
Appendix A.I. Pseudo-inverses Let X be a rectangular u X v integer matrix, and assume that X is of full rank min(u, v). If u = v, then X is nonsingular and its inverse matrix X-’ is such that XX ’ = X- ’X = Id,, where Id u denotes the identity matrix of order u. If u # v, we can define a pseudo-inverse (still denoted as X- ’ ) as follows: l if u < v (X is flat), then XX’ is a square nonsingular u X u matrix whose (ordinary) inverse matrix is (XX’)-‘. Then we define the pseudo-inverse (or right-inverse) of X as X-’ = X'(XX')’ : X- ’ is a v X u matrix of rank u such that XX-’ = Id,. Note that X-‘X # Id, if u # v. l if u 2 v (X is narrow), then X’X is a square nonsingular v X v matrix whose (ordinary) inverse matrix is (X'X)) ‘. Then we define the pseudo-inverse (or X-’ is a v X u matrix of rank v such left-inverse) of X as X- ’ = (X1X)-‘X’: that X- ’X = Id,. Note that in general XX- ’ # Id, if u # v.
1396
M. Dim,
Y. Robert/Parallel
Computing 22 (1996) 1373-1397
Note that for square non singular matrices, the pseudo-inverse matrix coincides with the (usual) inverse matrix. For more details, see [14]. A.2. Matrix equations Lemma A.l. Let A be an m X a matrix of rank m and F be an a X d matrix of rank a, where m Q a < d. Then AF is of rank m. Proof. We use the Hermite normal form of F: F = [H, O]Q, where H is an a X a upper triangular matrix of rank a, and Q is a unimodular d X d matrix. Since Q is nonsingular, the rank of AF is that of A[ H, 01, hence that of AH, hence finally that of A, as H is nonsingular too. 0
Lemma A. 1 was used in Section 2 to prove that we can safely let A4, = M, F when M, is an m x qx matrix of rank m and F a q, X d matrix of rank qx, where m Q qx G d. Now for the case where m Q d B q,, we need to solve the equation il4, = M, F, where M, of rank m and F of rank d are given. We use the following result from [141: Lemma A.2. Let S be an m X d matrix of rank m and F be an a X d matrix of rank d. Then the equation XF = S admits a solution if and only if the compatibility condition SF- ’ F = S is satisfied. In such a case, all solutions are given by the expression X = SF- ’ + Y(Id, - FF- ’ >, where Y is an arbitrary m X a matrix.
We derive the following result: Lemma A.3. Let S be an m X d matrix of rank m and F be an a X d matrix of rank d, where m G d G a. Then the equation XF = S admits the rank-m solution A = SF- ’ . Proof. The compatibility condition is verified because F-IF = Id, with our hypothesis. Hence A = SF-’ is a solution of the equation. Finally, we apply Lemma A.1 to prove that its rank is indeed m. •I
Lemma A.3 was used in Section 2 to orient some arrows from statements to arrays in the access graph.
References [l] J.M. Anderson and M.S. Lam, Global optimizations for parallelism and locality on scalable parallel machines, ACM Sigplun Notices 28 (6) ( 1993) I 12- 125. [2] V. Bouchitd, P. Boulet, A. Darte and Y. Robert, Evaluating array expressions on massively parallel machines with communication/computation overlap, in: B. Buchberger and J. Volkert, eds., Porullel Processing: CONPAR 94VAPP VI, Lecture Notes on Computer Science, Vol. 854 (Springer, Berlin, 1994) 713-724; extended version available as ‘JR9410, LIP, ENS Lyon.
M. Dion. Y. Rohert/Purullel
Computing 22 (1996) 1373-1397
1397
[3] T.-S. Chen and J.-P. Sheu, Communication-free
data allocation techniques for parallelizing compilers on multicomputers, in: Proc. Internur. Conf: on Purullel Processing, Vol. 2 (CRC Press, Boca Raton. 1993) 273-217. [4] T.-S. Chen and J.-P. Sheu, Communication-free data allocation techniques for parallelizing compilers on multicomputers, IEEE Traw Purullel Distributed Systems 5 (9) (1994) 924-938. [5] A. Darte and Y. Robert, Mapping uniform loop nests onto distributed memory architectures, Purullel Computing
20
(I 994)
679-710.
[6] A. Darte and Y. Robert,
On the alignment problem, Parullel Processing Letters 4 (3) (1994) 259-270. and E. Minieka, Uptimizution Algorithms f;w Networks and Cruphs (Marcel Dekker, New York, 1992). [8] P. Feautrier, Towards automatic distribution, Purullel Processing Lett. 4 (3) (1994) 233-244. [9] M.R. Garey and D.S. Johnson, Computers und Intructahility. A Guide to the Theory oj’NP-Complrteness (Freeman, New York, 1991). [IO] J.R. Gilbert and R.S. Schreiber, Optimal expression evaluation for data parallel architectures, J. Purullel [7] J.R. Evans
Distributed
Compur. 13 (1) (1991)
[I I] P.D. Hovland
[12]
[I31 [ 141
[15]
[ 161 [I71
[I81 [I91 [20]
[21] 1221
[23] [24]
58-64.
and L.M. Ni, A model for automatic data partitionin g, in: Proc. Internut. Conf: on Purullel Processing, Vol. 2 (CRC Press, Boca Raton, 1993) 25 l-259. C.H. Huang and P. Sadayappan, Communication-free hyperplane partitioning of nested loops. in: Banerjee, Gelemter, Nicolau and Padua, eds., Lunguuges und Compilersfi,r Purullel Computing, Lecture Notes in Computer Science, vol. 589 (Springer, Berlin, 1991) 186-200. K. Knobe, J.D. Lukas and G.L. Steele, Data optimization: Allocation of arrays to reduce communication on SIMD machines, J. Put-de1 Distributed Comput. 8 (1990) 102- 118. A. Korganoff and M. Pavel-Parvu, El&ment.s de The’orie des Mutrices Cur&es et Rectungles en Anulyse Nume’riyue (Dunod, Paris, 1966) (in French). U. Kremer, NP-completeness of dynamic remapping, Tech. Rept., Center for Research on Parallel Computation, Rice University, 1993. L. Lamport. The parallel execution of DO loops, Comm. ACM 17 (2) (1974) 83-93. J. Li and M. Chen, Index domain alignment: Minimizing cost of cross-referencing between distributed arrays, in: Frontiers 90: 3rd Symp. on the Frontiers of Massively Parallel Computution, College Park, MD, 1990. J. Li and M. Chen, The data alignment phase in compiling programs for distributed memory machines, J. Purullel Distributed Compur. 13 ( 199 I ) 2 13-22I. J.D. Lukas, Data locality for shared memory, in: Proc. 6th SIAM Conf: on Parullel Processing fbr Scientific Computing, Vol. 2 (SIAM Press, 1993) 836-839. J.D. Lukas and K. Knobe, Data optimization and its effect on communication costs in MIMD fortran code, in: Dongana, Kennedy, Messina, Sorensen and Voigt, eds., Proc. 5th SIAM Con$ on Purullel Processing for Scientific Computing (SIAM Press, 1992) 478-483. M.E. Mace, Memory Sforuge Patterns in Purullel Processing (Kluwer Academic Publishers, Boston, MA, 1987). M. O’Boyle and G.A. Hedayat, Data alignment: Transformations to reduce communications on distributed memory architectures, in: Proc. Sculuhle High-perfbrmunce Computing ConjI SHPCC-92 (IEEE Computer Society Press, Silver Spring, MD, 1992) 366-37 I. W. Shang and Z. Shu, Data alignment of loop nests without nonlocal communications, in: Appbcution Specific Arruy Processors (IEEE Computer Society Press, Silver Spring, MD, 1994) 439-450. B. Sinharoy and B.K. Szymanski, Data and task alignment in distributed memory architectures, J. Purullel Distributed Comput. 21 (1994) 6 l-14.