Iteration-Level ParallelExecution of DO Loops with a Reduced Set of Dependence Relations ZEN CHEN AND CWIH-CHI CHANG Institute of Computer Engineering, National Chiao Tung University, Hsinchu, Taiwan, Republic of China
ReceivedMarch13,1986 In this paperwe study the problem of parallel executionof DO loops. There are severallevelsof parallelismin connectionwith this problem. The leveladopted here is the iteration, or block, level.The dynamicdependencerelation is the key factor in our study. It is shown that in our model only one kind of dependencerelation is needed.This fact leadsto a smallersetof dependencerelationsand, therefore,results in better parallelperformance.Algorithms and examplesare given to show how to rewritean originalDO loop in a parallelform, and comparisonsaremade with some existingmethods. 0 1987 Acrdnnic prrssInc.
1. INTRODUCTION Thereis a wide spectrumof algorithmsthat can be expressedin the form of FORTRAN DO loops.They includeimportant operationssuchasmatrix multiplication, recurrenceevaluations,Gaussianelimination, and LU decompositionof a matrix. When a programcontainsDO loops,the computer oftenspendsmostof its time executingtheseloops.Therefore,parallelexecution of loopsby multiple processingelementsis desired. The type of loop that we are concernedwith is the one discussedin [ 10, 12,133,which is repeatedherefor convenience: Do211 =o, u* Do 2 I, = 0, uz Do2&=0, s p>
Ud
s2m
2
SAG
CONTINUE 488
0743-731S/87$3.00 Cotwi#tO1987byAademicRsqInc. Au@NsoflepmduuiorIinmyfolmlwmvod.
PARALLEL
EXECUTION
OF DO LOOPS
489
T denotes the index vector (I,, Z2, . . . , Id), and every S, represents an assignment statement. Let S&J be the specific statement when Stakes on a certain value t This simple type of loop is ideal for the purpose of illustrating our main ideas. We briefly address the other general types of loops in a later section. In the arrangement for the parallel execution of a DO loop, two preparatory steps are required. One first needs to choose the operation level at which parallel execution is performed, for which there are at least three alternatives. (a) The parallel execution is done at the instruction level. The arithmetic operations contained in every assignment statement are arranged so as to be performed simultaneously whenever possible. The tree height reduction technique in [ 121 and the method given by Brent [ 51 are in this category. (b) The parallel execution is done at the statement level. Assignment statements that are simultaneously executable are performed in parallel [4, 101. (c) The parallel execution is done at the iteration level. The entire block of statements embodied in an iteration is treated as a basic unit of operation. Different iterations that are simultaneously computable are done in parallel. The wavefront method in [ 121 and the two parallel methods given by Lamport [ 131 fall into this category. For each of these three levels the basic unit of operation, called the granularity of the parallelism, is performed in a hardware module, i.e., processing element (or processor). The finer granularities contained in the basic unit are executed in a serial fashion inside the processing element. The second preparatory step is to detect the dependence relations existing between the two granularities at the chosen parallelism level. As used in [ 10, 12, 141, there are three kinds of dependence. Consider the two statements, S, and S,, not necessarily distinct. Let ? and 5 be two values of the index vector. Assuming that the execution of $6) occurs no later than that of S&) during the sequential execution of the loop, then a dependence relation is said to exist between S, and S, when one of the following conditions holds: (a) variable (b) same; (c) of s&J.
flow (or data) dependence: the output variable of S,.(T)is an input of S&); output dependence: the output variables of S&) and St@ are the antidependence:
the output
variable of St@ is an input variable
He& and Little pointed out that the static analysis of the program’s structure revealed only the relationships between two statements based on the variable names which would end up with the static dependence relations,
490
CHEN AND CHANG
whereas the dynamic or run-time analysis of the program’s behavior, i.e., examining the values of the variables in two statements, gave the dynamic dependence relations. They showed that the use of dynamic dependence relations led to improvements in time and parallel processor bounds for the parallel execution of a loop [ lo]. In this paper we examine parallelism at the iteration level. There are many motivations for choosing such a large granularity, including the advantages of lower preprocessing overhead and simpler data communication links. In addition, the architecture needed is believed to be realizable with VLSI technology. This adoption is consistent with the proposition of using a compound function to raise the level of parallelism discussed by Gajski et al. [9]. In our model only one kind of dependence relation is of concern: flow dependence. Thus, the reduction of dependence relations will result in better parallel performance. We give algorithms for obtaining the set of dependence relations for two control modes (asynchronous and synchronous) of the architecture. In an asynchronous system, each processor may be driven by its own clock. However, in a synchronous system all processors are driven by the same clock. These algorithms are able to handle the missing index problem effectively. We present illustrative examples and make comparisons with other methods.
2. TYPES OF DEPENDENCE
RELATIONS
It is well known that there are two different interprocess communication methods: communication via shared variables and communication by message passing [2,8,1 I]. The architectures corresponding to these two methods are the tightly coupled model and the loosely coupled model, as shown in Fig. 1. In order to simplify the model, it is assumed that the processors in Communication Circuit Shared
Memory
pt&/
'm;;;;;.
Processors
a FIG. 1. The models of interprocess communication. loosely coupled model.
b (a) The tightly coupled model. (b) The
491
PARALLEL EXECUTION OF DO LOOPS
the tightly coupled model possess no local memory; the operands and the generated data are stored in the shared memory. Conversely, the processors in the loosely coupled model possess enough local memory to store the operands and the generated data. The generated datum is transferred only to those processes that use it as an operand. Now consider EXAMPLE 1 (borrowed from [4] and [lo]).
Sl s2 s3 s4 S5 2
DO2Z=O, 100 A(Z) = D(Z)**2 B(z+ 1) =A(Z)*C(z+ C(z+4)=B(Z)+A(z+ x(z+2)=x(Z)+x(z+ c(z+3)=D(Z)-5 CONTINUE.
1) 1) 1)
It is assumed that the DO loop is executed at the statement level of parahelism here. One process that is executed by one processor contains only one assignment statement. The process wilI be identified by the statement number and the iteration number and denoted as Sk(T). Define a value produced by the assignment statement to be generated, denoted as G, if its occurrence is on the left-hand side of an assignment statement; otherwise, it is called a used value, denoted as U. At the beginning, the initial values are loaded into the shared memory or into the processors using them as operands for the tightly and loosely coupled models, respectively. Let us first consider the tightly coupled model. As far as access to the variable A(Z) is concerned, process S3(Z - 1) reads the value of A(Z) first, and then Sl (I) writes a new value to A(Z). S2(Z) will read the new value afterward. The information about these accesses is listed in Table I. Evidently, Sl(Z) is dependent on S3(Z - 1) and S2(Z) is dependent on Sl(Z). They are the wellknown antidependence and flow dependence, respectively [9,10,12]. Information about accesses to the variable C(Z) is listed in Table II. Here, S5(Z - 3) is dependent on S3(Z - 4). This is the so-called output dependence. In the loosely coupled model (please refer to Table I) the value of A(Z) is loaded into the processor that is initially used by S3(Z - 1). Only the value TABLE 1 THE AWESSES
TO THE VALUE
Index vector of iteration
ACCeSS type
I-1 I I
U G U
OF A(I) Statement number s3 Sl 52
IN EXAMPLE
I Range of array subscript 1 GIG O=sIG OCIC
101 loo loo
CHEN AND CHANG
492
TABLE 2 THE ACCESSES TO THE VALUE
OF C(Z) IN EXAMPLE
Index vector of iteration
Access type
Statement number
z-4 I-3 I-1
G G U
s3 S5 52
I
Range of array subsclipt 4
104 103 101
ofA generated by Sl(Z) needs to be transferred to process S2(Z). In other words, only the dependence between Sl(r) and S2(1) exists. Consider next the accesses to the value of C(r) (refer to Table 2). Since the values of C(l) generated by S3(1- 4) is not used by any other processes, it is kept in local memory and is not transferred to any other processes. However, the value of C(Z) generated by SS(I - 3) must be transferred to S2(1- I), because S2(1 - 1) uses it as an operand. Evidently, the dependence between S3(1- 4) and S5(1- 3) that exists in the tightly coupled model does not occur here. In short, only the flow dependence is possible in the loosely coupled model; the other two side effects, i.e., the antidependence and output dependence, are not existent at all. One may notice that in the loosely coupled model the generated data are stored in the local memory. In other words, different versions of data may exist in the system. The dependence relations occur only when the same version of data instead of the same variable (memory location) is accessed. On the other hand, the antidependence and output dependence in the tightly coupled model are actually derived by considering the interaction among different versions of data. We can remove these two dependences by assigning different variables (memory locations) to different versions of data. By this data version concept, it is more intuitive and simpler to derive the dependence relations. Furthermore, for the loosely coupled model only the flow dependence relations are faced. The method used by Heuft and Little [lo] is conceptually based on the tightly coupled model; the derivations of dependence relations for different architecture models were not considered. Since the datum (or variable value) is a dynamic entity [lo], the dependence relations derived by considering the accesses to each version of data can be categorized as the dynamic dependence relations. For some functional programming languages, such as the FP proposed by Backus [3] and the data flow language [ 11, the global variables are not used. Although side effects are avoided in these languages, the load of avoiding these side effects is implicitly handed over to the programmers. Here we hope that the computer eliminates the side effects by itself From the discussions given above, it is more appropriate to consider the
PARALLEL EXECUTION OF Do Loops
493
dependencerelationsby choosingthe model of architecturefirst. In this paperwe discussthe dependencerelation of loopsat the iteration levelof parallelism basedon the loosely coupledarchitecture.However,the algorithms proposedin subsequentsectionsare alsosuitablefor the tightly coupledarchitecture,exceptthat the renamingtechniquesaredifferent. 3. THERRSTALGORITHMFORDERIVINGDYNAMIC DEPENDENCERELATIONS
A new algorithm for finding and handling the dynamic dependencerelations in an asynchronoussystemis asfollows: Algorithm 1 Step 1. For eachvariablein the loop body, list the information aboutthe valueof thearrayvariableaccessed, the accesstype (U or G), the index vector associatedwith the access,the statementnumber involved, and the rangeof the arraysubscript.Orderthe list first in the lexicographicalorderingof the index vectors;then orderit by the statementnumbers. Step2. Renameall occurrencesof the arrayvariablesin the program.Give the sameunsubscriptedname to the dataif their occurrenceshavethe same variablename and accessthe sameversionof data; otherwisegive different unsubscriptednamesto them. Step3. If the type of the first accessto a valueis U, initially load the value to the accessingprocessors.Thus, if the subscriptrangeof an arrayvariable of typeU is not coveredby that of the samearrayvariableof typeG, the data correspondingto this uncoveredrangemust alsobeloadedat theinitial time. Step4. If a sequenceof U accesses to a valuein the form of G&rUz. . . U,Gr is causedby a missingindex,thenbroadcastthegeneratedvalueassociatedwith Goto all processorsthat useit asoperand,providedthat the hardwarepossesses the function of broadcasting. Step5. The final valueof the arrayvariableis storedin the processorthat performsthe last G access. Step6. For eachflow dependencerelationthat occurs,insert a data-transferringstatementafterthe statementcontainingthe occurrenceof a datum of type G anda data-receivingstatementbeforethe statementcontainingthe occurrenceof a datum of type U. To illustrate this algorithm, considerExample 2, which is simple enough to avoid unnecessarydetail, but is sufficientto explain the main ideas.To apply the algorithm aboveto this DO loop, information about accesses to the variablesis listed in Table III.
494
CHEN AND CHANG TABLE III INFORMATION ABOUT THE ACCESSESTO VARIABLES IN EXAMPLE 2 Index vector of loop body
AOX% type
Statement number
Ranges of array subscript
(a) The accessesto the value ofA(Z, J) G
I, J
Sl
2=sI=sM
(b) The accessesto the value of B(2, J) I,J+
1
U
s2
2
O=sJ
41
G
s2
2sI=GM
I,N- 1 1, iv z-k 1,l
G U
s2
2=sZ
Sl
l
I+ l,N
U
Sl
IGZGM-1
LIInitially, load B(Z, J) to processor element (Z, J + I) and store it in B, where 2 G Z c M, 0 aJ
’ It can he checked easily that the value of C( 1) should he loaded into processor elements (2, l), . . . , (2, N) at the initial time.
2. DO51=2,M D05J= 1,N A(I,J)=C(IC(Z) = B(1, JCONTINUE
EXAMPLE
Sl 32 5
I)+5 1)/5
The variables A(I, J), C(l - I), C(I), and B(1, J - 1) are renamed as A, Cl, C, and B, respectively. The remaining dependence vectors are (1, *), 1 - N =G* < 0, caused by C(Z). If we now apply the hyperplane method used by Lamport [ 131 to the reduced set of dependence relations obtained above, a feasible hyperplane equation is given as I = k. Therefore, the original DO loop can be rewritten as follows.
PARALLEL
EXECUTION
OF
DO
495
LOOPS
Initialization: Load the value of B(1, J) to processor element (Z, J + l), for 2 6 I G M, 0 c J 2 THEN RECEIVE C from processor (I - 1, N) and store itinC1; A = Cl + 5; C = B/5; IF J = N THEN BROADCAST C to processor (I + 1, a), 1 d * BN END 5 CONTINUE In comparison, the hyperplane method, based on the dependence relations specified by Lamport’s method [ 131, give a hyperplane as N X I f J = k (constant), as shown in Table IV. Note that in Table IV the output dependence relations of (0, +), which do not appear in the result of our method, make p2 greater than 0, where p2 is one of the coefficients of the hyperplane equation pl X Z + p2 X J = K. It is obvious that the values of pl and p2 satisfying the constraints of p2 > 0 and pl + p2 * > 0 and minimizing pl X I + p2 X J are pl = N and p2 = 1, respectively. By substituting the possible values of (1, J) into the hyperplane equation, one can find the DO loops are executed serially. In our method, it requires (M - 1) time units to complete the DO loops, where a time unit is defined to be the amount of time needed to complete one iteration of the loop body by one processor including the data transfer required. Parallel executions of the DO loop in Example 2 by
TABLE
IV
THE HYPERPLANE METHODS
sets (al,al)=(O,O) (cl,cl)=(O,*) (cl, c2) = (1, *) (c2,cl)=(-1,:)
Elements > 0 (0, +I
(1, l )
“Weletal=A(I,J),cl=C(I),c2=C(Z-1),and1-N~raN-1.Thefunctionofthe hyperplaneispl XI+p2XJ=Kwithpl =Nandp2= 1.
Constraints p2>0 pl +p2*>0
496
CHEN AND CHANG
our method and Lamport’s hyperplane method are depicted in Figs. 2a and 2b, respectively. The DO loop in Example 2 can also be implemented by either a systolic array or a wavefront array processor. Moldovan introduced variables to guide the flow of data to avoid the function of broadcasting [ 141. However, introducing the variables to avoid the function of broadcasting is not so trivial. Usually, some trial and error is needed. Here, according to the dependence relations derived by our algorithm, the function of broadcasting can be naturally avoided by executing the DO loop in a wavefront fashion with
N.
I\\ ,i\\ I\\ \\\ ’ \\I’ \\I ’ \l\‘Q\\ p
--- 0 - -- 0 --- -- :I
N-1.
3..
\$I \\\
2..
0 I 0
’ ’ ’
\\\o \\\
0
\&$I\& \I\ I\\
- - - 0
Q---O
- --b
O----D
~
1,.
/I
0 I
Cl I , 0
0 I I I
o----f-J I 1 I
b
b
r;
&--;
2
3
4
E3
I I I :*I M
FOG.2. Parallel execution of the program given in Example 2, (a) by our method in (M - 1) steps, (b) by Lamport’s method in N X (M - 1) steps, and (c) by the systolic approach in N + M - 2 steps.
497
PARALLEL EXECUTION OF Do Loops
the wavefront moving from the upper left corner to the lower right corner, as shown in Fig. 2c. It needs iV + M - 2 time units to complete the DO loop. 4. FURTHERREDUCTIONOFDYNAMICDEPENDENCERELATIONS INASYNCHRONOUSSYSTEM
We now turn our attention to a synchronous system in which the dependence relations derived earlier can be further reduced. For convenience, we define L(T) a L(p) to denote that the iteration with index vectorp, L(T”), is dependent on the iteration with index vector T, L(T), due to the fact that the value of X(F) generated in L(T) is used by L(~“). Let II be a mapping function. II( indicates the time unit in which L(r) is executed. If L@) L(p), then II(L(F’)) - II@(T)) >, 1 must hold. Since in the synchronous system all processors are driven by the same clock, the instructions are executed in a predetermined time ordering. For the DO loop given in Example 3, the value of A(Z, J) used by L(Z, J + 1) is ready at the time step t2, if L(Z, J) and L(Z, J + 1) are executed in an SIMD system. This time sequence is shown in Table V. EXAMPLE 3.
D05Z= 1,M D05J= 1,N A(Z, J) = B(Z, J) + C(Z, J) D(Z, J) = A(Z, J- 1) + E(Z, J) CONTINUE
Sl 22 5
Also, in Example 3 the dependence relation caused by multiple to the value ofA(Z, J) can be weakened, i.e., II(L(Z, J + 1)) - II(L(Z, in the synchronous system. In general, the weakened dependence cannot be removed, unless it can be ensured that L(1, J) and L(Z, J be executed simultaneously.
accesses J)) b 0, relation + 1) will
TABLEV THE DETERMINISTIC TIME SEQUENCES FOR EXECUTING OF EXAMPLE 3 IN AN SIMD SYSTEM
UZ, J)
THE LOOP
Z-U,J+ 1)
fl
A(Z, J) = B(Z, J) + C(Z, J) ******
A(Z,J+l)=B(Z,J+l)+C(Z,J+l)
t2
ZI(Z,J)=A(Z,J-
D(Z,J+
l)+E(Z,J)
l)=A(Z,J)+E(Z,J+ ******
1)
498
CHEN
AND
CHANG
,Lamport proposed an algorithm [ 131, called the coordinate algorithm, in which he tried to alter the sequence of the statements in order to fit the condition depicted above to weaken the dependence relation. By his method, the “DO SIM variables” were chosen first to check whether the condition of Sl(z) listed in [ 131 is satisfied. Undoubtedly, it will take time to try all 2” - 1 possible combinations, where n is the number of index variables, although some combinations can be tested easily. We propose a new algorithm to weaken the dependence relations. By this algorithm, the 2” - 1 possible ways of choosing the DO SIM variables can be avoided:
Algorithm 2 Step1. Construct the accessinformation tables and rename all occurrences of the array variable appearing in the loop body (as in the algorithm for the asynchronous system given previously). Step2. Construct a directed graph. Each node in the graph represents an assignment statement in the loop body. For every flow dependence relation, add a directed arc to the graph if the occurrences causing the dependence relations are not contained in the same statement. This arc is from the node that contains the occurrence of type G to the node that contains the occurrence of type U. If the occurrences are contained in the same statement, it is obvious that these dependence relations cannot be removed by altering the order of the satements. Step3. Check whether any cycle exists in the graph. If any cycle does exist, it means that not all dependence relations can be removed by adjusting the order of the statements. Step4. Remove some arcs in order to break any cycles. At most, K arcs need to be removed, if K cycles exist in the graph. Removal of the arcs that correspond to longer dependence vectors is preferred, for these will constitute the final dependence vectors. If the final dependence vectors are shorter, the parallelism performance will be poorer. One should note here that the arcs whose corresponding vectors are null vectors cannot be removed from the graph, because these dependence relations can be satisfied only when the statements are in the original order. Step 5. Rewrite the loop body according to the sequence of the acyclic graph. The dependence relations, except the null vector corresponding to the remaining arcs of the graph, can now be weakened. Step6. Write down the weakened dependence relations and the dependence relations corresponding to the arcs being removed to break cycles. Use the wavefront method (or more generally the hyperplane method) or some other effective method to find the processors that can be activated simultaneously based on the dependence relations listed.
PARALLEL
EXECUTION
OF DO LOOPS
499
Now consider 4 (borrowed from Lamport DO24Z= 2,M DO24J= l,N A(Z, J) = B(Z, J) + C(Z) Cl C(Zf = B(Z -f : J) c Bi B(Z,J)=A(Z+ 1,J) **2 B Al
EXAMPLE
21 22 23 24
[ 131).
CONTINUE.
By applying Algorithm 2 to the DO loop in Example 4, we obtain the information in Table VI. Note that we rename all occurrences of the array variables just below the original variables shown in Example 4. There are only two dependence relations which we need to consider, namely, ( 1,O) and (0, 1) caused by accesses to the values of B(Z, J) and C(Z) as shown in Tables VI(b) and VI(c), respectively. By Step 2 of the algorithm, the directed graph is constructed as shown in Fig. 3. There are no cycles in the graph. All the dependence relations can be weakened. If the wavefront method [6, 121 is adopted, the following formulation can be obtained: 2
min 2 UiXi subject to i=l
a,*1 +a,.oz=o a,~0+u~~l~o
and
a,, u2 a 0. It is obvious that aI = 0, and a2 = 0 is the solution. Therefore, the original program can be fully executed by all processor elements at the same time. The DO loop can be rewritten as Initialization: LoadA(Z,J)toprocessorelement(Zl,J),3
500
CHEN AND CHANG TABLE VI INFORMATION ABOUT THE ACCWES TO VARIABLES IN EXAMPLE 4 (a) The accessesto the value ofA(Z, J)
Index vector of iteration I-
l,J
1,J
Access type
Statement number
U G
23 21
Ranges of array subscript 34Z=zM+l,l
Note. Initially load A(Z, J) to processor element (I - 1, J), 3 d Z < M + 1, I < J < N and store itinA1. (b) The accessesto the value of B(Z, J) Index vector of iteration
4J 1,J Z+ I,J
Access type
Statement number
U G U
21 23 22
Ranges of Zand J 24Z
Note. Initially, load B(r, J) to processor element (Z, J) and store it in Bl. Load Z?(l, J) to processor element (2, J), 1 < J d N, and store it in B2. (c) The accessesto the value of c(Z) Index vector of iteration
Access type
Statement number
4 1 z, 2 z, 2
U G U G
21 22 21 22
1, N Z, N
U G
21 22
1,1
Range of Z 2GZGM
2cZ
Note. Initially, load C(Z) to processor (Z, I), 2 Q Z d M, and store it in Cl.
Execution:
PARA-EXECUTE for all 2 < I < M, 1~ J < N in a synchronousmode BEGIN B=A1**2;
TRANSFER B to processorelement(I + 1,J); IF 12 3 THEN RECEIVE B from processorelement(I - 1,J) and store itinB2; C= B2;
IF J Q N - 1 THEN TRANSFER C to processorelement(1,J + 1);
PARALLEL
EXECUTION
OF DO LOOPS
501
21 C(I)
c
22 B(I,J) 23
FIG.
3
3. The directed graph of the desired statement order for Example 4.
IF J > 2 THEN RECEIVE C from processorelement(Z,J - 1)andstore itin Cl; A = Bl + Cl END A remark is in orderhere.In our methodwe avoidthe necessityof considering all 2” - 1 combinationsof the n index variables.Insteadwe canusea minimization formulation to find the propercombination. In addition, our method givesa totally parallel executableresultwhich is evidently superior to Lamport’sresult,which needsn time units to completethe sameDO loop. 5. SOME F~GTICAL
CONSIDERATIONS
In this sectionthe featuresof our algorithm_s that needspecialhardwareto implement arediscussed.If the condition L(I) - L(J) holds,the execution speedcanbeimprovedif Fe executionof Z,(J)canbe startedandoverlapped with the executionof L(1) when the requireddata are ready.An MSIMD system,which possesses the characteristicof high performanceof the SIMD systemand can be programmedrow by row independently(thus the processescan be executedin an overlappingmannerby processorsin different rows),is presentedin an upcomingpaper[7]. The loosely nestedloop can be executedby moving the statementsbetweenDO statementsinto the loop body and insertingthe checkingconditions in the statements.However,by the architectureproposedin another paper[7] the looselynestedloopsneednot bemoved into the loop body and canbe executedmore efficiently.On the otherhand,becausethe loop body is executedserially,the IF-THEN andthe IF-THEN-ELSE statementscan be implementedin the SIMD systemeasily.The GO TO statementalsocan be handled,if the destinationof the GO TO statementis still in the loop body. However,it is difficult for most parallel architecturesto handle the casein which theparallelexecutionis terminatedby executingtheunpredictableGO TO statement.Somespecialfeaturesof the architectureareneeded. The architectureproposedin [7] canalsohandlethis thorny problem. Execution of the loops whosedimension and number of iterations are
502
CHEN AND CHANG
greater than the dimension and the number of processor elements of the hardware architecture is another problem in realizing the algorithm. We have discussed this issue in another paper [6].
6.
CONCLUSIONS
Based on the concept of determining the dependence relations by considering the accesses to the same version of data as those given in Section 2, an algorithm to derive the dependence relations of the asynchronous system at the iteration level of parallelism is proposed. It is more intuitive and effective than the method given by Lamport [ 131. Only the flow dependence relations are needed in our method. For the synchronous system, a new algorithm to reduce the dependence relations further by altering the order of the statements is also proposed. By this method, one does not need to consider the 2” - 1 possible cases for the DO SIM variables, just like the coordinate method proposed by Lamport. Furthermore, the result of our method is superior to that of the coordinate method. The algorithms proposed in this paper are shown to be used in the loosely coupled architecture. However, the same algorithms can also be used in the tightly Coupled architecture with only minor changes. The algorithms and examples for this tightly coupled model are included in the Appendix. APPENDIX
A. Algorithmfor the Tightly Coupled,AsynchronousArchitecture Only Steps 1 to 3 of Algorithm the following modifications:
1 given in Section 3 are required, but with
(1) In Step 2, if the occurrences of the subscripted array refer to the same version of data, then keep the same name; otherwise, give different names to them, but retain the part with the original subscript functions. (2) In Step 3, all initial data are loaded into the memory instead of being loaded into the processors. Example 2 can be rewritten as follows:
Initialization: Load the value of B(I, J) to the memory location of @I, J), 2 i I < A& 0
GJGAJ-I. Load the value of C( 1) to the memory location of C( 1, N). (Note that C(l, K) and C(l, K’), K # K’, are the different data names assigned in Step 2.)
PARALLEL EXECUTION OF DO LOOPS
503
Execution: DO51=2,M PARA-EXECUTE for all 1 < J < N in an asynchronous mode BEGIN A(Z,J)=C(Zl,iv)+5 C(Z,J) = B(Z,J - 1)/5 END CONTINUE
5
B. Algorithm for the Tightly CoupledSynchronous Architecture The renaming technique in Step 1 of Algorithm 2 is modified in the same manner as that given above; the other steps remain the same. Example 4 can be rewritten as follows:
Initialization: Load the value ofA(Z, J) into the memory location of A l(Z, J), 2 < Z=SM, l
l
Execution: PARA-EXECUTE BEGIN
for all 2 < Z < M, 1 d J B N in a synchronous mode
B(Z,J) = Al(Z+ 1, J)**2 C(Z,J)=B(Z- l,J) A(Z,J)=Bl(Z,J)+C(Z,J-
1)
END
ACKNOWLEDGMENT The authorsthank the refereesfor their constructivecommentsand help in correctinga mistake in Fig. 2, which makethe current versionof paperpossible.
REFERENCES 1. Ackerman,W. B. Data 5ow languages.Computer(Feb. 1982), 15-25. 2. Andrews,G. R., and Schneider,F. B. Conceptsand notations for concurrentprogramming. Comput. Surveys 15,l (March 1983),3-43.
504
CHEN AND CHANG
3. Backus, J. Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Comm. ACM 21 (Aug. 1978), 6 13-64 1. 4. Banejee, U., Chen, Shyh-Ching, Kuck, D. J., and Towel, R. A. Time and parallel processor bounds for Fortran-like loops. IEEE Trans.Comput. C-28,9 (1979),660-670. 5. Brent, R. P. The parallel evaluation of general arithmetic expressions. J. Assoc.Comput. Much. 21,2 (April 1974), 201-206. 6. Chen, Zen, and Chang, Chih-Chi. A wave band method for parallel execution of DO loops. Submitted for publication. 7. Chen, Zen, Chang, Chih-Chi, and Chia, Tsomg-Lin. On the design of VLSI architectures for parallel execution of general DO loops. Submitted for publication. 8. Dubois, M., and Briggs, F. A. Performance of synchronized iterative processes in multiprocessor systems. IEEE Trans.SoftwareEngrg.E&8,4 (July 1982), 4 1-43 1. 9. Gajski, D. D., Padua, D. A., Kuck, D. J., and Kuhn, R. H. A second opinion on data flow machines and languages.IEEE Comput. (Feb. 1982), 58-69. 10. He&, R. W., and Little, W. D. Improved time and parallel processor bounds for Fortranlike loops. IEEE Trans.Compuc.C-31, I (I 982),78-8 1. 11. Hwang, Kai, and Briggs, F. A. ComputerArchitectureand ParallelProcessing.McGrawHill, New York, 1984. 12. Kuck, D. J., Muraoka, Y., and Chen, Shyh-Ching. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans. Comput.C-21 (1972), 1293-1310. 13. Lamport, L. The parallel execution of DO loops. Comm. ACM 17,2 (1974), 83-93. 14. Moldovan, D. I. On the designof algorithms for VLSI systolic arrays.Proc.IEEE 71, 1 (1983), 113-120.