JOURNAL OF PARALLEL AND DISTRIBUTED
COMPUTING
9,154- I70 ( 1990)
A Mechanism for Keeping Useful Internal Information in Parallel Programming Tools: The Data Access Descriptor * VASANTH BALASUNDARAM Caltech Concurrent Computation Program, Mail Code 206-49, Caltech, Pasadena, California 91125
An important aspect of any parallel programming tool is its ability to provide useful information that can help the user optimize a program for efficient parallel execution or debug a parallel program. In many parallel programming tools, data dependence information is a fundamental requirement for the implementation of the various utilities provided by the tool. Experience has shown that while data dependence is a powerful concept, it sometimes causes several complexities both in internal analysis and in programmer-tool interaction. These problems can be overcome by summarizing the effect of data dependences in parts of the program. This paper presents a mechanism for summarizing data accesses in numerical scientific programs that is easy to implement and manipulate in a programming tool. Data dependence is viewed as an intersection between data access summaries, which allows data dependence
and data accessto he treated in a unified manner.
0 1990 Academic
Press, Inc.
1. INTRODUCTION
tool performs the necessaryanalysis to extract any relevant information, this information is presented to the programmer, and the programmer makes the decision based on this information. There is a fundamental problem that arises in this kind of programmer-tool interaction, where the tool does most of the analysis, and the programmer makes most of the decisions. The results of the analysis must be presented to the programmer in terms of the source program, and must bear some relevance to the point on which the programmer needs to make a decision. Several experimental parallelizing compilers and parallel programming tools have been built, and valuable lessons have been learned from them [21,2,4,28,24,20,1,9,7,16]: Data dependence is very low-level information, and can often be nonintuitive to the programmer. A tool that interacts with a programmer only in terms of data dependences can easily overwhelm the programmer with a large amount of very low-level information. The data dependence graph is a dense structure. For large programs it may pose storage problems. The data dependencegraph contains a lot of redundant information from the point of view of the parallelization process. A lot of this is usually due to conservative analysis done with statically available information. Another problem is the large number of dependencesgeneratedby scalar variable references. Much of the parallelism, especially in numerical scientific applications, comes from the fact that different parts of a program can simultaneously accessindependent portions of the same array. Scalars offer no such possibility, and they have to be either shared by all processes (and accessesguarded by appropriate synchronization) or replicated in the local memories of all the processors.Thus, dependences involving array variables are generally of greater interest. Sometimes, dealing with data dependencesgives rise to complexities that may make the analysis prohibitively expensive to implement. One instance of this is the occurrence of a subroutine call within a loop that is being analyzed for parallelism. Data dependenceswithin the subroutine, and some that may involve references outside the subroutine, would have to be analyzed. The analysis must l
l
There are severalfactors that complicate the task of parallel programming. The creation of an efficient parallel program involves decisions that are intimately related to the nature of the problem being solved, the algorithmic techniques used in its solution, and details that are specific to the particular parallel computer being used. Humans are generally good at making these decisions. However, the decisions have to be based on a consideration of the effect it has on aspects such as correctness and performance improvement. This often requires a detailed analysis of the data dependences in the program. Because of the large number of data dependencesthat may need to be examined, and the detail of the information that must be processed,this is best performed by a software tool. Many parallel programming tools are based on this paradigm: the programmer requestsinformation to help make some decision involving parallelization of a program segment, the * This research was partly supported by the NSF under Grant ASC85 I8578 (while the author was in the computer science department at Rice University) and by a Prize Fellowship in Concurrent Computing at Caltech. 0743-7315/90 $3.00 Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.
154
l
l
THE DATA ACCESS DESCRIPTOR
155
an associatedatom. No attempt is made to compute a single atom for all references to the same array variable. These atoms are “translated” to atom imageswhich are in terms of variables at the call site, using the parameter binding functions. When testing for dependence, a reference to an array variable inside the subroutine is replaced by its corresponding atom image. This method allows much more precise dependence testing than the RSD, but is clearly a lot more storage intensive. The problem of summarizing these atoms into a single representation for each variable is equivalent to the problem of constructing Regular Section Descriptors. Yet another technique, proposed by Triolet, uses a set of linear inequalities to describe the boundaries of the accessed We cannot, however, do away with traditional data deportion [27, 291 of an array. Other than the linearity rependences, because dependence information is a fundaquirement, there is no further restriction on these inequalimental requirement for determining the validity of many ties, allowing that arbitrary convex shapescan be described. optimizing and parallelism improving transformations. In Unfortunately, this generality also makes the representathe case of analyzing dependences in the presence of subtion expensive to use and implement. Such arbitrary access routine calls, some methods that can be used as an alternashapesare rarely encountered in practice, so such a general tive to data dependenceshave recently been proposed. They representation seemsunnecessary. are briefly described in the following section. All of the above methods are designed for the purpose of interprocedural data dependence analysis. These represen2. SUMMARIZING DATA ACCESS tations can be modified for more general use, such as a means of presenting dependence information to the proCallahan and Kennedy introduced a structure called the grammer in a more conceptual form. However, the goal of Regular SectionDescriptor (RSD), which was used to sum- this paper is to take the idea of summarizing information marize accessesto array variables within a subroutine [ 13, a step further. The proposed method of summarizing data 121. For each array variable referenced in the subroutine, access information attempts to improve on the existing an RSD that describes the portion of the array that is refer- methods, and also aims to unify the concepts of data depenenced within the subroutine is computed. In order to make dence and data access. A data dependence between two implementation easy, the accessedportion of an array was statements is treated as an intersection of the data access defined by specifying the bounds of the induction variables summaries of the two statements. This concept can be exinvolved in the subscripts of the array references. This lim- tended to regions that are composed of several statements, ited the shapes describable by an RSD to a single row, a by letting the data access summary of the region be the single column, a single diagonal, or the entire array. The union of the data accesssummaries of its statements. Then, RSD is used when testing for a data dependence between a data dependence between two regions is the intersection two referencesto the same array variable, one outside a sub- of their data accesssummaries. These ideas are elaborated routine and the other within. Instead of examining the body in the subsequent sections. We will not be concerned with of the subroutine, the RSD summary for the array variable the problem of how the information should be presented in that subroutine is used. This allowed dependencetesting to the programmer, or how extracted information can be in the presence of subroutine calls to be done much faster maintained in some optimal way that makes subsequent rethan a brute-force approach. However, since only a single quests easier to service. These are considered to be user-inRSD is used to summarize the effects of multiple references terface and database issues, respectively, concerns that are to an array, the RSD can introduce a lot of imprecision into not relevant to the topic of the paper. the dependence testing process. Moreover, the inability of the RSD to represent more complex shapessuch as triangu3. SIMPLIFYING ASSUMPTIONS lar sections also made it too imprecise in some cases. Li and Yew suggesteda different method of representing Some assumptions are made in the paper, primarily to array accessesin subroutines, using a structure called the simplify the treatment of the topic. We will only consider atom [ 23,221. An atom encodes an array reference, essen- programs that are written for scientific or engineering applitially by storing the subscript expressionsand the bounds of cations. Such programs typically use numerical techniques any loop induction variables involved in those expressions. to model the mechanics of the entities whose behavior is Every single reference to an array within a subroutine has being studied. Spatial structures are usually modeled by l-, take into account factors such as aliasing of variables at the call site, other subroutines that are called by the subroutine being analyzed, and so on. This makes the analysis complex and more difficult to implement. In real programs, it is rarely the case that every statement is involved in a compute-intensive segment. Experience has shown that concentrating our efforts on efficiently parallelizing the few computation-intensive regions of a program can lead to impressive speedups. Parallelizing the other regions leads to marginal gain or none at all. It seems wasteful, then, to compute and store data dependenceinformation for regions of the program that are not computation intensive. l
156
VASANTH
BALASUNDARAM
2-, or 3-dimensional grids which are represented in the program by l-, 2-, or 3dimensional arrays, respectively. The elements of the array correspond to grid points in the model. The numerical algorithms that are used to perform computations on the array elements (or grid points) generally access the array in very regular ways [ 26, 25, 19, 141. Many of the parallelization strategies used for scientific programs exploit the fact that different portions of the program access different sections of an array and can therefore be executed simultaneously on parallel processors. The program is assumed to be a sequence of loops, each of which may enclose other loops and statements. A set of statements that are not enclosed by a loop can be assumed to be in a dummy loop that iterates only once. Nonrecursive subroutine and function calls are allowed within these loops. 4. AN EXTENDED DEFINITION DEPENDENCE
OF DATA
Traditionally, a data dependence is said to exist between a pair of statements if both access one or more common memory locations. We can generalize the notion of data dependence to groups of statements (or regions) of a program as follows. Let R, and R2 be two regions in a program. Then, there exists a data dependence between RI and R2 if there is some set of memory locations that are accessed by both regions. The dependence between the two regions is quantified by the set of memory locations that are accessedby both. For the purposes of this paper, we will only deal with “regions” that belong to the following hierarchy: The entire program is a region. This is the largest possible region. Every loop, along with the statements comprising the body of that loop, is a region. The body of the innermost loop in a loop nest is a region. A subroutine or function call statement is a region. This region consists of the regions that make up the body of the subroutine. l
l
l
l
The definition of what can constitute a region can be refined further to include if-then-else compound statements, basic blocks, and single statements. This will result in finer granularity internal information, allowing more precise internal analyses at the cost of increased storage and computation time requirements. This may be a worthwhile price to pay in some cases, for instance, when the programming tool is being designed to help uncover fine-grain parallelism. The set of memory locations accessed in region R, is denoted by A( R 1) , and the set accessedin region R2 by A( R2). A dependence between the two regions can be represented by A( RI R2 ) = A( R 1) II A( R2), consisting of all locations
referenced by both regions. A(R) is called the Data Window of the region R. The Data Window of a region describes the portion of the program’s data structures that may be accessed during an execution of the statements in that region. The Data Window consists of a set of Data Access Descriptors, one for each variable referenced in the region. The Data Access Descriptor of a variable A in a given region R , denoted by hR(A ) , contains a summary of the accesses to the variable A within the region R. A( R, R2) is the data overlap between RI and R2 and represents the portion of the program’s data structures that is accessed by both the regions RI and RZ. A nonempty data overlap indicates a dependence between RI and R2. 5. THE DATA
ACCESS DESCRIPTOR
Given an n-dimensional array A that is accessedwithin a region R of a program, the Data Access Descriptor for A in R, denoted by aR(A), is a 3 tuple (191SIT), where 13is the reference template, S is the simple section approximating the portion of A that is accessed within R, and r is the traversal order of S. The terms simple section, traversal order and reference template will be defined shortly. 6,(A) actually consists of two components, called the write component and read component, denoted by 6:(A) and 6k(A), respectively. 6,“(A) is a Data Access Descriptor (@“I S” I?‘) that describes the simple section of A that is written or stored into within region R, and &k(A) is a Data Access Descriptor (8’1 S’I 7’) that describes the simple section of A that is read within region R . 5.1. Simple Section The simple section is designed to provide a compact representation of commonly encountered array access shapes. It also allows certain frequent operations involving simple sections to be fast and easy to implement. A simple section is any convex polytope that is bounded by simple boundaries. Given an n-dimensional space with coordinate axes x1, x,,, a simple boundary is a hyperplane of the form x2,**., xi = c or Xi + x, = c, where xi, x, are any two different coordinate axes, and c is an integer constant. Thus, a simple boundary can either be parallel to any one coordinate axis or at a 45“ angle to a pair of coordinate axes. The above definition can be extended to allow a symbolic expression to occur in place of the integer constant. Many commonly occurring access patterns, such as access of any rectangular section, triangular section, diagonal, or band of diagonals, can be described precisely by a simple section. For a section that does not have simple boundaries, we will determine the smallest simple section that contains it. Simple sections have several useful properties. These are outlined without proofs in the remainder of this section.
THE DATA ACCESSDESCRIPTOR
157
Their proofs can be found in an earlier work by the author
[3,61. Any simple section in n dimensions can be described by at most 2n 2 simple boundaries. Consider an n-dimensional space with n coordinate axes xl, x2, . . . , x, . For each coordinate axis xi, there can be a pair of boundary hyperplanes of the form x, = c which are simple boundaries. For every pair of coordinate axes xi and xi there can be a pair of “diagonal” bounding hyperplanes of the form x; + Xj = c and another pair of diagonal bounding hyperplanes of the form x, - Xj = c, all of which are also simple boundaries. Since there are (T) possible ways of choosing a pair of coordinate axes from n axes, there can be at most 2 (z) = n( n - 1) pairs of diagonal boundaries. Thus, the maximum number of simple boundaries possible is 2( n + 2( 5)) = 2( n + n( II - 1)) = 2n2 boundaries. A simple section in n dimensions can have fewer than 2n2 boundaries, in which case the remaining boundaries that do not contribute to the shape of the section are redundant boundaries. Figure 1 shows a simple section in 2 dimensions with its redundant boundaries shown as broken lines. Although redundant boundaries do not contribute to the shape of a section, they are very useful when computing the union of simple sections. Thus, a simple section in n dimensions is described by specifying all its 2n2 simple boundaries (i.e., both nonredundant and redundant boundaries). All the boundaries are required to be tight boundaries (i.e., each boundary containing at least one point of the simple section). Since redundant boundaries do not contribute to the shape of the section, a tight redundant boundary must touch the section at a vertex.
5. I. 1. Representation of a Simple Section: Simple Boundary Pairs A convenient way to represent a simple section S is to specify its simple boundary pairs. A simple boundary pair is an inequality of the form (YG rc/(x, , . . . , x,) G & where x,) is a function of the coordinate axes xl, . . . , *(xl,..., x,, such that +5(x,, . . . , xn) = a and $(x1, . . . ,x,) = /3 are both simple boundaries of the section S. $(x1, . . . , x,) = cx is called a lower boundary and $(x1, . . . , x,) = /3 is called an upper boundary. For example, the 2-dimensional simple
s ‘>
Sl us2
simp (S, u S2)
FIG. 2. Intersection and union of simple sections.
section in Fig. 1 can be represented by the simple boundary pairs 2< x r7 16 y 66 3
5. I .2. Intersection and Union of Simple Sections Intersection and union operations are frequently done on simple sections ( see Fig. 2). Algorithms INTERSECT and UNION compute the intersection and union of simple sections, respectively:
Algorithm INTERSECT Input: two simple sections S, and S2, both of dimensionn,representedbytheirsimpleboundary pairs. Output: a simple section S = S, n S2 if the intersectionisnonempty,andq5otherwise. begin 1 for each simple boundary pair Bi=aiG$(X,y.. .,x,)<&~S,do 2 getitscorrespondingboundarypair B:=cw:~IC/(x,,...,x,)~P:~s,; 3 4 5
12
3
4
5 6 7
8
910z1
FIG. I. A simple section in 2 dimensions and its redundant boundaries.
if max(ol,, ai) > min(&, /3>) then return C#I elseaddtheboundarypair B = max(ai, (xi)< #(x,, . . ..x.)G to s; end for
min(Pi, pi)
VASANTH BALASUNDABAM
158 6
return S, end
Algorithm UNION Input: two simple sections S, and S,, both of dimensionn,representedbytheirsimpleboundary pairs. Output: S = simp(S, U S,), the simple section containing the union S, U S,. begin 1 for each simple boundary pair Bi = CtjG $(X1) . s .,X,)G~~ES~~O 2 pickitscorrespondingboundarypair
B: = (.y:G rc/(x,, . . . , xn) < /3;E S2; 3
add the boundary pair B = min(cui, (Y:) G $(xi, . . . , xn) < max(pi, pi) to s; end for 4 return S, end The intersection and union of simple sections have some important properties, which are stated here without proof. The proof is given elsewhere [ 3,6 ] . Simple sections are closed under intersection, but are not closed under union. Since it is not possible to compute the exact union and still stay within the simple section framework, algorithm UNION computes an approximation to the exact union, S = simp( S, U &). The approximate union, simp( S, U S,), computed by algorithm UNION is the smallest simple section containing the exact union S, U S2. The union computed by algorithm UNION is associative; i.e., if S, , S,, and S, are all simple sections of the same dimension, then l
l
l
simp(S, U simp(& US,))
= simp(simp(& U S2) U S3).
This result states that the final shape of a simple section computed by unioning together several different simple sections is independent of the order in which the unions are performed. Algorithms INTERSECT and UNION each take time proportional to the square of the number of dimensions of the section. When the maximum number of dimensions is fixed, they take constant time. l
5.2. Traversal Order The traversal order field of the Data Access Descriptor contains information about how the section described by the simple section field is traversed. To gather this information from an array reference, we must first determine the dominant induction variable of each subscript position of the reference:
LetA(x,,x2, . . . . xn) be an array reference contained within m loops with induction variables I,, 12, . . . , Z, . Consider the loop at nesting depth k with induction variable Ik (1
l
l
We assume that for a given array reference and a loop Zk that contains the reference, the dominant induction variables of the subscript positions are available as a vector IDOM,. If the ith subscript position has no dominant induction variable, IDOM,k( i) = “-“, and if its dominant induction variable is Z,, s > k, then IDOM,(i) = 1,. The length of the vector IDOM, is equal to the dimension of the array (i.e., there is one component in the vector for each subscript position in the array reference). IDOM,, for a given loop Zk containing the array reference, can be computed in a straightforward manner using the method outlined in the definition above. The traversal order of a section is determined from the dominant induction variables of the array reference: Let A(x,, x2, . . . , xn) be an array reference contained within m loops with induction variables I,, Z2, . . . , I,. Consider the loop at nesting depth k with induction variable Zk (l
V,> ... > I’,. The computation of traversal order is illustrated for the following reference: DOI= 1, 100 DOJ= 1, 100 DOK= 1, 100 A(I+J,
K+2,
J)
= . . .
ENDDO If we embed the array A in a 3-dimensional space with coordinates x, , x2, x3, then any reference to A can be represented as A (xl , x2, x3). The dominant induction variables
THE
DATA
159
ACCESS DESCRIPTOR
for x, , x2, and x3 with respect to the outermost loop are J, K, and J, respectively (i.e., IDOMI = {J, K, J} ). The list of distinct dominant induction variables in decreasing order of nesting depth is K, J. We must construct two sets, Vr and V,, that consist of the subscript positions of the array reference that have K and J as dominant induction variables, respectively. In this case, only the second subscript position has K as a dominant induction variable. Thus V, = { x2 } . The first and third subscript positions of the reference have J as their dominant induction variables. Thus V2 = { x1 , x3 } . The traversal order for this array reference with respect to the Z loop is thus r = x2 > { xl, x3 } . Figure 3 shows the traversal order represented by T = x2 > { x, , x3 } . This traversal order indicates that accessalong dimension x2 occurs faster than accessalong dimensions x1 and x3. Two traversal orders 71 and r2 are said to match if the orders described by both are consistent with one another. This is denoted by 71 = 72. We will henceforth assume the existence of a procedure MATCH that takes as arguments two traversal orders and returns the “matched portion” of the traversal orders. If the traversal orders do not match, the procedure returns the symbol “ * “. Consider two other traversal orders, given by 7’= x2 > x1 and 7” = x2 > xl > x3. MATCH (T, 7’) will return x2 > xl, while MATCH ( 7, 7”) will return “ * “.
referencetemplate for this reference with respect to the Zk loop is a vector with n components. The ith component of the vector is determined as follows: If xi has a dominant induction variable with respect to the Zk loop, the ith component of the reference template is the variable “Xi”. It indicates that the ith subscript (or ith dimension) is variant with respect to the Zk loop, and IDOM,(i) Z “-“. If xi has no dominant induction variable with respect to the Z, loop, the ith component of the reference template is the subscript expression Xi. In this case, the ith subscript is invariant with respect to the Z, loop, and IDOMIk( i) l
l
= “-“. The simple section of an array reference in a loop consists of the range of those “xi” that correspond to subscripts (or dimensions) that are variant with respect to that loop. The variables “x~” that correspond to invariant subscripts do not appear in the simple boundary pairs that describe the simple section. These invariant subscripts still contribute to the shape of the simple section since they represent simple boundary pairs of the form y =GXj G y , where y is the expression in the jth subscript position that is invariant with respect to the loop in question. 6. COMPUTATION OF THE DATA ACCESS DESCRIPTOR
5.3. Reference Template The reference template is used to translate the Data Access Descriptor for a variable from the context of one region to the context of an enclosing region. This is necessary when, for example, the data access information within a subroutine is required in terms of the variables at its call site. Since the data access information for the subroutine is in terms of the formal parameters to the subroutine, it must be translated to be in terms of the actual parameters passed at the call site. Translation of Data Access Descriptors is the subject of a later section. Let A(x,, x2, . . . , xn) be an array reference contained within m loops with induction variables I,, Z2, . . . , Z,. The
This section describes the algorithms used to compute the Data Access Descriptor of a variable in a given region. 6.1. Data Access Descriptor of a Scalar Variable Scalar variables can be treated differently from arrays by defining the Data Access Descriptor for a scalar reference to be simply the name of the scalar variable. The simple section and traversal order fields will not contain any useful information and can be ignored. All references to the same scalar variable will thus have the same Data Access Descriptor. 6.2. Data Access Descriptor of an Array Accessed within a Loop Consider a reference to an n-dimensional array A enclosed within m loops:
row-wise access along dimension z2
DO 1, = L,, u, DO 1, = L,, u,
Do z, = L,, u, A(+ ENDDO
FIG. 3.
Traversal order T = x2 > { x, , xx }
x2,...,
x,,) = * - -
where the Z,, 1 G k =z m , are loop induction variables, and Lk, U, are the lower and upper bounds of Zk, respectively. L, and U, are integer constants, and Li and Vi, 2 < i < m,
VASANTH BALASUNDARAM
160
areoftheformp+ql,, 1
$=ao+
Algorithm SIMPLESECTION
k=l
Given an array reference, we can compute the simple boundary pairs of the array section referenced within the region enclosed by the I, loop by determining the lower bound Ib( $) and the upper bound ub( rc/)for each of the n2 possible J/. These bounds are given by m m a0 + 2 ( ak+Lk - ai uk) < a0 + 2 akIk k=l
Input: dimension n of the array, the nesting depth m of the array reference, the bounds Lh, L,, U, of the loop induction variables Ih, Uh,...r expressions XI, X2, I h+l, . . . , I,,, and the subscript . . . ) x, of the array reference. Output: the simple section Sthat describes the portion of the array referenced in the region enclosedbytheIhloop.
k=l
/ lbtlc/)
The bounds for a given $ cannot be evaluated in a single step using Eq. ( 1) because the &, uk, k 3 2, can be of the formp+qZ,, 1
begin
1
-z-
<&+
5
(a:Uk-akLk)
k=l
(1) I
ubt$)
where a+ and a- are the positive and negative parts of a number and are defined to be a+ = max(a, 0) and a= max( -a, 0). The inequality ( 1) is a property of linear functions and is adapted from the following result due to Banejee [ 8,9] : If the rectangle .7i!C R”, defined by .72= {(z,, z2, . . , z,> : p1 G z1 G 41, p2 =G z2 =s 42, . . . , Pn G ‘?I =2qn} is nonempty, then the minimum and maximum values of the function f(z) = 2 akzk in % are c (ak+pk - ai&) and C (ak+qk- akpk), respectively. In our case, 72 is the iteration space Ji! = {(I,, 12, . . . , Zm) : L, < I, < U,, L2 < I2 < U,, . . . , L, < I,,, < Urn> defined by the loop induction variables I,, Z2, . . . , I,,,. The array reference maps a portion of the iteration space into memory. The simple section is an approximate description of this portion.
2 3
4 5
let
&htx~,
x2,
. . . , x,)
= {xiIIDOMl,(i)
endfor return S end procedure LBOUND(h, t) begin
ifEdoesnotcontainanytermsin Ih,Ih+I,..-,zm
2
and
let ll%hll = 1; /* 81,is the set of Xithat are variant with respect to the 1, loop. The simple section consists of l2 simple boundary pairs of the forma
6
1
# -},
then return .$ else
161
THE DATA ACCESSDESCRIPTOR
3
let Zk, h < k < m be the highest remaining term in [; replace the Zk in [ by (akfL, - ai Uk); 4 5 let.$‘betheresultingexpression; return LBOUND(h, f ) 6 endelse end procedure UBOUND(h, [) begin 1 if[doesnotcontainanytermsin zh,zh+l,.
2 3 4 5 6
. * ,I,
then return 4 else let Zk, h < k G m be the highest remaining term in [; replace the Zk in [ by ( aiUk - ak Lk) ; lett’betheresultingexpression; return UBOUND( h, f ) endelse end
Consider the following loop nest: DOI=l,lOO DOJ=l,lOO DOK=l,lOO A(I, I+J,K+2, ENDDO
50, IPVT(J))=
-.a
The IDOM vector of dominant induction variables and the reference templates 0 for the above reference with respect to the K, J, and Z loops are, respectively: Kloop: IDOMK = { -, -, K, -, - ) andoK= {Z,Z+ J,xg,50,3}. Jloop: IDOMJ = { -, J, K, -, J> andBJ= {Z,x~,x3,50,V}. Z loop: IDOM, = {I, J, K, -, J} ando,= {x,,x2,x3,50,V’). l
l
l
6.4. Dealing with Symbolic Variables
Symbolic variables can occur in the lower and upper The algorithm calls proceduresLBOUND and UBOUND bounds of the loop induction variables as well as in the subto compute the lower bound and upper bound of the simple script expressions of an array reference. Although the techboundary pairs of the simple section. Note that after each niques presented here only attempt to be precise up to symbolic execution, it is possible to extend the Data Access Derecursive call to the procedures LBOUND and UBOUND, the expression [ has at least one less induction variable than scriptor representation to keep a limited number of it did prior to that call. Eventually, Ewill only contain terms symbolic variables. The details of this are beyond the scope in I,, . . . , Zhpl, and the recursion will terminate. In the of this paper, but can be found elsewhere [ 31. Figure 4 shows some loops RI , R2, R3 that contain referworst case, if each induction variable is triangular with respect to the induction variable of its containing loop, proce- ences to a 2-dimensional array A, of size 100 X 100, and dures LBOUND and UBOUND would require 0( nesting the corresponding simple sections described by them depth) recursive calls to complete. For fixed dimensions, (dotted lines represent redundant boundaries). The Data the number of possible J/ expressions for a simple section is Access Descriptors for each of these regions are given by a small constant (see Section 5.1.1) . Algorithm SIMPLER, (entire array) SECTION thus has a worst-case time complexity of 0( nesting depth) of the array reference with respect to the 1 I I 1 GX, x2 -99 G x, - x2 < 99 1 6.3. Dealing with Complex Subscript Expressions i I i l
Sometimes, a subscript expression may be a nonlinear expression in the loop induction variables, or a subscripted variable (such as an array element). Such subscripts are called complex subscripts. When an invariant subscript position has a complex expression, its corresponding entry in the reference template vector will be denoted as “3”. This indicates that the subscript position is a symbolic constant whose value is unknown. When a variant subscript position has a complex expression, it may not be possible to accurately determine the range of values that it can have. When constructing the simple section, it is necessaryto be conservative and assume that such a subscript position can have any value within its declared range. This is denoted by a “V” symbol in its corresponding entry in the reference template.
l
R’, (single column) b&4)
= (Xl, I( 1 =GXI G 1001x,)
R3: DO I = 1, 100 R,: DO I = 1, 100 Rz: DOI=l,lOO R;: DO J = 1. 100 R;: DO J = 1, 100 R;: DO J = 1, I
A(J,I) ENDDO
=
FIG. 4.
A(I.1) ENDDO
=
A(I,J) ENDDO
Some array references within loops.
=
VASANTHBALASUNDARAM
162 l
R2 (single diagonal)
6,,(A)
=
Translation to a loop that encloses an imperfectly nested loop must factor in those accessesthat occur in the portion of code that belongs only to the outer loop. Let R be the region described by the Zk loop. Algorithm TRANSLATE-TO-LOOP translates &<(A) to 6R(A), the Data Access Descriptor for the reference with respect to the region R described by the Zk loop:
x1,x2
O~X,-X2~0 l
R; (single element) Algorithm TRANSLATE-TO-L.OOP 6R;(A)=(I,Ii-/-)
Input:
l
R3 (triangular section) l~X,~lOO 2=sx, +x,<200 OGX,-x*s99
l
-52 >
Xl i
R\(partofasinglerow) 6,;(A) = (1, x2 11 < x2 < ZI x2).
7. TRANSLATION OF DATA ACCESS DESCRIPTORS In an actual implementation, it may be expensive to compute and store the Data Access Descriptors for each variable with respect to every region that contains the variable. For instance, if a variable reference is contained within m loops, the Data Access Descriptor for that variable with respect to each of the loops may be different, so that m different Data Access Descriptors may have to be computed. Fortunately, it is sufficient to compute the Data Access Descriptor for a variable with respect to the innermost region that contains it. This can be converted to give the Data Access Descriptor for the variable in terms of the immediately enclosing region incrementally, without recomputing the reference template, simple section, and traversal order, by a process called translation. The reference template and the simple section fields of the Data Access Descriptor contain sufficient information to perform the translation. Two kinds of translation are discussed in this section. The first involves translation of the Data Access Descriptor of a region to a loop that contains the region. The second kind involves translation of the Data Access Descriptor for a variable accessedwithin a subroutine to the call site of the subroutine. 7.1. Translation to a Surrounding Loop Consider an array reference A(x,, x2, . . . , x,,). Let 6,,(A) be the Data Access Descriptor for this reference to A in the region R’. Let Z, be the induction variable of the loop immediately surrounding the region R’ (i.e., if R’is a loop, then the two loops are assumed to be perfectly nested).
The Data Access Descriptor s,,(A) = (8, ISRflTR,)for A inthe region R’. Output: The Data Access Descriptor s,(A) regionR = (OR IS, ITR) for A withrespecttothe (i.e., theregiondescribedbythezkloop). begin /*Computationof6R*/ 1 for i = 1 to n do if eR'(i)contains anZkterm then 2 /*theithsubscriptisvariantinR*f replacedR,( i) withthevariable“~,"; 3 4 addXitoNEW-VARIANTS; endif elseif e,( i) is variant in R’ 5 6 then addx,toOLD-VARIANTS; else if e,/(i) = “3” or eRf(i) = “V” then 7 if eRt(i) = “3” then set eR,(i) = “V”; 8 addxitoUNKNOWN-SYMBOLICS; 9 endelse elseaddxjtoINVARIANTS; 10 endfor /*ComputationofTR*/ 11 TR = “TR’ > {NEW-VARIANTS} ” ; /*ComputationofSR*/ 12 &=b&; /* In the following, refer to Alg. SIMPLESECTIONforproceduresLBOUNDandUBOUND*/ 13 for allx,EOLD-VARIANTS do 14 for allXjENEW-VARIANTS do addtoSRthesimpleboundarypairs 15 LBOUND(k,(Zb(X;)k oR'(j)))
+
OR’(j));
endfor 20 for allsimpleboundarypairscuG#G@ESR
do
163
THE DATA ACCESS DESCRIPTOR
21
if (Ycontains an Ik term then replace LBOUND(k, CX); 22 if /3 contains an Ik term then replace UBOUND(k, 0) ; endfor 23 return 6R(A) = (ORI SR ( TV) end
CYwith p with
The algorithm translates a “3” in the reference template to a “V”. The “3” in the reference template indicates that its corresponding subscript position is a symbolic constant whose value is unknown. When this is translated out to an enclosing region, we conservatively assume that this subscript position can vary between its declared lower and upper bounds within the new region. This is indicated by the “V” symbol in the reference template. The reader can use the examples shown in Fig. 4 to verify the translation algorithm. 7.2. Translation from a Subroutine to a Call Site Translation of a Data Access Descriptor from within a subroutine to its call site is more complex. The complexity of the translation process depends on the parameter passing rules defined by the programming language. Algorithm TRANSLATE-TO-CALL translates a Data Access Descriptor for an array reference within a subroutine to the call site of the subroutine: Algorithm TRANSLATE-TO-CALL Input: The Data Access Descriptor 6(A) = (0 1s( 7) forann-dimensionalarrayA,thebindingfunction b that maps the formals to actuals at the call site. Output: The Data Access Descriptor 6(b(A)) = (e’Is’l7’). begin 1 if A and b(A) are of the same dimension and size then 2 6(b(A)) = 6(A)with allsymbolicvariables Xreplacedby b(X); 3 return6(b(A))=(8’IS’I#). endif 4 if dimensionof A > 1 then 5 linearizeallreferencesto A; endif /* nowforthecasewheredimensionofA = 1 anddimensionof b(A)> l*/ 6 let 6(A) beofthe form(x, Icy~xI<~Ixl); 7 andletthesizeofb(A)beNXNX...XN(N" elementsintotal); 8 thecasep= aisthetrivialsingleelement access, soassumeP>a; /*computethebounds foralldimensions*/ 9 L(l)=(amodN==O)?N:amodN, 10 U(l)=(pmodN==O)?N:@modN,
11 fori=2tondo L(i) = (a mod N’-’ == O)? (a div N’-‘) mod N: 12 (a div N’-’ + 1) mod N, U(i) = (0 mod N’-’ == O)? (p div N’-‘) mod N : 13 (pdivN’-‘+ l)modN, enddo b(A) isvari14 0'( 1 )=“~~";/*firstdimensionof ant*/ 15 fori=2tondo 16 if (p/N’-] > 1) then 17 Y(i) = “x;'; /* ith dimension of b(A) is variant*/ 18 addtoStheboundarypair ~
26
add Ib(Xi) - Ub(Xj) G X; - X, G
Ub(X;)
-
Ib(Xj)
to
S’;
endfor 27 r’=x,>x*>“‘>xk; 28 returna(b(A)) = (#I,!$‘~ T’);
Translation of the Data Access Descriptor of a scalar variable is trivial; the name of the scalar is replaced with the actual name bound to it at the call site. It is assumed that for every call site in the program, we are given a binding function b. The binding function maps formal parameter names in the subroutine to actual parameter names at the call site. If X is a formal variable in the subroutine, the actual name bound to it is denoted by b(X) . If X is a global or C O M M O N variable ( or passedby reference), b(X ) = X. In the case of an array variable, the actual parameter is assumed to be the address of the first element of the array. If this is not the case, the algorithm presented here can be
164
VASANTH BALASUNDARAM
modified in a straightforward manner to take this into account. Algorithm TRANSLATE-TO-CALL performs the translation of a Data Access Descriptor in a subroutine to the call site. The algorithm considers three cases: Case I: The first case is when the array variable A and its actual parameter b(A) are of the same dimension and size. This implies that the shape of the accessed portion of A within the subroutine is preserved upon translation to the call site. In other words, the simple section of b(A) has the same shape as the simple section of A. The translation is accomplished by replacing all symbolic variables X E 6(A) by b(x). Case 2: The second case is when A is of dimension m > 1, and b(A) is of dimension n > 1, m Z n . All references to the array A are linearized, so that the references map onto a linear array (of dimension 1). They can now be treated as part of the case 3. Details of the linearization process are fairly straightforward and have been omitted from the algorithm. Case 3: The third case is when the array A within the subroutine is a linear array (i.e., a vector), but b(A) is a higher-dimensional array. Clearly, shapes are not preserved upon translation, because a l-dimensional section of A must be translated to a multidimensional section of b( A). Letthearrayb(A)beofsizeNXNX . . . XN(i.e.,N” elements). The translation done in algorithm TRANSLATE-TO-CALL is based on the following observation. Assuming column major storage for the multidimensional array b(A), the first N elements of A will map to the first column of b( A), the first N* elements of A will map to the first xl x2 plane of b(A), and so on. Thus, an access of the first N elements of the vector A translate to an access of a single column of b(A), so that accessesin b(A) are along dimension xl only, and the other dimensions are invariant. An access of more than N but less than N* consecutive elements of A translate to accessesalong dimensions x1 and x2 in b(A). That is, we have “spilled over” onto the next dimension. The translation must take into account the element of A that the accessesstart from. The algorithm given here assumes that accesses always start from the first element. The extension to the more general case is straightforward and has been omitted for simplicity of exposition. l
l
l
Algorithm TRANSLATE-TO-CALL assumes that the strides along the dimensions of A within the subroutine are all unit strides. While this does not have any real consequence for the first case, the other two casesdepend on this assumption being true. The problem with nonunit strides is discussed in detail in a subsequent section, and further discussion of this issue is postponed until then. Translation of a Data Access Descriptor is denoted by the symbol “0”. This will denote translation either to an
enclosing loop or from a subroutine to a call site, depending on the context in which it is used. 8. AN EXAMPLE Let us now examine a realistic program segment and compute the Data Access Descriptors for variables at various points in the program. The subroutine LUDCOMP is adapted from the LINPACK subroutine DGEFA, which factors a real matrix by Gaussian elimination [ 15 1. Subroutine LUDCOMP takes as input the 2dimensional matrix A to be factored and the order N of the matrix. It returns the factored matrix A in the form of an upper triangular matrix and the multipliers that were used to obtain it. Also returned is an integer vector IPVT, of size N, containing the pivot indices: SubroutineLUDCOMP(A,N, IPVT) l* DOK=l,N-1 2* L=IDANAX(N-K+l,A(K,K),l)+K-1 3 IPVT(K)= L 4* swap (A(IPVT(K) ,K), A(K,K) 1 5 C=l/A(K,K) 6* callDSCAL(N-K,C,A(K+l,K),l) 7* DOJ=K+l,N 8 T=A(L,J) 9* swap W(IPVT(K) ,J), A(K,J) ) 10* call DAXPY (N-K, T, A(K+l,K), 1, A(K+l,J), 1) 11 ENDDO 12 ENDDO end FunctionIDAMAX (M,DX, INCX) 13* findP, such that IDX(P)f = max( /0X(1)(, . . . , ID-VW 1) 14 RETURN P end SubroutineDSCAL(M,DA,DX, INCX) 15 J=l 16* DOI=l, M 17 DX(J) =DA*DX(J) 18 J=J+INCX 19 ENDDO end SubroutineDAXPY(M,DA,DX, INCX,DY, INCY) 20 J=l 21 K=l 22* DOI=l, M 23 DY(J) =DY(J) +DA*DX(K) 24 J= J+INCY 25 K=K+INCX 26 ENDDO end
165
THE DATA ACCESS DESCRIPTOR
The Data Access Descriptors for the array variable A are computed at the points denoted by the * line numbers in subroutine LUDCOMP. The subscript on 6 is used to denote the line number for which it is computed. The region can be discerned from the context. The Data Access Descriptors for the arrays DX and DY within function IDAMAX and the subroutines DSCAL and DAXPY are l
V(A)
= ll(Gw)
u 6&(A))
= h((3, JI-I-)U(K U(x,,JIK+l~x,~NIx,)) = (V,x*IK+
1 sx*
U(K,x*IK+
In function IDAMAX
1
6’;,(DX) = (x* I1
l
= (x, I1 G x, < M I x,)
In subroutine DAXPY 6&(DY) = 6gDY) 6;,(Dx)=
= (x, I1 G x, G M I x,)
(X,JlGX,=zMIX,).
The Data Access Descriptors for A at line 2 of LUDCOMP can be computed by translating Sr(OX) in function IDAMAX outward to the call site at line 2. l
K+l
In subroutine DSCAL S&(DX) = 6;,(Dx)
At line 2 of LUDCOMP
By translating 6,“(A), 6:(A), and S,W (A) outward again, the Data Access Descriptor for A in the region enclosed by the “DO K” loop at line 1 of LUDCOMP can be computed. ar(A) = b(64”(‘4) u G(A) u m-4))
=I’I (%KI-I-)U(KKI-I-) ( U(x,,KIK+l~x,~N~x,) U(V,x21K+
1
U(K,x,IK+
1 =sx,
u
Similarly, we can get the Data Access Descriptors for A at lines 6 and 10 by translating the Data Access Descriptors in DSCAL and DAXPY outward to their call sites in LUDCOMP. l
l
bEI
x1,x2
1
= (x,,KIK+
= ( x,, JIK+
6’;,(A) =h (a;,) = (XI, KIR+
=(V,x2(l
1
)~Gx,GN-I,~Gx,GN-~
2
1
l
1 GX,
3Gx,+x2G2N-i
1)
1
! >-XI ) x2
1-N
1 GX, < NI x,)
2
1
To compute the Data Access Descriptor for A in the region enclosed by the loop at line 7, we must translate & and &, outward to the loop header at line 7. The derivation of the write component of the Data Access Descriptor (i.e., 6”(A)) is shown below. The read component 6’(A) can be derived in a similar manner:
The sections represented by 6 y(A) are shown in Fig. 5. The dotted boundaries are the boundaries of the complete array A. The dashed lines represent redundant simple boundaries. 9. DEALING
WA)= (%JI-I-)U(K,Jl-I-)
x2
U(V,x,)2~x2~N(x2)
U(x,;JlK+
fi,W(A)=(%Kl-I-)U(KKI-I-)
xl>
O
At line 10 of LUDCOMP =Tr(W
1 )I
K+l~x,
At line 6 of LUDCOMP
6,“(A) = ?qA)=fi(6$)
4-I)
WITH
NONUNIT
STRIDES
In all of the examples thus far, the strides of loop induction variables were ignored when the Data Access Descrip-
166
VASANTH
BALASUNDARAM
tor was computed. We implicitly made the assumption that every point of the simple section described by a Data Access Descriptor is actually accessed.This assumption, although conservative in most cases, can give incorrect results for Data Access Descriptors that are translated from a subroutine to its call site. For example, consider the subroutine DAXPY in the LUDCOMP program. One of the parameters to DAXPY is INCY, which is used as a stride for the auxiliary loop induction variable Jon line 24 of subroutine LUDCOMP. When INCY = 1, the accessto the Melements of D Y within DAXPY translate to an accessof N - K consecutive elements of the array A at line 10, starting from the A (K + 1, K)th element. Assuming column major storage for Fortran, this is an access of N - K elements along the Kth column of the array A. Supposethe actual parameter bound to the formal INCY at the call site on line 10 is N instead of 1. Within DAXPY, M elements of the array DY are accessed at a distance of INCY apart. This translates to an accessof N - K elements of the array A, starting at the A (K + 1, K)th element, each separated by N elements from the previous access. This is an access of N - K elements along the (K + 1 )th TOWof A. If the stride INCY is ignored and assumed to be 1, the translation of the Data Access Descriptors from DAXPY to its call site at line 10 of LUDCOMP cannot detect this rowwise access. Algorithm TRANSLATE-TO-CALL will translate the accessto DY in DAXPY as an accessof N - K elements along the Kth column of A, which is incorrect. Thus, the strides of loop induction variables cannot always be ignored. The Data Access Descriptor representation can be extended to keep information
about strides.
When the stride along a particular dimension k can be determined, we will denote it as a superscript on xk in the traversal order field. For example, the Data Access Descriptor
for the store into DY in subroutine DAXPY is S,W,( DY) < x1 < MI x{~‘ *), indicating that within DAXPY, = (Xl I1 the x,th dimension of variable DY is traversed with a stride of INCY. Absence of a superscript denotes the default unit stride. Algorithm TRANSLATE-TO-CALL can be modi-
Data Access Descriptors
for the stores to A within
10. INTERSECTION AND UNION OF DATA ACCESS DESCRIPTORS Intersection of Data Access Descriptors is used as a test for data dependence. Procedure TEST-DEPENDENCE is a definition
of Data Access Descriptor
intersection:
Procedure TESTDEPENDENCE (6, (A), &(A)) /*A
is of dimension
= (02 I s2
IZ, 6,(A) = (0,I S, I TV) and &(A)
172).
Compute 6(A) = (0 I SI 7) = 6,(A) n 6,(A). Return 4 if the intersection is empty ( i . e . , 6, ( A ) and a2(A) are independent). Elsereturnthedataoverlap
s, n s,. q forall fll (i), 1 < i < n that add the boundary
pair
endfor forall e2( i), 1 G i < n that add the boundary
pair
are invariant do 81(i) G Xi G 01(i) to S1; are invariant do fl,( i) G Xi G e2(i) to S2 ;
endfor S = S, fl S,; ( see algorithm
INTERSECT
)
return S The procedure takes two Data Access Descriptors for a variable and returns their “intersection.” If there is no data overlap between the two Data Access Descriptors, the procedure returns 4, indicating that the two Data Access Descriptors are independent. When nonunit strides occur in the traversal order fields of the two Data Access Descriptors, procedure TESTDEPENDENCE should be modified to check if actual overlap occurs. The following example, due to Paul Havlak, illustrates why this is necessary [ 18 ] : DOI=l,lOO,2 d(I) = ... ENDDO DOJ=2,100,2 d(J) = . .ENDDO
U
FIG. 5. COMP.
fied to take the stride into account when the translated Data Access Descriptor is computed.
LUD-
The Data Access Descriptor for A in the region described by the I loop is &(A) = (x1 I 1 =Gxl < 100 I x:). The Data Access Descriptor for A in the region described by the J loop is S,(A) = (x, I 2 c x, < 100 I x:) . Procedure TESTDEPENDENCE will find that the ranges 1 < xl < 100 and 2 < x1 G 100 overlap, so that the intersection of the simple section fields of 6,(A) and 6,(A) is nonempty. Thus, it would conclude that a dependence exists between the I and the J loop on variable A.
THE
DATA
This conclusion is overly conservative because there is actually no dependencebetween the Z and the J loops. Both loops access A with strides of 2, and since the accessesdo not start at the same element of A, there is no data overlap at all. Loops Z and Jaccess alternate elements ofA. This can be detected by making a simple modification to procedure TEST-DEPENDENCE. If the stride on some dimension xi in a1 is a multiple of the stride on the same dimension xi in &, then there is a nonempty data overlap (i.e., a data dependence) between 6i and b2only if the greatest common divisor of the two strides divides the difference between the lower bounds of xi in the simple section fields of 6i and &. More formally, let s, be the stride along dimension Xi in 6,) and let s2 be the stride along xi in a2. Let Z1and 1, be the lower bounds of the variable xi in the simple section fields of 6, and &, respectively. Then, there is a dependence between 6, and &2if gcd( s, , s2) I ( I(1, - 121) ) . Otherwise they are independent. If there are multiple references to the same array variable within a region, a Data Access Descriptor is computed for each reference and unioned together to give a single Data Access Descriptor that summarizes all accessesto the array within that region. Procedure DAD-UNION returns the union of two Data Access Descriptors for the same array variable: ProcedureDAD-UNZON(&(A),
167
ACCESS DESCRIPTOR
a2(A))
/*A is of dimension n, 6,(A) = (/?,I S, 1T,) and h2(A) = (82 I s2 172). Compute 6(A) = 6,(A) U 6,(A). */ / * Computation of S */ 3 S = Simp(S, U &); ( see algorithm UNION) / * Computation of & / 4 fori= 1 tondo 5 ife,(i)=e,(i) 6 thenO = 0,(i) 7 else if both (?I(i) and fZ2(i) are invariant then O(i) = “If”; 8 9 else if only one of them is invariant then 10 suppose e,(i) is invariant and 0,(i) is variant; 11 there must be a simple boundary pair CYG Xi G fl E S2; 12 if 8, (i) lies outside (Yand fl then 13 extend the range of the boundary pair x, in s to include e,(i); recompute the lower and upper bounds of all simple boundary pairs in S that involve xi; endif 15 else if it cannot be determined whether or nota
endelse endfor /*Computationofr*/ 17 T = MATCH(T,, T2); (see Section 5.2) 18 return 6(A) The time complexity of Data Access Descriptor union and intersection is determined by the complexity of simple section union and intersection, which is constant time for fixed dimensions. 11. THE DATA WINDOW REVISITED The previous section described the Data Access Descriptor representation, and how it can be used to summarize the portion of an array accessedwithin a region. In Section 4, we introduced the concept of a Data Window. A Data Window is a set of Data Access Descriptors: Given a region R of a program, its Data Window, denoted by A(R), is given by A(R)=
(6,(X)
I XisreferencedinRandisglobaltoR},
where hR(X) is the Data Access Descriptor for the variable Xin R. The above definition indicates that variables that are “local” to R do not contribute to its Data Window since they are not visible outside the region. 6,(X), for each variable X, consists of two components, S:(X) and 6k(X). Thus, the Data Window also has two components, the write component A’“(R) = { 6 g(X) } and the read component A’(R) = m(w). 11.1. Intersection and Union of Data Windows A data dependenceexists between two regions if the intersection of their Data Windows is nonempty. Thus, intersection of Data Windows can be used to check for dependence between regions. The intersection of two Data Windows is defined as follows: Given two regions RI and R2 and their corresponding Data Windows A( R, ) and A( R2), the intersection of the Data Windows is given by A(&&)
= A(&) n A(&)
where each component of A( RI ) is intersected with both the read and the write components of A( R2). A( RI R2) is called the data overlap between R, and R2. If the data overlap A( R, R2) is empty, it means there is no data dependencebetween regions R , and R2. Since each Data Window consists of two components, and intersection
168
VASANTH
BALASUNDARAM
between dissimilar components is allowed, the Data Overlap A( RI R2) will consist of four components:
AWr(R1R2) = A”‘( R,) fl Ar(R2), called the write-read dependence. A”“( RI R2) = A( R,) fl AW(R2), called the read-write dependence. A”‘,( RI R2) = A”( R, ) fl AW(R2), called the write-write dependence. A”( R, R2) = Ar(R1 ) n A’( R2), called the read-read dependence. l
l
l
l
When both regions consist of single statements, the above dependences are called true (or flow), anti, output, and input dependences, respectively. If the two regions are two different iterations of the same loop, a dependence between them is also referred to as a loop-carried dependence.A region can be constructed by merging together a sequence of smaller regions. The Data Window of this larger region will be the union of the Data Windows of its component regions. Given two regions R, and R2 and their corresponding Data Windows A( RI ) and A( R2), the union of the Data Windows is a Data Window A(R) = At& 1 U At&) = { sRl(X) U aR2(X) 1XE RI and R2}, where each component of A( RI ) is unioned with its corresponding component of A( R2). A(R) is the Data Window of the region R obtained by merging the regions R, and R2. A(R) consists of two components, the write and read components. 12. USING
DATA
ACCESS DESCRIPTORS
We have shown how Data Access Descriptors can be used as an efficient mechanism for summarizing data access information in programs. This internal information can be used to implement several utilities in a parallel programming tool. Some examples are listed below:
Dependence testing: Data dependence testing can be done using algorithm TEST-DEPENDENCE. The algorithm does not distinguish between a variable reference, a single statement, a set of statements, a loop nest, or a subroutine. Thus, the same algorithm can be used to test for data dependence between any two regions, regardless of what the two regions actually look like. As a particular case, when one of the regions is a subroutine call statement in a loop, we can do interprocedural dependence testing using this algorithm. It can be shown that a non-null intersection between any two simple sections S, and S, of the same dimension must contain at least one point with integer coordinates [ 31. This important property of simple sections all
lows us to treat a nonempty simple section intersection as a necessary and sufficient test for the existence of a dependence. In an actual implementation it may not be a good idea to union the Data Access Descriptors for all references to the same variable into a single descriptor, because the imprecision introduced during successive unions has a cumulative effect. This may lower the accuracy of dependence testing between two regions. Maintaining the Data Access Descriptor for every reference to a variable in a region will improve the accuracy of dependence testing, at the expense of increased storage requirements. An effective compromise is to maintain only those Data Access Descriptors for an array variable that describe accessesto distinct, nonoverlapping sections, in the form of a fixed length list. Thus, if two references to an array in a region accessthe same section, a single Data Access Descriptor is used to represent them both. When a list exceeds the fixed length, all the Data Access Descriptors in it can be unioned together into a single descriptor, effectively reducing the length of the list to one. Further details about this can be found elsewhere [ 31. Facility for programmer assertions..Often, it is necessary to make conservative assumptions when complex subscript expressions or loop induction variable bounds are encountered. Earlier, we discussed how to handle symbolic variables in the Data Access Descriptor representation. When an invariant symbolic expression is encountered at a particular subscript position, a “3” symbol is used to denote that the constant value of this subscript position is unknown. When a Data Access Descriptor containing a “3” is translated outward to a containing region, the subscript with the “3” is assumed to vary between the declared lower and upper bounds for that subscript position of the array, and this is represented by a “V” symbol. This is illustrated in the Data Access Descriptor S,W( A) in the LUDCOMP program ( Section 8 ) . The reference to A (IPVT (K), J) on line 9 of LUDCOMP has the Data Access Descriptor (3, JI - I- ) , indicating that we have no information about the value of IPVT (K) . When this is translated outward to the loop at line 7, the resulting Data Access Descriptor is (V, x2 ]K + 1 < x2 < Nl x2). We could assume that the first subscript position can have values anywhere between the declared lower and upper bounds of A. This is done when translating the Data Access Descriptor outward again to the loop at line 1. However, if we had the information that IPVT is actually a permutation of the values of K, the translation would be more precise. For example, if the programmer gave the information that the first subscript position can be considered “YK”, then the outward translation of this Data Access Descriptor would give a triangular section instead of the rectangular section computed as a result of conservative translation. Thus, the accuracy of dependence testing can be improved in some cases by programmer interaction. l
169
THE DATA ACCESS DESCRIPTOR
Data access visualization: The Data Access Descriptor lends itself to a straightforward pictorial representation of data accesspatterns in a region. This could provide useful information to a programmer. 9 Parallel debugging: In debugging parallel programs, errors of nondeterminacy that arise from inadvertent race conditions are of primary interest. Many static analysis methods that can help prune the sets of statements that must be monitored for possible race conditions by run-time trace analysis have been suggested.One such method, proposed by Balasundaram and Kennedy [ 51, shows how Data Access Descriptors can be used for this purpose. Deriving data domain partitions: A joint project currently underway at Caltech and Rice University is investigating methods for automatically determining data partitions in programs that will be executed on distributed memory parallel computers. The method analyzes each loop, looking at data dependencesto determine a set of feasible partitions. In order to select a partitioning that is efficient, we must also consider the global effects that could arise in choosing a particular local partitioning for the loop being analyzed. The Data Access Descriptors of the other loops in the program are used for this purpose. This is more efficient than examining the traditional data dependence graph for the whole program each time a loop is analyzed.
ACKNOWLEDGMENTS
l
l
I am thankful to Dr. Ken Kennedy of Rice University, who, as my thesis advisor, provided invaluable guidance in this research. I am also grateful to Andy Boyd, Paul Havlak, Uhich Kremer, and Joe Warren of Rice University, and to Douglas Hahn of the University of Wisconsin-Madison for their comments and suggestionson this work.
REFERENCES 1. Allen, F., Burke, M., Charles, P., Cytron, R., and Ferrante, J. An overview of the PTRAN analysis system for multiprocessing. Proc. International Conference on Supercomputing, Athens, Greece, 1987. 2. Allen, J. R., and Kennedy, K. PFC A program to convert Fortran to
3.
4.
5.
6.
13. CONCLUSION
parallel form. Tech. Rep. MASC-TR 82-6, Department of Mathematical Sciences,Rice University, Mar. 1982. Balasundaram, V. Interactive parallelization of numerical scientific programs. Ph.D. thesis, Department of Computer Science, Rice University, May 1989. Available as Tech. Rep. COMP TR89-95, Department of Computer Science, P.O. Box 1892, Houston, TX 7725 I, or request by email to [email protected]. Balasundaram, V., Baumgartner, D., Callahan, D., Kennedy, K., and Subhlok, J. PTOOL: A system for static analysis of parallelism in programs. Tech. Rep. TR88-7 1, Department of Computer Science, Rice University, June 1988. Balasundaram, V., and Kennedy, K. Compile-time detection of race conditions in a parallel program. Proc. International Conference on Supercomputing, Crete, Greece, June 1989. Balasundaram, V., and Kennedy, K. A technique for summarizing data accessand its use in parallelism enhancing transformations. Proc. A C M SIGPLAN 89 Conference on Programming Language Design and Implementation, Portland, Oregon, June 1989.
The Data Access Descriptor is a powerful mechanism for summarizing useful internal information. It provides a compact representation of data access and traversal order information. Instead of keeping an entire data dependence graph, a parallel programming tool can compute data access information for each loop or subroutine in the program and store it in the form of Data Windows. When necessary,traditional data dependence information for a selectedregion can be either derived from the Data Windows (by using algorithm TEST-DEPENDENCE) or computed using other techniques. The treatment of data dependenceas an intersection of data access summaries allows a uniform treatment of traditional data dependence information and data accessinformation. This simplifies a lot of the internal analyses used in parallel programming tools because the same algorithms and techniques that deal with data dependence information can also deal with data access information. Data access summaries also provide higher bandwidth information and are more conceptual than traditional data dependences.As the LUDCOMP example illustrates, data access summaries can also be used to provide a pictorial representation of the data accesspatterns in a program. The complete potential of this representation as a basis for the design of parallel programming tools, however, remains to be seen.
I. Balasundaram, V., Kennedy, K., Kremer, U., McKinley, K., and Sub-
8.
9. 10.
II.
12.
13.
14.
15.
hlok, J. The ParaScope editor: An interactive parallel programming tool. Supercomputing 89, Reno, Nevada, Nov. 1989. Banejee, U. Speedup of ordinary programs. Ph.D. thesis, University of Illinois at Urbana-Champaign, Oct. 1979. Also available as Tech. Rep. 79-989, Department of Computer Science. Banejee, U. Dependence Analysis for Supercompufing. Kluwer Academic Pubs., Norwell, MA, 1988. Brewer, O., Dongarra, J. J., and Sorensen, D. C. Tools to aid the analysis of memory access patterns for FORTRAN programs. Tech. Rep. 120, Argonne National Laboratory, 1988. Burke, M., and Cytron, R. Interprocedural dependence analysis and parallelization. In Proc. SIGPLAN ‘86 Symposium on Compiler Construction, June 1986, pp. 162-115. Callahan, D., and Kennedy, K. Analysis ofinterprocedural side effects in a parallel programming environment. In Proc. First International Conference on Supercomputing, Athens, Greece, 1987. Available as Tech. Rep. TR87-57, Department ofcomputer Science, Rice University, July 1987. Callahan, D., and Kennedy, K. Analysis of interprocedural side effects in a parallel programming environment. .I. Parallel Distrib. Comput. 5,5 (Oct. 1988), 517-550. Dongarra, J. J., Sorensen, D. C. A look at the evolution of mathematical software for dense matrix problems over the past fifteen years. International Seminar on Scientthc Supercomputers, Paris, Feb. 1987. Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W. LINPACK Users Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1979.
VASANTH
170
BALASUNDARAM
16. Guama, V. A., Cannon, D., Gaur, Y., and Jablonowski, D. FAUST: An environment for programming parallel scientific applications. Supercomputing 88, Nov. 1988. 17. Hahn, D. Private communication, 1989. 18. Havlak, P. Private communication, 1989. 19. Heller, D. A survey of parallel algorithms in numerical linear algebra. SIAMRev. 20,4 (Oct. 1978), 740-777. Henderson, L. A. The usefulness of dependency-analysis tools in parallel programming: Experiences using PTOGL. Tech. Rep. LA-UR87-3 135, Los Alamos National Labs., Sept. 1987. 21. Kuck, D. J., Kuhn, R. H., Leasure, B., and Wolfe, M. The structure of an advanced vectorizer for pipelined processors. Proc. 8th POPL, Jan. 1981, pp. 207-218. 22. Li, Z., and Yew, P-C. Efficient interprocedural analysis for program parallelization and restructuring. ACM SZGPLAN PPEALS Parallel Programming: tems, 1988.
Experience
with Applications,
Languages
and Sys-
23. Li, Z., and Yew, P-C. Interprocedural analysis and program restructuring for parallel programs. Tech. Rep. CSRD 720, University of Illinois at Urbana-Champaign, Jan. 1988. 24. Pier, J-K., Gajski, D. D., and Wu, M-Y. Programming environments for multiprocessors. International Seminar on ScientiJic Supercomputing, Paris, Feb. 1987. Received June 1, 1989; revised January 18, 1990
25. Rice, J. R. Matrix Computations and Mathematical Software, McGraw-Hill, New York, 198 1. 26. Rice, J. R. Numerical Methods, Soflware and Analysis: IMSL Reference Edition. McGraw-Hill, New York, 1983. 27. Triolet, R. Interprocedural analysis based program restructuring with Parafrase. Tech. Rep., University of Illinois, Center for Supercomputing Research and Development, 1986. 28. Triolet, R. Programming environments for parallel machines. International Seminar on Scientific Supercomputing. Paris, Feb. 1987. 29. Triolet, R., Irigion, F., and Feautrier, P. Direct parallelization of call statements. In Proc. SIGPLAN ‘86 Symposium on Compiler Construction, June 1986, pp. 176-185.
VASANTH BALASUNDARAM graduated from Rice University in May 1989 with a Ph.D. in computer science. He is currently a research fellow in the Caltech Concurrent Computation Program at the California Institute of Technology, Pasadena,where he is the recipient of a Prize Fellowship in Concurrent Computing. His research interests include the design of software tools and compilers for shared and distributed memory parallel computers and parallel debugging.