An efficient algorithm for communication set generation of data parallel programs with block-cyclic distribution

Parallel Computing 30 (2004) 473–501 www.elsevier.com/locate/parco An eﬃcient algorithm for communication set generation of data parallel programs wi...

Download PDF

458KB Sizes 1 Downloads 99 Views

Report

PDF Reader
Full Text

Parallel Computing 30 (2004) 473–501 www.elsevier.com/locate/parco

An eﬃcient algorithm for communication set generation of data parallel programs with block-cyclic distribution Gwan-Hwan Hwang Department of Information and Computer Education, National Taiwan Normal University, Taipei, Taiwan Received 7 June 2002; received in revised form 13 February 2004; accepted 14 February 2004 Available online 12 April 2004

Abstract Data parallel programming languages, such as High Performance Fortran, are widely regarded as a promising means for writing portable programs for distributed-memory machines. In this paper, we present a new algorithm for computing the communication sets in array section movements with block-cyclic (cyclic (k) in HPF) distribution. Our framework can handle multi-level alignments, multi-dimensional arrays, array intrinsic functions, aﬃne indices and axis exchanges in the array subscript. Instead of employing the linear diophantine equation solver, a new algorithm which does not rely on the linear diophantine equation solver to calculate communication sets is proposed. We use formal proof and experimental results to show that it is more eﬃcient than previous solutions to the same problem. Another important contribution of this paper is that we prove that the compiler is able to compute eﬃciently the communication sets of block-cyclic distribution as long as the block sizes of the arrays are set to be identical or the lowest common multiple (LCM) of block sizes is not a huge integer. We demonstrate it by thorough complexity analyses and extensive experimental results. 2004 Elsevier B.V. All rights reserved. Keywords: HPF compiler; Distributed memory machines; Parallelizing compiler; Data parallel programs; Block-cyclic distributions

1. Introduction An increasing number of programming languages, such as Fortran D [9], Vienna Fortran [4], High Performance Fortran (HPF) [20], and parallel C++ [2,23], are E-mail address: [email protected] (G.-H. Hwang). 0167-8191/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2004.02.001

474

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

providing distributed arrays to support a global name space on distributed memory architectures. In these languages (such as HPF), programmers can specify the distribution of the distributed arrays among processors and the compiler can then take the distribution information (how data are distributed among processors), and generate the communication codes [3,15,19] for programs to emulate the shared memory space on distributed memory architectures. The distribution pattern speciﬁed by programmers in HPF includes ‘‘Block’’ pattern with each processor getting a consecutive block of elements, ‘‘Cyclic’’ distribution which is a round-robin rotation, ‘‘Collapse’’ distribution which means no distribution, or‘‘Block-Cyclic’’ also known as cyclic (k) distribution in which contiguous blocks of size k are assigned to a processor in a round-robin fashion. In this paper, we call k block size of block-cyclic distribution. The block-cyclic distribution is very important for expressing ‘‘blockscatter’’ distribution in the design of scalable libraries for linear algebra computation [7,11,16]. Thus, it had prompted many research eﬀorts [5,8,10,17,21,24] hoping to lead to eﬃcient schemes in supporting this issue. The general case of the block-cyclic distribution will be at least with multi-level alignments, multi-dimensional arrays, array intrinsic functions (such as Transpose operation), aﬃne indices and axis exchanges in the array subscript. We show a typical example of such cases below in Code Fragment 1: HPF Code Fragment 1 !HPF$ ALIGN A(i1 ; i2 ; . . . ; in ) WITH T1(u1 ði1 ; . . . ; in Þ; . . . ; un ði1 ; . . . ; in Þ) !HPF$ ALIGN B(i1 ; i2 ; . . . ; in ) WITH T2(w1 ði1 ; . . . ; in Þ; . . . ; wn ði1 ; . . . ; in Þ) !HPF$ ALIGN T1(i1 ; i2 ; . . . ; in ) WITH T3(w1 ði1 ; . . . ; in Þ; . . . ; wn ði1 ; . . . ; in Þ) !HPF$ ALIGN T2(i1 ; i2 ; . . . ; in ) WITH T4(w01 ði1 ; . . . ; in Þ; . . . ; w0n ði1 ; . . . ; in Þ) !HPF$ PROCESSOR PROCS1(#PA;1 ; #PA;2 ; . . . ; #PA;n ) !HPF$ PROCESSOR PROCS2(#PB;1 ; #PB;2 ; . . . ; #PB;n ) !HPF$ DISTRIBUTE T3(CYCLIC(b1 ),. . .,CYCLIC(bn )) ONTO PROCS1 !HPF$. DISTRIBUTE T4(CYCLIC(b01 ),. . .,CYCLIC(b0n )) ONTO PROCS2 .. 9 FORALL (i1 ¼ L1 : U1 : S1 ; i2 ¼ L2 : U2 : S2 ; . . . ; in ¼ Ln : Un : Sn ) 10 A(f1 ði1 ; i2 ; . . . ; in Þ; . . . ; fn ði1 ; i2 ; . . . ; in ÞÞ ¼ F ðBðg1 ði1 ; i2 ; . . . ; in Þ; . . . ; gn ði1 ; i2 ; . . . ; in Þ)) 11 END FORALL 1 2 3 4 5 6 7 8

In the above code, arrays A and B are ﬁrst aligned to template arrays T1 and T2, respectively. T1 and T2 are then aligned to template arrays T3 and T4, respectively. Next, T3 and T4 are distributed onto the processor grids, PROCS1 and PROCS2, respectively. The b1 , b2 , . . ., and bn as well as b01 , b02 , . . ., and b0n specify the block size along each dimension of T3 and T4, respectively. Added to this, #PA;i and #PB;j represent the number of processors which the ith and jth dimension of arrays T3 and T4 are distributed, respectively. Therefore, it’s a program with multi-level alignments. The functions, fi and gi , are all aﬃne functions to be used in the array

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

475

subscript, and F is an array function or array intrinsic function (such as Transpose operation [1]). In [12,13], a software framework using expression rewritings and a calculus form called CSD calculus was presented hoping to lead to the implementation of HPF programs for block-cyclic distributions with multi-level alignments, multi-dimensional arrays, array intrinsic functions (such as Transpose operation), aﬃne indices and axis exchanges in the array subscript. The expression-rewriting framework is derived from a new representative form called CSD to represent the set of access elements in HPF programs with block-cyclic distribution. Previous research used the regular section descriptor (RSD) to describe the set of access elements in HPF programs with ‘‘Block’’ or ‘‘Cyclic’’ distribution patterns. RSD is not suﬃcient to represent the access elements with ‘‘Block-Cyclic’’ distributions. Later, researchers [25] tried to use BSD to represent the access elements of ‘‘Block-Cyclic’’ distributions. Unfortunately, BSD is not closed under intersection or index normalization. Therefore, BSD is not appropriate for representing communication set of HPF programs with ‘‘Block-Cyclic’’ distributions. On the other hand, the common-stride descriptor (CSD) is the new representative form we propose for describing the regularity of the access patterns of HPF programs with ‘‘Block-Cyclic’’ distribution. One of the most important properties of CSD calculus is that CSD is closed under intersection and index normalization. The fundamental of CSD calculus can be traced to the deﬁnition of the RSD, BSD and CSD. In essence, RSD, BSD and CSD are section descriptors which present set of integers. An RSD is with a triplet notation, (l:u:s), where l, u, and s are referred to as the lower bound, upper bound and stride, respectively [1]. Both the RSD and BSD are special cases of CSD. A CSD is union of multiple RSDs which are with the same stride. It can be used to represent the communication sets of data parallel programs with block-cyclic distribution. However, BSD can only present the data distribution of the block-cyclic distribution. The operations in the CSD calculus consist of the intersection and index normalization of RSDs, BSDs and CSDs. In [12,13], all the operations in the CSD calculus are derived from the intersection of two RSDs, i.e. ðl1 : u1 : s1 Þ \ ðl2 : u2 : s2 Þ, which is obtained by solving the linear diophantine equation. The normalization problem of a RSD, i.e. to derive i as a i þ b 2 ðl : u : sÞ, can be obtained from the intersection problem. The intersection and normalization problems of BSDs and CSDs are derived from the subroutines of intersection and normalization of RSDs. It follows from what has been said that the implementation of all the operations of CSD calculus in [12,13] depends on solving the linear diophantine equations. In this research, a new algorithm which does not rely on linear diophantine equation solver to perform the intersection and normalization of RSDs, BSDs and CSDs is proposed. Given two CSDs which are ready to intersect. According to the basic property of a CSD, i.e., all its RSDs are with a common stride, we ﬁrst enumerate a limited number of integers in the two CSDs respectively according to the common strides of them and then perform intersection on the two enumerated integer sets to obtain the intersected result.

476

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

Given two CSDs, C1 and C2 . Assume that S1 and S2 are the common stride of C1 and C2 , respectively. Let jC1 j and jC2 j be the number of RSDs in C1 and C2 , respectively. The running time for performing C1 \ C2 by solving the linear diophantine equations [12,13] is OðMinðlog S1 ; log S2 Þ þ jC1 j jC2 jÞ. 1 However, the new algo rithm proposed in this paper is with complexity of O Minðlog S1 ; log S2 Þ þ 1 ;S2 Þ 1 ;S2 Þ 1 ;S2 Þ 1 ;S2 Þ Max jC1 j LCMðS ; jC2 j LCMðS and LCMðS . We show that both LCMðS S1 S2 S1 S2 tend to be a small integer or even integer number 1. Both formal proof and experiment results show that our new algorithm can reduce greatly the execution time. The maximal speedup obtained from the experiments is 38. Another important contribution of this paper is that we prove that the compiler is able to compute eﬃciently the communication sets of block-cyclic distribution as long as the block sizes 2 of the arrays are set to be identical or the lowest common multiple (LCM) of block sizes is not a huge integer. We demonstrate it by thorough complexity analyses and extensive experimental results. The remainder of the paper is organized as follows. Section 2 describes the CSD calculus and the expression-rewriting framework for generating communication sets for HPF programs with block-cyclic distributions. Section 3 shows our new algorithms for CSD calculus. The related discussion and proof about the complexity is also detailed in this section. Section 4 gives the experimental results. Related works are discussed in Section 5. Finally, Section 6 concludes this paper. 2. CSD calculus and communication sets generation of HPF In this section, we will ﬁrst present the deﬁnition of CSD calculus and then show how to apply it to the communication set generation of HPF programs. The CSD calculus is ﬁrst proposed in [12,13]. 2.1. CSD calculus In this section, we give deﬁnitions for RSD, BSD, and CSD representations. RSD describes the set of access elements in HPF programs with ‘‘Block’’ or ‘‘Cyclic’’ distribution patterns. RSD is not suﬃcient to represent the access elements with ‘‘Block-Cyclic’’ distributions. Researchers tried to use BSD to represent the access elements of ‘‘Block-Cyclic’’ distributions. Unfortunately, BSD is not closed under intersection or index normalization. Therefore, BSD is not appropriate for calculating communication set of HPF programs with ‘‘Block-Cyclic’’ distributions. On the other hand, CSD is a new representative form we propose for describing the regularity of the access patterns of HPF programs with ‘‘Block-Cyclic’’ distribution. We will demonstrate below a calculus of CSD that CSD is closed under intersection and index normalization. RSD is deﬁned below. 1 2

In this paper, Min( ) and Max( ) denote the minimal and maximal value, respectively. bi and b0i , i ¼ 1 to n, in Code Fragment 1.

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

477

Deﬁnition 1. A regular section descriptor (RSD) represents a sequence of integer indices. It is with a triplet notation and deﬁned as follows: ðl : u : sÞ ¼ fk j l 6 k 6 u and k ¼ l þ s I; where I is a non-negative integer:g where l, u, and s are referred to as the lower bound, upper bound, and stride, respectively. The RSD is deﬁned as part of the language construct of Fortran 90 [1]. One of its usages is to specify array section movement in Fortran 90. Researchers proposed a quadruplet notation (BSD) similar to RSD to represent the block-cyclic distribution of HPF [10,25]. The deﬁnition of BSD is given below. Deﬁnition 2. A block-cyclic section descriptor (BSD) represents a set of integer indices. It is with a quadruplet notation and deﬁned as follows: ðl : u : b : cÞ ¼

b[ 1

ðl þ k : u : cÞ;

k¼0

where l, u, b, and c are the lower bound, upper bound, block width, and cyclic length, respectively. We have an example of BSD below. Example 1. As shown in Fig. 1, array B is with 36 elements and distributed among processors by cyclic(3) distribution. We have A(1:36:3:9), A(4:36:3:9), and A(7:36:3:9) distributed among P1, P2 and P3, respectively. As an example, the data section owned by P1 is (1:36:3:9) ¼ (1:36:9) [ (2:36:9) [ (3:36:9) ¼ {1,10,19,28} [ {2,11,20,29} [ {3,12,21,30} ¼ {1,2,3,10,11,12,19,20,21,28,29,30}. Unfortunately, BSD is not closed under intersection. Therefore, we propose a new representative form, CSD, deﬁned below. Deﬁnition 3. A common-stride section descriptor (CSD) is a set of indices. It is a union of N (N is a positive integer) RSDs which are all with the same stride, and the diﬀerence between the lower bounds of each pair of RSDs is less than the stride. CSD is represented by a triplet notation as follows: ððl1 ; l2 ; . . . ; lN Þ : ðu1 ; u2 ; . . . ; uN Þ : SÞ ¼

N [

ðlk : uk : SÞ;

k¼1

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

P1 P1 P1 P2 P2 P2 P3 P3 P3 P1 P1 P1 P2 P2 P2 P3 P3 P3 P1 P1 P1 P2 P2 P2 P3 P3 P3 P1 P1 P1 P2 P2 P2 P3 P3 P3

Fig. 1. Block-cyclic distribution of array B.

478

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

where (1) 8i and j; 1 6 i 6 N and 1 6 j 6 N , and i 6¼ j ) li 6¼ lj , (2) jli lj j < S; 1 6 i 6 N and 1 6 j 6 N . Deﬁnition 4. If C is a CSD, then jCj represents the number of RSDs in C. That is, if C ¼ ððl1 ; l2 ; . . . ; lN Þ : ðu1 ; u2 ; . . . ; uN Þ : SÞ, then jCj ¼ N . Example 2. The CSD ((2,3):(9,15):3) ¼ (2:9:3) \ (3:15:3) ¼ {2,5,8} \ {3,6,9,12,15} ¼ {2,5,8,3,6,9,12,15}. The ((1,11,12):(36,36,36):9) does not conform to the deﬁnition of CSD because j1 12j > 9. The properties of the normalization and intersection operations of the three kinds of section descriptors are shown in Table 1. The CSD is closed under normalization and intersection operations. The normalization operation is to derive the set of elements representing i when given the set of elements representing a i þ b, where a and b are integer constants. We present the properties of the CSD below. Theorem 1 ﬁrst describes the relationships between RSD, BSD, and CSD. Theorem 1. A set represented by an RSD can be represented by a BSD. Similarly, a set represented by a BSD can be represented by a CSD. Proof. See [12,13].

h

Next, Theorems 2 and 3 describe the intersection properties of CSD. Theorem 2. If C1 and C2 are represented by CSDs, then C1 \ C2 can be represented by a CSD. Proof. See [12,13].

h

C1 ¼ ððl11 ; l12 ; . . . ; l1n Þ : ðu11 ; u12 ; . . . ; u1n Þ : S1 Þ and C2 ¼ ððl21 ; l22 ; . . . ; l2m Þ : : S2 Þ. We prove that C1 \ C2 ¼ ððL1;1 ; . . . ; L1;m ; L2;1 ; . . . ; L2;m ; . . . ; Ln;m Þ : ðU1;1 ; . . . ; U1;m ; U2;1 ; . . . ; U2;m ; . . . ; Un;m Þ : LCMðS1 ; S2 ÞÞ, 3 where a 6¼ b or c 6¼ d ) La;b 6¼ Lc;d and jLa;b Lc;d j < LCMðS1 ; S2 Þ. Finally, Theorem 3 shows the normalization properties of CSD. Assume

ðu21 ; u22 ; . . . ; u2m Þ

3 Note that ðLi;j : Ui;j : LCMðS1 ; S2 ÞÞ ¼ ðl1i : u1i : S1 Þ \ ðl2j : u2j : S2 Þ. Among the intersections of RSDs, some may derive an empty set. If ðl1i : u1i : S1 Þ \ ðl2j : u2j : S2 Þ is ;, then Li;j and Ui;j should be eliminated from C1 \ C2 .

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

479

Table 1 Normalization and intersection of RSD, BSD and CSD Normalization RSD BSD CSD

RSD CSD CSD

\

RSD

BSD

CSD

RSD BSD CSD

RSD CSD CSD

CSD CSD CSD

CSD CSD CSD

Theorem 3. If a i þ b 2 ðl : u : b : cÞ, then i can be represented by a CSD. Proof. See [12,13].

h

In the proof of Theorem 3, the normalization of a i þ b 2 ðl : u : b : cÞ can be constructed. We can derive that i 2 ððL1 ; L2 ; . . . ; Ln Þ : ðU1 ; U2 ; . . . ; Un Þ : jð cÞ= GCDða; cÞjÞ, where x 6¼ y ) Lx 6¼ Ly and jLx Ly j < jð cÞ=GCDða; cÞj, 1 6 x 6 N and 1 6 y 6 N . 2.2. Expression rewritings to generate communication sets In [12,13], we have shown how to apply the CSD calculus presented above to the communication sets computation of HPF programs. It is a ﬁve-step framework with mechanic arithmetic-rewriting rules to accompany the CSD calculus so that the communication set of HPF programs of ‘‘Block-Cyclic’’ distributions with two-level alignments (or multi-level alignments), multi-dimensional arrays, array intrinsic functions (such as Transpose operation), and aﬃne indices in the array subscripts, can be calculated in a systematic way. The section includes most of the important points. We deﬁne distribution function for the data layout of an array in HPF programs. First, let’s deﬁne the descriptor needed for a distribution function. Deﬁnition 5. A descriptor, /, is a set of integer vectors, and deﬁned as follows: /ð=f1 ði1 ; i2 ; . . . ; in Þ; . . . ; fm ði1 ; i2 ; . . . ; in Þ=; =S1 ; S2 ; . . . ; Sm =Þ ¼ fði1 ; i2 ; . . . ; in Þ j 8k; 1 6 k 6 m; fk ði1 ; i2 ; . . . ; in Þ 2 Sk g Sk (1 6 k 6 m) represents a set of integers. In this paper, we are interested in having Sk as an RSD, BSD, or CSD. Note that m and n are the number of the dimension of the represented integer vectors and index variables, respectively. Next, we will use an example below to illustrate a distribution function. We list the general form of the distribution function of an n-dimensional array X for block-cyclic distribution as follows:

480

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

Xdis ði1 ; i2 ; . . . ; in Þ 8 > > Pv1 j /ð=i1 ; i2 ; . . . ; in =; =l1;1 : u1;1 : b1 : c1 ; . . . ; l1;n : u1;n : bn : cn =Þ > > > > > > <... ¼ Pvk j /ð=i1 ; i2 ; . . . ; in =; =lk;1 : uk;1 : b1 : c1 ; . . . ; lk;n : uk;n : bn : cn =Þ > > > ... > > > > > : Pv j /ð=i1 ; i2 ; . . . ; in =; =lm;1 : um;1 : b1 : c1 ; . . . ; lm;n : um;n : bn : cn =Þ m

The equality relation, Xdis ði1 ; i2 ; . . . ; in Þ ¼ Pvk j /ð=i1 ; i2 ; . . . ; in =; =lk;1 : uk;1 : b1 : c1 ; . . . ; lk;n : uk;n : bn : cn =Þ, means that X ði1 ; i2 ; . . . ; in Þ resides in processor Pvk if ði1 ; i2 ; . . . ; in Þ 2 /ð=i1 ; i2 ; . . . ; in =; =lk;1 : uk;1 : b1 : c1 ; . . . ; lk;n : uk;n : bn : cn =Þ. In Code Fragment 1, arrays A and B are aligned to template arrays T3 and T4, respectively. Next, T3 and T4 are distributed among processors by block-cyclic distribution. Since template arrays T3 and T4 are ﬁnally distributed among processors, we can derive their corresponding distribution functions. We assume that the distribution functions of T3 and T4 are as follows: T 3dis ði1 ; i2 ; . . . ; in Þ 8 > > P1 j /ð=i1 ; i2 ; . . . ; in =; =l1;1 : u1;1 : b1 : c1 ; . . . ; l1;n : u1;n : bn : cn =Þ > > > > > > <... ¼

Pq j /ð=i1 ; i2 ; . . . ; in =; =lq;1 : uq;1 : b1 : c1 ; . . . ; lq;n : uq;n : bn : cn =Þ > > > ... > > > > > : Px j /ð=i1 ; i2 ; . . . ; in =; =lx;1 : ux;1 : b1 : c1 ; . . . ; lx;n : ux;n : bn : cn =Þ

T 4dis ði1 ; i2 ; . . . ; in Þ 8 > > P1 j /ð=i1 ; i2 ; . . . ; in =; =l01;1 : u01;1 : b01 : c01 ; . . . ; l01;n : u01;n : b0n : c0n =Þ > > > > > > <... ¼ Pr j /ð=i1 ; i2 ; . . . ; in =; =l0r;1 : u0r;1 : b01 : c01 ; . . . ; l0r;n : u0r;n : b0n : c0n =Þ > > >... > > > > > : Py j /ð=i1 ; i2 ; . . . ; in =; =l0 : u0 : b0 : c0 ; . . . ; l0 : u0 : b0 : c0 =Þ y;1

y;1

1

1

y;n

y;n

n

ð1Þ

ð2Þ

n

See Code Fragment 1. Note that ck ¼ #PA;k bk and c0k ¼ #PB;k b0k . HPF compiler uses the owner-computing rule to partition the computation among the processors. Each processor computes only values of data it owns [19,20]. According to the owner-computes rule, process Pr must send Bðg1 ði1 ; . . . ; in Þ; . . . ; gn ði1 ; . . . ; in ÞÞ to process Pq , and store it in Aðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ if ði1 ; . . . ; in Þ satisﬁes the intersection equation in Fig. 2. Refer to [12,13] for the details of how to derive this equation. By solving the intersection of descriptors as shown in Fig. 2, we can derive the communication set which processor Pr must send to processor Pq . The intersection

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

481

Fig. 2. Communication set that Pr must send to Pq .

of segmentation descriptors is to do intersection of section descriptors in each dimension. In the following, we explain how to employ the CSD calculus to solve the equation in Fig. 2. Note that lk ði1 ; . . . ; in Þ, mk ði1 ; . . . ; in Þ, fk ði1 ; . . . ; in Þ and gk ði1 ; . . . ; in Þ, 1 6 k 6 n, are linear aﬃne index functions with only one index variable. According to Theorem 4, lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ and mk ðg1 ði1 ; . . . ; in Þ; . . . ; gn ði1 ; . . . ; in ÞÞ; 8k; 1 6 k 6 n, are linear aﬃne index functions with one index variable. Theorem 4. If wk ði1 ; . . . ; in Þ, u1 ði1 ; . . . ; in Þ, u2 ði1 ; . . . ; in Þ; . . . ; un ði1 ; . . . ; in Þ are linear affine index functions with one index variable, then wk ðu1 ði1 ; . . . ; in Þ; . . . ; un ði1 ; . . . ; in ÞÞ, 1 6 k 6 n, are also linear affine index functions with one index variable. Proof. See [12,13].

h

Thus, we can continue to solve the equation in Fig. 2. We can normalize lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ 2 ðlq;k : uq;k : bk : ck Þ; 8k, 1 6 k 6 n, according to Theorem 3. Similarly, we can normalize ms ðg1 ði1 ; . . . ; in Þ; . . . ; gn ði1 ; . . . ; in ÞÞ 2 ðl0r;s : u0r;s : b0s : c0s Þ; 8s, 1 6 s 6 n. We get the results below for the equation in Fig. 2. ði1 ; i2 ; . . . ; in Þ 2 /ð=i1 ; i2 ; . . . ; in =; =C1 ; C2 ; . . . ; Cn =Þ ^ /ð=i1 ; i2 ; . . . ; in =; =C10 ; C20 ; . . . ; Cn0 =Þ ^ /ð=i1 ; i2 ; . . . ; in =; =L1 : U1 : S1 ; . . . ; Ln : Un : Sn =Þ; where C1 ; C2 ; . . . ; Cn ; C10 ; C20 ; . . . ; Cn0 are all CSDs. We can conclude the results further as shown below. ði1 ; . . . ; in Þ 2 /ð=i1 ; . . . ; in =; =C1 \ C10 \ ðL1 : U1 : S1 Þ; . . . ; Ci \ Ci0 \ ðLi : Ui : Si Þ; . . . ; Cn \ Cn0 \ ðLn : Un : Sn Þ=Þ ð3Þ Because the intersection of CSDs is a CSD (Theorem 2), and the intersection of a CSD and RSD (as a RSD is also a CSD according to Theorem 1) is again a CSD, the ﬁnal result is a CSD.

3. A new algorithm of CSD calculus According to Fig. 2, the intersection and normalization operations of CSDs form the kernel of the computation for the communication sets of data parallel programs

482

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

with block-cyclic distribution. We will ﬁrst present the old algorithm presented in [12,13], which solves the linear diophantine equation to perform the intersection of two RSDs [14,25,26]. 3.1. The old algorithm: solving linear diophantine equations By solving intersection of two RSDs with the linear diophantine equation, the intersection of two CSDs can be implemented as follows: Assume C1 ¼ ððl11 ; l12 ; . . . ; l1m Þ : ðu11 ; u12 ; . . . ; u1m Þ : S1 Þ and C2 ¼ ððl21 ; l22 ; . . . ; l2n Þ : ðu21 ; u22 ; . . . ; u2n Þ : S2 Þ. Then, C1 \ C2 ¼ ððl11 ; l12 ; . . . ; l1m Þ : ðu11 ; u12 ; . . . ; u1m Þ : S1 Þ \ ððl21 ; l22 ; . . . ; l2n Þ : ðu21 ; u22 ; . . . ; u2n Þ : S2 Þ ! ! m n [ [ 1 1 2 2 ¼ ðli : ui : S1 Þ \ ðlj : uj : S2 Þ i¼1 j¼1 [ ¼ ððl1i : u1i : S1 Þ \ ðl2j : u2j : S2 ÞÞ: i¼1;m j¼1;n

The above equation shows that the solution of C1 \ C2 needs m n intersections of two RSDs. The solution of ðl1i : u1i : S1 Þ \ ðl2j : u2j : S2 Þ can be derived from solving the linear diophantine equation, S1 x S2 y ¼ l2j l1i . However, these m n intersections need to invoke the extended Euclid’s algorithm [6] only once as all of them are with the same inputs S1 and S2 . Therefore, the total running time is OðMinðlog S1 ; log S2 Þ þ m nÞ. The Minðlog S1 ; log S2 Þ is the running time of the extended Euclid’s algorithm. Note that the S intersected CSD does not have a sorted set in its lower bound list. That is, if i¼1;m ððl1i : u1i : S1 Þ \ ðl2j : u2j : S2 ÞÞ ¼ ððL1;1 ; j¼1;n

...; L1;m ;L2;1 ;.. .;L2;m ;... ;Ln;m Þ : ðU1;1 ;. ..;U1;m ;U2;1 ;. ..;U2;m ; ...; Un;m Þ : LCMðS1 ;S2 ÞÞ, the lower bound list, ðL1;1 ; . . . ; L1;m ; L2;1 ; . . . ; L2;m ; . . . ; Ln;m Þ, is always not a sorted set. The normalization of a CSD involves the intersection of a RSD and a CSD. Since an RSD is a special case of a CSD, the normalization of a CSD can be derived by the intersection algorithm of two CSDs. Assume C is a CSD, we can solve a i þ b 2 C as follows. Since a i þ b 2 C, we have l 6 a i þ b 6 u, where l and u are the minimal number of the lower bound list and maximal number of the upper bound list of C, respectively. We can ﬁnd the smallest and largest integer of i easily. Let l0 and u0 be the smallest and largest integers of the above inequality. Then, a i þ b 2 ðl0 : u0 : aÞ. According to Theorem 2, let ðl0 : u0 : aÞ \ C ¼ ððl00 ; l01 ; . . . ; l0b1 1 Þ : ðu00 ; u01 ; . . . ; u0b1 1 Þ : LCMða; cÞÞ, 4 where l0x 6¼ l0y and jl0x l0y j < LCMða; cÞ, 8x 6¼ y. It is trivial that a i þ b 2 ððl00 ; l01 ; . . . ; l0b1 1 Þ : ðu00 ; u01 ; . . . ; u0b1 1 Þ : LCMða; cÞÞ ) i 2 l0b 1 b u0b 1 b l00 b u00 b LCMða;cÞ 1 1 ;...; a ;...; a : : . Thus, the complexity is bound a a a on the intersection of an RSD and a CSD, i.e., ðl0 : u0 : aÞ \ C. Assume S is the common stride of C. Note that an RSD is a special case of a CSD. The complexity of

4

Among the intersections of RSDs, some may derive an empty set.

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

483

solving a i þ b 2 C is the same as the running time of ðl0 : u0 : aÞ \ C, which can be derived from the running time of the intersection of two CSDs. Finally, the running time for solving a i þ b 2 C is OðMinða; SÞ þ 1 jCjÞ ¼ OðMinða; SÞ þ jCjÞ. 3.2. The new algorithm: enumerating a limited number of integers in CSDs and then taking intersection The new algorithm does not employ the linear diophantine equation solver. It performs the intersection of two CSDs by ﬁrst enumerating a limited number of integers in the two CSDs respectively according to the common strides of the two CSDs. Then, it performs intersection of the two enumerated integer sets to derive the resulting CSD. Before we present the details of the algorithm, we should prove a property that the derived CSD from the equation in Fig. 2 can be represented with a common upper bound. Deﬁnition 6. A CSD ððl1 ; l2 ; . . . ; lN Þ : ðu1 ; u2 ; . . . ; uN Þ : SÞ is said to have a common upper bound if u1 ¼ u2 ¼ ¼ uN 1 ¼ uN . With Deﬁnition 6, we proceed to the following theorem. Theorem 5. Any CSD which is derived by the equation in Fig. 2 can be represented by a CSD with a common upper bound. Proof. See Appendix A.

h

According to Theorem 5, we can construct an enumeration scheme for a CSD with a common upper bound. If a CSD does not have a common upper bound, then one of the methods for enumerating the deﬁned integer set of it is to enumerate all the integer sets of its RSDs respectively. To take a simple example, ((2,3):(9,15):3), consists of two RSDs, (2:9:3) and (3:15:3). Since the two RSDs are not with a common upper bound, two RSDs are enumerated separately: ((2,3):(9,15):3) ¼ (2:9:3) \ (3:15:3) ¼ {2,5,8} \ {3,6,9,12,15} ¼ {2,5,8,3,6,9,12,15}. It is obvious that the resulting set of integers is not a sorted set. However, consider a CSD with a common upper bound. We can enumerate it as follows. ðð2; 3Þ : ð15; 15Þ : 3Þ ¼ f2; 3; 2 þ ð3 1Þ; 3 þ ð3 1Þ; 2 þ ð3 2Þ; 3 þ ð3 2Þ; 2 þ ð3 3Þ; 3 þ ð3 3Þ; 2 þ ð3 4Þg ¼ f2; 3; 5; 6; 8; 9; 11; 12; 14g Note that the enumerated set of integers is sorted in ascending order. It is not diﬃcult to construct an algorithm which enumerates the integers in a CSD in ascending order if it is with a common upper bound as well as a ascending lower bound list. See Algorithm 1. The input of it is a CSD, C ¼ ððl1 ; l2 ; . . . ; lN Þ : ðU ; U ; . . . ; U Þ : SÞ, with the common upper bound U and a ascending lower bound

484

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

list ðl1 ; l2 ; . . . ; lN Þ, where l1 < l2 < < lN . The number of RSDs in C is N , i.e., jCj ¼ N . Each execution of the FOR loop enumerates N integers if all these integers are not greater than U . If there exists an integer in li þ S r, 1 6 i 6 N , which is greater than U , then all the N integers, lj þ S ðr þ 1Þ, 1 6 j 6 N , are greater than U . See Corollary 1. Thus, Algorithm 1 stops the execution of the next FOR loop. Algorithm 1. Enumerate a CSD which is with a common upper bound Input: C ¼ ððl1 ; l2 ; . . . ; lN Þ : ðU ; U ; . . . ; U Þ : SÞ C is with a ascending lower bound list, i.e. l1 < l2 < < lN . Begin_of_Algorithm r¼0 Finish_enumeration ¼ False WHILE (TRUE) FOR i ¼ 1 to N IF (li þ S r 6 U Þ Output ðli þ S rÞ ELSE Finish_enumeration ¼ True STOP END IF END FOR IF (Finish_enumeration ¼ True) THEN STOP ELSE r ¼ r þ 1 END WHILE End_of_Algorithm

Corollary 1. C is a CSD with a common upper bound U and C ¼ ððl1 ; l2 ; . . . ; lN Þ : ðU ; U ; . . . ; U Þ : SÞ. If 9 i, 1 6 i 6 N , li þ S r > U , then 8j, 1 6 j 6 N , lj þ S ðr þ 1Þ > U . Proof. See Appendix A.

h

In the following, we present the new CSD intersection algorithm. Assume we have two CSDs, C1 and C2 . According to Theorem 5, we assume that C1 is with a common upper bound and so is C2 . Let the common upper bound of C1 and C2 be u1 and u2 , respectively. See below. The size of C1 and C2 are m and n, respectively, i.e., jC1 j ¼ m and jC2 j ¼ n. C1 ¼ ðL1 : U1 : S1 Þ ¼ ððl1;1 ; l1;2 ; . . . ; l1;m Þ : ðu1 ; . . . ; u1 Þ : S1 Þ C2 ¼ ðL2 : U2 : S2 Þ ¼ ððl2;1 ; l2;2 ; . . . ; l2;n Þ : ðu2 ; . . . ; u2 Þ : S2 Þ The ﬁrst point to note is that the lower bound lists of C1 and C2 are sorted sets in ascending order, i.e., l1;1 < l1;2 < < l1;m and l2;1 < l2;2 < < l2;n .

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

485

Step 1 of Algorithm 2 is to compute the lowest common multiple S 0 of S1 and S2 . S 0 is the common stride of the output CSD C 0 . Without loss of generality, we assume l1;1 6 l2;1 . In Step 2, ðl1;1 ; l1;2 ; . . . ; l1;m Þ of C1 are shifted by all adding a multiple of S1 , i.e., the Shift_Amount in Step 2, to have l1;1 þ Shift Amount P l2;1 and l1;1 6 l2;1 in the resulting new lower bound list ðl1;1 ; l1;2 ; . . . ; l1;m Þ of C1 . We use Fig. 3 to illustrate it.

l1,1 l1,2 . . . l1,m l1,1 +S1 l1,2 +S1 . . . l1,m +S 1 . . . . . . . . . . . . . . . . . .

l2,1 +S 2 l2,2 +S 2 . . . . l2,n +S 2

. . .

. . .

C1

l2,1 l2,2 . . . . l2,n

C2

After Step 2

l1,1 l1,2 . . . l1,m l1,1 +S1 l1,2 +S1 . . . l1,m +S 1 l1,1 +S 1*2 l1,2+ S1*2 . . . l1,m +S1*2 .

C1

Fig. 3. Shift the integers in the lower bound list of C1 .

l2,1 l2,2 . . . . l2,n l2,1 +S2 l2,2 +S2 . . . . l2,n +S2 . . . .

C2

486

G.-H. Hwang / Parallel Computing 30 (2004) 473–501 0

In Step 3, we enumerate the ﬁrst m ðSS1 þ 1Þ integers of C1 obtained in Step 2. As0 sume that they are stored in Enumerated_Set_C1 . Similarly, enumerate the ﬁrst n SS2 integers of C2 and store them in Enumerated_Set_C2 . See Fig. 4. In Step 4, set the common upper bound of C 0 to be Minðu1 ; u2 Þ. By performing the intersection of Enumerated_Set_C1 and Enumerated_Set_C2 and then deleting those numbers which are greater than Minðu1 ; u2 Þ, we derive the lower bound list of C 0 . Algorithm 2. Intersection of two CSDs Input: C1 ¼ ðL1 : U1 : S1 Þ ¼ ððl1;1 ; l1;2 ; . . . ; l1;m Þ : ðu1 ; . . . ; u1 Þ : S1 Þ C2 ¼ ðL2 : U2 : S2 Þ ¼ ððl2;1 ; l2;2 ; . . . ; l2;n Þ : ðu2 ; . . . ; u2 Þ : S2 Þ Note that l1;1 < l1;2 < < l1;m and l2;1 < l2;2 < < l2;n . Without loss of generality, we assume l1;1 6 l2;1 . Output: C 0 ¼ ðL0 : U 0 : S 0 Þ, where C 0 ¼ C1 \ C2 : Begin_of_Algorithm • Step 1: S 0 ¼ LCMðS1 ; S2 Þ • Step 2: Shift the integers in the lower bound list of C1 . Shift_Amount ¼ S1 ððl2;1 l1;1 ÞDIVS1 Þ (Note that DIV operation yields the quotient). FOR i ¼ 1 to m l1;i ¼ l1;i þ Shift Amount END FOR • Step 3: Enumerate some integers of C1 and C2 according to S 0 , S1 , and S2 . Let Enumerated_Set_C1 ¼ fl1;1 ; l1;2 ; . . . ; l1;m , l1;1 þ S1 ; l1;2 þ S1 ; . . . ; l1;m þ S1 , l1;1 þ S1 2; l1;2 þ S1 2; . . . ; l1;m þ S1 2, . . ., 0 0 0 l1;1 þ S1 ðSS1 1Þ; l1;2 þ S1 ðSS1 1Þ; . . . ; l1;m þ S1 ðSS1 1Þ, l1;1 þ S1 S 0 ; l1;2 þ S1 S 0 ; . . . ; l1;m þ S1 S 0 } Let Enumerated_Set_C2 ¼ fl2;1 ; l2;2 ; . . . ; l2;n , l2;1 þ S2 ; l2;2 þ S2 ; . . . ; l2;n þ S2 , l2;1 þ S2 2; l2;2 þ S2 2; . . . ; l2;n þ S2 2, . . ., 0 0 0 l2;1 þ S2 ðSS2 1Þ; l2;2 þ S2 ðSS2 1Þ; . . . ; l2;n þ S2 ðSS2 1Þ } • Step 4: L0 ¼ Enumerated_Set_C1 \ Enumerated_Set_C2 Remove all the elements which are greater than Minðu1 ; u2 Þ from L0 . Assume jL0 j ¼ x. Let U 0 ¼ fMinðu1 ; u2 Þ; . . . ; Minðu1 ; u2 Þg and jU 0 j ¼ x. End_of_Algorithm

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

l1,1 l1,2 . . l1,m

l2,1 l2,2 . . . . l2,n

l1,1 +S1 l1,2 +S1 . . l1,m +S1

H2 =

LCM ( S 1, S 2 ) S1 LCM ( S 1, S 2 ) S2

=

=

S' S1 S' S2

l2,1 +S 2 l2,2 +S 2 . . . . l2,n +S 2

. . . . . . . l1,1 +S 1*(H 1-1) l1,2 +S1*(H 1-1) . . l1,m +S 1*(H 1 -1) l1,1 +S' l1,2 +S' . . l1,m +S'

H1 =

487

. . . l2,1 +S2 *(H 2-1) l2,2 +S2 *(H 2-1) . . . . l2,n +S2 *(H 2-1)

Enumerated_Set_C 2 Enumerated_Set_C 1 Fig. 4. Shift the integers in the lower bound list of C1 .

The new algorithm does not employ the linear diophantine equation solver. It performs the intersection of two CSDs by ﬁrst enumerating some numbers in the two CSDs respectively according to the common strides of the CSDs. Then, it does intersection of the enumerated numbers to derive the resulting CSD. The following serves as an example to illustrate Algorithm 2. Example 3. Given two CSDs: C1 ¼ ðð14; 15; 17; 19; 21Þ : ð1000; 1000; 1000; 1000; 1000Þ : 12Þ C2 ¼ ðð41; 45; 46; 50; 51; 52; 56Þ : ð2000; 2000; 2000; 2000; 2000; 2000; 2000Þ : 18Þ

488

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

Step 1: S 0 ¼ LCMð12; 18Þ ¼ 36 Step 2: Shift Amount ¼ 12 ðð41 14Þ DIV 12Þ ¼ 24, after the shift, C1 becomes ðð38; 39; 41; 43; 45Þ : ð1000; 1000; 1000; 1000; 1000Þ : 12Þ Step 3: We enumerate 5 ð36 þ 1Þ ¼ 20 integers of C1 and 7 36 ¼ 14 integers of C2 . 12 18 Let Enumerated Set C1 ¼ f38; 39; 41; 43; 45; 50; 51; 53; 55; 57; 62; 63; 65; 67; 69; 74; 75; 77; 79; 81g Let Enumerated Set C2 ¼ f41; 45; 46; 50; 51; 52; 56; 59; 63; 64; 68; 69; 70; 74g Step 4: Finally, we have C 0 ¼ ðð41;45;50;51;63;69;74Þ : ð1000;1000;1000;1000;1000; 1000; 1000Þ : 36Þ Corollary 2. In Algorithm 2, C 0 ¼ C1 \ C2 . Proof. See Appendix A.

h

3.3. Complexity analysis In this section, we ﬁrst give the complexity analysis of Algorithm 2 in Section 3.3.1. The running time of CSD normalization derived from Algorithm 2 is then presented in Section 3.3.2. Finally, in Section 3.3.3, we analyze the complexity for computing communication sets according to the equation in Fig. 2. We will have a complete illustration about the running time for the communication set computation which process Pr must send it process Pq . 3.3.1. Running time for intersection of CSDs Assume C1 ¼ ððl11 ; l12 ; . . . ; l1n Þ : ðu11 ; u12 ; . . . ; u1n Þ : S1 Þ and C2 ¼ ððl21 ; l22 ; . . . ; l2m Þ : 2 ðu1 ; u22 ; . . . ; u2m Þ : S2 Þ. We have shown in Section 3.1 that the running time for C1 \ C2 in the algorithm presented in [12,13] is OðMinðlog S1 ; log S2 Þ þ jC1 j jC2 jÞ. Refer to the deﬁnition of the CSD (Deﬁnition 3), the diﬀerence between the lower bounds of each pair of RSDs is less than the common stride. Thus, the number of RSDs in a CSD should be less than its common stride. Then, we have jC1 j 6 S1 and jC2 j 6 S2 . For Algorithm 2, the running time depends on the size of Enumerated_Set_C1 and Enumerated_Set_C2 as well as their intersection. The running time for preparing Enumerated_Set_C1 and Enumerated_Set_C2 is their sizes. See Algorithm 2, 1 ;S2 Þ þ1 and jEnumerated_Set_C2 j ¼ jC2 j jEnumerated_Set_C1 j ¼ jC1 j LCMðS S1 LCMðS1 ;S2 Þ . S2

Since both Enumerated_Set_C1 and Enumerated_Set_C 2 are ordered sets, LCMðS1 ;S2 Þ the running time for their intersection is O Max jC1 j þ 1 ; jC2 j S1 LCMðS1 ;S2 Þ 1 ;S2 Þ 1 ;S2 Þ ; jC2 j LCMðS ¼ O Max jC1 j LCMðS . However, the complexity S2 S1 S2 for the calculation of LCMðS1 ; S2 Þ is OðMinðlog S1 ; log S2 ÞÞ [6]. Finally, we obtain

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

489

the running time for C1 \ C2 inAlgorithm 2 to be O Minðlog S1 ; log S2 Þ þ 1 ;S2 Þ 1 ;S2 Þ ; jC2 j LCMðS Max jC1 j LCMðS . S1 S2 3.3.2. Running time for normalizing a CSD Assume S is the common stride of a CSD C. We have shown in Section 3.1 that the running time for a i þ b 2 C in the algorithm presented in [12,13] is OðMinða; SÞ þ jCjÞ. By employing Algorithm 2 to solve the normalization problem, a i þ b 2 C, the running time is bound on the complexity of Algorithm 2. It is an intersection problem of an RSD which is with stride a and the CSD C. The running time of solving a i þ b 2 C by applying Algorithm 2 is O Minðlog a; log SÞ þ LCMða;SÞ LCMða;SÞ Max ; jCj S . a

3.3.3. Running time for calculating communication sets Fig. 2 shows the corresponding equation for the communication set which process Pr must send process Pq . Consider the kth dimension in Fig. 2. We have, ik 2 /ð=lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ=; =lq;k : uq;k : bk : ck =Þ ^ /ðmk ðg1 ði1 ; . . . ; in Þ; . . . ; gn ði1 ; . . . ; in ÞÞ=; =l0r;k : u0r;k : b0k : c0k =Þ ^ /ð=ik =; =Lk : Uk : Sk =Þ: According to Theorem 4, we assume lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ ¼ a1 i þ b1 and mk ðg1 ði1 ; . . . ; in Þ, . . . ; gn ði1 ; . . . ; in ÞÞ ¼ a2 i þ b2 . Then, we have fik 2 ija1 i þ b1 2 ðlq;k : uq;k : bk : ck Þg \ { i j a2 i þ b2 2 ðl0r;k : u0r;k : b0k : c0k Þ } \ { i j i 2 ðLk : Uk : Sk Þ}. For the above equation, we need four stages to solve it. Stage 1: It is to normalize a BSD. Having fija1 i þ b1 2 ðlq;k : uq;k : bk : ck Þg, we can apply the CSD normalization algorithm since the BSD is a special case of the CSD. Assume we obtain i 2 C1 . Stage 2: It is to normalize a BSD similar to Stage 1. Assume the solution of fija2 i þ b2 2 ðl0r;k : u0r;k : b0k : c0k Þg is i 2 C2 . Stage 3: We have to intersect two CSDs: C3 ¼ C1 \ C2 . Stage 4: We have to perform the intersection of an RSD and a CSD: C4 ¼ C3 \ ðLk : Uk : Sk Þ.

Table 2 Complexity of each phase of io_Solving, Pure_Enumerating, and Enumerating+Dio Dio_Solving

Pure_Enumerating

Enumerating+Dio

Stage 1

OðMinðlog a1 ; log ck Þ þ bk Þ

Stage 2

OðMinðlog a2 ; log c0k Þ þ b0k Þ

Stage 3

OðMinðlog S10 ; log S20 Þ þjC1 j jC2 jÞ OðMinðlog S3 ; log Sk Þ þ jC3 jÞ

OðMinðlog a1 ; log ck Þ þMaxðSa11 ; bk Sck1 ÞÞ OðMinðlog a2 ; log c0k Þ þMaxðSa22 ; b0k Sc02 ÞÞ k OðMinðlog S10 ; log S20 Þ S3 þMaxðjC1 j S 0 ; jC2 j SS30 ÞÞ 1 2 OðMinðlog S3 ; log Sk Þ þMaxðSS43 ; jC3 j SS4k ÞÞ

OðMinðlog a1 ; log ck Þ þbk þ jC1 j log jC1 jÞ OðMinðlog a2 ; log c0k Þ þb0k þ jC2 j log jC2 jÞ OðMinðlog S10 ; log S20 Þ þMaxðjC1 j SS30 ; jC2 j SS30 ÞÞ 1 2 OðMinðlog S3 ; log Sk Þ þ jC3 jÞ

Stage 4

1 ;ck Þ S1 ¼ LCMða1 ; ck Þ, S2 ¼ LCMða2 ; c0k Þ, S10 ¼ LCMða , S20 ¼ a1

LCMða2 ;c0k Þ , a2

S3 ¼ LCMðS10 ; S20 Þ, S4 ¼ LCMðS3 ; Sk Þ.

490

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

According to the algorithms mentioned in Section 3.1, we list the complexity of the four stages respectively. See Table 2. We call it Dio_Solving. It works by solving linear diophantine equations to perform the intersection and normalization of CSDs [12,13]. The second one is called Pure_Enumerating. It uses Algorithm 2 to execute Stages 3 and 4 and the normalization algorithm which is derived from Algorithm 2 to execute Stages 1 and 2. Comparing Dio_Solving with Pure_Enumerating shows that Dio_Solving has better running time at Stages 1, 2 and 4. Complexity illustration 1 helps to illustrate it. Complexity Analysis 1. In Stage 1, the complexity of Dio_Solving and Pure_Enu merating are OðMinðlog a1 ; log ck Þ þ bk Þ and O Minðlog a1 ; log ck Þ þ Max Sa11 ; bk LCMða1 ;ck Þ 1 ;ck Þ P bk . Thus, we have , respectively. It is trivial that bk LCMða ck ck S1 1 ;ck Þ OðMinðlog a1 ; log ck Þ þ bk Þ 6 O Minðlog a1 ; log ck Þ þ Max a1 ; bk LCMða . The ck extreme case is that a1 primes to ck . In this case, O Minðlog a1 ; log ck Þ þ Max Sa11 ; 1 ;ck Þ bk LCMða becomes O Minðloga1 ;logck Þ þ Max Sa11 ;bk a1 . O Minðlog a1 ; ck log ck Þ þ Max Sa11 ; bk a1 may be far more greater than OðMinðlog a1 ; log ck Þ þ bk Þ if a1 is a large integer number. Similarly, in Stage 2, if a2 primes to c0k , O Minðlog a2 ; log c0k Þ þ Max Sa22 ; b0k LCMða2 ;c0k Þ becomes O Minðloga2 ;logc0k Þþ Max Sa22 ;b0k a2 ; in Stage 4, if S3 primes c0k 3 ;Sk Þ to Sk , O MinðlogS3 ;logSk Þ þ Max SS43 ;jC3 j LCMðS becomes O MinðlogS3 ; Sk log Sk Þ þ Max SS43 ; jC3 j S3 . Although the consequence in Complexity illustration 1 shows that the running time of Dio_Solving is better than that of Pure_Enumerating at Stages 1, 2 and 4. It would be fallacious to say that the total running time of Dio_Solving is better than that of Pure_Enumerating. As mentioned in Section 3.1, we can easily ﬁnd out that the maximal values of jC1 j and jC2 j are bk and b0k , respectively. Usually, jC1 j and jC2 j are very close and even equal to bk and b0k , i.e., jC1 jjC2 j is usually very close and even equal to bk b0k . In the following, we focus our attention on the complexity of Stage 3, C3 ¼ C1 \ C2 , in Pure_Enumerating. Complexity Analysis 2. Consider the complexity of Stage 3, C3 ¼ C1 \ C2 , in Pure_Enumerating. LCMða2 ;c0k Þ ck 1 ;ck Þ We can easily obtain that S10 ¼ LCMða ¼ GCDða 6 ck and S20 ¼ ¼ a1 a2 1 ;ck Þ 0 ck S3 0 0 0 0 6 c . Then, S ¼ LCMðS ; S Þ 6 LCMðc ; c Þ. Furthermore, jC j 6 3 k k 1 k 1 2 GCDða2 ;c0 Þ S0 1

k

jC1 j

LCMðck ;c0k Þ LCMða1 ;ck Þ a1

GCDða2 ; c0k Þ.

¼ jC1 j

LCMðck ;c0k Þ ck

GCDða1 ; ck Þ. Similarly, jC2 j SS30 6 jC2 j 2

LCMðck ;c0k Þ c0k

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

491

See Code Fragment 1, in the distribution of arrays T3 and T4, ck ¼ #PA;k bk and c0k ¼ #PB;k b0k where #PA;k and #PB;k are the number of processors which arrays T3 and T4 are distributed along the kth dimension. We may consider the complexity under the following cases. Case 1: ck ¼ c0k (i.e. #PA;k bk ¼ #PB;k b0k Þ LCMðck ;c0k Þ jC1 j SS30 6 jC1 j GCDða1 ; ck Þ ¼ jC1 j GCDða1 ; ck Þ. We obtain that ck 1

jC1 j SS30 6 jC1 j GCDða1 ; ck Þ. Similarly, jC2 j SS30 6 jC2 j GCDða2 ; c0k Þ. 1

2

Then, MaxðjC1 j SS30 ; jC2 j SS30 ÞÞ 6 MaxðjC1 j GCDða1 ; ck Þ; jC2 j GCDða2 ; c0k ÞÞ. 1

2

Case 2: One of ck and c0k is a multiple of the other. Without loss of generality, we assume ck ¼ X c0k jC1 j SS30 6 jC1 j 1

jC1 j

S3 S10

LCMðck ;c0k Þ ck

GCDða1 ; ck Þ ¼ jC1 j GCDða1 ; ck Þ. We obtain that

6 jC1 j GCDða1 ; ck Þ.

In other words, jC2 j SS30 6 jC2 j GCDða2 ; c0k Þ. 2 Then, MaxðjC1 j SS30 ; jC2 j SS30 ÞÞ 6 MaxðjC1 j GCDða1 ; ck Þ; jC2 j GCDða2 ; c0k ÞÞ. 1

Case 3: bk and

b0k

2

prime to each other, i.e., GCD(bk , b0k Þ ¼ 1.

LCMðck ;c0k Þ LCMð#PA;k bk ;#PB;k b0k Þ GCDða1 ;ck Þ ¼ jC1 j GCDða1 ;ck Þ ¼ ck #PA;k bk LCMð#PA;k ;#PB;k Þbk b0k LCMð#PA;k ;#PB;k Þb0k jC1 j GCDða1 ;ck Þ ¼ jC1 j GCDða1 ;ck Þ. We #PA;k bk #PA;k 0 LCMð#P ;#P Þb A;k B;k k obtain jC1 j SS30 6 jC1 j GCDða1 ;ck Þ. Similarly, we can obtain #PA;k 1 LCMð#PA;k ;#PB;k Þbk S3 jC2 j S 0 6 jC2 j GCDða2 ;c0k Þ. #PA;k

jC1 j SS30 6 jC1 j 1

2

Then, MaxðjC1 j SS30 ; jC2 j SS30 ÞÞ 6 MaxðjC1 j 1

2

LCMð#PA;k ;#PB;k Þb0k #PA;k

GCDða1 ; ck Þ; jC2 j

LCMð#PA;k ;#PB;k Þbk #PB;k

GCDða2 ; c0k Þ. Furthermore, if #PA;k ¼ #PB;k ) MaxðjC1 j SS30 ; jC2 j SS30 Þ 6 MaxðjC1 j b0k 1

2

GCDða1 ; ck Þ; jC2 j bk GCDða2 ; c0k ÞÞ. Case 4: GCDðbk ; b0k Þ ¼ g; g > 1. Assume bk ¼ a g; b0k ¼ b g. jC1 j SS30 6 jC1 j 1

LCMðck ;c0k Þ ck

GCDða1 ; ck Þ ¼ jC1 j

GCDðbk ; b0k Þ ¼ g ) GCDða; bÞ ¼ 1.

Since

GCDða1 ; ck Þ ¼ jC1 j

LCMð#PA;k ag;#PB;k bgÞ #PA;k ag

LCMð#PA;k bk ;#PB;k b0k Þ #PA;k bk

GCDða1 ; ck Þ ¼ jC1 j

gLCMð#PA;k a;#PB;k bÞ LCMð#PA;k a;#PB;k bÞ GCDða1 ; ck Þ ¼ jC1 j #PA;k ag #PA;k a LCMð#P a;#P bÞ S3 A;k B;k 0 GCDða2 ; ck Þ. Similarly, jC2 j S 0 6 jC2 j #PB;k b 2

GCDða1 ; ck Þ.

Finally, we derive MaxðjC1 j SS30 ; jC2 j SS30 ÞÞ 6 1

MaxðjC1 j

LCMð#PA;k a;#PB;k bÞ #PA;k a

2

GCDða1 ; ck Þ; jC2 j

LCMð#PA;k a;#PB;k bÞ GCDða2 ; c0k ÞÞ. #PB;k b

492

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

Furthermore,

if

#PA;k ¼

#PB;k ) MaxðjC1 j SS30 ; jC2 j SS30 ÞÞ 6 MaxðjC1 j b 1

2

GCDða1 ; ck Þ; jC2 j a GCDða2 ; c0k ÞÞ. From Complexity illustration 2, we can now propose an answer to the question that we posed on the comparison of the running time between Dio_Solving and Pure_Enumerating at the Stage 3. The answer is that if the programmer sets ck ¼ c0k or that one of ck and c0k is a multiple of the other, then the Pure_Enumerating tends to have a better running time at Stage 3 compared with Dio_Solving. It is because Max jC1 j SS30 ; jC2 j SS30 6 MaxðjC1 j GCDða1 ; ck Þ; jC2 j GCDða2 ; c0k ÞÞ 6 jC1 j jC2 j if 1

2

GCDða1 ; ck Þ < jC2 j and GCDða2 ; c0k Þ < jC1 j. Note that GCDða1 ; ck Þ < a1 and GCDða2 ; c0k Þ < a2 . We have adequate reason to think that a1 and a2 are usually small integer numbers as they represent the coeﬃcients in array section movement. Complexity illustration 1 shows that the running time of Dio_Solving is better than that of Pure_Enumerating at Stages 1, 2 and 4. However, the Pure_Enumerating tends to have a better running time at Stage 3 compared with Dio_Solving according to Complexity illustration 2. It motivates us to propose a hybrid scheme which employs the old algorithm presented in Section 3.1 to perform Stages 1, 2 and 4 and invoke Algorithm 2 to do the CSD intersection at Stage 3. However, just because the normalized CSDs derived by the algorithm in Section 3.1 are not always with a sorted lower bound list, it is impossible to apply it to the hybrid scheme directly. Note that Algorithm 2 must have two CSDs with sorted lower bound lists in ascending order. Thus, we devise a new scheme called Enumerating+Dio. Stage 1 of Enumerating+Dio comprises Stage 1 of Dio_Solving which derives C1 and a subsequent quick sort [6] to the lower bound list of C1 . It turns out that the complexity of Stage 1 of Enumerating+Dio is OðMinðlog a1 ; log ck Þ þ bk þ jC1 j log jC1 jÞ. 5 For the same reason, Stage 2 of Enumerating+Dio consists of Stage 2 of Dio_Solving which derives C2 and a subsequent quick sort to the lower bound list of C2 . The complexity of Stage 2 of Enumerating+Dio is OðMinðlog a2 ; log c0k Þ þ b0k þ jC2 j log jC2 jÞ. Since both C1 and C2 are now with sorted lower bound list, we can apply Algorithm 2 to Stage 3 of Enumerating+Dio. Finally, Stage 4 of Enumerating+Dio invokes the same algorithm as Dio_Solving. For the same equation, the CSDs derived by Dio_Solving, Pure_Enumerating, and Enumerating+Dio represent the same integer sequences. However, only the Pure_Enumerating generates CSDs with sorted lower bounds. Although the total number of data elements derived by the three schemes are the same, diﬀerent sequences of the elements may aﬀect the performance for the sending and receiving of communication sets. For example, it may have inﬂuence on the hit rate of cache for memory reference. But, it is not the present purpose to explore this area in the paper. There is room for further investigation.

5

The complexity for applying a quick sort to a list with n elements is n log n.

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

493

4. Experimental results We have implemented our expression rewriting framework and CSD calculus to generate the communication codes for HPF programs. Our software framework includes (1) the deﬁnition of data structure of RSD, BSD and CSD, (2) intersection and normalization of RSD, BSD and CSD, and (3) the proposed expression-rewriting framework. We measure the performance of HPF codes which is with multi-level alignments, multiple dimensional arrays, and axis exchanges. See Code Fragment 1. All the execution time for Stages 1, 2, 3 and 4 are measured. 6 Also, according to the Complexity illustrations 1 and 2, we try diﬀerent combinations of block size of arrays T3 and T4 as well as diﬀerent values of a1 and a2 . Without loss of generality, we list only the measured execution time of one dimension. All measurements are done on a SUN E10K workstation with 400Mhz processors. The OS version of the machine is SUN OS 5.7. Programs are compiled by gcc with -O2 optimization level. The C source codes can be downloaded at http:// bashful.ice.ntnu.edu.tw/~ghhwang/CSD/CSD.zip. The CSD_Calculus.h implements the old algorithm we published in [12,13]. The CSD_faster.h includes the implementation of Algorithm 2 and CSD normalization algorithm which is derived from Algorithm 2. According to Section 3.3.3, we measured the execution time of a1 and a2 with different block sizes of arrays A and B including (1) bk ¼ b0k ; (2) bk is a multiple of b0k ; and (3) bk and b0k prime to one another. In the ﬁrst part, we set a1 and a2 to be small integers. Fig. 5A–C show the execution we measured for a1 ¼ 7 and a2 ¼ 9. In Fig. 5A, we set bk ¼ b0k ; in Fig. 5B, we set bk to be a multiple of b0k ; in Fig. 5C, we have bk and b0k prime to one another. According to Complexity illustration 2, Pure_Enumerating should outperform Dio_Solving at Stage 3. There is considerable evidence as shown in Fig. 5A and B. Although Complexity illustration 1 shows that Dio_Solving will have better running time than Pure_Enumerating at Stages 1, 2 and 4. However, Fig. 5A–C show the converse results. It is because a1 and a2 are small integer numbers. The measured numbers from Fig. 5C make it clear that it is unwise for the programmer to set the block sizes of arrays T3 and T4 prime to each other. The block sizes specify only the data distribution of arrays in the program. They do not have inﬂuence on the execution results. In the second part, we set a1 and a2 to be somewhat larger integers. Fig. 5D–F show the execution we measured for a1 ¼ 56 and a2 ¼ 72. As can be seen, the Pure_Enumerating cannot outperform Dio_Solving at Stages 1, 2, and 4 anymore. Thus, Enumerating+Dio has the better execution time in Fig. 5D and E. The measured numbers from Fig. 5F show that it is unwise for the programmer to set the block sizes of arrays A and B prime to each other again. Fig. 6A–C show the execution we measured for a1 ¼ bk 1 and a2 ¼ b0k 1. The Enumerating+Dio has the better execution time in Fig. 6A and B. Again, the

6

Because of space limitation, please ﬁnd the detailed execution time for each diﬀerent stages at http:// bashful.ice.ntnu.edu.tw/~ghhwang/papers/PPexp_results.pdf.

494

G.-H. Hwang / Parallel Computing 30 (2004) 473–501 45000

90000 80000

Dio_Solving Pure_Enumera ting

60000

Time (Microseconds)

Time (Microseconds)

70000

Enumerating+ Dio

50000 40000 30000

40000

Dio_Solving

35000

Pure_Enumera ting

30000

Enumerating+ Dio

25000 20000 15000

20000

10000

10000

5000

0

0

512x256

512x128

300000

512x64

(B)

512x32

512x16

Block size: T3*T4

512x8

512x512

256x256

128x128

64x64

32x32

16x16

8x8

(A)

Block size: T3*T4

1600 Dio_Solving

200000

Pure_Enumera ting

Time (Microseconds)

Time (Microseconds)

Dio_Solving

1400

250000

Enumerating+ Dio 150000

100000

1200

Pure_Enumerat ing

1000

Enumerating+ Dio

800 600 400

50000

200 0

Time (Microseconds)

600

Pure_Enumera ting

500

Enumerating+ Dio

Time (Microseconds)

Dio_Solving

400 300

512x512

Block size: T3*T4

25000

Dio_Solving

20000

Pure_Enumera ting

800 700

256x256

900

128x128

(D)

64x64

32x32

Block size: T3*T4

16x16

512x511

512x255

512x127

512x63

512x31

512x15

512x7

(C)

8x8

0

Enumerating+ Dio

15000

10000

200

5000

100 0

512x511

512x255

512x127

512x63

512x31

(F)

512x15

512x7

512x256

Block size: T3*T4

512x128

512x64

512x32

512x16

512x8

(E)

0

Block size: T3*T4

Fig. 5. Experimental results.

measured numbers from Fig. 6C show that it is unwise for the programmer to set the block sizes of arrays A and B prime to each other again.

G.-H. Hwang / Parallel Computing 30 (2004) 473–501 45000

90000 Dio_Solving

40000

70000

Pure_Enumera ting

60000

Enumerating+ Dio

50000 40000 30000

30000

Pure_Enumera ting

25000

Enumerating+ Dio

20000 15000

20000

10000

10000

5000

0

0

Time (Microseconds)

512x256

Pure_Enumera ting

60

200000

50000

512x128

70

100000

512x64

Block size: T3*T4

Dio_Solving

150000

512x32

(B)

250000

Enumerating+ Dio

512x16

Block size: T3*T4

512x8

512x512

256x256

128x128

64x64

32x32

16x16

8x8

(A)

Time (Microseconds)

Dio_Solving

35000 Time (Microseconds)

Time (Microseconds)

80000

Dio_Solving

50

Pure_Enumer ating

40

Enumerating+ Dio

30 20 10 0

512x512

512x255

512x511

Time (Microseconds)

Enumerating+ Dio

35

256x256

Pure_Enumer ating

40

128x128

Block size: T3*T4

Dio_Solving

45

64x64

(D)

Block size: T3*T4 50

32x32

16x16

512x511

512x255

512x127

512x63

512x31

512x15

512x7

(C)

8x8

0

Time (Microseconds)

495

30 25 20 15 10

60

Dio_Solving

50

Pure_Enumer ating Enumerating+ Dio

40

30 20 10

5 0

0

512x127

512x63

512x31

(F)

512x15

512x7

512x256

Block size: T3*T4

512x128

512x64

512x32

512x16

512x8

(E)

Block size: T3*T4

Fig. 6. Experimental results.

Fig. 6D–F show the execution we measured for a1 ¼ bk , a2 ¼ b0k . In this case, C1 and C2 are with few numbers of RSDs, so are C3 and C4 . The execution time of the

496

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

three schemes becomes very small. Enumerating+Dio shows its ability to get good performance again. From all the measured running times, we have made following observations. First, Dio_Solving always spends tremendous time on Stage 3. For example, as shown in Fig. 5A, it takes Dio_Solving 77244.8 ms to ﬁnish Stage 3 when block size is 512 * 512. However, Pure_Enumerating needs only 50.0 ms to ﬁnish Stage 3 with the same block size. Actually, the execution time at Stage 3 of Dio_Solving becomes prohibitively large as the block sizes of T3 and T4 increase. The new CSD intersection algorithm proposed in this paper (Algorithm 2), i.e. Stage 3 of Pure_Enumerating and Enumerating+Dio, always outperforms Dio_Solving at Stage 3. Second, although Pure_Enumerating has worse complexity at Stages 1, 2 and 4, the measured execution times in Fig. 5A–C show that it could have less execution time at Stages 1, 2, and 4 as long as a1 and a2 are small integers. However, when a1 and a2 become larger integers, Pure_Enumerating consumes more execution time at Stages 1, 2 and 4 than Dio_Solving and Enumerating+Dio (see Fig. 5D–F). Third, Enumerating+Dio delivers the best average execution time among all the diﬀerent combinations of a1 , a2 , bk and b0k . Finally, most important of all is that the time needed for computing the communication sets of block-cyclic distributed would be prohibitively large if the lowest common multiple(LCM) of the block sizes of distributed arrays (i.e. bk and b0k in this paper) is a large integer number. The extreme cases are that bk and b0k prime to one another. See Figs. 5C, F, and 6C. In other words, we prove that the compiler is able to compute eﬃciently the communication sets of block-cyclic distribution as long as the block sizes of the arrays are set to be identical or the lowest common multiple of block sizes, LCMðbk ; b0k Þ, is not a huge integer. See Figs. 5A, B, E, F, and 6A, B, D, E. For example, consider a1 ¼ 7, a2 ¼ 9, and bk ¼ b0k ¼ 512. It takes Dio_Solving, Pure_Enumerating and Enumerating+Dio 78695.9, 1070.8 and 2064.6 ms to do the communication sets computation, respectively. However, for a1 ¼ 7, a2 ¼ 9, bk ¼ 512, and b0k ¼ 511, Dio_Solving, Pure_Enumerating and Enumerating+Dio spend 249836.4, 87706.2 and 125436.6 ms calculating the communication sets, respectively.

5. Related work The issue of generating SPMD code for HPF program with data distribution directives was ﬁrst addressed by Koelbel in [19]. It shows that the communication set of data movement between identically distributed arrays with either a block or a cyclic distribution can be represented by a closed form. However, arrays with block-cyclic distribution was not considered. The problem of array statements involving block-cyclically distributed arrays was addressed by Chatterjee et al. [5]. It describes a method for enumeration of local regular section indices in increasing order using a ﬁnite-state machine. Kennedy et al. solved the same problem by an integer lattice method [17,18]. Kwon et al. conducted more experiments for several local set enumeration algorithms [22]. Their works mainly focused on the local set

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

497

enumeration. They did not derive a closed-form representation for the communication set of block-cyclically distributed arrays. The approach proposed by Gupta et al. [8] viewed a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processor, which are cyclically (or block-wise) mapped to physical processors. However, they did not consider the cases that the arrays are ﬁrst aligned to the template arrays, which are then distributed onto processor memory, i.e., two-level mapping of HPF [20]. The above approaches did not discover periodic patterns in communication sets. Hiranandani et al. presented that the pattern of the communication set repeats if both the loop stride and array subscript are unit. However, the block-cyclically distributed arrays should be one-level mapped. In [25], Stinchnoth et al. described that the communication sets for block-cyclic distributions can be represented by union of regular section descriptors and it illustrated that the communication pattern becomes even more complicated when we use nonunit access strides in the assignment statement. However, their framework can handle only one-level mapped arrays. Hwang and Lee proposed the CSD calculus as the base for the computation of communication sets for data parallel programs with block-cyclic distribution [12,13]. They consider the most general cases that multi-dimensional arrays are distributed by two-level mapping in each dimension. However, their solution to the communication sets computation problem depends on the linear diophantine equation solver and the execution time becomes prohibitively large if the block sizes of the array are set to be large number (e.g. greater than 64). 6. Conclusion Widespread use of data-parallel languages, such as High Performance Fortran, will not come about until fast compilers and eﬃcient run-time systems are developed. In this research, new algorithms which do not rely on linear diophantine equation solver to perform the intersection and normalization of RSDs, BSDs and CSDs are proposed. The experimental results show a maximal speedup of 38 compared with algorithms which invoke linear diophantine equation solver. Another important contribution of this paper is that we prove that the compiler is able to compute eﬃciently the communication sets of block-cyclic distribution as long as the block sizes of the arrays are set to be identical or the lowest common multiple (LCM) of block sizes is not a huge integer. We demonstrate it by thorough complexity analyses and extensive experimental results. With this, the programmer can be sure that the compiler can always compute eﬃciently the communication sets of array section movements between block-cyclic distributed arrays as long as he sets up appropriate block sizes of these arrays. Acknowledgements Thanks are due to the Computer Center of the National Normal University, Taiwan, for providing access to the SUN E10K workstation. G.-H. Hwang’s work was

498

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

supported in part by the ROC National Science Council under grant 91-2213-E -003001 and ROC MOE/NSC program for promoting academic excellence of universities under grant 89-E-FA04-1-4. Appendix A. Theorem proof Corollary 1. C is a CSD with a common upper bound U and C ¼ ððl1 ; l2 ; . . . ; lN Þ : ðU ; U ; . . . ; U Þ : SÞ. If 9i; 1 6 i 6 N ; li þ S r > U , then 8j; 1 6 j 6 N ; lj þ S ðr þ 1Þ > U. Proof. C is a CSD. According to Deﬁnition 3, jli lj j < S; 1 6 i 6 N and 1 6 j 6 N . We have S < lj li < S ) S þ li < lj < S þ li ) S þ li þ S ðr þ 1Þ < lj þ S ðr þ 1Þ < S þ li þ S ðr þ 1Þ ) li þ S r < lj þ S ðr þ 1Þ. Because we already have li þ S r > U , we obtain ﬁnally U < li þ S r < lj þ S ðr þ 1Þ ) U < lj þ S ðr þ 1Þ. h Theorem 5. Any CSD which is derived by the equation in Fig. 2 can be represented by a CSD with a common upper bound. Proof. Step 1: Assume ðL0 : U 0 : S 0 Þ ¼ ðL1 : U1 : S1 Þ \ ðL2 : U2 : S2 Þ, it is suﬃcient to have U 0 ¼ MinðU1 ; U2 Þ to enumerate all the integers in ðL0 : U 0 : S 0 Þ. Step 2: To solve the equation in Fig. 2, we ﬁrst have to perform the normalization of BSD. We prove that the normalization of BSD can be represented by a CSD with a common upper bound in this step. According to Theorem 4, lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ and mk ðg1 ði1 ; . . . ; in Þ; . . . ; gn ði1 ; . . . ; in ÞÞ; 8k, 1 6 k 6 n, are linear aﬃne index functions with one index variable. That is, lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ can be represented by a i þ b. The normalization operation, lk ðf1 ði1 ; . . . ; in Þ; . . . ; fn ði1 ; . . . ; in ÞÞ 2 (lq;k : uq;k : bk : ck ) is the same asSa i þ b 2 (lq;k : uq;k : bk : ck ). According to Deﬁnition 2, it is to bk 1 solve a i þ b 2 k¼0 ðlq;k þ k : uq;k : ck Þ. Since a i þ b 2 ðlq;k : uq;k : bk : ck Þ, we have lq;k 6 a i þ b 6 uq;k . We can ﬁnd the smallest and largest integer of i easily. Let l0 and u0 be the smallest and largest integers of the above inequality. S k 1 To derive i, ðl0 : u0 : aÞ \ ðlq;k : uq;k : bk : ck Þ ¼ ðl0 : u0 : aÞ \ bk¼0 ðlq;k þ k : uq;k : ck Þ. To solve the above intersection, we have to perform bk intersections of two RSDs, ðl0 : u0 : aÞ and ðlq;k þ k : uq;k : ck Þ, k ¼ 0 to bk 1. As we have proven it in Step 1, it is suﬃcient to have Minðu0 ; uq;k Þ asStheir common upper bound. That is, we could asbk 1 sume 7 a i þ b 2 ððl0 : u0 : aÞ \ k¼0 ðlq;k þ k : uq;k : ck ÞÞ ¼ ððl00 ; l01 ; . . . ; l0b1 1 Þ : ðMinðu0 ; l00 b 0 0 uq;k Þ; Minðu ; uq;k Þ; . . . ; Minðu ; uq;k ÞÞ : LCMða; cÞÞ. Finally, we have i 2 ;...; a 0 ;u Þ b 0 ;u Þ b l0b 1 b Minðu Minðu LCMða;cÞ q;k q;k 1 ; . . ., : : . a a a a

7

Among the intersections of RSDs, some may derive an empty set.

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

499

Step 3: We have proven that all the normalized BSDs in Fig. 2 can be represented by CSDs with a common upper bound. That is, in equation (3), Ci and Ci0 , i ¼ 1 to n, are all CSDs with a common upper bound. Now, we want to prove that Ci \ Ci0 \ ðLi : Ui : Si Þ can still be represented by a CSD with a common upper bound. According to the proof in Step 2, both Ci and Ci0 are CSDs with a common upper bound. Assume the common upper bounds of Ci and Ci0 are u1 and u2 , respectively. According to the proof in StepS1, C1 \ C2 ¼ ððl11 ; l12 ; . . . ; l1n Þ : ðu1 ; . . . ; u1 Þ : SÞ\ 2 2 ððl1 ; l2 ; . . . ; l2m Þ : ðu2 ; . . . ; u2 Þ : S 0 Þ ¼ i¼1;n ððl1i : u1 : SÞ \ ðl2j : u2 : S 0 ÞÞ ¼ ððL1;1 ; . . . ; j¼1;m

L1;m ; L2;1 ; . . . ; L2;m ; . . . ; Ln;m Þ : ðMinðu1 ; u2 Þ; . . . ; Minðu1 ; u2 ÞÞ : LCMðS; S 0 ÞÞ. h Corollary 2. In Algorithm 2, C 0 ¼ C1 \ C2 . Proof. Without loss of generality, we assume that C1 is obtained after Step 2 of Algorithm 2. In case that a 2 C1 \ C2 and a 2 Enumerated_Set_C1 \ Enumerated_Set_C2 , it is trivial that a 2 C 0 . Now, we consider the case that a 2 C1 \ C2 and a > MaxðEnumerated Set C1 \ Enumerated_Set_C2 ). We use Fig. 7 to illustrate this proof.

Fig. 7. a x LCMðS1 ; S2 Þ 2 C 0 .

500

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

First, a 2 C1 \ C2 , we have a 2 C1 and a 2 C2 ) a ¼ l1;i þ S1 p ¼ l2;j þ S2 q, for some i, j, p, and q where 1 6 i 6 m, 1 6 j 6 n, p P 0, q P 0. Because LCMðS1 ; S2 Þ is a multiple of S1 and a ¼ l1;i þ S1 p, we have a LCMðS1 ; S2 Þ 2 C1 if a LCMðS1 ; S2 Þ P l2;1 . Similarly, we have a LCMðS1 ; S2 Þ 2 C2 if a LCMðS1 ; S2 Þ P l2;1 . Since a LCMðS1 ; S2 Þ 2 C1 and a LCMðS1 ; S2 Þ 2 C2 , we obtain that a LCMðS1 ; S2 Þ 2 C1 \ C2 if a LCMðS1 ; S2 Þ P l2;1 . Repeat the above process, we can obtain a 2 LCMðS1 ; S2 Þ 2 C1 \ C2 if a 2 LCMðS1 ; S2 Þ P l2;1 . Find the maximum integer number x to have a x LCMðS1 ; S2 Þ P l2;1 . According to the deﬁnition of Enumerated_Set_C1 and Enumerated_Set_C2 , we have a x LCMðS1 ; S2 Þ2 Enumerated_Set_C1 and a x LCMðS1 ; S2 Þ2 Enumerated_Set_C2 )a x LCMðS1 ; S2 Þ 2 C 0 ) a 2 C 0 . h

References [1] J.C. Adams, W.S. Brainerd, J.T. Martin, B.T. Smith, J.L. Wagener, Fortran 90 Handbook Complete Ansi/iso Reference, Intertext Publications McGraw-Hill Book Company, 1992. [2] F. Bodin, P. Beckman, D. Gannon, S. Narayana, S. Yang, Distributed pc++: Basic ideas for an object parallel language, Scientiﬁc Programming 2 (3) (1993). [3] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, M.Y. Wu, Compiling Fortran 90d/hpf for distributed memory mimd computers, Journal of Parallel and Distributed Computing 21 (1) (1994) 15–26. [4] B. Chapman, P. Mehrotra, H. Zima, Programming in Vienna Fortran, Scientiﬁc Programming 1 (1) (1992) 31–59. [5] S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, S. Teng, Generating local addresses and communication sets for data parallel programs, in: Proceedings of Fourth ACM SIGPLAN Conference on Principles and Practice of Parallel Programming, May 1993, pp. 149–158. [6] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, second ed., MIT Press, 2001. [7] J. Dongarra, R. van de Geijn, D. Walker, A look at scalable dense linear algebra libraries, in: Proceedings of Scalable High-Performance Computing Conference, April 1992, pp. 372–379. [8] S.K.S. Gupta, S.D. Kaushik, S. Mufti, S. Sharma, C.H. Huang, P. Sadayappan, On compiling array expressions for eﬃcient execution on distributed-memory machines, in: Proceedings of International Conference on Parallel Processing, 1993, pp. 301–305. [9] S. Hiranandani, K. Kennedy, C.-W. Tseng, Compiling Fortran D for mind distributed-memory machines, Communication of the ACM 35 (8) (1992) 66–80. [10] S. Hiranandani, K. Kennedy, J. Mellor-Crummey, A. Sethi, Compilation techniques for blockcyclic distributions, in: Proceedings of International Conference on Supercomputing, 1994, pp. 392– 403. [11] My.C. Hu, G. Jin, S.L. Johnsson, D. Kehagias, N. Shalaby, Hpfbench: A high performance Fortran benchmark suite, ACM Transactions on Mathematical Software 26 (1) (2000) 99–149. [12] G.-H. Hwang, J.K. Lee, An expression-rewriting framework to generate communication sets for hpf, in: The 12th International Parallel Processing Symposium, April 1998, pp. 62–68. [13] G.-H. Hwang, J.K. Lee, Communication set generations with csd calculus and expression-rewriting framework, Parallel Computing 25 (1999) 1105–1130. [14] G.-H. Hwang, J.K. Lee, Dz.C. Ju, Array operation synthesis scheme to optimize Fortran 90 programs, in: Proceedings of ACM SIGPLAN Conference on Principles and Practice of Parallel Programming, July 1995, pp. 112–122. [15] G.-H. Hwang, J.K. Lee, Dz.C. Ju, Array operation synthesis to optimize hpf programs, in: Proceedings of the 1996 International Conference on Parallel Processing, vol. 3, August 1996, pp. 1–8.

G.-H. Hwang / Parallel Computing 30 (2004) 473–501

501

[16] M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, A global communication optimization technique based on data-ﬂow analysis and linear algebra, ACM Transactions on Programming Languages and Systems 21 (6) (1999) 1251–1297. [17] K. Kennedy, N. Nedeljkovic, A. Sethi, A linear-time algorithm for computing the memory access sequence in data-parallel programs, in: Proceedings of Fifth ACM SIGPLAN Conference on Principles and Practice of Parallel Programming, July 1995, pp. 102–111. [18] K. Kennedy, N. Nedeljkovic, A. Sethi, Eﬃcient address generation for block-cyclic distributions, in: Proceeding of International Conference on Supercomputing, 1995, pp. 180–184. [19] C. Koelbel, Compile-time generation of regular communications patterns, in: Proceedings of Supercomputing’91, 1991, pp. 101–110. [20] C. Koelbel, D. Loveman, Schreiber, G. Stelle, M. Zosel, The High Performance Fortran Handbook, MIT Press, Cambridge, 1994. [21] J. Sheu Kuei-Ping Shih, C.H. Huang, Table-lookup approach for compiling two-level data-processor mappings in hpf, Workshop on Languages and Compilers for Parallel Computing, August 1997, pp. 493–510. [22] O.-Y. Kwon, T.-G. Kim, T.-D. Han, S.-B. Yang, S.-D. Kim, An eﬃcient local address generation for the block-cyclic distribution, in: IEEE 1997 3rd International Conference on Algorithms and Architectures for Parallel Processing, 1997, pp. 389–396. [23] J.K. Lee, D. Gannon, Object-oriented parallel programming: Experiments and results, in: Proceedings of Supercomputing ’91, November 1991. [24] H.J. Sips, K. van Reeuwijk, W. Denissen, Analysis of local enumeration and storage schemes in hpf, in: Proceeding of International Conference on Supercomputing, 1996, pp. 10–17. [25] J. Stichnoth, D. O’Hallaron, T. Gross, Generating communication for array statements: design, implementation, and evaluation, Journal of Parallel and Distributed Computing (1994) 150–159. [26] H. Zima, B. Chapman, Supercompilers for Parallel and Vector Computers, Addison-Wesley Publishing Company, 1990.

An efficient algorithm for communication set generation of data parallel programs with block-cyclic distribution

An efficient algorithm for communication set generation of data parallel programs with block-cyclic distribution

Recommend Documents