Computer Communications 22 (1999) 1017–1033
Partitioning and scheduling loops on NOWs S. Chen, J. Xue* School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, Australia
Abstract This paper addresses the problem of partitioning and scheduling loops for a network of heterogeneous workstations. By isolating the effects of send and receive and quantifying the impact of network contention on the overall communication cost, a simple yet accurate cost model for predicting the communication overhead for a pair of workstations is presented. The processing capacities of all workstations in a network are modeled based on their CPU speeds and memory sizes. Based on these models, loop tiling is used extensively to partition and schedule loops across the workstations. By adjusting sizes, i.e. the granularities of tasks, the impact of the heterogeneity arising from program, processor and network is minimised. Experimental results on an Ethernet of seven DEC workstations demonstrate the effectiveness of our models and parallelisation strategies. q 1999 Elsevier Science B.V. All rights reserved. Keywords: Partitioning; Scheduling; Networks of workstations; Communication cost model
1. Introduction Networks of workstations (NOWs) are gaining popularity as promising lower-cost alternatives to massively parallel processors because of their ability to leverage commodity high performance workstations and networks. For example, a network of 1024 DEC Alpha workstations had a peak performance of 150 GFLOPS while the same sized configuration of a CM-5 had a peak GFLOPs rate of only 128 [1]. In addition, with the availability of softwares like PVM [2] and MPI [3], a network of heterogeneous workstations can be effectively harnessed to run programs which otherwise would have been possible only with a myriad of multiprocessors. While emerging networks such as Myrinet [4] promise better performance, the standard network interfaces (even ATM or FDDI) still have larger latencies than the proprietary networks found in multicomputers and multiprocessors. Therefore, NOWs are currently used as a coarse-grain parallel processing platform. To achieve good performance on NOWs, three issues must be adequately addressed. • High communication overhead: Since networks are not initially developed for parallel processing, communication overheads over many networks are high. For example, the contention on bus-based networks makes the communication overheads severe and hard to predict [5]. Therefore, an accurate communication cost model
* Corresponding author.
for each pair of workstations is required to guide parallelism exploitation. • Heterogeneity: Most networks consist of different types of workstations, such as DEC, HP, SGI, SUN and PC. Even in the case of the same vendor-provided workstations, differences between CPU speeds and memory sizes introduce heterogeneity as well. These heterogeneities may result in severe load imbalance. Heterogeneity from programs, such as loops with irregular iteration spaces, is another resource of load imbalance [6,7]. • Real-time variable workload: Unlike dedicated parallel machines, most workstations on a network support multiuser environments. It is impractical to use an entire network in a dedicated way. Because the workload on a workstation changes in real time, dynamic load balancing is required [8]. Both hardware and software solutions have been proposed to address these three issues. Gillett and Kaufmann improve communication performance using a socalled memory channel network [9]. Cierniak et al. tackle heterogeneity in three aspects—program, processor and network when scheduling parallel loops on NOWs [8,10]. They present a architecture-conscious algorithm for finding optimal and sub-optimal time schedules by minimising load imbalance and communication cost. Wang and Blum consider a class of applications formulated as iterative functions, such as neural network simulation, shortest path and linear equation computing [11]. They develop a communication cost model called reduced overhead
0140-3664/99/$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. PII: S0140-366 4(99)00073-0
1018
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
J do i=1, N1 do j=1, N2 a(i, j) = a(i-1, j) + a(i-1, j-1)
D=
[ 11 10 ]
(b) dependence matrix
(a) double loop
I (c) iteration space
Fig. 1. An example.
cluster communication (ROCC) and provide two implementations: priority ROCC and parallel ROCC. Finally, many software systems have been developed over the years to support parallel and distributed computing on NOWs. Examples include Linda [12], p4 [13], PVM [2], MPI [3] and ACE [14]. In this paper, we investigate how to partition and schedule loops to achieve high performance on a network of heterogeneous workstations. We present a cost model for characterising the communication overhead between a pair of workstations. In the cost model, both send and receive overheads and network contention are taken into account. The overheads for send and receive are quantified separately, allowing their effects on the overall communication cost to be predicted accurately. To minimise load imbalance, the processing capacities of workstations are modeled based on the heterogeneity in both CPU speeds and memory sizes. To generate coarse-grain parallelism required for an efficient execution on NOWs, we make extensive use of loop tiling to partition and schedule loops. By aggregating the iterations of the iteration space of a program into tiles and scheduling tiles across the workstations, tiling reduces interprocessor communication overhead by amortising large startup overheads. Based on our models for communication overheads and the processing capacities of workstations, we partition the iteration space into tiles of the same shape but different sizes to eliminate the heterogeneity in program, processor and network. A technique for statically scheduling doall loops is provided. Our experimental results on kernel programs demonstrate the effectiveness of our parallelisation strategies. The rest of the paper is organised as follows. Section 2 introduces our program model and defines the problem addressed. Section 3 quantifies the processing capacities of workstations and communication overheads in a heterogeneous setting. Section 4 is concerned with deriving optimal tilings for a network of homogeneous workstations. By dividing the iteration space into tiles of the same size and shape, the problem of tiling the iteration space optimally amounts to finding a matrix transformation that minimises the total execution time. In Section 5, we consider the general case when loops are executed on a network of heterogeneous workstations. To reduce the performance impact of heterogeneity arising from machines and programs, we divide the iteration space into tiles of the same shape but different sizes. Section 6 presents our experimental results. In Section 7, the related work is
reviewed and compared. Finally, the paper is concluded in Section 8. 2. Background 2.1. Program model This paper considers fully permutable double loops with constant data dependencies [15]. A loop nest is fully permutable if all its dependence vectors are nonnegative [16]. Our results can be applied to a higher-dimensional loop nest when only two of its adjacent loops are targeted. A double loop has the following form: do I 1; N do J p1 I 1 q1 ; p2 I 1 q2 H
I; J where H(I,J) denotes the loop body. We use D [ Z 2×m to represent the dependence matrix whose columns are the m dependence vectors of the program; D d~ 1 ; d~ 2 ; …; d~ m . The iteration space of the double loop, denoted I, is the set of all iterations defined by the bounds of loops I and J: I {
I; Ju1 # I # N; p1 I 1 q1 # J # p2 I 1 q2 }: The iteration space is assumed to be either rectangular (p1 p2 0) or triangular (p1 0 or p2 0 but not both). Fig. 1 gives a double loop with a rectangular iteration space: (a) lists the source code; (b) gives the dependence matrix; and (c) depicts the iteration space where the arrows represent the data dependencies between the iterations. 2.2. Tiling Tiling decomposes a two-deep loop nest into a four-deep loop nest, where the outer two loops step between tiles and the inner two loops enumerate the iterations within a tile. Fig. 2 depicts a tiling of the program in Fig. 1. Part (a) gives the tiled code. Part (b) shows the tiled iteration space. Since all tiles are identical in terms of their size and shape, we can represent a tiling transformation by characterising the fundamental tile, i.e. the tile at the origin (0,0), as illustrated in Part (c) of Fig. 2. The fundamental tile can be defined either by an integral matrix P [ Z 2×2 whose columns are its two edge vectors emitting from the origin or by a rational matrix H [ Q 2×2 whose rows are the normal vectors to its two edge vectors adjoining at the origin. Thus,
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
above optimisation problem becomes:
J
Minimise Tlast
do ii = 1, N1, n1 do jj = 1, N2, n2 do i = ii, min(ii+n1-1, N1) do j = jj, min(jj+n2-1, N2) a(i, j) = a(i-1, j)+a(i-1, j-1)
Subject to HD $ 0: I
(a) tiled version of the program p2 h2 (0,0)
P = [ p1 , p2 ] = p1
(b) rectangular tiling of size 2x2 2 0 0 2
H=
1019
h1 1/2 0 = h2 0 1/2
1
To find optimal tilings, we need to develop a closed-form formula for Tlast and solve the optimisation problem analytically since Tlast is usually non-linear. 3. Machine model
h1
(c) fundamental tile
(d) clustering maxtrix
(e) tiling transformation
Fig. 2. Tiling of the program in Fig. 1.
as illustrated in Parts (c) and (d) of Fig. 2, P and H satisfy P H 21. Tiling is not always legal. A tiling transformation is legal if and only if the transformed dependencies are lexicographically positive, i.e. H D $ 0 [17]. The tile shape is determined by its edges, i.e. p~1 and p~ 2 in P or h~1 and h~ 2 in H. The tile size is the number of integral points contained in the tile, i.e. udet
Pu
1=udet
Hu. Both the tile shape and tile size have significant impact on the performance of tiled programs [16,18–22]. 2.3. Problem statement The problem of deriving optimal tilings for a certain cost function amounts to finding a tiling transformation H such that the cost function is minimised. Depending on what is emphasised, several cost functions have been used, including, execution time [23,24] communication volume [25– 29], idle time [18,19], cache miss ratio [20,21,30] and data reuse [16,22,31]. Our cast function is execution time because it reflects a combined effect of many factors such as network contention and communication overhead. Let us consider the problem of finding a tiling transformation H to minimise the total execution time on a network of P workstations. Throughout this paper, P workstations will be numbered from 1 to P. Let tiles be scheduled to the P workstations in some fashion. Let Ti be the time elapsed from when the first workstation starts until when workstation i finishes, where 1 # i # P. The total execution time for the program Ttotal, is the total time elapsed on the workstation that completes the last. Hence Ttotal max{T1 ; T2 ; …; TP }: Thus, the problem of finding the optimal tiling can be formulated as: Minimise max{T1 ; T2 ; …; TP } Subject to HD $ 0 where HD $ 0 ensures that the data dependencies are respected. Other constraints may be added if desired. Let Tlast be the workstation that finishes the last. The
A network consists of a number of workstations called nodes. We present two models for characterising a network of heterogeneous workstations: node model and network model. The node model characterises the computational power of a single node by considering heterogeneity in both processor and memory. The network model predicts the communication overhead between a pair of nodes by isolating the effects of send and receive and quantifying the impact of the network contention on the overall communication cost. 3.1. Node model The nodes in a network can be homogeneous but often heterogeneous. To minimise load imbalance and the total execution time on a network of heterogeneous workstations, a node model is required to estimate the processing capacity of each workstation. In practice, there are many contributing factors to the processing capacity of a node. Some important ones are: the number of processors on the node, the speed of each processor, instruction-level parallelism and memory hierarchy. The multitude of machine parameters and the sophisticated relationships between each other make their use in performance prediction very difficult. In this paper, we assume that there is only one processor on each node. For each processor, we consider two factors: processor speed and memory size. For a node i, we use two parameters to describe its computational power and one parameter its memory size: • w i: the processor speed, which is application-independent; • tci: the execution time of one iteration, which is application-dependent; and • f i: the memory size, which is application-independent where w i and f i can be obtained from the manufacturer of the machine and tci is estimated at compile-time. The processing capacity of a node is application-dependent. However, the processing capacities of two nodes over the same range of applications are relatively stable [32]. This means that in most practical cases a node with faster processor and/or larger memory should perform better than the other. To validate this observation, we executed three kernels (MM—Matrix Multiply, LUD—LU decomposition,
1020
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
Table 1 Execution times of three kernels on five workstations Workstation
CPU type
Vieta Neumann Church Euclid Turing
MIPS MIPS ALPHA ALPHA ALPHA
CPU speed (MHz)
Memory size (MB)
33 40 150 150 333
32 256 128 256 640
SOR—Successive Over-Relaxation) on five workstations that differ in speed and/or memory size. The same problem size was used when the same kernel was run on different nodes. The execution times are shown in Table 1 and further interpreted in Fig. 3. Our experimental results lead to the following two conclusions: 1. Both processor speed and memory size impact the performance of a workstation and the former does so more significantly than the latter (Fig. 3). 2. The execution times of three applications on all five nodes are consistent. In more detail, an application that runs the fastest on one node runs the fastest on almost all other nodes, an application that runs the slowest on one node runs the slowest on almost all other nodes, and so on. Therefore, we estimate the processing capacity of a node as follows. Definition 1 (Processing Capacity). The processing capacity of a node i is a linear combination of its CPU speed w i and its memory size f i C i k 1 wi 1 k 2 f i
2
where k1 and k2 are two coefficients for w i and f i respectively. The two coefficients k1 and k2 are obtained as follows. For each application, we define the processing capacity of the slowest node, called the base node, to be Cbase 1. We
Execution Time (sec)
16.574 10.239 4.807 4.586 1.327
20.361 17.528 4.417 4.130 1.252
Ci 0:0267wi 1 0:0057fi : By applying this formula, the processing capacities for seven DEC workstations are listed in Table 4. Note that Church and Kleene are identical to Babbage. This table also gives the normalised node capacities of the seven workstations relative to that of the base node [10]. 3.2. Network model In this section, we present a communication cost model for quantifying the communication overhead between a pair of workstations in a heterogeneous environment. In the model, the communication overheads for send and receive are quantified separately to allow their effects on the overall communication cost to be predicted more accurately than otherwise. Furthermore, the network contention on the overall communication cost is also taken into account.
Execution Time (sec) SOR
Memory Size = 256MB CPU 40MHz
14.922 12.586 4.760 3.989 0.80
Church vs Eucild
18 15 12 9 6 3 0 LUD
SOR
define the capacity of node i as Ci
Tbase =Ti ; where Tbase and Ti are the execution times of the application on the base node and node i, respectively. This gives rise to Table 2, where Vieta is the base node in all three cases. Next, we calculate the two coefficients k1 and k2 individually for each application. By fitting (2) to the data in each of the three columns under “Processing capacities” from Table 2 using Splus—a statistical modelling tool [33], we obtain Table 3. Finally, we take the average of the three applications as the coefficients k1 and k2, which is given in the last row of Table 3. Hence, our node model is:
Neumann vs Euclid
MM
Execution time (seconds) MM LUD
CPU 150MHz
18 15 12 9 6 3 0 MM
LUD
SOR
CPU Speed = 150MHz Memory 128MB
Fig. 3. Impacts of processor speed and memory size.
Memory 256MB
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
1021
Table 2 Processing capacities based on individual applications Workstation
CPU type
Vieta Neumann Babbage Euclid Turing
MIPS MIPS ALPHA ALPHA ALPHA
CPU speed (MHz)
Memory size (MB)
33 40 150 150 333
32 256 128 256 640
3.2.1. Send overhead To transfer a message from one node to another, the sender will pack the message, establish a link with the receiver and place the message on the network. Once reaching this point, a nonblocking-send is complete. But a blocking-send must wait further for an acknowledgement from the receiver that says the message has been safely received. Although the network latency can be partially hidden in the case of a nonblocking-send, the impact can be felt only at the local node rather than its partner if no additional hardware exists to support communication overlap. Therefore, we use blocking-send to estimate the impact of send on the overall performance. The communication cost for sending a message is estimated by a one-order polynomial: Ts as 1 bs B
3
where a s is the startup time, b s is the time taken for transferring one byte and B is the total number of bytes sent. Another issue that must be addressed in network-based computing is network contention. The workstations on a network often share physical medium and equipments, such as cables and switches. Collisions may occur when more than one node attempt to use these hardwares at the same time. Intuitively, the more nodes we use in the network, the higher the network contention will be. In regularly structured computations, the communication usually takes place between “neighbouring” nodes. So we incorporate the impact of the network contention into Eq. (3) as follows: Ts as 1 bs B 1 g
P 2 1 where P is the number of workstations in the network and g is the network contention overhead coefficient.
Processing capacities MM LUD
SOR
1.000 1.186 3.135 3.741 18.653
1.000 1.187 3.448 3.614 12.497
1.000 1.165 4.610 4.930 16.265
Definition 2 (Complete Send and Send Overhead). A complete send is the process for sending a message until the message arrives at its destination. Send overhead is the time elapsed during a complete send. We measure send overheads using the ping-pong technique [34], as illustrated in Fig. 4(a). The sender first sends a message to the receiver. As soon as the arrival of the message is detected (most parallel programming systems provide such a system call, e.g. pvm-probe() in PVM), the receiver sends the same message back. The whole process takes approximately twice the time a complete send does. To include the impact of the network contention on the send overhead, we performed a series of experiments by varying the number of processors used in each experiment. The experimental results on measuring send overhead are presented in Table 5. 3.2.2. Receive overhead There are two stages involved in receiving a message: (a) waiting—the receiver busy testing if the message has arrived and (b) receiving—the receiver moving the message from the system buffer to the application memory. The time elapsed in the waiting stage is called the idle time. Because the idle time depends on many factors and is hard to predict at compile-time, it is not represented in our receive model. In Section 5, we see that the idle time can be reduced by arranging workstations appropriately in the network. Definition 3 (Receive Overhead). Receive overhead is the time spent on moving a message from the system space to the application space. Because network contention is already represented in the send overhead (3), the receive overhead is defined as follows: Tr ar 1 br B
Table 3 Coefficients k1 and k2 Application
Coefficient k1
Coefficient k2
Residual standard error
MM LUD SOR Average
0.0226 0.0351 0.0225 0.0267
0.0061 0.0048 0.0062 0.0057
1.02 1.70 1.18 1.30
where a r is the startup time, b r is the time taken for receiving one byte and B is the total number of bytes received. Fig. 4(b) shows that it is relatively easier to measure the receive overhead than the send overhead. To measure the receiver overhead, the sender sends a message to the receiver. Once the arrival of the message is detected, the receiver turns on its timer t1 and begins to receive the message.
1022
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
Table 4 Processing capacities of seven DEC workstations Workstation
CPU type
CPU speed (MHz)
Memory size (MB)
Processing capacity
Normalised capacity
Vieta Neumann Babbage Church Kleene Euclid Turing
MIPS MIPS ALPHA ALPHA ALPHA ALPHA ALPHA
33 40 150 150 150 150 333
32 256 128 128 128 256 640
1.06 2.53 4.73 4.73 4.73 5.46 12.54
1 2 5 5 5 5 12
Having received the message, the receiver turns off the timer t2. The time interval t2 2 t1 gives rise to the receive overhead. The experimental results on measuring receive overhead are given in Table 5.
4.1. Case (a) In the absence of data dependencies, both loops can be parallelised. Such parallelism can be represented as: doall I 1; N1 doall J 1; N2
4. Tiling for homogeneity
H
I; J
In this section, we consider the problem of parallelising and scheduling homogeneous loops on a network of homogeneous workstations. That is, we assume that all loops have rectangular iteration spaces, all nodes have the same processing capacity and all pairs of nodes have the same communication overhead. We find optimal tilings by distinguishing loops into four cases according to their dependence patterns. Let cone {~x1 ; ~x2 ; …; ~xn } be the cone generated by the n vectors ~x1 ; ~x2 ; …; ~xn [35]. Let C(D) cone {d~ 1 ; d~ 2 ; …; d~ m } be the dependence cone generated by the dependence vectors in D d~ 1 ; d~ 2 ; …; d~ m . We distinguish four types of fully permutable loops: Case (a) C
D B; Case (b) C(D) has the form cone { ab }, where a and b are ~ nonnegative and ab ± 0; Case (c) C(D) has the form cone { 10 ; ab } or cone { 01 ; ab }; where a and b are positive; and Case (d) C(D) cone { ab ; dc }; where a, b, c and d are ~ nonnegative, a ± 0~ and c ± 0: b
d
Note that each case is contained in a subsequent case. When the iteration space is rectangular, block distribution is always optimal [24]. For simplicity, we assume that both N1 and N2 divide P. Receiver
probe() send()
send() Sender
probe()
The outer loop will be tiled with blocks of size (N1/P). 4.2. Case (b) In this case, C
D cone{ ab }: Two subcases are distinguished. If a 0 or b 0, the program can be parallelised exactly as in Case (a). Fig. 5(I) illustrates the situation when a 0. Otherwise, we can apply a unimodular transformation to transform the original program so that in the transformed program the outer loop is a doall loop and the iteration space is a parallelogram having the shape as shown in Fig. 5(II). A simple-minded tiling as shown in Fig. 5(II) will lead to load imbalance. Such a program will be dealt with either as in Case (c) discussed next or as discussed in Section 5.5. 4.3. Case (c) A loop interchange, if necessary, can always put the program into the form: do I 1; N1 doall J 1; N2 H
I; J such that C
D cone{ 10 ; ab }; where a and b are positive.
Time
Receiver
Time
send() Sender
recv()
Tx2
Fig. 4. Ping-pong experiment for measuring communication overhead.
Time
Time T
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
1023
Table 5 Communication parameters for all pairs of workstations Workstation
Parameters
Vieta
Neumann
Babbage, Church, Kleene
Euclid
Turing
Vieta
as bs g ar br
– – – – –
0.0134380 0.0000088 0.0045610 0.0078140 0.0000030
0.0134350 0.0000051 0.0036390 0.0078140 0.0000030
0.0123190 0.0000050 0.0022120 0.0078140 0.0000020
0.0105910 0.0000045 0.0017980 0.0056340 0.0000005
Neumann
as bs g ar br
0.0108400 0.0000100 0.0053430 0.0078140 0.0000030
– – – – –
0.0085410 0.0000098 0.0051350 0.0078140 0.0000030
0.0085410 0.0000099 0.0050840 0.0078140 0.0000030
0.0072170 0.0000031 0.0048420 0.0053240 0.0000009
Babbage Church Kleene
as bs g ar br
0.0042230 0.0000078 0.0055240 0.0029250 0.0000011
0.0040810 0.0000078 0.0055340 0.0029730 0.0000009
0.0035320 0.0000056 0.0046190 0.0021730 0.0000005
0.0035320 0.0000056 0.0047450 0.0021730 0.0000005
0.0032730 0.0000012 0.0042130 0.0019520 0.0000003
Euclid
as bs g ar br
0.0040390 0.0000061 0.0054310 0.0029250 0.0000011
0.0041740 0.0000077 0.0053330 0.0029250 0.0000011
0.0035320 0.0000056 0.0044730 0.0029250 0.0000011
– – – – –
0.0032730 0.0000012 0.0042130 0.0019520 0.0000006
Turing
as bs g ar br
0.0033200 0.0000041 0.0020150 0.0019520 0.0000000
0.0031390 0.0000042 0.0020330 0.0019520 0.0000000
0.0030360 0.0000008 0.0020530 0.0019520 0.0000000
0.0029810 0.0000008 0.0020420 0.0019520 0.0000000
– – – – –
Fig. 6 illustrates how the iteration space will be tiled in this case. The basic idea is to divide the iteration space into parallelograms in such a way that all nodes can start executing their first tiles simultaneously. This can happen if a tile has some prescribed shape. Consider the tile at the origin in Fig. 6. Our parallelisation strategy is feasible if the axis J splits the tile into two right triangles. In this case, n1 and n2 represent the number of iterations along its edges. In the example, there are two iterations along each edge, so n1 n2 2. Let dmax
b=a: Because C
D cone{ 10 ; ab }; dmax is the angle between the axis I and the dependence vector with the largest slope. Let q be the angle between the axis I and the edges of a tile that are not parallel to the axis I. To ensure that a tiling is legal, we must have q $ dmax [29,36]. When loop J is block distributed, we J
have n2
N2 =P. This implies that n1
n2 =tan q N2 =
P tan q. Each node is allocated
N1 =n1 P
N1 =N2 tan q tiles. In addition, each node computes a total of (N1 N2 =P) iterations. As illustrated in Fig. 6, we can calculate the total execution time for the program as follows: T Tcomp 1 Trecv 1 Tidle Tcomp 1 Trecv 1 Tsend
N1 N2 tc N 1
as 1 ar P 1 tan q 1
bs 1 br N1 N2 P 1 gP
P 2 1
N1 tan q N2
where tc is the execution time of one single iteration. The
J
I Node1
Node2 (I)
Node3
I Node1
Fig. 5. Tiling for Case (b).
Node2 (II)
Node3
1024
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
J Node3
Node3
Node2 Node2
Time
Node1 Node1 I
computation
idle
send
receive
Fig. 6. An optimal tiling for Case (c).
total execution time is minimal when q dmax . Hence the optimal tile size is: nopt 1
n2 N2 tan q Ptan q
nopt 2
node is: T Tcomp 1 Trecv 1 Tidle
n1 n2 tc 1 as 1 bs n2 1 g
P 2 1 1 ar 1 br n2 N2 P211 n2
N2 : P
The tiling in Fig. 6 is optimal as far as our parallelisation strategy is concerned. Since N1 N2 6, D 10 11 and P 3, we have q dmax 1 and n1 (N2/P) tan dmax 2 and n2 (N2/P) 2. 4.4. Case (d) A program not covered under either of the preceding three cases will be parallelised as follows. Such a program can be expressed as doacross loops: doacross I 1; N1 doacross J 1; N2
N1 N2 tc N 1 1
as 1 g
P 2 1 1 ar 2 1 N1 tc 1 2 n2 P P 1
P 2 1
bs 1 br n2 1
as 1 ar 1 g
P 2 1
P 2 1 1
bs 1 br N2 :
By the inequality of arithmetic and geometric means, T is minimal when N2 n2 1 N1 tc 1 2 1
P 2 1
bs 1 br n2 : P
as 1 g
P1 1 ar
H
I; J Unlike Case (c), it may not be possible for all nodes to start executing at the same time. Our parallelisation strategy in this case is illustrated in Fig. 7. The iteration space will be tiled by rectangles of size n1 × n2. Assume that loop I is blocked distributed, we have n1 N1 =P. Due to data dependencies, the last node will always complete the last. The last node cannot start until after all other nodes have executed their first tiles. After this, the last node will execute N2/P tiles allocated to it. So the total execution time of the last
Hence, the optimal tile size is found to be: N1 P s P
as 1 ar 1 g
P 2 1N2 :
P 2 1
N1 tc 1 bs 1 br
nopt 1 nopt 2
I Node3
Node3
Node2 Node2
Node1
Node1
J
Time computation
idle
send
receive
Fig. 7. An optimal tiling for Case (d).
4
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
1025
Fig. 8. Heterogeneous vs. homogeneous tilings for rectangular tiles.
The minimal execution time is: T
opt
N N tc 1 2 P
s 1 1
P 2 1
bs 1 br N1 N2 tc 12
as 1 g
P 2 1 1 ar 1 2 P
1
as 1 ar 1 g
P 2 1
P 2 1 1
bs 1 br N2 :
5. Tiling for heterogeneity In the section, we describe how to tile loops with rectangular and triangular iteration spaces on a network of heterogeneous workstations.
bs
g
Ts as 1 bs B 1 g
P 2 1 where as ; bs and g are defined as follows:
as
21 1 PX ai P 2 1 i1 s
5
P 1 X gi : P i1
The subscript i identifies the number assigned to a workstation. Because all P workstations contribute to the network contention. g is the average of the network contention coefficients for all the workstations. Similarly, we have the following cost model for the receive overhead: Tr ar 1 br B where ar and br are defined as follows:
5.1. Heterogeneous networks Table 5 indicates that the communication overheads for different pairs of workstations are different. In the case of n mutually different workstations, there can be up to 5n(n 2 1)/2 different communication parameters to deal with, five per pair of workstations. This will lead to highly complex objective function in (1), making the underlying optimisation problem difficult to solve. By taking the “average” of the communication overheads among all pairs of workstations, we can use one single communication model for all pairs of workstations. This reduces the total number of communication parameters to just five and allows us to generalise our previous results in a homogeneous setting. Note that the workstation numbered 1 does not receive any messages and the workstation numbered P does not send any messages. The cost for sending a message between a pair of workstations is:
21 1 PX bi P 2 1 i1 s
ar
P 1 X ai P 2 1 i2 r
P 1 X br bi : P 2 1 i2 r
6
Finally, the execution time of a single iteration is also taken as the average of the execution times on all P workstations: tc
P 1 X tci : P i1
7
5.2. Heterogeneous workstations To achieve load balance, more powerful workstations should be given more computations to perform. The “average” cost model introduced previously will be used to estimate the communication cost for every pair of workstations. In addition, the execution time for a single iteration is the average of the execution times for all workstations, following from (7). Two cases are considered below.
1026
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
I
I
Node3
Node3
Node2
Node2
Node1
Node1
(a) heteogeneous tiling
(b) homeogeneous tiling
Fig. 9. Heterogeneous v.s. homogeneous tilings for parallelogram tiles.
5.3. Rectangular tiles We choose the same n2 for all workstations but different values of n1 for all workstations to match their processing capacities. The tile size for workstation i is chosen according to: ni1
ni2 N1
C N1 P i X Ci s P
as 1 ar 1 g
P 2 1N2
P 2 1
N1 tc 1 bs 1 br
8
nopt 2
is from Eq. (4) with all communication and where computation parameters calculated based on Eqs. (5)–(7). Let us consider an example of parallelising a doall/do program with a rectangular iteration space on three workstations—Vieta, Neumann and Babbage. Their processing capacities are read out directly from Table 4: CVieta 1, CNeumann 2 and CBabbage 5. The combined processing capacity of the three workstations is C CVieta 1 CNeumann 1 CBabbage 1 1 2 1 5 8. So the parameter n1 on three workstations is obtained using: N1 nVieta i
CVieta 1 N1 C 8
nNeumann N1 1 nBabbage N1 1
Ci P X Ci i1
i1
ni2 nopt 2
network of homogeneous workstations, the iteration space will be tiled by parallelograms as depicted in Fig. 6. For a network of heterogeneous workstations, both sides of a tile must be adjusted. The tile size for workstation i is found by
CNeumann 2 N1 C 8
CBabbage 5 N1 : C 8
Let N1 24 and N2 12. Suppose we have found that n2 3. The resulting heterogeneous tiling is depicted in Fig. 8(a). If the three identical workstations such as Babbage, Church and Kleene were used instead, they would have been allocated the same-sized slice, as illustrated in Fig. 8(b). 5.4. Parallelogram tiles Recall that in this case, the dependence cone has the form C
D cone{ 10 }; ab }; where a and b are positive. For a
ni1
9
1 min{n12 ; n22 ; …; nP2 } tan dmax
where dmax b/a. The tile side n2 is adjusted so that more powerful workstations are allocated with more work to do. The tile side n1 is chosen as the smallest for all workstations to ensure that all workstations can start at the same time. Fig. 9 compares heterogeneous and homogeneous tilings in this case. 5.5. Heterogeneous programs Programs with irregular iteration space also introduce heterogeneity, which may lead to severe load imbalance. Previous solutions fall into two categories: static scheduling such as block scheduling, cyclic scheduling and canonical scheduling [37] and dynamic scheduling such as self-scheduling [38], guided self-scheduling [39] and affinity scheduling [40]. We describe a static scheduling technique that is useful for doall loops and that tends to improve the data locality of each node. Some definitions from [37] are introduced below. Definition 4 (Load Imbalance and Relative Load Imbalance). Let the total amount of computation in a loop nest, Wtot, be distributed among P nodes. Let Wi be the amount of computation allocated to node i (1 # i # P) such that P Wtot Wi . This distribution exhibits a load imbalance, L, equal to: W W L max Wi 2 tot Wmax 2 tot P P where Wmax max
W1 ; W2 ; …; WP :
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
This is a double loop with a triangular iteration space as shown in Fig. 10. The total number of iterations in the program is Wtot N(N 1 1)/2. All nodes are perfectly load balanced if the workload on each node is W Wtot/P. Let Lt be the load imbalance that we can tolerate, say, Lt Wtot × 1%. To achieve a perfect load balance, our intention is to partition the doall loop (i.e. loop I) into P trapezoidal chunks of the same size, as illustrated in Fig. 10. As a result, a perfect load balance can be achieved if all trapezoids have the same area, i.e. contain the same number of iterations. Consider the ith (1 # i # P) trapezoid, where Hi is the length of the longer base and si is the height. Since the iteration space is an isosceles right triangle, the length of the smaller base of the ith trapezoid is Hi11 Hi 2 si . Thus, the total number of iterations contained in the ith trapezoid can be approximated as (si/2) (Hi 1 (Hi 2 si)). To achieve a perfect load balance, we want to find si such that
Fig. 10. Static scheduling for perfect load balance.
A relative load imbalance, LR, is defined as: L LR Wmax
Wtot P : Wtot 12 PWmax
Wmax 2 Wmax
1027
F
si
W 1 Lt 2
si
H 1
Hi 2 si 2 i
i.e.
If L LR 0, the distribution is perfectly balanced.
F
si
Let us use the following example to explain out scheme for statically scheduling doall loops on a network of P workstations. We first consider homogeneous workstations and then heterogeneous workstations. doall I 1,N
Wtot s 1 Lt 2 i
Hi 1
Hi 2 si P 2
is minimal. F(si) 0 is a quadratic equation. By noting that si # N, we find that F(si) attains its minimum when s Wtot 1 Lt : si Hi 2 Hi2 2 2 P
do J I,N H(I,J) J
J
I
I
(a) block scheduling
(b) cyclic scheduling
J
J
I (c) canonical scheduling
10
I (d) this paper
Fig. 11. Four static scheduling schemes (the same shaded chunks are allocated to the same node).
1028
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
Table 6 Four scheduling schemes compared Problem size (N) Node Scheme 4
BLO CYC CAN This paper BLO CYC CAN This paper BLO CYC CAN This paper BLO CYC CAN This paper
8
16
64
a b
1000 L
LR
Chunks a
DL b
1111 L
LR
#Chunks
DL
875 000 218 750 0 3345 937 500 117 187 36 3435 969 000 60 078 36 1944 992 500 14 887 5460 16 241
0.8571 0.4267 0.0001 0.0260 0.9333 0.4661 0.0005 0.0520 0.9677 0.4793 0.0011 0.0585 0.9921 0.4747 0.4073 0.6750
4 1000 8 5 8 1000 16 9 16 992 32 17 64 960 128 65
Best Bad Good Better Best Bad Good Better Best Bad Good Better Best Bad Good Better
1080 447 269 382 28 4986 1157 662 143 796 28 2249 1195 991 74 278 276 1286 1224 877 18742 3828 16 226
0.8571 0.4267 0.0001 0.0313 0.9333 0.4630 0.0003 0.0283 0.9677 0.4802 0.0067 0.0322 0.9921 0.4850 0.2808 0.6270
4 1108 8 5 8 1104 16 9 16 1104 32 17 64 1088 128 65
Best Bad Good Better Best Bad Good Better Best Bad Good Better Best Bad Good Better
Chunks: total number of partitions of the doall loop. DL: Data Locality
Based on this and the following recurrence equations: H0 N Hi Hi21 2 si ;
where 1 # i # P:
The first P trapezoidal chunks can be statically determined. A chunk is determined if the values of Hi and si are found. Any iterations left can be assigned to the node with the smallest workload (due to integral approximation). Fig. 11 compares four static scheduling schemes. The chunks shaded identically are allocated to the same node. Our scheme tends to improve the data locality of each node since consecutive iterations are executed at the same workstation. The four scheduling schemes are also experimentally compared in Table 6. Block scheduling tends to achieve the best data locality but can result severe load imbalance. Cyclic scheduling improves load balance but exploits little data locality. Canonical scheduling achieves the best load balance but may not exploit the full-potential data locality. The proposed scheme in this paper represents a very good compromise among the preceding three schemes.
Improving data locality is especially important for sharedmemory machines. Our scheme can be applied to the other types of triangular iteration spaces as well as trapezoidal iteration spaces. In addition, our scheme also generalises for a network of heterogeneous workstations. In this case, we simply change the first term in Eq. (10) from
Wtot =P 1 Lt to Ci P X Ci
Wtot 1 Lt : P
i1
5.6. Putting it all together Fig. 12 depicts a framework in which all results of the paper are integrated together. The compiler accepts sequential programs as input and generates as output tiled programs based on our node and network models are parallelisation strategies discussed previously.
Fig. 12. A tiling-based framework.
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
1029
Table 7 Three models compared for matrix multiply Node model
Homogeneous Model in Ref. [10] Heterogeneous
Problem size 200 × 200 Vitea Babbage
Turing
66 12 11
68 126 134
66 62 55
6. Experiments All experiments are performed on a network of seven heterogeneous DEC workstations connected via an Ethernet. The parallelised programs are run in the SPMD mode with message-passing based on PVM. 6.1. MM-A doall/doall loop nest Our first example is matrix multiply (MM): do i 1; N do j 1; N c
i; j 0:0 do k 1; N c
i; j c
i; j 1 a
i; k p b
k; j All three loops are doall loops. By ignoring the innermost loop, the program is effectively a double loop and is parallelised as discussed in Section 4.1. Essentially, loop i is parallelised as a doall loop. Therefore, there is no communication involved among all participating workstations. This example is used to evaluate the impact of different node models on the overall performance of a program. The MM program was run on three heterogeneous workstations: Vieta, Babbage and Turing; Kleene was used as the master to monitor its execution. Three different node models were used: (a) the homogeneous node model assuming that all workstations have the same processing capacity; (b) the model of Ref. [10]; and (c) the heterogeneous model. In (a), all workstations are allocated the same-sized chunks. In (b), the chunk size of each workstation is determined based on its CPU speed only. In (c), the computation allocated to a node is proportional to its processing capacity such that the workload on all three nodes is balanced. The performance results in three cases are given in Table 7. The poor performance in case (a) is caused by severe load balance. Although the performance is significantly improved in case (b), our model achieves the best speedup. Fig. 13 further compares the load balance behaviour for the three cases using xpvm.
Execution time (s)
Speedup
13.4230 3.0876 2.2409
0.87 3.70 5.10
do j 1; N a
i; j v=4
a
i 2 1; j 1 a
i; j 2 1 1 a
i 1 1; j 1 a
i; j 1 1 1
1 2 va
i; j The dependence matrix is " # 1 0 D 0 1 Hence, neither of the two loops is a doall loop. The program will be parallelised as doacross loops as discussed in Section 4.4. Due to data dependencies, there will be communication among all participating workstations. This example is used to evaluate the effectiveness of our node and network models and the parallelisation strategy for doacross loops. We executed the parallelised program on five workstations: Babbage, Church, Euclid, Neumann, and Turing, which differ in CPU speed and/or memory size. We arranged these workstations in the order as shown in Fig. 14. Again Kleene was used as the Master to monitor the execution of the program. Based on (5) and (6), we found the following communication parameters:
as 0:0048637 bs 0:0000080 g 0:0004916 ar 0:0047145 br 0:0000010: Based on Eq. (7), the average execution time of a single iteration on five workstations was measured to be: tc 0:0000006: Let N1 N2 3000. The optimal tile sizes on five workstations were found by applying Eq. (8): 517 nBabbage 1
6.2. SOR-A doacross/doacross loop nest
517 nChurch 1
Our second example is a 2D SOR (Successive OverRelaxation) kernel:
517 nEuclid 1
do i 1; N
207 nNeumann 1
11
1030
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
Fig. 13. Load balance behaviour for three node models.
nTuring 1242 1 nChurch nEuclid nNeumann nTuring 180: nBabbage 2 2 2 2 2 To validate these tile sizes can indeed yield good performance speedup, the following experiments were performed: • By fixing n2 in Eq. (11), we measured the performances of the tiled SOR program by varying the n1s on five workstations. Our experimental results indicate that the heterogeneous tiling (11) gives a near-optimal performance in all test cases. In particular, Fig. 14 shows the load balance behaviour for the homogeneous tiling (n1 N1 =P 3000=5 600 on all five nodes) and the heterogeneous tiling. In the homogeneous tiling, the last workstation, Turing, spent a significant amount of
time idle. This is in sharp contrast with the situation for the heterogeneous tiling. • By fixing the n1s in Eq. (11), Fig. 15 shows the execution times and speedups of the tiled SOR program for several different values of n2. The best performance was observed when n2 180 in all test cases conducted.
7. Related work Previous research efforts on communication costs and tiling can be found in Refs. [1,10,22,26,36,40] and references therein. Our work on dealing with heterogeneity for NOWs is closely related to the work reported in Refs. [10,40] with the following differences. First, we are not restricted to doall loops. Second, our node and network
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
1031
Fig. 14. Load balance behaviour for the homogeneous and heterogeneous tilings.
Our work on tiling is based on Ref. [24]. However, our results are derived for a network of heterogeneous workstations based on our node and network models. Our scheduling scheme described in Section 5.5 does not attempt to fully out-perform the scheduling schemes in Refs. [10,37]. Rather, we provide an alternate scheduling scheme that is simple and tends to improve load balance and spatial locality.
8
6.00
7
5.00 speedup
execution
models are different. In our node model, memory size is explicitly taken into account. In our network model, the costs for send and receive are represented separately so that their effects on the overall communication cost can be more accurately predicted. Third, the effect of the network contention on the overall communication overhead is represented explicitly in our communication cost model.
6 5 4
4.00 3.00 2.00
3
1.00 50
100
150
200
tile size (n2)
250
300
50
100
150
200
tile size (n2)
Fig. 15. Execution time and speedup of the tiled SOR for different tile sizes.
250
300
1032
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
8. Conclusions In this paper, we have studied the problem of parallelising and scheduling loops on a network of heterogeneous workstations. We described a node model for quantifying the processing capacity of a workstation and a network model for characterising the communication overhead between a pair of heterogeneous workstations. Our node model reflects not only a node’s processor speed but also its memory size. In our network model, both send and receive overheads are quantified separately, allowing their effects on the overall communication cost to be predicted more accurately. Based on these cost models, we have explored the use of loop tiling to generate efficient SPMD code for a network of workstations. In the case of homogeneous workstations, loops are tiled based on their dependence patterns. The objectives are to overlap computation and communication as much as possible on all workstations, reduce the idle time on each workstation and keep all workstations load balanced. Out tiling techniques for homogeneous workstations are extended to the case of heterogeneous workstations by considering heterogeneity in processor, memory and program. By choosing tiles of the same shape but different sizes, we discussed how to generate tiled codes that achieve good performance with good load balance. A scheme useful for statically scheduling doall loops was provided. Our experimental results demonstrated the effectiveness of our node and network models and parallelising techniques.
Acknowledgements We wish to thank Norman Gaywood of the University of New England for his assistance with our experiments. This work is supported by an Australian Research Council Grant A49600987.
References [1] G. Bell, Ultracompter a Teraflop before its time, Communication of the ACM 17 (1) (1992) 27–47. [2] Al Geist, et al., PVM—Parallel Virtual Machine: A User’s Guide and Tutorial for Networked Parallel Computing, 0-262-57108-0, The MIT Press, 1994. [3] E. Lusk, W. Gropp, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, 0-262-57104-8, The MIT Press, 1994. [4] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, W.-K. Su, Myrinet, a gigabit per second local area network, IEEE Micro Magazine 15 (1995) 29–36. [5] J.L. Hammond, J.P. Peter, Performance Analysis of Local Computer Network, O’Reilly, 1986. [6] S. Lucco, A dynamic scheduling method for irregular parallel programs, ACM SIGPLAN’92 (1992) 200–211. [7] R. Sakellariou, J.R. Gurd, Compile-time minimisation of load imbalance in loop nests, in: Proc. of the International Conference on Supercomputing (ICS), Vienna, July 1997. [8] M. Zaki, W. Li, S. Parthasarathy, Customized dynamic load balancing
[9] [10]
[11]
[12] [13]
[14]
[15] [16]
[17] [18]
[19]
[20]
[21]
[22]
[23]
[24]
[25] [26] [27]
[28] [29] [30]
[31] [32]
for a network of workstations, Technical Report TR-602, Department of Computer Science, University of Rochester, December. 1995. R. Gillett, R. Kaufmann, Using the memory channel network, IEEE Micro 17 (1) (1997) 19–25. M. Cierniak, W. Li, M. Zaki, Loop scheduling for heterogeneity, in: Proc. of the Fourth IEEE International Symposium on HighPerformance Distributed Computing, Pentagon City, Virginia, August 1995. X. Wang, E.K. Blum, Parallel execution of iterative computation on workstation clusters, Journal of Parallel and Distributed Computing 1 (34) (1996) 218–226. N. Carriero, D. Gelernter, Linda in context, Communication of the ACM 32 (4) (1989) 444–458. R. Butler, E. Lusk, User’s guide to the p4 parallel programming system, Technical Report ANL-92/17, Argonne National Laboratory, October. 1992. D.C. Schmidt, T. Suda, A high-performance endsystem architecture for real-time corba, IEEE Communications Magazine, 14 (2) February 1997. M.J. Wolfe, High Performance Compilers for Parallel Processing, Addison-Wesley, Reading, MA, 1996. M. Wolf, M. Lam, loop transformation theory and an algorithm to maximize parallelism, IEEE Transactions on Parallel and Distributed Systems 2 (4) (1991) 452–471. J. Xue, On tiling as a loop transformation, Parallel Processing Letters 7 (4) (1997) 409–424. F. Desprez, J. Dongarra, F. Rastello, Y. Robert, Determining the idle time of a tiling: new results, Technical Report UT-CS-97-360, Department of Computer Science, University of Tennessee, May 1997. K. Hogstedt, L. Carter, J. Ferrante, Determining the idle time of a tiling, in: ACM Symposium on Principles of Programming Languages, January 1997. W.K. Kaplow, W.A. Maniatty, B.K. Szymanski, Impact of memory hierarchy on program partitioning and scheduling, in: Proc. of HICSS-28, Hawaii, Maui, Hawaii, January. 1995. W.K. Kaplow, B.K. Szymanski, Program optimization based on compile-time cache performance prediction, Parallel Processing Letter, 6 (1) 1996. M.E. Wolf, M.S. Lam, A data locality optimizing algorithm, in: Proc. of the ACM SIGPLAN’91 Conf. on Programming Language Design and Implementation, June 1991 pp. 173–184. R. Andonov, S. Rajopadhye, Optimal tiling of two-dimensional uniform recurrences, Technical Report 97-01, LIMAV, Universite´ de Valenciennes, January 1997. H. Ohta, Y. Saito, M. Kainaga, H. Ono, Optimal tile size adjustment in compiling for general DOACROSS loop nests, in: Supercomputing’95, pp. 270–279. ACM Press, 1995. P. Boulet, A. Darte, T. Risset, Y. Robert, (Pen)-ultimate tiling, Integration, the VLSI Journal 17 (1994) 33–51. P. Calland, T. Risset, Precise tiling for uniform loop nests, Technical Report 94-29, Ecole Normale Superieure de Lyon, September. 1994. J. Ramanujam, P. Sadayappan, Tiling multidimensional iteration spaces for multicomputers, Journal of Parallel and Distributed Computing 16 (2) (1992) 108–230. R. Schreiber, J.J. Dongarra, Automatic blocking of nested loops. Technical Report 90.38, RIACS, May 1990. J. Xue, Communication—minimal tiling of uniform dependence loops, Journal of Parallel Distributed Computing 42 (1) (1997) 42–59. W.K. Kaplow, B.K. Szymanski, Tiling for parallel execution—optimizing node cache performance, in: Workshop on Challenges in Compiling for Scaleable Parallel Systems, Eighth IEEE Symposium on Parallel and Distributed Processing, January 1996. J. Xue, C.H. Huang, Reuse-driven tiling for improving data locality, International Journal of Parallel Programming 26 (6) (1998) 671–696. M. Zaki, W. Li, M. Cierniak, Performance impact of processor and memory heterogeneity in a network of machines, in: Proc. of the 4th
S. Chen, J. Xue / Computer Communications 22 (1999) 1017–1033
[33] [34]
[35] [36]
Heterogeneous Computing Workshop, Santa Barbara, California, April 1995. J.M. Chambers, T.M. Hastie, Statistical Models in S. 0-534-16764-0, Wadsworth Inc., 1992. S. Bokhari, Communication overhead on the inter paragon, IBM-SP2 and meiko CS-2, Technical Report TR-28, NASA Langley Research Center, August 1995. A. Schrijver, Theory of Linear and Integer Programming, Series in Discrete Mathematics, Wiley, New York, 1986. F. Irigoin, R. Triolet, Computing dependence direction vectors and dependence cones with linear systems, Technical Report No. E94, Centre d’autmatique et informatique, September 1987.
1033
[37] R. Sakellariou, J.R. Gurd, Compile-time minimisation of load imbalance in loop nests. In Proceedings of the 11th International Conference on Supercomputing (ICS’97), Vienna, July 1997, pp. 277–284. [38] P. Tang, P.C. Yew, The processor self-scheduling for multiple nested parallel loops, in: Proc. of ’86 International Conference on Parallel Processing, 1986. [39] C.D. Polychronopoulos, D. Kuck, Guided self-scheduling: a practical scheduling scheme for parallel supercomputers, IEEE Transaction on Computers 36 (1) (1987) 1425–1439. [40] E.P. Markatos, T.J. LeBlanc, Using processor affinity in loop scheduling on shared-memory multiprocessors, Proc. of Supercomputing ’92 (1992) 104–113.