JOURNAL OF PARALLEL AND DISTRIBUTED
COMPUTING
11,
146-I 55 ( 1991)
(k-k) Routing on Multidimensional
Mesh-Connected
Arrays*
MANFRED KUNDE AND THOMASTENSI Institut fir Informatik, Technische Universitd Miinchen, Arcisstrasse 21, D-8000 Munchen 2, German);
In this paper we study the problem of routing packets on an r-dimensional mesh-connected array of processors. The focus of this paper is on routing with each processor containing exactly k packets, k z 2, initially and finally (so-called k-k routing). For two-dimensional n X n grids the number of transport steps is at most zkn + o(knlf(n)) with a buffer size of O(kf(n)). In the special case of a sequence of k permutation routing problems this step count can be reduced to kn + o(knlf(n)). For an r-dimensional grid, r > 3, with side length n the same technique yields an algorithm with step count (r - I)( 1 + 1 lr2)kn For sequences of per+ O(nlf(n) 1’(r-1)) and buffer rk.f(n). mutation routing problems this drops to rklrl (2r - 2)n + O(knlf(n) 1’(r-1)) and a buffer size of 0( kf( n)). Furthermore it is shown that splitting large packets into smaller ones has benefits for permutation routing problems. For grids with wraparound connections these step counts and times generally can be reduced by one-half. o 1991 Academic PW, IN.
1. INTRODUCTION The performance of parallel computation is heavily influenced by the existence of fast data movement algorithms
[ 11. It is well known that the efficiency of data movement algorithms strongly depends on the underlying topology. In this paper we present various fast algorithms to perform routing on grids of processors.Grids are a particularly attractive architecture becauseof their simple interconnection and their easy scalability. An, X ... X nr mesh-connected array or grid is a set mesh( n, , . . . , n,) of N = rzl . . . . . n, identical processors where each processor P = (pI, . . . , p,), 0 < pi < ni - 1, is directly connected to all its nearest neighbors only. A processor Q = (q,, . . . , qr) is called a nearest neighbor of P iff the manhattan distance between P and Q is 1 (d(P, Q)
can send data only to its nearest neighbors during one clock period. Bidirectional communication can occur with all nearestneighbors in one clock cycle. Furthermore, each processor has only a limited number of registers for data (i.e., the buffer size is constant or log N). Fast and conflict-free data transfer is a problem to be solved on grids as well as on networks. This problem especially arises when one network is simulated by another (e.g., an idealistic network like the complete graph simulated by a realistic one). A basic problem in this context is the so-called packet routing problem, where each processormay send packets of data to and receive them from other processors. For a partial k relation the problem is slightly restricted in that each processoris origin and destination of at most k packets [ 151. The algorithms proposedin this paper are valid for partial k relations but for the purpose of simplicity we focus on k-k routing where eachprocessorsendsand receives exactly k packets. The problem can be described more formally as follows: DEFINITION 1 (k-k Packet Routing). A k-k routing problem is given by a (k + 1) tuple (mesh, address,, . . . , addrew), where each addressi is a map from mesh to mesh and for all processors P in the mesh we have C F=, I address;’ (P) I = k. All the packets in address;’ (mesh) are said to lie in layer i (see Fig. 1).
In the case of k = 1, address, is a bijective function by definition and the problem is then called a permutation routing problem, which is an attractive paradigm subproblem [ 151 and therefore has been extensively studied in the literature. For permutation routing on n X n meshessome algorithms based on sorting need only 3n + O(low order) steps and a = Ci=l Ipi - qjl)buffer size of 1 packet [ 9, 12, 141. A randomized algorithm The control structure of the grid of processorsis assumed of Kriznac et al. [ 31 solves the problem by 2n + 0( log n) to be of the MIMD type (Multiple Instruction Multiple steps with constant buffer size. A deterministic approach of Data). That is, each processorhas its own program memory, one of the authors [ 51 takes 2n + 0( n/f< n)) for arbitrary different processorscan perform different instructions at the buffer sizef(n), 1 3, a probabilistic algorithm of Valiant and Brebner [ 15] needs * This researchwas supported by the Lkutsche Forschungsgemeinschaft, Grant Ku658 / I - 1, and partially by SIEMENS AG, Miinchen. (2r - 1)n steps with buffer size O(log n); about the same
0743-73 I5/9 1 $3.00 Copyright 0 199 I by Academic Press, Inc. All rights of reproduction in any form reserved.
146
(k-k)-ROUTING
layer
1
.:... s..::.::+
layer2 _
147
For the routing algorithms presentedin the following very often we need solutions for routing problems which are not k-k routing problems. It may happen that there are kf( n) packets in the beginning and the same number of packets at the end in a processor. This leads to the following definition:
FIG. 1. Mesh with two layers.
DEFINITION 3 (Balanced Routing Problem). An r-dimensional routing problem is balanced with respect to the rth axis iff there are at most k( h, - 1, + 1 + f(n)) packets number of steps can be achieved by a sorting algorithm with addressesin (p, , _ . . , ~~-1, [L, hl) for all Pi, 0 G Pj needing only buffer size 1 [ 4, 5 1. If buffer sizef( n) is given -c n,j= 1,. . . , r- l,andallO 1). The total number ofpackets mutation routing problem in the following way. Each packet is xn + y, with 0 G y < n (see Fig. 2a). Then the number of size s is split into r subpackets of size s/r, each having of transport steps neededfor routing is at most xn +. y. the same address as the original packet. Such a problem is Proof W.1.o.g. we just observe packets going from left called a k packet splitting problem. Under the assumption to right. Note that packets going in the opposite direction that the transport time of a packet of size s is tstep and that do not interfere becauseof the bidirectional connections befor packet size s/r is tstep/r we can solve the permutation routing problem asymptotically in the transportation time tween the processors. Initially each processor selects one packet going right and lets it move, picking up packets com(2r - 2)nt,,,/r = 2nt,,,/r, a speedup by a factor of r. If the k-k routing problem is not a sequence of simple ing in from the left destined for itself and letting the others permutation routing problems then the problem seems to pass. Packets once started are never delayed; i.e., they keep become more difficult. In the third section we show how to on moving until they reach their destinations. From time to route then in $( kn) + 0( low order) steps on an n X n mesh time a gap in the sequenceof moving packets, which is immediately filled by the next packet going right, may occur and sknf 2 + O(low order) or an n X n torus. In the r-dimensional case-which is discussedin the fourth at a processor. Now focus on processor i and let t = t(i) = (i section-a step count of ( r - I)( 1 + re2) kn + 0( low order) + 1)x - y. If a gap has occurred for the last right-bound for a mesh and (r - l)( 1 + r2)kn/2 + O(low order) for a packet at i destined forj (j > i) before step t, then this packet falls in and moves to its destination without delay and arrives torus is used. Both algorithms use buffer krf (n). there before step t + j - i. If no gap has occurred for this packet, that means that t packets have left processor i to the 2. BASIC OPERATIONS AND LOWER BOUNDS right. But to the left of i and in i there were at most t packets. Thus the packet joins the sequenceand arrives at its destinIn this section we present some results on routing on linear arrays or rings which are to be used in the sequel. Moreover, some simple lower bounds for routing on multidimensional meshes are given. 1 ” j As slices of processors are often dealt with, a formal no(4 tation for them is introduced: (n-j+l)x+g(n)
4-...a
ix+1
xn
2.
Let [I, h] := (xl1
i
(b)
(cl
FIG. 2. Routing on linear arrays(a, b) and on rings (c).
148
KUNDE AND TENS1
ation at t + j - i. Hence the arrival time of that packet is less than or equal to
src( I- 1) heading for q 0 (I - 1) would go via z, i.e., that all packets of W(f) coming from src( 1 - 1) are completely contained in W( I- 1). Also, by the induction hypothesis max {t(i) +j - i} all packets have left src( I- 1) after w( I- 1) steps. Thus all j,i packets from src(1 - 1) heading for dest(Z) have left src(l - 1) after w( 1 - 1) steps. =max {i(xl)+j+x+y} i.i Now observe the stream of packets passing z via z 0 1. Whenever z contains packets E W(I), one of these packets <(n-2)(xl)+(nl)+x+y will be forwarded as these packets have highest priority. The =(n1)x+ 1 +yGnx+y asx> 1. n number of packets E W(1) in z is reduced whenever no packet E W( 1) arrives from z 0 1. LEMMA 5. Assume a linear array with n processors and Let u denote this number of “gaps”; then w(I) - u packets a total number of xn packets to route (x > 1). For eachj let E W(l) arrive from z 0 1. If a packet E W( 1) were at z the number of packets with addresses greater than (j after w( 1) steps,then the initial number of packets in z head- l)beboundedby(n-j+ l)x+g(n)andthenumberof ing for dest( 1) called t would have been greater than u; i.e., packets with addressesless than j by (j - 1)x + g(n) (see t + (w(Z) - u) > w(I) > ) W(1)/ packets E W( 1) would Fig. 26). Then the number of transport steps neededfor rout- have had to pass z, which is a contradiction. n ingisatmostxn+g(n). LEMMA 7. Assume a ring with n processors, each proProof: (Similar to the technique in [ 3 1, farthest-destin- cessorcontaining x or x + 1 packets initially and an arbitrary ation-first.) Route a packet from source i to destination j number finahy (x 3 1). The total number of packets is xn (w.1.o.g. i < j). A packet at i destined for j can be delayed + y, with 0 < y < n. Then the number of transport steps by at most x( n - j + 1) + g(n) other packets going to or neededfor routing is at most xn/2 + y. past j. Thus a packet at i reachesits destination at least after Proof Using Lemma 6 we first show that w(I) G xl + y x( n - j + 1) + g(n) + (j - i) steps. The maximum value for all 1. Note that ) W( 1) 1 G xl + y since the number of for this expression for any j is less than xn + g(n). n packets in src(1) is bounded by xl + y. On the other hand Note that this lemma covers the caseof a balanced routing by the induction hypothesis w( I- 1) G x( I - 1) + y and problem. thusw(lI)+ 1 Gxl+ysincex> 1. Now by Lemma 6 for 1: = L n/ 2 1 all packets going clockwise 2.2. Routing on Rings toqhaveleftsrc(tn/2J)afterw(tn/2~)steps.Thusallpackets Both lemmas above can be formulated for rings (linear heading for q in clockwise direction have arrived there after arrays with wraparound connections). For this section let a w(Ln/2]) G xn/2 + y steps. n 0 b := (a + b) mod n and a 0 b := (a - b) mod n. LEMMA 8. Assume a ring with n processors and a total The following technical lemma applies to arbitrary initial and final packet distributions as long as the buffer require- number of xn packets to route (x >, 1). Let the number of packets going to processors { I, , m } be bounded by (m ments are fulfilled during the whole routing phase. -I+ l)x+g(n)or-incasem landletw(O):=O (Fig. 2~). Then after w( 1) steps all packets in W(1) have leji for dest(1) is bounded by this quantity. The rest of the argumentation is completely analogousto the proof of the presrc( 1). vious lemma. n Proof Each packet goes in that direction on the ring Before carrying on with the development of the algorithms which is shortest in going toward the destination (farthestwe give a simple lower bound for the problem. destination-first). Now w.1.o.g. observe only packets going toward larger indices. For 1 = 1 only packets going from p 2.3. Lower Bounds to q are considered. No packet coming from p 0 1 heading THEOREM 9. k-k routing on an r-dimensional grid with for q would go via p; thus packets in p have highest priority. There are at most w( 1) packets and they can leave p in that side length n takes at least number of steps. For the induction step let z := p 0 (I- I), max(r(n - l), k[z]) i.e., src( 1) = src( 1 - 1) U { z) and dest( 1) = dest( 1 - 1) U {q@(ll)}.F’ nst 1et us check how long it takes to clear src( 1 - 1) from packets of W(l). Note that no packet in transport steps.
(k-k)-ROUTING The first term is simply obtained by the distance bound. We prove the validity of the second term by a cut argument. Let y = 1n / 2 1. Let the first submesh contain all those processors P = (pi, . . . , pr) with first coordinate p1 < y and let the second one contain all the other processors. Then the first submesh contains yn’- ’processors. Assume a routing problem where each packet from submesh 1 travels to submesh 2. Then kyn’-’ packets have to travel across a cut via n’- ’communication lines. Thus the minimum numn ber of steps is kyrC’/n’-’ = ktn/21. Proof
Although the above bound does not seem very tight in the general case, it is sometimes relatively close to an upper bound. For example for an n X n mesh the lower bound is valid for tile translation where each packet in an arbitrary processor P = (pi, p2) has to be transported to processor Q =((p1+n/2)modn,(pZ+n/2)modn).Thiscanbedone in kn steps using our algorithm (see 3.1) . In the case of only two or three packets it is even very close to an optimal bound. In the same way a lower bound for r-dimensional tori can be found: THEOREM 10. k-k routing on an r-dimensional with side length n takes at least
torus
transport steps. Proof Similar to the proof of Theorem 9. Note that the two submeshes are divided by two boundaries. n
3. k-k ROUTING ON TWO-DIMENSIONAL MESHES AND TORI In this section the k-k packet routing on square (n X n) meshes and tori is discussed. The presentation concentrates on meshes. However, the argumentation also applies to tori. First we present the algorithm in its most general form, which is optimal within a factor of at most 3. Then we apply the algorithm to a sequenceof simple permutation problems. After that we give a further improvement for the general kk routing problem. At the end of this section an application to the k packet splitting problem is given. 3.1. Algorithms for k-k Routing on a Square Mesh (k a 2) For our interleaving technique we use as special base algorithms the so-called uniaxial algorithms. DEFINITION 11 ( Uniaxial Algorithm). In a uniaxial alin one time step all processors can communicate along one coordinate axis only. More formally (in the r-dimensional case) for any uniaxial algorithm we can define a function (Yfrom step index (EN)
gorithm
149
to ((1, . . .) r} . In clock step j processors P and Q may communicate iff P - Q = &u,(j) (where Ui denotes the ith unit vector). The axis a(j) is called the active axis at step j. Note that many routing and sorting algorithms in the literature (e.g., [4-6, 9, 11, 12, 141) are uniaxial. As an exception the algorithm in [ 81 is not uniaxial (this is crucial for the high performance of that algorithm! ) We exploit uniaxiality by letting data streams flow along orthogonal axes. The principal idea is to use the Sort-And-Route algorithm [ 5 1, but to exploit its uniaxiality by interleaving of orthogonal phases. For this algorithm the total mesh is partitioned into small submeshes, called blocks. Within these blocks the packets are then sorted according to their addresses. This has the effect of distributing packets with a similar destination region uniformly, which guarantees a limited buffer size. Throughout the discussion the side length of a block is denoted by 6, where b divides n. A processor P = (r, c) then lies in block [ 1 r/ bl , rc/ bl) . For the orthogonal phaseswe need different index schemes for the sorting. DEL~NITION 12 (Indexings, Indexed Intervals). In the following we use two indexings of the processors: the lexicographical indexing lex, defined by lex(p,, . . . , p,) := 2 :=, pinrei, and the reversed lexicographical indexing rev, given by rev(p, , . . . , pr) := lex(p,, . . . , p1 ). Now let g be any index function. Then an interval of processors (or addresses) with respect to indexing g is the set of processors [P, Q], := {Xlg(P) G g(X)
150
KUNDE AND TENS1
2. Route level 0 packets along row to destination column and level 1 packets along column to destination row (similar to [ 51; each packet is now in either its correct row or column): t routel= W2.
3. Route level 0 packets within column and level 1 packets within row to destination processor:
Proof: The improvement can be obtained for phase 3. W.1.o.g. consider only routing within a single column. Note that the number of packets in this column is now bounded by kn/2. These packets can be routed within the column in kn/2 steps on a mesh (by Lemma 5) or in (kn/2)/2 steps on a torus (by Lemma 8). w
Improvement. We have just seen that if we know that the k-k packet routing problem is a sequence of k permutation problems then the number of transport steps can be t route2= kn. reduced by a third. But normally we do not have this inforTHEOREM 14. General k-k packet routing on an n X n mation and it seems to be hard to get it. mesh can be done in $kn + 0( n/f( n)) transport steps with According to our analysis the reason for the worse pera buffer size of 0( kf (n)) by Algorithm 13. formance of general k-k routing can be easily found. As we ProoJ: Let b = n/f(n). For the first phase note that two do not touch the initial distribution of packets into the diflayers can be sorted concurrently as their orthogonal con- ferent levels, it may be that packets destined for a certain nections and the existence of uniaxial sorting algorithms [ 12, column can all lie in level 0. This overload of level 0 leads 141 allow the complete exploitation of all connections. As to kn transport steps for the final phase of Algorithm 13. there is no interference between two adjoining layers, sorting (Bad cases for level 1 are analogous with respect to rows.) Note that once a packet is in a level it stays there until can happen in rk/21. O(b). Phase 2 is also noncritical, as the end of the algorithm. sorting in phase 1 does not change the number of packets in For k permutation routing a priori 50% of the packets for the processors. Note that in each processor there are Lk/21 or I k/21 packets of level i. Thus, by applying Lemma 4, we a certain destination column initially lie in level 0, but also can route both the rows and the columns in time kn/2. (In 50% heading for the same destination column initially lie in the case of an odd k apply Lemma 4 with x = (k - 1)/2, level 1. (The same holds for the rows.) Thus the number of y = n/2.) To get an indication of the buffer size, w.1.o.g. we packets of level 0 involved in the final phase for this column observe the level 0 packets destined for column c. Assume is kn/2, which is also about the number of steps needed. So an idea for improving the algorithm for general k-k the number of those packets in block (x, y) to be r(x, y) (with Xi 7(x, i) c kn). Phase 1 smears the packets for one routing would be to introduce some prephase where packets column onto different rows [ 51. The number of packets for between level 0 and level 1 are exchanged to bound the number of packets within a level going to a certain destination column c going to processor (r, c) is column or row. The question is whether we can decide locally within a (k+ 1)n <~(‘~r(lr/hl,i))+~< b . small amount of time whether a packet has to become a level 0 packet. That means we must try to guarantee that in the last phase in each column there are at most tkn packets of On the other hand there are at most kn packets with destin- level 0 with 0.5 6 t < 1. The smaller the t the smaller the ation address in column c in even numbered layers. Hence number of transport steps needed for the algorithm. the last routing phase (phase 3) can be done in kn steps (by This problem seems to be very hard to solve ideally (E Lemma 5). n = f ), but by the prephase of the next algorithm we can at COROLLARY 15. General k-k packet routing on an n least guarantee that for the last phase of each incarnation at X n torus can be done by Algorithm 13 in qkn / 2 + 0( kn / most 3kn/4 packets (i.e., E = $) going to a certain row or column may occur in one level. f(n)) transport steps with a bufir size of O( kf( n)). An underlying idea for this prephase is to partition the Prooj Algorithm 13 can use the wraparound connec- addressesinto two classes(called colors) such that about half tions of the torus. Thus phases 2 and 3 can be done in one- the number of processors of an arbitrary rectangle is in each half the time by Lemmas 7 and 8. n class. By sorting with respect to these colors and by a subCOROLLARY 16. If the routing problem is a sequenceof sequent distribution of every other packet into odd resp. even k permutation problems, then the total number of transpor- layers the desired separation of packets is achieved. Since the idea of coloring is also useful for the r-dimentation steps reduces to sional case, we now present the general definition: kn + O(low order) for a mesh DEFINITION 17 (Color, Mixed Order). A processor (x1, . . . ) x~) has color ( C I=1 x,)mod r. We say that address P and precedes Q with respect to mixed order (denoted by P &lx Q) if either kn/2 + O(low order) for a torus.
,~,
(k-k) -ROUTING 1. color(P) < color(Q) or 2. color(P) = color(Q) and TO&(P)
151
usually the number of transport steps is counted (where concurrent steps count as one step). Of course, when splitting a big packet into smaller ones it is not very realistic to assume the same duration of communication. Therefore we postulate a linear relation between packet size and communication duration. Let us assume that the duration of one transport step is t Slql= p * s + to, where s is the packet size, to is the constant overhead, and p is the transport rate. COROLLARY 2 1. If packet splitting is allowed, then the simple permutation routing problem with packets of size s only needs transportation time nsp + 2nto + O(low order)
on an n X n mesh
resp. Proof Note that algorithm 13 works independently of the distribution of the packets in different layers. Phase 3 of nsp/2 + nto + O(low order) on an n X n torus. Algorithm 13 can be improved by phase 1 of the above algorithm. We show this only for packets handled finally in Proof Packet splitting routing is a special case of perodd layers of Algorithm 13. Let the side length of a block be mutation in each layer. In each layer the same permutation b = n/f(n). takes place. Split up the packets into two subpackets. Then For arbitrary column c and row r let us define as critical by Corollary 16 and for k = 2 the total number of transporpackets those packets with addresses in ( 1, c), . . . , (r, c) tation steps is 2n + 0( low order). Since the packet size is and with color 1. Assume the number of critical packets in block (x, y) to be T~(x, v). Note that for even r we have only s/2 the transport time for a single step reduces to p . s/ 2 + to. exactly C,, 7 I (x, v) = kr/ 2 critical packets. Phase 1 equally Similar argumentation applies to the wraparound case. n distributes the critical packets such that in each layer there are at least 17,(x, y)/21 and at most 17,(x, y)21 of them. That means that for large packet size where the overhead Thus before phase 3 of Algorithm 13 there are at most to can be neglected a time speed up of a factor of almost 2 is achieved. This improvement even holds when we compare the splitting approach with the optimal nonsplitting algoC hdx, Y)/‘J~ G C (4x, d/2 + 1) = kr/4 + (n/bj2 XL’ x,.v rithm of Leighton et al. [ 81, which needs time (2n - 2)(sp critical packets in column c in even numbered layers. Un- + to). The disadvantage of our algorithm over that algorithm fortunately all kr/ 2 packets with addressesin ( 1, c), . . . , ( r, is the adaptive buffer size, which is a function of n for our c) and color 0 may have been in one layer and also occur algorithm while Leighton et al. use constant sized buffers. in column c. That is, in each layer there are at most kr/4 However, their constant is quite large, such that up to at + ( n/b)2 + kr/2 = 3kr/4 + ( n/b)2 critical packets to route least a million processors buffer requirements are smaller for in phase 3. Therefore by Lemma 5, phase 3 can be performed our algorithm. by 3kn/4 + (n/b)’ transport steps. Hence by Lemma 5, in It should be mentioned that packet splitting in the above total O(kb) + kn/2 + 3kn/4 + O((n/b)2) transport steps sensebecomes unrealistic when the packet size is small comare needed. Since f(n) < n1j3 we obtain f(n)* < n213< n/ pared to the address information. In this situation splitting f(n) = b and the theorem is proven. n in the context of messagerouting seems to have benefits. A COROLLARY 20. General k-k packet routing on an n first approach in this direction was recently done by Makedon X n torus can be done in $kn / 2 + 0( kn/f( n)) transport and Simvonis [lo]. steps with a bujer size of O(kf(n)), Proof
wheref(n) s n’13.
Analogous to the proof of Theorem 19. n 3.2. Time Analysis for k Packet Splitting
At the end of this section we demonstrate that packet splitting may dramatically reduce transport cost. In this paper
3.3. Partial k Relations At the end of this section let us have a brief look at k relations, which generalizes k-k routing insofar as at most k instead of exactly k packets are sent and received by each processor. Thus in a certain sense a k relation problem can be viewed as a k-k routing problem with a lack of packets.
152
KUNDE AND TENS1
Note that this lack of packets does not adversely effect pure routing along the axes. Moreover there are no detrimental effects on the sorting of blocks. During the sorting phase the packets are rearranged in order to avoid buffer overflow. For this goal it is not necessary to place a packet in some absolute position according to its address; rather it suffices to achieve a relative ordering of packets. This relative ordering can be preserved for k relations by introducing dummy packets which are assumed to have minimal addresses with respect to any sorting order used. Thus all of the above algorithms for k-k routing can be adapted for k relations. Furthermore this technique of handling k relations also applies to the r-dimensional case discussed in the following section. 4. k-k ROUTING ON r-DIMENSIONAL
MESHES
easily checked that these algorithms are uniaxial. They need only (2r - 1)n + 0( n’-‘I’) transport steps and a buffer size of 1 packet for solving a simple permutation routing problem. Hence by applying Algorithm 22, proposition (a) is shown. For (b) take algorithm Sort-And-Route presented recently by one of the authors [ 6, 71. This algorithm is also uniaxial. The buffer size needed for solving a single permutation routing problem is O(f(n)). Hence a buffer size of O(kf(n)) packets is needed for Algorithm 22. Since the Sort-AndRoute algorithm routes by (2r - 2)n + O(n/f(n)“‘r-‘)) transport steps Algorithm 22 outperforms a sequence of k permutation problems by rk/rl(2r - 2)n + O(kn/ f(n) 1’(r-‘)) transport steps. n At the end of this section let us briefly mention how this result can be applied to the k splitting problem. The same time assumptions as those in the last section are used.
COROLLARY 24. If packet splitting is allowed, then on In this section we first present a solution to the problem an r-dimensional grid, r 3 3, the simple permutation routing of routing a sequence of k permutation problems. We then problem with packets of size s needs only apply this to the packet splitting problem. Finally we discuss an algorithm for the general k-k routing problem. (a) (2 - l/r)nspp + (2r - l)nto + O(fow order)
4.1. k Permutation Routing As in the two-dimensional case the basic idea is to exploit the uniaxiality of existing algorithms [ 4, 5, 11, 141 for the simple permutation routing problem. All these algorithms have at each clock cycle j as well-defined active axis a(j). Define for each layer i the active axis to be (a(j) + i)mod r + 1. The k layers, k Z=2, inside the mesh are divided into min( k, r) classesand for each class a chosen uniaxial algorithm is applied concurrently, giving min (k, r) parallel incarnations. ALGORITHM 22. 1. Take a fixed uniaxial routing algorithm ALG and let a(j) denote the active axis of ALG at clock step j . 2. Apply algorithm ALG to each layer i with active axis 1 + (a(j) + i)mod r. THEOREM 23. For r-dimensional meshes, r > 3, a sequence of k permutation problems can be deterministically routed by
[I
(a) f (2r-
1)n + O(n’-‘I’)
transport steps and a bufler size of k packets; (b) r (2r - 2)n + O(kn/f(n)“(‘-‘I) [I transport steps and a bufler size of 0( k f ( n) ) packets. Proof For (a) take one of the fast sorting algorithms presented in [ 4, 5 ] and use them as routing algorithms. It is
transportation time and a bufler size of I packet. (b) (2 - 2/r)nsp + (2r - 2)nto + O(low order) transportation time and a bu$Tersize of 0( f (n)) packets. Proof Split packets into r subpackets of equal size. Then by Theorem 23 we immediately obtain the above corollary. n The above corollary again demonstrates that packet splitting is a very useful technique for routing packets of large size. 4.2. r-Dimensional
k-k Routing
Things get a little more involved when we cannot separate the permutations into different layers. But the idea of interleaving orthogonal phases can still be applied. 4.2.1. General Strategy and Elementary Definitions The principal idea for k-k routing on an r-dimensional mesh is to solve (r - 1)-dimensional k-k routing problems in parallel hyperplanes (i.e., each packet has reached its correct destination for at least (r - 1) coordinates). Then the packets are routed along the rth axis to their correct destination. Unfortunately this naive technique can lead to severe problems. Imagine that by accident all kn2 packets with address (x,, x2, *, *) are in plane (*, *, p3, ~4). They then have to gather at processor (xl, x2, p3, p4) and in a plane this needs at least time kn2/4. Hence some kind of preprocessing has to take place. That preprocessing will arrange the packets in such a way that they satisfy the boundary conditions for the (r - 1 )-dimensional case of a balanced
153
(k&)-ROUTING
routing problem. To further speedup the algorithm the same interleaving idea as before is utilized. Thus we normally describe only one of the interleaving incarnations. Before we continue in the description of the algorithm, we want to give some definitions. For the rest of this section we assumethat the packets are linearly ordered with respect to the lexicographical order of their addresses;that is, ad&~~( P) =&. addre.ss(Q) if and only if lex( address(P)) < lex( address( Q)). For submeshes the indexings are used in the corresponding manner. To get rid of numerous ceiling brackets we additionally assume that r divides k throughout the rest of the paper. Additionally a block number parameter a is assumed to be restricted to 2 < a
The rearrangement principle is illustrated in Fig. 4 for the three-dimensional case. To cope with the fact that now k/r layers are involved in each incarnation, we modify the index functions by introducing the layer number as the least significant digit for the index. The complete algorithm for one incarnation is shown in Fig. 5. The correctness of the algorithm is extensively discussed in [7].
n b{
column of blocks
FIG. 3. Blocks.
PerformanceAnalysis. The correctness analysis ensured that the subphasesof the algorithm fit together correctly and that the algorithm works. To guaranteecorrectnesstime and space requirements of the worst subcase (e.g., the linear routing within a tower with the most packets) have to be taken as the requirements of the phase under consideration. To do that we have to give worst case estimates for the phases of one incarnation. For ease of notation the initial number of packets per processor for the ith incarnation (which is the same for all!) is denoted by ki with k, := k/r. Rearrangement. The two sorting subphases per dimension take place in blocks of size at most n/u. Thus sorting step time is 0( n / a). The shifts cost kin stepsper dimension: =+ step count: (r - 2)(kin + 0( n/u));
buffer: O(ki).
The first routing phase can be 2D Sort-And-Route. done in kin steps (shift register principle). For the second phasethings get more difficult. The balancing condition only guaranteesthat we have at most k( n + f( n)) packets in the linear array. Thus it needs k( n + f( n)) steps: j
step count: kin + k( n + f( n));
buffer: 0( k-f( n)).
Correction. As before the balancing condition can only ensure that we have at most k( n +f( n)) packets in the linear array. So each correction phase has to deal with at most k( n + f( n)) packets. Thus * step count: (r - 2)k(n +f(n));
buffer: O(k-f(n)).
Note that the buffer requirements for the interleaved incarnations are r times the requirements for a single incarnation. THEOREM 26. The k-k routing problem on an r-dimensional grid with side length n and b@er size parameter f (n) (withy(n) < n”‘) can be solved within
154
KUNDE
AND TENS1
The sorting and first routing phase of the two-dimensional are also not affected by the prephase. Thus they add time kn/r + O( low order). Now let us check that the bad time for the correction phases (kn) is somewhat lowered. From now on call those packets critical which go to (xl, . . . , x,-~, *, P~+~,. . . , pr) and have color 0. The total number of these packets is kn/ r . By the sorting part of the prephase all those critical packets become nearly equally distributed onto the different levels. Then the shift phase of the prephase kicks out (r - 1)/r of all the color 0 packets to other levels. That means that only 1/r* of the color 0 critical packets stay in level 0, i.e., at FIG. 4. Three-dimensional rearrangement: reversed lexicographical sort most kn/r*. (a), shifting (b), lexicographical sort ( c ) But we cannot say anything about the other colors. So we must assume that non-0 colored critical packets for the tower (r - l)( 1 + l/r)kn + u(n/f(n)““P”) under consideration go by accident to level 0, i.e., at most kn( r - 1)/r. In total, we have reduced the number of packets time steps and with bufleer going to level 0 (i.e., incarnation 0) from kn to kn ( 1/ r* + ( r rk*f(n). - 1)/ r) . As (by Lemma 5 ) the routing time is linear for the number of packets involved, each correction phase now takes Improvement. Analogously to the two-dimensional case kn( l/r* + (r - 1)/r) for a total of (r - 2)kn( l/r2 + (r the result of Theorem 26 can be improved by a coloring - 1)/r). Summation completes the proof. n prephase analogous to the one presented in Section 3. COROLLARY 29. The k-k routing problem on an r-diThe improvement of the algorithm is achieved by coloring all packets according to their destination processor. It will mensional torus with side length n and bufler size parameter affect accumulation of packets for the correction phases of f(n) (with f( n) G n”‘) can be solved within the incarnations. (r - l)( 1 + l/r*)kn/2 + O(n/f(n)“‘‘-I)) The modified algorithm is as follows: Sort-And-Route
(b)
(a>
ALGORITHM 27. tination addresses.
cc>
Color packets according to their des-
1. Sort the packets with respect to mixed order into blocks with block size b in the following way: The layer numbers are seen as an additional coordinate with highest priority; i.e., all addresses in layer i are smaller than or equal to addressesinlayeri+ l(i= l,...,k1)andwithinalayer follow the lexicographical indexing of the processors (some kind of layer-major indexing). Shift packets within processors with color c from layer i to layer ( i + c) mod k + 1. 2. Apply Algorithm 3 1. Each incarnation uses layers with same remainders modulo r. THEOREM
28. Algorithm 27 needs (r - l)( 1 + 1/r*)kn + O(low order)
steps with a bu&r size of rk * f( n) when r I k. Proof: W.1.o.g. focus on one of the interleaved incarnations, say the one gathering layers 1with I mod r = 0 (socalled layers on level 0); i.e., there are k/r packets per processor in each level. As in the two-dimensional casethe prephasedoes not affect the principal behavior of Algorithm 3 1. The prephase leaves k/r packets in each processor and thus the preconditions for the rearrangement phases are the same as before. They take time (r - 2)kn/r + O(low order).
time steps and with buffer rk*f(n). 5. CONCLUSION In this paper we presented algorithms for k-k packet routing on mesh-connected arrays. As mentioned at the end of the third section these algorithms also work for partial k relations. The two-dimensional algorithm for an n X n mesh uses a buffer size off(n), where f ( n) is some monotonous function of n. The number of transport steps for routing is at most $kn + U( kn/f( n)), but can be reduced to kn Algorithm
31 (r-dimensional
k-k-routing)
for s := r downto 3 do begin { rearrange phase s} forallj:=s+l,...,r,forallP~:=O,...,n-ldoinp~all~lbegin Rearrange(s, (I,. .I *,%+I,. . . ,R)) end end for all j := 3,. _, T, for all pj := 0,. , n - 1 do in parallel begin Sort - And - Rout+, (*, *,m, ,PT1) end for s := 3 to r do begin { correction phase s } for all j := 1, , r, j # s, for all p, := 0,. , n - 1 do in parallel begin in all s-towers (~1,. , p,-1, t,p.+~, ,p,)) transport packets to their correct s-address. end end
FIG. 5. r-dimensional k-k
routing.
( !x-R)-ROUTING
+ O(kn/f(n)) for the interesting subcase of a sequence of k permutation routing problems. In the r-dimensional space the same technique yields an algorithm with step count (r - l)(l + l/r*)kn+O(n/f(n)“‘-‘)andbufferrk.f(n)in the general case. A sequence of permutation problems can be done in rk/rl(2r - 2)n + O(kn/f(n)‘“‘-I’). Ifa splitting of large packets into r small ones is possible then permutation routing can be performed nearly r times faster than routing of unsplitted packets. For grids with wraparound connections these step counts and times generally can be reduced by one half. It still remains open whether the algorithms of this paper are optimal or whether the simple congestion lower bound is tight. Another open problem is how to handle nonsquare grids. At the end of this paper it should be pointed out that most of the methods presented in this paper can also be applied to routing problems where the maximum manhattan distance between source and destination is known to be small (similarly to [ 5 ] ) . However, it is still open whether our methods may be advantagous for other topologies like butterflies or hypercubes. Results in [ 2, 131 indicate that, e.g., packet splitting is also beneficial for the performance of special routing algorithms on hypercubes. REFERENCES Agrawal, D. P., Janakiram, V. K., and Pathak, G. C. Evaluating the performance of multicomputer configurations. IEEE Comput. 19 (1986),
5.
6.
7.
8.
9.
10.
II. 12.
155
Springer-Verlag, Berlin/Heidelberg/New York/Tokyo, 1987, Vol. 247, pp. 408-4 19. Kunde, M. Routing and sorting on mesh-connected arrays. In Reif, J. H. (Ed.). VLSI Algorithms and Architectures, Proc. 3rd AWOC 88, Lecture Notes in Computer Science Series. Springer-Verlag, New York/ Berlin, 1988, Vol. 319, pp. 423-433. Kunde, M. Parallel routing on multi-dimensional grids of processors. In Jesshope,C. R., and Reinartz, K. D. (Eds.). Proc. CONPAR88. Cambridge Univ. Press, 1989, pp. 687-694. Kunde, M. Packet routing on grids of processors. In Djidjev (Ed.). Optimal Algorithms, Proc.. Lecture Notes in Computer Science Series. Springer-Verlag, Berlin/Heidelberg/New York/Tokyo, 1989, Vol. 401, pp. 254-265. Leighton, T., Makedon, F., and Tollis, I. A 2n - 2 step algorithm for routing in an n X n array with constant size queues. Proc. 1989 ACM Symposium on Parallel Algorithms and Architectures, Santa Fe, NM, 1989, pp. 328-335. Ma, Y., Sen, S., and Scherson, I. D. The distance bound for sorting on mesh-connected processor arrays is tight. Proc. FOCS 86, pp. 255-263. Makedon, F., and Simvonis, A. On bit-serial packet routing for the mesh and the torus. Proc. 3rd Symposium of Frontiers of Massively Parallel Computation. 1990, to appear. Nassimi, D., and Sahni, S. Bitonic sort on a mesh-connected parallel computer. IEEE Trans. Comput. C-28 ( 1979), 2-7. Schnorr, C. P., and Shamir, A. An optimal sorting algorithm for meshconnected computers. Proc. STOC 1986, Berkely, CA, 1986, pp. 255263.
13. Stout, Q. F., and Wager, B. Intensive hypercube communication. I.
Prearranged communication in link-bound machines. .I ParaNel Distrib. Comput., to appear. 14. Thompson, C. D., and Kung, H. T. Sorting on a mesh-connected parallel computer. Comm. ACM 20 ( 1977), 263-270. 15. Valiant, L. G., and Brebner, G. J. Universal schemes for parallel communication. Proc. STOC 81, 1981, pp. 263-277.
23-37.
Johnsson, S. L., and Ho, C. T. Optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Comput. C-38 ( 1989), 1249-1268. Krizanc, D., Rajasekaran, S., and Tsantilas, T. Optimal routing algorithms for mesh-connected processor arrays. In J. H. Reif (Ed.). VLSI Algorithms and Architectures, Proc. 3rd AWOC 88, Lecture Notes in Computer Science Series.Springer-Verlag, New York/Berlin, 1988, Vol. 319, pp. 41 l-422. Kunde, M. Optimal sorting on multi-dimensionally mesh-connected computers. In Brandenburg, F. J., Vidal-Naquet, G., and Wirsing, M. (Eds.). Proc. STACS 87, Lecture Notes in Computer Science Series. Received February 15, 1990; accepted July 1 I, 1990
MANFRED KUNDE received his diploma degree and his Ph.D. degree in computer science from the Christian-Albrechts-Universitat Kiel (Germany) in 1975 and 1980. He is presently head of the research project on “Data Transfer on Networks of Processors” at the Computer Science Department of the Technische Universit+.itMtinchen. His research interests include design and analysis of parallel algorithms, theory of VLSI, and complexity theory.
THOMAS TENSI received his diploma degree in computer science from the Technische Universitit Munchen (Germany) in 1985. He is presently working toward his Ph.D. degree at that university. His research interests include parallel algorithms for networks and complexity theory.