JOURNAL
OF ALGORITHMS
11.
Sorting
622-630 (1990)
on a Ring of Processors YISHAY MANSOUR*
Laboratory
for Computer Science, MIT, 545 Technology Cambridge, Massachusetts 02139
Square,
AND LEONARD Department
of Mathematics,
SCHULMAN*
MIT,
Cambridge,
Massachusetts
02139
Received August 1, 1989
We study the time necessary to sort on a ring of processors. We show that the amount of space available to each processor determines the time required. We prove a lower bound of 2[n/2j - 1 steps for sorting on a ring of n processors, under the constraint that each processor retains only a single value at any time. In contrast, we show an algorithm that sorts in \n/2] + 1 steps if each processor is allowed to store six values. 0 1990 Academic PESS. IK.
1. INTR~DUCTI~N The ring of processors has been studied as a model for both distributed and parallel computation [l, 2, 4-61. Each processor in the ring is limited to communication with its two neighbors. The communication is synchronous, i.e., all the processors send messages at each clock tick, and the messages arrive at their neighbors instantaneously. The sorting problem for this network is formalized as follows. Initially there are n values distributed, one per processor. On termination, processor pi halts holding the ith-ranked value. The sorting problem and others have been widely studied in the ring model. In [61 a tight bound on the bit complexity of any sorting algorithm *Part of the work was done while the first author was visiting Bell Laboratories, Murray Hill, NJ. The second author was supported by an ONR graduate fellowship. The research was supported by NSF-865727-CRR, ARO-DALLO3-86-K-017, DARPA-NOOO14-87-k-85, ONR-NOOO14-86-K-0593, AF-OSR-89-0271, and DAAL-03-86-K-0171
622 0196-6774/90 $3.00 Copyright All rights
0 1990 by Academic Press. Inc. of reproduction in any form resewed.
SORTING
ON
A
RING
623
is given. Trade-offs between the number of messages and the time required are shown in [2]. In [l], lower and upper bound techniques for the message complexity are described. They show that any function can be computed with O(n log n) messages, and some functions (e.g., the exclusive-or of the inputs, where each processor has a one-bit input) require C&z log n) messages to be computed. In this work we study the time complexity of sorting on a ring. We assume that each processor in the ring is subject to some space constraint. Without this constraint, the following trivial algorithm achieves the lower bound of ]n/2J steps, that is dictated by the diameter of the network: Each processor broadcasts its value in both directions around the ring. After [n/2] steps, each processor holds all the values and can find the correct output locally. This solution is unattractive because it requires the system to have fl(n*) storage space to sort n elements. One can consider algorithms that only pe?ute elements from step to step. (An equivalent constraint, since a sorting algorithm must not lose any of the inputs, is to require that each processor retain no more than one value from one clock tick to the next.) Bubblesort (see 131) is such an algorithm, and it will sort in 2n - 3 steps. The odd-even transposition sort is another such algorithm, and it sorts the inputs in n steps (see [51X Kunde [4] showed a lower bound of 3n/4 - 2 for such algorithms. We improve the lower bound of Kunde and show a lower bound of 2ln/2] - 1. This lower bound in fact depends heavily on the space constraint. We give an algorithm that allows nodes to retain six values and sorts in [n/2] + 1 steps. The paper is organized as follows. Section 2 describes the model of computation and the sorting problem. Section 3 describes the lower bound in the case of a single value. Section 4 gives the algorithm for the case of six values.
2. MODEL
A ring consists of 12 processors, p, to p,. Processor pi has a link to processors pi+ l,Cmodn) and pi-i +,d n). (We will hereafter omit the mod FZ from the subscript.) The communication between the processors is synchronous. In every clock tick each processor sends a message to its two neighbors. The sorting problem is formalized as follows. Every processor pi has some initial value ui. At the termination of the protocol processor pi holds that uj such that rank(uj) = i, i.e., the final values of the processors are sorted in increasing order starting with pl.
624
MANSOUR
AND
SCHULMAN
The algorithms we consider for the sorting problem use the comparison model: The only operation allowed on values is comparison. Messages can consist of a constant number of values and O(log n) bits. The capacity of each processor is our formulation of space. An algorithm has capacity k if each processor, after receiving the messages from its neighbors, stores at most k values to the next clock tick.
3. THE LOWER BOUND
In this section we prove the lower bound on the time required to sort n values. THEOREM 1.
Any sorting algorithm
that has capacity 1 requires 2[n/2]
- 1 steps. Proof We will restrict our attention to sorting n distinct numbers in the range 0 to 2n. We start the lower bound arguments with a few observations. The first observation is that since each processor can hold only a single value, the values that the processors hold after each clock tick are a permutation of the original values. Therefore each value u is held by a unique processor after each clock tick. Sometimes we refer to the values as moving around the ring, rather than the processors sending them. The next observation is about information transfer. The k-neighborhood of a processor consists of itself and k processors to each side of it. The state of a processor before the (k + l)-clock tick depends only on the original values of the processors in this neighborhood. In like manner we can define the neighborhood of a value. The k-neighborhood of a value u after the ith clock tick is the k-neighborhood of the processor then holding u. The movement of value ZJup through the kth step depends only on the original values in its k-neighborhood. The construction of the lower bound consists of two stages: the first of In/21 - 1 clock ticks and the second of [n/21 clock ticks. In the first stage we, in adversarial fashion, fix the original values depending on the movement of the value n in the ring. This will be done in a manner ensuring that at the end of the first stage the value n is at distance [n/21 from its destination. Then in stage two the value n will need [n/21 clock ticks to get to its final location. The original configuration is one of the following n - 1 configurations. The kth configuration, 2 I k I n, has uj = j - 1 for 2 I j I k - 1, uj = n+j for k+l
SORTING
FIG. 1. The l-neighborhood
ON A RING
625
of p1 in the initial configuration.
We maintain the following invariant about the value n. Let pji be the processor that has the value it after clock tick i and before clock tick i + 1. Inductive claim. The 2i + 1 original values of the i-neighborhood of pji can be fixed such that the final position of the value n is outside the i-neighborhood of pji.
We start with i = 0 (Fig. 1). Clearly, the base of the induction holds. It remains to show how to maintain the inductive step for i. In order for the inductive claim to be correct it is necessary that 2i + 2 5 n. (The 2i + 1 nodes in the neighborhood and the node for the value 0.) We will show that the inductive claim holds for 0 _< i I [n/2] - 1. There are three possible destinations for the value IZ at clock tick i. The first case is that the value n remains in its previous position, i.e., pi,+, = pji. In this case we fii the values of processors qii+i+, and pji-i-l. The original value of processor pji+i+l is set to ji + z, and the original value of pjieiPl is set to IZ + ji - i - 1. This implies that the original location of 0 is outside the (i + l&neighborhood of pi,+,. Therefore, the final destination of n is outside the (i + l)-neighborhood of pi,+,. The second case is that the value n moves from pi, to P~,+~, i.e., pii+l = P~,+~. In this case we fix the values of processors pji+i+l and P~,+~+*. The original values of processors pji+i+l and P!,+~+~ are set to ji + i and ji + i + 1, respectively. This implies that the original location of 0 is outside the (i + l)-neighborhood of pii+,. Therefore, the final destination of it is outside the (i + Dneighborhood of pji+,. The third case is that the value n moves from pji to Pj,-l, i.e., Pi,+, = Pji- 1’ In this case we fix the values of processors pji+- i and P~,-~-~. The original values of processors pjiPiP1 and pjieiv2 are set to it + ji - i - 1 and n + ji - i - 2, respectively. Therefore, the final destination of n is outside the (i + l&neighborhood of Pji+l. Thus the inductive claim holds through i = In/21 - 1. Therefore 1n/21 - 1 clock ticks are spent in stage one and the value rz is then at distance [n/2] from its final destination. Hence the overall time is at least 2[n/2] - 1. 0
626
MANSOUR
1
AND
SCHULMAN
IQ = 1 FIG. 2.
The configuration
at step i = 2 after the value
n moves from
p, to p2.
4. THE SORTING ALGORITHM We now present an algorithm which sorts on a ring in [n/2] + 1 steps. Theorem 1 will not apply because we will allow capacity greater than one at every processor; however, this capacity will be a constant. Also (and this was not prohibited in Theorem 1) we will use a constant number of counters of size O(log n) at each processor and allow a message between processors to relay a constant number of values and a string of length O(log n). The value of n is available at every processor as is its own address i.
Initialization
and Broadcast
Let ui be the value initially held at processor pi. We start at pi with two packets y: and yi. Each packet carries the value ui, a focus register f (of size log n), and a record of i. This last will be used merely to arbitrate comparisons in case of equality; it allows us in what follows to assume all values distinct. Additionally at pi we start with four copies of ui to be used for broadcast. These are labeled UT, uiP, u/+, and u/-. Their trajectories are given (for step 0 I t I In/211 by loc(u+,t)
= i+ t
loc( ui, t) = i - t. The “primed”
broadcasts lag by one: at t = 0 they are, of course, located
SORTING
ON
A
627
RING
at pi and thereafter they follow loc( v/+, t) = i + t - 1 loc( v/-, t) = i - t + 1. Each of the packets is simply forwarded at every step (save the initial lag). Packets superscripted with + are sent from pi to pj+,, packets superscripted with - are sent from pi to pjel. The relevant property of these broadcasts is captured in: Property I. A packet located at processor pi on the ring at step t has seen broadcasts of precisely the values ujet,. . . , vi+*.
Directly related to this is the observation that: Property II. Every packet sees two new broadcast values at every step (for 0 I t 5 [n/21), no matter what trajectory the packet has followed.
If the packet stayed put at pj between steps broadcasts are ujtI and u,Ql ; if it reached pi v;;~ _ i and v,;~ ; and if it came from pj+ 1 then (For even IZ only one new value is seen at step
t - 1 and t then the new from pj’i-l then they are
they are v/Tl + i and v,?, . n/2.)
The Packets
The foci of the packets are initialized
1 1 n+1
fd Yi+) =
-n + 1
=
for i 5 [n/2]
3n2+ 1 2
fo(Yi)
with
n+l 2 2
for i > [n/2]
for i I [n/2] for i > [n/2].
In contrast to the broadcasts, the trajectories of the packets do depend on the inputs. At every step t > 0 each packet compares itself with the two new broadcasts that it sees (recall that a packet yi* carries with it the value ui). It then updates its focus: if both values were less than vi then f, = f,-i + 1; if both were greater then f, = f,-i - 1; otherwise f, = f,-i.
628
MANSOUR
AND
SCHULMAN
Having seen some of the input values, the packet can rule out some of the highest- and lowest-indexed processors as its final address; the focus, mod it, is at the middle of the segment of those not ruled out. After comparison with all IZ - 1 other values, f(yi*) will equal rank(vJ mod n. By Property II this will happen after [n/21 steps. The idea now is that packets travel in order to try to meet their focus. The packets keep track of their own current address as a value in Z-i.e., their address starts out as a value between 1 and n but should they travel across the arc z, it will not remain so. Thus the difference f(yi*) 104 yi*> is well defined in Z and prescribes a direction of motion for the packet in order to try to minimize this difference. (In the interests of uniformity we adopt the convention that, for even II, the packet does take a step when the distance to its focus is $.) What do the packet’s trajectories look like? Every y: packet starts traveling in the direction of increasing processor index-counterclockwise (see Fig. 2). Every y; packet starts traveling clockwise. They only stop or change direction if and when they coincide with or overtake their focus. Since the focus can only change by 1 with each time step, they can from then on “track” it: they manage to stay within a distance of 1 of it. (For odd n; for even n this is s, but at t = n/2 only one comparison is made, f changes by i instead of 0 or 1 and the distance becomes at most 1.) Next we show that one of the pair y;, y+ must meet its focus by step [n/2]. By meeting we mean being at a distance of at most 1. It will thereafter track the focus and at time [n/2] be within a distance of 1 of it. Recall that at time [n/2] the value of f is congruent (mod n) to rank(vi). Taking that final step, the packet reaches its destination in time [n/2] + 1. LEMMA 1.
For every i, one of y: or y; will meet its focus by step (n/2].
Proof. We consider two cases. (a) If either rank( vi) < i + in/21
+ 1
(for i 5 [n/2])
rank( vi) -< i - [n/21
+ 1
(for i > [n/al)
or then y+ meets its focus by step [n/2]. (b) If either rank( vi) 2 i + [n/21
- 1
(for i I [n/21)
rank( vi) 2 i - [n/2]
- 1
(for i > [n/21)
or
then yip meets its focus by step [n/2].
SORTING
629
ON A RING
We show (a) for i I [n/21. (The other cases are similar.) In this case f,(y+) = (n + 1)/2; +1f,~,~1(y+) I n + 4; and as observed earlier, f,n,z,(yi+) = rank(q). Now, loc,(yi+) increases by 1 at each step unless y+ overtakes its focus. If y+ overtakes its focus then it must meet it, because loc,(y’) and f,(y:) change by at most 1 at each step. Now we suppose that y+ has not met its focus by step [n/21. Then it has not overtaken it; and l~c,~,~,(y~+) = i + [n/21. So f,n,z,(yi+) 2 i + [n/2]. Also, if f tn,21(yi+) = i + [n/21 or i + [n/21 + 1 then a meeting would occur at time [n/21. Hence rank(ui) =fin,21(yT) > i + [n/21 + 1. •I It remains to show that large numbers individual processors.
of packets do not collect at
LEMMA 2. At most six packets residein a processor at any moment (five in casen is odd).
Proo$ We prove the case of even n. Let y:(k) denote the set of values bk-t, * *. , vk+, 1. For a packet yi* in processor pk at time t (after comparisons with the incoming broadcasts but before moving),
n-t1 2 fLYi*)
- 2
= rank( vi E V,(k))
(mod n).
Each of the partial ranks is distinct and therefore so are the foci. Note that for even n, no packet can stay motionless except at steps 0 and ln/2]. Hence of seven packets at pk at time t, four must have been at (say) pk _ 1 at time t - 1, and had foci greater than k - 1. These foci were distinct, so at least two of them were at a distance greater than 2 away from their packets. In that case those packets were not yet “tracking” their foci, and so had never changed direction; therefore they are of the form y: and at time t = 0 were both at processor pker. But pldr is the only such packet. In conclusion we have shown: THEOREM
2.
The above algorithm sorts in time [n/21 + 1 with capacity 6.
ACKNOWLEDGMENT We thank Professor Leighton who suggested this problem.
630
MANSOUR
AND SCHULMAN
REFERENCES 1. H. ATTIYA, M. SNIR, AND M. WARMUTH, Computing on an anonymous ring, J. Assoc. Comput. Mach. 35 4 (19881, 845-875. 2. G. N. FREDRICKSON, Tradeoffs for selection in distributed networks, in “Proceedings, 2nd ACM Colloquium on Principles of Distributed Computing, Montreal, Canada, August 1983,” pp. 154-160. 3. D. E. KNUTH, “The Art of Computer Programming, Vol. 3, Sorting and Searching,” p. 223, Addison-Wesley, Reading, MA, 1973. 4. M. KUNDE, Bounds for I-selection and related problems on grids of processors, in “Proceedings, Parallel Algorithms, Berlin, GDR, 1988. 5. F. T. LEIGHTON, Introduction to parallel algorithms and architectures, manuscript in preparation. 6. M. C. LOUI, The complexity of sorting on distributed systems, Inform. and Control 60, Nos. l-3 (19841, 70-G.