JOURNAL
OF PARALLEL
AND DISTRIBUTED
COMPUTING
1, 133-
15 1 (1984)
Flux, Sorting, and Supercomputer Organization for Al Applications* JEFFREYD. ULLMAN Department
of Computer
Science,
Stanford
Universiq,
Stanford,
California
94305
A central issue for the design of massively parallel computers is their ability to sort. We consider organizations that are suitable for fast sorting, both those that use point-to-point connections and those that connect processors with multipoint nets. We show that for fast sorting and minimal area, nets must connect at least G nodes each, if the network has n nodes. We then discuss some of the ways that fast-sorting networks can be used to speed up other processes, such as combinatorial search.
I.
SUMMARY
Section II introduces the informal notion of “flux,” the ability of a multiprocessor organization to transmit large amounts of data between arbitrary partitions quickly. Then, a formal representative of flux, the ability to sort quickly, is introduced. In Section III we explore some of the consequences of requiring our multiprocessor to be able to sort quickly, and in Section IV we discuss some of the ways fast-sorting,machines can be used to solve “AI” problems. Finally, in Section V, we discuss some open problems of a more theoretical nature concerning multiprocessor organizations. II.
FLUX AND THE ABILITYTO SORT
Let us consider a multiprocessing system consisting of n processors (nodes), connected in some way. For us, the connection is defined by a network, which is a set of nets; each net is a set of two or more nodes. Thus, the nets are exactly the edges of a hypergraph, although we shall not use that term. Some examples of networks are the tree network suggested in Fig. la and the network consisting of n processors connected to a single bus, such as a *Work supported by DARPA Contract NOO39-83-C-1036 and NSF Grant MCS-82-03405. 133 0743-7315184 $3.00 Copyright 8 1984 by Academic Press, Inc. All rights of reproduction in any form reserved.
134
JEFFREY D. ULLMAN
b FIG. 1. Examples of networks. (a) Tree network. (b) Bus network.
local area network, suggested in Fig. lb. The tree network is commonly proposed as a way to achieve massive parallelism, for example, by Browning [4], Shaw [ 181, and Stolfo and Shaw [20]. Notice that the tree uses nets consisting of two nodes each, so we draw it as an ordinary graph. Networks consisting of two-node nets are point-to-point networks. Figure lb is not a point-to-point network. When more than two nodes are on one net, we assume that the net has sufficiently high bandwidth to support all the communication that the nodes on the net can generate. We do not make a distinction between a true bus, where (usually) the processors attached to the bus can gain access in a relatively short time to send a byte or a few bytes, and a local area network, where there is a large overhead and delay associated with a processor gaining access to the net. The central assumption is that the net has the capability of serving the communication requirements of all its nodes with some acceptable delay. FlUX
There is an intuitive notion of “flux” in a network that measures the ability of the network to transfer data from one (arbitrarily given) place to another. This ability is important for many, although not all, applications. For example, it is essential for computing joins in relational database systems (see Ullman [24], Maier [ 1 l]), because each tuple of one relation may have to be compared with any or all of the tuples in the other relation (although techniques such as “semijoins” reduce the need for communication in some cases). Similarly, rule-based inference, or equivalently, theorem proving, requires the same kind of communication as the join operation (Shaw [ 171). We shall see in Section III how high flux can improve the performance of parallel search algorithms, as well.
FLUX, SORTING, SUPERCOMPUTER ORGANIZATION
135
To define flux fomally, we must consider not single networks, but families of similar networks, N1,
. . .
9 Nip
. . .
3
for an infinity of i. We assume that each Ni has more nodes than N; - 1, so the family has arbitrarily large networks. For example, we can generalize the balanced binary tree of Fig. la to a family where Ni has 2’ - 1 nodes and is a balanced binary tree of depth i - 1. We can generalize Fig. 1b so Nj is a collection of i nodes on a single net. We define thejImf(n) of a family N,, . . . , Ni, . . . as follows. Suppose Ni has n nodes. Thenf(n) is the minimum over all sets S with n/2 or fewer of Ni’s nodes, of q/m, where 1. q is the sum over all nets in Ni, of the minimum of the number of nodes of that net that are in S and the number that are not in S; and 2. m is the number of nodes in S. While this definition may appear intimidating, the basic idea is that each net is assumed to support all communication its nodes desire. If we draw a circle around the nodes in S, the maximum amount of information per unit time that can leave S via some particular net is proportional to the minimum number of nodes on that net that can send (the number in 5’) and the minimum number that can receive (those outside of S). Thus, the flux is inversely proportional to the minimum time that it takes to get the data from any set of nodes S out of that set, or equivalently, the time needed to supply each of the nodes of S with a quantum of data. EXAMPLE 1. For the tree family suggested by Fig. la, we may let S be the nodes in the left subtree. Then, there is only one net, the one connecting the root with its left child, that has a node in S and a node out of S. Therefore, q = 1 in this case, and m, the number of nodes in the left subtree, is (n - 1)/2. Thus,f(n) I 2/(n - 1) is 0(1/n) for the tree family.’ That is, the flux decreases as II - ’ when 12gets large in an n-node tree. Put another way, the time to get data out of all the nodes in one subtree and pass them to the other subtree grows as IZ, the size of the subtree. That is hardly surprising, because the root of a tree forms a significant bottleneck. For the local area network of Fig. lb, whatever subset S of size m s n/2 we pick, q/m = 1, where q is the minimum of the number of nodes in or out of S in the single net. But q = m, since m 5 n/2 was assumed. Thus,f(n) is 1. The implication of this analysis is that the single net with all the nodes has ‘For our purposes, saying that “f(n) is O&(n))” means that there is some constant c so that for all sufficiently large n,!(n) 5 q(n); i.e.,f(n) grows no faster than proportionally to g(n). Similarly, we shall use ‘Y(n) is R(g(n))” to mean that for some constant c > 0, for all sufficiently large n, we havef(n) I q(n); i.c., f(n) grows at least proportionally to g(n).
136
JEFFREYD. ULLMAN
very high flux; it can send or receive a datum per node into or out of any set in time proportional to 1/f(n), which is 1. That only makes sense if we agree that the single net is fast enough to support all the traffic that n nodes can create. That may not be realistic, but we should recall our basic assumption that whenever we drew a net, that net was capable of supporting its traffic. Thus, the analysis of Fig. lb simply says tautologically that if we have a net that can handle all the traffic that n processors can create, then that net has “high flux”; i.e., it can handle the traffic. n Fast-Sorting Networks A concept related to high-flux networks is the ability to sort “fast.” Formally, we shall say a family of networks is afast sorter if, given one element at each processor (node), there is a parallel algorithm to sort the elements in polylog time, i.e., O(logk n) for some k. By “sort,” we mean that there is a fixed order of the nodes, pl, . . . , p,,, independent of the elements at the nodes, so that the largest element is brought to pI, the next largest to p2, and so on. Notice that a fast sorter must have flux at least a( 1/logk n), because if not, then there is some set S at which we could initially place elements that must go to nodes not in S, and the sorting algorithm could not move all these elements out of S in time O(logk n). However, it is open whether the converse is true; just because a network has high flux, is there always a sorting algorithm that runs fast? EXAMPLE 2. On the tree, sorting cannot be done quickly; it requires Ln(n) time. The local area network of Fig. lb allows fast sorting. We can implement an O(log2 n) parallel sort such as those of Batcher [2]. We can perform the sort in time O(log n) with high probability using the “flashsort” method of Reif and Valiant [ 141, or in principle, we could even use the O(log n) parallel sort of Ajtai et al. [ 11, although the overhead makes it unlikely that we would ever do so. Figure 2 shows three other fast-sorting families of networks. The butterfly, or FFT, network, is suggested in Fig. 2a. In this network there are k + 1 ranks, each with 2k nodes. If we number the ranks from the bottom, starting at 0, and we number the nodes in each rank from the left, starting at 0, then there is a connection between nodes j and 1 on ranks i and i + 1, respectively, if and only if either i = 1 or the binary representations of j and 1 differ in only the (i + 1)st place from the left. The hypercube of Fig. 2b has the obvious generalization from three to d dimensions. Finally, the shuffle-exchange graph of Fig. 2c has nodes numbered 0 through n - 1, where n is a power of 2. The shujj’le connection connects node i to node 2i(mod n - 1) for 0 5 i 5 n - 2 and connects node n - 1 to itself. The exchange connection connects each even node i to
FLUX, SORTING, SUPERCOMPUTER ORGANIZATION
137
a
b
C FIG. 2. Some fast-sorting networks. (a) Butterfly network (b) Hypercube. fle-exchange network.
(c) Shuf-
the next higher node. The shufJle-exchange network has both the shuffle and the exchange interconnections. . All these networks, and several similar networks, can sort in time O(log2 n). For example, Stone (1971) discusses sorting on the shuffle-exchange. The ultracomputer of Schwartz [ 161 is basically a shuffle-exchange organization, as is the CEDAR design of Gajski et al. [5]. Algorithms for the butterfly network, which they call the cube-connected cycles, are discussed by Preparata and Vuillemin [ 131. These and similar networks, their relationships, and their sorting algorithms are covered generally in Siegel [ 191 and Ullman [25]. Also see Goodman [6] and Goodman and Despain [7] for an analysis of how these and other organizations perform on the operation of eliminating duplicates, an operation which is closely related
138
JEFFREY
D.
ULLMAN
to the operation of sorting, and even more closely related to “binsorting,” discussed in Section IV. Figure 3 shows two gridlike networks. In Fig. 3a, a true grid is shown. Its flux is only 0(1/G), since we can let the set S be the left half of the grid, with n/2 nodes, and find that only 6 nets contribute to the flux, because only that number are cut by the boundary of S. Thus, the grid is not a fast sorter. However, it is faster than the tree; we can sort in time 0(X&) by an implementation of the Batcher sort discussed in Thompson and Kung [23]. Gridlike networks are important processor organizations for special-purpose applications. Many of these applications involve “systolic computations,” as discussed by Kung and Leiserson [9] or Leiserson and Saxe [lo], e.g. In Fig. 3b we see a grid where the rows and columns are single nets. It is easy to check that this network allows the same implementation of Batcher’s sort that the usual grid does, but instead of having to slide data from one subrectangle to an adjacent one, the grid-of-nets allows the transmission of data between two adjacent rectangles in one time unit. Thus, the running time of the algorithm of Thompson and Kung [23] (i.e., the implementation of Batcher’s [2] algorithm on the grid) becomes O(log* n) on the grid-of-nets. Thus, Fig. 3b represents another family of fast-sorting networks. n
a
b FIG. 3. Two
gridlike
networks.
(a) Grid
network.
(b) Grid-of-nets.
FLUX, SORTING,SUPERCOMPUTER ORGANIZATION
III.
139
A LOWER BOUND FOR SORTINGNETWORKS
It is known that point-to-point networks capable of sorting must exhibit certain problems; either they are large or they take much time to sort. A proof of this phenomenon appears in the work of Thompson [22], where the model of VLSI circuits, that is, point-to-point networks embedded in the plane with wires of nonnegligible width (and crossovers permitted), was studied. The essential idea is that whatever circuit we design to solve a problem, we can “slice” the circuit in half the short way, cutting no more wires than the square root of the area. Depending on properties of the problem being solved, we may be able to show that the circuit must transmit a large quantity of data across this boundary. If so, either the boundary must be long or the time taken must be long. While this idea was originally intended to apply to single integrated circuits, it is easy to see that the same ideas apply in the large, to printed circuit boards, for example, or to a roomful of processors laid out on the floor and connected by cables. The only essential is that the wires have nonnegligible width. The resulting bounds applied to the problem of sorting n elements (which is one of the problems that requires the transmission of much data) says that if the area of the circuit is A and the time taken is T, then AT* = Q(n *). Thus, if we want the time to be logk n, the area of the circuit must grow almost as n2, to be precise, as n2/logx n. This result of Thompson [22] is somewhat surprising; it says that any fast-sorting network whose nodes are placed in a plane must use most of its area for interconnection wires, and therefore, the area of the multiprocessor must be substantially more than one would intuitively expect. If one builds the network in three dimensions, the volume need not grow as iI( but it must grow as fI(n 3/2) (Rosenberg [ 151). It is interesting to note the attempt of Bianchini and Bianchini [3] to lay out the wiring of an ultracomputer, which is one of the fast-sorting networks, in 3-space. It is clear that their methods will not extend gracefully to networks much larger than the 4096-processor machine they work with, and even for that size machine, the ability of conventional wiring machines to perform the wiring in the volume they predict will be tested. More generally, the theory of VLSI complexity tells us that any attempt to build a fast-sorting network with point-to-point nets dooms us to certain subtle implementation problems. These problems include the fact that the area or volume must grow superlinearly, as discussed above. Another such problem is the fact that the longest wire must grow almost as a(n) if the layout is in the plane, or as C!(fi) if the layout is in 3-space. That, in turn, makes the cycle time, or remote data access time, for the processors grow as the size of the network increases. As with all asymptotic results, we must view these remarks with a grain of salt. They do not imply that fast-sorting, point-topoint networks are infeasible, even for thousands of processors. They only
140
JEFFREY
D.
ULLMAN
imply that the performance of such networks must degrade in several ways as we attempt to build progressively larger networks of the same family. Proof of Lower Bound for Wire Length
Let us see why the maximum wire length must grow as Cl(n) in the plane; the argument for 3-space is analogous. Suppose that d is the maximum wire length. Then in any fast-sorting network, we can surely deliver a datum from any node to any other in time O(logk n). That is, from any node there are paths of at most O(logk n) edges to all the other nodes. Thus, pick any node as the “origin”; all nodes must be within distance O(d log’ n) of the origin. However, we know from Thompson [22] that the entire network, including wires, must take area a(n2/log2k n), and therefore, there is some point p on a wire at least half the square root of that area, i.e., R(n/logk n), from the origin. We may assume that all wires are necessary, or they could be eliminated from the network. Thus, there is a path between two nodes that uses this most remote wire and therefore goes through the point p. This path must consist of no more than O(logk n) nodes, or it could not be used in an O(logk n) timesorting algorithm. If d grows as fi(n/logzk n), we are done; d grows as “almost R(n).” However, if d grows more slowly than n/log2k n, then the radius of the circle about the origin, inside which all nodes are found, must be negligible compared with the distance between any node and the point p, That is, the distance from any node to p, which is at least fl(n/logk n - d logk n), grows as R(n/logk n). But since this path has O(logk n) edges, one of them must grow as C$z/log2k n), and the proof is complete. Networks with Multipoint
Nets
It appears that if we are willing to use multipoint nets for communication, we can do somewhat better than the bounds mentioned above for fast-sorting point-to-point networks. Of course, the protocols involved in sharing multipoint nets may degrade the performance of the network, compared with nets that are dedicated to the connection of two processors, but we shall see in Section IV one possible scheme for amortizing the overhead and delay inherent in the use of such protocols. We shall generalize Thompson’s result to networks in the plane where nets are allowed to include as many as M nodes. As before, we let A be the area of the network and T be the time taken to perform its task. The generalization to three-dimensional layouts is straightforward. Figure 4 shows a typical net connecting M nodes. There is a boundary drawn, cutting the layout the short way, so the length of the boundary is at most a. It is shown crossing the boundary at one point, although our argument is not affected if it crosses at several points. The net surely cannot transmit more than one bit for each of the M processors on the net in one time
FLUX,
SORTING,
SUPERCOMPUTER
M nodes FIG. 4. Net crossing
ORGANIZATION
141
on net boundary.
unit, and at most a nets can cross the boundary at all. As in Thompson [22], one can argue that if the boundary is drawn so about half the processors are on either side, and the network is capable of sorting, then for some runs of the sorting algorithm fi(n ‘) bits must cross the boundary, where n is both the number of nodes and the number of elements being sorted. It follows that V?h4T = C!(n), since, as we argued above, only mbits can cross the boundary in one time unit. If we square both sides, we get the familiar “area-time-squared” type law: AT2M2 = Cl(n’).
If we wish T to be a power of log n, and we wish A to be minimal, i.e., 0 (n), then the above equation says that M = CI(V%/log” n); that is, the size of nets must grow almost as the square root of the number of processors. Similarly, in 3-space, the size of nets must grow as almost n1j3. That result is not too bad. It tells us that if we need a fast-sorting network, and don’t want to use too much area or volume, then improvements in network technology, i.e., the ratio of the network bandwidth to processor speed, must be made at a rate that is the 4, or even 3, power of the rate at which the number of processors we desire to couple grows. Similarly, we can limit the growth of the longest wire length to O(n ri2) or O(n ii3), depending on whether we work in two or three dimensions. EXAMPLE 3. The grid-of-nets organization of Fig. 3b can be laid out in area O(n), if n is the number of processors. As we commented previously, an adaptation of the sorting algorithm of Thompson and Kung [23] on the ordinary grid works for the grid-of-nets in time O(log2 n). Finally, M = V% for this organization; i.e., each net includes V% processors. Thus, AT2M2 = n 2 log” n, which is just slightly more than the theoretical lower limit of fi(n “) . A similar observation holds about the grid-of-nets in three dimensions; it is
142
JEFFREY
D.
ULLMAN
close to optimal as far as using the smallest possible nets for a linear-volume fast sorter. H IV.
SOLVING
COMBINATORIAL
SEARCH
PROBLEMS
ON A FAST
SORTER
Suppose that we have a fast-sorting multiprocessor; how might we use it to solve a combinatorial search problem (i.e., an “AI” problem) in a cooperative way, so that most of the processors are doing useful work most of the time? The strategies we shall outline are most appropriate in an environment where communication involves a large overhead, such as one or more local area networks, like Figs. lb and 3b. However, they save time over obvious strategies even if there is no overhead associated with the sending of a message between processors. Combinatorial Implosion One way that sorting can help speed search is through the phenomenon of combinatorial implosion (Komfield [8]). It may be that different processors cooperating on a combinatorial search wind up trying to solve the same problem. It would be useful to have a “clearinghouse” that detected coincidences where two or more processes have been spawned at different processors and are solving the same subproblem. However, implementation of such a “clearinghouse” is not trivial on realistic processor organizations. We shall propose a framework for solving combinatorial problems where the overhead of transmitting information about the problems being solved at a given processor is amortized sufficiently and does not dominate the cost of calculation. In effect, we can obtain the performance of a machine with high-speed, low-overhead, point-to-point links, on a machine that is in reality quite loosely coupled, like those of Figs. lb and 3b. Thus, for the right class of problems, we have to pay neither the penalties associated with point-topoint fast sorters like the ultracomputer (large wire area and long wires) nor the penalty associated with loosely coupled organizations (message-passing overhead). EXAMPLE 4. To focus on a particular problem, consider playing generalized Nim, where we are given a configuration consisting of three rows of sticks, with i, j, and k sticks, respectively. In a move, we are allowed to take up to three sticks from a single row; the game is won by taking the last stick. We might solve the problem of choosing the best move in a given situation by a recursive search.* That is, we invoke instances of an evaluation procedure with parameters (p, i, j, k), where p indicates which player’s move it ‘There is an elegant way of solving the problem based on the binary representations of i, j, and k, but let us ignore this fact and consider what would happen if we tried to solve Nim by a straightforward combinatorial search.
PLUX, SORTING, SUPERCOMPUTER ORGANIZATION
143
is, and i, j, and k indicate the number of sticks remaining in each of the three rows. The process returns 1 if player p can win, 0 if not. If the problem is not trivial, the process spawns new processes, one for each legal move, and returns 0 or 1 based on the results of the spawned processes in an obvious way. Since each spawned process represents a configuration with fewer sticks, this algorithm eventually succeeds in determining whether a given configuration is a win or loss. A simple extension finds a winning move if there is one. n Evidently, if the spawned processes are distributed to different processors in an equitable way, the processors can all be kept busy, but they will frequently be evaluating a configuration that has already been evaluated by another processor (or even by the same processor). We could avoid this duplication by having each process broadcast its “state” to the other processors, and having those procesors that had another process with the same state inform the first process of that fact. The first process could then suspend, wait for the second process to determine the outcome, and take the result of the latter proces as its own result. In the case of Nim, the state of a process with parameters (p, i, j, k) can be taken to be 1. p, that is, the player whose turn it is; and 2. The list i, j, k in sorted order. That is, the order of the lengths of the three rows is unimportant in determining whether a configuration is won or lost. Suppose that the solution of a Nim instance requires the generation of m processes among n processors. Also suppose that the sending or receiving of a message requires time r. for overhead, plus 7 for each state transmitted in the message. If each process, at startup, sends its state to all the processors, there will be mn messages sent and received, for a total message sending time of mn (rO + 7). Since this cost can be divided among n processors, the elapsed time due to broadcasting states is m (70 + T). The Batch-and-Binsort
Strategy
A superior approach in many situations is to wait until a large number of processes have been spawned. At fixed intervals, the states of all the processes are sorted, so that two or more processes with identical states can be detected. When a group of processes with identical states is found, a message is sent to all but one, telling it to suspend, pending the outcome of the one process with that state that is allowed to continue.3 ‘If we are not careful to suspend newer processes in favor of old, then deadlocks could result. In fact, even if we use this rule for suspending processes, we must be careful that the state embodies sufficient information that two processes with the same state will spawn a set of processes that have the same set of states. If that condition is satisfied, the existence of a deadlock implies that the algorithm, run serially, would never terminate, and so is “defective” anyway (J. Skudlarek).
144
JEFFREYD. ULLMAN
Actually, we can do slightly better, if we “binsort” the states. If there are n processors, we may imagine that each processor has charge of one “bin.” Each state is hashed to a number from 0 to n - 1, and the states that have not previously participated in a binsort are directed through the network to the bin whose number equals their hashed value. For the fast-sorting networks of Fig. 2, this routing can be accomplished in about log n phases, where at each phase, each node sends messages, consisting of a large number of states, to each of its neighbors. Of course, in the worst case, the routing could take Q(m) time to direct m states to their bins; for exampie, with bad luck all states could hash to the same bin. However, the method of Valiant and Brebner [26] allows us, with probability that approaches 1 as n gets large, to route m states distributed among n processors to their proper bins with @log n) messages per node and messages whose length is @m/n). Thus, the total time for the binsorting of m states is some small constant times (rO + mr/n) log n. To this time we must add the time for sending suspension messages to processes whose states are duplicates and messages that inform suspended states of the result they would have produced if allowed to run to completion. These times are each no greater than the binsorting time; in fact, if we are willing to wait until the next binsorting cycle, we shall pay close to nothing for sending these extra messages. Finally, we must add the time wasted by running processes that are eventually suspended, and by having to kill processes that were spawned by suspended processes (and that are not themselves needed because some other process with the same state is depending on them for a result). The ratio of the time required to binsort m states compared with the time to broadcast those m states on an n-processor fast-sorting multiprocessor is thus a small constant times
(1) If we assume that r 4 ro, (1) is approximately -log n + --log n 7 m
n
TO’
Even if 7. = 0, i.e., there is no message passing overhead, (1) reduces to (log d/n. EXAMPLE 5. Suppose that there are 1000 processors, r. = 0 .OO1 set, and r = 0.0001 sec. Further, suppose that m = 100,000; i.e., we wait until approximately 100 new processes per processor have been spawned before we binsort. Then the value of (1) is 0.001; that is, the binsort strategy is on the
FLUX,
SORTING,
SUPERCOMPUTER
ORGANIZATION
145
order of 1000 times faster than the broadcast strategy. What is more to the point, the broadcast strategy would consume about a minute of each processor’s time, which would most likely dominate the processor time needed to spawn 100 processes at each procesor. On the other hand, the batch-and-binsort strategy takes less than a second, which is probably well matched to the time consumed spawning the new processes. Of course, there is the unknowable factor slowdown in the batch-and-binsort strategy due to the fact that processes with a duplicate state are suspended in a less timely fashion than with the broadcast approach, and due to the fact that more message passing is needed to notify the processes of their suspension. Now, suppose that all is as above, but r. = 0; i.e., there is no overhead for messages. Even so, the ratio (log n)/n of the times taken by the two methods indicates a factor on the order of 100 for the speedup achieved by the batch-and-binsort method. n If the network is a “slow sorter,” where 0(n), rather than @log n), messages per processor have to be passed, then the factor logn in (1) would be replaced by n. If that is the case, (1) becomes
This ratio is approximately n/m or approximately r/~~, whichever is larger. However, if r. = 0, then the ratio is 1; i.e., no speedup is possible. Even when r. is substantial, the benefits from the batch-and-binsort strategy are small for a slow sorter, compared to what can be achieved with a fast sorter. A Framework for Solving Problems by Batch-and-Binsort The parallel problem-solving technique described above with the tacit assumption that we were playing Nim can be generalized considerably if we have the “programmer” supply the following elements. 1. A collection of one or more distinguished procedures. Each invocation of one of the distinguished procedures corresponds to a process in the description above. For example, the Nim player has one distinguished procedure, which takes a configuration and evaluates it, perhaps spawning other invocations of the same procedure to do so. The distinguished procedure must have available to it a mechanism for interacting with the underlying system that passes messages from process to process. For example, if a call to the Nim procedure spawns additional calls corresponding to the possible moves, then the calling procedure must wait for each of the spawned procedures to return a value. However, if any of the spawned procedures returns the fact that it cannot win no matter what it does, then the calling procedure can return and claim that its player can win. On the other hand, if none of the spawned procedures admit they must lose, then after all have returned, the calling procedure must return the fact that it must lose.
146
JEFFREY
D.
ULLMAN
2. A procedure that computes states. This procedure is invoked by the system every time a distinguished procedure is called. In the case of Nim, the state computation uses only the parameters of the procedure called, but in general, the state computation could have access to any data in the environment of the distinguished procedure. The result of the state computation is stored by the system for eventual participation in a binsorting step. 3. A hash function to map states to bin (processor) numbers. 4. A procedure to follow when the equality of states of two processes is discovered. In the case of Nim, the action to be taken is to suspend until the other process finishes and then return the same value as that process. If processes were states in a theorem-proving algorithm, we might instead kill all but one process with a given state, because we are only looking for one way to “win” and do not care how many ways there are. The elements to be provided by the underlying system should be clear. We enumerate them here. (a) Facilities to get messages to the various processes concerning their suspension and to get suspended processes the data needed to compute their own return values. (b) Facilities to keep track of spawned processes and allow them to share processors, and perhaps a mechanism for balancing the load among processors. (c) Facilities to monitor the spawning of processes, cause their states to be computed, record the states, and decide when it is economical to run a binsorting phase. (d) Facilities to implement the binsort, find duplicate states, and invoke the required suspension operations for the appropriate processes. Binsorting to Support Global Data Access A strategy similar to the batch-and-binsort strategy just described can be used for parallel processing of problems that are not oriented toward combinatorial search, as problems in the framework described above must be. We only require a relatively fine granularity of processes; i.e., there must be many processes per processor on the average. We may suppose that the data used by the processes are distributed at the sites of different processors, and the processors are loosely coupled, as by a local area network, so it is not economical to get the remote data required by each process as soon as the process asks for them. However, as long as there are many processes per processor, we can suspend each process that needs remote data until at some point it becomes economical to sort all the requests for remote data in a binsorting operation and satisfy the requests in a second binsorting operation. As with the original batch-and-binsort proposal, the effectiveness of this scheme depends upon the multiprocessor’s ability to sort quickly.
FLUX, SORTING, SUPERCOMPUTER ORGANIZATION
FIG. 5. Methods of exploring a graph. (a) Unidirectional search. (b) Bidirectional sorting.
147
search with
Two- Way Search There is another, somewhat different, way that a fast-sorting multiprocessor can improve the performance of a combinatorial algorithm. To set the stage, suppose we are trying to find a path in a graph from source node s to sink node t. To simplify the calculations involved, let us assume that the graph has a fixed degree, d + 1, so when we search either forward from s or backward from t, we fan out in a tree where each node has d children. We could start at s, explore for p levels, and see if we find a path to the node t, or we could start from t and work backward forp levels to see if s can be reached, as suggested in Fig. 5a. Either way, we explore about dP nodes.4 If it takes time r, to explore one node, and we are able to distribute the exploration to n processors, then the time to explore p levels is dPTe/n. Now, let us consider another method where we explore for q levels both forward from s and backward from t, as suggested by Fig. 5b. There will be two sets of about dq nodes each, and each set will be distributed among the n processors. We now need to determine whether the two sets have one or more nodes in common. We can do so if we binsort the sets. Assuming we 4However, if we use the binsorting strategy mentioned previously, to produce a “combinatorial implosion,” we could reduce the number of nodes explored somewhat, since we are likely to reach the same node by several different paths.
148
JEFFREYD. ULLMAN
have a fast-sorting network, with overhead r. for a message and node transmitted in a message, the time for the binsorting of the is on the order of (r. + 2d+/n) log n, by the analysis given for and-binsort method. Thus, the total time taken for the method of 2d9z+
(To++?
cost r per 2d9 nodes the batchFig. 5b is
logn.
If we assume that there are enough nodes per processor that the message passing overhead can be neglected, i.e., r. -6 2d97/n, then (2) reduces to F(7,
+ rlog
n).
It is also likely that the time to transmit a node will be much smaller than the time to explore it, so that we may assume T log n 4 7,. If that is the case, then (2) becomes 2d‘%,/n. When we compare this expression for Fig. 5b with the formula d?Jn for either of the unidirectional searches suggested in Fig. 5a, we see that for a given running time, we may pick q to be essentially p; i.e., since d 2 2 may be assumed, we have q 2 p - 1. The result is that bidirectional search, coupled with binsorting, allows us to search almost twice as many plies as unidirectional search. In fact, even if the network sorted slowly, the same observation would hold asymptotically, but the effect might not be noticeable for realistic numbers.
V.
SOME OPEN PROBLEMS
In this section, we shall mention some of the research questions that are suggested by the observations of the previous sections. Flux and Sorting Fast sorting requires high flux. But is the opposite true? Can we sort fast, say in polylog time, on every high-flux network? Sorting by Multipoint
Nets
We saw in Section IV that we could get very close to the lower bound AT2M2 = O(n*) with the grid-of-nets of Fig. 3b. Can we approach or reach the bound for smaller and larger M? Sorting with Limited Total Net Size The network of Fig. 3b has each node on exactly two nets. Another way to sort with nodes on only two nets is based on the shuffle-exchange inter-
FLUX,
SORTING,
SUPERCOMPUTER
ORGANIZATION
149
connection as illustrated in Fig. 6. In general, if we apply the shuffle repeatedly, beginning at any node, we cycle through a set (orbit) of at most log n nodes. If we allow nets of size log n, we can place one or more orbits in each of n/log IZ nets. In Fig. 6, we show two nets of size three, one for the l-2-4 orbit and the other for the 3-6-5 orbit. The orbits consisting of node 0 alone and node 7 alone need not be placed in nets. Similarly, we can group the exchange edges into nets of size approximately log n, although for the case n = 8 illustrated in Fig. 6, we can only place one exchange edge in each net because two pairs exceed log 8. In general, we use approximately 2n/log n nets of size log FL Of course, this design does not use minimal area; it requires roughly n2 area, as any fast sorter using nets of size @log n) must. 1. Let M be the maximum net size, and define P to be the total number of nets in a given network. The grid-of-nets and the network suggested above each have MP = 2n. Is it true that MP 2 2n for fast sorters in general? 2. If the size of nets grows more slowly than log n, can we even reach 2n? Note that even for nets of size two, we can implement the shuffle-exchange as a point-to-point network with 3n/2 nets, so MP = 3n is attainable. Generalizing
the Batch-and-Binsort
Strategy
A question of utmost practical importance is how the batch-and-binsort strategy generalizes when states need to be compared for a relationship more general than equality, but less general than “everybody has to be compared with everybody else.” We already mentioned the connection between sorting, taking joins in databases, and “artificial intelligence” applications like tbeorem proving or application of rules to data, as expressed by Shaw [ 171. It is interesting to note from Moto-oka and Fuchi [ 121 that the “Japanese Fifth Generation,” despite its claims about implementing “inference engines” and the like, really boils down to building machines that do fast unification and building database back ends that can do joins quickly. Here, the joins are not exactly on equality; they are on “unifiability.” Unfortunately, it appears that the fifth-generation project has not gotten beyond the standard sort (or binsort) approach to taking joins (see Ullman [24] for a discussion of methods for joining). Thus, a generalized batch-and-binsort, where, say, states were expressions and all pairs of unifiable states were able to meet in some bin, will be essential.
FIG. 6. Fast sorter
based
on shuffle-exchange.
150
JEFFREY D. ULLMAN REFERENCES
1. Ajtai, M., Kolmos, J., and Szemeredi, E. An O(nlog n) sorting network. Proc. Fifteenth Annual ACM Symposium on the Theory of Computing, 1983, pp. 1-9. 2. Batcher, K. Sorting networks and their applications. Spring Joint Computer Conference 32. AFIPS Press, Montvale, N. J., pp. 307-314. 3. Bianchini, R., and Bianchini, R., Jr. Wireability of an ultracomputer. DOE/ER/ 03077-177, Courant Institute, New York, 1982. 4. Browning, S. A. The tree machine: A highly concurrent computing environment. Ph.D. thesis, Dept. of Computer Science, CIT, Pasadena, Calif., 1980. 5. Gajski, D., Kuck, D., Lawrie, D., and Sameh, A. Construction of large scale multiprocessors. UIUCDCS-R-83-1123, Dept. of CS, University of Illinois, 1983. 6. Goodman, J. R. An investigation of multiprocessor structures and algorithms for database management. UCB/ERL M81/33, Dept. of EECS, University of California, Berkeley, 1981. 7. Goodman, J. R., and Despain, A. M. A study of the interconnection of multiple processors in a database environment. Proc. Intl. Conf. on Parallel Architectures. IEEE Press, New York, 1980. 8. Komfield, W. A. Combinatorially implosive algorithms. Comm. ACM 25; 10 (1982), 734-738. 9. Kung, H.-T., and Leiserson, C. E. Introduction to VLSI Systems. Addison-Wesley, Reading Mass., 1980, Chap. 8. 10. Leiserson, C. E., and Saxe, J. B. Optimizing synchronous systems. Proc. Twenty-Second Annual IEEE Symposium on Foundations of Computer Science, 1981, pp. 23-36. 11. Maier, D. The Theory of Relational Databases. Computer Science Press, Rockville, Md., 1983. 12. Moto-oka, T., and Fuchi, K. The architectures in the fifth generation computers. Proc. 1983 IFIP Congress. North-Holland, Amsterdam, 1983, pp. 5899602. 13. Preparata, F. P., and Vuillemin, J. E. The cube-connected cycles: A versatile network for parallel computation. Proc. Twentieth Annual IEEE Symposium on Foundations of Computer Science, 1979, pp. 140-147. 14. Reif, J. H., and Valiant, L. G. A logarithmic time sort for linear size networks. Proc. Fifteenth Annual ACM Symposium on the Theory of Computing, 1983, pp. 10-16. 15. Rosenberg, A. L. Three-dimensional VLSI: A case study. J. Assoc. Comput. Mach. 30, 3 (1983), 397-416. 16. Schwartz, J. T. Ultracomputers. ACM Trans. Programming Languages and Systems 2, 4 (1980), 484-521. 17. Shaw, D. E. Knowledge-based retrieval on a Relational Database Machine. Ph.D. thesis, Dept. of CS, Stanford University, Stanford, Calif., ‘1980. 18. Shaw, D. E. The NON-VON supercomputer. Memorandum, Dept. of Computer Science, Columbia University, New York. 1982. 19. Siegel, H. J. Interconnection networks and masking schemes for single instruction stream-multiple data stream machines. Ph.D. thesis, Princeton University, Princeton, N.J., 1977. 20. Stolfo, S. J., and Shaw, D. E. DADO: A tree-structured machine architecture for production systems. Memorandum, Dept. of Computer Science, Columbia University, New York, 1982. 21. Stone, H. S. Parallel processing with the perfect shuffle. IEEE Trans. Comput. C-20, 2 (1971), 153-161.
FLUX, SORTING, SUPERCOMPUTER ORGANIZATION
151
22. Thompson, C. D. Area-time complexity for VLSI. Proc. Eleventh Annual ACM Symposium on the Theory of Computing, 1979, pp. 81-88. 23. Thompson, C. D., and Kung, H.-T., Sorting on a mesh connected parallel computer. Comm. ACM 20, (1977), 263-271. 24. Ullman, J. D. Principles of Database Systems. Computer Science Press, Rockville, Md., 1982. 25. Ullman, J. D. Computational Aspects of VLSI. Computer Science Press, Rockville, Md., 1984. 26. Valiant, L. G., and Brebner, G. J. Universal schemes for parallel communication. Proc. Thirteenth Annual ACM Symposium on the Theory of Computing, 1981, pp. 263-277.