JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.
45, 104–121 (1997)
PC971372
Path-Based Multicast Communication in Wormhole-Routed Unidirectional Torus Networks David F. Robinson,*,1,2 Philip K. McKinley,†,3 and Betty H. C. Cheng†,3 *Department of Computer Science, Quincy University, Quincy, Illinois 62301; and †Department of Computer Science, Michigan State University, East Lansing, Michigan 48824
This paper addresses the problem of one-to-many, or multicast, communication in wormhole-routed, n-dimensional torus networks. The proposed methods are designed for systems that support intermediate reception, which permits multidestination messages to be pipelined through several nodes, depositing a copy at each node. A key issue in the design of such systems is the routing function, which must support both unicast and multicast traffic while preventing deadlock among messages. An efficient, deadlock-free routing function is developed and used as a basis for a family of multicast algorithms. The S-torus multicast algorithm uses a single multidestination message to perform an arbitrary multicast operation. The M-torus algorithm is a generalized multiphase multicast algorithm, in which a combination of multidestination messages is used to perform a multicast in one or more communication steps. Two specific instances of the Mtorus algorithm, the M d-torus and M u-torus multicast algorithms, are presented. These algorithms produce contention-free multicast operations and are deadlock-free under all combinations of network traffic. A simulation study compares the performance of the different multicast algorithms, and implementation issues are discussed. The results of this research are applicable to the design of architectures for both wormhole-routed massively parallel computers and high-speed local area networks with wormholerouted switch fabrics. © 1997 Academic Press Key Words: multicast communication; wormhole routing; intermediate reception; path-based routing; collective communication.
1. INTRODUCTION
The term wormhole routing [2] refers to a message switching technique that has been widely used in the design of distributed-memory parallel computers. Processor nodes in these systems communicate by sending messages through a network. In wormhole routing, each message is divided into a number of small pieces, called flits, prior to transmission. The flits of a message are pipelined through the network by way 1 The research described in this paper was conducted while the author was at Michigan State University. 2 E-mail: {robinda}@quincy.edu. 3 E-mail: {mckinley,chengb}@cps.msu.edu.
of a router at each node, as shown in Fig. 1. The header flit(s) of the message contains routing information and governs the path from the source to the destination. Compared to store-andforward switching, which was used in early-generation parallel computers, the pipelining characteristic of wormhole routing reduces the effect of path length on message delay. This distance insensitivity has been demonstrated through measurements on commercial systems [11]. An additional advantage of wormhole routing over other switching techniques is that only a very small, fixed-size buffer is needed for each communication channel. Wormhole routing has been adopted in many commercial parallel systems, including those that use direct networks, such as the nCUBE-2 (hypercube), the Intel Paragon (2D mesh), and the Cray T3D (3D torus), and those that use indirect switch-based networks, such as the TMC CM5 and the IBM SP series. More recently, wormhole routing has been used in the design of switch fabrics for high-speed local area networks; an example is the Myrinet LAN. Communication operations for parallel and distributed computing can be classified as either point-to-point, involving a single source and a single destination, or collective, involving more than two nodes. The growing interest in the use of collective routines is evidenced by their inclusion in many commercial communication libraries and in the recent Message Passing Interface (MPI) standard [10]. Perhaps the most fundamental collective operation is multicast, in which a source node must deliver a copy of a message to every node in a specified set of destinations. Special cases of multicast include unicast, in which the destination set contains exactly one node, and broadcast, in which the destination set contains every node in the network. Efficient multicast communication is useful in many parallel algorithms and in distributed multiparty applications. Multicast in wormhole-routed systems has been the subject of extensive research (see [9] for a recent survey) and at least two commercial systems, the nCUBE-2 and the CM-5, provide some hardware support for multicast. One architectural feature that can be used to support multicast communication in wormhole-routed networks is intermediate reception (IR). The IR capability allows a router to deliver an incoming message to the local host while simultaneously forwarding it to another router, as depicted in 104
0743-7315/97 $25.00 Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.
PATH-BASED MULTICAST COMMUNICATION
FIG. 1. Pipelining operation of wormhole routing.
Fig. 2. In this manner, a single worm may be routed through several destinations, depositing a copy of the message at each; such messages are called multidestination messages [12]. An important issue in multidestination message transmission is the underlying routing algorithm. The hold-and-wait property of wormhole routing makes it particularly susceptible to deadlock, and most wormhole-routed systems restrict message routing so as to prevent cycles of channel dependency [3]. For example, dimension-ordered routing, in which messages are routed through dimensions of the network in a predetermined order, has been used in many systems to route unicast messages [11]. Panda et al. [12] studied the implementation of multidestination messages using existing unicast routing algorithms. The main advantage of this approach is that it requires only minor modifications to existing router logic. However, the number of nodes that can be reached by a single worm is limited. Multiple, multidestination messages usually are needed to implement multicast and broadcast operations, and several such algorithms have been proposed [1, 6, 12]. An alternative approach is to use a different underlying routing algorithm entirely, one that is designed specifically to support both multidestination and single-destination (unicast) messages. One way to do this is to base all message routing on a Hamiltonian Path (HP) in the network. Lin et al. [8] used this path-based approach to develop a family of multicast routing algorithms for mesh and hypercube networks. The routing algorithm is designed so that messages always acquire channels in the same order as they appear on a particular HP, thereby preventing cyclic dependences and deadlock. Messages may follow shortcuts in order to reduce path length, in particular, to guarantee that a unicast message always follows a shortest path. The major advantage of this approach is that a single worm can reach an arbitrary set of
FIG. 2. Operation of intermediate reception.
105
destinations. Tseng and Panda [16] generalized and improved Lin’s approach to accommodate faults and networks that do not necessarily contain HPs, and Duato [5] developed a theory governing the design of adaptive path-based multicast routing algorithms. This paper addresses the design of path-based multicast communication for wormhole-routed torus networks. The symmetry of torus networks leads to a more balanced utilization of communication links, under random traffic, than in mesh topologies [7]. However, the presence of “wrap-around” channels in the torus network necessitates different approaches to multicast routing than are possible in the mesh. Specifically, the particular HP (actually, a circuit in the case of the torus) must be carefully chosen so as to support both multicast and unicast messages emanating from any node in the network, while preventing deadlock. The contribution of this work is to use such a circuit to develop and evaluate a family of multicast algorithms for wormhole-routed torus networks. These results are applicable to the design of architectures for both wormhole-routed parallel computers and high-speed LANs based on wormhole routing. The remainder of the paper is organized as follows. The architectural model under consideration is described in Section 2. Section 3 describes the routing function that can be used to support both unicast and multicast messages. In Section 4, this routing function is used to support multicast communication in which a single multidestination message visits all destinations of the message. Section 5 describes several multicast algorithms, based on the same routing function, but in which the operation is implemented using multiple multidestination messages in order to reduce latency under certain conditions. In Section 6, the results of a simulation study are presented. This study compares the performance of the multicast algorithms developed in Sections 4 and 5, as well as a unicast-based multicast method based on the same routing mechanism. Implementation issues are addressed in Section 7, and conclusions are given in Section 8. In order to improve the readability of the main body of the paper, proofs of certain theorems have been deferred to the Appendix. 2. TORUS NETWORK MODEL
An n-dimensional torus has k0 × k1 × · · · × kn−2 × kn−1 nodes, with k i nodes along each dimension i, where k i ≥ 2 for 0 ≤ i ≤ n − 1. Each node x is identified by n coordinates, σn−1 (x)σn−2 (x) . . . σ0 (x), where 0 ≤ σ i(x) ≤ k i − 1 for 0 ≤ i ≤ n − 1. Two nodes x and y are neighbors if and only if σ i(x) = σ i( y) for all i, 0 ≤ i ≤ n − 1, except one, j, where σ j(x) ± 1 = σ j( y) mod k j. In this paper, we assume for purposes of discussion that a torus is regular, that is, that k i = k j for all 0 ≤ i, j ≤ n − 1, and we refer to the width, or arity, of the torus as simply k. However, the results we present are also applicable to certain nonregular tori, as described later. We further assume that the torus is unidirectional; that is,
106
ROBINSON, MCKINLEY, AND CHENG
in software at the source node, constructs the list of destination nodes and places it in the message header, thereby determining the order in which the destination nodes are visited. This order can affect the channel dependencies created by the message, and hence, the deadlock properties of the system. In other words, it is the combination of the message preparation algorithm and the routing function that must be considered in designing a deadlock-free system. 3.1. Hamiltonian Circuits
FIG. 3. Example of a 2D unidirectional torus network.
neighboring nodes are connected by physical channels in one direction only. Figure 3 shows the physical links associated with a 4 × 4 2D unidirectional torus. The unidirectional torus network permits the use of simpler routing hardware than is required for a bidirectional torus, in which direct message transmission is possible in either direction between neighboring nodes. Moreover, in unidirectional torus networks, fewer virtual channels are required, as compared to the number required in bidirectional networks [3]. Although the average distance between nodes is greater in a unidirectional torus, this property is less important in wormhole-routed networks than in earlier store-and-forward switched systems, given the relative distance-insensitivity of wormhole routing [11]. Our studies of multicast communication in bidirectional torus networks are described elsewhere [14, 15]. In torus networks, the presence of wraparound channels can lead to routing cycles among messages. Channel-dependence cycles can be broken by multiplexing virtual channels [3] on a single physical communication channel. Each virtual channel has its own flit buffer and control lines. In this paper, we assume that only two virtual channels, the minimum needed to prevent deadlock [3], are multiplexed on every physical channel. Each router is connected to its local processor/memory by internal channels, or ports. The port model of a system refers to the number of internal channels at each node. In this paper, we assume that every node has two input ports and one output port. The two input ports are necessary in order to prevent deadlock among multiple multidestination messages [1].
The path-based routing method developed by Lin et al. [8] for meshes and hypercubes uses a Hamiltonian path (HP) in the network. In a unidirectional torus network, a Hamiltonian circuit (HC) must be used. Figure 4 shows a 2D (6 × 6) unidirectional torus. The numbers near the upper left corner of each node define a total ordering on the nodes. This total ordering corresponds to an HC that “begins and ends” at node (0, 0). Although a unidirectional torus generally contains more than one HC, the methods described here use only the particular type of HC illustrated in Fig. 4. We refer to this special HC in a particular torus, T, as T, or simply when it is clear which torus network is indicated. For any node u, and any dimension d, let u d be the node adjacent to u in is the HC that begins at node (0, dimension d. Informally, 0) and, at each node u on , the next node is the neighbor, u d, that minimizes d under the constraint that u d does not already precede u on . Also shown in Fig. 4 are boundaries, which are communication links whose direction is from a higher numbered node to a lower numbered node. Boundaries will be used when defining a path-based routing function for torus networks. The same technique can be applied to higher-dimensional torus networks. Figure 5 shows the node orderings defined in a 3D (4 × 4 × 4) torus where, for clarity, the by communication channels in dimension 2 (inter-plane) are not
3. PATH-BASED ROUTING FUNCTION
The two major components of a multicast implementation are the routing function and the message preparation algorithm. The routing function, implemented in router hardware, determines the route taken by a message between the source and the first destination, and between subsequent pairs of destination nodes. The message preparation algorithm, implemented
FIG. 4. Hamiltonian circuit
H in a 2D torus.
PATH-BASED MULTICAST COMMUNICATION
shown. In addition to the boundaries shown explicitly in Fig. 5, every channel from plane 3 to plane 0 is also a boundary. We define ` T(u) to be the position, or label, of a node u in torus T as determined by T. Again, when it is clear which torus network is under consideration, we use the notation `(u). For example, in Fig. 4, `(0, 0) = 0, `(0, 1) = 1, `(1, 0) = 7, and is formally defined in terms of labels. For a 1D so forth. torus, which is simply a ring, `(u) = σ 0(u). For a 2D torus, `(u) = [(σ0 (u) + σ1 (u)) mod k] + k[σ1 (u)]. is deFor the general case of a k-ary n-dimensional torus, fined as follows: n−1 n−1 X X k i σ j (u) mod k for n ≥ 1. (1) `(u) = i=0
j=i
This method is also applicable to certain classes of torus networks whose topologies are not regular. For example, if the size of the network is kn−1 × kn−2 × · · · × k0 , then the desired HP exists if, for all i, k i is a multiple of k i−1. We formally define boundaries as follows.
107
DEFINITION 1. If u and v are two neighboring nodes (that is, if u i = v for some 0 ≤ i ≤ n − 1), then channel (u, v) is a boundary if and only if `(u) > `(v). 3.2. Use of Virtual Channels and Input Ports At least two virtual channel sets are required in order to provide deadlock-free deterministic path-based (or unicast) routing in torus networks [3]. The routing algorithm used in this paper, termed unidirectional torus path-based routing (UTPR), requires only this lower bound of two virtual channel sets, which are identified as p-channels and h-channels. Briefly, the p-channels (“p” for pre-boundary) are used by messages only prior to crossing a boundary; after crossing a boundary, messages use the h-channels (“h” for high-channel) for all remaining travel. Each nonboundary physical communication link in the network has multiplexed onto it a p-channel and an h-channel, while only h-channels are required on boundary links. We use the notation of Dally and Seitz [3], where c dαx represents the virtual channel leaving node x in dimension d, and belonging to the virtual channel set α, where α ∈ {p, h}.
FIG. 5. Hamiltonian circuit
H in a 3D torus.
108
ROBINSON, MCKINLEY, AND CHENG
FIG. 6. Virtual channels in one dimension of a unidirectional torus.
1. There is a p-channel, c dpu, from u to u d whenever (u, is not a boundary, and 2. There is an h-channel, c dhu, from u to u d.
u d)
Figure 6 illustrates the virtual channels within a single dimension, d, of a unidirectional torus. In this example, as in each row of the 2D torus in Fig. 4, the wraparound link happens to be a boundary. Besides deadlock caused by the usage of channels that interconnect routers, another type of deadlock, consumption channel deadlock [1, 12], must also be considered in systems that support multidestination messages. Figure 7 illustrates a scenario in which two multidestination messages are each attempting to deliver a message to nodes x and y. However, each message is holding the single input port (consumption channel) at one node, while waiting to use the input port at the other node, resulting in communication deadlock. Boppana et al. [1] showed that by equipping each node with two input ports and restricting their use (one for “forward” messages and one for “backward” messages), deadlock can be prevented in path-based systems. We can use this approach to prevent consumption channel deadlock in unidirectional torus networks under UTPR. Each router is equipped with two input ports, a p-port and an hport. A (unicast or multicast) message for which node i is a destination, upon arriving at node i’s router via a p-channel, will be forwarded to the host via the p-port. Similarly, a message arriving via an h-channel will be forwarded by way of the h-port. 3.3. Formal Routing Function A multidestination routing implementation must provide efficient support for the special case of unicast communication.
FIG. 7. Port-induced communication deadlock in a one-port system.
In particular, a unicast message should always follow a shortest path between the source and destination. To meet this requirement, the path-based routing function must restrict travel to those dimensions in which the addresses of the current node and the next destination node differ; these are called useful dimensions. In order to route a message from a node u to a node v, where either `(u) < `(v) or `(u) > `(v), we use the following generalized routing rule: travel occurs in the lowest useful dimension that does not cross a boundary. If every useful dimension crosses a boundary, then travel occurs in the highest useful dimension. By using p-channels prior to crossing a boundary, and h-channels thereafter, cycles of dependency among virtual channels are prevented. We show later that an arbitrary multidestination message can be routed so that at most one boundary is crossed. Limiting a message to a single boundary crossing is important, since with only two virtual channel sets, allowing messages to cycle through the network multiple times could cause deadlock. The path routing function for UTPR is described formally by the function UTPR: N × {p, h} × n → C, which maps a (current node, incoming virtual channel set, destination node) triple into the next channel on the path. Let (u, v) be the set of useful dimensions for routing from u to v. That is,
1
1
1(u, v) = {i | 0 ≤ i ≤ n − 1 and σi (u) 6= σi (v)}.
Let f (u, v) be the useful dimensions that are not boundaries at u. Thus, 1 f (u, v) = {i ∈ 1(u, v) | (u, u i ) is not a boundary}. We define
UTPR,
U T P R (u,
based on
1 and 1 f, as follows:
α, v) = cdβu , where min (1 f (u, v)), if 1 f (u, v) 6= ∅ d= max (1(u, v)), otherwise p, if α = p and (u, u d ) is not a boundary (2) β= h, otherwise. A multidestination message is routed from a source node to a destination node (or between successive destination nodes) by applying the UTPR function, first at the source router, and then at each router through which the message travels, until the destination is reached. When a message enters the network, it is initially routed over a p-channel. As specified by Eq. (2), the message continues to be routed over p-channels unless a boundary is crossed. After crossing a boundary, a message is routed on h-channels.
109
PATH-BASED MULTICAST COMMUNICATION
4. SINGLE-PHASE MULTICAST COMMUNICATION
To perform a multicast operation using multidestination messages, one or more communication steps may be used. Methods that reach all destination nodes in one communication step are termed single-phase [8], while those that require more than one step are called multiphase [6, 12]. During the first phase of a multiphase multicast, the source node sends a multidestination message to a subset of the destination nodes. During subsequent phases, some (perhaps all) of the nodes that have already received the message each send a multidestination message to a distinct subset of the nodes that have not yet received the message. This process continues until the message has reached every destination node. In this section, the UTPR routing function is used to develop a singlephase multicast algorithm for torus networks, while Section 5 describes multiphase algorithms. The S-torus multicast algorithm (“S” for single phase) implements deadlock-free multicast communication by requiring each multidestination message to visit its destination nodes in an order corresponding to . Given the constraint of only two virtual channel sets, the message must be limited to one full cycle of . In order to meet this requirement, we define an -chain, which is simply a sequence of nodes whose order is consistent with the ordering established by . DEFINITION 2. A sequence of nodes {u 0 , u 1 , u 2 , . . . , u m−1 } is an -chain if and only if all elements of the sequence are distinct, and `(u i ) < `(u i+1 ) for 0 ≤ i < m − 1. The labels of the elements of an -chain are strictly increasing. An -cycle, on the other hand, is a sequence of nodes whose ordering is consistent with the cyclic ordering associated with , and is defined as an end-around rotation of an -chain. DEFINITION 3. If 8 = {u 0 , u 1 , u 2 , . . . , u m−1 } is an -chain and u s is an element of 8, then {u s , u s+1 , . . . , u m−1 , u 0 , u 1 , . . . , u s−1 } is an -cycle with respect to u s. As an example, we consider a multicast problem in a 2D (6 × 6) torus, depicted in Fig. 8. The source node (3, 2) is to deliver a message to nine destination nodes using a single multidestination message. We assume that routing is based on the HC shown in Fig. 4. The problem is thus defined by the sequence of nodes
FIG. 8. A multidestination message under
R
UTPR
in a 2D torus.
Next, the -chain 8′ is rotated so that the source node, (3, 2), appears first in the sequence, resulting in the following cycle: 800 = {(3, 2)23 , (4, 3)25 , (4, 5)27 , (5, 1)30 , (5, 4)33 , (0, 5)5 , (1, 0)7 , (1, 2)9 , (2, 1)15 , (3, 4)19 }. A single message can be routed, starting at the source node (3, 2), to each destination node in 8″, in turn, according to the path routing function UTPR (Eq. (2)). The path of this multidestination message is depicted with bold arrows in Fig. 8. Figure 9 gives the message preparation algorithm, which is executed by the source node in order to construct the multidestination message header. THEOREM 1. If 8 = {u 0 , u 1 , u 2 , . . . , u m−1 } is an cycle, then a multidestination message, routed according to
8 = {(3, 2)23 , (0, 5)5 , (4, 5)27 , (3, 4)19 , (5, 4)33 , (4, 3)25 , (1, 2)9 , (2, 1)15 , (5, 1)30 , (1, 0)7 }, where the first element of 8 is the source node and the order of destination nodes is arbitrary. For convenience, the source node is underlined, and the label `(u) of each node u is added as a superscript to the node address. First, 8 is sorted according to labels of the node addresses to obtain the -chain 80 = {(0, 5)5 , (1, 0)7 , (1, 2)9 , (2, 1)15 , (3, 4)19 , (3, 2)23 , (4, 3)25 , (4, 5)27 , (5, 1)30 , (5, 4)33 }.
FIG. 9. The message preparation algorithm for path-based routing.
110
ROBINSON, MCKINLEY, AND CHENG
path routing function UTPR, beginning at source node u 0 and routed through destination nodes u 1 , u 2 , . . . , u m−1 , in that order, has the following properties: 1. All paths between successive destination nodes are minimal. 2. The network is deadlock-free under any and all combinations of such multidestination messages. 3. All physical channels used by the message are distinct. Proof. See the Appendix. Since a unicast message is a special case of a multidestination message in which there is only one destination node, it follows that minimal, deadlock-free unicast routing is also provided by the above routing mechanism. Furthermore, all combinations of multidestination and unicast messages can coexist without possibility of network deadlock. There are several advantages to implementing multicast communication with a single multidestination message. The operation requires only one communication step. Hence, for long messages, full advantage is taken of the communication pipelining of wormhole routing. Also, every destination node receives the message directly via its router; local processors are not required to relay the message. This feature may be very useful in wormhole-routed LANs, where per-switch destination sets are typically small and the streaming of multimedia data is important. However, the S-torus algorithm also suffers from some disadvantages. For example, a single message used to reach all N nodes in the network, as in the case of a broadcast operation, will have a total path length of N − 1. Such extremely long paths can result in poor performance in large parallel computer networks because of transmission delays and network congestion [12]. Moreover, a single-phase approach to multicast does not exploit the communication parallelism that is possible when concurrent messages are used to complete the operation [12]. These considerations motivate the study of multiphase multicast algorithms. 5. MULTIPHASE MULTICAST COMMUNICATION
The concept of using multiple multidestination messages to implement broadcast operations was proposed by Panda et al. [12] and by Ho and Kao [6] for use in broadcast algorithms for meshes and hypercubes, respectively. In this section, we present several multiphase multicast algorithms for unidirectional torus networks. Like the S-torus algorithm, all these algorithms use the UTPR routing function.
= {80 , 81 , 82 , . . . , 8r −1 } such that r ≥ 2 and 8 = 80 k 81 k 82 k · · · k 8r −1 , where the symbol “k” represents concatenation of -cycles. Each -cycle, 8 i, of is written as 8i = {u i, 0 , u i, 1 , . . . , u i, (m i −1) }. The M-torus multicast algorithm, given in Fig. 10, implements multiphase multicast operations by recursively partitioning the -cycle. Let 8 be an -cycle containing the source and destinations of a multicast operation. The ordering of the nodes depends strictly on the source node and the Hamiltonian circuit used for routing. The -cycle 8 can be partitioned into smaller -cycles in a variety of ways, but the order of the nodes remains fixed. Let be a partition of 8. In the first phase of the algorithm, the source node sends a multidestination message addressed to the first node in each -cycle in
(except for the -cycle containing the source node itself). The first node of each list, having just received the message, becomes the source node for the next iteration, and the algorithm proceeds recursively. Since there are many ways to partition a given -cycle, a specific instance of the M-torus algorithm is determined by the partition used in Step 1. For example, if the input -cycle (of length m) is partitioned into m individual parts (whose lengths are all equal to one), then the M-torus algorithm will produce a single-phase multicast equivalent to the Storus algorithm. On the other hand, if the input -cycle is always partitioned into exactly two smaller -cycles, then all messages generated by the algorithm will have only a single destination, resulting in a unicast-based implementation [14]. Between these two extremes are a multitude of partitioning methods, each resulting in a different version of the M-torus algorithm. An important property of the M-torus algorithm is that it is contention-free, regardless of the partition used in each suboperation. THEOREM 2. The M-torus algorithm applied to an cycle 8 = {u 0 , u 1 , u 2 , . . . , u m−1 } results in a contentionfree multicast from source node u 0 to destination nodes u 1 , u 2 , . . . , u m−1 . Proof. See the Appendix. Next, we examine two particular partitioning methods: dimensional partitioning and uniform partitioning. These two
5.1. The M-Torus Generalized Multicast Algorithm As before, we assume that a multicast operation is described by an -cycle, 8, where the first element of 8 is the source node. Since multiple worms will be used to deliver the message to the destinations, we need to partition the -cycle. DEFINITION 4. A partition of an -cycle 8 = {u 0 , u 1 , u 2 , . . . , u m−1 } is a sequence of nonempty -cycles
FIG. 10. The M-torus algorithm for multicast.
111
PATH-BASED MULTICAST COMMUNICATION
partitioning methods correspond to two instances of the Mtorus algorithm termed the M d-torus and M u-torus multicast algorithms, respectively.
destinations, respectively. The M d-torus algorithm completes the multicast in 2 phases because the dimensionality of the network is 2.
5.2. The M d-Torus Multicast Algorithm
5.3. The M u-Torus Multicast Algorithm
The dimensional partitioning method attempts to limit the path lengths of the constituent messages in a multicast operation. A dimensional partition of order d divides the nodes of an -cycle by their respective subtori of dimension d. For example, in a 3D torus, each -cycle in a dimensional partition of order 2 contains the nodes in a plane of the network. In a similar manner, a dimensional partition of order 1 divides the nodes according to the rings containing them. Stated formally, we have:
The M d-torus algorithm partitions the destination nodes based only on the structure of the underlying torus network, without regard to the actual destinations of the multicast operation. In some cases, it may be useful to consider the structure of the -cycle to be partitioned. For example, limiting the number of destination nodes reached by any single message reduces the length of the message header and tends to also reduce the message path length. A method that allows the number of destinations per message to be controlled is termed uniform partitioning, in which the -cycle is divided into a specified number of -cycles whose sizes are as nearly equal as possible. We emphasize that the word “uniform” refers to the size of the cycles, and not to any randomized approach associated with a uniform distribution. Uniform partitioning is formally defined as follows.
-cycle DEFINITION 5. A partition of an dimensional partition of order d if and only if
8
is a
For each -cycle 8 a ∈ , and for any two nodes u, v 8 a, σ i(u) = σ i(v), for d ≤ i ≤ n − 1; and 2. For any two -cycles 8 a, 8 b ∈ ,where a ≠ b, and any two nodes u ∈ 8 a and v ∈ 8 b, there exists some integer ∈
1.
i, d ≤ i ≤ n − 1, such that σ i(u) ≠ σ i(v).
In an n-dimensional torus network, the M d-torus algorithm operates as follows (refer to Fig. 10). During the first communication phase, the partition is a dimensional partition of order n − 1. During the second phase, a dimensional partition of order n − 2 is used, and so forth, until the nth and last phase, where is a dimensional partition of order 0, which is simply a partition of 8 into individual nodes. As an example, consider the multicast problem in a 2D (64 × 64) torus defined by the following nodes, already arranged into an -cycle. The source node (15, 7) is underlined: 8 = {(15, 7), (15, 9), (15, 3), (15, 17), (15, 22), (15, 24), (15, 27), (15, 33), (15, 35), (15, 36), (2, 50), (2, 60), (5, 22), (5, 32), (5, 42), (5, 52)}. Since the dimensionality of the torus is 2, the M d-torus algorithm first performs a dimensional partitioning of order 1 on 8, resulting in three -cycles: 80 = {(15, 7), (15, 9), (15, 13), (15, 17), (15, 22), (15, 24), (15, 27), (15, 33), (15, 35), (15, 36)} 81 = {(2, 50), (2, 60)} 82 = {(5, 22), (5, 32), (5, 42), (5, 52)}. The source node (15, 7) sends a multidestination message addressed to the first node in 8 1 (2, 50) and the first node in 8 2 (5, 22). During the second phase, the first node in each of the three -cycles sends a multidestination message addressed to all other nodes in their respective partition. In the example, these three messages are sent to nine, one, and three
DEFINITION 6. A partition of an partition of size r if and only if
-cycle 8 is a uniform
1. | | = r; and 2. For each partition 8 a ∈ , either |8 a| = dm/re, or |8 a| = bm/r c, where m = |8| is the size of the multicast operation. The M u-torus algorithm is a special case of the Mtorus algorithm that uses uniform partitioning. The M u-torus algorithm is parameterized by the size of the partition, r. In all but the last phase of the algorithm, r − 1 is the number of destinations of each constituent message. In the last phase, messages may involve fewer destinations. Using this algorithm, the aggregate number of destinations reached will grow by a factor of r during each phase, leading to the following result: The M u-torus algorithm, with partitioning parameter r, applied to a multicast operation of size m, requires dlogr me, phases to complete the multicast operation. We now apply the M u-torus algorithm, with parameter r = 4, to the multicast problem used to demonstrate the M d-torus algorithm. During each phase, uniform partitioning is used to create r = 4 -cycles. In the first phase these -cycles are 80 = {(15, 7), (15, 9), (15, 13), (15, 17)} 81 = {(15, 22), (15, 24), (15, 27), (15, 33)} 82 = {(15, 35), (15, 36), (2, 50), (2, 60)} 83 = {(5, 22), (5, 32), (5, 42), (5, 52)}. The source node (15, 7) sends a multidestination message to nodes (15, 22), (15, 35), and (5, 22). During the second phase, the first node in each of the four -cycles applies uniform partitioning with parameter r = 4. In each case, the result is a partitioning of the sequence into individual nodes. Thus, each
112
ROBINSON, MCKINLEY, AND CHENG
node holding a copy of the message (specifically, nodes (15, 7), (15, 22), (15, 35), and (5, 22)) sends a multidestination message to the three remaining nodes in its respective -cycle. The M u-torus algorithm with parameter r = 4 completes the multicast of size m = 16 in 2 phases. 5.4. Discussion The M d-torus multicast algorithm takes advantage of the underlying dimensional structure of the torus network. By partitioning the multicast problem according to dimension boundaries, subsequent phases of the multicast are confined to subtori of decreasing dimension. This confinement is beneficial in limiting interdestination path lengths of multidestination messages. The M u-torus algorithm, on the other hand, uses the strategy of limiting the number of destination nodes for each multidestination message, thereby preventing excessively long total path lengths. We compare the two algorithms through simulation in Section 6. In addition to the two methods described above, many other partitions can be used in the M-torus algorithm. Even hybrid approaches are possible. For example, a partitioning method could combine -cycles produced by dimensional partitioning whenever two or more adjacent subtori contain relatively few destination nodes, and could partition an cycle for a single subtorus that contains many destination nodes. This method balances the purely “network view” imposed by dimensional partitioning with the “destination set” view of uniform partitioning. 6. PERFORMANCE EVALUATION
In order to better understand the performance of the multicast algorithms presented in the previous two sections, a simulation study was conducted. To our knowledge, the proposed routing function is the first to offer a general framework for multicast algorithms in unidirectional torus networks, so it is not possible to compare our methods with alternative proposals with the same goal. However, the generality of the proposed algorithms makes it possible to compare certain special cases, some of which have appeared previously in the literature, and some of which are proposed in this paper. Five specific multicast algorithms were simulated: the single-phase S-torus algorithm, the M d-torus algorithm, two instances of the M u-torus algorithm, and for comparison, a unicast-based multicast algorithm [14]. In order to examine the effect of the partitioning parameter, r, on the M u-torus algorithm, two such parameter values were chosen: r = 8 and r = 64. We refer to the corresponding algorithms as M u-torus(8) and M u-torus(64), respectively. The unicast-based algorithm, which also uses UTPR, is essentially the M u-torus algorithm with parameter value r = 2. To complete a multicast of size m, the algorithm therefore requires dlogr me, phases, or communication steps [14]. The S-torus algorithm, being very
similar to the original single-phase path-based methods of Lin et al. [8], is included as a way of comparing this method to our M d-torus and M u-torus algorithms. Likewise, the unicastbased algorithm we examine here is very similar to existing multicast methods [14]. The system model for the simulation is the same as that assumed throughout this paper: a k-ary n-dimension torus with unidirectional communication links. All simulations were performed for a 4096-node torus. Both 2D (64 × 64) and 3D (16 × 16 × 16) topologies were examined, as well as various message lengths, multicast sizes, and startup latencies. In order to provide example cases for the simulation, each multicast set was produced by selecting randomly, from all nodes in the system, the appropriate number of unique nodes. Specific values were chosen for the following parameters. The software overhead at the message source and destination node are represented respectively by the message send latency, τ S, and the message receive latency, τ R. The combined send and receive latencies are referred to as the message startup latency. The time required for a message in the network to advance one flit is represented by the per-flit network latency, τ n. We present performance results for two sets of message latencies: one corresponding to first-generation wormholerouted parallel computers, and a second set of smaller latencies consistent with current systems. 6.1. Performance under Large Startup Latencies Figure 11 shows the performance of the various multicast methods on a 2D (64 × 64) torus with message latency values of τ S = 95 µsec, τ R = 75 µsec, and τ n = 0.5 µsec. These values reflect the relatively high message startup latencies reported for early wormhole-routed systems, such as the nCUBE2 hypercube [11]. Both average (over all destinations) and maximum multicast delays are shown. Results are shown for message lengths of 8, 512, and 16,384 flits. Under this configuration, the results show that all of the path-based methods, with their additional hardware support for intermediate reception, perform better than the unicast-based operation, except for very short messages. Even for those short messages, however, the M u-torus(8) algorithm performs much better than the unicast-based method. In general, as the message length is increased, methods that use fewer communication phases tend to perform better due to the pipelining of wormhole routing. Among path-based algorithms, the algorithms providing the best performance depends on the message length. The S-torus algorithm performs very well for long messages, due to lower relative startup costs, whereas the M u-torus(8) algorithm performs better with short and medium message lengths. 6.2. Performance under Small Startup Latencies In another group of simulations, whose results are shown in Fig. 12, lower message startup latencies were used to
PATH-BASED MULTICAST COMMUNICATION
113
FIG. 11. Multicast delay (4096-node 2D torus, high message startup latency).
reflect recent improvements in network interface designs. The respective latency values are τ S = 10 µsec, τ R = 8 µsec, and τ n = 0.5 µsec. Again, message sizes of 8, 512, and 16,384 flits were simulated. Under this configuration, the unicast-based method exhibits performance that is nearly identical to the M u-torus(8) algorithm for short messages, but still performs poorly, compared to the path-based methods, for medium
and long messages. For long messages, the S-torus algorithm shows the best performance. The results for the 3D torus are similar and may be found in [13]. The results are similar to those for the 2D torus, though the performance of the S-torus, M d-torus, and M u-torus(64) algorithms improve somewhat due to the shorter average path lengths of the 3D topology.
114
ROBINSON, MCKINLEY, AND CHENG
FIG. 12. Multicast delay (4096-node 2D torus, low message startup latency).
6.3. Effect of Message Size In Fig. 13, the multicast delays are plotted against the message length for a multicast involving 512 (out of 4096) nodes. Results for both high and low message startup latencies are presented. The results show that the amount by which the algorithm is affected by an increase in the message size is directly related to the number of communication phases used
by the algorithm. The S-torus algorithm, with only one phase, is least affected by the message length, while at the other extreme, the unicast-based method, requiring dlog2 512e = 9 phases, is most affected. 6.4. Link Utilization An important attribute of any communication operation is the resultant amount of network traffic produced, which
PATH-BASED MULTICAST COMMUNICATION
115
FIG. 13. Multicast delay (4096-node 2D torus, 512 node multicast size).
may affect other communication in the network. In order to study this characteristic, the total number of link visits was recorded for each simulated multicast operation. Each link visit represents the use of one communication link by one message. Multiplying the number of link visits involved in a multicast operation by the message length and by the per-flit network latency, τ n, produces a value equivalent to the summation, over all links in the network, of the total time during which the link is being used by the operation. Figure 14 plots the resultant link usage for both 2D and 3D 4096-node torus networks, over various multicast sizes. As shown, the path-based algorithms require the use of fewer communication links than does the unicast-based method, especially with the 2D topology. 6.5. Algorithm Selection We emphasize that any and all of the multicast algorithms presented in this paper can be used on the same system (see Section 7). While the UTPR routing function is presumed to be implemented in hardware, a system could be tuned so that the source node chooses the algorithm that has been found to perform well under the current conditions. For example, in a system with low message startup latency, Fig. 12 indicates
that the single-phase S-torus algorithm is appropriate for very long messages, while the M u-torus(8) algorithm is better suited to shorter messages. Empirical studies on a particular system could be used to refine this algorithm selection process. The choice of multicast algorithm is made in software at the source node and does not affect the underlying routing hardware. Of course, the particular HC used in the system also affects performance. There may exist other HCs in a given network, and it may be possible that certain multicast operations executed on a different HC would use fewer communication links. However, only one HC may be used in a particular system, otherwise deadlock is possible. The particular HC (and routing function) that we have defined in Section 3.1 was chosen because it provides minimal unicast routing and prevents deadlock among all combinations of unicast and multicast traffic. Moreover, we have shown that this HC supports a wide variety of multicast algorithms, allowing performance tuning to be implemented in software as described above. To our knowledge, this is the first proposal for a unified approach to multicasting in unidirectional torus networks. Evaluation of alternative HCs is a topic for future research.
116
ROBINSON, MCKINLEY, AND CHENG
FIG. 14. Total link usage (4096-node tori).
7. IMPLEMENTATION ISSUES
Because the path routing function must be implemented in hardware, it must be simple. The path routing function must be designed so that the output channel on which an incoming message is to be forwarded can be determined by examining only the first few flits of the message header. Also, the router must be able to determine quickly if the current node is a destination of the incoming message, and if so, whether there are additional destinations to which the message must be forwarded. 7.1. Boundary Identification In order for the router at a node u to implement UTPR given in Eq. (2), the outgoing channels at node u that are boundaries must be identified. For any node u, and any dimension d (0 ≤Pd ≤ n − 1), channel (u, u d) is a boundary if and only if ( n−1 j =d σ j (u)) mod k = k − 1. The boundary dimensions for each node can be identified once during system startup, and need never be recomputed. 7.2. Algorithmic View of
UTPR
The formal definition of the path routing function UTPR, given by Eq. (2), is intended to convey the concept of how the routing path of a message is determined. It does not, however, represent an efficient implementation of the function. An algorithmic view of the function UTPR, more suitable for implementation in a router, is given in Fig. 15. The predicate boundary simply indicates whether an outgoing channel on the specified dimension is a boundary. For messages originating at the local node, u (as opposed to those that arrive on network channels), the input parameter α is set initially to p. When the dimension n of the torus is small, as is the case with commercial torus networks, the UTPR path routing function can be easily implemented with a simple combinational logic circuit.
7.3. Multidestination Message Format With unicast wormhole routing, the header flit of a message contains either the address of the destination node (absolute addressing) or the direction and distance from the current node to the destination node (relative addressing). Similarly, in multidestination routing, the message can be prefixed with either a list of absolute addresses or a list of relative addresses [8]. The order of these addresses corresponds to the order in which the destination nodes are to be visited. Once the message header reaches the first destination node in the list, the router at that node removes its own address from the front of the list and forwards the remaining flits towards the next destination node, while simultaneously copying the message contents to the local host memory. The router at the last destination node copies the message to the local host memory, but does not forward the message. In order to identify the last destination address in the message header, and consequently, the beginning of the message body, the address of the last destination node can be duplicated in the message header. The occurrence of two consecutive and identical destination addresses then signifies the end of the destination address list. The message format
FIG. 15. Implementation of the path routing function
R
UTPR.
PATH-BASED MULTICAST COMMUNICATION
117
FIG. 16. Multidestination message format.
corresponding to the -cycle 8 = {u 0 , u 1 , u 2 , . . . , u m−1 }; see Fig. 16. As the message progresses through the network, the header becomes shorter as the routers at each destination node remove their respective addresses from the head of the message. In the case of relative addressing, rather than duplicating the address of the last destination node, the list of destination addresses can simply be appended with a zero offset (which, conceptually, is equivalent to duplicating the last address, since an offset of zero from the last destination is, indeed, the last destination). 7.4. The Flit Forwarding Algorithm Figure 17 specifies the flit forwarding algorithm implemented in a router to support UTPR. Absolute addressing is assumed; very minor adjustments are required to support relative addressing. This algorithm can be implemented in a router using relatively simple logic similar to that described in [4]. The recv function reads the next incoming flit, while the send function transmits a flit over the specified outgoing channel. In cases where channel contention occurs so that a flit cannot be sent over the specified channel, the send function blocks until the channel is available. The send-local function copies a flit to the local host memory. The router hardware need be only slightly more complex than routers in current systems that support single-destination messages. Only two flit buffers are required for each incoming channel (corresponding to “flit1” and “flit2” in Fig. 17). Decision logic in the flit forwarding algorithm can be implemented using simple combinational logic. 7.5. Multiphase Implementation Issues To implement the M-torus multi-phase algorithm, each intermediate destination node must be informed of the list of nodes to which it must send the message. The following are three potential mechanisms by which the dissemination of the multicast structure can be accomplished: (1) If a particular source node is to perform repeated multicasts to the same group of destination nodes, then the necessary information can be distributed once when the communication group is formed. During subsequent multicasts to that group, a group identifier (GID) is attached to the message body so that each destination node can refer to the local state information associated with the group. (2) The -cycle for which an intermediate destination node is subsequently responsible can be included in the multicast message body. (3) The required information can be incorporated into the message header so that the router relays to each intermediate destination node the appropriate -cycle.
FIG. 17. The flit forwarding algorithm for path-based routing.
In the implementation of a system-supporting path-based multicast, the choice between the above three methods of distributing the multicast structure depends on several factors. The first method, in which the structure information is distributed initially upon creation of a group of communicating nodes, is efficient only in cases where the same source and destinations will be involved in repeated multicasts. This method also requires that each destination node maintain a local group table to store such information. If the multicast structure is added to the message body, as in the second method, one difficulty is that every destination node must receive the “structural” information intended for all of the other destination nodes. Without additional router support, the same message body must be delivered to each of the destination nodes of the message; there is no way to distinguish the structural information that is needed by a particular destination. Instead, the local host at each destination node must locate and use the correct -cycle from the list of -cycles received in the message body. An advantage of
118
ROBINSON, MCKINLEY, AND CHENG
this method is that no additional hardware support is needed beyond that required for single-phase multicast. The third method requires that the routers be able to process compound message headers. For succinct discussion, we introduce the following notation. Let 8 ube the -cycle that is used as input to the M-torus algorithm at node u. The -cycle 8 uthus contains those destination nodes for which node u is responsible. If u is the original source node, then 8 u represents the entire multicast operation. Figure 18 depicts a message that is to be transmitted from source node u 0 to destination nodes u 1 , u 2 , . . . , u r −1 using a compound message header. Note u 0 has partitioned the multicast problem represented by -cycle 8 into r sublists 80 , 81 , 82 , . . . , 8r −1 , where u i is the first element of 8 i (0 ≤ i ≤ r − 1). Each destination node address, u i, in the header of a compound message is followed by a count, w i = |8u i |, of the number of destination nodes for which u i will in later phases be responsible, as well as the list, 8u i , of those node addresses. The address of the last destination node, u r−1, is duplicated to signify the end of the header. Minor enhancements to the routing hardware are required to support the compound message format. In addition to the activities required for a basic multidestination message (Fig. 17), the router must perform two additional tasks: (1) When forwarding the message header to the next destination node, the proper -cycles must also be forwarded; and (2) when the local node is a destination of the message, the cycle immediately following the local node address must be forwarded to the local host. When using the compound message format for a simple multidestination message, the -cycle for each destination node is null; however, in order for the router to process -cycle size counts, the message header, the associated all set to zero, must be included in the header. Thus, the compound message format incurs an overhead of one additional header flit for each destination node when used for a simple (noncompound) message. In order to support both compound and noncompound multidestination message formats more efficiently, it is possible to place a message type indicator into the message header. This indicator would distinguish between the two types of multidestination message formats, and could even identify a unicast message, so that the most efficient message format could be used for each type of message. However, in order to interpret the message type information and adjust the router functionality accordingly, additional hardware support is required.
8. CONCLUSIONS AND FUTURE WORK
This paper has described and evaluated several path-based multicast algorithms for unidirectional wormhole-routed torus networks. The results are applicable to the design of architectures for both wormhole-routed parallel computers and highspeed LANs. The S-torus multicast algorithm uses a single multidestination message to perform an arbitrary multicast operation. The S-torus algorithm was extended to the M-torus algorithm, a generalized multiphase multicast algorithm, in which a combination of multidestination messages is used to perform a multicast in one or more communication phases. Two specific instances of the M-torus algorithm, the M d-torus and M u-torus multicast algorithms, were presented. These algorithms produce contention-free multicast operations and are deadlockfree under all combinations of network traffic. A simulation study was conducted to compare the expected performance of the different algorithms. The results of this study show that the path-based multicast algorithms offer significant performance gains over unicast-based multicast techniques. By using the proposed algorithms, unidirectional torus systems with IR capability are able to perform efficient unicast and path-based multicast operations within a variety of environments. Finally, we described several issues that must be considered in the implementation of such systems and presented possible solutions. A particularly important issue is the manner in which multi-phase multicast addressing is supported. Our ongoing and future work in this area focuses on the design of path-based routing algorithms for other topologies and the use of path-based multicast to support “higher-level” collective operations, such as scan, global reduction, and allto-all broadcast. A particularly intriguing topic is the use of this approach to support these operations, as well as multicast, within an “intelligent” LAN whose switch fabric is constructed from a wormhole-routed torus network. APPENDIX: PROOFS OF THEOREMS 1 AND 2
In order to prove Theorem 1, we first present a series of lemmas. The lemmas are straightforward and follow directly from one another; proofs may be found in [13]. As a notational convenience, we define ` i to be the coefficient of the ith term of Eq. (1); that is, for any node u and for 0 ≤ i ≤ n − 1, `i (u) =
n−1 X
σ j (u) mod k.
(3)
j=i
Thus, we can write Eq. (1) as
`(u) = FIG. 18. Compound message format.
n−1 X i =0
[k i `i (u)].
(4)
PATH-BASED MULTICAST COMMUNICATION
LEMMA 1. Let u be any node in a unidirectional torus, and let d be any dimension, 0 ≤ d ≤ n − 1. Then `(u) > `(u d ) if and only if `d (u) > `d (u d ). LEMMA 2. Let u and v be any two distinct nodes in a unidirectional torus and let d be the largest integer such that σd (u) 6= σd (v). If channel (u, u d ) is a boundary, then `(u) > `(v). LEMMA 3. Let u and v be any two distinct nodes in a unidirectional torus. If σn−1 (u) 6= σn−1 (v) then σn−1 (u) > σn−1 (v) if and only if `(u) > `(v). LEMMA 4. Let u and v be any two distinct nodes in a unidirectional torus. If `(u) < `(v) then there exists a dimension d such that σd (u) 6= σd (v) and `(u) < `(u d ) ≤ `(v). If `(u) > `(v) then there exists a dimension d such that σd (u) 6= σd (v) and either `(u) < `(u d ) or `(u d ) ≤ `(v). LEMMA 5. Let u and v be any two distinct nodes in a unidirectional torus and let w be the first node (after node u) on the path from u to v, as determined by the path routing function UTPR. If `(u) < `(v) then `(u) < `(w) ≤ `(v). If `(u) > `(v) then either `(u) < `(w) or `(w) ≤ `(v). LEMMA 6. Let u and v be any two distinct nodes in a unidirectional torus. If `(u) < `(v) then the path from u to v, as determined by the path routing function UTPR, does not contain a boundary. If `(u) > `(v) then the path contains exactly one boundary. THEOREM 1. If 8 = {u 0 , u 1 , u 2 , . . . , u m−1 } is an cycle, then a multidestination message, routed according to path routing function UTPR, beginning at source node u 0 and routed through destination nodes u 1 , u 2 , . . . , u m−1 , in that order, has the following properties. 1. All paths between successive destination nodes are minimal. 2. The network is deadlock-free under any and all combinations of such multidestination messages. 3. All physical channels used by the message are distinct. Proof. Assertion 1 (minimal paths): Since the path routing function UTPR selects only useful dimensions, then all paths are minimal. Assertion 2 (deadlock-free): In order to show that the network is deadlock-free, we define a total ordering on the virtual channels and the ports of the network, and then show that all messages reserve these resources in an order that is consistent with this total ordering. For any channel c dαu, define λ(c dαu) as follows: 0.`(u).d if α = p λ(cdαu ) = 1.`(u).d if α = h. Further, let us label the input ports at each router as follows: The p-port at node u is labeled 0.`(u).(−1), and the h-port
119
at node u is labeled 1.`(u).(−1). A lexicographical ordering of the three-part labels of each channel and port defines a total ordering of these resources, collectively referred to in the following as simply channels. All p-channels precede all h-channels in this ordering. Among the same virtual channel set, the ordering is determined by the label, `(u), of the sending node, u. Finally, channels of the same virtual channel set and source node are ordered by their dimension, d, with d = −1 indicating that they are ports that connect the router to the local host. We consider two cases with respect to the -cycle, 8. Case 1 (8 Is an -Chain). By Definition 1, `(u i ) < `(u i+1 ) for 0 ≤ i < m − 1, and by Lemma 6, the path from node u 0 to node u m−1 does not contain a boundary. Thus, from Eq. (2), the path contains only p-channels. Channels are therefore used by the path from node u 0 to node u m−1 in an order that is consistent with the total ordering described above. Case 2 (8 Is Not an -Chain). Since 8 is an -cycle, and is not an -chain, there exists an integer, q, 1 ≤ q ≤ m − 1, such that `(u 0) > `(u m−1) and `(u 0 ) < `(u 1 ) < . . . < `(u q−1 ) > `(u q ) < `(u q+1 ) < · · · < `(u m−1 ). Applying Lemma 6 to each pair of consecutive nodes in 8 shows that the path from node u 0 to node u m−1 contains exactly one boundary. Thus, by Eq. (2), the path uses p-channels prior to the boundary and h-channels thereafter, and from Definition 1, both the sequence of p-channels and the sequence of h-channels on the path visit nodes in an ascending order of node labels, `(u). The label of the port used at the node immediately preceding the boundary is greater than those of all the p-channels visited thus far, and less than the label of the h-channel to be visited on the boundary. Thus, the order of all channels on the path from node u 0 to node u m−1 is consistent with the total ordering described above. Since all messages reserve virtual channels in an order that is consistent with the total ordering of virtual channels defined above by λ, then cycles of channel dependency, and thus deadlock, cannot occur. Assertion 3 (distinct physical channels): The above proof of Assertion 2 shows that the path from node u 0 to node u m−1 does not visit any node more than once. Therefore, it cannot contain a multiple occurrence of a physical channel. THEOREM 2. The M-torus algorithm applied to an -cycle 8 = {u 0 , u 1 , u 2 , . . . , u m−1 } results in a contentionfree multicast from source node u 0 to destination nodes u 1 , u 2 , . . . , u m−1 . Proof. From Definition 4, each -cycle in the partition is written as 8i = {u i, 0 , u i, 1 , . . . , u i, (m i −1) }. For each 8 i, range (8 i) is defined as follows:
120
ROBINSON, MCKINLEY, AND CHENG
range (8i ) {u|`(u i, 0 ≤ `(u) ≤ `(u i, (m−1) )} if 8i is an -chain = {u|`(u i, 0 ) ≤ `(u) or `(u) ≤ `(u i, (m−1) )} if 8i is not an -chain. From Lemma 6, it follows that the above -cycles are all disjoint; that is, range (8 i) ∩ range (8 j) = ∅ whenever i ≠ j. We now consider the multidestination messages generated within 8 i. The first such message is generated by node u i, 0 . Since the source and destination nodes of the message are ordered according to the -cycle 8i , then by Lemma 6, every node visited by the message is an element of range (8 i). We conclude that, once the original multicast operation has been partitioned into r subproblems, each corresponding to an cycle, the messages generated by these subproblems do not visit common nodes. That is to say, there is no contention between any two subproblems. Since the message generated in Step 2 of the algorithm (Fig. 10) precedes all other messages of the multicast, then it cannot contend with any other message produced by the operation. Recursively, each multicast subproblem corresponding to one of the -cycles 8 i, is itself contention-free. Therefore, the entire multicast operation is contention-free. Since no assumptions are made in the above proof regarding the nature of any of the partitionings performed in Step 1 of the algorithm, the M-torus algorithm is contentionfree regardless of the nature of the partitions used in the suboperations (for example, whether some are uniform and others are dimensional). ACKNOWLEDGMENTS The authors thank the anonymous reviewers for their many useful comments and suggestions for improving this paper. This work was supported in part by DOE Grant DE-FG02-93ER25167, by a grant from the EPA, and by NSF Grants MIP-9204066, CCR-9503838, CCR-9209873, and CCR-9407318.
REFERENCES 1. Boppana, R. V., Chalasani, S., and Raghavendra, C. S. On multicast wormhole routing in multicomputer networks. Proceedings of the 1994 Symposium on Parallel and Distributed Processing. 1994, pp. 722–729. 2. Dally, W. J., and Seitz, C. L. The torus routing chip. J. Distrib. Comput. 1, 3 (1986), 187–196. 3. Dally, W. J., and Seitz, C. L. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans. Comput. C-36, 5 (May 1987), 547–553. 4. Dally, W. J., and Song, P. Design of a self-timed VLSI multicomputer communication controller. Proc. of the International Conference on Computer Design (ICCD-87). 1987, pp. 230–234. 5. Duato, J. A new theory of deadlock-free adaptive multicast routing in wormhole networks. Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing. 1993, pp. 64–71.
6. Ho, C.-T., and Kao, M. Optimal broadcast on hypercubes with wormhole and e-cube routings. Proceedings of the 1993 International Conference on Parallel and Distributed Systems. Taipei, 1993, pp. 694–697. 7. Kessler, R. E., and Schwarzmeier, J. L. CRAY T3D: A new dimension for Cray Research. Proc. COMPCON. 1993, pp. 176–182. 8. Lin, X., McKinley, P. K., and Ni, L. M. Deadlock-free multicast wormhole routing in 2D mesh multicomputers. IEEE Trans. Parallel Distrib. Systems 5, 8 (Aug. 1994), 793–804. 9. McKinley, P. K., Tsai, Y.-J., and Robinson, D. Collective communication in wormhole-routed massively parallel computers. IEEE Comput. 28, 12 (Dec. 1995), 39–50. 10. Message Passing Interface Forum. MPI: A message-passing interface standard. Technical report, Department of Computer Science, University of Tennessee, Knoxville, Tennessee, May 1994. 11. Ni, L. M., and McKinley, P. K. A survey of wormhole routing techniques in direct networks. IEEE Comput. 26, 2 (Feb. 1993), 62–76. 12. Panda, D. K., Singal, S., and Prabhakaran, P. Multidestination message passing mechanism conforming to base wormhole routing scheme. Proceedings of the First International Parallel Computer Routing and Communication Workshop, Seattle, 1994. Springer-Verlag, Berlin/New York, pp. 131–145. 13. Robinson, D. F. Scalable multicast communication in massively parallel computers. Ph.D. thesis, Michigan State University, Department of Computer Science, 1994. 14. Robinson, D. F., McKinley, P. K., and Cheng, B. H. C. Optimal multicast communication in torus networks. IEEE Trans. Parallel Distrib. Systems 6, 10 (Oct. 1995), 1029–1042. 15. Tsai, Y.-J., and McKinley, P. K. Extended dominating node broadcast in all-port wormhole-routed torus networks. IEEE Trans. Parallel Distrib. Systems 7, 8 (Aug. 1996), 876–885. 16. Tseng, Y.-C., and Panda, D. K. Trip-based multicasting in wormholerouted networks. Proceedings of the 7th International Parallel Processing Symposium. Newport Beach, CA, 1993, pp. 276–283.
DAVID F. ROBINSON received the B.S. in computer science from the University of Michigan—Flint in 1983 and the M.S. and Ph.D. in computer science from Michigan State University in 1989 and 1994, respectively. He has been an assistant professor in the Department of Computer Science at Quincy University since 1994. He was a member of technical staff at ROLM Corporation in Austin, Texas from 1983–1986. His current research interests include parallel and distributed computing, communications protocols for high performance computing, scalable software, monitoring and performance evaluation of parallel systems, and computer science education. He is a member of the IEEE and ACM. PHILIP K. MCKINLEY received the B.S. in mathematics and computer science from Iowa State University in 1982, the M.S. in computer science from Purdue University in 1983, and the Ph.D. in computer science from the University of Illinois at Urbana–Champaign in 1989. He is an associate professor in the Department of Computer Science at Michigan State University, where he has been on the faculty since 1990. He was a member of technical staff at Bell Laboratories in Naperville, Illinois from 1982 to 1990, on leave of absence 1985–1989. His current research interests include scalable architectures and software, communications libraries for parallel and distributed computing, multicast communication, high-speed network architectures and protocols, and parallel numerical algorithms. He is a member of the IEEE and ACM. BETTY H. C. CHENG received the B.S. in computer science from Northwestern University in 1985 and the M.S. and Ph.D. in computer science from the University of Illinois at Champaign–Urbana in 1987 and 1990, respectively. She is an associate professor in the Department
PATH-BASED MULTICAST COMMUNICATION of Computer Science at Michigan State University, where she has been on the faculty since 1990. She conducts research in the areas of formal methods applied to software engineering, software development environments, object-oriented development methods, parallel and distributed computing, and Received March 23, 1995; revised August 22, 1996; accepted July 23, 1996
121
multimedia information systems. Dr. Cheng was a faculty fellow at the Jet Propulsion Laboratory at California Institute of Technology in 1993, where she investigated the application of formal methods to the space shuttle software. Dr. Cheng is a member of the IEEE and ACM.