A New Token-Based Channel Access Protocol for Wavelength Division Multiplexed Multiprocessor Interconnects

A New Token-Based Channel Access Protocol for Wavelength Division Multiplexed Multiprocessor Interconnects

Journal of Parallel and Distributed Computing 60, 169188 (2000) doi:10.1006jpdc.1999.1599, available online at http:www.idealibrary.com on A New ...

332KB Sizes 0 Downloads 22 Views

Journal of Parallel and Distributed Computing 60, 169188 (2000) doi:10.1006jpdc.1999.1599, available online at http:www.idealibrary.com on

A New Token-Based Channel Access Protocol for Wavelength Division Multiplexed Multiprocessor Interconnects 1 Joon-Ho Ha* and Timothy Mark Pinkston *Server Architecture Lab, Intel Corporation, CO3-240, 15220 NW, Greenbrier Parkway, Beaverton, Oregon 97006; and -Electrical Engineering-Systems Department, University of Southern California, 3740 McClintock Avenue, EEB-208, Los Angeles, California 90089-2562 E-mail: joon-ho.haintel.com, tpinkcharity.usc.edu Received March 10, 1998; revised February 2, 1999; accepted September 1, 1999

This paper presents a new token-based channel access protocol for wavelength division multiplexed optically interconnected multiprocessors. Our empirical study of access protocols based on slotted time division multiplexed data and control channels reveals that such protocols typically suffer from excessive slot synchronization latency due to static slot preallocation. The proposed token-based time division multiple access protocol minimizes latency by allowing dynamic allocation of slots to use channels efficiently. Simulation results indicate that the proposed scheme can significantly increase the performance for protocols based on preallocation and also those based on preallocation-controlled reservation of multiple channels.  2000 Academic Press

Key Words: channel access control protocol; WDM; optical broadcastand-select network; shared-memory multiprocessor.

1. INTRODUCTION Optically interconnected networks using wavelength division multiplexing (WDM) have potential connectability and reconfigurability that far exceeds conventional, electronic-based networks. Based on broadcast-and-select protocols, such network architectures take advantage of the high fanning capability of optics (e.g., using passive star couplers) and the wavelength routing capability of WDM [3, 19, 20]. Tightly coupled systems such as distributed shared-memory multiprocessors (DSMs) have distinctive communication needs that can benefit from both high bandwidth point-to-point connections and the broadcastmulticast capabilities of single-hop broadcast-and-select WDM optical interconnects [15]. 1 This research was supported by NSF Career Award Grant ECS-9624251 and by NSF Research Grant CCR-9812137.

169

0743-731500 35.00 Copyright  2000 by Academic Press All rights of reproduction in any form reserved.

170

HA AND PINKSTON

Channel access control protocols can help to achieve higher channel utilization and lower latency by efficiently allocating bandwidth to communication demands while ensuring fairness and minimal contention. Channel access control protocols designed to provide effective access control for diverse communication demands can significantly reduce communication latency for given network constraints. In particular, for optically interconnected shared-memory multiprocessors where the high bandwidth capacity of optical technology and the small granularity of sharedmemory communication can minimize the impact of transmission and contention latency on packet latency, efficient channel access control protocols become even more significant. This paper presents the design of an optical WDM channel access protocol based on token-controlled time division multiple access (TDMA). In this protocol, tokens are used to add flexibility to base schemes such as TDMA and TDMA controlled reservation schemes by allowing the recovery and dynamic relocation of unused slot space. The improvements sought by augmenting those base protocols with token control are based on an extensive evaluation of two major classes of protocols: preallocation-based protocols such as TDMA [2] and reservation-based protocols with preallocated reservation control such at [1, 23, 24]. Our evaluation reveals that both classes of protocols typically suffer from excessive slot synchronization latency, the major portion of which consists of unused slot space. Consequently, these protocols fall short of providing an effective channel arbitration that can enable interprocessor communication to take full advantage of the high bandwidth and low latency capabilities of optical interconnects. It also confirms that channel access control protocols generally play a critical role in the performance of optical broadcast-and-select networks targeted for DSMs. To overcome the performance drawbacks of the basic preallocation-based protocols, we propose to use the notion of fast token as a supplemental control mechanism in conjunction with a preallocation-based approach. As a control mechanism, tokens have shown the potential to be an effective control mechanism to achieve mutually exclusive access to shared resources in the presence of contention [16, 21, 25]. In this study, we propose to use asynchronous optical tokens [21] that can enable the efficient recovery of unused slot space and effectively reduce slot synchronization latency due to unused slot space. Channel access control protocols have been widely investigated and a myriad of protocols have been proposed in the past. Certainly, the idea of using a token to implement channel access control is not new. However, the motivation of this work is two-fold. First, we desire to demonstrate the potential viability and usefulness of fast tokens (optical or electrical) as a supplemental control mechanism in conjunction with other channel access control mechanisms in the implementation of broadcast-and-select networks for DSMs. Hence, the main focus here is not to create entirely new channel access control protocols, but rather to enhance previously proposed protocols in order to provide the performance required for DSM communication traffic characteristics. The secondary motivation is to provide an application-specific evaluation, in both a quantitative and qualitative sense regarding which channel access protocols

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

171

have the potential to better serve the communication behavior of shared-memory multiprocessors interconnected by optical broadcast-and-select networks. While previous access protocol studies based prevalently on analytical modeling can offer performance estimation for a broad spectrum of applications in a very general term, our application-specific performance analysis of protocols is based on a programdriven multiprocessor simulation using actual shared-memory programs and detailed network specifications. This approach is the first of its kind in this field of research and can provide a balanced picture of channel access protocol performance in combination with previous performance studies based on analytical modeling. The remainder of this paper is organized as follows. The following section describes broadcast-and-select networks and how they can be classified as multiprocessor interconnections. Section 3 gives background on two major classes of access control protocols in this study. Section 4 gives a detailed description of our proposed channel access protocol. Section 5 explains how channel access protocols are applied to three types of broadcast-and-select network configurations. Section 6 evaluates the performance of the proposed protocol. Section 7 gives a discussion of our results and, finally, conclusions are drawn in Section 8.

2. BROADCAST-AND-SELECT NETWORKS Broadcast-and-select networks are attractive for the interconnection of sharedmemory multiprocessors because they are inherently single-hop networks which can offer high bandwidth point-to-point and broadcastmulticast capabilities with significantly less complexity than alternative electronic high-degree networks. Accordingly, a number of broadcast-and-select networks based on passive starcouplers (see Fig. 1a) have been proposed for multiprocessor interconnection [4, 5, 8, 11, 13, 15, 18]. A broadcast-and-select network can be configured to a number of different singlehop network topologies as well as multihop topologies depending on the range of channels to which a node has access. In addition, it can support point-to-point communications by selectively restricting the receptiontransmission of certain wavelengths at each node. Of those different virtual topologies, there are three common ways in which broadcast-and-select networks can be configured for multiprocessor interconnection as shown in Fig. 1b1d: broadcast networks, point-to-point networks, and hybrids of both broadcast and point-to-point networks. The following section describes these networks in detail in the context of multiprocessor interconnection. 2.1. Network Configurations In a broadcast-and-select network, each network node is typically equipped with a set of transmitters and receivers of which some may be tunable. The requirement of transmitters and receivers depends on the type of the network topology that the underlying broadcast-and-select network seeks to emulate.

172

HA AND PINKSTON

FIG. 1. The multiprocessor node structure used in the evaluation and conceptual implementation of the three broadcast-and-select networks. (a) A star coupler implementation, (b) a broadcast network (BUS), (c) a point-to-point network (PTP), and (d) a hybrid network (Hybrid).

2.1.1. Broadcast Networks (BUS). This is a simple form of a broadcast-andselect network and consists primarily of an optical analog to an electrical shared bus. This network uses all the available channels for communication among all the nodes through broadcasting and mimics the functionality of a shared bus as depicted in Fig. 1b. Channel multiplicity does not affect connectivity but contributes only to bandwidth. However, its practicality as a scalable interconnect for DSMs is limited by two factors. One is the cost of employing multiple transmitters and

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

173

receivers at each node for concurrent transmissions, which grows O(N 2 ) with N nodes. The other is the speed of the electronic interfaces, since each node is exposed to the full available bandwidth. Hence, this topology is more appealing as subnetworks such as clusters within a network, where a small number of wide channels are available. BUS can also be used as a subnetwork to supplement the functionality and connectivity of other cooperating subnetworks within a network. In any case, the attraction of this topology for shared-memory multiprocessors is that it offers an effective cache coherence enforcement mechanism through broadcasting (e.g., snoopy cache protocols can be implemented). 2.1.2. Point-To-Point Networks (PTP). A point-to-point network assigns each node with at least one unique receiving or transmitting channel, referred to as home channel [4, 5, 7, 15]. Figure 1c depicts a PTP based on unique receiving home channels. In PTP, one node is able to make a point-to-point connection to any other node by transmittingreceiving on the unique home channel of that destinationsource node. Hence, the scalability of this topology is limited by the number of available channels, although clustering and spatial reuse of wavelengths can relax this dependency. This network enables concurrent transmissions of messages destined to different nodes using wavelength recognition and mimics the functionality of a fully connected network. The attraction of this topology for shared-memory multiprocessors is that a major portion of communication traffic is point-to-point in nature [9, 14, 15]. Broadcasting point-to-point messages can be detrimental to the performance of cachememory controllers and network interfaces as messages of limited interest consume a significant portion of global bandwidth unnecessarily. The naming of the above two virtual topologies as BUS and PTP is according to the functional similarity that the two topologies have to conventional electrical network topologies, a shared bus and a point-to-point networktopologies that have been widely employed for shared-memory multiprocessor interconnection. 2.1.3. Hybrid Networks (Hybrid ). The difficulties in efficiently handling a mixture of shared-memory multiprocessor traffic with either broadcast networks or point-to-point networks lead naturally to networks that can support both types of communication [4, 5, 15]. A hybrid network partitions channels into two classes: one class is used for broadcasting as in a broadcast network and the other is used for point-to-point communication over home channels as in a point-to-point network (see Fig. 1d). In this study, we assume that all networks are single level networks (i.e., no wavelength reuse). Such single level networks can be used as building blocks for further generalized and more complex networks. 3. CHANNEL ACCESS CONTROL PROTOCOLS This section describes the main characteristics of two classes of previously proposed distributed channel access protocols for shared-memory multiprocessor

174

HA AND PINKSTON

interconnects. Here, a distributed channel access protocol is loosely defined as a protocol that does not require a centralized control mechanism with memory for channel arbitration. There are a number of proposed distributed channel access protocols for use in optical broadcast-and-select networks for multiprocessors. I-TDMA [2] is based on time division multiple access (TDMA) and preallocates slots in an interleaved fashion between channels using the source-destination allocation map. TDMA-C [1] is a reservation-based protocol that employs a separate TDMA-controlled channel for reservation. Distributed access control is achieved through the status table of each node which holds current reservation information of all channels. The status tables are updated by the control packet broadcasted on the channel for reservation. FatMAC [6, 23] and HRPTSA [24], an improved version of FatMAC, are hybrid protocols that combine preallocation and reservation with no dedicated control channels for reservation. FatMAC and HRPTSA are the first protocols to give comprehensive consideration to the characteristics of shared-memory multiprocessor traffic. Both protocols are based on the assumption that traffic in shared-memory multiprocessors can be classified as short messages and long messages. TDMA is used to control short messages while long messages are controlled through reservation using short messages on data channels. Because FatMAC shares data channels for reservation, its application is limited by the data channels' ability to provide an adequate reservation network. The main objective is to reduce packet latency over TDMA by reducing the amount of slot segmentation due to unmatched slots. Our performance evaluation of these protocols applied to shared-memory multiprocessors indicates that channel arbitration is inefficient and the processors underutilize the high bandwidth and low latency capabilities of optical interconnects. One major reason for the inefficiency is that these protocols do not serve very well the communication behavior of shared-memory multiprocessor traffic. The following subsections discuss the inherent limitations of these protocols. 3.1. TDMA Preallocation Protocols TDMA techniques allocate channel bandwidth to nodes by partitioning the timespace into slots which are assigned in a static predetermined fashion. Although TDMA requires no extra control activities other than synchronization, it does not support variable packet sizes well. Slot size is an important performance variable for TDMA due to the fact that as message size varies, so should the optimal slot size. There are two opposing factors that influence the ideal slot size: channel utilization and TDMA overhead. TDMA overhead, denoted as T overhead , represents the delay between the packet generation and the arrival of assigned slots (i.e., slot synchronization latency) as shown in Fig. 2. The key system parameter affecting channel utilization and TDMA overhead is slot size, denoted as L. The major problem of static slot preallocation that adversely impacts both channel utilization and T overhead is wasted slot space due to unused or segmented slots. Channels slotted on the boundary of smaller packets can increase channel utilization by reducing slot segmentation and unused slots. On the other hand, T overhead

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

FIG. 2.

175

Time division multiple access overhead.

is proportional to NP((2k&1)2k) where N is the number of nodes, P is the packet size, and k is the ratio between the slot size and the packet size as represented by P=kL. T overhead of individual packets can be reduced by using larger slots (smaller k). Therefore, the slot size is typically determined by the longest packets (e.g., memorycache blocks) in latency critical systems such as shared-memory multiprocessors. However, larger slots can further decrease channel utilization due to unused and segmented slots. Nevertheless, T overhead is on the order of NP, which may limit the use of TDMA in scalable systems. 3.2. TDMA-Controlled Reservation Protocols With reservation-based protocols, the notion of reservation can help to overcome some of the problems of TDMA in dealing with variable packet sizes as it enables the allocation of matching slot space on a demand basis. Hence, reservation-based protocols have the potential to achieve higher channel utilization and lower latency as compared to TDMA. However, the main problem with reservation-based protocols is the need to exchange explicit control information regarding the reservation of channels. Reservation packet exchange can be done out-of-band on dedicated control channels as in TDMA-C or in-band on shared data channels as in FatMAC. Since the contention caused by reservation packet exchange is exactly the same in nature as that for data packet exchange, access control for the reservation packet exchange is also required. In this study, we restrict our attention to TDMA-controlled reservation protocols. Figure 3 shows the operation diagram of TDMA-C. In TDMA-C,before channels are grabbed for data transmission, a node makes a channel reservation by broadcasting a reservation packet on the control channel which is controlled by TDMA. Our empirical study reveals that the major contributor to latency with TDMA-controlled reservation protocols is the control overhead which is caused by handling the reservation. The two main components of the control overhead are T overhead and the reservation process. T overhead , as defined in the previous section, is the time required to access the assigned TDMA slot on the control channel for reservation, which is proportional to the size of the reservation slot for a given system size. The slot size is determined by the reservation process which involves status table updating [1], reservation packet generation, and its transmission as depicted in Fig. 3 in the context of TDMA-C. Generally, reservation-based protocols are based on the assumption that the size of reservation packets is much smaller than that of data packets. Nonetheless, we estimate that T overhead which is on the oder of NP can determine the control overhead in a large system. Moreover, when applied to shared-memory multiprocessors,

176

HA AND PINKSTON

FIG. 3.

Channel reservation using a TDMA control channel (TDMA-C).

it is expected that T overhead can strongly influence the overall performance of reservation-based protocols since the granularity of communication is relatively small in shared-memory multiprocessors. In what follows, we propose to use a fast token as an access control passing mechanism in an attempt to minimize T overhead due to unused slot space. 4. PROPOSED TOKEN-BASED TDMA The most significant problem with basic TDMA is that the control of channels is passed between nodes solely with regard to time and not with regard to the communication needs. As a result, TDMA overhead typically comprises not only the time of actual data transmissions but also the time of unused slot space. Our empirical evaluation of TDMA using actual shared-memory multiprocessor communication traffic confirms that a significant portion of TDMA overhead results from unused slot space with basic TDMA. To overcome this weakness, we propose a token-based TDMA, referred to as T-TDMA, which uses a token as a supplemental control mechanism to pass control of channel access in conjunction with the slotted channel access of TDMA. A token is used to minimize the TDMA overhead due to unused slot space. T-TDMA has the potential to accomplish higher channel utilization and lower latency as the token enables recovery of unused slot space according to the communication needs of individual nodes. 4.1. Token-based TDMA Figure 4a illustrates the timing diagram of the proposed token-based TDMA operations. In this scheme, the use of a token to control the access to TDMA slots is straightforward. Access to a slot on a shared channel is gained by acquiring a token that circulates through all nodes in the system in a predetermined order. In the figure, following the reception of a token, node i holds the token for data

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

FIG. 4. channel.

177

(a) Token-based TDMA, (b) token with a dedicated path, and (c) token sharing a data

transmission on channel k. The token is released by node i before the data transmission is completed so that the token-passing can overlap with the data transmission. In contrast, with no data to transmit, node i+1 immediately passes the token to node i+2. Assuming that the delay of token-passing with no hold is smaller than the size of a data slot, the token permits the recovery of a part of an assigned but unused slot space for node i+1. This recovered slot space is used for data transmission at a subsequent node, node i+2 in Fig. 4a, as part of a fully assigned slot space. Ideally, tokens can be implemented out-of-band using dedicated token paths to ensure fast operation as shown in Fig. 4b. Such implementations require up to O(N) additional channels and inputoutput port resources (e.g., optical transceivers if optically implemented) for a system of N nodes. Alternatively, tokens can be implemented in-band using data channels if the channels can support token circulation appropriately. In such implementations, a token can be incorporated into the data packet as in Fig. 4c, where a token is incorporated into the header of a data packet. In this case, a node with no data to transmit can pass the token by transmitting a packet that comprises only a header without a payload (data) to the predetermined subsequent node. Consequently, the effectiveness of this scheme depends strongly on packet make-up time and the relative size of a header compared to data.

178

HA AND PINKSTON

T-TDMA can be modeled in a similar fashion as regular TDMA. Token overhead, denoted as T toverhead , which represents the delay between packet generation and token arrival, can replace T overhead as shown in Fig. 4a. Token-passing delay can affect T toverhead in two different ways depending on whether the token is passed or grabbed. If the token is grabbed, token passing can either overlap data transmission on a dedicated path or be incorporated in the dates packet, rendering token delay virtually transparent. If the token is passed without being held for data transmission, the token delay contributes to T toverhead of subsequent nodes as shown in Fig. 4a. Therefore, by minimizing the delay of token passing, the performance of token-based TDMA can be maximized. 4.2. Token Applied to Channel Access Control A single token may not be adequate for efficient control of multiple channels. Alternatively, the use of multiple tokens may require as many token paths, which can be very costly unless tokens can be implemented in-band. If a single dedicated path has to be used for multiple tokens, it would create an additional problem of token path access control. Because of this, T-TDMA may be more suitable for controlling a single channel or a small number of channels unless implemented in-band. In broadcast-and-select networks, a broadcast channel can not only serve as a dedicated token path but can also allow tokens to be implemented in-band with data packets while point-to-point channels cannot be used for token implementation due to the limited connectivity. Therefore, a point-to-point channel requires a dedicated token path for T-TDMA. The control channel as a single broadcast channel in TDMA-C perfectly fits for T-TDMA implementation. Applied to the control channel of TDMA-C, T-TDMA can minimize TDMA overhead due to unused reservation slots. This, in turn, speeds up all the pending reservations in subsequent nodes, increases data channel utilization, and reduces packet latency. TDMA-C with a T-TDMA controlled control channel is referred to as T-TDMA-C. A key factor that determines the effectiveness of T-TDMA is the speed of token operation, which determines the efficiency of slot recovery. It is shown in [21] that a high-speed asynchronous token can be implemented using a dedicated optical token path. Alternatively, tokens can be implemented entirely on an electronic medium, depending on the size of slots being controlled. Moreover, a token can time-share data channels (see Fig. 4c) as in the case with the token bus and the token ring [25] if the hardware cost of a dedicated token path is significant. In such cases, a token can be incorporated into the header of a data packet, and its capture and release can be controlled by a similar mechanism for packet transmission and reception where the header of a packet specifies both the receiver of the payload and the receiver of thee token. In any case, the token circulates through nodes in a predetermined, cyclic order. 5. PROTOCOLS APPLIED TO NETWORKS This section describes how channel access protocols are applied to the three network architectures in Fig. 1. TDMA and T-TDMA applied to BUS can slot all

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

179

the channels as if they were a single channel with multiple bit-lines. In such cases, each channel serves as a bit-serial line and packets are transmitted in bit-parallel mode (as wide as the number of channels) as in [10] and the AMOEBA switch [17]. A single token suffices to control all the channels as a single identity. The token can be implemented in three different ways (1) incorporated in the data packet, (2) implemented on the control channel, and (3) implemented on a dedicated optical path (not shown in Fig. 1b). TDMA-C or T-TDMA-C applied to BUS uses the TDMA-controlled or T-TDMA-controlled control channel to reserve slots matching the size of data packets on the data channels. For T-TDMA-C, a single token controlling the control channel can be implemented in either method (1) with the reservation packets or method (3) described above. TDMA on PTP can be implemented similarly to BUS except that it allocates slots such that any two slots assigned to a node on two different channels are separated in time by at least the tuning delay if tunable transmitters are used (similar to I-TDMA). The point-to-point (home) channels of PTP pose a greater challenge for T-TDMA implementation. For optimal channel control, PTP may require as many tokens and paths as the data channels, as pointed out earlier in Section 4.2. Considering that point-to-point channels do not allow tokens to be implemented in-band (method (1)), T-TDMA for PTP may not be desirable. On the contrary, the implementation of TDMA-C and T-TDMA-C on PTP is straightforward. The home channels are reserved using reservation packets broadcasted on the TDMA-controlled or T-TDMA-controlled control channel. The token for the control channel can be implemented in the same way as in BUS (method (1) or method (3)). The only difference between TDMA on Hybrid and others is that the home channels and the broadcast channels can be slotted using two different slot sizes, depending on the packets they carry. As with PTP, T-TDMA can be very costly for the home channels, but for the broadcast channels, T-TDMA can be applied similarly to BUS. TDMA-C and T-TDMA-C are applied to Hybrid in the same way as in PTP, and the broadcast channels are treated like the home channels. The following section examines how TDMA, T-TDMA, TDMA-C, and T-TDMA-C perform when applied to shared-memory multiprocessor interconnects based on the three broadcast-and-select network configurations.

6. PERFORMANCE The efficiency of channel access protocols can be determined by the amount of channel access control overhead that is incurred. Another indicator of protocol efficiency is channel utilization, which shows how effectively communication loads are allocated to available bandwidth. Using these two performance metrics and packet latency, we compare the performance of TDMA and TDMA-C with T-TDMA and T-TDMA-C, respectively. The three performance metrics are defined as follows. M

averagechannelutilization= : C(i) occupied MC active i=1

(1)

180

HA AND PINKSTON

Average channel utilization is determined by Eq. (1), where M is the number of channels. C(i) occupied is the aggregate time period in which channel i is occupied for transmission, and C active represents the time period in which channels are active for the program execution.

\

N

all

TDMAoverhead= : : T(i) overhead i=1

+
(2)

In this evaluation, control overhead is in the form of either TDMA overhead or token overhead. TDMA overhead is defined here by Eq. (2), where N is the number of nodes and T(i) overhead is TDMA overhead of Node i either for data channel access with TDMA or the control channel access with TDMA-C. The inner summation sums up T overhead of all the packets generated from Node i. Then, the sum is averaged out throughout N nodes.

\

N

all

tokenoverhead= : : T(i) toverhead i=1

+
(3)

Token overhead is an average aggregate token overhead defined by Eq. (3), where T(i) toverhead , the overhead at Node i, is determined either by the speed of the token with a dedicated path or by the additional packet latency with the token incorporated into a data packet, as explained in Section 4.1.

\

N

all

packetlatency= : : (T(i) overhead +T(i) contention +T(i) transmission ) i=1

+
(4)

Packet latency is measured by Eq. (4), where T(i) overhead is the control overhead in the form of either TDMA overhead or token overhead. T(i) contention and T(i) transmission denote contention and transmission latency of a packet generated from Node i, respectively. Here, T(i) contention is defined as the time period in which a packet has to wait for transmission after arbitration is completed (see Fig. 3). Note that T(i) contention is not applicable when TDMA is used. 6.1. Evaluation Methodology Previous channel access control protocol research has focused on measuring the performance through analytical modeling using synthetic network traffic. Our evaluation methodology differs from previous studies in two essential ways. First, network traffic is generated from actual shared-memory parallel applications using a program-driven multiprocessor simulation platform with detailed architectural specifications. As a result, the network traffic exhibits a realistic timing and correlation among packets and the amount of traffic generated is independent of network parameter settings. Second, key network parameters are set to provide adequate bandwidth capacity for multiprocessors to achieve reasonable performance regardless of the channel

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

181

access control protocol used. Providing adequate network bandwidth capacity is an important condition for accurately measuring channel access control protocol performance since the metrics used to measure the performance are all sensitive to network bandwidth capacity. The cache coherent shared-memory multiprocessor and network simulator is based on MINT [26], a program-driven multiprocessor simulation platform. A simplified node consists of a processor, a single-level cache, a main memory segment, a local bus, and a network interface which includes the function of channel access control as shown in Fig. 1. For cache coherence, a snoopy protocol for BUS, a full-map directory protocol for PTP, and the SPEED protocol for Hybrid are assumed. Stanford SPLASH [22] shared-memory applications used in this evaluation are BarnesHut (BA), Cholesky (CH), LocusRoute (LR), and MP3D (MP). Each network is assumed to have eight 2 Gbps data channels connecting 64 nodes with one control channel of the same capacity for reservation. PTP has one tunable transmitter with 10 ns tuning delay [12] and at most two fixed receivers per nodeone for the assigned home channel (i.e., home channel as a unique receiving channel) and the other for the control channelwhile Hybrid has one more receiver per node for the broadcast channel. BUS has two arrays of transmitters and receivers matching the number of data and control channels. Network traffic is divided into two or three classes depending on channel access protocols used: short packets, long packets, and control packets (if reservation is used). Channels are slotted on the boundary of long packets containing the 64 byte cachememory block [22] for both TDMA and T-TDMA. The broadcast channel of Hybrid is slotted on the boundary of small packets containing messages such as cache invalidations, memory block requests, and acknowledgments, r18th the size of the long packet containing memory block. The control channel for TDMA-C and T-TDMA-C is slotted on the boundary of control packets which is r14th of the size of the long packet. Three types of token delays are considered to evaluate the impact of token delay on the performance of token-based protocols. The first is an absolute asynchronous token delay obtainable with a dedicated optical path. The delay of the asynchronous token on a dedicated optical path between adjacent nodes is within a single network cycle (5 ns assuming 200 MHz operation) [21], which allows slot recovery in a single cycle (i.e., single cycle token delay). The other two are relative token delay figures calculated in terms of the ratio of token delay to slot size. Accordingly, a token delay of 25 0 indicates that token delay is 25 0 of the size of the slot so that about 75 0 of the unused slot in terms of cycles can be recovered. Hence, 25 and 50 0 token delays simulate less than optimal token mechanisms. By way of comparison, the single cycle token delay calculated in terms of the relative token delay is about 7 0 of a T-TDMA-C control slot. For T-TDMA, a single cycle amounts to 2 0 of the data slot on the home channels of PTP and Hybrid, and it amounts to 15 0 of the data slot of BUS and the broadcast channel of Hybrid. We estimate that token delay of r50 0 may be the break-even point over which the overhead of token implementation outweighs the performance improvement.

182

HA AND PINKSTON

6.2. Simulation Results From a practical point of view, the use of dedicated tokens and paths with T-TDMA for all the individual point-to-point channels of PTP and Hybrid is not feasible. Hence, T-TDMA on PTP and Hybrid is not considered in the first two cases. We first present results for topology and protocol combinations that either require only one dedicated token path or allow tokens to share data channels (i.e., broadcast channels). Accordingly, T-TDMA-C can be implemented for all three topologies, while T-TDMA can be implemented on BUS without violating the criteria. A direct comparison is made within the same class of protocols (i.e., TDMA Versus T-TDMA and TDMA-C versus T-TDMA-C). 6.2.1. Channel utilization. Figure 5 plots the channel utilization of T-TDMA-C and T-TDMA with 25 and 50 0 token delay and basic TDMA-C and T-TDMA normalized to those of T-TDMA-C and T-TDMA with the single cycle token delay (i.e., the optimal 100 0 normalized utilization), respectively. Each bar in the graphs represents three separate sets of data overlapped on top of each other for comparison, representing (1) the basic TDMA or TDMA-C, (2) T-TDMA or T-TDMA-C with token delay of 250, and (3) token delay of 50 0. With 50 0 token delay, T-TDMA-C can increase the channel utilization by 40 0 up to 65 0 (MP of Hybrid ) over TDMA-C except for LR (r20 0). Furthermore, T-TDMA-C with 25 0 token delay can increase the channel utilization by 75 0 (BA of BUS) up to 120 0 (MP of Hybrid ) over TDMA-C except for LR (r25 0). With single cycle token delay, T-TDMA-C is able to increase the channel utilization by 225 0 (BA of BUS) up to 350 0 (MP of Hybrid ) except for LR (r35 0). T-TDMA shows similar increase in channel utilization over basic TDMA except

FIG. 5. Channel utilization of TDMA, TDMA-C, T-TDMA, and T-TDMA-C with varying token delay normalized to T-TDMA and T-TDMA-C with single cycle token delay, respectively.

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

183

FIG. 6. Token overhead (the left bar) and packet latency (the right bar) of T-TDMA and T-TDMA-C with varying token delay normalized to TDMA and TDMA-C, respectively.

that LR shows better performance improvement (r40 0) with T-TDMA than with T-TDMA-C. 6.2.2. Control overhead and packet latency. Figure 6 plots the control overhead (the left bar) and the packet latency (the right bar) normalized to the basic TDMA and TDMA-C (i.e., the worst case). T-TDMA-C with single cycle token delay is capable of reducing control overhead and packet latency up to r65 0 (BA of BUS) by recovering unused slots in a single network cycle. T-TDMA-C with token delay of 25 0 can reduce latency up to 50 0 (BA of Hybrid ). Even with 50 0 token delay, the token is able to reduce the latency up to 35 0 (BA of Hybrid ). T-TDMA achieves comparable latency reduction over TDMA.

FIG. 7. Channel utilization, TDMA overhead, and packet latency of TDMA normalized to T-TDMA with single cycle token delay on PTP and Hybrid.

184

HA AND PINKSTON

6.2.3. TDMA on point-to-point networks. We estimate that the performance of T-TDMA with single cycle token delay gives the upper bound for static preallocation-based protocol performance for any given traffic load. Figure 7 shows the channel utilization, TDMA overhead, and packet latency of TDMA compared to T-TDMA with single cycle token delay on PTP and Hybrid. Although T-TDMA on both PTP and Hybrid requires considerable token resources and thus is considered infeasible in practice, this comparison gives some indications as to how TDMA performs relative to the ideal case of T-TDMA on these network configurations. Results show that the channel utilization is as low as 60 (MP3D of Hybrid ) and the packet latency can be reduced by up to 960 (BA of Hybrid ). Given the low achievable channel utilization and large packet latency, TDMA does not appear practical for latency critical applications such as sharedmemory multiprocessor interconnections. 7. DISCUSSION While channel access protocol studies based on analytical modeling can offer performance estimation for a broad spectrum of applications, this study seeks to provide an application-specific performance analysis of channel access protocols designed to meet the needs of the shared-memory multiprocessor communication traffic through simulation using program-driven benchmarks. Results confirm our intuition that TDMA overhead considerably contributes to packet latency. A noticeable trend in our results is that TDMA overhead dominates packet latency for both TDMA and TDMA-controlled reservation protocols for the given cache block size of 64 byte. On average, TDMA overhead accounts for more than 750 of the packet latency. This trend is evident in Fig. 6 as the reduction of TDMA overhead by using tokens results in almost the same rate of packet latency reduction. It is also observed that the relative performance of T-TDMA increases with increasing block size (e.g., 256 byte cache block) since the amount of unused slot space that a token can recover increases. In regard to TDMA-based reservation, TDMA overhead becomes more dominant as channel bandwidth increases. Consequently, the effectiveness of tokens in reducing packet latency is significant. In contrast to T-TDMA, however, the impact of reservation overhead (with or without tokens) on the packet latency decreases as the size of the packet (cache block) increases. Therefore, the performance gain with T-TDMA-C decreases with increasing cache block size if the size of reservation messages remains unchanged. Our evaluation indicates that T-TDMA-C with the optical asynchronous token achieves the lowest packet latency compared to other protocols for the network configurations analyzed. Three factors contribute to this result. First, control slots are smaller than data slots, which results in smaller TDMA overhead (see Section 3.1). Second, although control packets are also subject to T-TDMA, T-TDMA-C allows multiple reservations on different channels with a single control packet, effectively reducing the impact of TDMA overhead on packet latency. Third, T-TDMA-C manages bandwidth more efficiently by allocating slots that match the size of packets and avoids slot segmentation.

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

185

The significance of our results lies in the fact that the evaluation is performed under very realistic network assumptions that allow multiprocessors to perform realistically regardless of the access control protocol used. In other words, given the results showing a high performance gain for the network of same capacity, an efficient access control can make a network of considerably less bandwidth capacity perform comparably with those of higher bandwidth capacity, yet with less efficient access controls. The results for LR and MP highlight the importance of networks settings in channel access protocol performance analysis in a slightly different way. LR shows a relatively small performance improvement in both utilization and latency with T-TDMA-C. Channel utilization of MP shows a significant improvement with token-based protocols, which does not necessarily translate into latency reduction. It is found that the control channel is more heavily loaded with both LR and MP resulting in a control channel bottleneck. In an actual implementation, the control channel would have been set to provide sufficient bandwidth capacity to meet the reservation demand of applications. In such a case, we anticipate that T-TDMA-C can offer a similar performance improvement with LR and MP to other benchmarks. Now let us consider how the results with T-TDMA-C compare relative to FatMAC. We estimate that FatMAC with no dedicated control channel will perform no better than T-TDMA-C given that TDMA overhead to access the dedicated control channel for reservation dominates packet latency. T-TDMA-C certainly allocates at least one channel solely for access control purposes, but the results of our current and previous study [15] indicate that a moderate increase or decrease ( <200) in the number of channels has less impact on performance than channel access protocols for networks supporting point-to-point communication offered by WDM (e.g., PTP and Hybrid ). In other words, the performance advantage gained by using a dedicated control channel outweighs the loss of communication bandwidth which would otherwise not be managed efficiently. Perhaps the biggest issue with token-based protocols is that dedicated fast token paths require additional transmitters and receivers system-wide. The cost of additional hardware for the token mechanism is a design trade-off. The results of this study strongly suggest that for network applications where minimizing latency is of highest priority, a token-based protocol is not only feasible, but also preferable. 8. CONCLUSION This study presents a token-based TDMA protocol targeted for optical wavelength division multiplexed broadcast-and-select networks for DSMs. This study is motivated by our experience with previously proposed TDMA and TDMA-controlled reservation-based protocols in the development of an optically interconnected shared-memory multiprocessor. It revealed that these protocols were inefficient and the processors underutilized the high bandwidth and low latency capabilities of the underlying optical interconnect. Further empirical evaluation shows that TDMA and TDMA-controlled reservation-based protocols typically suffer from excessive slot synchronization latency in dealing with shared-memory

186

HA AND PINKSTON

multiprocessor network traffic. It also indicates that a large portion of the latency comprises unused slot space due to static slot preallocation of TDMA. The proposed token-based TDMA aims to reduce this latency by minimizing latency due to unused slot space. A fast token is used in conjunction with TDMA to achieve recovery and dynamic relocation of unused slot space by allowing nodes not needing their assigned slots to pass channel access on to other nodes needing to communicate. Given typical communication behavior of DSMs such as relatively small granularity and high correlation among packets, the proposed token-based TDMA has tremendous potential to reduce packet latency by minimizing slot synchronization latency due to unused slot space. The effectiveness of the new token scheme is evaluated on both TDMA and TDMA-controlled reservation-based protocols using three broadcast-and-select network configurations which are considered suitable for shared-memory multiprocessor interconnection. Simulation results indicate that the performance of DSMs interconnected by broadcast-and-select networks can be significantly improved. Packet latency can be reduced by up to 70 0 and up to 65 0 by using token-based TDMA and token-based TDMA-controlled reservation protocols, respectively. This study further reveals that token-controlled reservation-based protocols have a clear advantage over a basic TDMA and a token-based TDMA in dealing with the communication behavior of DSMs for the given network constraints. Obviously, the hardware overhead of implementing these alternative high performance demand assignment protocols is not trivial. However, token-controlled reservation-based protocols are recommended highly for such latency critical applications as DSMs as long as hardware design constraints permit.

REFERENCES 1. K. Bogineni and P. W. Dowd, A collisionless multiple access protocol for a wavelength division multiplexed star-coupled configuration: architecture and performance analysis, J. Lightwave Technol. 10, 11 (1992), 16881699. 2. K. Bogineni, K. M. Sivalingam, and P. W. Dowd, Low complexity multiple access protocols for wavelength-division multiplexed photonic networks, IEEE J. Selected Areas Commun. 11, 4 (May 1993), 590604. 3. C. A. Brackett, Dense wavelength division multiplexing networks: principles and applications, IEEE J. Selected Areas Commun. 8 (August 1990), 964984. 4. E. V. Carrera and R. Bianchini, ``NetCache: A NetworkCache Hybrid for Multiprocessors,'' Tech. Rep. ES-45597, COPPE Systems Engineering, Federal University of Rio de Janeiro, November 1997. 5. E. V. Carrera and R. Bianchini, ``OPTNET: A Cost-Effective Optical Network for Multiprocessors,'' Tech. Rep. ES-45797, COPPE Systems Engineering, Federal University of Rio de Janeiro, December 1997. 6. P. Dowd, J. Perreault, J. Chu, D. C. Hoffmeister, R. Minnich, D. Burns, F. Hady, Y.-J. Chen, M. Dagenais, and D. Stone, LIGHTNING network and systems architecture, J. Lightwave Technol. 14, 6 (June 1996), 13711386. 7. P. W. Dowd, K. Bogineni, K. A. Aly, and J. Perreault, Hierarchical scalable photonic architectures for high-performance processor interconnection, IEEE Trans. Comput. 42, 9 (September 1993), 11051120.

TOKEN-BASED CHANNEL ACCESS PROTOCOL FOR WDM

187

8. P. W. Dowd and J. Chu, Photonic architectures for distributed shared memory, in ``Proc. First International Workshop on MPPOI,'' pp. 151161, April 1994. 9. S. J. Eggers and R. H. Katz, A characterization of sharing in parallel programs and its application to coherency protocol evaluation, in ``Proc. 15th International Symposium on Computer Architecture,'' pp. 372383, 1988. 10. J. Feehrer, J. R. Sauer, and L. Ramfelt, Design and implementation of a prototype of optical deflection network, in ``Proc. ACM SIGCOMMP'94,'' August 1994. 11. K. Ghose, R. K. Horsell, and N. K. Singhvi, Hybrid multiprocessing using WDM optical fiber interconnections, in ``Proc. the First International Workshop on MPPOI,'' April 1994. 12. B. S. Glance, J. M. Wiesenfeld, U. Koren, and R. W. Wilson, New advances on optical components needed for FDM optical networks, IEEE Photon. Tech. Lett. 5, 10 (October 1993), 12221224. 13. M. Goodman, The LAMBDANET multiwavelength network: architecture, applications, and demonstrations, IEEE J. Selected Areas Commun. 8, 6 (August 1990), 9951004. 14. A. Gupta and W.-D. Weber, Cache invalidation patterns in shared-memory multiprocessors, IEEE Trans. Comput. 41, 7 (July 1992), 794810. 15. J.-H. Ha and T. M. Pinkston, SPEED DMON: cache coherence on an optical multichannel interconnect architecture, J. Parallel Distrib. Comput. 41, 1 (February 1997), 7891. 16. T. S. Jones and A. Louri, Media access protocols for a scalable ical interconnection network, in ``Proc. 1998 International Conference on Parallel Processing,'' pp. 304311, August 1998. 17. A. V. Krishnamoorthy, J. E. Ford, K. W. Goossen, J. W. Walker, B. Tseng, S. P. Hui, J. E. Cunningham, W. Y. Jan, T. K. Woodward, M. C. Nuss, R. G. Rozier, F. E. Kiamilev, and D. Miller, The AMOEBA chip: an optoelectronic switch for multiprocessor networking using denseWDM, in ``Proc. Third International Conference on MPPOI,'' pp. 94100, October 1996. 18. A. Louri and R. Gupta, Hierarchical optical ring interconnection (HORN): a scalable interconnection-network for multiprocessors and massively parallel systems, Appl. Optics 36, 2 (January 1997), 430442. 19. B. Mukherjee, WDM-based local lightwave networks, Part I: single-hop systems, IEEE Network (May 1992), 1227. 20. B. Mukherjee, WDM-based local lightwave networks, Part II: multihop systems, IEEE Network (July 1992), 2031. 21. T. M. Pinkston, M. Raksapatcharawong, and C. Kuznia, An asynchronous optical token smart-pixel design based on hybrid CMOSSEED integration, in ``LEOS 1996 Summer Topical Meeting on Smart Pixels Technical Digest,'' pp. 4041, IEEELEOS, August 1996. 22. J.-P. Singh, W.-D. Weber, and A. Gupta, SPLASH: Stanford parallel applications for sharedmemory, Comput. Architect. News 20, 1 (March 1992), 544. 23. K. M. Sivalingam and P. W. Dowd, A multilevel WDM access protocol for an optically interconnected multiprocessor system, J. Lightwave Technol. 13, 11 (November 1995), 21522167. 24. K. M. Sivalingam and J. Wang, Media access protocols for WDM networks with on-line scheduling, J. Lightwave Technol. 14, 6 (June 1996), 12781286. 25. A. S. Tanenbaum, ``Computer Networks,'' PrenticeHall, Englewood Cliffs, NJ, 1988. 26. J. E. Veenstra and R. J. Fowler, MINT: a front end for efficient simulation for shared-memory multiprocessors, in ``Proc. Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS),'' pp. 201207, January 1994.

JOON-HO HA received the B.S.E.E. and M.S. (in telecommunications) in electrical engineering from Yonsei University in Seoul, Korea in 1985 and 1987, respectively. He also received his M.S. and Ph.D. in computer engineering from the University of Southern California in 1991 and 1999, respectively. In 1999, Dr. Ha joined the Server Architecture Laboratory at Intel Corporation in Beaverton, Oregon as a senior systems engineer. His current research focuses on the design of cost-effective, high-performance, next-generation server architectures, which are expected to provide scalable cache coherence, interconnection, and IO capabilities.

188

HA AND PINKSTON

TIMOTHY MARK PINKSTON completed his B.S.E.E. from The Ohio State University in 1985 and his M.S. and Ph.D. in electrical engineering from Stanford University in 1986 and 1993, respectively. Prior to joining the University of Southern California in 1993, he was a member of technical staff at Bell Laboratories, a Hughes doctoral fellow at Hughes Research Laboratory, and a visiting researcher at IBM T. J. Watson Research Laboratory. Currently, Dr. Pinkston is an associate professor in the Computer Engineering Division of the EE-Systems Department at USC and heads the SMART Interconnects Group. His current research interest include the development of deadlock-free adaptive routing techniques and optoelectronic network router technologies for achieving high-performance communication in parallel computer systemsmassively parallel processor (MPP) and network of workstation (NOW) systems. Dr. Pinkston has authored over forty refereed technical papers and has received numerous awards, including the Zumberge Fellow Award, the National Science Foundation Research Initiation Award, and the National Science Foundation Career Award. Dr. Pinkston is a member of the ACM, IEEE, and OSA. He has also been a member of the program committee for several major conferences (ICPP, IPPSSPDP, SC, PCRCW, OC, MPPOI, IEEE LEOS, WOCS) and was the program co-chair for MPPOI'97. Currently, he serves as an associate editor for IEEE Transactions on Parallel and Distributed Systems.