Combinatorial performance modelling of toroidal cubes

Combinatorial performance modelling of toroidal cubes

Available online at www.sciencedirect.com Journal of Systems Architecture 54 (2008) 241–252 www.elsevier.com/locate/sysarc Combinatorial performance...

265KB Sizes 2 Downloads 50 Views

Available online at www.sciencedirect.com

Journal of Systems Architecture 54 (2008) 241–252 www.elsevier.com/locate/sysarc

Combinatorial performance modelling of toroidal cubes H. Hashemi-Najafabadi, Hamid Sarbazi-Azad

*

IPM School of Computer Science, Tehran, Iran Received 19 February 2007; received in revised form 7 June 2007; accepted 11 June 2007 Available online 30 June 2007

Abstract Although several analytical models have been proposed in the literature for deterministic routing in different interconnection networks, very few of them have considered the effects of virtual channel multiplexing on network performance. This paper proposes a new analytical model to compute message latency in a general n-dimensional torus network with an arbitrary number of virtual channels per physical channel. Unlike the previous models proposed for toroidal-based networks, this model uses a combinatorial approach to consider all different possible cases for the source–destination pairs, thus resulting in an accurate prediction. The results obtained from simulation experiments confirm that the proposed model exhibits a high degree of accuracy for various network sizes, under different operating conditions, compared to a similar model proposed very recently, which considers virtual channel utilization in the k-ary n-cube network.  2007 Elsevier B.V. All rights reserved. Keywords: Interconnection networks; n-D Tori; k-ary n-cubes; Deterministic wormhole routing; Virtual channels; Analytical modelling; Combinatorial performance analysis

1. Introduction The topology, routing algorithm and switching method are the most important factors determining the performance of an interconnection network. Network topology determines the way in which the processing elements of the network are interconnected. Practical multicomputers [4,17,20,21] have widely employed torus networks for low-latency and high-bandwidth inter-processor communication

* Corresponding author. Address: Sharif University of Technology, Tehran, Iran. Tel.: +98 21 22280332; fax: +98 21 22828687. E-mail addresses: [email protected] (H. HashemiNajafabadi), [email protected] (H. Sarbazi-Azad).

[12]. The n-D torus has an n-dimensional grid structure with ki nodes in each dimension such that every node is connected to its neighbouring nodes in each dimension by direct channels. k-ary n-cubes and hypercubes are restricted cases of the n-dimensional torus topology. The switching method determines the way messages visit intermediate routers. Owning to its low buffer size, wormhole switching (usually referred to as wormhole routing) has been widely employed in multicomputers. Another advantage of wormhole routing is that, in the absence of blocking, message latency is almost independent of the distance between source and destination. In this switching technique, messages are broken into flits, each of a few bytes, for transmission and flow control. The header flit, containing routing information, is used

1383-7621/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2007.06.004

242

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

to govern routing and the remaining data flits follow in a pipelined fashion. If the header is blocked, the other flits are blocked in situ. The advantage of this technique is that it reduces the impact of message distance on the latency under light traffic. However, as network traffic increases messages may experience large delays to cross the network due to the chain of blocked channels [19]. To overcome this, the flit buffers associated with a given physical channel are organised into several virtual channels [10], each representing a ‘‘logical’’ channel with its own buffer and flow control logic. Virtual channels are allocated independently to different messages and compete with each other for the physical bandwidth. This decoupling allows messages to bypass each other in the event of blocking, using network bandwidth that would otherwise be wasted. Routing algorithms establish the path between the source and destination nodes for message transmission. Routing can be deterministic or adaptive. In deterministic routing, messages with the same source and destination always traverse the same path. This form of routing results in a simple router implementation [13] and has been used in many practical multicomputers [17,20,21]. With adaptive routing, the path taken by a message is affected by the traffic on network channels. Simulation is an approach to evaluate the performance of an interconnection network for a specific configuration. But, depending on the complexity of the interconnection network and resources available, this technique may be too time-consuming to perform. Another approach is utilization of an analytical model of the system. An appropriate analytical model can predict the performance of a specific interconnection network structure in a fraction of the time simulation would take. Thus, it is justified to be in pursuit of accurate analytical models for the performance of different network topologies, and it is for this reason that many such models have been suggested in the literature and their accuracy verified. Analytical models of networks base on wormhole switching and deterministic routing have been reported in the past [1–3,5,7,9,11,14,15,18,22,23]. There have however been few models reported in the literature that have considered the performance of such networks with any number of virtual channels per physical channel. Table 1 illustrates all analytical models proposed for deterministic routing along with their features. Of these models, only the last one [23], can capture the effect of virtual channel multiplexing for any number of virtual

channels per physical channel. The model proposed by Draper and Ghosh [11] considers only the use of a minimum requirement of virtual channels (two virtual channels) to ensure deadlock freedom according to the methodology proposed in [8], and cannot deal with any arbitrary number of virtual channels. When the number of virtual channels is large (>2), however, the effect of virtual channels on network performance cannot be ignored since this can cause the analytical model to produce inaccurate predictions of message latency, especially when the network operates under heavy traffic loads. This is because the multiplexing of virtual channels increases the latency seen by an individual message inside the network as virtual channels share the bandwidth of the physical channel in a timemultiplexed manner. The model, proposed very recently in [23], uses a different approach and has the main advantage of being simpler to derive than the existing models including Draper and Ghosh’s model [11]. Moreover, the model can support both unidirectional and bidirectional k-ary n-cubes with any number of virtual channels. However, the accuracy of the model is its main drawback especially near high traffic region. In this paper, a new combinatorial performance model1 is proposed considering all the potential source–destination node pairs when computing message latency. Thus, the proposed model, while keeping all the advantages of the model proposed in [23] compared to those proposed in [2,7,9,11], highly improves the accuracy of it for predicting the average message latency in the network, as will be shown in following sections. The rest of the paper is organised as follows. Section 2 describes the n-D torus, its router structure, and deterministic routing algorithm using virtual channels. Section 3 describes the proposed analytical model to compute the mean message latency in such interconnection networks, and extends the model for special cases. Section 4 validates the proposed model and compares its accuracy with the one proposed in [23]. Finally, in Section 5, we conclude this study. 2. The n-D Torus: topology, switch structure, and routing Qn In an n-D torus N ¼ i¼1 k i nodes are arranged in n-dimensions, where ki is called the radix of the 1

An early version of this work was appeared in [22].

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

243

Table 1 Reported analytical models for k-ary n-cubes Model

Network

Dally’s [9], 1990 Adve and Vernon’s [2], 1994 Draper and Ghosh’s [11], 1994 Cinciani, Colajanni and Paolucci’s [7], 1997 Sarbazi-Azad, Khonsari, Ould-khaoua [22], 2003

Unidirectional Unidirectional Unidirectional Unidirectional Unidirectional

a

Virtual channels considered k-ary n-cubes and Bidirectional k-ary n-cubes k-ary n-cubes and Bidirectional k-ary n-cubes and Bidirectional k-ary n-cubes

No No Noa No Yes

Only two virtual channels to ensure deadlock freedom. The model is not extendable for any number of virtual channels V > 2.

ith dimension, 1 6 i 6 n. Links can be either bidirectional or unidirectional. Each node can be identified by an n-digit mixed-radix address (a1, a2, . . . , an). The ith digit of the address vector, ai, represents the node position in the ith dimension. Nodes with address (a1, a2, . . . , an) and (b1, b2, . . . , bn) in a bidirectional n-dimensional torus are connected if and only if there exists i (1 6 i 6 n), such that ai = (bi ± 1) mod ki and aj = bj for i 5 j. Thus, each node is connected to two neighbouring nodes in each dimension. In the unidirectional torus, node (a1, a2, . . . , an) is linked to node (b1, b2, . . . , bn) if and only if there exists i (1 6 i 6 n), such that ai = (bi + 1) mod ki and aj = bj for 1 6 j 6 n, i 5 j. The k-ary n-cube is a restricted case of the torus network in which the radices of all dimensions are the equal to k. When k = 2, the k-ary n-cube topology collapses to the well-known hypercube topology. Each node, in an n-D torus, consists of a processing element (PE) and switching element (SE) or router, as shown in Fig. 1. The PE contains a processor and some local memory. The router has 2n + 1 input and 2n + 1 output channels (n + 1 input and n + 1 output channels in unidirectional network). A node is connected to its neighbouring nodes through 2n inputs and 2n output channels (n input and n output channels in unidirectional network); there are two channels (only one in a unidirectional network) for each dimension corresponding to the positive and negative direction connecting the current node to the next and previous nodes, respectively. The remaining channels are used by the PE to inject/eject messages to/from the network. Messages generated by the PE are injected into the network through the injection channel. Messages at the destination node are transferred to the PE through the ejection channel. The router contains flit buffers for each input virtual channel. The input and output channels are connected by a crossbar switch that can simultaneously connect multiple input channels to multiple output channels given that there is no contention over the output channels.

Router Dimension

1+

Dimension

1+

Dimension

1-

Dimension

1-

Dimension

2+

Dimension

2+

Dimension

2-

Dimension

2-

Dimension

n+

Dimension

n+

Dimension

n-

Dimension n

-

Injection Channel

Ejection Channel

PE Fig. 1. The node structure in the bidirectional n-dimensional torus.

The concept of virtual channels has been first introduced in the context of the design of deadlock-free deterministic routing algorithms [8,10]. The authors in [8] have shown that the wrap-around connections in torus-based networks can lead to deadlock situations due to the cyclic dependencies that can occur within a dimension. They have proposed the use of an additional virtual channel to transform dependency ‘‘cycles’’ into ‘‘spirals’’ to avoid deadlock. Although two virtual channels per physical channel are sufficient to guarantee deadlock-free routing, the use of more virtual channels

244

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

improves network performance by increasing throughput and reducing message blocking due to chained channel blocking, as pointed out in [6,10,12]. There are several ways to route messages across virtual channels when more than two virtual channels are available. In [23], two ways to organise virtual channels are discussed when more than two virtual channels are associated to each physical channel. Let VC = {v1, v2, . . . , vL} be the set of L virtual channels associated to a physical channel. Adapting the same approach taken by Dally [8] to ensure deadlock freedom in torus using two virtual channels, we can split these virtual channels into two sets of virtual channels VC 1 ¼ fv1 ; v2 ; . . . ; vL1 g and VC 2 ¼ fvL1 þ1 ; vL1 þ2 ; . . . ; vL g, where 1 6 L1 6 L  1. Let the virtual channels in VC1 be referred to as the ‘‘low virtual channels’’ and those in VC2 as the ‘‘high virtual channels’’. In this organisation (referred to as ORG.I) the routing algorithm is as follows. Assume that the message header is at a current node with address C = c1 c2    cn and the message is destined to the destination node D = d1 d2    dn. Dimensions are passed one after one according to a fixed order. At each dimension i, 1 6 i 6 n, if ci < di, the message can be routed through one of the high virtual channels. Otherwise, the message is routed on any of the low virtual channels. This algorithm merely extends the number of virtual channels, in each of theses two class of virtual channels, from one, as suggested in [8], to respectively L1 and L  L1 virtual channels and is therefore deadlock free. An organisation of virtual channels (referred to as ORG.II) based on Duato’s methodology [12] in the context of deterministic routing can also be considered. Let the virtual channels of a given physical channel be split into two sets: VC1 = {v3, v4, . . . , vL} and VC2 = {v1, v2}. Since the routing is deterministic, messages cross dimensions in a fixed predefined order. At each routing step in dimension i, a message can choose any of the L-2 virtual channels in VC1. If all these virtual channels are busy, the message crosses v1 if ci < di; otherwise it crosses v2. Adopting the same terminology as in [12], the virtual channels v1 and v2 represent ‘‘escape channels’’. Since this algorithm is a restricted version of Duato’s methodology, it is deadlock free. 3. The analytical model We base our discussion, when describing the derivation of the model, on ORG.II as it exhibits higher

performance compared to ORG.I, as discussed in [23]. We then briefly mention the necessary changes to adapt the model to ORG.I. In what follows, we first outline the assumptions made in the analysis. We then describe the derivation of the model and its validation. The model is initially described in the context of the unidirectional torus and then, in Section 3.2, the model for unidirectional k-ary n-cube is considered and completed. Only a few simple modifications are required to adapt it for the bidirectional case; the model for the bidirectional k-ary n-cube is discussed in Section 3.4. As most real machines are of bidirectional type, the final validation will be based on the bidirectional k-ary n-cubes. 3.1. Assumptions The model is based on the following assumptions, which are widely used in the literature [1– 3,5,7,9–11,14–16,18,22,23]. (a) The network is an n-dimensional torus with radix k1 for dimension 1, k2 for dimension 2, and so on. (b) Nodes generate traffic independently of each other, and follow a Poisson process, with a mean rate of kg messages/cycle. Furthermore, message destinations are uniformly distributed across the network nodes. (c) Message length is fixed (M flits). Each flit is transmitted in tc cycles from one router to the next. (d) Messages are transferred to the local PE through the ejection channel once they arrive at their destination. (e) L virtual channels (L P 2), per physical channel are used.

3.2. Description of the general model A generated message in an n-dimensional torus Pn1 traverses m hops (where 1 6 m 6 i¼0 ðk i  1Þ in a unidirectional torus) to reach its destination. The destination node of an m-hop message can be any of C mn;ðkn1 1;kn2 1;...;k0 1Þ nodes that are m hops away from the source node, where C mn;ðkn1 ;kn2 ;...;k0 Þ is the number of different m-combinations of n distinguishable types of items with each type i consisting of ki identical items. This expression can be recursively calculated as

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

C md;ðkd1 ;kd2 ;...;k0 Þ 8 1; if m ¼ 0 > > > < 0; if m < 0 or d < 0 ¼ k d1 > P > d1;ðk ;k ;...;k Þ > C mi d2 d3 0 ; otherwise : i¼0

ð1Þ The average number of dimensions that a message traverses before reaching its destination is equal to h¼

ðk 0 1Þþðk 1X 1Þ...þðk n1 1Þ

mP m

ð2Þ

m¼1

in which Pm is the probability of the length of the path of a message being equal to m hops. In a unidirectional torus with a uniform traffic pattern we have Pm ¼

C mn;ðkn1 1;kn2 1;...;k0 1Þ k 0 k 1    k n1  1

ð3Þ

in which the numerator expression corresponds to the number of nodes at a distance of m from the source node and the denominator is the number of nodes, other than the source node, in the network. Considering that messages travel an average of h hops before reaching their destination, the average rate at which messages enter nodes of the network is equal to h times the message generation rate. On the other hand, since with unidirectional routing, the n output channels of each node are equally utilized, the arrival rate of messages to any network channel is given by kc ¼

hkg n

ð4Þ

Average message latency is defined as the average amount of time for a message to reach its destination node and all its flits to be ejected out of the network. We consider the average message latency to be a measure representative of the performance of the parallel processing system. Computation of this parameter calls for the calculation of three main factors. One is the effective network latency. Another is the average degree of virtual channel multiplexing and the last factor is the average waiting time at the source node. The average network latency is defined as the sum of the average message transfer time and average blocking delay at different dimensions (excluding the effect of multiplexing). The effective network delay is then computed by inflating the average network latency by the average virtual

245

channel multiplexing degree. The naming of factors in the following description complies closely with that of [16]. When a message needs to traverse one of the hops of dimension di, 0 6 di 6 n  1, it is delayed an average amount of time, W d i , before acquiring a virtual channel. A message is actually blocked only when all the virtual channels of its current hop are busy. The probability of l virtual channels of a hop in dimension di being busy is denoted by P d i ;l . Considering ORG.II for virtual channel allocation, the probability of a message being blocked at a hop of dimension di (called the ‘‘blocking probability’’) is given by [23] P Bi ¼ P d i ;L þ

P d i ;L1 L

ð5Þ

in which two cases are considered. The first expression ðP d i ;L Þ, corresponds to the case when all the virtual channels are busy and the second, to the case when all but one of the virtual channels are busy and that virtual channel is either v1 or v2 (as defined in Section 2), such that the message cannot be traversed by the blocked message. If a message is blocked at a hop, the message is delayed by as much time as it takes all the flits of a blocking message to finish traversing that hop. If none of the messages occupying the virtual channels terminate after traversing that hop, the blocked message will additionally be delayed by the average waiting time encountered by a blocking message in the rest of its path to its destination. Let P t;d i be the probability that a message will terminate after traversing a hop in dimension di. Since messages traverse an average of ki/2 hops in dimension di, the probability of a message terminating after traversing a specific hop of dimension di is P equal to kit;d=2i . Therefore, the probability of a message being blocked at a hop of dimension di and none of the blocking messages terminating after traversing that hop, can be denoted as  L1  L P t;d i P t;d i P d i ;L P t;d i P di ¼ 1 þ 1 P d i ;L k i =2 k i =2 L k i =2  L1 P t;d i P d i ;L1 ð6Þ þ 1 k i =2 L in which three cases are considered, each corresponding to one of the products. Enumerated from left to right, the first product corresponds to the case when L virtual channels are busy, but the message allocating one of them (v1 or v2, such that it can

246

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

not be traversed by the blocked message) does terminate in the following node. The second and third cases correspond, respectively, to when L and L  1 virtual channels are busy and none of the messages allocating these virtual channels terminate in the following node. The average waiting time of a blocked message to acquire a virtual channel at a hop of dimension di, when it considered that no other message is blocked at that hop is indicated _ by W d i . This factor can be obtained as the product of the aggregate of the average waiting time of blocking messages in each of the dimensions of the remainder of their paths and the conditional probability that none of the blocking messages terminate after traversing the channel given that the channel is already known to be blocked (resulting in P d i =P Bi Þ, plus the length of a message times the channel cycle time (tc) – to account for the time it takes for the flits of a message to be transmitted over a single channel. This is expressed in the following equation: _

W di ¼

P di P Bi

d j 1 n1 kX X

d j ¼d i

d j 1 n1 kX X

Dd i ¼

d j ¼d i

W d j P d j ;sjd ifirst þ Mtc þ W ejection

ð9Þ

s¼1

where P d j ;sjd ifirst is the probability of a message traversing at least s hops in dimension dj given that it has traversed its first hop in dimension di. The probability of a message reaching its destination after using one of the hops of dimension di, with the presumption that it does take a hop in that dimension, can be expressed as Pk0d i 1 m¼0

P t;d i ¼

Pk0n1 1 m¼0

W d j P d i ;sjd j þ Mtc

d i þ1;ðk d 11;...k d 1;k d 1Þ i 1 0

P mþ1 Cm n;ðkn1 1;...kd C mþ1

n1;ðk d

P mþ1 Cm

ð7Þ

s¼1

n1

1

1;k d 1Þ 0

1;...;k d 11;...k d 1;k d 1Þ i 0 1

ð10Þ

n1;ðk n1 1;...k d 1;k d 1Þ 0 1

C mþ1

in which

in which P d i ;sjd j is the probability of a message traversing s hops of dimension dj given that it has already traversed a hop in dimension di. W d i is the average blocking time at a hop of dimension di when it is considered that other messages may be blocked at the same hop. If freed virtual channels are granted to waiting messages on a first-come-firstserve basis (which is usually the case), W d i can be calculated as e d i ;waiting b di N W di ¼ W

the other hops of di and dimensions higher than di (calculated as a weighted average of the waiting time in higher dimensions), the transfer time of all the flits of a message over a channel (Mtc) and the waiting time at the ejection channel (Wejection). Expressed mathematically:

ð8Þ

e d i ;waiting is the average number of waiting in which, N messages at a hop of dimension di. Therefore, the meaning of Eq. (8) is that, the actual average blocking time at a channel of dimension di is equal to the average waiting time of a blocked message to acquire a virtual channel at that hop, when it considered that no other message is blocked at that hop, times the average number of blocked messages at the channel. The network latency of a message is defined as the time it takes all the flits of a message to cross the network, reach the destination node, and be ejected through the ejection channel. The average network latency of messages that traverse their first hop in dimension di, excluding the blocking delay of the first hop, denoted Dd i , is defined as the sum of the average blocking delay that messages face at

k 0d i ¼

di X ðk i  1Þ

ð11Þ

i¼0

In Eq. (10), the numerator is the probability of a message traversing a hop in dimension di, but not traversing any hops in dimensions higher than di, and the denominator corresponds to the probability of a message traversing a hop of dimension di (in the first place). When a message is to traverse a hop in dimension di, it may have already traversed m hops (where 0 6 m 6 k 0d i  1Þ. The number of different paths that such a message may have taken to reach this specific d i þ1;ðk d 11;...k d 1;k d 1Þ

i 1 0 hop is therefore equal to C m , i.e. the number of different m-combinations of di + 1 distinguishable types of items with each type j, for 0 6 j 6 di  1, consisting of kj  1 identical items and type di consisting of k d i  1  1 items. But the total number of paths that an m + 1 hop message

n;ðk n1 1;...k d 1;k d 1Þ

1 0 may take is equal to C mþ1 , i.e. the number of different m-combinations of di + 1 distinguishable types of items with each type j consisting of kj  1 identical items. The probability that the last hop to be traversed by an m + 1 hop message is in dimension di, is equal to the first combination divided by the second. Therefore, the probability of a message traversing its last hop in dimension di

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

can be determined as a weighted average, d i ;ðk d 11;...k d 1;k d 1Þ Pk0d i 1 1 0 Cm i m¼0 P mþ1 n1;ðk n1 1;...k d 1;k d 1Þ . Following the same 1

C mþ1

0

approach, a message that is to traverse a hop in dimension di may use any of the n1;ðk d n 1;...;k d i 11;...k d 1 1;k d 0 1Þ Cm different paths to reach this hop. Hence, the probability that a message traverses a hop in dimension di can also be determined as a weighted average such n1;ðk d n 1;...;k d 11;...k d 1;k d 1Þ Pk0n1 1 i 0 1 Cm as P . The later mþ1 n1;ðk n1 1;...k d 1;k d 1Þ m¼0 1

C mþ1

0

weighted average divided by the former will, as expressed in Eq. (10), obviously result in P t;d i . In a similar manner, the probability of a message traversing at least s hops of dimension dj, with the presumption that it has traversed at least one hop in dimension di (where dj > di) is given by Pk0n1 s1 m¼0

P d j ;sjd i ¼

P mþ1þs

Pk0n1 1 m¼0

n1;ðk

C mþ1þsn1

1;...;k 1 1;k 0 1Þ

n;ðk n1 1;...;k d 11;...;k 0 1Þ i

P mþ1 Cm

n1;ðk n1 1;...;k 0 1Þ

C mþ1

ð12Þ In this equation, the numerator is the probability of a message traversing at least one hop in dimension di and at least s hops in dimension dj, and the denominator is the probability of a message traversing at least one hop in dimension di (as the presumption). In a similar manner, the probability of a message traversing at least s hops in dimension dj, with the presumption that it has traversed its first hop in dimension di (where dj > di) is given by P d j ;sjd i;first Pk0n1 k0d i 1 s1 m¼0

¼

nd i 1;ðk n1 1;...;k d 1s;...;k d 11Þ j i

P mþ1þs Cm

Pk0n1 k0d i 1 1 m¼0

virtual channels. The Markovian model results in the following steady-state probability (derivation explained in [23]), in which the service time of a channel has been approximated as the network latency of that channel: ( ð1  kc Dd i Þðkc Dd i Þl ; if 0 6 j < L ð14Þ P d i ;j ¼ l ðkc Dd i Þ ; if j ¼ L The average number of waiting messages at a hop in dimension di can also be determined in a similar manner. First the probability of there being j messages waiting for a hop in di is calculated (according to the Markovian model mentioned above) as ( ð1  kc Dd i Þðkc Dd i Þl ; if 0 6 j < ð2d i  1ÞL P d i ;j;waiting ¼ ðkc Dd i Þl ; if j ¼ ð2d i  1ÞL ð15Þ

n1;ðk n1 1;...;k d 1s;...;k d 11;...;k 0 1Þ i j

Cm

n1;ðk

C mþ1þsn1

1;...k 1 1;k 0 1Þ

nd i 1;ðk n1 1;...k d 1 1;k d 11Þ i i

P mþ1 Cm

n1;ðk n1 1;...k 1 1;k 0 1Þ

C mþ1

ð13Þ Once again, the numerator is the probability of a message taking its first hop in dimension di and traversing at least a number of s hops in dimension dj. Here, however, the denominator is the probability of a message taking its first hop in dimension di (the presumption). The service time of a channel is defined as the time it takes for all the flits of a message to be transmitted over the channel. The value of P d i ;j can be determined following the same approach taken in [23], using a Markovian model of the allocation of

247

where (2di  1)L is the maximum number of messages that may be waiting to acquire a hop in dimension di. Verification of this fact is straightforward when it is established that only messages entering a node from dimension di or lower dimensions, may need to acquire a virtual channel of a hop in that dimension. e d i ;waiting from Eq. (8) can now be calculated as N the weighted average of the number of waiting messages (the average number of messages that need to traverse a hop in di) as e d i ;waiting ¼ N

ð2dX i 1ÞL

j:P d i ;j;waiting

ð16Þ

j¼L

In the steady state, the rate of messages that exit the network through ejection channels is equal to the injection rate of messages, which is equal to the generation rate kg. Utilization of the ejection channel (in each node) is therefore equal to Mkg. Given that messages are of fixed length, there is no variance in service time. Using an M/G/1 queueing model (as explained in [23]), we can calculate the waiting time at an ejection channel as W ejection ¼

M 2 kg 2ð1  Mkg Þ

ð17Þ

The probability that a message uses a hop in dimension di as its first route can be calculated as a weighted average: k 0n1 k 0d

Xi 1 m¼1

nd i 1;ðk

1;...;k 11Þ

di n1 C m1 P m n1;ðk C m n1 1;kn2 1;...;k0 1Þ

ð18Þ

248

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

Since this probability is dependent on di, a weighted average is more appropriate to determine the average total network latency of messages, denoted S. Therefore, considering that the total network latency of messages that take their first hop in dimension di is equal to ðDd i þ W d i Þ, S can be determined as S¼

n1 X d i ¼0

k 0n1 k 0d

Xi 1

nd i 1;ðk

P m  ðDd i þ W d i Þ:

m¼1

1;k

3.3. An analytical model for the k-ary n-cube network The k-ary n-cube is a torus network with the restriction that the radix of each dimension is equal to k. Thus, the probability of a message traversing s hops in dimension dj, having traversed at least one hop in dimension di, becomes Pnðk1Þs1

1;...;k 11Þ

n1 n2 di C m1 C mn1;ðk n1 1;kn2 1;...;k0 1Þ

m¼0

P d j ;sjd i ¼

ð19Þ

n1;ðk1;...;k1s;...;k11;...;k1Þ

P mþ1þs Cm

Pnðk1Þ1 m¼0

P mþ1

n1;ðk1;...;k1;k1Þ C mþ1þs n1;ðk1;...;k11;...;k1Þ Cm n1;ðk1;...;k1Þ C mþ1

ð24Þ In virtual channel flow control, multiple virtual channels share the bandwidth of a physical channel. Hence, the average service time of a message should be inflated by the amount of multiplexing that takes place across the different dimensions in order to obtain the effective average service time. The average degree of virtual channel multiplexing at dimension di is given by [22] PL 2 ld ¼ Pl¼0 l P d i ;l i L l¼0 lP d i ;l

ð20Þ

But the right-hand side of this equation is independent of di and dj. Therefore, P d j ;sjd i can be considered a function of s. This tremendously reduces the complexity of the model, considering that the number of such probability factors needed to be calculated is reduced by a factor of O(n2). The probability of a message traversing s hops in dimension dj, having traversed its first hop in dimension di, becomes Pðnd i Þðk1Þs1 m¼0

Therefore, the average degree of multiplexing in the network becomes l ¼

n1 X 1 k d ld k 0 þ k 1 þ    þ k n1 d i ¼0 i i

ð21Þ

2

ðkg =V ÞS 2 ð1 þ ðS  MÞ =S 2 Þ 2ð1  ðkg =V ÞSÞ

ð22Þ

Finally, the average message latency is obtained as the summation of the effective average network delay ðSlÞ, the average waiting time at the source node ðW s Þ, and the average time for the last flit of a message to reach its destination ðhlÞ. Therefore T ¼ Sl þ W s þ hl

Pðnd i Þðk1Þ1 m¼0

P mþ1

n1;ðk1;...k1;k1Þ C mþ1þs nd i 1;ðk1;...k1;k11Þ Cm n1;ðk1;...k1;k1Þ C mþ1

ð25Þ

Hence, as explained before, the effective average network delay is equal to Sl. A message originating from a given source node sees a network latency of S. Modeling the local queue in the source node as an M/G/1 queue, with the mean arrival rate of kg/V and a service time of S 2 with an approximated variance of ðS 2  MÞ , yields the mean waiting time seen by a message at the source node as Ws ¼

P d j ;sjd i;first ¼

nd i 1;ðk1;...;k1s;...;k11Þ

P mþ1þs Cm

ð23Þ

The right-hand side of this equation is independent of di. Therefore, P d j ;sjd i;first can be considered a function of s and di. This also reduces the complexity of the model by reducing the number of such probability factors needed to be calculated by a factor of O(n). 3.4. The analytical model for the bidirectional k-ary n-cube network With bidirectional routing, the maximum number of hops that a message can take in one dimension is equal to bk/2c. Thus, a generated message in a k-ary n-cube traverses m hops, where 1 6 m 6 nbk/2c, to reach its destination, and the destination node of an m-hop message could be any of n;ðbk c;bk c;...;bk2cÞ Cm 2 2 nodes that are m hops away from the source node. But the hops taken in each dimension can be in the positive or negative direction. This is while, if the index of a dimension is even, the number of hops that can be taken in one of the directions is less than bk/2c. An example, to show this case, is illustrated in Fig. 2.

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

249

the difference that the calculation complexity of P t;d i , P d j ;sjd i (or P d j jd i , since s can attain no value other than one) and P d j ;sjd i;first (or P d j jd i;first ) are greatly reduced here.

Fig. 2. An example of the number of hops that can be taken in the positive or negative directions in a bidirectional ring when the number of nodes are odd (a) and even (b).

Therefore, changes to

the

definition

d1 ;k d2 ;...;k 0 Þ C d;ðk m 8 1; > > > > > 0; > > > < kP d1 d1;ðk ;k ;...;k Þ C mi d2 d3 0 ¼ > i¼0 > > > > k d1 > Pe d1;ðkd2 ;kd3 ;...;k0 Þ > > C mi ; : þ

of

C md;ðkn1 ;kn2 ;...;k0 Þ

if m ¼ 0 if m < 0 or d < 0

otherwise

i¼1

ð26Þ

where e is equal to one when kd1 is even and equals zero otherwise. With bidirectional routing, the average number of hops that a message makes along one dimension is equal to k/4, thus the k/2 factors in P d i ;ðB:ntÞ are replaced by k/4. The arrival rate of messages to any network channel is given as hkg ð27Þ kc ¼ 2n since the number of physical output channels is equal to 2n. The rest of the calculations for the bidirectional case are similar to that performed for the unidirectional case with the difference that the new definition of C md;ðkn1 ;kn2 ;...;k0 Þ is used, and the k  1 parameters are replaced by bk/2c. 3.5. The hypercube case The Hypercube is a unidirectional k-ary n-cube network with the restriction that the radix of each dimension is equal to 2. Thus, the maximum number of hops that can be taken in each dimension is d;ð1;1;...;1Þ equal reduces  toone and thus the function C m d to , the combination of m from d (also   m d 1 d;ð1;...;11;...;1Þ reduces to ). Moreover, the Cm m k/2 factor in W d i reduces to one. Therefore, the model presented here, reduces to almost the same model presented in [16] for the hypercube, with

3.6. Considering the first organisation of virtual channels Only the equation giving the blocking probability at dimension i, should be changed in the model when considering ORG.I, described in Section 2. A message is blocked in dimension i, if all the virtual channels in VC1 are busy and ci < di (where ci is the current node position and di is the destination node position along dimension i), or when all the virtual channels in VC2 are busy and ci P di. With the uniform traffic pattern, we have Pr(ci < di) = Pr(ci P di) = 1/2. The probability that all virtual channelsin VC1  are busy  can be approximated by PL L L  L1 . Similarly, the probabilv¼L1 P iv v Lv ity that all virtual  channels  in  VC2 are busy is given PL L1 L by v¼LL1 P iv . Adding these probaLv v bilities with their appropriate weights yields the probability of blocking as     L  L1 L1 L L 1X 1 X Lv Lv   P iv þ   P iv P Bi ¼ L L 2 v¼L1 2 v¼LL1 v v ð28Þ 4. Validation of the model The analytical model has been validated through a discrete-event simulator that mimics the behaviour of the described routing algorithms in the network at the flit level. In each simulation experiment, a total number of 100 000 messages are delivered. Statistics gathering was inhibited for the first 10 000 messages to avoid distortions due to the initial start up conditions. The simulator uses the same assumptions as the analysis, and some of these assumptions are detailed here with a view to making the network operation clearer. The network cycle time is defined as the transmission time of a single flit from one router to the next. Messages are generated at each node according to a Poisson process with a mean inter-arrival rate of kg messages/cycle. Message length is fixed at M flits. Destination nodes

250

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

are determined using a uniform random number generator. The mean message latency is defined as the mean amount of time from the generation of a message until the last data flit reaches the local PE at the destination node. The other measures include the mean network latency, the time taken to cross the network, and the mean queuing time at the source node, the time spent at the local queue before entering the first network channel. Numerous validation experiments have been performed for several combinations of network sizes,

message lengths, and number of virtual channels to validate the model. However, for the sake of specific illustration, Figs. 3 and 4 depict latency results predicted by the model plotted against those provided by the simulator for the bidirectional 16 · 16 (N = 256 nodes) and 8 · 8 · 8 (N = 512 nodes) torus networks. The message length is assumed to be M = 32, 64, and 100 flits and the number of virtual channels per physical channel was set to L = 3 and 5. The horizontal axis in the figures represents the traffic rate at which a node injects messages into 8-ary 3-cube, L=3

16-ary 2-cube, L=3 500

500

simulation model, M=32 model, M=64 model, M=100

450

Average message latency

400

Average message latency

simulation model, M=32 model, M=64 model, M=100

450

350 300 250 200 150

400 350 300 250 200 150 100

100

50

50

0 0

0 0

0.001

0.002

0.003

0.004

0.005

0.002

0.006

0.008

500 simulation model, M=32 model, M=64 model, M=100

450

0.012

8-ary 3-cube, L=5

16-ary 2-cube, L=5 500

0.01

Traffic generation rate

Traffic generation rate

simulation model, M=32 model, M=64 model, M=100

450 400

Average message latency

400

Average message latency

0.004

0.006

350 300 250 200 150

350 300 250 200 150 100

100

50 50

0 0

0 0

0.001

0.002

0.003

0.004

0.005

0.006

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.007

Traffic generation rate

Fig. 3. Average message latency predicted by the model against simulation results for a 16 · 16 torus, L = 3 and 5 virtual channels, and M = 32, 64 and 100 flits.

Traffic generation rate

Fig. 4. Average message latency predicted by the model against simulation results for a 8 · 8 · 8 torus, L = 3 and 5 virtual channels, and M = 32, 64 and 100 flits.

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

the network in a cycle. The vertical axis shows the mean message latency in crossing from source to destination, including waiting time at source and destination. The figure reveals that the model predicts the mean message latency with a good degree of accuracy when the network operates in the steady-state regions. It can be seen in the figures that the model can better predict the saturation point when the number of virtual channels increases. It is because the assumptions of the model (such as exponential arrival rate over network channels while there is strict dependency between network channels under heavy traffic loads when wormhole switching is used) and also the approximations made to ease calculating the blocking probability well suit scenarios where the number of virtual channels is larger. Also as can be seen in the figures, the accuracy of the model (even for medium traffic rates) is lower when messages are longer. This is again a direct cause of the simplifying assumption that channels are considered to be independent and have an exponential arrival rate; this assumption looses its validity when the traffic generation rate increases. Note that the tight dependency between successive channels and buffers used by a message crossing the network, when the switching method is wormhole, causes such an inaccuracy when traffic load becomes heavier. For light traffic loads and short messages the inaccuracy created by the model assumptions is lower. Fig. 5 shows the accuracy of the proposed model (compared to the one given in [23]), for a 16 · 16 torus, L = 5 and M = 32. It is observed that the sat500 simulation

Average message latency

proposed model

400

model in [22]

300

200

100

0 0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

Traffic generation rate

Fig. 5. The prediction accuracy of the model proposed here (solid line) compared to that of [23] (dashed line) for a 16 · 16 torus with L = 5 virtual channels and M = 32 flits.

251

uration point of the model presented here is closer to the saturation point of the simulation results. For the sake of brevity, only one scenario has been shown. However, in all considered cases, the proposed model here predicted the average message latency in the network with a higher degree of accuracy compared to the one proposed in [23]. 5. Conclusions Analytical models of torus-based networks with deterministic routing have widely been reported in the literature. However, most of these models have not considered the effect of virtual channels on network performance. This paper has proposed a new combinatorial model that captures the effects of virtual channels multiplexing on message latency in n-D torus interconnection networks. The model is based on assumptions widely used in similar studies. Simulation experiments have revealed that the model predicts message latency with a reasonably high degree of accuracy compared to the mostrecently proposed model in [23] having similar capabilities. References [1] S. Abraham, K. Padmanabhan, Performance of multicomputer networks under pin-out constraints, Journal Parallel and Distributed Computing 12 (3) (1991) 237–248. [2] V.S. Adve, M.K. Vernon, Performance analysis of mesh interconnection network with deterministic routing, IEEE Transactions on Parallel and Distributed Systems 5 (3) (1994) 225–246. [3] A. Agrawal, Limits on interconnection network performance, IEEE Transactions on Parallel and Distributed Systems 2 (1991) 398–412. [4] E. Anderson, J. Brooks, C. Grassl, S. Scott, Performance of the Cray T3E multiprocessor, Proceedings of Supercomputing Conference, ACM Press, 1997, pp. 19–25. [5] J.R. Anderson, S. Abraham, Performance-based constraints for multi-dimensional networks, IEEE Transactions on Parallel and Distributed Systems 11 (1) (2000) 21–35. [6] A.A. Chien, A cost and speed model for k-ary n-cube wormhole routers, IEEE Transactions on Parallel and Distributed Systems 9 (2) (1998) 150–162. [7] B. Ciciani, M. Colajanni, C. Paolucci, Performance evaluation of deterministic wormhole routing in k-ary n-cubes, Parallel Computing 24 (1998) 2053–2075. [8] W.J. Dally, C. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Transactions on Computers 36 (5) (1987) 547–553. [9] W.J. Dally, Performance analysis of k-ary n-cubes interconnection networks, IEEE Transactions on Computers 39 (6) (1990) 775–785. [10] W.J. Dally, Virtual channel flow control, IEEE Transactions on Parallel and Distributed Systems 3 (2) (1992) 194–205.

252

H. Hashemi-Najafabadi, H. Sarbazi-Azad / Journal of Systems Architecture 54 (2008) 241–252

[11] J.T. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, Journal Parallel and Distributed Computing 23 (1994) 202–214. [12] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An Engineering Approach, IEEE Computer Society Press, Los Alamitos, CA, 1997. [13] J. Duato, Why commercial multicomputers do not use adaptive routing, IEEE Technical Committee on Computer Architecture Newsletter (1994) 20–22. [14] R. Greenberg, L. Guan, Modelling and comparison of wormhole routed mesh and torus networks, in: Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Systems, IASTED Press, 1997. Available from: . [15] W.J. Guan, W.K. Tsai, D. Blough, An analytical model for wormhole routing in multicomputer interconnection networks, in: Proceedings of the International Conference on Parallel Processing, 1993, pp. 650–654. [16] Y. Boura, C.R. Das, Modelling virtual channel flow control in n-dimensional hypercubes, in: Proceedings of the International Symposium on High Performance Computer Architecture, 1995, pp. 166–175. [17] R.E. Kessler, J.L. Swarszmeier, Cray T3D: A new dimension for Cray research, Compcon, Spring, 1993. [18] J. Kim, C.R. Das, Hypercube communication delay with wormhole routing, IEEE Transactions on Computers 43 (7) (1994) 806–814. [19] L.M. Ni, K. McKinley, A survey of wormhole routing techniques in direct networks, IEEE Computers 26 (1993) 62–76. [20] M. Noakes et al., The J-machine multicomputer: An architectural evaluation, in: Proceedings of the 20th International Symposium on Computer Architecture, ACM Press, 1993, pp. 224–235. [21] C. Peterson et al., iWARP: A 100-MPOS LIW microprocessor for multicomputers, IEEE Micro 11 (13) (1991). [22] H. Hashemi-Najafabadi, H. Sarbazi-Azad, An accurate combinatorial model for performance prediction of deterministic wormhole routing in torus multicomputer systems, in: Proceedings of IEEE International Conference on Computer Design, 2004, pp. 548–553. [23] H. Sarbazi-Azad, A. Khonsari, M. Ould-Khaoua, Analysis of k-ary n-cubes with dimension-ordered routing, Future Generation Computer Systems 19 (4) (2003) 493–502.

Hashem Hashemi received his BSc degree in Computer Engineering from Azad University, Tehran, Iran, in 2002, and his MSc degree in Computer Engineering from Sharif University of Technology, Tehran, Iran, in 2004. He is currently a PhD student in the Department of Electrical and Computer Engineering at the North Carolina State University, Raleigh, NC, USA. His research interests include advanced microarchitectures, memory hierarchy, and interconnection networks.

Hamid Sarbazi-Azad received his BSc degree in electrical and computer engineering from Shahid-Beheshti University, Tehran, Iran, in 1992, his MSc degree in computer engineering from Sharif University of Technology, Tehran, Iran, in 1994, and his PhD degree in computing science from the University of Glasgow, Glasgow, UK, in 2002. He is currently associate professor of computer engineering at Sharif University of Technology, and heads the School of Computer Science of the Institute for Studies in Theoretical Physics and Mathematics (IPM), Tehran, Iran. His research interests include high-performance computer architectures, NoCs and SoCs, parallel and distributed systems, performance modeling/evaluation, graph theory and combinatorics, wireless/mobile networks, on which he has published more than 150 refereed conference and journal papers. Dr. Sarbazi-Azad has served as the editor-in-chief for the CSI Journal on Computer Science and Engineering since 2005, editorial board member and guest editor for several special issues on high-performance computing architectures and networks (HPCAN) in related journals. Dr. Sarbazi-Azad is a member of ACM and CSI (computer Society of Iran).