Simulation Modelling Practice and Theory 10 (2002) 525–541 www.elsevier.com/locate/simpat
The impact of virtual channel allocation on the performance of deterministic wormhole-routed k-ary n-cubes S. Loucif, M. Ould-Khaoua
*
Department of Computing Science, University of Glasgow, 17 Lilybank Gardens, Glasgow, Scotland, G12 8RZ, UK Received 18 December 2001; received in revised form 7 October 2002
Abstract Virtual channels yield significant improvement in the performance of wormhole-routed networks as they can greatly reduce message blocking over network resources. K-ary n-cubes with deterministic routing have been widely analysed using analytical modelling tools. Most existing models, however, have either entirely ignored the effects of virtual channel multiplexing or have not considered the impact of virtual channels allocation on message latency. This paper discusses two different organisations of virtual channels in k-ary n-cubes, resulting in two deterministic routing algorithms. It then proposes an analytical model to compute message latency for the two routing algorithms. The proposed model is used in a case study to demonstrate the sensitivity of network latency to the way virtual channels are allocated to messages. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Parallel processing; Multicomputers; Interconnection networks; Deterministic routing; Wormhole routing; Virtual channels; Performance modelling.
1. Introduction K-ary n-cubes have been popular networks for multicomputers [2,13,17,19] due to their desirable properties, such as ease of implementation and ability to exploit communication locality to reduce message latency. A k-ary n-cube has an n-dimensional *
Corresponding author. Tel.: +44-141-3306056; fax: +44-141-3304913. E-mail address:
[email protected] (M. Ould-Khaoua).
1569-190X/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S1569-190X(02)00132-6
526
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
Nomenclature Symbols Cij number of i combinations among j distinguishable items d average message distance in the network fl continuing message rate arriving at a channel i, j, l indices k dimension width k average message distance within a dimension Lm average length of messages Latency average network latency L average message latency, excluding the multiplexing effects of virtual channels and the mean queuing time at the source N network size in nodes n network dimensionality ps probability that a message skips a dimension pt;i probability that a message terminates after crossing dimension i pp;j probability that a message passes dimension j Pi;j;v Probability that v virtual channels of jth physical channel in dimension i, are busy Pa probability that all adaptive virtual channels at a physical channel are busy Pd probability that adaptive and deterministic virtual channels at a physical channel are busy Pbl probability of message blocking at a physical channel qi;j;v intermediate variable used in the calculation of Pi;j;v RLi average message rate entering the network through dimension i RPi average message rate entering dimension i from lowest dimensions Si average service time seen by a message entering dimension i Ti average service time seen by a message exiting dimension i Ti;j average latency seen by a message entering the jth physical channel of dimension i V number of virtual channels per physical channel V i;j average degree of virtual channels multiplexing at the jth physical channel of dimension i V average degree of virtual channels multiplexing in the network Wi total average blocking delays at dimension i wi;j average blocking delay at the jth physical channel of dimension i wqs average message queueing time at the source wqd average message queueing time at the destination kg probability that a processor generates a message in a cycle k0 fraction of generated message traffic, taking one of the network possible paths
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
kc U f
527
average message rate on a physical channel total message rate entering a channel average service time seen by a message at the entrance of a physical channel
grid structure with k nodes in each dimension such that every node is connected to two other nodes in each dimension by direct channels. The low-dimensional torus is the most common instance of k-ary n-cubes and has been used in practical systems, such as the J-machine [17], CRAY T3E [2], CRAY T3D [13], and iWARP [19]. Wormhole routing [16] has become the dominating switching technique used in contemporary multicomputers [2,13,17–19]. In this switching technique, messages are broken into flits (a few bytes each) for transmission and flow control. The header flit, containing routing information, governs the route and the remaining data flits follow in a pipelined fashion. If the header is blocked, the other flits are blocked in situ. The advantage of this technique is that it reduces the impact of message distance on the latency under light traffic. However, as network traffic increases messages may experience large delays to cross the network due to the chained channel blocking. To overcome this, the flit buffers associated with a given physical channel are organised into several virtual channels [8], each representing a ‘‘logical’’ channel with its own buffer and control flow. Virtual channels are allocated independently to different messages and compete with each other for the physical channel bandwidth. This de-coupling allows messages to bypass one another in the event of blocking, using network bandwidth that would otherwise be wasted. Deterministic routing [6] has been popular in practical multicomputers [13,17,19] because it is simple and requires a minimal number of virtual channels, resulting in an efficient router implementation [4]. The authors in [21] have shown that under realistic traffic patterns generated by typical parallel applications the performance advantages of determinisitc routing can even approach those of adaptive routing without the expense of a complex hardware implementation. Analysis of k-ary ncubes with deterministic routing has been widely reported in the past [1,3,5,7,9]. However, most existing studies have ignored the effects of virtual channel multiplexing, except for the model of [9], where the analysis has been restricted to two virtual channels per physical channel. The analytical model proposed by Gaughan and Yalamnchili [11] takes into account a number of important factors including virtual channel multiplexing. However, their model was discussed in the context of circuit switching (a variant switching method of wormhole routing) and adaptive routing that allows backtracking in case of message blocking. When the network operates at moderate or heavy traffic loads, the probability that messages are blocked increases and this probability depends on how virtual channels are allocated to messages. Unfortunately, most existing analyses have not considered the impact of the virtual channels allocation on message latency. In this paper, we discuss two different ways of allocating virtual channels to messages when crossing k-ary n-cubes networks, which results in two organisations of
528
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
virtual channels at a given physical channel, and ultimately to two possible deterministic routing algorithms. We then propose an analytical model of deterministic routing in wormhole-routed k-ary n-cubes equipped with an arbitrary number of virtual channels. The model can be used to compute the mean message latency when either organisation of virtual channels is adopted. It is worth noting that the authors in [20] have very recently conducted a similar study. However, in this work, we present a totally different approach from that of [20] for modelling the behaviour of deterministic routing and capturing the effects of virtual channels on network performance. The present model is validated through extensive simulation experiments, and then used to compare the relative performance merits of the two different organisations of virtual channels. The rest of the paper is organised as follows. Section 2 describes the k-ary n-cube and the router structure used in the analysis. Section 3 outlines the two organisations of virtual channels that lead to two different deterministic routing algorithms. Section 4 describes an analytical model to compute the mean message latency. Section 5 validates the model through simulation experiments. Section 6 compares the performance of the two virtual channels organisations. Finally, Section 7 concludes this paper.
2. The k-ary n-cube and its router structure The k-ary n-cube contains N ¼ k n nodes, arranged in n dimensions, with k nodes per dimension. Each node is connected to its nearest neighbours in each dimension. Let dimensions be numbered from 1 to n. A node, x, can then be labelled by an n 1 address vector with xi being the nodeÕs position in its dimension i. A node at address x ¼ ðx1 ; . . . ; xi1 ; xi ; xiþ1 ; . . . ; xn Þ ð0 6 xi 6 k 1Þ ð1 6 i 6 nÞ is connected along dimension i to node x0 ¼ ðx1 ; . . . ; xi1 ; xi 1½modulo k ; xiþ1 ; . . . ; xn Þ. Fig. 1 shows some examples of k-ary n-cubes. The k-ary 1-cube is the well-known ring, while the k-ary 2-cube and k-ary 3-cube are best known as the 2- and 3-dimensional torus respectively; a variation of the mesh with wrap-around connection.
(a)
(c)
(b)
Fig. 1. Examples of k-ary n-cubes. (a) 10-ary 1-cube, (b) 5-ary 2-cube, (c) 3-ary 3-cube.
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
529
Dimension 1+ Dimension 1Crossbar Switch Dimension n+ Dimension nEjection channel
Injection channel PE
Fig. 2. The router structure in k-ary n-cubes.
Each node consists of a processing element (PE) and router, as shown in Fig. 2. The PE contains a processor and some local memory. The router has ð2n þ 1Þ input and ð2n þ 1Þ output channels. A node is connected to its neighbouring nodes through 2n inputs and 2n output channels; there are two channels in each dimension corresponding to the positive and negative direction respectively. The remaining channels are used by the PE to inject/eject messages to/from the network respectively. Messages generated by the PE are injected into the network through the injection channel. Messages at the destination node are transferred to the PE through the ejection channel. The router contains flit buffers for each input virtual channel. The input and output channels are connected by a crossbar switch that can simultaneously connect multiple input to multiple output channels given that there is no contention over the output channels.
3. Virtual channels The concept of virtual channels has been first introduced in the context of the design of deadlock-free deterministic routing algorithms [6]. The authors in [6] have shown that the wrap-around connections in k-ary n-cubes can lead to message deadlock due to the cyclic dependencies that can occur within a dimension. They have proposed the use of an additional virtual channel to transform dependency cycles into ‘‘spirals’’ to avoid deadlock. Although two virtual channels per physical channel are sufficient to guarantee deadlock-free routing, the use of more virtual channels improves network performance by increasing throughput and decreasing message blocking delays due to the chained channel blocking inherent in wormhole routing, as pointed out by a study in [8]. There can be several ways to route messages across virtual channels when more than two virtual channels are used per physical channel. To demonstrate how message latency can be affected by virtual channels allocation, this study proposes two ways to organise virtual channels leading to two deterministic routing algorithms, and investigates the relative performance merits of the two routing algorithms.
530
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
3.1. The first routing algorithm Let V be the number of virtual channels per physical channel, and which are numbered v0 ; v1 ; . . . ; vV 1 . In this organisation, V is split into two classes, containing an equal number of virtual channels. The first class contains ½v0 ; v1 ; . . . ; vðV =2Þ1 , representing the ‘‘low’’ virtual channels. On the other hand, the second class contains ½vðV =2Þ ; vðV =2Þþ1 ; . . . ; vV 1 , representing the ‘‘high’’ virtual channels. Let us assume that the message header is at the node address Ac , and its destination is the node address Ad , the routing algorithm is as follows. If Ac < Ad , the message is routed on any of the ‘‘high’’ virtual channels. Otherwise, the message is routed on any of the ‘‘low’’ virtual channels. This deterministic routing algorithm is similar to that described in [6], and therefore is deadlock-free; this algorithm merely extends the number of virtual channels in each class from one, as suggested in [6], to an arbitrary number. 3.2. The second routing algorithm The routing algorithm in the second organisation is based on DuatoÕs methodology [10] in the context of deterministic routing. Virtual channels of a physical channel are split into two classes. The first class contains ðV 2Þ virtual channels ½v2 ; v3 ; . . . ; vV 1 , and the second contains only two virtual channels v0 and v1 . Since the routing is deterministic, messages cross dimensions in a predefined order (e.g., in an ascending order). At each routing step, a message can choose any of the ðV 2Þ virtual channels ½v2 ; v3 ; . . . ; vV 1 . If all these channels are occupied, the message crosses v1 if Ac < Ad , otherwise it crosses v0 . Adopting the same terminology as in [10], the virtual channels v0 and v1 represent escape channels. In what follows we will refer to the first and second routing algorithms as Dally and Seitz-based and Duatobased algorithms, respectively.
4. The analytical model Since the derivation of the model is almost the same for the two deterministic routing algorithms mentioned above, we will make a distinction between the two algorithms when it is necessary, e.g., when the equations for some parts of the model differ. The model is based on the following assumptions, which are widely accepted in the literature [1,3,5,7,8,14]. (a) Nodes generate traffic independently of each other, and follow a Poisson process, with a mean rate of kg messages/cycle. Furthermore, message destinations are uniformly distributed across the network nodes. (b) Message length is exponentially distributed with a mean of Lm flits, and each flit requires one cycle transmission time across a physical channel.
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
531
(c) Messages generated at the source node are put in a local queue of infinite capacity. (d) One pair of injection/ejection channels connects the processing element to the router. Messages enter the network through the injection channel, while they are consumed at destination through the ejection channel. The average number of channels that a message visits within a dimension and across the network are given by [1] k ¼ k 4
ð1Þ
d ¼ nk
ð2Þ
Following a similar analysis as in [5,15], Fig. 3 shows the diagram used to derive the mean message latency. A message generated at the source sees a mean latency, Latency, to reach its destination. Furthermore, a message can enter the network through any dimension, depending on its source and destination addresses. When a message enters the network through dimension i ð0 6 i < nÞ it sees Si as service time to reach its destination. Similarly, after exiting the dimension, the message sees, on average, the latency Ti to reach the destination. Let ps ¼ 1=k be the probability that a message skips a dimension, and k0 be the mean message rate entering the network from a given node. In a k-ary n-cube with bi-directional channels, messages can follow any of the 2n possible paths, resulting from the possible combinations of both the left and right directions on the n dimensions. Therefore, the message rate received on each path is k0 ¼ kg =2n . Without loss of generality, the analysis focuses on one path, taking into account the interaction of
Latency
λ′ S0
S1 R0L dim 0
T0
R1P
S n −2 R1L
…
S n −1 R
dim n-2
dim 1
T1
R
P n −2
Tn −2
L n− 2
RnP−1 …
RnL−1
dim n-1
Tn −1
…
Fig. 3. A diagram of a message path in the network.
…
532
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
messages traveling on different paths and sharing one or more dimensions. The results are similar for other paths since the k-ary n-cube has a symmetric topology. The mean message rate generated by a node and entering the network through dimension i ð0 6 i < nÞ, RLi , skipping the j ð0 6 j < iÞ lowest dimensions, is given by [15] RLi ¼ ð1 ps Þpsi k0
ð3Þ
The probability that a message that has exited dimension j ð0 6 j < nÞ passes di2 mension i ðj < i < nÞ in its next step is ð1 ps Þ psij1 . Therefore, the total rate of P messages, Ri , entering dimension i from the lowest dimensions j ð0 6 j < iÞ can be written as 8 <0 i¼0 ð4Þ RPi ¼ Pi1 2 0 : j¼0 ð1 ps Þ psij1 k 0 < i < n The total rates RLi and RPi at any dimension i is found to be U ¼ ð1 ps Þk0 [5]. To compute Si , we need to expand each branch of the diagram given in Fig. 3 to a more detailed level, as shown in Fig. 4. The mean network latency, Si , seen by a message entering the network through dimension i is the mean service time seen at the exit of the dimension, increased by the mean blocking delay encountered within the dimension, Wi . Therefore, we can write S i ¼ Ti þ W i
ð5Þ
After crossing dimension i ð0 6 i < nÞ, the message can either be consumed if it has reached its destination and this occurs with probability pt;i ð¼ psni1 Þ, or it may continue crossing subsequent dimensions j ði < j < nÞ, with probability pp;j ð¼ psji1 Þ. In the former case, the mean service time seen by the message is the mean waiting time to acquire the ejection channel, wqd , if there is blocking, and the service time of the message through the ejection channel. In the latter case, the message may experience blocking delays at each crossed subsequent dimension. So, Ti can be expressed as Ti ¼ Lm þ wqd þ ð1 pt;i Þ
n1 X
Pp;j Wj
ð6Þ
j¼iþ1
where wqd is given as [12] wqd ¼ kg L2m
ð7Þ
Wi is the sum of the mean blocking delays experienced by messages at the k channels within the dimension. Therefore Wi is written as Wi ¼
k X l¼1
wi;l
ð8Þ
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
533
Si R iP
R iL
( 2 n −1 − 1) Φ
Φ
T i ,1
fl
Ch 1
2 n −1 Φ
2 n −1 Φ
fl
Ti , 2
Ch 2
2 n −1 Φ
… Ti ,k
fl
avg
Ch k
2 n −1 Φ
avg
fl 2 n −1 Φ Ti Fig. 4. A message path within a dimension.
To calculate wi;l , we need to find out the probability of blocking at the lth ð1 6 l 6 k Þ physical channel of dimension i. This probability depends on the deterministic routing algorithm used. Let us start by examining Dally and Seitz-based deterministic routing algorithm described above. A message experiences blocking if at least V =2 virtual channels are occupied. In general, when v ðV =2 6 v 6 V Þ virtual channels are busy, the probability that the V =2 required virtual channels are busy V =2v is Pi;l;V v CV =2 =CVV v , where Pi;l;v represents the probability that v virtual channels at the lth physical channel of dimension i are busy. Pi;l;v is calculated using the Markovian chain model. The transition rate kc ð¼fl þ 2n1 UÞ is the total message rate entering a given physical channel (with fl ð¼2n1 Uk Þ being the message rate continuing in the dimension), and f is the mean service time seen at the exit of the channel. It is Ti;lþ1 if l < k , otherwise, it is Ti . At the steady state, the solution of this model is given by the set of recurrence relations [8]
534
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
qi;l;v
8 <1 ¼ qi;l;v1 kc f :q kc f i;l;v1 1=fkc
Pi;l;v
8 1 > < PV qi;l;v l¼0 ¼ Pi;l;v1 kc f > :P kc f i;l;v1 1=fkc
v¼0 0
ð9Þ
v¼0 ð10Þ
0
In Duato-based deterministic routing algorithm, a message is blocked at a given physical channel when the virtual channels ½v2 ; v3 ; . . . ; vV 1 are busy, and this occurs with the probability pa , and when the virtual channel v0 or v1 (depending on the current and destination addresses) is busy. The latter occurs with the probability pd . The probabilities pa and pd are found to be (see [14] for a more detailed derivation of these probabilities) Pa ¼ Pi;l;V þ
2Pi;l;V 1 Pi;l;V 2 þ V CVV 1 CV 2
ð11Þ
Pd ¼ Pi;l;V þ
2Pi;l;V 1 CVV 1
ð12Þ
Finally, the probability of blocking at a physical channel can be written as Pbl ¼
8 > <
V =2
Pi;l;V þ CV > : Pa Pd
PV =2 v¼1
V =2v
Pi;l;V v
CV =2
CVV v
for Dally and Seitz-based routing algorithm for Duato-based routing algorithm
ð13Þ In the event of blocking, the message has to wait for the blocking message to release one of the virtual channels at the physical channel that still to be crossed. This waiting time includes the blocking delays of the blocking message at the ejection channel, at the channels of subsequent dimensions j ði < j < nÞ, and at the physical channels of the current dimension i, increased by the actual transfer time of the blocking message through that physical channel. wi;l ð1 6 l 6 k Þ is then nv nv
1 1 wi;l ¼ Pbl 1 1 psni1 Lm þ wqd þ 1 psni1 Ti;2 ð14Þ k k with nv being equal to V =2 in Dally and Seitz-based routing algorithm, and ðV 1Þ in Duato-based routing algorithm. The mean service time seen by a message at the entrance of the lth physical channel of dimension i, Ti;l , can be then written as
Ti;lþ1 þ wi;l 1 6 l < k Ti;l ¼ ð15Þ Ti þ wi;l l ¼ k Taking into account all possible ways that messages can take to enter the network, the mean network latency, excluding the effects of virtual channels multiplexing, and the mean waiting time at the source node, is given by
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
L ¼ d þ
n1 X
ð1 ps Þpsi Si
535
ð16Þ
i¼0
The average multiplexing degree of virtual channels at the lth ð1 6 l 6 k Þ physical channel of dimension i is [8] PV 2 j¼0 j Pi;l;j V i;l ¼ PV ð17Þ j¼0 jPi;l;j The overall average multiplexing degree in the network, V , is therefore Pn1 Pk V i;l V ¼ i¼0 k¼1 nk
ð18Þ
The effects of blocking at the source also must be included. A message can enter the network at any of the V virtual channels. The mean service time seen by the message at the entrance of the network is equal to LV . So, the mean waiting time at the source is k
g LV V wqs ¼
kg 1 V LV 1=LV
ð19Þ
Scaling the mean network latency by V to reflect the effects of virtual channels multiplexing, and including the message blocking at the source, gives the mean message latency as Latency ¼ wqs þ LV
ð20Þ
5. Model validation The above models have been validated through discrete event simulators, operating at the flit level. In each simulation experiment, a total number of 100 000 messages are delivered. Statistics gathering was inhibited for the first 10 000 messages to avoid distortions due to the initial start-up conditions. The network cycle time in the simulator is defined as the transmission time of a single flit across a physical channel. Messages are generated at each node according to a Poisson process with a mean inter-arrival rate of kg messages/cycle. The mean message length is M flits. Destination nodes are determined using a uniform random number generator. The mean message latency is defined as the mean amount of time from the generation of a message until the last data flit reaches the local PE at the destination node. The other measures include the mean network latency, the time taken to cross the network, the mean waiting time at the source node, and the time spent at the local queue before crossing the first network channel. Extensive simulation experiments have been conducted on different network sizes, different number of virtual channels, and message lengths, and the general conclusions have been found to be consistent across all the cases considered.
536
Fig. 6. Latency predicted by the model against simulation results in Duato-based routing algorithm. N ¼ 16 16 nodes. (a) V ¼ 4, (b) V ¼ 8 and (c) V ¼ 16.
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
Fig. 5. Latency predicted by the model against simulation results in Dally and Seitz-based routing algorithm. N ¼ 16 16 nodes. (a) V ¼ 4, (b) V ¼ 8 and (c) V ¼ 16.
537
Fig. 8. Latency predicted by the model against simulation results in Duato-based routing algorithm. N ¼ 8 8 8 nodes. (a) V ¼ 4, (b) V ¼ 8 and (c) V ¼ 16.
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
Fig. 7. Latency predicted by the model against simulation results in Dally and Seitz-based routing algorithm. N ¼ 8 8 8 nodes. (a) V ¼ 4, (b) V ¼ 8 and (c) V ¼ 16.
538
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
However, for illustrative purpose, the model validation is shown for the following parameters: – network size N ¼ 16 16 and N ¼ 8 8 8 nodes, – mean message length Lm ¼ 32 and 100 flits, – number of virtual channels V ¼ 4, 8, and 16. Figs. 5–8 depict the mean message latency results predicted by the model against those provided by the simulation experiments in Dally and Seitz-based and Duatobased routing algorithms, respectively. The x-axis in the figures represents the traffic rate at which a node injects messages into the network in a cycle. The y-axis shows the mean message latency in crossing from source to destination, including waiting time at source and destination. The figures reveal that in general the model predicts the mean message latency with a reasonable degree of accuracy when the network operates in the steady-state regions. The simplicity of the proposed model makes it a practical and cost-effective evaluation tool to use it to study the effects of using and organising virtual channels on the performance of k-ary n-cubes.
6. Performance comparison As mentioned earlier, virtual channels improve network performance since they reduce message blocking at moderate and high traffic loads. This can be verified by examining the above Figs. 5–8 when the number of virtual channels V is increased. However, since virtual channels increase the complexity of router hardware circuitry and therefore their cost [4], it is essential to find out the best way to organise these channels in order to take full advantage of their usage. This section investigates this issue by comparing the performance of two deterministic routing algorithms resulting from the two organisations of virtual channels, discussed in Section 3. Figs. 9 and 10 show the mean message latency obtained for message lengths of 32 flits and 100 flits, respectively, as a function of the traffic rate injected by each node into the network. The number of virtual channels used per physical channel has been varied between 4, 8, and 16. The results are shown on a 2-dimensional torus of system size 256 nodes, but the general conclusion can be applicable to the other higherdimensional versions of the k-ary n-cube, larger system sizes and other message lengths. The results reveal that in the absence of message contention when the traffic load is low, both routing algorithms provide similar results, irrespective of the number of virtual channels used per physical channel. This is because when the traffic intensity in the network is light, the probability that messages are blocked is negligible, and therefore both approaches behave almost identically. As the traffic increases, however, the two routing algorithms exhibit different performance behaviour because they allocate virtual channels to messages differently. Both figures reveal that Duato-based deterministic routing provides much lower message latency. This is due to the fact that Duato-based algorithm reduces delays due to message blocking as it efficiently uses the virtual channels. A message can
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
539
Fig. 9. Comparison of the two deterministic routing algorithms in 16 16 torus, Lm ¼ 32 flits.
Fig. 10. Comparison of the two deterministic routing algorithms in 16 16 torus, Lm ¼ 100 flits.
choose at a given routing step one of ðV 1Þ virtual channels, and as a result has the opportunity of using ðV 1Þ bypass lanes to escape blocking. In contrast, Dally and Seitz-based deterministic routing algorithm provides a message with less opportunity to avoid blocking since it can choose only one out of the V =2 virtual channels at a given routing step. It is worth pointing out that the results in the two figures also reveal that the improvement in performance as the number of virtual channels increases is more significant with Duato-based deterministic routing algorithm. This is because when the
540
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
number of virtual channels increases, the number of extra bypass lanes that a message can use is higher than in Dally and Seitz-based routing algorithm.
7. Conclusions A number of studies have shown that virtual channels improve performance by reducing message blocking delays inside the network. However, they can slow down router speed as they cause an increase in hardware complexity. Therefore, it is critical to work out the best way of organising virtual channels in order to optimise performance. To this end, this paper has discussed two different ways of organising virtual channels in wormhole-routed k-ary n-cubes, leading to two deterministic routing algorithms. A model to compute the message latency for each routing algorithm has been proposed, and validated through simulation experiments. The results have shown that there is a close agreement between the model and simulation. The proposed model has then been employed to evaluate the performance of k-ary ncubes with the two different deterministic routing algorithms. The above results have revealed that deterministic routing algorithm that enable a message to select among a large number of virtual channels at each routing step to progress across the network provides the optimal performance. An interesting line of future research would be to investigate the performance merits of virtual channels when routers with deep buffers are used, and taking into account other important non-uniform traffic patterns, such as hotspot and matrix transpose.
References [1] A. Agrawal, Limits on interconnection network performance, IEEE Trans. Parall. Distr. Syst. (2) (1991) 398–412. [2] E. Anderson, J. Brooks, C. Grassl, S. Scott, Performance of the Cray T3E multiprocessor, in: Proc. Supercomputing Conference, 1997. [3] J.R. Anderson, S. Abraham, Multi-dimensional network performance with unidirectional links, Proc. Int. Conf. Parall. Proc. (1997) 26–33. [4] A.A. Chien, A cost and speed model for k-ary n-cube wormhole routers, IEEE Trans. Parall. Distr. Syst. 9 (2) (1998) 150–162. [5] B. Ciciani, M. Colajanni, C. Paolucci, Performance evaluation of deterministic wormhole routing in k-ary n-cubes, Parall. Comput. 24 (14) (1998) 235–252. [6] W.J. Dally, C. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. Comput. 36 (5) (1987) 547–553. [7] W.J. Dally, Performance analysis of k-ary n-cubes interconnection networks, IEEE Trans. Comput. 39 (6) (1990) 775–785. [8] W.J. Dally, Virtual channel flow control, IEEE Trans. Parall. Distr. Syst. 3 (2) (1992) 194–205. [9] J.T. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, J. Parall. Distr. Comput. (23) (1994) 202–214. [10] J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Trans. Parall. Distr. Syst. 4 (12) (1993) 1320–1331.
S. Loucif, M. Ould-Khaoua / Simulation Modelling Practice and Theory 10 (2002) 525–541
541
[11] P.T. Gaughan, S. Yalamanchili, A performance model of pipelined k-ary n-cubes, IEEE Trans. Comput. 44 (8) (1995) 1059–1063. [12] L. Kleinrock, Queueing Systems (1), John Wiley, New York, 1975. [13] R.E. Kessler, J.L. Swarszmeier, Cray T3D: A new dimension for Cray research, Compcon (1993) 176– 182. [14] S. Loucif, M. Ould-Khaoua, L.M. Mackenzie, Analysis of fully adaptive wormhole routing in tori, Parall. Comput. 25 (12) (1999) 1477–1487. [15] S. Loucif, Performance evaluation of distributed crossbar switch hypermesh, Ph.D. Dissertation, Comp. Sci. Dept., Glasgow University, 1999. [16] L.M. Ni, K. McKinley, A survey of wormhole routing techniques in direct networks, IEEE Comput. (26) (1993) 62–76. [17] M. Noakes et al., The J-machine multicomputer: An architectural evaluation, Proc. 20th Int. Symp. Comput. Architect. (1993). [18] Paragon XP/S Product Overview, Intel Corporation, Supercomputer Systems Division, Beaverton, Or. (1991). [19] C. Peterson et al., iWARP: A 100-MPOS LIW microprocessor for multicomputers, IEEE Micro 11 (13) (1991) 26–37. [20] H. Sarbazi-Azad, A. Khonsari, M. Ould-Khaoua, Analysis of deterministic routing in k-ary n-cubes with virtual channels, Proc. 8th Int. Conf. Parall. Distr. Syst. (2001) 509–516. [21] A.S. Vaidya, A. Sivasubramaniam, C.R. Das, Impact of virtual channels and adaptive routing on application performance, IEEE Trans. Parall. Distr. Syst. 12 (2) (2001) 223–237.