Future Generation Computer Systems 24 (2008) 461–474 www.elsevier.com/locate/fgcs
An accurate mathematical performance model of adaptive routing in the star graph A.E. Kiasari a,b , H. Sarbazi-Azad b,a,∗ , M. Ould-Khaoua c a IPM School of Computer Science, Tehran, Iran b Sharif University of Technology, Tehran, Iran c University of Glasgow, Glasgow, UK
Received 12 December 2006; received in revised form 22 June 2007; accepted 22 June 2007 Available online 18 July 2007
Abstract Analytical modelling is indeed the most cost-effective method to evaluate the performance of a system. Several analytical models have been proposed in the literature for different interconnection network systems. This paper proposes an accurate analytical model to predict message latency in wormhole-switched star graphs with fully adaptive routing. Although the focus of this research is on the star graph but the approach used for modelling can be, however, used for modelling some other regular and irregular interconnection networks. The results obtained from simulation experiments confirm that the proposed model exhibits a good accuracy for various network sizes and under different operating conditions. c 2007 Elsevier B.V. All rights reserved.
Keywords: Multicomputers; Interconnection networks; Star graph; Adaptive routing; Wormhole switching; Message latency; Performance evaluation; Modelling
1. Introduction It is widely recognized that one of the critical components of a multicomputer is the interconnection network used to connect the processing elements together. The star graph [3,5], as an attractive alternative to the well-known hypercube, has received considerable attention in the past [3,4,18]. It provides an interconnection topology for a large number of processors using a low number of communication channels while still providing a high level of fault tolerance [4]. The star graph has many desirable features including vertex and edge symmetry, sublogarithmic degree and diameter, recursive structures and possessing efficient routing and broadcasting schemes. Modern parallel routers significantly reduce average latency by using wormhole switching [10]. Wormhole is a switching strategy that divides each packet into elementary units called flits, each of a few bytes for transmission and flow control, and advances each flit as soon as it arrives at a node. The ∗ Corresponding address: Department of Computer Engineering, Sharif University of Technology, Azadi Street, Tehran, Iran. Tel.: +98 21 22280332; fax: +98 21 22828687. E-mail addresses:
[email protected] (A.E. Kiasari),
[email protected],
[email protected] (H. Sarbazi-Azad),
[email protected] (M. Ould-Khaoua).
c 2007 Elsevier B.V. All rights reserved. 0167-739X/$ - see front matter doi:10.1016/j.future.2007.06.010
header flit (containing routing information) governs the route and the remaining data flits follow it in a pipelined fashion. If a channel transmits the header of a message, it must transmit all the remaining flits of the same message before transmitting flits of another message. Once the header is blocked, the data flits are blocked in their place. Wormhole is attractive because it reduces the latency of message delivery compared to store and forward and requires only a few flit buffers per node. Network throughput of wormhole routed networks can be increased by organizing the flit buffers associated with each physical channel into several virtual channels [10]. These virtual channels are allocated independently to different packets and compete with each other for the physical bandwidth. This decoupling allows active messages to pass blocked messages using network bandwidth that would otherwise be wasted. Most multicomputer interconnection networks, including star graphs [3], provide multiple physical paths for routing a message between two given nodes. This introduces the problem of choosing a route between many alternatives. Many practical multicomputers [14,24] have adopted deterministic routing where messages with the same source and destination addresses always take the same network path. This form of routing has been popular because it requires a simple
462
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
deadlock-avoidance algorithm, resulting in a simple router implementation. However, messages cannot use alternative paths to avoid congested channels, and thus reduce their latency. Fully adaptive routing has often been suggested to overcome this limitation by enabling messages to explore all available paths. Several authors like Lin et al. [21] and Su and Shin [30] have proposed fully adaptive routing algorithms, which can achieve deadlock freedom with a minimal requirement for virtual channels, allowing for an efficient router implementation [7]. Several deterministic and adaptive routing algorithms [3, 8,22] were proposed for the star graphs. Recently, we have proposed a new high performance fully adaptive routing algorithm for the star graph that exhibits superior performance over other algorithms proposed in the literature [18]. Mathematical models are cost-effective and versatile tools for evaluating system performance under different design alternatives. The significant advantage of analytical models over simulation is that they can be used to obtain performance results for large systems and behaviour under network configurations and working conditions which may not be feasible to study using simulation on conventional computers due to the excessive computation demands. Several researchers have recently proposed analytical models of popular interconnection networks, e.g. k-ary n-cubes, tori, hypercubes, and meshes [1,9,23,29]. The most difficult part in developing any analytical model of adaptive routing is the computation of the probability of message blocking at a given router due to the number of combinations that have to be considered when enumerating the number of paths that a message may have used to reach its current position in the network. Almost all studies on star interconnection networks focus on topological properties and algorithmic issues. There has been hardly any study on performance evaluation and analytical modelling of such networks. In [19], we introduced the first analytical model to predict the average message latency as a performance measure in wormhole star networks using the high performance routing algorithm proposed in [18]. In this paper, we propose an accurate model by improving the accuracy of the computed blocking probabilities in [19], thus resulting in a much more accurate performance model for predicting the average message latency in the star graph. We then use this model to analyse the effect of some important parameters on the overall performance of the network. The rest of this paper is organized as follows. In Section 2, the star graph and its router architecture is described. In Section 3, adaptive wormhole routing in the star graph is discussed. Section 4 compares the performance of the algorithms defined in Section 3 with simulation. In Section 5, we propose mathematical models for latency and throughput in wormhole star graphs. Validation of the proposed performance models is realized in Section 6 using results obtained from simulation experiments. In Section 7, we use the proposed analytical model to study the performance merits of the star interconnection network with fully adaptive routing and virtual channels. Finally, Section 8 concludes the paper.
Fig. 1. The star graph and its router structure; (a) S2 , (b) S3 , (c) S4 and (d) the router structure.
2. The star graph and its router structure Let Vn be the set of all n! permutations of symbols 1, 2, 3, . . . , n. For any permutation v ∈ Vn , if we denote the ith symbol of v by vi , v can be written as v1 v2 . . . vn . A star graph defined on n symbols, Sn = (Vn , E n ), is an undirected graph with n! nodes, where each node v is connected to n − 1 nodes which can be obtained by interchanging the first and ith symbols of v, i.e. [v1 v2 . . . vi vi+1 . . . vn , vi v2 . . . v1 vi+1 . . . vn ] ∈ E n , for 2 ≤ i ≤ n. We call these n − 1 connections as dimensions. Thus each node is connected to n − 1 nodes through dimensions 2, 3, . . . , n. The Sn is also called an n-star. Fig. 1(a), (b) and (c) show the S2 , S3 and S4 , respectively. The star graph is an attractive alternative to the hypercube [3], and compares favourably with it in several aspects [12]. For example, the degree and diameter of Sn is n − 1 and b3(n − 1)/2c i.e. sublogarithmic in the number of nodes of Sn while a hypercube with Θ(n!) nodes has a degree and a diameter of Θ(log n!) = Θ(n log n), i.e. logarithmic in the number of nodes. Much work has been done to study both the topological properties and parallel algorithms of the star graph in the past [25,26,28]. Each node, in the star graph, is uniquely indexed by an ntuple using the n numbers corresponding to a permutation on the symbol set {1, 2, . . . , n}. We assume the adjacent nodes are connected by two unidirectional communication links (or a bidirectional channel). Each physical channel has some, say
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
V , virtual channels that share the bandwidth of the physical channel in a multiplexed fashion. Also each input/output virtual channel has incoming/outgoing buffers as shown in Fig. 1(d). In this paper, a communication channel or communication link should be taken to mean a physical channel. Every physical channel, virtual channel and message originating from a node can be given unique numbers based on the address of the node.
463
Fig. 2. Positive-hop (PHop) virtual channel selection rule.
3. Wormhole routing in star graphs The routing problem requires a node (the source node) to send a message to another node (the destination node). Each wormhole routing algorithm includes two important parts: (1) physical channel selection, and (2) virtual channel selection. Physical channel selection rule chooses the next physical channel to route the message while virtual channel selection rule indicates the proper virtual channel of the selected physical channel. Each of these selection rules can be deterministic or adaptive. 3.1. Physical channel selection rule Suppose C = c1 c2 . . . cn and D = d1 d2 . . . dn (1 ≤ ci , di ≤ n) are the current and destination nodes in Sn , respectively. Note that just two rules are involved in finding next physical channel [3] as: R1 If c1 = 6 d1 , replace c1 and ci which that c1 = di . R2 If c1 = d1 , replace c1 and each ci which that ci 6= di . It has been shown [3] that these rules insure a minimum length path between any source and destination but can not find all the minimal paths. Misic [22] proposed a minimal fully adaptive routing algorithm by analysing paths in star graphs using transposition trees, and by deriving some of their algebraic properties. In this research, we use the adaptive physical channel selection rule. It means that each physical channel that brings the message closer to the destination may be used for the next hop. 3.2. Virtual channel selection rules After selecting the next physical channel, a suitable virtual channel must be selected. This is usually done in order to prevent deadlock or to increase performance. In the rest of this section, three basic virtual channel selection rules are described. We will then modify them to achieve higher performance. 3.2.1. The positive-hop (PHop) policy In the well-known positive-hop policy [8] a message is placed in a virtual channel of class 1 in the source node upon injection into the network. During routing, a message is placed in the class i + 1 in an intermediate node if it has already completed i hops as shown in Fig. 2. Since the maximum number of hops a message can take equals the diameter of the network, the maximum number of classes required in each physical channel equals the diameter of the network; for Sn , this number equals b3(n − 1)/2c. It is clear that any number of virtual channels can be used in each class.
Fig. 3. Negative-hop (NHop) virtual channel selection rule.
3.2.2. The negative-hop (NHop) policy In the negative-hop policy [8], the network is partitioned into several subsets, such that no subset contains adjacent nodes (this is equivalent to the well-known graph colouring problem). If χ is the number of subsets, then the subsets are labelled 1, 2, . . . , χ , and nodes in subset i are labelled (or coloured) as i. A hop is a negative hop if it is from a node with a higher label to a node with a lower label; otherwise, it is a nonnegative hop. If H is the diameter of the network and χ is the number of colours, then the maximum number of negative hops that can be taken by a message is H N = dH (χ − 1)/χ e [15]. The structure of Sn is a bipartite graph [16], and its nodes can be partitioned into two subsets; therefore, it can be coloured using only two colours (χ = 2). Because adjacent nodes are in distinct partitions, the maximum number of negative hops a message may take is at most half the diameter of Sn , which equals db3(n − 1)/2c/2e. In order to derive the negative-hop wormhole routing algorithm, db3(n −1)/2c/2e virtual channels must be used for each physical channel in Sn . When a message is generated, the total number of negative hops taken is set to zero. A message occupies a virtual channel of class i + 1 at an intermediate node if and only if the message has taken exactly i negative hops to reach that intermediate node. In other words a message that is currently in a virtual channel of class i can only wait for a virtual channel of either class i (if it is waiting for a nonnegative hop) or class i + 1 (if it is waiting for a negative hop). An example of the negative hop scheme is shown in Fig. 3. 3.2.3. The Misic’s policy Two previous policies can be used for any topology, but Misic’s policy is just working for star graph. Misic [22] by analysing the star graph via group theory has suggested a new virtual channel selection rule for the star graph. Let gi be a physical channel in dimension i and gi1 gi2 gi3 . . . gim be an mhop path from the source node to the destination node. When we write gik ≺ gik−1 , it indicates that a dpo (disruption of partial ordering) has occurred in a path. When a message is generated, the total number of corresponding dpo’s that have occurred is set to zero. In an intermediate node, a message reserves virtual channel of class i if exactly i dpo’s have occurred. Misic has
464
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
Our simulation results show that the channel utilization is different in various traffic rates. For low traffic rates, the virtual channel utilization is as same as basic routing algorithms. Fig. 6 shows the average usage rate for different virtual channels in the S5 for basic routing algorithms (PHop, NHop and Misic) and for high traffic rates improved routing algorithms (Pbc, Nbc and Mbc). Fig. 4. Misic’s virtual channel selection rule.
Fig. 5. Range of selectable classes in Pbc selection rule.
proved the maximum number of dpo’s in Sn is n − 2. Therefore, in wormhole routing, by providing n − 1 virtual channels for each physical channel, a deadlock-free routing algorithm is obtained. An example of the Misic’s policy is shown in Fig. 4. We can claim that all of described virtual channel selection rules (PHop, NHop and Misic) are deterministic rules, because there is not any freedom to choose a virtual channel and in each hop the message can use just one specified virtual channel. 3.3. Improving virtual channel selection rules The basic virtual channel selection rules (PHop, NHop and Misic) have an unbalanced use of virtual channels because all messages start their journey starting from virtual channel 1. However, very few messages take the maximum number of hops and use all the virtual channels and thus virtual channels with high numbers will be used rarely. It is noteworthy that, as can be seen in our simulation results, the channel usage is identical for all traffic rates and any virtual channel set size. The basic virtual channel selection rules can be improved by giving each header flit a number bonus card [5]. For positivehop with bonus cards policy (Pbc), the number of bonus cards is equal to the diameter of network minus the number of hops it is going to take. For the negative-hop with bonus cards scheme (Nbc), it is equal to the number of virtual channel classes minus the number of required negative hops to reach the destination node. Also for the Misic with bonus cards rule (Mbc), the number of bonus cards is equal to the number of virtual channel classes minus the maximum number of dpo’s minus one. More details can be found in [18]. In improved selection rule, the header flit has some flexibility in the selection of classes, depending on the destination between source and destination. Fig. 5 shows a message that starts its journey from node S and after d S hops to reach node M, and still requires d D hops to reach to the destination (node D). With Pbc policy considered as the virtual channel selection rule, the range of channel classes that can be selected for next hop in node M is [d S + 1, V − d D + 1].
3.4. Enhanced virtual channel selection rules In this section, we improve again the virtual channel selection rules by using another methodology. According to this methodology [14], virtual channels are divided into adaptive (class 1) and deadlock free (class 2) virtual subnetworks. At each step, a message visits adaptively any available virtual channel from class 1. If all the virtual channels belonging to class 1 are busy it visits a virtual channel from class 2 using a deadlock-free routing algorithm described in previous sections (basic and improved one). The virtual channels of class 2 define a complete deadlock-free virtual subnetwork, which acts like a “drain” for the virtual subnetwork built from virtual channels belonging to class 1. If we employ PHop or Pbc, NHop or Nbc and Misic or Mbc routing algorithms at least b3(n − 1)/2c, db3(n − 1)/2c/2e and n − 1 virtual channels is required for class 2. For example, these values for an S7 , with 5040 nodes are equal to 9, 5 and 6 virtual channels per physical channel. Network performance is maximized when the extra virtual channels are added to adaptive virtual channels in class 1 [14]. Thus, the best performance is achieved when class 2 contains the minimum required virtual channels and extra virtual channels are allocated to class 1. In the previous sections has been shown the improved routing algorithms (Pbc, Nbc and Mbc) have better performance in comparison of basic routing algorithms (PHop, NHop and Misic). Therefore we use Pbc, Nbc and Mbc for routing in class 2 and name these new routing algorithms EPbc, ENbc and EMbc respectively. 4. Simulation experiments To compare the performance of these routing algorithms, we have developed an event driven simulator that works at the flit level. This simulator can be used for the star networks of any size with wormhole switching. We compare the performances of nine deadlock-free wormhole routing algorithms: three basic ones (PHop, NHop and Misic), three improved ones (Pbc, Nbc and Mbc), and finally three enhances routing algorithms (EPbc, ENbc and EMbc). We have simulated these nine routing algorithms for two star networks S5 and S6 with 120 and 720 nodes, respectively. In addition, we have considered fixed length messages of 32 and 64 flits. Nodes generate traffic independently of each other, and which follows a Poisson process. For the destination address of each message, we have considered the uniform traffic pattern. Messages are transferred to the local PE through the ejection channel as soon as they arrive at their destinations. We are interested in the average channel flit arrival rate and average latency. The average channel flit arrival rate refers to
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
(a) PHop and Pbc.
465
(b) Misic and Mbc.
(c) NHop and Nbc. Fig. 6. Virtual channel utilization in S5 for basic and improved routing algorithms.
the fraction of the physical channel bandwidth utilized in any time interval when the network is in steady state. It is also called the network utilization factor or normalized throughput of the network. The average channel flit arrival rate, denoted by ρ, is computed as the ratio of network bandwidth utilized to the raw bandwidth available Number of nodes , Number of channels where λ is the average message generation rate at each node, m l is the message length (in flits) and d is average inter-node distance of the network. an n-star graph, d can be calculated Pn For 1 as n − 4 + n2 + i=1 for uniform traffic pattern [3]. The i number of flits in the message is fixed at 32 and 64 flits. It takes one clock cycle to transmit a flit between neighbouring nodes. Multiple virtual channels mapped to a physical channel share its bandwidth in time multiplexed manner; that is, f t = 1. ρ = λm l d ×
Therefore, the average channel flit arrival rate in a Sn can be simplified to ! n λm l 2 X 1 ρ= n−4+ + . n−1 n i=1 i The numerator computes the average traffic generated by a node, and the denominator gives the available bandwidth due to the physical channels originating from a node. For each simulation, sufficient warm up time is provided to allow the network reach steady state. After the warm-up time, the network traffic is sampled at periodic intervals and the average message latency has been calculated by the simulator program. The average latency is plotted against offered traffic (ρ) in Fig. 7 for S5 and S6 with 32 and 64 flit messages. In S5 we need 6, 3 and 4 virtual channel classes for PHop, NHop and Misic’s routing algorithms. We have also used 2, 4 and 3
466
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
(a) S5 with 32 flit message length.
(b) S5 with 64 flit message length.
(c) S6 with 32 flit message length.
(d) S6 with 64 flit message length. Fig. 7. Performance of basic and improved routing algorithms.
virtual channels in each class, respectively. Therefore, in S5 each physical channel has 12 virtual channels that can ensure a fair comparison under almost equal hardware cost. Similarly, in S6 each physical channel must have 7, 4 and 5 virtual channel classes for PHop, NHop and Misic routing algorithms. In this case, we have used 3, 5 and 4 virtual channels in each class. Therefore, in S6 each physical channel has 21 virtual channels for PHop and 20 virtual channels for NHop and Misic’s routing algorithms. Network configuration for Pbc, Nbc and Mbc routing algorithms is the same as that for PHop, NHop and Misic routing algorithms, respectively. Also in our simulation experiments, all virtual channels are of 1-flit depth.
In Fig. 7 we see that for low and medium traffic loads, all six routing algorithms have the same latency but they begin to behave differently around the saturation region. The improved routing algorithms have better performance than the three basic algorithms. In Fig. 8, it is shown that under an equal number of virtual channels per physical channel, enhanced routing algorithms have better throughput compared to the improved routing algorithms. Also Fig. 9 shows the simulation results for some other scenarios. It can be seen that enhanced routing algorithms have a better performance in comparison of improved algorithms, because in Nbc routing there are more virtual channels in the adaptive class.
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
Fig. 8. Enhanced routing algorithms versus the improved routing algorithms in S5 (V = 12 and M = 32).
467
Fig. 9. Enhanced routing algorithms for S5 (V = 8 and M = 32).
5. The analytical model In this section, we derive an analytical performance model for wormhole fully adaptive routing in a star graph. Due to the superior performance of ENbc algorithm, our analysis focuses on this routing algorithm but the modelling approach used here can be equally applied for other routing schemes after few changes in the model. The measure of interest in our model is the average message latency as a representative for network performance. 5.1. Model assumptions The following assumptions are made when developing the proposed performance model. These assumptions have been widely used in the literature [1,6,9,13,17,19,23,27,29]: (a) Messages are broken into some packet of fixed length of M flits which are the unit of switching. The flit transfer time between any two routers is assumed to one cycle over physical channels: (b) Message destinations are uniformly distributed across the network nodes. (c) Nodes generate traffic independently of each other, and follow a Poisson process, with a mean rate of λg messages/cycle. (d) Messages are transferred to the local processor through the ejection channel once they arrive at their destination. (e) V virtual channels per physical channel are used. These virtual channels are used according to ENbc routing algorithm. 5.2. Model description The model computes the mean message latency as follows. First, the mean network latency, S, that is the time to cross
the network is determined. Then, the mean waiting time seen by a message in the source node to be injected into the network, Ws , is evaluated. Finally, to model the effect of virtual channels multiplexing, the mean message latency is scaled by a factor, V , representing the average degree of virtual channels multiplexing that takes place at a given physical channel. Therefore, the mean message latency can be written as Latency = (S + Ws )V .
(1)
The average number of hops that a message makes across the network, d, is given by ! n 2 X 1 n! d = n−4+ + × . (2) n i=1 i n! − 1 Fully adaptive routing allows a message to use any available channel that brings it closer to its destination resulting in an evenly distributed traffic rate on all network channels. A router in the Sn has n − 1 output channels and the PE generates, on average, λg messages in a cycle. Since each message travels, on average, d hops to cross the network, the rate of messages received by each channel, λc , can be calculated as [2]: λc =
λg d . n−1
(3)
Since the star graph is symmetrical, averaging the network latencies seen by the messages generated by only one node for all other nodes gives the mean message latency in the network. Let S = 123 . . . n (identity permutation) be the source node with linear address 0 and i denotes linear address of the destination node, where 1 ≤ i ≤ n! − 1. The network latency, Si , seen by the message crossing from node 0 to node i consists of two parts: one is the delay due to the actual message
468
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
transmission time, and the other is due to the blocking time in the network. Therefore, Si can be written as Si = M + h i +
hi X
Bi,k ,
(4)
k=1
where M is the message length, h i is the distance between the node 0 and node i, and Bi,k is the mean blocking time seen by a message form node 0 to node i on its kth hop. Averaging over all the possible destination nodes destined made by a typical message yields the mean network latency as n!−1 P
S=
Si
i=1
n! − 1
.
(5)
5.3. Message blocking time (Bi,k ) Under the uniform traffic pattern and due to the symmetry of the star graph topology, adaptive routing results in an evenly distributed traffic rate on all network channels. Furthermore, a message sees the same mean waiting time and same mean service time across all channels regardless of their positions in the network. However, the message sees a different probability of blocking at each channel as the number of alternative paths, that can be selected, changes from one channel to the next along the path from the source to destination node. The probability of blocking depends on the number of output links, and thus on the virtual channels that a message can use at its next hop. We define f (i, j, k) as the number of output channels for kth hop of jth path set (of all possible paths) for the destination node i, 1 ≤ i ≤ n! − 1. For example, in Fig. 10, there are four different paths between source and destination node. By traversing dimensions 1232 or 1323 to reach destination, the number of output links that can be used in each hop are equal to 3, 2, 1, and 1, respectively. However, by using dimensions 2321 or 3231 to make a path, the number of output channels in each hop is equal to 3, 1, 1, and 1, respectively. Therefore, there are two path sets: seti,1 = {1232, 1323} and seti,1 = {2321, 3231}. In any of these path sets, a message sees a different probability of blocking at its second hop. Let Pblocki,k be the average probability blocking seen by a message form node 0 to node i on its kth hop, and w be the mean waiting time when blocking occurs. The mean blocking time, Bi,k , is giving by Bi,k = Pblocki,k w.
(6)
The probability of blocking, Pblocki,k , can therefore be calculated as Nseti
Pblocki,k =
1 X Pblocki, j,k , Nseti j=1
(7)
where Pblocki, j,k is the probability of blocking for kth hop of jth path set for the destination node i and Nseti is the number of path sets for the destination node i.
5.3.1. Message blocking probability Let the V virtual channels associated to each physical channel be grouped into some different classes. Let each class have V1 virtual channels. According to PHop policy, if a message has already completed i hops, it can use only a virtual channel of class i. In other words, if the virtual channels of class i are busy, the message is blocked even if virtual channels in other classes are free. With PHop virtual channel selection rule where each class contains k virtual channels and the total number of virtual channels per physical channel is V , the probability of blocking for a message is given by V −k V X v−k PHop(k) = Pv (8) V v=k v where Pv (0 ≤ v ≤ V ) is the probability that v virtual channels at a given physical channel are busy. The probability Pv can be determined by using a Markovian model (details of the model can be found in [11]). It is easy to see that the same probability may be derived for NHop and Misic schemes. Let the virtual channel selection rule be Pbc with C classes of k virtual channels per physical channel. Let i be the virtual channel class used in the previous hop. As can be seen in Fig. 5, the range of classes that can be selected for the next hop is [d S + 1, C − d D + 1]. Then the number of usable classes is (C − d D + 1) − (d S + 1) + 1 = C − d + 1. Also Fig. 5 shows that d S ≤ i ≤ C − d D and all of these classes have the same chance to be selected. If a message has already used a virtual channel of class i in its previous hop, it can use a virtual channel of class i +1, i +2, . . ., or C −d D +1. Therefore, the number of classes of which a virtual channel can be selected for the next hop is C − d D − i + 1. This means that with Pbc a message can use one of the available k(C − d D − i + 1) virtual channels for the next hop. Thus, the probability that the message is blocked if the virtual channel class used is i can be given as Pb Pbc (i) = PHop(k(C − d D − i + 1)),
dS ≤ i ≤ C − d D .
Thus, the probability that a d-hop message is blocked can be given as C−d XD 1 Pbc(d) = × PHop(k(C − d D − i + 1)). C −d +1 i=d S
By considering that l = k(C − d D − i + 1), we have k(C−d+1) P
PHop(i)
i=k
. (9) C −d +1 Using a similar approach, we can prove that the message probability blocking for Nbc routing algorithm is equal to Pbc(d) =
k(C−bd/2c+1) P
Nbc(d) =
PHop(i)
i=k
C − bd/2c + 1
= Pbc (bd/2c) .
(10)
469
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
Note that the number of virtual channel classes (C) in Pbc and Nbc schemes are different. Now, let the virtual channel selection rule be EPbc with V1 and V2 virtual channels in the fully adaptive and deadlockfree classes, respectively. In what follows we determine the probability of message blocking for a d-hop message that still has d D hops to reach its destination. Let the virtual channels be divided in two classes as stated before: class 1 (fully adaptive) and class 2 (deadlock-free). In EPbc routing algorithm, a message may use virtual channels of class 1 and class 2 in any order. To calculate the probability that all virtual channels belonging to class 2 are busy we define a new concept, virtual source. A message arrives to virtual source via a virtual channel of class 1 and exits from this node with a virtual channel of class 2. Let i denote the distance between the last virtual source and destination. It has been shown that in Pbc routing algorithm a message can use one virtual channel among C − d + 1 classes. But in EPbc routing algorithm in each class only one virtual channel is used. It means that C = V2 . So, during the message journey from the virtual source to the current node, the probability of using a virtual channel of class 2 in each hop is (V2 − i + 1)/(V − i + 1) and the distance between the virtual source and current node is i − d D hops. Therefore, the probability that the distance between the last virtual source and destination is i can be given by V2 − i + 1 i−d D +1 Pvs (i) = , d D ≤ i ≤ d. V −i +1 Thus, the probability that all virtual channels of class 2 are busy is given by Pbclass 2 =
d X
Pbc(i)
i=d D
V2 − i + 1 V −i +1
i−d D +1
,
(11)
A d-hop message which is d D hops away from its destination is blocked if all virtual channels in class 1 and all usable virtual channels in class 2 are busy. Therefore, the probability that a d-hop message which is d D hops away from its destination is blocked can be given by d X V2 − i + 1 i−d D +1 EPbc(d, d D ) = PHop(V1 ) Pbc(i) . V −i +1 i=d D
(12) Using the same approach, we can derive the probability of message blocking for ENbc scheme as ENbc(d, d D ) = PHop(V1 )
d X i=d D
V2 − bi/2c + 1 N bc(i) V − bi/2c + 1
i−d D +1
.
(13)
5.4. Calculation of w and (Ws ) To determine the mean waiting time, w, to acquire a virtual channel a physical channel is treated as an M/G/1 queue with a
Fig. 10. Alternative paths in star graph.
mean waiting time of [20] ρs S 1 + C 2 S w= 2(1 − ρs )
(14)
ρs = λc S σ 2 C S2 = S2 S
(15) (16)
where λc is the traffic rate on the channel given by Eq. (7), S is its service time calculated by Eq. (9), and σ S2 is the variance of the service time distribution. Since the minimum service time at a channel is equal to the message length, M, following a suggestion given in [13], the variance of the service time distribution can be approximated as σ S2 = (S − M)2 . Hence, the mean waiting time becomes 2
w=
λc S (1 + (1 − M/S)2 ) 2(1 − λc S)
.
(17)
Similarly, modelling the local queue in the source node as an M/G/1 queue, with the mean arrival rate λg /V and service time S with an approximated variance (S − M)2 yields the mean waiting time seen by a message at the source node as [20] Ws =
λg V
2
S (1 + (1 − M/S)2 ) . λ 2 1 − Vg S
(18)
5.5. Calculation of the average multiplexing degree of virtual channels v¯ The probability, Pv , that v virtual channels are busy at a physical channel can be determined using a Markovian model. State πv (0 ≤ v ≤ V ) corresponds to v virtual channels being busy. The transition rate out of state πv to state πv+1 is the traffic rate λc (given by Eq. (3)) while the rate out of state πv to state πv−1 is 1 (S is given by Eq. (5)). The transition rates S out of state πv are reduced by λc to account for the arrival of messages while a channel is in this state. The Markovian model results in the following steady state probability (derivation explained in [20]), in which the service time of a channel has been approximated as the network latency of that channel: (1 − λc S)(λc S)v , 0 ≤ v < V Pv = (19) (λc S)v , v = V.
470
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
Fig. 11. The average message latency predicted by the model against simulation results for a 5-Star with V = 4, 8 and 12 virtual channels and messages length M = 32, 64 and 100 flits.
When multiple virtual channels are used per physical channel, they share the physical bandwidth in a time multiplexed manner. The average degree of multiplexing of virtual channels, that takes place at a given physical channel, can then be estimated by [11]: V P
V¯ =
v=1 V P
v 2 pv
v=1
.
(20)
vpv
The above equations reveal that there are several interdependencies between the different variables of the model. For instance, Eqs. (4)–(6) reveal that S is a function of w while Eq. (17) shows that w is a function of S. Given that closed form solutions to such interdependencies are very difficult to determine, the different variables of the model are computed using an iterative technique. 5.6. Network throughput We define the throughput of an interconnection network as the amount of flits per cycle (time unit) ejected from the network. When network is stable and normally working, throughput equals to the traffic rate at which the nodes are injecting flits into the network. Thus, ideally accepted traffic should increase linearly with injection rate. However, due to the limitation of routing resources (switches and interconnect wires) throughput will saturate at a certain value of the injection rate λsat . λsat is the peak traffic rate that can be sustained (λsat can be found by the Markovian model described in the previous section). Now, we can model the throughput of the interconnection network as: Total flits n!Mλg , λg < λsat , Throughput = = (21) n!Mλsat , λg ≥ λsat . Total time 6. Validation of the model The proposed analytical model has been validated through a discrete event simulator that mimics the behaviour of the
described routing algorithms in the network at the flit level. In each simulation experiment, a minimum of 200 000 messages are delivered. Statistics gathering was inhibited for the first 20 000 messages to avoid distortions due to the initial start up conditions. The simulator uses the same assumptions as the analysis, and some of these assumptions are detailed here with a view to making the network operation clearer. The network cycle time is defined as the transmission time of a single flit from one router to the next. Messages are generated at each node according to a Poisson process with a mean interarrival rate of λg messages/cycle. Message length is fixed at M flits. Destination nodes are determined using a uniform random number generator. The mean message latency is defined as the mean amount of time from the generation of a message until the last data flit reaches the local processor at the destination node. The other measures include the mean network latency, the time taken to cross the network, the mean queuing time at the source node, and the time spent at the local queue before entering the first network channel. Numerous validation experiments have been performed for several combinations of network sizes, message lengths and number of virtual channels to validate the model. Figs. 11 and 12 depict latency results predicted by the model explained in the previous section, plotted against those provided by the simulator for the S5 and S6 interconnection networks (with respectively 120 and 720 nodes), with V = 6, 9, 10 and 12 virtual channels per physical channel, and three different message lengths M = 32, 64 and 100 flits. The horizontal axis in the figure shows the traffic generation rate at each node while the vertical axis shows the mean message latency. The figures reveal that in all cases the analytical model predicts the mean message latency with a good degree of accuracy in the steady state regions. Moreover, the model predictions are still good even when the network operates in the heavy traffic region, and when it starts to approach the saturation region. However, some discrepancies around the saturation point are apparent. These can be accounted for by the approximations made to ease the derivation of different variables, e.g. the approximation made to estimate the variance of the service time distribution at a channel Eq. (16). Such an approximation greatly simplifies the
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
471
Fig. 12. The average message latency predicted by the model against simulation results for a 6-star with V = 5, 10 and 15 virtual channels and messages length M = 32, 64 and 100 flits.
Fig. 13. The throughput predicted by the model against simulation results for a 5-star with V = 4, 8 and 12 virtual channels and messages length M = 32, 64 and 100 flits.
model as it allows us to avoid computing the exact distribution of the message service time at a given channel, which is not a straightforward task due to interdependencies between service times at successive channels as wormhole routing relies on a blocking mechanism for flow control. Fig. 13 shows the throughput predicted by the model explained in the previous section plotted against those provided by the simulator for the S5 (the same scenario in Fig. 11). The figure shows when the number of virtual channels increases, the maximum traffic accepted by the network also increases. It can also be seen in the figure, when network enters saturation region the model prediction does not completely match simulation results. This is because the statistics gathered by the simulator in saturation region is not reliable and fluctuates as the network operates in a nonstable condition (note that the behaviour of both the model and simulator is smooth and concise in the steady region, i.e. before entering the saturation region). However, the analytical model can still predict the throughput fairly accurately in almost all traffic regions. Similar graphs were resulted for the throughput when the scenario used in Fig. 12 was evaluated, and hence we do not report them here for the sake of brevity.
7. Performance analysis using the model In this section, we use the proposed analytical model to study the performance merits of the star interconnection network with fully adaptive routing and virtual channels. The 4-star, 5-star and 6-star are used for the sake of the present discussion, but the conclusions reached here have been to be similar when other network configurations are considered. Fig. 14 illustrates the blocking probability curves as a function of the traffic rate injected by each node into the network. Fig. 14(a) shows that Nbc (improved) routing algorithm has better performance than NHop (basic) algorithms and also it can be seen that ENbc (Enhanced) routing algorithms have a better performance in comparison of Nbc (improved) algorithm for various message length as we see before in simulation results. We can see this trend in Fig. 14(b,c) for PHop, Pbc and EPbc routing algorithms. In addition Fig. 14(c) shows that probability of blocking decreases when we add more virtual channels to class 2 in EPbc routing algorithm. Fig. 15(a) and (b) depict the traffic generation rate and channel flit arrival rate when the network enters the saturation region as a function of the message length M. It is assumed
472
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
Fig. 14. The average message blocking probability versus traffic generation rate in the 5-star and 6-star, with M = 40, 80 and 160 flits.
Fig. 15. The (a) saturation traffic rate (λsat ) and (b) saturation channel flit arrival rate versus message length in the 4-star and 6-star.
that the network enters the saturation region when ρ ≤ 1. The Fig. 15(a) reveals that although the network size in the 6star is 30 times larger than that of a 4-star, it can still sustain almost the same traffic generation rate. It means that star graph is a truly scalable network as its performance does not drop when its size increases, the case happening for other networks like the torus and mesh. This is simply because the number of channels, thus the network bandwidth, in the star graph is proportional to its size. In Section 4, we describe the average channel flit arrival rate as the percentage of the physical channel bandwidth utilization in any time interval. Fig. 15(b) reveals that increasing the number of virtual channels per physical channel and message length, improves the physical channel utilization. 8. Conclusion and future work The star interconnection network systems have gained much attention during the last decade. However, most of studies in this line have focused on topological properties and algorithmic aspects of these networks. In this paper, we introduced an accurate mathematical performance model for wormhole star graphs using adaptive routing and validated it
through simulation experiments. We saw that the proposed model manages to achieve a good degree of accuracy making it a practical and useful evaluation tool that can be used by researchers in the field to gain insight into the performance behaviour of fully-adaptive routing in wormhole star graphs. The approach used in constructing the proposed model can also be used to model other regular networks with some effort. We are now using this model and some other accurate models proposed for the hypercubes to compare the performance merits of the star graphs and their equivalent hypercubes under different technological constraints and working conditions. Our next objective in this line is to propose a new model for networks with irregular topologies. Clustered systems and network of workstations (NOWs) are examples of highperformance computing systems used nowadays that employ networks with irregular topology. Such a model can predict the performance of any interconnection network with any arbitrary topology. References [1] S. Abraham, K. Padmanabhan, Performance of the direct binary n-cube networks for multiprocessors, IEEE Transaction on Computers 37 (7) (1989) 1000–1011.
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474 [2] A. Agarwal, Limits on interconnection network performance, IEEE Transaction on Parallel and Distributed Systems 2 (4) (1991) 398–412. [3] S.B. Akers, D. Harel, B. Krishnamurthy, The star graph: An attractive alternative to the n-cube, in: Proceedings of the International Conference on Parallel Processing, 1987, pp. 393–400. [4] S.B. Akers, B. Krishnamurthy, The fault tolerance of star graphs, in: 2nd International Conference on Supercomputing, 1987, pp. 270–276. [5] S.B. Akers, B. Krishnamurthy, A group-theoretic model for symmetric interconnection network, IEEE Transaction on Computers 38 (4) (1989) 555–565. [6] A.C. Aljundi, J. Dekeyser, M.T. Kechadi, I.D. Scherson, A universal performance factor for multi-criteria evaluation of multistage interconnection networks, Future Generation Computer Systems 22 (7) (2006) 794–804. [7] R.V. Boppana, S. Chalasani, A comparison of adaptive wormhole routing algorithms, in: International Symposium on Computer Architecture, 1993, pp. 351–360. [8] R.V. Boppana, S. Chalasani, A framework for designing deadlockfree wormhole routing algorithms, IEEE Transactions on Parallel and Distributed Systems 7 (2) (1996) 169–183. [9] Y. Boura, C.R. Das, T.M. Jacob, A performance model for adaptive routing in hypercubes, in: Proceedings of the International Workshop on Parallel Processing, 1994, pp. 11–16. [10] W.J. Dally, C. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Transaction on Computers C-36 (5) (1987) 547–553. [11] W.J. Dally, Virtual channel flow control, IEEE Transaction on Parallel and Distributed Systems 3 (2) (1992) 194–205. [12] K. Day, A. Tripathi, A comparative study of topological properties of hypercubes and star graphs, IEEE Transactions on Parallel and Distributed Systems 5 (1) (1994) 31–38. [13] J.T. Draper, J. Ghosh, A Comprehensive analytical model for wormhole routing in multicomputer systems, Journal of Parallel and Distributed Computing 32 (1994) 202–214. [14] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An Engineering Approach, IEEE Computer Society Press, 2003. [15] I.S. Gopal, Prevention of store-and-forward deadlock in computer networks, IEEE Transactions on Communications C-33 (12) (1985) 1258–1264. [16] J.S. Jwo, S. Lakshmivarahan, S.K. Dhall, Embeddings of cycles and grids in star graphs, Journal of Circuits, Systems, and Computers 1 (1) (1991) 43–74. [17] A. Khonsari, H. Sarbazi-Azad, M. Ould-Khaoua, An analytical model of adaptive wormhole routing with time-out, Future Generation Computer Systems 19 (1) (2003) 1–12. [18] A.E. Kiasari, H. Sarbazi-Azad, S.M. Rezazad, Performance comparison of adaptive routing algorithms in the star interconnection network, in: Proceedings of the International Conference on High Performance Computing in Asia Pacific Region, 2005, pp. 257–264. [19] A.E. Kiasari, H. Sarbazi-Azad, M. Ould-Khaua, Analytical performance modelling of adaptive routing in star interconnection network, in: Proceedings of the International Parallel and Distributed Processing Symposium, 2006. [20] L. Kleinrock, Queueing Systems, vol. 1, John Wiley, New York, 1975. [21] X. Lin, P.K. Mckinley, L.M. Lin, The message flow model for routing in wormhole-routed networks, in: Proceedings of the International Conference on Parallel Processing, 1993, pp. 294–297. [22] J.V. Misic, Z. Jovanovic, Routing function and deadlock avoidance in a star graph interconnection network, Journal of Parallel and Distributed Computing 22 (2) (1994) 216–228. [23] H.H. Najafabadi, H. Sarbazi-Azad, P. Rajabzadeh, Performance modelling of fully adaptive wormhole routing in 2D mesh-connected multiprocessors, in: Proceedings of the International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, 2004. [24] M. Noakes, W.J. Dally, System design of the J -machine, in: Proceedings of the Advanced Research in VLSI, MIT Press, 1990, pp. 179–192.
473
[25] S. Rajasecaran, D.S.L. Wei, Selection, routing and sorting on the star graph, in: Proceedings of the International Parallel Processing Symposium, 1993, pp. 661–665. [26] S. Ranka, J.C. Wang, N. Yeh, Embedding meshes on star graph, Journal of Parallel and Distributed Computing 19 (2) (1993) 131–135. [27] H. Sarbazi-Azad, A. Khonsari, M. Ould-Khaoua, Analysis of k-ary n-cubes with dimension-ordered routing, Future Generation Computer Systems 19 (4) (2003) 493–502. [28] H. Sarbazi-Azad, M. Ould-Khaoua, L.M. Mackenzie, S.G. Akl, A parallel algorithm for lagrange interpolation on the star graph, Journal of Parallel and Distributed Computing 62 (4) (2002) 605–621. [29] H. Sarbazi-Azad, M. Ould-Khaoua, L.M. Mackenzie, An accurate analytical model of adaptive wormhole routing in k-ary n-cube interconnection networks, Performance Evaluation 43 (2–3) (2001) 165–179. [30] C. Su, K.G. Shin, Adaptive deadlock-free routing in multicomputers using one extra channel, in: Proceedings of the International Conference on Parallel Processing, 1993, pp. 175–182.
Abbas Eslami Kiasari received his B.Sc. degree in Electrical Engineering from the Ferdowsi University, Mashhad, Iran, in 2003, and his M.Sc. degree in Computer Engineering from the Sharif University of Technology, Tehran, Iran, in 2005. He is currently a Ph.D. student in the Computer Engineering Department, Sharif University of Technology, Tehran, Iran. His research interests include performance modelling and evaluation, queueing theory, high-performance computer architecture and parallel processing. Hamid Sarbazi-Azad received his B.Sc. degree in Electrical and Computer Engineering from ShahidBeheshti University, Tehran, Iran, in 1992, his M.Sc. degree in Computer Engineering from Sharif University of Technology, Tehran, Iran, in 1994, and his Ph.D. degree in Computing Science from the University of Glasgow, Glasgow, UK, in 2002. He is currently a faculty member in the Department of Computer Engineering at Sharif University of Technology, and heads the School of Computer Science of the Institute for Studies in Theoretical Physics and Mathematics (IPM), Tehran, Iran. Dr Sarbazi-Azad has served as a guest co-editor for the special issue “Performance modelling and evaluation of high-performance parallel and distributed systems” in Performance Evaluation journal, the special issue “Design and performance of networks for super-, cluster-, and gridcomputing” in the Journal of Parallel and Distributed Computing, and the special issue “Performance evaluation of networks in parallel, cluster, and grid computing systems” in Parallel Computing journal. He is now guest-editing two other special issues in Computers & Electrical engineering and Journal of Computer and System Sciences. He is a founding co-chair for the International Workshop of Performance Evaluation of Networks in Parallel, Cluster, and Grid computing Systems (PEN-PCGCS), in conjunction with International Conference on Parallel Processing (ICPP); he was a co-chair of PENPCGCS’2005 and is a programme co-chair for PEN-PCGCS 2006. He was also the general chair of CSICC’2006 and has been the Editor-in-Chief for the CSI International Journal of Computer Science and Engineering from January 2006. His current research interests include NoCs, mobile and wireless networks, high-performance computer architectures, parallel/distributed systems, cluster and grids computing systems, performance modelling/evaluation, graph theory and combinatorics. Mohamed Ould-Khaoua received his B.Sc. degree from the University of Algiers, Algeria, in 1986, and the MAppSci and Ph.D. degrees in Computer Science from the University of Glasgow, UK, in 1990 and 1994, respectively. He is currently a Reader in the Department of Computing Science at the University of Glasgow, UK. His research focuses on applying theoretical results from stochastic processes and queuing theory to the quantitative study of hardware and software architectures. He has been
474
A.E. Kiasari et al. / Future Generation Computer Systems 24 (2008) 461–474
working in the area of performance modelling and evaluation of wired and wireless networks for high-performance computing systems over the past 15 years. Dr Ould-Khaoua serves on the editorial board of IEEE Transactions on Parallel & Distributed Systems, International Journal of Parallel, Emergent and Distributes Systems, International Journal of Computers & Applications, and International Journal of High-performance Computing & Networking. He is the Guest Editor of 14 special issues related to performance modelling and evaluation of computer systems and networks in the Journal of Computation and Concurrency: Practice & Experience, Performance Evaluation, Supercomputing, Journal of Parallel & Distributed Computing, IEE-Proceedings-Computers & Digital Techniques, International Journal of High Performance Computing & Networking, and Cluster Computing. He is
the Co-Chair of the international workshop series on performance modeling, evaluation, and optimization of parallel and distributed systems (PMEO-PDS) and ACM Workshop on Performance Evaluation of Wireless Ad Hoc, Sensor, and Ubiquitous Networks (PE-WASUN), ACM International Workshop on Performance Monitoring and Measurement of Heterogeneous Wireless and Wired Networks (PMMH-WN 2006), International Workshop on Networks for Parallel, Cluster and Grid Systems (PEN-PCGCS). He is Workshops Chair at the 9th ACM International Symposium on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM’ 2006). He has served on the programme committees of many international conferences and workshops. Dr Ould-Khaoua’s current research interests are performance modelling/evaluation of wired/wireless communication networks and parallel/distributed systems.