J. Parallel Distrib. Comput. 72 (2012) 1280–1294
Contents lists available at SciVerse ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
An accurate performance model for network-on-chip and multicomputer interconnection networks Slavko Gajin ∗ , Zoran Jovanovic Department of Computer Engineering and Computer Science at the School of Electrical Engineering, University of Belgrade, Serbia
article
info
Article history: Received 15 July 2011 Received in revised form 8 May 2012 Accepted 14 May 2012 Available online 26 May 2012 Keywords: Deterministic routing Interconnection network Network on chip Wormhole Performance modeling
abstract In this paper, we present a mathematical background for a new approach for performances modeling of interconnection networks, based on analyzing the packet blocking and waiting time spent in each channel passing through all possible paths in the channel dependency graph. We have proposed a new, simple and very accurate analytical model for deterministic routing in wormhole networks, which is general in terms of the network topology and traffic distribution. An accurate calculation of the variance of the service time has been developed, which overcomes the rough approximation used, as a rule, in the existing models. The model supports two-dimensional mesh topologies, widely used in network-on-chip architectures, and multidimensional topologies, popular in multicomputer architectures. It is applicable even for irregular topologies and arbitrary application-specific traffic. Results obtained through simulation show that the model achieves a high degree of accuracy. © 2012 Elsevier Inc. All rights reserved.
1. Introduction Interconnection network architecture, which is traditionally used in multicomputers, shares significant similarities with the new network-on-chip (NoC) architecture. The common need is to achieve an efficient and high-speed message exchange throughout the underlying communication architecture. This is achieved by using the same mechanisms, such as message passing, routing, channel and buffers allocation, flow control arbitration policy, and so on. The differences between NoCs and multicomputers are primarily caused by hardware implementation. Due to a technological limitation and power consumption, NoC architectures are usually implemented in planar topologies, such as the two-dimensional (2D) mesh, while multicomputers can also be implemented in multidimensional meshes and k-ary n-cubes. The number of nodes in NoCs is still smaller than in multicomputers, but the communication channels are wider, which allows more bits to be transferred in parallel. However, both NoC and multicomputer interconnection networks can be analyzed using the same analytical framework. The design of NoC and multicomputer systems is a very complex, time-consuming and expensive process, which is faced with tight requirements in terms of performance, cost and time to market. The design methodology consists of two phases: in the
∗
Corresponding author. E-mail addresses:
[email protected] (S. Gajin),
[email protected] (Z. Jovanovic). 0743-7315/$ – see front matter © 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2012.05.005
first phase, a concrete architecture is derived from the general NOC template, which defines the topology, routing, switches, and other resources, while the second phase maps the application onto the chosen architecture to form a concrete product [15]. However, it is an iterative process in which performance analysis is a crucial step needed to validate the system architecture and to estimate its performances for target application workloads. The results provided during the performance analysis step are used to refine the communication architecture. Simulation is an extensive and time-consuming technique which is typically used at a later design phase. On the other hand, flexible and accurate analytical models are fast and efficient tools to determine whether or not the chosen architecture–application combination satisfies the design requirements [21]. The challenge to each model is to achieve high level of accuracy with less complexity under given assumptions and approximations. The existing analytical models in the literature commonly analyze network channels using an M /G/1 queuing model [14], with independent packet arrivals in accordance with the Poisson process and general service time distribution. A packet served in one channel may cause blocking of other packets in previous channels. The usual approach is to calculate the packet service time at each hop, taking into account the blocking probabilities and waiting time to gain access to the next channel. Backward calculation is needed to calculate these parameters throughout the network in order to predict the average packet latency. Early work in this area was done by Dally [5], who analyzed the performances of deterministic dimension order routing in
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
k-ary n-cube networks of varying dimension under the assumption of constant wire bisection. He exploited the fact that the physical channels of k-ary n-cube networks are equally loaded under a uniform traffic pattern. The model calculates the latency contribution in each dimension separately in addition to the latency in the remaining dimensions needed to reach the destination. It is based on the probability of skipping or routing into the specific dimension, considering packet flow along an average number of nodes in the dimension. However, he used several rough approximations, which resulted in a less accurate model for both lower-dimension networks and higher traffic load. Kim and Chien modeled the network channels as an M /G/1 queue using a generalized Weibull exponential distribution [13]. They considered the flow rate in an average node in each dimension of symmetrical topology (k-ary n-cubes) and the conflict probability of packets gaining the channel. They improved the calculation of the waiting time, using the second moment of service time and the variance of the service time distribution. Ciciani et al. [3] slightly improved Dally’s model by providing a detailed flow analysis in each dimension and analyzing the conflict of two different incoming flows requesting the same channel. They calculated the waiting time seen by the packet using the residual service time of the packet in the other flow, which occupies the channel and causes the conflict. However, they also assumed zero variance of the service time, which only holds for a uniform distribution, and approximated the residual time as half of the service time. The authors demonstrated this methodology by modeling several network topologies (hypercube, and unidirectional and bidirectional tori), each with a different flow analysis model. In later work, Quaglia et al. [24] extended this idea to fully adaptive routing, which involves higher complexity due to multiple classes of flows and transition possibilities. Further improvement was done by Ogras et al. [21], who proposed an accurate and less complex model for Duato’s fully adaptive routing in k-ary n-cubes with virtual channels [8]. Based on the condition of symmetric network load, this model considers a path of average distance only, calculating the blocking time in each hop using an M /G/1 queuing model, and the probability that virtual channels are busy, using results from combinatorial theory. The model finally multiplies the resulted mean waiting time by the average degree of multiplexing of virtual channels at a physical channel, achieving high accuracy. The common similarity in the models presented above, as well as in other approaches reported in the literature [7], is their dependence on the specific topologies, regular traffic distributions, and other system prerequirements. Changing some of these conditions leads to different models. As an illustration, Ould-Khaoua and other authors varied the approach reported in [22], resulting in many different models for various traffic distributions [27–30], network topologies [19,18,16,20,23], and routing algorithms [12,17,26]. Recent NoC performance analysis has overcome these restrictions, resulting in more general analytical models. The study presented in [21] is based on traffic flow rates and the contention probability that two input channels compete for the same output channel. It then calculates the average number of the packets of each router, which is used to compute the average buffer utilization, average packet latency, and maximum network throughput. The authors of [1] also considered the contention probability matrix and introduced an analytical model for NoCs with arbitrary topology, routing algorithm, and traffic distribution. The model additionally allows a stochastic distribution of message length and both a homogenous and a heterogeneous buffer allocation scheme, which leads to a more complex closed-form solution with interdependencies that are computed using iteration. In this paper, we present an analytical model which supports arbitrary traffic distribution and network topologies, including
1281
irregular topologies or application-specific traffic. The model is accurate and simple, without interdependency between the parameters which are calculated. The main contributions of our work are as follows. (1) A new approach is used on the channel (local) level to calculate the blocking probability and the average waiting time in packet transition between two channels, named Corrected M /G/1 model. (2) On the network (global) level, we provide a flow-based analysis to calculate the service time and its variance passing through all possible paths over the network. In addition to similar solutions reported in [1,21], we fulfill the analysis by a formal mathematical approach based on the channel dependency graph. (3) The exact calculation of the variance of the service time improves the accuracy of the model compared to existing approximations commonly used in the literature. For the sake of simplicity, the model primarily considers the simplest case of deterministic routing with a single-flit buffer and only one virtual channel per physical channel. With this approach, which is also used in [1], we first ignore the effect of virtual channels, but later, in Appendix A.1, we have extended the model by taking into account virtual channels which share the physical channel in time multiplexing manner. This paper is organized as follows. Section 2 briefly gives the technological background and summarizes the necessary definitions and assumptions. Section 3 presents the analytical model, especially focusing on flow rate analysis, and the model description at local and global level, as well as the model results. Section 4 validates the model in comparison to the simulation results. Section 5 concludes our study, highlighting our main achievements, and discussing the possibilities to extend the model to support arbitrary packet length, multi-flit buffers, and adaptive routing. Appendix A.1 describes the model extension for virtual channel architectures, while Appendix A.2 lists all symbols used in this paper. 2. Definitions and assumptions In order to analyze the packet delay and other network performance metrics, several definitions and assumptions need to be formally introduced. Similar definitions and notation are used by Duato [8]. Definition 1. An interconnection network I is a strongly connected and directed multigraph I = G(N , C ). The vertices N of the multigraph I, represent the union of processing and routing nodes: N = NP ∪ NR . The arcs C of the multigraph I, represent the union of network channels and injection and ejection channels: C = CN ∪ CIN ∪ CEJ . The network channels cn ∈ CN are unidirectional and connect given pairs of adjacent routing nodes, denoted as cn = (nr1 , nr2 ) ∈ NR × NR . The injection channels cin ∈ CIN and ejection channels cej ∈ CEJ are also unidirectional and connect given pairs of processing and routing nodes, denoted as cin = (np , nr ) ∈ NP × NR and cej = (nr , np ) ∈ NR × NP , respectively. The unidirectional feature means that channel c ′ connecting (n1 , n2 ) is opposite to c ′′ , which connects (n2 , n1 ). In contrast to the popular definition proposed by Duato [9,8], we have separated processing and routing nodes. This distinction introduces the injection and ejection channels and allows their transparent treatment in the following analysis. Definition 2. A deterministic routing function has the form R: N × NP → P (C ), where P (C ) is the power set of C . It supplies a set of output channels based on the current processing or routing node that owns the input channel with a packet header, and the destination processing node, addressed in the packet header.
1282
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
Even given that a source processing node actually does not perform packet routing, but only forwards the packet to the associated routing node, this approach allows us to include internal input channels in the channel dependency graph in the following way. Definition 3. An direct dependency relation DD for a given network I and a routing function R is a set of all pairs (ci , cj ), where ci , cj ∈ C , ci = (n1 , n2 ), ci ̸= cj , and ∃n ∈ N, such that ci ∈ R(n1 , n) and cj ∈ R(n2 , n). That is, cj can be used immediately after ci by a packet destined for some node n, where n1 and n2 are adjacent nodes connected by channel ci . In the direct dependency relation, where (ci , cj ) ∈ DD, ci is a direct predecessor for cj , and cj is direct successor for ci . Sets of all direct predecessors and all direct successors of channel c are denoted as predDir (c ) and succDir (c ), respectively. Therefore, for (ci , cj ) ∈ DD, we can write ci ∈ predDir (cj ) and cj ∈ succDir (ci ). Definition 4. An indirect dependency relation ID for a given network I and a routing function R is a set of all pairs (ci , cj ), where ci , cj ∈ C , ci ̸= cj , and ∃c1 , c2 , . . . , ck ∈ C , such that (ci , c1 ) ∈ DD, (c1 , c2 ) ∈ DD, . . . , (ck , cj ) ∈ DD. In an indirect dependency relation, where (ci , cj ) ∈ ID, ci is an indirect predecessor for cj , denoted ci ∈ predInd(cj ), and cj is indirect successor for cj , denoted cj ∈ succInd(ci ). This definition of indirect dependency is derived from a direct dependency relation, in contrast to Duato’s definition, which is based on a routing subfunction [9,8]. Definition 5. The channel dependency graph D for a given interconnection network I and routing function R is a directed graph D = G(C , E ). The vertices of the graph D are the channels c ∈ C , and the arcs of the graph D are the pairs of channels (ci , cj ) ∈ C × C = E, such that (ci , cj ) ∈ DD (there is a direct dependency from ci to cj ). We will use the acronym CDG for a channel dependency graph in the rest of this paper. The CDG is defined by the given network topology and a particular routing algorithm. In our approach, injection and ejection channels are also represented in the network topology by beginning and ending vertices in the CDG, respectively. The intermediate vertices in the CDG are associated to network channels. As mentioned above, cycle absence in a CDG is the sufficient condition for deadlock freedom for both deterministic and adaptive routing algorithms [5,9,8]. This fact is widely used in routing algorithms design, so we will consider only acyclic CDGs. However, this is not the only approach [10]. Definition 6. The traffic distribution function TD: NP × NP → [0 · · · 1], for every pair of nodes (s, d), supplies a probability that a packet is generated in source node s and is destined to node d. Therefore, following equation must be satisfied:
TD(s, d) = 1.
(1)
s∈Np d∈Np
When the traffic distribution supplies equal probability for each pair (s, d) and equal packet generation rate to all nodes, it is referred to as a uniform traffic distribution. In this paper, we present a model for wormhole flow control with one-flit buffers in each channel. The simplest dimension order deterministic routing, denoted as XY , is considered. Other assumptions that we have used in this paper are also commonly presumed in the literature [7,12,17,19,18,16,20,21,23,25–30]. 1. The source nodes generate the messages independently and according to a Poisson process. 2. The packet arrival process at each channel is approximated by an independent Poisson process.
3. The flit transmission time between two adjacent channels is one cycle. 4. The CDG is acyclic. 5. The packet length is fixed to L flits and it is larger than the maximum path length. Assumptions 1 and 2 allow us to use a queuing-based approach. Even though some recent papers have reported that self-similar and non-stationary traffic characteristics are more realistic [2], this approximation leads to less complex analysis with an acceptable degree of accuracy. Assumption 3 is commonly used in performance analysis and simulation of multicomputers, while NoC simulators and models prefers exact timing of each intermediate step: routing, flit transition over physical channels, etc. However, there is no significant difference between these two approaches. Assumption 4 is justified by all hardware implementations, and it is used in all existing models, even though some theoretical analysis allows deadlock-free deterministic routing with cycle dependencies [10]. The last assumption is chosen for the sake of simplicity. Packets larger than the maximum path length with wormhole flow control cause packet blocking in one channel to influence all previous channels. The model can also be adapted to support an arbitrary distribution of packet length by taking into account the probabilities of how far the packet is spread over the path and how many channels over the path are blocked. Other extensions of the model to support multi-flit buffer sizes and adaptive routing are discussed in the conclusion of this paper. 3. The analytical model For the sake of simplicity, in this section we define the analytical model for the average waiting time for deterministic routing in a wormhole network with only one virtual channel per physical channel. An extension of the model to support multiple virtual channels is given in Appendix A.1. 3.1. The flow rate analysis Each processing node generates packets independently and according to a Poisson process, with an average rate λs . The total packet generation rate for the whole network, known as the network input rate, is
λ=
Nnd
λs ,
(2)
s =1
where Nnd is the total number of processing nodes. In the case of a uniform packet generation rate, each node generates packets at an equal rate λs ; therefore, the network packet generation rate is given by λ = λs · Nnd . In order to calculate the flow rate in each channel, we can introduce the function P (c |s, d), for every channel c and pair of nodes s and d. This gives the probability that a packet passes through channel c when it is routed from source node s to destination node d. This function is discrete for deterministic routing, giving values either 0 or 1. The total flow rate at the channel level, denoted as f (c ), is determined by the generation rate λ and the total probability for channel c to be on the path for any packet generated by traffic distribution TD: f (c ) = λ
TD(s, d) · P (c |s, d).
(3)
s∈NP d∈NP s̸=d
In a similar way, we can further develop the partial flow rate from channel ci to another channel cj , taking into account the probability
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
1283
literature using the M /G/1 queuing model. The packets arrive at the channel queue independently and according to a Poisson process with mean arrival rate equal to the channel flow rate f (c ). Packets occupy the channel for a certain time, which is determined by the service time random variable s(c ). The distribution of this random variable is general, but for the purpose of the average waiting time calculation, WM /G/1 (c ), it is sufficiently described by the mean service time S (c ) and the variance of the service time distribution σ 2 (c ) [14]: Fig. 1. Channel incoming and outgoing flows: flow rate (f ), waiting time (W ), and service time (S).
that the packet passes through both channels on the path from source s to destination d. f (ci , cj ) = λ
TD(s, d) · P (ci , cj |s, d).
(4)
s∈NP d∈NP s̸=d
In general, channels ci and cj can be any channels in the CDG. In the case of adjacent channels in the CDG, we will denote this flow rate from ci to its direct successor channel cj by f (ci → cj ). According to the flow conservation law [14], the flow rate through one channel is equal to the sum of all incoming flow rates, which is the same as the sum of all outgoing flow rates (see Fig. 1). f (c ) =
f (ci → c ) =
ci ∈predDir (c )
f (c → cj ).
(5)
cj ∈succDir (c )
For a given network topology, deterministic routing algorithm, particular traffic distribution TP, and network input rate λ, all flow rates for all combinations of sources and destinations are fully determined for each individual channel, as well as partial flows from one channel to another. In the case of a 2D square mesh and uniform packet distribution, the flow rate for a channel in the positive direction of the dimension X , denoted as X +, in node (x, y) is given by
λ
(x + 1)(n − x − 1). (6) n(n2 − 1) The flow rates calculated in this way are the basic input parameters for our analytical model. f (cX +(x,y) ) =
3.2. Outline of the model The model is performed on two levels. On the local level, the model is focused on a packet served in one channel c and its influence on the other packets waiting in each predecessor channels. In other words, the model on the local level calculates the mean waiting time that packets spend in the predecessor channels ci waiting for the channel c, denoted by W (ci → c ). On the global level, the model goes back through the CDG and calculates the mean service time for each channel c in the CDG, denoted by S (c ), taking into account the individual mean waiting times from this channel c to all its direct successors cj ∈ succDir (c ), denoted by W (c → cj ). These waiting times have been previously calculated by application of the local model in each direct successor cj . Additionally, in the case of wormhole flow control, which is considered in this paper, the channel service time is affected by waiting times in all possible indirect successors on the paths to all possible destinations. As mentioned before, the flow rates are the basic input parameters for the model, and they are determined by the network topology, traffic distribution, and deterministic routing function.
WM /G/1 (c ) =
f (c )S 2 (c )(1 + vs2 (c )) 2(1 − f (c )S (c ))
,
(7)
where ν 2 (c ) is the squared coefficient of the service time variance, defined by
vs2 (c ) =
σs2 (c ) . S 2 (c )
(8)
However, the M /G/1 queuing model is derived for one service element, which serves one input queue. It does not describe a system with one service element serving several input queues, which is the case when one channel c (the service element) is occupied by a packet, and several preceding channels ci (input queues) are occupied by packets waiting for channel c (Fig. 2a). Taking a single-flit buffer wormhole flow control into account, if two subsequent packets are passing through the same path from one channel to another, the following packet can only see the tail of the heading packet, which already reached the destination node, and therefore it cannot be blocked by the heading one. In other words, a packet in channel ci waiting for channel c can only be blocked by a packet which currently occupies channel c, where it came from some other preceding channel ck , ck ∈ predDir (c ), ck ̸= ci . All preceding channels ci have, in general, different partial flow rates to the next adjacent channel c, denoted by f (ci → c ). Considering one preceding channel ci , the sum of the rest of the flow rates from other preceding channels, which can actually block the packet in channel ci , is different for each channel ci ∈ predDir (c ) and is equal to f (c ) − f (ci → c ), as shown in Fig. 2b. Different flow rates will yield different blocking probabilities, and, consequently, different waiting times in all adjacent preceding channels, denoted by W (ci → c ). Considering the above analysis, we have developed a new local model named the Corrected M /G/1 model. The model first calculates the probability that the packet coming from channel ci is blocked by another packet which is occupying the succeeding channel c. This probability is given by the ratio of the number of all packets that can cause the blocking in channel ci to the total number of all packets passing through channel c. Taking the flow rates into account, this blocking probability can be expressed by the following equation: Pbl (ci → c ) =
f (c ) − f (ci → c ) f (c )
.
(9)
We will use this blocking probability to correct the original waiting time calculation in the M /G/1 model. With this correction, the average waiting time which the packet spends in channel ci waiting for the succeeding channel c is W (ci → c ) = Pbl (ci → c ) · WM /G/1 (c )
= (f (c ) − f (ci → c ))
S (c )2 (1 + vs2 (c )) 2(1 − f (c )S (c ))
.
(10)
3.3. The local model—Corrected M /G/1
3.4. The global model
The average waiting time which packets see in one channel while waiting for the next channel is widely calculated in the
The local model calculates the average waiting time for packets in channel ci , in transition to the succeeding channel c, based on the
1284
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
b
a
Fig. 2. One service element and several input queues: (a) flow rates in incoming equivalent queues, (b) complementing flow rates for selected preceding channel ci .
If we denote by path(c , cd ) the set of all channels on the path from c to cd , the equation above can be written briefly as
S (c , cd ) =
W (ck → ck+1 ) + L.
(14)
ck ,ck+1 ∈path(c ,cd ) ck+1 =succDir (ck )
In general, we can consider the probability that the packet is taking the path through channel ck , provided that it has already reached channel c, denoted by P (ck |c ). From the conditional probability definition, we have P (c , ck )
P (ck |c ) =
Fig. 3. Flow rate from channel c to destination channel cd .
mean service time and the variance of the service time in channel c. To calculate this, we need to derive the mean service time and its variance for packets in channel c. The service time for a packet in a channel is a random variable s(c ), which corresponds to the time period for which the channel is occupied by a packet. It last from when the heading flit enters the channel until the last flit leaves the channel. If there is no blocking in succeeding channels, the service time is equal to the packet length. This is the case with the ejection channels, since they are the last channels in the CDG, which directly deliver packets to the destinations. Otherwise, the service time is increased by the time for which the packet is being blocked in the channel. If the next adjacent channel cj has a buffer large enough to receive the whole packet, which is the case with virtual cutthrough flow control, this blocking time can be presented by the random variable of the waiting time in channel c for channel cj , giving s(c → cj ) = w(c → cj ) + L.
(11)
In case of wormhole flow control, a packet can be blocked in any succeeding channel on its path to the destination. In terms of the CDG, the destination node is related to the ejection channel, denoted by cd . Therefore, the random variable of the service time for a packet in channel c leading to destination channel cd is given by s(c , cd ) = w(c → c1 ) + w(c1 → c2 ) + w(c2 → c3 )
+ · · · + w(cd−1 → cd ) + L, (12) where w(ck → ck+1 ) denotes the random variable of the waiting time for the transition to the next channel on the path, as shown in Fig. 3. 3.4.1. Average service time The expectation of the sum of random variables is the sum of the expectation of the random variables (even if they are not independent) [14]. The expected service time of given channel c in communication to channel cd is expressed by S (c , cd ) = W (c → c1 ) + W (c1 → c2 ) + W (c2 → c3 )
+ · · · + W (cd−1 → cd ) + L.
. (15) P (c ) This ratio of the probability with which the packet is traversing both channels c and ck to the probability with which the packet is traversing only channel c is equal to the ratio of the corresponding flow rates from channel c to ck to the total flow rate in channel c: . (16) f (c ) The probability that packets take the path from channel c to channel cd is therefore given by P (cd |c ) =
f (c , cd )
. (17) f (c ) The ejection channels are the last channels in the CDG. There are no further channels to block them and their service time is equal to the packet length. The average service time for channel c can be calculated by summation of the service time to each destination and taking into account the corresponding probability. Using the equations above, the average service time for an arbitrary channel c can be written as 1 f (c , cd ) S (c ) = f (c ) c ∈C d
EJ
×
W (ck → ck+1 ) + L.
(18)
ck ,ck+1 ∈path(c ,cd ) ck+1 =succDir (ck )
We can further analyze all flows from channel c to destination channel cd , passing through some intermediate channel ck+1 , for which the sum of all those flow rates can be written as
f (c , ck+1 , cd ) = f (c , ck+1 ).
(19)
cd ∈CEJ ∩succ (c )
Putting this expression into (18), we can summarize the flow rates for each waiting time W (ck → ck+1 ), which yields the following equation: 1 S (c ) = f (c , ck )W (c → ck ) f (c ) c ∈succDir (c ) k
+ (13)
f (c , ck )
P (ck |c ) =
1 f (c )
ck ∈succ (c ) ck+1 ∈succDir (ck )
f (c , ck+1 )W (ck → ck+1 ) + L.
(20)
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
The first term in the equation above expresses the average waiting time in channel c caused only by its all direct successors, and therefore it can be called the direct waiting time, denoted by WD (c ): WD (c ) =
1
f (c )
ck ∈succDir (c )
f (c , ck )W (c → ck ).
(21)
Using this expression for other succeeding channels, the average service time can be further simplified as S (c ) = WD (c ) +
1
f (c )
ck ∈succ (c )
f (c , ck )WD (ck ) + L.
(22)
In addition to the direct waiting time for channel c, the second term in (22) corresponds to the average waiting time of the heading flit in all succeeding channels. This additional time spent in channel c can be called the indirect waiting time, denoted by WI (c ). The average service time of channel c with wormhole flow control can be finally expressed as S (c ) = WD (c ) + WI (c ) + L.
(23)
In comparison to virtual cut-through flow control, in which only the direct waiting time contributes to channel blocking, the wormhole flow control channel service time is additionally increased by the indirect waiting time, WI (c ). In (22), we can further extract the direct waiting time for direct successors of channel c, which gives S (c ) = WD (c ) +
1
f (c ) c ∈succDir (c ) j
f (c → cj )
× WD (cj ) +
1 f (c → cj )
f (c , cj , ck )WD (ck ) + L .
(24)
ck ∈succ (cj )
Since we are taking into account the impact of the waiting time in all succeeding channels, which depends on all possible flow rates, the fraction of total flow rate passing through channel cj does not depend on the channels prior to channel cj in the CDG, and the flow rate ratio in the equation above can be simplified as follows: f (c , cj , ck ) f (c , cj )
=
f (cj , ck ) f (cj )
.
(25)
As a result, (22) can be rewritten in a recursive form, in which the average service time in channel c depends only on the service time of its direct succeeding channels: S (c ) = WD (c ) +
1
f (c ) c ∈succDir (c ) j
f (c → cj )S (cj ).
(26)
σs2 (c ) E [s2 (c )] − S 2 (c ) = , 2 S (c ) S 2 (c )
(28)
where S 2 (c ) is the square of the expectation of a random variable s(c ), i.e., the square of the mean service time, while E [s2 (c )] is the expectation of the square of random variable s(c ). The equation above can be decomposed into an equation with random variables to all destinations, s(c , cd ), giving
vs2 (c ) =
cd ∈succ (c )∩CEJ
P (path(c , cd ))E [s2 (c , cd )]
− 1.
S 2 (c )
(29)
By the definition of variance [14], the expression above can be written as
vs2 (c ) =
cd ∈succ (c )∩CEJ
P (path(c , cd )){σs2 (c , cd ) + S 2 (c , cd )}
− 1.
S 2 (c )
(30) Taking into account the fact that the variance of the sum of independent random variables, s(c , cd ) in (12), is equal to the sum of the variances of those random variables, w(ck , ck+1 ), the variance of the service time σs 2 (c , cd ) becomes the sum of the variances of the waiting time, denoted by σw 2 (ck , ck+1 ). Assuming that the random variable waiting time, w(ck , ck+1 ) takes independent values, their probability density function is exponential, and therefore the variance is equal to the square of the mean waiting time: σw 2 (ck , ck+1 ) = W 2 (ck , ck+1 ). As a result, the squared coefficient of variance becomes the expressions given in Box I. This equation can be expressed as
vs2 (c ) =
×
vs2 (c ) =
1285
W 2 (c ) + S 2 (c ) S 2 (c )
− 1,
(32)
where W 2 (c ) denotes the average square of the waiting time in each successor of channel c, while S 2 (c ) denotes the average square of the service time from channel c to all destinations. The average square of the direct waiting time in channel c can be written as WD2 (c ) =
1
f (c ) c ∈succDir (c ) j
f (c → cj )W 2 (c → cj ).
(33)
Similar to the derivation of the service time in channel c given by (26), the average square of the waiting time contributed by each successor of channel c can be expressed by a recursive equation as follows: W 2 (c ) = WD2 (c ) +
1
f (c ) c ∈succDir (c ) j
f (c → cj )W 2 (cj ).
(34)
The average square of the service time in channel c can be expressed by extracting the waiting time from the remaining components of the service time along the path: 1
f (c , cj , cd ) W (c → cj )
3.4.2. Average squared coefficient of the variance of the service time The variance of the service time is widely approximated by following a suggestion proposed in [7], taking the minimal service time into account, which is equal to the packet length L, as
S 2 (c ) =
σs2 (c ) = (S (c ) − L)2 .
Appling the flow rate ratio in channel c given by (25), after several calculating steps, we obtain
(27)
However, we will derive a more accurate variance, or, more precisely, a squared coefficient of service time variance, by analyzing the random variables s(c , cd ) from channel c to all possible destinations cd . By definition, the squared coefficient of the variance of the service time is given by
f (c ) c ∈succDir (c ) c ∈succ (c )∩C j d j EJ
2 + S (cj , cd ) .
S 2 (c ) = WD2 (c ) +
(35)
1
f (c )
cj ∈succDir (c )
f (c , cj )
× 2W (c → cj )S (cj ) + S 2 (cj ) .
(36)
1286
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
vs2 (c ) 1 f (c )
cd ∈succ (c )∩CEJ
f (c , cd )
=
ck ,ck+1 ∈path(c ,cd ) ck+1 =succDir (ck )
2
W 2 (ck → ck+1 ) + f (1c )
f (c , cd )
cd ∈succ (c )∩CEJ
W (ck → ck+1 ) + L
ck ,ck+1 ∈path(c ,cd ) ck+1 =succDir (ck )
S 2 (c )
− 1.
(31) Box I.
2WD2 (c ) + f (1c )
vs2 (c ) =
cj ∈succDir (c )
f (c → cj ) W 2 (cj ) + 2WD (c → cj )S (cj ) + S 2 (cj )
− 1.
S 2 (c )
(37)
Box II.
Putting (34) and (36) into (32), we obtain a final expression for the squared coefficient of the variance of the service time in channel c (see Eq. (37), given in Box II). This equation is simple to calculate, since it depends only on the parameters related to the direct succeeding channel, which are calculated on the local level. 3.5. The parameters resulting from the model
3.5.1. The average packet latency Deterministic routing algorithms in NoC and interconnection networks correspond to an acyclic CDG. A message-passing mechanism with channel queuing behavior affects the performances of the particular channel, described by the average waiting time and average service time, which depend on the dynamic state of the succeeding channels. Since the last channels in the CDG correspond to the network ejection channels, they are occupied by packets which are drained from the network flit by flit without blocking. Therefore, their service time is minimal, and is equal to message length, S (c ) = L, ∀c ∈ CEJ . To calculate the waiting time and the service time for all channels, we can start from the ejection channels and track back through the CDG. For each channel, the local model will calculate the partial waiting time in all its direct preceding channels, W (ci → c ). Since there is only unidirectional dependency in the model, at one moment, for each channel c, all direct waiting times on all paths to all possible destinations will be calculated, W (ck → ck+1 ), ck , ck+1 ∈ succ (c ), where succ (c ) denotes the set of all direct and indirect successors of channel c, succ (c ) = succDir (c ) ∪ succInd(c ). The calculation ends with the beginning channels in the CDG, which correspond to injection channels. These channels have only one possible input stream directly from the corresponding source processing element, with flow rate equal to the packet generation rate in the source node, f (cin ) = λs . Packet injection from source nodes through the injection channels behaves as a pure M /G/1 queue, with waiting time given by WSRC (s) =
f (c )S (c )2 (1 + vs2 (c )) 2(1 − f (c )S (c ))
,
∀c ∈ CIN .
(38)
Having calculated the direct waiting times for each pair of adjacent channels in the CDG, we are able to calculate the total delay of a packet following a path from any source node s to any destination node d. It consists of the following components.
• The packet injection waiting time, WSRC (s), given by (38).
• The direct waiting time on each hop from one channel to its direct successor over the path, calculated by the local model. This corresponds to the total blocking time the packet waits in the network. • The packet propagation time, i.e. the time needed to move a packet from the source to the destination, traversing one hop per cycle. This is equal to the hop distance between the source and the destination, denoted by D(s, d). Since both injection and ejection channels are also included on the path in the CDG, the network diameter is increased by these two additional hops: D(s, d) = DNET (s, d) + 2. • The packet ejection time, i.e. the time needed to deliver the packet from the ejection channel to the destination processing node, which is equal to packet length L. The total packet latency on the path from source node s to destination node d is therefore given by
Tlat (s, d) = WSRC (s) +
WD (c → cj )
cj ∈succDir (c ) (c ,cj )∈path(s,d)
+ D(s, d) + L.
(39)
The contribution to the latency of each pair of source and destination nodes is determined by the probability of all possible pairs of source and destination, defined by the traffic distribution TD. The average packet latency can be written as Tlat =
TD(s, d)Tlat (s, d)
(40)
s∈NP d∈NP s̸=d
Tlat = WSRC + WNET + D + L,
(41)
with the components given as follows.
• L is the packet length, invariant to the input load and traffic distribution, whether it is constant or given by some distribution. • D is the average path length, or the network diameter under the given condition, expressed by D=
TD(s, d)D(s, d).
(42)
s∈NP d∈NP s̸=d
• WSRC is the average waiting time in source nodes, expressed by WSRC =
1
λ s∈N
P
λ(s)WSRC (s).
(43)
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
• WNET is the average network waiting time, which can be written
1287
3.6. Model complexity discussion
as
WNET =
TP (s, d)
s∈NP d∈NP s̸=d
×
P (c → cj |s, d)WD (c → cj ).
(44)
c ∈C cj ∈succDir (c )
Changing the order of the summations, we obtain
WNET =
1
λ
c ∈C cj ∈succDir (c )
TD(s, d) λ s∈NP d∈NP s̸=d
· P (c → cj |s, d) · WD (c → cj ).
(45)
Using (4), the equation above yields WNET =
1
λ
f (c → cj ) · WD (c → cj ),
(46)
c ∈C cj ∈dirSucc (c )
which can be further simplified using (21), resulting in WNET =
1
λ
f (c )WD (c ).
(47)
c ∈C
Putting the equations above into (41), the average packet latency can be finally calculated as Tlat =
1
λ s∈N
λ(s)WSRC (s) +
P
1
λ
f (c )WD (c ) + D + L.
(48)
c ∈C
3.5.2. Other performance metrics The average packet latency is the main performance metric which describes the network performance. However, the generality of our model, based on the individual packet viewpoint in each channel, gives us the ability to calculate additional parameters which can offer a more detailed view into the dynamic network behavior. The model inherently calculates the service time and the waiting time (both direct and indirect) for network channels as well as injection channels. The channel utilization factor, in the range from 0 to 1, is widely used to express the effective usage of the flits passing through the channels. In our model, it is equal to the flow rate scaled by the packet length: U (c ) = L · f (c ).
(49)
The channel occupation factor, in the range from 0 to 1, is the time fraction for which the packets occupy the channel, ether in a blocking or a running state. It is a function of the channel flow rate and the channel service time: O(c ) = S (c ) · f (c ).
(50)
Since the service time consists of the packet length and the waiting time, the channel occupation can be further expressed by O(c ) = (L + W (c )) · f (c ) = U (c ) + B(c ),
(51)
where B(c ) is the channel blocking factor, in the range from 0 to 1, defined as the time fraction for which a blocked packet occupies the channel. In case of wormhole flow control, it could be further divided into direct and indirect blocking times, corresponding to direct and indirect waiting times: B(c ) = W (c )f (c ) = (WD (c ) + WI (c )) · f (c )
= BD (c ) + BI (c ).
(52)
During the initial phase of the model calculation, we need to generate the CDG and calculate the flow rate through each channel, f (c ), as well as the partial flow rate from one channel to all its direct successor channels, f (c → cj ). This is a straightforward process which is performed by passing through all source and destination nodes using (3) and (4), resulting in a time complexity of O(Nnd 2 ). On the global level, the model is calculated using a depth-first search algorithm, starting from an arbitrary ejection channel as the root of the search and passing through all channels (vertices in the CDG) with linear time complexity [4]. The time complexity further depends on processing in each node by calculating the parameters given by the local model using (10), (26), and (37) for all direct successors (outgoing edges in the CDG). For orthogonal topologies, the maximum number of direct successors in the worst case is 2n, where n is the dimension of the network (for channels in the lowest dimension: 1 edge for current dimension, 2 edges for each other dimensions, and 1 for ejection channel). All parameters related to individual channels, as well as average network parameters, including the average packet latency (48), are calculated during this process. Consequently, the time complexity can be expressed as O(Nch ∗ n), which means that it is linear in terms of the total number of all channels (network, ejection and injection channels) and network dimension. The space complexity is even less critical, since it depends only on the depth of the graph, which is related to the longest path in the network. The calculation process is therefore fast and efficient, and it scales well with network size. It is also worth noting that the model does not involve any interdependency between parameters that would lead to a iterative calculation, which is the case with some other relevant analytical models [1,22,25–30]. The model is therefore very simple, efficient, and non-complex in terms of processing and memory requirements.
4. Validation of the model
4.1. Simulation parameters NoC and multicomputer interconnection networks can be globally analyzed as black boxes loaded with particular input traffic. For the sake of simplicity, we will assume a uniform packet generation rate λ0 at each source processing node, while the network packet generation rate is λ = λ0 · Nnd . The average path length that packets take in the network, named the network diameter, D, is given by the network topology, the routing algorithm, and the traffic distribution over the source and destination nodes. During a long enough time, a total of λT packets are generated, with average packet length of L flits. Therefore, each packet traverses LD flits through the network, contributing to the channel utilization. Assuming one hop time as the basic time unit, the applied load (also known as normalized throughput, expected network channel utilization), denoted ρ , is computed as the ratio of total flits traversed through the network to the total number of network channels, NC .
ρ=
λLD λT · LD = . T · NC NC
(53)
Using this equation, we can set up a particular network throughput ρ , ranging between 0 and 1, and calculate the input traffic rate λ as an input value for the packet generation procedure in the simulation program.
1288
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
a
b
c
Fig. 4. Average packet latency obtained by the model and the simulation: (a) networks with 16 nodes, (b) networks with 64 nodes, (c) networks with 256 nodes.
a
b
c
Fig. 5. Average packet latency obtained by the model and the simulation: (a) packet length of 32, 64, and 128 in a network with 64 nodes, (b) packet length of 32, 64, and 128 in a network with 512 nodes, (c) packet length of 128 in a network with 256 nodes.
4.2. Simulation program
4.3. Simulation results against model predictions
The simulation was performed by using a specially developed simulation program written in the high-level programming language ADA. The program supports a flit-level transition of the message-passing mechanism, where the packet header with the destination address fits the first flit. Input queuing is assumed by allowing a packet to occupy input channel buffers when it is blocked. One hop cycle is taken as the basic time unit. During that cycle, the routing nodes, which hold the heading flits, perform the following tasks: routing the function, switching to the output channel as the result of selection function and, finally, transition through the granted output channel to the input channel of the next routing node. An FCFS (first-come, first-served) input channel selection policy is used when more then one packet requests the same output channel. During the same transition cycle, all tailing flits also make one hop ahead and follow the heading flit in pipeline fashion through the assigned channels path. The simulation program also involves several self-checking mechanisms and control parameters which assure its validity. The same simulation program was used by the authors in [11], where the network congestion was analyzed on the individual channel level for deterministic dimension order and partial adaptive routing algorithms. Simulations were run until 100,000 packets were received in the destination nodes. Sufficient warm-up time was provided for a large number of initial packets (10,000) that had been fully transferred without counting any parameters, allowing the network to reach a dynamic steady state. This criterion for ending simulation was experimentally verified by small variations of measured values.
Several network topologies, network sizes, and packet lengths, and traffic distributions were simulated to validate the model. A uniform traffic distribution was used as a baseline, gradually increasing the applied load from very light traffic up to the heavily loaded traffic near the saturation point. At that point, some network regions are fully congested and the newly generated packets need to be rejected by the simulation. Fig. 4 shows the average packet latency against the applied load obtained by the simulation (SIM) and by the model (MODEL), for packet length of 32 flows (L) and network size of 16, 64, and 256 nodes (N). Multidimensional mesh topology is denoted by M × × DIM, where DIM is the number of dimensions, while M is the number of nodes in each dimension. All results demonstrate a high degree of accuracy of the Corrected M/G/1 model for all considered topologies and traffic distribution. The model also scales well with increased packet length and network size. Fig. 5a and b compare results obtained for 2D and 3D square networks with 64 and 512 nodes (8 × ×2 and 8 × ×3 mesh), and packet length of 32, 64, and 128 flits. Longer packet length directly increases the latency, according to (48), but it also contributes to higher network load, which leads to higher growth of packet latency against the applied load. However, the model shows that the normalized saturation point is invariant to the packet length. For the 8 × ×2 mesh, the model predicts saturation for applied load around 0.293 for all packet lengths considered. These results lead to the conclusion that the increase of the packet length increases the packet latency approximately by same factor. Fig. 5c shows results for a packet length of 128 flits used in three
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
a
b
1289
c
Fig. 6. Average packet latency obtained by the model and the simulation for hot-spot traffic distribution: (a) one hot-spot node in the network corner of a 7 × × 2 mesh, (b) 5 hot-spot nodes in a 7 × × 2 mesh (center and all corners), (c) one hot-spot node in the network center of a 5 × × 3 mesh.
Fig. 7. Maximum normalized throughput (normalized saturation point) ρmax , and maximal packet generation rate λmax .
different network topologies with 256 nodes (16 ××2, 4 ××4, and 2 × ×8 mesh), which confirm that the model is scalable in terms of network topology size, as well as packet length. To validate the model against non-uniform traffic, we have used a hot-spot traffic distribution. In this case, some destinations, named hot-spot nodes, appear with higher probability (multiplied by factor F ), while other destinations are equally distributed. The results for hot-spot traffic distribution with multiplication factor F =4 are given in Fig. 6, for packet length 16 and 32 in various scenarios: (a) single hot-spot destination positioned in the corner of a 7 × ×2 mesh, (b) 5 hot-spot destinations positioned in the center and all corners of a 7 × ×2 mesh, and (c) single hot-spot destination positioned in the center of a 3D mesh with 125 nodes (5 × ×3). In case of the single hot-spot destination, approximately 10% of the total packets are destined to this node, whereas, with 5 hot-spot destinations in the network, up to 40% of the total packets are destined to these nodes. The results confirm that the model is also very accurate for various non-uniform scenarios with hot-spot destinations, especially for light and moderate traffic loads. With intense traffic rate near the saturation point, this traffic distribution imposes an overloaded condition, having the hot-spot nodes as a bottleneck in the network, which causes congestion in some far-end channels. It should be noted that the simulation counts only packets that end up at the destinations, while packets from the congestion region are not taken in the measurement, since they are mostly blocked. On the other hand, the model does not tolerate any congestion channel, resulting to diverged latency. This is the explanation of the limited and relatively small latency obtained by the simulation and infinite latency calculated by the model at the saturation point.
In order to better compare performances predicted by the model and results obtained by the simulation, Table 1 shows the relative differences between these values for uniform and hot-spot traffic distributions in all scenarios reported above. The applied load is taken at seven points in range from 10% to 90% relative to the saturation point, covering light, moderate, and heavy traffic loads. The saturation points, given in the first row, are taken for the traffic in cases when the simulation reported overload in some injection queues, which means that some packets were dropped due to congestion. The last column gives the average value of differences between the model and simulation. It is notable that the model predicts performance values with a very high degree of accuracy for light and moderate traffic loads. Even for the intense traffic load near the saturation point, the error varies for different scenarios, but the model is still very accurate on average. It could be noted from the figures that the saturation points significantly differ for each network topology, where networks with higher dimension achieve lower maximum normalized throughput. However, this conclusion has a relative meaning, since these networks have smaller network diameter (D) and higher number of network channels (Nc ), and, according to (53), they can serve a higher packet generation rate (λ). These values are summarized in Fig. 7 for all considered cases, where the applied load ρmax is in the range from 0 to 1, while λmax is given in units of packet number per clock cycle. This figure shows that higherdimensional topologies, such as 4 × ×4, 2 × ×8, and 5 × ×3, have much better ratio of normalized throughput and packet rate at the saturation point, and consequently achieve better performances. Since our model is based on the flow analysis and calculation for each channel in the CDG, this approach gives us the ability
−0.04 −0.21 −0.56 −0.78 −1.07 −1.68 −2.32
−0.06 −0.32 −0.93 −1.50 −2.54 −2.08
−0.29 −0.77 −1.86 −2.65 −3.41 −3.79 −2.48
−0.26 −0.69 −1.65 −2.34 −3.07 −3.48 −2.26
HOT-SPOT
10 25 50 60 70 80 90
−0.10 −0.34 −1.55 −2.50 −3.43 −4.51 −3.13
8××2 L = 64 0.28 (%)
Saturation (%)
−0.10 −0.33 −1.40 −2.31 −3.27 −4.40 −5.98
8××2 L = 32 0.28 (%)
7××2 L = 32, central 0.23 (%)
2.60
2××4 L = 32 0.27 (%)
4××2 L = 32 0.35 (%)
7××2 L = 16, central 0.23 (%)
10 25 50 60 70 80 90
Saturation (%)
UNIFORM
−0.10 −0.33 −1.40 −2.31 −3.27 −4.40 −5.98
4××3 L = 32 0.28 (%)
−0.23 −0.66 −1.38 −1.95 −2.47 −3.20 −3.31
7××2 L = 16, corner 0.21 (%)
−0.10 −0.34 −1.53 −2.56 −3.66 −4.26 −3.84
8××2 L = 128 0.28 (%)
−0.08 −0.37 −2.03 −3.44 −5.57 −8.65 −12.65
4××4 L = 32 0.23 (%)
−0.28 −0.79 −1.60 −2.19 −2.70 −3.51 −3.25
7××2 L = 32, corner 0.21 (%)
−0.04 −0.29 −1.27 −2.07 −1.79 −4.36 −5.66
16 × × 2 L = 32 0.25 (%)
5.18
4.66
7××2 L = 32, 5 HS 0.23 (%)
−0.08 −0.37 −2.03 −3.44 −5.57 −8.65 −12.65
4××4 L = 128 0.23 (%)
−0.65 −0.83 −1.21 −1.47 −1.45 −0.96 0.58
−0.08 −0.28 −1.43 −2.73 −3.29 −5.06 −6.06
16 × × 2 L = 128 0.25 (%)
−0.22 −1.33 −1.70 −2.11 −1.61
7××2 L = 16, 5 HS 0.23 (%)
−0.10 −0.46 −2.45 −4.30 −6.20 −11.87 −14.04
2××8 L = 32 0.21 (%)
−0.26 −0.85 −2.39 −3.61 −5.43 −8.56 −11.94
8××3 L = 32 0.23 (%)
−0.73 −1.23 −1.64 −2.44 −3.16 −3.94 −0.17
5××3 L = 16, central 0.18 (%)
−0.08 −0.22 −1.65 −2.66 −4.18 −6.72 −9.16
2××8 L = 128 0.18 (%)
−0.30 −0.94 −2.68 −4.08 −6.21 −9.24 −12.76
8××3 L = 128 0.23 (%)
0.97
−0.25 −1.03 −1.89 −2.67 −3.40 −4.67
5××3 L = 32, central 0.18 (%)
−0.30 −0.96 −2.61 −3.88 −5.98 −8.68 −13.35
8××3 L = 64 0.23 (%)
Table 1 Relative differences (errors) between performances predicted by the model and results obtained by the simulation, for applied load relative to the saturation point, for uniform and hot-spot traffic distributions.
−0.26 −0.78 −1.57 −2.18 −2.72 −3.14 −0.08
0.21 (%)
Average
−0.12 −0.44 −1.73 −2.81 −4.10 −6.21 −7.80
0.25 (%)
Average
1290 S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
a
b
1291
c
Fig. 8. Channel service time obtained by the model and the simulation for network with 64 nodes (8 ×× 2) and packet length of 32 flits: (a) injection channels, (b) X -channels, (c) Y -channels.
a
b
c
Fig. 9. Channel occupation factor for network with 64 nodes (8 × × 2) and packet length of 32 flits: (a) injection channels, (b) X -channels, (c) Y -channels.
a
b
c
Fig. 10. Channel blocking factor obtained by the model and the simulation for network with 64 nodes (8 × × 2) and packet length of 32 flits: (a) injection channels, (b) X -channels, (c) Y -channels.
to calculate other useful parameters and improve the network performance analysis. We have chosen a network of 64 nodes (8 × ×2 mesh) to additionally demonstrate the strength of the model in the prediction of these additional parameters. Once we have the channel service time directly calculated by the model, it is easy to get the channel occupation factor from (51) and its fraction, the channel blocking factor, from (53). These parameters are shown in Figs. 8–10, respectively, for injection channels and network channels in X - and Y - directions (the opposite directions in the same dimension are symmetrical). The figures show a significant difference between the performance of these channel classes. While the service time in X channels and especially in the injection channels grows rapidly and reaches the saturation point, Y channels handle packets without any
congestion. The channel occupation and blocking factors also follow that behavior and indicates the network congestion in injection and X channels. This is consistent with results previously reported in the literature [11], where the channel load increases when moving back through the CDG from ejection channels, over the network channels from higher to lower dimensions (opposite to the dimension order of the routing algorithm), to the injection channels, which are most congested. We will demonstrate the high precision of the model even at the individual channels level, showing the channel waiting time, divided into its two components: the direct and the indirect waiting times. Fig. 10 shows distribution of the direct and indirect waiting times for all channels in the X -direction for the applied load near the saturation point (ρ = 0.25).
1292
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
a
b
c
Fig. 11. Channel (a) direct waiting time, (b) indirect waiting time, and (c) total waiting time for a 8 × × 2 network, packet length of 32 flits and applied load of 0.25: simulation results (SIM) and predicted results (MODEL).
The unequal waiting time in different channels, combined with unequal packet flows (not shown), is the reason for the unequal distribution of all other parameters over the network (see Fig. 11). This unbalanced distribution leads to congested regions, which ultimately bounds the network performances. 5. Conclusion In this paper, we have proposed an analytical model of deterministic wormhole routing in NoC and interconnection network architectures which is general in terms of network topology and traffic distribution. In comparison with existing solutions, the model has several new advantages. It is simple, fast, and gives very accurate results for arbitrary network topology and traffic distribution. It is modular, separating the analysis at channel level from the calculation on the global network level, considering the channel position in the CDG. Apart from its slightly complicated mathematical derivation, the model is still simple to use as a tool for performance analysis and evaluation of different solutions during system design methodology. The model can be used for early validation of a proposed architecture against different workloads, which can guide engineers to reshape the design to fit the requirements, dealing with the trade-off between generality and performance under time to market pressures. The main contribution of our study, which provides a high accuracy of performance prediction, is the proposed model on the channel level, named the Corrected M /G/1 model. The accuracy is additionally achieved by a precise calculation of the variance of the service time, taking into account paths to all possible destinations. In addition to average packet latency, which is commonly used as a main performance metric, our approach also calculates other useful parameters for each channel in the network. We have demonstrated deeper performance analysis based on the channel
service time and the channel waiting time, as well as channel occupation and blocking factor. For instance, channel occupation distributed over the network shows the network asymmetric load and points out the critical regions which lead to network congestion. It is also possible to use the model to predict the packet latency between two arbitrary source and destination nodes. With this approach, engineers can balance the network load by modeling different task-distribution scenarios to maximize the network performance for specific applications. Model extension to multi-flit buffers and variable packet length is needed in order to support recent architectures which provide better performances for both NoCs and multicomputers. In that case, the calculation of the service time for each channel on the global level must be limited to the minimal number of succeeding channels which could entirely hold the blocked packet. At the local level, the model must be adapted to support the case where one packet is blocked by another packet over the same path. Supporting adaptive routing is a more challenging task. In this case, the flow rates over the network channels and the probability of each path are not determined. This leads to interdependency in the calculation of the flow rates and service times, which can be solved by an iterative calculation. Extension of the model to support multi-flit buffers and adaptive routing is therefore left for further work. Appendix A.1. Extending the model for multiple virtual channels When multiple virtual channels share the physical channel, packets are not blocked as long as there is at least one free virtual channel on the next hop. Consequently, the average throughput and network performances are increased, but a packet’s transition through the network is slowed down. Taking this effect into
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
1293
Table 2 List of symbols used in the paper. Symbol
Description
TD(s, d) λs , λ(s)
Probability that a packet is generated in the source node s and destined to node d Packet generation rate at source node s Packet generation rate in network Flow rate at channel c Flow rate from channel ci to channel cj Flow rate from channel ci to its direct successor channel cj Flow rate from channel c, ci , and ck Probability that the packet is passing through channel c, provided that it is on the path from source node s to destination node d Probability that the packet is passing through channel ci and channel cj , provided that is it on the path from source node s to destination node d Probability that the packet is passing through channel ck , provided that it has already reached channel c Probability that the packet is taking the path from channel c to channel ck Probability that the packet is taking the path through channel c Probability that the packet coming from channel ci is blocked by another packet which is occupying its direct successor channel cj Random variables of service time for packet in channel c Random variables of service time for packet in channel c in transition to channel cd Random variables of service time for packet in channel ci in transition to its direct successor cj Average service time for packet in channel c Service time for packet in channel c in transition to channel cd Service time for packet in channel ci in transition to its direct successor cj Random variables of waiting time for packet in channel c in transition to channel cd Random variables of waiting time for packet in channel ci in transition to its direct successor cj Average waiting time for packet in channel c Waiting time for packet in channel c in transition to channel cd Waiting time for packet in channel ci in transition to its direct successor cj Direct waiting time for packet in channel c Indirect waiting time for packet in channel c Expectation of a square of random variable s(c ) Expectation of a square of random variable s(c , cd ) Squared coefficient of variance of service time for packet in channel c Variance of service time for packet in channel c Variance of service time for packet in channel c in transition to channel cd
λ f (c ) f ( ci , cj ) f ( ci → cj ) f ( c , ci , ck ) P (c |s, d) P (ci , cj |s, d) P (ck |c ) P ( c , ck ) P (c ) Pbl (ci → cj ) s (c ) s ( c , cd ) s ( ci → cj ) S (c ) S ( c , cd ) S ( ci → cj ) w(c , cd ) w(ci → cj ) W (c ) W (c , cd ) W (ci → cj ) WD (c ) WI (c ) E [s2 (c )] E [s2 (c , cd )] vs2 (c ) σs2 (c ) σs2 (c , cd ) S 2 (c )
Average square of service time for packet in channel c
W 2 (c )
Average square of waiting time for packet in channel c
WD2 (c ) Tlat Tlat (s, d) WSRC (s) D D( s , d ) L U (c ) O(c ) B (c )
Average square of direct waiting time for packet in channel c Average network packet latency Average packet latency from source node s to destination node d Packet waiting time to enter the network in source processing node s Network diameter—average length of packets in the network Average length of packets in transition from source node s to destination node d Packet length Channel utilization factor Channel occupation factor Channel blocking factor Applied load (normalized throughput, expected network channel utilization) Set of processing nodes, set of routing nodes, set of all nodes Set of network channels, set of injection channels, set of ejection channels, set of all channels Number of routing nodes, number of network channels Set of direct predecessor of channel c, set of indirect predecessor of channel c, set of all predecessor of channel c Set of direct successor of channel c, set of indirect successor of channel c, set of all successor of channel c Set of all channels on the path from channel c to channel cj Temporary variables needed for calculating the probability Pv (c ) Probability that v virtual channels are busy in a physical channel c Average degree of multiplexing of virtual channels in a physical channel c Average degree of multiplexing of virtual channels in all physical channels on the path from source node s to destination node d
ρ
NP , NR , NC , N CN , CIN , CEJ , C Nnd , NC predDir (c ), predInd(c ), pred(c ) succDir (c ), succInd(c ), succ (c ) path(c , cj ) qv Pv (c ) V (c ) V (s, d)
account we have extended our model by calculating the average degree of virtual channel multiplexing over the physical channel, which is a method first proposed by Dally [6] and commonly used in the literature. Since our model is a flow-based one, and is general in terms of topology and traffic distribution, we can calculate the average degree of virtual channel multiplexing for each physical channel and use it to scale the latency for each pair of source and destination nodes. If we consider physical channel c with V virtual channels, packets arrive and request a free virtual channel at rate f (c ), while they leave channel c and release the virtual channel at rate 1/S (c ). The probability that v virtual channels are busy in a physical channel c, denoted as Pv (c ), can be determined using a Markovian model [14]. State π (v) corresponds to v virtual channels being
Fig. 12. Markovian model with transition rates between states π(v) that v virtual channels being busy.
used, where the rate out of state π (v) to π (v + 1) is f (c ) and the rate out of state π (v) to state π (v − 1) is 1/S (c ). In the case of the last state π (V ), the rate out to state π (V − 1) must be reduced by the arrival rate from state π (V − 1) to π (V ), shown in Fig. 12.
1294
S. Gajin, Z. Jovanovic / J. Parallel Distrib. Comput. 72 (2012) 1280–1294
Solving the Markovian model in the steady state, the probabilities that v virtual channels being busy are given by the following equations:
1
qv−1 f (c )S (c ) qv = f (c ) qv−1 1/S (c ) − f (c )
Pv (c ) =
1 V qj
v=0 0
j =0
(54)
Pv−1 (c )f (c )S (c ) 0
Observing the virtual channel from the perspective of the packet passing through it, when v virtual channels are used, the packet transition time is extended by a factor v . The probability seen by the packet that v virtual channels are used corresponds to this time fraction and is given by the ratio of v Pv (c ) to the sum V v=1 v Pv (c ). Therefore, the average factor of slowing down the packet, i.e. the average degree of multiplexing of virtual channels, is given by V
V (c ) =
v=1
v 2 Pv ( c )
V
v=1
.
(55)
v Pv ( c )
The average degree of multiplexing of virtual channels on the path from source node s to destination node d is therefore V (s, d) =
1
D(s, d) c ∈C
V (c )P (c |s, d),
(56)
where D(s, d) is the distance between nodes s and d. Finally, the resulting average packet latency on the path from source s to destination d, Tlat (s, d), calculated by (39), needs to be multiplied by the average degree of virtual channel multiplexing, given by (56). A.2. List of symbols
[10] E. Fleury, P. Fraigniaud, A general theory for deadlock avoidance in wormholerouted networks, IEEE Transactions on Parallel and Distributed Systems 9 (7) (1998) 626–638. [11] S. Gajin, Z. Jovanović, Explanation of performance degradation in turn model, The Journal of Supercomputing 37 (2006) 271–295. [12] A. Khonsari, H. Sarbazi-Azad, M. Ould-Khaoua, An analytical model of adaptive wormhole routing with time-out, Future Generation Computer Systems 19 (2003) 1–12. [13] J.H. Kim, A.A. Chien, Network performance under bimodal traffic loads, Journal of Parallel and Distributed Computing 28 (1995) 43–64. [14] L. Kleinrock, Queuing Systems: Theory, Vol. 1, Wiley, New York, 1975. [15] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, A network-on-chip architecture and design methodology, in: Proceedings of the Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, 2002, pp. 117–124. [16] S. Loucif, M. Ould-Khaoua, On the merits of hypermeshes and tori with adaptive routing, Journal of Systems Architecture 47 (2002) 795–806. [17] S. Loucif, M. Ould-Khaoua, The impact of virtual channel allocation on the performance of deterministic wormhole-routed k-ary n-cubes, Simulation Modelling Practice and Theory 10 (2002) 525–541. [18] S. Loucif, M. Ould-Khaoua, A. Al-Ayyoub, Hypermeshes: implementation and performance, Journal of Systems Architecture 48 (2002) 37–47. [19] S. Loucif, M. Ould-Khaoua, L.M. Mackenziea, Analysis of fully adaptive wormhole routing in tori, Parallel Computing 25 (1999) 1477–1487. [20] M. Moadeli, A. Shahrabi, W. Vanderbauwhede, P. Maji, An analytical performance model for the Spidergon NoC with virtual channels, Journal of Systems Architecture 56 (2010) 16–26. [21] U. Ogras, P. Bogdan, R. Marculescu, An analytical approach for network-onchip performance analysis, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29 (12) (2010) 2001–2013. [22] M. Ould-Khaoua, Message latency in the 2-dimensional mesh with wormhole routing, Microprocessors and Microsystems 22 (1999) 509–514. [23] M. Ould-Khaoua, L.M. Mackenzie, On the design of hypermesh interconnection networks for multicomputers, Journal of Systems Architecture 46 (2000) 779–792. [24] F. Quaglia, B. Ciciani, M. Colajanni, Performance analysis of adaptive wormhole routing in a two-dimensional torus, Parallel Computing 28 (2002) 485–501. [25] H. Sarbazi-Azad, A mathematical model of deterministic wormhole routing in hypercube multicomputers using virtual channels, Applied Mathematical Modelling 27 (2003) 943–953. [26] H. Sarbazi-Azad, A. Khonsari, M. Ould-Khaoua, Analysis of k-ary n-cubes with dimension-ordered routing, Future Generation Computer Systems 19 (2003) 493–502. [27] H. Sarbazi-Azad, M. Ould-Khaoua, L.M. Mackenzie, Communication delay in hypercubes in the presence of bit-reversal traffic, Parallel Computing 27 (2001) 1801–1816. [28] H. Sarbazi-Azad, M. Ould-Khaoua, L.M. Mackenzie, A performance model of adaptive wormhole routing in k-ary n-cubes in the presence of digit-reversal traffic, The Journal of Supercomputing 22 (2002) 139–159. [29] H. Sarbazi-Azad, M. Ould-Khaoua, L.M. Mackenzie, Analytical modelling of wormhole-routed k-ary n-cubes in the presence of matrix-transpose traffic, Journal of Parallel Distributed Computing 63 (2003) 396–409. [30] A. Shahrabi, L.M. Mackenzie, M. Ould-Khaoua, An analytical model of wormhole-routed hypercubes under broadcast traffic, Performance Evaluation 53 (2003) 23–42.
Table 2 summarizes the symbols used through this paper. References [1] M. Arjomand, H. Sarbazi-Azad, Power-performance analysis of network-onchip with arbitrary buffer allocation schemes, IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 30 (4) (2011) 508–519. [2] P. Bogdan, R. Marculescu, Non-stationary traffic analysis and its implications on multicore platform design, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30 (4) (2011) 508–519. [3] B. Ciciani, M. Colajanni, C. Paolucci, Performance evaluation of deterministic wormhole routing in k-ary n-cubes, Parallel Computing 24 (1998) 2053–2075. [4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, second ed., MIT Press and McGraw-Hill, ISBN: 0-262-03293-7, 2001. [5] W.J. Dally, Performance analysis of k-ary n-cube interconnection networks, IEEE Transactions on Computers 39 (6) (1990) 775–785. [6] W.J. Dally, Virtual-channel flow control, IEEE Transactions on Parallel and Distributed Systems 3 (2) (1992) 194–205. [7] J.T. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, Journal of Parallel and Distributed Computing 32 (1994) 202–214. [8] J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Transactions on Parallel and Distributed Systems 4 (12) (1993) 1320–1331. [9] J. Duato, A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks, IEEE Transactions on Parallel and Distributed Systems 6 (10) (1995) 1055–1067.
Slavko Gajin received his Dipl. Eng., M.S., and Ph.D. degrees from the University of Belgrade, School of Electrical Engineering, Serbia, in 1993, 1999, and 2007, respectively. He is a director of Belgrade University Computer Center, where he started working as a network engineer after he received bachelor’s degree. He is a professor at the Department of Computer Engineering and Computer Science at the School of Electrical Engineering, University of Belgrade. His current research interests include the modeling and performance analysis of communication systems, routing algorithms in NoC and interconnection networks, computer network management, monitoring, and performances.
Zoran Jovanovic received his Dipl. Eng., M.S., and Ph.D. degrees from the University of Belgrade, School of Electrical Engineering in 1978, 1982, and 1988, respectively. He is the Director of the National Research and Education Network of Serbia (AMRES). He is a professor at the Department of Computer Engineering and Computer Science at the School of Electrical Engineering, University of Belgrade. He has more than 15 published journal papers. His current research interests include parallel computing, interconnection networks, and concurrent and distributed programming.