Future Generation Computer Systems 23 (2007) 737–747 www.elsevier.com/locate/fgcs
Analytical communication networks model for enterprise Grid computing Bahman Javadi a , Mohammad K. Akbari a , Jemal H. Abawajy b,∗ a Amirkabir University of Technology, Computer Engineering and Information Technology Department, 424 Hafez Ave., Tehran, Iran b Deakin University, School of Engineering and Information Technology, Geelong, VIC. 3217, Australia
Received 22 April 2006; received in revised form 23 November 2006; accepted 24 November 2006 Available online 2 April 2007
Abstract This paper addresses the problem of performance analysis based on the communication modeling of large-scale heterogeneous distributed systems, with an emphasis on enterprise Grid computing systems. The study of communication layers is important, as the overall performance of a distributed system often critically hinges on the effectiveness of this part. We propose an analytical model that is based on probabilistic analysis and queuing networks. The proposed model considers the processor as well as network heterogeneity of the enterprise Grid system. The model is validated through comprehensive simulations, which demonstrate that the proposed model exhibits a good degree of accuracy for various system sizes, and under different working conditions. c 2006 Elsevier B.V. All rights reserved.
Keywords: Enterprise Grid; Performance analysis; Analytical modeling; Heterogeneity
1. Introduction Advances in computational and communication technologies have made it economically feasible to conglomerate multiple independent clusters, leading to the development of the large-scale distributed systems commonly referred to as Grid computing [1–3]. Grids can be classified in two ways, according to their architecture and presence: global Grids and enterprise Grids [30]. These two categories have varying characteristics and are suitable for different scenarios. Global Grids are established over the public Internet, are characterized by a global presence, comprise highly heterogeneous resources, present more sophisticated security mechanisms, focus on single sign-ons, and are mostly batch-job oriented [30]. In contrast, enterprise Grid computing systems consist of resources spread across an enterprise, and provide services to users within that enterprise and are managed by a single organization [4,5, 29]. They can be deployed within large corporations that have a global presence even though they are limited to a single enterprise [13]. Organizations may also want to go beyond their Grids to share resources with new partners when their applications require computing resources that surpass what their own ∗ Corresponding author. Tel.: +61 3 5227 1376; fax: +61 3 5227 2028.
E-mail addresses:
[email protected] (B. Javadi),
[email protected] (M.K. Akbari),
[email protected] (J.H. Abawajy). c 2006 Elsevier B.V. All rights reserved. 0167-739X/$ - see front matter doi:10.1016/j.future.2006.11.002
Grids can offer [29]. In this way, by using the extra resources offered by other partners, they can improve their performance as well as increasing their agility. This paper addresses the problem of performance analysis based on the communication modeling of large-scale heterogeneous distributed systems, with an emphasis on enterprise Grid computing systems. We are motivated to study this problem for a number of reasons. First, interconnection network design plays a central role in the design and development of enterprise Grid computing. Second, due to interconnection network’s contention problems [6], having a fast communication network does not necessarily guarantee a good performance from the enterprise Grid computing system built on it. The contention problems, which adversely affect the overall performance, can happen in host nodes, network links, and network switches [6]. Node contention happens when multiple data packets compete to contain a receive channel of a node, and link contention occurs when two or more packets share a communication link. Switch contention is due to unbalanced traffic flow through the switch, which can result in an overflow of the switch buffer. The contribution of this paper is to propose an analytical model for the communication networks of enterprise Grid systems. The proposed model is based on probabilistic analysis and queuing networks to analytically evaluate the
738
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
performance of communication networks for enterprise Grid systems. The model takes into account processor as well as network heterogeneity among clusters. The model is validated through comprehensive simulations, which demonstrate that the proposed model exhibits a good degree of accuracy for various system sizes and under different operating conditions. Several analytical performance models of multi-computer systems have been proposed in the literature for different interconnection networks and routing algorithms (e.g., [7–12]). However, analytical models for the systems of greatest interest are generally rare, and most of the existing works are based on homogenous single cluster systems [13–16] with the exception of [17], which looked at processor heterogeneity. Moreover, in [18] a queuing model based on input and server distributions was proposed to analyze a special Grid system, VEGA 1.1. The majority of these works are based on job level modeling, but in contrast, we intend to propose an analytical model of the communication layer of enterprise Grid systems to provide a more accurate performance analysis and prediction. To the best of our knowledge, our work is the first which deals with heterogeneous enterprise Grid environments. The rest of the paper is organized as follows. In Section 2, a brief description of the enterprise Grid system used in this paper is presented. In Section 3, we give a detailed description of the proposed analytical communication model. In Section 4, we present an experimental evaluation to validate the proposed model. Finally, we summarize our findings and conclude the paper in Section 5. 2. System description The computational Grid architecture which is used in this paper is shown in Fig. 1. The system is composed of C clusters, each cluster is composed of Ni , i ∈ {0, 1, . . . , C − 1}, computing nodes, each node with τi processing power, and its associated memory module. In our model, we express the processing power of the various processors in each cluster relative to a fixed reference processor [17], and not relative to the fastest processor, which is used in the most works on heterogeneous parallel systems. Although the latter choice may appear more natural, since it makes it possible to obtain the speedup by comparing the performance of the parallel system with that of the fastest single node available, we think that choosing a fixed reference allows a clearer performance analysis, especially if we vary the number and/or the power of nodes. So the relative processing power of each node can be stated as s (i) = τi /τ f , where f is the reference machine number. Since we consider processor heterogeneity among clusters, the total relative processing power and the average relative processing power in the system are as follows, respectively: S=
C−1 X
s (i)
(1)
i=0
S . (2) C Each cluster in the enterprise Grid system has two communication networks: an Intra-Communication Network s=
Fig. 1. Enterprise Grid computing architecture.
(ICN1) and an intEr-Communication Network (ECN1). The ICN1 is used for the purpose of passing messages between processors within a cluster, while the ECN1 is used to transmit messages between clusters and for the management of the entire system. To interconnect clusters, the ECN1 is connected through a set of concentrators/dispatchers [19] to the external network, i.e., ICN2. Since high performance computing clusters typically utilize Constant Bisectional Bandwidth networks [20–22], we adopted m-port n-tree [23] as fixed arity switches to construct the topology for each cluster. An m-port n-tree topology consists of N pn processing nodes and Nsw communication switches, which can be calculated with following equations: m n N pn = 2 × (3) 2 m n−1 Nsw = (2n − 1) × . (4) 2 In addition, each communication switch itself has m communication ports {0, 1, 2, . . . , m − 1} that are attached to other switches or processing nodes. Every switch (except the root switches) uses ports in the range of {0, 1, 2, . . . , (m/2)−1} to make connections with its descendants or processing node, and uses ports in the range of {(m/2), (m/2) + 1, . . . , m − 1} for connection with its ancestors. This means that the internal switches have m/2 up links and m/2 down links connected to their ancestors and descendants respectively. It can be shown that the m-port n-tree is a full bisection bandwidth topology [24], so link contention doesn’t occur in such network. In this paper, we used wormhole flow control and deterministic routing, which are commonly used in cluster network technologies such as Myrinet, Infiniband and QsNet [25]. We used a deterministic routing based on Up*/Down* routing [26], which is proposed in [24]. In this algorithm, each message experiences two phases, an ascending phase to get to a Nearest Common Ancestor (NCA), followed by a descending phase. Furthermore, since this algorithm performs a balanced traffic distribution, the switch contention problem will not be present [24]. 3. The analytical communication model In this section, we develop the analytical communication model for the enterprise Grid system discussed in the previous
739
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
section. The notations used in this paper are summarized in Table 2 in the Appendix of the paper. The proposed model is built on the basis of the following assumptions, which are widely used in similar studies [8–13]: (1) Nodes generate traffic independently of each other, which (i) follows a Poisson process with a average rate of λg , where i ∈ {0, 1, . . . , C − 1}, messages per time unit. (2) The number of processors in all clusters are equal (N0 = N1 · · · = NC−1 ), and the clusters’ nodes are heterogeneous in their processing power. (3) The destination of each message can be any node in the system with uniform distribution. (4) The inter-communication and intra-communication networks are heterogeneous, i.e., exhibit different communication characteristics. (5) The communication switches are input buffered, and each channel is associated with a single flit buffer. (6) The message length is fixed (M flits). (7) The source queue at the injection channel in the source node has infinite capacity. Moreover, messages are transferred to their node as soon as they arrive at their destinations. We have two types of connections in this topology, node to switch (or switch to node) and switch to switch. In the first and the last stages, we have node to switch and switch to node connections respectively. In the middle stages, a switch to switch connection is employed. Each type of connection has a service time which is approximated as follows: tcn = 0.5αnet + L m βnet tcs = αsw + L m βnet
(5) (6)
where tcn and tcs represent times to transmit from node to switch (or switch to node) and switch to switch connections, respectively. αnet and αsw are the network and switch latency, βnet is the transmission time of one byte (inverse of bandwidth), and L m is the length of each flit in bytes. In the presence of network heterogeneity, we have two values for times to transmit. For intra-communication networks, the pair of (tcn I , tcs I ), and for inter-communication networks the pair of (tcn E , tcs E ), are adopted in the model. 3.1. Outline of the model The message flow model of the system is shown in Fig. 2, where the path of a flit through various communication networks is illustrated. A processor, which is shown as a circle in this figure, sends its message to the ICN1 or ECN1 with probabilities 1 − Po and Po respectively. The message path is (i) depicted with arrows. The message rate of a processor is λg , so the input rates of ICN1 and ECN1 which are fed from the (i) (i) same processor will be λg (1 − Po ) and λg Po , respectively. The probability Po has been used as the probability of a message leaving a cluster. In other words, this is the probability of inter-cluster message generation and is obtained by the following equation: Po =
(C − 1) × N0 N − N0 = . N −1 C × N0 − 1
(7)
Fig. 2. Message flow model in each communication network.
The external request of cluster i goes through the ECN1 with probability Po , and then to ICN2. In the return path, it again accesses the ECN1 in cluster v to get to the destination node. The concentrators/dispatchers are working as simple buffers to interface two external networks (i.e., ECN1 and ICN2) and so combine message traffic from/to one cluster to/from other cluster. Therefore, the message rates received by ICN1 and ECN1 in cluster i (to cluster v) can be calculated as follows: (i)
λ I 1 = (1 − Po )λ(i) g (i,v)
(v) λ E1 = Po λ(i) g + Po λg
(8) v 6= i.
(9)
In the second stage, the message rate of ICN2 can be computed by following equation: (i)
λ I 2 = N0 Po λ(i) g .
(10)
Given that a newly generated message in cluster i traverses 2 j-links to reach its destination with probability P j , the average number of links that a message traverses to reach its destination is given by: davg =
n X
(2 j × P j )
(11)
j=1
where P j is the probability of a message crossing 2 j-links ( j-links in the ascending and j-links in the descending phase) to reach its destination in an m-port n-tree topology. Different choices of P j lead to different distributions for the message destination, and consequently different average message distances. As mentioned in assumption 3, we take into account the uniform traffic pattern, so, based on the m-port n-tree topology, we can define this probability as follows: j−1 m − 1 m2 2 j = 1, 2, . . . , n − 1 n 2 m2 − 1 Pj = (12) j−1 (m − 1) m2 j = n. n 2 m2 − 1 By substituting Eq. (12) into Eq. (11), the average message distance is obtained as follows: n (nm − 2n − 1) m2 + 1 davg = ∀n > 1. (13) h m n 1 i m −2 2 −1 2
740
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
For n = 1 the average message distance is, davg = 2. Consequently, the rate of received messages in each channel can be given as follows: (i)
ηI 1 = (i,v)
(i)
(i)
(1 − Po )λg × davg(I 1) 4n (i)
(v)
Po (λg + λg ) × davg(E1) 4n C−1 P (i) N0 Po λg × davg(I 2)
η E1 =
i=0
ηI 2 =
(14) v 6= i
(15) Fig. 3. Markov chain to calculate of blocking probabilities.
(16)
4n c C
(i)
where n c , the number of trees in the ICN2, can be computed as follows: & ' logC 2 −1 nc = . (17) logm 2 −1 3.1.1. Average message latency of the intra-communication network In this section, we find the average message latency of The intra-communication network from cluster i’s point of view. Since each message may cross different numbers of links to reach its destination, we consider the network latency of an (i) 2 j-link message as T j , and averaging over all the possible destined nodes made by a message yields the average message latency as: (i)
T in =
n X
(i)
(P j × T j ).
(18)
j=1
Our analysis begins at the last stage and continues backward to the first stage. The network stage numbering is based on location of switches between the source and the destination nodes. In other words, the numbering starts from the stage next to the source node (stage 0) and goes up as we get closer to the destination node. The number of stages in an m-port n-tree topology is K = 2 j − 1. The destination, stage K − 1, is always able to receive a message, so the service time given to a message at the final stage is tcn I . The service time in the internal stages might be more, because a channel would be idled when the channel of a subsequent stage is busy. The average amount of time that a message waits to acquire a channel at stage k for (i) cluster i, Wk, j , is given by the product of the channel blocking probability in stage k, channel at stage k, (i)
Wk, j =
(i) PBk, j ,
(i) Tk, j /2
1 (i) (i) T P . 2 k, j Bk, j (i) PBk, j
Solving this chain for the steady state probabilities gives:
and the average service time of a
[19]: (19)
The value of is determined using a birth-death Markov chain [27]. Such a two states Markov chain is shown in Fig. 3, in which the rates of transition out and into the first state are (i) (i) (i) η I 1 and 1/Tk, j − η I 1 respectively.
(i)
(i)
PBk, j = η I 1 Tk, j .
(20)
The average service time of a message at stage k is equal to the message transfer time and waiting time at subsequent stages to acquire a channel, so: K −1 X (i) (Wl, j ) + Mtcs I 0 ≤ k ≤ K − 2 (i) (21) Tk, j = l=k+1 Mtcn I k = K − 1. According to this equation, the average network latency for (i) (i) a message with a 2 j-link journey is equal to T0, j (=T j ). A message originating from a given source node in cluster i (i) sees a network latency of T in (given by Eq. (18)). Due to the blocking situation that takes place in the network, the message latency distribution function becomes general. Therefore, a channel at the source node is modeled as an M/G/1 queue. The average waiting time for an M/G/1 queue is given by [27]: (i)2 (i)2 (i) λ σs + X (i) (22) W in = 2(1 − ρ (i) ) ρ (i) = λ(i) X
(i)
(23)
where λ(i) is the average message rate on the network, X
(i)
is
(i)2
the average service time, and σs is the variance of the service time distribution. Since the minimum service time of a message at the first stage is equal to Mtcn I , the variance of the service time distribution is approximated based on a method proposed by Draper and Ghosh [10] as follows: (i)
σs2(i) = (T in − Mtcn I )2 .
(24)
As a result, the average waiting time at the source queue becomes, (i) (i) (i) λ I 1 (T in − Mtcn I )2 + (T in )2 (i) W in = . (25) (i) (i) 2(1 − λ I 1 T in ) (i)
Finally, the average message latency, L in , seen by the message crossing from the source node to its destination in cluster i, consists of three parts: the average waiting time at the (i) (i) source queue (W in ), the average network latency (T in ), and the average time for the tail flit to reach the destination (R in ).
741
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
Therefore, (i)
(i)
(i)
(26)
L in = W in + T in + R in where, R in = (davg(I 1) − 2)tcs I + tcn I .
(27)
3.1.2. Average message latency of inter-communication networks As mentioned before, inter-cluster messages cross through both networks, ECN1 and ICN2, to get to their destinations in other cluster. Since the flow control mechanism is a wormhole, the latency of these networks should be calculated as a merge one. From this, and based on the Eq. (18), we can write, (i,v)
T out =
nc n X X
As before, the source queue in the inter-communication networks is modeled as an M/G/1 queue, and the same method is used to approximate the variance of the service time. Thus, the average waiting time of the source queue in intercommunication networks can be calculated as: (i,v) (i,v) 2 + (T (i,v) )2 − Mt ) λ (T cn out out E E1 (i,v) W out = . (34) (i,v) (i,v) 2(1 − λ E1 T out ) The averaging of all waiting times in which the message from cluster i to all other clusters, namely cluster v, might be seen at the source queue, gives the average waiting time at the source queue of inter-communication networks as follows: (i)
(i,v)
P j,h × T j,h
(28)
W out =
C−1 X 1 (i,v) (W ). C − 1 v=0,v6=i out
(35)
j=1 h=1
(29)
P j,h = P j × Ph
where P j and Ph can be calculated from Eq. (12). It means that each external message crosses 2 j-links through the ECN1 ( j-links in the source cluster, i, and j-links in the destination cluster, v) and 2h-links in the ICN2 to reach to its destination. So the analysis has to be done for K = 2( j + h) − 1 stages. Based on Eqs. (19) and (20), the average amount of time that a message waits to acquire a channel at stage k, in such intercommunication networks is as follows: (i,v)
Wk,( j,h) =
1 (i,v) (i,v) 2 η (Tk,( j,h) ) 2 k
(30) (i,v)
where the average channel rate ηk following equation: η I 2 j ≤ k < j + 2h − 1 (i,v) ηk = (i,v) η E1 otherwise.
can be driven with the
(31)
The average service time of a channel in these intercommunication networks from cluster i’s point of view can be obtained in a similar manner to that for the intracommunication network. Therefore, the average service time of a channel in these inter-communication networks from cluster i’s point of view, can be found as follows: K −1 X (i,v) Wl,( j,h) + Mtcs E 0 ≤ k ≤ k − 2 (i,v) (32) Tk,( j,h) = l=k+1 Mtcn E k = K − 1. Similar to the case of intra-communication networks, the network latency for an inter-cluster message equals the average service time of a channel at the first stage. The arithmetic average of all latencies in which the message from cluster i to all other clusters, namely cluster v, might be seen gives the average latency of inter-communication networks via the following equation: (i)
T out =
C−1 X 1 (i,v) (T ). C − 1 v=0,v6=i out
(33)
The average waiting times at the concentrator/dispatcher are calculated in a similar manner to that for the source queue (Eq. (22)). The service time of the queue is Mtcs E , and there is no variance in the service time, since the messages length is fixed. By modeling the injection channel in the concentrator/dispatcher as an M/G/1 queue, the average waiting time is given by the following equation: (i)
(i)
W cd =
λ I 2 (Mtcs E )2 (i)
2(1 − λ I 2 Mtcs E )
.
(36)
Also, we model the dispatcher buffers in the concentrator/dispatcher as an M/G/1 queue, with the same rate of concentrate buffers. So the average waiting time at the ejection channel is given similarly by Eq. (36). The average message latency in inter-communication (i) networks, L out , experienced by the message crossing from its source node in cluster i to its destination in cluster v, consists of (i) four parts: the average waiting time at the source queue (W out ), (i)
the average network latency (T out ), the average waiting time at (i)
the concentrator/dispatcher (2W cd ), and the average time for the tail flit to reach the destination (R out ). Therefore, (i)
(i)
(i)
(i)
L out = W out + T out + 2W cd + R out
(37)
where, R out = (davg(E1) + davg(I 2) − 2)tcs E + tcn E .
(38)
At last, we can obtain the average message latency from cluster i’s point of view, based on the message flow model (see Fig. 2), with the following equation: L
(i)
(i)
(i)
= (1 − Po )(L in ) + Po (L out ).
(39)
To calculate the total average of the message latency, we use a weighted arithmetic average as follows: ! C−1 X s (i) (i) L= ×L . (40) S i=0
742
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
Fig. 4. Average message latency in the system with N = 29 , homogeneous networks, and H = 0.05.
Finally, to perform our analysis we chose to express the degree of processor heterogeneity of the system through a single parameter, i.e., the standard deviation of their relative processing powers, as follows: v u C−1 u1 X H =t (s (i) − s)2 . (41) C i=0 4. Model validation In order to validate the proposed model and justify the applied approximations, the model was simulated with a discrete event-driven simulator. In this section, we present the model’s validation through comprehensive simulation experiments under different working conditions. 4.1. Experimental setup Requests are generated randomly by each processor with an exponential distribution of inter-arrival time with a rate (i) of λg . The destination node is determined by using a uniform random number generator. Each packet is timestamped after its generation. The request completion time is checked in every “sink” module in each processor to compute the message latency. For each simulation experiment, statistics were gathered for a total of 100,000 messages. Statistics gathering was inhibited for the first 10,000 messages to avoid distortions due to the warm-up phase. Also, there is a drain phase at the end of each simulation, in which 10,000 generated messages are not in the statistics gathering, in order to provide enough time for all packets to reach their destination. We have used the batch means method in our simulation experiments, where total numbers of messages are split into many batche,s and statistics are accumulated for each of these batches [19]. Since each sample in the batch means method is an average over many of the original samples, the variance between batch means are greatly reduced, which in turn reduces the standard deviation of the measurements. This leads to greater confidence in our estimates of the mean. In our experiments, the results of the simulations are accurate with a confidence level of 95%.
Table 1 Network configurations for model validation Network parameter
Net.1
Net.2
Network technology bandwidth (per time unit) Network latency (time unit) Switch latency (time unit)
500 0.02 0.01
300 0.05 0.02
Extensive validation experiments have been performed for several combinations of clusters sizes, network sizes, message lengths, and degrees of heterogeneity. The general conclusions are found to be consistent across all the cases considered. To illustrate the results of some specific cases to show the validity of our model, some of the items which were examined carefully are as follows: Number of nodes: N = 29 and N = 210 Number of clusters: C = 24 and C = 25 Switch size: m = 4 and m = 8 ports Message length: M = 32 and M = 64 flits Flit length: L m = 256 and 512 bytes Total relative processing power: S = C. Note that we varied the degree of processor heterogeneity while the total relative processing power was fixed and equal to the number of clusters, i.e., S = C. Moreover, two different network configurations are used in the validation experiments, which are listed in Table 1. In the first configuration, all communication networks (i.e., ICN1, ECN1, and ICN2) are same as Net.1, and in the second the ICN1 is Net.1 and ECN1 and ICN2 are Net.2. We used average message latencies and network throughput as the performance metrics for the analysis of the system. Throughput is the rate at which packets are delivered by the network for a particular traffic pattern. 4.2. Results and discussions The results of our simulation and analysis for a system with the above mentioned parameters are depicted in Figs. 4–9 in which the average message latencies are plotted against the offered traffic with different values for the degree of processor heterogeneity. The figures reveal that the analytical model predicts the average message latency with a good degree of accuracy when the system is in the steady state region, that is, when
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
743
Fig. 5. Average message latency in the system with N = 29 , heterogeneous networks, and H = 0.0.
Fig. 6. Average message latency in the system with N = 29 , heterogeneous networks, and H = 0.2.
Fig. 7. Average message latency in the system with N = 210 , homogeneous networks, and H = 0.05.
it has not reached the saturation point. It is assumed that the network enters the saturation region when system utilization becomes greater or equal to one. However, there are discrepancies in the results provided by the model and the simulation when the system is under heavy traffic and approaches the saturation point. This is due to the approximations that have been made in the analysis to simplify the model’s development. One of the most significant terms in the model under a heavily loaded system is the average waiting time at the source queue in the intra-communication and inter-communication networks. The
approximation that is made to compute the variance of the service time received by a message at a given channel (Eq. (24)) is a source of the model’s inaccuracy. Also, in this region the traffic on the links is not completely independent, as we assume in our analytical model. Since, most evaluation studies focus on network performance in the steady state regions, we can conclude that the proposed model can be a practical evaluation tool that can help system designers to explore the design space and examine various design parameters.
744
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
Fig. 8. Average message latency in the system with N = 210 , heterogeneous networks, and H = 0.0.
Fig. 9. Average message latency in the system with N = 210 , heterogeneous networks, and H = 0.2.
Comparison of Fig. 5 with Fig. 6 and Fig. 8 with Fig. 9 reveals that an increase in processor heterogeneity decreases the saturation point, and so it diminishes the maximum throughput of communication networks. In other words, it is observed that the homogenous system (H = 0.0) possesses the highest maximum throughput, so it can better handle heavy traffic compared to the heterogeneous ones. Our result is in agreement with the work of [17], where it is affirmed that an increase in heterogeneity worsens performance. So, the optimal configuration is the homogeneous system, because it yields the most uniform distribution of the total communications among the available links. To obtain a better analysis, Fig. 10 depicts the maximum network throughput as a function of processor heterogeneity. In this analysis, the system which is used in the validation section has C = 32, message lengths M = 64, 128, and L m = 256. Moreover, we used two different network configurations as follows: (a) The basic configuration: ICN1 is Net.1 and ECN1 and ICN2 are Net.2 (see Table 1). (b) The 20% increase in network bandwidth configuration: ICN1 is Net.1 (but the bandwidth is set 600); and ECN1 and ICN2 are Net.2 (but the bandwidth is set 360). The figure shows that the maximum throughput of the communication network will be degraded as the processor
Fig. 10. Maximum network throughput versus degree of processor heterogeneity (H ) in a system with C = 32, 8-port 2-tree, M = 64, 128, Lm = 256 and two different network configurations.
heterogeneity increases, but the performance degradation is more considerable for the case of processor heterogeneity less than 50%. As it can be seen in this figure, the system performance is poorer for longer messages, because increasing the service time received by a message at a channel leads to a longer blocking time, and thus to a higher latency. To consider of effects of network bandwidth on the system’s performance, the second system configuration (i.e., more network bandwidth)
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
745
Table 2 Symbols used in the analytical model (i)
λg
Message generation rate at a node in cluster i
s (i)
Relative processing power of a node in cluster i
S
Total relative processing power
s
Average relative processing power (i)
λI 1
(i,v) λ E1 (i) λI 2 (i) ηI 1 (i,v) η E1
Arrival rate of messages in ICN1 in cluster i Arrival rate of messages in ECN1 in cluster i (to cluster v) Arrival rate of messages in ICN2 in cluster i Arrival rate of messages on a channel in ICN1 in cluster i Arrival rate of messages on a channel in ECN1 in cluster i (to cluster v)
ηI 2
Arrival rate of messages on a channel in ICN2
M
Average length of generated messages in flit
Lm
Length of each flit in bytes
K
Number of stages in the network
k
Stage number, where 0 ≤ k ≤ K − 1
tcs
Time to transfer a flit between two adjacent switches
tcn
Time to transfer a flit between a switch and a node or vice versa
(i) PB k, j
Channel blocking probability at stage k for 2 j-link journey from cluster i’s point of view
Po
Probability of exiting a message from a cluster
Pj
Probability of crossing a 2 j-link message in the network
N
Number of nodes in the system
N0
Number of processors in each cluster (cluster size)
nc
Number of trees in ICN2
αnet
Network technology latency
αsw
Network switch fabric latency
βnet
Network technology bandwidth
C
Number of clusters in the system
m, n
Parameters of m-port n-tree topology
davg
Average distance in m-port n-tree topology
(i) Wk, j (i) Tk, j (i) T in (i) T out (i) W in (i) W out (i) Wcd
Average amount of time spent waiting to acquire a channel at stage k from cluster i’s point of view Average service time of a message at stage k from cluster i’s point of view Average latency of intra-communication network from cluster i’s point of view Average latency of inter-communication networks from cluster i’s point of view Average waiting time at the source queue in the intra-communication network from cluster i’s point of view Average waiting time at the source queue in the inter-communication network from cluster i’s point of view Average waiting time at concentrator/dispatcher from cluster i’s point of view
R in
Average time for the tail flit to reach the destination in the intra-communication network
R out
Average time for the tail flit to reach the destination in the inter-communication networks
(i)
L in
(i) L out (i)
L
Average message latency in the intra-communication network from cluster i’s point of view Average message latency in the inter-communication network from cluster i’s point of view Average message latency from cluster i’s point of view
L
Total average message latency
H
Degree of processor heterogeneity
746
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747
has been analyzed. This upgradation yields similar behaviors, however with about 19% improvement in all points on the system performance. Our results have confirmed the work of [28], which is based on the measurement in a Grid test bed, and found the communication bandwidth and latencies have a much greater impact on the performance of the system than the processor speed. 5. Conclusions In this paper, an analytical model for the communication layer of enterprise Grid computing systems is discussed. Both processor and network heterogeneity have been considered in the model. The model is validated through simulations, which have shown that the model predicts message latency with a good degree of accuracy. The results of performance analyses also show that the bandwidths of communication networks are most effective factor in such systems, and processor heterogeneity has a marginal impact on overall system performance. For future work, we intend to take nonuniform traffic patterns into account, which is closer to the real traffic in such systems. Acknowledgments Special thanks to Mr. D. Sgro and R. Ruge from School of Engineering and Information Technology at Deakin University for their great help regarding the running of simulations. Also, thanks to Dr A. Khonsari from Tehran University for his remarks and discussions. Appendix See Table 2. References [1] I. Foster, The Grid: A new infrastructure for 21st century science, Physics Today 55 (2) (2002) 42–48. [2] J.H. Abawajy, An efficient adaptive scheduling policy for high performance computing, Journal of Future Generation Computer Systems (2006) (in press). [3] J. Dongarra, A. Lastovetsky, An overview of heterogeneous high performance and Grid computing, in: B. DiMartino, J. Dongarra, A. Hoisie, L. Yang, H. Zima (Eds.), Engineering the Grid: Status and Perspective, American Scientific Publishers, February 2006. [4] The DAS-2 Supercomputer, http://www.cs.vu.nl/das2. [5] B. Boas, Storage on the lunatic fringe, in: Panel at Supercomputing Conference 2003, 15–21 November, Phoenix, AZ, Lawrence Livermore National Laboratory, 2003. [6] A.T.T. Chun, C.L. Wang, Contention-free complete exchange algorithm on clusters, in: Proceedings of the IEEE International Conference on Cluster Computing, November 28–December 1, Saxony, Germany, 2000, pp. 57–64. [7] A. Agarwal, Limits on interconnection network performance, IEEE Transaction on Parallel and Distributed Systems 2 (4) (1991) 398–412. [8] H. Sarbazi-Azad, A. Khonsari, M. Ould-Khaoua, Performance analysis of deterministic routing in wormhole k-ary n-cubes with virtual channels, Journal of Interconnection Networks 3 (1–2) (2002) 67–83. [9] M. Ould-Khaoua, A performance model for Duato’s fully-adaptive routing algorithm in k-ary n-cubes, IEEE Transaction on Computers 42 (12) (1999) 1–8.
[10] J.T. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multi-computer systems, Journal of Parallel and Distributed Computing 23 (2) (1994) 202–214. [11] Y.M. Boura, C.R. Das, Performance analysis of buffering schemes in wormhole routers, IEEE Transactions on Computers 46 (6) (1997) 687–694. [12] A. Khonsari, H. Sarbazi-Azad, M. Ould-Khaoua, An analytical model of adaptive wormhole routing with time-out, Future Generation Computer Systems 19 (1) (2003) 1–12. [13] P.C. Hu, L. Kleinrock, A queuing model for wormhole routing with timeout, in: Proceedings of the 4th International Conference on Computer Communications and Networks, 20–23 September, Nevada, LV, 1995, pp. 584–593. [14] X. Du, X. Zhang, Z. Zhu, Memory hierarchy consideration for costeffective cluster computing, IEEE Transaction on Computers 49 (5) (2000) 915–933. [15] B. Javadi, S. Khorsandi, M.K. Akbari, Queuing network modeling of a cluster-based parallel systems, in: Proceedings of the 7th International Conference on High Performance Computing and Grids, 20–22 July, Tokyo, Japan, 2004, pp. 304–307. [16] B. Javadi, S. Khorsandi, M.K. Akbari, Study of Cluster-based Parallel Systems Using Analytical Modeling and Simulation, in: Lecture Notes in Computer Science, vol. 3483, Springer-Verlag, 2005, pp. 1262–1271. [17] A. Clematis, A. Corana, Modeling performance of heterogeneous parallel computing systems, Journal of Parallel Computing 25 (9) (1999) 1131–1145. [18] H. Yang, Z. Xu, Y. Sun, Q. Zheng, Modeling and performance analysis of the VEGA Grid system, in: Proceedings of the IEEE International Conference on e-Science and Grid Computing, 5–8 December, Melbourne, Australia, 2005, pp. 296–303. [19] W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publisher, San Francisco, 2004. [20] Thunder Statement of Work, University of California, Lawrence Livermore National Laboratory, September 2003. [21] InfiniBand Clustering, Delivering Better Price/Performance than Ethernet, White Paper, Mellanox Technologies Inc., Santa Clara, CA, 2005. [22] Building Scalable, High Performance Cluster/Grid Networks: The Role of Ethernet, White Paper, Force10 Networks Inc., Milpitas, CA, 2004. [23] X. Lin, An efficient communication scheme for fat-tree topology on infiniband networks, M.Sc. Thesis, Department of Information Engineering and Computer Science, Feng Chia University, Taiwan, 2003. [24] B. Javadi, J.H. Abawajy, M.K. Akbari, Modeling and analysis of heterogeneous loosely-coupled distributed systkems, Technical Report TR C06/1, School of Information Technology, Deakin University, Australia, January, 2006. [25] M. Koibuchi, K. Watanae, K. Kono, A. Jouraku, H. Amano, Performance evaluation of routing algorithm in RHiNET-2 cluster, in: Proceedings of the IEEE International Conference on Cluster Computing, 1–4 December, Hong Kong, 2003, pp. 395–402. [26] M.D. Schroeder et al., Autonet: A high-speed, self configuring local area network using point-to-point links, SRC Research Report 59, Digital Equipment Corporation, April, 1990. [27] L. Kleinrock, Queuing System: Computer Applications, vol. 2, John Wiley Publisher, New York, 1975. [28] C. Lee, C. DeMatteis, J. Stepanek, J. Wang, Cluster performance and implications for distributed, heterogeneous Grid performance, in: Proceedings of the 9th Heterogeneous Computing Workshop, May 1, Cancun, Mexico, 2000, pp. 253–261. [29] Enterprise Grid alliance reference model v1.0, Enterprise Grid Alliance, April 2005. [30] M.D. de Assuncao, K. Nadiminti, S. Venugopal, T. Ma, R. Buyya, An integration of global and enterprise Grid computing: Gridbus broker and Xgrid perspective, in: Proceedings of the 4th International Conference on Grid and Cooperative Computing, GCC, November 30–December 3, 2005, Beijing, China, in: LNCS, Springer-Verlag, Berlin, Germany, 2005.
B. Javadi et al. / Future Generation Computer Systems 23 (2007) 737–747 Bahman Javadi received his B.Sc. degree in computer engineering from the Esfahan University, Esfahan, Iran, in 1998, and his M.Sc. degree in computer engineering from the Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran, in 2001. He is a member of Iranian High Performance Computing Research Centre as a researcher. He is now Ph.D. candidate at the Department of Computer Engineering and IT, Amirkabir University of Technology, Tehran, Iran. At the moment, he is research scholar in the School of Engineering and Information Technology, Deakin University, Australia. His research interests include high-performance computer architecture, parallel computing, communication networks, and performance modeling and evaluation. Dr Mohammad K. Akbari received his B.Sc. degree in computer engineering from the National (Beheshti) University, Tehran, Iran, in 1984, and the M.Sc. and Ph.D. degrees in computer engineering from the Case Western Reserve University, Cleveland, Ohio, USA, in 1991 and 1995 respectively. He is currently a faculty member in the Department of Computer Engineering and IT at Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran. He is also Chair of Iranian
747
High Performance Computing Research Centre. His research interests include parallel processing, grid and cluster computing systems, and mathematical modeling. Dr. Akbari is a member of the ACM and the scientific committee of Computer Society of Iran. Dr Jemal H. Abawajy is a faculty member of computer science at Deakin University, Department of Science and Technology, School of Engineering and Information Technology. Dr Abawajy received his Ph.D. in computer science from Ottawa-Carleton Institute of Technology (Canada), an M.Sc. in computer science from Dalhousie University (Canada), and a B.Sc. in computer science from St. F. X University (Canada). He has published more than 70 papers in referreed international journals and conferences. His research interests are in the area of high-performance Grid and cluster computing, performance analysis, data management for large scale applications and mobile systems. Dr Abawajy has guest edited several journals and has served as a program committee of numerous international and national conferences. He also chaired several special sessions and workshops in conjunction with international conferences. Dr Abawajy has worked as a software engineer, UNIX systems administrator and database administrator for many years.