Modeling the effect of hot spot contention on crossbar multiprocessors

Modeling the effect of hot spot contention on crossbar multiprocessors

Computer Standards & Interfaces 26 (2004) 483 – 488 www.elsevier.com/locate/csi Modeling the effect of hot spot contention on crossbar multiprocessor...

174KB Sizes 0 Downloads 23 Views

Computer Standards & Interfaces 26 (2004) 483 – 488 www.elsevier.com/locate/csi

Modeling the effect of hot spot contention on crossbar multiprocessors Her-Kun Chang * Department of Information Management, Chang Gung University, 259 Wen-Hwa 1st Road, Kwei-Shan, Tao-Yuan 333, Taiwan Received 2 October 2003; received in revised form 15 January 2004; accepted 15 January 2004

Abstract The paper models the performance characteristics of crossbar multiprocessors under hot spot contention. The effect of hot spot contention in previous works was under-estimated because the blocked requests were assumed to be discarded. The model in the paper provides more accurate results by taking into account the retransmission of blocked requests. D 2004 Elsevier B.V. All rights reserved. Keywords: Crossbar; Hot spot; Multiprocessors; Performance modeling

1. Introduction A crossbar interconnection network can supports all possible distinct connections between processors and memory modules simultaneously in a tightly coupled multiprocessor system. Fig. 1 shows an M  N crossbar connecting M processors and N memory modules. When two or more processors attempt to access the same memory module, only one request can be accepted and the others are blocked. The bandwidth, which is defined as the expected number of requests accepted per unit time, is the most common performance parameter used in analyzing the multiprocessors [5,8]. Surveys of analytic models of the crossbar interconnected multiprocessors can be found in Ref. [5] and Chapter 6 of Ref. [10]. The analyses in Refs. [1– 4,6,11,13] simply discard the blocked requests. In these analyses, the blocked * Tel.: +886-211-8800; fax: +886-211-8700. E-mail address: [email protected] (H.-K. Chang). 0920-5489/$ - see front matter D 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csi.2004.01.001

requests are assumed to be discarded. As a consequence, successive requests generated by a processor are assumed to be independent. This model can simplify the analysis, but the assumption of discarding blocked requests is not realistic. In practice, a blocked request must be re-sent in the next memory cycle. In Ref. [9], an analysis of this model was reported. However, only the uniform reference model (URM) was considered in the work. (As noted in Ref. [5], the URM implies that, when a processor makes a request, the request will be directed to any one of the memory modules with the same probability 1/N, where N is the number of memory modules. That is, the destination address of memory request is uniformly distributed among N memory modules.) Hot spot contention, due to the use of shared objects, was first noticed by Pfister and Norton [12] in the multiprocessor systems with multistage interconnection networks. In the hot spot model, a fraction h (0 V h V 1) of requests are intended to a specific

484

H.-K. Chang / Computer Standards & Interfaces 26 (2004) 483–488

Fig. 1. An M  N crossbar.

memory module (hot spot), and the remaining fraction (1  h) of requests are spread uniformly over N memory modules. The hot spot memory contention can lead to a phenomenon of tree saturation and cause severe performance degradation [7]. Several analyses of crossbar for hot spot contention were reported in Refs. [1,2,13], but they did not consider retransmission of blocked requests. The objective of this paper is to provide a performance model, considering both hot spot contention and retransmission of blocked requests for crossbar multiprocessors, from which more accurate results can be derived. A Markov chain is used to model the behavior of the processors and an efficient solution is presented to evaluate the bandwidth.

2. Model An M  N crossbar connects M processors and N memory modules. Let Pi denote processor i, 1 V i V M. The model of the paper is as follows. (1) The crossbar operates in a synchronous model and Pi generates a request to one of the memory modules at the beginning of each memory cycle. (2) Requests from different processors are mutually independent. (3) A fraction h (0 V h V 1) of requests are intended to a specific memory module (the hot spot, Mh), and the remaining fraction (1  h) of requests are

spread uniformly over N memory modules. That is, the probability that the request is intended to Mh is ph = h+(1  h)/N, and the probability that the request is to the other memory modules is px = 1  ph=(1  h)(N  1)/N. (4) When two or more requests are intended to the same memory module, only one of the requests is accepted, the others are blocked. The conflict requests to the same memory module are resolved on a priority basis. In the paper, we assume that Pi has higher priority than Pi + 1, for 1 V i V M  1. That is, P1 always has the highest priority and PM always has the lowest priority. (5) If the request from Pi to a memory module is blocked, Pi will generate the same request in the next memory cycle. If the request is accepted, Pi will generate a new request in the next cycle. A Markov chain is used to model the behavior of the processors and each processor Pi can be in any of the following states: 

state 0 (initial state): In this state, Pi will generate a new request in the next cycle. If the request is accepted, Pi returns to state 0; otherwise, Pi enters states h or x as described in the following.  state h: If a request to Mh is blocked, Pi enters state h and then reissues the request to Mh in the next cycle. If the reissued request is accepted, Pi enters state 0; otherwise, Pi returns to state h again.  state x: If a request to one of the other memory modules is blocked, Pi enters state x and Pi will resend the same request in the next cycle. If the resent request is accepted, Pi enters state 0; otherwise, Pi enters state x again.

3. Analysis The state transition probabilities of Pi (1 V i V M) are listed in Table 1, in which 

qi0, qih and qix denote the steady state probabilities that Pi is in state 0, h and x, respectively.  bih denotes the probability that a request from Pi to Mh will be blocked.  bix denotes the probability that a request from Pi to the other memory modules will be blocked.

H.-K. Chang / Computer Standards & Interfaces 26 (2004) 483–488 Table 1 State transition probabilities of processor i ( Pi) Current state

Table 3 Comparison for 32  32 crossbar h

Next state

0 h x

485

0

h

x

1  ph  pxbix 1  bih 1  bix

phbih bih 0

phbih 0 bix

According to the Markov model, we have qih ¼ bih qih þ ph bih qi0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Simulation With retransmission

7.7349 4.3804 3.1001 2.3971 1.9481 1.6181 1.4131 1.2444 1.1103

Without retransmission

Bandwidth Error (%) Bandwidth

Error (%)

7.80488 4.44444 3.1068 2.38806 1.93939 1.63265 1.40969 1.24031 1.10727

153 316 439 529 581 597 542 433 256

0.90 1.46 0.22  0.38  0.45 0.90  0.24  0.33  0.27

19.54409 18.21139 16.725 15.08348 13.27162 11.2724 9.06711 6.63527 3.95443

and qix ¼ bix qix þ px bix qi0

average number of successful requests throughout the crossbar per cycle, thus

It implies that qih ¼

ph bih qi0 1  bih

BW ¼

and

M X

ð2Þ

BWi

i¼1

qix ¼

px bix qi0 1  bix

¼

Because

M X

qi0

i¼1

qi0 þ qih þ qix ¼ 1 we can get the equation 1 qi0 ¼ ph bih px bix þ 1þ 1  bih 1  bix

¼

M X i¼1

ð1Þ

1 ph bih px bix 1þ þ 1  bih 1  bix

Let BWi be the expected number of accepted requests generated by Pi per cycle, then BWi = qi0. Recall that the bandwidth of the crossbar is the

According to the basis of priority, the values of qi0’s, bih’s and bix’s can be found in an iterative manner described as follows.

Table 2 Comparison for 16  16 crossbar

Table 4 Comparison for 64  64 crossbar

h

Simulation With retransmission

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

6.3957 3.9337 2.8778 2.2964 1.8717 1.5953 1.3937 1.2351 1.1000

Without retransmission

Bandwidth Error (%) Bandwidth

Error (%)

6.4000 4.0000 2.9091 2.2857 1.8824 1.6000 1.3913 1.2308 1.1035

56 139 201 242 273 276 253 202 121

0.07 1.69 1.09  0.47 0.57 0.29  0.17  0.35 0.31

9.9864 9.3857 8.6662 7.8621 6.9743 5.9962 4.9193 3.7345 2.4317

h

Simulation With retransmission

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

8.7839 4.6827 3.2386 2.4346 1.9862 1.6383 1.4195 1.2427 1.1125

Without retransmission

Bandwidth Error (%) Bandwidth 8.7671 4.7059 3.2161 2.4428 1.9692 1.6495 1.4191 1.2451 1.1092

 0.19 0.50  0.70 0.33  0.85 0.68  0.03 0.20  0.30

38.5487 35.8346 32.8355 29.5226 25.8635 21.8228 17.3614 12.4361 6.9997

Error (%) 339 665 914 1113 1202 1232 1123 901 529

486

H.-K. Chang / Computer Standards & Interfaces 26 (2004) 483–488

Table 5 Comparison for 128  128 crossbar h

Simulation With retransmission

Without retransmission

Bandwidth Error (%) Bandwidth 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

9.3522 4.9851 3.2824 2.5071 1.9674 1.6595 1.4271 1.2541 1.1082

9.34307 4.84848 3.27366 2.47104 1.9845 1.65803 1.4238 1.24756 1.11015

 0.10  2.74  0.27  1.44 0.87  0.09  0.23  0.52 0.18

76.52954 71.0783 65.0547 58.39917 51.04599 42.92265 33.94921 24.03746 13.09014

Error (%) 718 1326 1882 2229 2495 2486 2279 1817 1081

(1) Pi is in state 0, Pi will make a request to Mx with (conditional) probability px/(N  1) in the next cycle. (2) Pi is in state x and the (blocked) request in the previous memory cycle is to Mx, Pi will request Mx with (conditional) probability 1/(N  1) in the next cycle.

Since P1 has the highest priority, the request from P1 will never be blocked and P1 is always in state 0, therefore b1h ¼ b1x ¼ 0; q10 ¼ 1

So the probability Pi will send a (new or resubmitted) request to Mh in each cycle is qi0ph + qih. Similarly, there are two cases that Pi may make a request to one of the other (non-hot spot) memory modules, Mx:

ð3Þ

Then the probability Pi will send a (new or resubmitted) request to Mx in each cycle is ( qi0px + qix)/(N  1). In general, a request from Pi to a memory module can be accepted if P1,. . .,Pi  1 do not request the same memory module. That is, for 2 V i V M,

There are two cases that Pi may request Mh: (1) Pi is in state 0, Pi will make a (new) request to Mh with (conditional) probability ph in the next cycle. (2) Pi is in state h, Pi will make a (resubmitted) request to Mh with (conditional) probability 1 in the next cycle.

bih ¼ 1 

i1 Y ð1  qk0 ph  qkh Þ

ð4Þ

k¼1

bix ¼ 1 

i1 Y qk0 px þ qkx Þ ð1  N 1 k¼1

Fig. 2. Comparison for different values of M and N.

ð5Þ

H.-K. Chang / Computer Standards & Interfaces 26 (2004) 483–488

Giving the value of h, the bandwidth of a crossbar multiprocessor can be evaluated iteratively by using Eqs. (1) – (5).

487

imating 1/h. Another observation from Figs. 2 and 3 is that the size of a crossbar has little effect on the bandwidth when hot spot contention occurs. On the other hand, hot spot contention affects the bandwidth greatly.

4. Numerical results In Tables 2 – 5, the bandwidth is shown as a function of h for 16  16, 32  32, 64  64 and 128  128 crossbar multiprocessors, respectively. The analytic results are compared with those obtained by equations in Ref. [13] and simulation. For each pair of M, N and h, the simulation is run five times with different random seeds. Each time the simulation is performed for 105 memory cycles. It is shown that the analytic results are very close to simulation results. Also it is noticed that the bandwidth calculated by a model that discards blocked requests, as proposed in Ref. [13], would be overestimated as compared to that calculated by our model considering retransmission of blocked requests. In Fig. 2, the bandwidth is plotted against h for various values of M and N. As shown in Fig. 2, the bandwidth decreases rapidly to an asymptote approximating 1. Fig. 3 plots the bandwidth against the size of crossbar (M = N) for different values of h. It is seen that the bandwidth increases to an asymptote approx-

5. Summary The paper uses a Markov chain to model the behaviors of the processors in a crossbar multiprocessor and analyzes the performance characteristics of a crossbar multiprocessor under hot spot contention. For crossbar multiprocessors, previous studies of hot spot contention were restricted to simplified models ignoring the retransmission of blocked requests. The presented analysis provides more accurate results than those of simplified models.

Acknowledgements This work was supported partially by the National Science Council under the grant number NSC-922213-E-182-017-.

Fig. 3. Comparison for different values of hot spot rate.

488

H.-K. Chang / Computer Standards & Interfaces 26 (2004) 483–488

References [1] M. Atiquzzaman, M.M. Banat, Effect of hot-spots on the performance of crossbar multiprocessor systems, Parallel Computing 19 (4) (1993) 455 – 461. [2] M. Atiquzzaman, M.A. Sayeed, Computation availability of crossbar systems in non-uniform traffic environment, Microelectronics and Reliability 34 (12) (1994) 1931 – 1937. [3] R.Y. Awdeh, H.T. Mouftah, Comment: performance of crossbar interconnection networks in presence of ‘hot spots’, Electronics Letters 29 (2) (1993) 218 – 219. [4] L.N. Bhuyan, An analysis of processor – memory interconnection networks, IEEE Transactions on Computers 34 (3) (1985) 279 – 283. [5] L.N. Bhuyan, Q. Yang, D.P. Agrawal, Performance of multiprocessor interconnection networks, IEEE Computer 22 (2) (1989) 25 – 37. [6] H.-K. Chang, S.-M. Yuan, The bandwidth of crossbars for general reference model, Electronics Letters 29 (21) (1993) 1837 – 1838. [7] S.P. Dandamudi, Reducing hot-spot contention in sharedmemory multiprocessor systems, IEEE Concurrency 7 (1) (1999) 48 – 59. [8] T.-Y. Feng, A survey of interconnection networks, IEEE Computer 14 (12) (1981) 12 – 27. [9] Y.C. Liu, C.C. Wang, Analysis of prioritized multiprocessor systems, Journal of Parallel and Distributed Computing 7 (1989) 321 – 334. [10] M.A. Marsan, G. Balbo, G. Conte, Performance Models of Multiprocessor Systems, MIT Press, 1986. [11] J.H. Patel, Performance of processor – memory interconnec-

tions for multiprocessors, IEEE Transactions on Computers 30 (10) (1981) 771 – 780. [12] G.F. Pfister, V.A. Norton, Hot spot contention and combining in multistage interconnection networks, IEEE Transactions on Computers 33 (10) (1985) 943 – 948. [13] A. Pombortsis, C. Halatsis, Performance of crossbar interconnection networks in presence of ‘hot spots’, Electronics Letters 24 (3) (1988) 182 – 184. Dr. Chang is an associate professor in Information Management at Chang Gung University, Taiwan. He recieved his BS and PhD degrees in Computer and Information Science from National Chiao Tung University, Taiwan, in 1989 and 1994, respectively. His current research include parallel and distributed processing, data mining, e-learning and medical informatics.