JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.
42, 101–108 (1997)
PC971321
V_THR: An Adaptive Load Balancing Algorithm Pallab Dasgupta, A. K. Majumder, and P. Bhattacharya Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721302, India
This paper presents a new adaptive algorithm for dynamic load balancing on a shared BUS architecture. We present results obtained from simulation studies and queuing analysis, which reflect the relation between the BUS contention and the efficiency of load balancing. The proposed algorithm uses a scheme for monitoring the Threshold parameter to dynamically adapt itself to the limited bandwidth of the shared BUS. We have compared our algorithm with some of the known policies, and the relative performance appears to be promising. © 1997 Academic Press
1. INTRODUCTION
Load balancing is one of the fundamental issues in distributed computing. In particular, dynamic load balancing assumes a system model where the detailed task structure is not known a priori, which is often the case in actual distributed operating systems. Consequently, the choice of a dynamic load balancing algorithm is considered as a crucial design decision in distributed operating systems [12]. In the past decade a lot of research has been directed towards the task of effective load balancing in various system architectures [2–4, 6, 7, 12–14]. In one of the earlier works, Linvy and Melman [8] used simple queuing network models and simulation to show that dynamic load balancing can substantially improve average task response times. Eager et al. [4] carried the work further by systematically studying a number of dynamic load balancing algorithms with different levels of complexity. Their study brought forth one of the most interesting and significant results in this field, namely that relatively simple algorithms can provide substantial performance improvements, while more complicated algorithms are not likely to offer much further improvement. In recent times the idea of exploiting coarse grain parallelism over a network of workstations has been well appreciated. This has led to the development of several softwares (such as PVM [11], p4 [1], Zipcode [10] and Express [5]), which provide libraries and other support to facilitate the distribution and execution of tasks over a network of workstations. With the advent of such systems, the task of dynamic load balancing on a shared BUS model has gained significance [5]. In this paper, we consider the task of dynamic load balancing in a shared BUS model having a limited bandwidth.
For convenience of analysis, we assume a homogeneous set of processors. The contents of the paper are as follows: • We report the results obtained independently from queuing analysis and simulation of several dynamic load balancing strategies. The results are similar to a reasonable extent, establishing the credibility of our queuing model. • Using results obtained from the queuing analysis, we illustrate the effect of varying the Threshold 1 on the average response time of the tasks, as well as on the BUS utilization. The results show that the limited bandwidth of the BUS is often responsible for performance degradation. • We present a new dynamic load balancing algorithm V_Thr, which adapts itself to the limited bandwidth of the BUS by dynamically monitoring the Threshold. Simulation studies conducted on this algorithm yield promising results. The results also show that the algorithm is scalable up to a reasonable extent. The paper is organized as follows. In Section 2 we outline the model used for queuing and simulation in brief. Section 3 describes the load balancing strategies used in our studies. The results obtained from queuing and simulation which motivate the proposed algorithm are discussed in Section 4, while the algorithm itself is presented in Section 5.
2. AN OUTLINE OF THE MODEL
The system model used in the queuing and simulation studies is shown in Fig. 1. Each processor maintains a task queue for the tasks awaiting execution and a route queue for the tasks which are being transferred to other processors. All incoming tasks to a processor are entered into its task queue. When the load balancing policy decides to transfer a task to some destination processor, the task is transferred from the task queue to the route queue and is eventually routed to the task queue of the destination processor. It is assumed that the sizes of the task and route queues are finite and bounded. Throughout this paper, the load of a processor implies the number of tasks in the task queue of the processor. The other parameters shown in Fig. 1 have the following meanings: 1Threshold is an important parameter of most dynamic load balancing policies, which determines whether a processor is overloaded.
101 0743-7315/97 $25.00 Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.
102
DASGUPTA, MAJUMDER, AND BHATTACHARYA
ag
Mean arrival rate of new tasks.
ae
Mean arrival rate of transferred tasks from other processors.
at
Mean arrival rate at the task queue of a processor.
t
Mean rate of task transfers from a processor.
st
Mean service rate of the task queue.
sr
Mean service rate of the route queue, when the BUS is free.
b out
Task transfer rate through the BUS.
dict the performance trends obtained from simulation. In the simulator arrival events at the task queues are generated by simulated Poisson arrival processes, while a completion event indicates the completion of a task (task sizes are exponentially distributed). Transfer events are generated by the load balancing scheme being simulated. In addition to the parameters shown in Fig. 1, we also consider the time spent by a processor to probe another processor (say, to determine the current load of that processor). The cumulative time spent on probing could be quite significant (as we shall discuss in Section 5), particularly when most of the processors are heavily loaded. 3. LOAD BALANCING POLICIES USED IN THIS STUDY
In this study it has been assumed that the arrival of new tasks at a processor follows a Poisson distribution with mean arrival rate ag . The task sizes are assumed to follow an exponentially distribution with mean service rate st . The values of ae , t, and sr are dictated by the load balancing scheme being used. The average response time of the tasks have been chosen as the performance index. A queuing model has been developed for the above configuration. The state of each processor is represented by a doublet (n 1 , n 2 ), where n 1 denotes the current size of the route queue and n 2 denotes the current size of the task queue. Since we consider a homogeneous set of processors, it follows that the steady state of the processors are similar. Using this fact, we used an identical Markov model for the state transitions of each processor, and a global queuing model for the shared BUS. A convergent predictor–corrector method was used to solve these models and obtain the steady state transition rates. Since identical models are considered for each processor, the solution approach is scalable to an arbitrarily large number of processors. Besides the solver for the queuing model, we also constructed an independent event driven simulator for the configuration shown in Fig. 1. Results obtained from the simulator have been compared to those obtained by solving the queuing model to test whether the queuing model can correctly pre-
In this section, we briefly describe the dynamic load balancing schemes considered in our studies. These algorithms have been selected on the basis of their performance from a previous simulation study conducted by Zhou [14]. Each of the following strategies consider a task to be eligible for load balancing if it arrives when the number of tasks in the task queue is greater than or equal to the value of a parameter called Threshold. ALGORITHM CENTRAL. At regular intervals of time, specified by a parameter called Load_Exchange_Period, one of the processors designated as the Load Information Center (LIC), receives load updates from all the other processors. When a processor decides that a task is eligible for load balancing, it sends a request to the LIC to determine the suitable placement of the task. The LIC identifies the processor with the shortest queue length and informs the requesting processor to send a task there. ALGORITHM THRHLD. A number of randomly selected processors—up to a limit given by a parameter Probe_Limit— are polled when a task eligible for load balancing arrives, and the task is transferred to the first processor whose load is below the load threshold given by Threshold. ALGORITHM LOWEST. It is similar to THRHLD except that instead of transferring the eligible task to the first processor whose load is below Threshold, a fixed number of processors, determined by the parameter Probe_Limit, are polled, and the most lightly loaded processor is selected. In addition to the above policies, we considered the following three boundary cases of load balancing. ALGORITHM NoLB. In this policy, no load balancing is attempted; all incoming tasks are processed locally. ALGORITHM NoCOST. This algorithm depicts the unrealistic case where the current load (that is, the number of tasks) at each processor is known to the transfer decision maker(s) at no overhead costs, and the transfer of tasks are assumed to be costless.
FIG. 1. The model.
ALGORITHM PartCOST. This is the partly ideal case where current load information is assumed to be known at no cost, but task transfer costs are considered.
V_THR: AN ADAPTIVE LOAD BALANCING ALGORITHM
103
4. RESULTS OF SIMULATION AND QUEUING ANALYSIS
The relative performance of the five algorithms, namely NoCOST, PartCOST, CENTRAL, LOWEST, and THRHLD have been analyzed using the queuing model as well as simulated in our simulator. Simulation runs were performed over 80,000 tasks generated by an artificial workload model. The workload model simulated a Poisson arrival process at each processor, with exponentially distributed task sizes. The workload parameter values are as follows: Average task size Average inter-arrival time
400 units of time.
Average time to transfer a task Probe time
600 units of time. 30 units of time. 10 units of time.
As mentioned earlier, probe time represents the time spent in probing another processor. The load balancing parameters used by the different algorithms have the following values: Load exchange period Probe limit
300 units of time. log2 N .
Load exchange period represents the interval at which the load balancing scheduler runs. In particular, the algorithm CENTRAL gathers load information at such intervals. Probe limit defines the maximum number of processors polled by the THRHLD and LOWEST algorithms. N denotes the number of processors in the entire system. The value of the parameter Threshold has been shown in the figures illustrating the results. If the size of the task queue of a processor exceeds the parameter Threshold, then the load balancing algorithm attempts to transfer a task from that processor. The average response times of the five algorithms executed in simulated systems of 2, 4, 8, 16, 32, and 64 processors are shown in Fig. 2. The average response times are normalized with respect to that of NoLB (the no-load-balancing policy). 2 Figure 2 may be compared with Fig. 4 which illustrates the results obtained through queuing analysis. The trends of the curves and the relative performances are roughly similar in both figures, lending credibility to our queuing analysis. The contention for the shared BUS is a significant issue in the model considered in this paper. The five algorithms considered in our study do not adapt themselves to the limited bandwidth of the BUS. This leads to performance degradation as described below. • Figure 2 shows that up to 16 processor systems, all the policies exhibit better performance gains with increasing number of processors. This is due to the fact that the probability of having a lightly loaded processor in the system 2 A normalized response time of 1 indicates that the algorithm does not provide any performance gains over the NoLB case, and a response time below 1 indicates that performance gains are achieved by performing load balancing.
FIG. 2. Results obtained from simulation (average response time vs number of processors).
increases with the number of processors, offering a chance to balance the load effectively. • For systems having more than 16 processors, Fig. 2 shows a rise in the average response time for the CENTRAL and LOWEST policies. With more processors in the system, the number of task transfers and load message exchanges (that is, probes) are substantially higher, which results in greater contention for the BUS. • The contention for the BUS is illustrated (by results obtained from queuing analysis) in Fig. 7, which shows the BUS utilization against the number of processors. For the LOWEST and CENTRAL policies (which involve more message passing than the other policies), we find that for 32 processors, the BUS utilization is between 50–60%, and for 64 processors, the BUS saturates. The results discussed so far highlight the necessity to adapt to the limited bandwidth of the BUS while performing load balancing. There may be various ways to address this problem. In the following section, we describe a new algorithm which monitors the Threshold parameter to adapt to the limited bandwidth of the BUS. 5. THE ADAPTIVE ALGORITHM: V_THR
The results described in Section 4 suggest that in view of the limited bandwidth of the BUS, it may be wise to tolerate some amount of load imbalance, particularly when the contention for the BUS is high. Threshold is one of the most important parameters used by most dynamic load balancing schemes to decide the level of balance. If a task arrives at a processor whose load is greater than or equal to Threshold, then it becomes eligible for transfer. Therefore by regulating the
104
DASGUPTA, MAJUMDER, AND BHATTACHARYA
value of Threshold it is possible to monitor the task transfer traffic. Figures 3, 4, and 5, show the variation of the average response times of the tasks with the number of processors in the system for three different values of the Threshold. Figures 6, 7, and 8 show the BUS-utilization under these situations. These results have been obtained through queuing analysis, that is, by solving the steady state equations using predictor– corrector methods. The following observations can be made from these figures. • Up to 16 processors, a Threshold of 3 seems to be appropriate, since it gives almost 35% performance improvement (over NoLB), as compared to about 25% improvement for a Threshold of 4, and only 20% improvement when the Threshold is 5. This shows that up to 16 processors the load balancing policies succeed in maintaining the balance at three or less tasks per processor. • From 32 to 64 processor systems, a Threshold of 3 causes an alarming degradation for the algorithms LOWEST and CENTRAL (as shown in Fig. 3). Maintaining the balance at three or less tasks per processor for so many processors becomes difficult due to the excessive amount of message passing and task transfers which saturates the BUS (as shown in Fig. 6). The degradation is not as alarming if the Threshold is increased to 4 (Fig. 4). If the Threshold is increased to 5 (Fig. 5), then there is no degradation up to 64 processors.
FIG. 4. Results obtained from queuing analysis (average response time vs number of processors, single bus configuration).
The results described above suggest that a low value of Threshold is good as long as the balance can be maintained at
that level. With an increasing number of processors, the consequent increase of the task transfer and message traffic may make it difficult to balance the load. Under such situations it may be advisable to sacrifice the level of balance by choosing a larger Threshold, than to risk significant performance degradation as a result of greater BUS contention. In other words, the Threshold should be chosen in a way such that:
FIG. 3. Results obtained from queuing analysis (average response time vs number of processors, single bus configuration).
FIG. 5. Results obtained from queuing analysis (average response time vs number of processors, single bus configuration).
V_THR: AN ADAPTIVE LOAD BALANCING ALGORITHM
FIG. 6. Results obtained from queuing analysis (bus utilization vs number of processors, single bus configuration).
1. The task and message traffic should not saturate the BUS, and, 2. The Threshold is as small as possible without violating the first requirement, so that the best balance is obtained within the constraints imposed by the limited bandwidth of the BUS. We describe an adaptive load balancing algorithm which dynamically regulates the Threshold in order to meet the above requirements. We have applied the idea of varying the Thres-
105
FIG. 8. Results obtained from queuing analysis (bus utilization vs number of processors, single bus configuration).
hold on the THRHLD policy to develop the new algorithm V_THR, since the performance of the THRHLD policy is most promising among the practical policies considered in this study. We first describe the key idea behind varying the Threshold. • Let us consider a system consisting of two processors at a time when the load on both processors is above Threshold. If a new task arrives at one of the processors, it will probe the other processor to determine whether it can accept a task. This probe is futile and wastes valuable time, considering that their task queues are heavily loaded. • Let us now consider the following policy to vary the Threshold. Whenever one processor probes the other and finds that the other is also overloaded, it increases its own Threshold to accommodate the new task and impose a stricter condition for a task to become eligible for transfer. Thus when the system as a whole becomes heavily loaded, the processors increase their Threshold, thereby reducing the amount of futile probing. On the other hand, when the load on some of the processors decrease, the Threshold of the heavily loaded processors should be reduced, so that they may transfer some load. This is implemented by means of the following protocol.
FIG. 7. Results obtained from queuing analysis (bus utilization vs number of processors, single bus configuration).
• Whenever a processor is idle, it probes other processors (at random) with the request for a task. Two cases may occur, depending on the load of the probed processor. 1. If the processor which has been probed finds that its load is equal to, or above its own Threshold, then it sends a task to the requesting processor. 2. On the other hand, if the probed processor finds that its load is below its own Threshold, then it ignores the request
106
DASGUPTA, MAJUMDER, AND BHATTACHARYA
allowing the requesting processor to find a more heavily loaded processor. In both the cases, the probed processor decreases its Threshold by one, to acknowledge a possible decrease in system load. The proposed scheme uses a parameter called Base_Threshold which acts as a lowerbound for the Thresholds of every processor, that is, a processor never decreases its Threshold below the given value of Base_Threshold. The Base_Threshold denotes the minimum value of Threshold for which the task traffic does not saturate the BUS. The use of this parameter makes V_THR scalable to a reasonably large number of processors. Base_Threshold is essentially a heuristic value computed from prior knowledge about the system as follows. At a given instant, the number of tasks that are eligible for transfer in a homogeneous system of N processors is given by N ∗ P(n > T hreshold), where P(n > T hreshold) denotes the probability that the load on a single processor is above Threshold. The average transfer rate Tr at which task transfers are attempted in the whole system is given by: Tr = N ∗ P(n > T hreshold)/Load E xchange Period, where Load_Exchange_Period (as described in Section 4) represents the interval at which the load balancing scheduler runs in a processor. While choosing a value for Base_Threshold our objective is to ensure that Tr is less than the throughput of the shared BUS. The throughput of the BUS is approximated by the inverse of the time required to transfer an average sized task when the BUS is free. The value of P(n > T hreshold) in the expression for Tr can be computed by treating each processor as a M/M/1 queuing system, where the number of tasks n in the task queue of the processor defines the state of the processor. It then follows from standard analyses of M/M/1 queuing systems [9] that K +1 Avg T ask Si ze . P(n > K ) = Avg I nter Arri val T ime Substituting this result in the expression for Tr and putting Tr < Bus T hroughput, we find that the desired value of Base_Threshold is the minimum integer K which satisfies the inequality K +1 Avg T ask Si ze N∗ Avg I nter Arrival T ime
ALGORITHM V THR begin if (eligible task has arrived at this processor) then begin for (maximum number of probes = Probe Limit) do begin select destination processor at random ; if (task queue length of destination < Threshold of destination) then generate transfer event of task to destination ; end ; if (suitable destination not found) then increment Threshold of this processor by one ; end ; else if (this processor is idle) then while (task queue is empty) do if (the BUS is free) then begin select source processor at random ; request source processor for a task ; wait for task or refusal ; end ; else if (request for a task has arrived) then begin if (its task queue length ≥ Threshold of this processor) then send a task to requesting processor ; else send refusal to requesting processor ; if (Threshold of this processor > Base Threshold) then decrement Threshold of this processor by one ; end ; end.
5.1. Experimental Results of V_THR
Load E xchange Period , Avg T ask T rans f er T ime
Since V_THR applies the policy of varying the Threshold on the THRHLD algorithm, we performed simulation studies to compare the performance of our algorithm with variants of the THRHLD algorithm which use different static values of Threshold. The results (shown in Fig. 9) lead to the following observations.
where Avg_Task_Transfer_Time is the inverse of the BUS throughput. The complete algorithm follows.
• Figure 9 shows that the performance of V_THR is significantly better than that of THRHLD with static Thresholds
<
107
V_THR: AN ADAPTIVE LOAD BALANCING ALGORITHM
REFERENCES 1. Butler, R. M., and Lusk, E. L. Monitors, messages and clusters: The p4 parallel programming system. Parallel Comput. 20 (1994), 547–564. 2. Chow, Y., and Kohler, W. Models of dynamic load balancing in a heterogeneous multiple processor system. IEEE Trans. Comput. C 28(5) (May 1979), 354–361. 3. Eager, D., Lazowska, E., and Jahorjan, J. A comparison of receiverinitiated and server-initiated dynamic load sharing. Technical report, Dept. Comp. Sci., Univ. Washington, Apr. 1985. 4. Eager, D., Lazowska, E., and Jahorjan, J. Dynamic load sharing in homogeneous distributed systems. IEEE Trans. Software Engrng. 12(5) (May 1986), 662–675. 5. Flower, J., and Kolawa, A. Express is not just a message passing system: Current and future directions in Express. Parallel Comput. 20 (1994), 597–614. 6. Hac, A., and Johnson, T. J. A study of dynamic load balancing in a distributed system. Proc. ACM SIGCOMM Symp. Communications, Architectures and Protocols, Aug. 1986, pp. 348–356. 7. Leland, W., and Ott, T. Load balancing heuristics. SIGMETRICS Conf., May 1986, pp. 54–69.
Proc.
ACM
8. Linvy, M., and Melman, M. Load balancing in homogeneous broadcast distributed systems. Proc. ACM Computer Network Performance Symp., Apr. 1982. FIG. 9. Results obtained from simulation (average response time vs number of processors, single bus configuration).
of 2, 3, and 4, respectively, suggesting that the idea of monitoring the Threshold may be quite promising for this model. • For systems with small number of processors the BUS contention is less and therefore V_THR forces the Threshold to low values and achieves a better balance. For large number of processors V_THR plays safe by using the lowerbound Base_Threshold below which the Threshold is never reduced. Consequently, performance degradation due to saturation of the BUS is avoided, but some of the load imbalance remains which is reflected by the rise of the average response time curve in Fig. 9. But in spite of this rise, the performance is better than the algorithms with static Threshold.
6. CONCLUSION
In this paper we have studied the effect of the limited bandwidth of the BUS on the performance of dynamic load balancing policies in a shared BUS model. Queuing and simulation studies suggest that in order to be scalable to a reasonably large number of processors, an effective dynamic load balancing policy should adapt itself to the limited bandwidth of the shared BUS. We have presented an algorithm which dynamically monitors the Threshold in order to achieve this goal to some extent. Monitoring the other load balancing parameters (such as the Load_Exchange_Period) may be an interesting topic for future studies. While we apply the idea of varying the Threshold on the THRHLD algorithm, it may be worthwhile to investigate the effect of applying similar schemes on other strategies.
9. Sauer, C. H., and Chandy, K. M. Computer Systems Performance Modeling, Prentice-Hall, New York. 10. Skjellum, A., Smith, S. G., Doss, N. E., Leung, A. P., and Morari, M. The design and evolution of Zipcode. Parallel Comput. 20 (1994), 565–596. 11. Sunderam, V. S., Geist, G. A., Dongarra, J., and Manchek, R. The PVM concurrent computing system: Evolution, experiences and trends. Parallel Comput. 20 (1994), 531–545. 12. Yang, Y., and Morris, R. Load balancing in distributed systems. IEEE Trans. Comput. C 34(3) (Mar. 1985), 204–217. 13. Zhou, S., and Ferrari, D. An experimental study of load balancing performance. Proc. 7th Int. Conf. Dist. Computing Syst., Sept. 1987, pp. 490–497. 14. Zhou, S. A trace driven simulation study of dynamic load balancing. IEEE Trans. Software Engrg. 14(9) (Sept. 1988).
PALLAB DASGUPTA received his B.Tech, M.Tech, and Ph.D. in computer science and engineering from the Indian Institute of Technology, Kharagpur, in 1990, 1992, and 1995, respectively. He is currently working as a visiting lecturer in the Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur. His research interests include VLSI (CAD), distributed computing, and artificial intelligence. A. K. MAJUMDER earned an M.Tech. in applied physics in 1968 from the University of Calcutta; a Ph.D. in applied physics in 1973, also from the University of Calcutta; and a Ph.D. in electrical engineering in 1976 from the University of Florida, Gainesville. He was associated with the Electronics and Communication Sciences Unit of the Indian Statistical Institute in Calcutta in 1976–1977. He served from 1977 to 1980 as an associate professor in the School of Computer and System Sciences of Jaraharlal Nehru University in New Delhi. Since 1980 he has been a professor in the Computer Science and Engineering Department of the Indian Institute of Technology at Kharagpur. He was a visiting professor in the Department of Computer and Information Sciences of the University of Guelph in 1986–1987. His research focuses on design automation, database management systems, artificial intelligence and expert systems.
108
DASGUPTA, MAJUMDER, AND BHATTACHARYA
PRITIMOY BHATTACHARYYA received the M.Sc. and M.Phil. in mathematics and the Ph.D. in computer vision from the Indian Institute of Technology, Kharagpur. He is currently a professor in the Computer Science and Engineering Department of the Indian Institute of Technology, Kharagpur. His research interests include computer vision, distributed systems, and software Received July 3, 1992; revised September 13, 1995; accepted March 17, 1997
engineering. He has a number of publications in these areas. He has also coauthored a book on Data Base Management Systems. Dr. Bhattacharyya is a senior member of the IEEE. Dr. Bhattacharyya is currently visiting a number of US industrial firms.