Reliability of Distributed Programs under Overloads Shmuel Rotenstreich The George ~~hingto~
University, ~~~i~gton,
D. C.
lnves~ting the possible solutions to the problem of a system being unreliie under time overloads, we have conckrded that the best modei is a distributed system. This way both distributed systems and modem software systems can be treated in the same manner. Within the framework of the distributed system, two types of techniques are proposed; (1) a group of passive measures that will fight predictabfe overfoads with minimum cost, and (2) a group of active measures in the form of two algorithms, built on top of the passive measures, that will fight overloads whose occurrence and rn~n~ are not predictable. Because of the different nature of the passive and active measures, there is the need for the active measures to try to achieve maximum system utilization in addition to reliability.
1. INTRODUCTION
System overloading can occur in areas ranging from software systems to hardware systems [l , 21. Common to most overloaded systems is the obvious need to avoid overloads, the implicit realization that overloads are the exception, and the reality that once overloaded a considerable degradation in the system performance is to be expected. Overloads take two basic forms-either more data enter the system than can be stored (a space overload) or there are more data than the system has time to process (a time overload). A space overload causes problems but is usually simple to detect by its very nature, and a corrective action can usually be taken at the time of detection. Time overloads, however, are difftcult to detect because they manifest themselves in many different ways and at many different times. Even the common trade-off of time and space, in which one may assume that eventually every time overload will result in space overload and thus become easy to detect, is not realistic in general because the time of detection may be after the system’s collapse. This paper deals with time overloads only. Address correspondence to Dr. Shmuel Rotenstreich. Department of Electrical Engineering and Computer Science, The George ~~hington Wni~~ity, Remington, D.C. 20052. The Jwrd of Systems and Software 9, 29-4 (1989) @ 1989 Else&r Science Publishing Co., Inc.
The work described here began as an attempt in an industrial setting to solve an overload problem in a realtime system [3]. The system was Nnning correctly under reasonable load conditions, but when an overload occurred, and this was a rare occasion, the system either failed altogether or started to produce meaningless output. It became apparent that while entangled with the overload, the system had plenty of maneuvering room in which to avoid either collapsing or being rendered useless. Parts of Section 3 of this paper were implemented in this system, but because of the industrial setting controlled testing was not feasible nor was there any way to test more sophisticated techniques. This work is meant to develop techniques that will be capable of handling faults, input volume changes, and other causes for system overload. Section 2 outlines a basic framework under which overload tolerance is considered. Section 3 describes two algorithms that can react to different cir~~s~ces and still let the system tolerate overloads. Section 4 accumulates tbe results of simulation runs for Section 3.
2. A BASIS FOR TIME-OVERLOAD TOLERANCE We assume that the input to the system is in the form of messages, a form general enough to accommodate many systems of interest. The tolerance of the system is characterized by two elements: the continued operation of the system and the successful passing of an acceptance test on its output. The acceptance test is a predicate x,) where a value of 1 indicates AC%, x2, -*-, success. The test is system-dependent, but one may find many variables common to many systems; for example, output should appear within a limited time from the message input, the output is correct, a certain percentage of the output is correct and on time, etc. A system is said to be in a meaningful state if it passes AC, and it is said to be in a meaningless state if it fails AC. For a system to be considered for time-overload tolerance, it has to meet three p~onditions: 29 0164-1212/g9/$3.50
30 prel. There is a volume of messages under which the system operates correctly and stays in a meaningful state. This volume is called a normal load. pre2. When the system crosses the line between a normal load and an overload, no additional messages can be processed. This does not imply that the normal load is either known or fixed. pre3. The system can survive overloads by dropping any messages that cannot be processed, and in doing so it returns to a m~~ng~l state. prel guarantees a starting point for tolerance. pre2 spells out the reality of an overload. pre3 implies that overload tolerance is feasible. This work considers time-overload tolerance in a restricted sense described in two postcondtions: postl. A system is expected to be mostly overload-free and should mostly ignore the possibility of an overload as long as it does not take place. post;?. The system’s behavior under time overloads should be close to its behavior under maximal normal load. post2 means that the system is designed to be free of overloads most of the time. In particular it emphasizes that overload handling should attempt to avoid imposing changes in the normal processing of the system. post2 states that the system will continue to operate and the overl~d-having mechanism will attempt to optimize system utilization both by providing a low overhead m~h~sm and by being able to optimize system usage. (The definition of this utilization appears in Section 3.3.~). The two postconditions are stated in imprecise terms, using words like “mostly” and “close.” It is the nature of the design process that causes this type of usage and makes it meaningful at the same time. The pre- and postconditions stated above can be met by many systems, inclu~ng process control systems, real-time systems such as a radar system, measurement and data collection systems, and other systems of this type [4, 51. Systems that usually are not considered realtime can still benefit from the framework of overload tolerance. Later in Section 3, an example of timeoverload tolerance handling is applied to a screen editor to achieve performance goals. Throughout this work we will deal with a restricted set of systems with a message processing time independent of the message content. Although this may seem a major restriction, it can easily include many systems with a high volume of messages in which the processing time is message-dependent but the high input rate causes the average processing time of the system before and after overload tolerance to be the same. Since the average processing time does not change, one can refer to such a system as having independent processing time.
S. Rotenstreich A time overload can be regarded as a fault [6] that relates the overload handling to a recovery mechanism. Thus, overload handling is either a backward recovery or a forward recovery [7] (IIn a backward recovery, one saves the system’s state under normal load and restores it upon detection of an overload. A forward recovery implies the feasibility of changing the system’s state into a new state without going backward. In our case the possibility of a forward recovery is implied by pre3. A backward recovery, consisting of the saving the state and the recovery process thereafter, is not always possible in a time-overload context, and when it is possible its cost is too high. There are two kinds of time overloads-an overload that persists all the time, called a continuous overload, which is characterized by a continuous input rate that is beyond the capacity of the system to process, and an overload that is restricted to a time interval, called an interval overload, which is characterized by a time limit to the condition of excess messages to process. We distinguish three types of acceptance tests: AC is applicable, where AC can be applied anywhere in the system. For instance, AC = #msgs < N consists of counting messages and can be applied wherever messages are received. AC is partially applicable, where AC can be applied at some points in the system but not everywhere. For instance, in a dis~buted environment some of the variables needed to compute AC may not exist all over the system, and in places in which they do not exist AC cannot be performed. AC is not applicable, where AC cannot be applied at all because some of its variables are not defined in the system. For instance, if the test requires the variable clock and the system does not have a clock, AC is not applicable. This can happen in both distributed and nondist~buted systems. A widely used case is when modern programming techniques produce systems decomposed into independent modules that possess only the information pertaining to their internal function 18, 91. The case of the applicable acceptance test is less complicated and will receive limited attention in this paper. The other two cases, where AC is partially applicable or not applicable, will be the ones that will attract the most attention from here on. In this work we use a restricted form of an acceptance test in which AC has two variables, an acceptable delay and a system de&y, and the predicate yields 1 when the system delay is less than or equal to the acceptable delay. This restriction is needed because a more general predicate must be application-dependent. The ~st~but~ processing of the messages, shown schematically in Figure 1, is a suitable model for
Reliability under Overloads
L-_Jw
Figure 1. overload handling. It has the advantage of applying to distributed systems as well as to modem programming systems. Although there are systems that have more than one path of service, systems of the type depicted in Figure 1 are among those that suffer from overload problems, and the techniques developed here can be adapted to systems with multiple paths. The size of the buffer attached to each node is an encoding for AC. Once a finite capacity is associated with each node, an overflow of that capacity is similar to applying AC = 0 to overflowing messages. The model in Figure 1 was heavily researched by the queueing networks discipline, which provides formal or approximated solutions to many systems with different probabilistic behaviors [ 10, 111. Those results can be used to obtain measures for the different capacities in the system in the initial design. 3. TOLERANCE ALGORITHMS 3.1. Passive Techniques
The encoding of AC as the buffer capacity of the node leaves two possible reactions to the dropping of messages. One can either drop the incoming messages when the buffer is full (we call this a normal buffer method), or one can drop the message at the top of the buffer and add the incoming message (we call this a sliding window method). A probabilistic treatment of issues related to those methods can be found in the work of Gavish and Schweitzer [12] and Bacceli and Hebuteme 1131. The normal buffer method and the sliding buffer method have the same throughput. The amount of time a message stays in a system with a normal buffer is equal to or longer than the amount of time a message stays in the system with a sliding buffer. Furthermore, the delay of a sliding buffer is inversely proportional to the overload until an asymptotic level is reached. Consequently , interval overloads are best dealt with by normal buffers, while continuous overloads are better handled by a sliding window. Both techniques achieve reasonable results in most system but tend to be ineffective in systems with high fluctuations or faults.
uw
slowdown may be caused by internal problems, such as a disk failure, a loss of one of several CPUs, etc., and can result, given a fixed buffer size, in a delay larger than expected. Since nonfaulty nodes may not be aware of the failure, they cannot compensate for the lunger delay, and without corrective measures the system will remain in a meaningless state. Faults are not the only reason for the behavior just described. Similar behavior is expected if the input message rate grows beyond the estimate under which the system was designed. Errors in the calculations of the system capacity and processing speed may also push a system into a meaningless state. The last point mentioned demonstrates a possible conflict between interval overloads and continuous overloads. In a normal buffer, a fixed buffer is designed to handle interval overloads, but the same buffer may push the system into a meaningless state if the input rate increases. Similar behavior is expected from the sliding buffer; however, its inherent corrective measures might keep the system in a meaningful state whereas the normal buffer would be in a meaningless state. However, even the sliding buffer cannot guarantee a meaningful state. A shift to a meaningless state requires both detection and correction. Detection implies that there is a node in which AC is applicable that comes after the nodes in which the fault occurred. We will call such a node a conscious node. Clearly, if AC is applicable in the last node in the chain, the node is conscious. The conscious node must originate the correction that returns the system to a meaningful state. The amount of influence one node can exert on other nodes is system-dependent. A system may consist of many nodes that cannot communicate with each other except for input messages. For instance, a physical node bought from a manufacturer will usually not comply with commands that would change its buffer size. Thus, any adjustment of such a node must be made outside the node. Even when a node accepts buffer-adjusting commands, the access to the node may require communication lines that do not exist. For a physical node, this may mean the lack of physical communication between itself and the conscious node. For a logical node, this may mean that the faulty node program and the conscious node program may not have any interface. In either case,
no direct action can be taken. 3.2. Active Techniques
3.3. The Braces Algorithm
A system can be pushed into a meaningless state by a fault that slows down some of its nodes. Such a
Let i be a node preceding node j, and let both nodes be dropping messages all the time, or both buffers be full,
S. Rotenstreich
32
Figure2. with average processing times of ii and fi, respectively, i; I fi, then a chain in which j precedes i will have an average delay time lower than or equal to that of the original chain. In the new chain the slower node is put before the faster one; as a result, the faster node is supplied by an input rate that is slower than its throughput. In other words, the second node is not overloaded. The first node stays full constantly because either its input rate is now at least as high as it was in the original chain or its delay contribution does not change. The lowered delay of the permuted system can be used in two ways. In an ad hoc manner, slower nodes can be positioned as close as possible to the beginning of the chain. They can also be used in an active way in the algorithm presented next. a. The algorithm. Figure 2 shows a system made up of the nodes Ni, Nz, *-a, N,, to which an additional node was added at each end of the chain. fier, the node added at the beginning, regulates the flow of input messages into the system. conscious, the conscious node added at the end, monitors the system with respect to AC and initiates adjustments to the system by communicating with fixer. The algorithm first returns the system to a meaningful state and then makes the system stay in a meaningful state by providing it with only the amount of messages it can process. We assume that an increase either in I (input rate) or in tp took place and that as a result the system is in a meaningless state. We also assume that the new I, or the new tp, is constant. The program in Figure 3 describes the most general elements of the algorithm. Omitted from this description are the details of the actual computation and the optimization needed according to post.2; those elements will be presented later in this section. The algorithm is presented in the form of concurrent processes for which all synchronization problems, which are simple, are taken care of but are not detailed so as to avoid dwelling on minor issues. The conscious node. This node is made up of two concurrent processes. conscious1 evaluates Ds, the
process conscious1
loop receive input message compute Ds computer tp end loop end process conscious2
loop if not AC then begin compute 6 send (6, &j to fixer wait 6 X tp time units end end loop end (a) process fixer1
loop receive message from conscious 6fixer _= bixcr - 6 save tp
end loop end process fixer2 loop if Sf,,, > Othen begin accept input message 6fuer = her - 1 end else drop message end loop end process fixer3
1WP
wait iptime units 6fmr = hi, + 1 end loop end @I
Figure3. The algorithm program. (a) conscious; (b) fixer.
33
Reliability under Overloads
delay
Figure 4.
time
average system delay, and &,; conscious2 invokes AC and initiates adjustments to the system. conscious1 estimates tp in the variable $ using message arrival times. This estimate is useful as long as the system is overloaded or at maximum normal load. It will be a wrong estimate when the system is underloaded; however, the algorithm is applied only in a continuous overload. conscious1 computes &, the average processing time of an equivalent system with one node and with an average delay of Ds. It also computes Ds using the system-supplied mechanism embedded in the computation of AC. conscious2 monitors the system. When AC = 0, it computes 6, the number of messages the system has to lose from its buffer in order to return to a meaningful state. Here, AC is reduced to AC = Ds I DA. DS increases from DA to a maximum value of Ds_, which is reached when the system reaches a steady state. This process is demonstrated by Figure 4. The circles represent points at which conscious2 makes a decision to adjust the system. 6i denotes the ith iteration of conscious2’s attempt to adjust the system buffers by 6 messages. Qi is the accumulated reduction in the system delay. Equation (1) details the computation performed in each iteration. 6j=
mx(Os Ds,-Qi-l-D,4)
,
ta
k-0
(1) i=O i>O
The computation in (1) takes place as long as DS > DA. When the system delay drops under DA, the iterations in (1) are terminated. If the system gets above DA again, conscious restarts at iteration 0. Qi reflects the amount of delay, or in turn the number of messages, the algorithm wants the system to drop from its buffer. This, however, is not the actual drop if during the adjustment interval, fixer accepts input messages. At this stage we will say that if no input messages arrive, then (1) guarantees the successful completion of the adjustment. In Section 3.1 .b we will discuss this issue again. hi and $, are sent to fixer. It takes the system St& time to lose 6i entries in its queue, provided it does not receive any additional messages at this time. Then conscious2 performs the acceptance test again. At this point, either Ds_ has changed, in which case Di > 0, or there is no change in Ds_, in which case Di = 0 and either the decline in Ds will take place with no additional adjustments or Ds_ will change again. The time between the last adjustment initiated by conscious and the return of the system to DA is the time it takes the system to process all the messages that are already in the system at the time of the last adjustment. The jiier node. Fixer maintains a count, afixer,of the number of input messages it can accept at any given time. When 8sxer 5 0, messages are dropped. Input messages will reach the system at a maximum rate of I/ $,, and messages in excess of this rate will be dropped. Since the throughput of fixer is the throughput of the actual system, fixer will drop the same number of messages as the system will. (This, of course, applies
S . Rotenstreich
34
only to a deterministic system in which all the estimates are accurate.) fixer3 maintains the simulated rate by mainlining a clock rate of l/& . fixer 1 accepts 6’s from conscious and merges them into the processing stream of fixer. fixer2 decides between accepting or dropping messages and updates r)sxer. The program in Figure 3 does not accept any messages when Sfi,, I 0. This results in a sharp drop of the system delay as seen in Figure 4. There are, however, alternative approaches in which a gradual drop avoids a loss of a whole section of messages. Such approaches can be adopted to accept messages with higher priority or for other system-dependent reasons. The disadv~~ge of the gradual approach is the longer time necessary to arrive at a meaningful state. b. Analysis and refinements. The algorithm in Figure 3 does not reflect the nondeterministic nature of the system and the difficulty in obtaining quantities like iP and Ds [143. We assume that tP is a constant and that its measurements deviate slightly from this constant. Given the time of arrival of messages as fmsg,, t,, , - - *, fmsgnthe average processing time has to be estimated by averaging the series: tmrg2 -fw
t
mg3 --fmsg2?
*‘*9
fmgn-4n5g,_,
The resulting estimate must reflect changes in the processing time as the result of faults as well as maintain a smoothed constant when the system is stable. An mimetic average is insensitive to changes, while the exponential average in (2) is more sensitive to changes and at the same time maintains smoothing. (2)
iP=f;l+(Y(tmsg,-tmsgn_,-fb)
msg, is the last message received, and 0 < CY< 1. In (2) a large cywill prevent good smoking, while a very small (Ywill be less sensitive to permanent changes in the average processing time. This contradiction between the two goals can be resolved by an additional forecast variable that attempts to predict permanent changes in the value of fmsgn- fmspn_{. This approach has been avoided here, but not without a price. The process of adjustment as reflected by the computation in Eq. (1) and the mechanism in fixer assumes that the system will lose enough messages in its buffer to return to a meaningful state. This assumption may be incorrect because of two problems. Suppose a system fault changes tP from X to Y, Y > X. In this case, (2) becomes n-l C (I-CX)‘(Y+A~)
ip= Y-e=a ( Where
Y +
i=O
+(I-a)“X
(h)
>
Ai = fmssi+1 - tmsgiE > 0. Thus, the time it
takes conscious to reach a “good” estimate of tP is constant. This time may be longer than the time it takes (1) to arrive at Ds_ . Consequently, fixer, using a low r,, sends messages into the system at a rate that is higher than the output rate. This may cause the anticipated decline in the queue size to not materialize. The other problem inherent in the process is that the time it takes the braces algorithm to reduce the system delay to zl, from DsmX is shorter than the time it takes the delay to increase from DA to DsmX. The system delay increases from DA to Ds,, on average every tp time units simply because this is the sampling interval, and by increments of less than or equal to tp. If t&si is the time of arrival of msgj then: delay increase= tmsg,- t&i - fmsgi_ , - tksgi_, =tmsgi-tmsgi_*-(t~‘“gi-t~~i_,) = tp,‘p,msgi - (difference of arrival time)zzfp,msgi
(2b) In other words, fixer will finish to drop one entry in time &, but it will take the system longer to increase its buffer size by one entry. During the difference, fixer will accept input messages. Solutions to the two problems mentioned above can be achieved in several ways. For instance, fixer can totally stop input until the adjustment is completed; this will take care of the wrong estimate problem. Simultaneously, for the difference between r, and the delay increase, one could use the delay increase instead of tP. The solution implemented by this ~go~~rn is a logical consequence of the design directives as presented by the pre- and postconditions. Namely, we are seeking the simplest solution in both algorithm design complexity and execution time. The algorithm, then, stays as described in Figure 3, and the remedy ~plement~ is a timeout. Since it takes the system at least Ds_ -
DA
@cl
to get maximum delay, this is a bound that can be used for a time-out process waiting for a meaningful state. If a meaningful state is not reached within a small multiple of (2c), a readjustment of the system will be attempted. The advantage of the retry is that the two problems discussed above will disappear. First, & will reach a “good” estimate, and second, Ds,, will already be achieved. Thus the retry should be successful. Although we assume that AC is applicable in a conscious node, and thus Ds is computed as part of it, it might be of interest to discuss the nature of this function. The evaluation of Ds involves considerations similar to the ones involved in the estimate of t, . In a system with a large volume of messages, a delay in one message is of
35
Reliability under Overloads
conscious. If the actual system as represented by the current messages being processed is in the present, then conscious is in the past and fixer is in the future. This relation means that at the time conscious performs its computation, the estimates may not represent the state of the actual system. For example, &,will be used by fixer to filter in the right amount of input messages. But, by the time this portion of fixer’s message get processed by the system, the actual average processing time may have already changed. To avoid underutilization, conscious must catch the queue size at a point at which an increase command sent to fixer will affect the system queue before the queue size gets to zero. A similar argument applies when the queue size grows. A lower limit, DL , and an upper limit, Du, are defined for Ds . Whenever Ds < DL , conscious instructs fixer to increase the queue size, and whenever Ds > Du, conscious instructs fixer to decrease the queue size. This strategy allows the braces enough time to react, and lowers the likelihood of underutilization or a meaningless state. The system is adjusted to a middle boundary, DM, instead of to DA. A reasonable choice will be
process conscious2 loop if Ds > Du then begin if first iteration then create conscious3 compute 6 send (6, fP] to fixer wait 6 X tp end else if Ds < DL then begin computer 6 send (6, &) to fixer end else if adjustment terminated then begin abort conscious3 reset iteration end end loop end process conscious3 wait Ds_ reset iteration end
Figure 5. DM=(D”+ DL)/2, no consequence provided all messages are of the same importance. The following formula represents a possible solution, where Ds is the estimated system delay and Dmsgis the last message delay. &=Ds+B(&sg-Ds)
(3)
c. Optimization. In this section the attribute optimization refers to the optimization of the utilization of the actual system. It does not refer to optimizing the algorithms within conscious and fixer. The braces algorithm attempts to minimize both the time in which the equivalent system has a queue size of zero and the time in which the system is in a meaningless state. $ may be a good estimate at a given time, but if tp decreases considerably the system will consume the messages in the buffer faster, possibly leading to intervals in which the queue size is zero. This underutilization is caused by the reduction of the number of messages in the buffer induced by the braces algorithm. A symmetric argument can be made when tp increases, which may lead to AC = 0. In both instances, the optimization stage of the algorithm must attempt to predict in time the trends in the system, thereby achieving optimization. The prediction of the trends involves the understanding of the relation among fixer, the system, and
however, the only requirement is that DL < DM < Du. The optimization changes to the braces algorithm are restricted to conscious2 which is shown in Fig. 5. Fixer is not affected by them because a negative 6 increases hexerand thereby the number of messages accepted by it. The handling of the adjustment at the upper and lower limits is not symmetric, as reflected by the missing wait in the lower-bound crossing case. Waiting for the termination of the increase by fixer must take into account the input rate. In Figure 3, fixer completes an increase adjustment when Bfixer= 0 in lixer2, but fixer2 depends on the input rate. Since conscious does not know the input rate, it cannot define the wait function, thereby causing extra transmissions to fixer. The extra transmissions have no other negative effect. d. Variations. This subsection describes two variations on the braces algorithm. The first variation is a special application of the algorithm, while the second shows a system in which AC must be defined externally. Different location of fixer. Figure 6 shows a different deployment of the braces algorithm. Here, the braces embrace only a segment of the system. The validity of this employment can be easily verified; the position of fixer, as shown in Figure 2, does not figure in its mechanism. The data rate is the only factor in its operation, and as such the data rate coming out of N2 is the data rate used by fixer.
36
Figure 6.
This variation will be useful when the section preceding fixer drops messages based on system dependent criterion, for example, dropping messages with low priority. Screen editor. We view a screen editor as a service complex that consists of three functional modules: input, the end through which keyboard commands enter and which is in most likelihood part of the operating system; editor, the editor itself, and display, the other end of the editing process, which embodies both the operating system output mechanism and the physical display as seen by the user. When editor takes a long time to process commands, the user tends to be impatient and commands received by input are stuck in the buffer of editor. This may cause discrepancies between the actual text file and its display. For instance, if the user enter ‘delete line’ and those lines are not deleted from the screen, repeated ‘delete line’ may delete lines the user did not intend to delete. The screen editor is an example for a system in which AC is not applicable. Since usually a clock is not
S. Rotenstreich
associated with an editor, an acceptance test cannot be implemented without a major design change. Figure 7 shows an adaptation of the braces algorithm to a screen editor. fixer and conscious are joined together in the following manner: given that the editor must explicitly respond to every command, fixer sends each command’s arrival time to conscious, while conscious knows the time of the command’s departure. This information enables the modified algorithm to lock the keyboard instead of dropping messages when the response time of the editor is too long. This example shows that the braces algorithm can be implemented as a standard service provided by the operating system, in a manner similar to other fault tolerance services provided by modem operating systems. 3.4. Half-Braces Algorithm One of the main shortcomings of the braces algorithm is its need to monitor the system and to intervene in its processing. A system that provides the capability to communicate adjusting commands on the path of the input messages can use an algorithm that can be invoked only once with no need for the optimization follow-up procedure. The half-braces algorithm is such an al-
Figure 7.
Reliability under Overloads
Figure 8. gorithm. The half-braces algorithm does not have a fixer node; instead each node contains a mechanism that takes care of the reduction in the node’s buffer size and the communication of the reduction to other nodes. The conscious node is similar to the process bearing the same name in the braces algorithm. The computation of DS, fp, and Ai is the same; The difference is in the computation of Qi, which is performed only after the reply from the nodes in the system is received. This computation is done in the following manner: i=O i>O Qi= ( ~_i+(b_4)~~* 6, is the 6 returned from the system. (In Figure 8, 6, = 6”.) Figure 8 shows schematically the communication used by the half-braces algorithm, while the algorithm appears in Fig. 9. Conscious computes 6 in the same way it was computed in the braces algorithm and it then sends 6 to the first node on the chain with which it communicates. Each node in the chain recognizes the message as a request to decrease its buffer size by 6. When a node gets a message with 6 = 0, it passes it on to the next node. Only 6 # 0 is subject to node consideration. Nodes that drop messages consult their own logic to decide whether they can afford to decrease their buffer size. If a decrease cannot be accommodated, the message is passed on without change. If the node can afford a decrease d 5 6, then it passes on a message with the new 6, denoted 6i where i is the node index; 6i_, - d. Nodes that do not drop messages can reduce their buffer size by D I 6 if they find it fit to do so, but the 6 retransmitted is decreased only by the number of occupied buffer locations actually dropped. The decrease in buffer size is not necessarily distributed uniformly among the nodes. A more uniform distribution may be implemented if a better knowledge of the system exists. The adjustment message eventually returns to conscious 6 1 0. If 6i = 6, and the system still fails AC, the half-braces algorithm has the option to terminate in failure or repeat the request with an appropriate retry count. Before conscious goes into the next iteration, it must wait the amount of time needed to decrease the system buffer size by 6i - 6,.
The probabilistic arguments in this algorithm are similar to the probabilistic arguments in the braces algorithm. An outcome of the permanent decrease in the buffer sizes is the ability to do away with the need to monitor the fluctuations in the system’s delay. After the first adjustment, the system should always be in a meaningful state. Still, as was argued before, the adjustment activity is based on estimates that may turn out to be wrong, causing the system to reach a meaningless state or to be underutilized. If one associates the cutback of the buffer sizes with a verified logic that guarantees a resulting “reasonable” system, then the number of adjustments needed depends on the errors in estimates. The applicability of the half-braces algorithm is more restricted than the applicability of the braces algorithm. This is the result not only of the need to have the cooperation of the whole system, which is a major point, but mainly because the logic within each node to decrease the buffer size is not well defined. If the larger than accepted buffer sizes can be clipped down, why were they too large to begin with? A possible reason for using the half-braces algorithm is the presence of predictable changes in the input rate or predictable faults that can be better adjusted to using this algorithm. Whiie the braces algorithm can be used as a system service independent of the system at the cost of providing it with AC and a way to estimate Ds , the same cannot be said directly about the half-braces algorithm. In the latter case the cooperation between the desired system service and the system must include, on top of the elements needed by the full algorithm, a communication protocol and buffer adjustment facility in those nodes that are willing to participate.
4. SIMULATION RESULTS The various techniques were simulated. Each node was represented by an independent process that took its definitions of the buffer technique, buffer size, and type of processing from a parameter list. The nodes became a system by the ordering of the processes and by communicating between them by means of interprocess communication. This way, full concurrency as well as total independence was easily achieved, and at the same time it became easy to create any desirable system.
38
S. Rotenstreich
process conscious1 loop receive input message evaluate l& compute $ end loop end
Table 1.
process conscious2
--
780 780 780 780
1.109
1.108 1.113 1.109
loop if not AC then begin compute S, send (6,) to first node wait to receive 6, if S, = S, then terminate (or time-out) else wait (S, - S,)ip end end loop
end (a) process node-adjust loop when message from conscious if&, * Othen if dropping messagesthen begin compute buffer decrease compute new 6 translnit 6 end else begin compute buffer decrease if queue-size > buffer-size-decrease 6 = S - queue-size - (bu f f er-size -decrease) ant 6 end else transmit &=O end loop end (b) Figure 9. The half-bracesalgorithmprogram (a) conscious(b)
buffer adjustmentin each node. The additional nodes needed for the algorithms, conscious and fixer, were also implemented as processes, and the collation between them was achieved by interprocess communication as well. They were fitted into the system in the same way as the original system was fitted. The following is a collection of facts and cla~fications about the simulations: The input to all simulation runs was Poisson distributed withh = 1. The distribution of the service time used everywhere in
1.223 1.221 1.221 1.220
780 780 780 780
800 600 250 12Q
those runs is general, with a second moment of 2.5 times the first one. In all runs the limits are DL = DA/4, Du = 3 DA14, and DM = DA /2. The entries in the table columns DL and DU stand for the number of times the limits were crossed after the first adjustment. In all runs the parameters in (2) and (3) are 01 = 0.01, and fi = 0.2. The different processing times found in the tables are arithmetic averages from the first to the last input arrival; the only case in which this will be very far off the correct estimate is the fault simulation in Figure 4; the number stands for the last value compute by (2) in Section 3. All simulations stop when the first node in the chain detects the end of the simulation. The counts of delayed messages, the delay time, and the percentage of delay time given in the various tables were started after the first system adjustment was accomplished. Tables 1 and 2 summarize four runs that terminate at time 99000. The system consists of two nodes in which the buffer size was computed erroneously, and the braces algorithm attempts to adjust the system. Three numbers in Tables 1 and 2 best describe the performance of the algorithm: the crossings of the upper and lower bounds and the number of messages outputted. The number of crossings grows as DA decreases, which is expected, since the interval in which the algorithm initiates adjustments shortens. The number of output messages remains within a range of less than 1% of the total output, which without any additional measure can be interpreted to indicate that the algorithm performs well. Table 2. No. of Do
Output delayed msgs Delay time
DL
Percent delay time --
1
I
6 33
18 68
77821 77776 77631
0 0 75
: 100.1
0.0 0.0 0.001
111
177
77336
1681
2122.1
0.021
39
Reliability under Overloads Table 5.
Table 3. t;
3.619 1.118
n3
nmw ,
200 780
26 69
t;
?l2
3.622 1.224
200 780
rl;”
28 66
nnw 3
1;
n3
0 1.604
0 45
0 21
Tables 3 and 4 show the results of a simulation with the half-braces algorithm. The system is adjusted to 3 DA/4 instead of DA. The results show that the halfbraces algorithm produces a more stable performance than the braces algorithm. The second entry in Table 3 is repeated in the first entry in table and shows less fluctuations in the half-braces algorithm. The halfbraces algorithm is not compared with the output of the braces algorithm because of their different logic. Instead, the percentage of the idle time is given. The adjusted buffer sizes appear under n y’“. In the half-braces algorithm the percentage of idle time indicates the amount of underutilization caused by the cut in buffer sizes. Since the numbers are very small, the runs indicate no adverse effect of lowering buffer size on the performance. Another issue in this algorithm is the existence of delays after the adjustment. The numbers in the percent delay time column are low but point to the existence of the problem. The question with an actual system is whether those figures will pass the acceptance test. Tables 5 and 6 contain the results of the braces algorithm operating with a fault at time 7000, which changes the average processing time of the third node from about 0.91 to higher than 4.5. The interpretation of the results in Tables 5 and 6 is similar to the that of Tables 3 and 4, where the additional runs point to the impression that the algorithm behaves in the predicted manner. Tables 7 and 8 are similar to Tables 1 and 2, but they run a shorter simulation, to 50000, with a different system. In all the simulation runs, the utilization of the system seems to be reasonable. The main difference between runs with different values of DA is not in the number of output messages but in the higher number of upper and lower limit crossings in the case of lower DA’s. The difference is also manifested by the growing number of delayed messages.
0.925 0.923 0.913
26 26 26
0.919 0.913 0.917
26 26 26
5.571 6.417 6.591
266 266 266
Table 6. No. of delayed DA
Du
DL
800 250 120
10 26 52
10 32 88
output
14761 14741 14451
m%s
0 151 341
Percent Delay time delay time 0 726.8 1655.9
0.0 0.015 0.033
5. CONCLUSION
The techniques presented in this paper offer the user several options when dealing with overload problems. All the techniques are inexpensive in both CPU time and program size. However, they do not answer all possible time-overload problems. Extreme cases in which the behavior of the system is especially chaotic cannot be handled with reasonable performance by these techniques. Additional work can improve the performance of the algorithms. Possible avenues are: attempt to make conscious adapt to the behavior of a system, and if the behavior is periodic, use conscious to improve its estimates; explore the logic behind the self-adjusting logic in the half braces algorithm; handle multiple faults and fixes. Table 7. t2 P 1.109 1.115 1.109
780 780 780
1.220 1.224 1.218
780 780 780
1.610 1.600 1.603
45 45 45
Table 8. Table 4.
DA
DL
DU
Idle time
No. of delayed msgs
250 250
0 0
4.0 2.001
183 118
677.9 189.8
Delay time 0.014 0.004
Da
01
DL
output
No. of delayed msgs
800 250 120
2 2 47
12 128 112
30324 30266 30151
0 0 116
Percent Delay time delay time 0 0 146.5
0.0 0.0 0.003
40 REFERENCES 1. D. B. Benson, Letter to the Editor, ACM SZGSOFT, Software Eng. Notes, 9(l), 10-11, (1984). 2. H. Stone, Introduction to Computer Architecture, SRA, Chicago, 1980. 3. S. Rotenstreich, Tolerating System Overloads: A Case Study, Proc. 4th Symp. Reliability in D&rib. Software and Databases, Ckt 1984. 4. N. G. Leveson, SoftwareSafety in Computer-Controlled Systems, Computer, 17(2), 48-55, (1984). 5. J. D. Schoeffler, Distributed Computer Systems for Industrial Process Control, Computer, 17(2), 11-18 (1984). 6. T. Anderson and R. A. Lee, Fault Tolerance, PrenticeHall, Englewood Cliffs, New Jersey, 1981. 7. N. G. Leveson, Software Fault Tolerance: The Case for Forward Recovery, Proc. AZAA Conf. Comput. Aerospace, pp. 50-54, October 1983.
S. Rotenstreich 8. D. L. Pamas, On The Criteria to be Used in Decomposing 9. 10
11. 12.
13.
14.
Systems into Modules, Commun. ACM, 15(12), 10531058, (1972). G. Booth, Object-Oriented Design, Ada Lett., 1, 64-76 (1982). K. Mani Chandy and C. H. Sauer, Approximate Methods for Analyzing Queueing Network Models of Computing Systems, Comput. Surv., 281-317 (1978). I. Mitrani and E. Gelenbe, Analysis and Synthesis of Computer Systems, Academic, London, 1980. B. Gavish and P. J. Schweitzer, The Markovian Queue with Bounded Waiting Time, Manage. Sci., 23(12), 1349-1357 (1977). F. Bacceli and G. Hebuteme, On Queues with Impatient Customers, in Performance 81 (F. J. Kylstra, ed.), North-Hollnad, Amsterdam, 1981, pp. 159-179. B. Abraham and J. Ledolter, Statistical Methods for Forecasting, Wiley, New York, 1983.