225
A Simple Approach to System Modeling Yenathan Bard
O. ]Foreword
IBM Cambridge Scientific Center. Cambridge. MA 02139. U.S.A. Received 20 February 1981; revised version received 4 April 1981
This paper describes an approach to system modeling based on heuristic mean value analysis. The virtues of the approach are conceptual simplicity and computational efficiency. The approach can be applied to a large variety of systems, and can handle features such as resource constraints, tightly and loosely coupled raultiprocessors, distributed processing, and certain types of CPU priorities, Extensive validation results are presented, including truly predictive situations. The paper is intended primarily as a tutorial on the method and its applications, rather than as an exposition of research results.
This paper presents an approach to system nodding which is an outgrowth of the author's work on the VM/370 Predict, : model [1]. While details of that model have appeared piece-meal in the literature [2,3], a comprehensive description of the underlying methods is lacking. We shall attempt to present here a more or less complete exposition of this approach. On the one hand, we shall drop all reference to specific features of VM/370. On the other hand, we give several hitherto unpublished extensions, including treatments of CPU priorities and of distributed systems. Our approach is derived from mean value analysis of queuing networks [4] and its heuristic extensions [5], coupled with the Schweitzer approximation [3 (Appendix B), 6]. The presentation here is entirely heu~stic. Indeed, the only known justification for this method is its practical success in a large variety of system models, as will be seen in some of the results accompanying the exposition. The presentation will be deliberately kept at an elementary level, if only to demonstrate one of the great attractions of this approach, namely its inherent simplicity (excluding, unfortunately, the I / O subsystem model).
I. Introduction Yonathan Bard ts a staff member at the IBM Cambridge .Scientific Center. where he has worked since 1969 on many aspects of computer performance analysis, He has produce~ several widely used programs for the measurement, analysis, and modeling of the VM/370 System. Prior to that he was a member of the IBM New York Scientific Center. where he worked on computer applications in the fields of chemical engineering, operations research, and statistics. He received his Ph.D, in chemical engineering from Columbia University in 1966. He is ~he author of a book on nonlinear parameter estimation, and of many pagers on diverse aspects of performance analysis, He has taug;h~ a course on the latter subject at Boston University.
North-Hoiland Publishing Company Performartce Evaluation I (I 98 i ) 225-248
We shall be concerned here with modeling the performance of a computer system. By this we mean estimating tbe performance of a system with a given configuratk,n under a given workload. The performance measures whc,se mean values are to be estimated are pdnc~pally: Response time: The time it takes the system to execute a given 'transaction' submitted by a user. Throughput: The rate at which the system completes submitted transactions. Utilization: The fraction of time that v~rious system components (CPU, 1/O devices) are busy. The above definitions imply that the user workload can be broken up into individual work units which we call 'transactions', but they may
0166-5316/81/0000-0000/$02.75 © 1981 North-Holland
Y. Bard / A simple approach to s.vstem modeling
equally well be called 'i,obs', "job-steps', 'programs', or 'queries'. We shall allow for transactions of different types, and attempt to compute response times and throughputs for each type. We shall need to characterize both the system configuration and workload. The former is described in terms of CPU speed, storage size, and I / O capabilities. The latter is described in terms of number of users generating each transaction type, and the system resources (CPU time, storage, I / O ) consumed by a transaction of each type. The precise list of quantities required to specify the system and workload will be generated alongside the model, since the modeling equations will themselves point out the required inputs. 2. The basic model 2.1. Basic model formulation and solution In the following discussion we make some rather stringent simplifying assumptions, of which many will be relaxed in later sections. We assume that there is a fixed number of users, each with his own terminal, and that each user generates only one class of transactions. A transaction is defined as the work done by the computer for a user between successive terminal interactions. Furthermore, we assume that a user can have only one 'active' transaction at a time. From the computer's point of view, a user may be in any one of the following states: (1) DORMANT: No transaction is pending. The user is reading tb.e output from the previous t:'ansaction, thinking, entering the input for the next transaction, or taking a coffee break. (2) WAITING: A transaction has been rec,:ived, but not yet scheduled for processing by the system. (3) ACTIVE: A transaction is being processed. Our mair~ objective is to estimate how long the user has to wait for the system, i.e. to compute the: RESPONSE TIME = WAITIVG + A CTI VE TIME. Tke flow of transactions is depicted in Fig. 1. Let subscripts i a n d j ~efer, respectively, to the user classes and user state,,;. Thus j can take the values l(dorrnant), 2(waitirg), and 3(active). Let the following quantities be defined: n, = average number of class i users (need not be integers);
I
DOR.'~AN'T
(j=l)
~ ~'AITING
L I-
L ~ hffl'IgE
[ (j=3)
(j=~,)
I Fig. I. l'ransacti,m flow in basic model.
Tv = average time that a class i user spends in state j per transaction; Nv = average number of class i users in statej at any one time; T, = average class i transaction response time; X, = rate at which class i transactions are completed. It should be obvious that the n, must be specified as inputs. The same applies to the T,.~, i.e. tl~e average user dormant times (often referred to ~.s 'think times'). It should also be obvious that the following relations hold, the first one by definztion: r, =
+ r,.3,
, , = N,., +
(i) +
(2)
The total transaction cycle time of a class i user is T,.~ + T,.z + T,.s, so that he completes l,/(T,.~ T,.2 + T,.s) transactions per unit time. Since there are n, such users, the total class i transaction rate is
x, =n,/(r,., + r,.2 + r,.3).
(.-;)
A similar argument can be applied to the class i users in each state, resulting in: X,=Nv/7~j
j = 1,2,3.
(4)
Eqs. (3) a~,-1 (4) are known in queuing theory as Little's Formula. This formula has natural counterparts in many branches of science and engh~eering. For instance, in chemical engineering one learns that the average residence time T of a substance flowing through a reactor of capacity N at a steady rate h is T = N / h . By eliminating A, between (3) and (4) we are led tO:
=n,r,;/(r,., +
+
(5)
Y. Bard / A simple approach to system modeling
The ~,,, NO, and Tq are the fundamental unknowns whose values must be determined. Once that is done, all the important performance measures can be computed, e.g. from eq. (1). Eqs. (4) and (5), however, are not enough to solve the problem: oae needs another set of equations, say of the general form
r o =f,,(N.x),
(6)
where N is the matrix of No, and A is the vector of througputs At. We refer to eq. (6) as the delay equation, since it is derived by considering the delay Tq that a class i transaction suffers, on the average, in state j. It is our job now to derive appropriate expressions for the fo in the various system states. But before we do that, let us note that eqs. (4), (5), and (6) together can be solved for the At, To, and No by means of the following algorithm.
General algorithm Inputs n,: f j:
no. of class i users, statej delay functions.
227
This constant is often referred to as think time. For now, transmission delays between the terminal and CPU are included in the think time (see Section 3.3). Wait state ( j = 2): A transaction's delay in the wait state depends on the system's scheduling algorithm, i.e., on how and when the system selects a waiting transaction for admission to the active set. We shall assume that the system possesses various recources, e.g., memory space or tape drives, whicb need to be allocated to active transactions. There is a limited amount, say S~,, of resource k available, and the average amount of resource k required by a class i transaction is a quantity S,k known to the system. Both the S k and S~k must be inputs to the model. The scheduling algorithm we shall model now is: Waiting transactions are queued in first-come first -serve order. The first transaction in the queue is admitted to the active set as soon as resources are available to take care of its needs. Since this algorithm does not discriminate between user classes, it should result in all average waiting times being roughly equal (it is true that a transaction with low resource demands may have a shorter wait when it is the first one in the queue, but until that time all classes are treated alike.) Hence:
Algorithm
r,.,=w
(i) Assign some initial guess values to the T,j. (2) Evaluate N t j = n ~ T i j / ( T i . i , T t . 2 + T/.3) for all i,j, and X, = n~/(T~. I + T~.2 + T,.3) for all i. (3) Evaluate T,j - f o ( N , A ) for all i, j. (4) If the new values of the T~, are sufficiently close to the previous values, terminate. Otherwise, return to step 2.
where W is an unknown average wait time. To determine W, we need to consider what happens in the active set. Specifically, observe that X,N,.3S, k is the total average amount of resource k held by active users. The scheduling algorithm does not allow this quantity to exceed the total amount of resource k available, i.e.,
i = l , 2 .... ,
~N,.3S, k g S k With the exception of some extremely overloaded systems, this process sea:ms always to converge in a few (say five to thirty) iterations (see Section 2.4 and the Appendix for discussion). We turn now to deriving the f~j. Dormw, t state ( j = 1): We have already rem a r k ~ that the T~.t are iaput variables which need to be specified externaU¢ from the model. Therefore, eq. (6) for j = 1 tal:es the rather tr.:vial from T,.I = givea constant (possibly different for each i).
k = 1,2, ....
(7)
(8)
i
By substituting (5) and (7) into (8) we obtain: + w+
k = 1.2,3 ....
(9}
It is clear that (9) can always by satisfied by making W large enough. If we adopt the approximation that no wait occurs unless .some resource i:; saturated on the average, then the actual value of W will be the smallest to satisfy (9), and it
Y. Bard / A simple approach to .system modeling
228
can be estimated as follows from the currentiteration values o ' T,.t and T,.j:
Waiting time alg, ~rithnt
total number of main storages page frames available to programs, and S,k is the minimum number of page frames required, on the average, for efficiently run.ring a class i transaction. Th/s quantity is somet;mes loosely referred to as the transaction's working set.
(1) Assume W = 0. (2) Test whether (9) is satisfied for all k. If it is, we art; finished. Ottterwise: (3) Select the r,'source k, say k*, for which (9) appears most vi~,lated. Find the value of W which satisfies (c)) for k = k* and with the inequality replaced by equldity, i.e. solve for W the equation + w+
T,.,)=S,..
I
The Newtc purr ".
,a phson method is suitab!
/t J
:r,
r,., + w +
g"that
) -
I
"lhen
:
+ w+ I
"Fake W ~"~ :: 3. and compute W ~'' i~ = W ~ f ( W ~ ' / f ' ( 1 4 ''''~) for t,=0,1.2 .... until [W ~'' IJ I¥~"~ [ < t. "lhen take W = W ~ i~ (4) Return to step 2. Note: The S k used in this algorithm should be :,ome~ahat lesi than the amounts actually available. lo allow for Iragment~tion.
It is eas~. to show that this method always produces a r onnegative value of W satisfying (9) for any finit,., aumber of resource constraints. Because tlti:~method of computing the wait time is based on the approximation mentioned above, the model's accuracy may deteriorate in cases where the t~aroughput is severely limited by the coastraining resou,ce. This question is further discussed in Se,:tion 2.5 Perhaps :he most frequently applied resource constraint is that of main storage space. In a paged system, marl .,aorage is divided into relatively small (e.g.. 256 to 4096 bytes) units called page frames. and prograras ;ire split into corresponding pages. The address translation hardware makes it possible to assi:gm a~3' one of a program's pages to any main storage page frame. In this case. S k is the
Another popular resource constraint is the multiprogramming level (MPL), i.e., the number of transactions allowed in the active set at any one time. In many operating systems, the MPL is limited by the number of partitions, initiators, task monitors, etc. In this case, Sk = maximum MPL, and S,~ = I for all i. Somet, mes there are specific restrictions on the multiprogramming level within a given class, or set of classes, of transactions. Then S , k - - I for the restricted classes, and zero otherwise. See, for example, d:scussion of subsystems in Section 3.7. It is worth noting that the S,k used in the algorithm should be the same as the estimated resource requirements used by the operating system when making its scheduling decisions, and not the true amounts of resource used. The two values may be quite different. Active State ( j = 3 ) : The active state has a rather complex structure. An active transaction receives service from the CPU and from various ! / O devices, thus it may be in any one of many possible "substates'. We have lumped all these substates into a single state in the overall system model, thereby decomposing the set of transaction states into system states (dormant, waiting, and active) and substates (CPU, I/O). Such a decomposition works well provided transitions between substates of a given state occur much more rapidly than transitions between states [7]. This is the case here, because CPU bursts and I / O accesses take a few milliseconds, whereas dormant and response times are of the orders of seconds, or at worst hundreds of milliseconds. Since many substate transitions can occur while the MPL remains fixed, we shall analyze the active set on the assumption that the number of active transactions of each class remains fixed at its average value N,.s. We shall distinguish between two primary substates, which we shall n u m b e r j = 4,5: j = 4: transaction waiting for or receiving CPU service; j = 5: transaction waiting for or receiving I / O service. 'We shall assume that these substates are mutu-
Y. Bard / A smiple approach to srstem modeling
ally exclusive, i.e., a transaction cannot overlap its CPU and I / O services. The notation No, T,j will apply with j = 4,5 just as it did to the overall system states. From here on, the analysis proceeds exactly as that of the overall system model. Total time in the active set is given by: = + (10)
229
queue. Now, the unconditional average total number of transactions in the CPU queue is N4 = Y.,,,Nm.4, to which total a specific class i active transaction contributes N,.4/N,. 3. However, when that transaction arrives at the CPU queue, we know that it is not already present there, and its cohtribution must be missing from that total. Hence [6]. approximately N "~ =: N 4 - N~.4/N,. 3, so that (14) becomes
The equation analogous to (2) is: =
(11)
+
Little's formula results in an analogue to (5):
N,, = N,.3T, j/(T,.4 + T,.5) j = 4 , 5 .
(12)
And, finally, we shall need delay equations similar to (6): T,/=f,,(N/,A)
j = 4,5.
(13)
Given any specific N,.3 values, we can iterate between (12) and (13), just as we did between (5) and (6) in the General Algorithm. When the iterations converge, we end up with values of the T,.4 and T,.5, and hence of the T,.3. But that, implicitly, amounts to evaluating the delay function of the active set, i.e., T~.3 = f , J N 3, X ) At this point, the model is complete, except for the delay functions (13) for the CPU and I / O substates. We proceed to derive these now. CPU substate ( j = 4). The delay suffered by a transaction that is seeking to receive CPU service depends largely on the CPU dispatching policy. Let us assume that the dispatcher uses a roundrobin policy with a relatively short (compared to typical transaction CPU time) time slice. The net effect is as though all transactions currently in the CPU substate are processor sharing the CPU. I hat is, "f the CPU is capable of delivering 1000000 instructions per second, and there are 10 transactions in the CPU queue, then, on a macroscopic time scale, each transaction appears to be averaging 100,000 instructions per second. Then, if a transaction requires 2seconds of CPU time, it would actually spend 20 seconds of elapsed time in the CPU queue. The processor-sharing delay formula can be written as:
r,.4 = (l + N'")t,,
t 14)
where t, is the average CPU time demanded by a class i transaction - this is a required input; and N ") is the average number of transactions, besides itself, that a class i transaction finds in the CPU
and this is the CPU delay function. The maximum in the above expression is needed to prevent the absurd result T,.4 < t, when N,.3 < 1. I , / 0 substate ( j = 5): At this point, we must abandon the pretense that a system can be modeled simply. The I / O subsystem, with its channels, control units, slrings, devices, seeks, searches, and alternate paths, is too complex to yield to the simple kind of mean value analysis that has been employed so far. Fortunately, it is possible to construct specialized models which take into account many of the intricacies of the I / O subsystem [3,8,9]. For our purposes here it suffices to characterize such a model by listing its inputs and outputs. Suppose there are several I / O devices. Let: rp
-
n,t , =
sp
=
op
=
Op
=
device p access rate, i.e., number of I / O requests issued to the p-th device per second. average number of I / O requests for the p-th device genera'~.ed by a class i transaction. average device p service time, i.e., time elapsed from when the device becomes free from the previous request (or the arrival time of the current request, if the device was already free at that time), until completion of the current request. This includes all delays in the channel, control unit, etc.--in fact everything except queuing time for the device itself. standard deviation of crevice p service times. device p access response time, i.e., time elapsed from when the I / O request was issued until it:; completion. It is the sum of device service time and device oueaing time.
230
Y. Bard / A simple approach to s)'~tem modeling
The de'/ice access rates can be computed by means of the formula: re -- ~ X , n i, allp.
(16)
I
The rp are the inputs required by the I / O subsystem model. The model must also know about the configuration and characteristics of the devices and the fib:s residing on them--things like average rotation, s:.'ek, and data transfer times. But these items remain constant for a given problem--they do not change from iteration to iteration. The sp ~nd op are the outputs generated by the I / O subsy,:~em model. There are two ways in which this ihformation can be used in the activestate mod I: (1) The ~pen model. The device access response times can be estimated using standard queuing ~heory fo~ nulas, based on the assumption that requests a rive from a potentially infinite population at th i' given rates re. In this case, the well known Po~aczek-Khinchine formula yields: 2 + o~)/2(I-
r, se).
(17)
Once the !Op ha~:e been estimated, the total I / O delay sufli:red by a class i transaction can be simply ev~uated:
: E ,oo,. P
(2) The closed model. One may object to eq. (17) on the gro~and that it assumes a potentially infinite populatic, a, whereas in fact there can be at most N~ = 2,?'..3 transactions in any queue. It might, therefore, be more accurate Io trest each device as a separate: 'substate" in which the average delay per visit is:
where N .", the average number of transactions •vaitia~ t,r d~.:h. ~: as s,=,.,, ~y d c|rs~ 'tansact,on J
newly arrived at the queue, can be estimated in analog:/ a the CPU queue, as Y.,,,N,,,p -/~t,/N,.3" The to~.al delay suffered by a class i transaction a'. device ~'~~s then:
n~
T~he c,.:mputational difference between the two approac~,.s is tha,: the open model gives a single delay eqt~ation, (18). for the entire I / O subsystem, so that the active state has only two substates
(CPU and I / 0 ) . The closed model gives a delay equation for each device, resulting, in P + 1 substates, where P is the number of devices (or, rather, of separately maintained I / 0 queues). Since P can be quite large, the open model h ~ a significant computational advantage. Since, in our experience, it also produces quite accurate results for many real systems, it is the approach which will be adopted here. We are now ready to fle::h out the general algorithm by specifying the detailed evaluation of the delay functions. The result is what we call the Basic Algorithm, which suffices for the realistic modeling of many systems. Its capebiliti¢s will, however, be augmented in sub,;equent sections.
Basic algorithm A. Required Jr,puts (1) System configuration CPU speed (see note 1). Sk: available amount of k-th resource (main storage space, maximum MPL, tape drives .... ) I / O subsystem description (see. note 2). (2) 14:orkload characterization n, : number of class i users; T,.I : average class i user dormant ('think') time; S,t : average class i user demand for k-th resource; t, : average class i transaction CPU time (see note 1); n,p :average number of class i transaction accesses to device p. Note I. The CPU speed enters only indirectly in scaling the t~ to the particular CFU being modeled. ,Vote 2. The I / O subsystem description typically requires the following ;nfc-.-',etion" Device rotation times and data rates. Path topolog3 from CP'" to devices. Placement of files on devices. File extents, block length~,, access ~,atterns (e.g., random or seqt~ential). B. Algorithm (!) Initialization. Assign initial guess values to the T,.3. A reasonable value might be t, + Y,vn,pO*p, where, 0" v is the data transfer time per access ior device p, which may be calculated from the record
Y. Bard / A simple approach to svstem modeling
length aad device data transfer rate. (2) Overall iteration (2.1) Initialization. Set W = O. (2.2~ Wait time iteration (2.2.1) Test whether, for all resources k, + w+ i
If so, proceed to step 2.3; otherwise: (2.2.2) Select the value of k, say k*, for which the above condition is most violated (relative to the magnitude of S k). Find W such that the condition is satisfied with equality for k = k*. (2.2.3) Return to step 2.2.1. (2.3) Convergence test. Compute T, = W + T,.3. If this is not the first iteration, compare these values to the previous iteration T, values. If they agree to within, say, 1%, terminate. Otherwise, (2.4) Compute
J,, = n , l ( r, + r,.,),
=
re = ~ 2k,n,t,. (2.5) I / 0 subsystem model. Invoke the I / O subsystem model with ',he re as inputs, and apply eq. (17) to obtain the 0p. (2.6) Compute =
(2.7) Active set iteration (2.7.1) Initialization. Assign initial guess values to the T,.4. In the first overall iteration, take T,.4 = t,. Subsequently, use the current value. (2.7.2) Compute =
+
(2.7.3) CPU delay. C'~rnpl,te
(2.7.4) Convergence test. If the new T,.4 values differ Ifrom the previous iteration ones by more than, say, 10 .6, return to step 2.7.2. Otherwise, (2.7.5) Compute T,.3 = T,.4 + T,.5.
(2.8) Return to step 2.1.
231
C. Outputs The following useful outputs values are produced directly by the algorithm: T,: average i-th class transaction response time, h: i-th class transaction throughput, T,j: average time spent by class i transaction in state j ( j = 2,3,4,5). In addition, one calculates: u, = 2~t,/n,
CPU utilization per class i user.
U = ~n~u, = ~X~t, i
total CPU utilization.
I
Uk = ( ~ N,.3S,k )/S~
k - t h r e s o u r c e utilization.
t
Nj = ~ ~ j
average statej queue length.
t
2.2. Interpretation of outputs P~rhaps the most useful output items are the T,;, since they tell us where the transaction actually spends its time: If W is large, the k*-th resource (see step 2.2.2 of the Basic Algorithm) is a bottleneck. On the other hand, if T,.4 or T,.5 is large, then the CPU or I / O , respectively, is the bottleneck. In the latter two cases, we may wish to determine how much of the total time in the substate was spent waiting and how much actually receiving service. In the case of the CPU, t, is the service time and T,.4 - t, is the wait time. If service time is large, a faster CPU is needed. If wait time is large, we could do with the same CPU but with fewer users. In the case of 1/O, sp is the service time and 0p - sp is the wait time for devicep. The ,~,~j values produced by the model are average q."-'ue k,;3t!.~. B~ause of L;ttie's ~3;rauia, average queue lengths and response times are equivalent performance measures, provided the throughputs are known. 2.3. Some Benefits of itera~.ion The algorithm we have suggest~.,d proceeds i;~ iterative fashion. One benefit to be derived from that is that any modeled quantity, whether input, intermediate, or output, can be made to depend on any other quantity. All we need to do is reevaluate
232
~: Bard / A simple approath m ~r.~tem modehng
the relationshii.'s in the course of each iteration. This approach is mo:t valuable when the inputs themselves dep~'.nd on the outputs. Here are a few cases in points: (I) Think ti,~e. It has been observed [101 that user think tim,.• may be affected by system response time. T~:e theo;y is that if the system takes longer to respc,ad, the user becomes sluggish, and requires longer to enter the next transaction. If the effect is linear. ~,e could write down the equation: T,. t = a + b T , , a n d if a and l" are known, we can recompute T,. t in each iteratic:~ after step 2.3. (2) Paging ~ate. Many systems manage their main storage ,pace by 'paging in' portions of users' program., and data as they are required, and 'paging out' r,.~-longer required portions. Each paging operati..,n reou;res an access to a 'paging device'. For m~,de:ing purposes, paging appears as accesses to certain I / O devices. However, the n,t , for these acces:,.'s are n,~t determinable in advance. since the num~,;r of paging operations required by a transaction i~, a function of the environment in which it is run. In particular, the n,,,, may depend on the MPL. srtorage utilization, total demand for storage, etc. Sc+ne models exist ~k~r estimating the n,t, as function:- of these quantittes [ ! 1]. The latter are not k n o w a priori, but ~hey have known values within e~.ch iteration, ~o the n,s, can also be recomputed within each iteration. There are In,my more examples of what can be done. We shall discuss some of these later under 'extensions'. Note. in particu!ar, the subsystem mode2 (Section 3.7~. where reouest rates for some transa~.tton ck,~.ses depend on response times for other c~asses. 2.4. Conrergen e l h e tl,aestk , ~f convergence of the algorithms that we have d~':,cribed is not central to the validity of our modeli,:g approach. The latter hinges on solving the s c t . f simultaneous eqs. (4), (5). and (o) for the unknov, ~i ,\,, ,~,. and T,,,. Any method that sacceeds in so ,'ing the~,;e equations will do. Our a;sorithm, ho',v,.wer, is so simple and straightfor~ar-.:l, that one would like to use it if at all possible. At this poin':: we should +.;tress that as a practical
matter, convergence problems appear to be a minor nuissance at worst. That is, the algorithm has always converged for any reasonably performing system. Failure to converge within thirty iterations is a sure sign that some system resource is heavily overloaded. One easily identifies the troublesome component by processing the unconverged model's outputs in the usual way (see Section 2.2). It would, however, be nice if the algorithm could be fixed up to work even in the more ~ecalcitrant cases. This can, indeed, be done, but we leave the details to an Appendix. 2. 5. A ccura~y The accuracy of a model may be tested in two distinct ways: descriptive and predictive. In the first one, we measure simultaneously a systengs perforn+tance and workload, feed the latter into the model, and compare the model's outputs to the measured performance values. This is the manner in which models are typically 'validated'. In the second one we attempt to predict tomorrow's or next year's performance. While this is clearly the more interesting case, the accuracy of the model's predictions here is crucially dependent on the accuracy of the workload predictions--a question beyond the scope of this paper. A typic:d case is depicted in Table 1. A workload measured in Ma'~ 1979 was used to predict performance for June 1980. In the meantime, the IBM System/370 Model 158 CPU was replaced by a Model 3031 CPU, main storage si~e was increased from 2 to 4 megabytes, and the number of users almost doubled. The model significantly underestimated the response times. It was noticed that the average nontrivial transaction think time had been ,reduced by almost one half at the latter date (this was a university. Perhaps the students had learned something in the interim?) When the r~ew think was fed to the model, the results became quite acceptable. Note that think time is perhaps the most important workload parameter, since it strongly influences the transaction rate (see below). There is yet a third possibility: a reproducible benchmark j<:~b stream is constructed, run, and measured on a given system. It is then both run on, and modeled for, other systems of vzrying configurations. The two sets of results are then mutually compared. The few studies of this kind that we have seen show nearly e.s good an agree-
233
L Bard / A simple approach to srstem modeling
Table I Model predictions for a real university workload Transaction class
Average response time (scc) 6,14/80 b
5/25/79"
Trkial Nontrivial Batch
Measured
Model
Measured
P~edicted ~
Predicted 'j
0.39 15.95 44.26
0.38 16.27 38.39
0.75 31.07 77.03
0.55 24.85 47.77
0.65 34.96 56.58
Nol(,s: a C P U Model 158, 2 megabytes main storage, 68 users. C P U Model 3 0 3 1 . 4 megabytes main storage, 124 users. ¢ Nontrivial think time--23.4 sec (as n easured 5 / 2 5 / 7 9 ) . d Nontrivial ihink time-- 12,7 sec (as measured 6 / 4 / 8 0 ) .
ment as in the ,dmple de,..cripttve validations discussed, below. Table 2 shows results of predicting performm~ce under a varying number of users and m a n storage sizes [2]. Other examples of predicted benchmark results are found in Tables 4 and 7. In spite of its simplic!ty, the basic model is capable of achieving results of good accuracy in reproducing the performance of actual systems. Table 3 is a sampling of what can be achieved routinely [2]. provided that the systems' existing workloads can be measured and characterized accurately. As a general rule. utilizations and throughputs come out within 5% of the measured values, and response times within 20%. Such resalts are more than adequate for most purposes. There is a very simple explanation for the accuracy of most utilization and throughput predictions. This derives from the fact that response time is often short relative to dormant time. Eq. (3) can be rewritten as
x,
Table 2 Basic model validation for Vbt,.370 s,~stem with b e n c h m a r k workloads 121
Run
No. of users
size (K bytes) I"
2
Average r e s p o n s e " (see) Measured
512
2(1
1024
21)
0.33 32.(I 36.7 44.1) 0.31
15.()
3h
11)24
40
4
I 0"~z,
60
-
+ r,/r,.,)].
If, as is often the case, T,/T,.~ < 0 . 1 , then even a 100% error in estimating T, results in, at most, a 9% error in the throughputs h, and CPU utilization Y~,X,t,. Because of this, we do not bother to report throughputs and utilizations in the validations results reported here. On the other hand, the accuracy of the response time predictions remains somewhat of a mystery, particularly since some of the ° 9pmxiraations in the model may appear to be a bit farfetched. For instance, if we look closely at the manner in which the wait state was analyzed, we see that we have assumed that there is no wait at ~11 unless the utilization of the bottlenecked resource (ks= k*) is 100%. In reality, even if the
Main
storage
5
2048
40
6
2(H~
61)
15.5 ',6.3 0.3X 49. I .;~.g 45.8 0.41 127.7 75.1 84.6 0.25 44.8 32.6 3O.O 0.34 78.2 61.7 57.8
Predicted O. 16 5ll. 5 46.0 45.8 0.2 I !-.o "~ "~ 13.7 12.3 I).25 4q. g 460 43.8 0.26 I._.4 "~" 83.7 84.6 O. 3 I 34.0 38.2 33.0 0.44 76.5 60.8 55. I
.N'otes " a Response times are given for four different transactton classe~,
in each run. ~" Measuretacnts from run 3 were used to deri',e the workload
characteriza"on used in all the model runs. The poo, predictions for run ! are due mostly to the difficulty in estimating the paging rate for this system, which had rather inadequate storage for the given workload.
Y. Bard / A simple approach to system modeling
234
Ta ~lc 3 Vafidatior t f basic model on V M / 3 7 0 systems with real wt rkloads 121 Sy ,tom/370 CI U mode;
Ave rage no. :~f
Average response a (sec)
US~ FS
Measured 0.66 19.04 0.26 3.98 0.50 26.75 0.05 !.19 0.08 2.76 11.21 21.32 0.14 5. I I
Computed
135
4
0.83 21.39 O. i 8 3.66 0.50 30.99 0.05 !.15 0.06 3.71 0.17 17.75 0. I I 3.88
145
S
145
! :i
155-II
21)
155- I 1
2?
158
3"?
15~
4(,
158
24
0.07
0.06
! 61':
72
12.76 0.02 168.70 0. ! 3
11.13 0.02 123.70 0.14
16~
I 17
7.61
7.38
0.46 7.94 0.48 13.q7 0.55 ! 9.25 t,.58 27.2 i
0.35 8.59 0.42 13.09 0.54 i 7.82 0.~5 25.65
" Rcsl:'onse times arc gi,,en for several transaction classes within t.';lc h .~v.~ ic[P.
resource is not 100~ utilized on the average, it will occasionally acquire waiting queues. Fortunately, in reb.tively congested systems, it does not make much diffe::ence to the total predicted response time whether the model assumes users to be in a formal 'wait stat,:', or whether it lets them join the CPU or I / O queues and wait there--provided that the bott!enccked resource does not severely constrain the sy:.tem's throughout. If it does. the model will recoLnize this to be the case, even thottgb numerically the predicted response times may be quite ir, accurate. Normally, the model's e~timated waiting time may be too low, but the total response tin':e remaine fairly accurate. The model deals in average values, and is likely to go wrong if these averages are no: representative of what actually happens in the system. Note that nonlinearity is the enemy of averages! Con-
sider, for instance, a system with a homogeneous workload (single transaction class) in which the number of users is the only quantity which varies significantly over time. Fig. 2 depicts the way that response time might depend on the number of users. In this example, the system saturates with approximately 30 users. The performance curve is reasonably linear in the unsaturated region (0-25 users) and in the fully saturated region (35 users or more), and not much harm would be done if the user population fluctuated witMn either one of these regions. Observe, however, what happens if, in the coarse of a monitored period, there were 20 users during one half of the run, and 40 during the other halt'. If the model is run with the observed average popalation of 30, it will (if it is an accurate model) predict a response time of 2 seconds, compared to tile measured average of 3.5 seconds! When a wori-load is broken up into many user or transaction classes, it is likely that the bulk of the transactions of one class or another will have occurrext during some short period, not typical of the emire monitored period. It should not be surprising, therefore, that among the e:aimated response time for many different tra~'~saction classes, a few may be in substantial error. Many models require 'calibration', which consists of adjusting built-in or input parameters until accuracy is achieved. Our model contains essentially no built-in parameters, except for some hardware characteristics (particularly relating to 1/O devices), which should be known with good precision. Therefore, the model should require no calibration, provided the inputs are known accurately. In fact, calibration would then amoant to Response t £me
I I 0
10
2C~ No. o f u s e r s
Fig. 2 Effect of nonlinearity.
30
40
Y. Bard / A simple approach to system modefing
cheating. However, if some of the inputs are not known acet~rately, then adjusting their values to improve the accuracy of the model's outputs is quite legitimate. This is particularly true if the model's outputs to be matched against measurements are not the transaction response times, but other quantities more directly related to the misshag iaputs. For example, channel utilizations can be matched by adjosting unmeasured record lengths, and then device utilizations by adjusting unmeasured seek times. Sometimes parameters are used to derive model inputs from other quantities. For instance, if t, (transaction CPU time) has not been measured direx:tly, it can be estimated as t, = ~ c,t,'rt,, b
where c,b is the number of calls to routine b made by the transaction, and % is the CPU time ('path length') for that routine. If the c,h are measured d;rectly, and the ~'t,are built-in model parameters. then the latter may be subject to calibration, with CPU utilization the quantity to be matched. A possible source of model inaccuracy is an improper or insufficient breakdown of the workload into user classes. Since many queuing formulas assume exponential distribution of service times, one might suggest breaking up the entire workload into classes such that the coefficient of var~atlon of CPU and ! / 0 demands (t,, n,p for heavily used devices) of the transactions within each class be approximately one.
3. Extensions We shall extend the range of Systems to which the model applies by: (l) relaxing some of the undelying assun.ptions, (2) adding entirely new features. The following list of possible extensions should not be considered exhaustive. 3.1. Open system By assuming that there is a fixed number of users of each class, we have created what is known as a closed system, i.e., one which no users leave or enter. In contrast to that is an open system, in
235
which there is a potentially infinite population of users generating transactions. One important difference between the two is that the closed system is stabilized by feedback, i.e., if the system response become:: slow, then the transaction arrival rate decreases ~ee eq. (3)). Queue lengths are bounded by the ~opulation size. In an open system, on the other hand, transactions keep arriving at the same rate no matter how slow the response. Hence, potentially infinite queues can build up. In reality, a closed model is a good representation of a time-sharing system used for program development or problem solving. Here, there is usually a relatively stable user population, with individual sessions often lasting an hour or more. An open model, on the other hand, may well represent a data base transaction system such as used by bank tellers or airline ticket agents, where the customer arrival rate dictates the system's throughpu,. Mathematically, an open system is characterized by specifying the ~,~ (class i transaction arrival rate) in place of n r To model such a system, substitute h,(T,.s + T,.2 + T,.3) whenever n, is required in the algorithm (see eq. (3)). Many equations are thereby simplified. For instance, eq. (9) becomes E )~,T,.3S,k <~Sk.
(19)
t
Notice that W has disappeared: what this means is that the .,,p...cified throughputs ~, can be sustained only if (19) is satisfied for all k; otherwise, a potentially infinite waiting queue will develop. Similarly, since CPU utilization is given by 2,~'d,, the specified threughputs can only be maintained if 2,h,t, <~ 1. Contrast this to the closed system case, where response times remain finite no matter how large the user population. The I / O subsystem model also breaks down if the utilization of any I / O component (channel, device, etc.) reaches or exceeds 100% under the offered workload. Mixed systems, in which some user classes are closed (fixed n,), and some are open (fixed A,), can also be handled. Table 4 shows results of applying the model to a system with both open and closed transaction classes. A benckanark run was made with fifteen users in the closed classes, and a fixed rate of transaction arrival for the open class. In additio~J
it. Bart~ / A simple approach to srstem modeling
236
Table 4 ~ o d e l val:dation for sy:tem with both open and closed transaction classes Average response time Isec) with:
Transaction: class
21 users
! 5 users Measured
Computed
Measured
Predicted
1.03 .'~0.__'~'~ 0.42
0.91 20.26 0.37
1.87 35.23 0.54
2.21 37.73 0.52
to validating the model for this w3rkload, a projection was made io twenty one users in the closed classes, and this was compared to the results of an actual run made with the latter workload. in a completely open system, step 2.2 of the Basic Algorithm is omitted. The first formula in step 2.4 is supe.'fiuous since h, is known. Otherwise the aigoriahm remains unmodified. Unfortunately, compu.ation of the waiting time is rather difficult. One approach would be to regard the active set as a va-iable rate queue dependent server, and apply known queuing-theoretic methods for its solution.
transaction is:
Closed I Closed .:"~ Open
3...~ lnhomogene,ms user classes We have assu reed that each user class generates only one kind of transaction. Instead, let us as:,ume that there is a certain number of transaction classes indexed by i, and a certain number of user classes indexed by q. There a r e mq users in class q, and a fraction f~, of the transactions they generate is ,~f class/, preceded by average dormant time D~,. Tile total class/ transaction cycle time for a class q user is 11;',, + T, where T, is. as before, the class t transactit,n average response time Hence, h e fraction of class q users engaged in entering or 'x;iiting for cl~.ssi transactions is fq,(Dq, + T , ) / ¢ .rlq,.{ D,,r + T,). From the system's point of view. •he average number of users in "class i transaction rode"
is
,t - - E , , i , , f , , , ( D + T,)/Efq,.(D,,
+ 7",).
(20)
The rate at vv.aich class q users are generating ,.~ass~ transactions is
,x,,, -
r ,I Do, + g
and ;he overall average dormant time for a class/
T,., =~.Xq, Dq, l~.Xq,. q
(22)
q
Once the n, and T,.s have been computed, the problem has been converted into the previous form. This conversion has to be redone at each iteration, since the response times T, are required. This, however, presents no problems in practice.
3.3. Transmission delays So far the 'response time" has ineloded only delays incurred at the host computer. In reality, the response time observed by the end-user includes also transmission delays between the terminal and the host. Quite scphisticated models have been constructed to account for these delays, which depend strongly on the topology of the transmission network, the speed of the transmission lines, the load on the network, and the protocols used. We shall content ourselves with describing a simple extension to the basic model. Assume that the dominant ~ource of tr~ rsmission delays is queuing in a transwdssien ~.ontrol unit (TCU). Each transaction incurs two delays: one on input to the system, and one on output to ihe terminal. The TCU requires %seconds to service a class/ transaction input or output message. Transactions are served on a first-come-first serve (FCFS) basis. Let T,.6 and N,.6 be, respectively, the average time spent by a class i transaction at the q'CU. and the average number of such transactions in the TCU queue. On arriving, a class/transaction has to wait Y~,,N~[)61-,, seconds for all the transactions ahead of it to be served, and an additional w~ seconds for its own service. Here, ~t,~ denotes the "m,6 average number of class m transactioas a class/
Y. Bard / A simple approach to system modeling
transaction f~nds upon arrival at the TCU. In analogy to (15), one derives the following approximation for the total delay suffered by a class i transaction in its two trips through th~ TCU:
Tt'6 -- 2[ 2m Nm'6"r'n + ( l -- S/'6/ni)'r/]"
(23)
This approximation works well provided the variation between individual message service times within each class is not too great, i.e., the coefficient of variation is no greater than 1.5 To include the TCU as an add:tional state in the system, insert additional terms T,.6 and No6 in the right hand sides of (1) and (2) respectively, and amend all other formulas (e.g., (3), (5), and (9)) accordingly. Eq. (23) serves as the delay equation (6) for s t a t e j : 6. Delays which do not depend on traffic intensity, e.g., propagation time~ must be added as constants to the response time. When transmission delays are modeled explicitly, they are no longer included in the dormant times.
3.4. Tightly coupled multiprocessors When several instruction processing units ('processors' for short) share main storage and are controlled by a common operating system, they are said to be tightly coupled. If there are K such orocessors of equal power and fully interchangeable capabilities, then they can be modeled very easily. Such a configuration is referred to as a K-fold homogenem~s multiprocessor. To model such a configuration, let us reformulate the CPU processor-sharing model in terms of processing cates. We say that the p/ocessing rate R, of a class i transaction is the number of CPU seconds it consumes per second of elapsed time while it is in the CPU queue, i.e.,
R, = t,/T,. 4.
(25)
A system's processing capacity is Lhe maximum rate at with it can dispense CPU time per unit of elapsed time. The wocessing capacity of a uniprocessor CPU is unity, and, neglecting for the moment contention ~nd overhead, the processing capacity of a homogeneous K-fold multiprocessor is K When all ti.~ers in the CPU queue share equally in the available processing power, we find by combining (15) and (25) that R, =i,/T,., =
237
min[l,l/(l-Nc4/~.3 +2mNm.4)l- This
formula takes into account the fact that members of different user classes see, on the average, slightly different CPU queue populations. When the total processing capacity is increased K fold, so, potentially, is each user's processing rate. However, assuming that individual users' programs are not written to execute in parallel on multiple processors, a user cannot have a processing rate exceeding unity (which would make elapsed time less than CPU time). Hence we are led to:
The delay equation for the tightly coupled multiprocessor case then takes the form
T,.4 = t , / R , ,
(27)
with R, given by (26). This replaces (15) in the basic algorithm. Frequently, a multiprocessor requires more CPU time to execute the same workload than the corresponding uniprocessor. There are two reasons: additional overhead incurred due to the ~aore complex scheduling, and processor slowdown due to various synchronization (both software and hardware) requirements. The simplest way to model these effects is to increase the transaction CPU times t, by appropriate factors. These factors may be determined empirically, or they may be derived from more detailed models of the hardware a n d / o r software. We have found the following empirical approach satisfactory: the same. benchmark job stream is run both on the uni- and multiprocessor configurations, and total CPU time consumed is measured in both cases. The ratio of the two is then used to adjust the i, in all subsequent applications of the model. Nonhomogeneous multiprocessors are more difficult to deal with. A specifically tailored solution is probably required for each case The following example is indicative, though relatively easy. In an IBM System/370 Attached Processor configuration there are two processors with equal processing capacities, but only one of them (the so-called main processor) can execute 1/O instructions and receive I / O interrupts. The result is that a certain fraction ~, of the i-th class CPU instructions can be executed only on the main processor~ The processing capacity of this configuration is K = 2 except when all the users in the CPU queue re-
Y. Bard / A simple approach to system modelbsg
238
Table 5 Model validation "or loo.~ly ceupled mult_iprocessor system [3} ('PU
Transacti-m class
Average response time (sec) Observed
Separate models
0.27 5.03
0.26 4.36
Combined model
I
Trivial Non trivial Compute bound
2
Trivial Nontrivial
0.30 5.22
0.25 3.57
0.29 4.46
3
Trivia! Nontrivial Compute bound
0.33 7.64
0.29 6.25
0.35 7.5 I
881
923
155
157
0.30 5. !0 900
terminate only if all CPUs have converged. (V) Return to step II.
157
K - . ! × l'Ick,N.... + 2( 1 - 1"I~bN,.,) = 2 -- l'I 0~,.,. t
t
(28) It may be impractical to measure ff~ for each transaction cl,~ss. Instead, an overall system average ~ may be ,.'stimated or measured once and for all, so that we have K : 2 - q ~ x,x,.'.,
CPUs in the system. Save the rp vectors generated by all the CPUs in step 2.4. ( l i d Invoke the I / O subsystem model, using the combined rp vectors as rows in an input matrix. The outputs of the submodel will include a response matrix whose rows are the 0p for the various CPUs. (IV) Execute steps 2.5-2.7 for each one of the
CPUs in the system, using the appropriate 0p vectol in each case. In the convergence test (step 2.7),
quire the main processor, in which case K--- 1. The probability of that event is 1"I,,~". Hence, on the average, t
(I1) Execute steps 2.1-2.4 for each one of the
(29)
(but take K = I when Y.,N,.4 ~< 1). In the loos~:iy coupled system validation results ('/'able 5), sysl,:ms I and 2 had attached processor configuration:~,. 3.5 Loosely" coupled multipro,'essors When seve:'al CPUs (some of which may, in th,:mselv,,-~, be tightly coupled multiprocessors) run independently except fo~ sharing access to a set of I / O devices, they are said to be loosely coupled. To model su,.h a s'/stem, one needs an I / O submodel that ~.m cope with devices accessible from several CPUs [3]. Gr~.cc we have such a model, the Basic Algorithm can I~e modified as follows:
This algorithm can produce excellent results, as illastrated in the following example [3]. The system in question consisted of two IBM System/370 Model 168 Attached Processors (CPU1 and CPU2), and a Model 168 Uniprocessor (CPU3). The configuration is depicted in Fig. 3, and included 31 shared devices, in addition to 10 unshared devices used for paging. Three transaction classes were recognized, labeled 'trivial', 'nontr/vial', and 'compute-bound'. The !.~i kind was absent from CPU2 during the period when the systems were monitored (similarly named transaction classes on the different CPUs did not neces,,arily have similar resource requirements). In Table 5 we compare the measured average response times to those estimated by: (l) a model which treats each CPU separately, ignoring the device sharing, (2) a joint model employing the shared I / O algorithm. The results produced by the latter model are extremely good. I~ amy be that the I / O rates re generated by some of the CPUs in the system are known in advance. For instance, if these CPUs have open user populations with known transaction rates JL, the r~, can be computed from (16) without need for iteration. Sucrt CPUs may be excluded from consideration in steps I, II, IV, and V of the shaced I / O ~dgorithm, but their known rp vectors must be introduced directly into step III.
Shared I / O algoritlan 3.6. CPU p~iorities (I) Execute :~tep 1 of the Basic Algorithm for each one of the CPUs in the system.
So far we have a';sumed that the total CPU processing power is evenly divided among all the
}: Bard / A simple approach to system modeling PAGING DEVICES
2x2305/II \
2x2305/II l
2x2305/II
/
PAGING CHANNEI~
2x2305/II
2x2305/II
/
239
/
/. System/370 CPUs
,
/
CHANNELS
CONTROL UNITS
STRINGS
DEVICES
6x3330
7x3330
6x3330
7x3330
5x3330
2x3330
Fig. 3. Configuration ef loo~ly coupled system [31.
users contending for it. Such a model corresponds roughly to a round-robin dispatcher which allocates equal time quanta to all users. By assigning different quanta to different users, or by dispatching some more frequently than others, the dispatcher can provide processing rates dictated by the users' respective priorities. Such dispatching schemes can be modeled approxhnately by assigning different processing rates depending on users' priorities and privileges. We shall analyze two types of priorities, ~hich can be combined in a single model in an obvious way: (1) Guaranteed utilization. Each usel in class i* is guaranteed by the system to receive a fi'action u,. of one processor's capacity. Since, of the hi. users in the class, only N~..4are in the CPU queue, the average required processing rate per user in that queue is R,. = rcm[ (30) Note that if u~.n,./N,..4 > 1, the user cannot ab-
sorb the guaranteed amount of CPU time, presumably bccau:;e he spends too much time dormant, waiting, or executing I / O requests. The processing capacity used up by class i* users is not available to others. Hence, (26) must be modified for i =~i*, to:
Ri = nfin ! 1, ( K - R,.Ne.4) / Ik
The extension to multiple guaranteed-utilization classes is obvious. Suppose the value of R,. computed by O0) is less than that obtained from (26) with i = i*. This means that, mtder the modeled load, the user receives sufficiently good service even without the guarantee. Unless the system actually restricts users to no more than their guaranteed utilizations, eq. (26) should then be used for all user classes. (2) Relative processing rates. Suppose class/
240
Y. Bard / A simple approach to system modeling
users, while in the CPU queue, receive processing rates proportional to some specified constant ,8,. For instance, if 'trivial' transactions ( i = 1) are processed four times as fast as 'nontrival' t-ansactions (i = 2). then we may set gl = 4, ,8, = ,. From the point of view of a class i user, each class m user in the CPU queue is equivalent to ~,~//], class/ users. Hence (26) is modified to
R, =minl l , K / ( l - N,.a/N,. 3 + (!/,8,)~,jS,,,N,,.4 } . •
tltl
(32) Note that only :he ratios of the ~Ss matter, not their actual values. Table 6 shows that using ,8,/,8, = 2 improves resutts for some systems which give higher CPU priorides to 'trivial' than to 'nontrivial' tra,tsactions. I"ae same ratio was used in some of the other validations shown here, e.g., Table I. Most operating systems have some means of as:4gning different priorities to different users or tasks. When somebody has absolute preemtive priority over everybody else, the effect can be modeled in the standard way by first modeling the high priority user(s) aloe:e, then modeling the rest on a ,,.ystem whose processing powers have been diminished by the amounts already used up. The effects of other kinds of priority assignments can often be simulated with our relative processing rate scheme. The proper relative rates may have to be found by trial 'and error. The model cannot, !hrn, tell us directly what settings of the operating ,,,,'.,-tern parameters are required to achieve certain 0erformance goals, but it can suggest the range of achievable value:~. The actual settings will, then. have to be found once more by trial and error.
l a b l c <,
Modcltna(lq
prlorllw:.
. . . . . . . . . .
( "a,c
Tr,m~action
A~ cragc rc.,,pop,~e time (.,co) M "a,,,u red
I
~ •4
! rlvlal Nontmial i ' m ial Nontrivial rri~ial
ql. 16 I;.I ,,~ 3.<, 75.5 ,~ I(I
Modeled
B,, 1t. =1
1~, /~z - 2
0.26 I 1.9 0.42 71.1.6
O. 18 12.9 (I.36 75.2 O. 12 7.6 O. 16 8.5
1). I 3
Nontrivlal
,~..0
7.3
l-ri~ ial
O. 16
O. 19
Nontriviai
,~. I
7.8
3. Z Software subsystems Many systems contain software subsystems which perform services for users' transactions. For instance, there are data base subsystems and communication or networking subsystems. Such subsystems are usually capable of multi-threading, i.e., servicing a number of requests simultaneously. No special techniques are required to model a subsystem which serves end-users directly. The more interesting case arises when the subsystem serves requests generated inside the computer by transactions that have been submitted by endusers. We shall describe a model for a system containing a single such subsystem, but the extension to several is straightforward. We shall assume that the subsystem can handle requests of several Dpes indexed by subscript v. For instance, the b pes of requests handled by a data base system F a y be GET, G E T NEXT, UPDATE, INSERT; o: a teleprocessing subsystem may handle TRANSMIT and RECEIVE. We shall assume that tt~ere is a fixed upper limit ~" on the subsystem's degree of multithreading, i.e., on the number of simultaneously serviced requesta. We shall regard each request type as a separate transaction class, al~o indexed by v. The multithreading limit can be regarded as a special case of a resource availability li.ait, for which eq. (8) takes the form: •~ .3 ~< V.
(33)
[
with each type v request there are associated, as w~th any other kind of tran:;action, the following quantities: S ~" demand for resource k, t,: CPU time, n,v: number of accesses to devicep. All these must be given as inputs. We shall further assume that, for each regular transaction class/, we are given the average number of times such a transaction calls on the subsystem for a type v service. We denote this number by w,,. Then, the overall type v request rate is
x,. :
w,,.x,.
(341
I
We assume that a transaction waits while its subsystem requests are serviced. Thus, if T~. is the response time to a type v request, the class i transaction response time is increased by 2,~w,T,,. So, to es,imate transaction response times, we need to calculate the T~:.
Y. Bard / A simple approach to system modeling
The subsystem request response time T,, is composed of two components: a wait time 7~,..- and an active time T~.3. We have written 1~,.2 instead of the system-wide value W because subsystem lran.~actmns may be scheduled differently from other~. On the one hand. resources for which other transaction types have to wait (e.g., main storage spa~e) may be pro-allocated to the subsystem so that no wait is required. On the other hand, there may be a wait for a task initiator within the subsystem to satisfy the multi-:hreading restriction (33). For concreteness, we suppose that all subsystem; resources are preallacated, so that only (33) appfies. The resources used by the subsystem must be subtracted from their capacities, so that S, is replaced with S k --Sk' in (8) and (9), where Sj,' is the preallocated amount. Furthermore, we shall assume that the different request types a~.e scheduled by the subsystem with equal prior;ties, so that there is a common waiting time W' = To.2 for all v. If n~, is the total number of type v requests outstanding, then the analog to eq. (5) is (since there is no dormant time associated with these requests):
241
A. Required inputs, in addition to those given in the basic algorithm (i)Subsystem characteristics V: maximum degree of multi-threading, S~,: preallocated amount of k.-th resource, t,~: average CPU time per type v request, n~v: average number of acx:esses to devicep per type v request (2) Additional workload characteristics w,~,: number of type v ~equests per class i transaction. B. Algorithm I' !) Initialization. Assign initial guess values to the ;~.3 and T~. Set h i ::: n, ./(T,. I + T, .~ + Y.,W,,,T, ). (2) Overall iteration (2.1) Initialization
(2.1.1) Set W=: 0.
N,..3 = n~ T~,.3/ ( W" + 7;',..3)
(2.1.2) Compute h~ :: ~',w,, ~, (see note I) and n, = ~T,.. Test wheter 5gin ~ :~ V. If so, set W' = 0 and N~.3 = nt:; otherwise, (2.1.3) Co:npute
Or, since T~ = W' + T~..3,
w ' = ( L . " t. -
(35)
u,;., = n,~(r,; - w ' ) / v , .
Clearly, if 2~n~ ~< V, then (36) is satisfied with W' = 0 . If 2~n~ > V, then (36) (with equality sign) can be solved for W', yielding
l"
(2.2) Wait time iteratio, z (2.2.1) Test whether
(36)
t:
Ig
We have by now modified the Basic Algorithm extensively. Here, it is restated for the case of a single subsystem. The algorithm essentially views the subsystem as a set of open user classes, whose transaction arrival rates ~,~ are recomputed in each overall iteration according to eq. (34).
Basic algorithm with subsystem
)/y ( ,~,/r, ).
Set T, .3 = T, - W' and 5, :..~ = n,, T,,.3/ 7]:.
Thus, (33) becomes:
~n,(T,: - W')/T~ <~ V.
v
2~,,r,.,~,~/
(r,., + w +
!
-<-s,,
.,.
!:
(see r.~ote2). If so, proceed to step 2.3; otherwise, (2.2.2) Select the value of k, ~,ay k*, for which the above condition is most violated (relafi~/e to the value of the right band side). Find I4,'such that the condition is satisfied :vitb equality for k = k*. (2.2.3) Return to step 2.2.1. (2.3) Convergence test. Compute
T, = W +
r,.~ + ~ w , oT,.
If this is not the first iteratk,n, compare these values to the pre,vious iteration T, values. If they agree ~o within, say, 1%, terminate. Otherwise, (2.4) Compute
x, = n , / ~ r , + r,.,), Below, subscript i ranges over the end-user transaction t.lasses, subscript v over subsystem request types, and subscript b over both.
)
r,.~ + 2.~,,r~
u,.3 = ).,r,.~,
rp = ~ X , n , ~ . h
(2.5) I / 0
subsystem model. Invoke the I / O
Y. Bard / A simple approach to system modeling
242
subsystem model with the rt, as inputs, then apply eq. (17) to obtain the 0p. (2.6 ~ Compute =
(2.7) Active set iteratkm (2.7. I) Initialization. Assign initial quess values to the Th.4. In the fir:~t overall iteration, take T~,.4 = tb. Subsequently, ~Jse the current value. (2.7.2) Compute N,,., = N,,.3
/ (
+
Transaction class
Class i Class 2 Subsystem service
Avera,-e response time (sec) Measured no. subsystem
Predicted with subsystem
Measured with subsystem
0.25 7.48
0.79 15.34
0.76 15.87
0.15
0. I 8
.5 ).
(2.7.3) CPU delay. Compme . 'F~..,~=max[t,,(i-N
Table 7 Validation of model with teleprocessing subsystem
.,/Jvh.3 +
) t ].
bined in the model to predict performance of the same workload run via the teleprocessing subsystem. The results are compared to actual measurements in Table 7.
tH
(2.7.4) Convergen,:e test. If these values differ from the previous iteration ones by more than, :~ay, 10 "6, return to s~ep 12.7.2. Otherwise, (2.7.5) Compute -
+ K.5;
r,: = w ' +
(Note that among the new Th..~, we have both T,.3 and 2~3.) (2.8) Return to step 2.1. Note 1. If the subsystem is an 'open' one, i.e. if it serves requests 3ubmitt:ed externally (rather than generated by already p~resent transactions), then the ~,,. would be specified directly and would not be calculated here. We ~'an also have a combination of both cases, wi~h h,, the sum of given external rates and computed internal rates. The algorithm works fo', open subsystems only if no resource is saturated ~t the specified external arrival ~'ates. Note2. This test assttmes that a transaction is dropped from the active state while waiting for subsystem response. If this is not the case, i.e., if the tr;,n:~action continues to hold resource k while wa~ting, then T,.3 in the denumerator shouh:l be replaced with T,.~ + Y-~,w,,T~,.
T!!is model was applied to predict the performance of a system employing a teleprocessing subsystem. First. a benchmark j.Jbstream was run and measured ,~sing simulateo locally attached term~n~ds. The resource requirements of the teleprocessing subsys,.em 0vere measured independentiv. The two stts of measurements were com-
3. & Distributed systems We define a distributed system as a constellation of interconnected computers in which some transactions require service at more than one computer. Such systems can often be handled by combining the methods we have suggested for loosely coupled multiprocessors and for subsystems. For concreteness, suppose a 'parent' transaction at one computer generates 'offspring' transactions to be sent to a second computer. The parent transaction cannot continue until the response from the offspring has been received. In this case, we can iterate simultaneously on the two computer models, just as we did for loosely coupled systems. In each iteration, the rate of generated offspring transactions is computed from the parent traasaction rate, just as we did for the subsystem. Conversely, the parent transaction's response time is augmented by the previous iteration's offspr/ng transaction response time, plus whatever transmission delays are incurred. There may be cases in which a portion of the parent transaction can continue to be executed on the first computer in parallel with the offspring's execution on the second computer. The processing of such a transaction is depicted in Fig. 4. The parent transaction consists of three phases: (1) a nonoverlapped phase before the offspring i; generated; (2) an overlapped phase; 0! (3) another nonoverlapped pha~e, following the ~!ece~pt of the offspring's response. Let the average durations (elapsed time) of these
L B a r d / A simple approach to system modeling Parent transaction
T! !
( l - s t computer)
T2 . . . .
T3 ]
I
~1"
/,
Offspring :ransaction (2-nd computer)
T4
243
T5
'
'
I
T6
Fig. 4. Overlapped parent and offspring transaclio::s in a distributed systen,
phase be Tin, T2, and T3, respectively. Also, let T4, T~, and T6 be the average dural ions of the offspring's outward transmission, execution, and return transmission times. Then, if we take the maximum of the averages as an approximation to the average of the maxima, the parent's overall response time is: T = Tm+ max(T2,T4 + Ts + Tr) + T3.
(38)
The individual T, ( i - 1,2,3) can be computed by considering these phases as distinct transaction classes, each with its own CPU ~nd I / O demands. The nonoverlapped case corresponds to T2 = 0. For computational purposes, the two nonoverlapped phases (T l and T3) can be combined in a single phase of duration 7"1 + T3~ This treatment of overlapped processing is jus~ as applicable when the offspring transactions are serviced by a local subsystem, i.e., in the same computer as the parent.
3.9. Separate resource queues The bas;.c algorithm treats resource crnstraims as though each transaction had a requirement for each resource, and there was a single system-wide wait time T~.2 = W. When various transaction classes require different subsets of resources, one can proceed as follows: For each resource S k. find the value of W, say Wk, such that (9) is satisfied with equality (use Wk = 0 if that sati,,:fies (9)). Then, for each transaction class i take T,.2 = max kis,~ ~oWk.
3.10. Ooerhead Any overhead which makes a more or less fixed contribution to a transaction's resource denmnds can be included in the t,, n,e. etc., and need not be
further c i~,iOered here. On the other hand, there may be ,~er'lead components strongly dependent on tht: environment in which a transaction is being run. For instance, the scheduling and dispatching overhead probably depends on the size of the queue to be handled. Another CPU overhead item is associated with paging, and should be approximately proportional to nip, the number of paging ,~ccesses required by a transaction. However, since n,p is environment-dependent (see Section 2.3) so is the associated overhead. There are many system services which are performed upon direct request from a transaction, for instance, acquisition of free storage or accessing, a data base. Most of the cost of such services would be relatively constant, and therefore included in the transactions' resource demand specifications. However, sometimes the amount of resource (CPU time, disk accesses) used per request may increa,se as a function of system congestion. We then suppose that equations are available relating the overhead to quantities which describe the state of the system. Such equations can often be obtained em.pirically, using regression techniques [12]. There. fore, and since at any iteration in our algorithm the values of all these quantities (e.g., N~.3, storage utilization) are known, the overhead can be recomputed and used as part of the workload description for the next iteration. Assuming that the amounts of all overhead components can indeed, be assessed at each i:cration, how do we incorporate these into the model? For modeling purposes, we distinguish between the following types of ove,'head: Synchronous overhead. This is overhead executed in line with the execution of a transaction. Included are paging, most direct request services, and the actual scheduling and dispatching (as opposed to general queue maintenance). Some direct
244
Y. Bard / A
.~'imple approach to s)stem modeling
rcque.,.t services, su,.'h as c,!'fline printing of output, are not synchronous. S)nchronous overhead is modeled by adding the overhead resource demand:; to those of lhe req.~.esting transactions. This include:. CPU time added to t,, 1/O accesses addty:l t,:', the n,/,, other resources added to the proper St ,~.
Hly.h priority as~,nchror,'ous overhead. This consists ~f services not e:~.ecuted in line with any tr~ms.action, but t:~king l~recedence over transactic,ns. Examples at ~: servi,-ing real-time interrupts, running a performance monitor or system resource manager, or producing ~t system log. Resources consu ned by suct~ activities should be subtracted from the available pool (~he $1, or the total CPU proce.:~s~ng power) at each iteration. In the case of i / O i equcsts, the device access rates 5, are augmenteJ by tlae ov.~rhead amounts, if necessm',. If these r~crhead requests receive high priority even in the ! / O subsys~:em, the latter's model should he able t,, take that iato account. Low priority a.,,ynchronous overhead. This con.. ~ists of functions 19erformed when there is nothing else to do. for instance polling of terminals. Such overhead may be ignored until the end, when its contributions shc,uld be added to the total resource utilization,;.
4. Collclusion in this section, we shall attempt to place the model we have described within the general context of system modeling methods. Most ant~lvtic methods for modeling the performane,..' of computer systems reiy on queuing network theory. What we have called 'system states' are really qut:ues in which custt~mers (transactions) wait for or receive service. "lhe movement of ~HMoileI's between the::e states creates a network t,f qteucs. Queuing network theory attempts to compute network state probabili:ies, i.e., the prob~'bilit es of all possible splits of ,he total customer popu ation among the various q,ieues. From these prob~,:bilitles, all desired perform ,ace measures can be co~nputcd. The central achievement o~~ queuing network theor,~ is the pro~htct [orm sohmon: If the network a~d its workloac~ satisfy certair conditions, then the tzrobability of a given network state is the product of the in:tividual queue s~ate probabilities.
These probabilities are easily computed, up t¢ a normalizing constant, and several reasonable algorithms exist for computing the latter. A survey of the product from solution and its computational algorithms can be found in [ 13]. The product form solution gives exact results for networks satisfying the appropriate assumptions. These assumptions, however, are rather restrictive. For instance, among things not covered, we find: - resource constraints; --fir:a-come-first-serve (FCFS) queues, unless service times are exponentially distributed with the same mean for all user classes: - priorities of all kin4s; -non-Poisson transaction arrival processes for open user classes: - output dependent inputs, e.g.. service time at one queue dependent on response time at another queue. When confronted with a real system, users of the product form solution adopt procedures such as the following: (1) decompose the network into simpler subnetworks; (2) approximate each subnetwork with one that has a product form solution; (3) solve each approximate subnetwork; (4) put together the subnetwork solutions. This step may require iterating back to step 2. One of the algorithms for computing the product form solution is derived from mean value analysis [4]. The algorithm employs Little's formula in the form of eq. (5) and the delay eqs. (6). 1~' turns out that to obtain the exact product form :~c:lution, the delay equation for a given user population contains the average queue lengths for syster,as with one less user. Therefore, the mean value method requires solving, in turn, systems with I, 2, .... n - I users m order to get the solution for n US~,rg.
If the exact delay functions are replaced by approximations involving only the quantities related to the poputation of interest, a significant reduction in computational effort results. Such an approximation is represented by eq. (15) for a processor sharing queue. But, once the idea of using approximations for networks with product form solution is accepted, it becomes natural to apply s'imilar techniques to other cases. For instance, one is led to equations such as (23) for
Y. Bard / A
shnpleapproach to .wstem modeling
FCFS queues. And, indeed, the idea of iterating between Little's formula and th,.• delay equations app'ies to the network as a who~e as well as to its constituent parts. The advantages of this approaea, which we call heuristic mean value analysis (HMVA) are: applicable to a wide range of situations; - easy to formulate a model; easy ~.o explain; - easy ~:o program; computationally fast; low storage requirements; - n o worries about numerical overflow or underflow; - u s e r populations (n,) need not be integers. The last point is important, since average population sizes are not necessarily integral~, In contrast, the product form solution possesses~ the following advantages: obtains exact solution when as:,umptions hold; - o b t a i n s entire qaeue length distributi3n, n o l merely averages; - handling of servers with queue-dependent service rates is much better" - no worries about convergence of iterations. Of these, the first item is rarely of importance in realistic cases, since the errors introduced in the HMVA solution are small compared to the errors introduced by making the assumptions r¢q~Jired for product form. The last item i.q also of little practical importance, since the iterations usually do converge. It is worthwhile tracing the history of z particular model (that of the VM/370 system [1,2]) which has used HMVA fc.r its overall system model from the start, but whose method for solving the active state subnetwork has evolved th:ough three s~ages (note that [1] and [2] appeared before the ti" d i
245
stage was implemented), which are depicted in Table 8. Whereas in the early stages t~:a transac-. tion classes were the outer limit of the model's capabilities, the last version can handle up to a hundred. The accuracy of ~he model (relative to the performance of real systems being modeled) has improved somewhat in going from each stage to the next. The last improvement may be surprising, since an 'exact' algorithm was replaced by an approximate one. However, because of the increased speed of the algorithm, it became feasible to change the overall iteration convergence criterion form 5~b to I%, thus accounting for the increased accuracy. We have used here the simplest possible form of HMVA. Somewhat more elabolate approximations have been developed [14]i. "l'hey require greater computational effo~'t, but yield more accurate results as judged by comparison to the product form soluti :,n in cases where the latter applies. Whether or no~. similar improvements can be realized in the modeling of real systems is yet to be determined. First and fc.remost among the deficiences of our model is the gross way in which i,, handles resource constraints. Among important system features no~ included in the analysis one must cite software locks. These may, in principle, be treated as resource constraints directly affecting active (rather than wait) times, but a better method for handling such constraints would be required. Perhaps the method of sum,gate delays [15] can be combined with HMVA in a satisfactory manner to solve the resource constraint problem. The theoretical problems raised by HMVA remain unresolved. Most important among those are conditions and proofs ~f convergence, and error bounds.
Table 8 Stages in the development of the *'M/370 Predictor Model Representation of active state sub'etwork
R.clat~vc ('PU time.,, ~
Relati,~¢ : torag¢ rcquircm,~nt.~
Product form solution, with one qteue for each CPU. channel, and device
20
[l,( iI a -+ ] p
Product form solution, with two queues: PS queue for CPU, and infinite server queue for I / O . the delay time beipg supplied b~, the 1 / O subsystem model
4
~ , ( It, + l ~
Heuristic mean value analysis (the "basic algorithm" of this paper)
I
N o . o f u.cr classc.,,
a The 'relative CPU times" are ~i~proximately those required for modeling a system with ten transaction ci~s.~~.,, a'~d thirt,,' ! ! O devices, using programs written in A P L
246
K Bard / A smzple approach to system modeling
Refercnct~
plicated functions, which can be written down as:
7;,'.., = g,(r,.,.r2.,,... Ill Y Bard, The VM/370 performance predictor. Comput. Surveys IO (19781 333-342.
121 Y Bard, An analytic model of the VM/370 system. IBM J Res. Develop. 22 (1978} 498-508.
[31 Y. Bard. A mcalel of shared DASD and multipathing. ('o,nm. ACM 23 (1980) 564-572.
[41 M Reiser and S.S. La~,enbcrg. Mean value analysis of clo:,ed multichain queuing networks, J. ACM 27 (1980} 313-322. iS1 Y. Bard. Some extensions to multicla~s queuing network analysis, in: M. Arato. A. Butrimenko and E Gelenbe, Fat:,., Performance of Compuler Systems. (North-Holland. Amsterdam, 1979) pp. 51-62. 161 P. '..£hweitzer, Personal communication, ! 978. P.J Courtois, Dec3mposability, instabilities, anti saturatior in multiprogramming systems. Comm. ACM 18 ( 19751 371 - 377. N.('. Wilhelm. A general model for the performance of disk systems. J. ACM 24 (1977) I- 13. 19i J. Zahorjan. J.N.P. Hume and K('. S(:vcik. A queuing nl,~Icl of a rotational position .~ensing disk system. INF')R I0 11978) 199-216. !)Ol S.J Boie,. User behavior on an interactive computer systen',, IBM S~,stems J. 13 (1974) 2-18. Itil Y Bard, A characterization of VM/370 workloads, in: H. Bcilner and E. (ielenbe, Eds., Modeling and Performz:nee oi C'omput~-r Systems, (North-Holhmd. Amsterdam. 1976) pp. 35-5o. l~21 Y Bard, Performance criteria and mea,uremen t of a time .,,baring svsmm. IBM S)stems J. 10 (19"}I) 168-192. 131 C t l. Sau,:r and K M. Chandy, Computer System Modeling. A Primer, (Prentice-Hall, Englewood Cliffs, N J, 19811. K M. Ch;tr, ]y and D. Neuse, Fast accurate heuristic algori,,hm.,, for queuing network models of computing systems, Tc~ hn. Rep. TR- i 57, Department of Computer ~'iences, University of To".as, Austin (19893 151 P A. Jact>bson and E.D. ! azowska, The method of surrogate delays: simuhaneou,; resource pos~ssion in analytic models of computer systems, Techn. Rep. No. 80-04-03, I)el)artment of Computer Science, University of Washingto~1 ( 1980i. A lhand~vajn, Model.,, of I)ASD sub.,sstems, Part I: Basic m,~Jel ot reconacction. Peril,finance Evaluation I(3) (1981) 2"" 2~1
).
To simplify matters, assume there is a single transaction class, so that we may write:
r;
=
The behavior of the iterations is shown, for a particular form of the function g(T3), in Fig. 5. In this case, the iterations converge rapidly to the desired solution at the .;ntersection o1" the curve T~ = g(T3) with the line T3' = T3, Let us examine the behavior of g ( T 3) for two extreme cases. First, consider a CPU-bound workload, with negligible delays due to I / O and resource constraints. In this case, T~ and T4 are identical, and by combining (5) with (15) we can derive:
=.tr
/(rl +
This function has exactly the form shown in Fig. 5, with a horizontal asymptote at T 3' = n t as T~ goes to infinity. Thus, in CPU bound cases, convergence is assured. The same argument appfies whenever the delay equation is a roughly linear function of the N,j, with no direct dependence on the ~,. On the other hand, consider an I/O-bound system, with negligible delays due to CPU and resource constraints. Here, T3 and Ts are almost identical. The delay eq,mtion contains a sum of terms like eq. (17), wlfich have the rough form
/
T3a
nT
App,emlix. ( 'O,'it'e/'gtoHC¢' COIl¢ideraliony
['he basic algorithm for solving eqs. (5) and (6) proceed.~ bv su(:cessive subst tutions. Each iteration starts ',~ith a vector of ~ .3 values, and ends with a new vector, say T',,~. t ,n iteration may be regarded as 'he evaluation oi a vector of corn-
/
1"3 Fig. 5. Iterations, population depeadant delay.
Y. Bard
/ A simple approach to system modeluag
247
T31
/ T2
Fig. 6. Iterations. throughput dependant delays, case I.
bn-T 1
T3
Fig. 7. Iteration.,,, throughput dependant delays, ease 2.
T3 = a / ( l - b ~ ) , where a and b are constantly. Combining this with (4) yields" T; = a( r, + T3 ) / ( T , + ~ - bn ). This formula h~s a shape as depicted in Fig. 6 if T~ > bn, or Fig. 7 if Tm< bn. The latter case is representative of what happens when some I / O path component saturates at the throughpta that would prevail if response time were instantaneous. In the first case, there is still no convergence problem, and ev,m in the second case convergence is assured provided that we start with a feasible initial guess. However, should the g(T3) curve cross the T3 line at a steeper than - 4 5 degree slope, then the iteration will not converge (Fig. 8). In fact, a diverging o.,~cillatory behavior will be observed, even t h o u # a perfectly well defined solution exists. This case can arise whenever the delay fur..tion depends primarily on the throughputs rather than on the populations. What can be done about this problem? One solution is to recxpress the delay functions so that they depend only on the populations. Although this has actually be, m done for the I / O subsystem [16], the resulting equations are much more complicated and expensive to evaluate than the throughput based equations. Fortunately, a simpler solution is suggested by Fig. 8, namely that one should, in effect, rotate the g(T3) curve until it does not descend too quickly. This can be done by truncating tt.e step taken during each iteration
when nonconverging oscillatory behavior is detected, as shown in Fig. 9. When we try to generalize from one to several dimensions we are on shaky theoretical ground. Nevertheless, among several algorithms that suggest themselves, the f,-Ilowing has been very suc~,~ssful in eliminati-,g virtually all cases of non-convergence:
!
T3
l /
\
/ i Fig. 8. Iterations, throughput dependant delays, case 3.
T3
248
T3t
Y. Bard / A simple apprc,ach to system modeling
i. . . . . . . . /
T3 ] ~ ~ It:tiLt III~, coll,~¢rgcnt.e
Convergence inducement algorithm. ( I ) Ini,,ialization Truncation factor: set K = 1. Step lim:t~ing factor: ~iet L = 10. Initial guess: set 7~.~ as in Basic A~gorithm.
Step length normalization factor: set Y~ = T,.3. Previous step: set D, = 0. (2) Iteration (2.1) Model. Compute T~3 as the new value of T,.3 according to the i':gasic Algorithm or one of its modifications. (2.2) Prevention of excessive fluctuations Set T,~ = max[T,.3/L, nlin(T~ 3, LT,~)]. Set L = 2 (for subsequent iterations). (2.3) Compute normalized step sizes Set E: = D, (previous step). Set 1), = (T,~"3 - T,.3 ) / Y~ (present step). If iteration number is less than 3, go to Step 2.4. Otherwise, Let Q i = X,D,E, (inner product of present and previous steps). If Qt I>0, go to Step 2.4 (no oscillatory behaviour). Otherwise, Compute Q2 = X,D, 2 (square ot present step length). Compute Q3 = X , E , 2 (square of previous step length). If Q2 >0.9Q3, set K = K / 2 (step reduction because of unsatisfactory convergence). (2.4) Step. Let T,.3 = T,.3 + K(T/.~ - T..3). (2.5) Convergence test. As in Basic Algorithm.