Development criteria for a Benchmark test program

Development criteria for a Benchmark test program

Computer Program~ in Biomeditine 15 (1982) 243-248 243 Elsevier Biomedical Press Development criteria for a Benchmark test program K u r t BiShm Ge...

547KB Sizes 0 Downloads 55 Views

Computer Program~ in Biomeditine 15 (1982) 243-248

243

Elsevier Biomedical Press

Development criteria for a Benchmark test program K u r t BiShm German Cancer Research Center, Institute of Documentation, Information and Statistics, P.O. Box 101949, hn Neuenheimer FeM 280. D-6900 Heidelberg i, FRG

A Benchmark test program cannot only be used as a tool in the process of comparing the capabilities of different EDP systems, to make the decision for a new computer. Benchmark tests can serve also as a source of dynamic information on the development of performance and to predict bottlenecks with increasing workload. To gain these advantages each computer center should develop z representative model for its typical computer workload. Criteria for the development of such a Benchmark test program are described, especially for the important simulation of terminal sessions and detailed in an example. Bechmark test

Performance evaluation

Computer simulation

Computer management

Resource optimization

i. INTRODUCTORY REMARKS

2. AIMS

'Benchmark tests are pointless because you can prove anything with them' was also the reaction in our computing center to the suggestion of setting up a Benchmark test. The statistician frequently confronted with a similar prejudice retorts that nothing can be proven without statistics. This can also be applied to the field of measuring, and comparing performance for EDP installations. The criticism of Benchmark tests (performance measurement at a certain date of reference) as a means of comparison of performance between computers with differing hardware configurations a n d / o r different operating systems is definitely justified if the turn-around time of a standard program or a sequence of programs serves as the sole measure of performance. Such tests are not adequate to multi-user operation and the increased sharing of conversational usage. Therefore, they are unsuitable as a model for the mode of operation which is customary today on larger computers.

As in other biomedical institutions, the use of computers in the German Cancer Research Center is characterized by a broad spectrum of applications (statistics. simulation, databases, image processing) showing a strong variance in their CPU time and storage requirements. The necessity to offer users in biomedicine conversational access to the computer in the laboratory, characterizes the beginning of their ability to develop programs. Based on the necessities arising in a computing center with a large computer (IBM 3032) and a large share of conversational jobs (75~), we need an instrument which can fulfil the following aims.

0010-468X/82/0000-0000/$02.75 ~3 1982 Elsevier Biomedical Press

2.1. Dynamic information on the performance of the computer (control) Although a computer system - apart from an exchange of terminals - remains constant at the user interface over several years, its performance is still subject to continuous changes. These follow

2 4 4

from hardware and software modifications and the development of the user profile. Changes of the hardware configuration which are important for ,,he performance of the computer are the extension of the main storage, the fast. fixed-head storage for paging, the disk storage and the number of terminals. In software, performance is influenced by changes of the operating system (releases) and the implementation of new progr~m products. Less clearly definable but nonetheless important are the changes resulting from the consonant development of the user behaviour. More CPU-intensive compilers, interpreters and par,.ers ~re used: current applications are optimized; new. ,fften uacalculable projects are started: database applications are intensified. All these changes, which frequently overlap in time and lead to a dynamic change of the entire system, have in common that they influence the performance of the computer. The computing center has to be able to control this process to render account on the efficacy of hardware and software improvements and to be able to guarantee a long-term operability at acceptable conditions (response times, turn-around times}. A suitable Benchmark test can serve as a control instrument. used before and after important hardware or software changes, and periodically applied to control the development of computer performance.

puter performance. Benchmark tests may also be used to assist in the prognosis of the maximum load of an installed computer [2]. Dependent on the development trend of the user p~'ofile, one can test under which conditions bottlenecks in the existing computer system are likely to occur. Thus, it can be estimated with the aid of Benchmark tests for instance, how man3 conversational users with EDITOR applications may be additionally managed by the existing computer system with acceptable response times, and when bottlenecks will enforce restrictions in usage or extensions of, e.g.. hardware. Since control, prognosis and decision regarding performance of an EDP installation are of essential importance for the management of any c ~mputing center, and since Benchmark tests are to be regarded as an important aid in this respect, it should be investigated as to which requirements are to be met by a suitable Benchmark.

3. REQUIREMENTS The most important prerequisite for the use of a Benchmark test follows from the aims to be attained.

3. !. Representative model for the typical computer load in your center

2.2. Punctual gain of reformation ¢decision) The gain of information on performance at a certain time is the classical field of application of the Benchmark tests. Here, they serve to determine the capability of your EDP system with a view to comparing it to alternative computers. In particular, when decisions concerning the installation of a new CPU have to be made, suitable Benchmark tests offer valuable decision aids to be preferred to data supplied by the producer because they picture the load profile of your environment and thus permit a better evaluation of the requ.red capacity of an alternative computer [1].

Generally, there are applications on every general purpose computer in the field of biomedicine which may be classified as: - CPU-intensive Storage- and thus paging-intensive Terminal-intensive The mix of these 3 types of applications in the total load of the computer has to be modelled as realistically as possible in the Benchmark test. The quality of results of a Benchmark test depends decis!vely upon the realism of the model; i.e., upon the choice of the programs representing the 3 types of applications and their mutual balance.

2.3. Gain of planning dam (prognosis)

3. 2. Simulation of terminal sessions

Next to the decision aid, when choosing products and the periodically repeated control of com-

W~file we can fall back on existing programs, in arranging CPU- and storage-intensive applications

-

-

245

for the Benchmark test, it is not possible to model the terminal sessions of conversational users without additional expense. For one thing, modelling the terminal sessions by real users in a load test with 20, 40 or 60 conversational users meets with practically unsurmountable obstacles; moreover, such a test cannot be reproduced in the same way at another time or place. The terminal sessions will, therefore, have to be pictured by a simulation which must be as realistic, reproducible and flexible as possible with regard to the number of users [31.

3. 3. Constancy of the model Benchmark tests should be kept constant in their composition over a longer period of time to remain expressive under the aspect of the control of the development of the changing computer system. Only in this way can the question of effectiveness (e.g., of hardware improvement) be quantitatively proven. The requirement of the constancy of the model is also the main reason for picturing the terminal sessions in a simulation, since only in this way can the usual duration of the thinking time of the user be reproduced.

3.4. Manageability oj the Benchmark test To facilitate the practical execution of a Benchmark test it is sensible to generate a transportable mini-operating system and to keep the need for user data as low as possible. It is thus possible to use the Benchmark test designed for development control also for the selection of alternative computer hardware in decisive situations without large data sets having to be transported. Moreover, if we proceed in this manner, the users remain largely undisturbed in their routine ~pplications and problems of data protection do not arise.

-

-

Procedural phase Final phase Evaluation phase

4.1. Definition phase The choice of suitable application programs of 'CPU-intensive ~nd 'storage-intensive' type belongs to the definition phase. Several programs of the same type are combined in job classes. Furthermore, it is necessary to establish different job classes for both types of programs which contain only the job execution, while the compilation or both are contained in the other (sub-)classes [4]. The conversational jobs are simulated in a job class of their own. For this purpose, different sequences of terminal commands (scripts) are to be set up, modelling different terminal sessions. In this, the thinking times required of the real users have to be taken into considerat/on. A single dialog script with deterministic thinking times for all users is, however, unsuitable, since this would imply the risk of a synchronization with distorted response times [5]. On the contrary, it would be sensible to use several scripts with thinking times from a pseudo-random number generator. Maintaining the initial values, which are different for each script, the reproducibility of the thinking time sequences is warranted with further Benchmark test runs, The variation of the random numbers for the thinking time must, however, be restricted to avoid senseless thinking times and to reach a sufficient number of script runs, Furthermore, it must be laid down in the definition phase which parameters are to be collected at what frequency, and it must be checked whether the accounting data and monitor programs recorded by the system are sufficient for the evaluations.

4.2. Stm't phase 4. DESIGN i'he composition of a Benchmark test can be subdivided into: - Definition phase Start phase

In the start phase the beginning of the accounting data record is important. By the "simultaneous" LOGON of all jobs in the Benchmark test a computer load occurs which is atypical for the normal operation to be simulated. Therefore, the accounting data record should begin only after the

246

end of the start phase, to employ only those accounting data for the !ater evaluation.

4.3. Procedural phase For the procedural phase we have to decide on the duration of the Benchmark test. Here, two different approaches are possible in principle. In one case, all programs or scripts are run only once: the Benchmark test is completed when the last application program has reached its natural end with LOGOFF. The m~asure for the performance of the computer system in this case is the time the Benchmark test remains on the computer (elapsed time}. In the other case. the elapsed time for the Benchmark test on the machine is fixed and the test is broken off at ~he end of this measuring period {e.g.. 1 h}. In that case. the measure for the performance of the EDP system is the number of runs or transactions achieved per job class. The advantage of the latter method is that: - Shorter dialog scripts and programs can be used: - T h e more frequent LOGON and L O G O F F phases are a better adaptation to the normal usage: ..... The duration of the test can be planned exactly; --A uniform utilization of the computer is attained over the entire phase.

4.4. Fin,d phase The disadvantage of this method lies in the fact that in the final phase the break-off of the running programs leads t~ an inexactitude, which c0uses obscurity in the evaluation. This disadvantage can be partly met by a more frequent recording of acco,nting data giving information on the state of completeness of the last run of a program (check points).

4.5. Evaluation phase In the evaluation phase the accounting data obtained are presented by job classes in tables or graphs. Together with the monitor data they supply

the desired information lor the controls, prognoses and decis[~ms concerning the performance of an EDP system which are necessary in every computing center and increase in value with periodic repetition of the Benchmark test.

5. REALIZATION (EXAMPLE) At the turn of the year 1979/80 the exchange of the central unit (IBM 370/158 by IBM 3032) was planned in our computing center. As a decision aid and to control the effectiveness of this measure. a Benchmark test was worked out according to the above criteria which -~ in this matter aria in the following time in the case of minor changes in the system - proved to be a valuable measurement for the development of the performance of the computer system. The job classes used consist of: Class

Designation

Type

A B C D E F (; H l K L M

Terminal stimulation Assembly FORTRAN compilation PL/i compilation SPITBOL interpre*.er Subroutine calls Program e~ecution {CPU-intensive) Sequential input/output Index-sequential input/output APL execution Program execution (paging-intensive) P L/i execution

Dialog Dialog Dialog Dialog Dialog Dialog Dialc~_~ Dialog Dialog Batch Batch Batch

To model a representativejobmix, the following were included in the standard Benchmark: In class A: 16 conversational jobs with 8 different scripts from EDITOR and timesharing (TSS) commands. In classes B - l : 1 job each to model the conversational-oriented program development (compilations) and production (executions). In classes K - M : 1 job each to model the batchoriented production. In all. the standard Benchmark thus consists of 27 jobs which, with complete CPU utilization, are running 'simultaneously' in endless loops till the

~7 break-off after 1 h. The n u m b e r of executed transactions is considered as evaluation criterion for the stimulation of the terminal sessions (class A), the think times varying randomly from 1-60 s, while the n u m b e r of j o b runs attained serves as the criterion for the remaining j o b classes ( B - M ) . F r o m the wealth of information obtained by m e a n s of the Benchmark test, some of the keynumbers which were determined when comparing the two central processing units (IBM 370/158 and 3032) are presented in table 1 *. TABLE ! Performance relation IBM 3032 (4MB) to 370/158 Mod.I (3 MB) for different job classes

these jobs, due to internal priority control of the schedule maker in favour of the conversational jobs, have little chance with the less efficient 3 7 0 / 1 5 8 , but have a better chance with the 3032. Similarly expressive Benchmark tests have been carried out meanwhile to determine the m a x i m u m load of the computer installed, to prove the effectivity of an extension of the fixed-head storage f~paging and to control the effectivity of a new release for the operating system [6]. In spite of initial criticism, the Benchmark tests proved to be an indispensable tool for the management of change in a computir, g center.

Job class

Performance relation 3032 to 370/! 58

REFERENCES

A Terminal simulation B Assembly C FORTRAN compilation D PL/! Compilation E SPITBOL interpreter F Subroutine calls G Program (CPU-intensive) H Sequential I/O I Index-sequential I/O K APL program L Program (paging-intensive) M PL/! program

2.1 4,0 3.0 2.6 !.3 1.4 4.9 1.1 1.3 26,3 9.4 i 7.6

[1] R. Posch, G. Haring, G. Geli and C. Leonhardt. Ein leistungsorientiertes Bewertungsverfahren zur Auswahl yon GroBrechern. Angew. lnformat. (Appl. Informat.) 2 (1979) 47-52. [2] G. Hating, Zur Ermittlung der Residualkapazitiit yon Rechnersystemen0 Angew. Informat. (Appl. lnformat.) 8 (1981) 346-353. [3] K. Screenivason and A. Kleinmann. Construction of a Representative Synthelic Workload. Commun. ACM 16 (1974) 247-253. [4] A. Duey and J. Fang. Benchmarking the IBM 3033. Datamation !1 (1978) 120-122. [5] H. Miihlenbein, Ein Vorschlag zur Leistungsbeschreibung yon interaktiven DV-Systemen. in: lnformatik-Fachberichte. W. Brauer. ed. vol. 15. pp, 213-229 (SpringerVerlag~ Berlin. New York NY. 1978), [6l D. Schtiefer. Leistungsbewertungsstrategien far Timesharing-Systeme, in: lnformatik-Fachbetichte. W. Brauer. ed. vol. 46. pp. 52-63 (Springe~-VerlagBerlin. New York NY. 1981).

The values in the region of the conversationaloriented jobs (classes A - I ) show an improvement in performance leading to roughly 2.5-times shorter response times. The drastic improvement in performance with batch jobs is based on the fact that

" Deternfinvd and presented at the SHARE Meeting 55. TSS Performance Workshop, Atlanta (USA). 15-16 August 1980 by Mrs Hanna Hahne, German Cancer Research Center. Heidelberg.

48 APPENDIX The following scheme gives a short characterization of the programs in the Benchmark test set and approximately shows the typical workload of the system. Short description of Benchmark jobs Class

B C

D E F G H

!

Name

Type

Size

Average CPU-time/run (se.:onds) on IBM 3032

Simulation of a typical terminalsession (tritiai comm,~nd-=, i.e. copy of data sets. editing of short data sets with REDITOR) Assembly {CZCJT; TSS task monitor) FORTRAN compilation (spectroscopies: plots of spectroscopies, molecular structures) P L / ! compilation (animal laboratory, information system ALIS) SPITBOL compilation (formatted .~r~n) Subroutine-calls (synthetic Bench~nark program~ CPU-intensive execution (synthetic Benchmark program) Sequential i / O qsynthetic Benchmark program~

]nto.ractive

85-95 Input-lines {commands and data for REDITOR)

7-9/ Script

i nteractive

4713 lines

19-20

Interactive

! 195 lines

9.0-9.5

Interacive

"/79 lines

8.5-9.0

Interactive

1848 lines

1.6-1.7

Interactive

2 x 150 calls of a trivial P L / I subroutine and ,~ funclion 3 x 5 extensive arithmetic operations in P L / ! VS-dataset. fixed length of 2000 bytes: 2 x 350 re," ~jds reading and 500 records writing Vl-dataset of 213 pages: variable length of max. 132 bytes, keylength 7 bytes. 3 x 2000 r~cords reading with random key maximal size of matrix = 12!/1! x2! ×2!

2.6-2.8

~ ndex-scquential i / O ~synthetic

|n~eractive Interactive

interactive

Benchmark program)

M

APL-execution for calculation of distrtbutions (pagin~-intensive with increasing CPU-intensity dependent on size) Execution of a paging-intensive program (synthetic Benchmark pro. gram) Execution of a CPU- ~nd storageintensive PL/l-program (analysis of secondary structures of DNAsequences)

Batch

Batch

Batch

2 × processing of a 320 × 320 matrix, where elements (3, I ) are assigned to elements ( 1, J ) 2 × (230 × 230) matrices

10.1-10.3 2.5-2.7

17-18

93-96

2.0-3.5

3o--32