Performance evaluation principles for vector- and multiprocessor systems

Performance evaluation principles for vector- and multiprocessor systems

Parallel Computing 7 (1988) 425-438 North-Holland 425 Performance evaluation principles for vector- and multiprocessor systems Ulrich HERZOG Institu...

1MB Sizes 3 Downloads 92 Views

Parallel Computing 7 (1988) 425-438 North-Holland

425

Performance evaluation principles for vector- and multiprocessor systems Ulrich HERZOG Institute for Mathematical Machines and Data Processing, Friedrich-Alexander.University Erlangen-Niirnber~ Chairfor Computer Architecture and Performance Evaluation, D-8520 Edangen, Feel Rep. Germany

Absttm:t. Skilful computer system measurement, modelling and performance evaluation techniques are needed for supercomputer architectures. They allow to accurately determine characteristic performance values, to find potential hardware- and software-bottlenecks; they also help to efficiently distribute and schedule user tasks. This paper is an extended version of a tutorial contribution at the IEEE CompEuro 87 and surveys fundamental ~lformance issues and their solution for supercomputer architectures. Keywords. Vector computer, multiprocessor systems, performance measures, monitoring techniques, modelling memory interference, modelling synchronization problems.

1. Introduction Performance evaluation means to describe, to analyze and to optimize the flow of data and control information in computer systems. Performance characteristics, such as utilization, throughput or response time give information about the efficiency of the hardware/software structure and allow the detection and elimination of bottlenecks. Therefore, performance modelling and evaluation is needed from the initial conception of a systems architectural design to its daily operation after installation [1,23,35]. Todays general purpose supercomputers are SIMD machines, mainly vector processors. At the moment, however, we are on the threshold of a new area in computer architecture: many of the arguments for processing a single instruction at a time no longer apply; the MIMD-principle, i.e. parallel processing, is getting more attractive because of various reasons [2,11,31]. Supercomputers trying to combine both principles are already commercially available [21,26,37] and many projects are known favouring massively parallel architectures. This paper surveys at first c o m m o n performance measures. We then overview fundamentals of measurement and modelling techniques.

2. Pedorma~e measures Computer performance depends on many factors: on the architecture and realization of the hardware and system software, on the confgurafion of the system to be investigated, and last but not least on the worHoad to be handled by the computer. There is not just one general measure for describing all these factors. And dependent on our position we may be interested in global performance measures only, in a detailed analysis of intrinsic dependencies in the system, or in both. We first describe general performance measures and then focus our attention on particulars of vector and parallel processors. e167-8191/88/$3.50 © 1988, Elsevier Science Publishers B.V. (North-Holland)

U. Herzog / Performanceevaluationprinciples

426

2.1. General measures

Probably the best known measure is the Mips, millions of instructions per second a processor can initiate or fetch. The Mips-rate is defined by 1 nt c

where tc - cycle time, n - number of cycles per instruction. Hence a computer with a 100 ns cycle time and four cycles per instruction is a 2.5 Mips computer. Different instructions usually need a different number of cycles to be completed. So dependent on the load profile the Mips-rate varies considerably. Standardized Instruction-mixes (Gigson-Mix, Gamm-Mix, etc.) allow, however, to determine a mean Mips-rate for a particular profile. The Mips measure does not include a component which covers the word length. Therefore, it was proposed to multiply the Mips-rate by the word length to obtain a number of logical operations, Lops, or Glops (GigaLops-109 logical operations per second). Sometimes this measure is named memory bandwidth. The Lops-rate is defined by ?lffi/~-w where w = word length. Experience with early computers demonstrated that the Mips (and Glops) rating of computers gave a poor measure of relative performance on scientific computations, because the value of the supporting hardware facilities was not estimated. The historical use of floating-point operation counts as a measure of complexity in numerical algorithms suggested to use Mflops (Mega Flops-million of floating-point operations per second) or Gflops (GigaFlops-billions of floating-point operations per second). Note, however, that also performance ratings in Mflops are often highly misleading: they do not include a component which takes into consideration parallelism in the hardware, the length of vector of data and other important aspects [33,35]. Moreover, vendors frequently advertise or cite the peak performance of their machines. These rates are rarely sustainable in the context of real applications; these rates may also vary significantly from program to program. "Peak performance is that level of performance that the manufacturer guarantees cannot be exceeded!" [10,35], cf. also Fig. 1. The failure of all of the above yardsticks suggests to concentrate on comparing the performance of real problems, using the most appropriate known algorithm for each type of system [33]. The adequate measures are the total elapsed time and the total CPU time for each of the kernels, benchmarks or scripts representing the real workload of the system. Kernels are characteristic code pieces or subroutines for mathematical functions. Benchmarking means to run a single or usually a selection of real programs on the system. Scripts are synthetic workload descriptions simulating the user terminal's behaviour of a dialog system. Besides comparing total elapsed time and CPU time, the Utilization of the various components is most important, e.g. the CPU utiliTation, channel activity, etc. These measures may show whether a system is well balanced or not. The utilization is defined as the ratio between the time a component is used during a given interval of observation and the duration of the interval.

U. Herzog / Performanceevaluationprinciples

427

I Fulu r e

SystemsET~AGFIO ~

1T

10 G r~ 0

2G

Generation

LL

U e-

OJ r~

CD a.

Japan-SP Cray-3

t

/

/~

~

"

t

/

16

f=

$. 0

/

/

1

600 M

XMP-2 /Hitachi

_LJ:s'82°

1

200 M 100 M 60 M 20 M

~

t ' .ASC

,r.

I

Genera/ion

10 M 1970

197S

1980 198S Year Fig. 1. Supercomputers[37].

1990

Performance evaluation in the commercial world is most .~ften characterized in terms of Throughput rate, the volume of information processed by the system in the time unit, expressed in units of work per second. Typical examples are jobs/s, transactions/s or messages/s. Investigating interactive systems the Response time--mean and distribution--allows to measure the sensitivity of a system to individual jobs. Response time, often called turnaround or reaction time, is the time between the presentation of an input to the system and the appearence of the corresponding output [7]. It includes processing time, system overhead and waiting time; it obviously depends on the overall workload of the system. 2.2. Additional measures for vector processors State-of-the-art supercomputers are complex systems whose performance measurement strongly depends on the workload. No amount of measurement effort will overcome the incorrect conclusions that come from an improper workload characterization. Thus the careful selection of kernels, subroutines and benchmarks is vital for correct performance evaluation [7,26,35]. Benchmarking traditional mainframe machines the total elapsed time may vary by a factor of two or three dependent on the characteristics of the workload (e.g. numeric or nonnumeric). However, benchmarking mainframe machines with vector add-on facilities or true vector processor machines a difference of factor ten or more is possible! The reason for these variations are differences in the amount of vectorization of codes. Some typical examples of benchm~k results for the total elapsed time (in seconds) are shown in Table I [46].

U. Herzog / Performanceevaluationprinciples

428 Table 1

VP 200

Machine Operating mode Test: VORTEX I - 500 EULER I = 1000 2DMHD SHEAR 3 BARO

I-PE CKAY X-MP

Scalar

Vector

Scalar

Vector

217.2 6.3 43.4 164.4 1107.8

34.4 4.8 2.6 83.6 41.1

233.6 9.0 39.2 190.3 756.9

37.8 3.1 4.3 72.7 76.3

For a particular workload (with a given fraction of vectorization) the effective speedup S is well estimated by Amdahl's famous approximation [3]:

S=

(1-f)+f/k

where f - the fraction of vectorized code, k =the speed of the vector unit relative to the scalar (which depends, strictly speaking, on other parameters, e.g. the vector length)• Figure 2 shows the relative performance of two IBM mainframes with and without vector facility (VF) as a function of the degree of vectorization. Figure 3 shows performance estimates for some mainframe and vector machines [44]. The spectrum is relative flat over a wide range of vectorization, increasing in steepness only at the high end, i.e. the slower mode (scalar) will dominate the overall performance unless a program is almost entirely vectorized. Amdahl's 'law' captures essential features of the workload and the CPU-architecture. Note, however, that the overall performance depends also on the I/O-system, the operating system and last but not least on the compiler. Measured performance values for different types of systems and different types of programs are presented in several publications, for example in [21,26,42,46]. There are interesting two- and three-parameter descriptions of computer performance, they are, however, not yet popular [18]•

Remark.

25

3.0

T 2.0

3

1,5 ==

~ 1.o

VP200/

20

2.5

U c

/ -

3090-180VF 3090200VF

L

5

" 0.5

~

ID

"~" 0.0 ' 0.0

I

i

i

i

i

,

0.2 0.4 0.6 0.8 1.0 Vectori zati on

Fig. 2. Interpretation of Amdald's law for mainframes [441.

MP200

~0'' ...... 0.0 0.2 0.4 0.6 0.8 .0 Vectorization Fig. 3. lnterp~fion of Amdahrslaw for supercomputersanda m~ramewi~ vector ~ c ~ t y [ 4 4 ] .

U. Herzog / Performanceevaluationprinciples

429

2.3. Additional measures for parallel processors

Parallel processing is not a new idea in computers architecture, rather it is an idea whose time has come. A number of research and development projects are underway to configure 25, 64 or more one-Mips microprocessors into a MIMD multiprocessor. The goal is to overcome technological constraints and to yield supercomputer performance on selected applications, but at the cost of a small mainframe [34]. Promises and accomplishments of parallel processing as well as the problems and work that remains are discussed in many papers [2,34]. Parallel processor performance is mainly discussed by means of speedup-diagrams. Let Tp be the time required to perform some calculation using p parallel processors. Then the speedup $ is defined by

s=r The ideal performance characteristic for a p-processor system, on which a given application problem could be partitioned into p subtasks, would be the linear relationship as shown in Fig. 4, where the speedup is equal to the number of processors. Overhead due to interprocessor communication, coordination and synchroniTation problems do usually not allow to obtain this ideal behavior. Related performance issues are Communication delay--due to software mechanisms or communication switches--, Shared memory interference or Hotspot contention and Cache coherency delays [2,35]. There have been several conjectures about the actual performance gain of parallel processors, by Minsky, Lee, Kuck and many others [6,20,34]. Amdahrs law is invoked today that parallel processors will never be competitive with sequential machines [37]. However, measured data of several authors [11,13,27,38] demonstrate clearly that there are classes of applications where we may come close to linear speedup: Large scientific and engineering calculations, VLSI design automation, database operations, artificial intelligence, etc. [2]. Figures 5(a) and (b) show examples of those applications. In some exceptional cases we even may obtain a performance better than the above stated 'ideal' behavior [12,32].

12

m

1 '°

95

8

e~ "0 o)

o

e~

t4--

6

- - 90

~ - GJ 0"o 0

.i.a e'- e.-~le--

4

2O 10

2 I

I

I

I

I

I

I

I

2

4

6

8

10

12

14

16

Number of Parallel Processors p

Fig. 4. Ideal speedup and Amdahrs law.

~. r~

c_ O.

U. Herzog / Performanceevaluationprinciples

430

S

Sidea I

S

25,

25

20,

20

15.

Sidea 1

015

10

10

5

5

5

10

15

20

25

p

....

a

5 ....

' .... 10

' ..... 15

'' 20 " " " '25

p

b

Fig. 5. Speedup $ measured for p processors on D I R M U 25 [27]; (a) Asynchronous solution of PDEs; (b) Median

filtering of digital images.

"While the world around us works in parallel, our perception of it has been filtered through 300 years of sequential mathematics, 50 years of the theory of algorithm and 30 years of FORTRAN programming [34]".

3. Instrumentation

and measurement

Monitoring the behaviour of computer systems there are two different types of objectives, the measurement of characteristic performance values and the observation of the dynamic behaviour within the system. Performance measurements give information about the speed of a computer system, utilization of hardware and software components and other global performance characteristics discussed in the preceding section. Beyond that, the idea of observing the dynamic behaviour is to uncover the timely cooperation of hardware and software components and to find explanations for the measured performance values [22]. The special objectives of performance monitoring for supercomputers are - the correct recording of concurrent hardware and software events within or between all components of the system (components may run asynchronously!), - the analysis and evaluation of these events to find system bottlenecks, to check program_ (task) decomposition, subtask allocation and resource scheduling strategies, - the visualization of key results for system management and application programmers, and - the validation of performance evaluation models. Both types of monitoring--measurement and observation--may be done either electronically (hardware monitor), by special system programs (software monitor) or by a combination of both (hybrid monitor). Efficient data compaction tools and data analysis packages are most important. We next summarize the basic principles of hardware and software monitors. - Hardware monitoring: With the hardware-measurement approach the system is typically examined by means of an independent set of hardware probes: binary information is

U. Herzog / Performance evaluation principles

B-processor

I,ul C.~

W

I,aJ

it}

I--

~

Z

~g

"~

v W

V

shadow

LeJ

A

~

W

A

~,

-

..I

..,

tat

v ~

.FI-E FOl ~ FUll /..=. I I ~ I I ~...

431

~ <

n n = _

I~1~

~

lUl

A

=~

-~

coordinating process idle A-processor 1.3 ms

interrupt from B

P

t/)

tit

LIJ

Lt.I

,,~ t~

A-process p_

idle

Fig. 6. Process trace describing processor communication in the EGPA-multiprocessor [9].

SLAVE 3 Texl~natJ.oa lterat£on Stazt

SLAVE 2 TezmlnatJ.on Iterat£on

N

fin

n~l

Start

SLAVE 1 Termlnat£on Iteration Start

,o ,,fl, ~ofl~,

MASTER T~z'm£nat~on lterat£on Start

Fig. 7. Process trace describing processor synchronization in a DIRMU-multiprocessor configuration solving PDEs by the asynchronous Gauss-Seidel algorithm.

U. Herzog / Performanceevaluationprinciples

432

gathered and transferred to event counters or to automatic comparators. The recorded dat~ may incidentally be used for utilization statistics, but the main objective is often to get deep insight into the dynamic flow of data and control information. Figures 6 and 7 sho,~ typical examples for such process traces in a parallel processor environment [9,19]. The main advantage of hardware monitoring is that it provides data which are nol corrupted by the operation of the monitoring system itself. It also provides data at micro-level which could not be readily obtained by other means. On the other hand hardware monitoring required deep insight into both system hardware and software and the interpretation of the system-oriented event traces is only possible by experienced people. Software monitoring: Data are gathered by special programs which are added to and rut upon the examined system. The common method is to embed some data gathering prograr~ within the operating system which selectively monitor events of interest. The main advantage of this user-oriented approach is that it can be directly coupled to the software load an¢ thereby trace operation at a macro-level in synchronization with major events. The interpre. tation of results is possible also for application programmers. However, this approach has the disadvantage of corrupting the statistics which it i: attempting to measure; we have to be aware that the measurement procedure itself use~, some percentage of system resources [36,39].

4. M o d e l l i n g

4.1. General remarks There are various possibilities to describe the flow of data information within and betweer the components of a computer system. The most important concepts are (1) The graph-theoretic description by means of nodes, branches and flows. Efficien~ algorithms are available in order to get first estimates for the maximum flow. (2) The probabilistic description by means of service models and stochastic processes. ,~ more detailed modelling of the actual system behaviour is possible and many performance values may be investigated such as throughput, buffer utilization, response time,..., meat values, higher moments, distribution functions. Probabilistic models may be evaluated either b] mathematical tools or--if they are very complex--by means of simulation. (3) The description by means of evaluation network, detailed flow diagrams, etc. Here, very precise description of the actual traffic flow is possible, especially in subsystems. However up to now no mathematical tools axe available and the evaluation is only possible by means o: simulation. (4) Combinations of the above methods. We observe that the more elaborate the description, the more difficult the performance statements. Modelling the information flow for complex systems such as vector or parallel processors wc are often faced with the following problem: we have to find a modelling technique which allow., to describe and analyze the system behaviour in a transparent manner still capturing th~ micro-operations within and the interdependencies between components. This means that w~ have to find models which are - simple, for reasons of an efficient and obvious performance evaluation, and - accurate, describing the actual flow of information. The key to the solution can be only a modular approach, a hierarchical modellirag technique Such models have been introduced and successfully applied. For an illustration of thi~ technique, see [5,15,23,25,43,45] and Fig. 8. There are, however, many open questi,ons.

U. Herzog / Performance evaluationprinciples

433

• The Macro-Level Model of a Time-Sharing System Terminals

Job-Sched.I.........

I!11111 Multiprogr. system

Time between events is in the order of O seconds

• Intermediate-Level Model (Multiprogramming Behavior) Processors

Omiiliseconds

OMicro-Leve] Model (~lemoryAccess of Processors) Processors

Mere. Flodules Omicroor nanoseconds

Fig. 8. Example for a hierarchy of models [23].

In order to derive performance measures from such models there are two possibilities: (1) The mathematical method which allows the exact or approximate investigation of the performance characteristics in components or in the overall system. (2) The simulation of the structure and operating modes by means of special programs or simulation languages. The main advantage of simulation compared to the mathematical methods is twofold: A very detailed modelling and evaluation of the actual traffic flow is possible and the investigation of new problems is straightforward. This is by no means true for mathematical methods. On the other hand, the problem of system engineers is not only the analysis but also the synthesis of computer systems. And real optimization at reasonable expense is only possible with mathematical methods. In this section we concentrate on two important problems of parallel processors, shared memory interference and synchronization of subtasks. Many other problems have been investigated at different places and new results are published regularly in the relevant conference proceedings and journals.

434

U. Herzog /

Performanceevaluationprinciples

4°2. Modelling memory interference Memory conflicts may occur whenever two or more processors attempt to gain access to the same memory module simultaneously. The effect of memory conflicts, referred to as memory interference, may decrease the execution rate of each processor significantly. Much attention has been paid to the analysis of this phenomenon and may be summarized as follows [4,8,9,28,411. Processor behaviour is described as a stochastic process. Each memory access is followed by a certain amount of processing time. No distinction is made between a memory access to fetch instructions and a memory access to fetch or store operands, nor between the processing time to decode an instruction and the processing time corresponding to its execution. This simplification results in an 'unit instruction' which was first proposed by Strecker [41]. The rewrite time does not need to be considered because it is overlapped with the next cycle. Ta and Tp, the memory access and processing time of the unit instructions are assumed to be discrete or continuous random variables. If two or more processors simultaneously request the same memory unit, only one of these requests can be served. There exist no priorities between the processors, but each has equal probability of success. The deferred processors are queued up to be served in subsequent memory cycles. Combining these assumptions with the actual structure of our multiprocessor system, the queuing network model of Fig. 9 reflects the system behaviour: three processors access their own as well as other memories. The interference measure I shows how many percent longer the expected execution time E[T~] of an instruction i in a conflicting situation is compared to its expected conflict-flee execution time E[T~]:

i ffiE[T~]- E[T~] .100.

E[T,]

It can be shown that the interference I depends on the actual value of memory access time Ta and processing Tp only through the instruction service time ratio y:

E[ I and the access probabilityPa to each memory block. Embedding these results from the memory interference model into the classical multiserver results, realistic speedup values may be

i

r-L~

I:

Processor ]

I

,~

:

-'

Processor 2

Processor 3

Fig. 9. Modellingof memoryconflicts for a symmetricmultiprocessorsystem(Processing Elements, Memories).

435

U. Herzog / Performance evaluation principles

.['~

~"

-,~ i" II IV

~

~ ----i~*~

/

..o ~ " " ~

_l

_

~,.,e~"

DO--

_

3"=0,1 ~,,** "

,.,., ~

I ".,~'~

| / ~

~_

~"

~, "" ~.._ = U,'f5 ~

1 ,U ql

/ I

i

I

I

I

1

2

3

4

5

of P r o c e s s o r s

c

Number

~-

Fig. 10. Speedup S for multiprocessor configurations taking into account memory conflicts (extreme example where all processors access the same memory block; 7 instruction service time ratio).

obtained. Figure 10 shows speedup values in case of an extreme example: All processors access the same memory block. It is demonstrated how each additional processor contributes relatively less to the overall performance; the curve may even drop again, which is well known as the famous breakdown phenomena of multiprocessor projects with an unbalanced hardware a n d / o r software structure.

4. 3. Modelling synchronization problems In the classical modelling technique one usually assumes concurrent pJ~ocesses to be independent of each other; on the other hand, it is also standard to assume processes, being dependent on each otherue.g. I / 0 and CPU phasesDtake a sequential turn. Little research considers I / O and CPU overlap at~d recognizes that programs may be d~:omposed into well-defined cooperating subtasks and processed concurrently. Modem multiprocessor projects take advantage of the inherent parallelism of many application programs. Therefore, the response time for each application may be reduced drastically. Then, however, difficult coordination problems may occur and have to be considered in modelling such systems [14,16,17,29,30,40,43]: synchronization between tasks and subtasks, process communication delays, and data and code sharing problems (see Section 4.2). We only discuss the basic synchronization problem: Be given a program structure as shown in Fig. 11. Problems are often of this type: algorithm for the solution of linear algebraic or partial differential equation systems, optimization procedures, simulations including subruns for the purpose of estimating confidence intervals, problems of picture processing, etc. -

-

-

U. Herzog / Performance evaluation principles

436

0



O0

Fig. 11. Type-l-program structure [16,17].

A possible implementation on a hierarchically organized multiprocessor such as EGPA [13] is illustrated by the timing diagram in Fig. 12: - At first the source program is translated, loaded and then started by the coordinating B-processor. - The B-processor initiates the execution of n independent subtasks by the appfication processors A1 to An. - Having completed" its subtask, each A-processor sends asynchronously a message to the B-processor. - Postprocessing and preparation of a new loop cycle by the B-processor is only possible when all subtasks are completed. Such synchronization problems are typical for progressive parallel processor concepts. They never have been modeled and analyzed in classical performance evaluation theory. They are, however, for some years a major research topic and quite some results are available today [14,16,17,29,30,40,43].

-~18-'period ~ I

B- processor

A - period- - " 1

I

F//////A

A1-processor

I

.

4 ~//.//A

I

I

I

I

I

I

I

I

I

~......,....I I v..,,.,

, I

. . . . . . . . .

.L : •

An

ii

!I

I

I

I

I

I

,-. . . . . . . . . .

rAs

CycLe no. 4 Tc

' L. I

, I J

I-

A i I

i

77.-----[~ I I

I

Fig. 12.

',-

~ 'I

8 'I

J 'I

I

tr / / / / /'/ / / / A

'

'

I I I

~"J I 'I t.

I I I

I

I

I

_-4, I _I_ T

I I I I I I I

I

I nO. 2

I I 'J

Total. service time TT

T M

I

Timmg diagram.

////i

.i

I

I h~--no, c

I v

I

U. Herzog / Performance evaluation principles

437

5. Summary We briefly summarized performance measures, monitoring w~hniques and modelling methods. Research is progressing fast in all areas, experience with commercial supercomputers and experimental system give new insight: New performance measures have been proposed, people try to combine the advantages of both, hardware and software monitoring, and there are many ideas how to model accurately the dynamic behaviour of supercomputers. We also tried to overview the related literature. However, a great deal of new techniques and new research results appear regularly in the relevant conference proceedings and journals.

References [11 A. Agrawala and U. Herzog, eds., Performance Evaluation of Multiple Processor Systems, Special Issue IEEE Trans. Comput. 32 (1) (1983).

[21 G. Almasi, Overview of parallel processing, Parallel Comput. 2 (3) (1985) 191-203. [31 G. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conf. Proc. 30 (1967) 483-485.

[4] D.P. Bhandarkar, Analysis of memory interference in multi-processors, IEEE Trans. Comput. 24 (1984) 897-908. [51 P. Courtois, Decomposability--Queuing and Computer System Applications (Academic Press, New York, 1977). [61 Eager, Zehles, Zahorjan and Lazowska, Speedup versus efficiency in parallel systems, University of Washington, Technical Report 86-08-01, 1986.

[71 D. Ferrari, Computer Systems Performance Evaluation (Prentice-Hall, Englewood Cliffs, NJ, 1978). [81 I-I.J. Fromm, Modellierung und Analyse der Speicherinterferenz in hierarchisch organisierten Multiprozessorsystemen, Informatik Fachberichte 41 (Springer, Berlin, 1981) 212-226.

[91 l-I.J. Fromm, U. Hercksen, U. Herzog, K.H. John, R. Klar and W. Klein~ler, Experiences with performance measurement and modelling of a processor array, in: [1] 15-31.

[lO1 J. Hack, quotation from [35]. [111 W. Handier, Computer architecture and applications: Complexity and flexibility, Comput. Artificial Intelligence 3 (1984) 79-104.

[121 W. Handler and H. Hessenauer, Supralinear speedup, Private Communication and Lectures, University of Erlangen-Niimberg , 1986/87.

[13] W. Handler, U. Herzog, F. Hofmann and H.J. Schneider, Multiprozessoren f'fir breite Anwendungsbereiche, Erlangen General Purpose Array, Information Fachberichte 78 (Springer, Berlin, 1984) 195-208.

[14] P. Heidelberger and K.S. "[~i:,edi, Analytic queuing models for programs with internal concurrency, IEEE Trans. Comput. 32 (1983) 73-82.

[151 U. Herzog, Modelling the dynamic behaviour in networks (Modellierung des Ablaufgeschehens in Netzen), Lectures and Tutorials, Bd. 8 Informationsverarbeitung und Kommunikation, Oldenbourg (1979) 175-208.

[161 U. Herzog and W. Hofmann, Synchronization problems in hierarchically organized multiprocessor computer systems, Proc. 4th International Symposium on Modelling and Performance Evaluation (North-Holland, Amsterdam, 1979) 29-48. [17] U. Herzog, W. Hofmann and W. Kleinikler, Performance modelling and evaluation for hierarchically organized m~ltiprocessor computer systems, IEEE Conference on Parallel Processing, Bellaire, MI (1979) 103-14. [181 It. Hockney, Parameterization of computer performance, IBM Institute on Supercomputers, Oberlech, 1986; see also: (r+, nl/2, sl/2) measurements on the 2-CPU CRAY X-MP, Parallel Comput. 2 (1) (1985) 1-14. [191 R. Hofmann, R. Klar, N. Luttenberger and B. Mohr:, Z~4LMONITOR 4, ein Monitorsystem for Hardware- und Hybrid-Monitoring von Multiprozessor- und Multicomputer-Systemen, Informatik Fachberichte 154 (Springer, Berlin, 1987) 79-99. [201 W. Hofmann, Warteschlangenmodelle fiir Parallelverarbeitung, Dissertation, Universifiit Erlangen-Niirnberg, Arbeitsberichte des IMMD Bd. 11 No. 17, 1978. [21] K. Jordan, Performance comparison of large-scale, scientific computers, Computer (March 1987) 10-23. [22] R. Klar, Hardware/software-monitoring, Inform Spektrum 8 (1) (1985) 37-38. [231 H. Kobayashi, Modelling and Analysis: An Introduction to System Performance Evaluation Methodology (AddisonWesley, Reading, l'vL.%,1978). [241 Kowalik, ed. High Speed Computation, NATO ASI Series F7 (1984). [251 E. Lazowska, J. Zahorjan, G. Graham and K. Scvcik, Quantitative System Performance--Computer System Analysis Using ~L)ueuingNetwork Models (Prentice-Hall, Englewood Cliffs, NJ, 1984).

438

U. Herzog / Performanceevaluationprinciples

[26] O. Lubeck, J. Moore and R. Mendez, A benchmark comparison of three supercomputers: Fujitsu VP200, Hitachi $810/20, and Cray X-MP/2, Computer (December 1985) 10-24. [27] E. Maehle, K. Wirl and D. Jitpel, Experiments with parallel programs on the DIRMU multiprocessor kit, in: M. Feilmeier, G.R. Joubert and U. Schendel, eds, Parallel Computing 85 (North-Holland, Amsterdam, 1986). [281 J.W. McCredie, Analytic models as aids for multiprccessor design, Proc. 7th Annnai Princeton Conference on Information Science Systems (1973) 186-191. [29] B. Mueller-Clostermann, Synchronized queuing networks-concepts, Examples and evaluation techniques, lnformatik Fachberichte 154 (Springer, Berlin, 1987) 176-191. [3O] R. Nelson, D. Towsley and A. Tantawi, Perfo.~nance analysis of parallel processing systems, Proc. ACM Sigmetrics (1987). [31] C. Norrie, Supercomputers for superproHcms: An architectural introduction, Computer (March 1984) 62-74. [32] D. Parkinson, Parallel efficiency c~,~tbe greater than unity, Parallel Comput. 3 (1986) 261-262. [33] D. Parkinson and H. Liddell, The measurement of performance on highly parallel systems, in: [1] 32-37. [341 P. Patton, Multiprocessors: Architecture and applications, Computer (June 1985) 29-40. I351 G. Paul and J. Martin, Aspects of performance evaluation in supercomputers and scientific applications, International Workshop on Modelling Techniques and Performance Evaluation, Paris, 1987. [36] A. Rafii, Structure and application of a measurement Tool-SAMPLER/3000, ACM-Sigmetrics (1981) 110-120. [37] L. Richter, Supercomputer-Prinzipien, Entwicklung, Stand und Perspektiven, PIK 9 (1986) 8-15. [38] C. Seitz, Concurrent VLSI architectures, IEEE Trans. Comput. 33 (12) (1984) 1247-1265. [391 J. Shemer and J. Robertson, Instrumentation of time-shared systems, Computer (July/Augnst 1972) 39-48. [40] E. De Sonza e Silva and R.R. Muntz, Approximate solutions for a class of non-product form queuing network models, Performance Evaluation 7 (3) (1987) 221-242. [41] W.D. Strecker, Analysis for the instruction execution rate in certain computer structures, Ph.D. Dissertation, Carnegie-Mellon University, Pittsburg, 1970. [421 H. Tamura, S. Kamiya and T. Ishigai, FACOM VP-100/200: Supercomputers with ease of use, Parallel Comput. 2 (2) (1985) 87-107. [43] A. Thomasian and P. Bay., Analyses techniques for queuing network models of multicomputer systems with shared resources, Comput. Performance 4 (3) (1983) 151-166. [441 H. Wacker, Der Markt fiir Vektorrechner nach Ankfindigung der IBM 3090-VF, PIK 9 (1986) 16-20. [45] J. Wong, J. Mourra and J. Field, Hierarchical modelling of local area computer networks, IEEE National Telecommunications Conference (1980) 37.1.1-37.1.7. [46] J. Worlton, Understanding supercomputer benchmarks, Datamation (September 1984) 121-130.