Benchmark frameworks and tools for modelling the workload profile

Benchmark frameworks and tools for modelling the workload profile

__ __ l!iB *H CL PERFORMANCE EVALUATION plnt~mational ELSEVIER Performance Evaluation 22 (1995) 23-41 Benchmark frameworks and tools for modell...

1MB Sizes 0 Downloads 28 Views

__ __ l!iB

*H CL

PERFORMANCE EVALUATION plnt~mational

ELSEVIER

Performance

Evaluation

22 (1995) 23-41

Benchmark frameworks and tools for modelling the workload profile Ken J. McDonell Systems Technology Laboratory, &-amid Technology, 553 St. Kilda Road, Melbourne, Vktoria 3004, Australia

Abstract

Ten years after the initial development of MUSBUS and one year after the adoption of the “son-of-MUSBUS”, KENBUS, by SPEC it seems appropriate to reflect on the philosophy and methodology of this benchmarking approach. This paper presents an overview of the strengths and weaknesses of the MUSBUS approach, suggests ways in which the methodology may be applied to produce more accurate predictions of system performance, and introduces three new tool sets that may be used to increase the accuracy of performance predictions based upon synthetic benchmarks. These tool sets support realistic emulation of end-user behaviour at a terminal-like interface, provide parametric-driven emulation of both application programs and an underlying DBMS, and assist in the population of synthetic databases with records whose attribute values are both realistic and satisfy the semantic integrity constraints of the database schema. If one accepts the assertion that the construction of synthetic benchmarks with acceptable performance accuracy is a technically feasible goal, then the benefits of good benchmark development extend into the deployment and production phases of a system’s life-cycle, and as a consequence it is argued that the cost of good benchmark development can be more than re-couped over a system’s life-time. Keywords: Performance

analysis; Benchmarking;

Benchmark development

1. An overview of MUSBUS The Monash University Software for Benchmarking UNIX ’ Systems (MUSSUS) was developed in 1982, to assist in equipment acquisition decisions for the Computer Science department at Monash University. Over the next 4 or 5 years the software evolved at Monash, with additional refinements and improvements coming from members of the world-wide UNIX community too numerous to mention, however the original architecture remains unchanged. The aim was to provide some plausible comparison of multi-user performance amongst competitive systems, and the suite of tests in MUSBUS evolved from a realization that no i UNIX is a registered trademark of UNIX System Laboratories. 0166-5316/95/$09.50 0 1995 Elsevier XSDZ 0166-5316(94)EOO36-5

Science

B.V. All rights reserved

24

KJ. McDonell /Performance

Evaluation 22 (1995) 23-41

File Uo

-

P&I/O

---3.

makework

Fig. 1. MUSBUS multi-user test architecture.

adequate commercially available benchmarks or performance profiles existed at that time. The details of the MUSBUS architecture and design philosophy were presented in [l]. An architectural overview of the multi-user component of MUSBUS is shown in Fig. 1. The MUSBUS approach represented a significant improvement over prior art in the field of multi-user synthetic benchmarks (the notable exception being the Gaede Benchmark, independently developed by Steve Gaede at Bell Labs [2] at about the same time, which implements a very similar approach). The validity of the methodology and the robustness of the implementation have encouraged subsequent incorporation, or adaption, of substantial parts of MUSBUS into, - the SSBA Benchmark, sponsored by the French Unix Users Group [3], - the Byte Benchmark [4], - SPEC’s SDM 1.0 Benchmark [5,6]: the KENBUS suite, refer to Section 4.

2. The strengths of the MUSBUS approach A detailed retrospective review of the MUSBUS approach appears in [7]. The aspects of MUSBUS that are important and useful as criteria against which other multi-user benchmarks may be technically assessed, are as follows: Benchmark software that is both portable and publicly available for scrutiny. A software engineering philosophy in which all possible error conditions are checked and an error triggers a “bells and whistles” abort of the benchmark (including killing off all concurrent activity). Statistically stable metrics, with repeated measurements and reported means and variances. The workload as an input parameter, not a universal constant. Real-time delays in user input (via a parameterized simulated typing rate).

ICJ. McDonell /Performance

Evaluation 22 (1995) 23-41

25

Terminal output being generated and sent to real physical devices. Permutation of the workload to avoid “lock-step” synchrony between concurrent simulated users. - A “per user” directory context to provide script invariance and avoid false file sharing. - Automatic distribution of user files and activity across multiple disk devices. 3. The pitfalls in the MUSBUS approach The basic MUSBUS architecture has several weaknesses that either warrant special care when interpreting the results, or should be addressed in the evolution of multi-user benchmarks of this nature. (1) MUSBUS has been used in systems with a raw processing power that differs by at least three orders of magnitude. The spread of UNIX both up and down the price-performance spectrum has placed MUSBUS in environments very different from the ones in which it was originally conceived and used. In the multi-user test, the levels of concurrent activity (i.e. numbers of concurrent users to simulate) are specified as an input parameter, whereas one would prefer the selection of the range of values for the number of concurrent simulated users to be made automatically based on the capacity of the system under test. This is a very difficult problem, and little progress has been made to date on its resolution, although the adaptive approach adopted by AIM Technology in their more recent Suite III versions * is quite interesting. (2) There is a lack of proper send-receive synchronization, i.e. the generation of simulated user input is constrained by the nominated typing rate, not by system response - in Section 5.1 this issue is revisited and an alternative approach suggested. (3) Beyond the inter-keystroke delays from a constrained simulated typing rate, there is no mechanism within the MUSBUS architecture to include “user think” time in the workload scripts. The facilities of a Remote Terminal Emulator (RTE) are required - see Section 5.2. (4) Concurrent MUSBUS users compete for system resources (CPUs, memory, disk bandwidth, etc.), but there is no mechanism for contention over logical resources such as file locks or a database transaction log. Provision for this sort of contention can be made within a specific application environment, and hence a workload profile, but this is very difficult to do in a portable manner; the approach described in Section 6 shows how this can be done in a parametric fashion, independent of any particular DBMS. (5) When measuring system throughput, it must be remembered that in addition to the applications being run, the system must support the MUSBUS benchmark code. Typically this translates into a benchmark overhead of the order of lo%, due mostly to the additional processes that are required to generate the simulated user input. RTEs provid.e a mechanism for optionally avoiding this perturbation of the measured system performance. * For more details., contact AIM Technology, 4699 Old Ironsides Drive, Suite 1.50, Santa Clara, CA 95059, USA.

26

KJ. McDonell /Performance

Evaluation 22 (1995) 23-41

(6) The definition of the elapsed time for the multi-user benchmark covers the period from the launching of the first user until the last simulated user is finished. A statistically more stable metric would be to measure aggregate throughput across some “steady state” period, outside the benchmark’s “ramp up” and “ramp down” periods, and from this determine the expected elapsed time for a prescribed amount of work. Again this is a difficult problem - steady state determination and sound throughput measurement requires careful definitions (see the efforts of [8] for example) and a benchmark architecture that measures elapsed times for each unit of work, rather than for a complete session, Well-designed RTEs support the necessary model of performance measurement. (7) Besides the concurrent multi-user benchmark, MUSBUS included some single-threaded raw speed tests - this was an unmitigated disaster! The raw speed tests purport to measure arithmetic speed, overheads associated with various system activities, file system throughput, etc. The intent was that these tests are purely diagnostic, and less effort was taken with their engineering. Consequently, the tests are very susceptible to compiler, architecture, scaling and configuration changes, and this make them ill-suited for comparisons between heterogeneous system. The only metric intended for comparison purposes is the system performance in the multi-user portion of MUSBUS. (8) The original intention of MUSBUS was that the simulated user workload should be an input parameter to any performance measurement exercise. But to provide some guidance, several example workloads were included. One of these was the default workload used by the controlling scripts that ran the benchmark. Unfortunately, the majority of MUSBUS users have opted to use the default workload, even when their own processing profile has no correlation with that modelled in the default workload. (9) MUSBUS workloads are constructed from permuted “shell” scripts. For more general performance analysis and benchmarks, this framework is too restrictive, and does not address the related modelling issues of, - highly interactive, full-screen applications (Section 5), or - workload synthesis before the applications exist (Section 61, or - instantiation of attribute values in the records of the shared files or database that underlies the applications (Section 7).

4. From MUSBUS to KENBUS In May 1991, SPEC released the SDM 1.0 benchmark suite [5] that included KENBUS, a minor variant of MUSBUS Version 5.2. However, the changes are important - MUSBUS and KENBUS use the same framework, but different workloads, and hence the measured performance (either absolute or relative) is in 120way comparable. The major differences are as follows: - The default MUSBUS workload is an invariant component of the KENBUS benchmark note point (8) in the previous section!

K.J. McDoneN /Performance

Evaluation 22 (1995) 23-41

27

- The rate at which human users “peck” at keyboards is an input parameter to MUSBUS, with a default value of 2 characters per second. SPEC has fixed this value at 3 characters per second for KENBUS, in response to some more recent human-factors studies suggesting this is a more realistic “average” value. - In the original MUSBUS framework, terminal I/O was given substantial importance and great care was taken to ensure that realistic amounts of terminal I/O were generated and sent to real serial interfaces. The rationale was simple - many systems wasted expensive cycles handling mundane terminal character traffic, with a consequent degradation in throughput for serious computation and work. SPEC is operating under some different constraints, and in particular there is no guarantee that a vendor’s system will be configured with any serial line hardware, and may be not even have an Ethernet interface. Consequently, KENBUS assigns all terminal output to the “bit (in UNIX bucket” - all of the write& still occur, they are just directed to /dev/null parlance). - Thankfully, the “raw speed” tests are not distributed with KENBUS. - There was no single performance metric from MUSBUS; the benchmark reported the mean and variance Iof elapsed and CPU times for several different levels of concurrent activity. Overall system performance was gauged based upon heuristics related to degradation in elapsed time and/or increasing CPU saturation, as a function of increasing concurrent load. For KENBUS the single performance metric is “scripts per hour” and the maximum value is reported, with unconstrained freedom to vary the concurrency level.

5. Tools to support

remote terminal emulation

Particularly as we move from the execution of standard UNIX programs into the realm of commercial applications and DBMS-based workloads, we must advance the techniques used for modelling of user activity beyond the MUSBUS approach. This requires benchmark tool(s) to support the following functions: - The execution of interactive programs, driven by a “send” and “receive” model of synchronization between the actions of the simulated user and the system. - Incorporation of real-time delays between bursts of keystroke activity, i.e. “think” times. - More general performance measurement techniques that allow the easy capture and reduction of data describing response-time, transaction cycle-time, task throughput, real-time windowing for steady state analysis, etc. - Dynamic non-determinism in l the selection of simulated user task, e.g. chasing transaction types across all simulated users so that 75% of transactions are retrievals, 15% are amendments and the balance are insertions, l the selection of data values, e.g. apply a SO%-20% selection criteria across the set of valid part numbers, and l the real time delays, e.g. the think time is drawn from a negative-exponential distribution with a mean of 15 (seconds). - Support for alternative system responses, timeout conditions and recovery procedures.

28

KJ. McDoneN / Performance Evaluation 22 (1995) 23-41

System Under Test 6U-n

Fig. 2. Remote terminal emulation architecture.

- Structured procedural “scripting” languages to describe a workload in terms of a functional model of the user’s activity. - Tools to reduce the effort required in workload script development. Many modern Remote Terminal Emulation (RTE) packages provide these features (for example Pyramid Technology’s sscript product), and we shall consider some of them in greater detail. The general RTE architecture is typically as shown in Fig. 2, with one system for the System Under Test (SUT) and one (or more) system acting as the RTE(s). The interconnection uses the Same communication mechanism that real users would use to connect to the SUT, i.e. a virtual circuit implemented over back-to-back serial lines, Ethernet, X.25, etc. 5.1. Send-receive synchronization

The MUSBUS model uses a typing rate constraint, without any send-receive synchronization. This works perfectly adequately on a lightly loaded system, but as the concurrent load increases, the user input may be submitted before the application is ready for it. This “typeahead” behaviour reduces the realism, because in real systems the user’s typing rate actually drops under heavy load (more accurately, the think time between interactions is extended). There is another unfortunate side-effect, namely the benchmark architecture collapses altogether under heavy load when a customized workload has been developed for applications that periodically “flush” their typeahead buffer (e.g. screen-based applications built on top of libcurses).

The hardest part of implementing send-receive synchronization is to establish “what” to expect in the way of system response after each user input message. Note that the response may very with new releases of the applications software, operating system or DBMS product. One approach is to automate the process so that user (input) keystrokes can be easily captured and then played back to create a “dialogue” of user-system interaction. For example, the mkdialogue tool shown in Fig. 3 could take a keystroke fragment of the form date mail

fred

K.J. McDonell /Performance

Evaluation 22 (1995) 23-41

29

Fig. 3. Automatic “dialogue” generation.

and automatically <- $ -> date <- date Sun

construct the dialogue

Ma:,

3

15:56:28

PDT

1992

$ ->

mail

fred

<-

mail

fred

Subject:

In this way, the benchmark development tools can “learn” (and if need be, “re-learn”) what to expect in the way of a system response. The lack of send-receive synchronization in MUSBUS also means that no measurement of response time is possible. RTEs provide keystroke-level and/or message-level instrumentation, timestamp logging and data reduction tools to support a wide range of flexibility in the choice of how and what the benchmark user chooses to define as “response-time”. 5.2. User think time Such a facility would allow more realistic modelling of the delays normally encountered between high level interactions (e.g. commands, transactions or applications). These delays are typically described by one or more underlying statistical distributions, from which the RTE software should randomly sample as required. Additional delays would also enable simulated users to generate a volume of work per unit time that is closer to the expected productivity of real users, rather than the present situation in which one MUS,BUS simulated user may represent the throughput of between 2 and 10 actual users. 5.3. An RTE scr@ting language A statement of language requirements is well beyond the scope of this paper, however, the following points indicate the sort of computational and expressive required of an RTE scripting language: - Procedures, block structure and flexible control structures - scripts are algorithmically of the same complexity as general-purpose programs, so linguistic constructs akin to C or Pascal are required.

KJ. McDoneN /Performance

Evaluation 22 (1995) 23-41

String manipulation primitives, since composing user input messages and decoding the system response are both string-oriented operations. Expression values in the language must be polymorphic - “004” is a string in some contexts, and an integer in others. Intrinsic functions for pseudo-random number generation, real-time delays, mutual exclusion between scripts, external file manipulations, etc. Communications primitives to establish a channel to the SUT, send messages under various delay regimes, receive responses, recognize alternative expected responses, handle timeouts when waiting for responses, etc. A mechanism to force higher-level elapsed times (e.g. transaction cycle-time for several send-receive-sleep interactions) to be measured and integrated into the response-time log. The following representative fragment is presented as suggestive of the level of abstraction and functionality required in an RTE scripting language. while

walltime

runtime

<

do waitfor prob if

"SQL>

=

<

25%

send

25

then

are

join

"select

waitfor

30

send

"from

M

emp,

"more?

get

and

-save

recvbuf;

aggregate

dept.dno,

"more?

waitfor /*

-timeout

random(100);

prob /*

M

valid

-timeout

send

"where

2;

value

for

dept.dno

-

dept.dname waitfor

"more?

I)

*/

2;

-timeout

dname

getvardnameo;

(tx

count(eno)\n";

dept\n"; M

call

dname,

W

-timeout

vardname

emp.dno =

and

*/ \

'IN - vardname

- II' ;\n";

2;

elif

. . . . /*

think

sleep

10

time +

in

the

range

(10,201

seconds

*/

random(l1);

endwhile

5.4. The case for software tools Developing RTE benchmarks can be a non-trivial undertaking. However much of the script development process can be automated with the provision of the correct tool set; application of the UNIX and software tools [9] philosophy leads to the following tasks warranting support in the RTE script development environment: - unintrusive capture of user keystrokes from an interactive session, - synchronized execution of keystrokes to generate a “dialogue” between the user and the system,

KJ. McDoneN /Performance

Evaluation 22 (1995) 23-41

31

-

translating dialogues into scripts in the RTE scripting language, a generator for script permutations, data generators for populating the benchmark databases, a script interpreter, and an interactive script debugger, tools to translate scripts into executable machine code, large RTE benchmarks are difficult to start up, monitor and shutdown - tools and interfaces are required to provide the benchmark engineer with the maximal degree of control, at the least possible effort, - context-sensitive previewers and playback tools for studying traces of user-system interactions, and - postprocessing log files to produce statistical summaries of response time distributions, subject to optional real-time window “clipping”.

-

5.5. Advantages of the RTE approach to benchmarking

To summarize, benchmarks constructed using the tool set of a reasonable RTE product offer the following advantages: - Strict synchronization between (user) send and (system) response means that l bogus typeahead is avoided, so full screen applications may be exercised, and l actual response times can be accurately measured and logged. - All operational communications activity is included in the benchmark. - Realistic user typing rates and think times can be easily parameterized and included in the scripts. - There is a 1: 1 correspondence between the load generated by a simulated user and the load expected to be generated by a production system user. - Script development is a tool-based activity, with many of the steps well suited to automation, so that the benchmark is expected to survive, or be easily recreated, or only require minimal maintenance, after changes to the application software, DBMS or operating system. - Beyond equipment selection, an RTE benchmark may be redeployed for capacity planning, post-delivery acceptance testing, stress testing, system tuning and software QA (functional and performance) - more of this in Section 8.

6. Modelling DBMS applications without a DBMS Typically, benchmark workloads are application-specific, being developed in terms of user actions in the context of the production applications environment. Application-specific benchmarks presume the existence of an application. In the cases where a production application does not yet exist, the development of a skeletal prototype [lo] to support the functionality required for the benchmark is recommended - the obvious benefits for the performance estimates are complimented by benefits to the software development process, and the cost need not be excessive. When an application prototype is not available, the benchmark designer is forced to build a workload against existing applications, in an attempt to synthesize the anticipated production

32

K.J. McDoneN / Performance Evaluation 22 (1995) 23-41

Fig. 4. The “dumbo” benchmark architecture.

environment. At best, this is a highly risky procedure, made even more so in cases where the production applications are using a DBMS that may not be available in the benchmark environment (this situation occurs commonly when a DBMS selection has not been made, or for benchmarks on new systems for which DBMS ports are not yet available). Within the MUSBUS framework (and hence easily extensible to an RTE framework), we have developed a large synthetic program (affectionately known as “dumbo”) that implements a parameter-driven model of a wide class of applications implemented on top of a range of DBMS products. The conceptual architecture for dumbo (see Fig. 4) is similar to the independently developed sdbase benchmark within the WPI suite [ll]. Dumbo models the application as a set of transactions against an arbitrarily sized file (the “database”) of fixed length records. Each transaction consists of a number of random and/or sequential record reads, a number of random and/or sequential record writes, some units of serious computation, a number of screens to be displayed, some optional transaction logging activity and some think time. All “numbers” may be constants, else drawn from either a uniform or a negative exponential distribution. For each record read or written there may be a possible lock conflict and attendant delay - the underlying mathematical model of lock conflict ensures that the probability of conflict occurring increases with the number of concurrent simulated users. The hope is that this parameterization of the application behaviour may prove to be more attractive and accurate, because the quantification of the parameters can be based upon the sizing calculations which would normally have been performed as part of the systems analysis, database design and specification phases of the application’s development (well in advance of the development of working code). The parametric specification is made via a declarative language, which is relatively simple; a detailed discussion of the language is beyond the scope of this paper, however, the syntax is briefly defined in Appendix A.

K J. McDoneN /Performance

33

Evaluation 22 (1995) 23-41

Since there is no real application involved, the dumbo code simulates ail of the required activity; the cod’e size has been artificially increased and the pattern of execution varies to emulate the large code size of both the DBMS and an application using the DBMS. All “database” I/O is implemented via “backend” processes connected to the application simulation program via pipes, with the attendant overheads of IPC, buffer passing and memory-access patterns (modelled after DBMS page cache searching and management strategies). The multi-user workload consists of multiple invocations of the application simulation program (both frontend and backend), each with its own transaction profile to be run; the synthetic databa!;e and transaction log is typically shared amongst all backend instances.

7. Populating the benchmark

databases

Assuming the benchmark designer has a framework that realistically models end-user behaviour, an accurate model of the transaction profile, and the required applications programs 3, then there remains the question of how to populate the database (or files) used by the applications. If this is not done properly, the predictive accuracy of the benchmark may be eroded, due to factors such as - insufficient I/.0 per unit of user work, because DB page cache hit ratios are too high, - excessive inter-transaction interference as the number of concurrent transactions attempting to update the same record(s) is too high, - bogus query costs because the optimization strategies are skewed, etc. Whilst this is partly a question of non-determinism and modelling accuracy in defining how the attribute values are selected for the variable parts of the simulated user queries, it is also a question of how well the benchmark database models the production database. Particularly, we are concerned with (in the relational model) table cardinality, the distribution of attribute values (especially for select and join terms), and the cardinality of association for foreign key relationships. The program rgen is another benchmark tool, that is designed to produce instantiated sets of records suitable for loading into a database. The input uses an augmented SQL-like schema language to describe the relations, the attributes and the foreign key dependencies between attributes (usually in different relations). The output is one ASCII file per relation, one line per tuple. Within each tuple the attributes are fixed width and terminated with a colon (“3. Numeric attributes are right-justified, others are left-justified. The output format is designed to be easily post-processed with other UNIX tools, e.g. awk, sed, cut/paste or grep, and is similar to the format expected by the “bulk data loaders” that are typically provided by DBMS vendors.

3 Here we assume real applications, dumbo.

since the contents

of the database files is immaterial

to the behaviour

of

KJ. McDonell /Perfomance

34

Evaluation 22 (1995) 23-41

The rgen specification language supports the following functionality (a brief synopsis of the syntax is given in Appendix B): - For each attribute, the domain is defined as one of l a discrete set of literal values that may be numbers or strings, l a range of alphanumeric values (encoded in radix-36 so that the values may be ordered and manipulated as numbers), e.g. “aaa”. . .“ZZZ” or - 3.. . + 3, or l a range of dates and/or times relative to a settable origin, l a regular expression (following the syntax of the UNIX program egrep). - For DECIMAL, CHAR and VARCHAR data types, the width of the generated attribute value must defined. - For each attribute the underlying distribution of values is defined by one of the following choices: 0 a constant, l an identifier; step through the domain, with a user-defined average increment between successive values, l a uniform distribution, l a normal distribution with a prescribed mean and standard deviation, l a negative exponential distribution with a prescribed mean, l a Zipfian distribution with a prescribed “z” parameter; refer to [12] for a discussion of the importance of Zipf s distribution to modelling attribute value distributions. The facilities of rgen, in conjunction with standard UNIX text processing tools (for intra-tuple attribute dependencies in particular) provide a reasonably sophisticated framework for generating data that may be loaded into unrelated relations. However, the more difficult problem is to instantiate the values for attributes that are “foreign keys” of other relations. The semantics of the relational data model demand that for these attributes the values must correspond to a valid value of the key attribute in SOme tuple of the “foreign” relation. For example, consider the skeletal rgen schema for the generic supplier-parts database, table suppliers (

1

10

table

1

35

snum

char(5)

identifier(,

sname

varchar(l0)

"a " . . ” z ”

cr

decimal(8,

Limit

2)

251,

uniform(O..lO),

uniform(1000..20000),

status

char(l)

normal('f',

xpect_delay

integer

negexp(51,

21,

discount

float

constant(l01,

state

chart31

zipf(l.5,

"NSW", IISA",

“VIC”, IIW*",

"QLD", MT*S")

tuples; parts

(

pnum

char(6)

identifier(,

pname

varchar(l5)

"a " . . ” z ”

pmass

smallint

constant(l23)

tuples;

uniform(0..15),

5001,

KJ. McDoneN /Performance

table

1

sp

Evaluation 22 (1995) 23-41

(

snum

chart51

pnum

chart61

constant("?"),

qtY

integer

normal(20,

25

35

constant("?"), IO)

tuples;

For every tup1.e in the sp relation, the pnum attribute must have a value that is equal to the value of the pnum attribute for one tuple in parts. Similarly, for every tuple in the sp relation, the snum attribute must have a value that is equal to the value of the snum attribute for one tuple in suppliers. The “foreign key” relationship establishes a 1: N association between relations, e.g. one parts tuple may be associated with 0, 1, 2 or more tuples in the sp relation. The number of associated tuple defines the cardinality of association, and the distribution of cardinalities is a critical factor in determining the cost of join terms 4. Consequently, correct synthesis of the inter-relation associations is required for semantically acceptable data sets (if the semantics are not satisfied the DBMS should abort any attempt to load the tuples into the database, and realistic performance modelling. The approach that is used in rgen is to support the specification of foreign key associations, for example foreign foreign

key

sp.snum

->

suppliers.snum

cardinality

weights

key

->

sp.pnum

cardinality

X

2:0,

l:l,

3:2,

5:3,

5:4,

3:5

I;

parts.pnum

weights

C

35:0,

65:l

I;

The first part of the specification defines the attributes and direction of association (each sp tuple must be associated with exactly one suppliers tuple (via snum) and exuctly oite parts tuple (via pnum). The weights alre a set of numbers that define the desired relative frequency of association for specific cardinalities, 0, 1, 2, etc. For example, in the case above, we are requesting that 35% of the parts tuples are associated with 110 sp tuple, and the other 65% of the parts tuples are associated with one sp tuple. We now have to resolve the problem of over-specialisation of the number of tuples. In the example above, for the sp table, we require the following numbers of tuples: - 25: from the ‘Ituples” clause at the end of the table specification. - 29: from the foreign key association between suppliers and sp that has an average cardinal&y ofassociationiof(1~1+3X2+5X3+5X4+3X5)/(2+1+3+5+5+3)=2.85,hence 2.85 X 10 expected sp tuples. - 23: from the foreign key association between parts and sp that has an average cardinal&y of association of 0.65, hence 0.65 X 35 expected sp tuples.

4 This cardinality is also known as the “join selectivity”, which is used, for example, to estimate the number of sp tuples that must be retrieved to construct the natural join between a subset of the suppliers (or parts) tuples over the common snum (or pnum) attribute.

RJ. McDonell /Performance

36

Evaluation

22 (1995) 23-41

This problem is resolved by linear adjustment 5 of the cardinal@ weights until the foreign key associations produce the desired expected relation sizes. In the example above, this caused the following changes to be made to the cardinality weights (which have been normalized and expressed as percentages): foreign

key: 0

sp.snum 1

->

suppliers.snum

2

3

5

4

cardinality

2.00

1.00

3.00

5.00

5.00

3.00

initial

weights

10.53

5.26

15.79

26.32

26.32

15.79

initial

percentages

21.93

21.93

13.16

adjusted

25.44

4.39

foreign

key:

13.16 sp.pnum 1

0

->

percentages

parts.pnum

cardinality

7.00

13.00

initial

weights

35.00

65.00

initial

percentages

28.57

71.43

adjusted

percentages

As part of the foreign key associations between tables, rgen also supports a limited form of attribute inheritance for fields that are replicated (for performance reasons> in un-normalized schemas.

8. Benchmarks:

the life after equipment

acquisition

Irrespective of the tool set available, the development of benchmarks with good predictive accuracy is both a skilled and time-consuming process. If the associated cost is viewed merely as a “cost of acquisition”, then the potential benefits are significantly reduced. Rather, the cost-benefit analysis should continue over the life of the system, because a pre-sales benchmark may be re-deployed in the following ways. - Since the benchmark simulates real users doing real work, we can use the benchmark to validate system integrity and performance prior to live-system upgrades, e.g. on Sunday morning, use the benchmark to simulate Monday’s peak load before you commit to the deployment of new software releases in the production environment, - Use the benchmark and adjust the number of simulated users, or the distribution of transaction (or application) types, or the transaction inter-arrival time to measure the effects of anticipated changes in the processing profile on the resource consumption and performance, i.e. realistic capacity planning. - For complex software systems with non-trivial user interfaces, the functionality and flexibility of the RTE tools can be adapted to deliver cost-effective testing procedures for functional conformance and performance measurement in the process of software quality assurance n31.

’ It is a little trickier than this, since the estimate may be an over-estimate average cardinality of association may need to be decreased or increased.

or an under-estimate,

and so the

K.J. McDonell /Performance

Evaluation 22 (1995) 23-41

37

9. Concluding colmments MUSBUS was built on top of the basic UNIX shell and file manipulation tools. The next generation of benchmark support tools have been built on top of the MUSBUS tools and on top of the higher-level UNIX tools, e.g. yacc, lex and network services. This “tool bootstrapping” philosophy has worked well and is critical in the development of low-cost-to-build benchmark support tools. Nevertheless, building good benchmarks (even with the best benchmark support tools) is a significant and demanding undertaking - the best results, in terms of predictive accuracy and value to the enterprise, will be achieved by taking the long-term view, and trading the “cost to build” off against the “cost of ownership over the life of a system” - often the potential savings over this longer-term dwarf the initial considerations of the “cost to acquire” that historically has motivated much benchmarking activity.

Appendix A. A grammar for defining a synthetic application The program dumbo uses a declarative language to define the parametric specification of a synthetic application. The following grammar uses a BNF-style notation in which keywords appear in UPPERCASE BOLD, non-terminals symbols are in lowercase italic and generic terminal tokens appear in lowercase (i.e. name, integer and real). The symbol ‘::’ introduces the right-hand side of a production rule, and ‘I’ designates an alternative production. All other non-alphanumeric characters are literal symbols.

I

stmt stmt spec

I I

TRANSACTION name optlist ; DATABASE integer RECORD integer BYTE per RECORD FILE

spec stmt

db -opts

: I

log

file-list

I

file optlist

: .

option

log db -opts file-list db_opts LOG file integer BYTE per integer TRANSACTION file file file _list name option optlist range seqopt READ conflict _ opt range seqopt WRITE conflict -opt range REWRITE CPU LOAD FACTOR number range screen _ opt REFRESH rdelay _ opt range CHAR per REFRESH THINK TIME range SECOND

db_opts ;

KJ. McDonell /Performance

38

seqopt

Evaluation 22 (1995) 23-41

SEQUENTIAL

conflict _opt

I with-opt

with _opt LOCK CONFLICT number ldelay _opt

:

WITH ldelay _ opt

DELAY range SECOND real UNIFORM (number , number) NEGEXP (number ) integer real

range

number screen _ opt

SCREEN rdelay _opt

DELAY range SECOND PER

per

/

Appendix B. A grammar for generating synthetic database records The grammar definition for rgen follows the same syntax rules as used in Appendix A. schema

.

stmt schema stmt

stmt

TABLE name (attlist optcomma) nexpr TUPLES ; FOREIGN KEY attrspec -> attrspec cardspec optinherit ; name EQUAL ntxpr ; DATESTAMP string ;

attltkt

attdef attlist , attdef name att -type distrib

attdef att-type

precision vcval

CHAR (integer) VARCHAR (integer) vcval FLOAT DECIMAL (integer precision) SMALLINT INTEGER DATA DATETIME optfrom dt -unit optto dt -unit , number value . . value epat

RJ. McDonell /Performance Evaluation 22 (1995) 23-41

optfrom

YEAR MONTH DAY HOUR MINUTE SECOND FROM

optto

TO

d&rib

UNIFORM (value -range> NORMAL (value , rnumb) NEGEXP (value) ZIPF (r-numb , value -range) DISCRETE ( @et > IDENTIFIER (start , delta) CONSTANT (value) value .. value vset epat nexpr sign _ real string real PLUS real UPLUS MINUS real UMINUS velt vset , velt string nexpr sign _ real real number value epat rfreq velt vfset , rfreq velt rnumb: nexpr

dt_unit

value _ range

value

sign _ real

vset velt

t-numb start vfset tireq delta optcomma cardspec wtlist

,

CARDINALITY weight wtlist , weight wtlist weight

WEIGHTS { wtlist )

39

40

KJ. McDoneN/Performance Evaluation 22 (1995) 23-41

weight optcard

rnumb optcard : nexpr

optinherit

INHERIT

inspect&t

inspec inspeclist , inspec inspeclist inspec attrspec FROM attrspec

inspec attrspec nexpr

inspeclist

name name . name number name (new-) PLUS nexpr UPLUS MINUS nexpr UMINUS nexpr TIMES nexpr nexpr DIVIDE nexpr nexpr PLUS nexpr nexpr MINUS nexpr

References [l] K.J. McDonell, Taking performance evaluation out of the “Stone Age”, Proc. Summer Usenix Technical Con&, Phoenix, AR (1987) 407-417. [Z] S.L. Gaede, Tools for research in computer workload characterization and modeling, in: Experimental Comp. Performance and Evaluation (North-Holland, Amsterdam, 1981). [3] AFUU, Suite synthetique de benchmarks de 1’A.F.U.U. (S.S.B.A. l.O), Report, Association Frangaise des Utilisateurs d’Unix, Le Kremlin-Bicetre, France, 1988. [4] B. Smith, The Byte Unix benchmarks, Byte (Mar. 1990) 273-277. [5] SPEC, SPEC announces a new benchmark suite, SPEC Newslett. 3(2) (1991) 1. [6] S.K. Dronamraju, S. Balan and T. Morgan, System analysis and somparison using SPEC SDM 1, SPEC Newslett. 3(4) (1991) 3-8, 17. [7] K.J. McDonell, MUSBUS - what has been learnt?, Austral. Unix Systems User Group Newslett. H(2) (1990) 27-36. [8] TPC, TPC Benchmark C (Order Entry), Transaction Processing Performance Council, 1991 (Draft 6.0, (contact

Shanley Public relations, San Jose, California)). [9] B.W. Kernighan and R. Pike, 77re UNIX Programming Enuironment (Prentice-Hall, Englewood Cliffs, N.J., 1984). [lo] A. Albano and R. Orsini, A prototyping approach to database applications development, Database Engrg. 7(4) (1984) 64-69.

[ll] D. Finkel, R.E. Kinicki, J.A. Lehmann and J. CaraDonna, Comparisons of distributed operating system performance using the WPI benchmark suite, Tech. Rep. WPI-Comp. Science-Tech. Rep.-92-2, Dept. of Comp. Science, Worcester Polytechnic Institute, Worcester, Mass., 1992. [12] A.Y. Montgomery, D. D’Souza and S.B. Lee, The cost of relational algebraic operations on skewed data: estimates and experiment, Proc. IFZP Congress, Paris, France (North-Holland, Amsterdam, 1983) 235-241.

RJ. McDonell /Performance Evaluation 22 (1995) 23-41

41

[13] K.J. McDonell., Remote terminal emulators and other tools for cost-effective software quality assurance, Austral. UNIX Systems USER Group Newslett. 13(2) (1992) 24-33 (presented at AUUG Summer Meetings,

Perth, Adelaide and Melbourne). Ken McDonell’s interest in performance analysis began with simulation studies of file access methods at Monash University and continued through his Ph.D. work on the interface between operating systems and database management systems at the University of Alberta. From 1977 to 1988 Ken was an academic at Melbourne and Monash Universities, with teaching and research interests in database management systems (particularly implementation efficiencies and the semantics of query languages) and software engineering. In 1988 Ken left academia to take up a position in California with the Performance Analysis Group at Pyramid technology Corporation. Assignments at Pyramid have included management of pre-sales benchmarking, management of the corporate performance analysis group, development and specification of performance-related products and performance quality assurance. He was a member of the Systems Technology Laboratory (STL), a group charged with investigating core technologies for future Pyramid products, and manages the STL efforts in Australia. Currently he manages the Performance Tools Group for Silicon Graphics Inc.