Chapter 7 Simulation experiments

Chapter 7 Simulation experiments

D.P. Heyman, M.J. Sobel, Eds., Handbooks in OR & MS, Vol. 2 © Elsevier Science Publishers B.V. (North-Holland) 1990 Chapter 7 Simulation Experiments ...

3MB Sizes 0 Downloads 132 Views

D.P. Heyman, M.J. Sobel, Eds., Handbooks in OR & MS, Vol. 2 © Elsevier Science Publishers B.V. (North-Holland) 1990

Chapter 7 Simulation Experiments

Bruce

Schme&er

School of Industrial Engineering, Purdue University, West Lafayette, 1N 47907, U.S.A.

1. Introduction

This chapter focuses on digital-computer modeling and analysis of stochastic systems to estimate performance measures. Both dynamic-system simulation and Monte Carlo simulation of models having no time component are considered. Simulations with human interaction, such as business gaming and train~ ers, are not considered. Deterministic differential-equation models, whose numerical analysis is sometimes referred to as simulation, are also not considered. Simulation, as viewed here, is a method of modeling and analyzing stochastic systems using digital computers. A simulation model is an expression of the logical relationships that cause the system to change from one state to another. Observations, generated from the simulation model, are statistically analyzed to estimate performance measures of interest. Simulation models can differ widely. The state space can be discrete, continuous, or combined. The observations can be values assumed by random variables or can be the occurrence or nonoccurrence of events. The observations can be static or can be indexed by time, either discrete or continuous. The logical relationships can be time homogeneous or nonhomogeneous. Likewise, the performance measures to be estimated can differ widely. They can be static, a function of time, or steady-state. They can be means (including probabilities), higher-order moments, or quantiles. But always, the performance measures are unknown constants determined by a model that is completely specified. Performance measures are expected values (mathematically, integrals), except in the case of quantiles where they are defined as the upper limit of integrals. Simulation serves the same purpose as analytical probabilistic analysis or numerical quadrature: to deduce the value of the performance measure from a logically complete specification of the model. Given a simulation model, the following five steps are required to perform a simulation experiment: 295

296

B. Schmeiser

1. Select a source of randomness. 2. Obtain basic observations from the source. 3. Transform the basic observations to input observations having known distributions. 4. Transform the input observations, via the model, to output observations. 5. Calculate statistics from the output observations to estimate the performance measures. This chapter discusses the issues that an analyst must face in each step of a simulation experiment. Each step requires decision making. Even the model is assumed to be subject to possible change, although we do not discuss modeling explicitly. Throughout the chapter the emphasis is on fundamental issues - the central problems and general approaches to solving the problems - rather than discussion of specific procedures. Step 1, the only nondeterministic step, is the decision whether to use historical data, truly random phenomena, or pseudorandom-number generators. Included here is the specific procedure to be used to obtain basic observations: for historical data, the specific data to be used; for truly random phenomena, the mechanism; and for pseudorandom-number generators, the form of the generators and the initial seed values. These sources of randomness are discussed in Section 2. Step 2 is straightforward, given that the source has been decided in Step 1: We go to the source and obtain the numbers. Implementation issues are discussed in Subsection 2.3. The only other complication arises when we use correlation-induction methods to reduce variance, as discussed in Section 7, since then we must decide on the relationships among the random numbers. Step 3 is random-variate generation, the creation of a sequence of input observations that represent the environment of the system. Random-variate generation is discussed in Section 3. The modeling issue of selecting the input distributions is discussed in Section 4. Step 4 is the fundamental step of the simulation, converting input observations from known distributions to output observations whose distributions are unknown. Step 5 is composed of two parts: point estimators of performance measures and estimators of the quality of the point estimators. The choice is often, but not always, obvious. Point estimation is discussed in Section 5 and methods for estimating point-estimator quality are discussed in Section 6. A simulation experiment, then, involves all the usual issues of statistical experimentation, plus some induced by the special nature of simulation experiments: control of the source of randomness and complete knowledge of the system giving rise to the observations. The issues can be informally divided into two types: (1) those involving the choice of design and point estimators are typically discussed under the broad heading of variance reduction and (2) those involving the choice of the method of estimating properties of the resulting estimators are typically discussed under the broad heading of output analysis.

Ch. 7. Simulation Experiments

297

Output analysis, which assumes the simulation experiment is specified, is concerned with estimating the sampling properties of the point estimator. The prototype problem considers estimating the standard error of the sample mean of stationary autocorrelated data, an estimate of the population mean. Variations include nonstationary data (initial transient or time-varying behavior) and/or other properties, such as variance, higher-order moments and cross moments, quantiles, and percentiles. The use of the estimate of standard error in a variety of analysis procedures, such as for confidence intervals and for ranking and selection, also falls within the topic of output analysis. Variance reduction includes the choice of the design points to be simulated, the choice of (multivariate) distribution from which the observations are drawn (including the choice of the distribution for initial conditions and dependencies among replications), the choice of point estimators, and the choice of rules for stopping the simulation. The dependencies among replications typically involve the use of random-number streams, sometimes the choice of random-variate generators. In short, variance reduction is the choice of simulation experiment. Variance reduction is discussed in Section 7. The topics of this chapter are intertwined. For example, input modeling is in some sense logically first, since it has to do with choosing the model to be analyzed in the simulation experiment. Yet it follows random-variate generation since much of what makes input modeling for simulation special is that observations from the input model must be sampled. Random-number generators precede random-variate generators in the five steps, but the reason for using the types of pseudorandom-number generators discussed later is the relationship of random numbers to variance reduction. The sections of this chapter differ in level of detail and level of sophistication. Although the primary purpose of this chapter is to provide an overview of simulation, several times I discuss a topic simply because I think it is interesting and the information is not available in the general references and textbooks of the last few years: Banks and Carson (1984), Banks, Carson, and Goldsman (1989), Bratley, Fox, and Schrage (1987), Fishman (1973, 1978), Law and Kelton (1982), Kleijnen (1974, 1975, 1988), Knuth (1981), Lewis and Orav (1989), Morgan (1984), Nelson (1987b), Pritsker (1986), Ripley (1987), Rubinstein (1981, 1986), Welch (1983), Wilson (1984a, b), Wilson and Pritsker (1982), and Yakowitz (1977). Likewise, I have tried to minimize discussions that duplicate these general references. And finally, I have tried to emphasize ideas of general applicability.

2. Sources of randomness

To obtain realizations of model behavior, a source of randomness is needed° Some criteria for selecting the source include (1) little deviation from desired statistical properties, (2) speed, (3) memory, (4) ease of implementation, (5)

298

B. Schmeiser

portability, (6) reproducible, but changeable, and (7) ease in obtaining multiple independent streams. Usually the randomness underlying simulation analysis is obtained via pseudorandom-number generators, which provide deterministic observations that are intended to mimic realizations of independent uniformly distributed random variables on the interval (0, 1). The term random number will be used in this chapter to refer to such observations, with truly and pseudo being used only for clarity. We discuss general concepts in Section 2.1, some specific pseudorandom-number generators in Section 2.2, and implementation issues in Section 2.3.

2.1. General concepts The most common alternative to random-number generators is trace-driven experimentation- the use of historical data to drive the simulation model. Historical data are sometimes useful for validating simulation models, in some applications satisfy the first criterion perfectly, and can be reasonably easy to implement if the data are already in computer-readable form. The use of truly random numbers is conceptually straightforward. Typically, Poisson events such as radioactive decay are observed, with the time between events being independent and exponential. The marginal distribution of the source of randomness is not important, as long as it is continuous and known, since its values can be converted to any other continuous marginal distribution via the associated distribution and inverse distribution functions, as discussed in the next section. With new computer technology, such as optical-disk storage, using truly random numbers is becoming practical, unlike using the decks of punched cards available from RAND (1955). Truly random numbers are advocated by some as being necessary for the valid application of statistical analysis to the simulation observations. Knuth (1981, Chapter 3) considers the question "what is a random sequence?" Discussions of the distinctions between stored (and possibly reused) truly random numbers and pseudorandolnnumbers quickly become philosophical. Because using truly random numbers requires either a physical process or massive storage, and therefore is relatively slow and difficult, for most applications an algorithm that provides numbers deterministically as a function of a single random outcome is preferable. Such algorithms, called pseudorandomnumber generators, are of the form U = d(Ui_l, U i _ _ 2 , . . . , Ui_p). Such algorithms typically require only a few lines of computer code, little in the way of tables, and can reproduce results easily. Given an initial seed value U0, U i , . . . , U (p_l), the future is determined by the function d. Since the future must lie in the finite set of values that can be represented on the computer used, pseudorandom-number generators must cycle. A generator can have multiple cycles, cycles can be degenerate, and some values can lie in no cycle (such a value lies on a path that leads to a cycle). By choosing d, the user is implicitly choosing a sample space g2, each

299

Ch. 7. Simulation Experiments

element of which is an infinite sequence (stream) that cycles• By choosing the seed, the user is choosing an element w, which can be thought of as choosing a starting point and eventual cycle• When using pseudorandom-number generators, these two choices comprise the first step discussed in Section 1. The user can choose any generator and any seed. The generator chosen is often that provided by the software developer• Even when the choice is explicitly made by the user, the choice is usually (and ideally) from the relatively small set of generators with a record of successful use. An occasional recommendation is to use a function d chosen randomly from a set of functions, but such a procedure simply increases the size of the sample space gL The more conventional wisdom is to choose d so that the streams in ~2 are good, as discussed below. The choice of seeds is likewise often restricted, as when software is designed with a few predetermined streams• Two not uncommon recommendations are to choose the seed equally likely from the set of all feasible seeds (possibly via random-number tables), or truly randomly (such as via the system clock). Few users bother, and reasonably so. In the typical case of a scalar integer seed from the set {1, 2 , . . . , 231 - 1}, seed values with easy to r e m e m b e r patterns such as 11111111 and 987654321 are common• Unless the user is selecting the seed because of knowledge of the corresponding stream, no harm is done since the seed appears to the user to be an arbitrary stream label• The pattern in the seed does not imply a pattern in the stream. The only randomness in simulation experiments based on pseudorandomo numbers is the choice of generator and seed• When they are chosen, the source of randomness is established in step one; the remaining four steps simply transform this choice to the point estimators of step five• Thus one argument for applying statistical methods to simulation data generated from pseudo° random-numbers is that the data are in fact random• But the stronger argument rests on the ability of pseudorandom-number generators to mimic truly random behavior• (Pseudorandom-number generators are just one of many examples in chaos theory of apparent randomness arising from deterministic mechanisms.) The research problem is to find a function d that corresponds to a sample space ~ that has good properties, in the sense that the user's choice of seed (w) can be made easily and without concern. Typically, the first order of business is to obtain a single cycle with a large period. Although cycles with periods in the hundreds are sometimes used to drive games on microcomputers, simulation experiments often require many more random numbers and therefore periods on the order of millions or bllhons. Commonly periods are about 2 - 1 2 147 483 647, the number of positive integers on a 32-bit computer• The use of supercomputers and interest in better statistical quality have increased interest in even longer periods. With a known large period, pseudorandom-number streams have a chance of mimicking truly random behavior from a specified stochastic process. Almost without exception the desired behavior is that of independence with marginal •

.



31

300

B. Schmeiser

distributions that are continuously uniform over the unit interval, U(0, 1); equivalently k-tuples should be uniformly distributed in the k-dimensional unit cube for k = 1, 2 , . . . . 2.2. Some pseudorandom-number generators In the quest for such behavior from pseudorandom-number generators, two classic types of functions d have dominated. These are due to Lehmer (1951) and Tausworthe (1965). Both work with integers with later conversion to values between zero and one. Lehmer's linear congruential generators, of the form V¢~= (aW~_ 1 + c) mod m obtain integers in the set {0, 1, 2 , . . . , m - 1} and converts them to the [0, 1) interval by dividing W~ by m. This class of generators has a variety of advantages. Its number theory is tractable, so selection of the parameters a, c~ and m to obtain a long period is straightforward. Multiplicative congruential generators, the special case obtained when c = 0, are especially convenient. With proper selection of a and m, such generators do not admit the realization 0, thereby automatically eliminating the occasional glitch caused by attempting to evaluate the logarithm of 0. A full period over 1, 2 , . . . , m - 1 is obtained when m is prime and a is a primitive root of m. (Knuth 1981 discussed primitive roots.) Since the largest integer on a 32-bit computer, 231 - 1, is prime, the search for good values of the multiplier has attracted much attention, e.g., Fishman and Moore (1986) and L'Ecuyer (1986). The memory and execution time are faster, although trivially so. Obtaining nonoverlapping streams from a single generator is relatively easy, as discussed in Marse and Roberts (1983), since W i =

(aiWo) mod

m

= [((a i) mod re)W0] mod m . The innermost remainder is not a function of the randomly chosen seed Wo, so its evaluation is hard coded. If this remainder is denoted by b, then W i = (bWo) mod m =

1-[ (bjW0) rood m mod m j=l

where II~_~ bj = b reflects factoring of b that may be necessary to avoid overflow. The primary disadvantage of the Lehmer generators, first noted by Marsaglia (1968), is that k-tuples W~+I, W i + 2 , . . . , W~+k cover at most m of the m k grid points in k dimensions; worse, the points covered lie in at most ( k ! m ) 1/k hyperplanes. These bounds are smaller in higher dimensions k and for smaller values of m. Worse yet, specific generators can have substantially fewer

Ch. 7. Simulation Experiments

301

hyperplanes than indicated by the bounds, the classic example being RANDU from page 77 of the I B M Scientific Subroutine P a c k a g e (1966): W/= (65539W~_1) rood 231 - 1, whose realizations lie on only fifteen planes in three dimensions, despite the bound being 2344. Creating an example in which RANDU invalidates the experiment is simple - estimate the probability that a point uniformly distributed in the 3-dimensional unit cube lies in a set entirely between the planes; the estimate will always be zero. This sort of example can occur when estimating rare-event probabilities, such as those which arise in nuclear physics or Monte Carlo estimation of the power of a test of hypothesis. Complex-system Simulations, however, are usually insensitive to the hyperplane problem. In fact, the enforced scattering of points caused by the hyperplanes can improve estimator performance for small sample sizes, a form of variance reduction, the topic of Section 7. More-recent research involves nonlinear congruential generators, which are of interest since they do not have the hyperplane property. For example, Eichenauer and Lehn (1986) study inverse congruential generators defined by W i = (aWi-_ll q- c) mod m if IVi/> 1 and IV,.= c otherwise, where the inverse is defined so that W W - 1 = (1) mod m. Such generators are more difficult to implement than Lehmer generators and lack the years of experience necessary to develop confidence, but are of interest to those wishing to avoid hyperplanes. The second classical generator type, which also avoids the hyperplane structure, is the feedback-shift-register generator of Tausworthe, defined by a~ = (clak_ 1 + Czar_ 2 + . . . + Cpak_~, ) mod 2,

where each of cl, c 2 , . . . , cp_l are binary and cp = 1. 5race zero causes the generator to degenerate, the maximal period is 2 p - 1, which is obtained if c 1 , . . . , Cp are the coefficients of a primitive polynomial, as tabled for example in Stahnke (1973). Commonly, only two nonzero coefficients are used a k = (ak_p+ q + a k _ p ) r n o d 2,

which then reduces the generator to an exclusive-or operation. For example, if p = 532 and q = 37, then a k =: a k -495 ~

ak

532

has a period of 2 532 -- 1 , is fast, requires a table of 532 bits, and has numerical properties fundamentally different from Lehmer generators. One way the Tausworthe generators differ from Lehmer generators is that bits, rather than words, are generated. Unlike the Lehmer generators whose logic is at the word level, Tausworthe generator logic is typically computer dependent; thus code portability is low. However, the bit-level logic also allows the same sequence of numbers to be generated even on machines having

302

B. Schmeiser

different word lengths. The process of converting k, say, bits to a numerical value on the unit interval is straightforward by viewing the bits as a binary fraction, providing accuracy to 2 -k. Now a machine of word size l ~< k can obtain the same sequence of numbers as any other machine, to its own accuracy 2 -z, by simply using only the first l bits, wasting the other k - l. These k - l bits are not really wasted, however, since they allow e a c h / - b i t number to be followed by multiple other numbers. In particular, every /-bit number will occur 2 ~-t times in a period, each time being followed by a different number. Thus Tausworthe generators can be substantially more dense in a k-dimensional cube than can L e h m e r generators. Given that two types of generators have been studied in some detail, a natural extension is to examine methods for combining them into a single generator. Ideas such as shuffling generators can add substantially to the period length and intuitively would seem to provide numbers that appear more random than provided by the original generators. However, because of the added complexity of such schemes, theoretical analysis is usually difficult. See, for example, Nance and Overstreet (1978). An appealing idea for combining estimators is to use the fractional part of the sum of two or more independent generators. Brown and Solomon (1979) show that, in several metrics, the resulting sequence must be no less uniform than the sequences from the component generators. Wichman and Hill (1982) provide an implementation of this idea using the three multiplicative congruential generators with parameters (a, m) equal to (171, 30269), (172, 30307), and (170, 30323). Since the three values of rn are prime, the resulting period appears to be (30269)(30307)(30323), which is greater than 1013. But as Wichman and Hill (1984) note, the period is less by a factor of four, although still large considering each modulus is less than 2 1 5 - 1 = 32767. McLeod (1985) notes numerical instability. Zeisel (1986) notes that the generator is equivalent to a L e h m e r generator with modulus 27 817 185 604 309 and multiplier 16 555 425 264 690. The message is that years are required to obtain the empirical and theoretical results necessary to determine whether a generator is good. If this last paragraph doesn't discourage users from creating their own generators, then I hope the following example will. In 1974 a student in my class used the following Fortran logic on a Univac computer iseed = 3141592653 c = 2718281829 do 1 0 i = 1 , 2 0 iseed = iseed * iseed + c if (iseed .le. 0) iseed = - i s e e d p = iseed * .2910383e - 10 10 write (6, 1) p

Ch. 7. Simulation Experiments

303

to obtain the observations .8644, .5903, .6438, .1694, .1885, .4452, .5141, .8781, .1694, .1885 . . . . , which has a period of length five. Such difficulty with ad hoc generators seems not uncommon (Knuth 1981). One motivation of the Wichman and Hill (1984) generator is portability. Portability means that the generator, with a given seed, produces the same sequence of random numbers on each computer (to within each computer's accuracy). Having the code execute without obvious error is not particularly difficult. But the same code can produce different sequences of random numbers due to differences arising during floating-point comparisons and overflow. The two common solutions are to use double-precision arithmetic or to restrict computations to integers less than 215 - 1 = 32 767. (Most codes are written for computers having at least 16-bits.) IMSL (t987) uses the former approach; Wichman and Hill's generator is an example of the latter approach, as in the generator in Schrage (1979).

2.3. Implementation Park and Miller (1988) and L'Ecuyer (1988) discuss implementation difficulties and make specific implementation recommendations. Park and Miller, in emphasizing the difficulty of writing valid, portable implementations, discuss a variety of incorrect implementations, mostly from textbooks. Since correct implementation is evidently so difficult, they suggest that users in doubt should use what they call their 'minimal standard generator', the multiplicative congruential generator with a = 16807 and m = 231 - 1 = 2147483647, which is relatively simple to implement and yet has reasonably good randomness. To ensure correct implementation, they encourage checking that W10000-1 043 618 065 when W1 = 1. We have not discussed how to check whether a particular generator is good. Floyd's algorithm (Knuth 1981, Section 3.1, Exercise 7) is useful if forced to use a generator with possibly small period: For each random number Ui = d ( g i _ l ) generated, also generate and s t o r e Uzi : d(d(U2(i_l))); if Ui¢ U2i, then U~ is not a repeated value. Plotting 2-tuples is a good quick check, as is plotting 3-tuples if you have the software for rotating a cloud of points. Simulating systems with known properties that are similar to those of interest is another common-sense check. Most textbooks describe an array of statistical tests of hypothesis. Researchers focus on number-theoretic properties, such as distance between hyperplanes and points in k-dimensional space. Heath and Sanchez (1986) emphasize that the statistical quality needed in a generator is a function of the number of random numbers needed for an application.

3. Random-variate generation Simulation experiments require sequences of observations of random vario ables that represent the simulated system's environment. Examples of such exogenous random variables include random coefficients of differential equao

304

B, Schmeiser

tions, points at which a function is to be evaluated by Monte Carlo numerical integration, or interarrival times of customers in a stochastic service system. While sometimes deterministic observations are used (specific sets of coefficients, numerical quadrature formulae, or historic interevent times), often random sequences of observations having specified statistical properties are generated and input to the computer model. Such observations are termed random variates and may be observations from scalar random variables, multivariate random vectors, time series, or point processes. Random-variate generation refers to methods for transforming some given source of randomness to observations possessing desired statistical properties. Almost without exception, the given source of randomness is U(0, 1) random numbers, as discussed in Section 2. The rest of this section is a discussion of methods for transforming a sequence of random numbers {U~} into a sequence of random variates {Xi}. The criteria for selecting random-variate-generation algorithms include exeo cution time, ease of implementation, portability, memory requirements, interaction with variance-reduction techniques, and the amount of approximation to the desired distribution. The amount of approximation is a function of three sources of error: error in the random numbers (discussed in Section 2), error induced by computer arithmetic, and the error inherent in the method. The first two sources of error predominate in state-of-the-art algorithms, since almost every algorithm published since 1980 contains no approximations in the mathematics. The execution time is composed of two parts: the time to set up the algorithm each time the parameter values change, and the marginal time required to generate one variate after setup is complete. The ease of implementation is often crudely measured by the number of lines of code, but also includes the type of support routines required, such as variate-generation algorithms for other distributions or routines for bit manipulation. Portability here, as for random numbers, is the ability to obtain the same sequence of random variates on different computers. Memory requirements for most algorithms are modest, except those that involve tables of constants. Even with tables, memory is seldom an issue with today's memory capacities. Finally, the method used for random-variate generation sometimes determines whether a variety of simple variance-reduction methods will be effective. The discussion of random-variate generation is organized around four fun° damental concepts from which random-variate-generation algorithms are composed: (1) inverse transformation, (2) composition, (3) acceptance/rejection, and (4) special properties. The fifth subsection discusses using conditional distributions to generate multivariate random vectors. The last subsection discusses processes correlated in time: time series and joint processes. As one can quickly find by glancing through the encyclopedic Devroye (1986a), the topics discussed here are a small part of the variate-generation literature. The emphasis here is on fundamental concepts rather than on state-of-the-art algorithms, which typically are composed of more than one concept. In addition to Devroye, implementation suggestions and extended bibliographies

Ch. 7. Simulation Experiments

305

can be found in Kennedy and Gentle (1980), Ripley (1983), and Schmeiser (1980). A C M Transactions on Mathematical Software and Applied Statistics both publish state-of-the-art algorithms. IMSL contains state-of-the-art algorithms for many standard distributions.

3.1. Inverse transformation The inverse transformation for a scalar random variable X is x = F x l(u), where u - U ( 0 , 1 ) and F x l ( u ) is the smallest x such that the cumulative distribution function (cdf) of X , Fx(x), satisfies Fx(x ) >! u. The transformation is valid for any distribution. Each point x corresponds to a subinterval of (0, 1) of length P ( X = x). For continuous distributions, where each point x has zero probability, the validity of the transformation rests on F x ( X ) being U(0, 1). The inverse transformation is one of many valid transformations from u i to xi, but it is fundamental since it is monotonically nondecreasing. For a variety of input processes, the inverse transformation provides a closed-form generation method. Examples include (1) U(a, b): x = a + ( b - a)u, (2) WeibuU: x = [ - a -1 ln(1 - u)] t~, where a is the scale parameter and/3 is the shape parameter, and (3) Geometric: x = [ l n ( 1 - u ) / l n ( 1 - p ) J , where p is the probability of a success on each Bernoulli trial, x is the number of failures observed before the first success, and Is] denotes the largest integer less than or equal to s. The inversion is often not so easy. Inverse transformations for normal, gamma, and beta random variables must be performed numerically. Machineaccuracy numerical algorithms are usually an order of magnitude slower than the fastest algorithms and are about equally difficult to implement. Approximao tions - such as x = (u °'135 - (1 - u) °135)/0.1975 for generating standard normal random variates - are often faster than exact algorithms and are often trivial to implement, but can only be used in application-specific situations. Software developers providing general purpose codes need to use more accurate methods so that users of their software need not be concerned with accuracy. Recursive methods work well for the Poisson, binomial, and hypergeometric distributions when the mean is small, but numerical instability and slow execution times arise for large values of the means. A variety of methods, including binary search and index tables, may be used to speed the search for the variate x that corresponds to the random number u. Index tables (Chen and Asau 1977, Fishman and M o o r e t984) are a simple idea that can result in substantial reduction in execution time. Somewhat like the tabs on a dictionary, index tables allow the inverse cdf search to begin at a point just to the left of the target. Consider the discrete case of a random variable X with mass points at {1,2 . . . . , n}. Let T ( i ) = m i n x such that F(x) >- ( i - 1)/k, where k is the number of indices. Defining T requires O(n) computation time for set-up. The generation of each variate requires a search from T(i) to T(i + 1), where i = [uk] + 1. Thus if many variates are to be

306

B. Schmeiser

generated from the same distribution, the set-up time is worthwhile. If k is chosen to equal n or larger, the marginal time is small. In fact, the composition-based alias method, which has received more attention in the literature, is almost dominated by index tables, whose set up is easier, whose memory requirement is adjustable, and whose marginal times are essentially equal to that of the alias method.

3.2. Composition Composition (Peterson and Kronmal 1982) refers to the concept of mixing probability distributions. Let f(x) denote the density function if X is continuous or the mass function if X is discrete, where X may be multivariate. If f(x) can be written as f(x) = f fy(x) dP(y), then observations of X can be generated by first generating y from distribution P and then generating x from fy(x), the Conditional distribution given y. Here the distribution function P can be discrete or continuous and fy(x) can be discrete or continuous for each y. The utility of composition arises when fast methods are available for generating from fy(x) and the corresponding probability of choosing fy(X) is high. Several state-of-the-art algorithms for normal, gamma, beta, Poisson~ and binomial distributions rely on returning observations from easy-to-generate uniform and triangular densities a large percentage of the time. Composition has been central in the literature on univariate discrete distributions, in particular Marsaglia's tabling method and the alias method. These methods convert the general problem, in an initialization step, to one of choosing one of several equally likely outcomes for either the first step of choosing Y = y or the second step of choosing X given y (Schmeiser 1983). Acceptance/complement algorithms, often viewed as a modification of acceptance/rejection algorithms, are an ingenious use of composition. To generate x from a density f, consider another density g that is similar to f and easy to generate. Three regions are defined by f and g: (1) that under both densities, (2) that under g and not under f, and (3) that under f and not under g. In composition terms, the density f is composed of Regions 1 and 3; we should sample from the density proportional to rain(f, g) with probability p, the area of Region 1, and from the density proportional to min(0, f - g) with probability 1 - p . The acceptance/complement algorithm generates a point (x, v) uniformly distributed under g, the union of Regions 1 and 2, which involves sampling x from g and v uniformly over (0, g(x)). If v <-f(x), return x, since the point is in Region 1 and therefore under f. Otherwise, ignore the first value o f x and return x from Region 3; that is, return a variate from the density proportional to max(0, f - g). The method has fast marginal generation times even if sampling from Region 3 is slow, since Region 1 is large and Region 3 is small when g is similar to f. But the beauty of acceptance/complement is that no set-up step is required to calculate the mixing probability p; the mixing probability enters the algorithm only implicitly. Regions 2 and 3 have equal area since f and g integrate to one, so we should sample from Region 3 with

Ch. 7. Simulation Experiments

307

the same probability as we find ourselves in Region 2, observations of which are available as a by-product of the first step. The idea extends directly to discrete and multivariate distributions. 3.3. Acceptance~rejection Let X denote a discrete or continuous, possibly multivariate, random variable. Let t(x) be any function that satisfies t ( x ) > - f ( x ) for all x, with corresponding density or mass function r(x) = t(x)/e, where c = f t(x) dx if X is continuous and c = E x t(x) if X is discrete. Then observations of X can be generated by (1) generate x - r ( x ) , (2) generate u - U ( 0 , 1), (3) if u>~ f ( x ) / t ( x ) go to step 1, (4) return x. Algorithm efficiency depends on the choice of t(x), which determines the difficulty of obtaining x in step 1, the difficulty of calculating the ratio of step 3, and the expected number of iterations c. Many state-of-the-art algorithms use acceptance/rejection as the framework, with inverse transformation or composition used to generate x in step 1. 3.4. Special properties Occasionally a distribution has a special property that leads to a method of generation specific to that distribution. Here some intermediate random variates Yl, Y2, • • •, Ym are transformed to obtain a single random variate x, with m possibly being random. Examples where special properties are useful include (1) lognormal: x = exp(y), where y is normal, (2) beta: x = y l / ( y l + Y2), where Yl and Y2 are independent and gamma distributed, and (3) symmetric triangular density between a and b: x = a + ½ ( b - a)(u 1 + u2), where u I and u 2 are independent U(0, 1). Patel, Kapadia, and Owen (1976) is a good source of special properties, as are the four volumes of Johnson and Kotz (1969, 1970a, 1970b, 1972). Note that several common distributions do not require special algorithms, since they are special cases of distributions already discussed. In particular, the Erlang distribution is a special case of the gamma distribution with integer shape parameter, the chi-square distribution with v degrees of freedom is a special case of the gamma distribution with shape parameter ½u and scale parameter 2, and the exponential is Weibull with shape parameter 1. 3.5. Conditional distributions While in principle the previous concepts can be used to generate multivariate random vectors, a common approach is to generate the multivariate components one-by-one using their conditional distributions. That is, generate x i -fx~lx,.,2 . . . . . . j_l(xi) for i = 1, 2 . . . . . n. Two common applications illustrate particularly nice results arising from conditional distributions.

B. Schmeiser

308

Consider multivariate normal random vectors X with m e a n / x and covariance matrix X. Then x = / ~ + Cz yields observations of X, where C is lower triangular and satisfies CC T = X and z is composed of independent standard normal random variates. A generalization to nonnormal random vectors is discussed in Section 4. The second example is the direct generation of sorted samples x(~) ~< X(2) • "" ~
3.6. Processes correlated in time Two types of autocorrelated processes commonly used in computer experimentation are time series and point processes. Autoregressive/movingaverage (ARMA) time series and nonhomogeneous Poisson point process are straightforward to generate. A R M A time series satisfy p

q

Xt=Z ~l~q- Z ¢~i(Xt_i-- ~lb)-- Z Oi'~t i'Jc Et i=1

.

i=1

for t = . . . , - 2 , - 1 , 0, 1 , . . . , where the process mean is /x and the random shocks e t have mean zero, variance o-2, and zero covariance. Typically the random shocks are assumed to have normal marginal distributions, in which case random variates can be generated directly from the definition starting with t = 0 after generating (X 1 , X 2 , . . . , X p) and ( e l , ~ 7 - 2 ' " " " ' e q) one time from the appropriate multivariate normal distribution. Barone (1987) discusses initialization and generation of multivariate time series. A generalization to nonnormal time series is discussed in Section 4. Nonhomogeneous Poisson processes are appropriate models of times ot events occurring independently with time dependent rate function A(t). Given the process (simulation) is at time t i_1, the distribution function to the time of the next event, ti, i s FTilti_l(ti)= 1 - e x p [ - f ~ - i A(t)dt]. If T i is continuous, then t i can be obtained using the inverse transformation by solving for t~ in f t~_~ A(t) dt = -In(1 u), where u - U(0, 1). When the inverse is not in closed form, one can thin another process having rate function A*(t) satisfying A*(t) ~> A(t) for all t (Lewis and Shedler 1979). Here the event time t arising from A*(t) is ignored with probability 1 - A(t)/A*(t). Typically, A* is chosen to have a closed-form inverse transformation, often A*(t) = max, A(t).

Ch. 7. Simulation Experiments

309

4. I n p u t m o d e l i n g

The input to a simulation experiment is the sequence of exogenous random variables that represent the random environment of the system being simulated. Input modeling is the process of determining the model for these exogenous random variables. The input model must be both an adequate representation of the environment of interest and be amenable to producing random observations. Input modeling arises in two substantially different situations: data fitting and subjective modeling using no data. When data are available, all the usual statistical methods of fitting models to data are appropriate, as discussed in Lehoczky's chapter on statistical methods (Chapter 6), including Bayesian methods. We mention here some input modeling ideas that seem particularly relevant to simulation.

4.1. General families of scalar distributions For univariate modeling, there are several families of distributions that are general and have tractable inverse transformations. Rather than having their genesis in common physical models, these distributions are simply empirical models that have been found to work well in a variety of situations. Schmeiser (1977) discusses some early ideas. Typically the models are for continuous univariate random variables, but discrete approximations and multivariate extensions are straightforward. Ramberg et al. (1979) discuss the inverse transformation x = A~ + (u ~3 (1 - u) an)/A2, which by manipulating the values of the four parameters attains a variety of shapes. This family, which contains the uniform distribution as a special case and the exponential as a limiting case, can match any first four moments except those with light tails, such as those that occur with U-shaped distributions. Matching four moments leads to the standard normal approximation x = (u °~35- ( 1 - u)°135)/0.1975, mentioned in Section 3, which has a distribution function that differs from the normal by less than 0.001 for all x. The Johnson system of distributions provides a univariate continuous distribution to match any combination of mean, variance, skewness, and kurtosis. The system is composed of SB distributions, which are bounded and have fourth moments less than those of the lognorrnal distribution, and S v distributions, which are unbounded and have fourth moments greater than the lognormal, along with the lognormal and normal distributions as boundary cases. In each of the.four cases, the inverse transformation is a closed-form function of a standard normal random variate. Due to the multiple types of distributions and the complexity of the moment formulas, the Johnson family requires a variety of numerical software. De Brota, Dittus, Roberts, Wilson, Swain, and Venkatraman (1988) survey their work on using Johnson-system distributions as input models, including software for data fitting and for subjective modeling using interactive graphics based on the density function.

B. Schmeiser

310

4.2. Modifying scalar distributions Hora (1983) takes a different tack by trying to improve the fit of a continuous univariate inverse-cdf input model F 0. Rather than generating random variates using Fol(u), where u is a random number, he suggests using Fo 1( g(u)), where g(u) = cu'% e x p ( ~ /3iui/i), "i~1

with c = exp(-g~= 1 /3i/i) and r is the order of the model. Given data {Xi} and a reasonably well-fitted inverse cdf F0, the parameters /3 = (/30,/31, • • •, /3r) are estimated using the usual ordinary least-squares regression /3 = (ArA)-~ATW, where A is the ( n - 1 ) x ( r + 1) matrix having components aij = (j/(n + 1)) i and W is the (n - 1)-dimensional vector having components Wj = j [ l n Fo(x(j+o ) - I n F0(x(s~)], where x(j) denotes the jth order statistic. The derivation is straightforward and heuristic. Hora's regression is simple to implement, but it does not guarantee a monotonic transformation and in some cases poor, or even nonsensical, fits are obtained. Avramidis (1989) studies polynomial functions g(u), fitting them using a nonlinear-programming algorithm constrained to force monotonicity. Since g is monotonically increasing, Fol(g(u)) is the inverse transformation with its advantages, as discussed in Section 7 on variance reduction. A natural combination is to choose F 0 from the Johnson family and to fit g(u) only if necessary.

4.3. General multivariate distributions Johnson (1987) discusses many classical continuous multivariate distributions, with interesting graphics showing the shapes of bivariate densities as a function of parameter values. Random-variate generation is discussed, but fitting to data is not. Rather, the emphasis is on those situations where the analyst wants to subject a system to a variety of multivariate input types in a designed simulation experiment. Schmeiser and Lal (1980) survey multivariate models useful for simulation, with an emphasis on empirical models. We elaborate here on a modeling approach useful when we have specified marginal distributions F1, F 2 , . . . , F k and specified k x k correlation matrix R = (O,s). This context arises often in engineering models, since specifying marginal distributions and correlations, whether from data or opinion, is more feasible than specifying a complete multivariate distribution.

4.4. Modifying multivariate distributions Hull (1977) suggests transforming a random vector from a tractable rnultivariate model F* into a random vector having the desired marginals and

Ch. 7. Simulation Experiments

311

correlations. Typically, we choose F* to be the k-dimensional multivariate normal distribution with correlation matrix R*, from which we can generate random vectors using the Cholesky decomposition from Section 3. The algorithm is 1. Generate a random vector y = ( Y l , Y2. . . . , Yk) from F*. 2. Calculate u = (Ua, u 2 , . . . , Uk) , where u, = F * ( y i ) . 3. Calculate x = ( x l , x2, . . . , Xk), where x i = F 7 1 ( u i ) . Now for any choice of F* with continuous marginal distributions we obtain the specified marginal distributions of X, since the marginal distributions of U are uniform (0, 1). Since both Steps 2 and 3 are nonlinear transformations, in general the correlation matrix R* of F* will not equal the correlation matrix R of F. Let/z~ and o-~denote the mean and standard deviation, respectively, of X i. For each of the ½ k ( k - 1) combinations of i and j, p~j satisfies g¢

Po =

F~ ( F i (y~)) - i~ o.i × [ F ; I ( F ~ ' ( y j ) ) - - ' l ~ j ] d F * ( Y i , YJ; Pi~),

o-j

which typically does not have a closed-form solution. When the marginals of F are similar to those of F*, choosing R* = R is often a reasonable approximation. The quality of the approximation can be easily checked by estimating R by exercising the algorithm to obtain n realizations of x. If some estimate t30 is unacceptably far from p~j, then pq can be adjusted in the direction of desired change. The standard error of p~j is approximately n-1/2(1 - p/~), so n I> 10 000 ensures almost two-place accuracy in the estimates, satisfactory for most applications where the distribution itself is not required. ^

.

4.5. M o d i f y i n g time series

Just as a multivariate normal F* is the natural choice for multivariate random vectors, A R M A time series with normal marginal distributions seem reasonable for Step 1 when modeling time series with nonnormal marginal distributions. Suppose we have specified a marginal distribution F x and an autocorrelation structure/91, ,02 . . . . , pq. Let F* represent a time-series model with correlation structure close to that desired, and let F~ denote its marginal distribution. To generate x i from previous observations xi_l, x i _ 2 , . . , we proceed as follows: 1. Generate a random variate Yi from the time-series model F*o 2. Calculate u i = F~,(yi). 3. Calculate x~ = F x l ( u i ) . H e r e again Steps 1 and 2 serve simply to generate correlated random

312

B. Schmeiser

numbers for use in the inverse transformation of Step 3. The same correlation issues arise in determining the model for Step 1 as in the multivariate setting. Extension to time-varying mean and variance is straightforward. Devroye (1986b) discusses nonparametric methods of generating random variates from an existing random sample.

5. Point estimation

The tangible result of a simulation experiment is a point estimate 0 of the unknown performance measure 0. Although we are typically interested in many performance measures and therefore a multivariate 0, we focus on a single performance measure. Several scalar 0's may be used to estimate a multivariate 0. We discuss the single-replication/multiple-replication dichotomy of simulation experiments in Subsection 5.1, the central role of the mean in Subsection 5.2, estimators based on cumulative statistics in Subsection 5.3, other estimators in Subsection 5.4, causes of bias in Subsection 5.5, and end with a comment in Subsection 5.6.

5.1. The single-replication~multiple-replication dichotomy For every performance measure 0, an output process Y = {Yi) must be defined such that 0 is a property of Fv = limi._~= F~, the limiting distribution. The process Y is often scalar, as when estimating moments, probabilities, quantiles, autocovariances, and autocorrelations. Two-dimensional processes are required when estimating covariances and correlations. In general, the performance measure can be any property of a multivariate distribution from which we can create observations. There are two approaches. In the first, the simulation experiment consists of~ (possibly dependent) identically distributed replications, each yielding a single observation Y~ with distribution function Fy. Terminating models always belong to this first approach; steady-state models less naturally can be simulated using this approach. In the second approach, Yi is the ith observation within one replication; here the observations are usually not independent and are sometimes not identically distributed, with early observations affected by an initial transient being a common cause. Steady-state models fit more naturally into this second approach than the first. Nonterminating periodic models can be simulated with either approach; the second approach is accomplished by defining observations from the same point in each period. As an example, consider a computer system for which we wish to estimate responsetime properties. With the first approach (multiPle replications), each replica° tion might represent a single day with Y~ being the maximal response time for the ith day. With the second approach (single replication), Yi might be the response time of the ith request. In both approaches, 0 is the mean of the distribution of Y.

Ch. 7. Simulation Experiments

313

5.2. (Almost) everything is a mean For either type of experiment and whatever property, the output process can be defined so that 0 is either the expected value/z v = j'_~= y d F r ( y ) or the pth quantile yp, which by definition satisfies F v ( y e ) = p for p E (0, 1). The variance, o-~, = J'7, ( y - / z v ) 2 d F r ( y ) can be thought of as the asymptotic mean of the output process {(Yi - I?)z), where I? is a sample average from Y. Likewise, higher-order moments can be viewed as the asymptotic mean of processes defined as a function of Y. Forexample, the covariance is the asymptotic mean of the output process {(Yli - Y1)(Yzi - Y2)}, where now the output process is bivariate. Estimating the probability of an event A can be viewed as estimating the mean-of the sequence of indicator functions I { A } , equal to one when the event occurs and zero when the event does not occur. A special case is estimating p = b y ( y p ) , the probability that Y~ is less than or equal to yp, where yp is given; in this case the indicator function is I{ Yi <~Yp}. The probabilities of more-general events, such as the sum of two adjacent requests taking more than two seconds, can be viewed as the mean of corresponding output processes of indicator functions I { Y i + Yi+l >2}. Thus, with the exception of quantiles, the purpose of simulation can be viewed as estimating population means, which are integrals. In this way, system simulation is linked to the conceptually simpler problem of using Monte Carlo experiments to determine the area under a curve. Pragmatically, however, we often consider estimating variances and higherorder moments separately from estimating means. The reason is that viewing variances and higher-order moments as means requires the output process to be defined using the sample mean, which is not available until the end of the experiment. Therefore, we usually use various functions of the original output process {Y~) to estimate these quantities. 5.3. Some estimators based on cumulative statistics Each performance measure 0 is a property of the mode! being simulated, and not a function of the simulation experiment, whereas 0 is a function of the simulation experiment. Since we view variance reduction as the process of defining the simulation experiment, including choosing sample size, the choice of estimator is logically one of variance reduction, the topic of Section 7. But discussing point estimation separately allows us to discuss some relatively simple ideas here, as well as output analysis in Section 6, before introducing more-involved variance-reduction ideas. For each type of performance measure, O, we now briefly discuss the usual point estimators, 0. Depending upon the model and the knowledge of the analyst, better estimators can often be defined, as discussed in Section 7. The mean ~ r ~ limi--/~r, is almost always estimated using the sample mean 17"= n -1 Ei~ 1 Yi, where n denotes the sample size. When the data have the common mean p~v (no initial transient) and a constant sample size n, then the sample mean is an unbiased estimator for tzv, as is any estimator of the form

314

B. Schmeiser

Ei"=1 aiX i, where •in=l O/i = 1. The sample mean, obtained when the weights are all a~ = 1/n, is the minimum-variance unbiased estimator when the observations are identically distributed and independent (i.i.d.). More generally, when the observations are identically distributed and possibly dependent, the weights of the minimal-variance unbiased estimator are a T _(R-11)/(1TR-11), where R is the n × n correlation matrix with elements corr(Xi, Xj), 1 is the n × 1 vector of ones, and a is the n x 1 vector of weights. Thus for correlated output, some reduction in variance can be obtained with unequal weights. In particular, for a covariance-stationary process with positive autocorrelation, the optimal weights are larger on the ends of the sample and smaller (possibly negative) in the middle. Intuitively, this type of weighting is optimal since the end observations carry information about the unseen observations just beyond the sample. However, the reduction in variance is small (Halfin 1982), and asymptotically the sample mean has the same variance as the optimal estimator (Fuller 1976, Chapter 6). In addition, the unknown optimal weights must be estimated in practice. Therefore, the usual sample mean is used to estimate/x. Just as for the mean, we also usually use classical i.i.d, statistics to estimate other properties, even when the observations are correlated. Probabilities are usually estimated using the sample means of the indicator functions; i.e., the fraction of the observations corresponding to a success, a special case of means. We now discuss variances, higher-order moments, and quantiles. To estimate the population variance, try, 2 we use the sample variance S~y=_(n_ 1)-1(E~1 y 2 _ n~2), which has expected value n ( n - 1)-~(tr~,var(Y)) regardless of the correlation structure. Thus S2y is an unbiased estimator for 0-2r when the data are i.i.d, and always is asymptotically unbiased, since var(Y) goes to zero as the sample size becomes large. One could correct for the bias with an estimate of var(]7~) from Section 6, but again the usual i.i.d, estimator S 2 is used almost without exception because the potential gain is small. As in the i.i.d, case, sometimes other denominators are used, but the small absolute bias of using n - 1 is advantageous in Section 6, where point estimators are sometimes averaged to estimate their standard error. Similarly, we use the corresponding sample values for higher-order mon ments: For the kth noncentral moment, /2~ = n -1 E~=~ Y~; for the kth central k lxi,,kx ~ p . r moment, /2, = Ei= 0 ( - ) t ~ )/zi/z k i; for the kth standardized moment, &g = ^ k tZk/S r. Some care should be taken when estimating higher-order moments, since it is easy to be misled by such Monte Carlo results. That such estimators have a high variance is well known and is the reason that the method of moments does not work well for estimating the underlying distribution using higher-order moment estimators. But an additional problem is that the sampling distribution of &g is skewed and has heavy tails. For example, consider estimating the kurtosis, a4, from normal i.i.d, data. Even with a relatively large sample size, observations of &4 a r e almost always between 2.5 and the true value o~4 = 3. The problem is that the distribution of &4 has a mode that lies to the left of the true value o~4 and has a heavy right tail, so that occasionally a

Ch. 7. Simulation Experiments

315

quite-large observation of ~4 occurs. Without large sample sizes, seldom will the simulation experiment produce a sample kurtosis close to o14.

5.4. Estimating quantiles All of the statistics mentioned above can be implemented based on cumulao tire statistics; that is, we can calculate the estimators from sums of functions of the observations that are collected while the simulation is in progress. The inefficiency of the alternative - saving all observations for later analysis, either in memory or in peripheral s t o r a g e - is less important today than a few years ago, especially with the recent emphasis on graphical display of simulation output. Quantiles are fundamentally more difficult, since their estimators are most easily defined in terms, of order statistics (the sorted sample), which are not amenable to cumulative statistics. For identically distributed data, either independent or dependent, let Y(i) denote the ith observation after sorting into ascending order. A simple estimator of F~,l(p) is Y(r), where r = [p(n + 1)]. Better estimators are obtained by taking linear combinations of multiple order statistics; for example, (1 - a)Y~r ~ + aY~r+l), where a = p ( n + 1) -

tp(n + 1)J. Histograms provide a simple popular mechanism for estimating several quantiles simultaneously, since cell counts are cumulative statistics. Some information is lost by grouping the data into cells; in fact, quantile estimators based on fixed cell-width histograms are not consistent. But with small-width cells the estimators are often adequate. The argument why histograms are valid when filled with dependent data is straightforward and indicative of why all of the point estimators of this section are valid for dependent data. By valid, we mean that for any sample size n the probability of Yi lying in the cell covering the interval (a, b) is J'~ dFy(y)~ assuming that yl is sampled from Fr, the marginal distribution function of Y~ at all times i. The argument is straightforward. First suppose the sample size is n = 1; then dependence is irrelevant and the result is true by the definition of F r. For general values of n, the relative frequency of this cell is n -1 ZT=1 I{a <~ Yi < b}. The expected value is identical for each term; therefore the expected value is the same as for the n = 1 case, demonstrating the histogram's validity~ Of course, the joint probabilities associated with multiple cells do depend on the correlation structure. In particular, a classic problem in simulation is that the observations are positively correlated and therefore we sample only from part of the histogram in a short run, affecting goodness-of-fit tests and causing the severe effect of dependence on vat(0).

5.5° Causes of bias In Section 6 methods for estimating the variance of these point estimators are discussed. But more fundamentally, we need to assure ourselves that the estimators have negligible bias. There are three causes of substantial bias that

316

B. Schmeiser

commonly arise in simulation experiments: the initial transient, random sample sizes, and nonlinear transformations. We discuss each in turn. The argument of histogram validity above assumes there is no initial transient, the disturbance in the distribution of early observations caused by starting the simulation from other than the steady-state distribution Fv, a problem incurred with every replication of most steady-state simulations. Typically we are most concerned with the effect of initial bias. Ameliorating this bias takes two forms: (a) efforts to sample the initial system state from a distribution similar to Fv; and (b) modifications to the estimator to reduce the effects of early observations on the estimator. The simplest useful idea for initial-bias amelioration is to use a single replication to estimate steady-state performance measures, since the initial transient is then incurred only once. The second idea is to start the system in a state that is not grossly untypical; an empty manufacturing facility often requires the least effort but is not at all typical. The third idea is to discard early observations when the first two ideas are either impractical or insufficient to convince the analyst that the transient effect will be negligible. Discarding early observations corresponds to giving these observations a zero weight; more sensible from a statistical-properties point of view is to gradually increase the weights, which recognizes that in most systems the initial transient does not disappear at a particular time. A more sophisticated approach is to use all the data to fit a model that explicitly includes both the steady-state mean and the initial-transient bias, as examined by Snell and Schruben (1985) for AR(1) data processes. Welch (1983) discusses a graphical approach based on multiple replications to evaluate the initial transient. The second cause of point-estimator bias is random sample size, which is common in simulation experiments. The randomness arises for three reasons. First, since we are often interested in multiple performance measures, typically the stopping rule will cause some samples to have deterministic size and others random size. For example, if we simulate for 5000 customers, the number of hours is random; if we simulate for 5000 hours, the number of customers is random. Second, we often are estimating conditional performance measures, such as the expected response time for those requests that must wait. Third, random sample sizes can occur by design when sequential stopping rules are used~ The random sample sizes can cause bias in the point estimators, since the expected value of a ratio of random variables is not the ratio of the expected values. However, as long as the sample sizes are 'large' the bias is negligible. Two situations.commonly cause the bias to be nonnegligible even for relatively large simulation experiments. The first situation arises in estimating a condi.tional parameter (e.g., the expected sojourn time of a customer needing a particular combination of services) when the conditioning event so seldom occurs that the sample size for the estimator is small; the obvious worst case is when the sample size is zero and no estimate is obtained. When the sample size is positive but small the problem is easily overlooked. The second situation is caused by averaging averages. Suppose the jth replication has output process

Ch. 7. Simulation Experiments

317

{Ylj, Y 2 j , . . . , Ymi.j} for j = 1, 2 . . . . , k and that we are estimating the common mean. Either of two estimators is sometimes used. The first is to obtain the average from each replication, ~ , and to average these to obtain 17= k -1 Z j~=l ~.. The second is to take the grand average I7"= mjy k_ _ Ek= i l 2i-1 ;i/Ej_~ mj. The- former has more bias, since the bias is incurred with each replication average Yj while the latter is a single ratio with larger sample size. The same issue exists for estimators other than the mean. The former, which is appealing because estimating the standard error based on Y is more straightforward than for I?, as discussed in Section 6, should not be used unless each replication sample size is large. Nonlinear transformations are the third source of bias. Consider estimating the expected future value, $F, of an investment of $P for k years with continuous compounding of interest rate r per year. In the deterministic case, the future value is F ='P e rk. Now suppose the interest rate is random, R, as determined by a stochastic model from which we can obtain simulated realizations of the rate. A tempting approach is to simulate only interest rates to obtain n interest rates R1, R 2 , . . . , R,,, from which we calculate an estimator of the mean interest rate, R, and in turn calculate the estimator of the expected future value /~(F)= P e ok. The problem, of course, is that E(E,(F))= E(P e ~k) ¢ E(P e n~) = E(F), since exponentiation is nonlinear. This estimator is consistent, but the better (and unbiased) approach is to calculate, for each observed interest rate Ri, the corresponding future value Fi and to then estimate E(F) with the sample mean F. This example illustrates the bias induced whenever we substitute sample moments for a population moment in a nonlinear function. Similarly, some common estimators are biased.

5.6. A comment We end this section with a comment. The classical i.i.d, point estimators discussed in this section seem to be most people's natural choice, often because they have not thought to be concerned that dependence might have an adverse effect. Unfortunately, those who are concerned about the effect of dependence 2 2 often hold the misconception that S v can not be used to estimate o.v when data are dependent. (Seldom is a similar concern expressed about using histogram data, which would also be misleading if the sample variance were misleading)° The source of the confusion about estimating o.2 probably stems from the very real difficulties caused by dependence in estimating the variance of the sample mean. As we see in Section 6, using the classical i.i.d, estimator SZ/n to estimate var(Y) can be so biased as to be without useful meaning.

6o Output analysis In this section we consider the problem of estimating the precision of the simulation experiment. We wish to compute a measure that provides the

318

B. Schrneiser

analyst with a sense of the sampling error in t) using only the same output data Y1, Y2, • • • , Yn used to calculate 0. Such a measure is needed for the analyst to interpret the simulation results. For example, if we are estimating expected parts processed per shift and we obtain the point estimate 0 =75.4335, one7digit significance might have much different implications than three-digit significance, if for no reason other than to decide whether to simulate longer. As in the last section, we consider only one performance measure at a time, and therefore t) is scalar. We treat this problem at two levels. The first is to estimate the standard error of 0, or equivalently its square, var(0). The second is to use the standard error to calculate other measures, the most popular being confidence intervals, but also prediction intervals and probability of correct selection when compar~ ing systems. In Subsection 6.1 we consider estimating var(O) from i.i.d, data, as arises with multiple replications. In Subsection 6.2 we consider estimating var(O) from covariance-stationary dependent data, as arises when using one replication to estimate steady-state measures of performance. We briefly discuss confidence intervals in Subsection 6.3.

6.1. I.i.do output First consider estimating a probability of an event A~ say p, with O =/~ = --1 n n E i=1 I{ Yi ~ A}, the fraction of the independent obselvations correspond~ ing to a success. From the binomial-distribution context we see that var(/~) = p(1-p)E(N-1), where the sample size N does not depend on the observations, but is possibly random. A good estimator of var(/5 ) is vSr(/~)= ,6(1 - / ~ ) ( N - 1) 1, which is unbiased. Now consider a general sample mean O = 12. The variance is v a r ( I ? ) = var(Y~)E(N-1), where again the sample size does not depend on the observations, but is possibly random. A good, unbiased estimator is S2/N, where S 2 = ( N - 1) - i (2i= N 1 Y2i-Ny2 ) is the sample variance of the N~>2 observations. Next consider a s a m p l e variance, O = S 2. For a fixed sample size N=n, the variance is v a r ( S 2 ) = n - 1 o '4r ( a 4 (n-3)/(n-1)), where a 4 = E[((Y i - tzr)/try) 4] is the marginal kurtosis of Y. Substituting S 2 for 0-2 and ~ 4 for o~4 yields the obvious estimator. However, as noted in Section 5, estimating a 4 is difficult, unless n is quite large. For i.i.d, data, each type of classical estimator has a welt-developed theory for estimating estimator variance. But as we have seen for probabilities, means, and variances, the method differs depending upon the type of estimator. Also, as discussed in Subsection 6.2, such methods are inappropriate when the data are dependent. Therefore, we seek a method that works for any type of estimator and that will extend to dependent data. The primary alternative to using a specialized method for each type of performance measure is to use macro/micro replications, which can be used to estimate the variance of any type of point estimator 0. The idea is to view the experiment as having k independent macro replications of m micro replica-

Ch. 7, Simulation Experiments

319

tions, where k m = n. Each micro replication produces one observation from the output process Y; each macro replication produces one estimator, 0 ( j ) , which is defined analogously to t} but is based on only m observations. That is, each macro replication is a sample-size-m version of the whole experiment. Logically, the implementation contains an outer (macro) loop running from j = 1 to j = k and an inner (micro) loop running from i = 1 to i = m. The average of the results of the k macro replications, 0 = k-1 E j=, 0 ( j ) , is an alternative to the point estimator 0 from Section 5. Since 0 is a sample mean regardless of the type of performance measure, its variance can always be . ^ 2 -1 k ^2 -2 estimated by o-~ = k (E j= 1 0j - kO ) / ( k - 1). To the extent that 0 behaves . ^ • . ^2 ^ . ^ hke 0, the micro/macro estimator o-~ is a good estimator of var(0). Smce 0 = 0 when m = n and k = 1, one can expect similar behavior when the micro-sample size, m, is large in some sense. The sole purpose of the micro/macro-replication structure is to provide an estimator of the sampling error of 0 by estimating the sampling error of 0. The point estimator is unchanged; we continue to use 0 from Section 5 rather than 0. Let us discuss briefly why. In the case of means, 0 = 0 for all choices of m and k, so there is no issue. But consider estimating the variance, _ C r y2. For i.i.d, data, 0 is unbiased for any choice of k and m; but v a r ( 0 ) = tr~,(c~4 - (m - 3 ) / ( m - 1))In. Thus the mean squared error is minimized when m = n and k = 1, which corresponds to/). More generally, for all of the usual types of estimators, both bias and variance are minimized by using 0 as the point estimator. Having chosen the point estimator, we still must choose the value of k and rn to provide a good estimator 6-~ of var(0). In this discussion, we assume that the sample size n is fixed, so that choosing the number of macro replications k implies that the number of micro replications is m = n / k . We also now denote the output from micro replication i of macro replication j by Y~j. First consider estimating the variance of a mean. H e r e the point estimator 0, is the sample mean fz = n - 1 2in_l Yi and the usual estimator of var(I?) is SZy/n. Within the micro/macro replication structure we define 0j = ~ = m-1 ~'i=lmYij and 0 = k-I ~ j = l k Yj. Note that here 0 = 0. The estimator of var(t))= var(0) is 6-2 = 6-2, which is undefined for k = 1 and unbiased for any number of independent macro replications k = 2, 3 , . . . , n. T h e r e f o r e , we choose k based on minimizing the variance of o-0, ~ 2 which is n - 2 ( o ' 4 / k ) ( 3 + (a 4 - 3 ) / m (k - 3 ) / ( k - 1)). Unless o/4 is quite large, this variance is minimized at k = rL m = I, corresponding to the usual estimator S2/n._ When the performance measure is not a mean, 0 and 0 are equal only when m = n, and therefore choosing a good value of k is more difficult. For the commonly used performance measures other than means, such as variances and quantiles, k = n can not be chosen, since 4 is not defined for rn = 1. Therefore, m is restricted to being at least some lower threshold value. As we saw in Section 5, when m is small (and therefore k is large), ~)j typically has nonnegligible bias, which is transferred to 0. So we need to choose m large enough to produce negligible bias and k large enough that the variance is reasonably small. k

-

320

B. Schmeiser

For nonmeans, the amount of bias is unknown in practice, since the simulation experiment would not be performed if 0 were known. Therefore it is fortunate that we can say something about the behavior of the variance as a function of k in the micro/macro-replication context. In particular, although *2 var(cro) decreases as k increases, the rate of decrease quickly becomes small as k increases. Schmeiser (1982) suggests that k = 10 is large enough in most cases and that k = 30 is the maximum value for which any practical improvement occurs for the usual estimators. The price paid for using a fixed number of batches is that some information about the standard error is lost; for better variance in estimating the standard error, the number of batches grows with the sample size. But the bigger potential loss is due to bias, so a safelydesigned simulation experiment might use k = 20 macro replications and n/20 micro replications, even though n may be in the millions. Lewis and Orav (1988) suggest choosing k between twelve and twenty, which is not inconsistent since such bounds are subjective. We will discuss micro/macro replications further in the next section, since they are particularly useful when the output is correlated. But first we discuss two alternative m e t h o d s - jackknifing and bootstrapping- for estimating the variance of 0 when the output process is i.i.d. Unlike micro/macro replications, which can be implemented with cumulative statistics, both jackknifing and bootstrapping are resampling plans that are performed at the end of the experiment when all observations are available for computation. Both can be applied using either the n independent output values Y,, Y2,-- -, Yn or the k macro-replication estimators 01, 02. . . . ,0h, but the presentation here will assume the former. Both are sometimes called computationally intensive methods, for the obvious reason. Jackknifing is based on calculating the point estimator n times, each time omitting one observation. Let 0i denote the value of the point estimator calculated from the n - ] observations omitting ~ . For reasonably large samples, the unconditional sampling distribution of 0i and 0 should be similar, with a correction factor of ( n - 1 ) / n for the variance. But conditional on the value of 0, the variance of 0i is a function of the single omitted observation, so n -1 var(0) is approximately equal to var(0il0), which is easily estimated using n -- 1) -1 (Ei=~ n 0/z _ ~2), where 0 is the~ sample average of the 0i's. Multiply by n for the difference in the variance of 0 and the conditional variance of 0i givet~ t}. Then multiply by the correction factor ( n - 1)/n for the difference in the variance_ of 0 and the variance of 0i. The result is v~r(0) = (n - t)n -l(Zi= 1 . ~2 _ n0Z). This estimator is algebraically identical to, but slightly simpler to compute than, the usual jackknifing definition of treating the pseudovalues 0(0 = nO - (n - 1)~ as independent observations. For more intuition, note that if 0 is the sample mean I?, then vfir(0) = S2r/n. Bootstrapping (Efron and Gong 1983) is quite intuitive. We assume that the empirical distribution function of the output is the true distribution function, We can then sample, with replacement, m values from the set { Y~, I12 . . . . , Ym} and compute a point estimator 01, which will have statistical

Ch. 7. Simulation Experiments

321

properties similar to the observed point estimator 0. We can then repeatedly sample to obtain k independent bootstrap values 01, 0 2 , . . - , 0k. Then vfir(0) is the sample variance of the bootstrap sample. The method works well to the extent that the empirical distribution function from the original sample mimics the population distribution function F r. Bootstrapping is doubly interesting for simulation, since in addition to being applicable to simulation output analysis, bootstrapping is a Monte Carlo sample procedure. Various alternatives to crude sampling from the distribution function have been proposed; these alternatives can be thought of as variance-reduction techniques, the topic of Section 7.

6.2. Covariance-stationary output Compared to the analysis of data from i.i.d, output processes, the analysis of autocorrelated data is difficult. This difficulty remains even when we assume that the initial transient has been somehow ameliorated, as discussed in Section 5, and that therefore the output process is covariance stationary. The difficulty also remains even as sample size grows. Just as for i.i.d, output, we need to have methods appropriate for all types of performance measures. But first we discuss the probabilistic relationships underlying estimating the variance of a sample mean, which is the prototype problem. Since the process Y is assumed to be stationary, var (I7") is a known function of the variance and the autocorrelations of the observations: var(I?) = CO'y/n~ z where c = 1 + 2 Eh= ln-~ (1 - (h/n))Ph , where Ph = corr(Yi, Yi+h) for all i. Asymptotically as n goes to infinity the constant c goes to % - E~=_o~ Ph, which we assume to be finite. The process constant Y0 can be viewed as the number of adjacent observations that carry the same information about var(I?) as one independent observation in an asymptotically large sample. For i.i.d, output~ %=1. The difficulty in estimating var(l?) is that c and 3'0 are unknown when the data are not i.i.d. Asymptotically, ignoring the effect of the autocorrelations by using_the i.i.d, estimator SZr/n leads to underestimating or overestimating vat(Y) by a factor of c. Since c can be arbitrarily large, assuming dependent data to be independent when estimating the variance of the estimator Y can be quite misleading. The extensive literature on estimating vat(Y) attacks the problem from a variety of angles, which can roughly be categorized as spectral, A R M A modeling, and standardized time series via transformations to i.i.d, or Brownian processes, as discussed in graduate-level similation textbooks. These methods are mentioned briefly later. We will focus on micro/macro replicao. tions and extensions, since that approach works for any type of performance measure. When micro/macro replications are applied to covariance-stationary data~ the method is often called batching. We will avoid the common term batch means, since the method is not limited to means. The batches are obtained by partitioning the single sequence of observations II1, Y2, • • •, Yn into k contigu~

322

B. Schmeiser

ous batches of size m, thus creating data that are analogous to that obtained from k macro replications containing m micro replications. All the computations are then identical to the micro/macro-replication logic used with i.i.d. data: Calculate 0s for each batch j and estimate var(0) with k-1 Es=lk (~j_ -

I).

To see the validity of this analysis requires thinking about the effect of dependence among observations within a batch and the effect of autocorrelation between batches. The effect within a batch is that dependence biases the macro-replication point estimators 0j's and therefore their average 0. To the extent that 0 is a poor estimate of/-~r, the sum of the squared deviations from divided by k - 1 does not provide a good estimator of o-~. But as we saw earlier, bias of 0j disappears as m increases, whether or not the bias is due to dependence. So if m is large enough dependence within a batch causes no harm. The effect of autocorrelation between batches is that the batch point estimators 0j are not independent of each other. Let us ignore the second-order effect caused by all batch estimators being a function of the common mean 0, in which case we can let t~h denote corr(~, 0j+h)" Then var(0) = k - l ~ ; 1 + 2 ~

( 1 - (h/k

~

h=l

where {72 = var(0j). If the ~'s are non-zero, then k is not the appropriate factor to relate ~2 to var(0). Thus, the batch size m must be chosen large enough to obtain only negligible correlation between the 0/s. Therefore, the problem is again to select the number of macro replications (batches) k and micro replications (batch size) m to balance the bias and variance effects. The effects are the same as with i.i.d, data, except now the dependence adds two new reasons to strive for large values of m. Since both within and between batch autocorrelation effects are difficult to estimate, we now have even more reasons than for i.i.d, data to use small numbers of macro replications k. A value of k greater than thirty is seldom defensible and in the presence of autocorrelation k is more often appropriately quite small. Because small values of k increase the variance of the estimator of var(0), a method to increase k without decreasing m for a fixed run length n would be valuable. An improvement on the micro/macro-replications esdmator uses all n - m + 1 (overlapping) batches to obtain the estimator n-m+1

vfir(°)(O) = (n - m +

1)-' 2 (Oj-

[})21((nlm)---

1),

]=1

where ~ is now calculated from the batch of size m running from Yj to Ys+m i' This estimator has been studied in detail only for 0 = Y, the original version suggested in Meketon and Schmeiser (1984). They note that this estimator has (essentially) the same bias and 2/3 the variance of the micro-macro estimatos, since it is asymptotically equivalent to the spectral estimator (Moran 1975)

Ch. 7. Simulation Experiments

323

obtained by substituting estimates of ~r~ and Ph in the expression for var(I?) and truncating at m autocorrelations. Goldsman and Meketon (1988) show this estimator's asymptotic mean squared error fares well compared to some other generally applicable estimators. Song (1988) extends the Goldsman and Meketon results to show that the asymptotic optimal batch size for vfir(°)(I7") is m* = 1 + (3ng) ~/3, where g = E2= 0 hPh/(1/2 + E~=1 p~) is the center of gravity of the autocorrelations at nonnegative lags. The increase in computation to use v~r(°)(0) depends on the implementor's ability and desire to efficiently compute 0j from t)j_l; properly implemented, the computation remains O(n) and only negligibly more than that of nonover-, lapping batching. 6.3. Confidence intervals'

A classic use of an estimate of var(0) is to compute a confidence interval on 0. The idea is to use the output data to compute a random interval [L, U] with the property that P { L ~ 0 <~ U} = 1 - a for some specified value of a. The classic interval is L = O - t~,l_ ~ vfirl/2(0) and U = 0 + t~,l_ ~ vfirl/2(0), where tv,l_ ~ is the 1 - a quantile of the t distribution with v degrees of freedom. Here vfirl/2(0) is a standard-error estimator whose square is assumed to be proportional to a chi-squared distribution with v degrees of freedom and to be independent of the normally distributed point estimator 0. The extent to which these assumptions fail to hold depends upon the sample size n, the autocorrelation structure, and the type of performance measure. In the classic case of i.i.d, normal data and micro/macro replication analysis, the procedure with the sample mean and v = k - 1 degrees of freedom covers tXy with the desired probability. For all the usual performance measures, in the limit as m becomes large the 0j's become i.i.d, and normal, causing the desired coverage probability to occur. The need for each macro-replication statistic to be normally distributed is yet another reason to choose m large, forcing the value of k to be yet smaller. The effect of a small value of k is to increase the expected width of the interval and to increase the variance of the width of the interval. In this case using the overlapping-batch estimator is tempting, since asymptotically the degrees of freedom are half again more than the nonoverlapping-batch estimator. Besides methods based on batching, a variety of confidence-interval procedures for the mean have been proposed, beginning with the classicaP spectral work of Fishman and Kiviat (1967), and including a regression-based spectral estimator by Heidelberger and Welch (1981), standardized-time-series estimators (Schruben 1983), and estimators based on time-series modeling (Schriber and Andrews 1984). Law and Kelton (1984) and Sargent, Kang, and Gotdsman (1987) empirically compare various confidence-interval procedures; Glynn and Iglehart (1990) theoretically compare the micro/macro, overlapping, and regenerative procedures, showing that regenerative analysis is asymptotically better when it is applicable.

324

B. Schrneiser

7. Variance reduction

In the last section we discussed output analysis: how to estimate the quality of an experiment via the sampling error(s) of its point estimator(s). The experiment, including the estimator, was given and the problem was to estimate how well the point estimator behaved as measured by some property of its distribution- typically standard error. Now we try to improve the distribution of the point estimator. Fundamentally, variance reduction involves the choice of experiment. It is helpful to think of an original experiment, often referred to as the crude experiment, although it need not be crude in any sense other than that our objective is to obtain an experiment that is better, typically in the sense of the point estimator's bias, variance, or mean squared error. The topic is known as variance reduction since, when bias is negligible, the usual measure of estimator quality is variance, the square of standard error. One cannot take the term variance reduction literally, since generating arbitrary (or no) data and printing 0 = 6 produces a zero-variance estimator that is both trivial and worthless. Further, even when bias can be assumed to be negligible, the objective of minimizing variance is seldom made explicit, except in the theoretical literature. In practice, the objective is almost always to improve the distribution of the estimator in some unspecified way. Not infrequently, bias is allowed in the hope of substantial decrease in variance and, therefore, mean square error. The simplest modification to the crude experiment is to increase the sample size, n. Since estimator variance is usually O(n-1), this modification reduces variance by a known factor with no degradation (and sometimes improvement) of bias. In addition, it requires no work or detailed thought by the analyst. The cost is only in computational effort, the importance of which depends on the application. In real-time applications (e.g., interactive computing or scheduling) or when the current simulation is just one step in the search for a good design, the increase in computation is often unacceptable, leading to an interest in more fundamental improvement. A reasonable objective is to reduce mean squared error with similar computational effort or to reduce computational effort with similar mean squared error. When bias is negligible, a typical measure is to consider the product of work (computational effort) and variance. This measure does not change (asymptotically) with sample size, since doubling the work halves the variance. The literature on variance reduction, which is extensive, indicates the tremendous computational savings that are possible when the analyst has the time and capability of tailoring the experiment to the specific application. For example, Swain and Schmeiser (1988) report reductions of orders of magnitude when estimating behavior of methods for fitting nonlinear statistical models to data. Fishman (1983) and Stein (1987) report variance reductions that improve the rate of convergence. Such impressive reductions usually occur when the model has a structure that the analyst can exploit by tailoring the experiment to the model. For example,

Ch. 7. Simulation Experiments

325

Swain and Schmeiser (1988) use analytic results from linear models to improve the simulation results for nonlinear models. Fishman (1983) substitutes deterministic sequences for random numbers in models of network reliability. Stein (1987) discusses Latin hypercube sampling, which eliminates the variance due to the additive terms in models that use a fixed number of random numbers. Moderate variance reductions are often possible with reasonably general techniques; e.g., common random numbers, antithetic variates, stratified sam= piing, and internal control variates. Several recent textbooks have good discussions of techniques; Bratley, Fox, and Schrage (1987, Chapter 2, pp. 44-76, and Chapter 8, pp. 281-302) discuss variance reduction in more detail than most. Meyer (1956) contains several seminal papers. Section 6 of Nelson (1985) is a nicely organized bibliography. Wilson (1984b) categorizes variancereduction techniques into two classes: correlation methods and importance methods. Nelson (1987a) categorizes variance-reduction techniques by the part(s) of the experiment that are modified; these parts correspond closely to the last three of the five steps of simulation experiments discussed in Section 1. The four methods mentioned in the last paragraph-common random numbers, antithetic variates, stratified sampling, and internal control variates--are general in the sense that they do not require changes within the model. Common random numbers use the same random number streams to obtain realizations from k/> 2 systems in an attempt to positively correlate the k estimators, which reduces the variance of mean differences between systems. Antithetic variates use transformations of one random-number stream to induce correlations among k estimators of the same system; the purpose is to reduce the variance of the average of the k estimators. Various forms of stratified sampling reduce variance by replacing independent sampling with sampling without replacement. Internal control variates modify the crude estimator by a correction based on the relationship of an estimate of a population property with its known value. Each of these four methods is intimately tied to the method by which random variates are generated. In particular, all are aided by the inverse transformation. Common random numbers and antithetic variates are aided because the random variates are monotonic functions of the random numbers. Stratified sampling is aided because stratifying the random numbers, which have strata with known probabilities, automatically stratifies the random variates. Control variates are aided because the random numbers always have known population properties; e.g., the mean is one half. The failure of common random numbers and antithetic variates to reduce variance when using many of the common simulation languages stems from most languages' use of other random-variate generation algorithms, which are typically faster or easier to code. For example, the two antithetic streams U and 1-- U produce independent, rather than negatively correlated, normal random variates when used with the popular Box-Muller algorithm. Schmeiser and Kachitvichyanukul (1989) discuss efficient random-variate generation at-° gorithms designed to support variance reduction techniques.

326

B. Schmeiser

We conclude this section by discussing two ideas that are effective and easy to implement, although they require the analyst to recognize the context. First, let's reconsider the bias caused by random sample size in the denominator of an estimator, as discussed in Section 5. In some cases, random denominators are helpful, despite the resulting bias. Consider estimating the ratio of two performance measures. Examples include standardized moments, a t =/xk/cr~; the coefficient of variation o-//x; correlations corr(Y1, Y2)= cov(Y1, Y2)/0-2; the ratio cov(Y1, Y2)/var(Y1) that arises in control variates; or average wait time per customer, E(total time)/E(number-of-customers). In each of these cases, the analyst sometimes knows the true value of the denominator. Following the appealing folklore that we shouldn't use simulation to estimate a known quantity, we would take as our estimator the ratio of the numerator estimate and the known denominator, with negligible bias. However, at a small cost in bias, the variance can be reduced substantially by using the ratio of the numerator estimate to the denominator estimate. The reduction is the result of positive correlation between the two estimators. This is why we divide by the actual number of customers rather than the expected number of customers, even when the expected number of customers is known. In the second example, we consider a situation where using known information does result in variance reduction. Suppose we wish to estimate E ( Y ) = g~= 1 E ( Y [ X = x~)P(X = xj). For example, Y might be the throughput of one shift and X might be the number of times that a key machine breaks down during the shift. If we know P ( X = xj) for each j = 1, 2 , . . . , k, we can reduce the variance of/~(Y). Let nj denote the number of observations of Y for which X = xj. Then we can estimate E ( Y [ X = xj) by the sample average n f I Eik=l Yq, where Yq is the ith observation of Y when X = xj. The poststratified estimate of E ( Y ) is obtained by substituting these estimates into the total probability equation, using the known probabilities (Wilson and Pritsker 1984). If the model allows the values of nj to be controlled, sampling effort can be shifted to the largest conditional expected values and those with the largest associated probability. Despite the benefits of variance reduction techniques, only the most straightforward are used with any frequency. Beyond analyst time, the cost of using variance reduction techniques is that analyst error could cause the modified experiment to be invalid, in that it doesn't estimate the performance measure of interest. Automating variance reduction within simulation languages so that it is always valid and consistently useful is a difficult and important problem.

Acknowledgments Support from the National Science Foundation under Grant No. DMS871779 to Purdue University is gratefully acknowledged. Discussions with and the suggestions of Peter Glynn, David Goldsman, David Kelton, Barry Nelson, Alan Pritsker, Stephen Roberts, Lee Schruben, Wheyming Tina Song, James Swain, Michael Taaffe, and James Wilson were, as always, helpful.

Ch. 7. Simulation Experiments

327

References Avramidis, A. (1989). A Flexible Method for Estimating Inverse Distribution Functions in Simulation Experiments. M.S.I.E. Thesis, School of Industrial Engineering, Purdue University, West Lafayette, IN. Banks, J. and J.S. Carson, II (1984). Discrete-Event System Simulation. Prentice-Hall, Englewood Cliffs, NJ. Banks, J., J.S. Carson, II, and D. Goldsman (1989). Computer Simulation. In: H. Wadsworth (Ed.), Handbook o f Statistical Methods for Engineers and Physical Scientists. McGraw-Hill, New York. Barone, P. (1987). A method for generating independent realizations of a multivariate normal stationary and invertible A R M A ( p , q) process. J. Time Series Anal. 8, 125-130. Bratley, P., B.L. Fox, and L.E. Schrage (1987). A Guide to Simulation. Springer-Verlag, New York, Second Edition. Brown, M. and H. Solomon (1979). On combining pseudorandom number generators. Ann. Statist. 7, 691-695. Chen, H.C. and Y. Asau (1977). Oil generating random variates from an empirical distribution. A H E Trans. 6, 163-166. DeBrota, D.J., R.S. Dittus, S.D. Roberts, J.R. Wilson, J.J. Swain, and S. Venkatraman (1988). Input modeling with the Johnson system of distributions. In: M.A. Abrams, P.L. Haigh and J.C. Comfort (Eds.), Proceedings o f the Winter Simulation Conference, t65-179. Devroye, L. (1986a). Non-Uniform Random Variate Generation. Springer-Verlag, New York. Devroye, L. (1986b). Sample-based non-uniform random variate generation, In: J.R. Wilson, J.O. Henriksen, and S.D. Roberts (Eds.), Proceedings of the Winter Simulation Conference, 260-265. Eichenauer, J. and J. Lehn (1986). A non-linear congruential pseudo random number generator. Stastist. Hefte 27, 315-326. Efron, B. and G. Gong (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. Amer. Statist. 37, 36-48. Fishman, G.S. and P.J. Kiviat (1967). The analysis of simulation-generated time series. Management Sci. 13, 525-557. Fishman, G.S. (1973). Concepts' and Methods in Discrete Event Digital Simulation. John Wiley & Sons, New York. Fishman, G.S. (1978). Principles o f Discrete Event Simulation. John Wiley & Sons, New York Fishman, G.S. (1983). Estimating network reliability with accelerated convergence rates on error bounds. Technical report ORSA-83-7, University of North Carolina at Chapel Hill. Fishman, G.S. and L.S. Moore III (1984). Sampling from a discrete distribution while preserving monotonicity. Amer. Statist. 38, 219-223. Fishman, G.S. and L.S. Moore III (1986). An exhaustive analysis of multiplicative congruential random number generators with modulus 231- 1. S l A M J. Sci. and Statist. Comput. 7(1), 24 -45. Fuller, W.A. (1976). Introduction to Statistical Time Series. John Wiley & Sons, New York. Glynn, P.W. and D.L. Igtehart (1990). Simulation output using standardized time series. Math. Oper. Res. 15(l), 1-16. Gotdsman, D.M. and M.S. Meketon (1988). A comparison of several variance estimators. Technical Report J-85-12, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA (revision). Halfin, S. (1982). Linear-estimat0rs for a class of stationary queueing processes. Oper. Res. 30, 515-529. Heath, D. and E Sanchez (1986). On the period of pseudo-random number generators (or: how big a period do we need?). Oper. Res. Left. 5(1), 3-6. Heidelherger, E and P.D. Welch (1981). A spectral method for confidence interval generation and run length control in simulation. Comm. A C M 24, 233-245. Hora, S.C. (1983). Estimation of the inverse function for random variate generation. Comm. A C M 26, 590-594. Hull, J.C. (t977). Dealing with dependence in risk simulations. Oper. Res. Quart. 28, 201-213.

328

B. Schmeiser

IBM System/360 Scientific Subroutine Package (1966). Programmer's Manual, H20-0205-3, IBM Corp., White Plains, NY. IMSL STAT/LIBRARY (1987). User's Manual. Version 1.0, IMSL Inc., Houston, TX. Johnson, M.E. (1987). Multivariate Statistical Simulation. John Wiley & Sons, New York. Johnson, N.L. and S. Kotz (1969, 1970a, 1970b, 1972). Distributions in Statistics, Volumes 1-4. John Wiley & Sons, New York. Kennedy, W.J., Jr. and J.E. Gentle (1980). Statistical Computing. Marcel Dekker, New York. Kleijnen, J.P.C. (1974). Statistical Techniques in Simulation, Part I. Marcel Dekker, New York. Kleijnen, J.P.C. (1975). Statistical Techniques in Simulation, Part H. Marcel Dekker, New York. Kleijnen, J.P.C. (1988). Statistical Tools for Simulation Practitioners. Marcel Dekker, New York. Knuth, D.E. (1981). The Art of Computer Programming: Volume 2, Seminumerical Algorithms. Addison-Wesley, Reading, MA. L'Ecuyer, P. (1986). Efficient and portable 32-bit random variate generators. In: J.R. Wilson, J.O. Henriksen, and S.D. Roberts (Eds.), Proceedings of the Winter Simulation Conference, 275-277. L'Ecuyer, P. (1988). Efficient and portable combined random number generators. Comm. A C M 31,742-749, 774. Law, A.M. and W.D. Kelton (1982). Simulation Modeling and Analysis. McGraw-Hill, New York. Law, A.M. and W.D. Kelton (1984). Confidence intervals for steady-state simulations, I: A survey of fixed sample size procedures. Oper. Res. 32, 1221-1239. Lehmer, D.H. (1951). Mathematical models in large-scale computing units. Ann. Comput. Lab. (Harvard University) 26, 141-146. Lewis, P.A.W. and G.S. Shedler (1979). Simulation of nonhomogeneous Poisson processes by thinning. Naval Res. Logist. Quart. 26, 403-414. Lewis, EA.W. and E.J. Orav (1989). Simulation Methodology for Statisticians, Operation Analysts, and Engineers: Vol. 1. Wadsworth & Brooks/Cole, Pacific Grove, CA. Marsaglia, G. (1968). Random numbers fall mainly in the planes. Proc. Nat. Acad. Sci. U.S.A. 60, 25 -28. Marse, K. and S.D. Roberts (1983). Implementing a portable FORTRAN uniform (0, 1) generator. Simulation 41, 135-139. McLeod, A.I. (1985). A remark on algorithm A.S 183: An efficient and portable pseudo-random number generator. Appl. Statist. 34, 198-200. Meketon, M.S. and B.W. Schmeiser (1984). Overlapping batch means: something for nothing? In: S. Sheppard, U. Pooch, and C.D. Pegden (Eds.), Proceedings of the Winter Simulation Conference, 227-230. Meyer, H. (1956). Symposium on Monte Carlo Methods, John Wiley & Sons, New York. Moran, P.A.P. (1975). The estimation of standard errors in Monte Carlo simulation experiments. Biometrika 62, 1-4. Morgan, B. (1984). The Elements of Simulation. Chapman & Hall, London. Nance, R.E. and C. Overstreet, Jr. (1978). Some observations on the behavior of composite random number generators. Oper. Res. 26, 915-935. Nelson, B.L. (1985). A decomposition approach to variance reduction. In: D.T~ Gantz~ ~J.~£ Blais, and S.L. Solomon (Eds.), Proceedings of the Winter Simulation Conference, 23-32. Nelson, B.U (1987a). Variance reduction for simulation practitioners. In: A. Thesen, H. Grant, and W.D. Kelton (Eds.), Proceedings of the Winter Simulation Conference, 43-51. Nelson, B.L. (1987b). A perspective on variance reduction in dynamic simulation experiments. Comm. Statist.- Simulation Comput. 16, 385-426. Park, S.K. and K.W. Miller (1988). Random number generators: good ones are hard to find. Comm. A C M 31, 1192-1201. Patel, J.K., C.H. Kapadia, and D.B. Owen (1976). Handbook of Statistical Distributions. Marcel-Dekker, New York. Peterson, A.V. Jr. and R.A. Kronmal (1982). On mixture methods for the computer generation of random variables. Amer. Statist. 36, 184-191. Pritsker, A.A.B. (1986). Introduction to Simulation and S L A M H. Halsted Press, New York. Ramberg, J.S., ER. Tadikamalta, E.J. Dndewicz, and E.F. Mykytka (1979). A probability distribution and its uses in fitting data. Technometrics 21, 201-214.

Ch. 7. Simulation Experiments

329

RAND Corporation (1955). A Million Random Digits with 100,000 Normal Deviates. Free Press, Glencoe, IL. Ripley, B.D. (1983). Computer generation of random variables: a tutorial, lnternat. Statist. Rev. 51, 301-319. Ripley, B.D. (1987). Stochastic Simulation. John Wiley & Sons, New York. Rubinstein, R.Y. (1981). Simulation and the Monte Carlo Method. John Wiley & Sons, New York. Rubinstein, R.Y. (1986). Monte Carlo Optimization, Simulation and Sensitivity of Queueing Networks. John Wiley & Sons, New York. Sargent, R.G., K. Kang and D.M. Goldsman (1987). An investigation of small sample size behavior of confidence interval procedures. Working Paper #87-005, Department of Industrial Engineering and Operations Research, Syracuse University, Syracuse, NY. 8chmeiser, B.W. (1977). Methods for modeling and generating probabilistic components in digital computer simulation when the standard distributions are not adequate: A survey. In: H.J. Highland, R.G. Sargent, and J.W. Schmidt (Eds.), Proceedings of the Winter Simulation Conference, 50-57. (Reprinted in Simulener 10 (Fall 1978/Winter 1978-79), 38-43, 72.) Schmeiser, B.W. (1980). Random variate generation: A survey. In: T.I. Oren, C.M. Shub, and P.F. Roth (eds.), Proceedings of the Winter Simulation Conference, 79-104. Schmeiser, B.W. and V. Kachitvichyanukul (1990). Non-inverse correlation induction: Guidelines for algorithm development. Computational and Applied Mathematics, forthcoming. Schmeiser, B.W. and R. Lal (1980). Multivariate modeling in simulation: A survey. In: 33rd Annual Technical Conference Transactions. American Society for Quality Control, 252-261. Schmeiser, B. (1982). Batch size effects in the analysis of simulation output. Oper. Res. 30, 556-568. Schmeiser, B.W. (1983). Recent advances in generating observations from discrete random variables. In: J.E. Gentle (Ed.), Computer Science and Statistics: The Interface. North-Holland, Amsterdam, 154-160. Schrage, L.E. (1979). A more portable Fortran random number generator. ACM Trans. Math. Software 5, 132-138. Schriber, T.J. and R.W. Andrews (1984). ARMA-based confidence interval procedures for simulation output analysis. Amer. J. Math. and Management Sci. 4, 345-373. Schruben, L.W. (1983). Confidence interval estimation using standardized time series. Oper. Res. 31, 1090-1108. Snell, M. and L. Schruben (1985). Weighting simulation data to reduce initialization effects. IIE Trans. 17(4), 354-363. Song, W.-M.T. (1988). Estimators of the Variance of the Sample Mean: Quadratic Forms, Optimal Batch Sizes, and Linear Combinations. Ph.D. Dissertation, School of Industrial Engineering, Purdue University, West Lafayette, IN. Stahnke, W. (1973). Primitive binary polynomials. Math. Comput. 27, 977-980. Stein, M. (1987). Large sample properties of simulations using Latin hypercube sampling. Technometrics 29, 143-151. Swain, J.J. and B. Schmeiser (1988). Control variates for Monte Carlo analysis of nonlinear statistical models, II: Raw moments and variances. J. Statist. Comput. Simulation 30, 39-56. Tausworthe, R.C. (1965). Random numbers generated by linear recurrence modulo two. Math. Comput. 19, 201-209. Welch, ED. (1983). The statistical analysis of simulation results. In S.S. Lavenberg (Ed.), Computer Performance Modeling Handbook. Academic Press, New York, 267-329. Wichman, B.A. and I.D. Hill (1982). An efficient and portable pseudo-random number generator. Appl. Statist. 31, 188-190. Wichman, B.A. and I.D. Hill (1984). Correction to "An efficient and portable pseudo-random number generator". Appl. Statist. 33, 123. Wilson, J.R. (1984a). Statistical aspects of simulation. In: J.P. Brans (Ed.), Operational Research '84. North-Holland, Amsterdam, 921-937. Wilson, J.R. (1984b). Variance reduction techniques for digital simulation. Amer. J. Math. Management Sci. 4, 277-312.

330

B. Schmeiser

Wilson, J.R. and A.A.B. Pritsker (1982). Computer simulation. In: G. Salvendy (Ed.), Handbook of Industrial Engineering. Wiley Interscience, New York, 13.11.1-13.11.24. Wilson, J.R. and A.A.B. Pritsker (1984). Experimental evaluation of variance reduction techniques for queueing simulations using generalized concomitant variables. Management Sci. 30, 1459-1472. Yakowitz, S.J. (1977). Computational Probability and Simulation. Addison-Wesley, Reading, MA. Zeisel, H. (1986). A remark on algorithm AS 183: An efficient and portable pseudo-random number generator. Appl. Statist. 35, 89.