Simulation Modelling Practice and Theory 44 (2014) 14–25
Contents lists available at ScienceDirect
Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat
Design of experiments for interpolation-based metamodels E. Jack Chen a, Min Li b,⇑ a b
BASF Corporation, 100 Park Avenue, Florham Park, NJ 07932, USA College of Business Administration, California State University, Sacramento, 6000 J Street, Sacramento, CA 95819, USA
a r t i c l e
i n f o
Article history: Received 6 March 2013 Received in revised form 11 January 2014 Accepted 18 February 2014
Keywords: Kriging metamodel Experimental design Gaussian processes
a b s t r a c t A metamodel is a simplified mathematical description of a simulation model that represents the system’s input–output relationship with a function. In many situations, we may not need a single formula to describe the systems being simulated. Interpolation-based metamodels are useful for providing simple estimates at non-design points to communicate the input–output relationship. This paper proposes a new approach to select an experimental design for interpolation-based metamodels. The algorithm dynamically increases the sample size and the number of design points so that the estimates obtained via the metamodel satisfy the pre-specified precision. An experimental performance evaluation demonstrates the validity of interpolation-based metamodels. Ó 2014 Elsevier B.V. All rights reserved.
1. Introduction Many simulation studies have been used to investigate system characteristics, such as the mean and the variance of certain system performances of the system under study. In many cases, users are interested in the system responses under different design points (input combinations, scenarios). Thus, a series of design points need to be evaluated, i.e., we need to construct a metamodel (or a response surface). For example, we are interested in the mean waiting time (or system time) of queuing systems with different traffic intensities. The metamodel can be ‘‘used as a proxy for the full-blown simulation itself in order to get at least a rough idea of what would happen for a large number of design points’’ [13]. The purpose of constructing metamodels is to estimate or approximate the response surface (see Section 3.1). We could then use the metamodel to learn how the response curve would behave over various input-parameter combinations, e.g., output sensitivity. This approach is ‘‘helpful when the simulation is very large and costly, precluding exploration of all but a few input-parameter combinations’’ [13]. Note that the metamodel is designed to provide the overall tendency of performance measures rather than accurate estimates at all input-parameter combinations. These metamodels can also be used for visualization. Graphical representations of metamodels are useful for providing a simple and easy form to communicate the input–output relationship. Kriging [12] from geostatistics is a metamodeling methodology that estimates the value at non-design points by interpolation. It was originally developed for deterministic simulation. Kriging is a classical frequentist approach estimating the response by obtaining the best linear unbiased estimator of the response [20]. Gaussian process regression from statistics and machine learning is theoretically equivalent to kriging. Gaussian process regression is a Bayesian approach imposing a prior distribution over the response function. Only in a special case the results from both approaches are the same [20]. We use both kriging and Gaussian process regression to estimate or approximate the response in our empirical experiments in Section 4. ⇑ Corresponding author. Tel.: +1 916 538 2753. E-mail addresses:
[email protected] (E. Jack Chen),
[email protected] (M. Li). http://dx.doi.org/10.1016/j.simpat.2014.02.004 1569-190X/Ó 2014 Elsevier B.V. All rights reserved.
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
15
Kriging requires choice of correlation function (i.e., spatial dependence) to compute the weights, which depend on the distances between the input combination to be estimated and the existing input combinations already observed. Kriging assumes that the closer the input combinations are, the stronger positively correlated the responses are. The choice of correlation function should be motivated by the underlying system we want to model. However, the (true) correlation function is unknown and both its type and parameter values must be estimated. Furthermore, some of the weights may be negative. This correlation function is similar to the kernel function used to smooth estimates when estimating density [5]. Gunes et al. [7] point out that in general kriging is computationally expensive and a large amount of memory may be required to construct and evaluate the weights. See [22] for an overview of kriging. Gaussian process regression is a Bayesian nonparametric model with prior probability distribution over the function f in the model Y = f(X) + e directly (see Section 3.4). Rasmussen and Williams [19] give a textbook treatment of Gaussian processes. Kleijnen [9] notes that kriging software such as DACE (Design and Analysis of Computer Experiments, [14]) still needs improvement. There are a large number of covariance functions available in the Gaussian processes software GPML (Gaussian Processes for Machine Learning) Toolbox [18] that can be used to fit all sorts of data. We find the GPML toolbox produces reliable results and report them along with results from DACE and our proposed metamodels. Chen [4] develops a procedure to construct a metamodel with no functional form (i.e., interpolation-based metamodels) of quantiles. The technique can be used to construct metamodels of various system characteristics. In this paper, we extend and refine the technique to investigate the performance of interpolation-based metamodels. This general-purpose procedure is sequential and selects design points as more information becomes available. The simulation run length at each design point is determined sequentially and independently. The stopping rules are based on the required precision measure. Design of experiments can be used during the metamodel construction to improve the construction process and the metamodel’s quality. Metamodeling requires determination of the number and placement of design points and estimation of their responses to which the metamodel is fitted. In this paper we investigate the issue of finding both the number and the placement of design points to achieve the required precision. To account for the stochastic nature of the output, two additional metamodels are constructed: lower and upper bounds. For non-design points, the interpolated value is a linear interpolation of two bounding design points, not a weighted average of all design points. Hence, the value at non-design points can be estimated computationally inexpensively. This approach is a special case of the correlation function LIN in the Matlab kriging toolbox developed by Lophaven et al. [14]. We plan to demonstrate that without the knowledge of the underlying systems, a simple estimate obtained via linear interpolation is as good as that obtained via weighted average of all observations. The paper is organized as follows. Section 2 reviews some theoretical basis of metamodels. Section 3 illustrates our strategy of constructing interpolation-based metamodels. Section 4 lists the empirical results from experiments using interpolation-based metamodels to estimate output performance measures. Concluding remarks follow in Section 5.
2. Metamodel A simulation model can be thought of as a function that turns input parameters into output performance measures [13]. For example, if we use simulation to estimate lq (the mean system time) of the M/M/1 queuing process with certain traffic intensity 0 < q < 1, we could in principle write lq = G(q) for some function G that is stochastic and unknown. Simulation can be used to evaluate G for numerical input values of q. We are interested in lq for any point in the one-dimensional region defined by the traffic intensity q. Since q is continuous, there are an infinite number of values and it is impossible to evaluate all possible q. On the other hand, when the parameter is not continuous, it may be too costly to evaluate all of them. Instead, we evaluate the function at certain input combinations and fit the results to some curves. The response of other input combinations is then approximated via the fitted curves (surfaces). When a metamodel is specified to be a standard regression model, the independent variables for regression are the simulation input parameters and the dependent variable is the response of interest [13]. For example, we would try to approximate the response-surface function G(q) with a simple explicit formula involving q. It is nice to have a single formula to represent the responses of a system, but the formula is generally unknown and is likely complex. Fitting the grid points to a formula also introduces error (noise) into the estimates. Furthermore, we are investigating stochastic systems and the grid points themselves are estimates and are not exact values. Even though kriging was originally developed to estimate values at non-design points of deterministic models, Ankenman et al. [1] have developed a stochastic kriging procedure that takes into account the randomness of the response at design points and extended kriging for stochastic models. In our approach, instead of obtaining a formula or a regression model, we treat the collection of grid points themselves and the set of fitted curves as metamodels. Consequently, the metamodel is not described by a single (simple) formula. The values at non-design points will be estimated via interpolation. Furthermore, the randomness of the responses at design points is described by confidence intervals. Consequently, three response surfaces will be constructed: the lower bound, the expected value, and the upper bound. It is likely that the form and complexity of the metamodel vary substantially for different systems (see [21]). Thus, the metamodeling procedure needs to include a scheme for model selection given the simulation data, obtain a metamodel that is of the least complexity but adequate to characterize the underlying response surface.
16
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
2.1. The natural estimators We assume that the process under study is stationary, i.e., the covariance depends only on the distance between observations but not on time. Let F() and FN() denote respectively the true and the (sampled) empirical steady-state cumulative distribution function of the simulation-output process under study, where N is the simulation run length (we assume a disP crete-time output process X1, X2, . . .). For purpose of analysis, it is convenient to express FN() as F N ðxÞ ¼ ð1=NÞ Ni¼1 Ið1;x ðX i Þ, where Ið1;x ðX i Þ ¼ 1 if Xi 6 x and Ið1;x ðX i Þ ¼ 0 if Xi > x. In addition to the autocorrelations of the stochastic process output sequence approaching zero as the lag between observations increases (e.g., /-mixing, see next section), we require two things for our method to work. First, FN() must converge to F() as N ? 1. In random (i.e., independent) sampling, X1, X2, . . ., XN is a random sample of size N. If N ? 1, then FN() will tend with probability one to F(). Second, if a statistic T is estimating some characteristic property W of the distribution, then this characteristic must satisfy certain smoothness properties (e.g., continuity and differentiability). Most characteristics such as moments and percentiles are smooth and so their estimators have distributions that can be estimated. Our procedure may fail when T is a statistic estimating a function that is not smooth. The property W could be, for example, the mean, the variance, or a quantile. The natural point estimator for W, denoted by b , is typically the sample mean, the sample variance, a sample quantile, or a simple function of the relevant order statistics, W chosen to imitate the performance measure W. Furthermore, the natural estimators are appropriate for estimating any W, even in the presence of autocorrelation. This follows since FN() converges to F(). 2.2. Determining the required sample size at design points Recall that we are investigating stochastic systems, hence, the responses (performance measure) at design points need to be estimated by many observations. Although asymptotic results are often applicable when the amount of data is ‘‘large enough,’’ the point at which the asymptotic results become valid generally depends on unknown factors. Furthermore, the quality of metamodels is heavily dependent upon the quality of the estimated responses at the design points. We use the procedure of Chen and Kelton [6] to estimate responses at any given design points. The procedure is valid whether or not the output process is independent and identically distributed normal. However, the procedure requires that the output sequences satisfy the /-mixing conditions [2]. Intuitively, a stochastic process is /-mixing if its distant future event is essentially independent of its present and past events [2]. We use the Quasi-Independent procedure [6] to determine simulation run length at each design point. When data are correlated, the Quasi-Independent procedure will progressively increase the simulation run length until a sequence of n samples (taken from the original output sequence) appear to be independent, as determined by the runs-up/runs-down tests [11]. If the n systematic samples (from lag l observations of a stochastic output process) appear to be independent, then the simulation run length N is set to nl. The simulation run length becomes longer as the autocorrelation becomes stronger and the simulation run lengths at different design points are likely to be different. The achieved default precision determined by this strategy may not meet the specified requirement. In the next section, we will discuss our strategy to increase the simulation run length to meet the accuracy/precision requirement at design points. 3. Interpolation-based metamodels This section introduces our approach of constructing interpolation-based metamodels. The strategy is to determine the number and locations of design points as well as the simulation run lengths at design points. 3.1. Construction of metamodels A metamodel will have greater accuracy and precision when constructed with a larger set of design points. On the other hand, a larger set of design points requires more simulation efforts. Furthermore, the required number of design points for constructing metamodels to achieve the pre-specified accuracy and precision depends on the underlying systems. We can sequentially determine the required number of design points by observing the fitted curves. It is known that sequential procedures are more efficient and may require less samples than fixed-sample-size procedures. Unlike fixed-sample-size procedures, sequential procedures determine the required sample size dynamically and are able to adapt to different environments. Fig. 1 displays the flowchart of the proposed procedure. We select the initial design points in a space-filling way. Assuming the range of interest of the (controllable) parameter q is [qL, qU], we initially simulate the system with five evenly spaced design points, e.g., q1 = qL < q2 < < q5 = qU. Let Gi be the simulated (estimated) response of input qi. b i (for i = L + 1, L + 3, . . ., U 3, U 1) denote the estimate obtained via linear interpolation between design points Let G b i Gi j=jGi j < 4c is satisfied for all i, then the procedure returns the collection of qi1 and qi+1. If the relative precision j G all the histograms as the interpolation-based metamodel; otherwise design point qj1 ¼ ðqi1 þ qi Þ=2 or
17
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
Fig. 1. Flowchart of constructing a linear-interpolation metamodel.
qj2 ¼ ðqi þ qiþ1 Þ=2 will be added for each i that the precision requirement is not satisfied. This can be considered cross validation, which leaves out one observation and then predicts it based on the remaining observations. Design point qj1 will be added when |(Gi Gi1)/(qi qi1)| > |(Gi+1 Gi)/(qi+1 qi)|. Otherwise, design point qj2 will be added. This is a greedy algorithm favoring the new design point where there is the largest improvement in precision. Note that the value . c is upper bound of the relative precision of estimates obtained at design points, i.e., Xb q l l c, where Xb q is the q
q
final point estimator of lq (see [8]). . b Using the value 4c in the relative precision G i Gi jGi j < 4c is somewhat arbitrary. The rationale is that when qi’s are evenly spaced, the distance between the design points qi1 and qi is half of the distance (between qi1 and qi+1) used to estib i and the angle of the line between design points is also reduced by no less than half. mate G The step of adding more design points will be performed repeatedly until the specified precision on the selected estimators is achieved. Thus, the final number of design points is not known until the simulation is complete. Basically, we will allocate more input combinations around design points where the performance measure changes at different rate. Our preliminary experimental results indicate that the required number of design points to achieve the specified precision is not excessive (see Section 4).
3.2. Constructing confidence intervals Because we are constructing metamodels of stochastic systems, with a given design point (e.g., traffic intensity q), we use the Batch-Means (BM) procedure of Chen and Kelton [6] to estimate the confidence interval of the mean of a stochastic process. Batch means are sample means of subsets of consecutive subsamples from a simulation output sequence. In the non-overlapping batch-means method, the simulation output sequence {Xi:i = 1, 2, . . ., N} is divided into b adjacent non-overlapping batches, each of size m. For simplicity, we assume that N is a multiple of m so that N = bm. The method of BM is a well-known technique for estimating the variance of point estimators computed from simulation experiments. The BM method tries to reduce autocorrelation by batching observations. The BM variance estimator (for estimating the variance
18
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
of sample means, i.e., Var½X j ) is simply the sample variance of the mean estimator X j computed from the sample (batch) means of subsets of consecutive subsamples. P Let X q;j denote the estimator of lq in the jth batch, i.e., X q;j ¼ ð1=mÞ jm i¼ðj1Þm X i , for j = 1, 2, . . ., b. Note that the batch means b q ¼ ð1=bÞPb X q;j (the average of batch means) as a point are independent when the batch size m is large enough. We use X j¼1 b q equals the grand mean, i.e., the average of all samples PN X i =N. By estimator of lq. Note that the average of batch means X i¼1 the central limit theorem, a confidence interval (CI) for lq using the i.i.d. normal X q;j ’s can be approximated using standard pffiffiffi b q l Þ=ðS= bÞ has an approximate t distribution with (b 1) d.f. (degrees of statistical procedures. That is, the ratio T ¼ ð X q P b q Þ2 is the usual unbiased estimator of the variance of X q;j ’s. Let freedom), where S2 ¼ ð1=ðb 1ÞÞ bj¼1 ðX q;j X pffiffiffi h ¼ t b1;1a2 =2 S= b be the half width of 100(1 a2)% CI, where tb1;1a2 =2 is the (1 a2/2) quantile of the t distribution with b q hÞ. Let a and c respectively, be the user specified abso(b 1) d.f. (b P 2). Then the 100(1 a2)% CI for lq is defined as ð X lute and relative precision. If the half width of CI does not meet the user specified absolute or relative precision (i.e., w < a or
c0 jbx q j), then we can progressively increase the number of b until the CI meets the specified requirement. We assume that the b q l j=l c with c = c0 . This assumption is somewhat arbitrary but our x q j also achieves j X run length that achieves h < c0 jb q q primary results support this. Furthermore, bq h l X b q þ h. Hence, j X b q l j h. X q q We construct the curve of means as well as the curves of the lower and upper confidence limits of means at the design points via the QI procedure. The mean estimator and CI at non-design points can then be obtained via linear interpolation. b q and X b q respectively denote the simulated estimate of the time in system of the M/M/1 queuing process with traffic Let X a
b
intensity qa and qb. Then the mean estimate with traffic intensity qa < q < qb is obtained b q þ ½ðq q Þ=ðq q Þ X bq X b q . The confidence limits of lq are computed in the same manner. bq ¼ X X a b a a b a
by
Let qi for i = 1, 2, . . ., n be the final design points. Let wi be the (kriging metamodel) weights for estimating the response at P q. Then the kriging estimate is Xb q ¼ ni¼1 wi Xb qi . Consequently, the linear-interpolated estimate is a special case with wqb ¼ ðq qa Þ=ðqb qa Þ, wqa ¼ 1 wqb , and wi = 0 for i – a – b. 3.3. Using common random numbers The goal of the simulation run length in the previous section aims to ensure that the relative deviation of the estimates is within c. However, this accuracy is not guaranteed at non-design points. Recall that the goal of our metamodel construction strategy is mainly to provide the general trend of the responses. The technique of common random numbers (see, e.g., [13]) can be used to induce (positive) correlation of responses among design points. Consequently, the deviation of estimates to the corresponding unknown true values at design points is positively correlated. That is, the relative position of each point on the surface is the same when the entire response surface is lower or elevated. 3.4. Gaussian process regression Combining the Gaussian process prior with the Gaussian likelihood via the Bayes’ theorem leads to Gaussian process posterior and Gaussian predictive distribution that can be used to predict output at non-design points. On the other hand, kriging (DACE) minimizes MSE of the output predictor that is a weighted linear combination of old outputs (see [22]). We summarize below the basics of Gaussian process regression in this section. More details can be found in Bishop ([3], Chapter 6) and Rasmussen and Williams [19]. Gaussian process regression imposes Gaussian process prior N(m(x), k(x, x0 )) on the function f(x) in the model y = f(x) + e directly using a Bayesian approach, where m(x) is the mean function of a input data point x, k(x, x0 ) is the covariance function of any two input data points (x, x0 ), and e is the independent identically distributed Gaussian noise with variance r2. The ikelihood is Gaussian y|x, f(x) N(f, r2I), where x and y are observed data and f = [f1, . . ., fn]T. The resulting Gaussian process 1
1
posterior is f ðxÞjx; y Nðkðx; xÞ½Kðx; xÞ þ r2 I y; kðx; x0 Þ kðx; xÞ½Kðx; xÞ þ r2 I kðx; x0 ÞÞ, where k(x, x) is a vector of the covariance function evaluated at x and each element of the vector x, K(x, x) is a covariance matrix. The predictive distribution of a new output data point y⁄ given a new input data point x⁄ (e.g., a non-design point) is y jx ; x; y Nðkðx ; xÞT ½Kðx; xÞþ
r2 I1 y; kðx ; x Þ þ r2 kðx ; xÞT ½Kðx; xÞ þ r2 I1 kðx ; xÞÞ. The Gaussian process posterior and predictive distribution are fully specified by the mean and covariance functions. A variety of mean and covariance functions for Gaussian processes have been implemented in software such as GPML [18] to model all sorts of data from simple to complex. Both the functional form and hyperparameters of m(x) and k(x, x0 ) need to be determined. They are usually expressed as a parametric family of functions containing hyperparameters which can be learned from data by maximizing the log marginal likelihood log p(y|x) = (1/2)yTK1y (1/2) log |K| (n/2) log (2p) where K is the covariance matrix K(x, x) and |K| is its determinant. This nonlinear optimization requires the gradient of the log marginal likelihood function with respect to each hyperparameter hi, which can be found to be (1/2)yTK1(@K/@hj)K1y (1/2)
19
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
trace (K1(@K/@hj)). The covariance (kernel) function k(x, x0 ) plays the role of the correlation function in kriging. Closer x and x0 lead to closer f(x) and f(x0 ) through k(x, x0 ). A common choice for k(x, x0 ) is the Matérn class of covariance functions:
kMatern ðrÞ ¼
21m CðmÞ
pffiffiffiffiffiffi !m pffiffiffiffiffiffi ! 2m r 2m r ; Km ‘ ‘
where r = |x x0 |. The parameter m controls the degree of smoothness. For example, m ? 1 gives the infinitely differentiable squared exponential covariance function exp(r2/2‘2) that we use for the M/M/1 data in Section 4.2. The parameter ‘ is the characteristic length-scale, the limit that the two input data points x and x0 are considered uncorrelated [17]. Km is a modified Bessel function (see [19], Section 4.2). An additional parameter can be added to each x to determine its relevance, which is called automatic relevance determination (ARD) [15,16]. 4. Empirical experiments In this section, we present some experiments of the chosen design points and empirical results obtained from simulations using the batch-means procedure of Chen and Kelton [6] and interpolation-based metamodels to estimate means. 4.1. Choosing the design points In order to compare our customized design with other approaches, we perform a similar experiment as in van Beers and Kleijnen [22]. We want to construct a metamodel of the system time of the M/M/1 queuing system for traffic intensity 0.1 6 q 6 0.9. We begin with five evenly spaced traffic intensity (i.e., pilot runs) and sequentially increase the number of design points until the number of design points reaches 10, the same number as in van Beers and Kleijnen [22] for comparison. b i Gi j=jGi j.To adapt to this stopping rule, we add the design point q ¼ ðq þ q Þ=2 or q ¼ ðq þ q Þ=2 when Let ci ¼ j G j1 j1 j j2 j jþ1 cj = max ci, where i is the index of design points. Table 1 lists the final 10 design points. The LI and V&K rows, respectively, are for the linear interpolated and van Beers and Kleijnen [22]. The LI approach places design points more evenly because the strategy basically aims to minimize the maximum of ci and selects relatively few design points in the area where the responses change linearly as the input changes. On the other hand, the approach of V&K aims to minimize the integrated mean square error and is computationally expensive. Consequently, they place all the additional design points between traffic intensities 0.7 and 0.9, where the responses have larger values and larger variances. Fig. 2 displays the true responses and the linear-interpolated responses. 4.2. Estimating means of queuing systems via metamodels In this experiment, we estimate the steady-state waiting time in M/M/1 queuing systems. We are interested in the response with traffic intensities between 0.7 and 0.95. The waiting times are minimal when the traffic intensities are low. The proposed procedure selects the following design points (or traffic intensities): 0.7, 0.7625, 0.825, 0.8875, 0.9188, and 0.95. Traffic intensity 0.9188 ((0.8875 + 0.95)/2) is included in the design points because the relative precision of the estimates at traffic intensity 0.8875 and from linear interpolation at traffic intensities 0.825 and 0.925 is larger than the specified relative precision 4c = 40% (see Section 3.1). Two types of plots are constructed to display graphically the 100 realizations of each estimator in a simulation study: (1) relative deviation plots and (2) absolute deviation plots, in which estimates are plotted around their true values. Figs. 3–5 show the results. The 100 response estimates (estimated waiting time) at each design point are obtained using the QI procedure of Chen and Kelton [6]. These response estimates and the corresponding design points are used to train the metamodels. We then estimate (predict) the responses at traffic intensities 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95 and compare the predictions to the expected responses of the M/M/1 process. In the absolute deviation plots, the solid curve represents a piecewise linear version of the true expected-response curve across various traffic intensities and the mean estimates are plotted as points. Note that a piecewise linear version of mean estimates can provide information regarding the general tendency of means and a rough idea of mean sensitivity as the traffic intensity changes. For queuing processes, the mean estimates obtained through linear interpolation are slightly biased high, which can be explained by the curves in Fig. 3. The mean vs. q curve is concave upward, i.e., F(q) is a convex function. As the value of q deviates further away from the design points, e.g., q = 0.8 and q = 0.9, the accuracy of the estimates gets worse. Note that the (absolute or relative) precision can be improved by increasing the number of design points.
Table 1 Chosen design points. Procedure
Traffic intensity
LI V&K
0.1 0.1
0.3 0.3
0.4 0.5
0.5 0.7
0.6 0.8
0.7 0.85
0.75 0.875
0.8 0.8875
0.85 0.89375
0.9 0.9
20
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
Chosen Design Points 10 Solid Line: True Responses Dash Line: Linear Interpolated Responses of Design Points (Van Beers and Kleijnen) Responses of Design Points (Linear Interpolated)
9 8
System Time
7 6 5 4 3 2 1 0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Traffic Intensity Fig. 2. Plots of the linear interpolated versus true MM1 system time.
Relative Deviation Plots 0.25 Kriging − linear correlation function Linear Interpolated
Relative Deviation
0.2 0.15 0.1 0.05 0 −0.05 −0.1 0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.95
1
Traffic Intensity Absolute Deviation Plots
20
Absolute Deviation
18
Kriging − linear correlation function Linear Interpolated True Responses
16 14 12 10 8 6 4 2 0.65
0.7
0.75
0.8
0.85
0.9
Traffic Intensity Fig. 3. Plots of the estimates for MM1 via kriging with linear correlation function.
We also estimate (predict) expected responses at traffic intensities 0.7, 0.75, 0.8, 0.85, and 0.95 using kriging (DACE toolbox) and Gaussian process regression (GPML toolbox). For kriging, we use the linear and Gaussian correlation functions. Q Q These correlation functions are nj¼1 maxð0; 1 hj xj x0j Þ and nj¼1 expðhj ðxj x0j Þ2 Þ respectively between two input data
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
21
Relative Deviation Plots 0.2 Kriging − Gaussian correlation function Linear Interpolated
Relative Deviation
0.15 0.1 0.05 0 −0.05 −0.1 −0.15 0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.95
1
Traffic Intensity Absolute Deviation Plots
20
Absolute Deviation
18 16
Kriging − Gaussian correlation function Linear Interpolated True Responses
14 12 10 8 6 4 2 0.65
0.7
0.75
0.8
0.85
0.9
Traffic Intensity Fig. 4. Plots of the estimates for MM1 via kriging with Gaussian correlation function.
points xj and x0j , where hj is the parameter. The results are shown in Figs. 3 and 4. Our proposed linear interpolation approach is a special case of the linear correlation function in kriging. For Gaussian process regression, the isotropic squared exponential covariance function
r2f exp½ðx x0 ÞT ðx x0 Þ=ð2‘2 Þ where ln ‘ and ln rf are hyperparameters. The results are
shown in Fig. 5. The squared exponential covariance function is a special case of the Matérn class of covariance functions which contain more hyperparameters. For the relatively simple data from M/M/1, the squared exponential covariance function is sufficient. The estimates from DACE (kriging) using the linear correlation function or GPML are not as good as those from LI (see Figs. 3–5). The convex structure of the underlying system has a rapid rate of change around the reflection point. We suspect this may be the reason why the estimates are poor. Kriging uses weighted sample means of all observations. Even though the weights of observations far away from the point to be estimated are low, they still negatively affect the estimates. Table 2 shows the estimated confidence intervals for M/M/1. The column labeled ‘‘Avg HW’’ is the average of the 90% CI half width calculated from the 100 realizations of the estimator. The ‘‘Coverage’’ column lists the proportion of these 100 CIs that cover the true mean. The CI coverage at the design points (i.e., q = 0.7 and 0.95) are around the nominal value of 0.90, which indicates the estimated CI half widths are also accurate. Note that the estimates at design points are obtained by the QI procedure [6] and the achieved coverages are consistent with the reported performance of the QI procedure. However, the CI coverages of q that are further away from the design points (e.g., q = 0.8, 0.85, and 0.9) are less than the nominal value, especially at q = 0.85 where the coverages are zero or just above zero. This is expected because the point estimators are greater than the true values; while the half widths are about the right length. We do not think this is a major drawback of the procedure. The main purpose of the metamodel is to gain the overall tendency of expected responses of the system under study and is not to obtain accurate CIs throughout the entire traffic-intensity range of interest. Note that accurate CIs can be obtained by designating the traffic intensity of interest q as a design point. Furthermore, with the help of the plots, we are able to predict approximately the accuracy of the interpolated estimates. If the CI coverage at non-design points is a concern, a conservative adjustment can be used to increase the coverage of the b b CI estimated via interpolation. Let c = max(ci1, ci). If qi1 < q < qi and c ¼ G i G i =jGi j, set the adjustment i
^
A ¼ ðc xq =2Þ ðq qi1 Þ=ðqi qi1 Þ if q < ðqi1 þ qi Þ=2;
22
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
Relative Deviation Plots 0.2 GP − covSEiso Linear Interpolated
0.15
Relative Deviation
0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25 0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.95
1
Traffic Intensity Absolute Deviation Plots
20 18
Absolute Deviation
16
GP − covSEiso Linear Interpolated True Responses
14 12 10 8 6 4 2 0 0.65
0.7
0.75
0.8
0.85
0.9
Traffic Intensity Fig. 5. Plots of the estimates for MM1 via Gaussian processes with isotropic squared exponential covariance function.
Table 2 Estimated 90% CI half width (HW) and coverage. Model
Traffic intensity
0.7
0.75
0.8
0.85
0.9
0.95
MM1
Average HW coverage
0.095 0.9
0.118 0.9
0.151 0.64
0.203 0.09
0.294 0.69
0.578 0.86
A ¼ ðc ^xq =2Þ ðqi qÞ=ðqi qi1 Þ if q ðqi1 þ qi Þ=2: When using interpolation with design points to estimate CI, we set
b i > Gi ; CI ¼ ð^xq h A; ^xq þ hÞ if G b i < Gi : CI ¼ ð^xq h; ^xq þ h þ AÞ if G
Table 3 Estimated 90% CI half width and coverage after adjustment. Model
Traffic intensity
0.7
0.75
0.8
0.85
0.9
0.95
MM1
Avg Hw coverage
0.095 0.9
0.133 0.92
0.224 0.98
0.304 0.83
0.434 0.97
0.578 0.86
23
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
While this adjusted CI will increase the coverage to be around the nominal value, the range of the CI will be much larger. Table 3 lists the results. 4.3. A roadmap to higher dimensions The metamodel we have discussed is one-dimensional. However, higher dimensional models are of more interest. Below we provide a roadmap for two-dimensional models followed by an illustrative experiment. For example, if a non-design point (a + x, b + y) is bounded by the design points listed below: Design points
Response
(a, b) (a, b0 ) (a0 , b0 ) (a0 , b)
A B C D
Then the response of (a + x, b + y) can be estimated by A + (B A) y/(b0 b) + (D + (C D) y/(b0 b) (A + (B A) y/ (b0 b)))x/(a0 a). The linear interpolation approach to higher dimensions can be generalized similarly. In this experiment, we test our approach with a deterministic system having two parameters. Namely, we use the inventory model in [13]. Let s and d, respectively, denote the reorder points and order size. The average total cost per month is 2
Gðs; dÞ ¼ 188:51 1:49s 1:24d þ 0:014sd þ 0:007s2 þ 0:01d : Of course this is unknown to experimenters. We are interested in the response in the following ranges: 0 6 s 6 100 and 4 6 d 6 100. Fig. 6 shows the true response surface of G(s, d). The initial 52 design points are obtained by evenly spaced points in each dimension: initial values of s are 0, 25, 50, 75, 100 and initial values of d are 4, 28, 52, 76, 100. With the approach discussed in Section 3.1, the procedure added the following five design points (in (s, d) format): (50, 40), (75, 16), (75, 40), (100, 16), and (100, 40). The pre-selected non-design points to be estimated are (20, 20), (40, 40), (60, 60), (80, 80), (20, 80), and (80, 20).
Average total cost per month R (s, d)
Response−surface plot
240 220 200 180 160 140 120 100 100
80
60
Reorder−point s
40
20
0
0
20
40
60
80
100
Order−size d
Fig. 6. True response-surface for the inventory model.
Table 4 Comparison of estimates. Points
(20, 20)
(40, 40)
(60, 60)
(80, 80)
(20, 80)
(80, 20)
True value
146.31
128.91
136.31
168.51
148.71
115.71
LI
148.29 0.0135
130.54 0.0126
138.64 0.0171
170.01 0.0089
150.21 0.0101
116.73 0.0088
DACE
139.23 0.0484
127.43 0.0115
150.39 0.1033
157.06 0.0680
132.59 0.1084
148.77 0.2857
GPML
134.38 0.0815
143.08 0.1099
151.78 0.1135
160.47 0.0477
166.77 0.1215
128.08 0.1069
24
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
Table 5 Comparison of metamodel-construction techniques. Metamodel technique
Computational cost
Ease of implementation
Correlation function
Stopping rules
Ease of analysis
LI DACE GPML
Low Medium High
Easy Hard Hard
No Yes Yes
Yes No No
Easy Hard Hard
Table 4 lists the estimates and their associated relative precision, i.e., (estimate-true value)/(true value) from linear interpolation (LI), kriging (DACE), and Gaussian process regression (GPML). The linear correlation function in DACE is adopted for comparison with linear interpolation. For Gaussian process regression, the squared exponential covariance function with ARD (Automatic Relevance Determination) is adopted: r2f exp½ðx x0 ÞT K2 ðx x0 Þ=2 where ln ‘, ln rf, and K (a diagonal matrix with ARD parameters ln k1 ; . . . ; ln kD ) are hyperparameters. ARD can help determine relative importance of the two input variables s and d. For this experiment, the responses at design points are the true values instead of estimates. Even though there is no guarantee of the response at non-design points, the observed relative precisions from LI are better than the desired precision of 10%. Because the response surface is convex, all LI estimates have positive relative precision. The observed relative precisions from DACE and GPML are similar but are generally worse than those of LI. We are aware that the performance of the methods cannot be concluded with limited experiments. We merely want to show that estimates obtained via linear interpolation of grid points of metamodels are as good as those obtained via other sophisticated methods when the structure of the underlying systems is unknown.
4.4. Comparison of interpolation-based metamodels Table 5 gives several traits of interpolation-based metamodels and shows the performance of the proposed metamodel (LI for linear interpolated) and the competing metamodels (DACE for kriging and GPML for Gaussian process regression) with regard to these traits. Kriging or Gaussian process regression requires selecting the functional form of the correlation or covariance function and tuning parameters. They can become computationally expensive. Furthermore, the correlation functions may be difficult to implement. It may also be complicated to analyze the sensitivity of the estimates because they are the weighted average of the values of the correlation function. The kriging metamodel in its original form uses fixed design points and is a non-sequential procedure. The stopping rule based on the specified accuracy works well for the LI metamodel and can be used for the kriging metamodel to determine the number and locations of design points. Furthermore, the validity of the LI metamodel is dependent on the stopping rule. That is, the stopping rule must be used with the LI metamodel. We believe that the LI metamodel is a viable choice when there is no prior knowledge of the characteristic of the underlying system. Kleijnen and van Beers [10] indicate that using the Gaussian (GAUSS in DACE) and the linear correlation functions (LIN in DACE) results in designs that look very similar in their experiments. Furthermore, the estimates based on the linear correlation function have a smaller empirical integrated mean square error. van Beers and Kleijnen [22] point out that the optimal weights are stochastic and ignoring the randomness of the estimated optimal weights tends to underestimate the true variance of the kriging predictor.
5. Conclusions In order to obtain valid asymptotics, some estimates require more observations than others. The quasi-independent algorithm works well in determining the required simulation run length for a valid asymptotic approximation. The empirical results of Chen and Kelton [6] indicate that the QI procedure works well in achieving the pre-specified precision. QI can be executed without user intervention. Furthermore, the procedure is a data-based algorithm, i.e., the procedure can be embodied in a software package whose input is the data (X1, . . ., Xn) and whose output is the mean estimate. The procedure is easy to understand, simple to implement, and is a powerful tool for estimating expected response. Interpolation-based metamodels providing overall tendency of mean can be constructed with a set of carefully selected design points. The metamodel can be used as a proxy for the full-blown simulation itself in order to get at least a rough idea of what would happen for a large number of input-parameter combinations. Furthermore, our approach is a general-purpose procedure and does not assume any structure of the underlying system being simulated. Our experiments also show Gaussian process software GPML is a flexible tool for interpolation in random simulation. When the underlying systems are unknown, it is not clear which correlation or covariance functions should be used to obtain the greatest accuracy or precision. Hence, instead of using correlation or covariance functions which require intensive computation, it is beneficial to use linear-interpolation in these situations, which requires less computation. We are investigating whether using four-point Lagrange interpolation increase the accuracy.
E. Jack Chen, M. Li / Simulation Modelling Practice and Theory 44 (2014) 14–25
25
Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/ j.simpat.2014.02.004. References [1] B. Ankenman, B.L. Nelson, J. Staum, Stochastic Kriging for simulation metamodeling, in: Proceedings of the 2008 Winter Simulation Conference, pp. 362–370. [2] P. Billingsley, Convergence of Probability Measures, second ed., John Wiley and Sons Inc., New York, 1999. [3] C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. [4] E.J. Chen, Metamodels for Estimating Quantiles of Systems with One Controllable Parameter, Simulation: Transactions of the Society for Modeling and Simulation 85 (2009) 307–317. [5] E.J. Chen, W.D. Kelton, Empirical Evaluation of Data-Based Density Estimation, in Proceedings of the 2006 Winter Simulation Conference, 333–341. [6] E.J. Chen, W.D. Kelton, Confidence-Interval Estimation Using Quasi-Independent Sequences, IIE Trans. 42 (2010) 83–93. [7] H. Gunes, H.E. Cekli, U. Rist, Data Enhancement, Smoothing, Reconstruction and Optimization by Kriging Interpolation, in Proceedings of the 2008 Winter Simulation Conference, 379–386. [8] D.R. Jones, M. Schonlau, W.J. Welch, Efficient global optimization of expensive black-box functions, J. Global Optim. 13 (1998) 455–492. [9] J.P.C. Kleijnen, Kriging metamodeling in simulation: a review, Eur. J. Oper. Res. 192 (2009) 707–716. [10] J.P.C. Kleijnen, W.C.M. van Beers, Application-driven sequential designs for simulating experiments: Kriging metamodelling, J. Operat. Res. Soc. 55 (2004) 876–883. [11] D.E. Knuth (Ed.), The Art of Computer Programming, vol. 2, third ed., Addison-Wesley, Reading, Mass, 1998. [12] D.G. Krige, A statistical approach to some basic mine valuation problems on the witwatersand, J. Chem. Metall. Mint. Soc. S. Afr. 52 (1951) 119–139. [13] A.M. Law, Simulation Modeling and Analysis, fourth ed., McGraw-Hill, New York, 2007. [14] S.N. Lophaven, H.B. Nielsen, J. Sondergaard, A Matlab Kriging Toolbox, Version 2.5, IMM Technical University of Denmark, Lyngby, 2002. [15] D.J.C. MacKay, Bayesian methods for backprop networks, in: E. Domany, J.L. van Hemmen, K. Schulten (Eds.), Models of Neural Networks, vol. III, Springer, New York, 1994, pp. 211–254 (chapter 6). [16] R.M. Neal, Bayesian learning for neural networks, Lecture Notes in Statistics, vol. 118, Springer, New York, 1996. [17] C.E. Rasmussen, Advances in Gaussian Processes, Tutorial at NIPS 2006 in Vancouver. [18] C.E. Rasmussen, H. Nickisch, Gaussian processes for machine learning (GPML) toolbox, J. Mach. Learn. Res. 11 (2010) 3011–3015. [19] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA, 2006. [20] J. Sacks, W.J. Welch, T.J. Mitchell, H.P. Wynn, Design and analysis of computer experiments, Stat. Sci. 4 (1989) 409–435. [21] M.I. Santos, P.M. Santos, Simulation metamodels for modeling output distribution parameters, in: Proceedings of the 2007 Winter Simulation Conference, pp. 910–918. [22] W.C.M. van Beers, J.P.C. Kleijnen, Customized sequential designs for random simulation experiments: Kriging metamodeling and bootstrapping, Eur. J. Oper. Res. 186 (2008) 1099–1113.