Design of experiments for linear regression models when gradient information is available

Design of experiments for linear regression models when gradient information is available

Journal of Statistical Planning and Inference 144 (2014) 141–151 Contents lists available at ScienceDirect Journal of Statistical Planning and Infer...

604KB Sizes 0 Downloads 35 Views

Journal of Statistical Planning and Inference 144 (2014) 141–151

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Design of experiments for linear regression models when gradient information is available$ Yiou Li n, Fred J. Hickernell Department of Applied Mathematics, Illinois Institute of Technology, Room 208, Engineering 1 Building, 10 West 32nd Street, Chicago, IL 60616, United States

a r t i c l e i n f o

abstract

Available online 10 February 2013

When it is expensive to compute a function value via numerical simulation, obtaining gradient values simultaneously can improve model efficiency. This paper considers the design problem for linear regression models fitted using both function and gradient data. A theoretical upper bound is derived on the scaled integrated mean squared error in terms of the discrepancy of the design, and this bound can be used to choose designs that are both efficient and robust under model uncertainty. Low discrepancy designs are those whose empirical distribution functions match a fixed target distribution. Numerical experiments show that for lower degree polynomial models and for models with many interactions, low discrepancy designs approximating the arcsine distribution are superior. On the other hand, for higher degree polynomial models and those with fewer interactions, low discrepancy designs approximating the uniform distribution are superior. & 2013 Elsevier B.V. All rights reserved.

Keywords: Gradient information Computer experiments Efficiency Robustness Integrated mean squared error Low discrepancy

1. Introduction For some computer simulations, obtaining a single data point can be computationally expensive, limiting the number of data that one can afford. Thus, the choice of the design, i.e., the set of vectors of input values chosen for running these simulations, is an important problem. In many cases, it is worthwhile to evaluate the gradient of the function relating the inputs to the output using adjoint techniques (Griewank, 2003). This can substantially improve the accuracy of estimating a surrogate model for the computer simulation (Roderick et al., 2010). The gradient of a d-variate function provides d more scalar pieces of information, at a cost of perhaps only several times the cost of a function value alone, irrespective of d, as noted by Roderick et al. (2010). For a large number of parameters (factors) d, the amount of extra information per run is significant. When d is large and the sample size, n, is modest, low degree polynomial models are often used to approximate the response because of their simple forms. The choices of the polynomial basis and the experimental design both affect the estimation error. For example, for a simplified model of nuclear reactor simulations it was demonstrated by Li et al. (2011) that a better choice of polynomial basis can decrease the estimation error by a factor of 1/10 or better by drastically reducing the condition number of the information matrix.

$ n

This research is supported in part by Department of Energy Grant DE-SC0002100. Corresponding author. Tel.: þ1 3125678980. E-mail addresses: [email protected] (Y. Li), [email protected] (F.J. Hickernell).

0378-3758/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2013.01.010

142

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

This paper focuses on the choice of the experimental design when gradient information is easily available along with function values. For simplicity, the model misspecification and the numerical or stochastic noise in the observations at two distinct locations are assumed to have negligible correlation. This means that any misspecification or numerical noise has a small length scale compared to the spacing of the design points. The goal is two-fold: to find designs that are robust to the lack of knowledge about the model form while at the same time being efficient. The statistical model allowing for gradient data is described in Section 2, and an expression for the scaled integrated mean squared error (IMSE) is derived in Section 3. Section 4 considers the situation in which the (polynomial) basis functions used in the regression model are unknown beforehand and must be found by model selection techniques. Our interest is in low discrepancy designs. The discrepancy of a design, x ¼ fx1 , . . . ,xn g, is defined in (7), and it measures the difference between the empirical distribution of the design, F x , and a fixed target distribution F tar . Theorem 1 in this paper shows that low discrepancy designs protect against poor IMSE. This result generalizes Theorem 1 of Hickernell and Liu (2002) in two ways, by allowing for gradient data and by allowing the distribution defining the IMSE to differ from the target distribution for the design. Section 5 presents simulation results comparing the IMSE for different models and designs. The study here is related to that of Morris et al. (1993) who investigated how to predict the response using both function values and gradient information under a Bayesian framework. We assume that the relationship between values of the function and the gradient at different locations is modeled by the mean, which is a linear combination of some polynomial basis, and we assume that any misspecification or noise is modeled by a very spiky covariance kernel. Morris et al. (1993) assumed a constant mean and modeled the relationship of the function and gradient values at different locations through a stationary covariance kernel of product form. Thus, the dependence of the response on the factors is contained in the mean in our setting, whereas it is contained in the covariance kernel in the setting of Morris et al. (1993). There are several reasons for choosing a spiky covariance kernel. We prefer having a mean that is modeled by a polynomial because polynomials can capture variation of the response on both large and small length scales, and low degree polynomials are manageable even for large d. If one adopts a non-trivial covariance kernel, then there are likely many parameters defining the kernel that require estimation. To avoid the problem of too many kernel parameters, one might adopt a stationary or even radially symmetric kernel. However, we are not convinced that stationarity is necessarily an appropriate assumption for computer experiments, even though it may make sense when the inputs are spatial or temporal variables. Another drawback of radially symmetric kernels or stationary kernels of product form is that they tend to have a vanishing domain of influence as d increases. At the same time, a design with n evenly spread points in a d1=d dimensional cube of fixed volume has design points separated by a distance proportional to dn as d-1. Thus, for moderate to large d many kernels are effectively spiky anyway. If one tries to avoid the vanishing domain of influence for radially symmetric covariance kernels by making the kernels flat, then one recovers polynomial interpolation (Driscoll and Fornberg, 2002), which is similar to assuming that the mean is modeled by a polynomial. Thus, we do not think that assuming a very spiky kernel is too restrictive, especially for moderate to large d. Morris et al. (1993) showed that in their setting and in the limiting case of a spiky covariance kernel D-optimality corresponds to maximizing the minimum distance between design points. This maximin criterion is one way to ensure that a design is space filling. In our setting we show the value of designs with low discrepancy, another way of ensuring that a design is space filling. 2. Statistical model Suppose that an experiment has d variables, and let Oj be a measurable set of all possible levels of the jth variable. Common examples of Oj are ½1,1 and R. The experimental region, O, is some measurable subset of O1      Od . An experimental design with n points, x ¼ fxi ¼ ðxi1 , . . . ,xid Þ : i ¼ 1, . . . ,ng, is a subset of O with multiple copies of the same point allowed. Let H be some vector space of real-valued functions defined on O, and let H include all possible responses to the computer experiment that one might imagine. Define an operator Lx : H-Rm , which when applied to a d-variate function f 2 H, returns an m-vector of outputs dependent on the point x 2 O. For traditional experiments, m¼1, and Lx is the scalar evaluation functional, i.e., Lx f ¼ f ðxÞ. For the applications that motivate this study, m ¼ d þ1, and Lx returns both the function and gradient values at x  T @f @f Lx f ¼ f ðxÞ, ðxÞ, . . . , ðxÞ : @x1 @xd One can imagine cases where Lx includes higher order derivatives as well. For a vector function f ¼ ðf 1 , . . . ,f ‘ ÞT : O-R‘ , the T definition of this operator is extended to give an m  ‘ matrix: Lx f ¼ ðLx f 1 , . . . ,Lx f ‘ Þ. A linear regression model can be written as y~ i ¼ ðLxi g T Þb þ e~ i ,

i ¼ 1, . . . ,n,

ð1Þ T

where y~ i denotes the observed m  1 vector response at the design point xi , g ¼ ðg 1 , . . . ,g k Þ is the vector of basis functions (polynomials), b is the k  1 regression coefficient to be estimated, and e~ i is the error in estimating the response by the linear combination of k basis functions. It is assumed for simplicity that the m-vectors e~ 1 , . . . , e~ n are independent and ~ . Here, s2 is the overall magnitude of the covariance identically distributed with zero mean and covariance matrix s2 L

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

143

~ is a suitably normalized version of the m  m covariance matrix, which matrix, which does not affect the design, and L may affect the design. The e~ i may correspond to purely random error in the algorithm generating the data, e.g., if the numerical simulation involves random numbers. Alternatively, e~ i may correspond to Lxi h, where h is an instance of a stochastic process whose values at different design points are (essentially) independent and identically distributed. This is a model where the numerical noise or misspecification has a very small length scale. When e~ i ¼ Lxi h, the independence of observations i and i0 for iai0 assumes that the points in the design, x, are all distinct, i.e., xi axi0 . All the examples considered here assume distinct design points, but the theory below allows for repeated design points, which might be desirable in the case of purely random noise. The above linear regression model can be written even more compactly in vector–matrix notation as y ¼ Gb þ e, where 0 B G¼@

Lx1 g T ^ L xn g

T

1 C A,

0

y~ 1

1

B C y ¼ @ ^ A, y~ n

0

e~ 1

1

C e¼B @ ^ A ~e n

~ , . . . ,L ~ Þ. The weighted least squares estimate and e has zero mean and mn  mn covariance matrix s2 L, where L ¼ diagðL of the regression coefficient b is given by

b^ ¼ By ¼ b þ Be, B ¼ ðGT L1 GÞ1 GT L1 :

ð2Þ

The experimental design influences b^ through the k  k information matrix. Let ~ 1 ðLx g T Þ Mx ¼ ðLx g T ÞT L

ð3aÞ

be the information matrix corresponding to a point distribution that puts unit probability at x. The information matrix for a general probability distribution, F, defined on O is given by Z MF ¼ Mx dFðxÞ: ð3bÞ O

Specifically, the information matrix for a design, x, or equivalently, for its empirical distribution, F x , is defined as Z n 1 1X Mxi ¼ Mx dF x ðxÞ: Mx ¼ GT L1 G ¼ n ni¼1 O

ð3cÞ

~ ¼ 1, and Recall that for traditional linear regression where only function values are collected, Lx g T ¼ g T ðxÞ, L Mx ¼ gðxÞg T ðxÞ. 3. Scaled integrated mean squared error The goal of the linear regression is the estimation of Tðg T bÞ, where T : H-Lp2 is some operator producing a p-vector of functions defined on O that are square integrable with respect to the probability distribution F IMSE . The inner product and norm for Lp2 is a generalization of that for L12 ¼ L2 . For any vectors of functions, u ¼ ðu1 , . . . ,up ÞT and v ¼ ðv1 , . . . ,vp ÞT , let Z /u,vSLp ¼ uT ðxÞvðxÞ dF IMSE ðxÞ, ð4aÞ 2

:u:Lp ¼ 2

O

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi /u,uSLp ¼ : uT ðÞuðÞ:L2 : 2

ð4bÞ

If one is only interested in approximating the response to the computer experiment, then p¼ 1 and Tf ¼ f , which is the embedding operator. This is perhaps the most common case of interest. However, if one is interested in approximating the same kinds of quantities used to estimate the response, then Tf might include the gradient, or even high order derivatives, T of f. As was done for Lx , the definition of T is extended in the natural way so that for f ¼ ðf 1 , . . . ,f ‘ ÞT , Tf is a p  ‘ matrix whose jth column is Tf j . Based on this operator T one may then define the (scaled) integrated mean squared error in terms of the difference between the true response, Tðg T bÞ, and the fitted response, Tðg T b^ Þ IMSEðx,gÞ ¼

n

s2

2

E:Tðg T bÞTðg T b^ Þ:Lp :

ð5Þ

2

The notation emphasizes the dependence on the design, x, and on the vector of basis functions, g. The dependence on the type of data collected, Lx , and on T is suppressed. The scaling constant n=s2 is used to remove the dependence on the number of sample points and the overall size of the error. Under this definition, doubling the number of experiments by duplicating a design does not affect the IMSE.

144

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

Proposition 1. The IMSE can be written as the trace of the product of two matrices IMSEðx,gÞ ¼ trðM1 x AÞ, where the information matrix, Mx , defined in (3), depends on the design, and Z A ¼ ð/Tg i ,Tg j SLp Þki,j ¼ 1 ¼ ðTg T ÞT ðxÞ ðTg T ÞðxÞ dF IMSE ðxÞ 2

O

does not depend on the design. Proof. By the linearity of T, the expression for the regression coefficient in (2), and the definition of the Lp2 inner product in (4), it follows that the Lp2 -norm of the difference between the true and fitted response can be written in terms of the vector of errors Tðg T b^ ÞTðg T bÞ ¼ ðTg T Þðb^ bÞ ¼ ðTg T ÞBe qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 :Tðg T b^ ÞTðg T bÞ:Lp ¼ : eT BT ðTg T ÞT ðTg T ÞBe:L2 2 Z ¼ eT BT ðTg T ÞT ðxÞðTg T ÞðxÞBe dF IMSE ðxÞ O

¼ eT BT ABe, where B is defined in (2). Using standard arguments about the quadratic forms of random vectors allows one to simplify the IMSE IMSEðx,gÞ ¼

n

s2

EðeT BT ABeÞ ¼

n

s2

trðcovðeÞBT ABÞ

¼ n trðLBT ABÞ ¼ n trðBLBT AÞ ¼ trðM1 x AÞ

by the definition of B in (2) and the definition of the information matrix in (3).

&

For the traditional case where only function value data is obtained, and the goal is to estimate the response function Mx ¼

n X

Mxi ¼

i¼1

Z A¼ O

n X

gðxi Þg T ðxi Þ,

i¼1

g i ðxÞg j ðxÞ dF IMSE ðxÞ

k ¼ MIMSE i,j ¼ 1

1 and again IMSEðx,gÞ ¼ trðM1 x AÞ ¼ trðMx MIMSE Þ. The generalization of this concept in Proposition 1 is that the data gathered now include more than function values, e.g., gradients, and one may want to estimate not only the response but also its derivatives.

4. Low discrepancy design bounds the IMSE n

While one might attempt to choose an experimental design, x that optimizes the IMSE, i.e., IMSEðx ,gÞ r IMSEðx,gÞ n

8x,

such a design would by necessity depend strongly on the choice of basis functions, g. Yet, the selection of the final g typically happens after the experiment has been run. The approach taken in this section is different. First one chooses some (typically continuous) target distribution, F tar , whose corresponding IMSE is acceptable regardless of the choice of g. This target distribution may differ from the one defining the IMSE. It is demonstrated here that this moderate IMSE may be maintained, again insensitive to the eventual choice of g, by choosing an experimental design whose empirical distribution function, F x , closely approximates F tar . Recall from the description of the statistical model in Section 2 that H is a vector space of possible response functions. In this section it is assumed that H is a separable Hilbert space with a reproducing kernel K : O  O-R (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004; Fasshauer, 2007). A reproducing kernel has the following properties: Kð,xÞ 2 H,

f ðxÞ ¼ /f ,Kð,xÞSH 8x 2 O, f 2 H:

That is, Kð,xÞ is the representer for the functional that evaluates a function at a point x. A reproducing kernel is also symmetric in its arguments and semi-positive definite Kðx,tÞ ¼ Kðt,xÞ

8t,x 2 O,

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

X ci cj Kðxi ,xj Þ Z 0

145

8n 2 R, c1 , . . . ,cn 2 R, x1 , . . . ,xn 2 O:

i,k

Moreover, any function satisfying the above two conditions is the reproducing kernel for some unique Hilbert space. Reproducing kernel Hilbert spaces can be used to provide a tight upper bound on the numerical integration error. In particular, for a fixed probability distribution F tar , and a design x ¼ fx1 , . . . ,xn g with empirical distribution F x that is supposed to approximate F tar , it follows that (Hickernell, 1999) Z  Z  n     1X   f ðxi Þ ¼  f ðxÞ d½F tar ðxÞF x ðxÞ  f ðxÞ dF tar ðxÞ  O  n O i¼1

r Dðx; F tar ÞVðf Þ:

ð6Þ

Here D is the discrepancy, which measures how much F x differs from F tar using the kernel K, and is defined as Z 1=2 Dðx; F tar Þ ¼ Kðx,tÞ dfF tar ðxÞF x ðxÞg dfF tar ðtÞF x ðtÞg O2

"Z ¼

N 2X Kðx,tÞ dF tar ðxÞ dF tar ðtÞ ni¼1 O2

Z

#1=2 N 1 X Kðxi ,tÞ dF tar ðtÞ þ 2 Kðxi ,xk Þ : n i,k ¼ 1 O

ð7Þ

Although D depends on the kernel K, this dependence is suppressed for simplicity of notation. Moreover, the variation of the integrand, V(f), is the (semi-)norm of its non-constant part ( Jf JH if 1= 2H, Vðf Þ ¼ ð8Þ ðJf J2H /f ,1S2H =J1J2H Þ1=2 if 1 2 H: The discrepancy bound may be used to show that a space filling design, namely one with small discrepancy, gives a reasonable IMSE for a large class of possible g. The following theorem generalizes Theorem 1 of Hickernell and Liu (2002) by allowing more general information, such as gradients, to be observed, and allowing F tar aF IMSE . Theorem 1. Suppose that F tar is some target probability distribution function defined on O, which may be different from F IMSE , and that H is a reproducing kernel Hilbert space of functions defined on O with reproducing kernel K. Consider the information matrix and the IMSE for F tar , defined analogously to those for a design with a finite number of support points Z Z ~ 1 ðLx g T Þ dF tar ðxÞ, Mtar ¼ Mx dF tar ðxÞ ¼ ðLx g T ÞT L O

O

IMSEðtar,gÞ ¼ trðM1 tar AÞ and suppose that the function ha : x/aT ðMtar Þ1=2 Mx ðMtar Þ1=2 a lies in H for any a 2 Rk . Define the variation over the basis g as V g,tar ¼ sup Vðha Þ, JaJ2 r 1

where V is the variation defined (8). Then it follows that the IMSE is bounded above by IMSEðx,gÞ r

IMSEðtar,gÞ , 1Dðx; F tar ÞV g,tar

provided that Dðx; F tar ÞV g,tar o1. ~ denoted rðMÞ, ~ can be bounded above as ~ ¼ IM1=2 Mx M1=2 ¼ M1=2 ðMtar Mx ÞM1=2 . The spectral radius of M, Proof. Let M tar tar tar tar follows using inequality (6): ~ ¼ sup 9aT M ~ a9 ¼ sup 9aT M1=2 ðMtar Mx ÞM1=2 a9 rðMÞ tar tar JaJ2 r 1

JaJ2 r 1

  Z Z  1=2 1=2  Mx dF tar ðxÞ Mx dF x ðxÞ Mtar a ¼ sup aT Mtar JaJ2 r 1 O O Z    1=2 1=2 ¼ sup  aT Mtar Mx Mtar a d½F tar ðxÞF x ðxÞ JaJ2 r 1

O

JaJ2 r 1

O

Z    ¼ sup  ha ðxÞ d½F tar ðxÞF x ðxÞ

rDðx; F tar Þ sup Vðha Þ ¼ Dðx; F tar ÞV g,tar : JaJ2 r 1

1=2

1=2

If Dðx; F tar ÞV g,tar o 1, then the smallest eigenvalue of Mtar Mx Mtar 1Dðx; F tar ÞV g,tar 4 0. Since A is positive semi-definite, it follows that 1 1 1 trðM1 x AÞ ¼ tr½ðMtar Mx Þ Mtar A

~ which is 1rðMÞ, ~ ¼ IM, is no smaller than

146

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151 1=2

1=2

1 1 r rð½M1 Þ trðM1 ÞIMSEðtar,gÞ tar Mx  tar AÞ ¼ rð½Mtar Mx Mtar  IMSEðtar,gÞ r , 1Dðx; F tar ÞV g,tar

which completes the proof.

&

The upper bound on the IMSE in Theorem 1 is a product of two terms. The term IMSEðtar,gÞ depends on the basis, g, the distribution defining the IMSE, F IMSE , and the target distribution, F tar , but not on the design, x. The other term, 1=½1Dðx; F tar Þ V g,tar , depends on the basis, g, the target distribution, F tar , and the design, x, but not on the distribution defining the IMSE, F IMSE . Breaking this down further, Mtar depends on g and F tar , while A depends on g and F IMSE . The product M1 tar A depends on g but is unchanged if g is replaced by an equivalent linear transformation, say Cg for non-singular C. Thus, one should choose F tar to make IMSEðtar,gÞ ¼ trðM1 tar AÞ small, but not necessarily minimal, for the family of bases g of interest. In fact, it would likely be impossible for F tar to make IMSEðtar,gÞ minimal over a set of possible g with more than one element. In some cases a good F tar might be different from F IMSE . pffiffiffi For example, consider the case of a model that is linear in a single factor, O ¼ ½1,1, gðxÞ ¼ ð1, 3xÞT . First, consider the case where the data is simply function values, i.e., Lx f ¼ f ðxÞ, and just the output function is to be approximated, i.e., Tf¼f. Let the distribution defining the IMSE be the uniform distribution, i.e., F IMSE ¼ F unif ðxÞ ¼ ð1þ xÞ=2. We also consider the arcsine distribution, F asin ðxÞ ¼ cos1 ðxÞ=p. For these choices pffiffiffi ! 1 3x Mx ¼ pffiffiffi , 3x 3x2   Z 1 0 A ¼ MIMSE ¼ Munif ¼ Mx dF unif ðxÞ ¼ I ¼ , 0 1 O ! Z 1 0 , Masin ¼ Mx dF asin ðxÞ ¼ 0 3=2 O trðM1 asin AÞ ¼

5 o 2 ¼ trðM1 unif AÞ: 3

Thus, the arcsine distribution is a better choice for the target distribution, F tar , than the uniform distribution. 0 Consider now the same situation, but with the data including first derivative values, Lx f ¼ ðf ðxÞ,f ðxÞÞT and L ¼ I. Again the arcsine distribution is a better choice for the target distribution, but the advantage is not as great ! pffiffiffi   1 3x 1 0 Mx ¼ pffiffiffi , A ¼ I ¼ , 3x 3ð1 þ x2 Þ 0 1 !   1 0 1 0 , , Masin ¼ MIMSE ¼ Munif ¼ 0 9=2 0 4 trðM1 asin AÞ ¼

11 44 45 5 ¼ o ¼ ¼ trðM1 unif AÞ: 9 36 36 4

Once the target distribution has been fixed, then one should pay attention to the term 1=½1Dðx; F tar ÞV g,tar  in the upper bound in Theorem 1. This says to choose the design, x, so that Dðx; F tar Þ is as small as possible, i.e., to choose the empirical distribution of the design to closely approximate the target distribution. This causes the IMSE to be close to trðM1 tar AÞ, provided that V g,tar is not too large. The term V g,tar measures the roughness or degree of oscillation of the basis g. If g contains polynomials of too high a degree, then V g,tar is large, and Dðx; F tar Þ must be very small to guard against aliasing. It is emphasized that Theorem 1 represents a hybrid approach in constructing experimental designs. The optimal design approach chooses the design that minimizes IMSEðx,gÞ for a particular g, but then g must be known in advance of the experiment. The approach here chooses F tar to make trðM1 tar AÞ reasonably small for a variety of possible g that the experimenter believes to be most relevant. Then the design, x, is chosen to make Dðx; F tar Þ as small as possible. The resulting design should be robust in the sense that it has reasonable IMSE efficiency over a variety of possible models, the best of which will be selected after the experimental data is observed. In the following section, we illustrate this approach. Before presenting those examples, we wish to make some remarks about low discrepancy designs. Such designs are prevalent in the high dimensional integration literature (Novak and Woz´niakowski, 2001), where d may be in the tens or hundreds. In that setting the run size is typically quite large, in the thousands or more. The run sizes considered here are much smaller. In the next section we use Sobol’ sequence, which is perhaps the best known example of a digital net (Dick and Pillichshammer, 2010), a popular family of low discrepancy designs, and find that it performs well. Both the deterministic version and the randomly scrambled version discovered by Owen (2000) are used. In practice, one might wish to construct a low discrepancy design using optimization methods. Winker and Fang (1997) and Fang et al. (2000) have developed methods to do so, but that is beyond the scope of this paper. Low discrepancy is one measure of the ability of the design to fill space. There are other measures, such as the maximin criterion, also called the separation distance, and the minimax criterion, also called the covering radius or fill distance. These latter measures are based on balls defined in terms of Euclidean distance, and as such are subject to the curse of

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

147

d

dimensionality. A ball of radius d o1 only has a volume of Oðd Þ, which becomes quite small as the number of factors, d, increases. On the other hand, discrepancy does not tend to suffer from this curse of dimensionality. The kernels used to define the discrepancy are typically of tensor product form. This type of construction means that the discrepancy measures how well the empirical distribution of the design, x, approximates the target distribution, F tar , by looking at all marginals of these two distributions involving all possible number of factors at a time, 1, . . . ,d. Finally, the notion of space filling being used here allows for non-uniform filling, if F tar is not the uniform distribution. Practical constructions of low discrepancy designs typically assume a uniform target distribution. When F tar is nonuniform, one may apply an inverse distribution transformation to the uniform design points to obtain a design that closely approximates F tar . This is the approach taken in the next section. In the examples below, designs approximating the arcsine distribution sometimes yield a smaller IMSE. 5. Numerical experiments on low discrepancy design This section computes IMSEðx,gÞ defined in (5) and simplified in Proposition 1 for a variety of different designs, x, and bases, g, for one-factor, two-factor, and five-factor models. The aim is to show which choices of the target distribution, F tar , and the design, x, lead to smaller IMSE. Recall that the IMSE is defined to be independent of the number of runs and the variance of the error, but to depend on the choice of basis and on the choice of support points and weights for the design. In this section, it is assumed that Tf ¼ f , which means we only wish to predict the response function but not its derivatives, even though derivative data may be observed. Furthermore, F IMSE is the uniform distribution, F unif . We consider both of the following cases: (i) observing only function data and ~ ¼ I. (ii) observing function and gradient data together with L The first example is a univariate (one-factor) situation with the domain O ¼ ½1,1. We compare the IMSE of eight different designs, x: (a) (b) (c) (d) (e) (f) (g) (h)

simple uniform random points, simple arcsine random points, low discrepancy (unscrambled Sobol’) points, low discrepancy points transformed to approximate the arcsine distribution, the classical D-optimal design for basis g ¼ ð1,x, . . . ,x7 ÞT , the classical arcsine design corresponding to basis g ¼ ð1,x, . . . ,x7 ÞT (Pukelsheim, 1993, Section 9.6), continuous uniform design, F unif , and continuous arcsine design, F asin .

Designs (e) and (f) have only eight distinct support points, regardless of n, whereas the other discrete designs, (a)–(d), have n support points. The continuous designs are the limits of low discrepancy designs or simple random designs when the

8 7 6

IMSE

5 4

Random Points Arcsine Random Points Sobol Points Arcsine Sobol Points D−opt for Classical Model Arcsine Points Continuous Uniform Design Continuous Arcsine Design

3 2 1 0

1

2

3

4

5

6

Degree of Model, s Fig. 1. IMSEðx,gÞ for case (i) function value data only.

7

148

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

number of runs approaches infinity. One-dimensional Sobol’ designs, (c), are essentially equally spaced points. In this case they have been transformed to reach the extremes of the interval and are explicitly  3n 5n n3 , , ..., ,1 : 1, n1 n1 n1 Designs (d) and (f) are the same for n¼8, but not for n ¼32. The models considered choose the bases g ¼ ð1,x, . . . ,xs ÞT with s ¼ 1, . . . ,7. For each of these models, the figure of merit IMSEðx,gÞ given in Proposition 1 for both cases (i) and (ii), and for each design (a)–(h). For case (i) only the results for n ¼32 are shown, since the IMSE is quite large for n¼ 8 and the larger values of s. For case (ii) the effective number of data is 2n, and so results for both n¼ 8 and n ¼32 are shown. For the random designs the average IMSEðx,gÞ is recorded for 1000 replications. The results are displayed in Figs. 1 and 2. Perhaps not surprisingly, it is observed that random designs perform the poorest. For both the cases with and without gradient information and for the models that are linear or quadratic in x (s¼1,2), the arcsine designs, which push points towards the ends of the interval, perform the best in terms of the IMSE. However, for models with quartic or higher order terms in x (s ¼ 4, . . . ,7), the uniform designs yield smaller IMSE. This illustrates the value in sometimes choosing F tar to be different than F IMSE ¼ F unif , a point not recognized by Hickernell and Liu (2002). For each s Sobol’ design, either approximating the arcsine or uniform distribution is the best among the discrete designs, (a)–(f). For n ¼32, Sobol’ designs perform similarly to their continuous counterparts. For case (i) they actually perform a bit better and for case (ii) they perform a bit worse. It is interesting to note that although the effective number of data in case (ii) is only twice as

1.6 1.55

IMSE

1.5 1.45 1.4

Random Points Arcsine Random Points Sobol Points Arcsine Sobol Points D−opt for Classical Model Arcsine Points Continuous Uniform Design Continuous Arcsine Design

1.35 1.3 1.25 1.2

1

2

3

4

5

6

7

Degree of Model, s

1.6 1.55

IMSE

1.5 1.45 1.4

Random Points Arcsine Random Points Sobol Points Arcsine Sobol Points D−opt for Classical Model Arcsine Points Continuous Uniform Design Continuous Arcsine Design

1.35 1.3 1.25 1.2

1

2

3 4 5 Degree of Model, s

6

Fig. 2. IMSEðx,gÞ for case (ii) function and gradient data.

7

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

149

much as in case (i), the IMSE in case (ii) is, in most cases, much less than half of that in case (i). The next example considers multiple factors ðd 41Þ on the domain O ¼ ½1,1d . All cases assume both function and ~ ¼ I. Univariate polynomials are chosen so that MIMSE is diagonal. The first five such gradient data observed together with L univariate polynomials are 2 9 27 1,x,x2 13 ,x3 10 x,x4 33 28 x þ 140:

Li et al. (2011) have shown this choice of polynomials to lead to a much better conditioned information matrix, Mx . Then a tensor product basis of these univariate polynomials is constructed. There are four cases: (iii) (iv) (v) (vi)

d¼2 d¼2 d¼2 d¼5

factors, factors, factors, factors,

an all all all

additive model of polynomials up multivariate polynomials of up to multivariate polynomials of up to multivariate polynomials of up to

to degree 3 in each factor, total degree 3, total degree 3 plus univariate 4th degree polynomials, and total degree 3.

The number of runs is n ¼8 for cases (iii)–(iv) and n ¼32 for cases (v)–(vi). For each model, we compare the IMSEs of eight different designs: (a), (b), (g), and (h) as described above, plus (c) (d) (e) (f)

randomly scrambled low discrepancy (Sobol’) points, scrambled low discrepancy points transformed to approximate the arcsine distribution, maximin Latin hypercube designs, and generalized maximin Latin hypercube designs (Dette and Pepelyshev, 2010).

All the discrete designs, (a)–(f), are random and their average IMSE is computed for 1000 instances. The empirical distribution function of the IMSEs for all eight design families is plotted for cases (iii)–(vi) in Figs. 3–6. 1 0.9 Random Points Arcsine Random Points Scrambled Sobol Points Arcsine Scrambled Sobol Points LHD GLHD Continuous Uniform Design Continous Arcsine Design

0.8

Empirical CDF

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1.7

1.75

1.8

1.85

1.9

1.95

2

IMSE

Fig. 3. IMSEðx,gÞ for case (iii) 2-factor cubic additive model with eight runs.

1 0.9 0.8

Empirical CDF

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2

2.5

3 IMSE

Fig. 4. IMSEðx,gÞ for case (iv) 2-factor full cubic model with eight runs.

3.5

150

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

1 0.9 0.8

Empirical CDF

0.7 0.6 0.5

Random Points Arcsine Random Points Scrambled Sobol Points Arcsine Scrambled Sobol Points LHD GLHD Continuous Uniform Design Continous Arcsin Design

0.4 0.3 0.2 0.1 0 2

2.05

2.1

2.15

2.2

2.25

2.3

2.35

2.4

2.45

2.5

IMSE

Fig. 5. IMSEðx,gÞ for case (v) 2-factor quartic model with 32 runs.

1 0.9 0.8

Empirical CDF

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5

6

7

8

9

10

11

12

IMSE

Fig. 6. IMSEðx,gÞ for case (vi) 5-factor full cubic model with 32 runs.

For lower degree polynomials with interaction terms, cases (iv) and (vi), the arcsine designs yield smaller IMSE, and for higher degree polynomials or without interaction terms, cases (iii) and (v), the uniform designs yield smaller IMSE. In all these examples the low discrepancy designs outperform the other discrete designs. 6. Conclusion Gradient information, when available, adds predictive power to experiments. When the exact form of the linear regression model is unknown, i.e., when one does not know which polynomials should be included, then a low discrepancy design yields good IMSE. Although the numerical examples presented here are not exhaustive, transforming the uniform, low discrepancy design to approximate the arcsine distribution appears to be advantageous when the polynomial degree of the eventual model is believed to be low and/or many interaction terms are present. Otherwise, the uniform, low discrepancy points should remain untransformed.

Acknowledgments The authors thank the referees and the editor for their helpful comments. The second author thanks the organizers of the very fruitful International Conference on Design of Experiments, held in May 2011 at the University of Memphis, where some of this work was presented. References Aronszajn, N., 1950. Theory of reproducing kernels. Transactions of the American Mathematical Society 68, 337–404. Berlinet, A., Thomas-Agnan, C., 2004. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, Boston. Dette, H., Pepelyshev, A., 2010. Generalized latin hypercube design for computer experiments. Technometrics 25, 421–429. Dick, J., Pillichshammer, F., 2010. Digital Nets and Sequences: Discrepancy Theory and Quasi-Monte Carlo Integration. Cambridge University Press, Cambridge.

Y. Li, F.J. Hickernell / Journal of Statistical Planning and Inference 144 (2014) 141–151

151

Driscoll, T.A., Fornberg, B., 2002. Interpolation in the limit of increasingly flat radial basis functions. Computational and Applied Mathematics 43, 413–422. Fang, K.T., Lin, D.K.J., Winker, P., Zhang, Y., 2000. Uniform design: theory and applications. Technometrics 42, 237–248. Fasshauer, G.E., 2007. Meshfree Approximation Methods with MATLAB Interdisciplinary Mathematical Sciences, vol. 6. World Scientific Publishing Co., Singapore. Griewank, A., 2003. A mathematical view of automatic differentiation. Acta Numerica 12, 321–398. Hickernell, F.J., 1999. Goodness-of-fit statistics, discrepancies and robust designs. Statistics & Probability Letters 44, 73–78. Hickernell, F.J., Liu, M.Q., 2002. Uniform designs limit aliasing. Biometrika 89, 893–904. Li, Y., Anitescu, M., Roderick, O., Hickernell, F.J., 2011. Orthogonal bases for polynomial regression with derivative information in uncertainty quantification. Journal of Uncertainity Quantum, 297–320. Morris, M.D., Mitchell, T.J., Ylvisaker, D., 1993. Bayesian design and analysis of computer experiments: use of derivatives in surface prediction. Technometrics 35, 243–255. Novak, E., Woz´niakowski, H., 2001. When are integration and discrepancy tractable? In: Foundations of Computational Mathematics, Cambridge University Press, Cambridge, pp. 211–266. Owen, A.B., 2000. Monte Carlo, quasi-Monte Carlo, and randomized quasi-Monte Carlo. In: Niederreiter, H., Spanier, J. (Eds.), Monte Carlo, Quasi-Monte Carlo, and Randomized Quasi-Monte Carlo. Springer-Verlag, Berlin, pp. 86–97. Pukelsheim, F., 1993. Optimal Design of Experiments. John Wiley & Sons, New York. Roderick, O., Anitescu, M., Fischer, P., 2010. Polynomial regression approaches using derivative information for uncertainty quantification. Nuclear Science and Engineering 164, 122–139. Winker, P., Fang, K.T., 1997. Application of threshold accepting to the evaluation of the discrepancy of a set of points. SIAM Journal on Numerical Analysis 34, 2028–2042.