Computational Statistics & Data Analysis 51 (2006) 823 – 835 www.elsevier.com/locate/csda
MCMC-based local parametric sensitivity estimations C.J. Pérez∗ , J. Martín, M.J. Rufo Faculty of Veterinary, Department of Mathematics, University of Extremadura, Avda. de la Universidad s/n, 10071 Cáceres, Spain Received 22 November 2003; received in revised form 23 December 2004; accepted 8 September 2005 Available online 5 October 2005
Abstract Bayesian inferences for complex models need to be made by approximation techniques, mainly by Markov chain Monte Carlo (MCMC) methods. For these models, sensitivity analysis is a difficult task. A novel computationally low-cost approach to estimate local parametric sensitivities in Bayesian models is proposed. This method allows to estimate the sensitivity measures and their errors with the same random sample that has been generated to estimate the quantity of interest. Conditions to allow a derivativeintegral interchange in the operator of interest are required. Two illustrative examples have been considered to show how sensitivity computations with respect to the prior distribution and the loss function are easily obtained in practice. © 2005 Elsevier B.V. All rights reserved. Keywords: Bayesian decision theory; Bayesian inference; MCMC; Simulation
1. Introduction Bayesian statistics has become more popular, thanks to the appearance of Markov chain Monte Carlo (MCMC) methods (see Brooks, 1998 for a review and Gilks et al., 1998 for a monograph). The application of these simulation techniques allows to obtain a numerical solution of problems based on really complex models. Sometimes, MCMC methods are the only computationally efficient alternative. When performing a Bayesian analysis, inferences depend on some input models as the prior distribution, the likelihood or the loss function. Besides the model solution, some description of its sensitivity to the specification of these inputs is necessary. Sensitivity of inferences to the choice of the prior distribution has been widely investigated (see, for example, Berger, 1994). Sivaganesan (1993) and Dey et al. (1996) studied sensitivity with respect to the prior and the likelihood. Martín et al. (1998) considered the loss function and Martín et al. (2003) investigated joint sensitivity with respect to utility and prior distribution. Two relevant monographs on robust Bayesian analysis are provided by Berger et al. (1996) and Ríos Insua and Ruggeri (2000). Sensitivity analysis can be studied from two viewpoints: local and global. Local sensitivity considers the behavior of posterior quantities of interest under infinitesimal perturbations from a specified input (prior or loss in this paper). On the other hand, global sensitivity quantifies the range of a posterior quantity when the prior (loss) varies in a class. See Sivaganesan (2000) for a comparative review on the local and global approaches to Bayesian robustness. Sensitivity analyses are demanded by several authors to be applied in complex models that need to be solved by MCMC methods (see, for example, Ríos Insua and Ruggeri, 2000). Some authors, like Richarson and Green (1997), Hall ∗ Corresponding author. Tel.: +34 927257146; fax: +34 927257110.
E-mail addresses:
[email protected] (C.J. Pérez),
[email protected] (J. Martín),
[email protected] (M.J. Rufo). 0167-9473/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2005.09.005
824
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
et al. (2003), Halekoh and Vach (2003) study parametric sensitivity by solving the model for some values of the prior parameters. They, basically, re-run the Markov chain for different parameters of the prior distributions and estimate the quantities of interest under those parameter specifications. This is computationally costly and, generally, not enough. Therefore, it would be convenient to develop a general method that can be applied to estimate local sensitivities. This is the issue addressed in the paper. The outline of the paper is as follows. In Section 2, the problem is described and some relevant results on local sensitivity analysis are summarized. Section 3 focuses on the parametric sensitivity case. A computationally low-cost method to estimate local parametric sensitivities in models solved by MCMC methods is proposed. Section 4 presents two illustrative examples which show how the proposed method is easily applied in practice. A discussion is presented in Section 5. Finally, an appendix containing the proofs is included.
2. Local sensitivity estimations Suppose the interest is focused on the estimation of a quantity that can be expressed as the integral of a real valued function f over a multiple dimension domain with respect to a density p, i.e. f ()p() d. (1)
When p is the posterior distribution for , i.e., p(|x), the integral (1) becomes the posterior expectation of f (). In this case, the posterior mean is recovered when f is the identity function. In Bayesian decision theory and inference, the elicitation of the prior distribution () (or f () when it denotes the loss function) is far from simple. A class of prior distributions (or loss functions) is usually considered in robustness analysis. Then, expression (1) is denoted by f ()l(|x)() d I(, f ) = , ∈ , f ∈ F, (2) l(|x)() d where denotes a class of prior distributions and F represents a class of functions. Note that the likelihood l(|x) is considered fixed and both classes must be chosen to make (2) integrable. Suppose that sampling directly from p(|x) is so complex that the use of MCMC methods is necessary. Note that this is the most usual case for real problems in Bayesian inference. Let (1) , (2) , . . . , (n) be a sample generated from p(|x) by MCMC methods, where (·) ∈ and f ∈ F are fixed. Then, an estimate of I(, f ) is given by n 1 (i) I(, f) = f . n i=1
Now, the interest is focused on evaluating the impact on I(, f ) when (or f) varies in a neighborhood, i.e., a local sensitivity analysis is required. The choice of a sensitivity analysis method depends on a great extent on (a) the sensitivity measures employed, (b) the accuracy in the estimates of the sensitivity measures, and (c) the computational cost involved. The rest of this section is devoted to measures used in local functional sensitivity, whereas next section deals with the particular case of local parametric sensitivity. The use of derivatives is widely studied in sensitivity literature. Diaconis and Freedman (1986) proposed to use Fréchet derivatives with respect to the prior distribution in local sensitivity analysis. They proposed to use the norm of the derivative as a sensitivity measure. Cuevas and Sanz (1988) extended the results and provided a mathematical framework. Other local approaches based on -contaminated classes or Gateaux derivatives are presented in Srinivasan and Truszczynska (1990), Ruggeri and Wasserman (1995), Dey et al. (1996) and Sivaganesan (2000), among others. Some related definitions and results are summarized below. The notation is the following: the set of parameters endowed with a -algebra B is denoted as , the likelihood is l(|x), where x is the data vector, the class of prior distributions is , the objective function belonging to a class F is represented by f (), and, finally, the set of signed measures with total mass zero over (, B) is denoted by M.
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
825
˙ in the set M that Definition 1. The Fréchet derivative of I(, f ) with respect to is a bounded linear operator I satisfies ˙ I( + , f ) = I(, f ) + I() + ◦(),
∀ ∈ M.
Under mild conditions, the Fréchet derivative exists and coincides with the Gateaux derivative (see, for example, Diaconis and Freedman, 1986). For example, when the Euclidean norm for I and the total variation norm for M are used, Proposition 2 is obtained. This result and its applicability conditions are shown in Diaconis and Freedman (1986) and Ruggeri and Wasserman (1993). Proposition 2. Under mild conditions, the following expression holds: 1 N (, f ) − I(, f )D() , D() where N (, f ) = f ()l(|x)() d and D() = l(|x)() d. ˙ I() =
(3)
A sensitivity measure based on the Fréchet derivative is ˙ I ˙ = sup I() . ∈M Itmeasures the maximum change in I for infinitesimal changes in . The following proposition gives an expression ˙: for I Proposition 3. Under mild conditions, the following expression holds I ˙=
1 sup |h()|, D() ∈
where h() = l(|x)|f () − I(, f )|. This local sensitivity measure can be computed in MCMC-based models by using algorithms to approximate D() and sup |h()|. It is possible to use methods to approximate D() like, for example, in DiCiccio et al. (1997). The approximation of sup |h()| can be solved by stochastic optimization methods like simulated annealing, genetic algorithms ortabu search. This sensitivity measure leads to asymptotic problems (see Gustafson et al., 1996). Moreover, ˙ is attained at with two points mass, so + is usually not a reasonable prior. These problems the value of I suggest to restrict M to a subset. In order to avoid the previous problems, the following -contaminated class is considered = { () = (1 − )() + q(); q ∈ Q} , with Q a class of probability distributions over and > 0 a small quantity. A possible extension for m-dimensional settings can be done by using the class
m m = () = 1 − i () + i (−i |i ) qi (i ) ; qi ∈ Qi , i=1
i=1
where, now, = (1 , . . . , m ), −i = (1 , . . . , i−1 , i+1 , . . . , m ), and Qi is a class of probability distributions over i , i = 1, 2, . . . , m. If the vector q = (q1 , . . . , qm ) is fixed, then
m m I(, q) = I 1− i + i qi , f i=1
i=1
826
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
is an operator from [0, 1]m to the real line. The Fréchet derivative is represented by the Jacobian matrix. In this case, it is
j I(, q)=0 , . . . , j I(, q)=0 . 1 m The following vector and its norm can be used to measure the sensitivity in the m-dimensional contaminated class
sup j1 I(, q)=0 , . . . , sup jm I(, q)=0 .
q1 ∈Q1
qm ∈Qm
Sivaganesan (1996) provides asymptotic results for this measure in one dimensional settings. For the multidimensional case, it is easy to prove 1 sup sup ji I(, q)=0 = (f () − I(, f )) l(|x) (−i |i ) qi (i ) di . D() qi ∈Qi qi ∈Qi Assuming that (−i |i ) is known for all i, the supremum can be easily computed, or approximated, when Qi are classes of probability distributions or quantile classes. The main drawback of this measure is that computations for restricted classes Q are difficult. Ruggeri and Wasserman (1995) proposed to use a dimensionless local sensitivity measure considering density ratio classes. An adaptation of this idea could be implemented for models solved by MCMC methods. Gustafson et al. (1996) recommended, with limitations, to use parametric classes to avoid computational and asymptotic problems. Moreover, different kinds of parametric prior distributions are recommended for many Bayesian models. For example, improper priors lead generally to improper posteriors in mixture models. Then, conjugate prior distributions are suggested to be used (see Richarson and Green, 1997 and Stephens, 1997). The next section presents local parametric sensitivity as a particular case of functional sensitivity, since the proposed local parametric sensitivity measure, i.e., the gradient vector, is the Fréchet derivative. 3. Parametric classes a parametric class of prior distributions = { : ∈ } and a parametric class of functions F = In this section, f : ∈ are considered. Let I (, f ) and I (, f ) (I and I for short) denote the dependence of the posterior quantity I(, f ) on = (1 , . . . , m ) and = ( 1 , . . . , m ), respectively. Firstly, sensitivity with respect to the prior distribution is considered. Assume that a local sensitivity analysis around = 0 is required in the class . Then, I is, as a function of parameters, an operator from R m to the real line. In this case, the Fréchet derivative is the gradient vector evaluated at 0 , i.e.
(4) ∇I0 = j1 I0 , j2 I0 , . . . , jm I0 . This coincides with the measure considered in Turányi and Rabitz (2003) and references therein. Components in (4), i.e., the partial derivatives with respect to each j evaluated at 0 , indicate how rapidly I is changing around an infinitesimal neighborhood of 0 along that axis. Therefore, they can be used as rates of change with respect to the parameter components. Then ∇I0 can be considered as a local sensitivity measure for the parameter at 0 . The gradient vector represents the precise direction which has maximum increase of I at 0 . Furthermore, it indicates which component has the largest influence on the output. If the gradient vector norm is calculated, i.e.
∇I 0 = j I 0 2 + j I 0 2 + · · · + j I 0 2 , 1 2 m the steepness of the graph of I at 0 is obtained. The norm ∇I0 determines the slope of the tangent hyperplane at 0 , and it represents the maximum speed of change in any direction for I in the neighborhood of 0 . The just developed concept is also valid for sensitivity with respect to the function f . In this case, the operator I
is considered and a local sensitivity analysis around = 0 is required.
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
827
In this context, the main problem is to calculate the gradient vector. A computationally low-cost method to estimate the components of the gradient vector is proposed. In order to implement the method, some conditions to allow a derivative-integral interchange are needed. These conditions are presented in Appendix A. When they are satisfied, the method allows for obtaining local sensitivity measures in the parametric case, as it is shown in Section 4. Next, the method is presented. 3.1. Sensitivity with respect to the prior distribution The posterior quantity of interest is I 0 =
f ()l(|x)0 () d . f () p0 (|x) d = l(|x)0 () d
Under the conditions presented in Theorem 1 (Appendix A), each component of ∇I0 can be expressed as jj I0 =
f () − I0
jj 0 () p0 (|x) d. 0 ()
If (1) , (2) , . . . , (n) is a sample generated from the posterior distribution p0 (|x) (mainly by MCMC methods), then the estimate of jj I0 is given by n jj 0 (i) 1 0 , f (i) − I j j I0 = n 0 (i) i=1
(5)
0 =(1/n) n f (i) is the estimate of I 0 . For each j , the estimate given in (5) is asymptotically unbiased, where I i=1 so its asymptotic mean squared error can be measured by its standard error. The Monte Carlo standard error estimate of (5) is given by ⎛ ⎞2 (i) 0 j 0 (i) n f − I 1 j j ⎝ ⎠ . − j (6) SE j I0 = j I0 (i) n(n − 1) 0 i=1 Note that the estimation in (6) can be easily obtained from the generated sample that has been previously used to estimate I0 and jj I0 . This fact introduces a computational saving that can be interesting for Bayesian models solved by MCMC methods. Another advantage of the proposed method is that only the partial derivatives of the prior distribution are needed. The previous procedure avoids the need of calculating the partial derivatives of the posterior distribution. Suppose that an importance-sampling method were implemented by using the posterior as the importance distribution. Under conditions to allow a derivative-integral interchange, each partial derivative could be expressed as:
f ()j p 0 (|x)
j jj f () p0 (|x) d = p0 (|x) d p 0 (|x) f ()jj (p0 (|x)) = Ep0 . p0 (|x)
jj I0 =
Then, the sensitivity measures could be estimated only when a closed expression for the posterior distribution (known up to a normalization constant) and its partial derivatives could be obtained. It is well known that the explicit form for the posterior distribution p0 (|x) is usually intractable. Besides, the use of the posterior distribution as the importance distribution might not be a good choice.
828
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
Partial derivatives can be used to measure the impact on the posterior quantity of infinitesimal changes on the prior distribution. However, this measure to quantify the local sensitivity is absolute. A dimensionless sensitivity measure is jj I0 j = , 2 where 2 = f () − I0 p0 (|x) d and can be estimated as: 2 = (1)
n 1 (i) 2 f − I0 , n−1 i=1
(2)
where , , . . . , (n) is the sample generated previously from p0 (|x). The idea is based on Ruggeri and Wasserman (1995). Dividing the absolute value of the partial derivative for the component j by , a dimensionless quantity is obtained. One might say that the rate of change is “small” if this absolute value spans less than 41 of the posterior standard deviation . Then, the rate of change relative to the deviation of f to the posterior quantity of interest I0 could be considered non-substantial. Therefore, as a rule of thumb, it could be considered that inference on the posterior quantity is locally robust when j is less than 0.25 (see Berger et al., 2000 for a similar idea). 3.2. Sensitivity with respect to the function f Now, the interest is focused on: f 0 ()l(|x)() d I 0 = . f 0 ()p(|x) d = l(|x)() d The procedure is analogous to the previous case. Under conditions presented in Theorem 2 (Appendix A), for each j : j j I 0 = j j f 0 ()p(|x) d,
and its estimate is given by 1 (i) j I = j f , 0 0
j
j n n
i=1
where (1) , (2) , . . . , (n) ∼ p(|x). Now, the Monte Carlo standard error estimate of (7) is given by n 2 1 (i) j SE I f I . = j − j 0 0 0
j
j
j
n(n − 1) i=1
The dimensionless sensitivity measure is j j I 0
j = , f 2 where 2f = f 0 () − I 0 p(|x) d and can be estimated as 2 = f
2 1 f 0 (i) − I
0 . n−1 n
i=1
The interpretation of this dimensionless measure is analogous to the previous case.
(7)
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
829
As a particular case (see Example 2 in Section 4), the practical implementation of the proposed sensitivity measures in the context of Bayesian decision theory is discussed. Among the many fine reviews are, for example, Berger (1985) and French and Ríos Insua (2000). Bayesian decision theory and inference describe a decision problem by a set of possible actions a ∈ A, a set of states or parameters, ∈ , a prior distribution (), a likelihood, l(|x) for the observed vector data x, and a loss function l(a, ). The actions are ranked by the expected loss. The optimal decision a ∗ is the action that minimizes the posterior expected loss: a ∗ = arg min L(a), a∈A
L(a) =
l(a, )l(|x)() d . l(a, )p(|x) d = l(|x)() d
Practical implementation is hindered by the fact that L(a) and hence the minimum a ∗ could be sensitive to the chosen prior (·) and/or loss function l(·). A skeptical decision maker will require, in addition to the optimal solution, some description of the sensitivity of a ∗ with respect to reasonable changes and uncertainties in the specification of the inputs. In this context, the local parametric sensitivity of L (a ∗ ) can be investigated. Now, f ()=l (a ∗ , ) and the quantity of interest I 0 is L 0 (a ∗ ). Note that the interest is focused on the expected loss sensitivity instead of the decision sensitivity (see Kadane and Srinivasan, 1994 for a distinction). If sensitivity with respect to the prior distribution is interesting, then L (a ∗ ) can be investigated. Then, next section shows how the proposed computationally low-cost sensitivity estimations and their errors can be easily calculated in practice. 4. Applications Two illustrative examples are considered in this section: an application to a normal mixture model and a decision problem based on a first-order autoregressive model. Example 1 (Normal mixture model). Bowmaker et al. (1985) analyzed data on the peak sensitivity wavelengths for individual microspectrophotometric records on a small set of monkey’s eyes. Part of the analysis involves fitting a mixture of two normal distributions with common variance, so that each observation yi is assumed to be drawn from one of two groups. Let Ti = j (j = 0, 1) be the true group of the ith observation, where group j has a normal distribution with mean j and precision = −2 . It is assumed that an unknown fraction P of observations are in group 1 and 1 − P in group 0. Then, the model is given by:
yi ∼ Normal Ti , , Ti ∼ Bernoulli(P ). Robert (1994) warns of a danger when using MCMC methods with this model: it may happen that at some iteration all the data will go into one component of the mixture and it will be difficult to escape from this situation. In order to solve this problem, Robert (1994) suggests a re-parameterization. A simplified version is to assume: 1 = 0 + ,
> 0,
where , 0 , and have independent prior distributions. These prior distributions were chosen to present weak information about the parameters, i.e.: ∼ Gamma(0.001, 0.001), 0 ∼ Normal(0, 0.000001) and ∼ Normal(0,+∞) (0, 0.000001), where the notation indicates the truncated normal distribution on (0, +∞). (P , 1 − P ) is assumed to be distributed as a Dirichlet(1.1, 1.1). The interest is focused on the posterior mean of P . This quantity gives the proportion of observations in each group after the Bayes update has been done. Local sensitivity with respect to the parameter = (1 , 2 ) in the neighborhood of 0 = (1.1, 1.1) is investigated. In this case, the quantity of interest is E0 [P |Y ], where Y is the data vector containing 48 measurements. In order to simplify the notation, this quantity will be denoted by E0 .
830
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
Table 1 Estimations of jk E0 , SE(j k E0 ), and k k
j k E 0
j SE( k E0 )
k
1 2
1.21 × 10−2 −1.79 × 10−2
3.12 × 10−4 2.82 × 10−4
1.45 × 10−1 2.15 × 10−1
By using WinBUGS (Spiegelhalter et al., 2003), MCMC samples can be generated from the posterior distribution of interest. After the convergence has been considered to be achieved, a sample of size n = 10 000 is generated. The convergence diagnostics has been assessed by using CODA (see Best et al., 1997). By using the sample
generated −3 p (1) , . . . , p(10 000) , it is obtained E 0 =0.5994 and its Monte Carlo standard error estimation is SE E0 =8.34×10 . The following expressions have been obtained: j1 (p) (p) j2 (p) (p)
= (1 + 2 ) − (1 ) + log(p), = (1 + 2 ) − (2 ) + log(1 − p),
where (·) = (·)/(·) represents the Digamma function. Then, the expressions to obtain the partial derivatives are j 1 E0 =
10 000 1 p (i) − E 0.968048 + log p (i) , 0 10 000
j 2 E0 =
10 000 1 p (i) − E 0.968048 + log 1 − p (i) . 0 10 000
i=1
i=1
Table 1 shows the estimations of the partial derivatives, their standard errors, and the dimensionless sensitivity measures. The partial derivative measures how much infinitesimal changes in the parameter 1 (2 ) affect the posterior mean E0 . Both 1 and 2 are less than 0.25, so in accordance to the interpretation given in Section 3, the inference on the posterior mean can be considered robust in an infinitesimal neighborhood of 0 = (1.1, 1.1). Example 2 (Autoregressive model). Decision making in natural resource management combines models of the dynamics of an ecological system with an objective function, which values the outcomes of alternative management actions. The goal of the manager is to develop a decision rule that prescribes actions for each time or system state that are optimal with respect to the objective function. Dorazio and Johnson (2003) argued that Bayesian inference and decision theory provide a coherent theoretical framework for decision making in natural resource management. These authors solved a decision problem based on a first-order autoregressive model to provide an adequate habitat for waterfowl. Although the problem was greatly simplified by the authors for purposes of illustration, it includes the features that are common in this context: (a) a set of management actions, (b) a predictive model for the relevant behavior of the ecological system and (c) an objective function valuing the outcomes of actions. The choice of this example is motivated by the fact that the specification of an objective function can often be a difficult task in practical applications of decision theory. Thus, a sensitivity analysis with respect to the parameters in the objective function is appropriate. Assume that wildlife managers must develop a strategy to provide an adequate habitat for waterfowl on a large property during a brief overwintering period. This strategy will be based on a combination of water-level regulation with one or more types of physical manipulation of the vegetation. It is believed that a proportion of 50% open water
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
831
and 50% vegetation cover is nearly ideal for these waterfowl. The regulation of water levels on the property can be performed with reasonably good precision. However, there is considerable uncertainty about how to control the vegetation growth. Burning, cutting, or grazing of vegetation are possible management actions that can be applied to control growth. Suppose that the property is divided into m non-overlapping plots of approximately equal size and shape. In order to modify the vegetation cover, different manipulations can be applied to these plots. Let x denote a design vector that specifies which of the q possible management actions (manipulations) is applied to a particular plot. Without loss of generality, a “centered” parameterization is used for x in order to describe the model, i.e., x = (1, 0, . . . , 0) represents the first action, x = (0, 1, . . . , 0) represents the second action, and so on. In the first year, a randomization procedure is used to decide which of the q possible actions is applied to each of the n plots. Then, xi1 represents the design vector for the ith plot at time t = 1. The management actions for all the plots may be summarized in a m × q matrix X1 = (x11 , x21 , . . . , xm1 ) . After implementing these actions, the vegetation cover in each plot is observed. Let y1 = (y11 , y21 , . . . , ym1 ) denote the responses in vegetation cover for the first year. Given these results, a procedure for selecting a new set of actions to be implemented in the second year is required. Dorazio and Johnson (2003) assumed that the vegetation cover in an individual plot depends on both the current type of management and on past levels of vegetation observed in that plot. These dependencies are specified by means of a first-order autoregressive model defined by
N xit , 2 1 − 2 if t = 1, 2 Yit xit , , , , yi,t−1 , xi,t−1 ∼ 2 N xit + yi,t−1 − xi,t−1 , if t > 1, where Yit is the random variable for vegetation cover in plot i in year t, xit represents the management action applied to plot i in year t, and , 2 , and are model parameters. Each element of corresponds to the mean vegetation cover associated with a distinct management action and the parameter represents the correlation between vegetation responses observed in consecutive years. In order to evaluate the consequences of future management actions, the predictions of the vegetation cover in each plot given the past observations are required. Let y˜t = (y˜1t , y˜2t , . . . , y˜mt ) denote a m × 1 vector representing the plot specific predictions of vegetation cover in year t. The posterior predictive distribution of y˜ t is ˜ , y , y , . . . , y , X , X , . . . , X p y˜ t X t 1 2 1 2 t−1 t−1 ˜ = f y˜ t X t , , yt−1 , Xt−1 p y1 , . . . , yt−1 , X1 , . . . , Xt−1 d,
where = , 2 , . The solution of the problem for year is to find the management action X˜ ∗ that minimizes the expected loss given by L X˜ y1 , y2 , . . . , y−1 , X1 , X2 . . . , X−1
˜ , y1 , y2 , . . . , y−1 , X1 , X2 , . . . , X−1 d˜y , = l y˜ , c p y˜ X
where l y˜ , c denotes a function representing the loss obtained when the predictions of vegetation cover differ from the target value (c = 0.50) in year . A hypothetical data set used in Dorazio and Johnson (2003) is considered to illustrate the use of the proposed local sensitivity analysis method in decision problems. Suppose that the two types of management actions are 1 ((1, 0) = burning) and 2 ((0, 1) = cutting). Suppose further that the property is divided into 4 similar parts. Then, there are 24 possible managements actions. The vegetation responses corresponding to the first two years are represented in Table 2.
k The loss function in the third year is lk y3 , 0.50 = k m i=1 |yi3 − 0.5| . The impact of small changes in the parameter k (when k varies in a neighborhood of k 0 = 1) on the expected predictive loss is investigated. Note that for k = 1, an
832
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
Table 2 Actions and responses for the first two years Plot
X1
y1
X2
y2
1 2 3 4
1 2 2 1
0.15 0.55 0.85 0.45
1 2 2 1
0.25 0.50 0.75 0.50
Table 3 Top 4 actions
X3
0 L k
0 ) L SE( k
j k Lk 0
j SE( k Lk 0 )
1
(1, 2, 1, 1) (1, 2, 2, 1) (2, 2, 1, 1) (2, 2, 2, 1)
1.0009 1.0064 1.0071 1.0091
0.0048 0.0045 0.0050 0.0048
−1.1799 −1.1899 −1.1909 −1.1897
0.0062 0.0059 0.0064 0.0062
2.4671 2.6393 2.3725 2.4829
absolute-error loss function is obtained. To complete the Bayesian analysis, the following prior distributions are used: Uniform(0,1) for each component of , a conjugate Inverse-Gamma(0.1,0.1) for 2 , and a Uniform(−1, 1) for . These distributions specify no strong prior opinions for the parameters of the model. The optimal action for year 3 is obtained by: ˜ 3 y1 , y2 , X1 , X2 , X˜ ∗3 = arg min Lk X X˜ 3
˜ 3 , y1 , y2 , X1 , X2 d˜y3 . ˜ 3 y1 , y2 , X1 , X2 = lk y˜ 3 , c p y˜ 3 X Lk X ˜ Samples from the predictive posterior distribution p y˜ 3 X 3 , y1 , y2 , X1 , X2 can be generated by using MCMC ˜ 3 y1 , y2 , X1 , X2 is the quantity of interest and it will be denoted by Lk 0 . As the number of methods. Now, Lk 0 X actions is finite, the action with minimum expected loss can be obtained. After the convergence has been considered to be achieved, a sample of size n = 10 000 is generated by using MCMC with WinBUGS. As in the previous
example, the convergence is assessed by using CODA. L k 0 and its Monte Carlo standard error estimate SE Lk 0 for the top four management actions are obtained. Also, the derivative of the expected predictive loss jk Lk 0 and its Monte Carlo j standard error SE Lk 0 are estimated. In Table 3 all the previous estimations are represented. k
All four estimations for the dimensionless sensitivity measures are greater than 0.25. Thus, the expected losses can be considered very sensitive to k in a neighborhood of k 0 = 1, so the inferences can be considered local-sensitive in expected loss for the four actions.
5. Discussion It is widely recognized that formal sensitivity analysis is a difficult task in Bayesian modeling. Several local sensitivity measures based on functional derivatives have been proposed in the recent literature, for example, the Fréchet derivative and its norm is one of them. In the particular case of parametric classes, the Fréchet derivative is the gradient vector. The method proposed in this paper uses the components of the gradient vector, i.e., the partial derivatives as sensitivity measures. These measures can be interpreted as the rate of change in the posterior quantity of interest produced by infinitesimal perturbations in the parameters of the prior distribution or loss function. A dimensionless sensitivity measure has also been proposed.
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
833
In Bayesian decision theory and inference the proposed local approach can be very useful when applicable. The proposed method is based on the following assumption: the derivative and integral operators must be interchangeable. Thus continuity and integrability of several functions involved in the analysis and their derivatives are needed. However, when the method is applicable, it constitutes a general technique for complex models that need to be solved by MCMC methods. The main advantage is that the MCMC simulations can be re-used to estimate the sensitivity measures and their errors, avoiding the need of further sampling. This computational saving makes the method speedy in contrast to informal sensitivity methods based on re-running the chain for several values of the parameters. The use of derivatives is widely studied in sensitivity literature. It can be considered as a measure to quantify how much an infinitesimal change in the parameter (for the prior distribution or the f function) affects the posterior. The main drawback of this measure is that it is an absolute one. It would be convenient to describe a relative local sensitivity measure varying between two fixed values. However, this is a difficult task that several authors seek to solve. Instead of that, a dimensionless measure based on the idea of Ruggeri and Wasserman (1995) has been proposed in this paper. This measure cannot be bounded between two values but it allows a better interpretation, as it is shown in Section 3. In particular, the use of Fréchet derivatives in local sensitivity analysis is extended. Local parametric sensitivity is presented as a particular case of functional sensitivity. The proposed sensitivity measure can be extended to vector functions. In this case, the Fréchet derivative for parametric classes is a matrix, instead of a vector as in the case treated. This measure could be used to study which parameters are more sensitive. It could also be used to assess which components of the vector function are more sensitive. Finally, it is remarkable that the proposed method is easy to apply in practice as it has been shown with two illustrative examples.
Acknowledgements The authors thank a referee and an associate editor for comments and suggestions which have substantially improved the readability and the content of this paper. Discussions with David Ríos are also gratefully acknowledged. This work has been partially supported by Junta de Extremadura, Spain (Project IPR00A075).
Appendix A. Applicability conditions In order to estimate the sensitivity measures, some conditions need to be satisfied to assure the derivative-integral interchange. The following results are presented. Theorem 1. Let and jj be continuous functions with respect to in a neighborhood of 0 , N 0 , where 0 is a fixed value interior to . Suppose that 0 () = 0, ∀ ∈ and that the functions f ()l(|x) (), l(|x) (), f ()l(|x)jj (), and l(|x)jj (), are integrable with respect to , ∀ ∈ N 0 . Then for each j: jj I0 = Ep0
jj 0 f − I 0 0
,
where p0 is the posterior distribution. Proof. The conditions assumed in the theorem allow to interchange the derivatives and the integrals. This is the key step in the proof. The posterior steps are basic manipulations addressed to get the integral of a function with respect to
834
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
the posterior distribution. ! jj I = jj =
jj
−
l(|x) () d
jj
=
j j
l(|x) () d 2
l(|x) () d
(f ()l(|x) ()) d l(|x) () d
f (|) () d jj
−
l(|x) () d 2
l(|x) () d
f ()l(|x) () d
=
f ()l(|x) () d
"
f ()l(|x) () d
(l(|x) ()) d 2 l(|x) () d
f ()l(|x)jj () d
−
l(|x) () d
f ()l(|x) () d l(|x) jj () d . 2
l(|x) () d
The partial derivative is evaluated at 0 , and some basic manipulations are made. jj I0 =
f ()
j 0 () j
0 ()
l(|x)0 () d
l(|x)0 () d
jj 0 () f ()l(|x) 0 () l(|x)0 () d 0 () d − · l(|x)0 () d l(|x)0 () d j 0 () jj 0 () j = f () p0 (|x) d − I0 · p0 (|x) d () 0 0 ()
jj 0 () = f () − I0 p0 (|x) d 0 ()
jj 0 = Ep0 f − I0 . 0
Theorem 2. Let f and j j f be continuous functions with respect to in a neighborhood of 0 , N( 0 ), where 0 is a fixed value interior to . Suppose that the functions f ()p(|x) and j j f ()p(|x), are integrable with respect to , ∀ ∈ N( 0 ). Then for each j: # $ j j I 0 = Ep j j f 0 , where p is the posterior distribution for .
C.J. Pérez et al. / Computational Statistics & Data Analysis 51 (2006) 823 – 835
835
Proof. Again the key step is the derivative-integral interchange. ! "
j j I = j j f ()p(|x) d = j j f ()p(|x) d # $ = j j f ()p(|x) d = Ep j j f .
Evaluating at 0 , the result is obtained.
References Berger, J.O., 1985. Statistical Decision Theory and Bayesian Analysis. Springer, Berlin. Berger, J.O., 1994. An overview of robust Bayesian analysis. Test 3, 5–124. Berger, J.O., Betro, B., Moreno, E., Pericchi, L.R., Ruggeri, F., Salinetti, G., Wasserman, L.e., 1996. Bayesian Robustness. IMS Lecture Notes Monograph Series, vol. 29. IMS. Berger, J.O., Ríos Insua, D., Ruggeri, F., 2000. Robust Bayesian analysis. In: Ríos Insua, D., Ruggeri, F. (Eds.), Bayesian Robustness. Springer, Berlin, pp. 1–32. Best, N.G., Cowles, M.K., Vines, S.K., 1997. Coda: convergence diagnosis and output analysis software for Gibbs sampling output. Technical Report, MRC Biostatistics Unit, Cambridge. Bowmaker, J.K., Jacobs, G.H., Spiegelhalter, D.J., Mollon, J.D., 1985. Two types of trichromatic squirrel monkey share pigment in the red-green spectral region. Vision Res. 25 (12), 1937–1946. Brooks, S.P., 1998. Markov chain Monte Carlo method and its application. The Statistician 47, 69–100. Cuevas, A., Sanz, P., 1988. On differentiability properties of Bayes operators. In: Bernardo, J.M., DeGroot, M.H., Lindley, D.V., Smith, A.F.M. (Eds.), Bayesian Statistics, vol. 3. Oxford University Press, Oxford, pp. 569–577. Dey, D., Ghosh, S., Lou, K., 1996. On local sensitivity measures in bayesian analysis (with discussion). In: Berger, J.O., Betro, B., Moreno, E., Pericchi, L.R., Ruggeri, F., Salinetti, G., Wasserman, L. (Eds.), Bayesian Robustness. IMS Lecture Notes Monograph Series, vol. 29, IMS, pp. 21–39. Diaconis, P., Freedman, D., 1986. On the consistency of Bayes estimates. Annals of Statistics 14, 1–67. DiCiccio, T.J., Kass, R.E., Raftery, A., Wasserman, L., 1997. Computing Bayes factors by combining simulation and asymptotic approximations. J. Amer. Statist. Assoc. 92 (439), 903–914. Dorazio, R.M., Johnson, F.A., 2003. Bayesian inference and decision theory—a framework for decision making in natural resource management. Ecol. Appl. 13, 556–563. French, S., Ríos Insua, D., 2000. Statistical Decision Theory. Arnold, Paris. Gilks, W.R., Richardson, S., Spiegelhalter, D.J., 1998. Markov Chain Monte Carlo in Practice. Chapman & Hall, London. Gustafson, P., Srinivasan, C., Wasserman, L.A., 1996. Local sensitivity analysis (with discussion). In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics, vol. 5. Oxford University Press, Oxford, pp. 197–210. Halekoh, U., Vach, W., 2003. A Bayesian approach to seriation problems in archaeology. Comput. Statist. Data Anal. 45, 651–673. Hall, C.B., Lynn Kuo, J.Y., Lipton, R.B., 2003. Bayesian and profile likelihood change point methods for modeling cognitive function over time. Comput. Statist. Data Anal. 42 (1–2), 91–109. Kadane, J., Srinivasan, C., 1994. Comment on “An Overview of Robust Bayesian Analysis” by Berger. Test 3, 93–95. Martín, J., Ríos-Insua, D., Ruggeri, F., 1998. Issues in Bayesian loss robustness. Sankhya Ser. A 60, 405–417. Martín, J., Ríos-Insua, D., Ruggeri, F., 2003. Joint sensitivity in Bayesian decision theory. Test (to appear). Richarson, S., Green, P.J., 1997. On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. 59 (4), 731–792. Ríos Insua, D., Ruggeri, F., 2000. Robust Bayesian Analysis. Springer, Berlin. Robert, C.P., 1994. The Bayesian Choice. Springer, Berlin. Ruggeri, F., Wasserman, L., 1993. Infinitesimal sensitivity of posterior distributions. Canad. J. Statist. 21, 195–203. Ruggeri, F., Wasserman, L., 1995. Density based classes of priors: infinitesimal properties and approximations. J. Statist. Plann. Inference 46, 311–324. Sivaganesan, S., 1993. Robust Bayesian diagnostics. J. Statist. Plann. Inference 35, 171–188. Sivaganesan, S., 1996. Asymptotics of some local and global robustness measures (with discussion). In: Berger, J.O., Betro, B., Moreno, E., Pericchi, L.R., Ruggeri, F., Salinetti, G., Wasserman, L. (Eds.), Bayesian Robustness. IMS Lecture Notes Monograph Series, vol. 29. IMS, pp. 195–209. Sivaganesan, S., 2000. Global and local robustness approaches: uses and limitations. In: Ríos, D., Ruggeri, F. (Eds.), Robust Bayesian Analysis. Springer, Berlin, pp. 89–108. Spiegelhalter, D., Thomas, A., Best, N., Lunn, D., 2003. WinBUGS User Manual, Version 1.4. MRC Biostatistics Unit, Cambridge. Srinivasan, C., Truszczynska, H., 1990. On the ranges of posterior quantities. Technical Report 294, Department of Statistics. Stephens, M., 1997. Bayesian methods for mixtures of normal distributions. Ph.D. Thesis, University of Oxford. Turányi, T., Rabitz, H., 2003. Local methods. Sensitivity Analysis. Wiley, New York, pp. 81–99.