Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270 www.elsevier.com/locate/jspi
Robust likelihood functions in Bayesian inference Luca Grecoa,∗ , Walter Racugnob , Laura Venturac a PE.ME.IS. Department, University of Sannio, P.zza Arechi II, 82100 Benevento, Italy b Department of Mathematics, University of Cagliari, Italy c Department of Statistics, University of Padua, Italy
Received 28 February 2007; received in revised form 26 April 2007; accepted 2 May 2007 Available online 22 May 2007
Abstract In order to deal with mild deviations from the assumed parametric model, we propose a procedure for accounting for model uncertainty in the Bayesian framework. In particular, in the derivation of posterior distributions, we discuss the use of robust pseudo-likelihoods, which offer the advantage of preventing the effects caused by model misspecifications, i.e. when the underlying distribution lies in a neighborhood of the assumed model. The influence functions of posterior summaries, such as the posterior mean, are investigated as well as the asymptotic properties of robust posterior distributions. Although the use of a pseudo-likelihood cannot be considered orthodox in the Bayesian perspective, it is shown that, also through some illustrative examples, how a robust pseudo-likelihood, with the same asymptotic properties of a genuine likelihood, can be useful in the inferential process in order to prevent the effects caused by model misspecifications. © 2007 Elsevier B.V. All rights reserved. Keywords: Estimating equation; Influence function; Kullback–Leibler divergence; Model misspecification; Pseudo-likelihood; Posterior distribution; Robustness
1. Introduction In the frequentist framework, it is well known that mild deviations from the assumed model can give rise to nonnegligible changes in inferential results. The concept of robustness related to likelihood functions has been discussed under various points of views in the literature; see, for instance, Heritier and Ronchetti (1994), Tsou and Royall (1995), Lavine (1995) and Markatou et al. (1998). This issue is also addressed in the approach based on the influence function (IF) discussed in Huber (1981) and Hampel et al. (1986), in which unbiased estimating equations are claimed to supply effective tools for robust inference. In this context, robust pseudo-likelihoods can be derived from bounded estimating functions and can be used as genuine likelihoods. In particular, in this paper we focus our attention on the quasi- and the empirical likelihoods (see e.g. Owen, 2001; Adimari and Ventura, 2002a, b), which on the one hand keep the standard first-order properties, and on the other hand take into account model inadequacies. Indeed, they provide inferential procedures which are still reliable and reasonably efficient when the underlying distribution lies in a neighborhood of the assumed model. ∗ Corresponding author.
E-mail address:
[email protected] (L. Greco). 0378-3758/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2007.05.001
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
1259
Obviously, even in the Bayesian approach, inferential results depend on the modeling assumptions and on the observed sample, but Bayesian robust statistic is mainly concerned with the global and local sensitivity to prior distributions; see, for instance, Berger (1994), Rios Insua and Ruggeri (2000) and references therein. Indeed, the problem of robustness with respect to sampling model misspecifications has been considered only in few contributions (Lavine, 1991; Sivaganesan, 1993; Dey et al., 1996; Gustafson, 1996; Shyamalkumar, 2000). Other exceptions are discussed in Pericchi and Perez (1994), Pericchi and Sansò (1995), David (1973), O’Hagan (1979, 1988, 1990), Verdinelli and Wasserman (1991), Passarin (2004) and Andrade and O’Hagan (2006), who address the topics of outliers, robust posterior summaries and model embedding. The solution of embedding in a larger structure has the cost of eliciting a prior distribution for the extra parameters introduced in the analysis. Moreover, the statistical procedures derived under the supermodel are not necessarily robust in a broad sense, the supermodel being too thin in the space of all distributions. In this paper, we focus on an alternative approach and we explore the use of the quasi- and the empirical likelihoods mentioned above. Actually, since these robust pseudo-likelihoods share most of the properties of the genuine likelihood, they can be used as a basis for Bayesian inference to obtain a “robust posterior distribution”. Although this way of proceeding on principle cannot be considered as orthodox in the Bayesian perspective, it is of interest to evaluate whether the use of a robust likelihood may be useful in the inferential process. Related works on the use of alternative likelihoods in Bayesian inference are in Efron (1993), Bertolino and Racugno (1992, 1994), Raftery et al. (1996) and Cabras et al. (2006). Papers which are more specifically related to the validation of a pseudo-posterior distribution based on an alternative likelihood function are Monahan and Boos (1992), Severini (1999), Lazar (2003), Racugno et al. (2005), Pace et al. (2006) and Schennach (2005). We discuss two properties of robust posterior distributions, which are derived from the corresponding properties of the robust pseudo-likelihoods involved. First of all, we show that the robust posterior distributions are asymptotically normal as the genuine posterior distributions and that the asymptotic Kullback–Leibler divergence between a robust posterior distribution and the corresponding genuine posterior distribution is related to the required level of robustness. Moreover, to assess the stability of robust posterior inference, we show that the IF of summaries of the robust posterior distributions are bounded, contrarily of those based on the genuine posterior distribution. A bounded IF is a desirable local stability property of an inferential procedure since it implies that, for each sample size, in a neighborhood of the assumed model the effect of a small contamination on the procedure cannot become arbitrarily large. The need for bounded influence statistical procedures for estimation, testing and prediction has been stressed by many authors (see, for instance, Hampel et al., 1986; Peracchi, 1990, 1991; Markatou and Ronchetti, 1997) and the aim of this paper is to contribute to the current literature in the Bayesian direction. A motivating example: Let us consider the well-known Darwin’s paired data (Spiegelhalter, 1985) on the heights of self- and cross-fertilized plants. This example is discussed, in the Bayesian literature, also by Pericchi and Perez (1994) and Shyamalkumar (2000). It is pointed out that there is not enough evidence to distinguish between the normal model and a distribution with heavy tails, such as the Cauchy and the Student’s t distributions, the latter being slightly preferred. Fig. 1(a) shows the normal probability plot of these data. Interest focuses on the location parameter of the distribution of the difference of the heights. By modeling likelihood uncertainty by the normal, Cauchy and Student’s t3 distributions and assuming the same default noninformative prior distribution for the location parameter, we obtain the posterior distributions plotted in Fig. 1(b). Notice that posterior inferential results can depend strongly on the assumed model. Even though simple, this example is interesting in its results since it illustrates in a practical situation the problem we deal with. In order to handle with model uncertainty, we account for the imprecision with respect to the likelihood by considering robust pseudo-likelihoods in the Bayesian practice. Their main appeal is that they allow to assume the simpler sampling model and to deal anyhow with mild departures from it. The proposed approach leads to directly embed model uncertainty in the likelihood function and, in view of this, to arrive at posterior inferential conclusions which prevent the effects caused by model misspecifications, i.e. when the underlying distribution lies in a neighborhood of the simpler assumed model. The paper is organized as follows. Section 2 presents a short overview of M-estimators, IF and quasi- and empirical likelihood functions. Section 3 discusses robust posterior distributions, focusing on their asymptotic properties and on the computation of the IF of posterior summaries. Section 4 illustrates by explanatory examples and Monte Carlo studies the use of robust pseudo-likelihoods in the Bayesian framework. Finally, some conclusive remarks are given in Section 5.
1260
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270 Normal Q−Q Plot 80
a
0.06
b
60 0.05 Sample Quantiles
40 0.04
20 0
0.03
−20
0.02
−40
0.01
−60 0.00 −1
0
1
−10
0
10
20
30
40
50
60
Theoretical Quantiles
Fig. 1. Darwin’s data: normal probability plot and comparison of posterior distributions based on the normal (−), Cauchy (...) and Student’s t3 (−−) distributions.
2. M-estimators, IF and quasi- and empirical likelihoods Let y = (y1 , . . . , yn ) be a random sample of size n, having independent and identically distributed components, according to a distribution function F = F (y; ), with ∈ ⊆ Rd , d 1. Let = (y; ) = ni=1 (yi ; ) be an unbiased estimating function for based on y, such that E ((Y ; )) = 0 for every . For some purposes, it is enough that the latter expectation is O(1) as n → ∞. A general M-estimator is defined as the root ˆ of the estimating equation = 0.
(1)
It is worth mentioning that the class of M-estimators is rich and includes a variety of well-known estimators. For example, if is the likelihood score function, we obtain the maximum likelihood estimator (MLE). Under broad conditions assumed throughout this paper, an M-estimator, as the MLE within a parametric model, is consistent and approximately normal with mean and variance V () = B()−1 ()(B()−1 )T ,
(2)
where B() = −E (j /jT ) and () = E ( T ). The same ingredients appearing in (2) also appear in the wellknown robust sandwich estimate of the asymptotic variance of the MLE (Huber, 1967), with = j()/jT ; see Royall (1986) for detailed examples. One way of assessing the robustness of a statistic T which can be represented (at least asymptotically) as a function of the empirical distribution function Fˆn , that is, T (Fˆn ), is by means of the IF given by T (F ) − T (F ) jT (F ) , (3) IF(x; T , F ) = lim = →0 j =0 where T (F )=T ((1−)F +x ), with x point mass in x. The IF measures the local stability of a statistical procedure, since it describes the effect of a small contamination (x ) at the point x, standardized by the mass of the contamination. As a consequence, the linear approximation IF(x; T , F ) describes the asymptotic bias of T under single-point contaminations of the assumed model distribution F . The supremum of the IF, i.e. the gross-error sensitivity, measures the worst influence on T and a desirable robustness property for a statistical procedure is that the gross-error sensitivity is finite, i.e. that the IF is bounded (B-robustness). Note that the IF of the MLE is proportional to its score function and, in general, MLEs have unbounded IF. On the contrary, if (x; ) is bounded then the corresponding M-estimator ˆ is B-robust (Hampel et al., 1986). Indeed, the IF is given by ˆ F ) = B()−1 (x; ). IF(x; ,
(4)
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
1261
Finally, notice that the IF can also be used to evaluate the asymptotic covariance matrix of T, since V ()=E (IF(x; T , F ) IF(x; T , F )T ). Robust pseudo-likelihoods may be derived from the estimating equation (1) that define an M-estimator and their robustness properties are driven from the corresponding M-estimator (Heritier and Ronchetti, 1994). A first possible robust pseudo-likelihood obtained from the estimating function is the quasi-likelihood (see McCullagh, 1991; Adimari and Ventura, 2002a, b), defined as
LQ () = exp(Q ()) = exp
A(t)(y; t) dt , k
where the matrix A() is such that A()T = ()−1 B() and k is an arbitrary constant. When d = 1, LQ () is usually easy to derive, and when d > 1 it exists if and only if the matrix B() is symmetric. When LQ () exists, it may be used, in a standard way, for setting quasi-likelihood tests and confidence regions, since matrix A() allows to obtain ˆ − Q ()} with a standard 2 distribution. Moreover, the robustness a quasi-likelihood ratio statistic WQ () = 2{Q () d properties of ˆ do not depend on A() because this matrix does not change its IF. A different robust pseudo-likelihood for can be obtained by computing the empirical likelihood from . In fact, the empirical likelihood LE () (Owen, 2001) is a nonparametric tool which can be derived in several contexts, and in particular for parameters which are determined by estimating equations (Monti and Ronchetti, 1993; Adimari, 1997; Tsao and Zhou, 2001). Although in its natural setting the empirical likelihood does not require the assumption of a parametric model F , when based on unbiased estimating equations that define M-estimators related to F , it may become such as a parametric tool. Its main appeal is that only unbiasedness of is required to obtain a standard asymptotic 2d distribution for the empirical likelihood ratio statistic. On the other hand, due to its nonparametric nature, LE () fails when the sample size is very small (see also Adimari and Ventura, 2002b). The empirical likelihood n ratio statistic is given by W () = −2 log R (), with R () = max E E E p i i=1 np i , where the pi -weights satisfy pi 0, n n i=1 (yi ; )pi = 0. A Lagrangian argument leads to i=1 pi = 1 and WE () = 2
n
log{1 + T (yi ; )}
i=1
if =0 is inside the convex hull of (y1 ; ), . . . , (yn , ); otherwise, it is adequate to set WE ()=+∞. The Lagrangian multiplier = () satisfies ni=1 (yi ; )/(1 + T (yi ; )) = 0. Finally, note that, under standard regularity conditions, it can be shown that WQ () and WE () are equivalent to the first term of their Taylor expansions. 3. Robust posterior distributions By introducing a general pseudo-likelihood function LPS () in the Bayesian analysis, a pseudo-posterior distribution can be defined as PS (; y) ∝ ()LPS (), where () denotes a default prior distribution. The properties of robust posterior distributions of the form
Q (; y) ∝ ()LQ () or E (; y) ∝ ()LE ()
(5)
are driven from the used pseudo-likelihood. In general, the properties of a pseudo-likelihood LPS () are related to the way it is derived. Two different situations for the construction of LPS () need to be distinguished. In the first case, the pseudo-likelihood is based on a statistical model defined as a reduction of the original model, so that the obtained LPS () is a genuine likelihood function. On the contrary, when such a reduced model does not exist, all the properties have to be specifically investigated: this is the case of quasi- and empirical likelihoods. However, it has been shown (see, for instance, Owen, 2001; Pace and Salvan, 1997, Chapter 4; Adimari and Ventura, 2002a; Lazar, 2003) that both LQ () and LE () have all the desired standard first-order properties which make them resemble a true likelihood and thus useful ˆ solution of (1), is consistent, asymptotically normal as a basis for Bayesian inference. In particular, the pseudo-MLE , and asymptotically efficient, the pseudo-score function has mean zero and is asymptotically normally distributed and the pseudo-likelihood ratio statistic has an asymptotic chi-squared distribution.
1262
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
3.1. Asymptotic properties of robust posterior distributions The study of the asymptotic properties of a pseudo-posterior distribution associated to (1) follows essentially the route used for the proper posterior distribution, i.e. the special case where is the score function (see e.g. Bernardo and Smith, 2000, Section 5.3). Under regularity conditions, the pseudo-posterior distribution can be written as
PS (; y) ∝ exp(log () + PS ()) with PS () = log LPS (). Expanding the two logarithms in a Taylor series about their respective maxima, i.e. the prior ˆ respectively, we obtain mode 0 and , log () = log (0 ) − 21 ( − 0 )T H0−1 ( − 0 ) + R0 , ˆ − 1 ( − ) ˆ T H () ˆ −1 ( − ) ˆ + Rn , PS () = PS () 2 where H0−1 = −(j2 log ())/(jT j)|=0 , ˆ −1 = −(j2 PS ())/(jT j)| ˆ H () = and R0 and Rn are remainder terms, which can be assumed small for large n. It follows, ignoring terms not depending on , that, for large n, ˆ T H () ˆ −1 ( − )) ˆ
PS (; y) ∝ exp(− 21 ( − 0 )T H0−1 ( − 0 ) − 21 ( − ) ∝ exp(− 21 ( − m )T Hm−1 ( − m ))
(6)
ˆ Development (6) suggests that robust pseudo-posterior ˆ −1 and m = Hm (H −1 0 + H () ˆ −1 ). with Hm−1 = H0−1 + H () 0 distributions of form (5) are asymptotically normal with mean m , which is a weighted average of a prior estimate and a robust observation-based estimate, and precision Hm−1 , which is the sum of the prior precision matrix and of the pseudo-information matrix. Notice that, for large n, m /ˆ → 1 and the precision provided by the data tends to be large compared with the prior precision, which turns negligible and hence m and Hm−1 are essentially indistinguishable from ˆ −1 (see e.g. Bernardo and Smith, 2000, Section 5.3). Moreover, since H ()−1 = V −1 () + op (1), we might ˆ and H () ˆ approximate (5) by a multivariate normal distribution with mean ˆ and variance V (). The asymptotic approximations of the proper and the pseudo-posterior distributions allow to easily compute the asymptotic Kullback–Leibler divergence between (; y) and PS (; y). Consider for simplicity d =1, but the following results can be extended to the more general situation of d > 1. Using (; y) ∼ N(ˆ MLE , i()−1 ), with i() information ˆ V ()), it can be shown that matrix and ˆ MLE MLE of , and using PS (; y) ∼ N(, KL = KL( (; y); PS (; y))
(; y) = log
(; y) d
PS (; y)
ˆ 2 (ˆ MLE − ) 1 − log e − 1 + e + e , = 2 i()−1
(7)
where e = i()−1 /V () = [i()V ()]−1 is the asymptotic efficiency of the estimator ˆ (see e.g. Hampel et al., 1986, Chapter 2). As it is well known, in the classical robustness theory, to build a B-robust estimator, we have to put a bound on its IF. However, doing that leads to an efficiency loss at the assumed model, and therefore it is necessary to establish a trade-off between robustness and efficiency. Typically, an usual choice is to fix an efficiency loss of, say, 5% under the assumed parametric model and to consider the consequent bound on the IF. This means that, for example, the robust estimator ˆ can be built to achieve a 95% efficiency at the model, i.e. e = 0.95, and in this situation the first term [− log e − 1 + e] in (7) reduces to 0.00129. Furthermore, it is interesting to note that the last term in (7) is related to the extension of the Mahalanobis’s distance between two normal populations with proportional variances (see e.g.
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
1263
Paranjpe and Gore, 1994). This term is of order O(1) and its value will depend on the specific assumed model and on the assumed robust estimator (see Section 4). In view of this, the asymptotic Kullback–Leibler divergence between
(; y) and PS (; y) turns out to be defined by the required level of robustness. 3.2. IF of posterior summaries In this section we investigate the IF of the posterior moments. To this end, first of all, we need to derive the IF of the posterior mode, which is easy to compute, and to use this result to compute the IF of posterior moments. By considering log PS (; y), the mode ˜ of the pseudo-posterior distribution is the solution of the estimating equation j log PS (; y) jPS () = 0, = () + j j
(8)
where () = j log ()/j, which is of form (1). Eq. (8) can be written as () + A()(y; ) = 0 or, equivalently, as n
A()[g() + (yi ; )] = 0.
(9)
i=1
From (9), for the quasi-loglikelihood function we have g() = A()−1 ()/n, whereas for the empirical likelihood we have g() = ()/n since A() = Id , where Id is the identity matrix. Finally, for the genuine likelihood we have A() = Id , g() = ()/n and (yi ; ) = j(; yi )/j, for i = 1, . . . , n. It is easy to compute the IF for the posterior mode ˜ since, in view of (4), it is simply proportional to the estimating equation (9). In particular, for the posterior mode ˜ the IF at a point x is given by ˜ F ) ∝ A()[g() + (x; )]. IF(x; ,
(10)
As a consequence, we can see that the IF of a posterior mode is bounded if and only if the function (·; ) is bounded. This is true if a robust likelihood function is used in (5), while in general this is not true if the genuine likelihood is considered. Let us now consider moments of posterior distributions of form (5) given by T = h() exp(log PS (; y)) d. For instance, if h() = , T is the posterior mean E PS (; y). To compute T , we use a simple approximation based on ˜ + R()), ˜ where ˜ is the mode of PS (; y). Laplace’s method (Tierney and Kadane, 1986), which provides T˜ = h()(1 −1 ˜ The remaining term R(), of order O(n ), can be expressed as a function of derivatives of h() and PS (; y) evaluated ˜ The IF of a posterior moment can then be computed simply using approximation T ˜ . In particular, denoting by ˜ at . the estimator computed with respect to the contaminated model (1 − )F + x , we obtain jT˜ IF(x; T , F ) = j =
j˜ j
=0
j[h(˜ )(1 + R(˜ ))] = j
=0
jh(˜ ) jR(˜ ) ˜ ˜ (1 + R( )) + h( ) j˜ j˜ =0
˜ F ). ∝ IF(x; ,
(11)
In view of (10) and (11), a posterior moment has bounded IF if and only if the posterior mode has bounded IF. These results state that the summaries of the robust posterior distributions are stable with respect to small contaminations and thus the proposed procedure prevents the effects caused by model misspecifications, outliers or influential observations. On the contrary, for the genuine posterior distribution, small deviations from the assumed model can have strong effects on posterior summaries. Finally, it is interesting to note that in this context the robustness of the inferential conclusions
1264
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
does not concern the posterior summary considered, but it depends on the likelihood function used in the posterior distribution. 4. Examples and Monte Carlo studies In this section we consider some examples in order to investigate the finite-sample behavior of the inferential procedures based on robust posterior distributions (5). Interest is focused on location and scale models. Although the examples discussed here are purely illustrative, the construction of (5) can be done also for more complex situations and in the presence of nuisance parameters, without particular efforts. In Examples 4.2 and 4.3 we perform Monte Carlo experiments whose objective is twofold. First, to evaluate the discrepancy between the robust posterior distributions and the corresponding genuine posterior distribution in terms of the Kullback–Leibler divergence. Second, to assess the accuracy and stability of robust posterior summaries both when the model is correctly specified and under arbitrary departures from the assumed model. We consider -neighborhood contamination models of the form F = (1 − )F + H , where H (·) denotes the contaminating distribution and is the contamination percentage. We set = 0.10 and performed 1000 trials for each simulation. 4.1. A motivating example (continued) Let us consider the performance of (5) for Darwin’s data. The quantity of interest is the location parameter of the distribution of the difference of the heights and the quasi- and empirical likelihoods can be derived from a classical Huber-type estimating function (see Hampel et al., 1986, Chapter 2), given by (y; ) = (y − ) min(1, k/|y − |),
(12)
where is a location parameter and k > 0 is a fixed constant, which can be interpreted as the regulator between robustness and efficiency: for a lower k one gains robustness but loses efficiency and vice versa. Typically, k is chosen to achieve 95% efficiency at the assumed model, i.e. e = 0.95. Table 1 gives the values of the posterior mean and of the 0.95 higher posterior density interval (HPD) for the location parameter derived from the proper posterior distributions based on the normal, Cauchy and Student’s t3 models, and also from the quasi- and empirical posterior distributions based on the simpler normal model. All the posterior distributions are displayed in Fig. 2, together with the plots of the minus twice normalized loglikelihoods. In this case, the quasi-loglikelihood function for is given by Q () = A
n
n A2 yi I(yi −k yi +k) − I(yi −k yi +k) 2
i=1 n
+A
i=1
(k + c1i )I(
i=1
n
(k + c2i )I(>yi +k) ,
(13)
i=1
where I(·) denotes the indicator function, c1i = (yi − k)2 /2 and c2i = −(yi + k)2 /2, i = 1, . . . , n, are such that Q () is continuous and A = [(k) − (−k)]/[2(k 2 (−k) − k (k) + (k) − 1/2)], with (·) and (·) standard normal distribution and density, respectively. On the contrary, a closed form expression for the empirical likelihood is not available.
Table 1 Darwin’s data: posterior mean and 0.95 HPD for the location parameter
Mean HPD
Normal
t3
Cauchy
Q (; y)
E (; y)
20.58 (7.93,33.23)
26.27 (11.16,41.15)
26.27 (10.86,41.72)
25.74 (12.09,38.97)
22.69 (0.23,42.04)
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
a
8 Normal t3 Cauchy Quasi Empirical
0.05
Normalized log−likelihood
0.06
0.04
0.03
0.02
1265
b
6
4
2 Normal t3 Cauchy Quasi Empirical
0.01
0
0.00 −20
0
20
40
60
−20
0
20
40
60
Fig. 2. Darwin’s data: comparison of posterior distributions and of minus twice normalized loglikelihood functions.
From Table 1 and Fig. 2 it can be noticed that the posterior means of Q (; y) and E (; y) show their robustness properties since they are very similar to the means of the proper posterior distributions based on the t3 and Cauchy models. On the contrary, the posterior mean of (; y) based on the normal model is shifted to the left, since it is affected by the two smaller values of the data set. Indeed, in this case, the IF of the posterior mean of (; y) is proportional to IF(x; T , F ) ∝ (x − ) (see also Example 4.2). Moreover, it is important to notice that the HPDs associated to the proper posterior distributions based on the t3 and Cauchy models are larger than the one associated to Q (; y), which avoids using a heavy tails sampling model. This does not happen when using E (; y), which shows a markable asymmetry to the left. The reason is that a characteristic of LE () is that it takes into account the configuration of the data and the sample size. 4.2. Example: location model Let us consider n observations drawn from a normal population N(, 1), assuming as prior the conjugate distribution
() ∼ N(0 , 2 ), with 0 and 2 fixed. The posterior mean and the posterior mode are equal to E (; y) = (ny ¯ 2+ 2 0 )/(1 + n ), which has clearly unbounded IF since the latter is proportional to IF(x; T , F ) ∝ (x − ). If we write y¯ = (n − 1)y¯(n−1) /n + yn /n, where y¯(n−1) is the sample mean computed on (n − 1) observations, then E (; y) =
y n 2 (n − 1)y¯(n−1) 2 + 0 + . 1 + n 2 1 + n 2
(14)
By allowing yn to vary, the posterior mean (14) is largely affected for each assumed values of 0 and 2 , since one observation may have unbounded influence on E (; y). On the contrary, if we consider Q (; y) or E (; y) derived from a classical Huber-type estimating function of form (12), then the IF of the posterior mean is proportional to IF(x; T , F ) ∝ (x; ) which is clearly bounded. Sensitivity analysis: A plot of (14) as a function of yn yields the empirical IF of the posterior mean. This function is also known as the sensitivity curve (Hampel et al., 1986, Chapter 2; see also Passarin, 2004, for an application in the Bayesian setting), which in general enables us to understand if model assumptions can become inadequate. To illustrate the use of the sensitivity curve to compare (; y), Q (; y) and E (; y), let us consider a sample of n = 10 observations drawn from an N(, 1), with = 0, and the prior () ∼ N(0.5, 1). The largest observation is allowed to vary in the range [−5, 5]. For each sample Q (; y) and E (; y), based on (12) with k = 1.345, and (; y) have been computed. Attention is focused on twice the natural logarithm of the Bayes factors to choose among the simple hypotheses H0 : = 0.5 versus H1 : = 0 (Fig. 3). Clearly, one observation may have unbounded influence on the classical Bayes factor B01 based on the genuine likelihood: evidence against H1 increases as one observation becomes larger, whereas, moving to the left, data support H1 with growing strength. On the contrary, the use of LQ () or of LE () in the evaluation of the Bayes factors, on the one hand, leads to reject H0 in each case, as twice the natural
1266
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270 4
2
0
−2
−4
−6 −4
−2
0
2
4
Fig. 3. Sensitivity curves related to twice the natural logarithm of the Bayes factors evaluated from (; y) (−), Q (; y) (∗) and E (; y) (◦).
Table 2 Location model: mean (and sd) of KL( (; y); Q (; y)), KL( (; y); E (; y)) and (7) KL( (; y); Q (; y))
KL( (; y); E (; y))
(7)
5
0.02 (0.04)
0.17 (0.15)
0.02 (0.06)
10
0.02 (0.04)
0.12 (0.21)
0.02 (0.05)
50
0.02 (0.03)
0.02 (0.03)
0.02 (0.04)
n
The sampling model is N(0, 1); the prior distribution is N(0.5, 1).
logarithm of the pseudo-Bayes factors is less than 2, and, on the other hand, provides us with a lower and an upper bound when measuring evidence of H1 with respect to H0 . Kullback–Leibler divergence: The aim is to evaluate the discrepancy between the pseudo-posterior distributions
Q (; y) and E (; y) and the proper posterior distribution (; y), in terms of the finite-sample expression of the Kullback–Leibler divergence and of its asymptotic expression (7). To this end a Monte Carlo study has been performed with n=5, 10, 50. The results in Table 2 show the closeness of (; y) and the robust posterior distributions Q (; y) and
E (; y), since the estimated values of the Kullback–Leibler divergences are quite small. Note that the results in Table 2 confirm the asymptotic theoretical result of formula (7), since the estimated Kullback–Leibler divergences converge to the asymptotic value. Here, we report the results obtained for a single elicited prior distribution, but the same behavior has been observed for different prior distributions. It is worth noting that the distribution of KL( (; y); E (; y)) is non-negligibly skewed for n = 5, 10 and several extreme values took place; indeed, the empirical posterior takes into account the configuration of the data and asymmetries arise in E (; y) (see also Lazar, 2003). Stability of robust posterior summaries: The objective is to compare the finite-sample accuracy of the inferential procedures based on posterior distributions of form (5) with those based on the corresponding genuine posterior distribution. The entries in Table 3 show that Q (; y) and E (; y) can be used successfully to obtain robust (in a frequentist sense) posterior summaries both when the model is correctly specified and when the true sampling model lies in an -neighborhood of the assumed model. Indeed, the expected value of the genuine posterior distribution (; y) behaves unsatisfactorily under contamination in terms of bias or variability, while the posterior summaries of Q (; y) and E (; y) are more stable and less affected by departures of the data from the specified model. Their robustness
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
1267
Table 3 Location model: mean (and sd) of posterior means under different scenario E (; y)
E Q (; y)
E E (; y)
Mode E (; y)
5
0.08 (0.37)
0.08 (0.37)
0.06 (0.40)
0.08 (0.39)
10
0.05 (0.29)
0.05 (0.29)
0.06 (0.29)
0.05 (0.30)
50
0.02 (0.13)
0.01 (0.13)
0.02 (0.13)
0.01 (0.13)
0.06 (0.95)
0.07 (0.42)
0.08 (0.46)
0.06 (0.42)
10
0.04 (0.97)
0.05 (0.34)
0.12 (0.43)
0.07 (0.30)
50
0.03 (0.46)
0.01 (0.16)
0.09 (0.23)
0.02 (0.16)
0.94 (1.52)
0.26 (0.46)
0.36 (0.49)
0.28 (0.43)
10
0.96 (0.29)
0.21 (0.31)
0.59 (0.18)
0.24 (0.27)
50
0.99 (0.13)
0.19 (0.14)
0.35 (0.13)
0.20 (0.14)
n
H = N(0, 102 ) 5
H = N(10, 1) 5
The sampling model is N(0, 1); the prior distribution is N(0.5, 1).
properties are driven from the properties of the estimating function used to obtain the involved pseudo-likelihood function. Validity of Bayesian inference: Finally, we discuss the validity for (5) using a criterion for evaluating whether or not an alternative likelihood is proper for Bayesian inference (Monahan and Boos, 1992). These authors introduce a definition of validity, based on the coverage properties of posterior sets. In practice, they compute the statistic H = −∞ PS (t; y) dt, which corresponds to posterior coverage set functions of the form (−∞, t ], where t is the th percentile of a pseudo-posterior distribution. They assume that PS (; y) is valid by coverage if H is uniformly distributed in (0, 1). Validity of Bayesian inference for the empirical likelihood was assessed in Lazar (2003), and for the quasi-likelihood is investigated here. In particular, paralleling Lazar (2003) for the empirical likelihood in a location model, we generated values of the mean from two normal priors, i.e. N(0, 0.52 ) and N(0, 52 ). Then, observations (y1 , . . . , yn ) were generated from an N(, 1), for n = 5, 10, 20, 50. For each n, this procedure was repeated 1000 times, and we calculated the statistic H in each case, using the quasi-likelihood function. The normal probability plots for n=5, 10, 20, 50 and the two normal priors N(0, 0.52 ) and N(0, 52 ) are given in Fig. 4. In these cases the values of the Kolmogorov–Smirnov tests for uniformity are always larger than 0.5 for each value of n. We conclude that, contrarily to the empirical likelihood, that requires n50, quasi-likelihood with a conjugate prior gives valid Bayesian inference, for each value of the sample size n. 4.3. Example: scale model Consider a sample of n observations drawn from an exponential model with F = 1 − exp(−y), assumed to be the sampling model. Let us focus on the optimal Hampel estimator (see Hampel et al., 1986, Chapter 2), based on the
1268
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
1.0
1.0
1.0
1.0
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
1.0
1.0
1.0
1.0
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
Fig. 4. Uniform probability plots to check the validity of the quasi-likelihood function.
estimating equation (y; ) = max(−b, min(b, a − y)),
(15)
for appropriate constants a and b. From (15) the quasi- and the empirical likelihoods are very easy to derive; in particular, a closed form expression for LQ () is discussed in Adimari and Ventura (2002b). Assume that the prior density of the parameter is taken as the Jeffrey’s prior distribution () ∝ −1 , that is the usual noninformative prior for a scale parameter. In this case, (; y) is a gamma distribution with shape parameter n and scale parameter s = ny. ¯ Thus, the posterior mean is given by E (; y) = 1/y, ¯ which has clearly unbounded IF. The same result holds also for the posterior mode ˜ = (n − 1)/(ny). ¯ On the contrary, if we consider Q (; y) or
E (; y) derived from (15), then the IF of the robust posterior mean or of the robust posterior mode is proportional to IF(x; T , F ) ∝ (; x) which is clearly bounded. Kullback–Leibler divergence: The aim is to evaluate the discrepancy between Q (; y) and E (; y) and the proper posterior distribution (; y). Data are generated from an exponential model with mean 1 (Exp(1)). The results in Table 4 show the closeness of (; y) and the robust posterior distributions, and confirm the asymptotic theoretical result of formula (7). In this case, the divergences increased on average with respect to the location model and so their variability. This is due to a more markable asymmetry in their empirical distribution. Moreover, in this case Q (; y) seems to be close to (; y), although E (; y) seems to be preferable. Indeed, E (; y) presents smaller values of the finite-sample Kullback–Leibler divergence. Stability of robust posterior summaries: To asses the accuracy and stability of robust posterior summaries, we compare the results obtained under the true model Exp(1) and also when the true sampling model lies in an -neighborhood of the assumed model. As in Example 4.2, from Table 5 we can note that both Q (; y) and E (; y) can be used
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
1269
Table 4 Scale model: mean (and sd) of KL( (; y); Q (; y)), KL( (; y); E (; y)) and (7) n
KL( (; y), Q (; y))
KL( (; y), E (; y))
(7)
10
0.12 (0.13)
0.08 (0.19)
0.06 (0.07)
20
0.11 (0.12)
0.08 (0.08)
0.06 (0.07)
50
0.09 (0.12)
0.07 (0.07)
0.05 (0.06)
The sampling model is an exponential model with = 1; the prior distribution is Jeffrey’s prior.
Table 5 Scale model: mean (and sd) of posterior means under different scenario n
H = Weibull(1.5, 10)
True model E (; y)
E Q (; y)
E E (; y)
E (; y)
E Q (; y)
E E (; y)
5
0.96 (0.24)
1.03 (0.43)
1.11 (0.79)
0.78 (0.43)
0.92 (0.48)
1.15 (0.61)
10
1.05 (0.26)
1.05 (0.39)
1.15 (0.75)
0.72 (0.40)
0.87 (0.39)
1.19 (0.67)
50
1.02 (0.15)
1.01 (0.17)
1.06 (0.16)
0.78 (0.13)
0.94 (0.16)
1.13 (0.17)
The sampling model is an exponential model with = 1; the prior distribution is Jeffrey’s prior.
successfully to construct robust (in a frequentist sense) posterior summaries for the parameter of interest, since they provide posterior summaries not affected by the contamination, as it happens for the genuine posterior summaries. 5. Final remarks Quasi-likelihood and empirical likelihood represent two approaches for constructing robust posterior distributions, which take into account sampling model misspecifications. The examples and simulation results discussed in the former section show that both Q (; y) and E (; y) can be used successfully to make robust Bayesian inference. Generally, our experience, based on the examples discussed in Section 4 and other examples not reported here, indicates that, on the contrary of Q (; y), E (; y) is not computable for very small values of n and for moderate sample sizes appears to have always heavier tails. For large values of n, Q (; y) and E (; y) are equivalent and in this case the choice between LQ () and LE () may depend on the problems related to their computation, and on the assumptions made on the model F . The examples presented in the previous section are very simple and concern models indexed by scalar parameters. However, the robust pseudo-likelihoods can be computed also for more complex situations, such as generalized linear models, and also to deal with nuisance parameters. In fact, also their profile versions are available and have the same robustness properties. Finally, the sensitivity of Q (; y) and E (; y) with respect to prior specifications requires more investigation. In particular, it would be interesting to investigate the relationship between Bayesian robustness with likelihood robustness, in order to understand how these two requirements can counterbalance or can be linked. A possible starting point could be expression (6). References Adimari, G., 1997. Empirical likelihood type confidence intervals under random censorship. Ann. Inst. Statist. Math. 49, 447–466.
1270
L. Greco et al. / Journal of Statistical Planning and Inference 138 (2008) 1258 – 1270
Adimari, G., Ventura, L., 2002a. Quasi-profile loglikelihood for unbiased estimating functions. Ann. Inst. Statist. Math. 54, 235–244. Adimari, G., Ventura, L., 2002b. Quasi-likelihood from M-estimators: a numerical comparison with empirical likelihood. Statist. Methods Appl. 11, 175–185. Andrade, J.A.A., O’Hagan, A., 2006. Bayesian robustness modeling using regularly varying distributions. Bayesian Anal. 1, 169–188. Berger, J.O., 1994. An overview of robust Bayesian analysis (with discussion). Test 3, 5–124. Bernardo, J.M., Smith, A.F.M., 2000. Bayesian Theory. Wiley Series in Probability and Statistics, New York. Bertolino, F., Racugno, W., 1992. Analysis of the linear correlation coefficient using pseudo-likelihoods. J. Italian Statist. Soc. 1, 33–50. Bertolino, F., Racugno, W., 1994. Robust Bayesian analysis of variance and the 2 -test by using marginal likelihoods. Statistician 43, 191–201. Cabras, S., Racugno, W., Ventura, L., 2006. Bayesian inference on the scalar skew-normal distribution. In: Proceedings of COMPSTAT 2006, 17th Symposium of IASC-ERS, pp. 1373–1380. David, A.P., 1973. Posterior expectations for large observations. Biometrika 60, 664–667. Dey, D.K., Ghosh, S.H., Lou, K., 1996. On local sensitivity measures in Bayesian analysis (with discussion). In: Bayesian Robustness, IMS Lecture Notes—Monograph Series, vol. 29. Efron, B., 1993. Bayes and likelihood calculations from confidence intervals. Biometrika 80, 3–26. Gustafson, P., 1996. Model influence functions based on mixtures. Canad. J. Statist. 24, 535–548. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 1986. Robust Statistics. The Approach Based on Influence Functions. Wiley, New York. Heritier, S., Ronchetti, E., 1994. Robust bounded influence tests in general parametric models. J. Amer. Statist. Assoc. 89, 897–904. Huber, P.J., 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. I, pp. 221–233. Huber, P.J., 1981. Robust Statistics. Wiley, New York. Lavine, M., 1991. Sensitivity in Bayesian statistics: the prior and the likelihood. J. Amer. Statist. Assoc. 86, 396–399. Lavine, M., 1995. On an approximate likelihood for quantiles. Biometrika 82, 220–222. Lazar, N.A., 2003. Bayesian empirical likelihood. Biometrika 90, 319–326. Markatou, M., Ronchetti, E., 1997. Robust inference: the approach based on influence functions. Handbook of Statistics 15, 49–75. Markatou, M., Basu, A., Lindsay, B.G., 1998. Weighted likelihood equations with bootstrap root search. J. Amer. Statist. Assoc. 93, 740–750. McCullagh, P., 1991. Quasi-likelihood and estimating functions. In: Hinkley, D.V., Reid, N., Snell, E.J., (Eds.), Statistical Theory and Modelling. pp. 265–286. Monahan, J.F., Boos, D.D., 1992. Proper likelihoods for Bayesian analysis. Biometrika 79, 271–278. Monti, A.C., Ronchetti, E., 1993. On the relationship between empirical likelihood and empirical saddlepoint approximation for multivariate M-estimators. Biometrika 80, 329–338. O’Hagan, A., 1979. On outlier rejection phenomena in Bayes inference. J. Roy. Statist. Soc. B 41, 258–367. O’Hagan, A., 1988. Modelling with heavy tails. Bayesian Statist. 3, 345–359. O’Hagan, A., 1990. Outliers and credence for location parameter inference. J. Amer. Statist. Assoc. 85, 172–176. Owen, A.B., 2001. Empirical Likelihood. Chapman & Hall, London. Pace, L., Salvan, A., 1997. Principles of Statistical Inference from a Neo-Fisherian Perspective. World Scientific, Singapore. Pace, L., Salvan, A., Ventura, L., 2006. Likelihood based discrimination between separate scale and regression models. J. Statist. Plann. Inference 136, 3539–3553. Paranjpe, S.A., Gore, A.P., 1994. Selecting variables for discrimination when covariance matrices are unequal. Statist. Probab. Lett. 21, 417–419. Passarin, K., 2004. Robust Bayesian estimation. Working Paper, 2004.11, University of Insubria, Italy. Peracchi, F., 1990. Robust M-estimators. Econometric Rev. 9, 1–30. Peracchi, F., 1991. Robust M-tests. Econometric Theory 7, 69–84. Pericchi, L.R., Perez, M.E., 1994. Posterior robustness with more than one sampling model. J. Statist. Plann. Inference 40, 279–294. Pericchi, L.R., Sansò, B., 1995. A note on bounded influence in Bayesian analysis. Biometrika 82, 223–225. Racugno, W., Salvan, A., Ventura, L., 2005. On the use of pseudo-likelihoods in Bayesian variable selection. Working Paper, 2005.12, Department of Statistics, University of Padua, Italy. Raftery, A.E., Madigan, D., Volinsky, C.T., 1996. Accounting for model uncertainty in survival analysis improves predictive performance. Bayesian Statist. 323–349. Rios Insua, D., Ruggeri, F., 2000. Robust Bayesian Analysis. Springer, New York. Royall, R., 1986. Model robust confidence intervals using maximum likelihood estimators. Internat. Statist. Rev. 54, 221–226. Schennach, S.M., 2005. Bayesian exponentially tilted empirical likelihood. Biometrika 92, 31–46. Severini, T.A., 1999. On the relationship between Bayesian and non-Bayesian elimination of nuisance parameters. Statist. Sinica 9, 713–724. Shyamalkumar, N.D., 2000. Likelihood Robustness. In: Robust Bayesian Analysis.Springer, New York. Sivaganesan, S., 1993. Robust Bayesian diagnostic. J. Statist. Plann. Inference 35, 171–188. Spiegelhalter, D.J., 1985. Exact Bayesian inference on the parameters of a Cauchy distribution with vague prior information. Bayesian Statist. 2, 743–750. Tierney, L., Kadane, J.B., 1986. Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81, 82–86. Tsao, M., Zhou, J., 2001. On the robustness of empirical likelihood ratio confidence intervals for location. Canad. J. Statist. 29, 129–140. Tsou, T.S., Royall, R.M., 1995. Robust likelihoods. J. Amer. Statist. Assoc. 90, 316–320. Verdinelli, I., Wasserman, L., 1991. Bayesian analysis of outlier problems using the Gibbs sampler. Statist. and Comput. 1, 105–117.