Journal of Statistical Planning and Inference 166 (2015) 158–170
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
A note on Bayes factor consistency in partial linear models Taeryon Choi a,∗ , Judith Rousseau b a
Department of Statistics, Korea University, Republic of Korea
b
CEREMADE, Université Paris Dauphine, France
article
abstract
info
Article history: Received 24 December 2013 Received in revised form 15 November 2014 Accepted 28 March 2015 Available online 22 April 2015 Keywords: Bayes factor Consistency Fourier series Gaussian processes Hellinger distance Kullback–Leibler neighborhoods Lack of fit testing
It has become increasingly important to understand the asymptotic behavior of the Bayes factor for model selection in general statistical models. In this paper, we discuss recent results on Bayes factor consistency in semiparametric regression problems where observations are independent but not identically distributed. Specifically, we deal with the model selection problem in the context of partial linear models in which the regression function is assumed to be the additive form of the parametric component and the nonparametric component using Gaussian process priors, and Bayes factor consistency is investigated for choosing between the parametric model and the semiparametric alternative. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Suppose we have two candidate models M0 and M1 for yn ∈ Yn , a set of n observations from an arbitrary distribution P that is absolutely continuous with respect to a common measure µn on Yn . We also assume the two candidate models to have the following parameters and prior distributions, respectively: θ and π0 (θ ), λ and π1 (λ), n
M0 = {pnθ (yn ), θ ∈ Θ , π0 (θ )},
M1 = {pnλ (yn ), λ ∈ Λ, π1 (λ)},
(1)
where pθ (y ) and pλ (y ) denote the densities of y with respect to µ under M0 and M1 , respectively. Based on this set of observations, a common Bayesian procedure to measure the evidence in favor of M0 over M1 is to assess the Bayes factor (Jeffreys, 1961). Given the two candidate models M0 and M1 , the Bayes factor, denoted B01 is given by the ratio of two marginal densities (Kass and Raftery, 1995), n
B01
n
n
n
n
n n pθ (y )π0 (θ )dθ = = . = p(yn |M1 ) m1 (yn ) pnλ (yn )π1 (λ)dλ p(yn |M0 )
m0 (yn )
n
(2)
Note that the large value of B01 is a strong evidence in support of model M0 (Jeffreys, 1961; Kass and Raftery, 1995). Accordingly, B01 is expected to converge to infinity as the sample size increases when M0 is the true model. This notion is called the consistency of the Bayes factor or Bayes factor consistency, and is formulated as follows
lim B01 =
n→∞
∗
∞, in Pθn0 probability, 0, in Pλn0 probability,
if pnθ0 ∈ M0 if pnλ0 ∈ M1 ,
Corresponding author. Tel.: +82 2 3290 2245; fax: +82 2 924 9895. E-mail address:
[email protected] (T. Choi).
http://dx.doi.org/10.1016/j.jspi.2015.03.009 0378-3758/© 2015 Elsevier B.V. All rights reserved.
(3)
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
159
where Pθ0 represents the true probability measure belonging to the null model, and Pλ0 represents the true probability measure belong to the alternative model. In (1)–(3), we typically have in mind that Λ is much larger than Θ , and Θ is often nested in Λ. In this paper we consider a setup where Θ ⊂ Rd for some 0 < d < ∞ and Λ = Θ × F , where F is a function space. Recent studies of Bayes factor consistency have focused on density estimation in the Bayesian goodness-of-fit testing problems. In goodness-of-fit testing for a parametric null (M0 ) against a nonparametric alternative (M1 ), Bayesian approaches have been investigated particularly in the context of Bayesian nonparametric testing of goodness-of-fit problems (see Tokdar et al., 2010 for a nontechnical survey). However, because of the specification of an alternative model and the computation of the Bayes factor in addition to theoretical issues such as Bayes factor consistency, Bayesian nonparametric goodness of fit testing is often regarded as a challenging inferential problem. In these regards, there have been several advances in Bayesian goodness-of-fit testing from the nonparametric Bayesian point of view. Readers are referred to Verdinelli and Wasserman (1998), Berger and Guglielmi (2001), Lenk (2003), Dass and Lee (2004), Walker et al. (2004), Ghosal et al. (2008), McVinish et al. (2009), and Tokdar et al. (2010). These results on Bayes factor consistency in density estimation problems are based on verifying sufficient conditions related to posterior consistency and posterior convergence rates. These conditions are mainly designed for the case of independent and identically distributed (i.i.d.) observations. In particular, one of these sufficient conditions, the Kullback–Leibler property, in which the prior puts positive mass on all Kullback–Leibler neighborhoods of the true value of parameter, is known to be fundamental to Bayes factor consistency. That is, if a model has the Kullback–Leibler property, the Bayes factor always supports the model with the Kullback–Leibler property eventually when compared with any other model without the Kullback–Leibler property (e.g. Dass and Lee, 2004 and Walker et al., 2004). However, when two competing models have the Kullback–Leibler property, additional conditions are required for consistent model selection; Ghosal et al. (2008) and McVinish et al. (2009) investigated this issue by establishing sufficient conditions for the consistency of the Bayes factor under i.i.d. observations. The consistency of Bayesian model selection in regression problems for independent observations has been largely studied in Gaussian linear regression models, particularly in the context of variable selection procedures (e.g. Liang et al., 2008, Casella et al., 2009, Moreno et al., 2010 and Shang and Clayton, 2011). On the other hand, Bayes factor consistency in partial linear models has not been fully studied in the literature except for a recent work by Choi et al. (2009), and we need to deal with the consistency of Bayes factor for partial linear models more intensively. In addition, it has become increasingly important to understand the asymptotic behavior of the Bayes factor for model selection in general statistical models. Thus, Bayes factor consistency needs to be investigated for statistical problems in non i.i.d. observations such as regression problems and Markov processes and time series model. In this paper we discuss recent results on Bayes factor consistency for semiparametric regression problems where observations are independent but not identically distributed. Specifically, we deal with the model selection problem in the context of partial linear models in which the regression function is assumed to be the additive form of the parametric component and the nonparametric component using Gaussian process priors in Section 2, and Bayes factor consistency is discussed for choosing between the parametric model and the semiparametric alternative under these partial linear models in Section 3. Concluding remarks are made in Section 4. 2. The Bayesian partial linear model using Gaussian processes A partial linear model (PLM: see, e.g., Härdle et al., 2004) is a semiparametric regression model whose mean function consists of two additive components, a linear part and a nonparametric part, yi = β0 + βT1 wi + f (xi ) + σ ϵi ,
i.i.d.
ϵi ∼ N (0, 1),
(4)
where the mean function of the regression model in (4) has two parts: a p-dimensional parametric part with β {wi }ni=1 ∈ [−1, 1]p , p ≥ 1 and a nonparametric part with an unknown function f (xi ), {xi }ni=1 ∈ [0, 1] in the infinite
T 1 wi ,
dimensional parameter space. We consider both the case of fixed design and the case of random design. For the fixed design case, we assume that the design points are equally spaced. For the random design, we assume that (wi , xi ) are sampled from a probability distribution. Bayesian approaches to the partial linear models have been studied in the literature by developing different methods for estimating the nonparametric component f (·) (e.g. the Fourier series Lenk, 1999 and Choi et al., 2009), splines (Fahrmeir et al., 2013; Chib and Greenberg, 2010; Kyung, 2011), Gaussian processes (Choi and Woo, 2015), smoothing priors (Koop and Poirier, 2004), and wavelets (Qu, 2006; Ko et al., 2009). Here, we study specific partial linear models using two different families of Gaussian processes, which are very common nonparametric priors for Bayesian regression functions. Specifically, the unknown nonparametric function f (·), is a priori modeled as a Gaussian process (GP) with zero mean function and a suitable covariance function C (·, ·), f (·) ∼ GP (0, C (·, ·)),
(5)
where C (xi , xj ) = Cov(f (xi ), f (xj )), i, j = 1, . . . , n. Briefly speaking, the Gaussian process f (·) in (5) can be decomposed, according to the Karhunen–Loève orthogonal expansion (see e.g. Grenander, 1981; Wahba, 1990) f ( x) =
∞ k=0
λk φk (x),
(6)
160
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
where the λk are independently distributed according to a N (0, τk2 ), and the covariance function C (x, x′ ) can be expanded, (see e.g. Wahba, 1990 and Grenander, 1981, this is also known as Mercer’s theorem), C ( x, x′ ) =
∞
τk2 φk (x)φk (x′ ),
k=0
τk2 < ∞
(7)
k
where τ02 ≥ τ12 ≥ τ22 ≥ · · · ≥ 0 denote the eigenvalues and φ0 , φ1 , φ2 , . . . are the orthonormal eigenfunctions of the operator whose covariance function is C (x, x′ ). For the√ orthonormal basis functions φk , a Fourier basis with cosine functions (Lenk, 1999) is considered as φ0 (x) = 1, and φk (x) = 2 cos π kx, k = 1, 2, . . . , which forms an orthonormal basis for the piecewise continuous functions on [a, b] (Kreider et al., 1966). As an alternative to the Karhunen–Loève orthogonal expansion of the regression function in (6) for the Gaussian process, a particular form of covariance function can be used to model the desired forms of dependence in practice known as Gaussian process regression (e.g. Choi and Woo, 2015 and Rasmussen and Williams, 2006). For example, the well-known squared exponential covariance kernel is defined as f (·) ∼ GP (0, C (·, ·)),
1 C (xi , xj ) = v0 exp − w(xi − xj )2 2
,
(8)
with two hyperparameters v0 and w . Note that C (xi , xj ) is assumed to be the stationary covariance function that depends on the distance between xi and xj , C (xi , xj ) = C (|xi − xj |). Since f (xi ) is assumed to be a stationary Gaussian process a priori, if the covariance function C (xi , xj ) is continuous on [0, 1], then f (xi ) forms the orthonormal expansion, f (xi ) =
∞
τk zk φk (xi ) =
k=0
∞
λk φk (xi ),
k =1
where zk ∼ N (0, 1), λk ∼ N (0, τk2 ), and τk2 satisfy the following integral equation
τk2 φk (xi ) =
1
C (xi , t )φk (t )dt ,
k = 0, 1, . . . ,
0
which implies that the two approaches using Gaussian process priors in (6) and (8) are equivalent to each other in theory by Karhunen–Loève expansion from Mercer’s theorem (Grenander, 1981). Bayesian inference for the partial linear models in (4) begins with the specification of prior distributions for β0 ∈ R, β1 ∈ Rp , f (·) on a given class of measurable functions and σ ∈ R+ . The nonparametric prior πf is assumed to be a Gaussian process prior as described in (6) and (8), with a reproducing kernel Hilbert space H (RKHS) and concentration function φf defined by φf (ϵ) = infh∈H:∥h−f0 ∥∞ <ϵ ∥h∥2H − log πf [∥f ∥∞ < ϵ], where ∥ · ∥∞ is the supremum norm, ∥f ∥∞ = sup{|f (x)| : x ∈ [0, 1]} (see van der Vaart and van Zanten, 2008b for a more complete discussion of the notion of RKHS and concentration function). Recall that the support of πf is the closure of H (e.g. van der Vaart and van Zanten, 2008a,b). Hence, for any f0 that belongs to the support of πf , there exists ϵn going to 0 and c > 0 such that φf0 (ϵn ) ≤ nϵn2 , and
πf [∥f − f0 ∥∞ ≤ ϵn ] ≥ e−cnϵn , 2
(9)
which follows from Lemma 5.3 of van der Vaart and van Zanten (2008b). The result of (9) provides the small ball probabilities of Gaussian processes (van der Vaart and van Zanten, 2008a), and it is indeed about a rate that a prior probability πf shrinks to the true parameter f0 . The small ball probability of (9) is used not only to determine the rates of contraction of posterior distributions based on Gaussian process priors (van der Vaart and van Zanten, 2008a) but also to establish consistency of the Bayes factor under the partial linear model, which will be discussed in the next section. The parametric prior π0 for (β, σ 2 ) is assumed to be absolutely continuous with respect to the Lebesgue measure with positive, continuous and bounded density on Rp+1 × R+ . For example, we use independent normal prior πβ for β = (β0 , βT1 )T ∼ Np+1 (0p+1 , B0 ), where B0 is a p + 1 dimensional diagonal matrix with diagonal elements γ0 , γ1 , . . . , γp . For the prior distribution of σ 2 , a conjugate prior πσ can be considered with an inverse gamma distribution, IG(a, b). Based on the model structure in (4) with suitable prior distributions π0 and πf for unknown parameters, we build the posterior distribution and estimate the regression function β0 + βT1 w + f (x) by suitable Markov chain Monte Carlo methods (see, e.g., Lenk, 1999 and Choi and Woo, 2015). The goodness of fit in this problem can thus be represented as a choice between model M0 and model M1 ,
M0 : yi = β0 + βT1 wi + σ ϵi ,
M1 : yi = β0 + βT1 wi + f (xi ) + σ ϵi ,
(10)
which indicates that in the alternative model M1 , f (x) explains the covariate effect of x that cannot be captured by the covariate effect of w in the form of β0 + βT1 w, i.e. the departure of the mean function from the parametric regression function of w.
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
161
Note that the model selection structure in (10) is equivalent to the testing problem for model checking or the lack of fit testing in the partial linear model of (4), which can be rewritten as
H0 :
inf
β∈Rp+1
[−1,1]p ×[0,1]
H1 :
inf
β∈Rp+1
[−1,1]p ×[0,1]
(1, w T )β − f (x)2 dµw ,x = 0 vs. (1, w T )β − f (x)2 dµw ,x > 0,
(11)
where µw ,x denotes a joint probability measure of (w , x) on [−1, 1]p × [0, 1], and f (x) is assumed to be square integrable, f ∈ L2 [0, 1]. Note that for each f ∈ L2 [0, 1], we write in the case of a random design with common distribution µw ,x H (f ) =
inf
β∈Rp+1
[−1,1]p ×[0,1]
(1, w T )β − f (x)2 dµw ,x ,
(12)
and in the case of fixed design H (f ) = lim inf inf n
β∈Rp+1
[−1,1]p ×[0,1]
(1, w T )β − f (x)2 dµn , w ,x
(13)
where dµnw ,x = n−1 i=1 δwi ,xi is the empirical measure on the design points. It is easy to see that H (f ) acts as a distance to the null hypothesis. In the case of fixed design, (12) is defined in the same way with dµ(w , x) = dwdx, the Lebesgue measure. Hence, if the distance to the null hypothesis H (f ) is positive, it indicates that the parametric component is separated from f (x), and that the parametric component is not enough for explaining the response. Thus we need the nonparametric component f (x) to explain the response. On the other hand, if H (f ) is zero, the zero distance to the null hypothesis indicates the nonparametric component f (x) is indistinguishable from the parametric component, and thus the parametric model is enough for explaining the response. When w and x are independent under µw ,x , i.e. µw ,x = µw × µx , then the testing problem in (11) is equivalent to testing the constancy of f (x)
n
H0 : f (x) = constant
vs. H1 : f (x) ̸= constant.
(14)
Note that if f (x) is constant, a suitable choice of the parametric coefficient β makes H (f ) to be zero while H (f ) becomes positive if f (x) is a non-constant function of x. Also note that testing problem (14) is then equivalent to H0 : λk = 0,
∀k ≥ 1
vs. H1 : there exists k ≥ 1 such that λk ̸= 0
(15)
in the context of Fourier series representation of f (x) of (6). Here, Bayesian model selection is performed by computing the Bayes factor in (2), with θ = (β, σ 2 ), λ = (θ , f ), and πθ = πβ × πσ 2 and πλ = πθ × πf . 3. Bayes factor consistency The first asymptotic results of the Bayes factor in the partial linear model of (4) was studied by Choi et al. (2009) in the context of the Fourier representation of the regression model of (6) for the model selection problem in (10). The approach considered in Choi et al. (2009) is on the basis of the closed form of the Bayes factor and the analytic evaluation of its asymptotic behavior. Indeed, the Bayes factor B01 in (2) is simply identified as the ratio of two multivariate normal densities when the noise variance σ 2 is given. They established the Bayes factor consistency using two appropriate conditions, one on the design matrix to be orthogonal, and the other on the shrinking variance structure of the Fourier coefficients for f (x), which complemented the work of Lenk (1999) in terms of asymptotic analysis of the model selection procedure. Alternatively, the approach to the assessment of the large sample behavior of the Bayes factor can also be explained in the context of the asymptotic behavior of the posterior distribution, that is, posterior consistency and rates of contraction of the posterior distribution under M0 and M1 . Here, we investigate the consistency of the Bayes factor in the partial linear model, based on asymptotic results studied by McVinish et al. (2009) for goodness-of-fit testing problems and the convergence rates of posterior distributions studied by Ghosal and van der Vaart (2007a) for non-i.i.d. observations. 3.1. Theorems on Bayes factor consistency In order to discuss Bayes factor consistency, we consider the following two assumptions, Assumption A1 on the design
(w , x) and Assumption A2 on the parametric prior πθ .
Assumption A1. We assume either of the two situations:
• Fixed design: {wi }ni=1 ∈ (−1, 1)p and {xi }ni=1 ∈ (0, 1) are equally spaced. • Random design: For i = 1, . . . , n, (wi , xi ) ∼ µw ,x on [−1, 1]p × [0, 1] independently with E[w1 ] = 0 and E w1 w1T > 0 in the sense that the matrix is positive definite and the marginal distribution of the xi has a positive and bounded density with respect to Lebesgue measure.
162
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
Assumption A2 (On the Parametric Prior πθ ).
• The prior πθ on θ = (β0 , β1 , σ 2 ) ∈ Rp+1 × R+ has positive density with respect to Lebesgue measure on Rp+1 × R+ and satisfies: for all ϵ > 0 there exists aϵ > 0 and Nϵ such that ∀n ≥ Nϵ , aϵ n ≥ 1 − e−ϵ n . πθ θ : ∥β∥p+1 ≤ eaϵ n , e−aϵ n ≤ σ ≤ ee This is a very weak assumption on the prior π0 . In particular if σ follows a either a Gamma(a, b) with a, b > 0, or an inverse Gamma(a, b) or a truncated Gaussian as in Gelman (2006) and if the prior on β has at most polynomial tails such as normal priors then Assumption A2 is satisfied. When the true model is the alternative model M1 (≡ H1 ), that is, when the true probability measure of yn is Pλn0 , it is relatively easy to establish that B01 converges to zero in Pλn0 probability, as compared to the case when the true model is the null model M0 (≡ H0 ). We then have the following result, i.e. Bayes factor consistency under M1 as summarized in Theorem 1. Theorem 1. Assume that the design satisfies Assumption A1. Consider two priors πθ ,0 and πθ ,1 on θ = (β0 , β1 , σ 2 ) satisfying condition Assumption A2. Suppose that for λ0 = (θ0 , f0 ) ∈ Rp+1 × R+ × L2 ([0, 1]) there exists ϵn ↓ 0, with nϵn2 → +∞, for which
πf (∥f − f0 ∥2 ≤ ϵn ) ≥ e−cnϵn
2
(16)
for some c > 0 and that H (f0 ) > 0. Also, suppose that f0 is contained in the support of πf . Then, there exists δ > 0 such that Pλ0 B01 eδ n > ϵ = o(1).
That is, the Bayes factor is exponentially decreasing under the partial linear model M1 . The proof of Theorem 1 is provided in Section 3.2. Now, suppose that the true model is M0 (≡ H0 ). That is, the true regression model is assumed to be the parametric linear model, y = β0 + βT1 w + σ ϵi , θ0 = (β00 , β10 ), pnθ0 ∈ M0 . Thus, if H (f ) = 0, the parametric vector η0n has components equal
to η0i = β0 + βT1 wi . When the true model is a null model, which is nested to M1 as in the case of the partial linear model of (4), the parametric regression model under M0 can also be explained by the partial linear model under M1 . In other words, pnθ0 ∈ M0 can also be approximated by densities in M1 , and thus π1 under M1 that has the Kullback–Leibler property also contains the Kullback–Leibler support of π0 as well. In such a case, it is challenging to prove the Bayes factor consistency of (3) under M0 , and additional conditions are required for consistent model selection between two competitive models. In these regards, Rousseau (2007), Ghosal et al. (2008), and McVinish et al. (2009) investigated this issue for i.i.d observations, and we extend these results to the independent but not identically distributed observations with the partial linear model using Gaussian process priors. For this purpose, we specifically consider the following assumptions on the semiparametric prior π1 = πθ ,1 × πf :
Condition P1 (Semiparametric Prior π1 = πθ,1 × πf ).
• (i) There exist un ↓ 0, C1 , c1 , τ > 0 and Fn,1 ⊂ L2 ([0, 1]) s.t. 2
πf (∥f n ∥n ≤ un ) ≥ C1 e−c1 nun ,
πf (Fnc,1 ) = o(e−(2+c1 )nun upn+1 ), 2
and if F¯n,1 = {f n = (f (x1 ), . . . , f (xn ))t , f ∈ Fn,1 }
µ log N (un , F¯n,1 , ∥.∥n ) > τ nu2n = o(1). • (ii) For all a1 , b > 0, there exists a2 > 0 such that for all n large enough, a nu2 2 2 πθ,1 a1 un ≤ σ ≤ ee 2 n ; ∥β∥ ≤ ea2 nun ≥ 1 − ebnun . Moreover for all ϵ > 0, there exists C > 0 such that if wn =
√ πf ∥f ∥∞ > C nun /wn + πf [H (f ) ≤ Cu2n ] < ϵ
log(nu2n ),
p+2
1 un
√
(17)
n
,
(18)
Theorem 2. Assume that Assumptions A1 and A2, and Condition P1 are satisfied. Then, for all θ0 ∈ Rp+1 1 n B− 01 → 0 in Pθ0 − probability.
That is, the Bayes factor is increasing to infinity with Pθn0 probability tending to one under M0 as the sample size increases. The proof of Theorem 2 is given in Section 3.2.
(19)
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
163
3.2. Proofs of Theorems 1 and 2 This subsection proves Bayes factor consistency results in Theorems 1 and 2. The proofs of these results all rely on a general theorem for Bayes factor consistency that is modified from the theorem of McVinish et al. (2009) and adapted to the independent and not identically distributed observations for the partial linear model we consider. We first prove Theorem 1. 3.2.1. Proof of Theorem 1 Let H (f0 ) > 0 and denote λ = (β0 , β1 , σ , f ). We also denote λ0 = (β00 , η10 , σ0 , f0 ) to be the true value of the unknown parameters. Under the Gaussian noise assumption, it follows that
KL(pnλ0 , pnλ ) = Epn
=
log pnλ
λ0
V (pλ0 , pλ ) = Varpn n
log pnλ0
n
log pnλ0 log pnλ
λ0
n 2
σ02 ∥ηn − η0n ∥2n 2 2 − 1 − log (σ /σ ) + , 0 2 σ 2σ 2
σ02 =n −1 σ2
2 +
σ04 n ∥η − η0n ∥2n , σ4
(20)
where ηn = (β0 +βT1 w1 + f (x1 ), . . . , β0 +βT wn + f (xn ))T , and η0n = (β00 +βT10 w1 + f0 (x1 ), . . . , β0 +βT10 wn + f0 (xn ))T . Thus, it follows that if ∥β−β0 ∥2p+1 ≤ ϵn2 , ∥f n − f0n ∥2n ≤ nϵn2 , and |σ −σ0 | ≤ ϵn , then there exists τ > 0 such that KL(pnλ0 , pnλ ) ≤ τ nϵn2 and V (φλn0 , φλn ) ≤ τ nϵn2 , where β = (β0 , β1 ), β0 = (β00 , β10 ), f n = (f (x1 ), . . . , f (xn ))T and f0n = (f0 (x1 ), . . . , f0 (xn ))T . λ
Lemma 3. Let J1 0 (yn ) =
pnλ (yn ) dπ1 (λ), where Λ Λ pnλ (yn )
= {λ = (θ , f ) : θ ∈ Rp+1 × R+ , f ∈ L2 ([0, 1])} ∈ R2 × L2 ([0, 1]). Then,
0
λ Pλn0 J1 0 <
e−2τ nϵn π1 {λ : KL(pnλ0 , pnλ ) ≤ τ nϵn2 ; V (pnλ0 , pnλ ) ≤ τ nϵn2 } 2
2
≤ 2τ . nϵn2
The proof of Lemma 3 is similar to the proof of Theorem 1 in McVinish et al. (2009). Thus, the assumption (16) implies that 2 λ Pλn0 J1 0 < e−nϵ = o(1) for all ϵ > 0.
(21)
Now, consider the metric dn which is used to construct tests in regression models, in terms of the average of the squares of the Hellinger distances for the n observations, given by d2n
(pλ1 , pλ2 ) = n
n
n 1
n i=1
φλ1 (yi |wi ) −
2 φλ2 (yi |wi ) dµ,
where φλ (y|w) is the Gaussian density of y given w = (w , x) with mean η(w , x) = β0 + βT1 w + f (x) and variance σ 2 . Note that obvious calculations lead to
d2n (pnλ1 , pnλ2 ) = 2 − 2 1 −
(σ1 − σ2 )2 σ12 + σ22
1/2
n 1
n i=1
exp −
(ηi1 − ηi2 )2 , 4(σ12 + σ22 )
where ηij = β0j + βT1j wi + fj (xi ), i = 1, . . . , n and j = 1, 2. We thus have
d2n (pnλ1 , pnλ2 ) ≥ 2 − 2 1 −
d2n (pnλ1 , pnλ2 ) ≤ 2 − 2 1 −
(σ1 − σ2 )2 σ12 + σ22
1/2
(σ1 − σ2 )2 σ12 + σ22
1/2
exp −
1−
∥η1n − η2n ∥2n , 4n(σ12 + σ22 )
∥η1n − η2n ∥2n , 4n(σ12 + σ22 )
(22)
(23)
where ηjn = (η1j , . . . , ηnj )T , j = 1, 2. Let ϵ > 0 and consider aϵ > 0 defined in Assumption A2 and
aϵ Θn = (β, σ ); |α| ≤ eaϵ n , ∥β∥ ≤ eaϵ n , e−aϵ n ≤ σ ≤ ee . Then, it follows that
2 πθ Θnc ≤ e−2nϵ .
(24)
164
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
We now verify the existence of uniformly consistent test using results in Birgé (1983), Le Cam (1986), or Lemma 2 of Ghosal and van der Vaart (2007a). It follows that there exists a sequence of tests φ1n such that for all θ1 ∈ Θ ,
2 Eλn0 φ1n ≤ e−ndn (λ0 ,θ1 )/2 sup
dn (θ1 ,θ)≤dn (θ1 ,λ0 )/18
2 Eθn 1 − φ1n ≤ e−ndn (λ0 ,θ1 )/2 ,
where dn is the average of the Hellinger distance as mentioned before. Then we have from (22) that for n large enough infα,β ∥η1n − η0n ∥2n ≥ ϵ n/2 there exists a constant a > 0 such that d2n (λ0 , θ1 ) > a if |σ1 − σ0 | ≤ 2σ0 . On the other hand, if
√ √ √ |σ1 − σ0 | > 2σ0 then direct computations imply that d2n (λ0 , θ1 ) > a for a ≤ ( 8 − 7)/ 2. This leads to 2 2 Eλn0 φ1n ≤ e−na /2 , sup Eθn 1 − φ1n ≤ e−na /2 . dn (θ1 ,θ)≤a/18
(25)
Let t > 0 and let Nn (t , Θn , d2n ) be the t covering number of Θn in dn norm. Then, the inequality of (22) implies that for all θj = (βj , σj ) ∈ Θn and θ = (β, σ ) ∈ Θn , if |β0 − β0j | ≤ τ σj , |β1 − β1j |2 ≤ τ σj , ∥ηn − ηjn ∥2 ≤ naσj2 /(18)2 , and
|σj − σ |2 ≤ aσj2 /[4(18)2 ], by choosing τ small enough, 1/2 ∥ηn − ηjn ∥2n (σ − σj )2 2 n n 1− , dn (pθ , pθj ) ≤ 2 − 2 1 − σ 2 + σj2 4n(σ 2 + σj2 )
√ aϵ /36)j , j = 0 . . . .Jn where √ Jn = ⌊36[exp(aϵ n) + aϵ n]/ aϵ ⌋ + 1. We bound the number of intervals in β0 , β satisfying the above constraint by so that d2n (pnθ , pnθj ) ≤ a/(18)2 . For each subinterval (σj , σj+1 ) with σj = e−aϵ n (1 + Nn,j ≤ 4e2aϵ n τ −2 σj−2 ∨ 1. Moreover, when j is such that σj ≥ 2eaϵ n /τ , uniformly in σ ∈ (σj , σj+1 ) and β02 + ∥β∥2 , (β0′ )2 + ∥β′ ∥2 ≤ e2aϵ n , with θ = (β0 , β, σj ) and θ ′ = (β0′ , β′ , σ ), d2n (pnθ , pnθ ′ ) ≤ a/(18)2 , so that the global covering number of Θn is bounded by Nn (a/(18) , 2
,
Θn d2n
)≤
e4aϵ n
τ2
∞ √ −2j (1 + aϵ /36) + Jn ≤ ean/4 ,
(26)
j =0
if the constant aϵ in the definition of Θn is small enough. Therefore, by combining (26) with (25), we have that for all ϵ > 0 there exist κ > 0 and a sequence of tests φn satisfying Eλn0 [φn ] = o(1),
sup Eθn [1 − φn ] ≤ e−κ n .
(27)
θ∈Θn
Now, let ϵ > 0 and δ > 0. Then, it follows from (21), (24) and (27) that
2 enϵ +δ n 2 λ Pλn0 B01 eδ n ≥ ϵ ≤ Eλn0 [φn ] + Pλn0 [J1 0 (yn ) < e−nϵ ] + ϵ
≤ o(1) + e
−nϵ 2
−nκ/2
+e
Θn
Eθn [1 − φn ]dπθ (θ ) + πθ (Θnc )
,
as soon as δ < ϵ ∧ κ/2. Hence, the proof of Theorem 1 is complete. 2
3.2.2. Proof of Theorem 2 We now prove Theorem 2. Suppose that the true model is M0 . That is, the true regression model is assumed to be the parametric linear model, y = β00 + βT0 w + σ0 ϵi , θ0 = (β00 , β0 , σ0 ), pnθ0 ∈ M0 . Thus, if H (f ) = 0, the parametric vector η0n
has components equal to η0i = β00 + βT0 wi . First, note that if
|σ − σ0 | ≤ n−1/2 ,
|β0 − β00 | ≤ n−1/2 ,
∥β − β0 ∥p ≤ n−1/2 ,
then KL(pnθ0 , pnθ )+ V (pnθ0 , pnθ ) ≤ C for some positive constant C , where θ0 = (β00 , β0 , σ0 ) and θ = (β0 , β, σ ). Then, it follows from Assumption A2 that n(p+2)/2 π0
θ : KL(pnθ0 , pnθ ) ≤ 1, V (pnθ0 , pnθ ) ≤ 1 ≥ C1
for a positive constant C1 . Next, we define Aun = {λ ∈ Λ : dn (pnλ , pnθ0 ) < un },
(28)
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
165
and recall that d2n
(σ − σ0 )2 (pλ , pθ0 ) = 2 − 2 1 − 2 σ + σ02 n
n
1/2
n 1
n i =1
(ηi − ηi0 )2 exp − , 4(σ 2 + σ02 )
where ηi = β0 + βT wi + f (xi ), and ηi0 = β00 + βT0 wi , and that d2n (pnλ , pnθ0 ) ≤ u2n implies that (σ − σ0 )2 ≤ Cu2n for some positive C , regardless of η. Then, we show that Condition P1(i) implies that π1 [AMun |yn ] = 1 + op (1), for some M > 0. Using (20), it follows that there exists M0 such that if λ = (β, σ , f ) satisfies
∥f n ∥n ≤ σ0 un /2, (β − β0 )T Z T Z (β − β0 ) ≤ un σ0 /2,
0 < σ − σ0 < ϵ un
for some ϵ > 0, then
λ ∈ Sn = {λ : KL(pnλ0 , pnλ ) ≤ nu2n , V (pnλ0 , pnλ ) ≤ M0 nu2n }, where Z is the matrix whose ith row is given by Zi = (1, wiT ), i = 1, . . . , n.
2
p+1
Thus, Condition P1(i) implies that with probability going to 1, π1 (Sn ) & C1 e−c1 nun un
, so that
2 λ J1 0 (yn ) & e−(2+c1 )nun upn+1
(29)
with probability going to 1. Moreover define a2 nu2 n
2
Fn (a1 , a2 ) = {(β, σ , f ); ∥β∥ ≤ ea2 nun , a1 un ≤ σ ≤ ee
, f ∈ F n ,1 }
then under Condition P1(i), we have that for all a > 0 there exists c0 > 0 with π (Fnc (a1 , a2 )) = o(e−(2+c1 )nun un ). Let ϵ0 > 0 and define Sn,1 = {λ; d2n (pnλ , pnθ0 ) ≤ ϵ02 } and Sn,2 = {λ; d2n (pnλ , pnθ0 ) > ϵ02 }. Then λ ∈ Sn,1 implies that there exists C > 0 such that 2
|σ − σ0 | ≤ σ0 /2,
p+1
and ∥ηn − η0n ∥n ≤ Cdn (pnλ , pnθ0 ).
Thus if λ, λ′ ∈ Sn,1 , with λ = (β, σ , f ) and λ′ = (β′ , σ ′ , f ′ ) and
|σ − σ ′ | ≤ un ,
∥β − β′ ∥ ≤ un ,
n
∥ f n − f ′ ∥ n ≤ un ,
then there exists ρ > 0 such that dn (pnλ , pnλ′ ) ≤ ρ un . Therefore, it follows that there exists ρ such that log N ρ un , Fn ∩ Sn,1 , dn . nu2n + log N un , Fn,1 , ∥ · ∥n
. nu2n ,
with probability going to 1. Next, suppose that λ ∈ Sn,2 and σ ∈ (σ0 /2, 3σ0 /2). Then using the same computations as before, log N ϵ0 /18, Fn ∩ Sn,2 ∩ {σ ∈ (σ0 /2, 3σ0 /2)}, dn
. nu2n + log N un , Fn,1 , ∥ · ∥n = o(n).
a2 nu2 n
Set σn = a1 un and σ¯ n = ee . We consider separately the cases σ ∈ (σn , σ0 /2) or σ ∈ (3σ0 /2, σ¯ n ). In the former case, we 1 define σj = σn (1 + a1 )j with a1 > 0 chosen as small as needed be, and j = 1, . . . , Jn,1 where Jn,1 = ⌊a− 1 log(σ0 /(2σn ))⌋ + 1. ′ ′ 1 ′ ′ For each j ≤ Jn,1 , for all σ , σ ∈ (σj , σj+1 ), all β, β such that ∥β − β ∥ ≤ ρσj and all f , f such that ∥f n − (f ′ )n ∥n ≤ a− 1 σj with ρ small enough and a1 large enough, dn (pnλ , pnλ′ ) ≤ ϵ0 /18. Thus, based on the similar derivation of covering number to the case of (26), we have that log N (ϵ0 /18, Fn ∩ {σ ∈ (σn , σ0 /2)}, dn ) = o(n). For the latter, we have the covering number in the same manner as before, log N (ϵ0 /36, Fn ∩ {σ ∈ (3σ0 /2, σ¯ n )}, dn ) = o(n). Finally, it follows from the general results in posterior consistency and posterior convergence rates (see, e.g., Ghosal and van der Vaart, 2007b) that
π1 [AMun |yn ] = 1 + opnθ (1),
(30)
0
uniformly over all compact K ⊂ Rp+1 × R +∗ , for some M > 0. Now we bound from above π1 AMun . We first note that λ ∈ AMun implies that there exist τ1 , τ2 > 0 such that |σ − σ0 | ≤ τ1 un and ∥ηn − η0n ∥2n ≤ τ2 u2n . Thus, it follows that
π1 (AMun ) . π1 {|σ − σ0 | ≤ τ1 un } ∩ {∥ηn − η0n ∥2n ≤ τ2 u2n } . un π1 ∥ηn − η0n ∥2n ≤ τ2 nu2n .
166
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
Let Pz be the projection operator from Rn onto the vector space spanned by Z , we then obtain
∥ηn − η0n ∥2n = ∥Z (β − β0 ) − Pz f n ∥2n + ∥f n − Pz f n ∥2n & ∥β − β0 − (Z t Z )−1 Z t f n ∥2 + ∥f n − Pz f n ∥2n .
(31)
So that
π1 ∥ηn − η0n ∥2n ≤ τ2 u2n . upn+1 πf ∥f n − Pz f n ∥2n ≤ τ3 u2n for some τ3 > 0. Thus, to prove that for all ϵ > 0, π1 AMun < ϵ n−(p+2)/2 with probability going to 1, it is enough to prove that
−(p+2)/2 = o(1), µ πf ∥f n − Pz f n ∥2n ≤ τ3 u2n > ϵ nu2n which can be verified by Condition P1(ii) as follows: √ n n 2 2 f ; ∥f ∥2 ≤ √Condition P1(ii) implies that we can restrict ourselves to the set {∥f − Pz f ∥n ≤ τ3 un }∩{f ; ∥f ∥t ∞ ≤ C nun }∩{√ C nun /wn } and for the sake of simplicity we still call this set A . Moreover, let Ω ( B ) = { Z ; ∥ Z Z / n − V ∥ ≤ B / n}, then n n , 1 w for all ϵ > 0 there exists B > 0 such that µ Ωn,1 (B)c ≤ ϵ . Since
∥f n − Pz f n ∥2n ≥ (∥f n − n−1 ZVw−1 Z t f n ∥n − ∥(Pz − n−1 ZVw−1 Z t )f n ∥n )2 .
(32)
and when Z ∈ Ωn,1 (B),
∥(Pz − n−1 ZVz−1 Z t )f n ∥2n . n−2 ∥Z [(Z t Z )/n − Vw ]Z t f n ∥2n 2 n 2 ∥ Z ∥ i B2 i=1 ∥f n ∥2 ≤ n n
.
n
∥f n ∥2n n
.
∥f ∥2∞ n
= o(u2n )
if f ∈ An . Hence, if Z ∈ Ωn,1 (B), there exists C > 0 such that for all f ∈ An , ∥f n − n−1 ZVw−1 Z t f n ∥2n ≤ Cu2n . We also have for all A > 0, using a Markov inequality,
2 −1 Z t f n √ 2 −(p+2) − ⟨z (w ), f ⟩ > Aun > ϵ( nun ) µ πf An ∩ ZVw n n √ 2 −1 Z t f n ( nun )(p+2) 2 ≤ µ ZVw − ⟨z (w ), f ⟩ > Aun dπf (f ). ϵ n An n
Note that for Z ∈ Ωn (B),
2 2 p+1 n −1 Z t f n Zij f (xi ) ZV − ⟨z (w ), f ⟩ − ⟨zj (w ), f ⟩ w . n n n
j =1
i =1
where zj (w ) = 1 if j = 1 and zj (w ) = dj−1 if j > 1, where dj−1 denotes the j − 1th component of w. Set σj2,f =
(dj−1 )2 f (x)2 dµ(dj , x) if j ≥ 2 and σ12,f = f (x)2 dµ(x), then we can write √ n n √ Anun (Zij f (xi ) − ⟨zj (w ), f ⟩) µ (Zij f (xi ) − ⟨zj (w ), f ⟩) > Anun = µ > . √ σj,f nσj,f i =1 i=1
Since for all f ∈ An , Zij f (xi ) is bounded by ∥f ∥∞ , Kolmogorov’s exponential inequality see (Shorack and Wellner, 1986, p. 855), implies that
n Anu2 n √ √ − µ (Zij f (xi ) − ⟨zj (w ), f ⟩) > Anun ≤ e 4∥f ∥2 if σj2,f ≥ Aun ∥f ∥∞ , i =1
√ n √ √ n − 4∥Anu µ (Zij f (xi ) − ⟨zj (w ), f ⟩) > Anun ≤ e f ∥∞ if σj2,f < Aun ∥f ∥∞ . i =1
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
In both cases the right hand side is o e−Awn /4C 2
167
√ = o ( nun )−(p+2) on An by choosing A large enough. We finally obtain
that there exists c > 0, such that
√ πf [An ] ≤ πf ∥f n − ZVw−1 ⟨z (w ), f ⟩∥2n ≤ cu2n + op (( nun )−(p+2)/2 ). Consider f ∈ An with H (f ) > 2cu2n , then
√ µ πf {∥f n − ZVw−1 ⟨z (w ), f ⟩∥2n ≤ cu2n } ∩ {H (f ) > 2cu2n } > ϵ( nun )−(p+2) √ ( nun )(p+2) ≤ µ {−∥f n − ZVw−1 ⟨z (w ), f ⟩∥2n + H (f ) > (H (f )/2 + cu2n )/2} dπf (f ). ϵ An Using Kolmogorov inequality, if H (f ) ≥ 2cu2n , and since (f (xi ) − Zi Vw−1 ⟨z (w ), f ⟩)2 ≤ c0 ∥f ∥2∞ , for some positive c0 ,
cCnH (f )2 16w 2 (f ) µ {−∥f n − ZVw−1 ⟨z (w ), f ⟩∥2n + H (f ) > H (f )/4} ≤ exp − if H (f ) ≤ , 2 w (f ) c0 ∥f ∥2∞ cCnH (f )2 16w 2 (f ) ≤ exp − if H (f ) > 2 ∥f ∥∞ ∥f ∥2∞ for some positive C , where w 2 (f ) = (f (x) − z (w )Vw−1 ⟨z (w ), f ⟩)4 dµ(w , x) − H (f )2 . In both cases the right hand term is √ 2 bounded by e−cC wn = o ( nun )−(p+2) by choosing c large enough. Therefore, it follows that for all ϵ > 0, there exists C > 0 such that p+2 n 1 n µ πf ∥f − Pz f ∥n ≤ Cun > ϵ = o(1), √ un
n
which implies that n(p+2)/2 π1 [AMun ] = o(1).
(33)
Now, define Sn,0 = θ : KL(pnθ0 , pnθ ) ≤ 1, V (pnθ0 , pnθ ) ≤ 1 ,
θ
J0 0 (yn ) =
Θ
pnθ (yn )
pnθ0 (yn )
.
Then, using the proof of Theorem 1 in McVinish et al. (2009), we have θ lim sup Pθn0 J0 0 < e−C π0 (Sn,0 )/2 = 0. C →∞
n
1 In addition, let B10 = B− 01 and ϵ > 0, define
An =
θ yn : J0 0 (yn ) > e−C π0 (Sn,0 )/2
yn : π1 (AMun |yn ) > 1 − ϵ .
Suppose that yn ∈ An . Then, we have pλ (y ) d 1 Aϵn pnθ (yn ) 0
B10 ≤ eC
2 C
n
d/2 θ0
J1 (yn ) = eC
2 C
nd/2
n
n
π (λ)
π1 (Aϵn |yn )
,
and that Pθn0 B10 > e2C n(p+1)/2 π1 (AMun ) ≤ Pθn0 An ∩ B10 > e2C nk0 /2 π1 (AMun )
≤
Ce−C 2(1 − ϵ)
+ Pθn0 [Acn ]
+ Pθn0 [Acn ] C →∞
≤ o(1) + 2Ce−C + 8/C 2 → 0. 1 n n Hence, B− 01 → 0 with Pθ0 probability tending to 1, which implies B01 converges to infinity with Pθ0 probability tending to 1 when the true model is from M0 . The proof of Theorem 2 is complete.
168
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
3.3. Remarks on Bayes factor consistency theorems 3.3.1. Remark on Theorem 1 For Theorem 1, what needs to be discussed further is the assumptions on f0 as described in Theorem 1. We remark in particular the assumptions H (f0 ) and f0 ∈ L2 [0, 1] and that of prior concentration of (9). Specifically, we verify that if 1 H (f0 ) > 0 and if 0 f0 (x)2 dx < +∞ then
πf (∥f − f0 ∥2 ≤ ϵn ) & e−cnϵn , 1/2 |g (x)|2 dx . This follows from Lemma 5.3 of van der Vaart and van Zanten (2008b). In where ∥g ∥2 denotes the L2 norm, the case of fixed design the same result applies by noting that for all β ∈ Rp+1 if H (f0 ) > 0 with H (·) defined by (13), then there exists δ > 0 such that for n large enough (1, w T )β − f0 (x)2 dµn ≥ δ w ,x 2
[−1,1]p ×[0,1]
for and β ∈ Rp+1 . 3.3.2. Remark on Theorem 2 For Theorem 2, we make a further remark on the assumption of τk2 that ensures the conditions on prior concentrations of πf for ∥f ∥∞ and H (f ) for Condition P1(i) and (ii). First, we consider the prior concentration of πf for Condition P1(i): That is, if τk2 . k−δ , δ ≥ 3, from the known results in Shen and Wasserman (2001), van der Vaart and van Zanten (2008a) and Castillo (2008) for example, it follows that if ϵn & n−(δ−1)/2δ log n, there exists for any C1′ > 1, Bn a subset of the set of continuous functions on [0, 1] such that
√ 2 πf [∥f ∥∞ ≤ 2ϵn ] ≥ e−nϵn = o ( nϵn )−(p+2) πf (Bcn ) ≤ e−C1 nϵn , ′
2
log N (3ϵn , Bn , ∥ · ∥∞ ) ≤ 6Cnϵn2
for some positive constants C1′ , C2′ and C3′ . Since for all functions f , g on [0, 1] ∥f n − g n ∥n ≤ ∥f − g ∥∞ , Condition P1(i) are n n T n satisfied under Assumption A2. Here, we denote f denote the n-dimensional vector f = (f (x1 ), . . . , f (xn )) ∈ R and ∥f n ∥n = n−1 ni=1 f (xi )2 . Next, writing ∥φ∥∞ = maxi ∥φi ∥∞ , we have ∥f ∥∞ ≤ ∥φ∥∞ +∞ So that
i
τi |zi |, and Ef [ea∥f ∥∞ ] ≤ 2a∥φ∥∞ ea
2 ∥φ∥2
∞
2 i τi /2
i
τi <
√ √ πf (∥f ∥∞ > C2′ nϵn / log n) = o(( nϵn )−κ ), ∀κ > 0. 1 p 2 T ˜ Now, note that H (f ) = ∥f˜ − j=1 mj βj (f )∥2 + β(f ) Σ β(f ), where f = f − 0 f (x)dµ(x), mj (x) = E [wj |X = x], βj (f ) = ⟨mj , f ⟩, β(f ) = (βj (f ), j = 1, . . . , p) and Σ is the expectation of the conditional covariance matrix of w given p X . Thus, without loss of generality, H (f ) . ϵn2 implies that ∥f˜ − j=1 mj βj (f )∥22 . ϵn2 . Under both Gaussian process priors in (6) and (8) πf (∥f˜ ∥2 . ϵn ) = o(n−H ) for all H > 0, which validates Condition P1(ii). 3.4. Numerical illustration For the numerical illustration of Bayes factor consistency, we consider a simulation study in which synthetic data sets are generated from Yi = 1 + 2di + f (xi ) + ϵi , i = 1, . . . , n where {ϵi } is a random sample with mean 0 and standard −1 , i = 1, . . . , n and deviation σ = 0.1. The covariate values {xi } is equally spaced and orthogonal on (0, 1) where xi = 2i2n di = 2(xi − 0.5) + δi where {δi } is a random sample from a normal distribution with mean 0 and standard deviation 0.5. For the choice of the nonparametric component f (·), we use two nonlinear curves, f (x) = sin(2π x), and f (x) = x cos(4π x) as well as a zero function, i.e. f (x) = 0, which implies that the true model is M0 . In the simulation study, we considered 100 data sets with the sample sizes 25, 50, 100, 200 and 250 per data set, fitted these data sets using the partial linear model based on a Fourier representation of f (x), and provide the averaged result of Bayes factors in Table 1. For computing the Bayes factor, we provide numerical estimates of logarithms of marginal likelihoods denoted by log m0 and log m1 under M0 and M1 respectively, based on Gelfand and Dey approximation (Gelfand and Dey, 1994). As summarized in the first two rows of Table 1, the numerical results of the Bayes factor B01 , i.e. the difference of log m0 and log m1 decrease as the sample size increases when the true model is M1 , while when the true model is M0 with f (x) = 0, the estimated Bayes factors increase as the sample size increases as given in the third row of Table 1, which demonstrates the empirical evidence of Bayes factor consistency in the partial linear model structure. 4. Discussion In this paper, we investigated the consistency of the Bayes factor for independent observations. In particular, we specialized our results to the model selection problem in the context of partial linear models, in which the regression
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
169
Table 1 Simulation results for estimation of logarithms of marginal likelihoods and Bayes factors (log m0 , log m1 , log B01 ) under M1 and M0 . True Model
n
25
50
100
M1
log m0 log m1 log B01 log m0 log m1 log B01 log m0 log m1 log B01
14.43 8.1 6.33 7.616 2.5 5.116 11.480 8.086 3.393
10.985 8.256 2.729 1.681 13.84 −12.158 24.080 17.640 6.447
−22.271
f (x) = sin(2π x)
M1
f (x) = x cos(4π x)
M0
f (x ) = 0
200
−91.99
32.13
102.10
−54.41 −23.71
−194.1 −73.92
250
−131.7 136.3
−268.1 −100.84
47.56
128.8
168.4
−71.28
−202.7
−269.3
60.100 53.390 6.713
141.200 133.100 8.128
181.600 174.200 7.420
function is assumed to be the additive form of the linear component and the nonparametric component. Two specific partial linear models using Gaussian process priors were considered, and Bayes factor consistency was investigated for choosing between the parametric model and the semiparametric alternative. These results extended the work of McVinish et al. (2009) to independent observations and complemented their work in the context of Bayesian lack-of-fit testing for partial linear models using Gaussian process priors. The approach taken in this paper was based on the Gaussian error assumption. The non-Gaussian error distribution assumption can be accommodated, and the unknown error distribution assumption can also be considered. In the proof of the consistency results, the crucial steps that depend on the Gaussian error assumption are in the calculations of the Kullback–Leibler neighborhoods of the true parameter and the Hellinger distances between two probability density functions. These calculations could be easily extended to non-Gaussian error distributions such as Laplace distribution and even to asymmetric and skewed error distributions (Sriram et al., 2013), with suitable analytic assumptions, and thus, consistency results similar to Theorems 1 and 2 could also be established. More importantly, when the error distribution is unknown, nonparametric priors such as Dirichlet mixtures of normals can be used for the unknown error distribution. Indeed, sufficient conditions for the Bayes factor in McVinish et al. (2009) and the current work may not be easily verified for Dirichlet mixtures of normals, and the popular use of a Dirichlet mixture of normals for the unknown error distribution may lead to inconsistency (Tokdar et al., 2010), compared to, for example, the mixtures of triangular distributions (McVinish et al., 2009) and Bernstein–Dirichlet mixtures (Ghosal et al., 2008). This would be one of the most challenging issues in investigating Bayes factor consistency for independent observations, but the work of Ghosal and van der Vaart (2007a,b) could be useful for developing the underlying theory. In addition, the consistency results of the Bayes factor we considered in this paper could be extended to the various Bayesian model selection problems for the observations that are not required to be either independent or identically distributed. For example, in addition to Gaussian process priors for the regression function, another family of prior distributions can also be incorporated such as the orthogonal priors with wavelets, nonorthogonal priors with spline bases and mixture priors discussed in de Jonge and van Zanten (2010). Other problems to be considered further based on the current results could include statistical models for dependent observations such as Markov chains or autoregressive processes. Acknowledgments Research of Taeryon Choi was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2012R1A1A2041370). The work of Judith Rousseau is partially supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2010–2013 project Bandhits. References Berger, J.O., Guglielmi, A., 2001. Bayesian and conditional frequentist testing of a parametric model versus nonparametric alternatives. J. Amer. Statist. Assoc. 96 (453), 174–184. Birgé, L., 1983. Robust testing for independent nonidentically distributed variables and Markov chains. In: Specifying Statistical Models (Louvain-la-Neuve, 1981). In: Lecture Notes in Statist., vol. 16. Springer, New York, pp. 134–162. Casella, G., Girón, F.J., Martínez, M.L., Moreno, E., 2009. Consistency of Bayesian procedures for variable selection. Ann. Statist. 37 (3), 1207–1228. Castillo, I., 2008. Lower bounds for posterior rates with Gaussian process priors. Electron. J. Stat. 2, 1281–1299. Chib, S., Greenberg, E., 2010. Additive cubic spline regression with Dirichlet process mixture errors. J. Econometrics 156 (2), 322–336. Choi, T., Lee, J., Roy, A., 2009. A note on the Bayes factor in a semiparametric regression model. J. Multivariate Anal. 100 (6), 1316–1327. Choi, T., Woo, Y., 2015. A partially linear model using a Gaussian process prior. Comm. Statist. Simulation Comput. 44 (7), 1770–1786. Dass, S.C., Lee, J., 2004. A note on the consistency of Bayes factors for testing point null versus non-parametric alternatives. J. Statist. Plann. Inference 119 (1), 143–152. de Jonge, R., van Zanten, J.H., 2010. Adaptive nonparametric Bayesian inference using location-scale mixture priors. Ann. Statist. 38 (6), 3300–3320. Fahrmeir, L., Kneib, T., Lang, S., Marx, B., 2013. Regression. Models, Methods and Applications. Springer, Heidelberg. Gelfand, A.E., Dey, D.K., 1994. Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc. Ser. B 56 (3), 501–514. Gelman, A., 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 1 (3), 515–533. Ghosal, G., Lember, J., Van der Vaart, A., 2008. Nonparametric Bayesian model selection and averaging. Electron. J. Stat. 2, 63–89.
170
T. Choi, J. Rousseau / Journal of Statistical Planning and Inference 166 (2015) 158–170
Ghosal, S., van der Vaart, A., 2007a. Convergence rates of posterior distributions for non-i.i.d. observations. Ann. Statist. 35 (1), 192–223. Ghosal, S., van der Vaart, A., 2007b. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist. 35 (2), 697–723. Grenander, U., 1981. Abstract Inference. In: Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons Inc., New York. Härdle, W., Müller, M., Sperlich, S., Werwatz, A., 2004. Nonparametric and Semiparametric Models. In: Springer Series in Statistics, Springer-Verlag, New York. Jeffreys, H., 1961. Theory of Probability, third ed. Clarendon Press, Oxford. Kass, R.E., Raftery, A., 1995. Bayes factors. J. Amer. Statist. Assoc. 90, 773–795. Ko, K., Qu, L., Vannucci, M., 2009. Wavelet-based Bayesian estimation of partially linear regression models with long memory errors. Statist. Sinica 19 (4), 1463–1478. Koop, G., Poirier, D.J., 2004. Bayesian variants of some classical semiparametric regression techniques. J. Econometrics 123 (2), 259–282. Kreider, D.L., Kuller, R.G., Ostberg, D.R., Perkins, F.W., 1966. An Introduction to Linear Analysis. Addison-Wesley Publishing Co., Inc., Reading, Mass.-Don Mills, Ont. Kyung, M., 2011. A computational Bayesian method for estimating the number of knots in regression splines. Bayesian Anal. 6. Le Cam, L., 1986. Asymptotic Methods in Statistical Decision Theory. In: Springer Series in Statistics, Springer-Verlag, New York. Lenk, P.J., 1999. Bayesian inference for semiparametric regression using a Fourier representation. J. R. Stat. Soc. Ser. B Stat. Methodol. 61 (4), 863–879. Lenk, P.J., 2003. Bayesian semiparametric density estimation and model verification using a logistic-Gaussian process. J. Comput. Graph. Statist. 12 (3), 548–565. Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O., 2008. Mixtures of g priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 (481), 410–423. McVinish, R., Rousseau, J., Mengersen, K., 2009. Bayesian goodness of fit testing with mixtures of triangular distributions. Scand. J. Statist. 36 (2), 337–354. Moreno, E., Girón, F.J., Casella, G., 2010. Consistency of objective Bayes factors as the model dimension grows. Ann. Statist. 38 (4), 1937–1952. Qu, L., 2006. Bayesian wavelet estimation of partially linear models. J. Stat. Comput. Simul. 76 (7), 605–617. Rasmussen, C.E., Williams, C.K.I., 2006. Gaussian Processes for Machine Learning. In: Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA. Rousseau, J., 2007. Approximating interval hypotheses: p-values and Bayes factors. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 8: Proceedings of the Eigth International Meeting. Oxford University Press. Shang, Z., Clayton, M.K., 2011. Consistency of Bayesian linear model selection with a growing number of parameters. J. Statist. Plann. Inference 141 (11), 3463–3474. Shen, X., Wasserman, L., 2001. Rates of convergence of posterior distributions. Ann. Statist. 29 (3), 687–714. Shorack, G.R., Wellner, J.A., 1986. Empirical Processes with Applications to Statistics. Wiley, New York. Sriram, K., Ramamoorthi, R.V., Ghosh, P., 2013. Posterior consistency of Bayesian quantile regression based on the misspecified asymmetric Laplace density. Bayesian Anal. 8 (2), 479–504. Tokdar, S.T., Chakrabarti, A., Ghosh, J.K., 2010. Bayesian nonparametric goodness of fit tests. In: Chen, M.-H., Dey, D.K., Müller, P., Sun, D., Ye, K. (Eds.), Frontiers of Statistical Decision Making and Bayesian Analysis in Honor of James O. Berger. Springer, New York. van der Vaart, A.W., van Zanten, J.H., 2008a. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist. 36 (3), 1435–1463. van der Vaart, A.W., van Zanten, J.H., 2008b. Reproducing kernel Hilbert spaces of Gaussian priors. In: Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh 3. pp. 200–222. Verdinelli, I., Wasserman, L., 1998. Bayesian goodness-of-fit testing using infinite-dimensional exponential families. Ann. Statist. 26 (4), 1215–1241. Wahba, G., 1990. Spline Models for Observational Data. SIAM, Philadelphia. Walker, S., Damien, P., Lenk, P., 2004. On priors with a Kullback–Leibler property. J. Amer. Statist. Assoc. 99 (466), 404–408.