Bayesian consistency for regression models under a supremum distance

Bayesian consistency for regression models under a supremum distance

Journal of Statistical Planning and Inference 143 (2013) 468–478 Contents lists available at SciVerse ScienceDirect Journal of Statistical Planning ...

181KB Sizes 2 Downloads 70 Views

Journal of Statistical Planning and Inference 143 (2013) 468–478

Contents lists available at SciVerse ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Bayesian consistency for regression models under a supremum distance Fei Xiang a,n, Stephen G. Walker b a b

School of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, UK School of Mathematics, Statistics and Actuarial Science, Cornwallis Building, University of Kent, Canterbury, Kent CT2 7NF, UK

a r t i c l e in f o

abstract

Article history: Received 8 June 2011 Received in revised form 12 June 2012 Accepted 6 September 2012 Available online 15 September 2012

This paper studies the consistency of Bayesian nonparametric regression models. We concentrate on the use of the sup metric and dealing with non-stochastic, i.e. designed, covariate values. We illustrate our results on a normal mean regression function and demonstrate the usefulness of a model based on piecewise constant functions. Crown Copyright & 2012 Published by Elsevier B.V. All rights reserved.

Keywords: Bayesian nonparametric Consistency Regression model Hellinger neighborhood Piecewise constant function

1. Introduction This paper focuses on the posterior Bayesian consistency of regression models under a supremum based metric and on non-stochastic or design covariates. Observations are of the type ðyi ,xi Þni¼ 1 which arise from some fixed but unknown regression model f 0 ðy9xÞ. Here, the ðxi Þ are assumed fixed non-random design points from the covariate space X ¼ ½0,1. However, to start the theory, we take X ¼ ðz1 ,z2 ,z3 , . . .Þ, although examples in later sections consider taking values from ½0,1. In our notation, however, we will simply refer to zl as l. We use f to denote a conditional density from the Bayesian model and for concreteness f ðy9xÞ is assumed to be a density for all x with respect to the Lebesgue measure. The Bayesian prior is written as Pðdf Þ on the function space F and combines with the data to assign posterior mass to the set of conditional densities A as

Pn ðAÞ ¼

R Rn ðf ÞPðdf Þ IA , ¼ RA IF F Rn ðf ÞPðdf Þ

where Rn ðf Þ ¼

n

n Y f ðyi 9xi Þ : ðy 9x Þ f i¼1 0 i i

Corresponding author. E-mail addresses: [email protected] (F. Xiang), [email protected] (S.G. Walker).

0378-3758/$ - see front matter Crown Copyright & 2012 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2012.09.002

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

469

Our aim is to study consistency with respect to a sup based metric; and hence we are interested in   A ¼ AE ¼ f : sup Hðf ð9xÞ,f 0 ð9xÞÞ o E , x

where Hðf ,f 0 Þ is the Hellinger distance between f and f0, given by Z pffiffiffi qffiffiffiffiffi 1=2 Hðf ,f 0 Þ ¼ ð f  f 0 Þ2 : We want to establish sufficient conditions on P which ensure that

Pn ðAcE Þ-0 in probability. There are not many results of this type in the literature. One such notable exception is in Shively et al. (2009), where such a consistency result is obtained for a normal mean regression function, when the function is assumed to be monotone on a bounded interval. Posterior consistency is an important issue in Bayesian nonparametrics. A good review is provided by Choi and Ramamoorthi (2008). Most techniques on demonstrating consistency are developed for the case of independent and identically distributed observations. A popular approach is based on sieves and the existence of uniformly consistent tests; see Barron et al. (1999) and Ghosal and Ghosh (1999) for more details. Another approach was introduced in Walker (2003, 2004) and is based on the summability of square roots of prior masses on the parameter space covering Hellinger balls. It is this latter approach that will form the basis of the current paper. Recent attention has been given to consistency for regression models. For instance, Amewou-Atisso et al. (2003) studies the weak consistency of semiparametric linear regression models. Also, Ghosal and Roy (2006) investigates the consistency of nonparametric binary regression models with unknown link functions using a Gaussian process prior. On the other hand, Choi (2007, 2005) and Choi and Schervish (2007) give alternative methods on the consistency of nonparametric regression models and multiple regression models. However, the assumptions here are that the covariates are generated independent and identically distributed from some distribution function, say Q. Then, taking the integrated Hellinger metric: Z dðf 0 ,f Þ ¼ Hðf 0 ð9xÞf ð9xÞÞ dQ ðxÞ the resulting theory is identical in dealing with the independent and identically distributed case involving ðyi ,xi Þni¼ 1 . The work about establishing the conditions on P which provide consistency is typically harder because the model can be high dimensional. A problem with this set-up is that it is not possible to establish consistency for the predictive at a fixed point in the covariate space. Another problem is that it is not always that covariates can be justified as being independent and identically distributed. When covariates are non-stochastic, the analogous idea is to use the average metric Z dðf 0 ,f Þ ¼ Hðf 0 ð9xÞf ð9xÞÞ dQ n ðxÞ, where Qn is the empirical distribution of the ðxi Þ. Here consistency is established by showing Pn ðAcE,n Þ-0 where ( ) n X 1 Hðf ð9xi Þ,f 0 ð9xi ÞÞ o E : AE,n ¼ f : n i¼1

Consistency here is a rather weak result and still says nothing about the consistency at a particular choice of x. The layout of the paper is as follows. In Section 2 we describe necessary preliminaries including the basic assumptions. Section 3 contains the consistency results for both the posterior and predictive distributions. In Section 4 we illustrate the findings by establishing consistency for the normal model f 0 ðy9xÞ ¼ Nðy9y0 ðxÞ, s20 Þ: Finally, Section 5 contains a discussion and details future work. 2. Preliminaries This section will describe basic concepts and ideas for the strategies adopted in this paper. To start, we assume that the covariates take values in X ¼ fz1 ,z2 , . . .g and we will typically for simplicity represent these as integers; i.e. X ¼ f1,2, . . .g. Hence, we are interested in the set AE ¼ ff : Hðf ð9lÞ,f 0 ð9lÞÞ o E 8l ¼ 1,2, . . .g and we will also define AE,l ¼ ff : Hðf ð9lÞ,f 0 ð9lÞÞ o Eg

470

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

so that AE ¼

\ AE,l : l

Our aim is to find conditions on P for which

Pn ðAcE Þ-0 in probability, and hence we become interested in the set [ AcE ¼ AcE,l : l

We will be taking finite versions of the model f for each sample size n and will denote the size of the model by N, typically, though not always, suppressing the dependence on n. As n-1 it will be that N-1, and also n=N-1. Our technique then relies on the idea that if Hðf ð9lÞ,f 0 ð9lÞÞ 4 E n

for some l 2 f1,2, . . .g then for some l 2 f1, . . . ,Ng it is that n

n

Hðf ð9l Þ,f 0 ð9l ÞÞ 4 EN for some EN 4 0. To make this idea more general, let us say for x 2 X , now the continuous covariate space, X ¼ ½0,1, if Hðf ð9xÞ,f 0 ð9xÞÞ 4 E for some x 2 X then for some xn 2 C N ¼ fz1 , . . . ,zN g, with each zj 2 X , it is that Hðf ð9xn Þ,f 0 ð9xn ÞÞ 4 EN for some EN 4 0. We will write EN ¼ EpN and note that for each N, CN is a fixed set. Hence, we can now write, and we are now working under the assumption X ¼ ½0,1, for any N, AcE 

[

ff : Hðf ð9lÞ,f 0 ð9lÞÞ 4 EN g ¼

1,...,N

N [

AcEN ,l :

l¼1

So, we have

Pn ðAcE Þ r

N X

Pn ðAcEN ,l Þ:

l¼1

Now each AE,l defines a set of densities, since the covariate value is the same. Hence, this set of densities can be covered by sets of densities based on Hellinger balls of arbitrary size. This follows from the separability of densities with respect to the Hellinger metric. Hence, for each l 2 f1, . . . ,Ng, there exist disjoint sets fAEN ,l,j ; j ¼ 1,2, . . .g such that AcEN ,l ¼

[ AEN ,l,j : j

Each AEN ,l,j is a subset of a Hellinger ball, centered on f l,j , and of size 12 EN . Consequently,

Pn ðAcE Þ r

N X 1 X

Pn ðAEN ,l,j Þ:

l¼1j¼1

The posterior probability for a set A is now written as R Rn ðf ÞPðdf Þ I , Pn ðAÞ ¼ n,A ¼ R A In F Rn ðf ÞPðdf Þ where we have again suppressed the dependence on N. Since we are dealing with N, which will be depending on n, and we need to consider EN , it follows that we need to proceed as though we are working with rates of convergence ideas. We shall also rewrite the prior as PN , as it is the belief on f with respect to each l ¼ f1,2, . . . ,Ng. The hard part will be dealing with the numerator and hence here we first consider the denominator. We will consider dealing with the numerator in Section 3.

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

471

The following lemma deals appropriately with the denominator. It is similar to Lemma 8.1 in Ghosal et al. (2000). We need to define the Kullback–Leibler divergence of two densities f and g as Z f Kðf ,gÞ ¼ f log g and we also define 2 Z  f Vðf ,gÞ ¼ f log : g

Lemma 1. Let K l ðf Þ ¼ Kðf 0 ð:9zl Þ,f ð:9zl ÞÞ

and

V l ðf Þ ¼ Vðf 0 ð:9zl Þ,f ð:9zl ÞÞ:

Suppose that for a sequence E~ n with E~ n -0 and nE~ 2n -1, and some constant C 40, the prior satisfies

PN ff : K l ðf Þ o E~ 2n ,V l ðf Þ o E~ 2n 8l ¼ 1, . . . ,Ng ZexpðnE~ 2n CÞ for all large n. Then, Z Rn ðf ÞPN ðdf Þ Z expfnE~ 2n ðC þ1Þg

in probability:

Proof. See Appendix. 3. Consistency results For reasons that will become clear later, we will take the first n covariates to be from those N values which we have discussed in Section 2. Therefore, we have xi ¼ z1 for i ¼ 1, . . . ,r n , and xi ¼ z2 for i ¼ r n þ1, . . . ,2r n , and so on, to xi ¼zN for i ¼ ðN1Þr n þ 1, . . . ,Nr n . So Nr n ¼ n. We are not implying the order must be this way, we are only suggesting that among the n samples, we take rn of each covariates l 2 f1, . . . ,Ng, in any order. For the maths, it is convenient to assign a particular order. Now define, for l ¼1,y,N, Z Ln,l,j ¼ Rn ðf ÞPðdf Þ AEN ,l,j

from which we can see a standard result, e.g. Walker (2003), that f n1,l,j ðyn 9xn Þ Ln,l,j , ¼ Ln1,l,j f 0 ðyn 9xn Þ where f n1,l,j ð9xn Þ is the predictive density for yn given xn based on a posterior restricted and normalized to the set AEN ,l,j and given the sample ðxi ,yi Þn1 i ¼ 1 . Therefore,  0 1 1=2  Ln,l,j  1 E@ 1=2 sn1 A ¼ 1 H2 ðf n1,l,j ð9xn Þ,f 0 ð9xn ÞÞ, 2 Ln1,l,j  where sn ¼ sððx1 ,y1 Þ, . . . ,ðxn ,yn ÞÞ. Taking z1 for example, if xn az1 then we can bound the RHS by 1, i.e.  0 1 1=2  Ln,1,j   @ E 1=2 sn1 A r 1: Ln1,1,j  On the other hand, if xn ¼ z1 , then we have Hðf n1,1,j ð9xn Þ,f 0 ð9xn ÞÞ ZHðf 1,j ð9xn Þ,f 0 ð9xn ÞÞHðf n1,1,j ð9xn Þ,f 1,j ð9xn ÞÞ, where f 1,j is the density at the center of AEN,1,j . Hence, Hðf 1,j ð9xn Þ,f 0 ð9xn ÞÞ 4 EN and due to the convexity of the Hellinger balls, Hðf n1,1,j ð9xn Þ,f 1,j ð9xn ÞÞ r 12EN since the Hellinger balls are of size 12EN . Thus, 1 2

H2 ðf n1,1,j ð9xn Þ,f 0 ð9xn ÞÞ Z 14E2N

472

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

and so for xn ¼ z1 we have  0 1 1=2  Ln,1,j  1 E@ 1=2 sn1 A r1 E2N : 4 Ln1,1,j  Putting this together with the covariates selected, we have qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=2 EðLn,1,j Þ r ð114E2N Þrn PðAEN ,1,j Þ: Now observe that P pffiffiffiffiffiffiffiffiffi Ln,l,j Pn ðAcE Þ r l,jpffiffiffiffi : In According to the condition appearing in Lemma 1, the denominator has a lower bound, i.e. n n o pffiffiffiffi In Z exp  E~ 2n ðC þ 1Þ 2 in probability. For the numerator, we see that 0 1 Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X 1=2 2 2 2 n d PðAEN ,l,j Þ P@ Ln,l,j 4 e n A r endn eð1=4ÞnðEN =NÞ l,j

l,j

and so if 2

2

endn eð1=4ÞnðEN =NÞ

Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PðAEN ,l,j Þ-0 l,j

then X 1=2 2 Ln,l,j o endn l,j 2

in probability. We are obviously interested in the case when dn ¼ C 1 E~ 2n for some constant C1, and hence we are interested in determining conditions on the prior for which Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 en½C 1 E~ n þ ð1=4ÞðEN =NÞ PðAEN ,l,j Þ-0: l,j

We can put this into the following theorem: Theorem 1. Since Hðf N ð9xÞ,f 0 ð9xÞÞ 4 E for some x 2 X implies Hðf N ð9lÞ,f 0 ð9lÞÞ 4 EN for some l 2 f1, . . . ,Ng, if the prior PN satisfies the condition of Lemma 1 and Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 en½C 1 E~ n þ ð1=4ÞðEN =NÞ PN ðAEN ,l,j Þ-0 l,j

then

Pn ðAcE Þ-0 in probability when the covariates are taken according to xi ¼zj for ðj1Þr n þ1 r ir jr n , for j¼1,y,N and r n ¼ n=N. We notice that it is desirable for EN ¼ pE, or specifically that pN does not go to 0. This will be clear in the example that follows in Section 4. The predictive based on the sequence of posterior Pn and any new fixed x is given by Z b f ðy9xÞ ¼ f ðy9xÞPn ðdf Þ: The next theorem shows this specific predictive is also consistent. Theorem 2. If the posterior is consistent, i.e.

Pn ðAcE Þ-0

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

473

in probability, then Hðb f ð9xÞ,f 0 ð9xÞÞ-0 in probability. Proof. By Jensen’s inequality and the convexity of the squared Hellinger distance we have Z  Z f ð9xÞPn ðdf Þ,f 0 ð9xÞ o Hðf ð9xÞ,f 0 ð9xÞÞPn ðdf Þ Hðb f ð9xÞ,f 0 ð9xÞÞ ¼ H Z Z Hðf ð9xÞ,f 0 ð9xÞÞPn ðdf Þ þ Hðf ð9xÞ,f 0 ð9xÞÞPn ðdf Þ o Pn ðAcE Þ þ E: ¼ AcE

AE

Thus the proof is completed by the fact that E is arbitrary and Pn ðAcE Þ-0 in probability.

&

4. Example This section will consider a normal example, assuming that y is normally distributed with mean function yðxÞ and variance s2 . So prior distributions will be constructed for y and s. We also assume that the covariate space is X ¼ ½0,1. Following Shively et al. (2009) we need to make assumptions on the true function y0 ðxÞ in order to obtain results for the sup metric. The assumption here is less restrictive than that of Shively et al. (2009). We will also discuss here the choice of yN . From Sections 2 and 3 it turns out to be quite important that we have EN ¼ pN E and pN stays away from 0. For example, using a polynomial of order N for modeling yN results in a pN which goes to 0 too fast. This actually ensures there is no consistency result as we cannot deal with the numerator and denominator to get the desired consistency. A form of function for yN that does allow pN -1 is a piecewise constant function. So we define for x 2 ððj1Þ=N,j=N

yN ðxÞ ¼ bj

ð1Þ

for j ¼1,y,N. For us, each bj will be allocated an independent normal prior distribution with zero mean and variance t 2j . Lemma 2. Assume that y0 is Lipschitz continuous on ½0,1 such that for any x1 ,x2 2 X , 9y0 ðx1 Þy0 ðx2 Þ9 rC9x1 x2 9 for some constant C o þ 1. If 9yN ðxÞy0 ðxÞ9 4 E for some x 2 ½0,1, then for some j 2 f1, . . . ,Ng it is that 9yN ðj=NÞy0 ðj=NÞ94 EN , where EN ¼ Ef=N, and f depends only on y0 . Proof. We have that, for x 2 ððj1Þ=N,j=N, 9yN ðj=NÞy0 ðj=NÞ9 49yN ðj=NÞy0 ðxÞ99y0 ðxÞy0 ðj=NÞ9: The first term on the right side is lower bounded by E and the second term has an upper bound of f=N due to the Lipschitz condition. This completes the proof. & 4.1. Denominator We first deal with the denominator and therefore we need to consider K l ðy, sÞ ¼ Kðf y0 , s0 ð9lÞ,f y, s ð9lÞÞ and V l ðy, sÞ ¼ Vðf y0 , s0 ð9lÞ,f y, s ð9lÞÞ for l 2 f1, . . . ,Ng. The next result introduces a way to calculate the prior mass on the Kullback–Leibeler neighborhood of f0. Theorem 3. For all d 4 0, there exist constants c1 40 and c2 40, independent of d, such that if sup9yðxÞy0 ðxÞ9 o d

and

x

9ðs0 =sÞ19 o d

then it is that sup K x ðy, sÞ oc1 d x

and

sup V x ðy, sÞ o c2 d: x

Proof. See Appendix. Hence, we focus on finding a lower bound for (

)

P ðy, sÞ : sup 9yN ðj=NÞy0 ðj=NÞ9 o d,9ðs0 =sÞ19 o d : j2f1,...,Ng

If we model y and s independently then we separate ( )

Py y : sup9yN ðj=NÞy0 ðj=NÞ9o d j

and

Ps f9ðs0 =sÞ19 o dg:

474

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

For the s probability, it is easy to show that

Ps f9ðs0 =sÞ19 o dg  2ps ðs0 Þds0 for small d, where ps is the prior density for s and we assume Ps put positive mass around s0 , which is easy to achieve in practice by having the prior to be something like a gamma density, for example. To confirm this result, it is of interest to compute   Z s0 =ð1dÞ s0 s0 ds0  2pðs0 Þds0 P oso pðuÞ du  2pðs0 Þ ¼ 1þd 1d ð1dÞð1 þ dÞ s0 =ð1 þ dÞ given the d is very small. For the y probability, we note that ( )

(

)

Py y : sup9yN ðj=NÞy0 ðj=NÞ9 o d ¼ Pb b : sup9bj y0 ðj=NÞ9 o d : j

j

This is given by N Y

Pbj fbj : 9bj y0 ðj=NÞ9o dg,

j¼1

which is given by Z y0 ðj=NÞ þ d N Y 2 2 1 pffiffiffiffiffiffi e0:5 s =tj ds, t 2 p y ðj=NÞ d 0 j¼1 j which is for small d asymptotically equivalent to  N Y N 2d 2 pffiffiffiffiffiffi t 1 expfy0 ðj=NÞ=t 2j g: j 2p j¼1 2

Since y0 is bounded above and the ðt j Þ are constant, then for some c o 1 it is that ( )

P ðy, sÞ : sup 9yN ðj=NÞy0 ðj=NÞ9 o d,9ðs0 =sÞ19 o d 4 cN dN þ 1 : j2f1,...,Ng

Therefore, to apply Lemma 1, we would be looking for E~ 2n such that N log E~ 2n N log c o nE~ 2n , which is satisfied for

E~ 2n ¼ ðN=nÞ1a for any a 4 0. 4.2. Numerator We now look at the numerator. First, we establish that ( ) pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2 1 ðyðxÞy0 ðxÞÞ2 H ðf ð9xÞ,f 0 ð9xÞÞ ¼ 1 tðs, s0 Þ exp  , 2 4 s2 þ s20 where

tðs, s0 Þ ¼

2ss0 : s2 þ s20

So, if Hðf ð9xÞ,f 0 ð9xÞÞ 4 En for some En , which depends on E, then either 9ðs0 =sÞ19 4 E

or

9y0 ðxÞyðxÞ9 4 E:

Thus AE 

N [

fy : 9yðlÞy0 ðlÞ9 4 EN g

[ fs : 9ðs0 =sÞ19 4 Eg:

l¼1

The final term of the union can be dealt with straightforwardly; it is quite easy to show that Z Rn ðf ÞPðdf Þ o encE BE

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

475

a.s. for all large n for some cE 4 0, where BE ¼ fs : 9ðs0 =sÞ19 4 Eg. The union of the remaining terms can be dealt with using Theorem 1. So, since we have

E~ 2n ¼ ðN=nÞ1a for any a 4 0, we need to determine Nn such Nn E~ 2n -0. This happens when we take N so that N2a =n1a -0: Therefore, N ¼ nð1=2Þa for any a 40 is sufficient. Now we are only left with showing that Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PðAEN ,l,j Þ oMN , l,j

where MN o 1 and that M Nn eEn=Nn -0. Theorem 4. The sum of square rooted prior mass on Hellinger covering balls is bounded by Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PðAEN ,l,j Þ oMN l,j

and en=N M N -0: Proof. See Appendix. In summary, we are showing that the numerator is bounded above by enE=N M N for some constant E 4 0, where Nn ¼ nð1=2Þa for any a 4 0, and M N eEðn=NÞ -0. On the other hand, the denominator is 1a bounded below by enðN=nÞ for any a 40. Putting these together yields the desired consistency. 5. Discussion In this paper we have concentrated on establishing consistency for Bayesian regression models with non–stochastic covariates and with respect to a sup Hellinger metric. The example has illuminated the strategy of constructing the prior based on the sample size. We would advocate such a procedure. Bayesian infinite mixture models with each mixture representing a nonparametric regression model would appear to be too unidentifiable and with a finite data set are clearly over–parameterized. Allowing the model to increase with the data makes perfect sense. In the mean regression function pffiffiffi we advocate a piecewise constant model for the function with effectively n pieces. We would next extend the theory to mixture models and it is our anticipation that we would work with Nn mixtures for a sample of size n where Nn would need to be determined to present the sup Hellinger consistency. Work on this is ongoing. We are not pursuing the standard Bayesian consistency ideas for regression for a number of reasons. The theory, once the covariates are assumed to be i.i.d., is just that for i.i.d. data and the toughness is due to verifying the conditions for the i.i.d. case which now are needed to be worked out in higher dimensions. Additionally covariates are rarely i.i.d. and are often designed. Our results give an indication of how to take covariates to ensure consistency. Finally, we are able to establish consistency for the predictive at a particular choice of covariate value, not possible if they are assumed to be i.i.d.

Acknowledgements The authors would like to thank two anonymous referees for their comments and suggestions which have allowed to improve the paper. Appendix A Proof of Lemma 1. The f ð9zl Þ is a density indexed by zl and ðxi Þni¼ 1 take values from ðz1 , . . . ,zN Þ. Therefore, K l ðf Þ o E~ 2n and V l ðf Þ o E~ 2n for l ¼ 1,2, . . . are equivalent to K i ðf Þ ¼ Kðf 0 ð:9xi Þ,f ð:9xi ÞÞ o E~ 2n ,

8i ¼ 1, . . . ,n

V i ðf Þ ¼ Vðf 0 ð:9xi Þ,f ð:9xi ÞÞ o E~ 2n ,

8i ¼ 1, . . . ,n:

and

476

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

Hence the condition on the prior is equivalent to

Pff : K i ðf Þ o E~ 2n V i ðf Þ o E~ 2n 8i ¼ 1, . . . ,ng Z expðnE~ 2n CÞ for all large n. The proof follows similar to Ghosal et al. (2000). By Jensen’s inequality, Z Y n n Z X f f log ðyi 9xi ÞPðdf Þ Z log ðyi 9xi ÞPðdf Þ: f f 0 0 i¼1 i¼1 Thus ! ! Z Y Z Y n n f f ðyi 9xi ÞPðdf Þ r expðnE~ 2n ðC þ 1ÞÞ ¼ P log ðyi 9xi ÞPðdf Þ rnE~ 2n ðC þ 1Þ f f i¼1 0 i¼1 0 ! ! pffiffiffi n Z Z n X pffiffiffi 2 nX f f 2 ~ ~ log ðyi 9xi ÞPðdf Þ r nE n ðC þ 1Þ ¼ P log ðyi 9xi ÞPðdf Þ r nE n ðC þ 1Þ rP n i¼1 f0 f0 i¼1 ! pffiffiffi X Z pffiffiffi n n f ¼P log ðyi 9xi ÞPðdf ÞP r  nE~ 2n ðC þ 1ÞP , n i¼1 f0

P

where P¼

pffiffiffi n Z Z nX f f 0 ðyi 9xi Þ log ðyi 9xi ÞPðdf Þ dyi n i¼1 f0

pffiffiffi so P 4  nE~ 2n . Therefore, ! ! pffiffiffi n Z Z Y n pffiffiffi 2 nX f f 2 ~ ~ P log ðyi 9xi ÞPðdf ÞP r  nE n C : ðyi 9xi ÞPðdf Þ r expðnE n ðC þ 1ÞÞ rP n i¼1 f f0 i¼1 0 Since Z  Z Z f f E log ðyi 9xi ÞPðdf Þ ¼ f 0 ðyi 9xi Þ log ðyi 9xi ÞPðdf Þ dyi f0 f0 applying Chebyshev’s inequality, the previous probability is bounded above by hpffiffi P i hR i R Pn n f 1 Var nn logff ðyi 9xi ÞPðdf Þ i ¼ 1 logf 0 ðyi 9xi ÞPðdf Þ i ¼ 1 Var n 0 ¼ nE~ 4n C 2 nE~ 4n C 2  2 R Pn R f 1 ðy 9x Þ P ðdf Þ dyi f log i 0 i i ¼ 1 n f0 r 4 2 nE~ n C h i2 R R P n 1 logff ðyi 9xi Þ Pðdf Þ dyi i ¼ 1 f0 n 0 r ðby Jensen’s inequalityÞ nE~ 4n C 2 r

E~ 2n 1 ¼ -0: nE~ 4n C 2 nE~ 2n C 2

Therefore, Z Y n

f ðyi 9xi ÞPðdf Þ Z expðnE~ 2n ðC þ1ÞÞ f 0 i¼1

completing the proof.

in probability

&

Proof of Theorem 3. If sup9yðxÞy0 ðxÞ9 o d

and

x

s   0   1o d

s

then 2

2

sup9yðxÞy0 ðxÞ9 o d x

and

  s2  2dd2   :  2 1 o s0  ð1dÞ2

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

477

2

Therefore, let c ¼ 2d=ð1dÞ2 4ð2dd Þ=ð1dÞ2 , so it is also true that   s2    supðyðxÞy0 ðxÞÞ2 o c and  2 1 o c: s0  x Now K x ðy, sÞ ¼

!   s2 1 s2 1 1 ½yðxÞy0 ðxÞ2 1 1 1 1 log 2  1 02 þ o c þ þ : 2 2 2 1c 1c ð1cÞs20 s s2 s0 2

Hence K x ðy, sÞ oð2 þ s2 0 Þ

c 1c

:

In addition, 2  2  2 2 s0 1 1 s20 3 c þ f y ðxÞ y ðxÞg o : V x ðy, sÞ r 2  þ 0 2 2 2 2s 2 ð1cÞ2 s Since

c 1c

¼

2d 14d þ d

2

o 4d

for 0 o d o 18, the proof can be completed by choosing c1 ¼ 4ð2 þ s2 0 Þ and c2 ¼ 24.

&

Proof of Theorem 4. The covering Hellinger balls of the function space is constructed as follows and is based on the squared Hellinger distance between two parameters Z1 ¼ ðy1 ðxÞ, s1 Þ and Z2 ¼ ðy2 ðxÞ, s2 Þ being ( )vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½y1 ðxÞy2 ðxÞ2 u u 2 : H2 ½f Z1 ð9xÞ,f Z2 ð9xÞ ¼ 22 exp  ts1 s2 4ðs21 þ s22 Þ þ

s2

s1

Then H½f Z1 ,f Z2  o d is equivalent to sup9y1 ðxÞy2 ðxÞ9 o d

n

x

given the information of the variance. Recall that yðxÞ is piecewise constant bk on ððk1Þ=N,k=N. Thus the partition of y space is the partition of b space given s. That is, Akmk ¼ fbk : mk Es2 o bk oðmk þ1ÞEs2 g, where the mk are integers from ð1, þ1Þ. n The Hellinger distance between f Z1 and f Z2 bounded by d, with respect to the variance, is equivalent to 9s1 =s2 9 o d2 . Thus the partition of the half real line for s is Amn ¼ fs : mn d2 olog s oðmn þ 1Þd2 g, n

n

where mn is also integer from ð1, þ1Þ. Hence, the probability of interest is given by ) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ( Y 1 N 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X n n P½bk 2 fmk Es2 ,ðmk þ1ÞEs2 9sg P½log s 2 fmn d2 ,ðmn þ 1Þd2 g  2 PðAEN ,l,j Þ ¼ j

mn ¼ 1

o4

1 X mn ¼ 0

k¼1m¼0

(qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n

n

P½log s 2 fmn d2 ,ðmn þ 1Þd2 g

N Y k¼1

" 1 þ 4ð2pÞ1=4



3=2 #) : 2

tk

Es

s

The calculation is similar to that of the infinite-dimensional exponential families of Walker (2004). Put t k ¼ k with s 4 1, we have, for some constant l ¼ 4ð2pÞ1=4 =E 40 8sffiffiffiffiffiffiffiffiffiffiffiffi 9 ! 2  n = 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 < n2 n Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X X d m d n n n n N N 3m d 2 2 2Þ pffiffiffiffiffiffi exp  : P½log s 2 fmn d2 ,ðmn þ 1Þd2 gð1 þ le PðAEN ,l,j Þ o4 o4 ð1 þ lÞ 2 : ; 4t 2pt mn ¼ 0 mn ¼ 0 j

478

F. Xiang, S.G. Walker / Journal of Statistical Planning and Inference 143 (2013) 468–478

Thus, for some constant Q Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PðAEN ,l,j Þ o QNð1þ lÞN : l,j

That is, we can take M N ¼ N expðN tÞ for some t ¼ logð1 þ lÞ 4 0. Now eEn=N is expðEn1=2 þ a Þ and Mn ¼ N expðtnð1=2Þa Þ, and so enE=N M N -0 as required.

&

References Amewou-Atisso, M., Ghosal, S., Ghosh, J.K., Ramamoorthi, R.V., 2003. Posterior consistency for semi-parametric regression problems. Bernoulli 9, 291–312. Barron, A., Schervish, M.J., Wasserman, L., 1999. The consistency of posterior distributions in nonparametric problems. Annals of Statistics 27, 536–561. Choi, T., 2005. Posterior Consistency in Nonparametric Regression Problems Under Gaussian Process Priors. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA. Choi, T., 2007. Alternative posterior consistency results in nonparametric binary regression using Gaussian process priors. Journal of Statistical Planning and Inference 137, 2975–2983. Choi, T., Ramamoorthi, R.V., 2008. Remarks on consistency of posterior distributions. IMS Collection 3, 170–186. Choi, T., Schervish, M.J., 2007. On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis 98, 1969–1987. Ghosal, S., Ghosh, J.K., Van Der Vaart, A.W., 2000. Convergence rates of posterior distributions. Annals of Statistics 28, 500–531. Ghosal, S., Ghosh, R., 1999. Posterior consistency of dirichlet mixtures in density estimation. Annals of Statistics 27, 143–158. Ghosal, S., Roy, A., 2006. Posterior consistency of gaussian process prior for nonparametric binary regression. Annals of Statistics 34, 2413–2429. Shively, T.S., Sager, T.W., Walker, S.G., 2009. A Bayesian approach to non-parametric monotone function estimation. Journal of the Royal Statistical Society: Series B. Statistical Methodology 71. Walker, S.G., 2003. On sufficient conditions for Bayesian consistency. Biometrika 90, 482–488. Walker, S.G., 2004. New approaches to Bayesian consistency. Annals of Statistics 32, 2028–2043.