Bayesian bootstrap prediction

Bayesian bootstrap prediction

Journal of Statistical Planning and Inference 140 (2010) 65 -- 74 Contents lists available at ScienceDirect Journal of Statistical Planning and Infe...

272KB Sizes 0 Downloads 83 Views

Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i

Bayesian bootstrap prediction Tadayoshi Fushiki The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan

A R T I C L E

I N F O

Article history: Received 3 September 2007 Received in revised form 17 June 2008 Accepted 17 June 2009 Available online 21 June 2009 Keywords: Bagging Bayesian bootstrap Bootstrap Kullback–Leibler divergence Probabilistic prediction

A B S T R A C T

In this paper, bootstrap prediction is adapted to resolve some problems in small sample datasets. The bootstrap predictive distribution is obtained by applying Breiman's bagging to the plug-in distribution with the maximum likelihood estimator. The effectiveness of bootstrap prediction has previously been shown, but some problems may arise when bootstrap prediction is constructed in small sample datasets. In this paper, Bayesian bootstrap is used to resolve the problems. The effectiveness of Bayesian bootstrap prediction is confirmed by some examples. These days, analysis of small sample data is quite important in various fields. In this paper, some datasets are analyzed in such a situation. For real datasets, it is shown that plug-in prediction and bootstrap prediction provide very poor prediction when the sample size is close to the dimension of parameter while Bayesian bootstrap prediction provides stable prediction. © 2009 Elsevier B.V. All rights reserved.

1. Introduction In this paper, bootstrap prediction is adapted to resolve some problems in small sample datasets. Bagging was proposed by ¨ Breiman (1996), and has been studied in machine learning (for example, Buhlmann and Yu, 2002). The bootstrap predictive distribution is obtained by applying the bagging algorithm to the plug-in distribution with the maximum likelihood estimator (MLE). The effectiveness of bootstrap prediction has been studied under the Kullback–Leibler loss (Fushiki et al., 2005; Fushiki, 2005; Harris, 1989). It is known that Bayesian prediction is admissible when a proper prior is used (Aitchison, 1975). However, sometimes it is computationally not very easy to obtain Bayesian prediction. We have shown that bootstrap prediction is considered to be an approximation of Bayesian prediction and provides asymptotically better prediction than plug-in prediction with the MLE (Fushiki et al., 2005). Let N be the number of observations. The difference between the bootstrap predictive distribution and the plug-in distribution with the MLE is Op (N−1 ), thus the difference is negligible if N is quite large. However, the difference has significance in small sample datasets even if the difference between the true distribution and the plug-in distribution with the MLE/bootstrap predictive distribution is Op (N−1/2 ). These days, analysis of small sample datasets is quite important in various fields, but it is known that plug-in prediction sometimes provides poor prediction for small sample datasets, then other methods are required. Although bootstrap prediction is thought to be more effective in such a case, some problems may arise when bootstrap prediction is constructed in small sample datasets. For example, in regression problems, a matrix whose inverse is needed to obtain the MLE may be rank-deficient in a bootstrap sample because some observations may not appear in the bootstrap sample. To overcome such problems, we adapt bootstrap prediction in this paper. Instead of ordinary bootstrap, Rubin's (1981) Bayesian bootstrap is

E-mail address: [email protected]. 0378-3758/$ - see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2009.06.007

66

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

used to construct estimators. I learned that the same method is known as Bayesian bagging (Clyde and Lee, 2001; Lee and Clyde, 2004) in machine learning. The Bayesian bootstrap predictive distribution is obtained by applying Bayesian bagging to the plug-in distribution with the MLE. In this paper, we study the effectiveness of Bayesian bootstrap prediction. The effectiveness of Bayesian bootstrap prediction is confirmed by some examples. For real datasets, plug-in prediction and bootstrap prediction provide very poor prediction when the sample size is close to the dimension of parameter while Bayesian bootstrap prediction provides stable prediction. The risk improvement is quite large in this case. Although we are interested in nonasymptotic situations, the asymptotic properties of Bayesian bootstrap prediction are also studied. It is shown that Bayesian bootstrap prediction is asymptotically more effective than plug-in prediction with the MLE. This fact guarantees that Bayesian bootstrap prediction is effective even if the sample size is not small. Thus, Bayesian bootstrap prediction can be used in various situations. The connection between Bayesian procedures and bootstrapping is also an interesting topic. For example, Efron (2003, 2005) discussed the connection. Boos (2003) suggested some analogies between the parametric bootstrap in frequentist inference and Markov chain Monte Carlo methods in Bayesian analysis. Hastie et al. (2001) also pointed out the relation between bootstrap and noninformative Bayes in Section 8. In this paper, it is shown that Bayesian bootstrap prediction is second-order equivalent to Bayesian prediction with a prior satisfying a condition, and that the condition is an extension of the result obtained by Newton and Raftery (1994). This paper is organized as follows. In Section 2, we formulate the statistical prediction problem discussed in this paper. In Section 3, the Bayesian bootstrap predictive distribution is defined. In Section 4, the effectiveness of Bayesian bootstrap prediction is shown by some examples. In Section 5, the asymptotic properties of Bayesian bootstrap prediction are studied. Section 6 is the Discussion section. 2. Problem formulation In this paper, the following statistical prediction problem is considered. Observations xN ={x1 , . . . , xN } are drawn independently from an unknown distribution p0 . We discuss a probabilistic prediction of xN+1 based on xN . A statistical model P = {p(x; )| ∈ } ˆ N+1 ; xN ) be a is used to predict xN+1 . In this paper, it is assumed that there exists a unique 0 such that p0 (x) = p(x; 0 ). Let p(x ˆ The loss function is predictive distribution. The Kullback–Leibler divergence is used to measure the predictive performance of p. given by ˆ N+1 ; xN )) = D(p0 (xN+1 ), p(x

 p0 (xN+1 ) log

p0 (xN+1 ) . dx ˆ N+1 ; xN ) N+1 p(x

The risk function is ˆ N+1 ; xN ))] = ExN [D(p0 (xN+1 ), p(x



p0 (xN )

 p0 (xN+1 ) log

 p0 (xN+1 ) dx dxN . N+1 ˆ N+1 ; xN ) p(x

The Bayesian predictive distribution p (xN+1 |xN ) =



p(xN+1 ; )(|xN ) d,

(|xN ) = 

p(xN ; )() p(xN ; )() d

(1)

is an optimal predictive distribution in the sense that it minimizes the following Bayes risk: ˆ N+1 ; xN ))]] = E [ExN [D(p(xN+1 ; ), p(x



()



p(xN ; )



p(xN+1 ; ) log

  p(xN+1 ; ) N dx dx d. ˆ N+1 ; xN ) N+1 p(x

Thus, the Bayesian predictive distribution is admissible when the prior is proper, but sometimes it is computationally not very easy to calculate (1). 3. Bayesian bootstrap prediction The bootstrap predictive distribution (Fushiki et al., 2005) is defined by pˆ B (xN+1 ; xN ) =



ˆ ∗N ) dx∗N , p(xN+1 ; ˆ MLE (x∗N ))p(x

 ∗N ∗N ˆ ˆ where pˆ is the empirical distribution p(x) = (1/N) N i=1 (x − xi ) and MLE (x ) is the MLE based on a bootstrap sample x . This predictive distribution is obtained by applying Breiman's bagging to the plug-in distribution with the MLE. We have shown that the bootstrap predictive distribution is considered to be an approximation of the Bayesian predictive distribution (Fushiki et al., 2005).

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

Bootstrap Mean

67

Bayesian Bootstrap Mean

1400 1200

1000

1000

800

800

600

600

400

400 200

200

0

0 –1.0

–0.5

0.0

0.5

1.0

1.5

–1.0

–0.5

0.0

0.5

1.0

1.5

Fig. 1. Distribution of the MWLE based on five observations from N(0, 1). The assumed model is N(, 1) and 10 000 bootstrap samples are used to make the histograms. Left: bootstrap, Right: Bayesian bootstrap.

However, as explained in Section 1, some problems may arise when the bootstrap predictive distribution is constructed for a small sample dataset. To overcome such problems, Bayesian bootstrap (Rubin, 1981) is used in this paper. We consider a  multinomial distribution whose support is {x1 , . . . , xN }, and Pr(X = xi ) = wi ( i wi = 1). In this paper, we assume that xi  xj (i  j). We use the following Dirichlet prior with respect to w:

w (w) ∝

 −1 wi i . i

Then, the posterior distribution is

w (w|xN ) ∝

 wi i . i

A Bayesian bootstrap sample w is a sample from the posterior distribution. The prior where i = 0 is often used. In this paper, we also use the prior where i = 0, thus the posterior distribution becomes a uniform distribution over the probability simplex. We consider the following maximum weighted likelihood estimator (MWLE):

ˆ MWLE (w) = argmax 

N

wi log p(xi ; ).

i=1

When wi = 1/N, the estimator coincides with the MLE. In bootstrap, wi can take a value of {0, 1/N, 2/N, . . . , (N − 1)/N, 1}. In Bayesian bootstrap, wi can take a real number within (0, 1), thus the estimator ˆ MWLE (w) is affected by all observations. This fact sometimes prevents a matrix from being rank-deficient in calculating ˆ MWLE (w). The property of the estimator ˆ MWLE (w) was investigated by Newton and Raftery (1994). Fig. 1 shows the difference in the distributions of the MWLE obtained by bootstrapping and Bayesian bootstrapping in a Gaussian model with unknown mean. The distribution of the MWLE obtained by bootstrapping is discrete. In fact, the MWLE obtained by bootstrapping can take at most 256 different values. The distribution of the MWLE obtained by Bayesian bootstrapping is continuous. The Bayesian bootstrap predictive distribution is defined by pˆ BB (xN+1 ; xN ) =



p(xN+1 ; ˆ MWLE (w))w (w|xN ) dw.

By drawing w(1) , . . . , w(B) from w (w|xN ), a Monte Carlo estimate of pˆ BB (xN+1 ; xN ) is obtained by (B) pˆ BB (xN+1 ; xN ) =

B 1

p(xN+1 ; ˆ MWLE (w(b) )). B b=1

The Bayesian bootstrap predictive distribution is the one obtained by applying Bayesian bagging (Clyde and Lee, 2001; Lee and Clyde, 2004) to the plug-in distribution with the MLE.

68

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

4. Examples In this section, we investigate the effectiveness of Bayesian bootstrap prediction in some examples. 4.1. Simulated data Example 1 (Polynomial regression). We consider the following third-order polynomial regression: Y = 0 + 1 X + 2 X 2 + 3 X 3 + ,

∼ N(0, 20 ).

In this case, we may not obtain a full-rank matrix to calculate the MWLE obtained by bootstrapping when N is small. Such a problem is discussed in Lee and Clyde (2004) in detail. Even if we can obtain a full-rank matrix, the result may be unstable when the number of independent observations is small because we have to estimate a complicated curve based on very few observations. We calculated the risks of predictive distributions when N = 20 and B = 50. The result is shown in Table 1. Fig. 2 shows a typical example that Bayesian bootstrap prediction provides better prediction than plug-in prediction. Prediction intervals are obtained by calculating the upper and lower 2.5% points of predictive distributions. Let x0 = (1, x0 , x20 , x30 )T and b = (0 , 1 , 2 , 3 )T . The prediction interval at x0 is usually calculated based on the distribution of Y − bˆ TMLE x0 . The Bayesian T x up to the bootstrap predictive distribution coincides with the predictive distribution based on the distribution of Y − bˆ MLE 0

second order. Fig. 3 compares the two asymptotically equivalent 95% prediction intervals. Example 2 (Single hidden layer, feed-forward neural network). We consider the following feed-forward neural network model

i (wi0 + wTi X) + , ∼ N(0, 20 ), X ∈ Rk , Y = 0 + where (x) = 1/{1 + exp(−x)} is the sigmoid function. We examined the case where the true distribution is given by Y = f0 (X) + ,

∼ N(0, 1), X ∼ U(0, 3) × · · · × U(0, 3),

f0 (X) = 1 − 3 (1 + X1 + 3X2 − X3 − 2X4 + 5X5 ) + 5 (−2 + 2X1 − 3X2 + X3 + 2X4 ). The risk for N = 50 and B = 50 is shown in Table 2. In this case, we cannot obtain the exact MLE. We used an estimate that provides a local maximum of the objective function. The function nnet of statistical software R (R Development Core Team, 2006) is used to implement the neural network model. In this example, the risk difference is very large. Fig. 4 shows the means of the plug-in distributions with the estimates obtained by bootstrapping or Bayesian bootstrapping at an X that is rather far from all observations. In this case, the mean of the plug-in distribution with the estimate is far from the true function value f0 (X), but many means of the plug-in distributions with the estimates obtained by bootstrapping or Bayesian bootstrapping are near the true function value f0 (X) in Fig. 4. Thus, the bootstrap predictive distribution and the Bayesian bootstrap predictive distribution work better than the plug-in distribution, and the risk difference became large in this case. 4.2. Real data In this section, real data are analyzed. We use the following cross entropy:  ˆ N+1 ; xN ) dxN+1 − p0 (xN+1 ) log p(x as the loss function. This loss is equivalent to the Kullback–Leibler loss up to constant. The loss is estimated by −

M 1

ˆ i ; xN ) log p(y M i=1

using a test dataset {y1 , . . . , yM }. Table 1 The value of the risk in a polynomial regression model. Plug-in Bayesian bootstrap

0.506 0.385

The true is Y = 1 + 5X − 7X 2 + 2X 3 + , ∼ N(0, 1), X ∼ U(0, 3). The loss is calculated by discretizing X into 1000 grids and one-dimensional numerical integration is done at each X. The risk is obtained by 10 000 Monte Carlo iterations.

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

12

14

10

12

8

10 8

6

6

4

4

2

2

0

0

-2

-2

-4

0

0.5

1

1.5

2

2.5

3

-4

12

14

10

12

8

10

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

8

6

6

4

4

2

2

0

0

-2

-2

-4

69

0

0.5

1

1.5

2

2.5

3

-4

Fig. 2. An example of a polynomial regression model. The true is Y =1+5X −7X 2 +2X 3 + , ∼ N(0, 1) and X ∼ U(0, 3). The thick solid lines are Y =1+5X −7X 2 +2X 3 . Top left: data and the estimated curve. Top right: estimated 50 curves obtained by Bayesian bootstrapping. Bottom left: 95% prediction interval based on the plug-in distribution with the MLE. The thick broken line is the mean of the plug-in distribution and the thin broken lines are the upper and lower prediction limits. Bottom right: 95% prediction interval based on the Bayesian bootstrap predictive distribution. The thick broken line is the median of the Bayesian bootstrap predictive distribution and the thin broken lines are the upper and lower prediction limits.

16 14 12 10 8 6 4 2 0 -2 -4

0

0.5

1

1.5

2

2.5

3

T Fig. 3. Comparison of 95% prediction intervals. The dash-dotted lines show a 95% prediction interval based on the distribution of Y − bˆ MLE x0 . The broken lines show a 95% prediction interval obtained by the Bayesian bootstrap predictive distribution and the solid line is Y = 1 + 5X − 7X 2 + 2X 3 .

70

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

Table 2 The value of the risk in a neural network model. Plug-in Bootstrap Bayesian bootstrap

1.455 0.220 0.187

The loss is calculated by 10 000 Monte Carlo iterations and the risk is also obtained by 10 000 Monte Carlo iterations.

Bayesian Bootstrap Estimation

15

15

10

10

Frequency

Frequency

Bootstrap Estimation

5

5

0

0 2

4

6

8

10

12

5

10

15

Fig. 4. The means of the plug-in distributions with the estimators obtained by bootstrapping (left) or Bayesian bootstrapping (right) at an X that is rather far from the observations. The filled circle shows the true function value f0 (X) and the filled diamond shows the estimated one.

Table 3 The average and the standard error of the estimated losses for the prostate cancer data.

Plug-in Bayesian bootstrap

Av.

s.e.

4.837 1.680

2.066 0.722

Example 3 (Prostate cancer data). This dataset was originally studied by Stamey et al. (1989). Hastie et al. (2001) also used this dataset to study linear regression models. The response variable is the level of prostate specific antigen and eight covariates are used to explain the response variable. First, 97 observations are divided into five disjoint subdatasets whose sizes are almost the same. When the first subdataset is used as the training dataset, the other four subdatasets are used as the test dataset. We iterate five such divisions, then the average and the standard error of the estimated losses are calculated. The full linear regression model is adopted to analyze these data in this paper. Since the number of observations in each training dataset is not enough, bootstrap prediction could not be constructed in this case. The average and the standard error of the estimated losses for B = 50 are shown in Table 3. Bayesian bootstrap prediction provides better prediction than plug-in prediction. Example 4 (Boston housing data). This dataset was originally studied by Harrison and Rubinfeld (1978), and now it has been used by many researchers as a benchmark dataset. The response variable is the median value of owner-occupied homes and 13 covariates including one binary variable are used to predict the response variable. The number of observations is 506. Single hidden layer, feed-forward neural network models are used to analyze this dataset. The covariates except for the binary variable are normalized such that their means are 0 and standard errors are 1. The estimate is obtained by finding a local maximum of the objective function. To obtain the estimate, 20 randomly selected different initial values are examined. Let L be the number of hidden units. We examined the cases where L = 4, 5, 6. The dataset is divided into five subdatasets as in Example 5. The average and the standard error of the estimated losses for B = 50 are shown in Table 4. In Table 4, plug-in prediction provides very poor prediction. Bootstrap prediction provides better prediction than Bayesian bootstrap prediction when L = 4, but it failed to be constructed 3 times in five iterations when L = 5. It could not be constructed in every dataset when L = 6. Bayesian prediction is stable and effective when the sample size is close to the dimension of parameter. In this paper, our concern is to investigate the properties of bootstrap prediction and Bayesian bootstrap prediction in small sample datasets. Providing the best model is not our concern, but, from the viewpoint of data analysis, it may be better to select a model and there are a lot of studies on the selection of L (see, for example, Section 5.6 of Ripley, 1996).

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

71

Table 4 The average and the standard error of the estimated losses for the Boston housing data. Av.

s.e.

Plug-in Bootstrap Bayesian bootstrap

14.46 2.174 2.662

4.54 0.235 0.172

Plug-in Bootstrap Bayesian bootstrap

49.61 NA 2.373

53.54 NA 0.120

Plug-in Bootstrap Bayesian bootstrap

105.0 NA 2.377

32.6 NA 0.443

`NA' means that prediction could not be constructed. Top: L = 4. In this case, the number of parameters is 62. Middle: L = 5. In this case, the number of parameters is 77. Bottom: L = 6. In this case, the number of parameters is 92.

5. Asymptotic analysis We are interested in Bayesian bootstrap prediction for not large N, but it is difficult to deal with it analytically. Therefore, we investigate the asymptotic properties. Theorem 1. The Bayesian bootstrap predictive distribution pˆ BB (xN+1 ; xN ) coincides with the bootstrap predictive distribution pˆ B (xN+1 ; xN ) up to the third order: pˆ BB (xN+1 ; xN ) = pˆ B (xN+1 ; xN ) + op (N−3/2 ). Proof. First, we investigate the asymptotic properties of the MWLE obtained by Bayesian bootstrapping. Let (w1 , . . . , wN ) be a sample from a uniform distribution on the probability simplex. By using the following results: 1 , N

Ew (wi ) =

Var(wi ) =

N−1 , N2 (N + 1)

Cov(wi , wj ) = −

1 for i  j, N2 (N + 1)

(2)

we obtain 0=

N

=1

=

N

=1

=

w *i log p(x ; ˆ MWLE ) w *i log p(x ; ˆ MLE ) +

N

w −

=1

1 N



N

=1

j j w *i *j log p(x ; ˆ MLE )(ˆ MWLE − ˆ MLE ) + op (N−1/2 )

*i log p(x ; ˆ MLE ) +

N j j 1

*i *j log p(x ; ˆ MLE )(ˆ MWLE − ˆ MLE ) + op (N−1/2 ), N

(3)

=1

i

i

where  is the i-th element of  and *i = */ * . The Einstein summation convention is also used above: when an index variable appears twice in a single term, once in an upper and once in a lower position, summation over the index is implied. For example,

i *i log p(x; ) =

p

i=1

i

* log p(x; ). *i i

From (3), the first-order expansion of ˜ can be written as i

˜ = ¯Ji i (ˆ MLE )Z1,i (ˆ MLE ) + op (1), 

where ¯Jij () = − 1 N Z1,i () =

N

=1

*i *j log p(x ; ),

N √

1 N *i log p(x ; ), w − N =1

√ ˜ = N(ˆ MWLE − ˆ MLE )

72

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

and (¯Jij ) is the inverse matrix of (¯Jij ). By a higher-order expansion, we obtain i 1  1 1  ˜ = ¯Ji i Z1,i + √ ¯Ji i Z2,ij ¯Jjk Z1,k + √ ¯Jjk ¯Jlm Z2,kl Z1,m + √ ¯Jjk ¯Jln ¯Jmo K¯ klm Z1,n Z1,o N N 2 N 2 1 1  + √ ¯Ji i K¯ ijk ¯Jjl ¯Jkm Z1,l Z1,m + √ ¯Jjl ¯Jkm ¯Jno Z1,l Z2,mn Z1,o + √ ¯Jjl ¯Jkm ¯Jnp ¯Joq K¯ mno Z1,l Z1,p Z1,q 2 N N N +

1 ¯i i ¯jl ¯km 1 ¯i i ¯jm ¯kn ¯lo ¯ J J J Z3,ijk Zi,l Z1,m + J J J J Lijkl Z1,m Z1,n Z1,o + op (N−1 ), 2N 6N

(4)

where N 1

*i *j *k log p(x ; ), K¯ ijk () = N

=1

N 1

L¯ ijkl () = *i *j *k *l log p(x ; ), N

=1

Z2,ij () =

N √

1 N *i *j log p(x ; ), w − N =1

Z3,ijk () =

N √

1 N *i *j *k log p(x ; ) w − N =1

and all the values in (4) are evaluated at ˆ MLE . From (2), we obtain Ew (Z1.i Z1,j ) =

N

N−1 1

1 *i log p(x ; ˆ MLE )*j log p(x ; ˆ MLE ) − *i log p(x ; ˆ MLE )*j log p(x ; ˆ MLE ). N+1 N N(N + 1)

=1

(5)



The second term of (5) is



1 1 *i log p(x ; ˆ MLE )*j log p(x ; ˆ MLE ) = *i log p(x ; 0 )*j log p(x ; 0 ) N(N + 1) N(N + 1) 



+

k 1 *i *k log p(x ; 0 )*j log p(x ; 0 )(ˆ MLE − k0 ) N(N + 1) 

+

k 1 *i log p(x ; 0 )*j *k log p(x ; 0 )(ˆ MLE − k0 ) N(N + 1) 

+ Op (N

−1

).

(6)

The first term of (6) is Op (N−1 ) from the fact that

E{*i log p(x ; 0 )*j log p(x ; 0 )*i log p(x ; 0 )*j log p(x ; 0 )} =

⎧ Iii (0 )Ijj (0 ),  = ,  = , ⎪ ⎪ ⎨ ⎪ ⎪ ⎩

Iij (0 )2 ,

 = ,  = ,

0,

otherwise,

when    and  . Here, Iij () = E{*i log p(x; )*j log p(x; )}. The second term of (6) is

k 1 *i *k log p(x ; 0 )*j log p(x ; 0 )(ˆ MLE − k0 ) N(N + 1) 

=

Jkl (0 ) 2 N (N + 1)

  ,

*i *k log p(x ; 0 )*j log p(x ; 0 )*l log p(x ; 0 ) + Op (N−1 )

= Op (N−1 ), where Jij () = E[¯Jij ()] and (Jij ) is the inverse matrix of (Jij ). The third term of (6) is also Op (N−1 ). Thus, Ew (Z1.i Z1,j ) = ¯Iij (ˆ MLE ) + Op (N−1 ),

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

73

where N

¯Iij () = 1 *i log p(x ; )*j log p(x ; ). N

=1

We obtain ¯ (ˆ MLE ) + Op (N−1 ), Ew (Z2,ij Z1,k ) = ij,k Ew (Z1,i Z1,j Z1,k ) = Op (N−1/2 ), Ew (Z2,ij Z1,k Z1,l ) = Op (N−1/2 ), Ew (Z2,ij Z2,kl Z1,m ) = Op (N−1/2 ), Ew (Z3,ijk Z1,l Z1,m ) = Op (N−1/2 ) by the same way. Here,

¯ ij,k () =

N 1

*i *j log p(x ; )*k log p(x ; ). N

=1

From the above, i 1  1 ¯i i ¯jl ¯km ¯ ¯ ¯ + √ Ew (˜ ) = √ ¯Ji i ¯Jjk J J J Kijk Ilm + Op (N−3/2 ), ij,k N 2 N 







i j   Ew (˜ ˜ ) = ¯Ji i ¯Jj j ¯Iij + Op (N−1 ), 

i j k Ew (˜ ˜ ˜ ) = Op (N−1/2 ),

where all the values are evaluated at ˆ MLE . Finally, the asymptotic expansion of pˆ BB (xN+1 ; xN ) is obtained by the Taylor expansion. The asymptotic expansion of pˆ B (xN+1 ; xN ) is shown in Fushiki et al. (2005). By comparing these results, we obtain pˆ BB (xN+1 ; xN ) = pˆ B (xN+1 ; xN ) + op (N−3/2 ).



The following theorem is obtained by the same way as Theorem 4 of Fushiki et al. (2005). Theorem 2. Bayesian bootstrap prediction provides asymptotically better prediction than plug-in prediction with the MLE. The effectiveness of the Bayesian bootstrap predictive distribution is experimentally confirmed in Section 4 when the sample size is small, but it is also asymptotically better than the plug-in distribution with the MLE. Thus, the Bayesian bootstrap predictive distribution is useful even if the sample size is not small. The connection between Bayesian procedures and bootstrapping is an interesting topic as explained in Introduction. From the relation (6) of Fushiki et al. (2004), the Bayesian bootstrap predictive distribution coincides with the Bayesian predictive distribution up to the second order if the prior () satisfies e

*i log () = jij for all i, e

j

(7) e

j

where ij is the e-connection coefficient (Fushiki et al., 2004). In exponential families, ij = 0 if parameter  is the natural parameter. Thus, the Bayesian bootstrap predictive distribution coincides with the Bayesian predictive distribution with the uniform prior over the natural parameter up to the second order in exponential families. Newton and Raftery (1994) calculated the condition where the distribution of the MWLE obtained by Bayesian bootstrapping and the Bayesian posterior distribution coincide up to the second order in a special case. We consider one-parameter exponential family p(x; ) = h(x) exp(x − ()), where  is the natural parameter. Let  be the expectation parameter: E(x) = .

74

T. Fushiki / Journal of Statistical Planning and Inference 140 (2010) 65 -- 74

The uniform prior over the natural parameter satisfies d log () d   d d d = log () , d d d

0=

where () = () d/d. Thus, the uniform prior over the natural parameter is expressed by

() ∝

d . d

(8)

Since d/d is the Fisher information with respect to , the prior satisfying (8) coincides with the prior obtained by Newton and Raftery (1994). 6. Discussion In this paper, we proposed Bayesian bootstrap prediction to resolve some problems in bootstrap prediction. The effectiveness was confirmed by looking at some examples. The asymptotic properties were also investigated, and the obtained results supported using Bayesian bootstrap prediction. For real datasets, plug-in prediction and bootstrap prediction provided very poor prediction when N is close to the dimension of . These days, analysis of such data is quite important in various fields. Bayesian bootstrap prediction provided stable prediction and the risk difference was quite large in this case. The connection between Bayesian procedures and bootstrapping is an interesting topic. In this paper, we described a condition where the Bayesian bootstrap predictive distribution and the Bayesian predictive distribution coincide up to the second order, and showed that this result is an extension of the result obtained by Newton and Raftery (1994). Acknowledgments The author would like to thank two anonymous referees and the associate editor for their helpful comments. This research was partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Japan, Grant-in-Aid for Young Scientists (B), 20700260, 2008-2009. References Aitchison, J., 1975. Goodness of predictive fit. Biometrika 62, 547–554. Boos, D.D., 2003. Introduction to the bootstrap world. Statistical Science 18, 168–174. Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123–140. ¨ Buhlmann, P., Yu, B., 2002. Analyzing bagging. Annals of Statistics 30, 927–961. Clyde, M.A., Lee, H.K.H., 2001. Bagging and the Bayesian bootstrap. In: Richardson, T., Jaakkola, T. (Eds.), Artificial Intelligence and Statistics. pp. 169–174. Efron, B., 2003. Second thoughts on the bootstrap. Statistical Science 18, 135–140. Efron, B., 2005. Bayesian, frequentists, and scientists. Journal of the American Statistical Association 100, 1–5. Fushiki, T., Komaki, F., Aihara, K., 2004. On parametric bootstrapping and Bayesian prediction. Scandinavian Journal of Statistics 31, 403–416. Fushiki, T., Komaki, F., Aihara, K., 2005. Nonparametric bootstrap prediction. Bernoulli 11, 293–307. Fushiki, T., 2005. Bootstrap prediction and Bayesian prediction under misspecified models. Bernoulli 11, 747–758. Harris, I.R., 1989. Predictive fit for natural exponential families. Biometrika 76, 675–684. Harrison, D., Rubinfeld, D.L., 1978. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management 5, 81–102. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer, New York. Lee, H.K.H., Clyde, M.A., 2004. Lossless online Bayesian bagging. Journal of Machine Learning Research 5, 143–155. Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society B 56, 3–48. R Development Core Team, 2006. R: A Language and Environment for Statistical Computing. Vienna, R Foundation for Statistical Computing http:// www.R-project.org . Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Rubin, D.B., 1981. The Bayesian bootstrap. Annals of Statistics 9, 130–134. Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., Yang, N., 1989. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II. Radical prostatectomy treated patients. Journal of Urology 16, 1076–1083.