journal of statistical planning Journal of Statistical Planning and Inference 47 (1995) 79-94
ELSEVIER
and inference
Computer intensive methods for inference on parameters of complex models--- A Bayesian alternative A y a l a C o h e n a'*, U d i M a k o v b, R u t h B o n e h a "Facuhy of Industrial Engineering and Management, Technion, Ha([a 32000, Israel b Department of Statistics, University o/ Ha!/a, Israel
Abstract
Simulation techniques are gaining popularity for drawing statistical inference, particularly in the field of practical Bayesian Statistics (e.g. Ann. Statist. 9 (1981), 130--134; J. Amer. Statist. Assoc. 85 (1987), 470-477; J. Amer. Statist. Assoc. 85 (1990), 398 409; J. Roy. Statist. Soc. 56 (1994), 3 48). Smith and Gelfand (Amer. Statist. 46 (1992), 84 88) discussed two sampling resampling methods for obtaining posterior samples from prior samples. The first is based on the rejection method, the second (the weighted bootstrap) is the sampling importance resampling (SIR) (J. Amer. Statist. Assoc. 82 (1987), 543-546; Bayesian Statistics, Vol. 3, Oxford University Press, 1988). This paper uses an example to illustrate the applicability of these methods. The example is on cosmological data for which maximum likelihood estimates do not exist. The data were analyzed by Chernoff(Comput. Statist. Data Anal. 12 (1991), 159 178) who introduced several non-Bayesian computer intensive methods for analyzing these data. The cosmological data are used to evaluate the two Bayesian computer intensive methods. Key words': Bayes estimation; Simulations: Bootstrap; Sampling resampling; SIR
1. Introduction Several computer intensive methods were proposed by Chernoff (1991) to estimate parameters of complex parametric models. The approach was motivated by an astronomical problem related to modeling apparent superluminal motion. The study involved 24 i.i.d, observations on a random variable for which the form of the distribution is known and depends on two unknown parameters (shape and scale). The data were compiled by Porcas (1987) and the model was suggested by Segal (1987). Segal used an estimation method which is equivalent to the method of moments. The conventional m a x i m u m likelihood estimation could not be applied since the density becomes infinite at the lowest end. As an alternative to estimation by standard analytic methods, Chernoff proposed methods which exploit computer
* Corresponding author. 0378-3758/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved SSDI 0 3 7 8 - 3 7 5 8 ( 9 4 ) 0 0 1 2 3 - 5
80
A. Cohen et al./Journal of Statistical Planning and Inference 47 (1995) 79-94
power and use simulated data. Chernoff mentioned that one of his methods had a Bayesian interpretation, but his approach in general was not Bayesian. Chernoff's methods motivated us to analyze the same data by applying alternative Bayesian computer intensive methods. Various simulation techniques have been developed in recent years for drawing inference, particularly in Bayesian analysis. They include iterative as well as noniterative methods (e.g. Rubin, 1981; Tanner and Wong, 1987; Gelfand and Smith, 1990; Newton and Raftery, 1994). Some of the methods require sophisticated expertise and some are more straightforward to implement, though they are not necessarily the most efficient. The purpose of our study is twofold. First, to illustrate the applicability of two Bayesian computer intensive methods by applying them to the cosmological data. Secondly, to use this example for an evaluation of these methods. The two methods, which we apply, were discussed by Smith and Gelfand (1992). The first is complementary to maximum likelihood estimation and uses likelihoodbased estimates to simulate posterior samples of the parameters. In the second method, posterior samples are obtained by a weighted bootstrap. This is essentially the Sampling Importance Resampling (SIR) algorithm (Rubin, 1987, 1988). We note that a Bayesian solution has been recently used for another astronomical problem (Pettit, 1993). In Section 2 we present the data and previous estimates. Section 3 describes the two Bayesian procedures. Section 4 includes the results of our analysis and Section 5 concludes with a discussion.
2. The data and previous estimates The data which were listed by Chernoff(1991), as displayed in Table 1, consist of 24 independent bivariate observations (z,p). This is actually part of a larger data set composed of 32 observations which were displayed and discussed by several authors (e.g. Porcas, 1987; Cohen et al., 1987; Blackwell, 1991). The 32 observations included correlated observations due to multiple observations from the same source; while in the reduced data, only one observation was retained from each source. A transformation of (z, p) defines the random variable w, where w = -- 21n(za/2/~).
(2.1)
According to the model suggested by Segal (1987), w can be expressed as w = ~ + w*,
(2.2)
where ~ is an unknown shift parameter and the distribution of w* is f w . ( W ) -~- P [ w * ~ w] -~- [1 - - e - W ( 1 - - r ) ] l / 2 / E 1
for w > ln(1 - r).
-[-
re -w] (2.3)
A. Cohen et al./Journal of Statistical Planning and ln[erence 47 (19951 79 94
81
Table 1 Porcas data i
z
/~
w
i
z
it
w
1 2 3 4 5 6 7 8 9 10 I1 12
0.538 0.158 0.033 0.595 0.846 1,258 0.070 0.699 0.635 0.859 0.057 0.302
0.50 1.20 2.66 0.48 0.19 0.15 0.76 0.16 0.64 0.35 0.74 0.60
2.00619 1.48052 1.45460 1.98713 3.48870 3.56472 3.20813 4.02327 1.34670 2.25163 3.46691 2.21898
13 14 15 16 17 18 19 20 21
0.751 1.322 2.367 1.250 0.669 0.652 (/.466 1.029 0.206 0.424 1.037 0.306
0.34 0.12 0.09 0.13 0.11 0.06 0.07 0.11 0.36 0.18 0.65 0.28
2.44397 3.96138 3.95427 3.85730 4.81652 6.05453 6.08209 4.38596 3.62318 4.28762 0.82523 3.73010
22 23 24
By its definition, the unknown shape parameter r is in the unit interval and according to the theory behind the model the value of z should be close to 0.9. From (2.31, the density of w* is
j~..(w;z)=e-W{1 + r - z ( 1 - r ) e - W } / 2 [ 1
+~e
"]2{[1-(1-z)e
for w > l n ( 1 - z ) .
w]~.2~ (2.4)
Chernoff also considered functions of the two parameters, such as q~= ln[z/(l - z ) ] ,
(2.5)
In R = (~ + In z)/2 + 3.455.
12.6)
According to the theory behind the model, R should equal the radius of the umverse. measured in megaparsecs. As noted by Chernoff, the density of w* becomes infinite at In(1 - r); therefore, the conventional m a x i m u m likelihood is not defined. Segal (1987) estimated the unknown parameters by the method of moments; Blackwell (1991) introduced a modified M L E; Chernoff proposed several estimators which are linear combinations of some of the order statistics where the weights are calculated on the basis of computer intensive simulations. He used the term 'feature vector' for the subset of order statistics which form the linear combinations of the estimators. There are two types of estimators which Chernoff proposed. The weights of the first type are obtained by regression, whereas in the second type, the weights are calculated so that the resulting estimator is a locally minimum variance unbiased estimator. The procedure for the first estimator is such that a set of different values is preselected for the unknown vector of parameters (which for the specific example is of dimension two). For each vector, a set
A. Cohen et al./Journal of Statistical Planning and Inference 47 (1995) 79-94
82
Table 2 Chernoff's estimates qg
~
f
Methods
1
2.22
3.19
0.902
Method of m o m e n t s
2
2.22
3.52
0.902
A 'robust' version of the method of moments, using the median
3
2.27
3.19
0.906
Estimating the shift parameter by using the smallest observation and then applying the methods of m o m e n t s
4
2.28
3.21
0.908
Modified M L E (Blackwell's method, 1991)
5
2.78
3.28
0.941
Local estimator (Chernoff's estimator of type 2) with a certain feature vector
6
2.48
3.15
0.923
Same as (5), but based on 4 times more simulations
7
2.44
3.18
0.920
Same as (5), but with another feature vector
8
2.54
3.21
0.927
Same as (7), but based on 4 times more simulations
9
2.47
3.14
0.922
Same as (8), but based on 4 times more simulations
10
2.27
3.19
0.906
Regression estimator (Chernoff's type 1) with the same feature vector as (5) and (6), and a certain prespecified set of parameters
11
2.28
3.17
0.908
Same as (10), but based on 4 times more simulations
12
2.18
3.46
0.898
Same as (10), but based on another prespecified set of parameters
13
2.24
3.18
0.904
Same as (11), but with the feature vector of (7)-(9)
14
2.34
3.09
0.912
Same as (12), but with the feature vector of (7)-(9) and (13)
15
2.12
3.18
0.893
Same as (14), but with 4 times more simulations
16
2.21
3.18
0.901
Same as (15), but with 4 times more simulations
17
2.15
2.73
0.896
Same as (14), but with a different set of prespectified parameters
18
2.20
3.42
0.900
lnvariant regression estimator with the same feature vector as in (5), (6), (10)-(12)
19
2.39
3.41
0.916
Same as (18), but with selected parameters as in (12)
20
2.32
3.44
0.911
Same as (19), but with 4 times more simulations
of data is simulated. Regression analysis is done to estimate a linear relationship between the parameters and their feature vectors. The estimated regression is then applied to infer from the given data set about the unknown set of parameters. This method also enables one to obtain invariant estimators, by applying least squares estimation under constraints in the regression analysis. For the second estimator, an initial guess is made on the unknown vector of parameters. A correction is then added to the initial guess, which is a linear combination of the feature vector. The weights, which are determined so as to obtain local minimum variance unbiased estimators, are functions of unknown expected values, variances and covariances of the order statistics. These are estimated by averaging simulated data sets. The resulting estimates of these methods depend on various arbitrary choices, such as the number of simulations, an initial guess of the unknown parameters and the choice of a subset of the order statistics which is used in the linear combinations. The various estimates, which Chernoff calculated and presented in his paper, are displayed here in Table 2. We display this table for a comparison between his and our results.
A. Cohen et al./Journal of Statistical Planning and In[i, rence 47 (1995) 79 94
83
3. Bayesian sampling resampling methods Let x denote the data obtained from a density which is specified by a finitedimensional vector of parameters 0. Let p(O), p(O/x) denote the prior and posterior densities of 0, respectively. Smith and Gelfand (1992) discussed two procedures for resampling prior samples (from p(O)) to obtain posterior samples (from p(O/x)). The first procedure is based on the rejection method (Ripley, 1986). The algorithm of random number generation via the rejection method is useful when a sample of 0's has been generated from a certain density, say h(O) and what is of interest is a sample from another density, say 9(0) which is absolutely continuous with respect to h(O). Given the sample {0i, 1 ..... n} from h(O) only a subset of this sample is retained and this subset is a sample from 9(0). The procedure works as follows: let M > 0 be a known constant such that 9(O)/h(O) <~ M for all 0. A uniform random variable u ~ U(0, 1)is generated and if u > 9(Oi)/Mh(Oi) then Oi is rejected. Otherwise, 0i is accepted as an observation from 9(0). In Bayesian inference, this is a useful tool for obtaining a posterior sample from a prior sample when the M L E can be calculated. Denote by l(O; x) the likelihood of x; let 0 be the MLE and M =/(0; x). According to the rejection method, given a prior sample {Oi, i = 1 ... n}, a posterior sample is obtained from {0i, i = 1 ... n} if we retain Oi with probability
u = l(O,;x)/M.
{3.1)
In other words, those 0i's in the prior sample with a high value for the likelihood are more likely to be included in the posterior sample. By applying this procedure to the cosmological data, we are able to obtain additional inference on the parameters, such as posterior confidence intervals. We exploit the point estimates which were presented by Chernoff to evaluate M = / ( 0 ; x ) and then we simulate posterior samples from prior samples. Instead of relying on the assumption that the bound on the likelihood is known, a weighted bootstrap can be applied. This is the second resampling method which is discussed by Smith and Gelfand (1992) and is essentially the sampling importance resampling (SIR algorithm (Rubin, 1987, 1988)). According to this method, given a prior sample {Oi, i = 1,..., n}, calculate
qi = l(Oi; x
I(Oj; x).
{3.2~
J
Draw 0* from the discrete distribution over {0~ ..... 0,} with a probability qi on 0~. The resulting 0* is approximately distributed according to the posterior. The approximation improves as n increases, but it also depends on the similarity between the proposed prior and the posterior. We have outlined two Bayesian computer intensive methods which we used to analyze the cosmological data. These two methods are not the most efficient but they are easy to implement. One of the nice features of the Bayesian resampling methods is
A. Cohen et al./Journal of Statistical Planning and lnJerence 47 (1995) 79 94
84
that posterior inference can easily be implemented on any function of the parameters. Thus, after simulating a posterior vector of (~,z)'s, it is straightforward to obtain posterior samples of ~ and/or R (defined by (2.5), (2.6) respectively). The methods could be applied for inference on the two-dimensional vector of parameters, as well as on either one of the two parameters, or on ~u, R or In R. As in all computer intensive methods, there are various ways of applying these methods. There are practical questions, such as the size of simulated samples, and which prior to choose. The main interest in the current paper is the alternative look at the cosmological data. It is an illustration of Bayesian computer intensive methods in contrast with the methods suggested by Chernoff.
4. Results We wanted to include in our analysis both vague as well as informative priors. In a usual practical setting, the statistician might consult with the expert (the astronomer, in our case) to obtain a prior knowledge on the parameters. For the current illustration, we used Chernoff's estimates as the 'expert prior information'. The vector of parameters in this study was 0 = (~,r) and the inference was done mainly on a and T. In addition, following Chernoff (1991), inference was also done on ~u, which was defined in (2.6). Since, the basic vector of parameters was 0, all the priors were constructed for 0. However, the Bayesian methods enable easily to infer on functions of these parameters, such as ~u. The flexibility of reparametrization, which is illustrated in this study, was noted by Smith and Gelfand (1992). Several priors were considered for the joint densities of a and r, both informative and vague. In all the cases, the joint prior density was a product of the marginals, assuming independence for c~and z. We note the inherent relationship between these parameters as the density is nonzero only in the subspace of 0 = (~, r) defined by {(~, r)/c~ + ln(1 -- r) < w~l)},
(4.1)
where w~l) = min{wl, ..., w24}. Thus, in practice, our sample was constrained to include only pairs (~,z) which satisfied (4.1). For such pairs, the likelihood is bounded. Posterior samples were generated for each prior by applying both the rejection method and the weighted bootstrap. For the rejection method, the dominating constant M was evaluated by substituting Blackwell's modified M L E ' s in the likelihood function. These estimates were ~ = 3.21, f = 0.908 (see Table 2). Since these are modified MLE's, M is not necessarily a supremum of the likelihood. Therefore, both M and a larger constant M* = 10M were used in several of the examples. Comparison of the posterior samples obtained with M vs. M* revealed essentially no difference in their distributions. Thus, M seems to be an appropriate dominating constant. Moreover, in all our simulations the ratio u (defined in (3.1)) was strictly less than 1. As expected, the rejection rate for M* was higher than with M.
A. Cohen et al./Journal of Statistical Planning and b{/i, rence 47 (1995) 79 94
85
Table 3 summarizes results of our study. It displays posterior means, standard deviations, and 0.05 and 0.95 quantiles which construct 90% posterior confidence intervals. These statistics correspond to 2, ~ and q' for various priors. In the table, N1 denotes the prior sample size, N 2 denotes the number of points out of NI which passed the initial criterion of (4.1). N3 denotes the posterior sample size. In the rejection method, both N 2 and N3 are random, while in the weighted bootstrap N3 is determined by the statistician. The following is a discussion on the choice of priors and an analysis of the results. By definition, r is in the unit interval, therefore a natural vague prior for is the uniform distribution in the unit interval. It should be noted that a uniform prior in [0, 1] for r is actually equivalent to a standard logistic prior for q'. On the other hand, ~ could be any positive number. At first, as a noninformative prior on ~ could not be outside this interval. Thus, a prior sample of the (~,~)'s was generated in the rectangle [-1 ~< ~ ~< 4, 0 ~< r ~< 1]. Out of a prior sample of size 100000, 90% were rejected in the first stage, due to the constraint (4.1). The rest were rejected as they did not satisfy the condition specified by the dominating constant. Thus, out of 100000 points, none was retained. While a dramatic increase of the sample size could have produced posterior sample points, we decided to :shrink the interval of the vague prior for ~ and let :~ be uniformly distributed in [2.9. 3.5]. This is a symmetric interval around the modified M L E of~ (3.2 _+ 0B}. Out of an initial prior sample of 10000, there was still a high percentage of rejected points (98.9%). To be more efficient, one could generate a priori only (~, r)'s which are in the region (4.1). Our attitude was based on the principle of using methods which are simple and thus compensate for computational inefficiency. When the prior sample was further increased up to 50000 points, N 2 increased to 4592 and the final sample included N3 = 179 points (see case 1, Table 3). A comparison of the prior sample with the posterior sample of :~, showed essentially no difference in their distributions. In other words, the prior for :~ was relatively tightly concentrated around the MLE, so that the likelihood did not add any more information. This did not occur in case 2 when the range for the uniform prior of ~ was increased to 3.2 + 0.5. While the prior standard deviation was then 0.289, it has been reduced to 0.257 in the posterior sample. In both cases 1 and 2 the posterior mean of r was 0.927, but the posterior standard deviation was larger in case 2. This is due to the inherent posterior relationship between r and ~. Accordingly, the larger posterior variance for :~ yielded a larger posterior variance for (and consequently also for 7*). The g a m m a distribution was considered as an informative prior for :~ in cases 3. 4 ',and 6 of Table 3. The parameters for this prior were chosen as follows: we calculated the mean and variance of the twenty ~ estimates of Table 2, which were derived by Chernoff (1991). The parameters of the prior g a m m a were determined such that its mean and variance were equal to the respective calculated mean and variance of the twenty ~'s (mean = 3.209; variance = 0.02859). The idea behind this arbitrary choice was to exploit prior information and to study its effect. As expected, the rejection rate
ct
~ U [ 0 , 1] c~ ~ U [ 2 . 7 , 3 . 7 ]
t ~ U [ 0 , 1] ~ gamma
t ~ beta ct ~ g a m m a
t ~ u [ 0 , 1] c~ = 3.21
t = 0.908 ~t ~ g a m m a
4
5
6
r 7j
~P
c(
7~
q~
~ t 7j
t ~ U [ 0 , 1] • ~ U[2.9, 3.5]
t
Parameter
Prior distribution
3
Case no.
P r i o r a n d p o s t e r i o r s u m m a r y statistics
Table 3
3.209
0.5 0
3.209 0.910 2.332
3.209 0.5 0
3.2 0.5 0
3.2 0.5 0
0.169
0.289 1.814
0.169 0.011 0.142
0.169 0.289 1.814
0.289 0.289 1.814
0.173 0.289 1.814
3.141
N1=8000,
0.931 2.726
3.175 0.914 2.366
2.970 0.896 2.157
0.070
0.985 4.226
N 3 = 381
3.365 0.929 2.576
2.290
3.208
N3=470
0.908 2.294
Nz=4502,
0.024 0.657
N2 = 8000,
0.117 0.010 0.125
N s = 544
3.447 0.984 4.113
N3 = 406
N z = 9344, 2.942 0.890 2.089
3.562 0.994 5.068
N 3 = 153
3.472 0.983 4.079
N 3 = 179
0.95
2.760 0.870 1.900
N z = 5765,
0.159 0.029 0.689
N1 = 10000,
3.188 0.930 2.739
2.937 0.882 2.012
N2 = 4746,
0.257 0.035 0.921
Nl = 100000,
3.131 0.927 2.777
0.05 N 2 = 4592,
0.176 0.030 0.690
N~ = 5 0 0 0 0 ,
3.196 0.927 2.695
NI = 5 0 0 0 0 ,
SD
Mean
Mean
SD
Rejection m e t h o d
Prior sample SD
3.139
N1=2000,
0.933 2.769
NI = 1000,
3.180 0.912 2.350
N1 = 2 0 0 0 ,
3.183 0.927 2.668
0.072
N2=981,
0.024 0.744
3.446 0.983 4.063
3.005
N3=470
0.909 2.290
3.207
0.984 4.140
N s = 381
3.369 0.929 2.572
N 3 = 400 2.968 0.897 2.160 N2 = 1000,
0.115 0.009 0.120
3.605 0.988 4.476 N 3 = 400 2.334 0.890 2.094
N2 = 1157,
3.439 0.985 4.250 N 3 = 400 2.788 0.869 1.893
N2 = 980, 0.148 0.028 0.654
0.95 N 3 = 400
2.932 0.886 2.047
N2 = 925, 0.251 0.033 0.777
NI = 1 0 0 0 0 ,
3.170 0.928 2.746
0.05 N2 = 901,
0.163 0.029 0.750
NI = 10000,
3.173 0.927 2.702
NI = 10000,
Mean
Weighted bootstrap
r x~ 4u.
4~
~2
?
A. Cohen et al./Journal of Statistical Planning and lr{l~'rence 47 (1995) 79 94
87
in case 3 was smaller as compared with cases 1 and 2, so were also the posterior standard errors. In case 4 we considered informative priors for both z and ~. The beta distribution was chosen as the prior for z and the parameters were again determined by equating the mean and variance of the prior beta to the calculated mean and variance of the 20 estimates of ~ in Table 2 (mean = 0.9098; variance = 0.000143). In this case, since the prior of (~, ~) was concentrated in a relatively small region, a much lower rejection rate was achieved, so that the prior sample N1 could be reduced to only 10000. The standard errors, as well as the 90% confidence intervals for :~ and ~, were substantially reduced for case 4, as compared with the previous less informative priors. In the following final cases (5, 6), the motivation was to study the effect of reducing to minimum the prior standard error of one of the parameters. Thus, in case 5, we considered a degenerate prior for ~ with all its mass on ~ = 3.21, while in case 6, we considered a degenerate prior for ~ with all its mass on z = 0.908. Consider first case 5. According to {4.1), the condition ~ = 3.21 restricts all the z's in the posterior sample to satisfy ln(1 - r) < w~l) - ~,
(4.2)
which is terms ofz is z > 0.9079 = L. Thus, in the process of generating posterior from prior samples we could control the size of N 2 by limiting the range of the generated to the interval [L, 1]. The resulting posterior distribution for ~ is similar to the posterior for z in case 3, except that it is truncated in its lower tail according to (4.2). In case 6, as expected, the posterior for ~ had the lowest posterior standard deviation and the shortest posterior confidence interval for ~. In general, the more informative were the priors, the smaller were the posterior standard deviations, as well as the confidence intervals and the rejection rates. No substantial differences were observed among the posterior means for the various cases. All the posterior means for ~ were smaller than the modified M L E and all the posterior means for ~ were larger than the modified MLE. The results of our study are also summarized in box plots which provide an illuminating visual summary. Figs. 1 2 and 3-4 correspond to the posterior samples of the z's and the ~'s, respectively. In each of these four plots the order of the boxplots from left to right corresponds to the order in Table 3. The box plots enhance the relationship between the prior and posterior distribution of the parameters. The least informative prior produces the longest box plots, while the most informative prior yields a relatively tight box plot. The box plots also enable to compare between the two methods. Such a comparison reveals, in almost all cases, only small differences between the two methods. Another means for such a comparison is provided by Q Q plots. Fig. 5 is an example of such a plot and corresponds to case 5. We drew the 4 5 line to enhance the similarities and differences between the posterior samples of the two methods. The figure shows that the posteriors of the sample obtained by the two methods are indeed very similar.
88
A. Cohen et al./Journal of Statistical Planning and Inference 47 (1995) 79-94
LPIIA 3 . 7 -
3 . 6
3 . s
3 . 4
3 . 3
3 . 2
3 . 1
3 . 0
2 . 9
2 . 8
2 . 7
,
.
,
0.8
.
,
,
.
.
.
.
,
1.2
. . . .
1.6
,
. . . .
2.0
,
. . . .
2.4
,
. . . .
2.B
,
. . . .
3.2
,
.
.
.
.
3.6
,
. . . .
4.0
,
.
.
.
.
.
4.4
.
.
.
4.8
.
,
5.2
INDEX
Fig. 1. Rejection method for alpha.
TAU i.oo
0.99
0.9B
0.97
0.96
0.95
0.94
0.93
0.92
0.91
O.90
O . S 9
0.88
o . e 7
0.86
i
0.05:,
. . . .
.8
,
1.2
.
.
.
.
.
.
.
1.6
.
.
, 2.0
. . . .
, 2.4
. . . .
,
. . . .
,
3.2
2.8
. . . .
,
3.6
ZND~X
Fig. 2. Rejection method for tau.
. . . .
,
4.0
. . . . .
.
4.4
.
.
,
,
4.0
. . . .
,
5.2
A. Cohen et al. 'Journal of Statistical Planning and Inference 47 (1995) 79 94
89
ALPHA 3.7-
3.6
3.5
3.4
3.3
3.2
3.1
3.0"
2.9"
2.8
0.8
1.2
~.6
2.0
2.4
2.8
3.a
3.6
4.0
4,4
4.8
5,2
INDEM
Fig. 3. Bootstrap method for alpha.
TAO 1.00
0.99
0.ga
0.97
0.96
0.95
0.94
0.93
0.92
0.91
O.9O
0.89
0.88
O.a7
0.~6
0. 85
.
o.s
.
.
.
,
~.2
.
.
.
.
.
.
.
.
1.6
.
,
2.0
.
.
.
.
,
2.'~
.
.
.
.
,
.
.
.
.
2.B
,
3.2
.
.
.
.
,
3.6
INDEX
Fig. 4. Bootstrap method for tau.
.
.
.
.
,
4.0
.
.
.
.
,
4.4
.
.
.
.
,
4.s
.
.
.
.
,
~.2
90
A. Cohen et al./Journal of Statistical Planning and Inference 47 (1995) 79-94
MLE 1. 0,0 " ,+**÷ + 0.99/. 0.90
+
*
"
0.97
0.96
0.9~
i"
0.9,1
0.93
0.92
0.91
0.90
0.09
O.B8
0.0"/
0.07
0.Be
0.S9
0.90
0,91
0.92
0.93
0.94
0.95
0.96
0.9"I
0.De
0,99
1.00
BOOT
Fig. 5. MLE vs bootstrap, alpha = 3.21 tau U(0, 1).
One way to check the performance of the weighted bootstrap is to examine the weights. This was done for all six cases, but for brevity we present only cases 1, 5. In both cases, tau has the uniform (vague) prior, but whereas in case 1, the prior of c~ is also vague, in case 5, the prior for ~ is concentrated on one value. Recall that relatively large weights correspond to more likely parameters and that the sum of the weights is one. The two Figs. 6 and 7, which display the weight distributions, are extremely different. They have the same mean, but while, in case 1, the weights are more evenly distributed, in case 5, the distribution is highly skewed. In case 1, where the prior is quite vague, we never obtained 'high' weights. However in case 5, where we used the informative prior, we had more chance to hit the more likely parameters. Thus, we obtained some relatively extremely large weights. In general, we would not know whether the evenly distributed weights indicate a flatter likelihood in the neighborhood of the m a x i m u m or that we are far from the maximum. Studying various priors could guide us, particularly with large prior samples. It has also been of interest to compare our posterior means and standard deviations with Chernoff's point estimates and their standard errors. The 20 estimates (in Table 2) included basically four types of estimates: moment based estimators, modified MLE's, local estimators, regression estimators. The estimators of the last type were considered by Chernoff as some kind of Bayes estimates calculated numerically. It is, therefore, interesting that among the four types, those which were most similar to our posterior means were not the regression estimates, but the local estimates. This is
s u o ! l e l n t u ! s ,~q p a l n d m o o OSle aH "~ p u e ~ jo sal13tu!lsa s!q sol saoa~a p 2 e p u e l s ~!Iol - d u d s v paA!.iap aq l n q 'saletu!lsa aql 3oj s3o2.~ p.n3pums luasa2d l o u p!p tIou2aqD "s~alam~a~d u ~ o u ~ l u n aql u o Jaju! ol 'tuaaoaql sa/[~l] u ~ q l £~t~ lUaaa~.~p u! aldml~s Jotad aql pu~ ~l~P aql saz!l!ln iJOiSSaJ~aa Jl~atlt.| aql l ~ q l 131~j aql jo linsaJ e "9 "~!A
0001. ~q pe!Id.qlnlN - s~IB!eM g~.
Irl.
¢l.
31.
14
01.
6
9
/.
9
~;
IP
g
~
I.
o
. . . . . . . . . . . . .
x.x ×x x× ~x × x xx xx xx ×x xx
XX xx XX xx ),0~ XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX
XX X~ XX XX XX XX XX xx xx xx
xx
0
~!.
x?< XN X~ XN XN XN NX NX X~ X N,
x:x
00~
XX ~ v
v v
X X
A A
X X
A A
x~
y Y
xx xx
x
vAvA
x
0
X X XX ix x
y~-
00~
N'N y'x~ XX "x" "x"
XX XX XX v',<,.x, XX XX XX U~X P'<,X XX XX [XX
eseo -
XX N'Y XN
slq6!eM deJlSlOO8
P6-6Z (~66l ) 2P aJuaaajuI put? gu!uuvld lOd!t~!lmS fo lvudnof ,/lt? to ua~loD V
16
A. Cohen et al./Journal o[ Statistical Planning and Inference 47 (1995) 79-94
92
Bootstrap Weights -
Case 1
90
80-
,£x KX &X
x
70 x
x.
60
0:~ 5 0 -
>(X >(X
x~ ~x
X)<
×~
>(X
~..,
xx xx
K), ~ x
::<.X
×;~
>(X XX XX xx )4x
xL;. x~ X. ~ K~ )<>
KX KX K× ,
×> X> X.." X> x> x;* x~ x> X. ~ x> .:4.~, X>
Kx MX
()< '(X
x;, X. ~
~X KX
X> x~,
(X
x.;, ×> x>
x;. ×~
~x
4>
X> x~
×
0
x>
~x
l
x:> x>
x>
K)<1
d>
I~. 4 0
30
XX XX XX XX XX XX XX XX XX XX XX XX XX
20
10
OJIO
0.84
×~ X.Y. >~ X)' X), x,~, X~, x> x> x~, ~<)' >()' X>
/-.X
x;,
×~, x~ x~x
X> X> x>
04
K~
X>
x> x), X)'
X;" ×~
:'<.~ :4.)<
KX '(X
X.)' ~#'
K)
xx xx xx XX XX XX XX XX x.X XX XX XX XX XX XX XX XX XX
1.410 1.44
X>
1.~
1.W
1J!
1JI4
1.111 1.*~2 ~
X> ><..>
X.> X. ~
K~
<.X
>(2
X> X>
K><1 < . x
X> X> X> X> X~
>(m x ;~
X> X>
X> X>
>(>
1.12
Weights -
IXX] Ix)<]
K..Xl
4)<
1.q]O 1.04
XX xx
x>
X~'
X.)'
>( x K~ K~
<.X
K~ >( :~
~
,¢x <×
'<')< X >
X>
~
~yJ
<.x ,
'iX
0 ml
x>
× ~
x~
KX ~4)< K~
x ~,
x
K~<
g> x'>
Kx
xx [:,.().~ ><.X I X ><1 ><.Yq D<><~ >( X DK)Q ~<~< IX>Q X X 1::<)~ ~X IX.~ XX lx[)q ~ < X l><.,~ X X IX.)<] XX I>(.-~I X X [:<)<1 X X IXZxl X X I>()<1 X X D<>q XX 1>(~ XX l>(~ X X D<>~
~X K~X K~ K~ KX KX K~X K.X K.X
1.44t 1.52
Muffiplied by 1000
Fig. 7.
standard errors of his estimates as functions of the true parameters for s o m e specified values. M o s t of those standard errors of ~ were close to s(~) = 0.35, while those of ~' were close to s(q0 = 0.5, depending on the m e t h o d and the true ~ and ~U's. In general, our posterior standard deviations of ~ were lower than the s(~)'s, while the
A. Cohen et al,/ Journal o[" Statistical Planning and lq[erence 47 (1995) 79 94
93
posterior standard deviations of 7~ were higher than the s(q')'s. The reason for this is that in most cases we used a prior with a relatively large standard deviation for ~ and a prior with a relatively small standard deviation for ~. In most of our priors z was uniform in [0, 1] (thus 7" had a logistic prior with variance rC2/3). Chernoff preselected the 7' values to be in the relatively small range of 1.5 ~< 7" ~< 3.5. In the only case where our prior for z had an informative prior with a relatively small variance, the resulting posterior for 7' was indeed small, and lower than Chernoff's s(q')'s. Conversely, our priors for ~ were constructed with relatively small variances, consequently the posterior standard deviations were smaller than Chernoff's s(~)'s.
5. Discussion Chernoff's regression method is analogous to the Bayesian approach in the sense that 0i's are selected a priori for the data simulations. The drawback of this method is that if the range of selected 0;'s is too small, it might not include the true values of the unknown parameters, whereas if it is too large, the linearity assumption might be invalid, In the Bayesian methods used here, one could start with a wide range for the prior without worrying about introducing potential nonlinearity. The price in this case is more extensive simulations. The cosmological data are an example where there were difficulties in evaluating the upper bound of the density. If Blackwell's M L E were not available, the procedure of weighted bootstrap would seem to be an appropriate method but the current study revealed a potential danger in implementing this method. When we considered an informative prior (a large interval for the vague prior, such as 1 ~< ~ <~ 4; 0 ~< ~ ~< 1t, the rejection method rejected all the 100000 prior sampled points. Suppose we had not known a dominating constant and had used weighted bootstrap. Out of the 100000 prior sampled points, 9369 points would have passed the criterion (4.1t, so that their qi's were positive (qi is defined in (3,2)). These points had relatively low likelihoods, therefore they were not retained with the rejection method, but weighted bootstrap would have resampled them. Indeed, when resampled by weighted bootstrap, we obtained an 'inadequate' posterior distribution. For ~, for example, the posterior mean was 2.320 with a standard deviation of 0.622 and for ~ the posterior mean was 0.860 with a standard deviation of 0.110. These posterior means are markedly lower than all the posterior means of Table 3. With the trend of having more powerful hardware and being able to simulate extremely large samples, this disadvantage, in principle, could be overcome. As noted by Chernoff (1991), the cosmological example has a particularly unusual distribution which causes difficulties in any inference method. This example can, therefore, serve as an interesting test case for various inferential methods. In this study, the computations were done using SAS 6.06 on an IBM mainframe (9121). The computations, particularly for the weighted bootstrap, required large memory and relatively long running times (up to 10 rain).
94
A. Cohen et al./Journal of Statistical Planning and Inference 47 (1995) 79-94
Acknowledgements W e t h a n k the referees for v e r y helpful c o m m e n t s .
References Blackwell, T. (1991). Inferences from apparent superluminal motion. Technical Report No. ONR-C-8. Cohen, M.H., J.A. Zensus, J.A. Biretta, G. Comoretto, P. Kaufman and Z. Abraham (1987). Evaluation of 3C273 at 10.7 GHz. Astrophys. J. 315, L89-L92. Chernofl, H. (1991). Simple computer intensive methods for estimating parameters of complex models. Comput. Statist. Data Anal. 12, 159-178. Gelfand, A.E. and A.F.M. Smith (1990). Sampling based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85, 398-409. Newton, M.A. and A.E. Raftery (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. J. Roy. Statist. Soc. B 56(1), 3-48. Pettit, L.I. (1993). Inferences about ordered parameters - - An astronomical problem. The Statistician 42(4), 491-496. Ripley, B. (1986). Stochastic Simulations. Wiley, New York. Rubin, D.B. (1981). The Bayesian bootstrap. Ann. Statist. 9, 130-134. Rubin, D.B. (1987). Comment on 'The calculation of posterior distributions by data augmentation', by M.A. Tanner and W.H. Wong. J. Amer. Statist. Assoc. 82, 543-546. Rubin, D.B. (1988). Using the SIR algorithm to simulate posterior distributions. In: J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F. Smith, Eds., Bayesian Statistics, Vol. 3, Oxford University Press, Oxford, 395-402. Porcas, R.W. (1987). In: J.A. Zensus and T.J. Pearson, Eds., Superluminal Radio Sources, Cambridge University Press, Cambridge. Tanner, M. and W. Wong (1987). The calculation of posterior densities by data augmentation (with discussion). J. Amer. Statist. Assoc. 85, 470-477. Segal, I.E. (1987). Apparent Superluminal Motion and Comparative Cosmology, Draft, 1-12. Smith, A.F.M. and A.E. Gelfand (1992). Bayesian statistics without tears: a samplin~resampling perspective. The Amer. Statistician 46(2), 84 88.