# Bayesian Statistical Analysis

## Bayesian Statistical Analysis

Bayesian Statistical Analysis G Kokolakis, National Technical University of Athens, Athens, Greece ã 2010 Elsevier Ltd. All rights reserved. Introduc...

Bayesian Statistical Analysis G Kokolakis, National Technical University of Athens, Athens, Greece ã 2010 Elsevier Ltd. All rights reserved.

Introduction Bayesian statistics has been considered, for quite a long time, as a branch of statistics; however, its role in the development of statistical inference is much more profound than that. Its philosophical base traces back to the very initial and rather subjective interpretation of the notion of probability and its use in everyday human activities. It is related with the notion of ‘pithanon,’ as used by Carneades (214–129/8 BC), meaning persuasive, convincing, which allows human beings to take decisions and act without necessarily having perfect knowledge of the situation and conditions they are facing. For a detailed account on Carneades’ philosophical position, the reader is referred to Bett (1989). According to the Bayesian approach to statistics, statistical inference is not just about collecting, analyzing, and interpreting data that have an endogenous variability, which is the frequentist view of statistics. Rephrasing de Finetti (1974, 1975), it is about developing a plausible way of studying uncertain events through the notion of probability. It also provides a coherent methodology for inductive mathematical reasoning when new information is available. In short, statistics is the study of uncertainty. Savage (1954) and Lindley (1965, 2000) state that the only self-consistent (coherent) way to deal with uncertainty is by expressing it through the notion of probability. When dealing with a statistical problem, uncertainty refers to data, parameters, and models. Having expressed the uncertainty about these, the rest concerns taking into account the information provided by the data and updating the uncertainty according to the rules of probability calculus. The basic mathematical tool to do it is provided by the Bayes’ theorem. Thus, the Bayesian approach provides a unified and coherent inferential procedure. It must be mentioned here that the rules of probability calculus, namely: (a) the addition rule (finite or countable), (b) the product rule, and (c) the convexity rule (that the probability lies in the convex unit interval taking the value 1 for the sure event), have to be demonstrated on the basis of axioms about the logical and, at the same time, subjective human behavior in an uncertain environment. There exist several axiomatic systems resulting in the above three probability rules. For a discussion of such axiomatic systems, see Savage (1954), DeGroot (1970), and Fishburn (1986). The Bayesian approach, although quite simple in principle, is often quite demanding, especially when dealing

with complicated and highly structured statistical models. There are two main tasks to be accomplished: (a) modeling uncertainty about the probabilistic behavior of the statistical data at hand and the parameters involved, including models, and (b) applying very efficient computational methods in order to estimate the parameters and confirm the models. Regarding the first task, much of the statistical work is based on the notions of coherence and exchangeability (both due to de Finetti (1937)), and it is greatly simplified by using the so-called objective priors (diffuse, flat, reference or noninformative priors). Regarding the second, the development of modern and quite efficient computational techniques, such as Markov-chain Monte Carlo (MCMC), has made this task a rather straightforward procedure. Their versatality had an enormous impact on the Bayesian approach and its spread in many disciplines during the last two decades.

Bayesian Inference We express the uncertainty about getting a data value x related with a population parameter u, both x and u being in general multivariate. This is achieved by a critical choice of a joint distribution for x and u, conditional on the state of knowledge we have at this moment about the population under study. An expert opinion, familiar with the specific population, might be very helpful in making such a choice. Specifically, the task is to choose a probability density pðx; ujK Þ (with respect to Lebesgue or counting measure), where K represents the present state of knowledge. In a more general setup, K might also include a set of modeling assumptions and random hyperparameters. In general, given the state of knowledge K , it is more convenient to choose separately the densities pðxju; K Þ and pðujK Þ, and then apply the product rule to get pðx; ujK Þ ¼ pðxju; K ÞpðujK Þ

½1

The product rule again allows us to write the density pðx; ujK Þ in the form pðx; ujK Þ ¼ pðujx;K ÞpðxjK Þ

½2

and thus, combining the above two results, we have pðujx; K Þ ¼ pðxju; K ÞpðujK Þ=pðxjK Þ

½3

If we apply now the addition rule on both sides of [1], with x fixed, we conclude that

37

38

Statistics Z pðxjK Þ ¼

and, thus, introducing the latter result into [3], we get the Bayes’ formula (Bayes, 1763) pðujx; K Þ ¼ R

pðxju; K ÞpðujK Þ pðxju; K ÞpðujK Þdu

½4

The denominator, either as pðxjK Þ or under the inteR gral form pðxju; K ÞpðujK Þdu, is called the predictive distribution for x. It does not convey any information about the parameter u (as far as a data value x does not raise doubts about the state of knowledge K itself ) and, thus, it acts as a normalizing constant. For simplicity in the notation, the reference to the state of knowledge K is also omitted. We may then write the Bayes’ formula [4] as follows: pðujxÞ / pðuÞpðxjuÞ

n Y

pðx i juÞ

i¼1

Q The product ni¼1 pðx i juÞ, considered as a function of the parameter u, is called the likelihood and is denoted by L(u,x(n)). The above posterior can be used as the prior for a  second dataset xðn Þ ¼ fxnþi ; i ¼ 1; :::; n g. Indeed, the posterior distribution of parameter u, given the whole  dataset x ðnþn Þ ¼ fx nþi ; i ¼ 1; :::; n þ n g, is 

pðujx ðnþn Þ Þ / pðuÞ

nþn Y i¼1

pðx i juÞ / pðujxðnÞ Þ

pðz; hjx ðnÞ Þdh

usually called marginal posterior.

Exchangeability It is essential to give an explanation about how the prior for a parameter u emerges. The answer has been given by de Finetti’s theorem (de Finetti, 1937), which constitutes the backbone of the Bayesian statistics. The basic notion here is that of exchangeability. The random variables X and Y are called exchangeable if F ðx; yÞ ¼ P½X  x; Y  y ¼ P½Y  x; X  y ¼ F ð y; xÞ for all ðx; yÞ 2 ℝ2

½5

Formula [5] allows us to go from pðxjuÞ to pðujxÞ; thus, it is also known as the inverse probability law. The first factor on the right hand side (RHS) of [5] is called a prior, since it expresses the information we have about the parameter u prior to receiving the data value x. The conditional density p(u|x) expresses the updated, according to the Bayes’ theorem, distribution of u after receiving the data value x. Thus, it is the posterior, with reference to x, distribution of u. In practice, instead of a single data value x, we have a random sample {xi, i ¼ 1, .Q . ., n}, denoted for simplicity by x(n). Then, pðxðnÞ juÞ ¼ ni¼1 pðx i juÞ, and the Bayes’ formula takes the form pðujx ðnÞ Þ / pðuÞ

Z

pðzjx ðnÞ Þ ¼

pðxju; K ÞpðujK Þdu

n Y

pðx nþi juÞ

i¼1

Thus, the posterior of u, given the first dataset x(n) ¼ {xi, i ¼ 1, . . ., n}, is the prior for the next dataset  xðn Þ ¼ fxnþi ; i ¼ 1; :::; n g, and so on. Quite often, the involved parameter u consists of two parts, namely, u ¼ (z, h), where the parameter z is the one of interest, whereas the parameter h is a nuisance parameter. In such a case, we only have to integrate out the nuisance parameter h from the joint posterior of u ¼ (z,h). Specifically, the posterior of the parameter of interest z is

This means that the pair (X,Y) is distributed as the pair (Y, X ) and, thus, the order does not matter. Similarly, a finite sequence of random variables, {Xi, i ¼ 1,2, . . ., n} is called exchangeable if any permutation, ðXi1 ; . . . ; Xin Þ say, is distributed as (X1, . . . ,Xn). The notion of exchangeability is then extended to an infinite sequence of random variables {Xn, n ¼ 1,2, . . .}, say, by requiring that every finite subsequence of it is exchangeable. Let now {Xn, n ¼ 1,2, . . .} be an infinite sequence of binary random variables, each one of which is taking the value 1 or 0 according to the realization of an event (a success) or not (a failure). Suppose we make the statement that for any n 2 ℕ, the probability to have k successes in any subsequence of length n of X’s, fXi1 ; . . . ; Xin g, say, depends only on k and n. In others words, we assume that the random variables Xn (n ¼ 1,2, . . .) are exchangeable. Let Pk,n denote the above probability, that is, Pk;n ¼ P

" n X

# Xij ¼ k ; k ¼ 0; 1; . . . ; n ðn ¼ 1; 2; . . .Þ

j ¼1

Then, the above statement of exchangeability of X’s is tantamount to the following: There exists a probability measure P over the interval [0,1] such that Pk;n ¼ ¼

n k n k

!Z

1

yk ð1  yÞnk PðdyÞ

0

!Z

1

yk ð1  yÞnk pðÞ dy;

0

k ¼ 0; 1; . . . ; n ðn ¼ 1; 2; . . .Þ

where p(y) denotes the density of the parameter y, which represents the unknown probability of success (here, p is the derivative of P with respect to Lebesgue or counting measure over the interval [0,1]).

Thus, the existence of a prior distribution p(y), for the probability of success y, does not arise out of nothing, but it is a consequence of our general statement about X’s, that is, that of exchangeability. In de Finetti (1937), a generalization of the above exchangeability result for random variables taking values in ℝd has also been derived. Parametric, together with nonparametric, versions of de Finetti’s theorem can be found in Diaconis and Freedman (1980) and in Bernardo and Smith (1994). Specifically, under additional assumptions of minor importance, when the random variables Xm (m ¼ 1, 2, . . .) are exchangeable, there exists a parametric function F ðxjuÞ with u 2 Q such that the joint distribution Q of Xi (i ¼ 1, 2, . . ., m) has the following form: Q ðx 1 ; :::; x m Þ ¼

Z (Y m Q

ðnÞ

Lðu;x Þ ¼

n Y

Bayesian Statistical Analysis

39

(

#)

n

pðxi juÞ / fgðuÞg exp

’j ðuÞ

n X

hj ðx i Þ

i¼1

n o ¼ fgðuÞgn exp wðuÞT tðx ðnÞ Þ

where t(x(n)) ¼ (t1 (x(n)), ..., tk (x(n)))T with tj ðx ðnÞ Þ ¼

n X

hj ðxi Þ; j ¼ 1;:::; k

i¼1

The quantity t ¼ t(x(n)) is the sufficient statistic for u, in other words, it conveys the same information about u as the whole dataset x(n). Applying Bayes’ theorem now, the posterior has the same form as the prior. Specifically: ( pðujx ðnÞ Þ / fgðuÞg

F ðx i juÞ pðuÞd u; m ¼ 1; 2; ::::

"

j ¼1

i¼1

)

i¼1

k X

þn

exp

 ) n P ’j ðuÞ nj þ hj ðx i Þ

k P j ¼1

i¼1

n  o ¼ fgðuÞg exp wðuÞT n þ tðxðnÞ Þ n o ¼ fgðuÞgn exp wðuÞT nn ; u 2 Q þn

It is obvious that the converse statement is true in all the above cases. Exponential Family and Conjugate Priors A density pðxjuÞ belongs to the exponential family if it can be written as follows: ( pðxjuÞ ¼ f ðxÞgðuÞexp

k X

) wj ðuÞhj ðxÞ

j ¼1

n o ¼ f ðxÞgðuÞexp wðuÞT hðxÞ ; x 2 X  ℝd and u 2 Y  ℝm

½6

Then, the conjugate prior for the parameter u is of the form ( pðuÞ / fgðuÞg exp

k X

) nj jj ðuÞ

j ¼1

  ¼ fgðuÞg exp nT wðuÞ ;u 2 Q

with updated parameters, namely, n ¼  þ n and nn ¼ nþt (x (n)). In almost all textbooks on Bayesian statistics, there are extensive tables and results for priors within the exponential family together with their marginal posteriors, predictive distributions, and their limiting objective forms, cf. DeGroot (1970), Aitchison and Dunsmore (1975), Robert (2001), Gelman et al. (2003) and Press (2003).

½7

with parameters (hyperparameters)  and n ¼ (n1, ..., nk)T. We see, therefore, that when pðxjuÞ belongs to the exponential family, so does its conjugate prior p(u). Conjugate priors, although restrictive sometimes, have the advantage of great mathematical simplicity in deriving posteriors in closed form. An additional advantage is that limiting versions of conjugate priors correspond to the so-called objective priors. Objective priors are most often improper, since their integral over the whole parameter space is not equal to one. They are, therefore, incoherent, meaning that they do not satisfy the probability axioms. Despite this fact, their posteriors can be proper, and, thus, coherent. Objective priors play a very important role in objective Bayesian statistics. Let now x(n) ¼ {xi, i ¼ 1, . . ., n}, be a random sample from [6]. The likelihood for the parameter u is then

Bayesian Statistical Methods Bayesian Decision-Making Having derived the posterior for a parameter u, all information we have about it is included in p(u|x(n)). Thus, if we have to take a decision, the consequences of which are related with the true value of u, it is logical to make best use of the information we have about it, that is, to make use of p(u|x(n)). Within the decision-making context, the parameter u represents the state of nature, whereas the decision to be taken is called the action and denoted by a. The set of all actions under consideration is denoted by a. The loss implied by taking an action a, when the true state of nature is u, is defined by a loss function L(a, u), usually bounded from bellow. Since the true state of nature u is not known, associated with an action a is the Bayes’ expected loss RðaÞ ¼ E½Lða; uÞjx ðnÞ  ¼

Z

Q

Lða; uÞpðujxðnÞ Þdu

½8

It must be mentioned that when the action a affects the distribution of the parameter u, the above expectation has to be with respect to p(u|a, x(n)). According to the above setup,

40

Statistics

the best action is that associated with the lowest Bayes’ expected loss. Thus, a decision a* is optimal if and only if Rða Þ ¼ inf fRðaÞ : a 2 ag

Such an action a* is then called the Bayes’ action. In general, a* constitutes the Bayes’ rule. For an extensive presentation of statistical decision theory, see DeGroot (1970), Berger (1985), and Lindley (1985).

Thus, the multinomial distribution is apparently within the exponential family and its conjugate prior is of the form pðuÞ ¼ k 

m Y j ¼1

a 1

yj j

; u 2 S m with aj > 0 ð j ¼ 1; . . . ; mÞ ½11

known as the Dirichlet distribution and denoted by D(a). The normalizing constant k above is given by (Z k kðaÞ ¼

Parameter Estimation When in place of the action space a we have the parameter space Q, then an action is a choice of a particular value, say ~ for the parameter u. In other words, the Bayesian estimau, tion problem is a decision problem, where the Bayes’ rule gives the optimal estimate for u, under the considered loss ~ uÞ. According to [8], the Bayes’ estimate for u function Lðu; is the quantity minimizing the posterior expected risk: ~ ¼ E½Lðu; ~ uÞjx ðnÞ  ¼ RðuÞ

Z

Y

~ uÞpðujxðnÞ Þ du Lðu;

½9

It can be easily shown that with the quadratic loss function ~ uÞ ¼ jju ~  ujj2 , the optimal estimate for u is its posteLðu; rior mean, namely, E(u|x(n)). Thus, we have ~ E½ujx ðnÞ  ¼ argmin RðuÞ ~ ðu2QÞ

Z

¼ argmin ~ ðu2QÞ

Q

~  ujj2 pðujxðnÞ Þ du jju

~ uÞ for It is well known that the class of loss functions Lðu; which the Bayes’ estimate of a parameter u coincides with the posterior mean, is very large. In fact, any loss function that belongs to the class of Bregman divergences has this property. The converse is also true. For the definition and applications of Bregman Divergences in statistics, see Kokolakis et al. (2006). Estimation of the multinomial parameter

Let Y be a discrete random variable taking values from the set P {1,2, . . ., m} with probabilities yj ( j ¼ 1, . . ., m), where mj¼1 yj ¼ 1. In what follows, for notational convenience, we introduce an m-variate random vector X ¼ (X1, . . ., Xm) such that Xj ¼ 1 when Y ¼ j and Xj ¼ 0 otherwise (j ¼ 1, . . ., m). Then, the probability distribution of X, given the parameter u, can be written as follows: pðxjuÞ ¼

m Y j ¼1

with (

(

x yj j

¼ exp

m X

m X

)

xj log yj ; xj 2 f0; 1g

xj ¼ 1 and

j ¼1 m

m X

) yj ¼ 1

j ¼1

m

½10

j ¼1

u 2 S ¼ u 2 ℝ : yj  0 ð j ¼ 1; :::; mÞ with m

m Y

The set S above is known as the (m1)-simplex.

S m j ¼1

)1 a 1 yj j du

m X GðaÞ with a ¼ aj Gða1 Þ . . . Gðam Þ j ¼1

¼

where

Z

1

GðaÞ ¼

½12

t a1 e 1 dt ; a > 0

0

is the gamma function, satisfying the following recursive relation: Gða þ 1Þ ¼ aGðaÞ for a > 0 with Gð1Þ ¼ 1

Note that the flat noninformative prior corresponds to taking: a1 ¼ a2 ¼,. . ., ¼ am ¼ 1. It can be easily seen that in order to find the mean of the above distribution, a renormalization procedure has to be applied and, thus, the mean will be expressed as the ratio of two normalizing constants. The same is true for any moment of the above distribution. Specifically, with n ¼ (n1, n2,. . .,nm)T, we have "

E

m Y j ¼1

#

n yj j

¼

kðaÞ ; nj ¼ 0; 1; 2; . . . ð j ¼ 1; . . . ; mÞ kða þ nÞ

and, thus, E½yj  ¼

aj aj ða  aj Þ ; Var½yj  ¼ 2 ; and a ða þ 1Þ a aj ak ; ð j 6¼ kÞ Cov½yj ; yk  ¼  2 a ða þ 1Þ

½13

Let x(n) ¼ {xi, i ¼ 1, . . ., n} be a random sample from the multinomial distribution [10]. Then, the likelihood function is Lðu; x ðnÞ Þ ¼ pðx ðnÞ juÞ ¼

n Y

pðx i juÞ ¼

m Y j ¼1

i¼1

n

yj j

P where n ¼ (n1, . . ., nm)T with nj ¼ ni¼1 xij ; j ¼ 1; . . . ; m, the sufficient statistics for the multinomial parameter u, P and mj¼1 nj ¼ n. Multiplying by the prior [11], we get pðujx ðnÞ Þ ¼ kðaðnÞ Þ

m Y j ¼1

ðnÞ

a 1

yj j

; u 2 Sm

where k(a(n)) is as in [12] with updated parameters a(n) ¼ a þ n and a(n) ¼ a þ n.

Bayesian Statistical Analysis

Using the quadratic loss function, the optimal estimate for u is its posterior mean. With the updated parameter a(n), the first result in [13] implies that E½yj jx ðnÞ  ¼

ðnÞ aj =aðnÞ

aj þ nj ¼ ; j ¼ 1; . . . ; m aþn

½14

It is interesting to write the posterior mean in the following form: E½ujx ðnÞ  ¼

a n

E½u þ

X n aþn aþn

 ðnÞ aj ðnÞ aðnÞ  aj

a2n ðan þ 1Þ ðaj þ nj Þða  aj þ n  nj Þ ¼ ða þ nÞ2 ða þ n þ 1Þ ð! 0 as n ! 1Þð j ¼ 1; . . . ; nÞ

could be a ¼ b ¼ 1, resulting in the flat noninformative prior pðÞ ¼ 1; 0    1

The above prior had been used by both Bayes (1763) and Laplace (1785) in their demonstration of the inverse probability law, of which Laplace was aware independently of Bayes.

½15

that is, the posterior mean of u is a convex combination of the prior mean E[u] and its sample mean X n . The mean squared errors of the above estimates are given by the associated risks, which, with the quadratic loss function used here, coincide with the posterior variances. Thus, using the second result in [13], we have Var½yj jx ðnÞ  ¼

41

½16

ItPis obvious, therefore, that when the quantity a ¼ Pmj¼1 aj is very small compared to the sample size n ¼ mj¼1 nj , the above estimates of the components of the parameter vector u are almost equal to their frequentist estimates and they agree completely when aj ¼ 0 ( j ¼ 1, . . ., m). Such a choice of hyperparameter values implies an improper prior that nevertheless has a proper posterior provided that nj  1 ( j ¼ 1, . . ., m). In addition, from the limiting behavior of the above estimates (cf. [14]–[16]), we conclude that, under the assumption of the exchangeability of X ’s, as n ! 1, the posterior mean of the parameter u will coincide with its actual value, whatever it is. A remarkable property of the Dirichlet family of priors is the following. Let u be Dirichlet distributed with parameter vector a. With l < m, let I ¼ {I P1, . . ., Il} be a partition of the set {1, . . ., m} and ’i ¼ Ii yj ði ¼ 1; . . . ; lÞ. Then, the random vector w ¼ (’1, . . ., ’l)T is Dirichlet distributed T with P parameter vector b ¼ (b1, . . ., bl) , where bi ¼ Ii aj ði ¼ 1; . . . ; lÞ. We conclude, therefore, from the above result that the requirement for coherence does not allow the use of objective flat D(1, . . ., 1) priors for both the parameters u and w. From the above result, it is obvious that the marginal distribution of y y1 is the univariate Dirichlet with P parameters a ¼ a1 and b ¼ mj¼2 aj , known as the Beta(a, b) distribution. When dealing with binomial data, a reasonable choice for the hyperparameter values a and b

Application For demonstration purposes, we present the following application. We analyze the data in Table 1 which refer to the grades (response variable) of a random sample of firstyear successful university students in relation to the parents’ educational level (explanatory variable). We have here three cell probability vectors, namely, ui ¼ (yi1, yi2, yi3) (i ¼ 1,2,3), corresponding to the three categories of parents’ educational level. We assume that the parameters u1, u2, and u3 are independent, Dirichlet distributed with a common hyperparameter vector a ¼ (a1, a2, a3). With N the 3 3 matrix representing the above contingency table and Q the 3 3 matrix representing the corresponding cell probabilities, the likelihood function is LðQ; N Þ /

3 Y 3 Y i¼1 j ¼1

n

yijij

Thus, pðQjN Þ / pðQÞpðN jQÞ /

3 Y 3 Y i¼1 j ¼1

a þnij 1

yiji

and, therefore, pðQjN Þ ¼

3 Y

Dðui jai þ ni Þ

½17

i¼1

We conclude, therefore, that the cell probability vectors ui (i ¼ 1, 2, 3), conditional on N, are again independent, Dirichlet distributed with updated hyperparameter vectors a(n) i ¼ a þ ni (i ¼ 1, 2, 3), respectively. For the common hyperparameter vector a, we assume that its components a1, a2, and a3 are all equal to 1,

Table 1 Numbers of first-year successful university students according to grades received and parents’ educational level Grades Parents’ educational level

A

B

C

Elementary Secondary University

3 7 8

32 59 93

50 63 134

42

Statistics

40

20

20

30

15

15

20

10

10

10

5

5

0

0

0

0.0 0.2 0.4 0.6 0.8 (a)

0.0 0.2 0.4 0.6 0.8 (b)

0.0 0.2 0.4 0.6 0.8 (c)

Figure 1 Posterior distributions for cell probabilities yij (i, j ¼ 1, 2, 3). Continuous lines for elementary parents’ educational level (i ¼ 1), dashed lines for secondary level (i ¼ 2), and dot lines for university level (i ¼ 3).

implying an objective flat prior for each one of the parameter vectors ui (i ¼ 1, 2, 3). Their posterior distributions, as defined by [17], are presented in Figure 1. The three cases (a), (b), and (c) correspond to the grade categories A, B, and C, respectively. Specifically, in case (a), we have the posteriors of the cell probabilities yi1 (i ¼ 1, 2, 3); in case (b), those of the cell probabilities yi2 (i ¼ 1, 2, 3); and in case (c), those of the cell probabilities yi3 (i ¼ 1, 2, 3). Comparing the three graphs within each grade category, we may realize that there is no serious discrepancy between the posterior distributions, apart from the fact that the parents’ category secondary education, presented with dashed lines, shows a slightly better response (higher located posteriors for A and B grades combined with a lower located posterior for C grades). For sensitivity analysis, several values for the components of the hyperparameter a have been considered in the range (0.5, 5.0), with step 0.5, and all were found to be in agreement with the above conclusion. Alternative methods for analyzing contingency tables are based on Hierarchical models and generalized linear model (cf. Gelman et al., 2003). For a rather extensive presentation of Bayesian models in educational research, see Novick and Jackson (1974) and Bock (1989). Interval Estimation For a scalar parameter y whose posterior p(y|x(n)) can be found in closed form, it is a rather easy task to find an interval (a, b) such that 1 F ðaÞ ¼ 1  F ðbÞ ¼ ð1  bÞ 2

where F ðaÞ ¼ P½y  ajx ðnÞ  ¼

Z

a

1

½18

pðyjx ðnÞ Þdy

and b a prespecified number in the interval (0, 1). Such a region is called the credibility interval for the parameter y

with credibility level b. This credibility interval is not necessarily the shortest one. In order to be the shortest, it has to contain the highest posterior density values. Then, it is called the highest posterior density credibility interval. It is obvious that, when the posterior is unimodal and symmetric, the credibility interval defined by [18] coincides with the shortest one. Usually, we simulate data from the posterior p(y|x(n)) using MCMC methods. Then, we only have to discard the most extreme values at appropriate equal proportions, and the region which is covered by the remaining data, is the required credibility interval.

Hypothesis Testing According to the Bayesian approach, the hypothesistesting problem, either for simple (sharp) hypotheses or for composite hypotheses, can also be treated within a decision-theoretic context. It must be mentioned here that the hypothesis-testing problem is not of prime interest within the Bayesian approach, mainly because it oversimplifies the problem of inference by requiring an answer of type yes’ or no’. Suppose that the parameter space Q is partitioned into two subsets, say Q0 and Q1. Specifically, we assume: Q ¼ Q0 [ Q1 with Q0 \ Q1 ¼Ø. Let the null hypothesis H0 states that u 2 Q0 while the alternative H1 states that u 2 Q1. Suppose, in addition, that the loss is zero when accepting the right hypothesis, while the loss implied by making a type-I error (rejecting the null hypothesis H0 when it is true) is a multiple a of the loss implied by making a type-II error (rejecting the alternative hypothesis H1 when it is true). Then, the Bayes’ rule states the following: P½Q1 jx ðnÞ  Reject H0 iff P½Q0 jx ðnÞ  Z Z pðujx ðnÞ Þd u= pðujx ðnÞ Þdu> a ¼ Q1

Q0

½19

Bayesian Statistical Analysis

The above ratio is the posterior odds ratio against H0 (and in favor of H1). Consider now two special cases. Case a: Suppose that both the sets Q0 and Q1 are single-point sets, that is, Q0 ¼ {u0} and Q1 ¼ {u1}, with prior probabilities p0 and p1 ¼ 1  p0, respectively. Then, the Bayes’ rule [19] becomes Reject H0 iff

p1 pðx ðnÞ ju1 Þ >a p0 pðx ðnÞ ju0 Þ

The ratio p1/p0 represents the prior odds ratio against H0, while the quantity Bðx ðnÞ Þ ¼ pðx ðnÞ ju1 Þ=pðx ðnÞ ju0 Þ

is called the Bayes’ factor against H0. Thus, the Bayes’ rule becomes: Reject H0 iff fprior odds against H0 g

fBayes factor against H0 g > a

½20

Case b : Suppose that the null hypothesis is a simple one, say H0: u ¼ u0, while the alternative is the composite one H1: u 6¼ u0. Suppose, as before, that the null hypothesis has a prior probability p0. Then, the alternative has a prior probability p1 ¼ 1  p0 and the Bayes’ rule is Z

Reject H0 iff p1

Q1

pðx ðnÞ juÞdu=p0 pðx ðnÞ ju0 Þ > a

with Q1 ¼ Q  {u0}. It must be mentioned here that using positive probabilities for sharp hypotheses as the above was suggested by Jeffreys (1961).

Extensions Mixtures of Models and Model Selection Suppose that, according to our present state of knowledge K , there exists a set M ¼ f1; 2; . . . ; Mg of possible models from one of which the random sample x(n) ¼ {xi, i ¼ 1,. . ., n} has been derived. Let M be the discrete random variable defining the actual model. Then, we have pðx ðnÞ Þ ¼

M X

pðx ðnÞ jmÞpðmÞ

m¼1

pðmjx ðnÞ Þ / pðx ðnÞ jmÞpðmÞ; m ¼ 1; . . . ; M

43 ½22

with proportionality constant k defined by k1 ¼

M X

pðx ðnÞ jmÞpðmÞ

m¼1

The posterior odds for the model m against the model m0 , say, are as follows: pðmjx ðnÞ Þ pm

Bm;m 0 ðx ðnÞ Þ; m 6¼ m 0 ¼ 1; . . . ; M ¼ 0 ðnÞ pðm jx Þ pm 0

where Bm,m 0 (x(n)) is the Bayes’ factor for the model m against the model m0 , that is, Bm;m 0 ðxðnÞ Þ ¼ pðx ðnÞ jmÞ=pðx ðnÞ jm 0 Þ

When using Bayes’ factors, it is interesting to see that the posterior probability of the model m can be written under the following form: ( ðnÞ

pðmjx Þ ¼

M X pm 0 m 0 ¼1

pm

)1 ðnÞ

Bm 0 ;m ðx Þ

; m ¼ 1; . . . ; M

½23

Thus, the model selection will be based on the result [22] or on its equivalent version [23]. The model selection procedures usually tend to favor models of higher complexity, that is, those with the larger number of parameters. This is a consequence of the fact that the average amount of information provided by an experiment is non-negative (cf. Lindley, 1956; Kokolakis, 1981; Kokolakis, 1985). Although proper priors protect from overfitting, a problem which is very common within the frequentist approach, it might not be suitable to estimate a very large number of parameters using a rather small sample size. In such a case, a cross-validation based on the predictive distributions (cf. Draper, 1995), or the, rather conservative, Bayesian information criterion (BIC), due to Schwarz (1978), might be applied. The latter penalizes the model for the number of parameters it has. Specifically, the model to be chosen is that which maximizes the quantity BIC ¼ log max Lðum ; x ðnÞ Þ  ½ km logkm

where km is the number of parameters within model m, that is, the dimension of um.

with pðx ðnÞ jmÞ ¼

Z Qm

pðx ðnÞ jum ; mÞpðum jmÞdum ðm ¼ 1; . . . ; MÞ ½21

the predictive distribution of x(n) when it comes from model m, and um (m ¼ 1, . . ., M), the model-specific parameters. Applying Bayes’ theorem, the model posterior probabilities will be

Random Hyperparameters and Hierarchical Structures Suppose that, according to our present state of knowledge K , the hyperparameter a, say, in the prior distribution of u, is also a random quantity with its own prior distribution. Let a be the set from which the hyperparameter a takes values. Then, we have the following results:

44

Statistics

pðx ðnÞ Þ ¼

with pðx ðnÞ jaÞ ¼

Z Y

Z a

pðx ðnÞ jaÞpðaÞda

pðx ðnÞ ; ujaÞdu ¼

Z Y

pðx ðnÞ ja; uÞpðujaÞdu

½24

the, conditional on a, predictive distribution of x(n). The required posteriors are then given by pðajx ðnÞ Þ / pðaÞpðx ðnÞ jaÞ ¼ pðaÞ

Z

Q

pðx ðnÞ ja; uÞpðujaÞdu ½25

and pðuja; x ðnÞ Þ / pðujaÞpðx ðnÞ ju; aÞ

½26

Suppose now that the dataset x consists of k subsets, namely, xðni Þ ¼ fxij ; j ¼ 1; . . . ; ni g(i ¼ 1, . . ., k), coming from k populations, each one with its own parameter vector ui, a sub-vector of the collective parameter vector ðnÞ ðn1 Þ ðnk Þ u above. Pk We have, therefore, x ðn¼i Þ ðx ; . . . ; x Þ with n ¼ i¼1 ni . We may think of x ði ¼ 1; . . . ; kÞ as random samples of the same product that come from k production lines. Now, consider the following hierarchical structure. The random samples xðni Þ (i ¼ 1, . . ., k), conditional on ui i ¼ 1, . . ., k, respectively, are independent, and the parameters ui i ¼ 1, . . ., k, conditional on the hyperparameter a, are independent and identically distributed according to p(|a). In other words, the random samples xðni Þ ði ¼ 1; . . . ; kÞ are conditionally independent and the parameters ui (i ¼ 1, . . ., k), are exchangeable. Then, in place of [24]–[26] above, we have (n)

pðx ðnÞ jaÞ ¼

k Z Y i¼1

pðajx ðnÞ Þ / pðaÞ

Y

pðx ðni Þ ja; ui Þpðui jaÞdui

k Z Y i¼1

pðuja; xðnÞ Þ /

Y

k Y

pðx ðni Þ ja; ui Þpðui jaÞdui

½27

pðui ja; x ðni Þ Þpðx ðni Þ ja; ui Þ

½28

½29

i¼1

The above hierarchical structure was introduced by Lindley and Smith (1972) in their seminal work on the Bayesian linear models. For a rather extensive range of applications of hierarchical structures in linear and generalized linear models, see Gelman et al. (2003) and Press (2003). See also: Markov Chain Monte Carlo.

Bibliography Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge, UK: Cambridge University Press.

Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society London V61.53, 370–418. [Reprinted with biographical note by G. A. Barnard in Biometrika 45, 293–315 (1958)]. Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd edn. New York: Springer. Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. New York: Wiley. Bett, R. (1989). Carrneades’ Pithanon: A reappraisal of its role and status. Oxford Studies in Ancient Philosophy 7, 59–94. Bock, R. D. (1989). Multilevel Analysis of Educational Data. New York: Academic Press. De Finetti, B. (1937). La pre´vision: Ses lois logiques, ses sources subjectives. Annalles de l’Institut H. Poincare´ 7, 1–68. [Translated and reprinted in Kyburg, H. and Smokler, H. (eds.) (1964). Studies in subjective probability. pp 93–158. New York: Wiley]. De Finetti, B. (1974, 1975). Theory of Probability (2 vols). New York: Wiley. DeGroot, M. H. (1970). Optimal Statistical Decisions. New York: McGraw-Hill. Diaconis, P. and Freedman, D. (1980). De Finetti’s generalizations of exchangeability. In Jeffrey, R. (ed.), Studies in Inductive Logic and Probability 2, 233–249. Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society B 57, 45–97 (with discussion). Fishburn, P. C. (1986). The axioms of subjective probability. Statistical Science 1, 335–358. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis, 2nd edn. London: Chapman and Hall. Kokolakis, G. E. (1981). On the expected probability of correct classification. Biometrica 68(2), 477–483. Kokolakis, G. E. (1985). A martingale approach to the problem of dimensionality and performance of Bayesian classifiers. Asymptotic results. Communications in Statistics – Theory and Methods 14, 927–935. Kokolakis, G., Nanopoulos, Ph., and Fouskakis, D. (2006). Bregman divergences in the (m k)-partitioning problem. Computational Statistics and Data Analysis 51, 668–678. Laplace, P. S. (1785). Me´moire sur les formules qui sont fonctions de tre`s grands nombres. Me´moires de l’Acade´mie Royal des Sciences 21, 1–88. Lindley, D. V. (1956). On a measure of the information provided by an experiment. Annals of Mathematical Statistics 27, 986–1005. Lindley, D. V. (1965). Introduction to Probability and Statistics, Part 2: Inference. Cambridge: Cambridge University Press. Lindley, D. V. and Smith, A. F. M. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society B 34, 1–41 (with discussion). Lindley, D. V. (1985). Making Decisions, 2nd edn. Chichester: Wiley. Lindley, D. V. (2000). The philosophy of statistics. Statistician 49, 293–337 (with discussion). Novick, M. R. and Jackson, P. H. (1974). Statistical Methods for Educational and Psychological Research. New York: McGraw-Hill. Press, S. J. (2003). Subjective and Objective Bayesian Statistics. New York: Wiley. Robert, C. P. (2001). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, 2nd edn. New York: Springer. Savage, L. J. (1954). The Foundations of Statistics. New York: Wiley. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461–464.

Further Reading Fox, J. P. and Glas, C. A. W. (2001). Bayesian estimation of a multi-level IRT model using Gibbs sampling. Psychometrika 66, 271–288. Jeffreys, H. (1961). Theory of Probability, 3rd edn. Oxford: Clarendon. Lewis, C. and Sheehan, K. M. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement 14, 367–386.

Bayesian Statistical Analysis Rubin, D. B. (1983). Some applications of Bayesian statistics to educational data. Statistician 32, 55–68. Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement 42(4), 375–394.

45

Sinharay, S. (2006). Bayesian statistics in educational measurement. In Upadhyay, S. K., Singh, U., and Dey, D. K. (eds.) Bayesian Statistics and its Applications, pp 422–437. van der Linden, W. J. (1998). Bayesian item-selection criteria for adaptive testing. Psychometrika 63, 201–216.