Size-biased sampling and discrete nonparametric Bayesian inference

Size-biased sampling and discrete nonparametric Bayesian inference

Journal of Statistical Planning and Inference 128 (2005) 123 – 148 www.elsevier.com/locate/jspi Size-biased sampling and discrete nonparametric Baye...

368KB Sizes 1 Downloads 93 Views

Journal of Statistical Planning and Inference 128 (2005) 123 – 148

www.elsevier.com/locate/jspi

Size-biased sampling and discrete nonparametric Bayesian inference Andrea Ongaro∗ Dipartimento di Statistica, Universita di Milano-Bicocca, Edicio U7, Via Bicocca degli Arcimboldi 8, Milano 20126, Italy Received 25 June 2003; accepted 29 October 2003

Abstract In this paper, we establish the connection between two di0erent topics, i.e. size-biased sampling schemes and Bayesian updating mechanisms for a general class P of discrete nonparametric priors. By exploiting this connection, we are able to use size-biased sampling theory to 2nd representations of the class particularly suitable for applications to inference problems and to derive new general results about its posterior and predictive distributions and about the properties of a sample from P. Potential of the approach is illustrated via an application to the Dirichlet process and an investigation of a new class of symmetric priors. c 2003 Elsevier B.V. All rights reserved.  MSC: primary 62G99; secondary 60G57 Keywords: Discrete random probability measure; Random relabelling scheme; Dirichlet process

1. Introduction This paper brings together two di0erent streams of research: random relabelling schemes of groups of a population and Bayesian nonparametric inference. In particular, we aim at establishing the connections between size-biased sampling and the Bayesian updating mechanism for a general class P of discrete nonparametric priors. On one side, size-biased sampling can be informally described as follows. Consider a population whose individuals can be classi2ed into a countable number of groups,  This work was supported by Ministero dell’Istruzione, dell’Universit@ a e della Ricerca Scienti2ca under grant: Nonparametric Bayesian methods and their applications (2002). ∗ Tel.: +39-2-64487375; fax: +39-2-64487390. E-mail address: [email protected] (A. Ongaro).

c 2003 Elsevier B.V. All rights reserved. 0378-3758/$ - see front matter  doi:10.1016/j.jspi.2003.10.005

124

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

each group having a given frequency i . We randomly sample from the population assigning an integer label to the group to which each sampled individual belongs. Labels are assigned to groups in the order with which they appear in the sampling process. This de2nes a random relabelling (permutation) of the groups and of the associated frequencies. This sampling scheme and the related theory has applications in many 2elds: population genetics (see Ewens, 1979, 1990), where the interest lies in alleles frequencies, mathematical ecology (see Engen, 1979), where the interest lies in species abundances, and various probabilistic contexts such as heap processes (Donnelly, 1991), recursive splitting of intervals (Lloyd and Williams, 1988) and cycle structure of random permutations (Billingsley, 1999). On the other side, we shall consider Bayesian nonparametric inference problems where a prior distribution P is placed on the set of all distributions on the sample space, i.e. is a random probability measure. In this setting, a general class of discrete priors can be constructed by mean of countably many random weights assigned to random locations. More speci2cally, we shall assume that an element P ∈ P has the ∞ form P = i=1 i i , where  indicates the probability measure giving mass 1 to  and the locations i s are independent and identically distributed and independent of the weights i s. The two settings come into relation if one thinks of the countably many random weights i s de2ning the prior as (random) frequencies of groups of an underlying population, namely the population of the locations i s. Starting from this interpretation, one can reach a close parallel between size-biased sampling from a population with group frequencies i s and computing the posterior distribution of the class P. The main purposes of this paper are: (1) to formalize the link between size-biased sampling and the Bayesian updating mechanism for the class P of discrete nonparametric priors; (2) to 2nd representations of the class P particularly suitable for applications to inference problems; (3) to exploit size-biased sampling theory in order to gain insight into and to derive new general results about the posterior and predictive distributions of P. The Iavour will be on presenting a unifying framework providing the machinery needed for tackling inference problems rather than on focusing the attention on speci2c applications of the class. In this spirit, no detailed investigations will be devoted, for example, to the problem of choosing a particular member of the class or of developing computational procedures for 2tting the models. The contents of the paper are as follows. In Section 2, we introduce the class P. The basic de2nitions and results about size-biased sampling are presented in Section 3. Section 4 establishes the connection between size-biased sampling and the prior to posterior calculations, whereas in Section 5, a particular representation of the class P, called canonical form, is introduced leading to a very simple expression of the posterior distribution. In the same section we also study properties of a sample taken from P, deriving in particular the predictive distribution, the probabilities of various con2guration of ties and the mean number of distinct observations. Section 6 presents an application of the derived results: it centers on the Dirichlet process and its two parameter generalization and on a general class of symmetric processes. Finally, Section 7 discusses our 2ndings.

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

125

2. The class P of discrete nonparametric priors We shall consider the following general class P of discrete nonparametric prior distributions. Let P0 be a di0use probability measure on an arbitrary sample space (X; A), i.e. P0 ({x})=0, ∀x ∈ X. Then, a random discrete probability measure P belongs to P if it can be represented as P=

∞ 

i i ;

(1)

i=1

where  = (1 ; 2 ; : : :) is a sequenceof random weights taking value on the in2nite ∞ dimensional simplex S = {x ∈ R∞ : i=1 xi = 1; xi ¿ 0; i = 1; 2; : : :}. Here,  has an arbitrary distribution Q, whereas 1 ; 2 ; : : : are independent and identically distributed (i.i.d.) random elements with common distribution P0 , independent of . The Dirichlet process is a special case, corresponding to a particular choice of the distribution of the i s, as discussed in detail in Section 6.1. Two features essentially set P apart from the general class of discrete random probability measures: (1) the random positions i s are independent of the associated random weights; (2) the positions are generated by i.i.d. random elements from a di0use measure. Notice that no restrictions are imposed on the distribution of the weights. Several properties of the class P can be found in Ongaro and Cattaneo (2002), who make use of a slightly di0erent, although equivalent, de2nition of the class. In particular, they derive formulas for the joint moments of the random probabilities P(Ai ); Ai ∈ A, and for the marginal distribution of the observations. Furthermore, they discuss the problem of determining the support of the class providing general suLcient conditions under which elements of P have a genuine nonparametric support. They also derive recursive formulas for computing the posterior which involve a rather complicated mixture representation. Representation (1) allows for a very rich and Iexible structure and many relevant classes of nonparametric prior distributions proposed in the literature can be viewed as instances of the class P (see Ishwaran and James, 2001 and references therein for an extensive list of such priors). Besides the Dirichlet process, we mention: the two parameter Poisson–Dirichlet process, the beta two-parameter process, stick-breaking priors and other generalizations of the Dirichlet process (cf. Hjort, 2000; Hjort and Ongaro, 2003). It becomes powerful for the following developments to note that representation (1) of an element of P is not unique: any random permutation of the i s will leave the distribution of P una0ected as stated in the following proposition. Hereafter, we shall always adopt the convention that the random positions i s are i.i.d. from P0 and independent of all the other random elements de2ning the random probability measure under consideration. Proposition 1 (Random permutation). Let us say that  is a random permutation of a sequence of random weights  if  = (U1 ; U2 ; : : :);

126

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

where U = (U1 ; U2 ; : : :) is a random permutation of N, whose distribution is allowed to depend on . If  is a random permutation of , then ∞ ∞   i i ∼ P  = i  i : P= i=1

i=1

Proof. It is enough to note that a.s. P | ∼ P  | ; U . A natural question is then whether there exists a representation of the random weights which is particularly suitable to deal with quantities of interest in inference problems. Size-biased sampling turns out to provide the appropriate framework to tackle the question. 3. Size-biased sampling This section presents all the de2nitions and results relative to size-biased sampling which will be needed for application in the nonparametric Bayesian setting based on the class P. Our main technical reference throughout the section will be Donnelly (1991). Let  = (1 ; 2 ; : : :) be a sequence of random weights, that is a random element with values on the in2nite dimensional simplex S. The size-biased sampling s =(1s ; 2s ; : : :) of , can be informally described as the following random permutation of : 1s is equal to i with probability i ; having chosen 1s = i as above, 2s is equal to j ; j = i, with conditional probability j =(1 − i ); given 1s and 2s , 3s is equal to l , l ∈ {i; j}, with conditional probability l =(1 − i − j ) and so on. If i represents the random frequency of the ith group of a certain population, in the new labelling, the ith group will become the 2rst one with probability i ; conditionally on this 2rst choice, the jth group, j = i, will become the second one with probability j =(1 − i ) and so on. Before formally de2ning sized-biased sampling, we need to introduce some preliminary notation and de2nitions. Let x = (x1 ; x2 ; : : :) be a sequence of real numbers. De2ne N + (x) as the number of strictly positive components of x and de2ne x(i1 ;:::;in ) as the sequence obtained by moving the entries i1 ; : : : ; in of x to the 2rst n positions. Formally, x(i1 ;:::;in ) = (xi1 ; : : : ; xin ; xJ1 (i1 ;:::;in ) ; xJ2 (i1 ;:::;in ) ; : : :);

(2)

where Jk (i1 ; : : : ; in ) is the kth smallest integer di0erent from i1 ; : : : ; in , that is J0 = 0 and for k = 1; 2; : : : Jk (i1 ; : : : ; in ) = inf {l ∈ N : l ¿ Jk−1 (i1 ; : : : ; in ) and l ∈ {i1 ; : : : ; in }}:

(3)

For notational convenience, we de2ne Jk (∅) = k so that x(∅) = x, where ∅ denotes the empty set. Conditionally on , a random permutation T = (T1 ; T2 ; : : :) of the integers N can be de2ned in the following way. Let I1 ; I2 ; : : : be i.i.d. integer valued random variables such that Pr(Ij = i|) = i ;

i = 1; 2; : : : :

(4)

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

127

If N + () = ∞, let Li ; i = 1; 2; : : : ; be the 2rst time the ith new value appears in the sequence I1 ; I2 ; : : : ; formally, L1 = 1 and for i = 2; 3; : : : Li = inf {j ∈ N : j ¿ Li−1 and Ij ∈ {I1 ; : : : ; Ij−1 }}:

(5)

It is easy to check that Li ¡ ∞ a.s. Furthermore, T = (IL1 ; IL2 ; : : :) de2nes a random permutation of N, such that for distinct integers i1 ; : : : ; in , i 2  in ··· : Pr(T1 = i1 ; : : : ; Tn = in |) = i1 1 −  i1 1 − i1 − · · · − in −1

(6)

Suppose now N + () = n for some n ¿ 1. Then only the 2rst n Li s are 2nite. In this case, the random permutation T = (T1 ; T2 ; : : :) is given by T = (IL1 ; : : : ; ILn ; J1 (IL1 ; : : : ; ILn ); J2 (IL1 ; : : : ; ILn ); : : :);

(7)

where Jk ; k ¿ 1, is de2ned as in (3). We are now fully equipped to approach sized-biased sampling and its variants. De!nition 1 (Sized-biased sampling): Let  be a sequence of random weights on S. Then the sequence of random weights  = (1 ; 2 ; : : :) is a size-biased sampling of , denoted by SBS(), if Pr( ∈ A|) = Pr((T1 ; T2 ; : : :) ∈ A|); where A is an arbitrary measurable subset of S. Remark 1. Conditionally on , and therefore also unconditionally,  is a sequence of random weights on S. The conditional distribution of | can be equivalently described as follows. If N + () = ∞, then all the components of | are strictly positive; they are a random permutation of the positive components of  and their distribution is given by  i 2 IAn (i1 ; : : : ; in )i1 Pr((1 ; : : : ; n ) ∈ An |) = 1 −  i1 i ;:::; i 1

n

distinct

···

i n ; 1 − i1 − · · · − in−1

where n ¿ 1, An is a suitable measurable set and IAn (·) denotes the indicator function of the set An . If N + () = n, then, conditionally on , 1 ; : : : ; n are a random permutation of the n positive components of  with distribution given by the above formula; furthermore k = 0 for k ¿ n. Another important concept is that of kth-order partial size-biased sampling, which can be thought of as a size-biased sampling stopped at the kth step. Roughly speaking, this means that we randomly select according to the size-biased sampling scheme only the 2rst k i s, leaving all the others in their original order.

128

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

De!nition 2 (Partial size-biased sampling): Let  be a sequence of random weights on S. Conditionally on , consider the following random permutation T k = (T1k ; T2k ; : : :) of N: T k = (IL1 ; : : : ; ILk∧N + () ; J1 (IL1 ; : : : ; ILk∧N + () ); J2 (IL1 ; : : : ; ILk∧N + () ); : : :); where I1 ; I2 ; : : : and L1 ; L2 ; : : : are de2ned as in (4) and (5), respectively. The sequence of random weights (k) = (1 (k); 2 (k); : : :) is a k-order partial size-biased sampling of , denoted by k-SBS(), if Pr((k) ∈ A|) = Pr((T1k ; T2k ; : : :) ∈ A|); where A is an arbitrary measurable subset of S. Notice that if N + () 6 k, then kth-order partial size-biased sampling coincide with ordinary size-biased sampling. The above given de2nitions are tailored to suit the Bayesian framework under consideration. This explains the di0erences with the corresponding de2nitions of size-biased permutation and kth-order heaps mechanism given in Donnelly (1991). However, it is easy to check their equivalence. The notion of invariance is also important. De!nition 3 (Invariant distribution): A sequence of random weights  is said to be invariant under size-biased sampling if SBS() ∼ ; furthermore  is said invariant under k-order partial size-biased sampling if k-SBS() ∼ . Characterizations and examples of distributions invariant under sized-biased sampling can be found in Pitman (1996). Remark 2. Note that size-biased sampling de2nes a random relabelling which is independent on the criterion originally used to order the groups. Formally this means that if  is a random permutation of , then SBS() ∼ SBS(). It follows that SBS(SBS()) ∼ SBS() and therefore SBS() is a random permutation of  which is invariant under size-biased sampling. The following result, given in Donnelly (1991), Theorem 4, turns out to be relevant for our purposes. Theorem 1 (Characterizations of invariance). The three following conditions are equivalent for a sequence  of random weights: (1)  is invariant under k-order size-biased sampling for some positive integer k; (2)  is invariant under k-order size-biased sampling for all k ¿ 1; (3)  is invariant under size-biased sampling. The surprising aspect of the theorem is that, although the operations of partial size-biased sampling can be interpreted as just a 2rst step towards full size-biased sampling, the corresponding notions of invariance are equivalent. In particular this implies that invariance under size-biased sampling can be proved by checking the much simpler condition of 2rst order partial invariance.

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

129

We introduce here also the following new result which will be a key ingredient to obtain a simple expression of the posterior. A partition ={B1 ; : : : ; Bm } of Nn ={1; : : : ;  n} is taken to be an ordered collection of m disjoint nonempty subsets Bi of Nn , such that i=1 Bi =Nn . Hereafter we shall always adopt the convention that the Bi s are ordered according to their smallest element, i.e. B1 is the subset containing 1, B2 is the subset containing the smallest positive integer not contained in B1 and so on. Given a vector (y1 ; : : : ; yn ), we denote by (y1 ; : : : ; yn ) the partition of Nn induced by the following equivalence relation: i ∼ j ⇔ yi = yj ; where the subsets of the partition are taken in the above speci2ed order, or equivalently in the order in which they appear in the vector (y1 ; : : : ; yn ). Proposition 2. Let In be the set In = {(i1 ; : : : ; in ) ∈ Nn : (i1 ; : : : ; in ) = {B1 ; : : : ; Bm }}; where {B1 ; : : : ; Bm } is a partition of Nn and let (k) = (T1k ; T2k ; : : :) be a k-order partial size-biased sampling of a sequence of random weights , with k ¿ m. Suppose that Pr(In ∈ In ) ¿ 0

or equivalently

Pr(N + () ¿ m) ¿ 0;

where In = (I1 ; : : : ; In ) is dened as in (4). Then the conditional distribution Fk of (k)|In ∈ In is absolutely continuous with respect to the distribution Gk of (k) with density h(y) dFk ; (8) (y) = dGk cn where h(y) = (1 − y1 ) · · · (1 − y1 − · · · − ym−1 )

m 

yini −1 ;

i=1

ni is the cardinality of Bi and cn = E[h((k))] = Pr(In ∈ In ) =

 i1 ;:::; im distinct

E[in11 · · · inmm ]:

See Appendix A for a proof of the proposition. 4. Connecting the class P to size-biased sampling In this section we shall investigate the analogism existing between size-biased sampling and prior to posterior calculations for the class P, having in mind to use the size-biased machinery to reach a simple expression for the posterior.

130

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

∞ A sample of n observations from P = i=1 i i can be expressed in terms of  and Q = (1 ; 2 ; : : :) as follows. Conditionally on  and Q, let us introduce i.i.d. integer valued random variables I1 ; : : : ; In such that Pr(Ii = j|; Q) = j ;

j = 1; 2; : : : :

Note the formal analogism with the random variables Ii s de2ned in (4). Furthermore, let us de2ne Xi = Ii ; i = 1; : : : ; n. Then it is easy to check that X1 ; : : : ; Xn is a sample from P, that is Pr(X1 ∈ A1 ; : : : ; Xn ∈ An |P) = P(A1 ) · · · P(An ) for all measurable sets A1 ; : : : ; An . It follows that the posterior distribution of P given a sample X1 ; : : : ; Xn can be derived by computing the conditional distribution of (; Q) given I1 ; : : : ; In . By exploiting this representation for a sample of observations, the following result establishes the connection between the posterior distribution of P and k-order partial size-biased sampling (see Appendix A for a proof). Theorem 2 (The posterior distribution). Let P be dened as in (1) and let Xn = (X1 ; : : : ; Xn ) be a sample of observations from P. Suppose that xn = (x1 ; : : : ; x n ) ∈ Xn has rn distinct values v1 ; : : : ; vrn , 1 6 rn 6 n, taken in the order in which they appear in the vector xn , and that each of them is repeated n‘ times, ‘ = 1; : : : ; rn . Then P given Xn = xn is distributed as rn ∞   (9)  + i  i ; i vi i=1

i=rn +1

where ∼ (T1k ; T2k ; : : :)|In ∈ I(xn )

(10)

and I(xn ) = {(i1 ; : : : ; in ) ∈ Nn : (i1 ; : : : ; in ) = (xn )}:

(11)

Here k =rn , (T1k ; T2k ; : : :) is the k-order partial size-biased sampling of  introduced in Denition 2 and (·) is dened as in Proposition 2. Recall that the i s are assumed to denote i.i.d. random variables from P0 independent of all the other random quantities. Furthermore, by Proposition 2, the distribution Frn of is absolutely continuous with respect to the distribution Grn of (rn ) = rn -SBS(), with density dFrn h(y) (y) = ; (12) dGrn cn where h(y) = (1 − y1 ) · · · (1 − y1 − · · · − yrn −1 )

rn  i=1

and cn = E[h((rn ))] =

 i1 ;:::; irn distinct

n

E[in11 · · · irrnn ]:

yini −1

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

131

For example, if data are all distinct, then the posterior distribution is given by (9) and (10) where k = rn = n and the conditioning set I(xn ) is formed by all the vectors (i1 ; : : : ; in ) whose entries are all distinct. Furthermore, the distribution of can be obtained by modifying the distribution of the n-order partial size-biased sampling (n) by a density factor proportional to (1 − y1 ) · · · (1 − y1 − · · · − yn−1 ). Notice also that in this case the distribution of ( 1 ; : : : ; n ) is exchangeable. This in particular implies that the above simple density modi2cation has the e0ect of making exchangeable the distribution of the 2rst n components of (n), which originally are stochastically ordered. Signi2cant results related to this comment are discussed in Section 6.2 and in Pitman (1996). As this is the core result relating sized-biased sampling to the class P, it seems appropriate to pause a while to give an informal explanation. From one side, consider a random sample from a population D with in2nitely many groups with labels i; i = 1; 2; 3; : : : ; each appearing with relative frequency i . The aim is to record the label of the group to which each sampled individual belongs. The sized-biased sampling operates a random relabelling of the groups and of the corresponding frequencies i s, so that to the ith group appearing in the sample will be given label i. ∞ On the other side, consider a sample of observations from P = i=1 i i . As the i s can be assumed to be all distinct, they can be interpreted as new labels attached to the groups of the population D. If the i s were known, sampling from D would be equivalent to sampling from P. However, the i s are random and, as they are i.i.d. and independent of the i s, they can be thought of as labels randomly assigned to the groups and to the corresponding frequencies (weights) i s. Since in sampling from P we only observe the labels i s, this destroys the possibility of inferring the groups to which they were associated. It follows that the derivation of the posterior distribution of weights and locations in their original order is quite cumbersome. As we are allowed to consider any random permutation of weights and locations, one is then led to search for permutations particularly suitable to express information contained in the data. In fact data contain direct information only on the groups that appear during the sample process. It follows that a natural way to reorder the groups, and therefore the locations and the weights, is to consider the (random) order in which they appear in the sample, that is the ordering induced by (partial) size-biased sampling. If we do that, the posterior distribution of the locations is straightforward: the 2rst rn distinct locations coincide with the 2rst rn distinct observations and the remaining locations are i.i.d. On the other hand, the posterior distribution of the weights can be shown to coincide with the rn -order partial sized-biased sampling of the original weights conditionally on the sample. To complete the description it remains to assess the inIuence of the sample on the random weights. From previous discussions, it is intuitive that the speci2c values assumed by the Xi s have no inIuence; the only relevant information carried by the Xi s is the structure of their replications, which gives the number of elements belonging to the groups appeared in the sample. It follows that the conditional distribution of the size-biased weights given the sample coincide with the conditional distribution given

132

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

the numbers of individuals belonging to the groups. In other words, it is as if we were taking a random sample from D in a context where we are unable to assess the group to which individuals belong, but we can only tell whether they belong to the same group.

5. Exploiting the connection We shall here exploit the connection presented in the previous section to achieve a substantial simpli2cation of the expression of the posterior distribution and derive sample properties for the class P. 5.1. Canonical form of the class P We are now able to answer the question, posed at the introduction, about the permutation of the random weights in the prior distribution which is most suitable for Bayesian inference. Theorem 2 shows that the expression of the posterior is greatly simpli2ed if we choose for  a distribution which is invariant under rn -order partial size-biased sampling, in the following named rn -partially invariant. The problem is then how to characterize such distribution and, possibly, how to eliminate its dependence on rn . Notice that the k-order partial size-biased sampling of  is not necessarily k-partially invariant, since if  is a random permutation of  then, in general, k-SBS() does not have the same distribution of k-SBS(). Theorem 1 turns out to be the best solution of the problem one could hope for: if for some k, a distribution is k-partially invariant, then it is k-partially invariant for all k. Moreover, this is equivalent to invariance under size-biased sampling. We are then lead to the following de2nition. ∞ De!nition 4 (Canonical form): We shall say that an element P = i=1 i i of the class P is written in the canonical form if  = (1 ; 2 ; : : :) is invariant under size-biased sampling. ∞ From Remark 2 and Proposition 1 follows that any element P = i=1 i i of the class P can be written in the canonical form by rearranging the random weights  according to the size-biased sampling scheme. Formally, if  = SBS() then P=

∞  i=1

i i ∼ P  =

∞ 

i  i

i=1

and P  is in the canonical form. As a consequence of Theorems 1 and 2, for a prior distribution P in the canonical form the distribution of the random weights appearing in the posterior representation can be expressed through a simple density modi2cation of the distribution of the prior random weights.

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

133

∞ Corollary 1 (The posterior distribution for the canonical form). Let P = i=1 i i ∈ P be written in the canonical form and let G be the distribution of =(1 ; 2 ; : : :). Then, in the notation of Theorem 2, P| Xn = xn is distributed as rn ∞   i vi + i i ; (13) i=1 

i=rn +1

= (1 ; 2 ; : : :)

where  has distribution F admitting density with respect to G given by dF h(y) ; (14) (y) = cn dG where h(y) = (1 − y1 ) · · · (1 − y1 − · · · − yrn −1 ) and cn = E[h()] =



rn 

yini −1

i=1 i1 ;:::;irn distinct

E[ni11

n · · · irrnn ].

Proof. By Theorem 1 if  is invariant under size-biased sampling then it is invariant under k-order partial size-biased sampling for any k ¿ 1. The result is then a direct consequence of Theorem 2. The distribution of the i s in (13) can be equivalently described as follows: the conditional distribution of (rn+1 ; rn+2 ; : : :) given (1 ; : : : ; rn ) is equal to the conditional distribution of (rn+1 ; rn+2 ; : : :) given (1 ; : : : ; rn ). The distribution of (1 ; : : : ; rn ) has density proportional to h(y) with respect to the distribution of (1 ; : : : ; rn ). It follows that to derive the posterior distribution of P, in fact we only need to update the distribution of the 2rst rn i s. To appreciate the compactness of the derived posterior expressions, it is instructive to compare them with the complexity of the Ongaro and Cattaneo (2002) representation. They develop a recursive formula for the computation of the posterior which involves quite a complicated mixture distribution for the random weights; this does not allow the derivation of an explicit general formula. The following proposition, which makes use of Corollary 1, proceeds in the direction of Proposition 2, which showed that two elements of the class P have the same distribution if the random weights of one of the two can be written as a random permutation of the random weights of the other. It gives necessary and suLcient conditions, related to the canonical form, for two elements of P to have the same distribution. ∞ ∞ Proposition 3. Let P = i=1 i i and P  = i=1 i i be two elements of the class P such that i ∼ P0 and i ∼ P0 . Then P ∼ P  ⇔ P0 = P0

and

SBS() ∼ SBS( ):

See Appendix A for a proof of the proposition. From Propositions 1 and 3 follows that when working with the class P of prior distributions with assigned parameter P0 , we need to take into consideration only

134

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

elements written in the canonical form; furthermore, any two such elements have different distribution if and only if the corresponding random weights have di0erent distribution. In other words, for a 2xed P0 , there is a one-to-one correspondence between prior distributions in P and size-biased invariant distributions. 5.2. Sample properties We derive here some results concerning the distribution and the properties of a sample of n observations Xn = (X1 ; : : : ; Xn ) from P ∈ P. From the representation of the posterior process given in Theorem 2 and in Corollary 1, it is straightforward to obtain the expression for the predictive distribution. ∞ Proposition 4 (Predictive distribution). Let P = i=1 i i be dened as in (1) and let Xn = (X1 ; : : : ; Xn ) be a sample from P. Then, in the notation of Theorem 2, we have for any A ∈ A Pr(Xn+1 ∈ A|Xn = xn ) =

rn 

wj vj (A) + (1 − w1 − · · · − wrn )P0 (A);

(15)

j=1

where



 rn E[ij ‘=1 in‘‘ ]  rn : wj = E[ j ] =  n‘ i1 ;:::;irn E[ ‘=1 i‘ ] ∞

i1 ;:::;irn distinct

distinct

If P = i=1 i i is written in the canonical form then the wj s can be expressed in the notation of Corollary 1 as rn ni −1 i ] E[(1 − 1 ) · · · (1 − 1 − · · · − rn −1 )j i=1 wj = E[j ] = rn ni −1 : E[(1 − 1 ) · · · (1 − 1 − · · · − rn −1 ) i=1 i ] As noticed in Section 6.1, if P is a Dirichlet process with parameter 0P0 then the vector (1 ; : : : ; rn ) has a Dirichlet distribution with parameter (n1 ; : : : ; nrn ; 0). This gives the well known expression wj =nj =(0+n) which characterizes the predictive distribution of the Dirichlet process. Exploiting Theorem 2 and Corollary 1, it is also possible to investigate the random structure of various con2gurations of ties among the observations. This is of interest, for example, in hierarchical Bayesian models where a nonparametric prior is placed at a second stage on the distribution of the parameters and ties in the parameters are interpreted as clusters in the observations (cf. Escobar and West, 1998; MacEachern, 1998). We shall distinguish between ordered and unordered con2gurations of ties. These are de2ned as follows. Let n = {B1 ; : : : ; Bk } be a partition of Nn = {1; : : : ; n} as de2ned above Proposition 2. To any such a partition we can associate an ordered con2guration

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

135

of ties by setting O(n ) = {xn ∈ Xn : (xn ) = n } = {xn ∈ Xn : xi = xj ⇔ i; j ∈ Bl for some l = 1; : : : ; k}: In words O(n ) contains all the sample values which have the same number of repeated values all appearing in exactly the same order. If we disregard the order of appearance, to any set of positive integers {n1 ; : : : ; nk } with n = n1 + · · · + nk , we can associate an unordered con2guration of ties U(n1 ; : : : ; nk ). This is de2ned as the set of xn ∈ Xn which have k distinct values such that one of them is repeated n1 times, a second one n2 times; : : : ; and the kth one nk times. For example, the vector (1,3,1,2,3) belongs to the same ordered con2guration of (2,4,2,5,4), whereas (1,3,1,2,3) and (1,3,3,1,2) belongs to di0erent ordered con2gurations but to the same unordered con2guration. It can be shown that,denoting by mi the number of nj s equal to i, U(n1 ; : : : ; nk ) is n n the union of the n1 :::n i=1 (1=mi !) ordered con2gurations of ties associated to the k partitions of Nn = {1; : : : ; n} into k subsets with cardinalities equal to the nj s. Exploiting Proposition 2 and the representation of the Xi s used in Theorem 2, we can derive formulas for the probability of various con2gurations of ties. ∞ Proposition 5 (Con2gurations of ties). Let P = i=1 i i be dened as in (1) and let Xn = (X1 ; : : : ; Xn ) be a sample from P. Furthermore, let n = {B1 ; : : : ; Bk } be a partition of Nn and denote by nj the cardinality of Bj . Then  (16) E[in11 · · · inkk ]: Pr(Xn ∈ O(n )) = ∞

i1 ;:::; ik distinct

If P = i=1 i i is written in the canonical form, we obtain the following simplied expression:  k  ni −1 : (17) Pr(Xn ∈ O(n )) = E (1 − 1 ) · · · (1 − 1 − · · · − k−1 ) i i=1

Finally, we have



Pr(Xn ∈ U(n1 ; : : : ; nk )) = Pr(Xn ∈ O(n ))

n n1 : : : nk



n  1 : mi !

(18)

i=1

Proof. We can assume, without loss of generality, that the i s are all distinct. From the representation Xi = Ii given above Theorem 2 follows that Xi = Xj if and only if Ii = Ij . This implies that Pr(Xn ∈ O(n )) = Pr((I1 ; : : : ; In ) ∈ In ); where In = {(i1 ; : : : ; in ) ∈ Nn : (i1 ; : : : ; in ) = n }. Since the distribution of the Ii s is identical to the distribution of the Ii s given in (4), by Proposition 2 we have  Pr((I1 ; : : : ; In ) ∈ In ) = E[in11 · · · inkk ] = E[h((k))]; i1 ;:::; ik distinct

136

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

k where h(y) = (1 − y1 ) · · · (1 − y1 − · · · − yk−1 ) i=1 yini −1 and (k) is a k-order partial size-biased sampling of . If P is written in the canonical form, the latter expression is equal to E[h()] since  is invariant under k-order partial size-biased sampling. Finally, the expression for Pr(Xn ∈ U(n1 ; : : : ; nk )) follows from a combinatorial argument, once we note that all the ordered con2gurations of ties which form U(n1 ; : : : ; nk ) receive the same probability. From formula (18) one can obtain the distribution of the number of distinct observations Dn and, given Dn , the distributions of the sizes Ni s of the groups of repeated observations. An alternative description of the unordered con2gurations of ties in the observations can be given by considering the random variables Mj , 1 6 j 6 n, nrepresenting the number of distinct observations which are repeated j times, so that j=1 Mj = Dn and n j=1 jMj = n. In ecological applications the Mi s represent the number of species with exactly j animals found in a random sample of size n from a given population (cf. Engen, 1979). The distribution of the Mj s can be obtained from formulas in Proposition 5. In addition, following Engen (1979), it is possible to derive simple expressions for the expected values of the Mj s and therefore of Dn . Proposition 6 (Number of distinct observations): Let Mj and Dn be dened as above with respect to a sample of observations (X1 ; : : : ; Xn ) from P ∈ P. Then, in the notation of Proposition 5, we have



n n  j

j−1 E[Mj ] = (19) E[1 (1 − 1 )n−j ] = E i (1 − i )n−j j j i and

 

1 − (1 − 1 )n = E[Dn ] = E E 1 − (1 − i )n : 1 i 

(20)

Proof. The random number Mj can be computed by counting all the subgroups of the Xi s of size j which contain j identical values di0erent from all the others:  I (Xi1 = · · · = Xij ; Xi = Xi1 for all i ∈ Nn \ {i1 ; : : : ; ij }); Mj = i1 ;:::; ij distinct

where I (A) is equal to 1 if the event A is veri2ed and to zero otherwise. It follows that  Pr(Xi1 = · · · = Xij ; Xi = Xi1 for all i ∈ Nn \ {i1 ; : : : ; ij }) E[Mj ] = i1 ;:::; ij distinct

and because of the exchangeability of the Xi s the latter expression is equal to

n Pr(X1 = · · · = Xj ; Xj+i = X1 for i = 1; : : : ; n − j): j

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

137

If we assume, without loss  of generality, that the i s are all distinct we can write  the above expression as nj Pr(I1 = · · · = Ij ; Ij+i = I1 for i = 1; : : : ; n − j), where, conditionally on , the Ii s are de2ned as at the outset of Section 4.1. An argument similar to the one employed in the proof of Proposition 2 then shows that  Pr(I1 = · · · = Ij ; Ij+i = I1 for i = 1; : : : ; n − j|) = 1j−1 (1 − 1 )n−j

where  = (T11 ; T21 ; : : :) is a 2rst-order partial size-biased sampling of . The 2rst . The second equality equality in formula (19) then follows by noticing that 1 ∼ 1 is a consequence of the fact that, by Remark 1, Pr( ∈ A|) = 1 i IA (i )i , which in  turn implies E[f(1 )]=E[ i f(i )i ] for any positive measurable function f. Finally, n computing j=1 E[Mj ] one obtains (20). We conclude the section by showing how the canonical representation for P can be used to provide an appealing way of constructing a sample of observations. Proposition 7 (Sample representation): Let  be a sequence of random weights invariant under size-biased sampling and let the random sequence (Y1 ; Y2 ; : : :) have the following distribution conditionally on : Y1 ∼ P0 , and, for n ¿ 1, Yn+1 |Y1 ; : : : ; Yn is equalto vi with probability i , i = 1; : : : ; rn , and is distributed as P0 with probability rn (1 − i=1 i ). Here the vi s are the rn distinct values in the vector (Y1 ; : : : ; Yn ), taken an exchangeable random sequence with in order of appearance. Then (Y1 ; Y2 ; : : :) is  ∞ associated random probability measure P = i=1 i i , i.e. (Y1 ; : : : ; Yn ) is distributed as a random sample from P. Proof. We shall prove that the conditional distribution of Yn+1 |Y1 ; : : : ; Yn coincide with the predictive distribution given in Proposition 4. We have  r

rn n   i vj (A) + 1 − i P0 (A)|Y1 ; : : : ; Yn : Pr(Yn+1 ∈ A|Y1 ; : : : ; Yn ) = E i=1

i=1

The proof is completed if we can that |Y1 ; : : : ; Yn has density proportional to rnshow (1 − 1 ) · · · (1 − 1 − · · · − rn −1 ) i=1 ini −1 , with respect to the distribution of . Here ni is the number of Yj s, j = 1; : : : ; n, equal to vi . This will be proved by induction. Clearly, for n = 1 we have | Y1 ∼ . Suppose the result holds for n. We shall apply Bayes theorem for the in2nite-dimensional parameter case to derive the result for n + 1. We take as a parameter the sequence  and as prior distribution the conditional distribution of |Y1 ; : : : ; Yn which is known by hypotheses. The conditional distribution of the “observation” given the parameter appearing in Bayes theorem is taken to be the distribution of Yn+1 |; Y1 ; : : : ; Yn . Notice that for any 2xed , the latter conditional distribution admits density equal to

rn rn   i vi (yn+1 ) + 1 − i X\{v1 ;:::;vrn } (yn+1 ) (21) i=1

i=1

rn with respect to the 2nite measure i=1 vi + P0 . By Bayes theorem follows that the distribution of | Y1 ; : : : ; Yn+1 admits density proportional to (21) with respect to the

138

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

distribution of |Y1 ; : : : ; Yn . This yields the desired density with respect to the distribution of . The above proposition gives a simple procedure to simulate a sample of n observations from P: it is enough to simulate from the distribution of (1 ; : : : ; n−1 ) and from P0 . Moreover, it adds further insight into the relationships between size-biased sampling invariant representations and the process of sampling from the class P. 6. Examples To illustrate the potential of the approach developed in the previous sections, we shall consider here two special subclasses of P: the Dirichlet process and one of its generalizations, and a new very general class of symmetric priors. 6.1. The Dirichlet process and its two parameter extension As the Dirichlet process is a cornerstone in the development of Bayesian nonparametric statistics, in what follows we shall illustrate the advantages of our approach by deriving the posterior distribution for it and for its two parameter generalization. The Dirichlet process was brought to the attention of the statistical community by Ferguson (1973, 1974). In these ∞ seminal papers, Ferguson gave a representation of the process of the type P = i=1 i i here considered, where the random weights i were arranged in a decreasing order. The distribution of such ordered weights, known also as Poisson–Dirichlet distribution (Kingman, 1975), is rather untractable (see Billingsley, 1999). Later on, Sethuraman (1994) gave a di0erent and much simpler representation of the weights of the Dirichlet process, which are described through a so-called stick-breaking scheme (see Ishwaran and James, 2001 and references therein). Under this scheme the weights are de2ned as follows: 1 = U1 , 2 = U2 (1 − U1 ), 3 = U3 (1 − U1 )(1 − U2 ) and so on, where U1 ; U2 ; : : : are independent random variables with values in (0; 1). In particular, the Dirichlet process with parameter 0P0 , denoted by D(0P0 ), corresponds to the particular case where the Ui s have a Beta distribution with parameter (1; 0). Sethuraman representation proved itself very suitable for nonparametric Bayesian applications, leading to a wealth of new results. Remarkably, such a representation coincides with the size-biased sampling of Ferguson representation (see for a proof Donnelly and Joyce, 1989); in our setting, this is the canonical representation of the Dirichlet process. One of the most important facts about the Dirichlet process is that the posterior n is still a Dirichlet process with parameter 0P0 + i=1 xi . This result can be easily obtained by applying Corollary ∞ 1, as shown in what follows. Let P ∼ D(0P0 ) be written in the canonical form P = i=1 i i . The result can be obtained recursively, once we prove that it holds for one observation. Corollary ∞1 gives immediately the form of the posterior distribution: P| X1 = x1 ∼ 1 x1 + i=2 i i . As a direct consequence of the stick-breaking structure of the i s, the latter is equivalent in distribution to 1 x1 + (1 − 1 )P, where P ∼ D(0P0 ) is independent of 1 . It is then easy to check

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

139

that 1 x1 + (1 − 1 )P ∼ D(0P0 + x1 ): one can simply verify, using well known properties of the Dirichlet distribution, that the 2nite dimensional distributions agree. Note the simplicity with which we could prove the result from our general formula compared with the proof given by Sethuraman (1994). Expressions for the posterior can be computed for the generalizations of the Dirichlet process which share the same structure of the class P (see Hjort, 2000; Ishwaran and James, 2001 and references therein). In particular, for the so-called two parameter Poisson–Dirichlet process (see Ishwaran and James, 2001 for applications to the Bayesian setting), posterior computations are a direct extension of the ones performed for the Dirichlet process. This thanks to the stick-breaking structure of the size-biased sampling of the weights. In fact, Pitman (1996) showed that the size-biased version of the random weights of the two parameter Poisson–Dirichlet process is the only distribution with in2nitely many strictly positive weights which is at the same time size-biased invariant and stick-breaking. The two parameter Poisson–Dirichlet process, in the following denoted by D( ; 0P0 ) is de2ned by taking the Ui s de2ning the stick-breaking representation to have a Beta distribution with parameter (1 − ; 0 + i ). The Dirichlet process corresponds to the case = 0. By applying Corollary 1, one has that if P ∼ D( ; 0P0 ) then rn ∞   i vi + i i ; P|Xn = xn ∼ i=1

i=rn +1

where (rn+1 ; rn+2 ; : : :) given (1 ; : : : ; rn ) is equal to the conditional distribution of (rn+1 ; rn+2 ; : : :) given (1 ; : : : ; rn ) and the distribution of (1 ; : : : ; rn ) has density proportional to h(y) with respect to the distribution of (1 ; : : : ; rn ). Noticing that (1 − 1 − · · · − rn ) = (1 − U1 ) · · · (1 − Urn ), as a direct consequences of the stick-breaking structure of the i s we have ∞  i i |(1 ; : : : ; rn ) ∼ (1 − 1 − · · · − rn )P  ; i=rn +1

where P  ∼ D( ; (0 + rn )P0 ) independent of 1 ; : : : ; rn . It is then straightforward, operating a change of variable, to show that (1 ; : : : ; rn ) has a Dirichlet distribution with parameter (n1 − ; : : : ; nrn − ; 0 + rn ), therefore completing the description of the posterior. A di0erent non-constructive proof, based on an induction argument, is given in Carlton (1999). 6.2. Symmetric processes Weshall study here the general subclass PF of P formed by those elements ∞ P = i=1 i i with 2nitely many strictly positive random weights, i.e. such that + N () ¡ ∞ a.s. To remark this particular condition, we shall use the notation BF to indicate the class of distributions of random weights  satisfying the above given condition. For such a class it is possible to introduce a new symmetric representation which will lead to a simple characterization of the canonical form. We shall then use such representation to specialize results obtained in Section 5.

140

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

Let us say that the random vector (Z1 ; : : : ; Zk ) is a uniform random permutation of the vector (y1 ; : : : ; yk ) if the distribution of (Z1 ; : : : ; Zk ) gives constant probability 1=k! to all possible k! permutations of (y1 ; : : : ; yk ). We can then introduce the following random permutation of . De!nition 5 (Symmetrization): Let  ∈ BF . Then the sequence of random weights = (61 ; 62 ; : : :) is a symmetrization of , denoted by SYMM(), if, conditionally on , the 2rst N + () components of are a uniform random permutation of the N + () strictly positive components of  and 6N + ()+i = 0 for i = 1; 2; : : : : Furthermore, we shall say that a sequence of random weights  ∈ BF has a symmetric distribution if (1 ; : : : ; N + () )|N + () = n has a distribution invariant under permutation of the components for any 2nite n such that Pr(N + () = n) ¿ 0. The notion of symmetrization and symmetric distribution are related as follows. Clearly, if =SYMM() the distribution of (61 ; : : : ; 6N + () )| is invariant under permutations of the components. The same holds true for (61 ; : : : ; 6N + () )|N + () and, noticing that N + () = N + ( ) a.s., for (61 ; : : : ; 6N + ( ) )|N + ( ), so that 6 has a symmetric distribution. Furthermore, a random sequence of weights is invariant under symmetrization if and only if it has a symmetric distribution. Symmetrization, just as size-biased sampling, de2nes an intrinsic random relabelling, i.e. a reordering independent on the mechanism originally used to label the groups. Formally, if  is a random permutation of  then SYMM() ∼ SYMM(). This implies that there is a one-to-one correspondence between symmetric distributions and distributions invariant under size-biased sampling, since SBS() ∼ SBS( ) if and only if SYMM() ∼ SYMM( ). By Proposition 3 this means that, for any given measure P0 , two elements of PF have the same distribution if and only if the symmetrized representations of their random weights are the same. Equivalently, there is a one-to-one correspondence between prior distributions in PF and symmetric distributions. Notice also that it is not possible to construct distributions with in2nitely many strictly positive random weights which are invariant under permutations of the components. It follows that the above setup cannot be extended to whole class P. By deriving the size-biased sampling version of a symmetric distribution one can make explicit the correspondence between symmetric and size-biased invariant distributions. Moreover, one also obtains a simple characterization of size-biased invariant distributions in BF . In the following, with a slight abuse of notation, we shall interpret, if required, a 2nite dimensional vector of random weights (1 ; : : : ; n ) as the random sequence (1 ; : : : ; n ; 0; 0 : : :). Proposition ∞ 8 (The interplay between symmetrization and size-biased sampling): Let P = i=1 i i ∈ PF . Then P admits a symmetric representation of the type N  P= 6i i ; i=1

where N is an integer valued random variable and, for any n such that Pr(N = n) ¿ 0, (61 ; : : : ; 6N )|N = n has a symmetric distribution on the n-dimensional simplex

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

141

n with strictly positive components Sn = {xn ∈ Rn : i=1 xi = 1; xi ¿ 0; i = 1; : : : ; n}. Furthermore, P admits a canonical representation of the type 

P=

N 

i  i ;

(22)

i=1

where N  is an integer valued random variable and, for any n such that Pr(N  =n) ¿ 0, (1 ; : : : ; N  )|N  = n has a size-biased invariant distribution. There is a one-to-one correspondence between the two representations given by the following relationships: N ∼ N  and, for any n such that Pr(N =n) ¿ 0, (1 ; : : : ; N  )| N  =n ∼ SBS((61 ; : : : ; 6N )|N = n) and has density 2 n K(1 ; : : : ; n ) = n!1 ··· 1 − 1 1 − 1 − · · · − n−1 with respect to the distribution of (61 ; : : : ; 6N |N = n). Conversely, (61 ; : : : ; 6N )|N = n ∼ SYMM((1 ; : : : ; N  )|N  = n) and has density 1=K with respect to the distribution of (1 ; : : : ; N  )|N  = n. As a consequence, a random sequence  ∈ BF is invariant under size-biased sampling if and only if, for any n such that Pr(N + () = n) ¿ 0, (1 ; : : : ; N + () )|N + () = n has density K with respect to a symmetric distribution on Sn . Proof. Let = SYMM() and  = SBS(). Then the two representations are straightforward consequences of the properties of size-biased sampling and symmetrization, once we make the following identi2cations: N ∼ N + () ∼ N + ( ) ∼ N + () ∼ N  and, with a slight abuse of notation, (61 ; : : : ; 6N )|N = n ∼ (61 ; : : : ; 6N + ( ) )|N + ( ) = n, respectively (1 ; : : : ; N  )|N  = n ∼ (1 ; : : : ; N + () )|N + () = n. Furthermore, as is a random permutation of , it follows that  ∼ SBS( ), which proves (1 ; : : : ; N  )|N  = n ∼ SBS((61 ; : : : ; 6N )|N = n). Conversely, as  is a random permutation of , we have ∼ SYMM(), which proves (61 ; : : : ; 6N )|N = n ∼ SYMM((1 ; : : : ; N  )|N  = n). In order to prove the relation between the distributions of (1 ; : : : ; N  )|N  = n and (61 ; : : : ; 6N )|N =n let us now derive the distribution of the sequence  by constructing it as size-biased sampling of . Conditionally on , if N + ( )=n, then n+i =0; i=1; 2; : : : ; and, by Remark 1,  6i 2 Pr((1 ; : : : ; n ) ∈ An | ) = IAn (6i1 ; : : : ; 6in )6i1 1 − 6 i1 i ;:::; i 1

n

distinct

···

6in : 1 − 6i1 − · · · − 6in−1

Because of the symmetric properties of , taking expectation conditionally on N + ( ) = n, we obtain Pr((1 ; : : : ; n ) ∈ An |N + ( ) = n)   62 6n = n!E IAn (61 ; : : : ; 6n )61 ··· |N + ( ) = n : 1 − 61 1 − 61 − · · · − 6n−1

142

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

As N + ( )=N + () a.s., this proves that the distribution of (1 ; : : : ; N + () )|N + ()=n admits density K(1 ; : : : ; n ) with respect to the distribution of (61 ; : : : ; 6N + ( ) )| N + ( ) = n. It is left to prove the characterization of an invariant distribution given at the end of the proposition. If  is size-biased invariant then  ∼ SBS() and  ∼ SBS(SYMM()) as SYMM() is a random permutation of . It follows that  is a size-biased sampling of a symmetric distribution and therefore must have the form given in the proposition. Conversely, if  has the form given in the proposition, then it is the size-biased sampling of a symmetric distribution and therefore is invariant. The above proposition gives two equivalent simple ways of representing and de2ning priors in the class PF . It tells us an easy route to go from one representation to the other. It also says how to construct invariant size-biased sampling distributions as a simple modi2cation of symmetric distributions and how to check whether a given distribution is invariant. This is relevant in order to apply Corollary 1 to obtain the posterior distribution, which is our next task. A straightforward application of such corollary to the canonical representation (22) gives the following result. Let P ∈ PF be written as in (22). Then, in the notation of Theorem 2, P given Xn =xn is distributed as rn  i=1

i vi +

r n +Nn

i i ;

(23)

i=rn +1

where Pr(Nn = j) ˙ E[h(1 ; : : : ; rn )|N  = rn + j] Pr(N  = rn + j)

j = 0; 1; : : : ;

(1 ; : : : ; rn +Nn )|Nn = j has density proportional to h(1 ; : : : ; rn ) with respect to the distribution of (1 ; : : : ; N  )|N  = rn + j and rn  yini −1 : h(y1 ; : : : ; yrn ) = (1 − y1 ) · · · (1 − y1 − · · · − yrn −1 ) i=1

Exploiting the above given relation between the symmetric and the canonical representation, it is simple to obtain a representation of the posterior in terms of the symmetric N representation P = i=1 6i i . An explicit expression can be found in Ongaro and Cattaneo (2002), where a class with such symmetric structure is also considered. In particular, as (1 ; : : : ; N  )|N  = rn + j ∼ SBS((61 ; : : : ; 6N )|N = rn + j) and letting k = rn , one has 

E 6in11 · · · 6inkk | N = k + j ; E[h(1 ; : : : ; k )|N  = k + j] = i1 ;:::; ik distinct

where the indexes il vary from 1 to k + j. Because of the symmetry of the distribution of the 6i s, the latter quantity is equal to (j + 1)[k] E[61n1 · · · 6knk | N = k + j]; where x[k] denotes the factorial x(x + 1) · · · (x + k − 1).

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

143

Quantities of the type E[h(1 ; : : : ; k )], which are the basis for the computation of the predictive distribution and of the probabilities of various con2gurations of ties, can then be obtained by taking expectation with respect to N . Similar arguments lead to the following expressions, in terms of 61 , for the number Mj of distinct observations which are  repeated j times and for the total number of

distinct observations Dn : E[Mj |N ] = nj N E[61j (1 − 61 )n−j |N ] and E[Dn |N ] = N E[1 − (1 − 61 )n |N ]. Hjort and Ongaro (2003) studied a generalization of the Dirichlet process which can be viewed as a special instance of the subclass presented in this section. It is obtained by choosing a symmetric Dirichlet distribution for the vector (61 ; : : : ; 6N )|N = n. Note also that a prior distribution with the same Dirichlet structure for the random weights but with a non random number of locations N has been discussed by several authors in di0erent contexts (see, for a review, Ishwaran and Zarepour, 2002, Section 4).

7. Discussion In this paper, we have exploited the link between size-biased sampling and the Bayesian updating mechanism for a general class P of discrete nonparametric priors. This has lead us to the following key result: the random weights appearing in a natural representation of the distribution of P given a sample of observations can be expressed, up to a simple density modi2cation, as a (partial) size-biased sampling of the prior random weights. This compact and non recursive expression for the posterior holds for any possible representation of any member of P. Moreover, this result, coupled with sized-biased sampling theory, has pointed us to a representation of the class P particularly suitable for Bayesian inference, namely the canonical form. As a consequence, we have derived several properties of the distribution of a sample from P. All the expressions have been derived both for the canonical form and for an arbitrary representation. As an application of our approach, we have investigated the general subclass formed by elements with 2nitely many strictly positive random weights. For such class, we have given a constructive characterization of the canonical form and a new equivalent representation in terms of 2nite dimensional symmetric distributions for the random weights. The rich nature of the class P o0ers a useful family of priors for use in nonparametric settings. For example, consider the hierarchical Bayesian models where a nonparametric prior is used at a second level on the distribution of the parameters. Such models play a central role in the recent Bayesian nonparametric literature (see Escobar and West, 1998; MacEachern, 1998 and references therein), where they are largely used in problems of clustering of items. In such hierarchical settings, the structure of clusters is determined by the con2guration of ties in a sample from the nonparametric prior. The standard practice to use a Dirichlet process as a second-level prior leads to serious drawbacks due to the severe constraints imposed on the probability of various con2gurations of ties (see Petrone and Raftery, 1997; Green and Richardson, 2001).

144

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

On the contrary, the class P displays a completely general structure of probabilities of ties as this is determined by the distribution of the random weights which is itself arbitrary. Therefore P provides a completely Iexible model to tackle such context. Furthermore, the sized-biased characterization represents a powerful device capable of making the class P more amenable to nonparametric applications in much the same way that Sethuraman (1994) representation has done for the Dirichlet process. Acknowledgements We wish to thank Monica Chiogna for suggestions that led to considerable improvements of the initial draft. We also are grateful to two referees for helpful comments. Appendix A. Proofs Proof of Proposition 2. It is enough to prove that Pr(In ∈ In |(k)) = (1 − 1 (k)) · · · (1 − 1 (k) − · · · − m−1 (k))

m 

ini −1 (k): (A.1)

i=1

This implies cn = E[h((k))] = Pr(In ∈ In ) and the latter expression can be easily seen to be equal to  E[in11 · · · inmm ]: i1 ;:::; im distinct

Furthermore, it is simple to check that Pr(In ∈ In ) ¿ 0 ⇔ Pr(N + () ¿ m) ¿ 0:

(A.2)

Let us now prove (A.1). First notice that N + ((k)) is identically equal to N + (), (k) being a permutation of . If N + () = N + ((k)) ¡ m, then clearly Pr(In ∈ In | (k)) = 0, so that (A.1) is satis2ed. Suppose now that N + ()=N + ((k)) ¿ m and let Ij ; j=1; : : : ; n−1, be the projection of In onto the 2rst j coordinates, that is Ij = {(i1 ; : : : ; ij ) ∈ Nj : (i1 ; : : : ; in ) ∈ In for some (i1 ; : : : ; in ) ∈ Nn }:

(A.3)

Let mj be the number of distinct values in an arbitrary vector (i1 ; : : : ; ij ) ∈ Ij . Then it is not diLcult to check that Pr(I1 ∈ I1 | (k)) = 1 and, for 1 6 j ¡ n, Pr(Ij+1 ∈ Ij+1 |Ij ∈ Ij ; (k))  s (k) = 1 − 1 (k) − · · · − mj (k)

if (j + 1) ∈ Bs ; 1 6 s 6 mj ; if (j + 1) ∈ {B1 ∪ · · · ∪ Bmj }:

Expression (A.1) then follows by multiplying the above conditional probabilities.

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

145

Proof of Theorem 2. The posterior distribution of P can be deduced by the conditional distribution of (; Q) given Xn = xn , where Xn = (I1 ; : : : ; In ) as noted in the discussion just above Theorem 2. A fruitful approach in order to derive the latter is to further condition on In = (I1 ; : : : ; In ), that is to consider (; Q)|Xn = xn ; In = in ;

(A.4)

where in = (i1 ; : : : ; in ). Since P0 is a di0use measure and the i s are independent with distribution P0 , we can assume, without loss of generality, that the i s are all distinct. Note that, under this assumption, Xi = Xj if and only if Ii = Ij , so that Xn = xn implies In ∈ I(xn ), where I(xn ) is de2ned as in (11). It follows that in (A.4) we only need to consider vector in belonging to I(xn ). Notice furthermore that Pr(In ∈ I(xn )) can be assumed to be positive as shown by the following considerations. Let D(xn ) = {(z1 ; : : : ; zn ) ∈ Xn : (z1 ; : : : ; zn ) = (xn )}:

(A.5)

Then we can assume, without loss of generality, that Pr(Xn ∈ D(xn )) ¿ 0. If the i s are all distinct, then Xn ∈ D(xn ) if and only if In ∈ I(xn ), so that  n E[in11 · · · irrnn ] ¿ 0: (A.6) Pr(Xn ∈ D(xn )) = Pr(In ∈ I(xn )) = i1 ;:::; irn distinct

Let us now derive the distribution of (A.4). It is not diLcult to check that, conditionally on Xn = xn and In = in ,  is independent of Q. Clearly, Q| Xn = xn ; In = in is a sequence of independent random variables such that (i1 ; : : : ; in ) is degenerate at xn while the others i s have distribution P0 . Furthermore, conditionally on In = in ,  is independent of Xn , so that |(Xn = xn ; In = in ) ∼ |In = in : It is now possible to derive a representation for the posterior distribution of P conditionally on In = in . Let {B1 ; : : : ; Brn } = (xn ) and denote by lj the smallest integer belonging to Bj ; j =1; : : : ; rn : Noting that, conditionally on Xn =xn and In =in , ilj =vj , we have that P|Xn = xn ; In = in is distributed as rn   ilj vj + m m ; (A.7) j=1

m=ilj ; j=1;:::;rn

where (1 ; 2 ; : : :) ∼ (1 ; 2 ; : : : |In = in ). We are left to compute the conditional distribution of In given Xn = xn . If in ∈ I(xn ) then, for suitable measurable sets A1 ; : : : ; An  

  Pr(X1 ∈ A1 ; : : : ; Xn ∈ An |In = in ) = P0 Ai · · · P 0  Ai  ; i∈B1

i∈Brn

which is independent of in . This implies that Pr(X1 ∈ A1 ; : : : ; Xn ∈ An |In = in ) = Pr(X1 ∈ A1 ; : : : ; Xn ∈ An |In ∈ I(xn ))

146

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

and therefore, conditionally on In ∈ I(xn ), Xn is independent of In . Noticing that {Xn = xn } ⊆ {In ∈ I(xn )}, we have Pr(In = in |Xn = xn ) = Pr(In = in |Xn = xn ; In ∈ I(xn )) = Pr(In = in |In ∈ I(xn )): As a consequence we obtain the following representation for the posterior distribution: rn   (A.8) ˜j j ; P|Xn = xn ∼ ˜j vj + j¿rn +1

j=1

where

  (˜1 ; ˜2 ; : : :) ∼ Il ; : : : ; Ilr ; J1 (Il ;:::;Ilr ) ; J2 (Il ;:::;Ilr ) ; : : : |In ∈ I(xn ) 1

1

n

1

n

n

and Ji (·) is de2ned as in (3). Now let Li the 2rst (random) time the ith new value of I1 ; I2 ; : : : (and consequently of X1 ; X2 ; : : :) appears; that is L1 = 1 and for i = 2; 3; : : : ; rn  }} Li = inf {j ∈ N : j ¿ Li−1 and Ij ∈ {I1 ; : : : ; Ij−1

= inf {j ∈ N : j ¿ Li−1 and Xj ∈ {X1 ; : : : ; Xj−1 }}: Then we have (1 ; 2 ; : : :) ∼ (T  ; T  ; : : : |I  ∈ I(xn )); 1

n

2

where (T1 ; T2 ; : : :) = (IL 1 ; : : : ; IL r ; J1 (IL 1 ; : : : ; IL r ); J2 (IL 1 ; : : : ; IL r ); : : :): n

n

n

Noticing that N + () ¿ rn a.s. on In ∈ I(xn ), it is straightforward to check that (T1 ; T2 ; : : :) has exactly the same structure of the rn -order partial size-biased sampling (T1k ; T2k ; : : :); this proves expression (10). Finally, formula (12) is a direct consequence of Proposition 2. Proof of Proposition 3. Let  = SBS() and  = SBS( ). By Proposition 1 follows that P0 = P0 and  ∼  imply P ∼ P  . Suppose now that P ∼ P  . Then E[P(A)] = P0 (A) must be equal to E[P  (A)] = P0 (A) for any arbitrary measurable set A, so that P0 = P0 . Let us show now that  ∼  . Clearly we have that ∞ ∞   i i ∼ P = i i : P = i=1

i=1

Xn (Xn )

Let be a random sample from P (P ). Then by formula (A.6) and Corollary 1 follows that for an arbitrary vector xn ∈ Xn  n (A.9) E ni11 · · · irrnn = E[h()]; Pr(Xn ∈ D(xn )) = i1 ;:::; irn distinct

where h() = (1 − 1 ) · · · (1 − 1 − · · · − rn −1 )

rn  i=1

ini −1 :

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

147

Because Xn and Xn have the same distribution, the above expression must be equal to Pr(Xn ∈ D(xn )) = E[h( )]: It follows that for all positive integers n1 ; : : : ; nk and for all k ¿ 1 it holds  k  ni −1 i E (1 − 1 ) · · · (1 − 1 − · · · − k−1 ) i=1

 =E (1 −

1 ) · · · (1



1

− ··· −

k−1 )

k 

(i )ni −1

:

(A.10)

i=1

The plan now is to show that (A.10) implies  ∼  , by showing that n =(1 ; : : : ; n ) ∼ n = (1 ; : : : ; n ) for all n ¿ 1. The distribution of a random vector taking values on a compact set is uniquely determined by the knowledge of all mixed moments of the vector. It follows that (A.10) implies that 1 ∼ 1 and that the distributions of k and k , for k ¿ 1, coincide on the set Ck = {yk ∈ Rk : y1 + · · · + yk−1 ¡ 1; y1 + · · · + yk 6 1; yi ¿ 0; i = 1; : : : ; k}: In order to show that n ∼ n we use an induction argument. Suppose that k ∼ k . We know that the distributions of k+1 and k+1 agree on the set Ck+1 . On the complementary set Cck+1 = {yk+1 ∈ Rk+1 : y1 + · · · + yk = 1; y1 + · · · + yk+1 6 1; yi ¿ 0; i = 1; : : : ; k + 1} the last components of k+1 and k+1 are degenerate at zero. This implies that the distributions of k+1 and k+1 on Cck+1 are determined by the distributions of k and k which, by hypotheses, are identical. References Billingsley, P., 1999. Convergence of Probability Measures, 2nd Edition. Wiley, New York. Carlton, M.A., 1999. Applications of the two-parameter Poisson–Dirichlet distribution. Ph.D. Thesis, Department of Statistics, University of California, Los Angeles. Donnelly, P., 1991. The heaps process, libraries and size-biased permutations. J. Appl. Probab. 28, 322–335. Donnelly, P., Joyce, P., 1989. Continuity and weak convergence of ranked and size-biased permutations on the in2nite simplex. Stochastic Process. Appl. 31, 89–103. Engen, S., 1979. Stochastic Abundance Models with Emphasis on Biological Communities and Species Diversity. Chapman & Hall, London. Escobar, M.D., West, M., 1998. Computing nonparametric hierarchical models. In: Dey, D., Muller, P., Sinha, D. (Eds.), Practical Nonparametric and Semi-parametric Bayesian Statistics, Lecture Notes in Statistics, Vol. 133. Springer, New York. Ewens, W.J., 1979. Mathematical Population Genetics. Springer, New York. Ewens, W.J., 1990. Population genetics theory—the past and the future. In: Lessard, S. (Ed.), Mathematical and Statistical Developments of Evolutionary Theory. Kluwer, Dordrecht. Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209–230. Ferguson, T.S., 1974. Prior distributions on spaces of probability measures. Ann. Statist. 2, 615–629.

148

A. Ongaro / Journal of Statistical Planning and Inference 128 (2005) 123 – 148

Green, P.J., Richardson, S., 2001. Modelling heterogeneity with and without the Dirichlet process. Scand. J. Statist. 28, 355–375. Hjort, N.L., 2000. Bayesian analysis for a generalized Dirichlet process prior. Statistical Research Report, Department of Mathematics, University of Oslo. Hjort, N.L., Ongaro, A., 2003. Nonparametric Bayesian inference using a generalization of the Dirichlet process. Statistical Research Report no. 11, Department of Mathematics, University of Oslo. Ishwaran, H., James, L., 2001. Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96, 161–173. Ishwaran, H., Zarepour, M., 2002. Exact and approximate sum-representations for the Dirichlet process. Canad. J. Statist. 30, 1–16. Kingman, J.F.C., 1975. Random discrete distributions. J. Roy. Statist. Soc. B 35, 1–22. Lloyd, C.J., Williams, E.J., 1988. Recursive splitting of an interval when the proportions are identical and independent random variables. Stochastic Process. Appl. 28, 111–122. MacEachern, S.N., 1998. Computational methods for mixture of Dirichlet process models. In: Dey, D., Muller, P., Sinha, D. (Eds.), Practical Nonparametric and Semiparametric Bayesian Statistics, Lecture Notes in Statistics, Vol. 133. Springer, New York. Ongaro, A., Cattaneo, C. 2002. Discrete random probability measures: a general framework for Bayesian inference. Quaderno di Dipartimento n. 6, Dipartimento di Statistica, UniversitYa Milano-Bicocca. Petrone, S., Raftery, A.E., 1997. A note on the Dirichlet process prior in Bayesian nonparametric inference with partial exchangeability. Statist. Probab. Lett. 36, 69–83. Pitman, J., 1996. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab. 28, 525–539. Sethuraman, J., 1994. A constructive de2nition of Dirichlet priors. Statistica Sinica 4, 639–650.