Chapter 10 Estimation of a Mixing Distribution
10.0. INTRODUCTION Mixtures of distributions occur quite frequently in biological and physical sciences. In biology, to study certain characteristics in natural populations of fish, a random sample of fish might be taken and the characteristic mea sured for each member of the sample. Because the characteristic usually varies with age, the distribution of the characteristic in total population will be a mixture of the distributions at different ages. To analyze the qualitative character of inheritance, a geneticist might observe the phenotypic value that has a mixture of distributions because each genotype may produce phenotypic values over an interval. Mixtures of distributions also occur in empirical Bayes procedures proposed by Robbins (1964). Here the mixing distributions are the prior distributions. These examples and additional ones can be found in Choi and Bulgren (1968) where further references are provided. The problem of interest is the estimation of the mixing distribution from a random sample of observations. Obviously this problem is meaningful only when there is a one to one correspondence between the mixing and the in duced distributions. In other words, the problem should be identifiable. Let (3Γ, !F) and (Θ, $) be measurable spaces such that ^ contains all singletons of Θ. Let Φ = {Ρθ, Θ e Θ} be a family of probability measures on (#", 3F) such that the mapping Θ -+ ΡΘ(Α) is $ measurable for each A e
\pQ{A)dG(ß\ 437
Ae&.
(1)
438
10.
ESTIMATION OF A MIXING DISTRIBUTION
This H is a probability measure on (#*, 3F\ H a mixture of Φ, and G a mixing distribution. Let Λ be the class of all mixing distributions on (Θ, 0ί) and ζ be the corresponding class of mixtures. Let Q : Λ -» ζ be defined by g(G) = H. Here Λ is said to be identifiable if Q is a one-to-one mapping. The problem of estimation of the mixing distribution G is meaningful only when the family Λ is identifiable. It is easy to see that if T: {β£, #") -► (^, τ) and if the family Λ is identifiable with respect to Φ Τ " 1 = {ΡΘΤ~\ ΘΕΘ} then Λ is identifiable with respect to Φ. A discrete mixing distribution with finite number of mass points is called a finite mixing distribution. In the case of finite mixing distributions, the following result holds. THEOREM 10.0.1 A necessary and sufficient condition that the class A of all finite mixing distributions is identifiable relative to Φ = {Ρθ; Θ e 0 } is that the family ofPe are linearly independent on 3F. Proof Suppose £f =1 C{P%i = 0, where d < · · · < C N .Then M
N
T\Ci\Pei=
Σ
i=0
< C2 < ■ · · < CM < 0 < C M + 1
\d\Pe,
i = M+l
and M
N
Zic,i=
Σ
i=l
i= M+l
lc f | = i>>o
(say) because Ρθ. are probability measures. Let at = \Ct|/fc, 1 < i < N. Then M
α ρ
Σ ι θί=
i=l
N
Σ
a p
i et
i = M+l
are two distinct representations of the same finite mixture in Λ. Hence, Λ cannot be identifiable. Conversely, suppose that the family of Ρθ are linearly independent. Then they form a basis for the linear space <Φ> spanned by Φ. Considering that ζ cz <φ>, the identifiability of Λ follows from the uniqueness of the repre sentation with respect to the basis. Theorem 10.0.1 does not hold for arbitrary mixtures. Theorem 10.0.2 gives a sufficient condition for identifiability in finite mixtures. THEOREM 10.0.2 Suppose that to each ΡΕΦ is associated a transform φ with domain of definition Όφ and the mapping M: P -► φ is linear. Further
10.0. Introduction
439
suppose there is a total ordering (=^) of Φ such that Px =^ P2 => Οφί a ϋφι and for each P x e Φ, there exists tleT1 = {t : φι(ί) φ 0} such that lim φ2(ί)/φΜ
=0
(2)
for each P t -< P2 (Pi, P 2 e Φ). 'P'iew ^ c/ass Λ of all finite mixing distribu tions is identifiable relative to Φ. Proof Suppose X C l P i ( ) = 0,P l GO,
1
(3)
Without loss of generality suppose that P, -< P} if ι < j . Then,
(4)
I Q # ) = 0· i=l
Let Ti = {ί e ί) ψι ; ^ ( r ) ^ 0}. For f e T„ C x + Σ€*ΦΜ/Φι(β) = 0,
(5)
1=2
and hence, SLS t ^ tleT1 through values in 7\, we get that Cx = 0 by (2). Hence,
£c(Pei(-) = 0,
(6)
i=2
Repeating the argument we get that Ct = 0, 1 < i < N, and hence, we have the identifiability. We shall now discuss the identifiability problem in the arbitrary mixing case when SC and Θ are intervals contained in the real line. Let/be a nonnegative Borei measurable function from f x Θ to (0, oo) such that f(x,0)dx=l,
ΘΕΘ.
(7)
JVC
Let G be a probability distribution on Θ and define /GW=
f / ( * , Θ) dG{0\
xe£
(8)
Let Δ = {/(·, 0), Θ e 0} and Γ = {/(x, -\xeX) and C o (0) be the Banach space of continuous functions on the interval Θ that vanish at oo and normed by ||0|| = s u p i n i . ye®
(9)
440
10. ESTIMATION OF A MIXING DISTRIBUTION
THEOREM 10.0.3 Let Γ
χε3Τ,οΗ(θ)
= (ί(θ)9 ΘΕΘ,
(10)
holds if and only ifT generates C o (0) under the supremum norm defined in (9). Proof Suppose (10) holds. Let B be the closed subspace of C o (0) generated by Γ. Suppose there exists g e C o (0) — B. By the Hahn-Banach theorem, there exists a bounded linear functional φ on C o (0) such that ψ(β)=ί9
iKh) = 0,
heB.
(11)
But, by the Riesz representation theorem, there exist nondecreasing nonnegative functions Kl and K2 of bounded variation on 0 such that *(f) = f m J®
d(Kt - K2XB),
/ e C„(0).
(12)
hsB
(13)
Hence, ihdKl(6)=
\hdK2{B\
J®
J®
for all x e 9C. Hence, relation (10) implies that Κγ(β) — Κ2(θ) is a constant. This shows that \\j(g) = 0 contradicting (11). Hence, Γ generates C o (0). Conversely, suppose that Γ generates C o (0) and / G W = f fix, Θ) dGifl) = fH(x) = f f(x, Θ) dHi0), J®
J®
xeX.
(14)
Because Γ generates C o (0), it follows that \gie)dGie)= J®
[gi9)dHie\ J®
geC0i&).
(15)
Let *i9)= \gie)dGi6). J®
(16)
Then φ is a bounded linear functional on C o (0) because G is of bounded variation on 0. From the uniqueness of the Riesz representation, it follows that G — if is a constant. Because G and H are probability distributions, it follows that G(x) = if(x), x e l 10.1. ESTIMATION FOR FINITE MIXTURES We shall now consider the problem of estimation forfinitemixtures assuming that the problem is identifiable in this section.
10.1. Estimation for Finite Mixtures
441
Estimation of the Mixing Proportion in a Mixture of Two Distributions Let C = {He(x) : - i < Θ < i Ηθ(χ) = (i + 0)F(x) + (i - 0)G(x)}
(1)
where F and G are univariate known distribution functions. The identifiability is obvious here. Suppose Xu ..., Xn are i.i.d. with distribution func tion Ηθ(χ). Note that 0 = {Ηθ(χ) - ì[F(x) + G(x)]}/[F(x) - G(x)], (2) and hence, one can get a family of estimators for 0 by estimating the distribu tion function Ηθ(χ) by several of the methods described in Chapter 9. In particular, one can take 0„(x) = {Hn(x) - i[F(x) + G(x)]}/[F(x) - G(x)]
(3)
to be an estimator of 0, for anyfixedx, where Hn(x) is the empirical distribu tion function. These estimators are unbiased, strongly consistent, and have variance of order 0(1 /n) but they might take values outside the interval (~h i)· B° e s (1966) studied conditions for the efficient estimation of Θ. Estimation for Finite Mixtures (When the Number of Mixing Distributions is Known) Let {F(x, 6jX 1 < j < m} be m distribution functions where 0,·, 1 < j < m, are known, m is known, and ^ = JHG(x):f^F(x,öA
(4)
where G is any discrete distribution with support (0l5 ..., 0m) and PG(9j) = gj91
(5)
subject to the constraints Yj= l g} = 1 and gj > 0, 1 < j < m where Hn is the empirical distribution function of the sample Xu ..., Xn. This is a problem in quadratic programming. Choi and Bulgren (1968) studied computational aspects of this programming problem. THEOREM 10.1.1 Suppose F(x, Θ) is continuous in Θ for each x and con tinuous in xfor each Θ. Then G(„) —^ G0
as n -> oo,
(6)
442
10. ESTIMATION OF A MIXING DISTRIBUTION
where G0 is the true distribution function. Furthermore, if the matrix J = ((EGo{F(X, ΘΜΧ, ej)})\, x m
(7)
n1/2Z"1/2(G(B)-G0)^iV(0,/),
(8)
is nonsingular, then,
n
where
Y = J-ifdJ~l n
and
n
£ = ((
with aij(n) the (i, 7*)th component of C„(G), the derivative of Cn(G) evaluated at G 0 . In addition, \\GW - Go» = 0((n-l
log log n) 1/2 )
a.s.
(9)
where || · || is the supremum norm. Proof
Note that Cn(G(n)) < Cn(G0) < \\HGo -Hn\\2
(10)
and the last term tends to zero a.s. by the Glivenko-Cantelli theorem. Furthermore, \\HGM
- HGo\\ < \\HG(n) - HJ + \\HGo - HJ
(11)
and \\HGin) - HJ3'2 < 3 Γ lHGJx)
- H„(x)f dH„(x) = ζ„(0(η)) (12)
·/ — oo
by Problem 1 and Cn(Gin)) tends to zero a.s. by the earlier observation. Hence, HG(n) - / / J a - 4 - 0 a s f i - - o o .
(13)
Therefore, WHG{n) - HGo\\ -+ 0 a.s.
as n -+ oo.
(14)
In view of the discrete nature of the distribution functions Gn and G0, it is easy to see that G(„)a-^'G0
as
rc->oo.
(15)
Note that 0 = UG(n)) = UGo) + L(G)(G{n) - G0)
(16)
10.1. Estimation for Finite Mixtures
443
where G is a convex mixture of G0 and G{n). Furthermore 1(G) = (In' 'tFiiX^FjiX,)])
(17)
for all distribution functions G where F^X) = F(X, 0^). Hence, UG0) ** ((EGo{Fi(X)Fj(X)})) = J
(18)
by the strong law of large numbers, and J is nonsingular by hypothesis. Hence, the asymptotic normality will be proved if nm Σ " 1 / 2 (Go - Go) = " 1/2 Σ'112 J~lUGo) n
(19)
n
is asymptotically normal. Observe that, if X(i) is the ith order statistic then
$UG0) = n-1
(20)
Σι {^(^(ο)[Σ ffW^ii) - <>] and n1/2Cn(G0) is asymptotically multivariate normal because every linear combination of components of £n(G0) essentially involves sums of i.i.d. random variables with finite variance. We leave the details to the reader [see Choi and Bulgren (1968)]. Because £n(G) is uniformaly bounded in G, the rate of convergence of G(n) — G0 is the same as that of Cn(G0) to zero by Eq. (16). Furthermore,
«"'ZJWo) ΣΛ(*<ο)-#Α) U=i
< sup|HGo(x) - Hn(x)\n-1 £ FX(X;). x
(21)
i=l
It follows from results in Kiefer (1961), that n1'2 sup\HGo(x) - Hn(x)\ = 0[(log log n)1/2] a.s.
(22)
X
and n~l Σ FiiX,) -» £[^!(Ζ)]
a.s.
as n -> oo.
(23)
444
10. ESTIMATION OF A MIXING DISTRIBUTION
Therefore nll2\\UG0)\\ = O[(log log < 2 ]
a.s.
(24)
because sup E[Fi(Xy] < oo. 1< i
This in turn proves that n1/2||G(n) - Go« = 0[(log log n)1/2] a.s.
(25)
by (16) and the fact that J is nonsingular. We have discussed the estimation problem in case the mixing distributions are known completely. Suppose the mixing distribution F(x, 0,·), 1
Σβ]=
>Qj > 0, 1 < 7 < m.
;=i
This problem is studied in Choi (1969a). We do not discuss the results here as they are similar to those studied above. An algorithm for constructing a consistent estimator of the mixing distribution when the number of mixing distributions is finite but unknown is given in Yakowitz (1969). 10.2. ESTIMATION FOR ARBITRARY MIXTURES Let/be a nonnegative Borel-measurable function from f x Θ to (0, oo) such that f/(x,0)dx=l,
0G0.
(1)
Jar
Let G be a probability distribution on Θ and define / c W = i/(x,0)rfG(0), J©
X G I
(2)
Theorem 10.0.3 gives conditions for identifiability. We shall suppose that the problem is identifiable. Let us first consider the case when 3C = Θ = R9 the real line. Let -oo = 0Πί_! < 0„,o = -n<
θηΛ <
< 0M>mn = n < 0„,Wn+i = oo (3)
be a partition of ( — oo, oo) such that the norm of the partition over [ — n, ri\ is δη = max (ßnJ - 0„,;-i) -► 0 1
as n -> oo.
(4)
10.2. Estimation for Arbitrary Mixtures
445
For any fixed x, let M M> /x) and mn
(5)
be such that 1 (i) Ρη,*>0,Σ?"-ιΡη,= (Ü) Σ?= -1 ÎV' M ",Xx) >/ G (x), and (iii) l7=-iPn^m^(x)
(6a) (6b) (6c)
for x = 0 nj , 0 < j < mn. Solution of (6) can be obtained by linear program ming techniques. It can also be seen that the solution space of (6) is nonempty, for
Pn,s= P
V
- 1 <*f
(7)
is a solution for the problem. Define fO G„0>) =
for y < 0 „ , o
Pn, - 1 + Pn, 0
for
(8)
0„,o < y < 0 „ , i
for e önf^
*f = l , . . . , m „ .
THEOREM 10.2.1 Let Γ = {/(x, ·), x e i ? } . Suppose Γ c C0(K) and the following conditions hold CONDITION 1
(0
l i m ^ x sup,| j(x, 0) - f(y, 0)| = 0,
xeK;
(9)
(ii) for every x and any ε > 0, there exists δ > 0, δ' > 0 swc/i ί/ιαί |x - y\ < δ
and
\ff - 0| < δ'^ |/(y, 0') - / ( y , 0)| < ε. (10)
Then, for all XGR, /•OO
/»OO
f(x,e)dGM=fan(x)^Mx)=
·/ — 00
f(*,0)d(W
(11)
J — cc
and eg
Gn-> G
as n -► oo.
(12)
Proo/ Choose n large so that | x | < n. Clearly there exists a sequence j„ -> oo such that Qn,jn^X
a s
« ^ OO.
(13)
446
10. ESTIMATION OF A MIXING DISTRIBUTION
Because/(x, ·) £ C0(R) and Condition 1 (i) holds, for each ε > 0, there exists δ > 0 and M > 0 such that \y-x\<ô,
\0\>M=>f(y90)
(14)
Observe that /»oo
mn
Σ Pn^nMjJ <
/(0„,,„,0)
< ί=-\Σ Pn.
(15)
and hence,
0 < Σ P«,.[M„,XÖ„,Jn) - m„,AenJJ] < sup{|/(x', Θ) -f(x\ 0')| · I* - *Ί < à„ \θ - θ'\ < (5} (16) + sup{/(x', 0):|x' - χ| < δΗ,\θ\ > η) and the last term tends to zero by Condition 1 and relations (13) and (14) as n -► oo. However mn
mn
Σ Pn^mn^(enJn)
X
t=-\
e=-\
pn,,Mn^enJn\
and hence, by the earlier remarks, Γ mJn,
Θ) dG„(d) -fG(ßnJn)
-> 0
as
n
-> oo.
(17)
J — oo
But because 6nJn
fG(OnJn)^fG(x) as ^ œ -► x and Condition l(i) holds. Furthermore
ΛΟΟ
J — oo
(18)
/»OO
nen,j„,e)dGjW-
/(x, 0)rfG,,(0)- 0
as
n^oo
J — oo
by Condition l(i). Hence, ΛΟΟ
f(x,0)dGjiff)-+Mx)
as
«-^οο,χεΛ.
(19)
J — oo
Because Γ c C0(K) and identifiability condition holds by assumption, Γ generates C0(R) by Theorem 10.0.3. Hence, (19) holds for all g e C0(R), i.e., /»OC
/»OO
g(e)dGn(6)-> J — 00
for all g £ C0(K), i.e., G„ -» G
g{G)dG{9) v — oo
as n -► oo. .
as « - oo
(20)
447
Problems
The above theorem can be extended to the case when SC and Θ are finite intervals on the real line [see Blum and Susarla (1977a)]. Suppose fG is unknown in the above framework. Let be an estimator of/ G (x) such that II/G(-)-/G(-)II->0
as
H->OO
a.s.
(21)
and solve the linear programming problem (6a-c) with fG replaced by fG — εΝ and fG + εη in (6b) and (6c), respectively. Here ε„ is the smallest positive number for which the linear programming problem has solutions. Let Gn be the corresponding estimator given by (8). Blum and Susarla (1977a) proved that Gn-+ G
a.s.
as
n -► oo.
They have also obtained the rate of convergence offGn to fG. The estimator fG discussed above can be obtained for instance by the kernel type method discussed in Section 2.1. Other methods of estimation are due to Choi (1969a), Rolph (1968) and Meeden (1972). Choi extends the method of minimum distance described in the previous section to the arbitrary case. Rolph (1968) and Meeden (1972) study Bayes method of estimation of mixing distribution. For results on estimation in metric spaces, see Fisher and Yakowitz (1970) and Chandra (1977).
BIBLIOGRAPHICAL NOTES The basic results on identifiability of finite mixtures are due to Teicher (1961, 1963). Our presen tation in Section 10.0 follows Chandra (1977). Theorems 10.0.1 and 10.0.2 are from Chandra (1977). Theorem 10.0.3 is due to Blum and Susarla (1977b). Estimation of mixing proportion in a mixture of two distributions is discussed in Boes (1966). Theorem 10.1.1 of Section 10.1 is due to Choi and Bulgren (1968). Choi (1969a) studied extension of his results. The material presented in Section 10.2 is due to Blum and Susarla (1977a).
PROBLEMS SECTION 10.0 1. Let I be a real-valued random variable with distribution function F. Let Φ be the translation parameter family induced by F. If the characteristic function of X does not vanish in some nondegenerate interval, show that the class Λ of arbitrary mixing distribu tions is identifiable relative to Φ. (Teicher, 1961)
448
10. ESTIMATION OF A MIXING DISTRIBUTION
2. Show that the class of all finite mixing distributions relative to univariate Gamma distribu tions is identifiable. (Teicher, 1963) 3. For a given positive integer n, let Φ be the family of measures corresponding to products of n independent exponential densities. Show the class of all finite mixing distributions is identifiable relative to Φ. (Yakowitz and Spargins, 1968) 4. Show that the class of all mixtures of n identical normal distributions is identifiable for n > 1 but the class of all mixtures of one-dimensional normal distributions is not identifiable. (Teicher, 1960, 1967; Barndorff-Nielsen, 1965) 5. Give an example to show that the class of mixing distributions can be identifiable relative to a family of multivariate distributions without being identifiable relative to any of the corre sponding families of marginal distributions. (Rennie, 1972) SECTION 10.1 1. Show that, if H is any distribution function, K is a continuous distribution, and K„ is the empirical distribution from K based on i.i.d. A}, 1
{H(x) - Kn(x)}2 dKn{x).
< 3 f
l
J
-ao
(Choi and Bulgren, 1968) 2. (Maximum likelihood estimation for mixtures) L e t / i , / 2 , . . . be a sequence of density func tions with support R+ = [0, oo) such that ££°=1 pkfk(t) is identifiable and is finite for each t > 0 and p e 0>^, where ^oo = | p = (Pi, Pi, - · ·); Pk > 0, Σ pk = 1 j . Let g(t; p) = J]™=1 PkfiÂO· Suppose Tl9...,TNisa. following conditions hold
random sample with density g. Suppose the
0) Λ(0 -► 0 asfc-► oo except possibly on a set of Lebesgue measure zero in (0, oo), (ii) for any positive integers n, kx < k2 < · · · < /c„, there exists a subset N' of R\ with Lebes gue measure zero such that for each t = (tl9..., tn) e R\ — ΛΓ, the matrix
(Ui.fi d)U. is nonsingular. The set JV' may depend on n, ku . . . k„. Prove that the maximum likelihood estimator of p exists and is unique with probability one under conditions (i) and (ii). (Hill et al, 1980) SECTION 10.2 1. Suppose/(x, Θ) = h(x - 0), - oo < 0, x < oo, where h is a density func tion. Show that Gn defined by Eq. (8) converges weakly to G provided h is a continuous density with h(x) -► 0 as |x\ -► oo. In addition, suppose h is differentiable with sup{ |h {1) (t)\ 9 teR} < oo. Consider the estimator
449
Problems
where Xi, 1 < i < n, are i.i.d. as/ G (x), K is the standard normal density and hn = n~1'4. Show that ôn ^> G a.s. provided that {ε„} defining ô„ is n~c with 0 < c < £. (Blum and Susarla, 1977a) 2.
Suppose that HG(x)=
f F(x, 0) rfG(0), Je
where F(x, 0) is uniformly continuous in (x, 0) and the limits F(x, oo) = lim F(x, 0), F(x, - oo) = lim F(x, 0) θ->αο
Θ-+-00
exist for every x with neither F(·, oo) nor F(·, - oo) being a distribution function. Here F(JC, Θ) is a distribution function on an open subset of R for every Θ e Θ. Let G„ be any discrete n-point distribution that minimizes C(G) = P [ H e M - H„(x)Y dHn(x) • ' - O O
where H„ is the empirical distribution function based on an i.i.d. sample from HG. Show that G„ -► G a.s.
as n -► oo. (Choi and Bulgren, 1968)
3. Let ζ be a class of distributions on a compact set Θ cz R and F(x, 0) be a distribution function on a closed subset 3C c /? for every θ ε Θ. Define F G (x)= ÎF(x,0)
For each n > 1, let C„ be the class of discrete distributions on Θ with weights at θχ „ , . . . , θηη where θίη are chosen so that for any G e ζ, there exists a sequence {G„} in ζ„ converging weakly to G. Suppose F(x, 0) is continuous o n f x Θ and the problem is identifiable, i.e., FG = FH for G,HeC implies G = if. Let GJ eζη such that ||FGÄ-FJ|=
inf||FH-FJ|, ΗΒζη
where ||F - G|| = sup*|F(x) - G(x)|. Show that G* -► G a.s.
as « -► oo. (Deely and Kruse, 1968)