Asymptotic theory for maximum likelihood in nonparametric mixture models

Asymptotic theory for maximum likelihood in nonparametric mixture models

Computational Statistics & Data Analysis 41 (2003) 453 – 464 www.elsevier.com/locate/csda Asymptotic theory for maximum likelihood in nonparametric m...

131KB Sizes 3 Downloads 137 Views

Computational Statistics & Data Analysis 41 (2003) 453 – 464 www.elsevier.com/locate/csda

Asymptotic theory for maximum likelihood in nonparametric mixture models Sara van de Geer ∗ Mathematical Institute, University of Leiden, P.O. Box 9512, 2300 RA Leiden, The Netherlands Received 1 March 2002; received in revised form 1 March 2002

Abstract An overview of asymptotic results is presented for the maximum likelihood estimator in mixture models. The mixing distribution is assumed to be completely unknown, so that the model considered is nonparametric. Conditions for consistency, rates of convergence and asymptotic ef0ciency are provided. Examples include convolution models, and the case of piecewise monotone densities. c 2002 Elsevier Science B.V. All rights reserved.  MSC: 62-02 62G2 Keywords: Asymptotic e4ciency; Entropy; Maximum likelihood; Mixture model; Rates of convergence

1. Introduction Let X be a random variable with values in some space X and with distribution P0 . Let p0 = dP0 =d be the density with respect to a -0nite dominating measure . We observe n i.i.d. copies X1 ; : : : ; Xn of X , and want to estimate the density p0 or functions thereof. We consider the mixture model  p0 (x) = k(x|y) dF0 (y); x ∈ X; where y ∈ Y, with Y a measurable space, and where, for all y ∈ Y, the kernel k(·|y) is a given density with respect to . The mixing distribution F0 is a probability measure ∗

Corresponding author. Tel.: +31-71-5277104; fax: +31-71-5277101. E-mail address: [email protected] (S. van de Geer).

c 2002 Elsevier Science B.V. All rights reserved. 0167-9473/03/$ - see front matter  PII: S 0 1 6 7 - 9 4 7 3 ( 0 2 ) 0 0 1 8 8 - 3

454

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

on Y. We will often write p0 = pF0 to express the dependence on F0 . This is a slight abuse of notation, but confusion is not likely. We suppose that F0 is unknown, and let  be the space of all probability measures on Y. The class of possible densities is thus    P = pF = k(·|y) dF(y): F ∈  : This includes the case of 0nite mixtures (where Y is 0nite), but our main interest will be the case of in0nite mixtures. Thus, our model is a nonparametric one. Example 1.1 (The convolution model). Suppose X = Y + Z; with Y and Z as independent; unobservable real-valued random variables; and with Z having a known density k. The distribution F0 of Y is unknown. Then the density p0 is the convolution of k with F0 ; i.e.; k(x|y) = k(x − y) and  p0 (x) = k(x − y) dF0 (y); x ∈ X: Example 1.2 (Monotone densities). Suppose that; it is known that X has a continuous bounded increasing density p0 (x); say with respect to Lebesgue measure on (0; 1) (see; e.g.; Groeneboom; 1985); then we may write   p0 (x) = l{y 6 x} dp0 (y) = l{y 6 x}=(1 − y) dF0 (y); where dF0 (y) = (1 − y) dp0 (y). So this is a mixture model with k(x|y) = l{y 6 x}= (1 − y). Remark. In fact; if it is known that p0 is a member of a given convex class of densities; then the model can be regarded as a mixture model. For example; P could be the class of all densities in a Sobolev or Besov space; or the class of all concave densities; etc. This paper gathers some results in literature. We will present an overview of the asymptotic theory (as n tends to in0nity) for the maximum likelihood estimator for the nonparametric mixture model. In Section 2, one can 0nd the de0nition of the maximum likelihood estimator, and conditions for consistency of this estimator. Section 3 considers the rates of convergence, and Section 4 the asymptotic e4ciency. Section 5 concludes and discusses extensions. 2. Maximum likelihood The maximum likelihood estimator Fˆ n of F0 is de0ned as the maximizer of n  log pF (Xi ) i=1

over all F ∈ . We will assume that the estimator exists (but do not require it to be unique). More generally, we could consider n -maximum likelihood estimation, where

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

455

the likelihood is maximized up to a factor n . If n tends to zero fast enough, this does not bring in additional problems for deriving the asymptotics. Therefore, to facilitate the exposition, we assume we take n = 0. The maximum likelihood estimator of p0 is denoted by pˆ n = pFˆn . In this section, we will study consistency of the maximum likelihood estimator pˆ n in Hellinger distance. The squared Hellinger distance between a density p and p0 is de0ned as  1 √ √ 2 ( p − p0 )2 d: h (p; p0 ) = 2 This distance does not depend on the dominating measure, and moreover, it clearly also does not depend on a particular parametrization. It is a convenient global metric for measuring the performance of the maximum likelihood estimator in general. A discussion on the choice of a metric can be found in BirgFe (1986). Theorem 2.2 can be found in van de Geer (1993, 2000). It directly links consistency to uniform laws of large numbers. Necessary and su4cient conditions for uniform laws of large numbers can be formulated in terms of empirical entropies (see the literature on empirical processes, e.g., Pollard (1984), or van der Vaart and Wellner (1996)). For this reason, we need to introduce the concept of entropy. Denition 2.1. Let (T; d) be a (subset of a) metric space. For  ¿ 0; the -covering number N (; T; d) is de0ned as the number of balls with radius ; necessary to cover T . The -entropy is H (; T; d) = log N (; T; d). It is not within the scope of this paper to discuss entropy numbers in great detail. An example is given in Theorem 3.1. n Let Pn = i=1 Xi =n be the empirical measure, i.e., Pn puts mass 1=n at each observation Xi ; 1 6 i 6 n. Denote the L1 (Pn )-norm of a function g : X → R by g 1; Pn = n |g(X i )|=n. Consider the space G de0ned as i=1   p G= : p∈P ; p + p0 with P = {pF : F ∈ } the collection of all possible mixtures. Theorem 2.2. Suppose that for all  ¿ 0; 1 H (; G; · 1; Pn ) → 0 in probability: n Then h(pˆ n ; p0 ) → 0; a.s.

(2.1)

Condition (2.1) holds for instance in the following general situation. Suppose that Y is a locally compact HausdorH space. Let ∗ be the set of all possibly defective probability measures on Y (that is, the set of all sub-probability measures: measures F with F(Y) 6 1). Let " be the metric on ∗ corresponding to the vague topology (see Bauer, 1981). Assume that for -almost all x, the kernel k(x|y) is continuous as a function of y, and that it vanishes at in0nity. Then it can be shown that (2.1) is met. Hence, then h(pˆ n ; p0 ) → 0 almost surely as n → ∞. The latter result (proved in

456

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

a diHerent way) is also in Pfanzagl (1988). Whether or not one also has consistency of Fˆ n depends of course on the identi0ability of the mixing distribution F0 . Call F0 identi0able if for all F ∈ ∗ ; h(pF ; pF0 ) = 0 implies "(F; F0 ) = 0. If we suppose that F0 is identi0able, then the consistency in Hellinger metric implies that also "(Fˆ n ; F0 ) → 0 almost surely. For instance, when Y = R, convergence for the vague topology implies pointwise convergence of the distribution functions at continuity points of the limit. It should be noted, however, that the above continuity condition on the kernel k is not ful0lled in some important cases. For example, in the convolution model (see Example 1.1), k(x|y) = k(x − y) is not continuous as a function of y ∈ R as soon as the density k has discontinuities (for example, the uniform distribution on an interval). However, condition (2.1) is of combinatorial nature, and it certainly does not require continuity. Theorem 2.2 can be applied for example when P is a class of monotone densities on the real line, or more generally a class of piecewise monotone densities with 0xed locations for the modes. Thus, it proves for instance the consistency in Hellinger distance for the maximum likelihood estimator for the case of convolution with the uniform distribution (see van de Geer, 2000). We stress here that Theorem 2.2 makes use of the convexity of the class of densities P. Thus, if for example the maximum likelihood estimator is restricted to lie in a smaller (nonconvex) class, then (2.1) will no longer guarantee consistency, even when the truth is in the smaller class. 3. Rates of convergence Rates of convergence in Hellinger distance, for the maximum likelihood estimator pˆ n , can be found in, e.g., van de Geer (1996). Related results, and the case of misspeci0cation, are in Patilea (2001). These results are based on the general idea that it is the metric entropy of parameter space, that determines the rate of convergence (see also BirgFe and Massart (1993) and Wong and Shen (1995)). There is an important theorem for the case of mixtures, obtained by Ball and Pajor (1990). It gives a bound on the entropy of the collection of mixtures with a given kernel. This bound is in terms of the covering number of the kernel. It will play an important role in our considerations, so let us cite this bound here. Let K be a class of functions on X, and let F be the convex hull of K, i.e., the class of all convex combinations     F= $j kj : kj ∈ K; $j ¿ 0; $j = 1 : j

j

Consider some  probability measure Q on X, and let · 2; Q be the L2 (Q)-norm (that is, g 22; Q = |g2 | dQ). Theorem 3.1 (Ball and Pajor, 1990). Suppose that for some positive constants c and d; N (; K; · 2; Q ) 6 c−d ;

 ¿ 0:

(3.1)

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

457

Then; there exists a constant A = A(c; d) such that H (; F; · 2; Q ) 6 A−2d=(2+d) ;

 ¿ 0:

(3.2)

Roughly speaking, condition (3.1) is that K can be seen as a subset of a 0nitedimensional space, with d as its “dimension”. (Indeed, the number of balls that 0t into a bounded subset of d-dimensional Euclidean space is proportional to −d .) Inequality (3.2) says that its convex hull can however be in0nite dimensional.  In mixture models, the class of possible densities P = {pF = k(·|y) dF(y); F ∈ } is the (closure of the) convex hull of K = {k(·|y): y ∈ Y}. Rates of convergence for densities are related to the entropy of the class of densities endowed with Hellinger metric. Because the Hellinger distance is an L2 -distance between square root√densities, entropy calculations may be di4cult when densities can be small (since x is not diHerentiable at x = 0). As a consequence, one faces some technical di4culties when applying the result of Ball and Pajor to density estimation. We will present the case where these technical di4culties are kept to a minimum. Theorem 3.2 below is a special case of more general results in van de Geer (1996). It is essentially about the case where P0 can be taken as a dominating measure. Like Theorem 2.3, this theorem also makes n use of empirical entropies. Recall the de0nition of the empirical measure Pn = i=1 Xi =n. We shall make use of stochastic order symbols OP (·). For Zn being a sequence of random variables, and n a sequence of positive numbers, Zn = OP (n ) means that Zn =n remains bounded in probability. Theorem 3.2. Let P=p0 = {(p=p0 )l{p0 ¿ 0}: p ∈ P}. Assume that for some ) ¡ 2; sup ) H (; P=p0 ; · 2; Pn ) = OP (1):

¿0

(3.3)

Then h(pˆ n ; p0 ) = OP (n−1=(2+)) ): One may regard the parameter ) appearing in (3.3) as a measure for the complexity of the model: if ) is large, the entropy is large and (hence) the rate of convergence is slow. Moreover, application of the result of Ball and Pajor (Theorem 3.1) gives ) = 2d=(2 + d), where d is the power occurring in the bound for the covering number of the kernel K=p0 = {(k(·|y)=p0 )l{p0 ¿ 0}: y ∈ Y} (note that this value for ) meets the requirement of Theorem 3.2 that ) should be less than 2). Indeed, in many examples where Y = Rr , the value d = r holds as an upperbound. This gives ) = 2r=(2 + r) and the rate n−(2+r)=(4+4r) in Hellinger distance. Example 3.3 (Convolution with the beta-distribution). (van de Geer (2000)). Let X = Y + Z; where Y and Z are independent random variables on [0; 1]. Suppose that for

458

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

some given + ¿ 0; Z has the beta-distribution with density k(z) =

-(2+ + 2) + z (1 − z)+ ; -2 (+ + 1)

0 6 z 6 1:

Assume that Y has a density f0 on [0; 1]; with f0 ¿ / for some / ¿ 0. We then 0nd after some calculations that; for a constant c; sup d N (; K=p0 ; · Pn ) 6 c

¿0

almost surely; for all n su4ciently large; where d = 1 for + ¿ 1; and d = 1=+ for 0 ¡ + 6 1. One thus obtains the rates  + ¿ 1; OP (n−3=8 ); h(pˆ n ; p0 ) = OP (n−(2++1)=(4++4) ); 0 ¡ + 6 1: Actually; for 0 6 + 6 1=2; the result for the rate can be improved to OP (n−1=3 ) using separate arguments for the estimation of piecewise monotone densities. This rate follows from Theorem 3.2; where the value ) = 1 can be deduced from; e.g.; Clements (1963). We note that in general, there is, as yet, no clear cut answer on the rate of convergence in mixture models. As we already pointed out, one of the problems is that the calculation of entropies for the Hellinger metric can be di4cult. Another problem is that the result of Ball and Pajor, although being optimal in its generality, need not be optimal in special cases. One such special case is one where F is the class of Lipschitz functions bounded by 1, de0ned on [0; 1]. Consider this as a subset of L2 with Lebesgue measure. The set of extreme points of this class of functions has d = 1 in the bound for the covering number, but F has -entropy of order −1=2 . We stress moreover that a fast rate of convergence in Hellinger distance may correspond to a slow rate (if any) for the estimator of F0 . The reason is that mixing can be seen as a smoothing operation. Severe smoothing gives a fast rate in Hellinger distance, but is also di4cult to invert. A good example is in0nite mixtures with normal distribution. The density of the observations can then almost be estimated with a parametric rate, but it is virtually impossible to estimate the mixing distribution. An exact rate of convergence will not always be of prime interest. In some applications (see Section 4), a rate of order oP (n−1=4 ) (in Hellinger distance) su4ces to arrive at further results for estimators of functions of the density. The reason is, roughly speaking, that remainder terms in expansions often behave quadratically, and that remainders of order oP (n−1=2 ) are small enough for most purposes. Finally, we remark that the approach we used here only produces rates in Hellinger metric, and not in any other metric. This is due to the general connection between maximum likelihood and the Hellinger metric. Of course, in a particular problem, it may be possible to obtain rates in other metrics from rates in Hellinger metric. This is for example the case for densities with a given number, say s, of derivatives. One can then deduce rates for the derivatives up to order s − 1. This follows from results in function theory, see e.g., Agmon (1965). However, say for pointwise rates, other

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

459

techniques are generally needed. See the last section in van de Geer (1993) for an attempt to put such techniques on a common ground. 4. Asymptotic e%ciency We are now interested in estimating a linear function  10 = 1F0 = a dF0 ; where a : Y → R is given. For example, when F0 is a distribution on R, the function a(y) = y corresponds to estimating the mean, and a(y) = exp[ty] corresponds to estimating the moment generating function at the point t. Extensions to linear functions a with values in more general spaces (such as the complex numbers or a space of (real-valued or complex valued) functions) are possible, but will not be considered here. In some situations, rather simple estimators of 1F0 are available. For example, in a convolution model, X = Y + Z, a naive estimator of the mean of Y is simply XO n − EZ, n O where X n = i=1 Xi =n is the empirical mean of the observations and the mean EZ of Z is known since the distribution of Z is assumed to be known. A naive estimator of the moment generating function at the point t is in this case ˜ (4.1) p˜ n (t)= k(t); n where p˜ n = i=1 exp[(·)Xi ]=n is the empirical moment generating function of the observations, and k˜ is the (known) moment generating function of Z. naive The existence of such simple estimators, which we will call naive estimators 1ˆn , √ already shows that one may have n-consistent (and asymptotically normal) estimators of the functions of F0 , even though F0 itself may only be estimated at slower rates. ˆ  In this section, we study the behavior of the maximum likelihood estimator 1n = a d Fˆ n . The question we address is whether the maximum likelihood estimator beats naive estimators in certain situations. When do naive estimators exist? By the following de0nition, diHerentiable functions can be estimated by simple empirical means of a given transformation b of the observations. Denition 4.1. We say that 1F = b; E(b(X )|Y = y) = a(y);



a dF; F ∈  is di:erentiable; if for some function

y ∈ Y:

We shall use the notation A∗ b(y) = E(b(X )|Y = y);

y∈Y

and AF h(x) = EF (h(Y )|X = x);

x ∈ X:

460

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

Denition 4.2. We call hF0 the worst possible subdirection for estimating 10 ; if  hF0 dF0 = 0; and A∗ AF0 hF0 (y) = a(y) − 10 ;

y ∈ Y:

Then; b0 = AF0 hF0 is called the e;cient in
S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

461

Example 4.5 (Convolution with the exponential distribution). We now consider i.i.d. observations from X = Y + Z; with Y and Z independent; and Z standard exponentially distributed. The distribution of Y is some unknown distribution F0 on [0; ∞). Lemma 4.6. Suppose that a(y)=da(y)=dy ˙ exists and is bounded; and that d(e−y a(y))= ˙  dF0 (y) exists and is bounded. Assume; moreover; that ey dF0 (y) ¡ ∞. Then the maximum likelihood estimator 1ˆn is e;cient and asymptotically equivalent to the naive estimator n

naive 1 (a(Xi ) − a(X ˙ i )): 1ˆn = n

(4.2)

i=1

For example, when 10 is the moment generating function of F0 at the point t, one has a(y) = exp[ty], so that by (4.2), naive 1 1ˆn = (1 − t) n

n 

exp[tXi ]:

i=1

˜ Indeed, the moment generating function of Z is k(t)=1=(1−t), so this is exactly (4.1). Example 4.7 (Mixture with complete family). Recall that {k(·|y): y ∈ Y} is assumed to be a family of densities with respect to a -0nite measure . Such a family is called complete if the equality  g(x)k(·|y) d(x) = 0; y ∈ Y; can only happen if g(x) = 0 for -almost all x. Thus; then there is essentially at most one solution b to the equation E(b(X )|Y = y) = a(y);

y ∈ Y:

(4.3)

Therefore; if a solution b to (4.3) exists; and if an asymptotically e4cient estimator n naive exists at all; then the naive estimator 1ˆn = i=1 b(Xi )=n is asymptotically e4cient. Other asymptotically e4cient estimators are then; in general; asymptotically equivalent to the naive estimator. As a special case; one may think of Poisson mixtures. In n Xi that case; XO n is an asymptotically e4cient estimator of EX ; i=1 (1 + t) =n is an asymptotically e4cient estimator of the moment generating function of F0 at t; the empirical distribution function Pn (x) at x is an e4cient estimator of the distribution function P0 (x) at x; etc. Similar observations hold for mixtures with other complete exponential families; such as binomial or exponential mixtures. In general; it is unclear whether the maximum likelihood estimator is also asymptotically e4cient in these cases. For the Poisson case (see Lambert and Tierney; 1984) and also for more general discrete kernels (see Milhaud and Mounime; 1996); the maximum likelihood estimator is indeed asymptotically equivalent to the naive estimator.

462

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

5. Conclusions and extensions The mixture model has some special features (such as the convexity of the class of densities) that allow one to study asymptotic theory in general terms. We have summarized some results on consistency, rates of convergence, and asymptotic e4ciency. The linearity of the map F → pF sometimes also allows one to write down useful score equations (which put the derivatives of the log-likelihood to zero). An example is the identity  1ˆn − 10 = bˆn d(Pn − P0 ) (see van der Laan, 1995), where e4cient inRuence curves at Fˆ n (assumed to exist in this case). This identity is very useful for deriving asymptotic e4ciency. Unfortunately, it does not always hold, and general conditions for it to hold are as yet not known. An important class of mixture models are models for censored observations. Examples are right censoring and interval censoring. The score equations are then often called self consistency equations. It should be noted that the mixture models we consider correspond to the case where each observation is censored, i.e., there are no direct observations available from F0 . This is also the reason why the estimators of the mixing distribution converge with a nonparametric rate. For partially censored models, see e.g., van der Vaart (1994). As an extension, one may consider the situation where the kernel k depends on unknown parameters. An example where this case is studied is in Murphy and van der Vaart (1997). The convexity property of the class of possible densities is now lost, which means that there is no simple extension of the approach we used to establish the asymptotic theory. In some cases, there are a priori restrictions on the mixing distribution F0 . For example, it may be that it is known that F0 has 0nite support. If the support is known, the situation corresponds to the one we consider in this paper. On the other hand, if it is known that F0 concentrates on, say, only two points, but is not known on which, then again the convexity of the class of densities is lost and there is no obvious adjustment of the approach we use. Dependence of the kernel k on unknown parameters can be seen in this light as well. Smoothness conditions on F0 can be easily incorporated into our approach, whenever the assumed smoothness class is convex (e.g., a Sobolev or Besov ball). Smoothness conditions on F0 may in fact be indistinguishable from smoothing the kernel k. This is because a smoothness condition on F0 can generally be expressed by writing F0 as the result of some integral operation of an unknown, possibly nonsmooth quantity, say 40 (·). Changing the order of integration then shows that k can be replaced by a smoother (integrated) version, and that 40 now starts playing the role of F0 . In general, smoothness conditions will improve the rate for Fˆ n , although the exact inverse calculations very much depend on the problem at hand. The arguments used to obtain rates of convergence carry over (and are in fact more simple) for various regression problems. For example, one may think of the situation

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

463

where one wants to recover an image, when one observes a blurry version with noise. Let us call the image f0 ={f0 (y): y ∈ Y}, where Y are the pixels. Let {k(·|y): y ∈ Y}  be a blurring of the image, so that the blurred image is g0 = k(·|y)f0 (y) dy. Suppose that we observe Xi = g0 (Zi ) + Wi ; i = 1; : : : ; n, with W1 ; : : : ; Wn measurement error. The class of possible blurred images { k(·|y)f(y) dy; } is convex, and may well be subjected to the result of Ball and Pajor (cited as Theorem 3.1 in Section 3). This will provide one with rates of convergence for various regression estimators (e.g., least squares or least deviations) of g0 . Again, the entropy that comes out from the argument of Ball and Pajor may not be optimal, which seems to be the price to pay for its simplicity in use. References Agmon, S., 1965. Lectures on Elliptic Boundary Value Problems. Van Nostrand, Princeton, NJ. Ball, K., Pajor, A., 1990. The entropy of convex bodies with “few” extreme points. In: MWuller, P.F.X., Schachermayer, W. (Eds.), Proceedings of the 1989 Conference in Banach Spaces at Strobl, Austria, London Mathematical Society, Lecture Note Series, Vol. 158. Cambridge University Press, Cambridge, pp. 25–32. Bauer, H., 1981. Probability Theory and Elements of Measure Theory. Academic Press, London. Bickel, P.J., Klaassen, C.A.J., Ritov, Y., Wellner, J.A., 1993. E4cient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. BirgFe, L., 1986. On estimating a density using Hellinger distance and some other strange facts. Probab. Theory Related Fields 71, 271–291. BirgFe, L., Massart, P., 1993. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97, 113–150. Clements, G.F., 1963. Entropies of sets of functions of bounded variation. Canad. J. Math. 15, 422–432. Groeneboom, P., 1985. Estimating a monotone density. In: Le Cam, L., Olshen, R.A. (Eds.), Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, Vol. 2. University of California Press, Berkeley, pp. 539 –555. Lambert, D., Tierney, L., 1984. Asymptotic properties of maximum likelihood estimates in the mixed Poisson model. Ann. Statist. 12, 1388–1399. Milhaud, X., Mounime, S., 1996. A Modi0ed Maximum Likelihood Estimator for In0nite Mixtures. University of Paul Sabatier, Toulouse, Preprint. Murphy, S.A., van der Vaart, A.W., 1997. Semiparametric mixtures in case control studies. Technical Report, Free University, Amsterdam. Patilea, V., 2001. Convex models, MLE and misspeci0cation. Ann. Statist. 29, 94–123. Pfanzagl, J., 1988. Consistency of the maximum likelihood estimator for certain nonparametric families, in particular: mixtures J. Statist. Plann. Inference 19, 137–158. Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York. van de Geer, S.A., 1993. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Statist. 21, 14–44. van de Geer, S.A., 1995. Asymptotic normality in mixture models. ESAIM, Probab. Statist. 2, 17–33 (http://www.emath.fr/ps/). van de Geer, S.A., 1996. Rates of convergence for the maximum likelihood estimator in mixture models. J. Nonparametric Statist. 6, 293–310. van de Geer, S.A., 2000. Empirical Processes in M-estimation. Cambridge University Press, Cambridge. van der Laan, M.J., 1995. An identity for the nonparametric maximum likelihood estimator in missing data and biased sampling models. Bernoulli 1, 335–341. van der Vaart, A.W., 1991. On diHerentiable functionals. Ann. Statist. 19, 178–204.

464

S. van de Geer / Computational Statistics & Data Analysis 41 (2003) 453 – 464

van der Vaart, A.W., 1994. Maximum likelihood with partially censored observations. Ann. Statist. 22, 1896–1916. van der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes, with Applications to Statistics. Springer, New York. Wong, W.H., Shen, X., 1995. Probability inequalities for likelihood ratios and convergence rates for sieve MLE’s. Ann. Statist. 23, 339–362.