Statistics & Probability Letters 64 (2003) 347 – 358
A nonparametric Cramer–Rao inequality for estimators of statistical functionals Arnold Janssen University of D usseldorf, Mathematical Institute, Universit atsstr. 1, D-40225 D usseldorf, Germany Received June 2002; received in revised form May 2003
Abstract The present paper introduces a lower bound for the mean quadratic error of estimators of di-erentiable statistical functionals. The result can be extended to bilinear covariance forms of vector-valued estimators. The lower bound leads to a concept of Fisher e2ciency of estimators for functionals. The concept is based on tangent spaces and L2 -di-erentiable submodels. c 2003 Elsevier B.V. All rights reserved. Keywords: Cramer–Rao inequality; Di-erentiable statistical functionals; Fisher e2ciency; Mean quadratic error of estimators; E2cient estimators
1. Introduction and the univariate Cramer–Rao bound Let M1 (; A) denote the set of probability measures on a measurable space (; A). In nonparametrics Euclidean parameters are substituted by functions : P → R;
(1.1)
called statistical functionals, which are de
348
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
distributions and their tangents and derivatives of along these paths. Whereas their results are mostly of asymptotic nature the present approach is concerned with
(1.4)
of the variance and the square of the bias bT (P) := EP (T ) − (P). The idea of the present approach can be summarized as follows. Each one-parametric regular submodel establishes a lower bound for (1.4). The nonparametric Cramer–Rao bound is the largest bound which corresponds to the least favorable submodel (if it exists) with lowest Fisher information. The general setup is presented via a projection method for estimators. Roughly speaking it is shown that this projection is determined by the sum of the gradient of the functional and the gradient of the bias functional. In order to study the bias let us recall the following lemma which can be found in van der Vaart (1988, Proposition 5.21), Bickel et al. (1993, p. 457) and in Ibragimov and Has’minskii (1981, Lemma 7.2, Chapter I) for a reference of an early version.
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
349
Lemma 1. Let t → Pt be L2 -di3erentiable at t = 0 with tangent g. If lim supt ↓0 EPt (T 2 ) ¡ ∞ holds then t → EPt (T ) is right-sided di3erentiable at t = 0 with d EP (T )|t=0+ = EP0 (Tg): dt t
(1.5)
This observation motivates to consider the following subset V0 ⊂ T (P; P) of tangents: Let V0 be the set of tangents g such that there exists an L2 -di-erentiable curve in M1 (; A) with tangent g; Pt ∈ P for 0 6 t 6 and some ¿ 0, P0 = P, such that lim sup EPt (T 2 ) ¡ ∞
(1.6)
t ↓0
holds. Introduce the L2 (P)-closure V (T ) := lin V0 of the linear hull lin V0 of V0 and the orthogonal projection pV (T ) : L2 (P) → L2 (P)
(1.7)
on the subspace V (T ). It may happen that V (T ) = T (P; P) occurs but we have V (T ) = T (P; P) whenever E(T 2 ) is bounded on some total variation neighborhood of P. Let us call ˜ V (T ) := pV (T ) (); ˜
b˜V (T ) := pV (T ) (T − ) ˜
(1.8)
the V (T )-gradient of and the bias functional bT (:), respectively. For each curve with (1.6) we then have right-sided derivatives d (Pt )|t=0+ = EP (˜ V (T ) g); dt
d bT (Pt )|t=0+ = EP (b˜V (T ) g) dt
(1.9)
and the di-erentiability calculus can be extended to the bias. The following lower bound only depends on local quantities of the estimator at P. Theorem 2 (Cramer–Rao inequality for functionals). (a) For 6xed P ∈ P we have EP ((T − (P))2 ) ¿ EP ((˜ V (T ) + b˜V (T ) )2 ) + bT (P)2 :
(1.10)
Moreover, ˜ V (T ) + b˜V (T ) = pV (T ) (T ) is the projection of T on V (T ) ⊂ L2 (P). The lower bound is the mean quadratic error of T0 = ˜ V (T ) + b˜V (T ) + (P) + bT (P): (b) We have equality in (1.10) i3 T = T0 holds P a.e.
(1.11)
350
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
Proof. (a) Let t → Pt be as in (1.6) with tangent g ∈ V0 . Consider EPt (T ) = (Pt ) + b(Pt ). After di-erentiating at t = 0 we obtain by Lemma 1 and (1.9) that EP (pV (T ) (T )g) = EP (Tg) = EP ((˜ V (t) + b˜V (T ) )g)
(1.12)
holds. By linearity equation (1.12) holds for all g ∈ lin V0 . Then pV (T ) (T ) = ˜ V (T ) + b˜V (T ) follows. The rest is a consequence of Pythagoras’ theorem since Var P (T ) ¿ Var P (pV (T ) (T )) = EP (pV (T ) (T )2 ):
(1.13)
(b) Equality holds for (1.13) i- T − EP (T ) = pV (T ) (T ) a.e. Remark 3. The univariate Cramer–Rao inequality generalizes the results of RJuschendorf (1985, 1987)and van der Vaart (1988, Proposition 1.1) which are concerned with minimum variance unbiased estimation of functionals. There the arguments are also based on the projection onto tangent spaces. However, as already noticed earlier the crucial assumption (1.6) is not superKuous. For
V (S) ⊃ V;
b˜ = pV (b˜V (S) ):
(1.14)
˜ we have then obviously the same lower bound For each S ∈ K(b; V; b) ˜ 2 ) + b2 : ˜ + b) EP ((S − (P))2 ) ¿ EP ((pV () Equality holds i3 S = pV () ˜ + b˜ + (P) + b. It is easy to see that (1.10) coincides indeed with the familiar Cramer–Rao inequality for a parametric situation. Example 5. (a) Let t → Pt be L2 -di-erentiable at t = 0 with tangent g = 0, t ∈ ⊂ R. Assume that 0 is an inner point of and let f : → R be di-erentiable at t = 0. If (1.6) holds for an estimator T of f then d=dt(EPt (T ) − f(t))|t=0 = : a exists and (1.10) reads as EP0 ((T − f(0))2 ) ¿
(f (0) + a)2 + (EP0 (T − f(0)))2 : EP0 (g2 )
(1.15)
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
351
To see this choose a neighborhood U ⊂ of 0 and a submodel P := {Pt : t ∈ U } with V (T ) = {g: ∈ R}. Then the projection is given by pV (T ) (T )=(EP0 (Tg)=EP0 (g2 ))g and (1.5) implies f (0)= EP0 (Tg) − a. (b) (Estimation for product measures at sample size n) Let T : n → R be an estimator of P n → (P) for the model Pn := {P n : P ∈ P}. Assume that V (T ) = T (P n ; Pn ) is full and b˜V (T ) = 0 holds. The last condition is a gradient unbiasedness assumption of the estimator T for at P corresponding to a = 0 in (1.15). Then the Cramer–Rao inequality implies EP (˜ 2 ) + bT (P)2 : (1.16) n ˜ The The lower bound only depends on the bias and the L2 (P)-norm of the canonical gradient . details can be found in Lemma 18. (c) The smallest bound within the class b∈R K(b; T (P n ; P); 0) is just EP (˜ 2 )=n and it is attained by the unique estimator EPn ((T − (P))2 ) ¿
n
T0 (x1 ; : : : ; x n ) =
1 (x ˜ i ) + (P): n i=1
(1.17)
Remark 6. (a) The present result can be viewed as a local analog of the famous Lehmann–Sche-e theorem, see Lehmann (1983), for UMVU estimators which rely on projections via conditional expectations. Under the present conditions the projection based on the subset V (T ) of the tangent space may yield an estimator T0 with equality for the Cramer–Rao bound at P. (b) Within the parametric setup (see Example 5) it is well-known that for unbiased estimation the Cramer–Rao bound is attained only for special exponential families, see MJuller–Funk et al. (1989). (c) We do not claim that the estimators T0 always belong to the class (1.14). They just attain the lower bound. However, if the lower bound is attained in the present class then the estimator coincides with T0 . We will now discuss the accuracy of the bound for a restricted model given by some P ⊂ P and the restricted functional |P . Eq. (1.18) may be used to construct estimators. Let P ∈ P . Lemma 7. Let T be an estimator with V (T ) ⊃ T (P; P ). Suppose that we have equality in (1.10) and let b˜ := b˜V (T ) abbreviate the gradient of its bias functional. Consider the estimator T1 = pT (P; P ) (T ) + (P) + bT (P)
(1.18)
for the restriction |P . Then we have equality EP ((T1 − (P))2 ) = EP (pT (P; P ) (T1 )2 ) + bT (P)2 :
(1.19)
This is just the lower bound for the mean quadratic error of all estimators for |P of the class ˜ K(bT (P), T (P; P ); pT (P; P ) (b)). Proof. By Theorem 2(b) we have T = T0 and ˜ T − EP (T ) = pV (T ) (T ) = ˜ V (T ) + b:
(1.20)
352
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
˜ holds which yields the lower Thus, T1 − EP (T ) = pT (P; P ) (T − Ep (T )) = ˜ T (P; P ) + pT (P; P ) (b) bound for the present estimators of |P .
2. Estimation of arbitrary functionals In this section the results are extended for more general statistical functionals : P → M;
(2.1)
where M denotes a nonvoid set. A typical example is the functional given by the distribution function which serves as motivation for our approach. Example 8. Let P ⊂ M1 (Rd ; Bd ) ∩ {P: P } be a subset of distributions on Rd , d ¿ 1, which are absolutely continuous w.r.t. some -
(P n ) = FP :
(2.2)
In order to introduce lower bounds for estimators of (2.2) the problem can be reduced to the univariate case as follows. Let f : Rd → R denote a bounded measurable function. Univariate functionals are then given by n (2.3) f (P ) := f dP: Lower bounds for estimators of f can now be used to introduce a Cramer–Rao bound for the vector-valued functional. The treatment of (2.1) can be done under fairly general assumptions which are speci
(2.4)
of functions acting on M . Suppose that W is an F-space in the sense of Rudin (1973), 1.8, i.e. a topological vector space with a topology given by a complete invariant metric. The topology is assumed to be
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
353
Proof. The related proof of Ledoux and Talagrand (1991, p. 42) can be adapted. (a) It is easy to see that the graph {(f; f(T )): f ∈ W } is closed in the product space. Notice that fn → f in W implies pointwise convergence fn (T ) → f(T ). Thus the closed graph theorem, see Rudin (1973), 2.15, implies the result. (b) Since the present function is separately continuous the result follows from Rudin (1973), 2.17. Throughout we will always assume that f(T ) ∈ L2 (; A; P) holds for all f ∈ W where P ∈ P is the
bT (f) = EP (f(T ) − f((P)):
(2.5)
The bias de
Bb; T (f; h) := bT (f)bT (h):
(2.6)
The mean quadratic error of T at P is now substituted by the continuous bilinear form (corresponding to a covariance operator in special cases) BT : W × W → R;
BT (f; h) = EP ((f(T ) − f((P)))(h(T ) − h((P)))):
(2.7)
If we have two bilinear forms B1 , B2 : W × W → R the ordering ¿ can be introduced by de
Q∈P
(2.8)
is di-erentiable at P ∈ P for f ∈ D. Consider for each f ∈ D the subspace V (f(T )) ⊂ T (P; P) of the estimator f(T ) given by (1.6). De
Similarly to Lemma 9 the present closed subspace V (T ) of L2 (; A; P) de
B0 (f; h) = EP (pV (T ) (f(T ))pV (T ) (h(T ))):
(2.10)
Remark 10. (a) Suppose that f := f◦ (2.8) is di-erentiable with V (f(T )) ⊃ V (T ) and canonical gradient ˜ f for some f ∈ W . Then the projection ˜ V (T ) pV (T ) (f(T )) = (˜ f )V (T ) + b(f)
(2.11)
has according to Theorem 2 an interpretation in terms of gradients where by de
354
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
˜ V (T ) given by f ∈ D has by the left-hand side of (2.11) a (b) The function f → (˜ f )V (T ) + b(f) unique continuous linear extension as continuous linear operator from W into L2 (; A; P). Hence, B0 is again completely determined by the V (T )-gradients of f and the bias bT (f) for f ∈ D. Under these assumptions we have a Cramer–Rao inequality for bilinear forms a ssociated with T .
Theorem 11. (a) Consider the bilinear forms (2.6), (2.7) and (2.10) on W 2 . Then BT ¿ B0 + Bb; T holds. (b) If we have equality in part (a) then f(T ) = pV (T ) (f(T )) + f((P)) + bT (f)
(2.12)
holds for each f ∈ W . If in addition the assumptions of Remark 10(a) hold for some f ∈ W then ˜ V (T ) + f((P)) + bT (f) f(T ) = (˜ f )V (T ) + b(f)
(2.13)
follows P a.e. Proof. (a) For
(2.14)
for all f ∈ D (with the same bias and the same V (T )-gradient of the bias as T ). The de6nition of the class (2.14) is taken from (1.14). (b) (Uniqueness) Suppose that the condition A holds: (A) (Separability) There exists a countable dense subset in W . W separates the points of M , i.e. f(x1 ) = f(x2 ) for all f ∈ W implies x1 = x2 . Let S be as in (a) above. If we have equality BS = B0 + Bb; T then S is unique within the present class (2.14) of estimators. Proof. If there are two estimators S1 and S2 the set for each countable subset W0 ⊂ W .
f ∈ W0
{f(S1 ) = f(S2 )} has P-probability one
The present result applies to the estimation of distribution functions. Example 13 (Example 5(b), 8 continued): Lower bounds for (2.2) can be established as follows. Introduce the set of indicators D = {1A : A = (−∞; y1 ] × · · · × (−∞; yd ]; (y1 ; : : : ; yd ) ∈ Qd }
(2.15)
and let W = lin D ⊂ L∞ (Rd ; Bd ; ) denote the closure of the linear hull in the L∞ -norm. Then the present assumptions hold. For
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
well-known gradient n
1 (f(xi ) − ˙f (x) = n i=1
355
f dP):
(2.16)
The canonical gradient depends on the model P: (a) Suppose that P = {P: P } is the full model. Then ˜ f = ˙f holds. Let Fˆ n (·; ·) : Rnd → M denote the empirical distribution function Fˆ n (x; y) = 1=n ni=1 ( dj=1 1(−∞; yj ] (xij )). Then (2.17) f d Fˆ n (x; ·) = ˜ f (x) + f (P) holds and Fˆ n attains the nonparametric Cramer–Rao bound. (b) Let now P be an arbitrary submodel with tangent space T (P; P) at P ∈ P. De
(3.2)
The present sequence Tn is called Fisher e2cient at P if lim EPn (n(Tn − (P))2 ) = EP (˜ 2 )
n→∞
holds. This stands into accordance to the classical notion for parametric estimators. Pfanzagl (2001) derived an asymptotic Cramer–Rao inequality for asymptotically unbiased estimator (under his slightly di-erent notation). The asymptotic lower bound (3.2) can already be found in his Theorem 11. Let n
T0; n (x1 ; : : : ; x n ) =
1 (x ˜ i ) + (P) n i=1
be estimator (1.17) attaining the lower bound for unbiased estimation.
(3.3)
356
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
√ Theorem 14. Suppose that Tn is n-gradient unbiased (3.1). Then Tn is Fisher e@cient at P i3 (3.4) n (Tn − T0; n )2 dP n → 0 holds as n → ∞: √ Proof. Notice that (3.4) implies n-unbiasedness. Due to condition (3.1) we may assume without restrictions that b˜Vn = 0 holds. Otherwise we may consider Tn − b˜Vn . Now it is clear that (3.4) implies Fisher e2ciency. Conversely assume that Tn is Fisher e2cient. Then Theorem 2 implies EPn (n(Tn − (P))2 ) − EPn (npVn (Tn − (P))2 ) → 0:
(3.5)
By Theorem 2(a) we have pVn (Tn − (P)) = T0; n − (P). Thus, the orthogonal projection of Tn on Vn⊥ is just Tn − T0; n and Pythagoras’ theorem implies (3.4). √ We will brieKy sketch how to extend the results to n-gradient unbiased estimation of arbitrary functionals (2.1). Suppose that (2.8) is di-erentiable for
(3.6)
for f; h ∈ D. At this stage B0 can be de
n→∞
(3.7)
exists for all f ∈ W . Then the following conditions are equivalent: (i) B = B0 on (lin D)2 and B is the unique continuous extension of B0 (3.6). (ii) f(Tn ) is Fisher e@cient for f ◦ at P for each f ∈ D. Remark 16. (a) Condition (3.4) of Theorem 14 can be compared with the asymptotic e2cient estimation of functionals considered by the more technical local asymptotic normality (LAN) of experiments, see Pfanzagl and Wefelmeyer (1982), Bickel et al. (1993). As a√result of that work it turned out that the sequence Tn is asymptotically e2cient (in their sense) if n(Tn − Tn; 0 ) → 0 in P n -probability. (b) A sequence of supere2cient estimators with asymptotic variance smaller than EP (˜ 2 ) (3.2) √ cannot be n-gradient unbiased. The present inequality proves that (3.1) is violated for supere2cient estimators. Related results concerning asymptotic unbiasedness can be found in Pfanzagl (2001, Theorem 4.2). Finally, we like to study the e2ciency of estimators for for restricted models P ⊂ P, see Lemma 7. Obviously, the canonical gradient of w.r.t. P at P is now ˜˜ = pT (P; P ) (). ˜
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
357
Lemma 17. Let Tn be as in Theorem 14 be Fisher e@cient for w.r.t. the Pn -model. Consider Tn = pT (Pn ; Pn ) (Tn ) + (P) as estimator for |P : (a) The estimator Tn is asymptotically e@cient for the Pn -model in the sense that √
n
1 ˜ n (Tn (x1 ; : : : ; x n ) − (P)) − √ (x ˜ i) → 0 n i=1
(3.8)
holds in P n -probability. √ (b) If in addition the sequence (Tn )n is n-gradient unbiased in the sense of (3.1) for the Pn -model then it is Fisher e@cient for the restricted model. Proof. Let T0; n denote the estimator (3.3) with ˜ replaced by . ˜˜ Due to the present assumption statement (3.4) holds. Since the projection on T (P n ; Pn ) is norm-contracting in L2 (P n ) this statement implies (3.9) n (Tn − T0; n )2 dP n → 0: Thus statement (a) follows. If in addition the restricted estimator sequence is then Theorem 14 applies to Tn and the result proves the Fisher e2ciency.
√
n-gradient unbiased
Thus, one obtains the e2ciency in the sense of (3.8) of the reduced estimator Fn of FP for submodels. However, the construction depends on P and does in general not lead to global estimators at once. Lemma 18. (a) The tangent space T (P n ; Pn ) given by Pn of Example 5(b) consists of the L2 (P n )closure of the convex hull of the functions (x1 ; : : : ; x n ) →
n
g(xi );
g ∈ T (P; P):
(3.10)
i=1
(b) If : P → R is di3erentiable with canonical gradient ˜ at P then P n → (P) is di3erentiable at P n ∈ Pn with canonical gradient n
(x1 ; : : : ; x n ) →
1 (x ˜ i ): n i=1
(3.11)
Proof. (a) Let t → Ptn be L2 -di-erentiable at t = 0. Then it is well known that the projection t → Pt is also L2 -di-erentiable. Let g denote its tangent. Then (3.10) is the tangent of t → Ptn , for details see Witting (1985; Section 1.8.3) or Bickel et al. (1993). (b) Along the present curve (Ptn )t the derivative of (1.3) is given by n n 1 (x ˜ i) g(xi ) dP n (x1 ; : : : ; x n ) = g ˜ dP: (3.12) n i=1 i=1 Thus (3.11) is a gradient and as member of T (P n ; Pn ) it is the canonical gradient.
358
A. Janssen / Statistics & Probability Letters 64 (2003) 347 – 358
Acknowledgements The author wishes to thank J. Pfanzagl for careful reading of the manuscript. References Bickel, P., Klaassen, C.A.J., Ritov, Y., Weller, J.A., 1993. E2cient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Series of Mathematical Sciences. Johns Hopkins University Press, Baltimore. Gill, R.D., Levit, B.Y., 1995. Applications of the van Trees inequality: a Bayesian Cramer–Rao bound. Bernoulli 1, 59–79. Ibragimov, I.A., Has’minskii, R.Z., 1981. Statistical Estimation. Springer, Berlin. Janssen, A., 1999. Testing nonparametric statistical functionals with applications to rank tests. J. Statist. Plann. Inference 81, 71–93 (Erratum 92 (2001) 297). Ledoux, M., Talagrand, M., 1991. Probability in Banach Spaces. Springer, Berlin. Lehmann, E.L., 1983. Theory of Point Estimation. Wiley, New York. MJuller–Funk, U., Pukelsheim, F., Witting, H., 1989. On the attainment of the Cramer–Rao bound in Lr -di-erentiable families of distributions. Ann. Statist. 17, 1742–1748. Pfanzagl, J., 2001. A nonparametric asymptotic version of the Cramer–Rao bound. In: de Gunst, M., Klassen, C., van der Vaart, A. (Eds.), State of the Art in Probability and Statistics Festschrift for Willem R. van Zwet, IMS Lecture Notes Monograph Series Vol. 36, pp. 499 –517. Pfanzagl, J., Wefelmeyer, W., 1982. Contributions to a General Asymptotic Statistical Theory. In: Lecture Notes in Statistics, Vol. 13. Springer, Berlin. Rudin, W., 1973. Functional Analysis, TMH Edition. McGraw-Hill, New York. RJuschendorf, L., 1985. Unbiased estimation and local structure. In: Grossmann W., et al. (Eds.), Proceedings of the Fifth Pannonian Symposium, Visegrad, pp. 295 –306. RJuschendorf, L., 1987. Unbiased estimation of von Mises functionals. Statist. Probab. Lett. 5, 287–292. Strasser, H., 1985. Mathematical theory of statistics. In: De Gruyter Studies in Mathematics, Vol. 7. de Gruyter, Berlin. van der Vaart, A.W., 1988. Statistical Estimation in Large Parameter Spaces. CWI Tract, Vol. 44. Center for Mathematics and Computer Science, Amsterdam. Witting, H., 1985. Mathematical Statistik I. Teuber, Stuttgart.