Computational Statistics & Data Analysis 30 (1999) 19–30
A semi-Bayesian method for nonparametric density estimation R. de Bruina , D. Salomeb , W. Schaafsmab; ∗ b
a Centre for High Performance Computing, University of Groningen, Netherlands Department of Mathematics and Computer Science, University of Groningen, P.O. Box 800, 9700AV Groningen, Netherlands
Received 1 July 1995; received in revised form 1 February 1996; accepted 30 November 1998
Abstract If x[1] ; : : : ; x[n] are the ordered outcomes of an independent random sample from a distribution on the interval [x[0] ; x[n+1] ] = [a; b] with distribution function F and density function f then ‘intrinsic’ arguments (Section 2) lead to a speci c Bernstein polynomial as ‘the’ estimate of F −1 (p). Numerical inversion and dierentiation provides ‘the’ estimate of f = F 0 which we study in detail. A comparison c 1999 Elsevier Science B.V. All rights is made, extensions are indicated and literature is discussed. reserved.
1. Introduction The nonparametric estimation of distribution, density, quantile and other characterizing functions has received considerable attention since, e.g., Aggarwal (1955) and Rosenblatt (1991). The wealth of ideas developed makes it dicult to decide what one should do in practice. That is why we have tried to develop an ‘intrinsic’ approach leading to a ‘unique’ solution. Our results are in line with existing ones of, e.g., Jones (1992), Munoz-Perez et al. (1987) and Vitale (1975). Our approach is based on three elementary remarks: (1) the main purpose is to make a distributional inference about a future observation y(=x n+1 ); given the outcome x1 ; : : : ; x n of an independent random sample X1 ; : : : ; Xn from F, (2) estimating a distribution function F or its derivative f = F 0 is equivalent to estimating the quantile function F −1 , and (3) the making of a point estimate ˆp for p = F −1 (p) ∗
Corresponding author. Tel.: 31-50-3633970; fax: 31-50-3633800.
c 1999 Elsevier Science B.V. All rights reserved. 0167-9473/99/$ – see front matter PII: S 0 1 6 7 - 9 4 7 3 ( 9 8 ) 0 0 0 8 9 - 9
20
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
is facilitated by providing a distributional inference about the quantile p rst; the point estimate can then be obtained by simply taking the expectation. Assuming that the support [a; b] = [x[0] ; x[n+1] ] of F is given, Section 2 will provide Hn (p) =
n+1 X
x[i]
i=0
n+1 i
pi (1 − p)n+1−i
as ‘the’ estimate of H (p) = F −1 (p). This estimate was developed before in Munoz-Perez et al. (1987) on the basis of a dierent argument (see Sections 5 and 6 for further details). It can easily be proved that Hn : [0; 1] 7→ [a; b] is a strictly increasing function assuming the value a if p = 0 and b if p = 1. (Note that x[:] is a strictly increasing function and that the family {B(n + 1; p); 0 ¡ p ¡ 1} stochastically increases or use the explicit expression of Hn0 (p) to be given in Section 5.) The inverse Gn = Hn−1 : [a; b] 7→ [0; 1] is the smoothed analogue of the empirical distribution function Fn we shall study. The derivative gn = Gn0 is our nonparametric estimate of the density f. This estimate will be compared with some competitors (Section 3). The crux of this comparison is, of course, the choice of a concept of performance. The following dissimilarity coecients d(F; G) will be considered to characterize the loss if the true distribution function is F and its estimate is G: d1 (F; G) = = d2 (F; G) = d3 (F; G) = d4 (F; G) = d5 (F; G) = d6 (F; G) =
Z
b
a
Z
1
0
Z
b
a
Z
b
a
Z
b
a
Z
b
a
Z
b
a
|F(z) − G(z)| d z; |F −1 (p) − G −1 (p)| dp; (F(z) − G(z))2 d z; |F(z) − G(z)| dF(z); (F(z) − G(z))2 dF(z); |f(z) − g(z)| d z; (f(z) − g(z))2 d z;
d7 (F; G) = sup |F(z) − G(z)|; d8 (F; G) = d9 (F; G) =
Z
z
b
a
Z
a
b
f(z) ln
f(z) d z; g(z)
(f1=2 (z) − g1=2 (z))2 d z;
=2−2
Z a
b
f1=2 (z)g1=2 (z) d z:
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
21
Fig. 1. The estimate gn (solid curve) on the interval [0; 31:62], gn;1 (dashed curve) on an estimated ˆ gn;2 (long-dashed curve) on the interval (−∞; ∞) and g˜n (dotted curve) as computed interval [a; ˆ b], by the IMSL subroutine DDESKN.
If the support of G is dierent from [a; b] then obvious modi cations are made. The question appears which dissimilarity coecient is the most appropriate. This question does not permit an easy answer. From the nonparametric point of view di ; i = 3; 4; 5; 7; 8; 9, have the advantage that they are invariant under monotone transformations; d5 , d8 and d9 are even invariant under all piecewise continuous (dierentiable) bijections. From the density estimation point of view these invariance requirements are irrelevant because nonparametric density estimates fail to be equivariant under monotone transformations. Most comparative analyses (Section 3) are, fortunately, such that all dissimilarity concepts, d(F; G), point into the same direction. 1.1. Example Let the outcome x[1] ; : : : ; x[13] be given as 14.34, 18.09, 18.79, 19.93, 20.83, 22.56, 24.06, 24.71, 26.86, 27.18, 28.40, 28.92 and 29:67. If the support [a; b] is speci ed by a = 0 and b = 31:62, then the density gn = Gn0 is plotted in Fig. 1. A weakness of the approach of Section 2 is that the support [a; b] has to be nite and that a and b have to be speci ed. Various extensions will be made in Section 4 and some comments will be given in Section 6. If neither a(=0) nor b(=31:62) is available then two situations can be distinguished: (1) it is known that nite values a and b exist though they are unknown, (2) the case a=−∞; b=∞ cannot be excluded
22
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
Table 1 The dissimilarity coecients, di , are explained in the text. The estimated density functions gn are computed on the interval [0; 31:62] (gn ), on an unspeci ed nite interval [a; b] (gn;1 ), on an in nite interval (−∞; ∞) (gn;2 ) and by IMSL (g˜n )
d1 d2 d3 d4 d5 d6 d7 d8 d9
gn
gn;1
gn;2
g˜n
1.461 0.114 0.053 0.004 0.288 0.004 0.113 0.056 0.037
2.248 0.223 0.076 0.007 0.419 0.011 0.134 ∞ 0.206
11.263 0.989 0.076 0.007 0.739 0.010 0.190 0.529 0.418
5.187 1.011 0.152 0.030 0.508 0.012 0.246 ∞ 0.204
a priori. In the case of situation 1 we proceed by using estimates aˆ = x[1] − (x[2] − x[1] ) ˆ The and bˆ = x[n] + (x[n] − x[n−1] ) and applying Section 2 with x[0] = aˆ and x[n+1] = b. resulting estimate gn;1 is plotted together with the estimate gn;2 which appears if situation 2 occurs (see Section 4 for details). Finally, the IMSL (1987) subroutine DDESKN with the biweight kernel function K(y) = 15(1 − y 2 )2 =16; −1 ¡ y ¡ 1; and a window width of 2h=5 provides the kernel estimate g˜n . The graphs of gn ; gn;1 ; gn;2 and g˜n in Fig. 1 can be compared with the graph of f(x) = x=500, 0 ¡ x ¡ 31:62 (≈ 10001=2 ) which is the true density. The values of the dissimilarity coecients can be computed now and are given in Table 1. It is concluded from Table 1 that gn is closer to truth than gn;1 and gn;1 is better than gn;2 : if speci c information is provided then this seems to be helpful even if it is only qualitative (like the statement that a and b are de nitely nite). If further information is provided, e.g. that ln f(x) is a concave function of x (strong unimodality) or that f is unimodal in the usual sense, or that f is an increasing function of x or that the hazard function is increasing, then all density estimates can be adapted. We are not too much interested in these ‘improvements’ because the assumptions of monotonicity or unimodality, though less awkward than parametric ones, are very restrictive and not really in line with the idea of nonparametric density estimation. In practice gn cannot be compared with the true f because this is unknown. The statistician or numerical analyst would like to present a con dence band (gn ; gn ) for f instead of a simple estimate. The deviations of gn from the true f in Fig. 1 are so large that nothing useful comes out for n = 13. That is why we considered the case n = 100. After having made a nonparametric density estimate gn on the basis of a sample of size n=100 from the distribution with density f(x)=x=500; 0 ¡ x ¡ 31:62 (by taking square roots of random numbers from (0; 103 )) we did the following. From the distribution with density gn , hundred new samples were made, each of n = 100 elements. Each sample generated an estimate gˆn = gˆn; i of gn . Next, for each value of z ∈ (0; 31:62), the two largest and the two smallest gˆn; i (z) values were deleted and the range (gn ; gn ) of the remaining 96 points was plotted, see Fig. 2.
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
23
Fig. 2. The ‘con dence band’ based on 100 samples of 100 elements from the estimated density function gn .
The band looks quite satisfactory from the point of view that the true density f is contained (except for an obvious anomaly near 0). The wideness, however, indicates that nonparametric density estimates are not very accurate. The third line of Table 2 in Section 5 suggests that the width may decrease by something like 10% (from ‘20’ to ‘18’) if the sample size is doubled (from 100 to 200). See Sections 4 and 5 for further details.
2. The crucial argument The making of a distributional inference about p = F −1 (p) corresponds to assigning credibilities to all hypotheses of the form p ≤ z or, equivalently, p ≥ z. The logical equivalence F −1 (p) = p ≥ z ⇔ F(z) ≤ p; indicates that we can focus on hypotheses of the form Hz;p : F(z) ≤ p where the true value t = F(z) of P(Xi ≤ z) is unknown because F is. It seems ‘reasonable’ to use the uniform prior for t, (see also Section 6). The relevant information about t being expressed by the outcome s of S = #{1 ≤ i ≤ n; X[i] ≤ z} ∼ B(n; t);
24
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
we obtain the Beta(s + 1; n − s + 1) posterior distribution and, hence, the credibility Z 0
p
us (1 − u)n−s du = P(Bp ≥ s + 1); Beta(s + 1; n − s + 1)
where Bp ∼ B(n + 1; p) is an auxiliary random variable. Regarded as a function of z, the credibility thus assigned to p ≥ z jumps (downward) if z = x[i] . For z = x[i] − 0 the outcome s of S is equal to i − 1 and for z = x[i] + 0 this outcome is equal to i. Hence mass P(Bp ≥ i) − P(Bp ≥ i + 1) = P(Bp = i) is assigned to z = x[i] ; i = 0; : : : ; n + 1. The distributional inference about p = F −1 (p) thus obtained is such that the expectation of this discrete distribution is the estimate ˆ p = Hn (p) of p = F −1 (p) mentioned in Section 1. 2.1. Remark The estimate ˆp is an L-statistic but it diers from the sample pth quantile Fn−1 (p) that is commonly used to estimate the population quantile. The asymptotic distribution of this quantile is well known to be normal with the true value p as expectation and n−1 p(1 − p){f(p )}−2 as variance. It can be proven, using the methods of Dehling et al. (1991), that our islamic-mean-like estimator ˆp has the same asymptotic distribution. See Sheather and Marron (1990) for some alternative proposals.
3. The proof of the pudding The ‘statistical’ argument used in Section 2 is dierent from the usual ones. It requires that X1 ; : : : ; Xn is an uncensored, independent random sample from a univariate distribution with a positive density on the given support [a; b]. The competing parametric methods, kernel methods, spline approximations, etc., are more exible and open to generalization. See, e.g., Rosenblatt (1956, 1991), Silverman (1986), Barron and Sheu (1991) and Devroye (1987). The proposed method should not be taken seriously unless a comparative analysis is made. It may seem that the purpose of providing a distributional inference about y(=x n+1 ) is best served by using the empirical distribution function Fn . A bad omen, however, is that Fn assigns credibility 0 to the interval [a; x[1] ) whereas the actual (‘physical’) probability P(Y ¡ X[1] ) = (n + 1)−1 . It is natural to require of a method providing distributional inferences about y, like our method which assigns Gn = Hn−1 , see Section 1, that the distribution of Gn ; X1 ;:::;Xn (Y ) is as close as possible to U(0; 1). One may even require that the procedure is uniformity unbiased in the sense that the distribution indicated is exactly equal to U(0; 1). Hence we may compare our method Gn = Hn−1 and the empirical distribution function Fn from the point of view
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
25
Fig. 3. Points (d5 (F; Gn ); d5 (F; G˜ n )) are plotted. Each point is based on a sample of size 100 from the distribution function F as speci ed in the text for a mixture of -distributions. The optimal bandwidth h = 0:18 is used.
of closeness of LGn (Y ) and LFn (Y ) to U(0; 1). We obviously have P(Gn (Y ) ≤ p) = EF(Hn ; X1 ;:::;Xn (p)); where Bp ∼ B(n + 1; p), as before, and P(Fn (Y ) ≤ p) = (bnpc + 1)=(n + 1); where bnpc is the entier of np. If F is the distribution function of the uniform distribution on [a; b] then F(z) = (z − a)(b − a)−1 and EX[i] = a + i(b − a)(n + 1)−1 and, hence, Gn (Y ) ∼ U(0; 1) holds exactly. This suggests that Gn will ‘in general’ be closer to uniformity unbiasedness than Fn . The main purpose of our analysis, however, is to compare our method for nonparametric density estimation, based on Gn and gn = Gn0 , with other methods, providing 0 G˜ n and g˜n = G˜ n . It turned out that the dissimilarity coecients based on a direct comparison of distribution functions (d1 ; : : : ; d4 ; d7 ) were less sensitive than those based on the densities. The coecients d5 and d6 commonly used in the theory of nonparametric density estimation (see, e.g., Devroye, 1987 for d5 and Rosenblatt, 1991 for d6 ) seemed to be most satisfactory, d8 providing degenerate results and d9 being somewhat less conclusive. With respect to the distribution-function-based coecients, not much dierence exists among the preferences for the various methods. The eect of choosing the sample is much larger than that of choosing the
26
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
Fig. 4. Points (d5 (F; Gn ); d5 (F; G˜ n )) are plotted. Each point is based on a sample of size 100 from the distribution function F as speci ed in the text for a mixture of -distributions. The bandwith for each sample was obtained by likelihood cross-validation. The average of the selected bandwidths was 0:19 with standard deviation 0:06.
coecient for comparing the inferences. Another crucial issue is the choice of competitor G˜ n ; g˜n . An expert in the area suggested to use a kernel estimate and to adapt the density obtained to the support [0; 1] given (the comparison will be made for mixtures of -distributions) by re ecting the tails about 0 and about 1 and superposing. We decided to use the IMSL routine for nonparametric density estimation using the biweight 15(1 − u2 )2 =16; −1 ¡ u ¡ 1, as kernel and the bandwidth h (or rather 2h) which is optimal for the sample size (n = 100) and the true distribution 1 Beta(1; 5) + 12 Beta(7; 2), to start with. Always using 100 dierent simulations, we 2 found that the minimum ‘average d5 ’ and ‘average d6 ’ coecients are attained at about the same value h = 0:18, the minimum for d5 being about 0:18 (see also Fig. 3 and note that the average y-coordinate is 0.18%) and that for d6 about 0.068. As the bandwidth 2h = 0:36 is considerable, it is of some interest to note that the value h = 0:06 provides an average d5 (d6 ) value of 0:26(0:12). Now the interpretation of Fig. 3 is obvious. If the bandwidth h = 0:18 is chosen such that it is optimal for the true distribution function actually approximated, then the kernel method has a better performance. If a dierent bandwidth is chosen (in practice one does not know which distribution is estimated), then the advantage of the kernel method may get lost. If, e.g., h = 0:06 is chosen then the x-coordinates (our method) in Fig. 3 are preserved but the y-coordinates become larger such that in Fig. 3 the average
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
27
y-coordinate increases from 0:18 to 0:26. The x-coordinate being about 0:21 (see Fig. 3), we infer that our method is in the lead if the bandwidth h = 0:06 is chosen for the kernel estimate. We have made a number of similar comparisons. The global conclusion was that the kernel method with optimal bandwidth is in the lead but that our method is a reasonable competitor if the bandwidth is something like a factor 2 dierent from the optimal one. In practice, the optimal one is unknown, of course. We made a comparison of our method with the likelihood cross-validation method for bandwidth determination (see Fig. 4). These results and theoretical papers (Chow et al., 1983, Hall, 1987; Marron, 1985) support our opinion which is that our method is not unreasonable though some further improvement is possible (see Section 6).
4. Adaptations If additional information is provided then this should be incorporated by modifying Gn = Hn−1 and gn , or Hn itself. As such modi cations are obvious or can be found in the literature, we focus on the problem which appears if a and b are not given a priori as nite speci ed values. A number of dierent cases can be distinguished. If a and b are nite but not known then one would like to incorporate this ‘knowledge’. This can be done by using aˆ = x[0] = x[1] − (x[2] − x[1] ) and bˆ = x[n+1] = 2x[n] − x[n−1] if a or b (or both) are not available. If the support (a; b) is not nite then, for each of the possibilities [a; ∞), (−∞; b] and (−∞; ∞), a continuous distribution function : [a; b] 7→ [0; 1] will be used to transform the problem into one with [a; b] = [0; 1]. The P sample (x[1] ); : : : ; (x[n] ) thus obtained provides the estimate n+1 i=0 (x[i] )P(Bp = i) of the quantile function (F ◦ −1 )−1 (p) = ( ◦ F −1 )(p). Transforming backward, the estimate Hn (p) = −1
n+1 X i=1
(x[i] )
n+1 i
!
pi (1 − p)n+1−i
is obtained of the quantile function F −1 (p) we are interested in. Next, Gn = Hn−1 : [a; b] 7→ [0; 1] and gn = Gn0 are determined as before. The choice of the function is a matter of concern. On the one hand, should be as simple as possible and, on the other hand, one would like (x[1] ); : : : ; (x[n] ) to resemble a sample from U(0; 1) because we know from Section 3 that our method is quite reliable, from the uniformity-unbiasedness point of view if F is the distribution function of a uniform distribution. Aiming at simplicity, we choose (z) =
z−a ; z − a + n−1 (x[n] − a)
a ¡ z ¡ ∞;
28
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
Table 2 Averages of four dissimilarity coecients. Samples of size n = 10; : : : ; 200 were drawn from the density f studied in the example of Section 1
10 d1 102 d2 102 d5 102 d6 2
10
13
25
50
75
100
150
200
197 22 31 82
179 18 31 82
131 10 27 55
97 6 23 42
88 4 22 38
76 3 20 30
64 2 19 28
57 2 18 22
in case a is nite and b = ∞. Note that (x[n] ) = 1 − (n + 1)−1 is in line with the idea of resembling a sample from U(0; 1). In the case (a; b) = (−∞; ∞), the transformation z − 12 (x[1] + x[n] ) 1 1 ; (z) = + 2 2 |z − 12 (x[1] + x[n] )| + (x[n] − x[1] )=(n − 1) ∞ ¡ z ¡ ∞, is chosen such that (x[1] ) = (n + 1)−1 and (x[n] ) = 1 − (n + 1)−1 . 5. Asymptotics The estimate Hn (p) of F −1 (p) de ned in Section 1 was studied by Munoz-Perez Rb et al. (1987). One of their results is that the integrated risk a E|F −1 (p) − Hn (p)| dp is of order n−1=2 . This implies a similar result for Ed1 (F; Gn ). As Hn (p) is a strictly increasing function of p, one should not worry too much about the potential wiggling character of gn = Gn0 : the strict positivity of gn is guaranteed. The consistency Hn0 (p) → H 0 (p) (in probability) and, hence, the consistency gn (x) → g(x) (in probability) follows, under regularity conditions, from an analysis of n X n 0 Hn (p) = pi (1 − p)n−i {(n + 1)(x[i+1] − x[i] )}: i i=0 A theoretical analysis provided the result that Ln1=4 (Hn0 (p) − H 0 (p)) → N (0; (H 0 (p))2 12 n−1=2 {p(1 − p)}−1=2 ): Our numerical experiences were such that even for n = 1000 the wiggling of gn was not alarming. This is also clear from Table 2 where estimates d are presented of the risk Ed(F; Gn ) which appears if Gn and gn of the example in Section 1 are used to estimate the distribution F mentioned at the end of that section. The estimates d were obtained (for various dissimilarity coecients) by making 100 samples of size n, computing 100 corresponding dissimilarity coecients and taking the average. Table 2 displays that, indeed, d 1 behaves somewhat like cn−1=2 while the behaviour of d 2 resembles that of dn−1 . Fortunately, the density-based coecients d 5 and d 6 display a diminishing behaviour as well. The rate of convergence towards 0 for d 5 seems slightly slower than the O(n−1=4 ) result just mentioned. Similarly the convergence rate of d 6 also seems somewhat slower than O(n−1=2 ). See Devroye and Gyor (1990) for a theoretical comment.
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
29
Leaving the development of rigorous asymptotic theory to the experts, we complete this section by focussing on an asymptotic property which is of crucial interest to establish the asymptotic validity of con dence intervals for p = F −1 (p) if these are based on the distributional inference about p discussed at the end of Section 3. Using Gn;p : [a; b] 7→ [0; 1] to denote the distribution function of this distributional inference it is straightforward (using David, 1981, p. 15) to establish that lim sup |P(Gn;p (p ) ≤ u) − u| = 0;
n→∞ u∈[0;1]
i.e. that the sequence is ‘asymptotically uniformity unbiased’. 6. Discussion Ferguson (1967) Section 4:8 contains a nice discussion about the estimation of a distribution function. The empirical distribution function Fn is suggested, together with some slight modi cations. Underlying invariance considerations produced these nice results but, on the other hand, prevented the estimates from being dierentiable. Some smoothing is necessary to obtain a nonparametric density estimate. At the end of Section 1 we noticed that nonparametric density estimates are not very reliable. It is much easier to estimate a distribution function. Nevertheless, density estimates are often needed. We focus on the estimation of F −1 (p) as the basis of everything. It is important to note that our semi-Bayesian approach is not compelling. It is not a full-Bayesian approach because prescribing a uniform prior for t = F(z) for one value of z may be reasonable but making this assumption for all values of z is peculiar. As far as we know attempts initiated by Ferguson (1973), to produce a full-Bayesian approach have not led to useful density estimates. The semi-Bayesian approach of Section 2 can be replaced by another ‘Fisherian’ one providing slightly dierent results. Instead of the Bernstein polynomial Hn (p), the slightly dierent Kantorowitz polynomial Kn (p) =
n X 1 i=0
2
(x[i] + x[i+1] )
n i
pi (1 − p)n−i
is then obtained. The fact that both Hn (p) and Kn (p) were discussed before in Munoz-Perez et al. (1987) suggests that we are dealing with a satisfactory approach to the problem of estimating an (absolutely) continuous distribution. We shall now explain why an ‘intrinsic’, i.e. ‘genuinely compelling’ approach does not exist. The main idea of the present paper is to estimate the ‘distribution’ of a future observation such that the estimates of F, f = F 0 , H = F −1 , h = F 0 and of all other ‘characteristic’ functions are probabilistically coherent. To achieve this goal we focused on the estimation of H and obtained estimators Hn ; Gn ; gn , etc., which are ‘not unreasonable’ (other authors, e.g. Azzalini, 1981, used a dierent starting point). Our estimates should not really be regarded as intrinsic or compelling. Such estimates simply do not exist. Some further improvement is always possible. If a researcher uses a kernel estimate and picks a bandwidth which is close to the optimal one (relative error less than 10%, say) then it is very likely that his estimate is closer to
30
R. de Bruin et al. / Computational Statistics & Data Analysis 30 (1999) 19–30
the true density than that provided by our method (see Fig. 3). If he uses likelihood cross-validation then, on the average, his estimate may also be better (see Fig. 4). The advantage of our method is its simplicity. The uniqueness of our density estimate gn = Gn0 implies that a complicated discussion about the selection of a kernel and a bandwidth is not necessary. Acknowledgements Some comments by K. Borovkov, H.G. Dehling, A. van Es, A. Kroese, A.J. Stam and N. Veraverbeke helped us to improve the manuscript. References Aggarwal, O.P., 1955. Some minimax invariant procedures of estimating a cumulative distribution function. Ann. Math. Statist. 26, 450–462. Azzalini, A., 1981. A note on the estimation of a distribution function and quantiles by a kernel method. Biometrika 68, 326–328. Barron, A.R., Sheu, C.H., 1991. Approximating densities by exponential families. Ann. Statist. 19 (3), 1347–1369. Chow, Y.S., Geman, S., Wu, I.-D., 1983. Consistent cross-validated density estimation. Ann. Statist. 11, 25–38. David, H.A., 1981. Order Statistics, 2nd ed. Wiley, New York. Dehling, H.G., Kalma, J.N., Moes, C, Schaafsma, W., 1991. The islamic mean: a peculiar L-statistic. Studia Scientiarum Mathematicarum Hungarica, vol. 26. Akademiai Kiado, Budapest, pp. 297–308. Devroye, L., 1987. A course in density estimation, Progress in Probability and Statistics. Birkhauser, Boston. Devroye, L., Gyor , L., 1990. No empirical probability measure can converge in the total variation sense for all distributions. Ann. Statist. 18 (3), 1496–1499. Ferguson, T., 1967. Mathematical Statistics: a Decision Theoretic Approach. Academic Press, New York. Ferguson, T., 1973. A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 (2), 209–230. Hall, P., 1987. On Kullback–Leibler loss and density estimation. Ann. Statist. 15 (4), 1491–1519. IMSL, 1987. Stat=Library, Houston. Jones, M.C., 1992. Estimating densities, quantiles, quantile densities and density quantiles. Ann. Inst. Statist. Math. 44, 721–727. Marron, J.S., 1985. An asymptotically ecient solution to the bandwidth problem of kernel density estimation. Ann. Statist. 13 (3), 1011–1023. Munoz-Perez, J., Fernandez-Palacini, A., 1987. Estimating the quantile function by Bernstein polynomials. Comput. Statist. Data Anal. 5, 391–397. Rosenblatt, M., 1956. Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832–837. Rosenblatt, M., 1991. Stochastic curve estimation. NSF-CBMS Regional Conf. Ser. in Probability and Statistics, vol. 3. IMS, Hayward, California. Sheather, S.J., Marron, J.S., 1990. Kernel quantile estimators. J. Amer. Statist. Assoc. 85, 410–416. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Vitale, R.A., 1975. A Bernstein polynomial approach to density function estimation. Stoch Proc. and Related Topics, Proc. of Summer Res. Inst. on Statistical Inference for Stochastic Processes, Bloomington, pp. 87–100.