The performance of kernel density functions in kernel distribution function estimation

The performance of kernel density functions in kernel distribution function estimation

Statistics & Probability North-Holland Letters 9 (1990) 129-132 February 1990 THE PERFORMANCE OF KERNEL DENSITY FUNCTIONS IN KERNEL DISTRIBUTION F...

312KB Sizes 0 Downloads 61 Views

Statistics & Probability North-Holland

Letters 9 (1990) 129-132

February

1990

THE PERFORMANCE OF KERNEL DENSITY FUNCTIONS IN KERNEL DISTRIBUTION FUNCTION ESTIMATION

M.C. JONES School of Mathematical Received February Revised July 1988

Sciences,

University of Bath, Claverlon Down, Bath, Avon, BA2 7A Y, UK

1988

Abstract: We note that the uniform is the optimal kernel density for kernel estimation of the distribution function, or its inverse, and that several other popular kernel densities perform virtually as well. Parallels with the kernel density estimation case are made. Keywords:

Canonical

kernels,

density

estimation,

distribution

1. Introduction Suppose xi, . . . , x, is a sample obtained by independent draws from some continuous univariate distribution with density function f and distribution function F. The kernel density estimate

is a popular nonparametric estimate of f (see e.g. Silverman, 1986). Associated with f^ is a natural estimate of F, namely ;(x)=Jx

f(^
where

The nonnegative scalar h and the function K (or L) are quantities to be chosen by the user. If K is taken to be a probability density function, f^ and f are themselves density and distribution functions, respectively. It is this situation that we confine attention to here, although we note that there can be advantages to be gained by relaxing this requirement (e.g. Gasser, Mtiller and Mam0167-7152/90/$3.50

0 1990, Elsevier Science Publishers

function

estimation,

optimal

kernels,

quantile

estimation.

mitzsch, 1985); we also take K to be symmetric about zero. In kernel density estimation, Epanechnikov (1969) showed that although, within this class of kernels, an optimal K can be identified theoretically, the performance of many other appealing kernels is virtually as good as the best. The purpose of the current note is to make analogous observations for kernel distribution function estimation: we derive the optimal K (see also Swanepoel, 1988) which differs from that for density estimation, and show that the suboptimality of a number of other kernels is equally unimportant. The relative indifference of the performance of f^ and @ to choice of kernel contrasts with the criticality of appropriate choice of h; the hard problems of automatically specifying good h’s in practice for j or for 3 are not considered here.

2. Kernel density performance Mainly for reasons of tractability-and with no compelling argument in favour of any particular alternative loss function-we assess how well i estimates F by the integrated mean squared error (IMSE) given by /lE{ p(x) - F(x)}2dx. Asymptot-

B.V. (North-Holland)

129

Volume

9, Number

STATISTICS

2

& PROBABILITY

ically, as n + cc and h + 0, straightforward Taylor series manipulations (much as in, e.g., Azzalini, 1981) yield IMSE = n-t+-

n-%$(K)

+ +h4&K)r,;

‘F= /{ f’(x)}*dx, pLz( K) = jx2K(x)

dx

and #(K)

=Z/xK(x)L(x)

dx.

(2a)

F having a differentiable density f with finite mean and finite r, will suffice for (1) to hold, together with K being a symmetric density with finite variance. For such K, it is also the case that #(K)

=/L(x){1

-L(x)}

dx.

(2b)

Now, nwlvF is the IMSE of the empirical distribution (essentially P with h = 0) so, since G(K) > 0 for symmetric K, some smoothing (h, > h > 0 for some upper bound h “) does indeed provide an improvement in asymptotic IMSE. Whether this provides a worthwhile improvement in practice, particularly as h must be chosen suitably, is another matter. The traditional approach to obtaining optimal kernels starts with minimising (1) with respect to h to obtain the optimal value for h-this depends on r, and so is not a practical proposition-and the corresponding h-optimal IMSE given by IMSE,

= n-iv

F

- ;r,-“3

{a(K)}

4/3n-4/3 .

(3)

Here, (y(K)

=~(K>/{P.,(K>}“‘.

K is then chosen to minimise IMSE, by maximising a(K). However, recent work of Marron and Nolan (1988) shows how solving this maximisation problem yields the optimal kernel for all values of h and not just for the optimal one. By analogy with Marron and Nolan (1988), choose a(K) 130

IMSE = n-‘yF

= {$(K)}“3/{~~(K)}2’3

- { cr( K,C,,)}4’3(

n-‘h

- $h4rF) (4)

any

&K,(X)

dx,

1990

so that

for -F(x)}

February

(1)

here, vp= lF(x){l

LETTERS

h.

Here,

= {S(K)}-‘K((S(K))-‘x).

Thus, choosing the canonical form K,(,, of K to maximise a( K,( Kj is what is required. (Note that we should think of using K,(,) in fA and in (1) and of measuring h relative to this scaling of K ). At least for points x where f(x) > 0 and f’(x) # 0, a similar analysis holds for pointwise MSE (e.g. Azzalini, 1981). In particular, dependence of MSE on K is just as for IMSE in (1) and (4); see also Falk (1983) and references therein. Swanepoel (1988) exhibits the same dependence on K for an alternative definition of IMSE. Note that, if K has finite support [-a, a], say, then yet another form for 4 is $(K)=a-j-:/2(x)

dx.

The functional a(K) is invariant to changes of scale. This means that choosing K,(,, to maxiis equivalent to choosing K to n-use a(KGcK)) maximise q(K) subject to p2( K) = constant, which we might as well take to be unity (we can rescale to get K,(,, if desired). Result. The unique symmetric unit variance density K maximising q(K) is the uniform density (on [-- 4% 61). Proof. Introduce the folded version 2K(x) of K on x 2 0 and its distribution L,(x) = 2L(x) - 1. Then, IC/(K) =I, ~%,(x)&(x)

dx,

so that, by the Cauchy-Schwartz

X

K,(x) = function

inequality,

(~%k+~b) dx)liZ

=(1 x 4)“’

= l/fi

Volume

9, Number

Table 1 The performance Kernel

STATISTICS

2

of seven popular

l/(26)

triangular

(1-

Epanechnikov

3(1-

biweight cosine

15(1-

LETTERS

February

lx I/G)/& x */5)/(4JT

)

Support

8(K)

[-fi,Gl

0.833

1.060

1.000 =

[-&,,I

0.830

1.011

0.987

0.832

1.000 a

0.995

0.830 0.831

1.005 1.000

0.989 0.993

v51

r-6,

t’(K)

G(K)

[-fi,fil I- T/P.

normal

x*/7)‘,‘(16fi) b P ~0s~ ~~12~4 (2n)_ ‘12exp(- :x2)

BP

0.826

1.041

0.970

Laplace

exp(-fiIxI)/fi

R

0.809

1.247

0.893

a The optimal b Here,

-

T/PI

choice for this functional.

p = fi.

with equality if and only if x{Km

1990

kernels

K(x)

uniform

& PROBABILITY

a /~Lo(x)

for all x 2 0, i.e., for K,, and hence K, uniform. 0 For an alternative proof of this result using the calculus of variations see Swanepoel (1988). The uniform is not the best kernel density for density estimation. However, Bartlett (1963) and Epanechnikov (1969) (extended by Marron and Nolan, 1988) showed that a parabolic density on a finite interval (in fact, a suitably relocated and resealed beta(2, 2) density) is optimal in that context. It is interesting to note that the uniform kernel makes for “less smooth” distribution function estimates than would use of what is often called the Epanechnikov kernel; similarly, smaller h’s should be used for estimating F than are best for estimating f. More important than this optimality result is the observation that many other popular kernels result in IMSE’s negligibly worse than that associated with the uniform. This is demonstrated in Table 1. Seven kernels are considered there; the first five are simple functions on finite intervals, the sixth is the standard normal and the seventh is the Laplace. These kernel densities and their supports are given in Table 1, together with the scale factors 6(K) required to transform them to their canonical forms for distribution function estimation. For comparative purposes, density estimation results are given in the fifth column of the table. In that case, the ratio-called B(K)-of

{ IK2W dx) 4/5 to its value for the optimal K is given for each kernel; this represents the increase in IMSE associated with each K, compared with its optimal value over choice of K. The distribution function estimation results are given in the final column of Table 1. There, (p( K ) is the ratio of { 44K)) 4/3 for uniform K to its value for other kernels. These values are the proportional decrease in the savings (= K’Y~-- IMSE) resulting from using each kernel, compared with that obtained by using the uniform. Values of G(K) show the same kind of insignificant worsening of performance using suboptimal kernels in F^ as e(K) demonstrates for j (the Laplace kernel is unappealing for both), Indeed, since C/P(K)is only concerned with relative savings, the effects of different K's on IMSE are even more negligible in the distribution function case. As Silverman (1986, Section 3.3.2) points out for density estimation, this theoretical indifference to choice of K makes it especially desirable to choose K using other considerations, such as smoothness of P or ease of computation. We concur with Swanepoel (1988) that numbers given by Falk (1983, p. 78) which led him to conclude that the Epanechnikov kernel “is essentially better than” the uniform and normal kernels in the current context, appear to be in error.

3. Further remarks The above conclusions for choice of kernel for distribution function estimation transfer without change to the question of choice of kernel when 131

Volume

9, Number

2

STATISTICS

& PROBABILITY

estimating F-‘, the quantile function, by ,.kernel methods. This is so whether one inverts F as in Azzalini (1981) or if one kernel smooths the empirical quantile function as in Falk (1984); there is, in any case, an equivalence between these two approaches. Falk (1983, 1984) prefers to consider kernels which may take negative values and to maxim&e 1c/(K) subject to the constraint that K has support [ - 1, 11. The latter maintains the merit of minimising the variance contribution to (1) whatever h is, but has the demerit of replacing the natural scaling constraint imposed by the squared bias term by a more arbitrary one. Mammitzsch (1984) solves this maximisation problem; note that his result (when m = 1) is consistent with our calculations for the finite support kernels in Table 1.

Acknowledgements I am most grateful to Professor Steve Marron and the referee for helpful comments and to Professor Marron and Professor J.W.H. Swanepoel for providing me with preprints of their recent papers. Financial support was provided by the British Science and Engineering Research Council.

132

LETTERS

February

1990

References Azzalini, A. (1981). A note on the estimation of a distribution function and quantiles by a kernel method, Biometrika 68, 326-328. Bartlett, M.S. (1963). Statistical estimation of density functions, Sankhyii Ser. A 25, 245-254. Epanechnikov, V.A. (1969), Non-parametric estimation of a multivariate probability density, Theory Probab. Appl. 14, 153-158. Falk, M. (1983) Relative efficiency and deficiency of kernel type estimators of smooth distribution functions, Statist. Neerlandicn 31, 13-83. Falk, M. (1984) Relative deficiency of kernel type estimators of quantiles, Ann. Statist. 12, 261-268. Gasser, T., H.-G. Miiller and V. Mammitzsch (1985). Kernels for nonparametric curve estimation, J. Roy. Statist. Sot. Ser. B 47, 238-252. Mammitzsch, V. (1984), On the asymptotically optimal solution within a certain class of kernel type estimators, Stafist. Decisions 2, 241-255. Marron, J.S. and D. Nolan (1988) Canonical kernels for density estimation, to appear. Silverman, B.W. (1986) Density Estimation for Srafistics and Data Analysis (Chapman and Hall, London). Swanepoel, J.W.H. (1988). Mean integrated squared error properties and optimal kernels when estimating a distribution function, Comm. Statist. -Theory Metho&, to appear.