Pattern Recognition Letters 1 (1983) 293-298 North-Holland
July 1983
Classification procedures using multivariate variable kernel density estimate Adam KRZYZAK School of Computer Science, McGill University, 805 Sherbrooke Street West, Montreal, P.Q., Canada H3A 2K6 Received 31 January 1983 Revised 29 March 1983
Abstract: Nonparametric classification procedures derived from the multivariate variable kernel density estimate are examined. Conditions for weak and strong Bayes risk consistencies are given.
Key words: Nonparametric classification, Bayes risk consistency, density estimation, multivariate variable kernel estimate.
1. Introduction
We consider the asymptotic properties of classification procedures using the nonparametric variable kernel estimate of a multivariate probability density function. An approach to classification via nonparametric density estimation dated from early papers by Van Ryzin (1966) and Wolverton and Wagner (1969). Several authors have studied procedures derived from Parzen kernel type estimates (Greblicki (1978), Wolverton and Wagner (1969), Devroye (1983)) while others have used the k-nearest neighbor estimate (Greblicki (1978), Gy6rfi (1974)). The former procedures had parameters selected regardless of the data, while the latter were based on the estimates which were not densities. In this paper we introduce classification procedures derived from the variable kernel density estimate which has a promising small sample performance and is the actual density itself. We give sufficient conditions for weak and strong consistency of the estimate. Later we prove the weak and strong Bayes risk consistency of the resulting classification procedures.
2. The estimate
Let X 1..... Xn be a sample of Rd-valued, independent, identically distributed random variables with common density f. The variable kernel estimate of a density function is f n ( x ) = n ,=, Ran---~i \ Rn i /I
(1)
where K is a given density and R~i denotes the distance between Xi and its kth nearest neighbor among X 1..... X~. Recently in the field of density estimation a great deal of attention has been devoted to improved small sample performance and automatization, i.e. the choice of parameters as a function of the data. Several authors have studied the choice of a smoothing parameter in the kernel estimate (e.g. Duin (1976), Scott et al. (1977), Rudemo (1982), Hall (1982)). Mack and Rosenblatt (1979) proposed the estimate 1
~K(X-Xi~
f n ( X ) - nRa(x) i= 1
\Rn(x)J
(2)
where Rn(X ) is the distance between x and its kth nearest neighbor among X 1. . . . . X n. Such a choice
0167-8655/83/$3.00 © 1983, Elsevier Science Publishers B.V. (North-Holland)
293
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
o f the smoothing factor leads, however, to an estimate which is not a density. Estimate (2) becomes the k-nearest neighbor estimate of Loftsgaarden and Quesenberry (1965),
July 1983
I f moreover k
lim - ~ log n
(6)
= 0%
then
k
f*(x) = cnRd(x) ,
(3)
Ifn(x)-f(x)l--,O
a.s. as n--,oo
if K is substituted by (1/C)Is(x,R,(x)) where I A stands for the indicator of a set A, S(x, r) denotes
at every continuity point o f f .
the ball centered at x with radius r, and c is the volume o f the unit sphere. Estimates (2) and (3) have good local properties but having infinite integrals they overestimate a density in the tail area. Estimate (1) maintains the k-nearest neighbor's property o f local smoothing, and is a density at the same time. In (1) R n tends to be large in low density regions spreading out a kernel and vice-versa. Estimate (1) is a simplified version of an estimate of Breiman, Meisel and Purcell (1977). A similar estimate for d = 1 was introduced by Wagner (1975). Experimental comparisons between (1) and other estimates were performed by Raatgever and Duin (1978) and Habema et al. (1979) and their results are encouraging. So far only few properties of (1) have been examined (uniform consistency was studied by Devroye and Penrod (1982)).
Since fn is a density using the result o f Glick (1974) we obtain the following as a simple corollary of Theorem I:
3.
Consistency
Theorem
2. I f (4) and (5) hold then
f lf~(x)-f(x)l dx~O
in probability as n--,o~
for any f almost everywhere continuous. I f moreover k
lim - ~ log n
= o%
then i If~(x)-f(x)l--,O
a.s. as n-,oo
for any f almost everywhere continuous.
of the estimate 4. P r o o f s
In what follows S(x, r) denotes the ball centered at x with radius r. All integrals are taken over the whole space. Theorem
P{ If*(x) - f(x) I> e} <-2 exp(-c(x)k),
If
1.
L e m m a 1. When (5) holds then for all e > 0 there exist c(x) > 0 and integer N(x) such that
n >-N(x), K(x) = 1 I{llxll---1}
(4)
c
at every continuity point o f f , where
and lim n ~oo
c(x) = e 2 min (f(x) + e)(4f(x) + 3e) '
k
(5)
:2 = 0, n
4 ( f ( x ) - e~(8f(x)+ 5e)l"
then Ifn(x)-f(x)l~O
in probability as n--*oo
Lemma 1 follows from Theorem 4.11 in Devroye and Wagner (1976).
Proof.
at every continuity point o f f . 294
Volume
1, N u m b e r s
5,6
PATTERN
RECOGNITION
Corollary 1. When (5) holds it follows f r o m L e m m a 1 that
If*(x)-f(x)l--,O
in probability as n--,oo (7)
a.s. as n ~ o o
(8)
at every continuity point o f f .
N
(9)
Rn(Y) <- ~
sup y: ][y x~<-R.(y)
Rn(Y).
sup
i= 1 y: IlY-xll ~Rn(y) y • i-th cone
Without loss o f generality we consider only the case i = 1. By L e m m a 2 we have
R e m a r k 1. One m a y prove that assumption (5) implies (7) for almost all x (with respect to Lebesgue measure). If moreover k lim ~ = 0% n ~ loglog n
J u l y 1983
P r o o f . Let x be fixed and cover R d by N cones centered at x and having the same property as in the p r o o f of L e m m a 2. Then for any x e supp p we have
at every continuity point o f f . I f moreover (6) holds, then L e m m a 1 implies that [f*(x)-f(x)[~O
LETTERS
Rn(y) <-
sup Y: I]x-y~<-Rn(y) y • first cone
Rn(Y )
sup y: Lv- x~<-R.(y) y • S(x, 14.)
_< sup
y•S(x,H~)
Rn(Y )
where
/4. : IIx- x;,+ (x)LI
then (8) holds for almost all x. The next l e m m a is more refined version of Lemm a 2 in Devroye and Penrod (1982).
and X~'~" g I { x
belongs to the first cone} •
Thus, for all 6 > 0
L e m m a 2. There exists a constant N only depending upon d such that k i=l
I{llx_ X~H
L e m m a 3 in Devroye and Penrod (1982) implies
< Nk where Xg(x) denotes the kth nearest neighbor to x among X I . . . . . X , . Proof. Let R d be covered by N cones centered at x having the property: if y , x belong to the same cone, then klx-yl[ -< IIx - zll implies I1y - zl[-< LIx- zll. A m o n g the X ' s in the same cone only the first k nearest neighbors to x contribute to
P(
sup R n ( Y ) > 6 H n = h l ' O
a.s. a s n ~
k.y • S(:q h)
for any h > 0 . Application of the Lebesgue Dominated Convergence Theorem to (10) completes the p r o o f of L e m m a 3. P r o o f of Theorem 1. We note that
Ifn(x) - f ( x ) l<-An + B n + C n where
~]I{ b!x-x, ll- R°~}.
i=1
1 An = -k
~ ]f.(Xi)_f(X,)] I{Iqx-Silk<-R"i}'
i=1
L e m m a 3. Condition (5) implies sup
Rn(Y)--*O
a.s. as n ~ o o
y: I'x-yq[<-R.(y)
~ j f ( X i ) _ f ( x ) l I{llx_xj<_n.,} ' O n = k1 i=1
at every point x e supp p, where 11 denotes the probability measure o f X.
295
Volume 1. Numbers 5,6
PATTERN RECOGNITION LETTERS
From now on we assume that x ~ supp p.
July 1983 whenever [[x-yl[<_J.
[f ( x ) - f(y)[ < e
Part 1. Consider An. Let D n be the event
For every e > 0, Lemma 2 implies N i1-~ * P{An> e}
[] f * (Xj) -f(Xj)]/{ IIx-&n<-n.A> e/ T] and let E n be the event
o
I{ ~x- Xu(x))H<-R.(X~(x))}> NI
(11) where Xij(x) denotes the i-th nearest neighbor to x among random variables Xl ..... Xn belonging to the j-th cone. If x is a continuity point of f t h e n there exists dx>0 such that f is uniformly continuous on S(x, dx). The right-hand side of (11) does not exceed
[] f(Xj) - f(x) lI { IIx-xA <-g,o}> e/ T]. If f(x)<-e then the proof is completed by using L e m m a 2.
Let us assume that f(x)> e. It is clear that
P{C. > 2e/T} <_-P{Nf(X)ID. > t / T } + P{Nf(X)IE, > e/T} +P{CnID~oE~>e/T }.
(12)
k
E EP{]f*(y) -f(y)[
i=1
• I Hx_yn x > e / N I
=y}
To complete the proof it is sufficient to consider only the third term on the right-hand side of (12). We have on the set EJND c, provided that
Ifx-&.lJ-
<_2e/T. This in turn implies
c = sup c(x).
S(x,~)
Application of Lemma 3 completes the proof of part 1. Part 2. Consider B n. Let d be so small that If(x) - f ( y ) l < e whenever [Ix-yll<~. Then for rl=2Ne Lemma 2 yields k
P{B,,>rt} <- ~ p ~ l F., If(Xo(x))-f(x)l j:l (k ~:m
• I{,x_Xo(x)l<_n.(Xu(x))}>~l <_
j=l
P
k ~l/d
g_=
(
k
<- cn(f(x)-2/T)
=R+.
Therefore
PICnI~.nE~.> TI ---PImax( 1 j=I~I{'x-XA<-R-}-I'
~ If(Y)-f(x)lI{lx-yN<~} >
1 - - k1j =~l i { , x _ ~ , l < n . } _ l ) > T 1
i=1
=P{W,(x)>e/T}. +PI
sup Rn(y)>d 1 y: Ny-x[~-R~(y)
and the proof of part 2 follows from Lemma 3 since the last sum is equal to 0. Part 3. Consider C,. Let e > 0 be arbitrary and let J be such that for some T> 0 296
After some calculations we get W . ( x ) <_ -
3e
-
Tf(x) - e
+ max( n la,(S(x. R+))-p(S(x. R+))], kX
n iPn(S(x, R_))-p(S(x, k
R_))I~/ (13)
where
un(A ) = 1 cn
~ I{x, eA}. i-1
Choosing T large enough we can make the first term on the right-hand side of (13) arbitrarily small. Next for the second term we have for any r,r/>0
_
(14)
where the constant y depends only on r and r/. In the last step we used Bennett's inequality (see Devroye and Wagner (1976)). The p r o o f of part 3 and the proof of Theorem 1 is thus completed. Remark 2. During recent discussions with Prof. Luc Devroye it turned out that the conditions in Theorem 2 may be considerably weakened:
Let K be a Riemann integrable function with compact support. I f (4) holds, then
Theorem 2a.
' [fn(x)- f(x)l dx--,0
random variables (learning sequence), we estimate 0. Let ~n be an empirical classification rule defined for all x's from R a and all realizations of the learning sequence Vn and taking values in {1 . . . . . M}. The local performance of ~n is measured by
Ln(x ) =P{~n(X)~O[ Vn, X = x } while global performance by
Ln = E(L.(X) I V.). L*(x) and L* denote the local Bayes probability of
P[ k IPn(S(x, R)) -P(S(x, R))I > rl1
t
July 1983
PATTERN R E C O G N I T I O N LETTERS
Volume 1, Numbers 5,6
in probability as n~o~.
I f in addition (9) is satisfied then
error and the Bayes probability of error respectively. In this section we examine a classification rule which classifies every x e R d into any class i which maximizes pinfin(x), where Pin is an estimate of prior class probability pi=P{O=i} and fin(x) is an estimate of the conditional density of X given
0=i. Van Ryzin (1966) has given the following bounds on the probabilities of misclassification: M
Ln(x)-L*(x)< ~ I p i ~ ( x ) - p i n ~ . ( x ) l ,
(15)
i=1
Ln-L*<_ ,=,~ f lPifi(x)-pinfi,(x)ldx.
(16)
Pi is estimated in an obvious manner by Ni/n where N i is the number of observations from the class i while estimates of f/(x) are of the following form: -
~i'lf,(x)-f(x) I dx--,O a.s. as n--,oo. Note that in Theorem 2a no conditions at all are imposed on the density.
5. Classification
In classification 0 takes values in the set { 1..... M} whose elements are called classes, while X is an Rd-valued random vector. Having observed X and v~ = {(Xl, 01) . . . . . (xn, 0n)},
a sequence of independent, identically distributed
Ni j=~ Rn; The empirical classification rule classifies every x as coming from any class which maximizes
pinfi~(x). In light of the results of Greblicki (1978) and Devroye and Wagner (1976), if fin(X) is pointwise or almost everywhere consistent, then the left-hand sides of (15) and (16) are consistent. Asymptotic behaviour of the classification rule is described in the next two theorems.
I f (4) and (5) are satisfied then for classification rule (17)
Theorem 3.
Ln(X)~L*(x ) in probability as n~oo 297
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
at e v e r y c o n t i n u i t y p o i n t o f f . I f in addition (6) holds, then Ln(x)~L*(x )
a.s. as n--*oo
at every c o n t i n u i t y p o i n t o f f . Theorem
4. I f (4) a n d (5) are satisfied a n d f is
a l m o s t e v e r y w h e r e c o n t i n u o u s , then Ln-~L*
in p r o b a b i l i t y as n ~ o o .
I f in addition (6) holds, then tn~t Remark
*
a.s. as n ~ o ~ .
3. U s i n g T h e o r e m
2a we m a y s u b s t a n -
t i a l l y g e n e r a l i z e T h e o r e m 4: T h e o r e m 4a. I f (4) a n d (5) h o l d then
Ln~L*
in p r o b a b i l i t y as n ~ o o .
I f in addition (9) holds, then tn--*Z*
a.s. as n ~ o o .
Acknowledgements T h e a u t h o r is g r a t e f u l t o P r o f . L u c D e v r o y e f o r his v a l u a b l e r e m a r k s d u r i n g l o n g d i s c u s s i o n s .
References Breiman, L., W. Meisel and E. Purcell (1977). Variable kernel estimates of multivariate densities. Technometrics 19, 261-270. Devroye, L. and T.J. Wagner (1976). Nonparametric discrimination and density estimation. Technical Report, Information Systems Research Laboratory, University of Texas at Austin, Texas.
298
July 1983
Devroye, L. and C.S. Penrod (1982). The strong uniform convergence of multivariate variable kernel density estimates. Technical Report, Applied Research Laboratories, The University of Texas at Austin, Texas. Devroye, L. (1983). The equivalence of weak, strong and complete convergence in L 1 for kernel density estimates. Ann. Statist., to appear. Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. 1EEE Trans. Comput. 25, 1175-1179. Glick, N. (1974). Consistency conditions for probability estimators and integrals of density estimators. Utilitas Math. 6, 61-74. Greblicki, W. (1978). Pattern recognition procedures with nonparametric density estimates. IEEE Trans. Systems Man Cybernet. 8, 809-812. Gy6rfi, L. (1974). Estimation of probability density and optimal decision function in RKHS. In: J. Gani, K: Srakadi, T. Vincze, eds., Progress in Statistics, 281-301. North-Holland, Amsterdam. Habema, J.D.F., J. Hermans and J. Remme (1978). Variable kernel density estimation in discriminant analysis. In: L.C.A. Corsten and J. Hermans, eds., COMPSTAT. PhysicaVerlag, Wien. Hall, P. (1982). Cross-validation in density estimation. Biometrika 69, 383-390. Loftsgaarden, D.O. and C.P. Quesenberry (1965). A nonparametric estimate of a multivariate density function. Ann. Math. Statist. 36, 1049-1051. Mack, Y.P. and M. Rosenblatt (1979). Multivariate k-nearest neighbor density estimates. J. Mult. Anal. 9, 1-15. Raatgever, J.W. and R.P.W. Duin (1978). On the variable kernel model for multivariate nonparametric density estimation. In: L.C.A. Corsten and J. Hermans, eds., COMPSTA T. Physica-Verlag, Wien. Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J. Statist. 9, 1-15. Scott, D.W., R.A. Tapia and J.R. Thompson (1977). Kernel density estimation revisited. Nonlinear Analysis 1, 339-372. Van Ryzin, J. (1966). Bayes risk consistency of classification procedures using density estimation. Sankhya 8, 261-270. Wagner, T.J. (1975). Nonparametric estimates of probability densities. IEEE Trans. Inform. Theory 21, 438-440. Wolverton, C.T. and T.J. Wagner (1969). Asymptotically optimal discriminant functions for pattern classification. IEEE Trans. Inform. Theory 15, 258-265.