Pattern Recognition Letters 2 (1983) 13-17 North-Holland
October 1983
Almost sure convergence of classification procedures using Hermite series density estimates Wlodzimierz GREBLICKI
and Miroslaw PAWLAK
Institute of Engineering Cybernetics, Technical University of Wroclaw, 50-370 Wroclaw, Poland Received 7 December 1982 Revised 29 April 1983
Abstract: A multidimensional classification procedure is examined derived from the multiple Hermite series estimate of probability density functions. Conditions for the almost sure convergence of the integrated square error for the estimate are presented and the rate of the convergence is studied. The probability of misclassification, conditioned on a learning sequence of length n, is shown to converge to the Bayes risk almost surely as rapidly as O(n-1/2+~), 6 positive.
Key words: Classification, discrimination, pattern recognition, nonparametric, density estimate, Hermite series.
sions and are written in boldface. Thus,
1. Introduction We examine a multidimensional classification procedure applying the Hermite series estimate of probability density functions. We show that if all the class densities have derivatives of the r-th order in all directions and the absolute multiplicative moment of the s-th order, then
ILn -
R I = O(n-Ylog n)
almost surely, where
y=2rs/(2r+
1)(2s+ 1),
Ln is the probability of misclassification conditioned on a learning sequence of length n, and R is the Bayes probability of error. In particular, for every positive di, there exists a family of sufficiently regular class densities having all multiplicative moments for which
IL.
- R / = O(n -1/2 +fi)
almost surely. In order to establish these properties, we first study the rate at which the integrated square error for the Hermite series estimate of probability density functions converges to zero. In further development all vectors have d dimen0167-8655/83/$3.00
x = (x (~).....
x(d)),
while Ix I = max(Ix(l)l . . . . . Ix(d) I)" Writing Y . = O ( a n ) a.s., for a sequence {Yn} of random variables, we mean that fl.Y./a.--,O almost surely as n tends to infinity, for all number sequences {ft.} convergent to zero. Moreover a.-bn says that a~/bn has a nonzero limit as n tends to infinity.
2. The density estimate Let hj (y) = ( 2 Jj ! 7r 1/2)- l /2e- y2 /2 H j (y),
where H j (y) = e y2 die-Y2 dy j
is the j-th Hermite polynomial. It is well known that {hi}, j = O , 1..... is a complete orthonormal system over the real line R. Moreover, {hk}, k (1). . . . , k (d)= 0, 1, ..., constitutes a complete ortho-
© 1983, Elsevier Science Publishers B.V. (North-Holland)
13
Volume 2, Number 1 normal system over
PATTERN RECOGNITION LETTERS
R a. Since k is a vector,
October 1983
Proof of Theorem 1. By virtue of Parseval's
formula
hk (x) = hk~(x(l)) ... hk~(x(d)).
I (f(x) - f ( x ) ) 2 dx Let X~ . . . . . X n be a sample of independent observations of a random variable X taking values in R a. By f we denote the density of X. The estimate of the density studied in this paper has the following form: f(x) = ~
= ~
(6k-ak)2+
Ikl~N
~
a 2.
(6)
bkl>N
In turn
PIl,~N(dk-ak)2>t I
ak hk (x),
< ik~fP I(ak- E~k) 2
[kl <-N
where > t [ j I ] (k (j)+l)
ak=r/-1 ~ hk(Xi)
t/6]/SN1,
(7)
i=1
where is an unbiased estimate of ak =Ehk(X). The following inequalities are implied Theorem 8.91.3 in Szeg6 (1959, p. 242): max Ihj(y)q _
by
SN=(j~_o (J+l)-l/6)d<--(5/6)d(N+l) 5d/6.
(1)
Applying Hoeffding's (1963) inequality and (1) one can verify that the quantity on the right-hand side in (7) is not greater than
Y
max Ihj (y)] ___CA (j + 1)- 1/4,
(2)
lyI~A
2(N+ 1)aexp { for any nonnegative A, and max ]y lyr_>A
l/3hj(y)l
and the proof is completed.
(3)
for any positive A. In the following theorem we examine the integrated square error and give conditions under which it converges to zero almost surely as n tends to infinity.
Theorem 1.
N(n)
n
exp{ -
3. The convergence rate
In this section we study the rate at which the integrated square error converges to zero. In order to do so, we need to impose some smoothness restrictions on f. Let
tr(X, f ) = tr(X (1)..... x(d);f)
Let f be square integrable. If , ~,
= [j=II1 (x(J)--OOx(j))r]f(x(l) ..... x(d)).
(4)
an/NSd/6(n)} < 0%
n=l
nt/C2(N+ 1)5d/6},
(5)
For the random variable X, let/t s denote the absolute multiplicative moment of s-th order, i.e.
for all positive a, then I(Q~(x)--f(x)) 2 d x ~ O
a.s.
Remark. Condition (5) is satisfied if
NSd/6(n) log n/n n---*O. 14
Let both f and tr(" ; f ) be square integrable. Let I~s< oo for s> 16/9. If N(n) ~ n 2/d(2r+1), (8)
Theorem 2.
Volume 2, Number 1
PATTERN RECOGNITION LETTERS
then (f(x)-f(x))
2 dx = O(r1-2r/(2r+ l)log n) a.s. (9)
October 1983
Using F u c - N a g a e v ' s inequality, see Appendix, we get
Skn <--Akn+ Okn, The authors would like to remark that they are not aware o f other results concerning the almost sure convergence of the integrated square error. It is nevertheless interesting to compare the rate in (9) with those obtained in the literature for the mean integrated square error. Rates given by Schwartz (1967) and Waiter (1977) are O(n -(r-l)/r) and 0(11 (6r-5)/6r), respectively. They are much worse than that given by us since e.g. for r = 1, the former is useless, while the latter equals O(/7-1/6). The rate in (9) is O(n-2/31ogn). Hall's (1980) rate O(n -1/2) is also worse. The rate obtained by us is closely related to that given by Greblicki (1981), who showed that under restrictions like those in Theorem 2, with s = 2 / 3 , the mean integrated square error converges to zero as rapidly as O(n- 2r/(2r+1)). P r o o f of Theorem 2. For simplicity of notation the p r o o f of the theorem is given for d = 1. By virtue of L e m m a 1, see Appendix, and (8),
~,, a2 = O(n-2r/(2r+ 1)).
where
A kn = Cl n 1- q +aq/2(k + 1) q/4
X (SNfin) q/zE Ihk (X) [ a/(t log n) q/2 and
Bkn = 2exp{ -- c2tn 1- " l o g n
~fin(k+ 1)I/2SNEh2(X)}, where q = 3s. By virtue of L e m m a 2, see Appendix, the fact that SN<_2(N+ l) 1/2, and (8), we get
A kn = O((fin/1Og n) q/2n y ), where y can be easily calculated. Since y < - 1,
~
X Akn
n-I k<-N
In turn, by virtue of L e m m a 2,
Bkn < 2exp{ - ( c2t /2c)log n/fin }. Putting ( N + 1 ) - exp(log na), one can easily verify that
k>N
Therefore, in view of (6), it suffices to verify that, for all positive t and any sequence {fin} convergent to zero,
~ ~ Bkn
n-lk<_N
The theorem is proved.
~ pn
n=l
4. Classification
where
Pn= P I(finna/logn) ~N(ak--ak)2> t 1, where a = 2 r / ( 2 r + 1). Clearly
Pn <-- ~ Skn, k<_N
where
skn = P{ ((Ik - ak)2> t(k + 1)-1/21og n/finnaSN }. Here aN=
2 ( k + 1) 1/2. k~N
In this section we estimate 0 from X and a learning sequence Vn = {(Xl, 01) . . . . .
(Xn, On)}
consisting o f independent observations of (X, 0). X takes values in R a and has the density f , while 0 takes values in a set { 1. . . . . M } whose elements will be called classes. We examine a classification procedure assigning every x ~ R a among all m maximizing ~mfm(X), where/~m and fro(x) are estimates of the prior class probability Pm and the class conditional density fm(X), respectively. In this paper ~ m = N m / r t , where N m is the number of observa15
V o l u m e 2, N u m b e r 1
PATTERN
RECOGNITION
LETTERS
October 1983
Lemma 3. Let f o r all m
tions from class m and
E (Pmfm(X) --Pmfm(X)) 2dX = O(an ) a.s.
/rn (.If) = E ~km hi, (x), pkp<_N
(13)
and let
where
Psm=E
akm=n -1 ~ timhk(Xi).
Ix~)lS/O=m
i=1
Here tim = 1 if Oi= m and tim = 0 otherwise. Let t~, be a decision with the classification procedure defined above. Let L,=P{On--gO/Vn}. We study the convergence of L n to the Bayes probability of error denoted by R. Using Theorem 1, one can easily verify that
Then I t n - R I =O(a s/(zs+l))
a.s.
Proof. Let us observe that the first term in (11) is bounded from above by M
I(Pmfm(x)-Pmfm(X))2dx
n~o
a.s.,
(10)
provided that fm is square integrable and (4), (5) are satisfied. We can now establish the first of two main theorems of this paper. Theorem 3. Let all the class densities be square integrable. I f (4) and (5) are satisfied, then
Ln n,R
a.s.
Proof. Arguing like in the proof of Theorem 3 in Wolverton and Wagner (1969) we get the following inequality:
B -sa ~ Pmllsm • m=l
Taking B=an-1/(2s+ the proof.
1)
and using (13) we complete
Combining Lemma 3 and (12) we get the next main result of this paper. Theorem 4. For m = 1..... M, let fm and tr(" ;fm ) be square integrable and llsm < Oo for s> 16/9. I f (8) holds, then
IL~ - R ] = O(n -2rs/(2r+ l)(2s + 1)(log n)2S/(2s+ 1)) a,s,
M
Corollary. Under the restrictions o f Theorem 4
O<-Ln-R<- E Pm ~ fro(X) dx m=l
Ix/>B
IL n - R I = O(n-Zrs/(2r+ l)(2s+ l)log n)
1/2
--pro fro(X)) 2 dx
,
(11)
B positive. It is clear that for every positive ~ there exists B such that the second term in (11) does not exceed ~. This and (10) complete the proof. From Theorem 2 it follows that if fm and tr(" ;fro) are square integrable, Ps< oo for s > 16/9, and {N(n)} is selected according to (8), then
J (~mfm(X) -Pmfm(X)) 2 dx
=O(r1-2r/(Er+l)lOgn)
a.s.
(12)
Let us observe that the rate of convergence of Ln to R is independent of the dimension. It is interesting to compare the rate in Theorem 4 with that obtained for density estimates using the multiple Fourier series. Greblicki and Pawlak (1981), (1982) have shown that if, generally speaking, all class densities have r derivatives in all directions, then the procedure derived from the Fourier series converges almost surely as rapidly as
O(n -r/(2r + l)log n). It is clear that if all class densities have all multiplicative moments, the rate given in the paper is
O(n -r/(2r + 1) + 6 l o g This and the next lemma establish the rate of the convergence of Ln to R. 16
a.s.
n),
positive. Thus, it is close to that obtained by
Volume 2, Number 1
PATTERN RECOGNITION LETTERS
Greblicki and Pawlak (1981), (1982) for the procedure derived from the Fourier series. This fact is worth emphasizing since the procedure studied in this paper can classify vectors coming from the whole vector space while that applying the Fourier series classifies vectors belonging to only a bounded cube. It is obvious that, for every d > 0 , there exists a family of sufficiently smooth class densities, i.e. r and s such that the rate is O(n-l/Z+6logn).
October 1983
where I is the indicator function. Using (2) and (3) we complete the proof.
Fuc-Nagaev's inequality. Let X, XI ..... X n be independent identically distributed random variables and let EX=O,
EIX]q
q>_2.
Then, for every positive t,
<_ClE[X] q/tqn q 1+ 2exp(-c2nt2/EX2),
5. Appendix Lemma 1. Let both f and tr(" ; f ) be square inte-
grable. Then
where c 1 and c 2 are positive constants independent of both n and t, see Fuc and Nagaev (1971).
a 2 < [Itr( • ;f)ll2(N+ 1) rd, ikl >~"
References
where I1" II is the L2 norm. We omit the proof, since the lemma is a simple multidimensional generalization of Waiter's (1977) result. Lemma 2. Let Ps/3< oo, s>O. Then cl
E]hk(X)] S
where c is independent o f both d and k. Proof. Let a be a positive number. Clearly
E lhk(X)J s=E{phk(x) lsI~lxl ~_a~} +E
hk(X) ~ (X(J)) -~/3 j=l X (j=~Ii I x ( J ) i s / 3 ) I { , x , > a } l
-< max Ihk (X)] s Ixl~a d
Fuc, D.H. and S.V. Nagaev (1971). Probability inequalities for sums of independent random variables. Teor. Verojamost. i Primenen 16, 660-675. Greblicki, W. (1981). Asymptotic efficiency of classifying procedures using the Hermite series estimate of multivariate probability densities. IEEE Trans. Inform. Theory 27,364-366. Greblicki, W. and M. Pawlak (1981). Classification using the Fourier series estimate of multivariate density functions. IEEE Trans. Systems Man Cybernet. 11, 726-730. Greblicki, W. and M. Pawlak (1982). A classification procedure using the multiple Fourier series. Inform. Sci. 26, 115-126. Hall, P. (1980). Estimating a density on the positive half line by the method of orthogonal series. Ann. Inst. Statist. Math. 32, 351-362. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13-30. Schwartz, S.C. (1967). Estimation of probability density by an orthogonal series. Ann. Math. Statist. 38, 1261-1265. Szeg6, G. (1959). Orthogonal Polynomials. Amer. Math. Soc. Coll. Publ. Walter, G.G. (1977). Properties of Hermite series estimation of probability density. Ann. Statist. 5, 1258-1264. Wolverton, C.T. and T.J. Wagner (1969). Asymptotically optimal discriminant functions for pattern recognition. 1EEE Trans. Inform. Theory 15, 258-265.
s
+ p ~ / ~ m a x h k (X)j~I (X (j))- 1/3 • Ixr>a
17