Using a stopping rule to determine the size of the training sample in a classification problem

Using a stopping rule to determine the size of the training sample in a classification problem

S T A ~ & PR~IJI~ I.E'r1ERS ELSEVIER Statistics & ProbabilityLetters 37 (1998) 19-27 Using a stopping rule to determine the size of the training sam...

326KB Sizes 1 Downloads 64 Views

S T A ~ & PR~IJI~ I.E'r1ERS ELSEVIER

Statistics & ProbabilityLetters 37 (1998) 19-27

Using a stopping rule to determine the size of the training sample in a classification problem Subrata Kundu a, Adam T. Martinsek b'* a Department of Statistics and Applied Probability, University of California at Santa Barbara, CA 93106-3110, USA b Department of Statistics, University of Illinois, 101 Illini Hall, 725 S. Wrioht Street, Champaion IL 61820, USA

Received November 1996; received in revised form May 1997

Abstract

The problem of determining the size of the training sample needed to achieve sufficiently small misclassification probability is considered. The appropriate sample size is approximated using a stopping rule. The proposed procedure is asymptotically optimal. (~) 1998 Elsevier Science B.V. A M S classification: 62L10; 62L12 Keywords: Classification; Discrimination; Pattern recognition; Density estimate; Stopping rule

I. Introduction

Pattern recognition based on nonparametric density estimates has been investigated by a number of authors, including Van Ryzin (1966), Wolverton and Wagner (1969), Gyrrfi (1974), Devroye and Wagner (1976), Greblicki (1978a, b), Krzyzak (1983), and Greblicki and Pawlak (1983). The basic problem is as follows. One observes a "training sample" (X1, 01),(X2,02) . . . . . (Xn, O,) of independent, identically distributed (i.i.d.) random vectors, where the Oi take values 1, 2,... ,M and Xg comes from the jth population if Oi =-j. The unknown probability density function for the jth population is denoted by j~. Based on the training sample, for which one can observe the population from which each X/comes, one constructs density estimates f~, ..... f~t, and estimates ~ , for the unknown population probabilities pj (i.e., P(Og = j ) = pj). These estimates are used to construct a classification rule for any subsequent observation X whose 0 is unknown. The rule classifies X as coming from the kth population if Pkn ~ n ( X ) = max/3j, )~n(X). J

* Corresponding author. 0167-7152/98/$19.00 (~) 1998 ElsevierScience B.V. All rights reserved PH S0167-7152( 98 )00094- !

(1)

20

S. Kundu, A. T Martinsekl Statistics & Probability Letters 37 (1998) 1 9 - 2 7

This rule is designed to approximate the Bayes rule that assigns X to the kth population if pkJ~(X) = max pHy(X),

(2)

which is unavailable in practice since the pj and J) are unknown. Van Ryzin (1966) has shown that if L* denotes the probability of error using the Bayes rule (2) and if Ln denotes the probability of error using the approximation (1), then M

Many authors, including those above, have used (3) to show that Ln - L* ~ 0 a.s.

(4)

as n ~ cx). That is, as the size of the training sample becomes infinite, the error probability of the rule (1) approaches that of the optimal, Bayes rule. It is natural to ask how large the training sample should be to bring the error probability of the rule (1) to within a prespecified amount, say e, of the optimal error probability. In Section 2 we propose a stopping rule, based on kernel density estimates, that determines the size of the training sample for this purpose. We show that as the size of the bound e decreases to zero, the error probability is indeed smaller than e asymptotically. Moreover, the size of the training sample produced by the stopping rule is asymptotically equivalent to the size that would result if the relevant parameters were known ("sample-size efficiency").

2. Determining the size of the training sample with a stopping rule We will consider estimation of the densities J) by kernel estimates of the form n

j~n(X) = (NjnhNj. )-I ~ K ( (x - Xi )/hN:~)l{ o,=j}, I

(5)

where Njn is the (random) number of observations among the first n in the training sample that come from the jth population, K is a symmetric, bounded probability density function with compact support, and h~, > 0 is the bandwidth. Such estimates were first proposed by Rosenblatt (1956) and Parzen (1962). We will assume that the densities J) are twice differentiable with second derivatives satisfying sup If/'(x)l x

+ sup Iff'(x) x~y

f/'(y)l/lx

- y[ < oo,

(6)

f lff'l
(7)

V/~j < o~.

(8)

and

f

Then by results of Devroye and Gyrrfi (1985), if h ~ - - + 0 and NjnhN:o--~oo as n-+ oc,

E /]f~.n(X) -- J)(x)l dx ~ ~(NjnhN:.) -.12 / ~

Jl-(fl/2)h2jn /]6'11 -J-o(h2jn-}-(NjnhNjn)-ll2),

(9)

S. Kundu,A.T. MartinseklStatistics& ProbabilityLetters37 (1998) 19-27

21

where

~=

( /K72

,

[~=

/ x2K(x)dx.

(10)

It is well known that from various points of view the optimal rate of decay for hN~, is Nj~ 1/5 (note, for example, that this rate optimizes the rate of the upper bound in (9)), so we will set hNj, = cNjnus for a positive constant c. Then (9) becomes

It is shown by Kundu and Martinsek (1994) that (11) holds almost surely as well, that is, lim sup,~o~ ( f

[ f n ( x ) - ~ ( x ) [ dx)Nfn/s

<<,~c-l/2f Xffjj+ (~6c2/2) / I f j " '

a.s.

(12,

This inequality can be used to suggest an appropriate size of the training sample, as follows. From (3), M

tn - t * <~Z [Pj / lfn(X'- fj(x'ldx+ lPjn- Pjl f ~n(x)dx 1 j=l M

=Z

[pjfl~n(X)-fj(x)ldxq-Ifijn-pjl]

a.s.,

(13)

j=l

because f . is a density, lit is natural to put ~ . --: Njn/n, so that the almost sure convergence rate o f [ k ~ - pj]is v/log(log(n))/n, which is faster than Njn 2/5. Therefore, in order to bound L~ - L * above by e, asymptotically it would be enough to bound M

~-~pj f If,(x,-£(x)ldx j=l

above by e, so an appropriate size for the training sample would be the smallest positive integer n* for which M

Z P j ( °~e-1/2 v/fjj~(fle2/2) f ) NI£"1 j,:~ e . f -2/5<

(14)

j=l :¢

Of course the (random) sample size ne cannot be used in practice because it depends on the unknown pj and j). However, (14) suggests the stopping rule T~=first n such that

Z(Njn/n) ~e -1/2 j=,

If'l+Njn¢ ~z/s~
fn+(3e2/2) L

(15)



where ~n and ~ , are estimates of J) with possibly different bandwidths from those in f and dn and d, are sequences of real numbers going to infinity as n ~ c~z. The term Nj]-¢ for ~ > 0 is added to ensure that each T~---, cc a.s. as e ~ 0 and for technical reasons in the proof of (17) below. The following theorem summarizes the performance of the stopping rule T~.

Assume (6)-(8) hold. Assume also that both K and K" have finite total variation. Let )~n be the kernel estimate with kernel K and bandwidth h, = n- l/4(log log n) 1/4, and let fn be the kernel estimate with Theorem.

S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27

22

kernel K and bandwidth hn =n-1/S(loglogn) 1/8. Take dn = d , = n 1/8-6 with 0 < 6 < ½ and assume 0 < 4 < 1-!6. Then as ~ ~ O, limsup(Lr, - L * ) / ~ <<.1 a.s.,

(16)

lim sup E(Lr, - L*)/e ~< 1,

(17)

rdn* -~ 1 a.s.

(18)

E(T~/n* ) --~ 1.

(19)

and

Parts (16) and (17) of the Theorem assert that the stopping rule achieves the stated goal of bounding the difference between the error probability of the approximate procedure and the optimal (Bayes) error probability by ~, almost surely and in mean, for small values of e. (18) and (19) state that the stopping rule accomplishes this with a sample size that is, for small values of e, equivalent to the sample size n* that one would use if all the relevant parameters defining it were known. That is, the stopping rule not only achieves the desired goal, it does this using the "right" number of observations. Note that small values of e correspond to close approximation of the Bayes error probability. Proof. First we will prove (16). From (15) we know that M

~#/~+~/> ~ (Nj~./To)~/~-~ 1

3/5-~ : qj >>.0, Z qj

>~ inf

qJ = 1 =1, 1

SO

T¢ --~ oo a.s.

(20)

as e ~ 0. It follows from the Strong Law of Large Numbers and Lemmas 2.2 and 2.4 of Kundu and Martinsek (1994) that as n---* oc, Nj./n---~pj a.s.,

(21)

(22) and

dn d,

--'

Iff'l

a.s.

(23)

Combining (20) with (21)-(23) yields NjT,/T~ --* pj a.s.,

f[-T,

(24)

a.s.,

(25)

S. Kundu, A.T. Martinsek/ Statistics & Probability Letters 37 (1998) 19-27

23

and

f&~[gr,] ~ f [f/'[ a.s.

(26)

-dr,

From (3), (12), (24)-(26) and the Law of the Iterated Logarithm, M 1

.

,,.] / .

1

<~

PJ °tc-'/2

V~JJ+ ~c'/2

Ifj"l

jr.

+ o~~

.r.

j + °(T~-z/5)

e a.s.

=I~pj(,c-'/'f V/~j+,c'/2/[f/'l) Njr'/5+o(T~-'/5)]/~a.s. J

=(l+o(1))[~(NjrJT~)(,c-'/'f;;:,~jr+[Jc'/2 J ,,'-dr.

a.s.

(27)

jT,

(16) now follows from (27) and the definition of T~. To prove (18), note that M

etne )

>t Z

PJ

°~c-1/2

(N,z/n~* 1- 2/5

1

(28) I

It is immediate from (28) that n~ ---oo a.s. as e---*0. It follows from (21) that

NjnJn* ~ pj a.s.

(29)

Nj,,r_l/(n* -- 1) ~ pj a.s.

(30)

and

S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27

24

By the definition of n*,

/;(n~*)'/5.

~pj(.c-I/2/V~jq-flc2/2/lf/'l)(Njn,/n*~) -'/5

(31)

1

and

/;(n*-1)2/5<~

Pl ('c-I/2/V~I q-flc2/2f

lfj"l)(Nj, n,-i/(n~- l))-'/5.

(32)

I

Combining (29)-(32) shows

a(n*~)'/5:(l+o(1))~p3/5(.c-I/2f V~j+,c2/2/Ifj"l)

(33)

a.s.

I

A similar argument using the definition of T~ and (24)-(26) yields M

aT~/5=(l+o(1))~p3/5(.c-'/'f~j+,c'/2f[ff[)

(34)

a.s.

(18) now follows from (33) and (34). In view of (18) and (28), to establish (19) it suffices to show that This follows immediately from M /' cdT,-, e(T: - 1) 2/5 ~< 2..+ ~-~ Pj, ^3/5 r~-i i=c-,/2/_. 1

\

{~5/2T~: ~< 1} is uniformly integrable.

fdT,_,

~j,~,_,+~c2/2/_ d

-dr~-~

-

r,:-~

"~ [~,T_, i + Nj,~/_1

/

(35)

and Lemmas 2.5 and 2.6 of Kundu and Martinsek (1994). Finally, we turn to the proof of (17). To simplify notation, let

and

In view of (16) and Fatou's Lemma, it suffices to show that (Lr, - L*)/e is uniformly integrable as e-+ 0. As in (27) we have (36) By Lemma 9.1.1 of Chernoff (1972), for some ~c > 0 and p > 0,

e(Njn/n < pj/2 ) <.xe -;n.

(37)

S. Kundu,A.T. Martinsek/Statistics&ProbabilityLetters37 (1998)19-27

25

Setting rn(8)=e -1/(2/5+0, from (37),

P(Njr~ < pjm(8)/2) <~e ( inf Nj,/n < pj/2~ \n>~m(e) / CX3

<. ~ P(Njn/n < pj/2)

m(e) 0(3

< g Z e -on m(E) (38)

~< K1 e-prn(0,

where rq > 0. It follows from (38) above and (2.66) of Kundu and Martinsek (1994) that for every r > 1,

[ fN'jT~: 2/5 / 'fjT,: - fjl)'Z{N... < Plm(e)/2}] n=l EIk ~ ~E

[(Nj:/5f

m(e)[(/ ~f~ E'/.

'fjn-J~l)r/{N.,r,;


N:Z./s I~o -

)~1

pI/2[N:L < pjm(8)/2]

n=l

~< O(1)m(e)e - °m(O/2 ~ 0

(39)

as 8---~0. Combining (39) with an argument similar to (2.68) in Kundu and Martinsek (1994) shows

Elk r:...:.

/

)r]

I~r~ - J~l

=O(1)

(40)

for all r > 1. Using (38) above, an argument similar to (2.72)-(2.75) of Kundu and Martinsek (1994) and a result of Siegmund (1969) shows that for j = 1..... M and r > 1,

~_2/,_~)]r j =O(1)

E j l . : ~ , - f j l / ( f l m + iT,,

(41)

and E[Ipj

- f% I / ( ~ + Nj~2/5-e)lr = O(1 ).

(42)

Because

+ Lp: - ~:~,,I) / ~ 1

^ p:(~,,

-[-N-2/5-~ jr, )

-

]

I

in view of (36), (41) and (42) it is enough to show uniform integrability of M

M

^

..22/5_¢)]r ,

26

S. Kundu, A. T. Martinsek / Statistics & Probability Letters 37 (1998) 19-27

for all r > 1. To show this, write

M

/

1

m

. ,T-2/5-¢)

1

-2/5-~ 1 + ~ (pj - ~jr:)(l?l:.T: + NjT ` ) ~ PjT~(IIjT,+ Nj~2/'-~:).

=

1

(43)

I

We have M

(Pj - Pjr,)(ffljr~+ N'jr,

/

/ M :T (gro + N-2/'-~) :. ,

M

= T2/s+~ ~(p: -/~jT~)(/qjr.

+ Nj; '/5-~) .

(44)

Because all powers of

v/r~log log r~Ip:

-

b:r~l

are uniformly integrable (see Siegmund, 1969) and all powers of/qjr, are uniformly integrable (as in (2.72)(2.75) of Kundu and Martinsek, 1994), it follows from (43) and (44) that all positive powers of

m

/~

...-2/,-~)

I

are uniformly integrable and this completes the proof of (17).

[]

References Chemoff, H., 1972. Sequential Analysis and Optimal Design. Society for Industrial and Applied Mathematics, Philadelphia. Devroye, L., Wagner, T.J., 1976. Nonparametric discrimination and density estimation. Tech. Report, Information Systems Research Laboratory, University of Texas at Austin. Devroye, L., Gy6rfi, L., 1985. Nonparametric Density Estimation: the L1 View. Wiley, New York. Greblicki, W., 1978a. Asymptotically optimal pattern recognition procedures with density estimates. IEEE Trans. Inform. Theory IT-24, 250-251. Greblicki, W., 1978b. Pattern recognition procedures with nonparametric density estimates. IEEE Trans. Systems Man. Cybernet. 8, 809-812. Greblicki, W., Pawlak, M., 1983. Almost sure convergence of classification procedures using Hermite series density estimates. Pattern Recogn. Lett. 2, 13-17. Gyrrfi, L., 1974. Estimation of probability density and optimal decision function in RKHS. in: Gani, J., Srakadi, K., Vincze, T. (Eds.), Progress in Statistics. North-Holland, Amsterdam 281-301. Kundu, S., Martinsek, A.T., 1994. Bounding the L1 distance in nonparametric density estimation. Ann. Inst. Statist. Math. submitted. Krzyzak, A., 1983. Classification procedures using multivariate variable kernel density estimate. Pattern Recogn. Lett. 1,293-29. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065-1076. Rosenblatt, M., 1956. Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832-837.

S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27

27

Siegmund, D., 1969. On moments of the maximum of normed sums. Ann. Math. Statist. 40, 527-531. Van Ryzin, J., 1966. Bayes risk consistency of classification procedures using density estimation. Sankhy~, Set. A 28, 261-270. Wolverton, C.T., Wagner, T.J., 1969. Asymptotically optimal discriminant functions for pattern recognition. IEEE Trans. Inform. Theory 15, 258-265.