Some aspects of error bounds in feature selection

Pattern Re(ognitionVol. 11, pp. 353 360, 0031-3203 79'1201-0353 S0200'0 Pergamon Press Ltd, 1979 Printed in Great Britain. i Pattern RecognitionSoci...

Download PDF

525KB Sizes 27 Downloads 52 Views

Report

PDF Reader
Full Text

Pattern Re(ognitionVol. 11, pp. 353 360,

0031-3203 79'1201-0353 S0200'0

Pergamon Press Ltd, 1979 Printed in Great Britain. i Pattern RecognitionSociety

SOME ASPECTS OF ERROR B O U N D S IN FEATURE SELECTION D. E. BOEKEE and J. C. A. VAN DER LUBBE Information Theory Group, Department of Electrical Engineering, Delft University of Technology, 4 Mekelweg, Delft, The Netherlands

(Received 3 May 1978; in revised form 6 February t979) Abstract In this paper we discuss various bounds on the Bayesian probability of error, which are used for feature selection, and are based on distance measures and information measures. We show that they are basically of two types. One type can be related to the f-divergence, the other can be related to information measures. This also clarifies some properties of these measures for the two-class problem and for the multiclass problem. We give some general bounds on the Bayesian probability of error and discuss various aspects of the different approaches. Information measures Error bounds

Distance measures

I. INTRODUCTION In recent years many authors 0 10~have paid attention to the problem of bounding the Bayesian probability of error. This has been of particular interest in the field of pattern recognition related to feature selection. The problem can be stated as follows for a multi-class pattern recognition problem. Suppose we have m pattern classes C 1, C 2 , . . . , C , , with a priori probabilities Pi = P(Ci). Let the feature x have a class conditional density function p(x/C~). Then the a posteriori probability for class C~ given the feature value x can be given by P(C~/x). If the decision in the classifier is made according to the Bayes method, we select the class for which P~. p(x/Cg), or equivalently P(Ci/x), is maximal. This decision rule leads to a probability of error which is given by fy P~ = 1 max [Pi. p(x/C~)]dx ~ = 1 - Ex[max P(C,/x)],

Feature evaluation

l:divergence

the probability of error has also received much attention. In this paper we discuss some aspects of these distance measures, both for the two-class problem and for the multi-class problem. We show that many of the well-known distance measures belong to two general classes of distance measures. Firstly we shall study a g r o u p o f d i s t a n c e measures which can be related to the f-divergence. Secondly we shall study a generalclass of distance measures, called the general mean distance and relate it to a class of information measures. 2. THE F-DIVERGENCE In this section we shall study the behaviour of distance measures as measures of divergence between two hypotheses. This is the simple case of two pattern classes. A very general measure of divergence given by Csiszar <1~), is defined as follows. De[inition 1. Let f(u) be a convex function which satisfies f(0) = lim f(u) (la)

i

u~O

where Ex is tile expectation with respect to x. Several probabilistic distance measures and information measures have been used to obtain bounds on the B a y e s i a n p r o b a b i l i t y o f e r r o r , l i k e K o l m o g o r o v ' s variational distance, the Bhattacharyya distance, the J-divergence, the (generalized) Bayesian distance, as well as many others. More details about these measures can be found in (1)-(10), where upper and lower bounds to the probability of error are also given, Much effort has been given to generalize these measures from two pattern classes with equal a priori probabilities to two pattern classes with non-equal a

(~) 0.f

= 0

(lb)

(~) O.f

(a)=a

lim

.(O
z).

f(u)

= lim ~:.f

(lc) Then the f-divergence between two density functions p(x) and q(x) is defined by

~ Di(p,q) =

priori probabilities and from there to m pattern classes, with m ~> 2. The comparison of the various bounds on 353

f

tp(Xq~x))) q(x)dx.

(2)

It is easy to show, ~1~ that Di(p,q) ~- f ( l ) and that Di(p, q) = f ( 1 ) if and only if p(x) = q(x) for almost all

354

D.E. BOEKEEand J. C. A. VAN DER LUBBE

X. Furthermore, it follows from equation (2) that in general the f-divergence is not symmetric. It is possible to obtain a symmetric measure by introducing the measure

J:(p;q) = D:(p,q) + D:(q, p)

fx f\q(x)/q(x)dx+ (ptx) I fx f\~(~/p(x)dx, ( q(x) I

(3)

which by its very definition is a symmetric measure, We shall call it the symmetric f-divergence. For this measure we have J:(p;q) >1 2f(1), with equality if and only if p(x) and q(x) coincide. An important property of the f-divergence is that a one-to-one transformation of x can never decrease the f-divergence between p(x) and q(x). 2.1 Two pattern classes If we want to apply the notion off-divergence to the feature selection problem in pattern recognition we are, for two pattern classes, concerned with the class conditional density functions p(x/Cl) and p(x/C2) of the feature x. Then the f-divergence can be modified as follows, Definition 2. The f-divergence between the two-class conditional density functions is given by

Dy(Cx,C2) =

fx (p(x/C~)) f ~ p(x/C2)dx.

(4)

In a way similar to the one used in equation (3) we can define a symmetric measure of divergence by

Jr(C,

;C2) =

D/(C,,C2) + Dy(C2, Cx).

(5)

In definition 2 we have only considered the class conditional density functions for the two classes. If we also know the a priori probabilities we can generalize

equations(4)and(5),andgivedefinitionsfor D:andJ: in terms of the a posteriori probabilities of the two classes.

Definition

3. The average f-divergence between the

two classes Ca and C2 is defined by

O:(C~,C2)= f xf(~)p(x/C2)P(C2)dx

f,

/~ = ] If we set f(u) we find that

and

,/I'(CIme(c

2x = f,(u) = - u 1:2 in

(9) equation (4) and (6),

p = - D (CI,C2) P

(10)

p = -/)p(C1,C2).

(11)

2. Kolmogorov's variational distance. This distance measure can be defined as V = ["

P(x/C2)ldx

3x

(12)

and for the case of posterior probabilities, as ;,

~" = Jx [P(Cl/x) - P(C2/x)lP(x)dx. If we setf(u) we find

=fv(u) ----

(13)

= 11 - u I in equations (4) and (6)

Dv(C1,C2) and

(14) (15)

3. The generalized Matusita distance, or M,distance. This distance measure can be defined l o r r y > 1 as

Similarly we can obtain a symmetric version which follows from equations (5) and (6)

j:(Cl;C2,=Ex{f(~)P(C2/x) ( ~ )

is the Bhattacharyya coefficient. If we take the a priori probabilities into account we obtain the definition

V = Dv(C1, C2).

p(C1/x ) f \ ~x/ Ip.(, C( Cz /2x/ )xpl ( x ) d

+f

only some of the distance measures as an illustration. Some examples of distance measures for the fdivergence of definition 2 can also be found in Chert (1' and Kobayashi et al. "2~ 1. The Bhattacharyya distance. This distance measure is defined as B = - log p, where (. P = tY x/P(x/COp(x/C2)dx (8)

V

p(x/C1). P(CO

=

the a posteriori probabilities of the 2 classes, provided we know the a priori probabilities. Many distance measures can be considered as special cases of the f-divergence or of a monotone function of the f-divergence. It is not the purpose of this paper to give an exhaustive list, therefore we mention

and, in its averaged form,

lff,=[fx[P(Cl/x)l/'-P(CE/X)l/'l'p(x)dx] }

P(C1/x) .

(7)

Thus, it turns out that the f-divergence as given in equations (4) and (5) can be reformulated in terms of

1:'.

(17)

Note that for r = 1 we obtain the K o l m o g o r o v distance and for r = 2 we have the usual Matusita distance. If we setf(u) = fu(u) = I 1 - u ~/'['in e~tuations (4)and (6) we obtain

Error bounds in feature selection

M , = [DM(C 1, C2)] l'r

(18)

M, = [/)M(CI, C2)] 1/'.

(19)

and As an illustration of the symmetric f-divergence we consider Kullback's J-divergence which is given by

J --- ~ [p(x/C~) - p(X/CE)]logP-! x/C1) dx, (20) ptx/C2) jx

The proof is given in Appendix A. In this way, we have obtained an upper bound on the probability of error which is given by the average fdivergence Df(C 1, C2) as well as some values of the convex function f(u). To demonstrate that equation (24) includes some well-known bounds, based on various distance measures, we give three examples. 1. For the Bhattacharyya coefficient fi we have

f(u) = - u ~'2

or, in terms of a posteriori probabilities, by

J=

355

fl = - 1

i [P(Cx/x) - P ( C 2 / x ) ] l o gP(C,/x) ~p(x)dx.

f o = f ~ = f 2 = 0. (21)

We find by substitution off(u) = fs(u) = (u - 1) log u in equations (5) and (7) that J = Js(C1 ; C2) (22)

F r o m this it follows that

Pe <~ - bp(C1, C2) = tS which is a well-known result. 2. For the K o l m o g o r o v distance ~'we find

(27)

f(u) = 11 - u I

and J = Jj(C1 ; C2).

(23)

Finally we note that the probability of error Pe itself can also be written as a special case of the f-divergence if we setf(u) = min (u, 1 - u). Thus we can conclude that for two pattern classes these distance measures, as well as Pe itself, can be related to the f-divergence. Perhaps of more importance is the conclusion that their properties can also be obtained from similar properties of the f-divergence. Therefore, it seems necessary to study only thef-divergence.Acomparison of distance measures to get more insight into their nature, as some authors have stated, results, therefore, only in a comparison of certain convex functionsf(u), as can be seen from the previous examples. We may also note that the question of upper and lower bounds on P~ leads to a comparison of the functionf(u) = min (u, 1 - u) with other convex functions,

f, = 0 fo = f~ = 1 f2 = 2 and

Pe <~ ½ - ½Dv(C1,C2)= ~ - ½~'.

(28)

It is well-known that equation (28) holds with equality. This is due to the fact thatf(u) = [ 1 - u I is not strictly convex. 3. For the Matusita distance M, we obtain

f(u) = [ 1 - ul/r] r f0 =f~ f~ = 0

= 1

f2 = 2 and

2.2 Upper bounds on Pe In this section we shall consider upper bounds on the probability of error, using the f-divergence. For the special case P(C1) = P(C2) = ½, Vajda ~13) has obtained an upper bound in terms of the f-divergence, In the next theorem we give a general bound on P~ in terms of the average f-divergence by(C1, C2) of definition 3 for not necessarily equal a priori probabilities, Theorem 1. For two pattern classes with probabilities P(C1) and P(C 2) the probability of error is upper bounded by

p , < f o P ( C 2 ) + f ~ P(C1) - / ) j - ( C I , C2) f2

(24)

- fJ.

Pe <~ ½ -- ½DM(C,, C2) = ½ - - :iM -, . ,

(29)

This bound includes the K o l m o g o r o v and the Matusita bounds for r = 1 and r = 2. For general r ~> 1 the bound has not yet been given in the literature. 2.3 m Pattern classes In many applications there are m >/ 2 pattern classes. In those cases we want to obtain multi-class bounds on the probability of error, and the distance measures have to be modified. It is well-known [Chu et al. "4~] that an approximation of the probability of error P~ in terms of the pairwise error probabilities P,(Ci, Cj) satisfies the inequality m--1

Pe <~ i=1 ~., j =~i + l P~(Ci, Cj).

provided f 2 is finite, where fo = limf(u) ,t0

fl =f(1)

(25)

f2 = f o + f ~

f~ = lim'-_~ u

f(u)

(26)

(30)

It is straightforward to define a multi-class fdivergence in a similar manner. Definition 4. The multi-class average f-divergence is defined by

356

D.E. BOEKEEand J. C. A. VAN DER LUBBE m-1

~

Df(m) = ~

Dy(Ci, Cj).

i=1 j=i+l

(31)

2. If we substitute R = 3, S = 1 into equation (35), we find

GI.a(C/X ) =

Using equations (30), (31) and (26) for the pairwise error probabilities P.(Ci, C~), we can express Pe as a function of Dy(m)

L

p(x)

]

P(Ci/x) a dx. i

(38)

It can be shown that Hs(C/X) = 1 - G 1.a(C/X) is the Pe ~< g[/)I(m)] •

(32)

average conditional cubic entropy, which was pro-

In the light of previous examples it will be clear that a variety ofmulti-class distance measures are included in these results. As an example, we mention the multi-

posed in Chen. ~z) 3. If we set R > 1 and S = 1 in equation (35), we obtain

Ga.R(C/X) =

class Bhattacharyya coefficient m--1

which can be seen as a generalization of the Bayesian

~

fi(Ci, C~).

(34)

~=a j=~+x This b ° u n d was given in Laini°tis"' 5~H ° w e v e r ' due t ° the inequality of equation (30), it can be expected that the multi-class bounds are not very useful. This has been remarked by several authors. In Section 3 we shall consider distance measures which are directly defined for m ~> 2 classes.

3. THE GENERAL MEAN DISTANCE MEASURE In Section 2 we introduced distance measures which are basically suitable for a two-class problem. By adding these pairwise measures in an appropriate way over all classes, it was possible to apply them to a multi-class recognition problem. N o w we will discuss a general class ofdistance measures which are inherently based on m ~> 2 pattern classes, As opposed to being related to the f-divergence these measures show a close relation to information measures like the equivocation and the quadratic information, D¢linition 5. The new measure, called the general mean distance, is given by Van der Lubbe t16)

Gs.R(C/X ) =

L

p(x)

Y

P(Ci/x) R i

dx

(35)

where P(Ci/x) is the a posteriori probability of class C i given feature x and where p(x) is the unconditional probability density function of the feature x. This general mean distance includes a second class of well-known distance measures. The following is a list of these measures. 1. If we set R = 2 and S = 1 in equation (35), we find the Bayesian distance

Gt.2(C/X) =

L

p(x)

(39)

(33)

which, using equation (30) and (27), leads to the bound

Pe <~ ~,

P(Ci/x) g dx, .=

~(m) = ~ p(C~,C~), ~= 1 j= ~+1

m--1

p(x)

~

]

P(Ci/x) 2 dx

= B(C/X),

(38).

4. Substitution of S = 1/R into equation (35) and choosing R > 1, leads to. the distance measure

fx G1/R. R(C/X) =

IZ=I

1 l'g P(CJx) R dx, R > 1

p(x)

(40) which is the distance measure B'R(C/X) proposed in Trouborst et al. ~1°~ The general mean distance is by its very definition a symmetric measure. This measure is also positive and minimal if and only if for all i = 1,..., m the a posteriori probability distributions P ( C J x ) a r e equal. 3.1 Bounds on Pe It is possible to obtain several bounds on Pc, associated with the Bayesian decision rule, in terms of the general mean distance. Although we can obtain these bounds by the general mean distance for all possible values of R and S, we shall investigate in this paper only bounds for Pe for the case R > 1 and S > 0. The relationship between the Bayesian probability of error and the general mean distance GS.R(C/X ) is analyzed in the following theorems. Theorem 2. For S > 0, R > 1, 1 ~< RS < 1 + S it holds that 1 - Gs.n(C/X) 1/"s <~ Pe <~ 1 - G s , . ( C / X ) l's(R- 1~

<~ 1 - m-SGs.R(C/X).

(41)

Theorem 3. For S > 0, R > 1, RS ~< 1 it holds that 1 - Gs R(C/X) <. P~ <~ 1 GS.R(C/X) l/s(R- 1) -

<<.1 - m-I'RGs.R(C/X)~'RS.

(42)

Theorem 4. For S > 0, R > 1, RS >~ 1 + S we have 1 - Gs, R(C/X) 1ms <<-P~ <~ 1 -- Gs.R(C/X )

<~ 1 - m-SGs.R(C/X). Theorem 5. For RS = 1 we have

(43)

(36)

which was given in Devijver (3), and is closely related to the average conditional quadratic entropy

h(C/X) = 1 - B(C/X).

distance, and the distance measure given in equation

lim [1 - GI:R.R(C/X)] = P~. R~

(44)

The proofs of the Theorems 2 - 4 are given in Appendix (37)

B. The proof of Theorem 5 can be found in Trouborst

Error bounds in feature selection et al., (1°~ and will not be repeated here. Note that for S = 1/R, the signs of equality in the upper and lower bounds of Theorems 2 - 4 hold if R-~-/.. It can also be shown that many well-known error bounds can be derived from the three theorems by suitable substitutions of R and S. We give two examples, 1. The Bayesian distance and the distance measure of Lissack and Fu. For this measure we have R = 2 and S = 1. Theorem 4 can be used and thus we find 1-~fB(C/X)<~Pe<~I-B(C/X)<~I-m-IB(C/X).

(45) The lower bound and the tightest upper bound can be found in Devijver/3~ For the two-class classification problem it can be proved that B ( C / X ) = ½ + } L 2,

(46)

where L 2 is the distance measure of Lissack and Fu. (8~ Equation (44) can be rewritten as 1

x//~ + ½L2 ~< Pe ~< } - {L2 4 1

11 2(2

½L2)' (47) The tightest upper bound has been proved by Lissack and Fu/8~ 2. The distance measure B'R(C/X). For this measure we have S = l/R, R > 1. Theorems 2 and 3 hold and we find R 1 - B'R(C/X) <~ Pe <~ 1 -- B ' R ( C / X ) ~ ~< 1 - m - 1 / R B ' R ( C / X ) . (48) --

--

+

The lower bound has been proved by Trouborst et al., (~°~ and the first upper bound by Gy6rfi et al. (4~

357

converges to the probability of error. Thus the error bounds can be made arbitrarily tight. The f-divergence is in general not closely related to information measures. For the special case of Kullback's directed divergence a relation with Shannon's mutual information can be given. However, the general mean distance is directly related to information measures, as has been indicated in Section 3. Since the bounds on the Bayesian probability of error can be obtained both by measures like the J: divergence and by measures like the general mean distance it seems sufficient to stress the relation between distance, or divergence, and probability of error.

SUMMARY In the context of the feature selection problem, much attention has been devoted to upper and lower bounds on the Bayesian probability of error Pc. These bounds are based on distance measures or information measures. In general a pattern recognition problem is a muiticlass problem. It turns out that the information and distance measures are based on two different approaches, One approach is based on distance measures which are basically suitable for a two-class problem, such as the K o l m o g o r o v distance and the Bhattacharyya distance. To apply them to a multi-class recognition problem one has to average these measures in an appropriate way over all classes. We shall give a comparison of these measures by relating them to a measure which is known as the fdivergence. The J: divergence for a two-class problem is defined as

Jx

4. CONCLUSION As has been shown, it is possible to distinguish two approaches to obtain distance measures, which in this paper are characterized by the (average)f-divergence and the general mean distance, Thef-divergenceis basically a distance measure for two pattern classes. It includes Kolmogorov's variational distance which is directly related to the probability of error. As we have shown the (average)fdivergence can be extended to multi-class pattern recognition problems. With the f-divergence as well as the average f-divergence we can obtain upper bounds on the probability of error which include several wellknown bounds for other distance measures. For m pattern classes, with m /> 2, the bound becomes less tight for increasing m. The second class of distance measures, like the general mean distance is basically developed for m >/2 classes and is therefore easily formulated for the multiclass problem. For a suitable choice of the parameters R and S, one minus the general mean distance

\ptx/C2)}

'

"

where f ( . ) is a convex function and p(x/C~) is the conditional probability density function of the feature x given pattern class Ci. A second approach is inherently based on m ~ 2 pattern classes. This approach includes bounds which are based e.g. on information measures like the equivocation and the quadratic information. We shall introduce a general class of distance measures, called the general mean distance, which includes these two measures as well as some others. This general mean distance is given by i~ [~1 ]~ GS.R(C/X) = p(x) P(Ci/x) R dx. where P(C~/x) is the a posteriori probability of class C~ given a feature x and where p(x) is the probability density function of the feature. We show that with this measure we obtain upper and lower bounds of the probability of error P¢ which include many previously obtained results and can be made arbitrarily tight to P,. In conclusion we give a discussion on the two

358

D . E . BOEKEE and J. C. A. VAN DER LUBBE

a p p r o a c h e s m e n t i o n e d a b o v e in r e l a t i o n to m u l t i - c l a s s pattern recognition,

Taking the expectation with respect to x on both sides of equation (A9) over the area X 1 yields

Acknowledgement - The authors wish to thank Prof. Y.

ix, f*(u)p(x)dx

Boxma for advice and encouragement and Dr. E. Backer for his helpful comments during the preparation of this paper.

APPENDIX

tt

~< 6 [(./1 - 2f~)u +fo~]p(x)dx 2x 1

A

=

P(C2/x )

p(x/C2)P(C2)dx

(A11)

In a similar way we obtain for X2 f

dx f*(u)p(x)dx ¢, ~< / [(fl - 2fo)(1 - u) +fo]p(x)dx dx 2 p = jxL (fl - fo)p(x)dx

for u • (0,1], and f*(0) = limf*(u). It can be shown, Vajda" 3), u~0 thatf*(u) is convex on [0, 1] and is strictly convex if and only iff(u) is strictly convex. Using (25) and (26) we then find for f*(u) t h a t f * ( 0 ) = f~o,f*(½) = ½fl, a n d f * ( 1 ) = f0. If we set

p(x/C2)P(C2)

2f~) f

- fo~[P(X~/C2) + P(Xz/Ct) ].

u2 - Uo

which holds for every Uo e (u 1, u2) c [0, 00] it follows that f2 > f l . Next we introduce a different formulation for /)y(C1, C2) by introducing the function

u = u(x) =

--

+ f~o[P(C1 ) - P(X2/CI) + P(Xt/C2)] = fooP(C1) + ft P(XI/C2)

f (Uo) - f (ul ) ~ < f ( u 2 ) - f ( U o ) Uo - ut

(ft

3x t

In this appendix we prove the b o u n d for the probability of error in terms of the average f-divergence, given in Theorem 1. First we note, using Theorem 111 in Hardy et al., (1?) that from the inequality

r' - | (fl - 2 fo}p(x/C2)P(C2)dx Jx2

= (f~ - fo)[P(C2) - P(X1/C2) + P(X2/C,)] +

(A2)

p(x)

- (ft - 2fo)[P(C2) - P(Xx/C2)] =foP(C2) +fsP(X2/Cl)

on X, where X is the set of values x that the feature can take on, then it follows by substitution of (A1) and (A2) that

- fo[P(X,/C2) + P(X2/C,) ] .['P(x/C,)P(C,) "I , De(C,, C2) = Jx J|-(x/C2)P(C2) Iptx/C2)P(C2)dx\v/

= fxf*(u)p(x)dx.

Summing equations (A 11) and (A 12) and using equation (A3) then gives (A3)

3Act, c2)= fxf,(u)p(x)d x <~foP(Cz) + f ~ P(C 1) + (fl - fo - f~. )

Next we note that the probability of error can be expressed as

Pe = f ~a(u)p(x)dx dx

x [P(Xt/C2) + P(X2/CI) ].

(A4)

(A13)

Using equation (A6) finally leads to

with u as given in equation (A2), if we set ~o(u) = min(u, 1 - u) on [0, 1].

(A12)

/)y(Ct,C2) <~foP(C2) + f ~ P ( C t ) + (fl - f 2 ) P e (A5)

(A14)

or

To simplify our notation we introduce the set XI =

{p(x/C2)P(C2) < p(x/Ct)P(C~)}eX andsimilarly X 2 = X -

Pe <~f°P(C2) + f~P(C~) - DI(Ct,Cz)

X 1 • X. Then the probability of error can be written as

P~ = P(X1/C2) + P(X2/C~)

(A6)

(A15)

f2 -./'1 which proves the theorem.

where

f. P(XI/C2)

APPENDIX

JXl p(x/C2)P(C2)dx

(A7)

and

P(X2/CI) = ~

Jx

p(x/Ct)P(C1)dx.

(A8)

In the sequel we assume that fo and f ~ are finite. By the convexity of f *(u) we can write on X 1 f*(u) = f * ( t # ) ~< (1 - 2~o)f*(0) + 2tpf*(½)

= (ft - 2f®)u + f . .

(A9)

Similarly we find on X 2 that

f*(u) = f * ( l - ~p) ~< (1 - 2tp)f*(1) + 2tOf*(½)

In this appendix we prove the Theorems 2 - 4 for different values of S and R. Proof of Theorem 2. For simplicity we split up equation (41) into three inequalities for S > 0, R > 1 and RS ~< 1 + S.

Pe >1 1 -- GS,R(C/X) URs P, ~< 1 -GS,R(C/X) t/ste-~), S # 1

(B1) (B2)

1 - Gs.e(C/X) ~/s(R-~) <~ 1 - m-SGs.R(C/X).

(B3)

(a). To prove equation (B1) we consider for R > I and S > 0 the inequality ~, [ ~ p(C.dx)g]s >~ [-max e(C~x)] es. i=1

= (fl - 2fo)( 1 -- u) +fo-

B

(A10)

(B4)

i

By taking expectations on each side of equation (B4) we find

Error bounds in feature selection

is easy to see that equation IB2) also holds under the conditions of Theorem 3. (C). For S > 0, R > 1 it can be shown "6~, that

m

Gs.R(C/X) = E,[ ~

P(Cjx)R] s

i=1

Ex[[max P(Ci/x)] Rs] i It is easy to see that the function [max P(Ci/x)] Rs

(B51 [~ P(CI/x)R]S~[~ m R]S. (BI71 ~=, i= J Taking expectations on both sides of equation IBI 7) gives

i is a convex function for RS .~ 1 or RS ~ O. Then, by applying Jensen's inequality, we find that [E:,[max P(Cv/x)]] "s.

<~ E . [ [ m a x P(C,/x)]"s].

i

(B6)

(I - P S s ~ < E , [ [ m a x P ( C j x ) ] "s] <~ Gs.RtC/X) i

(B7)

for R S ~ 1 or RS ~ O.

For S > 0 we obtain by taking expectations on both sides of equation (B8) Gs.R(C/X) <~ E , [ [ m a x P ( C j x ) ] S ( ~ - l ) ] .

(B91

i

11 > 0. so thai

Gs..((~/X)t Rs,,. " ~ m ' u

tBIg)

l)=

from which equation (BI5) easily follows. Proof of Theorem 4. In this case wc have to proxc the inequalities P~ ~ 1 - Gg.R(C'X I Rs (B201 P~ ~ 1 - Gs.R(C/'X)

(B2I)

[E,[max

tB221

(a). Since equation (B1) holds for S > 0, R > 1, RS ~ 1, it also holds f o r S > 0, R > 1. RS ) S + 1. Thus. equalion (B20) is implied by equation (BI). (b). For S > 0, R > 1 we have equation (B9). From R S ) S + 1, or S(R 1) ) 1. ,~.e then obtain, using equation (B91. that E~[[max P ( C / x ) ] s~g 11] ~< E,[max P(C,/x)] = I - P,,.

(B231

i

Furthermore it is easy to see that for S < R S < 1 + S, Jensen's inequality yields Pe) s(R

For S > 0, R > 1 we lind that RS(R equation (B18) can be modified to obtain

IBl8t

1 - Gs..I~(C/X <~ 1 - m SG~.~I('.X~.

From equation (B7) we easily obtain equation (B1). (b). To prove equation (B2) we note that for R > 1 and S > (1 ,,, [ ~ P(C,/x)R] s <~ [max P(C,/x)] s(R ~' (BS) ~: ~ i

--

GX.R(C/)( ~> m(l-m.~

i

Using the definition of the probability of error and equation (B5) yields

(1

359

P(CI/x}]]

Combiningequations(B23)and(B9)leadstoequationlB21t. (c). The proof of equation (B221 follows by noting that for S > 0 w e h a v e m - s < 1.

s ( R - 11

i

>~ Ex[[max P ( C j x ) ] s(R- 11].

(Bltl)

i

REFERENCES

From equation (B9)and (B10)we easily obtain equation (B2). (c). For the proof of equation (B3) it suffices to show that S2t R

1

Gs R(C/X >~m 1 s~R ~,. -

(Bll)

It has been shown ~16' that GsR(C/X

~ m s(1-m

with equality i f a n d only ifall probabilities are equal. Noting that S > 0, R > l. and RS ~ 1, it follows that S~(R- ~ m a -S(R- 1) ~ m s(a- ~ (B121 from which equation (B1 l) follows and thus equation (B3). Proof of Theorem 3. First we write equation (42) as three separate inequalities for S > 0, R > l, SR <~ 1 Pe>/ 1 -

GS.R(C/X )

(B131

P~ <~ 1

GS.R(C/X) 1'sOl 11

(BI41

-

1 - Gs.R(C/X) 1'sIR- 11 ~ 1 - m 1,,~Gs.R(C/X)I/RS.

(B15)

(a). From S > 0 and R > 1 we see that equation (B5) holds. Noting that RS ~ 1 leads to Ex[[max P ( C j x ) ] Rs] ~> E:,[max P(CJx)] = 1 - P~. i

(B16)

Equations (B51 and (B16) together prove equation (B13). (b). It has already been shown that equation (B2) holds for S > O, R > 1,S < RS < 1 + S. Since RS ~< 1 in Theorem 3, it

1. C. H. Chen, Statistical Pattern Recoqnition. Hayden, Rochelle Park, New Jersey (1973). 2. C.H. Chen, On information and distance measures, error bounds, and feature selection, ln£ Sci. 10, 159 171 (1976). 3. P.A. Devijver, O n a new class of bounds on Bayes risk in multihypothesis pattern recognition, I E E E Trans. Comput. C-23, 70 80 (1974). 4. L. Gyorfi and T. Nemetz, On the dissimilarity of probability measures, Proe. Coll. h!/[ Theory. Keshtely, Hungary (19751. 5. T. Kailath, The divergence and Bhattacharyya distance in signal selection, I E E E Trans. Commn. Tech. COM-15, 52-60 (1967). 6. L. Kanal, Patterns in pattern recognition: 1968 1974, I E E E Trans. Inf. Theory IT-20, 697 722 (1974). 7. J. Kittler, Mathematical methods of feature selection in pattern recognition, Int. J. Man-Maeh. Studie~ 7, 609 637 (1975). 8. T. Lissack and K. S. Fu, Error estimation in pattern recognition via L~-distance between posterior density functions, I E E E Trans. ln£ Theory IT-22, 34 45 11976). 9. G. T. Toussaint, On the divergence between two distributtons and the probability ofmisclassification of several decision rules, Proc. Sec. Int. Jt. Conf. Pattern Recoynition, Copenhagen, Denmark, 27-34 (19741. 10. P. M. Trouborst, E. Backer, D. E. Boekee and Y. Boxma, New families of probabilistic distance measures, Proe. See. Int. dt. Conf. Pattern Recoonition, Copenhagen. Denmark, 3-5 (19741. 11. I. Csiszar, Informatipn-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hunyary. 2, 299 318 (19671.

360

D.E. BOEKEEand J. C. A. VAN DER LUBBE

12. H. Kobayashi and J. B. Thomas, Distance measures and related criteria, Proc. 5th Ann. Allerton Conf. Circuit Syst. Theory, Urbana, 491-500 (1967). 13. I. Vajda, On the f-divergence and singularity of probability measures, Per. Math. Hungary. 2, 223 234 (1972). 14. J. T. Chu and J. C. Chueh, Error probability in decision functions for character recognition, J. Ass. Comput. Mach. 14, 273-280 (1967).

15. D.G. Lainiotis, A class of upper bounds on probability of error for multihypothesis pattern recognition, IEEE Trans. Inf. Theory IT-15, 730-731 0969). 16. J.C.A.vanderLubbe, R-norm information and a general class of measures for certainty and information, Thesis, Delft University of Technology, Delft, The Netherlands, (1977) (in Dutch). 17. G.H. Hardy, J. E. Littlewood and G. Polya, Inequalities. Cambridge University Press, London (1973).

About the Author - DICK E. BOEKEEwas born in the Hague, The Netherlands on 17 March 1943. He received his M.Sc. and Ph.D. degrees in Electrical Engineering from the Delft University of Technology, Delft, The Netherlands, in 1970 and 1977, respectively. He is currently a scientific staff member at the Department of Electrical Engineering at the Delft University of Technology. His research interests include statistical information theory, signal processing and their applications in pattern recognition and image processing. Dr. Boekee is a member of the Institute of Electrical and Electronics Engineers, the Netherlands Electronics and Radio Society and the Netherlands Society for Statistics. About the Author - JAN C. A. VANDER LUBBEwas born in Alkmaar, The Netherlands on June 25, 1953. He received his M.Sc. degree in Electrical Engineering from the Delft University of Technology in June 1977. Presently he is working towards his Ph.D. degree at the Delft University of Technology. His research interests include the meaning of information, information measures, distance measures, coding theory and pattern recognition.

Some aspects of error bounds in feature selection

Some aspects of error bounds in feature selection

Recommend Documents