An optimal orthonormal system for discriminant analysis

An optimal orthonormal system for discriminant analysis

Vol. 18. No. 2. pp. 139 144, 1985. Printed in Great Britain. 0031 321)3 ~5 $300+ 00 I)crganhw~ Pre~,s Lid t 1985 Paltern Re¢ognila,n Sociel) Pattern...

321KB Sizes 3 Downloads 85 Views

Vol. 18. No. 2. pp. 139 144, 1985. Printed in Great Britain.

0031 321)3 ~5 $300+ 00 I)crganhw~ Pre~,s Lid t 1985 Paltern Re¢ognila,n Sociel)

Pattern R,'cognitimt

AN OPTIMAL ORTHONORMAL SYSTEM FOR DISCRIMINANT ANALYSIS T()SIIIIIIK() O K A I ) A a n d SITING() T()MITA

Faculty of Engineering, Yamaguchi University, Tokiwadai, Ube, 755 Japan (Received 16 July 1982; in revised form 22 March 1984; received for publication 2 May 1984)

Abstract--This paper proposes a new discriminant analysis with orthonormal coordinate axes of the feature space. In general, the number of coordinate axes of the feature space in the traditional discriminant analysis depends on the number of pattern classes. Therefore, the discriminatory capability of the feature space is limited considerably. The new discriminant analysis solves this problem completely. In addition, it is more powerful than the traditional one in so far as the discriminatory power and the mean error probability for coordinate axes are concerned. This is also shown by a numerical example. Discriminant analysis Pattern recognition

Feature extraction

Feature selection

Dimensionality reduction

I. INTRODUCTION J(~i) = - -

The problem of feature extraction or selection is very important in pattern recognition. Among the methods of feature extraction or selection, discriminant analysis, found in Wilks" j and Cooley and Lohnes, ~2~ is one of the well-known methods. This method realizes an optimal feature space on the generalized Fisher criterion, subject to the uncorrelatedness ofcoordinate axes. However, classification performance of the feature space isvery limited, for the number of coordinate axes of the feature space generally depends on the number of pattern classes. In 1975, Foley and Sammon ~3~ proposed a method which overcomes this shortcoming under the orthonormality condition of coordinate axes. However, the method is suitable only for the two class case, and not for the multiclass case. Besides, it would seem difficult to extend the method to the multiclass case by Lagrange's method of indeterminate multipliers. The purpose of this paper is twofold. First, it aims to realize the feature space by the orthonormal axes maximizing the generalized Fisher criterion. For that, a new technique quite different from Lagrange's method is used. Second, it shows by computer simulation that the new discriminant analysis is superior to the traditional one from the point of discriminatory capability and mean error probability for coordinate axes of the feature space. Section 2 presents definitions and a technique to obtain the orthonormal coordinate axes. Section 3 shows the effectiveness of the new discriminant analysis by a numerical example. Z ORTHONORMAL

COORDINATE

AXES

Let us define the generalized Fisher criterion of discriminatory capability as PR 18:2-D

1)

where g+is the ith n-dimensional coordinate axis vector onto which the data are projected, ~] is the transpose of Ti, m is the number of classes, Ni the number of samplesin classi, N = N 1 + N z + ... + N,, the total number of samples, xij is thejth n-dimensional sample vector for class i, 1

N

~, = ~, L x,j • J"= =1

the mean vector of class i, 1

,I

1~ = -~ i~l N,I.ti

the overall mean vector, 1

~i'Vi= Ni j ~, = (x ° _ p 3 ( x u _ ~,)7the within-class scatter matrix of class i, W = ~ i = N+W,. the weighted sum of within-class scatter matrices and 1

"

B = -~ ,~=1 Ni(Pi - g) (Pl - p)r

the between-class scatter matrix. Since J(~) = J(a~J, the value of J(~i) is independent of the magnitude of T~. Therefore, without loss of generality, we shall normalize so that the norm of T~is one, i.e. [[~[] 2 = T[~i = 1. In the traditional discriminant analysis, the coordinate axes ~, have been constituted as the ones maximizing J under the condition of uncorrelatedness of axes, i.e. 139

140

T. OKADAand S. TOMITA

¢/G¢~:0

fori#j

where v I = ¢1, 1 is the n x n identity matrix, ~bi is an arbitrary n-dimensional vector which is linearly independent of v~,j = 1, 2 . . . . i - 1, and ai is chosen so that IIv, I[ -- 1. These v, satisfy

where G is a covariance matrix G=~I

~

~+(xq_p)(x,j_#)

r.

i=lj=l

(7)

virvj = 6q.

However, in general, under this condition we can obtain only m - 1% i = 1, 2 ..... m - 1, with J(¢~) > 0. For instance, when m = 2, we obtain only ¢1. In order to solve this problem, we consider realizing optimal coordinate axes for discriminant analysis under the condition of orthonormality, i.e.

¢~¢j =

Let us define a n n x (n - 1) matrix

P1

2.1. T h e first axis The first axis ¢1 maximizing J is given as the eigenvector corresponding to the largest eigenvalue in the eigenvalue problem W - I B¢ = 2¢.

(2)

Here we have assumed that W - 1 exists. The objective eigenvector can be obtained by computing the following recursive equation with an initial approximation vector ao, i.e.

(8)

and transformations W 2 : PIrWPI

(9)

B2 = PIrBPI

(10)

ao

where ~o is Kronecker's delta. Let us describe this in the following subsections.

= [v2v3 ... v.]

which show how matrices W and B are transformed by the projection to S"-a. Matrices W 2 and B, are of ( n - 1) × ( n - 1). Computing the recursive equation as before ak+ 1 ----W ~ l B 2 a k ,

k = 0, 1, 2. . . .

(11)

we can obtain the converged a k, say %m,~ which is the eigenvector corresponding to the largest eigenvalue of W[~B2. By using the matrix P~, transforming ¢2ma~ into an n-dimensional vector without changing its direction, we can obtain the second axis as (12)

P:2,,JII ¢2m,~11.

¢2 =

Then the value of the criterion is ah+l=W-lBa~,

k=0,1,2

....

(3)

till ah+ a "" ah. This is the well-known power method in numerical analysis. The converged ak, say ¢1 ma~,gives the first axis by normalizing so that ¢1 = ex mJll ca m.,ll"

(5)

where 2~ ma~ is the largest eigenvalue of W - 1B. 2.2. T h e second axis We consider to find the second axis ¢2 that satisfies cr¢ l = 0 and maximizes J. First, the axis ¢2 satisfying ¢r¢~ = 0 must be in an (n - l)-dimensional subspace orthogonal to the first axis ¢1. Therefore, the second axis should be found as the axis maximizing J in this subspace. Now, let S"-1 be an ( n - l)-dimensional subspace orthogonal to ¢~. The subspace S"- 1 can be spanned by the following n - 1 vectors v, i = 2, 3. . . . , n, using the G r a m - S c h m i d t orthonormalization procedure, i.e. =

"2(I

-

-

-

TTmaxB2 T2 m a x = 22rn,x (13) ¢2rm.~W2 ¢2 ma~

where 22m. is the largest eigenvalue of W i IB2.

In general, the rth axis L that satisfies the constraints xr¢~ = 0 for i = 1, 2 ..... r - 1, and maximizes J can be obtained as follows. 1. Span an (n - r + l)-dimensional subspace S" - ' ÷ 1 orthogonal to the r - 1 axes % i = 1, 2,..., r - 1, by using n - r + 1 vectors v~: v~=oti

(, 'i r/, -

vlv

j=l

~, i = r , r

+ 1,...,n(14)

/

where vj -- ¢~,j = 1, 2 . . . . . r - 1, a, and ~, are normalizing constants and arbitrary vectors as mentioned before, respectively. Define an n x (n - r + 1) matrix P , - I = [v,v,+l "'" v.].

(15)

vlvl')¢2

V3 = ~3(1 - vlv r -- v2v]')l//3

v. = ~.(I -

TW P l ¢2 max ¢2T maxP1

2.3. T h e rth axis

¢/'I1¢~

v2

_ ¢2r maxPlr BPI ¢2max

(4)

Then the value of the criterion of discriminatory capability is J(¢1) = ¢~rW¢1 = 21ma~

¢1B¢2

J(¢2) = ¢2rW¢2

v~v~ -

... -

v.-:Ll)¢l.

(6)

2. Transform W and B into matrices in S"- ~:. W, = P~r_l W P , _ t

(16)

B, ffi Pr_ :BPr_ 1.

(17)

Orthonormal system for discriminant analysis Tr~j = t5o

3. Compute the recursive equation a k + l = W , " l B , ak,

k = 0, 1, Z . . .

(18)

/ l i t . . . . II.

(21)

and can be ordered according to

till ah+l "" ak. 4. Transform the converged ak, say f.,,.,~ into an ndimensional vector: r, = P r - a r . . . .

141

J(TI) > J(T~) ~ ... _>- J(~,).

(22)

Also, the n axes T~,i -- 1, 2,..., n, constitute an optimal orthonormal system with respect to J.

(19)

Then

3. E X A M P L E

.t(~,) = ,~,m~

(2o)

where 2,rna~is the largest eigenvalue of W,-~B,. The above procedure, if we desire, can be continued till the nth axis ¢, and the n axes mutually satisfy

In this section, we show a numerical example in order to illustrate the effectiveness of our procedure. We choose a ten-dimensional five-class case. The data set of class i consists of 200 samples generated with the following normal distribution N(M,, ~,).

MI = ( 1 0 , 7 , 6 , 5 , 1, 7, 2, 0, 5, 3) 'r

0.091 0.038 -0.053 - 0.005 0.010 -0.136 0.155 0.030 0.002 0.032

0.373 0.018 - 0.028 - 0.011 -0.367 0.154 - 0.057 -0.031 - 0.065

1.43 0.017 0.055 -0.450 -0.038 -0.298 -0.041 -0.030

0.084 - 0.005 0.016 0.042 - 0.022 0.001 0.005

0.071 0.088 0.058 - 0.069 - 0.008 0.003

5.72 - 0544 - 0.248 0,005 0.095

2.75 - 0.343 - 0.011 -0.120

1.45 0.078 0.028

0.067 0.015

0.727 -0,096 -0,017

0.715 -0.009

0.341

M 2 = ( 1 , 4 , 3 , 9 , 5, 1, 0, 7, - 5 , - 3 ) "1

~ =

M 3 -~

0.427 0.011 - 0.005 -0.025 0.089 -0.079 -0.019 0.074 0.089 0.005

0.080 0.098 0.045 -0.041 0.023 0.022 - 0.035 0.012

2.80 -0.107 0.150 -0.193 0.095 - 0.226 0.046

3.44 0.253 0.251 0.316 0.039 -0.010

2.27 -0.180 0.295 - 0.039 -0.113

0.327 0.027 0.026 -0.016

0.065

- 5 , - 9 , O, - 4 , - 1 , - 5 , 2, 6, - 3 , 6) 1 "

ql~3 ~_.

5.69 -0.069 -0.282 -0.731 0.090 -0.124 0.100 0.432 -0.103

0.335 0.026 0.091 -0.051 0.011 -0.012 - 0 . 0 1 0 0.079 0,006 0,017 -0.014 0.029 - 0.002 0.008 - 0.023 0.077 0,011 - 0.030 0.035

0.078 0.000 0.016 0.003 0.030 - 0.035 - 0.003 -0.049

0.082 -0.003 -0.026 - 0.025 - 0.029 - 0.015 0.025

0.797 0.194 -0.037 -0.023 0.059 -0.145

1.50 0.014 -0.104 0.114 -0.229

0.277 -0.030 - 0.077 -0.051

0.317 0.022 0.010

0.538 0.034

0.668

142

T. OKADAand S. TOMITA

M4 = (2, 3, 7, 15, 12, I0, 1, 3, - 4 , - 6 ) 7. 0.084 -0.012 0.001 0.015 - 0.004 0.024 -O.Oll 0.027 0.073 -0.012

~4 =

0.793 0.000 0.197 0.031 -0.026 0.050 0.079 0.145 -0.027

0.066 -0.023 0.011 -0.017 0.039 - 0.002 0.017 -0.004

0.728 - 0.003 -0.015 0.121 0.090 0.080 0.041

0.080 -O.OOI -0.007 0.023 -0.014 -0.023

0.336 -0.147 - 0.037 0.302 -0.052

1.45 - 0.132 -0.230 -0.137

0.320 0.238 0.017

2.93 -0.057

0.263 0.054

0.304

n

M 5 = (5, - 8 , - 4 , - 6 , - 4 , 3, 1, - 5 , - 2 , 8) 7. m

0.090 0.001 -0.008 -0.191 - 0.007 0.041 -0.030 -0.058 0.022 0.032

~5 --

0.092 -0.011 0.008 - 0.014 - 0.002 0.012 -0.010 - 0.021 -0.002

0.082 0.082 0.014 - 0.020 -0.058 0.105 0.004 0.023

5.68 - 0.096 - 0.015 0.646 0.219 - 0.238 0.218

0.076 - 0.035 -0.040 -0.023 0.027 -0.014

0.458 0.138 -0.251 0.012 0.039

1.82 -0.183 - 0.002 0.117

4.07 - 0.464 0.147

0.205 0.013 - 0.023 - 0.044 0.033 -0.027 0.025 0.016 0.053 0.006

1.41 - 0.010 - 0.023 - 0.144 -0.064 0.018 0.018 0.107 - 0.032

0.348 0.035 1.87 0.028 - 0.043 -0.105 0.022 -0.001 0.118 -0.042 0.071 - 0.012 - 0.080 - O.OlO 0.067

0.892 0.100 0.045 0.045 0.021 - 0.038

2.06 -0.144 -0.069 0.079 - 0.052

1.33 -0.132 - 0.059 - 0.042

- 0.044 0.037

24.5 18.5 4.94 6.09 - 2.24 18.4 -0.138 - 14.2 13.0 0.990

42.5 22.5 41.1 22.3 22.6 -1.67 5.27 8.36 -24.1

0.387

m

Then we have m,.

W =

1.38

0.903 0.007

0.353

m

B =

16.1 27.6 60.4 18.0 41.4 12.8 26.3 -0.270 -2.59 6.96 13.2 3.22 -4.12 -16.8 -40.1

31.0 15.8 -1.60 11.2 -6.89 -28.7

26.5 -0.530 -9.18 6.06 -13.7

0.461 -0.644 1.27 2.12

18.5 - 6.67 -11.9

12.4 6.72

28.1

Table 1. Values of the criterion and mean error probabilities i

J(,~) d(,;) P~(t;) P,(¢'j)

1

2

3

4

5

220 220

157 145

48.6 19.0

23.2 7.07

16.7 0

0.118 0.118

0.006 0.136

0.156 0.286

0.136 0.268

0.164

6

7

8

9

10

9.95 0

5.78 0

1.90 0

0.590 0

0.054 0

0.250

0.256

0.342

0.458

0.548

Orthonormal system for discriminant analysis Table 1 shows the experimental results of a new discriminant analysis and the traditional one. In Table I, ~ are coordinate axes of the traditional discriminant analysis. Also, P¢(~) and P~(¢~) are mean error probabilities for ~ and ¢~ calculated on a computer, respectively. From this, we see that for all i

J('~) _>-JO0 P,,(x,) < e,('f~) where the equality holds only when i = 1. Here we should emphasize that the above results arc never special and that similar results were observed frequently in many examples that we performed.

the between-class scatter matrix We assume that W is nonsingular. The first coordinate axis r~ that maximises J can be obtained as the eigenvector corresponding to the largest eigenvalue of the matrix W - i B. Then the value of the criterion is J(¢1) = 21,~,, where 21 m,, is the largest eigenvalue of W - ~B. Next, the second coordinate axis ca that satisfies ¢~¢1 = 0 and maximizes J should be in an 0 1 - l)dimensional subspace orthogonal to ¢ r Now, let this subspace be ,7'- 1. The subspace S"- ~ can be spanned by the following n - 1 vectors vi using the Gram-Schmidt orthonormalization procedure:

4. C O N C L U S I O N

In this paper, the authors have proposed a new discriminant analysis with orthonormal coordinate axes. This method is a generalization of that of Foley and Sammon. The method overcomes the limitations of the traditional discriminant analysis. It is also powerful in the values of J(xi) and P,(x~). For these reasons, we can claim that the new discriminant analysis is superior to the traditional one.

143

vi=~

I-

vf

i, i = 2 , 3 , . . . , n

j=l

where v I = TI, 1 is the identity matrix, ~,~is an arbitrary ~ c t o r which is linearly independent of vj, j = l, 2 ..... i - I and r,q is chosen so that [[ v, II = 1. Let us de e an n x (n - 1)matrix

el = [v~ v~... v,] and transformations W z = PIrWpI

SUMMARY

B2 = plrBpl This paper proposes a new discriminant analysis with orthonormai coordinate axes of the feature space. We define the generalized Fisher criterion of discriminatory capability as

~i'BT,

which show W and B in S"- 1. Then ~2 can be obtained as ~ ~ S"-1 that maximizes J. Such a x, say T2m~, becomes the eigenvector corresponding to the largest eigenvalue of the matrix W ~"t B 2. Therefore, the second coordinate axis is

J(~) = _ _

"~iTW~i

where ¢~is the ith n-dimensional coordinate axis vector onto which the data are projected, xr is the transpose ofx~, m the n u m b e r ofpattern classes, Ni the n u m b e r of samples in class i, N = N I + N 2 + ... + N,, the total number of samples, xq is thejth n-dimensional sample vector for class i, 1

~', = ~

N

L x,j • J"= =1

the mean vector of class i, 1 " p = ~ i=~1N ~ i

Then

where 22.,~ ~ is the largest eigenvalue of W~-~B2. In general, we can obtain the rth orthonormal coordinate axis L that satisfies t,~r~ = 0, i = 1, 2 ..... r - 1, and maximizes J as follows. 1. Span an (n - r + I)-dimensional subspace S" - ' + orthogonal to T~, i = 1, 2 ..... r - 1, by the following n - r + 1 vectors vi:

v~fcq

1-

vjr

i,

i=r,r+

1.....

n

j=|

the overall mean vector, I

N

W~ = ~ j = ~ l (x~ - p~)(x o - p j r the within-class scatter matrix of class i,

W = IN,=~ N,W, the weighted sum of within-class scatter matrices and

where v~ = ~ j , j = l, 2 . . . . . r - 1, ~ti and ~'i are normalizing constants and arbitrary vectors as mentioned before, respcc:tively. Let

P,-1 ffi [v,v,+~ ... v,]. 2. Compute W and B in S ~-'+ 1. W, = p~r.. i WP,_ 1 B, = P,rotBP,_ r

B = ~

N~(pi - #l (Pi - # ) r i=l

3. Compute the eigenvector ¢,m.~corresponding to the

144

T. OK^D^ and S. TOMITA

largest eigenvalue of the matrix W f 1B,. 4. Transform ~,,~., into an n-dimensional vector : ¢, = P , - t

¢,mJIl~. . . . II

Then

Here 2,m., is the largest eigenvalue of W f tB,. The above procedure, if we desire, can be continued till the nth coordinate axis xn. The n coordinate axes ~, i = 1, 2, .... n, can be ordered according to J(¢l) ~ J(¢2) -~ -.. ~ J(¢n) and constitute an optimal orthonormal system with respect to J. Furthermore, the paper presents a numerical example in which for all i

J(¢i) > J(~) P,(¢i) _-< P,(¢~) where ¢'i are coordinate axes of the traditional discriminant analysis, and P,(¢i) and P,(¢'i) are mean error probabilities for ~ and T'~ calculated on a computer, respectively.

REFERENCES

1. S. S. Wilks, Mathematical Statistics. Wiley, New York (1962). 2. W.W. Cooley and P. R. Lohnes, Multivariate Procedures for the Behavioral Sciences. Wiley, New York (1962). 3. D. H. Foley and J. W. Sammon Jr, An optimal set of discriminant vectors, IEEE Trans. Comput. C-24, 281-289 (1975).

About the Author--Tosl-nmgoOKAOAwas born in Yamaguchi, Japan, on 11 September 1940. He received a B.S. degree from Yamaguchi University in 1963. At present he is an Associate Professor in the Faculty of Engineering at Yamaguchi University. His research interests are the theory and application of feature selection in pattern recognition. About the Author--SHINGO TOMITAwas born in Hamamatsu, Japan, in 1934. He received a B.S. degree in Mathematics from the Science University of Tokyo, Tokyo, in 1958 and a Ph.D. degree in Engineering from Tohoku University, Sendal, Japan, in 1972. He joined NEC corporation, in 1958, where he worked on the mathematical design of several computers. From 1963 to 1975 he was with the Research Institute of Electrical Communication,Tohoku University. Since 1975 he has been a Professor of Yamaguchi University, Ube, and a technical member of the future city-planning of Ube technopolis by the Japanese Government.