Pattern Recognition
PergamonPress 1973. Vol. 5, pp. 353-363. Printedin Great Britain
Nonsupervised Classification Using the Principal Component MASAMICHI
SHIMURA
and TOSHIO IMAI
Faculty of Engineering Science, Osaka University, Toyonaka, Osaka, Japan
(Received 22 August 1972 and in revisedform 2 April 1973) Abstract--This paper presents a mathematical model of a nonsupervised two-category classifier with a nonparametric learning method by using the first principal component. On the assumptions that the patterns of each category are clustered and that the mean point of all patterns used lies between the two clusters, the separating hyperplane contains the mean pattern point and is perpendicular to the line governed by the first principal component. The learning algorithm for obtaining the mean pattern vector and the first principal component is described, and also some experimental results on random patterns are presented. Non-parametric methods Pattern recognition Signal detection Mixture distribution
Unsupervised learning
Principal components
I. INTRODUCTION THE L~RNINC machines can be classified into two types, supervised and nonsupervised. In a supervised machine, the probabilistic structure of the patterns is estimated by using the labeled patterns, the labels being provided by a teacher, and then the decision scheme is obtained by learning. While, in a nonsupervised machine such as Decision-Directed Machine,~l.2~ patterns used are unlabeled, and then the estimation of the unknown parameters must be performed according to its own decision instead of the label given by a teacher. The other method of the nonsupervised learning in which a mixture distribution is estimated by using unlabeled patterns is discussed in References (3, 4). The main shortcoming of such a machine is that the learning algorithm becomes rather complex, since no sufficient statistic exists as mentioned in Reference (5). Especially, the nonsupervised machine with a nonparametric learning method is quite difficult to obtain an acceptable performance because any information regarding the input patterns is not available. JAKOWATZ(6) presented a mathematical model of the nonsupervised machine with a nonparametric learning method. However, the analysis of his model is still incomplete. A clustering is one of the nonparametric learning methods and is applicable to the nonsupervised machine. Such a technique has been employed in self-learning systems.~-I o) The present paper proposes a mathematical model of the nonsupervised learning machine with a nonparametric method by using the first principal component. The first principal component, as is well known, is a normalized linear combination Wr~X with maximum variance. That is, its coefficient vector I41,is the normalized eigenvector corresponding to the largest eigenvalue of the overall covariance matrix. The closely related works of the classifying method using the first principal component are those of Ho and AGRAWALA,(9~ NAGY and TuoNG,"°~ COOP~R and CoOPeR~ 1~ and 353
354
MASAMICHI SHIMURA and TOSHIO IMAI
MORISHITA.(12~ Cooper and Cooper proposed a nonsupervised adaptive signal detector, in which the eigenvector was obtained by the classical iterative method based on the estimation of the overall covariance matrix. Ho and Agrawala, and Nagy and Tuong discussed the self-learning scheme using the first principal component obtained by some recursive procedure. Morishita showed that the first principal component can be found as the steady state solution of a nonlinear differential equation. One of the main techniques for obtaining the first principal component is the iterative "power method" discussed by FADDEEVand FADD~VA.~3~ To adopt this technique for pattern recognition, however, the overall covariance would be required to be known or to be estimated by some learning procedure. The former case of a fixed and finite training sample of patterns has been treated. ~9'~°~ The latter case of estimation of the covariance has been studied in Reference (11). While, the technique discussed in this paper is based on the direct estimation of the first principal component using unlabeled patterns. This technique is helpful in constructing a separate hyper-plane especially when sufficiently large memory is not available to store input patterns presented in a sequence or when the properties of the input patterns are changed slowly with time and cannot be given as stochastic parameters. Thus the major advantage of the machine is that a nonparametric learning method is employed and then no a priori knowledge about the input patterns is needed. Note that the use of this technique requires the assumption that the input patterns are clustered and separable by a surface perpendicular to the principal component. So far as the authors know, however, no other acceptable method has been developed in the case of unknown pattern distribution.
II. DECISION RULE AND LEARNING ALGORITHM We shall consider a nonsupervised decision scheme when the a priori knowledge concerning the input patterns is not given. Let Xt - (x~, x~ . . . . . x~)T (T: transpose) be an n-dimensional pattern vector at the k-th iteration of a learning procedure. In the following analysis, a two-category problem is considered. Assume that the patterns of each category are clustered and the mean point Zs of all the patterns used lies between the two clusters. On the above assumption, the direction of the coefficient vector W~of the first principal component is parallel to the line through the centres of the two clusters as shown in Fig. 1. Input pattern X is classified by a linear decision surface which includes Zs and is perpendicular to W~. That is, the decision rule becomes as follows: decide: X
fC~ E
o, =
C2
ifW~X>O~ otherwise
(I)
z,).
Now, we consider the learning algorithm for obtaining W,, Z, and 0,. Let Wk = (w~l, w~,..., ~ ) r , Z~ -- (z~, z~ . . . . . ~ ) r and 0~ be the coefficient vector, the mean pattern vector and the threshold value at the k-th iteration of the learning procedure, respectively. The (k+ 1)st coefficient vector W~+I(AF) can be given by normalizing the vector W;,+ I(AE), which is obtained by rotating W~toward Xt by BE
=
a~iXk-Zk[2BD
Nonsupervised classification using the principal component
/ / ~
/
•
~...-----A ~
355
decision surface
linear
FIG. l. Geometric interpretation of the decision rule.
as shown in Fig. 2. Similarly, the (k + 1)st mean pattern vector Zk+ 1 is obtained by moving Zk toward Xk by AG
=
a~(Xk-Zk).
Therefore, the learning algorithm is represented as follows: 14:~ +1 =
w~+ 1 =
Wk+ ak{(Xk-- Zk, Wk)(X~- Z~)- lXk - Z~I2Wk}
(2)
W'k+ , ~
(3)
Iw;,+ ,I
zk+l : Zk+ak(Xj,--Z~
(4)
0~+ i = (W~+ i, Z~+ ,),
(5)
where a k is the correction increment which satisfiesthe following conditions: 0 < ak < 1,
a~ = oo,
T
k: I
a~ < oo.
k: 1
The schematic diagram of the machine discussed here is shown in Fig, 3.
x, ? c I ! /
o~
F
,
E/
W,+~
//J" 2'#
crA
FIG. 2. Geom©tric interpretation of the learning algorithm,
(6)
356
MASAMICHI Sn~tJRA and Tosnto IMAI
Input paHerns~
Output decision
Weigl~fs
FIG. 3. A block diagram of the unsupervised machine.
III. C O N V E R G E N C E O F T H E L E A R N I N G P R O C E D U R E In this section, the convergence of the learning procedure of the coefficient vector Wk and the mean pattern vector Zk is discussed. At first, we show the convergence of Zk. Theorem 1
Assume that the following conditions are satisfied : (a)
E[lXkl 2] <: oo
(7)
(b) E[[ZI[2] < oo. Then, Zk converges to Zs with probability one as k ~ oo. That is,
prob. ( ~irn®Z, = Z,) = 1.
(8)
Proof
The proof is given by showing that the vectors of the learning procedure satisfy the conditions of Dvoretzky's theorem (special case): "'14~ Equation (4) can be rewritten as follows: Zk + 1 = (1 -- a k ) Z k +
a,Z, + ag X , - Zs).
(9)
Hence, we introduce the new vectors such as Tk(Z 1, Z2 . . . . . Z , ) = (1 - a~)Z, + akZ s
Yk =
a ~ ( X k - Z,)
F k ---
(1-a~),
(10)
where the same symbols as those in Reference (4) were used. Combining (9) and (10), we have IT~(Zi, Z2 . . . . , Z ~ ) - Z,I - F~lZk -- Z,I. From (6) and (10), Fk > 0,
fi F h = 0 k=l
Nonsupervise.d classification using the principal component
357
and from (7) and (10), oD
oo
E[IYkl2] = E[]X~-Z,I 2] ~ a~ < o~ k=l
E[YkIZ,,Z2,... , Z ~ = O.
k=l
Therefore, all the conditions of Dvoretzky's theorem are satisfied, and then the theorem is proved. Next, we obtain the following theorem concerning the convergence of We
Theorem 2 Assume that the following conditions are satisfied : E[IXf]
[(c)
<
the maximum eigenvalue 2max of the overall covariance matrix
~, = E[(Xk--Zs)(Xk--Zs) r] is a simple root of the characteristic equation I Z - 2II = 0. Then, W~ converges to W~with probability one as k ~ m. prob. ( lim I(w,,wDI 2 -- 1) = 1,
(12)
k-'* o0
where W, is the normalized eig'envector corresponding'to 2max.
Proof From (2) and (3), (Ws, Wk+ ,) = {(1 -aklXk-Z~12)(W~, W~)+ak(Xk--Zk, Wk)(W~, X k - Z ~ ) } - - .
1
Therefore, we have the following equation: 2
1
I(W~, Wk+~)l2 = I(W~, Wk)l + ~ A ,
(13)
where A = 2a,{(X k- Zk, Wk)(W~, X k- Zk)(W~, I4:,)- (X, - Z~, W,)2(W~, I4:,)2}
+ a~{lX,- Z,I2(X,- Z,, Wk)2(W,, I4:,)2 - 2IX~- Z,12(Xk- Zk, Wk)(W~, X ~ - ZD x (W,, Wk)+(Xk-Z~, W,)2(Ws, Xk--Zk) 2} [W k, +~ [2 = 1 - 2ak{ [Xk -- Z~,[2 - (X k - Z k,
(14)
W k ) 2 } "4- a2klXk -- Zk[ 2 { I X k - Zk] 2
- ( x ~ - z~, w~)2}.
(15)
If k is sufficiently large, the second term of (14) and the second and third terms of(15) can be neglected since limk. ~ at = 0. Neglecting these terms, we have IWk+
112 =
1,
358
MASAMlCHI
SHIMURA
a n d TOSHIO IMAI
and from Theorem 1
Zk = Z~ for a large number k. Therefore, equation (13) can be represented as follows: 1(~, Wk+l)[ 2 = I(W~, W~)12+2akBk,
(16)
where B~ =
(x~-z~, w~)(w~, x~-z~)(w~, w ~ ) - ( x : z~, wk)2(w~, w~)2.
Taking the conditional (given Wk) expectation of(16) over the possible pattern X k, we have E[I W~, W~+ ,)l 2[Wt] = I(IV,, W~)[2 + 2atE[Bkl Wd,
(17)
where
E[8~lw~] = (w,,
w,Jw~w~-(w~, w~)2w~zwk.
(18)
Note that
WrZWk <" 2rex
(W~ ¢ W~, IW~[ = 1),
(19)
since W~ is the normalized eigenvector corresponding to the maximum eigenvalue 2m,x of Y~,that is,
w~: = &,.,wL
(20)
E[BkIWj = (2m,,-- WrtY-,W~)(W~, Wk)2 > O.
(21)
From (18)-(20), therefore, we obtain
Thus, from (17) and (21), for a large k E l i - I(W~, W~+ I)I2IWL] < 1-I(W~, Wk)[2
(22)
so that {-(1-I(W~, Wk)12)} is a nonpositive submartingale. ~xS) Therefore, [(W~, Wk)[2 converges to a constant with probability one as k --, ~ . That is, prob.( lira [(W~, W~)[2 --- ~) = 1 k--* ot~
(u - const.).
(23)
Next, we shall show that ~ should be unity. If lira E[BkIWt] = ()~m,~-- wry~w~)(W~, W®)2 > 0,
k"*~
then from (6) and (16) we have [(Ws, W~)I2 --, ~ , which is contradict to [(Ws, Wk)[2 < 1. From (21), therefore, lira E[ Bk[ Wk] = O,
k--* oo
and then W~ must satisfy the following equation
:
(,~m,~- wr~w®)(w,, w J 2 = 0. Accordingly, I(W~, W~)I 2 = 1 or 0. Since the expectation of I(W,, Wk)J2 is monotonously increasing, we.have J(W~, W~o)l = u = 1. This proves the theorem.
Nonsupervised classification using the principal component
359
IV. GAUSSIAN DATA Consider the case that the input patterns are normally distributed in order to compare the method discussed here with other parametric methods. The problem is to detect the signal embedded in the additive Gaussian noise. Let the mcasurcment vector X k be represented as X k--- S I + N k E C 1 (24) Xk S2+Nk~C2,
I
where $1 and S 2 are the desired signals from categorics C l and C2, respectively, and N k is a Gaussian variable vector with mean vector 0 and covariance matrix 0-2I. If the additive Gaussian noise is not white, a situation is sometimes encountered wherein the classification criterion given by (1) does not lead to a reasonable performance, This is because, in such a case, the first principal component is no longer an important characteristic of input patterns. Although the "prewhitening" technique is available, the problem is still difficult if the covariance is not known. So that we assume that the covariance is a diagonal matrix. Assume that the probability of occurrence of each category is the same, that is, p(C0 = p(C2) = 1/2. In this case, the overall mean vector and overall covariance matrix become as follows: E[Xk] = (S 1 + S2)/2 (25) Z = 0-~I + (S l - S2)(S x- S2)r/4. The characteristic equation is 1(0-t~- 2)1 +(S 1- S2)(S1 - $2)r/41 = 0 . . (O.2__ A)n- 1(0-2 -- A 4- IS I - $212/4) = 0.
Therefore, ;t = 0-~(n- 1 multiple roots)
or
0-2+1S1-$212/4.
(26)
Obviously, the maximum eigenvalue '~mx ----"0-/~-t-IS1 -- $212/4
(27)
is the simple root. The solution vector W, which is the eigenvector corresponding to 2max, satisfies the following equation:
w , ~ ( ~ t +tSl -
s2){sl
-
s2)T/4) = ;~,~xw $ 2 4- IS1 - $212/4). ---- W,T(0-N
Consequently, we obtain $1 - $ 2 W~ = $2------~' ISx -
(28)
since 114,1 = 1. While, the decision rule of the Bayes' minimum error is as follows:
C1 if(Si-S2)rX > A decide:
Xe
C2
otherwise
A = (Sl - S2)r(Sl + $2)/2.
(29)
360
MmAmcm SmMURA and Tosmo IMAI
Substitution of (25) and (28) into (29) leads to the following decision rule:
I C1 ifW~X>(ws'Sl2S2) =(W*'Zs) decide:
Xe
(30) C2 otherwise,
which is just the same as the decision rule given by (1). It is known, therefore, that the machine proposed here converges to the Bayes' minimum error machine. V. COMPUTER STUDY Some results of the computer simulation of the machine are presented below. In the computer study, 20-dimensional (n = 20) patterns of normal distribution are classified into two categories Ct and C2, where P(Ct)= p(C2) = 1/2 and St = S, $2 = 0. The learning procedure of the machine is illustrated in Fig. 4 as the probability of error (three runs averaged) vs times of learning iterations. In the experiments, signal S is a random sample of a Gaussian distribution with mean vector 0 and covariance matrix o2I, and the correction increment a t is a t = Ilk. The probability of Bayes minimum error PB is indicated by an arrow - 4 in Fig. 4. 0.6
-
S / N "-13 dB
0.4
0.1
I rio
50
x-l--x--*
'
I00 200 Times
*
~
*.- ""
I,
I 0 0 0 2000
FIG. 4. Learning process of the machine.
The importance of the i-th principal component wrx is measured by the ratio B; of the variance 2i to the total variance Y'7=t 2i -= trace X [16], i.e. 2i Bi -- trace X'
(31)
where ~ is the normalized eigenvector corresponding to the eigenvalue 2~ of the overall covariance matrix X, and 2t -> ,t2 _> ... > 2,. In this case, the values of#i(i -- 1, 2..... 20) become as follows: t l+tl/4 f o r i = 1 n + r//4 /~=
~ 1 fori•l, n + r//4
(32)
Nonsupervised classification using the principal component
361
rl = (S, - S2)r(Sl - S2)/a ~.
(33)
where
The relations oft/, S/N, Pe and #l are given in Table 1. In addition, the convergence process of 1 -I(W~, Wk)l2 and IZs-Zkl are plotted in Figs. 5 and 6, respectively. It is seen from Fig. 4 that when signal-to-noise ratios SIN are 0, - 3 and - 7 dB, the probability of error converges closely to the probability of Bayes' minimum error after 100, 200 and 500 iterations in the learning phase, respectively. The experimental results show that the learning rate is uniformly slow when the signal-to-noise ratio is comparatively small, e.g. - 1 3 dB. This is because the difference between #t and #i (i = 2 . . . . . 20) T^BLE 1. THE ~t~TlOl~rS OF r/, SN l o m o ,
P,
AND
#, (i = 1..... 20) wrmN n = 20
SN ratio
Ps
•=
[31
(dB) 20 10 4 !
0 -3 - 7 - 13
(i:~ 1) 0.012 0.056 0.158 0.308
0.240 0.155 0.095 0.061
0.040 0.044 0.047 0.049
SIN. -13dB I'0~
r~
~..~0 ~ A . ~ . ~ . . . O
~
~___~
?aB
~:. o.e
,
R
x ~ x I0
50
I00
x 200
~A 500
tO00
2000
Times
FIG. 5. Convergence process of 1 - I ( W , , Wt)l 2.
4- 0 ~,~
~o
50
I0 F I G . 6.
I00 200 500 I000 Times Conver~ce process of IZ=-Zd.
2000
362
MASAMICtllSHaaUr.A and TOSHIOl ~ i
becomes comparatively small also, and hence it becomes more difficult to find the first principal component. However, this difficulty is not peculiar to this machine but is a common problem in any nonsupervised machine. VI. REMARKS In this paper, we have discussed a mathematical model of a nonsupervised machine with a nonparametric learning method using the first principal component. Also we have presented some experimental results on random patterns. An analysis of its asymptotic behaviour shows that the machine converges to the Bayes' minimum error classifier if the input pattern is assumed to be a sample from a Gaussian distribution, which is also verified by the computer experiment. Note that if the assumption on input patterns made in this paper is violated, a situation may be sometimes encountered where the principal component is not an important feature of input patterns. In such a case, some other technique such as the "prewhitening" technique may be helpful to obtain a reasonable performance if the overall covariance is known or can be estimated. Nevertheless we believe that the method using the first principal component is still worthwhile because of its optimal performance, simpIe algorithm and nonparametric learning. Also, it is useful as a "starting method" for other nonsupervised machines suffering from poor initial conditions. We write in addition that the estimation function of the machine can be written as J(W, Z) = E [ I X - Z [ 2 + I X - Z - ( X - Z ,
W)[2]
= 2E[IX - ZI 2] - W r E [ ( X - Z ) ( X - Z) T] W. The vectors W~ and Zs obtained in this paper minimize the above estimation function. That is, min J(W, Z) -- J(W~, Zs). Z,W
IWI=I
This can be easily proved and the proof is omitted.
REFERENCES 1. H. J. SCUDDER,Adaptive communication receivers, 1EEE Trans. IT-II (2), 167-174 (1965). 2. K. TANAKAand S. T,,om.r~, Some considerations on a type of pattern recognition using nonsupervised learning procedure, in IFAC Inter'l Syrup. Technical and Biological Problems of Control, Terevan, Armenia (1968). 3. E.A. PATmCand J. C. HANCOCE,Nonsuperviscd sequential classification and recognition of patterns, IEEE Trans. IT-12 (3), 362-372 (1966). 4. Y.T. CmEq and K. S. Fu, On Bayesian learning and stochastic approximation, IEEE Trans. SSC-3 (1), 28-
3s (1967). 5. J. SP~GINS, A note on the iterative application of Bayes' rule, IEEE Trans. IT-II (4), 544-549 (1965). 6. C. V. JAKOWATZet al., Adaptive waveform recognition, Proc. 4th London Syrup. on Information Theory, pp. 317-326 (1961). 7. G. NAGYand G. L. SHELTON,S©lf-corrective character recognition system, IEEE Trans. IT-12 (2), 215-222 (1966). 8. K. FUKUNAOAand W. L. G. Koo~rrz, A criterion and an algorithm for grouping data, IEEE Trans. C=19 (10), 917-923 (1970). 9. Y. H. Ho and A. K. AO~WALA, On the self-learaing scheme of Nagy and Shelton, Proc. IEEE $5-10, 17641765 (1967).
Nonsupervised classification using the principal component
363
10. G. NAGY and N. TUONG, On a theoretical pattern recognition model of Ho and Agrawala, Proc. IEEE
56-5, 1108-1109 (1968), I I. D. B. COOPEg and P. W. COOPER, Adaptive pattern recognition and detection without supervision, 1964 IEEE Inter'l Cony. Rec., pt. I, 246-256 0964). 12. I. MomsHrrA, Analysis of an adaptive threshold logic unit, IEEE Trans. C-19 (12), 1181-I 192 (1970). 13. D. K. FADD~V and V. N. FXDDI~VA, Computational Methods of Linear Algebra, p. 179. W. H. Freeman, London (1963). 14. A. DvoR~rZKY, On stochastic approximation, Proc. 3rd Berkeley Syrup. on Math. Stat. and Prob., Vol. 1, 39-55 0956). 15. J. L. Doon, Stochastic Processes, p. 324. John Wiley (1953). 16. D. F. MORmSON,Multimriate Statistical Methods, p. 226. McGraw-Hill (1967). i 7. H. J. SCUDDEI~,Probability of error of some adaptive pattern-recognition machines, IEEE Trans. IT-! 1 (4), 363-371 (1965).