ORIGINAL
CONTRIBUTION
Iterative Improvement of a Nearest Neighbor Classifier YAU AND MICHAEL
HUNG-CHUN
University
Abstract-In
practical
puttern
because it does not require As the number clussifier. number
of’ example
these problems form
of example
However,
recognition.
upplicutions.
knowledge
vectors i.s increased.
the neurest neighbor
of the joint
vectors.
is not optimul
the NNC the NNC (BP)
our approach
Significunt
to u s&mu-pi
leurning
cla.ss$er
neural
in cluss~ficution
dutu.
and used to improve
of’ hundprintcd
numerul
that
In this puper. isomorphic,.
clussifier
recognition
error percet1tuge.s ure obser~~ed for
upplied L’ector.s.
of’ the Buysian
increuse.s. Also.
to which it is purtiully
network,
often
input feuturc
upprouches
of the NNC
with respect to the truining
(NNC’) is
of the
density
of the NNC
complexity
is then delaeloped
to the problems
impro\‘ements
probability
the error probability
the compututionul
propagation
e.xump1e.s. we upp!\’
of Tcxaa at Arlington
at the same time,
by mupping
of‘ buck
recognition
an u priori
T. MANRY
for
u .smuIl we rrttack
A mod(fied
perjormunce.
und geometrical both the truining
As shape dutu
und testirlg data.
neighbor classifier,
Keywords-Nearest morphic
classifiers,
Shape recognition.
Sigma-pi
network,
Deformation-invariant
1. INTRODUCTION
Requests
for
reprints
Department
Texas at Arlington.
of
should Electrical
Arlington.
be sent to Prof. Engineering.
Michael
T.
University
of
propagation.
Character
recognition.
lso-
features.
(CPN) that combines the Kohonen self-organization and Grossberg outstar algorithms. The LVQ and CPN networks are then isomorphic to types of NNC. In this paper we develop techniques for optimizing the NNC through the use of a sigma-pi back propagation network. In Section 2. we give a simple method for mapping the NNC’s components to a sigma-pi neural network. The learning rule for the network is described in Section 3. We apply our algorithm to the improvement of classifiers for handprinted numerals and geometric shapes in Section 4. The weights, initialized by a direct mapping procedure instead of via random assignment. effectively shorten the learning procedure and increase the possibility that a global minimum of the error function can be reached. Conclusions are given in Section 5.
As pointed out by Duda and Hart (1973) and Fukunaga (1972), the nearest neighbor classifier (NNC) approximates the minimum error Bayesian classifier in the limit as the number of reference vectors gets larger. When the feature vector joint probability density is unknown, the NNC would be the preferred classifier. except for two problems. First. a prohibitive amount of computation is required for its use. Second, the NNC’s performance is usually not optimized with respect to the training data. As more hardware for parallel processing becomes available (neural or conventional), the first problem will be solved. Neural networks have already been used to attack the second problem. Lippman (19X7) has pointed out that the multilayer perception is very similar to the NNC. and can be used as a classifier. As illustrated in Figure 1, Kohonen (1YOO. 198X)and Kohonen, Barna. & Chrisley (198X) have mapped the NNC to a neural network, which they have named learning vector quantization (LVQ). They have also suggested a learning rule for training the network. Hecht-Nielsen (1987, 1988) developed the two layer counter-propagation network
Manry.
Back
2. NEAREST NEIGHBOR ISOMORPHIC NETWORKS In this section, we describe the nearest neighbor isomorphic network (NNIN), which is a back propagation (BP) network isomorphic to a type of NNC. This network uses fairly conventional sigma-pi units, described by Rumelhart, Hinton, & Williams (l’s%). in the hidden layer and units similar to the product units of Durbin and Rumelhart (1989) in the output layer. A set of good initial weights and thresholds can be directly determined from the reference feature vectors via appropriate mapping equations.
TX 76019. 517
518
Input Features
Kohonen Units
DumV Units
output Distances
x(l)
x(2)
x(3)
x(4) NGURE 1. Kohonen isomorphic classifier.
2.1. First Layer
Sout(i. j) =
Each pattern is represented by a feature vector. x = [X(l) X(2) *.. x(Nf)]r, where Nf is the number of elements in the feature vector x. Cluster mean vectors have been used for keeping a small number of most representive samples. For example, ri, = [ Ti,( 1), r;,(2), ..* , r,j(lv’)]‘is the mean vector of thejth cluster of the ith class. Let II, represent the distance between vectors x and rij. The first processing step in the NNC is the calculation of the D, numbers. For the NNC cluster discriminant, we use the Mahalanobis distance for independent features rather than the Euclidean distance. Let v,, denote the feature variance vector for the jth cluster of the ith class and let U,,(k) denote its kth element. Then D2
=
5 [x(k) - r,(k)l?
‘I k .I
u,(k)
(1)
As shown in Figure 2, the first processing layer consists of NC * Ns sigma-pi units which are connected to the Nf input features. Here NC is the number of classes and Ns is the number of clusters per class. Let Snet(i, j), SO(i, j), and Sout(i, j) denote the net input, threshold, and output of the jth unit of the ith class, respectively. Swl(i, j, k) and Sw2(i, j, k) denote the connection weights from the kth input feature to the jth sigma-pi unit of the ith class. The sigma-pi network and activation are, respectively, Snet(i,j)
= SO(Q) + 2 [SWl(i,j, k)x(k) k=l + Sw2(i, j, k)?(k)]
(2)
1 1+
exp[ -Snet(i, j)]
Comparing Snet(i, j) with Pi, we may -assign the initial weights and threshold of the jth sigma-pi unit of the ith class as follows: w
g&J
SWl(i.j,k) = -2
Sw2(i. j, k) = -
dk)
(5)
1
d,k)
This first sigma-pi layer can also be mapped back to the NNC’s first layer. Hence, the first layer of the NNC and the first layer of~the sigma-pi network are isomorphic. 2.2. Second Layer The second step in a NNC is the Mm operation, which consists of finding i and j such that D, is minimized. The estimated class is then i. As seen in Figure 2, the Min operation of the ~NNC has ~been replaced in the second processing iayer by sigma-pi units of a new type. These are the product units of Durbin and Rumelhart (1989), with each feature raised to the first power. For a discussion of the merits of product units versus Min units, see Appendix A. Let PO(i), Pnet(i) and Pout(i) denote the
Iterative Improvement
of a Nearest Neighbor
Input
DuW Units
Features
519
Classifier Sign-a-pi
Product
output
Units
Units
Discriminant Funct
ions
x(l) Pout(
1)
x(2)
x(3) Pout(z)
x(4)
FIGURE 2. New isomorphic
classifier.
threshold. net input and output of the ith unit in the second layer. The Pw(i, i ) are the connection weights for Pin(i) and Pnet(i). Pin(i)
= fi Sout(i,
j)
I I
(7)
\i Pnet(i)
= PO(i) + C Pw(i, I)Pin(Z)
/ I
Pout(i) =
1 1 + exp[ - Pnet(i)]
(8)
(9)
It is simple- to initialize the second layer as Pw(i, I) = 6( i - i) and PO(i) = 0. This is equivalent to
replacing the Min operation by a product at the beginning of training. The output layer has desired activation values of 1 for incorrect classes and 0 for the correct class.
E,, = [z
RULE
-
Pout(i)I’.’
(11)
where E,] is the error measure for the pth input pattern. and q is the positive integer. In practice we alter the error criterion from the least square approach (q = 1) to the minimax approach (q = infinity) or vice versa when the learning process slows. In our experiments, this adaptive-q technique results in an increase in learning speed. 3.1. Gradients for q = 1 For the q = 1 case, the traditional back propagation learning rule is applied to the isomorphic network. The gradient of E,, with respect to the net input of the second layer units is G(i)
dPout( -aE,, = - dE,> ~
=
dPnet( i) = 2 [Tout(i)
3. LEARNING
[Tout(i)
i)
dPout( i) aPnet( i) - Pout(i)] . Pout(i)
[I - Pout(i)] (12)
Then
In the previous section, we initialized out network with good initial weights. However, the net’s performance can still be improved via a form of back propagation, as described in Rumelhart et al. (1986). Let Tout(i) be the desired output. Np denotes the total number of training patterns. Using the q-norm, which is the p-norm of Cheney (1982) and Powell (1981), the system performance measure is defined as
-dE,,
=
3PO( i) -“E,, dPw(i,
-dE,, dPnet(i) - dE,>
1) = dPnet(i)
(13)
dPO(i)
dPnet(i) dPw(i.
1)
= G(i) . Pin(l)
(14)
The gradients for the first layer units are g(i. j) =
-dE,, dSnet( i. j)
= $
- dEil aPnet(l)
Cl@
i)Pnet(i) = G(i)
=Pin(i)
dPnet(i) dSout(i.
dSout(i,
j)
j) dSnet(i,
j)
. [l - Sout(i, j)] C G($Pw(l, / I
i)
(15)
520
H.-C.
Yuu and M. T. Manry
Pw(i,, i) = Pw(i,, i) + [ . G(i,) . Pin(i)
Then I’=-dE ASO(i. j)
- aE,,
dSnet(i. j)
dSnet( i, j) = g(i, j) iJSO(i. j)
-tlE,
- =,I
i)Snet(i, j) = R(i, j) dSwl(i, j. k) = dSnet(i. ,j) dSwl(i, j. k)
(16)
x(k) (17)
- aE,,
-ilE,,
iJSnet( i, j)
dSw2(i. j, k) = ifSnet(i. j) tlSw2(i. j, k)
= g(i. j)
x2(k)
(18) For 1 5 i, i 5 NC, 1 i j 5 Ns, 1 5 k 5 flf, the thresholds and connection weights are updated as W(i)
= PO(i)+
5. G(i)
(19)
Pw(i, i) = Pw(i. i) + <. G(i) SO(i. j) = SO(i. j) + < Swl(i,
j, k) = Swl(i.
Pin(i)
W)
R(i, j)
(21)
j. k) + <. g(i. j)
Sw2(i. j, k) = Sw2(i. j_ k) + < g(i. j)
x(k)
(22)
x2(k)
(23)
where < is a learning factor. 3.2. Gradients for Infinite q As 9 approaches infinity, the error measure for the pth input pattern can be written as V< c [Tout(i) ‘,-’ i /- I
= max[Tout(i) I. /’ .I< = [Tout(i,)
- Pout(i)]“’
1
- Pout(i)]:
- Pout(i
(24)
where class i, has the maximum error. In the training we modify the weights and threshold of the unit which has the maximum error for each pattern. For the pth pattern, the gradient for the output unit with maximum error is G(i,)
=
-I_%-
_
-aE,,
dPnet( it)
dPout(i,)
= 2[Tout(i,)
- Pout(i,)]
Pout(i,)
x [l - Pout(i,)]
(25)
The gradients for the first layer units are g(i,
j)
=
_
-aEP
dSnet(i. j) _
-a-%
dpnet( i,) dSout( i, j) aPnet( i>) dSout( i. j) dSnet(i. j)
= Pin(i)
[l - Sout(i. j)]
G(i))
Pw(i,.
i) (26)
Given i,, the thresholds and connection updated as
weights are
G( i,)
(27)
PO(i,) = PO(i,) + ;
(29)
(30)
Sw2(i, j, k) = Sw2(i, j, k) + <. g(i. j) . x’(k)
(31)
for 1 I i i NC, 1 5 j 5 Ns, and 1 5 k 5 Nf. 3.3. Training In order to speed up the training, we make the learning factor adaptive after each training iteration as <=
i
y;
I
if AE, 2 0 if AE, < 0
(32)
where (Yis an assigned constant less than 1 and /I = (l/a)’ for a positive integer constant c. We define AE,(n) = g,-(n) - &(n - 1) where n is the iteration number and the smoothed error is ET(n) = (1/2)[E,(n) + 6,(n-I)] and f$(O) = ET(l). For example, we use cz = 0.9. c = 5 in our back propagation training. If the smoothed error ET(n) increases in a given iteration, the learning factor is decreased. as seen in eqn (32). If ET(n) decreases, the learning factor is increased. Note that the rate of decrease is much greater than the rate of increase, preventing the error from blowing up. EXPERIMENTS
In this section, we apply our forward mapping algorithms to two pattern recognition problems. The first problem is that of recognizing handprinted numerals. The second is a geometrical shape recognition problem. To evaluate our network’s effectiveness, we compare the recognition performances of our algorithm with those of two other well known algorithms.
4.1. Handprinted
aPout dPnet(i,)
g( i. j)
Swl(i. j, k) = Swl(i. j, k) + < g(i. j) . x(k)
4. CLASSIFICATION
1c,
E,, = lim
SO(i. j) = SO(i. j) + [
(28)
Numeral Recognition
Example
Our raw data consists of images of 10 handprinted numerals, collected from 3,000 people by the Internal Revenue Service (see Weideman, Manry, & Yau, 1989). We randomly chose 300 characters from each class to generate a 3,000 character training data set. A separate testing data set consists of 6,000 characters, 600 from each class, which are selected from the remaining database. Images are 32 x 24 binary matrices. An image scaling algorithm is used to remove size variation in the characters. The feature set contains 16 elements, developed by Gong and Manry (1989), which are non-Gaussian. We use the iterative K-mean clustering algorithm, described by Anderberg (1973) and Duda and Hart (1973), to select 10 reference cluster mean and variance vectors for each class. This algorithm differs
Iterati1.e
Impro~~ement
of‘ a Nearest
Neighbor
521
Classifier
NNC
0
10
20
30
40
50
60 Iteration
70
80
90
100
110
120
130
140
150
number
FIGURE 3. Neural nets in training (handprinted numerals).
from the standard K-mean clustering algorithm in that each cluster is constrained to have the same number of members. We compared the performance of the NNIN with that of the NNC, and LVQ2.1 for the 10 numeral classes. We plot the recognition error percentage versus the training iteration number for the training data set for those different classifiers in Figure 3. Table 1 shows the final results for the training and testing data sets. The NNC is designed for 10 numeral classes and it has 10.43% classification error for the training data and 11.60% classification error for the testing data. The NNC was mapped to the LVQ2.1 and NNIN networks, in order to initialize their weights. The LVQ2.1 network quickly reduced the error percentage and finally leveled off at 6.37% error for the training set. The LVQ2.1 network decreased the error rate for the testing set down to 9.73%. Initially, the error for the NNIN was 9.47%. Although the error percentage was initially small, back propagation with the adaptive-q technique improved the per-
TABLE
1
Recognition experiments for handprinted numerals (in error percentage) Classifier
Training Set
Testing Set
NNC
10.43% 6.37% 4.93%
11.60%
LVQ2.1 NNIN
9.73% 7.82%
formance still further. After 150 training iterations, the NNIN classifier had 4.93% classification error for the training data and 5.5% error for the testing data. Both the LVQ2.1 and the NNIN performed much better than the NNC. However. the NNIN had better performance than LVQ2.1.
4.2.
Shape Recognition Example
In order to confirm our results, we obtained a second database for a completely different problem; the classification of geometric shapes. The four primary geometric shapes used in this experiment are: (a) ellipse, (b) triangle, (c) quadrilateral, and (d) pentagon. Each shape image consists of a matrix of size 64 X 64. Each element in the matrix represents a binaryvalued pixel in the image. For each shape, 200 training patterns and 200 testing patterns were generated using different degrees of deformation. The deformations included rotation. scaling, translation. and oblique distortions. The weighted log-polar transform. which computes energies in several ring and wedge-shape areas in the Fourier transform domain, was applied to extract 16 features for each pattern. A detailed description of these features is given in Appendix B. In Figure 4 we plot the recognition error percentage versus the training iteration number for the training data set for the NNC, LVQ2.1, and NNIN, respectively. The classifier performances are presented in Table 2 for the testing data set.
7-
NNC
----
-.-
.._____
4
0
10
20
30
40
50
60
Iteration
70
80
90
100
110
120
I30
140
I ‘iii
number
FIGURE 4. Neural nets in training (geometrk shapes).
Note that mapping the NNC to the NNIN increased the error rate in Figure 4, contrary to what is seen in the handprinted numeral recognition experiment in Figure 3. Nonetheless, the NNIN reduces the misclassification rate very quickly and finally reaches 0% classification error for the training data. As in the first experiment, the NNIN has better performance than the NNC and LVQ2.1 classifiers for both training and testing data sets. 5. CONCLUSIONS In this paper we have shown that under some circumstances, a NNC can be mapped to a sigma-pi neural network and then improved through back propagation training. This method was applied to NNC’s for handprinted numerals and geometric shapes. Although the error percentages were initially small, back propagation in the corresponding neural nets improved the performances still further.
Classifier
Training Set
Jesting Set
NNC
838% 2.88%
5.88% 4.88%
0.00%
1.63%
LVQ2.1 NNIN
The two examples, handprinte~d numerals and geometric shapes, show the range of performances possible with the NNIN. Mapping of the NNC to the NNIN greatly reduces theclaskification~errors in both problems. Note that the use of the adaptive q-norm error criterion may cause the error percentage to abruptly rise or drop during training. Wowever, we observe that the curves of error percentage versus~ iteration number still decrease~asymptoticaliy. Also. this learning technique may help us avoid getting trapped in local minima. Because the initial weights-used are good, the error percentage quickly levels off. Since the classification performance of the network is improved for both training and testing data, it is apparent that the classifier is generalizing, and not just memorizing the training data.
REFES%BWES Anderberg, M. (1973). Cluster analysis fbr applications. New York: Academic Press. Casasent, D., & P&is, D. (1976). Position, rotation, and scale invariant optical correlation. Applied Optics, 15, pp. 1795 1799. Cheney, E. W. (1982). Introduction to appwximation theory. NW York: Chelsea Publishing. Duda, R. Cl., & Hart, P. E. (1973). Pattern ckssification and scene analysis.~ Neiv York: John Wiley.& Sons. Durbin, R., & Rumelhart, D. E. (1989). Pmduct units: A corn-
Iteruti\,e
Impro\‘ement
putationally
of u Neurest
powerful
backpropagation II?. Fukunaga.
K.
Gons.
a neural
Ncurd
Neural
Inrrodudon
Hecht-Nielsen.
network.
R.
(lYX7).
June 19x7. 2. pp. 19-32.
Hecht-Nielsen.
(19xX).
Neurul
versions
with
ItEl:‘E
It~cerrmiord
Diego.
C‘a..
of counterpropagation of learning vector quanon Neural
Network.\.
to neural computing. Neurd
nets. IEEE
(‘onf~,rerlce
( IYX7).
nS.SP
M. T.
R. (1988).
networks:
Statistical
benchmarking
on
Neurd
An introduction April.
Muguzirre,
Rotation
pattern studies.
NerM’ork.s.
to computingwith lYX7.
and scale invariant
log-polar transform.
pp. 4-22. using a
to Oprical
En-
grnrrrirlq. Powell. M. J. D. York:
( IYXI
Rumelhart.
D.
Learning
E..
E. Rumelhart J. L..
Hinton.
internal
Weldernan.
G.
E..
& DeVoc.
& Williams.
The
D. R. (IYHX).
(Eds), MIT Hybrid
.lom
numeric
Manry.
M. T..
handprint
C‘orl[ermce
J. (19%).
Prrrullel
In D.
OH ,Neurd
Nmvorks.
A com-
lnfemrrriorltrl
Washington.
D.C..
June
UNITS
VERSUS
MIN
Property 1: Let r,, be the vector among the r,, from which the random vector x has the smallest distance. If x * r,,. then 0; - 0. and n’, = min D’! for all i E (1, NC]. 1 E [I. Ns],
t
i
,I
(A.i)
The product unit assigns x to class I II I’in( t ) IS \mallcr than [‘in(i) for everv i > I, which ij the same condition a\ I) c: Sout(l,
I)
= .Sout(i.
t)
jpLlt(i.
.Sout(I.i))
i,)/((j
Rs(i)
(AJ)
for all i > I. If cqn (A3) ia true. (Al) following sufficient conditions arc true:
I\ true II either
of the
(a) Sout( 1, j) < .Sout(i. i) for all i (h) Rs(i) < Rj( 1) for all i > I
I and
, 2. 2
UNITS
for Min units
2: Property
for
and a neural network
recognition.
In the NNC. the Min units have two important properties that allow the generation of complicated (disjoint for example) dccision regions. These are as follows:
Property
1)
Also. if (Al) is true. either of the ahove conditions is sufficient tor (A3) to be true. Thus under certain condition\. product units give the same answer as Min unit\.
A: PRODUCT
Replacements
.Sout( 1. 1J c .Sout(i.
op-
In Kohonen.9 LVO network, Mm units in the output laver dctermlne the classification result. A Min unit passes the mfnimum value of itj input throu_eh to its output. In the NNIN, it ih quite possible to use Min umts in the output layer. This would result in a network very similar to those of Kohonen. In this appendix. we consider alternative units.
A.1
of product units
Assume that Av(r. I) = A(r ~ I). For each class. put the cluster outputs Sout(i. j) in increasing order such that Sout(i. 1) 5 Sout(r. 2) 5 “’ 5 Sout(i. Ns). Assume that the correct class is class I. The Min unit assigns the vector x to clash I it
.Sout(l.
IOXY. 1. pp. 117-1’0.
APPENDIX
A.2 Some properties
optical/electronic
& Yau. H. C. (IYXY).
character
(AZ)
dbrrihured
Pres\.
938. pp. 170-177.
ofSPIE,
parison of a nearest neighbor classifier lor
R.
with both coherent and noncoherent
Proceedrri,y W. E..
MA:
D;,
I is If x + r,,. then I),, approaches 0 hut S(i) does not. Propert! not generally satisfied for the sum unit. II a sum unit did have Property 1 and if a new example vector is added to the &h cla\s. S(l) can no longer he driven to 0 as x approaches r,,. Thus the sum unit does not have properties I or 2, and is not a sultahle replacement for the Min unit. There are two advantages to using product units rather than Mm units; (a) derivatives of products are much simpler to calculate than derivatives of a minimum operation. and (h) a hack propagation iteration for product units results in changes for all weights, unlike a similar iteration for mln unit!.
NW
by error propagation.
and J. L. McClelland
pattern recognition cration\.
theory cdme/hod.s.
Press.
representations
1. Cambridge.
pro~“.\.sqq, Smith.
). ApproximuGon
C‘amhridgc University
I
San
neural
recognition
To he submitted
= x
San Diego.
July 19X8. 1. pp. 61-68.
R. P.
welshted
neural
S(i)
lEEi5
1. pp. 3-16.
‘f.. Barna. G., & Chrisley.
as Min units. As a second example consider the sum unit, whose activation function is defined as
on
Ca.. June IYYO. 1. pp. 545-550.
recognition
Manry.
network.
Nerworks.
Jorr~r Conference
7‘. (19X8). An Introduction
.Verrt~ork\.
Lippmann.
Conference
1. pp. 131-139.
Improved
Inrernariotlul
data
vectors are unequal, then Uf, > 0 and Pin(i) > 0 for all i f ; and Pronertv 1 is satisfied. When a new reference vector is added, Prober& I and therefore Property 2, are still satisfied. Therefore. the product units have the required properties and can function
June 1YXY. 2. p. 576.
Applications
Nelrvork.5.
( IY90).
San Diego.
Joint
D.C.,
Confrrenc~e on Neural
R.
1.
Kohonen.
recognition.
of non-Gaussian
Counterpropagation
Ir~~ernrrtiord
Kohonen.
to
1. pp. 13%
Computation.
lnrernational
Washington.
I;rrrt
tlzation.
extension
lo srulistica~pattern
(‘il..
network.
plausible
M. T. (19X9). Analysis
Nerworks.
Kohonen.
523
Classifier
Academic Pres.
W.. & Manry.
using
and biologically
networks.
( 1971).
New York:
Neighbor
still
holds if new reference vectors
are
added to anv class. Two possible replacements for the Min unit are the product unit (Durbin and Rumelhart 1989) and the sum unit. Define the activation function for the product unit as
APPENDIX
B: LOG-POLAR
FEATURE
DEFINITIONS
Casasent and Psaltis ( IY76) have proposed lop-polar transform (LPT) techniques combining a geometric coordinate transformation with the conventional Fouler transtorm. Smith and Devoc ( IYXX) have applied the LPI‘ to the magnitude-squared Fourier transform of the image. Lately. our rc\carch ha\ led to a weighted LPT. The first step in taking the LPT i\ to Fourlcr transform the input image j(.y. I’) to get F(R. ~1). where R and ~1 denote frequency domain radius and angle. Next. define ,, = In(R) and L(p, w) = + IF(e,‘. w)l’ = R’ IF(R. 1//)1’. ‘The Fourier translormation of the LPT is taken as :21(,. <) =
J
.’ L(/‘. w) C’,’ 18,
“I rlt// ,I,,
The (i. f) domain is referred to here as the Meltin define the LPT features R, and R- ;I>
domain.
(Bl) We
(B2) g:(n) The product unit has both of the properties given above. As x r,,. then Of, and Pin(i) both approach 0. Assuming all the r,,
where m. n E [l,
81.
=
I i M(0. II) M(0.0)
524
NOMENCLATURE D;,:
metric distance between vectors x and ri,. E,,: the error function for the pth input pattern. the error function being minimized by back propagation. G(i): the gradient of E,, with respect to Pnet(i). g(i, j): the gradient of E,> with respect to Snet(i, j). NC: the number of classes. Nf: the number of input dummy units in the neural net, and the number of elements in the feature vector x. Np: the total number of training patterns. Ns: the number of clusters per class. Pin(i): products of sigma-pi unit outputs of class i. Pnet(i): net input of the ith output unit. output of the ith output unit. Pout(i): weights for Pin(l) and Pw(i, 3: connection Pnet(i).
PO(i) :
threshold of the ith output Irtlit
r,,: the jth reference vector of the ith clas\ r,,(k): the kth element ot r Snet(i, j): Sout(i. j): Swl(i,j. k):
Sw2(i,j, k):
St/( i. j) :
Tout(i): x:
x(k): “,,I
v,,(k): <:
net input of the jth :Irq~~:~-prunn ot thy ith class. output of the jth srgma-pi unit ot the ith class. the connection weights for the first ordcr of the kth feature to the jth sigmapi unit of the ith cii~s4 the connection weights for the second order of the kth feature to the jth sigma-pi unit of the ith class. threshold of the jth sigma-pi unit of the ith class. the desired activation tar the ith output unit. feature vector. the kth element or feature of x. the Nf by 1 variance vector for the jth cluster of the ith class the kth element of v,, learning factor in back propagation.