Iterative improvement of a nearest neighbor classifier

Iterative improvement of a nearest neighbor classifier

ORIGINAL CONTRIBUTION Iterative Improvement of a Nearest Neighbor Classifier YAU AND MICHAEL HUNG-CHUN University Abstract-In practical puttern...

741KB Sizes 0 Downloads 91 Views

ORIGINAL

CONTRIBUTION

Iterative Improvement of a Nearest Neighbor Classifier YAU AND MICHAEL

HUNG-CHUN

University

Abstract-In

practical

puttern

because it does not require As the number clussifier. number

of’ example

these problems form

of example

However,

recognition.

upplicutions.

knowledge

vectors i.s increased.

the neurest neighbor

of the joint

vectors.

is not optimul

the NNC the NNC (BP)

our approach

Significunt

to u s&mu-pi

leurning

cla.ss$er

neural

in cluss~ficution

dutu.

and used to improve

of’ hundprintcd

numerul

that

In this puper. isomorphic,.

clussifier

recognition

error percet1tuge.s ure obser~~ed for

upplied L’ector.s.

of’ the Buysian

increuse.s. Also.

to which it is purtiully

network,

often

input feuturc

upprouches

of the NNC

with respect to the truining

(NNC’) is

of the

density

of the NNC

complexity

is then delaeloped

to the problems

impro\‘ements

probability

the error probability

the compututionul

propagation

e.xump1e.s. we upp!\’

of Tcxaa at Arlington

at the same time,

by mupping

of‘ buck

recognition

an u priori

T. MANRY

for

u .smuIl we rrttack

A mod(fied

perjormunce.

und geometrical both the truining

As shape dutu

und testirlg data.

neighbor classifier,

Keywords-Nearest morphic

classifiers,

Shape recognition.

Sigma-pi

network,

Deformation-invariant

1. INTRODUCTION

Requests

for

reprints

Department

Texas at Arlington.

of

should Electrical

Arlington.

be sent to Prof. Engineering.

Michael

T.

University

of

propagation.

Character

recognition.

lso-

features.

(CPN) that combines the Kohonen self-organization and Grossberg outstar algorithms. The LVQ and CPN networks are then isomorphic to types of NNC. In this paper we develop techniques for optimizing the NNC through the use of a sigma-pi back propagation network. In Section 2. we give a simple method for mapping the NNC’s components to a sigma-pi neural network. The learning rule for the network is described in Section 3. We apply our algorithm to the improvement of classifiers for handprinted numerals and geometric shapes in Section 4. The weights, initialized by a direct mapping procedure instead of via random assignment. effectively shorten the learning procedure and increase the possibility that a global minimum of the error function can be reached. Conclusions are given in Section 5.

As pointed out by Duda and Hart (1973) and Fukunaga (1972), the nearest neighbor classifier (NNC) approximates the minimum error Bayesian classifier in the limit as the number of reference vectors gets larger. When the feature vector joint probability density is unknown, the NNC would be the preferred classifier. except for two problems. First. a prohibitive amount of computation is required for its use. Second, the NNC’s performance is usually not optimized with respect to the training data. As more hardware for parallel processing becomes available (neural or conventional), the first problem will be solved. Neural networks have already been used to attack the second problem. Lippman (19X7) has pointed out that the multilayer perception is very similar to the NNC. and can be used as a classifier. As illustrated in Figure 1, Kohonen (1YOO. 198X)and Kohonen, Barna. & Chrisley (198X) have mapped the NNC to a neural network, which they have named learning vector quantization (LVQ). They have also suggested a learning rule for training the network. Hecht-Nielsen (1987, 1988) developed the two layer counter-propagation network

Manry.

Back

2. NEAREST NEIGHBOR ISOMORPHIC NETWORKS In this section, we describe the nearest neighbor isomorphic network (NNIN), which is a back propagation (BP) network isomorphic to a type of NNC. This network uses fairly conventional sigma-pi units, described by Rumelhart, Hinton, & Williams (l’s%). in the hidden layer and units similar to the product units of Durbin and Rumelhart (1989) in the output layer. A set of good initial weights and thresholds can be directly determined from the reference feature vectors via appropriate mapping equations.

TX 76019. 517

518

Input Features

Kohonen Units

DumV Units

output Distances

x(l)

x(2)

x(3)

x(4) NGURE 1. Kohonen isomorphic classifier.

2.1. First Layer

Sout(i. j) =

Each pattern is represented by a feature vector. x = [X(l) X(2) *.. x(Nf)]r, where Nf is the number of elements in the feature vector x. Cluster mean vectors have been used for keeping a small number of most representive samples. For example, ri, = [ Ti,( 1), r;,(2), ..* , r,j(lv’)]‘is the mean vector of thejth cluster of the ith class. Let II, represent the distance between vectors x and rij. The first processing step in the NNC is the calculation of the D, numbers. For the NNC cluster discriminant, we use the Mahalanobis distance for independent features rather than the Euclidean distance. Let v,, denote the feature variance vector for the jth cluster of the ith class and let U,,(k) denote its kth element. Then D2

=

5 [x(k) - r,(k)l?

‘I k .I

u,(k)

(1)

As shown in Figure 2, the first processing layer consists of NC * Ns sigma-pi units which are connected to the Nf input features. Here NC is the number of classes and Ns is the number of clusters per class. Let Snet(i, j), SO(i, j), and Sout(i, j) denote the net input, threshold, and output of the jth unit of the ith class, respectively. Swl(i, j, k) and Sw2(i, j, k) denote the connection weights from the kth input feature to the jth sigma-pi unit of the ith class. The sigma-pi network and activation are, respectively, Snet(i,j)

= SO(Q) + 2 [SWl(i,j, k)x(k) k=l + Sw2(i, j, k)?(k)]

(2)

1 1+

exp[ -Snet(i, j)]

Comparing Snet(i, j) with Pi, we may -assign the initial weights and threshold of the jth sigma-pi unit of the ith class as follows: w

g&J

SWl(i.j,k) = -2

Sw2(i. j, k) = -

dk)

(5)

1

d,k)

This first sigma-pi layer can also be mapped back to the NNC’s first layer. Hence, the first layer of the NNC and the first layer of~the sigma-pi network are isomorphic. 2.2. Second Layer The second step in a NNC is the Mm operation, which consists of finding i and j such that D, is minimized. The estimated class is then i. As seen in Figure 2, the Min operation of the ~NNC has ~been replaced in the second processing iayer by sigma-pi units of a new type. These are the product units of Durbin and Rumelhart (1989), with each feature raised to the first power. For a discussion of the merits of product units versus Min units, see Appendix A. Let PO(i), Pnet(i) and Pout(i) denote the

Iterative Improvement

of a Nearest Neighbor

Input

DuW Units

Features

519

Classifier Sign-a-pi

Product

output

Units

Units

Discriminant Funct

ions

x(l) Pout(

1)

x(2)

x(3) Pout(z)

x(4)

FIGURE 2. New isomorphic

classifier.

threshold. net input and output of the ith unit in the second layer. The Pw(i, i ) are the connection weights for Pin(i) and Pnet(i). Pin(i)

= fi Sout(i,

j)

I I

(7)

\i Pnet(i)

= PO(i) + C Pw(i, I)Pin(Z)

/ I

Pout(i) =

1 1 + exp[ - Pnet(i)]

(8)

(9)

It is simple- to initialize the second layer as Pw(i, I) = 6( i - i) and PO(i) = 0. This is equivalent to

replacing the Min operation by a product at the beginning of training. The output layer has desired activation values of 1 for incorrect classes and 0 for the correct class.

E,, = [z

RULE

-

Pout(i)I’.’

(11)

where E,] is the error measure for the pth input pattern. and q is the positive integer. In practice we alter the error criterion from the least square approach (q = 1) to the minimax approach (q = infinity) or vice versa when the learning process slows. In our experiments, this adaptive-q technique results in an increase in learning speed. 3.1. Gradients for q = 1 For the q = 1 case, the traditional back propagation learning rule is applied to the isomorphic network. The gradient of E,, with respect to the net input of the second layer units is G(i)

dPout( -aE,, = - dE,> ~

=

dPnet( i) = 2 [Tout(i)

3. LEARNING

[Tout(i)

i)

dPout( i) aPnet( i) - Pout(i)] . Pout(i)

[I - Pout(i)] (12)

Then

In the previous section, we initialized out network with good initial weights. However, the net’s performance can still be improved via a form of back propagation, as described in Rumelhart et al. (1986). Let Tout(i) be the desired output. Np denotes the total number of training patterns. Using the q-norm, which is the p-norm of Cheney (1982) and Powell (1981), the system performance measure is defined as

-dE,,

=

3PO( i) -“E,, dPw(i,

-dE,, dPnet(i) - dE,>

1) = dPnet(i)

(13)

dPO(i)

dPnet(i) dPw(i.

1)

= G(i) . Pin(l)

(14)

The gradients for the first layer units are g(i. j) =

-dE,, dSnet( i. j)

= $

- dEil aPnet(l)

Cl@

i)Pnet(i) = G(i)

=Pin(i)

dPnet(i) dSout(i.

dSout(i,

j)

j) dSnet(i,

j)

. [l - Sout(i, j)] C G($Pw(l, / I

i)

(15)

520

H.-C.

Yuu and M. T. Manry

Pw(i,, i) = Pw(i,, i) + [ . G(i,) . Pin(i)

Then I’=-dE ASO(i. j)

- aE,,

dSnet(i. j)

dSnet( i, j) = g(i, j) iJSO(i. j)

-tlE,

- =,I

i)Snet(i, j) = R(i, j) dSwl(i, j. k) = dSnet(i. ,j) dSwl(i, j. k)

(16)

x(k) (17)

- aE,,

-ilE,,

iJSnet( i, j)

dSw2(i. j, k) = ifSnet(i. j) tlSw2(i. j, k)

= g(i. j)

x2(k)

(18) For 1 5 i, i 5 NC, 1 i j 5 Ns, 1 5 k 5 flf, the thresholds and connection weights are updated as W(i)

= PO(i)+

5. G(i)

(19)

Pw(i, i) = Pw(i. i) + <. G(i) SO(i. j) = SO(i. j) + < Swl(i,

j, k) = Swl(i.

Pin(i)

W)

R(i, j)

(21)

j. k) + <. g(i. j)

Sw2(i. j, k) = Sw2(i. j_ k) + < g(i. j)

x(k)

(22)

x2(k)

(23)

where < is a learning factor. 3.2. Gradients for Infinite q As 9 approaches infinity, the error measure for the pth input pattern can be written as V< c [Tout(i) ‘,-’ i /- I

= max[Tout(i) I. /’ .I< = [Tout(i,)

- Pout(i)]“’

1

- Pout(i)]:

- Pout(i

(24)

where class i, has the maximum error. In the training we modify the weights and threshold of the unit which has the maximum error for each pattern. For the pth pattern, the gradient for the output unit with maximum error is G(i,)

=

-I_%-

_

-aE,,

dPnet( it)

dPout(i,)

= 2[Tout(i,)

- Pout(i,)]

Pout(i,)

x [l - Pout(i,)]

(25)

The gradients for the first layer units are g(i,

j)

=

_

-aEP

dSnet(i. j) _

-a-%

dpnet( i,) dSout( i, j) aPnet( i>) dSout( i. j) dSnet(i. j)

= Pin(i)

[l - Sout(i. j)]

G(i))

Pw(i,.

i) (26)

Given i,, the thresholds and connection updated as

weights are

G( i,)

(27)

PO(i,) = PO(i,) + ;

(29)

(30)

Sw2(i, j, k) = Sw2(i, j, k) + <. g(i. j) . x’(k)

(31)

for 1 I i i NC, 1 5 j 5 Ns, and 1 5 k 5 Nf. 3.3. Training In order to speed up the training, we make the learning factor adaptive after each training iteration as <=

i

y;

I

if AE, 2 0 if AE, < 0

(32)

where (Yis an assigned constant less than 1 and /I = (l/a)’ for a positive integer constant c. We define AE,(n) = g,-(n) - &(n - 1) where n is the iteration number and the smoothed error is ET(n) = (1/2)[E,(n) + 6,(n-I)] and f$(O) = ET(l). For example, we use cz = 0.9. c = 5 in our back propagation training. If the smoothed error ET(n) increases in a given iteration, the learning factor is decreased. as seen in eqn (32). If ET(n) decreases, the learning factor is increased. Note that the rate of decrease is much greater than the rate of increase, preventing the error from blowing up. EXPERIMENTS

In this section, we apply our forward mapping algorithms to two pattern recognition problems. The first problem is that of recognizing handprinted numerals. The second is a geometrical shape recognition problem. To evaluate our network’s effectiveness, we compare the recognition performances of our algorithm with those of two other well known algorithms.

4.1. Handprinted

aPout dPnet(i,)

g( i. j)

Swl(i. j, k) = Swl(i. j, k) + < g(i. j) . x(k)

4. CLASSIFICATION

1c,

E,, = lim

SO(i. j) = SO(i. j) + [

(28)

Numeral Recognition

Example

Our raw data consists of images of 10 handprinted numerals, collected from 3,000 people by the Internal Revenue Service (see Weideman, Manry, & Yau, 1989). We randomly chose 300 characters from each class to generate a 3,000 character training data set. A separate testing data set consists of 6,000 characters, 600 from each class, which are selected from the remaining database. Images are 32 x 24 binary matrices. An image scaling algorithm is used to remove size variation in the characters. The feature set contains 16 elements, developed by Gong and Manry (1989), which are non-Gaussian. We use the iterative K-mean clustering algorithm, described by Anderberg (1973) and Duda and Hart (1973), to select 10 reference cluster mean and variance vectors for each class. This algorithm differs

Iterati1.e

Impro~~ement

of‘ a Nearest

Neighbor

521

Classifier

NNC

0

10

20

30

40

50

60 Iteration

70

80

90

100

110

120

130

140

150

number

FIGURE 3. Neural nets in training (handprinted numerals).

from the standard K-mean clustering algorithm in that each cluster is constrained to have the same number of members. We compared the performance of the NNIN with that of the NNC, and LVQ2.1 for the 10 numeral classes. We plot the recognition error percentage versus the training iteration number for the training data set for those different classifiers in Figure 3. Table 1 shows the final results for the training and testing data sets. The NNC is designed for 10 numeral classes and it has 10.43% classification error for the training data and 11.60% classification error for the testing data. The NNC was mapped to the LVQ2.1 and NNIN networks, in order to initialize their weights. The LVQ2.1 network quickly reduced the error percentage and finally leveled off at 6.37% error for the training set. The LVQ2.1 network decreased the error rate for the testing set down to 9.73%. Initially, the error for the NNIN was 9.47%. Although the error percentage was initially small, back propagation with the adaptive-q technique improved the per-

TABLE

1

Recognition experiments for handprinted numerals (in error percentage) Classifier

Training Set

Testing Set

NNC

10.43% 6.37% 4.93%

11.60%

LVQ2.1 NNIN

9.73% 7.82%

formance still further. After 150 training iterations, the NNIN classifier had 4.93% classification error for the training data and 5.5% error for the testing data. Both the LVQ2.1 and the NNIN performed much better than the NNC. However. the NNIN had better performance than LVQ2.1.

4.2.

Shape Recognition Example

In order to confirm our results, we obtained a second database for a completely different problem; the classification of geometric shapes. The four primary geometric shapes used in this experiment are: (a) ellipse, (b) triangle, (c) quadrilateral, and (d) pentagon. Each shape image consists of a matrix of size 64 X 64. Each element in the matrix represents a binaryvalued pixel in the image. For each shape, 200 training patterns and 200 testing patterns were generated using different degrees of deformation. The deformations included rotation. scaling, translation. and oblique distortions. The weighted log-polar transform. which computes energies in several ring and wedge-shape areas in the Fourier transform domain, was applied to extract 16 features for each pattern. A detailed description of these features is given in Appendix B. In Figure 4 we plot the recognition error percentage versus the training iteration number for the training data set for the NNC, LVQ2.1, and NNIN, respectively. The classifier performances are presented in Table 2 for the testing data set.

7-

NNC

----

-.-

.._____

4

0

10

20

30

40

50

60

Iteration

70

80

90

100

110

120

I30

140

I ‘iii

number

FIGURE 4. Neural nets in training (geometrk shapes).

Note that mapping the NNC to the NNIN increased the error rate in Figure 4, contrary to what is seen in the handprinted numeral recognition experiment in Figure 3. Nonetheless, the NNIN reduces the misclassification rate very quickly and finally reaches 0% classification error for the training data. As in the first experiment, the NNIN has better performance than the NNC and LVQ2.1 classifiers for both training and testing data sets. 5. CONCLUSIONS In this paper we have shown that under some circumstances, a NNC can be mapped to a sigma-pi neural network and then improved through back propagation training. This method was applied to NNC’s for handprinted numerals and geometric shapes. Although the error percentages were initially small, back propagation in the corresponding neural nets improved the performances still further.

Classifier

Training Set

Jesting Set

NNC

838% 2.88%

5.88% 4.88%

0.00%

1.63%

LVQ2.1 NNIN

The two examples, handprinte~d numerals and geometric shapes, show the range of performances possible with the NNIN. Mapping of the NNC to the NNIN greatly reduces theclaskification~errors in both problems. Note that the use of the adaptive q-norm error criterion may cause the error percentage to abruptly rise or drop during training. Wowever, we observe that the curves of error percentage versus~ iteration number still decrease~asymptoticaliy. Also. this learning technique may help us avoid getting trapped in local minima. Because the initial weights-used are good, the error percentage quickly levels off. Since the classification performance of the network is improved for both training and testing data, it is apparent that the classifier is generalizing, and not just memorizing the training data.

REFES%BWES Anderberg, M. (1973). Cluster analysis fbr applications. New York: Academic Press. Casasent, D., & P&is, D. (1976). Position, rotation, and scale invariant optical correlation. Applied Optics, 15, pp. 1795 1799. Cheney, E. W. (1982). Introduction to appwximation theory. NW York: Chelsea Publishing. Duda, R. Cl., & Hart, P. E. (1973). Pattern ckssification and scene analysis.~ Neiv York: John Wiley.& Sons. Durbin, R., & Rumelhart, D. E. (1989). Pmduct units: A corn-

Iteruti\,e

Impro\‘ement

putationally

of u Neurest

powerful

backpropagation II?. Fukunaga.

K.

Gons.

a neural

Ncurd

Neural

Inrrodudon

Hecht-Nielsen.

network.

R.

(lYX7).

June 19x7. 2. pp. 19-32.

Hecht-Nielsen.

(19xX).

Neurul

versions

with

ItEl:‘E

It~cerrmiord

Diego.

C‘a..

of counterpropagation of learning vector quanon Neural

Network.\.

to neural computing. Neurd

nets. IEEE

(‘onf~,rerlce

( IYX7).

nS.SP

M. T.

R. (1988).

networks:

Statistical

benchmarking

on

Neurd

An introduction April.

Muguzirre,

Rotation

pattern studies.

NerM’ork.s.

to computingwith lYX7.

and scale invariant

log-polar transform.

pp. 4-22. using a

to Oprical

En-

grnrrrirlq. Powell. M. J. D. York:

( IYXI

Rumelhart.

D.

Learning

E..

E. Rumelhart J. L..

Hinton.

internal

Weldernan.

G.

E..

& DeVoc.

& Williams.

The

D. R. (IYHX).

(Eds), MIT Hybrid

.lom

numeric

Manry.

M. T..

handprint

C‘orl[ermce

J. (19%).

Prrrullel

In D.

OH ,Neurd

Nmvorks.

A com-

lnfemrrriorltrl

Washington.

D.C..

June

UNITS

VERSUS

MIN

Property 1: Let r,, be the vector among the r,, from which the random vector x has the smallest distance. If x * r,,. then 0; - 0. and n’, = min D’! for all i E (1, NC]. 1 E [I. Ns],

t

i

,I

(A.i)

The product unit assigns x to class I II I’in( t ) IS \mallcr than [‘in(i) for everv i > I, which ij the same condition a\ I) c: Sout(l,

I)

= .Sout(i.

t)

jpLlt(i.

.Sout(I.i))

i,)/((j

Rs(i)

(AJ)

for all i > I. If cqn (A3) ia true. (Al) following sufficient conditions arc true:

I\ true II either

of the

(a) Sout( 1, j) < .Sout(i. i) for all i (h) Rs(i) < Rj( 1) for all i > I

I and

, 2. 2

UNITS

for Min units

2: Property

for

and a neural network

recognition.

In the NNC. the Min units have two important properties that allow the generation of complicated (disjoint for example) dccision regions. These are as follows:

Property

1)

Also. if (Al) is true. either of the ahove conditions is sufficient tor (A3) to be true. Thus under certain condition\. product units give the same answer as Min unit\.

A: PRODUCT

Replacements

.Sout( 1. 1J c .Sout(i.

op-

In Kohonen.9 LVO network, Mm units in the output laver dctermlne the classification result. A Min unit passes the mfnimum value of itj input throu_eh to its output. In the NNIN, it ih quite possible to use Min umts in the output layer. This would result in a network very similar to those of Kohonen. In this appendix. we consider alternative units.

A.1

of product units

Assume that Av(r. I) = A(r ~ I). For each class. put the cluster outputs Sout(i. j) in increasing order such that Sout(i. 1) 5 Sout(r. 2) 5 “’ 5 Sout(i. Ns). Assume that the correct class is class I. The Min unit assigns the vector x to clash I it

.Sout(l.

IOXY. 1. pp. 117-1’0.

APPENDIX

A.2 Some properties

optical/electronic

& Yau. H. C. (IYXY).

character

(AZ)

dbrrihured

Pres\.

938. pp. 170-177.

ofSPIE,

parison of a nearest neighbor classifier lor

R.

with both coherent and noncoherent

Proceedrri,y W. E..

MA:

D;,

I is If x + r,,. then I),, approaches 0 hut S(i) does not. Propert! not generally satisfied for the sum unit. II a sum unit did have Property 1 and if a new example vector is added to the &h cla\s. S(l) can no longer he driven to 0 as x approaches r,,. Thus the sum unit does not have properties I or 2, and is not a sultahle replacement for the Min unit. There are two advantages to using product units rather than Mm units; (a) derivatives of products are much simpler to calculate than derivatives of a minimum operation. and (h) a hack propagation iteration for product units results in changes for all weights, unlike a similar iteration for mln unit!.

NW

by error propagation.

and J. L. McClelland

pattern recognition cration\.

theory cdme/hod.s.

Press.

representations

1. Cambridge.

pro~“.\.sqq, Smith.

). ApproximuGon

C‘amhridgc University

I

San

neural

recognition

To he submitted

= x

San Diego.

July 19X8. 1. pp. 61-68.

R. P.

welshted

neural

S(i)

lEEi5

1. pp. 3-16.

‘f.. Barna. G., & Chrisley.

as Min units. As a second example consider the sum unit, whose activation function is defined as

on

Ca.. June IYYO. 1. pp. 545-550.

recognition

Manry.

network.

Nerworks.

Jorr~r Conference

7‘. (19X8). An Introduction

.Verrt~ork\.

Lippmann.

Conference

1. pp. 131-139.

Improved

Inrernariotlul

data

vectors are unequal, then Uf, > 0 and Pin(i) > 0 for all i f ; and Pronertv 1 is satisfied. When a new reference vector is added, Prober& I and therefore Property 2, are still satisfied. Therefore. the product units have the required properties and can function

June 1YXY. 2. p. 576.

Applications

Nelrvork.5.

( IY90).

San Diego.

Joint

D.C.,

Confrrenc~e on Neural

R.

1.

Kohonen.

recognition.

of non-Gaussian

Counterpropagation

Ir~~ernrrtiord

Kohonen.

to

1. pp. 13%

Computation.

lnrernational

Washington.

I;rrrt

tlzation.

extension

lo srulistica~pattern

(‘il..

network.

plausible

M. T. (19X9). Analysis

Nerworks.

Kohonen.

523

Classifier

Academic Pres.

W.. & Manry.

using

and biologically

networks.

( 1971).

New York:

Neighbor

still

holds if new reference vectors

are

added to anv class. Two possible replacements for the Min unit are the product unit (Durbin and Rumelhart 1989) and the sum unit. Define the activation function for the product unit as

APPENDIX

B: LOG-POLAR

FEATURE

DEFINITIONS

Casasent and Psaltis ( IY76) have proposed lop-polar transform (LPT) techniques combining a geometric coordinate transformation with the conventional Fouler transtorm. Smith and Devoc ( IYXX) have applied the LPI‘ to the magnitude-squared Fourier transform of the image. Lately. our rc\carch ha\ led to a weighted LPT. The first step in taking the LPT i\ to Fourlcr transform the input image j(.y. I’) to get F(R. ~1). where R and ~1 denote frequency domain radius and angle. Next. define ,, = In(R) and L(p, w) = + IF(e,‘. w)l’ = R’ IF(R. 1//)1’. ‘The Fourier translormation of the LPT is taken as :21(,. <) =

J

.’ L(/‘. w) C’,’ 18,

“I rlt// ,I,,

The (i. f) domain is referred to here as the Meltin define the LPT features R, and R- ;I>

domain.

(Bl) We

(B2) g:(n) The product unit has both of the properties given above. As x r,,. then Of, and Pin(i) both approach 0. Assuming all the r,,

where m. n E [l,

81.

=

I i M(0. II) M(0.0)

524

NOMENCLATURE D;,:

metric distance between vectors x and ri,. E,,: the error function for the pth input pattern. the error function being minimized by back propagation. G(i): the gradient of E,, with respect to Pnet(i). g(i, j): the gradient of E,> with respect to Snet(i, j). NC: the number of classes. Nf: the number of input dummy units in the neural net, and the number of elements in the feature vector x. Np: the total number of training patterns. Ns: the number of clusters per class. Pin(i): products of sigma-pi unit outputs of class i. Pnet(i): net input of the ith output unit. output of the ith output unit. Pout(i): weights for Pin(l) and Pw(i, 3: connection Pnet(i).

PO(i) :

threshold of the ith output Irtlit

r,,: the jth reference vector of the ith clas\ r,,(k): the kth element ot r Snet(i, j): Sout(i. j): Swl(i,j. k):

Sw2(i,j, k):

St/( i. j) :

Tout(i): x:

x(k): “,,I

v,,(k): <:

net input of the jth :Irq~~:~-prunn ot thy ith class. output of the jth srgma-pi unit ot the ith class. the connection weights for the first ordcr of the kth feature to the jth sigmapi unit of the ith cii~s4 the connection weights for the second order of the kth feature to the jth sigma-pi unit of the ith class. threshold of the jth sigma-pi unit of the ith class. the desired activation tar the ith output unit. feature vector. the kth element or feature of x. the Nf by 1 variance vector for the jth cluster of the ith class the kth element of v,, learning factor in back propagation.