Training RBF networks with selective backpropagation

Training RBF networks with selective backpropagation

Neurocomputing 62 (2004) 39 – 64 www.elsevier.com/locate/neucom Training RBF networks with selective backpropagation Mohammad-Taghi Vakil-Baghmisheh...

285KB Sizes 3 Downloads 69 Views

Neurocomputing 62 (2004) 39 – 64

www.elsevier.com/locate/neucom

Training RBF networks with selective backpropagation Mohammad-Taghi Vakil-Baghmisheh∗ , Nikola Pave*si+c Laboratory of Arti cial Perceptions, Systems and Cybernetics, Faculty of Electrical Engineering, University of Ljubljana, Slovenia Received 11 March 2002; received in revised form 8 July 2003; accepted 19 November 2003

Abstract Backpropagation with selective training (BST) is applied on training radial basis function (RBF) networks. It improves the performance of the RBF network substantially, in terms of convergence speed and recognition error. Three drawbacks of the basic backpropagation algorithm, i.e. overtraining, slow convergence at the end of training, and inability to learn the last few percent of patterns are solved. In addition, it has the advantages of shortening training time (up to 3 times) and de-emphasizing overtrained patterns. The simulation results obtained on 16 datasets of the Farsi optical character recognition problem prove the advantages of the BST algorithm. Three activity functions for output cells are examined, and the sigmoid activity function is preferred over others, since it results in less sensitivity to learning parameters, faster convergence and lower recognition error. c 2003 Elsevier B.V. All rights reserved.  Keywords: Neural networks; Radial basis functions; Backpropagation with selective training; Overtraining; Farsi optical character recognition

1. Introduction Neural networks (NNs) have been used in a broad range of applications, including: pattern classi:cation, pattern completion, function approximation, optimization, prediction, and automatic control. In many cases, they even outperform their classical ∗ Corresponding author. LUKS, Fakulteta za elektrotehniko, Tr* za*ska 25, 1000-Ljubljana, Slovenia. Tel.: +386-1-4768839; fax: +386-1-4768316. E-mail addresses: [email protected] (M.-T. Vakil-Baghmisheh), [email protected] (N. Pave*si+c).

c 2003 Elsevier B.V. All rights reserved. 0925-2312/$ - see front matter  doi:10.1016/j.neucom.2003.11.011

40

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

y1

V

x1

U

z1 z2

x2

• • • xl1

• • •

• • • zl 3

yl 2 i = 1,..., l1

m = 1,..., l2

j = 1,..., l3

Fig. 1. Con:guration of the RBF Network (for explanation see Appendix A).

counterparts. In spite of diEerent structures and training paradigms, all NNs perform essentially the same function: vector mapping. Likewise, all NN applications are special cases of vector mapping. Development of detailed mathematical models for NNs began in 1943 with the work of McCulloch et al. [12] and was continued by others. According to Wasserman [20], the :rst publication on radial basis functions for classi:cation purposes dates back to 1964 and is attributed to Bashkirof et al. [4] and Aizerman et al. [1]. In 1988, based on Cover’s theorem on the separability of patterns [6], Broomhead et al. [5] employed radial basis functions in the design of NNs. The RBF network is a two layered network (Fig. 1), and the common method for its training is the backpropagation algorithm. The :rst version of the backpropagation algorithm, based on the gradient descent method, was proposed by Werbos [21] and Parker [13] independently, but gained popularity after publication of the seminal book by Rumelhart et al. [15]. Since then, many modi:cations have been oEered by others, and Jondarr [10] has reviewed 65 varieties. Almost all variants of backpropagation algorithm were originally devised for the multilayer perceptron (MLP). Therefore, any variant of the backpropagation algorithm which is used for training the radial basis function (RBF) network should be customized to suit this network, so it will be somehow diEerent from the variant suitable for the MLP. Using the backpropagation algorithm for training RBF network has three main drawbacks: • overtraining, which weakens the network’s generalization property, • slowness at the end of training, • inability to learn the last few percent of vector associations. A solution oEered for overtraining problem is early stopping by employing cross validation technique [9]. There are plenty of research reports that argue against usefulness of the cross validation technique in the design and the training of NNs. For detailed discussions the reader is invited to see [2,3,14].

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

41

From our point of view, there are two major reasons against using early stopping and the cross validation technique on our data: (1) The cross validation stops training on both learned and unlearned data. While the logic behind early stopping is preventing overtraining on learned data, there is no logic for stopping the training on unlearned data, when the data is not contradictory. (2) In the RBF and the MLP networks, learning trajectory depends on the randomly selected initial point. This means that the optimal number of training epochs which is obtained by CV, is useful iE we start training always from the same initial point, and the network always traverses the same learning trajectory! To improve the performance of the network, the authors suggest the selective training, as there is no other way to improve the performance of the RBF network on the given datasets. The paper shows that if we use early stopping or continue the training with the whole dataset, the generalization error will be much more than the results obtained by the selective training. In [19] the backpropagation with selective training (BST) algorithm was presented for the :rst time and was used for training the MLP network. Based on the results obtained on our OCR datasets, the BST algorithm has the following advantages over the basic backpropagation algorithm: • prevents from overtraining, • de-emphasizes the overtrained patterns, • enables the network to learn the last percent of unlearned associations in a short period of time. As there is no universally “eEective method”, the BST algorithm is not an exception. Since the contradictory data or the overlapping part of the data cannot be learned, applying the selective training on data with a large overlapping area will destabilize the system, but it is quite eEective when dataset is error-free and non-overlapping, as it is the case with every error-free character-recognition database, when enough number of proper features are extracted. Organization of the paper: The RBF network is reviewed in Section 2. In Section 3, the training algorithms are presented. Simulation results are presented in Section 4, and conclusions are given in Section 5. In addition, the paper includes two appendices. In most of the resources the formulations for calculating error gradients of RBF networks are either erroneous and conOicting (for instance see the formulas 4.57, 4.60, 7.53, 7.54, 7.55 in [11]), or having not been given at all (see for instance [20,16]). Thus in Appendix A, we obtain these formulas for three forms of output cell activity function. Appendix B presents some information about feature extraction methods used for creating the Farsi optical character recognition datasets, which are used for simulations in this paper. Remark. Considering that in the classi:er every pattern is represented by its feature vector as the input vector to the classi:er, classifying the input vector is equivalent

42

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

to classifying the corresponding pattern. Frequently in the paper, the vector which is to be classi:ed has been referred to by the input pattern, or simply pattern, and vice versa.

2. RBF networks In this section, the structure, training paradigms and initialization methods of RBF networks are reviewed. 2.1. Structure While there are various interpretations of RBF, in this paper we will consider it from the pattern recognition point of view. The main idea is to divide the input space into subclasses, and to assign a prototype vector for every subclass in the center of it. Then the membership of every input vector in each subclass will be measured by a function of its distance from the prototype (or kernel vector), that is fm (x) = f(x − vm ). This membership function should have four speci:cations: 1. 2. 3. 4.

Attaining the maximum value in the center (zero distance). Having considerable value in the close neighborhood of center. Having negligible value in far distances (where are close to other centers). DiEerentiability.

In fact, any diEerentiable and monotonically decreasing function of x − vm  will ful:ll these conditions, but the Gaussian function is the common choice. After obtaining the membership values (or similarity measures) of input vector in the subclasses, the results should be combined to obtain the membership degrees in every class. The two layered feed-forward neural network depicted in Fig. 1 is capable of performing all the operations, and is called the RBF network. The neurons in the hidden layer of network have a Gaussian activity function and their input–output relationship is: 

x − vm 2 ym = fm (x) = exp − 2 2m

 ;

(1)

where vm is the prototype vector or the center of the mth subclass and m is the spread parameter, through which we can control the receptive :eld of that neuron. The receptive :eld of the mth neuron is the region in the input space where fm (x) is high. The neurons in the output layer could be sigmoid, linear, or pseudo-linear, i.e. linear with some squashing property, so the output could be calculated using one of the

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

following equations:  1   ;   1 + e−Sj    s j ; zj = l  2     sj   ;  m ym

43

sigmoid; 1 squashing function; l2 1 pseudo-linear; with  squashing function; m ym

linear; with

(2)

where sj =

l2 

ym umj ;

j = 1; : : : ; l3 :

(3)

m=1

Although in the most of literature, the neurons with linear or pseudo-linear activity function have been considered for the output layer, we strongly recommend using the sigmoidal activity function, since it results in less sensitivity to learning parameters, faster convergence and lower recognition error. 2.2. Training paradigms Before starting the training, a cost function should be de:ned, and through the training process we will try to minimize it. Total sum-squared error (TSSE) is the most popular cost function. Three paradigms of training have been suggested in the literature: 1. No-training: In this the simplest case, all the parameters are calculated and :xed in advance and no training is required. This paradigm does not have any practical value, because the number of prototype vectors should be equal to the number of training samples, and consequently the network will be too large and very slow. 2. Half-training: In this case the hidden layer parameters (kernel vectors and spread parameters) are calculated and :xed in advance, and only the connection weights of output layer are adjusted through backpropagation algorithm. 3. Full-training: This paradigm requires the training of all parameters including kernel vectors, spread parameters, and the connection weights of output layer (vm ’s, m ’s and umj ’s) through the backpropagation algorithm. 2.3. Initialization methods The method of initialization of any parameter will depend on the selected training paradigm. To determine the initial values of kernel vectors, many methods have been suggested, among them the most popular are: 1. the :rst samples of the training set, 2. some randomly chosen samples from the training set,

44

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

3. subclass centers obtained by some clustering or classi:cation method, e.g. k-means algorithm or LVQ algorithm. Theodoridis [16] has reviewed some other methods and cited some related review papers. Wasserman [20] presents a heuristic which can be useful in determining the method of calculating initial values of spread parameters: Heuristic: The object is to cover the input space with receptive :elds as uniformly as possible. If the spacing between centers are not uniform, it may be necessary for each subclass to have its own value of . For subclasses that are widely separated from others,  must be large enough to cover the gap, whereas for subclasses that are close to others,  must have a small value. Depending on the dataset, training paradigm, and according to the heuristic, one of the following methods can be adopted: 1. Assigning a small :xed value, say,  = 0:05 or 0.1, which requires a large number of hidden √ neurons to cover the input space. 2.  = d= 2l2 , where d is the maximum distance between the chosen centers, and l2 is the number of centers. 3. In the case of using the k-means algorithm to :nd the kernel vectors, m could be the standard deviation of the vectors in the pertaining subclass. To assign initial values to the weights in the output layer, there are two methods: 1. Some random values in the range [ − 0:1; +0:1]. This method necessitates weight adjustment through an iterative process (the backpropagation algorithm). 2. Using the pseudo-inverse matrix to solve the following matrix equation:  YU = Z;

y1



    Y =  ...  ;   yQ



z1



    Z =  ...  ;   zQ

yq ∈ R l 2 ; zq ∈ Rl3 ;

(4)

where y1 ; : : : ; yQ and z1 ; : : : ; zQ are the row vectors obtained from the hidden and output layers, respectively, in response to the x1 ; : : : ; xQ row vectors in the input layer, and the equation YU = Z is made as follows: for each input vector in the training set xq , the outputs from the hidden layer are made a row in the matrix Y, target outputs are placed in corresponding rows of target matrix Z and each set of weights associated with an output neuron is made a column of matrix U. Considering that in large scale problems, the dimension of Y is high and (YT Y)−1 is ill-conditioned, despite super:cial appeal of the pseudo-inverse matrix method, the :rst iterative method is the only applicable one.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

45

3. Training algorithms In this section, we will present two training algorithms for the RBF network. First the basic backpropagation (BB) algorithm is reviewed, and then the modi:ed algorithm is presented. 3.1. Basic backpropagation for the RBF network Here we will consider the algorithm for the full-training paradigm; customizing it for half-training is straightforward and can be done simply by eliminating gradient calculations and weight-updating corresponding to the appropriate parameters. Algorithm. (1) Initialize network. (2) Forward pass: Insert the input and the desired output, compute the network outputs by proceeding forward through the network, layer by layer. (3) Backward pass: Calculate the error gradients versus the parameters, layer by layer, starting from the output layer and proceeding backwards: @E=@umj , @E=@vim , @E=@m2 (see Appendix A). (4) Update parameters: umj (n + 1) = umj (n) − 3

@E ; @umj

(5)

vim (n + 1) = vim (n) − 2

@E ; @vim

(6)

m2 (n + 1) = m2 (n) − 1

@E ; @m2

(7)

where 1 ; 2 ; 3 are learning rate factors in the range [0; 1]. (5) Repeat the algorithm for all training inputs. If one epoch of training is :nished, repeat the training for another epoch. Remarks. (1) Based on our experience, the addition of the momentum term—as it is common for the MLP—does not help in training of the RBF network. (2) If the sigmoidal activity function is used for output cells, adding sigmoid prime oEset 1 [8] will improve training substantially, similar to the MLP. (3) Stopping should be decided based on the results of the network test, which is carried out every T epochs after cost function becomes smaller than a threshold value C. 1 As the output of the neurons approaches extreme values (0 or 1) there will be just a little learning or no learning. A solution to this problem is adding a small oEset (≈ 0:1) to the derivative @zj =@sj in Eq. (A.9), which is called “the sigmoid prime o=set”, thus @zj =@sj never reaches zero. Based on our experience, adding such a term is helpful only in calculation of (A.11), but not in calculation of (A.26) or (A.30).

46

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

(4) To get better generalization performance, using the cross validation [9] method has been reported in some cases as a stopping criterion; this, however—as was mentioned in the introduction—is unsatisfactory and unconvincing, because it stops training on both learned and unlearned inputs. (5) The number of output cells depends on the number of classes and the approach of coding, however, it is highly recommended to make it equal to the number of classes. (6) Sometimes, in the net input of the sigmoid function or the linear output a constant term is also considered (called threshold term), which is implemented using a constant input (equal to −1). In some cases this term triggers the “moving target phenomenon” and hinders training, and in some other cases without it there is no solution. Therefore, it must be examined for every case, separately. (7) In the rest of this paper our purpose from BB is the backpropagation algorithm with sigmoid prime oEset as explained in footnote 1, without the momentum term. 3.2. Backpropagation with selective training The diEerence between the BST algorithm and the BB algorithm lies in the selective training, which is appended to the BB algorithm. When most of the vector associations have been learned, every input vector should be checked individually, and if it is learned there should be no training on that input, otherwise training will be carried out. In doing so, a side eEect will arise: the stability problem. That is to say, when we continue training on only some inputs, the network usually forgets the other input– output associations which were already learned, and in the next epoch of training it will make wrong predictions for some of the inputs that were already classi:ed correctly. The solution to this side eEect consists of considering a “stability margin” for the de:nition of the correct classi:cation in the training step. In this way we also carry out training on marginally learned inputs, which are on the verge of being misclassi:ed. Selective training has its own limitations, and cannot be used on conOicting data, or on a dataset with large overlapping areas of classes. Based on the obtained results, using the BST algorithm on an error free OCR dataset has the following advantages: • prevents from overtraining, • de-emphasizes the overtrained patterns, • enables the network to learn the last percent of unlearned associations in a short period of time. BST algorithm. (1) Start training with BB algorithm, which includes two steps: • forward propagation, • backward propagation. (2) When the network has learned most of the vector mappings and the training procedure has slowed down, i.e. TSSE becomes smaller than a threshold value C, stop the main algorithm and continue with selective training.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

47

(3) For any pattern perform forward propagation and examine the prediction of the network: zJ = max(zj )

∀j = 1; : : : ; l3 ;

if (zJ ¿ zj + )

∀j = J → J is the predicted class;

if (zJ ¡ zj + )

∃j = J → no-prediction;

J

(8)

where  is a small positive constant, and is called the stability margin. (4) If the network makes a correct prediction, do nothing, go back to step 3 and repeat the algorithm for the next input, else, network does not make a correct prediction (including no-prediction case), carry out the backward propagation. (5) If the epoch is not complete, go to step 3, else check the total number of wrong predictions: if its trend is decreasing, go to step 3 and continue the training otherwise stop training. (6) Examine network performance on the test set: do only forward propagation, and then: for any input: zJ = max (zj ) ∀j = 1; : : : ; l3 → J is the predicted class. J

Remarks. (1) An alternative condition for starting the selective training is as follows: after TSSE becomes smaller than a threshold value C, every T epochs carry out a recognition test on the training set, and if it ful:lls a second threshold condition, that is (recognition error ¡ C1 ), start selective training. The recommended value for T is 3 6 T 6 5. (2) If  is chosen large, training will be continued for most of the learned inputs, and this will make our method ineEective. On the other hand, if  is chosen too small, during training we will face a stability problem, i.e. with a small change in the weights values—which happens in every epoch—a considerable number of associations will be forgotten, thus the network will oscillate and training will not proceed. After training, by a small change in the feature vector that causes a small change in output values, the prediction of the network will change, or for feature vectors from the same class but with minor diEerences, we will have diEerent predictions. This also causes vulnerability to noise and weak performance on both the test set and on the real world data out of both the training set and the test set. The optimum value of  should be small enough to prevent training on learned inputs, but not so small as to give way to changing the winner neuron with minor changes in weights values or input values. Our simulation results show that for the RBF network a value in the range [0:1; 0:2] is the optimum for our datasets. (3) It is also possible to consider a no-prediction state in the :nal test, that is zJ = max (zj )

∀j = 1; : : : ; l3 ;

if (zJ ¿ zj + 1 )

∀j = J → J is the predicted class;

if (zJ ¡ zj + 1 )

∃j = J → no-prediction

J

(9)

48

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

in which 0 ¡ 1 ¡ . This no-prediction state will decrease the error rate at the cost of decreasing the correct prediction rate.

4. Experiments In this section, :rst we give some explanation about datasets, on which simulations have been carried out. Then simulation results are presented. 4.1. Datasets A total of 18 datasets composed of feature vectors of 32 isolated characters of the Farsi alphabet, sorted in three groups, were created through various feature extraction methods, including: principal component analysis (PCA), vertical, horizontal and diagonal projections, zoning, pixel change coding and some combinations of them, with the number of features varying from 4 to 78 per character according to Table 1. For creating these datasets, 34 Farsi fonts—which are used in publishing online newspapers and web sites—were downloaded from the Internet. Fig. 2 demonstrates a whole set of isolated characters of one sample font printed as text. Then, 32 isolated characters of these 34 Farsi fonts were printed in an image :le; 11 sets of these fonts were boldface and one set was italic. In the next step, by printing the image :le and scanning it with various brightness and contrast levels, two additional image :les were obtained. Then, using a 65 × 65 pixel size window, the character images were separated into images of isolated characters. After applying a shift-radius invariant Table 1 Brief information about Farsi OCR datasets Database

Extraction method

Group A

db1 db2 db3 db4 db5

PCA PCA PCA PCA PCA

Group B Group C

dbn1 to dbn5 db6 db7 db8 db9 db10 db11 db12 db13

PCA Zoning Pixel change coding Projection Projection Projection

Explanation

No. of features per character 72 54 72 64 48

Normalized versions of db1 to db5 Horizontal and vertical Diagonal db8+db9 db6+db7 db6+db8 db6+db7+parts of db8

4 48 48 30 78 52 52 72

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

49

Fig. 2. A sample set of machine-printed Farsi characters (isolated case).

image normalization [18], and by reducing the sizes of character images to 24 × 24 pixel, the features vectors were extracted as explained in Appendix B. 4.2. Simulation results In all of the simulations, two-thirds of the feature vectors, obtained from the original image and the :rst scanned image, were assigned for the training set, and one-third of feature vectors obtained from the second scanned image were assigned for the test set. Therefore, 68 samples per character were used for training and 34 samples per character for testing. Thus, the total number of samples in the training sets and test sets are 2176 and 1088, respectively. We considered three types of activity functions for the output layer: linear, pseudo-linear, and sigmoid. And we faced numerous problems with both linear and pseudo-linear activity functions. These problems are explained later (in current section see Considerations, item 2). Hence in the sequel we present only the simulation results obtained by the sigmoidal output neurons. In Table 2 the results obtained on datasets of groups A–C, by the BST and the BB algorithms have been compared against each other. 4.2.1. Settings and initializations In all the cases, we considered one prototype pattern per class, i.e. 32 prototype vectors, and 32 output cells for the output layer. Thus, the network con:guration is l1 −32−32, where l1 is the dimension of the input vector which is equal to the number of features per character (see Table 1). Adding the threshold term to the sigmoid activity functions triggered the “moving target phenomenon”, so we eliminated it. Also we did not add momentum term, because it did not help to improve training. However, adding the sigmoid prime oEset boosted the performance of network substantially. Two training paradigms were examined, i.e. half- and full-training. In Table 2, we have presented only the results obtained by the full-training paradigm; the consequences of adopting the half-training paradigm will be discussed at the end. For initializing kernel vectors, two methods were adopted: the :rst samples, and the cluster centers obtained by k-means algorithm. Considering that k was set equal to one, these centers are simply the average of training samples of every class. For initializing

50

Table 2 Comparing the recognition errors of the RBF network obtained by the BB and the BST algorithms

BB

BST

Epoch

Error

N

Train

db1 db2

100 100

db3 db4 db5 db7

Parameters

Epochs

Error

Test

n; N

Train

11 22

25 32

100 100 100 100

11 9 14 63

7 6 6 31

db8 db10

100 100

83 20

40 10

db11 db12

100 100

49 68

28 32

db13 dbn1

100 100

4 12

2 18

dbn2

100

1

27

dbn3

100

8

5

dbn4 dbn5

100 100

3 17

1 9

64, 104 57, 90 100, 140 51, 90 39, 55 58, 95 65, 105 100, 140 77, 115 68, 108 100, 140 48, 65 77, 117 100, 140 49, 60 71, 113 100, 120 64, 100 100, 140 53, 93 100, 130 45, 75 65, 100

0 4 2 0 0 0 3 2 14 2 1 0 5 5 0 0 0 4 0 0 0 0 0

First phase

Second phase

Test

 3 , 2 , 1

3 ,  2 ,  1

17 22 19 0 0 0 1 0 7 1 0 0 1 1 0 16 13 20 15 0 0 0 0

5; 0.04; 0.001 5; 0.04; 0.001 5; 0.04; 0.001 5; 0.04; 0.001 8; 0.04; 0.001 5; 0.04; 0.001 1.4; 0.001; 5e-6 1.4; 0.001; 5e-6 3; 0.001; 5e-6 6; 0.001; 5e-6 6; 0.001; 5e-6 3; 0.001; 5e-6 5; 0.001; 5e-6 5; 0.001; 5e-6 5; 0.001; 5e-6 5; 0.001; 5e-6 5; 0.001; 5e-6 4; 0.001; 5e-6 4; 0.001; 5e-6 5; 0.001; 5e-6 5; 0.001; 5e-6 4; 0.001; 5e-6 4; 0.001; 5e-6

1.7; 0.013; 0.0003 1.7; 0.013; 0.0003 1.7; 0.013; 0.0003 1.7; 0.013; 0.0003 2.7; 0.013; 0.0003 1.7; 0.013; 0.0003 0.5; 0.00033; 1.7e-6 .5; 0.00033; 1.7e-6 1; 0.00033; 1.7e-6 2; 0.0003; 1.7e-6 2; 0.0003; 1.7e-6 1; 0.0003; 1.7e-6 1.7; 0.0003; 1.7e-6 1.7; 0.0003; 1.7e-6 2; 0.0003; 1.7e-6 1.7; 0.0003; 1.7e-6 1.7; 0.0003; 1.7e-6 1.3; 0.0003; 1e-6 1.3; 0.0003; 1e-6 1.7; 0.0003; 1e-6 1.7; 0.0003; 1e-6 1.3; 0.0003; 1e-6 1.3; 0.0003; 1e-6

Threshold

2

60 110 — 80 60 80 220 — 220 120 — 170 200 — 80 50 — 100 0 60 — 70 80

6 5 5 6 7 6 0.05 0.05 0.2 0.2 0.2 0.05 0.15 0.15 0.15 0.5 0.5 0.4 0.4 0.5 0.5 0.6 0.6

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

Database

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

51

spread parameters two policies were adopted: (1) A :xed number slightly larger than the minimum of standard deviations of all clusters created by k-means algorithm, de:ned as  in Eq. (10): 2 = min(i2 ); i

(10)

where i2 is the variance of the training patterns of the ith cluster, de:ned by  1 (xq − mi )2 ; for i = 1; : : : ; 32 (11) i2 = 68 q x ∈cluster i

and mi is the average vector of the ith cluster. (2) DiEerent initial values equal to the variances of every cluster obtained by the k-means algorithm. Total sum-squared error (TSSE) was considered as the cost function, and random numbers in the range [ − 0:1; +0:1] were assigned to the initial weight matrix of the output layer. The results presented in Table 2 were obtained while the initial spread parameters were equal, and the initial kernel vectors were set equal to cluster centers of the k-means algorithm. For the BST algorithm, the stability margin was set equal to 0:2. This value was obtained based on the empirical results. 4.2.2. Training algorithms (1) BB algorithm: The network was trained for 100 epochs with the BB algorithm. The obtained results are given under column BB. (2) BST algorithm: A threshold value for TSSE was chosen, after which the training procedure slows down. This threshold value was acquired from the :rst training experiment through the BB algorithm. Training was restarted by the BB algorithm from the same initial point of the :rst experiment; when TSSE reached to the threshold value, we changed the training algorithm from unselective to selective and continued for a maximum of 40 epochs, with the values of learning parameters, i.e. 3 2 and 1 , decreased almost three times. Every :ve epochs the network was tested; if the recognition error on the training set was zero, training was stopped. Also training was stopped if either dynamic recognition error 2 reached zero or if 40 epochs of training were over. The obtained results are given under column BST. In this column n and N represent the epoch numbers where selective training starts and ends. (3) BST algorithm: We did not set any threshold; training was carried out for 100 epochs on datasets dbn1, db2, dbn2, dbn3, db7, db10 and db12 with the BB 2 Dynamic recognition error is obtained while the network is under training, and after presenting any pattern, the network parameters probably will change. Therefore, it is diEerent to the recognition error which is obtained after training. Generally, in selective training, the dynamic recognition error on the training set will be larger than the recognition error, therefore we could stop training earlier by performing a test on the training set after every few epochs of training, but this will violate the stability condition. Although, in our case study, this does not cause any problem, it can increase recognition error in real on-line operation.

52

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

algorithm, then followed with selective training for maximum of 40 epochs. The obtained results are given under column BST, where the threshold value has not been speci:ed, or where n is equal to 100. 4.2.3. Analysis (1) Results obtained by the BB algorithm: The recognition errors on datasets dbn4, db13, dbn3, db4, db3 and db5 are lower than those on others, and db12, db8 and db7 must be considered the worst ones. Data normalization has improved the recognition rate in all cases, excluding dbn5. We notice that the performances on test sets of db1 and db2 are weaker than that on training sets of these datasets, and this must be attributed to the inappropriate implementation of the feature extraction method. The learning rates of kernel vectors and spread parameters, i.e. 2 and 1 are much smaller for datasets of groups B and C than those for the datasets of group A, but the value of 3 (the learning rate for weight matrix of the output layer), does not change substantially for datasets from diEerent groups. The initial spread parameters for datasets of group A are much larger than for the datasets of groups B and C. (2) The results obtained by the BST algorithm: The :rst eminent point is the decreased recognition errors on all datasets. The BST algorithm has achieved much better results in shorter time; especially on db3, db4, dn5, dbn3, dbn4, dbn5, db11, and db13 the recognition error has reached zero. For evaluating and ranking these datasets we have two other measures: convergence speed and the number of features. Regarding convergence speed, the best datasets will be: db4, db13, db11, dbn4, db3, dbn3, db5, dbn5, although in some cases the diEerences are too slight to be meaningful; db3, dbn3 and db13 should be ruled out because of high dimension of their feature vectors. In addition, training does not bene:t from data normalization. Also the BST algorithm solves the overtraining problem. We notice that it has decreased the error rate on the training set but not at the cost of increased error on the test set. And this is even more obvious from the results obtained on db1 and db2. In addition, this can be veri:ed from the results demonstrated in Table 3. (3) In Table 3 we have compared the recognition errors at the epochs n and N , i.e. at the beginning and the end of the selective training obtained by the BST algorithm, against recognition error at the same epochs obtained by the BB algorithm, on four datasets. It shows that after TSSE reaches to the threshold value, if we continue training with basic backpropagation the recognition error either decreases very trivially or even increases (e.g. on test set of db4 and db11). Some researchers use the cross validation technique to :nd this point and stop training at it, but we are opposing applying the cross validation method on neural networks training. (4) Table 4 shows the results of another experiment performed on db4 with diEerent network settings. Threshold value was set equal to 100, stability margin equal to 0.1, and learning rate parameters divided only by two for the second phase of training, and the initial weight matrix was changed. While by epoch number 22 the error rates on both training set and test set are equal, i.e. 26 and 14, respectively, the BST algorithm has reached zero error on both sets in 23 epochs of selective training, and the error rates of the BB algorithm after 100 epochs of unselective training are 6 and 3, respectively, on training set and test set.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

53

Table 3 Comparing the recognition errors of the BB and the BST algorithms in two points of training Database

BB Epoch

BST Error

Epoch

Train

Test

Error Train

Test

db1

65 104

20 11

29 25

65 104

20 0

29 17

db4

40 100

9 9

4 6

40 55

9 0

4 0

db8

78 115

89 83

41 40

78 115

89 14

41 7

db11

48 100

50 49

26 28

48 65

50 0

26 0

Table 4 Comparing the recognition errors of the RBF network on db4 obtained by the BB and the BST algorithms BB

BST

Epoch

Error

n

Train

22 100

26 6

Epoch

Error

Test

n; N

Train

Test

14 3

22, 45 100, 120

0 0

0 0

(5) The reader should recall that in the selective training mode, the calculation for weight updating (or backpropagating)—which is the most time-consuming step of training—is carried out only for misclassi:ed patterns, whose number at the beginning of selective training is less than 89, or 5% of all training samples (see Tables 2 and 3); thus one epoch of selective training is at least :ve times faster than that of unselective training, and by decreasing misclassi:ed patterns through time it becomes faster and faster. Therefore the BST algorithm is at least three times faster than the BB algorithm. (6) Fig. 3 demonstrates TSSE versus the epoch number, obtained on db4, corresponding with the experiment of Table 4. By changing the training algorithm we face a sudden descent in TSSE, and this must be attributed to the sharp decrease of learning rate factors. Our explanation for this phenomenon is as follows: After approaching the minimum well, by using large values of learning rate and momentum factor we step over the well. But by decreasing the step size we put our leg in the middle of the minimum well and face a fall to the bottom. This phenomenon inspired us to devise the BDLRF [17] and BPLRF [19] algorithms. BDLRF and BPLRF

54

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

TSSE

1000

BST 100 BB

10 0

20

40 60 80 Epoch number

100

120

Fig. 3. Convergence diagrams of the RBF network obtained by the BB and the BST algorithms on db4, corresponding to Table 4.

are acronyms for “backpropagation with declining leaning rate factor” and “backpropagation with plummeting leaning rate factor”, respectively. In [17,19] we have shown how to speed up training and improve the recognition rate in MLP by decreasing the learning rate factor. Also we have shown that by larger decrease in the values of training factors can result in larger decrease in cost function, and better recognition rate. (7) In addition, we notice that by training in the second phase while the recognition error decreases, TSSE increases, which substantiates our statement that our method does not overtrain the network on learned patterns. On the contrary, if the network has been overtrained on some patterns in the :rst phase, by increasing TSSE and decreasing recognition error it is de-emphasizing on already overtrained patterns. In other words, decreased recognition error on unlearned patterns must be a resultant of decreased SSE (sum-squared error) resulting from the same patterns. Thus, for an increase of TSSE, simultaneous with decrease of recognition error, there has to be an increase in the SSE resulting from already learned patterns without crossing the stability margin, and this means de-emphasizing on overtrained patterns. Therefore, our method decreases the error on the training set, but not at the cost of overtraining and increased error on the test set. 4.2.4. Considerations (1) By starting from a diEerent initial point, the number of training epochs will change slightly, but not so much as to aEect the general conclusions. (2) As already mentioned, we considered three types of activity functions for the output layer: sigmoid, linear, and pseudo-linear. And we faced numerous problems

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

55

with both linear and pseudo-linear activity functions, as explained in the following: • Slow learning: they do not allow for using large learning rate factors. In the case of using a large learning rate, the network will oscillate and will not converge to minimum. • High sensitivity to parameters values and wide range in which the optimal parameters values lie for diEerent datasets. The optimal values change enormously for diEerent datasets (up to 4 orders of magnitude), which makes the parameter tuning procedure an exhausting task. • Contrary to their super:cial simplicity, they need far more computations per iteration (refer to the appendix for their formulations). Thus, they are slower than the sigmoid output from this aspect as well. • Their recognition errors are higher than that of the sigmoidal activity function. • The afore-said problems worsen on the datasets of groups B and C. • The pseudo-linear activity function has better performance than the linear one, in terms of convergence speed, recognition rate, and sensitivity to learning parameters. Therefore, output cells with a sigmoid activity function are preferred over other activity functions, due to resulting in less sensitivity to learning parameters, faster convergence, and lower recognition error. Although applying the BST algorithm on the RBF network with linear and pseudo-linear outputs does improve their performances, they do not surpass the RBF network with sigmoid output trained with the BST algorithm. (3) We tried three training paradigms: • Half training: Only the weight matrix of the output layer was under training • Half training: The weight matrix of the output layer and kernel vectors were under training. • Full training: The weight matrix of the output layer, kernel vectors and spread parameters were under training. If the half-training paradigm is chosen—considering that the kernel vectors will not be in the optimal positions and the spread parameters will not have the optimum values— we have to increase the number of kernel vectors, otherwise recognition error will increase. In the case of increasing the number of kernel vectors, both training and on-line operation will slow down drastically. On the other hand, if the full-training paradigm is chosen the number of learning parameters increases to three, that is 3 2 and 1 , adjusting which is an exhausting task. All in all, the full-training paradigm seems to be the most bene:cial method. (4) We adopted two policies for initializing kernel vectors: • the :rst samples, • the prototype vectors resulting from the k-means algorithm.

56

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

Table 5 Comparing performance of the RBF network with diEerent initialization policies BB

BST

Epoch

Error

n

Train

22 100 21 100 27 100 38 100

26 6 25 9 31 24 27 19

Initial values

Epochs

Error

Test

n; N

Train

Test

14 3 14 6 16 14 13 9

22, 45 100, 120 21, 45 100, 110 27, 45 100, 104 38, 78 100, 110

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

Kernel vectors

2

First samples

7

k-Means cluster Centers

7

First samples

From k-means

k-Means cluster Centers

From k-means

Although the last method of initialization does yield faster convergence, nevertheless the diEerence between these two types of initialization becomes trivial when the number of kernel vectors grows smaller. More precisely, when the number of kernel vectors is kept small, using the second initialization method speeds up convergence only at the very beginning, but in the middle and at the end of training convergence slows down, and the global convergence is not better than that obtained by the :rst method (see Table 5). Notwithstanding, before training the RBF network, we need to run the k-means algorithm to get initial values for the spread parameters, and using the created cluster centers can be done in no time. (5) A major drawback of the RBF network lies in its size, and therefore its speed. Unlike the MLP, in the RBF networks we cannot increase the size of network incrementally. For instance, in our case study the network size was l1 − 32 − 32; if we had decided to enlarge the network, the next size would have been l1 − 64 − 32, and this means that the network size would be doubled. Considering that speed will decrease by an order larger than one, the improved performance would cost a substantial slow-down both in training and on-line operation, which would make the RBF network unable to compete with other networks, and therefore be practically useless. Consequently, usage of the RBF network is not recommended if the number of pattern classes is high. (6) The best policy for initial values of spread parameters is to set equal initial values for all of them, but to change them through training. Considering kernel vectors, the most important aspect is adjusting them during training. In this case, selecting the :rst samples or the prototype vectors derived from the k-means algorithm yield similar results (see Table 5). (7) The data normalization method oEered by Wasserman was tried [20, p. 161]. Wasserman oEers, for each component of the input training vectors: 1. Find the standard deviation over the training set.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

57

2. Divide the corresponding component of each training vector by this standard deviation. Considering that some components of feature vectors are equal for all patterns except for some from speci:c classes, the standard deviations of these components are very small and by dividing these components to their standard deviations, their values extremely increase, and this attenuates the impact of other components in the norm of the diEerence vector xq − vm , almost to zero. The mere impact of Wasserman’s normalization method was destabilizing the whole system. (8) Decreasing the learning parameters during selective training, stabilizes the network and speeds up training by preventing repeated overtraining and unlearning on some patterns.

5. Conclusions (1) In this paper we presented the BST algorithm, and showed that on the given datasets, the BST algorithm improves the performance of RBF networks substantially, in terms of convergence speed and recognition error. The BST algorithm achieves much better results in shorter time. It solves three drawbacks of the backpropagation algorithm: overtraining, slow convergence at the end of training, and inability to learn the last few percent of patterns. In addition, it has the advantages of shortening training time (up to three times) and partially de-emphasizing overtrained patterns. (2) As there is no universally “eEective method”, the BST algorithm is not an exception and has its own shortcoming. Since the contradictory data or the overlapping part of the data cannot be learned, applying the selective training on data with a large overlapping area will destabilize the system. But it is quite eEective when dataset is error-free and non-overlapping, as it is the case with every error-free character-recognition database, when enough number of proper features are extracted. (3) The best training paradigm is full training, because it utilizes all the capacity of the network. Using the sigmoidal activity function for the neurons of the output layer is recommended, because it results in less sensitivity to learning parameters, faster convergence and lower recognition error.

Acknowledgements This work has been partially supported within the framework of the Slovenian– Iranian Bilateral Scienti:c Cooperation Treaty. The authors would like to thank the anonymous reviewers for their valuable comments and suggestions which helped to improve the quality of the paper. We would like to thank Alan McConnell DuE for linguistic revision.

58

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

Appendix A. Error gradients of RBF network Let the network have the con:guration as depicted in Fig. 1, and x l1 l2 vm V ym l3 uj U zj tj Q

input vector (=[x1 ; x2 ; : : : ; xl1 ]T ) dimension of the input vector number of neurons in hidden layer prototype vector corresponding to the mth hidden cell (=[v1m ; v2m ; : : : ; vl1 m ]T ) matrix of prototype vectors (=[v1 ; v2 ; : : : ; vl2 ]) output of mth hidden cell dimension of the output vector weight vector of the jth output cell (=[u1j ; u2j ; : : : ; ul2 j ]T ) weight matrix of output layer (=[u1 ; u2 ; : : : ; ul3 ]) actual output of the jth output cell desired output of the jth output cell number of training patterns

Let TSSE be the cost function de:ned as  q  Eq ; Eq = (tk − zkq )2 TSSE = q

(q = 1; : : : ; Q)

(A.1)

k

and let E be the simpli:ed notation for Eq  E= (tj − zj )2 :

(A.2)

j

We will calculate error gradients for pattern mode training. Obtaining error gradients for batch mode training is straightforward, as explained in the remark at the end of this appendix. We will consider three types of activity functions for output cells  1   ; sigmoid;   1 + e−Sj    s 1 j ; linear; with squashing function; (A.3) zj = l l  2 2     sj 1   ; pseudo-linear; with  squashing function;  y m ym m m where sj =



ym umj

(A.4)

m

and



x − vm 2 ym = exp − 2m2

 :

(A.5)

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

59

A.1. Part 1—Error gradients versus weights of output layer By using the chain rule for derivatives we get II

I

III

   @E @E @zj @sj = @umj @zj @sj @umj

(A.6)

and we will calculate all three terms in three cases. Computing (I ): The :rst term is the same for all three cases, i.e. I=

@E = −2 (tj − zj ) @zj

Computing (II ):  1    l2 @zj = II = 1 @sj    m

for cases (1–3):

(A.7)

for case (1); (A.8) ym

for case (2)

and for the third case (sigmoid output) we have @zj e−sj 1 + e−sj − 1 = = = zj − zj2 = zj (1 − zj ): (1 + e−sj )2 (1 + e−sj )2 @sj

II =

(A.9)

Computing (III ): The third term is identical for all cases III =

@sj = ym : @umj

(A.10)

By putting all partial results together we obtain  ym ; case(1); −2(tj − zj )    l2   @E = −2(tj − zj ) ym ; case(2); @umj   m ym    −2(tj − zj ) zj (1 − zj )ym ; case(3):

(A.11)

A.2. Part 2—Error gradients versus components of prototype vectors We have I

II

III

    @E @zj @ym @E = : @vim @zj @ym @vim j

(A.12)

Computing (I ): For all three cases we have I=

@E = −2(tj − zj ): @zj

(A.13)

60

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

Computing (II ): For case (1) zj =

sj l2



II =

For case (2) zj = 

sj

m

ym

=

@zj umj = : @ym l2

 m ym umj  m ym

since m is dummy variable we can change it to k  sj k yk ukj =  ; zj =  y k k k yk    umj k yk − k yk ukj umj k yk − sj @zj   = = : II = @ym ( k yk )2 ( k y k )2

(A.14)

(A.15)

(A.16) (A.17)

For case (3) IV

V

  !zj !sj !zj = ; !ym !sj !ym

(A.18)

considering zj = and sj =

1 ; 1 + e−sj 

umj ym ;

(A.19)

(A.20)

m

we have IV = zj (1 − zj );

(A.21)

V = umj :

(A.22)

and

Then by putting the partial derivatives together we obtain II =

@zj = zj (1 − zj )umj : @ym

Computing (III ): For all three cases we have     (x − vm )2 x − vm 2 = exp − ym = exp − 2m2 2m2

(A.23)

(A.24)

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

then @ym = ym III = @vim



xim − vim m2

61

 :

By putting the partial results together we have  umj ym  − 2(tj − zj ) (xim − "im );    l2 m2  J       umj k yk − sj ym @E  − 2(tj − zj ) (xim − "im ); = 2 ( m2 @"im  k yk )  J     ym    − 2(tj − zj )zj (1 − zj )umj 2 (xim − "im );   m j

(A.25)

case (1); case (2);

(A.26)

case (3):

A.3. Part 3—Error gradients versus spread parameters We have I

II

III

    @E @zj @ym @E = ; @m2 @zj @ym @m2 j

(A.27)

terms I and II are exactly as in part 2, therefore we only need to calculate the third term, which will have an identical formulation in all three cases   x − vm 2 (A.28) ym = exp − 2m2 and III =

x − vm 2 @ym = ym 2 @m 2m4

by putting the partial results together we obtain    umj x − vm 2   ; − 2(t − z ) y j j m   l2 2m4   j         umj k yk − sj x − vm 2 @E  ; − 2(t − z ) y j j m = ( k y k )2 2m4  @m2  j         x − vm 2   ; − 2(t − z )z (1 − z )u y  j j j j mj m  2m4 j

(A.29)

case(1); case(2);

(A.30)

case(3):

Remark. All the above formulas have Qbeen calculated for pattern mode training. For batch mode training, only the term q=1 should be added in front of Eqs. (A.11), (A.26) and (A.30), i.e. all partial results should be summed up. Our experience shows that batch mode training is much slower than pattern mode training, in addition to its implementation intricacy and high memory demand.

62

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

Appendix B. Feature extraction methods B.1. Group A—db1–db5 These datasets were created by extracting the principal components, extracted by a single layer feedforward linear neural network with generalized Hebbian Training Algorithm (GHA) [9,7], as summarized in the following: For extracting principal components, we used m − l single layer feedforward linear network with Generalized Hebbian Training Algorithm (GHA). db1: To train the network: 8×8 non-overlapping blocks of the image of every character were considered as an input vector. The image was scanned from top left to bottom right and l was set equal to 8. Therefore for every character 72 features were extracted. The training was performed with 34 samples per character and the learning rate was set to  = 7 × 10−3 . To give enough time for the network to learn the statistics of data, training procedure was repeated for three epochs. db2: The same as db1, but l was set equal to 6, thus for any character 54 features were extracted. db3: The image matrix of any character was converted into a vector, by scanning vertically from top left to bottom right, then this vector was partitioned into 9 vectors which were inserted into the network as 9 input vectors. In this way, 72 features were extracted for every character. db4: Similar to db1, but the dimension of the input blocks was considered to be 24×3, i.e. every three rows were considered as one input vector. In this way, for any character 64 features were extracted. db5: Similar to db4, but l was set equal to 6, thus for any character 48 features were extracted. B.2. Group B—dbn1–dbn5 These datasets are normalized versions of db1–db5. After creating any dataset, the feature vectors were normalized by mapping the ith component of all the vectors into the interval [0; 1]. B.3. Group C—db6 –db13 db6: This dataset was created by the zoning method. Each character image was divided into four overlapping squares and the percentage of black pixels in each square was obtained. The size of overlap was set to two pixels in each edge, which yields the best recognition rate. The best recognition rate on this dataset does not exceed 53%, so its features were used only in combination with other features. db7: Pixel change coding was used to extract the feature vectors of this dataset. db8: The feature vectors of this dataset were extracted by vertical and horizontal projections.

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

63

db9: The feature vectors of this dataset were extracted by diagonal projection. Ten components from the beginning and seven components from the end were deleted, because their values were zero for all characters. The best recognition rate on this dataset does not reach 85%, so its features were used only in combination with other features. db10: By concatenating the feature vectors of db8 and db9, feature vectors of this dataset were extracted. db11: The feature vectors of this dataset were created by concatenating the feature vectors of db6 and db7. db12: The feature vectors of this dataset were created by concatenating the feature vectors of db6 and db8. db13: The feature vectors of this dataset were created by concatenating the feature vectors of db11 and some selected features from db8, that is 10 features from the middle of both vertical and horizontal projections. References [1] M.A. Aizerman, E.M. Braverman, I.E. Rozonoer, Theoretical foundations of the potential function method in pattern recognition learning, Automat. Remote Control 25 (1964) 821–837. [2] S. Amari, N. Murata, K.R. Muller, M. Finke, H.H. Yang, Asymptotic statistical theory of overtraining and cross-validatio, IEEE Trans. Neural Networks 8 (5) (1997) 985–996. [3] T. Andersen, T. Martinez, Cross validation and MLP architecture selection, Proceedings of International Joint Conference on Neural Networks, IJCNN’99, Cat. No. 99CH36339, Vol. 3 (part 3), 1999, pp. 1614 –1619. [4] O.A. Bashkirov, E.M. Braverman, I.B. Muchnik, Potential function algorithms networks for pattern recognition learning machines, Automat. Remote Control 25 (1964) 629–631. [5] D.S. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, Complex Systems 2 (1988) 321–355. [6] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with application in pattern recognition, IEEE Trans. Electron. Comput. EC-14 (1965) 326–334. [7] K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley, New York, 1996. [8] S.C. Fahlman, An empirical study of learning speed in backpropagation networks, Technical Report CMU-CS-88-162, Carnegie Mellon University, Pittsburgh, PA 15213, September 1988. [9] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, Englewood CliEs, NJ, USA, 1999. [10] G. Jondarr, Backpropagation family album, Technical Report, Department of Computing, Macquarie University, New South Wales, August 1996. [11] C.G. Looney, Pattern Recognition Using Neural Networks, Oxford University Press, New York, 1997. [12] W.S. McCulloch, W. Pitts, A logical calculus of the ideas imminent in nervous activity, Bull. Math. Biophy. 5 (1943) 115–133. [13] D.B. Parker, Learning logic, Invention Report S81-64, File 1, OXce of Technology Licensing, Stanford University, March 1982. [14] L. Prechelt, Automatic early stopping using cross validation: quantifying the criteria, Neural Networks 11 (4) (1998) 761–767. [15] D.E. Rumelhart, J.L. McClelland, Parallel Distributed Processing: Exploration in the Microstructure of Cognition, MIT Press, Cambridge, MA, 1986. [16] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Academic Press, USA, 1999. [17] M.T. Vakil-Baghmisheh, N. Pave*sic, Backpropagation with declining learning rate, Proceeding of the 10th Electrotechnical and Computer Science Conference, Portoro*z, Slovenia, Vol. B, September 2001, pp. 297–300.

64

M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 39 – 64

[18] M.T. Vakil-Baghmisheh, Farsi character recognition using arti:cial neural networks, Ph.D. Thesis, Faculty of Electrical Engineering, University of Ljubljana, Slovenia, October 2002. [19] M.T. Vakil-Baghmisheh, N. Pave*si+c, A fast simpli:ed fuzzy ARTMAP network, Neural Process. Lett., 2003, in press. [20] P.D. Wasserman, Advanced Methods in Neural Computing, Van Nostrand Reinhold, New York, 1993. [21] P.J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral science, Ph.D. Thesis, Harvard University, Cambridge, MA, 1974. Mohammad-Taghi Vakil-Baghmisheh was born in 1961 in Tabriz, Iran. He received his B.Sc. and M.Sc. degrees in electronics, from Tehran University in 1987 and 1991. In 2002, he received his Ph.D. degree from the University of Ljubljana, Slovenia, with a dissertation on neural networks in the Faculty of Electrical Engineering.

Nikola Pave0si1c was born in 1946. He received his B.Sc. degree in electronics, M.Sc. degree in automatics, and Ph.D. degree in electrical engineering from the University of Ljubljana, Slovenia, in 1970,1973 and 1976, respectively. Since 1970 he has been a staE member at the Faculty of Electrical Engineering in Ljubljana, where he is currently head of the Laboratory of Arti:cial Perception, Systems and Cybernetics. His research interests include pattern recognition, neural networks, image processing, speech processing, and information theory. He is the author and co-author of more than 100 papers and 3 books addressing several aspects of the above areas. Professor Nikola Pave*si+c is a member of IEEE, the Slovenian Association of Electrical Engineers and Technicians (Meritorious Member), the Slovenian Pattern Recognition Society, and the Slovenian Society for Medical and Biological Engineers. He is also a member of the editorial boards of several technical journals.