NeuralNetworks,Vol. 10,No. 1,pp. 61–68, 1997 Copyright~ 1996ElsevierScienceLtd. All rights reserved Printedin Great Britain 0893-6080/97$17.00+.00
Pergamon PII: S0893-6080(96)00071-8
CONTRIBUTED ARTICLE
Training FeedforwardNeural Networks: An Algorithm Giving ImprovedGeneralization CHARLES W. LEE Bolton Institute (Received 3 May 1994; accepted 13 March 1996)
Abstract-An algorithm is derived for supervised training in mtdtilayer feedforwardneural networks. Relative to the gradient descent backpropagation algorithm it appears to give bothfaster convergence and improved generalization, whilst preserving the system of backpropagating errors throughthe network. Copyright ~ 1996 Elsevier Science Ltd,
Keywords—Neuralnetworks, Learning, Generalization, Backpropagation. example, the gradient descent backpropagation algorithm (Rumelhart & McClelland, 1986), generalization is better in smaller networks (Baum & Haussler, 1989; Schiffman, Joost, & Werner, 1993). This is because the shortage of units forces the algorithm to develop general rules to discriminate between the input patterns, whereas otherwise it would tend to learn each item of data as a special case, in which circumstances extending training time to improve discrimination will only have a detrimental effect on generalization. Unfortunately it can be difficult to select the optimum size for a network without knowing in advance the rules that the network is going to abstract from the data. A further problem is that in smaller networks training times are usually longer, and failures to learn to discriminate are more frequent, so improvements in one process must be paid for by a deterioration in the other. These difficulties have led to the development of a number of algorithms which progressively reduce the size of a network, or allow some of the connection weights to decay to zero, enabling training to be started with a network which may be larger than the optimal size (Baum & Haussler, 1989; Burkitt, 1991; Cottrell et al., 1993; Kamimura, 1993; Karmin, 1990; Romaniuk, 1993). There are, however, obvious difficulties in trying to select the correct neurons or connections to remove from a network which has already learned an inappropriate set of rules. Removing connections which are almost inactive is unlikely to improve the ability of the network to generalize, whereas removing active connections may necessitate considerable retraining. If it can be achieved, it seems likely that a more fruitful approach might be to
1. INTRODUCTION One of the principal concerns in supervised training of neural networks is to obtain good generalization to new input patterns from the patterns on which the network has been trained. It is usually desirable that the network should, for example, give the same response as the one it was trained to give for a particular pattern when presented with previously unseen patterns which are in some sense nearby. Similarly, if the network is trained to give the same response for a group of patterns which have some common feature which distinguishes them from the other input patterns, then the network should learn to respond to that common feature. The process of supervised training of networks is one of teaching the network to discriminate between input patterns and associate responses with them. Networks are not directly trained to generalize and so, although improvements in discrimination can usually (Wang & Malakooti, 1993) be obtained by increased training times, improvements in generalization must be sought by selecting different training algorithms or network architectures. This problem has been widely studied (Burkitt, 1991; Cottrell, Girard, Girard & Mangeas, 1993; Drucker & LeCun, 1992; Fahlman & Lebiere, 1990; Holstrom & Koistinen, 1992; Kamimura, 1993; Karmin, 1990; Kruschke, 1989; Lendaris & Harls, 1990; Romaniuk, 1993). It is known that with, for
Requests for reprints should be sent to C. W. Lee, Mathematics Division, Bolton Institute, Deane Road, Bolton BL3 5AB, UK; e-mail: charles@bolton. ac.uk
61
62
C. W. Lee
devise methods which take an inactive network as the starting point and encourage the growth of connection weights only where necessary. The purpose of this paper is to describe an algorithm which will accept almost-zero values for the connection weights at the beginning of training. The algorithm has been developed from one described elsewhere (Lee, 1993). The derivation of the algorithm is given in the next section, together with a specification of the procedure required for calculating the adjustments to the connection weights. The third section of the paper contains the results of the computer simulations used to test the algorithm. 2. THE DERIVATION OF THE ALGORITHM In Lee (1993) an algorithm was described for supervised training in multilayer feedforward networks which gives significantly faster convergence than the gradient descent backpropagation algorithm, whilst preserving the system of backpropagating errors through the network (Rumelhart & McClelland, 1986). This tangent plane backpropagation algorithm treats each training value as a constraint which defines a surface in the weight space. The weights are then adjusted by moving to the foot of the perpendicular from the present position in the weight space to the tangent plane to this surface, taken at a convenient nearby point. Although this tangent plane algorithm has good convergence properties, it is no better than the gradient descent method at generalizing from the training set presented. At first sight an attractive possibility for improving generalization would be to start the training with all of the connection weights close to zero, in the expectation that the training algorithm might activate only the minimum number of weights necessary. Unfortunately the two algorithms mentioned above will often fail to learn at all under these starting conditions, particularly in networks with several hidden layers, and it is in these networks that generalization is found to be the best. In the case of the tangent plane algorithm the failure to learn is because the axis in the weight space defined by the weight from the constant output, or bias, unit to the final unit is approximately perpendicular to all of the constraint surfaces when the weights are almost zero. The algorithm thus gives movement up and down this axis, trying to satisfy the training data by only adjusting the bias of the final unit. As the gradient descent algorithm gives movement in the same direction in weight space, but with different step size, it will behave in the same way. There remains, however, the possibility of finding a backpropagation algorithm which will be able to accept almost-zero initial conditions and which will be able to move away from the origin in a direction
a FIGURE 1. Movement from the present position, a, in the weight space to the point d. The point d lies on the tangent plane to the surface f “ (tk) = @k! determined by the training data, and ad is inclined away from the perpendicular to this plane at a small angle @ in the direction away from the origin O.
indicated by the training data. The tangent plane algorithm is a useful starting point for this, as instead of determining a single direction in which to move on being presented with an item of training data, it determines a (n – 1) plane of suitable points to move to, within the weight space R“. The original tangent plane algorithm moves to the foot of the perpendicular from the present position to this plane, on the expectation that the smaller the distance moved the less disturbance there will be to previous learning. The efficiency of the algorithm should not be impaired, however, if a nearby point is chosen, close to the foot of the perpendicular but displaced somewhat in the direction away from the origin. The second tangent plane algorithm, which is described here, uses this idea. As in Lee (1993), a multilayer feedforward network of units (uj) will be assumed, where the connection from Uito Ujis regulated by a weight, Wji.@jand 9j will for some be the input and output of Uj, so 9j =f(#j) function f, and #j = xi WjiOi. All simulation results given here have been calculated using~(x) = tanh (x). The single output unit Uk will be trained to mimic a teaching variable tk. U.will be a constant output unit, with 00 = 1. Let n be the total number of weights in the network, and let g ~ R~ represent the current state of the weights (see Figure 1). For a given set of inputs, we can consider ~k to be a function of the weights, dk: R“ + R. The first tangent plane algorithm adjusts the weights by moving to the foot of the perpendicular from g to the tangent plane to the (n – 1) surface !fk =~-’ (tk), taken at the point where the line from g parallel to the WM axis meets this surface. The algorithm derived here uses essentially the same
63
Enhanced Generalization
procedure but, instead of moving to the foot of the perpendicular to the tangent plane, moves to the foot of a line inclined at an angle/3 to the perpendicular, on the side of the perpendicular away from the origin. Some parts of the following derivation are given in more detail in Lee (1993). Let the current values of the weights be w~i,so, if ~ji is the unit vector in the direction of the Wjiaxis, a= —
x V
‘“
Let 8 =~-1 (t~) –Y-l (@~),the error in the input to the final unit. Then, from eqn (4)
(1)
Wji~ji.
Use the equation
w~o‘f ‘1(~k) – ~ Wkiei
(2)
i#O
Using a = tan ~, we obtain from eqn (5)
to find a value, W[o,for Wkofrom the values, w~i,of the other weights. Then
f-l(t~)= W:oeo+
x
W[i Oi
i#O
so the surface~
(tk) = fb~contains the point
~ = W(o~o + ~
W“iiji.
(3)
i,j# k,O
Let E be the unit normal to this surface at h, so
‘(&%) If Gis the foot of the perpendicular from Q to the tangent plane at b, ~ – Q = ((~ – g) “&)ti.
where
(4)
The vector parallel to the tangent plane and directed away from the origin at Q is
and
~ = @– (Q. @ a. Thus if ~ is the point of intersection with the tangent plane of a line from Q inclined at an angle /3 to the perpendicular and on the opposite side of the perpendic~lar from the origin, ~ – g = (d– c) + (c – Q) (5)
t%#k/@bjis calculated by the backpropagation
rule
Now, from eqns (1) and (3), ~ – Q = (w[o – w~o)jko and if we use 13~for the current values of Oj, so that w~o=~ (13L)– ~i#o w~ioi, then using also eqn (2) we get ~ – Q = (~-’ (t~) ‘~-l(~j))jkO
where {u~: m G kfj} is the set of units to which Uj passes its output. Using the notation
C. W. Lee
64
the amount we must add to each Wjican be rewritten as
where
and
The constant a, giving the tangent of the angle between the movement vector and the tangent plane, requires setting to an appropriate value. Tests showed, however, that the performance of the algorithm is not particularly sensitive to the value of a chosen. The only difficulty associated with a is the possibility that large values may result in too great a movement away from the origin giving average weight sizes large enough to render the output of the units effectively constant, resulting in very slow convergence. This was avoided in the tests of the algorithm by calculating the average, w, of the absolute values of the weight sizes at each presentation of a set of data, and multiplying a by (1 – w). This has the effect of reducing the movement away from the origin as w nears 1, and reversing the movement back towards the origin if w becomes larger than 1. This may give the algorithm an additional advantage over the first tangent plane algorithm and gradient descent backpropagation, as although there is no tendency towards large weights inherent in these methods, there is equally no mechanism for reducing them when they do occur. It maybe that it would be better to drive w towards some value different from 1, particularly in networks where most of the connections are expected to be inactive. This has not been investigated. Most of the computer simulation results given below were generated using the algorithm in this form. However, as a result of the experience gained on these tests, three adjustments were made to the algorithm before it was used for the larger data sets, where computation times can be very long. The most important of the three changes was aimed solely at reducing computational complexity. The inclusion of these adjustments was found to make only a minor difference to the results obtained with the simpler data sets. The first adjustment to the algorithm was the
replacement of the term t ==IQ– (Q. E)fil by IQI.The original form of t is computationally expensive to calculate because it requires a separate loop through the set of weights, and replacing it by IQI does not significantly affect the behaviour of the algorithm. IQI is greater than or equal to Ig – (Q”E)jl 1,with equality holding when Qis perpendicular to fi. Its use will result in a reduction in the size of the movement away from the origin, but this has in any case been scaled by the constant a. The reduction will be greatest when g is roughly perpendicular to the tangent plane, and as it is in these circumstances that “the direction parallel to the tangent plane away from the origin” is most sensitive to small changes of position it seems that the adjustment is unlikely to decrease performance, and may possibly improve it. The final layer of weights grows rapidly from almost-zero initial conditions without any induced movement away from the origin. A second, smaller, saving in computation time can be made by only using the component of g which is perpendicular to the subspace of Elndefined by these weights. If the average of the absolute values of the weight sizes, w, is also only calculated over the remaining weights then its use in scaling a is improved, because otherwise its size in the early stages of training is determined almost exclusively by the final layer weights. The inclusion of a tendency to move away from (or towards) the origin can be a disadvantage in the later stages of training. In cases where the surfaces deterriined by two of the items of training data are nearly parallel the network will zig-zag between these surfaces as it approaches the solution. Changing the direction of movement by an angle of similar size to the angle between the planes may be enough to destroy convergence. This effect can be eliminated by reducing o to zero when the error falls below a given level. This adjustment can also be used to give a saving in computation time. With these modifications, the procedure for adjusting the weights is then as follows: (1) For each unit uj, calculate
—
x) — [
1
if j = k
o
if uj has no influence on uk
provided that~(x) = tanh (x) is used (2) For each weight, calculate ~ji = xjoi. (3) Calculate r = Y?m. F l.m
(4) l,m
and t =
x d
l,m
w~m
65
Enhanced Generalization
with the sums restricted to those weights not in the final layer. (5) Add to each weight wji~ (a) (b)
() 6-7
~~1~1 ~~~“f Wjiis in the final layer 71
*+(’-% final layer,’
Yifwjiisnotinthe ‘
where 6 = tanh-l (tk) – ~~ and ~ =
(1 – W)T if the error is greater than E
{o
otherwise
where w is the sum of the absolute values of the weight sizes, excluding the weights in the final layer; ~ is a constant, typically 0.01; e is a constant. 3. COMPUTER SIMULATION RESULTS Comparative tests were performed on the first and second tangent plane algorithms and the gradient descent backpropagation algorithm. The training sets were chosen in order to determine the degree to which the algorithms can generalize from the given data to deduce an underlying rule. In all cases a set of 10 inputs was used. Each of these inputs was allowed to take one of two values, +0.5, and a full set of 100 possible input patterns was presented to the network at each step. There was a single output unit, which was trained to take one of two possible values, +0.9. On each trial of the algorithms, four of the input patterns were chosen at random as target patterns and grouped as two pairs. One pair was assigned the output 0.9 and the other pair the output –0.9. The remaining input patterns were then attached to these two groups as follows. The “distance” between two patterns was calculated as the number of inputs on which they differed, and each input pattern was grouped with the target pattern to which it was closest. If any pattern was found to be equidistant from two input patterns from different groups, then the target patterns were chosen again. Finally, each input pattern was given the output of the group to which it was attached. The weights were initially set to random values in the range [–2,2] for the gradient descent and first
tangent plane algorithms, and in the range [–0.01, 0.01] for the second tangent plane algorithm. The order in which the set of input patterns was presented was randomized between each presentation. For the second tangent plane algorithm a was set at 0.01. Preliminary tests had shown relatively little change in performance with values of a in the range [0.001, 0.05]. A learning rate of 0.1 and a momentum term with a weighting of 0.8 were used in the gradient descent method. The error term was calculated, as in Lee (1993), as one half of the sum of the squares of the output errors, summed over all input patterns except the target ones. That is, the network was not trained on the four target patterns. This error term was calculated on a separate run through the training data after the weights had been adjusted. Training was stopped when either the error term fell below 10–4 or the number of steps exceeded 50,000, one step being one presentation of the complete training set. Trials where convergence had not taken place before 50,000 steps were excluded from calculations of convergence rates and generalization. Tests were performed under a variety of conditions with regard to the hidden layers of the network. The number of hidden layers used was either two or three, and the number of units in each layer was set at 3, 5, 10 or 15. In each case all units in each layer were connected with all units in adjacent layers. For each test the network was trained 50 times. The average number of steps to convergence was calculated using the geometric mean rather than the arithmetic mean in order to eliminate the problem, noted elsewhere [Burkitt, 1991; Lee, 1993] of occasional large values causing unreliability in the results. The geometric mean was also used to calculate the average final error for the four target patterns. In this case, however, it is possible for one of the values being averaged to be zero, which would automatically give a zero mean. This was avoided by imposing an arbitrary lower limit of 10–7 on the final output error for a target pattern. It follows that the average generalization error in the results given here must also be not less than 10-7, so the results do not distinguish between algorithms or architectures which are consistently generalizing better than this. The results obtained are given in Tables 1,2 and 3. The ability of both the gradient descent and first tangent plane algorithms to generalize deteriorates as
TABLE 1 Final Error for Target Patterna (Average Error x 107)
Three hidden layers
Two hidden layers Width of hidden layers
3
5
10
15
3
5
10
15
Gradient descent First tangent plane Second tangent plane
7 4 7
31 23 13
5,370 5,420 10
31,200 46,500 18
6 5 6
15 10 8
981 4,360 8
3,780 108,000 8
66
C. W. Lee TABLE2 Number of Steps to Convergence
Two hidden layers Width Of hidden layers Gradient descent First tangent plane Second tangent
Three hidden layers
3
5
10
15
3
5
10
15
33,200 672 804
33,900 1,050 473
17,000 706 240
16,600 179 255
12,000 308 215
12,700 354 126
11,800 469 108
15,100 129 116
the size of the network increases. The results for the two algorithms are very similar with two hidden layers. With three hidden layers the performance of the gradient descent algorithm appears to deteriorate less rapidly. This, however, should be treated with caution, as it may be due to the large number of failures to converge found with the gradient descent algorithm, if failure to converge is associated with generalization difficulties. Apart from this, the only difference between the algorithms is the more rapid and reliable convergence of the first tangent plane method, as noted in Lee (1993). The second tangent plane algorithm has equivalent convergence properties to those of the first tangent plane algorithm. However, its ability to generalize is not affected by the size of the network, except for a possible slight deterioration when two hidden layers are used. The second tangent plane algorithm thus appears to be a relatively robust method of training a network, giving good convergence and generalization properties which are largely independent of network size. A further test was performed with data which had been partially corrupted, in order to assess whether the robustness of the second tangent plane algorithm was also apparent with a more realistic training set. The test was carried out using two hidden layers with ten units in each of these layers. The training data was generated in the same way, but at each presentation of an item of data the teaching value, tk,for the output was given a 10/. probability of being randomized in the range (–().9, 0.9) and a 99~0 probability of being randomized in the range (t~ – 0.025, tk+ 0.025). That is, 1% of the training data was incorrect and the remainder was fuzzy. The error was calculated with respect to the original, unrandomized, teaching values. As shown in Lee (1993), it is necessary to include a progressive reduction in step size, or stiffening, into
the tangent plane method in order to make it converge to a compromise solution with fuzzy training data, as otherwise it continues to hop around trying to satisfy each new constraint in turn. This was done here in both of the tangent plane algorithms by scaling down each weight change by a factor Sfl, where n is the number of steps. The parameter s was set at * for the second tangent plane method, but it was found that it had to be increased to * for the first tangent plane method to allow for the slower convergence of that algorithm. The results obtained are given in Table 4. The gradient descent algorithm did not converge on this test. The first tangent plane algorithm gave better generalization than it had done with uncorrupted data, although convergence was much slower. The second tangent plane method also suffered a decline in convergence speed, but continued to give good generalization. Two further tests of the algorithm were performed using real data which has been used in earlier studies by other authors. Assuming that the generalization performance given in these studies is about the optimum that can be achieved, then it is of interest to see if the behaviour exhibited in Table 1 still applies, whereby the second tangent plane algorithm can give generalization performance close to this optimum over a range of network sizes. The first test utilized the Wisconsin breast cancer data obtained from the University of California, Irvine (ftp: /pub/machine-learning-databases on ics.uci.edu). The data has nine integer valued attributes from which one attribute, malignant or benign, must be deduced. The first 200 cases were used for training the network and the next 167 cases were used to test for generalization. Twenty four different sizes of network were used, with the number of hidden layers varying from two to four and the number of units in each layer varying from three to ten. There
TABLE 3 Number of Failures to Converge
Three hidden layers
Two hidden layers Width of hidden layers
3
5
10
15
3
5
10
15
Gradient descent First tangent plane Second tangent
32 4 12
29 0 0
9 0 0
10 0 0
21 7 0
2 0 0
9 0 0
20 0 0
Enhanced Generalization
67 TABLE 4 Generalization from Noiey Date
Gradient descent First tangent plane Second tangent plane
Final error for target patterns (Average x 107)
Number of steps to convergence
Number of failures to converge
— 436 6
— 18,000 2,950
50 4 0
were nine inputs and a single output. In each case 50 trials were made. It was found that continuing learning until the training set had been learned perfectly resulted in a deterioration of generalization performance, as expected, but that stopping training with the error anywhere between 1 and 10 gave good generalization and acceptable performance on the training set. Generalization was found to be independent of the size of the network. With training stopped when the error fell below 5, for example, the average percentage of correct responses on the test set varied from 94.0 to 94.4°/0. These percentages are very similar to those given by Wolberg (1990) and Wolberg and Mangasarian (1990) in their analyses of this data by other methods. The average percentage of correct responses on the training set varied from 97.4 to 97.8Y0. The average number of steps to convergence varied from 1.8 to 3.9. The second test utilized the data used by Sejnowski and Rosenberg (1987) in their study of neural networks trained using backpropagation to learn the pronunciations of English words. The Nettalk data set, obtained from Carnegie Mellon University (ftp: /afs/cs/project/connect/bench on ftp.cs.cmu.edu), consists of 20,000 English words and their associated phonetic transcriptions. Only the phonemes, and not the data on stress and syllable boundaries, were learned. The method used by Sejnowski and Rosenberg (1987) to represent the letters and phonemes was adopted, using a window of seven letters. Because of the amount of computation time required to learn this large data set it was not possible to study the behaviour of the algorithm over a range of network sizes. Only one architecture was used, with two hidden layers of 100 units each. Only one trial was used for each of the two training sets tried, instead of the 50 trials used in each of the previous tests. Although the data being learned is exact, it was found in preliminary tests that performance appeared to be improved if the algorithms were damped slightly by a progressive reduction in step size, as used for corrupted data. A simple one stage reduction by 50°/0 was employed, after the error had fallen below 1000. The value of E, the level at which a is reduced to zero, was also set at 1000. Sejnowski and Rosenberg trained their networks on the 1000 most commonly occurring words and then tested them on the whole set of words. The
original list of 1000 words is not available, but a reconstruction of it is given with the Carnegie Mellon data. The present algorithm was trained using this and also using 1000 words chosen at random. In the first case generalization stopped improving after about 30 presentations of the data and performance on the training set improved little after about 70 steps. Training was stopped after 97 presentations, at which of the training stage the algorithm had learned 99.2°/0 set, and was correct on 76.6°/0 of the whole set. With the training set chosen at random, training was stopped after 76 presentations, giving a performance of 97.3°/0 on the training set and 78.7°/0 on the whole set. Sejnowksi and Rosenberg report similar performances, 98°/0 on the training set and 77°/0 on the whole set for a network with a single hidden layer of 120 units, and 97~0 on the training set and 80Y0on the whole set for a network with two hidden layers of 80 units. A test was also made of the relative computational efficiencies of the three algorithms. VAX FORTRAN was used with a network having two hidden layers and ten units in each later. It was found that relative to the CPU time, t seconds, used by one step of the gradient descent method, one step of the first tangent plane method required 1.22t seconds and one step of the second tangent plane method, without any of the three efficiency adjustments, required 2.30t seconds. 4. CONCLUSIONS The second tangent plane algorithm appears to give good generalization over a range of network sizes, at the same time preserving the efficient convergence of the first tangent plane algorithm. It retains the system of backpropagating errors through the network, and learning without the necessity to batch the data, although it is computationally more complex. Unlike the first tangent plane algorithm it requires the setting of a parameter, a, but it does not appear to be particularly sensitive to the value of a used and there is no indication that a needs to be adjusted to accommodate the training set or architecture chosen. REFERENCES Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? NeuralComputation,1, 151-160.
68 Burkitt, A. N. (1991). Optimisation of the architecture of feedforward neural nets with hidden layers by unit elimination. Complex Systems, 5, 371-380. Cottrell, M., Girard, B., Girard, Y,, & Mangeas, M, (1993). Time series and neural network: a statistical method for weight elimination. In M. Verleysen (Ed.), European Symposium on Artt$cial Neural Networks (pp. 157–164). Brussels: D. facto. Drucker, H., & Le Cun, Y. (1992). Improving generalisation performance using double backpropagation. IEEE Transac~ions on Neural Networks, 3, 991–997. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. Neural Information Processing Systems, 2, 524-532. Holmstrom, L,, & Koistinen, P. (1992). Using additive noise in backpropagation training. IEEE Transactions on Neural Networks, 3, 24–38. Kamimura, R. (1993). Internal representation with minimum entropy in recurrent neural networks: minimizing entropy through inhibitory connections. Network Computation in Neural Systems, 4,423-440. Karmin, E. D. (1990). A simple procedure for pruning backpropagation trained networks. IEEE Transactions on Neural Networks, 1,239-242. Kruschke, J. K. (1989). Distributed bottlenecks for improved generalization in back-propagation networks. International
C. W. Lee Journal of Neural Networks Research and Applications, 1, 187–193, Lee, C. W. (1993). Learning in neural networks by using tangent planes to constraint surfaces. Neural Networks, 6,385-392. Lendaris, G. G. & Harls, I. A. (1990). Improved generalization in ANN’s via use of conceptual graphs: a character recognition task as an example case. Proceedings IJCNN-90 (pp. 551-556). Piscataway, NJ: IEEE. Romaniuk, S. G. (1993) Pruning divide and conquer networks. Network: Computation in Neural Systems, 4, 481–494. Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing (Vol. 1). Cambridge, Ma: MIT Press. Schiffman, W., Joost, M., & Werner, R, (1993). Comparison of optimized backpropagation algorithms, In M, Verleysen (Ed.), European Symposium on Artificial Neural Networks, (PP. 97-104). Brussels: D, facto. Sejnowski, T, J., & Rosenberg, C, R. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1, 145–168. Wang, J., & Malakooti, B. (1993), Characterization of training errors in supervised learning using gradient-based rules. Neural Networks, 6, 1073–1089. Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, USA, 87, 9193–9196.