NEUNET 1238
Neural Networks PERGAMON
Neural Networks 11 (1998) 1059–1072
Contributed article
A modified back-propagation method to avoid false local minima Yutaka Fukuoka a,*, Hideo Matsuki b, Haruyuki Minamitani b, Akimasa Ishida a a
Institute for Medical and Dental Engineering, Tokyo Medical and Dental University, Chiyoda-ku, Tokyo 101-0062, Japan b Faculty of Science and Technology, Keio University, Yokohama, Kanagawa, Japan Received 31 August 1994; revised 5 June 1998; accepted 5 June 1998
Abstract The back-propagation method encounters two problems in practice, i.e., slow learning progress and convergence to a false local minimum. The present study addresses the latter problem and proposes a modified back-propagation method. The basic idea of the method is to keep the sigmoid derivative relatively large while some of the error signals are large. For this purpose, each connecting weight in a network is multiplied by a factor in the range of (0,1], at a constant interval during a learning process. Results of numerical experiments substantiate the validity of the method. q 1998 Elsevier Science Ltd. All rights reserved. Keywords: Back-propagation; False local minima; Premature saturation; Sigmoid derivative; Weight readjusting; Annealing
Nomenclature Dlj E T(t) TR a d i,c e max es eth k t w init wijl ¹ 1l Dwlij¹ 1l xlj ylj a b g d d¯ « « max D« v k y f
1. Introduction degree of saturation for unit j in layer l quadratic error function temperature at iteration/epoch t readjusting interval sigmoid’s slant parameter desired signal for output unit i for cth input pattern maximum error error tolerance (STELA) error tolerance adjusting factor for the learning rate learning iteration/epoch range of the initial weights connecting weight between unit j in layer 1 and unit i in the next lower layer change of wlij¹ 1l total input to unit j in layer l activity level of unit j in layer l momentum factor readjusting factor coefficient used for calculating b partial derivative smoothed partial derivative learning rate maximum value of the learning rate change of « coefficient used for smoothed partial derivative (DBD) increment factor for the learning rate (DBD) threshold for the sum of Dw (STELA) decrement factor for the learning rate
* Corresponding author. Tel.: +81-3-5280-8089; Fax: +81-3-5280-8049
0893-6080/98/$19.00 q 1998 Elsevier Science Ltd. All rights reserved. PII: S0 89 3 -6 0 80 ( 98 ) 00 0 87 - 2
The error back-propagation (BP) method (Rumelhart et al., 1986) has been widely used to train multilayer, feedforward neural networks. This algorithm can evolve a set of weights to produce an arbitrary mapping from input to output by presenting pairs of input vectors and their corresponding output vectors. It is an iterative gradient algorithm designed to minimize a measure of the difference between the actual output vector of the network and the desired output vector. The BP method, however, encounters two difficulties in practice: 1. the convergence tends to be extremely slow; 2. convergence to the global minimum is not guaranteed. Although these two seem to be closely related, as described later, we summarize various improvements to overcome the drawbacks into two aspects. For the former problem, various acceleration techniques have been proposed. 1. Dynamically modifying learning parameters. Many researchers have pointed out that a constant learning rate is not suitable for a complex error surface. Several heuristic techniques have been proposed for learning rate adaptation (Jacobs, 1988; Vogl et al., 1988; Hush & Salas, 1988), while Weir (1991) considered an optimum step length and established a method for selfdetermination of the adaptive learning rate. Most of these techniques can be considered as a kind of line search.
1060
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
2. Adjusting the steepness of the sigmoid function. As Hush et al. (1992) have pointed out, because of the sigmoid’s nonlinearity, error surfaces tend to have many flat areas as well as steep regions. If one such flat area with a high error is encountered, no significant decrease in the error occurs for some period, after which the error decreases again. This phenomenon is called premature saturation. Since considerable time is often needed to traverse such an area, premature saturation retards the learning process. The basic remedy is to adjust the sigmoid’s steepness (Yamada et al., 1989; Rezgui & Tepedelenlioglu, 1990). 3. Improving the error function. Since the sigmoid derivative which appears in the error function of the original BP method has a bell shape, it sometimes causes slow learning progress when output of a unit is near ‘0’ or ‘1’. To remove it from the error signal, van Ooyen & Nienhuis (1992) and Krzyzak et al. (1990) have employed an entropy-like error function. 4. Rescaling of variables. As mentioned above, the error signal involves the sigmoid derivative which is multiplied at each layer. Since it has a value between 0 and 1/4 the elements of the gradient vector differ greatly in magnitude corresponding to the different layers. This is one cause of the ill-conditioned nature of the BP method. Rigler et al. (1991) have proposed rescaling of the elements to overcome this problem. 5. Second-order method. Several researchers have proposed the use of second-order gradient techniques, such as the conjugate gradient and quasi-Newton methods, instead of the simple gradient descent technique (Watrous, 1987; Parker, 1987; Kramer & SangiovanniVincentelli, 1989). Watrous (1987) has demonstrated that the quasi-Newton method achieves rapid convergence near a minimum. These methods, however, require more storage capacity. Although these techniques have been successful in speeding up learning for some problems, there is not enough discussion or experiments about their abilities to avoid local minima. Moreover, they usually introduce additional parameters which are problem-sensitive (Hush & Horne, 1993). The second type of drawback of the BP method is very common among nonlinear optimization techniques. 1. Annealing. One of most famous remedies is simulated annealing (Kirkpatrick et al., 1983) which accepts deterioration in the error function with a nonzero but gradually decreasing probability. This is an analogy with annealing in solids. For the BP method, the deterioration is usually induced by adding Gaussian white noise to weight changes. In this case, the temperature in the Boltzman distribution corresponds to the variance of the noise. It is initialized with a large value and then reduced to zero according to an annealing schedule. Geman & Geman (1984) have proved the existence of
annealing schedules which guarantee convergence to the global minimum. However, all of them require an enormous amount of time, and any schedules designed to reduce the time generally lead to degradation of learning performance. Mean field annealing (MFA) (Peterson & Anderson, 1987; Bilbro et al., 1992), which is a deterministic approximation to simulated annealing, can provide faster progress. In this method, the temperature controls the steepness of the sigmoid function, i.e., starting with a slight slope and then increasing the steepness. Although it can execute at speeds 50 times faster than simulated annealing for some problems (Bilbro et al., 1992), the quality of the solution is highly dependent on the choice of the initial temperature, the annealing schedule, and the final temperature (Lee & Sheu, 1993). 2. On-line weight updating. Two different schemes of updating weights can be found in the literature. One approach is to accumulate partial derivatives of the error function with respect to weights over all the input–output cases before updating weights (batchmode learning). In this mode, a more accurate estimate of the true gradient is achieved. The other is to update weights after every input–output case. This is called online mode learning. By choosing each case randomly, it will produce small fluctuations which sometimes deteriorate the error function. Consequently, it can get out of shallow local minima (Xu et al., 1992). 3. Starting with appropriate weights. It has been shown that the BP method is quite sensitive to initial weights (Kolen & Pollack, 1991). Weights are usually initialized with small random values. However, starting with inappropriate weights is one reason for getting stuck in local minima or slow learning progress. For example, initial weights which are too large easily cause premature saturation (Lee et al., 1993). The learning progress can be accelerated by initializing weights in such a way that all hidden units are scattered uniformly in the input pattern space (Nguyen & Widrow, 1990). Wessels & Barnard (1992) have proposed a similar technique to avoid local minima for neural net classifiers. They refer to local minima which do not satisfy the criterion imposed for stopping a learning process as false local minima (FLM). The present study proposes a modified back-propagation method for general-purpose networks to avoid FLM. The goal is to learn the desired mapping in a reasonable amount of time. Since a three-layered network is capable of forming arbitrarily close approximations to any continuous nonlinear mapping (Irie & Miyake, 1988; Funahashi, 1989; Hornik et al., 1989; Cybenko, 1989), our discussion will be limited to three-layered networks. The paper is structured as follows. Section 2 is devoted to an explanation of the original BP method and its existing improvements. We discuss convergence to FLM in Section 3 and introduce a modified BP method in Section 4. Results of numerical experiments are described in Section 5.
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
Here f9(·) is the sigmoidal derivative which is calculated as:
2. Back-propagation method In this section, we describe the necessary notations and briefly outline the original BP method and its existing improvements. 2.1. Original BP method The total input to unit j in layer l, xlj is determined by X wijl ¹ 1l yil ¹ 1 (1) xlj ¼ i
where wijl ¹ 1l represents the connecting weight between unit j in layer l and unit i in the next lower layer. The activity level of unit j is a nonlinear function of its total input, ylj ¼ f (xlj ) ¼
1 l
1 þ exp ¹ axj
(2)
where a is a slant parameter. It is usually set to 1 and often omitted. Learning is carried out by iteratively updating the connecting weights so as to minimize the quadratic error function E which is defined as 1XX 3 (yi, c ¹ di, c )2 (3) E¼ 2 c i where c is an index over input–output vectors, y3i, c is the actual activity level of output unit i and d i,c is its desired value. We will suppress index c in what follows, except for the specially complicated cases. A weight change Dw is calculated from Dwijl ¹ 1l ¼ ¹ «
]E ]wijl ¹ 1l
(4)
where « is the learning rate. The term on the right-hand side is proportional to the partial derivative of E with respect to that weight. Instead of Eq. (4), Eq. (5) is often used to adapt the step size as a function of the local curvature of the error surface: Dwijl ¹ 1l (t) ¼ ¹ «
]E þ aDwijl ¹ 1l (t ¹ 1) ]wijl ¹ 1l
(5)
where a is the momentum factor and t is the iteration number. The second term makes the current search direction an exponentially weighted average of past directions. The partial derivative ]E/]w, which is denoted as d, is calculated as: dijl ¹ 1l ¼
]E ]E ]xlj ]E ¼ ¼ l yil ¹ 1 l ¹ 1l l l ¹ 1l ]wij ]xj ]wij ]xj
where ]E=]xlj is the error signal, 8 3 3 (for an output unit) > (yj ¹ dj )f 9(xj ) ]E < X ¼ 23 ]E ]xlj > : f 9(x2j ) (otherwise) k wjk ]x3k
1061
(6)
f 9(x) ¼ a(1 ¹ f (x))f (x)
(7)
A weight change in the nth layer counting backwards from the output layer involves (f9(·))n . This causes weight changes corresponding to different layers to differ considerably in magnitude, because 0 # f9(·) # 14 when a ¼ 1. The weights in the network are initialized with small random values at the onset of a learning process. This starts the search at a relatively safe position (Hush et al., 1992). Two different schemes of updating weights can be used: the on-line mode and the batch mode. In the on-line mode, weights are updated after every input–output case. On the other hand, in the batch mode, d is accumulated over all the input–output cases before updating the weights; i.e., a learning step, which is called an epoch, consists of a presentation of the whole set of patterns and a weight modification based on accumulated d. The learning process continues until the sum of squared errors becomes less than a preset value. In what follows, the iteration number t is expressed by the unit of iteration for the on-line mode, but by the unit of epoch for the batch mode. 2.2. Existing improvements to speed up learning and avoid FLM As mentioned in Section 1, there are many improvements to the BP method. In Section 5, some of them are compared with the proposed method. They are the (a) deltabar-delta (DBD) rule, (b) standstill evading learning algorithm (STELA) and (c) MFA. The first two are improvements for acceleration, while the last one is to avoid FLM. DBD is one of the learning rate adaptation techniques. STELA is a method which varies the steepness of the sigmoid function. MFA is a deterministic approximation to simulated annealing. We summarize these methods in what follows. 2.2.1. Delta-bar-delta rule As mentioned above, an appropriate learning rate at the beginning may not be very suitable later on. Jacobs (1988) proposed a learning rate adaptation technique called the delta-bar-delta rule. It has two features: • •
every weight has its own learning rate; every learning rate is varied according to the following equations.
«(t þ 1) ¼ «(t) þ D«(t), 8 ¯ ¹ 1)](t) . 0 k if d(t > > < ¯ ¹ 1)d(t) , 0 D«(t) ¼ ¹ f«(t) if d(t > > : 0 otherwise
(9)
(10)
where k is an increment factor for «, f is a decrement factor
1062
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
for «, and ¯ ¼ (1 ¹ v)d(t) þ vd(t ¯ ¹ 1) d(t)
(11)
¯ is an exponential average of the current and In Eq. (11), d(t) past derivatives, and v is a coefficient between 0 and 1. In the following simulations, Eq. (9) is slightly modified: ( «(t) þ D«(t) if «(t þ 1) , «max (12) «(t þ 1) ¼ «max otherwise i.e., «(t) is limited to a predetermined value « max. 2.2.2. STELA In the paper by Yamada et al. (1989), premature saturation is called a learning standstill state, and a method, STELA, is proposed for neural net classifiers. The following conditions are defined to detect the state: •
some output units have significant error which exceed a preset value e s; the summation of Dw is less than a preset value.
•
The latter condition can be substituted for by XX ly2k f 9(x3j )l , y j
(13)
k
When premature saturation occurs, the slant parameter a is reduced to half and all output levels are calculated again. This procedure is repeated until the above conditions are not satisfied. Then the weights are updated once and a is switched back to its original value. 2.2.3. Mean field annealing In mean field annealing, the steepness is varied instead of adding noise. The slant parameter is set by a ¼ 1=T(t)
(14)
The temperature, T, is initialized at a sufficiently large value, T(0), and gradually reduced.
3. Convergence to false local minimum and premature saturation The term ‘convergence’ can be defined as follows: a state where Dw is small enough such that the sum of squared errors is not appreciably decreased. When a network gets trapped in a false local minimum, where it will stay indefinitely, the state of the network has the following two features: 1. each Dw is quite small; 2. some of the errors between y3j, c and d j,c are greater than the given tolerance, e th. The tolerance is usually set according to the nature of the desired output d j. For example, if the desired output has a continuous value, e th should be set to a value small enough for the network to map the given function accurately. On the
other hand, if the desired output is represented by a binary value, e th need not be set to a very small value because it is satisfactory as long as the difference ly3j ¹ dj l is less than 0.5. As mentioned above, there are many flat areas on the error surface. When one such area with a large error is encountered, i.e., premature saturation occurs, the state of the network is similar to that in an FLM. Since considerable time is often needed to traverse such areas, they are mistakenly believed to be FLM (McInerney et al., 1989). On the contrary, the gradient search fails to decrease the error in the case of real FLM. As a result, once a network is trapped in such a minimum, it cannot get out of the minimum, despite large errors. Although it is sometimes believed that premature saturation is closely related to convergence to FLM, no general conclusion has been drawn about their relationship. However, it is clear that both cause difficulties. Apparently, the following is one reason for FLM and premature saturation. As denoted in Eqs. (4)–(7), a weight change Dw is determined by the error signal and the sigmoidal derivative f9(·). When the activity level of a unit approaches ‘0’ or ‘1’ (the saturated portion of the sigmoidal curve), f9(·), and consequently Dw, are very small. This implies that the activity level of the output unit can be maximally wrong without producing a large error signal. In this case, the squared error is not appreciably reduced.
4. Modified BP method To overcome the above problem, we propose a modified BP method whose basic idea lies in keeping f9(·) relatively large until each error signal becomes small. In other words, activation levels are kept in the middle range of the sigmoidal curve. To discuss this more quantitatively, we introduce the following parameter, the degree of saturation Dlj , which is increased as the sigmoidal derivative is decreased: Dli ; lyli ¹ 0:5l
(15)
To avoid FLM and premature saturation, the degree of saturation must be kept small while e max is large: emax ¼ max ly3j, c ¹ dj, c l j, c
(16)
As denoted in Eq. (1), the activity level of a unit is determined by its total input, owy. If each weight is multiplied by a factor b where b [ (0,1], the absolute value of the total input is reduced. Thus the activity level approaches the center of the sigmoidal curve. This operation, which is called a readjusting operation, makes the degree of saturation small and prevents f9(·) from approaching zero. This relationship is summarized in the
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
following expressions: wˆ ijl ¹ 1l
¼ blj wijl ¹ 1l
(17)
X X l wˆ ijl ¹ 1l yli l # l wijl ¹ 1l ylj l
(18)
ˆ lj # Dlj D
(19)
The modified method employs the readjusting operation which is carried out at every T R iteration/epoch during a learning process. The operation decreases the absolute values of the weights. For example, if b . 0 is used, the absolute value of w becomes nearly equal to ‘0’. However, ˆ is equal to w and the operation has no effect when b is set w to ‘1’. Absolute values of the weights, in general, increase with the progress of a learning process, and « determines the step size. To obtain good results, a balance is needed between the parameters b and «. On determining the multiplier blj , we therefore take account of the parameters «, Dlj , 23 and e max. Since the magnitudes of Dw12 ij and Dwij are conl siderably different, bj is calculated in slightly different ways for these layers: 8 > (for w23 < 1 ¹ «emax g ij ) l (20) bj ¼ 1 > : 1 ¹ «emax g (for w12 ij ) 2 where g is set according to Table 1. For example, if e max # 2e th (error level I), g is set to ‘0’ (i.e., b ¼ 1) without taking account of Dlj , because the learning process is expected to be completed successfully. If 2e th , e max # 3.4e th (error level II), g is set according to the value of Dlj . In the case of large e max (e max . 3.4e th, error level III) with large D2j , all weights which connect to unit j are reset to small random values and the corresponding Dw values are set to ‘0’ because the learning process is not expected to be completed successfully in this situation. For the following reason, weight resetting is carried out only with large D2j . The output units easily cause incorrect saturation when the desired values are 0/1. This means that if weights are reset with large D3j , frequent resetting retards the learning process. In addition, there are enough chances for the output units to be improved while the degree of saturation for some hidden units is small. At this moment, unfortunately, it is difficult to determine the values of g theoretically. They are determined through some degree of trial and error. Alternative values provided similar results in preliminary experiments. Table 1 Values of y
a
This method is similar to MFA and STELA, because the readjusting operation effectively corresponds to regulating the steepness of the sigmoid function. They control the slant parameter in different ways. In MFA, the parameter a is monotonically and gradually increased, but in the proposed method, a is decreased with the readjusting operation and is not explicitly increased. In STELA, which is one of the step size adaptation techniques, the parameter is decreased when weight changes are small despite large errors, and after a weight modification, it is increased to its original value. The most important point is that only the proposed method can explicitly decrease the absolute values of the weights. This implies that the ratio of Dw/w is kept relatively large and is effective for correcting the wrong sign of weights. However, even if Dw is large, it requires considerable time to change a wrong sign in the case of weights with fairly large magnitude. In this case, restarting from another set of small weights sometimes provides faster convergence than continuing the learning process. From this point of view, weights are reset when incorrect saturation in which both e max and the degree of saturation are large occurs. Rescaling weights with a small b, of course, can ease the above problem. Since the magnitude of the weights increases with the learning progress, restarting from the rescaled weights sometimes results in the incorrect saturation again and thus fails to change the wrong sign. In this case, weight resetting is more effective than weight rescaling.
5. Simulation results and discussion Two problems (symmetry problem and random mapping problem) were examined to compare the modified and various existing learning methods as to the rate of successful convergence within an iteration limit. Since the back-propagation procedure was sensitive to different learning rates and initial weights, we experimented with learning rates varying in the order of magnitude, and trials starting from different initial weights which were drawn at random from a uniform distribution between ¹ w init and w init. The same initial weights were used for all methods. A learning process was continued until the convergence criterion (e max # e th) was achieved or the total number of iterations/epochs exceeded the iteration limit. The data in the following sections only include those from simulations that achieve the convergence criterion. 5.1. Symmetry problem
e max
Dlj
1063
0.5–0.43 0.43–0.1 0.1–0
0–2e t h
2e t h –3.4e t h
3.4e t h –1
0.0 0.0 0.0
0.5 0.1 0.02
1.0, reset a 0.5 0.1
Weights are reset when D2j . 0:43.
The symmetry problem, which was investigated previously by Rumelhart et al. (1986) was posed for a network consisting of a layer of six input units, a layer of two hidden units, and one output unit (Fig. 1). The numbers on the links and the labels on the units will be used for identification purposes. This problem was solved using the original BP (the batch and the on-line modes), DBD, STELA, MFA and
1064
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
readjusting methods. According to each method, the connecting weights were updated to detect whether or not the binary output levels of a one-dimensional array of input units were symmetrical about the center point. Except for the on-line mode, the whole set of 2 6 input vectors was presented at each epoch. When the activity levels were symmetrical, the desired output was set to ‘1’, otherwise ‘0’. Learning rates of 0.01, 0.02, 0.05, 0.07, 0.1, and 0.2, and w init of 0.01, 0.05, 0.1 and 0.3 were used for this problem. For each condition, a rate of successful convergence within the iteration limit (rate of success) was estimated over 100 trials in which a and e th were fixed (a ¼ 0.9 and e th ¼ 0.25). The iteration limit was 102 400 iterations for the on-line mode, 1600 epochs for the others. The parameters were as follows: DBD: k ¼ 0.0025, f ¼ 0.2, « max ¼ 0.25 STELA: e s ¼ 0.5, y ¼ 0.05 MFA: T(t) ¼ T(0)=(t þ 1), T(0) ¼ 500, a ¼ 1 if T(t) , 1 the readjusting method: T R ¼ 25 epochs Table 2 summarizes the experimental results. In general, small w init causes slow learning progress. For this problem, the batch mode provides better results than those the on-line mode. Table 2(a) indicates that the rate of success obtained with the batch mode depends strongly upon the learning rate (« ¼ 0.05 is optimal and 0.07 is near-optimal for this problem). There are two reasons why a high rate of success is not achieved using the other values: one is a small learning rate (« , 0.05) that causes failure to converge within the iteration limit and the other is FLM or premature saturation which frequently occurs with a relatively large learning rate (« . 0.07). Actual output values of the network for some of the input vectors are often pushed towards the wrong extreme value by competition among the vectors. We say then that the output unit for such a vector is incorrectly saturated. In this case, since the activity level of the output unit is in the saturated portion of the sigmoidal curve, the network hardly improves its performance. A large learning rate is apt to cause this situation. The DBD rule overcomes the problem of a small learning rate. As shown in Table 2(c),
Fig. 1. A schematic representation of the network for the symmetry problem.
with a small initial «(«(0) # 0.07), the method achieves high rates of success and rapid convergence, but a large initial « causes failure of convergence, i.e., this method cannot ease the problem of FLM and premature saturation. STELA improves the results with « ¼ 0.1 (see Table 2(d)), but it does not overcome both of the problems. MFA and the readjusting method offer similar results (of both rate of success and learning speed) as summarized in Table 2. In contrast to DBD, MFA and the readjusting method achieve high rates with a large learning rate (« . 0.05), but cannot ease the problem of a small learning rate. This is because these methods have no ability to accelerate a slow process caused by such a small «. On the other hand, all trials with large « met with success because the features mentioned in Section 3 were avoided by readjusting the weights. As is evident from Table 2, the convergence using the readjusting method is faster than that when using the batch mode or is at least approximately the same. With MFA, during the initial stage, a small value of the slant parameter prevents the output unit from being incorrectly saturated. Fig. 2 depicts the mean square error vs learning iterations for (a) the batch mode, (b) the on-line mode, (c) DBD, (d) STELA, (e) MFA and (f) the readjusting method. All runs were performed under the condition of «/«(0) ¼ 0.1 and started from the same weights (w init ¼ 0.3). Except for (b), all trials converged successfully. The trial with the on-line mode converged to a false local minimum (cf. Table 3). Roughly speaking, the successful convergence behaviors can be grouped into two types: the first one contains (a), (c) and (f); the second (d) and (e). Each in the first group has two plateaux which suggest that two flat areas of the error surface were encountered. The batch mode required the longest time to converge. In particular, it spent most of its time in traversing the first flat region. DBD and the readjusting method can significantly reduce the time, and accordingly, achieve rapid convergence. The members of the second group behave in different ways, because they explicitly vary the steepness. In Fig. 3, degrees of saturation for an input vector which gives e max are plotted as a function of learning iterations for (a) the batch mode, and (b) the readjusting method (learning process shown in Fig. 2(a) and 2(f), respectively). Fig. 3(c) shows the saturation degree for the output unit in (a) and (b) at an increased scale. While the output unit is incorrectly saturated (the degree of saturation is approximately 0.5), the mean square error remains on the first plateau. Some ripples coincident with the readjusting operation are observed in Fig. 3(b). Fig. 3(c) clearly shows that the degree of saturation is reduced with the operation, while no fluctuation is observed in the case of the batch mode. Although the ripples are small, the operation has a significant effect on the decrease of the degree of saturation. In Fig. 4, sums of the weight changes (olDwl) are plotted as a function of learning iterations for (a) the batch mode, and (b) the readjusting method. As shown in the figure, the readjusting method can keep Dw large while the error is
1065
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
large. The batch mode with « ¼ 0.2 failed to keep Dw large because the output unit was incorrectly saturated. Weights obtained by the readjusting method are shown in Table 3 (Type I). The basic structure of the network is
identical to that which Rumelhart et al. (1986) obtained. That is, the network is arranged so that both hidden units receive a net input of zero for only the symmetric input vectors. Almost all successful trials converged to weights
Table 2 Obtained results «/«(0)
w init ¼ 0.01 RS
(a) Batch mode 0.01 0 0.02 0 0.05 100 0.07 100 0.1 0 0.2 0
(b) On-line mode ( 0.01 0.02 0.05 0.07 0.1 0.2
w init ¼ 0.05
w init ¼ 0.1
w init ¼ 0.3
LI
RS
LI
RS
LI
– – 1060 6 48 1248 6 45 – –
0 0 100 100 0 0
– – 938 6 62 1157 6 131 – –
0 0 100 92 9 0
– – 893 6 91 1095 6 211 1484 6 91 –
0 33 93 68 38 0
– 1398 6 81 872 6 221 845 6 284 899 6 309 –
0 0 3 8 4 0
– – 1556 6 31 1237 6 165 783 6 43 –
0 0 24 32 31 19
– – 1169 6 178 855 6 138 607 6 102 303 6 38
3 64 iterations) 0 – 0 – 0 – 0 – 0 – 1 555
0 0 0 1 0 0
– – – 1333 – –
RS
LI
(c) DBD 0.01 0.02 0.05 0.07 0.1 0.2
100 100 82 100 0 0
420 6 40 446 6 44 745 6 190 875 6 130 – –
100 100 97 100 16 0
344 6 26 371 6 47 565 6 111 718 6 85 1535 6 44 –
100 99 99 100 34 0
326 6 28 334 6 36 481 6 77 663 6 130 1298 6 152 –
99 96 89 87 47 0
299 6 34 303 6 41 445 6 138 663 6 304 828 6 327 –
(d) STELA 0.01 0.02 0.05 0.07 0.1 0.2
0 0 100 100 100 0
– – 1020 6 65 794 6 46 809 6 59 –
0 0 100 100 100 0
– – 841 6 68 671 6 48 661 6 55 –
0 1 100 100 100 1
– 1600 776 6 72 622 6 54 588 6 65 1266
0 42 99 97 93 32
– 1468 6 107 693 6 102 556 6 86 502 6 110 907 6 358
(e) MFA (T(0) ¼ 500) 0.01 0 0.02 0 0.05 100 0.07 100 0.1 100 0.2 100
– – 1189 6 53 941 6 37 777 6 28 609 6 22
0 0 100 100 100 100
– – 1048 6 52 846 6 37 709 6 28 559 6 21
0 0 100 100 100 100
– – 995 6 53 812 6 38 674 6 28 532 6 22
0 15 100 100 100 100
– 1581 6 29 907 6 55 750 6 39 629 6 30 499 6 23
( f) Readjusting (T R ¼ 25 epochs) 0.01 0 0.02 0 0.05 100 0.07 100 0.1 100 0.2 100
– – 1283 6 106 1031 6 83 871 6 70 762 6 63
0 0 100 100 100 100
– – 997 6 93 813 6 71 688 6 62 567 6 104
0 0 100 100 100 100
– – 896 6 90 735 6 69 621 6 65 530 6 166
0 27 100 100 100 100
– 1469 6 77 787 6 109 644 6 97 546 6 100 461 6 186
RS, rate of success; LI, learning epochs/iterations required for convergence. The data include only successful trials.
1066
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
Fig. 2. Mean square error vs learning iterations: (a) the batch-mode; (b) the on-line mode; (c) DBD; (d) STELA; (e) MFA; (f) the readjusting method.
of this type, but few to Type II. The convergence criterion can be satisfied only with these types which are the global minima for the symmetry problem. The learning process shown in Fig. 2(b) converged to Type III in Table 3. This set of weights is a false minimum. The structure of the network is different from the above networks. The weights connected to H 1 have the same sign except for bias. That
is, this hidden unit is on only when the input vector {0,0,0,0,0,0} is presented. Similarly, only {0,1,0,0,1,0} can send a positive value to H 2. The output O 1 is 1 when one of the hidden units is on. The network therefore misclassifies the input vectors shown in Table 4. Type IV is another FLM. This network cannot respond correctly for the input vectors shown in Table 4, because pairs of weights 1
Table 3 Obtained weights Weight no.
Type I
Type II
Type III
Type IV
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
¹ 11.91 ¹ 5.98 2.99 ¹ 2.99 5.97 11.91 ¹ 2.11 11.97 6.01 ¹ 3.01 3.01 ¹ 6.01 ¹ 11.97 ¹ 2.11 ¹ 15.41 ¹ 15.41 6.97 Global
¹ 11.43 8.59 5.64 ¹ 5.64 ¹ 8.59 11.43 1.79 ¹ 11.52 8.65 5.69 ¹ 5.69 ¹ 8.65 11.52 ¹ 1.99 16.74 ¹ 17.70 8.19 Global
4.93 5.58 4.85 4.87 5.51 4.91 0.43 ¹ 4.57 3.88 ¹ 4.56 ¹ 4.61 3.99 ¹ 4.55 ¹ 7.42 8.56 8.11 ¹ 2.02 Local
¹ 4.93 10.08 ¹ 5.10 5.10 ¹ 10.07 ¹ 0.24 ¹ 1.70 ¹ 9.54 11.19 ¹ 5.70 5.71 ¹ 11.18 ¹ 6.92 0.90 ¹ 14.11 11.05 ¹ 1.97 Local
¹ ¹ ¹ ¹ ¹ ¹
1067
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
Fig. 3. The degree of saturation vs learning iterations: (a) the batch-mode; (b) the readjusting method; (c) the degree of saturation for the output unit in (a) and (b) at an increased scale.
and 6, and 8 and 13 have the same sign. Trials starting from these types of weights were carried out for all methods in this section, to assess the ability to get out the false minimum. Only the readjusting method can reach the global minimum, because the weights were reset at the first readjusting operation. The other methods cannot improve the performance. Weight rescaling with a small b can be used instead of weight resetting. Although it can escape the false minimum, it requires more time to get out the minimum. In this way, weight resetting is effective for saving time taken to escape from FLM. As mentioned in Section 4, the proposed method is the combination of weight readjusting and resetting. The above experiment starting from FLM reveals that the latter is effective to escape from a false minimum. Next we discuss which is more effective in a real learning process. Table 5 summarizes the average number of resetting occurred in a trial. Each figure was calculated over the 100 trials whose RS and LI are given in Table 2(f). Almost no resetting occurred except for the trials with « ¼ 0.2 and w init ¼ 0.1/ 0.3. This is because premature saturation easily occurs under these conditions. However, even with large « and w init, the frequency of resetting is not so high: once per few trials. To ascertain the effectiveness of readjusting, additional simulations were carried out with the proposed method without resetting (i.e., with only readjusting). The results given in Table 6(a) show that this method provides similar results to those obtained by the proposed method (Table 2(f)). On the other hand, there is no significant difference between the results with the batch mode (Table 2(a)) and the proposed method without readjusting, i.e., with only resetting, (Table 6(b)). These results demonstrate that weight readjusting is more effective. Resetting is useful if « and/or w init are large. If the rate of success depends strongly upon the readjusting interval T R, the modified method appears impractical. To examine the dependence upon T R, rate of success was estimated with various T R(5 # T R # 200 epochs). In addition, for MFA, the dependence upon T(0) was also estimated with 100 # T(0) # 3000. These simulations were carried out
Table 4 Misclassified vectors I1
I2
I3
I4
I5
I6
(a) For Type III
0 0 1 1 1 1
0 1 0 0 1 1
1 1 0 1 0 1
1 1 0 1 0 1
0 1 0 0 1 1
0 0 1 1 1 1
(b) For Type IV
1 1 1 1
0 0 1 1
0 1 0 1
0 1 0 1
0 0 1 1
1 1 1 1
1068
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
Fig. 4. The sum of lDwl vs learning iterations: (a) the batch-mode; (b) the readjusting method.
using « ¼ 0.2 and w init ¼ 0.3 with which all trials using the batch mode ended in failure to converge. Table 7, Table 8 summarize the results for the proposed method and MFA, respectively. The former indicates that the modified method is not sensitive to T R and a wide range of T R is acceptable, whereas the latter shows that small T(0) degrades learning performance and too large a T(0) retards the learning process. Next, we discuss whether or not the proposed method is effective for noise-contaminated data. To check this, noise generated from random numbers in the range of [ ¹ 0.03, 0.03] was added to each input vector. This means that an input vector always changed slightly and no identical vectors were presented to the network. The results in Table 9 indicate that the readjusting method works well Table 5 Number of resetting. Each figure denotes the average number of resetting which occurred in a trial «
w init ¼ 0.01
w init ¼ 0.05
w init ¼ 0.1
w init ¼ 0.3
0.01 0.02 0.05 0.07 0.1 0.2
0 0 0 0 0 0
0 0 0 0 0 0.05
0 0 0 0 0 0.31
0.07 0.01 0.01 0 0.03 0.50
for noise-contaminated data, but that a large amount of noise decreases the learning speed. 5.2. Random mapping problem The purpose of treating this problem was to establish mapping between five-dimensional input and output vectors (Kung & Hwang, 1988) each component of which was a random value in the range of [0.05,0.95]. Six pairs (Table 10) were used as learning patterns for a network consisting of five input and output units and a layer of seven hidden units. This problem was solved using the original BP, MFA and readjusting methods. Each method was performed according to the on-line and batch type learning rules, i.e., on-line BP, batch BP, on-line MFA, batch MFA, on-line readjusting and batch readjusting. The parameters a and e th were fixed at constant values of 0.9 and 0.05, respectively. Learning rates of 0.2, 0.5, 0.7, 1.0, and 2.0, and w init of 0.3 were used for this problem. The other parameters were as follows: on-line MFA: T(t) ¼ 100/(100T(0) ¹ t), T(0) ¼ 50, a ¼ 1 if T(t) , 1 batch MFA: T(t) ¼ 1000/(1000T(0) ¹ t), T(0) ¼ 1.5, a ¼ 1 if T(t) , 1 on-line readjusting: T R ¼ 2500 iterations batch readjusting: T R ¼ 400 epochs
1069
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072 Table 6 Performance of weight readjusting and resetting «
w init ¼ 0.01 RS
w init ¼ 0.05
w init ¼ 0.1
w init ¼ 0.3
LI
RS
LI
RS
LI
RS
LI
(a) Only readjusting 0.01 0 0.02 0 0.05 100 0.07 100 0.1 100 0.2 100
– 1283 6 106 1031 6 3 871 6 70 762 6 63
0 0 100 100 100 95
– – 997 6 93 813 6 71 688 6 62 584 6 93
0 0 100 100 100 81
– – 896 6 90 735 6 69 621 6 65 496 6 69
0 21 99 100 99 73
– 1485 6 68 807 6 94 676 6 93 575 6 83 420 6 71
(b) Only resetting 0.01 0 0.02 0 0.05 100 0.07 100 0.1 0 0.2 0
– – 1088 6 46 1275 6 40 – –
0 0 100 100 0 0
– – 964 6 55 1181 6 104 – –
0 0 100 93 0 0
– – 917 6 76 1125 6 172 – –
0 41 96 69 23 0
– 1468 6 96 895 6 210 925 6 288 1077 6 287 –
RS, rate of success; LI, learning epochs required for convergence. The data include only successful trials.
We examined combinations of T(0) ranging from 1.2 to 10 000 and various annealing schedules. The above combinations provided the best results. For each «, the rate of success within the iteration limit was estimated over 100 trials. The iteration limit was one million iterations for the on-line type learning, 170 000 epochs for the batch type learning. The results are shown in Table 11. In general, the methods based on the batch type learning rule offer higher rates of success and faster convergence than do the methods based on the on-line type rule. For this problem, the readjusting method (both on-line and batch) achieved the highest rates of success among all examined methods. With the on-line readjusting method, however, an excessive learning rate caused frequent weight resetting which retarded the learning process. This inefficiency of the method was conspicuous when « ¼ 2.0 was used. Using the batch type rule can ease the problem of frequent resetting. During the initial stage of MFA which sometimes failed to converge, the absolute values of the weights tended to become large to compensate for small a(t). As an illustration, let us consider a unit which has single input of 1 and suppose that its desired output is 0.7. To generate an output of 0.7, the weight must be 0.847 when a ¼ 1, whereas it must be 1.695 when a ¼ 0.5. If the desired output is represented by a binary value, the absolute values can be large. This is because a total input of a unit must be fairly large to generate an output of 1. MFA, therefore, can work well for the symmetry problem. On the other hand, if the desired output has a continuous value, the weights must have adequate value. This is because improperly large weights might cause incorrect saturation when a(t) increases. Once incorrect saturation occurs, MFA cannot ease this difficult situation and consequently fails to converge.
Table 7 Parameter dependence of the readjusting method TR
RS
LI
5 10 20 50 100 200
92 100 99 99 96 84
719 6 369 405 6 135 426 6 133 548 6 208 681 6 195 948 6 169
RS, rate of success; LI, learning epochs required for convergence. The data include only successful trials. « ¼ 0.2, w init ¼ 0.3. Table 8 Parameter dependence of MFA T(0)
RS
LI
100 150 200 400 1000 3000
0 0 98 100 100 67
– – 992 6 213 474 6 38 767 6 24 1571 6 24
RS, rate of success; LI, learning epochs required for convergence. The data include only successful trials. « ¼ 0.2, w init ¼ 0.3. Table 9 Results for noise-contaminated data
« ¼ 0.2, w init ¼ 0.3 « ¼ 0.07, w init ¼ 0.1
Readjusting
Batch mode
RS
LI
RS
LI
98 99
715 6 291 1132 6 233
0 79
– 1376 6 186
RS, rate of success; LI, learning epochs required for convergence. The data include only successful trials.
1070
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
To examine the effect of the techniques, simulations were carried out under the same conditions as in Section 5.1, Section 5.2. The values of « in the above simulations were used as initial values of learning rate «(0). Values of the parameter « max were 0.25 and 2.0 for the symmetry problem and the random mapping problem, respectively. For both problems, w init of 0.3 was used. The results are summarized in Table 12 (symmetry) and Table 13 (random map). For the latter, only the error-level-based rule was examined. The tables indicate that the readjusting method with learning rate adaptation overcomes the above-mentioned shortcoming. DBD provides faster convergence for small «(0), while the error-level-based rule achieves a higher rate of success for large «(0). The combination of MFA and DBD was also examined. However, not even DBD can accelerate the slow learning process caused by large T(0). The learning speed of MFA is highly dependent upon the choice of the initial temperature and the annealing schedule. From this point of view, the readjusting method is practical.
Table 10 Input and output vectors for the random mapping problem Input 0.05 0.7 0.6 0.5 0.9 0.2
Output 0.95 0.6 0.3 0.4 0.6 0.5
0.05 0.3 0.1 0.3 0.6 0.8
0.7 0.2 0.7 0.05 0.5 0.95
0.7 0.8 0.1 0.7 0.9 0.05
0.4 0.05 0.95 0.95 0.5 0.3
0.6 0.2 0.2 0.9 0.3 0.5
0.8 0.6 0.1 0.95 0.3 0.6
0.95 0.9 0.3 0.8 0.1 0.8
0.6 0.05 0.8 0.8 0.1 0.7
5.3. Further modification The above simulations reveal that the readjusting method fails to reach convergence in the case of using a learning rate which is too small or excessively large. This suggests a learning rate adaptation. Although various techniques of adaptation can be used with the readjusting method, the present study examines DBD and the following simple rule (error-level-based rule). At every readjusting operation, « is multiplied by a factor k which is determined according to the error level: «(t) ¼ k«(t ¹ 1) (21)
5.4. Guidelines to determine values of parameters Some additional parameters are introduced in the readjusting method. Here, guidelines to determine their values are given. As shown in Section 5.1, a wide range of T R is acceptable. Generally speaking, a value of several epochs or greater is suitable for T R, but, as expected, a very large value will end in a result similar to that of the batch mode. The same values of g were used for both the symmetry and the random map problems, although there may be more suitable values for each problem. At this moment, it is difficult to set the values theoretically. However, they are well suited for the different types of problems. One of the important points is to choose the error tolerance, e th, properly, because the
where «(t) is limited to a predetermined value « max, i.e., «(t) ¼ «max if «(t) . « max. The learning process is accelerated with k . 1, while it is decelerated with k , 1. The factors are 1.1 and 1.05 for error levels II and III, respectively. For error level I, no operation is needed and k is set to ‘1’ because the process is expected to be completed successfully. For both adaptation techniques, the factor of 0.9 is multiplied to decelerate the learning process when an excessive learning rate causes weight resetting. This prevents frequent resets because the learning rate is decreased at every reset. Table 11 Results for the random map problem «
Original BP RS
MFA
Readjusting
LI
RS
LI
On-line based methods 0.2 66 0.5 57 0.7 52 1.0 43 2.0 38
47.3 6 34.7 43.3 6 45.3 43.8 6 38.5 42.7 6 38.2 42.5 6 42.3
70 63 69 72 55
49.0 6 32.5 36.8 6 37.2 43.0 6 49.0 36.2 6 34.0 42.7 6 43.0
Batch based methods 0.2 65 0.5 80 0.7 76 1.0 85 2.0 70
52.0 6 42.0 33.3 6 34.8 23.0 6 24.8 23.5 6 30.5 34.7 6 45.5
69 79 83 86 84
53.8 6 41.5 33.2 6 34.2 28.7 6 31.3 22.0 6 25.5 17.7 6 24.7
RS
LI
NR
77 94 99 94 73
67.2 6 41.2 51.8 6 38.3 46.5 6 33.5 47.0 6 36.5 66.3 6 43.0
10.7 6 9.9 20.6 6 19.3 26.0 6 24.0 35.4 6 30.6 102.5 6 76.6
78 93 100 94 96
70.3 6 42.7 38.8 6 35.7 28.0 6 30.2 21.8 6 25.5 9.0 6 10.5
4.5 6 4.6 4.6 6 5.4 4.6 6 4.8 5.5 6 6.7 6.1 6 5.6
RS, rate of success; LI, learning iterations/epochs required for convergence ( 3 6000 iterations for the on-line based methods, 3 1000 epochs for the batch based methods); NR, number of weight resetting. The data include only successful trials. w init ¼ 0.3.
1071
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072 Table 12 Results with the combinations of readjusting and learning rate adaptation techniques (the symmetry problem) «(0)
DBD
0.01 0.02 0.05 0.07 1.0 2.0
Error level based
RS
LI
RS
LI
99 100 92 84 83 88
396 6 146 358 6 80 500 6 228 593 6 275 608 6 246 646 6 282
93 100 100 100 99 100
1232 6 163 826 6 130 498 þ 48 448 6 78 409 6 85 457 6 209
RS, rate of success; LI, learning epochs required for convergence. The data include only successful trials. T R ¼ 25 epochs, w init ¼ 0.3. Table 13 Results with the combination of readjusting and learning rate adaptation technique (the random map problem) «(0)
On-line (TR ¼ 2500 iterations) RS LI
0.2 0.5 0.7 1.0 2.0
98 99 99 94 98
54.2 6 41.2 45.7 6 34.0 44.8 6 35.7 41.2 6 32.3 42.8 6 35.0
NR
Batch(T R ¼ 400 epochs) RS LI
31.3 6 31.4 28.4 6 26.2 30.3 6 27.9 29.3 6 24.2 33.9 6 26.7
100 98 100 100 99
26.5 6 20.0 19.2 6 18.0 17.5 6 20.3 14.2 6 13.8 14.3 6 18.7
NR 5.3 5.0 3.9 5.4 5.3
6 7.6 6 6.2 6 4.9 6 6.1 6 4.5
RS, rate of success; LI, learning iterations required for convergence ( 3 6000 iterations for the on-line based method, 3 1000 epochs for the batch based method); NR, number of weight resetting. The data include only successful trials. w init ¼ 0.3.
boundaries of error levels are determined by e th. An inappropriate tolerance sometimes causes inferior performance of a network, or frequent resetting and therefore slow learning progress. As mentioned before, e th should be set according to the nature of the problem and required accuracy. If weights are reset often, the learning rate must be decreased (or e th must be increased). On the other hand, if a learning process does not converge within an appropriate iteration limit and weights are not reset, the learning rate must be increased.
6. Concluding remarks The present study has focused on one of the important problems in back-propagation learning. Some features related to FLM have been discussed, and a modified back-propagation method has been proposed to avoid them. Its basic idea lies in readjustment of weights while the maximum error is large. Numerical experiments on two different types of test problems substantiate the validity of the method. In a reasonable amount of time, almost all trials converged to a global minimum or at least near-optimal solutions. It is also shown that the method can detect the states related to FLM and get out FLM. This method can be used with other techniques, such as the delta-bar-delta rule, and combinations are usually more powerful.
References Bilbro G. L., Snyder W. E., Gamier S. J., & Gault J. W. (1992). Mean field annealing: A formalism for constructing GNC-like algorithms. IEEE Transactions on Neural Networks, 3, 131–138. Cybenko G. (1989). Approximation by superpositions of a sigmoid function. Mathematics of Control, Signals, and Systems, 2, 303–314. Funahashi K. (1989). On the approximate realization of continuous mapping by neural networks. Neural Networks, 2, 183–192. Geman S., & Geman D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Hornik K., Stinchcombe M., & White H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359– 366. Hush D. R., & Horne B. G. (1993). Progress in supervised neural networks; What’s new since Lippmann?. IEEE Signal Processing Magazine, 10, 8–39. Hush D. R., Horne B., & Salas J. M. (1992). Error surfaces for multilayer. IEEE Transactions on Systems, Man, and Cybernetics, 22, 1152–1161. Hush, D. R., & Salas, J. M. (1988). Improving the learning rate of backpropagation with the gradient reuse algorithm. In Proceedings of the IEEE Conference on Neural Networks (Vol. I, pp. 441–447). Irie, B., & Miyake, S. (1988). Capabilities of three-layered perceptions. In Proceedings of the IEEE Conference on Neural Networks (Vol. I, pp. 641–648). Piscataway, NJ: IEEE Press. Jacobs R. A. (1988). Increased rate of convergence through learning rate adaptation. Neural Networks, 1, 295–307. Kirkpatrick S., Gelatt C. D. Jr., & Vecchi M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680. Kolen, J. F., & Pollack, J. B. (1991). Back propagation is sensitive to initial conditions. In R. P. Lippmann, J. E. Moody, & D. S. Tpuretzky (Eds.), Advances in Neural Information Processing Systems 3 (pp. 860–867). San Mateo, CA: Morgan Kaufmann.
1072
Y. Fukuoka et al. / Neural Networks 11 (1998) 1059–1072
Kramer, A. H., & Sangiovanni-Vincentelli, A. (1989). Efficient parallel learning algorithms for neural networks. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems (pp. 40–48). San Mateo, CA: Morgan Kaufmann. Krzyzak, A., Dai, W., & Suen, C. Y. (1990). Classification of large set of handwritten characters using modified back propagation model. Proceedings of the International Joint Conference on Neural Networks (Vol. III, pp. 225–232). Piscataway, NJ: IEEE Press. Kung, S. Y., & Hwang, J. N. (1988). An algebraic projection analysis for optimal hidden units size and learning rates in back-propagation learning. In Proceedings of the IEEE Conference on Neural Networks (Vol. I, pp. 363–370). Piscataway, NJ: IEEE Press. Lee B. W., & Sheu B. J. (1993). Paralleled hardware annealing for optimal solutions on electronic neural networks. IEEE Transactions on Neural Networks, 4, 588–599. Lee Y., Oh S. H., & Kim M. W. (1993). An analysis of premature saturation in back propagation learning. Neural Networks, 6, 719–728. McInerney, J. M., Haines, K. G., Biafore, S., & Hecht-Nielsen, R. (1989). Back propagation error surfaces can have local minima. In Proceedings of the International Joint Conference on Neural Networks (Vol. II, p. 627). Piscataway, NJ: IEEE Press. Nguyen, D., & Widrow, B. (1990). Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. In Proceedings of the International Joint Conference on Neural Networks (Vol. III, pp. 21–26). Piscataway, NJ: IEEE Press. Parker, D. B. (1987). Optimal algorithms for adaptive networks: second order back propagation, second order direct propagation, and second order Hebbian learning. In Proceedings of the IEEE Conference on Neural Networks (Vol. II, pp. 593–600). Piscataway, NJ: IEEE Press. Peterson C., & Anderson J. R. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1, 995–1019.
Rezgui, A., & Tepedelenlioglu, N. (1990). The effect of the slope of the activation function on the back propagation algorithm. In Proceedings of the International Joint Conference on Neural Networks (Vol. I, pp. 707–710). Piscataway, NJ: IEEE Press. Rigler A. K., Irvine J. M., & Vogl T. P. (1991). Rescaling of variables in back propagation learning. Neural Networks, 4, 225–229. Rumelhart D. E., Hinton G. E., & Williams R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. van Ooyen A., & Nienhuis B. (1992). Improving the convergence of the back-propagation algorithm. Neural Networks, 5, 465–471. Vogl T. P., Mangis J. K., Rigler A. K., Zink W. T., & Alkon D. L. (1988). Accelerating the convergence of the back-propagation method. Biological Cybernetics, 59, 257–263. Watrous, R. L., (1987). Learning algorithms for connectionist networks: applied gradient methods of nonlinear optimization. In Proceedings of the IEEE Conference on Neural Networks (Vol. II, pp. 619–627). Piscataway, NJ: IEEE Press. Weir M. K. (1991). A method for self-determination of adaptive learning rates in back propagation. Neural Networks, 4, 371–379. Wessels L. F. A., & Barnard E. (1992). Avoiding false local minima by proper initialization of connections. IEEE Transactions on Neural Networks, 3, 899–905. Xu L., Klasa S., & Yuille A. (1992). Recent advances on techniques of static feedforward networks with supervised learning. International Journal of Neural Systems, 3, 253–290. Yamada, K., Kami, H., Tsukumo, J., & Temma, T. (1989). Handwritten numeral recognition by multilayer neural network with improved learning algorithm. In Proceedings of the International Joint Conference on Neural Networks (Vol. II, pp. 259–266). Piscataway, NJ: IEEE Press.