ARTICLE IN PRESS
Neurocomputing 69 (2006) 2301–2308 www.elsevier.com/locate/neucom
Analysis of two restart algorithms Wei Lin, Tianping Chen Laboratory of Nonlinear Mathematics Science, Institute of Mathematics, Fudan University, Shanghai 200433, PR China Received 27 April 2004; received in revised form 26 April 2005; accepted 27 April 2005 Available online 19 January 2006 Communicated by M. Magdon-Ismail
Abstract Since the backpropagation algorithm used for neural network training suffers from a slow convergence and often sticking in local minima, the restart mechanism has been introduced, whose strategy is to cut off the training process and restart it with a fresh initialization when it seems unlikely to converge in a relatively short time. In this paper, we give detailed mathematical analysis on two versions of the restart algorithms. By deriving analytic expressions of the expected convergence time and the success rate, we illustrate why the restart algorithms work well and gain insights into the proper use of restarting. Numerical simulations are performed on the XOR problem, symmetry detection, parity problem and Arabic numeral recognition. We show the effectiveness of the restart algorithms, and compare them with simulated annealing. The analysis can also be applied to many other fields. r 2005 Elsevier B.V. All rights reserved. Keywords: Restart mechanism; Neural network training; Backpropagation algorithm; Gradient descent method
1. Introduction The backpropagation algorithm opened avenues for the applications of multilayer perceptrons [18], and has been widely used for neural network training. However, as a gradient descent method, it suffers from a slow convergence and often sticking in local minima [5,13]. The convergence time depends strongly on the initial conditions of synaptic weights and thresholds and varies in a wide range from run to run [9]. Many techniques have been developed to accelerate the convergence, overcome local minima and improve the success rate of learning [1,12,14,19,22]. Although these methods work well and have obtained good results, the problem has not been settled completely and the complexity of some techniques often makes them difficult to follow. The restart algorithm can work together with almost all of the training methods and improve them without any increase in complexity. It was proposed by Fahlman [3] in the context of backpropagation training, but actually it is not restricted to the field of neural networks [4,15]. Its idea is very simple, that is, to cut off the training process when it Corresponding author.
E-mail address:
[email protected] (W. Lin). 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.04.015
has shown a too slow convergence, restart it with new random weights and thresholds and proceed. This restart mechanism helps the training avoid a long time of computation and find a new path to approach the solution faster. According to when the training process should be cut short, the restart algorithm has two versions: restart after a specific period of time, or restart when the training gets stuck. These two versions are often adopted in practice but lack theoretical analysis of their efficiency. In this paper, we analyze the two restart algorithms in the light of expected convergence time and success rate of training. The theorems illustrate in theory why the restart algorithms work well. Numerical simulations are performed on the XOR problem, symmetry detection, parity problem and Arabic numeral recognition. We show the effectiveness of the restart algorithms, and compare them with simulated annealing. The simulation results show consistency between theory and practice. 2. Related work According to when the training process should be cut off and restarted, there are two versions of the restart algorithms. The first one is to cut off the training process
ARTICLE IN PRESS W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
2302
Theorem 2. Assume that
Table 1 Nomenclature t T T0 T TRES T0RES pðtÞ aT P1 SRES S0RES qconv ðtÞ qstick ðtÞ EðtÞ N s Z e
pðtÞ ¼ p0 ðtÞ þ P1 dðt 1Þ; Training time (number of epochs) Restart time Upper limit of total training time Expected convergence time without restart Expected convergence time of Algorithm 1 Expected convergence time of Algorithm 2 Probability density of convergence time Probability of restarting within time T Probability of not converging in training Success rate of Algorithm 1 Success rate of Algorithm 2 Probability density of the time for converging directly Probability density of the time for getting stuck Error function Steps in the restart criterion of Algorithm 2 Tolerance in the restart criterion of Algorithm 2 Learning rate Stopping error
when it does not converge after a specific period of time (often described by a specific number of epochs), denoted by T. This algorithm is described as follows: Algorithm 1. Step 1: Initialize the synaptic weights and thresholds randomly. Step 2: Present the training sample to the network and update the synaptic weights and thresholds via some calculations. Step 3: If the error function has converged to the goal, then stop; otherwise, continue to Step 4. Step 4: If the time from last initialization has reached T, go to Step 1; otherwise, go to Step 2. Algorithm 1 was investigated in [11] and some theoretical results were given. We present their work here briefly using the notation as in that paper (see Table 1). Since the synaptic weights and thresholds are picked from a specific random distribution, the convergence time obeys a certain probability density, denoted by pðtÞ. Note that converging to a local minimum is not considered convergence here. Let T ¼ E½t be the expected convergence time without restart, and let TRES ðTÞ ¼ E RES ½t be the expected convergence time when Algorithm 1 is applied. Denote by aT the probability that the restart is taken at least once, i.e. Z T pðtÞ dt. (1) aT ¼ 1 0
The following theorems were given in [11]. Theorem 1. The optimal time to restart is the time T that minimizes the expression Z T aT pðtÞ Tþ t dt (2) TRES ðTÞ ¼ 1 aT 1 aT 0 with respect to T.
0oP1 o1.
(3)
It R T is always advantageous to restart for any T satisfying 0 pðtÞ40, i.e. TRES ðTÞoT for such T. 3. Theoretical results Obviously, Theorems 1 and 2 are concerned with the expectation of convergence time, on which the so-called ‘‘optimal’’ is based. In practice, the success rate of training, namely the probability that the training succeeds, is also considered as a criterion to evaluate an algorithm. To be rigorous, we first define two concepts about when an algorithm is called better than others. Definition 1. A training algorithm is said to be faster if it has a less expectation of convergence time. Definition 2. A training algorithm is said to be more successful if it has a greater success rate. Now Theorems 1 and 2 can be rewritten in the following form in terms of Definition 1. Theorem 3. Algorithm 1 is made fastest by the restart time T that minimizes Eq. (2). Theorem 4. Assume that Eq. (3) is satisfied, Algorithm 1 is faster than that without restart. 3.1. Success rate of Algorithm 1 Now we consider the success rate of Algorithm 1. When no upper limit of the total training time is set, i.e., the number of restarts is not restricted, the following theorem is obtained. RT Theorem 5. Assume that 0 pðtÞ dt40. The success rate of Algorithm 1 is SRES ðTÞ ¼ 1. Proof. From the assumption and Eq. (1), we get 0paT o1. Since the probability of restarting at least n times is anT , the probability of converging within nT is 1 anT . Thus, the success rate of Algorithm 1 is SRES ðTÞ ¼ lim ð1 anT Þ ¼ 1: n!1
&
Theorem 5 illustrates that under certain conditions Algorithm 1 is always R T more successful than that without restart. Actually, if 0 pðtÞ dt40 and P1 40, the success rate of Algorithm 1 can reach 100% when the total training time is not restricted, strictly greater than that without restart. However, in practice the training process cannot go on forever, hence we must set an upper limit on the total training time, denoted by T 0 . Then we have the following theorem.
ARTICLE IN PRESS W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
2303
Theorem 6. When T 0 is a finite number, let n ¼ ½T 0 =T. Then the success rate of Algorithm 1 is calculated by
The following theorem gives the expected convergence time of Algorithm 2 in terms of qconv and qstick .
SRES ðT; T 0 Þ ¼ 1 anT aT 0 nT .
R1 Theorem 7. Assume that 0 qconv ðtÞ dt40. The expected convergence time of Algorithm 2 is calculated by R1 R1 0 0 tqconvRðtÞ dt þ 0 tqstick ðtÞ dt TRES ¼ . (6) 1 0 qconv ðtÞ dt
(4)
Proof. Failing to converge within T 0 means restarting n times and not converging during the last T 0 nT time. Thus the probability of failing to converge within T 0 is anT aT 0 nT . Then the success rate is calculated by Eq. (4). & Remark. Given a fixed training time T 0 , this result provides us a method to make Algorithm 1 most successful, that is, to find a proper value of T that maximizes Eq. (4). 3.2. Another restart algorithm The purpose of introducing the restart mechanism is to help the training avoid being trapped in local minima, so a very natural idea is to cut off the training process immediately when it has stuck in a local minimum. But there exists difficulty in judging whether this situation occurs. When the training shows a slow convergence, it may have been trapped in a local minimum, or just be passing across a relatively flat region of the error surface. However, it is still advantageous to restart in the latter case, because the plateau usually leads to an unbearably long time to converge. Hence, we consider another restart algorithm different from Algorithm 1, with which we restart the training process immediately when it is observed to converge too slowly. This algorithm can be described as follows. Algorithm 2. Step 1: Initialize the synaptic weights and thresholds randomly. Step 2: Present the training sample to the network and update the synaptic weights and thresholds via some calculations. Step 3: If the error function has converged to the goal, then stop; otherwise, continue to Step 4. Step 4: If the error function has shown a too slow convergence, go to Step 1; otherwise, go to Step 2. The slow convergence mentioned here may have different measurements and need to be defined more accurately in an implementation. We will discuss the details in Section 4.1. And note that the probability density pðtÞ cannot be employed to analyze Algorithm 2, because a training process converging in Algorithm 1 may be cut off in Algorithm 2 for its slow convergence. To solve this problem, we introduce qconv ðtÞ and qstick ðtÞ which represent the probability density of the time for converging directly and the time for getting stuck, respectively. Since a training process either converges directly or gets stuck, we have Z 1 Z 1 qconv ðtÞ dt þ qstick ðtÞ dt ¼ 1. (5) 0
0
Proof. The expected convergence time T0RES satisfies Z 1 Z 1 0 TRES ¼ tqconv ðtÞ dt þ ðt þ T0RES Þ qstick ðtÞ dt. 0
(7)
0
The two terms on the right-hand side represent, respectively, the case that the training converges directly and the case that the restart is taken as least once. In the latter case, the expected convergence time equals the time that has passed until the first restart, t, plus the expectation of a new training process, T0RES . Working out T0RES from Eq. (7) and using Eq. (5), we have R1 R1 0 0 tqconv ðtÞRdt þ 0 tqstick ðtÞ dt TRES ¼ 1 1 0 qstick ðtÞ dt R1 R1 0 tqconvRðtÞ dt þ 0 tqstick ðtÞ dt ¼ : & 1 0 qconv ðtÞ dt Remark. Assume that the training process cannot go on forever without convergence or getting stuck, which means the functions qconv and qstick have compact supports. Let ½0; A contain both of their supports, then the numerator in Eq. (6) equals Z A Z A tqconv ðtÞ dt þ tqstick ðtÞ dt 0 0 Z A Z A pA qconv ðtÞ dt þ qstick ðtÞ dt pAo1, 0
0
and the denominator is greater than zero, thus T0RES o1. However, when Eq. (3) is satisfied, obviously, we have T ¼ 1. Therefore, in this case Algorithm 2 is faster than that without restart. Now we take into account the success rate of Algorithm 2. Theorems 8 and 9 are obtained when the upper limit of total training time is set to infinity and a finite number, respectively. R1 Theorem 8. Assume that 0 qconv ðtÞ dt40. The success rate of Algorithm 2 is S0RES ¼ 1. Proof. R 1 From the assumption and Eq. (5), we have 0p 0 qstick ðtÞ dto1. R 1 Since the probability of restarting at least n times is ð 0 qstick ðtÞ dtÞn , the probability of converR1 ging with less than n restarts is 1 ð 0 qstick ðtÞ dtÞn . Then Z 1 n 0 SRES ¼ lim 1 qstick ðtÞ dt ¼ 1: & n!1
0
ARTICLE IN PRESS W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
2304
Theorem 9. When T 0 is set to a finite number, the success rate of Algorithm 2 satisfies the integral equation Z T0 Z T0 qconv ðtÞ dt þ qstick ðtÞS0RES ðT 0 tÞ dt. S0RES ðT 0 Þ ¼
two parameters, which makes it not difficult to choose proper values for them.
(8)
We choose the XOR (exclusive-OR) problem, a wellknown and widely studied benchmark problem, as our first task, which was also taken as an example in [11]. As in that paper, a 2–2–1 multilayer network is employed in our simulation. This type of network architecture used for solving the XOR problem has proved to have no local minima [6,20]. However, even though there exist no local minima, the backpropagation algorithm may become trapped at stationary points of any type, or be drawn towards an infinite plateau, which can lead to a failure of training [8,17]. We generate the initial values of synaptic weights and thresholds according to the uniform distribution in the interval ð1; 1Þ. First, we need to determine the probability density pðtÞ. Some methods were proposed to obtain pðtÞ for backpropagation networks [7,10,16], but here we estimate it directly from a large number of trials. Let the learning rate Z ¼ 0:1, the stopping error for convergence (i.e. the error below which convergence is defined) e ¼ 1 103 , and the upper limit of training time T 0 is set to 10,000 epochs. The standard backpropagation algorithm was implemented over 100,000 runs, and then we obtained a satisfactory estimation for pðtÞ as shown in Fig. 1 (also given in [11]). To optimize Algorithm 1, we need to decide the best value for the restart time T. According to Eqs. (2) and (4), the expected convergence time and the success rate are calculated from the estimation of pðtÞ we have obtained. As functions of T, TRES and SRES are shown in Fig. 2(a) (also given in [11]) and Fig. 2(b), respectively. From the figures, we can easily find that the best restart time is about 2550 epochs according to either of the two criteria.
0
4.2. Case study on XOR problem
0
Proof. In Eq. (8), the first term on the right-hand side represents the case that the training process converges directly within T 0 , and the second term represents the case that it converges after at least one restart. The probability of the latter case equals the probability of restarting at t, qstick ðtÞ, multiplying the success rate within the remaining time, S0RES ðT 0 tÞ, integrated with respect to t from 0 to T 0. & Remark. Theorem 8 provides a condition under which Algorithm 2 is more successful than that without restart. It is worth noting that the densities qconv ðtÞ and qstick ðtÞ depend on the criterion we take to judge when the restart should be taken. Different criteria lead to different qconv ðtÞ and qstick ðtÞ as well as different training effects, either the expected convergence time or the success rate. So it is important to choose the best one from the criteria. The results given by Theorems 7 and 9 can help us make a choice, by minimizing the expected convergence time or maximizing the success rate. 4. Simulations and discussions In this section, simulations will be performed on several benchmark tasks: the XOR problem, symmetry detection, parity problem and Arabic numeral recognition. The expected convergence time and the success rate are used as criteria for searching the optimal parameters of the restart algorithms. The XOR problem is studied in detail first, and then more examples and discussions are given. 4.1. Experimental details of Algorithm 2
The Euclidean norm of the gradient vector is sufficiently small. The change of the error function is sufficiently small. The latter is used in our simulations. Denote the error function by EðtÞ, and the practical criterion is given by jEðtÞ Eðt NÞjos,
(9)
where N and s are two parameters that should be chosen properly to optimize the training effects. As we will show below, the training effects are relatively insensitive to these
1 0.9 0.8 Probability density
To put Algorithm 2 into practice, we first discuss some experimental details. Note that in Algorithm 2, the criterion in Step 4 is critical, which has a great impact on the densities qconv ðtÞ and qstick ðtÞ. Generally speaking, to check if the training process shows a too slow convergence, the following two criteria are often adopted:
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
2 4 6 8 Convergence time (thousand epochs)
Fig. 1. The probability density pðtÞ in the XOR problem.
10
ARTICLE IN PRESS 2800
0.9
2700
0.8
2600
0.7 Probability density
Expected convergence time
W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
2500 2400 2300 2200
0.6 0.5 0.4 0.3 0.2
2100 2000 1500 (a)
2305
0.1 2000
2500
3000 3500 Restart time
4000
4500
5000
0 0
1
0.5 1 1.5 2 2.5 Convergence time (thousand epochs)
3
Fig. 3. The probability density qconv ðtÞ in the XOR problem.
0.96
3
0.94
2.5
0.92 0.9 0.88 (b)
2000 3000 4000 5000 6000 7000 8000 9000 10000 Restart time
Fig. 2. The training effects of XOR problem with Algorithm 1 when T varies. (a) TRES as a function of T. (b) SRES as a function of T.
Probability density
Success rate
0.98
2
1.5
1
0.5
0 0
Then we switch to Algorithm 2. In Eq. (9), let N ¼ 100, s ¼ 1 104 . The estimations of qconv ðtÞ and qstick ðtÞ were obtained from a large number of training trials (over 200,000 runs). The graphs of the densities are shown in Figs. 3 and 4. To search the best values for N and s, we give them a set of values and work out the average convergence time and the success rate according to experiments, i.e., we optimize T0RES and S0RES over a 2-D space. Let one parameter be fixed while the other varies, the training effects are illustrated in Fig. 5. From the figures, it is easy to see that N ¼ 100 and s ¼ 1 104 are the best values. We briefly discuss the experimental results: The optimal parameters judged by the expected convergence time and by the success rate are consistent. This illustrates that both of the criteria can be used to evaluate the efficiency of the restart algorithms. However, the success rate is often easier to calculate and observe in practice.
0.5
1 1.5 2 Sticking time (thousand epochs)
2.5
3
Fig. 4. The probability density qstick ðtÞ in the XOR problem.
Algorithm 2 usually has a better performance than Algorithm 1. In this example, in the case that the parameters in both algorithms have been optimized, Algorithm 1 have the best TRES ¼ 2100, SRES ¼ 99:8%, compared with T0RES ¼ 1450, S0RES ¼ 100% obtained by Algorithm 2. This is because in Algorithm 2 a slow training process will be cut off and restarted earlier than in Algorithm 1 and thus more training time is saved. Algorithm 2 is not very sensitive to the parameter values. In this example, when N varies from 50 to 1600 and s varies from 1 104 to 1 107 , the average convergence time is always less than 2000 and the success rate remains at nearly 100%. This fact provides us much convenience to apply this restart algorithm.
ARTICLE IN PRESS W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
2306
9000
5000 Average convergence time
5500
Average convergence time
10000
8000 7000 6000 5000 4000 3000
50
100
(a)
200 400 800 Parameter N
3000 2500 2000
1
1
0.9
0.98
10-5
10-6 10-7 Parameter σ
10-4
10-5
10-6 10-7 Parameter σ
10-8
10-9
0.96 Success rate
0.6 0.5 0.4
0.94 0.92 0.9 0.88
0.3 0.2
0.86
0.1
0.84
0 25
10-4
(b)
0.7 Success rate
3500
1000 -3 10
1600 3200 6400
0.8
(c)
4000
1500
2000 1000 25
4500
50
100
200 400 800 Parameter N
1600
3200
0.82 10-3
6400 (d)
10-8
10-9
Fig. 5. The training effects of the XOR problem with Algorithm 2 when N or s varies. (a) T0RES vs. N with s ¼ 1 104 . (b) T0RES vs. s with N ¼ 100. (c) S0RES vs. N with s ¼ 1 104 . (d) S0RES vs. s with N ¼ 100.
4.3. More tasks and comparisons In this subsection, the restart algorithms are performed on more tasks, and all the results are compared with simulated annealing, a widely used method for global optimization. Symmetry detection. This problem is to detect whether the binary activity levels of a 1-D array of input neurons are symmetrical about the center point. To facilitate a large number of trials, we investigate the 3-bit case and use a 3–2–1 network to be trained. Let Z ¼ 0:1, e ¼ 1 103 , T 0 ¼ 10; 000, and repeat 10,000 runs. For Algorithm 1, T ¼ 1400, and for Algorithm 2, N ¼ 100, s ¼ 1 104 . Parity problem. The error surface of the N-bit parity problem becomes more intricate and admits more local minima as N increases, which makes it excellent for a backpropagation training benchmark. We take the 4-bit case as an example and adopt the standard 4–4–1 network architecture in our simulation. Let Z ¼ 0:05, e ¼ 1 103 ,
Fig. 6. The training patterns used for numeral recognition.
T 0 ¼ 100; 000, and repeat 500 runs. For Algorithm 1, T ¼ 10; 000, and for Algorithm 2, N ¼ 1600, s ¼ 1 104 . Arabic numeral recognition. The training patterns in this task are the Arabic numerals 0–9, each defined by a 5 3
ARTICLE IN PRESS W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
2307
Table 2 Simulation results and comparisons
ðaÞ XOR problem Average number of epochs: Success rate (%): ðbÞ 3-bit symmetry detection Average number of epochs: Success rate (%): ðcÞ 4-bit parity problem Average number of epochs: Success rate (%): ðdÞ Arabic numeral recognition Average number of epochs: Success rate (%):
Standard BP
Algorithm 1
Algorithm 2
Simulated annealing
2519 89.9
2040 99.8
1458 100
2062 93.5
2296 84.9
1225 100
839 100
1914 91.8
86672 14.8
52286 77.6
23385 99.4
37926 79.8
3210 86.9
2563 98.5
2354 99.7
3198 89.2
pixel binary image, as shown in Fig. 6. We employ a 15–9–10 network in our simulation. Let Z ¼ 0:1, e ¼ 1 103 , T 0 ¼ 10; 000, and repeat 1000 runs. For Algorithm 1, T ¼ 3000, and for Algorithm 2, N ¼ 300, s ¼ 1 105 . Since the classical simulated annealing version is not appropriate in the context of neural network training (see discussions in [21]), we adopt the version used in [2] in which the change of the synaptic weights produced by simulated annealing at each iteration is defined by dW ¼ dw þ nr2
kt
,
where dw is the synaptic weight change produced by the standard backpropagation algorithm, n and k are constants, and r is a random number. In our simulations, we set n ¼ 0:1 for the symmetry detection and the parity problem, n ¼ 0:05 for the Arabic numeral recognition, k ¼ 0:002, and r is generated between 0:5 for all the tasks. For comprehensive comparisons, all the simulation results including those on the XOR problem are listed in Table 2. From the table, we can see that the two restart algorithms do improve the training effects and have a better performance than the simulated annealing in most situations. But it is worth pointing out that as the data of the 4-bit parity problem show, Algorithm 1 does not work efficiently when the success rate of the standard backpropagation algorithm is very low. In this case, the upper limit of total training time has to be set very large to obtain a satisfactory performance. This fact can also be derived from Theorem 6. Fortunately, Algorithm 2 still works well enough in this situation for the reason mentioned in the previous subsection. Although we have drawn a comparison between the restart algorithms and simulated annealing, we need to point out that generally the advantage of the restart mechanism should be illustrated by comparisons between an algorithm with restart and the same one without restart. Actually, the restart algorithms do not conflict with any other advanced techniques, such as simulated annealing and various second order methods. The combined algo-
rithms are more efficient for applications and will produce better performance. Especially, if the convergence time of the original algorithm is very long or the success rate is very low, employing some accelerating techniques or advanced methods that help overcome false minima is necessary for applying the restart algorithms. Otherwise, the restart algorithms are probably not capable of doing much improvement. 5. Conclusions In this paper, we have discussed two versions of the restart algorithms used in neural network training as well as many other problems. Our work has extended the work in [11] and gives an in-depth study on the restart algorithms. According to two definitions we proposed, we investigate the expected convergence time and the success rate of the training process, and point out the conditions under which the restart mechanism does improve the training effects. These results provide insights for the proper use of restarting. The simulations and comparisons have confirmed the consistency between theory and practice. However, there still exists difficulty in estimating the probability densities pðtÞ, qconv ðtÞ and qstick ðtÞ in general situations, which limits applications of the theoretical results. Acknowledgements This work is supported by the National Natural Science Foundation of China under Grants 60074005 and 60374018. We are very grateful to the referees for their helpful comments and suggestions that have greatly improved the manuscript. References [1] S. Amari, H. Park, K. Fukumizu, Adaptive method of realizing natural gradient learning for multilayer perceptrons, Neural Comput. 12 (2000) 1399–1409.
ARTICLE IN PRESS 2308
W. Lin, T. Chen / Neurocomputing 69 (2006) 2301–2308
[2] R.M. Burton, G.J. Mpitsos, Event-dependent control of noise enhances learning in neural networks, Neural Networks 5 (1992) 627–637. [3] S. Fahlman, An empirical study of learning speed in backpropagation networks, CMU Technical Report No. 96/167C, Carnegie Mellon University, 1988. [4] F. Ghannadian, C. Alford, R. Shonkwiler, Application of random restart to genetic algorithms, Inform. Sci. 95 (1996) 81–102. [5] M. Gori, A. Tesi, On the problem of local minima in backpropagation, IEEE Trans. Pattern Anal. Mach. Intell. 14 (1992) 76–85. [6] L.G.C. Hamey, XOR has no local minima: a case study in neural network error surface analysis, Neural Networks 11 (1998) 669–681. [7] T. Heskes, E. Slijpen, B. Kappen, Learning in neural networks with local minima, Phys. Rev. A 46 (1992) 5221–5231. [8] D.R. Hush, B. Horne, J.M. Salas, Error surfaces for multilayer perceptrons, IEEE Trans. Syst. Man Cybern. 22 (1992) 1152–1161. [9] J.F. Kolen, J.B. Pollack, Back propagation is sensitive to initial conditions, in: R.P. Lippmann, J.E. Moody, D.S. Tpuretzky (Eds.), Advances in Neural Information Processing Systems, vol. 3, Morgan Kaufmann, San Mateo, CA, 1991, pp. 860–867. [10] T. Leen, J. Moody, Weight space probability densities in stochastic learning: I. Dynamics and equilibria, in: C.L. Giles, S.J. Hanson, J.D. Cowan (Eds.), Advances in Neural Information Processing Systems, Morgan Kaufmann, San Mateo, CA, 1993, pp. 451–458. [11] M. Magdon-Ismail, A.F. Atiya, The early restart algorithm, Neural Comput. 12 (2000) 1303–1312. [12] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the convergence of the backpropagation algorithm using learning rate adaptation methods, Neural Comput. 11 (1999) 1769–1796. [13] J.M. McInerney, K.G. Haines, S. Biafore, R. Hecht-Nielsen, Back propagation error surfaces can have local minima, in: Proceedings of IJCNN’89, vol. II, IEEE Press, Piscataway, NJ, 1989, p. 627. [14] M.F. Møller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6 (1993) 525–533. [15] M. Muselli, A theoretical approach to restart in global optimization, J. Global Optim. 10 (1997) 1–16. [16] G. Orr, T. Leen, Weight space probability densities in stochastic learning: II. Transients and basin hopping times, in: C.L. Giles, S.J. Hanson, J.D. Cowan (Eds.), Advances in Neural Information Processing Systems, Morgan Kaufmann, San Mateo, CA, 1993, pp. 507–514.
[17] T. Poston, C.-N. Lee, Y. Choie, Y. Kwon, Local minima and back propagation, in: Proceedings of IJCNN’91, vol. II, IEEE Press, New York, 1991, pp. 173–176. [18] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distributed Processing, vol. 1, MIT Press, Cambridge, MA, 1986, pp. 318–362. [19] A.J. Shepherd, Second-Order Methods for Neural Networks, Springer, London, 1997. [20] I.G. Sprinkhuizen-Kuyper, E.J.W. Boers, The error surface of the 2–2–1 XOR: the finite stationary points, Neural Networks 11 (1998) 683–690. [21] S.T. Welstead, Neural Network and Fuzzy Logic Applications in C/C++, Wiley, New York, 1994. [22] L.F. Wessel, E. Barnard, Avoiding false local minima by proper initialization of connections, IEEE Trans. Neural Networks 3 (1992) 899–905. Wei Lin received his B.S. degree in Mathematics in 2002 and M.S. degree in Applied Mathematics in 2005 both from Fudan University, Shanghai, China. He is now studying in the Department of Mathematics and Statistics at the University of Minnesota, Duluth, USA. His research interests focus on neural networks, machine learning, dynamical systems, and their applications in biology and engineering.
Tianping Chen is a professor in the Department and Institute of Mathematics at Fudan University, Shanghai, China. He is the recipient of several important awards, including the Second Prize of 2002 National Natural Science Award of China, 1997 Outstanding Paper Award of IEEE Transactions on Neural Networks, and 1997 Best Paper Award of Japanese Neural Network Society. His research interests include harmonic analysis, approximation theory, neural networks, signal processing and dynamical systems.