Neural Networks 17 (2004) 589–609 www.elsevier.com/locate/neunet
New training strategies for constructive neural networks with application to regression problems L. Ma, K. Khorasani* Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd West, Montreal, Que. H3G 1M8, Canada Received 10 July 2001; accepted 9 February 2004
Abstract Regression problem is an important application area for neural networks (NNs). Among a large number of existing NN architectures, the feedforward NN (FNN) paradigm is one of the most widely used structures. Although one-hidden-layer feedforward neural networks (OHLFNNs) have simple structures, they possess interesting representational and learning capabilities. In this paper, we are interested particularly in incremental constructive training of OHL-FNNs. In the proposed incremental constructive training schemes for an OHL-FNN, input-side training and output-side training may be separated in order to reduce the training time. A new technique is proposed to scale the error signal during the constructive learning process to improve the input-side training efficiency and to obtain better generalization performance. Two pruning methods for removing the input-side redundant connections have also been applied. Numerical simulations demonstrate the potential and advantages of the proposed strategies when compared to other existing techniques in the literature. q 2004 Elsevier Ltd. All rights reserved. Keywords: Constructive neural networks; Network growth; Network pruning; Training strategy; Regression and function approximation
1. Introduction Among the existing numerous neural networks (NNs) paradigms, such as Hopfield networks, Kohonen’s selforganizing feature maps (SOM), etc. the feedforward NNs (FNNs) are the most popular due to their flexibility in structure, good representational capabilities, and large number of available training algorithms (Bose & Liang, 1996; Leondes, 1998; Lippmann, 1987; Sarkar, 1995). In this paper we will be mainly concerned with FNNs. When using a NN, one needs to address three important issues. The solutions to these will significantly influence the overall performance of the NN as far as the following two considerations are concerned: (i) recognition rate to new patterns, and (ii) generalization performance to new data sets that have not been presented during network training. The first problem is the selection of data/patterns for network training. This is a problem that has practical implications and has not received as much attention by researchers as the other problems. The training data set selection can have considerable effects on the performance * Corresponding author. Tel.: þ 1-514-848-3086; fax: þ1-514-848-2802. E-mail addresses:
[email protected] (K. Khorasani); liying@ece. concordia.ca (L. Ma). 0893-6080/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2004.02.002
of the trained network. Some research on this issue has been conducted in Tetko (1997) (and the references therein). The second problem is the selection of an appropriate and efficient training algorithm from a large number of possible training algorithms that have been developed in the literature, such as the classical error backpropagation (BP) (Rumelhart, Hinton, & Williams, 1986) and its many variants (Magoulas et al., 1997; Sarkar, 1995; Stager & Agarwal, 1997) and the second-order algorithms (Osowski, Bojarczak, & Stodolski, 1996; Shepherd, 1997), to name a few. Many new training algorithms with faster convergence properties and less computational requirements are being developed by researchers in the NN community. The third problem is the determination of the network size. This problem is more important from a practical point of view when compared to the above two problems, and is generally more difficult to solve. The problem here is to find a network structure as small as possible to meet certain desired performance specifications. What is usually done in practice is that the developer trains a number of networks with different sizes, and then the smallest network that can fulfill all or most of the required performance requirements is selected. This amounts to a tedious process of trial and errors that seems to be unfortunately unavoidable. This paper focuses on developing a systematic procedure for
590
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
an automatic determination and/or adaptation of the network architecture for a FNN. The second and third problems are actually closely related to one another in the sense that different training algorithms are suitable for different NN topologies. Therefore, the above three considerations are indeed critical when a NN is to be applied to a real-life problem. Consider a data set generated by an underlying function. This situation usually occurs in pattern classification, function approximation, and regression problems. The problem is to find a model that can represent the input – output relationship of the data set. The model is to be determined or trained based on the data set so that it can predict within some prespecified error bounds the output to any new input pattern. In general, a FNN can solve this problem if its structure is chosen appropriately. Too small a network may not be able to learn the inherent complexities present in the data set, but too large a network may learn ‘unimportant’ details such as observation noise in the training samples, leading to ‘overfitting’ and hence poor generalization performance. This is analogous to the situation when one uses polynomial functions for curve fitting problems. Generally acceptable results cannot be achieved if too few coefficients are used, since the characteristics or features of the underlying function cannot be captured completely. However, too many coefficients may not only fit the underlying function but also the noise contained in the data, yielding a poor representation of the underlying function. When an ‘optimal’ number of coefficients are used, the fitted polynomial will then yield the ‘best’ representation of the function and also the best prediction for any new data. A similar situation arises in the application of NNs, where it is also imperative to relate the architecture of the NN to the complexity of the problem. Obviously, algorithms that can determine an appropriate network architecture automatically according to the complexity of the underlying function embedded in the data set are very cost-efficient, and thus highly desirable. Efforts toward the network size determination have been made in the literature for many years, and many techniques have been developed (Hush & Horne, 1993; Kwok & Yeung, 1997a) (and the references therein). Towards this end, in Section 2, we review three general methods that deal with the problem of automatic NN structure determination.
2. Adaptive structure neural networks 2.1. Pruning algorithms One intuitive way to determine the network size is to first establish by some means a network that is considered to be sufficiently large for the problem being considered, and then trim the unnecessary connections or units of the network to reduce it to an appropriate size. This is the basis for the pruning algorithms. Since it is much ‘easier’ to determine or
select a ‘very large’ network than to find the proper size needed, the pruning idea is expected to provide a practical but a partial solution to the structure determination issue. The main problem to resolve is how to design a criterion for trimming or pruning the redundant connections or units in the network. Mozer and Smolensky (1989) proposed a method that estimates which units are ‘least important’ and deletes them during training. Karnin (1990) defined a sensitivity of the error function being minimized with respect to the removal of each connection and pruned the connections that have low sensitivities. Cun, Denker, and Solla (1990) designed a criterion called ‘saliency’ by estimating the second derivative of the error function with respect to each weight, and trimmed the weights with sensitivities lower than some prespecified bound. Castellano, Fanelli, and Pelillo (1990) developed a new iterative pruning algorithm for FNNs, which does not need the problem-dependent tuning phase and the retraining phase that are required by other pruning algorithms such as those in Cun et al. (1990) and Mozer and Smolensky (1989). The above pruning algorithms share the following limitations: (i) the size of the initial network may be difficult to determine a priori, and (ii) the computational costs are very excessive due to the fact that repeated pruning and retraining processes have to be performed. 2.2. Regularization A NN that is larger than required will, in general, have some weights that have very little, if anything, to do with the essential representation of the input – output relationship. Although, they may contribute in reducing the training error, they will not behave properly when the network is fed with new input data, resulting in an increase of the output error in an unpredictable fashion. This is the case when the trained NN does not generalize well. In the conventional BP and its many variants these weights along with other more crucial weights are all adjusted in the training phase. Pruning these unnecessary weights from the trained network according to a certain sensitivity criterion, as mentioned above, is one possible way to deal with this situation. Alternatively, by invoking a scheme known as regularization, one may impose certain conditions on the NN training so that the unnecessary weights will be automatically forced to converge to zero. This can be achieved by adding a penalty term to the error objective/cost function that is being minimized. Several penalty terms are proposed in the literature that are found to be effective for different problems, see for example the references (Chauvin, 1990; Ishikawa, 1990). However, the regularization techniques still cannot automatically determine the size of the network. The initial network size has to be specified a priori by the user. Furthermore, in the objective/cost function one has to appropriately weight the error and the penalty term. This is controlled by a parameter known as the regularization parameter. Usually, this parameter is selected by trial
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
and error or by some heuristic based procedure. Recently, the Bayesian methods have been incorporated into network training for automatically selecting the regularization parameter (Buntine & Weigend, 1991; Thodberg, 1996). In these methods, pre-assumed priors regarding the network weights such as normal or Laplace distributions are used that favor small network weights. However, the relationship between the generalization performance and the Bayesian evidence has not been established, and those priors sometimes do not produce good results. 2.3. Constructive learning algorithms The third approach for determining the network size is known as constructive learning. Constructive learning alters the network structure as learning proceeds, producing automatically a network with an appropriate size. In this approach one starts with an initial network of a ‘small’ size, and then incrementally add new hidden units and/or hidden layers until some prespecified error requirement is reached, or no performance improvement can be observed. The network obtained this way is a ‘reasonably’ sized one for the given problem at hand. Generally, a ‘minimal’ or an ‘optimal’ network is seldom achieved by using this strategy, however a ‘sub-minimal/sub-optimal’ network can be expected (Kwok & Yeung, 1997a,b). This problem has attracted a lot of attention by many researchers and several promising algorithms have been proposed in the literature. Kwok and Yeung (1997a) surveys the major constructive algorithms in the literature. Dynamic node creation algorithm and its variants (Ash, 1989; Setiono & Hui, 1995), activitybased structure level adaptation (Lee, 1991; Weng & Khorasani, 1996), cascade-correction (CC) algorithms (Fahlman & Lebiere, 1991; Prechelt, 1997), and the constructive onehidden-layer (OHL) algorithms (Kwok & Yeung, 1996, 1997b) are among the most common constructive learning algorithms developed so far in the literature. In Ma and Khorasani (2000), we proposed a new training strategy in constructive NN to improve the performance of the algorithm given in Kwok and Yeung (1997b). An algorithm which can adaptively construct multi-layer feed-forward NNs is proposed in Ma and Khorasani (2003). Constructive algorithms have the following major advantages over the pruning algorithms and regularization-based techniques: A1 It is easier to specify the initial network architecture in constructive learning techniques, whereas in pruning algorithms one usually does not know a priori how large the original network should be. A2 Constructive algorithms tend to build small networks due to their incremental learning nature. Networks are constructed that have direct correspondence to the complexity of the given problems and the specified performance requirements, while excessive efforts may be rendered to trim the unnecessary weights in
591
the network in pruning algorithms. Thus, constructive algorithms are generally more efficient (in terms of training time and network complexity/structure) than pruning algorithms. A3 In pruning algorithms and regularization-based techniques, one must specify or select several problem-dependent parameters in order to obtain an ‘acceptable’ and ‘good’ network yielding satisfactory performance results. This aspect could potentially reduce the applicability of these algorithms in reallife applications. On the other hand, constructive algorithms do not share these limitations.
3. Constructive algorithms for feedforward neural networks In this section, we first give a simple formulation of the training problem for a constructive OHL-FNN in the context of a nonlinear optimization problem. The advantages and disadvantages of these constructive algorithms are also discussed. 3.1. Formulation of constructive FNN training Suppose a FNN is used to approximate a regression function whose input vector (or predictor variables) is indicated by the multi-dimensional vector X and its output (or response) is expressed by the scalar Y; without loss of any generality. A regression surface (input –output function) gð·Þ is used to describe the relationship between X and Y: A FNN is trained and used to realize or represent this relationship. The input samples are denoted by ðx1 ; x2 ; …; xP Þ; the output samples at each layer are denoted by ðyj1 ; …; yjL21 ; yjL Þ; j ¼ 1; …; P; and the corresponding target samples (or observations) are denoted by ðd1 ; d2 ; …; dP Þ; which are the output data contaminated by an additive white noise vector L ¼ ðe 1 ; e 2 ; …; e P Þ; where L 2 1 is the number of hidden layers, and L denotes the output layer, and P is the number of patterns in the data set. The network training may be formulated as the following unconstrained least square (LS) nonlinear optimization problem min
L;n;f;w
P X
ðd j 2 yjL Þ2 ;
ð1Þ
j¼1
subject to 8 j y1 ¼ f1 ðw1 xj Þ; > > > > > j j > < y2 ¼ f2 ðw2 y1 Þ;
w2 [ Rn2 £n1 ; yj2 [ Rn2 ;
> .. > > . > > > : j yL ¼ fL ðwL yjL21 Þ;
wL [ R1£nL21 ; yjL [ R1
w1 [ Rn1 £M ; yj1 [ Rn1 ; xj [ RM ; ;
ð2Þ
592
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
where n ¼ ðn1 ; n2 ; …; nL21 Þ is a vector denoting the number of hidden units at each hidden layer, f ¼ ðf1 ; f2 ; …; fL Þ denotes the activation function of each layer, f1 ; f2 ; …; fL21 are usually nonlinear activation functions, with fL the activation function of the output layer selected to be linear for a regression problem, and w ¼ ðw1 ; w2 ; …wL Þ represents the weight matrix corresponding to each layer. All levels of adaptation, that is structure level, functional level, and learning level are included in the above LS nonlinear optimization problem. The performance index (1) is clearly too complicated to solve by existing optimization techniques. Fortunately, by fixing certain variables, the optimization problem will become easier to solve. However, only a suboptimal solution can then be provided. As it turns out, a suboptimal solution suffices in many practical situations. Practically, one can solve Eqs. (1) and (2) only through an incremental procedure. That is, by first fixing certain variables, say the activation functions and the number of hidden layers and units, and then by solving the resulting LS optimization problem with respect to the remaining variables. This process is to be repeated until an acceptable solution or a network is obtained. In this way, for a given choice of activation functions and the number of units, Eqs. (1) and (2) reduces to a nonlinear optimization problem with respect to the number of hidden layers and weight matrices. The problem of weight selection may be solved by using the steepest-decent (for example, the BP) type or second-order (Quasi-Newton) recursive methods. The error surface represented by Eqs. (1) and (2) has, in general, many local minima and the solution to the problem is sought for only one local minimum at a time. The error surface changes its shape each time the activation functions and/or the number of hidden layers and/or units are modified, resulting in different local minima each time. The standard constructive algorithms available in the literature have adopted the above approach and can therefore provide only suboptimal solutions to Eqs. (1) and (2). The constructive multilayer FNN proposed in this paper provides also a suboptimal solution to Eqs. (1) and (2). 3.2. Limitations of the current constructive FNNs
the convergence of constructive algorithms can be easily established (Kwok & Yeung, 1997b). R3 The constructive learning process is simple and facilitates the investigation of the training efficiency and development of other improved constructive strategies. The constructive multi-hidden-layer FNN considered in this paper may yield improved approximation and representation capabilities to many complicated problems when compared to OHL-FNNs. Other multi-layer architectures have also been developed in the literature, such as the stack learning algorithm (Fang & Lacher, 1994) and adding-anddeleting algorithm (Nabhan & Zomaya, 1994). The stack learning algorithm begins with a minimal structure consisting of the input and the output units only, similar to the initial network in CC. The algorithm then constructs a network by creating a new set of output units and by converting the previous output units into new hidden units. The new output layer has connections to both the original input units and all the established hidden units. In other words, this algorithm generates a network that has a similar structure as in the CC-based networks, and hence has the same limitations as the CC. In the adding-and-deleting algorithm, the network training is divided into two phases: addition phase and deletion phase. These two phases are controlled by evaluating the network performance. The socalled backtracking technique is used to avoid the ‘limit off’ of the constructive learning. This algorithm may produce multi-layered FNNs, but it is computationally very intensive due to its lengthy addition-and-deletion process and the use of the BP-based training algorithm.
4. Statement of the problem Generally, a multivariate model-free regression problem can be described as follows. Suppose one is given P pairs of vectors ðdj ; xj Þ ¼ ðd1j ; d2j ; …; dNj ; xj1 ; xj2 ; …; xjM Þ; j ¼ 1; 2; …; P; that are generated from unknown models
Our motivation for developing the proposed constructive learning algorithm may be stated due to the following reasons: R1 The FNN is simple and elegant in structure. The fan-in problem with the CC-type architectures is not present in this structure. Furthermore, as ‘deeper’ the structure becomes, the more input-side connections for a new hidden unit will be required. This may give rise to degradation of generalization performance of the network, as some of the connections may become irrelevant to the prediction of the output. R2 FNN is a universal approximator as long as the number of hidden units is sufficiently large. Therefore,
dij ¼ gi ðxj Þ þ e ji ; i ¼ 1; 2; …; N; j ¼ 1; 2; ……; P;
ð3Þ
where the {dj }’s are called the multivariate ‘response’ vectors and {xj }’s are called the ‘independent variables’ or the ‘carriers’, and M and N are dimensions of x and d; respectively. The gi ’s are unknown smooth nonparametric or model-free functions from M-dimensional Euclidean space to the real line: gi : RM ! R; i ¼ 1; 2; …; N:
ð4Þ
The {e ji }’s are random variables (or noise) with zero mean E½e ji ¼ 0; and are independent of {xj }: Usually, {e ji }’s are assumed to be independent and identically distributed ðiidÞ:
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
The goal of regression is to obtain estimators, g^ 1 ; g^ 2 ; …^gN that are functions of the data ðdj ; xj Þ; j ¼ 1; 2; …; P; to best approximate the unknown functions, g1 ; g2 ; …; gN ; and use these estimators to predict (generalize) a new d given a new x : d^ i ¼ g^ i ðxÞ; i ¼ 1; 2; …; N:
sjn
¼
M X
wn;i xji ;
j ¼ 1; 2; …; P;
This is the so-called input-side training phase. Once the input-side training is accomplished, the output of the hidden unit can then be expressed as ! M X j j ð7Þ wn;i xi ; j ¼ 1; 2; …; P; fn ðsn Þ ¼ f i¼0
ð5Þ
Suppose, without loss of any generality and for the sake of simplicity, the regression problem has an M-dimensional input vector and a one-dimensional output scalar ðN ¼ 1Þ: The j-th input and the output patterns are denoted by xj ¼ ðxj0 ; xj1 ; …; xjM Þ and d j (target), respectively, where j ¼ 1; 2; …; P; and xj0 ¼ 1 for all j corresponding to the bias. As discussed earlier, the constructive OHL algorithm (Kwok & Yeung, 1997b) may start from a small network, say a one hidden unit network, adapted by a BP type learning algorithm. At any given point during the constructive training process, say there are n 2 1 hidden units in the hidden layer, and the problem is to train the n-th hidden unit that is added to the network. Without loss of generality, all the hidden units are assumed to have identical sigmoidal activation function. The network is depicted in Fig. 1. Alternatively, a candidate that maximizes a correlationbased objective function, as described in details in the next section, will be selected from a pool of candidates, and is then incorporated into the network as an n-th hidden unit. The input to the n-th hidden unit is given by ð6Þ
i¼0
where wn;i is the weight from the i-th input to the n-th hidden unit, wn;0 is the so-called bias of the n-th hidden unit and its input is set to xj0 ¼ 1: The training of the network weights is accomplished in two separate stages; input side and output side training. Before proceeding to the output-side of a new hidden unit, one does first train its input-side weight(s).
593
where f ð·Þ is the sigmoidal activation function of the hidden unit. The network output with n hidden units may now be expressed as follows: yjn ¼
n X
vk fk ðsjk Þ; j ¼ 1; 2; …; P;
ð8Þ
k¼0
where vk ; ðk ¼ 1; 2; …; nÞ are output-side weights of the k-th hidden unit, and v0 is the bias of the output unit with its input being fixed to f0 ¼ 1: The corresponding output error is now given by ejn ¼ dj 2 yjn ¼ dj 2
n X
vk fk ðsjk Þ; j ¼ 1; 2; …; P:
ð9Þ
k¼0
Subsequently, the output-side training is performed by solving the following LS, problem given that in the output layer linear activation functions are assigned to the neurons, that is, ( )2 P P n X 1X 1X j j 2 j Joutput ¼ ðe Þ ¼ d 2 vk fk ðsk Þ : 2 j¼1 n 2 j¼1 k¼0 Following the output-side training, a new error signal ejn is calculated from Eq. (7) for the next cycle of input-side training associated with the ðn þ 1Þ-th hidden unit and its corresponding output-side training as discussed in Section 5. The remainder of this paper is organized as follows. Section 5 presents the concept of error scaling introduced to improve the performance of the constructive training procedure described briefly above. In Section 6, two pruning strategies are proposed to remove (delete), during the constructive training process, the input-side connections that have ‘minimal’ contributions to the input-side training. Since these pruning techniques are used locally, the generalization performance of the resulting networks may not be significantly improved, however, networks with fewer connections will be constructed that should be more advantageous for implementation purposes. In Section 7, convergence of the proposed constructive algorithm will be discussed, and an optimal objective function for the inputside training will be obtained that includes and encompasses the other objective functions given in the literature (Kwok & Yeung, 1997b). Section 8 concludes the paper.
5. Error scaling strategy for input-side training
Fig. 1. Structure of a constructive OHL-FNN.
In this section, the features of a correlation-based objective function is investigated. Without loss of any
594
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
generality, a regression problem with only one output is considered. The correlation-based objective function in this case is given as follows (Fahlman & Lebiere, 1991):
X
P j ð10Þ Jinput ¼
ðen21 2 e n21 Þðfn ðsjn Þ 2 fn Þ
;
j¼1 P P where e n21 ¼ 1=P Pj¼1 ejn21 ; fn ¼ 1=P Pj¼1 fn ðsjn Þ; with e n21 and fn denoting the mean values of the training error and the output of the n-th hidden unit over the entire training samples, respectively. The derivative of Jinput with respect to the weight wn;i is calculated as (for details refer to Appendix A) P X ›Jinput ¼ sgnðC0 Þ ðejn21 2 e n21 Þf 0n ðsjn Þxji : ›wn;i j¼1
ð11Þ
P where C0 ¼ Pj¼1 ðejn21 2 e n21 Þðfn ðsjn Þ 2 fn Þ: The explicit weight adjustment rules are given in detail in Appendix. From Eqs. (8) and (11), the following observations can be made: O1 Maximization of the objective function (8) requires that the output of the new hidden unit fn ðsjn Þ bears resemblance to the error signal ejn21 as close as possible. The optimal situation may be achieved if fn ðsjn Þ ¼ f ðsjn Þ ¼ lejn21 ;
ð12Þ
for all the training patterns, where l is an appropriate constant. Under this condition, it can be shown that the addition of the n-th hidden unit will make the network output error zero, which implies that a ‘perfect’ training may be achieved. However, in most regression problems this ideal situation is never feasible and therefore one is required to perform the input-side training to approach the above ideal case. For further details on the convergence of a constructive training algorithm refer to Section 7. O2 Since f ðsjn Þ is upper and lower bounded (saturated for large or small values of sjn ) to the interval [0,1] due to characteristics of the sigmoidal activation function, Eq. (12) will not be satisfied if certain errors ejn21 for some training samples reside outside the range of f ð·Þ: In this case, the neurons will be forced into their saturation regions, in order to achieve as much ‘resemblance’ as possible to the corresponding error signal. The derivative of the activation function around the saturation region will be zero, resulting in no contribution to Eq. (8). Equivalently, the training samples that have resulted in this undesirable situation will consequently have no contribution to the inputside training. O3 Based on the above observations one can envisage the following simple remedy: why not try to position the undesirable error signal samples into the operational range of the activation function (with non-zero
derivative) during the input-side training? This is the motivation behind the proposed ‘error scaling technique’ that is developed in the next subsection below. 5.1. Error scaling During the input-side training phase, if the error signal ejn21 varies within the range ½en21;min ; en21;max that is beyond the active range ½fmin ; fmax of the activation function of a hidden unit, the unit can not track the error. Consequently, the error signal range ½en21;min ; en21;max should ideally be mapped into the active range of the hidden unit ½fmin ; fmax ; by invoking a linear coordinate transformation, as follows: fmax ¼ C1 en21;max þ C2 ; fmin ¼ C1 en21;min þ C2 : Note that only linear transformations can be utilized here if one wants to preserve the wave form of the error signal. Solving the above expressions for the parameters C1 and C2 yields: C1 ¼
fmax 2 fmin ; en21;max 2 en21;min
and C2 ¼
en21;max fmin 2 en21;min fmax : en21;max 2 en21;min
Therefore, the error signal ejn21 is linearly transformed into ejn21;new according to 8 j > C1 , 1 < C1 en21 þ C2 j en21;new ¼ : ð13Þ f þ fmin > : ejn21 2 e n21 þ max C1 $ 1 2 The error signal ejn21 in Eq. (8) will now be replaced by ejn21;new which is calculated from the above equation. If the activation function of the hidden unit is a logsigmoidal function, then its active range may, for example, be set to [0.1,0.9]. It is not difficult to determine from the above equations that if en21;max and en21;min are chosen such that the mean value of the error signal e n21 is ð1=2Þðen21;max þ en21;min Þ; one gets e n21;new ¼
fmax þ fmin ; 2
ð14Þ
regardless of C1 : This is a point where the activation function reaches its maximum first-order derivative. The replacement of ejn21 by ejn21;new is now expected to improve the capability of the input-side training. What needs to be defined are the bounds for the error signal ejn21 : Since the correlation between fn ðsjn Þ and ejn21 is statistically defined, the bounds for ejn21 have to also be selected in a statistical sense rather than a simple fixed values. Consequently, the values for the bounds of ejn21 are selected according to en21;max ¼ e n21 þ 3sen21 ; and en21;min ¼ e n21 2 3sen21 : where sen21 is the standard deviation of ejn21 :
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
5.2. Simulation results In this section, several examples are worked out in some details to demonstrate the effectiveness of the proposed error scaling technique by comparing it with a previous training algorithm in the literature (Kwok & Yeung, 1997b). In all the following examples the performance of the network is measured by the so-called metric fraction of variance unexplained (FVU) (Kwok & Yeung, 1997b), which is defined as XP ð^gðxj Þ 2 gðxj ÞÞ2 j¼1 FVU ¼ XP ; ð15Þ j 2 ðgðx Þ 2 g Þ j¼1 where gð·Þ is the function to be implemented by the FNN, g^ ð·Þ is an estimate of gð·Þ realized by the network, and g is the mean value of gð·Þ over the training samples. The FVU is basically the ratio of the error variance to the variance of the function being analyzed by the network. Generally, the larger the variance of the function, the more difficult it would be to perform the regression analysis. Therefore, the FVU may be viewed as a measure normalized by the ‘complexity’ of the function under consideration. Note that
595
the FVU is proportional to the mean square error (MSE). Furthermore, the function under study is likely to be contaminated by an additive noise e : In this case the signalto-noise ratio (SNR) is defined by 8 XP 9 j 2= < ðgðx Þ 2 g Þ j¼1 ; ð16Þ SNR ¼ 10 log10 XP : ðe j 2 eÞ2 ; j¼1
where e is the mean value of the additive noise, which is usually assumed to be zero. Example I. Let us consider the following five twodimensional regression functions that are also studied in Kwok and Yeung (1997b): (a) Complicated interaction function (CIF) gðx1 ; x2 Þ ¼ 1:9ð1:35 þ ex1 e2x2 sinð13ðx1 2 0:6Þ2 Þsinð7x2 ÞÞ: ð17Þ (b) Additive function (AF) gðx1 ; x2 Þ ¼ 1:3356ð1:5ð1 2 x1 Þ þ e2x1 21 sinð3pðx1 2 0:6Þ2 Þ þ e3ðx2 20:5Þ sinð4pðx2 2 0:9Þ2 ÞÞ:
ð18Þ
Fig. 2. The original and the generalized CIF function with 10 [dB] SNR. Top row from left to right: (a) CIF (original) and (b) Generalized CIF with 1 HU; bottom row from left to right: (c) Generalized CIF with 10 HUs and (d) Generalized CIF with 20 HUs.
596
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
(c) Harmonic function (HF) gðx1 ; x2 Þ ¼ 42:659ð0:1 þ ðx1 2 0:5Þð0:05 þ ðx1 2 0:5Þ
4
2 10ðx1 2 0:5Þ2 ðx2 2 0:5Þ2 þ 5ðx2 2 0:5Þ4 ÞÞ: ð19Þ (d) Radial function (RF) gðx1 ; x2 Þ ¼ 24:234ððx1 2 0:5Þ2 þ ðx2 2 0:5Þ2 Þð0:75 2 ðx1 2 0:5Þ2 2 ðx2 2 0:5Þ2 Þ:
ð20Þ
(e) Simple interaction function (SIF) gðx1 ; x2 Þ ¼ 10:391ððx1 2 0:4Þðx2 2 0:6Þ þ 0:36Þ:
ð21Þ
The plot of these functions are depicted in Figs. 2(a)– 6(a). For each function two hundred and twenty-five uniformly distributed random points were generated from the two-dimensional interval [0,1] for network training. Ten thousands uniformly sampled points from the same interval without additive noise were used to test the generalization performance of the trained network with different number of hidden units. Forty independent runs were performed (where for each run the network is
initialized to a different set of weight values) for an ensemble-averaged evaluation for two cases where SNRs are 10 [dB] and 0 [dB], respectively. Mean FVUs for the training and the generalization performance of networks trained with and without error scaling for the above regression functions are summarized in Table 1. Selected generalized surfaces of the above functions by the networks constructed using samples with 10 [dB] SNR are also provided for comparison in Figs. 2(b) – (d) – 6(b) –(d). Comparisons between the training and the generalization performances of the standard (i.e. without error scaling) (Kwok & Yeung, 1997b) and the proposed (i.e. with error scaling) algorithms for all the five functions are depicted in Figs. 7 –11, where the SNR is 0 [dB]. It readily follows from these figures that the proposed technique can prevent overtraining and provides much improved generalization performance capabilities especially for relatively large noise scenarios (see Table 1 for results with relatively different SNR values). Example II. Consider the following three-dimensional analytical function (SAF) (Samad, 1991) given by gðx1 ; x2 ; x3 Þ ¼
1 1þe
2ex1 þðx2 20:5Þ2 þ3 sinðpx3 Þ
:
ð22Þ
Fig. 3. The original and the generalized AF function with 10 [dB] SNR. Top row from left to right: (a) AF (original) and (b) Generalized AF with 1 HU; bottom row from left to right: (c) Generalized AF with 10 HUs and (d) Generalized AF with 20 HUs.
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
597
Fig. 4. The original and the generalized HF function with 10 [dB] SNR. Top row from left to right: (a) HF (original) and (b) Generalized HF with 1 hidden unit (HU); bottom row from left to right: (c) Generalized HF with 10 HUs and (d) Generalized HF with 20 HUs.
One thousand random samples with additive noise were generated for network training, and 5000 random samples without additive noise were generated for verifying the network generalization performance. The SNR is 2 3 [dB]. Fig. 12 shows the performance of training and generalization of both our proposed algorithm (i.e. with error scaling) and the previous work (i.e. without error scaling) (Kwok & Yeung, 1997b). We see that our proposed algorithm performs slightly better than the previous algorithm. Example III. Consider the following five-dimensional function (HAS) (Hashem, 1997) that is given by
scaling) (Kwok & Yeung, 1997b), and generalization performance improved a great deal. Example IV. Finally consider the benchmark regression problem ‘Housing’ (Blake & Merz, 1998). There are 506 samples in the data set. Three hundred samples are used for training and the remaining 206 samples for generalization. There are 13 inputs and one output. Forty independent runs were performed. Fig. 14 shows the simulation results. Obviously, in this case the proposed algorithm works much better in terms of the generalization FVU as compared to the standard algorithm (Kwok & Yeung, 1997b).
gðx1 ; x2 ; x3 ; x4 ; x5 Þ ¼ 0:0647ð12 þ 3x1 2 3:5x22 þ 7:2x33 Þ ð1 þ cos 4px4 Þð1 þ 0:8 sin3px5 Þ:
6. Input-side pruning strategies
ð23Þ Three thousands random samples with additive noise were generated for network training, and 10,000 random samples without additive noise were generated for verifying the network generalization capabilities. The SNR is 2 3 [dB]. From Fig. 13, it follows very clearly that by our new error scaling technique, the training error becomes much smaller than the standard approach (i.e. without error
In the input-side training, one can have one or a pool of candidates to train a new hidden unit. In the latter case, the neuron that results in the maximum objective function will be selected as the best candidate. This candidate is incorporated into the network and its input-side weights are frozen in the subsequent training process that follows. However, certain input-side weights may not contribute much to the maximization of the objective function or indirectly to the reduction of the training error. These connections should first be detected and then removed
598
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
Fig. 5. The original and the generalized RF function with 10 [dB] SNR. Top row from left to right: (a) RF (original) and (b) Generalized RF with 1 HU; bottom row from left to right: (c) Generalized RF with 10 HUs and (d) Generalized RF with 20 HUs.
through a pruning technique. Pruning these connections is expected to produce a smaller network without compromising the performance of the network. Note that the pruning operation is carried out ‘locally’, and therefore the generalization performance of the final network will not be improved significantly, since the conventional pruningand-backfitting performed in standard fixed size network pruning is not implemented here. In this section we propose two types of sensitivity functions for the purpose of formalizing the input-side weight pruning process. 6.1. Sensitivity based on training error (Method A)
Joutput
P X ›Joutput ¼ vn xji f 0 ðsjn Þðyjn 2 d j Þ; i ¼ 1; …; M: ›wn;i j¼1
ð25Þ
The second-order derivative (sensitivity) function may now be evaluated according to SAn;i ¼
Let us reproduce here the squared output error criterion given earlier for a single output network P 1X ¼ ðyj 2 dj Þ2 : 2 j¼1 n
will be positive at the local minimum. If this is not the case for a given weight, that weight would then make no contribution to reducing the training error. The first derivative of Joutput with respect to an input-side weight wn;i of the n-th hidden unit is calculated as follows
P X ›2 Joutput 2 ¼ v ðxji Þ2 ðf 0 ðsjn ÞÞ2 n ›w2n;i j¼1
þ vn
P X
ðxji Þ2 ðyjn 2 dj Þf 00 ðsjn Þ; i ¼ 1; …; M:
ð26Þ
j¼1
ð24Þ
If Joutput is locally minimized by incorporating the n-th hidden unit, the error surface with respect to the input-side weights denoted by Wn ¼ ðwn;0 ; wn;1 ; …; wn;M Þ for the new hidden unit will appear to be bowl-shaped in a small vicinity of that local minimum. This implies that the second-order derivatives of Joutput with respect to each weight of that unit
The above criterion was also used for pruning BP-based trained NNs in Cun et al. (1990). However, we are using it here for pruning the input-side weights only during the input-side training phase. When the output-side training is completed, the second-order derivative or sensitivity function (24) with respect to each weight of the new hidden unit is calculated. The weights that result in negative sensitivity
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
599
Fig. 6. The original and the generalized SIF function with 10 [dB] SNR. Top row from left to right: (a) SIF (original) and (b) Generalized SIF with 1 hidden unit (HU); bottom row from left to right: (c) Generalized SIF with 10 HUs and (d) Generalized SIF with 20 HUs.
are set to zero or pruned. Subsequently, the output of the new hidden unit is recalculated and only the output-side training is performed again. Clearly, pruning the weights that make no contribution in reducing the training error may be achieved only once, however, output-side training has to be performed twice: once before and the other time after the pruning. Furthermore, weights that have relatively ‘small’ second-order derivatives, for example lower than some prespecified threshold (which is now referred to as pruning level), can also be removed without significantly affecting the performance of the network. 6.2. Sensitivity based on correlation (Method B) Suppose that the best candidate for the n-th hidden unit to be added to the network results in an objective function Jmax;n : Then, sensitivity of each weight may be defined as follows: SBn;i ¼ Jmax;n 2 Jinput ðwn;i ¼ 0Þ; i ¼ 1; …; M;
ð27Þ
where Jinput ðwn;i ¼ 0Þ is the value of the objective function when wn;i is set to zero, while other connections are unchanged. Note that the bias is usually not pruned. The above sensitivity function measures the contribution of
each connection to the objective function. The largest value for the n-th hidden unit sensitivity is denoted by Smax n : If SBn;i # 0; and/or is very small compared to Smax ; say 3% n B max 2 S Þ=S Þ of it, then the weight ðpruning level U ðSmax n n n;i wn;i is removed. After the pruning is performed, the output of the hidden unit f ðsjn Þ is re-evaluated and the output-side training is performed one more time. 6.3. Simulation results To confirm the validity of the pruning methods A and B, we present in this section several simulation results. Intuitively, the larger the dimension of the input vector, the more effective the above pruning techniques would be. In all the following examples, 10 independent runs are performed to obtain an appropriate gauge for an average evaluation purposes. The SNR in all examples is set to 10 [dB]. Example 1. The first example is the two-dimensional function CIF (Draelos & Hush, 1996; Kwok & Yeung, 1997b) considered in the previous section. Again, 225 uniformly distributed random points were generated from an interval [0,1] for network training. Ten thousands uniformly sampled points from the same interval without additive noise were used to test the generalization performance of the trained network. The pruning level is
600
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
Table 1 Mean FVUs values of the training and the generalization performance of the proposed network constructed ‘with’ (w) error scaling technique and the existing algorithm (Kwok & Yeung 1997b) ‘without’ (w/o) error scaling technique as a function of the number of hidden units Fcn.
SNR [dB]
Approach
Number of hidden units Training
CIF
10 0
AF
10 0
HF
10 0
RF
10 0
SIF
10 0
Generalization
2
5
10
20
2
5
10
20
w/o w w/o w
0.659 0.661 0.778 0.779
0.429 0.409 0.639 0.643
0.181 0.164 0.501 0.501
0.089 0.095 0.404 0.454
0.685 0.686 0.646 0.646
0.452 0.433 0.548 0.512
0.214 0.187 0.362 0.290
0.108 0.088 0.235 0.200
w/o w w/o w
0.697 0.634 0.791 0.797
0.207 0.182 0.500 0.494
0.126 0.125 0.406 0.426
0.092 0.095 0.350 0.406
0.723 0.644 0.699 0.665
0.171 0.135 0.270 0.184
0.094 0.079 0.225 0.155
0.048 0.040 0.313 0.190
w/o w w/o w
0.770 0.777 0.882 0.885
0.557 0.551 0.712 0.729
0.367 0.336 0.573 0.565
0.171 0.157 0.407 0.461
0.888 0.894 0.898 0.898
0.678 0.657 0.783 0.838
0.499 0.421 0.565 0.493
0.290 0.219 0.449 0.332
w/o w w/o w
0.625 0.627 0.658 0.658
0.240 0.200 0.493 0.498
0.125 0.110 0.417 0.441
0.072 0.073 0.356 0.415
0.691 0.684 0.632 0.622
0.243 0.191 0.267 0.223
0.098 0.065 0.258 0.185
0.036 0.024 0.405 0.264
w/o w w/o w
0.275 0.275 0.634 0.628
0.136 0.132 0.454 0.453
0.098 0.097 0.400 0.419
0.082 0.085 0.361 0.406
0.281 0.276 0.357 0.344
0.101 0.085 0.182 0.174
0.054 0.047 0.164 0.123
0.032 0.022 0.206 0.154
set to 5%. Fig. 15(a) shows the generalization errors with and without pruning method A (error scaling is performed for both cases), and Fig. 15(b) the cumulative number of weights pruned using method A. Fig. 15(c) and (d) depict the corresponding results using the pruning method B. From these figures, one may observe that pruning method A did not remove any weight while pruning method B pruned on average only one and half weights
from the network that has 20 hidden units. This shows that both proposed pruning methods do not demonstrate a significant advantage for a problem that has a low dimensional input space. Example 2. The second example is a simple threedimensional analytical function (SAF) (Samad, 1991) considered in the previous section. One thousand random samples with additive noise were generated for network
Fig. 7. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the CIF regression problem (SNR ¼ 0 [dB], 40 runs).
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
601
Fig. 8. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the AF regression problem (SNR ¼ 0 [dB], 40 runs).
Fig. 9. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the HF regression problem (SNR ¼ 0 [dB], 40 runs).
Fig. 10. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the RF regression problem (SNR ¼ 0 [dB], 40 runs).
602
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
Fig. 11. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the SIF regression problem (SNR ¼ 0 [dB], 40 runs).
Fig. 12. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the SAF regression problem (SNR ¼ 23 [dB], 40 runs).
Fig. 13. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the HAS regression problem (SNR ¼ 23 [dB], 40 runs).
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
603
Fig. 14. Comparisons between the mean FVUs of the training and the generalization with and without error scaling for the benchmark regression problem ‘Housing’ (40 runs).
Fig. 15. Left column figures from top to bottom—generalization errors for CIF: (a) method A and (c) method B (solid line with and dashed line without pruning methods). Right column figures from top to bottom—cumulative number of weights pruned versus the number of hidden units: (b) method A and (d) method B.
604
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
Fig. 16. Left column figures from top to bottom—generalization errors for SAF: (a) method A and (c) method B (solid line with and dashed line without pruning methods). Right column figures from top to bottom—cumulative number of weights pruned versus the number of hidden units: (b) method A and (d) method B.
training, and 5000 random samples without additive noise were generated for verifying the network generalization performance. The pruning level in this case is set to 1%. Fig. 16(a) shows the generalization errors with and without pruning method A (error scaling is performed for both cases), and Fig. 16(b) the cumulative number of weights pruned using method A. Similar results using the pruning method B are shown in Fig. 16(c) and (d). The pruning level is again set to 1%. These figures demonstrate that both methods have pruned weights, but method B is showing more pruning than method A. This is due to the fact that more weights could be pruned in method B than method A, even though both methods yield practically the same FVUs for both training and generalization comparisons. Example 3. The third example is a five-dimensional function (HAS) (Hashem, 1997) considered in the previous section. Three thousands random samples with
additive noise were generated for network training, and 10,000 random samples without additive noise were generated for verifying the network generalization capabilities. The pruning level in this case is set to 0.2%. Fig. 17(a) – (d) illustrate the results achieved using methods A and B, respectively. Again, it is observed that method B pruned more weights than method A. For a quantitative comparison, the results obtained using method B for the above three examples are summarized in Table 2.
7. Convergence of the proposed constructive algorithm For our proposed constructive OHL-FNN, the convergence of the algorithm with respect to the added hidden units is an important issue and needs careful investigation. First, we investigate an ideal case where assuming
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
605
Fig. 17. Left column figures from top to bottom—generalization errors for HAS: (a) method A and (c) method B (solid line with and dashed line without pruning methods). Right column figures from top to bottom—cumulative number of weights pruned versus the number of hidden units: (b) method A and (d) method B.
a ‘perfect’ input-side training the convergence of the constructive training algorithm with and without error scaling operations is determined. The ideal case can yield an ‘upperbound’ estimate on the convergence rate that the constructive training algorithm can theoretically achieve. Furthermore, through this analysis one obtains a better
understanding on the implications of a perfect or ideal inputside training. Secondly, the convergence properties of the nominal case of the constructive algorithm where a perfect inputside training is not assumed will be considered. Formal analysis of the input-side training is not tractable due to
Table 2 Percentage ðhÞ of the number of weights pruned by the proposed method B with respect to that of the full input-side connections (10 Runs, FVU: generalization error without pruning, FVUp : generalization error with pruning). The pruning level is selected through experimentation to yield the best results for each case Fcn.
Pruning level (%)
Number of hidden units 5
CIF SAF HAS
5 1 0.2
10
15
FVU
FVUp
ðhÞ (%)
FVU
FVUp
ðhÞ (%)
FVU
FVUp
ðhÞ (%)
0.377 0.068 0.671
0.377 0.091 0.660
0.00 25.3 22.8
0.218 0.022 0.208
0.213 0.024 0.216
3.50 19.2 27.2
0.158 0.018 0.195
0.161 0.018 0.202
4.33 15.1 20.0
606
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
the nonlinearities of the activation functions of the hidden units. Fortunately, due to the linearity of the output node of the network, one may formally analyze the output-side training to determine how the input-side training affects the output-side training, and further to determine how the output error changes with the addition of new hidden units as the constructive learning proceeds. Consequently, a sufficient condition is developed in this section for the input-side training to guarantee the convergence of the constructive algorithm. This condition also provides us with a formal method to better understand the previously designed objective functions in Kwok and Yeung (1997b). Furthermore, guidelines for the selection of objective function to be utilized for input-side training may also be obtained.
matrix rank conditions hold, then Eq. (12) will be satisfied, implying that the ideal case is feasible. However, in most regression problems P is much larger than M þ 1; and hence only a LS solution may be computed using the above set of simultaneous equations. When the output error is scaled according to Eq. (13), for C1 , 1 we may write ejn ¼ dj 2 yjn21 2 Dv0 2 vn fn ðsjn Þ ¼ ejn21 2 Dv0 2 vn lðC1 ejn21 þ C2 Þ ¼ ð1 2 vn C1 lÞejn21 2 ðDv0 þ vn C2 lÞ:
where Dv0 denotes the change in the bias of the output node. The other set of output-side weights v1 ; v2 ; …; vn21 are left unchanged. If one sets
7.1. Convergence for the ideal case vn ¼ Once the network with, say n 2 1 hidden units, has been constructed and trained using a BP-based or some secondorder training algorithm, the output error ejn21 for the n 2 1 hidden unit(s) is obtained. If one attempts to train the new (namely the n-th) hidden unit to the network where the input-side training is assumed to be perfect or ideal, i.e. Eq. (12) is fully satisfied, then for the purpose of output-side training one can set the output-side weight of the new hidden unit vn as vn ¼
1 ; l
ð28Þ
j
¼d 2 ¼
ejn21
1 C and Dv0 ¼ 2 2 ; lC1 C1
ð32Þ
the error signal ejn will be zero for all j: For C1 $ 1; we have ejn ¼ dj 2 yjn21 2 Dv0 2 vn fn ðsjn Þ f þ fmin ¼ ejn21 2 Dv0 2 vn l ejn21 2 e n21 þ max 2 fmax þ fmin j ¼ ð1 2 vn lÞen21 2 Dv0 2 vn le n21 þ vn l : 2 ð33Þ Therefore, by setting
while leaving other existing output-side weights unchanged. Therefore, the output error ejn will now become ejn
ð31Þ
yjn
j
¼d 2
yjn21
2
vn fn ðsjn Þ
1 2 {lejn21 } ¼ 0: l
ð29Þ
for all j; implying that the constructive OHL network requires to add only one hidden unit to the previous network architecture in order to achieve zero output error. However, since Eq. (12) cannot be satisfied in general, the above ideal situation is not likely to occur. Below we provide some further explanations for this observation. If Eq. (12) holds, then we would have wn;1 x11 þ wn;2 x12 þ ·· · þ wn;M x1M þ wn;0 ¼ f 21 ðle1n21 Þ .. . wn;1 xj1 þ wn;2 xj2 þ ·· · þ wn;M xjM þ wn;0 ¼ f 21 ðlejn21 Þ :
ð30Þ
.. .
vn ¼
1 f þ fmin ; and Dv0 ¼ e n21 2 max l 2
ð34Þ
the output error is forced to zero for all j: Consequently, when the error scaling technique is utilized and provided that Eq. (12) holds, perfect training (i.e. zero output error) is possible by properly selecting the bias of the output node and the output-side weight of only the new hidden unit that is added to the network. In practice, as pointed out above, Eq. (12) may hold in only very limited situations. One is therefore required to search and develop a procedure by which the output of the new hidden unit can be forced to approach as closely as possible to Eq. (12). In other words, the objective is basically to select a performance index/criterion for developing a training algorithm for the input-side weights. The correlation-based criterion Eq. (10) introduced by Fahlman and Lebiere (1991) may be considered as one such design in fulfilling the above goal. 7.2. Convergence of the proposed algorithm
wn;1 xP1 þ wn;2 xP2 þ ·· · þ wn;M xPM þ wn;0 21
¼f
21
ðlePn21 Þ
where f ð·Þ is the inverse of f ð·Þ: The above set of P simultaneous equations have M þ 1 unknown variables wn;0 ; …; wn;M : If P ¼ M þ 1 and the related necessary
The choice of a given objective function plays a key and important role in the input-side training. There are several objective functions (Kwok & Yeung, 1997b) that can be used for training the hidden units. Their effectiveness,
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
however, is found to be problem-dependent. Among the objective functions considered in Kwok and Yeung (1997b), the correlation-based functions proposed by Fahlman and Lebiere (1991) appear to be quite general and work satisfactorily for most of the regression problems. Furthermore, they have the lowest computational load among the objective functions investigated in Kwok and Yeung (1977b). In Kwok and Yeung (1997b), a new objective function or criterion for input-side training was derived as: X P j j 2 e f ðs Þ j¼1 n21 n n J¼ : ð35Þ XP ðf ðsj ÞÞ2 j¼1 n n It was concluded that this function compared to the other objective functions used for input-side training of a OHLFNN results in somewhat better performance in solving certain regression problems (Kwok & Yeung, 1977b). Note that the above cost function was derived under the assumption that the output weights of the already-added hidden units do not change during the output-side training. We have found that in all the experiments we conducted this assumption is never valid. When output-side training is performed it causes tangible changes to all the output-side weights of the existing n 2 1 hidden units. Therefore, from both theoretical and practical points of view it is quite interesting to investigate what is the ‘optimum’ objective function for input-side training, and under what condition(s) the proposed algorithm will converge? This section is dedicated to address these questions. Let the output weights of the hidden units be denoted by Vn21 ¼ ðv0 ; v1 ; v2 ; …; vn21 ÞT before the n-th new hidden unit is added to the network. The change in Vn21 as a result of the output-side training after the n-th new hidden unit is added is denoted by DVn21 : The summed squared error of the output after the addition of the n-th hidden unit is now given by SSEn ¼
P X
ðdj 2 yjn Þ2
j¼1
¼
P X
p X
with the minimum value given by T 21 SSEopt n ¼ SSEn21 2 Qn Rn Qn :
j
{d 2
ðVTn21
þ
DVTn21 Þf jn21
2
vn fn ðsjn Þ}2
ð39Þ
Obviously, maximizing the second term in the RHS of the above equation will maximize the convergence in minimizing the SSE (or MSE). Therefore, this would provide an important condition that would guarantee the convergence of our constructive algorithm, namely provided that the input-side training for each hidden unit makes the second term in the RHS of Eq. (39) positive, then the network error will be decreasing monotonically as the training progresses. It should also be noted that the objective function must be designed in such a way that the above term is positive and possibly maximized. Since the second term is highly nonlinear, and the many previously proposed objective functions in Kwok and Yeung (1997b) are related to it in rather complicated ways, one is generally resorted to a large number of simulations to determine the most suitable objective function that works best for a given problem. This fact emphasizes the difficulties one has to deal with as far as the input-side training is concerned. Direct maximization of the second term appears to be the most effective way to achieve the fastest convergence. Therefore, the optimum objective function may be given by Jinput;opt ¼ QTn R21 n Qn ;
ð40Þ
which is different from the objective function J given by Eq. (35) and derived in Kwok and Yeung (1997b). Note that the above objective function is of a theoretical interest, as it will be difficult to implement it in actual input-side training due to the matrix inversion R21 n : Also, note that the correlation between the output error and the output of the new hidden unit as indicated by Eq. (10) is included in Jinput;opt ; and maximization of this correlation will contribute to making Jinput;opt positive. If this condition can be ensured each time a new hidden unit is trained, the convergence of our proposed constructive algorithm will then be guaranteed. By denoting QTn ¼ ½QTn21 ; Qn ; Qn ¼
j¼1
¼
607
P X
ð41Þ
ejn21 fn ðsjn Þ;
ð42Þ
j¼1
ðejn21 2 UTn f jn Þ2 :
ð36Þ
j¼1
where Un ¼ ½DVTn21 ; vn T ; f jn ¼ ½f0 ; f1 ðsj1 Þ; …; fn21 ðsjn21 Þ; fn ðsjn ÞT ; with f0 ¼ 1: After some algebraic manipulations, Eq. (36) may be written as SSEn ¼ SSEn21 2 2UTn Qn þ UTn Rn Un ; ð37Þ PP j P where Qn ¼ j¼1 en21 f jn ; and Rn ¼ Pj¼1 f jn ðf jn ÞT : Minimization of the above SSEn leads to the following LS solution Un ¼ R21 n Qn ;
ð38Þ
" Rn ¼
Rn21
rn21
rTn21
rn
# ;
ð43Þ
and assuming that rn21 ¼ 0 (implying that the output of the new hidden unit is not correlated with the outputs of all the previous hidden units, although this usually may not be satisfied in network training), we obtain T 21 2 SSEopt n ¼ SSEn21 2 Qn21 Rn21 Qn21 2 Qn =rn :
ð46Þ
It follows that minimization of SSEn yields the objective function J given by Eq. (35) and derived in Kwok and
608
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609
Yeung (1997b) (the third term in the RHS of (2.61)). Given that the assumptions stated earlier are not valid during the constructive learning process, the objective function J is therefore only ‘suboptimal’. In other words, what we have derived here is an optimal objective function that does include the objective function derived in Kwok and Yeung (1997b) as its special case. Implementing or developing a network structure that would realize the proposed objective function is left as a topic of future research.
8. Conclusions In this paper, a new constructive adaptive NN scheme is proposed to scale the error signal during the learning process to improve the input-side training effectiveness and efficiency, and to obtain better generalization performance capabilities. All the regression simulation results that are performed, up to 13th dimensional input space, confirmed the effectiveness and superiority of the proposed new technique. Further simulations for higher dimensional input spaces will have to be performed to confirm the full extent of our proposed method. Two pruning methods for removing the input-side redundant connections were also developed. It was shown that pruning results in smaller networks without degrading or compromising their performances. Finally, convergence of the proposed constructive algorithm was also discussed and an optimum objective function for input-side weight training was proposed.
The correlation-based objective function for input-side training is reproduced here:
X
P j ðA1Þ Jinput ¼
ðen21 2 e n21 Þðfn ðsjn Þ 2 fn Þ
;
j¼1 e n21 ¼
P 1X ej ; P j¼1 n21
M X
wn;i xji ;
i¼0
where it is assumed that there are already n 2 1 hidden units in the network. The above objective function is used to train the input-side weight {w} of the n-th hidden unit and fn ð·Þ is the activation function of the n-th hidden unit, xji is the i-th element of the input vector of dimension M £ 1; ejn21 is the output node error, and fn ðsjn Þ is the output of the n-th hidden unit, all defined for the training sample j: The derivative of Jinput with respect to an input-side weight is given by P ›Jinput X ¼ dj xji ›wn;i j¼1
dj ¼
I X
sgnðC0 Þðejn21 2 e n21 Þ
o¼1
C0 ¼
P X
dfn ðsjn Þ dsjn
ðA3Þ
ðejn21 2 e n21 Þðfn ðsjn Þ 2 fn Þ;
j¼1
where, for simplicity, fn is treated as a constant in the above calculation, even though fn is actually a function of the input-side weights. Let us define Sn;i ðtÞ ¼ 2
8 1Sn;i ðtÞ; > > > > < S ðtÞDw ðt 2 1Þ n;i n;i Dwn;i ðtÞ ¼ > Sn;i ðt 2 1Þ 2 Sn;i ðtÞ > > > : mDwn;i ðt 2 1Þ;
sjn ¼
ðA2Þ
Acknowledgements This research was supported in part by the NSERC (Natural Science and Engineering Research Council of Canada) Discovery Grant number RGPIN-42515.
P 1X fn ¼ f ðsj Þ; P j¼1 n n
›Jinput ; i ¼ 0; 1; …; M; ›wn;i
ðA4Þ
where t is the iteration step. The quickprop algorithm maximizing Eq. (A1) may now be expressed by
if Dwn;i ðt 2 1Þ ¼ 0; ði ¼ 0; 1; …; MÞ if Dwn;i ðt 2 1Þ – 0 and
Sn;i ðtÞ , m; Sn;i ðt 2 1Þ 2 Sn;i ðtÞ
ðA5Þ
otherwise:
Appendix A. The ‘quickprop’ algorithm The quickprop algorithm developed by Fahlman and Lebiere (1991) has played a very important role in the inputside training of our constructive OHL-FNNs. In this appendix, a brief introduction to this algorithm is provided in the context of correlation-based input-side training. Note that quickprop is a second-order optimization method based on Newton’s method. As shown below, it is simple in form, but has been found to be surprisingly effective when used iteratively (Fahlman & Lebiere, 1991).
where Dwn;i ðt 2 1Þ ¼ wn;i ðtÞ 2 wn;i ðt 2 1Þ; and 1 and m are positive user-specified parameters. Note that in the simulation results presented in this work m is selected as 1.75 and 1 is selected as 0.35. Moreover, the nonlinear activation functions all are selected as ‘logsig’.
References Ash, T. (1989). Dynamic node creation in backpropagation networks. Connection Science, 1(4), 365–375.
L. Ma, K. Khorasani / Neural Networks 17 (2004) 589–609 Blake C., Merz C., (1998). UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html. Bose, N. K., & Liang, P. (1996). Neural network fundamentals with graphs, algorithms, and applications. New York: McGraw-Hill. Buntine, W. L., & Weigend, A. S. (1991). Bayesian backpropagation. Complex Systems, 5, 603–643. Castellano, G., Fanelli, A. M., & Pelillo, M. (1990). An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks, 8(3), 519– 531. Chauvin, Y. (1990). A back-propagation algorithm with optimal use of hidden units. In D. S. Touretzky (Ed.), Advances in neural information processing (2) (pp. 642 –649), Denver. Cun, Y. L., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing (2) (pp. 598–605), Denver. Draelos, T., & Hush, D. (1996). A constructive neural network algorithm for function approximation (Vol. 1). IEEE International Conference on Neural Network, Washington DC, pp. 50– 55. Fahlman, S. E., & Lebiere, C. (1991). The cascade-correlation learning architecture. Technical Report, CMU-CS-90-100, Carnegie Mellon University. Fang, W., & Lacher, R. C. (1994). Network complexity and learning efficiency of constructive learning algorithms. Proceedings of the International Conference on Artificial Neural Networks, pp. 366–369. Hashem, H. (1997). Optimal linear combinations of neural networks. Neural Networks, 10, 599–614. Hush, D. R., & Horne, B. G. (1993). Progress in supervised neural networks. IEEE signal Processing Magazine, 8–39. Ishikawa, M. (1990). A structural learning algorithm with forgetting of link weights. Technical Report TR-90-7, Electrotechnical Lab., TsukubaCity, Japan. Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks, 1(2), 239–242. Kwok, T. Y., & Yeung, D. Y. (1996). Bayesian regularization in constructive neural networks. Proceedings of the International Conference on Artificial Neural Networks. Bochum, Germany, pp. 557–562. Kwok, T. Y., & Yeung, D. Y. (1997a). Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks, 8(3), 630– 645. Kwok, T. Y., & Yeung, D. Y. (1997b). Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks, 8(5), 1131– 1148. Lee, T. C. (1991). Structure level adaptation for artificial neural networks. Dordrecht: Kluwer. Leondes, C. T. (1998). Neural network systems techniques and applications: Algorithms and architectures. New York: Academic Press.
609
Lippmann, R. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4 –22. Ma, L., & Khorasani, K. (2000). Input-side training in constructive neural networks based on error scaling and pruning (Vol. VI). Proceedings of the IJCNN 2000, pp. 455–460. Ma, L., & Khorasani, K. (2003). A new strategy for adaptively constructing multilayer feedforward neural networks. Neurocomputing, 51, 361– 385. Magoulas, G. D., Vrahatis, M. N., & Androulakis, G. S. (1997). Efficient backpropagation training with variable stepsize. Neural Networks, 10(1), 69–82. Mozer, M. C., & Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In D. S. Touretzky (Ed.), Advances in neural information processing (1) (pp. 107– 115), Denver. Nabhan, T. M., & Zomaya, A. Y. (1994). Toward generating neural network structures for function approximation. Neural Networks, 7(1), 89– 99. Osowski, S., Bojarczak, P., & Stodolski, M. (1996). Fast second-order learning algorithm for feedforward multilayer neural networks and its applications. Neural Networks, 9(9), 1583– 1596. Prechelt, L. (1997). Investigation of the cascor family of learning algorithms. Neural Networks, 10(5), 885 –896. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, & J. L. McClelland (Eds.), Parallel distributed processing: explorations in the microstructure of cognition (pp. 318–362). Cambridge, MA: MIT Press. Samad, T. (1991). Backpropagation with expected source values. Neural Networks, 4, 615–618. Sarkar, D. (1995). Methods to speed up error back-propagation learning algorithm. ACM Computing Surveys, 27(4), 519–542. Setiono, R., & Hui, L. C. K. (1995). Use of a quasi-Newton method in a feedforward neural network construction algorithm. IEEE Transactions on Neural Networks, 6, 273– 277. Shepherd, A. J. (1997). Second-order methods for neural networks. London: Springer. Stager, F., & Agarwal, M. (1997). Three methods to speed up the training of feedforward and feedback perceptrons. Neural Networks, 10(8), 1435–1443. Tetko, I. V. (1997). Efficient partition of learning data sets for neural network training. Neural Networks, 10(8), 1361–1374. Thodberg, H. H. (1996). A review of bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7, 56–72. Weng, W., & Khorasani, K. (1996). An adaptive structural neural network with application to EEG automatic seizure detection. Neural Networks, 9(7), 1223– 1240.