European Journal of Operational Research 170 (2006) 567–577 www.elsevier.com/locate/ejor
Computing, Artificial Intelligence and Information Technology
Model selection in Neural Networks: Some difficulties B. Curry *, P.H. Morgan Cardiff Business School, Cardiff University, Aberconway Building, Colum Drive, Cardiff CF10 3EU, UK Received 26 February 2003; accepted 25 May 2004 Available online 25 August 2004
Abstract This paper considers two related issues regarding feedforward Neural Networks (NNs). The first involves the question of whether the network weights corresponding to the best fitting network are unique. Our empirical tests suggest an answer in the negative, whether using standard Backpropagation algorithm or our preferred direct (non-gradientbased) search procedure. We also offer a theoretical analysis which suggests that there will almost inevitably be functional relationships between network weights. The second issue concerns the use of standard statistical approaches to testing the significance of weights or groups of weights. Treating feedforward NNs as an interesting way to carry out nonlinear regression suggests that statistical tests should be employed. According to our results, however, statistical tests can in practice be indeterminate. It is rather difficult to choose either the number of hidden layers or the number of nodes on this basis. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Neural Networks; Network weights; Hidden layers; Backpropagation; Polytope
1. Introduction In this paper feedforward Neural Networks (NNs) are viewed as devices for flexible nonlinear regression, relying on the property of Ôuniversal approximationÕ. Theoretical work on such approximation dates back to Cybenko (1989) and HechtNeilson (1989). Despite a plethora of such results,
*
Corresponding author. Tel.: +44 29 20875668; fax: +44 29 20874419. E-mail address: curry@cardiff.ac.uk (B. Curry).
and despite our own confidence in the approximation properties of NNs, there remain numerous questions to be asked regarding selection of the ÔbestÕ network for a given data set. Zhang et al. (1998), in their review of NNs in a forecasting context, comment that ‘‘. . . there is no simple clear-cut method for determination of . . . parameters’’. They also regard network design as ‘‘more of an art than a science’’. We discuss two related difficulties. The first is an identification problem manifesting itself in an inability to replicate weight values for a trained network. As regards finding a unique set of
0377-2217/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2004.05.026
568
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
estimated network weights (parameters), our analysis leads us generally to a pessimistic view. The second problem concerns the use of statistical principles in choosing the number of hidden layers and the number of hidden nodes. White (1989b) has shown in formal terms that standard statistical tests are available, which is not surprising if one views NNs as a form of nonlinear regression. However, in the experiments reported below, a statistical approach is far from being conclusive. In what follows, we review in Sections 2 and 3 the properties of NNs and discuss issues involved in their implementation. This includes the choice of learning algorithm. In Section 4, we present some theoretical investigations of interdependence between network weights, using a Taylor series approach: this section also contains an example based on simulated data. We then report on experiments with three real data sets.
2. Neural Networks 2.1. The basic model
ð1Þ
where Y denotes the output, X = (X1, . . . , Xm) denotes the m input variables, and u is an error term. In general, u is presumed, albeit often implicitly, to possess the usual desirable properties. We denote the number of sample observations by n. Let the hidden layers be indexed by p = 1, . . . , r. For p = 1, we have the first hidden layer, connected directly to the inputs. The rth hidden layer is connected directly to the output. Each hidden layer p contains q(p) nodes: the jth node in layer p is ðpÞ denoted by hj . We also have the following defined weights: wik ðpþ1Þ
vst
ov
j¼1
ðpþ1Þ hi
denotes the weight connecting the kth input to the ith node in the first hidden layer. denotes the weight connecting the tth node in the ith hidden layer to the sth node in hidden layer p + 1, for p = 1, . . . , r 1. denotes the weight from the vth node in hidden layer r to the output, for v = 1, . . . , q(r).
¼L
qðpÞ X
! ðpþ1Þ vit htðpÞ
ð3Þ
t¼1
for p = 1, . . . , r 1, f ðX Þ ¼
pðrÞ X
ov hvðpÞ :
ð4Þ
v¼1
Bias terms may be included by setting appropriate variables to unity. One may optionally apply a logistic transformation to the output, but it is the functions employed in the hidden nodes which provide the universal approximation property. The case of a logistic network with two inputs, no bias terms and a single hidden layer comprising q hidden nodes would be expressed as follows: Y ¼
A feedforward logistic NN is defined as follows: Y ¼ f ðXÞ þ u;
Denoting the logistic (sigmoid) function by L, the network model can be expressed as ! m X ð1Þ wij X j ; ð2Þ hi ¼ L
q X i¼1
oi : 1 þ eðwi1 X 1 þwi2 X 2 Þ
ð5Þ
One can see immediately that the structure is somewhat unwieldy, and the fitting process will inevitably be problematic. White (1989a) uses the term ÔNon-Linear Least SquaresÕ (NLS) specifically to describe the application of a quadratic error metric to Eq. (1). This definition of NLS incorporates no specific fitting algorithm, although White notes that optimization of the error sum of squares is Ôcomputationally demandingÕ. The term NLS can of course be given a much broader interpretation. The most common way to apply the standard network model is to optimize RMS (root mean square) error through Backpropagation (BP), as employed by Rumelhart et al. (1986). White (1989b) shows that BP arises from application of the method of Stochastic Approximation originally suggested by Robbins and Monro (1951). Gradient-based methods such as BP are especially problematic in the context of NNs since the sum of squares function may be quite flat in places and subject to Ônarrow valleysÕ, from which a search
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
algorithm may find difficulty in ÔescapingÕ. These points are amplified below. In the case of BP, it is common to observe a rapid early reduction of RMS error followed by a relatively slow ÔcrawlÕ along the flat-bottomed narrow valleys. These are points put forward in our earlier papers (e.g. Curry and Morgan, 1997; Curry et al., 2002) in which we provide evidence of possible shortcomings of BP and consider a direct (i.e. non-gradient based) alternative known as the Polytope Algorithm. In various experiments the Polytope was found to offer superior performance. The same algorithm is employed in this paper and compared to BP. In summary, the algorithm involves maintaining a highly flexible simplex or type of ÔPolytopeÕ which through operations of stretching, reflection etc moves and adjusts itself to the shape of the unknown function to be optimized. We use the term Polytope to avoid confusion with the Simplex algorithm for Linear Programming. Further explanation is provided in Section 5 below. 2.2. Approximation issues White (1989b) is among numerous authors who demonstrate that a feedforward NN can adapt to, i.e. approximate, arbitrary functional forms and arbitrary stochastic processes. Barron (1993) stresses that the functional form in Eqs. (2)–(4) above contains the Fourier series approximation as a special case. White (1989b) also shows that a single hidden layer of nodes is technically sufficient for such Ôuniversal approximationÕ. However, it should be noted approximation theorems relate to limiting cases in which the number of nodes becomes infinitely large. Empirical support for approximation properties is also reported by Morgan et al. (1999) who show that NNs can provide notably accurate approximations to a wide range of interesting functional forms. These include the ÔMexican HatÕ, the shape of which provides an excellent challenge for the network. A notable finding in our earlier work is that additional hidden layers are often required. This is not only from the point of view of goodness of fit, but in our experiments, additional layers can also provide better generalization, due to the addi-
569
tional complexity offered by the multi-layer model. A number of authors, e.g. Kurkova (1992), deal with models containing two or more hidden layers, but there is a need for further theoretical research on the precise effects of additional layers. A rare paper on this particular topic (Tamura and Tateishi, 1997) is also noteworthy in that it deals with a finite number of nodes. The authors show that networks with one and two hidden layers are strictly equivalent only with an infinite number of nodes. However, they show that a network with a second hidden layer, while being superior in terms of the number of weights needed, can also provide arbitrarily close approximations. See also Chester (1990). As far as practical applications are concerned we would wish to sound a note of caution: it may be dangerous simply to assume that a single layer will naturally provide adequate approximations. Even with universal approximation, model selection issues are still crucial. This is because of the rather neat irony that better approximation within the training naturally increases the risk of overfitting. The very strength of NNs, their ability to approximate, is also their principal weakness. Ultimately, it is the capacity of the network to generalize which is the most important.
3. Model selection issues Although there is continuing debate on model selection strategies, one area of agreement is that simple rules of thumb have little to offer. Such rules, suggesting that the number of hidden nodes should be directly related to the number of inputs and outputs, have historical interest, having been popular immediately after the pathfinding work of Rumelhart et al. (1986). The most simple, and probably still the most popular sensible strategy, is to focus directly on minimising RMS error, experimenting where necessary regarding selection of inputs and hidden nodes. This could be labelled a ‘‘practitionerÕs’’ strategy. One general difficulty is the excessive computational burden which is placed on the process of experimentation: the cause is Ôcombinatorial explosionÕ in the number of different models which need to be run. The
570
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
problem arises with any model selection strategy employed, although certain strategies seek to prune the search tree. Experimentation is required to test subsets of inputs, to test the number of hidden layers and the number of hidden nodes. Also, in algorithms such as BP there are parameters such as learning rates. The end result is that even if there is to be merely a single ÔrunÕ of each case, a complete range of experiments is prohibitive. Each run may take considerable time, possibly measured in days. A natural progression from ÔsimpleÕ use of RMS is to exploit the fact that RMS is essentially a statistical concept. Minimization of RMS, equivalent to least squares, follows directly from adoption of a Maximum Likelihood approach, provided of course that the nonlinear functional relationship of the feedforward NN is supplemented with a well-behaved additive error term. This latter requirement is often adopted only implicitly. A firm foundation for the use of a statistical approach is provided by White (1989b), both for estimation of network weights through Maximum Likelihood methods and for testing hypotheses. For the latter, tests can naturally be based on the change in RMS when a single weight or group of weights is set to zero. The statistical foundations laid by White (1989b) are the basis of work for example by Cottrell et al. (1995) and Gorr et al. (1994). The former consider the use of a Neural Network as a nonlinear extension of autoregressive time series models. Their approach involves applying statistical principles to eliminate non-significant weights. Anders and Korn (1999) stress that a serious problem to be faced with NN model selection is that the network weights are under-identified, meaning in particular that the RMS function will be subject to ÔflatÕ regions. They refer to the work of Phillips (1989) in which a notable consequence of local under-identification is that the asymptotic distribution of the estimates will be non-normal. As to reasons why there should be an identification problem with NNs they offer two suggestions. The first is that it is always possible simply to exchange within any hidden layer any pair of nodes and their associated weights, leaving computations with the network unaltered. Sec-
ondly, setting certain weights to zero will automatically remove branches from the network. For example, in Eq. (5) above, setting any output weight oi to zero will immediately remove weights wi,1 and wi,2. Under-identification is an issue also addressed by White (1989b) who notes that the limiting distribution of the estimated network weights is subject to what he calls ÔflatsÕ, arising because of redundant inputs or irrelevant hidden nodes. The view taken here is that the problem is more serious, there being naturally occurring parameter redundancy. The hidden nodes may be regarded as being nonlinear analogues of the latent variables or Principal Components employed in multivariate statistics; see Curry and Peel (1998). In Eq. (5), for example, even with two hidden nodes, we have six network weights, and of course the number of weights grows explosively as hidden nodes are added. This basic indeterminacy is stronger than would arise through simply exchanging weights. In Principal Components Analysis, the aim is to transform a vector of m variables to a vector of say d orthogonal components, and this requires a loading matrix of order d by m. Conventional direct estimation of such a matrix by least squares is therefore not possible. We offer some further theoretical investigation of this issue in Section 4 below. Here we may note that parameter redundancy has implications for any modeling strategy which makes explicit use of the number of network weights (or degrees of freedom). These include standard criteria, such as AIC, BIC, etc.: for a review in the context of NNs see Ripley (1996). Such measures suffer from the difficulty of calculating the number of degrees of freedom in the context of parameter redundancy. A link can also be made to other penalty-based methods, based on penalising network complexity or through forcing of weight decay. Although penalty-based methods involve network weights, a direct focus on weights leads us to ÔpruningÕ methods. These may for example be based on the Hessian matrix of the weights, and hence are based on statistical principles. See Ripley (1996) for further details. For a variance-based pruning approach see also Morgan et al. (2000).
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
4. Theoretical considerations 4.1. An approach using Taylor series A notable feature of our experience with NNs is that it is normally impossible to replicate a set of weights. Even with a preferred network structure, offering the best fit, there is no unique set of weight values. If we simply initialize the weight values, using random starting points, it is possible to replicate the RMS but not the individual weight values. Indeterminacy of weights arises simply because of the presence of latent variables: the number of weights is less than the effective degrees of freedom and there are substantial interdependencies between weights. Kurkova and Kainen (1994) use the term Ôfunctionally equivalentÕ weight vectors. They present theorems regarding on the existence of indeterminacies, without details on their precise nature. They also make an interesting point that indeterminacies could profitably be used in the search process, with the need only for a sample of one weight vector from a set of equivalent vectors. Further insights on this question are offered by Williamson and Helmke (1995) who consider the very simple case of a feedforward logistic network with a single input and single hidden node. Among the questions they consider is the uniqueness of the ÔbestÕ approximation provided by the network to an arbitrary underlying function. The approximations are examined only in relation to a finite interval. For their very limited class of networks studied, Williamson and Helmke (1995) establish uniqueness of best approximation only for the case of the supremum norm and not for the standard least squares approach generally applied in this context. They note with some degree of pessimism that more typical networks with multiple inputs and multiple hidden nodes pose Ôtruly formidable difficultiesÕ. Given the intractability of the functional form we have investigated these issues using the MAPLE symbolic manipulation software. We focus on the case of a single hidden layer, with two inputs and three hidden nodes, corresponding to Eq. (5) above. For convenience we need to change notation. We use p1, p5 and p9 as weights
571
from the hidden nodes to the output. Weights p2, p6 and p10 are bias terms; p3, p4, p7, p8, p9 and p10 are weights from inputs to hidden nodes. The network is then as follows: Y ¼
p1 1 þ exp ðp2 þ p3 x1 þ p4 x2 Þ p5 þ 1 þ exp ðp6 þ p7 x1 þ p8 x2 Þ p9 : þ 1 þ exp ðp10 þ p11 x1 þ p12 x2 Þ
ð6Þ
As in previous work (Curry et al., 2000) we have used a Taylor series approach to investigate the network. Tera¨svirta et al. (1993) also use a manually derived Taylor series approximation, proceeding as far as third order terms. They use derivatives with respect to the weights as opposed to the variables, which is necessary in context of autoregressive models. Here, however, we are concerned with functional forms and use derivatives with respect to the input variables. MAPLE computes multivariate Taylor series through the COEFTAYL command, which specifies both the expansion point and the order of the expansions. We obtain the (i,j)th coefficient for the function f of arguments x1 and x2, expanded around the origin using the command C½i; j :¼ COEFTAYLðf; ½x1 ; x2 ¼ ½0; 0; ½i; jÞ: The SOLVE command can be used to obtain expressions for the weights, given suitable constraints. As an example, with the following command we may set Taylor coefficients C[0, 0], C[0, 1] and C[1, 0] to zero to solve for the weights from p1 to p12. sol1 :¼ solveðfC½0; 0; C½0; 1; C½1; 0g; fp1; p2; p3; p4; p5; p6; p7; p8; p9; p10; p11; p12gÞ: Zero constraints in the command ÔguideÕ the MAPLE solutions towards certain types of functional form. This is useful given that we are dealing with a Taylor approximation and also to compensate for use of only three hidden nodes in Eq. (7). From the above command, MAPLE generates four distinct solutions. The solutions involve
572
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
notable and often quite simple functional dependencies between weights. For example, within the fourth solution set we have p expðp6 Þ þ p1 þ p5 expðp6 Þ p2 ¼ ln 1 : ð7Þ p5 To pursue the case further we have generated a series of 100 data points at random from the function y ¼ x21 þ x22 . The function was applied over the region (1 < x1 < + 1, 1 < x2< + 1) and a small noise component was added, using a standard deviation of 0.001. The three constraints on the Taylor coefficients in the example above involve setting to zero coefficients of both zero and first order: these tally with the simple quadratic function under discussion. The three-node network was trained using the Polytope algorithm. To check the validity of the Taylor approach, the first MAPLE solution was substituted into the network function in Eq. (7): a set of weights from the Polytope fit was then applied. The resulting network function corresponds very closely with the original target quadratic, except at the edges. The use of a Taylor approximation does not seem to have caused undue damage. Our example is consistent with the theoretical work of Kurkova and Kainen (1994) and Williamson and Helmke (1995). It suggests the existence of numerous interdependencies. Similar collections of solutions can easily be derived for other cases, setting various other Taylor coefficients to zero. The presence of interdependencies between weights is not an isolated instance. This confirms the point made in Section 3 above, that the identification problem in NNs is more serious than suggested by Anders and Korn (1999). Given that the MAPLE model just discussed is quite small, we also need to consider how far it can be generalized. Similar relationships between weights emerge when the model is extended, but the capability of MAPLE to handle extensions is quite limited. The number of potential solutions for the SOLVE command to handle increases explosively with the number of inputs and the number of hidden nodes. It also depends on the maximum degree for which the Taylor series coefficients are set to zero. As the size of the model is increased MAPLE runs out of stack memory and
simply cannot generate results. This may happen even with two inputs, but is much more likely with three.
5. Experiments 5.1. General Given the various issues noted above, we have carried out a number of experiments relating to the following specific points of comparison. 1. Networks trained with Polytope and Backpropagation algorithms. 2. Networks with single and double hidden layers. 3. Networks with logistic activation and hyperbolic tangent activation. In the various empirical tests we employ a ÔrawÕ unadjusted RMS error criterion, quoting R2 values in addition in order to provide comparisons between NN models and linear regressions. This is standard practice in the NN ÔcommunityÕ: in engineering and other disciplines RMS error is a natural choice. Of course, following standard practice is not in itself a convincing reason, and one would naturally wish to consider other error criteria such as AIC, BIC, etc. The problem with such criteria in the context of NN modelling is to establish the ÔtrueÕ degrees of freedom. This is a consequence of the identification and parameter dependency problems discussed above. The Ôeffective number of parametersÕ, in the sense employed by Moody (1992), is as yet unknown, so it seems dangerous to use a count of the network weights: our preference is for using ÔrawÕ RMS. 5.2. Data aspects Three data sets have been employed in this Section. The first, kindly provided by GFK Marketing Ltd., relates to purchases of Sony TV sets in the UK and relates to the hedonic model, whereby price is dependent on bundles of attributes, the aim being to analyse the quality of a product. Six such quality variables were used as inputs (see Curry et al., 2002 for more detailed description
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
of the data and earlier experiments involving a feedforward NN and linear regression). Some 447 observations were available. It was found that although a Neural Network model could improve on the goodness of fit of a linear regression, the latter provided a reasonable approximation away from the extremes. The second data set relates to sales of dishwasher machines in the US, the data being publicly available at http://lib.stat.cmu.edu/DASL. Annual dishwasher sales, over a period of 26 years are taken to be dependent on total annual expenditure on durable goods and private residential investment. The final data set involves temperature data, again for the US. It is also available from the above source. The average temperature in 56 cities is taken to be dependent on the degrees of latitude and longitude. As can be seen at the source website, the second and third data sets exhibit substantial nonlinearity. 5.3. Implementation issues For the training process our main focus is on the Polytope algorithm, employed because of previous favourable experience. The algorithm makes no use of gradients, nor does it require decisions regarding the order in which data points are presented. Its implementation requires that for m parameters (weights) we define m + 1 points situated at the vertices of a regular simplex in the space of m dimensions. Initial weight values are generated by random drawing from a uniform distribution on the interval (1, 1). From the weight values at each iteration a reflection of the least optimal vertex is generated through the space spanned by the remaining m vertices and the objective function value is calculated. If the resulting value is worse than the worst then we contract the whole original simplex by replacing the worst vertex by one closer to its base (which contains the centroid of the m best vertices). If this reflected point is better than the worst but not as good as the best, then we can at least get rid of the worst by replacing it with the reflected value. If the reflected value is better than the best, then we try expanding the simplex in its direction. If all reflec-
573
tion, expansions and contractions fail, then we assume that the optimum probably lies somewhere inside the simplex and we try shrinking it by moving all the vertices toward the best vertex. Some additional (but minor) refinements to improve convergence were also adopted. The algorithm can terminate when there is no change in the objective function (RMS error). In practice, under double precision this means stopping when the change is less than 1015. Alternatively, it can stop when the values of the parameters are unchanged. In practice this involves stopping when the mean of the sum of the relative absolute differences was less than 1014. As a further safeguard against the method getting trapped in a subspace of the parameters, the method was restarted by constructing a new simplex with size one-tenth the initial one and incorporating the current best vertex. The restarts were made with simplices of this same size until successive restarts were within 1015 of each other in terms of the sum of squared errors or a set number (30) of restarts had been made. Further details can be found in Curry and Morgan (1997). For the cross-section data (television sets and temperature) we have worked with a training set of approximately 75% of the data points, selected at random (using random numbers generated in EXCEL). For the dishwasher data the first 75% of values were used as a training set. To conform to standard econometric practice we use Ôsplit-sample validationÕ––we first optimize with respect to the training set and then proceed to test for overfitting by checking the RMS or R2 for the test set (using the weights from the training set). The error between the fitted model and the test set provides an estimate of the generalization error. This particular approach is well established and has also been adopted here because of training-time considerations. For Backpropagation, the Neuralworks Professional software, from Neuralware Inc., was employed. This package has no documented capability to handle skip layer connections, but it was found that such a model could be employed by manipulating the on-screen network display. For both linear and NN fits, the models were evaluated using RMS error. The unadjusted R2 gives
574
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
identical results, and is presented below to facilitate comparisons with linear models. 5.4. Results Our main results are summarized in Table 1 below: for reference, the table contains results from a linear regression. The number of hidden nodes and layers was varied and for each case five runs were carried out from randomly generated starting points for the weights. A uniform distribution was used for this initialization. The algorithms conduct their search in terms of RMS error, but for ease of comparison across data sets we have expressed final results in terms of R2. The values contained in the table represent the best R2 obtained over multiple runs, varying the number of layers and hidden nodes. Testing for significance of individual coefficients is also of interest, but not in the present context. We note that asymptotic t-tests are available for this purpose (White, 1989b), although they are not part of standard NN practice. The indeterminacy of network weights, reported and examined in more detail below, militates against such formal testing. Space considerations dictate omission of details such as the number of hidden nodes. Although such issues are important, they are not the primary focus of this paper: here we are concerned with the ability of network models to improve goodness of fit. In Table 1 there is no indication of serious overfitting for any of the datasets. There is no tendency Table 1 Unadjusted R2 values for the three datasets Linear regression
Polytope
Backprop
Data set Sony TV data In sample Out of sample
0.825287 0.819733
0.898860 0.813970
0.829350 0.827589
Dishwasher data In sample Out of sample
0.929578 0.871807
0.999955 0.922198
0.993866 0.747593
Temperature data In sample 0.766305 Out of sample 0.633504
0.984528 0.954705
0.971805 0.938510
for the goodness of fit to fall off substantially out of sample. Secondly, the NN models were consistently able to attain higher R2 values for the training sets compared with linear regression, although not necessarily substantially higher. For the Sony TV set data a linear fit offers what one might term as a reasonable working model for managers. The best R2 value of 0.898860 was obtained by applying the Polytope algorithm: the corresponding out-of-sample R2 was 0.813. Applying linear regression produces an R2 of 0.825287 for the training set, falling to 0.819733 out of sample. In terms of linearity testing, we may say that there are certain nonlinearities present, which the network can capture. These nonlinearities are more prominent in the extremes, where the curvature of the model produces a greater departure from linearity. For the dishwasher data, the Polytope achieves a best R2 of 0.999955, compared with a value of 0.929578 for linear regression. In the case of the temperature data the respective values are 0.987790 and 0.766305. These improvements over linearity are much more pronounced, as one would expect given the nonlinearities present in the data (see http://lib.stat.cmu.edu/DASL for details). Our focus here on the Polytope algorithm is supported by the fact that the algorithm was never outperformed by Backpropagation, and showed some tendency to provide superior fit, confirming our previous experience (Curry and Morgan, 1997). This was especially true of the Sony TV data. Although for the Sony TV data the Backpropagation runs failed to improve on the linear version, the Polytope algorithm is then able to provide some improvement. For the dishwasher and temperature data, BP achieved respective R2 values a little lower than those obtained from the Polytope. One simple advantage of BP over the Polytope is, however in computation time required. Although a BP run requires several minutes, Polytope runs may take several days. A disturbing aspect of our results relates to the final values of the weights that are generated. In a simple linear regression, for example, one may be confident that the parameter estimates constitute a distinct global minimum of the sum of squared errors. With nonlinear regressions, however, where a search algorithm needs to be employed, it is
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
important to consider whether this is indeed the case. Not only do we need to be wary of local minima, but we need to consider whether a single set of weights can be relied upon. For the three data sets reported in this paper, it is somewhat alarming to report that although we can carry out repeated runs of the model and replicate the level of fit achieved, we are unable to replicate the precise values of the weights. Similar problems have emerged in other studies we have carried out. In statistical terms, this reflects an identification problem not always the subject of careful attention by researchers. It is apparently only in the case of very simple models that we can break out of this difficulty. For example, in an earlier paper (Curry et al., 2000), we report on fitting a network to 30 data points generated from a simple bivariate quadratic function with no noise. In this case, the weights are stable under replication. We were also able to show, using a 5th order MAPLE Taylor series expansion constructed, that the weights could be translated into polynomial coefficients close to those of the initial quadratic function. 5.5. Further implications It is common to refer to the NN as a Ôblack boxÕ, and researchers frequently make no reference to the actual weight parameters of a trained model. (This is partly because of its complexity, as indicated in Equations (2)–(4) above.) A notable exception is Moutinho et al. (1996) who attempt an interpretation of the weights in terms of their model of consumer satisfaction. Although it is both possible and desirable to extract and analyse weight values, our experience leads to the rather pessimistic view that interpretation of individual coefficients may not be possible. We can simply use NNs as a test of some other more tractable case such as linear model. The network provides valuable comparisons, but is not a true parametric model in its own right. Our experiments also suggest problems with using a statistical approach to select the best network for any given data. Table 2 below shows for each of our data sets the R2 values for certain networks, varying both the number of layers and the number of hidden nodes. The table relates to the
575
Table 2 Best R2 values for various networks (Polytope Algorithm) Best R2 Sony TV data 2: 0 4: 0 6: 0 2: 2 2: 4 2: 6
0.842114 0.850342 0.884945 0.840045 0.840243 0.898860
Dishwasher data 2: 0 3: 0 4: 0 2: 2 3: 2 3: 3
0.990500 0.997337 0.998379 0.996992 0.999480 0.999955
Temperature data 2: 0 3: 0 4: 0 2: 2 3: 2 3: 3
0.959660 0.975276 0.984528 0.970932 0.964220 0.966927
(x: y) denotes network with x and y nodes respectively in the first and second hidden layers, where y=0 there is a single hidden layer.
Polytope algorithm applied with the sigmoid activation function. For the dishwasher and temperature data only two inputs are available, and the table contains results for up to four hidden nodes in the first layer; for the TV data there are six inputs and we show results for up to six nodes in the first hidden layer. In general, it is difficult to establish the best network for each data set. Similar indeterminacy emerges if Backpropagation is employed. It can be seen that in all three cases there is no clear cut ÔwinnerÕ. This is especially important given that there is in general variation between different runs for the same network. As an example, the R2 values for a network with two hidden layers, each containing two nodes, range from 0.830807 to 0.844074. 6. Summary and conclusions Although confident in the approximation properties of NNs, we feel less confident in some other
576
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577
aspects. In particular, our experiments and the theoretical results in Section 4 suggest a serious identification problem. Although one can replicate the goodness of fit for a given network, it is not in general possible to replicate the weight values. This holds, for our three datasets and for other cases, whether we use the standard BP or the Polytope algorithm. It also holds both for logistic and hyperbolic tangent activation functions. The implications are a cause for concern. First of all, the absence of unique best-fitting weights removes our ability to look inside the NN Ôblack boxÕ. It is possible to compute partial derivatives with respect to the input variables, although the expressions are far from convenient. However, without unique values for the weights such expressions have little value. One is left with general statements about overall departures from linearity, in cases where a number of network models outperform linear regressions. Such problems are less serious in a forecasting context, in the same way that collinearity or interdependence between explanatory variables matters less than overall forecast accuracy. However, forecasts become more trustworthy if they are based on a credible underlying structure, and forecasts not based on stable parameter values are more likely to degenerate outside of the data set or in response to structural changes. A second implication of weight indeterminacy concerns use of metrics such as AIC, BIC or even R2 adjusted for degrees of freedom. The basic problem is that it is unclear how one should compute degrees of freedom or what is described by Moody (1992) as the Ôeffective number of parametersÕ. Lack of identification of weights also potentially explains why we have found it difficult to choose on statistical grounds a single network structure which provides the best fit. It should be noted that we have considered statistical issues by actually fitting or training NNs. In other approaches, this is not the case. For example, in the NN test for neglected nonlinearity examined by Lee et al. (1993), network weights are selected at random and tests applied to the results of linear regressions. Overall, the choice of model selection strategies is still very much open. Statistical approaches need
to be compared and contrasted with pruning or penalty-based methods, for which the aim is to deal with issues of network structure in the training algorithm itself. See for example Ripley (1996) for a review.
Acknowledgement The authors are grateful to Prof. Mick Silver and to GFK Marketing for help with the provision of data.
References Anders, U., Korn, O., 1999. Model selection in neural networks. Neural Networks 12, 309–333. Barron, A.R., 1993. Universal approximation bounds for superpositions of a sigmoid function. IEE Transactions on Information Theory 39, 930–945. Chester, D.L., 1990. Why two hidden layers are better than one. In: International Joint Conference on Neural Networks, Washington, DC, vol. 1. Lawrence Erlbaum, pp. 265–268. Cottrell, M., Girard, B., Girard, Y., Mangeas, M., Muller, C., 1995. Neural modelling for TimeSeries: A statistical stepwise method for weight elimination. IEEE Transactions on Neural Networks 6 (6), 1355–1364. Curry, B., Morgan, P., 1997. Neural networks: A need for caution. OMEGA 25 (1), 123–133. Curry, B., Morgan, P., Beynon, M., 2000. Neural networks and flexible approximations. IMA Journal of Mathematics Applied in Business and Industry 11 (1), 19–35. Curry, B., Peel, M.J., 1998. Neural networks and business forecasting: An application to cross sectional audit fee data. International Journal of Commerce and Management 8 (2), 94–120. Curry, B., Morgan, P.H., Silver, M., 2002. Neural networks and non linear statistical methods: An application to modelling of price quality relationships. Computers and Operations Research 29, 951–969. Cybenko, G., 1989. Approximation by superposition of a sigmoidal function. Mathematics of Control, Systems and Signals 3, 303–314. Gorr, W.L., Nagin, D., Szczypula, J., 1994. Comparative study of artificial neural network and statistical models for predicting student grade point averages. International Journal of Forecasting 10, 17–34. Hecht-Neilson, R., 1989. Theory of the Backpropagation Network. In: International Joint Conference on Neural Networks, Washington, DC, vol. 1. IEEE, New York, pp. 593–605. Kurkova, V., 1992. KolmogorovÕs theorem and multilayer neural networks. Neural Networks 5, 501–506.
B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577 Kurkova, V., Kainen, P.C., 1994. Functionally equivalent feedforward neural networks. Neural Computation, 543– 558. Lee, T.-H., White, H., Grainger, C.W.J., 1993. Testing for neglected non-linearity in time series models. Journal of Econometrics 56, 269–290. Moody, J.E., 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems, NIPS4, 847–854. Morgan, P., Curry, B., Beynon, M., 1999. Comparing neural network approximations for different functional forms. Expert Systems 16, 2. Morgan, P.H., Curry, B., Beynon, M., 2000. Pruning neural networks by minimisation of the estimated variance. European Journal of Economic and Social Systems 14 (1), 1– 16. Moutinho, L., Davies, F., Curry, B., 1996. The impact of gender on car buyer satisfaction and loyalty. Journal of Retailing and Consumer Services 3 (3), 135–144. Phillips, P.C.B., 1989. Partially identified econometric models. Econometric Theory 5, 181–240. Ripley, B.D., 1996. Pattern recognition and neural networks. Cambridge University Press, Cambridge.
577
Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22. Rumelhart, D.E., Williams, R.J., Hinton, G.E., 1986. Learning internal representations by error propagation. In: McLelland, D.J., Rumelhart, D.E. (Eds.), Parallel Distributed Processes. MIT Press, Cambridge, MA. Tamura, S., Tateishi, M., 1997. Capabilities of a four layered feedforward neural network: Four layers versus three. IEEE Transactions on Neural Networks 8 (2). Tera¨svirta, T., Lin, C.F., Granger, C.W.J., 1993. The power of the neural network test. Journal of Time Series Analysis 14 (2), 209–220. White, H., 1989a. Asymptotic results for learning in single hidden layer feedforward network models. Journal of the American Statistical Association 84, 1003–1008. White, H., 1989b. Learning in artificial neural networks: A statistical perspective. Neural Computation 1, 425–464. Williamson, R.C., Helmke, U., 1995. Existence and uniqueness results for neural network approximations. IEEE Transactions on Neural Networks 6 (1), 2–13. Zhang, G., Patuwo, B.E., Hu, M.Y., 1998. Forecasting with artificial neural networks. International Journal of Forecasting 14, 45–62.