Model selection in Neural Networks: Some difficulties

European Journal of Operational Research 170 (2006) 567–577 www.elsevier.com/locate/ejor Computing, Artiﬁcial Intelligence and Information Technology...

Download PDF

316KB Sizes 0 Downloads 72 Views

Report

PDF Reader
Full Text

European Journal of Operational Research 170 (2006) 567–577 www.elsevier.com/locate/ejor

Computing, Artiﬁcial Intelligence and Information Technology

Model selection in Neural Networks: Some diﬃculties B. Curry *, P.H. Morgan Cardiﬀ Business School, Cardiﬀ University, Aberconway Building, Colum Drive, Cardiﬀ CF10 3EU, UK Received 26 February 2003; accepted 25 May 2004 Available online 25 August 2004

Abstract This paper considers two related issues regarding feedforward Neural Networks (NNs). The ﬁrst involves the question of whether the network weights corresponding to the best ﬁtting network are unique. Our empirical tests suggest an answer in the negative, whether using standard Backpropagation algorithm or our preferred direct (non-gradientbased) search procedure. We also oﬀer a theoretical analysis which suggests that there will almost inevitably be functional relationships between network weights. The second issue concerns the use of standard statistical approaches to testing the signiﬁcance of weights or groups of weights. Treating feedforward NNs as an interesting way to carry out nonlinear regression suggests that statistical tests should be employed. According to our results, however, statistical tests can in practice be indeterminate. It is rather diﬃcult to choose either the number of hidden layers or the number of nodes on this basis. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Neural Networks; Network weights; Hidden layers; Backpropagation; Polytope

1. Introduction In this paper feedforward Neural Networks (NNs) are viewed as devices for ﬂexible nonlinear regression, relying on the property of Ôuniversal approximationÕ. Theoretical work on such approximation dates back to Cybenko (1989) and HechtNeilson (1989). Despite a plethora of such results,

*

Corresponding author. Tel.: +44 29 20875668; fax: +44 29 20874419. E-mail address: curry@cardiﬀ.ac.uk (B. Curry).

and despite our own conﬁdence in the approximation properties of NNs, there remain numerous questions to be asked regarding selection of the ÔbestÕ network for a given data set. Zhang et al. (1998), in their review of NNs in a forecasting context, comment that ‘‘. . . there is no simple clear-cut method for determination of . . . parameters’’. They also regard network design as ‘‘more of an art than a science’’. We discuss two related diﬃculties. The ﬁrst is an identiﬁcation problem manifesting itself in an inability to replicate weight values for a trained network. As regards ﬁnding a unique set of

0377-2217/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2004.05.026

568

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

estimated network weights (parameters), our analysis leads us generally to a pessimistic view. The second problem concerns the use of statistical principles in choosing the number of hidden layers and the number of hidden nodes. White (1989b) has shown in formal terms that standard statistical tests are available, which is not surprising if one views NNs as a form of nonlinear regression. However, in the experiments reported below, a statistical approach is far from being conclusive. In what follows, we review in Sections 2 and 3 the properties of NNs and discuss issues involved in their implementation. This includes the choice of learning algorithm. In Section 4, we present some theoretical investigations of interdependence between network weights, using a Taylor series approach: this section also contains an example based on simulated data. We then report on experiments with three real data sets.

2. Neural Networks 2.1. The basic model

ð1Þ

where Y denotes the output, X = (X1, . . . , Xm) denotes the m input variables, and u is an error term. In general, u is presumed, albeit often implicitly, to possess the usual desirable properties. We denote the number of sample observations by n. Let the hidden layers be indexed by p = 1, . . . , r. For p = 1, we have the ﬁrst hidden layer, connected directly to the inputs. The rth hidden layer is connected directly to the output. Each hidden layer p contains q(p) nodes: the jth node in layer p is ðpÞ denoted by hj . We also have the following deﬁned weights: wik ðpþ1Þ

vst

ov

j¼1

ðpþ1Þ hi

denotes the weight connecting the kth input to the ith node in the ﬁrst hidden layer. denotes the weight connecting the tth node in the ith hidden layer to the sth node in hidden layer p + 1, for p = 1, . . . , r 1. denotes the weight from the vth node in hidden layer r to the output, for v = 1, . . . , q(r).

¼L

qðpÞ X

! ðpþ1Þ vit htðpÞ

ð3Þ

t¼1

for p = 1, . . . , r 1, f ðX Þ ¼

pðrÞ X

ov hvðpÞ :

ð4Þ

v¼1

Bias terms may be included by setting appropriate variables to unity. One may optionally apply a logistic transformation to the output, but it is the functions employed in the hidden nodes which provide the universal approximation property. The case of a logistic network with two inputs, no bias terms and a single hidden layer comprising q hidden nodes would be expressed as follows: Y ¼

A feedforward logistic NN is deﬁned as follows: Y ¼ f ðXÞ þ u;

Denoting the logistic (sigmoid) function by L, the network model can be expressed as ! m X ð1Þ wij X j ; ð2Þ hi ¼ L

q X i¼1

oi : 1 þ eðwi1 X 1 þwi2 X 2 Þ

ð5Þ

One can see immediately that the structure is somewhat unwieldy, and the ﬁtting process will inevitably be problematic. White (1989a) uses the term ÔNon-Linear Least SquaresÕ (NLS) speciﬁcally to describe the application of a quadratic error metric to Eq. (1). This deﬁnition of NLS incorporates no speciﬁc ﬁtting algorithm, although White notes that optimization of the error sum of squares is Ôcomputationally demandingÕ. The term NLS can of course be given a much broader interpretation. The most common way to apply the standard network model is to optimize RMS (root mean square) error through Backpropagation (BP), as employed by Rumelhart et al. (1986). White (1989b) shows that BP arises from application of the method of Stochastic Approximation originally suggested by Robbins and Monro (1951). Gradient-based methods such as BP are especially problematic in the context of NNs since the sum of squares function may be quite ﬂat in places and subject to Ônarrow valleysÕ, from which a search

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

algorithm may ﬁnd diﬃculty in ÔescapingÕ. These points are ampliﬁed below. In the case of BP, it is common to observe a rapid early reduction of RMS error followed by a relatively slow ÔcrawlÕ along the ﬂat-bottomed narrow valleys. These are points put forward in our earlier papers (e.g. Curry and Morgan, 1997; Curry et al., 2002) in which we provide evidence of possible shortcomings of BP and consider a direct (i.e. non-gradient based) alternative known as the Polytope Algorithm. In various experiments the Polytope was found to oﬀer superior performance. The same algorithm is employed in this paper and compared to BP. In summary, the algorithm involves maintaining a highly ﬂexible simplex or type of ÔPolytopeÕ which through operations of stretching, reﬂection etc moves and adjusts itself to the shape of the unknown function to be optimized. We use the term Polytope to avoid confusion with the Simplex algorithm for Linear Programming. Further explanation is provided in Section 5 below. 2.2. Approximation issues White (1989b) is among numerous authors who demonstrate that a feedforward NN can adapt to, i.e. approximate, arbitrary functional forms and arbitrary stochastic processes. Barron (1993) stresses that the functional form in Eqs. (2)–(4) above contains the Fourier series approximation as a special case. White (1989b) also shows that a single hidden layer of nodes is technically suﬃcient for such Ôuniversal approximationÕ. However, it should be noted approximation theorems relate to limiting cases in which the number of nodes becomes inﬁnitely large. Empirical support for approximation properties is also reported by Morgan et al. (1999) who show that NNs can provide notably accurate approximations to a wide range of interesting functional forms. These include the ÔMexican HatÕ, the shape of which provides an excellent challenge for the network. A notable ﬁnding in our earlier work is that additional hidden layers are often required. This is not only from the point of view of goodness of ﬁt, but in our experiments, additional layers can also provide better generalization, due to the addi-

569

tional complexity oﬀered by the multi-layer model. A number of authors, e.g. Kurkova (1992), deal with models containing two or more hidden layers, but there is a need for further theoretical research on the precise eﬀects of additional layers. A rare paper on this particular topic (Tamura and Tateishi, 1997) is also noteworthy in that it deals with a ﬁnite number of nodes. The authors show that networks with one and two hidden layers are strictly equivalent only with an inﬁnite number of nodes. However, they show that a network with a second hidden layer, while being superior in terms of the number of weights needed, can also provide arbitrarily close approximations. See also Chester (1990). As far as practical applications are concerned we would wish to sound a note of caution: it may be dangerous simply to assume that a single layer will naturally provide adequate approximations. Even with universal approximation, model selection issues are still crucial. This is because of the rather neat irony that better approximation within the training naturally increases the risk of overﬁtting. The very strength of NNs, their ability to approximate, is also their principal weakness. Ultimately, it is the capacity of the network to generalize which is the most important.

3. Model selection issues Although there is continuing debate on model selection strategies, one area of agreement is that simple rules of thumb have little to oﬀer. Such rules, suggesting that the number of hidden nodes should be directly related to the number of inputs and outputs, have historical interest, having been popular immediately after the pathﬁnding work of Rumelhart et al. (1986). The most simple, and probably still the most popular sensible strategy, is to focus directly on minimising RMS error, experimenting where necessary regarding selection of inputs and hidden nodes. This could be labelled a ‘‘practitionerÕs’’ strategy. One general diﬃculty is the excessive computational burden which is placed on the process of experimentation: the cause is Ôcombinatorial explosionÕ in the number of diﬀerent models which need to be run. The

570

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

problem arises with any model selection strategy employed, although certain strategies seek to prune the search tree. Experimentation is required to test subsets of inputs, to test the number of hidden layers and the number of hidden nodes. Also, in algorithms such as BP there are parameters such as learning rates. The end result is that even if there is to be merely a single ÔrunÕ of each case, a complete range of experiments is prohibitive. Each run may take considerable time, possibly measured in days. A natural progression from ÔsimpleÕ use of RMS is to exploit the fact that RMS is essentially a statistical concept. Minimization of RMS, equivalent to least squares, follows directly from adoption of a Maximum Likelihood approach, provided of course that the nonlinear functional relationship of the feedforward NN is supplemented with a well-behaved additive error term. This latter requirement is often adopted only implicitly. A ﬁrm foundation for the use of a statistical approach is provided by White (1989b), both for estimation of network weights through Maximum Likelihood methods and for testing hypotheses. For the latter, tests can naturally be based on the change in RMS when a single weight or group of weights is set to zero. The statistical foundations laid by White (1989b) are the basis of work for example by Cottrell et al. (1995) and Gorr et al. (1994). The former consider the use of a Neural Network as a nonlinear extension of autoregressive time series models. Their approach involves applying statistical principles to eliminate non-signiﬁcant weights. Anders and Korn (1999) stress that a serious problem to be faced with NN model selection is that the network weights are under-identiﬁed, meaning in particular that the RMS function will be subject to ÔﬂatÕ regions. They refer to the work of Phillips (1989) in which a notable consequence of local under-identiﬁcation is that the asymptotic distribution of the estimates will be non-normal. As to reasons why there should be an identiﬁcation problem with NNs they oﬀer two suggestions. The ﬁrst is that it is always possible simply to exchange within any hidden layer any pair of nodes and their associated weights, leaving computations with the network unaltered. Sec-

ondly, setting certain weights to zero will automatically remove branches from the network. For example, in Eq. (5) above, setting any output weight oi to zero will immediately remove weights wi,1 and wi,2. Under-identiﬁcation is an issue also addressed by White (1989b) who notes that the limiting distribution of the estimated network weights is subject to what he calls ÔﬂatsÕ, arising because of redundant inputs or irrelevant hidden nodes. The view taken here is that the problem is more serious, there being naturally occurring parameter redundancy. The hidden nodes may be regarded as being nonlinear analogues of the latent variables or Principal Components employed in multivariate statistics; see Curry and Peel (1998). In Eq. (5), for example, even with two hidden nodes, we have six network weights, and of course the number of weights grows explosively as hidden nodes are added. This basic indeterminacy is stronger than would arise through simply exchanging weights. In Principal Components Analysis, the aim is to transform a vector of m variables to a vector of say d orthogonal components, and this requires a loading matrix of order d by m. Conventional direct estimation of such a matrix by least squares is therefore not possible. We oﬀer some further theoretical investigation of this issue in Section 4 below. Here we may note that parameter redundancy has implications for any modeling strategy which makes explicit use of the number of network weights (or degrees of freedom). These include standard criteria, such as AIC, BIC, etc.: for a review in the context of NNs see Ripley (1996). Such measures suﬀer from the diﬃculty of calculating the number of degrees of freedom in the context of parameter redundancy. A link can also be made to other penalty-based methods, based on penalising network complexity or through forcing of weight decay. Although penalty-based methods involve network weights, a direct focus on weights leads us to ÔpruningÕ methods. These may for example be based on the Hessian matrix of the weights, and hence are based on statistical principles. See Ripley (1996) for further details. For a variance-based pruning approach see also Morgan et al. (2000).

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

4. Theoretical considerations 4.1. An approach using Taylor series A notable feature of our experience with NNs is that it is normally impossible to replicate a set of weights. Even with a preferred network structure, oﬀering the best ﬁt, there is no unique set of weight values. If we simply initialize the weight values, using random starting points, it is possible to replicate the RMS but not the individual weight values. Indeterminacy of weights arises simply because of the presence of latent variables: the number of weights is less than the eﬀective degrees of freedom and there are substantial interdependencies between weights. Kurkova and Kainen (1994) use the term Ôfunctionally equivalentÕ weight vectors. They present theorems regarding on the existence of indeterminacies, without details on their precise nature. They also make an interesting point that indeterminacies could proﬁtably be used in the search process, with the need only for a sample of one weight vector from a set of equivalent vectors. Further insights on this question are oﬀered by Williamson and Helmke (1995) who consider the very simple case of a feedforward logistic network with a single input and single hidden node. Among the questions they consider is the uniqueness of the ÔbestÕ approximation provided by the network to an arbitrary underlying function. The approximations are examined only in relation to a ﬁnite interval. For their very limited class of networks studied, Williamson and Helmke (1995) establish uniqueness of best approximation only for the case of the supremum norm and not for the standard least squares approach generally applied in this context. They note with some degree of pessimism that more typical networks with multiple inputs and multiple hidden nodes pose Ôtruly formidable diﬃcultiesÕ. Given the intractability of the functional form we have investigated these issues using the MAPLE symbolic manipulation software. We focus on the case of a single hidden layer, with two inputs and three hidden nodes, corresponding to Eq. (5) above. For convenience we need to change notation. We use p1, p5 and p9 as weights

571

from the hidden nodes to the output. Weights p2, p6 and p10 are bias terms; p3, p4, p7, p8, p9 and p10 are weights from inputs to hidden nodes. The network is then as follows: Y ¼

p1 1 þ exp ðp2 þ p3 x1 þ p4 x2 Þ p5 þ 1 þ exp ðp6 þ p7 x1 þ p8 x2 Þ p9 : þ 1 þ exp ðp10 þ p11 x1 þ p12 x2 Þ

ð6Þ

As in previous work (Curry et al., 2000) we have used a Taylor series approach to investigate the network. Tera¨svirta et al. (1993) also use a manually derived Taylor series approximation, proceeding as far as third order terms. They use derivatives with respect to the weights as opposed to the variables, which is necessary in context of autoregressive models. Here, however, we are concerned with functional forms and use derivatives with respect to the input variables. MAPLE computes multivariate Taylor series through the COEFTAYL command, which speciﬁes both the expansion point and the order of the expansions. We obtain the (i,j)th coeﬃcient for the function f of arguments x1 and x2, expanded around the origin using the command C½i; j :¼ COEFTAYLðf; ½x1 ; x2 ¼ ½0; 0; ½i; jÞ: The SOLVE command can be used to obtain expressions for the weights, given suitable constraints. As an example, with the following command we may set Taylor coeﬃcients C[0, 0], C[0, 1] and C[1, 0] to zero to solve for the weights from p1 to p12. sol1 :¼ solveðfC½0; 0; C½0; 1; C½1; 0g; fp1; p2; p3; p4; p5; p6; p7; p8; p9; p10; p11; p12gÞ: Zero constraints in the command ÔguideÕ the MAPLE solutions towards certain types of functional form. This is useful given that we are dealing with a Taylor approximation and also to compensate for use of only three hidden nodes in Eq. (7). From the above command, MAPLE generates four distinct solutions. The solutions involve

572

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

notable and often quite simple functional dependencies between weights. For example, within the fourth solution set we have p expðp6 Þ þ p1 þ p5 expðp6 Þ p2 ¼ ln 1 : ð7Þ p5 To pursue the case further we have generated a series of 100 data points at random from the function y ¼ x21 þ x22 . The function was applied over the region (1 < x1 < + 1, 1 < x2< + 1) and a small noise component was added, using a standard deviation of 0.001. The three constraints on the Taylor coeﬃcients in the example above involve setting to zero coeﬃcients of both zero and ﬁrst order: these tally with the simple quadratic function under discussion. The three-node network was trained using the Polytope algorithm. To check the validity of the Taylor approach, the ﬁrst MAPLE solution was substituted into the network function in Eq. (7): a set of weights from the Polytope ﬁt was then applied. The resulting network function corresponds very closely with the original target quadratic, except at the edges. The use of a Taylor approximation does not seem to have caused undue damage. Our example is consistent with the theoretical work of Kurkova and Kainen (1994) and Williamson and Helmke (1995). It suggests the existence of numerous interdependencies. Similar collections of solutions can easily be derived for other cases, setting various other Taylor coeﬃcients to zero. The presence of interdependencies between weights is not an isolated instance. This conﬁrms the point made in Section 3 above, that the identiﬁcation problem in NNs is more serious than suggested by Anders and Korn (1999). Given that the MAPLE model just discussed is quite small, we also need to consider how far it can be generalized. Similar relationships between weights emerge when the model is extended, but the capability of MAPLE to handle extensions is quite limited. The number of potential solutions for the SOLVE command to handle increases explosively with the number of inputs and the number of hidden nodes. It also depends on the maximum degree for which the Taylor series coefﬁcients are set to zero. As the size of the model is increased MAPLE runs out of stack memory and

simply cannot generate results. This may happen even with two inputs, but is much more likely with three.

5. Experiments 5.1. General Given the various issues noted above, we have carried out a number of experiments relating to the following speciﬁc points of comparison. 1. Networks trained with Polytope and Backpropagation algorithms. 2. Networks with single and double hidden layers. 3. Networks with logistic activation and hyperbolic tangent activation. In the various empirical tests we employ a ÔrawÕ unadjusted RMS error criterion, quoting R2 values in addition in order to provide comparisons between NN models and linear regressions. This is standard practice in the NN ÔcommunityÕ: in engineering and other disciplines RMS error is a natural choice. Of course, following standard practice is not in itself a convincing reason, and one would naturally wish to consider other error criteria such as AIC, BIC, etc. The problem with such criteria in the context of NN modelling is to establish the ÔtrueÕ degrees of freedom. This is a consequence of the identiﬁcation and parameter dependency problems discussed above. The Ôeﬀective number of parametersÕ, in the sense employed by Moody (1992), is as yet unknown, so it seems dangerous to use a count of the network weights: our preference is for using ÔrawÕ RMS. 5.2. Data aspects Three data sets have been employed in this Section. The ﬁrst, kindly provided by GFK Marketing Ltd., relates to purchases of Sony TV sets in the UK and relates to the hedonic model, whereby price is dependent on bundles of attributes, the aim being to analyse the quality of a product. Six such quality variables were used as inputs (see Curry et al., 2002 for more detailed description

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

of the data and earlier experiments involving a feedforward NN and linear regression). Some 447 observations were available. It was found that although a Neural Network model could improve on the goodness of ﬁt of a linear regression, the latter provided a reasonable approximation away from the extremes. The second data set relates to sales of dishwasher machines in the US, the data being publicly available at http://lib.stat.cmu.edu/DASL. Annual dishwasher sales, over a period of 26 years are taken to be dependent on total annual expenditure on durable goods and private residential investment. The ﬁnal data set involves temperature data, again for the US. It is also available from the above source. The average temperature in 56 cities is taken to be dependent on the degrees of latitude and longitude. As can be seen at the source website, the second and third data sets exhibit substantial nonlinearity. 5.3. Implementation issues For the training process our main focus is on the Polytope algorithm, employed because of previous favourable experience. The algorithm makes no use of gradients, nor does it require decisions regarding the order in which data points are presented. Its implementation requires that for m parameters (weights) we deﬁne m + 1 points situated at the vertices of a regular simplex in the space of m dimensions. Initial weight values are generated by random drawing from a uniform distribution on the interval (1, 1). From the weight values at each iteration a reﬂection of the least optimal vertex is generated through the space spanned by the remaining m vertices and the objective function value is calculated. If the resulting value is worse than the worst then we contract the whole original simplex by replacing the worst vertex by one closer to its base (which contains the centroid of the m best vertices). If this reﬂected point is better than the worst but not as good as the best, then we can at least get rid of the worst by replacing it with the reﬂected value. If the reﬂected value is better than the best, then we try expanding the simplex in its direction. If all reﬂec-

573

tion, expansions and contractions fail, then we assume that the optimum probably lies somewhere inside the simplex and we try shrinking it by moving all the vertices toward the best vertex. Some additional (but minor) reﬁnements to improve convergence were also adopted. The algorithm can terminate when there is no change in the objective function (RMS error). In practice, under double precision this means stopping when the change is less than 1015. Alternatively, it can stop when the values of the parameters are unchanged. In practice this involves stopping when the mean of the sum of the relative absolute diﬀerences was less than 1014. As a further safeguard against the method getting trapped in a subspace of the parameters, the method was restarted by constructing a new simplex with size one-tenth the initial one and incorporating the current best vertex. The restarts were made with simplices of this same size until successive restarts were within 1015 of each other in terms of the sum of squared errors or a set number (30) of restarts had been made. Further details can be found in Curry and Morgan (1997). For the cross-section data (television sets and temperature) we have worked with a training set of approximately 75% of the data points, selected at random (using random numbers generated in EXCEL). For the dishwasher data the ﬁrst 75% of values were used as a training set. To conform to standard econometric practice we use Ôsplit-sample validationÕ––we ﬁrst optimize with respect to the training set and then proceed to test for overﬁtting by checking the RMS or R2 for the test set (using the weights from the training set). The error between the ﬁtted model and the test set provides an estimate of the generalization error. This particular approach is well established and has also been adopted here because of training-time considerations. For Backpropagation, the Neuralworks Professional software, from Neuralware Inc., was employed. This package has no documented capability to handle skip layer connections, but it was found that such a model could be employed by manipulating the on-screen network display. For both linear and NN ﬁts, the models were evaluated using RMS error. The unadjusted R2 gives

574

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

identical results, and is presented below to facilitate comparisons with linear models. 5.4. Results Our main results are summarized in Table 1 below: for reference, the table contains results from a linear regression. The number of hidden nodes and layers was varied and for each case ﬁve runs were carried out from randomly generated starting points for the weights. A uniform distribution was used for this initialization. The algorithms conduct their search in terms of RMS error, but for ease of comparison across data sets we have expressed ﬁnal results in terms of R2. The values contained in the table represent the best R2 obtained over multiple runs, varying the number of layers and hidden nodes. Testing for signiﬁcance of individual coeﬃcients is also of interest, but not in the present context. We note that asymptotic t-tests are available for this purpose (White, 1989b), although they are not part of standard NN practice. The indeterminacy of network weights, reported and examined in more detail below, militates against such formal testing. Space considerations dictate omission of details such as the number of hidden nodes. Although such issues are important, they are not the primary focus of this paper: here we are concerned with the ability of network models to improve goodness of ﬁt. In Table 1 there is no indication of serious overﬁtting for any of the datasets. There is no tendency Table 1 Unadjusted R2 values for the three datasets Linear regression

Polytope

Backprop

Data set Sony TV data In sample Out of sample

0.825287 0.819733

0.898860 0.813970

0.829350 0.827589

Dishwasher data In sample Out of sample

0.929578 0.871807

0.999955 0.922198

0.993866 0.747593

Temperature data In sample 0.766305 Out of sample 0.633504

0.984528 0.954705

0.971805 0.938510

for the goodness of ﬁt to fall oﬀ substantially out of sample. Secondly, the NN models were consistently able to attain higher R2 values for the training sets compared with linear regression, although not necessarily substantially higher. For the Sony TV set data a linear ﬁt oﬀers what one might term as a reasonable working model for managers. The best R2 value of 0.898860 was obtained by applying the Polytope algorithm: the corresponding out-of-sample R2 was 0.813. Applying linear regression produces an R2 of 0.825287 for the training set, falling to 0.819733 out of sample. In terms of linearity testing, we may say that there are certain nonlinearities present, which the network can capture. These nonlinearities are more prominent in the extremes, where the curvature of the model produces a greater departure from linearity. For the dishwasher data, the Polytope achieves a best R2 of 0.999955, compared with a value of 0.929578 for linear regression. In the case of the temperature data the respective values are 0.987790 and 0.766305. These improvements over linearity are much more pronounced, as one would expect given the nonlinearities present in the data (see http://lib.stat.cmu.edu/DASL for details). Our focus here on the Polytope algorithm is supported by the fact that the algorithm was never outperformed by Backpropagation, and showed some tendency to provide superior ﬁt, conﬁrming our previous experience (Curry and Morgan, 1997). This was especially true of the Sony TV data. Although for the Sony TV data the Backpropagation runs failed to improve on the linear version, the Polytope algorithm is then able to provide some improvement. For the dishwasher and temperature data, BP achieved respective R2 values a little lower than those obtained from the Polytope. One simple advantage of BP over the Polytope is, however in computation time required. Although a BP run requires several minutes, Polytope runs may take several days. A disturbing aspect of our results relates to the ﬁnal values of the weights that are generated. In a simple linear regression, for example, one may be conﬁdent that the parameter estimates constitute a distinct global minimum of the sum of squared errors. With nonlinear regressions, however, where a search algorithm needs to be employed, it is

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

important to consider whether this is indeed the case. Not only do we need to be wary of local minima, but we need to consider whether a single set of weights can be relied upon. For the three data sets reported in this paper, it is somewhat alarming to report that although we can carry out repeated runs of the model and replicate the level of ﬁt achieved, we are unable to replicate the precise values of the weights. Similar problems have emerged in other studies we have carried out. In statistical terms, this reﬂects an identiﬁcation problem not always the subject of careful attention by researchers. It is apparently only in the case of very simple models that we can break out of this diﬃculty. For example, in an earlier paper (Curry et al., 2000), we report on ﬁtting a network to 30 data points generated from a simple bivariate quadratic function with no noise. In this case, the weights are stable under replication. We were also able to show, using a 5th order MAPLE Taylor series expansion constructed, that the weights could be translated into polynomial coeﬃcients close to those of the initial quadratic function. 5.5. Further implications It is common to refer to the NN as a Ôblack boxÕ, and researchers frequently make no reference to the actual weight parameters of a trained model. (This is partly because of its complexity, as indicated in Equations (2)–(4) above.) A notable exception is Moutinho et al. (1996) who attempt an interpretation of the weights in terms of their model of consumer satisfaction. Although it is both possible and desirable to extract and analyse weight values, our experience leads to the rather pessimistic view that interpretation of individual coeﬃcients may not be possible. We can simply use NNs as a test of some other more tractable case such as linear model. The network provides valuable comparisons, but is not a true parametric model in its own right. Our experiments also suggest problems with using a statistical approach to select the best network for any given data. Table 2 below shows for each of our data sets the R2 values for certain networks, varying both the number of layers and the number of hidden nodes. The table relates to the

575

Table 2 Best R2 values for various networks (Polytope Algorithm) Best R2 Sony TV data 2: 0 4: 0 6: 0 2: 2 2: 4 2: 6

0.842114 0.850342 0.884945 0.840045 0.840243 0.898860

Dishwasher data 2: 0 3: 0 4: 0 2: 2 3: 2 3: 3

0.990500 0.997337 0.998379 0.996992 0.999480 0.999955

Temperature data 2: 0 3: 0 4: 0 2: 2 3: 2 3: 3

0.959660 0.975276 0.984528 0.970932 0.964220 0.966927

(x: y) denotes network with x and y nodes respectively in the ﬁrst and second hidden layers, where y=0 there is a single hidden layer.

Polytope algorithm applied with the sigmoid activation function. For the dishwasher and temperature data only two inputs are available, and the table contains results for up to four hidden nodes in the ﬁrst layer; for the TV data there are six inputs and we show results for up to six nodes in the ﬁrst hidden layer. In general, it is diﬃcult to establish the best network for each data set. Similar indeterminacy emerges if Backpropagation is employed. It can be seen that in all three cases there is no clear cut ÔwinnerÕ. This is especially important given that there is in general variation between different runs for the same network. As an example, the R2 values for a network with two hidden layers, each containing two nodes, range from 0.830807 to 0.844074. 6. Summary and conclusions Although conﬁdent in the approximation properties of NNs, we feel less conﬁdent in some other

576

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577

aspects. In particular, our experiments and the theoretical results in Section 4 suggest a serious identiﬁcation problem. Although one can replicate the goodness of ﬁt for a given network, it is not in general possible to replicate the weight values. This holds, for our three datasets and for other cases, whether we use the standard BP or the Polytope algorithm. It also holds both for logistic and hyperbolic tangent activation functions. The implications are a cause for concern. First of all, the absence of unique best-ﬁtting weights removes our ability to look inside the NN Ôblack boxÕ. It is possible to compute partial derivatives with respect to the input variables, although the expressions are far from convenient. However, without unique values for the weights such expressions have little value. One is left with general statements about overall departures from linearity, in cases where a number of network models outperform linear regressions. Such problems are less serious in a forecasting context, in the same way that collinearity or interdependence between explanatory variables matters less than overall forecast accuracy. However, forecasts become more trustworthy if they are based on a credible underlying structure, and forecasts not based on stable parameter values are more likely to degenerate outside of the data set or in response to structural changes. A second implication of weight indeterminacy concerns use of metrics such as AIC, BIC or even R2 adjusted for degrees of freedom. The basic problem is that it is unclear how one should compute degrees of freedom or what is described by Moody (1992) as the Ôeﬀective number of parametersÕ. Lack of identiﬁcation of weights also potentially explains why we have found it diﬃcult to choose on statistical grounds a single network structure which provides the best ﬁt. It should be noted that we have considered statistical issues by actually ﬁtting or training NNs. In other approaches, this is not the case. For example, in the NN test for neglected nonlinearity examined by Lee et al. (1993), network weights are selected at random and tests applied to the results of linear regressions. Overall, the choice of model selection strategies is still very much open. Statistical approaches need

to be compared and contrasted with pruning or penalty-based methods, for which the aim is to deal with issues of network structure in the training algorithm itself. See for example Ripley (1996) for a review.

Acknowledgement The authors are grateful to Prof. Mick Silver and to GFK Marketing for help with the provision of data.

References Anders, U., Korn, O., 1999. Model selection in neural networks. Neural Networks 12, 309–333. Barron, A.R., 1993. Universal approximation bounds for superpositions of a sigmoid function. IEE Transactions on Information Theory 39, 930–945. Chester, D.L., 1990. Why two hidden layers are better than one. In: International Joint Conference on Neural Networks, Washington, DC, vol. 1. Lawrence Erlbaum, pp. 265–268. Cottrell, M., Girard, B., Girard, Y., Mangeas, M., Muller, C., 1995. Neural modelling for TimeSeries: A statistical stepwise method for weight elimination. IEEE Transactions on Neural Networks 6 (6), 1355–1364. Curry, B., Morgan, P., 1997. Neural networks: A need for caution. OMEGA 25 (1), 123–133. Curry, B., Morgan, P., Beynon, M., 2000. Neural networks and ﬂexible approximations. IMA Journal of Mathematics Applied in Business and Industry 11 (1), 19–35. Curry, B., Peel, M.J., 1998. Neural networks and business forecasting: An application to cross sectional audit fee data. International Journal of Commerce and Management 8 (2), 94–120. Curry, B., Morgan, P.H., Silver, M., 2002. Neural networks and non linear statistical methods: An application to modelling of price quality relationships. Computers and Operations Research 29, 951–969. Cybenko, G., 1989. Approximation by superposition of a sigmoidal function. Mathematics of Control, Systems and Signals 3, 303–314. Gorr, W.L., Nagin, D., Szczypula, J., 1994. Comparative study of artiﬁcial neural network and statistical models for predicting student grade point averages. International Journal of Forecasting 10, 17–34. Hecht-Neilson, R., 1989. Theory of the Backpropagation Network. In: International Joint Conference on Neural Networks, Washington, DC, vol. 1. IEEE, New York, pp. 593–605. Kurkova, V., 1992. KolmogorovÕs theorem and multilayer neural networks. Neural Networks 5, 501–506.

B. Curry, P.H. Morgan / European Journal of Operational Research 170 (2006) 567–577 Kurkova, V., Kainen, P.C., 1994. Functionally equivalent feedforward neural networks. Neural Computation, 543– 558. Lee, T.-H., White, H., Grainger, C.W.J., 1993. Testing for neglected non-linearity in time series models. Journal of Econometrics 56, 269–290. Moody, J.E., 1992. The eﬀective number of parameters: An analysis of generalization and regularization in nonlinear learning systems, NIPS4, 847–854. Morgan, P., Curry, B., Beynon, M., 1999. Comparing neural network approximations for diﬀerent functional forms. Expert Systems 16, 2. Morgan, P.H., Curry, B., Beynon, M., 2000. Pruning neural networks by minimisation of the estimated variance. European Journal of Economic and Social Systems 14 (1), 1– 16. Moutinho, L., Davies, F., Curry, B., 1996. The impact of gender on car buyer satisfaction and loyalty. Journal of Retailing and Consumer Services 3 (3), 135–144. Phillips, P.C.B., 1989. Partially identiﬁed econometric models. Econometric Theory 5, 181–240. Ripley, B.D., 1996. Pattern recognition and neural networks. Cambridge University Press, Cambridge.

577

Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22. Rumelhart, D.E., Williams, R.J., Hinton, G.E., 1986. Learning internal representations by error propagation. In: McLelland, D.J., Rumelhart, D.E. (Eds.), Parallel Distributed Processes. MIT Press, Cambridge, MA. Tamura, S., Tateishi, M., 1997. Capabilities of a four layered feedforward neural network: Four layers versus three. IEEE Transactions on Neural Networks 8 (2). Tera¨svirta, T., Lin, C.F., Granger, C.W.J., 1993. The power of the neural network test. Journal of Time Series Analysis 14 (2), 209–220. White, H., 1989a. Asymptotic results for learning in single hidden layer feedforward network models. Journal of the American Statistical Association 84, 1003–1008. White, H., 1989b. Learning in artiﬁcial neural networks: A statistical perspective. Neural Computation 1, 425–464. Williamson, R.C., Helmke, U., 1995. Existence and uniqueness results for neural network approximations. IEEE Transactions on Neural Networks 6 (1), 2–13. Zhang, G., Patuwo, B.E., Hu, M.Y., 1998. Forecasting with artiﬁcial neural networks. International Journal of Forecasting 14, 45–62.

Model selection in Neural Networks: Some difficulties

Model selection in Neural Networks: Some difficulties

Recommend Documents