A Methodology for simplification and interpretation of backpropagation-based neural network models

A Methodology for simplification and interpretation of backpropagation-based neural network models

Expert Systems With Applications, Vol. 10, No. 1, pp. 37-54, 1996 Copyright © 1996 Elsevier Science Ltd Printed in Great Britain. All rights reserved ...

2MB Sizes 0 Downloads 44 Views

Expert Systems With Applications, Vol. 10, No. 1, pp. 37-54, 1996 Copyright © 1996 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0957-4174/96 $15.00 + 0.00

Pergamon 0957-4174(95)00032-1

A Methodology for Simplification and Interpretation of Backpropagation-Based Neural Network Models LOUIS W. GLORFELD ComputerInformationSystemsand QuantitativeAnalysis,Collegeof BusinessAdministration,Universityof Arkansas, Fayetteville,AR 72701

Abstract--A new methodology for building inductive expert systems known as neural networks has emerged as one of the most promising applications of artificial intelligence in the 1990s. The primary advantages of a neural network approach for modeling expert decision processes are: (1) the ability of the network to learn from examples of experts' decisions that avoids the costly, time consuming, and error prone task of trying to directly extract knowledge of a problem domain from an expert and (2) the ability of the network to handle noisy, incomplete, and distorted data that are typically found in decision making under conditions of uncertainty. Unfortunately, a major limitation of neural network-based models has been the opacity of the inference process. Unlike conventional expert system decision support tools, decision makers are generally unable to understand the basis of neural network decisions, This problem often makes such systems undesirable for decision support applications. A new methodology is presented that allows the development of highly simplified backpropagation neural network models. This methodology simplifies network models by using a backward selection process to eliminate input variables that are not contributing to the networks ability to produce accurate predictions. Elimination of unnecessary input variables directly reduces the number of network parameters that must be estimated and consequently the complexity of the network structure. A primary benefit of this development methodology is that it is based on a variable importance measure that addresses the problem of producing an interpretation of a neural network's functioning. Decision makers may easily understand the resulting networks in terms of the proportional contribution each input variable is making in the production of accurate predictions. Furthermore, in actual application the accuracy of these simplified models should be comparable to or better than the more complex models developed with the standard approach. This new methodology is demonstrated by two classification problems based on sets of actual data.

1. INTRODUCTION

The problem of eliciting expertise is well known in the literature of psychology and human-machine interaction (Dawes, 1979; Hart, 1986; Meehl, 1954). This knowledge engineering bottleneck has led to investigation of techniques for building expert systems that automate the knowledge acquisition process (Barr & Feigenbaum, 1982). Although eliciting complete rule sets from an expert may be difficult, experts usually have little difficulty in providing problem descriptions and example reference data (Bratko & Michie, 1987). Given the availability of relevant example reference data, one commonly-used approach to automatic knowledge acquisition is termed induction. In the inductive approach, a sample of decisions along with a set of attributes on which the decisions were thought to be based is given to the modeling system. The system then generates an approximate model of the expert based on the sample data. There has been a recent proliferation of inductive

Or,rE oF ThE MOSTrecent and exciting developments from the artificial intelligence (AI) community that has found general application, has been the development of neural network models. Neural network-based expert system applications have become increasingly popular as easyto-use network emulation software and increased computing power have become readily available. Part of this popularity is due to the fact that neural networks represent an inductive learning strategy that addresses an issue that has gained some recent importance in the development of expert systems. This issue concerns the problem faced in the traditional knowledge acquisition phase of building expert systems. Attempting to extract the necessary knowledge directly from a relevant expert or experts has proven to be costly, error prone, and time consuming, often preventing the timely implementation of expert systems. 37

38

models to develop expert systems to address classification and other problems. Classification problems represent one of the most successful application areas of inductive models and will be the focus of the methods reviewed and the application examples. All inductive methods typically have the same general data requirements. A reference set of exemplary sample data with attributes related to the given correct classification result must be supplied. A model is then developed from this training data. Despite the large number of alternative inductive methods, there are still several general categories of inductive inference that encompass the majority of models including statistical, decision tree, case-based, genetic algorithms, and neural networks. The rich class of inductive neural network strategies based on analogies with the biological functioning of the brain has been the focus of much current interest. Although often computationally intensive, the ability of a neural network model to leam from very noisy, distorted, or incomplete sample data and the often superior or closely comparable results in comparison with other inductive learning methods in a wide range of problem domains help to explain why this class of methods has been preferred to some alternative inductive strategies. Currently, the neural network model based on the backpropagation (BP) architecture is most popular because of its wide range of problem applicability and generally good performance. Unfortunately, a major limitation of neural networkbased systems has been their inability to provide any comprehensible explanations of how the input attributes are used to produce the output predictions. This problem often makes such systems undesirable for decision support applications and has lead to a view of neural networks as black boxes whose iterative inner processes are mysterious though the resulting predictions are good (Yoon & Peterson, 1992). For example Hansen, McDonald and Stice (1992, p. 720) comment that: "Importantly, even when a neural network does an effective classification job on new examples, there is no convenient way to interpret what the network has learned." It is likely that end users will have little confidence in and will be reluctant to use an incomprehensible decision support tool, despite its predictive accuracy (Deng, 1993; James, 1985; Salchenberger, Cinar & Lash, 1992). This problem points to the importance of developing a method for providing a simple explanation of how a neural network uses its input variables to produce a predicted output so that an end user may have at least some understanding of the basis for the neural network decisions. A second consequence of not being able to determine the relative importance of input data in neural network processing manifests itself in applications development of neural networks. There is a tendency to throw any remotely relevant and available input data at the neural network and let the robust properties of the network filter

L. W. Giorfeld

out the usable information from the large amount of input noise. Of course, once the network is in actual use, the process of gathering unnecessary data is costly and simply confuses the issue of explaining how the neural network uses the input data to produce accurate output predictions. This paper addresses these issues by introducing a new methodology that can lead to a significant simplification in a BP neural network structure. The methodology is based on the adaptation of backward selection, a well-established variable selection methodology used in developing simplified predictive statistical models, to the development of BP neural network models. The variable selection process takes advantage of a recently established measure of variable importance developed by Garson (1991) for explicit use in the interpretation of BP networks. Selection of a final model is accomplished by the use of V-fold cross-validation coupled with the use of a heuristic measure of standard error for the cross-validated results. Model simplification is achieved by the elimination of input variables that do not contribute to the predictive accuracy of the BP network. Elimination of unnecessary input variables directly reduces the number of network parameters that must be estimated and the complexity of the network structure. A major benefit realized by basing the variable selection process on the Garson (1991) importance measure is that the contribution of the variables remaining in the network model to producing accurate output predictions may be interpreted as a simple proportion. An additional benefit of a simplified model is that it will generalize well to new data not actually used to build the specific model. 2. PERFORMANCE COMPARISONS Researchers have carried out a number of comparative performance studies of BP-based neural networks and other inductive algorithms in several problem domains. These studies indicate that the BP neural network will often outperform other inductive methods. In cases where BP does not perform best, it often is close to best. Other inductive algorithms seem to be more problem specific in terms of their performance. Atlas et al. (1989) compared backpropagation and CART on three real world data sets and found backpropagation to be slightly more accurate in all three cases, although the difference was only statistically significant in one case. Shawlik, Mooney and Towel (1991) investigated the classification performance of ID3, perceptron, and backpropagation on five different data sets ranging from soybean disease to heart disease data and found mixed results. Statistically, backpropagation performed better than ID3 in two cases and had comparable performance to ID3 in three cases. Both backpropagation and ID3 tended to perform slightly better than the perceptron model. Denton, Hung and Osyk (1990) compared the performance of linear and

A Methodology for Simplification quadratic discriminant analysis and a linear programming approach to classification with backpropagation on four sets of simulated data with different data characteristics. In three of the four cases, backpropagation performed best and was very close to best in one case. Tam and Kiang (1992) compared the performance of linear discriminant analysis, logistic regression, K nearest-neighbor, ID3, and backpropagation for predicting bank failures 1 and 2 years prior to failure. On crossvalidation performance comparisons, backpropagation gave the best performance on both the 1 and 2 years prior to failure data sets. Hanson, McDonald and Stice (1992) compared the performance of two forms of qualitative response models, probit, logistic, ID3, backpropagation and naive (random classification) models on two problems from the audit domain. Although backpropagation did not perform as well as the probit, logit, and qualitative response models in absolute terms, the resulting classification rate was not statistically different from the best model in either of the two sample problems. However, the quantitative response, logit, probit, and backpropagation models dominated the ID3 model in both examples and in all cases, all models performed better than the naive model. Salchenberger, Cinar and Lash (1992) compared the ability of a logistic and backpropagation model to predict thrift institution failures on training data 6, 12, and 18 months before failure, and a diluted sample. In all cases the backpropagation models performed as well as or better than the logistic models.

3. NEURAL N E T W O R K M E T H O D O L O G Y

3.1. Standard Backpropagation Currently, many different neural network architectures are available. The neural network backpropagation (BP) methodology is currently the most popular and is especially suited for the kind of expert systems application involved in classification problems that are used for the application examples. Detailed descriptions of the theory and methods used to implement neural networks may be found in many sources (Caudill, 1988, 1991; Gallant, 1993; Hecht-Nielsen, 1989; Kilmasauskas, 1988; Knight, 1990, Rumelhart, Hinton & Williams, 1986; Wasserman, 1989) and will not be repeated here, but a brief overview of BP networks in a classification context will be provided to standardize terminology. In a neural network, the unit analogous to the brain's neuron is called a processing element. The BP network is hierarchical and consists of at least three layers. The first layer is an input layer with the same number of processing elements as attributes which are thought to relate to the output classification signal. This layer typically operates as a simple input buffer and uses a linear transfer function to distribute each of the input

39 attribute values to each of the second layers processing elements. The second layer is called the hidden layer and has a sufficient number of elements to solve the problem at hand. Each of the hidden layer processing elements typically uses some form of nonlinear sigmoidal type of transfer function which has the same number of weights as the number of input attributes. The third and usually final layer is called the output layer and typically has the same number of processing elements as there are classification categories. The output signals from each of the hidden layer's transfer functions feed into each output layer processing element and are further processed by the output layer transfer function. The output layer transfer functions are typically the same form as the hidden layer functions except that the number of weights in each function will be equal to the number of hidden layer processing elements. In addition to the function weights just described, a sink or constant term is connected to each of the transfer functions and is known as a bias weight. Based on the previous description it should be apparent that the layers are fully connected into a network structure. The network is repeatedly presented with the sample data and effectively processes each sample input pattern through the appropriate transfer functions resulting in an output pattern. The actual output pattem is compared with the desired output pattern. The squared difference between the desired output and the actual output is used as a measure of error. The term backpropagation comes from the method of distributing the blame for the output errors which occur during the learning process. This process assumes that error in the output is due to all processing elements equally and the output error is propagated backward through the weighted connections to the previous layer for use in making adjustments to the transfer function weights until finally the input layer is reached. These successive corrective adjustments to the function weights, as cases are repeatedly presented to the BP network, lead to an ever increasing output accuracy in a least squares sense and account for the illusion that the network "learns" to solve the problem. Halbert White has given an interesting proof that a simple BP network is asymptotically equivalent to a misspecified nonlinear regression model (White, 1989). This demonstrated approximation of a BP network by regression is important since it provides a possible basis for the analogous application of powerful statistical theory and techniques developed over many years to BP networks.

3.2. Determining Network Weight Importance If a neural network is to be viewed as anything but some form of iterative black box that produces accurate predictions, then some method must be available for interpreting the relative importance of each of the input attributes in terms of producing the output. One approach

40

L. W. Glorfeld

to this problem has been to combine neural networks with more traditional rule-based systems (Gallant, 1993; Goodman, Higgins & Miller, 1992). However, the actual functioning of the neural network part of the system may still remain obscure. Some commercially available network packages have addressed the interpretation problem by providing a kind of sensitivity analysis methodology (Kilmasauskas, 1988). In one form of this methodology, all of the input variables except the one of interest are held constant while the input variable of interest is varied a standardized amount. The effect on the output signal is noted. This process is repeated for each of the input variables and leads to an ordering and indication of each input variable's contribution to the output signal. A recent development by Garson (1991) has provided a particularly suitable methodology to partition the available input variables' contributions to the output signal of a BP network in terms of a simple proportion of the output attributable to each input attribute. With a simulated causal modeling problem, Garson demonstrated how such a procedure led to an explanation of the output of a BP network in a manner similar to the interpretation of the independent variables' contributions to a dependent variable in a standardized multiple regression equation. Garson gives a formula that summarizes the variable importance measure, VI, for the PrH input variable, Xp, as"

E

Vlxp

j=l

(lllej I...

Ille,)lOl;.

/

k=l

np

E(~=p[l("E /jel"zQ )l "]i= ),

np

L

--

(1)

k=,

Where n e is the number of input variables, nH the number of hidden layer processing elements, Ille; is the absolute value of the hidden layer weight corresponding to the Pth input variable and jth hidden layer processing element, and IOI; is the absolute value of the output layer weight corresponding to the jth hidden layer processing element. The idea behind the formula is that for each j of the nH hidden layer processing elements, sum the product formed by multiplying the absolute value of hidden layer processing element weight I associated with variable P for hidden layer processing element j, times the absolute value of output processing element weight O for hidden layer processing element j. Divide this result by the sum of such quantities for all variables. This calculation produces a simple proportion of the output signal that is attributable to each input variable. It should be noted that this calculation does not take the bias or constant term weights resulting from the backpropagation process into account. Disregarding the bias weights is necessary because there is no way to relate the bias weights directly

to the input variables. However, a reasonable assumption to make is that the partition of the bias weights associated with the input variables would be the same as the partition of the weights excluding the bias weights. This assumption leads to an interpretational situation similar to a standard regression analysis where the constant term, although of possible interest, is not involved in determining relative variable importance. To supplement this formulation, a set of computational steps follows that also extends the procedure to deal with more than a single processing element in the output layer. Let ne and n n be defined as before and n o represent the number of output layer processing elements. Then perform the following steps for a BP neural network ignoring the bias term: Step 1: Make sure the network has converged. A good check is to make sure that iteration continues until all transfer function weight changes are zero through the fourth decimal place. Step 2: Extract the n e transfer function weights for each of the nH hidden layer processing elements and take their absolute values. Compute the sum across all ne transfer function weights for each of the n H hidden layer processing elements, This calculation results in nH sums representing the total output signal for each hidden layer processing element. Step 3: Extract the ns transfer function weights for each of the output layer processing elements and take their absolute values. Compute the sum across all no output layer processing elements for each of the corresponding n~t transfer function weights. This calculation will result in nn sums representing the output signal pattern based on each of the nx hidden layer processing elements. Step 4: For each hidden layer processing element transfer function, divide each of the ne weight's absolute values by the corresponding sum computed in Step 2. This procedure effectively converts the ne weights to simple proportions. This operation will be performed nn times. Step 5: For each of the nn hidden layer processing elements, multiply each of the ne proportions produced in Step 4 by the corresponding output layer sum produced in Step 3. Using this new set of weights, sum across the n n hidden layer processing element weights for each of the corresponding ne weights producing a single set of n e weights for the hidden layer. Step 6: Convert this final set of n e weights to simple proportions by computing the total across all n e weights and dividing each of the n e weights by this total. To clarify the operations involved, a computational example is given in Appendix A. The result of this set of operations is a simple proportion corresponding to each

A Methodology for SimpliJ~cation of the input variables that indicates its relative importance and contribution to producing the output of the BP neural network. This procedure results in an easily understood interpretation of how the BP network uses the input variables to produce the output predictions. Some evidence of the validity of the Garson variable importance measure was given by Nath Rajagopalan (1994). In a simulation study involving generation of 5000 samples from two populations under varied conditions, the variable importance orderings produced by classical linear discriminant analyses were compared to the variable orderings produced by the Garson importance measure when BP neural networks were developed for the corresponding sample data. Differences in variable importance orderings produced by the two procedures were trivial and not statistically significant. It should be understood that the direct relevance of the weights and the derived input variable proportional contributions to any given problem are dependent on the representativeness of the data used to train the network. As would be the case with ordinary regression weights, differing sets of training data would lead to different proportional input variable contributions in terms of the output predictions. To the extent that the training data is not representative of the general population of interest, the interpretation of the importance of the input variables will be misleading. 4. VARIABLE SELECTION M E T H O D O L O G Y A second issue that has generated some interest among neural network researchers has been development of methods for network simplification. The primary methodology used for simplification is some form of connection pruning. This technique is exemplified by the "optimal brain damage" technique of LeCun, Denker and Solla (1990). However, such connection pruning techniques rarely lead to elimination of any of the input variables since this pruning technique would require that the connections be dropped for all of the hidden layer processing elements corresponding to a given input variable, a very unlikely occurrence in actual practice. Given the existence of the Garson methodology for determining the contribution of each of the input variables to the output of a BP neural network model, it should be possible to use this technique to determine input variables that are primarily introducing noise to the resulting BP network model. Such a procedure would not only avoid the unnecessary cost of gathering input data that are of little relevance to producing accurate output predictions, but would also lead to more parsimonious explanations of the importance of the input variables for producing the output predictions. The law of parsimony dictates that when alternate explanations of a phenomenon are available, the simplest explanation is preferred. Building unnecessarily complex BP neural networks seems like a tremendous waste of resources.

41 4.1. The Backward Elimination Selection Procedure Statisticians have a long history of development of variable selection procedures, particularly in the multiple regression context. Chatterjee and Price (1992) give a recent presentation of methods and Hocking (1976) gives a classic review of the common methodologies. Although many variable selection procedures are available, only backward or forward selection would be computationally feasible in the context of neural networks because of the tremendous computational resources required for network development with typically available network simulators. In particular, statisticians usually prefer the backward elimination procedure to forward variable selection because it allows assessment of the full model containing all candidate variables. An additional benefit in a neural network context is that this procedure requires minimal computational resources. One limitation of backward elimination and all other methods, except all possible subsets selection, is that there is no guarantee that the globally optimum variable subset will be found due to problems such as multicoUinearity. Consequently, it is possible for the selection procedure to miss a theoretically important variable, although from the standpoint of predictive accuracy, there would be little effect. The selection methodology is straightforward. An optimized BP network model is developed using the full set of available input variables. The Garson methodology is then used to determine the input variables' importance relative to the output. The variable that makes the smallest contribution to the output is then dropped. This procedure is repeated for the remaining set of variables. This process continues until only a single input variable remains to develop the final BP network model. At each step the output objective function value is recorded. For a classification BP network, this value would correspond to the rate of correct classification (or error rate) determined using the same data used to train the BP model. The problem is now reduced to determining the best model from the set of models produced by the backward variable selection process. Unfortunately, there are no classical statistical tests available to help make this determination. However, in the context of variable selection procedures, such tests are not strictly accurate and are primarily of heuristic value anyway. Simply using the classification rate determined from the training data would typically be inappropriate as pointed out in the next section. A data-based method for estimating the accuracy of each of the alternative model's performance on new data is needed to guide model selection. 4.2. V-Fold Cross-Validation It is usually the case that the data that is used to train a neural network model represents some sort of sample from a population of data to which the developed neural

42

network might be applied. The problem of estimating the performance of a model based on sample data when applied to data from the population not included in the sample, is known as validation. The issue of validating the performance of a neural network model may be dealt with in several ways. The simplest procedure is known as the resubstitution method where the same observations used to train the neural network are then resubstituted into it to estimate its performance. Unforutnately, this very simple and appealing procedure is also a biased method for estimation of error rates (Efron, 1982; Efron, Tibshirani 1993; Krzanowski, 1988). The reason for this bias is that the same individual observations for which the neural network was optimized will be precisely those observations of all the ones to which the network could be applied that have the least chance of being misallocated by the given neural network model. Thus, the resubstitution method will provide an overoptimistic assessment of the performance accuracy of the neural network model on future individuals from the population to which it will be applied. The smaller the size of the training sample, the more extreme the overoptimism is likely to be. Therefore, the resubstitution method may give very misleading results unless the sample size is indeed large. One methodology for validating the performance of neural networks has become fairly standard. This methodology is known as cross-validation. The standard methodology involves splitting the training data set randomly into two portions, and then using one portion of the training set to develop the neural network model itself, and the other portion to assess its performance (Efron, 1982; Efron and Tibshirani 1993; Krzanowski, 1988; Stone, 1974). Although this standard validation methodology addresses the problem of performance bias, it has several practical drawbacks. The primary drawback is that such a procedure is very wasteful of valuable data since only a portion of the available data is actually used to train the neural network. Unless the initial sample size is very large, the developed neural network model or the assessment of its performance or both will be based on small samples that are subject to large sampling fluctuations. If the estimation of model performance is based on the split sample and then the split samples are combined and a final model is developed on the full sample data set, the model used for prediction of future observations will be different from the model used to develop the performance estimate. To overcome most of the problems inherent in the two previous methods for assessing neural network model performance, the use of V-fold cross-validation is proposed (Efron, 1982; Efron and Tibshirani 1993; Krzanowski, 1988; Stone, 1974; Urban Hjorth, 1994). Using a classification problem as an example, the basic methodology is as follows. First determine V, which indicates the number of random groupings of data into which the original data set is to be split. The larger V, the

L. W. GlorfeM

less bias will be involved in the estimation of the models future performance, where the maximum value of V would be equal to the sample size, n. The first of the V groups is held out for model validation while the actual model is trained using the remaining V-1 groups. The neural network model is developed using the combined V-1 group sample and its performance is assessed with the held out group sample. The second group is then removed from' the V-1 group sample and the first group is included in this sample. Again, the model is developed with the V-1 group sample and its performance is determined using the held out second group. This process is repeated V times for each of the V sample groups. Because of the computational burden involved both by this validation methodology and the neural network algorithm, it is recommended that setting V equal to 10 is sufficiently accurate for practical purposes. After the 10-fold cross-validation has been carried out, the final neural network model is trained using all of the data and its future performance on new data is estimated from the 10-fold cross-validation results. This procedure tends to avoid the previous validation problem of the resubstitution method since the held out group being classified each time has not been used in the formulation of the neural network model. It also tends to avoid the problem of the simple sample-splitting method since the model being assessed is based on 90% of the data observations, which should differ only slightly from the model based on all of the observations that will actually be used for future classification. Efron (1982) and Efron and Tibshirani (1993) provide a theoretical discussion of cross-validation and other procedures such as the jackknife and bootstrap that could be useful in assessing neural network performance.

4.3. Selecting the Final Model Applying the 10-fold cross-validation procedure to the backward elimination variable selection process requires the computation of the 10-fold cross-validated rate of correct classification for each model produced by the backward variable selection procedure. The naive approach to model selection would be to select the simplest model with the highest cross-validated correct classification rate (or lowest error rate) as the optimal model. However, several important practical considerations would dictate a modification of this naive approach. Model classification differences of only a few percent could easily be due to extraneous causes. First of all, even slight changes to the network model parameters such as the number of training iterations, the number of hidden layer processing elements, the learning coefficients, the momentum term, or the epoch length could all possibly lead to better models, There is no way to be sure that the optimal model has been reached. Therefore, small differences in model accuracy may be due to a less than optimal choice of a model parameter or parameters

A Methodologyfor Simplification

43

instead of some real difference in model performance. Secondly, the training sample and the model derived from it will by chance have characteristics that are different from the population of data to which the model will ultimately be applied. Again, small differences in the rate of correct classification of two or more competing models may not be statistically meaningful. The third and perhaps most important reason for ignoring small differences in model performance is based on the law of parsimony. When two models have comparable classification accuracy, it seems reasonable to choose the simpler model. The problem now reduces to defining in some way what a small difference is. A simple heuristic method similar to the one used by Breiman, Friedman, Olshen and Stone (1984) in their rule inductive CART methodology can be used to help in the network model selection process. Given the situation where a single training sample of size N is used to develop a neural network model with p variables, and a validation sample of size Nv is used to validate the model, the proportion of correct classifications based on the validation sample, Pv, is an estimate of the population value of correct classifications, P*, and (1 - P v ) an estimate of the population error rate for the model. Since running any validation observation through the network model may be viewed as a binomial trial with P~ representing the probability of a success and (1-P~) representing the probability of a failure, it should be apparent that the expected value of Pv is P*. The standard error for Pv may then be estimated as

SE(Pv) = , ~

/Pv(1-Pv) Nv

(2)

In the V-fold cross-validation case, the same data is used to both validate and build the training models. This procedure violates the assumption of independence required for the use of the binomial standard error estimate. However, as pointed out by Breiman et al. (1984), there is no clear way to obtain a rigorously appropriate standard error for the V-fold cross-validated classification rate. The heuristic used by Breiman et al. (1984) is simply to ignore the lack of independence and to apply (2) anyway with Nv equal to N and Pv equal to the V-fold cross-validated classification rate, Pc,. This procedure results in a conservative estimate from the standpoint of model selection in the sense that the standard error is smaller than it actually should be, which might lead to including one or several extra variables instead of too few. The rule for selecting the best network model is to select the simplest model that is within one standard error of the model that attains the maximum Pcv (Efron, 1982). This rule is reasonable since from a statistical perspective, models within one standard error of each other would not be considered statistically different. As

with any heuristic rule, an analyst's knowledge of the context of the problem and of the variables involved may suggest an alternative selection.

5. EXAMPLE PROBLEMS 5.1. Commercial Loan Problem For purposes of demonstration, the first problem introduced, involves commercial loan data. This problem is ideal for demonstration purposes since it involves a small sample with a large number of candidate variables. The data set used for this example contained 42 commercial loan applications obtained from the credit department of a large bank. Of the 42 loan applications, 26 had been approved and 16 had been disapproved. Each loan application included a set of 17 financially-related attributes along with the decision to grant or deny the loan. Bank loan officers suggested these financial attributes would be useful in making their decision. Descriptions of the financial attributes and simple summary statistics are given in Appendix B. The categorical variable that represented if a commercial loan was or was not a loan renewal was actually represented by two input elements, 1 and O for a renewal and 0 and 1 otherwise. Model development may be carried out with any implementation of the backpropagation algorithm. In our case, an initial network model that had 18 input elements in the input layer, 3 elements in the hidden or middle layer, and 2 output elements was developed using the standard BP neural network implementation available in NeuralWare's neural network simulator (Kilmasauskas, 1988). The number of processing elements to use in the hidden layer was determined by 10-fold cross-validation. This procedure prevents selecting too many hidden layer processing elements, which leads to overfitting or possible overtraining of the training data and a network that generalizes poorly. The network was trained using the normalized cumulative delta rule with the epoch length set to 21 and a hyperbolic tangent transfer function. The epoch length determines the number of iterations over which the error is accumulated before the weights are updated. Using a long epoch length, 50% of the training data in this case, discourages the network from grandmothering or learning a single or small subset of training observations and can lead to faster convergence. The combined effect of using cross-validation to select the number of hidden layer processing elements and selecting a relatively long epoch length is to allow training a network optimally that does not overfit the training data. When only a small training sample is available, this approach can be especially important. Training a network to optimallity is important because of the consistency of the development of the network weights required by the variable selection procedure. Procedures that attempt to compensate for overly

44

complex network models, which may easily overfit the training data, by stopping training early based on hold out sample error are not consistent enough to be used effectively with the variable selection methodology. Furthermore, such procedures amount to indirect training on the holdout cases (Carpenter & Hoffman, 1995). The network was trained for 150,000 presentations of data although basic convergence appeared to occur at approx. 10,000 iterations. After training, all weight changes were zero through the fourth decimal place. Throughout the variable selection process, BP networks were trained using this same process with reductions in the number of input elements used. Through use of 10-fold crossvalidation, it was determined that the three element hidden layer remained appropriate. The backward selection process required the development of 17 BP network models. Table 1 summarizes the result of applying this methodology to the commercial loan data. For each model the relative proportional contribution to the output of each input variable is shown in parentheses. The variables axe ordered by their relative output contribution, from most important to least important. The first model is the full model that includes all 17 input variables. By looking at the relative output contribution of each variable it can be seen that MARGIN is making the smallest relative output contribution, accounting for only 1.1% of the total output. This indicates that MARGIN should be eliminated at this stage and a new reduced model should be formed using only the 16 remaining variables. Model 2 shows this new reduced model for which the relative output contribution of each of the remaining variables has been recalculated. In Model 2 we see that RECTO is making the smallest contribution, accounting for only 1.6 percent of the total output. Therefore, Model 3 will consist of only 15 variables, excluding both MARGIN and RECTO. This process is continued until only a single variable model remains. The final, single variable model is excluded since it must make a relative contribution of 1.0 to the output. For each of the 17 models resulting from the backward selection process, classification performance was assessed, based on the full training data set followed by use of 10-fold cross-validation. For each BP network model, Table 2 gives the classification matrix, along with the overall rate of correct classification for both the full training data set and the corresponding 10-fold crossvalidation process. As was the case in Table 1, the first model corresponds to the full model including all of the 17 input variables. The first classification matrix for Model 1 indicates the classification performance for this model based on training the full 42 observation data set and then applying this model to the same 42 observations on which it was just trained. Group 0 corresponds to the loans which were denied and 1 corresponds to the loans which were granted. The rows of the matrix represent the actual group membership of the observations while the

L. W. Glorfeld

columns represent the predicted group membership. The diagonal elements of the matrix indicate the rate of correct classification for each group. The overall classification performance was determined by summing the diagonal elements of the matrix and dividing by the total number of observations. In the case of Model 1, (16 + 26)/42 equals a perfect classification rate of 1.0 for the model based on the full set of observations. The second matrix is identical in format to the first, but indicates the classification results from 10-fold crossvalidation. In this case the overall correct classification performance for Model 1 is (9 + 23)/42, which equals 0.762. The classification matrices for the remaining models corresponding to those in Table 1 were developed in exactly the same way as those for Model 1. The model selected as the single best model is indicated with an asterisk. By looking at the last column of Table 2, the model with the smallest number of variables that achieves the highest classification rate (lowest error rate) can be identified as Model 13 which includes 5 variables with a correct classification rate of 0.881. It might be noted that Model 8 has the same rate of cross-validated classification but requires 10 variables to achieve the same level of performance. Using the heuristic based on (2) results in a SE of 5% [(0.881X0.119)/42] ~/2, for the five-variable model. Since no simpler model is within one SE (has a classification rate of 0.831 or more), the five-variable model is determined to be the best model. Figure 1 displays a graphical summarization of the data in Table 2 by plotting the full training data set and cross-validated error rates corresponding to the number of variables in each model from the backward selection process. As the cross-validated rates of correct classification in Table 2 and the error rates in Figure 1 indicate, the five-variable model is the simplest model that achieves the minimal cross-validated error resulting from the backward selection process with a cross-validated classification rate of 88%. In this instance, this model is also the model with the minimum number of variables that first achieves a 100% correct classification rate on the full training data set. The five-variable model selected by the backward elimination methodology gives a 10-fold cross-validated classification rate of 88% and may be interpreted by referring to the relative output contribution of each variable for Model 13 in Table 1. The most important variable in the model is working capital turnover, which accounts for about 30% of the output. The debt ratio and return on assets variables are about equally important, each accounting for approx. 20% of the output. The quick ratio, which measures immediate liquidity, accounts for 13% of the output. The least important variable is renewal, which indicates if the loan is a renewal of a current loan, accounting for about 8% of the output. The selected set of input variables seems to form a reasonable basis for determining if a loan should be granted or rejected. As the cross-validated results

A Methodology for Simplification

45

q 0

~o~.~ • •

o d

0~-0

o 6

o (5

~0 o

o

0

o

Z

~~ o °

(.) r r

<

d ~

~ ~~ o c

o

~

q o

6 w

6

~-

~ w

qq ~

I'~ o o

£D o o

I~ ~ o. 0

(0 ~ o 0

£D o o ~

I--"

I--

z

z ~

W ~ V

:E

:E

ZOO

<

<

rc [E r~

~ ~

oo

ooo Ooad

0~10040~0

(5 o

o

dqdqdq6

w

~ 0 ~ 0 ~ 0 ~

n"

W

0,~0.~0

D

w

0

£E----. CO__. ~ _

~

nr

O0

66

0

p o~

6odo~

0

I ~ 0O 00 0 0 0

~o

c~6d

~

o

,,,~,,,~

- ~ 0

~ O ~ O Z O Z O o

.Q

~

Z

~

r...,~ .,~i. ~

Z

0

0 0

0 0

0

< 0

WWW W Z Z Z OWWW

~

vji._

0

_ _ ~ , ,0,) (,0

Zw

'~. q d q q n" o ~ , ~ q ~ °v~v ~ 0 v w ~_ o~. 6

~D

~<~_~w~DO~<~

~"

~ --I

,~ n-

£"~ 0 ',~I" ~ 1 0~7 ~ (.0 ~

< ~

~ ~ do.~6 6 6 x.. Z ~"

w

OODD:D

0

E

~0

r 7 LL U ~

r~ Z ~ O _ s

u_ r r

0

Ir

n" n" 0

0

0

.c_ 01~0 •0

I:

-~0

0~

eo°~°v~'d~

IE

d

~'d~"

,.., 0c z 0c v ~ ~- ~ v Z'-"Z--

c

0

W Z

a~

~ .

.

~ .

--LLI ~ 0 ~

0

oo

~"o~ ~- ~ t~l ~1" ~ "~1•

.

O)

~" . o

~

~

~--

~to ~0 "~1"

. ~

~ O ,~m .O ~ w O ~ W ~ _ ~0 ~ u u - - J u: 0 re O ~

"0

~ o

I-- ~ I

O)

~ -~-~

D,,,OOw_O O ~

0

0 O)

or~rrrr

o

o to 0

OoO

0

c~

JO

6q ~0

~ . ~ n - t - . , - ~ o ~ n -

-~-,~, , O ~ ; L-I.J~, . , O 0 ~0

.,,:1-

- o . 0 " o . o U•J ~ o

.

~

_ ~ " ~- w z ~ ~ o ; - ~ ~z ~o 0 rr m Z ,¢ r r m .'-~'m .,. m LUm < . ~ ~D~_ODUJ~UJOUJOwo~

~,,-.: ~

CD

0

.~.~.~

d O o~ o O e 0 e 0 ~

.,..,,•

-

~.p ~--~--p oo m .r'z.m m ~ ,¢ ,~ UJ~WW~O0

:D C~.~ 0 0 O 0

0 O0

~ ~

. ~ U.I I _ e r . ~ , . , . ~ . ~ ~ 1 _ t - Z m rr. t - ,.-, t - t - ~ ~

0 0 0 0 :E O w , , ,

:DO D O0

I1)

eU

0

o ~ ~ d o d o d o d o d m d h . - d < _- z d o d d ~ .._:,_: ,, Z ,~ .~ ~ ~-..-.~ v . ~ ~ o ~ ,,../~. O: i-- W ~ - Uj r - u j t - W k - ~ 1 - -

c 0

0-,

I_ ~

w

tO

f::

8. Q.

eh-

0 Z

46

L. W. Glorfeld TABLE 2 Matricas ard Rates of Correct Classification

Classification Rates:

Model 1

Full Predicted 0 1

Number of Variables 17

2

16

3

15

4

14

5

13

6

12

7

11

8

10

9

9

10

8

11

7

12

6

13"

5

14

4

15

3

16

2

17

1

Actual: 0 1 Actual:0 1 Actual: 0 1 Actual:0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual:0 1 Actual: 0 t Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1

16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0 14 1 13 1 12 2 9 2

0 26 0 26 0 26 0 26 0 26 0 26 0 26 0 26 0 26 0 26 0 26 0 26 0 26 2 25 3 25 4 24 7 24

10-Fold Cross Validated Predicted 0 1

1.000 1.000

1.000 1.000 1.000 1.000 1.000 1.000

1.000 1.000 1.000 1.000 1.000

0.929 0.905 0.857 0.738

Actual: 0 1 Actual:0 1 Actual:0 1 Actual:0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual:0 1 Actual:0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1

9 3 10 2 8 2 10 3 11 2 11 2 11 3 13 2 12 2 10 3 10 3 11 2 13 2 7 5 8 4 8 6 9 4

7 23 6 24 8 24 6 23 5 24 5 24 5 23 3 24 4 24 6 23 6 23 5 24 3 24 9 21 8 22 8 20 7 22

0.762 0.810 0.762 0.786 0.833 0.833 0.810 0.881 0.857 0.786 0.786 0.833 0.881 0.667 0.714 0.667 0.738

Note: The asterisk indicates the selected optimal model.

indicate, including more variables in a BP network model does not necessarily lead to better performance. One interesting comparison can be made. If the variable selection methodology is working reasonably, then the variables selected for the five-variable BP network model can be compared with the variables selected by other, better established, methodologies. Because these methodologies have different mathematical and methodological bases, an exact correspondence would not be expected. The results of the best five-variable model selected by use of classical linear discriminant analysis (CLDA) (Fisher, 1936), logistic regression analysis (LRA) (Krzanowski, 1988), recursive partitioning analysis (as implemented in CART) (Breiman et al., 1984), and the new BP-network methodology are shown in Table 3. The BP-network and LRA models contain the same five variables. The BP-network model differs from CLDA

and CART by only a single variable, LOANTERM. The most striking feature is the similarity of the variables selected, despite the very different bases of the solutions. 5.2. Check Overdraft Problem The second problem is based on a set of data that captures a bank's check overdraft return or payment policy. This problem was selected because it is maximally different from the first problem. The problem involves a large sample and a small number of variables that are primarily categorical in nature. The data set used for this example contained 340 checking account overdrafts obtained from a bank located in a university dominated city. Of the 340 checking account overdrafts, 145 had been paid by the bank while 195 had been returned unpaid. Each check overdraft included a set of

A Methodology for Simplification

47

Error rate Cron-validated

0.4

Full Data Set

0.3 •

0.2 L

0.1

~llt~,,~l Sele-le~ectaMode'

~i,-4D-.e..-e-e--e-~-o-e I

I

I

I

I

I

I

|

i

I

I

!

I

I

I

2 3 4 5 6 7 8 9 10 11 12 la 14 15 16 17 Number

of Variables

FIGURE 1. The cross-validated and training sample error rates for the BP-network models resulting from the backward variable selection process for the commercial loan data.

four customer related attributes, one continuous and three categorical, and three interaction variables along with the decision to pay the overdraft or return the check unpaid. Descriptions of the customer attributes and simple summary statistics are given in Appendix C. As in the previous problem, the three categorical variables were fully coded as a set of binary inputs and treated as single variables in the variable selection process. Using the standard BP neural network simulator as in the foregoing problem, an initial network was built that had 11 input elements in the input layer, 3 elements in the hidden or middle layer, and two output elements. BP network training used the normalized cumulative delta rule with the epic length set to 5 (and in several cases up to 20) with a hyperbolic tangent transfer function. The network was trained for 150,000 presentations of data although basic convergence appeared to occur at approx. 70,000 iterations. The procedure followed was the same as in the previous example. The backward selection process required the development of seven BP-network models. The results of applying this methodology to the check overdraft data are summarized in Table 4. For each model, the relative proportional contribution to the output of each input variable is shown in parentheses. As was previously the case, the variables are ordered by their relative output contribution from most important to least important. The first model includes all seven variables. TYPEACT is making the smallest proportional contribution to the output accounting for only 0.8% of the output and therefore is eliminated in model two. In model two

which contains the six remaining variables, MULTCKS is making the smallest contribution of 3.3% to the output. Therefore MULTCKS is dropped from model three which contains only the five remaining variables. This process is continued resulting in the final single variable model which can be omitted since it accounts for 100% of the output. For each of the resulting models from the backward selection process, 10-fold cross-validation was carried out. Table 5 gives the classification matrix along with the overall rate of correct classification for both the full training data set and the corresponding 10-fold crossvalidation process for each BP network model. The classification matrices correspond to the models resulting from the backward variable selection process. The first classification matrix which corresponds to the full model containing all seven variables indicates the performance of this model based on the full set of 340 observations. A check which was returned unpaid is coded as 0 while a check which was paid is coded as 1, The second matrix in the first row indicates the classification performance of this model based on the 10-fold cross-validation results. The numbers in the matrices are interpreted as in the previous example. Figure 2 displays a plot of the full training data set and cross-validated error rates derived from Table 5 corresponding to the number of variables in each model from the backward selection process. As the cross-validated rates of correct classification in Table 5 and the error rates in Figure 2 indicate, the six variable model is the simplest model that achieves the minimal cross-validated error resulting from the backward selection process with a cross-validated classification rate of 95%. Using the heuristic based on (2) results in a SE of 1.3% for the six-variable model. The simpler two-variable model with a cross-validated classification rate of 93.8% is within one standard error of the six-variable model and is therefore determined to be the best model. The two-variable model selected by the backward elimination methodology has an especially appealing interpretation. The most important variable in the model is credit rating, which accounts for about 60% of the output. The interaction of the unknown credit rating binary variable and the amount of the overdraft is the least important variable and accounts for 40% of the output. In this instance a very straightforward interpretation is possible. Generally, if the person making the overdraft has a bad credit rating the check(s) is/are returned, a good credit rating results in payment of the check(s), and if the credit rating is unknown, the decision to pay or return the check(s) unpaid is based on the amount of the overdraft. Furthermore, an interesting transformation can be made to this BP-network model. By experimentally inputting data to the model, it is easily determined that with an unknown credit rating an overdraft amount of six dollars or less results in the check(s) being paid and any amount over six dollars

48

L. W. Glorfeld

o o v

o o

o o

o o o o

a

aa

a w

aa ww

o

oo

I~{II

.i.-,

0 (-

._~

e-

.9 I 0 m

0 O)

.

.

.

.

o~

.8-&,...~ •. 0 ¢- 0 C:

i aaa

°°

~

~

0

,,o ,.0

"I

Ill ,.I

(11 (/)

¢nU-

U-F-

X X X

NNN www

i

C

.9

ooo

mmm d

d~dd

o

oooo

I

0

"0 0

C

E

I

O)

P

o,,, (~

C~

"Z }ua

b

0

,z

0 0 0

"5

D

""I O

0

~r

°~

~000

o

0

0

0

I= 0 im

W~WW

I

£ 7O 0

0

C 0 ww-, "-,i ,l} i...

~ 0 ~

E 0

0

.

{II I,.. D

>

8~85gg ~d~ddd

8

"6

8

o_

"I:: 0 e"l

p

Q. ®

._>

r,-

o Z

A Methodologyfor Simplification

49 TABLE 5 Matrices and Rates of Correct Clessificetlon

Classification Rates

Model

Number of Variables

1

7

2

6

3

5

4

4

5

3

6*

2

7

1

Full Predicted 0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual:0 1 Actual: 0 1 Actual: 0 1 Actual: 0 1

140 10 136 7 137 12 137 12 137 12 137 12 74 1

5 185 9 188 9 183 8 183 8 183 8 183 71 194

lO-Fold Cross Validated Predicted 0 1 0.956 0.953 0.941 0.941 0.941 0.941 0.788

Actual: 0 1 Actual: 0 1 Actual: 0 1 Actual:0 1 Actual:0 1 Actual: 0 1 Actual: 0 1

138 10 138 10 137 14 136 12 136 13 137 13 74 1

7 185 7 785 8 181 9 183 9 182 8 182 71 194

0.950 0.950 0.935 0.938 0.935 0.938 0.788

Note: The asterisk indicates the selected optimal model.

results in the check(s) being returned unpaid. This neural network may then be replaced with a set of simple production rules! As in the previous example the variables selected for the two-variable BP-network model can be comparedwith the variables selected by other, better established, methodologies. The results of the best two-variable model selected by use of CLDA, LRA, CART, and the new BP-network methodology are shown in Table 6. The Error rate Cross-validated Full Data Set --41---

0.2

0.1S

0.1

Model A

0.05 I

I

I

I

I

2

3

4

5

0

7

N u m b e r of V a r i a b l e s RGURE 2. The cross-validated and training sample error rates for the BP-network models resulting from the backward variable selection process for the chock overdraft data.

CLDA, LRA, and BP CART model differs most striking feature selected, despite the tions.

network models are identical. The by one variable. Once again, the is the similarity of the variables very different bases of the solu-

6. DISCUSSION The degree to which the variable selection procedure simplified the resulting BP-network models can be precisely determined for the two example problems. In the commercial loan problem there were originally 17 variables under consideration as predictors, counting a two group categorical variable which actually used two input elements as a single variable to be conservative. Since each input predictor variable must have a function weight or parameter estimated for it in each hidden layer element, this means that for the three hidden layer elements of the BP network, 51 connection parameters would have to be estimated, excluding the bias terms. After the variable selection process, only 5 input variables remained, requiring the estimation of only 15 connection parameters. Thus, the complexity of the BP network was reduced by 36 parameters or 71% of the input to hidden layer connections. This demonstrates the dramatic degree of simplification that can result from the backward variable selection procedure. The most important point to make is that this significant simplification occurs with no real loss in the predictive accuracy of the model or even an increase in the model's predictive accuracy! In the check overdraft problem, seven predictor variables were originally used, again considering two group categorical variables as single inputs. This means

50

L. W. Glorfeld

coo e " l ~-

Q

i| 8 ~j.~-r.s

r- 0

~.~ ~.~-.~ m

0

E o

coo°,88~:~ .Fjeo

"o

,.o tj~

|

crJ O - o

o

o

1-

O0 X

X

m

0

NN

"8

ooOg

i

E

o 0

15

NF,

a:

F,

mW

0

W

c¢'s

oo

~

o

r'~ .Jet" O-J

tr .~ O

e12_ m

o o

?,

that a total of 21 connection parameters corresponding to the input variables had to be estimated for the three hidden layer processing elements. The variable section procedure resulted in the retention of only two variables, requiring estimation of only six hidden connection parameters. As in the previous case, the reduction of the BP network input to hidden layer connections by a factor of 71% demonstrates the considerable simplification to the network architecture that can result from the backward variable selection methodology. Although the backward variable selection procedure can produce significant simplification of the BP-network connection architecture, there were some problems that became apparent in the original development of the methodology. One problem with the Garson importance measure was that qualitative variables, which were coded as binary input variables, were not necessarily given as much weight as they warranted. This problem was overcome by including the full set of binary coded inputs and then summing the importance proportions across this full set. For example, the loan renewal variable was originally represented for input as a 1 if a loan was a renewal and 0 otherwise, requiring only a single input element and resulting in its early elimination from the model. This coding was then changed to 1 and 0 for a renewal and 0 and 1 otherwise, requiring two input elements whose contributions to the output were summed. This second coding method was found to be the method of choice for giving categorical variables the weight they deserved and led to a much better set of models in the variable selection process. In the commercial loan example, any model requires the renewal variable to achieve 100% classification accuracy on the training data. The importance of individual categories within a composite categorical variable may be assessed using the importance measure. Another problem, not directly a fault of the importance measure, is that it is only indirectly connected to the actual objective function of the problem, which is to minimize the classification error. This problem is directly tied to the least-squares objective function used by the standard BP network, which is also only indirectly related to the correct classification rate (Gallant, 1993). Although the developed models based on the importance measure are good, there is no guarantee that the selected models are actually the optimal models in terms of the correct rate of classification. One interesting characteristic of the variable selection methodology is that only variables which make no contribution to the general predictive accuracy of a BP network model are eliminated. Variables with a high degree or even complete redundancy of information are not necessarily dropped from the model unless they are not contributing to predictive accuracy. The stochastic optimization methodology used in training BP neural networks is the primary reason for this behavior. However, there is some question as to the arbitrariness of

A Methodology for Simplification

the resulting weights in this situation, since no explicit solution constraints are provided. From an explanatory standpoint, the retention of nearly redundant but theoretically important variables in a model may be desirable, providing the variable importance weights have some reasonable interpretive basis. The characteristics and consistency of the importance weights of nearly redundant variables would be an area for future research. Although the demonstrated results of the variable selection methodology support the usefulness of the Garson importance measure with standard BP-network scaling, the scaling of the data still could be an interesting issue for further research. The standard neural network approach to scaling data is to transform all input values so that they range form zero to one. Since, in simple terms, the Garson methodology partitions the sum of the output weights in terms of the weights summed across the hidden layer functions for each input variable, any rescaling of the data which would affect the relative size of the hidden and output layer weights, would also affect the Garson importance measure. This in turn could affect the variable selection process. For example, one standard rescaling of the data that immediately suggests itself would be to first scale each input variable to standard form with a mean of zero and a standard deviation of one before restricting the range between zero and one. Another target for further research would be the development of a comprehensive model development methodology based on the 10-fold cross-validation process. In this methodology a series of increasingly complex network-based models would be used starting with simple linear single element models and extending to more complex nonlinear and multi-layer BP models with an increasing number of hidden layer elements and even multiple hidden layers. Cross-validation would be used to select the simplest model which demonstrated the best generalization potential. The backward variable selection methodology could then be used to further simplify the selected model. In some preliminary investigation of this methodology, the results have been encouraging. Finally, it is important to keep in mind that the variable selection procedure is heuristic in nature. The heuristic basis of the procedure means that it should be used with caution since it cannot be guaranteed to automatically work in all problem domains and with all data sets. An analyst's knowledge of the problem to which the variable selection procedure is being applied can be used to monitor and modify the process if necessary. 7. S U M M A R Y A new methodology that is computationally feasible has been presented for use with BP-based neural networks. This methodology allows the elimination of noise and possible redundant variables from a set of candidate

51

input variables using a backward elimination variable selection technique to develop a set of candidate models. Final model selection is then based on the results of 10-fold cross-validation, which leads to a parsimonious model that provides good predictive accuracy in actual application to data not originally used to develop the model. Furthermore, the criteria used for variable selection allows for the direct interpretation of the importance of the input variables to the generated output of the selected final model. This methodology was first demonstrated with a set of commercial loan data that had 17 candidate financial input attributes. Using all 17 variables, a BP neural network model that could correctly classify 100% of the training data as developed. Application of the variable selection methodology led to selecting a much simpler five-variable model that still gave 100% classification accuracy for the training data and had a cross-validated classification rate of 88%. As a secondary benefit, the cross-validation results indicate that the simplified fivevariable model will actually generalize better than the complex 17-variable model. A second application of the variable selection methodology to a set of bank checking account overdraft data resulted in reducing a sevenvariable BP network model to a much simpler two-variable model that gave 94.1 s-validated classification rate. In this instance the final model could actually be reduced to a simple set of decision rules. Comparisons of the variables in the selected BP-network models with variables selected by methodologies with established variable selection procedures reinforce the validity of the BP-network variable selection methodology. The variable selection procedure takes a definite step toward overcoming what is possibly the primary criticism of neural networks used to develop expert systems for decision support. BP-based neural network models do not have to be treated as complex and mysterious black boxes. The usefulness of the developed variable selection procedure should be apparent from the standpoint of cost savings realized by not collecting unnecessary data and the development of parsimonious and interpretable neural network models that will perform well in actual applications.

REFERENCES Atlas, L., Cole, R., Connor, J., EI-Sharkawi, M., Marks II, R. J., Muthusany, Y., & Bamard, E. (1990). Performance comparisons between backpropagationnetworksand classificationtrees on three real world applications. In Touretzky, D. S. (Ed), Advances in Neural Information Processing Systems 2 pp. 441-449. San Marco, CA: Morgan KaufmannPublishers. Barr, A. & Feigenbaum, E. A. (1982). The Handbook of Artificial Intelligence (Vol I1). London: PitmanBooks, Ltd. Bratko, 1. & Michie, D. (1987). Some comments on rule induction. Knowledge Engineering Review, 2, 65-67. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Reading, MA: Wadsworth Publishing Co.

52

L. W. Glorfeld

Carpenter, W. C. & Hoffman, M. E. (1995). Training backprop neural networks. AI Expert, 10(3), 30-33. Candill, M. (1988). Neural networks primer part HI. A/Expert, 5(6), 53-59. Candill, M. (1991). Neural network training tips and techniques. ,41 Expert, 6(1), 56-61. Chatterjee, S. & Price, B. J. (1992). Regression Analysis by Example. New York: John Wiley & Sons. Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7), 571-82. Deng, Pi-Sheng. (1993). Automatic knowledge acquisition and refinement for decision support: A connectionist inductive inference model. Decision Sciences, 24(2), 371-393. Denton, J. W., Hung, M. S., & Osyk, B. A. (1990). A neural network approach to the classification problem. Expert Systems with Applications, 1,417-424. Efron, Bradely. (1982). The Jackknife, the Bootstrap and other Resampling Plans. Philadelphia, PA: Society for Applied Mathematics. Efron, Bradely & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall, 237-257. Fisher, R. A. (1936). The use of multiple measurements in taxonomy problems. Annuals of Eugenics, 7, 179-188. Gallant, Stephen L. (1993). Neural Network Learning and Expert Systems. Cambridghe MA: MIT Press. Garson, G. David. (1991). Interpreting neural-network connection weights. ,4/Expert, 6(4), 47-51. Goodman, R. M., Hggins, C. M. & Miller, J. M. (1992). Rule-based neural networks for classification and probability estimation. Neural Computation, 4, 781-804. Hanson, J. V., McDonald, James B. & Stice, J. D. (1992). Artificial intelligence and generalized qualitative-response models: An empirical test on two audit decision-making domains. Decision Sciences, 23(3), 704-723. Hart, A. (1986). The role of induction in knowledge acquisition. Expert Systems, 2, 24-28. Hecht-Nielsen, R. (1989). Neurocomputing. Reading MA: AddisonWesley. Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics, 32, 1-49 James, M. (1985). Classification algorithms. New York: John Wiley & Sons. Kilmasauskas, C. C. (1988). NeuraIWorks: An introduction to neural computing. Sewicldey, PA: NeuralWare, Inc.

Knight, K. (1990). Connectionist ideas and algorithms. Communications of the ACM, 33(11), 59-74. Krzanowski, W. J. (1988). Principles of Multivariate Analysis: A Users Perspective. Oxford: Claredon Press. LeCun, Y., Denker, J. S. & Solla, S. A. (1990). Optimal brain damage.In Touretzky, D. S. (Ed), Advances in Neural Information Processing Systems 2, pp. 598-605. San Marco, CA: Morgan Kaufmann Publishers, 1990. Meehl, E E. (1954). Clinical Versus Statistical Prediction: A Theoretical Analysis and Review of the Evidence. Minneapolis, MN: U. of Minnesota Press. Nath, R. & Rajagopalan, B. (1994). Detemiining the salency of input variables in neural network classifiers. 1994 Proceedings of the Decision Sciences Institute, pp. 1294-1296. Honolulu, Hawaii. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E. & McClelland, J. L. (Eds), Parallel Distributed Processing: Volume 1, pp. 318-362. Cambridge, MA: The MIT Press. Salchenberger, L. M., Cinar, E. M. & Lash, N. A. (1992). Neural networks: A new tool for predicting thrift failures. Decision Sciences, 23(4), 899-916. Shawlik, J. W., Mooney, R. J. & Towell, G. G. (1991). Symbolic and Neural Learning Algorithms: An experimental comparison. Machine Learning, 6, 111-143. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36, 111-48. Tam, K. Y. & Kiang, M. Y. (1992). Managerial applications of neural networks: The case of bank failure predictions. Management Science, 38(7), 926-947. Urban Hjorth, J. S. (1994). Computer intensive statistical methods: Validation, model selection and bootstrap. New York: Chapman & Hall. Wasserman, P. D. (1989). Neural computing: Theory and practice. New York: Van Nostrand. White, H. (1989). Some asymptotic results for learning in single hidden-layer feedforward network models. Journal of the American Statistical Association, 84(408), 1003-1013. Whitley, L., Dominic, S. & Das, R. (1991). Genetic reinforcement learning with multilayer neural networks. The Proceedings of the Fourth International Conference on Genetic Algorithms, pp. 101-108. San Marco, CA: Morgan Kaufmann. Yoon, Y. & Peterson. L. (1992). Artificial neural networks: an emerging new technique. Data Base, 23(1), 55-57.

APPENDIX A Computational Example An example of the computations of the input variable proportional contributions to the output signal using the Garson methodology. The example is based on the simple three variable model that appears in the commercial loan data variable selection process shown in Table 1. Step 1: The network change weights were all zero through the fourth decimal place. Step 2: Hidden layer computations. Original hidden layer weights H1 H2 H3 WCTO -0.7678 -5.4218 -1.2469 ROA -0.7911 -1.9868 4.2391 DEBT - 1.4785 -2.5392 2.8626

Absolute value of weights H1 H2 WCTO 0.7678 5.4218 ROA 0.7911 1.9868 DEBT 1.4785 2.5392 sum 3.0374 9.9478

H3 1.2469 4.2391 2.8626 8.3486

A Methodology for Simplification

53

Step 3: Output layer computations H1 H2 H3 OUT1 -0.7992 1 . 2 5 3 1 -1.3356 OUT2 0.7992 - 1.2351 1.3356 sum

H1 0.7992 0.7992 1.5984

H2 1.2531 1.2351 2.4882

H3 1.3356 1.3356 2.6712

Step 4: Hidden layer converbion to proportions. Original hidden layer weights H1 H2 H3 WCTO 0.2528 0.5450 0.1493 ROA 0.2604 0.1997 0.5078 DEBT 0.4868 0.2553 0.3429 sum 1.0000 1.0000 1.0000

Step 5: Proportion the ouput layer signal among the hidden layer inputs. H1 H2 H3 sum WCTO 0.4041 1.3561 0.3990 WCTO 2.1591 ROA 0.4163 0.4970 1.3563 ROA 2.2696 DEBT 0.7780 0.6351 0.9159 DEBT 2.3291 sum 1.5984 2.4882 2.6712 6.7578

Step 6: Produce final proportions. WCTO 0.3195 ROA 0.3358 DEBT 0.3447 sum 1.0000

APPENDIX B

Financial Variables and Summary Statistics

Approved

Disapproved

Data Item

Definition

N

Mean

Std Dev

N

Mean

Std Dev

ACP AMOUNT CURRENT DEBT FATO ITO LOANRATE LOANTERM MARGIN NETWORTH QUICK RECTO RENEWAL ROA ROE WC WCTO

Average Collection Period Amount of Loan ($000) Current Ratio Debt Ratio Fixed Asset Turnover Inventory Turnover Loan Rating Termof Loan (Months) Profit Margin Networth ($000) Quick Ratio Receivables Turnover Renewal Loan (Y/N) Return on Assets Return on Equity Net Working Capital Working Capital Turnover

26 26 26 26 26 24 26 26 26 26 26 26 26 26 26 26 26

65.81 798.05 2.45 0.50 34.63 14.51 2.84 9.38 -0.08 4960.75 1.57 25.03 0.65 0.01 0.26 1283.83 -7.09

106.31 1064.68 3.03 0.32 45.60 21.13 0.54 3.98 0.50 15880.07 3.15 33.04 0.48 0.17 0.45 3095.57 80.53

16 16 16 16 15 12 16 15 16 16 16 15 16 16

122.10 837.68 6.83 0.77 29.27 8.86 2.93 18.06 -0.13 645.63 6.26 37.96 0.25 0.07 -2.05 290.35 25.79

377.49 1281.56 16.14 0.30 33.45 9.79 0.44 21.08 0.64 1066.56 16.35 74.92 0.44 0.20 6.60 661.53 44.79

16

16 16

Note: Variables with an N< 26 in the Approved group or < 16 in the disapproved group indicate missing values.

54

L. W. Glorfeld

APPENDIX C Checking Account Variables and Summary Statistics Payed Data Item

Definition

ODAMT TYPEACT

Overdraft Checking Amount 10 = Regular 01 = Student 100 = Good Rating 010 = Unknown Rating 001 =Bad Rating 10 = Multipie Checks 01 = Single Check Interaction of Good Credit and Overdraft Amount Interaction of Unknown Credit and Overdraft Amount Interaction of Bad Credit and Overdraft Amount

CREDIT

MULTCKS GCREDITx ODAMT UKCREDITx ODAMT BCREDITx ODAMT

Returned Unpayed

N

Mean

Std Dev

N

Mean

Std Dev

145 145 145 145 145 145 145 145 145

27.58 0.67 0.33 0.51 0.48 0.01 0.12 0.88 26.10

74.96 0.47 0.47 0.50 0.50 0.08 0.32 0.32 75.42

195 195 195 195 195 195 195 195 195

29.55 0.46 0.54 0.01 0.70 0.29 0.33 0.67 0.22

42.21 0.50 0.50 0.07 0.46 0.45 0.47 0.47 3.00

145

1.46

2.68

195

23.31

42.79

145

0.01

0.08

195

6.03

15.34