Reliable classification using neural networks: a genetic algorithm and backpropagation comparison

Reliable classification using neural networks: a genetic algorithm and backpropagation comparison

Decision Support Systems 30 Ž2000. 11–22 www.elsevier.comrlocaterdsw Reliable classification using neural networks: a genetic algorithm and backpropa...

287KB Sizes 1 Downloads 128 Views

Decision Support Systems 30 Ž2000. 11–22 www.elsevier.comrlocaterdsw

Reliable classification using neural networks: a genetic algorithm and backpropagation comparison Randall S. Sexton a,) , Robert E. Dorsey b,1 a

Computer Information Systems, Southwest Missouri State UniÕersity, Springfield, MO 65804, USA b Department of Economics and Finance, UniÕersity of Mississippi, UniÕersity, MS 38677, USA Accepted 10 May 2000

Abstract Although, the genetic algorithm ŽGA. has been shown to be a superior neural network ŽNN. training method on computer-generated problems, its performance — on real world classification data sets is untested. To gain confidence that this alternative training technique is suitable for classification problems, a collection of 10 benchmark real world data sets were used in an extensive Monte Carlo study that compares backpropagation ŽBP. with the GA for NN training. We find that the GA reliably outperforms the commonly used BP algorithm as an alternative NN training technique. While this does not prove that the GA will always dominate BP, this demonstrated reliability with real world problems enables managers to use NNs trained with GAs as decision support tools with a greater degree of confidence. q 2000 Elsevier Science B.V. All rights reserved. Keywords: Neural networks; Genetic algorithm; Backpropagation; Decision support; Classification; Artificial intelligence

1. Introduction Important tools in modern decision-making, whether in business or any other field, include those that allow the decision-maker to assign an object to an appropriate group, or classification. One such tool that has demonstrated promising potential is the artificial neural network ŽNN.. Numerous studies have compared the classification performance of NNs with

) Corresponding author. Tel.: q1-417-836-6453; fax: q1-417836-6907. E-mail addresses: [email protected] ŽR.S. Sexton., [email protected] ŽR.E. Dorsey.. 1 Tel.: q1-601-232-7575.

traditional statistical techniques and provided evidence to indicate that NNs outperform traditional techniques for some cases w2,12,20,32,38,44,46,54x. This is not surprising since traditional statistical techniques represent a subset of the models that can be approximated by the NNs. What is surprising is that NNs in these studies do not uniformly dominate the traditional techniques. For example, Hensen et al. w12x found that Logit did better than NNs when applied to a data set forecasting business failures and Patuwo et al. w32x compared NNs on two-group classification problems which included: 1. variables having a bivariate normal distribution with equal variance–covariance matrices across the groups,

0167-9236r00r$ - see front matter q 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 9 2 3 6 Ž 0 0 . 0 0 0 8 6 - 5

12

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

2. variables with a bivariate normal distribution but with unequal variance–covariance matrices and, 3. variables with a bi-exponential distribution. They found that NNs did not always dominate over classical methods. Since Refs. w10x and w15x proved that NNs have the ability to approximate any function to any degree of desired accuracy, it is somewhat troubling that the NNs, in many classification studies, were not able to outperform or at least equal the performance for all cases. In order for the NN to be used effectively as a management tool for classification problems, managers must have confidence in its reliability. Unlike traditional statistical techniques, which will achieve the same solution each time they are applied to a particular set of data, search algorithms generally report different solutions depending on their starting values. A manager cannot rely on a NN unless the solution reported is consistently at least as good as the next best alternative. Since the NN has the theoretical capability to dominate traditional statistical methods, any failure to achieve superior performance must be due to the training algorithm. In particular, this inability to reliably outperform standard techniques may be due to the well-known limitations of the backpropagation ŽBP. training algorithm. To explore this possibility, we directly compare BP and a global search algorithm, the genetic algorithm ŽGA. as search techniques for optimizing NNs. The Monte Carlo comparison consists of 10 real-world benchmark classification problems generated by an independent source for the sole purpose of algorithm comparisons. The comparisons are based on the algorithms’ ability to find superior classification solutions. We find that by using an appropriate global search technique, such as the GA, many, if not all, of the limitations of gradient search techniques can be overcome. The next two sections provide a brief discussion of BP and GA, followed by the experimental design, results, and conclusions.

2. Backpropagation The original development of the BP training algorithm is generally credited to Refs. w21,31,52x. BP is

currently the most widely used search technique for training NNs. Rumelhart et al. w36,37x popularized the use of BP Žgeneralized delta rule. for learning internal representations in NNs. Rumelhart et al. w36,37x also made two important observations regarding the limitations of the BP algorithm, namely, its slowness in convergence and its inability to escape local optima. There have been a variety of wellknown attempts to improve upon the original BP algorithm. Although, the fact that BP is a gradient search algorithm and, thus, inherently a local search technique, it is still a very popular and successful tool. A search for articles referencing NNs as applied to business problems shows that BP Žor some variation. is by far the most popular method of optimizing NNs. However, since it has been shown to have limitations resulting in inconsistent and unpredictable performances w2,16,19,25,26,28,36,45,48– 50,53x, its reliability as a management decision tool is questionable. When using gradient search techniques such as BP, the manager is faced with certain well-recognized problems such as, escaping local optima. Since virtually all BP algorithms used for training NNs initialize the starting weights Žor point in n-dimensional space. randomly, there is a high probability that the starting point is located in a local valley. BP will then generally converge on a local solution. In order to attempt to prevent this, researchers have modified the basic algorithm to try to escape local optima and find the global solution. Numerous modifications have been implemented in order to overcome this problem including modifications of differential scaling Žchanging the learning rate and momentum values during training. w3,13,22,33, 39,41x, the error metric modification Žchanging the objective function to something other than the normal SSE or RMSE. w9,14,29,43,47x, the transfer function modification Žchanging the transfer function to something other than the standard sigmoid function. w18,24,39,42,51x, modifications to the architecture Žchanging the architecture of the NN to aid the optimization. w1,17,32x, and non-linear optimization techniques w23,30x. Many researchers have also recommended that multiple random starting points be used. Unfortunately, there is no clear solution to this problem. Even when starting from multiple points, there is no guideline for how many starting points

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

are necessary. Without a clear protocol to ensure a global solution, managers cannot rely on any given solution. When using popular commercial NN software packages that incorporate BP such as Neural Works Professional IIrPlus by NeuralWare, user-defined parameter adjustments must be selected. A manager must choose step size, momentum value, learning rule Žvariation of BP., normalization technique, random seed, transfer function, and network architecture in order to find the best combination for solving the particular problem. Although there are many parameter settings for this type of BP strategy, it is the strategy of choice by most NN researchers. Since even simple functions can have multiple local solutions, and assuming the function is unknown, finding the correct combinations of parameter settings to obtain a global solution given a random starting point is difficult and is based primarily on chance. Since most of the problems associated with BP are due to its gradient nature, it seems plausible that using global search techniques that are not entirely dependent on derivatives could eliminate many of these problems.

3. The genetic algorithm The GA is a global search procedure that searches from one population of points to another. As the algorithm continuously samples the parameter space, the search is directed toward the area of the best solution so far. In our implementation, this is accomplished in two ways. First, the best solution so far is re-injected as a potential solution periodically and, second, the search space is bounded around the best solution so far and permitted to shrink as the number of generation increases. This algorithm has been shown to perform exceedingly well in obtaining global solutions for difficult non-linear functions w5,6x. The application of the GA to one particularly complex non-linear function, the NN, has also been shown to work well w8x. Sexton et al. w40x demonstrate that the GA significantly outperforms BP in optimizing NNs for a set of computer generated data. This paper examines real world data sets in order to determine whether or not a global search algorithm such as the GA offers managers a more reliable tool

13

for applying the NN to classification problems. A formal description of the algorithm is provided in Ref. w5x. The GA is used by first choosing an objective function for optimizing the network, such as minimization of the sum of squared errors or sum of absolute errors. The objective function need not be differentiable or even continuous, unlike BP. Using the chosen objective function, each candidate point Žor solution set of weights. out of the initial population of randomly chosen starting points is used to evaluate the objective function. The objective function values are then used to assign probabilities for each of the points in the population. For minimization, as in the case of sum of squared errors, the highest probability is assigned to the point with the lowest objective function value. Once all points have been assigned a probability, a new population of points is drawn from the original population with replacement. The points are chosen randomly with the probability of selection equal to its assigned probability value. Thus, those points generating the lowest sum of squared errors are the most likely to be represented in the new population. The points comprising this new population are then randomly paired for the crossover operation. Each point is a vector Žstring. of n parameters Žweights.. A position along the vectors is randomly selected for each pair of points and the preceding parameters are switched between the two points. This crossover operation results in each new point having parameters from both parent points. Finally, each weight has a small probability of being replaced with a value randomly chosen from the parameter space. This operation is referred to as mutation. Mutation enhances the GA by intermittently injecting a random point, in order to search the entire parameter space. This allows the GA to possibly escape from local optima if the new point generated is a better solution than has previously been found, thus, providing a more robust solution. This resulting set of points now becomes the new population, and the process repeats until convergence. Since this method simultaneously searches in many directions, the probability of finding a global optimum greatly increases. The algorithm’s similarity to natural selection inspires its name. As the GA progresses through generations, the parameters most

14

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

favorable in optimizing the objective function will reproduce and thrive in future generations, while poorly performing parameters die out, as in the saying Asurvival of the fittestB. Research using the GA for optimization has demonstrated its strong potential for obtaining globally optimal solutions w4,11x. A reliable rule for stopping the search does not currently exists. With derivative search techniques, one can demonstrate convergence to a local solution by obtaining zero value derivatives but there is currently no test to determine whether or not a global solution has been found. Two papers, which have addressed this problem, are Refs. w7x and w55x. Veal uses a random draw approach with limited success and, Dorsey and Mayer w7x use specification tests and a modification of the Veal technique with improved success. However, until such a test has been developed for the NN, no stopping rule will be totally reliable.

4. Experimental design In a recent study, Prechelt w34x evaluated the methodologies used in a number of research papers to compare NN algorithms and found that most studies present performance results for only a small number of problems. Prechelt also concluded that these studies typically used synthetic problems, as opposed to real world problems. He claims that synthetic problems do not adequately test the algorithms. Another problem identified in Ref. w34x was the lack of standard methods for developing data sets. For example, he claims that while authors might use the same problems in their study as past researchers, encoding the data in different formats may negate valid comparisons. To propose a solution for these problems, Prechelt w35x collected real data sets and formatted them in a standard way. He also developed a set of rules and conventions for applying them to be used in learning algorithm evaluations. For these reasons, this benchmark data with rules and conventions ŽPROBEN1. is used for this comparison. The focus of our study is classification problems, and 11 out of the 15 data sets included in PROBEN1 are classification problems. One of these,

the mushroom problem, is eliminated from this study because of its simplicity. The first data set for each of the following problems were used in this study. The following briefly describes the 10 classification problems used for this study. Cancer — This problem requires the decision maker to correctly diagnose breast lumps as either benign or malignant based on data from automated microscopic examination of cells collected by needle aspiration. The data set includes nine inputs and two outputs. The exemplars are split with 350 for training, 175 for validation, and 174 for testing, totaling 699 exemplars. All inputs are continuous variables and 65.5% of the examples are benign. The data set was originally generated at hospitals at the University of Wisconsin Madison, by Dr. William H. Wolberg. An application of this data set can also be seen in Ref. w27x. Card — The second data set is used to classify whether or not a bank Žor similar institution. granted a credit card to a customer. This data set includes 51 inputs and 2 outputs. The exemplars are split with 345 for training, 173 for validation, and 172 for testing, totaling 690 exemplars. The inputs are a mix of continuous and integer variables, with 44% of the examples positive. This data set was created based on the AcrxB data of the Acredit screeningB problem data set from the UCI repository of machine learning databases. Diabetes — The third data set is used for diagnosing diabetes among Pima Indians. This data set includes eight inputs and two outputs. The exemplars are split with 384 for training, 192 for validation, and 192 for testing, totaling 768 exemplars. All inputs are continuous, and 65.1% of the examples are negative for diabetes. This data set was created from the APima Indians diabetesB problem data set from the UCI repository of machine learning databases. Gene — This data set is used for detecting intronrexon boundaries in nucleotide sequences. The three classes of possible outputs included an intronrexon boundary Ža donor., an exonrintron Žan acceptor. boundary, or none of these. There are 120 inputs made up from 60 DNA sequence elements, which have a four-valued nominal attribute that was encoded to binary, using two binary inputs. The exemplars are split with 1588 for training, 794 for

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

validation, 793 for testing, totaling 3175 exemplars. There are 25% donors and 25% acceptors in the data set. This data set Asplice junctionB was also created from the UCI repository of machine learning databases. Glass — The glass data set is used for separating glass splinters in criminal investigations into six classes, which include float processed or non-float processed building windows, vehicle windows, containers, tableware, or head lamps. This data set is made up of nine inputs and six outputs. All inputs are continuous, with 70, 76, 17, 13, 9, and 29 instances of each class, respectively. The exemplars were split with 107 for training, 54 for validation, and 53 for testing, totaling 214 exemplars. This data set was created on the AglassB problem data set from the UCI repository of machine learning databases. Heart — The six and seventh data sets deal with the prediction of heart disease. A yes or no decision is made as to whether or not at least one of four major vessels is reduced in diameter by more than 50%. The first data set Žheart. includes 35 inputs and 2 outputs. The exemplars are split with 460 for training, 230 for validation, and 230 for testing, totaling 920 exemplars. Forty-five percent of the patients in this data set had no vessel reduction. Since the AheartB data set included several missing values, a second data set was constructed called AheartcB. This data set is composed of the cleanest part of the preceding data. The exemplars were split with 152 for training, 76 for validation, and 75 for testing, totaling 303 exemplars. This data set has 54% of exemplars with no vessel reduction. The original data was obtained from Ž1. Hungarian Institute of Cardiology, Budapest; Andras Janosi MD, Ž2. University Hospital, Zurich, Switzerland; William Steinbrun MD, Ž3. University Hospital, Basel, Switzerland; Matthias Pfisterer MD, Ž4. V.A. Medical center, Long Beach and Cleveland Clinic Foundation; Robert Detrano MD PhD. Horse — The horse data set is used for predicting the fate of a horse that has colic. The three classes include, will survive, will die, or will be euthanized. There are 58 inputs and 3 outputs to this data set. The exemplars are split with 182 for training, 91 for validation, and 91 for testing, totaling 364 exemplars. The distributions of classes are 62% survived, 24% died and 14% were euthanized. This data set

15

was created from the Ahorse colicB problem database at the UCI repository of machine learning databases. Soybean — The soybean data set was constructed to identify 19 different diseases of soybeans. The inputs were based on the description of the bean and plant. The data set includes 35 inputs and 19 outputs. The exemplars are split with 342 for training, 171 for validation, and 170 for testing. This data set was created based on the Asoybean largeB problem data set from the UCI repository of machine learning databases. Thyroid — The last problem deals with diagnosing a patient’s thyroid function as being overfunction, normal function, or underfunction. There are 21 inputs derived from patient query data and patient examination data. The exemplars are split with 3600 for training, 1800 for validation, and 1800 for testing. The class distributions are 5.1%, 92.6%, and 2.3%, respectively.

4.1. Summary of data Each of these data sets has been used in past research but without standard methods of comparison. By using these real data problems in a consistent manner, validity, reproducibility, and comparability should all be enhanced for future research. Detailed descriptions on how the data sets were constructed and further references about the problems can be found in Ref. w35x. The following sections describe how BP and the GA are used for this comparison.

4.2. Training For both algorithms, three network architectures are used for all 10 classification problems. The three architectures for each problem included one hidden layer with 3, 6, and 12 hidden nodes. Ten replications with each algorithm and architecture were conducted for each problem. Replications differed by changing the random seed for drawing the initial solutions. Thus, there were a total of 30 trained networks for each problem and algorithm. The transfer function in all cases was the standard sigmoid.

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

16

Both the GA and BP algorithms were PC based and were implemented on a 200 MHz Pentium Pro machine. The parameters used for the GA are those recommended in Ref. w5x. No equivalent paper exists for BP so parameter settings were selected based on recommendations provided with the software as described in the following section. 4.3. Training with backpropagation As mentioned in the descriptions of the data sets, all problems were split into training, validation, and testing files. The training and validation exemplars, for each problem, can be used for searching for the best solution. This can be done in two ways. The first method combines the training and validation set for training. The second uses the training portion for the initial training, while the validation portion is used to determine when to stop training. For example, a BP network will search for a solution using the training data, but once the error stops decreasing on the validation set, the training will discontinue. Some researchers feel this step is necessary to not overfit a particular function being estimated Žsee for example, Ref. w35x.. What often occurs at this stage is that the algorithm begins to converge on a local solution instead of a global solution. If the correct objective function is chosen and a global solution is found, then there can be no such problem. The problem arises when a gradient algorithm is converging on a

local solution that is not global. When this occurs, the solution can move from good out-of-sample predictions to poorer ones as in-sample results improve. Since BP converges locally, this type of NN training Žusing a validation sample. seems necessary. For this study, the BP networks use the training set for the initial training while intermittently testing Ževery 10,000 epochs. the validation set to indicate when to stop this local convergence. Once 10 consecutive validation errors cease to decrease, or a maximum of 10 million epochs are reached, training was terminated. Neural Works Professional IIrPlus by NeuralWare commercial neural net software was used for this study. In training each problem using BP, many factors could have been manipulated in an effort to find the best configuration for optimizing each problem. Some include the learning rate, momentum, weight update size, variation of BP algorithm, variation of transfer functions, normalization of data, and the Logicon algorithm. The different combinations of the learning rate and momentum are used to try to find the right combination that will allow the solution to escape local minima but not skip over the global solution. The epoch is defined as one complete pass through the data set. For this study, the learning rate was set at 1.0 and the momentum factor set to 0.9. Both of these user-defined parameters were systematically reduced with a reduction factor of 0.5 for every 10,000 epochs. This was done in order to keep the solution from oscillating and there-

Table 1 Average classification error and squared error percentage Problem

Classification error percentage BP

Cancer Card Diabetes Gene Glass Heart Heartc Horse Soybean Thyroid

Squared error percentage GA

BP

GA

Mean

Standard deviation

Mean

Standard deviation

Mean

Standard deviation

Mean

Standard deviation

3.0095 18.7791 29.6181 34.1992 49.8113 22.6667 24.8889 28.7179 58.8235 7.0759

1.1987 3.0739 2.2001 3.1748 6.6143 0.8350 1.7134 3.3620 5.0660 0.3155

2.2667 15.0194 26.2326 12.0429 33.2075 19.8841 20.4444 23.9927 39.7451 3.7815

0.3428 0.6474 1.2767 1.7464 2.6877 1.0985 1.9746 1.9118 1.3011 0.9901

2.2106 12.2752 18.0818 14.4765 10.5530 16.5733 18.2453 14.7548 3.5337 4.3475

0.8305 1.0174 0.6042 0.7081 0.4517 0.6116 0.7126 1.6252 0.1976 0.3480

1.7390 10.3810 17.7509 8.9247 9.6368 16.3125 16.6635 11.8495 1.3555 2.0094

0.1935 0.4670 1.8426 0.6688 0.4353 0.3420 0.7633 0.4706 0.2097 0.3049

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

17

Fig. 1.

fore helping to converge upon a solution. While there are rules of thumb for setting these values, there is no set standard upon which a researcher can draw for deriving optimum configurations for training with BP. Guidelines suggested by the Neural Works manual were used in selecting the values used. The best network for each problem was based upon the best validation set error.

training would have resulted in smaller errors, but was unnecessary for this study. 4.5. Error measure Two error measures were used in evaluating the performance of both algorithms. The first is the

4.4. Training with the genetic algorithm Since the GA searches globally for a solution, the use of the validation set is not needed to check for overfitting. An argument could be made that since BP was using the validation set in its training process, not using the validation set for the GA places it at a disadvantage but it is not necessary for global search. The GA, as presented by Ref. w5x, was used for this study. The only change made to the original recommended parameter settings was the use of 12 vs. 20 strings in the initial population. The reduction of strings is attributable to the lack of significant improvement in solutions vs. computation time. Termination was set arbitrarily to 3000 generations or 36,000 epochs for each of the data sets. Further

Table 2 Ranking of algorithms for best average classification error percentage Problem First Cancer Card Diabetes Gene Glass Heart Heartc Horse Soybean Thyroid )

GA GA GA GA GA GA GA GA GA GA

Second 12 3 3 3 3 3 12 6 12 12

Third

GA) 3 GA) GA 6 GA GA 6 GA GA 6 GA GA 6 GA GA 6 GA GA 6 GA GA 12 GA GA 6 BP GA 3 GA

6 12 12 12 12 12 3 3 12 6

Fourth

Fifth

BP BP BP BP BP BP ) BP BP GA BP

BP BP BP BP BP BP ) BP BP BP BP

6 12 3 12 12 12 12 6 3 6

Sixth 12 6 12 6 6 3 6 12 6 12

BP BP BP BP BP BP BP BP BP BP

3 3 6 3 3 6 3 3 3 3

represents a tie of classification error percentage for that particular problem.

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

18

Table 3 Best classification error and squared error percentage Problem

Cancer Card Diabetes Gene Glass Heart Heartc Horse Soybean Thyroid

of the network, and P is the number of exemplars in the data set considered.

Classification error percentage

Squared error percentage

BP

GA

BP

GA

5. Results

1.1494 15.1163 27.6042 25.3468 32.0755 21.7391 24.0000 25.2747 34.7059 6.6111

1.1494 12.2093 23.4375 10.3405 28.3019 18.2609 17.3333 19.7802 18.8235 2.1111

1.6620 11.3287 17.6549 12.6646 8.7881 16.1394 18.9575 14.4929 2.4493 3.7958

1.4636 10.2405 17.5024 7.0640 6.3879 15.3125 14.1614 11.6417 0.8420 1.2622

Since the ability of a NN to predict out-of-sample rather than in-sample is more relevant for these problems, the following discussion focuses on the algorithm’s performance on the testing data for all problems. The validation error was only used to indicate which of the 10 replications for each algorithm and problem generated the best solution. In almost every case, over the 10 replications for each algorithm and 3 architectures, the GA found 10 out of 10 superior classification errors compared to BP. The GA also found superior error percentages for all 10 problems. Table 1 shows the average of the 10 replications for each algorithm and problem across all architectures. This is shown in a chart in Fig. 1. The mean classification error percentage for the GA is significantly below the mean for BP at 1% level for every problem. Not only is the mean lower but the variation of the solutions is typically smaller for the GA. A ranking of each algorithm and their different architectures were also tabulated to illustrate the average best-to-worst algorithm and architecture for each problem, shown in Table 2. The ranking was based on classification error percentage, using the

percent classification error. This measure reports the percentage of incorrectly classified examples. The second measure used is the squared error percentage, which is: E s 100

ž

Omax y Omin NŽ P .

P

/

N

Ý Ý Ž Op i y t p i .

2

Ž 1.

ps1 is1

Where O is the actual output value, t is the target output value, Omin and Omax are the minimum and maximum values of actual output values in the problem representation, N is the number of output nodes

Fig. 2.

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

squared error percentage to break any ties between the solutions for each problem. It can be seen in Table 2 that the GA found the best three solutions for all but the soybean problem. As the table illustrates, the GA found the first, second, and fourth best solutions for this problem. Tables 1 and 2 show how well each algorithm does on average over the 10 replications, giving an indication of the robustness of each algorithm. Tables 2 and 3 show the best-found solution for each algorithm instead of their averages. As mentioned earlier, the best solution was determined by the smallest classification error percentage of the validation data set. Once the best weights were determined, the test set errors were then calculated and compared. As can be seen in Table 3, the GA found the best solution for every problem, with respect to classification error percentage, tying BP for the cancer data problem. This information is provided graphically in Fig. 2. This table also shows that the GA found all 10 of the best solutions, with respect to the squared error percentage. A ranking of each algorithm and their different architectures were also tabulated to illustrate the best-to-worst algorithm and architecture for each problem, shown in Table 4. The ranking was based on classification error percentage, using the squared error percentage to break any ties between the solutions for each problem. The GA found all 10 of the best solutions, 9 out of 10 second best solutions, and 8 out of 10 third best solutions.

Table 4 Ranking of algorithms for best classification error percentage Problem First Cancer Card Diabetes Gene Glass Heart Heartc Horse Soybean Thyroid )

GA) GA GA GA GA) GA GA GA GA GA

Second Third 3 3 6 3 3 12 3 12 12 12

GA) GA GA GA GA) GA GA GA BP GA

6 12 3 6 6 6 6 3 12 6

BP ) GA GA GA BP GA GA GA GA GA

3 6 12 12 12 3 12 6 6 3

Fourth

Fifth

BP ) BP BPa BP GA BP BP BP BP BP

GAa BP BPa BP BP BP ) BP BP GA BP )

6 3 3 12 12 6 6 12 6 3

Sixth 12 12 12 6 3 3 12 3 3 12

BPa BP BP BP BP BP ) BP BP BP BP )

12 6 6 3 6 12 3 6 3 6

and a represent a tie of classification error percentage for that particular problem.

19

Table 5 Average epochs and CPU time for all algorithms and architectures Problems

BP epochs

GA epochs

BP time

GA time

Cancer Card Diabetes Gene Glass Heart Heartc Horse Soybean Thyroid

1,140,667 1,814,333 158,533 862,667 477,000 1,507,000 490,000 466,333 491,000 204,667

36,000 36,000 36,000 36,000 36,000 36,000 36,000 36,000 36,000 36,000

704 1570 155 1356 475 1222 398 459 701 162

510 853 532 9562 296 554 325 1070 10248 9524

Comparing the performance of the best model found with each algorithm, a Wilcoxon Sign test finds that the GA model forecasts for out-of-sample data have a smaller error than the BP model at the 1% level for each of the data sets except heartc and glass. The p-value for the heartc data set was 0.05 and it was 0.07 for the glass data set. A comparison of CPU time and number of epochs trained was also tabulated in Table 5. As mentioned earlier, BP was allowed to train until validation error ceased to decrease, or when a maximum 10 million epochs was reached. The GA was terminated before fully converging upon a solution at 36,000 epochs. Although, the GA trained with far fewer epochs than BP, it converged much slower than BP, per epoch of training. While the GA does take more time for each epoch of training, it found superior solutions in a shorter amount of time in 5 out of the 10 problems. The value of converging upon a poorer solution faster has to be weighed for the problem being estimated. In these cases and most other classification problems, the extra time needed for finding more consistent and predictable solutions is minimal.

6. Conclusions Although BP used the validation data for training and trained for many more epochs, it still performed poorer than the GA for these classification problems. A critique of this study might be that the BP configuration used is limited to one set of user-defined parameter settings, for example, setting the initial learning rate to 1.0 and the momentum value to 0.9.

20

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

Changing any or all of the parameters available to users of the BP algorithm could possibly result in better performance. However, since there are no heuristics for setting these parameters and each BP NN is problem dependent, the problem facing the decision maker is which set to choose and recommendations provided in the software are reasonable choices. The GA, as used in this study, on the other hand, has predefined parameter settings and can be readily used for any given problem. Our findings indicate that the GA is not dependent upon the initial random weights for finding superior solutions and is more consistent and predictable in its abilities for finding those solutions. This research has shown the GA to be a viable alternative for NN optimization that finds superior solutions in a more consistent and predictable manner than those using BP. Although in this study, the GA actually used less time than the BP algorithm, it is often the case that BP is used because it converges rapidly. Managers must keep in mind, however, that solutions other than global solutions may lead to unreliable forecasts. This study, of course, does not prove that the GA will dominate for all problems but does indicate that the GA may enable managers to use NNs with more confidence that the reported solution is in fact the desired solution and, thus, allow the NN to become a powerful tool for managers.

w6x

w7x

w8x

w9x

w10x

w11x w12x

w13x

w14x

w15x

References w1x D. Alpsan, M. Towsey, O. Ozdamar, A.C. Tsoi, D.N. Ghista, Efficacy of modified backpropagation and optimization methods on a real-world medical problem, Neural Networks 8 Ž6. Ž1995. 945–962. w2x N.P. Archer, S. Wang, Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems, Decision Sciences 24 Ž1. Ž1992. 60–75. w3x J.R. Chen, P. Mars, Stepsize variation methods for accelerating the back propagation algorithm, Proceedings of the International Joint Conference on Neural Networks, Washington, DC 1 Ž1990. 601–604. w4x K.A. De Jong, An analysis of the behavior of a class of genetic adaptive systems, Unpublished PhD Dissertation, University of Michigan, Dept. of Computer Science, 1975. w5x R.E. Dorsey, W.J. Mayer, Optimization using genetic algorithms, in: J.D. Johnson, A.B. Whinston ŽEds.., Advances in

w16x

w17x

w18x w19x w20x

w21x

w22x

Artificial Intelligence in Economics, Finance, and Management, vol. 1, JAI Press, Greenwich, CT, 1994, pp. 69–91. R.E. Dorsey, W.J. Mayer, Genetic algorithms for estimation problems with multiple optima, non-differentiability, and other irregular features, Journal of Business and Economic Statistics 13 Ž1. Ž1995. 53–66. R.E. Dorsey, W.J. Mayer, Detection of spurious maxima through random draw tests and specification tests, Computational Economics, in preparation. R.E. Dorsey, J.D. Johnson, W.J. Mayer, A genetic algorithm for the training of feedforward neural networks, in: J.D. Johnson, A.B. Whinston ŽEds.., Advances in Artificial Intelligence in Economics, Finance and Management, vol. 1, JAI Press, Greenwich, CT, 1994, pp. 93–111. M.A. Franzini, Speech recognition with back propagation, Proceedings of the IEEErNinth Annual Conference of the Engineering in Medicine and Biology Society, Boston, MA 9 Ž1987. 1702–1703. K.I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2 Ž3. Ž1989. 183–192. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989. J.V. Hensen, J.B. McDonald, J.D. Stice, Artificial intelligence and generalized qualitative-response models: an empirical test on two audit decision-making domains, Decision Sciences 23 Ž3. Ž1992. 708–723. J. Higashino, B.L. de Greef, E.H. Persoon, Numerical analysis and adaptation method for learning rate of back propagation, Proceedings of the International Joint Conference on Neural Networks, Washington, DC 1 Ž1990. 627–630. M.J. Holt, S. Semnani, Convergence of back propagation in neural networks using a log-likelihood cost function, Electron Letters 26 Ž1990. 1964–1965. K. Hornik, M. Stinchcombe, H. White, Multilayer feed-forward networks are universal approximators, Neural Networks 2 Ž5. Ž1989. 359–366. J.T. Hsiung, W. Suewatanakul, D.M. Himmelblau, Should backpropagation be replaced by more effective optimization algorithms? Proceedings of the International Joint Conference on Neural Networks ŽIJCNN. 7 Ž1990. 353–356. J. Hwang, S. Lay, R. Maechler, D. Martin, J. Schimert, Regression modeling in back-propagation and projection pursuit learning, IEEE Transactions on Neural Networks 5 Ž3. Ž1994. 342–353. Y. Izui, A. Pentland, Analysis of neural networks with redundancy, Neural Computation 2 Ž1990. 226–238. T. Kawabata, Generalization effects of k-neighbor interpolation training, Neural Computation 3 Ž1991. 409–417. J.W. Kim, H.R. Weistroffer, R.T. Redmond, Expert systems for bond rating: a comparative analysis of statistical, rulebased, and neural network systems, Expert Systems 10 Ž3. Ž1993. 167–171. Y. LeCun, Learning processes in an asymmetric threshold network, Disordered Systems and Biological Organization, Springer, Berlin, 1986, pp. 233–240. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E.

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

w23x

w24x

w25x

w26x w27x w28x

w29x

w30x w31x

w32x

w33x

w34x

w35x

w36x

w37x

w38x

Howard, W. Hubbard, L.D. Jsckel, Backpropagation applied to handwritten zip code recognition, Neural Computation 1 Ž1989. 541–551. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 Ž11. Ž1998. 2278–2324. J. Lee, Z. Bien, Improvement on function approximation capability of backpropagation neural networks, Proceedings of the International Joint Conference on Neural Computation 1 Ž1991. 541–551. M. Lenard, P. Alam, G. Madey, The applications of neural networks and a qualitative response model to the auditor’s going concern uncertainty decision, Decision Sciences 26 Ž2. Ž1995. 209–227. G.R. Madey, J. Denton, Credit evaluation with missing data fields, Proceedings of the INNS, Boston, 1988, p. 456. O.L. Mangasarian, Mathematical programming in neural networks, ORSA Journal on Computing 5 Ž4. Ž1993. 349–360. E. Masson, Y. Wang, Introduction to computation and learning in artificial neural networks, European Journal of Operational Research 47 Ž1990. 1–28. K. Matsuoka, J. Yi, Backpropagation based on the logarithmic error function and elimination of local minima, Proceedings of the International Joint Conference on Neural Networks, Singapore 2 Ž1991. 1117–1122. M. Moller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6 Ž1993. 525–533. D. Parker, Learning logic, Technical report TR-8, Center for Computational Research in Economics and Management Science, MIT, Cambridge, MA, 1985. E. Patuwo, M.Y. Hu, M.S. Hung, Two-group classification using neural networks, Decision Sciences 24 Ž4. Ž1993. 825–845. D.C. Plaut, S.J. Nowlan, G.E. Hinton, Experiments on learning by back propagation ŽCMU-CS-86-126., Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, 1988. L. Prechelt, A study of experimental evaluations of neural network learning algorithm: Current research practice, Technical Report 19r94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karsruhe, Germany. Anonymous FTP: rpubrpapersrtechreportsr1994r1994-19.ps.Z on ftp.ira.uka.de, 1994a. L. Prechelt, PROBEN1 — A set of benchmarks and benchmarking rules for neural network training algorithms, Technical Report 21r94, Fakultat fur Informatik, Universitat Karlsruhe, Germany. Anonymous FTP: rpubrpapersrtechreportsr1994r1994-21.ps.gz on ftp.ira.uka.de, 1994b. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, Parallel Distributed Processing: Exploration in the Microstructure of Cognition, MIT Press, Cambridge, MA, 1986, pp. 318–362. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back propagating errors, Nature 323 Ž1986. 533–536. L.M. Salchenberger, E.M. Cinar, N.A. Lash, Neural net-

w39x

w40x

w41x w42x

w43x w44x

w45x

w46x

w47x

w48x

w49x

w50x

w51x

w52x

w53x

w54x

w55x

21

works: a new tool for predicting thrift failures, Decision Sciences 23 Ž4. Ž1992. 899–916. T. Samad, Backpropagation improvements based heuristic arguments, Proceedings of the International Joint Conference on Neural Networks, Washington, DC 1 Ž1990. 565–568. R. Sexton, R. Dorsey, J. Johnson, Toward a global optimum for neural networks: a comparison of the genetic algorithm and backpropagation, Decision Support Systems 22 Ž1998. 171–185. J. Sietsma, R.J.F. Dow, Creating artificial neural networks that generalize, Neural Networks 2 Ž1991. 67–79. P.A. Shoemaker, M.J. Carlin, R.L. Shimabukuro, Back propagation learning with trinary quantization of weight updates, Neural Networks 4 Ž1991. 231–241. S.A. Solla, E. Levin, M. Fleisher, Accelerated learning in layered neural networks, Complex Systems 2 Ž1988. 34–39. V. Subramanian, M.S. Hung, M.Y. Hu, An experimental evaluation of neural networks for classification, Computers and Operations Research 20 Ž7. Ž1993. 769–782. V. Subramanian, M.S. Hung, A GRG-based system for training neural networks: design and computational experience, ORSA Journal on Computing 5 Ž4. Ž1990. 386–394. K.Y. Tam, M.Y. Kiang, Managerial applications of neural networks: the case of bank failure prediction, Management Science 38 Ž7. Ž1992. 926–947. A. van Ooryen, B. Nienhuis, Improving the convergence of the back propagation algorithm, Neural Networks 5 Ž1992. 465–471. R. Vitthal, P. Sunthar, R.Ch. Durgaprasada, The generalized proportional-integral-derivative ŽPID. gradient decent back propagation algorithm, Neural Networks 8 Ž4. Ž1995. 563– 569. S. Wang, The unpredictability of standard back propagation neural networks in classification applications, Management Science 41 Ž3. Ž1995. 555–559. R.L. Watrous, Learning algorithms for connections and networks: applied gradient methods of nonlinear optimization, Proceedings of the IEEE Conference on Neural Networks 2, San Diego, IEEE, 1987, pp. 619–627. A.S. Weigend, D.E. Rumelhart, B.A. Huberman, Back propagation, weight-elimination and time series prediction, in: D.S. Touretzky, J.L. Elman, T.J. Sejnowski, G.E. Hinton ŽEds.., Connectionist Models, Proceedings of the 1990 Connectionist Models Summer School, Morgan Kaufmann, San Mateo, CA, 1991, pp. 105–116. P. Werbos, The Roots of Backpropagation: from Ordered Derivatives to Neural Networks and Political Forecasting, Wiley, New York, NY, 1993. H. White, Some asymptotic results for back-propagation, Proceedings of the IEEE Conference on Neural Networks 3, San Diego, IEEE, 1987, pp. 261–266. Y. Yoon, G. Swales, T.M. Margavio, A comparison of discriminant analysis versus artificial neural networks, Journal of the Operations Research Society 44 Ž1. Ž1993. 51–60. M.R. Veal, Testing for a global maximum in an econometric context, Econometrica 58 Ž1990. 1459–1465.

22

R.S. Sexton, R.E. Dorsey r Decision Support Systems 30 (2000) 11–22

Biographies Randall S. Sexton is an Assistant Professor of Computer Information Systems at Southwest Missouri State University. He received his PhD in Management Information Systems at the University of Mississippi. His research interests include computational methods, algorithm development, artificial intelligence, and neural networks. His articles have been accepted or published in Decision Support Systems, OMEGA, INFORMS Journal on Computing, European Journal of Operational Research, Journal of Computational Intelligence in Finance, Journal of End User Computing, and other leading journals.

Robert E. Dorsey is an Associate Professor of Economics and Finance at the University of Mississippi. He received his PhD in Economics at the University of Arizona. Prior to attending graduate school, he worked for 15 years in the private sector. His research interests include computational methods, experimental economics and artificial intelligence. His articles have been accepted or published in Decision Support Systems, Journal of Econometrics, Journal of Business and Economic Statistics, Journal of Computational Economics, European Journal of Operational Research, Decision Support Systems, and other leading journals.