Expert Systems with Applications 39 (2012) 9308–9314
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction Myoung-Jong Kim a, Dae-Ki Kang b,⇑ a b
School of Business, Pusan National University, South Korea Division of Computer & Information Engineering, Dongseo University, South Korea
a r t i c l e
i n f o
Keywords: Ensemble learning Genetic algorithm Coverage optimization Bankruptcy prediction
a b s t r a c t Ensemble learning is a method to improve the performance of classification and prediction algorithms. Many studies have demonstrated that ensemble learning can decrease the generalization error and improve the performance of individual classifiers and predictors. However, its performance can be degraded due to multicollinearity problem where multiple classifiers of an ensemble are highly correlated with. This paper proposes a genetic algorithm-based coverage optimization technique in the purpose of resolving multicollinearity problem. Empirical results with bankruptcy prediction on Korea firms indicate that the proposed coverage optimization algorithm can help to design a diverse and highly accurate classification system. Ó 2012 Elsevier Ltd. All rights reserved.
1. Introduction Since bankruptcy is a critical event that could inflict a great loss to management, stockholders, employees, customers and nation, the development of bankruptcy prediction models has been one of important issues in accounting and finance research fields. The widely used methods for developing bankruptcy prediction models are statistics and machine learning. The statistical techniques, including multiple regression, discriminant analysis, logistic models, and probit, have been traditionally used in forecasting business failures (Altman, 1968; Altman, Edward, Haldeman, & Narayanan, 1977; Dimitras, Zanakis, & Zopounidis, 1996; Meyer & Pifer, 1970; Ohlson, 1980; Pantalone & Platt, 1987; Zmijewski, 1984). However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variables and the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bankruptcy prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM) (Bryant, 1997; Buta, 1994; Han, Chandler, & Liang, 1996; Laitinen & Kankaanpaa, 1999; Min,
⇑ Corresponding author. Address: Division of Computer & Information Engineering, Dongseo University, 47, Churye-Ro, Sasang-Gu, Busan, 617-716, South Korea. Tel.: +82 51 320 1724; fax: +82 51 327 8955. E-mail address:
[email protected] (D.-K. Kang). 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2012.02.072
Lee, & Han, 2006; Odom & Sharda, 1990; Ravi & Ravi, 2007; Shaw & Gentry, 1998; Shin, Lee, & Kim, 2005). One of the recent techniques applied in bankruptcy prediction is ensemble learning (Alfaro, García, Gámez, & Elizondo, 2008; Alfaro, Gámez, & García, 2007; Kim & Kang, 2010). Ensemble learning is a machine learning technique for improving the performance of individual classifiers and predictors. Basically, ensemble learning constructs a highly accurate classifier (a single strong classifier) on the training set by combining an ensemble of weak classifiers, each of which needs only to be moderately accurate on the training set. Many studies on ensemble learning have shown an experimental confirmation and a theoretical explanation that combination of diverse hypotheses can produced a strong ensemble, whose error is reduced with respect to the average error of members. In the last decade, many studies have applied ensemble learning for designing high performance classification systems, mainly in terms of classification accuracy, in several pattern recognition tasks such as alphanumeric character recognition and face recognition (Czyz, Sadeghi, Kittler, & Vandendorpe, 2004; Lemieux & Parizeau, 2003; Zhou & Zhang, 2002). Recently, empirical studies on bankruptcy prediction have also demonstrated the reduction in generalization error and the prominent performance improvement (Alfaro et al., 2007, 2008; Kim & Kang, 2010). However, some studies have reported the performance degradation problem of ensemble learning caused by the multicollinearity among classifiers. (Buciu, Kotrooulos, & Pitas, 2001; Dong & Han, 2004; Eom, Kim, & Zhang, 2008; Valentini, Muselli, & Ruffino, 2003). Several studies have proposed coverage optimization to
M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314
cope with such problem (Banfield, Hall, Bowyer, & Kegelmeyer, 2003; Giacinto & Roli, 2001; Valentino, 2005). Coverage optimization, also known as diversity-based classifier selection, is a method for selecting classifiers in order to decrease the number of ensemble members and keeping the diversity among the selected members as well (Santana, Soares, Canuto, & Soouto, 2006). Those experimental studies have reported that the optimized ensembles have fewer classifiers than the original ensembles, but their accuracies are higher than the original ensembles. This paper proposes a genetic algorithms-based coverage optimization system for ensemble learning. The optimal (or near optimal) classifiers subset is selected based on prediction accuracy and diversity measurement represented as statistical value of variance influence factor (VIF). The proposed coverage optimization is applied to a company failure prediction task to validate the effect on the performance improvement. Experimental results with the bankruptcy prediction on Korean firms indicate that the proposed genetic algorithms-based coverage optimization can help to design a diverse and highly accurate classification system. The remainder of this paper is organized as follows: The next section describes two popular ensemble algorithms Bagging and Boosting, and the diversity problem in ensemble learning. Section 3 explains the algorithm of the proposed coverage optimization. Section 4 presents data descriptions and experimental design process. Section 5 discusses experimental results. The final section presents several concluding remarks and future research issues.
2. The diversity problem in ensemble learning Several ensemble methods for constructing and combining a collection of classifiers have been proposed. Two main methods which have been widely used are Bagging (Breiman, 1994) and Boosting (Freund, 1995; Schapire, 1990). Bagging is a method that creates and combines multiple classifiers, each of which is trained on a bootstrap replicate of the original training set. The bootstrap data is created by resampling examples uniformly with replacement from the original training set. Each classifier is created by training on corresponding bootstrap replicate. The classifiers could be trained in parallel and the final classifier is generated by combining ensemble of classifier. Bagging has been considered as a variance reduction technique for a given classifier. Bagging is known to be particularly effective when the classifiers are unstable, that is, when perturbing the learning set can cause significant changes in the classification behavior, because Bagging improves generalization performance due to a reduction in variance while maintaining or only slightly increasing bias. Boosting constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble’s performance is poor. Boosting combines predictions of ensemble of classifiers with weighted majority voting by giving more weights on more accurate predictions. In the last decade, many studies have applied ensemble learning to designing high performance classification systems. Particularly, many empirical studies using DT as a base classifier have been shown that ensemble learning can enhance the prediction performance of DT classification algorithms such as Classification and Regression Tree (CART) and C4.5 (Banfield, Hall, Bowyer, & Kegelmeyer, 2007; Bauer & Kohavi, 1999; Drucker & Cortes, 1996; Quinlan, 1996). Recently, several studies have applied ensemble
9309
learning to bankruptcy classification trees. They have shown that ensemble learning decreases the generalization error and improve the accuracy (Alfaro et al., 2007). On the other hand, many studies on NN/SVM ensemble have also reported that ensemble learning can improve individual classifier’s accuracy. However, some studies have indicated that the ensemble combination with NN/SVM is less effective than DT ensemble in the respect of performance improvement and that ensemble’s performance is often even worse than that of a single classifier (Buciu et al., 2001; Dong & Han, 2004; Eom et al., 2008; Valentini et al., 2003). Several works have investigated the cause of performance degradation and insisted that the performance of ensemble can be degraded where multiple classifiers of an ensemble are highly correlated with, and thereby result in multicollinearity problem, which leads to performance degradation of the ensemble (Banfield et al., 2003; Breiman, 1994; Giacinto & Roli, 2001; Hansen & Salamon, 1990; Valentino, 2005). Hansen and Salamon (1990) insisted that it is necessary and sufficient for the performance enhancement of an ensemble that the ensemble should contain diverse classifiers and each classifier in the ensemble needs to be more accurate than random guess. This means that the accuracy of each classifier in the ensemble should be over 50% when there are two class labels, and the classifiers in the ensemble should be diverse to minimize mis-classification rate. Therefore, the key to successful ensemble methods is to construct individual classifiers with error rates below 0.5 whose errors are at least somewhat uncorrelated. Breiman’s work (1994) reported that bagging (and to a lesser extent, boosting) can increase the performance of unstable learning algorithms, but does not show remarkable performance improvement on stable learning algorithms. Ensemble learning applies various sampling techniques such as bagging, boosting, etc. to guarantee the diversity in a classifier pool. Unstable learning algorithms such as DT learners are sensitive to the change of the training data, and thus small changes in the training data can yield large changes in the generated classifiers. Therefore, ensemble with unstable learning algorithms can guarantee some diversity among the classifiers. To the contrary, stable learning algorithms such as NN/SVM generate similar classifiers in spite of the changes of the training data, and thus the correlation among the resulting classifiers is very high. This high correlation results in multicollinearity problem, which leads to performance degradation of the ensemble. The concept of the coverage optimization is introduced to cope with performance degradation due to multicollinearity problem. Coverage optimization is a method for selecting classifiers in order to decrease the number of ensemble members and, at the same time, keeping the diversity among the selected members. It arises from the intuition that a set of dissimilar classifiers would perform better than a single good decision maker because its error is compensated by the decisions of the others. For example, there is clearly no accuracy gain in an ensemble that is composed of a set of identical classifiers. Thus, if there are many different classifiers to be combined, one would expect an increase in the overall accuracy when combining them, as long as they are divers (Santana et al., 2006). In the previous literature, several studies have proposed the methods for the diversity-based classifier selection problem (Banfield et al., 2003; Giacinto & Roli, 2001; Valentino, 2005). For instance, classifiers could be clustered based on the diversity they produce. In prediction task, one classifier of each group is selected to be a member of the ensemble to avoid multicollinearity problem because classifiers that belong to the same group tend to make correlated errors (Giacinto & Roli, 2001). Banfield et al. (2003) proposed an ensemble diversity procedure based on uncertain points (patterns). These uncertain points are considered to deliver diversity to the ensemble, since there is no general agreement among
9310
M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314
the classifiers about the correct output to these points. In this context, the classifiers that have higher accuracy for the uncertain points (diversity) are chosen to be part of the ensemble. 3. GA-based coverage optimization algorithm In assuming availability of the original set of multiple classifiers, which is referred to as a classifier pool, Ho (2002) insisted two optimization techniques including coverage optimization and decision optimization to combine multiple classifiers selection. Coverage optimization is a problem of selecting an optimal classifier subset from a given classifier pool, while decision optimization is the problem of combining outcomes of the classifiers belonging to a given classifier ensemble. Studies on optimization methods have placed much more effort on the decision optimization issue (Fumera & Roli, 2005; Ueda, 2000). However, coverage optimization is not of lesser importance than decision optimization, because outcomes of the classifier ensemble are directly input into the combination algorithm (Roli & Giacinto, 2002). Classifier ensemble selection is defined to be the problem of selecting a classifier ensemble of d classifiers from a classifier pool of K classifiers so that the chosen sub-ensemble has the optimal classification performance. For this problem, the size of the possible search space is KCd (K P d), which means the search space is an exponential search space. Genetic algorithms (GAs) are popularly used as an effective tool to solve such a local search operations. GAs can prevent local optima by using cross-over and mutation operators, and can search rapidly a vast and complicated search space to find an optimal or near-optimal solution using probabilistic search methods. Recently, GAs have been applied to classifiers selection process for performance improvement of ensemble learning. Zhou, Wu, and Tang (2002) have formally proved that a sub-ensemble of selected classifiers can be superior to an ensemble composed of all the classifiers in terms of prediction accuracy. They also have experimentally demonstrated their proposal by generating a neural network ensemble and adapting GA for choosing an optimal sub-ensemble. Oliveira, Sabourin, Bortolozzi, and Suen (2003) have also used GAs to select a sub-ensemble from ensembles with multiple classifiers to improve prediction accuracy. Wu and Chen (2004) have proposed accuracy-based classifiers selection based on Bagging and GAs. The main difference between those studies and our work is that those works have used GAs as selection algorithm for performance improvement, while the proposed algorithm concentrates on the selection process of an ensemble containing diverse members. The GA learning process for coverage optimization is performed through four stages as follows; 3.1. Chromosome encoding A solution can be encoded to chromosome form in order to solve the coverage optimization problem. We set the size of search space for the chromosomes of the coverage optimization as the number of classifiers in the ensemble, K, and assign the weight (dk) of each classifier as either 0 or 1, where 0 means the classifier is excluded and 1 means the classifier is selected. Thus, the GA chromosomes for the coverage optimization are encoded into binary strings. For example, chromosome C = 1100100011 (when K = 10) means that the classifiers #1, #2, #5, #9, and #10 are selected as classifier ensemble (d = 5). 3.2. Initial population The initial population is generated by random number generation.
3.3. Fitness function A chromosome C is evaluated using fitness value produced by combining outcomes of the selected classifier ensemble. Majority voting is used to combine the results of multiple classifiers and generate the final class of each observation, xi. In case of a tie, xi belongs to a class of a classifier with the highest prediction accuracy on training set. In this paper, fitness function is defined as average prediction accuracy (PA) because the purpose of this paper is to find bankruptcy prediction model with high prediction performance. However, with only focusing on performance accuracy through GA search, it is not sufficient to cope with multicollinearity problem. Thus, we adopt VIF analysis to measure diversity, and add a diversity constraint based on VIF value in order to cope with multicollinearity problem. VIF analysis is a statistical methods generally used in measuring multicollinearity. VIF value of the kth classifier is calculated as VIFðkÞ ¼ 1=ð1 R2k Þ; where R2k is a value of the coefficient of determination obtained when we performed regression analysis on the other classifiers with the kth classifier as a response parameter. If the kth classifier is closely related with other classifiers, then R2k will be close to 1, and therefore VIF(k) will increase. If 5 < VIF(k) < 10 then it is possible that the kth classifier has multicollinearity, and if VIF(k) > 10 then we can evaluate the kth classifier has serious multicollinearity. Thus, the classifiers included in the sub-ensemble should have VIF less than 5. Therefore, the fitness function with a diversity constraint can be expressed as following:
Fitness PA ¼ 1n
n P
CRi
where CRi ¼
i¼1
0 CðiÞ ¼ yi 1 CðiÞ – yi
Subject to VIFðkÞ < 5 where C(i) and yi is predicted output of a single strong classifier and actual output for ith observation, respectively, and n is the numbers of training sample. 3.4. Genetic operations At this stage, GAs select two chromosomes with the high fitness values as parents. The selection algorithm uses a rank-based roulette-wheel scheme. New candidates (offspring) are generated from two selected parents by using standard genetic operations such as crossover and mutation. For crossover operation, an mpoints crossover operator is used, which chooses m cutting points at random and alternately copies each segment out of the two parents. Between two candidates, one is randomly chosen, and the mutation is applied to the chosen candidates. The algorithm calculates a fitness score for each candidate and replaces chromosomes with low scores by new candidates with high scores. This process is repeated until the stopping conditions are satisfied. The coverage optimization algorithm is illustrated in Fig. 1. 4. Experimental design 4.1. Data and variable For the evaluation of our proposed method, we have applied it to a benchmark data set obtained from one major commercial bank in Korea. The benchmark data set contains 1,200 externally audited manufacturing firms, half of which have gone bankrupt during 2002–2005 while healthy firms have been selected from active companies at the end of 2005. Through literature review, we firstly have investigated 31 financial ratios and categorized as profitability, debt coverage, leverage, capital structure, liquidity, activity, and size. Then, we have chosen final input variables by assessing the performance of each variable
9311
M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314
Fig. 1. The genetic algorithm used for classifier selection.
based on receiver operating characteristic (ROC) curve analysis, in which 1 – specificity and sensitivity of the classifier are graphically plotted. Sensitivity and specificity is measured as TP/(TP+FN) and TN/(FP+TN) where TP, TN, FP, and FN are defined in the confusion matrix shown in Table 1. The performance criterion of each variable in ROC curve analysis is represented as the value of an area under the ROC curve (AUROC), which is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one and equals to. AUROC of a random guess model is 0.5. If AUROC of a variable (or a classifier) is 1, it means the perfect variable (or model) in the prediction task. Generally, a variable (or a classifier) has an AUROC value between 0.5 and 1, and is considered accurate if AUROC is close to 1. We have chosen seven financial ratios with the highest AUROC values from the 31 financial ratios, presented in Table 2. The potential presence of multicollinearity is an important check-point of the model. We have estimated VIF value among the seven financial ratios to check for multicollinearity. Table 3 shows that the estimated VIF values are below five. This finding indicates that the chosen model variables do not present any substantial multicollinearity.
Table 2 Area under the ROC curve (AUROC) value of the 31 financial ratios. Category Profitability
Debt coverage
Leverage Capital structure
Liquidity
4.2. Experimental design As for the experimental design, we have performed three separate experiments along with individual classifiers, ensemble classifiers, and coverage optimization-based ensemble classifiers. To develop individual classifiers for bankruptcy prediction, we have used three popular classification algorithms, which are DT, NN, and SVM. For DT and NN, C4.5 and multi-layer perceptron (MLP) is used to generate individual classifiers, respectively. For SVM, we have trained SVM classifier using sequential minimal optimization (SMO) with the radial basis function (RBF) kernel as a kernel function. There are two parameters while using RBF kernels:
Table 1 Confusion matrix that shows TP, TN, FP, and FN. Predicted class Actual class
Bankrupt (positive)
Non-bankrupt (negative)
Bankrupt (positive) Non-bankrupt (negative)
True positive (TP) False positive (FP)
False negative (FN) True negative (TN)
Activity
Size
Variables
AUROC a
Ordinary income to total assets Net income to total assets Financial expenses to sales Financial expenses to total debt Net financing cost to sales Ordinary income to sales Net income to sales Ordinary income to capital Net income to capital EBITDA to Interest expensesa EBIT to Interest expenses Cash operating income to interest expenses Cash operating income to total debt Cash flow after interest payment to total debt Cash flow after interest payment to total debt Debt repayment coefficient Borrowings to Interest expenses Total debt to total assetsa Current assets to total assets Retained earning to total assetsa Retained earning to total debt Retained earning to current assets Cash ratioa Quick ratio Current assets/current Liabilities Inventory to salesa Current liabilities to sales Account receivable to sales Total assetsa Sales Fixed assets
52.5 45.9 49.7 48.9 50.8 45.9 49.9 48.8 48.1 53.7 40.1 48.9 48.8 52.3 53.1 51.7 53.4 51.9 50.9 53.5 52.7 51.1 46.5 45.5 43.2 30.8 29.2 27.7 24.8 22.4 22.6
a Note that the chosen seven financial ratios are in boldface and are denoted with ‘*’.
acceptable error C and kernel parameter d2. We make up various configurations of the two parameters: varying C from 1 to 250 and d2 from 1 to 200. For performance analysis of ensemble classifiers, we have used two popular ensemble methods, AdaBoost and Bagging, to construct ensembles. Finally, we have applied GAs to develop coverage optimizationbased ensemble classifiers including coverage optimization of Boosting (CO-Boosting) and Bagging (CO-Bagging). To prevent the output from falling into the local optima, we have flexibly changed
9312
M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314
Table 3 The estimated variance inflation factors of the chose variables. Variables
VIF
Ordinary income to total assets EBITDA to interest expenses Total debt to total assets Retained earning to total assets Cash ratio Inventory to sales Total assets
1.64 2.34 1.95 2.77 1.54 1.73 1.67
Table 4 Multicollinearity analysis results by VIF.
Individual Boosting Bagging CO-Boosting CO-Bagging
DT
NN
SVM
1 (0) 7 (0) 10 (2) 6 (0) 7 (0)
1 (0) 9 (3) 10 (3) 6 (0) 6 (0)
1 (0) 8 (3) 10 (4) 5 (0) 6 (0)
Note that each column shows the total number of classifiers, and the number of classifiers of which the VIF value is greater than five (surrounded by parenthesis).
Table 5 Performance of classifiers for validation set (%).
Individual Boosting Bagging CO-Boosting CO-Bagging
DT
NN
SVM
70.30 75.10 75.78 76.00 76.20
71.02 73.10 73.97 76.52 76.92
72.45 73.07 72.85 77.53 77.23
the crossover and mutation rates of GAs. The crossover rate ranges 0.5–0.7 and the mutation rate ranges 0.06–0.1 for this experiment. The stop condition is set to 1,000 iterations and thus genetic search is repeated by 50 generation.
5. Experimental results We have repeated 10-fold cross validations for five times with different random seeds in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into ten equal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 50 experiments. We have performed VIF analysis to investigate the effect of multicollinearity on performance of classifiers. Table 4 shows the results of multicollinearity test through the VIF analysis where each column shows the total number of classifiers, and the number of classifiers of which the VIF is bigger than five. DT-based classifiers do not present any substantial multicollinearity except for DT Bagging classifier of which two base classifiers have multicollinearity problem. Meanwhile there exists substantial multicollinearity problem in three base classifiers of NN Boosting ensemble and three base classifiers of NN Bagging ensemble. For SVM, multicollinearity problem is observed in three classifiers of boosted SVM and four classifiers of bagged SVM. These results indicate that the performance improvement of NN/SVM ensemble can be deteriorated due to multicollinearity problem.
After selecting classifiers through coverage optimization, multicollinearity problem among classifier members is not observed any more, which means that both CO-Boosting/CO-Bagging classifiers contain diverse classifiers without having substantial multicollinearity. Table 5 shows performance comparison of classifiers for validation set. In the comparison between individual classifiers, SVM (72.45%) shows higher prediction accuracy than both NN (71.02%) and DT (70.30%) classifiers. All ensembles show higher performance than individual classifiers. The results show that the prediction accuracy of DT ensembles (75.10%, 75.78%) is higher than both NN (73.10%, 73.97%) and SVM (73.07%, 72.85%) ensembles. All ensembles have shown marginal improvements over a single classifier for validation data with about 4.8% and 5.48% for DT, 2.08% and 2.95% for NN, and 0.62% and 0.4% for SVM, respectively. The results mean that DT ensembles containing diverse classifiers can decrease the generalization error and thereby generate the prominent performance improvement, while stable NN/SVM ensembles have the limit in performance improvement due to multicollinearity problem. In the comparison between optimized ensembles, CO-SVM (77.53%, 77.23%) has the more accurate results than CO-DT (76.00%, 76.20%) and CO-NN (76.52%, 76.92%). The improvements of coverage optimization are as high as about 0.9% and 0.42% for DT ensembles, 3.42% and 2.95% for NN ensembles, and 4.46% and 4.38% for SVM ensembles. Compared with ensemble classifiers, optimized classifiers have fewer classifiers as shown in Table 4, but their accuracies are higher than the original ensemble classifiers. T-test is used to examine whether the average performance of each classifiers for 50 folds is significantly different. As shown in Table 6, the results indicate that ensemble learning significantly improve the DT and NN classifiers at 1% level, while it has no substantial effect on the performance improvement of SVM classifier. On the other hand, coverage optimization has the prominent performance improvement of stable NN/SVM ensembles at 1% significance level, while the performance of CO-DT classifiers is not significantly different from that of DT ensembles. We additionally compared the performance of learning methods as shown in Table 7. In a single classifier, the performance of
Table 6 T-test results of prediction accuracy (T-value). Classifiers Individual
Boosting
Bagging
***
Boosting DT NN SVM DT NN SVM DT NN SVM
Bagging
***
***
4.258 4.037*** 1.203 – – – – – –
4.803 5.726*** 0.776 – – – – – –
CO-Boosting ***
CO-Bagging 5.171*** 11.451*** 9.278*** – – – 0.275 5.276*** 8.501***
4.996 10.675*** 9.860*** 0.586 6.638*** 8.656*** – – –
Represent significance level at 1%.
Table 7 Performance comparison among learning algorithm (%, T-value).
* ** ***
Classifier
DT
NN
Individual Boosting Bagging CO-Boosting CO-Bagging
70.30 75.10 75.78 76.00 76.20
71.02 73.10 73.97 76.52 76.92
Represent significance level at 10%. Represent significance level at 5%. Represent significance levels at 1%.
SVM (1.398)* (1.780)** (1.589)* (0.453) (0.628)
72.45 73.07 72.85 77.53 77.23
(4.173)*** (1.870)** (2.570)*** (1.337) (0.899)
M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314
SVM is significantly different from DT at 1% significance level DT, while NN is not significantly different from DT. In Boosting classifiers, Boosted DT shows the significantly different performance compared with Boosted NN/SVM at 5% level, while the performance of Bagged DT is significantly different from Bagged SVM at 1% level. On the other hand, the performance of CO-classifiers is not different from each other. 6. Conclusion In this work, we have proposed coverage optimization algorithm to resolve multicollinearity problem and enhance the stable performance enhancement of ensemble. The proposed algorithm uses GA to guarantee an ensemble that contains diverse classifiers in coverage optimization process. We use prediction accuracy to improve the performance of an ensemble, as well as VIF analysis to measure multicollinearity as a degree of diversity in order to select various classifiers, which is the goal of coverage optimization. Therefore, we use prediction accuracy as fitness function and VIF as a constraint of GA search to remove high correlation among the classifiers to insure the diversity of classifiers. Experiments on company failure prediction to verify the performance of coverage optimization have shown that coverage algorithm is effectively applied in the stable performance enhancement of NN/SVM ensembles through the choice of classifiers by considering the correlations of the ensemble. Although the performance improvement when applied in unstable classifiers, such as DT ensemble, is less effective than those of stable classifiers, they have also shown that coverage optimization can be helpful for improving the performance of unstable classifiers. However, we expect to continue the following further researches to solve the limitations of this research. First of all, we mainly have focused on resolving coverage optimization. But, one of the most fundamental problems in ensemble is decision optimization process to find optimal combination function. We expect to have more advanced further researches on decision optimization process in the future. Secondly, ensemble learning has a problem with noise. The noises in the learning samples distort the classification boundary of learning algorithms like SVM and degrade the learning performance. Especially, in boosting ensemble which focuses on the learning of miss-classified observations, the noise will repeatedly affect newly generated classifiers. To deal with those outliers, various SVM ensemble methods such as Probabilistic Roulette Selection, Karush–Kuhn–Tucker Condition-based Heuristic Selection, Automatic Feature Selection, etc. have been proposed (Maia, Braga, & Carvalho, 2008). Coupled with those researches, we expect to have more advanced further researches in the future. References Alfaro, E., Gámez, M., & García, N. (2007). Multiclass corporate failure prediction by AdaBoost.M1. Advanced Economic Research, 13, 301–312. Alfaro, E., García, N., Gámez, M., & Elizondo, D. (2008). Bankruptcy forecasting: An empirical comparison of AdaBooost and neural networks. Decision Support Systems, 45, 110–122. Altman, E. L. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589–609. Altman, E. L., Edward, I., Haldeman, R., & Narayanan, P. (1977). A new model to identify bankruptcy risk of corporations. Journal of Banking and Finance, 1, 29–54. Banfield, R., Hall, L., Bowyer, K., & Kegelmeyer, W. (2003). A new ensemble diversity measure applied to thinning ensemble. In 4th International Workshop, MCS 2003 (Vol. 2709, pp. 306–316). Banfield, R. E., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2007). A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 173–180. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139.
9313
Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. Bryant, S. M. (1997). A case-based reasoning approach to bankruptcy prediction modeling. International Journal of Intelligent Systems in Accounting, Finance and Management, 6(3), 195–214. Buciu, I., Kotrooulos, C., & Pitas, I. (2001). Combining support vector machines for accuracy face detection. In Proceedings of ICIP (pp. 1054–1057). Buta, P. (1994). Mining for financial knowledge with CBR. AI Expert, 9(10), 34–41. Czyz, J., Sadeghi, M., Kittler, J., & Vandendorpe, L. (2004). Decision fusion for face authentication. In Proceedings of the first international conference on, machine learning (pp. 115–123). Dimitras, A. I., Zanakis, S. H., & Zopounidis, C. (1996). A survey of business failure with an emphasis on prediction methods and industrial applications. European Journal of Operational Research, 90(3), 487–513. Dong, Y. S., & Han, K. S. (2004). A comparison of several ensemble methods for text categorization. In IEEE international conference on service computing services, computing (SCC’04) (pp. 419–422). Drucker, H., & Cortes, C. (1996). Boosting decision trees. In Advanced neural information processing systems (p. 8). Eom, J. H., Kim, S. C., & Zhang, B. T. (2008). AptaCDSS-E: A classifier ensemble-based clinical decision support system for cardiovascular disease level prediction. Expert Systems with Applications, 34, 2465–2479. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Fumera, G., & Roli, F. (2005). A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 942–956. Giacinto, G., & Roli, R. (2001). An approach to the automatic design of multiple classifier systems. Pattern Recognition Letters, 22, 25–33. Han, I., Chandler, J. S., & Liang, T. P. (1996). The impact of measurement scale and correlation structure on classification performance of inductive learning and statistical methods. Expert System with Applications, 10(2), 209–221. Hansen, L., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on PAMI, 12, 993–1001. Ho, T. K. (2002). Multiple classifier combination: Lessons and next steps. In Hybrid methods in patter recognition. World Scientific. Kim, M. J., & Kang, D. K. (2010). Ensemble with neural networks for bankruptcy prediction. Expert Systems with Applications, 37, 3373–3379. Laitinen, T., & Kankaanpaa, M. (1999). Comparative analysis of failure prediction methods: The finish case. European Accounting Review, 8(1), 67–92. Lemieux, A., & Parizeau, M. (2003). Flexible multi-classifier architecture for face recognition systems. In The 16th international conference on vision, interface (pp. 1–8). Maia, T. T., Braga, A. P., & Carvalho, A. F. (2008). Hybrid classification algorithms based on boosting and support vector machines. Kybernetes, 37(9), 1469–1491. Meyer, P. A., & Pifer, H. (1970). Prediction of bank failures. The Journal of Finance, 25, 853–868. Min, S. H., Lee, J. M., & Han, I. G. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Systems with Applications, 31, 652–660. Odom, M., & Sharda, R. (1990). A neural network for bankruptcy prediction. In Proceedings of the international joint conference on neural networks (Vol. 2, pp. 163–168). Ohlson, J. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18(1), 109–131. Oliveira, L. S., Sabourin, R., Bortolozzi, F., & Suen, C. Y. (2003). A methodology for feature selection for ensembles using multi-objective genetic algorithms for handwritten digit string recognition. International Journal of Pattern Recognition and Artificial Intelligence, 17(6), 903–930. Pantalone, C., & Platt, M. B. (1987). Predicting commercial bank failure since deregulation. New England Economic Review, 37–47. Quinlan, J. R. (1996). Bagging, boosting and C4.5. Machine Learning. In Proceedings of the fourteenth international conference (pp. 725–730). Ravi, P., & Ravi, K. V. (2007). Bankruptcy prediction in banks and firms via statistical and intelligent techniques-a review. European Journal of Operational Research, 180(1), 1–28. Roli, F., & Giacinto, G. (2002). Design of multiple classifier systems. In Hybrid methods in pattern recognition (pp. 199–226). World Scientific. Santana, A., Soares, R., Canuto, A., & Soouto, M. (2006). A dynamic classifier selection method to build ensembles using accuracy and diversity. In Proceedings of the ninth Brazilian symposium on, neural networks, (pp. 36–41). Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Shaw, M., & Gentry, J. (1998). Using and expert system with inductive learning to evaluate business loans. Financial Management, 17(3), 45–56. Shin, K., Lee, T., & Kim, H. (2005). An application of support vector machines in bankruptcy prediction. Expert Systems with Applications, 28, 127–135. Ueda, N. (2000). Optimal linear combination of neural networks for improving classification performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2), 207–215. Valentino, G. (2005). An experimental bias-variance analysis of SVM ensembles based on resampling techniques. IEEE Transactions on System, Man and Cybernetics- Part B: Cybernetics, 35(6), 1252–1271. Valentini, G., Muselli, M., & Ruffino, F. (2003). Bagged ensembles of SVMs or gene expression data analysis. In The IEEE-INNS-ENNS international joint conference on, neural networks, (pp. 1844–1849).
9314
M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314
Wu, X., & Chen, Z. (2004). Recognition of Exon/Intron boundaries using dynamic ensembles. In Proceedings of the computational systems bioinformatics conference (pp. 16–19). Zhou, D., & Zhang, J. (2002). Face recognition by combining several algorithms. Pattern Recognition, 3(3), 497–500.
Zhou, Z. H., Wu, J. X., & Tang, W. (2002). Ensembling neural networks: Many could better than all. Artificial Intelligence, 137, 239–263. Zmijewski, M. E. (1984). Methodological issues related to the estimation of financial distress prediction models. Journal of Accounting Research, 22(1), 59–82.