Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction

Expert Systems with Applications 39 (2012) 9308–9314 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

Download PDF

283KB Sizes 0 Downloads 108 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 39 (2012) 9308–9314

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Classiﬁers selection in ensembles using genetic algorithms for bankruptcy prediction Myoung-Jong Kim a, Dae-Ki Kang b,⇑ a b

School of Business, Pusan National University, South Korea Division of Computer & Information Engineering, Dongseo University, South Korea

a r t i c l e

i n f o

Keywords: Ensemble learning Genetic algorithm Coverage optimization Bankruptcy prediction

a b s t r a c t Ensemble learning is a method to improve the performance of classiﬁcation and prediction algorithms. Many studies have demonstrated that ensemble learning can decrease the generalization error and improve the performance of individual classiﬁers and predictors. However, its performance can be degraded due to multicollinearity problem where multiple classiﬁers of an ensemble are highly correlated with. This paper proposes a genetic algorithm-based coverage optimization technique in the purpose of resolving multicollinearity problem. Empirical results with bankruptcy prediction on Korea ﬁrms indicate that the proposed coverage optimization algorithm can help to design a diverse and highly accurate classiﬁcation system. Ó 2012 Elsevier Ltd. All rights reserved.

1. Introduction Since bankruptcy is a critical event that could inﬂict a great loss to management, stockholders, employees, customers and nation, the development of bankruptcy prediction models has been one of important issues in accounting and ﬁnance research ﬁelds. The widely used methods for developing bankruptcy prediction models are statistics and machine learning. The statistical techniques, including multiple regression, discriminant analysis, logistic models, and probit, have been traditionally used in forecasting business failures (Altman, 1968; Altman, Edward, Haldeman, & Narayanan, 1977; Dimitras, Zanakis, & Zopounidis, 1996; Meyer & Pifer, 1970; Ohlson, 1980; Pantalone & Platt, 1987; Zmijewski, 1984). However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variables and the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bankruptcy prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM) (Bryant, 1997; Buta, 1994; Han, Chandler, & Liang, 1996; Laitinen & Kankaanpaa, 1999; Min,

⇑ Corresponding author. Address: Division of Computer & Information Engineering, Dongseo University, 47, Churye-Ro, Sasang-Gu, Busan, 617-716, South Korea. Tel.: +82 51 320 1724; fax: +82 51 327 8955. E-mail address: [email protected] (D.-K. Kang). 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2012.02.072

Lee, & Han, 2006; Odom & Sharda, 1990; Ravi & Ravi, 2007; Shaw & Gentry, 1998; Shin, Lee, & Kim, 2005). One of the recent techniques applied in bankruptcy prediction is ensemble learning (Alfaro, García, Gámez, & Elizondo, 2008; Alfaro, Gámez, & García, 2007; Kim & Kang, 2010). Ensemble learning is a machine learning technique for improving the performance of individual classiﬁers and predictors. Basically, ensemble learning constructs a highly accurate classiﬁer (a single strong classiﬁer) on the training set by combining an ensemble of weak classiﬁers, each of which needs only to be moderately accurate on the training set. Many studies on ensemble learning have shown an experimental conﬁrmation and a theoretical explanation that combination of diverse hypotheses can produced a strong ensemble, whose error is reduced with respect to the average error of members. In the last decade, many studies have applied ensemble learning for designing high performance classiﬁcation systems, mainly in terms of classiﬁcation accuracy, in several pattern recognition tasks such as alphanumeric character recognition and face recognition (Czyz, Sadeghi, Kittler, & Vandendorpe, 2004; Lemieux & Parizeau, 2003; Zhou & Zhang, 2002). Recently, empirical studies on bankruptcy prediction have also demonstrated the reduction in generalization error and the prominent performance improvement (Alfaro et al., 2007, 2008; Kim & Kang, 2010). However, some studies have reported the performance degradation problem of ensemble learning caused by the multicollinearity among classiﬁers. (Buciu, Kotrooulos, & Pitas, 2001; Dong & Han, 2004; Eom, Kim, & Zhang, 2008; Valentini, Muselli, & Rufﬁno, 2003). Several studies have proposed coverage optimization to

M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314

cope with such problem (Banﬁeld, Hall, Bowyer, & Kegelmeyer, 2003; Giacinto & Roli, 2001; Valentino, 2005). Coverage optimization, also known as diversity-based classiﬁer selection, is a method for selecting classiﬁers in order to decrease the number of ensemble members and keeping the diversity among the selected members as well (Santana, Soares, Canuto, & Soouto, 2006). Those experimental studies have reported that the optimized ensembles have fewer classiﬁers than the original ensembles, but their accuracies are higher than the original ensembles. This paper proposes a genetic algorithms-based coverage optimization system for ensemble learning. The optimal (or near optimal) classiﬁers subset is selected based on prediction accuracy and diversity measurement represented as statistical value of variance inﬂuence factor (VIF). The proposed coverage optimization is applied to a company failure prediction task to validate the effect on the performance improvement. Experimental results with the bankruptcy prediction on Korean ﬁrms indicate that the proposed genetic algorithms-based coverage optimization can help to design a diverse and highly accurate classiﬁcation system. The remainder of this paper is organized as follows: The next section describes two popular ensemble algorithms Bagging and Boosting, and the diversity problem in ensemble learning. Section 3 explains the algorithm of the proposed coverage optimization. Section 4 presents data descriptions and experimental design process. Section 5 discusses experimental results. The ﬁnal section presents several concluding remarks and future research issues.

2. The diversity problem in ensemble learning Several ensemble methods for constructing and combining a collection of classiﬁers have been proposed. Two main methods which have been widely used are Bagging (Breiman, 1994) and Boosting (Freund, 1995; Schapire, 1990). Bagging is a method that creates and combines multiple classiﬁers, each of which is trained on a bootstrap replicate of the original training set. The bootstrap data is created by resampling examples uniformly with replacement from the original training set. Each classiﬁer is created by training on corresponding bootstrap replicate. The classiﬁers could be trained in parallel and the ﬁnal classiﬁer is generated by combining ensemble of classiﬁer. Bagging has been considered as a variance reduction technique for a given classiﬁer. Bagging is known to be particularly effective when the classiﬁers are unstable, that is, when perturbing the learning set can cause signiﬁcant changes in the classiﬁcation behavior, because Bagging improves generalization performance due to a reduction in variance while maintaining or only slightly increasing bias. Boosting constructs a composite classiﬁer by sequentially training classiﬁers while increasing weight on the misclassiﬁed observations through iterations. The observations that are incorrectly predicted by previous classiﬁers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classiﬁers that are better able to predict examples for which the current ensemble’s performance is poor. Boosting combines predictions of ensemble of classiﬁers with weighted majority voting by giving more weights on more accurate predictions. In the last decade, many studies have applied ensemble learning to designing high performance classiﬁcation systems. Particularly, many empirical studies using DT as a base classiﬁer have been shown that ensemble learning can enhance the prediction performance of DT classiﬁcation algorithms such as Classiﬁcation and Regression Tree (CART) and C4.5 (Banﬁeld, Hall, Bowyer, & Kegelmeyer, 2007; Bauer & Kohavi, 1999; Drucker & Cortes, 1996; Quinlan, 1996). Recently, several studies have applied ensemble

9309

learning to bankruptcy classiﬁcation trees. They have shown that ensemble learning decreases the generalization error and improve the accuracy (Alfaro et al., 2007). On the other hand, many studies on NN/SVM ensemble have also reported that ensemble learning can improve individual classiﬁer’s accuracy. However, some studies have indicated that the ensemble combination with NN/SVM is less effective than DT ensemble in the respect of performance improvement and that ensemble’s performance is often even worse than that of a single classiﬁer (Buciu et al., 2001; Dong & Han, 2004; Eom et al., 2008; Valentini et al., 2003). Several works have investigated the cause of performance degradation and insisted that the performance of ensemble can be degraded where multiple classiﬁers of an ensemble are highly correlated with, and thereby result in multicollinearity problem, which leads to performance degradation of the ensemble (Banﬁeld et al., 2003; Breiman, 1994; Giacinto & Roli, 2001; Hansen & Salamon, 1990; Valentino, 2005). Hansen and Salamon (1990) insisted that it is necessary and sufﬁcient for the performance enhancement of an ensemble that the ensemble should contain diverse classiﬁers and each classiﬁer in the ensemble needs to be more accurate than random guess. This means that the accuracy of each classiﬁer in the ensemble should be over 50% when there are two class labels, and the classiﬁers in the ensemble should be diverse to minimize mis-classiﬁcation rate. Therefore, the key to successful ensemble methods is to construct individual classiﬁers with error rates below 0.5 whose errors are at least somewhat uncorrelated. Breiman’s work (1994) reported that bagging (and to a lesser extent, boosting) can increase the performance of unstable learning algorithms, but does not show remarkable performance improvement on stable learning algorithms. Ensemble learning applies various sampling techniques such as bagging, boosting, etc. to guarantee the diversity in a classiﬁer pool. Unstable learning algorithms such as DT learners are sensitive to the change of the training data, and thus small changes in the training data can yield large changes in the generated classiﬁers. Therefore, ensemble with unstable learning algorithms can guarantee some diversity among the classiﬁers. To the contrary, stable learning algorithms such as NN/SVM generate similar classiﬁers in spite of the changes of the training data, and thus the correlation among the resulting classiﬁers is very high. This high correlation results in multicollinearity problem, which leads to performance degradation of the ensemble. The concept of the coverage optimization is introduced to cope with performance degradation due to multicollinearity problem. Coverage optimization is a method for selecting classiﬁers in order to decrease the number of ensemble members and, at the same time, keeping the diversity among the selected members. It arises from the intuition that a set of dissimilar classiﬁers would perform better than a single good decision maker because its error is compensated by the decisions of the others. For example, there is clearly no accuracy gain in an ensemble that is composed of a set of identical classiﬁers. Thus, if there are many different classiﬁers to be combined, one would expect an increase in the overall accuracy when combining them, as long as they are divers (Santana et al., 2006). In the previous literature, several studies have proposed the methods for the diversity-based classiﬁer selection problem (Banﬁeld et al., 2003; Giacinto & Roli, 2001; Valentino, 2005). For instance, classiﬁers could be clustered based on the diversity they produce. In prediction task, one classiﬁer of each group is selected to be a member of the ensemble to avoid multicollinearity problem because classiﬁers that belong to the same group tend to make correlated errors (Giacinto & Roli, 2001). Banﬁeld et al. (2003) proposed an ensemble diversity procedure based on uncertain points (patterns). These uncertain points are considered to deliver diversity to the ensemble, since there is no general agreement among

9310

M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314

the classiﬁers about the correct output to these points. In this context, the classiﬁers that have higher accuracy for the uncertain points (diversity) are chosen to be part of the ensemble. 3. GA-based coverage optimization algorithm In assuming availability of the original set of multiple classiﬁers, which is referred to as a classiﬁer pool, Ho (2002) insisted two optimization techniques including coverage optimization and decision optimization to combine multiple classiﬁers selection. Coverage optimization is a problem of selecting an optimal classiﬁer subset from a given classiﬁer pool, while decision optimization is the problem of combining outcomes of the classiﬁers belonging to a given classiﬁer ensemble. Studies on optimization methods have placed much more effort on the decision optimization issue (Fumera & Roli, 2005; Ueda, 2000). However, coverage optimization is not of lesser importance than decision optimization, because outcomes of the classiﬁer ensemble are directly input into the combination algorithm (Roli & Giacinto, 2002). Classiﬁer ensemble selection is deﬁned to be the problem of selecting a classiﬁer ensemble of d classiﬁers from a classiﬁer pool of K classiﬁers so that the chosen sub-ensemble has the optimal classiﬁcation performance. For this problem, the size of the possible search space is KCd (K P d), which means the search space is an exponential search space. Genetic algorithms (GAs) are popularly used as an effective tool to solve such a local search operations. GAs can prevent local optima by using cross-over and mutation operators, and can search rapidly a vast and complicated search space to ﬁnd an optimal or near-optimal solution using probabilistic search methods. Recently, GAs have been applied to classiﬁers selection process for performance improvement of ensemble learning. Zhou, Wu, and Tang (2002) have formally proved that a sub-ensemble of selected classiﬁers can be superior to an ensemble composed of all the classiﬁers in terms of prediction accuracy. They also have experimentally demonstrated their proposal by generating a neural network ensemble and adapting GA for choosing an optimal sub-ensemble. Oliveira, Sabourin, Bortolozzi, and Suen (2003) have also used GAs to select a sub-ensemble from ensembles with multiple classiﬁers to improve prediction accuracy. Wu and Chen (2004) have proposed accuracy-based classiﬁers selection based on Bagging and GAs. The main difference between those studies and our work is that those works have used GAs as selection algorithm for performance improvement, while the proposed algorithm concentrates on the selection process of an ensemble containing diverse members. The GA learning process for coverage optimization is performed through four stages as follows; 3.1. Chromosome encoding A solution can be encoded to chromosome form in order to solve the coverage optimization problem. We set the size of search space for the chromosomes of the coverage optimization as the number of classiﬁers in the ensemble, K, and assign the weight (dk) of each classiﬁer as either 0 or 1, where 0 means the classiﬁer is excluded and 1 means the classiﬁer is selected. Thus, the GA chromosomes for the coverage optimization are encoded into binary strings. For example, chromosome C = 1100100011 (when K = 10) means that the classiﬁers #1, #2, #5, #9, and #10 are selected as classiﬁer ensemble (d = 5). 3.2. Initial population The initial population is generated by random number generation.

3.3. Fitness function A chromosome C is evaluated using ﬁtness value produced by combining outcomes of the selected classiﬁer ensemble. Majority voting is used to combine the results of multiple classiﬁers and generate the ﬁnal class of each observation, xi. In case of a tie, xi belongs to a class of a classiﬁer with the highest prediction accuracy on training set. In this paper, ﬁtness function is deﬁned as average prediction accuracy (PA) because the purpose of this paper is to ﬁnd bankruptcy prediction model with high prediction performance. However, with only focusing on performance accuracy through GA search, it is not sufﬁcient to cope with multicollinearity problem. Thus, we adopt VIF analysis to measure diversity, and add a diversity constraint based on VIF value in order to cope with multicollinearity problem. VIF analysis is a statistical methods generally used in measuring multicollinearity. VIF value of the kth classiﬁer is calculated as VIFðkÞ ¼ 1=ð1 R2k Þ; where R2k is a value of the coefﬁcient of determination obtained when we performed regression analysis on the other classiﬁers with the kth classiﬁer as a response parameter. If the kth classiﬁer is closely related with other classiﬁers, then R2k will be close to 1, and therefore VIF(k) will increase. If 5 < VIF(k) < 10 then it is possible that the kth classiﬁer has multicollinearity, and if VIF(k) > 10 then we can evaluate the kth classiﬁer has serious multicollinearity. Thus, the classiﬁers included in the sub-ensemble should have VIF less than 5. Therefore, the ﬁtness function with a diversity constraint can be expressed as following:

Fitness PA ¼ 1n

n P

CRi

where CRi ¼

i¼1

0 CðiÞ ¼ yi 1 CðiÞ – yi

Subject to VIFðkÞ < 5 where C(i) and yi is predicted output of a single strong classiﬁer and actual output for ith observation, respectively, and n is the numbers of training sample. 3.4. Genetic operations At this stage, GAs select two chromosomes with the high ﬁtness values as parents. The selection algorithm uses a rank-based roulette-wheel scheme. New candidates (offspring) are generated from two selected parents by using standard genetic operations such as crossover and mutation. For crossover operation, an mpoints crossover operator is used, which chooses m cutting points at random and alternately copies each segment out of the two parents. Between two candidates, one is randomly chosen, and the mutation is applied to the chosen candidates. The algorithm calculates a ﬁtness score for each candidate and replaces chromosomes with low scores by new candidates with high scores. This process is repeated until the stopping conditions are satisﬁed. The coverage optimization algorithm is illustrated in Fig. 1. 4. Experimental design 4.1. Data and variable For the evaluation of our proposed method, we have applied it to a benchmark data set obtained from one major commercial bank in Korea. The benchmark data set contains 1,200 externally audited manufacturing ﬁrms, half of which have gone bankrupt during 2002–2005 while healthy ﬁrms have been selected from active companies at the end of 2005. Through literature review, we ﬁrstly have investigated 31 ﬁnancial ratios and categorized as proﬁtability, debt coverage, leverage, capital structure, liquidity, activity, and size. Then, we have chosen ﬁnal input variables by assessing the performance of each variable

9311

M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314

Fig. 1. The genetic algorithm used for classiﬁer selection.

based on receiver operating characteristic (ROC) curve analysis, in which 1 – speciﬁcity and sensitivity of the classiﬁer are graphically plotted. Sensitivity and speciﬁcity is measured as TP/(TP+FN) and TN/(FP+TN) where TP, TN, FP, and FN are deﬁned in the confusion matrix shown in Table 1. The performance criterion of each variable in ROC curve analysis is represented as the value of an area under the ROC curve (AUROC), which is the probability that a classiﬁer will rank a randomly chosen positive instance higher than a randomly chosen negative one and equals to. AUROC of a random guess model is 0.5. If AUROC of a variable (or a classiﬁer) is 1, it means the perfect variable (or model) in the prediction task. Generally, a variable (or a classiﬁer) has an AUROC value between 0.5 and 1, and is considered accurate if AUROC is close to 1. We have chosen seven ﬁnancial ratios with the highest AUROC values from the 31 ﬁnancial ratios, presented in Table 2. The potential presence of multicollinearity is an important check-point of the model. We have estimated VIF value among the seven ﬁnancial ratios to check for multicollinearity. Table 3 shows that the estimated VIF values are below ﬁve. This ﬁnding indicates that the chosen model variables do not present any substantial multicollinearity.

Table 2 Area under the ROC curve (AUROC) value of the 31 ﬁnancial ratios. Category Proﬁtability

Debt coverage

Leverage Capital structure

Liquidity

4.2. Experimental design As for the experimental design, we have performed three separate experiments along with individual classiﬁers, ensemble classiﬁers, and coverage optimization-based ensemble classiﬁers. To develop individual classiﬁers for bankruptcy prediction, we have used three popular classiﬁcation algorithms, which are DT, NN, and SVM. For DT and NN, C4.5 and multi-layer perceptron (MLP) is used to generate individual classiﬁers, respectively. For SVM, we have trained SVM classiﬁer using sequential minimal optimization (SMO) with the radial basis function (RBF) kernel as a kernel function. There are two parameters while using RBF kernels:

Table 1 Confusion matrix that shows TP, TN, FP, and FN. Predicted class Actual class

Bankrupt (positive)

Non-bankrupt (negative)

Bankrupt (positive) Non-bankrupt (negative)

True positive (TP) False positive (FP)

False negative (FN) True negative (TN)

Activity

Size

Variables

AUROC a

Ordinary income to total assets Net income to total assets Financial expenses to sales Financial expenses to total debt Net ﬁnancing cost to sales Ordinary income to sales Net income to sales Ordinary income to capital Net income to capital EBITDA to Interest expensesa EBIT to Interest expenses Cash operating income to interest expenses Cash operating income to total debt Cash ﬂow after interest payment to total debt Cash ﬂow after interest payment to total debt Debt repayment coefﬁcient Borrowings to Interest expenses Total debt to total assetsa Current assets to total assets Retained earning to total assetsa Retained earning to total debt Retained earning to current assets Cash ratioa Quick ratio Current assets/current Liabilities Inventory to salesa Current liabilities to sales Account receivable to sales Total assetsa Sales Fixed assets

52.5 45.9 49.7 48.9 50.8 45.9 49.9 48.8 48.1 53.7 40.1 48.9 48.8 52.3 53.1 51.7 53.4 51.9 50.9 53.5 52.7 51.1 46.5 45.5 43.2 30.8 29.2 27.7 24.8 22.4 22.6

a Note that the chosen seven ﬁnancial ratios are in boldface and are denoted with ‘*’.

acceptable error C and kernel parameter d2. We make up various conﬁgurations of the two parameters: varying C from 1 to 250 and d2 from 1 to 200. For performance analysis of ensemble classiﬁers, we have used two popular ensemble methods, AdaBoost and Bagging, to construct ensembles. Finally, we have applied GAs to develop coverage optimizationbased ensemble classiﬁers including coverage optimization of Boosting (CO-Boosting) and Bagging (CO-Bagging). To prevent the output from falling into the local optima, we have ﬂexibly changed

9312

M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314

Table 3 The estimated variance inﬂation factors of the chose variables. Variables

VIF

Ordinary income to total assets EBITDA to interest expenses Total debt to total assets Retained earning to total assets Cash ratio Inventory to sales Total assets

1.64 2.34 1.95 2.77 1.54 1.73 1.67

Table 4 Multicollinearity analysis results by VIF.

Individual Boosting Bagging CO-Boosting CO-Bagging

DT

NN

SVM

1 (0) 7 (0) 10 (2) 6 (0) 7 (0)

1 (0) 9 (3) 10 (3) 6 (0) 6 (0)

1 (0) 8 (3) 10 (4) 5 (0) 6 (0)

Note that each column shows the total number of classiﬁers, and the number of classiﬁers of which the VIF value is greater than ﬁve (surrounded by parenthesis).

Table 5 Performance of classiﬁers for validation set (%).

Individual Boosting Bagging CO-Boosting CO-Bagging

DT

NN

SVM

70.30 75.10 75.78 76.00 76.20

71.02 73.10 73.97 76.52 76.92

72.45 73.07 72.85 77.53 77.23

the crossover and mutation rates of GAs. The crossover rate ranges 0.5–0.7 and the mutation rate ranges 0.06–0.1 for this experiment. The stop condition is set to 1,000 iterations and thus genetic search is repeated by 50 generation.

5. Experimental results We have repeated 10-fold cross validations for ﬁve times with different random seeds in order to ensure that the comparison among three different classiﬁers does not happen by chance. For each of 10-fold cross validation, the entire data set is ﬁrst partitioned into ten equal-sized sets, and then each set is in turn used as the test set while the classiﬁer trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classiﬁers on each of the 50 experiments. We have performed VIF analysis to investigate the effect of multicollinearity on performance of classiﬁers. Table 4 shows the results of multicollinearity test through the VIF analysis where each column shows the total number of classiﬁers, and the number of classiﬁers of which the VIF is bigger than ﬁve. DT-based classiﬁers do not present any substantial multicollinearity except for DT Bagging classiﬁer of which two base classiﬁers have multicollinearity problem. Meanwhile there exists substantial multicollinearity problem in three base classiﬁers of NN Boosting ensemble and three base classiﬁers of NN Bagging ensemble. For SVM, multicollinearity problem is observed in three classiﬁers of boosted SVM and four classiﬁers of bagged SVM. These results indicate that the performance improvement of NN/SVM ensemble can be deteriorated due to multicollinearity problem.

After selecting classiﬁers through coverage optimization, multicollinearity problem among classiﬁer members is not observed any more, which means that both CO-Boosting/CO-Bagging classiﬁers contain diverse classiﬁers without having substantial multicollinearity. Table 5 shows performance comparison of classiﬁers for validation set. In the comparison between individual classiﬁers, SVM (72.45%) shows higher prediction accuracy than both NN (71.02%) and DT (70.30%) classiﬁers. All ensembles show higher performance than individual classiﬁers. The results show that the prediction accuracy of DT ensembles (75.10%, 75.78%) is higher than both NN (73.10%, 73.97%) and SVM (73.07%, 72.85%) ensembles. All ensembles have shown marginal improvements over a single classiﬁer for validation data with about 4.8% and 5.48% for DT, 2.08% and 2.95% for NN, and 0.62% and 0.4% for SVM, respectively. The results mean that DT ensembles containing diverse classiﬁers can decrease the generalization error and thereby generate the prominent performance improvement, while stable NN/SVM ensembles have the limit in performance improvement due to multicollinearity problem. In the comparison between optimized ensembles, CO-SVM (77.53%, 77.23%) has the more accurate results than CO-DT (76.00%, 76.20%) and CO-NN (76.52%, 76.92%). The improvements of coverage optimization are as high as about 0.9% and 0.42% for DT ensembles, 3.42% and 2.95% for NN ensembles, and 4.46% and 4.38% for SVM ensembles. Compared with ensemble classiﬁers, optimized classiﬁers have fewer classiﬁers as shown in Table 4, but their accuracies are higher than the original ensemble classiﬁers. T-test is used to examine whether the average performance of each classiﬁers for 50 folds is signiﬁcantly different. As shown in Table 6, the results indicate that ensemble learning signiﬁcantly improve the DT and NN classiﬁers at 1% level, while it has no substantial effect on the performance improvement of SVM classiﬁer. On the other hand, coverage optimization has the prominent performance improvement of stable NN/SVM ensembles at 1% significance level, while the performance of CO-DT classiﬁers is not signiﬁcantly different from that of DT ensembles. We additionally compared the performance of learning methods as shown in Table 7. In a single classiﬁer, the performance of

Table 6 T-test results of prediction accuracy (T-value). Classiﬁers Individual

Boosting

Bagging

***

Boosting DT NN SVM DT NN SVM DT NN SVM

Bagging

***

***

4.258 4.037*** 1.203 – – – – – –

4.803 5.726*** 0.776 – – – – – –

CO-Boosting ***

CO-Bagging 5.171*** 11.451*** 9.278*** – – – 0.275 5.276*** 8.501***

4.996 10.675*** 9.860*** 0.586 6.638*** 8.656*** – – –

Represent signiﬁcance level at 1%.

Table 7 Performance comparison among learning algorithm (%, T-value).

* ** ***

Classiﬁer

DT

NN

Individual Boosting Bagging CO-Boosting CO-Bagging

70.30 75.10 75.78 76.00 76.20

71.02 73.10 73.97 76.52 76.92

Represent signiﬁcance level at 10%. Represent signiﬁcance level at 5%. Represent signiﬁcance levels at 1%.

SVM (1.398)* (1.780)** (1.589)* (0.453) (0.628)

72.45 73.07 72.85 77.53 77.23

(4.173)*** (1.870)** (2.570)*** (1.337) (0.899)

M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314

SVM is signiﬁcantly different from DT at 1% signiﬁcance level DT, while NN is not signiﬁcantly different from DT. In Boosting classiﬁers, Boosted DT shows the signiﬁcantly different performance compared with Boosted NN/SVM at 5% level, while the performance of Bagged DT is signiﬁcantly different from Bagged SVM at 1% level. On the other hand, the performance of CO-classiﬁers is not different from each other. 6. Conclusion In this work, we have proposed coverage optimization algorithm to resolve multicollinearity problem and enhance the stable performance enhancement of ensemble. The proposed algorithm uses GA to guarantee an ensemble that contains diverse classiﬁers in coverage optimization process. We use prediction accuracy to improve the performance of an ensemble, as well as VIF analysis to measure multicollinearity as a degree of diversity in order to select various classiﬁers, which is the goal of coverage optimization. Therefore, we use prediction accuracy as ﬁtness function and VIF as a constraint of GA search to remove high correlation among the classiﬁers to insure the diversity of classiﬁers. Experiments on company failure prediction to verify the performance of coverage optimization have shown that coverage algorithm is effectively applied in the stable performance enhancement of NN/SVM ensembles through the choice of classiﬁers by considering the correlations of the ensemble. Although the performance improvement when applied in unstable classiﬁers, such as DT ensemble, is less effective than those of stable classiﬁers, they have also shown that coverage optimization can be helpful for improving the performance of unstable classiﬁers. However, we expect to continue the following further researches to solve the limitations of this research. First of all, we mainly have focused on resolving coverage optimization. But, one of the most fundamental problems in ensemble is decision optimization process to ﬁnd optimal combination function. We expect to have more advanced further researches on decision optimization process in the future. Secondly, ensemble learning has a problem with noise. The noises in the learning samples distort the classiﬁcation boundary of learning algorithms like SVM and degrade the learning performance. Especially, in boosting ensemble which focuses on the learning of miss-classiﬁed observations, the noise will repeatedly affect newly generated classiﬁers. To deal with those outliers, various SVM ensemble methods such as Probabilistic Roulette Selection, Karush–Kuhn–Tucker Condition-based Heuristic Selection, Automatic Feature Selection, etc. have been proposed (Maia, Braga, & Carvalho, 2008). Coupled with those researches, we expect to have more advanced further researches in the future. References Alfaro, E., Gámez, M., & García, N. (2007). Multiclass corporate failure prediction by AdaBoost.M1. Advanced Economic Research, 13, 301–312. Alfaro, E., García, N., Gámez, M., & Elizondo, D. (2008). Bankruptcy forecasting: An empirical comparison of AdaBooost and neural networks. Decision Support Systems, 45, 110–122. Altman, E. L. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589–609. Altman, E. L., Edward, I., Haldeman, R., & Narayanan, P. (1977). A new model to identify bankruptcy risk of corporations. Journal of Banking and Finance, 1, 29–54. Banﬁeld, R., Hall, L., Bowyer, K., & Kegelmeyer, W. (2003). A new ensemble diversity measure applied to thinning ensemble. In 4th International Workshop, MCS 2003 (Vol. 2709, pp. 306–316). Banﬁeld, R. E., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2007). A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 173–180. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classiﬁcation algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139.

9313

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. Bryant, S. M. (1997). A case-based reasoning approach to bankruptcy prediction modeling. International Journal of Intelligent Systems in Accounting, Finance and Management, 6(3), 195–214. Buciu, I., Kotrooulos, C., & Pitas, I. (2001). Combining support vector machines for accuracy face detection. In Proceedings of ICIP (pp. 1054–1057). Buta, P. (1994). Mining for ﬁnancial knowledge with CBR. AI Expert, 9(10), 34–41. Czyz, J., Sadeghi, M., Kittler, J., & Vandendorpe, L. (2004). Decision fusion for face authentication. In Proceedings of the ﬁrst international conference on, machine learning (pp. 115–123). Dimitras, A. I., Zanakis, S. H., & Zopounidis, C. (1996). A survey of business failure with an emphasis on prediction methods and industrial applications. European Journal of Operational Research, 90(3), 487–513. Dong, Y. S., & Han, K. S. (2004). A comparison of several ensemble methods for text categorization. In IEEE international conference on service computing services, computing (SCC’04) (pp. 419–422). Drucker, H., & Cortes, C. (1996). Boosting decision trees. In Advanced neural information processing systems (p. 8). Eom, J. H., Kim, S. C., & Zhang, B. T. (2008). AptaCDSS-E: A classiﬁer ensemble-based clinical decision support system for cardiovascular disease level prediction. Expert Systems with Applications, 34, 2465–2479. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Fumera, G., & Roli, F. (2005). A theoretical and experimental analysis of linear combiners for multiple classiﬁer systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 942–956. Giacinto, G., & Roli, R. (2001). An approach to the automatic design of multiple classiﬁer systems. Pattern Recognition Letters, 22, 25–33. Han, I., Chandler, J. S., & Liang, T. P. (1996). The impact of measurement scale and correlation structure on classiﬁcation performance of inductive learning and statistical methods. Expert System with Applications, 10(2), 209–221. Hansen, L., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on PAMI, 12, 993–1001. Ho, T. K. (2002). Multiple classiﬁer combination: Lessons and next steps. In Hybrid methods in patter recognition. World Scientiﬁc. Kim, M. J., & Kang, D. K. (2010). Ensemble with neural networks for bankruptcy prediction. Expert Systems with Applications, 37, 3373–3379. Laitinen, T., & Kankaanpaa, M. (1999). Comparative analysis of failure prediction methods: The ﬁnish case. European Accounting Review, 8(1), 67–92. Lemieux, A., & Parizeau, M. (2003). Flexible multi-classiﬁer architecture for face recognition systems. In The 16th international conference on vision, interface (pp. 1–8). Maia, T. T., Braga, A. P., & Carvalho, A. F. (2008). Hybrid classiﬁcation algorithms based on boosting and support vector machines. Kybernetes, 37(9), 1469–1491. Meyer, P. A., & Pifer, H. (1970). Prediction of bank failures. The Journal of Finance, 25, 853–868. Min, S. H., Lee, J. M., & Han, I. G. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Systems with Applications, 31, 652–660. Odom, M., & Sharda, R. (1990). A neural network for bankruptcy prediction. In Proceedings of the international joint conference on neural networks (Vol. 2, pp. 163–168). Ohlson, J. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18(1), 109–131. Oliveira, L. S., Sabourin, R., Bortolozzi, F., & Suen, C. Y. (2003). A methodology for feature selection for ensembles using multi-objective genetic algorithms for handwritten digit string recognition. International Journal of Pattern Recognition and Artiﬁcial Intelligence, 17(6), 903–930. Pantalone, C., & Platt, M. B. (1987). Predicting commercial bank failure since deregulation. New England Economic Review, 37–47. Quinlan, J. R. (1996). Bagging, boosting and C4.5. Machine Learning. In Proceedings of the fourteenth international conference (pp. 725–730). Ravi, P., & Ravi, K. V. (2007). Bankruptcy prediction in banks and ﬁrms via statistical and intelligent techniques-a review. European Journal of Operational Research, 180(1), 1–28. Roli, F., & Giacinto, G. (2002). Design of multiple classiﬁer systems. In Hybrid methods in pattern recognition (pp. 199–226). World Scientiﬁc. Santana, A., Soares, R., Canuto, A., & Soouto, M. (2006). A dynamic classiﬁer selection method to build ensembles using accuracy and diversity. In Proceedings of the ninth Brazilian symposium on, neural networks, (pp. 36–41). Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Shaw, M., & Gentry, J. (1998). Using and expert system with inductive learning to evaluate business loans. Financial Management, 17(3), 45–56. Shin, K., Lee, T., & Kim, H. (2005). An application of support vector machines in bankruptcy prediction. Expert Systems with Applications, 28, 127–135. Ueda, N. (2000). Optimal linear combination of neural networks for improving classiﬁcation performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2), 207–215. Valentino, G. (2005). An experimental bias-variance analysis of SVM ensembles based on resampling techniques. IEEE Transactions on System, Man and Cybernetics- Part B: Cybernetics, 35(6), 1252–1271. Valentini, G., Muselli, M., & Rufﬁno, F. (2003). Bagged ensembles of SVMs or gene expression data analysis. In The IEEE-INNS-ENNS international joint conference on, neural networks, (pp. 1844–1849).

9314

M.-J. Kim, D.-K. Kang / Expert Systems with Applications 39 (2012) 9308–9314

Wu, X., & Chen, Z. (2004). Recognition of Exon/Intron boundaries using dynamic ensembles. In Proceedings of the computational systems bioinformatics conference (pp. 16–19). Zhou, D., & Zhang, J. (2002). Face recognition by combining several algorithms. Pattern Recognition, 3(3), 497–500.

Zhou, Z. H., Wu, J. X., & Tang, W. (2002). Ensembling neural networks: Many could better than all. Artiﬁcial Intelligence, 137, 239–263. Zmijewski, M. E. (1984). Methodological issues related to the estimation of ﬁnancial distress prediction models. Journal of Accounting Research, 22(1), 59–82.

Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction

Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction

Recommend Documents