Expert Systems with Applications 41 (2014) 2353–2361
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
An improved boosting based on feature selection for corporate bankruptcy prediction Gang Wang a,b,c,⇑, Jian Ma c, Shanlin Yang a,b a
School of Management, Hefei University of Technology, Hefei, Anhui 230009, PR China Key Laboratory of Process Optimization and Intelligent Decision-making, Ministry of Education, Hefei, Anhui, PR China c Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong b
a r t i c l e
i n f o
Keywords: Corporate bankruptcy prediction Ensemble learning Boosting Feature selection
a b s t r a c t With the recent financial crisis and European debt crisis, corporate bankruptcy prediction has become an increasingly important issue for financial institutions. Many statistical and intelligent methods have been proposed, however, there is no overall best method has been used in predicting corporate bankruptcy. Recent studies suggest ensemble learning methods may have potential applicability in corporate bankruptcy prediction. In this paper, a new and improved Boosting, FS-Boosting, is proposed to predict corporate bankruptcy. Through injecting feature selection strategy into Boosting, FS-Booting can get better performance as base learners in FS-Boosting could get more accuracy and diversity. For the testing and illustration purposes, two real world bankruptcy datasets were selected to demonstrate the effectiveness and feasibility of FS-Boosting. Experimental results reveal that FS-Boosting could be used as an alternative method for the corporate bankruptcy prediction. Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction Predicting corporation bankruptcy is an important management science problem and its main goal is to differentiate those corporations with a probability of distress from healthy corporations. Moreover, incorrect decision-making in financial institutions may run into financial difficulty or distress and cause many social costs affecting owners or shareholders, managers, government, etc. As a result, how to predict corporate bankruptcy has become a hot topic for both industrial application and academic research (Li, Andina, & Sun, 2012; Olson, Delen, & Meng, 2012; Zhou, Lai, & Yen, 2014). As there are no mature theories of corporate bankruptcy, studies in corporate bankruptcy have largely been based on trial and error iterative processes of selecting features and predictive models (Li & Sun, 2009; Zhou, Lai, & Yen, 2014). With the development of statistics, artificial intelligence (AI), some statistical methods and intelligent methods have been proposed for corporate bankruptcy prediction. The statistical methods applied in corporate bankruptcy prediction mainly include Linear Discriminant Analysis (LDA), Multivariate Discriminant Analysis (MDA), Quadratic Discriminant Analysis (QDA), Logistic Regression Analysis (LRA), and Factor Analysis (FA) (Li & Sun, 2009; Zmijewski, 1984). However, the problem with applying these statistical techniques to corporate ⇑ Corresponding author at: Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong. Tel.: +852 9799 0955; fax: +852 2788 8694. E-mail address:
[email protected] (G. Wang). 0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.09.033
bankruptcy prediction is that some assumptions, such as the multivariate normality assumptions for independent variables, are frequently violated in the practice, which makes these techniques theoretically invalid for finite samples (Shin & Lee, 2002). In recent years, many studies have demonstrated that intelligent techniques such as Artificial Neural Networks (ANN), Decision Tree (DT), CaseBased Reasoning (CBR), Support Vector Machine (SVM) can be used as alternative methods for corporate bankruptcy prediction (Olson et al., 2012; Tsai & Wu, 2008). In contrast with statistical techniques, intelligent techniques do not assume certain data distributions and automatically extract knowledge from training samples (Wang, Ma, Huang, & Xu, 2012). However, there is still no overall best intelligent methods used in predicting corporate bankruptcy. The performance of prediction depends on the details of the problem, the data structure, the used characteristics, the extent to which it is possible to segregate the classes by using those characteristics, and the objective of the classification (Duéñez-Guzmán & Vose, 2013). Recently, integrating multiple predictors into an aggregated output, i.e. ensemble methods, has been demonstrated to be an efficient strategy for achieving high prediction performance, especially when the component predictors have different structures that lead to independent prediction errors (Breiman, 1996; Polikar, 2006). Moreover, latest studies have shown that such ensemble techniques have performed better than single intelligent technique in financial distress prediction (Deligianni & Kotsiantis, 2012; Sun & Li, 2012). In this paper, a novel and improved Boosting, FS-Boosting, is proposed to predict corporate bankruptcy. Through injecting feature
2354
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
selection strategy into Boosting, FS-Booting can get better performance as base learners in FS-Boosting could get more accuracy and diversity. For the testing and illustration purposes, two real world bankruptcy datasets were selected to demonstrate the effectiveness and feasibility of FS-Boosting. Among eight methods, FS-Boosting gets the best prediction accuracy on two datasets. Experimental results reveal that FS-Boosting can be used as an alternative method for the corporate bankruptcy prediction. The remainder of the paper is organized as follows. In Section 2, we review the related work of corporate bankruptcy prediction. In Section 3, an improved boosting, FS-Boosting, is proposed for corporate bankruptcy prediction. In Section 4, we present the details of experiment design. Sections 5 and 6 summarize and analyze empirical results and discussion. Based on the results and observations of these experiments, Section 7 draws conclusions and future research directions. 2. Related work Many techniques have been proposed by prior research. In this study, we classified these techniques into statistical techniques and intelligent techniques. 2.1. Statistical techniques for corporate bankruptcy prediction In the past, many researchers have developed a variety of statistical techniques for corporate bankruptcy. The main statistical methods include LDA, MDA, QDA, LRA, and FA. One of the earliest techniques of corporate bankruptcy prediction were proposed by Beaver (1966) and Altman (1968). They used single discriminant analysis and multiple discriminant analysis, respectively, to identify corporate that would go bankrupt. Subsequently, due to the restrictive statistical requirement of normality for the explanatory variables and quality for the variance-covariance group matrices, logit and probit models were also applied (Ohlson, 1980; Zmijewski, 1984). West used the factor analysis to create composite variables to describe bank’s financial and operating characteristics (West, 1985). Experimental results demonstrated that the combined method of factor analysis and logit was promising in evaluating bank’s condition. However, these conventional statistical techniques have some restrictive assumptions, such as the normality and independence among predictor or input variables. Considering that the violation of these assumptions for independent variables frequently occurs with financial data, the statistical techniques can have limitations to obtain the effectiveness and validity (Shin & Lee, 2002). 2.2. Intelligent techniques for corporate bankruptcy prediction In recent years, many studies have demonstrated that intelligent techniques can be alternative methodologies to predict corporate bankruptcy. Intelligent techniques automatically extract knowledge from a dataset and construct different model representations to explain the data set. The major difference between intelligent techniques and statistical techniques is that statistical techniques usually need researchers to impose structures to different models, such as the linearity in the multiple regression analysis, and to construct the model by estimating parameters to fit the data or observation, while intelligent techniques allow learning the particular structure of the model from the data (Wang, Hao, Ma, & Jiang, 2011). The intelligent techniques frequently used include DT (Shaw & Gentry, 1990), ANN (Tam & Kiang, 1992; Tang & Chi, 2005), CBR (Buta, 1994; Shin & Han, 2001), and SVM (Min & Lee, 2005; Van Gestel et al., 2003). Shaw and Gentry applied DT to risk classifica-
tion applications and found that the performance of DT was better than probit or logit analysis (Shaw & Gentry, 1990). Tam and Kiang used ANN to predict bankruptcy risk and compared ANN with a linear discriminate model, a logit model, DT and KNN (Tam & Kiang, 1992). Tang and Chi proposed a means to collect and determine explanatory variables using a between-countries approach (Tang & Chi, 2005). In addition, they established a systematic experiment to investigate the influences of techniques for both network architecture selection and variable selection on neural network models’ learning and prediction capability. CBR, which benefits from utilizing case specific knowledge of previous experienced problem situations, is also applied to predict corporate bankruptcy. A new problem is solved by finding a similar past case and reusing it in the new problem domain. Buta developed a CBR model using financial data of 1000 companies in the Standard and Poor’s Compustat database (Buta, 1994). And Shin and Han proposed a CBR method which used nearest neighbor matching algorithms to retrieve cases (Shin & Han, 2001). Another widely used method is SVM whose formulation simultaneously embodies the structural risk and empirical risk minimization principles. Van Gestel et al. reported least squares SVM got significantly better results when contrasted with the classical methods (Van Gestel et al., 2003). Min and Lee proposed grid-search method using five-fold cross validation to find out the optimal parameter values of kernel function of SVM (Min & Lee, 2005) and found that SVM outperformed NN, MDA and logit models. However, there is still no overall best intelligent techniques used in predicting corporate bankruptcy. Recently, latest studies have shown that ensemble techniques have performed better than single intelligent technique for corporate bankruptcy prediction (Alfaro, García, Gámez, & Elizondo, 2008; Deligianni & Kotsiantis, 2012; Sun & Li, 2012; Sánchez-Lasheras, de Andrés, Lorca, & de Cos Juez, 2012). For example, Alfaro et al. compared two classification methods, i.e., AdaBoost and ANN, and experimental results showed the improvement in accuracy that AdaBoost achieves against the ANN (Alfaro et al., 2008). Sun and Li proposed a new SVM ensemble method whose candidate single classifiers are trained by SVM algorithms with different kernel functions on different feature subsets of one initial dataset (Sun & Li, 2012). Deligianni and Kotsiantis found that an ensemble of classifiers could enable users to predict bankruptcies with satisfying precision long before the final bankruptcy (Deligianni & Kotsiantis, 2012). At the same time, some studies also found that ensemble methods are not always clearly superior to single classifiers (Alfaro-Cid et al., 2008; Nanni & Lumini, 2009; Tsai & Wu, 2008). It means that ensemble techniques should be adjusted according to the character of corporate bankruptcy prediction. In this paper, an improved Boosting based on feature selection is proposed to predict corporate bankruptcy.
3. Feature selection based Boosting for bankruptcy prediction 3.1. Feature selection Feature selection has been an active research area in machine learning and data mining communities (Liu & Motoda, 1998). The main idea of feature selection is to choose a subset of input variables by eliminating features with little or no predictive information. Feature selection reduces the dimensionality of feature space, and removes redundant, irrelevant, or noisy data. It brings the immediate effects for application: speeding up an algorithm, improving the data quality and thereof the performance of classifier (Blum & Langley, 1997). Diverse feature selection techniques have been proposed in the machine learning and data mining literature. Feature selection
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
techniques broadly falls into two categories: wrapper model and filter model (Jain & Zongker, 1997; Kohavi & John, 1997). The wrapper model requires one predetermined learning algorithm in feature selection process. Features are selected based on their effects on the performance of learning algorithm. For each new subset of features, the wrapper model needs to train the classifier. It tends to find features better suited to the predetermined learning algorithm resulting in superior learning performance. As embedding learning algorithm into the selection process, the wrapper model is more computationally expensive (Hengpraprohm & Chongstitvatana, 2009; Liu & Motoda, 1998). The filter model relies on general characteristics of the training data to select a feature set without involvement of a learning algorithm. In the filter model, the feature set estimators evaluate features individually. A typical feature selection algorithm computes some relevance measures for each feature, which mostly derived from statistical analysis of the data samples in the training data set, and then assigns each feature a score. Once the features are ranked, in the second phase of learning system, one is often interested in achieving maximum classification accuracy with minimum number of features (Liu & Motoda, 1998). Considering the computational complexity of ensemble methods and the higher computational burden of wrapper model, Information Gain (IG) based feature selection, one of popular filter models, is adopted in our research (Dash & Liu, 1997; Forman, 2003). The definition of IG is based on entropy. Entropy is a measure commonly used in the information theory, which characterizes the purity of an arbitrary collection of examples. It is in the foundation of the IG based feature selection. The entropy measure is considered as a measure of system’s unpredictability. The entropy of Y is:
HðYÞ ¼
X pðyÞlog2 ðpðyÞÞ
ð1Þ
y2Y
Where p(y) is the marginal probability density function of the random variable Y. If the observed values of Y in the training data set S are partitioned according to the values of a second feature X, and the entropy of Y with respect to the partitions induced by X is less than the entropy of Y prior to partitioning, then there is a relationship between features Y and X. Then the entropy of Y after observing X is:
HðY=XÞ ¼
X X pðxÞ pðy=xÞlog2 ðpðy=xÞÞ x2X
ð2Þ
y2Y
Where p(y/x) is the conditional probability of y given x . Given the entropy as a criterion of impurity in a training data set S, we can define a measure reflecting additional information about Y provide by X that represents the amount by which the entropy of Y decreases. This measure is known as IG. It is given by:
IG ¼ HðYÞ HðY=XÞ ¼ HðXÞ HðX=YÞ
ð3Þ
IG is a symmetrical measure (refer to Eq. (3)). The information gained about Y after observing X is equal to the information gained about X after observing Y . 3.2. FS-Boosting for bankruptcy prediction Ensemble is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches that try to learn one hypothesis from the training data, ensemble methods try to construct a set of hypotheses and combine them to use (Opitz & Maclin, 1999). The base learners are the components of the ensemble. Many ensemble methods are able to boost weak learners which are slightly better than random guess to strong learners which can make very accurate predictions. Therefore, ‘‘base learners’’
2355
are also referred as ‘‘weak learners’’. However, it is noteworthy that although most theoretical analyses work on weak learners, base learners used in practice are not necessarily weak since using not-so-weak base learners often results in better performance. One of the earliest work on ensemble systems is Dasarathy and Sheela’s paper (Dasarathy & Sheela, 1979), which discussed using two or more classifiers to partition the feature space. In 1990, Hansen and Salamon showed that the generalization performance of an ANN can be improved using an ensemble of similarly configured ANNs (Hansen & Salamon, 1990). Schapire demonstrated that a strong classifier in probably approximately correct (PAC) sense can be generated by combining weak classifiers through boosting, producing the predecessor of the suite of AdaBoost algorithms (Schapire, 1990). Since these seminal works, research in ensemble systems have expanded rapidly, often appearing in the literature under many creative names and ideas. The generalization ability of an ensemble method is usually much stronger than that of a single learner, which makes ensemble methods very attractive. Dietterich gave three reasons by viewing the nature of machine learning as searching a hypothesis space for the most accurate hypothesis (Dietterich, 1997). Firstly, the training data might not provide sufficient information for choosing a single best learner. For example, there may be many learners performing equally well on the training set. Thus, combining these learners may be a better choice. Secondly, the search processes of the learning algorithms might be imperfect. For example, even if there is a unique best hypothesis, it might be difficult to achieve this goal, since running the algorithms results in sub-optimal hypotheses. Thus, ensembles can compensate for such imperfect search processes. Thirdly, the hypothesis space being searched might not contain the true target function, while ensembles can give some good approximation. For example, the classification boundaries of DT are linear segments parallel to coordinate axes. If the target classification boundary is a smooth diagonal line, using a single DT will lead to a poor approximation. Although above intuitive explanations are reasonable, they lack rigorous theoretical analyses. In practice, to achieve a good ensemble, two necessary conditions should be satisfied: accuracy and diversity. The base learner should be more accurate than random guessing, and each base learner should have its own knowledge about the problem, with a different pattern of errors than other base learners. In general, ensemble methods can be divided into two classes: instance partitioning methods and feature partitioning methods. For example, Boosting and Bagging belong to the former, while Random Subspace (Ho, 1998) belongs to the later. Boosting and Bagging perturbs the instance space to get diversity, while, random subspace perturbs the feature space to get diversity. Recently, in order to enforce the diversity of base learners, some explorations have been done from the perspective of integrating instance partitioning and feature partitioning (Breiman, 2001; Rodriguez, Kuncheva, & Alonso, 2006). For example, a version of Bagging called Random Forest was proposed by Breiman (Breiman, 2001). The ensemble consists of decision trees built again on bootstrap samples. The difference lies in the construction of the decision tree. The feature to split a node is selected as the best feature among a set of M randomly chosen features, where M is a parameter of the algorithm. This small alteration appeared to be a winning heuristic in that diversity was introduced without much compromising the accuracy of the individual base learners. Conversely, Random Forest is prone to over fitting in noisy classification tasks and it is ineffective when handling a large number of irrelevant features (Breiman, 2001). Next to this study, Rodriguez and Kuncheva proposed a rotation approach, named Rotation Forest (Rodriguez et al., 2006), to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base learner. To create the
2356
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
training data for base learners, the feature set is randomly split into K subsets and Principal Component Analysis is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Decision trees were also chosen as base learners because they are sensitive to rotation of the feature axes. Experimental results revealed that Rotation Forest construct individual base learners which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well. Like Random Forest, Rotation Forest is also ineffective when encountering a large number of irrelevant features (Rodriguez et al., 2006). It can be seen that above methods by integrating instance partitioning and feature partitioning get better performance. These methods pay more attention to one of instance partitioning methods, Bagging. However, they give less attention to the Boosting. Meanwhile, feature selection, another form of feature partitioning, is seldom paid attention to. Features that are irrelevant to learning tasks may deteriorate the performance of learning algorithms. Therefore, the omission of some features could not only be tolerable but also even desirable relatively to the costs involved. It can influence the quality of ensemble method in several ways, e.g., reducing learner complexity, promoting diversity of base learners, and affecting the trade-off between the accuracy and diversity of base learners. Compared with Bagging and Random Subspace, Boosting is easier affected by noise data in nature. In order to tackle the irrelevant features problem, we introduce one of feature selection methods into Boosting and propose an improved Boosting: FS-Boosting. Through injecting feature selection strategy into Boosting, FS-Booting can get better performance as base learners in FS-Boosting could get more accuracy and diversity. FS-Boosting combines one of the popular feature selection methods, IG based feature selection, with the standard Boosting procedure. We want to utilize the feature selection, e.g., IG based feature selection, to enhance the accuracy and diversity of base
learners. The pseudo-code for the FS-Bagging algorithm is given in Fig. 1. The proposed FS-Boosting algorithm proceeds in a series of T rounds. In every round a base learner is trained with a different distribution DFS altered by emphasizing particular training int stances. The distribution is updated to give wrong classification higher weights than correct classifications. And after obtaining a new dataset, FS-Boosting applies feature selection to this dataset to reduce the noise. The entire weighted training set is given to the base learner to compute the hypothesis ht. At the end, the different hypotheses are combined into a final hypothesis H(x) . 4. Experimental design 4.1. Real world bankruptcy dataset Two real world datasets were used to test the performance of proposed method. The first dataset was collected by Pietruszkiewicz (2008). It contains 240 companies including 112 failed companies. The time period of dataset is from 1997 to 2001 before bankruptcy toke place. A total of 30 financial variables were used for prediction. The particular explanations of these financial variables are listed in Table 1. Then second dataset is selected from the CD-ROM database (Shmueli, Patel, & Bruce, 2011) and includes 132 companies (66 non-risk cases and 66 risk cases). The time period of dataset is from 1970 to 1982. A total of 24 financial variables were computed for each of the 132 companies using data from the Compustat and from the Moody’s Industrial Manual. The particular explanations of these financial variables are listed in Table 2. 4.2. Evaluation criteria The evaluation criteria of our experiments were adopted from the established standard measures in the fields of corporate
Fig. 1. The FS-Boosting algorithm.
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
Formally speaking, they are defined as follows:
Table 1 Financial variables for bankruptcy 1. Variable
Definition
Variable
Definition
X1 X2 X3
X16 X17 X18
Sales/Receivables Sales/Total assets Sales/Current assets
X4
Cash/Current liabilities Cash/Total assets Current assets/Current liabilities Current assets/Total assets
X19
X5 X6
Working Capital/Total assets Working capital/Sales
X20 X21
X7
Sales/Inventory
X22
X8 X9 X10 X11 X12
Sales/Receivables Net profit/Total assets Net profit/Current assets Net profit/Sales Gross profit/Sales
X23 X24 X25 X26 X27
X13
Net profit/Liabilities
X28
X14 X15
Net profit/Equity Net profit/(Equity + Long term liabilities)
X29 X30
(365⁄Receivables)/ Sales Sales/Total assets Liabilities/Total income Current liabilities/ Total income Receivables/Liabilities Net profit/Sales Liabilities/Total assets Liabilities/Equity Long term liabilities/ Equity Current liabilities/ Equity EBIT/Total assets Current assets/Sales
Table 2 Financial variables for bankruptcy 2. Variable
Definition
Variable
Definition
X1 X2 X3 X4
Cash/Current debt Cash/Sales Cash/Total assets Cash/Total debt
X13 X14 X15 X16
X5
Cash flow from operations/Sales Cash flow from operations/Total assets Cash flow from operations/Total debt Cost of goods sold/ Inventory Current assets/Current debt Current assets/Sales
X17
X19
Income/Sales Income/Total assets Income/Total debt Income plus depreciation/ Sales Income plus depreciation/ Total assets Income plus depreciation/ Total debt Sales/Receivables
X20
Sales/Total assets
X21
Total assets/Total debt
X22
Current assets /Total assets Current debt/Total debt
X23
Working capital from operations/Sales Working capital from operations/Total assets Working capital from operations/Total debt
X6 X7 X8 X9 X10 X11 X12
2357
X18
X24
bankruptcy prediction. These measures include average accuracy, type I error and type II error. Each measure has its merits and limitations. In this study, we used a combination of these measures, rather than a single measure, to measure the performance of proposed methods. The definition of these measures can be explained with respect to a confusion matrix as shown in Table 3.
Table 3 Confusion matrix for bankruptcy. Actual condition Positive (Non-Bankruptcy)
Negative (Bankruptcy)
Test Result Positive True Positive (TP) False Positive (FP) (Non-Bankruptcy) Negative (Bankruptcy) False Negative (FN) True Negative (TN)
Average accuracy ¼
Type I Error ¼
TP þ TN TP þ FP þ FN þ TN
FN TP þ FN
Type II Error ¼
FP FP þ TN
ð4Þ
ð5Þ
ð6Þ
4.3. Experimental procedures The experiments described in this section were performed on a PC with a 1.83 GHz Intel Core Duo CPU and 2GB RAM, using Windows XP operating system. Data mining toolkit WEKA (Waikato Environment for Knowledge Analysis) version 3.6.6 was used for experiment. WEKA is an open source toolkit, and it consists of a collection of machine learning algorithms for solving data mining problems (Witten, Frank, & Hall, 2011). In this study, we compared FS-Boosting with other seven common used methods in corporate bankruptcy prediction, e.g., LRA, NB, DT, ANN, SVM, bagging, and boosting. For implementation of LRA, NB, DT, ANN and SVM, we chose Logistic module, NavieBayes module, J48 (WEKA’s own version of C4.5) module, MultilayerPerceptron module and SMO module. In addition, for implementation of ensemble learning, i.e., bagging and boosting, we chose Bagging module and ADBoostM1 module. Besides, for implementation of FS-Boosting, we used WEKA Package, i.e., WEKA.JAR and implement in Eclipse. In addition, we used DT as base classifier in experiments according to Opitz and Maclin (Opitz & Maclin, 1999), and Fu et al. (Fu, Golden, Lele, Raghavan, & Wasil, 2006). Except when stated otherwise, all the default parameters in WEKA were used. Moreover, for FS-Boosting, five random subspace rates of feature set are tested, where the value of subspace ratio is set to 0.5, 0.6, 0.7, 0.8 and 0.9, respectively. To minimize the influence of variability in the training set, ten times 10-fold cross validation was performed. In detail, the corporate bankruptcy prediction dataset was partitioned into ten subsets with similar sizes and distributions. Then, the union of nine subsets was used as the training set while the remaining subset was used as the test set. This process was repeated for ten times, such that every subset had been used as the test set once. The average test result was regarded as the result of the 10-fold cross validation. The whole above process was repeated for 10 times with random partitions of the ten subsets, and the average results of these different partitions were recorded. 5. Experimental results In this paper, an improved Boosting, FS-Boosting, is proposed to predict bankruptcy and reduce the loss of financial institutions. Table 4 summarizes the performance indicators of different methods on the two bankruptcy datasets, where the values following ‘‘±’’ are standard deviations. Firstly, we consider the results of bankruptcy I dataset. As shown in Table 4, FS-Boosting gets the highest average accuracy, 81.50%. Two other ensemble methods, i.e. Bagging and Boosting, also get the higher average accuracy, 76.21% and 77.42%, respectively. For the type I and II errors, FS-Boosting reduces two indicators into 12.81% and 25.02%. Compared with Boosting, the reason why FS-Boosting gets higher accuracy is that FS-Boosting reduces the type I error greatly. It is interesting that although NB also gets lower the type I error, it’s type II error is the highest among eight methods.
2358
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
Table 4 Results of different methods. Methods
LRA NB DT ANN SVM Bagging Boosting FS-Boosting
Bankruptcy 1
Bankruptcy 2
Average Accuracy (%)
Type I Error (%)
Type II Error (%)
Average Accuracy (%)
Type I Error (%)
Type II Error (%)
74.04 67.33 71.63 73.38 72.21 76.21 77.42 81.50
23.24 12.49 30.21 25.37 30.19 21.16 20.16 12.81
29.09 55.73 26.19 28.07 25.01 26.80 25.36 25.02
73.90 77.78 75.99 75.69 79.99 81.22 79.40 86.79
26.38 29.93 23.10 20.62 21.55 17.83 19.64 18.24
25.83 14.48 24.74 27.69 18.19 19.50 21.17 7.86
±2.63 ±3.72 ±5.47 ±3.03 ±3.27 ±3.46 ±4.18 ±4.01
±6.20 ±10.18 ±10.42 ±9.65 ±8.04 ±6.88 ±7.73 ±6.01
Then we consider the bankruptcy 2 dataset. As shown in Table 4, FS-Boosting also gets the highest average accuracy, 86.79%. Bagging and Boosting get the higher average accuracy, 81.22% and 79.04%. For the type I and II errors, FS-Boosting gets the lowest type II error, 7.86%. The type I error of FS-Boosting is lower than Boosting, but higher than Bagging. It is clear that the reason why FS-Boosting gets the highest accuracy is the great reduction of type II error. And for base learners, SVM gets the higher average accuracy, 79.99%.
6. Discussion In order to ensure that the assessment does not happen by chance, we tested the significance of above results by means of the paired t-test. The null hypothesis is ‘‘Model A’s mean of Average Accuracy / Type I Error / Type II Error = Model B’s mean Average Accuracy / Type I Error / Type II Error’’. The alternative hypothesis is ‘‘Model A’s mean Average Accuracy / Type I Error / Type II Error – Model B’s mean Average Accuracy / Type I Error / Type II Error’’. The column ‘improvement’ gives the relative improvement in mean Average Accuracy (Type I Error or Type II Error) that Model A gives over Model B. The results are summarized
±8.01 ±16.06 ±13.69 ±10.25 ±7.15 ±8.49 ±7.24 ±8.88
±6.05 ±5.41 ±5.14 ±5.70 ±4.59 ±4.91 ±4.08 ±4.22
±10.53 ±12.28 ±11.39 ±12.02 ±10.83 ±9.13 ±10.29 ±10.18
±10.94 ±8.78 ±10.98 ±12.88 ±9.64 ±10.52 ±10.56 ±6.77
in Tables 5 and 6. As shown in Tables 5 and 6, the proposed FSBoosting is significantly better than the other seven methods. As random subspace is an important parameter for FS-Boosting, it’s value has great influence on the performance of FS-Boosting. Figs. 2 and 3 display the average accuracy curve, type I error curve and type II error curve of eight methods when the subspace ratio varies from 0.5 to 0.9. As shown in Figs. 2 and 3, we can see that the performance of FS-Boosting varies with the different subspace ratio. However, FS-Boosting gets the best average accuracy on two bankruptcy datasets in any condition. It means that although the performance of FS-Boosting is affected by the subspace ratio, it still get better performance than traditional methods. And it also means that the performance of FS-Boosting is not sensitive to this parameter. The better performance of FS-Boosting stems from the introduction feature selection into Boosting. Then another question is coming: could we only make use of feature selection to get comparative or better results? Another experiment is conducted to confirm the effectiveness of feature selection. The subspace ratio also varies from 0.5 to 0.9 .The experimental results are shown in Figs. 4 and 5. FS-DT means DT using feature selection technique. It can be found that only using feature selection couldn’t get better accuracy. Although FS-DT can get better type I or II error, the
Table 5 Significant test results of paired t-test (Bankruptcy 1). Method A
Method B
Average Accuracy Improvement
FS-Boosting
* **
LRA NB DT ANN SVM Bagging Boosting
10.07% 21.04% 13.79% 11.07% 12.87% 6.94% 5.27%
Type I Error t
Improvement **
Type II Error t
Improvement
t
9.841 0.180 12.615** 9.598** 14.183** 11.310** 8.903**
16.29% 122.80% 4.69% 12.20% 0.03% 7.15% 1.39%
3.118** 15.141** 0.729 1.650 0.005 1.734* 0.334
**
8.684 13.765** 10.553** 8.905** 9.906** 8.810** 5.870**
81.48% 2.50% 135.89% 98.05% 135.69% 65.22% 57.41%
Improvement
t
Improvement
t
Improvement
t
17.44% 11.59% 14.22% 14.66% 8.50% 6.86% 9.31%
13.021** 8.834** 11.706** 11.811** 7.435** 6.977** 9.278**
44.65% 64.10% 26.63% 13.05% 18.15% 2.22% 7.70%
5.588** 7.888** 3.320** 1.712* 2.767** 0.441 1.468
228.79% 84.24% 214.85% 252.42% 131.52% 148.18% 169.39%
11.774** 4.657** 11.308** 11.534** 7.463** 8.946** 10.668**
P-values significant at alpha = 0.1. P-values significant at alpha = 0.01.
Table 6 Significant test results of paired t-test (Bankruptcy 2). Method A
FS-Boosting
* **
Method B
LRA NB DT ANN SVM Bagging Boosting
P-values significant at alpha = 0.1. P-values significant at alpha = 0.01.
Average Accuracy
Type I Error
Type II Error
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
Fig. 2. Sensitivity analysis of different methods (Bankruptcy 1).
Fig. 3. Sensitivity analysis of different methods (Bankruptcy 2).
2359
2360
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
Fig. 4. Sensitivity analysis of feature selection (Bankruptcy 1).
Fig. 5. Sensitivity analysis of feature selection (Bankruptcy 2).
average accuracy of FS-DT is worse than FS-Boosting. These results also indicate that the advantages of FS-Boosting stems from two complementary aspects: Boosting and feature selection.
7. Conclusions and future directions Owing to recent financial crisis and European debt crisis, bankruptcy prediction has become an increasingly important issue for financial institutions. Meanwhile ensemble learning is a powerful machine learning paradigm which has exhibited apparent advantages in many applications. In this paper, an improved Boosting, FS-Boosting, is proposed to predict bankruptcy and reduce the loss of financial institutions. FS-Boosting integrates the advantage of
Boosting and feature selection to enhance the performance of prediction. Empirical results demonstrate that FS-Boosting can be used as an alternative method for bankruptcy prediction. Several future research directions also emerge according this study. Firstly large data sets for experiments and applications, particularly with more exploration of bankruptcy data structures, should be collected to further valid the conclusions of the study. Secondly, in consideration of efficiency, FS-Boosting choose information gain based filtering feature selection method to overcome the shortcoming that Boosting is easily affected by noise data. The other filtering methods and wrapper method should be considered in the future research. Thirdly, as FS-Boosting is belong to one of the hybrid systems (Sánchez-Lasheras et al., 2012), more detailed compared studies are needed in the future research.
G. Wang et al. / Expert Systems with Applications 41 (2014) 2353–2361
Fourthly, in this research we only give empirical test for effectiveness of FS-Boosting. And the theoretical analysis should be considered in the future research.
Acknowledgments This work is partially supported by the National Natural Science Foundation of China (Nos. 71071045, 71131002, 71101042), Specialized Research Fund for the Doctoral Program of Higher Education (20110111120014), the China Postdoctoral Science Foundation (2011M501041, 2013T60611), Special Fund of AnHui Province Key Research Institute of Humanities and Social Sciences at Universities (SK2013B400).
References Alfaro, E., García, N., Gámez, M., & Elizondo, D. (2008). Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks. Decision Support Systems, 45, 110–122. Alfaro-Cid, E., Castillo, P. A., Esparcia, A., Sharman, K., Merelo, J. J., Prieto, A., Mora, A. M., & Laredo, J. L. J. (2008). Comparing multiobjective evolutionary ensembles for minimizing type I and II errors for bankruptcy prediction. In Evolutionary Computation, 2008. CEC 2008. (IEEE World Congress on Computational Intelligence). IEEE Congress on (pp. 2902–2908). Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The journal of finance, 23, 589–609. Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of Accounting Research, 4, 71–111. Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245–271. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Buta, P. (1994). Mining for financial knowledge with CBR. AI Expert, 9, 34–41. Dasarathy, B. V., & Sheela, B. V. (1979). A composite classifier system design: Concepts and methodology. Proceedings of the IEEE, 67, 708–713. Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1, 131–156. Deligianni, D., & Kotsiantis, S. (2012). Forecasting corporate bankruptcy with an ensemble of classifiers. In Lecture notes in computer science Vol. 7297 (pp. 65– 72). Dietterich, T. (1997). Machine learning research: Four current directions. AI Magazine, 18, 97–136. Duéñez-Guzmán, E. A., & Vose, M. D. (2013). No free lunch and benchmarks. Evolutionary Computation, 21, 293–312. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289–1305. Fu, Z., Golden, B. L., Lele, S., Raghavan, S., & Wasil, E. (2006). Diversification for better classification trees. Computers & Operations Research, 33, 3185–3202. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12, 993–1001. Hengpraprohm, S., & Chongstitvatana, P. (2009). Feature selection by weighted-snr for cancer microarray data classification. International Journal of Innovative Computing, Information and Control, 5, 4627–4635. Ho, T. K. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20, 832–844. Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19, 153–158. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324.
2361
Li, H., Andina, D., & Sun, J. (2012). Multiple proportion case-basing driven CBRE and its application in the evaluation of possible failure of firms. International Journal of Systems Science, 44, 1409–1425. Li, H., & Sun, J. (2009). Gaussian case-based reasoning for business failure prediction with empirical data in China. Information Sciences, 179, 89–108. Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining (Vol. 454). Springer. Min, J. H., & Lee, Y. C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28, 603–614. Nanni, L., & Lumini, A. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36, 3028–3033. Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18, 109–131. Olson, D. L., Delen, D., & Meng, Y. (2012). Comparative analysis of data mining methods for bankruptcy prediction. Decision Support Systems, 52, 464–473. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198. Pietruszkiewicz, W. (2008). Dynamical systems and nonlinear Kalman filtering applied in classification. In Cybernetic Intelligent Systems, 2008. CIS 2008. 7th IEEE International Conference on (pp. 1–6). Polikar, R. (2006). Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6, 21–45. Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28, 1619–1630. Sánchez-Lasheras, F., de Andrés, J., Lorca, P., & de Cos Juez, F. J. (2012). A hybrid device for the solution of sampling bias problems in the forecasting of firms’ bankruptcy. Expert Systems with Applications, 39, 7512–7523. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Shaw, M. J., & Gentry, J. A. (1990). Inductive learning for risk classification. IEEE Expert, 5, 47–53. Shin, K., & Han, I. (2001). A case-based approach using inductive indexing for corporate bond rating. Decision Support Systems, 32, 41–52. Shin, K. S., & Lee, Y. J. (2002). A genetic algorithm application in bankruptcy prediction modeling. Expert Systems with Applications, 23, 321–328. Shmueli, G., Patel, N. R., & Bruce, P. C. (2011). Data mining for business intelligence: Concepts, techniques, and applications in microsoft office excel with xlminer. Wiley. Sun, J., & Li, H. (2012). Financial distress prediction using support vector machines: Ensemble vs. individual. Applied Soft Computing, 12, 2254–2265. Tam, K. Y., & Kiang, M. Y. (1992). Managerial applications of neural networks: the case of bank failure predictions. Management Science, 38, 926–947. Tang, T. C., & Chi, L. C. (2005). Neural networks analysis in business failure prediction of Chinese importers: A between-countries approach. Expert Systems with Applications, 29, 244–255. Tsai, C.-F., & Wu, J.-W. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34, 2639–2649. Van Gestel, T., Baesens, B., Suykens, J., Espinoza, M., Baestaens, D. E., Vanthienen, J., et al. (2003). Bankruptcy prediction with least squares support vector machine classifiers. In 2003 IEEE international conference on computational intelligence for financial engineering (pp. 1–8). IEEE. Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 38, 223–230. Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61–68. West, R. C. (1985). A factor-analytic approach to bank condition. Journal of Banking & Finance, 9, 253–266. Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann. Zhou, L., Lai, K. K., & Yen, J. (2014). Bankruptcy prediction using SVM models with a new approach to combine features selection and parameter optimisation. International Journal of Systems Science, 45, 241–253. Zmijewski, M. E. (1984). Methodological issues related to the estimation of financial distress prediction models. Journal of Accounting Research, 22, 59–82.