Knowledge-Based Systems 22 (2009) 120–127
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Feature selection in bankruptcy prediction Chih-Fong Tsai * Department of Information Management, National Central University, 300 Jhongda Road, Jhongli 32001, Taiwan
a r t i c l e
i n f o
Article history: Received 9 January 2008 Received in revised form 14 July 2008 Accepted 7 August 2008 Available online 14 August 2008 Keywords: Feature selection Data mining Bankruptcy prediction Neural networks
a b s t r a c t For many corporations, assessing the credit of investment targets and the possibility of bankruptcy is a vital issue before investment. Data mining and machine learning techniques have been applied to solve the bankruptcy prediction and credit scoring problems. As feature selection is an important step to select more representative data from a given dataset in data mining to improve the final prediction performance, it is unknown that which feature selection method is better. Therefore, this paper aims at comparing five well-known feature selection methods used in bankruptcy prediction, which are t-test, correlation matrix, stepwise regression, principle component analysis (PCA) and factor analysis (FA) to examine their prediction performance. Multi-layer perceptron (MLP) neural networks are used as the prediction model. Five related datasets are used in order to provide a reliable conclusion. Regarding the experimental results, the t-test feature selection method outperforms the other ones by the two performance measurements. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction Business and academic communities have paid much attention to predict bankruptcy. This is because incorrect decision-making in financial institutions may run into financial difficulty or distress and cause many social costs affecting owners or shareholders, managers, workers, lenders, suppliers, clients, the community and government, etc. As a result, bankruptcy prediction has been one of the most challenging tasks and a major research topic in accounting and finance. The advancement of information technology allows us to obtain a variety of information about some risk status of a company from many ways, such as professional agencies, mass media, etc. In the process of evaluating a great amount of information, many people usually rely on some analyst’s judgment. However, some factors can influence the result of the analysis. Statistical and artificial intelligence (AI) methods can be used to identify important factors for bankruptcy prediction. In the field of bankruptcy prediction, AI methods have been developed for a long time. They are used to build models to evaluate whether corporations face financial distress. The grand assumption is that financial variables extracted from public financial statements, such as financial ratios, contain a large amount of information about a company’s financial status which may be a factor to cause bankruptcy [1]. It is a complicated process to utilize
* Tel.: +886 3 4227151; fax: +886 3 4254604. E-mail address:
[email protected] 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.08.002
those related financial data and other information from enterprise’s strategic competitiveness to operational details to establish an effective model. Along with the development of AI and database technology, data mining techniques are gradually applied in various domains. In bankruptcy prediction, data mining techniques are able to predict business failures which can be very important for related staffs in two different ways. First, they can be used as ‘‘early warning systems”. These systems are very useful to those (e.g. managers, authorities, etc.) that can take actions to prevent business failures. These actions include the decision about merger of the distressed firm, liquidation or reorganization and associated costs. Second, these systems can help decision makers of financial institutions to evaluate and select firms to collaborate with or to invest in. Such decisions have to take into account the opportunity cost and the risk of failures [2]. To deeply analyze a huge amount of information of the corporations is likely to take much time and need many human resources. When irrelevant information is overabundance, it is unlikely to interpret and absorb the information very easily. Therefore, how to filter and condense the large amount of data is a very important issue to predict business failures, especially for bankruptcy prediction. Feature selection as the preprocessing step is the one of the most important steps in data mining process. It aims at filtering out redundant and/or irrelevant features from the original data [3]. In related work, they attempt to design various mathematical calculations and/or combine different models to tackle the bankruptcy prediction problem. However, the crucial process of feature
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127
selection is not carefully concerned in many bankruptcy prediction studies. That is, selecting more informative data to effectively predict bankruptcy. Superfluous and redundant information inputted into a model could consume much time and cost, and even reduce the degree of accuracy of the model [4,5]. As there are a number of statistical based feature selection methods which are used for bankruptcy prediction, the research question of this paper is which method is the best one for allowing the models to provide the best performance. In this paper, we consider five feature selection methods which have been applied in bankruptcy prediction to compare their prediction accuracy and Type I and II errors. They are t-test, correlation matrix, stepwise regression, principle component analysis (PCA) and factor analysis (FA). It should be noted here that although some machine learning techniques, such as self-organizing maps (SOM) [6] and genetic algorithms [7] can be applied for selecting representative features, they are not widely considered in the business domain, especially for bankruptcy prediction. Therefore, the aim of this paper is to first examine the traditional statistical based feature selection methods for bankruptcy prediction. The contributions of this paper allow us to not only understand the best feature selection method for effective bankruptcy prediction but also provide the baseline feature selection method for future related research. This paper is organized as follows. Section 2 briefly describes the methods of data mining applied in bankruptcy prediction. Related work is also reviewed. Section 3 describes the experimental methodology. Experimental results are present in Section 4. The conclusion is provided in Section 5.
2. Literature review 2.1. Bankruptcy prediction Sometimes a firm can become distressed and continue to operate in that condition for many years. On the other hand, some firms enter bankruptcy immediately after a highly distressing event, such as a major fraud. A number of factors influence these outcomes. Lensberg et al. [8] investigates related work and categorizes various factors affecting bankruptcy potentially. They are audit, financial ratios, fraud indicators, start-up and stress which are measured by qualitative or quantitative variables. Bankruptcy occurs if the company can not operate, pay liability, earn profits and obtain bad credits, etc. To develop an appropriate method to predict bankruptcy is to prevent relevant staffs from involving the most terrible crises. Bankruptcies are companies/ individuals which can not operate continually or get the awful credit. Forecasting bankruptcy can be thought of as a classification problem. With input variables as the financial and accounting data of a firm, we try to find out which category the firm belongs to bankruptcy or non-bankruptcy. 2.2. Bankruptcy prediction using data mining techniques In the past, Beaver [9,10] uses financial ratios as the input variables for linear regression models to classify healthy/bankrupt firms. Altman [11] uses the classical multivariate discriminate analysis technique. On the other hand, many recent studies focus on using data mining techniques [12] for bankruptcy prediction Related work shows that data mining models (e.g. neural networks) outperform statistical approaches (e.g. Logistic regression, linear discriminate analysis, and multiple discriminate analysis) [1,13–16].
121
Table 1 simply shows a comparison of related studies published from 2001 to 2007 to examine which model they build and whether they consider feature selection or not. Many of above studies emphasize on designing more sophisticated classifiers although features of bankruptcy play an important role to affect the later prediction result. However, not all of them highlight on this issue. That is, only 6 of 18 consider feature selection. Each of these six related studies uses a specific method to select suitable features during the preprocessing process. However, it is unknown that a preprocessing method should be used for better prediction. 2.3. Feature selection for bankruptcy prediction In many research fields, such as system modeling, pattern recognition and so on, it is important to choose a group of set of attributions with more prediction information. Reducing the number of irrelevant or redundant features drastically reduces the running time of a learning algorithm and yields a more general concept. There are many potential benefits of feature selection, which are facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization times, defying the curse of dimensionality to improve prediction performances, etc. [34]. In addition, this helps in getting a better insight into the underlying concept of a real-world classification [35]. There are many well-known feature extraction techniques, which are principal component analysis, factor analysis, independent component analysis (ICA), and discriminant analysis (DA). Take PCA and FA as the examples, PCA transforms all dimensional input data vector into data vector of uncorrelated and orthogonal principal components. FA is the generalization of PCA, where the main difference between PCA and FA is that FA allows noise to have non-spherical shape while transforming the data. The main goal of both PCA and FA is to transform the coordinate system such that correlation between system variables is minimized [36]. The feature selection stage is generally performed before training the models. However, not all related studies consider this stage (cf. Table 1). On the other hand, related work which undertakes the feature selection stage uses different methods. Table 2 summarizes the methods applied. Table 2 shows that related studies apply different feature selection methods. As there are qualitative and quantitative variables for bankruptcy prediction (cf. Section 2.1), quantitative data are easier to present the financial conditions of the enterprise and an individual. In this paper, we focus on the methods applied on quantitative data, which are correlation matrix, factor analysis, t-test, stepwise regression, and principal component analysis. 2.3.1. Correlation matrix Correlation matrix is to confer the correlation of two quantitative groups, as well as to analyze whether one group affects the other one. A correlation coefficient is the result of a mathematical comparison of how closely related two variables are. The relationship between two variables is said to be highly correlated if a movement in one variable results or takes place at the same time as a similar movement in another variable. To select appropriate variables affecting much more parts of the result by this technique could obtain related advantages [17]. 2.3.2. t-test The t-test method is used to determine whether there is a significant difference between two group’s means. It helps to answer the underlying question: do the two groups come from the same population, and only appear differently because of chance errors, or is there some significant difference between these two groups.
122
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127
Table 1 A survey of related studies Work
Feature selection
Prediction models
Atiya (2001) [17] Lee et al. (2002) [18]
Yes No
Malhotra and Malhotra (2002) [19] McKee and Lensberg (2002) [20] Shin and Lee (2002) [21] Kim and Han (2003) [22] Huang et al. (2004) [1] Canbas et al. (2005) [23]
No
Neural networks Discriminant analysis + Neural networks Fuzzy logic + Neural networks
No
Genetic algorithms
Yes No Yes Yes
Lee et al. (2005) [24] Min and Lee (2005) [25] Ong et al. (2005) [26]
No Yes No
Shin et al. (2005) [27] Gestel et al. (2006) [28] Huysmans et al. (2006) [29]
Yes No No
Lensberg et al. (2006) [8] Min et al. (2006) [30]
No No
Tsakonas et al. (2006) [31]
No
Wu et al. (2007) [32]
No
Tsai and Wu (2008) [33]
No
Genetic algorithms Genetic algorithms SVM Discriminant analysis + Logistic regression Self-organizing maps Support vector machines Neural networks + Discriminant analysis Support vector machines Support vector machines Self-organizing maps + Neural networks Genetic algorithms Genetic algorithms + Support vector machines Neural logics networks + Genetic algorithms Genetic algorithms + Support vector machines Neural networks
Table 2 Feature selection methods used in literature Work
Feature selection methods
Atiya (2001) [17] Shin and Lee (2002) [21] Huang et al. (2004) [1] Canbas et al. (2005) [23] Min and Lee (2005) [25] Shin et al. (2005) [27]
Correlation matrix Factor analysis; t-test; stepwise regression t-test Principal component analysis Principal component analysis t-test; stepwise regression
Three basic factors help determine whether an apparent difference between two groups is a true difference or just an error due to chance [37]: 1. The larger sample, the less likely that the difference is due to sampling errors or chance. 2. The larger the difference between the two means, the less likely the difference is due to sampling errors. 3. The smaller variance among the participants, the less likely that the difference was created by sampling errors.
2.3.3. Factor analysis The purpose of FA is to describe concisely the interrelationship of the numerous variables. It helps researchers to conceptualize the variables. In other words, it uses fewer dimensions to present original structures of data and keeps the most information. The concept of FA presumes that the observed data are divided into two components which are common factors and unique factors. Common variance refers to the variance in a variable that is shared with all other variables in the analysis. Unique factors are not related to common factors and to other unique factors. As there is a unique factor in each variable, at least one or more common factors can be analyzed. To evaluate the common factors cause the variation, we can select appropriate variables affecting the dependent one.
FA seeks the least number of factors which can account for the common variance (correlation) of a set of variables. By this method the first factor could explain the most common variance between variables and the second factor explain the most variance after eliminating the first one. The other factors are to interpret the residual variable sequentially until all common variables are divided [37]. A common FA model includes principal axis factoring, maximum likelihood, alpha factoring [38]. 2.3.4. Principle component analysis The central idea of PCA is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This reduction is achieved by transforming to a new set of variables, as the principal components, which are uncorrelated and ordered so that the first few retain most of the variation present in the entire original variables. By computing eigenvalues and eigenvectors of the principal components, we could find a combination of the original variables in linearity which makes the greatest variance. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Thus the definition and computation of principal components are straightforward [38]. The difference between A and PCA is that PCA considers the total variance accounting for all the common and unique (specific plus error) variance in a set of variables while the common FA considers only the common variance. 2.3.5. Stepwise regression The stepwise regression (or stepwise) method is used by Shin and Lee [21] and Shin et al. [27]. When using regression to build models, one common technique to find the best combination of predictor variables is stepwise regression. Although there are many variations, the most basic procedure is to find the single best predictor variable and add variables that meet some specified criterion. The result is a combination of predictor variables, all of which have significant coefficients. There are two ways of stepwise regression: (1) choose or delete variables and (2) evaluate the importance of the factors. It is interesting that the above-mentioned techniques belong to inferential statistics. With inferential statistics, we try to infer from the sample data about what the population might be like, or make judgments of the probability that an observed difference between groups is a dependable one. Applying those techniques to acquire the most representative factors can affect the accuracy of bankruptcy prediction. The five above-mentioned methods belong to the traditional statistics for selecting variables, which are very helpful for the later data mining process. We aim at finding better methods to make the models applied in bankruptcy prediction more accurately. 3. Experimental design There are three stages to complete this experiment shown in Fig. 1. The first stage is to build a multi-layer perceptron (MLP) neural network as the baseline model since it is the most widely used in bankruptcy prediction [1,13]. In this stage, we do not apply any feature selection methods. The second stage uses the five feature selection methods individually for generating more appropriate features. Then, there are five different new generated feature sets which are used to train the MLP model. The third stage is to evaluate the models’ performance. We consider two evaluation methods to verify these models. They are average accuracy and Type I and II errors.
123
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127 Table 3 The five datasets
The original dataset
MLP
(a) The first stage
The original dataset
feature selection
No. of variables No. of samples Good/bad cases
The new dataset
evaluation MLP
Type I & II errors (c) The third stage
Fig. 1. The experiment process.
3.1. The datasets In order to make a reliable comparison, we used five datasets including Australian Credit,1 German Credit,2 Japanese Credit,3 Bankruptcy dataset [39]4 and the UC competition5 datasets. The five datasets belong to bankruptcies and credit evaluations match the definition of bankruptcy (cf. Section 2.1). Table 3 shows the number of variables, samples and good/bad cases in each dataset. 3.2. The baseline model We used MLP neural networks with the back-propagation learning algorithm as the baseline prediction model. This is because approximately 95% of business application studies utilize MLP [40]. In addition, the most popular learning method is backpropagation [41]. The success of the neural network models is also discussed in Atiya [1]. Since the focus of this paper is not on developing a novel prediction model, it is feasible to construct the widely applied model, i.e. MLP, as the baseline prediction model to compare feature selection methods. There are some issues that need to be considered in designing neural network models. The architecture of the network (i.e. numbers of hidden neurons and layers) must be chosen, and overtraining would be happened during training. Hornik et al. [42] recommend that one-hidden-layer network is sufficient to model any complex system with any desired accuracy. Therefore, the constructed MLP is based on only one hidden layer in this paper. In addition, Raudy [43] discovers that the generalization error, with an increase in the number of training iterations, decreases at first to reach a minimum and then begins to arise. To avoid overtraining, much related work constructing MLP as the baseline model examine different parameter settings in order to obtain the ‘best’ MLP model for further comparisons (e.g. 18, 24, and 25). Therefore, we designed four different numbers of hidden nodes and learning epochs, respectively. The numbers of hidden nodes are 8, 16, 32,
1 2 3 4 5
German credit
Japanese credit
Bankruptcy dataset
UC competition
14 690 307/383
20 1000 700/ 300
15 690 307/383
33 240 128/112
39 2528 2449/79
MLP
(b) The second stage Average accuracy
Australian credit
http://www.liacc.up.pt/ML/statlog/datasets/australian/australian.doc.html http://www.liacc.up.pt/ML/statlog/datasets/german/german.doc.html http://www.ics.uci.edu/~mlearn/MLRepository.html http://www.pietruszkiewicz.com/ http://mill.ucsd.edu/
and 64 and the learning epochs are 50, 100, 200, and 400. As a result, there are sixteen models to test each data set. Moreover, the bias of changing dataset composition may have a negative impact on deciding the neural network architecture and its parameters. The cross-validation method can be used to examine the prediction performance of the MLP models in terms of sampling variation. The cross-validation method is able to avoid the variability of samples which may affect the performance of MLP and minimize any bias effect [44,45]. In this paper, a 5-fold cross-validation method is used. Utilizing the 5-fold cross-validation methodology is to divide five equal parts of a dataset. Any four of the five segments is selected to perform training. The remaining part will be executed for testing the model. As a result, each part will be trained and tested five times. The best classification rate can be the indicator for the model’s performance. The average prediction rate is also a good indicator for the fitting and stable performance of a model. 3.3. Evaluation methods After using 5-fold cross-validation, we can evaluate which model is the most appropriate model as the baseline, i.e. provides the highest prediction accuracy. The purpose of this evaluation is to find a suitable group of the MLP parameters for each dataset. In addition to prediction accuracy, TypeIand Type II errors are also examined as the performance measures. 3.3.1. Type I error It means that the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature. In other words, this is the error of failing to accept an alternative hypothesis when one does not have adequate power. That is, in bankruptcy prediction it occurs when we classify the non-bankruptcy group into the bankruptcy group. 3.3.2. Type II error It means that the error of rejecting a null hypothesis when it is the true state of nature. In other words, this is the error of accepting an alternative hypothesis (the real hypothesis of interest) when an observation is due to chance. In bankruptcy prediction, it occurs when we classify the bankruptcy group into the non-bankruptcy group. 3.4. Feature selection The followings describe the parameter setting of the five feature selections employed in each of the five original datasets, respectively. Correlation matrix. We set the confident level of 95% to reach the significance between each variable and the outcome to distinguish bankruptcy or non-bankruptcy. FA and PCA. We considered factor loadings equal to or greater than 0.5 as informative variables. In particular, the principal axis analysis to filter the variables is used.
124
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127
Stepwise. We used 0.05 probability of F as a side line of entry to select one of the variables which is crucial to the model. For testing the co-linearity of the variables selected in applying the model at the same time, we set 0.1 probability of F as a limit of removal. One of the variables enters the models if the probability of F is less than 0.05 and removal from the models if the probability is more than 0.1 when more than one variable selected in the models. The probability of F presents contribution of the variables to the model whether they reach the significance or not. t-test. For the same criterion, we set 0.95 confident interval used in t-test to assess whether the variance of each feature in good and bad cases are the same. When the level of significance is more than 0.95, it means that the variances of the variable in the good and bad cases are different. Therefore, the variable is a substantial factor for bankruptcy prediction.
Table 4 The baseline models
Learning epoch Hidden nodes Average Accuracy Type I error Type II error
Japanese credit
Australian credit
Bankruptcy dataset
German credit
UC competition
400
400
400
100
50
64
32
8
16
32
85.88%
81.93%
71.03%
74.28%
96.92%
90.05% 22.40%
21.89% 13.89%
12.85% 30.42%
55.39% 9.63%
81.68% 4.05%
Table 5 Performance of feature selection unit: % t-test
As a result, for each original dataset, five different datasets are produced based on the five different feature selection methods, respectively. In order to directly and fairly compare with the best MLP model without feature selection (cf. Section 3.2), the best MLP model parameter setting of a specific original dataset is applied to the new generated five datasets individually to see whether the consideration of feature selection can outperform non-feature selection. In addition, the ANOVA analysis is used to analyze the significant level of the prediction performances between the five feature selection methods. In order to compare the five feature selection methods and make a more reliable conclusion, we only consider the result which has a high level of significance. 4. Results and discussion 4.1. The baseline models For investigating distinct parameters affecting the outcome of MLP in the five datasets, we built sixteen models and performed 5-fold cross validation. Table 4 presents the baseline models for the five datasets in terms of their best setting for training epochs and numbers of hidden nodes, average accuracy, and Type I and II errors. 4.2. Feature selection performance Table 5 shows the performance of using t-test, Stepwise, correlation matrix, FA, and PCA which are applied on the baseline model over the five datasets. Regarding Table 5, all the three performance measures contain the high level of significant difference, except the Type II error of the Japanese Credit and German Credit datasets. By evaluating means of the results, we can rank the five feature selection methods by their performance which has significant difference. Table 6 shows the ranking result. Note that we disregard the effect of the Japanese dataset and the accuracy of the Bankruptcy dataset because most of the five feature selection methods are not significantly different. By the other three datasets, the top three positions of the ANOVA results are considered and weights are given since the aim of this paper is to find out the ‘optimal’ method. The weighting is to give three points for the rank one, two points for rank two and one point for the last one. Then, the top two positions which have the highest scores are seen as the best feature selection methods. Table 7 shows the ranking results. Therefore, for average accuracy, the first one is factor analysis and than t-test. For the Type I error, the first place is t-test and the second one is stepwise. Correlation matrix and stepwise are
Japanese credit Accuracy 63.53 Type I 55.33 error Type II 17.29 error Australian credit Accuracy 89.27 Type I 9.38 error Type II 11.72 error
Stepwise
Correlation matrix
FA
PCA
Baseline models
82.64 32.27
60.16 74.55
74.22 29.17
74.00 47.46
85.88 90.05
3.25* 3.521*
6.77
3.49
23.75
10.37
22.40
2.046
84.74 12.80
89.31 13.33
86.08 14.58
89.93 7.93
81.93 21.89
7.279** 7.949**
16.71
8.33
13.60
11.53
13.89
7.136**
76.08 22.76
72.91 22.50
79.59 16.55
71.03 12.85
3.219* 7.707**
25.45
32.73
26
30.42
8.132**
75.51 51.34
74.84 54.36
78.76 48.69
67.03 84.92
74.28 55.39
33.002** 18.261**
12.25
12.04
10.66
6.27
9.63
2.348
96.33 79.25
96.70 96.47
97.30 94.00
96.47 90.00
96.92 81.68
3.678* 7.811**
0.35
0.04
0.08
0.13
4.05
3.243*
Bankruptcy dataset Accuracy 82.98 77 Type I 7.69 37.27 error Type II 28.57 5.56 error German credit Accuracy 75.87 Type I 61.28 error Type II 8.62 error UC Competition Accuracy 97.25 Type I 74.82 error Type II 0.16 error * **
F value
Represents the level of significance is higher than 95% by ANOVA. Represents the level of significance is higher than 99% by ANOVA.
Table 6 Ranking of the feature selection methods
Japanese credit Australian credit
Accuracy
Type I error
Type II error
B>C P>C>T>S>B
B>C P>T>S>C>F>B
– C>P>F>B>S
Bankruptcy dataset
T>B
T>P>B>S
S>C>P>1>B>F
German credit
F>T>S>C>B>P
F>S>C>B>T>P
–
UC competition
F>T>S
T>S>P>F>C>B
C>B>S
Note: T for t-test; S for stepwise; C for correlation matrix; F for FA; P for PCA; B for baseline model.
the first and second methods to effectively reduce the Type II error, respectively. To sum up, t-test is the better feature selection method to provide higher prediction accuracy and reduce the Type I error. On the other hand, stepwise extracts the most features (see next subsec-
125
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127
tion), and provides relatively better performance in terms of average accuracy.
the largest numbers of variables (i.e. the highest extraction rate), on average 65.5% variables are extracted. The other four feature selection methods are seldom distinct. The disparity of them is not larger than 45%. It should be noted that as the five datasets are composed of different variables (features), the deep analysis of these extracted features is not considered and it is not the focus of this paper. Fig. 2 shows the feature reduction percentage versus its prediction performance. Note that for the Japanese Credit dataset, as there is no significant difference, we do not consider it in Fig. 2. The results show that stepwise diminishes substantially the number of redundant or irrelevant features, but the accuracy rates are not the worst one over the four datasets. On the other hand, FA reduces the lowest rate of irrelevant features. It provides the best result in the German data set, but has the worst result in the Bankruptcy dataset. For t-test, it lies on the middle position of the feature reduction rate and perform very good over the four datasets.
4.3. Selected features versus prediction performance By setting some criteria for the five feature selection methods to take irrelevant features out for improving the results of bankruptcy prediction, Table 8 shows the original and extracted numbers of variables in the five datasets. We can see that stepwise extracts Table 7 Ranking results
Average accuracy Type I error Type II error
Number 1
Number 2
Factor analysis t-test Correlation matrix
t-test Stepwise Stepwise
Table 8 Numbers of selected variables by feature selection Japanese credit
Australian credit
Bankruptcy dataset
German credit
UC competition
Before
After
Before
After
Before
After
Before
After
Before
After
t-test
15
14
15
Factor analysis
15
PCA
15
12 (40%) 10 (50%) 12 (40%) 16 (20%) 9 (55%)
39
Related matrix
13 (60.6%) 2 (93.9%) 14 (57.6%) 26 (21.2%) 24 (27.3%)
20
15
12 (14.3%) 7 (50%) 12 (14.3%) 9 (35.7%) 9 (35.7%)
33
Stepwise
12 (20%) 5 (66.7%) 12 (20%) 11 (26.7%) 8 (46.7%)
27 (30.8%) 13 (66.7%) 27 (30.8%) 32 (17.9%) 16 (59%)
14 14 14 14
33 33 33 33
20 20 20 20
39 39 39 39
Average extraction rate
33.1% 65.5% 32.5% 24.3% 44.7%
Note: (%) shows the percentage of extracted variables comparing with the original ones.
100.00%
100.00%
80.00%
80.00%
60.00%
60.00%
40.00%
40.00%
20.00%
20.00%
0.00%
t-test
stepwise
correlation matrix
FA
PCA
Baseline
0.00%
t-test
stepwise
correlation matrix
FA
Average Accuracy
Feature Extraction Rate
(a) Bankruptcy Dataset
(b) German Credit
100.00%
100.00%
80.00%
80.00%
60.00%
60.00%
40.00%
40.00%
20.00%
20.00% t-test
stepwise
correlation matrix
FA
Baseline
Average Accuracy
Feature Extraction Rate
0.00%
PCA
PCA
Baseline
0.00%
t-test
stepwise
correlation matrix
FA
Average Accuracy
Fig. 2. Feature reduction rate versus its prediction accuracy.
Baseline
Feature Extraction Rate
Feature Extraction Rate
(c) Australia Credit
PCA
Average Accuracy
(d) UC Competion
126
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127
100.00%
100.00%
80.00%
80.00%
60.00%
60.00%
40.00%
40.00%
20.00%
20.00%
0.00%
t-test
stepwise
correlation matrix
FA
PCA
Baseline
0.00%
t-test
stepwise
correlation matrix
FA
TypeI error
TypeI error
TypeII error
TypeII error
Feature Extraction Rate
(b) German Credit
100.00%
100.00%
80.00%
80.00%
60.00%
60.00%
40.00%
40.00%
20.00%
20.00% t-test
stepwise
correlation matrix
FA
Baseline
Feature Extraction Rate
(a) Bankruptcy Dataset
0.00%
PCA
PCA
Baseline
0.00%
t-test
stepwise
correlation matrix
FA
PCA
TypeI error
TypeI error
TypeII error
TypeII error
Feature Extraction Rate
Baseline
Feature Extraction Rate
(d) UC Competition
(c) Australia Credit Fig. 3. Feature reduction rate versus its Type I and II errors.
Similarly, Fig. 3 shows the relationship between the feature reduction percentage and its Type I and II errors. It shows that although stepwise and PCA can perform very well in reducing the variables, they provide higher Type I and II errors. On the contrary, t-test performs stably on the two kinds of the error measures. Above all, we can verify that t-test is more preferable than stepwise and PCA. By looking at the prediction performance and its feature reduction percentage, t-test is more static than others. 5. Conclusion It is a very important issue to accurately predict business failure in financial decision-making. Bankruptcy prediction has long been regarded as a critical topic and has been studied extensively in the accounting and finance literature. Data mining techniques have been used to prediction bankruptcies in recent years. Feature selection, a pre-processing step in the data mining process, is the step to select and extract more valuable information in the massive related materials. That is, it aims at filtering out redundant or irrelevant information, and consequently it can improve the model’s performance as well as reduce the effort of training the model. However, feature selection pre-processing is not carefully concerned in literature. Many studies focus on developing more effective prediction models per se which provide better predictive capabilities. Some of them even do not consider feature selection before constructing their models. Related work, which has applied feature selection only uses some chosen method. Therefore, as there are a number of feature selection methods available in literature, it is unknown that which one allows the prediction model to provide the best performance. In this paper, we compared five well-known feature selection methods used in the
bankruptcy prediction literature over five chosen datasets. They are t-test, correlation matrix, stepwise, PCA and FA. In addition, we utilized average accuracy and Type I and II errors in order to reliably evaluate the performance of these methods. Regarding the experimental results, feature selection methods applied on selecting more representative variables certainly increase the performance of prediction. On average, t-test is superior to others and Stepwise is on the second position. For the percentage of reducing the original variables, stepwise outperforms the others, which provides the highest feature reduction rate. However, the result of using stepwise for the bankruptcy prediction and credit scoring problems is instable based on the chosen five datasets. In summary, t-test performs stably and provides higher prediction accuracy and lower Type I and II errors. For future work, the research findings could be considered in other related business domains, especially for two-class classification problem which is the same as the bankruptcy prediction problem, such as stock price prediction (stock up and down), customer churn prediction (churn and non-churn), etc. In addition, related work proposing new feature selection methods can be compared with t-test as one major baseline feature selection method in order to make a reasonable conclusion. Acknowledgement This research is partially supported by National Science Council of Taiwan (NSC 96-2416-H-194-010-MY3). References [1] Z. Huang, H. Chen, C.-J. Hsu, W.-H. Chen, S. Wu, Credit rating analysis with support vector machines and neural networks: a market comparative study, Decision Support Systems 37 (2004) 543–558.
C.-F. Tsai / Knowledge-Based Systems 22 (2009) 120–127 [2] A.I. Dimitras, S.H. Zanakis, C. Zopounidis, A survey of business failures with an emphasis on prediction methods and industrial applications, European Journal of Operational Research 90 (1996) 487–513. [3] J. Yang, S. Olafsson, Optimization-based feature selection with adaptive instance sampling, Computers & Operations Research 33 (11) (2006) 3088–3106. [4] S. Piramuthu, Evaluating feature selection methods for learning in data mining application, European Journal of Operational Research 156 (2004) 483–494. [5] C.L. Huang, C.J. Wang, A GA-based feature selection and parameters optimization for support vector machines, Expert Systems with Applications 31 (2006) 231–240. [6] H. Yin, Data visualisation and manifold mapping using the ViSOM, Neural Networks 15 (8–9) (2002) 1005–1016. [7] L. Rokach, Genetic algorithm-based feature set partitioning for classification problems, Pattern Recognition 41 (2008) 1676–1700. [8] T. Lensberg, A. Eilifsen, T.E. McKee, Bankruptcy theory development and classification via genetic programming, European Journal of Operational Research 169 (2006) 677–697. [9] W.H. Beaver, Financial ratios as predictors of failure, Journal of Accounting Research 4 (1966) 71–102. [10] W.H. Beaver, Alternative accounting measures as predictors of failure, Account Review 43 (1) (1968) 113–122. [11] E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, Journal of Finance 23 (1968) 589–609. [12] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [13] P.R. Kumar, V. Ravi, Bankruptcy prediction in banks and firms via statistical and intelligent techniques – a review, European Journal of Operational Research 180 (1) (2007) 1–28. [14] J.H. Min, Y.-C. Lee, Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters, Expert Systems with Applications 28 (2005) 603–614. [15] K.S. Shin, T.S. Lee, H.J. Kim, An application of support vector machines in bankruptcy prediction model, Expert Systems with Applications 28 (2005) 127–135. [16] G. Zhang, M.Y. Hu, B.E. Patuwo, D.C. Indro, Artificial neural networks in bankruptcy prediction: general framework and cross-validation analysis, European Journal of Operational Research 116 (1999) 16–32. [17] A.F. Atiya, Bankruptcy prediction for credit risk using neural networks: a survey and new results, IEEE Transactions on Neural Networks 12 (4) (2001) 929–935. [18] T.S. Lee, C.C. Chiu, C.J. Lu, I.F. Chen, Credit scoring using the hybrid neural discriminant technique, Expert Systems with Applications 23 (2002) 245–254. [19] R. Malhotra, D.K. Malhotra, Differentiating between good credits and bad credits using neuro-fuzzy systems, European Journal of Operational Research 136 (2002) 190–211. [20] T.E. McKee, T. Lensberg, Genetic programming and rough sets: a hybrid approach to bankruptcy classification, European Journal of Operational Research 138 (2002) 436–451. [21] K.S. Shin, Y.J. Lee, A genetic algorithm application in bankruptcy prediction modeling, Expert Systems with Applications 23 (2002) 321–328. [22] M.-J. Kim, I. Han, The discovery of experts’ decision rules from qualitative bankruptcy data using genetic algorithms, Expert Systems with Applications 25 (2003) 637–646.
127
[23] S. Canbas, A. Cabuk, S.B. Kilic, Prediction of commercial bank failure via multivariate statistical analysis of financial structures: The Turkish case, European Journal of Operational Research 166 (2005) 528–546. [24] K. Lee, D. Booth, P. Alam, A comparison of supervised and unsupervised neural networks in predicting bankruptcy of Korean firms, Expert Systems with Applications 29 (2005) 1–16. [25] J.H. Min, Y.-C. Lee, Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters, Expert Systems with Applications 28 (2005) 603–614. [26] C.-S. Ong, J.-J. Huang, G.-H. Tzeng, Building credit scoring models using genetic programming, Expert Systems with Applications 29 (2005) 41–47. [27] K.S. Shin, T.S. Lee, H.J. Kim, An application of support vector machines in bankruptcy prediction model, Expert Systems with Applications 28 (2005) 127–135. [28] T.V. Gestel, B. Baesens, J.A.K. Suykens, D. Van den Poel, D.-E. Baestaens, M. Willekens, Bayesian kernel based classification for financial distress detection, European Journal of Operational Research 172 (2006) 979–1003. [29] J. Huysmans, B. Baesens, J. Vanthienen, T. van Gestel, Failure prediction with self organizing maps, Expert Systems with Applications 30 (2006) 479–487. [30] S.-H. Min, J. Lee, I. Han, Hybrid genetic algorithms and support vector machines for bankruptcy prediction, Expert Systems with Applications 31 (3) (2006) 652–660. [31] A. Tsakonas, G. Dounias, M. Doumpos, C. Zopounidis, Bankruptcy prediction with neural logic networks by means of grammar-guided genetic programming, Expert Systems with Applications 30 (2006) 449–461. [32] C.H. Wu, G.H. Tzeng, Y.J. Goo, W.C. Fang, A real-valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy, Expert Systems with Applications 32 (2) (2007) 397–408. [33] C.-F. Tsai, J.-W. Wu, Using neural network ensembles for bankruptcy prediction and credit scoring, Expert Systems with Applications 34 (4) (2008) 2639–2649. [34] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. [35] M. Dash, H. Liu, Feature selection for classification, Intelligent data analysis 1 (1997) 131–156. [36] O. Uncu, I.B. Turksen, A novel feature selection approach: combing feature wrappers and filters, Information Sciences 177 (2) (2007) 449–466. [37] R.R. Pagano, Understanding Statistics in the Behavioral Sciences, Sixth ed., Wadsworth/Thomson Learning, California, 2001. [38] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 1986. [39] W. Pietruszkiewicz. Application of discrete predicting structures in an early warning expert system for financial distress. Ph.D. Thesis, Szczecin Technical University, Szczecin, 2004. [40] K.A. Smith, J.N.D. Gupta, Neural networks in business: techniques and applications for the operations researcher, Computers & Operations Research 27 (2000) 1023–1044. [41] S. Olafsson, X. Li, S. Wu, Operations research and data mining, European Journal of Operational Research 187 (2008) 1429–1448. [42] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximations, Neural Networks 2 (1989) 336–359. [43] S. Raudys, Statistical and Neural Classifiers, Springer, London, 2001. [44] K.Y. Tam, M.Y. Kiang, Managerial applications of neural networks: the case of bank failure predictions, Management Science 38 (1992) 926–947. [45] G. Zhang, M.Y. Hu, B.E. Patuwo, D.C. Indro, Artificial neural networks in bankruptcy prediction: general framework and cross-validation analysis, European Journal of Operational Research 116 (1999) 16–32.