Expert Systems With Applications 59 (2016) 226–234
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction Hyun-Jung Kim, Nam-Ok Jo, Kyung-Shik Shin∗ School of Business, Ewha Womans University, 52 Ewhayeodae-gil, Seodaemun-Gu, Seoul, 120-750, Korea
a r t i c l e
i n f o
Article history: Received 2 December 2015 Revised 12 March 2016 Accepted 21 April 2016 Available online 23 April 2016 Keywords: Genetic algorithms Cluster-based undersampling technique Imbalance data Corporate bankruptcy prediction
a b s t r a c t We suggest an optimization approach of cluster-based undersampling to select appropriate instances. This approach can solve the data imbalance problem, which can lead to knowledge extraction for improving the performance of existing data mining techniques. Although data mining techniques among various big data analytics technologies have been successfully applied and proven in terms of classification performance in various domains, such as marketing, accounting and finance areas, the data imbalance problem has been regarded as one of the most important issues to be considered. We examined the effectiveness of a hybrid method using a clustering technique and genetic algorithms based on the artificial neural networks model to balance the proportion between the minority class and majority class. The objective of this paper is to constitute the best suitable training dataset for both decreasing data imbalance and improving the classification accuracy. We extracted the properly balanced dataset composed of optimal or near-optimal instances for the artificial neural networks model. The main contribution of the proposed method is that we extract explorative knowledge based on recognition of the data structure and categorize instances through the clustering technique while performing simultaneous optimization for the artificial neural networks modeling. In addition, we can easily understand why the instances are selected by the rule-format knowledge representation increasing the expressive power of the criteria of selecting instances. The proposed method is successfully applied to the bankruptcy prediction problem using financial data for which the proportion of small- and medium-sized bankruptcy firms in the manufacturing industry is extremely small compared to that of non-bankruptcy firms. © 2016 Elsevier Ltd. All rights reserved.
1. Introduction With increasing amounts of data available on the web from sources such as social media and mobile phones, the issue of how to use big data analytics to gain insights into big data for managerial decision making has emerged (Tambe, 2014). To handle big data, the first stage is to preprocess and select suitable data from the large amount of data to analyze the advanced analytics and find apparent properties between groups that play a significant role in applying big data generated by social network services (SNS), sensors, and mobiles to various analytic technologies. Clustering, one of the data mining techniques, is widely utilized for the preliminary purpose of performing advanced analytics as a descriptive and unsupervised learning technique that does not use any pre-defined target variables known in advance. It explores data to find similar groups based on many features by identifying the characteristics of the collected data. Additionally, it is appro∗
Corresponding author. E-mail addresses:
[email protected] (H.-J. Kim),
[email protected] (N.-O. Jo),
[email protected] (K.-S. Shin). http://dx.doi.org/10.1016/j.eswa.2016.04.027 0957-4174/© 2016 Elsevier Ltd. All rights reserved.
priate for reflecting all the various candidate variables, because it has an advantage in that it is relatively free from the curse of dimensionality problems compared with statistical methods. Furthermore, the general characteristics of big data in real-world problems belong to the data imbalance between groups. The data imbalance problem has emerged in most practical applications, including fault monitoring (Japkowicz, Myers, & Gluck, 1995), fraud detection (Fawcett & Provost, 1997), remote sensing (Bruzzone & Serpico, 1997), image processing (Kubat, Holte, & Matwin, 1998), response modeling (Kang, Cho, & MacLachlan, 2012), and credit evaluation (Zhou, 2013). Data imbalance occurs when the dataset is skewed towards one of the categories of the target variable. In other words, the proportion of one class is extremely small compared to that of the other class. The classification techniques assume that the training dataset is almost equally distributed between classes. However, conducting classification tasks using imbalanced data deteriorates the classification performance. If the difference of the data size between the two classes is greater, most of the data is overwhelmingly classified as the majority class to reduce overall misclassification. It is difficult to recognize patterns, because the minority class is likely to be
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234
treated as the majority class. Therefore, handling imbalanced data is a crucial procedure in model development. Handling imbalanced data in classification problems is considered in big data analytics using the portion of the number of data belonging to each category, which is severely different in case target variables consisting of discrete types, such as bankruptcy and non-bankruptcy. Learning from the imbalanced data deteriorates the classification performance in a binary classification problem. If the difference in data size between the two classes is greater, the model for classification tends to be overwhelmed by nonbankruptcy cases and ignore bankruptcy cases. Therefore, a number of approaches have been proposed to solve the data imbalance problem. A simple and typically used approach for handling imbalanced data is utilizing sampling techniques to shift the original class distribution of the training dataset. The basic methods are over-sampling from the minority class and undersampling from the majority class, which are carried out until the classes are almost equally distributed. Many previous studies on bankruptcy prediction have usually adopted random undersampling. In this paper, we suggest an optimization approach based on an undersampling technique for selecting the appropriate instances to solve the data imbalance in various problems for conducting knowledge extraction to improve the performance of existing data mining techniques through data preprocessing. Although data mining techniques among big data analytics technologies have been successfully applied and proven in terms of classification performance in various domains, such as customer relationship management (Ahmed, 2004; Chen, Hsu, & Chou, 2003; Datta, Masand, Mani, & Li, 20 0 0; Hung, Yen, & Wang, 20 06; Kim & Street, 20 04), and credit evaluation (Fletcher & Goss, 1993; Odom & Sharda, 1990; Shin, Lee, & Kim, 2005; Zhang, Hu, Patuwo, & Indro, 1999). We propose cluster-based undersampling for the Genetic Algorithms (GA) and the Artificial Neural Networks (ANN) model (GAANN) to combine a clustering technique based on the undersampling methodology and GA-ANN to balance the proportion between the minority class and majority class. Since we assume that instances that are far from the centroid of the cluster are regarded as noisy instances, the noisy instances of the majority class are removed by the threshold that is the removal criteria searched for with GA. The objective of this paper is to constitute the best suitable training dataset for both decreasing data imbalance and improving the classification accuracy. Thus, we utilize the geometric mean, considering sensitivity and specificity as performance metrics that represent the accuracies observed separately on the minority class and majority class instead of using simple average accuracy for the fitness function in optimization. The proposed method is applied to the bankruptcy prediction problem using financial data for which the proportion of bankruptcy firms focused on small- and medium-sized firms in manufacturing is only 5.9% of the total. The remainder of this paper is organized as follows. Section 2 provides a description of sampling techniques for handling imbalanced data, including a brief review of previous studies relevant to the research topic of this paper. Section 3 describes the cluster-based undersampling technique for GA-ANN modeling. Section 4 reports the model-building process and the results of the experiments. Section 5 discusses conclusions and future research issues. 2. Handling imbalanced data in classification problems There have been previous studies for solving the problem of data imbalance using two approaches at the data and the algorithm level. The solution for modeling using the imbalanced data at the data level involves using the re-sampling method, includ-
227
ing undersampling and oversampling. Another solution at the algorithmic level is to adjust the costs of the classes, threshold, and one class learning (Chawla, Japkowicz, & Kotcz, 2004). In this section, we provide a brief description of several methods to handle the imbalanced data focused on sampling techniques at the data level. In selecting instances for constructing the ANN model, Kim (2006) proposed an instance selection method based on GA in ANN. GA optimized the connection weights and instance selection at the same time to reduce the learning time and improve the predictive performance in stock market prediction. The proposed method conducted the dimension reduction of data and removed the noise within the dataset. The experimental results showed that the GA-based instance selection technique was effective in improving the performance of the ANN model. This study focuses on undersampling for solving the problem of data imbalance for bankruptcy prediction applications. In GA-ANN, GA simultaneously optimizes the connection weight of the ANN and training dataset composed of relevant samples. Sampling techniques at the data level are performed independently from the classification models; thus, they can be integrated with other classification models. Sampling for the imbalanced data is conducted to transform original data through various methods of oversampling or undersampling to obtain a balanced distribution between classes. The balanced data processed by sampling techniques provides enhanced classification performance (He & Garcia, 2009). Undersampling is a sampling method for removing instances in the majority class to adjust the balance with instances in the minority class, while oversampling augments instances in the minority class. The two sampling techniques have both advantages and disadvantages. Oversampling has the advantage of maintaining the natural distribution of the minority class without discarding original data. However, there is a risk that the artificially increased data in the minority class can include noise data degrading the classification performance. While undersampling is an efficient sampling method in terms of training time because it uses relatively little data, there is a risk of excluding useful data for learning. Generally, the performance of oversampling is relatively low compared to that of undersampling (Drummond & Holte, 2003). This paper adopts undersampling techniques at the data level as a commonly used sampling method in previous studies on bankruptcy prediction. 2.1. Oversampling Oversampling augments the size of the minority class to balance the majority class. It can preserve the original distribution of the minority class, but it has the disadvantage of involving a long training time, because the total number of training datasets is artificially appended. In addition, the deviation among instances within a minority class is greatly increased and the noise can be included in the training dataset in the process of generating data. However, it is a useful method in cases when the amount of data in a minority class is extremely small. Random oversampling can generate a dataset obtained by a mechanism to diversify the degree of distribution between the majority class and minority class at any desired level as the simplest method for oversampling (He & Garcia, 2009). Unlike the random selection approach, the synthetic minority oversampling technique (SMOTE) and SMOTE boost were studied to perform more intelligent oversampling. SMOTE, proposed by Chawla, Bowyer, Hall, and Kegelmeyer (2002), generates artificial data using the k-nearest neighbor technique. It creates a new dataset of the minority class by randomly selecting instances of one of the k-nearest neighbors. Chawla, Lazarevic, Hall, and Bowyer (2003) proposed SMOTE boost combining a boosting technique and
228
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234
SMOTE. Han, Wang, and Mao (2005) developed borderline SMOTE, which oversamples instances in the minority class near the borderline. Hu, Liang, Ma, and He (2009) proposed MSMOTE (modified synthetic minority oversampling technique), which considers the distribution of the minority class and excludes latent noise using the k-nearest neighbors. In addition to oversampling-based SMOTE, the cluster-based oversampling approach has been studied using the clustering technique. Jo and Japkowicz (2004) suggested an oversampling algorithm based on k-means clustering technique. They performed oversampling to consider the concept of two types of data imbalances, the between-class imbalance of two classes and the withinclass imbalance of the sub-clusters of each class. They calculated cluster centers representing the average of each cluster, and the training instances of training dataset is were then allocated to the cluster with the distance the closest to the cluster centers. The averages of the clusters were updated until all training instances corresponded with sub-clusters by balancing the number of instances of the training dataset between the minority class and the majority class. 2.2. Undersampling Undersampling refers to decreasing the number of instances in the majority class to balance the minority class. This is efficient in modeling when we reserve a large amount of data in that the training time is reduced due to the reduced training dataset, but it has disadvantages in that there is a risk of distorting the original distribution of the majority class. In addition, it may discard potential useful data. However, partial data can be used in data modeling because the amount of data is sufficient for analysis in the era of big data. It is crucial to have a relevant dataset to improve the classification performance of a model by sampling data with similar properties. Random undersampling reduces the dataset by removing a randomly sampled dataset from the majority class as the simplest method. Unlike the random selection approach, more intelligent techniques using one-sided sampling (OSS) in terms of data cleaning, clustering, and GA have been suggested for undersampling. Kubat and Matwin (1997) proposed the OSS method, which removes the bordering instances in the minority class, while containing the instances of the majority class. Undersampling using data cleaning techniques, such as Tomek links (Tomek, 1976), has been applied in many previous studies. Tomek (1976) referred to a pair of nearest neighbors of another class with the shortest distance and removed overlapping instances between the minority class and the majority class. Barandela, Sánchez, García, and Rangel (2003) applied Wilson’s editing (WE) algorithms (Wilson, 1972) to remove noisy instances in the majority class. Yen and Lee (2009) proposed a cluster-based undersampling approach. The proposed approach first conducted clustering of all instances into several clusters. Next, it selects the relevant number of instances belonging to the majority class from each cluster based on the proportion of the number of instances in the majority class to the number of instances of the minority class in the cluster. Kang et al. (2012) used clustering, undersampling, and ensemble methods to solve the class imbalance problem. They first conducted clustering using instances of the majority class and then constructed multiple training datasets consisting of sampled instances of the majority class from each cluster, preserving instances of the minority class. Barandela, Hernández, Sánchez, and Ferri (2005) used GA to reduce the high complexity of the significant amount of computational resources required. The proposed method simultaneously performs two tasks based on GA, the reduction of the imbalanced data and feature selection. The evolutionary sampling technique
Table 1 Confusion matrix for performance evaluation. Actual
Predicted Non-bankruptcy
Bankruptcy Bankruptcy Non-bankruptcy
True Positive (TP) False Negative (FN)
False Positive (FP) True Negative (TN)
based on GA has been employed to selectively remove instances from the majority class (García & Herrera, 2009; Khoshgoftaar, Seliya, & Drown, 2010). However, previous studies on evolutionary sampling using GA have involved time-consuming tasks in exploring optimal or near-optimal solutions, because instances of the majority class become strings for GA searching. Thus, this study suggests cluster-based sampling supported by GA to handle the inefficiency problem of the previous evolutionary sampling method. 2.3. Performance measure We investigate various performance measures for modeling the imbalanced data and select an appropriate performance measure to use in our study. We explain the performance evaluation measure by using a confusion matrix for the model performance evaluation as shown in Table 1. The True Positive (TP) indicates the number of bankrupt firms correctly classified, and False Positive (FP) indicates the number of bankrupt firms incorrectly classified as non-bankrupt firms. False Negative (FN) indicates the number of non-bankrupt firms incorrectly classified as bankrupt firms, and True Negative (TN) indicates the number of non-bankrupt firms correctly classified. •
Hit-ratio: Simple average accuracy, (TP+TN) / (TP+FN+FP+TN) The hit-ratio is not suitable for performance evaluation measure in case of modeling the imbalance data because the value of hit rate is severely flexible according to a cut-off for calculation of sensitivity and specificity criteria.
•
G-Mean: Geometric mean, Sensitivity × Specificity where sensitivity indicates the true positive rate, TP/(TP + FN) and specificity is the true negative rate, TN/(FP + TN). Unlike the Hit rate, which represents simple average accuracy, the geometric mean equally considers the accuracy of both the majority class and the minority class. The higher geometric mean shows the balance between classes is reasonable and good performance in the binary classification model (Kubat, Holte, & Matwin, 1997). We use the G-Mean as the fitness function in GA for data balancing.
•
•
AUROC: The Receiver Operating Characteristic (ROC) curve represents the percentage of bankruptcy firms correctively classified, which is plotted on the y-axis, and the percentage of nonbankrupt firms that are incorrectly classified as bankrupt firms. These curves can compute the confidence interval and are useful for visualizing model performance, particularly in the domain corresponding to the data imbalance problem. AR: The Cumulative Accuracy Profile (CAP) curve depicts the percentage of bankruptcy firms correctively classified, which is plotted on the y-axis, and the percentages of the total number of firms. Although the ROC and the CAP are not equal, they have similar information in terms of the Gini index (Agresti, 1984; Engelmann, Hayden, & Tasche, 2003), since there is a simple linear relation between the AUROC and the AR as follows: AR = 2(AUROC − 0.5).
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234
229
Fig. 1. The process of cluster-based evolutionary undersampling (CBEUS).
•
H-measure: The AUROC has been widely used for measuring the classification performance in related studies for data mining. Hand (2009) pointed out that comparing the model performance based on the ROC curves uses different misclassification cost distribution for different classification models. This paper utilizes the H-measure using a pre-defined beta distribution suggested by Hand (2009) for comparing the classification performance of the constructed models (Anagnostopoulos, Hand, & Adams, 2012).
3. Proposed model The proposed cluster-based evolutionary undersampling (CBEUS) method is an undersampling technique combining clustering and GA to address the problem of data imbalance. We used ANN as a classification model in this study. Studies on integrating GA and ANN started in the late 1980 s (Harp & Samad, 1991; Harp, Samad, & Guha, 1989; Heistermann, 1990; Miller, Todd, & Hedge, 1989; Whitely & Hanson, 1989; Whitley, Starkweather, & Bogart, 1990), and GA-ANN has been applied to various fields. In constructing GA-ANN, there are many optimization points, such as input variables (Back, Laitinen, & Sere, 1996), network architectures (Benardos & Vosniakos, 2007; Kim & Shin, 2007; Son et al., 2004), and connection weights (Shin & Lee, 2004). However, it is inefficient for optimizing all the elements for learning in GA because of the wider searching space to find a solution. Among them, the connection weights are the most important elements among the optimization points, because they extract data-driven knowledge represented by the application domain. Therefore, our study uses GA to conduct a simultaneous search focused on reflecting the cluster-based undersampling method while optimizing the connection weight in developing the ANN model. The parameters such as the number of hidden nodes and hidden layers are determined before GA optimization. The number of hidden nodes is fixed like the number of input variables. In addition, the number of hidden layers is fixed at 1 according to previous studies on GA-ANN (Kim, 20 0 0, 20 06). The overall process of the proposed model is shown in Fig. 1. The process of this method consists of the following steps: The first step is to divide non-bankruptcy firms corresponding to the majority class into several clusters using k-means clustering. Next, we compute the distance between an instance and centroid within each cluster using the Euclidean distance function. The second step
is to find the thresholds that represent the distance from the centroid of each cluster using GA. We select relevant instances from the majority class based on the assumption of removing noisy instances that are far from the centroid of the cluster. We expect this approach will have advantages over the previous undersampling techniques. First, it removes noisy instances by considering distances between an instance of a majority class and the centroid of each cluster. The dataset that removes noisy instances results in an improved and stable classification performance. Second, we employ relatively intelligent undersampling techniques based on survival of the fittest instances in optimization. The disadvantage of the random undersampling is that potentially useful instances may be discarded. Thus, we utilize the geometric mean, which considers the balance and accuracy of both the majority class and the minority class to compensate for these drawbacks. 3.1. Clustering of non-bankruptcy firms corresponding to the majority class Non-bankruptcy firms corresponding to the majority class are divided into several clusters using k-means clustering, and then compute the distance between an instance of the majority class and centroid within each cluster using the Euclidean distance function. The silhouette statistic proposed by Rousseeuw (1987) is adopted to determine the adequacy of the number of clusters. This measure reflects the concept of distance within clusters and distance between clusters as follows:
Si = (bi − ai )/ max(ai , bi ) where ai indicates the average distance to all other points within clusters, and bi indicates the average distance to all other points in another cluster when observation i is given. The range of silhouette statistics is from −1 to 1. If it is closer to 1, clustering is properly performed, and if it is closer to −1, the result of clustering is not reasonable. We regard instances that are far from the centroid (cluster averages) of each cluster as noisy instances. The noisy instances of the majority class are removed until the class distribution is balanced. This approach is used to select instances of the majority class from each cluster according to the distances between the instance of the majority class and the centroid. The thresholds of each cluster are
230
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234 Table 2 Description of each dataset.
searched by GA for removal criteria in Step 2. If the distance between the instance and cluster centroid is less than the threshold, then the instance is included in the training dataset. 3.2. Undersampling based on genetic algorithms
Dataset
Non-bankruptcy
Bankruptcy
Total
Training Validation Total
16,920 4,230 21,150
1,080 270 1,350
18,0 0 0 4,500 22,500
We need parameters that are coded for the optimization problem and a fitness function to evaluate the performance of each chromosome to set up the genetic search environment. The parameters that are coded are composed of two sets. The first set includes the thresholds for each cluster, which represent the distance from the centroid (ck ) of the cluster. We set the range of the threshold between the minimum distance and maximum distance from the centroid within each cluster. The second set includes the connection weights of ANN. The string encoded for the GA experiments is represented as follows:
Table 3 Definition of variables. Variables
Definition
X1 X2 X3 X4 X5 X6
Return on assets Stockholder’s equity to total assets Total borrowings to total assets Productivity of capital Operating cash-flow to total assets EBITDA to sales
String {d1 , d2 , . . . , dk , w11 , w12 , . . . , wmn } where d: the minimum distance from centroid of cluster; k: the number of clusters; w: the connection weights of neural networks in the matrix of the dimension {(m + 1)n + (n + 1)}; m: the number of input nodes; n: the number of hidden nodes. The values of each cell in the connection weight matrix are searched from −5 to 5 not only between the input layer and the hidden layer, but also the hidden layer and the output layer. We select instances that are within d from ck , the centroid of each cluster. The rule structure for the cluster-based undersampling based on GA is as follows:
IF [Distance from ck <= dk ] THEN select. The performance of each chromosome is defined by the application domains and evaluated by the user-defined fitness function. The objectives of this study are to find the threshold within each cluster that can balance the classes and connection weights in ANN. Through the GA searching process, the fitness function shows to what extent using the optimal training dataset consisting of samples extracted from the majority class increases the classification performance. We apply the G-Mean to the fitness function, because the Hit-ratio has disadvantages in that it is extremely affected by the accuracy of the majority class. Thus, the chromosomes that consist of thresholds and connection weights are searched to maximize the fitness function. Mathematically, the fitness function for this study is represented as follows:
b n 1 1 Max. G − Mean = BAi · NA j b
s.t.
BAi = NA j =
i=1
n
1 0
if P Oi = AOi otherwise
1 0
if P O j = AO j otherwise
j=1
for given i(i = 1, 2, . . . , b)and j ( j = 1, 2, . . . .n) where b: the number of bankruptcy firms; BAi : the classification accuracy of ith instances of bankruptcy firms, denoted by 1 and 0 (‘correct=1, ‘incorrect’=0); n: the number of non-bankruptcy firms; NAj : the classification accuracy of jth instances of non-bankruptcy firms, denoted by 1 and 0 (‘correct=1, ‘incorrect’=0); POi : the predicted output of ith instances of bankruptcy firms; AOi : the actual output of ith instances of bankruptcy firms; POj : the predicted output of jth instances of non-bankruptcy firms; AOj : the actual output of jth instances of non-bankruptcy firms. Each neuron is linked to numerical weights within ANN, and we searched them through GA. First, the neuron calculates the
weighted sum of the input values through the combination function as follows:
X=
n
xi wi
i=1
where X: the net weighted input to the neuron; xi : the value of the inputs; wi : the connection weight of input i; n: the number of neuron inputs. The weighted sum is transferred by a transfer function into the activation value of the neuron. We use the sigmoid function that generally is used as a transfer function in previous studies on ANN. The transfer function is as follows:
Y = 1/(1 + e−x ) 4. Experiments and results 4.1. Research data and experiments The dataset consisted of 106 financial ratios and 22,500 externally non-audited small- and medium-sized Korean manufacturing firms, 1,350 of which filed for bankruptcy and the other 21,150 of which filed for non-bankruptcy from 2002 to 2007. The number of bankruptcy firms was only 5.9% of the total. Generally, the bankruptcy ratio of non-audited, small- and medium-sized firms is higher than that of total firms. The dataset was split into two subsets as shown in Table 2; approximately 80% of the data was used for a training dataset and 20% for a validation dataset. The training dataset was used to develop an undersampling technique for class balancing. The validation data was organized according to the distribution of training, and used to test the validity of the model. We applied two stages of the input variable selection process based on methods in previous studies (Kim & Ahn, 2012; Kim & Han, 2001; Lee, 2007; Shin & Han, 2001; Shin et al., 2005). In the first stage, we selected 31 variables using an independent-sample t-test (between each financial ratio as inputs and bankrupt or nonbankrupt as outputs) and an expert’s opinion. In the second stage, we finally selected six financial variables using the statistical stepwise method (Kim & Ahn, 2012; Lee, 2007; Shin & Han, 2001; Shin et al., 2005). We first selected variables satisfying the univariate test, and then selected final variables to reduce the dimensionality through this variable selection procedure. The selected input variables are shown in Table 3. It consists of input variables widely used for the credit evaluation of small- and medium-sized firms in the real world. There were controlling parameters for the experiments, such as population size, crossover rate, mutation rate, and the criteria for stopping the process for GA searching. We used 100 organisms in
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234 Table 4 Definition of variables. Cluster
Frequency
Ratio (%)
1 2 3 4 5 Total
5,090 4,867 2,416 2,359 4,017 20,438
29.7% 28.8% 14.3% 13.9% 19.7% 100.0%
sults on the centroids of the resulting clusters are shown in Table 5. The derived results for the centroids of the resulting clusters from GA searching are summarized in Table 6. To reduce the impact of random variation in the GA search processes, we replicated the experiment using several trials to select instances in the majority class and derive the ANN model with the best predictive power. The derived rule based on cluster-based undersampling is as follows. The structure of this rule includes five conditions corresponding to five clusters using AND relations. The performance of each chromosome is defined by the application domain and evaluated by the user-defined fitness function. This study finds a simple rule for constituting the training dataset with the classification performance through GA searching. We selected the strings to maximize the fitness function values, because the optimal or near-optimal solutions were determined by the fitness function of GA. CBEUS rule:
Table 5 Centroids of each variable per cluster. Cluster
X1
X2
X3
X4
X5
X6
1 2 3 4 5
−0.22 −0.04 0.24 0.74 −0.14
−0.77 0.27 1.84 0.51 −0.66
1.04 0.13 −1.27 −1.01 −0.68
−0.15 −0.1 0.06 0.68 0.02
−0.08 −0.11 −0.04 0.53 0.13
0.22 0.07 0.03 −0.07 −0.46
IF [Distance from c1 (−0.22, −0.77, 1.04, −0.15, −0.08, 0.22) >= 0.501, AND Distance from c2 (−0.04, 0.27, 0.13, −0.1, −0.11, 0.07) >= 0.520, AND Distance from c3 (0.24, 1.84, −1.27, 0.06, −0.04, 0.03) >= 5.817, AND Distance from c4 (0.74, 0.51, −1.01, 0.68, 0.53, −0.07) >= 1.172, AND Distance from c5 (−0.14, −0.66, −0.68, 0.02, 0.13, −0.46) >= 0.576] THEN Select
Table 6 Centroids of each variable per cluster. Cut-off
d1
d2
d3
d4
d5
Value
0.501
0.520
5.817
1.172
0.576
231
the population. The genetic operators, such as crossover and mutation rate, were varied to prevent the solution from falling into the local optima (Shin & Han, 1999). The crossover and mutation rate were set to 0.6 and 0.1 for this study, respectively. We used 20 0 0 trials as the stopping condition. To build a neural network model, we fixed the number of hidden nodes to six. The number of hidden nodes is fixed like the number of input variables. In addition, the number of hidden layers is fixed at 1 according to previous studies on GA-ANN (Kim, 20 0 0, 20 06). We used the widely used sigmoid function as the transfer function. 4.2. Results and analysis To investigate the effectiveness of a cluster-based undersampling technique for constructing GA-ANN in bankruptcy prediction application, we set GA to simultaneously search the cut-off for each cluster, which represented the minimum distance from the centroid of the clusters and connection weights of the ANN model. The number of k representing clusters was determined in advance for k-means clustering. We used the silhouette statistic (Rousseeuw, 1987), considering the concept of distance within clusters and distance between clusters. The silhouette statistic showed a reasonable value of 0.4 when the number of k was set to 5. The results showed that five clusters were classified by using six financial ratios according to the result of k-means clustering. Table 4 shows the frequency and ratio of each cluster. The derived re-
The model performance of the optimized GA-ANN with clusterbased evolutionary undersampling (GA-ANN_CBEUS) is compared with ANN using the original data without sampling (ANN_None), ANN with random undersampling (ANN_RUS), GA-ANN with RUS (GA-ANN_RUS), and GA-ANN with evolutionary undersampling (GA-ANN_EUS) in terms of the AUROC and H-measure as a representative model performance measure. Among the models constructed in this paper, ANN_RUS and GA-ANN_RUS represent the results of the model using a balanced sample. The comparison of the model performance in the training dataset and validation are summarized in Tables 7 and 8, respectively. Figs. 2 and 3 show the comparison of the classification results of all the models developed for this study. The AUROC and the H-measure as a classification accuracy of GA-ANN_CBEUS outperformed that of other models. The GAANN_CBEUS was effective for selecting proper instances and balancing sensitivity and specificity by performing simultaneous optimization. Modeling a balanced training dataset optimized by GA provided an enhanced model performance compared with using an imbalanced dataset. The McNemar test was performed to verify that cluster-based undersampling supported by GA significantly performed sampling better than random sampling and evolutionary sampling as shown in Table 9. We used a balanced sample that
Table 7 Model performance in training set. (Unit: %) Category
ANN_ None
ANN_ RUS
GA-ANN_ RUS
GA-ANN_ EUS
GA-ANN_ CBEUS
Sensitivity Specificity G-Mean AUROC H-measure NB:B
4.907 99.905 22.142 80.846 8.289 94.0:6.0
79.815 85.370 82.546 88.760 53.498 50.0:50.0
84.722 78.796 81.706 87.852 52.586 50.0:50.0
75.648 73.008 74.316 77.652 8.108 88.4:11.6
77.315 93.716 88.654 92.698 60.748 69.1:30.9
NB: Non-bankruptcy; B: Bankruptcy.
232
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234 Table 8 Model performance in validation set. (Unit: %) Category
ANN_ None
ANN_ RUS
GA-ANN_ RUS
GA-ANN_ EUS
GA-ANN_ CBEUS
Sensitivity Specificity G-Mean AUROC H-measure NB:B
2.963 100.0 0 0 17.213 83.963 42.586 50.0:50.0
69.630 89.630 78.999 87.111 47.598 50.0:50.0
93.370 68.519 78.690 86.185 47.291 50.0:50.0
85.185 75.556 80.226 85.185 47.933 50.0:50.0
87.407 75.556 84.259 88.778 56.158 50.0:50.0
NB: Non-bankruptcy; B: Bankruptcy. Table 9 McNemar values for classification accuracy between models. (Significance level) ANN_ RUS ANN_None ANN_RUS GA-ANN_RUS GA-ANN_EUS ∗∗∗ ∗∗
0.0 0 0 – – –
∗∗∗
GA-ANN_ RUS 0.0 0 0 0.728 – –
∗∗∗
GA-ANN_ EUS 0.0 0 0 0.906 10.653 –
∗∗∗
GA-ANN_ CBEUS 0.0 0 0∗∗∗ 0.022∗∗ 0.005∗∗∗ 0.024∗∗
Significant at 1%. Significant at 5%.
set 20 0 0 trials as the stopping condition in constructing GA-ANN. In particular, the GA-ANN_EUS had an over-fitting problem arising from the increased search space, because whether each record was selected was the search variable. Thus, it took a long time to attain an optimal or near-optimal solution, and it also increased the computational time of GA searching. On the other hand, the GAANN_CBEUS proposed in this study showed the highest level of predictive performance and relieved the burden of computational costs, since it used a relatively small number of search variables. 5. Conclusion
Fig. 2. Comparison of model performance using AUROC.
Fig. 3. Comparison of model performance using H-measure.
consisted of the same randomly selected firms as the validation data for the McNemar test. Based on the experiment results, we concluded that the GAANN_CBEUS significantly outperformed the other methods. However, the performance of the GA-ANN_RUS that optimized connection weights of the ANN and GA-ANN_EUS that optimized the training dataset and connection weights had no significant difference compared with the benchmarks model (ANN_RUS) when we
This study examined the effectiveness of the cluster-based undersampling approach based on the sampling methodology to optimize GA-ANN for bankruptcy prediction. We extracted the properly balanced dataset composed of optimal or near-optimal instances to both decrease data imbalance and improve the classification. To evaluate the classification performance of the model, we utilized the geometric mean to consider the balance of the proportion between the minority class and the majority class. The experimental results showed that the performance of GA-ANN with cluster-based evolutionary undersampling was superior to that of ANN with random sampling, GA-ANN with random sampling, and GA-ANN with evolutionary sampling through various performance metrics. The undersampling method in this study has the following main contributions. First, the structure of data can be explored by categorizing instances through the clustering technique while performing simultaneous optimization for the construction of the ANN model. Second, the proposed undersampling method for this study selected samples by excluding instances of outliers that were far from the centroid of each cluster. Modeling based on a dataset that removes noisy instances results in improved and stable classification performance. Third, we can easily recognize why the instances are selected by the rule-format knowledge representation increasing the expressive power of the criteria of selecting instances. Finally, we verified the benefit of cluster-based evolutionary undersampling approach by comparing the various performance measures such as sensitivity, specificity, G-Mean, AUROC, AR, and Hmeasure. Specifically, the H-measure can be an appropriate performance measure for comparing the classification performance of the constructed models based on the imbalanced data in this study.
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234
Our study has the following limitations that require future research. First, the proposed model selected the optimal training dataset and found a suitable connection weight to learn the ANN model. We need to consider possible combinations with control parameters, such as learning trials, population size, crossover rate, and mutation rate, in GA-ANN modeling to find global solutions, since these control parameters can affect the classification performance. Although GA-ANN with evolutionary sampling has a wider search space because the search variables consist of each record of the given dataset, GA-ANN using cluster-based evolutionary undersampling has the efficiency of GA searching, since the search variables only consist of the cut-off values of each cluster. However, the proposed model using general GA has the risk of the local minimum. Therefore, we need to apply multimodal GA to find multiple solutions on the cut-off values of each cluster in a further study. Second, we used the k-means algorithm for clustering to undersample the optimal training dataset corresponding to the majority class of the non-bankruptcy firms into several clusters. A future study will apply other clustering algorithms, such as selforganizing maps, and hierarchical agglomerative clustering will be considered, because the classification performance of the proposed model may be dependent on the quality of the clusters. Finally, we tentatively performed investigation and analysis using suitable performance measures for modeling the imbalanced data in binary classification problems. However, the performance metrics, such as the AUROC, the AR, or the H-measure had no definite criteria to provide evidence for evaluating the excellence of the model performance. In a future study, we need to provide the specific criteria level for the model evaluation of the manifold contingent environments in academic and practical applications. Acknowledgment This work was supported by the National Research Foundation of Korea Grant funded by the Korean government (NRF2013S1A3A2054667). References Agresti, A. (1984). Analysis of ordinal categorical data. New York: John Wiley & Sons. Ahmed, S. R. (2004). Applications of data mining in retail business. Information Technology: Coding and Computing, 2, 455–459. Anagnostopoulos, C., Hand, D. J., & Adams, N. M.. Measuring classification performance: The hmeasure package URL http://cran.r-project.org/web/packages/ hmeasure/vignettes/hmeasure.Pdf. Back, B., Laitinen, T., & Sere, K. (1996). Neural networks and genetic algorithms for bankruptcy predictions. Expert Systems with Applications, 11(4), 407–413. Barandela, R., Hernández, J. K., Sánchez, J. S., & Ferri, F. J. (2005). Imbalanced training data set reduction and feature selection through genetic optimization. CCIA, 131, 215–222. Barandela, R., Sánchez, J. S., García, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849–851. Benardos, P. G., & Vosniakos, G. C. (2007). Optimizing feedforward artificial neural network architecture. Engineering Applications of Artificial Intelligence, 20(3), 365–382. Bruzzone, L., & Serpico, S. B. (1997). Classification of imbalanced remote-sensing data by neural networks. Pattern Recognition Letters, 18(11), 1323–1328. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357. Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1–6. Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In Knowledge Discovery in Databases: PKDD 2003 (pp. 107–119). Berlin Heidelber: Springer. Chen, Y. L., Hsu, C. L., & Chou, S. C. (2003). Constructing a multi-valued and multilabeled decision tree. Expert Systems with Applications, 25(2), 199–209. Datta, P., Masand, B., Mani, D. R., & Li, B. (20 0 0). Automated cellular modeling and prediction on a large scale. Artificial Intelligence Review, 14, 485–502. Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (p. 11).
233
Engelmann, B., Hayden, E., & Tasche, D. (2003). Measuring the discriminative power of rating systems. Discussion paper, Series 2: Banking and Financial Supervision, 2003, 01. Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291–316. Fletcher, D., & Goss, E. (1993). Forecasting with neural networks: An application using bankruptcy data. Information & Management, 24(3), 159–167. García, S., & Herrera, F. (2009). Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation, 17(3), 275–306. Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Lecture Notes in Computer Science, 3644, 878–887. Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77, 103–123. Harp, S. A., & Samad, T. (1991). Optimizing neural networks with genetic algorithms. In Proceedings of American Power Conference (pp. 1138–1143). Harp, S. A., Samad, T., & Guha, A. (1989). Towards the genetic synthesis of neural networks. In Proceedings of 3rd International Conference on Genetic Algorithms (pp. 360–369). He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. Heistermann, J. (1990). Learning in neural nets by genetic algorithms. In R. Eckmiller, et al. (Eds.), Proceedings of Parallel Processing in Neural Systems and Computers (ICNC) (pp. 165–168). Elsevier. Hu, S., Liang, Y., Ma, L., & He, Y. (2009). MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the 2nd Int. Workshop on Computer Science and Engineering (pp. 13–17). Hung, S. Y., Yen, D. C., & Wang, H. Y. (2006). Applying data mining to telecom churn management. Expert Systems with Applications, 31(3), 515–524. Japkowicz, N., Myers, C., & Gluck, M. A. (1995). Novelty detection approach to classification. IJCAI, 1, 518–523. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6(1), 40–49. Kang, P., Cho, S., & MacLachlan, D. L. (2012). Improved response modeling based on clustering, under-sampling, and ensemble. Expert System with Applications, 39(8), 6738–6753. Khoshgoftaar, T. M., Seliya, N., & Drown, D. J. (2010). Evolutionary data analysis for the class imbalance problem. Intelligent Data Analysis, 14(1), 69–88. Kim, H. J., & Shin, K. S. (2007). A hybrid approach based on neural networks and genetic algorithms for detecting temporal patterns in stock markets. Applied Soft Computing, 7(2), 569–576. Kim, K. J. (2006). Artificial neural networks with evolutionary instance selection for financial forecasting. Expert Systems with Applications, 30(3), 519–526. Kim, K. J., & Ahn, H. (2012). A corporate credit rating model using multi-class support vector machines with an ordinal pairwise partitioning approach. Computers & Operations Research, 39(8), 1800–1811. Kim, K. J., & Han, I. (20 0 0). Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert Systems with Applications, 19(2), 125–132. Kim, Y. S., & Street, W. N. (2004). An intelligent system for customer targeting: A data mining approach. Decision Support Systems, 37(2), 215–228. Kim, K. S., & Han, I. (2001). The cluster-indexing method for case-based reasoning using self-organizing maps and learning vector quantization for bond rating cases. Expert Systems with Applications, 21(3), 147–156. Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training data sets: One-sided selection. In Proceedings of the 4th International Conference on Machine Learning (pp. 170–186). Nashville, USA: Nashville. Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3), 195–215. Kubat, M., Holte, R., & Matwin, S. (1997). Learning when negative examples abound. In Machine learning: ECML-97 (pp. 146–153). Berlin Heidelberg: Springer. Lee, Y. C. (2007). Application of support vector machines to corporate credit rating prediction. Expert Systems with Applications, 33(1), 67–74. Miller, G. F., Todd, P. M., & Hedge, S. U. (1989). Designing neural networks using genetic algorithms. In Proceedings of 3rd International Conference on Genetic Algorithms. Odom, M. D., & Sharda, R. (1990). A neural network model for bankruptcy prediction. In Proceedings of 1990 IJCNN International Joint Conference on Neural Networks (pp. 163–168). Rousseeuw, P. J. (1987). Silhouettes: Graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. Shin, K. S., & Lee, K. J. (2004). Neuro-genetic approach for bankruptcy prediction modeling. In Knowledge-based intelligent information and engineering systems (pp. 646–652). Berlin Heidelberg: Springer. Shin, K. S., & Han, I. (1999). Case-based reasoning supported by genetic algorithms for corporate bond rating. Expert Systems with Applications, 16(2), 85–95. Shin, K. S., & Han, I. (2001). A case-based approach using inductive indexing for corporate bond rating. Decisions Support Systems, 32(1), 41–52. Shin, K. S., Lee, T. S., & Kim, H. J. (2005). An application of support vector machines in bankruptcy prediction model. Expert Systems with Applications, 28(1), 127–135. Son, J. S., Lee, D. M., Kim, I. S., & Choi, S. K. (2004). A study on genetic algorithm to select architecture of a optimal neural network in the hot rolling process. Journal of Materials Processing Technology, 153, 643–648.
234
H.-J. Kim et al. / Expert Systems With Applications 59 (2016) 226–234
Tambe, P. (2014). Big data investment, skills, and firm value. Management Science, 60(6), 1452–1469. Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6(11), 769–772. Whitely, D., & Hanson, T. (1989). Optimizing neural networks using faster, more accurate genetic search. In Proceedings of 3rd International Conference on Genetic Algorithms (pp. 391–396). Whitley, D., Starkweather, T., & Bogart, C. (1990). Genetic algorithms and neural networks: Optimizing connections and connectivity. Parallel Computing, 14(3), 347–361.
Wilson, D. L. (1972). Symptotic properties of nearest neighbour rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 3, 408–421. Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718–5727. Zhang, G., Hu, M. Y., Patuwo, B. E., & Indro, D. C. (1999). Artificial neural networks in bankruptcy prediction: General framework and cross-validation analysis. European Journal of Operational Research, 116(1), 16–32. Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16–25.