Accepted Manuscript
A new hybrid ensemble credit scoring model based on classifiers consensus system approach Maher Alaraj , Maysam F. Abbod PII: DOI: Reference:
S0957-4174(16)30362-1 10.1016/j.eswa.2016.07.017 ESWA 10763
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
28 March 2016 10 July 2016 11 July 2016
Please cite this article as: Maher Alaraj , Maysam F. Abbod , A new hybrid ensemble credit scoring model based on classifiers consensus system approach, Expert Systems With Applications (2016), doi: 10.1016/j.eswa.2016.07.017
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights A new hybrid ensemble model for credit scoring problem is proposed. An improved data filtering technique is developed based on GNG method. GNG with MARS combined proved to be better than applying them individually. Our model is validated on three performance measures over seven credit datasets. Classifiers decisions after consensus effectively improved prediction performance.
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
A new hybrid ensemble credit scoring model based on classifiers consensus system approach Maher Ala’raj*, Maysam F. Abbod Department of Electronic and Computer Engineering, College of Engineering, Design and Physical Sciences, Brunel University London, Kingston Lane, Uxbridge UB8 3PH, UK ABSTRACT
AN US
CR IP T
During the last few years there has been marked attention towards hybrid and ensemble systems development, having proved their ability to be more accurate than single classifier models. However, among the hybrid and ensemble models developed in the literature there has been little consideration given to: 1) combining data filtering and feature selection methods 2) combining classifiers of different algorithms; and 3) exploring different classifier output combination techniques other than the traditional ones found in the literature. In this paper, the aim is to improve predictive performance by presenting a new hybrid ensemble credit scoring model through the combination of two data pre-processing methods based on Gabriel Neighbourhood Graph editing (GNG) and Multivariate Adaptive Regression Splines (MARS) in the hybrid modelling phase. In addition, a new classifier combination rule based on the consensus approach (ConsA) of different classification algorithms during the ensemble modelling phase is proposed. Several comparisons will be carried out in this paper, as follows: 1) Comparison of individual base classifiers with the GNG and MARS methods applied separately and combined in order to choose the best results for the ensemble modelling phase; 2) Comparison of the proposed method with all the base classifiers and ensemble classifiers with the traditional combination methods; and 3) Comparison of the proposed approach with recent related studies in the literature. Five of the well-known base classifiers are used, namely, neural networks (NN), support vector machines (SVM), random forests (RF), decision trees (DT), and naïve Bayes (NB). The experimental results, analysis and statistical tests prove the ability of the proposed method to improve prediction performance against all the base classifiers, hybrid and the traditional combination methods in terms of average accuracy, the area under the curve (AUC) H-measure and the Brier Score. The model was validated over seven real world credit datasets. Keywords—credit scoring; consensus approach; classifier ensembles; hybrid models; data filtering; feature selection.
1. Introduction 1.1. Background
AC
CE
PT
ED
M
Managing customer credit is an important issue for commercial banks and hence, they take great care when dealing with customer loans to avoid any improper decisions that can lead to loss of opportunity or financial losses. The manual estimation of customer creditworthiness has become both time- and resource-consuming. Moreover, a manual approach is subjective (dependable on the bank employee who gives this estimation), which is why devising and implementing programming models that provide loan estimations is being engaged in, as it can eradicate the ‘human factor’ in this process. Such a model should be able to provide recommendations to the bank in terms of whether or not a loan should be given and/or give a probability in relation to whether the loan will be returned. Nowadays, a number of models have been designed, but there is no ideal classifier amongst them as each gives some percentage of incorrect outputs, which is a critical consideration when each percentage point of an incorrect answer can mean millions of dollars of losses for large banks. The area of credit-scoring has become a extensively researched topic by scholars and the financial industry (Kumar & Ravi, 2007; Lin et al., 2012), with many models having been proposed and developed using statistical approaches, such as logistic regression (LR) and linear discriminate analysis (LDA) (Desai et al., 1996; Baesens et al., 2003). As a result of the financial crises, the Basel Committee on Banking Supervision requested all banks to apply rigorous credit evaluation models in their systems when granting a loan to an individual client or a company. Accordingly, research studies have demonstrated that artificial intelligence (AI) techniques, (e.g. neural networks, support vector machines and random forests) can be a good replacement for statistical approaches in building credit scoring models (Atiya, 2000; Bellotti and Crook, 2009). The utilisation of the different techniques in building credit-scoring models have varied over time, with researchers initially using each technique individually, and then later, in order to overcome shortcomings of applying them in this way, they started to customise the design of credit-scoring models. That is, they began to introduce complexity into their designs, with new approaches, such as hybrid and ensemble modelling that provided better performance than the use of individual techniques. Hybrid and ensemble approaches can be utilised independently or in combination. The basic idea behind the former is to conduct a pre-processing step for the data that are fed to the classifiers. Hybrid modelling can take many forms, including: 1) cascading different approaches (Lee et al., 2002; Garcia et al., 2012); 2) combining clustering and classification methods (Tsai and Chen, 2010); and using synergetic ways in combining different methods into single approaches such as fuzzy-based rules (Gorzałczany and Rudzinski, 2016); the focus of this paper is with regards to the first type of technique. Whilst ensemble modelling focuses on gathering the insight of a group of classifiers trained on the same problem and using their opinions to reach an effective classification decision. In spite of complex modelling being associated with significant financial and computational cost, we believe that complexity leads to better and universal classification models for credit-scoring and hence, the creation of a complex model for this purpose is the main aim of this paper. 1.2. Research motivations Our research motivation in this paper is three fold in relation to: 1) data filtering, 2) feature selection and 3) ensemble combination. The general process in credit scoring modelling is to use the credit history of former clients to compute and predict new applicants’ risk of default (Tsai and Wu 2008; Thomas 2009). The collected historical loans will be used to build a credit scoring model that map the attributes or features of each loan applicant to measure their probability of default. The number of the available features make up the feature space and the high dimensionality within it has advantages, but also some serious deficiencies (Yu and Liu, 2003). In practice, real historical datasets are used in order to develop credit-scoring models; these datasets might differ in size, nature, and the information or characteristics they hold, which can cause difficulties in classifier training and hence, not to be able to capture different relationships of the dataset characteristics. Such datasets can include noisy data, missing values, redundant or irrelevant features and can exhibit complexity of distribution (Piramuthu, 2006) Practically, the more features in a dataset the more the computation time * Corresponding author. Tel.: +447466925096. E-mail addresses: maher.ala’
[email protected] (M. Ala’raj),
[email protected] (M.F. Abbod).
ACCEPTED MANUSCRIPT
CE
2. Literature review
PT
ED
M
AN US
CR IP T
required and this could lead to low model accuracy and inefficient scoring interpretation results (Liu and Schumann 2005). One solution is to perform feature selection on the original data. Also, another problem that can arise with the original dataset is that that there might not be any data points located at the exact points that would allow for the most accurate and concise concept description (Wilson and Martinez, 2000). Another solution is to reduce the original data by filtering or removing it. The majority of credit scoring studies have used feature selection step as a pre-processing step to clean their data from any noise that can disturb the training process (Liu and Schumann, 2005; Huang et al., 2007; Bellotti and Crook, 2009; Tsai, 2009; Chen and Ma, 2009; Chen and Li, 2010l; Tsai & Chen, 2010; Akkoc, 2012; Tsai, 2014; Harris, 2014). On the other hand, to our knowledge just three studies have considered data filtering in their approaches (Tsai and Chou, 2011; Tsai and Cheng, 2012; Garcia et al., 2012). Few studies have considered two stages modelling in their approach and in order to fill this gap we have considered a 2 stage model based on data filtering and feature selection in addition to our proposed approach, which is based on classifier cooperation. The purpose of data-filtering is to reduce the size of the original dataset and to produce a representative training dataset, whilst keeping its integrity (Wilson & Martinez, 2000). Data that are noisy or contain outliers can have strong effect on model performance as can redundant and irrelevant features. According to Tsai & Chou (2011), in some cases, removing outliers can increase the classifiers’ performance and accuracy by smoothing the decision boundaries between data points or feature space. In general, outliers in a dataset mean that a sample of the dataset appears to be inconsistent within other samples in the same dataset; these data can be atypical data, data without prior class or data that are mislabelled. If all this appears in a dataset, then these outliers must be eliminated by filtering those samples that hold such characteristics that could hinder the training process, since their occurrence can lead to inefficient data training by classifiers. Another important step in data pre-processing is feature selection, which refers to choosing the most relevant and appropriate features and accordingly, removing the unneeded ones. In other words, it pertains to the process of selection of a subset of representative features that can improve model performance. The first motivation for this paper is to develop hybrid classifiers using data filtering and feature selection approaches. After training these, the obtained results from the classifiers are compared comprehensively in terms of using feature selection and data-filtering separately as well as by combining them together. To the best of our knowledge, combining a data-filtering technique with a feature-selection technique has not been considered before in the area of credit-scoring. Nowadays, the research trend has been actively moving towards using single AI techniques in building ensemble models (Wang et al., 2011). According to Tsai (2014), the idea of ensemble classifiers is based on the combination of a pool of diversified classifiers, which leads to better performance as each complements the other classifiers’ errors. In the literature on credit-scoring, most of the classifier combination techniques adopt the form of homogenous and heterogeneous classifier ensembles, where the former combines the classifiers of the same algorithm, whilst the latter combines those of different algorithms (Partalas et al., 2010; Lessmann et al., 2015; Tsai, 2014). As Nanni & Lumini (2009) point out, an ensemble classifier is a group of classifiers, where the decisions of each are combined using the same approach. In the light of the above discussion, a recent paper was presented by the authors (Ala’raj and Abbod, 2016), where they developed a credit scoring model based on a combination of heterogeneous ensemble of classifiers and combined their rankings using a new combination rule called the consensus approach, through which the classifiers work as a team to reach an agreement on all the data points’ final outputs. They carried out a comprehensive analysis, comparing their approach with several other classifiers and theirs was demonstrated as having significantly superior predictive performance across various measures and datasets. The second and main motivation of this paper, is to enhance the results of this approach by applying the following: Proposing a new way of evaluating the consensus between classifiers; Investigating the use of data pre-processing, including data filtering and feature selection, to see to what extent the consensus approach results can be improved upon, as Ala’raj and Abbod (2016) solely used the raw data for the training. Based on the above motivations, a new hybrid ensemble model is proposed that involves combining hybrid and ensemble models together with data filtering with feature selection. For the next stage, the outcomes of the hybrid classifier are combined using the enhanced consensus approach, with the aim being to increase accuracy and have better predictive performance. Experimentally, the consensus approach comprises combining the decisions of the classifier ensembles after hybridising them. The generalisation ability of the proposed model is evaluated against four widely approved performance indicator measures across several real world financial credit datasets. Moreover, the new proposed model is compared to the developed models in this paper in addition to some literature results that are considered as being benchmarks. The structure of the paper is as follows: Section 2 provides an introduction to data filtering, feature selection and ensemble models, with comparison and analysis of the related literature in terms of the datasets, base models, hybrid and ensemble models, combination methods and the evaluation performance measures used. Section 3 explains the data filtering method, feature selection method and the classifier consensus approach that is adopted in this paper. Section 4 describes the experimental setup of the paper, whilst section 5 presents the experimental results and analysis. Finally, in section 6 conclusions are drawn and future work possibilities are discussed.
2.1. Feature selection
AC
Datasets, in general, contains different attributes or features that make them up, and they vary from one to another. However, they can include irrelevant or redundant features that make it difficult to train models, thus leading to low performance and accuracy. As a result, analysing features in terms of their importance has become an essential task for data pre-processing in data-mining, in general and credit-scoring, in particular, in efforts to enhance the chosen model’s prediction performance (Tsai, 2009; Yao, 2009). In the literature, many studies have involved conducting data pre-processing to hybridise their models. Lee & Chen (2005) built a two stage hybrid model based on MARS to choose the most significant variables, which provided the best performance when compared to other techniques. Whilst Chen & Ma (2009) proposed a hybrid SVM model based on three strategies, namely, CART, MARS and grid search, with the best results being delivered when using MARS with this model. Chen & Li (2010) proposed four methods to select the best feature subset for building a hybrid SVM model. The results showed that such a model is very robust and effective in finding the best feature subsets for credit scoring models. The main reason for using feature selection is to choose an optimal subset of features for improving prediction accuracy or decreasing the size of the data structure without significantly decreasing the prediction accuracy of the classifier. This type of data pre-processing is important for many classifiers as it increases learning speed and accuracy regarding the testing set. To fulfil this purpose, the MARS technique can be used to determine the most valuable variables or features in the input data. This is a commonly applied classification technique, nowadays, widely accepted by researchers and practitioners in the field of credit scoring for the following reasons: It is capable of modelling complex non-linear relationships among variables without strong model assumptions; It can evaluate the relative importance of independent variables to the dependent variable when many potential independent variables are considered; It does not require a long training process and hence, can save lots of model building time, especially when working with a large dataset;
ACCEPTED MANUSCRIPT
It gives models that can be easily interpreted (Lee et al., 2002; Lee and Chen, 2005).
Moreover, according to Sholom and Indurhnya (1998), the MARS method is suitable for feature selection when the number of variables is not very large. Hence, selection of MARS for feature selection task is justified for the current enquiry. 2.2. Data filtering
The decision boundaries are smooth and clear; It is easier for the classifiers to discriminate between the classes; Decreasing the size of the training; leaving in it the really important data; Improving the accuracy performance of the model; Computational costs can be reduced.
CR IP T
Data filtering is used to improve the results of machine learning classifiers, which obviously should be trained on some training data before applying to the testing set. It improves the training set of data by removing the inaccurate samples from the set, which are those samples that stand out from the whole picture. For example, a loan in the sample data that is labelled as bad amongst many good loans, but with similar characteristics, would need to be removed from the training set. Tsai and Cheng (2012) carried out an investigation into the removal of different percentages of outliers in the data using simple distance-based clustering and examined the performance of the prediction models that were developed using four different classification models. The results demonstrated different performance abilities due to the structure of the datasets used. Garcia et al. (2012) conducted a study using a wide range of filtering algorithms and applied these to a credit-scoring assessment problem. Specifically, they used a total of 20 filtering algorithms, all of which showed superiority over the original training set and of those, they reported that the RNG1 filtering algorithm, which is based on proximity graphs, was the most statistically significant when compared with the others. Consequently, the idea of proximity graphs is adopted in this paper as the filtering algorithm used to pre-process the training data for the collected datasets. The motivation behind applying a data filtering algorithm in this paper lies in the belief that training a classifier with the filtered dataset can have several benefits (Garcia et al., 2012), such as:
AN US
Regarding the particular algorithm employed for fulfilling the task of data filtering, this is Gabriel Neighbourhood Graph editing (GNG), which is based on proximity graphs and it was chosen for the following reasons: Proximity graphs are used to avoid incoherent data (when some places are full of points and some have only a few); Proximity graphs find neighbours in each direction, so if some point has two neighbours, one and another just behind, the second will not be counted. This feature is not available in k-NN filtering, where the directions of neighbours' do not count. Proximity graphs describe the structure of data very well in that for each point the algorithm finds the closest matches to it.
2.3. Ensemble models
AC
CE
PT
ED
M
Alongside the hybrid techniques, another method used by researchers is the ensemble learning method or multiple classifier system, which is the most recent to be introduced to credit-scoring evaluation (Lin & Zhong, 2012). It involves applying multi-classifiers rather than single ones in order to achieve higher accuracy in the results. The difference between the ensemble and hybrid methods is that for the former, the output of the multiple classifiers is pooled to give a decision, whilst for the latter, only one classifier gives the final output and the other classifiers’ results are processed as an input to this final classifier (Verikas et al., 2010). The ensemble method for building creditscoring models is valued due to its ability to outperform the best single classifier’s performance (Kittler et al., 1998). A central key issue in building an ensemble classifier is to make each single classifier as different from the others as possible, in other words, the aim is to be as diverse as possible (Nanni & Lumini, 2009). Most of the works on ensemble studies in the domain of credit-scoring have focused on homogenous ensemble classifiers via simple combination rules and basic fusion methods, such as majority voting (MajVot), weighted average (WAVG), weighted voting (WVOT), reliability-based methods (i.e. MIN, MAX, PROD), stacking and fuzzy rules (Wang et al., 2012; Tsai, 2014; Yu et al., 2009; Tsai & Wu, 2008; West et al., 2005; Yu et al., 2008). A few researchers have employed heterogeneous ensemble classifiers in their studies, but still with the aforementioned combination rules (Ala’raj and Abbod, 2016; Lessmann et al., 2015; Wang et al., 2012; Hsieh & Hung, 2010; Tsai, 2014). In ensemble learning, all the classifiers are trained independently to produce their decisions, which are subsequently combined via a heuristic algorithm to produce one final decision (Zhang et al., 2014; Rokach, 2010). Table 1 summarises the extant ensemble studies in relation to whether they were homogenous or heterogeneous and as can be seen, most involved adopting a homogenous classifier ensemble with the traditional combination rules. Two studies solely applied heterogeneous ensembles, while three employed both homogenous and heterogeneous ones.
1
Relative Neighbourhood Graph editing
Table 1 Ensemble models studies
ACCEPTED MANUSCRIPT Study
2005 2006 2008
West et al. Lai et al. Tsai and Wu Yu et al. Nanni and Lumini Yu et al. Hseih and Hung Yu et al. Zhang et al.
2009 2010
Classifier ensembles Homogenous Heterogeneous x x x x x x x x x
Zhou et al. Partalas et al. Wang et al. Finlay Wang et al. Wang and Ma Marques et al. (a) Marques et al. (b) Tsai Abellan and Manatas Lessmann et al. Zhou Xiao et al. Ala’raj and Abbod
2011 2012
2014 2015 2016
x x x
x x x x x x x x x x
x x
x
3. New hybrid ensemble credit scoring model
Majority vote, Weighted average Majority vote, Reliability-based Majority vote Majority vote, Reliability-based Sum rule Fuzzy GDM2 Confidence-weighted average Majority vote, Weighted average, ALNN3 Majority vote Majority vote, Reliability-based, Weights based on Tough samples Weighted voting Majority vote, Weighted average, stacking Majority vote, Weighted average, mean Majority vote Majority vote Majority vote Majority vote Majority vote, Weighted vote Majority vote Majority vote, Weighted average, stacking Majority vote Majority vote, Weighted vote Consensus approach
AN US
3.1. Gabriel Neighbourhood Graph editing (GNG)
Combination rule
CR IP T
Year
(
)
(
(
)
)
(
)
M
The idea behind data-filtering is the selection of the data outliers, data points with labels that are weakly associated with those of their neighbours, such that it is assumed that some mistake in data collection or representation was made and hence, this data point may contain an error. Consequently, it is best not to include such data in the training process. The efficient way to reflect the structure of the data and the interconnection between training set entries is to represent the training data as a graphical structure. The simplest way to this is to connect two data points when they are close enough, such that and are connected, if ( , where is chosen ) manually. Whilst this method is easy, in the case of non-uniform data distribution it is impossible to provide coherency of data representation: some graph areas are full of edges, whereas others have only a few. Moreover, it is not guaranteed that the obtained graph is connected. Consequently, it is better to use proximity graphs, which are built without using a fixed distance and instead, involves connecting two data points according to their neighbours’ location with respect to them. GNG (Garcia et al., 2012) is used as a special case of proximity graphs, which provides a list of neighbours for each point from the training set and is defined as follows: (1)
√(
)
(
)
(
PT
)
)
(2)
AC
CE
(
ED
Figure 1 demonstrates the connection between 2 points in the GNG. In simple terms, the idea is to connect the points and , if and only if, there is no point inside the circle with segment , - as a diameter, otherwise the two points will not be connected. The Euclidian distance between two points ( ) and ( ) is calculated as follows:
Fig. 1. Illustration of GNG edge connection (Gabriel & Sokal, 1969)
Figure 2 illustrates the construction of a GNG on a 2-D training dataset, showing how points are connected and the process of filtering data points, which depends on meeting certain conditions set by the GNG algorithm.
2
GDM: group decision making
3
ALNN: adaptive linear neural network
ACCEPTED MANUSCRIPT
Fig. 2. The construction of GNG for 2-D on training data (Gabriel & Sokal, 1969)
CR IP T
As can be seen from Figure 2, the GNG is constructed, with the training data points being connected and for each sample the weighted average for all neighbours’ labels to is evaluated. OK? In this case, the weights chosen are proportional to the distance from each . Also, for every data point two scalar values, and are defined that can be interpreted as thresholds. Subsequently, two conditions are checked: If label of is equal to 0, and the weighted average of all the GNG is greater than , is removed or filtered from the training set; If label of is equal to 1, and the weighted average of all GNG is less than , is removed or filtered from the training set; If neither condition is satisfied, will remain in the training set.
AN US
In balanced datasets, where the number of '0' labels is approximately equal to the number of ‘1’ labels, it is wise to use 0.5 for both the and thresholds. However, in the case of imbalanced datasets, when the number of bad loans is far less than the number of good ones, if the values of both thresholds are equal to 0.5, this leads to excessive filtration for entries with data labelled as ‘1’. For instance, if the data set is imbalanced the goal is to tighten its conditions for good loans as they are the majority, so is decreased to remove more good loans (the advantage here is more than the disadvantage, so if a non-noisy loan is removed, its guaranteed that the noisy ones are removed), and regarding the minority class, is decreased to keep as many bad loans as possible. However, the thresholds of the datasets when they are imbalanced are chosen based on the best accuracy of the training set. Therefore, the proposed new enhancement on the GNG in this paper is using the following two modifications: Using a weighted average for the GNG instead of a simple one, in order to account for far points less than close points for each , thereby more precisely finding outliers; Using various thresholds so as to avoid excess filtering of bad loan entries.
Compute GNG for all entries of the training set:
For every pair -
of the training set:
Check whether to connect them in the GNG using Equation (1).
End for 2.
For each classifier optimal good loans and bad loans thresholds (
3.
) are evaluated beforehand.
CE
Compute the vector which consists of actual labels of all Gabriel Graph neighbours; Compute the vector which consists of distances from to its Gabriel Graph neighbours; ( ) Perform the subsequent operations: Evaluate , where ⟨ ⟩ a scalar product: If label of is equal to 0 and is greater than , then is removed from the training set; If label of is equal to 1 and is less than , is removed from the training set.
AC
End for
of training set:
,
PT
For each entry -
ED
1.
M
For a clearer understanding, the steps of the GNG filtering algorithm are summarised in the following pseudo-code:
Perform training stage of the selected classifier on the reduced training set.
3.2. Multivariate Adaptive Regression Splines (MARS) MARS is a non-parametric and non-linear regression technique developed by Friedman (1991), which models the complex relationships between the independent input variables and the dependent target variable. It is a member of the regression analysis methods family, and can be described as an extension of linear regression. The MARS model takes care of non-linearity, being constructed by assuming the result value on the unknown points and then converts the linear model to one that is non-linear. The conversion occurs by creating knots on t extremes of arguments using the hinge functions. The advantage of converting the model with these functions is that different combinations of them can form a complex model, which lies in the closest position possible to the real results. The hinge function looks like: ( ) ( ) or, ( )
(
)
ACCEPTED MANUSCRIPT where, and are constants and is called a knot. In the knot a hinge function changes its direction. In fact, the single hinge function is a combination of two linear functions ( ) and ( ) ( ), where the second function constitutes the first after point. After finishing the training stage, the obtained MARS is a mathematical generalised linear model, which fits the data well. However, this model can be used not only to classify entries, but also to analyse the input data and to find the most important features, which are highly correlated with the target labels. The aim of feature selection is to choose a subset of features for improving the prediction accuracy or decreasing the size of the structure without significantly decreasing the accuracy of the classifier built by using only the selected features. OK? ANOVA decomposition is used for the MARS mathematical model to determine the most valuable and important features of the input data. The main characteristic that distinguishes MARS from other classifiers is that its results are easily-interpreted and then ANOVA can be used with this model to make investigations into the input data structure, particularly regarding features importance. ANOVA decomposition is the process of separation of the MARS model into groups of functions that depend on the different variables (features). Thus, when analysing this decomposition, one of the groups is removed to see how performance drops and the more it does so, the more important the feature is. The steps of the MARS feature selection process can be summarised in pseudo-code as follows: iterations, for each of which the dataset is divided into training and testing parts.
For i from 1 to N do Evaluate the MARS algorithm using the training set with the given parameters:
Maximum number of functions in the model allowed. The default value for this parameter is -1 in which case maxFuncs4 is calculated automatically using the formula ( ( )) (Jekabsons, 2009), where d is the number of input variables; Penalty value per knot. Larger values lead to fewer knots being placed (i.e. final model is simpler); 5 function. Store the second column of this table as ( ) Perform ANOVA decomposition on the obtained model using the and thus, ( ) denotes the importance of the k-th input feature during the i-th iteration.
Assume
is the total number of features of the data.
For s from 1 to
do ∑
()
(3)
End for 2.
Return
)
3.
Suppose that for each single classifier the optimal feature importance threshold and testing only the features are chosen, for which .
M
(
AN US
End For 1.
CR IP T
Suppose there are
was evaluated beforehand. Thus, for training
ED
3.3. The classifier consensus system approach
AC
CE
PT
The basic idea behind classifier combination decisions being taken together is that when making a decision one should not rely only on a single classifier decision, but rather, classifiers need to participate jointly in the decision-making process by combining or fusing their individual opinions or decisions. Consequently, the core problem that needs to be addressed when combining different classifiers is resolving conflicts between them. In other words, the problem is how to combine the results of different classifiers to obtain a better result (Chitroub, 2010; Xu et al., 1992). In this section, a new combination method is introduced in the field of credit scoring based on classifier consensus, where those in the ensemble interact in a cooperative manner in order to reach an agreement on the final decision for each data sample. Tan (1993) stressed that classifiers working in collaboration can considerably outperform those working independently. The idea of the consensus approach is not new and it has been investigated in many studies in different fields, such as statistics, remote sensing, geography, classification, web information retrieval, multi-sensory data and the financial domain (Tan, 1993; DeGroot, 1975; Benediktsson and Swain, 1992; Shaban et al., 2002; Basir and Shen, 1993, Ala’raj and Abbod, 2016). Regarding which, we adopted the general guidelines of DeGroot. (1975) and Shaban et al. (2002), who proposed a framework that provides a comprehensive and practical set of procedures regarding the construction of the consensus theory, where the interactions between the classifiers are modelled when agreement between them is needed. It is believed that their guidelines can be useful in the domain of credit scoring and credit risk evaluation. The goal of the consensus approach is to merge the rankings of the ensemble classifiers into one group ranking (answer of the group) and to do so, it should comprise the following main stages: I. Calculating the rankings of all the ensemble classifiers and building a decision profile Consider a group of N classifiers, denoted by the C1, C2,…,Cn. All the classifiers in the ensemble will be trained and tested on the same input data points. After training, each classifier will produce a ranking for the input data point and this ranking after applying a threshold will have a set of two possible answers, which is either good or bad loan. The set of the two possible answers can be denoted by ( ), For each classifier, consider a ranking function , which associates a non-negative number for every possible answer for . The result of the estimated function is a value in the range of [0, 1], which shows the desirability of the corresponding answer. OK? Predictions of the classifiers can be found after finding and applying a threshold to it. Hence, the ranking of each classifier given input data will be: OK? ∑
( )
*
+
(4)
4
This is the maximum number of hinge functions allowed by the MARS model
5
It is a function in ARESLab toolbox in Matlab that performs ANOVA decomposition and variable importance assessment
ACCEPTED MANUSCRIPT where, Ri is the ranking of the 2 classes for each classifier given an input data point. Now, after calculating each classifier’s rankings, the decision profile can be represented in matrix form as follows: ( ( ( ( (
[
) ) ) ) )
( ( ( ( (
) ) ) ) )
( ( ( ( (
) ) ) ) )
( ( ( ( (
) ) ) ) )]
(5)
where, n is the number of input data in the training/testing set, ei is the ith input data and Rj(ei); j 1..5 is the j-th classifier ranking for the i-th input data. So, to evaluate the uncertainty between the classifiers it is necessary to process n columns of matrix DP for the testing set input by input. The main objective is to evaluate the common group ranking , - by aggregating the expected rankings for all classifiers and hence, reach a consensus on the final ranking of each given input data point. II. Calculating classifier uncertainty
AN US
CR IP T
After building the decision profile (DP) for the classifier rankings, the next stage is to find a function by which each classifier’s uncertainty about its own decision can be computed. The task here is to give more weight to those that are less uncertain about their decision, and vice versa. Moreover, the assigned weights should reflect the contrast in the classifiers' decisions. During this stage, classifier uncertainty can be divided into two types: local (self) and global (conditional). Self-uncertainty relates to the quality of the classifier's own decision, while conditional uncertainty refers to this quality after being exposed to the other classifiers’ decisions, which takes the form of a decision profile exchange between classifiers. At this stage, a classifier will be able to review its uncertainty level and modify it in light of its decision as well as those of other classifiers. In other words, this shows how a classifier is able to improve its decision when other classifiers’ decisions become available. ( ) is the i-th classifier’s ranking of answer , and ( ) is the i-th classifier’s ranking of answer , when it is exposed to the ranking of the jth classifier. Consequently, the uncertainty matrix can be presented as follows:
(6)
[
]
Matrix U is evaluated using equations (7) and (8):
( ( )) ( ) ( ) ( (
where,
(7) (8)
))
M
∑ ∑ is the self uncertainty and
is the conditional uncertainty of each classifier for each given input data point.
∑
ED
Now, knowing that equation (4) is fulfilled, equation (8) can be satisfied as follows: OK? (
)
*
+
(9)
In the case of two possible answers: "0"as good loans and "1" as bad loans, then, for simplicity, equations (4) and (9) can be converted into: ( )
PT
( )
(
)
(
)
CE
where, ( ) is the i-th classifier ranking of answer "1" (bad loan) and Denote ( ) and ( ) ( ), then ( ) and ( converted into: ( )
( ) ( ( ( ))
) (
(
( ))
(10) ( ) is the i-th classifier ranking of answer "0" (good loan). ) ( ) and, hence, equations (7) and (8) can be )
(
( ))
(11) (12)
AC
where, Uii; i 1…5 is the local uncertainty of the ith classifier, and Uij, I, j 1…5; i ≠ j is the global uncertainty of the i-th classifier, when it knows the ranking of the jth classifier. It is important to explain that the reason why the uncertainties in equations (11) and (12) are evaluated using a logarithm with base 2 ( ) can be demonstrated by plotting equation (11), where Uii is a function of the parameter Ri:
ACCEPTED MANUSCRIPT
Fig. 3. Uncertainty value
as a function of the parameter
1)
( ) and
is described in Algorithm (1):
Calculate , If and ( ) If and ( ) If and have opposite signs, then ( ) .
(
)
M
The whole process of estimating
AN US
CR IP T
From the plot in Figure 3, it is clear that, if the value of the classifier’s ranking is close to the edges of the [0, 1] interval, uncertainty will be near zero (the classifier is certain about its decision). On the other hand, if the ranking is close to the 0.5, the classifiers uncertainty will be close to 1, which is the maximum value of uncertainty (the classifier is very uncertain about its decision). Looking at , it is straight forward to calculate the self-uncertainty of each classifier, but when it comes to the conditional uncertainty, there is no information available about how to calculate the rankings of the classifiers after they are exposed to each other, ( ). In DeGroot (1974), Berger (1981) amd Shaban et al.’s (2002) convergence conditions for the optimal single decision of the group were investigated, but provided no information about how they calculated their conditional uncertainty. To evaluate the conditional rankings in , an uncertainty approach is proposed here. In Ala’raj and Abbod (2016), they proposed a way for estimating the conditional rankings ( ) of the , which was based on giving the rankings of the ith and jth classifiers a weight using the hyperbolic function tanh and global accuracy. In this paper, another way is put forward for estimating the conditional rankings of , which is based on: 1) Calculating how far the ith and jth classifiers’ rankings (distance) are from the threshold (0.5) and measuring how certain they are about their decisions; 2) Calculating the local accuracy (the strength of a classifier with regards to classifying loans around given input test data) of both classifiers when computing their uncertainty (the more locally accurate classifier will have less uncertainty and vice versa).
Evaluate
PT
ED
The logic behind Algorithm (1) is to simulate the classifiers’ communication behaviour in order to generate conditional rankings so that we can calculate In the first two conditions, and should be greater than , as the effect of ith classifier certainty is increasing due to the similar opinion of jth classifier. For example, if , and , then ( ) , and the conditional ranking according to this is ( ) . The logic behind this is that if the two classifiers simultaneously consider a loan as ‘good’, after communicating with each other their certainty in that decision will increase (thus, the ranking will decrease). OK? On the other hand, if two experts simultaneously consider a loan as ‘bad’, after communicating their certainty, the decision will increase (thus, the ranking will increase). according to equation (12)
Update ( ( ) ), where ( ) is the local accuracy of the ith classifier on input data , using the ith classifier answer error for exactly k-neighbour queries from the training set. In the current implementation, the parameter . Parameters k1, k2, k3, k4 and k5 are chosen using gradient descent with the objective function being global accuracy on the training set. For each iteration these parameters are evaluated separately.
AC
2)
CE
After calculating all the values in the uncertainty matrix a symmetrical matrix will be produced, but because of the clear differences in single classifier performance we do not want them to be symmetrical, which is why is updated taking into consideration the classifiers’ local accuracies, as in step (2).
The reason for updating step (2) is to take into consideration the local accuracy of a classifier, for one with low local accuracy is more uncertain about its decision (as local accuracy is in the denominator). Coefficient is a normalising coefficient that picks lower than the lowest local accuracy for the ith classifier and so the denominator stays positive. 3)
Local accuracy (LAi) is estimated by the accuracy of each classifier in the local region of the feature space surrounding an unknown test point. Before applying the ensemble combiner, all single combiners are trained, with their predictions being evaluated on the training and testing sets. To combine the decisions of these classifiers the local accuracy is evaluated for each entry from the testing set. In Xiao et al. (2012) choosing a non-negative distance as a local accuracy area and evaluating the accuracy of all entries for the training set that are located at a distance from less than . OK? As an enhancement, rather than using a simple mean value, evaluating a weighted average for all points of the training set with weights that are inversely proportional to the distance from the training test entries to . For example, we have a training set with entries with ranking values , and the classifier's actual targets for training set ̃ , where . Hence, local accuracy is evaluated as:
ACCEPTED MANUSCRIPT ∑
( )
̃ ∑
(13)
III. Evaluating the weights of each classifier’s uncertainty After having calculated the uncertainties of the classifiers and all values of uncertainties having been presented in the uncertainty matrix, at this stage the classifier can assign weights for itself and for other classifiers. The uncertainty weights are evaluated using the following equation, which can be presented in a matrix as with the uncertainty, which we call matrix W: (14)
∑
Equation (14) is the result of a set of minimisation problems (Shaban et al., 2002) (One problem for each ∑
{
)
(15)
∑
IV. Evaluating vector
(the weight of each classfier)
CR IP T
These problems are stated in this form to ensure that each classifier will assign high weights to classifiers with low global uncertainties and low ones to those with high global uncertainties. These problems are solved via the Lagrange method of undetermined coefficients, as illustrated in equation (14) and the detailed process of its derivation is described in Shaban et al. (2002).
Vector , which is the weight that is given for each classifier that reflects the confidence of each of its own decisions after being exposed to the results of other classifiers. It is evaluated as an approximate solution to the following equation: (∑
(16)
̃
(
AN US
Weights of matrix are assigned as the transition matrix of a Markovian chain with single classifiers, as stated in DeGroot (1974). Then, the stationary distribution of this chain can be evaluated using a system of equations. Sometimes there is no exact solution to this chain, because the number of equations in it is one more than the number of variables. In this case, the Markovian chain does not converge to stationary distribution. Equation (16) can be converted into the form: )
(17)
where, ̃
((
) )
(18)
Matrix ̃ is a rectangular ( ) matrix. The sum of the elements for each column of matrix ( row of this matrix is redundant and can be removed. Therefore, if: ) )
(
)
(19)
M
((
) is equal to 0, so at least one
Aggregating the consensus ranking of each ensemble classifier
PT
V.
ED
then, equation (17) has a single exact solution. To solve equation (16) using Matlab the least squares method could be used. It is also a new approach, when compared to those in articles by DeGroot (1974) and Berger (1981). Using the least squares method, it is not necessary to worry about vector convergence, because the result of the approximate solution of equation (17) when equation (19) is fulfilled, is the same as using DeGroot’s (1974) iterative method with normalisation at each step, until ‖ ‖ becomes close to zero. In that scientific paper, the final value of is called the ‘equilibrium’, as this value does not change again after reaching it. Generally, the equilibrium represents a balance between the single classifiers’ opinions, i.e. is a common denominator for all classifiers and they all agree with it.
When all the classifiers reach a consensus about their decisions and there is no room for decision updates, at this point, the aggregation of the consensus rankings is evaluated using the following equation: ∑
( )
(20)
CE
( )
AC
Vector is considered as the weights importance of each single classifier and the sum of all the elements of it equals 1. So, consensus rankings aggregation can be evaluated as a linear combination of single classifier rankings. The length of vector is equal to the size of the set of the possible answers, and the sum of all the elements of is equal to 1. The final prediction of the group, using ConsA, is the answer , for which ( ) reaches the maximum value, which can be specified as: (
)
( )
(21)
The pseudo-code below summarises the process of the classifiers ConsA adopted in this work. The ConsA pseudo-code (generating the common group ranking for one input sample) Input:
– ranking of answer ‘1’ for each agent,
Output: For i = 1 to N do For j = 1 to N do If (i == j) then
(computed by equation (11))
Else (Computed by algorithm (1) and equation (12))
,
– Accuracy of each agent.
ACCEPTED MANUSCRIPT End if End for End for
* + (computed by equation (14)) Compute ̃ ( ) (computed by equation (18)) Compute (computed by system (17)) In Matlab the Compute aggregate ConsA ( ) using equation (20). Define group aggregate answer using equation (21).
function is used.
Figure 4 shows a flowchart for the ConsA, based on generating a common group ranking for one data sample or input:
Input data
Testing data profile
Delete unnecessary features
Data Filtering
MARS feature selection Neural Network
Naive Bayes
Training set decisions
Least squares method
SVM
Calculate local accuracy for each testing point using training set decisions and labels
Calculate uncertainty matrix
Calculate weight matrix
M
Calculate the equilibrium value π
Testing set decisions
AN US
Decision Tree
Random Forest
CR IP T
Training data profile
Calculate the aggregate consensus ranking
Define the final answer of the group
ED
Fig. 4. The process of ConsA
To make it clear we provide an example of how ConsA works:
AC
U=
CE
PT
Suppose that five classifiers have the rankings: 0.8,0.3,0.4,0.7,0.6, and local accuracies (LA) (0.77,0.7,0.65,0.75,0.65): 1) During the gradient descent the vector of parameters: k1=1, k2=2, k3=0.5, k4=1, k5=0.3 is obtained, which gives the best accuracy for the training set. 2) Calculate the uncertainty matrix U (for diagonal elements equation 11 is used, for non-diagonal – Algorithm 1):
(
0.72
2.11
2.07
0.00
1.00
2.48
0.88
0.00
2.50
2.48
2.77
0.00
0.97
2.84
2.86
0.00
2.22
2.21
0.88
1.60
1.34
2.84
2.86
2.06
0.97
)
is calculated as follows: ( ) ( ) (0.8 is the first classifier ranking, R1) and the other diagonal elements are calculated in the same way. For non-diagonal elements Algorithm (1) is used. is calculated as:
Calculate As and ( ( ) Evaluate
, have opposite signs, then ) ( ) according to equation (13) ( ) ( )
(
)
ACCEPTED MANUSCRIPT
Update
(
(
(
)
), where
(
) is local accuracy.
)
W=
(
0.00
0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.37
0.06
0.04
0.14
0.39
CR IP T
Other non-diagonal elements are calculated by the same algorithm
)
AN US
In this example, the four first rows all have zero elements, except one. This fact is because of the equation of evaluation: sum of inverse squares ∑ is infinity for ( ), because in these rows matrix U has zeros. The only element is equal to one for these rows is where . The last row of matrix U has no zeros, so it can be evaluated, for example, . To do this, firstly, the sum of the inverse squares of all the elements of matrix U for the last row is evaluated: ∑ *
All other weights in this row are calculated in the same way.
̃=
(( -1.00
0.00
0.00
-1.00
0.00
1.00 0.00
1.00 0.00
(
1.00
0.37
1.00
0.00
0.06
-1.00
0.00
0.04
0.00
-1.00
0.14
0.00
0.00
0.00
-0.61
1.00
1.00
1.00
1.00
Calculate vector , such that ̃ (
(
)
) and to do this, the least squares method is used:
(̃
̃)
) .
CE
̃
5)
1.00
0.00
PT
4)
) )
ED
Evaluate matrix ̃
M
∑
3)
+
= (0.3,0.2,0.2,0.0)
Evaluate global final ranking as: (0.3,0.2,0.2,0.0)
(0.8,0.3,0.4,0.7,0.6) = 0.59
AC
So, the ranking of each classifier is calculated as: 0.8×0.3+0.3×0.2+0.4×0.2+0.7×0.3+0.6×0.0
6)
As the global final ranking is greater than 0.5, ConsA considers the loan as “bad”.
4. Experimental design As in this paper the work of Ala’raj and Abbod (2016) is extended, in order to reach a fair comparison, the decision was taken to use the same experimental set-up that was deployed in their earlier study in terms of the credit datasets, base classifiers, traditional combination methods, performance indicator measures and significance tests. OK? 4.1. Credit datasets A collection of public and private datasets with different characteristics was employed in the process of empirical model evaluation. In total, seven datasets were obtained, four public and three private. The public datasets are well-known real-world credit-scoring datasets that have been widely adopted by researchers in their studies, which are easily accessed and publicly available at the UCI machine-
ACCEPTED MANUSCRIPT learning repository (Asuncion & Newman, 2007). The German6, Australian7 and Japanese8 datasets of this type were employed, with the purpose being to provide extra validation. The Iranian dataset, which consists of corporate client data from a small private bank in Iran, has been used in several studies (Sabzevari et al., 2007; Kennedy, 2012; Marques et al., 2012a, 2012b), whilst the Polish dataset contains information on bankrupted Polish companies recorded over two years (Pietruszkiewicz, 2008; Kennedy, 2012; Marques et al., 2012a, 2012b). Moreover, to two extra datasets are used for the proposed model for extra validation. Firstly, a Jordanian dataset, based on a historical loan dataset, was gathered from one public commercial bank in Jordan. These data are confidential and sensitive, hence acquiring them involved a complex and time-consuming process. Secondly, we used the UCSD dataset that matches with a reduced version of a database employed in the 2007 Data Mining Contest organised by the University of California San Diego and Fair Isaac Corporation. A summary of all the datasets is illustrated in Table 2. Table 2 Description of the seven datasets used in the study #Loans 1000 690 690 1000 240 500 2435
#Attributes 20 14 15 27 30 12 38
4.2. Base classifiers development
Good/ Bad 700/300 307/383 296/357 950/50 128/112 400/100 1836/599
CR IP T
Dataset German Australian Japanese Iranian Polish Jordanian UCSD
The baseline models developed in this paper, chosen to be part of the multiple classifier systems, based on their wide application in credit scoring studies (Harris, 2015; Tomczak and Zieba, 2015; Xiao, 2016; Louzada and Fernandes, 2016), are neural networks (NN), support vector machines (SVM), decision tress (DT), random forests (RF) and naïve Bayes (NB). These methods as well as being wellknown are easy to implement and hence, facilitate banks or credit card companies in quickly evaluating the creditworthiness of clients. Below is the theoretical background to the classifiers used.
AN US
Neural Networks (NNs)
M
NNs are machine-learning systems based on the concept of artificial intelligence (AI) inspired by the design of the biological neuron (Haykin, 1999). These are modelled in such a way as to be able to mimic the human brain functions in terms of capturing complex relationships between the inputs and outputs (Bhattacharyya and Maulik, 2013). One of the most common architectures for NNs is the multi-layer perceptron (MLP), which consists of one input layer, one or more hidden layers and one output layer. According to Angelini et al. (2008), one of the key issues needing to be addressed in building NNs is their topology, structure and learning algorithm. The most commonly utilised topology of an NN model in credit-scoring is the three-layer feed-forward back propagation. Consider the input of a credit-scoring training set x = {x1, x2,…, xn}, the NN model works in one direction, starting from feeding the data x to the input layer (x includes the customer’s attribute or characteristics). These inputs are then sent to a hidden layer through links or synapsis associated with the random initial weight for every input. The hidden layer will process what it has received from the input layer and accordingly, will apply it to an activation function. The result is served as a weighted input to the output layer, which will further process weighted inputs and apply the activation function, which will lead to a final decision (Malhotra and Malhotra, 2003). Support Vector Machines (SVMs)
CE
PT
ED
An SVM is another powerful machine-learning technique used in classification and credit-scoring problems. It has been widely used in the area of credit-scoring and other fields owing to its superior results (Huang et al., 2007; Lahsasna et al., 2010). SVMs first were proposed by Cortes & Vapnik (1995), adopting the form of a linear classifier. SVMs take a set of two classes of given inputs and predict them in order to determine which of the two classes, namely good or bad, has the output. SVMs are used for binary classification in order to make a finest hyperplane (line) that categorises the input data into two classes (good and bad credit) (Li et al., 2006). In cases where the data are non-linear, other types are proposed in order to improve the accuracy of the original model. The main difference of the new model, compared to the initial one, is the function used to map the data into a higher dimensional space. To achieve this, new functions were proposed, namely, linear, polynomial, radial basis function (RBF) and sigmoid. SVMs map non-linear data of two classes to a highdimensional feature space, with a linear model then being used to implement the non-linear classes. The linear model in the new space will denote the non-linear decision margin in the original space. Subsequently, the SVMs will construct an optimal line (hyperplane) that can perfectly separate the two classes in the space. Decision Trees (DTs)
AC
A DT is another commonly used approach for classification purposes in credit-scoring applications. DTs are non-parametric classification techniques used to analyse dependant variables as a function of independent variables (Lee et al., 2006). They can be represented using graphical tools; the node is shown in the box with lines to show the possible events and consequences, until reaching the optimal outcome. The idea behind the DT in credit-scoring is to provide a classification between two classes of credit, namely, ‘good’ and ‘bad’ loans. It begins with a root node that comprises the two types of classes, with the node then being split into two subsets with the possible events based on the chosen variable or attribute. The DT algorithm goes round all the splits to find the optimal one and then selects the winning sub-tree that gives the most accurate ‘good’ and ‘bad loans based on its overall error rate and lowest misclassification cost (Biermann et al., 1984; Thomas, 2000). Random Forests (RFs) An RF is considered an advanced technique of DTs, as proposed by Biermann (2001), which consists of a bunch of DTs that are created by generating n subsets from the main dataset, with each subset a DT created based on randomly selected variables. Consequently, a 6
https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)
7
https://archive.ics.uci.edu/ml/datasets/Statlog+(Australian+Credit+Approval)
8
https://archive.ics.uci.edu/ml/datasets/Japanese0Credit0Screening
ACCEPTED MANUSCRIPT very large number of trees are generated, which is why it is referred to as an RF. After all the DTs have been generated and trained, the final decision class is based on voting procedure, where the most popular class determined by each tree is selected as a final class for the RF. Naïve Bayes (NB) NB classifiers are statistical classifiers that predict a particular class (good or bad) of loan. Bayesian classification is based on Bayesian theory and is a valuable measure when the input feature space is high (Bishop, 2006). This is considered as being a very simple method for making classification rules that are more accurate than those made by other methods; however, it has received very little attention in relation to the credit-scoring domain (Antonakis & Sfakianakis, 2009). An NB classifier is calculated using the posterior probability of a class by multiplying the prior probability of a class before seeing any data with the likelihood of the data given its class. For example, in the credit-scoring context, the assumption can be made that the training sample set D = { x₁, x₂,….,xn}, where each x is made up of n characteristics or attributes { x11, x12,…., x1n} and assisted with a class label c, which is either a good or bad loan. The task of the NB classifier is centred on analysing these training set instances and determining a mapping function ƒ: (x11,….,x1n} -> (c), which can decide the label of an unknown example x = (x1,….., xn).
CR IP T
4.3. Data pre-processing and partitioning Before building and training the models, data has to be prepared in terms of dealing with any missing values that could detract from the knowledge discovery process. The easiest way of dealing with missing values is to delete the instances containing the missing value of the feature. However, there are other ways of handling missing values instead of deleting them, such as by adopting an imputation approach, which means replacing missing values with new ones based on some estimation (Acuna et al., 2004). Regarding the collected datasets, only the Japanese dataset was found to contain some missing values, and it was deciding to impute them via a simple imputation approach as follows (Acuna et al., 2004; Lessmann et al., 2015):
AN US
Replace missing categorical or nominal data with the most frequent category within the remaining entries, in other words, the mode; Replace missing quantitative data with the mean value of the features that holds that missing value.
Some classifiers, such as NNs and SVMs, require input values that range from 0 to 1 and in vectors of real numbers. However, the datasets contain inputs that hold different values than those fed to these classifiers and each attribute in the dataset contains values that vary in range. In order to avoid bias and accordingly, feed the classifiers with data within the same interval, data should be transformed from a different scale of values to those of a common scale. In order to obtain this dataset, attributes should be normalised to values in the range of between 0 and 1 using an appropriate way for the datasets used in this paper. The data are normalised using the min-max normalisation procedure (Sustersic et al., 2009; Wang & Huang, 2009; Li & Sun, 2009), where the maximum value in an attribute is given a value of 1 (max_new) and the minimum value is given a value of 0 (min_new). Then, the values in between are scaled based on the equation below:
M
New_value = (original – min) / (max– min) * (max_new – min_new) + min_new
PT
ED
Regarding the data splitting technique, the k-fold cross validation was adopted. In this technique, the original dataset is partitioned in to k-subsets or folds of approximately equal size. For example, consider P1, P2, P3,…..,Pk are the number of partitions made from the original dataset (Louzada and Fernandes, 2016). Now, individually each partition must be trained and tested and the final accuracy is estimated by taking the average of all the partitions or folds that have been tested. An issue also could arise regarding the consideration of how many folds or partitions to create. Garcia et al. (2015) stated that 5 or 10 folds can be a good choice with data sets of different sizes, with repetitions of the process also being desirable in order to ensure switching between training and testing data as much as possible and also to avoid high variances. Consequently, a 5-fold cross-validation was adopted, repeated 50 times, in order to achieve reliable and robust conclusions relating to model performance. As a result, in this paper, a 10 5-fold cross-validation was applied on each dataset, and the process was repeated 10 times for each, giving a total of 50 test results that were averaged to give a final result for each. 4.4. Parameters tuning and setting
AC
CE
Practically, a few parameters needed to be set up before classifier construction, for such as NN, SVM, DT and RF. The intention was to make a unique model for all the datasets and it is worth noting that all the parameters for all the classifiers across all the datasets were based on achieving the best results on the training set. Firstly, for the NN model, a feed forward back-propagation was constructed and for the German, Australian, Japanese, Iranian, Polish, UCSD and Jordanian datasets the chosen numbers of neurons in the hidden layer were 4, 10, 3, 10, 10, 10, 10, respectively. Generally, the number of hidden neurons should be chosen relatively to the number and complexity of relations between the input features for each dataset. In the developed NN classifier, a grid search was carried out to find the optimal number of neurons in the hidden layer for each dataset. Furthermore, the learning rate was a default of 0.01 in the case of Australian, Japanese, Iranian, UCSD and Jordanian datasets, while in the German and Polish ones the values were set at 0.005 and 0.5, respectively. Moreover, the maximal number of epochs was set to 1,000. For each particular dataset, it was important to change the way in which the neural network was trained. Regarding which, the German, Australian, Polish, UCSD and Jordanian datasets were kept at default ( ), whereas the Japanese and Iranian datasets were changed to , respectively. In addition, for the Japanese and Iranian datasets, a momentum default parameter of 0.9 was chosen. For all the other datasets, a momentum was not defined, as and training methods do not require this parameter. Secondly, regarding the SVM, an RBF kernel was used. For each dataset, different values of kernel scale parameters were provided (German – ‘1.37’, Australian, Japanese, UCSD, Polish and Jordanian: 'auto'9, Iranian: ‘1’). Thus, for the majority of datasets, the SVM function automatically chose the appropriate kernel scale. However, for some datasets, the values for the kernel scale parameter that increased the SVM accuracy were found by grid search rather than the default ('auto') parameter. The ‘auto’ parameter gave the values 0.7812, 0.9795, 0.1836, 0.366, 0.243 for the Australian, Japanese, UCSD, Polish and Jordanian datasets, respectively. With RF, the most important parameters are the number of trees and attributes used to build a tree. 60 10 trees were built and regarding the number of chosen attributes for growing each decision tree, the default value was selected (all attributes available in the 9
‘auto’ is a parameter that automatically choose the best value for the kernel scale that increases SVM’s training accuracy. This is achieved by using the fitcsvm function in Matlab.
10
We tried several parameters and with a value of 60 RF achieved a very good training accuracy with acceptable computational speed.
ACCEPTED MANUSCRIPT dataset). Another important issue worth noting is the defining of the categorical variables for each analysed dataset. The number of the features that were define as categorical during the RF evaluation was less than the initial number of categorical features, because some are best considered as being numerical. According to Rhemtulla et al. (2012), when the categorical variables have many levels, there is a considerable advantage in treating them as continuous variables. By way of an example: for a particular dataset there is an ‘Education’ feature, where ‘0’ means ‘No education’, ‘1’ refers to ‘ordinary school’, ‘2’ pertains to MSc and ‘3’ means ‘PhD’. Hence, his feature can be considered as being numerical, whereby the bigger its value, the better educated is the loan applicant. So during leaf splitting, the RF does not need to iterate for all values of this feature, for it can simply define the leaf threshold, which is thus more efficient. Lastly, in relation to the DT, the impurity evaluation is performed according to Gini’s diversity index in order to choose the best feature to start the tree with. Regarding which, the best categorical variables split is when 2C−1 −1 combinations are considered, where C is the number of categories in each categorical variable. 4.5. Performance measure metrics
CR IP T
In order to reach a reliable and robust conclusion on the predictive accuracy of the proposed approach, four performance indicator measures are implemented, specifically: 1) accuracy, 2) area under the curve (AUC), 3) H-measure and 4) Brier Score. These were chosen because they are popular in credit scoring and they give a comprehensive view on all aspects of model performance. The accuracy stands for the proportion of correctly classified good and bad loans, which measures the predictive power of the model. As such, this is a criterion that measures the discriminating ability of the model (Lessmann et al., 2105). AUC is a tool used in binary classification analysis to determine which of the models used predicts the classes the best. According to Hand (2009), the AUC can be used to estimate the model’s performance without any prior information about the error costs. However, it assumes different costs distribution among classifiers depending on their actual score distribution, which prevents them from being compared effectively. As a result, Hand (2009) proposed the H-measure as an alternative measure to the AUC for measuring classification performance, which assumes different costs distribution between classifiers without depending on their scores. In other words, this measure finds a single threshold distribution for all classifiers. Finally, the Brier Score, which also known as the mean squared error (Brier, 1950), measures the accuracy of the probability predictions of the classifier, by taking the mean squared error of the probability. In other words, it shows the average quadratic possibility of a mistake, and the main difference between it and accuracy is that it directly takes the probabilities into the account, while accuracy transforms these probabilities into 0 or 1 based on a pre-determined threshold or cut-off score. The lower the Brier Score the better the classifier performance.
AN US
4.6. Statistical significance tests
(
)
(22)
ED
CD = q∝ √
M
Statistical tests can be categorised into parametric and non-parametric (Demšar, 2006). Demšar recommended that using nonparametric tests is preferable to parametric tests as the latter can be conceptually inappropriate and statistically unsafe. That is, nonparametric tests are more appropriate and safer than parametric tests since they do not assume normality of the data or homogeneity of the variance (Demšar, 2006). Friedman’s (1940) test is non-parametric test that ranks the classifiers for each dataset independently. The best ranking classifier is given a rank of one, the second best a rank of two and so on. Under the null hypothesis of Friedman, the test is that all classifiers from this group perform identically and all differences are only random fluctuations. The Friedman statistic is distributed according to with K - 1 degrees of freedom, when N (number of data sets) and K (number of classifiers) are big enough (Demšar, 2006). If the null hypothesis of the Friedman test is rejected, then, a post-hoc test is carried out in order to find the particular pair wise comparisons that produce significant differences. For instance, the Bonferroni–Dunn (1961) test can be used when all the classifiers are compared with a control model (Demšar, 2006; Marques et al., 2012a; Marques et al., 2012b). With this test, the performance of two or more classifiers is significantly different if their average ranks differ by at least the critical difference (CD), as follows:
5. Experimental results
PT
where, q∝ is calculated as a studentised range statistic with a confidence level ∝/ (k-1) divided by √2. Also, k = number of classifiers to be compared to ConsA and N = number of datasets.
CE
In this section, the results of the proposed model are presented along with comparison between the individual base classifiers, hybrid classifiers and ensemble classifiers with traditional combination methods. The model is validated over the above-described seven realworld credit datasets across four performance measures. In addition, histograms of the loans ranking distribution of the proposed model are provided and discussed. All the experiments for this study were performed using Matlab 2014b version, on a PC with 3.4 GHz, Intel CORE i7 and 8 GB RAM, using the Microsoft Windows 7 operating system.
AC
Table 3 Classifier results for the for all individual classifiers for all the datasets for the different performance measures without GNG or MARS Dataset Base classifiers Perf. RF DT NB NN SVM Acc. 0.767 0.705 0.725 0.748 0.761 German AUC 0.792 0.679 0.762 0.764 0.783 H-measure 0.289 0.136 0.238 0.239 0.272 Brier Score 0.162 0.252 0.199 0.172 0.166 Accuracy 0.867 0.826 0.803 0.859 0.852 AUC 0.936 0.866 0.896 0.915 0.911 Australian H-measure 0.661 0.515 0.580 0.614 0.602 Brier Score 0.095 0.147 0.167 0.108 0.114 Accuracy 0.867 0.817 0.797 0.858 0.858 AUC 0.931 0.856 0.889 0.912 0.907 Japanese H-measure 0.649 0.489 0.567 0.619 0.607 Brier Score 0.098 0.157 0.175 0.108 0.114 Accuracy 0.951 0.924 0.926 0.950 0.948 AUC 0.779 0.615 0.714 0.613 0.603 Iranian H-measure 0.276 0.112 0.193 0.062 0.073 Brier Score 0.043 0.070 0.076 0.048 0.051 Polish Accuracy 0.763 0.701 0.690 0.698 0.749
ACCEPTED MANUSCRIPT
Jordanian
0.837 0.384 0.165 0.855 0.909 0.535 0.097 0.862 0.903 0.513 0.101
0.728 0.210 0.267 0.828 0.795 0.385 0.143 0.820 0.788 0.330 0.159
0.740 0.235 0.297 0.811 0.707 0.176 0.172 0.614 0.574 0.080 0.314
0.767 0.250 0.198 0.815 0.746 0.238 0.142 0.831 0.859 0.398 0.123
0.821 0.366 0.176 0.830 0.789 0.309 0.128 0.831 0.843 0.386 0.134
CR IP T
UCSD
AUC H-measure Brier Score Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score
PT
ED
M
AN US
Table 4 Classifier results for the for all individual classifiers for all the datasets for the different performance measures with GNG Dataset Base classifiers Performance RF DT NB NN SVM measure Accuracy 0.770 0.745 0.759 0.751 0.768 German AUC 0.793 0.689 0.775 0.766 0.796 H-measure 0.294 0.182 0.268 0.247 0.296 Brier Score 0,1608 0,2301 0.198 0.173 0.164 Australian Accuracy 0.868 0.868 0.865 0.859 0.863 AUC 0.923 0.888 0.911 0.916 0.921 H-measure 0.648 0.615 0.624 0.623 0.636 Brier Score 0.101 0.122 0.120 0.110 0.105 Japanese Accuracy 0.867 0.864 0.863 0.865 0.853 AUC 0.929 0.882 0.909 0.908 0.911 H-measure 0.649 0.607 0.623 0.630 0.622 Brier Score 0.099 0.127 0.122 0.109 0.111 Iranian Accuracy 0.951 0.951 0.931 0.949 0.946 AUC 0.788 0.530 0.722 0.613 0.649 H-measure 0.301 0.036 0.211 0.070 0.117 Brier Score 0.043 0,0491 0.071 0.048 0.051 Polish Accuracy 0.752 0.751 0.726 0.743 0.756 AUC 0.834 0.770 0.773 0.806 0.810 H-measure 0.375 0.312 0.302 0.336 0.365 Brier Score 0.167 0,2338 0.264 0.184 0.177 Jordanian Accuracy 0.864 0.853 0.813 0.820 0.834 AUC 0.889 0.763 0.710 0.765 0.824 H-measure 0.514 0.354 0.185 0.269 0.386 Brier Score 0.100 0.132 0.175 0.137 0.119 UCSD Accuracy 0.868 0.834 0.807 0.840 0.832 AUC 0.916 0.780 0.823 0.857 0.844 H-measure 0.541 0.344 0.373 0.421 0.397 Brier Score 0.095 0.150 0.191 0.118 0.135
AC
CE
Table 5 Classifier results for the for all individual classifiers for all the datasets for the different performance measures with MARS Dataset Base classifiers Performance RF DT NB NN SVM measure Accuracy 0.767 0.721 0.744 0.748 0.766 German AUC 0.787 0.692 0.768 0.767 0.783 H-measure 0.284 0.156 0.246 0.248 0.281 Brier Score 0.163 0.239 0.186 0.171 0.165 Accuracy 0.868 0.828 0.785 0.865 0.853 AUC 0.940 0.868 0.894 0.919 0.905 Australian H-measure 0.668 0.518 0.574 0.629 0.596 Brier Score 0.093 0.146 0.174 0.104 0.114 Accuracy 0.866 0.817 0.797 0.862 0.858 AUC 0.932 0.857 0.889 0.911 0.908 Japanese H-measure 0.651 0.493 0.567 0.619 0.608 Brier Score 0.098 0.156 0.175 0.108 0.114 Accuracy 0.951 0.927 0.945 0.949 0.948 AUC 0.790 0.639 0.740 0.618 0.566 Iranian H-measure 0.278 0.151 0.218 0.061 0.038 Brier Score 0.043 0.068 0.055 0.048 0,0508 Accuracy 0.768 0.725 0.707 0.689 0.754 AUC 0.842 0.753 0.801 0.772 0.826 Polish H-measure 0.395 0.264 0.339 0.259 0.374 Brier Score 0.162 0.241 0.257 0.196 0.173 Accuracy 0.862 0.830 0.821 0.838 0.837 Jordanian AUC 0.920 0.809 0.709 0.827 0.796 H-measure 0.567 0.402 0.177 0.378 0.373
ACCEPTED MANUSCRIPT UCSD
Brier Score Accuracy AUC H-measure Brier Score
0.092 0.866 0.914 0.538 0.096
0.137 0.824 0.801 0.350 0.152
0.169 0.625 0.593 0.097 0.306
0.121 0.847 0.889 0.466 0.111
0.124 0.844 0.874 0.456 0.139
0.9
0.8
0.7
CR IP T
0.6
0.5
0.4
0.3
0.2
0 Filt (-) FS (-)
Filt (-) FS (+) Accuracy
AUC
AN US
0.1
Filt (+) FS (-)
Brier Score
Filt (+) FS (+)
H-measure
Fig. 5 Comparisons of different set-ups of data filtering and feature selection on average for all the classifiers across all the datasets
M
5.1. Classification results
AC
CE
PT
ED
In this section, four experimental results are presented in order to choose the best rankings for the proposed hybrid ensemble model: 1) Results of all base classifiers with all features and data; 2) Results of base classifiers with GNG data filtering and all features; 3) Results of base classifiers with MARS feature selection and all data; and 4) Results of base classifiers with MARS and GNG combined. All the obtained results are compared and hence, the method used for the proposed model is justified. Subsequently, the results of the proposed model are summarised and compared with those of the base and ensemble classifiers using the traditional combination methods. Tables 3 to 6 report the results of base classifiers with each of the MARS and GNG options in addition to the results of the proposed model. Regarding each classifier in each table several key findings emerge. RF: Shows a good example of how GNG+MARS work jointly better than separately. This is clear for the Japanese dataset, where accuracy decreases by 0.07% and 0.12% when GNG and MARS are used alone, respectively, but by combining them it increases by 0.43%. Also, an unusual case is found with the Polish dataset, whereby filtering decreases the accuracy and the feature selection increases it. The reason this happens is that the Polish dataset has fewer data samples and a large number of features compared to the other datasets. All the other datasets show increments ranging between 0.03% and 1.1%. DT: Also demonstrates that GNG+MARS work well together. Both of them separately increase the accuracy, but combining them increases it for all datasets except the Japanese, where GNG alone is better than GNG + MARS, which might be because the latter’s performance alone was worse than with individual classification. Also, another substantial increment for both methods is found for the Polish dataset where the increment is 8.87%, which is much better than both methods separately. NB: Shows that GNG +MARS works the same way as with DT, as for all the datasets the incremental change is more than when using them separately, with the exception being the Australian dataset, where GNG alone is better. In general, the performance of GNG alone is quite good with Naïve Bayes, but when combining it with MARS the performance gets better if the latter’s performance is superior to that of individual classification (e.g. Australian and Japanese datasets). NN: GNG+MARS communicate in an impressive way, whereas separately, they both work differently. That is, both methods together improve the accuracy for all the datasets, ranging from 0.01% for the Iranian dataset to 5.46% for the Polish. All the other datasets show increments of between 0.01% and 1.76%. SVM: Is the most controversial classifier, for whilst GNG+MARS should improve the accuracy, this is not so regarding the Japanese and Iranian datasets. Regarding which, applying MARS to the Japanese dataset decreases the accuracy more than when GNG is used, because the former dataset does not have many features, compared to the Polish dataset, for example. In sum, GNG and MARS work better together rather then they are applied separately with SV. Regarding the AUC and H-measures, GNG+MARS, on average, is better than using them separately or without using them. In conclusion, the results are improved when applying the GNG and MARS methods together (as an average of all the dataset outcomes): NB: 6.07%, DT: 4.37%, NN: 2.41%, SVM: 0.90%, RF: 0.63%. It can be seen, that the worst results are when the classifiers are applied without filtering and feature selection, whereas the best improvement in accuracy occurs when these two pre-processing methods are applied together. The cases where using filtering and feature selection are particularly useful are as follows: When the dataset is well-balanced (e.g. the Australian and Japanese) and even when imbalanced in some cases (e.g. Jordanian); When the data have a lot of features and some of them are categorical; When any individual classifier without filtering and feature selection gives surprisingly low results, which cannot be explained by
ACCEPTED MANUSCRIPT any other reason than the existence of outliers in the data; When DT or NB are used as a part of the classification system. In fact, even if one of them is applied to the data being analysed, using filtering and feature selection in combination is very desirable. From Figure 5, the main conclusion that can be drawn is that using filtering and feature selection in combination is justified, as the experiments conducted with these two pre-processing techniques show improvement for all the tested classifiers when compared to no pre-processing or using just one pre-processing technique. It is worth noting, that filtering is more responsible for accuracy increases than feature selection. This can be seen when the results of experiments were compared for filtering and feature selection being applied separately. Having made these comparisons, for the ranking of classifiers, GNG+MARS will be used in combination when building the proposed hybrid ensemble model. Table 6 demonstrates the results of the proposed model as well as those for the base classifiers and ensemble classifiers using traditional combination methods after implementing GNG+MARS. Regarding the ensemble classifiers with traditional combination methods, seven methods were adopted: Min Rule (MIN), Max Rule (MAX), Product Rule (PROD), Average Rule (AVG), Majority Voting (MajVot), Weighted Average (WAVG) and Weighted Voting (WVOT). The results reveal that the best combiner appears to be the MajVot. Its first place can simply be explained by the fact that the classifiers have quite a high accuracy by themselves and hence, the probability that four will make a mistake regarding the same data point is low. This is also the reason for AVG being in second place. WVOT, which is a combination of WAVG and MajVot, is third, but for the Japanese and UCSD datasets it holds first place. The final decision about which traditional combination method to choose can be made by looking at the structure of the dataset. The worst combiner is PROD, which can be explained the fact that the result of multiplying the rankings of the five classifiers will be less than one, and this value is very small, which is why the threshold is very hard to choose. For example, if all five classifiers have a ranking of 0.6, then the ranking of this combiner is 0.65 = 0.078, which is an extremely small, so the threshold would have to be much lower than this value (0.078). The other interesting thing about the combiners and classifiers here is that each of the latter works better when it has few features to rely on. That is, with many features this unnecessarily hinders the classifier training, which in turn reduces the accuracy and increases losses. The results obtained clearly show that most traditional combiners are behind the best of the classifiers (RF) for all the datasets. Of course, Random Forest stays the best, for it actually is not a single classifier, but rather, a homogenous combiner of DTs. Finally, the traditional combiners could be used to improve the work of single classifiers, but these could not be used on every dataset with the same productivity, and hence, should be chosen independently for each dataset. As a result of the above findings, a complex combiner is proposed (ConsA) to combine the rankings of all the base classifiers, which should result in a classifier that outperforms the best base classifier and the traditional ensemble combining methods developed in this paper. The last column of Table reports the results of the proposed approach across all the datasets using the three performance measures. It can be seen ConsA is superior for all measures across all the datasets when compared to the base classifiers and traditional combination methods. However, several distinct findings can be reported for each dataset. German: During this experiment ConsA shows accuracy 1.65% higher than with RF (second best classifier in this case). Moreover, the accuracy of ConsA at 0.7903, is superior to the best of the traditional combiners by 1.7%. Standard deviation of ConsA Accuracy over all iterations is 0.029. The results clearly show that using GNG and MARS in conjunction is advisable, because this increases the performance of almost all the classifiers when compared to only one of these being used. The AUC value of ConsA is the highest amongst all the other classifiers and combiners. Australian: Regarding the filtering and feature selection methods being applied in conjunction it can be observed that ConsA’s accuracy is the highest at 0.881, which is better than the best second classifier by 0.74%. The standard deviation of ConsA’s accuracy over all iterations is 0.0268. Moreover, ConsA’s AUC is the highest amongst the other classifiers, thus indicating its efficiency across several thresholds. Moreover, the H-measure ConsA is the biggest amongst all the classifiers, Japanese: ConsA’s accuracy is 0.8871, better than RF by 1.5%. The standard deviation of ConsA’s accuracy over all iterations is 0.0259. Moreover, the AUC of ConsA is the highest by 0.9330 and the H-measure of ConsA is almost 3.65% higher than that for RF. In general, ConsA is stable for balanced and unbalanced datasets, so far. Iranian: For this severely imbalanced dataset, it can be seen that the results of ConsA rises up to 95.75%, which is better than the best second classifier by 0.062%. The standard deviation of ConsA’s accuracy for all iterations is 0.015. The results show that for this dataset ConsA has the highest H-measure and by far the best AUC. Polish: The accuracy of ConsA with filtering and feature selection enabled is 81.33%, better than the second best classifier by 2.33%. The standard deviation of ConsA Accuracy over all iterations is 0.0514, which is the highest when compared to all the other datasets. The AUC value of ConsA remains greater than that for all the other classifiers. This is the only dataset so far, for which it can be seen that there is a big advantage of ConsA over the other classifiers and combiners. GNG and MARS helped to raise its accuracy by almost 2.3%, which proves the importance of these two pre-processing techniques in the classification procedure. Interestingly, for this dataset, RF shows a worse accuracy result than DT, with the latter being 79%. The reason of this good performance is that GNG helps this classifier to choose the right node splits and thus, the obtained model becomes quite precise. The H-measure of ConsA is the best as is the AUC. In fact, for this dataset ConsA shows superiority for all the measures evaluated. Jordanian: Reveals that with filtering and feature selection algorithms, ConsA can rightfully be called the best possible option. It surpasses RF’s accuracy by 0.78%, and is superior in all other measures, including AUC (almost a 3% increase). Traditional combiners show different results: NN and SVM provide 2% worse results than RF and about 4.5% worse than ConsA, NB delivers even worse results. This proves that increasing the complexity the of the classifier will significantly increase the benefits of using it. For this dataset, the complexity of the classifier is highly positively correlated with its accuracy and other performance metrics. Finally, the H-measure of ConsA is better than the second placed WAVG by about 6%. UCSD: The ConsA classifier is always better than any other classifier, but its accuracy enhancement of only 0.59% is not very big. However, in large real world datasets this figure could be crucial regarding losses and profits, which undoubtedly makes ConsA the number one classifier for this dataset. The distribution of the testing set rankings for the proposed model is demonstrated in histograms in Figures 6 to 12. Each histogram represents the following:
AC
CE
PT
ED
M
AN US
CR IP T
( ) is the predicted values subset where the actual target is 0 (Red); ( ) is the predicted values subset where the actual target is 1 (Green); ( ) is the predicted value set (Black).
From Figure 6, it can be concluded that ConsA for the German dataset is much more certain about good loans prediction, than regarding bad ones, the highest probability (22%) is the ranking of a random bad loan entry in the interval [0.1-0.2].
ACCEPTED MANUSCRIPT
Table 6 Classifier results including those for the proposed method for all the datasets for the different performance measures with GNG + MARS
Australian
Japanese
Iranian
Polish
Jordanian
RF
DT
NB
NN
SVM
MIN
MAX
PROD
AVG
MajVot
WAVG
WVOT
Proposed method ConsA
0.773 0.794 0.297 0.160 0.871 0.929 0.649 0.098 0.872 0.929 0.651 0.100 0.951 0.779 0.283 0.043 0.774 0.841 0.395 0.163 0.866 0.886 0.503 0.101 0.869 0.916 0.542 0.095
0.753 0.699 0.197 0.221 0.869 0.887 0.616 0.122 0.862 0.880 0.603 0.129 0.951 0.536 0.040 0.049 0.790 0.798 0.382 0.203 0.861 0.781 0.399 0.125 0.841 0.793 0.374 0.142
0.764 0.774 0.267 0.193 0.861 0.909 0.615 0.125 0.863 0.909 0.623 0.122 0.945 0.747 0.237 0.054 0.730 0.800 0.345 0.264 0.821 0.774 0.259 0.157 0.808 0.831 0.396 0.191
0.758 0.772 0.258 0.170 0.864 0.920 0.631 0.104 0.869 0.907 0.631 0.109 0.950 0.629 0.075 0.047 0.752 0.806 0.341 0.184 0.845 0.835 0.404 0.119 0.849 0.883 0.463 0.110
0.773 0.794 0.299 0.164 0.869 0.921 0.637 0.104 0.854 0.911 0.622 0.111 0.946 0.612 0.078 0.051 0.757 0.816 0.370 0.175 0.847 0.830 0.459 0.113 0.846 0.868 0.455 0.143
0.764 0.718 0.225 0.166 0.866 0.913 0.636 0.100 0.864 0.913 0.634 0.104 0.950 0.553 0.055 0.101 0.720 0.829 0.396 0.230 0.825 0.806 0.384 0.135 0.805 0.883 0.462 0.114
0.753 0.788 0.288 0.206 0.866 0.908 0.632 0.115 0.845 0.903 0.614 0.114 0.910 0.740 0.254 0.049 0.768 0.824 0.409 0.264 0.853 0.861 0.469 0.147 0.842 0.836 0.401 0.196
0.736 0.709 0.225 0.232 0.858 0.912 0.638 0.123 0.860 0.910 0.633 0.123 0.950 0.538 0.050 0.050 0.719 0.819 0.395 0.269 0.816 0.773 0.410 0.168 0.803 0.893 0.473 0.186
0.773 0.800 0.306 0.158 0.873 0.929 0.652 0.098 0.865 0.926 0.647 0.101 0.950 0.777 0.293 0.043 0.782 0.859 0.446 0.153 0.857 0.882 0.495 0.104 0.864 0.908 0.516 0.100
0.778 0.755 0.286 0.184 0.874 0.903 0.638 0.111 0.865 0.908 0.641 0.111 0.950 0.578 0.109 0.047 0.788 0.858 0.442 0.158 0.860 0.803 0.435 0.114 0.865 0.877 0.498 0.107
0.746 0.746 0.223 0.180 0.871 0.920 0.636 0.102 0.854 0.909 0.601 0.122 0.950 0.776 0.284 0.045 0.736 0.801 0.323 0.187 0.862 0.879 0.506 0.101 0.860 0.901 0.506 0.102
0.773 0.688 0.247 0.193 0.870 0.891 0.631 0.106 0.872 0.849 0.611 0.115 0.946 0.572 0.106 0.048 0.774 0.799 0.384 0.199 0.857 0.795 0.430 0.110 0.869 0.809 0.481 0.115
0.790 0.802 0.325 0.164 0.881 0.935 0.669 0.096 0.887 0.933 0.688 0.093 0.958 0.842 0.403 0.039 0.813 0.874 0.491 0.143 0.874 0.913 0.566 0.096 0.875 0.924 0.562 0.091
M
UCSD
Performan ce measure Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score Accuracy AUC H-measure Brier Score
Traditional combination methods
CR IP T
German
Base classifiers
AN US
Dataset
AC
CE
PT
ED
However, the bad loans prediction performance of most of the other classifiers and combiners is even worse and the few that show higher accuracy in such prediction have poor good loan prediction as well as overall accuracy. This indicates that for the German dataset, due to its imbalanced structure, it is very difficult to build a combiner with over 85% accuracy. From Figure 7, it can be concluded that ConsA for the Australian dataset is very often certain about its decisions (lengths of bars near the 0.4 - 0.6 points are much less than those on the edges of the [0 and 1] ranking interval). Moreover, ConsA often is very certain about good loans (if the loan is good, the probability that ConsA will give less than a 0.1 prediction value is more than 60%). Regarding the Japanese dataset, Figure 8 shows that ConsA provides a very good level of confidence for good and bad loan entries. Most of the rankings of ConsA lie either in the [0, 0.1] interval or the [0.9, 1] one. When the input loan is good, the probability that ConsA will give the number near 0.1 or less is almost 80%, whilst when it is bad, the probability that the ConsA will give the number near 0.9 or more is 70%. In relation to the Iranian dataset, Figure 9 proves again the fact that ConsA is very good at good loan recognition, but demonstrates much worse results in terms of bad loan identification. Most of the time, when the input query has a ‘bad loan’ label, ConsA treats this query as good, and its ranking is in the [0.1 - 0.3] interval. Regarding the Polish dataset, ConsA is not very certain about its answers, as is the case with some of the other datasets. The most likely ranking of a good loan entry lies in the [0.1 - 0.2] interval (35%) and that of bad loans entry is apparent in the [0.8 - 1] interval. However, for 10% of input entries ConsA is not certain at all, as the rankings lie in the [0.4 - 0.6] interval (see Figure 10). As can be seen in Figure 11, regarding the real world Jordanian dataset, the tall red bar on the right of the graph indicates that ConsA is very certain regarding good loans, whereas for bad loans this is not the case. However, ConsA very rarely shows uncertainty (ratings between 0.4 and 0.6), and in most of the cases if it makes an incorrect prediction, its ranking is not completely wrong (so for bad loans, it can make a mistake on 0.2-0.3 ranking, but not on 0-0.1). In other words, even when ConsA is wrong and the actual class is ‘1’, its ranking is not ‘0’ (completely wrong), but rather (0.2 - 0.3). So, in the case that a 100% guarantee that ConsA will make a correct good loan prediction, a true good loan can only be accepted for 0-0.1 rankings. The same logic is applicable to bad loan predictions. In the case where it is crucial to be sure about classifier prediction, a two-threshold system can be recommended as follows: If the prediction is less than the first threshold, it can be accepted that the loan is good with great certainty; If the prediction is greater than the second threshold, it can be accepted that the loan is bad with great certainty. If the prediction is situated between the thresholds, this could be interpreted as the ‘grey zone’ and any decision based on it cannot be made. From Figure 12, it can be said that ConsA for the UCSD dataset is certain about its good and bad loans predictions, whereby most of the good loans are scored by a prediction value of less than 0.2 and most of the bad ones by a prediction value greater than 0.8. This is a big advantage of ConsA, for if the classifier gives a ranking close to the boundary of the [0 and 1] interval, it can be claimed almost for certain that it is correct. However, very few bad loans gain a prediction value of ‘1’, which could be due to the fact, that UCSD dataset is skewed. In summary, ConsA shows the best performance and for some datasets, it superior performance over all the other classifiers is impressive. The ranking histograms demonstrate that, for almost all the focal datasets, ConsA is certain about its predictions, which shows that it can be successfully used with various ranges of thresholds without any significant drop in accuracy. The most impressive performance for ConsA is with regards to the Polish dataset, which can be explained by the fact that this dataset is balanced. ConsA also deals well with imbalanced datasets and its high H-measure shows that it can be successfully used with different pairs of misclassifying costs (false-positive cost and true-negative cost), with its misclassifying error for all thresholds being lower than for the other classifiers.
ACCEPTED MANUSCRIPT This means that in real life, losses caused by ConsA wrong decisions will be smaller than those as a result of the decisions of any of the other classifiers that have been considered here. 5.2. Significance tests results 5.2.1. Friedman test for best classifiers
CR IP T
In this section, Friedman’s statistical test is conducted on all the implemented classifiers to prove that ConsA is better not only on the seven datasets that have been investigated, but also with very high probability on all datasets with a similar structure to those examined in this paper. After this, the Bonferroni-Dunn test is performed to rank all classifiers from the best to the worst and to divide them into two groups, 1) classifiers that under some conditions could rival ConsA and 2) classifiers that are undoubtedly worse than ConsA. So, to make the conclusions scientifically more solid, analysis of Friedman test is considered for the three best classifiers, including ConsA. That is, the test was performed on ConsA, the best single classifier and the best classical combined classifier. The null hypothesis in this case is that the difference between these five base classifier rankings is accidental and not caused by the level of significance of each classifier. ( ) The null-hypothesis is accepted with probability, if the Friedman statistic it is accepted with probability, if ( ) . At a significance level of , the null hypothesis for all the classifiers is rejected, except for the Polish dataset. Moreover, at the significance level the null hypothesis for all datasets is also rejected apart from for the Polish dataset. The reason why the Polish dataset is an exception, is owing to its small size, having only 60 entries and if it had more entries, the Friedman statistic would be much higher. 5.2.2. Bonferroni-Dunn test for all classifiers
CE
PT
ED
M
AN US
The Friedman ranking test (accuracy rankings) was calculated for all single classifiers, all classical combiners and ConsA. To evaluate the critical values of significance levels and , a Bonferroni-Dunn two-tailed test was evaluated as in equation (22). was calculated as the Studentised range statistic with a confidence level ( ) , divided by √ . So, in such a case, the Studentised range statistic test was calculated with confidence levels and . The obtained values are , and the obtained results , . Looking at Figure 13. the two horizontal lines, which are at heights equal to the sum of the lowest rank and the critical difference computed by the Bonferroni– Dunn test, represent the threshold for the best performing method at each significance level ( and ( )). The obtained results clearly show us that ConsA is obviously the best when compared to all the other classifiers and classical combiners. RF shows good stable results, holding second position for all the datasets, whilst DT is good, but worse than some of the classical combiners. Based on the evaluated critical values, it can be concluded that PROD, LR, NB, Max and MIN, SVM, WAVG and NN are significantly worse than the ConsA approach at significance levels and , whilst DT is worse only at .
AC
Fig. 6. Frequency histogram of conditional and absolute values of
for the test set for the German dataset
for the test set for the Australian dataset
PT
ED
M
AN US
Fig. 7. Frequency histogram of conditional and absolute values of
CR IP T
ACCEPTED MANUSCRIPT
for the test set for the Japanese dataset
AC
CE
Fig.8. Frequency histogram of conditional and absolute values of
Fig. 9. Frequency histogram of conditional and absolute values of
for the test set for the Iranian dataset
for the test set for the Polish dataset
ED
M
AN US
Fig. 10. Frequency histogram of conditional and absolute values of
CR IP T
ACCEPTED MANUSCRIPT
for the test set for the Jordanian dataset
AC
CE
PT
Fig. 11. Frequency histogram of conditional and absolute values of
Fig. 12. Frequency histogram of conditional and absolute values of
for the test set for the UCSD dataset
CR IP T
ACCEPTED MANUSCRIPT
Fig. 13. Significance ranking for the Bonferroni–Dunn two-tailed test for the ConsA approach, benchmark classifier, base classifiers and traditional combination methods, with ∝ =
AN US
0.05 and ∝= 0.10
5.3. Benchmark studies
ED
M
In this section, a comparison of the proposed approach ConsA with recent related studies in credit scoring and data classification (Gorzałczany and Rudzinski, 2016; Partalas et al., 2010) is provided. Gorzałczany and Rudzinski (2016) employed three of the same benchmark datasets as in this paper, namely, the German, Australian and Japanese. In their paper they developed a hybrid based on combining fuzzy-rules based classifiers with evolutionary optimisation algorithms. In their modelling design they used k-fold cross validation for their dataset splitting for model training and validation. They used different values of k-folds with repetition in order to minimise any bias that could be associated to the random splitting of the datasets, Specifically, values of 2, 3, 5 and 10 folds were employed in their paper. Moreover, further comparisons with other studies can be found in Gorzałczany and Rudzinski (2016). Partalas et al. (2010) have proposed a new metric based on uncertainty weighted accuracy (UWA) to measure heterogamous ensemble pruning via direct hill climbing. The search is based on forward selection and backward elimination, and based on these methods, UWA determines whether to remove a classifier from the ensemble, leave it or add a new one. Their approach was evaluated on many datasets, one of which was the same German dataset used in this paper. In their modelling design they used two data splitting techniques to evaluate their approach based on hold out sampling, where results were based 80% training data and 20% testing data as well as 80% training data, 20% validation data and 20% testing data.
Proposed approach
CE
Studies
PT
Table 7 Comparison of our proposed approach (ConsA) results with recent approaches across three benchmark datasets German
Data splitting
Australian Accuracy
technique
Data splitting
Japanese
Accuracy
technique
Data splitting
Accuracy
technique
80%-20%-20%
0.7475
-
-
-
-
Partalas et al.
BTUWA12
80%-20%
0.7535
-
-
-
-
(2010)
FVUWA13
80%-20%-20%
0.751
-
-
-
-
FTUWA14
80%-20%
0.704
-
-
-
-
2 k-folds
0.744
2 k-folds
0.868
2 k-folds
0.867
3 k-folds
0.754
3 k-folds
0.873
3 k-folds
0.872
5 k-folds
0.765
5 k-folds
0.880
5 k-folds
0.882
10 k-folds
0.785
10 k-folds
0.891
10 k-folds
0.890
5 k-folds
0.790
5 k-folds
0.881
5 k-folds
0.887
AC
BVUWA11
Gorzałczany and
Rudzinski (2016)
This paper
FRB-MOEOAs15
ConsA
11
Forward uncertainty weighted accuracy using training set.
12
Forward uncertainty weighted accuracy using validation set.
13
Backward uncertainty weighted accuracy using training set.
14
Backward uncertainty weighted accuracy using validation set.
15
Multi-objective genetic optimization based on fuzzy based rules.
ACCEPTED MANUSCRIPT Table 7 summarises a comparison of the ConsA approach results against the aforementioned recent credit scoring and related binary classification studies in the literature. Our approach, in general, outperforms Partalas et al. (2010) in all their proposed approaches regarding the German dataset and Gorzałczany and Rudzinski (2016) for 5-fold cross validation. Considering the other values of k-folds, ConsA for the German dataset is better and regarding the Australian and Japanese datasets our approach is better except for the 10-fold cross validation. In this regard, Gorzałczany and Rudzinski (2016) method outperforms ours by 1% and 0.3% in relation to the Australian and Japanese datasets, respectively. To make a reasonable comparison, with our proposed approach, for 5-folds, this provides better accuracy performance than Gorzałczany and Rudzinski (2016) across all three data sets. 6. Conclusions
AN US
CR IP T
The main advantage of ConsA compared to traditional combiners is the creation of a group ranking as a fusion of individual classifier rankings, rather than merging these rankings using arithmetical, logical or other mathematical functions. It simulates the real experts’ group behaviour: they continuously interchange their opinions, and change their measurements of possible answers influenced by other experts. The process continues until they come up with a group decision, with which they all agree. Sometimes, however, experts cannot come up with such a decision, i.e. ConsA does not converge. To prevent these situations, it has been decided to use the least squares method instead of an iterations procedure to obtain optimal group ranking. Another problem is unknown conditional ranking values, which has been evaluated as a linear combination of two classifier rankings. Moreover, the better the accuracy of classifier is the more impact it has on other classifiers. So, the two things new in the current investigation when compared with Shaban et al. (2002) are: Using a local accuracy algorithm to estimate the performance of single classifiers at a given point and then evaluating conditional rankings; Using the least squares algorithm instead of iterations to solve equation (17). The ConsA algorithm was tested on seven datasets with the aim of predicting the loan quality of the client (0 – good loan, 1 – bad loan). For every dataset, when compared to the single classifiers, hybrid classifiers and traditional combiners, ConsA delivered better performance. In relation to the direction of future work, the proposed model could be modified by: Analysing other approaches to the conditional ranking ( ) evaluation of ConsA; Investigating combined homogenous classifiers or different numbers of heterogeneous classifiers to see to what extent ConsA results can change; Investigating different pre-processing methods for the datasets, such as other feature-selection or data-filtering methods; Improving ConsA so it can output, rather than a single floating-point ranking, a fuzzy opinion using fuzzy logic, whereby a fuzzy matrix with fuzzy opinions can be produced.
References
AC
CE
PT
ED
M
Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 41, 3825-3830. Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier Accuracy. Classification, clustering, and data mining applications (pp. 639-647) Springer. Ala'raj, M., & Abbod, M. F. (2016). Classifiers consensus system approach for credit scoring. Knowledge-Based Systems, 104, 89-105. Angelini, E., di Tollo, G., & Roli, A. (2008). A neural network approach for credit risk evaluation. The quarterly review of economics and finance, 48, 733-755 Antonakis, A., & Sfakianakis, M. (2009). Assessing naive Bayes as a method for screening credit applicants. Journal of applied Statistics, 36, 537-545. Asuncion, A., & Newman, D. (2007). UCI machine learning repository. Atiya, A. F., & Parlos, A. G. (2000). New results on recurrent network training: unifying the algorithms and accelerating convergence. Neural Networks, IEEE Transactions on, 11, 697709. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54, 627-635. Basir, O. A., & Shen, H. C. (1993). New approach for aggregating multi‐sensory data. Journal of Robotic Systems, 10, 1075-1093. Bellotti, T., & Crook, J. (2009). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications, 36, 3302-3308. Benediktsson, J. A., & Swain, P. H. (1992). Consensus theoretic classification methods. Systems, Man and Cybernetics, IEEE Transactions on, 22, 688-704. Berger, R. L. (1981). A necessary and sufficient condition for reaching a consensus using DeGroot's method. Journal of the American Statistical Association, 76, 415-418. Bhattacharyya, S., & Maulik, U. (2013). Soft computing for image and multimedia data processing. Springer. Bishop, C. M. (2006). Pattern recognition and machine learning. springer. Breiman, L. (2001). RFs. Machine-learning, 45(1), 5-32. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Whether Review, 78(1), 1-3. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Wadsworth. Belmont, CA. Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39, 3446-3453. Chen, F., & Li, F. (2010). Combination of feature selection approaches with SVM in credit scoring. Expert Systems with Applications, 37, 4902-4909. Chen, W., Ma, C., & Ma, L. (2009). Mining the customer credit using hybrid support vector machine technique. Expert Systems with Applications, 36, 7611-7616. Chitroub, S. (2010). Classifier combination and score level fusion: concepts and practical aspects. International Journal of Image and Data Fusion, 1, 113-135. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297. DeGroot, M. H. (1974). Reaching a consensus. Journal of the American Statistical Association, 69, 118-121. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1-30. Desai, V. S., Crook, J. N., & Overstreet, G. A. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational Research, 95, 24-37. Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52-64. Finlay, S. (2011). Multiple classifier architectures and their application to credit risk assessment. European Journal of Operational Research, 210, 368-378. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11, 86-92. Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 1–67. García, V., Marqués, A., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications, 39, 13267-13276. García, V., Marqués, A. I., & Sánchez, J. S. (2015). An insight into the experimental design for credit risk and corporate bankruptcy prediction systems. Journal of Intelligent Information Systems, 44, 159-189. Gorzałczany, M. B., & Rudziński, F. (2016). A multi-objective genetic optimization for fast, fuzzy rule-based credit classification with balanced accuracy and interpretability. Applied Soft Computing, 40, 206-220. Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77, 103-123. Harris, T. (2015). Credit scoring using the clustered support vector machine. Expert Systems with Applications, 42, 741-750. Haykin, S. (1999). Adaptive filters. Signal Processing Magazine, 6. Hsieh, N., & Hung, L. (2010). A data driven ensemble classifier for credit scoring analysis. Expert Systems with Applications, 37, 534-545. Huang, C., Chen, M., & Wang, C. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications, 33, 847-856. Jekabsons, G. (2009). Adaptive Regression Splines Toolbox for Matlab. Ver, 1, 3-17. Kennedy, K., Mac Namee, B., & Delany, S. J. (2012). Using semi-supervised classifiers for credit scoring. Journal of the Operational Research Society, 64, 513-529. Kittler, J., Hatef, M., Duin, R. P., & Matas, J. (1998). On combining classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20, 226-239. Kumar, P. R., & Ravi, V. (2007). Bankruptcy prediction in banks and firms via statistical and intelligent techniques–A review. European Journal of Operational Research, 180, 1-28. Lai, K. K., Yu, L., Zhou, L., & Wang, S. (2006). Credit risk evaluation with least square support vector machine. In Anonymous Rough Sets and Knowledge Technology (pp. 490-495). Springer. Lee, T., & Chen, I. (2005). A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines. Expert Systems with Applications, 28, 743-752. Lee, T., Chiu, C., Chou, Y., & Lu, C. (2006). Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis, 50, 1113-1130
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Lessmann, S., Baesens, B., Seow, H., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research. Li, X., & Zhong, Y. (2012). An overview of personal credit scoring: techniques and future work. Li, S., Shiue, W., & Huang, M. (2006). The evaluation of consumer loans using support vector machines. Expert Systems with Applications, 30, 772-782. Li, H., & Sun, J. (2009). Majority voting combination of multiple case-based reasoning for financial distress prediction. Expert Systems with Applications, 36(3), 4363-4373. Lin, W., Hu, Y., & Tsai, C. (2012). Machine learning in financial crisis prediction: a survey. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42, 421436. Liu, Y., & Schumann, M. (2005). Data mining feature selection for credit-scoring models. Journal of the Operational Research Society, 56(9), 1099-1108. Louzada, F., Ara, A., & Fernandes, G. B. (2016). Classification methods applied to credit scoring: A systematic review and overall comparison. arXiv preprint arXiv:1602.02137. Marqués, A., García, V., & Sánchez, J. S. (2012). Exploring the behaviour of base classifiers in credit scoring ensembles. Expert Systems with Applications, 39, 10244-10250. Marqués, A., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39, 10916-10922. Nanni, L., & Lumini, A. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36, 3028-3033. Partalas, I., Tsoumakas, G., & Vlahavas, I. (2010). An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Machine Learning, 81(3), 257-282. Pietruszkiewicz, W. (2008). Dynamical systems and nonlinear Kalman filtering applied in classification. , 1-6. Piramuthu, S. (2006). On preprocessing data for financial credit risk evaluation. Expert Systems with Applications, 30, 489-497. Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous. A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological methods, 17, 354. Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review, 33, 1-39. Sabzevari, H., Soleymani, M., & Noorbakhsh, E. (2007). A comparison between statistical and data mining methods for credit scoring in case of limited available data. Shaban, K., Basir, O., Kamel, M., & Hassanein, K. (2002). Intelligent information fusion approach in cooperative multiagent systems. , 13, 429-434. Šušteršič, M., Mramor, D., & Zupan, J. (2009). Consumer credit-scoring models with limited data. Expert Systems with Applications, 36(3), 4736-4744. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. , 330-337. Tomczak, J. M., & Zięba, M. (2015). Classification restricted Boltzmann machine for comprehensible credit scoring model. Expert Systems with Applications, 42, 1789-1796. Tsai, C. (2014). Combining cluster analysis with classifier ensembles to predict financial distress. Information Fusion, 16, 46-58. Tsai, C. (2009). Feature selection in bankruptcy prediction. Knowledge-Based Systems, 22, 120-127. Tsai, C., & Chen, M. (2010). Credit rating by hybrid machine learning techniques. Applied soft computing, 10, 374-380. Tsai, C., & Cheng, K. (2012). Simple instance selection for bankruptcy prediction. Knowledge-Based Systems, 27, 333-342. Tsai, C., & Chou, J. (2011). Data pre-processing by genetic algorithms for bankruptcy prediction. , 1780-1783. Tsai, C., & Wu, J. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34, 2639-2649. Verikas, A., Kalsyte, Z., Bacauskiene, M., & Gelzinis, A. (2010). Hybrid and ensemble-based soft computing techniques in bankruptcy prediction: a survey. Soft Computing, 14, 995-1010. Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 38, 223-230. Wang, C., & Huang, Y. (2009). Evolutionary-based feature selection approaches with new criteria for data mining: A case study of credit approval data. Expert Systems with Applications, 36(3), 5900-5908. Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61-68. West, D., Dellana, S., & Qian, J. (2005). Neural network ensemble strategies for financial decision applications. Computers & Operations Research, 32, 2543-2559. Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38, 257-286. Woods, K., Bowyer, K., & Kegelmeyer Jr, W. P. (1996). Combination of multiple classifiers using local accuracy estimates. , 391-396. Xiao, H., Xiao, Z., & Wang, Y. (2016). Ensemble classification based on supervised clustering for credit scoring. Applied Soft Computing, 43, 73-86. Xiao, J., Xie, L., He, C., & Jiang, X. (2012). Dynamic classifier ensemble model for customer classification with imbalanced class distribution. Expert Systems with Applications, 39, 36683675. Xu, L., Krzyżak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. Systems, man and cybernetics, IEEE transactions on, 22, 418-435. Yao, P. (2009). Feature selection based on SVM for credit scoring. , 2, 44-47. Yu, L. &Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In ICML, 3, 856-863 Yu, L., Wang, S., & Lai, K. K. (2009). An intelligent-agent-based fuzzy group decision making model for financial multicriteria decision support: The case of credit scoring. European Journal of Operational Research, 195, 942-959. Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural network ensemble learning approach. Expert Systems with Applications, 34, 1434-1444. Yu, L., Yue, W., Wang, S., & Lai, K. K. (2010). Support vector machine based multiagent ensemble learning for credit risk evaluation. Expert Systems with Applications, 37, 1351-1360. Zang, W., Zhang, P., Zhou, C., & Guo, L. (2014). Comparative study between incremental and ensemble learning on data streams: Case study. Journal of Big Data, 1, 1-16. Zhang, D., Zhou, X., Leung, S. C., & Zheng, J. (2010). Vertical bagging decision trees model for credit scoring. Expert Systems with Applications, 37, 7838-7843. Zhou, L., Lai, K. K., & Yu, L. (2010). Least squares support vector machines ensemble models for credit scoring. Expert Systems with Applications, 37, 127-133. Zhou, L., Tam, K. P., & Fujita, H. (2016). Predicting the listing status of Chinese listed companies with multi-class classification models. Information Sciences, 328, 222-236.