Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors

Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors

European Journal of Operational Research xxx (2014) xxx–xxx Contents lists available at ScienceDirect European Journal of Operational Research journ...

2MB Sizes 0 Downloads 36 Views

European Journal of Operational Research xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Innovative Applications of O.R.

Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors Zhiwang Zhang a,⇑, Guangxia Gao b, Yong Shi c,d a

School of Information and Electrical Engineering, Ludong University, Yantai 264025, China Shandong Institute of Business and Technology, Yantai 264005, China c Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China d College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182, USA b

a r t i c l e

i n f o

Article history: Received 1 April 2012 Accepted 20 January 2014 Available online xxxx Keywords: Data mining Fuzzy set Kernel function Multi-criteria optimization Classification Credit risk

a b s t r a c t With the fast development of financial products and services, bank’s credit departments collected large amounts of data, which risk analysts use to build appropriate credit scoring models to evaluate an applicant’s credit risk accurately. One of these models is the Multi-Criteria Optimization Classifier (MCOC). By finding a trade-off between overlapping of different classes and total distance from input points to the decision boundary, MCOC can derive a decision function from distinct classes of training data and subsequently use this function to predict the class label of an unseen sample. In many real world applications, however, owing to noise, outliers, class imbalance, nonlinearly separable problems and other uncertainties in data, classification quality degenerates rapidly when using MCOC. In this paper, we propose a novel multi-criteria optimization classifier based on kernel, fuzzification, and penalty factors (KFPMCOC): Firstly a kernel function is used to map input points into a high-dimensional feature space, then an appropriate fuzzy membership function is introduced to MCOC and associated with each data point in the feature space, and the unequal penalty factors are added to the input points of imbalanced classes. Thus, the effects of the aforementioned problems are reduced. Our experimental results of credit risk evaluation and their comparison with MCOC, support vector machines (SVM) and fuzzy SVM show that KFP-MCOC can enhance the separation of different applicants, the efficiency of credit risk scoring, and the generalization of predicting the credit rank of a new credit applicant. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Credit risk evaluation is a very challenging and important data mining problem in the domain of financial analysis. Credit scoring models have been extensively used to evaluate the credit risk of consumers or enterprises, and they can classify the applicants as either accepted or rejected according to their demographical and behavioral characteristics. Over the past three decades, a large number of methods have been proposed for credit risk decision (Lando, 2004; Thomas, Crook, & Edelman, 2002). These methods mainly include logistic regression (Bolton, 2009; Wiginton, 1980), probit regression (Grablowsky & Talley, 1981), nearest neighbor analysis (Henley & Hand, 1996), Bayesian network (Baesens, Egmont-Petersen, Castelo, & Vanthienen, 2002; Pavlenko & Chernyak, 2010), artificial neural network (Jensen, 1992; West, 2000), decision tree (Bastos, 2008; Zhang and Zhang, et al., 2010; Zhang ⇑ Corresponding author. Tel.: +86 015053532980. E-mail address: [email protected] (Z. Zhang).

and Zhou, et al., 2010; Zhang and Zhu, et al., 2010), genetic algorithm (Abdou, 2009; Ong, Huang, & Tzeng, 2005), multiple criteria decision making (Shi, Peng, Xu, & Tang, 2002; Shi, Wise, Luo, & Lin, 2001), SVM (Bellotti & Crook, 2009; Gestel, Baesens, Garcia, & Dijcke, 2003; Huang, Chen, & Wang, 2007; Martens, Baesens, Van Gestel, & Vanthienen, 2007; Schebesch & Stecking, 2005), and so on. Among these classification methods, logistic regression is considered to be the most popular statistical approach and has been more widely used than the others in practice. Neural network credit scoring models have high accuracy but some modeling skills are required – for instance, to design proper network topologies – and it is difficult to explain the results of credit scoring to users clearly. SVM based models indicated promising results in credit risk evaluation, but the SVM classifier needs to solve a convex quadratic programming problem which is very computationally expensive in real world applications. Recently increasing interests in the synergies of optimization and data mining can be observed (Olafsson, Li, & Wu, 2008; Meisel & Mattfeld, 2010; Corne, Dhaenens, & Jourdan, 2012). As optimization

http://dx.doi.org/10.1016/j.ejor.2014.01.044 0377-2217/Ó 2014 Elsevier B.V. All rights reserved.

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

2

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

techniques, SVM based on statistic learning theory and optimization recently grew in popularity (Cortes & Vapnik, 1995; Vapnik, 1995, 1998), mainly owing to its higher generalization power than that of some traditional methods. The main idea of the SVM algorithm is to separate instances from different classes by fitting a separating hyperplane that maximizes the margin among the classes and minimizes the misclassification simultaneously. For the linearly separable case, the hyperplane is located in the input space. Besides, for the nonlinearly separable case, kernel techniques are used to map the data from the input space into the feature space, and the hyperplane is positioned in the feature space (Cristianini & Shawe-Taylor, 2000; Hamel, 2009). When SVM is employed to solve classification problems, each input point is treated equally and assigned certainly to one of different classes. However, in many practical applications, SVM is very sensitive to noise, outliers and anomalies in data so that the separating hyperplane severely deviates from the right position and direction. Thus several methods have been proposed to solve the problem by introducing a proper fuzzy membership function to SVM model (Abe, 2004; Abe & Inoue, 2002; Jiang, Yi, & Jian, 2006; Lin & Wang, 2002, 2004; Takuya & Shigeo, 2001; Tovar & Yu, 2008; Tsujinishi & Abe, 2003). Additionally, in real world applications, it is very usual that one class is more important than others, and that class distribution is imbalanced, resulting in a rapidly degenerating classification precision and accuracy. In order to effectively deal with the class imbalance problem, penalty techniques based on cost-sensitive learning are used (Koknar-Tezel & Latecki, 2009, 2010; Tang, Zhang, & Chawla, 2002; Yang, Wang, Yang, & Yu, 2008; Yang, Yang, & Wang, 2009; Zeng & Gao, 2009). Another optimization method mentioned above is MCOC which is used to solve classification problems in data mining and machine learning (Shi et al., 2001). The classifier mainly uses a trade-off between the overlapping degree of different classes and the total distance from input points to the separating hyperplane, with the former to be minimized and the latter to be maximized simultaneously based on the idea of linear programming for classification (Freed & Glover, 1981; Glover, 1990). Then MCOC has been used in various applications within different fields of science, ranging from credit scoring to bioinformatics. Subsequently, a linear MCOC based on a compromise solution was proposed for the behavior analysis of credit cardholders (Shi et al., 2002). A multiple phase fuzzy linear programming approach was provided for solving classification problem in data mining (He, Liu, Shi, Xu, & Yan, 2004). And a penalized MCOC using weight of the target class was proposed for solving the class-imbalanced classification problem in credit cardholder behavior analysis (Li, Shi, & He, 2008). Then a quadratic MCOC was proposed and used for credit data analysis (Peng, Kou, Shi, & Chen, 2008). A rough set-based MCOC was put forward and used for the medical diagnosis and prognosis (Zhang, Shi, and Gao, 2009; Zhang, Shi, and Tian, 2009), and a MCOC with fuzzy parameters was used to improve the generalization power of MCOC, where an appropriate fuzzy membership function is introduced to MCOC, and the objective functions and the constraints were transformed into the fuzzy decision set, then the new MCOC with fuzzy parameters was constructed (Zhang, Shi, and Gao, 2009; Zhang, Shi, and Tian, 2009). A kernel-based MCOC was given just like the use of kernel methods in SVM (Zhang and Zhang, et al., 2010; Zhang and Zhou, et al., 2010; Zhang and Zhu, et al., 2010). Additionally, MCOC was used to analyze the behavior of VIP E-mail users (Zhang and Zhang, et al., 2010; Zhang and Zhou, et al., 2010; Zhang and Zhu, et al., 2010). The above rough set-based MCOC was also used to predict protein interaction hot spots (Chen et al., 2011). In these applications, MCOC outperformed some traditional methods in data mining (Shi, 2010). A number of models and algorithms related to MCOC gradually developed to powerful tools for solving classification, regression and other problems.

However, in many real world applications such as bioinformatics, language information processing and credit risk evaluation, quality problems like noise, outliers and anomalies within the data set are very common; additionally the set may be class-imbalanced, nonlinearly separable, and uncertain. Because of these uncertainties, some input points are difficult to correctly classify as one of predefined classes. Consequently, when we train MCOC using these data sets with uncertainties, MCOC will degenerate into an inefficient, instable, and inaccurate classifier. In other words, MCOC lacks the capacity of effectively dealing with noise, outliers, anomalies, class imbalance, nonlinear separable cases and other uncertainties in data. To improve the performance of MCOC, we reformulate the model by introducing a kernel function to constraints, a fuzzy membership degree to each input point and penalty factors to objective functions of imbalanced classes in MCOC. Thus, the proposed new method (KFP-MCOC) can improve performance of the original MCOC approach in stability, efficiency and generalization, which reduces the effects of anomalies, class imbalance and nonlinearly separable problems significantly. The rest of this paper is organized as follows: Section 2 describes basic principles of MCOC. Then the new KFP-MCOC is illustrated in Section 3. The experiment on credit risk evaluation and the results are demonstrated in Section 4. Finally, discussion and conclusions will be given in Sections 5 and 6 respectively. 2. MCO classifiers Compared to many traditional methods in data mining, the multi-criteria optimization (MCO) approach based on optimization techniques was only recently introduced to practical applications. This is partly due to SVM being successfully applied to various domains at first, and eventually MCO approaches now are paid more attention. The two methods share common advantages of using flexible objectives and constraints to fit a decision function for separation of different classes. Thus, a general classification problem using MCOC can be described as follows: For a binary classification problem, given the training data T = {(x1, y1), . . . , (xn, yn)}, each input point xi e Rd belongs to either of the two classes with a label yi e {1, 1}, i = 1, . . . , m for yi = 1; i = m + 1, . . . , n for yi = 1, where d is the dimensionality of the input space, and n is the sample size. In order to separate the two classes, e.g. Freed and Glover have chosen the two measures for any input point: The overlapping degree of deviation from the separating hyperplane, and the distance between input points and the separating hyperplane (Freed & Glover, 1981). In the first case an input point located the wrong side of the hyperplane is misclassified, while in the second case an input point positioned in the right side of the hyperplane is correctly classified. Subsequently, Glover took the above two factors into consideration when building the classification models (Glover, 1990). Let ai (ai P 0) be the distance where an input point xi deviates from the separating hyperplane, and the sum of the distance ai is characterized by the function f ðaÞ ¼ kakpp (p P 1) which should be minimized with respect to ai, we have

min f ðaÞ ¼ kakpp subject to wT xi  b P yi ai ;

ai P 0; 8i:

ð1Þ

Obviously if ai = 0, the input point xi is correctly classified. If ai > 0, the input point xi is misclassified. Where the input point xi is given training data, the weight vector w and the offset term b are unrestricted variables, i = 1, 2, . . . , n. Similarly, let bi (bi P 0) be the distance where an input point xi departs from the separating hyperplane, then the sum of the

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

distance bi is characterized by the function gðbÞ ¼ kbkqq (q P 1) which should be maximized with respect to bi, and we get

max gðbÞ ¼ kbkqq subject to wT xi  b 6 yi bi ;

bi P 0;

ð2Þ

8i:

And if bi = 0, the input point xi is misclassified. If bi > 0, the input point xi is correctly classified. If we take the two measures in the above Eqs. (1) and (2) into account simultaneously, MCOC is defined as

kpp

min f ðaÞ ¼ ka

and max gðbÞ ¼

T

subject to w xi  b ¼ yi ðbi  ai Þ;

kbkqq

ai P 0; bi P 0; 8i:

ð3Þ

If C is a penalty factor of the objective function f(a), we may rewrite Eq. (3) as a new MCOC with the hybrid objective function h(a, b) with respect to ai and bi as follows

min hða; bÞ ¼ Cf ðaÞ  gðbÞ ¼ Ckakpp  kbkqq subject to wT xi  b ¼ yi ðbi  ai Þ;

ai P 0; bi P 0; 8i

ð4Þ

According to Eq. (4), we can calculate the term b (b e R) and the weight vector w (w e Rd) of the separating hyperplane which is defined as

wT x ¼ b:

ð5Þ

Thus during testing we may use the decision function to predict the class label of an input point x as below

f ðxÞ ¼ signðwT x  bÞ:

ð6Þ

If p = 1 and q = 1 in Eq. (4), we get a linear MCOC with the linear objective function which can be rewritten as

min hða; bÞ ¼ C

n X

ai 

i¼1

n X bi

ð7Þ

i¼1

subject to wT xi  b ¼ yi ðbi  ai Þ;

ai P 0; bi P 0; 8i:

Similarly, if p = 1, p = 2 and q = 1 in Eq. (4), and the term ð1=2Þkwk22 which defines the margin of the support hyperplanes from two classes respectively is also added to the objective function, we get a quadratic MCOC with the quadratic objective function and the linear constraints which can be denoted as 0

min h ðw; a; bÞ ¼ ð1=2Þðkwk22 þ kak22 Þ þ C

n X i¼1

subject to wT xi  b ¼ yi ðbi  ai Þ;

ai 

n X bi i¼1

ð8Þ

ai P 0; bi P 0; 8i:

Besides, based on Eq. (7), the compromise solution approach has been used to improve the performance of the linear MCOC in business practices (Shi et al., 2001, 2002). With its obvious advantages in simplicity, high efficiency, and interpretability, MCOC has become more popular than some traditional methods in solving practical problems in recent years. But MCOC has the significant limitations of unstability, poor generalization and insufficient flexibility, especially for data sets that contain anomalies, class imbalance, nonlinear separability, and other uncertainties in data. Therefore, it is necessary that we use a soft method to rebuild MCOC instead of crisp one so as to remarkably increase its robustness, predictive accuracy and efficiency in finding an optimal decision function of classification problem. Finally, for a multi-class classification problem, we may transform Eqs. (7), (8) into multiple one-against-all or paired binary classification problems of MCOC in the real world applications. 3. KFP-MCO classifier In this section, similar to the idea of fuzzy SVM approach (Jiang et al., 2006; Yang et al., 2008), we propose a new KFP-MCOC which

3

employs kernel tricks, fuzzification, and class-imbalanced methods, hence the effects of noise, outliers, class imbalance, nonlinear separability and other uncertainties in data on the performance of MCOC are reduced. Then, the corresponding classifier algorithm is demonstrated in the following subsections. 3.1. Kernel method We know that for classification the case of nonlinearly separable data is very common in many real world applications. The kernel method is often used to address this problem, and it is implemented by replacing the dot product with an appropriate positive definite function where a nonlinear mapping is implicitly performed from the input data into a high-dimensional feature space. An appropriate basic function /(x) is chosen to transform the input space where the data set is not linearly separable into a higher dimensional feature space where the data set may be linearly separable (Boser, Guyon, & Vapnik, 1992). In order to compute the dot product of basic function (/(x)T/(y)), we may replace it by the kernel function K(x, y) = (/(x)T/(y)) of input points in the original input space without computing basic function directly. In this case, following kernel functions are often chosen: (i) The linear kernel K(x, y) = xTy. (ii) The polynomial kernel K(x, y) = (xTy + c)d(c P 0, d P 2). (iii) The radial basis function kernel (RBF) Kðx; yÞ ¼ expðkx  yk22 =2r2 Þ (r > 0). (iv) The sigmoid kernel K(x, y) = tanh(axTy + r) (a, r e R). In this paper, we use the above four types of the kernel functions in our experiments. 3.2. Fuzzification method Ideally a decision function of MCOC should correctly classify each input point as one of predefined classes. However, in many practical applications, because of noise, outliers, and anomalies in data, the predictive performance of MCOC will rapidly degenerate. Besides, the contribution of each input point to classification is obviously different from that of other points, and some points are more important than others in solving classification problems. In other words, for the before mentioned constraints, the effects of these input points should be reduced or removed from data when a classifier is constructed. Thus an input point is assigned to one of the classes with uncertainty. That is, an input point belonging to one of the classes may be the three cases: (i) The input point belongs certainly to one class. (ii) The input point does not belong to one class with determination. (iii) The input point belongs to one class with the degree of fuzzy membership si (0 < si < 1). For the first case, of course, we have si = 1, at the same time, and si = 0 for the second case. For a predefined threshold s (s > 0) and the degree of fuzzy membership si of an input point xi, if si 6 s, the input point xi is considered to be unimportant and may be discarded from MCOC; Otherwise, the input point xi is regarded as important one in modeling the classification problem. In our paper, the fuzzification method is implemented based on the kernel feature space. 3.3. Class-imbalanced learning method In data mining and machine learning the class-imbalanced problem is very common, many studies show that in that case a classifier tends to overfit the samples of majority-class, at the same time, and underfit the samples of minority-class (Akbani, Kwek, & Japkowicz, 2004; Yang et al., 2008, 2009; Zeng & Gao, 2009). Generally speaking, there are some proposals to solve the problem, including: (i) Cost-sensitive learning, which adds different penalty factors to different classes of data so as to misclassify minorityclass samples is costlier than majority-class samples. (ii) Synthetic

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

4

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

minority over-sampling, a method which employs over-sampling of the minority-class to shift the separating hyperplane towards the majority-class. (iii) Margin calibration, a method which uses a margin compensation to refine the biased decision boundary for class-imbalanced learning. In our paper, we use the cost-sensitive learning to build the penalized MCOC. 3.4. KFP-MCO classifier Similar to the limitations of the traditional MCOC approach, many studies have shown that SVM is very sensitive to noise, outliers and anomalies in data, so fuzzy SVM was proposed to reduce the effects of anomalies in data (Abe, 2004; Jiang et al., 2006; Lin & Wang, 2002; Tsujinishi & Abe, 2003). A fuzzy SVM model is implemented by associating the importance or contribution of each input point with a fuzzy membership value si. Then a decision function is constructed based on the fuzzy support vector. Generally the method can enhance the classification performance of SVM considerably. Following this approach, MCOC is very sensitive to noise, outliers, and anomalies in data, so we proposed KFP-MCOC by introducing an appropriate fuzzy membership function to MCOC which is based on the distance between an input point and the representative point of one class in the kernel-induced feature space. Thus noise, outliers, and anomalies in data can be distinguished from other input points. Given a training set T = {(x1, y1), . . . , (xn, yn)} and the mapping function /(xi) (i = 1, . . . , n), after mapping the input point xi to the high-dimensional feature space, we can use the linearly separable classification method to compute the separating hyperplane. 0 Hence, for the new training set T = {(/(x1), y1), . . . , (/(xn), yn)} in the kernel-induced feature space, we use the class mean as a representative point. For the two-class classification problem based on 0  y as the mean of class yi as the new training set T , we define / i

   y k ¼ max dð/ðxi Þ; / y Þ : ryi ¼ max k/ðxi Þ  / i 2 i

ð13Þ

Therefore, for the linearly separable case, given training set T 00 ¼ fðx1 ; y1 ; s1 Þ; . . . ; ðxn ; yn ; sn Þg with the degree of fuzzy membership si (s 6 si 6 1), where each input point xi e Rd (i = 1, . . . , m, m + 1, . . . , n) is partitioned by class label yi e {1, 1}. From Eq. (11) we know that the input point xi with the degree of fuzzy membership si may be less important one and considered as noise, outlier and anomaly if si 6 s, otherwise the input point xi is regarded as a positive contribution to classification. Besides, the parameter ai (ai P 0) is a measure of the misclassified input point in Eq. (7), so the term siai is a new measure of the classification error with different weight. By this way, the effects of noise, outliers and anomalies in data are reduced remarkably. At the same time, the stability of MCOC is improved considerably. Thus in the linearly separable case Eq. (7) is written as

min C

n n X X si ai  bi i¼1

ð14Þ

i¼1

subject to wT xi  b ¼ yi ðbi  ai Þ;

ai P 0; bi P 0; 8i

where C (C > 0)is a penalty constant of the misclassified input points. For the class-imbalanced case, if yi = 1, let C1 (C1 > 0) be the misclassification cost or penalty factor of the negative class. Similarly if yi = 1, let C2 (C2 > 0) be misclassification cost of the positive class. We can rewrite Eq. (14) as

min C 1

m n n X X X si ai þ C 2 si ai  bi i¼1

i¼mþ1

ð15Þ

i¼1

subject to wT xi  b ¼ yi ðbi  ai Þ;

ai P 0; bi P 0; 8i

Let the degree of fuzzy membership si of each input point be the linear function of the mean and the radius of one class, we have

Where the input point xi is given, the weight vector w and the offset term b are unrestricted variables, C1 > 0, C2 > 0, s < si 6 1, s > 0, i = 1, 2, . . . , n. For the nonlinearly separable case, we suppose that /(x) is a mapping function from the input data to a higher dimensional feature space. According to the ideas of introducing kernel functions to mathematical programming (Cristianini & Shawe-Taylor, 2000; Smola, Bartlett, Scholkopf, & Schuurmans, 1999; Zhang and Zhang, et al., 2010; Zhang and Zhou, et al., 2010; Zhang and Zhu, et al., 2010), given the new data set T 000 ¼ fð/ðx1 Þ; y1 ; s1 Þ; . . . ; ð/ðxn Þ; yn ; sn Þg in the kernel-induced feature space, the weight vector w can be denoted as the linear combination of the input point /(xj) and the class label yj with respect to the positive coefficient kj (kj P 0), that is

 y Þ=ðry þ dÞ si ¼ 1  dð/ðxi Þ; / i i



X X yi /ðxi Þ= yi ;

y ¼ / i

i

yi 2 f1; 1g

ð9Þ

i

where i = 1, . . . , m for yi = 1, and i = m + 1, . . . , n for yi = 1. Then the radius of class yi is denoted as

y k : r yi ¼ max k/ðxi Þ  / i 2

ð10Þ

ð11Þ

where d (d > 0) is an adjusted constant and used to avoid the case of  y Þ between the point /(xi) and its class si = 0. The distance dð/ðxi Þ; / i  y using the kernel method is written as mean / i

 1=2  y k ¼ /2 ðxi Þ  2/ðxi Þ  / y þ / 2 ¼ k/ðxi Þ  / yi i 2 i , X X T T ¼ /ðxi Þ /ðxi Þ  2 yj /ðxi Þ /ðxj Þ yj j

j

XX yj yk Kðxj ;xk Þ þ k

n X

si a i 

i¼mþ1

n X bi i¼1

ai P 0; bi P 0; 8i;

0 6 kj 6 C 1 ; for yj ¼ 1; 0 6 kj 6 C 2 ; for yj ¼ 1: ð17Þ If now the dot product (/(x)T/(y)) of the basic function is replaced by the kernel function K(x, y), we get the KFP-MCOC

min C 1

j

j

si ai þ C 2

j¼1

j k , X yj

, !1=2 X X yj yk

m X i¼1

j

X ¼ Kðxi ;xi Þ  2 yj Kðxi ;xj Þ

j

Integrating the above weight vector w into Eq. (15), we have

n X subject to kj yj /ðxj ÞT /ðxi Þ  b ¼ yi ðbi  ai Þ;

, !1=2 XX X X þ yj yk /ðxj ÞT /ðxk Þ yj yk k

ð16Þ

j¼1

min C 1

y Þ dð/ðxi Þ; / i

j

n X kj yj /ðxj Þ:

m n n X X X si ai þ C 2 si ai  bi i¼1

ð12Þ

k

i¼mþ1

i¼1

n X subject to kj yj Kðxj ; xi Þ  b ¼ yi ðbi  ai Þ;

ai P 0; bi P 0; 8i;

j¼1

where j, k = 1, . . . , m for yi = 1, and j, k = m + 1, . . . , n for yi = 1. Apparently the radius of class yi in the kernel-induced feature space (see Eq. (10)) is calculated by

0 6 kj 6 C 1 ; for yj ¼ 1; 0 6 kj 6 C 2 ;

for yj ¼ 1: ð18Þ

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

5

By solving Eq. (18), we can obtain the nonnegative coefficients kj (j = 1, 2, . . . , n). Plugging the coefficients kj into Eq. (16), we can calculate the weight vector w. For any training point xi 0 0 (i = 1, 2, . . . , n , n 6 n) which satisfies ai = 0 or bi > 0, according to the separating hyperplane wT/(xi) = b, we have the offset term P b ¼ nj¼1 kj yj Kðxj ; xi Þ, then an average of the term b is taken. The decision function is denoted as

! n X f ðxÞ ¼ signðw /ðxÞ  bÞ ¼ sign kj yj Kðxj ; xÞ  b T

ð19Þ

j¼1

Following the above illustrations, the overall process of the KFPMCOC algorithm can be summarized into the following four steps: Step 1: Computing the degree of fuzzy membership si for each input point xi (i = 1, 2, . . . , n) with training set (see Eq. (11)):  y Þ=ðr y þ dÞ, where the class mean /  y , the si ¼ 1  dð/ðxi Þ; / i i i  y Þ between an input class radius ryi and the distance dð/ðxi Þ; / i point and its class mean as Eqs. (9), (10), and (12) respectively, d (d > 0) is a sufficiently small constant. Step 2: Solving Eq. (18) and getting the optimal solution kj ðj ¼ 1; . . . ; nÞ of KFP-MCOC based on the training set. Step 3: Constructing a decision function according to the separating hyperplane wT/(x) = b with respect to kj with the trainP P ing set (see Eq. (19)): w ¼ nj¼1 kj yj /ðxj Þ and b ¼ nj¼1 kj yj Kðxj ; xi Þ, where the input point xi satisfies ai = 0 or bi > 0. Thus the decision function is written as f ðxÞ ¼ sign P  n j¼1 kj yj Kðxj ; xÞ  b .

Fig. 2. Separation by SVM with the RBF kernel.

Step 4: Testing an unknown sample x by the above decision function: If f(x) P 0, the point x is classified as positive class; otherwise, the point x is predicted as the negative class. 3.5. Simulation We designed a data set which is characterized by anomalies, class imbalance, and nonlinearly separable cases so as to test the new KFP-MCOC. In this paper, since the linear MCOC and quadratic MCOC cannot be directly used with nonlinearly separable cases, we only tested KFP-MCOC, SVM, and fuzzy SVM with the RBF kernel on the data set. The data set is used for binary classification with 102 points of negative class and 38 points of positive class as shown by diamonds and crosses respectively in Fig. 1. Figs. 2 and 3 show the boundaries and margins found by SVM with C = 1000 and fuzzy SVM with C = 100,000 and r = 1 for the RBF kernel respectively. In Fig. 4, KFP-MCOC using the RBF kernel

Fig. 1. Two-class data set.

Fig. 3. Separation by fuzzy SVM with the RBF kernel .

Fig. 4. Separation by KFP-MCOC with the RBF kernel.

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

6

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

generates a better separating boundary with C1 = 1000, C2 = 7000 and r = 0.1, which shows that KFP-MCOC can achieve an excellent accuracy for uncertain data.

4. Experiments and applications to credit scoring In this section we use KFP-MCOC and other classifiers for credit risk evaluation, including data sets used, experiment design, performance evaluation, experimental results, comparison analysis, and discussion in the following subsections. 4.1. Data sets In our experiments, three credit data sets are used for credit risk analysis: A Australian credit data set, a German credit data set, and a major USA credit data set. They are presented separately as follows. The Australian credit data set is sourced from the UCI Repository of Machine Learning Databases, an online repository of large data set which encompasses a wide variety of data types (http://archive.ics.uci.edu/ml). The data set consists of 307 instances of creditworthy applicants and 383 instances where applicants are not creditworthy. Each instance contains 6 nominal, 8 numeric attributes, and 1 class attribute (accepted or rejected). This data set is interesting because there is a good mixture of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values in this data set. To protect data confidentiality, the attributes names and values have been changed to meaningless symbolic data. The German credit data set which is also sourced from the UCI Repository of Machine Learning Databases is class-imbalanced and composed of 700 instances of creditworthy applicants and 300 instances whose credit should not be extended. For each applicant, 24 input attributes describe the credit history, account balances, loan purpose, loan amount, employment status, personal information, age, housing, and job title. This data set consists entirely of the numeric attributes. The last credit card data set used in our experiments is provided by a major U.S. bank. It contains 6000 records and 66 derived attributes. Among these 6000 records, 960 are bankruptcy accounts and 5040 are ‘‘good’’ status accounts. Obviously this credit data set is also class-imbalanced. Finally, owing to the inevitable errors in collecting and computing the values of above attributes, these data sets may potentially contain noise, outliers and anomalies, at the same time, and they may be nonlinearly separable for different classes. Besides, the Australian credit data set, the German credit data set and the USA credit data set are class-balanced, with the first having the ratio of ‘‘good’’ to ‘‘bad’’ of roughly 4 to 5, the second showing a proportion of 7 to 3, and the third of approximately 5 to 1. 4.2. Experiment design In our experiments, for each credit data set, we randomly select 250 samples from each class of the respective data set, and form the training set. The remainder is used for the independent test set. Then the 10-fold cross-validation (CV) method is used to train MCOC, KFP-MCOC, SVM, and fuzzy SVM on the training subset, and the averages of predictive performance with the independent test set are calculated and reported. For the performance comparison among the different categories of classifiers, the average accuracies of those classifiers belonging to the same category are also computed and shown. Besides, missing values are filled with the mean

of continuous attribute and mode of discrete attribute. All the attributes are normalized to the range of 0–1. Then, in the process of training the parameters for MCOC, SVM, fuzzy SVM and KFP-MCOC are chosen from the specific discrete set so as to get the best classification accuracy. In order to be able to use the grid search method, the different parametric sets of these classifiers are set to: For the penalty factor C for MCOC, SVM, and fuzzy SVM, the penalty factors C1 and C2 for KFP-MCOC from the set {1, 10, 20, 50, 100, 200, 500, 1000, 5000, 10,000, 100,000}; the offset term c from the set {0, 1} and the power d from the set {2, 3} for the polynomial kernel; the bandwidth r from the set {0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10,000} for the RBF kernel; the coefficient a from the set {0.0001, 0.001, 0.01, 0.1, 1, 10} and the offset term r from the set {0.00001, 0.0001, 0.001, 0.01, 0.1, 1} for the sigmoid kernel. For a classifier mentioned above an iterative procedure based on the corresponding parametric set is used to determine the best parameter, and the 10-fold CV approach is applied in each iteration. The best parameter of a classifier for the training subsets is chosen with regard to the best average of performance on the validation subsets, so we have 10 classifiers corresponding 10 training subsets for the 10-fold CV method. Then the same classifiers with the same parameter value is later also used on the independent test set for prediction respectively, thus based on 10 classifiers the average accuracy for the evaluation of the total performance and the statistical comparison of different classifiers can be computed and reported. It is clear that for the purposes of preventing credit fraud and avoiding bad loans, the classification accuracy of ‘‘bad’’ credit applicants must be improved considerably, at the same time, these improvements should not affect the predictive accuracy of ‘‘good’’ credit applicants and total creditors. Thus in our experiments eight accuracy measures are used to evaluate the predictive performance of classifiers. These measures are defined respectively as follows: (i) Total accuracy (the total classification accuracy rate, TA):

Total accuracy ¼ ðTP þ TNÞ=ðTP þ FN þ TN þ FPÞ:

ð20Þ

(ii) Type I accuracy (the identification rate of ‘‘bad’’ creditors, T1A):

Type I accuracy ¼ TP=ðTP þ FNÞ:

ð21Þ

(iii) Type II accuracy (the identification rate of ‘‘good’’ creditors, T2A):

Type II accuracy ¼ TN=ðTN þ FPÞ:

ð22Þ

(iv) F1 score (the mixed measure of classification, F1S):

F1 score ¼ 2TP=ð2TP þ FN þ FPÞ:

ð23Þ

(v) Matthew’s Correlation Coefficient (the adjusted impacts of imbalanced data set, MCC):

MCC ¼ ðTP  TN  FP  FNÞ=ððTP þ FNÞðTP þ FPÞðTN þ FPÞðTN þ FNÞÞ1=2 :

ð24Þ

Where true positive (TP) is the number of ‘‘bad’’ creditors that are correctly predicted. False negative (FN) is the number of ‘‘bad’’ creditors that are predicted as the ‘‘good’’ creditors. True negative (TN) is the number of ‘‘good’’ creditors that are correctly predicted. False positive (FP) is the number of ‘‘good’’ creditors that are predicted as the ‘‘bad’’ creditors. For the formula (23), the F1 score is the harmonic mean of recall and precision, where it is accounting for two quantities of the predictive accuracy. The value of MCC is between 1 and 1, and a higher MCC corresponds to a better predictive performance of the classifier. Besides, the Kolmogorov–Smirnov (KS) statistic is one of the most useful and general nonparametric methods for evaluating classifiers, as it is sensitive to differences in both location and

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

7

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

shape of the empirical cumulative distribution function of the two classes. The area under curve (AUC) statistic is an empirical measure of classification performance based on the area under an ROC (receiver operating characteristic) curve (Fawcett, 2006), where a classifier is preferred if its ROC curve is closer the upper-left corner, that is, with a large AUC. The value of AUC lies in the interval [0, 1], and a larger AUC means a better predictive performance of the classifier. For comparative purposes, H measure (HM) uses different distributions to evaluate different classifiers, that is, it is making fair comparisons (Alpaydin, 2010; Hand, 2009). Finally, all of our experiments are carried out using Matlab 7.10. The linear programming problems of linear MCOC and KFP-MCOC, and the convex quadratic programming problems of quadratic MCOC, SVM and fuzzy SVM, are solved by using Matlab optimal tools.

For the Australian credit data set, by using the 10-fold CV method the MCOC, SVM, fuzzy SVM and KFP-MCOC models are trained on training subsets and validated on validation subsets in order, the classifiers are obtained and tested on the independent test set. Hence, the averages of different accuracies of these classifiers and the total averages of predictive performance are calculated and reported in Tables 1–4 respectively. As the experimental results show in Tables 1–4, we can observe that the predictive performance of our proposed KFP-MCOC is slightly better than that of others on the average. Generally KFPMCOC has the highest values of type I accuracy (average 95.44%), type II accuracy (average 83.03%), total accuracy (average 86.79%), F1 score (average 0.81), MCC (average 0.74), KS (average 0.83), AUC (average 0.91) and H measure with an average of 0.62. Then, the predictive accuracy of fuzzy SVM with the sigmoid

Table 1 Evaluation of MCOC for the credit risk prediction on the Australian credit test set. TA (%) T1A (%) T2A (%) F1S

MCOC Linear 78.95 Quadratic 85.95 Average 82.45

87.89 91.05 89.47

75.11 83.76 79.44

MCC KS

0.71 0.58 0.80 0.70 0.76 0.64

AUC HM

0.64 0.82 0.37 0.75 0.88 0.53 0.70 0.85 0.45

Table 2 Evaluation of SVM for the credit risk prediction on the Australian credit test set. Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

SVM Linear Polynomial RBF Sigmoid Average

83.26 82.93 85.63 86.16 84.50

0.64 0.63 0.71 0.73 0.68

0.87 0.88 0.90 0.90 0.89

85.09 82.71 95.44 97.72 90.24

82.48 83.03 81.43 81.20 82.04

0.75 0.75 0.80 0.81 0.78

0.74 0.76 0.79 0.79 0.77

0.50 0.55 0.55 0.54 0.54

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

Table 3 Evaluation of fuzzy SVM for the credit risk prediction on the Australian credit test set. Kernel classifiers

TA (%)

T1A (%)

T2A (%)

F1S

MCC KS

AUC HM

Fuzzy SVM

84.21 85.53 87.26 88.74 86.44

92.98 87.54 96.84 98.25 93.90

80.45 84.66 82.26 84.66 83.01

0.78 0.78 0.82 0.84 0.81

0.68 0.69 0.75 0.78 0.73

0.87 0.88 0.91 0.92 0.90

Linear Polynomial RBF Sigmoid Average

Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

KFP-MCOC Linear Polynomial RBF Sigmoid Average

85.00 86.37 88.84 86.96 86.79

0.69 0.73 0.77 0.75 0.74

0.90 0.90 0.93 0.92 0.91

0.74 0.75 0.82 0.83 0.79

0.49 0.53 0.61 0.61 0.56

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

90.88 95.61 96.84 98.42 95.44

82.48 82.41 85.41 82.03 83.03

0.78 0.81 0.84 0.82 0.81

0.80 0.80 0.87 0.84 0.83

0.58 0.57 0.71 0.63 0.62

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

Table 5 Evaluation of MCOC for the credit risk prediction on the German credit test set. Classifiers

TA (%) T1A (%) T2A (%) F1S

MCOC Linear 69.60 Quadratic 68.66 Average 69.13

4.3. Results with the independent test set

Classifiers

Table 4 Evaluation of KFP-MCOC for the credit risk prediction on the Australian credit test set.

70.00 70.60 70.30

69.56 68.44 69.00

MCC KS

0.32 0.25 0.31 0.24 0.32 0.25

AUC HM

0.34 0.51 0.03 0.41 0.70 0.04 0.38 0.61 0.04

Table 6 Evaluation of SVM for the credit risk prediction on the German credit test set. Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

SVM Linear Polynomial RBF Sigmoid Average

73.00 69.60 70.00 69.20 70.45

0.31 0.25 0.26 0.26 0.27

0.74 0.70 0.71 0.71 0.72

76.00 70.00 72.00 72.00 72.50

72.67 69.56 69.78 68.89 70.23

0.36 0.32 0.32 0.32 0.33

0.49 0.40 0.42 0.42 0.43

0.06 0.03 0.04 0.04 0.04

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

Table 7 Evaluation of fuzzy SVM for the credit risk prediction on the German credit test set. Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

Fuzzy SVM Linear Polynomial RBF Sigmoid Average

71.40 71.20 72.40 73.20 72.05

0.32 0.30 0.30 0.30 0.31

0.75 0.73 0.75 0.74 0.74

80.00 76.00 74.00 74.00 76.00

70.44 70.67 72.22 73.11 71.61

0.36 0.36 0.35 0.36 0.36

0.50 0.47 0.50 0.47 0.49

0.06 0.05 0.06 0.05 0.06

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

Table 8 Evaluation of KFP-MCOC for the credit risk prediction on the German credit test set. Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

KFP-MCOC Linear Polynomial RBF Sigmoid Average

71.68 73.40 73.20 72.48 72.69

0.31 0.33 0.34 0.33 0.33

0.76 0.75 0.75 0.77 0.76

77.08 78.00 80.00 77.00 78.02

71.09 72.89 72.44 72.16 72.15

0.36 0.37 0.37 0.37 0.37

0.52 0.51 0.51 0.54 0.52

0.06 0.06 0.06 0.07 0.06

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

Table 9 Evaluation of MCOC for the credit risk prediction on the USA credit test set. Classifiers

TA (%) T1A (%) T2A (%) F1S

MCOC Linear 70.49 Quadratic 70.47 Average 70.48

71.55 74.65 73.10

70.33 69.85 70.09

MCC KS

0.38 0.29 0.39 0.31 0.39 0.30

AUC HM

0.42 0.71 0.06 0.45 0.72 0.07 0.44 0.72 0.07

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

8

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

Table 10 Evaluation of SVM for the credit risk prediction on the USA credit test set.

Table 12 Evaluation of KFP-MCOC for the credit risk prediction on the USA credit test set.

Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

SVM Linear Polynomial RBF Sigmoid Average

73.53 72.89 75.51 75.35 74.32

0.37 0.38 0.37 0.40 0.38

0.76 0.77 0.76 0.78 0.77

KFP-MCOC Linear Polynomial RBF Sigmoid Average

75.01 74.95 73.17 75.57 74.68

0.40 0.40 0.38 0.41 0.40

0.78 0.78 0.77 0.78 0.78

78.87 82.96 76.06 80.99 79.72

72.73 71.40 75.43 74.51 73.52

0.43 0.44 0.44 0.46 0.44

0.52 0.54 0.51 0.55 0.53

0.09 0.10 0.10 0.11 0.10

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

Table 11 Evaluation of fuzzy SVM for the credit risk prediction on the USA credit test set. Kernel classifiers

TA (%) T1A (%) T2A (%) F1S

MCC KS

AUC HM

Fuzzy SVM Linear Polynomial RBF Sigmoid Average

74.38 73.35 74.29 76.05 74.52

0.39 0.40 0.36 0.40 0.39

0.77 0.78 0.75 0.78 0.77

81.27 84.79 76.76 80.28 80.78

73.36 71.65 73.92 75.43 73.59

0.45 0.45 0.44 0.46 0.45

0.55 0.56 0.51 0.56 0.55

0.11 0.11 0.09 0.12 0.11

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

82.79 83.07 83.22 82.37 82.86

73.86 73.75 71.69 74.57 73.47

0.46 0.46 0.44 0.47 0.46

0.57 0.57 0.55 0.57 0.57

0.11 0.11 0.10 0.12 0.11

The bold value means that compared with other classifiers the classifier gains the best predictive performance.

kernel and KFP-MCOC with the RBF kernel outperform that of linear MCOC, quadratic MCOC, and SVM. Besides, KFP-MCOC with the RBF kernel achieves better accuracy in total compared to SVM and fuzzy SVM with the RBF kernel. Owing to the majority of ‘‘good’’ creditors, SVM with the linear kernel and fuzzy SVM with the polynomial and sigmoid kernel classifiers achieve better performance for the type II accuracy than that of others. However, for KS, AUC, and H measure KFP-MCOC is superior in the predictive performance to other classifiers. Similarly, for the German credit data set, MCOC, SVM, fuzzy SVM and KFP-MCOC are trained and obtained with the 10-fold

Fig. 5. ROC curves for different classifiers on the Australian credit test set.

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

CV subsets, and then the resulting classifiers are tested on the independent test set. So the average accuracies and the total averages with respect to the predictive performance are shown in Tables 5–8 respectively. As the experimental results demonstrate in Tables 5–8, we find that the predictive performance of our proposed KFP-MCOC is slightly better than that of other methods for the averages. Generally KFP-MCOC has the highest values of the total accuracy (average 72.69%), the type I accuracy (average 78.02%), the type II accuracy (average 72.15%), F1 score (average 0.37), MCC (average 0.33), KS (average 0.52), AUC (average 0.76) and H measure (average 0.06). The predictive accuracies of SVM, FSVM, and KFP-MCOC are better than that of linear and quadratic MCOC. Besides, for the type II and the total accuracies, SVM with the linear kernel and fuzzy SVM with the sigmoid kernel outperform MCOC and KFP-MCOC. For KS, AUC, and H measure KFP-MCOC is obviously superior in the predictive performance to MCOC and SVM. For the USA credit data set, MCOC, SVM, fuzzy SVM and KFPMCOC are trained and obtained with the 10-fold CV subsets, and then these classifiers are tested on the independent test set. Hence, the averages of different accuracies of the classifiers, and the total averages with respect to the predictive performance, are shown in Tables 9–12 respectively.

9

As the experimental results demonstrate in Tables 9–12, we can see that KFP-MCOC is slightly better than others in predicting ‘‘bad’’ creditors (average 82.86%), the total accuracy (average 74.68%), F1 score (average 0.46), MCC (average 0.40), KS (average 0.57), AUC (average 0.78), and H measure (average 0.11). For the linear, RBF, and sigmoid kernels, the type I accuracy of KFP-MCOC is remarkably superior to other classifiers, while for the linear and polynomial kernels, the total and type II accuracies of KFP-MCOC are slightly better than that of other classifiers. Besides, because of the majority of ‘‘good’’ creditors in data, SVM with the RBF kernel and fuzzy SVM with the sigmoid kernel have a slightly better type II accuracy, while fuzzy SVM with the polynomial kernel achieves the best type I accuracy. However, for KS, AUC, and H measure KFP-MCOC is superior in the predictive performance to MCOC and SVM averagely. 4.4. Comparison analysis for performance of classifiers On the independent test sets of the three credit data sets, we select the classifiers with the best predictive performance as obtained by the 10-fold CV method and plot the corresponding receiver operating characteristics (ROC) curves, as illustrated in Figs. 5–7 respectively.

Fig. 6. ROC curves for different classifiers on the German credit test set.

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

10

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

Fig. 7. ROC curves for different classifiers on the USA credit test set.

From Figs. 5–7, we find that FKP-MCOC in total performs better than the linear or quadratic MCOC. Hence, we can say that FKP-MCOC achieves a notable improvement in classification performance compared with MCOC. Besides, on the Australian credit data set, as shown in Fig. 5, FKP-MCOC with the linear and RBF kernels explicitly outperforms the classifiers of SVM and fuzzy SVM with the corresponding kernels for the best predictive performance. On the German credit data set, as seen in Fig. 6, FKP-MCOC with the polynomial, RBF, and sigmoid kernels outperforms SVM with the corresponding kernel functions under different loss conditions. On the USA credit data set, as seen in Fig. 7, for different kernels, FKP-MCOC is slightly better than SVM and fuzzy SVM for predictive accuracies, except for fuzzy SVM with the linear and sigmoid kernels, according to the visual analysis of ROC. For comparison of different classifiers, statistical tests are often used on the same data set (Alpaydin, 2010; Kuncheva, 2004). To this end, here we employ the 10-fold CV Paired t-Test method to compare the classification performance of KFP-MCOC with that of MCOC, SVM, and FSVM. To compare the performance of two classifiers, we use the measures of total accuracy, type I accuracy, type II accuracy, F1 score, MCC, KS, AUC and the H measure. For each measure m, a set of 10 paired differences between the two classifiers is obtained. One assumption that we make is that the set of differences is an independently drawn sample from an

approximately normal distribution. In fact, for 10-fold CV method the training sets are overlapping, but the test sets are independent.  Hence, under the null hypothesis, this means the mean difference d of two classifiers is greater than or equal to zero. Then the following statistic has a t-distribution with 9 degrees of freedom as below

 t ¼ ðd

, ! 10 X pffiffiffiffiffiffi 2  ðmi  dÞ =9 : 10Þ sqrt

ð25Þ

i¼1

 ¼ ð1=10ÞP10 m . For the level of significance 0.05 in the where d i¼1 i tabulated value we have t0.025,9 = 2.26. Thus if t > 2.26, we reject the null hypothesis that the first classifier has significantly lower accuracies than the second, otherwise, we accept it. The study (Demsar, 2006) shows non-parametric tests are safer and stronger than parametric tests without the assumption of normal distributions. The Wilcoxon signed-ranks test (WSR-Test), as a non-parametrically statistical test, is often used to rank the differences in performances of two classifiers. For the 10-fold CV method, let di be the difference between the performance measures of the ith fold of the two classifiers on the independent test set. The differences are ranked according to their absolute values and average ranks are assigned in case of ties. Let R+ be the sum of ranks for the test set on which the first classifier outperformed the second, and let R be the sum of ranks

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

11

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx Table 13 Statistical comparison of different classifiers on the Australian credit test set. 10-fold CV Paired t-Test (t statistic, above) & WSR-Test (z statistic, below) Kernel type

Classifier 1

Classifier 2

Linear

KFP-MCOC

Linear MCOC SVM FSVM

Polynomial

KFP-MCOC

Quadratic MCOC SVM FSVM

RBF

KFP-MCOC

Quadratic MCOC SVM FSVM

Sigmoid

KFP-MCOC

Quadratic MCOC SVM FSVM

TA

T1A

T2A

F1S

MCC

KS

AUC

HM

8.17 2.80 1.53 1.53 1.21 1.07

4.02 2.55 5.07 2.80 2.34 1.99

7.6 2.80 0.00 0.05 2.13 1.78

8 2.80 2.14 1.78 0.67 0.46

8.09 2.80 2.26 1.99 0.49 0.46

14.5 2.80 5.47 2.80 5.23 2.80

7.25 2.80 2.73 2.80 2.62 2.80

19.91 2.80 7.56 2.80 8.27 2.80

1.18 1.07 2.43 2.04 0.70 0.82

5.75 2.80 4.13 2.80 3.66 2.60

2.42 2.04 1.56 0.97 1.64 1.78

2.84 2.19 3.03 2.40 1.47 1.48

3.33 2.29 3.00 2.40 1.64 1.48

4.52 2.80 3.80 2.80 4.52 2.80

2.27 2.80 1.90 2.80 2.26 2.50

3.21 2.80 1.81 2.80 3.21 2.80

5.06 2.70 4.95 2.70 3.07 2.29

6.98 2.80 1.40 1.27 0.00 1.07

2.37 1.89 6.28 2.70 3.22 2.29

5.68 2.70 4.46 2.70 2.92 2.29

6.10 2.80 4.19 2.70 2.69 2.19

10.94 2.80 7.13 2.80 4.28 2.80

5.47 2.80 3.57 2.80 2.34 2.80

16.53 2.80 14.64 2.80 9.19 2.80

0.62 1.63 0.33 1.78 0.36 1.40

8.94 2.80 0.34 0.82 3.25 2.67

3.97 2.09 0.10 1.17 1.28 1.07

2.09 2.29 0.46 1.58 2.22 0.27

2.79 2.50 0.45 1.48 1.06 1.29

8.32 2.80 4.28 2.80 3.71 2.80

4.16 2.80 2.14 2.80 2.37 2.70

9.20 2.80 8.06 2.80 1.42 2.80

The bold value means that for the predictive performance the first classifier significantly outperforms the second one according to the statistical values.

Table 14 Statistical comparison of different classifiers on the German credit test set. 10-fold CV Paired t-Test (t statistic, above) & WSR-Test (z statistic, below) Kernel type

Classifier 1

Classifier 2

Linear

KFP-MCOC

Linear MCOC SVM FSVM

Polynomial

KFP-MCOC

Quadratic MCOC SVM FSVM

RBF

KFP-MCOC

Quadratic MCOC SVM FSVM

Sigmoid

KFP-MCOC

Quadratic MCOC SVM FSVM

TA

T1A

T2A

F1S

1.58 1.27 0.43 1.07 4.29 2.70

3.14 2.40 2.22 1.99 0.22 0.25

1.06 0.46 0.88 1.38 4.24 2.70

4.45 2.80 0.94 0.76 4.18 2.70

9.37 2.80 9.02 2.80 3.70 2.40

1.86 1.68 5.82 2.80 6.51 2.80

8.22 2.80 8.32 2.80 4.89 2.70

0.06 0.05 0.74 0.36 7.60 2.80

5.47 2.70 10.94 2.80 1.00 0.87

3.26 2.80 2.81 2.80 0.36 1.22

4.74 2.65 6.34 2.70 1.26 1.68

MCC

KS

AUC

HM

4.91 2.80 1.06 0.56 2.97 2.40

17.08 2.80 5.07 2.80 2.68 2.80

23.72 2.50 2.53 2.50 1.34 2.80

4.06 2.80 3.73 2.80 0.63 2.80

9.4 2.80 10.51 2.80 1.24 1.07

8.11 2.80 9.62 2.80 2.89 2.09

15.33 2.80 15.21 2.80 5.66 2.80

7.67 2.80 7.60 2.40 2.83 0.15

3.77 2.80 3.78 2.80 1.75 2.80

1.48 1.48 2.03 1.99 6.95 2.80

4.81 2.70 3.84 2.40 12.56 2.80

5.79 2.80 6.08 2.80 10.56 2.80

16.17 2.80 12.97 2.80 2.24 2.80

6.88 2.80 5.41 2.80 0.04 0.87

3.50 2.80 2.96 2.80 0.75 2.80

2.32 2.80 1.90 2.60 0.16 1.33

6.51 2.80 5.99 2.80 1.55 2.04

9.74 2.80 9.22 2.80 2.67 2.55

17.59 2.80 15.80 2.80 8.94 2.80

8.80 2.80 7.90 2.19 4.47 0.36

4.07 2.80 3.82 2.80 1.86 2.80

The bold value means that for the predictive performance the first classifier significantly outperforms the second one according to the statistical values.

for the opposite. Rank of di = 0 are split averagely among the sums. That is

Rþ ¼

X X rankðdi Þ þ ð1=2Þ rankðdi Þ; di >0

di ¼0

X X R ¼ rankðdi Þ þ ð1=2Þ rankðdi Þ: 

di <0

di ¼0

ð26Þ

Let t be the smaller of the sums, that is, t = min(R+, R). Therefore, the z statistic of the 10-fold CV method is distributed approximately normally, which can be denoted as

z ¼ ðt  ð1=4ÞKðK þ 1ÞÞ=sqrtðð1=24ÞKðK þ 1Þð2K þ 1ÞÞ:

ð27Þ

where K = 10. With the level of significance 0.05, the null hypothesis should be rejected if z < 1.96, and then we can draw a conclusion

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

12

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

Table 15 Statistical comparison of different classifiers on the USA credit test set. 10-fold CV Paired t-Test (t statistic, above) & WSR-Test (z statistic, below) Kernel type

Classifier 1

Classifier 2

Linear

KFP-MCOC

Linear MCOC SVM FSVM

Polynomial

KFP-MCOC

Quadratic MCOC SVM FSVM

RBF

KFP-MCOC

Quadratic MCOC SVM FSVM

Sigmoid

KFP-MCOC

Quadratic MCOC SVM FSVM

TA

T1A

T2A

F1S

MCC

KS

AUC

HM

14.40 2.80 16.13 0.25 1.03 2.80

16.42 2.80 17.90 2.60 4.77 2.80

9.06 2.80 17.23 0.25 0.69 2.80

25.97 2.80 10.69 2.29 1.94 2.80

27.52 2.80 5.81 2.80 2.13 2.70

14.04 2.80 6.80 2.80 2.75 2.80

7.02 2.80 3.40 2.70 1.38 2.80

5.34 2.80 2.91 2.80 1.24 2.80

11.30 2.80 10.81 2.80 15.15 2.80

5.93 2.80 6.48 2.80 12.10 2.80

8.74 2.80 10.12 2.80 14.76 2.80

10.05 2.80 14.43 2.80 16.17 2.80

9.43 2.80 17.66 2.80 13.00 2.80

11.59 2.80 3.17 2.80 0.37 2.80

5.79 2.80 1.58 2.80 0.19 2.80

4.69 2.80 2.01 2.80 0.92 2.80

6.84 2.80 5.26 2.70 3.28 2.50

6.66 2.80 12.89 2.80 16.11 2.80

3.39 2.29 7.33 2.80 6.67 2.80

7.74 2.80 2.94 2.40 7.22 2.80

7.77 2.80 6.66 2.80 10.77 2.80

9.90 2.80 4.62 2.80 5.70 2.80

4.95 2.80 2.31 2.80 2.85 2.80

3.53 2.80 2.38 2.80 1.30 2.80

11.62 2.80 2.31 1.99 2.21 1.89

6.10 2.80 1.21 1.38 1.11 0.87

8.91 2.80 2.12 1.58 2.14 1.89

9.54 2.80 3.09 2.29 2.19 1.78

8.99 2.80 2.89 2.09 2.07 1.68

11.83 2.80 1.97 2.80 1.69 2.80

5.91 2.80 0.99 2.80 0.84 1.33

5.02 2.80 0.80 2.80 0.30 2.80

The bold value means that for the predictive performance the first classifier significantly outperforms the second one according to the statistical values.

that the difference between the two classifiers is significant statistically. Thus the statistics of the paired t-Test (see Eq. (25)) and WSRTest (see Eq. (27)) based on the 10-fold CV method are calculated with the three credit data sets and shown in Tables 13–15 respectively. For the Australian credit data set results in Tables 13–15, as well as the experimental results of classifier performance in Section 4.3 show that KFP-MCOC totally outperforms linear and quadratic MCOC, and that the type I accuracy of KFP-MCOC is better than that of SVM. For the measures of KS, AUC and HM, according to their values of both t and z statistics, we see that FKP-MCOC is more outstanding than other similar classifiers. However, we find that there is no statistically significant difference between KFP-MCOC and fuzzy SVM except that fuzzy SVM with the linear kernel outperforms that of KFP-MCOC with the linear kernel for the type I accuracy. On the German credit data set, we come to similar conclusions. KFP- MCOC significantly outperforms both linear and quadratic MCOC. Also, KFP-MCOC outperforms SVM except for the case of the linear kernel. For the type I accuracy, the fuzzy SVM with the polynomial kernel is better than KFP-MCOC with the same kernel. However, for the z statistical values of KS and HM measures, we see that KFP-MCOC is obviously better than other classifiers in the predictive performance. On the USA credit data set, we conclude that KFP-MCOC is better than the linear and quadratic MCOC. At the same time, KFP-MCOC is better than SVM with the linear and polynomial kernels except for the type I accuracy, AUC, and H measure. Besides KFP-MCOC with the RBF kernel is better than SVM and fuzzy SVM with the same kernel for measures type I accuracy, F1 score, MCC, KS, and AUC. Finally, for the sigmoid kernel classifier there is no statistically significant difference between KFP-MCOC and fuzzy SVM. Finally, for the comparison of different classifiers we can generally find that the results of the non-parametrically statistical tests (z statistic) are consistent with that of the parametrically statistical tests (t statistic), and the former is not as sharp or sensitive as the latter for the statistical values.

5. Discussion From the above experimental results of credit scoring, we find that our models feature a significantly better performance in efficiency, flexibility, separation and generalization than that of original MCOC. As we know, the capacity of catching the ‘‘bad’’ creditors is an important measure for credit risk evaluation approaches. Hence, for the average predictive accuracy of ‘‘bad’’ creditors (Type I accuracy) as seen in Tables 4, 8 and 12, we find that the total performance of our proposed KFP-MCOC provides the best average accuracy for credit risk prediction. Moreover, as the parametrically and non-parametrically statistical comparison of performance of different classifiers shown in Tables 13–15, for KS, AUC, and H measure, based on the linear and RBF kernels, KFP-MCOC slightly outperforms SVM on the three credit data sets, while F1 score, MCC also illustrate that KFP-MCOC is a good classification method. Generally there is no statistically significant difference between KFP-MCOC and fuzzy SVM. Finally, it is obvious that KFP-MCOC only employs the linear programming method to efficiently solve classification problems. 6. Conclusions In this paper, we proposed an improved MCOC method, KFPMCOC, which is based on kernel, fuzzification, and penalty factors for credit risk evaluation. KFP-MCOC extends the capacities of MCOC and avoids solving the convex quadratic programming problem which is required for SVM and fuzzy SVM. KFP-MCOC is characterized by using a kernel function to transform the original space into a new high-dimensional feature space, introducing a degree of fuzzy membership to each input point in kernel-induced feature space, and using class-imbalanced penalty factors to reach a compromise between overfitting majority-class and underfitting minority-class. The improved classifier can effectively reduce the effects of noise, outliers, and anomalies in data, class imbalance, and nonlinearly separable cases. At the same time, KFP-MCOC was tested with simulation data and three real world data sets. The experimental results and the statistically

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

comparative analysis show that KFP-MCOC is a more effective classifier for credit risk evaluation, and has great potential as a prospective classification approach for other applications. Besides, we plan to improve the interpretability of different variables in KFP-MCOC for credit risk evaluation or other real world applications in the future. Acknowledgements The authors would like to thank Mr. Dirk Kayser for his careful polishing of the English writing. They also thank three anonymous reviewers for their valuable comments and suggestions. This research has been partially supported by the Science Foundation of Ludong University (LY2010013), the National Natural Science Foundation of China (#70871111, #70921061), Major International (Regional) Joint Research Project, National Natural Science Foundation of China (#71110107026), the Natural Science Foundation of Shandong (ZR2009GM001, J09LG01, ZR2012FL13), and the CAS/SAFEA International Partnership Program for Creative Research Teams. References Abdou, Hussein A. (2009). Genetic programming for credit scoring: The case of Egyptian public sector banks. Expert Systems with Applications: An International Journal, 36(9), 11402–11417. Abe, Shigeo (2004). Fuzzy LP-SVMs for multiclass problems. In ESANN’2004 proceedings – European symposium on artificial neural networks (Belgium) (pp. 429–434). Abe, Shigeo, & Inoue, Takuya (2002). Fuzzy support vector machines for multiclass problems. In ESANN’2002 proceedings (pp. 113–118). Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In J. F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi (Eds.). ECML, LNCS (LNAI) (Vol. 3201, pp. 39–50). Springer. Alpaydin, Ethem (2010). Introduction to machine learning (2nd ed.). London: The MIT Press. Baesens, B., Egmont-Petersen, M., Castelo, R., & Vanthienen, J. (2002). Learning Bayesian network classifiers for credit scoring using Markov chain Monte Carlo search. In: 16th International conference on pattern recognition (ICPR’02) (vol. 3, p. 30049). Bellotti, Tony, & Crook, Jonathan (2009). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications: An International Journal, 36(2), 3302–3308. Bolton, Christine (2009). Logistic regression and its application in credit scoring. Dissertation, University of Pretoria. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual workshop on computational learning theory (pp. 144–152). Chen, R., Zhang, Z., Wu, D., Zhang, P., Zhang, X., Wang, Y., & Shi, Y. (2011). Prediction of protein interaction hot spots using rough set-based multiple criteria linear programming. Journal of Theoretical Biology, 269, 174–180. Corne, David, Dhaenens, Clarisse, & Jourdan, Laetitia (2012). Synergies between operations research and data mining: The emerging use of multi-objective approaches. European Journal of Operational Research, 221, 469–479. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press. Demsar, Janez (2006). Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. Fawcett, Tom (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874. Freed, N., & Glover, F. (1981). Simple but powerful goal programming models for discriminant problems. European Journal of Operational Research, 7, 44–60. Gestel, T. V., Baesens, B., Garcia, J., & Dijcke, P. V. (2003). A support vector machine approach to credit scoring. Bank en Financiewezen, 2, 73–82. Glover, F. (1990). Improved linear programming models for discriminant analysis. Decision Sciences, 21, 771–785. Grablowsky, B. J., & Talley, W. K. (1981). Probit and discriminant functions for classifying credit applicants: A comparison. Journal of Economic Business, 33, 254–261. Hamel, Lutz (2009). Knowledge discovery with support vector machines. New Jersey: John Wiley & Sons. Hand, J. David (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77, 103–123. He, J., Liu, X., Shi, Y., Xu, W., & Yan, N. (2004). Classifications of credit cardholder behavior by using fuzzy linear programming. International Journal of Information Technology and Decision Making, 3(4), 633–650.

13

Henley, W. E., & Hand, D. J. (1996). Nearest neighbor analysis in credit scoring. The Statistician, 45(1), 77–95. Huang, Cheng-Lung, Chen, Mu-Chen, & Wang, Chieh-Jen (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications: An International Journal, 33, 847–856. Jensen, Herbert L. (1992). Using neural networks for credit scoring. Managerial Finance, 18(6), 15–26. Jiang, X., Yi, Z., & Jian, C. (2006). Fuzzy SVM with a new fuzzy membership function. Neural Computing & Applications, 15(3–4), 268–276. Bastos, Joao (2008). Credit scoring with boosted decision trees. (Paper No. 8034). Koknar-Tezel, S., & Latecki, L. J. (2009). Improving SVM classification on imbalanced data sets in distance space. ICDM, 2009, 259–267. Koknar-Tezel, S., & Latecki, L. J. (2010). Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information Systems. http://dx.doi.org/10.1007/s10115-010-0310-3. Kuncheva, I. (2004). Ludmila. Combining pattern classifiers: Methods and algorithms. New Jersey: John Wiley & Sons. Lando, David (2004). Credit risk modeling: Theory and applications. Princeton University Press. Li, A., Shi, Y., & He, J. (2008). MCLP-based methods for improving ‘‘bad’’ catching rate in credit cardholder behavior analysis. Applied Soft Computing, 8(3), 1259–1265. Lin, Chun-Fu, & Wang, Sheng-De (2002). Fuzzy support vector machines. IEEE Transactions on Neural Networks, 13(2), 464–471. Lin, Chun-Fu, & Wang, Sheng-De (2004). Training fuzzy support vector machines with noisy data. Pattern Recognition Letters, 25, 1647–1656. Martens, David, Baesens, Bart, Van Gestel, Tony, & Vanthienen, Jan (2007). Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research, 183, 1466–1476. Meisel, Stephan, & Mattfeld, Dirk (2010). Synergies of operations research and data mining. European Journal of Operational Research, 206, 1–10. Olafsson, Sigurdur, Li, Xiaonan, & Wu, Shuning (2008). Operations research and data mining. European Journal of Operational Research, 187, 1429–1448. Ong, Chorng-Shyong, Huang, Jih-Jeng, & Tzeng, Gwo-Hshiung (2005). Building credit scoring models using genetic programming. Expert Systems with Applications: An International Journal, 29(1), 41–47. Pavlenko, T., & Chernyak, O. (2010). Credit risk modeling using Bayesian networks. International Journal of Intelligent Systems, 25(4), 326–344. Peng, Y., Kou, G., Shi, Y., & Chen, Z. (2008). Multi-criteria convex quadratic programming model for credit data analysis. Decision Support Systems, 44, 1016–1030. Klaus, Schebesch, & Ralf, Stecking (2005). Support vector machines for credit scoring: Extension to non standard cases. In Innovations in classification, data science, and information systems, studies in classification, data analysis, and knowledge organization, Part VI (pp. 498–505). Shi, Y. (2010). Multiple criteria optimization based data mining methods and applications: A systematic survey. Knowledge and Information Systems, 24(3), 369–391. Shi, Y., Peng, Y., Xu, W., & Tang, X. (2002). Data mining via multiple criteria linear programming: Applications in credit card portfolio management. International Journal of Information Technology and Decision Making, 1, 131–151. Shi, Y., Wise, M., Luo, M., & Lin, Y. (2001). Data mining in credit card portfolio management: A multiple criteria decision making approach. In M. Koksalan & S. Zionts (Eds.), Advance in multiple criteria decision making in the new millennium (pp. 427–436). Berlin: Springer. Smola, A., Bartlett, P., Scholkopf, B., & Schuurmans, D. (1999). Advances in large margin classifiers. Cambridge, MA: MIT Press, pp. 158–180. Takuya, I., & Shigeo, A. (2001). Fuzzy support vector machines for pattern classification. In: Proceeding of international joint conference on neural networks (pp. 1449–1454). Washington DC. Tang, Yuchun, Zhang, Yan-Qing, & Chawla, N. V. (2002). SVMs modeling for highly imbalanced classification. Journal of LATEX Class Files, 1(11), 1–9. Thomas, L. C., Crook, J., & Edelman, D. (2002). Credit scoring and its applications. Society for Industrial Mathematics. Tovar, J. C, & Yu, Wen (2008). On-line modeling via fuzzy support vector machines. In MICAI 2008, LNAI 5317 (pp. 220–229). Berlin Heidelberg: Springer. Tsujinishi, Daisuke, & Abe, Shigeo (2003). Fuzzy least squares support vector machines for multiclass problems. Neural Networks, 16(5–6), 785–792. Vapnik, V. N. (1995). The nature of statistic learning theory. New York: Springer. Vapnik, V. N. (1998). Statistic learning theory. New York: John Wiley & Sons. West, David (2000). Neural network credit scoring models. Computers & Operations Research, 27, 1131–1152. Wiginton, J. C. (1980). A note on the comparison of logit and discriminant models of consumer credit behaviour. Journal of Financial and Quantitative Analysis, 15, 757–770. Yang, Chan-Yun, Wang, Jian-Jun, Yang, Jr-Syu, and Yu, Guo-Ding (2008). Imbalanced SVM learning with margin compensation. In ISNN 2008, Part I, LNCS 5263 (pp. 636–644). Yang, Chan-Yun, Yang, Jr-Syu, & Wang, Jian-Jun (2009). Margin calibration in SVM class-imbalanced learning. Neurocomputing, 73(1–3), 397–411. Zeng, Zhi-Qiang, & Gao, Ji (2009). Improving SVM classification with imbalance data set. Neural Information Processing, Lecture Notes in Computer Science, 5863, 389–398. Zhang, Z., Shi, Y., & Gao, G. (2009). A rough set-based multiple criteria linear programming approach for the medical diagnosis and prognosis. An International Journal of Expert Systems With Applications, 36(5), 8932–8937.

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044

14

Z. Zhang et al. / European Journal of Operational Research xxx (2014) xxx–xxx

Zhang, Z., Shi, Y., & Tian, Y. (2009). An effective classification approach based on fuzzy set and multiple criteria linear programming. Journal of Data Analysis, Chung-hwa DATA MINING Society, 4(2), 105–122. Zhang, Zhan, Zhang, D., & Tian, Y. (2010). Kernel-based multiple criteria linear programming classifier. Procedia Computer Science, 1, 2401–2409.

Zhang, Defu, Zhou, Xiyue, Leung, S. C. H., & Zheng, Jiemin (2010). Vertical bagging decision trees model for credit scoring. Expert Systems with Applications: An International Journal, 37(12), 7838–7843. Zhang, P., Zhu, X., Zhang, Z., & Shi, Y. (2010). Multiple criteria programming models for VIP E-Mail behavior analysis. Web Intelligence and Agent Systems, 8(1), 69–78.

Please cite this article in press as: Zhang, Z., et al. Credit risk evaluation using multi-criteria optimization classifier with kernel, fuzzification and penalty factors. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.01.044