Evolution strategy based adaptive Lq penalty support vector machines with Gauss kernel for credit risk analysis

Evolution strategy based adaptive Lq penalty support vector machines with Gauss kernel for credit risk analysis

Applied Soft Computing 12 (2012) 2675–2682 Contents lists available at SciVerse ScienceDirect Applied Soft Computing journal homepage: www.elsevier...

376KB Sizes 0 Downloads 61 Views

Applied Soft Computing 12 (2012) 2675–2682

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Evolution strategy based adaptive Lq penalty support vector machines with Gauss kernel for credit risk analysis Jianping Li a,∗ , Gang Li a,b , Dongxia Sun a,b , Cheng-Few Lee c a

Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, PR China Graduate University of Chinese Academy of Sciences, Beijing 100049, PR China c Department of Finance and Economics, Rutgers Business School, Rutgers University, Piscataway, NJ 08854, USA b

a r t i c l e

i n f o

Article history: Received 5 July 2011 Received in revised form 20 February 2012 Accepted 10 April 2012 Available online 25 April 2012 Keywords: Adaptive penalty Support vector machine Credit risk classification Evolution strategy

a b s t r a c t Credit risk analysis has long attracted great attention from both academic researchers and practitioners. However, the recent global financial crisis has made the issue even more important because of the need for further enhancement of accuracy of classification of borrowers. In this study an evolution strategy (ES) based adaptive Lq SVM model with Gauss kernel (ES-ALq G-SVM) is proposed for credit risk analysis. Support vector machine (SVM) is a classification method that has been extensively studied in recent years. Many improved SVM models have been proposed, with non-adaptive and pre-determined penalties. However, different credit data sets have different structures that are suitable for different penalty forms in real life. Moreover, the traditional parameter search methods, such as the grid search method, are time consuming. The proposed ES-based adaptive Lq SVM model with Gauss kernel (ES-ALq G-SVM) aims to solve these problems. The non-adaptive penalty is extended to (0, 2] to fit different credit data structures, with the Gauss kernel, to improve classification accuracy. For verification purpose, two UCI credit datasets and a real-life credit dataset are used to test our model. The experiment results show that the proposed approach performs better than See5, DT, MCCQP, SVM light and other popular algorithms listed in this study, and the computing speed is greatly improved, compared with the grid search method. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Ever since the subprime mortgage crisis hit the United States and affected the whole world, deepening the global economic crisis further, credit risk has received increasing attention from both industry and academia. Numerous methodologies have been introduced for addressing the challenge and importance of credit risk analysis, risk evaluation, and risk management. Statistical and neural network based approaches are among the most popular paradigms [1–5]. Support vector machines (SVMs) have been used for credit risk analysis in the last few decades, as a promising credit risk analysis approach. It was first proposed by Vapnik [6,7]. It has been widely used in credit evaluation [8,9] and other fields, such as pattern classification, bioinformatics, and text categorization [10–12], due to its good generalization performance and strong theoretical foundations.

∗ Corresponding author. Tel.: +86 105935 8805. E-mail addresses: [email protected] (J. Li), [email protected] (G. Li), [email protected] (D. Sun), [email protected] (C.-F. Lee). 1568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.04.011

Although their strong theoretical foundations have been illustrated [6,7], the standard SVMs still have several drawbacks. For standard linearly separable problems, SVM attempts to optimize generalization performance, bound by separating data with a maximal margin classifier, which can be transformed into a constrained quadratic optimization problem. However, the data in real life are so complicated that most original credit data are not linearly separated and, therefore, the input space needs to be mapped into a higher dimensional feature space to make the data linearly separable [6,7,13]. As increasing attention is being put on learning kernel functions from data structures, kernels are introduced in the SVM for nonlinear transformations, in order to improve classification accuracy [13]. Many researchers have observed that kernels greatly improve the performance of SVM [14–16]. Computational complexity of credit risk analysis is another problem because it costs much time to solve the constrained quadratic programming problem, and to select the hyperparameters and kernels for classification. Many modified versions of SVM have been proposed to deal with this drawback, such as LS-SVM and improved LS-SVM models [17,18,19,20,21]. However, these SVM and LS-SVM models fix the penalty forms (L2 ) and cannot choose the optimal penalty adaptively, according to the dataset structures. If the dataset is sparse, an SVM with L2 penalty cannot

2676

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682

delete the noisy data, thus reducing the classification accuracy. As it is less sensitive to outliers of L1 , many researchers have chosen the L1 penalty to improve the generalization performance, and have achieved better results on data sets having redundant samples or outliers [22,23]. But L1 penalty SVM are not preferred if the data set is of a non-sparse structure. An adaptive Lq SVM is proposed by Liu et al., which adaptively selects the optimal penalty, driven by the data [24]. However, that model adopted the linear kernel; consequently, the classification performance was not outstanding for the nonlinear separable credit datasets. Thus, the kernel is concerned in our study to improve the performance of this model for credit classification. In addition, parameters optimization also needs to be considered in SVMs. Credit data are mass data; computation time increases as the size of training data increases, and performance of SVM deteriorates. Thus, some new methods need to be introduced to reduce the computational complexity of credit analysis. The Grid Search (GS) method is the most traditional and reliable optimization method for the parameter selection problem. To select optimal SVM parameters, GS is implemented by varying the SVM parameters through a wide range of values, with a fixed step size, assessing the performance of every parameter combination, based on a certain criterion, and finding the best setting in the end. However, the grid size will increase dramatically as the number of parameters increases. It is an exhaustive search method and very time-consuming. Therefore, a new method needs to be introduced to control computational costs. Evolutionary algorithms (EA) are iterative, direct, and randomized optimization methods inspired by principles of neo-Darwinian evolution theory. EA have higher computing efficiency than grid search, and can optimize parameters in non-differentiable functions. It has been proved to be suitable for hyper-parameter and feature selection for kernel-based learning algorithms [25,26]. Evolution strategies (ES), a branch of EA, were proposed by students at the Technical University of Berlin [27,28], which do not need coding and encoding processes and have good computing efficiency [29,30]. Especially, ES have selfadaptation search capacity, which quickly and nicely guides toward optimal points; some researchers have demonstrated that ES are suited to find the optimal hyper-parameters for SVMs and other multi-criteria programming models [31,32,33,34]. In this paper, we introduce an ES-based search method to find the optimal penalty and hyper-parameters simultaneously instead of Grid-based search method to save computing time. In brief, in order to improve the performance of credit classification and control computational costs, this paper proposes an ES-based ALq G-SVM approach. First, the Gauss kernel is applied to adaptive Lq penalty support vector machines in this paper to better solve the nonlinear credit data problem. Second, evolution strategy is used to select optimal parameters instead of grid search in order to improve operation efficiency. The remainder of this paper is organized as follows. Section 2 reviews the relevant literature on SVM and adaptive Lq SVM. Section 3 introduces the proposed ES-based ALq G-SVM approach for improving the classification accuracy and the computing speed. Experimental results, compared with other popular models, are shown in Section 4. Conclusions are finally drawn in Section 5, with recommendations for future research.

2. Adaptive Lq SVM 2.1. Support vector machine The general form of support vector machine is used to solve binary classification problem. Given a set of credit data points n G = {(xi , yi )}i=1 , where xi ∈ Rm is the ith input record and yi ∈ {1, − 1}

is the corresponding observed result, the main goal of SVM is to find an optimal separating hyper-plane, which can be represented  x  + b = 0, and to maximize the separation margin (the disas , tance between it and the nearest data point of each class), and to minimize the empirical classification error. Thus, the problem of seeking the optimal separating hyper-plane can be transformed into the following optimization problem:

 1  2+C |||| i , 2 n

min

i=1  T (xi ) + b) ≥ s.t. yi (

(1)

1 − i ,

i ≥ 0, i = 1, . . . , n,

where variables  i are the non-negative slack variables, representing the classification error; regularized parameter C is a constant denoting a trade-off between the maximum margin and the minimum experience risk. The above quadratic optimization problem can be solved by transforming it into Lagrange function:

  1  2+C  T (xi ) + b) − 1 + i ) |||| i − ˛i (yi ( 2

 b, , ˛) = L(,



n 

n

n

i=1

i=1

i i ,

(2)

i=1

where ˛i ,  i denote Lagrange multipliers; ˛i ≥ 0 and  i ≥ 0. At the optimal point, we have the following saddle point equations:  = 0, ∂L/∂b = 0 and ∂L/∂ i = 0, which translate into: ∂L/∂  = 

n 

˛i yi (xi ),

n 

i=1

˛i yi = 0 and ˛i = C − i

i = 1, . . . , n (3)

i=1

By substituting (3) into (2), and by replacing ((xi ) · (xj )) with kernel functions k(xi , xj ), we get the dual optimization problem: max

n 

1  ˛i ˛j yi yj k(xi , xj ), 2 n

˛i −

i=1

s.t.

n 

n

i=1 j=1

˛i yi = 0,

(4)

0 ≤ ˛i ≤ C, i = 1, . . . , n.

i=1

Commonly used kernel functions include polynomial, radial basis function (RBF) and sigmoid kernel, which are shown in (5)–(7), respectively. d k(x , y ) = (x · y + c) , for d ∈ N, c ≥ 0,

(5)

k(x , y ) = exp(−||x − y ||2 ), for  > 0,

(6)

k(x , y ) = tanh((x · y ) + c), for  > 0, c ≥ 0.

(7)

On solving the dual optimization problem, solution ˛i determines the parameters of the optimal hyper-plane. This leads to the decision function, expressed as: f (x ) = sgn

 n 



yi ˛i k(xi , x ) + b

,

(8)

i=1

n

where b = yj − y ˛ k(xi , xj ), for some j, ˛i ≤ C. i=1 i i Usually, there is only a small subset of Lagrange multipliers ˛i that is greater than zero in a classification problem. Training vectors having nonzero ˛i * are called support vectors; the optimal decision hyper-plane depends on them exclusively:



f (x ) = sgn

n 

i ∈ SV



yi ˛∗i k(xi , x ) + b∗

.

(9)

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682

2.2. Adaptive Lq penalty SVM

algorithm to solve (12) can be summarized in the following three steps:

The SVM model (1) or (4) and many other improved SVM are with fixed penalty forms, but different data structures are suitable for different types of penalty forms. Thus, Liu proposed an adaptive Lq penalty SVM to find the best penalty q for different data structures [24]. Under the general regularization framework, the standard SVM for binary classification problem can be changed as follows. 1 l(f (xi ), yi ) + ||f ||22 , n n

min f

(10)

i=1

where l(f (xi ), yi ) = [1 − yi f (xi )]+ is the convex hinge loss; ||f ||22 , the L2 penalty of f, is a regularization term serving as the roughness penalty of f, and  > 0 is the tuning parameter which controls the trade-off between the fitness of the data measured by l and the complexity of f in terms of ||f ||22 . Based on the extended SVM model (10), the adaptive Lq penalty SVM can be briefly described as: n

(11)

m

where f (x) =  B (x) is the decision function, and c(+1) and d=1 d d c(−1) are the costs for false positives and false negatives, respectively. Different from the standard binary SVM, there are two parameters  and q to be tuned. Parameter , playing the same role as in the non-adaptive SVM, controls the trade-off between minimizing the hinge loss and the penalty on f. Parameter q determines the penalty function on f. Here, q ∈ (0, 2] is considered as a tuning parameter, and it can be adaptively chosen by data, together with . The best q is dependent on the nature of the data structure. If there are many noisy attributes of inputs, the Lq penalty with q ≤ 1 is chosen since it automatically chooses important attributes and removes the noisy ones. Otherwise, if all the attributes are important, q > 1 is used to avoid any attributes deletion. Like the standard SVM, f(x) is the linear decision function. For binary classification, the decision function is denoted as fk (x) =  kT x + bk , where   k = (k1 , . . . , km )T , and k = 1, 2. In order to  obtain the unique solution, the coefficients are subject to b1 + b2 = 0 1+  2 = 0. Then the optimization problem becomes: and 

 1  T  k + bk + 1]+ c(k = [ / yi ) +  |kd |q (12) {(k ,bk )k=1,2 } n n

2

2

m

min

i=1 k=1

Step 1: Set l = 1 and the initial value (l) ; Step 2: Let 0 = (l) . Minimize F( ) = T Q + T L to obtain (l+1) ; Step 3: Set l = l + 1 and go to Step 2, until convergence. where denotes the parameter vectors; F( ) is the Taylor first order approximation of the object function of model (12), where Q = QA1 + QB1 + QC1 + QC2 and L = LA1 + LA2 + LB1 + LB2 . More details of the LQA algorithm are given in Ref. [24]. Generally, training vectors xi are mapped into a higher (maybe infinite) dimensional space by function ˚(·). Then the SVM finds a linear separating hyper-plane with the maximal margin in this higher dimensional space. There are several kernels proposed by researchers, and the Gauss kernel, shown in (13), is used more often in general cases and performs better than others [35]. Thus, the Gauss kernel is introduced in the adaptive Lq penalty SVM model to improve the performance of classification [36].



k(x , y ) = exp

1 q c(−yi )[1 − yi f (xi )]+ + ||f ||q , min n i=1

2677

k=1 d=1

s.t. b1 + b2 = 0, 1d + 2d = 0 d = 1, 2, . . . , m.

The final decision rule for classifying x is argmax[f1 (x), f2 (x)].

3. ES based ALq G-SVM 3.1. Adaptive Lq SVM with Gauss kernel When q = 2, the optimization problems (12) can be solved by quadratic programming (QP). When q = 1, (12) can be reduced to linear programming (LP). Many standard software packages are available to solve them. Except for these two special cases, the problem (12) is essentially a nonlinear programming (NLP) problem, which is easy to solve. Another problem is that when q < 1, q  Therefore, standard optimizafunction ||f ||q is not convex in . tion routines may fail to solve this problem. The LQA algorithm is proposed to solve this problem and minimize (12) via iterative quadratic optimization [24]. For fixed parameters (, q), the LQA

−||x − y ||2

2



(13)

In the LQA algorithm, Gauss kernel is introduced in QA1 and QB1 in Step 2, which are defined as: QA1 =

ci1 1  1  T x˜ i x˜ T = xˆ i xˆ i = Ker QAm×m , 1 4n 4n | 0T x˜ i + 1| i n

n

i=1

i=1

where xˆ i = (ci1 /| 0T x˜ i + 1|)

0.5

(14)

x˜ i , and Ker QAm×m (i, j) is the Gauss ker1

n

nel of columns i and j of the matrix whose rows are (ˆxiT )i=1 . QB1 =

ci2 1  1  T T ˜ ˜ x xˆ i xˆ i = Ker QAm×m , x = i i 1 4n 4n | 0T x˜ i − 1| n

n

i=1

i=1

where xˆ i = (ci2 /| 0T x˜ i − 1|)

0.5

(15)

x˜ i , and Ker QAm×m (i, j) is the Gauss ker1

n

nel of columns i and j of the matrix whose rows are (ˆxiT )i=1 . 3.2. ES based ALq G-SVM The grid search is usually used as it is an exhaustive search method. It examines the search space entirely. Each dimension of a grid represents a variable to be optimized. The grid search requires specifying, for each dimension, its maximum and minimum values. Moreover, the number of divisions along each dimension that can regularize how fine the grids should be given. Some researchers have found that optimal points of hyperparameters do not exist uniquely, as unattached points; usually there are other optimal hyper-parameters points in the neighborhood of one optimal hyper-parameters point. These optimal hyper-parameter points form a neighborhood in which values of the evaluation function are equal and optimal [37,38]. Thus, evolution strategies can be introduced to find these optimal hyperparameter points. As one of the important classes of evolution algorithms (EA), evolution strategies (ES) are often used as optimization tools due to their self-adaptation search capacity, which quickly and nicely guides the search toward optimal points. The ES algorithm has three basic operations: mutation, recombination and selection. ES are made of several iterations of the basic evolution cycle shown in Fig. 1. 3.2.1. Mutation The main variations in evolution algorithms are mutations. The mechanism of ES mutation is implemented by making perturbations, namely, adding random numbers which follow a Gaussian

2678

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682

Parameters Initiation Parents Marriage

Parameters Recombination Fig. 1. ES basic evolution cycle.

Parameters Mutation distribution, to its variable vector (chromosome). Let (p1 , . . ., pn ) be the chromosome involving n variables. Then a mutation operator is defined as: g+1

pt

g

g+1

= pt + N(0, εt

)

(16)

Setting η(k)=η(k+1) No No

where strategy parameters εt are actually step sizes of mutation, and g is the number of generations. The εt will also evolve in their own way: g+1

εt

g

= εt exp( , N(0, 1))

Obtaining η(k+1) by minimizing F(η)=ηTQ(k(x , y))η+ηTL Convergence?

(17)

Yes

where is the learning rate controlling the self-adaptation speed of εt .

Calculating fitness

3.2.2. Recombination Mutation performs a search step which is only based on the information of only one parent, while recombination can share several parents’ information. It reproduces only one offspring from

parents. ES has two versions of the recombination technique: discrete recombination and intermediate recombination. Discrete recombination selects every component randomly for the offspring, from relevant components of parents. In contrast, the intermediate recombination gives an equal right to all parents for reproduction. The offspring takes the average of parent vectors as its value. When the variables are discrete, we need to perform a rounding operation. 3.2.3. Selection The selection operation is necessary to direct the search to the promising range of the object parameter space, and to produce optimization. There are two versions of this selection technique, i.e. comma selection, denoted by (, ), and plus selection, denoted by ( + ). For (, ) selection, only  offspring individuals can be selected, while parental individuals are not selected in the set. This selection technique will forget the information of parent generation. Thus we can avoid pre-converging on local optimal points. However, the plus selection ( + ) gives a choice from the selection set, which comprises the parents and the offspring. It can ensure the best individuals survive, and then preserve them. Moreover, the fitness function is a criterion for evaluating search results, and is also very important to the ES. The chromosome with high fitness value has high probability of being preserved to the

No

Stop? Yes Offspring population generation

Selection

Stop? Yes Output Fig. 2. Flow diagram of ES-ALq G-SVM.

next generation. So in this study we use this test’s overall hit rate as the fitness measure. 3.3. Algorithm The ES-based adaptive Lq SVM with Gauss kernel (ES-ALq G-SVM) algorithm is proposed as follows.

Table 2 Results of experiments with the ES-based ALq G-SVM on ACD.

Table 1 Basic information of the three credit datasets. Datasets

ACD

GCD

AMCD

No. of total instances No. of good instances No. of bad instances No. of attributes No. of classes

690 383 307 14 2

1000 700 300 24 2

5000 4185 815 65 2

Fold number

T1 (%)

T2 (%)

T (%)

Optimized (, 2 , q)

1 2 3 4 5

95.45 86.36 86.36 93.18 100.00

88.46 86.54 92.31 86.54 84.62

91.67 86.46 89.58 89.58 91.67

(1.26, 0.85, 1.02) (1515.35, 1.00, 2.00) (11.38, 2.98, 2.00) (1.00, 1.00, 2.00) (1.00, 260.62, 2.00)

92.27

87.69

89.79

Average

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682

2679

Algorithm 1 ES-based Adaptive Lq SVM with Gauss Kernel (ES-ALqG-SVM)

Input: credit dataset G Output: optimal penalty, hyper-parameters and classification results Set:

c(+1) – costs for false postitives c(-1) – costs for false negatives g – number of generations ρ – number of parents μ – size of parent pool θ – number of offspring individuals

initialize: regularization parameter pm(0) (1) ← λ penalty parameter pm(0) (2) ← q kernel parameter pm(0) (3) ← σ 2

The whole process of the algorithm is shown in Fig. 2.

4. Experiment results In our experiments, two widely used UCI credit datasets [39] are used to check the performance of our model: Australia credit data set (ACD) and German credit data set (GCD). In addition, we also test our model in a real-life credit data set of a US commercial bank (AMCD). The proportion of bad instances in is relatively lower.

Thus, AMCD is a typical unbalanced dataset and more like the actual situation. These datasets are described in Table 1. To guarantee valid results for making predictions, the data set was randomly partitioned into training sets and independent test sets via a k-fold cross validation. The 5-fold cross validation method is used in our experiments when training SVM models for choosing the parameters for the binary-class data sets. With 5-fold cross

2680

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682

Table 3 Results of experiments with the ES-based ALq G-SVM on GCD. Fold number

T1 (%)

T2 (%)

T (%)

Optimized (, 2 , q)

1 2 3 4 5

75.00 70.83 70.83 65.63 64.58

90.00 95.00 85.00 85.00 82.50

79.41 77.94 75.00 71.32 69.85

(1.00, 1.00, 2.00) (7.18, 17.01, 1.80) (1.00, 1.00, 2.00) (1.50, 1.62, 1.00) (53.75, 1.00, 1.00)

Average

69.38

87.50

74.71

validation, a training dataset is further randomly divided into five subsets of equal size. Each of the five subsets is used for validating the model trained with the other four subsets. Therefore, for each dataset and each parameter combination, five SVM models are trained, with five reduced, overlapped subsets, to acquire the corresponding average validation accuracy and avoid overfitness. The performance is measured by Type 1 accuracy (T1), Type 2 accuracy (T2) and total accuracy (T). They stand for the percent of correctly classified good (healthy) samples, the percent of correctly classified bad (ill) samples and the percent of correctly classified in total, respectively. They are: Total accuracy (T ) =

TN + TP , TN + FP + TP + FN

Type 1 accuracy (T 1) =

TN , TN + FP

Type 2 accuracy (T 2) =

TP , TP + FN

where, TN is the number of good credit samples that are correctly classified; TP is the number of bad instances correctly classified; FP is the number of good credit samples that are misclassified, and FN is the number of the misclassified bad samples. Matlab 7.6 is used to perform all computation.

Table 4 Comparison of experiment results of other models on ACD and GCD. Methods

ACD T1 (%)

LDA QDA See5 LogR DT k-NN MCCQP SVM light GA-based SVM MK-MCP CBA ES-ALq G-SVM

80.68 91.38 87.99 86.32 80.13 54.40 87.00 18.03 84.72 87.58 – 92.27

GCD T2 (%) 92.18 66.12 84.69 85.90 87.21 81.72 85.52 90.65 92.18 90.91 – 87.69

T (%) 85.8 80.14 86.52 86.09 84.06 69.57 86.38 44.83 88.10 88.70 86.60 89.79

T1 (%) 72.57 69.33 84.00 88.14 77.43 90.57 74.38 77.00 89.60 75.60 – 69.38

T2 (%) 71.33 66.57 44.67 50.33 48.67 27.00 72.00 42.00 76.62 72.00 – 87.50

T (%) 72.20 67.40 72.20 76.80 68.80 71.50 73.50 66.50 85.60 75.00 73.50 74.71

ACD and GCD illustrate that ES-based ALq G-SVM model is relatively competitive than other popular models. 4.2. Experiment results of the real-life US commercial bank credit dataset We compare our model with 5 other paper major credit analysis models in this real-life dataset, i.e. MCLP, decision tree (DT), neural network (NN), MCNP, MK-MCP. Results of MCLP, DT, NN are from [44], MCNP are from [45] and MK-MCP are from [42]. Results of experiments on the US credit dataset AMCD, compared with other popular models, with fixed penalty forms, are listed in Table 5. Accuracy of ES-ALq G-SVM performing the best is in bold. It shows that the ES-based ALq G-SVM performs the best compared with other five models, i.e. the second best Type 1 accuracy and the best Total accuracy. The experiment results of AMCD also demonstrate that ES-based ALq G-SVM is an outstanding credit risk analysis method. 4.3. ES-based experimental results compared with GS

4.1. Experiment results—UCI credit datasets Results of experiments on these two UCI datasets with our model, by 5-fold cross validation, are shown in Tables 2 and 3. The results above show that the best fit penalty forms of different datasets are different. They vary from dataset to dataset. Due to the random division of the five training subsets of each dataset, penalty forms and total accuracies of every UCI dataset are not robust. This means data structures of these data sets are not robust. We compare our model with 11 other paper major credit analysis models in the two UCI datasets, i.e. linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), decision based See5, logistic regression (LogR), decision tree (DT), k-nearest neighbor classifier (k-NN) with k = 10, MCCQP, SVM light, GA-based SVM, MKMCP and CBA. Results of LDA, QDA, See5, LogR, DT, k-NN, MCCQP, SVM light, GA-based SVM are from [40,41], MK-MCP are from [42] and CBA are from [43]. In addition, only total accuracies of CBA are cited because Type 1 and Type 2 accuracy are not specified in [43]. Table 4 summarizes the results of the two UCI datasets compared with other popular models with fixed penalty forms. Accuracies of ES-ALq G-SVM performing best are in bold. With respect to ACD, ES-based ALq G-SVM performs best in total accuracy and Type 1 accuracy, i.e. 89.79% and 92.27%; Type 2 accuracy is 87.69%, which is also competitive. With respect to GCD, the results of ES-based ALq G-SVM are also satisfactory. Though Type 1 accuracy is a little lower, Type 2 accuracy and Total accuracy are satisfactory. Type 2 accuracy is 87.5% which is the best; Total accuracy is 74.71%, which is better than popular LDA, QDA, See5, DT, k-NN, MCCQP, SVM light and CBA. In general, the experiment results of

To compare the computing speed of ES-based ALq G-SVM approach with that of GS-based ALq G-SVM, the time consumed in completing classification tasks on the computer needs to be compared. The CPU time is measured on a PC with AMD Athlon(tm) 7750 processor, 2.65 GHz and 3.25 GB of main memory in Windows Server XP environment. The CPU time denotes the time required for each algorithm to run a 5-fold cross validation within the training set. The average CPU time is the average time of ten training runs (5-fold cross validation repeated ten times). Moreover, in the case of GS-based ALq G-SVM, we use the Gauss kernel. The parameters are { 2 , } ∈ [2−10 , 2−9 , . . ., 210 ], q ∈ [0.1, 0.2, . . ., 2], and the grid size is 8820. Table 6 lists the results when the grid search method and the proposed method are applied on the binary-class datasets, with the 5-fold training process. With respect to total accuracy, it is shown that the GS-based ALq G-SVM performs a little better in all of the three datasets. But ES-based ALq G-SVM almost performs the same. Moreover, it is Table 5 Comparison of experiment results of other models on AMCD. Methods

T1 (%)

T2 (%)

MCLP Decision tree Neural network MCNP MK-MCP ES-ALq G-SVM

75.51 52.09 67.24 50.97 70.24 72.52

40.61 82.7 78.4 82.82 85.88 80.26

T (%) 69.82 57.08 69.06 56.16 71.84 73.98

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682 Table 6 Comparison of experiment results of ES-based models and GS-based models. Methods

ACD

GCD

AMCD

ES based

T1 (%) T2 (%) T (%) Time (s)

92.27 87.69 89.79 318

69.38 87.50 74.71 588

72.52 80.26 73.98 41,019

GS based

T1 (%) T2 (%) T (%) Time (s)

91.36 91.15 91.25 31,733

72.29 81.50 75.00 104,647

73.62 74.83 73.83 465,693

noteworthy that ES-based ALq G-SVM can speed up the training process from 10 to about 180 times compared with GS-based ALq GSVM. In general, the experiment results confirm that ES-based ALq G-SVM can get satisfactory performance and save much time in the meanwhile, so when it comes to the real world dataset, the method we proposed is an excellent choice. 5. Conclusions and future research This study proposes an ES-based adaptive Lq penalty SVM with Gauss kernel (ES-based ALq G-SVM) for credit risk analysis and makes two major contributions. First, this model improves classification accuracy by introducing Gauss kernel. Most of the datasets in real life are non-linear, so the original adaptive Lq penalty SVM has flaws in application. Gauss kernel is employed to solve non-linear classification problem. It maps the data into high dimension to separate the data better. The experiments have affirmed the validity of the Gauss kernel. Second, this models lowers computational complexity by introduce evolution strategies. Traditional SVM uses grid search to conduct parameter optimization. However, the step size of the GS is fixed and the parameter space is very huge, so using GS to select parameters one by one is very time-consuming. One the contrary, evolution strategy is an efficient method to select optimal parameters. Its step size is not fixed and it considers the smoothness around the optimal parameters area. The experiments have demonstrated that the ES based SVM model can save time from 10 to 180 times compared with the GS based SVM. In other word, the model is verified that it can drastically save up the computation time and obtain a good accuracy in the meanwhile. Some questions still remain unsolved. Experiment results of this study obtained from two UCI credit datasets and a real-life US commercial bank credit dataset show that the proposed ES-based ALq G-SVM performs well in credit classification. However, many more other datasets need to be tested in the future to verify and extend this approach. In addition, there are still several other evolution algorithms that can be chosen for improvement, and how to make the model interpretable for credit decision making needs to be researched in the future. Acknowledgement This research is supported by the National Science Foundation of China (NSFC) under Grants Nos. 71071148, 70701033, and 70531040. References [1] H. Abdou, J. Pointon, A. El-Masry, Neural nets versus conventional techniques in credit scoring in Egyptian banking, Expert Systems with Applications 35 (2008) 1275–1292. [2] E. Angelini, G. di Tollo, A. Roli, A neural network approach for credit risk evaluation, The Quarterly Review of Economics and Finance 48 (2008) 733–755. [3] S.L. Lin, A new two-stage hybrid approach of credit risk in banking industry, Expert Systems with Applications 36 (2009) 8333–8341.

2681

[4] A. Laha, Building contextual classifiers by integrating fuzzy rule based classification technique and k-nn method for credit scoring, Advanced Engineering Informatics 21 (2007) 281–291. [5] L. Yu, S.Y. Wang, K.K. Lai, Credit risk assessment with a multistage neural network ensemble learning approach, Expert Systems with Applications 34 (2008) 1434–1444. [6] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [7] V.N. Vapnik, Statistical Learning Theory, Addison-Wiley, Reading, MA, 1998. [8] W.M. Chen, C.Q. Ma, L. Ma, Mining the customer credit using hybrid support vector machine technique, Expert Systems with Applications 36 (2009) 7611–7616. [9] J.P. Li, J.L. Liu, W.X. Xu, Y. Shi, Support vector machines approach to credit assessment, Lecture Notes in Computer Science 3039 (2004) 892–899. [10] M. Arun Kumar, M. Gopal, Least squares twin support vector machines for pattern classification, Expert Systems with Applications 36 (2009) 7535–7543. [11] J. Diederich, A. Al-Ajmi, P. Yellowlees, Ex-ray: data mining and mental health, Applied Soft Computing 7 (2007) 923–928. [12] V. Mitra, C.-J. Wang, S. Banerjee, Text classification: a least square support vector machine approach, Applied Soft Computing 7 (2007) 908–914. [13] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to kernelbased learning algorithms, IEEE Transaction on Neural Networks 12 (2) (2001) 181–201. [14] V.D. Sánchez, Advanced support vector machines and kernel methods, Neurocomputing 55 (2003) 5–20. [15] W.J. Wang, Z.B. Xu, W.Z. Lu, X.Y. Zhang, Determination of the spread parameter in the Gaussian kernel for classification and regression, Neurocomputing 55 (2003) 643–663. [16] Z.Y. Chen, J.P. Li, L.W. Wei, W.X. Xu, Y. Shi, Multiple-kernel SVM based multipletask oriented data mining system for gene expression data analysis, Expert Systems with Applications 38 (10) (2011) 12151–12159. [17] J.A.K. Suykens, J. Wandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (1999) 293–300. [18] R. Debnath, M. Muramatsu, H. Takahashi, An efficient support vector machine learning method with second-order cone programming for large-scale problems, Applied Intelligence 23 (3) (2005) 219–239. [19] J.P. Li, Z.Y. Chen, L.W. Wei, W.X. Xu, G. Kou, Feature selection via least squares support feature machine, International Journal of Information Technology and Decision Making 6 (4) (2007) 671–686. [20] Z.Y. Chen, J.P. Li, L.W. Wei, A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue, Artificial Intelligence in Medicine 41 (2) (2007) 161–175. [21] J.L. Liu, J.P. Li, W.X. Xu, Y. Shi, A weighted Lq adaptive least squares support vector machine classifiers—robust and sparse approximation, Expert Systems with Applications 38 (3) (2011) 2253–2259. [22] P.S. Bradley, O.L. Mangasarian, Massive data discrimination via linear support vector machines, Optimization Methods and Software 13 (1) (2000) 1–10. [23] L.W. Wei, Z.Y. Chen, J.P. Li, W.X. Xu, Sparse and robust least squares support vector machine: a linear programming formulation, Proceedings of the IEEE International Conference on Grey System and Intelligent Services (2007) 54–58. [24] Y.F. Liu, H.H. Zhang, C. Park, J. Ahn, Support vector machines with adaptive Lq penalty, Computational Statistics and Data Analysis 51 (2007) 6380–6394. [25] H. Frohlich, O. Chapelle, B. Scholkopf, Feature selection for support vector machines using genetic algorithm, International Journal on Artificial Intelligence Tools 13 (2004) 791–800. [26] E.I. Papageorgiou, P.P. Groumpos, A new hybrid method using evolutionary algorithms to train fuzzy cognitive maps, Applied Soft Computing 5 (2005) 409–431. [27] I. Rechenberg, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog, Stuttgart, Germany, 1973. [28] H.P. Schwefel, Evolutionsstrategie und numerische optimierung, Dissertation, TU Berlin, Germany, 1975. [29] H.G. Beyer, H.P. Schwefel, Evolution strategies—a comprehensive introduction, Natural Computing 1 (1) (2002) 3–52. [30] M. Papadrakakis, N.D. Lagaros, Soft computing methodologies for structural optimization, Applied Soft Computing 3 (2003) 283–300. [31] R. Liu, E. Liu, J. Yang, M. Li, F.L. Wang, Optimizing the Hyper-parameters for SVM by Combining Evolution Strategies with a Grid Search, ICIC 2006, vol. 344, LNCIS, 2006, pp. 712–721. [32] B. Mersch, T. Glasmachers, P. Meinicke, C. Igel, Evolutionary optimization of sequence kernels for detection of bacterial gene starts, International Journal of Neural Systems 17 (5) (2007) 369–381. [33] L.W. Wei, Z.-Y. Chen, J.-P. Li, Evolution strategies based adaptive Lp LS-SVM, Information Sciences 181 (14) (2011) 3000–3016. [34] J.P. Li, L.W. Wei, G. Li, W.X. Xu, An evolution-strategy based multiple kernels multi-criteria programming approach: the case of credit decision making, Decision Support System 51 (2) (2011) 292–298. [35] C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998) 121–167. [36] D.X. Sun, J.P. Li, L.W. Wei, Credit risk evaluation using adaptive Lq penalty SVM with Gauss kernel, Journal of Southeast University 24 (2008) 33–36. [37] R.M. Liu, E. Liu, J. Yang, M. Li, F.L. Wang, Optimizing the hyper-parameters for SVM by combining evolution strategies with a grid search, Lecture Notes in Control and Information Sciences 344 (2006) 712–721.

2682

J. Li et al. / Applied Soft Computing 12 (2012) 2675–2682

[38] S.R. Gunn, Support Vector Machines for Classification and Regression, Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science, Technical Report, 1998. [39] UCI Machine Learning Repository, Irvine, CA, University of California, School of Information and Computer Science, http://archive.ics.uci.edu/ ml/datasets.html. [40] Y. Peng, G. Kou, Y. Shi, Z.X. Chen, A multi-criteria convex quadratic programming model for credit data analysis, Decision Support Systems 44 (2008) 1016–1030. [41] J.P. Li, L.W. Wei, G. Li, W.X. Xu, An evolution strategy-based multiple kernels multi-criteria programming approach: the case of credit decision making, Decision Support Systems 21 (2011) 292–298.

[42] L.W. Wei, Research on Data Mining Classification Model based on the Multiple Criteria Programming and Its Application, Ph.D. Thesis, 2008. [43] B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in: Knowledge Discovery and Data Mining, New York, August 27–31, 1998. [44] J. He, X.T. Liu, Y. Shi, W.X. Xu, N. Yan, Classifications of credit cardholder behavior by using fuzzy linear programming, International Journal of Information Technology and Decision Making 3 (4) (2004) 633–650. [45] J. He, Y. Shi, W.X. Xu, Classification of credit cardholder behavior by using multiple criteria non-linear programming, Lecture Notes in Artificial Intelligence 3327 (2004) 154–163.