Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis

Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis

Expert Systems with Applications 39 (2012) 4947–4953 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

312KB Sizes 0 Downloads 74 Views

Expert Systems with Applications 39 (2012) 4947–4953

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis Shukai Li a,⇑, Ivor W. Tsang a, Narendra S. Chaudhari b a b

Centre for Computational Intelligence, Nanyang Technological University, Singapore Department of Computer Science and Engineering, Indian Institute of Technology Indore, India

a r t i c l e

i n f o

Keywords: Credit risk analysis Boosting Relevance vector machine Perceptron Kernel

a b s t r a c t In this paper, a relevance vector machine based infinite decision agent ensemble learning (RVMIdeal) system is proposed for the robust credit risk analysis. In the first level of our model, we adopt soft margin boosting to overcome overfitting. In the second level, the RVM algorithm is revised for boosting so that different RVM agents can be generated from the updated instance space of the data. In the third level, the perceptron Kernel is employed in RVM to simulate infinite subagents. Our system RVMIdeal also shares some good properties, such as good generalization performance, immunity to overfitting and predicting the distance to default. According to the experimental results, our proposed system can achieve better performance in term of sensitivity, specificity and overall accuracy. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Credit risk, the chance that money owed may not be repaid. There is little doubt, however, that the awareness of credit risk has continued to grow. This has been accompanied by an increasing recognition across many sectors of the economy that credit risk needs to be actively managed (Servigny & Renault, 2004). People pay high attention to the potential loss of credit assets in the future, such as: changes in the credit quality (including downgrades or upgrades in credit ratings), variations of credit spreads, and the default event. The role of credit risk analysis is to assess and evaluate the potential credit risk with any customer or borrower, and to advise on decisions about granting credit or providing loans or borrowing facilities (Graham & Coyle, 2000). In other words, credit risk analysis is the method by which one calculates the creditworthiness of a person, business or organization. For many credit granting institutions like commercial banks and credit companies, the ability to discriminate non-default customers from default ones is crucial for the success in their business. Credit risk analysis has become to attract much more attention from financial institutions because of the Asian Financial Crisis in 1997, the subprime mortgage crisis during 2007 and 2009, and Basel II (Lang, Mester, & Vermilyea, 2008) published in 2004. Furthermore, as business competitions for more market share and profit become more and more serious, some financial institutions undertake more risks to achieve competitive superiority in the market. ⇑ Corresponding author. Tel.: +65 81786016; fax: +65 67926559. E-mail address: [email protected] (S. Li). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.10.022

Accessibility of large databases, and advances in statistical and machine learning methods to generate efficient credit risk models have changed this area fundamentally in last decades. The prediction of credit risk has also been widely studied after realizing its practical purposes like early warning signals for defaults by obligors. This kind of techniques are widely applied in the corporate and personal credit risk analysis in which there is a need to predict the credit risk of a potential obligor before the debit is approved and extended. Besides that, financial institutions are driven by obligees to employ powerful credit risk models to assess the credit risk of debit. Hence, more accurate quantitative models for prediction are essential in order to perform more accurate credit risk analysis of loan portfolios and access the obligor’s creditworthiness. 1.1. The state of the art Credit risk analysis is important but also complicated. The most reliable customer may also default his or her debt. Besides that, there are some noisy data from corporate financial statements and personal credit question forms. In order to generate a robust credit risk analysis model, this line of research has began since 1960s, like Beaver (Beaver, 1966) who is one of the earliest researchers to study the prediction of credit risk. Beaver’s analysis includes studying one financial ratio at each time and deciding a cutoff threshold for every ratio. Hereafter, quantitative models (Galindo & Tamayo, 2000; Thomas, 2000) such as linear discriminant analysis and logistic regression (Ederington, 1985) have been applied to predict the credit level of new clients. In addition to

4948

S. Li et al. / Expert Systems with Applications 39 (2012) 4947–4953

these traditional statistical methods, machine learning techniques, such as rule based reasoning systems (Kim, 1993) and neural networks (Maher & Sen, 1997), were adopted to improve the prediction accuracy around 1990s. Investigations of machine learning methods and their experiments revealed that such methods normally reach higher accuracy than traditional statistical methods (Kim, 1993). Furthermore, hybrid methods (Lin, 2009) and support vector machine (SVM) (Huang, Chen, & Wang, 2007) are also employed in this area recently. Among various credit risk models, Chen and Shih (2006) and Huang et al. (2004) reported that SVM was competitive and outperformed other classifiers (including neural networks and linear discriminant classifier) in terms of generalization performance. Due to the good performance of SVM, our model will mainly compare with it in our experiments. Furthermore, in order to further improve the generalization performance of existing models, ensemble learning is adopted to enhance them, like neural network and SVM based ensemble learning methods (Yu, Wang, & Lai, 2008, 2010). 1.2. The preferred properties for credit risk modeling The primary focus of credit risk analysis is to improve the prediction accuracy. In particular, sensitivity (SE) and specificity (SP) are the common performance measures for this task. If SE is low, banks will lose some non-default customers, which will lower down their interest income in the income statement. If SP is low, it will lead to more default, which will write down the provision in the income statement. Therefore, for credit risk analysis, we should analyze the SE and SP separately (Baesens et al., 2003), not just pay attention only to overall accuracy. Moreover, the credit risk model should also pay attention to some special groups of customers which are difficult to classify. If we can catch this kind of customers, it will improve the generalization performance significantly. Besides that, any customer may default due to various reasons, and even non-default customers have probability to default. So the credit risk model should also predict distance to default (DD) (Servigny & Renault, 2004), which means the probabilities for default. For some machine learning models, like the neural networks, we need to adjust the model structure and parameters, which often increases the complexity. Although neural networks are increasingly found to be powerful in many classification applications, the performance is actually dependent on network model itself, especially on initial conditions, network topologies and training algorithms, which may be one reason why the results of neural networks for credit risk evaluation varies when compared with some statistical methods. To find the optimal neural net model is still a challenging issue (Huang, Chen, Hsu, Chen, & Wu, 2004). If the credit model structure is not relatively stable and has many free parameters to adjust, it will be inconvenient for financial institutions. Therefore, for the consideration of practical usage, the stable structure of the model is preferred. Besides that, there is some inaccurate information, both in the personal credit data from application forms and corporate financial statements, which lead to some noisy data and cause overfitting (Stecking & Schebesch, 2003) in the learning process. Thus, the credit risk model should have a relatively stable structure, good generalization performance, predict Distance to Default, and also overcome overfitting. 1.3. The advantages of our model Relevance vector machine (RVM) (Tipping, 2001) is a Bayesian sparse Kernel technique for regression and classification which has not been used in credit risk analysis and overcome the limitations of existing methods. It shares the good characteristics of SVM while

avoiding some limitations of SVM. For instance, RVM also provides clear connections to the underlying statistical learning theory. The RVM algorithm usually finds a globally optimal solution, and has a simple geometric interpretation (Bishop & Tipping, 2000). Compared with SVM, the advantages of RVM includes: (1) The number of relevance vectors for RVM is much fewer than the number of support vectors for SVM. (2) Its prediction is probabilistic, which is used to estimate Distance to Default (DD). (3) Unlike SVM, there is no need to estimate the soft margin parameter ‘C’, and even the Kernel parameters are optimized in the learning process of RVM. (4) The Kernel function does not need to satisfy Mercer’s condition (Tipping, 2001). However, to the best of our knowledge, until now RVM has not been used in credit risk analysis. Additionally, RVM typically leads to much sparser models resulting in faster performance on test data while without increasing the error. In order to further improve the prediction performance of RVM, we apply ensemble learning method to improve RVM. Ensemble learning makes use of many base agents or their different variants to resolve the problem. Ensemble learning usually generalizes better than a single agent. One of the most common and effective ensemble learning frameworks is Adaboost (Freund & Schapire, 1996), which often leads to overfitting. In this paper, we adopt soft margin boosting (Ratsch & Onoda, 2001) to overcome this drawback. Traditional boosting methods often use instance weights to generate diverse input data and make the algorithms focus on the instances which are difficult to classify. In this way, we can separate some special groups of customers which are difficult to classify. In our model, we also adopt instance weights to achieve the above purpose. The instance weights are incorporated to the RVM agent objective function. Therefore, the agent will focus on the high weight instances. Besides that, in each RVM agent, we employ the perceptron Kernel (Lin & Li, 2008) to simulate infinite subagents, which forms a three-level ensemble learning systems named RVM based infinite decision agent ensemble learning (RVMIdeal) system. The rest of this paper is organized as follows. Section 2 gives a review on the agent RVM. Section 3 then describes the proposed framework for credit risk analysis. Experimental results are presented in Section 4, and the last section gives some concluding remarks. For simplicity, throughout the remaining of this paper, we consider a two-class classification problem - non-default and default customers, in which the training data Z = {z1, z2, . . . , zn} comprises feature vectors X = {x1, x2, . . . , xn} along with corresponding binary target variables y = {y1, y2, . . . , yn}. yi = 1 and yi = 1 stand for nondefault and default customers respectively. In the agent level, yi = 1, 1 is mapped to yi = 1, 0 for the need of RVM. Each instance ðtÞ xi is composed of m features {xi1, xi2, . . . , xim} and a weight wi in iteration t for this instance. The base agent we obtained from each iteration t is written as ht. Moreover, the operator  means elementwise product. 2. RVM review Relevance vector machine (RVM) is a probabilistic Bayesian learning framework. It acquires relevance vectors and weights by maximizing a marginal likelihood. The structure of the RVM is described by the sum of product of weights and Kernel functions as follows:

yðxÞ ¼ a0 /ðxÞ ¼

n X

ai Kðx; xi Þ þ a0

ð1Þ

i¼1

where a = [a0, a1, a2, . . . , an]0 and / (x) = [1, K(x, x1), K(x, x2), . . . , K(x, xn)]0 . The likelihood of the training data is

4949

S. Li et al. / Expert Systems with Applications 39 (2012) 4947–4953

  1 pðyja; r2 Þ ¼ ð2pr2 Þðn=2Þ exp  2 ky  a/k2 : 2r

ð2Þ

The a and r2 are estimated by maximizing likelihood (2). In order to overcome over-fitting, a is set as a zero-mean Gaussian prior distribution with variance a1:

! n   Y ai ðai ai Þ2 1 pffiffiffiffiffiffiffi exp  pðajaÞ ¼ N ai j0; ai ¼ ; 2 2p i¼0 i¼0 n Y

ð3Þ

where a = [a1, a2, . . . , an]0 . The posterior distribution for weight a as follows:

pðajy; a; r2 Þ ¼

pðyja; r2 ÞpðajaÞ pðyja; r2 Þ

  1 ¼ ð2pÞðnþ1Þ=2 jRj1=2 exp  ða  lÞ0 1 ða  lÞ 2

ð4Þ

where the posterior covariance and mean are respectively:

R ¼ ðr2 U0 U þ AÞ1

ð5Þ

l ¼ r2 RU0 y

ð6Þ

with A = diag(a0, a1, . . . , an) and U = [/(x1), /(x2), . . . , /(xn)]0 . The marginal likelihood is also computable,

pðyja; r2 Þ ¼

Z

pðyja; r2 ÞpðajaÞda

¼ ð2pÞn=2 jr2 I þ UA1 U0 j1=2   1  exp  y0 ðr2 I þ UA1 U0 Þ1 y : 2

ci l2i

ðr2 Þnew ¼

ð7Þ

ð8Þ ky  Ulk2 P n  ni¼0 ci

n Y

In this section, we present our proposed RVMIdeal approach for credit risk analysis. The architecture of our proposed framework is depicted in Fig. 1. It consists of three levels. In the base level, we make use of the perceptron Kernel to simulate infinite subagents. Then in the middle level, the RVM algorithm using the perceptron Kernel is performed to train weak learner ht, and the error function is computed and sent to boosting. In the top layer, boosting algorithm is used to update the instance weights and sends them back to the RVM agents. Then in the next iteration, a new RVM agent is trained based on the updated instance weights. This process is repeated until convergence. At the end, the final hypothesis makes prediction based on the ensemble of the prediction functions of all agents. The details of this framework are given as follows.

Boosting is a powerful technique for combining multiple learning agent to produce a form of committee whose performance is usually significantly better than that of any single base agent. Boosting can also give good results even if the base agents have the performance that is only slightly better than random (Freund & Schapire, 1996), and hence the base agents are known as weak learners.

ð9Þ

where ci = 1  aiRii and Rii is the ith diagonal component of the posterior covariance R given by (5). Learning therefore proceeds by choosing initial values for a and r2, evaluating the mean l and covariance R of the posterior using (6) and (5), respectively. Then the hyperparameters a and r2 are alternately re-estimated, using (8) and (9). This process is repeated until a suitable convergence criterion is satisfied. For the case of classification, we need to make a small modification. Following statistical convention to generalize the linear model by applying the logistic sigmoid function g(x) = 1/(1 + ex) and the Bernoulli distribution, we write the likelihood as:

pðyjaÞ ¼

3. Methodology formulation

3.1. Base agent creation

The derivatives of the marginal likelihood is set to zero and the reestimation equations is as follows:

anew ¼ i

As the result of the optimization, a large proportion of the hyperparameters a are driven to large (in principle infinite) values, and so the weight parameters ai’s corresponding to these ais have posterior distributions with mean and variance both zeros. Thus those parameters, and the corresponding basis functions K(x, xi), are removed from the model and play no role in making predictions for new inputs. The inputs xi corresponding to the remaining nonzero weights are called relevance vectors.

gðyðxi ÞÞyi ½1  gðyðxi ÞÞ1yi :

3.1.1. Instance diversity In the basic form of boosting, the base agents are trained in a sequence, and each base agent is trained using a weighted form of the data set in which the weighted coefficient associated with each instance depending on the performance of the previous agents. In particular, instances that are misclassified by one of the base agents are given greater weight when used to train the next agent in the sequence. Once all the agents have been trained, their predictions are then combined through a weighted majority voting scheme. The instance weight wi is initially set as 1/n for all instances. We assume that we have a procedure available for training a base agent using weighted data through the given function. At each

ð10Þ

i¼1

Subagent

Here, the target yi 2 0, 1. Since p(ajy, a) = p(yja)p(aja), finding the maximum, over a, of

ln½pðyjaÞpðajaÞ ¼

n X

infinite

RVM agent 1

Subagent

½yi ln gðyðxi ÞÞ þ ð1  yi Þ lnð1  gðyðxi ÞÞÞ

Subagent

i¼1

1  a0 Aa þ const: 2

ð11Þ

Perceptron Kernel

infinite

Error function

RVM agent 2

Subagent

Instance weights

At the mode of p(ajy, a), we have:

R ¼ ðU0 BU þ AÞ1

ð12Þ

aMP ¼ A1 U0 ðy  yðXÞÞ

ð13Þ

where B = diag(b1, b2, . . . , bn) and bi = g(y(xi)) [1  g(y(xi))], and y (X) = [y(x1), y(x2), . . . , y(xn)]0 .

Subagent infinite

RVM agent T

Subagent

Fig. 1. The structure of RVMIDEAL.

Soft margin boosting

4950

S. Li et al. / Expert Systems with Applications 39 (2012) 4947–4953

stage of the algorithm, boosting trains a new agent using the data set in which the weighting coefficients are adjusted according to the performance of the previously trained agent, so as to give greater weights to the misclassified instances. Comparing the true label yi and the prediction ht(xi) given by the tth agent, we define the margin function as

qðzi Þ ¼ fq ðht ðxi Þ; yi Þ

ð14Þ

And based on the margin function, the weight update equation is defined as ðtþ1Þ

wi



ðtÞ ¼ fb wi ; q

ð15Þ

The instance weights are changed in different iterations based on the margin function in the former iteration, and the detailed form of (14) and (15) will be given in Section 3.3. 3.1.2. Infinite decision subagents via perceptron Kernel The base agent RVM is a Kernel method, so we need to select a proper Kernel fitting for both RVM and boosting. The most common used Kernel functions are polynomial, radial basis and hyperbolic tangent functions, none of which take consideration for the needs of ensemble learning. Recently, there is some research (Lin & Li, 2005) done on the relationship between Kernel methods and ensemble methods. Although the number of agents can be infinite in theory, most existing models only utilize a small number of finite agents, which limit the capacity of the model (Freund & Schapire, 1997). If we use many agents in the boosting, it increases the time complexity largely. Thus, we employ infinite base sub-agents in the agent level. Based on the above consideration, we use the perceptron Kernel (Lin & Li, 2005) defined as 0

0

K p ðx; x Þ ¼ DP  kx  x k2

ð16Þ

where DP is a constant, and it is defined as

DP ¼

Z

dh khk2 ¼1

Z

jcosðangle < h; e1 >Þjdh:

ð17Þ

khk2 ¼1

Here, e1 = (1, 0, . . . , 0)0 and the operator angleh⁄, ⁄i means the angle between two vectors. The perceptron is a linear threshold classifier of the form ph,a(x) = sign(hx  a), which is a basic model of a neuron. There is a set of perceptrons

P ¼ fph;a : h 2 Rm ; khk2 ¼ 1; a 2 ½R; Rg

ð18Þ

where R is a ball’s radius. Then the perceptron Kernel is written as 0

K p ðx; x Þ ¼

2r 2P

Z

0

0

ðR  kh x  hx kÞdh

ð19Þ

khk2 ¼1

R where rP ¼ 2 khk

2 ¼1

jcosðanglehh; e1 iÞjdh . The integral is over every

possible direction of h, which is equivalent to the ensemble classifier over infinite perceptrons. In Kernel learning, (19) is usually simplified as kx  x0 k2, since DP is a constant (Lin & Li, 2008).

In our formulation, the instance with higher weights should be assigned correct label first. In order to achieve this, we put the instance weights to the objective function of RVM, and let the learning process of RVM incorporate the instance weights. Considering the weights for the corresponding instance, the likelihood (10) is n Y

gðyðxi ÞÞwi yi ½1  gðyðxi ÞÞwi ð1yi Þ

i¼1

and (11) becomes

n X

wi ½yi ln gðyðxi ÞÞ þ ð1  yi Þ lnð1  gðyðxi ÞÞÞ

i¼1

1  a0 Aa þ const: 2

ð20Þ

ð21Þ

Here, the first term is the sum of errors, and the second is the regularizer. (21) is maximized to find the a. In the process, the instance with larger weight is assigned correct label first. The first and second derivatives of the log posterior distribution are then given by

r ln½pðyjaÞpðajaÞ ¼ ðU  wÞ0 ðy  yðXÞÞ  Aa rr ln½pðyjaÞpðajaÞ ¼ ½ðU  wÞ0 BU þ A

ð22Þ ð23Þ

0

where y(X) = [y(x1), y(x2), . . . , y(xn)] and B = diag(b1, b2, . . . , bn) with bi = r[y(xi)]{1  r[y(xi)]}. The negative second derivative represents the inverse covariance matrix for the Gaussian approximation to the posterior distribution. The mode of the resulting approximation to the posterior distribution, corresponding to the mean of the Gaussian approximation, is obtained by setting (22) to zero, and giving the mean and covariance of the Laplace approximation in the form

aMP ¼ Að1Þ ðU  wÞ0 ðy  yðXÞÞ

ð24Þ

0

ð25Þ

1

R ¼ ½ðU  wÞ BU þ A :

Now we use this Laplace approximation to evaluate the marginal likelihood as

pðyjaÞ ¼

Z

pðyjaÞpðajaÞda

’ pðyjaMP ÞpðaMP jaÞð2pÞðnþ1Þ=2 jRj1=2

ð26Þ

If we substitute for p(yjaMP) and p(aMPja) and then set the derivative of the marginal likelihood with respect to ai equal to zero, we obtain

anew ¼ i

ci

ð27Þ

w2MPi

where ci = 1  aiRii. This update formula is identical to the (8). Hence, the remain steps are the same as the RVM in regression case. The algorithm alternately re-estimates the posterior mean and covariance, using (24) and (25), and re-estimates the hyperparameters, through (27), and until a suitable convergence criterion is satisfied. 3.3. Multiagent ensemble learning In order to avoid overfitting and get a good generalization performance, we introduce soft margin to boosting, which is the same as Ratsch did (Ratsch & Onoda, 2001). First we define the margin for Eq. (14) as

qðzi ; cÞ ¼ yi f ðxi Þ ¼ yi

t X

cr hr ðxi Þ:

ð28Þ

r¼1

In Adaboost, after many iterations, it satisfies the inequalities

qðzi ; cÞ P Q ði ¼ 1; 2 . . . ; nÞ

3.2. Single RVM agent learning

pðyjaÞ ¼

ln½pðyjaÞpðajaÞ ¼

ð29Þ

If Q > 0, all the instances are classified according to their possibly wrong labels, which leads to overfitting in the presence of noise. Therefore, we relax the margin,

q~ ðzi ; cÞ ¼ qðzi ; cÞ þ C lt ðzi Þ

ð30Þ

P where C is a priori chosen constant and lt ðzi Þ ¼ j tr¼1 cr wr ðzi Þj. Based on Ratsch’s result, the weight update function of Eq. (15) becomes

wtþ1 ðzi Þ ¼

  1 1 t t exp  ½qðzi ; b Þ þ Cjb jlt ðzi Þ Zt 2

ð31Þ

4951

S. Li et al. / Expert Systems with Applications 39 (2012) 4947–4953

n o t t where bt ¼ argminbt P0 i¼1 exp  12 ½qðzi ; b Þ þ Cjb jlt ðzi Þ with bt = [b1, b2, . . . , bt]0 , and Zt is the normalization constant, such that Pn i¼1 wtþ1 ðzi Þ ¼ 1. In this way, the margin is computed based upon the ht and the ground true y, and it is relaxed to get the new soft margin by adding regularizer. If there are noisy data, it limits the increase of the weights. RVM is retrained based on the updated instance space. In our newly formalized RVM model, the instances with higher weights will be correctly classified first. Then we gain a new prediction function ht+1. This process is repeated until convergence. Finally, it has run for T iterations, and then we get final hypothesis as Pn

f ðxÞ ¼

T X

c t ht

i¼1

bt where ct ¼ PT

i¼1 jbt j

4. Experiments In this Section, two credit card application approval data sets are adopted, which are Australian and Japanese data sets from UCI machine learning repository (http://www.archive.ics.uci.edu/ ml/datasets.html). In the Australian credit application data set, there are 307 instances of creditworthy applicants and 383 instances whose credit is not creditworthy. Each sample is characterized by 14 attributes including 6 numerical and 8 categorical attributes, and one target attribute (non-default or default). All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. There are 37 instances with missing values. The missing ones are replaced by either the mode of the attribute if it is a categorical attribute or the mean of the attribute if it is continuous. In the Japanese consumer credit card application approval data set, all attribute names and values have also been changed to meaningless symbols for confidentiality. This time we delete the data with missing attribute values, and then we obtain 653 data with 15 features, in which 357 cases were granted credit and 296 cases were refused. 4.1. Experimental setup Here, 1/2 of the instances is for the training set, and 1/4 is for validation and testing set respectively. In the credit risk analysis, detecting non-default and default customers are both important. If we do not separate the default customers, it will lead to default. Otherwise, if we predict non-default customers wrongly, we will lose non-default customers, which lows down our interest. Therefore, we use the following measures,

Number of true positives Number of true positives þ Number of false negatives ð33Þ

SP ¼

Number of true negatives Number of true negatives þ Number of false negatives

SE þ SP 2

BA

SE

SP

LOGR SVM RVM RVMAda (T = 100PK) RVMAda (T = 200PK) RVMIdeal (T = 100GK) RVMIdeal (T = 200GK) RVMIdeal (T = 100PK) RVMIdeal (T = 200PK)

86.5 ± 2.2 86.3 ± 1.2 88.2 ± 1.5 90.5 ± 1.0 90.6 ± 1.0 93.7 ± 1.2 93.8 ± 1.2 95.5 ± 1.0 95.5 ± 1.0

91.1 ± 2.8 89.5 ± 1.4 92.6 ± 1.6 94.7 ± 1.4 94.8 ± 1.2 97.2 ± 1.7 97.3 ± 1.7 98.4 ± 1.6 98.4 ± 1.6

81.8 ± 3.2 83.2 ± 1.8 83.8 ± 2.2 86.3 ± 1.5 86.3 ± 1.5 90.2 ± 1.8 90.2 ± 1.8 92.7 ± 1.4 92.7 ± 1.5

Table 2 Testing accuracy (%) on Japanese credit data set. Method

BA

SE

SP

LOGR SVM RVM RVMAda (T = 100PK) RVMAda (T = 200PK) RVMIdeal (T = 100GK) RVMIdeal (T = 200GK) RVMIdeal (T = 100PK) RVMIdeal (T = 200PK)

74.6 ± 4.3 78.3 ± 4.1 79.5 ± 3.4 83.2 ± 3.6 83.2 ± 3.6 86.3 ± 3.3 86.3 ± 3.3 88.0 ± 3.3 88.0 ± 3.3

75.8 ± 5.6 80.4 ± 5.1 82.2 ± 4.3 84.3 ± 4.7 84.3 ± 4.6 89.3 ± 4.3 89.3 ± 4.4 91.5 ± 4.2 91.5 ± 4.2

73.4 ± 6.2 76.2 ± 6.1 76.7 ± 4.8 82.1 ± 5.1 82.1 ± 5.1 83.3 ± 4.7 83.3 ± 4.7 84.6 ± 4.9 84.6 ± 4.9

consider the capacity to detect both non-default and default customers. The training, validation and testing set are chosen randomly, all the methods are run 20 times and the average performances are evaluated to reduce statistical varieties. Finally, we give the prediction accuracy based on the testing set, which are shown in Tables 1 and 2. The best performance is listed in bold. 4.2. Compared methods We first compare our model RVM based infinite decision agent ensemble learning (RVMIdeal) system with logistic regression (LOGR), SVM and RVM. LOGR is one of the best statistical credit risk models, and SVM is among the best machine learning credit risk models. Thus, these two models are typical baseline methods used in credit risk analysis. For SVM, we needs to choose regulation parameter C and the Gaussian Kernel parameter r. C is selected in the range of {0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000}. In particular, the width r pffiffiffi of the Gaussian Kernel exp (kzk2/2r2) is picked via 0:25 c; pffiffiffi pffiffiffi pffiffiffi pffiffiffi 0:5 c; c; 2 c; 4 cg where c is the average distance from all pairs of instances. For RVM, all the parameters including the parameter in the Gaussian Kernel is adaptive by itself. Therefore, no validation is needed for RVM. Besides that, we also compare with RVM based Adaboost (RVMAda). For the soft margin boosting, the regularization parameter C is selected from {0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000}. We also adopt the Gaussian Kernel (GK) and the perceptron Kernel (PK) individually for our model. 4.3. Analysis of experimental results

ð34Þ BA ¼

Method

ð32Þ

In order to verify the effectiveness of the proposed model, the real world credit card approval experiments are conducted and analyzed.

SE ¼

Table 1 Testing accuracy (%) on Australian credit data set.

ð35Þ

where SE denotes the sensitivity which means the ability of the model to recognize the actual non-default customers. SP represents the specificity which evaluates the capacity to separate the actual default customers, and BA is short for balanced accuracy, which

We compare the BA, SE and SP in Tables 1 and 2. In addition, the test error (1  BA), (1  SE) and (1  SP) are presented for different number of agents in Fig. 2. Based on the above result, several important conclusions are drawn in the following, (a) For the single agent, it is obvious that the LOGR, SVM and RVM perform generally the same in Australian credit data set, while SVM and RVM are better in Japanese case. For

4952

S. Li et al. / Expert Systems with Applications 39 (2012) 4947–4953 21

12 RVMAda 11

RVMAda

20

RVMIdeal(GK) RVMIdeal(PK)

RVMIdeal(GK) RVMIdeal(PK)

19

10 18 17 1−BA

1−BA

9 8

16 15

7

14 6 13 5

12 11

4 0

20

40

60

80

100

120

140

160

180

0

200

20

40

60

Number of Agents

80

100

120

140

160

180

200

Number of Agents

8

20 RVMAda

RVMAda

RVMIdeal(GK)

7

RVMIdeal(GK)

18

RVMIdeal(PK)

RVMIdeal(PK)

6 16

1−SE

1−SE

5 14

4 12 3 10

2

1

0

20

40

60

80

100

120

140

160

180

8

200

0

20

40

60

Number of Agents

80

100

120

140

160

180

200

Number of Agents

17

24 RVMAda

16

RVMAda 23

RVMIdeal(GK) RVMIdeal(PK)

15

RVMIdeal(GK) RVMIdeal(PK)

22

14

21

1−SP

1−SP

13 12

20 19

11

18

10

17

9

16

8

15

7 0

20

40

60

80

100

120

140

160

180

200

0

20

40

60

80

100

120

140

160

180

200

Number of Agents

Number of Agents

Fig. 2. Test error for the different number of agents.

RVM and SVM, they are both Kernel methods, but RVM does not need to search the optimal values for the model parameters. All parameters in RVM are optimized by itself, even including the parameters for the Gaussian Kernel. (b) For the ensemble models, due to the ensemble output of boosting (Freund & Schapire, 1996), they can improve the prediction performance of a single agent significantly. Thus, the ensemble models RVMIdeal and RVMAda are better than the single agent models LOGR, SVM and RVM. Moreover,

the performance of RVMAda is inferior to that of RVMIdeal. It is possibly because RVMAda does not have soft margin, which will lead to overfitting, i.e. inferior prediction performance on testing set. (c) Moreover, we compare our model with the Gaussian Kernel and the perceptron Kernel. From the experimental result in Tables 1 and 2, RVMIdeal(PK) performs the best in term of the BA, SE and SP. The possible reason is that the perceptron Kernel is equal to infinite subagents, which fit for the

S. Li et al. / Expert Systems with Applications 39 (2012) 4947–4953

ensemble learning better. In addition, there is no Kernel parameter in the perceptron Kernel needed to be optimized; while the Kernel parameter in the Gaussian Kernel is optimized through the learning process. (d) Furthermore in Fig. 2, for all the ensemble methods, it is obvious that the slope of RVMIdeal(PK) is higher than RVMIdeal(GK), and the slope of RVMIdeal is higher than RVMAda, which means the RVMIdeal(PK) converges most quickly in the three ensemble models, because of infinitely many perceptrons and its immunity from overfitting. The slope often decreases when T grows bigger, and it nearly converges to 0 after T = 80. We also find there is nearly no difference from T = 100 to T = 200. Thus, it is proper to adopt the model at T = 100.

5. Conclusion In this paper, we first give a brief survey of credit risk analysis and its employed models, and also compare advantages and disadvantages for different kinds of credit risk models. Furthermore, we summarize the preferred properties of credit risk modeling. Based on these properties and existing machine learning techniques, we propose our model RVMIdeal. For the single agent, we mainly compare RVM with SVM. We also employ the perceptron Kernel to simulate the infinite subagents. In this way, our system becomes a three-level ensemble learning system. Moreover, we also update the objective function and the related functions in each RVM agent to fit the ensemble learning system. In order to better ensemble the agents, we adopt soft margin boosting to overcome overfitting. We also perform comprehensive experiments on two credit data sets. Our proposed model outperforms existing statistical and machine learning methods. Besides the good accuracy, all the parameters in the agent level are optimized by itself. Based on the good property of RVM, we can also compute DD for each instance. All in all, our model RVMIdeal has a stable structure, good generalization performance, can overcome overfitting and predict DD in the same time.

References Baesens, B., Gestel, T. V., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54, 627–635. Beaver, R. (1966). Financial ratios as predictors failure. Journal of Accounting Research, 4, 71–111.

4953

Bishop, C., & Tipping, M. E. (2000). Variational relevance vector machines. In C. Boutilier & M. Goldszmidt (Eds.), 16th conference on uncertainty in artificial intelligence (pp. 46–53). San Mateo, CA: Morgan Kaufman. Chen, W.-H., & Shih, J.-Y. (2006). A study of Taiwan’s issuer credit rating systems using support vector machines. Expert Systems with Applications, 30, 427–435. Ederington, H. (1985). Classification models and bond ratings. Financial Review, 20, 237–262. Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Thirteenth international conference on machine learning (pp. 148–156). Bari, Italy: Morgan Kaufman. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Science, 55, 119–139. Galindo, J., & Tamayo, P. (2000). Credit risk assessment using statistical and machine learning: Basic methodology and risk modeling applications. Computational Economics, 15, 107–143. Graham, A., & Coyle, B. (2000). Corporate Credit Analysis: Credit Risk Management. New York: Global Professional Publishing. Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., & Wu, S. (2004). Credit rating analysis with support vector machines and neural networks: A market comparative study. Decision Support Systems, 37, 543–558. Huang, C.-L., Chen, M.-C., & Wang, C.-J. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications, 33, 847–856. Kim, J. W. (1993). Expert systems for bond rating: A comparative analysis of statistical, rule-based and neural network systems. Expert Systems, 10, 167–171. Lang, W. W., Mester, L. J., & Vermilyea, T. A. (2008). Competitive effects of Basel II on US bank credit card lending. Journal of Financial Intermediation, 17, 478–508. Lin, S. L. (2009). A new two-stage hybrid approach of credit risk in banking industry. Expert Systems with Applications, 36, 8333–8341. Lin, H.-T., & Li, L. (2005). Infinite ensemble learning with support vector machines. In European conference on machine learning (pp. 242–254). Springer-Verlag. Lin, H.-T., & Li, L. (2005). Novel distance-based svm Kernels for infinite ensemble learning. In Twelfth international conference on neural information processing (pp. 761–766). Springer-Verlag. Lin, H.-T., & Li, L. (2008). Support vector machinery for infinite ensemble learning. Journal of Machine Learning Research, 9, 285–312. Maher, J. J., & Sen, T. K. (1997). Predicting bond ratings using neural networks: A comparison with logistic regression. Intelligent Systems in Accounting, Finance and Management, 6, 59–72. Ratsch, G., & Onoda, T. (2001). Soft margins for adaboost. Journal of Machine Learning Research, 42, 287–320. Servigny, A. D., & Renault, O. (2004). Measuring and Managing Credit Risk. New York: McGraw-Hill Companies. Stecking, R., & Schebesch, K. (2003). Support vector machines for credit scoring: Comparing to and combining with some traditional classification methods. In M. Schader, W. Gau, & M. Vichi (Eds.), Between Data Science and Applied Data Analysis (pp. 604–612). Berlin: Springer. Thomas, L. C. (2000). A survey of credit and behavioural scoring: Forecasting financial risk of lending to consumers. International Journal of Forecasting, 16, 149–172. Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural network ensemble learning approach. Expert Systems with Applications, 34, 1434–1444. Yu, L., Yue, W., Wang, S., & Lai, K. K. (2010). Support vector machine based multiagent ensemble learning for credit risk evaluation. Expert Systems with Applications, 37, 1351–1360.