Expert Systems with Applications 36 (2009) 7515–7518
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Integrating nonlinear graph based dimensionality reduction schemes with SVMs for credit rating forecasting Shian-Chang Huang * Department of Business Administration, National Changhua University of Education, College of Management, No. 2, Shi-Da Road, Changhua 500, Taiwan
a r t i c l e
i n f o
Keywords: Kernel graph embedding Dimensionality reduction Support vector machine Multi-class classification Credit rating
a b s t r a c t By integrating graph based nonlinear dimensionality reduction with support vector machines (SVMs), this study develops a novel prediction model for credit ratings forecasting. SVMs have been successfully applied in numerous areas, and have demonstrated excellent performance. However, due to the high dimensionality and nonlinear distribution of the input data, this study employed a kernel graph embedding (KGE) scheme to reduce the dimensionality of input data, and enhance the performance of SVM classifiers. Empirical results indicated that one-vs-one SVM with KGE outperforms other multi-class SVMs and traditional classifiers. Compared with other dimensionality reduction methods the performance improvement owing to KGE is significant. Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction Credit rating assesses the credit worthiness of an individual, corporation, or even a country. Typically, credit rating tells a lender or investor the probability of the borrower being able to pay back a loan. Consequently, credit ratings are important determinants of risk premiums and even the marketability of corporate bonds. Recently, credit rating forecasting had been a critical issue in the banking industry. All banking institutes and their regulators attempt to search for a precise internal credit system to model the credit quality of their evaluation borrowers. Furthermore, subprimemortgage crisis in the later half of 2007 profoundly impacts the banking sector of US. The bank with the most accurate estimation of its credit risk will be the most profitable. The objective of this study is thus to develop a reliable and accurate prediction models for risk assessment. The development of the corporate credit rating prediction model has attracted lots of research interests in academic and business community. Many researchers have attempted to construct automatic classification systems using methods from data mining, such as statistical and artificial intelligence techniques. However, due the high dimensionality of input variables (both financial and non-financial information), this study combined a kernel graph embedding (KGE) scheme proposed by Yan et al. (2007) with multi-class SVMs to enhance our predictions. Numerous classification techniques have been adopted for credit scoring. These techniques include (1) traditional statistical methods; for example, discriminant analysis, logistic regression * Tel.: +886 4 7232105 7420, mobile: +886 953092968; fax: +886 4 7211292. E-mail address:
[email protected] 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.09.047
(Steenackers & Goovaerts, 1989; Stepanova & Thomas, 2001), and Bayesian network, (2) non-parametric statistical models, such as k-nearest neighbor (Henley & Hand, 1997), (3) decision trees (Yobas, Crook, & Ross, 2000), and (4) neural networks (Desai, Crook, & Overstreet, 1996; West, 2000; Yobas et al., 2000). Recently, the support vector machine (SVM) method (Cristianini & Shawe-Taylor, 2000; Schoelkopf, Burges, & Smola, 1999; Vapnik, 1999), another form of neural networks, has become increasingly popular and is currently regarded as the state-of-the-art technique for regression and classification applications. The formulation of SVM is believed to embody the structural risk minimization principle (a maximum margin classifier), and thus to combine excellent generalization properties with a sparse model representation. SVMs exploit the idea of mapping input data into a high dimensional reproducing kernel Hilbert space (RKHS) where linear classification is performed. However, owing to the large amounts of data from public financial statements which can be used for corporate credit rating predictions, the large scale input data will make SVM classifiers infeasible due to the curse of dimensionality. Consequently, one needs to select key features from the raw data to reduce the dimensionality of the classification problem. Dimensionality reduction have been studied extensively in both the statistics and machine learning communities during recent decades. Among dimensionality reduction, the linear algorithms principal component analysis (PCA) and linear discriminant analysis (LDA) have been the two most popular because of their relative simplicity and effectiveness. However, as indicated by Yan et al. (2007), in many real world problems there is no evidence that the data is sampled from a linear subspace. This motivates researchers to consider manifold based techniques for dimensionality reduction. Recently, various
7516
S.-C. Huang / Expert Systems with Applications 36 (2009) 7515–7518
manifold learning techniques, such as ISOMAP (Tenenbaum, de Silva, & Langford, 2000), locally linear embedding (LLE) (Roweis & Saul, 2000) and Laplacian eigenmap (Belkin & Niyogi, 2001) have been proposed which reduce the dimensionality of a fixed input data set in a way that maximally preserve certain inter-point relationships. This research adopted the general framework of Yan et al. (2007) called kernel graph embedding (KGE) for dimensionality reduction. Their framework offers a unified view for understanding and explaining many of the popular dimensionality reduction algorithms. The kernelization of graph embedding applies the kernel trick on the linear graph embedding algorithm. Thus it can handle data with nonlinear distributions. To handle the high dimensionality of the input data, this study combined KGE with SVMs to increase the rating accuracy. KGE will reduce the dimensionality of input data and simultaneously eliminate irrelevant features. This combination will reduce the computational loading of SVMs and enhance the forecasting accuracy. Moreover, this study applies three types of multi-class SVMs, one-vs-one, one-vs-all, and multi-class SVMs to classify enterprize credit rating, and compares these SVM classifiers with traditional classifiers. Empirical results indicated that the performance of SVMs with KGE are promising. The performance improvement owing to KGE is significant. The method developed here will help financial institutions make good assessments about their credit risks, and substantially reduce their losses. The remainder of this paper is organized as follows: Section 2 describes the multi-class SVMs. Section 3 introduces the KGE algorithm. Subsequently, Section 4 describes the study data and discusses the empirical findings. Conclusions are finally given in Section 5. 2. Support vector machines The support vector machines (SVMs) were proposed by Vapnik (1999). Based on the structured risk minimization (SRM) principle, SVMs seek to minimize an upper bound of the generalization error instead of the empirical error as in other neural networks. SVM classifiers construct a hyperplane to separate the two classes (labeled y 2 f1; 1g) so that the margin (the distance between the hyperplane and the nearest point) is maximal. The SVM classification function is formulated as follows:
y ¼ signðwT /ðxÞ þ bÞ;
ð1Þ
where /ðxÞ is called the feature, which is a nonlinear mapping from the input space x to the future space. The coefficients w and b are estimated by the following optimization problem:
min Rðw; nÞ ¼ w;b
1 kwk2 þ C 2
l X
ni P 0;
ni ;
ð2Þ
i¼1
i ¼ 1; . . . ; l
i ¼ 1; . . . ; l;
ð3Þ ð4Þ
where C is a prescribed parameter, which evaluates the trade-off between the empirical risk and the smoothness of the model. After taking the Lagrangian and conditions for optimality, the dual solution of this convex optimization problem can be formulated as follows:
max DðaÞ ¼ a
l X
ai
i¼1
l 1X y y ai aj Kðxi ; xj Þ; 2 i;j¼1 i j
ð5Þ
with constraints,
0 6 ai 6 C;
i ¼ 1; . . . ; l
ai yi ¼ 0;
ð7Þ
i¼1
where a are Lagrangian multipliers, which are also the solution to the dual problem, and Kðxi ; xj Þ is the kernel function. b follows from the complementarity Karush–Kuhn–Tucker (KKT) conditions. The decision function is given by
f ðxÞ ¼ sign
l X
!
ai yi Kðx; xi Þ þ b :
ð6Þ
ð8Þ
i¼1
The value of the kernel is equal to the inner product of two vectors x and xi in the feature space, such that Kðx; xi Þ ¼ /ðxÞ/ðxi Þ. Any function that satisfying Mercer’s condition (Vapnik, 1999) can be used as the kernel function. 2.1. Multi-class support vector machine One approach to solving multi-class classification problem is to consider the problem as a collection of binary classification problems. k classifiers can be constructed, one for each class. The nth classifier constructs a hyperplane between class n and the k 1 other classes. A majority vote across the classifiers or some other measure can then be applied to classify a new point. That is, a particular point is assigned to the class for which the distance from the margin, in the positive direction (i.e., in the direction in which class ‘‘one” lies rather than class ‘‘rest”), is maximal. This is oneagainst-rest method for multi-class classification. hyperplanes can be constructed, sepaAlternatively, C k2 ¼ kðk1Þ 2 rating each class from each other class, and similarly some voting schemes can be applied. This is one-against-one method. The above two methods have been used widely in the support vector literature to solve multi-class classification problems. Another way to solve multi-class problems is to construct a decision function by considering all classes at once (Weston & Watkins, 1999). One can generalize (2) to the following setting:
min Rðw; nÞ ¼ w;b
l X m X 1 ni ; kwm k2 þ C 2 i¼1 m–y
ð9Þ
i
with
wTyi /ðxi Þ þ byi P wTm /ðxi Þ þ bm þ 2 nm i
ð10Þ
nm i P 0;
ð11Þ
i ¼ 1; . . . ; l m 2 f1; . . . ; kg n yi :
This gives the decision function:
f ðxÞ ¼ arg maxðwTi /ðxÞ þ bi Þ; k
with
yi ðwT /ðxi Þ þ bi Þ P 1 ni ;
l X
i ¼ 1; . . . ; k:
ð12Þ
One can also find the solution to this optimization problem in dual variables by finding the saddle point of the Lagrangian. This method is termed as MSVM. 3. Kernel graph embedding In this section, we present the dimensionality reduction method of Yan et al. (2007) and Cai et al. (2007). Given m samples m n d xi jm i¼1 2 R , dimensionality reduction aims at finding yi ji¼1 2 R , d n, where yi can represents xi . In the past decades, many algorithms, either supervised or unsupervised, have been proposed to solve this problem. These algorithms can all be interpreted in a general graph embedding framework of Yan et al. (2007). Given a graph G with m vertices, each vertex represents a data point. Let W be a symmetric m m matrix with W ij having the weight of the edge joining vertices i and j. The G and W can be defined to characterize certain statistical or geometric properties of the data set. The purpose of graph embedding is to represent each
7517
S.-C. Huang / Expert Systems with Applications 36 (2009) 7515–7518
vertex of a graph as a low dimensional vector that preserves similarities between the vertex pairs, where similarity is measured by the edge weight. Let y ¼ ½y1 ; y2 ; . . . ; ym T be the map from the graph to the real line. The optimal y tries to minimize
X ðyi yj Þ2 W i;j
ð13Þ
i;j
under appropriate constraint. This objective function incurs a heavy penalty if neighboring vertices i and j are mapped far apart. Therefore, minimizing it is an attempt to ensure that if vertices i and j are close then yi and yj are close as well. With some simple algebraic formulations, we have
X ðyi yj Þ2 W i;j ¼ 2yT Ly;
ð14Þ
i;j
where L ¼ D W is the graph Laplacian (Chung, 1997) and D is a diagonal matrix whose entries are column (or row, since W is symP metric) sums of W, Dii ¼ j W ji . Finally, the minimization problem reduces to find
y ¼ arg min yT Ly ¼ arg min yT Dy¼1
yT Ly : yT Dy
ð15Þ
The constraint yTDy ¼ 1 removes an arbitrary scaling factor in the embedding. The optimal ys can be obtained by solving the minimum eigenvalue eigen-problem: Ly ¼ kDy. If we choose a linear function, i.e., yi ¼ f ðxi Þ ¼ aT xi . Eq. (15) can be rewritten as:
a ¼ arg min
yT Ly aT XLX T a ; ¼ arg min T y Dy aT XDX T a
ð16Þ
where X ¼ ½x1 ; . . . ; xm . The optimal as are the eigenvectors corresponding to the minimum eigenvalue of eigen-problem: XLX T a ¼ kXDX T a. If we choose a function in RKHS, i.e.,
yi ¼ f ðxi Þ ¼
m X
aj Kðxj ; xi Þ;
ð17Þ
4. Experimental results and analysis Taiwan Economic Journal (TEJ) is an important provider of data on securities markets in Taiwan. This study used all of the financial variables from the TEJ in forecasting enterprize credit rating. Specifically, these financial variables include the following categories of information: company scale, financial structure, solvency, business performance, profitability, financial coverage and cash flow, for a total of 36 input variables. Most of these variables are derived from publicly disclosed information that companies are required to file with authorities like the securities and futures commission. These input variables are important for financial analysis. Besides the financial variables, this study also included the historical rating of each company to improve the rating accuracy. Information on enterprize credit ratings was also obtained from the TEJ, which provides the credit rating for every publicly traded Taiwanese company. A TEJ rating indicates company’s capacity to meet its financial commitments over a one-year period, and is classified as: low risk, medium risk and high risk. A low risk rating indicates that an organization has an extremely strong capacity to meet its commitments, whereas a high risk rating indicates that an organization is likely to default. This study tested six models for corporate credit rating, including: one-vs-one, one-vs-rest, multi-class SVM, nearest neighbors, logistic regressions, and Bayesian networks. For SVMs, this study selected the polynomial kernel with two degrees owing to its good performance compared with other types of kernels. This study collected 88 Taiwanese high technology companies that are traded on the Taiwan’s security market. There were five ratings for each company during the period from 2000 to 2004. The data set was randomly divided into 10 parts, and 10-folds cross validation was applied to evaluate the model performance. 4.1. Performance comparison
j¼1
Kðxj ; xi Þ is the Mercer Kernel. Eq. (15) can be rewritten as:
a ¼ arg min
yT Ly aT KLK a ; ¼ arg min T a KDK a yT Dy
ð18Þ
where a ¼ ½a1 ; . . . ; am T . The optimal as are the eigenvectors corresponding to the minimum eigenvalue of eigen-problem: KLK a ¼ kKDK a. The procedure of KGE is stated below: 1. Constructing the adjacency graph: Let G denote a graph with m nodes. The ith node corresponds to the sample xi . We construct the graph G through the following three steps to model the local structure as well as the label information: (a) put an edge between nodes i and j if xi is among p nearest neighbors of xj or xj is among p nearest neighbors of xi . (b) Put an edge between nodes i and j if xi shares the same label with xj . (c) Remove the edge between nodes i and j if the label of xi is different from that of xj . 2. Choosing the weights: W is a sparse symmetric m m matrix with W ij having the weight of the edge joining vertices i and j. (a) If there is no edge between i and j, W ij ¼ 0. (b) Otherwise,
W ij ¼
1=lk
if xi and xj both belong to the kth class;
d sði; jÞ otherwise;
where lk is the number of labeled samples in the kth class. 0 < d < 1 is the parameter to adjust the weight between supervised information and unsupervised neighbor information. sði; jÞ is a function to evaluate the similarity between xi kxand 2 xj and there are two variai xj k tions: (1) heat kernel; sði; jÞ ¼ e 2r2 and (2) simple-minded; sði; jÞ ¼ 1.
Table 1 shows that on average the one-vs-one SVM outperforms other SVM classifiers. Comparing with traditional classifiers, Table 1 also reveals that one-vs-one SVM performs best. Next, the KGE is employed to enhance the performance of these SVM classifiers. We compared the performance improvement of KGE with PCA and independent component analysis (ICA, Hyvärinen, Karhunen, & Oja, 2001). We set the dimension of subspace to five for all these schemes. The results are listed in Table 2. On the other hand, we also compared the performance improvement of KGE with a famous feature selection algorithm, the recursive feature elimination (RFE) method proposed by Guyon, Weston, Barnhill, and Vapnik (2002). RFE algorithm recursively eliminates input variables to identify the most important 5, 10, 15, and 20 feature subsets for comparison. The results are listed in Table 3. Tables 2 and 3 also list pure SVM models without any dimensionality reduction and feature selection schemes for comparison. Table 2 shows that only KGE algorithm significantly improves the performance of these SVM classifiers. The one-vs-one SVM with KGE have the best accuracy rates than other multi-class SVM classifiers. These results fully demonstrate that in real rating
Table 1 Forecasting performance (error rate) of every model
Nearest neighbors Logistic regression Bayesian network 1-vs-1 Pure SVM 1-vs-rest Pure SVM Pure MSVM
2000
2001
2002
2003
2004
0.2661 0.2564 0.1923 0.2161 0.2143 0.2375
0.2232 0.2676 0.1690 0.2232 0.1821 0.2804
0.2286 0.1944 0.1806 0.1625 0.1750 0.2768
0.2446 0.2557 0.2208 0.1929 0.2679 0.3304
0.3427 0.2597 0.2468 0.1786 0.2661 0.2143
7518
S.-C. Huang / Expert Systems with Applications 36 (2009) 7515–7518
Table 2 Performance comparison (error rate) of three dimensionality reduction schemes 2000
2001
2002
2003
2004
1-vs-1 Pure SVM 1-vs-rest Pure SVM Pure MSVM
0.2161 0.2143 0.2375
0.2232 0.1821 0.2804
0.1625 0.1750 0.2768
0.1929 0.2679 0.3304
0.1786 0.2661 0.2143
1-vs-1 + PCA 1-vs-rest + PCA MSVM + PCA
0.1661 0.2036 0.2786
0.2643 0.2500 0.2214
0.2393 0.2411 0.4375
0.2339 0.2071 0.2589
0.2964 0.2821 0.3946
1-vs-1 + ICA 1-vs-rest + ICA MSVM + ICA
0.2750 0.3250 0.3518
0.1929 0.1929 0.2214
0.2196 0.2161 0.2054
0.3071 0.3714 0.3589
0.2911 0.4268 0.3304
1-vs-1 + KGE 1-vs-rest + KGE MSVM + KGE
0.0625 0.0875 0.2161
0.1536 0.1393 0.3214
0.0536 0.0804 0.2732
0.1286 0.1661 0.1679
0.0500 0.0625 0.1500
Table 3 Performance comparison (error rate) of KGE and RFE 2000
2001
2002
2003
2004
1-vs-1 Pure SVM 1-vs-rest Pure SVM Pure MSVM
0.2161 0.2143 0.2375
0.2232 0.1821 0.2804
0.1625 0.1750 0.2768
0.1929 0.2679 0.3304
0.1786 0.2661 0.2143
1-vs-1 + RFE 1-vs-1 + RFE 1-vs-1 + RFE 1-vs-1 + RFE
0.1250 0.1375 0.1250 0.1750
0.2536 0.1946 0.2214 0.2625
0.0250 0.0679 0.1107 0.1339
0.1393 0.0893 0.1018 0.1554
0.1536 0.1393 0.1268 0.1518
0.1268 0.1625 0.1500 0.1893
0.2786 0.1946 0.1679 0.1643
0.0393 0.0929 0.1089 0.1500
0.1536 0.1661 0.1643 0.2286
0.1893 0.1929 0.1393 0.2429
5 10 15 20
0.3071 0.2518 0.2375 0.2625
0.3054 0.2929 0.2643 0.2786
0.3304 0.3161 0.3036 0.2750
0.2536 0.2661 0.3179 0.2679
0.2661 0.2643 0.2786 0.2768
1-vs-1 + KGE 1-vs-rest + KGE MSVM + KGE
0.0625 0.0875 0.2161
0.1536 0.1393 0.3214
0.0536 0.0804 0.2732
0.1286 0.1661 0.1679
0.0500 0.0625 0.1500
5 10 15 20
1-vs-rest + RFE 1-vs-rest + RFE 1-vs-rest + RFE 1-vs-rest + RFE MSVM + RFE MSVM + RFE MSVM + RFE MSVM + RFE
5 10 15 20
problems the data is not sampled from a linear subspace. Hence linear algorithms such as PCA and ICA fail to extract key information containing in the data. Considering manifold based techniques for dimensionality reduction in credit rating problems are more effective. Comparing KGE with RFE, Table 3 reveals one-vs-one SVM with KGE is the most cost-efficient model because it has the fewest dimensionality of subspace and the best accuracy rate. Clearly, the accuracy rates of the pure one-vs-one, one-vs-rest, and MSVM models, which contains all of the variables, are lower than for the models containing fewer variables. That is, more information does not necessarily improve accuracy. Table 3 also shows that the performance improvement owing to RFE is lower than KGE no matter 5, 10, 15, and 20 key features are selected by RFE. The subspace formed by KGE containing sufficient information or latent structures to discriminate or represent the data, while the subset formed by RFE does not. 5. Conclusions Corporate credit ratings provide important information on credit risk for banks or investors in financial markets. This study inte-
grated KGE with SVM to create a novel classifier for rating predictions. The performance of the hybrid model was examined using a data set comprising a large amount of financial information regarding Taiwanese high technology companies. The results show that our new classification model is more accurate than pure SVM classifiers, and outperforms traditional techniques when applied to multiple-class credit rating problems. Kernel graph embedding effectively enhances the classification performance of SVMs. Empirical results showed that one-vs-one SVM with KGE outperforms other multi-class SVMs. The accuracy rates of the three types of SVMs using all of the input variables are lower than for models using a smaller number of more important latent variables. Compared with traditional dimensionality reduction schemes, performance improvement resulting from KGE is significant. Restated, nonlinear dimensionality reduction is key technique for improving the performance of multi-class classifiers. Future research may consider non-financial and macroeconomic variables for SVM inputs. But including more information does not guarantee higher accuracy. In this situation, dimensionality reduction and feature selection are important strategies for enhancing classifier performance. What dimensionality reduction schemes is efficient to incorporate with SVM classifiers need further study. References Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems 14 (p. 585V591). Cambridge, MA: MIT Press. Cai, D., He, X., & Han, J. (2007). Spectral regression for dimensionality reduction. Department of Computer Science, Technical Report No. 2856, University of Illinois at Urbana-Champaign (UIUCDCS-R-2007-2856). Chung, F. R. K. (1997). Spectral graph theory. In Regional conference series in mathematics (Vol. 92). AMS. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge University Press. Desai, V. S., Crook, J. N., & Overstreet, G. A. Jr., (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operations Management, 95, 24–37. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422. Henley, W. E., & Hand, D. J. (1997). Construction of a k-nearest neighbour creditscoring system. IMA Journal of Management Mathematics, 8, 305–321. Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. Wiley Interscience. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Schoelkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods – Support vector learning. Cambridge, MA: MIT Press. Steenackers, A., & Goovaerts, M. J. (1989). A credit scoring model for personal loans. Insurance Mathematics Economics, 8, 31–34. Stepanova, M., & Thomas, L. C. (2001). PHAB scores: Proportional hazards analysis behavioural scores. The Journal of the Operational Research Society, 52, 1007–1016. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Vapnik, V. N. (1999). The nature of statistical learning theory (2nd ed.). Springer. West, D. (2000). Neural network credit scoring models. Computers and Operations Research, 27, 1131–1152. Weston, J., & Watkins, C. (1999). Support vector machines for multi-class pattern recognition. ESANN’99. Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extension: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51. Yobas, M. B., Crook, J. N., & Ross, P. (2000). Credit scoring using neural and evolutionary techniques. IMA Journal of Management Mathematics, 11, 111–125.