Expert Systems with Applications 30 (2006) 479–487 www.elsevier.com/locate/eswa
Failure prediction with self organizing maps Johan Huysmans a,*, Bart Baesens c,a, Jan Vanthienen b,d, Tony van Gestel d a
Faculty of Economics and Applied Economics, Katholieke Universiteit Leuven, Naamsestraat 69, B-3000 Leuven, Belgium b Credit Risk Modelling, Group Risk Management, Dexia Group, Square Meeus 1, B-1000 Brussel, Belgium c School of Management, University of Southampton, Southampton SO17 1BJ, UK d Department of Electrical Engineering, ESAT-SCD-SISTA, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Leuven (Heverlee), Belgium
Abstract The significant growth of consumer credit has resulted in a wide range of statistical and non-statistical methods for classifying applicants in ‘good’ and ‘bad’ risk categories. Self organizing maps (SOMs) exist since decades and although they have been used in various application areas, only little research has been done to investigate their appropriateness for credit scoring. This is mainly due to the unsupervised character of the SOM’s learning process. In this paper, the potential of SOMs for credit scoring is investigated. First, the powerful visualization capabilities of SOMs for exploratory data analysis are discussed. Afterwards, it is shown how a trained SOM can be used for classification and how the basic SOM-algorithm can be integrated with supervised techniques like the multi-layered perceptron. Two different methods of integration are proposed. The first technique consists of improving the predictive power of individual neurons of the SOM with the aid of supervised classifiers. The second integration method is similar to a stacking model in which the output of a supervised classifier is entered as an input variable for the SOM. Classification accuracy of both approaches is benchmarked with results reported previously. q 2005 Elsevier Ltd. All rights reserved. Keywords: Self organizing maps; Classification; Credit scoring
1. Introduction One of the key decisions financial institutions have to make is to decide whether or not to grant a loan to a customer. This decision basically boils down to a binary classification problem which aims at distinguishing good payers from bad payers. Until recently, this distinction was made using a judgmental approach by merely inspecting the application form details of the applicant. The credit expert then decided upon the creditworthiness of the applicant, using all possible relevant information concerning his sociodemographic status, economic conditions, and intentions. The advent of data storage technology has facilitated financial institutions’ ability to store all information regarding the characteristics and repayment behavior of credit applicants electronically. This has motivated the need to automate the credit granting decision by using statistical or machine learning algorithms. Numerous methods have been proposed in the
* Corresponding author. E-mail addresses:
[email protected] (J. Huysmans), bart.
[email protected] (B. Baesens),
[email protected] (J. Vanthienen),
[email protected] (T. van Gestel).
0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.10.005
literature to develop credit-risk evaluation models. These models include traditional statistical methods (e.g. logistic regression (Steenackers & Goovaerts, 1989) and discriminant analysis (Malhotra & Malhotra, 2002)), nonparametric statistical models (e.g. k-nearest neighbor (Henley & Hand, 1997) and classification trees (Davis, Edelman, & Gammerman, 1992)), neural network models (Atiya, 2001; Chen & Huang, 2003; Yobas & Crook, 2000) and support vector machines (Baesens et al., 2003; Van Gestel, Baesens, Garcia, & Van Dijcke, 2003). While newer approaches, like neural networks and support vector machines, offer high predictive accuracy, it is often difficult to understand the motivation behind their classification decisions. In this paper, the appropriateness of SOMs for credit scoring is investigated. The powerful visualization possibilities of this neural network model offer a significant advantage for understanding its decision process. However, the training process of the SOM is unsupervised and initially the predictive power lies therefore slightly below classification accuracy of several supervised classifiers. In the rest of the paper, we investigate how the SOM can be integrated with supervised classifiers. Two distinct approaches are adopted. In the first approach, the classification accuracy of individual neurons is improved through the training of a separate supervised classifier for each of these neurons. The second approach is similar to a stacking model. The output of a supervised classifier is used as input to the SOM. Both models are
480
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
tested on two data sets obtained from a major Benelux financial institution. Both models are benchmarked with the results of other classifiers reported in (Baesens et al., 2003). 2. Self organizing maps SOMs were introduced in 1982 by Teuvo Kohonen (1982) and have been used in a wide array of applications like the visualization of high-dimensional data (Vesanto, 1999), clustering of text documents (Honkela, Kaski, Lagus, & Kohonen, 1997), identification of fraudulent insurance claims (Brockett, Xia, & Derrig, 1998) and many others. An extensive overview of successful applications can be found in (Kohonen, 1995) and (Deboeck & Kohonen, 1998). A SOM is a feedforward neural network consisting of two layers. The neurons from the output layer are usually ordered in a lowdimensional grid. Each unit in the input layer is connected to all neurons in the output layer. Weights are attached to each of these connections. This is similar to a weight vector, with the dimensionality of the input space, being associated with each output neuron. When a training vector xð is presented, the weight vector of each neuron c is compared with xð. One commonly opts for the euclidian distance between both vectors as the distance measure. The neuron that lies closest to xð is called the ‘winner’ or the Best Matching Unit (BMU). The weight vector of the BMU and its neighbors in the grid are adapted with the following learning rule: ðc Z w ð c C hðtÞLwinner;c ðtÞððx Kw ð cÞ w
(1)
In this expression h(t) represents a learning rate which decreases during training. Lwinner;c ðtÞ is the so-called neighborhood function which decreases when the distance in the grid between neuron c and the winner unit becomes larger. Often a gaussian function centered around the winner unit is used as the neighborhood function with a decreasing radius during training. The decreasing learning rate and radius of the neighborhood function result in a stable map that does not change substantially after a certain amount of training. From the learning rule, it can be seen that the neurons will move towards the input vector and that the magnitude of the update is determined by the neighborhood function. Because units that are close to each other in the grid, will receive similar updates, the weights of these neurons will resemble each other and the neurons will be activated by similar input patterns. The winner units for similar input vectors will be close to each other and self organizing maps are therefore often called topologypreserving maps.
zone of solvency. This division of the map was obtained by labelling each neuron with the label of the most similar training example. Unseen test observations were classified by calculating the distance between the neurons of the map and the observations. If the most-active neurons were in the solvent zone, the observation was classified as good. It is concluded that the percentage correctly classified observations is comparable with the accuracy of a linear discriminant analysis and several multi-layered perceptrons. The author’s conclusion is promising for the SOM: the flexibility of this neural model to combine with and to adapt to other structures, wether neural or otherwise, augurs a bright future for this type of model. In Kiviluoto (1998), several SOM-based models for predicting bankruptcies are evaluated. The first of the models, SOM-1, is very similar to the model described above, but instead of assigning each neuron the label of the most similar observation, a voting scheme is used. For each neuron of the map, the probability of bankruptcy is estimated as the number of bankrupt companies projected onto that node divided by the total number of companies projected on that neuron. A second, more complex model was also proposed (SOM-2). It consists of a small variation to the Basic SOM-algorithm as explained above. Each input vector consists of two types of variables: the financial indicators and the bankruptcy indicators. Only the financial indicators are used when searching which unit is the BMU. Afterwards, the weights are updated with the traditional learning rule from Eq. (1). These weight updates are not only made for the financial indicators, but also for the bankruptcy indicators. The weight of the bankruptcy indicator after training is used as an estimate for the conditional probability of bankruptcy given the neuron. Compared to other classifiers, like LDA and LVQ, SOM-1 was clearly outperformed. SOM-2 performed much better and more importantly: its classification accuracy was quite insensitive to the map grid size. In Tan, van den Berg, & van den Bergh (2002), SOMs were used to predict creditworthiness ratings for companies. First, a clustering approach was followed whereby it was investigated if the different clusters are able to separate the companies according to their creditworthiness rating. In a second step, a K-nearest neighbor technique is used on the trained SOM to obtain the actual forecasts. 4. Description and preprocessing of the data For this application, two different data sets were at our disposal. The characteristics of these data sets are summarized in Table 1. The same data sets were used in a benchmarking
3. Related research
Table 1 Description of the datasets
In Serrano-Cinca (1996), a model based on self organizing maps is used to predict corporate bankruptcy. A data set containing 129 observations and 4 variables was divided into a training and a test data set with the proportion between bankrupt and solvent companies being almost equal. A 12!12 map was trained and divided into a zone of bankruptcy and a
Name
Bene1
Bene2
Number of obs. Number of variables Good/bad Best classifier Sens./spec.
3123 27 67:33 RBF LS-SVM(73.1%) 83.9%/52.6%
7190 27 70:30 MLP(75.1%) 86.7%/48.1%
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
study of different classification algorithms (Baesens et al., 2003). In this benchmarking study, two thirds of the data were used for training and one third as test set. The same training and test sets will be used in this paper. For the Bene1 data set, a RBF LS-SVM was able to obtain a classification accuracy of 73.1%. For the Bene2 data set, a multi-layer perceptron shows best classification performance (75.1%). Additional measures like sensitivity and specificity for these classifiers are also given in Table 1. Sensitivity measures the number of good risks that are correctly identified while specificity measures the number of bad risks that are correctly classified. Both data sets contain several categorical variables, like goal of the loan and residential status. Thomas (2000) mentions two ways of dealing with this kind of variables. The first is the creation of binary dummy variables for each possible value of the attribute. The second way, by which one can avoid the creation of a large number number of dummies, consists of performing a weight of evidence encoding. For each attribute value, the number of good and bad risks are counted and the weight of evidence is calculated with the following formula: WOE Z ln
pðvalue Z goodÞ pðvalue Z badÞ
(2)
with p(valueZgood) being the number of goods that have this value for the attribute divided by the total number of goods and p(valueZbad) the number of bads having this value for the attribute divided by the total number of bads. The categorical answers are replaced by these weights of evidence and higher weights indicate that the attribute value corresponds with good risks. After performing the weights of evidence encoding for the categorical variables, an additional normalization was done for all variables such that each variable has a mean of zero and a variance of 1. This normalization is necessary because the training algorithm uses euclidian distances to find the BMU. If the variables are not normalized, some will have a too large influence during this process (Kohonen, 1995).
5. Exploratory data analysis 5.1. Visualization of the SOM SOMs have mainly been used for exploratory data analysis and clustering. This is mainly due to the unsupervised character of their learning algorithm and the excellent visualization possibilities they offer. In this section, the basic SOMalgorithm will be applied to the Bene1 data set. Similar conclusions were drawn for the Bene2 data set. We start with an overview of the different visualization possibilities of SOMs. Obtaining good classification accuracy is hereby not the primary goal. Only afterwards, when integrating SOMs with supervised classifiers, we try to obtain good classification accuracy in combination with a good understanding of the decision process. A map of 4!6 neurons is used in the rest of this section because it is small enough to be conveniently visualized here. All analyzes are performed with the SOMtoolbox for Matlab (Helsinki University of Technology, 2003).
481
1.88
1.23
0.571 Fig. 1. U-matrix.
When performing a classification task, one hopes that all the observations from one category are significantly different from observations belonging to another category. For a SOM, this assumption would result in observations belonging to different categories being projected onto different neurons in the map. With SOMs, the easiest way to examine if there are distinct groups present in a data set is based on the distance matrix. This matrix represents the distances between neighboring neurons of the map. For the Bene1 data set, with a grid of 4!6, the distance or U-matrix is given in Fig. 1. For each neuron, the distance with its six neighbors is indicated with a grayscale color. Small and large distances are visualized in respectively black and white. Dark spots on the map separated with bright ribbons indicate the presence of distinct groups in the data. We conclude, based on Fig. 1, that there are no significantly different groups present in the Bene1 data set. To examine if ‘good’ and ‘bad’ risk observations are projected onto different units, we can calculate for each observation the winner neuron. For the Bene1 data set, this results in Fig. 2. In each neuron, the number of good and bad risk observations that were projected onto that neuron, are given. For example, the upper-left neuron was the BMU for 219 training observations, from which 122 were good and 97 were bad. The same information is also given in the left part of g(122) b(97)
g(51) b(40)
g(36) b(13)
g(6) b(4) g(81) b(23)
g(50) b(19)
g(73) b(7) g(14) b(1)
b(131) g(92)
g(7) b(2)
g(174) b(13)
b(29) g(20)
g(22) b(16)
g(77) b(9)
b(54) g(33)
g(97) b(31)
g(45) b(6)
g(52) b(2)
g(46) b(13)
b(91) g(48)
g(109) b(50)
g(46) b(10)
g(25) b(2)
g(74) b(19)
Fig. 2. Number of hits per neuron (based on training set).
482
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
5.2. Clustering of the SOM
Fig. 3. Number of hits per neuron. (Dark gray bars indicate good risks; Light gray bars represent bad risks).
Fig. 3. In this figure, the size of the bar indicates the number of good and bad observations projected onto each neuron. Notice that the scale of the bars is different for ‘good’ and ‘bad’ risk categories. The right part of Fig. 3 contains the same information, but this time for the unseen test data. From the graphs, it can be noticed that bad risks tend to be projected onto the neurons in the upper half of the grid, but that the SOM is not able to achieve a clear separation. This corresponds with the results from Baesens et al. (2003), even powerful techniques like Support Vector Machines are not able to obtain a very high degree of accuracy on the Bene1 data set. It is useful to investigate to which extent each of the input variables contributes to the map. Component planes are an excellent aid for this kind of analysis. A component plane can be drawn for each input variable and shows the weights that connect each neuron with the particular input variable (Vesanto, 1999). The component plane for variable ‘number of years a person has been client’ is shown in Fig. 4. From the component plane, we conclude that recent clients tend to be projected on the upper part of the map which corresponds with the region were bad risks are mostly projected. The variable is a good predictor for the category of a client. The same analysis can be performed for each of the other variables. 9.5 9
In the preceding section, we have shown that SOMs are an excellent tool for the exploration of large data sets. But if the SOM grid itself consists of numerous neurons, analysis can be facilitated by clustering similar neurons into groups (Vesanto & Alhoniemi, 2000). By first applying the SOM algorithm, the computation time can be significantly reduced compared to a direct clustering of the input data. For our credit scoring study, a map with 400 neurons was created. The 400 neurons were first clustered with Ward’s hierarchical method. The centroids of the clusters found by this method were used as initial seeds for the k-means clustering algorithm that was applied afterwards. The result of this procedure, assuming five clusters are present in the data, is given in Fig. 5. Several supervised methods for labelling the clusters have been proposed. A number of labelled training examples are projected onto the map and depending on the distances from the neurons to these observations, the map will be labelled. This procedure was used by Serrano-Cinca (1996) to divide a SOM in a zone of solvent neurons and a zone of bankruptcy. In contrast to these supervised approaches, several unsupervised labelling techniques have been proposed (Azcarraga et al., forthcoming; Lagus & Kaski, 1999; Rauber & Merkl, 1999). We shortly discuss the method presented in Azcarraga et al. (forthcoming). After performing a clustering of the trained map, the unlabelled training examples are projected again onto the map and it is remembered which observations are assigned to each cluster. Consequently, for each cluster the so-called salient dimensions are sought. These are defined as the variables that have significantly different values for the observations belonging and not-belonging to that cluster. Finally, a human specialist manually interprets the salient dimensions of each cluster and assigns a descriptive label. We have not adopted this statistical approach, but assign descriptive labels to the clusters by visually comparing them with the component planes. Fig. 6 represents the component plane of the variable ‘goal of the loan’. It can easily be seen that neurons with a high value for this variable tend to be projected on the lower cluster. Recall that categorical variables were encoded using the weights of evidence framework. For the variable ‘goal of the loan’, high values correspond with ‘new car’ and ‘real estate’. A very large part of the applicants did not specify the goal of the loan or the goal was the purchase of a
8.5 8 7.5 7 6.5 6 5.5 5 4.5
Fig. 4. Component plane for variable ‘number of years a person has been client’.
Fig. 5. Clusters found with K-means algorithm.
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
2.5 2 1.5 15 0.5 0
Fig. 6. Component plane for variable ‘goal of the loan’.
secondhand car. These applicants tend to be projected on the other clusters. The same analysis can be performed for each of the other clusters and variables. For most of the clusters, we find a component plane that shows a high degree of similarity with the cluster and these variables can therefore be used as the cluster’s salient dimensions. 6. Classification The SOMs that we created can also be used for classification. In Serrano-Cinca (1996), a SOM is created and each neuron is assigned the label of the closest training observation. Predictions for the test data are based on the label of their BMU. Using the same labelling on our map of the Bene1 data, results in 4 nodes that are assigned the bad status and 20 nodes a good status. The labelling is shown in Fig. 7(a). It can be seen that most of the nodes labelled ‘bad’ are situated in the lower part of the map. From Fig. 3, however, we know that most bad risk observations are projected on the upper half of the map. The accuracy, specificity and sensitivity of this classification method are therefore rather low (respectively 57.5, 22.3 and 76.1%). Changes in grid size do not considerably alter these results. For the Bene2 data set, with a grid of 6!4, accuracy, specificity and sensitivity are respectively 66.1, 14.5 and 88.3%. These numbers are considerably below the performance of several supervised classifiers reported in Baesens et al. (2003). Instead of using only the closest training observation for labelling each neuron, more sophisticated techniques, like k-nearest neighbor, might prove useful. A second way of labelling the neurons was proposed in Kiviluoto (1998): each node receives the label of the class from which most training observations are projected on the node. (a)
nearestobservation method
(b)
Majorityvote method
Fig. 7. Classification of neurons (white: nodes assigned Bad status, black: nodes assigned good status).
483
This method will always result in a greater classification accuracy on the training data. Fig. 7(b) shows this labelling for the Bene1 data with a 6!4 grid. The accuracy, specificity and sensitivity of this map are respectively 71.0, 44.6 and 84.9%. For the Bene2 data, with the same map size, classification performance was 69.7% because the model assigns the ‘good’ label to all-but-one neuron and will therefore mainly just predict the majority class. Several improvements for the above method have been proposed. For example, if the cost of misclassifying a bad risk are twice the misclassification cost of good risks, we can assign each neuron the bad label if at least one third of the observations projected onto that node are bad. This will result in more neurons being assigned the bad label. Specificity will increase, while accuracy and sensitivity will decrease. This is closely related to considering the ratio between ‘good’ and ‘bad’ training observations for which the neuron is the BMU as the continuous output of the SOM, similar to SOM1 in Kiviluoto (1998). For the Bene1 data, an observation projected onto the upper left neuron has a probability of 55% (Z122/ 122C97) of being good. By applying different tresholds, a ROC curve analysis can be performed. A ROC curve is a twodimensional plot with the sensitivity on the Y-axis versus (1-specificity) on the X-axis for different values of the treshold. A ROC curve that lies close to the 458 diagonal corresponds with a random predictor. The more the ROC-curve diverges from this diagonal, the better the performance of its classifier. The area under the ROC (AUC) is convenient for comparing different classifiers. For the Bene1 data, an area under the ROC of 73.4% was found. This result is better than the result of a classifier like C4.5, but worse than most other classifiers. For Bene2, an AUC of 62.48% was found. Besides accuracy in general, it is also interesting to investigate, where on the map most of the misclassifications occur. Fig. 8 gives this overview for the Bene1 data set. We see that most of the misclassifications occur in the first neuron of the top row of the map. Fifty-three bad risk observations from the test set are projected on this neuron. Because the node received the label good, all 53 observations will be wrongly classified. But the absolute number of misclassifications might not be a good indicator for the accuracy in each node. Suppose that 1000’s of good observations were also projected onto the same neuron, then the misclassification of these 53 observations is not so bad. Fig. 9 gives an overview of the classification accuracy in each node for the Bene1 data. Dark nodes are neurons with low classification accuracy. The size of the neurons is an indicator of the number of observations for which that neuron is the BMU. We can see that some neurons are the BMU for lots of observations, while others are the BMU for only a few examples. The presence of large and dark neurons in Fig. 9 will indicate a bad classification accuracy of the map. For the Bene1 data set, it can be seen that the lower part of the map has a good classification accuracy. The upper half of the grid shows worse accuracy. For some nodes, the accuracy is below 50%. Inverting the labelling of this neurons will improve the percentage correctly classified. Fortunately, only few observations are projected onto these nodes. From the
484
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
Fig. 8. Number of misclassified observations with ‘majority vote’-method (Bene1 Test data).
figure, we conclude that many observations are projected onto the first neuron of the top row and that not all of these observations belong to the category ‘good’, because classification accuracy is low. In the following section, a more detailed model is elaborated. Observations that are projected onto neurons with low accuracy, will not receive the standard labelling of these nodes, but will instead be classified by independent models. If a SOM is used as new model, a hierarchy of SOMs for prediction is obtained. The above techniques are all based on creating one map using all observations. The labels are not used during training. But a different approach is possible. Instead of training one map, we can train two maps. One of these maps is trained on the observations labelled ‘good’, the other map is trained with the ‘bad’ observations. The labels of the observations are hereby implicitly used during training. Classification with this approach can be done as follows. For each new observation, we calculate the BMU on both maps. If the distance of the observation to the BMU of the ‘good’ map is smaller than the distance to the BMU of the ‘bad’ map, we label the observation as good and vice versa. k-Nearest neighbors variants are also feasible. The k-nearest neurons on both maps are searched and a label is assigned to the observation depending on how many of the k-neurons lie on each of both maps. The first approach, finding the BMU on both maps and label depending on the shortest distance, results in an accuracy, specificity and sensitivity of respectively 66.4, 79.7 and 59.4% for Bene1 and 64.5, 73.6 and 60.6% for Bene2, if we take the size of both maps equal to 6!4 neurons. The size of both maps does not need to be equal. More neurons in the ‘good map’ will increase sensitivity while increasing the size of the ‘bad map’ results in a better specificity. For example, increasing the size of the ‘good map’ to 8!6 and leaving the size of the ‘bad map’ unchanged, improves the above results to respectively 68.1, 66.6 and 68.9% for Bene1 (68.5, 59.2 and 72.4% for Bene2). For both data sets, variations in grid size cause considerable differences in classification performance.
understanding of the decisions made by these models. But due to their unsupervised nature, SOMs seem not able to obtain the degree of accuracy achievable by several supervised techniques, like multi-layer perceptrons or support vector machines. In this part, the classification performance of the SOMs will be improved by integrating them with these supervised algorithms. There are two possible approaches for obtaining this integration. One possibility is to first train a SOM and then use the other classification techniques to improve the decisions made by the individual neurons of this SOM. A second possibility is to use the predictions of the classifiers as inputs to the SOM. These two approaches are now discussed in more detail. 7.1. Improving the accuracy of neurons with bad predictive power From Fig. 9, we can observe that not all neurons achieve the same level of accuracy when predicting the risk category of an applicant. The lack of accuracy of the predictions made by the neurons in the top rows is compensated by the almost perfect predictions in the lower half of the map. A two-layered approach is suggested in this section. For neurons that achieve almost perfect accuracy on the training data when using one of the models from the previous section, nothing changes. All the observations projected on one of these neurons, will therefore receive the same label. There are only changes for neurons 1 0.9 0.8 0.7 0.6 0.5
7. Integration of SOMs with other classification techniques In the previous sections, we have shown that SOMs offer excellent visualization possibilities what leads to a clear
0.4
Fig. 9. Accuracy in each node with ‘majority vote’-method(Bene1 Test data).
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
485
neurons and then training the feedforward neural networks with this additional information. For applicants that are projected on one of the other neurons, requesting this additional information is not necessary. The results for Bene2 are similar. A threshold of 65% results in 9 additional classifiers to be trained of which most are situated in the upper half of the map. The accuracy, averaged over 100 independent trials, improves to 71.9% compared with the original performance of the majority-vote method of 69.7%. Specificity and sensitivity are respectively 26.8 and 91.1%. Fig. 10. Stacking.
7.2. Stacking model whose level of accuracy on the training data lies below a userspecified threshold. For each of these neurons, we build a classifier based on the training examples projected on that neuron. In our experiments, we used feedforward neural networks as classifiers for each of the neurons, but there is no necessity for the classifiers being of the same type. The userspecified accuracy threshold was fixed at 58% for the Bene1 data set with a 6!4 map. This value has been estimated by a trial-and-error procedure. A threshold that is set too low will give no improvement over the above mentioned classifiers because no new models will be estimated. The opposite, a very high threshold, will result in too many new classifiers to be trained. With a threshold of 58%, three models will be trained: one for the two first neurons of the top row and one for the third neuron of the third row. We tested with several different values for the number of hidden neurons. The simplest case, with only one hidden neuron delivered best results with an accuracy on the test set of 71.3% averaged over 100 independent trials. This is almost no improvement over the majority vote classifier, that showed an accuracy of 71.0%. It seems that the increase in accuracy of the 3 newly trained classifiers is only marginal. For some neurons, a decrease in the percentage correctly classified can even be noted. It seems extremely difficult to separate the applicants that are projected on these neurons. A possible improvement might result from requesting additional information if an applicant is projected on one of the low-accuracy
A stacking model (Wolpert, 1991) consists of one metalearner that combines the results of several base learners to make its predictions. In this section, a SOM will be used as the metalearner (Fig. 10). The main difference with the previous section is that the classifiers are trained before training the SOM and not afterwards. The classifiers also learn from all available training observations and not from a small subpart of it. In our experiments, we start with only one base learner, a multi-layer perceptron with 2 hidden neurons, which achieves an average classification accuracy of 72.5% on the Bene1 data set (75.1% on Bene2). The input of the meta-learner, the SOM, consists of the training data augmented with the output of this MLP. A small variation to the above described basic SOMalgorithm is used. Instead of finding the BMU by calculating the euclidian distance between each neuron and the sample observation, a weighting factor is introduced for each variable. Heavily weighted variables, in our case the output from the MLP, will contribute more during formation of the map. The distance measure, with n the number of variables, can be written as: ð cs Z sð x Kw
n X
weighti jxi Kwc;i j2
(3)
iZ1
The update rule from Eq. (1) does not change. So introducing the weights only affects finding the BMU’s of the SOM (Helsinki University of Technology, 2003).
Fig. 11. Stacking model: accuracy in function of weighting factor.
486
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487
0.9 0.85
Table 2 Overview of classification results (accuracy, specificity and sensitivity)
0.8
Classifier
Bene1
Bene2
0.75
NTO MV 2M(equal sized maps) 2M(unequal maps) IA SM
57.5/22.3/76.1 71.0/44.6/84.9 66.4/79.7/59.4 68.1/66.6/68.9 71.3/52.6/81.5 Fig. 11
66.1/14.5/88.3 69.7/00.6/99.34 64.5/73.6/60.6 68.5/59.2/72.4 71.9/26.8/91.1 Fig. 11
0.7 0.65 0.6 0.55 0.5 0.45 0.4 Fig. 12. Component plane of the neural network input: a value close to 0 is an indication of a high risk application.
For the experimental study, all the weighting factors of the original variables were fixed at a value of one while the weighting factor of the MLP-input was varied between 1 and 100. Classifications were made by both methods discussed above: the ‘majority vote’-method and the ‘nearest distance’method. Fig. 11 gives an overview of the classification accuracy for each method and for both data sets with a grid size of 6!4. It can be seen that performance of the nearest-distance method is always below the performance of the majority-vote method. Second, we conclude that the weighting factor of the MLP plays a crucial role in the classification performance of the integrated SOM. In general, the larger the weighting factor is, the more the output of the integrated SOM resembles the output of the MLP. There is however a large amount of variance present in the results. A small change in weighting factor can significantly change the performance observed. A comparison of the component plane of the MLP with the other component planes gives a deeper insight in the influence of each variable on the decision process. For example, the component planes of the variables ‘number of years a person has been clients’ or ‘age’ are very similar to the component plane of the neural network input (Fig. 12). Other variables, like amount and duration of the loan, show only a small similarity with the MLP component plane. These variables seem to contribute only marginally to a good separation of the applicants. In theory, the stacking model can be used in combination with the previous method of integration, but the degree of complexity of the resulting model is high and the advantage of the SOM’s explanatory power is lost. This approach was therefore not analyzed in greater detail. 8. Conclusion In this paper, the appropriateness of self organizing maps for credit scoring has been investigated. In the first section, several visualization techniques have been presented. These offer the data analyst an easy way for exploring the data. Initially, for classification purposes, a SOM seems less suitable because its unsupervised training algorithm results in a lower accuracy than what can be obtained by several supervised classifiers.
NTO, nearest training observation-method; MV, majority vote-method; 2M, two maps; IA, improving accuracy of neurons with bad predictive power; SM, stacking model.
In the final part of the paper, two ways for integrating SOMs with supervised classifiers were proposed. It can be concluded that integration of a SOM with a supervised classifier is feasible and that the percentage correctly classified applicants of these integrated networks is better than what can be obtained by employing solely a SOM. A summary of these results is given in Table 2. The visualization techniques remain applicable to these integrated models and they are therefore easily understood. Several topics were not discussed in this paper. For instance, we did not investigate in detail what the influence of the map size is on the results. A combination of SOMs with several different types of supervised classifiers was also not tested. Comparison of the component planes of these different classifiers might visually show where the predictions of the models agree and where they disagree. Acknowledgements This work was supported by Onderzoeksfonds K.U.Leuven/Research Fund K.U.Leuven (OT/03/12). References Atiya, A. (2001). Bankruptcy prediction for credit risk using neural networks: A survey and new results. IEEE Transactions on Neural Networks, 12(4), 929–935. Azcarraga, A., Hsieh, M., Pan, S., & Setiono, R. (forthcoming). Extracting salient dimensions for automatic som labeling. Transactions on Systems, Management and Cybernetics, Part C. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state of the art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627–635. Brockett, P., Xia, X., & Derrig, R. (1998). Using kohonen’s self-organizing feature map to uncover automobile bodily injury claims fraud. International Journal of Risk and Insurance, 65, 245–274. Chen, M.-C., & Huang, S.-H. (2003). Credit scoring and rejected instances reassigning through evolutionary computation techniques. Expert Systems with Applications, 24, 433–441. Davis, R., Edelman, D., & Gammerman, A. (1992). Machine learning algorithms for credit scoring applications. IMA Journal of Mathematics Applied in Business and Industry, 4, 43–51. Deboeck, G., & Kohonen, T. (1998). Visual Explorations in Finance with selforganizing maps. Berlin: Springer. Helsinki University of Technology, 2003. Som toolbox for matlab. URL http:// www.cis.hut.fi/projects/somtoolbox/. Henley, W., & Hand, D. (1997). Construction of a k-nearest neighbour credit scoring system. IMA Journal of Mathematics Applied in Business and Industry , 305–321.
J. Huysmans et al. / Expert Systems with Applications 30 (2006) 479–487 Honkela, T., Kaski, S., Lagus, K., & Kohonen, T., (1997). WEBSOM—selforganizing maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4–6. Helsinki University of Technology (pp. 310–315). Neural Networks Research Centre, Espoo, Finland. Kiviluoto, K. (1998). Predicting bankruptcies with the self-organizing map. Neurocomputing, 21, 191–201. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Lagus, K., & Kaski, S., (1999). Keyword selection method for characterizing text document maps. In Proceedings of ICANN99, Ninth International Conference on Artificial Neural Networks. IEE, pp. 371–376. Malhotra, R., & Malhotra, D. (2002). Differentiating between good credits and bad credit using neuro-fuzzy systems. European Journal of Operational Research, 136, 190–211. Rauber, A., & Merkl, D. (1999). Automatic labeling of self-organizing maps: Making a treasure-map reveal its secrets. In Proceedings of the third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’99). LNCS / LNAI 1574. (pp. 228–237). Berlin: Springer.
487
Serrano-Cinca, C. (1996). Self organizing neural networks for financial diagnosis. Decision Support Systems, 17, 227–238. Steenackers, A., & Goovaerts, M. (1989). A credit scoring model for personal loans. Insurance: Mathematics and Economics, 8(1), 31–34. Tan, R., van den Berg, J., & van den Bergh, W. (2002). Credit rating classification using self-organizing maps. In K. Smith, & J. Gupta (Eds.), Neural Networks in Business: Techniques and Applications (pp. 140–153). Hershey: Idea Group Publishing. Chapter 9. Thomas, L. (2000). A survey of credit and behavioural scoring; forecasting financial risk of lending to consumers. International Journal of Forecasting, 16, 149–172. Van Gestel, T., Baesens, B., Garcia, J., & Van Dijcke, P. (2003). A support vector machine approach to credit scoring. Bank en Financiewezen, 2, 73–82. Vesanto, J. (1999). Som-based data visualization methods. Intelligent-DataAnalysis, 3, 111–126. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE-NN, 11(3), 586. Wolpert, D. (1991). Stacked generalization. Neural Networks, 5, 241–259. Yobas, M., & Crook, J. (2000). Credit scoring using neural and evolutionary techniques. IMA Statistics in Finance, Journal of Mathematics Applied in Business and Industry, 11, 111–125.