North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
North American Journal of Economics and Finance journal homepage: www.elsevier.com/locate/najef
Best classification algorithms in peer-to-peer lending Petr Teply , Michal Polena ⁎
Department of Banking and Insurance, Faculty of Finance and Accounting, University of Economics in Prague, Winston Churchill Sq. 4, 130 67 Prague, Czech Republic
ARTICLE INFO
ABSTRACT
Keywords: Classification classifiers Ranking credit scoring Lending Club P2P lending
A proper credit scoring technique is vital to the long-term success of all kinds of financial institutions, including peer-to-peer (P2P) lending platforms. The main contribution of our paper is the robust ranking of 10 different classification techniques based on a real-world P2P lending data set. Our data set comes from the Lending Club covering the 2009–2013 period, which contains 212,252 records and 23 different variables. Unlike other researchers, we use a data sample which contains the final loan resolution for all loans. We built our research using a 5-fold cross-validation method and 6 different classification performance measurements. Our results show that logistic regression, artificial neural networks, and linear discriminant analysis are the three best algorithms based on the Lending Club data. Conversely, we identify k-nearest neighbors and classification and regression tree as the two worst classification methods.
1. Introduction Peer-to-peer (P2P) lending is a new, on-line based financial intermediary connecting people willing to borrow (borrowers) with people willing to lend their money (lenders). Borrowers and lenders are connected through on-line P2P lending platforms. P2P lending platforms can provide loans with lower intermediation costs than traditional banks because of their on-line presence and the elimination of costs associated with “bricks and mortar” locations with staff. This fact supports the offer of more competitive terms for both borrowers and lenders. Borrowers get on average lower interest rates on P2P lending platforms than in banks (Namvar, 2013). Similarly, lenders with well-diversified loan portfolio achieve higher returns than on traditional saving accounts (Serrano-Cinca, Gutierrez-Nieto, & Lopez-Palacios, 2015). These underlying facts increasingly make P2P lending very popular for both borrowers and lenders. P2P has attracted the attention of many researchers in recent years due to the interest in understanding the application of proper credit risk management techniques across various platforms. There exist numerous classification algorithms, such as logistic regression or random forest, for assessment of borrower's creditworthiness1. These classification techniques support the decisionmaking process of whether to lend money to a borrower or not. There already are some comparison studies of classification algorithms such as Baesens et al. (2003) and Lessmann, Baesens, Seow, and Thomas (2015) who provide their ranking of classifiers. However, in line with Salzberg (1997) and Wu (2014), we are concerned about the relevance of these findings for applications in the real world because these studies are usually based on small data sets with mostly unknown source of origin. This might be a problem because classifiers' predictions are only as good as the data sets used for its training. Therefore, we refrain from comparing our classifiers ranking results to studies that are not based on Lending Club data. In other words, we primarily focus on classifiers'
Corresponding author. E-mail address:
[email protected] (P. Teply). 1 Throughout our paper, we use terms credit scoring algorithm, classification algorithm, classification technique, and classifier interchangeably. ⁎
https://doi.org/10.1016/j.najef.2019.01.001 Received 15 May 2018; Received in revised form 20 December 2018; Accepted 2 January 2019 1062-9408/ © 2019 Elsevier Inc. All rights reserved.
Please cite this article as: Teplý, P., North American Journal of Economics and Finance, https://doi.org/10.1016/j.najef.2019.01.001
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
comparison with other papers using Lending Club data. Our contribution to the literature consists of three parts: data set, the number of classification methods used and performance measurements. First, we use a large data set comprised of 212,252 records in the 2009–2013 period. All our records are matured, i.e. we know their final loan resolution status. Moreover, our data preparation approach is more exhaustive and accurate than in the papers based on the Lending Club data (Chang, Kim, & Kondo, 2015; Malekipirbazari & Aksakalli, 2015; Tsai, Ramiah, & Singh, 2014; Wu, 2014). The second part of our contribution is the number of classification methods used. The above-mentioned studies based on the Lending Club data have used a maximum of five methods. We use ten classification methods, which makes our comparison more comprehensive. Finally, we use six performance measurements from three different performance measurement groups. Using different measurement techniques makes our findings robust. The aforementioned reasons make our paper unique and add value to our findings. The remainder of the paper is organized as follows. In Section 2, we present a literature review on classification techniques. After that, we describe the avail-able Lending Club data in Section 3. The next section describes our methodology for classifiers comparison. Section 5 contains our experimental results with classifiers ranking. Finally, the last section concludes our paper and states final remarks. 2. Literature review We divide the literature review into two parts. The first part presents the current research papers comparing classification techniques for credit scoring. The second part explores the recent literature comparing classifiers based on the Lending Club data. 2.1. Comparison of classification techniques A proper credit scoring technique is a vital part of long-term success for all types of financial institutions, especially P2P lending platforms. Abdou and Pointon (2011) conducted an in-depth review of 214 articles and books concerned with applications of credit scoring in various areas of business. They found out that there does not exist single overall best classification technique for creation of credit scoring models. Abdou and Pointon (2011) in line with Hand and Henley (1997) argue that performance of classification techniques depends on many characteristics. These characteristics might be available variables in data set, data structure or only the objective of classification. Even though one single best credit scoring technique might not exist as argued by Abdou and Pointon (2011), the amount of literature comparing different classification algorithms is very rich. Majority of those studies, such as Yeh and Lien (2009), Tsai, Lin, Cheng, and Lin (2009) or Akkoc (2012), introduce new classification methods. These new classifiers are then usually compared with a limited number of classifiers including logistic regression, which is regarded as an industry standard for credit scoring models (Ala'raj & Abbod, 2015). However, Lessmann et al. (2015) criticized such approach. They argue that comparing some new classification method, often specifically ne-tuned and without any prior hypotheses, to the limited number of classifiers and showing its performance superiority to logistic regression is not a signal of methodological advancement. Another issue we observe in studies comparing different classification techniques is the choice of dataset. Some studies, such as Zhang, Huang, Chen, and Jiang (2007) and Chuang and Lin (2009), use the data sets of Lichman (2013) based on Australian and German credit data. Both these data sets are freely downloadable from the UCI Machine Learning Repository. The Australian credit data set has 690 observations with 14 independent variables and default rate of 44.5%. The German credit data set has 1,000 observations with 20 independent variables and default rate of 30%. Along with Wu (2014), we regard both these data sets as inappropriate for classifiers' comparison because of a low number of observations. Furthermore, although high default rates of 44.5% and 30% ensure balanced data sets, they do not correspond to the reality in credit lending industry. Based on our literature review comparing classification techniques, we con-sider two studies methodologically outstanding: Baesens et al. (2003) and Lessmann et al. (2015). The latter study is an update of the former study incorporating new findings, such as new classifiers, performance criteria and statistical testing procedures. Furthermore, Lessmann et al. (2015) include more data sets than Baesens et al. (2003). Altogether, Lessmann et al. (2015) compare 41 different classification algorithms based on eight data sets measured by six various measurement methods. We use similar methodology (k-fold cross-validation; classifiers ranking) and the same six performance measurements techniques as Lessmann et al. (2015) used in their study. Similar to Lessmann et al. (2015), our main goal is to create the classifiers performance ranking. However, unlike Lessmann et al. (2015), we use a unique real-world data set from Lending Club P2P platform. Therefore, as Abdou and Pointon (2011) and Hand and Henley (1997) argue, we might arrive at completely different results than Lessmann et al. (2015) based on the comparatively rich depth and breadth of the Lending Club data set. 2.2. Comparison of classifiers based on Lending Club data There are only three P2P lending platforms that make their data about issued loans and borrower characteristics public: Bondora, Prosper and the Lending Club. However, we have not found any studies comparing classification techniques based on Bondora or Prosper data sets. Bondora is the youngest P2P lending platform among these three platforms. Even though Bondora was founded in 2009, it experienced the first rapid growth in 2013. The number of issued loans in January 2013 was almost 14,000 and more than three times more in January 2014. The average loan duration at Bondora is 47 months. It means that majority of loans has not yet reached their maturity and cannot be properly analyzed. We assume that current immaturity of loans issued at Bondora and 2
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
consequently, the lack of defaulted loans, might be the reason for excluding their data from studies comparing classifiers using their data. There have already been many papers written, such as Herzenstein, Dholakia, and Andrews (2011) and Zhang and Liu (2012), based on the Prosper data set, but none of them compared classification techniques. Most of these papers, such as Pope and Sydnor (2011) and Duarte, Siegel, and Young (2012), are mainly concerned with the social features of Prosper. For example, Lin, Prabhala, and Viswanathan (2013) state that borrowers with stronger network relationships are less likely to default. We suppose that it is currently too di cult to isolate and quantify the effect of the social features in the Prosper data. Therefore, these data might not be suitable now for the comparison of classification techniques. The Lending Club does not support any social features and all loans in our data set are matured. That is the reason why we firmly believe that their data, which we analyze in our paper, is the most relevant and reliable for the comparison of classification methods. To the best of our knowledge, there are currently only four studies (Chang et al., 2015; Malekipirbazari & Aksakalli, 2015; Tsai et al., 2014; Wu, 2014) comparing classification methods based on the Lending Club data. We focus on three aspects of these studies: data set, the use of classifiers and performance measurement techniques. First, we need more details on the data set (e.g. data period, how many observations were used for classification and how many variables are considered). Second, we deal with the use of classifiers. Only one study uses five classifiers, two studies use four classifiers and once even only two classifiers are used. The third aspect we are interested in is the use of performance measurement techniques. Most of the studies used three performance measurements only. The most popular measurement technique remains Percentage Correctly Classified (PCC) which was used in the three studies. Table 7 displays the overview of information gathered from the papers using the Lending Club data. Each of the four above-mentioned studies has been written with a different purpose. The primary goal of Wu (2014) is to compare logistic regression and random forest on the real data set. She argues that the performance of classifiers in the Kaggle.com competition called 'Give me some credit' cannot be taken as credible. Wu (2014) also criticizes the Kaggle data set for being artificially created. Next, the purpose of Tsai et al. (2014) research is to avoid as many false positive predictions as possible and therefore they use precision as a performance measurement. Moreover, Tsai et al. (2014) use a modified version of logistic regression with a penalty factor to avoid false positive predictions. A study by Chang et al. (2015) compares the performance of different naïve bayes distributions and kernel methods for a support vector machine. Chang et al. (2015) found that naïve bayes with Gaussian distribution and a support vector machine with linear kernel have the best performance based on the LC2 data. Finally, Malekipirbazari and Aksakalli (2015) compares different machine learning algorithms and identified the random forest as the best scoring classifier. Moreover, they showed that their default prediction classification is empirically superior to default predictions based on FICO scores or Lending Club grades. However, none of the above-discussed papers using Lending Club data includes classifiers' rankings. To fill this gap, we created such a ranking in Table 7 based on an average classifiers' performance in these papers. We use this ranking as a comparison benchmark for our results. Comparing the rankings between each other does not provide much insight. There might be various explanations why the rankings are not similar, such as the use of different variables or performance measurements. Furthermore, we identified three different types of shortcomings in the dis-cussed studies based on the Lending Club data: the time frame, a limited number of classifiers and the choice of classifiers performance measurements First, all four studies used data from time period where even the final status of 36-month loans could not be known. For example, Chang et al. (2015) used data from years 2014 and 2015, which means they could not have known the loan resolution. Therefore, they had a data set with the majority of loans with current statuses, i.e. in the process of repayment. It means that they could either label loans with current status as positive or filter all current loans out. Either of these approaches provides a biased data set, likely demonstrating the classic limitations of survivorship bias. As we demonstrate in the following section describing our data preparation process, our data set is not biased. We identified the comparison of a limited number of classifiers as the second shortcoming as maximum of 5 classifiers were compared. Moreover, none of the four studies includes either nowadays very popular artificial neural network (ANN) or other classification techniques such as k-nearest neighbors (k-NN). The last issue, we observe, is the choice of classifiers performance measurements. For example. Tsai et al. (2014) use only one performance measurement for classifiers comparison. We firmly believe that performance results based on one performance measurement cannot be robust. Nevertheless, more performance measurements from the same methodological group might not improve our understanding of classifiers performance. For in-stance, Chang et al. (2015) apply Percentage Correctly Classified, Precision and G-mean as performance measures. All of them are, however, from the same performance measurement group based on a confusion matrix. Thus, the results of Chang et al. (2015) might not be robust enough because measurement techniques from this performance measurement group could favor some classification techniques. We do believe that overcoming these shortcomings would ensure a proper comparison of classifiers based on the Lending Club data. 3. Data analysis For better transparency, we divide a data analysis into four parts. First, we introduce all available data and variables we have at hand. Furthermore, we ex-plain which variables were left out from our final data set. Second, we describe the transformation of our data set. Third, we provide the descriptive statistics of our data set. Finally, we explain a 5-fold cross-validation approach we choose for training, fine-tuning and testing of our classifiers. 2
LC is an abbreviation for the Lending Club. 3
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Fig. 1. Number of issued loans by years at Lending Club. Source: Authors based on the Lending Club data.
3.1. Data preparation Our data set for the years 2009 through 2013 was downloaded from a registered account on the Lending Club website. Fig. 1 displays the numbers of issued loans in the given years and depicts, among others, that 134,814 loans were issued in the year of 2013. The original data set contains 115 variables, from that we choose only 23 variables including a variable with the final loan status result (Table 1). How-ever, a majority of variables from the original data set were left out because of following three reasons: first, many variables do not include any values or have a high number of missing (or null) values. Second, we left the variables with constant values for all observations since these variables do not have any significant impact on borrower's default. Third, the reason for leaving out a variable is a lack of its information value. For example, variable url, which represents web link to the loan listing, or variable member id, assigning a unique number to a borrower, in our opinion, adds no information of value for default prediction. 3.2. Data transformation We have transformed five variables: loan_status, emp_length, desc3, earliest_cr_line, fico_range_low, and fico_range_high.4 The loan status ranks evidently as the most important variable for our purpose since it describes the current status of a loan. Table 2 reports an overview of seven possible loan statuses. We are, however, only interested in having loan status with the binary outcome: 0 for a paid back loan and 1 for a defaulted loan. As a result, we labeled all loans with status Fully Paid as 0 because they have been paid back. Otherwise, we have filtered out all loans with status Current as we do not know their final status. Loans with status Charged Off are defaulted loans (labeled with 1). There are four more loan statuses used for loans with delayed payments. First, a loan status in Grace Period means that a borrower is at most 15 days late with loan repayments. Loan statuses with listed as Late (16–30 days) and Late (31–120 days) are past due in the period in parentheses. Fourth, loans with status Default are more than 120 days past due. According to the Lending Club statistics, loans that are more than 90 days past due have 85% chance of not being paid back at all. Based on this statistic, we have marked all loans with status Default as defaulted, thus with 1. Moreover, all loans with more than 90 days past due from Late (31–120 days) (i.e. 430 loans) has been marked as defaulted too. All other loans that are past due but not more than 90 days have been filtered out. Altogether 13,871 loans have been filtered out. After all these adjustments, our data set has 212,280 records at this step. The variable emp_length describes how long has been a borrower employed before asking for a loan. The values of emp_length, such as 1 year, 2 years and 10 + years, make this variable categorical. For better usage, we have decided to make this variable continuous going from 0 to 10. The 0 value of our emp_length variable means that a borrower worked less than 1 year before applying for a loan. The maximal value of emp_length which is 10 includes all the borrowers who have worked 10 or more years by the same employer. Every borrower can describe why he or she needs to borrow money. This loan description is included in the Lending Club data under the variable desc. Instead of text description meaning provided by a borrower, we are interested in number of characters a borrower used for his or her description. As consequence, our variable desc contains number of characters used in loan description. The length of credit history is an important part of the FICO score. Further-more, Serrano-Cinca et al. (2015) and Carmichael 3 Variable desc describes number of characters used by borrower when describing his or her need for a loan. Polena and Regner (2018) found out that desc might be significant determinant of borrower’s default. 4 We use lower letter in italics, sometimes even with underscore, to highlight the Lending Club variables.
4
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Table 1 Included Lending Club variables. Transformed
Abbreviated Name
Description
No No No No
acc_now_delinq annual_inc chargeoff _within12mths delinq_2yrs
The number of accounts on which the borrower is now delinquent. The self-reported annual income provided by the borrower during registration. Number of charge-offs within 12 months. The number of 30 + days past-due incidences of delinquency in the borrower's credit file for the past 2 years.
No
delinq_amnt
The past-due amount owed for the accounts on which the borrower is now delinquent.
No
dti
A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self-reported monthly income.
No
home_ownership
The home ownership status provided by the borrower during registration or obtained from the credit report.
No No No No No No No
inq_last_6mths loan_amnt open_acc pub_rec pub_rec_bankruptcies purpose revol_util
The number of inquiries in past 6 months (excluding auto and mortgage inquiries). The listed amount of the loan applied for by the borrower. The number of open credit lines in the borrower's credit file. Number of derogatory public records. Number of public record bankruptcies. A category provided by the borrower for the loan request. Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
No No
tax_liens term
Number of tax liens. The number of payments on the loan. Values are in months and can be either 36 or 60.
No No Yes
total_acc verication_status loan_status
The total number of credit lines currently in the borrower's credit file. Indicates if income was verified by LC or not verified. Final status of loan has binary outcome. 0 for Fully paid loans and 1 for Charged Off loans.
Yes
emp_length
Number of years in employment represented by continues variable going from 0 to 10.
Yes Yes Yes
desc earliest_cr_line fico_range_avg
Number of characters included in loan description. Number of years since the first credit line has been opened. The average value of fico_range_low and fico_range_high.
Note: The column Transformed signifies if a variable has its original form or if it has been transformed. For overview of variables' original descriptions see Appendix A. For overview of variables' transformation see section Data Transformation. Source: Authors based on the Lending Club data.
(2014) argue that the length of credit history is a significant determinant of borrower's default probability. In our data set, we have a variable earliest_cr_line, in the form month-years, which represents the month and the year when the first borrower's credit line was opened. We have transformed this variable to show how many years have passed since the first credit line was opened. As we have the data from the end of 2016, we do consider the year 2017 as the reference year. For example, a borrower with earliest_cr_line in value of 5 has opened his or her first credit line from 5 to 10 years ago in 2012. The last modified variables encompass fico_range_low and fico_range_high. The Lending Club data does not contain the exact value of FICO score since it contains FICO score in a range of four points with lower and upper bounds. In other words, the difference between fico_range_high and fico_range_low is four points. For our purpose, we have taken an average of these two variables. The newly created variable is called fico_range_avg and is listed in Table 1. 3.3. Descriptive statistics As mentioned, we have chosen 23 variables including the loan status for our final data set (thereof 18 variables are continuous). By providing the descriptive statistics of continuous variables, we found out some borrowers with suspiciously high annual income. For example, there was a borrower with reported annual income of $ 7,141,778 who applied for a loan in the value of $ 14,825. We considered this record to be erroneous. Furthermore, there were altogether 28 records with re-ported annual income higher than $ 1 million. We deleted all these records from our data set because they might have been erroneous and skewed our results. After this change, there are 212,252 records in our final data set. Otherwise, we have not found any suspicious values by exploration of remaining variables. Table 3 includes the descriptive statistics of continuous variables including loan status. The last column of Table 3, called t-test, indicates that for a majority of the continuous variables there are significant differences between their aver-age values for loans with status Fully Paid and Charged Off. Fully Paid loans have significantly lower loan amount (loan_amnt) and longer loan description (desc_count) than Charged Off loans. Borrowers who paid off their loans have higher annual income (annual_inc), higher FICO score (fico_range_avg) and longer credit history (earliest_cr_line) than defaulted borrowers. Besides, borrowers with Fully Paid loans report a lower debt-to-income ratio (dti), were less delinquent in past two years (delinq_2yrs) and asked for less loans in past six months (inq_last_6mths) than borrowers with Charged Off loans. Table 3 also provides more details on significant differences of the continuous variables. 5
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Table 2 Number of loans by loan status before and after adjustments. Loan Status
# of loans
Before Fully Paid Charged Off Current Late (31–120 days) In Grace Period Late (16–30 days) Default Total
178,500 33,333 13,307 543 346 105 17 226,151
After Fully Paid Charged Off Total
178,500 33,780 212,280
Source: Authors based on the Lending Club data. Table 3 Descriptive statistics of continuous variables. Statistics/Abbreviated Name
loan_amnt emp_length annual_inc desc_count dti delinq_2yrs earliest_cr_line fico_range_avg inq_last_6mths open_acc pub_rec revol_util total_acc acc_now_delinq chargeoff _within12mths delinq_amnt pub_rec_bancruptcies tax_liens
Mean
13,406 5.849 70,986 103.1 16.29 0.22 18.88 702 0.82 10.65 0.098 0.565 24.04 0.002 0.004 6.81 0.078 0.011
St. Dev. Min
7,958 3.585 45,017 214.2 7.56 0.67 7.00 32 1.04 4.61 0.385 0.243 11.19 0.052 0.074 476.48 0.279 0.222
Median
1,000 0.000 4,000 0 0.00 0 6 662 0 0 0 0.000 2 0 0 0 0 0
12,000 6 60,000 0 16.01 0 18 697 0 10 0 0.587 23 0 0 0 0 0
Max
35,000 10 1,000,000 3,959 34.99 29 71 848 8 62 54 1.404 105 5 5 65,000 8 53
Average value for loan_status: Fully Paid
Charge O
t-test
13,161 5.836 72,103 104.1 16.02 0.212 18.96 704 0.79 10.62 0.097 0.556 24.1 0.002 0.004 6.62 0.077 0.012
14,704 5.919 65,086 98.11 17.72 0.232 18.42 694 0.97 10.84 0.104 0.608 23.71 0.003 0.004 7.82 0.083 0.011
−31.23*** −3.93*** 28.45*** 4.67*** −37.97*** −3.43*** 13.23*** 62.51*** −27.17*** −7.89*** −3.07*** −37.02*** 5.77*** −2.26** 1.71* −0.40 −3.33*** 0.87
Note: Stars in the column t-test signify whether the difference in average values for Fully Paid and Charge Off loans is significant. *** denote significance at 1% level, ** at 5% level and * at 10% level. Source: Authors based on the Lending Club data.
The remaining five variables are categorical: loan_status, home_ownership, purpose, term, and verification_status. Table 4 provides the descriptive statistics of categorical variables (except for loan_status) and contains a column with a default rate of given categorical variables. A default rate is calculated based on loan status as ratio of Charged Off loans to a total number of loans. The over-all default rate reaches 15.91% in our final data set. Table 4 also demonstrates that the main purpose for loan application at the Lending Club is debt consolidation (56.7% of all loans) or repayment of credit cards (21.3% of all loans). Considering the default rate for different loan purposes, the loans with purpose small business are by far the riskiest ones (default rate of 26.2%). On the other hand, loans with the purpose of a car (default rate of 11.1%) or major purchase (11.9%) belong to the safest group of loans. Looking at the variable home ownership, mortgage (49.6% of all loans) and rent (42.2%) are the most commonly chosen options by borrowers for describing their home situation. Surprisingly, people who own their home report a significantly higher default rate (16.2%) than people who have a mortgage (14.6%). Furthermore, 80.6% of all loans report a 36-month duration. Simultaneously, the 36-month loans have a significantly lower default rate (12.4%) than 60-month loans (30.5%). We argue that the high default rate of 60-month loans is caused by the earlier default of some loans and a low number of included 60-month loans as most of them were filtered out as they have not reached their maturity yet. The last categorical variable is verification status. Most of the loans (65.1%) has been verified. However, they show a significantly higher default rare (17.8%) than loans with not verified information (12.3%). Based on borrower’s credit file, the Lending Club might not require the verification of borrower’s self-reported information because the borrower is considered to be creditworthy. This approach is justifiable because of the lower default rate of loan without verified information. Table 4 includes 6
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Table 4 Descriptive statistics of categorical variables. Variable name - Levels of variable purpose - debt consolidation - credit card - home improvement - other - major purchase - small business - car - wedding - medical - moving - house - vacation - educational - renewable energy home ownership - mortgage - rent - own - other - none term - 36-months - 60-months Verification status - verified - not verified
Percentage (number) of loans
Default rate (%)
56.7% (120,285) 21.3% (45,119) 5.8% (12,369) 5.7% (12,160) 2.6% (5,574) 2.0% (4,337) 1.6% (3,364) 1.0% (2,186) 1.0% (2,150) 0.7% (1,573) 0.7% (1,394) 0.6% (1,263) 0.1% (260) 0.1% (218)
16.9% 13.1% 13.8% 18.5% 11.9% 26.2% 11.1% 12.4% 16.9% 16.7% 16.3% 15.8% 16.5% 19.3%
49.6% (105,229) 42.2% (89,523) 8.2% (17,346) 0.1% (114) 0.02% (40)
14.6% 17.3% 16.2% 21.9% 17.5%
80.6% (171,137) 19.4% (41,115)
12.4% 30.5%
65.1% (138,148) 34.9% (74,104)
17.8% 12.3%
Source: Authors based on the Lending Club data.
more details about the descriptive statistics of categorical variables. We find out four variables in the correlation matrix of continuous variables with a correlation coefficient higher than 0.5 in absolute terms. The strongest correlation is between pub_rec and pub_rec_bankrp (r = 0.76). Furthermore, a variable pub rec is considerably correlated (r = 0.63) with variable tax_liens. We have decided not to exclude a variable pub_rec. By excluding the variable pub_rec, we could potentially lose some important information about the borrower's public records which are not included in variables pub_rec_bankrp and tax_liens. Moreover, we do believe that a correlation between variables is not an issue for default prediction but for coefficients interpretation. Therefore, we do not see any reason for exclusion of correlated variables. 3.4. k-Fold cross validation We have chosen k-fold cross validation for our models training, validating and testing. In the k-fold cross-validation method, the original data set is randomly divided into k subsets. Each of the k subsets is used as testing data set in one of the k iterations. The remaining k-1 subsets are used for model training and fine-tuning. Salzberg (1997) argues that this approach minimizes the impact of data dependency. Put differently, the risk that the performance of a classifier depends on the choice of testing set is minimized because the classifier is scored sequentially on the whole data set. Moreover, Huang, Chen, and Wang (2007) add that the use of kfold cross-validation serves as a guarantee of results validity. In our paper, we specifically use 5-fold cross validation. Picture 1 displays the 5-fold cross validation method. We use three folds for model training and a fold for model fine-tuning. After the finetuning of model, we retrain the model on the three train folds and the one fine-tuning fold. This model is then tested on test fold (see Fig. 2). 4. Methodology In this section we brie y describe our classification and performance measurement techniques. Comprehensive description of these techniques is beyond the scope of this paper. However, below we refer to the key works explaining the classification techniques more in depth. 4.1. Classification techniques We use ten different classification techniques in our paper, which can be divided into three groups based on the type of algorithm they use. The classifiers use linear, non-linear or rule-based algorithms. Logistic regression (LR) and linear discriminant analysis (LDA) are classification techniques based on linear algorithms (Wendler & Grïottrup, 2016; Kuhn & Johnson, 2013). The classifier 7
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Fig. 2. 5-fold Cross Validation.
using non-linear algorithm are support vector machine (SVM), artificial neural network (ANN), k-nearest neighbor (k-NN), naïve bayes (NB) and bayesian net-work (B-Net) (Karatzoglou, Meyer, & Hornik, 2006; Kuhn & Johnson, 2013; Wendler & Grïottrup, 2016). The last group of rule-based classifiers contains classification and regression tree (CART) and random forest (RF) and random forest (RF) as discussed by Wendler and Grïottrup (2016) or Cichosz (2015). For more information about classifiers meta-parameters and its values we refer to Table A1 in Appendix. 4.2. Performance measurements We use six different performance measurements to ensure robust classifiers performance evaluation. These performance measurements might be divided into three groups. The first group of performance measurements evaluates the correctness of classifiers' categorical predictions, such as Percentage Correctly Classified (PCC) or Kolmogorov-Smirnov statistics (KS) as discussed by Mays (2001). The second group contains performance measurement which evaluate the accuracy of the classifiers' probability predictions, such as a Brier score (BS) (Hernandez-Orallo, Flach, and Ferri (2011)or Rufibach, 2010). The performance measurements using discriminatory ability of classifier, such as Area under the curve (AUC), Partial Gini index (PG) and H-measure (H), belongs to the last group (Hand, 2009; Pundir & Seshadri, 2012). 5. Experimental results This section is divided into two parts. In the first part, we present a comparison of the classifiers' performance and in the second, our results are compared with other studies using the Lending Club data. 5.1. Results overview In our research, we compare ten classifiers on six different performance measurements with 5-fold cross-validation method. The 5-fold cross-validation method consists of five iterations. Table 5 displays aggregated results from single iterations and shows that the average overall performance of our classifiers across the six different performance measurements. The best classifier according to the given performance technique is determined based on the total performance. The best classifiers in Table 5 are underscored and in bold face. We further calculate the standard deviation of classifier's overall performance based on the results from iterations. The last metric in Table 5 (column M-W) stands for Mann-Whitney U statistic. The values of Mann-Whitney U statistic are accompanied by stars showing if the classifier's performance is significantly different from the best classifier. Three stars (***) shows that the classifier's performance is significantly different from the best classifier at 1% significance level. Two stars (**) denote significance at 5% level and one star (*) at 10% level. For example, logistic regression (LR) is the best classifier based on the PCC measure according to our results. The performance of LR measured by PCC is, however, not significantly different from the performance of linear discriminant analysis (LDA) and two other classifiers at a 5% significance level. On the other hand, PCC performance of LR is significantly different from random forest (RF) at 5% significance level, and from k-NN at even a 1% significance level. If not stated otherwise, we always refer to the 5% significance level when speaking about significant differences between classifiers' performance. Table 6 enables a comparison of ranking our classifiers across all performance measurements, the classifiers are ranked based on their performance. The rankings across different performance measurements are averaged in column Avg. Score. The total ranking of classifiers - displayed in column Total Ranking - is derived from the values of column Avg. Score. The best performing classifier gets a 8
9
0.7913 0.7905 0.7904 0.7887 0.7883 0.7878 0.7818 0.7836 0.7659 0.7655
0.0016 0.0016 0.0011 0.0021 0.0034 0.0014 0.0018 0.0019 0.0072 0.0144
/ 17 16 18 23** 23** 25*** 24** 25*** 25***
0.2885 0.2848 0.2833 0.2854 0.2789 0.2555 0.2104 0.2425 0.2557 0.2022
Perf.
M-W
Perf.
St. Dev.
KS
PCC
Performance measurement
0.0056 0.0039 0.0048 0.0053 0.0074 0.0032 0.0044 0.0066 0.0090 0.0026
St. Dev. / 17 20 17 20 25*** 25*** 25*** 25*** 25***
M-W 0.1239 0.1240 0.1245 0.1585 0.1248 0.1257 0.1306 0.1502 0.2019 0.1322
Perf.
BS
0.0008 0.0009 0.0008 0.0003 0.0010 0.0007 0.0008 0.0076 0.0073 0.0006
St. Dev. / 11 8 0*** 6 1** 0*** 0*** 0*** 0***
M-W 0.6979 0.6975 0.6955 0.6967 0.6928 0.6787 0.6519 0.6689 0.6373 0.6360
Perf.
AUC
0.0028 0.0019 0.0028 0.0029 0.0032 0.0027 0.0034 0.0031 0.0048 0.0015
St. Dev. / 14 19 18 23** 25*** 25*** 25*** 25*** 25***
M-W
0.2502 0.2453 0.2586 0.2309 0.2236 0.2108 0.2641 0.2043 0.1495 0.1502
Perf.
PG
0.0046 0.0081 0.0075 0.0068 0.0062 0.0054 0.0070 0.0226 0.0249 0.0082
St. Dev.
24** 25*** 17 25*** 25*** 25*** / 25*** 25*** 25***
M-W
0.1319 0.1305 0.1285 0.1300 0.1243 0.1122 0.1089 0.1029 0.0801 0.0678
Perf.
H
0.0039 0.0031 0.0039 0.0040 0.0040 0.0039 0.0044 0.0028 0.0053 0.0015
St. Dev.
/ 15 19 17 23** 25*** 25*** 25*** 25*** 25***
M-W
Note: The abbreviation M-W stands for Mann-Whitney U test. The Mann-Whitney U statistics in column M-W are accompanied with stars signifying whether the classifier's performance is significantly different from the performance of the best classifier. *** denote significance at 1% level, ** at 5% level and * at 10% level. Source: Authors based on the Lending Club data.
LR ANN LDA L-SVM RF B-Net SVM-Rbf NB CART k-NN
Classifier
Table 5 Average performance results.
P. Teply, M. Polena
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Table 6 Classifiers' ranking. Classifier
LR ANN LDA L-SVM RF B-Net SVM-Rbf NB CART k-NN
Performance Measurement
Avg.
Total
PCC
KS
BS
AUC
PG
H
Score Ranking
1 2 3 4 5 6 8 7 9 10
1 3 4 2 5 7 9 8 6 10
1 2 3 9 4 5 6 8 10 7
1 2 4 3 5 6 8 7 9 10
3 4 2 5 6 7 1 8 10 9
1 2 4 3 5 6 7 8 9 10
1.3 2.5 3.3 4.3 5.0 6.2 6.5 7.7 8.8 9.3
1 2 3 4 5 6 7 8 9 10
Notes: Avg. Score: Average score computes the average ranking of classifier based on rankings achieved under different performance measurements. Total Ranking: Total ranking ranks classifiers based on their average score. Table 7 Final classifiers' comparison based on the LC data. Classification studies based on Lending Club data
Data
Classifiers
Year
# of observations
# of variables
LR
Wu (2014) Tsai et al. (2014) Chang et al. (2015) Malekipirbazari and Aksakalli (2015) This Study
2007–2011 2007–2013 2007–2015
33,571 91,520 n/a
22 n/a n/a
1 1 3
2012–2014
68,000
16
4
2009–2013
212,252
23
1
ANN
LDA
L-SVM
2
RF
B-Net
2 3
SVM-Rbf
NB
2 5
4 1
CART
1 2
3
4
5
k-NN
3 6
7
8
9
10
SVM-P
Performance measurement technique
4
PCC, AUC PVV PCC, G-mean
2
PCC, AUC, RMSE PCC, KS, BS AUC, PG, H
Source: Authors' information extraction and ranking computation based on Wu (2014), Tsai et al. (2014), Chang et al. (2015), Malekipirbazari and Aksakalli (2015)'s, and own research.
ranking of 1, the second best performing classifier gets a ranking of 2 and so on. Table 6 also reveals that LR has ranking 1 based on PCC because it reports the highest PCC performance. 5.2. Comparison with other LC-based studies In the literature review we presented four studies comparing classifiers based on the Lending Club data: Tsai et al. (2014), Wu (2014), Chang et al. (2015) and Malekipirbazari and Aksakalli (2015). Table 7 summarizes key characteristics of these studies with our results. First, it implies that we have used by far the largest data sets with 212,252 records. Second, our data set has the most number of variables since it contains 23 variables including the dependent variable of loan status. Third, we have included ten classifiers to make our comparison comprehensive. As discussed, the primary goal of remaining studies is not a classifiers comparison, but rather the introduction of new classifiers for default prediction. That's why these studies are not as comprehensive comparison as ours and com-pare only a maximum of five classifiers. The last thing that differentiates our paper from the remaining studies is the number of performance measurements used. Altogether, we have used six different measurement techniques from three different performance measurement groups. Using a broad range of evaluation techniques makes our results more robust. Comparing our final classifiers ranking to four studies from Table 7, we observe that Wu (2014) and Tsai et al. (2014) rank logistic regression (LR) as the best classifier in their studies.4 On the other hand, Chang et al. (2015) rank logistic regression (LR) as the third and Malekipirbazari and Aksakalli (2015) as the fourth classifier in their studies. Generally speaking, the comparison of rankings shows more differences than similarities. We see several possible explanations for differences between these rankings. For example, the usage of different variables, differences in data structure or insufficient number of classifier performance measurements. However, we, see the fact that authors pursue different goals as the main reason for observed differences. For instance, the main goal of Chang et al. (2015) is to compare the different distributions of naïve bayes (NB) and support vector machines (SVM). That's the reason the data preprocessing and other steps are done to suit these classifiers. For instance, Chang et al. (2015) say that the data set has been rebalanced because SVM underperforms with imbalanced data sets. With this set-up, it might not be surprising that NB is ranked as the best classifier, and L-SVM as the second best classifier in Chang et al. (2015). Moreover, these classifiers have been specifically ne-tuned to t the data. Both studies of Tsai et al. (2014) and Malekipirbazari and Aksakalli (2015) are conducted in similar manner to Chang et al. (2015). We, therefore, refrain from comparing our results with these studies because of the unequal 10
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
conditions for classifier comparison. Moreover, we consider our paper as the first classifier comparison study based on the Lending Club data. It should be mentioned that Tsai et al. (2014) used a modified version of logistic regression (LR). For more information about this modification, we refer to Tsai et al. (2014). 6. Conclusion The paper introduces the first ranking of ten different classification methods based on the real-world Lending Club data set. The data set contains 212,252 observations with 23 different variables. We used 5-fold cross validation approach and six different classifier performance measurements to ensure robust and comprehensive comparison. According to our ranking, logistic regression placed as the best and artificial neural network as the second best classification method. Our results support Baesens et al. (2003) hypotheses that credit data sets might be linearly separable because of high ranking of logistic regression and linear discriminant analysis. We identified two classification algorithms that we do not recommend using for credit scoring because of their poor performance: classification and regression tree and k-nearest neighbors. Both these algorithms placed at the bottom ranking of Baesens et al. (2003) and Lessmann et al. (2015) too. These results of our paper might be used by both retail and institutional investors who are picking loans into their portfolios to access the probability of loan’s default. Our ranking was further compared to ranking of four relevant studies (Chang et al., 2015; Malekipirbazari & Aksakalli, 2015; Tsai et al., 2014; Wu, 2014) using the Lending Club data too. We have, however, found more differences than similarities between their and our ranking. We believe that this is because the authors pursue different goals in their study than we do. Therefore, we consider our study as the first study comprehensively comparing different classification methods on the Lending Club data set. Acknowledgment This research was supported by the Czech Science Foundation (Project No. GA 18-05244S) and University of Economics in Prague (Project No. VŠE IP100040). Appendix Meta-parameters of Classifiers Table A1 below displays following information: meta-parameters of classifiers that we have ne-tuned, ne-tuned values of metaparameters and analytical software we have used. The symbol n/a in meta-parameters description denotes that it was not needed to ne-tune a given classifier. We would like to point out that it might happen that different analytical software require different metaparameters. Our main software for classifiers' scoring was IBM SPSS Modeler 18.0. We have chosen SPSS Modeler because of its reliability, ease of use and authors' proficiency in this software. Two classification methods, naïve bayes and random forest, are not covered by SPSS Modeler. Therefore, our second software is R 3.4.0. The R's packages we have used are included in parentheses. Table A1 Meta-parameters of Classifiers. Classifier
Meta-parameter
Value
Software
Artificial neural network (ANN) Bayesian net (B-Net) Classification and regression tree (CART) k-Nearest Neighbor (k-NN) Linear discriminant analysis (LDA) Logistic regression (LR) Linear support vector machine (L-SVM) Naïve Bayes (NB) Support vector machine - radial (SVM-Rbf) Random forest (RF)
# of hidden nodes # of units in hidden nodes Structure of network
2 22;10 TAN
SPSS Modeler
Tree depth Min. leaf size # of nearest neighbors
6 2% 10
SPSS Modeler
SPSS Modeler
SPSS Modeler
n/a
SPSS Modeler
n/a
SPSS Modeler
Epsilon Lambda n/a
0.1 5
SPSS Modeler
Epsilon Gamma # of grown trees # of randomly sampled variables
0.1 0.1 800 5
SPSS Modeler
11
R (e1071)
R (randomForest)
North American Journal of Economics and Finance xxx (xxxx) xxx–xxx
P. Teply, M. Polena
Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.najef.2019.01.001.
References Abdou, H., & Pointon, J. (2011). Credit scoring, statistical technique and evalua-tion criteria: A review of the literature. Intelligent Systems in Accounting, Finance and Management, 18, 59–88. Akkoc, S. (2012). An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data. European Journal of Operational Research, 222, 168–178. Ala'raj, M., & Abbod, M. F. (2015). Classifiers consensus system approach for credit scoring. Knowledge-Based Systems, 104, 89–105. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Van-thienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54, 627–635. Carmichael, D. Modeling default for peer-to-peer loans. (2014). http://ssrn.com/abstract=2529240. Chang, S., Kim, S. D.-O., & Kondo, G. (2015). Predicting default risk of lending club loans. Machine Learning, 1–5. Chuang, C. L., & Lin, R. H. (2009). Constructing a reassigning credit scoring model. Expert Systems with Applications, 36, 1685–1694. Cichosz, P. (2015). Data mining algorithms: Explained using R (1st ed.). New Jersey: John Wiley and Sons. Duarte, J., Siegel, S., & Young, L. (2012). Trust and credit: The role of appearance in peer-to-peer lending. Review of Financial Studies, 25, 2455–2483. Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77, 103–123. Hand, D. J., & Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: A review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160, 523–541. Hernandez-Orallo, J., Flach, P., & Ferri, C. (2011). Brier Curves: a New Cost-Based Visualisation of Classifier Performance. Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 585–592. Herzenstein, M., Dholakia, U. M., & Andrews, R. L. (2011). Strategic herding behavior in peer-to-peer loan auctions. Journal of Interactive Marketing, 25, 27–36. Huang, C.-L., Chen, M.-C., & Wang, C.-J. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications, 33, 847–856. Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support vector algorithm in R. Journal of Statistical Software, 15, 1–28. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (5th ed.). New York: Springer. Lessmann, S., Baesens, B., Seow, H. V., & Thomas, L. C. (2015). Benchmark-ing state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247, 124–136. Lichman, M. (2013). UCI machine learning repository. Irvine: University of California. Lin, M., Prabhala, N., & Viswanathan, S. (2013). Judging borrowers by the company they keep: Friendship networks and information asymmetry in online peer-to-peer lending. Management Science, 59, 17–35. Malekipirbazari, M., & Aksakalli, V. (2015). Risk assessment in social lending via random forests. Expert Systems with Applications, 42, 4621–4631. Mays, E. (2001). Handbook of credit scoring. Chicago: Global Professional Publishing. Namvar, E. (2013). An introduction to peer to peer loans as investments. Journal of Investment Management, 12, 1–18. Polena, M., & Regner, T. (2018). Determinants of borrowers’ default in P2P lending under consideration of the loan risk class. Games, 9, 82. Pope, D. G., & Sydnor, J. R. (2011). What's in a picture? Evidence of discrimination from Prosper.com. Journal of Human Resources, 46, 53–92. Pundir, S., & Seshadri, R. (2012). A novel concept of partial Lorenz curve and partial Gini index. International Journal of Engineering, Science and Innovative Technology, 1, 296–301. Rufibach, K. (2010). Use of Brier score to assess binary predictions. Journal of Clinical Epidemiology, 63, 938–939. Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 328, 317–328. Serrano-Cinca, C., Gutierrez-Nieto, B., & Lopez-Palacios, L. (2015). Determinants of default in P2P Lending. PLoS ONE, 10, 1–22. Tsai, K., Ramiah, S., & Singh, S. (2014). Peer Lending Risk Predictor. Stanford: Stanford University CS229 Project Report. Tsai, M. C., Lin, S. P., Cheng, C. C., & Lin, Y. P. (2009). The consumer loan default predicting model – An application of DEA-DA and neural network. Expert Systems with Applications, 36, 11682–11690. Wendler, T., & Grïottrup, S. (2016). Data mining with SPSS modeler. Switzer-land: Springer International Publishing. Wu, J. Loan default prediction using lending club data. (2014). http://www.wujiayu.me/assets/projects/loan-default-predictionJiayu-Wu.pdf. Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36, 2473–2480. Zhang, D., Huang, H., Chen, Q., & Jiang, Y. (2007). A comparison study of credit scoring models. Proceedings Third International Conference on Natural Computation, 1, 15–18. Zhang, J., & Liu, P. (2012). Rational Herding in Microloan Markets. Management Science, 58, 892–912.
12