Multivariable data imputation for the analysis of incomplete credit data

Multivariable data imputation for the analysis of incomplete credit data

Expert Systems With Applications 141 (2020) 112926 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

1MB Sizes 0 Downloads 95 Views

Expert Systems With Applications 141 (2020) 112926

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Multivariable data imputation for the analysis of incomplete credit data Qiujun Lan a,∗, Xuqing Xu a, Haojie Ma a, Gang Li b,c,1 a

Business School of Hunan University, Changsha 410082, China School of Information Technology, Deakin University, Geelong, VIC 3216, Australia c Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China b

a r t i c l e

i n f o

Article history: Received 15 November 2018 Revised 5 April 2019 Accepted 4 September 2019 Available online 5 September 2019 Keywords: Bayesian network Credit scoring Data missing Data mining

a b s t r a c t Missing data significantly reduce the accuracy and usability of credit scoring models, especially in multivariate missing cases. Most credit scoring models address this problem by deleting the missing instances from the dataset or imputing missing values with the mean, mode, or regression values. However, these methods often result in a significant loss of information or a bias. We proposed a novel method called BNII to impute missing values, which can be helpful for intelligent credit scoring systems. The proposed BNII algorithm consisted of two stages: the preparatory stage and the imputation stage. In the first stage, a Bayesian network with all of the attributes in the original dataset was constructed from the complete dataset so that both the network structure that implied the dependencies between variables and the parameters at each variable’s conditional distributions could be learned. In the second stage, multivariables with missing values were iteratively imputed using Bayesian network models from the first stage. The algorithm was found to be monotonically convergent. The most significant advantages of the method include, it exploits the inherent probability-dependent relationship between variables, but without a specific probability distribution hypothesis, and it is suitable for multivariate missing cases. Three datasets were used for experiments: one was the real dataset from a famous P2P financial company in China, and the other two were benchmark datasets provided by UCI. The experimental results showed that BNII performed significantly better than the other well-known imputation techniques. This suggested that the proposed method can be used to improve the performance of a credit scoring system and to be extended to other expert and intelligent systems. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction For decades, credit scoring has been used by lenders as a credit risk assessment tool and an important means to reduce information asymmetry (Einav, Jenkins & Levin, 2013). In order to properly assess the borrower’s ability and willingness to repay debt on time, financial institutions collect various information about borrowers from their applications and from credit bureaus, including monthly income, outstanding debt, geographical data, borrowing history, and repayment actions (Bequé & Lessmann, 2017). Using a certain expert judgment method or statistical analysis models, they then aggregated the information into a prediction of a borrower’s ∗

Corresponding author. E-mail addresses: [email protected] (Q. Lan), [email protected] (X. Xu), [email protected] (H. Ma), [email protected] (G. Li). 1 This research work was completed when Gang Li was on ASL in Chinese Academy of Sciences, and we thank Deakin University for the support of ASL 2019 fund. https://doi.org/10.1016/j.eswa.2019.112926 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

repayment behaviors or profitability (Abdou & Pointon, 2011). In recent years, small and medium enterprises (SMSEs) have played an increasingly important role in maintaining economic growth, easing employment pressure, and facilitating people’s livelihoods (Zhang, Li & Chen, 2014). The increasing number of SMSEs has increased demand for quality credit services. At the same time, the scale of all kinds of consumer credit markets have also experienced rapid growth, which has further stimulated the demand by loan institutions for credit scoring models (Kano, Uchida, Udell & Watanabe, 2011). Generally, three categories of credit scoring methods have been used. In the early stage, methods based on the subjective experience of credit experts, such as 5C, 5P, and LAPP, were commonly used by loan institutions (Louzada, Ara & Fernandes, 2016). Later, with the promotion of statistical techniques, regression analysis, Linear discriminant analysis (LDA; Fisher, 1936), logistical regression (LR; Sohn, Dong & Jin, 2016; Walker & Duncan, 1967), and Probit regression (Bliss, 1934) were introduced into credit scoring

2

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

(Wiginton, 1980). An example includes the z-score model that was proposed by Altman (1968) and the risk-calc model of Moody’s. In recent years, machine learning techniques have also been introduced to credit scoring, including k-nearest neighbor (KNN; Zhou et al., 2014), support vector machine (SVM; Chen & Li, 2010; Hens & Tiwari, 2012), decision tree (DT; Kao, Chiu & Chiu, 2012), and neural network (NN; Chun & Huang, 2011; West, 20 0 0), and those approaches are regarded as the mainstream techniques in this field (Chen, Ribeiro & Chen, 2016). Due to this, the intelligent expert credit scoring system is widely used by credit institutions such as banks. However, missing values are ubiquitous when conducting credit scoring on enterprises, especially for SMSEs (Gordini, 2014; Shen, Shen, Xu & Bai, 2009). In many applications, credit data have suffered from unavailability, scarcity, and incompleteness (Schafer, 1997). This issue significantly affects the accuracy and usability of credit scoring systems. The causes of missing data are diverse and complicated, and can include an unwillingness to respond to survey questions, data acquisition fraud, and measurement errors. Two strategies have been commonly employed in practice to overcome this challenge. One possible approach is to drop the missing instances from the original dataset, as done by Won, Kim and Bae (2012) or to perform preprocessing to replace the missing values with mean values, as done by Feng, Xiao, Zhong, Dong and Qiu (2019), Lessmann, Baesens, Seow and Thomas (2015), and Florez-Lopez (2010). Such methods work well when the percentage of missing data is quite small and, also, when ignoring a test instance with missing values can be tolerated. However, given the scarcity of credit data, these methods are not always the best option (Schafer, 1997). They have been shown to result in the loss of information and to introduce biases into the credit scoring processes that can prevent the discovery of important credit risk factors and lead to invalid conclusions. Therefore, we mainly focused on data imputation approaches to estimate the missing values under incomplete credit data scenarios. We believe that this work will be of great benefit to improving data quality in the preprocessing process of data mining, and, consequently, to improve the performance of credit scoring models. We presented a novel missing value imputation method that was demonstrated to be suitable for multivariable missing credit data. The proposed imputation method was inspired by the EM algorithm presented in Dempster, Laird and Rubin (1977), which was used to find the local maximum likelihood parameters of a statistical model by updating the parameters and likelihood function in an alternate iterative fashion. Combining an iterative mechanism and a Bayesian network classifier to estimate the missing values, our proposed method did the following: (1) introduced an iterative strategy that was based on increasing posterior probability to make the imputation results more fitting with the real distribution, which made the algorithm more accurate; (2) decreased the dependence on the hypothesis for probability distribution, which made the algorithm more applicable; and (3) considered all attributes in the original dataset as nodes to construct the Bayesian network in order to make the algorithm suitable for both single variable and multivariable missing data. The proposed method showed a good capability to impute missing values utilizing the entire knowledge in complete datasets, which suggested that it can be beneficial for credit scoring systems and decision makers. The proposed framework represented a significant step toward the development of robust expert and intelligent credit scoring systems. This paper is organized as follows. in Section 2 we present a literature review of related work. Our proposed BNII algorithm for missing data imputation is described in Section 3. Experimental setting and results are given in Section 4. Finally, Section 5 provides our concluding remarks.

2. Related work Since missing data are common in all kinds of statistical analysis work, a great number of techniques have been proposed to deal with the issue. Existing techniques can generally divided into two categories: deletion and imputation methods (Garciarena & Santana, 2017; Hong & Wu, 2011; Purwar & Singh, 2015). The deletion method includes case deletion and variable deletion. Ignoring cases or variables with missing data is generally a convenient choice when the cardinality of missing data is relatively small, the missing data is homogeneously distributed, or the missing variable can be substituted by other variables. For simplicity, this method is widely used in data preprocessing (Luengo, Garcã-A & Herrera, 2010). Nevertheless, when the cardinality of missing data is large, or when ignoring them would result in significant loss of information, techniques that fill the gaps in data are more often recommended. The imputation method is generally based on two types of information: the distribution of the missing variable itself and the correlation between the missing variable and other variables (Deb & Liew, 2016; Tutz & Ramzan, 2015). Typical data imputation methods based on the distribution of the missing variable itself include mean and mode value imputation (Garciarena & Santana, 2017; Luengo et al., 2010). The former is applicable to numerical variables, and the latter is applicable to nominal variables, but both are simple and have been widely adopted. However, the main limitation of these kinds of methods is that replacing missing values with the mean or mode value results in a distorted estimate of the distribution function, which can in turn reduce the quality of the data mining result (Nuovo, 2011; Tutz & Ramzan, 2015). Therefore, in order to ensure that the data after imputation is closer to the real data, scholars have studied imputation methods based on the correlation between the missing variable and other variables. Some of these methods include regression-based imputation (RI; Atem, Sampene & Greene, 2017; Shahbazi, Karimi, Hosseini, Yazgi & Torbatian, 2018), k-nearest neighbor-based imputation (kNNI, Batista & Monard, 2003; Aydilek & Arslan, 2012), and expectation maximization imputation (EMI, Dempster et al., 1977; Schneider, 2001). The RI method imputes missing values by establishing regression equations. This method first divides the full dataset (DFull ) into two subsets of data, one having records with missing values (DMiss ), and the other having records without missing values (DComplete ). Then, it estimates the regression equations on DComplete by regarding the missing variables in DMiss as the dependent variables and others as the independent variables. The type of the regression can be chosen according to the different types of missing variables (e.g., logistical regression for categorical variables). Finally, the missing values are predicted using the corresponding regression equations built in the previous step. However, the main drawback of this method is that it incorrectly assumes that all variables are correlated linearly. The kNNI method imputes missing values using k similar records. This method first finds k records from the total dataset by using a suitable similarity measure. To impute a numerical missing value, the method utilizes the mean value of the specific variable within the k most similar records of the entire dataset. If the variable with missing values is categorical, then the method utilizes the mode value of the variable within the k most similar records. kNNI is a simple method that performs well on datasets that have a strong local correlation structure. However, the method can be expensive for large datasets, because for each record with missing value(s), it needs to find k similar records by searching the whole dataset. Moreover, how to find the most suitable similarity

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

3

Table 1 The advantages and disadvantages of typical methods to address missing data. Method

Reference

Advantages

Disadvantages

Deletion method

Furlow, Fouladi, Gagne and Whittaker (2007), Little and Rubin (2002), Luengo et al. (2010)

Simple and easy for application

Degrade the quality of estimations by removing some information present in instances containing missing values

Mean and Mode Value Imputation

Garciarena and Santana (2017), Luengo et al. (2010), Nuovo (2011), Tutz and Ramzan (2015)

Simple to apply and valuable data is not deleted

The existing relationships between the attributes is ignored; The variance of the single variables involved and the covariance with other variables are reduced (both are underestimated)

RI

Atem et al. (2017), Gelman and Hill (2006), Shahbazi et al. (2018)

Exploit existing linear correlation between the attributes to estimate the missing data

May lose performance when build from data where the attributes are poorly or nonlinearly correlated between each other; The attributes used as independent variable to establish regression equations must be observable, so this method is not suitable for multivariate missing cases

KNNI

Batista and Monard (2003), Aydilek and Arslan (2012), Pan and Li (2010), Tutz and Ramzan (2015)

Exploit the similarity between records to infer the missing data

Its performance depends on k and similarity measures; A need to compare all instance to find nearest neighbors result in a high time complexity

EMI

Dempster et al. (1977), Schafer (2010), Schneider (2001)

Exploit the inherent probability-dependent relationship between variables to estimate the missing data; Suitable for multivariate missing cases

Must solve highly complex likelihood equations or sets a specific probability distribution hypothesis; Local optima problem

measures for different datasets is also an open issue. These are the main drawbacks of this method (Batista & Monard, 2003). The EMI method is an iterative algorithm for estimating parameters of probability distribution with hidden variables using maximum likelihood estimation (MLE). Hidden variables are unobservable random variables, which can be regarded as missing variables. The EMI algorithm starts with an initial estimate of parameters of a known probability distribution, and it iterates until the imputed values and the estimates of parameters stop changing appreciably from the current iteration to the next. The EMI algorithm is only applicable to datasets in which the missing values are missing at random. The main drawback of this method is that for estimating parameters through MLE, the probability distribution function of the dataset is required, which is commonly assumed to be a multivariate normal distribution (Dempster et al., 1977). The brief information given above aims to describe the main idea of each method, together with their advantages and disadvantages. The results of these studies are summarized in Table 1.

Table 2 Full dataset DFull . Record

Education

Job

Income

Number of credits

Housing

R1 R2 R3 R4 R5 R6 R7

High ? High Low Low Low Low

Business Labor Labor Business Labor No job No job

High ? High High Low Low Low

1–3 1–3 more than 3 0 1–3 1–3 1–3

Own Own Own Rent ? Rent Rent

Table 3 Complete dataset DComplete . Record

Education

Job

Income

Number of credits

Housing

R1 R3 R4 R6 R7

High High Low Low Low

Business Labor Business No job No job

High High High Low Low

1–3 More than 3 0 1–3 1–3

Own Own Rent Rent Rent

3. Proposed approach The BNII algorithm proposed in our study consisted of two stages. The first stage was the preparatory stage. In this stage, we created two datasets from the original dataset. The first dataset, denoted as the complete dataset (DComplete ), contained records with no missing values. The second dataset, denoted as the incomplete dataset (DMiss ), contained those missing records with some missing attribute values. Then, considering all of the attributes in the original dataset as nodes, a Bayesian network was constructed from the DComplete. This way, both the network structure that implied the dependencies between variables and the parameters at each variable’s conditional distributions were learned in this stage, and this step formed the basis of our imputation algorithm. In addition, the mean or mode value of each possible missing variable was calculated in this stage for use in the second stage. In the second stage, similar to the expectation maximization imputation (EMI) algorithm, the main task was to impute the missing variable iteratively until convergence was achieved. First, we made initial simple guesses for missing values (e.g., mean

imputation). Then, we re-imputed each missing variable using the Bayesian network model that was trained in the first stage. Once all of the variables had been re-imputed, the procedure was repeated. The final imputation values were determined when the iteration limit or a pre-specified threshold was reached. The framework of the BNII algorithm is shown in Fig. 1. The subsections ahead present this algorithm with an illustration example. 3.1. Preparatory stage In the preparatory stage, a full dataset (DFull ), as shown in Table 2, was first divided into two sub-datasets (where ? stands for the missing value). One subset contained records with missing values (DMiss ), and the other had no missing values (DComplete ). Tables 3 and 4 show the resulting DComplete and DMiss, respectively. Next, a Bayesian network on all attributes in the original dataset was trained from the DComplete. A Bayesian network model is a probabilistic graphical model that represents a set of variables and their conditional depen-

4

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

Fig. 1. The framework structure of the BNII algorithm.

Table 4 Missing value dataset DMiss . Record

Education

Job

Income

Number of credits

Housing

R2 R5

? Low

Labor Labor

? Low

1–3 1–3

Own ?

The Bayesian network that was trained from the DComplete in our example is shown in Fig. 2. For the sake of simplification, the Bayesian network structure was provided manually, and the parameters were estimated using the MLE method. The joint probability function was as follows:

P (E, J, I, N, H ) = P (E )P (J )P (I|E, J )P (N|I )P (H |I ), dencies via a directed acyclic graph (DAG). Training a Bayesian network consists of two main steps: structural learning and parameter learning. Let G be a Bayesian network with nodes X1 ,. ., Xn . If there is a directed edge from Xi to Xj , Xi is called a parent of Xj , π (Xj ). Given its parents, a node is conditionally independent from all other nodes. Thus, the joint distribution of all the nodes can be written as:

P (X ) = P (X1 , X2 , . . . , Xn ) =

n 

P (Xi |π (Xi ) ),

(1)

i=1

where π (Xi ) denotes the parent variables of the node Xi . The predicted value of Xi can be expressed as follows:

Xi = arg max{P (Xi |e )},

(2)

P (Xi |e ) ∝ P (ec |Xi )P (Xi |e p ),

(3)

Xi ∈V

where e represents all evidence (i.e., values of variables on nodes other than Xi ), ep is the evidence of the parent nodes, and ec represents the child nodes. Eq. (3) expresses only a proportionality; however, at the end of the calculation, we normalized the probabilities over the states at X.

(4)

where the names of the variables were abbreviated to E = Education (High/Low), J = Job (Business/Labor/No job), I = Income (High/Low), N = Number of credits (0/1–3/More than 3), and H = Housing (Own/Rent). According to the Bayesian network structure in Fig 2, the children of node I were nodes N and H, and the parents of node I were nodes E and J. When the values of other variables were given, we could calculate the conditional distribution of income as follows:



P (I|e ) ∝ P (ec |I )P (I|e p ) = P (eN , eH |I )P I|eE , eJ





= P ( eN |I )P ( eH |I )P I |eE , eJ ,



(5)

where eN ,eH ,eE , and eJ represent the value of each node given by evidence. In addition, the mean and mode values of each possible missing variable were calculated at this stage for use in the second stage. 3.2. Imputation stage Let R represent a record with missing values in DMiss. and M = {m1 ,…,mp }, C={c1 ,…,cq }, where M denotes the set of variables with missing values, and C is the set of complete variables. The union of M and C contains all of the variables in R. Then, record

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

5

Fig. 2. The Bayesian network built from the DComplete in our example.

R could be written as R = (m1 ,…, mp , c1 ,…,cq ). For each record in DMiss, its missing values were imputed in the following iterative fashion. First, the mode and mean of each missing variable were used to initialize the missing variables according to the specific variable type. After initialization, the missing record could be expressed as R(0) =(m1 (0) ,…,mp (0) , c1 ,…,cq ). In the next, we re-imputed the values of the missing variables in order. Once all of the variables had been re-imputed, an iteration was complete. In the (t + 1)th iteration, for each missing variable mj (j = 1,…,p), a new value was predicted to replace the old one by applying the Bayesian network model that was trained in the first stage.

m(j

t+1 )

(t )

t = arg max{P (m j |e(j ) )},

(6)

m j ∈V



(t+1 )

e j = em1

(t+1 )

(t )

(t )



, . . . , em j−1 , em j+1 , . . . , em p , ec1 , . . . , ecq ,

Re-impute variable m1 (Education): (1 )

0 m1 = arg max{P (m1 |e1( ) )},

 (0 )  0 e1( ) = em , ec1 , ec2 , ec3 2 = {I = High, J = labor, N = 1 − 3, H = own},

(9)

  0 P E = High|e1( ) ∝ P (I = High|E = High, J = labor)P (E = High ) = 1×

2 2 = , 5 5

(10)

  0 P E = low|e1( ) ∝ P (I = High|E = low, J = labor)P (E = low ) = 0 ×

(7)

(8)

m1 ∈V

3 = 0, 5

(11)

Because P(E = High|e1 (0) ) > P(E = low|e1 (0) ), we estimated that m1 = high. Re-impute variable m2 (Income): (1)

where ej (t) represents all of the evidence after replacing the values of the last missing variable in the (t + 1)th iteration. (i.e., the values of variables on nodes other than mj after replacing the value of (t+1 ) mj − 1 (t) ). Moreover, em represents the value of node mj-1 in the j−1

(t ) (t + 1)th iteration, and em represents the value of node mj+1 in j+1

the (t)th iteration. After the (t + 1)th iteration, the missing record could be expressed as R( t + 1) = (m1 ( t + 1) ,…,mp ( t + 1) , c1 ,…,cq ). To illustrate how the BNII algorithm imputed the missing value, we used R2 in Table 3 as an example. R2 could be written as R2 = (?, labor, ?, 1–3, own). First, the modes of the categorical missing variables Education and Income were used to impute the missing value. After initialization, R2 could be expressed as R2 (0) = (high, labor, high, 1–3, own). In the first iteration, each missing variable was re-imputed by applying the Bayesian network model trained in the first stage in the following ways.

1 0 m2( ) = arg max{P (m2 |e2( ) )},

(12)

m2 ∈V

 (1 )  0 e2( ) = em , ec1 , ec2 , ec3 1 = {E = High, J = labor, N = 1 − 3, H = own},

(13)

  0 P I = High|e2( ) ∝ P (N = 1 − 3|I = High) P (H = own|I = High ) P (I = High|I = High, J = labor) =

2 1 2 × ×1= , 3 3 9

(14)

  0 P I = low|e2( ) ∝ P (N = 1 − 3|I = low )P (H = own|I = low ) P (I = low|E = High, J = labor) = 1 ×

1 × 0 = 0. 2

(15)

6

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926 Table 5 Iteration results.

R2 R2 (0) R2 (1) R2 (2)

Education(m1 )

Income(m2 )

Job(c1 )

Number of credits(c2 )

Housing(c3 )

? low high high

? high high high

labor labor labor labor

1–3 1–3 1–3 1–3

own own own Own

Because P(I = High|e2 (0) ) > P(I = low|e2 (0) ), we estimated that m2 = high. In this way, all of the missing variables were re-imputed, the first iteration was completed, and record R2 was updated to R2 (1) . However, because R2 (0) = R2 (1) , we had to continue the iteration. In the second iteration, for the missing variable m1 (Education), the iteration was continued as follows: (1)

2 m1( ) = arg max{P (m1 |e1 (1 ) )}, m1 ∈V

(16)

 (1 )  1 e1( ) = em , ec1 , ec2 , ec3 2 = {I = High, J = labor, N = 1 − 3, H = own},

(17)

  1 P E = High|e1( ) ∝ P (I = High|E = High, J = labor)P (E = High ) = 1×

2 2 = , 5 5

(18)

  1 P E = low|e1( ) ∝ P (I = High|E = low, J = labor)P (E = low ) 3 = 0 × = 0. 5

Proof. Let R represents a record with missing values, R = (m1 ,…,mp , c1 ,…,cq ). The probability of R can be written as:





P ( R ) = P ( m 1 , . . . , m p , c 1 , . . . , c q ) = P m j | χ P ( χ ),

(20)

where χ represents all other variables except mj . After reimputing mj , it was updated as mj  . By this time, the probability of R could be revised as:

 







P R = P m1 , . . . m j , . . . , m p , c1 , . . . , cq = P ( m j



| χ ) P ( χ ).

(21)

Since the imputation process was based on the Bayesian network model trained in the first stage, the corresponding maximum posterior probability estimate was returned. As a result, there was P(mj |χ ) ≤ P(mj  |χ ), namely P(R) ≤ P(R ). It can be seen that this iterative imputation process made the joint probability of a record R increase monotonically. Since the number of combinations of the imputation values was limited, and the maximum probability value was 1, the joint probability distribution function of a record R was bounded. Therefore, the process converged. 4. Experiments and results

(19)

Because P(E = High|e1 (1) ) > P(E = low|e1 (1) ), m1 (2) = high. Similarly, variable m2 (Income) was re-imputed by m2 (1) = high. The second iteration was completed, and record R2 was updated to R2 (2) . For R2 (2) = R2 (1) , we could stop the iteration and output R2 (2) as the final imputation results. Table 5 shows each iteration result in this example.

In order to verify the validity of the BNII algorithm in credit scoring, we used three credit datasets as the experimental data. One was from the Renrendai website, a famous P2P financial company in China, and two of them (German and Australia) were the benchmark UCI datasets. The experiments entailed comparing our algorithm with the mode value imputation and EM imputation methods in two aspects: the imputation accuracy and the performance of the credit scoring model after imputation.

3.3. Proof of the convergence of the BNII algorithm

4.1. Datasets

Our proposed BNII algorithm is presented below. BNII algorithm Step1: Deconstruct the full dataset into complete and missing value data subsets: DFull = DComplete + DMiss Step2: Train a Bayesian network containing all attributes as nodes from the DComplete. Step3: Rename all attributes in DMiss . Let R represent a record with missing values in DMiss , M={m1 ,…,mp } denotes the set of variables with missing values, and C={c1 ,…,cq } denotes the set of complete variables. p plus q is equal to the total number of variables. Step 4: Impute the missing values: FOR each missing record R in DMiss DO Iterate times t = 0 Begin initialize R(0) =(m1 (0) ,…,mp (0) , c1 ,…,cq ) Repeat FOR j = 1 to p DO m j (t+1) = arg max{P (m j |e j (t ) )} m j ∈V

END FOR t=t+1 Until the difference between the corresponding variables in R( t + 1) and (t) R is less than 10−5 END FOR

In the second stage (the imputation stage), the iterative imputation process converged, and this was proven as follows:

The German dataset was from the UCI database, which contains the credit card data of a German bank. The dataset contained 20 attributes and one class variable. The class had two values: {good, bad}, which represented a customer with good credit or with bad credit. The German dataset contained a total of 10 0 0 records, out which 300 customers had poor credit, and 700 customers had good credit. The Australia dataset was also a credit UCI dataset; it was derived from the credit card business of an Australian bank. There were 690 customer samples in the dataset, out which 383 had good credit, and 307 had bad credit. The dataset contained 14 attributes and one class variable. In order to avoid leaking the personal privacy of customers, the name of the attribute variables and the values of the categorical variables were replaced with letters The dataset Renrendai came from the company’s website. Renrendai is one of the most influential companies in the P2P loan industry. We crawled the successful borrowing records released by the company on their website from August 2014 to October 2015. The dataset contained 12 attributes and one class variable; it divided customers into bad customers who had defaulted and good customers who had paid off their loans. The actual number of samples used in the study was 30,256, among which the default record number was 2510, which accounted for 8.3% of the samples.

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

In order to make the dataset applicable to classification algorithms, we replaced the values of the categorical variables with numbers (e.g., 0, 1, 2, 3.) according to their meanings. Moreover, interval variables were also created from intervals on a contiguous scale of the numerical variables. To be specific, we divided the range of numerical variables into several intervals in accordance with (a,b),(b,c)., and then assigned each interval an ordinal number (e.g., 1,2.).

4.2.1. Comparison on the imputation accuracy The proposed imputation accuracy referred to how close the imputation value was to the true value. The smaller the difference between the imputation value and the true value, the better the imputation was. The experiment was designed as follows. First, we randomly selected 10% of the records from the original full dataset (DFull ) as the missing dataset (DMiss ), and the remaining records were taken as the complete dataset (DComplete ). For the DMiss , three variables were randomly selected as the missing variables, and all of the values of these three variables were removed. Second, we adopted the mode value imputation and the EMI and BNII methods to impute those missing values in DMiss , and then compared the imputation accuracy of each missing variable based on the imputation values and the real values. For robustness, the final result was the average of 10 imputation experiments. The specific experimental process is shown in Fig. 3. To evaluate the imputation accuracy of the three imputation methods, we chose two measures: the accuracy, p, and the root mean square error (RMSE). The RMSE is a frequently used measurement of the difference between values predicted by a model and the values actually observed. However, the RMSE is appropriate for ordinal variables (e.g., housing) and interval variables (e.g., education). For nominal variables, the RMSE is meaningless. We let n be the number of records in DMiss and c be the total number of correctly imputed missing values of a missing variable. The accuracy was as follows:

c , n

(22)

where the value of p ranges from 0 to 1, and 1 indicates the perfect imputation. We let Ai be the true value for the i th missing value of a missing variable, Pi be the imputed value for the i th missing value, and ei =(Pi −Ai ). The RMSE was as follows:



RMSE =

n 1 2 ei , n

Table 6 Confusion matrix. Prediction Value

Real Value

Positive Negative

Positive

Negative

True Positive (TP) False Positive (FP)

False Negative (FN) True Negative (TN)

Table 7 The description of the variables.

4.2. Experiments

p=

7

(23)

i=1

The lower the RMSE value, the better the imputation was. 4.2.2. Comparison on the performance of credit scoring model after imputation The purpose of missing data imputation is to improve the quality of data so that the performance of the prediction model can be further improved. Similarly, when applying the credit scoring model, new loan records often have missing values, and these missing values must be imputed to get the credit risk prediction. The more accurate the imputation, the better the performance of the credit scoring model. Therefore, when other conditions are the same, the performance of a credit scoring model reflects the capability of the imputation algorithm. Our experiment was designed as follows. First, we randomly selected 10% of the records from the original full dataset (DFull ) as the missing dataset (DMiss ) and the testing set of credit scoring models. The remaining records were taken as the complete dataset

Dataset

Variable NO.

Variable

Variable range

mode

German

X1 X2 X3

CheckingAccount CreditAmount History

{0,1,2,3} {1,2,3,4} {1,2,3}

0 1 2

Australia

X1 X2 X3

A5 A9 A13

{1,2,3,4,….,14} {0,1} {1,2,3,4,5}

8 0 4

Renrendai

X1 X2 X3

duration Amount working age

{1,2,3,4,5,6,7} {1,2,3} {1,2,3,4}

6 2 2

(DComplete ) for imputation and as the training set for training credit scoring models. For the DMiss , three variables that were the same as in the last subsection were selected as the missing variables, and all of their values were removed. Second, we adopted the mode value imputation and the EMI and BNII method to impute the missing values in DMiss and to get a testing set without missing values. For the DComplete , we used the Logistic Regression algorithm to train the credit scoring model and then applied it to predict the default probability of the testing set that was imputed in the last step. For robustness, a sampling of the training and testing sets (DComplete and DMiss ) was used with a 10-fold cross-validation, and the final result was the average of 10 rounds of experiments. The specific experimental process is shown in Fig. 4. To evaluate the performance of the credit scoring model after imputation, we chose two measures: accuracy and AUC. To understand these measures, we introduced the concept of confusion matrices. The confusion matrix shown in Table 6 has been commonly used as the basis for various evaluation measures in classification prediction and credit scoring. As for predictive classification in credit scoring, each sample could be classified into two classes: good credit and bad credit. There were four possible outcomes from the classifier. The prediction result of the sample was that it had good credit and was consistent with its real class (TP). Moreover, false positives (FP) were bad credit samples that were predicted as good credit samples, and bad credit samples with a prediction result of bad credit were noted as true negatives (TN) and false negative (FN), which represented the prediction result of the sample as bad credit, but its real class as good credit. (1) Accuracy is the ratio between the number of correctly predicted samples and the total number of samples, which was defined as follows:

Accuracy =

TP + TN . TP + FN + FP + TN

(24)

(2) AUC: This is an extensively used evaluation measure that was obtained from the Receiver Operating Characteristic (ROC) curve, which represents the area under the ROC curve. An ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system; its discrimination threshold is varied. The ROC curve was created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The x- axis of ROC curve represents the false-positive rate, and the y-axis represents true-positive rate sensitivity, whose formulas are written as

8

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

Fig. 3. The process of Comparison on the imputation accuracy.

Table 8 Examples of true values versus imputation values. German NO.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Australia

True value

Renrendai

Imputed value

True value

Imputed value

True value

X1

X2

X3

X1

X2

X3

X1

X2

X3

X1

X2

X3

X1

X2

X3

Imputed value X1

X2

X3

0 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 1 0 1 2

1 2 1 1 1 1 1 1 1 4 4 1 1 4 1 1 4 1 2 4

5 3 3 5 5 3 3 3 3 5 3 5 5 5 5 3 5 3 3 5

0 2 1 0 0 0 0 0 1 0 0 0 2 2 1 0 2 1 2 2

1 4 2 4 4 1 1 1 1 4 4 4 2 3 1 1 4 1 4 4

5 5 3 4 5 3 3 5 3 2 3 5 5 5 1 3 3 3 5 4

1 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1

10 8 11 1 8 3 8 8 8 8 8 8 11 8 8 8 8 8 8 8

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 2

1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1

10 4 10 1 7 6 8 9 9 14 13 9 10 7 9 4 11 2 8 11

1 3 4 4 5 1 3 4 4 5 4 2 4 4 5 1 1 4 1 2

7 7 5 6 4 7 3 5 2 6 3 5 6 7 5 6 5 7 6 5

2 3 3 3 3 3 1 3 1 3 1 1 3 2 3 2 3 3 2 3

2 1 2 2 3 1 2 2 3 2 2 3 2 1 1 2 2 1 2 2

5 6 4 6 4 6 2 5 5 6 4 6 6 7 4 6 4 6 6 6

3 3 3 2 3 2 1 3 3 2 3 2 2 3 3 2 3 2 2 3

2 2 2 2 4 1 4 2 1 2 2 2 2 1 2 2 2 1 4 2

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

9

Fig. 4. The process of the comparison on the performance of the model.

follows. The AUC value was between 0 and 1. The bigger the AUC value was, the better the classifier performance was.

f alse positive rate = t rue posit ive rate =

FP , TN + FP TP , TP + FN

(25)

(26)

In this study, the EM imputation algorithm was implemented via the Amelia package in R software, the LR algorithm was implemented via the sklearn machine learning library in Python, and the other algorithms were written in Python 3. 4.3. Results analysis In this study, three variables were randomly selected from the datasets of German, Australia, and Renrendai. The description of the variables is shown in Table 7. Table 8 shows the true and imputed values of the three missing variables in the three datasets after an imputation experiment using the BNII algorithm. Due to space limitations, only the first

20 missing records of the DMiss are listed. Table 9 shows the results of the first comparison, that is, the RMSE and the accuracy p of each missing variable after it was imputed by the BNII, mode value, and EM method. Since the variable meaning of the Australia dataset was not disclosed, we could not determine whether it was ordinal variable. Therefore, its RMSE value was not calculated in the experiment. As Table 9 shows, in the German and Renrendai datasets, the RMSE of each missing variable that was imputed by BNII was lower than those imputed by the mode value and EMI methods. For example, the RMSE of the variable(‘CheckingAccount’) that was imputed by BNII was 1.267, which was lower than that of the mode value imputation (1.361) and of the EMI method (1.370). In addition, the accuracy of each missing variable that was imputed by BNII was higher than those imputed by the other two methods in all three datasets. For example, the accuracy of variable(‘CheckingAccount’) that was imputed by BNII was 0.430, which was higher than that of mode imputation (0.407) and that of the EMI method (0.317). Therefore, in terms of the RMSE and accuracy, the performance of the BNII method was found to be better than the mode and the EM method.

10

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926 Table 9 The results of the first comparison. RMSE Dataset

variable

German

Mode

EMI

BNII

Mode

EMI

CheckingAccount CreditAmount History

1.267 1.208 1.065

1.361 1.726 1.207

1.370 1.330 1.522

0.430 0.552 0.698

0.407 0.408 0.529

0.317 0.407 0.405

Australia

A5 A9 A13

– – –

– – –

– – –

0.325 0.888 0.357

0.201 0.568 0.301

0.088 0.513 0.217

Renrendai

Duration Amount Working age

1.057 0.649 0.992

1.749 0.735 1.084

2.292 1.092 1.545

0.544 0.679 0.583

0.322 0.459 0.375

0.184 0.340 0.266

Table 10 The results of the second comparison. Accuracy Dataset German Australia Renrendai

Accuracy p

BNII

AUC

BNII

Mode

EMI

BNII

Mode

EMI

0.676 0.859 0.922

0.671 0.842 0.907

0.650 0.829 0.919

0.686 0.914 0.965

0.656 0.910 0.962

0.652 0.898 0.963

Table 10 shows the results of the second comparison, that is, the average prediction accuracy and the AUC value of the credit scoring models based on three credit datasets imputed by the BNII, mode, and EM methods under the 10-fold cross-validation. It can be seen that both the average accuracy and the AUC on the testing set after it was imputed by BNII were the highest in all three datasets. For instance, in the German dataset, the accuracy of the test set after it was imputed by BNII was 0.676, which was higher than the that of the mode value imputation (0.671) and that of the EMI (0.650). In addition, the AUC value of the test set after it was imputed by BNII was 0.686, which was also higher than that of the mode value imputation 0.656 and that of EMI (0.652). The results showed that the credit scoring model after it was imputed by BNII had the best performance, which indicated that the BNII had better capability to restore the original data than the mode imputation and EMI methods did. In summary, from the above two experiments, we found that the proposed algorithm had better imputation accuracy and was more beneficial to the performance of the credit scoring model when compared with the mode imputation and EM imputation methods. This indicated that the BNII method had better capability to solve the problem of multivariable data missing in the credit scoring model. However, unexpectedly, the experimental results also showed that the performance of the EMI method was actually lower than that of the mode value imputation method. This was because the true distribution of the original data was too different from the hypothesis that it would be identical to the multivariate normal distribution that was assumed by the EMI. In this case, the imputed values deviated from the true distribution and introduced noise information instead. 5. Conclusions We proposed a new imputation method called the BNII algorithm for multivariate missing credit data. The proposed method viewed the imputation of missing values as an optimization problem and solved it by combining an iterative mechanism and data mining techniques. The BNII algorithm consisted of two stages: fully indicating the relationship among different attributes

based on the Bayesian network, and iteratively imputing missing values to find better estimates until it reached the local maximum posterior probability. Developing the proposed imputation algorithm can be used to improve the performance of a credit scoring system, which can help financial institutions make more scientific and accurate decisions to reduce the economic losses. Therefore, it would serve as a significant improvement for big data analytics in credit risk management. The main advantages of the proposed method over traditional methods include the novelty in combining an iterative mechanism and data mining techniques, and its superiority in terms of adaptability and flexibility. First, our method combined an iterative mechanism and a Bayesian network classifier, which introduced the following desired features: (1) Considering all of the attributes in the original dataset as nodes to construct the Bayesian network, our method fully indicated the relationship among different attributes and utilized more relevant information to estimate missing values than other methods did. First, some simple imputation methods, such as mean and mode imputation methods, do not explore relationships between attributes, which always leads to an underestimate of the variance and reduces the quality of the data mining result (Nuovo, 2011; Tutz & Ramzan, 2015). Second, for other methods based on a prediction model, like regression imputation, the attributes used as independent variables to establish regression equations must be observable. In other words, in the multivariable missing data scenario, information about other missing attributes cannot be utilized by the method to predict missing values of the target attributes. This problem did not exist in our method. Third, compared with other methods without feature selection, our method estimated missing values based on the corresponding causal connection identified from a Bayesian network. We believe that this is why it resulted in better performance. (2) By detecting the dependencies between variables through structural learning and parameter learning, we found that our method was less dependent on the hypothesis for probability distribution and was more applicable than other methods. In previous studies, as a classical missing value imputation method, the EMI algorithm always assumes that the sample is identical to multivariate normal distribution. When the true distribution of the sample is significantly different from the assumed distribution, it is often unable to obtain a satisfactory imputation accuracy. However, the BNII algorithm was found to be suitable not only for categorical missing attributes, but also for continuous missing attributes.

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

(3) Since it introduced an iterative strategy based on increasing the posterior probability to make the imputation results more fitting with the real distribution, our method was more accurate than other single imputation methods. The results of our experiments clearly showed the superiority of the proposed imputation method compared to other well-known missing value imputation techniques. Second, the proposed BNII method also demonstrated a high degree of adaptability and flexibility. A high volume of recent research has focused on proposing schemes integrating the imputation model with the prediction model (Purwar & Singh, 2015; Roozbeh et al., 2018). In this case, models with different functions in the same scheme cannot be used separately, which creates a problem of inflexibility. However, our method can be embedded as a pre-processing software to complete a set of raw data and can be integrated with any other expert and intelligence systems. Our proposed method also had some limitations. First, the BNII method may lose performance when built from a dataset with attributes that are poorly correlated between each other. In this case, a method that does not explore relationships between attributes may be preferable, such as the mean imputation method. Second, although the structural learning phase of the Bayesian network only executed once throughout the whole imputation process, it may still prove costly in computing and memory. In the future, we plan to further improve the efficiency of the algorithm in several respects. (1) In the second stage of the BNII algorithm, we iteratively imputed the missing variables one by one in such an order that the index of missing variables increased from small to large. In future studies, we will attempt to verify whether the imputation order will affect the precision and efficiency of our method. (2) According to the literature in this area, the data characteristics such as the sample size and the number of variables may influence the performance of Bayesian network structure learning. In future studies, we will implement an adaptive method of selecting the optimal structure learning method according to the data characteristics for the BNII method. Declaration of Competing Interest None. Credit authorship contribution statement Qiujun Lan: Conceptualization, Methodology, Resources, Writing - review & editing, Funding acquisition. Xuqing Xu: Validation, Investigation, Writing - original draft. Haojie Ma: Software. Gang Li: Writing - review & editing. Acknowledgments This research was supported by the National Natural Science Foundation of China (Nos. 71871090, 71301047), the Science Foundation of Ministry of Education of China (18YJAZH038), the Hunan Provincial Science & Technology Major Project (2018GK1020), Xinjiang Uygur Autonomous Region research fund and Deakin University ASL 2019 fund. We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript. References Abdou, H. A., & Pointon, J. (2011). Credit scoring, statistical techniques and evaluation criteria: A review of the literature. Intelligent Systems in Accounting Finance & Management, 18(2-3), 59–88. Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589–609. Atem, F. D., Sampene, E., & Greene, T. J. (2017). Improved conditional imputation for linear regression with a randomly censored predictor. Statistical Methods in Medical Research, 28(2), 962280217727033.

11

Aydilek, I. B., & Arslan, A. (2012). A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks. International Journal of Innovative Computing Information & Control, 8(7), 4705–4717. Batista, G. A. P. A., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6), 519–533. Bequé, A., & Lessmann, S. (2017). Extreme learning machines for credit scoring: an empirical evaluation. Expert Systems with Applications, 86(15), 42–53. Bliss, C. I. (1934). The method of probits. Science, 79(2037), 38–39. Chen, F. L., & Li, F. C. (2010). Combination of feature selection approaches with SVM in credit scoring. Expert Systems with Applications, 37(7), 4902–4909. Chen, N., Ribeiro, B., & Chen, A. (2016). Financial credit risk assessment: A recent review. Artificial Intelligence Review, 45(1), 1–23. Chun-Ling, C., & Huang, S. (2011). A hybrid neural network approach for credit scoring. Expert Systems, 28(2), 185–196. Deb, R., & Liew, W. C. (2016). Missing value imputation for the analysis of incomplete traffic accident data. Information Sciences, 339(2016), 274–289. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38. Einav, L., Jenkins, M., & Levin, J. (2013). The impact of credit scoring on consumer lending. The RAND Journal of Economics, 44(2), 249–274. Feng, X., Xiao, Z., Zhong, B., Dong, Y., & Qiu, J. (2019). Dynamic weighted ensemble classification for credit scoring using Markov chain. Applied Intelligence, 49, 555. doi:10.1007/s10489- 018- 1253- 8. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Human Genetics, 7(2), 179–188. Florez-Lopez, R. (2010). Effects of missing data in credit risk scoring. a comparative analysis of methods to achieve robustness in the absence of sufficient data. Journal of the Operational Research Society, 61(3), 486–501. Furlow, C. F., Fouladi, R. T., Gagne, P., & Whittaker, T. A. (2007). A Monte Carlo study of the impact of missing data and differential item functioning on theta estimates from two polytomous rasch family models. Journal of Applied Measurement, 8(4), 388–403. Garciarena, U., & Santana, R. (2017). An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Systems with Applications, 89, 52–65. Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Model checking and comparison (pp. 513–528). Cambridge University Press. doi:10.1017/CBO9780511790942. Gordini, N. (2014). A genetic algorithm approach for SMEs bankruptcy prediction: empirical evidence from Italy. Expert Systems with Applications, 41(14), 6433–6445. Hens, A. B., & Tiwari, M. K. (2012). Computational time reduction for credit scoring: an integrated approach based on support vector machine and stratified sampling method. Expert Systems with Applications, 39(8), 6774–6781. Hong, T. P., & Wu, C. W. (2011). Mining rules from an incomplete dataset with a high missing rate. Expert Systems with Applications, 38(4), 3931–3936. Kano, M., Uchida, H., Udell, G. F., & Watanabe, W. (2011). Information verifiability, bank organization, bank competition and bank–borrower relationships. Journal of Banking & Finance, 35(4), 935–954. Kao, L. J., Chiu, C. C., & Chiu, F. Y. (2012). A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring. Knowledge-Based Systems, 36(6), 245–252. Lessmann, S., Baesens, B., Seow, H. V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124–136. Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd). HoboKen, New Jersey: John Wiley & Sons. Louzada, F., Ara, A., & Fernandes, G. B. (2016). Classification methods applied to credit scoring: systematic review and overall comparison. Surveys in Operations Research & Management Science, 21(2), 117–134. Luengo, J., Garcã-A, S., & Herrera, F. (2010). A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFNs and event covering method. Neural Network, 23(3), 406–418. Nuovo, A. G. D. (2011). Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Systems with Applications, 38(6), 6793–6797. Pan, L., & Li, J. (2010). K-nearest neighbor based missing data estimation algorithm in wireless sensor networks. Wireless Sensor Network, 2, 115. Purwar, A., & Singh, S. K. (2015). Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications, 42(13), 5621–5631. Roozbeh, R. F., Shiladitya, C., Mehrdad, S., & Enrico, Z. (2018). An integrated imputation-prediction scheme for prognostics of battery data with missing observations. Expert Systems with Applications, 115, 709–723. Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall. Schafer, J. L. (2010). Analysis of incomplete multivariate data. Boca Raton, Florida: CRC Press. Schneider, T. (2001). Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5), 853–871. Shahbazi, H., Karimi, S., Hosseini, V., Yazgi, D., & Torbatian, S. (2018). A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMX models. Atmospheric Environment, 187, 24–33.

12

Q. Lan, X. Xu and H. Ma et al. / Expert Systems With Applications 141 (2020) 112926

Shen, Y., Shen, M., Xu, Z., & Bai, Y. (2009). Bank size and small- and medium– sized enterprise (SME) lending: Evidence from China. World Development, 37(4), 800–811. Sohn, S. Y., Dong, H. K., & Jin, H. Y. (2016). Technology credit scoring model with fuzzy logistic regression. Applied Soft Computing, 43, 150–158. Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics and Data Analysis, 90(C), 84–99. Walker, S. H., & Duncan, D. B. (1967). Estimation of the probability of an event as a function of several independent variables. Biometrika, 54(1/2), 167–178. West, D. (20 0 0). Neural network credit scoring models. Computers and Operations Research, 27(11), 1131–1152.

Wiginton, J. C. (1980). A note on the comparison of logit and discriminant models of consumer credit behavior. Financial Quantitative Analysis, 15(3), 757–770. Won, C., Kim, J., & Bae, J. K. (2012). Using genetic algorithm based knowledge refinement model for dividend policy forecasting. Expert Systems with Applications, 39(18), 13472–13479. Zhang, Y., Li, J., & Chen, D. (2014). Information asymmetry, cloud financing mode and financing of small and micro science and technology enterprises. Science & Technology Progress and Policy, 15, 100–103 (in Chinese). Zhou, H., Wang, J., Wu, J., Zhang, L., Lei, P., & Chen, X. (2014). Application of the hybrid svm-knn model for credit scoring. In Proceedings of the international conference on computational intelligence and security (pp. 174–177). IEEE.