Automation in Construction 20 (2011) 1242–1249
Contents lists available at ScienceDirect
Automation in Construction j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / a u t c o n
An enforced support vector machine model for construction contractor default prediction H. Ping Tserng a, 1, Gwo-Fong Lin a, 2, L. Ken Tsai b,⁎, Po-Cheng Chen a, 3 a b
Dept. of Civil Engineering, National Taiwan Univ., No. 1 Roosevelt Rd., Sec. 4, Taipei, Taiwan National Council of Structural Engineers Associations, 7F, No.37, Tung Hsing Rd., Taipei, 11070, Taiwan
a r t i c l e
i n f o
Article history: Accepted 11 May 2011 Available online 22 June 2011 Keywords: Contractor analysis Default prediction Support vector machine
a b s t r a c t The financial health of construction contractors is critical in successfully completing a project, and thus default prediction is highly concerned by owners and other stakeholders. In other industries many previous studies employ support vector machine (SVM) or other Artificial Neural Networks (ANN) methods for corporate default prediction using the sample-matching method, which produces sample selection biases. In order to avoid the sample selection biases, this paper used all available firm-years samples during the sample period. Yet this brings a new challenge: the number of non-defaulted samples greatly exceeds the defaulted samples, which is referred to as between-class imbalance. Although the SVM algorithm is a powerful learning process, it cannot always be applied to data with extreme distribution characteristics. This paper proposes an enforced support vector machine-based model (ESVM model) for the default prediction in the construction industry, using all available firm-years data in our sample period to solve the between-class imbalance. The traditional logistic regression model is provided as a benchmark to evaluate the forecasting ability of the ESVM model. All financial variables related to the prediction of contractor default risk as well as 7 variables selected by the Multivariate Discriminant Analysis (MDA) stepwise method are put in the models for comparison. The empirical results of this paper show that the ESVM model always outperforms the logistic regression model, and is more convenient to use because it is relatively independent of the selection of variables. Thus, we recommend the proposed ESVM model as an alternative to the traditionally used logistic model. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Evaluating the failure probability of construction contractors is a critical issue in successfully completing a project. Project owners or managers are suggested to avoid awarding contracts to those contractors with high failure tendency. It is also an important issue for other stakeholders such as surety underwriters, creditors, auditors, investors, and contractors themselves. As a result of growth in project scale, progression of construction techniques, and complication of materials and equipments, a single construction contractor is unable to complete a project alone. Thus, a successful construction project is highly dependent on the cooperation of prime contractor and subcontractors. On the other hand, if one of them defaults or bankrupts, the others will also be affected. Therefore, nowadays prime construction contractors are concerned about the financial health of their subcontractors and vice versa.
⁎ Corresponding author. Tel.: + 886 2 87681117; fax: + 886 2 87681116. E-mail addresses:
[email protected] (H.P. Tserng), gfl
[email protected] (G.-F. Lin),
[email protected] (L.K. Tsai),
[email protected] (P.-C. Chen). 1 Tel.: + 886 2 23644154. 2 Tel.: + 886 2 33664368. 3 Tel.: + 886 917181585. 0926-5805/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.autcon.2011.05.007
There is abundant literature on the development of prediction models of corporate failure, including univariate analysis [1], Multivariate Discriminant Analysis (MDA) [2–5], Linear Probability Model (LPM) [6], logistic regression model [7], probit model, and Cumulative Sums (CUSUM) procedure [8]. Most prior studies did not pay much attention to single industries, likely due to the limitedness of defaulted samples. Yet, economic intuition suggests that industry effects should be an important component in company default prediction. Chava and Jarrow [9] suggested that different industries face different levels of competition and have different accounting conventions; therefore, the likelihood of bankruptcy can differ for firms in different industries with otherwise identical balance sheets. As the characteristics of the construction industry are highly different from other industries, the financial risk also differs from others. First, the construction industry is easily influenced by economic situation. Contractors often adopt the strategy of providing a lower price to win bids during economic turmoil, and thus have poor financial stability. Second, construction material and constructions in progress are the common items of inventory in the construction industry. When the inventory cannot be realized into cash due to contract disputes, the contractor suffers from insufficient liquidity. Third, contractors apply for advanced payment or progress payment from the owner according to the construction progress milestones,
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249
and thus the long-term capital can be relatively low. The short-term financial instruments are often used by contractors to deal with the supply of preliminary material or equipments. As a result, the capital resource is rather unstable and the interest rate payment is relatively high. Fourth, different from other industries, the production duration of a construction project is relatively long. The most common revenue recognition principle used by contractors is the percentage of completion method, and its advantage is that income can be recognized earlier. However, contractors may suffer from liquidity insufficiency due to the combination of the construction industry's risky nature and the over optimism in revenue recognition. It is obvious that generic credit risk models for all sectors tend to be too general and may lack the ability to deal with the construction industry, which have particular characteristics and financial risks. Some researchers have made efforts to develop default prediction models for the construction industry, including Mason and Harris [10], Langford et al. [11], Kangari et al. [12], Abidali and Harris [13], Severson et al. [14], and Russel and Zhai [15]. Most of their approaches use accounting or financial ratios to build Multivariate Discriminant Analysis (MDA) or logistic regression models. Some additionally incorporate managerial or economic variables into their models in order to enhance predicting performance. From the late 1980s, artificial intelligence, such as Artificial Neural Networks (ANN), was successfully applied to corporate financial distress forecasting. A large number of studies compared ANN models with other forecasting methods and proved that ANN had superior predicting performance [16,17]. In the late 1990s, a novel ANN model, the support vector machine (SVM), was introduced to deal with the classification problem. Fan and Palaniswami [18] applied SVM to select the financial distress predictors. They pointed out that SVM created an optimal hyperplane in the hidden feature space in terms of the principle of structure risk minimization and used the quadratic programming to obtain an optimal solution. To test the predicting performance of ANN models, all studies to our knowledge (e.g. [16,19–30]) rely on matched samples or partially adjusted unequal matched samples. Zmijewski [31] argued persuasively that this sample-matching method produces sample selection biases. To avoid sample selection biases, we abandon the sample-matching method and put in all available firm-years data. Yet this brings a new challenge, which is referred to as between-class imbalance [32]: The number of non-defaulted samples greatly exceeds the defaulted samples. SVM, like other methods, only demonstrates the distribution of the major parts of input points when analyzing data with between-class imbalance, ignoring the small parts of input points. Thus using the SVM model on default prediction is not satisfactory. The primary objective of this paper is to fix this shortcoming. We propose a method which can improve the predicting performance of the SVM model with between-class imbalance input data — the enforced support vector machine (ESVM) model. The ESVM model performs enforced training on the defaulted sample set, which has smaller sample size, to increase the model's discriminating power between defaulted and non-defaulted samples. This concept is close to the human learning process. When we try to grasp specific and important information, we tend to practice repeatedly for better results. In this paper, we replicate the defaulted samples in the training set several times to reinforce learning, expecting to solve the between-class imbalance. This paper empirically validates the predicting performance of the ESVM model in construction contractor default. The logistic regression model, which is commonly used and documented in the corporate failure prediction literature, is provided as a benchmark for assessing the results of the ESVM forecasting methods. The rest of this paper is divided into four sections. Section 2 introduces the methodology of default prediction used in this paper, including the SVM model, the ESVM model, and the logistic model. Section 3 presents how we apply the ESVM model to the prediction of
1243
construction contractor default, including our data set, sample selection criteria, and input variable selection. Section 4 reports the results of the proposed ESVM model and logistic regression method, and compares their predicting performances. This result shows that the ESVM model outperforms the logistic model in the discriminatory power between defaulters and non-defaulters. Finally, Section 5 provides the summary of this paper and the concluding comment that the ESVM model could be an alternative to the traditionally used logistic model. 2. Methodology Fig. 1 shows the framework of the methodology in this paper. The process of contractor default prediction is as follows: first, the sample set and the variables for the ESVM and the logistic model are selected. Then, the sample set is put into the ESVM model and the logistic model. At the same time, the enforced training procedure is executed for the ESVM model. Finally, this paper compares and analyzes the results of the ESVM and the logistic model. The methods used in this paper, including the SVM, the ESVM, and the logistic model are introduced in following sections. 2.1. Support vector machine In the early 1990s, Vapnik [33] developed SVM for classification and then extended for regression. Since that time, a number of studies have been published concerning its theory and applications. Compared with most other learning techniques, SVM leads to increased performance in pattern recognition, regression estimation, financial time-series forecasting, marketing, estimating manufacturing yields, text image, and medical diagnosis [23]. The principles of SVM are based on the structural risk minimization (SRM) and statistical learning theory. It mainly uses linear models to classify or fit a data sample through specific non-linear mapping into relevant high dimensional feature space [33–35]. Many previous research papers applied SVM classifier to predict company default; however, the classifier could not offer a continuous output. In practice, the decision-makers typically make continuous decision choices. For example, project owners choose the most competent construction contractor according to their ranking of default probability. Lending institutions determine the interest rate on a construction loan according to the estimated default probability
1. Screen samples 2. Prepare financial data 3. Select financial ratios as input variables
Logistic model
ESVM model: (includes Enforced training procedure)
Comparison of models and analysis
Fig. 1. Framework of methodology.
1244
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249
of the contractor. Thus, this paper applies SVM regression instead of SVM classifier in order to get a continuous output which could be viewed as the default potential or default probability. In the works by Lin et al. [36,37], the superiorities of SVM over conventional ANN are extensively discussed. In this section, the methodology of SVM used in this paper is briefly described. One can refer to the books of Vapnik [33,38] for more mathematical details about SVM. Based on N training data [(x1, y1), (x2, y2),…, (xN, yN)], the objective of the support vector machine is to find a non-linear regression function to yield the output ⌢ y, which is the best approximate of the desired output y with an error tolerance of ε. To be especially mentioned, in order for comparison with the benchmark, the desired output y is set of 1 for defaulters and 0 for nondefaulters in this paper. Firstly, the input vector x is mapped onto a higher dimensional feature space by a non-linear function ϕ(x). Then the regression function that relates the input vector x to the output ⌢ y can be written as
subject to N ∑ αi−αi′ = 0
i=1
0≤αi≤C 0≤αi′≤C i = 1; 2; …; N where α and α′ are the dual Lagrange multipliers. It should be noted that the solution to the optimal problem of Eq. (5) is guaranteed to be unique and globally optimal because the objective function is a convex function. The optimal Lagrange multipliers (αi − α′i)* are solved by the standard quadratic programming algorithm and then the regression function can be rewritten as N
f ðxÞ = ∑
i=1
T
yˆ = f ðxÞ = w ϕðxÞ + b
ð1Þ
where w and b are weights and bias of the regression function, respectively. Based on the SRM induction principle, w and b are estimated by minimizing the following structural risk function: N 1 T R = w w + C ∑ Lε ð yˆ i Þ 2 i=1
ð2Þ
0 jy yˆ j ε
forjy yˆ jb ε forjy yˆ j≤ ε
ð3Þ
The first and second terms in Eq. (2) represent the model complexity and the empirical error, respectively. The tradeoff between the model complexity and the empirical error is specified by a user-defined parameter C. Herein, C is set as 1, which represents that the model complexity is as important as the empirical error. In addition, it is acceptable to set the error tolerance ε as 0.0001 in contractor default prediction. Vapnik [33] expressed the SVM problem in terms of the following optimization problem:minimize N 1 T R w; b; ξ; ξ′ = w w + C ∑ ξi + ξ′i Þ 2 i=1
yi yˆ i = yi − wT ϕðxi Þ + b ≤ε + ξi yˆ i yi = wT ϕðxi Þ + b −yi ≤ε + ξ′i 0
ξi ≥0 i=1; 2; …; N where ξ and ξ′, which are slack variables, represent the upper and the lower training errors, respectively. This optimization problem is usually solved in its dual form using Lagrange multipliers. Rewriting Eq. (4) in its dual form and differentiating with respect to the primal variables (w, b, ξ, ξ′) gives maximize
i=1
Following [23,24,29,30,35], the default problem is defined as a non-linear problem, and the radial basis function is used as the kernel function in this paper. The radial basis function is defined as: ð8Þ
where γ ∈ R + and is a constant. In this paper, γ is set as 1. According to Shevade et al. [39], b can be calculated by minimizing Eq. (9): N
∑ maxð0; yi −fi ðxÞ−ε; fi ðxÞ−yi −εÞ:
i=1
ð9Þ
Some of solved Lagrange multipliers (α − α′)* are zero and should be eliminated from the regression function. Finally, the regression function involves the nonzero Lagrange multipliers and the corresponding input vectors of the training data, which are called the support vectors. The final regression function can be rewritten as
f ðxÞ = ∑ αi −αi′Þk K ðxk ; xÞ + b k=1
ð10Þ
2.2. Problem with imbalance data and the enforced SVM model
ξi ≥0
i=1
ð7Þ
where xk denotes the kth support vector and Nsv is the number of support vectors.
subject to
N N ∑ yi αi−αi′ −ε ∑
T
K ðxi ; xÞ = ϕðxi Þ ϕðxÞ:
Nsv
ð4Þ
ð6Þ
where the kernel function is defined as
2 K ðxi ; xÞ = exp −γjxi −xj
where the Vapnik's ε-insensitive loss function Lε is defined as Lε ð yˆ Þ = jy yˆ jε =
αi −αi′Þ K ðxi ; xÞ + b
N 1 N T αi + αi′ − ∑ ∑ αi−αi′Þ αj−αj′Þϕðxi Þ ϕ xj 2 i=1 j=1
ð5Þ
The sample design, employed by many studies related to corporate default prediction models, has been to match a set of defaulted firms with the same number or some multiple of healthy firms. Because default or bankruptcy is relatively rare, Zmijewski [31] argued that this sample-matching method produces sample selection biases. Unless one builds a model based on the entire population, the estimated coefficients will be biased, and the resulting predictions will be unreliable. To avoid sample selection biases, many recent studies use all available firm-quarters or firm-years during the sample period to construct the default prediction models, thereby improving the accuracy of the coefficient estimates and increasing the predicting performance of the models (e.g., [40–45]). Although the SVM algorithm is a powerful learning process, it cannot always be applied to data with extreme distribution characteristics, such as the distribution of defaulted events. Thus,
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249
many studies which used SVM for default prediction continued to rely on matched samples or partially adjusted unequal matched samples to test predicting performance. To our knowledge, in default prediction, no researchers have used the SVM or ANN methodology with all available firm-quarters or firm-years sample. In this paper, we put in all available firm-years data to empirically explore construction contractor default prediction with SVM model. After putting in all firm-years data, a huge difference in sample size between defaulted and non-defaulted companies occurred. That is, the number of non-defaulted samples greatly exceeds the defaulted samples. This form of imbalance is referred to as a between-class imbalance [32]. In default prediction, the problem of imbalance data in SVM is that SVM, like other ANN models, only represents the distribution of the major parts of input points, ignoring the minor parts of input points. Mathematically, such as the explanation in the research of Akbani et al. [46], the SVM model minimizes the model complexity 12 wT w in N
Eq. (2), while minimizing the associated error C ∑ Lε ð yˆ i Þ. When the i=1
value of C is not very large, the estimates of all the samples in the SVM model will be close to 0 because in that way, the model complexity will be minimized and the cumulative error on the non-default samples will approximate to zero. The only tradeoff is the small amount of cumulative error on the few default samples. That is why the SVM model is difficult to apply under the existence of imbalance data, such as the default prediction problem in this paper. Chang et al. [47] pointed out that to fix this shortcoming, the important information should be emphasized. A simple way to solve the problem is through repeatedly implementing those specific patterns. This idea is close to the human learning process. When we try to grasp specific and important information, we tend to practice repeatedly for better results. The proposed ESVM is based on this knowledge learning process. It enforces extra training on the SVM network by the enforced training procedure. The enforced training procedure means repeatedly learning from the smaller group of input samples. In order to achieve this, default samples are replicated in this paper until the numbers of defaulters and non-defaulters are balanced. 2.3. Logistic regression model Ohlson [7] is the first scholar to apply the logistic regression model to business bankruptcy prediction research. The logistic regression model is defined as a statistical modeling technique seeking the relationship between a binary dependent variable and other selected independent variables that are assumed to be related to the binary dependent variable [48]. Unlike MDA, the logistic regression model does not require multivariate normality or the equality of covariance matrices of two groups. It uses the logistical cumulative function to predict default. Jaselskis and Ashley [49], Russel and Jaselskis [50], and Severson et al. [14] have successfully built their logistic regression models to predict contractor performance. In this paper, we also employ logistic regression model as a comparison method to ESVM. Let yi ∈ {0, 1} for all i = 1 to n, logistic regression model estimates the probability that the label is 1 for a given example x using the model [51]: P ð y = 1jxÞ =
1 1 + expð−w⋅x bÞ:
ð11Þ
Parameters w and b can be estimated using the maximum likelihood procedure to maximize the log-likelihood function, with respect to w and b, n
Lðx1; …; xn jw; bÞ = ∑ yi log pi + ð1−yi Þ logð1−pi Þ i=1
where
pi = P ðy = 1j xi Þ:
ð12Þ
1245
3. Data and variables selection 3.1. Data A large cross-section of construction contractors are considered in the empirical investigations of this paper. Data is collected from Compustat Industrial file- Quarterly data [52] and the Center for Research in Securities Prices (CRSP) for construction contractors on the New York Stock Exchange (NYSE), American Exchange (AMEX), and Nasdaq. This paper restricts its attention to construction contractors with December fiscal year-ends by choosing firms with Standard Industrial Classifications (SIC) codes between 1500 and 1799. Similar to the researches of Severson et al. [14] and Russell and Zhai [15], the sample contractors include three construction categories: SIC code 1500–1599: building construction, general contractors, and operative builders SIC code 1600–1699: heavy construction other than building construction contractors SIC code 1700–1799: construction special trade contractors. The sample selection has three criteria. First, contractors that do not have financial statements for at least two years are removed from the sample. Second, data must be available in CRSP for at least two years prior to default time for data completeness. Third, default is defined by CRSP delisting code of 400 and 550 to 585. In this paper, we follow definition of defaulted firms proposed by Dichev [53] and Brockman and Turtle [40]. The defaulted firms are those de-listed because of bankruptcy, liquidation, or poor performance. As a result we identified 51 defaulted contractors. The goal of this paper is to predict default within one year, thus financial information of the year right before default is used as the default samples. Prior research usually involves “selecting” a group of non-defaulters on which to perform the analysis. To avoid sampling biases, this paper uses every firm-year for which data are available in our analysis. The final combined sample consists of 1422 firm-year observations, including 51 default and 1371 non-default samples, representing 168 individual contractors. 3.2. Input variables selection The first stage in deriving a financial ratio default prediction model is selecting the financial variables. This paper selects twenty variables for analyses (Table 1). These variables were selected for the following reasons. First, all of the variables were most commonly used in previous studies regarding contractor default prediction models [10,12–15,54,55]. Second, these variables encompass a broad cross-section of accounting ratios. These ratios describe a contractor's liquidity, leverage, activity, and profitability. As a group, these ratios capture the financial characteristics and performance of the construction industry. These input variables used this paper are scaled into the range of [0, 1] using linear normalization algorithm. While each of these variables may provide important alternative perspectives on a contractor's condition, including a large number of variables in a quantitative model may nevertheless yield a model that is “over-fitted.” In other words, the model performs very well insample on the data used to develop the model, but its out-of-sample performance on new data is poor [56]. Following [23], this paper uses a Multivariate Discriminant Analysis (MDA) stepwise method to select a limited number of variables, in order to avoid the “over-fitted” problem and yield a powerful model. In the stepwise MDA method used in this paper, the F-value to enter is 3.84, and the F-value to remove is 2.71. The variables selected by MDA stepwise method are shown in Table 2.
1246
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249
Table 1 The financial variables selected in this paper. Liquidity
Leverage
Activity
Profitability
1. 2. 3. 4.
5. 6. 7. 8.
9. Revenues to net working capital 10. Accounts receivable turnover 11. Accounts payable turnover 12. Sales to net worth 13. Quality of inventory 14. Fixed assets to net worth 15. Turnover of total assets 16. Revenues to fixed assets
17. Return on assets 18. Return on equity 19. Return on sales 20. Profits to net working capital
Current ratio Quick ratio Net working capital to total assets Current assets to net assets
Total liabilities to net worth Retained earnings to sales Debt ratio Times interest earned
In the following sections, for comparison we will put all 20 variables and the selected 7 variables in the logistic regression model and the ESVM model. 4. Result and discussion 4.1. Discriminatory power To compare the performance of estimates from models, this paper employs the Receiver Operating Characteristics (ROC) curve to assess their predictive ability. ROC curve is widely used in the medicine field to test the efficacy of various treatments and diagnostic techniques. It is also a popular technique for assessing discriminatory power of various credit scoring and rating models [45,57]. Many prior studies in the prediction of business default relied on prediction-oriented tests to distinguish between alternative statistical models. The shortcoming of the prediction-oriented test is that it produces only two ratings (bad/good), which are only valid for a specific model cut-off point, and leads to a dichotomous decision. In practice, decision-makers typically make continuous decision choices; thus, the results of prediction-orient tests are not suitable to represent the performances of default prediction models. Furthermore, the prediction-oriented test typically assumes that the costs of each type of classification error are equal. This does not hold true in the real world, where Type I errors are substantially more costly than Type II errors. For example, the costs of awarding contracts to an impending defaulted contractor will typically be much larger than the costs of rejecting a healthy contractor. Since prediction-oriented test does not allow for these continuous choices, this paper uses the discriminatory power to evaluate the performance of a default prediction model. The discriminatory power measures to what extent the model can differentiate firms that are more likely to default from firms that are less likely to default. In a perfect discriminating model, all firms that actually default are assigned a larger probability of default than any surviving firm. The ROC curve is a useful tool for assessing discriminatory power of the credit scoring model. ROC curve is constructed by scoring all credits, arranging the nondefaults from riskiest to safest on the x axis, and then plotting the percentage of defaults excluded at each level on the y axis. So the y axis is formed by associating every score on the x axis with the cumulative
Table 2 Definition of variables selected by MDA stepwise method. Variables Return on assets
Description
(Net profit after interest and taxes + Interest expense) / Total assets Net working capital to total assets (Current assets − Current liabilities)/ Total assets Revenues to fixed assets Net sales / Average fixed assets Turnover of total assets Net sales / Average total assets Sales to net worth Net sales / Average net worth Return on equity Net profit after interest and taxes / Net worth Profits to net working capital Net profit after interest and taxes / (Current assets − current liabilities)
percentage of defaults with a score equal to or worse than that score in the testing data. In other words, ROC curve plots the Type II error against one minus the Type I error. In the case of default prediction, it describes the percentage of non-defaulting firms that must be inadvertently denied credit (Type II) in order to avoid lending to a specific percentage of defaulting firms (1-Type I) when using a specific default model [57]. ROC curve generalizes different relative performances across all possible cut-off points associated with the costs of each type of classification error, and it provides a form of cost– benefit analysis for decision-makers. The ROC curve of an entirely random prediction corresponds to the main diagonal whereas a perfect model will have an ROC curve that goes straight up from (0,0) to (0,1) and then across to (1,1). Given two models, the one with better ranking will display an ROC curve that is further to the top left than the other. The area under the curve (AUC) is commonly used as a summary statistic for the quality of a ranking. A model with perfect ranking has an AUC of one whereas a model with random predictions has an AUC of 0.5 [42]. The general rule is as follows. If AUC = 0.5, this suggests no discrimination; if 0.7 ≦ AUC b 0.8, an acceptable discrimination; if 0.8 ≦ AUC b 0.9, an excellent discrimination; and if AUC ≧ 0.9, an outstanding discrimination [58]. 4.2. Validation process To apply the ESVM model, the sample set must undergo the enforced training procedure. In the enforced training procedure in this paper, the defaulted samples are replicated 26 times to balance the numbers of defaulters (51) and non-defaulters (1371). That results in a new sample set, and the ESVM model is built with the new sample set. The ROC curves compared the in-sample performances between the ESVM model and the logistic regression model of all 20 variables and selected 7 variables are shown in Figs. 2 and 3, respectively. The in-sample performances of the ESVM and the logistic regression model are similar. As the out-of-sample performance is the key assessment criterion for models we randomly divide the defaulted data set into a training set and a testing set of ratio 2:1, and repeat the same for the non-defaulted data set. Thus, in the training set, there are 34 defaulted and 914 nondefaulted samples in the logistic model, and 884 defaulted (after enforced training) and 914 non-defaulted samples in the ESVM model. In the testing set, there are 17 defaulted and 457 non-defaulted samples. There are no overlapped samples between the training set and the testing set. Different selections of training data and testing data yield different results and sometimes lead to different conclusions. To reach just conclusions, experiments are repeated many times for different random splits of the data. We use the average AUC as the final performance measures of models. To ensure how many times experiments repeated can represent actual model predicting performance, we performed sensitivity analysis in our experiments. Figs. 4 and 5 show the relationship between repeated times of experiments and average AUC. For both models, the average AUC has high variance when the experiment is repeated only a few times, and
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249
1247
0.9
Average AUC
0.85 0.8 0.75 0.7 0.65 0.6 0
50
100
150
200
250
Repeated Times of Experiments ESVM
Fig. 2. ROC curve of models with all 20 variables.
the logistic regression model has higher variance than the ESVM model. When the experiment is repeated more than 100 times, the average AUC becomes stable. Therefore, we use the average AUC of 200 times as the final performance measures of models for more crucial criteria. 4.3. Validation result The performance of the ESVM and logistic regression model is summarized in Table 3. Three important results are found. Firstly, regardless of whether the MDA stepwise method is performed to select the input variables, the testing set AUC of the ESVM model is always higher compared to the logistic regression model and has better predicting performance. When using all 20 variables, the ESVM model's predicting performance (AUC= 0.7805) is much higher than that of the logistic regression model (AUC= 0.7082). When using the selected 7 variables, both models' predicting performance continue to rise, and the ESVM model still outperforms the logistic regression model. This result shows that, due to the application of the enforced training procedure, the ESVM model can grasp information within financial data and correctly analyze to bring better results in default prediction, which is a highly complicated and highly non-linear problem. Secondly, whether ESVM or logistic regression model, the testing set AUC is improved after the MDA stepwise method is used to select input variable. Too many input variables are not necessarily better for the models. Too many input variables add training time to the models, yet
Logistic Regression
Fig. 4. Average AUC of models based on different times of experiments with 20 variables.
they do not always improve the predicting performance. Sometimes they are even a disturbance and lower the model's predicting ability. Through the MDA stepwise method of selecting input variables, the training time of models is reduced, and predicting performance is improved due to reduction of disturbance. Thirdly, even before undergoing MDA stepwise method to select input variables, the ESVM model's predicting performance is already better than the variables-selected logistic regression model. This shows that the ESVM model is much more convenient to use than the logistic regression model. Different selection methods result in different selected variables, and different input variables are often critical in predicting performance. In application, the process of finding a suitable method for variable selection is a huge challenge. Compared to the logistic regression model, whose predicting ability is greatly reduced without variable selecting, the ESVM model possesses good predicting ability even without variable selection. When it is uncertain of which selection method should be used, the ESVM model is more convenient to apply and also gives satisfactory results. 5. Summary and conclusions The credit risk models developed for all sectors tend to be too general for the construction industry, which has unique characteristics and different accounting treatment. Previous literature applied Multivariate Discriminant Analysis (MDA) or the logistic regression method to build the contractor default prediction models. However, the parameters in these models are likely to need periodical adjustment due to changes in economic conditions and market trends.
0.9
Average AUC
0.85 0.8 0.75 0.7 0.65 0.6 0
50
100
150
200
250
Repeated Times of Experiments ESVM
Fig. 3. ROC curve of models with selected 7 variables.
Logistic Regression
Fig. 5. Average AUC of models based on different times of experiments with 7 variables.
1248
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249
Table 3 Performance for different models.
Average AUC
All 20 variables
Selected 7 variables
ESVM
Logistic regression
ESVM
Logistic regression
0.7805
0.7082
0.8031
0.7361
This paper proposes an ESVM method for contractor default prediction by using all available firm-year data during our sample period to prevent sample selection biases of traditional samplematching method. Thus, the number of non-defaulted samples greatly exceeds the defaulted samples, and this will occur to the betweenclass imbalance problem. Traditional SVM produces unsatisfactory results when it is applied to the problem of learning from the dataset with between-class imbalance. To increase the discrimination power of the model, the ESVM model additionally trains the defaulted samples, which are scarce but extremely important. To evaluate the predicting ability of the ESVM model, the commonly used logistic regression model is provided as a benchmark. From the empirical results of this paper, we conclude that: First, due to the application of the enforced training procedure, the ESVM model obviously outperforms the logistic regression model in the discriminatory power between defaulters and non-defaulters. Second, unlike the logistic regression model, all financial variables related to the contractor default risk can be applied as inputs in the ESVM model. This is more convenient because it could avoid selecting and adjusting prediction variables in modeling. In application of the ESVM model, users could use all available firm-year samples as the initial training set and then execute the enforced training procedure: replicate the default samples until the numbers of defaulters and non-defaulters are balanced. After regression and optimization, users could substitute the target sample into the ESVM model to calculate its default probability for further analysis. The proposed modeling technique improves the construction contractor default forecasting and is more convenient to apply. Thus, the ESVM model is recommended as an alternative to the traditionally used logistic default prediction model. 6. Software source In this paper, the program of the SVM model is from [59], which was developed by C.C. Chang and C.J. Lin. The program of the logistic model is coded by VBA and MS Excel 2007 by the authors. References [1] W.H. Beaver, Financial ratios as predictors of failure, Journal of Accounting Research 4 (1966) 71–111. [2] E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, Journal of Finance 23 (4) (1968) 589–609. [3] M. Blum, Failing company discriminant analysis, Journal of Accounting Research 12 (1) (1974) 1–25. [4] E.B. Deakin, Distributions of financial accounting ratios: some empirical evidence, The Accounting Review 51 (1) (1976) 90–96. [5] R.J. Taffler, Forecasting company failure in the UK using discriminant-analysis and financial ratio data, Journal of the Royal Statistical Society Series a-Statistics in Society 145 (3) (1982) 342–358. [6] P.A. Meyer, H.W. Pifer, Prediction of bank failures, The Journal of Finance 25 (4) (1970) 853–868. [7] J.A. Ohlson, Financial ratios and the probabilistic prediction of bankruptcy, Journal of Accounting Research 18 (1980) 109–131. [8] P.S. Theodossiou, Predicting shifts in the mean of a multivariate time series process: an application in predicting business failures, The Journal of American Statistical Association 88 (422) (1993) 441–449. [9] S. Chava, R.A. Jarrow, Bankruptcy prediction with industry effects, Review of Finance 8 (4) (2004) 537–569. [10] R.J. Mason, F.C. Harris, Predicting Company Failure in the Construction Industry, Proceedings Institution of Civil Engineers 66 (1979) 301–307.
[11] D. Langford, R. Iyagba, D.M. Komba, Prediction of solvency in construction companies, Construction Management and Economics 11 (5) (1993) 317–325. [12] R. Kangari, F. Farid, H.M. Elgharib, Financial performance analysis for construction industry, Journal of Construction Engineering and Management 118 (2) (1992) 349–361. [13] A.F. Abidali, F.C. Harris, A methodology for predicting company failure in the construction industry, Construction Management and Economics 13 (1995) 189–196. [14] G.D. Severson, J.S. Russell, E.J. Jaselskis, Predicting contract surety bond claims using contractor financial data, Journal of Construction Engineering and Management 120 (2) (1994) 405–420. [15] J.S. Russell, H. Zhai, Predicting contractor failure using stochastic dynamics of economic and financial variables, Construction Engineering and Management 122 (2) (1996) 183–191. [16] P.K. Coats, L.F. Fant, Recognizing financial distress patterns using a neuralnetwork tool, Financial Management 22 (3) (1993) 142–155. [17] G. Zhang, M.Y. Hu, B.E. Patuwo, D.C. Indro, Artificial neural networks in bankruptcy prediction: general framework and cross-validation analysis, European Journal of Operational Research 116 (1) (1999) 16–32. [18] A. Fan, M. Palaniswami, Selecting Bankruptcy Predictors using a Support Vector Machine Approach, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks 6 (2000) 354–359. [19] M.D. Odom, R. Sharda, A neural network model for bankruptcy prediction, IJCNN International Joint Conference 12 (1990) 163–168. [20] R.L. Wilson, R. Sharda, Bankruptcy prediction using neural network, Decision Support Systems 11 (8) (1994) 545–557. [21] A.F. Atiya, Bankruptcy prediction for credit risk using neural networks: a survey and new results, IEEE Transactions on Neural Networks 12 (4) (2001) 929–935. [22] Y. Wang, S. Wang, K.K. Lai, A new fuzzy support vector machine to evaluate credit risk, IEEE Transactions On Fuzzy Systems 13 (6) (2005) 820–831. [23] K.S. Shin, T.S. Lee, H.J. Kim, An application of support vector machines in bankruptcy prediction model, Expert Systems With Applications 28 (1) (2005) 127–135. [24] J.H. Min, Y.C. Lee, Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters, Expert Systems with Applications 28 (4) (2005) 603–614. [25] J.C. Neves, A. Vieira, Improving bankruptcy prediction with hidden layer learning vector quantization, European Accounting Review 15 (2) (2006) 253–271. [26] H. Ahn, K. Lee, K. Kim, Global optimization of support vector machines using genetic algorithms for bankruptcy prediction, Neural Information Processing 4234 (3) (2006) 420–429. [27] T.H. Lin, A cross model study of corporate financial distress prediction in Taiwan: multiple discriminant analysis, logit, probit and neural networks models, Neurocomputing archive 72 (16–18) (2009) 3507–3516. [28] G.H. Muller, B.W. Steyn-Bruwer, W.D. Hamman, Predicting financial distress of companies listed on the JSE — a comparison of techniques, South African Journal of Business Management 40 (1) (2009) 21–32. [29] H.S. Kim, S.Y. Sohn, Support vector machines for default prediction of SMEs based on technology credit, European Journal of Operational Research 201 (3) (2010) 838–846. [30] C.C. Yeh, D.J. Chi, M.F. Hsu, A hybrid approach of DEA, rough set and support vector machines for business failure prediction, Expert Systems with Applications 37 (2) (2010) 1535–1541. [31] M.E. Zmijewski, Methodological issues related to the estimation of financial distress prediction models, Journal of Accounting Research 22 (1984) 59–82. [32] H. He, A.G. Edwardo, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1263–1284. [33] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [34] Y. Ding, X. Songa, Y. Zen, Forecasting financial condition of Chinese listed companies based on support vector machine, Expert Systems with Applications 34 (4) (2008) 3081–3089. [35] K.C. Lam, E. Palaneeswaran, C.Y. Yu, A support vector machine model for contractor prequalification, Automation in Construction 18 (3) (2009) 321–329. [36] G.F. Lin, G.R. Chen, P.Y. Huang, Y.C. Chou, Support vector machine-based models for hourly reservoir inflow forecasting during typhoon-warning periods, Journal of Hydrology 372 (1–4) (2009) 17–29. [37] G.F. Lin, G.R. Chen, M.C. Wu, Y.C. Chou, Effective forecasting of hourly typhoon rainfall using support vector machines, Water Resources Research 45 (2009) W08440, doi:10.1029/2009WR007911. [38] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998. [39] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy, Improvements to the SMO algorithm for SVM regression, IEEE Transactions on Neural Networks 11 (5) (2000) 1188–1193. [40] H.J. Brockman, Turtle, a barrier option framework for corporate security valuation, Journal of Financial Economics 67 (3) (2003) 511–529. [41] S. Bharath, T. Shumway, Forecasting default with the KMV-Merton model, Manuscript, , University of Michigan, 2004. [42] A.S. Reisz, C. Perlich, A market-based framework for bankruptcy prediction, Journal of Economic Literature Classification System G13 (2004) G33. [43] S.A. Hillegeist, E.K. Keating, D.P. Cram, K.G. Lundstedt, Assessing the probability of bankruptcy, Review of Accounting Studies 9 (2004) 5–34. [44] P. Gharghori, H. Chan, R. Faff, 3 Investigating the performance of alternative default-risk models: option-based versus accounting-based approaches, Australian Journal of Management 31 (2) (2006) 207–234. [45] V. Agarwal, R. Taffer, Comparing the performance of market-based and accounting-based bankruptcy prediction models, Journal of Banking & Finance 32 (2008) 1541–1551.
H.P. Tserng et al. / Automation in Construction 20 (2011) 1242–1249 [46] R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to imbalanced datasets, The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases — ECML, 2004, pp. 39–50. [47] F.J. Chang, L.C. Chang, Y.S. Wang, Enforced self-organizing map neural networks for river flood forecasting, Hydrological Processes 21 (6) (2007) 741–749. [48] D.H. Koo, S.T. Ariaratnam, Innovative method for assessment of underground sewer pipe condition, Automation in Construction 15 (4) (2006) 479–488. [49] E.J. Jaselskis, D.B. Ashley, Optimal allocation of project management resources for achieving success, Journal of Construction Engineering and Management 117 (2) (1991) 321–340. [50] J.S. Russell, E.J. Jaselskis, Predicting construction contractor failure prior to contract award, Journal of Construction Engineering and Management 118 (4) (1992) 791–811. [51] T. Bellotti, J. Crook, Support vector machines for credit scoring and discovery of significant features, Expert Systems with Applications 36 (2) (2009) 3302–3308. [52] Wharton Research Data Services, The Wharton School of the University of Pennsylvania, 2009http://wrds.wharton.upenn.edu.
1249
[53] I.D. Dichev, Is the risk of bankruptcy a systematic risk? The Journal of Finance 53 (3) (1998) 1131–1147. [54] G.D. Severson, E.J. Jaselskis, J.S. Russell, Trends in construction contractor financial data, Journal of Construction Engineering and Management 119 (4) (1993) 854–858. [55] R. Kangari, M. Bakheet, Construction surety bonding, Journal of Construction Engineering and Management 127 (3) (2001) 232–238. [56] D. Dwyer, A. Kocagil, R. Stein, The Moody's KMV EDF? RISKCALC?, Model NextGeneration Technology or predicting private firm credit default risk, v3.1, Moody's KMV Company, 2004. [57] R.M. Stein, Benchmarking default prediction models: pitfalls and remedies in model validation, Journal of Risk Model Validation 1 (1) (2007) 77–113. [58] D.W. Hosmer, S. Lemeshow, Applied Logistic Regression, 2nd Ed., John Wiley & Sons, Inc., New York, 2000. [59] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001 Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm.