Predicting the listing status of Chinese listed companies with multi-class classification models

Predicting the listing status of Chinese listed companies with multi-class classification models

Information Sciences 328 (2016) 222–236 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

3MB Sizes 0 Downloads 10 Views

Information Sciences 328 (2016) 222–236

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Predicting the listing status of Chinese listed companies with multi-class classification models Ligang Zhou a,∗, Kwo Ping Tam a, Hamido Fujita b a b

School of Business, Macau University of Science and Technology, Taipa, Macau Faculty of Software and Information Science, Iwate Prefectural University, Iwate, Japan

a r t i c l e

i n f o

Article history: Received 3 June 2015 Revised 20 August 2015 Accepted 21 August 2015 Available online 29 August 2015 Keywords: Prediction One-vs-all One-vs-one Multi-class classification Listing status

a b s t r a c t In China’s stock markets, a listed company’s different listing statuses are signals for different risk levels. It is therefore vital for investors and other stakeholders to predict the listing status of listed companies due to the difficulty of providing sufficient measurement of such risks. Existing studies tend to classify listing status into two categories for simple measurement purposes by applying binary classification models; however, such classification models cannot provide accurate risk management. Considering the existence of four different listing statues of Chinese listed companies in practice, this study introduces three different types of multiclass classification models to predict listing status in order to achieve better performance in terms of accuracy measures. These three types of models are based on One-versus-One and One-versus-All with parallel and hierarchy strategies. The performances of the three different models with two different types of feature selection strategies are compared. Further, the effectiveness and accuracy of the models’ performance are tested on a large test dataset. The achieved accuracy measures could provide better risk prediction for listed companies. © 2015 Elsevier Inc. All rights reserved.

1. Introduction The Chinese stock market, which initially opened in 1990, is an emerging market. To make individual investors aware of the risk of different listed companies, in 1998, China’s Securities Regulatory Commission (CSRC) issued special listing rules regarding disclosing the risk level of listed companies. According to the listing rules, when abnormal financial or other specified conditions arise in a listed company and these abnormal conditions increase the company’s risk of being terminated from the exchange or render investors unable to judge the company’s prospects and consequently hurt investors’ interests, the companys stock is given a special treatment (ST) label as a risk warning. Under the listing rules, an ST-labeled company can have the ST label removed if its financial health has improved to an extent that satisfies certain specified requirements. In China, a healthy company (without the special treatment label) may become financially troubled and thus receive the delisting risk (or some other risk) warning, or may even be delisted, while a financially troubled company may regain its status as a healthy company. In general, listed companies in China exhibit four different listing statuses: (1) normal status without any risk warning, (2) abnormal status with other risk warning, (3) abnormal status with delisting risk warning, and (4) delisted status. These four statuses are denoted as “A”, “B”, “D” and “X”, respectively. Listing status can switch from one level to another with the exception that delisted status cannot be reversed.



Corresponding author. Tel.: +85388972903. E-mail addresses: [email protected] (L. Zhou), [email protected] (K.P. Tam), [email protected] (H. Fujita).

http://dx.doi.org/10.1016/j.ins.2015.08.036 0020-0255/© 2015 Elsevier Inc. All rights reserved.

L. Zhou et al. / Information Sciences 328 (2016) 222–236

223

Correctly predicting the listing status of a listed company is very important for the company’s stakeholders, including investors, creditors, suppliers and customers. Companies with different listing statuses are associated with different levels of overall risk. Predicting a change in listing status can help investors to manage their stock portfolio risk and aid creditors, suppliers and customers to accurately evaluate a company’s credit risk. In addition, there is a daily plus and minus 5% cap on the trading price for special treatment companies, while a daily 10% cap is set for healthy listed companies. Consequently, stocks with different listing statuses are associated with different levels of overall risk, such as volatility, liquidity, and delisting risks, which are all major concerns for investors in making their investment decisions. The existing literature on predicting the listing status of Chinese listed companies mainly focuses on predicting whether a healthy listed company will maintain a normal status or fall into financial distress. Ding et al. [6] introduced support vector machines to predict the financial distress of a company and defined a company in financial distress as a company that had received special treatment in the Chinese stock markets. Zhang et al. [32] developed a Z-score model to forecast whether a firm would receive special treatment. Existing studies [12,16,19,20,28,31] commonly consider listing status prediction for Chinese listed companies as a binary classification problem and develop binary classification models to predict the listing status as “normal” or “financially distressed” in a given period. Given that there are in fact four possible listing statuses for a Chinese listed company (different listing states mean different risk levels), it is of practical significance to apply a multi-class classification approach to predict the listing status of listed companies in China. The most popular strategy to solve multi-class classification problems is to transform the problem of multi-class classification into multiple binary classification problems [10], which carries with it two important issues. One is how to decompose a multiclass classification problem into a series of binary class classification problem; the other is how to assemble the results obtained by the multiple binary classifiers. The methods for managing the former and latter procedures are termed “decomposition” and “ensemble strategies”, respectively. There are two common decomposition strategies: One-vs-One (OVO) [18] and One-vs-All (OVA) [3]. Suppose there are K classes in a multi-class classification problem. The OVO approach is to divide the problem into CK2 = K (K − 1)/2 binary classification problems, after which one binary classifier is trained to discriminate classes in each pair. The outputs from all CK2 binary classifiers are aggregated to predict the output class. The OVA method divides the problem into K − 1 binary classification problems such that each binary classifier distinguishes one class from all other classes. Some existing studies show the successful application of OVA in multi-class classification problems [7,8,15]. Rifkin and Kautau [26] claimed that the OVA strategy is as accurate as any other approach when the base classifiers are well-tuned, while others [11,17] have demonstrated that the OVO strategy is also a useful alternative to multi-class classification problems with a performance superior to that of OVA. Galar et al. [10] conducted a comprehensive investigation into different ensemble methods for binary classifiers in multi-class problems with OVO and OVA strategies. Their empirical study showed that the performance of OVO and OVA is sensitive to the selection of the ensemble strategy and that the best aggregation within a problem depends on the base classifier and the characteristic of the problem. For a listed company, there is a large amount of information that can be used to predict the company’s listing status, such as the company’s characteristics, financial performance and market information. Existing literature has shown that financial and marketing information is effective in predicting a company’s financial status [1,5,25]. The selection of features will affect the performance of the classification models. The number of companies with different listing statuses in each observed year varies and the listing status prediction problem is a typically highly imbalanced multi-class classification problem. This study introduces multi-classification models to predict the listing status of Chinese listed companies by integrating the selection of features and samples into the OVA and OVO with parallel and hierarchy strategies, respectively. It also investigates the effect of different binary classifiers in the OVA and OVO strategies. The remainder of this paper is organized as follows. The frameworks based on OVO and OVA with two different ensemble strategies are explained in detail in Section 2. Section 3 presents the empirical results of the multi-class classification models. Section 4 draws conclusions and summarizes major findings. 2. One-vs-All and One-vs-One aggregative models The OVA aggregative models (OVAAM) integrate the process of feature selection, sampling and the OVA strategies. Two different ensemble strategies, parallel and hierarchy, are employed in OVA in this study. The training and testing processes in OVAAM with parallel and hierarchical ensemble strategies are shown in Figs. 1 and 2, respectively. The training and testing processes in the One-vs-One (OVO) aggregative models (OVOAM) are shown in Fig. 3. 2.1. Feature selection method Feature selection can speed up learning, facilitate data understanding, and improve prediction performance. A variety of feature selection methods have been proposed and examined, including filter methods, wrapper methods and embedded methods [14]. This study employs a hybrid feature selection method that combines the filter method based on a two-sample t-test with variance inflation factor (VIF) analysis. The hybrid method takes advantage of the rapidity of the filter method, while VIF analysis can assure lower level of dependence among the selected features.

224

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Fig. 1. The training and testing processes in OVAAM with a parallel ensemble strategy.

Suppose that the features of an observation are denoted by X = (X1 , X2 , . . . , Xm ) and Y is the class, Y ∈ {1, 0}. There are N observations in a training set. Let v1j = var(X j |Y = 1), and v0j = var(X j |Y = 0), where var(·) is the variance of a group of values.

Let m1j = mean(X j |Y = 1), and m0j = mean(X j |Y = 0), where mean(·) is the mean of a group of values. The feature weighting strategy based on the t-test on the training data set is defined as follows [13,33]:

zj =



  m1 − m0  j

j

v1j /N1 + v0j /N0

,

(1)

where N1 , and N0 denote the number of observations with Y = 1 or Y = 0, respectively. Each feature can be ranked in terms of z value, as defined in Eq. (1), in descending order. The top ranked features have good discriminant capability on the two classes but may exhibit high multi-colinearity. Therefore, VIF analysis is employed to prevent high multi-colinearity among features. The VIF of feature j is calculated as follows:

Vj =

1 , 1 − R2j

(2)

where R2j is the R2 -square of regression equation X j = β0 + βX  , where X contains all features except for Xj . The algorithm based on the filter method with t-test criteria and VIF (ttFVIF) for feature subset selection is as follows: Algorithm. ttFVIF: Input: The training sample set Sr with M features X1 , X1 , … , XM and class Y; the number of feature M∗ to be selected. Output: feature subset F∗ 1. F ∗ = ∅ 2. Rank the M features in terms of t-test criteria in descending order, add the first feature to F∗ and then calculate the VIF of the second feature according to the features in F∗ . If the VIF value of the second feature is less than 10, then the second feature will be added to F∗ ; otherwise, calculate z value of the next feature, and so on, until all M features have been checked or the number of total features in F∗ is M∗ .

L. Zhou et al. / Information Sciences 328 (2016) 222–236

225

Fig. 2. The training and testing processes in OVAAM with a hierarchy ensemble strategy.

The VIF threshold of 10 in ttFVIF is an empirical value suggested by [21]. A larger VIF threshold will permit the selection of features with high correlationships with existing features, and the drastically smaller VIF threshold will reduce the number of feature candidates and ensure that no additional feature can be selected.

2.2. Sampling method In the training data set for listing status prediction, there are different numbers of observations in different classes. When the listing status prediction problem is decomposed into a series of binary classification problems, the training set consists of two classes for each binary classification problem: one class is inherited from the initial training set, and the other classes in the initial training set are merged into another class. The number of observations in each of the two classes in the initial training set is usually imbalanced. To make the final training set balanced, the random undersampling method is employed. Suppose that the number of observations selected for each of the two classes is N , and that S1 and S0 denote the observations set, while N1 , and N0 denote the number of observations Y = 1 or Y = 0 in the original training set, respectively. The sampling algorithm based on random undersampling (RU) is as follows:

226

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Fig. 3. The training and testing processes in OVOAM.

Algorithm. RU: Input: the original training set {S1 , S0 }. Output: balanced training set S of size 2N . 1. S = ∅. 2. N = min(N0 , N1 )  if N1 > N then 0 S=S randomly select N observations into S from S1 else S = S1 randomly select N from sample set S0 into S end if

2.3. Base binary classifiers This study aims to find an effective classification model for predicting the listing status of Chinese listed companies. The base binary classifier plays an important role in the OVA and OVO aggregative models. Seven well-known and widely used binary classification methods have been selected to construct the base binary classifiers. The selected binary classification methods are briefly introduced as follows: (1) Linear discriminant analysis (LDA) assumes that different classes have features based on different Gaussian distributions. Each class has features generated by a multivariate normal distribution (MND) and the MND for each class has the same covariance matrix but a different mean. LDA can consider the cost of misclassifying one class to another class and the prediction is designed to minimize the expected classification cost [22].

L. Zhou et al. / Information Sciences 328 (2016) 222–236

227

(2) Logistic regression (LR) is a classical binary classification method. It estimates the probability that an observation with features x belongs to class 1, by using the following formula:

ew0 +w x . 1 + ew0 +wT x T

P (Y = 1|x) =

(3)

(3) Neural networks (NN), inspired by biological nervous systems, use interconnections between neurons to determine the network function. NNs are trained, so that they can fit the function mapping the input to output and theoretically, they can fit any function and have been successfully used for a wide variety of classification and forecasting problems. A comprehensive introduction to neural network can be found in [4]. (4) Decision tree C4.5 (DTC4.5) uses gain ratio as a splitting criterion, and the splitting stops when the number of observations to be split is below a certain threshold. It uses error-based pruning to remove the least reliable branches [24]. (5) k-nearest neighbor (KNN) [22] is a simple and standard nonparametric approach which classifies an observation by considering only the k-nearest neighbors in the training set. The methods to measure the distance between two observations and the value of k are key factors that affect the performance. (6) Adaboost, a typical meta-learning algorithm, calls up a weak or base learning algorithm repeatedly in a series of rounds. In each round, the weights of incorrectly classified examples are increased so that the weak learning is forced to focus on the misclassified observations in the training set. The base learning algorithm is usually a simple decision tree [9]. (7) The least squares support vector machines (LSSVM) aims to minimize the upper bound of the generalization error, and it maps the input vectors into a high-dimensional feature space through some nonlinear mapping functions and constructs an optimal separating hyperplane to separate the two classes of observations with maximal margins. The details of LSSVM can be found in [29]. In the OVA or OVO aggregative models, all base binary classifiers are required to compute the posterior probability that it classifies an observation in class 1 (Y = 1). If the threshold value is specified, it is easy to classify the observation within a class in terms of the posterior probability. Most of the above binary classifiers can compute posterior probability in a straightforward way, however, some binary classifiers such as neural networks and support vector machines that cannot generate a posterior probability require the application of calibration methods to compute posterior probabilities. In this study, the Platt calibration [23] is applied to transform the initial output of the binary classifiers, which cannot generate posterior probability directly by passing them through a sigmoid function. Let the unthresholded output of a binary classifier method be f(x); then, the probabilistic output can be obtained as follows:

P (Y = 1| f (x)) =

1 , 1 + e(A f (x)+B)

(4)

where parameters A and B are estimated by using the maximum likelihood estimation from a fitting training set {( fi , Yi )|i = 1, 2, ..., N}. Parameters A and B are estimated by solving the following optimization problem:





  arg min − Yi log ( pi ) + (1 − Y  i ) log (1 − pi ) , A,B

(5)

i

where

pi =

1 . 1 + e(A fi +B)

(6)

Let N1 , and N0 denote the number of observations for Y = 1 and Y = 0, respectively; the value of Yi is defined as follows:



Yi =

N1 +1 N1 +2

if Yi = 1

1 N0 +2

if Yi = 0

.

(7)

2.4. One-vs-All aggregative models The One-vs-All aggregative model with parallel ensemble strategy (OVAPES) shown in Fig. 1 can train a base binary classifier for each class by distinguishing that class from the other classes. The prediction of a new observation relies on selecting the class that obtains the highest post-probability (confidence scores) among all base binary classifiers. The algorithm for OVAPES is as follows: Algorithm. OVAPES: Input: the original training set Sr , the number of feature to be selected M∗ . Output: a list of K base binary classifiers Ck for k ∈ {1, 2, …, K}. for k = 1 to K Construct a new label for observation i, Yi = 1 if Yi = k, otherwise Yi = 0, i = 1, 2, . . . , N; Call Algorithm ttFVIF to identify the M∗ features set F∗ ;

228

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Call Algorithm RU to select 2N observations taking features in F∗ to train the base binary classifier Ck . end for For an unseen observation x, the posterior probability to classify it in class k using classifier Ck is denoted by P (Y = k|x). The parallel ensemble strategy is to classify this observation in the class with maximum value of P (Y = k|x):

Yˆ = arg max P (Y = k|x) k∈{1, ... , K }

(8)

The algorithm of the One-vs-All aggregative model with a hierarchical ensemble strategy (OVAHES) is as follows: Algorithm. OVAHES: Input: the original training set Sr , the number of features to be selected M∗ . Output: a list of K − 1 base binary classifiers Ck for k ∈ {1, 2, . . . , K − 1}. for k = 1 to K−1 Construct a new label for observation i, Yi = 1 if Yi = k; Yi = 0, if Yi ∈ {k + 1, k + 2, · · · , K }, i = 1, 2, . . . , N; Call Algorithm ttFVIF to identify the M∗ features set F∗ ; Call Algorithm RU to selected 2N observations taking features in F∗ to train the base binary classifier Ck . end for The hierarchical ensemble strategy is used to classify an unseen observation x in the first class with max(P (Y = k|x), k = 1, 2, . . . , K ):

Yˆ = arg max k|P (Y = k|x) ≥ 0.5. k∈{1,2,...,K }

(9)

where P(Y = K |x) = 1 − P (Y = K − 1|x). 2.5. One-vs-One aggregative models Algorithm. OVO: Input: the original training set Sr , the number of feature to be selected M∗ Output: a list of K (K − 1)/2 base binary classifiers Ck for k ∈ {1, 2, . . . , K (K − 1)/2}. k=1 for m = 1 to K − 1 for j = m + 1 to K Select all observations of class m and j from training set Sr . Call Algorithm ttFVIF to identify the M∗ features set F∗ ; Call Algorithm RU to select 2N observations taking features in F∗ to train the base binary classifier Ck . k=k+1 end for The OVO classifies an unseen observation x in the class with maximum votes from the K (K − 1)/2 classifiers. 3. Empirical analysis 3.1. The data The source of data is the China Stock Market and Accounting Research Database (CSMARD) provided by the GTA database. There are 18551 company-year observations dating from 1999 to 2011. Each observation in an observed year t contains the following features: (1) 164 different financial ratios measuring various aspects of a company’s financial status for the fiscal year t − 1, such as short-term solvency, long-term solvency, asset management or turnover, profitability, capital structure, stock holder’s earning profitability, cash management and development capability; (2) three market variables introduced by Shumway [27] including the excess return of the company’s stock, relative market capitalization, and the standard deviation of firm’s stock return in the fiscal year t − 1; and (3) the listing status of the company in year t + 1. According to the listing rules in China, all listed companies must declare their financial statements for a fiscal year within four months after the end of the fiscal year. Any necessary actions on listed companies are usually enforced by the CSRC by the end of June, in keeping with the listing rules. The listing status of a company at the end of June in year t is denoted by Lt , while the company’s listing status at the end of year t + 1 is denoted by Yt+1 . The prediction is completed by the end of June in each year. Fig. 4(a)–(c) gives the number of companies with different listing statuses by years, with Lt taking values of “A”, “B” and “D” respectively. Because there are new companies listed each year and the number of newly listed companies is usually greater than the number of delisted companies each year, the number of normal companies marked “A” increases. Because a delisted

L. Zhou et al. / Information Sciences 328 (2016) 222–236

229

Fig. 4. The number of companies with listing status transformations.

company cannot recover any other listing status, it is meaningless to predict the status of a company that has just been delisted. Therefore, Fig. 4(d) only gives the number of companies with “X” in year t + 1 when the prediction is conducted in year t. Fig. 4(a) shows that most normal, healthy companies with listing status “A” retain their “A” listing status in the next fiscal year; very few normal companies with listing status “A” will transform to other listing statuses. For example, in fiscal year 2010, there were 1995 companies with listing status “A”, among which 1965 companies retained “A” status, one company moved to “B”, 24 companies changed to “D”, and five were delisted in the following year. Therefore, predicting the listing status of listed companies is a highly imbalanced classification problem. Fig. 4(b) shows that most companies with listing status “B” retain “B” listing status in the following fiscal year, and some of them may regain listing status “A” because of their improvement in financial performance. In fiscal year 2010, there were 87 companies with listing status “B”, among which 31 regained “A” status, 36 retained their “B” status, and 10 changed to “D”. Fig. 4(c) shows that companies with listing status “D” may move to “A” or “B” or retain their “D” status. Only a small proportion will slip to “X”. 3.2. Experimental settings 3.2.1. Sampling and classifiers The dataset, comprising 18,569 observations is split into a training set and a testing set. The training set consists of observations dating t ≤ 2006 , while the testing set contains observations dating 2007 ≤ t ≤ 2011. Because the “A”, “B”, “D”, and “X” classes are highly imbalanced in the training data set, the RU algorithm is employed to construct a balanced training sample for the binary classifiers in the three frameworks. The number of selected observations N for each class in RU is set to different values in OVAPES, OVAHES and OVO. The number of observations from different classes in the training set and testing set are listed in Table 1. Ntraining is the number of observations in the dataset for the sampling training set. Ntesting is the number of observations in the testing set. There are four binary classifiers in OVAPES, each of which is constructed on one class versus all other classes. For example, in the sampled training set for OVAPES to classify “A” companies, there are 1223 companies with listing status “A”

230

L. Zhou et al. / Information Sciences 328 (2016) 222–236 Table 1 The number of observations for the different classes in the training set and testing set. Y t+1

Ntraing

N in sampled training set OVAPES

A B D X

10,028 649 488 86

1223/1223 649/649 488/488 86/86

OVAHES

Ntesting OVO

1223/1223 574/574 86/86

B

D

X

649/649

488/488 488/488

86/86 86/86 86/86

6629 328 319 24

Table 2 A typical resulting confusion matrix. Predicted class

Actual class

Total

A

B

D

X

A B D X

n11 n12 n13 n14

n21 n22 n23 n24

n31 n32 n33 n34

n41 n42 n43 n44

nˆ 1 nˆ 2 nˆ 3 nˆ 4

Total

n1

n2

n3

n4

n

plus 1223 companies with the listing status of “B”,“D” and “X”. There are three binary classifiers in OVAHES; each is constructed on one class versus all other classes that have not been considered separately. Seven binary classifiers are used in OVO, each of which is constructed on one versus one of the other classes. Most classifiers are implemented in Matlab, while DTC4.5 is implemented by weka.classifiers.tress.J48 in Weka [30]. The parameter k in k-nearest neighbors takes the value of 10, and most parameters in the binary classifiers use the default settings in Matlab or Weka. To show the performance sensitivity of different frameworks to the number of selected features in ttFVIF M∗ , the frameworks have been tested with different M∗ taking values in set {4,8,12,16,20}. 3.2.2. Performance measures Classification rate and Cohen’s kappa measures have been applied as performance measures in this study because they are commonly used in evaluating performance in binary and multi-class problems [10]. A typical resulting confusion matrix for the listing status prediction problem is reported in Table 2, while nij denotes the number of observations of a class with actual list status yi which is predicted as a class with listing status yj . (i = 1, . . . , 4, j = 1, . . . , 4). The classification accuracy rate (CAR) is the ratio of correctly classified observations relative to the total number of classified observations. It is computed as follows:

4 CAR =

i=1

n

nii

.

(10)

Cohen’s kappa (CK) evaluates the proportion of hits that can be attributed to the classifier itself, not to mere chance relative to all of the classifications that cannot be attributed to chance alone. Cohen’s kappa is computed as follows:

CK =

n

4 ii i ˆi i=1 n − i=1 n n .  4 2 i i n − i=1 n nˆ

4

(11)

Cohen’s kappa ranges from −1 (total disagreement) to +1 (total agreement), and CK = 0 indicates a random classification. Cohen’s Kappa is a simple but useful measure for evaluating the performance of multi-class classifiers. It scores the hits independently for each class and aggregates them, while the classification rate scores all of the hits over all classes [10]. 3.3. Experimental results 3.3.1. Results of models with features selected by ttFVIF Because the performance of the models is affected by the sampled training set and the parameter of the number of selected features M∗ , 30 test iterations with the binary classifiers trained by 30 different groups of samples are conducted for each selected M∗ taking each value in set {4, 8, 12, 16, 20}. Fig. 5 shows the average classification accuracy rate of 30 iterations of tests by the three different models with different binary classifiers and different values of the number of selected features M∗ . It can be observed that OVAPES can always obtain a significantly higher classification accuracy rate than OVAHES in all scenarios. Each scenario includes the setting of a binary classifier and a fixed number of selected features. In addition, referring to the classification accuracy rate, each binary classifier performs stably in OVAPES but not in OVAHES with different settings on the

1.0 0.8 0.6 0.4 0.2

4 nu mb e

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y r a bin LR

8

ro

fs

12

ele

cte

df

16

ea

tu

res

20

classification accuracy rate

classification accuracy rate

L. Zhou et al. / Information Sciences 328 (2016) 222–236

231

1.0 0.8 0.6 0.4 0.2

4 nu mb e

8 ro

LDA

LSSVM Adaboost KNN s DTC4.5 sifier las c NN ary bin LR

fs

12 cte 16 df ea tu res

ele

LDA

(b) OVAHES

1.0 0.8 0.6 0.4 0.2

LSSVM Adaboost KNN s DTC4.5 sifier as l c NN ary bin LR

4 nu 8 mb er of sel e

12 cte 16 df ea tu res

20

LDA

(c) OVO

1.0

OVAPES

OVAPES

OVO

OVAPES OVO

Classification accuracy rate

classification accuracy rate

OVO

OVO

(a) OVAPES

20

0.8 0.6 0.4 0.2

4 nu 8 mb er of sel e

12 c te 16 df ea tu res

LSSVM Adaboost KNN s DTC4.5 sifier las c NN ary bin LR 20

LDA

(d) Settings of maximum average CAR

Fig. 5. Average classification accuracy rate of different frameworks with different settings.

number of selected features; i.e., the CAR of each binary classifier in OVAPES is less sensitive than that in OVAHES. OVO can achieve higher CAR than OVAHES in almost all scenarios except for the LSSVM, with M∗ taking a value of 12, 18, or 20. Comparing CAR in OVAPES and OVO, some binary classifiers obtain higher CAR in OVAPES under all of the different settings of M∗ , such as LDA, and Adaboost. Some binary classifiers achieve higher in OVO under all different settings of M∗ , such as DTC4.5. Binary classifiers LR, NN, KNN and LSSVM cannot consistently obtain higher CAR in either OVAPES or OVO with different settings of M∗ . Fig. 5(d) shows the settings of M∗ and aggregative models for each binary classifier that allows the binary classifier to achieve the maximum average CAR of 30 test iterations. Fig. 5(d) is generated from Fig. 5(a)–(c) by only presenting the maximum average CAR for each binary classifier. Fig. 5(d) shows that LDA achieves the global maximum average CAR of 0.9178 in OVAPES when the number of selected features is set to eight. LR follows LDA has the second maximum average CAR of 0.8886 in OVO, also by setting the number of selected features to eight. For each setting that allows the binary classifier to achieve the maximum average CAR, the standard deviation of CAR in the 30 iterations of test is on a small scale, which suggests that the binary classifier performs consistently in such settings. Table 3 reports the maximum average CAR and standard deviation of CAR in 30 test iterations from each binary classifier when achieving the maximum average CAR.

232

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Classifiers

LDA

LR

NN

DTC4.5

KNN

Adaboost

LSSVM

Max. CAR Std.

0.9178 0.0026

0.8886 0.0125

0.7928 0.0262

0.8380 0.0354

0.8712 0.0055

0.8673 0.0048

0.8422 0.0195

1.0

1.0

0.8

0.8

0.6 0.4 0.2

4 nu 8 mb er of 12 sel ect 16 ed fea tu res

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y ar bin LR 20

Cohen’s kappa

Cohen’s kappa

Table 3 The standard deviation of CAR from each binary classifier achieving maximum average CAR.

0.6 0.4 0.2

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y ar bin LR

4 nu 8 mb er of sel e

12 cte 16 df ea tu res

LDA

LDA

(b) OVAHES

0.6 0.4 0.2

4 nu mb e

LSSVM Adaboost KNN s DTC4.5 sifier las c NN ary bin LR

8 ro

12 fs ele cte df

16 ea

tu res

20

LDA

(c) OVO

0.6

OVAPES

OVAPES

OVO

OVO

0.8

OVO

1.0

0.8

OVAPES

1.0 Cohens kappa

Cohen’s kappa

OVO

(a) OVAPES

20

0.4 0.2

4 nu 8 mb er of 12 sel ect ed f

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y ar bin LR

16

ea

tu

res

20

LDA

(d) Settings of maximum average CK

Fig. 6. Average Cohen’s kappa of different frameworks with different settings.

As shown in Table 1, the testing set is highly imbalanced. Although classification accuracy can demonstrate the performance of the different frameworks to some degree, CK remains an important performance measure for multi-class classification problems. Fig. 6 shows the average CK of 30 test iterations under three different aggregative models with different binary classifiers and different numbers of selected features M∗ . Evidence from the CK among the different aggregative models under different settings suggests similar findings compared with CAR. OVAPES and OVO outperform OVAHES under almost all setting scenarios. It shows that the CK measure is highly correlated with CAR: the higher the CAR is, the higher the CA is. As shown in Fig. 6(d), NN achieves the maximum average CK when M∗ is set to eight, while it achieves the maximum average CAR when M∗ is set to four, as shown in Fig. 5(d). Other binary classifiers obtain the maximum CAR and CK with the same setting scenarios. LDA has the maximum average CK score of 0.5630 shown in Fig. 6(d), followed by LR (0.4910) and LSSVM (0.4581). The performance of three different aggregative models incorporating different binary classifiers is evaluated in terms of the average CAR and CK scores (there are 30 test iterations for the seven binary classifiers and three different frameworks when M∗ is set to eight). Because LDA achieves the maximum global average CAR and almost all binary classifiers in OVAPES and OVO are affected by different values of M∗ , the performance of all binary classifiers under M∗ = 8 are selected as representative.

L. Zhou et al. / Information Sciences 328 (2016) 222–236

233

Table 4 Average CAR and CK scores in test with seven binary classifiers and three different aggregative models. Binary classifier

LDA LR NN DTC4.5 KNN Adaboost LSSVM

OVAPES

OVAHES

OVO

CAR

CK

CAR

CK

CAR

CK

0.9178 (0.0026) 0.8781∗ (0.0062) 0.7195 (0.0905) 0.7555 (0.0713) 0.8611 (0.0040) 0.8623 (0.0066) 0.7994 (0.0094)

0.5630 (0.0092) 0.4641∗ (0.0156) 0.2599 (0.0766) 0.2902 (0.0757) 0.4357 (0.0113) 0.4515 (0.0149) 0.3443 (0.0141)

0.6620 (0.0270) 0.6428 (0.0265) 0.5336 (0.0868) 0.6961 (0.1079) 0.8046 (0.0057) 0.6139 (0.0437) 0.6531 (0.0168)

0.1853 (0.0171) 0.1809 (0.0160) 0.1362 (0.0469) 0.2523 (0.0880) 0.3411 (0.0095) 0.1839 (0.0261) 0.1926 (0.0104)

0.9135 (0.0033) 0.8886 (0.0125) 0.7817 (0.0422) 0.8031 (0.0389) 0.8352 (0.0107) 0.7939 (0.0259) 0.7645 (0.0259)

0.5444 (0.0106) 0.4910 (0.0294) 0.3227 (0.0518) 0.3522 (0.0539) 0.3769 (0.0171) 0.3278 (0.0341) 0.2896 (0.0287)

Table 4 shows the performance of different binary classifiers in three different aggregative models. Two-way ANOVA with a replication test shows that OVAPES and OVO outperform OVAHES in CR and CK for all binary classifiers. A Nemenyi test is used to compare all binary classifiers with each other under OVAPES to check whether their differences are statistically significant, as LDA obtains the maximum global CAR in OVAPES. The performance of two classifiers is statistically significantly different if the corresponding average ranks differ by at least the criteria difference (CD):

 CD = qα

C (C + 1) , 6I

(12)

where qα is the critical value for the two-tailed Nemenyi test with the significance level α ; C is the number of classifiers; and I is the number of data sets. This study compares the performance of all binary classifiers in the OVAPES framework in Table 3, with C = 7, I = 30, qα = 2.949, and CD = 1.645. The binary classifiers that show no significant differences on the CAR or CK performance from the LDA have been marked with an asterisk “∗”. Evidence suggests that LDA exhibits the best performance on CR and CK measures among all seven binary classifiers in the OVAPES framework and that LR is not significantly different from LDA in terms of CAR and CK. The Wilcoxon signed-rank test shows that LDA is not significantly different in terms of CAR and CK in OVAPES and OVO, so as LR. In Table 4, the LDA has achieved an overall classification accuracy of 91.78% on a test set of 7300 observations with different listing statuses, which suggests that the LDA has relatively higher classification accuracy. The test set contains 6629 observations with listing status “A”; however, if a classifier simply classes all observations as “A”, it can still show a high classification accuracy of 90.81%. In this highly imbalanced test set, classification accuracy cannot show the discriminative capability of the classifiers with high reliability. In terms of the measure of Cohen’s kappa, the LDA generates an average score of 0.5272, which is noticeably larger than zero. In the test set, 6442 observations retained “A” listing status out of the 6579 observations with the current listing status of “A”, and classifying all current “A” companies as “A” in the next year can still achieve high classification rate. In practice, it is more important for risk managers to correctly predict a “B”, “D” or “X” company in the next year than an “A” company as companies with “B”,“D” or “X” listing status are associated with higher risk. Fig. 7 shows the classification accuracy of the three different aggregative models with the same settings as those in Table 4 on test samples with different listing statuses. Fig. 7(a) shows that the LDA in OVAPES, which achieves the global maximum average CAR, can correctly classify 96.08% of “A” observations, 57.98% of “B” observations, 44.22% of “D” observations and no “X” observations. Fig. 7(c) shows that the LDA in OVO can correctly classify 95.91% of “A” observations, 80.85% of “B” observations, 13.57% of “D” observations and 8.19% of “X” observations. From Fig. 7, it can be observed that no framework and setting can achieve maximum classification accuracy for all four listing statuses. LR in OVO achieves the maximum average classification accuracy for “B” observations at 82.24%. NN in OVAHES achieves the maximum average classification accuracy for “D” observations at 71.14%, but it only achieves 53.25% of “A” observations. None of the frameworks and settings can obtain results with a classification accuracy above 25% for “X” firms, showing that it is very difficult to predict whether a listed company will be delisted. One possible reason may be that there are few observations of delisted companies and it is difficult for the classifiers to mine the pattern or characteristics of the delisted companies. Another possible reason may be that listed companies may not be given a delisting risk warning on China’s stock due to their financial performance but rather because of operations risk and negative audits from accounting agencies. Consequently, it is difficult to predict “D” listing status using the companies’ financial ratios. Many listed companies in financial distress use the strategy of corporate restructuring to avoid delisting, which makes predicting “X” difficult.

234

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Fig. 7. Classification accuracy on four different classes.

It is typically difficult to increase the classification accuracy of one class in multi-class classification models without sacrificing the classification accuracy of another class. Overall performance is usually a balanced outcome for all of the different classes in terms of the cost of misclassifying each of the classes. In practice, it is difficult to estimate the cost of misclassification, and in this study, equal cost is assumed for the misclassification of each class to demonstrate the discriminative capability of each model. 3.3.2. Results of models with features used by financial experts In the previous experiment, the features are automatically selected by the ttFVIF algorithm without considering their financial meaning. It is natural to try the features suggested by financial experts with a professional understanding of the meaning of financial features to check whether the models with features used by financial experts can improve classification performance. Listing status is used as a symbol of risk warning for listed companies on the Chinese stock market and is highly related to a company’s financial performance in terms of listing rules. The listing status of “B”, “D” and “X” may indicate that a company is facing financial distress or bankruptcy risk. Therefore, the features suggested in financial distress prediction or bankruptcy prediction by financial experts can be used to predict the listing status of Chinese listed companies. Although most existing studies regarding financial distress prediction focus on American and European companies, the structure of financial statements is identical for listed companies in China. It is technically difficult to check all of the features suggested by all financial experts in financial distress prediction. Only those widely accepted features with highly discriminative capabilities are used. The features selected from Altman [2], Zhang et al. [32] and Shumway [27] include: working capital to total assets (WCTA), retained earning to total assets (RETA), earning before interest and taxes to total assets (EBITTA), sales to total assets (STA), net income to total assets (NITA), total liabilities to total assets (TLTA), excess annual return (EAR), firm’s market capitalization to total market capitalization (FMCTMC), standard deviation of stock daily return (Std), earning per share (EPS), book value per share (BVPS) and operation income per share (PIPS). Table 5 reports the average CAR and CK performance of three different multi-class classification models with different binary classifiers utilizing features used by financial experts. OVAPES with LDA classifier achieves the maximum average CAR score of 0.9202 and average CK score of 0.5688. Although the average CAR and CK scores shown in Table 5 for OVAPES with LDA are marginally higher than those in Table 4, a Wilcoxon signed rank test shows that the difference in CAR and CK between the two

L. Zhou et al. / Information Sciences 328 (2016) 222–236

235

Fig. 8. Classification accuracy on four different classes. Table 5 Average CAR and CK in tests with seven binary classifier and three different frameworks with features suggested by financial experts. Binary classifier

LDA LR NN DTC4.5 KNN Adaboost LSSVM

OVAPES

OVAHES

OVO

CAR

CK

CAR

CK

CAR

CK

0.9202 (0.0021) 0.8592 (0.0586) 0.7751 (0.0751) 0.7505 (0.1170) 0.8875 (0.0039) 0.8668 (0.0048) 0.8210 (0.0113)

0.5688 (0.0073) 0.4400 (0.0826) 0.3149 (0.0862) 0.2797 (0.0721) 0.4858 (0.0126) 0.4585 (0.0116) 0.3763 (0.0194)

0.6283 (0.1098) 0.6648 (0.0643) 0.6855 (0.0725) 0.6005 (0.0770) 0.7428 (0.0068) 0.7543 (0.0392) 0.7126 (0.0205)

0.1664 (0.0684) 0.1886 (0.0541) 0.2221 (0.0565) 0.1660 (0.0470) 0.2453 (0.0074) 0.2976 (0.0417) 0.2372 (0.0188)

0.8705 (0.0143) 0.8412 (0.0260) 0.7451 (0.0651) 0.7610 (0.0446) 0.8252 (0.0451) 0.7894 (0.0233) 0.6976 (0.0358)

0.4396 (0.0279) 0.4029 (0.0360) 0.2857 (0.0623) 0.3033 (0.0511) 0.3582 (0.0550) 0.3296 (0.0295) 0.2310 (0.0272)

groups of 30 tests by OVAPES with LDA with two different sets of features is significantly different from zero. The p-Values of the tests on CAR and CK are 0.0015 and 0.0157 respectively. However, the CAR scores in OVO with all binary classifiers in Table 5 are smaller than those in Table 4. Fig. 8 shows the classification accuracy for observations with different listing statuses. Evidence in Fig. 8(a) and Fig. 7(a) suggests that OVAPES with LDA and features used by financial experts can slightly improve the accuracy rates for “A” and “B” observations while losing accuracy rate for “D” observations and has no change for “X” observations.

236

L. Zhou et al. / Information Sciences 328 (2016) 222–236

4. Conclusion This paper has considered listing status transitions of Chinese listed companies as a multi-class classification problem, in contrast to existing literature, which considers it as a binary classification problem, by predicting whether a normal company will receive a risk warning. To solve the multi-class classification problem, three aggregative models based on OVA and OVO strategies are proposed, and seven widely used binary classifiers have been investigated. The empirical results show that LDA and LR are robust in terms of their small standard deviation in 30 test iterations, and both can outperform other binary classifiers in most of the different test scenarios. Due to the difficulty of distinguishing between the characteristics of companies in class “B” with other risk warnings and companies in class “D” with delisting risk warnings, no models can provide satisfactory classification accuracy for companies of these statuses compared to their performance for “A” companies. In the aggregative classification models, features selected by ttFVIF can perform consistently under both the OVA parallel strategy and OVO for each of the seven different binary classifiers. OVAPES incorporating LDA can achieve significant higher performance with features used by financial experts than with features obtained by ttFVIF. Some types of listing status transition are practically difficult to predict, such as the transition from “D” to “X” status. Future research may investigate what other data are needed and what procedures can be followed in the effective prediction of such listing status transitions. References [1] V. Agarwal, R. Taffler, Comparing the performance of market-based and accounting-based bankruptcy prediction models, J. Bank. Financ. 32 (8) (2008) 1541–1551. [2] E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, J. Financ. 23 (4) (1968) 589–609. [3] R. Anand, K. Mehrotra, C.K. Mohan, S. Ranka, Efficient classification for multiclass problems using modular neural networks, IEEE Trans. Neural Netw. 6 (1) (1995) 117–124. [4] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [5] S.R. Das, P. Hanouna, A. Sarin, Accounting-based versus market-based cross-sectional models of CDS spreads, J. Bank. Financ. 33 (4) (2009) 719–730. [6] Y. Ding, X. Song, Y. Zen, Forecasting financial condition of Chinese listed companies based on support vector machine, Expert Syst. Appl. 34 (4) (2008) 3081–3089. [7] H.K. Ekenel, T. Semela, Multimodal genre classification of TV programs and Youtube videos, Multimed. Tools Appl. 63 (2) (2013) 547–567. [8] H.J. Escalante, M. Montes, L.E. Sucar, Multi-class particle swarm model selection for automatic image annotation, Expert Syst. Appl. 39 (12) (2012) 11011– 11021. [9] Y. Freund, R.E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [10] M. Galar, A. Fernndez, E. Barrenechea, H. Bustince, F. Herrera, An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes, Pattern Recognit. 44 (8) (2011) 1761–1776. [11] M. Galar, A. Fernndez, E. Barrenechea, H. Bustince, F. Herrera, Dynamic classifier selection for one-vs-one strategy: avoiding non-competent classifiers, Pattern Recognit. 46 (12) (2013) 3412–3424. [12] R.B. Geng, I. Bose, X. Chen, Prediction of financial distress: an empirical study of listed chinese companies using data mining, Eur. J. Oper. Res. 241 (1) (2014) 236–247. [13] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (7/8) (2003) 1157–1182. [14] I. Guyon, S. Gunn, M. Nikravesh, L.A. Zadeh, Feature Extraction: Foundations and Applications, Springer, 2006. [15] J.H. Hong, J.K. Min, U.K. Cho, S.B. Cho, Fingerprint classification using one-vs-all support vector machines dynamically ordered with naive Bayes classifiers, Pattern Recognit. 41 (2) (2008) 662–671. [16] C. Huang, C. Dai, M. Guo, A hybrid approach using two-level DEA for financial failure prediction and integrated SE-DEA and GCA for indicators selection, Appl. Math. Comput. 251 (2015) 431–441. [17] T.M. Khoshgoftaar, K. Gao, H. Lin, Indirect classification approaches: a comparative study in network intrusion detection, Int. J. Comput. Appl. Technol. 27 (4) (2006) 232–245. [18] S. Knerr, L. Personnaz, G. Dreyfus, Single-Layer Learning Revisited: A Stepwise Procedure for Building and Training a Neural Network, Springer, 1990. [19] S.J. Li, S. Wang, A financial early warning logit model and its efficiency verification approach, Knowl.-based Syst. 70 (2014) 78–87. [20] Z.Y. Li, J. Crook, G. Andreeva, Chinese companies distress prediction: an application of data envelopment analysis, J. Oper. Res. Soc. 65 (3) (2014) 466–479. [21] D.A. Lind, W.G. Marchal, S.A. Wathen, Statistical Techniques in Business & Economics, Mcgraw-Hill, 2012. [22] K.P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012. [23] J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Adv. Large Margin Classif. 10 (3) (1999) 61–74. [24] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. [25] K.P. Ravi, V. Ravi, Bankruptcy prediction in banks and firms via statistical and intelligent techniques – a review, Eur. J. Oper. Res. 180 (1) (2007) 1–28. [26] R. Rifkin, A. Klautau, In defense of one-vs-all classification, J. Mach. Learn. Res. 5 (2004) 101–141. [27] T. Shumway, Forecasting bankruptcy more accurately: a simple hazard model, J. Bus. 74 (1) (2001) 101–124. [28] J. Sun, Z.M. Shang, H. Li, Imbalance-oriented SVM methods for financial distress prediction: a comparative study among the new SB-SVM-ensemble method and traditional methods, J. Oper. Res. Soc. 65 (12) (2014) 1905–1919. [29] J.A. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett. 9 (3) (1999) 293–300. [30] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [31] Z. Xiao, X.L. Yang, Y. Pang, X. Dang, The prediction for listed companies’ financial distress by using multiple prediction methods with rough set and dempstershafer evidence theory, Knowl.-Based Syst. 26 (2012) 196–206. [32] L. Zhang, E.I. Altman, J. Yen, Corporate financial distress diagnosis model and application in credit rating for listing firms in China, Front. Comput. Sci. China 4 (2) (2010) 220–236. [33] L. Zhou, K.K. Lai, J. Yen, Empirical models based on features ranking techniques for corporate financial distress prediction, Comput. Math. Appl. 64 (8) (2012) 2484–2496.