Predicting the listing status of Chinese listed companies with multi-class classification models

Information Sciences 328 (2016) 222–236 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

Download PDF

3MB Sizes 0 Downloads 10 Views

Report

Full Text

Information Sciences 328 (2016) 222–236

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Predicting the listing status of Chinese listed companies with multi-class classiﬁcation models Ligang Zhou a,∗, Kwo Ping Tam a, Hamido Fujita b a b

School of Business, Macau University of Science and Technology, Taipa, Macau Faculty of Software and Information Science, Iwate Prefectural University, Iwate, Japan

a r t i c l e

i n f o

Article history: Received 3 June 2015 Revised 20 August 2015 Accepted 21 August 2015 Available online 29 August 2015 Keywords: Prediction One-vs-all One-vs-one Multi-class classiﬁcation Listing status

a b s t r a c t In China’s stock markets, a listed company’s different listing statuses are signals for different risk levels. It is therefore vital for investors and other stakeholders to predict the listing status of listed companies due to the diﬃculty of providing suﬃcient measurement of such risks. Existing studies tend to classify listing status into two categories for simple measurement purposes by applying binary classiﬁcation models; however, such classiﬁcation models cannot provide accurate risk management. Considering the existence of four different listing statues of Chinese listed companies in practice, this study introduces three different types of multiclass classiﬁcation models to predict listing status in order to achieve better performance in terms of accuracy measures. These three types of models are based on One-versus-One and One-versus-All with parallel and hierarchy strategies. The performances of the three different models with two different types of feature selection strategies are compared. Further, the effectiveness and accuracy of the models’ performance are tested on a large test dataset. The achieved accuracy measures could provide better risk prediction for listed companies. © 2015 Elsevier Inc. All rights reserved.

1. Introduction The Chinese stock market, which initially opened in 1990, is an emerging market. To make individual investors aware of the risk of different listed companies, in 1998, China’s Securities Regulatory Commission (CSRC) issued special listing rules regarding disclosing the risk level of listed companies. According to the listing rules, when abnormal ﬁnancial or other speciﬁed conditions arise in a listed company and these abnormal conditions increase the company’s risk of being terminated from the exchange or render investors unable to judge the company’s prospects and consequently hurt investors’ interests, the companys stock is given a special treatment (ST) label as a risk warning. Under the listing rules, an ST-labeled company can have the ST label removed if its ﬁnancial health has improved to an extent that satisﬁes certain speciﬁed requirements. In China, a healthy company (without the special treatment label) may become ﬁnancially troubled and thus receive the delisting risk (or some other risk) warning, or may even be delisted, while a ﬁnancially troubled company may regain its status as a healthy company. In general, listed companies in China exhibit four different listing statuses: (1) normal status without any risk warning, (2) abnormal status with other risk warning, (3) abnormal status with delisting risk warning, and (4) delisted status. These four statuses are denoted as “A”, “B”, “D” and “X”, respectively. Listing status can switch from one level to another with the exception that delisted status cannot be reversed.

∗

Corresponding author. Tel.: +85388972903. E-mail addresses: [email protected] (L. Zhou), [email protected] (K.P. Tam), [email protected] (H. Fujita).

http://dx.doi.org/10.1016/j.ins.2015.08.036 0020-0255/© 2015 Elsevier Inc. All rights reserved.

L. Zhou et al. / Information Sciences 328 (2016) 222–236

223

Correctly predicting the listing status of a listed company is very important for the company’s stakeholders, including investors, creditors, suppliers and customers. Companies with different listing statuses are associated with different levels of overall risk. Predicting a change in listing status can help investors to manage their stock portfolio risk and aid creditors, suppliers and customers to accurately evaluate a company’s credit risk. In addition, there is a daily plus and minus 5% cap on the trading price for special treatment companies, while a daily 10% cap is set for healthy listed companies. Consequently, stocks with different listing statuses are associated with different levels of overall risk, such as volatility, liquidity, and delisting risks, which are all major concerns for investors in making their investment decisions. The existing literature on predicting the listing status of Chinese listed companies mainly focuses on predicting whether a healthy listed company will maintain a normal status or fall into ﬁnancial distress. Ding et al. [6] introduced support vector machines to predict the ﬁnancial distress of a company and deﬁned a company in ﬁnancial distress as a company that had received special treatment in the Chinese stock markets. Zhang et al. [32] developed a Z-score model to forecast whether a ﬁrm would receive special treatment. Existing studies [12,16,19,20,28,31] commonly consider listing status prediction for Chinese listed companies as a binary classiﬁcation problem and develop binary classiﬁcation models to predict the listing status as “normal” or “ﬁnancially distressed” in a given period. Given that there are in fact four possible listing statuses for a Chinese listed company (different listing states mean different risk levels), it is of practical signiﬁcance to apply a multi-class classiﬁcation approach to predict the listing status of listed companies in China. The most popular strategy to solve multi-class classiﬁcation problems is to transform the problem of multi-class classiﬁcation into multiple binary classiﬁcation problems [10], which carries with it two important issues. One is how to decompose a multiclass classiﬁcation problem into a series of binary class classiﬁcation problem; the other is how to assemble the results obtained by the multiple binary classiﬁers. The methods for managing the former and latter procedures are termed “decomposition” and “ensemble strategies”, respectively. There are two common decomposition strategies: One-vs-One (OVO) [18] and One-vs-All (OVA) [3]. Suppose there are K classes in a multi-class classiﬁcation problem. The OVO approach is to divide the problem into CK2 = K (K − 1)/2 binary classiﬁcation problems, after which one binary classiﬁer is trained to discriminate classes in each pair. The outputs from all CK2 binary classiﬁers are aggregated to predict the output class. The OVA method divides the problem into K − 1 binary classiﬁcation problems such that each binary classiﬁer distinguishes one class from all other classes. Some existing studies show the successful application of OVA in multi-class classiﬁcation problems [7,8,15]. Rifkin and Kautau [26] claimed that the OVA strategy is as accurate as any other approach when the base classiﬁers are well-tuned, while others [11,17] have demonstrated that the OVO strategy is also a useful alternative to multi-class classiﬁcation problems with a performance superior to that of OVA. Galar et al. [10] conducted a comprehensive investigation into different ensemble methods for binary classiﬁers in multi-class problems with OVO and OVA strategies. Their empirical study showed that the performance of OVO and OVA is sensitive to the selection of the ensemble strategy and that the best aggregation within a problem depends on the base classiﬁer and the characteristic of the problem. For a listed company, there is a large amount of information that can be used to predict the company’s listing status, such as the company’s characteristics, ﬁnancial performance and market information. Existing literature has shown that ﬁnancial and marketing information is effective in predicting a company’s ﬁnancial status [1,5,25]. The selection of features will affect the performance of the classiﬁcation models. The number of companies with different listing statuses in each observed year varies and the listing status prediction problem is a typically highly imbalanced multi-class classiﬁcation problem. This study introduces multi-classiﬁcation models to predict the listing status of Chinese listed companies by integrating the selection of features and samples into the OVA and OVO with parallel and hierarchy strategies, respectively. It also investigates the effect of different binary classiﬁers in the OVA and OVO strategies. The remainder of this paper is organized as follows. The frameworks based on OVO and OVA with two different ensemble strategies are explained in detail in Section 2. Section 3 presents the empirical results of the multi-class classiﬁcation models. Section 4 draws conclusions and summarizes major ﬁndings. 2. One-vs-All and One-vs-One aggregative models The OVA aggregative models (OVAAM) integrate the process of feature selection, sampling and the OVA strategies. Two different ensemble strategies, parallel and hierarchy, are employed in OVA in this study. The training and testing processes in OVAAM with parallel and hierarchical ensemble strategies are shown in Figs. 1 and 2, respectively. The training and testing processes in the One-vs-One (OVO) aggregative models (OVOAM) are shown in Fig. 3. 2.1. Feature selection method Feature selection can speed up learning, facilitate data understanding, and improve prediction performance. A variety of feature selection methods have been proposed and examined, including ﬁlter methods, wrapper methods and embedded methods [14]. This study employs a hybrid feature selection method that combines the ﬁlter method based on a two-sample t-test with variance inﬂation factor (VIF) analysis. The hybrid method takes advantage of the rapidity of the ﬁlter method, while VIF analysis can assure lower level of dependence among the selected features.

224

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Fig. 1. The training and testing processes in OVAAM with a parallel ensemble strategy.

Suppose that the features of an observation are denoted by X = (X1 , X2 , . . . , Xm ) and Y is the class, Y ∈ {1, 0}. There are N observations in a training set. Let v1j = var(X j |Y = 1), and v0j = var(X j |Y = 0), where var(·) is the variance of a group of values.

Let m1j = mean(X j |Y = 1), and m0j = mean(X j |Y = 0), where mean(·) is the mean of a group of values. The feature weighting strategy based on the t-test on the training data set is deﬁned as follows [13,33]:

zj =

m1 − m0 j

j

v1j /N1 + v0j /N0

,

(1)

where N1 , and N0 denote the number of observations with Y = 1 or Y = 0, respectively. Each feature can be ranked in terms of z value, as deﬁned in Eq. (1), in descending order. The top ranked features have good discriminant capability on the two classes but may exhibit high multi-colinearity. Therefore, VIF analysis is employed to prevent high multi-colinearity among features. The VIF of feature j is calculated as follows:

Vj =

1 , 1 − R2j

(2)

where R2j is the R2 -square of regression equation X j = β0 + βX , where X contains all features except for Xj . The algorithm based on the ﬁlter method with t-test criteria and VIF (ttFVIF) for feature subset selection is as follows: Algorithm. ttFVIF: Input: The training sample set Sr with M features X1 , X1 , … , XM and class Y; the number of feature M∗ to be selected. Output: feature subset F∗ 1. F ∗ = ∅ 2. Rank the M features in terms of t-test criteria in descending order, add the ﬁrst feature to F∗ and then calculate the VIF of the second feature according to the features in F∗ . If the VIF value of the second feature is less than 10, then the second feature will be added to F∗ ; otherwise, calculate z value of the next feature, and so on, until all M features have been checked or the number of total features in F∗ is M∗ .

L. Zhou et al. / Information Sciences 328 (2016) 222–236

225

Fig. 2. The training and testing processes in OVAAM with a hierarchy ensemble strategy.

The VIF threshold of 10 in ttFVIF is an empirical value suggested by [21]. A larger VIF threshold will permit the selection of features with high correlationships with existing features, and the drastically smaller VIF threshold will reduce the number of feature candidates and ensure that no additional feature can be selected.

2.2. Sampling method In the training data set for listing status prediction, there are different numbers of observations in different classes. When the listing status prediction problem is decomposed into a series of binary classiﬁcation problems, the training set consists of two classes for each binary classiﬁcation problem: one class is inherited from the initial training set, and the other classes in the initial training set are merged into another class. The number of observations in each of the two classes in the initial training set is usually imbalanced. To make the ﬁnal training set balanced, the random undersampling method is employed. Suppose that the number of observations selected for each of the two classes is N , and that S1 and S0 denote the observations set, while N1 , and N0 denote the number of observations Y = 1 or Y = 0 in the original training set, respectively. The sampling algorithm based on random undersampling (RU) is as follows:

226

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Fig. 3. The training and testing processes in OVOAM.

Algorithm. RU: Input: the original training set {S1 , S0 }. Output: balanced training set S of size 2N . 1. S = ∅. 2. N = min(N0 , N1 ) if N1 > N then 0 S=S randomly select N observations into S from S1 else S = S1 randomly select N from sample set S0 into S end if

2.3. Base binary classiﬁers This study aims to ﬁnd an effective classiﬁcation model for predicting the listing status of Chinese listed companies. The base binary classiﬁer plays an important role in the OVA and OVO aggregative models. Seven well-known and widely used binary classiﬁcation methods have been selected to construct the base binary classiﬁers. The selected binary classiﬁcation methods are brieﬂy introduced as follows: (1) Linear discriminant analysis (LDA) assumes that different classes have features based on different Gaussian distributions. Each class has features generated by a multivariate normal distribution (MND) and the MND for each class has the same covariance matrix but a different mean. LDA can consider the cost of misclassifying one class to another class and the prediction is designed to minimize the expected classiﬁcation cost [22].

L. Zhou et al. / Information Sciences 328 (2016) 222–236

227

(2) Logistic regression (LR) is a classical binary classiﬁcation method. It estimates the probability that an observation with features x belongs to class 1, by using the following formula:

ew0 +w x . 1 + ew0 +wT x T

P (Y = 1|x) =

(3)

(3) Neural networks (NN), inspired by biological nervous systems, use interconnections between neurons to determine the network function. NNs are trained, so that they can ﬁt the function mapping the input to output and theoretically, they can ﬁt any function and have been successfully used for a wide variety of classiﬁcation and forecasting problems. A comprehensive introduction to neural network can be found in [4]. (4) Decision tree C4.5 (DTC4.5) uses gain ratio as a splitting criterion, and the splitting stops when the number of observations to be split is below a certain threshold. It uses error-based pruning to remove the least reliable branches [24]. (5) k-nearest neighbor (KNN) [22] is a simple and standard nonparametric approach which classiﬁes an observation by considering only the k-nearest neighbors in the training set. The methods to measure the distance between two observations and the value of k are key factors that affect the performance. (6) Adaboost, a typical meta-learning algorithm, calls up a weak or base learning algorithm repeatedly in a series of rounds. In each round, the weights of incorrectly classiﬁed examples are increased so that the weak learning is forced to focus on the misclassiﬁed observations in the training set. The base learning algorithm is usually a simple decision tree [9]. (7) The least squares support vector machines (LSSVM) aims to minimize the upper bound of the generalization error, and it maps the input vectors into a high-dimensional feature space through some nonlinear mapping functions and constructs an optimal separating hyperplane to separate the two classes of observations with maximal margins. The details of LSSVM can be found in [29]. In the OVA or OVO aggregative models, all base binary classiﬁers are required to compute the posterior probability that it classiﬁes an observation in class 1 (Y = 1). If the threshold value is speciﬁed, it is easy to classify the observation within a class in terms of the posterior probability. Most of the above binary classiﬁers can compute posterior probability in a straightforward way, however, some binary classiﬁers such as neural networks and support vector machines that cannot generate a posterior probability require the application of calibration methods to compute posterior probabilities. In this study, the Platt calibration [23] is applied to transform the initial output of the binary classiﬁers, which cannot generate posterior probability directly by passing them through a sigmoid function. Let the unthresholded output of a binary classiﬁer method be f(x); then, the probabilistic output can be obtained as follows:

P (Y = 1| f (x)) =

1 , 1 + e(A f (x)+B)

(4)

where parameters A and B are estimated by using the maximum likelihood estimation from a ﬁtting training set {( fi , Yi )|i = 1, 2, ..., N}. Parameters A and B are estimated by solving the following optimization problem:

arg min − Yi log ( pi ) + (1 − Y i ) log (1 − pi ) , A,B

(5)

i

where

pi =

1 . 1 + e(A fi +B)

(6)

Let N1 , and N0 denote the number of observations for Y = 1 and Y = 0, respectively; the value of Yi is deﬁned as follows:

Yi =

N1 +1 N1 +2

if Yi = 1

1 N0 +2

if Yi = 0

.

(7)

2.4. One-vs-All aggregative models The One-vs-All aggregative model with parallel ensemble strategy (OVAPES) shown in Fig. 1 can train a base binary classiﬁer for each class by distinguishing that class from the other classes. The prediction of a new observation relies on selecting the class that obtains the highest post-probability (conﬁdence scores) among all base binary classiﬁers. The algorithm for OVAPES is as follows: Algorithm. OVAPES: Input: the original training set Sr , the number of feature to be selected M∗ . Output: a list of K base binary classiﬁers Ck for k ∈ {1, 2, …, K}. for k = 1 to K Construct a new label for observation i, Yi = 1 if Yi = k, otherwise Yi = 0, i = 1, 2, . . . , N; Call Algorithm ttFVIF to identify the M∗ features set F∗ ;

228

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Call Algorithm RU to select 2N observations taking features in F∗ to train the base binary classiﬁer Ck . end for For an unseen observation x, the posterior probability to classify it in class k using classiﬁer Ck is denoted by P (Y = k|x). The parallel ensemble strategy is to classify this observation in the class with maximum value of P (Y = k|x):

Yˆ = arg max P (Y = k|x) k∈{1, ... , K }

(8)

The algorithm of the One-vs-All aggregative model with a hierarchical ensemble strategy (OVAHES) is as follows: Algorithm. OVAHES: Input: the original training set Sr , the number of features to be selected M∗ . Output: a list of K − 1 base binary classiﬁers Ck for k ∈ {1, 2, . . . , K − 1}. for k = 1 to K−1 Construct a new label for observation i, Yi = 1 if Yi = k; Yi = 0, if Yi ∈ {k + 1, k + 2, · · · , K }, i = 1, 2, . . . , N; Call Algorithm ttFVIF to identify the M∗ features set F∗ ; Call Algorithm RU to selected 2N observations taking features in F∗ to train the base binary classiﬁer Ck . end for The hierarchical ensemble strategy is used to classify an unseen observation x in the ﬁrst class with max(P (Y = k|x), k = 1, 2, . . . , K ):

Yˆ = arg max k|P (Y = k|x) ≥ 0.5. k∈{1,2,...,K }

(9)

where P(Y = K |x) = 1 − P (Y = K − 1|x). 2.5. One-vs-One aggregative models Algorithm. OVO: Input: the original training set Sr , the number of feature to be selected M∗ Output: a list of K (K − 1)/2 base binary classiﬁers Ck for k ∈ {1, 2, . . . , K (K − 1)/2}. k=1 for m = 1 to K − 1 for j = m + 1 to K Select all observations of class m and j from training set Sr . Call Algorithm ttFVIF to identify the M∗ features set F∗ ; Call Algorithm RU to select 2N observations taking features in F∗ to train the base binary classiﬁer Ck . k=k+1 end for The OVO classiﬁes an unseen observation x in the class with maximum votes from the K (K − 1)/2 classiﬁers. 3. Empirical analysis 3.1. The data The source of data is the China Stock Market and Accounting Research Database (CSMARD) provided by the GTA database. There are 18551 company-year observations dating from 1999 to 2011. Each observation in an observed year t contains the following features: (1) 164 different ﬁnancial ratios measuring various aspects of a company’s ﬁnancial status for the ﬁscal year t − 1, such as short-term solvency, long-term solvency, asset management or turnover, proﬁtability, capital structure, stock holder’s earning proﬁtability, cash management and development capability; (2) three market variables introduced by Shumway [27] including the excess return of the company’s stock, relative market capitalization, and the standard deviation of ﬁrm’s stock return in the ﬁscal year t − 1; and (3) the listing status of the company in year t + 1. According to the listing rules in China, all listed companies must declare their ﬁnancial statements for a ﬁscal year within four months after the end of the ﬁscal year. Any necessary actions on listed companies are usually enforced by the CSRC by the end of June, in keeping with the listing rules. The listing status of a company at the end of June in year t is denoted by Lt , while the company’s listing status at the end of year t + 1 is denoted by Yt+1 . The prediction is completed by the end of June in each year. Fig. 4(a)–(c) gives the number of companies with different listing statuses by years, with Lt taking values of “A”, “B” and “D” respectively. Because there are new companies listed each year and the number of newly listed companies is usually greater than the number of delisted companies each year, the number of normal companies marked “A” increases. Because a delisted

L. Zhou et al. / Information Sciences 328 (2016) 222–236

229

Fig. 4. The number of companies with listing status transformations.

company cannot recover any other listing status, it is meaningless to predict the status of a company that has just been delisted. Therefore, Fig. 4(d) only gives the number of companies with “X” in year t + 1 when the prediction is conducted in year t. Fig. 4(a) shows that most normal, healthy companies with listing status “A” retain their “A” listing status in the next ﬁscal year; very few normal companies with listing status “A” will transform to other listing statuses. For example, in ﬁscal year 2010, there were 1995 companies with listing status “A”, among which 1965 companies retained “A” status, one company moved to “B”, 24 companies changed to “D”, and ﬁve were delisted in the following year. Therefore, predicting the listing status of listed companies is a highly imbalanced classiﬁcation problem. Fig. 4(b) shows that most companies with listing status “B” retain “B” listing status in the following ﬁscal year, and some of them may regain listing status “A” because of their improvement in ﬁnancial performance. In ﬁscal year 2010, there were 87 companies with listing status “B”, among which 31 regained “A” status, 36 retained their “B” status, and 10 changed to “D”. Fig. 4(c) shows that companies with listing status “D” may move to “A” or “B” or retain their “D” status. Only a small proportion will slip to “X”. 3.2. Experimental settings 3.2.1. Sampling and classiﬁers The dataset, comprising 18,569 observations is split into a training set and a testing set. The training set consists of observations dating t ≤ 2006 , while the testing set contains observations dating 2007 ≤ t ≤ 2011. Because the “A”, “B”, “D”, and “X” classes are highly imbalanced in the training data set, the RU algorithm is employed to construct a balanced training sample for the binary classiﬁers in the three frameworks. The number of selected observations N for each class in RU is set to different values in OVAPES, OVAHES and OVO. The number of observations from different classes in the training set and testing set are listed in Table 1. Ntraining is the number of observations in the dataset for the sampling training set. Ntesting is the number of observations in the testing set. There are four binary classiﬁers in OVAPES, each of which is constructed on one class versus all other classes. For example, in the sampled training set for OVAPES to classify “A” companies, there are 1223 companies with listing status “A”

230

L. Zhou et al. / Information Sciences 328 (2016) 222–236 Table 1 The number of observations for the different classes in the training set and testing set. Y t+1

Ntraing

N in sampled training set OVAPES

A B D X

10,028 649 488 86

1223/1223 649/649 488/488 86/86

OVAHES

Ntesting OVO

1223/1223 574/574 86/86

B

D

X

649/649

488/488 488/488

86/86 86/86 86/86

6629 328 319 24

Table 2 A typical resulting confusion matrix. Predicted class

Actual class

Total

A

B

D

X

A B D X

n11 n12 n13 n14

n21 n22 n23 n24

n31 n32 n33 n34

n41 n42 n43 n44

nˆ 1 nˆ 2 nˆ 3 nˆ 4

Total

n1

n2

n3

n4

n

plus 1223 companies with the listing status of “B”,“D” and “X”. There are three binary classiﬁers in OVAHES; each is constructed on one class versus all other classes that have not been considered separately. Seven binary classiﬁers are used in OVO, each of which is constructed on one versus one of the other classes. Most classiﬁers are implemented in Matlab, while DTC4.5 is implemented by weka.classiﬁers.tress.J48 in Weka [30]. The parameter k in k-nearest neighbors takes the value of 10, and most parameters in the binary classiﬁers use the default settings in Matlab or Weka. To show the performance sensitivity of different frameworks to the number of selected features in ttFVIF M∗ , the frameworks have been tested with different M∗ taking values in set {4,8,12,16,20}. 3.2.2. Performance measures Classiﬁcation rate and Cohen’s kappa measures have been applied as performance measures in this study because they are commonly used in evaluating performance in binary and multi-class problems [10]. A typical resulting confusion matrix for the listing status prediction problem is reported in Table 2, while nij denotes the number of observations of a class with actual list status yi which is predicted as a class with listing status yj . (i = 1, . . . , 4, j = 1, . . . , 4). The classiﬁcation accuracy rate (CAR) is the ratio of correctly classiﬁed observations relative to the total number of classiﬁed observations. It is computed as follows:

4 CAR =

i=1

n

nii

.

(10)

Cohen’s kappa (CK) evaluates the proportion of hits that can be attributed to the classiﬁer itself, not to mere chance relative to all of the classiﬁcations that cannot be attributed to chance alone. Cohen’s kappa is computed as follows:

CK =

n

4 ii i ˆi i=1 n − i=1 n n . 4 2 i i n − i=1 n nˆ

4

(11)

Cohen’s kappa ranges from −1 (total disagreement) to +1 (total agreement), and CK = 0 indicates a random classiﬁcation. Cohen’s Kappa is a simple but useful measure for evaluating the performance of multi-class classiﬁers. It scores the hits independently for each class and aggregates them, while the classiﬁcation rate scores all of the hits over all classes [10]. 3.3. Experimental results 3.3.1. Results of models with features selected by ttFVIF Because the performance of the models is affected by the sampled training set and the parameter of the number of selected features M∗ , 30 test iterations with the binary classiﬁers trained by 30 different groups of samples are conducted for each selected M∗ taking each value in set {4, 8, 12, 16, 20}. Fig. 5 shows the average classiﬁcation accuracy rate of 30 iterations of tests by the three different models with different binary classiﬁers and different values of the number of selected features M∗ . It can be observed that OVAPES can always obtain a signiﬁcantly higher classiﬁcation accuracy rate than OVAHES in all scenarios. Each scenario includes the setting of a binary classiﬁer and a ﬁxed number of selected features. In addition, referring to the classiﬁcation accuracy rate, each binary classiﬁer performs stably in OVAPES but not in OVAHES with different settings on the

1.0 0.8 0.6 0.4 0.2

4 nu mb e

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y r a bin LR

8

ro

fs

12

ele

cte

df

16

ea

tu

res

20

classification accuracy rate

classification accuracy rate

L. Zhou et al. / Information Sciences 328 (2016) 222–236

231

1.0 0.8 0.6 0.4 0.2

4 nu mb e

8 ro

LDA

LSSVM Adaboost KNN s DTC4.5 sifier las c NN ary bin LR

fs

12 cte 16 df ea tu res

ele

LDA

(b) OVAHES

1.0 0.8 0.6 0.4 0.2

LSSVM Adaboost KNN s DTC4.5 sifier as l c NN ary bin LR

4 nu 8 mb er of sel e

12 cte 16 df ea tu res

20

LDA

(c) OVO

1.0

OVAPES

OVAPES

OVO

OVAPES OVO

Classification accuracy rate

classification accuracy rate

OVO

OVO

(a) OVAPES

20

0.8 0.6 0.4 0.2

4 nu 8 mb er of sel e

12 c te 16 df ea tu res

LSSVM Adaboost KNN s DTC4.5 sifier las c NN ary bin LR 20

LDA

(d) Settings of maximum average CAR

Fig. 5. Average classiﬁcation accuracy rate of different frameworks with different settings.

number of selected features; i.e., the CAR of each binary classiﬁer in OVAPES is less sensitive than that in OVAHES. OVO can achieve higher CAR than OVAHES in almost all scenarios except for the LSSVM, with M∗ taking a value of 12, 18, or 20. Comparing CAR in OVAPES and OVO, some binary classiﬁers obtain higher CAR in OVAPES under all of the different settings of M∗ , such as LDA, and Adaboost. Some binary classiﬁers achieve higher in OVO under all different settings of M∗ , such as DTC4.5. Binary classiﬁers LR, NN, KNN and LSSVM cannot consistently obtain higher CAR in either OVAPES or OVO with different settings of M∗ . Fig. 5(d) shows the settings of M∗ and aggregative models for each binary classiﬁer that allows the binary classiﬁer to achieve the maximum average CAR of 30 test iterations. Fig. 5(d) is generated from Fig. 5(a)–(c) by only presenting the maximum average CAR for each binary classiﬁer. Fig. 5(d) shows that LDA achieves the global maximum average CAR of 0.9178 in OVAPES when the number of selected features is set to eight. LR follows LDA has the second maximum average CAR of 0.8886 in OVO, also by setting the number of selected features to eight. For each setting that allows the binary classiﬁer to achieve the maximum average CAR, the standard deviation of CAR in the 30 iterations of test is on a small scale, which suggests that the binary classiﬁer performs consistently in such settings. Table 3 reports the maximum average CAR and standard deviation of CAR in 30 test iterations from each binary classiﬁer when achieving the maximum average CAR.

232

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Classiﬁers

LDA

LR

NN

DTC4.5

KNN

Adaboost

LSSVM

Max. CAR Std.

0.9178 0.0026

0.8886 0.0125

0.7928 0.0262

0.8380 0.0354

0.8712 0.0055

0.8673 0.0048

0.8422 0.0195

1.0

1.0

0.8

0.8

0.6 0.4 0.2

4 nu 8 mb er of 12 sel ect 16 ed fea tu res

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y ar bin LR 20

Cohen’s kappa

Cohen’s kappa

Table 3 The standard deviation of CAR from each binary classiﬁer achieving maximum average CAR.

0.6 0.4 0.2

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y ar bin LR

4 nu 8 mb er of sel e

12 cte 16 df ea tu res

LDA

LDA

(b) OVAHES

0.6 0.4 0.2

4 nu mb e

LSSVM Adaboost KNN s DTC4.5 sifier las c NN ary bin LR

8 ro

12 fs ele cte df

16 ea

tu res

20

LDA

(c) OVO

0.6

OVAPES

OVAPES

OVO

OVO

0.8

OVO

1.0

0.8

OVAPES

1.0 Cohens kappa

Cohen’s kappa

OVO

(a) OVAPES

20

0.4 0.2

4 nu 8 mb er of 12 sel ect ed f

LSSVM Adaboost KNN s DTC4.5 sifier s cla NN y ar bin LR

16

ea

tu

res

20

LDA

(d) Settings of maximum average CK

Fig. 6. Average Cohen’s kappa of different frameworks with different settings.

As shown in Table 1, the testing set is highly imbalanced. Although classiﬁcation accuracy can demonstrate the performance of the different frameworks to some degree, CK remains an important performance measure for multi-class classiﬁcation problems. Fig. 6 shows the average CK of 30 test iterations under three different aggregative models with different binary classiﬁers and different numbers of selected features M∗ . Evidence from the CK among the different aggregative models under different settings suggests similar ﬁndings compared with CAR. OVAPES and OVO outperform OVAHES under almost all setting scenarios. It shows that the CK measure is highly correlated with CAR: the higher the CAR is, the higher the CA is. As shown in Fig. 6(d), NN achieves the maximum average CK when M∗ is set to eight, while it achieves the maximum average CAR when M∗ is set to four, as shown in Fig. 5(d). Other binary classiﬁers obtain the maximum CAR and CK with the same setting scenarios. LDA has the maximum average CK score of 0.5630 shown in Fig. 6(d), followed by LR (0.4910) and LSSVM (0.4581). The performance of three different aggregative models incorporating different binary classiﬁers is evaluated in terms of the average CAR and CK scores (there are 30 test iterations for the seven binary classiﬁers and three different frameworks when M∗ is set to eight). Because LDA achieves the maximum global average CAR and almost all binary classiﬁers in OVAPES and OVO are affected by different values of M∗ , the performance of all binary classiﬁers under M∗ = 8 are selected as representative.

L. Zhou et al. / Information Sciences 328 (2016) 222–236

233

Table 4 Average CAR and CK scores in test with seven binary classiﬁers and three different aggregative models. Binary classiﬁer

LDA LR NN DTC4.5 KNN Adaboost LSSVM

OVAPES

OVAHES

OVO

CAR

CK

CAR

CK

CAR

CK

0.9178 (0.0026) 0.8781∗ (0.0062) 0.7195 (0.0905) 0.7555 (0.0713) 0.8611 (0.0040) 0.8623 (0.0066) 0.7994 (0.0094)

0.5630 (0.0092) 0.4641∗ (0.0156) 0.2599 (0.0766) 0.2902 (0.0757) 0.4357 (0.0113) 0.4515 (0.0149) 0.3443 (0.0141)

0.6620 (0.0270) 0.6428 (0.0265) 0.5336 (0.0868) 0.6961 (0.1079) 0.8046 (0.0057) 0.6139 (0.0437) 0.6531 (0.0168)

0.1853 (0.0171) 0.1809 (0.0160) 0.1362 (0.0469) 0.2523 (0.0880) 0.3411 (0.0095) 0.1839 (0.0261) 0.1926 (0.0104)

0.9135 (0.0033) 0.8886 (0.0125) 0.7817 (0.0422) 0.8031 (0.0389) 0.8352 (0.0107) 0.7939 (0.0259) 0.7645 (0.0259)

0.5444 (0.0106) 0.4910 (0.0294) 0.3227 (0.0518) 0.3522 (0.0539) 0.3769 (0.0171) 0.3278 (0.0341) 0.2896 (0.0287)

Table 4 shows the performance of different binary classiﬁers in three different aggregative models. Two-way ANOVA with a replication test shows that OVAPES and OVO outperform OVAHES in CR and CK for all binary classiﬁers. A Nemenyi test is used to compare all binary classiﬁers with each other under OVAPES to check whether their differences are statistically signiﬁcant, as LDA obtains the maximum global CAR in OVAPES. The performance of two classiﬁers is statistically signiﬁcantly different if the corresponding average ranks differ by at least the criteria difference (CD):

CD = qα

C (C + 1) , 6I

(12)

where qα is the critical value for the two-tailed Nemenyi test with the signiﬁcance level α ; C is the number of classiﬁers; and I is the number of data sets. This study compares the performance of all binary classiﬁers in the OVAPES framework in Table 3, with C = 7, I = 30, qα = 2.949, and CD = 1.645. The binary classiﬁers that show no signiﬁcant differences on the CAR or CK performance from the LDA have been marked with an asterisk “∗”. Evidence suggests that LDA exhibits the best performance on CR and CK measures among all seven binary classiﬁers in the OVAPES framework and that LR is not signiﬁcantly different from LDA in terms of CAR and CK. The Wilcoxon signed-rank test shows that LDA is not signiﬁcantly different in terms of CAR and CK in OVAPES and OVO, so as LR. In Table 4, the LDA has achieved an overall classiﬁcation accuracy of 91.78% on a test set of 7300 observations with different listing statuses, which suggests that the LDA has relatively higher classiﬁcation accuracy. The test set contains 6629 observations with listing status “A”; however, if a classiﬁer simply classes all observations as “A”, it can still show a high classiﬁcation accuracy of 90.81%. In this highly imbalanced test set, classiﬁcation accuracy cannot show the discriminative capability of the classiﬁers with high reliability. In terms of the measure of Cohen’s kappa, the LDA generates an average score of 0.5272, which is noticeably larger than zero. In the test set, 6442 observations retained “A” listing status out of the 6579 observations with the current listing status of “A”, and classifying all current “A” companies as “A” in the next year can still achieve high classiﬁcation rate. In practice, it is more important for risk managers to correctly predict a “B”, “D” or “X” company in the next year than an “A” company as companies with “B”,“D” or “X” listing status are associated with higher risk. Fig. 7 shows the classiﬁcation accuracy of the three different aggregative models with the same settings as those in Table 4 on test samples with different listing statuses. Fig. 7(a) shows that the LDA in OVAPES, which achieves the global maximum average CAR, can correctly classify 96.08% of “A” observations, 57.98% of “B” observations, 44.22% of “D” observations and no “X” observations. Fig. 7(c) shows that the LDA in OVO can correctly classify 95.91% of “A” observations, 80.85% of “B” observations, 13.57% of “D” observations and 8.19% of “X” observations. From Fig. 7, it can be observed that no framework and setting can achieve maximum classiﬁcation accuracy for all four listing statuses. LR in OVO achieves the maximum average classiﬁcation accuracy for “B” observations at 82.24%. NN in OVAHES achieves the maximum average classiﬁcation accuracy for “D” observations at 71.14%, but it only achieves 53.25% of “A” observations. None of the frameworks and settings can obtain results with a classiﬁcation accuracy above 25% for “X” ﬁrms, showing that it is very diﬃcult to predict whether a listed company will be delisted. One possible reason may be that there are few observations of delisted companies and it is diﬃcult for the classiﬁers to mine the pattern or characteristics of the delisted companies. Another possible reason may be that listed companies may not be given a delisting risk warning on China’s stock due to their ﬁnancial performance but rather because of operations risk and negative audits from accounting agencies. Consequently, it is diﬃcult to predict “D” listing status using the companies’ ﬁnancial ratios. Many listed companies in ﬁnancial distress use the strategy of corporate restructuring to avoid delisting, which makes predicting “X” diﬃcult.

234

L. Zhou et al. / Information Sciences 328 (2016) 222–236

Fig. 7. Classiﬁcation accuracy on four different classes.

It is typically diﬃcult to increase the classiﬁcation accuracy of one class in multi-class classiﬁcation models without sacriﬁcing the classiﬁcation accuracy of another class. Overall performance is usually a balanced outcome for all of the different classes in terms of the cost of misclassifying each of the classes. In practice, it is diﬃcult to estimate the cost of misclassiﬁcation, and in this study, equal cost is assumed for the misclassiﬁcation of each class to demonstrate the discriminative capability of each model. 3.3.2. Results of models with features used by ﬁnancial experts In the previous experiment, the features are automatically selected by the ttFVIF algorithm without considering their ﬁnancial meaning. It is natural to try the features suggested by ﬁnancial experts with a professional understanding of the meaning of ﬁnancial features to check whether the models with features used by ﬁnancial experts can improve classiﬁcation performance. Listing status is used as a symbol of risk warning for listed companies on the Chinese stock market and is highly related to a company’s ﬁnancial performance in terms of listing rules. The listing status of “B”, “D” and “X” may indicate that a company is facing ﬁnancial distress or bankruptcy risk. Therefore, the features suggested in ﬁnancial distress prediction or bankruptcy prediction by ﬁnancial experts can be used to predict the listing status of Chinese listed companies. Although most existing studies regarding ﬁnancial distress prediction focus on American and European companies, the structure of ﬁnancial statements is identical for listed companies in China. It is technically diﬃcult to check all of the features suggested by all ﬁnancial experts in ﬁnancial distress prediction. Only those widely accepted features with highly discriminative capabilities are used. The features selected from Altman [2], Zhang et al. [32] and Shumway [27] include: working capital to total assets (WCTA), retained earning to total assets (RETA), earning before interest and taxes to total assets (EBITTA), sales to total assets (STA), net income to total assets (NITA), total liabilities to total assets (TLTA), excess annual return (EAR), ﬁrm’s market capitalization to total market capitalization (FMCTMC), standard deviation of stock daily return (Std), earning per share (EPS), book value per share (BVPS) and operation income per share (PIPS). Table 5 reports the average CAR and CK performance of three different multi-class classiﬁcation models with different binary classiﬁers utilizing features used by ﬁnancial experts. OVAPES with LDA classiﬁer achieves the maximum average CAR score of 0.9202 and average CK score of 0.5688. Although the average CAR and CK scores shown in Table 5 for OVAPES with LDA are marginally higher than those in Table 4, a Wilcoxon signed rank test shows that the difference in CAR and CK between the two

L. Zhou et al. / Information Sciences 328 (2016) 222–236

235

Fig. 8. Classiﬁcation accuracy on four different classes. Table 5 Average CAR and CK in tests with seven binary classiﬁer and three different frameworks with features suggested by ﬁnancial experts. Binary classiﬁer

LDA LR NN DTC4.5 KNN Adaboost LSSVM

OVAPES

OVAHES

OVO

CAR

CK

CAR

CK

CAR

CK

0.9202 (0.0021) 0.8592 (0.0586) 0.7751 (0.0751) 0.7505 (0.1170) 0.8875 (0.0039) 0.8668 (0.0048) 0.8210 (0.0113)

0.5688 (0.0073) 0.4400 (0.0826) 0.3149 (0.0862) 0.2797 (0.0721) 0.4858 (0.0126) 0.4585 (0.0116) 0.3763 (0.0194)

0.6283 (0.1098) 0.6648 (0.0643) 0.6855 (0.0725) 0.6005 (0.0770) 0.7428 (0.0068) 0.7543 (0.0392) 0.7126 (0.0205)

0.1664 (0.0684) 0.1886 (0.0541) 0.2221 (0.0565) 0.1660 (0.0470) 0.2453 (0.0074) 0.2976 (0.0417) 0.2372 (0.0188)

0.8705 (0.0143) 0.8412 (0.0260) 0.7451 (0.0651) 0.7610 (0.0446) 0.8252 (0.0451) 0.7894 (0.0233) 0.6976 (0.0358)

0.4396 (0.0279) 0.4029 (0.0360) 0.2857 (0.0623) 0.3033 (0.0511) 0.3582 (0.0550) 0.3296 (0.0295) 0.2310 (0.0272)

groups of 30 tests by OVAPES with LDA with two different sets of features is signiﬁcantly different from zero. The p-Values of the tests on CAR and CK are 0.0015 and 0.0157 respectively. However, the CAR scores in OVO with all binary classiﬁers in Table 5 are smaller than those in Table 4. Fig. 8 shows the classiﬁcation accuracy for observations with different listing statuses. Evidence in Fig. 8(a) and Fig. 7(a) suggests that OVAPES with LDA and features used by ﬁnancial experts can slightly improve the accuracy rates for “A” and “B” observations while losing accuracy rate for “D” observations and has no change for “X” observations.

236

L. Zhou et al. / Information Sciences 328 (2016) 222–236

4. Conclusion This paper has considered listing status transitions of Chinese listed companies as a multi-class classiﬁcation problem, in contrast to existing literature, which considers it as a binary classiﬁcation problem, by predicting whether a normal company will receive a risk warning. To solve the multi-class classiﬁcation problem, three aggregative models based on OVA and OVO strategies are proposed, and seven widely used binary classiﬁers have been investigated. The empirical results show that LDA and LR are robust in terms of their small standard deviation in 30 test iterations, and both can outperform other binary classiﬁers in most of the different test scenarios. Due to the diﬃculty of distinguishing between the characteristics of companies in class “B” with other risk warnings and companies in class “D” with delisting risk warnings, no models can provide satisfactory classiﬁcation accuracy for companies of these statuses compared to their performance for “A” companies. In the aggregative classiﬁcation models, features selected by ttFVIF can perform consistently under both the OVA parallel strategy and OVO for each of the seven different binary classiﬁers. OVAPES incorporating LDA can achieve signiﬁcant higher performance with features used by ﬁnancial experts than with features obtained by ttFVIF. Some types of listing status transition are practically diﬃcult to predict, such as the transition from “D” to “X” status. Future research may investigate what other data are needed and what procedures can be followed in the effective prediction of such listing status transitions. References [1] V. Agarwal, R. Taﬄer, Comparing the performance of market-based and accounting-based bankruptcy prediction models, J. Bank. Financ. 32 (8) (2008) 1541–1551. [2] E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, J. Financ. 23 (4) (1968) 589–609. [3] R. Anand, K. Mehrotra, C.K. Mohan, S. Ranka, Eﬃcient classiﬁcation for multiclass problems using modular neural networks, IEEE Trans. Neural Netw. 6 (1) (1995) 117–124. [4] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [5] S.R. Das, P. Hanouna, A. Sarin, Accounting-based versus market-based cross-sectional models of CDS spreads, J. Bank. Financ. 33 (4) (2009) 719–730. [6] Y. Ding, X. Song, Y. Zen, Forecasting ﬁnancial condition of Chinese listed companies based on support vector machine, Expert Syst. Appl. 34 (4) (2008) 3081–3089. [7] H.K. Ekenel, T. Semela, Multimodal genre classiﬁcation of TV programs and Youtube videos, Multimed. Tools Appl. 63 (2) (2013) 547–567. [8] H.J. Escalante, M. Montes, L.E. Sucar, Multi-class particle swarm model selection for automatic image annotation, Expert Syst. Appl. 39 (12) (2012) 11011– 11021. [9] Y. Freund, R.E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [10] M. Galar, A. Fernndez, E. Barrenechea, H. Bustince, F. Herrera, An overview of ensemble methods for binary classiﬁers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes, Pattern Recognit. 44 (8) (2011) 1761–1776. [11] M. Galar, A. Fernndez, E. Barrenechea, H. Bustince, F. Herrera, Dynamic classiﬁer selection for one-vs-one strategy: avoiding non-competent classiﬁers, Pattern Recognit. 46 (12) (2013) 3412–3424. [12] R.B. Geng, I. Bose, X. Chen, Prediction of ﬁnancial distress: an empirical study of listed chinese companies using data mining, Eur. J. Oper. Res. 241 (1) (2014) 236–247. [13] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (7/8) (2003) 1157–1182. [14] I. Guyon, S. Gunn, M. Nikravesh, L.A. Zadeh, Feature Extraction: Foundations and Applications, Springer, 2006. [15] J.H. Hong, J.K. Min, U.K. Cho, S.B. Cho, Fingerprint classiﬁcation using one-vs-all support vector machines dynamically ordered with naive Bayes classiﬁers, Pattern Recognit. 41 (2) (2008) 662–671. [16] C. Huang, C. Dai, M. Guo, A hybrid approach using two-level DEA for ﬁnancial failure prediction and integrated SE-DEA and GCA for indicators selection, Appl. Math. Comput. 251 (2015) 431–441. [17] T.M. Khoshgoftaar, K. Gao, H. Lin, Indirect classiﬁcation approaches: a comparative study in network intrusion detection, Int. J. Comput. Appl. Technol. 27 (4) (2006) 232–245. [18] S. Knerr, L. Personnaz, G. Dreyfus, Single-Layer Learning Revisited: A Stepwise Procedure for Building and Training a Neural Network, Springer, 1990. [19] S.J. Li, S. Wang, A ﬁnancial early warning logit model and its eﬃciency veriﬁcation approach, Knowl.-based Syst. 70 (2014) 78–87. [20] Z.Y. Li, J. Crook, G. Andreeva, Chinese companies distress prediction: an application of data envelopment analysis, J. Oper. Res. Soc. 65 (3) (2014) 466–479. [21] D.A. Lind, W.G. Marchal, S.A. Wathen, Statistical Techniques in Business & Economics, Mcgraw-Hill, 2012. [22] K.P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012. [23] J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Adv. Large Margin Classif. 10 (3) (1999) 61–74. [24] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. [25] K.P. Ravi, V. Ravi, Bankruptcy prediction in banks and ﬁrms via statistical and intelligent techniques – a review, Eur. J. Oper. Res. 180 (1) (2007) 1–28. [26] R. Rifkin, A. Klautau, In defense of one-vs-all classiﬁcation, J. Mach. Learn. Res. 5 (2004) 101–141. [27] T. Shumway, Forecasting bankruptcy more accurately: a simple hazard model, J. Bus. 74 (1) (2001) 101–124. [28] J. Sun, Z.M. Shang, H. Li, Imbalance-oriented SVM methods for ﬁnancial distress prediction: a comparative study among the new SB-SVM-ensemble method and traditional methods, J. Oper. Res. Soc. 65 (12) (2014) 1905–1919. [29] J.A. Suykens, J. Vandewalle, Least squares support vector machine classiﬁers, Neural Process. Lett. 9 (3) (1999) 293–300. [30] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [31] Z. Xiao, X.L. Yang, Y. Pang, X. Dang, The prediction for listed companies’ ﬁnancial distress by using multiple prediction methods with rough set and dempstershafer evidence theory, Knowl.-Based Syst. 26 (2012) 196–206. [32] L. Zhang, E.I. Altman, J. Yen, Corporate ﬁnancial distress diagnosis model and application in credit rating for listing ﬁrms in China, Front. Comput. Sci. China 4 (2) (2010) 220–236. [33] L. Zhou, K.K. Lai, J. Yen, Empirical models based on features ranking techniques for corporate ﬁnancial distress prediction, Comput. Math. Appl. 64 (8) (2012) 2484–2496.

Predicting the listing status of Chinese listed companies with multi-class classification models

Predicting the listing status of Chinese listed companies with multi-class classification models

Recommend Documents