Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method

Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method

ARTICLE IN PRESS JID: KNOSYS [m5G;May 5, 2017;8:56] Knowledge-Based Systems 0 0 0 (2017) 1–9 Contents lists available at ScienceDirect Knowledge-...

1MB Sizes 0 Downloads 1 Views

ARTICLE IN PRESS

JID: KNOSYS

[m5G;May 5, 2017;8:56]

Knowledge-Based Systems 0 0 0 (2017) 1–9

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method Ligang Zhou a,∗, Yain-Whar Si b, Hamido Fujita c a b c

School of Business, Macau University of Science and Technology, Taipa, Macau Department of Computer and Information Science, University of Macau, Macau Faculty of Software and Information Science, Iwate Prefectural University, Iwate, Japan

a r t i c l e

i n f o

Article history: Received 23 December 2016 Revised 28 April 2017 Accepted 3 May 2017 Available online xxx Keywords: Multi-class classification Listing-status prediction Decision tree C4.5 Decision tree C5.0

a b s t r a c t Predicting the listing statuses of Chinese-listed companies (PLSCLC) is an important and complex problem for investors in China. There is a large quantity of information related to each company’s listing status. We propose an improved filter feature selection method to select effective features for predicting the listing statuses of Chinese-listed companies. Due to the practical concerns of analysts in finance about the performance and interpretability of the prediction models, models based on decision trees C4.5 and C5.0 are employed and are compared with several other widely used models. To evaluate the models’ robustness with time, the models are also tested under rolling time windows. The empirical results demonstrate the efficacy of the proposed feature selection method and decision tree C5.0 model.

1. Introduction To help make investors aware of the risks of the listed companies in Chinese stock markets, the China Security Regulatory Commission (CSRC) issued special listing rules for the listed companies in 1998. According to the rules, a listed company that falls into abnormal operational or financial conditions will be given a special risk warning label, known as special treatment (ST), to indicate its risk. According to the “Rules Governing the Listing of Stocks (RGLS)” on the Shanghai Stock Exchange (SHSE) [1] and Shenzhen Stock Exchange (SZSE) [2] released in July 2012, there are two types of risk warning: indication of the delisting risk and indication of other risks. The special treatment of a stock with a delisting risk warning or other risk warning is to add the prefix of “∗ ST” or “ST” to the stock symbol, respectively. In addition, the limit of the price increase/decrease within a trading day is 5% for a stock with a risk warning and 10% for a normal stock. In China’s stock market, a stock with a delisting or other risk warning is always called an ST stock and the corresponding company is called an ST company. The RGLS on SHSE and SZSE lists similar conditions for implementing the delisting risk warning on the listed companies. The



Corresponding author. E-mail addresses: [email protected] (L. Zhou), [email protected] (Y.-W. Si), [email protected] (H. Fujita).

© 2017 Elsevier B.V. All rights reserved.

conditions of abnormal financial conditions are specified as follows [1,2]: (1) The audited net profit of the company was negative in the last two consecutive fiscal years. (2) The audited net worth of the company was negative in the last fiscal year. (3) The audited operating income of the company was less than 10 million Yuan in the last fiscal year. (4) The financial statements for the last fiscal year were noted with an adverse opinion or disclaimer opinion from the auditing company. (5) The company has been commanded by the China Securities Regulatory Commission (CSRC) to correct serious errors and false records, but has failed to do so within the specified time window, and the company’s stocks have been suspended from trading for two months. (6) The company fails to disclose its annual report or semi-annual report within the statutory time window and the company’s stock has been suspended from trading for two months. If any of the following conditions occur, the listed company will be given another risk warning: (1) The operation of the company has been seriously affected and cannot be recovered in three months. (2) The main bank account of the company has been blocked.

http://dx.doi.org/10.1016/j.knosys.2017.05.003 0950-7051/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

JID: KNOSYS 2

ARTICLE IN PRESS

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

Fig. 1. The different formulations of the listing-status prediction problem.

(3) The board of the directors cannot hold meetings on a regular basis and cannot reach a resolution. (4) The company funds the controlling shareholders or their stakeholders, or the company violates the regulations on providing guarantees for external obligations. As indicated above, the main reason for a normal company to receive ST is its poor financial performance. Therefore, financial ratios are important factors in the listing status prediction. A normal company may be given special treatment; at the same time, an ST company has the opportunity to get the risk warning removed if its financial status has improved and it can meet certain requirements specified in RGLS. If a company with a risk warning fails to meet some requirements, it may be delisted. Since July 1999, the ST firms that exhibit no sign of financial improvement in the following year will be issued a particular transfer warning (hereinafter called PT firms) issued by CSRC [3]. However, on February 25, 2002, the PT warning policy was canceled by CSRC. Therefore, in this study, the PT warning is not considered and the listing statuses of the stocks will be categorized into four groups: (1) a normal company without any risk warning; (2) a company with delisting risk warning; (3) a company with another risk warning; and (4) a delisted company. These four listing-status groups are denoted by A, B, D and X, respectively, in this research. Due to the different risk levels implied by the listing statuses, it is crucial for investors to forecast the stocks’ listing statuses to manage the risk of their portfolios or make stock investment decisions in Chinese stock markets. Most preliminary research [3–5] forecasts listing status of companies in China by classifying a currently listed company as either a normal company or a financially distressed company, as shown in Fig. 1(a). Since a risk warning given to a company always indicates the company’s financial distress, the prediction of the listing status of a company in China is usually formulated as a financial distress prediction problem. In a prediction model constructed for the normal listed companies, a company can have one of two possible financial statuses in the considered forecasting period: normal or financially distressed. Therefore, the financial distress prediction problem is a typical binary classification problem. Altman et al. [3] developed a model called Zchina Score to predict financial distress of the listed companies in China. They defined the “ST”

and “PT” companies as financially distressed firms and the model was constructed based on a training sample consisting of 30 financially distressed and 30 healthy companies that were announced in 1998 or 1999. Ding, et al. [4] developed a prediction model based on support vector machines (SVM) to predict the financial distress of Chinese high-tech manufacturing companies. Li and Sun [5] proposed a hybrid Gaussian case-based reasoning system for predicting the business failures of Chinese-listed companies. This study predicts the listing statuses of the listed companies, based on four different listing statuses, as shown in Fig. 1(b). The study demonstrates that the listing status prediction problem is a multi-class classification (MC) problem in practice, which is simplified as a binary classification problem in most existing studies [3–5]. Since the listing status can indicates a company’s risk level and affect the liquidity of the company’s stock, correct prediction of a listed company’s listing status is very important for the investors and stake holders of the company. Moreover, the interpretability of the predictive models can help financial analysts judge the reliability of the models and thus increase the applicability of the models. Therefore, we select decision tree models C4.5 and C5.0 for predicting the listing statuses of Chinese-listed companies. C5.0 is an extension of decision tree C4.5 with improvements in computational efficiency and computational space. Some widely used approaches for multi-class classification are also used for benchmarking, such as linear discriminant analysis (LDA), Adaboost (AD), Neural networks (NN), random forest (RF), and Bayesian networks (BN). These multi-class classification models can be implemented easily and have been successfully applied in different applications [6–10]. The performance in predicting the listing statuses of Chineselisted companies can be affected not only by the classification approach but also by the selection of features. There are different categories of features that can be used in PLSCLC, such as macroeconomic factors, company characteristics, financial indicators and market information. Most previous studies have demonstrated that financial indicators and market information are most effective in financial distress prediction [3,11–14]. However, there are hundreds of different financial indicators and market information that can be

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

JID: KNOSYS

ARTICLE IN PRESS

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

used in PLSCLC. Therefore, it is also important in PLSCLC to select the appropriate subset of features for accurate forecasting. There are hundreds of features in PLSCLC and most of them are highly correlated because they are derived from highly related items in the financial statements of the listed company. To select an efficient feature subset with good discriminative capability but less multi-collinearity, this study aims to develop an extended feature selection method that combines a filter method and a variance inflation factor (VIF) to prevent high dependency among features in the selected feature subset for multi-class classification. Feature selection based on the combination of a filter method and a VIF was initially proposed for binary classification in [15], which explores the efficacy of feature selection guided by domain knowledge. In addition, this study explores an efficient multi-class classification model and a data mining feature selection method for the PLSCLS problem with a large number of features and highly imbalanced classification data. The rest of this paper is organized as follows. Section 2 introduces the proposed feature selection method and a genetic algorithm (GA) based wrapper method. Section 3 provides a brief introduction to decision tree models and other widely used methods for multi-class classification. The empirical study is reported in Section 4. Section 5 presents the study’s conclusions and a discussion. 2. Related research 2.1. Feature selection methods Feature selection techniques in data mining can be used to select an effective feature subset for PLSCLC. The potential benefits of feature selection includes: facilitating data understanding, reducing computational time, and avoiding the curse of dimensionality to improve prediction performance [16]. Feature selection methods can be generally categorized into three groups: (1) filter methods which select variables by ranking them according to information generated from data, such as relative entropy and absolute-value two-sample t-test with pooled variance estimate; (2) wrapper methods, which assess the feature subset according to the performance of a given model by searching the space of possible feature subsets and evaluating each subset in terms of the performance of the given model on the subset; and (3) embedded methods, which incorporate feature selection as part of the training process of the model, such as the least absolute shrinkage and selection operator (Lasso) method in a linear model. Filter methods can easily scale to high-dimensional data sets. These methods are also computationally simple and fast. However, these methods ignore the interaction with the classifier and feature dependencies. Wrapper methods include the interaction between feature subset search and the classifier. These methods can take into account the feature dependencies and usually provide the best-performing feature set for the particular classifier. However, wrapper methods have high computational costs [17]. Embedded methods integrate the feature-subset selection into the classifier construction and have far lower computational costs than wrapper methods. However, not all classifiers can be easily adjusted to embed the search for the optimal feature subset. The main idea of the wrapper method for feature selection is to search for the optimal feature subset with minimum error on the training or validation set from the classification model. The simple searching approach is to use sequential forward selection (SFS) [18], which adds a new feature from a candidate subset if the introduction of the new feature to the model can reduce the error of the model. One disadvantage of SFS is that once a feature has been selected, even if it becomes obsolete after the addition of other features, it cannot be removed. Genetic algorithm can over-

3

come this shortcoming of SFS because it iteratively optimizes the feature subset. In the wrapper method based on genetic algorithm for multiclass classification (WMBGAMC), a feasible feature subset can be simply represented as a vector of binary numbers. For example, if there are a total of four features, i.e., m = 4, a feature subset instance with X2 and X4 selected can be coded as (0, 1, 0, 1). This vector can be naturally taken as a genome in the genetic algorithm. Each individual in a population is denoted by a genome, which is a vector of such m genes like (g1 , g2 , ..., , gm ), gi ∈ {0, 1}. The final optimal genome from repeated processes of selection, crossover and mutation in the genetic algorithm can easily be transformed into an optimal feature subset. The fitness function is defined by the performance on the training or validation data set. More details about the wrapper method for feature selection can be found in [15,19]. 2.2. Decision trees and other benchmarking models 2.2.1. Decision trees Decision trees C4.5 [20] and C5.0 are the two most widely used decision tree models for classification. C4.5 chooses the attribute of the data by selecting the gain ratio as the criterion for splitting samples into subsets at each node. The splitting ceases when the number of instances to be split is below a specified threshold. A pruning process is performed after the creation of the tree to remove branches that do not help reduce error by replacing them with leaf nodes. C5.0 is an extension of C4.5. C5.0 is significantly faster and more memory efficient than C4.5 and can obtain a model that is similar to that obtained by C4.5 but with considerably smaller decision trees [21]. The details of the extensions are largely undocumented, but the implementation of C5.0 is available in R package “C50” [22]. 2.2.2. Other benchmarking models (1) Linear discriminant analysis is a classical classification method. It classifies an observation into a class by minimizing the expected classification cost. The classification function is shown as follows:

yˆ = arg min y=1,2,...,K

K 

Pˆ(y = i|x )C (i|k ),

(1)

i=1

where yˆ is the predicted class for observation x; K is the number of classes; Pˆ(y = i|x ) is the estimated posterior probability of class i for observation x; and C(y|i) is the cost of classifying an observation as y when its true class is i. C(y|i) is defined as follows:



C ( y|i ) =

1 i f y = i 0 if y = i

More details about linear discriminant analysis can be found in [23]. (2) Neural networks are composed of neurons which are inspired by biological nervous systems. The processing ability of the neural network lies in the inter-unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training samples [24]. Pattern recognition networks (PRN) are feed-forward neural networks which are widely used in classification problems. These networks are typically composed of an input layer, one or more hidden layers, and an output layer. The structure of the pattern-recognition neural networks used in this study is shown in Fig. 2.

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

ARTICLE IN PRESS

JID: KNOSYS 4

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

where Nk and Nl denote the numbers of observations with Y = k and Y = l, respectively. Z kl denotes the absolute-value two-sample j t-test with pooled variance estimate, which indicates the discriminative capability of feature j on observations between classes k and l. In a multi-class classification problem with K classes, there are CK2 = K (K − 1 )/2 different pairs of classes and the feature measure based on t-test criterion defined in Eq. (2) for feature j can be defined as

Zj =

K−1 

K 

k=1 l=k+1

nk + nl kl z , ( K − 1 )N j

j = 1, 2, . . . , M.

(3)

Each feature can be ranked in terms of the Z value defined in Eq. (3) in descending order. The top-ranked features have good discriminative capability on the K classes, but they may exhibit high multi-collinearity. Therefore, VIF analysis is employed to prevent high collinearity among features. The VIF of feature j is calculated as follows:

Vj = Fig. 2. The structure of a typical pattern-recognition neural network.

1 , 1 − R2j

(4)

where R2j is the R-squared value of regression equation X j = β0 + (3) The Adaboost model is a typical meta-learning algorithm. The main idea is to call a weak or base learning algorithm in a series of rounds. In each round, the weights of incorrectly classified instances are increased so that the weak learner is forced to focus on the misclassified observations in the training set. The base learning algorithm is usually a simple decision tree. Adaboost.M2 proposed by Freund and Schapire [25] is employed. (4) Random forests are a combination of tree predictors such that each tree depends on the values of a random vector that is sampled independently from the same distribution for all trees in the forest [26]. After the trees are generated, they vote for the most popular class. (5) A Bayesian network always has two components: (1) a probability distribution and (2) a directed acyclic graph consisting of nodes representing stochastic variables and arcs representing directed dependencies among variables, which are measured by the probability distribution. It can be used to capture complex interaction mechanisms and to perform prediction and classification [27,28]. 3. Filter method combined with variance inflation factor (VIF) for multi-class classification For a multi-class classification problem, suppose the features of an observation are denoted by X = (X1 , X2 , . . . , XM ), where M is the number of features, and the class label of the observation is denoted by Y, Y ∈ {1, 2, . . . , K }. There are a total of N observations in the training set. Let vkj = var (X j |Y = k ), k = 1, 2, . . . , K, where var(·) is the variance of the specified group of numerical values. Let mkj = mean(X j |Y = k ), k = 1, 2, . . . , K, where mean(·) is the mean of the specified group of numerical values. vkj and mkj denote the variance and mean of feature j from observations of class k in the training set, respectively. The t-test criterion for feature ranking is initially defined for binary classification, and the feature measure based on the t-test criterion between classes k and l (k = l) on the training set for feature j is defined as follows [16]:

zklj

=



  mk − ml  j

v

k /N k j

j

+ vlj /N l

,

j = 1, 2, . . . , M,

(2)

βX  , in which X contains all features except Xj . Algorithm 1: ttFVIFMC.

1 2

3 4

5 6 7 8 9

Input : The training sample set Sr with M features X1 , X1 , …, XM and class Y ; the number of features to be selected M∗ . Output: feature subset F ∗ . F ∗ = ∅, m = 1 ; Rank the M features in terms of Z defined in Eq. (3) in descending order, the ranked feature indices are denoted by a list Fo. Add the first feature in Fo to F ∗ , m = m + 1 ; while m < M∗ do Calculate the VIFm of the mth feature in Fo according to the features in F ∗ ; if VIFm < 10 then the mth feature in Fo will be added to F ∗ ; end m = m + 1; end

Algorithm 1 describes the feature-subset selection for multiclass classification based on the filter method with the t-test criterion and VIF. The VIF threshold of 10 in ttFVIFMC is an empirical value suggested by [29]. A larger VIF threshold will permit the selection of features that are highly correlated with existing features and much smaller VIF threshold will reduce the number of feature candidates and may result in no additional features being selected. 4. Experimental study 4.1. Data set The data set in this study is collected from the China Stock Market and Accounting Research Database (CSMARD) provided by GTA Information Technology Co. Ltd (http://www.csmar.gtadata.com/). There are a total of 23,497 company-year observations dating from 1999 to 2013 in the data set after preprocessing. The number of observations in each class, by year, is shown in Fig. 3. Each observation contains a total of 171 features and one class label, which indicates the company’s predicted listing status for the

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

JID: KNOSYS

ARTICLE IN PRESS

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

5

Fig. 3. The number of observations of different listing statuses in each observed years.

next year. The features include (1) 167 different financial ratios, which measure various aspects of a company’s financial status, such as short-term solvency, long-term solvency, asset management or turnover, profitability, capital structure, stock-holder earning profitability, cash management and development capability; (2) three market variables that were introduced by Shumway [12] for corporate bankruptcy prediction, including the excess return of the company’s stock, relative market capitalization, and the standard deviation of the company’s stock return; (3) the stock-market type (Shanghai or Shenzhen) and the current listing status of the company. As shown in Fig. 3, a large proportion of the observations are in class A, small proportions are in classes B and D, and a very small proportion are in class X. This indicates that the prediction of listing status is a highly imbalanced classification problem. 4.2. Experimental framework The framework of training and testing the multi-class classification models is shown in Fig. 4. We test all multi-class classification models in two different scenarios. One is without rolling windows and the other is with rolling windows. In the no-rolling-windows scenario, the models are trained and tested on samples from a fixed period. The models are statical because they do not change over time. In the rolling windows scenario, the models are reconstructed by introducing new observations into the training set, i.e., the training samples used for training models change over time. Therefore, the rolling-window tests are used to assess the stability of the models over time. 4.3. Experimental settings In the no-rolling-windows scenario, the training set consists of a total of 12,960 observations with t ≤ 2008, while the testing set contains 10,537 observations with t > 2008. In the rolling-windows scenario, we predict the observations in each observed year from 2009 to 2013, but we construct the predictive models with the training sets whose samples are from the last five years relative to the observed year. For example, when we predict companies’ listing statuses for year t, the observations from years t − 5 to t − 1 are used to construct the models. Although there are more than one hundred variables in the data set, most existing research [3–5, 12] concerning financial distress prediction has demonstrated that the number of features that are statistically significant for discriminating a normal company from a financially distressed company is not more than 10. Therefore, the number of features to be selected M∗ is set to 10. Although only 10 features are selected, all features are checked by the feature selection methods and only the most efficient features are finally used

Fig. 4. The framework of training and testing the multi-class classification models. Table 1 A typical resulting confusion matrix. Predicted class

Actual class C1

C1 C2  CK Total

11

n n12  n1K n1

C2



21

n n22  n2K n2

CK K1

 ...

n nK2  nKK nK

Total nˆ 1 nˆ 2  nˆ K n

in the classification models. There are two other reasons for the selection of a small number of features: (1) to make the knowledge gained from the model for listing status prediction interpretable and simple, and (2) to reduce the computational time for modeling. C4.5 is implemented by weka.classifiers.tree.J48 in Weka [30], while C5.0 is implemented by the C50 package [22] in R. Other multi-classification models are implemented in Matlab. The hidden-layer size in PRN is 20. The default settings are used for other parameters. 4.4. Performance measures The Micro-average (MiA) and Macro-averaged F1 (MF1) measures are used in evaluating performance on multi-class problems. When the data set is highly imbalanced, MF1 can provide a more significant measure of performance. Table 1 gives a typical resulting confusion matrix for a problem with K classes, where nij denotes the number of observations of a class with actual list status i that are classified as having listing status j (i = 1, . . . , K, j = 1, . . . , K). The Micro-average, which is the number of correctly classified observations relative to the total number of testing observations, is

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

ARTICLE IN PRESS

JID: KNOSYS 6

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9 Table 2 The performances of MC models with different feature-selection methods in tests without rolling windows. Model

MiA (%)

LDA NN C4.5 C5.0 AD RF BN

MF1(%)

AF

WMBGA

ttFVIFMC

AF

WMBGA

ttFVIFMC

75.69 94.36 91.51 94.14 94.76 94.69 89.43

75.43 94.38 91.21 93.45 94.55 94.29 89.78

93.67 93.94 91.42 94.57 94.69 94.18 90.40

40.26 44.58 40.98 47.31 39.32 47.45 42.01

38.24 45.01 42.30 44.34 38.06 44.95 42.71

39.07 39.42 42.98 51.24 38.15 49.26 44.42

MiA =

Statistic (R+ )

p-Value

Hypothesis tests conclusion

LDA NN C4.5 C5.0 AD RF BN

22 54 13 0 31 0 4

0.3125 0.9990 0.0801 0.0010 0.6523 0.0010 0.0068

Do not reject H0 Do not reject H0 Reject H0 at α = 0.1 Reject H0 at α = 0.01 Do not reject H0 Reject H0 at α = 0.01 Reject H0 at α = 0.01

Model

MiA (%) AF

WMBGAMC

ttF

AF

WMBGAMC

ttF

LDA NN C4.5 C5.0 AD RF BN

93.45 93.95 92.95 94.65 94.94 94.82 89.64

93.73 94.56 93.48 93.66 94.89 94.85 89.87

93.93 94.93 93.21 95.00 94.74 94.91 93.08

42.08 41.91 45.62 48.72 41.22 49.35 43.78

42.39 47.25 46.51 45.28 42.33 48.50 44.16

43.32 49.91 46.68 51.49 38.15 51.31 48.25

nii

i=1

n

.

(5)

The Macro-averaged F1 scores for each of the K classes are defined as Eq. (6), and the MF1 for K classes is defined in Eq. (7)[32].

Precisioni =

nii nˆ i

nii ni Precisioni · Recalli F1i = 2 · , Precisioni + Recalli Recalli =

MF1 =

Classifier

Table 4 The average performances of MC models with different feature-selection methods in tests with rolling windows.

defined as follows [31]: K 

Table 3 Wilcoxon signed-rank tests for comparison of F1-Score performance between the WMBGA and ttFVIFMC feature-selection methods.

i = 1, . . . , K.

K 1 F1i K

(6)

(7)

i=1

Since there is a random process in the WMBGAMC method, the average performance over 10 iterations of tests is used to evaluate the MiA and MF1 of all multi-classification models under the WMBGAMC feature-selection method. 4.5. The results of tests without rolling windows Table 2 shows the MiA and MF1 performances in the test without rolling windows of six multi-class classification models with three different feature-selection methods: all features (AF), WMBGAMC, and ttFVIFMC. The training sample set consists of observations from years 1999 to 2008, and the testing samples are from years 2009 to 2013. The best performances among the different MC models and feature selection methods are marked in bold. The results show that the Adaboost model with all features achieves the largest MiA, while its MF1 is almost the smallest. Since the data set is highly imbalanced, Adaboost with all features classifies most observations into the major class; thus, it achieves the highest MiA. C5.0 combined with ttFVIFMC (C5.0-ttFVIFMC) achieves almost the same MiA as Adaboost but higher MF1 performance; this shows that C5.0-ttFVIFMC achieves better classification accuracy on observations of minor classes. To obtain well-founded conclusions on the MF1 performance comparison between WMBGAMC and ttFVIFMC, Wilcoxon signedrank tests [33] are employed and the results of hypothesis tests are given in Table 3. The null hypothesis and alternative hypothesis are as follows: H0 : MW ≤ Mtt H1 : MW > Mtt R+ is the sum of the ranks of positive differences between the MF1 from WMBGAMC and ttFVIFMC. The statistical comparison in Table 3 indicates that ttFVIFMC outperformed WMBGAMC for all the used multi-class classification models except LDA, NN, and AD

MF1 (%)

at significance level α = 0.10. For C5.0 and RF, the statistic R+ = 0 indicates that WMBGAMC feature-selection method did not outperform ttFVIFMC in terms of MF1 performance in any of the ten testing rounds. Since PLSCLC is a highly imbalanced problem, it is important to investigate the models’ classification accuracies on observations from different classes, especially from the minor classes. Fig. 5 shows the classification accuracies of three selected models with the top MF1 performances, as listed in Table 2. The proportion of each class is denoted by the angle of a sector of a circle. The classification accuracy of each class is denoted by the radius of the filled sector. Because there are only 15 observations of class X among the total 10,537 observations in the test, and none of the three classification models can correctly predict even one X observation, the sector representing class X is too small to be labeled. As shown in Fig. 5, the three different models show similar performance patterns on the test samples. All of the models have high classification accuracies on the major class A and low classification accuracy on the minor classes B, D and X. C5.0-ttFVIFMC achieves the highest classification accuracies among all the models of 67.7% and 52.27% on B and D, respectively.

4.6. Results of tests with rolling windows Table 4 shows the MiA and MF1 performances of six multi-class classification models with three different feature-selection methods in the test with rolling windows. For each test year, the models are constructed based on the training set consisting of observations from the last five years. For example, to predict the listing statuses of the listed companies in 2013, the training sample set consists of observations from observed years 2008 to 2012. Table 4 gives the overall performances on the tests from years 2009 to 2013. Both Tables 2 and 4 report the performance on all testing observations from years 2009 and 2013. However, Table 2 gives the results of static models, which are constructed with fixed training sets, while Table 4 presents the results of dynamic models whose training sets are changed with rolling time windows. Wilcoxon signedrank tests are introduced for statistical comparison of performance between static models and dynamic models. The Wilcoxon signedrank test indicates that dynamic models outperform static models

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

JID: KNOSYS

ARTICLE IN PRESS

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

7

Fig. 5. Classification accuracy on each class from three selected models.

Fig. 6. The classification accuracies on each class of the top three methods.

in terms of MiA and MF1 performances at a significance level of 0.01. Fig. 6 shows the classification accuracies on each class of the three selected models whose MF1 values are among the top three in Table 4. All three models achieve stable and very high accuracies on class A. The classification accuracies on classes B and D fluctuate along the five test years. Like the static models, none of the three selected dynamic models can correctly predict even one observation of class X. C5.0-ttFVIFMC achieves average classification accuracies of 55.81% and 53.67% on classes B and D for the five years’ tests, respectively. RF-ttFVIFMC achieves a more stable accuracy on D than that on B. It can correctly predict all observations in B in 2013, but can predict only approximately 20% of observations in B in 2012.

It seems to be challenging for the models to achieve stable and high classification accuracies on the minor classes B, D and X. One possible reason is that the data set is highly imbalanced and characteristic differences of observations between classes B and D are subtle. We can merge classes B and D into one financially distressed class as most previous studies [3,5,15] do; the classification accuracies on this merged class BD of the three models are shown in Fig. 6(d). NN, C5.0 and RF combined with ttFVIFMC can achieve average classification accuracies on class BD of 76.66%, 73.49% and 75.92%, respectively. As the best MF1 from the models shown in Table 4 is only 51.49%, how can the stock investors use the models to help make stock investment decisions? The investors need to know the reliability of the predictive results from the models. Fig. 7(a)–(c) shows

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

JID: KNOSYS 8

ARTICLE IN PRESS

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

Fig. 7(b) and (c) shows the real class distributions of the predicted B and D classes, respectively. It can be observed that the model obtains high error rates on the predicted class B, especially in observed year 2012, for which 72 companies were predicted to be in class B, among which only 13 are actually in class B and 54 are in class A. Similar results were obtained for the prediction of class D in 2012. 72 companies were predicted to be class D, but only 25 are actually in class D. We can observe that the error rate in 2012 is much higher than those in other years. The possible reason maybe that both SHSE and SZSE released the revised RGLS, which changed some rules regarding listing status. The models rely on historical data, which only cover the obsolete rules regarding listing status. It is interesting to observe that the error rate for the prediction of class D in 2013 is almost zero, and only two observations from class A are misclassified as belonging to class D. Due to unavailability of the data after 2013, we cannot check if this is a coincidence or the model has learned the new rules from the data. Although the prediction accuracies of classes B and D, as shown in Fig. 7, are relatively low to that of class A, the cost of misclassifying a class-A observation to class B or D is low, because the investors only lose the investment opportunities for the class A companies that have been predicted as B or D. 4.7. Knowledge from the predictive models

Fig. 7. The observed class distributions of different classes predicted by the dynamic C5.0-ttFVIFMC model.

the real class distributions for the three classes A, B, and D predicted by the dynamic C5.0-ttFVIFMC model. Fig. 7(a) shows that among the five tested years, a very large proportion of the predicted class-A observations actually belong to class A. For example, in observed year 2013, the model classified a total of 2467 companies into class A, among which 2449 companies actually belong to class A. Only 18 companies from other classes are wrongly predicted to be in class A. In the tests for the five years, the maximum error rate of prediction for class A is 3.26% from 2009. If the investors make a long-term investment in the stock of a company that has been misclassified into class A by the model, the investors may suffer great losses when the company turns out to belong to class B, D or X. However, because the error rate of prediction for class A is very low, the expected loss on investments in the predicted class-A companies may be still low. If the B, D or X companies are correctly predicted in advance, it will help the investors avoid such stocks and therefore help to reduce risks. If a company with B status transfers to A status in the next year, the correct prediction represents a great opportunity for the investor; and if the company transfers to D status in the next year, the correct prediction can help the investors close out their positions in advance to reduce risk.

It is important for the financial analyst to understand what factors are used in the prediction models and how they are used for making the prediction. Table 5 shows the brief descriptions of the ten features selected by the ttFVIFMC method based on the training set with observations from years 1999 to 2008 for the static model. These features are widely used and important financial ratios that reflect such factors as a company’s profitability, debtpaying ability,and cash management. Although the decision trees are interpretable, they can sometimes be quite difficult to understand. An important feature of C5.0 is its ability to generate classifiers called rule sets, which consist of unordered collections of (relatively) simple if-then rules [34]. These rules may be more straightforward and easier to understand than the structures of decision trees. There are a total of 34 rules generated by the C5.0-ttFIVMC static model, which is trained by 12,960 cases and the ten features shown in Table 6. Due to the limited space, we only select one representative rules that covers the most cases in the training set for each class for demonstration. In Table 6, #ob.covered denotes the number of training cases covered by the rule, and #ob.ncovered denotes the number of training cases that do not belong to the class predicted by the rule. Different classes have different rule sets of different sizes. Some rule sets cover large numbers of observations, while other rule sets cover small numbers of observations. The size of the rule set mainly depends on the structure of the training set, i.e., the distribution of observations from different classes. Although the rule sets generated by C5.0 can not cover all observations correctly, they provide the financial analysts with an easy way to make improvements by incorporating their professional knowledge into the decision rule sets. 5. Conclusion This paper introduced a method that combines decision trees with an improved filter feature-selection method to predict the listing statuses of Chinese-listed companies. The improved filter feature-selection method ttFVIFMC combines the t-test filter method to select the features with high discriminative capabilities on the multi-classes with the FVIF to reduce the multicollinearity. The empirical results show that the proposed feature-selection

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003

ARTICLE IN PRESS

JID: KNOSYS

[m5G;May 5, 2017;8:56]

L. Zhou et al. / Knowledge-Based Systems 000 (2017) 1–9

9

Table 5 Description of the ten selected features by ttFVIFMC. ID

Name

Memo

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

Current listing status Net margin of current asset Retained earnings per share Operating revenue per share Cash flow per share Earned surplus per share Ratio of EBIT to total debt Equity turnover Operating net cash flow per share Return on assets (ROA)

The listing status at the time of prediction Net income / average current asset Retained earning / shares outstanding (SO) Revenue / SO Increment of cash or cash equivalent / SO Earned surplus / SO EBIT / total debt Operating revenue / shareholders equity Net cash flow of operating activities / SO Net income / total assets

Table 6 Selected representative rules for each class generated by C5.0-ttFVIFMC. Predicted class A B D X

Rules V1 V1 V2 V1

≤ > ≤ >

1 V 2 > −0.2752 1 V 2 > −0.2752 0.2752 V4 ≤ 0.1946 V8 ≤ 0.0736 V 9 ≤ −0.2412 2V 2 ≤ −0.2752V 3 ≤ −0.2129V 7 > −0.2768V 9 > −0.0326

method can improve the prediction performance in comparison with the genetic algorithm based wrapper method. All models are tested under two different scenarios: with rolling windows and without rolling windows. The experimental results shows that the performances of models with rolling windows are better than those of models without rolling windows, which implies that the prediction models are sensitive to time effects and the introduction of up-to-date observations into the training set can improve the models’ performances. The decision tree C5.0 combined with ttFVIFMC achieves the highest MF1 performance. In addition, the factors that are used and how they are used in this model are very clear to the financial analyst and the financial analyst can easily incorporate his professional knowledge to improve the decision rules. Although the prediction models exhibit different classification accuracies on different classes of listing status, our analysis of the annual predictive results from dynamic models demonstrates their reliability and efficacy. We will explore how to incorporate the knowledge obtained from decision tree models with the professional knowledge of financial experts to improve the rule set for predicting the listing statuses of Chinese-listed companies. Moreover, determining how to evaluate the predictive results in terms of decision cost instead of classification accuracy warrants further investigation, as well. References [1] Shanghai Stock Exchange, Rules Governing the Listing of Stocks on Shanghai Stock Exchange (Revision 2012), (http://www.sse.com.cn/lawandrules/sserules/ listing/stock/c/c_20150912_3985843.shtml), 2012. [2] Shenzhen Stock Exchange, Rules Governing the Listing of Stocks on Shenzhen Stock Exchange (Revision 2012) (https://www.szse.cn/main/files/2013/01/ 14/486117565020.pdf), 2012. [3] E.I. Altman, L. Zhang, J. Yen, Corporate financial distress diagnosis in China, Working paper, 2007. [4] Y. Ding, X. Song, Y. Zen, Forecasting financial condition of chinese listed companies based on support vector machine, Expert Systems with Applications 34 (4) (2008) 3081–3089. [5] H. Li, J. Sun, Gaussian case-based reasoning for business failure prediction with empirical data in China, Information Sciences 179 (1) (2009) 89–108. [6] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, B. Kégl, Aggregate features and adaboost for music classification, Machine Learning 65 (2-3) (2006) 473–484. [7] M. Collins, R.E. Schapire, Y. Singer, Logistic regression, adaboost and bregman distances, Machine Learning 48 (1-3) (2002) 253–285. [8] G. Eibl, K.-P. Pfeiffer, Multiclass boosting for weak classifiers, Journal of Machine Learning Research 6 (Feb) (2005) 189–210. [9] R. Anand, K. Mehrotra, C.K. Mohan, S. Ranka, Efficient classification for multiclass problems using modular neural networks, IEEE Transactions on Neural Networks 6 (1) (1995) 117–124.

#ob.covered

#ob.ncovered

11,037 445 227 4

150 173 92 1

[10] A. Prinzie, D. Van den Poel, Random forests for multiclass classification: Random multinomial logit, Expert systems with Applications 34 (3) (2008) 1721–1732. [11] E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, The journal of finance 23 (4) (1968) 589–609. [12] T. Shumway, Forecasting bankruptcy more accurately: A simple hazard model, The Journal of Business 74 (1) (2001) 101–124. [13] S. Aktan, Financial statement indicators of financial failure: An empirical study on turkish public companies during the november 20 0 0 and february 2001 crisis, Investment Management and Financial Innovations 6 (1) (2009) 163–173. [14] V. Agarwal, R. Taffler, Comparing the performance of market-based and accounting-based bankruptcy prediction models, Journal of Banking & Finance 32 (8) (2008) 1541–1551. [15] L. Zhou, D. Lu, H. Fujita, The performance of corporate financial distress prediction models with features selection guided by domain knowledge and data mining approaches, Knowledge-Based Systems 85 (2015) 52–61. [16] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. [17] Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (19) (2007) 2507–2517. [18] S. Fallahpour, E.N. Lakvan, M.H. Zadeh, Using an ensemble classifier based on sequential floating forward selection for financial distress prediction problem, Journal of Retailing and Consumer Services 34 (2017) 159–167. [19] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artificial intelligence 97 (1-2) (1997) 273–324. [20] J.R. Quinlan, C4. 5: Programming for machine learning, Morgan Kauffmann Publisher, 1993. [21] RULEQUEST RESEARCH, Is See5/C5.0 Better Than C4.5?, (http://www.rulequest. com/see5-comparison.html), 2012, [22] M. Kuhn, S. Weston, N. Coulter, M. Culp, R. Quinlan, C50: C5.0 decision trees and rule-based models, (http://www.CRAN.R-project.org/package=C50). R package version 0.1.0-242015. [23] K.P. Murphy, Machine learning: a probabilistic perspective, MIT press, 2012. [24] K. Gurney, An introduction to neural networks, CRC press, 1997. [25] Y. Freund, R.E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, in: European conference on computational learning theory, Springer, 1995, pp. 23–37. [26] L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32. [27] N. Friedman, D. Geiger, M. Goldszmidt, Bayesian network classifiers, Machine learning 29 (2-3) (1997) 131–163. [28] L. Jiang, H. Zhang, Z. Cai, A novel bayes model: Hidden naive bayes, IEEE Transactions on knowledge and data engineering 21 (10) (2009) 1361–1371. [29] A. Donglas, G. Willian, D. Robert, Statistical Techniques in Business & Economics, McGraw-Hill, 2012. [30] I.H. Witten, E. Frank, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, 2005. [31] V. Van Asch, Macro- and micro-averaged evaluation measures, (http://www. cnts.ua.ac.be/∼vincent/pdf/microaverage.pdf), 2013. [32] G. Forman, An extensive empirical study of feature selection metrics for text classification, Journal of machine learning research 3 (2003) 1289–1305. [33] M. Hollander, D.A. Wolfe, E. Chicken, Nonparametric statistical methods, John Wiley & Sons, 2013. [34] Information on See5/C5.0 - RuleQuest Research Data Mining Tools, (http:// www.rulequest.com/see5-info.html), 2011.

Please cite this article as: L. Zhou et al., Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.05.003