Financial distress prediction: Regularized sparse-based Random Subspace with ER aggregation rule incorporating textual disclosures

Financial distress prediction: Regularized sparse-based Random Subspace with ER aggregation rule incorporating textual disclosures

Applied Soft Computing Journal 90 (2020) 106152 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsev...

3MB Sizes 0 Downloads 25 Views

Applied Soft Computing Journal 90 (2020) 106152

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Financial distress prediction: Regularized sparse-based Random Subspace with ER aggregation rule incorporating textual disclosures ∗

Gang Wang a,b , , Jingling Ma a , Gang Chen c , Ying Yang a,b a

School of Management, Hefei University of Technology, Hefei, Anhui, PR China Key Laboratory of Process Optimization and Intelligent Decision-making (Hefei University of Technology), Ministry of Education, Hefei, Anhui, PR China c School of Management, Fudan University, Shanghai, PR China b

article

info

Article history: Received 11 April 2019 Received in revised form 25 November 2019 Accepted 31 January 2020 Available online 3 February 2020 Keywords: Financial distress prediction Random subspace Textual disclosures Grouping features Sparse group lasso Evidence reasoning rule

a b s t r a c t For the sake of risks management, losses reduction, and costs saving, financial distress prediction (FDP) has attracted extensive attention from various communities including academic researchers, industrial practitioners, and government regulators. In addition to the conventional financial information, the textual disclosures regarding companies have received especial concern nowadays and are demonstrated to be effective for FDP. Ensemble methods have become a prevalent research line in the field of FDP incorporating financial and non-financial features. Feature quality is an important factor determining the accuracy in ensemble, however, traditional ensemble methods integrate these different types of features directly and ignore their grouping structures, hence weakening the feature quality and ultimately deteriorating the prediction accuracy. Moreover, although diversity can be obtained by virtue of the randomness of feature sampling in ensemble, the problem is that such randomness leads to the ambiguities among base classifiers, resulting in that the prediction accuracy of each classifier could not be ensured. Having noted these deficiencies, we propose a novel and robust meta FDP framework, which incorporates the feature regularizing module for identifying discriminatory predictive power of multiple features and the probabilistic fusion module for enhancing the aggregation over base classifiers. To validate our proposed regularized sparse-based Random Subspace with Evidential Reasoning rule (RS2 _ER), we conducted extensive experiments on the datasets collected from the China Security Market Accounting Research Database (CSMARD), and the experimental results indicate that the proposed RS2 _ER method enables the prediction effectiveness on FDP to be significantly facilitated by dealing with the features grouping property and the ambiguities among base classifiers. © 2020 Elsevier B.V. All rights reserved.

1. Introduction With the ever-changing commercial environment, increasing companies are faced with tremendous challenges to occupy their competitive advantage in a mature market [1]. Without a sound mechanism for favorable adaptability to such environment, companies are more likely to run into difficulties caused by insufficient liquidity, excessive liabilities, and so on, which ultimately impel them towards financial distress. As one of the most highlyconcerned problems in the area of commerce, financial distress imposes adverse impacts on the companies’ prolonged development and even threats their survivals [2]. Specifically, financial ∗ Correspondence to: School of Management, Technology, Hefei, Anhui 230009, PR China E-mail address: [email protected] (G. Wang). https://doi.org/10.1016/j.asoc.2020.106152 1568-4946/© 2020 Elsevier B.V. All rights reserved.

Hefei

University

of

distress inevitably damages the interests of company stakeholders’ such as investors, managers, employees, suppliers, and regulators [3]. Besides, when the number of financially distressed companies accumulates to a certain extent, it may burst into financial crisis hence restricting the stability of the whole capital market. Having considered the adverse effects of financial distress, it is not strange that academic researchers, industrial practitioners, and government regulators have endeavored to seek guidance from a credible financial distress prediction (FDP) model to provide early warning of distress. With the benefits of FDP: (1) managers can make more advisable managerial decisions with respect to the business strategies; (2) investors can be accessible to auxiliary investment decisions [4]; and (3) regulators enable to achieve greater regulatory effectiveness. To this end, FDP has become an indispensable task for academic researchers, industrial practitioners, and government regulators [5,6].

2

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

In the existing FDP research, acquiring effective features and constructing credible models have become two dominant directions to improve the prediction accuracy [3,7]. From the feature point of view, financial features, which are generally derived by transforming companies’ financial information into financial ratios, play a leading role in revealing the companies’ financial status and have been steadily used [8,9]. On the other hand, it is worth noting that in reality, non-financial information derived from textual disclosures can be also conductive to the FDP ability [10,11]. Distinct from financial features, non-financial features extracted from textual disclosures are capable of reflecting the external environment with regard to the status of company operations, such as cost control, business strategies, and company management. In the light of this, researchers have attached great importance to the textual disclosures nowadays and explored them as a novel efficient tool to achieve a more robust FDP model [12,13]. For example, to attain a better understand of FDP, the textual features’ predictive effectiveness has been examined recently, and the desirable results evidenced their supportive role for the prediction task [14,15]. In detail, textual disclosures associated with the economic, business environment, as well as the related policies of the countries ware certified to play an important auxiliary role to financial features in terms of FDP which enables more reliable prediction [16]. Moreover, recent efforts have shown that sentiment features which presents the tone of the financial documents has been also verified to make a contribution to the prediction ability [17,18]. All the existing studies mentioned above indicated that both financial and non-financial features are substantial when to predict financial distress. Especially, extracted financial ratios, in conjunction with sentiment and textual features have been integrated to develop a more efficient FDP model, and the multiple features are illustrated to be associated with a higher prediction result [19]. As for constructing credible FDP models, previous endeavors are mainly paid to two broad categories: statistical methods and machine learning methods [20,21]. Among the statistical methods, Discriminant Analysis (DA) [22–24], Logistic Regression (LR) [25], and Factor Analysis (FA) [26] [27] have been extensively accepted for FDP by virtue of their advantages of low complexity and quality of easy-to-use. Some researchers provided the comprehensive review about these statistical methods used in FDP and illustrated their prediction effects [20,28,29]. Nevertheless, the validity and applicability of the statistical methods are limited by some strict assumptions, such as multivariate normality, linearity and independence of the predictive variables [30]. More recently, researchers who aimed to enhance the prediction effectiveness of FDP have focused on machine learning methods [31–34]. Typical examples such as Decision Tree (DT) [35], Neural Networks (NN) [36,37], and Support Vector Machines (SVM) [6,38] have been employed diffusely. When applying DT and NN to FDP, the results suggested their prediction performance were comparable to that of statistical methods [37]. Moreover, SVM has been recognized as a representative baseline method for FDP owning to its outstanding generalization performance benefited from the structural risk minimization principle [22,39]. Note that the common single classifiers expose their instability in various situations, hence leading to the difficulties in finding an optimal predictive model performing well constantly. In this regard, ensemble methods of which the goal is to improve the generalization performance of individual classifiers have become a prevalent research line in the field of FDP [40,41]. Furthermore, with the principle of divide-and-conquer, the ensemble learning takes in the advantages from dividing the data space into smaller and easy-to-learn partitions and combining the different classifiers appropriately, endowing it with capabilities of

coping with the underlying complex problems and achieving more stable prediction results [42]. In terms of partition manner, ensemble methods can be divided into two types: instance partitioning methods and feature partitioning methods. To be specific, instance partitioning methods Bagging, Boosting and its variation AdaBoost have been extensively applied to FDP in recent studies and authentically obtained satisfactory prediction results in contrast to certain single classifiers [43,44]. On the other hand, Random Subspace (RS), an example of feature partitioning methods, has shown its great superiority for FDP. In RS, by feature sampling, the original data-space filled with redundant and irrelevant features could be meliorated to generate highquality classifiers [45,46]. By dint of this peculiarity, RS shows its potential in dealing with the high dimensional problems. Given the above, we adopt an ensemble framework for the FDP incorporating multiple features. As indicated by ensemble learning theory [47], the effectiveness of an ensemble can be strengthened via a proper trade-off between the accuracy and the diversity of its base classifiers. Intuitively, once a classification model was constructed, its ability to predict accurately solely depends on the quality of feature space. That is, the features will represent the real hypothesis space. In the meanwhile, diversity intrinsically requires the possessed knowledge of distinct base classifiers is non-coincident as possible. This is essentially related to the way by which feature subsets are generated. For the FDP task, non-financial features extracted from textual disclosures have been demonstrated as effectiveness, and they are integrated with financial features for [47] realizing an accurate FDP. However, the direct integration of these different types of features definitely weakens the feature quality and ultimately deteriorate the prediction accuracy. In detail, financial features acquired from financial statements are quantitative and they are directly connected with the company’s financial performance such as the credit standing, the possibility of repayment, and the operation state [48]. Contrarily, non-financial features extracted from textual disclosures are qualitative and able to reflect the external environment with regard to company’s operation status, such as cost control, business strategies, and company management [13]. Under these circumstances, it is worth noting that one who directly combines these features for use without distinguishing the intrinsically different feature properties will fail to make full use of available features definitely, which decreases the feature quality and ultimately weakens the model’s predictive ability. Meanwhile, although diversity can be obtained by virtue of the randomness of feature sampling in ensemble, the problem is that such randomness leads to the ambiguities among base classifiers, resulting in that the prediction accuracy of each classifier could not be ensured. In this vein, traditional base classifiers aggregation rules such as voting and averaging fail in dealing with the ambiguities among these individual classifiers, hence reducing the final prediction accuracy. These unresolved problems promote us to propose a novel and unified ensemble framework for FDP, in which not only multiple features are well integrated to exert their maximal values, but also distinct base classifiers are combined within a probabilistic reasoning process, to tackle their ambiguities in terms of prediction. Therefore, in this study, we propose a novel and robust meta FDP framework, which incorporates the feature regularizing module for identifying discriminatory predictive power of multiple features and the probabilistic fusion module for enhancing the aggregation over base classifiers. Specifically, the proposed method, regularized sparse-based Random Subspace method with ER aggregation rule (RS2 _ER), is well grounded on the ensemble learning theory [42] and is pertinently constructed for solving the core issue brought by FDP in the context that non-financial data has been increasingly explored but inappropriately delivered. In

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

our proposed RS2 _ER, RS is enhanced by a reasonable feature weighting mechanism as well as a classifier aggregation strategy, so as to achieve both the accuracy and diversity in ensemble. Concretely, the regularized sparse-based Random Subspace which takes the features’ grouping structure into account makes it possible to maximize the use of multiple features, hence improving the feature quality as well as ensuring the prediction accuracy. Moreover, RS with Evidence Reasoning rule (ER rule) endows different weights and reliabilities to base classifiers which are generated on random feature subsets, thus contributing to a reasonable and accurate classifier combination. To verify the proposed RS2 _ER, an empirical study using 1835 instances from the China Security Market Accounting Research Database (CSMARD) is conducted. The experimental results show that the proposed RS2 _ER which incorporates financial features, together with sentiment and textual features achieves the best prediction performance with the value of AUC reaching more than 96.09%. Besides, the results indicate the RS2 _ER outperforms RS with an AUC improvement nearly by 1.2% on the feature set integrating financial, sentiment and textual features. Main contributions in this research are: (1) A novel and robust framework considers both the prediction accuracy and the diversity in ensemble is proposed for the FDP incorporating multiple features. In the framework, both the feature regularizing module for identifying discriminatory predictive power of multiple features and the probabilistic fusion module for enhancing the aggregation over base classifiers are integrated for an accurate FDP. (2) An enhanced method RS2 _ER is proposed, in which a feature regularizing method and classifier combination method are unified to traditional Random Subspace for FDP task. Faced with existing unresolved issues which impose negative effects on the prediction accuracy and the diversity of the base classifiers in ensemble, the proposed RS2 _ER is a new exploration on the traditional ensemble method from the perspective of both feature integration and classifier combination. To the best of our knowledge, this is the first attempt to take the grouping structures of multiple features into account and simultaneously conduct effective aggregation on base classifiers for ensemble methods in FDP. (3) An empirical study is conducted on CSMAR datasets, and the experimental results verified the RS2 _ER’s superior prediction performance to that of the benchmark methods on those datasets, hence demonstrating its effectiveness for FDP. Concretely, the RS2 _ER with the grouping mechanism on the integrated features together with the information maximization in the process of base classifiers’ ensemble has been verified to outperform baseline methods. The remainder of this study is organized as follows. Section 2 reviews related work on FDP. Section 3 presents the proposed RS2 _ER method in detail. Section 4 describes experimental design part including dataset, evaluation metrics, and experimental procedures. Section 5 reports and discusses the experimental results in detail. Finally, Section 6 concludes this study and prospects for future work. 2. Related work As FDP is a challenging problem, it has stimulated numerous academics to research on it over the past decades. Throughout existing studies with respect to FDP, it is apparent that acquiring effective features and constructing credible models are two feasible and efficient ways for one to make improvements on FDP.

3

2.1. Features in FDP In the field of FDP, financial information has been turned out to be strongly associated with companies’ financial performance [49]. In essence, financial information is naturally capable of reflecting companies’ credit standing, repayment possibility, operation state, and so on, which serve as crucial predictive information to estimate one’s financial distress. Therefore, financial ratios which can be calculated by utilizing financial information from companies’ accounting statements have been successfully applied to FDP. For instance, Beaver [8] who firstly took use of financial ratios to predict companies’ financial distress showed the effectiveness of financial features. Altman [24] as well utilized a number of financial ratios of profitability, liquidity, and solvency to analyze the companies’ running state, verifying they can provide valuable indications for financial distress. Accordingly, increasing researchers have adopted financial features to FDP and corroborated their contribution to a precise prediction model [44,50]. Geng et al. [9] employed 31 financial ratios, such as net profit margin of total assets, return on total assets, and cash flow per share for FDP and demonstrated their efficiency. Huang et al. [51] selected some most important financial ratios from a set of 95 financial ratios to build the predictive models, and indicated that fewer but relatively more vital financial ratios will behave well. Moreover, to estimate the operation state of company more accurately, non-financial data, for example companies’ textual disclosures which correlate with companies’ operation status directly, such as cost control, business strategies development, and management situation elaboration as well provide a reference value for FDP. In this regard, non-financial features have stimulated numerous research into exploring their complementary effect on conventional financial features in terms of FDP, which were also corroborated to be important determinants of FDP [12, 13]. For example, Cecchini et al. [52] examined the role that textual features extracted from financial texts may play in FDP, and found that the method combining both textual and financial features performed best in contrast to benchmark methods (using only financial features or textual features). Doumpos et al. [16] combined traditional financial features with the qualitative nonfinancial features related to the macroeconomic and business environment to predict financial distress, achieving decent results. Hájek et al. [46] extracted sentiment features from annual reports as the auxiliary of financial features, and realized a more precise prediction with promising results. Jo et al. [53] aimed to improve the prediction accuracy of FDP by utilizing sentiment features extracted from financial news articles in conjunction with financial features, yielding the satisfactory results. Obviously, to improve the prediction accuracy for FDP, recent efforts have been already focused on both financial features and nonfinancial features. Especially, Wang et al. [19] firstly incorporated textual and sentiment features into financial features for FDP, indicating that the integrated features contributes to the predictive power considerably. However, since features extracted from various aspects such as financial statements, annual reports, and financial news articles help to measure the intrinsically different characteristics of financial distress, the direct combination of them without distinguishing their distinct properties will weaken the feature quality definitely, and ultimately, decrease the prediction accuracy of FDP. In detail, financial features are quantitative and homogeneous owing to that the financial ratios from financial statements are calculated merely using similar formulas, whereas the qualitative non-financial features are mainly textual disclosures regarding to the companies’ business environment. Consequently, in order to enhance the feature quality thus to improve the prediction accuracy, a better synergy mechanism needs to be developed to maximize the use of multiple features while tackling the irrelevant and redundant information.

4

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

2.2. Methods in FDP As for constructing credible FDP models, previous endeavors have been paid to two broad categories: statistical methods and machine learning methods. For statistical methods, typical methods such as Discriminant Analysis (DA), Logistic Regression (LR) and Factor Analysis (FA) have been demonstrated to be suitable for the application of FDP. They are commonly used owing to their advantages of low complexity and easy-to-use. For instance, Altman [24] applied a linear regression analysis model to FDP and verified the effectiveness of the model. Subsequently, the LR and FA were employed successfully in financial distress prediction [25,26]. However, the validity and applicability of statistical methods are inevitably limited on account of the restrictive assumptions in them [20,54]. Under these circumstances, machine learning methods with common examples of Decision Tree (DT), Neural Networks (NN), and Support Vector Machines (SVM) have been extensively adopted in FDP for their remarkable prediction performance [35,55,56]. For instance, Sun et al. [44] made a comparison of DT and MDA on the different real world datasets for FDP, and indicated that DT was the better-performed one. In the meanwhile, Neural Networks was also extensively employed for FDP due to that they are able to identify and represent non-linear relationships in the dataset. For example, Tam and Kiang [57] predicted financial distress by means of ANN model and compared it with LDA and LR, of which the results suggested that ANN made the most accurate prediction. Besides that, SVM has attracted academics’ interest when predicting financial distress owing to its outstanding generalization performance benefited from the structural risk minimization principle [58,59]. In detail, Shin et al. [39] found that in terms of FDP, SVM was supposed to be a kind of successful model as it showed excellent performance in balancing fitting ability, generalization ability and model stability. Ding et al. [60] conducted an empirical study on FDP by SVM and verified that SVM was a more robust and accurate model in contrast to MDA, LR and NN. Note that the common single classifiers expose their instability in various situations, leading to the difficulties in finding an optimal prediction model performing well constantly. In this regard, ensemble methods of which the goal is to improve the generalization performance of individual classifiers have become a prevalent research line in the field of FDP [40,41]. Ensemble methods are generally superior to single classifiers and extensively adopted to FDP due to several advantages [42]. First, by the ensemble, the errors generated from individual classifiers relying on different parts of the input space can be effectively compensated. From this point of view, ensemble classifiers are able to outperform the best-performed single classifier [61]. Second, the ensemble learning methods are benefited from the divideand-conquer principle which seeks to divide the data space into smaller and easy-to-learn partitions and combine the different classifiers appropriately, endowing it with capabilities of coping with the underlying complex problems and achieving more stable prediction results. In terms of the difference of partition manners, ensemble methods can be divided into two types: (1) instance-partitioning methods, example of which include Bagging, Boosting, and their many variations; and (2) feature-partitioning methods, the typical one of which is RS. The former ensemble methods have been extensively applied to FDP, in which individual classifiers are trained independently using various bootstrapped training instances instead of depending on the entire feature space [42], and then, a final classifier with superior performance can be obtained through a certain combination rule. For instance, West et al. [62] investigated both Bagging and Boosting based on the single classifier of Multilayer Perceptron Neural Networks (MPNN) for FDP

and their experiment on the real dataset showed that Bagging achieved a lower generalization error in contrast to the Boosting as well as other single classifiers. Besides, Zhang et al. [63] proposed a modified Bagging with decision trees as base classifiers and it achieved satisfactory results on two real-world datasets from the UCI Machine Learning Repository. Moreover, Tsaia and Hsu [40] performed a comparative study with respect to the applications of ensemble methods with SVM, MLP and NN as different base classifiers in FDP, and demonstrated that DT ensembles using Boosting performed best on the three different datasets. Furthermore, as an extension of the traditional Boosting, AdaBoost with an adaptive weight distribution of instances in each learning process has been increasingly adopted for FDP in recent studies [41]. For example, Alfaro et al. [43] compared the prediction performance of the single classifier of NN with its AdaBoost ensemble, and verified that AdaBoost improved the accuracy of prediction against NN. Sun et al. [64] examined the performance of the AdaBoost with different base classifiers together with the singer classifier DT and SVM for FDP, and by means of the experiment on 692 Chinese listed companies, they found that AdaBoost generally outperformed single classifiers. On the other hand, different from the former, featurepartitioning methods generate different classifiers using random feature sets that make classifiers more diverse, and select the subset of all features for each classifier according to the subspace rate, which have been regarded as an efficient tool for the FDP task as well [65]. Among them, RS shows its superiority in dealing with the high dimensional problem due to the fact that it randomly extracts different feature subsets from the original feature space to train the base classifiers [19,45]. For instance, Hájek et al. [46] successfully applied RS with base classifier of SVM to FDP and compared it with the RSMLP (RS with base learners of MLP), SVM, and MLP over the real-word datasets, the results of which indicated that the RS outperformed single classifiers constantly. Besides, Nanni and Petr [45] investigated the performance of RS in comparison with Bagging and other single classifiers, and they found that RS resulted in better results on three real world datasets. Recently, Wang et al. [19] proposed an improved RS for FDP, and by validating, they found that their proposed method showed a good performance on the high-dimensional datasets compared to the benchmark methods. It can be seen from numerous existing studies that the prediction power of ensemble methods is obviously superior to that of individual classifiers, and especially, RS shows its outstanding performance for the FDP task. Nevertheless, as the standard RS extracts all features directly and randomly, it neglects the grouping properties of multiple features, which in actual weakens the feature quality. Moreover, the ambiguities among different base classifiers deserve in-depth exploitation which is far more than simple voting strategies to ensure an optimal final prediction result for the ensemble. In summary, there are still some gaps posing FDP to an urgent issue remaining to be addressed. 3. Research design FDP has become remarkably important for academic researchers, industrial practitioners, and government regulators to achieve better risk management as it can provide helpful decision supports [3,6]. However, existing studies have failed to deal with challenges raising in current FDP field, such as non-financial features’ prediction effect, multiple features’ grouping structure, and members’ ambiguities of ensemble, which are also essential for enhancing the performance of FDP. Subsequently, we propose a novel and robust meta FDP framework that incorporates the feature regularizing module for identifying discriminatory predictive power of multiple features and the probabilistic fusion module for enhancing the aggregation over base classifiers.

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

The framework of the proposed RS2 _ER mainly contains three parts: (1) data acquisition, (2) feature extraction, and (3) model construction. First, in the process of data acquisition, both financial data and non-financial disclosures are acquired from the listed companies’ financial statements and financial reports on the website, and the latter needs to be pre-processed. Second, in terms of feature extraction, we investigate the intrinsic difference of the accounting-based quantitative features and the textual disclosure information based qualitative features by dividing the features with grouping structures into several groups. Third, in stage of model construction, we construct RS2 _ER which simultaneously takes into account the benefits of RS, the regularized sparse model to maximize the utilization of financial and nonfinancial information, as well as the ER aggregation rule to handle with the ambiguities among the base classifiers. Fig. 1 shows the research framework. For a clear presentation, some essential notations used in this section are defined in the following. Given a training set D = {(x1 , y1 ), . . . , (xi , yi ), . . . , (xn , yn )}, where xi = {x1,i , x2,i , . . . , xc ,i , . . . , xm,i } are predictor variables, xi ∈ Rm is the vector space pattern, and yi ∈ {−1, 1} is the binary-class label related to the instance xi , n is the number of { instances and m is }the feature size. Moreover, define F1 = f11 , f21 , . . . , fi1 , . . . , fn11 , F2 = } { g g } { 2 2 g g f1 , f2 , . . . , fi2 , . . . , fn22 , . . . , Fg = f1 , f2 , . . . , fi , . . . , fng , as features with different properties, where ni (i = 1, 2, . . . , g), represents the size of these features. 3.1. Data acquisition Data acquisition and data pre-processing are two main steps for data acquisition. As analyzed before, it is necessary to acquire two forms of data: quantitative financial data and qualitative non-financial data. First, financial data which are originated from company accounting statements can be collected from an open platform of Market Accounting Research Database owing to the fact that listed companies are required to disclose accounting statements on a regular basis. Second, non-financial data can be extracted from textual disclosures for example annual reports, which are one of the most important textual disclosures that reflect the external environment with regard to the status of company operations, such as cost control, business strategies, and company management [15,33]. Note that non-financial data are in form of qualitative and unstructured, data pre-processing is indispensable for excavating the valuable hidden information from unstructured texts. Detailed procedures such as filtering, cleaning, and term extraction are conducted to turn the textual information into a series of useful individual terms, which can make non-financial data become more high-quality. 3.2. Feature extraction When to predict financial distress, previous studies have examined both financial and non-financial features and demonstrated their desirable prediction effects for FDP [12,16]. Note that financial features derived from accounting reports and nonfinancial features extracted from textual disclosures presumably hold different predictive power, which can be attributed to their nature data properties. In this study, we refer to data property as its predictive ability, that is, data with different properties naturally helps to reveal different aspects of companies’ financial states. For example, financial features can reflect the companies’ financial condition directly, whereas textual features reveal the companies’ financial risk in indirect way by capturing environment factors regarding companies’ operation state. Consequently, by incorporating financial features together with non-financial

5

features, we aim to explore the features’ different intrinsic properties by dividing the features with grouping structures into several groups. Financial feature is regarded as a key component of FDP predictors that has been extensively adopted [8,51]. Financial ratios, as typical examples of financial features, can be acquired from accounting statements such as balance sheets, cash flow statements, and income statements. Specifically, there is a trend of employing ratios which are: (1) frequently used by previous studies; (2) accessible from available data resources; and (3) contributing domain-specific knowledge to meet the needs of preliminary research [43,66]. Such domain knowledge can provide a direct insight into estimating one’s financially distressed probability and in the meanwhile ensure an acceptable performance on FDP. In this sense, financial features, to begin with, are utilized as input variables of RS2 _ER, including financial ratios from aspects of solvency, profitability, operational capabilities, business development capacity, capital expansion capacity, as well as finance{structure. The extracted } features can be formulated into F1 = f11 , f21 , . . . , fi1 , . . . , fn11 where the ith financial variable fi1 belongs to one of the financial aspects. In addition to financial features, non-financial features extracted from companies’ annual reports which are evidenced to be correlated with companies’ operation status as well, stimulating increasing researchers to investigate on them in terms of their prediction effectiveness and their supplementary role to financial features [12,33]. In detail, with non-financial information, the impacts of companies’ textual disclosures sentiment on FDP have been extensively examined, and one of the most attractive findings is the reflection of positive or negative narrative on one’s financial risks. As another typical example of non-financial features, textual features which are extracted to represent financial texts and to capture the relevant environmental factors have shown their remarkable predictive ability to FDP [16,19]. Subsequently, in this study, we additionally extract sentiment and textual features to promote the prediction power. Concretely, in terms of sentiment analysis, we employ the common lexicon-based approach to extract sentiment features, which filters candidate words with groups of external different sentiment categories. Among sentiment polaries, positive and negative sentiment kinds have been most frequently adopted to measure annual reports’ narrative tone, of which the correlation with financial risk can be used to indicate one’s financial distress [67]. In this study, to make full use of emotional messages contained in the textual disclosures, we followed [68] to employ not only negative (e.g. loss, bankruptcy, problem, weak, etc.), positive (e.g. effective, gain, strong, succeed, etc.) sentiment categories, but also litigious (e.g. allege, amend, bail, contract, etc.), modal strong (e.g. always, definitely, strongly, undoubtedly, etc.), and modal weak (e.g. nearly, seldom, sometimes, suggest, etc.) polaries for the FDP task. The process of extracting sentiment features is as following: first, a certain number of annual reports are collected from the website on which listed companies disclose their business information periodically. Second, tokenization and denoising are conducted on the annual reports in order to acquire a set of candidate terms. Third, by matching the extracted candidate terms with lexicons and counting their corresponding occurrence frequency, above five {kinds of sentiment features can } be derived, representing as F2 = f12 , f22 , . . . , fi2 , . . . , fn22 . For textual features’ extraction, the step of text pre-processing is proceeded as it is a fundamental to converting textual information into a structured form, and next, we adopt a standard textual feature extraction method, Bag of Words (BOW) which represents all words appearing in annual reports as a document-term matrix by the n-gram model [12,69]. Especially, unigram, bigram and trigram features have been demonstrated to be three kinds of

6

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

Fig. 1. Research framework for financial distress prediction.

simple and effective features for text classification, promoting us to apply them. In addition, the efficient weighting method of term frequency-inverse document frequency (TF-IDF) is adopted to measure the relative importance of each stem [18,53]. As a result, a set of unigram, bigram and trigram features with corresponding TF-IDF weights are extracted from the annual reports. To eliminate the effects of some irrelevant factors such as noise and redundancy from the original extracted grams, the feature selection method Information Gain (IG) is adopted to recognize and retain the most informative features in terms of the prediction task [70–72]. In specific, all of the unigrams, bigrams and trigrams with an information gain larger than 0.0025 is remained and acted as textual inputs } { of the following model, forming the textual feature set F3 = f13 , f23 , . . . , fi3 , . . . , fn33 . Furthermore, to explore the respective effects of above multiple features, we deliver features with intrinsic distinctions into three groups, that is, (1) financial features, (2) sentiment features, and (3) textual features. Moreover, in order to verify the supplementary role of non-financial features to financial features, we combine F1 with F2 and F3 separately, forming the integrated feature sets of F1 +F2 and F1 +F3 . Similarly, F2 and F3 which are integrated to investigate the full effect of non-financial features forms F2 +F3 . Lastly, to make joint use on the financial and nonfinancial features and to improve the predictive accuracy of FDP as much as possible, all of F1 , F2 , and F3 are integrated into F1 + F2 +F3 . 3.3. Model construction FDP task is a typical class imbalanced problem since the instance distribution of two classes including ‘‘distressed company’’ and ‘‘healthy company’’ are skewed. Traditionally, in terms of the imbalanced characteristic of FDP, classifiers usually tend to treat the imbalanced instances as balanced ones which ignores the minority class instances in the classification process, hence resulting in inaccurate predictions [73]. Under these circumstances, extensive efforts have been made on the imbalanced problem so as to minimize its adverse effects on the classification accuracy. In general, the approaches for imbalanced problems can be sorted into two categories: (1) the data-level approaches, the main idea of which is to rebalance the class distribution by resampling the data space; and (2) the algorithm-level approaches, which modify existing classification algorithms to bias the learning toward the minority class or develop new algorithms to deal with the class imbalanced problem [19]. Examples of the former approaches include Over-sampling (OS), Under-sampling (US), and

Synthetic Minority Over-sampling Technique (SMOTE). Specifically, OS duplicates minority-class instances to balance the data distribution while US eliminates majority-class instances. SMOTE generates new instances between the minority-class instance and its neighbor in the same class instead of directly duplicating the minority-class instances. Moreover, the examples of the algorithm-level approaches include cost-sensitive methods and ensemble methods [74]. In ensemble methods, the errors generated from individual classifiers can be effectively compensated by the combination of different classifiers, hence reducing the general error of the classification. Faced with the high-dimension dataset in FDP, ensemble methods have become a prevalent research line in this field to handle with the class-imbalanced problem [75]. As for ensemble methods, from the training set partitioning view, they can be divided into instance-partitioning methods which generate their base classifiers by bootstrapping the training instances, and feature-partitioning methods which train base classifiers using random feature subsets from the original feature space [65]. Among these ensemble methods, featurepartitioning-based RS shows its superiority in dealing with the high-dimensional problems and providing each classifier with overlapping but less complex hypothesis space, endowing it with redundancy tolerance capability. It is a consensus that both the prediction accuracy of the base classifiers and the diversity among them relate to the final prediction performance of ensemble [3]. As for the accuracy, it is the basic requirement for the base classifiers since that the prediction effect of integrating base classifiers with poor performance may worse than that of an advanced classifier, which will lead ensemble to a controversial issue [45]. Intuitively, once a classification model was constructed, its ability to predict accurately solely depends on the quality of feature space. That is, the features will represent the real hypothesis space. In the meanwhile, diversity intrinsically requires the possessed knowledge of distinct base classifiers is non-coincident as possible. This is essentially related to the way by which feature subsets are generated. For the FDP task, non-financial features extracted from textual disclosures have been demonstrated as effectiveness, and they are integrated with financial features for realizing an accurate FDP. However, the direct integration of these different types of features definitely weakens the feature quality and ultimately deteriorate the prediction accuracy. Moreover, in terms of the diversity, it requires the decision boundaries of base classifiers vary from each other, so that the total error can be reduced through the reasonable combination of these classifiers [43]. Traditional RS achieves the diversity by virtue of the randomness of feature sampling, whereas the problem is that

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

such randomness leads to the ambiguities among base classifiers, resulting in that the prediction accuracy of each classifier could not be ensured. In this situation, traditional base classifiers aggregation rules such as voting and averaging fail in dealing with the ambiguities among these individual classifiers, hence weakening the final prediction accuracy. Moreover, considering the class-imbalanced problems in FDP, feature selection methods contribute to alleviate the negative impact of class-imbalance on classification. As we all know, in the class-imbalanced problems, the samples of minority class can easily be discarded as noise, causing the inaccurate classification on the minority class. Under this circumstance, the feature selection method which removes the irrelevant features from the original feature space is an efficiency way to reduce this risk [19]. Besides, the errors generated from individual classifiers can be effectively compensated by the combination of different classifiers, hence reducing the general error of the classification [47]. Likewise, a reasonable classifier combination strategy is needed to on the one hand implement more understandable aggregation on base learners with distinguishment, and on the other hand generate the final classifier with a decent result for the class-imbalanced problem in FDP. In a word, the insufficiency of traditional ensemble framework on incorporating multiple features, combining different classifiers, and utilizing the advantages of feature selection methods in class-imbalanced problems has provided us an insight for introducing an efficient feature selection method and combination rule to gain more reasonable results in FDP. Traditional feature selection methods tend to evaluate feature’s effectiveness separately by employing the binary weighting strategy (i.e., assigning each feature with a weight of 0 or 1) which makes features to be either retained or deleted [72]. In this study, it is worth noting that features extracted from financial statements and textual disclosures naturally correspond with grouping structures as they hold discriminatory predictive power to reveal financial distress, and directly using traditional feature selection methods will fail to make full use of them definitely. Recently, the regularized sparse methods have attracted academics’ attention for feature processing [76,77]. In general, (1) they are powerful and flexible as they assign continuous non-negative weights to features by means of regularized penalization; (2) they produce sparse solution for features without any alteration of original feature representation, promoting model to be more interpretable; (3) they are supported by the oretical and empirical properties; and (4) they are robust to data noise and show obvious superiority in reducing the training time and computing complexity, which makes model faster and more cost-effective [78]. As a one of the representatives of regularized sparse models, lasso (least absolute shrinkage and selection operator), which was proposed by Tibshirani [79] has been generally applied to pick out important features. It takes effects by imposing a L1-norm penalty on the least squares loss function which results in sparsity among individual features but neglects aforementioned features’ intrinsically different characteristics and in the meanwhile, given distinct features, it takes the same measurement to judge them, deteriorating the feature quality. Subsequent researchers have proposed some improved lasso-based models, for group lasso as an example, it is able to account for feature’s grouping structures and remedies the deficiency of lasso by selecting or abandoning homogeneous features simultaneously [80]. A further extension of the group lasso, sparse group lasso (SGL) combines both of lasso and group lasso for use and brings sparsity effects on both group and within group level [81]. In this study, considering the natural grouping features such as financial, textual and sentimental features in FDP, we adopt SGL toward an enhanced RS so as to facilitate the prediction performance. Concretely, in terms of the distinct groups of financial features (F1 ), sentiment features (F2 ) and textual features

7

(F3 ), SGL can identify features’ discriminatory prediction power among feature groups and within each feature group by endowing them with varied weight vectors. In detail, it estimates the informativeness of both the feature groups and individual features to FDP, setting the coefficients for them to discriminate their different prediction power. SGL employs both the groupwise sparsity and within-group sparsity to identify particularly ‘‘important’’ features for FDP. It should be noted that the groupwise sparsity means the number of feature groups with at least one nonzero coefficient, and within-group sparsity means the number of individual features with nonzero coefficients in each feature group [81]. In this way, both the relative importance of these feature groups and the informativeness of individual features can be obtained, so as to achieve the overall sparsity. Concretely, by this process, the different prediction power of feature groups such as F1 , F2 and F3 can be estimated. For example, whether non-financial features are supplementary to financial features can be discerned. This is because that if non-financial features are identified as insignificant features, their derived feature weights by SGL will be endowed with zero and will not contribute to the prediction accuracy. Otherwise, these non-financial features with corresponding nonzero coefficients will be in conjunction with financial features for the high-accurate FDP. Moreover, the distinct prediction power of individual features in each feature group can be well treated via the different coefficients in predicting. Under this circumstance, the more informative features included in each feature group can be identified and play a relatively important role for FDP. On the other hand, the conventional combination rules of RS including majority voting, average and so on fail to handle the ambiguities among different base classifiers well [65]. For example, majority voting assumes that each classifier plays the same role in classification and only keeps classifiers belonging to the unanimous group with majority members, which indeed, is far from the requirement of an efficient ensemble that calls for accurate as well as diverse learners. Comparatively speaking, the aggregation strategy of weighted average evaluates the relative importance of base classifiers from an ensemble perspective by assigning each of them with a weight, and combines the outputs of entire base classifiers in linear manner. In fact, the measurement on the ambiguities among individual classifiers not only depends on the members’ relative importance but also relates to the reliability of the member itself which indicates its ability to provide correct assessment or prediction. Recently, a reasonable fusion strategy, ER rule [82] was proposed by Yang et al. which is an extension of classical Dempster–Shafer (D–S) theory that introduces belief functions to quantify the evidence available from multi-source, and then combines evidence through Dempster’s rule of combination. The ER rule has been widely adopted in the field of decision support [83,84], complex system modeling [85], and classification [86] for its advantages that (1) it provides an ER framework for evidence combination and information fusion in an Artificial Intelligence (AI); (2) it handles the insufficiency of D-S theory in facing with complete conflict evidence and not fully reliable evidence by modifying the initial belief function; and (3) it assigns both weight and reliability to each piece of evidence to measure the different property of evidence. Hence in this study, ER rule is added into RS to realize a more efficient ensemble with a desired level of accuracy and diversity in premise of the information loss to be minimized. In such a setting, evidence was defined as probability distribution, and understood as the learning results (decision results) of a base classifier (decisionmaker) [50]. Moreover, the weight is introduced to measure the relative importance of the individual classifier compared to others, while the reliability is to reflect the inherent property (e.g., prediction performance) of each classifier.

8

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

Therefore, by virtue of SGL model and ER rule approach, we propose a novel and unified ensemble framework for FDP, in which not only multiple features are well integrated to exert their maximal values, but also distinct base classifiers are combined within a probabilistic reasoning process, to tackle their ambiguities in terms of prediction. Concretely, our proposed RS2 _ER first resolves the input features’ intrinsic structures and divides them into several groups. After this, the SGL model is added to RS2 _ER to cope with the multiple features’ intrinsically different properties. In the process of classifiers aggregation, with the help of ER rule, the RS2 _ER treats base classifiers’ dissimilarly by assigning them with weights to show their relative importance and reliabilities to represent their respect abilities in coping with the given prediction task. The proposed RS2 _ER is detailed next and its 2 steps are presented in Fig. 2. As shown in Fig. 2, in step 1 of RS2 _ER, the base classifiers are trained on a set of subsets selectively generated from the original dataset. To begin with, given the input features, SGL estimation is applied to derive a sparse weight vector from group and within-group levels, which is described as following:

 

l

2 

l

∑ ∑√   1   min β(k) Xk  + (1 − α )λ pk β(k) 2 + αλ ∥β∥1 y − β 2n   k=1

2

k=1

(1)

∑m √

  pk β(k) 2 is de-

where the first plenty term (1 − α )λ k=1 signed to produce sparse effects for feature groups and the second plenty term αλ ∥β∥1 places sparsity upon individual features, with ∥·∥q as the Lq norm. Besides, pk refers to the number of predictor variables in groups k, and l is the number of disjoint groups. Corresponding to the l × pk predictor matrix, a 1 × pk (k) (k) regression coefficient vector β(k) = (β1 , . . . , βpk )T is derived (k) as the kth group’s weight vector, where βpk is the individual weight vector. Moreover, the parameter α ∈ [0, 1] implements a convex combination of the aforementioned lasso and group-lasso. In specific, when α = 0, the formula degenerates to group-lasso, whereas α = 1 makes it degenerate to lasso. Additionally, the tuning parameter λ is applied to adjust the shrinkage degree among both features and groups. As setting λ to a relatively large value, the weights of features that with lower absolute coefficient values will be compressed to 0, so that they will be neglected as unnecessary or redundant features in the coming feature sampling process. Conversely, when λ tends to be relatively small, a large number of features will be remained due to its insignificant constraint effects. In summary, both of λ and α play important roles in achieving a high-quality feature selection [81]. Taking the derived weight vector β(k) as sampling probabilities of features, K feature subspaces are generated from the original dataset D, and this process is additionally guided by another parameter, the subspace rate R, which adjusts { 1 the sampling scale. Finally, represent the obtained subsets as Dsub , D2sub , . . . , } k K k Dsub , . . ). , Dsub ( and the ) kth (random )} feature subspace as Dsub = {( k k k k k k x1 , y1 , . . . , xi , yi , . . . , xn , yn with 1 ≤ k ≤ K . Taking the basic requirements of ensemble methods into account, we adopt the base classifier SVM in RS2 _ER. On the one hand, SVM is trained with random feature subsets from the original feature space so as to reduce the similarity between classifiers and improve the diversity among them. On the other hand, SVM can achieve satisfactory prediction accuracy owing to that (1) it provides an effective technique in handling the highdimension features encountered in the training dataset; and (2) it owns fairly good generalization performance on the basis of its structure risk minimization. In fact, the goal of SVM is to find an optimal hyperplane separating one class from the other, and the hyperplane is a linear classifier f (x), satisfying f (x) =

D

∑n

Dj

sgn( i=1 yi ∂i K (xi i , xj ) + b), where K (xi , xj ) is a kernel function for generating the inner products to construct machines with different types of non-linear decision surfaces in the input space. Classical Kernel functions include linear kernel function, sigmoid kernel function, and radial basis function. The methodology to achieve this goal is based on the solution of a convex quadratic optimization problem describing in formulas (2) and (3): min w,b

1 2

∥ ω ∥2 +γ

n ∑

ξi

(2)

i=1

Subject to yi (⟨ω, xi ⟩ + b) + ξi − 1 ≥ 0 (ξi ≥ 0, i = 1, . . . , n)

(3)

In formula (3) w is the weight vector, b is the bias term, γ represents the regularization parameter which determines the trade-off between maximizing of the margin and minimizing of the classification error, and ξi is the non-negative slack variable for xi . Subsequently, train SVM classifiers on each feature subset separately, and then derive a set of based classifiers expressed as C = {SVM1 , . . . , SVMi , . . . , SVMK }. The probabilistic outputs of SVMi classifier are defined as pi for the positive class and 1 − pi for the negative class. In step 2 of RS2 _ER, an effective combination of individual classifiers is conducted by using ER rule. In the application scenario of ER rule, two necessities for evidence are required, that is, the involved evidence should be independent from each other and should be assigned with weight and reliability simultaneously [87]. As a result, following [88], we equate classifiers with evidence and define the independence of classifiers as that the outputs of one individual classifier is independent with the outputs of others and will not be influenced by others. Next, we need to specify proper weight and reliability for each classifier, thus all of them can be combined in a more reasonable way. In general, the weight reflects the relative importance of the classifier versus others, while the reliability represents the ability of classifiers to provide accurate results for a given classification problem. Obviously, there is distinct difference between the meaning of weight and reliability. Therefore, here the weight wi is equal and the reliability ri is represented as the AUC (Area Under the Curve) values which is widely used to evaluate the performance of a binary classifier. Apparently, the reliability of each single classifier depends on its own classification accuracy. Concrete aggregation procedures are as}follows: define Θ as a { frame of discernment Θ = h1 , h2 , . . . , hq , where 2Θ is a set of mutually exclusive and exhaustive hypotheses. Then, a piece of evidence can be described by a belief distribution as:

{ ei =

} (h, ph,i ), ∀h ⊆ Ω ,



ph,i = 1

(4)

h⊆Ω

where (h, ph,i ) indicates that the evidence ei supports the hypothesis h with a probability of ph,i , which is generated by the classifier SVMi . The reasoning process in the ER rule is performed by generating a new belief distribution with the following:

˜θ,i ), ∀h ⊆ Θ ; (P(Θ ), m ˜P(Θ ),i )}, mi = {(h, m

(5)

˜θ,i expresses the degree of support for h from Θ with where m both weight and reliability being taken into consideration, which can be defined as: 0,

{ ˜h,i = m

w ˜i ph,i , 1−w ˜,

h=∅ h ⊆ Θ , h ̸= ∅ h = P(Θ )

} (6)

In the above formula, w ˜i = wi /(1 + wi − ri ) serves as a hybrid weight cooperating both weight and reliability. Then, the

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

9

Fig. 2. The process of the RS2 _ER.

ER rule combines various pieces of evidence using the recursive method. Firstly, two pieces of independent evidence e1 and e2 which jointly support the hypothesis h with the probability or belief function ph,i are combined. The combined evidence e(2) is calculated by:

{

0

ph,e(2) =

h=∅ ˆ h,e(2) m



ˆ D,e(2) D⊆Ω m

h ⊆ Θ , h ̸= ∅

,



mB,1 mC ,2 , ∀h ⊆ Θ

(9)

B∩C =h

ˆ P(Θ ),e(Z ) = (1 − ri )mP(Θ ),e(i−1) m

(11)

3.4. The RS2 _ER algorithm

ˆ h,e(Z ) = [(1 − ri )mh,e(i−1) + mP(Θ ) ,e(i−1) mh,i ] m mB,e(i−1) mC ,i , ∀h ⊆ Θ

, ∀h ⊆ Θ

(8)

Secondly, using the idea of recursion, Z pieces of independent evidence which are jointly in favor of the hypothesis h are further combined. The combined evidence e(Z ) is generated by following:



ˆ h,e(Z ) m ˆ P(Θ ) ,e(Z ) 1−m

(7)

B∩C =h

+

ph =

According to above formulas, various pieces of evidence can be combined in conjunction with the weight and the reliability of each evidence. Subsequently, the final prediction result on the hypothesis h is determined by the largest support.

ˆ h,e(2) = [(1 − r2 )mh,1 + (1 − r1 )mh,2 ] m +

After normalization, the final combined support for the hypothesis h can be calculated as:

(10)

In this section, the main procedures of the RS2 _ER algorithm is described in detail. Firstly, given the dataset D, the feature weight vector can be obtained by the SGL estimation described in formula (4), and in this process, the parameter α plays an important role in balancing the effect of lasso and group lasso, while λ adjusts the shrinkage degree among both features and groups. Next, with the feature weight vector β(k) and subspace rate R, K subspaces are generated by probabilistic sampling. Based on this, K base classifiers are trained respectively on the K subsets. At last, the outputs of all classifiers are combined by the ER rule,

10

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

Fig. 3. Pseudo-code of the RS2 _ER algorithm.

producing a final result. Pseudo-code of the RS2 _ER algorithm is presented in Fig. 3. 4. Experimental design To demonstrate the effectiveness of the proposed ER2 _ER, the experiment based on the real-world dataset is conducted in this section. We first make a detailed description of the experiment datasets, then present the chosen evaluation metrics and at last introduce the concrete experiment procedure. 4.1. Experimental dataset For experiment datasets, there are two essential points requiring to be noted: (1) the symbol of the listed companies’ financial distress; and (2) the adoption of time periods in which data is collected and corporates’ financial distress is predicted. According to the Specially Treated (ST) mechanism established by Shanghai and Shenzhen Stock Exchange in China, we regard companies that have undergone annual losses with two consecutive years as financially distressed examples, which strategy has been adopted by earlier studies [9,66]. Moreover, given such mechanism, it is useless for one to predict a company’s financial distress with data collected from one year or two years before the year it receives the ST label. Subsequently, in this study, viewing the ST year

as the benchmark year T-0, we use financial and non-financial indicators from T-3 to predict financial distress. Totally, 1835 listed companies are selected as the experimental instances from the Shenzhen Stock Exchange and the Shanghai Stock Exchange of China, 261 of which have been labeled as ST (positive instance) and the remaining are sound (negative) companies. The specific steps are described below. Firstly, to collect financial features, 39 accounting ratios are selected from the China Security Market Accounting Research Database (CSMARD). Table 1 shows the concrete definitions of 39 accounting ratios in each aspect of solvency, profitability, operational capabilities, business development capacity, capital expansion capacity, as well as finance structure. Secondly, in terms of sentiment features, we collect sentiment categories including positive words, negative words, modal strong words and modal weak words from Chinese HowNet sentiment lexicon (http://www.keenage.com/download/sentiment. rar). In addition, litigious words are collected from the Chinese Sougou lexicon (https://pinyin.sogou.com/dict/). Some examples of the sentiment words are listed in Table 2. Thirdly, as for textual features (F3 ), unigrams, bigrams, and trigrams extracted from financial reports are applied for the FDP task. It should be noted that the financial reports are collected from the website of the Shenzhen Stock Exchange and the Shanghai Stock Exchange. Similarly, some key examples of the textual

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

11

Table 1 Financial variable list.

1

2

3

Variable

Definition

Variable

Definition

X1 X2 X3 X4

(Sales revenue – sales cost)/sales revenue Net profit/sales revenue Earnings before income tax/average total assets Net profit/average total assets

X21 X22 X23 X24

X5 X6 X7

Net profit/average current assets Net profit/average fixed assets Net profit/average shareholders’ equity

X25 X26 X27

Main business income/average total assets Sales revenue/average fixed assets Main business cost/average inventory Main business income/average balance of accounts receivable Sales revenue/average current assets Cost of sales/average payable accounts Sales revenue/average working capital

X8

Main business income of this year/main business income of last year Net profit of this year/net profit of last year Total assets of this year/total assets of last year Net assets of this year/net assets of last year

X28

X9 X10 X11 X12 X13 X14 X15 X16 X17

Total liabilities/total assets Current assets/current liabilities (Current assets-inventory)/current liabilities Total liabilities/total shareholders’ equity Current liabilities/total assets Earnings before interest and tax (EBIT)/interest expense Net operating cash flow/current liabilities Non-current liabilities/(non-current liabilities + owners’ equity) Net operating cash flow/total liabilities

X18 X19 X20

4

5

6

X29 X30 X31

Net increase in cash and cash equivalents/number of ordinary shares Net assets/number of ordinary shares Net profit/number of ordinary shares Capital reserves/number of ordinary shares

X32 X33 X34 X35 X36 X37

Current assets/total assets Fixed assets/total assets Shareholders’ equity/fixed assets Current liabilities/total liabilities Cash flow/total assets Accounts receivable/total liabilities

X38 X39

Current liabilities/shareholders’ equity Working capital/total assets

Notes: the number 1 to 6 respectively represents Profitability, Business Development Capacity, Solvency, Operational Capabilities, Capital Expansion Capacity, and Finance Structure. Table 2 Some examples of sentiment words.

features which are Top 20 according to the term frequency in financial reports are listed in Table 3.

follows: FPR = FP /(FP + TN)

(12)

4.2. Evaluation metrics

TPR = TP /(TP + FN)

(13)

To evaluate the predict performance of the proposed RS2 _ER, we employ the area under the ROC curve (AUC) as the evaluation metric. For the binary classification problem, there are four kinds of possible results, true positive (TP), true negative (TN), false positive (FP) and false negative (FN), and the ROC curve regards the FP rate (FPR) and the TP rate (TPR) as the horizontal axis and vertical axis respectively, which are concretely presented as

However, ROC curve is commonly not unable to display the performance of different classifiers intuitively, instead, the area under the ROC curve (AUC) is more extensively adopted to represent the probability of the classification performance as it remains unchanged when the distribution of positive and negative instances changes [19].

12

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

Table 3 Some key examples of textual features.

4.3. Experimental procedure In this section, we present a clear view on the experimental procedure where the benchmark methods are elaborated first. In order to verify the comparatively superior performance of the proposed method, several benchmark methods including SVM, Over-sampling with SVM (OS-SVM), Under-sampling with SVM (US-SVM), Synthetic Minority Oversampling Technique with SVM (SMOTE-SVM), Bagging (Bootstrap Aggregating), AdaBoost and RS are introduced for FDP task. On the one hand, in the light of the class imbalance nature of FDP, three kinds of samplingbased methods are introduced to validate RS2 _ER in terms of its effectiveness in dealing with the class imbalanced classification task. Note that Over-sampling, Under-sampling, SMOTE tend to address the imbalanced problem by resampling the data space to rebalance the class distribution, they are preprocessing steps of classification. Hence after the resampling process, SVM is selected as the base classifier to make the final classification. On the other hand, since we have demonstrated the superiority of ensemble learning methods in the field of FDP, it is necessary to compare them with our proposed method RS2 _ER, so as to exclude RS2 _ER achieves it predictive advantages by chance. It should be noted that SVM is employed as the base classifier of the traditional ensemble learning methods and Voting is used to combine the prediction results of base classifiers. These benchmark methods are listed as below (1) SVM, a classical and commonly-used machine learning method that works as base classifiers in this study [6,39]. (2) OS-SVM (Over-sampling with SVM), over-sampling is a method that deals with the class-imbalance problem by increasing the number of the minority class samples so as to balance the class distribution [18]. For the FDP task, balanced instances from OS are used to train the base classifier SVM for the final prediction. (3) US-SVM (Under-sampling with SVM), under-sampling is a method that reduces the number of majority class of samples to balance the data and SVM is selected as the base classifier to predict financial distress [89]. (4) SMOTE-SVM (Synthetic Minority Oversampling Technique with SVM), an improved version of oversampling which creates instances by a random interpolation of the original minority class instances and their k nearest neighbors [89]. In FDP, resampled instances from SMOTE are used to make a prediction by the base classifier SVM. (5) Bagging (Bootstrap Aggregating), a simple ensemble method which is generated from bootstrap sampling, has been proved to be an effective method for FDP by previous studies [56]. Especially here we selected SVMs as the base classifiers and combined their prediction results by the voting method. (6) AdaBoost is a common ensemble method which is a general version of the original Boosting algorithm and is commonly

Table 4 AUC under different feature sets and methods. Methods

Feature F1

F2

F3

F1 +F2

F1 +F3

F2 +F3

F1 +F2 +F3

SVM OS-SVM US-SVM SMOTE-SVM AdaBoost Bagging RS RS2 _ER

88.56 85.79 83.98 86.24 88.62 89.05 89.19 91.24

55.96 69.58 66.86 69.60 63.14 66.41 66.89 68.14

78.61 82.20 80.62 81.96 81.45 81.61 82.09 84.46

90.88 87.85 85.98 91.12 91.03 91.32 91.79 93.70

91.23 90.30 88.79 92.21 92.42 92.63 93.78 95.15

82.82 85.27 84.32 85.15 84.87 85.61 86.26 87.70

93.17 91.24 90.25 93.45 93.41 93.73 94.89 96.09

used in recent studies for FDP [43,64]. Note that the base classifiers and combination rule of AdaBoost are the same with that of Bagging so as to make a fair and comparable evaluation. (7) RS (Random Subspace) is the benchmark of our proposed method RS2 _ER. It trains each base classifier SVM by using random partial features rather than all features and combines the prediction results via the voting method [19,45]. In order to process the textual annual reports, we employ the StringToWordVector module of WEKA. Besides, we implement SVM, SMOTE, Bagging, Boosting, and RS by the SMO, SMOTE, Bagging, ADBoostM1, and RS modules respectively. We conducted the experimental process with 10-fold cross validation, and repeated the cross validation for 5 times considering that there exists instability when the test set and training set are divided randomly once, then we obtained the average result as a final outcome. 5. Experimental results and discussion 5.1. Experimental results In this section, the experimental results on the AUC are reported by comparisons of different feature sets and prediction methods in Table 4. In order to represent the results more explicitly, the best values of methods on each feature set are highlighted in boldface. On the one hand, among the overall feature sets, F1 +F2 +F3 achieves the highest values on overall methods, hence demonstrating the rationality of simultaneously using financial and nonfinancial features for FDP. Furthermore, the utilization of these features reached the highest level on our purposed RS2 _ER, and this indicates our proposed features’ fusion mechanism which divides features with discriminatory predictive power into distinct groups to maximize their prediction performance plays an effectual role. On the other hand, compared to class-imbalance methods, RS2 _ER outperforms them on almost all feature sets except for F2 . Concretely, the sampling method SMOTE-SVM reaches the highest value of AUC (69.60%), while OS-SVM achieves the secondary highest value (69.58%), which can be explained as that sampling methods show their superiority in dealing with class-imbalance problems with low-dimensional feature space. Moreover, in contrast to the traditional ensemble methods, we can see the noticeable improvements achieved by RS2 _ER. For instance, on feature set F1 +F3 , RS2 _ER achieves the highest AUC value (i.e. 95.15%), followed by traditional RS (i.e. 93.78%), Bagging (i.e. 92.63%) and Adaboost (i.e. 92.42%) successively. Besides, it improves the prediction performance of traditional RS nearly by 1.2% on the F1 +F2 +F3 set, which illustrates the necessity of dividing grouping structures for multiple features and distinguishing the ambiguities among different individual classifiers.

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

13

Table 5 Significance test results of feature comparisons. Methods

Comparison F2 /F1

F3 /F1

F3 /F2

F1 + F2 /F1

F1 + F3 /F1

F2 + F3 /F1

F1 +F2 +F3 /F1

SVM OS-SVM US-SVM SMOTE-SVM AdaBoost Bagging RS RS2 _ER

−26.58** −17.47** −17.27** −16.82** −21.17** −23.00** −20.16** −20.10**

−9.68** −4.56** −4.07** −5.14** −7.44** −7.72** −9.49** −7.79**

16.48** 13.22** 14.70** 12.35** 14.28** 16.29** 14.80** 15.37**

2.62** 2.76** 1.98* 6.47** 2.67** 2.50* 2.88** 2.78**

3.90** 7.75** 7.38** 9.93** 4.88** 5.74** 8.44** 6.67 **

−6.45** −0.62* −0.35* −1.26* −3.94** −3.86** −3.19** −4.18**

6.91** 8.88** 8.95** 12.49** 6.18** 7.65** 8.57** 7.48**

Notes:* p-values significant at α = 0.05. ** p-values significant at α = 0.01. Table 6 Significance test results of method comparisons. Comparison RS2 _ER/SVM RS2 _ER/OS-SVM RS2 _ER/US-SVM RS2 _ER/SMOTE-SVM RS2 _ER/AdaBoost RS2 _ER/Bagging RS2 _ER/RS

Features F1

F1 +F2

F1 +F3

F2 +F3

F1 +F2 +F3

16.94** 12.93** 11.00** 16.71** 6.16** 9.81** 9.92**

11.58** 14.64** 14.65** 10.95** 6.78** 11.16** 10.51**

11.43** 12.34** 16.36** 9.94** 5.36** 10.68** 6.41**

10.74** 6.19** 7.45** 7.03** 5.87** 5.87** 3.90**

15.10** 16.86** 16.59** 11.48** 6.05** 11.94** 5.90**

Notes:* p-values significant at α = 0.05. ** p-values significant at α = 0.01.

5.2. Statistical analysis To illustrate our experiment results are achieved not by chance, and in the meanwhile to test the statistical significance level of the improvement obtained by the integrated feature sets and the proposed method, we conducted the pair-wise t-tests with the AUC values and reported the corresponding t-values in Tables 4 and 5. In terms of features, we first illustrate the effectiveness of utilizing only non-financial features. Second, we respectively performed pair-wise t-tests to show the advantages of those integrated feature sets. The results of statistical test are shown in Table 5. As can be seen from Table 4, the prediction performance of financial features is significantly better than non-financial features, which indicates the dominant role of financial features in terms of FDP and the insufficiency of using non-financial features alone, providing a guidance for us to select more valuable features for prediction. Especially, it can be seen that non-financial feature set F3 is significantly superior to F2 , that is, textual features’ predictive power for FDP is higher than sentiment features, this interesting finding can provide a guideline for studies to extract non-financial features to enhance FDP, which have been rarely noticed in previous studies. Second, feature sets F1 +F2 , F1 + F3 and F1 +F2 +F3 are respectively compared with F1 , and the statistical results show their AUC values are significantly higher than financial features. This demonstrates the benefit of the features’ integration, results of which are in accordance with the study [19]. On the other hand, to verify the effectiveness of the proposed RS2 _ER, we compared it with benchmark methods on a part of feature sets, and the results are shown in Table 6. According to above experiment results, the proposed RS2 _ER significantly outperforms single classifier SVM, as well as the class-imbalance methods such as US-SVM and SMOTE-SVM on the overall integrated feature sets, demonstrating its advantages in dealing with high-dimensional and class-imbalance problems in FDP, and simultaneously illustrating that the grouping mechanism in our proposed RS2 _ER actually contributes to enhancing

Fig. 4. AUC comparison on different features.

the utilization of multiple features. Moreover, it can be seen that RS2 _ER generally performs better than original ensemble methods such as Bagging, AdaBoost and RS, which indicates that RS2 _ER not only improves the feature quality but also helps to moderate the information loss in the process of base classifiers’ ensemble indeed. Furthermore, the significant advantages of RS2 _ER over RS illustrates the necessity of the improvement on RS, hence supporting the proposition of this study. 5.3. Discussion 5.3.1. Evaluation on multiple features and their combinations In this section, we discussed the prediction performance of multiple features including financial, sentiment, and textual features, and analyzed their discriminatory predictive power. The results of features’ comparative analysis are shown in Fig. 4. From Fig. 4, it can be obviously seen that the feature set F1 +F2 +F3 under RS2 _ER yields the highest AUC, hence verifying the fusion of multiple features is reasonable and our grouping mechanism realizes the maximum utilization of these features among involved methods. Then, compared to non-financial feature set F2 , F3 , and F2 +F3 , F1 holds its dominant role for FDP, whereas the explanatory power of non-financial features to FDP is inferior. Similar conclusions have been drawn by several recent researchers [8,9,51]. This interesting finding can provide researchers with an insight to get better utilization of financial and non-financial features in terms of FDP. Moreover, the prediction power of F2 on AUC fluctuates from 55% to 70%, which is relatively lower than F1 and F3 , hence illustrating that using sentiment features alone are effective but is not sufficient. This is because that sentiment features which mainly focus on revealing sentiment polarities embedded in financial reports are

14

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

Fig. 5. AUC comparison on different methods.

one-sidedness compared to financial features which hold more comprehensive financial indicators, as well as textual features which are enable to reflect one’s description information upon business status [52]. Next, we further explore the joint effects of financial and non-financial features on prediction by evaluating the integrated feature sets F1 +F2 , F1 +F3 , and F1 +F2 +F3 . Concretely, the results show that F1 +F2 and F1 +F3 are more effective and informative than F1 in terms of FDP, which suggests sentiment and textual features exactly complement to financial features. This is consistent with previous studies which reported the supplementary role of non-financial features to FDP [12, 18]. Besides, F1 +F2 +F3 unifies the contribution of financial and non-financial features receives the best performance under each involved method, which supports the remarkable joint effect of multiple features for FDP, and such result has been also acquired by some previous studies [19]. 5.3.2. Evaluation on enhanced integration effect of RS2 _ER To explore the prediction power of the proposed method RS2 _ER, we compared it with benchmark methods on different feature sets. Fig. 5 shows the AUC of the methods analysis results on the overall features. First of all, it can be seen that the RS2 _ER outperforms other benchmark methods almost on the majority of feature sets. For F1 as an example, the RS2 _ER respectively achieves an improvement of 2.68%, 5.01%, 2.19%, 2.05% over SVM, SMOTE-SVM, Bagging and RS, which illustrates that our proposed RS2 _ER is superior which can be attributed to its benefits from reasonable feature assessment mechanism. Similar results can be acquired on some other feature sets. Exceptionally, the class-imbalance methods e.g., OSSVM and SMOTE-SVM achieve better results than the RS2 _ER in terms of F2 . This is possibly due to that for the class-imbalance FDP problem, the dimension of F2 is relatively lower than other feature sets, leading the RS2 _ER which possesses a strong effect of dimensionality reduction to be inefficient. Such interesting phenomenon indicates that our proposed RS2 _ER is outstanding in tackling high-dimensional problem for its advantages in feature processing, whereas in the view of low-dimensional feature space, it is unable to exploit these advantages fully. Moreover, it should be pointed out that the RS2 _ER performs better than AdaBoost, Bagging, and RS in varying degrees. For instance, the AUC improvements achieved by RS2 _ER over AdaBoost, Bagging and RS with F1 +F2 +F3 are respectively 8.34%, 2.36% and 1.20%. These results demonstrate the effectiveness of

Fig. 6. AUC comparison on SVM, RS and RS2 _ER.

the RS2 _ER versus the traditional ensemble methods, supporting the proposition of this study. Possible explanations of this result are as follows: first, considering that different kinds of features help to reveal various aspects of companies’ business status thus possessing distinct discriminatory power in terms of FDP, the direct combination upon financial, sentiment, and textual features will result in the reduction of feature quality so as to further weaken the prediction performance. By dividing multiple features into several disjoint groups of homogeneous features, RS2 _ER contributes to the improvement of the prediction accuracy for FDP substantially. Second, compared to those traditional ensemble methods that combine individual classifiers with strategies such as the majority voting, the average and etc., the RS2 _ER conducts more efficient and reasonable combinations on base classifiers with ER rule. In specific, by incorporating the ER rule which assigns both weight and reliability to each classifier in the combination, RS2 _ER takes full use of ambiguities among different classifiers thus improving the prediction performance enormously. We further explore whether the improvements are achieved and what extent the improvements can reach compared to original RS and SVM. Fig. 6 shows this comparison on AUC. Obviously, the RS2 _ER maintains the highest level of AUC on the overall feature sets. In fact, on the single feature set F1 , F2 and F3 , RS2 _ER improves the original RS mainly for two advantages: (1) screening redundant and irrelevant features through the regularized sparse-based feature selection method; and (2) coping with the ambiguities among the different base classifiers by incorporating ER rule. Then, under feature set F1 +F2 +F3 , the AUC of the RS2 _ER is 1.2% higher than that of RS, and 2.92% higher than that of SVM. With the addition of different features to the original feature set, we first deliver features with intrinsically distinct structures into disparate groups and then we apply SGL to cope with the multiple features’ intrinsically different properties. In the meanwhile, an aggregation ER rule is adopted to make distinct use of different base classifiers. Under these circumstances, it is not strange that RS2 _ER constantly performs better than other methods. 5.3.3. Evaluation on the influence of parameters on prediction effect In the RS2 _ER, there are three important parameters: subspace rate R, penalty parameter λ, and the convex combination parameter of the lasso and group-lasso penalty α . To explore the influence of their different settings on prediction and implement

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

15

Fig. 7. Sensitivity analysis of AUC for the RS and RS2 _ER on subspace rate.

an optimal RS2 _ER, we conduct an evaluation on them in this section. Fig. 7 presents the AUC of the RS and RS2 _ER under the different settings of R. The optimal R values varies upon different feature sets. For example, RS achieves its highest AUC values of 91.19% with F1 , 66.89% with F2 , 81.45% with F3 , 91.79% with F1 +F2 , and 85.61% with F2 +F3 as R fluctuates from 0.5 to 0.7, while it gets the best performance on F1 +F3 and F1 +F2 +F3 with R = 0.3. These results are acceptable because the dimensions of extracted features are distinct, which results in the optimal R values differing from each other. More than this, we can find the trend of R values under RS presents an ‘‘A’’ type curve on all kinds of feature sets, that is, R values increases to the optimal point first and then decreased. This is coincided with the viewpoint of existing studies that: (1) when the R value is low, the sampling on features is inadequate and some crucial features may be missed out, leading to an inferior feature quality; (2) with R increased, more features begin to work efficiently, hence improving the prediction performance; and (3) when the R value keeps increasing, a large number of extracted features obtained, which in turn leads to some redundant and irrelevant to be involved in, weakening the prediction power. For RS2 _ER, its regarding R is irregular and different from that of RS. For example, on the F1 , F2 , F1 +F3 , and F2 +F3 , R remains a declining trend, and on the F3 , it increases first and then decreases, then on the F1 +F2 and F1 +F2 +F3 , the trend is opposite with that of F3 . This is because in RS2 _ER, the process of feature sampling have not implemented the remarkable improvement on feature quality and its contribution of enhancing the prediction performance is deficiency, hence it is difficult to find the change regulation of prediction effect relying on R without the help of other parameters. This interesting discovery tells us that we need to further explore the joint effect of these parameters in order to search for the optimal one. Next, we discuss the joint effect of two important parameters R and α to the RS2 _ER since the λ has been optimized through cross validation. Fig. 8 shows the varying performance of the RS2 _ER under different settings of R and α , where the X -axis, the Y -axis, and the Z -axis are respectively defined as α , R, and AUC. When R is fixed, it is interesting that the AUC seems to rise as α increases gradually until being stable, which indicates RS2 _ER can perform best when it focuses more explicitly on the sparsity within group level than group level to remove redundant features. Simultaneously, this phenomenon also illustrates that

the considered homogeneous features (feature groups) including financial, sentiment and textual features are informative to prediction, hence they are all preserved. Next, we explore the joint effect of α and R. To make the analysis more understandable, we discuss four kinds of extreme cases. First, on the situation of α = 0.1 and R = 0.1, as well as α = 0.1 and R = 0.9, the prediction capabilities are generally poor, which calls our attention to the problem that it is unreasonable for RS2 _ER to make the great effort on group sparsity but neglect the redundancy and irrelevancy among individual features. Furthermore, with α = 0.1, the AUC under the lower R outperforms that under the higher R, which just confirms that the excessive sampling features may aggravate the adverse effect of redundancy on the feature quality. This is a valuable finding as it implies that we need to enhance the feature quality by improving the sparsity among features and adjusting corresponding sampling rate in order to improve the prediction performance. Second, with α = 0.9 and R = 0.1, as well as α = 0.9 and R = 0.9, the results on the overall feature sets are considerable. This phenomenon indicates that the prediction accuracy can be improved when the sparsity achieved within groups and the extraction ratios are at both low and high levels, which provides us with reasonable support for utilizing multiple features to predict financial distress. That is to say, due to the overall financial, sentiment and textual features are effective, they are all reserved through our proposed RS2 _ER, and in the meanwhile, considering there exists a certain level of redundancy in the homogeneous features (feature group), a strong within-group sparsity is needed to enhance the feature quality. Moreover, as we conduct the sensitivity analysis on the basis of the high-dimensional feature set (F1 +F2 +F3 ), a lower R is capable of ensuring the feature quantity and feature quality, and without doubt, a higher R is able to extract more adequate features for predicting, both of which enable our proposed RS2 _ER to reach a satisfactory prediction result. Through the analyses above, we can find that different parameter settings obviously influence the prediction performance and the optimal result is derived from the enhancement of feature quality by simultaneously realizing the sparsity with groups and within groups. To further explore how α and R coordinate with each other in RS2 _ER, we evaluate those optimal combinations of α and R from the view of F1 +F2 +F3 . The concrete distributions of α and R are shown in Fig. 9. Overall, the highest value of AUC achieves when the extracted features from original feature set are adequate and the sparsity

16

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

Fig. 8. Sensitivity analysis of AUC for the RS2 _ER on R and α .

Fig. 9. Sensitivity analysis for the RS2 _ER on R and α .

among the individual features is achieved through the redundancy removal with α = 0.9 and R = 0.7. From Fig. 9, we can find that α increases first and then decreases as R varies from 0.1 to 0.9, which reflects some extent of their synergetic and complementary roles for RS2 _ER to achieve the best values. Concretely, α and R show their synergetic role when the sampling features are relatively few, which provides a direction for researcher to correctly handle the relationship between α and R and achieve a better FDP model. Moreover, they are complementary with a large number of sampling features, which means that given a high R value, more features will be remained correspondingly, in parallel the RS2 _ER focuses on the sparsity on both group level and within group level to improve the feature quality, subsequently enhancing the prediction power. The results above show that the overall parameters codetermine the prediction performance of the RS2 _ER and proper settings of them are important for achieving high-quality prediction for FDP. 6. Conclusion and future work FDP has become one of the hot issues due to the damage of financial distress to economic agents and even the commercial environment. Recently, non-financial features derived from textual disclosures have promoted researchers in the area of the FDP to work on more efficient models. Ensemble methods have become a prevalent research line in the field of FDP incorporating

multiple features. However, in traditional ensemble, the intrinsic characteristics of multiple features have not been properly treated hence decreasing their feature quality, which leaves in great improvement space for prediction accuracy. Moreover, due to that base classifiers are trained on random feature subsets extracted from original feature set, the diversity among different classifiers is obtained whereas the prediction accuracy of each base classifier is uncertain. The direction combination of these discriminatory classifier will lead to a loss of prediction accuracy. Under these circumstances, this study proposes a novel and robust meta FDP framework which incorporates the feature regularizing module for identifying discriminatory predictive power of multiple features and the probabilistic fusion module for enhancing the aggregation over base classifiers. To validate the proposed RS2 _ER, we conducted the extensive experiments on the real-world datasets, and the experimental results show that the proposed RS2 _ER significantly facilitates the prediction performance by coping with the features grouping property and the ambiguities among base classifiers. There are also several future directions for this study. First, to enhance the utilization of multiple features, it is necessary for us to further explore the prediction power of other valuable quantitative and qualitative variables extracted from social media. Second, the research framework proposed by this study needs to be verified on larger and more diverse datasets to support the effectiveness of RS2 _ER. Third, in the regard of the computation issue, parallel computing techniques are required to

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

be researched in deep to cope with the computationally intensive problems. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2020.106152. CRediT authorship contribution statement Gang Wang: Conceptualization, Formal analysis, Methodology, Writing - review & editing. Jingling Ma: Conceptualization, Software, Visualization, Writing - original draft, Writing - review & editing. Gang Chen: Conceptualization, Software, Visualization, Writing - original draft. Ying Yang: Formal analysis, Methodology, Writing - original draft. Acknowledgments This work is partially supported by the National Natural Science Foundation of China (71471054, 71573071, 91646111), and Fundamental Research Funds for the Central Universities, PR China (PA2019GDQT0004). References [1] H.R. Khedmatgozar, A. Shahnazi, The role of dimensions of perceived risk in adoption of corporate internet banking by customers in Iran, Electron. Commer. Res. 18 (2018) 389–412, http://dx.doi.org/10.1007/s10660-0179253-z. [2] Y. Peng, G. Wang, G. Kou, Y. Shi, An empirical study of classification algorithm evaluation for financial risk prediction, Appl. Soft Comput. J. 11 (2011) 2906–2915, http://dx.doi.org/10.1016/j.asoc.2010.11.028. [3] D.L. Olson, D. Delen, Y. Meng, Comparative analysis of data mining methods for bankruptcy prediction, Decis. Support Syst. 52 (2012) 464–473, http: //dx.doi.org/10.1016/j.dss.2011.10.007. [4] Y. Yan, Z. Lv, B. Hu, Building investor trust in the P2P lending platform with a focus on Chinese P2P lending platforms, in: Proc. - 2016 Int. Conf. Identification, Inf. Knowl. Internet Things, IIKI 2016, 2018-Janua, 2018, pp. 470–474, http://dx.doi.org/10.1109/IIKI.2016.15. [5] W.W. Wu, Beyond business failure prediction, Expert Syst. Appl. 37 (2010) 2371–2376, http://dx.doi.org/10.1016/j.eswa.2009.07.056. [6] P. du Jardin, Dynamics of firm financial evolution and bankruptcy prediction, Expert Syst. Appl. 75 (2017) 25–43, http://dx.doi.org/10.1016/j.eswa. 2017.01.016. [7] C.H. Chou, S.C. Hsieh, C.J. Qiu, Hybrid genetic algorithm and fuzzy clustering for bankruptcy prediction, Appl. Soft Comput. J. 56 (2017) 298–316, http://dx.doi.org/10.1016/j.asoc.2017.03.014. [8] W.H. Beaver, Financial ratios as predictors of failure, J. Account. Res. (1966) 71–111. [9] R. Geng, I. Bose, X. Chen, Prediction of financial distress: An empirical study of listed Chinese companies using data mining, European J. Oper. Res. 241 (2015) 236–247, http://dx.doi.org/10.1016/j.ejor.2014.08.016. [10] P. Hajek, R. Henriques, When is a liability not a liability, Knowl.-Based Syst. 128 (2017) 139–152, http://dx.doi.org/10.1016/j.knosys.2017.05.001. [11] W. Antweiler, M.Z. Frank, Stock_Message_Boards_Antweiler_Frank, LIX, 2004. [12] R.P. Schumaker, Y. Zhang, C.N. Huang, H. Chen, Evaluating sentiment in financial news articles, Decis. Support Syst. 53 (2012) 458–464, http: //dx.doi.org/10.1016/j.dss.2012.03.001. [13] A. Volkov, D.F. Benoit, D. Van den Poel, Incorporating sequential information in bankruptcy prediction with predictors based on Markov for discrimination, Decis. Support Syst. 98 (2017) 59–68, http://dx.doi.org/10. 1016/j.dss.2017.04.008. [14] C. Chen, C. Liu, Y. Chang, H. Tasi, Opinion mining for relating multiword subjective expressions and annual earnings in US financial statements, J. Inf. Sci. Eng. 29 (2013) 743–764, http://nccur.lib.nccu.edu.tw/handle/140. 119/66213. [15] R. Balakrishnan, X.Y. Qiu, P. Srinivasan, On the predictive ability of narrative disclosures in annual reports, European J. Oper. Res. 202 (2010) 789–801, http://dx.doi.org/10.1016/j.ejor.2009.06.023.

17

[16] M. Doumpos, K. Andriosopoulos, E. Galariotis, G. Makridou, C. Zopounidis, Corporate failure prediction in the European energy sector: A multicriteria approach and the effect of country characteristics, European J. Oper. Res. 262 (2017) 347–360, http://dx.doi.org/10.1016/j.ejor.2017.04.024. [17] S.W.K. Chan, M.W.C. Chong, Sentiment analysis in financial texts, Decis. Support Syst. 94 (2017) 53–64, http://dx.doi.org/10.1016/j.dss.2016.10.006. [18] P. Hajek, V. Olej, R. Myskova, Forecasting corporate financial performance using sentiment in annual reports for stakeholders’ decision-making, Technol. Econ. Dev. Econ. 20 (2014) 721–738, http://dx.doi.org/10.3846/ 20294913.2014.979456. [19] G. Wang, G. Chen, Y. Chu, A new random subspace method incorporating sentiment and textual information for financial distress prediction, Electron. Commer. Res. Appl. 29 (2018) 30–49, http://dx.doi.org/10.1016/j. elerap.2018.03.004. [20] P. Ravi Kumar, V. Ravi, Bankruptcy prediction in banks and firms via statistical and intelligent techniques - A review, European J. Oper. Res. 180 (2007) 1–28, http://dx.doi.org/10.1016/j.ejor.2006.08.043. [21] J.H. Min, C. Jeong, A binary classification method for bankruptcy prediction, Expert Syst. Appl. 36 (2009) 5256–5263, http://dx.doi.org/10.1016/j.eswa. 2008.06.073. [22] C. Serrano-Cinca, B. Gutiérrez-Nieto, Partial least square discriminant analysis for bankruptcy prediction, Decis. Support Syst. 54 (2013) 1245–1255, http://dx.doi.org/10.1016/j.dss.2012.11.015. [23] S. Canbas, A. Cabuk, S.B. Kilic, Prediction of commercial bank failure via multivariate statistical analysis of financial structures: The Turkish case, European J. Oper. Res. 166 (2005) 528–546, http://dx.doi.org/10.1016/j.ejor. 2004.03.023. [24] E. Altman, Financial ratios, discriminant analysis and the prediction of copropate bankruptcy, J. Finance. 23 (1968) 589–609. [25] P. Newton, Financial ratios and probabilistic prediction of b ankruptcy, J. Account. Res. 2 (1975). [26] R.C. West, A factor-analytic approach to bank condition, J. Bank. Financ. 9 (1985) 253–266. [27] J.L. Peter, Assessing the vulnerability to failure of American industrial firms: A logistic analysis, J. Bus. Valuat. Econ. Loss Anal. 8 (2013) 19–45, http://dx.doi.org/10.1515/jbvela-2013-0009. [28] C.F. Tsai, Y.F. Hsu, A meta-learning framework for bankruptcy prediction, J. Forecast. 32 (2013) 167–179, http://dx.doi.org/10.1002/for.1264. [29] J. Sun, H. Li, Q.H. Huang, K.Y. He, Predicting financial distress and corporate failure: A review from the state-of-the-art definitions, modeling, sampling, and featuring approaches, Knowl.-Based Syst. 57 (2014) 41–56, http://dx. doi.org/10.1016/j.knosys.2013.12.006. [30] N. Chen, B. Ribeiro, A. Chen, Financial credit risk assessment: a recent review, Artif. Intell. Rev. 45 (2016) 1–23, http://dx.doi.org/10.1007/s10462015-9434-x. [31] Wei-Yang Lin, Ya-Han Hu, Chih-Fong Tsai, Machine learning in financial crisis prediction: A survey, IEEE Trans. Syst. Man Cybern. C 42 (2011) 421–436, http://dx.doi.org/10.1109/tsmcc.2011.2170420. [32] F. Barboza, H. Kimura, E. Altman, Machine learning models and bankruptcy prediction, Expert Syst. Appl. 83 (2017) 405–417, http://dx.doi.org/10.1016/ j.eswa.2017.04.006. [33] F. Mai, S. Tian, C. Lee, L. Ma, Deep learning models for bankruptcy prediction using textual disclosures, European J. Oper. Res. 274 (2019) 743–758, http://dx.doi.org/10.1016/j.ejor.2018.10.024. [34] Z. Jing, Y. Fang, Predicting US bank failures: A comparison of logit and data mining models, J. Forecast. 37 (2018) 235–256, http://dx.doi.org/10.1002/ for.2487. [35] H. Frydman, E.I. Altman, D.-L. Kao, Introducing recursive partitioning for financial classification: The Case of financial distress, J. Finance. 40 (1985) 269–291, http://dx.doi.org/10.1111/j.1540-6261.1985.tb04949.x. [36] C.F. Tsai, Financial decision support using neural networks and support vector machines, Expert Syst. 25 (2008) 380–393, http://dx.doi.org/10. 1111/j.1468-0394.2008.00449.x. [37] F.J. López Iturriaga, I.P. Sanz, Bankruptcy visualization and prediction using neural networks: A study of U.S. commercial banks, Expert Syst. Appl. 42 (2015) 2857–2869, http://dx.doi.org/10.1016/j.eswa.2014.11.025. [38] S. Wu, C.-J. Hsu, H. Chen, Z. Huang, W.-H. Chen, Credit rating analysis with support vector machines and neural networks: a market comparative study, Decis. Support Syst. 37 (2003) 543–558, http://dx.doi.org/10.1016/ s0167-9236(03)00086-1. [39] K.S. Shin, T.S. Lee, H.J. Kim, An application of support vector machines in bankruptcy prediction model, Expert Syst. Appl. 28 (2005) 127–135, http://dx.doi.org/10.1016/j.eswa.2004.08.009. [40] C.F. Tsai, Y.F. Hsu, D.C. Yen, A comparative study of classifier ensembles for bankruptcy prediction, Appl. Soft Comput. J. 24 (2014) 977–984, http: //dx.doi.org/10.1016/j.asoc.2014.08.047. [41] H. Choi, H. Son, C. Kim, Predicting financial distress of contractors in the construction industry using ensemble learning, Expert Syst. Appl. 110 (2018) 1–10, http://dx.doi.org/10.1016/j.eswa.2018.05.026. [42] R. Polikar, [Polikar06] ensemble based systems in decision making.pdf, 2006, pp. 21–45, http://dx.doi.org/10.1109/MCAS.2006.1688199.

18

G. Wang, J. Ma, G. Chen et al. / Applied Soft Computing Journal 90 (2020) 106152

[43] E. Alfaro, N. García, M. Gámez, D. Elizondo, Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks, Decis. Support Syst. 45 (2008) 110–122, http://dx.doi.org/10.1016/j.dss.2007.12.002. [44] J. Sun, H. Li, Listed companies’ financial distress prediction based on weighted majority voting combination of multiple classifiers, Expert Syst. Appl. 35 (2008) 818–827, http://dx.doi.org/10.1016/j.eswa.2007.07.045. [45] L. Nanni, A. Lumini, An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring, Expert Syst. Appl. 36 (2009) 3028–3033, http://dx.doi.org/10.1016/j.eswa.2008.01.018. [46] P. Hájek, V. Olej, R. Myšková, Predicting financial distress of banks using random subspace ensembles of support vector machines, Artif. Intell. Perspect. Appl. (2015) 131–140, http://dx.doi.org/10.1007/978-3-319-184760. [47] R. Polikar, Ensemble based systems in decision making, IEEE Circuits Syst. Mag. 6 (2006) 21–44, http://dx.doi.org/10.1109/MCAS.2006.1688199. [48] F. Lin, D. Liang, E. Chen, Financial ratio selection for business crisis prediction, Expert Syst. Appl. 38 (2011) 15094–15102, http://dx.doi.org/ 10.1016/j.eswa.2011.05.035. [49] F. Lin, D. Liang, E. Chen, Financial ratio selection for business crisis prediction, Expert Syst. Appl. 38 (2011) 15094–15102, http://dx.doi.org/ 10.1016/j.eswa.2011.05.035. [50] D. Delen, C. Kuzey, A. Uyar, Measuring firm performance using financial ratios: A decision tree approach, Expert Syst. Appl. 40 (2013) 3970–3983, http://dx.doi.org/10.1016/j.eswa.2013.01.012. [51] J. Huang, H. Wang, G. Kochenberger, Distressed chinese firm prediction with discretized data, Manag. Decis. 55 (2017) 786–807, http://dx.doi.org/ 10.1108/MD-08-2016-0546. [52] M. Cecchini, H. Aytug, G.J. Koehler, P. Pathak, Making words work: Using financial text as a predictor of financial events, Decis. Support Syst. 50 (2010) 164–175, http://dx.doi.org/10.1016/j.dss.2010.07.012. [53] N. Jo, K. Shin, Bankruptcy prediction modeling using qualitative information based on big data analytics, J. Intell. Inf. Syst. 22 (2016) 33–56, http://dx.doi.org/10.13088/jiis.2016.22.2.033. [54] H. Li, J. Sun, Majority voting combination of multiple case-based reasoning for financial distress prediction, Expert Syst. Appl. 36 (2009) 4363–4373, http://dx.doi.org/10.1016/j.eswa.2008.05.019. [55] H. Öˇgüt, R. Aktaş, A. Alp, M.M. Doˇganay, Prediction of financial information manipulation by using support vector machine and probabilistic neural network, Expert Syst. Appl. 36 (2009) 5419–5423, http://dx.doi.org/10. 1016/j.eswa.2008.06.055. [56] M.A. Boyacioglu, Y. Kara, Ö.K. Baykan, Predicting bank financial failures using neural networks, support vector machines and multivariate statistical methods: A comparative analysis in the sample of savings deposit insurance fund (SDIF) transferred banks in Turkey, Expert Syst. Appl. 36 (2009) 3355–3366, http://dx.doi.org/10.1016/j.eswa.2008.01.003. [57] K. Tam, M. Kiang, Managerial applications of neural networks: the case of bank failure predictions, Manage. Sci. 38 (1992) 926–947. [58] H. Li, C.J. Li, X.J. Wu, J. Sun, Statistics-based wrapper for feature selection: An implementation on financial distress identification with support vector machine, Appl. Soft Comput. J. 19 (2014) 57–67, http://dx.doi.org/10.1016/ j.asoc.2014.01.018. [59] J. Sun, H. Li, Author’s personal copy Financial distress prediction using support vector machines : Ensemble vs. individual, (n.d.). http://dx.doi. org/10.1016/j.asoc.2012.03.028. [60] Y. Ding, X. Song, Y. Zen, Forecasting financial condition of chinese listed companies based on support vector machine, Expert Syst. Appl. 34 (2008) 3081–3089, http://dx.doi.org/10.1016/j.eswa.2007.06.037. [61] D. Liang, C.F. Tsai, A.J. Dai, W. Eberle, A novel classifier ensemble approach for financial distress prediction, Knowl. Inf. Syst. 54 (2018) 437–462, http://dx.doi.org/10.1007/s10115-017-1061-1. [62] D. West, S. Dellana, J. Qian, Neural network ensemble strategies for financial decision applications, Comput. Oper. Res. 32 (2005) 2543–2559, http://dx.doi.org/10.1016/j.cor.2004.03.017. [63] D. Zhang, X. Zhou, S.C.H. Leung, J. Zheng, Vertical bagging decision trees model for credit scoring, Expert Syst. Appl. 37 (2010) 7838–7843, http: //dx.doi.org/10.1016/j.eswa.2010.04.054. [64] J. Sun, M.Y. Jia, H. Li, Adaboost ensemble for financial distress prediction: An empirical comparison with data from Chinese listed companies, Expert Syst. Appl. 38 (2011) 9305–9312, http://dx.doi.org/10.1016/j.eswa.2011.01. 042. [65] I. Barandiaran, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) http://www.ehu. eus/ccwintco/uploads/4/45/Presetacion-ibarandiaran-2012-01-27.pdf. [66] G. Wang, J. Ma, S. Yang, An improved boosting based on feature selection for corporate bankruptcy prediction, Expert Syst. Appl. 41 (2014) 2353–2361, http://dx.doi.org/10.1016/j.eswa.2013.09.033. [67] R. Feldman, S. Govindaraj, J. Livnat, B. Segal, Management’s tone change, post earnings announcement drift and accruals, Rev. Account. Stud. 15 (2010) 915–953, http://dx.doi.org/10.1007/s11142-009-9111-x.

[68] P. Hájek, V. Olej, Evaluating sentiment in annual reports for financial distress prediction using neural networks and support vector machines, Commun. Comput. Inf. Sci. 384 (2013) 1–10, http://dx.doi.org/10.1007/9783-642-41016-1_1. [69] A. Abbasi, S. France, Z. Zhang, H. Chen, Selecting attributes for sentiment classification using feature relation networks, IEEE Trans. Knowl. Data Eng. 23 (2011) 447–462, http://dx.doi.org/10.1109/TKDE.2010.110. [70] P. Du Jardin, E. Séverin, Forecasting financial failure using a Kohonen map: A comparative study to improve model stability over time, European J. Oper. Res. 221 (2012) 378–396, http://dx.doi.org/10.1016/j.ejor.2012.04. 006. [71] D. Liang, C.F. Tsai, H.T. Wu, The effect of feature selection on financial distress prediction, Knowl.-Based Syst. 73 (2014) 289–297, http://dx.doi. org/10.1016/j.knosys.2014.10.010. [72] L. Wang, C. Wu, Business failure prediction based on two-stage selective ensemble with manifold learning algorithm and kernel-based fuzzy selforganizing map, Knowl.-Based Syst. 121 (2017) 99–110, http://dx.doi.org/ 10.1016/j.knosys.2017.01.016. [73] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning from class-imbalanced data: Review of methods and applications, Expert Syst. Appl. 73 (2017) 220–239, http://dx.doi.org/10.1016/j.eswa.2016.12. 035. [74] L. Zhou, K.P. Tam, H. Fujita, Predicting the listing status of Chinese listed companies with multi-class classification models, Inf. Sci. (Ny). 328 (2016) 222–236, http://dx.doi.org/10.1016/j.ins.2015.08.036. [75] M.J. Kim, D.K. Kang, Ensemble with neural networks for bankruptcy prediction, Expert Syst. Appl. 37 (2010) 3373–3379, http://dx.doi.org/10. 1016/j.eswa.2009.10.012. [76] L. Jiang, L. Zhang, C. Li, J. Wu, A correlation-based feature weighting filter for naive Bayes, IEEE Trans. Knowl. Data Eng. 31 (2019) 201–213, http://dx.doi.org/10.1109/TKDE.2018.2836440. [77] S. Ji, Z. Sun, D. Tao, T. Tan, J. Gui, Feature selection based on structured sparsity: A comprehensive study, IEEE Trans. Neural Netw. Learn. Syst. 28 (2016) 1490–1507, http://dx.doi.org/10.1109/tnnls.2016.2551724. [78] M.A. Tahir, A. Bouridane, F. Kurugollu, Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier, Pattern Recognit. Lett. 28 (2007) 438–446, http://dx.doi.org/10.1016/j. patrec.2006.08.016. [79] R. Tibshirani, The Lasso method for variable selection in the Cox model, Stat. Med. 16 (1997) 385–395, http://dx.doi.org/10.1002/(SICI)10970258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3. [80] M. Yuan, Y. Lin, Model Selection and Estimation in Regression with Grouped Model Selection and Estimation in Regression with, Tech. Report, Dep. Stat. Univ. Wisconsin, 2004. [81] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse-group lasso, J. Comput. Graph. Statist. 22 (2013) 231–245, http://dx.doi.org/10.1080/ 10618600.2012.681250. [82] J.B. Yang, D.L. Xu, Evidential reasoning rule for evidence combination, Artificial Intelligence 205 (2013) 1–29, http://dx.doi.org/10.1016/j.artint. 2013.09.003. [83] Z.G. Zhou, F. Liu, L.L. Li, L.C. Jiao, Z.J. Zhou, J.B. Yang, Z.L. Wang, A cooperative belief rule based decision support system for lymph node metastasis diagnosis in gastric cancer, Knowl.-Based Syst. 85 (2015) 62–70, http://dx.doi.org/10.1016/j.knosys.2015.04.019. [84] Z.G. Zhou, F. Liu, L.C. Jiao, Z.J. Zhou, J.B. Yang, M.G. Gong, X.P. Zhang, A bi-level belief rule based decision support system for diagnosis of lymph node metastasis in gastric cancer, Knowl.-Based Syst. 54 (2013) 128–136, http://dx.doi.org/10.1016/j.knosys.2013.09.001. [85] Z.J. Zhou, L.L. Chang, C.H. Hu, X.X. Han, Z.G. Zhou, A new BRB-ER-based model for assessing the lives of products using both failure data and expert knowledge, IEEE Trans. Syst. Man, Cybern. Syst. 46 (2016) 1529–1543, http://dx.doi.org/10.1109/TSMC.2015.2504047. [86] L. Chang, Z. Zhou, Y. You, L. Yang, Z. Zhou, Belief rule based expert system for classification problems with new rule activation and weight calculation procedures, Inf. Sci. (Ny). 336 (2016) 75–91, http://dx.doi.org/10.1016/j.ins. 2015.12.009. [87] X. Xu, J. Zheng, J. bo Yang, D. ling Xu, Y. wang Chen, Data classification using evidence reasoning rule, Knowl.-Based Syst. 116 (2017) 144–151, http://dx.doi.org/10.1016/j.knosys.2016.11.001. [88] J. Zhou, Z. Zhou, Z.J. Hao, H. Li, S. Chen, X. Zhang, et al., Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion, 2017, pp. 1–13, ArXiv Prepr. arXiv:1710.01614. [89] D. Veganzones, E. Séverin, An investigation of bankruptcy prediction in imbalanced datasets, Decis. Support Syst. 112 (2018) 111–124, http://dx. doi.org/10.1016/j.dss.2018.06.011.