European Journal of Operational Research 167 (2005) 518–542 www.elsevier.com/locate/dsw
Decision Aiding
Forecasting business profitability by using classification techniques: A comparative analysis based on a Spanish case Javier de Andres b
a,*
, Manuel Landajo
b,1
, Pedro Lorca
a,2
a Department of Accounting and Business Administration, University of Oviedo, Avenida del Cristo, 33006 Oviedo, Spain Unit of Statistics and Econometrics, Department of Applied Economics, University of Oviedo, Avenida del Cristo, 33006 Oviedo, Spain
Received 3 December 2002; accepted 26 February 2004 Available online 18 May 2004
Abstract A comparative study of the performance of a number of classificatory devices, both parametric (LDA and Logit) and non-parametric (perceptron neural nets and fuzzy-rule-based classifiers) is conducted, and a Monte Carlo simulation-based approach is used in order to measure the average effects of sample size variations on the predictive performance of each classifier. The paper uses as a benchmark the problem of forecasting the level of profitability of Spanish commercial and industrial companies upon the basis of a set of financial ratios. This case illustrates well a distinctive feature of many financial prediction problems, namely that of being characterized by a high dimension feature space as well as a low degree of separability. Response surfaces are estimated in order to summarize the results. A higher performance of model-free classifiers is generally observed, even for fairly moderate sample sizes. 2004 Elsevier B.V. All rights reserved. Keywords: Decision support systems; Neural networks; Fuzzy systems; Profitability forecasting; Financial ratios
1. Introduction Classification techniques hold a very important place in business and economic research, as many real-world applications involve classification of observations into discrete categories. We can * Corresponding author. Tel.: +34-985-104855; fax: +34-985103708. E-mail addresses:
[email protected] (J. de Andres),
[email protected] (M. Landajo),
[email protected] (P. Lorca). 1 Tel.: +34-985-105055; fax: +34-985-105050. 2 Tel.: +34-985-103902; fax: 34-985-103708.
highlight, among others: (1) analysis of management decisions, (2) insolvency and bankruptcy forecasting, (3) lobbying positions on accounting standards, and (4) prediction of takeover targets. In the vast majority of accounting research works on this topic, classification is made upon the basis of sets of ratios computed using figures from public financial statements. As pointed out by several authors (e.g., Deakin, 1976; Watson, 1990), cross-sectional distributions of these ratios have peculiar properties––with high degrees of positive skewness and kurtosis––which imply violations of some key assumptions, such as sphericity, which
0377-2217/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2004.02.018
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
underlie many standard statistical classification techniques. These kinds of problems may generally induce poor predictive performance of standard classification techniques, as well as other inferential problems (e.g., White, 1990). Since all these situations seem to be the rule rather than the exception, it is a matter of interest to determine what kinds of classifiers suit the special characteristics of financial information better. It is well known that parametric statistical models, such as linear discriminant analysis (LDA) and logistic regression (Logit), have shown their adequacy for a great number of practical economic classification tasks (e.g. prediction of financial distress or bond ratings). They provide simple (i.e., low-parameterised), elegant and easy-to-interpret classifiers. In this sense, standard statistical models, when the researcher is capable of correctly selecting an adequate parametric model, are without doubt a good choice, especially when only modest sample sizes are available. Thus, they should not be discarded from the outset. Very often, some kind of ‘regularizing’ strategy––such as using Box–Cox transformations––may be tried in order to bypass the aforementioned problems and obtain a well-behaved data set more adequate for the use of standard classification techniques. As an alternative, which may potentially overcome the aforementioned risks associated with inadequate selection of a parametric structure, a wide panoply of model-free or non-parametric techniques have been proposed in the literature. These include, among many others, neural networks, rule-based systems (both in their crisp and fuzzy variants), kernel-based classifiers, and mathematical programming. 3 Non-parametric classifiers are flexible enough to perform satisfactorily in very general settings, provided that sufficiently high amounts of statistical information are supplied by the researcher. Under rather mild conditions, these classifiers are consistent for the Bayes optimal classification rule (e.g., Stone, 1977; Yang, 1999), thus being capable of avoiding the risks––such as unduly poor predictive performance––associated with using incorrectly specified parametric classifi3 For a recent literature review on classification methods see Zopounidis and Doumpos (2002).
519
ers. This guarantees that, at least, asymptotically, non-parametric classifiers should perform not worse than their possibly mis-specified parametric counterparts, although certainly this is not necessarily the case for small data sets. Unfortunately, asymptotic results cannot give a general answer to the question of what sample size is sufficient to start to consider the use of model-free classifiers as useful alternatives to parametric methods. This is largely a case-dependent issue, related to the dimensionality of the feature space and the smoothness of the discriminant surface to be learned from data, this last characteristic obviously being generally unknown in advance (see, e.g., Yang, 1999). The results in this paper will provide some further empirical evidence in respect of these issues, complementing the above literature with extensive Monte Carlo evidence obtained from a specific (although representative) case-based study. In this paper we test the predictive accuracy of two mainstream non-parametric classificatory devices, namely perceptron neural networks (NNs) and additive fuzzy rule-based systems (FSs), as compared to two classical parametric classifiers, namely, LDA and Logit. We adopt an approach that may be regarded as ‘dynamic’, in the sense that we focus on the issue of evaluating how the accuracy of the classifying devices varies as the available sample sizes increase. Of course, many previous works have focused on the comparison of parametric and non-parametric techniques, although, to our knowledge, no large scale analysis of the kind presented in this paper has appeared to date, since the available literature has focused more on computational aspects (such as training algorithms and their speed and efficiency) than on the effects of the available amounts of statistical information on learning of classifiers. As to our choice of these particular classes of model-free classifiers, it is considerably less restrictive than it may appear at first glance: The selected structures summarize two general strategies underlying model-free classification, namely projection-based classifiers––including parametric structures such as Logit and Probit, as well as projection pursuit regression (Friedman and Stuetzle, 1981) and perceptron NNs––and tensorproduct-based classifiers (which comprehend gen-
520
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
eral rule-based systems, as well as kernel-based classifiers, and series estimators based on multivariate B-splines and other local bases). Hence, the classifiers we use here may be seen as representative of a fairly wide set of classifiers. The classes of NNs and FSs used also possess rather appealing mathematical and statistical properties, as are detailed below. Finally, the class of FSs used in this paper has been shown to be mathematically equivalent to normalized RBF networks with Gaussian transfer functions (Jang and Sun, 1993), so that our results may also be seen as including a comparison of two NN architectures, namely perceptrons and RBF nets. Rule induction systems, such as, for example, Quinlan’s models, are not considered in the present research because their adequacy for the analysis of business efficiency has already been studied (De Andres et al., 1998; De Andres, 2001). Other alternatives used in discrimination problems, such as rough sets and mathematical programming, omitted here for the sake of brevity, also may surely have useful statistical properties. Although, certainly, they are not often seen as statistical devices, and, to our knowledge, neither asymptotic results nor extensive Monte Carlo evidence is yet available in the literature, rather encouraging recent results (e.g., Hansen et al., 1994) suggest that results similar to those reported here may be obtained for these other classification methodologies. The paper is structured as follows: In Section 2 prior research is briefly reviewed, and the scope of the paper is outlined. In Section 3 the methodological details are expounded, including the variables and the class indicator, as well as some specific features of each classification technique and the procedures used for the simulations and the building of the response surfaces. The main results, and limitations, of this research are contained respectively in Sections 4 and 5. Finally, Section 6 collects some concluding remarks and indicates further research lines.
2. Prior research and scope of the paper As stated above, a considerable number of research works on classification techniques in the economics-related fields have dealt with the goal of
comparing predictive accuracy of parametric and non-parametric classifiers. Among the latter, NNsand AI-inspired systems for rule and tree induction have probably received most attention. Probably, the most used rules and trees induction systems are the recursive partitioning algorithm (RPA) and Quinlan’s programs (ID3, C4.5 and SEE5). More recently, certain developments in the field of computing have been used in the design of inference engines. Hence, fuzzy sets, rough sets and genetic algorithms have also proven to be valid approaches to the construction of machine learning systems. Table 1 shows a summary of some of the most important studies comparing AI- and NN-based classifiers with standard statistical techniques, when applied to modelling economic decisions or phenomena (for each case, the specific task, the tested systems and the main results are indicated). From Table 1, it is clear that most research works have focused on prediction of insolvency and bankruptcy. Remarkably enough, AI–NN systems appear to yield acceptable results, even though sometimes they are incapable of outperforming linear models (the problem of over-fitting, quite common in non-parametric estimation, appears to be a relevant factor in those problematic situations). Also remarkable is the fact that the predictive accuracy of each specific classification system seems prone to vary depending on the specific choice of the classification task, variables, database and inference engine. This indicates that, unfortunately, universally valid conclusions should not be expected when comparing the performances of different classifiers in different contexts. In other words, results inherently tend to be highly casedependent, and when we consider a new classification task, a different inference engine or different kinds of companies, the whole set of comparisons would be necessary in order to select the best tool. 4 4
Sensitivity of AI techniques to changes in data structure has seldom been analyzed by researchers in the field of economics. Recent works by Kattan and Cooper (2000), Pavur (2002) and Pendharkar (2002) indicate that factors such as data distribution, class proportions and position of outliers do appear to affect significantly the accuracy of AI systems, although it must be taken into account that all these papers used data generated by simulations instead of actual figures from the financial statements of real companies.
Table 1 Prior research comparing NN–AI systems and statistical techniques Techniques
Main results
Modelling commercial bank loan classifications
• Probit • RPA
RPA seems not significantly better, especially when the data do not include nominal variables
Frydman et al. (1985)
Predicting financial distress
• LDA • RPA
Less complex RPA models perform better than LDA in terms of actual cross-validated and bootstrapped results
Braun and Chandler (1987)
Predicting stock market behaviour
• LDA • ID3
ID3 achieves better results than LDA
Elliott and Kennedy (1988)
Modelling accounting strategies
• • • • •
In the analysis of classification problems with more than two categories, all statistical techniques, with the exception of quadratic discriminant analysis (QDA), are better than RPA
Garrison and Michaelsen (1989)
Tax decisions
• LDA • ID3 • Probit
ID3 performs better than both LDA and Probit
Bell et al. (1990)
Insolvency prediction of commercial banks
• Logit • NN
Multilayer perceptron NNs achieve slightly better results than Logit
Cronan et al. (1991)
Assessing mortgage, commercial and consumer lending
• • • • •
LDA Logit Probit RPA ID3
RPA, which used fewer variables, provided notably higher accuracy than ID3, which used many more variables. RPA also outperformed statistical techniques
Liang et al. (1992)
Modelling FIFO/LIFO decision
• Probit • ID3 • NN
Predictive accuracy of ID3 is lower than that of Probit and NNs. However, ID3 is less sensitive to reductions in sample size
Coats and Fant (1993)
Recognizing financial distress patterns
• LDA • NN
Cascade correlation NNs are more effective than LDA for pattern classification
Altman et al. (1994)
Financial distress prediction
• LDA • NN
NN results are not clearly superior to those obtained using LDA
Goss and Ramchandani (1995)
Insolvency prediction of life insurers
• LDA • Logit • NN
NNs predict insurer insolvency more effectively than parametric models
Greenstein and Welsh (1996)
Insolvency prediction
• Logit • NN
Logit achieves better results than perceptron NNs
LDA QDA Logit Probit RPA
521
Task
Marais et al. (1984)
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
Author(s)
• • • • •
Varetto (1998)
Insolvency prediction
• LDA • Genetic algorithms
Bertels et al. (1999)
Evaluate the eligibility of a company to • LDA receive state subsidies • NN
Mahmood et al. (1999)
Analyzing ethical decision situations
Markham et al. (2000)
Determining the number of circulating kanban cards in a just-in-time production system
St. John et al. (2000)
Modelling the relationship between cor- • LDA porate strategy and wealth creation • NN
Tsai (2000)
Evaluation of credit card applications
Wong et al. (2000)
Identify potential donors for University • LDA fund-raising programs • NN
Zapranis and Ginoglou (2000)
Forecasting corporate failure
• LDA • NN
NNs outperform the linear approach, due to their improved ability to classify correctly the problematic firms
De Andres (2001)
Forecasting profitability of small businesses
• LDA • Logit • See5
SEE5 outperforms Logit but the superiority of SEE5 over LDA is less clear. In contrast, SEE5 seems to suffer bigger increases in error rates for the holdout sample
McKee and Lensberg (2002)
Bankruptcy prediction
• Rough sets • Genetic algorithms
The genetic programming produced a model that was less complex, more accurate and yielded more theoretical insight than a rough-set-based model
Mak and Munakata (2002)
Setting the product entry strategy
• Rough sets • NN • ID3
NNs provided best fit with numerical data, while ID3 and rough sets performed best with non-numerical data
Malhotra and Malhotra (2002)
Differentiating between good and bad credits
• LDA • NN
Neuro-fuzzy models are superior to the LDA in identifying potential loan defaulters
• • • •
RPA CN2 induction LDA Logit NN
LDA NN RPA NN
• Different models of neural networks
Rule induction systems outperform Logit and LDA in the training sample, but the efficiency of induction systems drops substantially in the validation sample
LDA proved to be slightly better than linear classifiers generated through genetic algorithms and the calculation of scores based on rules obtained using genetic algorithms Backpropagation NNs are not superior to LDA models except when they are given highly uncertain information NNs predict better in both training and testing phases Both methods are comparable in terms of accuracy and response speed, but RPA has advantages in terms of explainability and development speed NNs outperformed LDA in predictive ability in all analyses, suggesting the presence of non-linear effects A good feature reduction algorithm is very important for probabilistic NNs, both in order to raise their classification accuracy and to increase generalization ability of the model NNs and LDA performed equally in terms of overall accuracy. Furthermore, NNs were capable of predicting the actual donors with much more accuracy than LDA
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
Financial distress prediction
522
Didzarevich et al. (1997)
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
Compared with the above literature, this paper has the following distinctive features:
523
3. Methodological aspects 3.1. Basic information
1. Average error rates, for each classifier and a wide range of sample sizes, are estimated from results of a large number of Monte Carlo simulations. Averaging permits a significant reduction of the distorting effects of sampling variation, which makes estimation of error rates much more accurate. In addition, response surfaces are constructed in order to summarize the effects of variations of sample sizes on predictive performance of each technique. These effects are of interest in themselves, since the significance of economic analyses may be greatly enhanced when we are capable of splitting a large sample into branches of activity and/or company sizes, and then carrying out the analyses separately for each branch. This last goal obviously requires classifiers capable of performing well in small samples. 2. We focus on business profitability analysis. As seen above, this topic has not received much attention in the literature, despite its importance, and most research has dealt with the issue of insolvency forecasting. 3. Our classification problem has a low separability degree. Deliberately, we do not consider here a number of variables (such as productivity ratios or gross margin ratios) which, although at first glance may appear to be obvious candidates as good predictors for the class indicator, on a closer analysis they immediately reveal themselves as inadequate, simply because they are mere redefinitions of the variable to forecast. Plainly stated, the inclusion of these ratios would inflate the predictive ability of models simply by a kind of ‘tautological’ effect. 4. We test a slight variant of the class of additive fuzzy systems with Gaussian membership functions, with consequents normalized in order to be probabilities. The performance of such a class of FSs in this difficult classification task, with a high dimension feature space, will be compared to those of NNs and traditional parametric classifiers.
The source information was obtained from a database consisting of financial statements from commercial and industrial firms located in Spain (see Appendix A). In accordance with Spanish legislation, limited liability companies are required to deposit their annual accounts in the Registro Mercantil (commercial register), the files of which are publicly available to every user of financial information. The financial statements analyzed here correspond to the year 1999. We only considered companies with more than 100 employees (this gave us a manageable database, which contained data from 6533 firms). Additionally, a number of filters were applied in order to guarantee (1) high quality of the financial information, and (2) that the selected sample adequately represented the economic activity of each sector. Hence, companies were eliminated when either (a) they carried out no activity during 1999, or (b) when 1999 was the first year of business, or (c) whenever the information they provided was not enough to compute the selected ratios. After this pruning, the database was made up from the accounts of 5671 Spanish companies. 3.2. Classification task and class identification For class identification, the level of profitability is represented by a dichotomous variable Y which equals 0 if the company is included in the most profitable group, and 1 when it belongs to the least profitable group. Hence, our classification task is that of forecasting Y for each company on the basis of a set of accounting ratios, which we abbreviate using an N -dimension vector X . In order to measure the profitability of the firms, and taking into account informational limitations (only annual accounts were available), we chose the financial profitability ratio. This is defined as the quotient of the company’s net profit and equity capital. A considerable amount of literature (e.g., Kelly and Tippet, 1991; Brief and Lawson, 1992, among many others) suggests that,
524
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
despite its limitations, this ratio provides a suitable measure of management efficiency. After computing financial profitability for each firm––and in order to avoid distortions caused by the so-called ‘sector effect’––we divided the ratio by the median of the profitability for each branch of activity, as recommended, e.g., in Platt and Platt (1990). For the definitions of both groups (respectively, ‘efficient firms’ and ‘inefficient firms’), the final specification was made by discarding the intermediate quartiles of the financial profitability ratio. Hence, the group comprising the most profitable companies gathered 25% of the firms with the highest financial profitability, and the group of the least profitable companies comprised 25% of the firms with the lowest value for this ratio. 5 This implied filtering down the database to 2863 cases. In Appendix A the number of companies included in each group and each NACE––Nomenclature generale des Activites economiques dans les Communautes Europeennes–– sector are shown. 3.3. The financial variables used In order to summarize the situation of the firms, we started by choosing the financial aspects to be measured. As indicated in Section 2 above, we discarded as acceptable predictors those figures that (a) could not be measured upon the basis of the information in the annual accounts drawn up according to Spanish general agreement accounting principles (GAAP); (b) were intrinsically connected to financial profitability (e.g., productivity), which clearly invalidates them as fair predictors for Y . This last 5 In order to test the stability of the sample, we also computed financial profitability for the years 1998 and 1997. 64.11% of the firms in the 1999 figures upper quartile (those included in the sample) were also in both 1998 figures and 1997 figures upper quartiles. For the least profitable firms, 62.72% of the sample was also in the lower quartile for the two precedent years. So, it can be concluded that the composition of the sample is quite stable and the results of our research are not far different from those that would have been obtained if computing an average profitability for three years.
pruning also had the side effect of inducing low separability of classes. After this, the features we considered were the following: (1) growth, (2) turnover of assets, (3) debt quality, (4) indebtedness, (5) use of fixed capital, (6) debt cost, (7) short-term liquidity, (8) share of labour costs, and (9) size. In order to include the above figures in our model, we selected, for each concept, the financial variable that best measured it. Again, we were faced with the limitations caused by the relatively small amount of information in Spanish accounts. In order to avoid multicollinearity, each dimension is represented by one financial ratio. The prediction set finally selected appears in Table 2 (once more, distortions caused by the sector effect were corrected by dividing the ratio by the median for each branch of activity). Table 3 displays descriptive statistics, by groups, for each variable in the prediction set. As expected, descriptive analysis clearly indicates positive skewness in the frequency distributions of most of the variables. Correlations among predictors (omitted for the sake of brevity) were almost null, which made dimension-reduction strategies based on some kind of principal components analysis practically unfeasible for our problem, and forced us to work with a relatively high dimensional (as well as not very separable) feature space. As is well known, in such kinds of contexts, parametric classifiers frequently outperform their model-free counterparts, reflecting the fact that identifying structure only from data in high dimension feature spaces usually requires prohibitive sample sizes and complex models. With regard to the correlations between each predictor and the financial profitability ratio, the analysis showed no significant correlation with any of the variables, at least in respect of the usual significance levels (in all the significance tests the pvalues were higher than 5%). 3.4. The classification techniques As indicated above, we wish to compare the predictive performances of LDA and Logit with that of two model-free or non-parametric
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
525
Table 2 The set of financial variables Dimension
Variable
Code
Growth Turnover of assets Debt quality Indebtedness Use of fixed capital Debt cost Short-term liquidity Share of labour costs Size
Variation net turnover sales (percentage) Operation income/total assets Current liabilities/total debt Equity capital/total debt (Tangible fixed assets + intangible fixed assets)/total employment Financial expenses/total debt Current assets/current debt Labour cost/added value Average net turnover sales
V01 V02 V03 V04 V05 V06 V07 V08 V09
Table 3 Descriptive information relating to the financial variables Low profitability group V01 V02 V03 V04 V05 V06 V07 V08 V09
High profitability group
Mean
SD
Skewness
Kurtosis
Mean
SD
Skewness
Kurtosis
45.979 1.099 0.901 0.957 5.536 1.422 1.154 1.271 2.994
891.664 0.824 0.260 0.831 41.107 1.880 1.184 3.972 16.971
24.173 3.084 )0.880 )0.947 14.997 10.259 11.649 )6.734 25.921
615.740 15.626 1.148 17.308 249.148 189.352 233.002 393.379 789.160
6.151 1.434 0.957 0.808 2.666 1.266 1.104 0.606 5.076
77.400 1.022 0.233 0.989 14.118 2.682 0.664 7.813 40.485
34.639 3.567 )0.886 )8.547 19.222 22.780 3.934 )26.937 26.857
1261.896 24.833 1.736 182.298 465.250 680.307 31.609 776.984 832.351
techniques, namely, perceptron NNs and fuzzy rule-based systems with probabilistic output. From a statistical viewpoint, the problem of constructing a (two-class) classifier may be seen as one of first estimating––on the basis of a finite random sample fðxi ; yi Þji ¼ 1; 2; . . . ; ng––the L2 -optimal conditional predictor for Y , namely the regression surface which passes through conditional expectations of Y given X , i.e., EðY jX ¼ xÞ ¼ P ðY ¼ 1j X ¼ xÞ. Once an estimate P^ ðY ¼ 1jX ¼ xÞ is obtained for this expectation, 6 the standard Bayes 6 For logistic regression, and also for the classes of NNs and FSs used here, P^ ðY ¼ 1jX ¼ xÞ directly comes as the output of the estimation process. For LDA, the Bayes rule applies with estimated quantities replacing the unknown population parameters. Note that, for NNs and FSs, a common practice is to directly estimate the classifier, without first trying to estimate conditional probabilities. By the contrary, in this paper, however, we will impose a normalization of the outputs of NNs and FSs in order to be conditional probabilities. This essentially innocuous requirement (see Yang, 1999)––permits the placing of all the four classifiers within the same statistical decision framework.
rule is applied analogically (see Anderson, 1984), and we predict ^y ¼ 1 if P^ ðY ¼ 1jX ¼ xÞ > 0:5, otherwise ^y ¼ 0. For the case of LDA, a Box–Cox transformation of the data was carried out. For logistic regression classifiers, which do not require multivariate normality, no transformation was applied, either. Similarly, for NNs and FSs, no data transformation at all was applied. The rationale for this last choice is straightforward: No truly non-parametric method should require any kind of normalizing transform. Finally, the results of some additional experiments––analogous to those reported here––carried out for another three kinds of classifiers are not reported here, since the results proved scarcely relevant due to various reasons. In particular, Probit models were considered, and finally discarded because their results and model semantics were very close to those of Logit classifiers. Quadratic discriminant analysis (QDA) was also tried, but seemed to perform very poorly with this
526
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
problem (even for large sample sizes, the error rates stabilized around 45%, considerably higher than all the other techniques considered). 7 We also tried an additional non-parametric technique, kernel-based classifiers, also with a very poor performance of about 45% error rate in the largest samples. 8 In the following, practical details on applications are summarized. 3.4.1. LDA and Logit As to LDA, analogic estimations were used for the unknown parameters (e.g., Anderson, 1984). In order to achieve a closer meeting of the requirement of multivariate normality (or, at least, sphericity), and following recommendations of previous papers on the statistical properties of financial ratios (e.g., Deakin, 1976; Watson, 1990, among others), natural logarithm transformations were applied to the variables. Evidence from the aforementioned literature shows that departures from both univariate and multivariate normality of the frequency distributions of most accounting ratios can be reduced by logarithmic transformations of the original data. In our case, preliminary tests indicated that, indeed, LDA performed slightly worse when no transformation was applied to data. Other Box–Cox transformations were also tried, with results similar to those of logarithms. As regards Logit, we estimated the models by standard (conditional) maximum likelihood techniques. Absence of correlation among predictors led us to discard the use of Logit models with 7
For QDA, the economic literature abounds in evidence of poor performances of this technique (e.g., Gilbert, 1974; Elliott and Kennedy, 1988). This––at least from a statistical viewpoint––counter-intuitive result seems to relate to the lack of robustness of standard quadratic discriminants in respect of strong deviations from multivariate normality, such as those appearing in the highly asymmetric cross-sectional distributions of financial ratios. 8 A reasonable explanation for the poor performance of kernel-based classifiers in our problem seems to be the so-called curse of dimensionality. As is well known (e.g., Silverman, 1986), kernel-based density estimators converge very slowly in high dimension feature spaces, generally requiring enormous amounts of data, considerably larger than those available in this study, and possibly in many realistic economic contexts.
Fig. 1. Structure of perceptron classifiers.
interaction terms, which, in our preliminary tests, showed no improvement over standard Logit models. 3.4.2. Perceptron NNs From a theoretical viewpoint, multilayer perceptrons are universal approximators in many function spaces of practical relevance (e.g., Hornik et al., 1989), as well as being universal regression (and classification) tools (White, 1990; Lugosi and Zeger, 1995). Perceptron NNs have been extensively used in classification tasks in the field of finance, due to their capabilities of achieving good predictive results with non-normal data. 9 For our problem we have considered a rather standard topology, with two hidden layers. The first layer contains m neurons (m ¼ 1; 2; . . .), and the second hidden layer simply has a single unit which normalizes the output in order to be a probability (in both layers, logistic activation functions were used). Fig. 1 displays the architecture employed. Once the networks’ complexity was determined (by using the procedure to be indicated below), the training of the nets was carried out by using a nonlinear least squares criterion (an off-line steepest descent algorithm was employed). In some preliminary tests we detected that the effect of local minima was rather serious in respect of our problem. We found that an intensive random search, in order to find a good starting point for gradient-based iterations, led to greatly improved 9 For a review on the applications of NN to finance see Krishnaswamy et al. (2000).
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
solutions in a few steepest-descent steps. 10 Of course, many alternative optimization strategies, of a more sophisticated nature, could have been applied, but the considerable computational costs associated with our Monte Carlo analyses indicated that more elaborate procedures, adequate for a single estimation process, were largely unsuitable to our study. 3.4.3. The FS-based classifiers The FSs we used are rule-based systems with probabilistic output. The basic elements of an FSbased classifier for our two-class classification problem are (1) a fuzzy covering of the feature space X RN , i.e., a finite set of N -dimensional ~1; A ~2; . . . ; A ~ m , whose supports cover X , fuzzy sets A and (2) a rule base of the form ~ 1 ; then P ðY ¼ 1Þ ¼ p1 : Rule 1: If x 2 A ~2; Rule 2: If x 2 A
then P ðY ¼ 1Þ ¼ p2 :
~m; Rule m: If x 2 A
then P ðY ¼ 1Þ ¼ pm :
The idea is that the feature space is covered by a number of antecedent fuzzy sets (see, e.g., Kosko, 1992; Ishibuchi et al., 1994; Nauck and Kruse, 1997), which may partially overlap (as opposed to the case of crisp rules, where the feature space is split into pairwise disjoint regions). Each fuzzy rule separately assigns a probability pj : formally ~ j Þ, is the conditional probstated, pj ¼ P ðY ¼ 1jA 10
In order to reduce the risk of local optima, a number of strategies are possible. In our case, we found effective an expedient based on a Markov random search. The idea is as follows: we suppose that we must find an unknown point h at which the mapping achieves a global minimum in the set H Rq ; we start with an arbitrary choice, h0 2 H, and at each iteration, we perturb ht1 with a random noise. (In our applications, was a q-dimensional random vector with independent components, and the whole (ut ) process was i.i.d.). Then, we set ht ¼ ht1 þ ut if f ðht1 þ ut Þ 6 f ðht1 Þ, and h ¼ ht1 , otherwise. Under mild conditions, random search strategies are convergent, in the sense that, as the number of trials t ! 1, the probability of getting closer to a global optimum tends towards unity (e.g., Landajo, 2002). In practice, only a finite number of trials is possible. For our problem, we found that an initial Markov search at 1000m points (with m being the number of hidden units in the network’s first layer) was sufficient to find a fairly good starting point for the gradient-based algorithms.
527
ability that, for an individual that belongs to ~ j , the event Y ¼ 1 may occur (see the fuzzy set A Fig. 2). Since an individual whose profile is X ¼ x generally belongs (although generally at different de~ j s (i.e., more than grees) to more than one of the A one of the fuzzy rules are activated simultaneously at x), the output of the system must combine the consequents of those rules which are active at that point, giving relatively more weight to consequents which correspond to more strongly activated rules. Additive (weighted average) structures such as those to be seen below are a leading choice in order to devise FSs which satisfy the above requirements in a simple way. As opposed to crisp rule-based systems, which––as said above––rely on strategies of splitting, FSs have a number of practical advantages: They are smooth mappings, with nice analytical properties (e.g., they possess derivatives, and may be ‘trained’ by using gradient-based algorithms), which permit their use in a very similar way to other flexible regression tools such as NNs, keeping at the same time a semantic structure essentially identical to that of classic rulebased systems (Kosko, 1992, 1994). For the implementation of the above structures we used a variant of the class of additive fuzzy systems with tensor product Gaussian membership functions. Analogously to perceptrons, this class of FSs possesses remarkable universal approximation properties (e.g., Wang and Mendel, 1992; Wang, 1994; Kreinovich et al., 2000; Landajo et al., 2001) together with non-parametric regression capabiliX
1
~ A1 ~ A2
π1 ...
π j π2
~ A j ...
πm
~ Am 0
Fig. 2. Semantic structure of fuzzy rule-based classifiers.
528
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
ties similar to those of NNs (Landajo, 2004). FSs are thus capable of learning arbitrary regression (and decision) surfaces. The analytical expression for the class of FSs to be used here is as follows: 2 QN Pm 1 xi lij j¼1 rðbj Þ i¼1 aij exp 2 rij AðxÞ ¼ 2 : QN Pm 1 xi lij a exp j¼1 i¼1 ij 2 rij With x ¼ ðx1 ; x2 ; . . . ; xN Þ 2 RN . Since the consequent pj of each fuzzy rule must be a probability, our modification replaces the habitual consequents bj 2 R with a normalized output pj ¼ rðbj Þ 2 ð0; 1Þ, with rðÞ being an arbitrary continuous strictly increasing mapping taking values in [0,1] (i.e., a sigmoid). In this paper we have chosen rðzÞ ¼ ð1 þ expð5z þ 2:5ÞÞ1 due to the fact that this function satisfies rðzÞ z in the interval (0,1), although any other sigmoid would do similarly. The above normalization guarantees that AðxÞ 2 ð0; 1Þ and permits the interpretation of the FS’s output as an approximation to the conditional probability EðY jX ¼ xÞ ¼ P ðY ¼ 1jX ¼ xÞ. Once the above structure is estimated from data ^ the FS obtained), the plug-in (we will denote by A ^ Bayes rule (i.e., ^y ¼ 1 iff AðxÞ > 0:5) is applied. As to the training of the models, once the number of fuzzy rules (m) was selected (this proceeds by using the mechanism to be detailed below), we obtained an initial FS by using a clustering procedure based on an AVQ algorithm, similar to that proposed by Kosko (1992). Then, these values were used as initial estimates in a nonlinear least squares estimation procedure (again, an off-line steepest descent algorithm was used to solve the least squares problem). 3.4.4. Determining complexity of NNs and FSs In the above explanations about FSs and NNs, two additional details need to be specified: (1) for NNs, a mechanism for determining m, the number of units on the first hidden layer, and (2) for FSs, a procedure to determine the number of fuzzy rules (also denoted by m). From a mathematical viewpoint, both NNs and FSs may be seen as approximations to an unknown optimal Bayes decision surface. Roughly stated, a sufficiently complex NN or FS may approximate an arbitrary
discriminant surface to arbitrary high accuracy. Obviously, in practice an adequate agreement must exist between the complexity of the models to be estimated and available sample sizes. Small sample sizes are only capable of supporting lowcomplexity NNs or FSs, while much more complex and flexible structures, capable of capturing finer details of the object to be learned, can only be estimated when large sample sizes are available. Hence, an intuitively appealing idea would be that of letting the complexity of the models be determined by the amount of statistical information at the researcher’s disposal, by using any kind of deterministic or random increase rule. This idea may be adequately formalised by using concepts and techniques from the so-called method of sieves proposed by Grenander (1981). This is a generalpurpose method which permits the derivation of well-behaved non-parametric estimators in very general settings. 11 In sieve estimation, consistent sequences of estimators are constructed on the basis of an increasing sequence of parametric models (taken from a class of universal approximators), the complexity of which (denoted by m) increases with sample size (n). In as much as the complexity of the models increases with n , at an adequately slow rate, convergence to the desired goal (in our case, an arbitrary decision surface) is obtained as the sample size approaches infinity. (This last expedient is, indeed, the hallmark of all non-parametric estimation techniques.) The above scheme, rather cumbersome at a theoretical level, is much easier to apply in practice. In this case, the method is implemented by the expedient of using either a deterministic increase rule 12 (e.g.,
11
For very general results on sieve estimation see, e.g., Shen (1997), and Chen and Shen (1998). 12 Deterministic increase rules are common in the literature. Examples include series estimators (Andrews, 1991; Lugosi and Zeger, 1995), neural networks (White, 1990; Lugosi and Zeger, 1995), and kernel-based estimation, where a value for a smoothing parameter (the so-called bandwidth) must be selected, which may be carried out either by deterministic rules (i.e., functions of sample size n), or by stochastic rules such as cross-validation (see, e.g., Silverman, 1986). In De Andres et al. (2003) results analogous to those of this paper are obtained, although deterministic increase rules for NNs are used instead of data-driven methods.
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
pffiffiffi m ¼ ½c n), or an adaptive or data-driven rule (e.g., some kind of cross-validated measure or information criterion, such as those in White (1990) and Sin and White (1996)) in order to select an adequate model’s complexity for a given sample size. From a purely statistical viewpoint, both NNs and FSs may be seen as particular classes of nonparametric sieve estimators (e.g., White, 1990; Kuan and White, 1994; Landajo, 2004). Following the spirit of the method of sieves, in this paper we have not considered neural or fuzzy models of a fixed complexity––this would be somewhat nonsensical since most frequently NNs and FSs are only used to approximate systems which cannot be exactly represented by NNs or FSs of finite complexity. Instead, we permitted the number of model terms (neurons/fuzzy rules) to increase with sample size. Although, as mentioned above, purely deterministic increase rules are theoretically possible, data-driven selection procedures offer evident advantages (as well as being more commonly used in applied research, they are more intuitive and flexible) which make them more adequate for our problem. The expedient used in this paper was simply a standard model selection strategy based on the performance on a holdout sample: Each training set of size n is provisionally split at random into a reduced estimation set and a test set and, in order to select an adequate model complexity, neurons/fuzzy rules are added as much as the classification error on the test set decreases. Obviously, this procedure may be seen as a rather crude form of validation (computational effort was a key consideration in order to avoid more sophisticated methods). Once each model’s complexity (m) was determined, the model was re-estimated by the above-indicated methods, but using the whole training set of n observations. As to the bounds for the model selection process, for perceptron nets we used the following rule: for sample sizes n < 1600, we permitted 1 6 m 6 2 (i.e., NNs with up to 2 units in the first hidden layer), and, for n P 1600, we permitted 2 6 m 6 3 units in the first hidden layer. Hence, complexity was selected by using a datadriven method, although within a set of admissi-
529
ble values indexed by n (i.e., a deterministically constrained set). No doubt, this increase rule may seem at least at first glance somewhat conservative. A number of reasons supported our choice: (1) These numbers clearly lie within the (theoretical) higher bounds permitted by known NN-based sieve estimation results (e.g., White, 1990; Lugosi and Zeger, 1995). (2) Related to this, results in Section 4 indicated no significant presence of overfitting since, even for fairly moderate sample sizes, error rates both on training and test sets are similar. (3) In addition, some preliminary tests detected no presence of underfitting, either (even for large sample sizes close to n ¼ 2400 networks with three or more hidden units seemed to offer no predictive improvement with respect to simpler nets, and this at the cost of a considerable increase in computational effort). For FSs, bearing in mind that in a high dimension feature space the number of parameters to be estimated increases much more rapidly for rule-based systems than for perceptrons, complexity controls must be tight. Hence, we have permitted only up to m ¼ 6 fuzzy rules––our strategy here was slightly different; we simply permitted the data-driven selection procedure to choose 2 6 m 6 6, although preliminary tests suggested that FSs of low complexity sufficed, even for samples of around 2400 observations. An interesting point is that for this last model selection rule, dependence of m on sample size n does not appear explicitly, as in the increase rule we used for NNs above; however, it may be easily shown to be real; indeed, when we try to construct very complex FSs on the basis of toosmall data sets, the constructed systems will tend to overfit in the training samples, although they will generally predict very poorly on test sets, so that complexity selection mechanisms based on validation samples will automatically drive us towards simpler models. Formally stated, when we try to fit a complex model to a too-small data set, validation mechanisms will tend to reject it with a high probability. Thus, instead of a deterministic rule m ¼ mðnÞ, we have a stochastic control which tends to constrain the model’s complexity in a similar way.
530
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
3.5. Measuring the effects of sample size on each technique’s performance The basic strategy we used in the evaluation of each model’s performance may be summarized as follows: A sample is randomly split into an estimation and a test set, and the model’s performance is evaluated for both sets by using some appropriate error measure. Our approach may be seen as a Monte Carlo inspired refinement of the above idea. In essence, we replicated the same analysis for a large number of independent samples obtained from our database, which after pre-filtering came to 2836 cases. The idea is as follows: we considered sample sizes n ¼ 30; 60; . . . ; 2400 (i.e., a ‘step’ length of 30 was taken). Once a sample size within the above range was selected, the following scheme was applied both for NNs and FSs: (1) We randomly selected from the database a training set of n cases and an (independent) test set of 436 cases. This last number was taken as fixed for simplicity (it approximately constitutes 15% of the database) and we considered this to be a reasonable size for a test set. (2) After this, the training set was used to select the model’s complexity and estimate its free parameters, which tasks were carried out by following the routines mentioned in the above sections. (3) Finally, the estimated classifier’s performance was evaluated by calculating error rates both for the training and test sets. This scheme was replicated independently by T ¼ 100 times for each sample size n, each one for different, independently selected, training and test sets. Once the above process was completed (for NNs, this required estimation and testing of as many as 24,000 models, and for FSs, even more), average error rates were computed for each sample size n ¼ 30; 60; . . . ; 2400, separately for training and test sets. All the computations were programmed and performed by using Matlab 5.3, and executed in 10 standard PCs with Pentium II, each one running at 420 MHz. As to LDA and Logit models, the process was analogous, with only one difference: There was no model selection stage, as the same structures were considered in all the simulations. Once the whole process was completed, average error rates for each classification device and sam-
ple size were available to be used as the input for our analyses. In principle, the process of averaging the results of a large number of independent replications should greatly enhance the robustness of our conclusions; it should be able to neutralize or, at least, greatly alleviate, the effects of factors such as sampling variability, as well as those of unduly poor performance of the models in some replications because of local minima. Some perils of excessive data-mining may also be kept reasonably under control, since––strictly speaking––our analysis is based on a large number of independent trials, and not on a one-case analysis, as both the training and the test sets are constantly renewed with each new replication, and the models’ complexity is also determined again with each new training set. As a whole, the procedure should provide much more solid and trustworthy conclusions than those obtainable from the application of the na€ıve version of the evaluation process outlined at the beginning of this subsection. 13 Upon the above-estimated average error rates, evaluated for a sufficiently high range of sample sizes (since 2836 ¼ 30 · 80 + 436, a total of 80 points is available), we may construct response surfaces which summarize, for each classifying device, the effects of variations of sample size on prediction error rates. These surfaces may be obtained by postulating some adequate functional form for that relation, and then obtaining the response surface’s free parameters by least squares fitting. We will summarize all results, both average error rates and response surfaces, in the following section. As to what should be a priori expected of such an analysis, statistical theory provides some useful guides on which techniques may be more useful under certain circumstances. Hence, when sample sizes are small, simpler classifiers in terms 13 Some preliminary tests indicated that more complex NNs and FSs, or a higher number of replications of our experiments, appeared not to be able to change qualitatively this paper’s conclusions. Although (rather mild) improvements still seemed possible in estimating prediction error rates, this could only be achieved by a dramatic increase of computational costs. Since we pursued the ‘stylised facts’ rather than a very precise estimation of error rates (which, in the end, were essentially case-dependent), the above expedient was considered to be adequate.
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
531
estimated response surfaces, obtained from average error rates for the complete set of 80 different sample sizes, are displayed for each technique.
of number of parameters to be estimated are generally recommended, which essentially restricts the set of alternatives to parametric methods. In the event, unfortunately not so frequent in economic fields, of having at our disposal enough contour information to permit the postulation of an adequate parametric model, our choice is also obvious. As a general rule, non-parametric methods should be tried only when both simpler high-performance parametric alternatives are unavailable and sufficiently high sample sizes are at the researcher’s disposal. As to the choice among different model-free techniques, it is well known (e.g., Girosi et al., 1993) that for high dimensional feature spaces, such as those considered in this paper, projection-based methods such as perceptron nets may have potential advantages over tensor-product-based structures, such as kernel estimators, rule-based systems or RBF nets. The reason is the well-known problem of the curse of dimensionality, which affects the latter more strongly than the former, with a huge increase in the number of free parameters to estimate as the dimension of the feature space X grows, even for models of a moderate complexity in terms of the number of neurons or fuzzy rules.
4.1. Error rates Tables 4–6 show the average error rates both for the estimation and test samples at the indicated points. The term ‘type I error’ denotes the case of erroneously assigning a high profitability company to the low profitability group. Analogously, the term ‘type II error’ will be used when a low profitability firm is wrongly classified as belonging to the high profitability group. Rates for overall errors summarize both sources of predictive errors. The relatively high error rates appear to confirm our initial intuition of low separability among classes (remember that, as mentioned in Section 2, a number of possible predictors were discarded from the outset upon logical consistency grounds). As an overall conclusion, non-parametric classifiers seem to outperform their parametric counterparts for the task we have considered in this paper, and only for fairly small sample sizes can we appreciate some advantage for parametric models. Among these, Logit seems to offer smaller values of total errors, although it suffers from larger type II errors than LDA. Among the non-parametric classifiers, NNs appear to perform somewhat better than FSs, probably due to the high dimension of the feature space, although the differences only tend to be noticeable when sample size becomes very large and, indeed, for moderate sample sizes FSs perform slightly better than perceptrons. It is
4. Main results This section is structured as follows: First, some tables including average error rates for a few selected sample sizes (n ¼ 30, 90, 150, 300, 600, 900, 1500, 1800, 2100 and 2400) are provided. Then, the Table 4 Total percentage of errors in the test (estimation) sample Size of the estimation set (n)
LDA
30 90 150 300 600 900 1200 1500 1800 2100 2400
37.23 36.49 37.24 36.90 37.53 37.60 38.15 38.32 38.37 38.12 38.46
Logit (15.83) (27.33) (31.35) (33.33) (35.71) (36.56) (36.86) (37.20) (37.38) (37.48) (37.86)
41.58 38.67 37.90 35.76 34.93 35.21 35.10 35.33 34.89 35.10 35.42
NN (18.36) (26.79) (29.40) (30.51) (31.72) (32.88) (33.21) (33.33) (33.41) (33.90) (34.68)
43.61 39.99 36.97 33.99 31.44 30.28 30.03 28.80 28.04 27.59 26.85
FS (17.47) (24.88) (25.85) (27.21) (26.61) (27.08) (27.08) (26.45) (25.87) (26.06) (25.94)
41.20 37.00 33.52 32.45 31.88 31.53 30.90 31.86 31.03 30.11 31.69
(21.85) (27.86) (28.05) (28.64) (29.88) (29.94) (29.32) (30.94) (29.73) (29.15) (30.88)
532
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
Table 5 Percentage of type I errors in the test (estimation) sample Size of the estimation set (n)
LDA
30 90 150 300 600 900 1200 1500 1800 2100 2400
36.16 35.46 35.85 35.84 35.79 37.49 36.69 37.38 38.19 37.60 38.18
Logit (15.96) (27.04) (30.59) (32.58) (34.23) (36.63) (35.78) (36.57) (37.21) (36.81) (37.34)
38.14 35.92 34.76 31.82 30.27 31.00 31.02 30.78 30.31 30.42 30.58
NN (17.57) (25.12) (26.83) (26.96) (27.35) (28.75) (29.28) (28.78) (28.75) (29.15) (29.69)
41.53 39.35 35.85 31.09 29.00 27.43 26.88 25.85 25.70 25.08 24.28
(22.55) (30.11) (33.09) (34.49) (36.26) (37.15) (37.28) (37.90) (38.14) (38.67) (39.67)
45.38 40.49 37.90 36.79 33.87 33.03 33.12 31.15 30.41 30.08 29.40
FS (17.17) (24.75) (25.25) (25.54) (23.81) (24.16) (24.06) (23.63) (23.88) (23.26) (23.57)
43.65 40.63 36.55 33.86 33.60 32.53 33.67 33.19 31.46 32.45 31.94
(21.92) (26.74) (27.51) (29.72) (29.47) (30.12) (30.17) (29.33) (27.89) (28.90) (28.32)
38.92 33.28 30.50 31.02 30.26 30.53 28.19 30.39 30.60 27.76 31.44
(24.16) (29.97) (31.28) (29.73) (31.35) (31.50) (32.00) (32.29) (29.87) (31.40) (31.00)
Table 6 Percentage of type II errors in the test (estimation) sample Size of the estimation set (n)
LDA
30 90 150 300 600 900 1200 1500 1800 2100 2400
38.44 37.43 38.49 37.87 39.16 37.55 39.47 39.13 38.48 38.49 38.60
Logit (17.67) (29.42) (33.56) (35.00) (37.70) (36.89) (38.20) (38.05) (37.76) (38.32) (38.41)
also noticeable that the NN-based classifiers also tend to make more type II errors than FSs. When the diagnostics of the training sets are compared to those obtained from the test sets, it is evident that, except for fairly low sample sizes––just where overfitting more strongly affects in-sample diagnostics, and just where out-of-sample indicators are more pessimistically biased––the differences among both kinds of diagnostics are rather scarce for all the classifiers. For NNs and FSs, this would mean that the somewhat conservative rules we used for determining the complexity of the models effectively controlled the risks of overfitting. 4.2. Response surfaces At an intuitive level, our response surfaces may be seen as experience curves: as the amount of statistical information increases as new samples arrive, the classifiers perform better and better.
45.02 41.39 41.00 39.74 39.51 39.38 39.07 39.88 39.36 39.71 40.22
NN
FS (20.91) (25.88) (25.03) (27.62) (28.44) (28.53) (26.66) (29.63) (29.64) (26.93) (30.76)
However, this improvement takes place at a decreasing rate, until the learning process stabilizes around a limit value, indicating the classifiers’ asymptotic best predictive performances. For parametric classifiers, this optimal error rate coincides with the Bayes optimal error rate only when the model is correctly specified and the learning mechanism is consistent; otherwise, it is higher. For (consistent) model-free classifiers, since they asymptotically avoid the risks of incorrect model selection, they are capable in large samples of approaching to the optimal Bayes error rate. Such a convergence process is generally slower than for their parametric counterparts (this is an unfortunate consequence of the so-called bias-variance dilemma). Of course, no improvement is possible beyond the optimal Bayes rate, since separability imposes natural limits on the learning process. The Bayes rate imposes a lower bound that cannot be surpassed by simply adding
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
more samples, whichever classification device we may use. Taking into account the above considerations, we have essentially focused on the following functional form: eðnÞ ¼ a þ b=nc , with eðnÞ being the classifier’s average error rate for sample size n. This structure seems (when c > 0) a natural choice for our problem, since it may easily reflect effects such as monotonic convergence to a (minimum/maximum) limit error rate, as well as being parsimonious, in the sense that only a very low number of free parameters need to be estimated. As to the semantics of our response surfaces, in principle the idea is that they should reflect (although somewhat distorted by sampling variability), certain characteristics of the analyzed classifiers. In particular, the term a would be the limit value as n ! 1, and parameters b and c govern the speed of variation of the error rate as n increases. Also, a should not be higher for nonparametric classifiers than for their parametric counterparts (in case of incorrect specification of parametric classifiers, it should definitely be lower, reflecting the asymptotic capability of model-free classifiers of achieving the optimal Bayes error rate, which is not generally achievable by incorrectly specified parametric models). As regards to c, non-parametric estimators usually will tend to converge more slowly than their parametric counterparts. Some recent theoretical results by Yang (1999) shed some light on (minimax) optimal convergence rates of classification errors, as n increases, under the assumption that the Bayes discriminant surface f ðxÞ ¼ P ðY ¼ 1jX ¼ xÞ belongs to some specific function space (named non-parametric class, in statistical jargon). It may be shown that for a completely general classification problem, no generic convergence rates may exist, so this restriction to non-parametric classes is necessary, although results by Yang include the most commonly used non-parametric classes. As an example, when f is assumed to belong to a specific nonparametric class related to perceptron NNs, Yang (1999) reports that, for high dimension feature b spaces, e ðnÞ a þ n1=4 , with a being the Bayes error rate and e ðnÞ being the optimal error rate available for any classifier constructed on the basis
533
of samples of size n. 14 For parametric classifiers, convergence rates are usually more rapid, although this proviso does not necessarily apply when they are incorrectly specified. As a summary, the response surfaces may find support in the above-mentioned theoretical results on convergence rates for non-parametric estimation. For instance, in our results, perceptrons appear to converge in a way similar to that predicted by theory, although convergence rates we estimated appear slower than Yang’s optimal rates (see Table 8). On the other hand, we may expect that response surfaces estimated in-sample generally give optimistically biased results (i.e., excessively rapid convergence rates). More reliable results (although somewhat pessimistically biased) should be obtained when response surfaces are fitted on the basis of results for test samples. Notice that c is determined essentially at the points where error rates show highest curvature with respect to sample size, and this is precisely at the smallest sample sizes, which is just where in-sample error rates are more downwardly biased, and out-of-sample error rates are upwardly biased. In order to obtain each response surface’s parameters, non-linear least squares fitting was used. In what follows, and due to space restrictions, results of the above procedure are given for the total-error response surfaces corresponding both to training and test sets (similar curves may be constructed for type I and type II errors). 4.2.1. LDA and Logit Table 7 and Figs. 3 and 4 show the parameters and main diagnostics for the fitted response surfaces which correspond, respectively, to LDA and Logit classifiers. In all cases, the functional form was the ‘quasi-hyperbolic’ structure commented upon above. As regards LDA, the excellent fitting of the response curve obtained for average training error rates is remarkable. A somewhat a typical form was obtained for the test sets’ response surface, in that for low sample sizes the error rate appears to
14
For related results, see Lugosi and Zeger (1995).
534
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
Table 7 Response surfaces for LDA and Logit classifiers (standard errors in parentheses) Statistics
LDA
a b c Adjusted R2 p-Value (F statistic)
Logit
Estimation set
Test set
Estimation set
Test set
38.786 (0.078) )220.770 (8.002) 0.665 (0.01) 0.996 0.000
32.780 (5.731) 2.619 (5.043) )0.096 (0.116) 0.724 0.000
35.524 (0.150) )104.799 (6.210) 0.540 (0.017) 0.989 0.000
34.670 (0.119) 86.565 (16.888) 0.714 (0.053) 0.908 0.000
40
35
30
25
20
15 0
300
600
Error (estim.)
900
1200
Error (test)
1500
1800
Pred-error (estim.)
2100
2400
Pred-error (test)
Fig. 3. Average error rates and response surfaces for LDA. Error (estim.) ¼ average error rate (estimation sample). Error (test) ¼ average error rate (test sample). Pred-error (estim.) ¼ predicted error rate (estimation sample). Pred-error (test) ¼ predicted error rate (test sample).
fixed value (notice the almost null value of c, indicating an extremely low rate of variation with n). However, taking into account the rather high standard errors associated with estimates for b and c, only the estimate for a appears to be statistically significant, which strongly supports the hypothesis that the response surface for errors on the test sets is, in fact, a flat line. For the case of Logit, the goodness of fit of both response surfaces looks remarkably high, and the shape of the curves is as expected; the test sets’ error rates decrease with sample size until their stabilization at a limit value, and in the case of the training sets, the response surface increases until it converges to the same limit value, showing how the effects of overfitting rapidly diminish as sample size increases.
45 40 35 30 25 20 15 0
300 Error (estim.)
600
900 Error (test)
1200
1500 Pred-error (estim.)
1800
2100
2400
Pred-error (test)
Fig. 4. Average error rates and response surfaces for Logit. Error (estim.) ¼ average error rate (estimation sample). Error (test) ¼ average error rate (test sample). Pred-error (estim.) ¼ predicted error rate (estimation sample). Pred-error (test) ¼ predicted error rate (test sample).
show a slight trend towards increasing with sample size, until it stabilizes at an approximate ‘limit’
4.2.2. Perceptron NNs In respect of the response surface for test sets, we have used the same functional forms as above. However, for the training-sets-based error surface, error rates clearly show (see Fig. 5) a small jump at the point n ¼ 1600, just where the number of permitted neurons changes from 1 6 m 6 2 to 2 6 m 6 3. This jump downwards simply reflects the fact that more complex NNs fit the data somewhat better. However, there appears to be no jump at the same point in the case of the error rates when measured on the test sets (see Fig. 4), and statistical tests strongly supported that indeed this was the case. Hence, at least for n 1600, networks with more than two neurons appeared to be unnecessarily complex from a predictive accuracy viewpoint. Upon the above considerations, for the training-sets-based response surface we have taken the
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542 45 40 35 30 25 20 15 0
300 Error (estim.)
600
900 Error (test)
1200
1500
1800
Pred-error (estim.)
2100
2400
Pred-error (test)
Fig. 5. Average error rates and response surfaces for perceptrons. Error (estim.) ¼ average error rate (estimation sample). Error (test) ¼ average error rate (test sample). Pred-error (estim.) ¼ predicted error rate (estimation sample). Pred-error (test) ¼ predicted error rate (test sample).
following functional form: eðnÞ ¼ a1 þ a2 jump þ b=nc , with jump being a dummy variable which equals 1 if n P 1600, and 0, otherwise. Obviously, the inclusion of this binary variable permits the evaluation of the effect of the change in the NNs’ structure. Table 8 displays the main results. Once more, the adjusted response surfaces appear to fit the data rather well. The test-set response curve, as expected, decreases as sample size rises. For in-sample error rates the behaviour is approximately symmetric, indicating that overfitting decreases as sample size grows, unless we add more hidden units, which obviously is needless unless predictive accuracy improves with more complex nets.
535
4.2.3. FS-based classifiers Table 9 shows the main results for the response surfaces we fitted for FS classifiers. We used the standard functional form proposed at the beginning of the section. Notice that the (essentially unconstrained) data-driven model selection procedure appears to flatten the curves (see Fig. 6), in the sense that no jumps appear that may be modelled by dummies. This is not surprising, since the jumps appear at random points, which produces the effect of smoothing the response surfaces. The response surface for average errors on the test set fitted fairly well in a way similar to perceptrons, although somewhat lower. A comparison of the values obtained for b and c in each response surfaces clearly shows that in-sample error rates converge such more rapidly than their out-of-sample counterparts. This happens both for
Table 9 Response surfaces for FS-based classifiers (standard errors in parentheses) Parameters
Estimation set
Test set
a b
30.071 (0.103) )292.763 (107.360) 1.042 (0.100) 0.808 0.000
30.253 (0.257) 94.050 (20.146)
c Adjusted R2 p-Value (F statistic)
0.627 (0.059) 0.881 0.000
45 40 35 30
Table 8 Response surfaces for perceptron NNs (standard errors in parentheses) Parameters
Estimation set
Test set
a1 a2 b
27.085 (0.091) )1.027 (0.111) )688.366 (219.865) 1.241 (0.090) 0.898 0.000
10.941 (3.291) – 56.024 (1.103)
25 20 15 0
c Adjusted R2 p-Value (F statistic)
0.156 (0.023) 0.980 0.000
300 Error (estim.)
600
900 Error (test)
1200
1500 Pred-error (estim.)
1800
2100
2400
Pred-error (test)
Fig. 6. Average error rates and response surfaces for FSs. Error (estim.) ¼ average error rate (estimation sample). Error (test) ¼ average error rate (test sample). Pred-error (estim.) ¼ predicted error rate (estimation sample). Pred-error (test) ¼ predicted error rate (test sample).
536
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
NNs and FSs. A similar (although not so clearcut) phenomenon occurs for LDA and Logit. Finally, a point that we have not explicitly considered in our analyses is that of the time needed to estimate the classifiers. Obviously, nonparametric systems are generally harder to construct fit, in terms of CPU time, as compared to their parametric counterparts, although, in our view, this point may be negligible in most applications, with the possible exception of those cases where classifiers may be required to work in real time. 4.2.4. Some practical conclusions Although, as indicated throughout the text, universally valid principles for empirical work in this field are probably doomed, a few practical considerations or common sense rules which emerge from the analysis of the above results may be of practical relevance to related applications: 1. We support the usual approach of first trying parametric classifiers, and then trying to improve results by using model-free techniques if sample sizes are sufficient. 2. As to the sample size that we may call ‘sufficient’, although it theoretically depends on dimensionality of the feature space, in the studied 9-dimension feature space, for sample sizes around n ¼ 150, non-parametric methods start to be competitive. This may be taken as a prudent bound. Since the required sample sizes rapidly grow with dimension of the feature space, for lower numbers of explicative variables we may reasonably expect that required sample sizes may be considerably lower than the above figures. Hence, analyses with large samples split by branches of activity may appear possible in realistic contexts, under the condition, of course, that we accommodate the models’ complexity to sample sizes available for each branch. 3. Related to the above comment, whenever nonlinearities actually play a significant role in our problem, even fairly small numbers of neurons/rules (e.g., 2 6 m 6 3) should clearly indicate improvements on linear methods, for fairly moderate sample sizes. When such an
improvement is not apparent from the beginning (i.e., for moderate sample sizes and moderately complex nets), probably more complex models will not be able to work better, at least for realistic sample sizes (indeed, it is not so strange that for not very large sample sizes linear discriminators work better than poorly learned non-linear decision surfaces). 4. When feature space is high dimensional, after trying Logit and ADL, we suggest trying perceptron NNs, or any other projection-based method. For feature spaces of lower dimension (not higher than 5 or 6), FSs can first be tried. 5. Our results confirm previous results on the poor performance of QDA, as well as theoretical advice in respect of the serious inefficiency of kernel-based classifiers in high-dimension feature spaces. 6. In high-dimension feature spaces, FSs seem to perform considerably better than (closely related) kernel classifiers, and they work similarly to perceptrons, although in very large samples, a somewhat higher performance of perceptrons is appreciated.
5. Research limitations As defined above, the methodology of this study has several limitations that are either impossible or not sufficiently cost-effective to overcome. First of all, although available results (e.g., De Andres, 2000) suggest that the Spanish case may surely be representative for most European countries, and, therefore, our results may be of a more general scope than a priori expected, it must be taken into account that the validity of some of our conclusions may be, strictly speaking, limited to the studied case. Secondly, the database we used has several informational limitations as it only includes financial data from the companies’ annual accounts. This limited availability of information is responsible for many of the ‘low separability’ characteristics of our approach. The inclusion of non-financial data among the set of predictors, such as, for example, variables describing business organization and strategy, which can be obtained
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
only from interviews with managers, would have probably increased the explanatory power of the models. However, when considering large sets of firms, it is generally not cost-effective to include such data in the model. On the other hand, it can be presumed that the inclusion of organizational and strategical information would not have affected the main conclusions of this research, since such data possess statistical properties (such as non-normality and lower-boundedness) which are very similar to those of the financial ratios used in this paper.
6. Concluding remarks and further research Classification systems hold an important place as research techniques in economics and finance. In this paper, we have focused on a problem to which little attention had been paid in the literature to date––namely, forecasting business profitability––with some specific features, such as low separability and high-dimension feature space, which make prediction a non-trivial task. We adopted a Monte Carlo approach, by combining the results of the analyses of a large number of artificial samples randomly extracted from a real world database (as indicated, the Spanish case seems to have close cousins on a European scale). The performances of two mainstream parametric techniques (LDA and Logit) were compared to those of two non-parametric classifiers (perceptron NNs and additive FSs with Gaussian membership functions). Our approach may be regarded as ‘dynamic’, in the sense that we tried to evaluate the effects of variation of sample size on each technique’s predictive performance, rather than carry out a crude comparison on the basis of a single fixed-size data set. A number of simple and interpretable response surfaces were constructed for measuring how each technique’s performance evolves with sample size. As a general conclusion, our results essentially indicate a clear advantage of non-parametric systems over parametric alternatives, even for moderate sample sizes. A number of possible extensions of our analyses seems of clear interest. The most straightforward is
537
analyzing the robustness of our conclusions to changes in the database. In addition, the analysis should also be extended to other relevant business classification problems (e.g., financial distress prediction and analysis of management decisions). Of particular interest is the extension of our analyses in order to include other techniques, such as classifiers based on genetic algorithms and mathematical programming, which have proved to be highly effective in closely related problems, but have been omitted here for the sake of brevity. These paradigms may surely perform competitively in respect of the class of problems we analyzed here, and will surely show similar statistical behaviours. We have approached the profitability forecasting problem as a qualitative prediction problem, of a class membership indicator. Certainly, there is some loss of information when profitability, which may be measured on a continuous scale, is replaced by a dichotomous profitability indicator, and our strategy of discarding the intermediate quartiles in order to obtain the two groups increases this loss. To date, attempts to model the profitability of firms through approaches––such as standard regression––which do not imply such a loss of information, appear to have largely failed. Such failure has been attributed to the unavailability of adequate parametric functional forms for modelling the relation between profitability and explanatory variables (for more details see, e.g., Gort (1963) and Harris (1976)). The general purpose non-parametric regression capabilities of NNs and FSs ensure that they are capable of estimating arbitrary regression surfaces, and that, at least in large samples, they would be able to overcome the specification problems, although certainly this specific prediction problem may possess an inherent degree of uncertainty which makes it difficult to approach by whatever means we use. This suggests an interesting research goal, that of testing whether our problem may be usefully approached as a non-parametric regression, with stronger predictive targets than those of classmembership forecasting. Finally, development of improved estimation and model selection strategies for FSs, more computationally efficient and adequate for high
538
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
dimensional problems, seems another interesting research goal, in order to simultaneously maintain the semantic clarity of fuzzy rule-based classifiers
without having to renounce the predictive accuracy of projection-based approaches when used in high-dimension problems.
Appendix A. Companies in the database detailed by branch of activity No.
Name
01 02 05
Agriculture, hunting and related service activities Forestry, logging and related service activities Fishing, operation of fish hatcheries and fish farms; service activities incidental to fishing Mining of coal and lignite; extraction of peat Extraction of crude petroleum and natural gas; service activities incidental to oil and gas extraction, excluding surveying Mining of metal ores Other mining and quarrying Manufacture of food products and beverages Manufacture of tobacco products Manufacture of textiles Manufacture of leather clothes Tanning and dressing of leather Manufacture of wood and products of wood and cork, except furniture; manufacture of articles of straw and plaiting materials Manufacture of pulp, paper and paper products Publishing, printing and reproduction of recorded media Manufacture of coke, refined petroleum products and nuclear fuel Manufacture of chemicals and chemical products Manufacture of rubber and plastic products Manufacture of other non-metallic mineral products Manufacture of basic metals Manufacture of fabricated metal products, except machinery and equipment Manufacture of machinery and equipment n.e.c. Manufacture of office machinery and computers Manufacture of electrical machinery and apparatus n.e.c. Manufacture of radio, television and communication equipment and apparatus Manufacture of medical, precision and optical instruments, watches and clocks Manufacture of motor vehicles, trailers and semi-trailers
10 11
13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34
Low efficiency
High efficiency
Total
29 2 5
26 3 5
55 5 10
6 1
4 1
10 2
2 3 80 2 33 19 7 4
2 6 69 0 35 19 5 6
4 9 149 2 68 38 12 10
21 20
26 34
47 54
0
1
1
60 29 43
48 28 32
108 57 75
23 33
12 27
35 60
35 0 22
27 1 26
62 1 48
9
5
14
7
6
13
41
33
74
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
539
Appendix A (continued) No.
Name
Low efficiency
High efficiency
Total
35 36 37 40 41 45 50
Manufacture of other transport equipment Manufacture of furniture; manufacturing n.e.c. Recycling Electricity, gas, steam and hot-water supply Collection, purification and distribution of water Construction Sale, maintenance and repair of motor vehicles and motorcycles; retail sale of automotive fuel Wholesale trade and commission trade, except of motor vehicles and motorcycles Retail trade, except of motor vehicles and motorcycles; repair of personal and household goods Hotels and restaurants Land transport; transport via pipelines Water transport Air transport Supporting and auxiliary transport activities; activities of travel agencies Post and telecommunications Financial intermediation, except insurance and pension funding Insurance and pension funding, except compulsory social security Activities auxiliary to financial intermediation Real estate activities Renting of machinery and equipment without operator and of personal and household goods Computer and related activities Research and development Other business activities Public administration and defence; compulsory social security Education Health and social work Sewage and refuse disposal, sanitation and similar activities Activities and membership organizations n.e.c. Recreational, cultural and sporting activities Other service activities
17 15 3 4 8 117 22
12 11 3 4 7 141 23
29 26 6 8 15 258 45
159
153
312
63
56
119
64 39 5 4 26
75 41 4 6 27
139 80 9 10 53
11 2
10 4
21 6
1
1
2
2 13 4
7 21 1
9 34 5
27 1 175 3
28 3 180 2
55 4 355 5
12 34 11
16 38 11
28 72 22
1 30 9
1 38 7
2 68 16
1418
1418
2836
51 52
55 60 61 62 63 64 65 66 67 70 71 72 73 74 75 80 85 90 91 92 93
Total
540
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
References Anderson, T.W., 1984. An Introduction to Multivariate Statistical Analysis, second ed. Wiley, New York. Andrews, D.W.K., 1991. Asymptotic normality of series estimators for nonparametric and semiparametric regression models. Econometrica 59 (2), 307–345. Altman, E.I., Marco, G., Varetto, F., 1994. Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience). Journal of Banking and Finance 18, 505–529. Bell, T.B., Ribar, G.S., Verchio, J.R., 1990. Neural nets versus logistic regression: A comparison of each model’s ability to predict commercial bank failures. In: Proceedings of the Deloitte & Touche/University of Kansas Symposium of Auditing Problems, pp. 29–53. Bertels, K., Jacques, J.M., Neuberg, L., Gatot, L., 1999. Qualitative company performance evaluation: Linear discriminant analysis and neural network models. European Journal of Operational Research 115, 608–615. Braun, H., Chandler, J.S., 1987. Predicting stock market behavior through rule induction: An application of the learning-from-examples approach. Decision Sciences (Summer), 415–429. Brief, R.P., Lawson, R.A., 1992. The role of the accounting rate of return in financial statement analysis. Accounting Review 67 (2), 411–426. Chen, X., Shen, X., 1998. Sieve extremum estimates for weakly dependent data. Econometrica 66 (2), 289–314. Coats, P.K., Fant, L.F., 1993. Recognizing financial distress patterns using a neural network tool. Financial Management 22 (3), 142–155. Cronan, T.P., Glorfeld, L.W., Perry, L.G., 1991. Production system development for expert systems using a recursive partitioning induction approach: An application to mortgage, commercial and consumer lending. Decision Sciences 22 (4), 812–845. De Andres, J., 2000. Los parametros caracterısticos de las empresas manufactureras de alta rentabilidad. Una aplicaci on del an alisis discriminante. Revista Espa~ nola de Financiaci on y Contabilidad 104, 443–482. De Andres, J., 2001. Statistical techniques vs. SEE5 algorithm. An application to a small business environment. International Journal of Digital Accounting Research 1 (2), 157– 184. De Andres, J., Rodrıguez, E., Gonzalez, B., 1998. Logistic regression vs. C4.5 algorithm. An application to a small business environment. In: Bons on Ponte, E., Vasarhelyi, M. (Eds.), Emerging Technologies in Accounting and Finance. Huelva, Spain, pp. 101–114. De Andres, J., Landajo, M., Lorca, P., 2003. Forecasting business efficiency by using classification techniques: A comparative analysis based on a Spanish case. Working Paper, SSRN. Deakin, E.B., 1976. Distribution of financial accounting ratios: Some empirical evidence. Accounting Review 51 (1), 90–96.
Didzarevich, S., Lizarraga, F., Larra~ naga, P., Sierra, B., Gallego, M.J., 1997. Statistical and machine learning methods in the prediction of bankruptcy. In: Sierra Molina, G., Bons on Ponte, E. (Eds.), Intelligent Technologies in Accounting and Business. Huelva, Spain, pp. 85–100. Elliott, J.A., Kennedy, D.B., 1988. Estimation and prediction of categorical models in accounting research. Journal of Accounting Literature 7, 202–242. Friedman, J.H., Stuetzle, W., 1981. Projection pursuit regression. Journal of the American Statistical Association 76, 817–823. Frydman, H., Altman, E.I., Kao, D.L., 1985. Introducing recursive partitioning for financial classification: The case of financial distress. Journal of Finance 40 (1), 269–291. Garrison, L.R., Michaelsen, R.H., 1989. Symbolic concept acquisition: A new approach to determining underlying tax law constructs. The Journal of the American Tax Association (Fall), 77–91. Gilbert, G.G., 1974. Predicting de novo expansion on bank merger cases. Journal of Finance 29 (1), 151–162. Girosi, F., Jones, M., Poggio, T., 1993. Priors, stabilizers and basis functions: From regularization to radial, tensor and additive splines. A.I. Memo No. 1430, Artificial Intelligence Laboratory, MIT. Gort, M., 1963. Analysis of stability and change in market shares. Journal of Political Economy 71 (1), 51–63. Goss, E.E.P., Ramchandani, H., 1995. Comparing classification accuracy of neural networks, binary Logit regression and discriminant analysis for insolvency prediction of life insurers. Journal of Economics and Finance 19 (3), 1–18. Greenstein, M.M., Welsh, M.J., 1996. Bankruptcy prediction using ex-ante neural networks and reallistically proportioned testing sets. In: Sierra Molina, G., Bons on Ponte, E. (Eds.), Intelligent Systems in Accounting and Finance. Huelva, Spain, pp. 187–212. Grenander, U., 1981. Abstract Inference. Wiley, New York. Hansen, P., Jaumard, B., Saulaville, E., 1994. Partitioning problems in cluster analysis: A review of mathematical programming approaches. In: Diday, E., Lechevalier, Y., Schader, M., Bertrand, P., Burtschy, B. (Eds.), New Approaches in Classification and Data Analysis. Springer, Berlin, pp. 228–240. Harris, M.N., 1976. Entry and barriers to entry. Industrial Organization Review 3, 165–175. Hornik, K., Stinchcombe, M., White, H., 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366. Ishibuchi, H., Nozaki, K., Yamamoto, N., Tanaka, H., 1994. Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms. Fuzzy Sets and Systems 65 (2–3), 237–253. Jang, J.S.R., Sun, C.T., 1993. Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks 4 (1), 156– 159.
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542 Kattan, M.W., Cooper, R.B., 2000. A simulation of factors affecting machine learning techniques: An examination of partitioning and class proportions. Omega 28, 501–512. Kelly, G., Tippet, M., 1991. Economic and accounting rates of return: A statistical model. Accounting & Business Research 21 (4), 321–329. Kosko, B., 1992. Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence. Prentice-Hall, Englewood Cliffs, NJ. Kosko, B., 1994. Fuzzy Systems as Universal Approximators. IEEE Transactions on Computers 43 (11), 1329–1333. Kreinovich, V., Nguyen, H.-T., Yam, Y., 2000. Fuzzy systems are universal approximators for a smooth function and its derivatives. International Journal of Intelligent Systems 15, 565–574. Krishnaswamy, C.R., Gilbert, E.W., Pashley, M.M., 2000. Neural networks applications in finance: A practical introduction. Financial Practice and Education 10 (1), 75–84. Kuan, C.M., White, H., 1994. Artificial neural networks: An econometric approach. Econometric Reviews 13 (1), 1–91. Landajo, M., 2002. Some stochastically convergent strategies for global optimization with applications to neural network training and nonlinear estimation. Unpublished manuscript. Landajo, M., 2004. A note on model-free regression capabilities of fuzzy systems. IEEE Transactions on Systems, Man and Cybernetics––Part B 34 (1), 645–651. Landajo, M., Rıo, M.J., Perez, R., 2001. A note on smooth approximation capabilities of fuzzy systems. IEEE Transactions on Fuzzy Systems 9 (2), 229–237. Liang, T.P., Chandler, J.S., Han, I., Roan, J., 1992. An empirical investigation of some data effects on the classification accuracy of Probit, ID3 and neural networks. Contemporary Accounting Research 9 (Fall), 306–328. Lugosi, G., Zeger, K., 1995. Nonparametric estimation via empirical risk minimization. IEEE Transactions on Information Theory 41 (3), 677–687. Mahmood, M.A., Sullivan, G.L., Tung, R.L., 1999. A new approach to evaluating business ethics: An artificial neural networks application. Journal of End User Computing 11 (3), 11–19. Mak, B., Munakata, T., 2002. Rule extraction from expert heuristics: A comparative study of rough sets with neural networks and ID3. European Journal of Operational Research 136, 212–229. Malhotra, R., Malhotra, D.K., 2002. Differentiating between good credits and bad credits using neuro-fuzzy systems. European Journal of Operational Research 136 (2002), 190– 211. Marais, M.L., Patell, J.M., Wolfson, M.A., 1984. The experimental design of classification models: An application of recursive partitioning and bootstrapping to commercial bank loan classifications. Journal of Accounting Research 22 (Suppl.), 87–114. Markham, I.S., Mathieu, R.G., Wray, B.A., 2000. Kanban setting through artificial intelligence: A comparative study of artificial neural networks and decision trees. Integrated Manufacturing Systems 11 (4), 239–259.
541
McKee, T.E., Lensberg, T., 2002. Genetic programming and rough sets: A hybrid approach to bankruptcy classification. European Journal of Operational Research 138, 436–451. Nauck, D., Kruse, R., 1997. What are neuro-fuzzy classifiers? In: Proceedings of the Seventh International Fuzzy Systems Association World Congress IFSA’97, vol. III. Academia, Prague, pp. 228–233. Pavur, R., 2002. A comparative study of the effect of the position of outliers on classical and nontraditional approaches to the two group classification problem. European Journal of Operational Research 136, 603–615. Pendharkar, P.C., 2002. A computational study on the performance of artificial neural networks under changing structural design and data distribution. European Journal of Operational Research 138, 155–177. Platt, H.D., Platt, M.B., 1990. Development of a class of stable predictive variables: The case of bankruptcy prediction. Journal of Business, Finance and Accounting 17 (1), 31–51. Shen, X., 1997. On methods of sieves and penalization. The Annals of Statistics 25 (6), 2555–2591. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Sin, C.Y., White, H., 1996. Information criteria for selecting possibly mis-specified parametric models. Journal of Econometrics 71 (1–2), 207–225. St. John, C.H., Balakrishnan, N., Fiet, J.O., 2000. Modeling the relationship between corporate strategy and wealth creation using neural networks. Computers & Operations Research 27 (11, 12), 1077–1102. Stone, C.J., 1977. Consistent nonparametric regression. The Annals of Statistics 5 (4), 595–645. Tsai, C.Y., 2000. An iterative feature reduction algorithm for probabilistic neural networks. Omega 28, 513–524. Varetto, F., 1998. Genetic algorithms applications in the analysis of insolvency risk. Journal of Banking and Finance 22, 1421–1439. Wang, L.X., 1994. Adaptive Fuzzy Systems and Control: Design and Stability Analysis. Prentice-Hall, Englewood Cliffs, NJ. Wang, L.X., Mendel, J.M., 1992. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Transactions on Neural Networks 3 (5), 807–814. Watson, C.J., 1990. Multivariate distributional properties, outliers, and transformation of financial ratios. Accounting Review 65 (3), 662–695. White, H., 1990. Connectionist non-parametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks 3, 535–549. Wong, B.K., Bodnovich, T.A., Lai, V.S.-K., 2000. The use of cascade correlation networks in university fund raising. Journal of the Operational Research Society 51 (8), 913– 928. Yang, Y., 1999. Minimax nonparametric classification. (I): Rates of convergence. IEEE Transactions on Information Theory 45 (7), 2271–2284.
542
J. de Andres et al. / European Journal of Operational Research 167 (2005) 518–542
Zapranis, A., Ginoglou, D., 2000. Forecasting corporate failure with neural networks approach: The Greek case. Journal of Financial Management & Analysis 13 (2), 1–11.
Zopounidis, C., Doumpos, M., 2002. Multicriteria classification and sorting methods: A literature review. European Journal of Operational Research 138, 229–246.