On preprocessing data for financial credit risk evaluation

On preprocessing data for financial credit risk evaluation

Expert Systems with Applications 30 (2006) 489–497 www.elsevier.com/locate/eswa On preprocessing data for financial credit risk evaluation Selwyn Pir...

173KB Sizes 11 Downloads 127 Views

Expert Systems with Applications 30 (2006) 489–497 www.elsevier.com/locate/eswa

On preprocessing data for financial credit risk evaluation Selwyn Piramuthu * Decision and Information Sciences, University of Florida, 351 Stuzin Hall, P.O. Box 117169, Gainesville, FL 32611-7169, USA

Abstract Financial credit-risk evaluation is among a class of problems known to be semi-structured, where not all variables that are used for decisionmaking are either known or captured without error. Machine learning has been successfully used for credit-evaluation decisions. However, blindly applying machine learning methods to financial credit risk evaluation data with minimal knowledge of data may not always lead to expected results. We present and evaluate some data and methodological considerations that are taken into account when using machine learning methods for these decisions. Specifically, we consider the effects of preprocessing of credit-risk evaluation data used as input for machine learning methods. q 2005 Elsevier Ltd. All rights reserved. Keywords: Feature selection; Feature construction; Financial credit-risk evaluation; Decision tables

1. Introduction According to a recent survey from Risk Waters Group and SAS Institute, Inc., as cited in Computerworld (25 August 2004), organizations expect to benefit from significant rewards (e.g. 10% reduction in economic capital and 14% reduction in the cost of credit losses) through improved credit risk management. Results from this survey also indicate that data management as the biggest obstacle for successfully implementing credit risk management systems. These findings are not surprising given the stakes that are involved when dealing with significant chunks of resources. These findings are especially significant when put in perspective of even a typical mid-sized financial institution where the loss that could be prevented with improved credit risk management systems run in the hundreds of millions of dollars. Financial credit and the losses associated with it are not isolated incidents problems. These problems are common to most financial institutions, although only huge losses get enough publicity to get attention from the general public. Recent financial crises include the US S&L’s crisis with an estimate in the hundreds of billions of dollars, the Nordic countries’ injection of around $16 billion into their financial system to keep them away from bankruptcy, Japan’s bad loans that were estimated to be in the $160–240 billion range in * Tel.: C1 352 392 8882; fax: C1 352 392 5438. E-mail address: [email protected].

0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.10.006

October 1993, and later Mexico’s spending of at least $20 billion to keep its financial system from collapsing. Although it is really difficult, if not impossible, to completely be risk-free when dealing with financial credit, it is possible to reduce some of these problems. Better estimation of credit risk and implementing ways to completely or even partially avoid some of these translates to effective utilization of resources in these situations. Managing credit risk and reducing loan losses directly affect the bottom line for a financial institution. The identification and quantification of credit risk is thus important in improving the efficiency, accuracy, and consistency of financial credit risk management initiatives. One of the ways to approach this problem is to utilize better decision support tools when evaluating credit risk. The tools for such decision support have been developed over the years, and range from simple ‘eye-balling’ of data to complex data analyses. These tools by themselves are only a part of the solution. A complete system used for financial risk management includes such tools along with means of interpreting results generated by these tools, necessary resources to execute appropriate decisions in a timely manner, among others. In this paper, we consider the decision support tool part. Specifically, we consider decision support tool as considered from a machine learning perspective. Several previous studies have addressed some of the issues in applying machine learning tools for credit-risk evaluation (e.g. Baesens, Setiono, Mues, & Vanthienen, 2003; Piramuthu, 1999; Shaw & Gentry, 1990). We discuss a few means of improving the performance of these tools through data preprocessing, specifically through feature selection and construction.

490

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

This paper is organized as follows: the next section provides a brief overview of some of the issues that relate to machine learning and some of the issues that relate to applying these methods to financial credit risk data. Section 3 discusses a few means to address some of the issues raised in Section 2. Section 4 provides some illustrations and Section 5 ends the paper with a brief discussion on the implications. 2. Relevant machine learning issues Shaffer’s (1994) work and later the well-known Wolpert and Macready’s (1995, 1997) no free lunch (NFL) theorems for search and optimization state that the performance of search, optimization, or learning algorithms are equal when averaged over all possible problems. A corollary of the no free lunch theorem is that if an algorithm performs better than average on a given set of functions, it must perform worse than average on the complementary set of these functions. In other words, an algorithm performs well on a subset of functions at the expense of poor performance on the complementary set of these functions. A consequence of these is that all algorithms are equally specialized (Schumacher, Vose, & Whitley, 2001). Since the performance of all algorithms is similar, there can be no algorithm that is more robust than the rest. NFL applies to cases where each function has the same probability to be the target function. This was later extended (e.g. Igel & Toussaint, 2003; Schumacher et al., 2001) to provide necessary and sufficient conditions for subsets of functions as well as arbitrary non-uniform distributions of target functions. The NFL theorems and related work that followed raises serious question on blindly applying an algorithm (e.g. neural network, genetic algorithm) to data (e.g. Culberson, 1998). However, a cursory look at published literature reveals a plethora of articles that compare across methods and conclude that a given method is better than a few other methods. For example, Giplin et al. (1990) compared stepwise linear discriminant analysis, stepwise logistic regression and CART to three senior cardiologists, for predicting whether a patient would die within a year of being discharged after an acute myocardial infarction. Their results showed that there was no difference between the physicians and the computers, in terms of the prediction accuracy. Kors and Van Bemmel (1990) compared statistical multivariate methods with heuristic decision tree methods, in the domain of electrocardiogram (ECG) analysis. Their comparisons show that decision tree classifiers are more comprehensible and flexible to incorporate or change existing categories. Comparisons of CART to multiple linear regression and discriminant analysis can be found in Callahan and Sorensen (1991) where it is argued that CART is more suitable than the other methods for very noisy domains with lots of missing values. Feng, Sutherland, King, Muggleton, and Henery (1993) present a comparison of several machine learning methods (including decision trees, neural networks and statistical classifiers) as a part of the European Statlog project. Their main conclusions are that (1) no method seems uniformly superior to others, (2) machine learning methods seem to be

superior for multi-modal distributions, and (3) statistical methods are computationally the most efficient. Curram and Mingers (1994) compare decision trees, neural networks and discriminant analysis on several real world data sets. Their comparisons reveal that linear discriminant analysis is the fastest of the methods, when the underlying assumptions are met, and that decision trees methods overfit in the presence of noise. It should be noted that in most of these (e.g. neural networks, genetic algorithms, decision trees) involve specific learning strategies, and each application of these algorithm involves further specifying the strategy by tweaking various parameters (e.g. topology, learning rate, weight-update mechanism, etc. in neural networks; encoding of genotype, mutation rate, replication mechanism, selection mechanism, etc. in genetic algorithms; mechanism for splitting decisions at each node, when to prune, etc. in decision trees). Without appropriate tweaking, the resulting performance of these algorithms may not necessarily prove to be as reported in the literature. Moreover, tweaking tailors the different parameters of a given algorithm to data sets of interest. There have also been studies that have compared different algorithms, only to conclude that no one algorithm dominates the rest in performing better overall across several data sets. For example, Duin (1996) compared six methods (including neural networks, decision trees, nearest mean) with seven data sets and concluded that there is no such thing as a best classifier. He continues on to state that for any classifier, a problem or data can be selected where it can be shown to perform well. Lim and Shih (2000) study 33 classification methods using 32 data sets and report that the mean error rates of these algorithms are sufficiently similar that their differences are statistically insignificant. What do all these studies have to do with using machine learning methods for financial credit risk classification? For one, it is clear that we cannot select an algorithm and claim its superiority over competing algorithms without regard to data and/or problem characteristics as well as the suitability of the algorithm to such data and/or problem characteristics. However, given results from previous studies, one can possibly claim superiority of an algorithm for a specific data set or problem. The lesson learned here is simply that one needs to take data and/or problem characteristics as well as he suitability of a given algorithm to obtain better performance. Performance, in this context, depends on at least two different entities: the algorithm and the data set. Results mentioned in the last few paragraphs as well as the NFL theorems deal with the performance of algorithms. The studies, however, fail to deal with data characteristics and their appropriateness for a given algorithm. Data characteristics (noise, missing values, complexity of distribution of data, instance selection, etc.) can and do significantly affect the resulting performance of most, if not all, algorithms. Having selected an appropriate algorithm for a given data set, it can be shown that performance can further be improved by appropriate data characteristics.

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

3. Reducing data complexity for learning Assessing a firm’s financial risk is an important decision for investors, companies that extend credit, and financial institutions. An incorrect valuation of potential risks can result in serious financial loss. Three aspects of financial risk classification are critical but difficult: the development of a compact model, the use and refinement of the classification model for evaluation, and the identification of relevant financial features. For typical classification problems, values for a set of independent variables are given in a set of (training) examples, upon which a model is developed to categorize future observations into appropriate classes. Classification problems arise in credit or loan evaluation (Carter & Cartlett, 1987), bond rating (Ang & Patel, 1975), market survey (Currim, Meyer, & Le, 1988), tax planning (Michaelsen, 1984), and bankruptcy prediction of firms (Messier & Hansen, 1988; Shaw & Gentry, 1990; Tam & Kiang, 1992), among other applications. A concept is an expression that identifies a subset of some universe (Rendell & Seshu, 1990). The concept learning problem can be represented by an instance space composed of the features used in the training examples as the axes. When there are multiple regions (peaks) in the instance space, the learning problem is characterized as ‘hard concept learning’ (Rendell & Seshu, 1990) for their inherent learning difficulty. In most hard learning problems, using the appropriate set of features is critical for the success of the learning process and therefore, by itself, is an important decision. In the game of checkers, for example, detailed features such as the content of each board position may not be as helpful for learning good strategies as higher-level information, such as piece advantage and mobility. It is therefore reasonable to hypothesize that the

491

learning of checker strategies based on observing the content of board positions is more difficult than the learning problem based on training examples from observations described by piece advantage and mobility. The same phenomenon with respect to the relationship between learning difficulty and proper representation of the training examples is especially pronounced in the financial risk evaluation domain. In determining companies’ credit worthiness, for example, the features used in training determine the learning complexity to a great extent, and sometimes even the degree of eventual success of the learning process itself. The credit worthiness of companies would be more difficult for a learning system to learn from raw accounting data (e.g. those from the income statements and balance sheets) than from higher-level financial concepts such as liquidity, leverage level, profit growth, and operating cash flow. Successful learning hinges on the proper representation of training examples. 3.1. Feature construction Consider the XOR example in Fig. 1(a). This problem requires at least two hyperplanes (straight lines, in this space) to be able to separate examples belonging to the two (C,K) classes. The addition of a new feature X3 (X3ZX1oX2) decreases the learning difficulty by requiring just one hyperplane (abcd in Fig. 1(b)) to be drawn that separates examples belonging to the two classes. Although the addition of a new feature increased the number of effective features used, the resulting space simplified the classification process. Feature construction can be defined in terms of concept learning as follows: Feature construction is the process of applying a set of constructive operators {01,02,.,0n} to a set of

Fig. 1. A new feature (i.e. X3) makes learning easier.

492

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

existing features {f1,f2,.,fm}, resulting in the construction of one or more new features {f1,f2,.,fN}intended for use in describing the target concept (Matheus & Rendell, 1989). A separate learning method (e.g. neural network learning) can then make use of the constructed features in attempting to describe the target concept. Examples of feature construction systems include BACON (Langley, Zytkow, Simon, & Bradshaw, 1986), FRINGE (Pagallo, 1989), CITRE (Matheus & Rendell, 1989), MIDOS (Wrobel, 1997), Explora (Klo¨sgen, 1996), and Tertius (Flach & Lachiche, 2001). BACON (Langley et al., 1986), a program that discovers relationships among real-valued features of instances in data, uses two operators [multiply(_,_) and divide(_,_)]. The strong bias restricting the constructive operators allowed, leads to manageable feature construction process, although concept learning is severely restricted by these chosen operators. FRINGE (Pagallo, 1989) is a decision-tree ([e.g. Quinlan, 1986) based feature construction algorithm. New features are constructed by conjoining pairs of features at the fringe of each of the positive branches in the decision tree. During each iteration, the newly constructed features and the existing features are used as input space for the algorithm. This process is repeated until no new features are constructed. CITRE (Matheus & Rendell, 1989) and DC Fringe (Yang, Rendell, & Blix, 1991) are also decision-tree based feature construction algorithms. They use a variety of operands such as root (selects the first two features of each positive branch), fringe (similar to FRINGE), root-fringe (combination of both root and fringe), adjacent (selects all adjacent pairs along each branch) and all (all of the above). All of these operands use conjunction as the operator. In DC Fringe, both conjunction as well as disjunction as operators are utilized. MIDOS (Wrobel, 1997), which stands for multi-relational discovery of subgroups, finds statistically unusual subgroups in a database. It uses optimistic estimate and minimal support pruning, and an optimal refinement operator. MIDOS takes the generality of a hypothesis (i.e. size of the subgroup) into account in addition to the proportion of positive examples in a subgroup. Explora (Klo¨sgen, 1996) is an interactive system for the discovery of interesting patterns in databases. The number of patterns presented to the user is reduced by organizing the search hierarchically, beginning with the strongest, most general, hypotheses. An additional refinement strategy selects the most interesting statements and eliminates the overlapping findings. The efficiency of discovery is improved by inverting the record-oriented data structure and storing all values of the same variable together, allowing efficient computation of aggregate measures. Different data subsets are represented as bit-vectors making computation of logical combinations of conditions very efficient. Tertius (Flach & Lachiche, 2001) uses first-order logic representation and implements a top-down rule discovery mechanism. It deals with extensional knowledge with explicit negation or under the closed-world assumption. It employs a

confirmation measure, where only substitutions satisfying explicitly the body of rules are taken into account. In decision-tree based feature construction algorithms, as feature construction proceeds iteratively, the addition of new features to the previous set of features can lead to a large number of features being used as input to the decision tree construction algorithm. Thus, pruning of features is done during each iteration. The most desirable features are kept to be carried over to the next iteration, as well as to form newer features, whereas the least desirable features are discarded. This is done by the decision tree algorithm (e.g. ID3) through pruning, as well as by the features that were not used in the formation of the decision tree. FC (Ragavan, Rendell, Shaw, & Tessmer, 1993) constructs features iteratively from decision trees. It forms new features by conjoining as well as disjoining two nodes at the fringe of the tree-the parent and grandparent nodes of positive leaves are conjoined or disjoined to give a new feature. New features are added to the set of original features and a new decision tree is constructed using the maximum information-gain criterion (Quinlan, 1986). This feature selection phase thus chooses from both the newly-constructed features as well as the original features for rebuilding the decision tree. The iterative process of tree-building and feature construction continues until no new features are found. Splitting continues to purity, i.e. no pruning (Breiman et al., 1984) is used in this study. Detailed steps for constructing Inductive Tree can be found in Quinlan (1986). FC basically resolves the interactions among features by conjoining and disjoining features that appear close to the leaf nodes in a decision tree generated by an inductive learning program such as ID3 (Quinlan, 1986). New feature sets constructed through FC have been shown to be easier for learning applications. 3.2. Feature selection Consider the data given in Fig. 2. There are two independent variables V1 and V2, and a dependent variable Y. The examples from the two classes (YZC and K) are clearly separable. Given a choice between the variables V1 and V2, we would choose V1 as the variable that distinguishes examples from the two classes. Intuitively, if we project the data points on the V1

Fig. 2. An example.

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

axis, the four data points belonging to the class ‘K’ would be on V1Z1 and 2. And, the four points belonging to the class ‘C’ would be on V1Z4 and 5. Here, given just the V1 axis, we can separate the examples belonging to the ‘K’ classes and those belonging to the ‘C’ classes. Similarly, if we project the data points on the V2 axis, the four data points belonging to the class ‘K’ would be on V2Z1 and 2. The four points belonging to the class ‘C’ would also be on V2Z1 and 2. Here, given just the V2 axis, we cannot separate the examples belonging to the ‘K’ and ‘C’ classes. Here, we have perfect overlapping of examples in the two classes in dimension V2, whereas our goal is to separate examples belonging to the two classes. Invariably, and unknowingly for the most part, irrelevant as well as redundant variables are introduced along with relevant variables to better represent the domain in credit risk evaluation applications. A relevant variable is neither irrelevant nor redundant to the target concept of interest (John et al., 1994). Whereas an irrelevant feature (variable) does not affect describing the target concept in any way, a redundant feature does not add anything new to describing the target concept while possibly adding more noise than useful information in concept learning. Feature selection is the problem of choosing a small subset of features that ideally is necessary and sufficient to describe the target concept (Kira & Rendell, 1992). Feature selection is of paramount importance for any learning algorithm which when poorly done (i.e. a poor set of features is selected) may lead to problems associated with incomplete information, noisy or irrelevant features, not the best set/mix of features, among others. The learning algorithm used is slowed down unnecessarily due to higher dimensions of the feature space, while also experiencing lower classification accuracies due to learning irrelevant information. The ultimate objective of feature selection is to obtain a feature space with (1) low dimensionality, (2) retention of sufficient information, (3) enhancement of separability in feature space for examples in different categories by removing effects due to noisy features, and (4) comparability of features among examples in same category (Meisel, 1972). Although seemingly trivial, the importance of feature selection cannot be overstated. Consider for example a data mining situation where the concept to be learned is to classify good and bad creditworthy customers. The data for this application could possibly include several variables including social security number, asset, liability, past credit history, number of years with current employer, salary, and frequency of credit evaluation requests. Here, regardless of the other variables included in the data, the social security number can uniquely determine a customer’s creditworthiness. The learned knowledge using only the social security number as predictor has extremely poor generalizability when applied to new customers. Clearly, in this case, to avoid such a problem we can exclude social security numbers from the input data. It is not always clear-cut as to which of the variables could result in such spurious patterns. A similar problem could possibly exist among one or more other variables in the data. Feature selection methods can be used in similar situations to cull out

493

such problematic features before the data enters the pattern extraction stage in data mining systems. A goal of feature selection is to avoid selecting too many or too few features than is necessary. If too few features are selected, there is a good chance that the information content in this set of features is low. On the other hand, if too many (irrelevant) features are selected, the effects due to noise present in (most real-world) data may overshadow the information present. Hence, this is a tradeoff which must be addressed by any feature selection method. The marginal benefit resulting from the presence of a feature in a given set plays an important role. A given feature might provide more information when present with certain other feature(s) than when considered by itself. Cover (1974); Elashoff, Elashoff, and Goldman (1967), and Toussaint (1971), among others, have shown the importance of selecting features as a set, rather than selecting the best features to form the (supposedly) best set. They have shown that the best individual features do not necessarily constitute the best set of features. However, in most real-world situations, it is not known what the best set of features is nor the number (n) of features in such a set. Currently, there is no means to obtain the value of n, which depends partially on the objective of interest. Even assuming that n is known, it is extremely difficult to obtain the best set of n features since not all n of these features may be present in the data comprising the available set of features. There exists a vast amount of literature on feature selection. Researchers have attempted feature selection through varied means, such as statistical (e.g. Kittler, 1975), geometrical (e.g. Elomaa & Ukkonen, 1994), information-theoretic measures (e.g. Chambless & Scarborough, 2001), neuro-fuzzy (e.g. Benitez Castro, Mantas, & Rojas, 2001), receiver operating curves (ROC) (Coetzee, Glover, Lawrence, & Giles, 2001), discretization (Liu & Setiono, 1997), mathematical programming (e.g. Bradley et al., 1998), among others. In statistical analyses, forward and backward stepwise multiple regression (SMR) are widely used to select features, with forward SMR being used more often due to the lesser magnitude of calculations involved. The output here is the smallest subset of features resulting in an R2 (correlation coefficient) value that explains a significantly large amount of the variance. In forward SMR, the analyses proceeds by adding features to a subset until the addition of a new feature no longer results in a significant (usually at the 0.05 level) increment in explained variance (R2 value). In backward SMR, the full set of features are used to start with, while seeking to eliminate features with the smallest contribution to R2. Malki and Moghaddamjoo (1991) apply the K-L transform on the training examples to obtain the initial training vectors. Training is started in the direction of the major eigenvectors of the correlation matrix of the training examples. The remaining components are gradually included in their order of significance. The authors generated training examples from a synthetic noisy image and compared the results obtained using the proposed method to those of standard backpropagation algorithm. The proposed method converged faster than

494

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

standard backpropagation with comparable classification performance. Siedlecki and Sklansky (1989) use genetic algorithms for feature selection by encoding the initial set of n features as n-element bit string with 1 and 0 representing the presence and absence, respectively of features in the set. They used classification accuracy, as the fitness function (for genetic algorithms while selecting features) and obtained good neural network results compared to branch and bound and sequential search (Stearns, 1976) algorithms. They used a synthetic data as well as digitized infrared imagery of real scenes, with classification accuracy as the objective function. Yang and Honavar (1997) report a similar study. However, later Hopkins, Routen, & Watson (1994) show that classification accuracy may be a poor fitness function measure when searching for reducing the dimension of the feature set. Using rough sets theory (Pawlak, 1982), PRESET (Modrzejewski, 1993) determines the degree of dependency (g) of sets of attributes for selecting binary features. Features leading to a minimal preset decision tree, which is the one with minimal length of all path from root to leaves, are selected. Kohavi and Frasca (1994) use best-first search, stopping after a predetermined number of nonimproving node expansions. They suggest that it may be beneficial to use a feature subset that is not a reduct, which has a property that a feature cannot be removed from it without changing the independence property of features. A table-majority inducer was used with good results. The wrapper method (Kohavi, 1995a) searches for a good feature subset using the induction algorithm as a black box. The feature selection algorithm exists as a wrapper around the induction algorithm. The induction algorithm is run on data sets with subsets of features, and the subset of feature with the highest estimated value of a performance criterion is chosen. The induction algorithm is used to evaluate the data set with the chosen features, on an independent test set. Yuan, Tseng, Gangshan, and Fuyan (1999) develop a two-phase method combining wrapper and filter approaches. Almuallim and Dietterich (1991) introduce MINFEATURES (if two functions are consistent with the training examples, prefer the function that involves fewer input features) bias to select features in the FOCUS algorithm. They used synthetic data to study the performance of the FOCUS, ID3, and FRINGE algorithms using sample complexity, coverage, and classification accuracy as performance criteria. They increased the number of irrelevant features and showed that FOCUS performed consistently better. The IDG algorithm (Elomaa and Ukkonen, 1994) takes the positions of examples in the instance space to select features for decision trees. They limit their attention to boundaries separating examples belonging to different classes, while rewarding (penalizing) rules that separate examples from different (same) classes. Eight data sets are used to compare the performance (% accuracy, number of nodes in decision tree, time) of decision trees constructed using the proposed algorithm with ID3 (Quinlan, 1987). Decision trees generated using the proposed algorithm had better accuracy whereas

those with ID3 had fewer number of nodes and took more than an order of magnitude less time. Based on the positions of instances in instance space, the relief algorithm (Kira & Rendell, 1992) selects features that are statistically relevant to the target concept, using a relevancy threshold that is selected by the user. relief is noise-tolerant and is unaffected by feature interaction. The complexity of relief is O(pn), where n and p are the number of instances and number of features, respectively. relief was studied using two 2-class problems with good results, compared to FOCUS (Almuallim & Dietterich, 1991) and heuristic search (Devijver & Kittler, 1982), Kononenko (1994) extended relief to deal with noisy, incomplete, and multi-class data sets. Milne (1995) used neural networks to measure the contribution of individual input features to the output of the neural network. A new measure of input features’ contribution to output is proposed, and evaluated using data mapping species occurrence in a forest. Using a scatter plot of contribution to output, subsets of features were removed and the remaining feature sets were used as input to neural networks. Setino and Liu (1997) present a similar study using neural networks to select features. Battiti (1994) developed MIFS to use mutual information for evaluating the information content of each individual feature with respect to the output class. The features thus selected were used as input in neural networks. The author shows that the proposed method is better than those feature selection methods that use linear dependence (e.g. correlations as in principal components analysis) measures. Al-Ani and Deriche (2001) extend this work by considering trade-offs between computational costs and combined feature selection. Koller and Sahami (1996) use cross-entropy to minimize the amount of predictive information lost during feature selection. Piramuthu and Shaw (1994) use C4.5 (Quinlan, 1990), to select features used as input in neural networks. Their results showed improvements, over just backpropagation, both in terms of classification accuracy and time taken by neural networks to converge. The most popular feature selection methods in machine learning literature are variations of sequential forward search (SFS) and sequential backward search (SBS) as described in Devijver and Kittler (1982) and its variants (e.g. Pudil, Ferri, Novovicova, & Kittler, 1994). SFS (SBS) obtains a chain of nested subsets of features by adding (subtracting) the locally best (worst) feature in the set. These methods are particular cases of the more general ‘plus l-take away r’ method (Stearns, 1976). Results from previous studies indicate that the performance using forward and backward searches are comparable. In terms of computing resources, forward search has the advantage since fewer number of features are evaluated at each iteration, compared to backward search where the process begins using all the features. 4. Financial risk classification applications As it is important for companies, investors, and financial institutions to assess firms’ financial health or riskiness,

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

numerous empirical models have been developed that use annual financial information to distinguish between firms that are healthy and those that are risky (e.g. Abdel-Khalik & El-Sheshai, 1980). Although the financial credit-risk analyses literature is extensive, research interest continues in the development of a theoretical foundation that would capture the many dimensions of financial distress and failure. Likewise, numerous lenders and investors are interested in improving their ability to interpret, explain, and predict credit-risk. This type of financial risk analysis presents a challenge to the development of appropriate classification models because of the lack of linear relationships among features, the inherent level of noise in the training data, and the high degree of interactions among features. We use four real-world financial credit-risk evaluation data sets to illustrate the performance of data pre-processing effects on learning. 4.1. Tam and Kiang (1992) data This data set was used in the Tam and Kiang (1992) study. Texas banks that failed during 1985–1987 were the primary source of data. Data from a year and two years prior to their failure were used. Data from 59 failed banks were matched with 59 non-failed banks, which were comparable in terms of asset size, number of branches, age and charter status. Tam and Kiang had also used holdout samples for both the 1 and 2 year prior cases. The 1 year prior case consists of 44 banks, 22 of which belongs to failed and the other 22 to nonfailed banks. The 2 year prior case consists of 40 banks, 20 of which belongs to failed and 20 to nonfailed banks. The data describes each of these banks in terms of 19 financial ratios. For a detailed overview of the data set, the reader is referred to Tam and Kiang (1992). 4.2. Abdel-Khalik and El-Sheshai (1980) data This data set was used in the Abdel-Khalik and El-Sheshai (1980) study, among others. The data was used to classify a set of firms into those that would default and those that wouldn’t default on loan payments. Of the 32 examples for training, 16 belong to the default case and the other 16 to the non-default case. All the 16 holdout examples belong to the non-default case. The 18 variables in this data are: (1) net income/total assets, (2) net income/sales, (3) total debt/total assets, (4) cash flow/total debt, (5) long-term debt/net worth, (6) current assets/ current liabilities, (7) quick assets/sales, (8) quick assets/ current liabilities, (9) working capital/sales, (10) cash at yearend/total debt, (11) earnings trend, (12) sales trend, (13) current ratio trend, (14) trend of L.T.D./N.W., (15) trend of W.C./sales, (16) trend of N.I./T.A., (17) trend of N.I./sales and (18) trend of cash flow/T.D. For a detailed description of this data, the reader is referred to Abdel-Khalik and El-Sheshai (1980). 4.3. German credit data This data set (available at ftp.ics.uci.edu/pub/machinelearning-databases/statlog/) contains 1000 observations on 20

495

attributes. The class attribute describes people as either good (about 700 observations) or bad (about 300 observations) credits. Other attributes include status of existing checking account, credit history, credit purpose, credit amount, savings account/bonds, duration of present employment, installment rate in percentage of disposable income, marital status and gender, other debtors/guarantors, duration in current residence, property, age, number of existing credits at this bank, job, telephone ownership, whether foreign worker, and number of dependents. 4.4. Australian credit approval data This credit card applications data set (avaiable at ftp.ics.uci. edu/pub/machine-learning/databases/credit-screening/) was used in Quinlan (1990), and has 690 observations with 15 attributes. Of the attributes, five are real-valued and the remaining are nominal attributes. There are 307 positive examples and 383 negative examples in this data set. 4.5. Results We use decision table (e.g. Vanthienen & Wets, 1994) as an example of a tool for credit-risk evaluation decisions. Specifically, we used a simple decision table majority classifier as given in Kohavi (1995b). We use several feature selection algorithms for preprocessing input data used as input to decision tables. Specifically, we use relief (Kira & Rendell, 1992), gainratio (Quinlan, 1990), and chi-square (Chan & Wong, 1991). Table 1 provides results using these feature selection algorithms on different data sets. Here, AE represents Abdel-Khalik and El-Sheshai (1980) data, KT represents Tam and Kiang (1992) data, GC represents the German credit data, and AC represents tha Australian credit approval data. To facilitate ease of comparison, we used the same sets of training and testing data as in the initial reported study where these data sets were used. In Table 1, ‘all’ represents the case where no feature selection was used to select input data for the decision tables. The numbers outside the parentheses are the percentages correctly classified by the decision tables. The numbers outside the parentheses are the number of attributes used in the final decision tables. As can be seen in Table 1, feature selection leads to reduction in the number of variables with comparable or sometimes even better classification performance. Preprocessing data can be beneficial for learning algorithms, by reducing the complexity of instance space used as input to Table 1 % Correctly classified by decision table Feature selection/ data set

AE

KT

GC

AC

All Relief Gainratio Chi-square

68.75 (18) 77.083 (5) 77.083 (5) 77.083 (5)

83.43(19) 83.43 (1) 83.43 (5) 83.43 (5)

76.7 (20) 76.7 (3) 87.6 (4) 82.7 (11)

87.9 (15) 88.21 (14) 87.9 (10) 87.9 (12)

496

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

these algorithms. We can also improve learning performance through appropriate sampling of input data to achieve better instance selection. 5. Discussion Financial credit-risk evaluation data are replete with noise as well as the available information itself being prone to incompleteness. In spite of all these constraints, one should be able to efficiently obtain information from available data so as to compensate for the inadequacies in the data. Financial credit-risk evaluation is done thousands of times every day in most financial institutions, among others, and involves huge amounts of capital. Any improvement in currently available methods would certainly benefit these institutions in a tangible way. This paper considered one facet of financial credit-risk evaluation: decision-making tools. Although there are several tools that can be successfully used for this purpose, it is better to select a tool that is tailored for this purpose, taking the characteristics of financial creditevaluation data into consideration. Another means to improve performance of these tools is through proper preprocessing of data used in these decision support tools. Given the ready availability of tools for decision-making as well as those that are used for preprocessing of input data, there really is no excuse not to utilize them to get the most benefit from risk analysis. Acknowledgements I thank the referee for a thorough review, and for highlighting important issues that have helped in improving the clarity of presentation of this paper. References Abdel-Khalik, A. R., & El-Sheshai, K. M. (1980). Information choice and utilization in an experiment on default prediction. Journal of Accounting Research, Autumn, 325–342. Al-Ani, A., & Mohamed Deriche. (2001). An optimal feature selection technique using the concept of mutual information. Proceedings of the International Symposium on Signal Processing and its Applications (ISSPA) (pp. 477–480). Kuala Lumpur. Almuallim, H. M., & T. G. Dietterich. (1991). Learning with many irrelevant features. Proceedings of the Ninth national conference on Artificial Intelligence (pp. 547–552). Ang, J., & Patel, K. (1975). Bond rating methods: Comparison and validation. Journal of Finance, 30(2), 631–640. Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural network rule extraction and decision tables for credit-risk evaluation. Management Science, 49(3), 312–329. Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550. Benitez, J. M., Castro, J. L., Mantas, C. J., & Rojas, F., (2001). A Neuro-Fuzzy approach for feature selection. Proceedings of IFSA World Congress and 20th NAFIPS international conference (Vol. 2) (pp. 1003–1008). Bradley, P. S., Mangasarian, O. L., & Street, W. N. (1998). Feature selection in mathematical programming. INFORMS Journal on Computing, 10(2), 209–217. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth.

Callahan, J. D., & Sorensen, S. W. (1991). Rule induction for group decisions with statistical data—an example. Journal of the Operational Research Society, 42(3), 227–234. Carter, C., & Cartlett, J. (1987). Assessing credit card applications using machine learning. IEEE Expert, Fall, 71–79. Chambless, B., & Scarborough, D., (2001). Information-theoretic feature selection for a neural behavioral model. Proceedings of the international joint conference on Neural Networks (IJCNN-01) (Vol. 2) (pp. 1443–1448). Chan, K. C., & Wong, A. K. (1991). A statistical technique for extracting classificatory knowledge from databases. In G. Piatetsky-Shapiro, & W. Frawley (Eds.), Knowledge discovery in databases. Cambridge, MA: AAAI Press. Coetzee, F. M., Glover, E., Lawrence, S., & Giles, C. L. (2001). Feature selection in web applications by ROC inflections and powerset pruning. Proceedings of the symposium on applications and the Internet (pp. 5–14). Cover, T. M. (1974). The best two independent measurements are not the two best. IEEE Transactions on Systems, Man, and Cybernetics, SMC-4(1), 116–117. Culberson, J. C. (1998). On the futility of blind search: An algorithmic view of ‘No Free Lunch’. Evolutionary Computation, 6, 109–127. Curram, S. P., & Mingers, J. (1994). Neural networks, decision tree induction and discriminant analysis: An empirical comparison. Journal of the Operational Research Society, 45(4), 440–450. Currim, I. S., Meyer, R. J., & Le, N. T. (1988). Disaggregate tree-structured modeling of consumer choice data. Journal of Marketing Research, August, 253–265. Devijver, P. A., & Kittler, J. (1982). Pattern recognition: A statistical approach. Englewood Cliffs, NJ: Prentice-Hall. Duin, R. P. W. (1996). A note on comparing classifiers. Pattern Recognition Letters, 17, 529–536. Elashoff, J. D., Elashoff, R. M., & Goldman, G. E. (1967). On the choice of variables in classification problems with dichotomous variables. Biometrika, 54, 668–670. Elomaa, T., & Ukkonen, E. (1994). A geometric approach to feature selection. In Proceedings of the European conference on machine learning (pp. 351–354). Feng, C., Sutherland, A, King, R, Muggleton, S, & Henery, R (1993). Comparison of machine learning classifiers to statistics and neural networks. AI & Statistics-93 , 41–52. Flach, P. A., & Lachiche, N. (2001). Confirmation-guided discovery of firstorder rules with tertius. Machine Learning, 42, 61–95. Giplin, EA, Olshen, R. A., Chatterjee, K., Kjekshus, J., Moss, A. J., Henning, H., et al. (1990). Predicting 1-year outcome following acute myocardial infarction. Computers and Biomedical Research, 23(1), 46–63. Hopkins, C., Routen, T., & Watson, T. (1994). Problems with using genetic algorithms for neural network feature selection. 11th European conference on Artificial Intelligence (pp. 221–225). Igel, C., & Toussaint, M. (2003). On classes of functions for which no free lunch results hold. Information Processing Letters, 86(6), 317–321. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 121–129). San Francisco, CA: Morgan Kaufmann Publishers. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Proceedings of the ninth international conference on machine learning (pp. 249–256). Kittler, J. (1975). Mathematical methods of feature selection in pattern recognition. International Journal of Man–Machine Studies, 7, 609–637. Klo¨sgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 249–271). Menlo Park, CA: AAAI Press. Kohavi, R., & Frasca, B. (1994). Useful feature subsets and rough sets reducts. Third international workshop on rough sets and soft computing (RSSC 94). Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs, PhD dissertation. Computer Science Department Stanford University.

S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497 Kohavi, R. (1995). The power of decision tables. Proceedings of the Eighth European Conference on Machine Learning, (pp. 174–189). Koller, D., & Sahami, M. (1996). Toward optimal feature selection. Machine learning: Proceedings of the 13th international conference. Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. Proceedings of the European conference on machine learning (pp. 171–182). Kors, J. A., & van Bemmel, J. H. (1990). Classification methods for computerized interpretation of the electrocardiogram. Methods of Information in Medicine, 29(4), 330–336. Langley, P., Zytkow, J. M., Simon, H. A., & Bradshaw, G. L. (1986). The search for regularity: Four aspects of scientific discovery. Machine learning: An artificial intelligence approach, (Vol. 2, pp. 425–470) Los Altos, CA: Morgan Kaufmann. Lim, T.-S., & Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40, 203–229. Liu, H., & Setiono, R. (1997). Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 9(4), 642–645. Malki, H. A., & Moghaddamjoo, A. (1991). Using the Karhunen-Loe’ve transformation in the back-propagation training algorithm. IEEE Transactions on Neural Networks, 2(1), 162–165. Matheus, C. J., & Rendell, L. (1989). Constructive induction in decision trees. Proceedings of the Eleventh IJCAI (pp 645–650). Meisel, W. S. (1972). Computer-oriented approaches to pattern recognition. New York: Academic Press. Messier, W. F., & Hansen, J. V. (1988). Inducing rules for expert system development: An example using default and bankruptcy data. Management Science, 34(12), 1403–1415. Michaelsen, R. H. (1984). An expert system for tax planning. Expert Systems, October, 149–167. Milne, L. (1995). Feature selection using neural networks with contribution measures. AI’95. Canberra. Modrzejewski, M. (1993). Feature selection using rough sets theory. European Conference on Machine Learning , 213–226. Pagallo, G. (1989). Learning DNF by decision trees. Proceedings of the Eleventh IJCAI (pp. 639–644). Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11(5), 341–356. Piramuthu, S. (1999). Financial credit-risk evaluation with neural and neurofuzzy systems. European Journal of Operational Research, 112, 310–321. Piramuthu, S., & Shaw, M. J. (1994). On using decision tree as feature selector for feed-forward neural networks. International Symposium on Integrating Knowledge and Neural Heuristics , 67–74. Pudil, P., Ferri, F. J., Novovicova, J., & Kittler, J. (1994). Floating search methods for feature selection with nonmonotonic criterion functions. IEEE 12th international conference on pattern recognition (Vol. II, pp. 279–283).

497

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1990). Decision trees and decision making. IEEE Transactions on Systems, Man and Cybernetics, 20(2), 339–346. Ragavan, H., Rendell, L., Shaw, M., & Tessmer, A. (1993). Complex concept acquisition through directed search & feature caching, and practical results in a financial domain. Proceedings of the thirteenth international joint conference on Artificial Intelligence (pp. 946–951). Rendell, L., & Seshu, R. (1990). Learning hard concepts through constructive induction: Framework and rationale. Computational Intelligence, 6(4), 247–270. Schumacher, C., Vose, M. D., & Whitley, L. D. (2001). The no free lunch and description length. Proceedings of Genetic and Evolutionary Computation conference (GECCO-2001) (pp. 565–570). Setino, R., & Liu, H. (1997). Neural network feature selector. IEEE Transactions on Neural Networks, 8(3), 654–662. Siedlecki, W., & Sklansky, J. (1989). A note on genetic algorithms for largescale feature selection. Pattern Recognition Letters, 10(5), 335–347. Shaffer, C. (1994). A conservative law for generalization performance. Proceedings of the 1994 International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Shaw, M. J., & Gentry, J. (1990). Inductive learning for risk classification. IEEE Expert, February, 47–53 (pp. 279–283). Stearns, S. D. (1976). On selecting features for pattern classifiers. Third international conference on Pattern Recognition , 71–75 (pp. 71–75). Tam, K. Y., & Kiang, M. Y. (1992). Managerial applications of neural networks: The case of bank failure predictions. Management Science, 38(7), 926–947. Toussaint, G. T. (1971). Note on optimal selection of independent binaryvalued features for pattern recognition. IEEE Transactions on Information Theory, IT-17, 618. Vanthienen, J., & Wets, G. (1994). From decision tables to expert system shells. Data and Knowledge Engineering, 13(3), 265–282. Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search. Technical report SFI-TR-05-010, Santa Fe Institute, Santa Fe, New Mexico. Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups Proceedings of the first European symposium on principles of data mining and knowledge discovery. Berlin: Springer. Yang, J., & Honavar, V. (1997). Feature Subset Selection using a Genetic Algorithm. Proceedings of the Genetic Programming Conference, GP’97 (pp. 380–385). Yang, D. -S., Rendell, L., & Blix, G. (1991). A scheme for feature construction and a comparison of empirical methods. Proceedings of the Twelfth IJCAI (pp. 699–704). Yuan, H., Tseng, S. -S., Gangshan, W., & Fuyan, Z. (1999). A two-phase feature selection method using both filter and wrapper. Proceedings of the IEEE conference on Systems, Man, and Cybernetics (Vol. 2, pp. 132–136).