Conflict-sensitivity contexture learning algorithm for mining interesting patterns using neuro-fuzzy network with decision rules

Conflict-sensitivity contexture learning algorithm for mining interesting patterns using neuro-fuzzy network with decision rules

Expert Systems with Applications Expert Systems with Applications 34 (2008) 159–172 www.elsevier.com/locate/eswa Conflict-sensitivity contexture learn...

555KB Sizes 8 Downloads 84 Views

Expert Systems with Applications Expert Systems with Applications 34 (2008) 159–172 www.elsevier.com/locate/eswa

Conflict-sensitivity contexture learning algorithm for mining interesting patterns using neuro-fuzzy network with decision rules Chun-Min Hung a, Yueh-Min Huang a

b,*

Department of Information Management, Kun Shan University, No. 949, Da Wan Road,Yung-Kang City, Tainan Hsien 710, Taiwan, ROC b Department of Engineering Science, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC

Abstract Most real-world data analyzed by classification techniques is imbalanced in terms of the proportion of examples available for each data class. This class imbalance problem would impede the performance of some standard classifiers since a modal-class pattern may cover many relatively weak interest patterns. This study presents a new learning algorithm based on conflict-sensitive contexture, which remedies the class imbalance problem by basing decisions on the inconsistency of the local entropy estimator. The study also adopts a new neuro-fuzzy network algorithm with multiple decision rules to a real-world banking case for mining very significant patterns. The proposed algorithm can attain accuracy for minority classes at classification from roughly 10% up to 71%. This work also elucidates these patterns of interests and suggests many business applications for them.  2006 Elsevier Ltd. All rights reserved. Keywords: Class imbalance; Banking; Neuro-fuzzy; Decision tree; Entropy

1. Introduction In credit assessment tasks, a relative-weak pattern is still regarded as very significant if its class distribution, as adjusted by sampling techniques, influences the degree of decision-making as much as strong patterns do. Moreover, if mining the relative-weak pattern can significantly exceed the prior expectation of margin, the credit assessment tasks are successful. Many standard classifiers, such as the decision tree and fuzzy inference, are not appropriate for classification of data with class imbalances, in which one class contains a large number of examples while another contains only a few. As surveyed by Japkowicz and Stephen (2002), because it has difficulty in accuracy improvement in standard learning approaches which assume a balanced class distribution, the class imbalance problem is an important issue (Japkowicz & Stephen, 2002). *

Corresponding author. E-mail addresses: [email protected] (C.-M. Hung), huang@ mail.ncku.edu.tw (Y.-M. Huang). 0957-4174/$ - see front matter  2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.08.018

This study seeks to use the proposed scheme for credit scoring specified of bank checking accounts and for mining interesting patterns, which are likely covered by strong patterns that are yielded by less conflicting decision-making than the others. Our goals is for improving the accuracy of ‘Risky’ class over 50% at classification, and for suggesting at least five useful business rules for the application of interesting patterns discovered from the experimental results. In 2001, An, Cercone, and Huang (2001) presented a case study for learning from extremely imbalanced pharmaceutical data, and proposed a rule coverage approach to enhance predictive performance. Next, Japkowicz and Stephen (2002) discussed basic re-sampling and cost-modifying schemes to circumvent the class imbalance problem, and establish a relationship between concept complexities, training set size and class imbalance level. Most notably, their work concludes that Multi-Layer Perceptrons (MLP) seems to be less affected by the class imbalance problem than decision tree. Furthermore, Guo and Viktor (2004) proposed the DataBoost-IM method for learning from imbalanced data

160

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

sets by boosting and data generation. The boosting method uses examples of equal classes to create synthetic data, which is added to the original training set to rebalance the class distribution and total weights of the different classes. They claimed attaining high predictions against both minority and majority classes while not favorable for any class (Guo & Viktor, 2004). Additionally, Chawla, Lazarevic, and Hall (2003) also presented the SMOTEBoost approach to improve minority class prediction by boosting. As to multiple classifiers, Phua, Alahakoon, and Lee (2004) integrated back-propagation networks, naı¨ve Bayesian and C4.5 (Quinlan, 1993) algorithms into a single meta-classifier system to alleviate circumvent the class imbalance issues in fraud detection research. Besides, Zadrozny and Elkan (2001a) proposed a direct cost-sensitive decision-making solution, based on the MetaCost method (Domingos, 1999), to the sample selection bias problem in econometrics. Using smoothing techniques, the direct cost-sensitive approach obtains well-calibrated conditional probability estimates for tackling extremely imbalanced datasets (Zadrozny & Elkan, 2001a). Ling et al. provides another similar application in banking, used for direct marketing (Ling & Li, 1998). As to the foundations of cost-sensitive learning can be found in Elkan (2001). As for nondeterministic classifier, early in 1991 DeRouin, Brown, Beck, Fausett, and Schneider (1991) presented a class imbalance training based on neural network. Thereafter in 2001, Japkowicz (2001) presented an unsupervised binary-learning autoassociation by using feedforward neural networks, aiming to provide more accurate classification than MLP, which uses supervised neural networks. However, MLP is more accurate than autoassociation in multi-modal domains. Recently, many learning methods to alleviate the class imbalance problem have been reported in literature. Weiss and Provost (2003) reported an effect of class distribution on tree induction learning when training data are costly. Although this effect may lead to inaccurate estimates for minority classes due to class imbalance problems, the proposed method used a standard decision tree since it is very fast. Additionally, Hickey (2003) proposed a REFLEX Algorithm for learning minority class with its footprints. Visa and Ralescu (2003) have also published a learning method to eliminate the imbalance problem with overlapping classes. Because our dataset involves the minority class problems with overlapping classes, the traditional approach such as decision tree must be therefore reapplied differently. The proposed algorithm draws from both methods is designed to modify the overlapping class problem using fuzzy logic. For flexibility, the proposed method also incorporates a neural network to learn the fuzzy membership function (Jang, 1993). Batista, Prati, and Monard (2004) provided evidence to illustrate performance bottleneck for 10 methods dealing with imbalanced training datasets. In Batista et al. (2004), they concluded that class-overlapping issues essentially influence the performance of learning

systems, and over-sampling methods under specific conditions provide more accurate results than under-sampling methods. In contrast to the cost-sensitive methods, other proposed schemes to solve the class imbalance problem are classified as sampling methods. Chawla, Bowyer, Hall, & Kegelmeyer (2002) proposed the SMOTE approach based on over-sampling; Estabrooks, Jo, & Japkowicz (2004) proposed a multiple resampling method, and Drummond et al. (2003) discussed why under-sampling is better than over-sampling. Clearly, the opposite conclusions proposed between Batista et al. (2004) & Drummond et al. (2003) seem to contradict results each other. Considering the drawbacks of these techniques, the proposed algorithm adopts an indirect-sampling technique using a local contexture estimator when the multiple classifiers provide conflicting forecast responses. The indirectsampling method generalizes a classification involving more than two classes into a binary-learning classification based on class values of either 0 or 1, indicating conflictsensitive or conflict-insensitive example, respectively. A training model constructed from the conflict-sensitive examples can then be employed to perform the original classification on test data. This indirect-sampling strategy is intended to avoid direct random sampling on real-world instances with known posterior probabilities, while it produces a local contexture dataset, where contains the relationship information between entropy-based estimators in locality and conflicting decision-making from multiple classifiers. Also, based on the following two reasons, the computation of local contexture added can effectively offer a particular way to discovery of interesting patters. (i) Diversity in uniform distribution: the original training dataset comprises three-class data, while the variation of its probability distribution is high and class overlapping is broad as well. Especially, the most of interesting patterns within ‘Risky’ class almost are covered by the other two classes. By this indirect-sampling strategy, the proposed method removes non-overlapping instances from the original dataset so that the new dataset attains to resemblance in overlapping of each class. Thereby, it alleviates imbalanced problems and retains part of examples that hold original probability distribution disputed by overlapping. Alternatively, this strategy generates a set of instances with diversity in uniform distribution achieved in this way, by which more patterns of interest can be discovered. (ii) Locality in correlation patterns: In fact, neighbor instances created in one day have much more correlation than those of being yielded during a long gap. If it distinguishes the instances of conflicting decisionmaking from multiple classifiers, which have diverse capabilities in influence of training parameters, this strategy has the ability to assist the classifiers in judge of disputed decisions in terms of locality. According to the concept that the instances with conflict-sensi-

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

161

tive likely imply having much more interesting patterns than others. Because the majority of consistent decision-making have been removed, in theory, the degree of these interesting patterns would be amplified relatively in term of local patterns. Therefore, these instances with conflict-sensitive are very suitable to be used for mining interesting patterns.

and approximative response to estimating its local context. In the second part, called cross-fusion, the decision rule neuro-fuzzy system identifies conflicts in decision-making to enhance the minority class data mining and discover interesting pattern.

In contrast to the method of Saerens et al. which generates the a priori probabilities (Saerens, Latinne, & Decaestecker, 2002), the proposed algorithm applies a simple technique called row-fusion, in which each row determines its response (class) from a common decision of three different classifiers. Subsequently, cross-fusion is performed on a local contexture computation of each test example and of each training example; while another procedure, columnfusion, aggregates five classifier decisions for each test example to create a new dataset. The ensemble dataset generated from the three fusions can be approximatively estimated for a priori probabilities in the test dataset. Although the proposed system can improve classification accuracy of the minority class, the majority class may suffer performance degradation. Recently, Jo & Japkowicz (2004) have suggested that class imbalances are not truly responsible for this performance degradation, which may instead be caused by small disjuncts. Thus, the proposed approach considers this factor for those imbalance problems mentioned above, and then proposes an entropy-based contexture solution to the problems. The experimental results confirmed that the proposed scheme can attain classification prediction accuracy for one of the two minority classes from less than 10–71% while the other minority class does not sacrifice its prediction accuracy. Because the random probability of correctly predicting the minority class is 15% according to the class data distribution in training dataset Thus, if the proposed method can increase the accuracy to over 30%, then the study is already very successful. In contrast, another minority class constitutes only 19% of the dataset, but it has a strong pattern, to avoid sacrificing classification accuracy. This investigation used a real-world banking dataset of about 1000 records, distributing the classes ‘1’, ‘2’, and ‘3’ in the ratio 66:15:19, sampled randomly from a dataset of 62,621 records. Hopefully, some patterns would be found to identify class ‘2’ denoting a customer may have bounce-check risks.

The differences in a local contexture of relationships between training examples are assumed to be highly systematic and representable in a set of rules. Hence, these rules can identify the non-deterministic decision-making between classifiers. For each xi in training dataset A containing NA examples, where A = {[Xk, Rl]jxi, j  X is a signal matrix, and ri, j  R is a response matrix}, where k, and l denote the dimensions of signal matrix X and response matrix R, respectively. Since each example’s local contexture constructed from a small set of examples x and r, then

2.1. Entropy-based contexture

x ¼ fxi;j ji ¼ 1 . . . w; w ¼ 2m þ 1  N A ; j ¼ 1 . . . kg and r ¼ fri;j ji ¼ 1 . . . w; w ¼ 2m þ 1  N A ; j ¼ 1 . . . lg; fed sequentially to a learner, might affect a decision by the classifiers, the common technique called an entropy estimator (Coifman & Wickerhauser, 1992) must be used in a local observation. Eq. (1) quantitatively determines the local entropy difference of each example v vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u k u l uX 2 uX 2 DEv ¼ t Ej ðxÞ  t Ej ðrÞ; j¼1

Ej ðxÞ ¼ 

vþm X

j¼1

ð1Þ

x2i;j logðx2i;j Þ;

i¼vm

2. Methods

where xi is thepdistribution of xj, which bins the elements in ffiffiffiffi vector xj into w equally spaced containers and returns the number of elements in each container as a row vector 8
The system architecture depicted in Fig. 1 describes how the system components interact and work together to accomplish the overall system objectives. Fig. 1 describes the system operation, what each component does and the information exchanged among the components. The training system architecture comprises two major parts. In the first part, called row-fusion and column-fusion, an ensemble dataset is constructed from each customer input vector

The entropy-based estimator calculated using information theory is a local entropy difference DEv of each example v, which shifts itself in a fixed size window w limiting a continuous range of examples near to v. Subsequently, Eq. (2) combines the entropy-based estimator, the vector norm and their vector difference for those examples to construct the local contexture Uv, enabling the algorithm to detect a conflict-sensitive response, where the training example

162

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172 Training Training dataset A

Validation dataset B

Testing Shared data

A: input& response Global DecisionTree

B: input only

A: local contexture B: predicted response0

Column Fusion

Sampling & Splitting

Classified DecisionTree

B: input & predicted response1

Misclassified DecisionTree

Class balance DecisionTree

Neuro-Fuzzy Decision Rule

Local DecisionTree

B: input & predicted response2 B: input, predicted response3

Row Fusion

Balanced sampling & Splitting 0

1

B: input, fused response Ensemble dataset B'

B: local contexture

conflictsensitive system 0

conflictinsensitive system1

Synthesis DecisionTree Cross Fusion

Column Fusion

Outside Test dataset C

Sensitive DecisionTree

Insensitive DecisionTree

Strong class Weak class

Fig. 1. Multiclassifier system architecture for multiclassifiers based on detection of conflict-sensitivity using the NFLDR learner. The Upper Square of the dotted line comprises three fusion processes integrating multi-source decisions depending on a conflict-sensitivity in decision-making. The conflictsensitivity determined by the NFLDR learner in the right side of the sub diagram uses an entropy-based estimator U called the local contexture in a fixedsize window on training dataset A to dissect conflict-sensitive examples ~v and conflict-insensitive examples v on validation dataset B and ensemble dataset B 0 . The fixed-size window, here defined as w = 15, shifts an extent with a fixed number of examples passing through the overall dataset to iteratively evaluate Uv each example v. The Lower Square of the dotted line in Fig. 1 is a two-layer decision fusion that constructs two data models after training ~v and v on B 0 , respectively, to test for an outside test dataset C. The two data models both use a standard decision-tree classifier. Additionally, the decisions resulting from the models offer users specific interest patterns during an ‘explain’ phase within the data mining cycle.

frequently generates conflicting classifier decisions. Experimental results demonstrate that the local contexture estimator Uv can easily identify the conflict-sensitive decision-making response. Generally, if the examples are independent, then the combined response variance is large. Conversely, if the examples are correlated, then the variance response is conflict-sensitive. 2.2. Conflict sensitivity Depending on a primary training probability, deterministic classification method such as decision tree methods constantly classifies a test example into one skewed class due to the high bias and high variance properties, as survey

by Zadrozny & Elkan (2001b). Thus, using the properties of decision tree methods to infer a set of interesting rules from imbalanced datasets is inadequate since they would probably results in accurate conditional probability estimates. In fact, a weak pattern from a minority class can be amplified by learning the opinions of multiple classifiers, which are built from diverse datasets including sampling, splitting, and merging operations in row, column, or both. In the proposed model, six decision trees are constructed from these operations mentioned, including global trees, correctly classified trees, misclassified trees, local trees, class balanced trees and synthesis trees, and gathered together. These classifiers would produce a conflicting response due to the change in data distribution. Applying

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

the conflicting response to identify the weak pattern is feasible since they usually appear with less reliability in statistics. By contrast, a robust response denotes a strong pattern, which inevitably masks some weak patterns when using standard classifiers in real-world data. Therefore, distinguishing the response either in a conflict-sensitive or a conflict-insensitive decision is very important in identifying a weak pattern. To learn the conflict-sensitivity of response from classifiers, using an inside-test of decision tree C4.5 (Quinlan, 1993) in the proposed model generates another training dataset containing the local contexture estimator associated with conflict responses and consistent responses corresponding to each example. 2.3. Two-layer decision fusion To postulate the conflict sensitivity originating from varied classifier decisions, the model must first merge these decisions to form a final response that probably comes from a global tree, a classified tree, a misclassified tree, a local tree or a class-balanced tree. Multiple classifier system (MCS) that is an aggregated classifier has been used since the 1960s. Boosting (Freund & Schapire, 1996) and bagging (Breiman, 1996), which both based on manipulating training samples, are two of the most used MCSs. The two approaches can be combined easily, for example in voting AdaBoost learners. In 1995, Freund & Schapire (1997) introduced the AdaBoost approach, which adjusts the next weak classifier to focus on the misclassified training data. In contrast, the statistical consensus theory of Benediktsson, Sveinsson, & Swain (1997) summarizes all the training data only once by employing independence of data sources. However, these existing multiclassifier fusion methods for multisource data only focus on improving performance accuracy at classification, and do not address noise problems. Jo & Japkowicz (2004) thought that solving class imbalanced problem should consider the factor of small disjuncts for performance degradation. Any noise on the minority class would sensitively influence these small disjuncts (Jo & Japkowicz, 2004). Although an appropriate weighting may increase the total classification accuracy for all classes, the minority class pattern is likely to be deformed due to the artificial environment far away from the real-world case. Thus, the proposed model provides a two-layer decision fusion by dissecting conflict-sensitivity examples (~vÞ and conflictinsensitivity examples (vÞ, to discover a weak interesting pattern from the noise of the other two strong patterns, and to understand the patterns identified. 2.3.1. Row fusion To dissect ~v and v in the dataset, the model builds a new learner W to forecast a test example v 2 ~v or v 2 v. Because the contexture computation Uv of each example v must estimate the response value rv in Eqs. (1) and (2), the model requires a validation dataset B whose observation data (input signals) is equal to that of C but with invisible cor-

163

responding response values, to create another ensembles dataset B 0 including the predicted response values. Let the number of examples in B be denoted by NB, and set NB ffi NC, which is the number of examples in C. First, all examples in the training dataset A form a global deciN sion tree DTG, to generate the predicted responses frg1 B N on A and the predicted responses fr0 g1 B on B by using a C4.5 (Quinlan, 1993) classifier such that r0  r, and the contexture UA on A is estimated by using Eq. (2). In the sequel, DTG is divided into two independent decision N spaces according to its predicted responses frg1 A using inside-test. These decision spaces are the ‘classified tree’ DTC, which has the same decisions, and the ‘misclassified tree’ DTM, which makes always the opposite decisions to DTC. Applying a posterior pruning approach to DTC, the multiclassifier system in Fig. 1 produces the pruned tree DTCP to forecast the predicted responses fr1 gN1 B on B. Nevertheless, DTM is applied directly to forecast the predicted N responses fr2 g1 B on B without pruning since it may be a singleton decision tree node. Since the original decision space is classified into two states denoted by g, Eq. (3) defines the calibrating function w of a learner W to assess an example v underlying either the conflict-sensitivity state g = 0 or the conflict-insensitivity state g = 1 wðWðUv Þ; ai Þ ¼ froundðgv Þj9ai ; gv ¼ WðUv Þ þ ai ; gv 2 ½0; 1'fðaÞ ¼ f g;

ð3Þ

where f denotes a search function to find the minimum f*, and ai represents an optimum adjusting value for the output of learner W. Eq. (4) applies Bayes’s rule (John & Langley, 1995) with maximum entropy estimating nmax to Eq. (3) to balance the probability distribution of each class. * Bayes’s rule states that given a hypothesis a , and evidence *

* *

* *

*

b which bears on that hypothesis, then pða jb Þ ¼ pð b j a*Þpð a Þ, pð b Þ

where p denotes is a probability function of each class * * occurring in both a and b . By using Eqs. (3) and (4), the experimental results reveal a stationary performance of accuracy for each class since the conflict-sensitive state * may contain an unbiased distribution for a after adjusting * the output distribution of W in b * *

fðaÞ ¼ 1  nmax ; nmax ða ; b g¼0 Þ X * * * * p2 ða i jb g¼0 Þ log p2 ða i jb g¼0 Þ; ¼

ð4Þ

i2Ca *

where C*a denotes the set of class label, a represents a class * label of responses in A, i.e. [1 2 3], and b is a class label of states for dissecting ~v and v, i.e. [0 1]. Conversely, the multiclassifier system employs the dataN set B with fr0 g1 B to construct a local tree DTL that is N directly applied to predict responses fr3 g1 B on B without pruning. For row fusion, a weighted linear combination for opinions is used to estimate a consistent decision r 0 on an ensemble dataset B 0 according to the three opinions r0v , r1v and r3v for each v. In Eq. (5), the consistent probability xi for each opinion is used to weight linear combination of

164

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

r0v , r1v , and r3v to determine r0v if r1v is the same as r3v ; otherwise, r0v determines its value by the mean of the consistent opinion r0v and the additional opinion r2v 8 P i i PN B x rv if r1v ¼ r3v ; < /ðr0v ¼ riv Þ i¼0;1;3 0 and xi ¼ v¼1 rv ¼ : r0v þr2v NB otherwise 2 ð5Þ Once r0v is visible on B 0 , the contexture UB0 of each example in B 0 is easily estimated from Eq. (2). Now, given UA and UB0 with corresponding responses r and r 0 respectively, another learning model with the new training dataset (UA, r) is created for the new testing dataset ðUB0 ; r0 Þ. The learning objective is to maximize the likelihood of contexture between (UA, r) and ðUB0 ; r0 Þ since the ensemble contexture UB0 is similar to the unknown UC which is the contexture of test dataset C. Thus, the set of conflict-sensitive examples ~vðg ¼ 0Þ in B 0 may be determined to generate an unbi~ for predicting true responses in C. ased learning model W 2.3.2. Column fusion To link the original dataset B 0 with the corresponding contexture UB0 , the proposed column-fusion proposed is better able to involve complex decisions between dataset and contexture than row-fusion. Our previous work established the classification of equal quantity (CEQ) (Hung, Huang, & Chen, 2002) in which each class is sampled with equal examples, can improve accuracy when predicting a weak pattern hidden by another strong pattern. Therefore, the proposed multiclassifier system proposed can also introduce the CEQ approach to construct a class balance decision tree DTEQ illustrated in Fig. 1 by using Drummond et al’s undersampling technique (Drummond et al., 2003). This DTEQ provides a different opinion from the above-mentioned classifiers because of local equivalences. The CEQ plays a critical role in the division of conflict-sensitive and conflict-insensitive examples because the most basic classifiers can approach high classification accuracy on average, i.e. only few conflictsensitive examples ~v would be detected. Therefore a learning ~ cannot easily be constructed to be entirely represenmodel W tative in feature of various patterns. Thus, the CEQ is adopted to filter UA and UB0 , as well as to filter A in DTEQ. Subsequently, the last classifiers call the synthesis decision tree to produce a response r4, which is constructed from DTEQ and DTCP according to g = 0 and g = 1, respectively. Finally, the column-fusion vertically merges the contextures and responses to create a new dataset X ¼ ½UB0 ; r0 ; r1 ; r2 ; r3 ; r4 ; g for local pattern mining. 2.3.3. Cross fusion To exploit the multiclassifier system, a cross-fusion is required to coordinate decision-making for a two-layer explanation. The experimental section demonstrates the two-layer explanation in detail. First, the cross-fusion process makes a decision with the mean decision r of r0, r1, r2, r3, and r4 when g = 0. Next, a learning model W is applied

to create two dependent models W0 and W1 to learn contexture-associated classifier decision-making results using the nS Pk r i k i i partial columns o of X. Let Xg¼0 ¼ ½ i¼0 r [ i¼0 k [ gjr ; g 2 X; k ¼ 4 and Xg¼1 ¼ f½UB0 [gjUB0 ; g 2 Xg. Furthermore, let W0(Xg=0, rg=0) = r5 and W1(Xg=1, rg=1) = r6, such that rg=0 [ rg=1 = r on A, where r5 and r6 are conflict-sensitive responses of ~v on C and conflict-insensitive responses of v on C, respectively. Assume that the most examples of the majority class K are addressed as v, represented as K / v. When using UB0 2 Xg¼1 , then W1(Xg=1, rg=1) can be used to forecast r6. Because this induction implies v / U, it conducts K / v / U. Therefore, only the majority class K would be predicted successfully using UB0 when g = 1; otherwise the other classes should adopt W0(Xg=0, rg=0) to forecast r5 since nonlinear combination can repair the drawback in roughed r if the individual opinions of r are violently different. The experimental analysis herein demonstrates that K / v is feasible. 3. Neuro-fuzzy logic with decision rules The learning model W is a neuro-fuzzy logic inference system involving decision rules (NFLDR), which are generated from a standard C4.5 decision tree classifier, to discriminate between ~v and v examples on test dataset C. Jang (1993) uses a hybrid learning algorithm to identify parameters of Sugeno-type (Sugeno, 1985) fuzzy inference systems (FIS). The proposed system not only determines parameters using Sugeno-type FIS, but also achieves fuzzy logic rules using decision rules in a decision tree. This NFLDR W combines the least-squares and back-propagation gradient descent method to train FIS membership function parameters to emulate the given contexture UA on A. Additionally, NFLDR W formulates the conditional statements comprising fuzzy logic from a decision tree built from A. Thus, NFLDR W is a very powerful, computationally efficient tool to handle imprecision and nonlinearity. The following steps briefly describe how NFLDR W determines g values. First, as illustrated in Fig. 2, the W inputs each variable in UA for the membership function M(UA). Second the input membership functions are associated with fuzzy rules generated from a decision tree DTP1 in Fig. 3, where DTP1 denotes a pruned decision tree of training dataset A to one level that eliminates all errors within one unit. In the third step, the fuzzy rules are applied to a set of output characteristics. In the fourth step, the output characteristics are transformed into output membership functions. Finally, the output membership function is summed to a single-valued output or a decision associated with the output. A membership function is a curve defining how each point in the input space is mapped to degree of membership between 0 and 1. Unlike an ordinary FIS using a predefined membership function, the NFLDR W learns the membership function M(UA) using the backpropagation gradient descent method to define an adaptive

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

165

Fig. 2. Recognizing contexture for conflict-sensitivity a 4-input, 1-output, and 23-decision rule NFLDR learner W. The learner cooperated with the proposed system by training FIS membership function parameters to emulate the input contexture UA for each example v. The left-hand side of the diagram contains the number, name and notation of variables, as well as two sub-diagrams in each input square. The left-hand side of the input square contains a membership function sub-diagram whose Y-axis denotes the degree of membership for each class. The right-hand side of the input square contains an output function sub-diagram whose Y-axis represents the degree of conflict-sensitivity. The X-axis in these sub diagrams maps both input * values. The shadowed input square indicates that input 2 (signal X v ) does not involve any fuzzy rule affecting the output (conflict-sensitivity g), since the source decision tree in Fig. 3 for generating decision rules does not hold any node containing input 2. The central part of the diagram denotes only Rules 1–5 of the 23 because of space limitation. Rule 1 demonstrates the fuzzy reasoning process and generates the FIS output surface associated with its antecedent variables, i.e. response and DSR. The rules’s consequent variable corresponds to its output value representing the degree of conflict-sensitivity. The gray arrowed line depicts the application of the fuzzy operator in the antecedent and implies from the antecedent to the consequent. The right-hand side of the diagram denotes a linear combination of degree of membership in each input to produce an output number defuzzified.

membership function. However, the learning process has to define an initial membership function before learning its optimal parameters. The proposed approach adopts a generalized bell function as the initial membership function with three randomly-generated parameters. The generalized bell function of x depends on the three parameters r, b and d as given by Eq. (6) f ðx; r; b; dÞ ¼

1 xd2b ; 1þ 

ð6Þ

r

where b is typically positive and d locates the center of the curve. Only roughly 20 epochs are required for learning r, b and d in the backpropagation gradient descent method since a fuzzy logic system would weaken the function’s effectiveness in error. Moreover, DTP1 limits the number of rules generated by deleting the approximate rules. To avoid time-consuming computation, a maximum of 30 rules is permitted herein. This study proposes an approach to produce the fuzzy rules from decision tree DTP1 by a fuzzy rule function R

of DTP1, as described below. As illustrated in Fig. 3, the NFLDR W extracts fuzzy rules from DTP1 to form if–then rule statements and integrates them into W. Each numbered circle associated with its arrowed dotted line denotes a reverse fuzzy rule, which reverses the node, item, order for both the antecedent and consequent in an if–then rule. The reverse fuzzy rule is revised by transposing the node order and eliminating the duplicate nodes such that each unique variable name appears only once in the rule. The numbering order for each rule follows a top-down and zigzag path. Furthermore, the lower left part of Fig. 3 depicts all input and output variables regarding their crisp (nonfuzzy) numbers Xj limited to the specific ranges p(Xj) given by Eq. (7) for variable j pðX j Þ ¼

8 <

Pnj Dxi;j nj ; rmin 6 ri;j 2 X j 6 rmax g; if X is numeric ; fri;j jri;j ¼ xi;j þ i¼1 j j 2 : otherwise fri;j j9!ri;j 2 X j ; i ¼ 1... nj g

ð7Þ In Eq. (7) above, ri,j denotes the numeric label for class i of variable j between rmin and rmax ; xi,j represents the j j distribution of Xj in each class i, which is binned nj, and

166

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172 Response<1.20711

LED < 1.68482 SRD < 0.778241 SRD < 0.778791

5

0

1

1 2

0

SRD < 0.869148

0

SRD < 0.938265 SRD < 0.742759 SRD3< 0.746353

11

SRD < 0.812502 1 SRD < 0.813439

1 0 0.5091 0.6275 0.6502 0.7024 0.7428 0.7464 0.7628 0.7782 0.7788 0.8125 0.8134 0.8691 0.9168 0.9383 0.9628

medium

high

SRD < 0.916826 SRD < 0.762808 SRD < 0.702374 LED < 1.68303 1

1

1 LED

signal

0.3885 0.7880 0.8569 1.6830 1.6848

response 1.2071 1.5731

1

LED < 0.78804

0

SRD low

LED < 0.388451

0 4

To p-down ...

1

Response< 1.57313

SRD < 0.962759

1 0

0

SRD < 1 0.509147 SRD < 0.627496 SRD < 0.856865 SRD < 0 0.65016

1

0 23

n Rule n

output sensitive 0 insensitive 1

Fig. 3. Extraction of fuzzy rules from a decision tree DTP1 for producing if–then rule statements integrated into NFLDR. Each numbered circle linked with its arrowed dotted line depicts a reverse fuzzy rule, which reverses the antecedent and consequent. The lower left part of the diagram indicates all input and output variables regarding crisp (non-fuzzy) numbers limited to specific ranges. Other labels from Rules 6–23 are ignored to simplify the diagram. The numbering order follows a top-down and zigzag path.

Dxi,j is differential value of each xi,j. Before discretizing the numeric domain, Eq. (7) must first determine the number of class nj 2 nr for variable j by using Eq. (8) nr ¼ fnj jnj ¼ maxðRj ðDTP 1 Þ þ 1Þ; j ¼ 1 . . . k þ lg;

ð8Þ

In Eq. (8), k and l indicate the dimensions of signal matrix X and response matrix R respectively, and Rj ðDTP 1 Þ is a fuzzy rule function of DTP1 for variable j. The fuzzy rule function returns the number of membership functions in

alpha=–0.25 round(output+alpha=0)

adjust output

output

1 0.8

output

1

1.1 1

0.8

0.9

1

0.8 0.8

0.6

0.6 0.4

0.4

0.2 0

0.2

output

output

Eqs. (6) and the numeric label for class i in Eq. (7), and joins them using the ‘AND’ operator. Eq. (8) searches the maximum number of membership functions of each variable j associated with a fuzzy set such as {low, medium, high} for input variables, and {sensitive, insensitive} for output variables. Subsequently, fuzzy system orderly evaluates the antecedent in an if–then rule by fuzzifying the input and applying any required fuzzy operators, and then implies that result to the consequent.

0.7 0.6

0.6

0.4

0.5

0.2

0.4 0.3

1.6 0.9

1.4

0.8

1.6

2 1.4

0.7

1.2

response

0

1

1.2

0.6 0.5

(a) Rule 1

SRD

1

response

1

0

0.2 0.1

LED

(b) Rule 4

Fig. 4. (a) NFLDR response’s output surface versus DSR for Rule 1. The adjusted output whose resulting value is either 0 or 1 is rounded off after adding an alpha value a in Eq. (3) to the original output. For example, given a = 0.25 the original output is 0.5, and then its resulting value is 0 rather than 1, directly round-off (0.5) = 1. The dotted gray-line in (a) denotes the depression of output surface after adjusting the original output. The color map indicates the degree of conflict-sensitivity for decision-making, and the output value ranges from 0 (blue) to 1 (brown) maps from conflict-sensitive to conflict-insensitive. Diagram (b) shows the NFLDR response’s output surface of versus LED for Rule 4 in Fig. 2, where Rule 4 is apparently conflictsensitive, and its range of the response entropy is wider than that of Rule 1 in (a). (For interpretation of the references in color in this figure legend, the reader is referred to the web version of this article.)

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172 x 10

—3

2 good(1) risky(2) declined(3)

contexture probability density

1.8 1.6 1.4 1.2 1 0.8 0.6

interest pattern

0.4 0.2

x 10

—3

sensitive insensitive

1.8

contexture probability density

2

167

1.6 1.4 1.2 1 0.8 0.6

interest pattern

0.4 0.2

0 0

2

4

6

8

10

signal probability density x 10—4

(a) Class scatter diagram

0

0

2

4

6

8

10

signal probability density x 10—4

(b) Sensitivity scatter diagram

Fig. 5. Scatter diagram of signal virus contexture in terms of probability density. The class scatter diagram in sub-diagram (a) shows the correlation between contexture and signal for each class. By contrast, in sub-diagram (b), red-square points form a set of conflict-sensitive examples ~v concentrating on the lower left side in (b) such that a hill takes shape in (a). This study addresses examples ~v in the hill area to mine interest patterns. (For interpretation of the references in color in this figure legend, the reader is referred to the web version of this article.)

Fig. 4 shows the NFLDR output surface of ‘response’ versus ‘DSR’ for Rule 1 and ‘response’ versus ‘LED’ for Rule 4, mentioned above in Fig. 2. Comparing (a) and (b) in Fig. 4, Rule 1 presents a more relaxing output boundary than Rule 4 for recognizing the conflict-sensitive examples. Fig. 4(a) shows more clearly the effect of Eqs. (3) and (4) on the output than Fig. 4(b). However, Rule 1’s response range is sharper than that of Rule 4 when the involving variable in fuzzy rule antecedent is SRD and LED. The result in (a) implies that only specific responses within a narrow range would cause a decision-making conflict, indicated by blue area, when the signal values between examples approach each other. Conversely, the result in (b) implies that many responses within a wide range would lead to a decisionmaking conflict (blue area) when their local contexture variance is very low. The proposed system applies Eqs. (3) and (4) to adjust example conflict-sensitivities against multisource decisions according to their probability distribution, and thus can depress the output surface, enabling many patterns with an underlying conflict-sensitive condition to be classified. Additionally, the proposed system employs the local entropy difference (LED) to provoke potentially alternative decisions to convert static patterns representing underlying conflict-insensitivity (not blue area) into dynamic patterns indicating underlying conflict-sensitivity. Thus, the proposed approach can discover many more patterns of interest than can traditional methods. Thus, the learning model W determines whether v is ~v or v. Similarly, the two dependent models W0 and W1 in the cross-fusion phase are used with W to learn classifier decision-making contexture-associated result using the partial columns of X.

Furthermore, Eq. (9) presents a probability density function fk to confirm that the proposed system can address a set of interesting patterns that may belong to a minority class. The probability density function fk(X) returns the n · 1 vector k, which includes the probability density of the multivariate normal distribution with zero mean and identity covariance matrix, evaluated at each row of the n · d matrix X. Rows of X map to observations, and columns map to variables Pd qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðX i;j =RÞ2 j¼1 d  2 ki ¼ ð2pÞ  e ;

i ¼ 1 . . . n;

ð9Þ

where R is an upper triangular such that R 0 · R = X, and R 0 denotes a transposed matrix of R. Fig. 5 applies Eq. (9) to plot a scatter diagram of the observed correlation between the probability density of the contexture UB0 with respect to that of the signal X. Fig. 5(a) and (b) illustrate that the conflict-sensitive example in (b) forms a hill containing the same examples ~v as in (a). Examples ~v in the hill are used to build a classification learning model with simple C4.5 classifiers. Diagram (a) clearly shows that the majority of red cross-points indicating ‘Good’ customers outside the hill area incur the imbalance problem. Similarly, some green-circle points representing ‘Declined’ customers have the same problem as ‘Good’ customers. The experimental results show that ‘Good’ (class 1) customers and ‘Declined’ (class 3) customers have strong patterns which nonetheless partially obscure interesting patterns for ‘Risky’ (class 2) customers using traditional classi-

168

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

Table 1 List of nominal values in nominal attributes Variable

Attribute

Possible values of nominal attributes

Meanings

1 2 3 4 5 6

Overdraft Type Transactions Interest Activity Changed-signature Credit

None, vouching Chief, sponsor, contribution, others Very high, high, middle, low, very low Huge, very high, high, middle, low, very low, none No, yes No, yes Good (1 class), Risky (2 class), Declined (3 class)

Type of overdraft Type of account Number of transactions Interest over this year Status of recent transactions Any changes to signature of the specimen seal Credit class

Variable 1

Variable 2

Number of records

600

Variable 3

400 0 1

500

1 2 3 4

300

400 300

Variable 4

400

600 0.15051 0.30101 0.45152 0.60202 0.75253

300

200

200

100

100

200

1

2 Class

0

3

1

Variable 5

Number of records

300

100 2 Class

0

3

Variable 6

500 0 1

1

2 Class

800 0 1

500

1 2 3

600

400

300

300 200

0

3

Class variable

600

400

400

200

1

2 Class

3

0.083651 0.1673 0.25095 0.3346 0.41825 0.5019 0.58555

200

100 0

400

200

100 0

500

100 1

2 Class

3

0

1

2 Class

3

0

1

2 Class

3

Fig. 6. Actual class distribution for six attribute variables and the class variable.

fiers. The business rule section of this study suggests several applications of these patterns. 4. Experimental data This work proposes a novel algorithm for mining interesting patterns for the Tainan Business Bank (TNB), a local bank in Tainan. The proposed algorithm adopts state-of-the-arts models circumvent class imbalance problems in the raw data. The bank expects to create a descriptive risk management model from the raw data containing randomly selected checking accounts. The denominated variables and their detailed descriptions for this training exercise are available in (Hung et al., 2002). For summary, Table 1 lists roughly all of the nominal values used to convert numeric attributes into nominal attributes. Significantly, all of the numeric values in the original training dataset were normalized to the range between 0 and 1 inclusive.

To obtain a clear class distribution for the training dataset, Fig. 6 illustrates the actual class distribution in this real-world case with six attribute-variables and one classvariable with 1,000 data records sampled from the original dataset of 62,621 records. Clearly, all variables classified belong to modal class ‘1’ in Fig. 6. The data are skewed such that the group of good customers is much larger than the other groups combined. Because the outside-test prediction accuracy of is an important measure of the proposed model’s robustness, a series of experiments was conducted, as described in the next section. 4.1. Experimental results The proposed model was trained and tested on a PC with 512 MB RAM and 1.5 G Pentium IV CPU running Windows 2000. Table 2 compares the experimental results using the proposed model with results using other methods by accuracy, true positive rate and related confidence inter-

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

169

Table 2 Prediction performance using various data mining models for three-class classification Methods

Accuracy %

‘Good’

TP rate ‘Risky’

‘Declined’

Conflict-sensitivity classifier with NFLDR (C-NFLDR) Conflict-sensitive classifier (CSC) Conflict-insensitive classifier (CIC) C4.5 after data cleaning C4.5 with prior probability smoothing C4.5 with cost and probability smoothing C4.5 with cost smoothing after data cleaning C4.5 with probability density Standard C4.5 Multilayer perceptron Neural fuzzy logic after sampling with probability density Naı¨ve Bayes Bayes NetB IBk ID3 Conjunctive rule Complement naı¨ve Bayes RBF 200-clusters Random committee RBF, 30-clusters RBF 3-clusters RBF 2-clusters NNge PRISM

81.50 61 [51, 71] 72 [68, 77] 80.00 81.54 80.95 80.50 79.21 81.54 81.54 78.32 81.50 81.44 81.15 80.99 80.97 80.59 80.42 78.68 78.54 73.84 72.43 70.26 21.45

0.912 0.56 [0.38, 0.73] 0.98 [0.98, 0.99] 0.912 0.912 0.955 0.897 0.912 0.972 0.973 0.986 0.972 0.97 0.964 0.964 0.965 0.958 0.974 0.929 0.974 0.989 0.994 0.825 0.056

0.235 0.57 [0.43, 0.71] 0.06 [0.04, 0.08] 0.154 0.308 0.294 0.231 0.077 0.196 0 0 0.005 0.007 0.016 0.017 0 0.013 0.005 0.062 0 0 0 0.124 0.055

0.841 0.84 [0.84, 0.84] 0.25 [0.00, 0.50] 0.842 0.842 0.724 0.842 0.842 0.746 0.770 0.737 0.768 0.771 0.774 0.774 0.770 0.770 0.699 0.741 0.592 0.259 0.157 0.629 0.977

vals. In contrast to systems designed to generate consistent experimental results, this model adopts the nondeterministic algorithm NFLDR to determine conflict-sensitive examples and multi-classifier decisions, leading to inconsistent experimental results. Therefore, Table 2 also lists the confidence intervals of accuracy and true positive (TP) rates at classification, only presenting the results for our model, to assess statistical relevance in the ‘mean accuracy’ hypothesis. The first column in Table 2 identifies the models tested in this experiment. The tested models included standard classifiers C4.5 in which the class imbalance problem was remedied using the cost-sensitive (Domingos, 1999; Elkan, 2001; Zadrozny & Elkan, 2001a) or probability methods, and some representative well-known classifiers, including the following: standard C4.5 (Quinlan, 1993), multilayer perceptron (Fletcher, 1987; Jiang et al., 2000; Rumelhart, Hinton, & Williams, 1986; Stan & Kamen, 1999), neural fuzzy logic (Jang, 1993) after sampling by probability density, naı¨ve Bayes (John & Langley, 1995), Bayes NetB (Heckerman, Geiger, & Chickering, 1995), instance-based K-nearest neighbor classifier (IBk) (Aha & Kibler, 1991), ID3 (Witten & Frank, 1999), Conjunctive Rule that reading a set of rules directly off a decision tree, complement naı¨ve Bayes (Rennie, Shih, Teevan, & Karger, 2003), RBF (Powell, 1985) in 200-clusters, random committee, RBF in 30-clusters, RBF in 3-clusters, RBF in 2-clusters, PRISM (Cendrowska, 1987), and the nearest-neighbor-like algorithm using non-nested generalized exemplars (NNge). The second column in Table 2 for each method presents the classification accuracy as a percentage. The last three columns in Table 2 for each class containing ‘Good’,

‘Risky’, and ‘Declined’ show the prediction accuracy TP rates of each method. The multilayer perceptron with back-propagation algorithm (BPN) and C4.5 have the highest prediction accuracy. However, C4.5 is much simpler than BPN since each split considered in the entropy method only takes O(m log m) time, where m is the number of instances in the training dataset. Although IBk and ID3 are slightly less accurate than either C4.5 or BPN, they can identify some ‘Risky’ classes of interest. Even simple Bayeslike methods achieve an acceptable level of accuracy in classification. Additionally, by comparing these Bayes-like systems, the TP rate of ‘Risky’ gradually rises from 0.005 to 0.013, implying that the classification for the risky customers is not so dependent on attributes, since the Complement Naı¨ve Bayes method modifies the poor assumption of Naı¨ve Bayes (Rennie et al., 2003). The analytical results indirectly reveal that the assumption that all attributes are independent and affect results consistently must be modified when using Bayes-like models for this real case. Moreover, the Conjunctive Rule can attain the same accuracy as ID3 because it is induced from decision tree. Besides the approximation property of RBF networks, more K-clusters apparently enhance prediction accuracy. The prediction accuracy of RBFnetworks improved from 72.43% to 80.42% as the cluster number rose from 2 to 200. Experimentally, the Random Committee method was found to achieve the same prediction accuracy as RBF networks with K values between 30 and 50 clusters. Before the data are cleaned, most neural networks including BPN and RBF are insensitive to the variations of the resulting TP rate. Predictions using the most of the methods mentioned above can reach accuracy of up to around

170

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

80%. However, the prediction using the NNge method yields an accuracy of just 70.26% because of excessive noise. Sensitivity to noise means that the instance-based methods are not suitable for raw data mining. Although the prediction accuracy for NNge is extremely low, the TP rate of the ‘Risky’ class is higher than that obtained using other methods. Surprisingly, the prediction accuracy for PRISM is only 21.45%, but the TP rate for ‘Declined’ class reaches up 0.977. The extremely low accuracy is caused the ‘Declined’ class of customers having apparent patterns or consistent characteristics. 4.2. Interesting mining in two-layer explanations This study considers patterns of ‘Risky’ customers represent far more valuable knowledge than patterns from the two other classes in banking. Table 2 demonstrates that the maximum accuracy for ‘Risky’ customers is achieved by the proposed standard C4.5 model method with the conflict-sensitive example recognition algorithm. The accuracy result is obtained by performing a Z statistical test with a significance level of 0.05 to determine whether the ‘mean’ hypothesis of the results in a normal distribution. Thus, the experimental result demonstrates that the sensitive classifier for ‘Risky’ customers has prediction accuracy 57% on average between confidence interval between 43% and 71% at a significance level of 0.05. Clearly, only the sensitive classifier can have such consistently high TP rate performances at each class, even if cost-sensitive, probability or data cleaning methods (Hung et al., 2002) are used to remedy the class imbalance problem to enable the minority class to be mined. The next section presents practical applications of patterns from the minority class. A conflict-sensitivity classifier with NFLDR combining the accuracy results in Table 2 from the conflict-sensitive classifier and the conflict-insensitive classifier would attain a similar accuracy performance to state-of-the-art methods to remedy the class imbalance problem. Since the insensitive classifier produces either a very high or very low accuracy, contrasting the with points distribution in Fig. 5, we believe that the boundary dissecting the conflict-sensitive and conflict-insensitive examples is appropriate. Therefore, a twolayer explanation during data mining cycle can better demonstrate risk management than a general explanation. The two-layer explanation depicted in the dotted-square below in Fig. 1 leads decision-makers to make two different businesses decisions: the ‘strong class’ decision of recognizing loyal customers, and the ‘weak class’ decision such as identifying risky customers by their environment. 5. Business application From previous investigations, this study proposes a critical methodology called Conflict-sensitivity classifier with NFLDR (C-NFLDR) to derive significant pattern of interest in term of business viewpoint from this bank. The methodology remedies the class imbalance problem and extracts

patterns of interest according to the conflict-sensitivity of decision-making. Because of risky customers being more valuable in business than others, the ensuing discussion focuses on identifying the risky customer behaviors. For clarity of explanation, all based dates of the ensuing expressions are assigned to the date of generating training dataset, refer to the mining date. Business rule 1. Most of customer crediting evaluations depend on willingness to exercise the right of overdraft. If the customers do not accept the terms of the right, their credits are usually good. Meanwhile, the credits of the other customers, whose overdraft protection comes into play after vouching their overdraft rights, must further observe the degree of transaction quantities in order to make a crediting decision. The customers are denoted as vouching-overdraft customers. Generally, the transaction of either very high or very low quantities for vouchingoverdraft customers implies the possibility of ‘Risky’ credit or ‘Good’ credit. Furthermore, such customers given the general amount of transactions almost obtains a ‘Declined’ credit. This study strongly recommends that bank mangers establish alarming systems for vouching-overdraft customers, and give a suggestion of manage finances to the customer for increasing number of transactions. Business rule 2. In opposite viewpoints of business rule 1, assuming that a specific group of customers that do not exercise the right of overdraft would receive more ‘Risky’ credit than the customers classified based on business rule 1. Besides, this case has the consequence that the customers might attain either a ‘Good’ credit or a ’Risky’ credit when the number of transactions is normal. Such strong pattern, which results from some local events or localities, explains specific customers controlling their credit in a particular manner. Well-designed credit control strategy could protect them against receiving the declaration of ‘Declined’ credit. Alternatively, the customers restrain their bounced checks quantity below three times during one year. In such cases, a bank manager should encourage customers to draw checks of small denomination in large numbers. Both business rule 1 and business rule 2 depend on the strong class results by using C-NFLDR. 5.1. Applying CSC with non-dominating attributes Next, the study should further verify the Conflict-Sensitive Classifier (CSC) to realize risky customer characteristics. Whole prediction accuracy from C-NFLDR is around 81% where each TP rate of the classes is 0.91, 0.24 and 0.84 for ‘Good’ (class 1), ‘Risky’ (class 2), and ‘Declined’ (class 3), respectively. Contrast to C-NFLDR, CSC for each TP rate at classification is 0.56, 0.57 and 0.84 for ‘Good’ (class 1), ‘Risky’ (class 2), and ‘Declined’ (class 3), respectively. Clearly, CSC can improve all TP rates for forecasting each class simultaneously over 0.5. In fact, CSC which is a standard C4.5 based on conflictsensitive examples can easily applies to mining rules without a set of dominating attributes by removing dominating

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

attributes (variables) in advance to suppress a monopoly so that only rules with ‘activity’, ‘interest’, and ‘changed-signature’ attributes emerge. By this way, several interesting patterns explain the potential relationship between crediting risks and individual habits. The individual habits considered include incorrect bankroll operation, low loyalty, and low interest in finance. Thus, analysis of non-dominating attributes using CSC can bring out the following three business rules depending on the weak class: Business rule 3. A customer, whose ‘activity’ attribute is ‘yes’, executed more than one transaction during the preceding three months before the mining date and contributed some interest-free fund during one year always has an ‘Risky’ credit. Such customers always have unstable credit because of their poor bankroll operation. Alternatively, these customers are active and normal, but lack enough funds. Fortunately, the ‘Risky’ customers are mostly intended from ‘Risky’ credit to ‘Good’ credit. Bank managers should provide formal guidance to those customers and increase lending to them. Business rule 4. According to well-known rules of the checking business, checking accounts are an interest-free fund. Moreover, in fact, most of good customers have never changed their specimen seal signatures. Even though such knowledge appears trivial, the identified customers by business rule 4 as well as one customer identified by business rule 3 can confirm the customers with instable credits that change from ‘Risky’ credit to ‘Good’ credit, and vice versa. Notably, such a classification obtains a high rate of false positives, since the dominating attributes are not involved. Despite activity being either yes or no, most ‘Declined’ credit can be identified by the classification involving the dominating attribute ‘overdraft’. Furthermore, customers who frequently change the specimen seal signatures will attain some unexpected records about credit risk. In this case, a bank manager should utilize the on-spot information to collocate the alarming system mentioned in business rule 1. If the forecasting credit for any customer tends to the ‘Risky’ class, the bank manager should reduce their credit; otherwise the bank manager should increase lending to them. Business rule 5. Likewise, according to the Check Truncation Act (CTA) in Taiwan, the original checks of a customer would immediately stop at any bank in the collection chain, and new transactions are inhibited when the customer receives a notice of Nonacceptance to which a customer with ‘Declined’ credit is corresponded here. Because such terms in CTA are a well-known rule, their interests of contribution yielded from the interest-free fund of the checking accounts generally are zero. Also, their activity must be ‘no’ since the customers have been assigned to ‘Declined’ credit. Moreover, some ‘Good’ credit and ‘Risky’ credit customers without considerations of the interests mentioned as above could stop sending new checks from consumers’ bank when they do not suffer from an occasional bounced check. The reasons behind individual habits for the customers, which have either ‘Good’ credit or and ‘Risky’ credit mentioned above, are a cus-

171

tomer of low loyalty or a customer contributes low interest in finance. The former is caused by other banks that supplies better services or better terms, while the later is caused by using the specifically consuming habits such as payment of insurance that may occur only once one year. Therefore, new financial services should continuously be offered, for example easy-loans, which that can reduce the impact of an occasional bounced check. Moreover, bank managers and employees should encourage these customers, which have ‘Good’ credit but use checks only for specific functions, to use other financial products such as credit cards as substitutes. 6. Conclusions This work proposed a novel approach to remedy the class imbalance problem to help manage conflicting decision-making in identifying ‘Risky’ bank account applications. This study concludes that (i) conflict-sensitive sampling of each class can improve accuracy; (ii) local pattern learning by entropy-based evaluation in contexture can resolve the class imbalanced problem, and (iii) twolayer explanations during data mining can practically express business rules, and hence proposes a hybrid classification system. Furthermore, since the number of good customers naturally significantly exceeds the number of bad customers, this investigation proposes a non-random sampling method approach to accurately reflect real-world situation containing imbalanced distributions and contradictory data. Consequently, the proposed method obtains consistently high TP prediction rates for each class. The ‘Risky’ class of bank customers is the most interesting for bank managers in classification prediction. Thus, the study promises to discover valuable business knowledge through data mining, and it also provides further insight into the financial field. The findings reveal that classification accuracy for all classes is around 80% without data cleaning, but rises by up to 71 % for minority class by identifying the conflict-sensitive examples. However, numerous factors may affect customer credit, such as loans and other transaction information. Therefore, future studies should consider including more information in the preliminary model than that of this study. Acknowledgement The authors would like to thank the Tainan Business Bank at the Republic of China for financially supporting this study. References Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66. An, A., Cercone, N., & Huang, X. (2001). A case study for learning from imbalanced data sets. In Advances in artificial intelligence: Proc 14th conf. Canadian soc. comput. studies of intell. (pp. 1–15).

172

C.-M. Hung, Y.-M. Huang / Expert Systems with Applications 34 (2008) 159–172

Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1), 20–29. Benediktsson, J. A., Sveinsson, J. R., & Swain, P. H. (1997). Hybrid consensus theoretic classification. IEEE Transactions on Geoscience and Remote Sensing, 35, 833–843. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. Cendrowska, J. (1987). PRISM: an algorithm for inducing modular rules. International Journal of Man–Machine Studies, 27, 349–370. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16, 321–357. Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer K. W. (2003). SMOTEBoost: improving prediction of the minority class in boosting. In 7th Euro. conf. on principles and practice of knowledge discovery, Cavtat-Dubrovnik, Croatia. (pp. 107–119). Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy-based Algorithms for best basis selection. IEEE Transactions on Information Theory, 38(2), 713–718. DeRouin, E., Brown, J., Beck, H., Fausett, L., & Schneider, M. (1991). Neural Network Training on Unequally Represented Classes. In C. H. Dagli, S. R. T. Kumara, & Y. C. Shin (Eds.), Intelligent engineering systems through artificial neural networks (pp. 135–140). New York: ASME press. Domingos, P. (1999). Metacost: a general method for making classifiers cost-sensitive. In Proc. 5th int’l. conf. on knowledge discovery and data mining. (pp. 155–164). Drummond, C., & Holte, R. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Proc. ICML workshop on learning from imbalanced data sets. Elkan, C. (2001). The foundations of cost-sensitive learning. In Proc. of the seventeenth int’l joint conf. on arti. intel. (IJCAI’01), pp. 973–978. Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalances data sets. Computational Intelligence, 20(1), 18–37. Fletcher, R. (1987). Practical methods of optimization. New York: Wiley. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine learning: Proc. of the thirteenth international conf. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 551, 119–139. Guo, H., & Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations, 6(1), 30–39. Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning Bayesian networks: the combination of knowledge and statistical data. Machine Leaning, 20, 197–243. Hickey, R. (2003). Learning rare class footprints: the REFLEX algorithm. In Proc ICML workshop on learning from imbalanced data sets. Hung, C. M., Huang, Y. M., & Chen, T. S. (2002). Assessing check credit with skewed data: a knowledge discovery case study. In International computer symposium, workshop on artificial intelligence, Taiwan. Jang, J. S. R. (1993). ANFIS: adaptive-network-based fuzzy inference systems. IEEE Transactions on Systems Man and Cybernetics, 23(3), 665–685.

Japkowicz, N. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning, 42(1), 97–122. Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429–450. Jiang, M., Zhu, X., Yuan, B., Tang, X., Lin, B., Ruan, Q., et al. (2000). A fast hybrid algorithm of global optimization for feedforward neural networks. In Proc. signal processing, WCCC-ICSP international conference. (Vol. 3, pp. 1609–1612). John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the eleventh conference on uncertainty in artificial intelligence, pp. 338–345. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6(1), 40–49. Ling, C. X., & Li, C. (1998). Data mining for direct marketing: problems and solutions. In Proc. 4th ACM SIGKDD int’l conf. on knowledge discovery and data mining (KDD-98) (pp. 73–79). New York, NY: ACM. Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: classification of skewed data. SIGKDD Explorations, 6(1), 50–59. Powell, M. J. D. (1985). Radial basis functions for multivariable interpolation: a review. In RMCS IMA conf. on algorithms for the approximation function and data. (pp. 143–167). Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. Rennie, J. D., Shih, L., Teevan, J., & Karger, D., (2003). Tackling the poor assumptions of naive Bayes text classifiers. In ICML 2003. (pp. 616–623). Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations of back-propagation errors. Nature (London), 323, 533–536. Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14(1), 21–41. Stan, O., & Kamen, E. W. (1999). New block recursive MLP training algorithms using the Levenberg–Marquardt algorithm. In International Joint Conference on neural networks in IJCNN ’99. (Vol. 3, pp. 1672–1677). Sugeno, M. (1985). Industrial applications of fuzzy control. Elsevier Science Pub Co. Visa, S., & Ralescu, A. (2003). Learning imbalanced and overlapping classes using fuzzy sets. In Proc ICML workshop on learning from imbalanced data sets. Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315–354. Witten, I. H., & Frank, E. (1999). Data mining practical machine learning tools and techniques with Java implementations. San Francisco, CA: Morgan Kaufmann Publishers. Zadrozny, B., & Elkan, C. (2001a). Learning and making decisions when costs and probabilities are both unknown. In Proc. seventh int’l conf. knowledge discovery and data mining. (pp. 204–213). Zadrozny, B., & Elkan, C. (2001b). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. ICML 2001, pp. 609–616.